Previous: System Definitions, Up: System Definitions [Contents][Index]
The function string-match
(*Index string-match::) is used to
match a regular expression against a string. If the variable
*case-fold-search*
is not nil, case is ignored in the match.
To determine the extent of the match use *Index match-beginning:: and
*Index match-end::.
Regular expressions are implemented using Henry Spencer’s package (thank you Henry!), and much of the description of regular expressions below is copied verbatim from his manual entry. Code for delimited searches, case insensitive searches, and speedups to allow fast searching of long files was contributed by W. Schelter. The speedups use an adaptation by Schelter of the Boyer and Moore string search algorithm to the case of branched regular expressions. These allow such expressions as ’not_there|really_not’ to be searched for 30 times faster than in GNU emacs (1995), and 200 times faster than in the original Spencer method. Expressions such as [a-u]bcdex get a speedup of 60 and 194 times respectively. This is based on searching a string of 50000 characters (such as the file tk.lisp).
|
. A match of the regular expression against a string is simply a match of the string with one of the branches.
+
, *
, or ?
.
+
matches a sequence of 1 or more matches of the atom.
*
matches a sequence of 0 or more matches of the atom.
?
matches a match of the atom, or the null string.
.
matching any single character
^
matching the null string at the beginning of the input string
$
matching the null string at the end of the input string
\
followed by a single character matching that character
[]
.
It normally matches any single character from the sequence.
^
,
it matches any single character not from the rest of the sequence.
-
, this is shorthand
for the full list of ASCII characters between them
(e.g. [0-9]
matches any decimal digit).
]
in the sequence, make it the first character
(following a possible ^
).
-
, make it the first or last character.
In general there may be more than one way to match a regular expression to an input string. For example, consider the command
(string-match "(a*)b*" "aabaaabb")
Considering only the rules given so far, the value of (list-matches 0 1)
might be ("aabb" "aa")
or ("aaab" "aaa")
or ("ab" "a")
or any of several other combinations.
To resolve this potential ambiguity string-match chooses among
alternatives using the rule first then longest.
In other words, it considers the possible matches in order working
from left to right across the input string and the pattern, and it
attempts to match longer pieces of the input string before shorter
ones. More specifically, the following rules apply in decreasing
order of priority:
In the example from above, (a*)b* matches aab: the (a*) portion of the pattern is matched first and it consumes the leading aa; then the b* portion of the pattern consumes the next b. Or, consider the following example:
(string-match "(ab|a)(b*)c" "xabc") ==> 1 (list-matches 0 1 2 3) ==> ("abc" "ab" "" NIL) (match-beginning 0) ==> 1 (match-end 0) ==> 4 (match-beginning 1) ==> 1 (match-end 1) ==> 3 (match-beginning 2) ==> 3 (match-end 2) ==> 3 (match-beginning 3) ==> -1 (match-end 3) ==> -1
In the above example the return value of 1
(which is > -1
)
indicates that a match was found. The entire match runs from
1 to 4.
Rule 4 specifies that (ab|a) gets first shot at the input
string and Rule 2 specifies that the ab sub-expression
is checked before the a sub-expression.
Thus the b has already been claimed before the (b*)
component is checked and (b*) must match an empty string.
The special characters in the string "\()[]+.*|^$?"
,
must be quoted, if a simple string search is desired. The function
re-quote-string is provided for this purpose.
(re-quote-string "*standard*") ==> "\\*standard\\*" (string-match (re-quote-string "*standard*") "X *standard* ") ==> 2 (string-match "*standard*" "X *standard* ") Error: Regexp Error: ?+* follows nothing
Note there is actually just one \
before the *
but the printer makes two so that the string can be read, since
\
is also the lisp quote character. In the last example
an error is signalled since the special character *
must
follow an atom if it is interpreted as a regular expression.
Previous: System Definitions, Up: System Definitions [Contents][Index]