GCL SI Manual: Regular Expressions

17.1 Regular Expressions

The function string-match (*Index string-match::) is used to match a regular expression against a string. If the variable *case-fold-search* is not nil, case is ignored in the match. To determine the extent of the match use *Index match-beginning:: and *Index match-end::.

Regular expressions are implemented using Henry Spencer’s package (thank you Henry!), and much of the description of regular expressions below is copied verbatim from his manual entry. Code for delimited searches, case insensitive searches, and speedups to allow fast searching of long files was contributed by W. Schelter. The speedups use an adaptation by Schelter of the Boyer and Moore string search algorithm to the case of branched regular expressions. These allow such expressions as ’not_there|really_not’ to be searched for 30 times faster than in GNU emacs (1995), and 200 times faster than in the original Spencer method. Expressions such as [a-u]bcdex get a speedup of 60 and 194 times respectively. This is based on searching a string of 50000 characters (such as the file tk.lisp).

A regular expression is a string containing zero or more branches which are separated by |. A match of the regular expression against a string is simply a match of the string with one of the branches.
Each branch consists of zero or more pieces, concatenated. A matching string must contain an initial substring matching the first piece, immediately followed by a second substring matching the second piece and so on.
Each piece is an atom optionally followed by +, *, or ?.
An atom followed by + matches a sequence of 1 or more matches of the atom.
An atom followed by * matches a sequence of 0 or more matches of the atom.
An atom followed by ? matches a match of the atom, or the null string.
An atom is
- - a regular expression in parentheses matching a match for the regular expression
- - a range see below
- - a . matching any single character
- - a ^ matching the null string at the beginning of the input string
- - a $ matching the null string at the end of the input string
- - a \ followed by a single character matching that character
- - a single character with no other significance (matching that character).
A range is a sequence of characters enclosed in []. It normally matches any single character from the sequence.
- - If the sequence begins with ^, it matches any single character not from the rest of the sequence.
- - If two characters in the sequence are separated by -, this is shorthand for the full list of ASCII characters between them (e.g. [0-9] matches any decimal digit).
- - To include a literal ] in the sequence, make it the first character (following a possible ^).
- - To include a literal -, make it the first or last character.

Ordering Multiple Matches

In general there may be more than one way to match a regular expression to an input string. For example, consider the command

 (string-match "(a*)b*"  "aabaaabb")

Considering only the rules given so far, the value of (list-matches 0 1) might be ("aabb" "aa") or ("aaab" "aaa") or ("ab" "a") or any of several other combinations. To resolve this potential ambiguity string-match chooses among alternatives using the rule first then longest. In other words, it considers the possible matches in order working from left to right across the input string and the pattern, and it attempts to match longer pieces of the input string before shorter ones. More specifically, the following rules apply in decreasing order of priority:

[1] If a regular expression could match two different parts of an input string then it will match the one that begins earliest.
[2] If a regular expression contains | operators then the leftmost matching sub-expression is chosen.
[3] In *, +, and ? constructs, longer matches are chosen in preference to shorter ones.
[4] In sequences of expression components the components are considered from left to right.

In the example from above, (a*)b* matches aab: the (a*) portion of the pattern is matched first and it consumes the leading aa; then the b* portion of the pattern consumes the next b. Or, consider the following example:

 (string-match "(ab|a)(b*)c"  "xabc") ==> 1
 (list-matches 0 1 2 3) ==> ("abc" "ab" "" NIL)
 (match-beginning 0) ==> 1
 (match-end 0) ==> 4
 (match-beginning 1) ==> 1
 (match-end 1) ==> 3
 (match-beginning 2) ==> 3
 (match-end 2) ==> 3
 (match-beginning 3) ==> -1
 (match-end 3) ==> -1

In the above example the return value of 1 (which is > -1) indicates that a match was found. The entire match runs from 1 to 4. Rule 4 specifies that (ab|a) gets first shot at the input string and Rule 2 specifies that the ab sub-expression is checked before the a sub-expression. Thus the b has already been claimed before the (b*) component is checked and (b*) must match an empty string.

The special characters in the string "\()[]+.*|^$?", must be quoted, if a simple string search is desired. The function re-quote-string is provided for this purpose.

(re-quote-string "*standard*") ==> "\\*standard\\*"

(string-match (re-quote-string "*standard*") "X *standard* ")
 ==> 2

(string-match "*standard*" "X *standard* ")
Error: Regexp Error: ?+* follows nothing

Note there is actually just one \ before the * but the printer makes two so that the string can be read, since \ is also the lisp quote character. In the last example an error is signalled since the special character * must follow an atom if it is interpreted as a regular expression.