Navigation Logo 7.4  Repetitions and Branches Navigation Logo

 

 

I find the concept of a quasichar to be useful in explaining regular-expression pattern matching. A quasichar is a part of the pattern that matches a single ASCII character, for example, A, [A-Z], \01, and so forth. Here are descriptions of the special symbols that are directly related to quasichars.

*
When following quasichar, * causes any number (including zero) of copies of that quasichar to be used in the match. Preference is to the longest match.

So, the glob pattern * is the same as either of these regular expressions: ^.*$ or .* (Why?)

+
When following a quasichar, + causes one or more copies of that quasichar to be used in the match. Preference is to the longest match.

So, the pattern [a-z]+ will match the entire string "cat" because this pattern

[a-z][a-z][a-z]
matches "cat" and no longer sequence of [a-z] could make a match. The same pattern will match the first two characters of the string "hi!" and would not match any substring of "1234."

?
When following a quasichar, ? causes zero or one copies of that quasichar to be used in the match. Preference is to the longest match.

I call these special symbols repeaters for obvious reasons.

Repeaters introduce the possibility that more than one matching substring might begin at the same position in the string. Ties are broken as with glob pattern matching:

If two matching substrings begin at the same position in the string, the longer is chosen.
That is what "preference is to the longest match" means in the descriptions above.

Regular expressions are more than just a sequence of quasichars with possible repeaters, they can be several sequences of quasichars with possible repeaters. The special symbol | is used to separate these sequences which are called branches. Each branch defines a different possible match.

Now for a major difference between Tcl's version 8.1 and everything that came before. (Note that version 8.1 is experimental at the time of writing.)

For versions 8.0 and earlier

When more than one branch defines a matching substring at a given position within a string, the leftmost branch will be used – even if the match defined by another branch would choose a longer substring.

For versions 8.1 and later

When more than one branch defines a matching substring at a given position within a string, the longer will be used. If two are of the longest length then the one to the left will be used.

Some examples will help. They depend on this preassignment:

set BC_ {[bBcC]}

This,

regexp a|$BC_  cat Match
matches Match with "c" because that is the first substring of cat which can be matched.

This,

regexp $BC_?|$BC_* bbbb Match
matches Match with "b" in versions 8.0 and earlier and with "bbbb" in versions 8.1 and later.

This,

regexp ^$BC_* able Match
matches Match with the empty string. The * repeater enables a match with the empty string. The empty string at the front of "able" is the first possible match.

This,

regexp ^able|^$BC_* able Match
matches Match with "able." The leftmost branch takes precedence here because both patterns match at the first character of "able." It might seem that a match to the empty string at the beginning of "able" would come first. It does not.

Exercise 7.4a

Which of the following regexps will return true? Of those that do, what is assigned to the variable Match? Of those that do not, why?
set Digit_ {[0-9]}
set Space_ "\[ \t]"
set Dot_ {\.}
set NoDot_ {[^\.]}
set Quote_ {"}
regexp -indices $Space_$Quote_ {  "} Match
regexp $Digit_.$Digit_ 201 Match
regexp $NoDot_*$Dot_ "Interesting. But not relevant." Match
regexp ".*" "" Match

Solution

Exercise 7.4b

Which of the following will return true? Of those that do, what is assigned to the variable Match? Of those that do not, why?
regexp catbert|cat catbert Match
regexp cat|catbert catbert Match
regexp c?t|at catbert Match
set NoLowerCase_ {[^a-z]}
regexp $NoLowerCase_*at|atbert Catbert Match
regexp $NoLowerCase_*bert|bert Catbert Match

Solution

Exercise 7.4c

Write a regexp command that matches everything in a string Str up to, and including, the first end of line.

Test your answer with these strings: "Hi There\nBig Boy\n," "\nSecond Line," and "First Line." You should obtain, respectively, the string "Hi There" followed by a new line, a new line without anything before or after it, and no match.

Solution

 

 

[Sample TK Application]
Author's Home Page
Navigation Logo [Book's Cover]
Order from Amazon.