| |
I find the concept of a quasichar to be useful in explaining
regular-expression pattern matching. A quasichar is a part of the pattern
that matches a single ASCII character, for example, A, [A-Z], \01,
and so forth. Here are descriptions of the special symbols that are directly related
to quasichars.
* - When following quasichar, * causes any number (including zero) of
copies of that quasichar to be used in the match. Preference is to the
longest match.
So, the glob pattern * is the same as either of these regular
expressions: ^.*$ or .* (Why?)
+ - When following a quasichar, + causes one or more copies of that
quasichar to be used in the match. Preference is to the longest match.
So, the pattern [a-z]+ will match the entire string "cat" because
this pattern
[a-z][a-z][a-z]
matches "cat" and no longer sequence of [a-z] could make a match. The
same pattern will match the first two characters of the string "hi!" and
would not match any substring of "1234."
? - When following a quasichar, ? causes zero or one copies of that
quasichar to be used in the match. Preference is to the longest match.
|
I call these special symbols repeaters for obvious reasons.
Repeaters introduce the possibility that more than one matching substring
might begin at the same position in the string. Ties are broken as with glob
pattern matching:
- If two matching substrings begin at the same
position in the string, the longer is chosen.
That is what "preference is
to the longest match" means in the descriptions above.
Regular expressions are more than just a sequence of quasichars with possible
repeaters, they can be several sequences of quasichars with possible
repeaters. The special symbol | is used to separate these sequences
which are called branches. Each branch defines a different
possible match.
Now for a major difference between Tcl's version 8.1 and everything that came
before. (Note that version 8.1 is experimental at the time of writing.)
- For versions 8.0 and earlier
-
-
When more than one branch defines a matching substring at a given
position within a string, the leftmost branch will be used even if the
match defined by another branch would choose a longer substring.
- For versions 8.1 and later
-
-
When more than one branch defines a matching substring at a given
position within a string, the longer will be used. If two are of the longest
length then the one to the left will be used.
Some examples will help. They depend on this preassignment:
set BC_ {[bBcC]}
This,
regexp a|$BC_ cat Match
matches Match with "c" because that is the first substring of cat
which can be matched.
This,
regexp $BC_?|$BC_* bbbb Match
matches Match with "b" in versions 8.0 and earlier and with
"bbbb" in versions 8.1 and later.
This,
regexp ^$BC_* able Match
matches Match with the empty string. The * repeater enables a match
with the empty string. The empty string at the front of "able" is the
first possible match.
This,
regexp ^able|^$BC_* able Match
matches Match with "able." The leftmost branch takes precedence here
because both patterns match at the first character of "able." It might
seem that a match to the empty string at the beginning of "able" would come
first. It does not.
Exercise 7.4a -
Which of the following regexps will return
true? Of those that do, what is assigned to the variable Match? Of those
that do not, why?
set Digit_ {[0-9]}
set Space_ "\[ \t]"
set Dot_ {\.}
set NoDot_ {[^\.]}
set Quote_ {"}
regexp -indices $Space_$Quote_ { "} Match
regexp $Digit_.$Digit_ 201 Match
regexp $NoDot_*$Dot_ "Interesting. But not relevant." Match
regexp ".*" "" Match
Solution
Exercise 7.4b -
Which of the following will return true? Of those
that do, what is assigned to the variable Match? Of those that do not,
why?
regexp catbert|cat catbert Match
regexp cat|catbert catbert Match
regexp c?t|at catbert Match
set NoLowerCase_ {[^a-z]}
regexp $NoLowerCase_*at|atbert Catbert Match
regexp $NoLowerCase_*bert|bert Catbert Match
Solution
Exercise 7.4c -
Write a regexp command that matches
everything in a string Str up to, and including, the first end of
line.
Test your answer with these strings: "Hi There\nBig Boy\n,"
"\nSecond
Line," and "First Line." You should obtain, respectively, the string
"Hi
There" followed by a new line, a new line without anything before or after it,
and no match.
Solution
|
|