A regular expression is a technique of describing a pattern that you are seeking in a string (used mainly by Tcl's regexp and regsub commands). One of the best resources for understanding regular expressions is the O'Reilly BOOK Mastering Regular Expressions, which talks about many of the uses of regular expressions, from the Unix grep(1) command to Tcl and beyond.In the following examples, regular expressions and strings will be listed inside {} .An analogy that might make regular expressions easier is to think of them in chemistry sense. One starts with atoms - the smallest building blocks of uniqueness in regular expressions.A regular expression atom is made of either a literal character or a metacharacter.A literal character is the simplest regular expression possible. For example, the string {a} is a one character regular expression. It can be used to match a portion of any string which contains the letter a. Compare the regular expression {a} against the string "abc" and you get a match. Compare {a} against the string "xyz" and you do not get a match.A metacharacter is the means for telling regular expression what you want in a bit more vague manner. For instance, the metacharacter {.} means 'match any character'. Compare a period {.} against any string one character or longer and you get a match.Another metacharacter is the {\\}. The backslash metacharacter tells the regular expression that the following character is to be used literally. This comes in most helpfully when attempting to describe patterns containing metacharacters.The following applies to Tcl 8.1 or newer.Regular expression atoms and metacharacters fall into one of several classes. The first type express a specific character is to be matched.
- literal
- Any alphabetic, numeric, white space character are frequently treated as literal matches. However, there are a few cases, detailed below, where they are used in a metacharacter construct.
- [characters]
- The notation here defines a subset of characters to match. An exclusive match is one in which the first character inside the matching braces is the caret (^) character.
- .
- A period matches any literal character
- \k
- When k is non-alphanumeric, the atom matches the literal character k
- \c
- When c is alphanumeric (possibly followed by other characters), the sequence is called a Regular Expression Escape Sequence
- *
- The largest series (zero or more occurances) of the preceeding regular expression atom will be matched.
- +
- The largest series (one or more occurances) of the preceeding regular expression atom will be matched.
- ?
- This is a boolean type quantifier - it means the atom may or may not appear (i.e. it may appear 0 or 1 times).
- {m}
- a sequence of exactly m matches of the atom
- {m,}
- a sequence of m or more matches of the atom
- {m,n}
- a sequence of no less than m and no more than n matches of the atom
- *?
- non-greedy form of * quantifier - if there is more than one match, selects the smallest of the matches
- +?
- non-greedy form of + quantifier
- ??
- non-greedy form of ? quantifier
- {m}?
- non-greedy form of {m} quantifier
- ^
- The following regular expression will only match when it occurs at the beginning of a string.
- $
- The preceeding regular expression will only match when it occurs at the end of a string. While it is common to think of this character matching the newline, note that one cannot manipulate the newline by for instance trying to replace the symbol by a null string, etc.
One must be aware that regular expression are either greedy or non-greedy, regardless of your mixture of greedy/non-greedy metacharacters. Refer to this [1] comp.lang.tcl thread, and specifically this [2] Sept. 1999 posting from Henry Spenter to c.l.t.
Comma Number FormattingSome folks insist on inserting commas (or other characters) to format digits into groups of three. Here is a regexp to do the trick from Keith Vetter. (Thanks Keith!) The Perl manual describes a very slick method of doing this:
1 while s/^([-+]?\d+)(\d{3})/$1,$2/;Translated into (pre 8.1) tcl you get:
set n 123456789.00
while {[regsub {^([-+]?[0-9]+)([0-9][0-9][0-9])} $n {\1,\2} n]} {}
puts $nresults in
123,456,789.00(You can tighten this up a little using Tcl 8.1's regular expressions:
while {[regsub {^([-+]?\d+)(\d{3})} $n {\1,\2} n]} {}Using the extended syntax, this becomes a bit easier to understand:
while {[regsub {(?x)
^([-+]?\d+) # The number at the start of the string...
(\d{3}) # ...has three digits at the end
} $n {\1,\2} n]} {
# So we insert a comma there and repeat...
})For a version with configurable separator, see Bag of algorithms, item "Number commified" - RSSee also Human readable file size formatting for a version without regular expressions for those of us who are allergic to monstrous complexity ;) - RoHenry Spencer writes
>...You can't put extra spaces into regular >expressions to improve readability, you just have to suffer along >with the rest of us.Actually, since 8.1 you can, although since it's one of 57 new features, it's easy to miss. Like so:
set re {(?x)
\s+ ([[:graph:]]+) # first number
\s+ ([[:graph:]]+) # second number
}
set data " -1.2117632E+00 -5.6254282E-01"
regexp $re $data match matchX matchYThe initial "(?x)" (which must be right at the start) puts the regexp parser into expanded mode, which ignores white space (with some specific exceptions) and #-to-end-of-line comments.More information is available in the Tcl manual page on regular expressions. You can view two of the pages at:http://www.purl.org/tcl/home/man/tcl8.4/TclCmd/re_syntax.htm
and http://www.purl.org/tcl/home/man/tcl8.4/TclCmd/regexp.htm
.Chapter 11 in Brent Welch's book also covers regular expressions. Information on the book can be found at http://www.beedub.com/book/3rd
. And older version of the chapter, which won't cover the most recent developments is available too [3].The above discussion needs to cover the advanced regular expression syntax as handled by Tcl 8.1 and later and show the user what all the differences are, so that one can write portable code when necessary - or at least create appropriate package require statements.
Another useful place to learn about Regular Expressions is the page at the Tcl Developer's Xchange [4], where info on the Tcl 8.x specific features are discussed.
Some algorithms are easier coded withOUT REs. Tcl's string command is versatile, and often simplifies problems many programmers hit with the RE hammer.
[Explain Komodo RE debugger.]
tkWorld contains tkREM which is a regular expression maker. Perhaps someone familar with it would like to discuss it.^txt2regex$ [5] is a Regular expression wizard written in bash2 that converts human sentences into regular expressions. It can be used to build up regular expressions suitable for use in Tcl.Visual REGEXP [6] is software to help you debug regular expressions.See redet for another tool to assist in developing regular expressions.
If someone is still stuck using Tcl 8.0.x, you might take a look at ftp://ftp.procplace.com/pub/tcl/sorted/packages-7.6/devel/nre30.tar.gz
which is one of a couple extensions back then that provided a superset of regular expression functionality. Unfortunately, this does not provide all the power of Tcl 8.1 and newer, but at least it is more than was available before 8.1.tcLex [7] is a lexical analyzer which uses Tcl regular expressions to do the matching.Yeti is another lexical analyser, parser generator.Tcl's regular expression engine is an interesting and subtle object for study in its own regard. While Perl is the language that deserves its close identification with RE capabilities, Tcl's engine competes well with it and every other one. In fact, although he doesn't favor Tcl as a language, RE expert [Jeffrey Friedl] has written [8] that "Tcl's [RE] engine is a hybrid with the best of both worlds."For more on different engines, see Henry's comments in [9].Most common regular expression implementations (notable perl and direct derivatives of the PCRE library) exhibit poor performance in certain pathological cases. Henry Spencer's complete reimplementation as a "hybrid" engine appears to address some of those problems. See [10] for some fascinating benchmarks.Lars H: A very nice paper! Highly recommended for anyone interested in the internals of regular expression engines, and a good introduction to the theory.
Yet another meaning of "Regular Expressions": the name of an at-least-monthly column on scripting languages CL has co-authored since 1998 [11].
TCL variables can be marked that an instance contains a compiled regular expression. REs can be pre-compiled by the call "regexp $RE {}" [12].DKF: I prefer to use regexp -about $RE to do the compilation, but that's probably a matter of style.
KBK has astutely remarked that, "Much of the art of designing recognizers is the art of controlling such things in the common cases; regexp matching in general is PSPACE-complete, and our extended regexps are even worse (... not bounded by any tower of exponentials ...)." [13
]Lars H, 2008-06-01: Somehow I doubt KBK would say that, in part because it's dead wrong as far as basic regular expressions are concerned — given a regular expression of size m and a string of size n it is always possible to test whether the string matches that regular expression in time that is linear in n and polynomial in m. Googling for "regexp matching PSPACE complete" turns up this page, but otherwise rather suggests that other problems concerning regular expressions, in particular deciding whether two regular expressions are equivalent, may be PSPACE-complete. (Which is actually kind of interesting, since the naive determinization algorithm for this might need exponential amounts of memory and thus not be in PSPACE at all, but off-topic.)The link provided as source currently doesn't work (no surprise, it's into SourceForge mail archives), but the forum_id seems to refer to development of a Perl module (text::similarity, in some capitalization) rather than anything Tcl related. That matching using Perl's so-called "regexps" should be "worse than PSPACE-complete" is something I can believe, so in that context the quote makes sense, but why it should then be attributed to KBK, and moreover why it should appear in this Wiki (added 2006-08-09, in revision 35 of this page), is still a mystery.LV Searching http://aspn.activestate.com/ASPN/Mail/Browse/Threaded/tcl-core
(well, actually from what I can see, one can only search ALL of activestate's mailing list archives), doesn't turn up a reference like this. Maybe it is quite old - before activestate?TV Auw, man... That´s like suggesting something like the traveling salesman problem is only there to upset people that a certain repository will have the perfect solution for this type of problem, but like that the actual worlds´ best sorting algorithm (O(log(2.1...)) has gotten lost in a van with computer tapes from some university in 1984 or so, the whole of "datastructures and algorithms" will end up like the ´English IT Show´ on the Comedy Channel, and than on the Who says Tcl sucks... graveyard like the [connection machine] was great but forgotten and the world´s greatest synthesizer developers/researchers are in "The Dead Presidents Society" (CEO´s that is, like ´the dead Poets Society´).See also:
- Regular Expression Examples
- Beginning Regular Expressions
- Regular Expression Debugging Tips
- re_syntax
- Drawbacks of Tcl's Regexps
- a thorough tutorial with examples [14] (although its imprecisions exasperate CL)
- an extensive library of regular expressions for various tasks [15]
- an article on five habits for regular expression development [16]
- Tcl REs are actually even more wonderful than those of other languages--but no one cares [17]
- An introductory article, "Know your regular expressions" [18], which features a flexible RE generator
- http://www.tcl.tk/doc/howto/regexp81.html
- New Regular Expression Features in Tcl 8.1 – seems to have been lost to time.
[Refs to Henry Spencer and Kleene [19
].]