Version 59 of Regular Expressions

Purpose: Describe Tcl Regular Expressions, emphasising advanced features.

The page regular expression (singular) is more about regular expressions in general (including features not found in Tcl) and the theory behind them.

A regular expression is a technique of describing a pattern that you are seeking in a string (used mainly by Tcl's regexp and regsub commands). One of the best resources for understanding regular expressions is the O'Reilly BOOK Mastering Regular Expressions, which talks about many of the uses of regular expressions, from the Unix grep(1) command to Tcl and beyond.

In the following examples, regular expressions and strings will be listed inside {} .

An analogy that might make regular expressions easier is to think of them in chemistry sense. One starts with atoms - the smallest building blocks of uniqueness in regular expressions.

A regular expression atom is made of either a literal character or a metacharacter.

A literal character is the simplest regular expression possible. For example, the string {a} is a one character regular expression. It can be used to match a portion of any string which contains the letter a. Compare the regular expression {a} against the string "abc" and you get a match. Compare {a} against the string "xyz" and you do not get a match.

A metacharacter is the means for telling regular expression what you want in a bit more vague manner. For instance, the metacharacter {.} means 'match any character'. Compare a period {.} against any string one character or longer and you get a match.

Another metacharacter is the {\\}. The backslash metacharacter tells the regular expression that the following character is to be used literally. This comes in most helpfully when attempting to describe patterns containing metacharacters.

The following applies to Tcl 8.1 or newer.

Regular expression atoms and metacharacters fall into one of several classes. The first type express a specific character is to be matched.

literal: Any alphabetic, numeric, white space character are frequently treated as literal matches. However, there are a few cases, detailed below, where they are used in a metacharacter construct.

[characters]: The notation here defines a subset of characters to match. An exclusive match is one in which the first character inside the matching braces is the caret (^) character.

If this character is the first in the subset, then the following characters define a subset of characters which must not match. Otherwise without the leading caret, the characters define a subset of characters which must match. The subset is defined as either individual characters or ranges of characters separated by a minus sign (-).

.: A period matches any literal character

\k: When k is non-alphanumeric, the atom matches the literal character k

\c: When c is alphanumeric (possibly followed by other characters), the sequence is called a Regular Expression Escape Sequence

The second type are modifiers to regular expression atoms, expressing the quantity of characters to be matched. These are called quantified atoms.

*: The largest series (zero or more occurances) of the preceeding regular expression atom will be matched.
+: The largest series (one or more occurances) of the preceeding regular expression atom will be matched.
?: This is a boolean type quantifier - it means the atom may or may not appear (i.e. it may appear 0 or 1 times).
{m}: a sequence of exactly m matches of the atom
{m,}: a sequence of m or more matches of the atom
{m,n}: a sequence of no less than m and no more than n matches of the atom

*?: non-greedy form of * quantifier - if there is more than one match, selects the smallest of the matches
+?: non-greedy form of + quantifier
??: non-greedy form of ? quantifier
{m}?: non-greedy form of {m} quantifier

The third type of regular expression metacharacters modify a regular expression atom by placing constraints upon the characters being matched.

^: The following regular expression will only match when it occurs at the beginning of a string.
$: The preceeding regular expression will only match when it occurs at the end of a string. While it is common to think of this character matching the newline, note that one cannot manipulate the newline by for instance trying to replace the symbol by a null string, etc.

The fourth type of regular expression metacharacters are used for expressing groupings.

(.RE-a.) -- Parens surrounding a series of regular expression atoms (with possible modifiers) are treated as an entity to be matched. The results are returned, if the invocation provides for matches to be returned in a variable.

(?:.RE-a.) -- This modified version of grouping treats the enclosed regular expression atoms as an entity, but does not return the results in a match variable.

() -- This matches an empty string, returning the match in a variable.

(?:) -- This matches an empty string, but does not return the results in a match variable.

.RE-a.|.RE-b. -- The vertical bar/or symbol/pipe symbol is a metacharacter used to separate regular expression atoms. The resulting regular expression means "match RE-a OR RE-b". Each side of the symbol is called a branch. Each branch consists of zero or more constraints or quantified atoms, concatenated. Empty branches match an empty string.

One must be aware that regular expression are either greedy or non-greedy, regardless of your mixture of greedy/non-greedy metacharacters. Refer to this [L1 ] comp.lang.tcl thread, and specifically this [L2 ] Sept. 1999 posting from Henry Spenter to c.l.t.

Comma Number Formatting

Some folks insist on inserting commas (or other characters) to format digits into groups of three. Here is a regexp to do the trick from Keith Vetter. (Thanks Keith!) The Perl manual describes a very slick method of doing this:

     1 while s/^([-+]?\d+)(\d{3})/$1,$2/;

Translated into (pre 8.1) tcl you get:

    set n 123456789.00
    while {[regsub {^([-+]?[0-9]+)([0-9][0-9][0-9])} $n {\1,\2} n]} {}
    puts $n

results in

    123,456,789.00

(You can tighten this up a little using Tcl 8.1's regular expressions:

    while {[regsub {^([-+]?\d+)(\d{3})} $n {\1,\2} n]} {}

Using the extended syntax, this becomes a bit easier to understand:

    while {[regsub {(?x)
        ^([-+]?\d+)     # The number at the start of the string...
        (\d{3})         # ...has three digits at the end
    } $n {\1,\2} n]} {
                        # So we insert a comma there and repeat...
    }

)

For a version with configurable separator, see Bag of algorithms, item "Number commified" - RS

See also Human readable file size formatting for a version without regular expressions for those of us who are allergic to monstrous complexity ;) - Ro

Henry Spencer writes

 >...You can't put extra spaces into regular
 >expressions to improve readability, you just have to suffer along
 >with the rest of us.

Actually, since 8.1 you can, although since it's one of 57 new features, it's easy to miss. Like so:

 set re {(?x)
        \s+ ([[:graph:]]+)      # first number
        \s+ ([[:graph:]]+)      # second number
 }
 set data "     -1.2117632E+00     -5.6254282E-01"
 regexp $re $data match matchX matchY

The initial "(?x)" (which must be right at the start) puts the regexp parser into expanded mode, which ignores white space (with some specific exceptions) and #-to-end-of-line comments.

More information is available in the Tcl manual page on regular expressions. You can view two of the pages at:

http://www.purl.org/tcl/home/man/tcl8.4/TclCmd/re_syntax.htm and http://www.purl.org/tcl/home/man/tcl8.4/TclCmd/regexp.htm .

Chapter 11 in Brent Welch's book also covers regular expressions. Information on the book can be found at http://www.beedub.com/book/3rd . And older version of the chapter, which won't cover the most recent developments is available too [L3 ].

The above discussion needs to cover the advanced regular expression syntax as handled by Tcl 8.1 and later and show the user what all the differences are, so that one can write portable code when necessary - or at least create appropriate package require statements.

Another useful place to learn about Regular Expressions is the page at the Tcl Developer's Xchange [L4 ], where info on the Tcl 8.x specific features are discussed.

Some algorithms are easier coded withOUT REs. Tcl's string command is versatile, and often simplifies problems many programmers hit with the RE hammer.

[Explain Komodo RE debugger.]

tkWorld contains tkREM which is a regular expression maker. Perhaps someone familar with it would like to discuss it.

^txt2regex$ [L5 ] is a Regular expression wizard written in bash2 that converts human sentences into regular expressions. It can be used to build up regular expressions suitable for use in Tcl.

Visual REGEXP [L6 ] is software to help you debug regular expressions.

See redet for another tool to assist in developing regular expressions.

If someone is still stuck using Tcl 8.0.x, you might take a look at ftp://ftp.procplace.com/pub/tcl/sorted/packages-7.6/devel/nre30.tar.gz which is one of a couple extensions back then that provided a superset of regular expression functionality. Unfortunately, this does not provide all the power of Tcl 8.1 and newer, but at least it is more than was available before 8.1.

tcLex [L7 ] is a lexical analyzer which uses Tcl regular expressions to do the matching.

Yeti is another lexical analyser, parser generator.

Tcl's regular expression engine is an interesting and subtle object for study in its own regard. While Perl is the language that deserves its close identification with RE capabilities, Tcl's engine competes well with it and every other one. In fact, although he doesn't favor Tcl as a language, RE expert Jeffrey Friedl has written [L8 ] that "Tcl's [RE] engine is a hybrid with the best of both worlds."

For more on different engines, see Henry's comments in [L9 ].

Most common regular expression implementations (notable perl and direct derivatives of the PCRE library) exhibit poor performance in certain pathological cases. Henry Spencer's complete reimplementation as a "hybrid" engine appears to address some of those problems. See [L10 ] for some fascinating benchmarks.

Lars H: A very nice paper! Highly recommended for anyone interested in the internals of regular expression engines, and a good introduction to the theory.

Yet another meaning of "Regular Expressions": the name of an at-least-monthly column on scripting languages CL has co-authored since 1998 [L11 ].

TCL variables can be marked that an instance contains a compiled regular expression. REs can be pre-compiled by the call "regexp $RE {}" [L12 ].

DKF: I prefer to use regexp -about $RE to do the compilation, but that's probably a matter of style.

KBK has astutely remarked that, "Much of the art of designing recognizers is the art of controlling such things in the common cases; regexp matching in general is PSPACE-complete, and our extended regexps are even worse (... not bounded by any tower of exponentials ...)." [L13 ]

Lars H, 2008-06-01: Somehow I doubt KBK would say that, in part because it's dead wrong as far as basic regular expressions are concerned — given a regular expression of size m and a string of size n it is always possible to test whether the string matches that regular expression in time that is linear in n and polynomial in m. Googling for "regexp matching PSPACE complete" turns up this page, but otherwise rather suggests that other problems concerning regular expressions, in particular deciding whether two regular expressions are equivalent, may be PSPACE-complete. (Which is actually kind of interesting, since the naive determinization algorithm for this might need exponential amounts of memory and thus not be in PSPACE at all, but off-topic.)

The link provided as source currently doesn't work (no surprise, it's into SourceForge mail archives), but the forum_id seems to refer to development of a Perl module (text::similarity, in some capitalization) rather than anything Tcl related. That matching using Perl's so-called "regexps" should be "worse than PSPACE-complete" is something I can believe, so in that context the quote makes sense, but why it should then be attributed to KBK, and moreover why it should appear in this Wiki (added 2006-08-09, in revision 35 of this page), is still a mystery.

LV Searching http://aspn.activestate.com/ASPN/Mail/Browse/Threaded/tcl-core (well, actually from what I can see, one can only search ALL of activestate's mailing list archives), doesn't turn up a reference like this. Maybe it is quite old - before activestate?

TV Auw, man... That´s like suggesting something like the traveling salesman problem is only there to upset people that a certain repository will have the perfect solution for this type of problem, but like that the actual worlds´ best sorting algorithm (O(log(2.1...)) has gotten lost in a van with computer tapes from some university in 1984 or so, the whole of "datastructures and algorithms" will end up like the ´English IT Show´ on the Comedy Channel, and than on the Who says Tcl sucks... graveyard like the connection machine was great but forgotten and the world´s greatest synthesizer developers/researchers are in "The Dead Presidents Society" (CEO´s that is, like ´the dead Poets Society´).