Version 23 of Regular Expressions Are Not A Good Idea for Parsing XML, HTML, or e-mail Addresses

[L1 ] - NEM This link is dead for me on 2 June, 2005. This [L2 ] article from comp.lang.tcl certainly looks relevant, however.

[Wiki page on e-mail addresses]

[different meanings of "regular expressions"]

[ Perl disease]

[When REs go wrong]

Regular expression examples

05Apr03 Brian Theado - For XML, I'm guessing the title of this page is referring to one-off regular expressions, but see [L3 ] for a paper describing shallow parsing of XML using only a regular expression. The regular expression is about 30 lines long, but the paper documents it well. The Appendix includes sample implementation in Perl, Javascript and Flex/Lex. The Appendix also includes an interactive demo (using the Javascript implementation apparently). The demo helped me understand what they meant by "shallow parsing". For a Tcl translation, see XML Shallow Parsing with Regular Expressions.

Why are regular expressions not suited for parsing email addresses? "Regular expression to validate e-mail addresses" comments on this.

A few more comments appear in "The Limits to Regular Expressions" [L4 ] and "Regular Expressions Do Not Solve All Problems" [L5 ], themselves descendants of Jamie Zawinski's notorious judgment [L6 ] REs multiply, rather than solve, problems.

D. McC: OK, so what can you use instead of REs to solve, rather than multiply, problems?

AM In Tcl you have a number of options, depending on what you really want to do:

Searching for individual words - consider [lsearch]
Searching for particularly simple patterns - consider [string match]
Try coming up with simple REs that solve the matching problem to, say, 80 or 90% and use a second step to get rid of the "false positives"
Use a combination of all three
If you are trying to match text that spans multiple lines, not uncommon, turn it into one long string first, removing any unnecessary characters (like \ or \n)

That is just a handful of methods. I am sure others can come up with more methods.

DKF: For XML and HTML, use a proper parser to build a DOM tree. For email addresses, do a cheap hack that does the 99.999% of the cases seen in practice. :^)

NEM: I'm not sure if [lsearch] or [string match] would be the way to go if [regexp] wasn't good enough. The direction I'd go in would be to use one of the many parser generators available for Tcl (e.g. I've heard good things about taccle), or check out some of the tools in tcllib (look at the grammar_fa stuff by AKU). Or, you could roll your own parser using recursive descent. At some point soon, I'd like to experiment with parser combinators [L7 ], which look great. Note that most of these techniques probably make use of regexps as part of the solution. However, regular expressions, in their most basic form, can only recognise regular grammars (see [L8 ] for a description of the Chomsky language hierarchy), but many times what needs to be parsed is context-free or context-sensitive (XML is context-free, IIRC).

AM I use [lsearch] and [string match] for identifying lines of interest - quite often the first thing you need to do. I do not intend them as replacements for splitting up the text in smaller pieces...