Regexp HTML Attribute Parsing

20040711 CMcC: here's a little HTML/XML/SGML attribute parser. It's iterative, but it uses regexps extensively.

    array set match {
        quote {^([a-zA-Z0-9_-]+)[ \t]*=[ \t]*["]([^"]+)["][ \t]*(.*)$}
        squote {^([a-zA-Z0-9_-]+)[ \t]*=[ \t]*[']([^']+)['][ \t]*(.*)$}
        uquote {^([a-zA-Z0-9_-]+)[ \t]*=[ \t]*([^ \t'"]+)[ \t]*(.*)$} 
        }

    proc parseAttr {astring} {
        global match
        array set attr {}
        set astring [string trim $astring]
        if {$astring eq ""} {
        return {}
        }
    
        while {$astring != ""} {
        foreach m {quote squote uquote} {
            set org $astring
            if {[regexp $match($m) $astring all var val suffix]} {
                set attr($var) $val
                set astring [string trimleft $suffix]
            }
        }
        if {$astring == $org} {
            error "parseAttr: can't parse $astring - not a properly formed attribute string"
        }

        }
        return [array get attr]
    }

Since you are considering the dark side of markup parsing, you might also enjoy XML Shallow Parsing with Regular Expressions

LES: Regex are too often criticized by those who just don't know or like them. "If only you knew the power of the dark side..."

NEM: Regular Expressions Are Not A Good Idea for Parsing XML, HTML, or e-mail Addresses. Regular expressions can be immensely useful -- I use them frequently for pulling apart simple (regular) strings. However, there are genuine limits to the power of regexps, and people should be aware of them. Especially for situations (such as parsing XML/HTML) where there exist (several) excellent quality full parsers.

LES: I find your lack of faith disturbing. E-mail cannot be parsed with regex. But XML can. Feel free to ask for help whenever you need it and think it can't be done.

NEM: XML cannot be parsed purely with regexp. You can parse XML by applying several regexps, and using additional code to put things together correctly. If you put a lot of time and effort into it and fix all the edge cases, you'll end up with something resembling the pure-Tcl version of TclXML, I suspect. But what's the point? It's a waste of time reinventing a perfectly good wheel.

escargo 29 Jan 2007 - Isn't there also an issue about parsing nonconforming HTML or XML? It's a nicer experience if somebody made sure the structure was right, but in the real world, you cannot rely on that assumption.

NEM: See e.g. tdom's [dom parse -html] option which will attempt to deal with nonconforming HTML documents.


Tcllib also contains a module, htmlparse, for parsing HTML code.