A little XML parser

Description

Richard Suchenwirth 2002-08-20: XML (the eXtensible Markup Language), shortly demonstrated as

<tag attribute="value">textcontent<singletontag/></tag>

is not the most beautiful or efficient way to package data, but it's fashionable and "standard". Often it's better to design a file format in XML (and rely on the available tools, e.g. tDOM and, based on it, starDOM) than to design a proprietary format, and have to write matching parsers, and document the format, etc., oneself. After all, computer memory and CPU power are getting ever cheaper these years than brain cells...

Another way to represent complex data is as a nested Tcl list. If you just get your braces balanced, Tcl will parse a string into a list and let you process it - which gets easier with the "multi-dimensional" lindex and lset commands since Tcl 8.4. tDOM has the $node asList command which, after parsing the input XML string into a DOM structure in memory, returns a well-formed Tcl list with these element properties:

  • if the first element is #text, the second is the text content;
  • else the first element is the tag, the second the attributes alternating name value..., and the third is a list of the child elements.

Comparing XML input and toList output, I thought that many differences can be handled by string manipulations in local context, for which Tcl offers powerful tools (for instance, string map, regexp and regsub). So one morning I decided to try a direct conversion of an XML string to a string that makes a well-formed list equivalent to the toList output. The following code does some XML well-formedness checking with a stack for matching start/end tags, but besides it's of course weaker than the power of tDOM. Still, it was a nice little evening project, so here goes:

proc xml2list xml {
    regsub -all {>\s*<} [string trim $xml " \n\t<>"] "\} \{" xml
    set xml [string map {> "\} \{#text \{" < "\}\} \{"}  $xml]

    set res ""   ;# string to collect the result   
    set stack {} ;# track open tags
    set rest {}
    foreach item "{$xml}" {
        switch -regexp -- $item {
            ^# {append res "{[lrange $item 0 end]} " ; #text item}
            ^/ {
                regexp {/(.+)} $item -> tagname ;# end tag
                set expected [lindex $stack end]
                if {$tagname!=$expected} {error "$item != $expected"}
                set stack [lrange $stack 0 end-1]
                append res "\}\} "
        }
            /$ { # singleton - start and end in one <> group
                regexp {([^ ]+)( (.+))?/$} $item -> tagname - rest
                set rest [lrange [string map {= " "} $rest] 0 end]
                append res "{$tagname [list $rest] {}} "
            }
            default {
                set tagname [lindex $item 0] ;# start tag
                set rest [lrange [string map {= " "} $item] 1 end]
                lappend stack $tagname
                append res "\{$tagname [list $rest] \{"
            }
        }
        if {[llength $rest]%2} {error "att's not paired: $rest"}
    }
    if [llength $stack] {error "unresolved: $stack"}
    string map {"\} \}" "\}\}"} [lindex $res 0]
}

Now that this went so well, I'll throw in the converse:

proc list2xml list {
    switch -- [llength $list] {
        2 {lindex $list 1}
        3 {
            foreach {tag attributes children} $list break
            set res <$tag
            foreach {name value} $attributes {
                append res " $name=\"$value\""
            }
            if [llength $children] {
                append res >
                foreach child $children {
                    append res [list2xml $child]
                }
                append res </$tag>
            } else {append res />}
        }
        default {error "could not parse $list"}
    }
}
#-------------------------------------------- now testing:
set test {<foo a="b">bar and<grill x:c="d" e="f g"><baz x="y"/></grill><room/></foo>}
proc tdomlist x {[[dom parse $x] documentElement root] asList} ;# reference
proc lequal {a b} {
    if {[llength $a] != [llength $b]} {return 0}
    if {[lindex $a 0] == $a} {return [string equal $a $b]}
    foreach i $a j $b {if {![lequal $i $j]} {return 0}}
    return 1
}
proc try x {
    puts [set a [tdomlist $x]]
    puts [set b [xml2list $x]]
    puts list:[lequal $a $b],string:[string equal $a $b]
}
puts [set res [xml2list $test]]
foo {a b} {{#text {bar and}} {grill {x:c d e {f g}} {{baz {x y} {}}}} {room {} {}}}

which is equal to the toList result. This may not be the most readable code I ever wrote (it strongly illustrates that unbalanced braces have to be escaped in strings ;-), but it demonstrates the mind-boggling power of hopping between the list and the string representation, which is one of Tcl's unique features.

Having the two converters above, one may start to think about a Tcl list DOM variation, where accesses go via lindex/lset in the nested list, and finally well-formed XML comes out again... But this smells like more work than fun for an evening ;-)


RS: The idea that index vectors, as can now be used with lindex/ lset, are paths in a tree (which the XML or DOM is) is fascinating - navigating nested lists can go very fast with this." - Here is a little general-purpose depth-first traverser that you can run over a listDOM:

proc forall {varName list body} {
    set $varName $list
    eval $body
    foreach child [lindex $list 2] {
        forall $varName $child $body
    }
} ;# RS
% forall i $a {puts [lrange $i 0 1]}
foo {a b}
#text {bar and}
grill {x:c d e {f g}}
baz {x y}
room {}

Brian Theado 2003-04-05: See [L1 ] for a paper describing shallow parsing of XML using only a regular expression. The regular expression is about 30 lines long, but the paper documents it well. The Appendix includes sample implementation in Perl, Javascript and Flex/Lex. The Appendix also includes an interactive demo (using the Javascript implementation apparently). The demo helped me understand what they meant by "shallow parsing".

I'm guessing translation from the Perl implementation to a Tcl implementation should be pretty straightforward. ...the next day: Translation to Tcl was straightforward. See XML Shallow Parsing with Regular Expressions.


JM 2010-01-05: I suggest LogParser for XML files, you can always drive the parser from Tcl of course.