Stephen Uhler's HTML parser in 10 lines

Stephen Uhler's HTML parser in 10 lines, originally by Stephen Uhler, previously known as HTML parser in 8 lines of Tcl, and currently known as HTML parser in 4 lines of Tcl, is a small toy HTML parser. It's not correct in that it can get messed up by angle brackets in attribute values and unbalanced braces in the HTML content, but it's an interesting code snippet nonetheless.

Attributes

location (defunct)
http://freegis.org/cgi-bin/viewcvs.cgi/grass51/lib/form/html_library.tcl?rev=1.1&content-type=text/vnd.viewcvs-markup
Didn't anyone go to last week's Tcl/Tk Workshop? , comp.lang.tcl, 1995-07-10
Tcl/Tck and HTML , comp.lang.tcl, 1995-07-20

Description

EKB et al:

Stephen Uhler's HTML parser in 8 lines is now actually in 4 lines. Here is the current version:

############################################
# Turn HTML into TCL commands
#   html    A string containing an html document
#   cmd                A command to run for each html tag found
#   start        The name of the dummy html start/stop tags

proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} {
    set exp {<(/?)([^ \t\r\n>]+)[ \t\r\n]*([^>]*)>}
    set sub "\}\n[list $cmd] {\\2} {\\1} {\\3} \{"
    regsub -all $exp [string map {\{ \&ob; \} \&cb;} $html] $sub html
    eval "$cmd {$start} {} {} \{ $html \}; $cmd {$start} / {} {}"
}

But it was missing the default value for cmd, HMtest_parse, so I wrote one and applied it to a sample bit of HTML:

proc HMtest_parse {tag state props body} {
    if {$state eq {}} {
        set msg "Start $tag"
        if {$props ne {}} {
            set msg "$msg with args: $props"
        }
        set msg "$msg\n$body"
    } else {
        set msg "End $tag"
    }
    puts $msg
}

HMparse_html {
   <html>
     <p class="bubba">
     This is my very first paragraph. How do you
     like it? I think it has a lot to recommend it.
     </p>
     <p class="louielouie">
     This is my second paragraph, which is OK,
     but not as nice as my first one.
     </p>
   </html>
}

Output:

Start hmstart


Start html


Start p with args: class="bubba"

       This is my very first paragraph. How do you
       like it? I think it has a lot to recommend it.

End p
Start p with args: class="louielouie"

       This is my second paragraph, which is OK,
       but not as nice as my first one.

End p
End html
End hmstart

In fact, the code is not HTML-specific, and can handle simple XML code (e.g., that doesn't use the self-closing <tag/> format). It's like a mini-SAX. (Actually, it isn't quite like SAX. It's only like it because you define handlers for each tag. But unlike SAX it operates on a string in memory and doesn't execute until everything has been converted.) I've created a small XML parser based on this code and put it in TAX: A Tiny API for XML.

PYK: the following comment refers to a previous, longer version of the parser which can be found in the history for this page

In spite of its incredible (to me) brevity, the code can actually be shortened somewhat. The proc HMcl is introduced in order to avoid trouble with [ ]'s. But it can also be avoided by enclosing the value of exp in { }'s. Also, the variable w doesn't need to be defined (at least in recent Tcl versions): \s can be used instead. Here's the new HMparse_html proc:

proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} {
    regsub -all \{ $html {\&ob;} html
    regsub -all \} $html {\&cb;} html
    set exp {<(/?)([^\s>]+)\s*([^>]*)>}
    set sub "\}\n$cmd {\\2} {\\1} {\\3} \{"
    regsub -all $exp $html $sub html
    eval "$cmd {$start} {} {} \{ $html \}"
    eval "$cmd {$start} / {} {}"
}

OK, one more thing... If the cmd is an ensemble, then the different tags can be sub-procs within the ensemble. For example, just like string length is a command, where string is the ensemble, and length is the sub-proc, it should be possible to set up cmd so that cmd p would invoke the proc for parsing p tags, cmd html would invoke the command for parsing html tags, etc.

It's pretty easy to create ensembles in snit, so here's a snit version:

package require snit

############################################
# Turn HTML into TCL commands
#   html    A string containing an html document
#   cmd                A command to run for each html tag found
#   start        The name of the dummy html start/stop tags

proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} {
    regsub -all \{ $html {\&ob;} html
    regsub -all \} $html {\&cb;} html
    set exp {<(/?)([^\s>]+)\s*([^>]*)>}
    set sub "\}\n$cmd {\\2} {\\1} {\\3} \{"
    regsub -all $exp $html $sub html
    eval "$cmd {$start} {} {} \{ $html \}"
    eval "$cmd {$start} / {} {}"
}

snit::type parser {
    proc isend {state} {
        if {$state eq {}} {
            return false
        } else {
            return true
        }
    }
    method hmstart {args} {}
    method html {state args} {
        if [isend $state] {
            puts {That's all, folks!}
        } else {
            puts {Let's get going!}
        }
    }
    method p {state props body} {
        if {![isend $state]} {puts $body}
    }
}

parser HMtest_parse

HMparse_html {
   <html>
     <p class="bubba">
     This is my very first paragraph. How do you
     like it? I think it has a lot to recommend it.
     </p>
     <p class="louielouie">
     This is my second paragraph, which is OK,
     but not as nice as my first one.
     </p>
   </html>
}

Output:

Let's get going!

        This is my very first paragraph. How do you
        like it? I think it has a lot to recommend it.
        

        This is my second paragraph, which is OK,
        but not as nice as my first one.
        
That's all, folks!

The problem with using snit (or incr tcl is you have to declare handlers for all tags or you will end up with a runtime error (for example "method body not found"). I myself use the following mechanism with some success:

proc HMtest_parse {tag state props body} {
    if {[info proc handle_$tag] ne {}} {
        handle_$tag $state $props $body
    }
}

proc handle_a {state props body} { ... }
proc handle_img {state props body} { ... }

This way, you only have to declare handlers for the tags that you care about.

Hai Vu


WHD: Actually, Snit allows you to define a method that receives all unknown methods:

delegate method * using {%s UnknownMethod %m}

method UnknownMethod {methodName args} { ... }