[EKB] This is a follow-up to the discussion on [Is Tcl Different!].
Here's the wonderful HTML parser in 10 lines as posted on that page:
############################################
# Turn HTML into TCL commands
# html A string containing an html document
# cmd A command to run for each html tag found
# start The name of the dummy html start/stop tags
proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} {
regsub -all \{ $html {\&ob;} html
regsub -all \} $html {\&cb;} html
set w " \t\r\n" ;# white space
proc HMcl x {return "\[$x\]"}
set exp <(/?)([HMcl ^$w>]+)[HMcl $w]*([HMcl ^>]*)>
set sub "\}\n$cmd {\\2} {\\1} {\\3} \{"
regsub -all $exp $html $sub html
eval "$cmd {$start} {} {} \{ $html \}"
eval "$cmd {$start} / {} {}"
}
But it was missing the default value for ''cmd'', ''HMtest_parse'', so I wrote one and applied it to a sample bit of HTML:
proc HMtest_parse {tag state props body} {
if {$state == ""} {
set msg "Start $tag"
if {$props != ""} {
set msg "$msg with args: $props"
}
set msg "$msg\n$body"
} else {
set msg "End $tag"
}
puts $msg
}
HMparse_html {
This is my very first paragraph. How do you
like it? I think it has a lot to recommend it.
This is my second paragraph, which is OK,
but not as nice as my first one.
}
This gives the following output:
Start hmstart
Start html
Start p with args: class="bubba"
This is my very first paragraph. How do you
like it? I think it has a lot to recommend it.
End p
Start p with args: class="louielouie"
This is my second paragraph, which is OK,
but not as nice as my first one.
End p
End html
End hmstart
In fact, the code is not HTML-specific, and can handle simple XML code (e.g., that doesn't use the self-closing format). It's like a mini-[SAX]. (Actually, it isn't quite like SAX. It's only like it because you define handlers for each tag. But unlike SAX it operates on a string in memory and doesn't execute until everything has been converted.) I've created a small XML parser based on this code and put it in [TAX: A Tiny API for XML].
In spite of its incredible (to me) brevity, the code can actually be shortened somewhat. The proc ''HMcl'' is introduced in order to avoid trouble with [[ ]]'s. But it can also be avoided by enclosing the value of ''exp'' in { }'s. Also, the variable ''w'' doesn't need to be defined (at least in recent Tcl versions): \s can be used instead. Here's the new ''HMparse_html'' proc:
proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} {
regsub -all \{ $html {\&ob;} html
regsub -all \} $html {\&cb;} html
set exp {<(/?)([^\s>]+)\s*([^>]*)>}
set sub "\}\n$cmd {\\2} {\\1} {\\3} \{"
regsub -all $exp $html $sub html
eval "$cmd {$start} {} {} \{ $html \}"
eval "$cmd {$start} / {} {}"
}
OK, one more thing... If the ''cmd'' is an [ensemble], then the different tags can be sub-procs within the ensemble. For example, just like ''string length'' is a command, where [string] is the ensemble, and ''length'' is the sub-proc, it should be possible to set up ''cmd'' so that ''cmd p'' would invoke the proc for parsing p tags, ''cmd html'' would invoke the command for parsing html tags, etc.
It's pretty easy to create ensembles in [snit], so here's a snit version:
package require snit
############################################
# Turn HTML into TCL commands
# html A string containing an html document
# cmd A command to run for each html tag found
# start The name of the dummy html start/stop tags
proc HMparse_html {html {cmd HMtest_parse} {start hmstart}} {
regsub -all \{ $html {\&ob;} html
regsub -all \} $html {\&cb;} html
set exp {<(/?)([^\s>]+)\s*([^>]*)>}
set sub "\}\n$cmd {\\2} {\\1} {\\3} \{"
regsub -all $exp $html $sub html
eval "$cmd {$start} {} {} \{ $html \}"
eval "$cmd {$start} / {} {}"
}
snit::type parser {
proc isend {state} {
if {$state == ""} {
return false
} else {
return true
}
}
method hmstart {args} {}
method html {state args} {
if [isend $state] {
puts "That's all, folks!"
} else {
puts "Let's get going!"
}
}
method p {state props body} {
if {![isend $state]} {puts $body}
}
}
parser HMtest_parse
HMparse_html {
This is my very first paragraph. How do you
like it? I think it has a lot to recommend it.
This is my second paragraph, which is OK,
but not as nice as my first one.
}
This is the output:
Let's get going!
This is my very first paragraph. How do you
like it? I think it has a lot to recommend it.
This is my second paragraph, which is OK,
but not as nice as my first one.
That's all, folks!
----
[Category XML]
[Category Word and Text Processing]
[Category Internet]