'''HTML''', or '''HyperText [Markup Language]''', is a [markup language] used
on the [WWW%|%World-Wide Web].
** Tools **
[Tcllib html]: a module for generating html
[Wub]: includes a [http://wub.googlecode.com/svn/trunk/Utilities/Html.tcl%|%utility] for structured HTML tag generation
[htmlparse]: tools to parse html
[tkHTML]: an [extension] that parses and renders HTML, compiled for use without Tk
[tcltidy]: a wrapper to Tidy
[tkhtml3]: the successor to [tkHTML]
[tDOM]'s [XPath]-oriented parser: can be used to manipulate HTML
[TclXML]: includes xmlgen for generating HTML or XML
[MajaMaja]: structure and layout a static collection of html pages arranging a wide variety of materials
** See Also **
[HTML widgets]: discusses widgets that ''render'' HTML into a visual representation.
[Web scraping]:
[august html editor]:
[url-encoding]:
[html2text]:
** Description **
For extracting data from HTML, it's generally more robust to parse the HTML
page into some document model, perhaps using [tDOM], than to hack at it with
regular expressions, and then using [XPath] to find the data.
If the task is to 'pull out' some data out of a HTML page, I'm indeed a strong
believer in the 'parse the HTML page into a tree and query that tree' approach.
For real life problems, I claim that this approach is much simpler and easier
to maintain - and for sure, you have to maintain such a thingy, because the
layout of HTML pages tend to change frequently - than every regexp approach.
Sure, you have to learn another query language - xpath in this case. But if you
are really in the web business, there are chances you have to learn xpath
anyway.
<> Category Package | Tcllib | Category Internet | Category Glossary | Markup Language