'''HTML''', or '''HyperText [Markup Language]''', is a [markup language] used
on the [WWW%|%World-Wide Web].


** Tools **

   [Tcllib html]:   a module for generating html

   [Wub]:   includes a [http://wub.googlecode.com/svn/trunk/Utilities/Html.tcl%|%utility] for structured HTML tag generation

   [htmlparse]:   tools to parse html

   [tkHTML]:   an [extension] that parses and renders HTML, compiled for use without Tk
   
   [tcltidy]:   a wrapper to Tidy

   [tkhtml3]:   the successor to [tkHTML]

   [tDOM]'s [XPath]-oriented parser:   can be used to manipulate HTML

   [TclXML]:   includes xmlgen for generating HTML or XML

   [MajaMaja]:   structure and layout a static collection of html pages arranging a wide variety of materials


** See Also **

   [HTML widgets]:   discusses widgets that ''render'' HTML into a visual representation.

   [Web scraping]:   

   [august html editor]:   
   
   [url-encoding]:   
   
   [html2text]:   


** Description **

For extracting data from HTML, it's generally more robust to parse the HTML
page into some document model, perhaps using [tDOM], than to hack at it with
regular expressions, and then using [XPath] to find the data. 

If the task is to 'pull out' some data out of a HTML page, I'm indeed a strong
believer in the 'parse the HTML page into a tree and query that tree' approach.
For real life problems, I claim that this approach is much simpler and easier
to maintain - and for sure, you have to maintain such a thingy, because the
layout of HTML pages tend to change frequently - than every regexp approach.
Sure, you have to learn another query language - xpath in this case. But if you
are really in the web business, there are chances you have to learn xpath
anyway.


<<categories>> Category Package | Tcllib | Category Internet | Category Glossary | Markup Language