Aug. 2009 by [rmax] [http://mikrocontroller.net/%|%mikrocontroller.net%|%] is a popular German forum for microcontroller hobbyists working with controllers like [http://atmel.com/%|%AVR%|%] or [http://www.microchip.com%|%PIC%|%]. It allows users to subscribe to discussion threads and get a notification email when something new has been posted. Unfortunately these emails only contain a link to the new posting, but not the posted text. This script can be used as a filter in a [http://procmail.org/%|%procmail%|%] rule to replace the notification body with the actual text of the new posting. It uses the Tcl core's [http] package to fetch the discussion page and the [tdom] package to parse the HTML. ---- package require http package require tdom fconfigure stdout -encoding utf-8 # pass on the mail header while {[gets stdin line] != 0} { puts $line } puts "" # read the body and grab the URL from it regexp {(https?://[^\#]*)\#([0-9]+)} [read stdin] -> url rel regsub {^https} $url {http} url # fetch the thread set token [http::geturl $url] set html [http::data $token] http::cleanup $token # parse the HTML set dom [dom parse -html $html doc] set div [format \ {//div[@class="post box gainlayout " and .//a[@name="%s"]]} $rel] set p [[$dom documentElement] selectNodes $div] # get and print the author set A [[$p selectNodes {.//div[@class="author"]}] asText] puts [regsub -all {\s+} [string trim $A] { }] # get and print the date set D [[$p selectNodes {.//div[@class="date"]}] asText] puts [regsub -all {\s+} [string trim $D] { }] # get and print the names of attachments foreach F [$p selectNodes {.//div[@class="attachment"]}] { puts [regsub -all {\s+} [string trim [$F asText]] { }] } puts "" # get and print the text of the posting set T [[$p selectNodes {.//div[contains(@class,"text")]}] asText] puts "$T" puts "" puts "$url#$rel" <>Web Scraping