Aug. 2009 by [rmax] [http://mikrocontroller.net/%|%mikrocontroller.net%|%] is a popular German forum for microcontroller hobbyists working with controllers like [http://atmel.com/%|%AVR%|%] or [http://www.microchip.com%|%PIC%|%]. It allows users to subscribe to discussion threads and get a notification email when something new has been posted. Unfortunately these emails only contain a link to the new posting, but not the posted text. This script can be used as a filter in a [http://procmail.org/%|%procmail%|%] rule to replace the notification body with the actual text of the new posting. It uses the Tcl core's [http] package to fetch the discussion page and the [tdom] package to parse the HTML. ---- package require http package require tdom fconfigure stdout -encoding utf-8 # Pass on the mail header while {[gets stdin line] != 0} { puts $line } puts "" # Read the mail body and grab the URL from it regexp {(https?://[^\#]*)\#([0-9]+)} [read stdin] -> url rel regsub {^https} $url {http} url # Fetch the whole thread set token [http::geturl $url] set html [http::data $token] http::cleanup $token # Parse the HTML and select the
with the new message set dom [dom parse -html $html] set doc [$dom documentElement] set div [$doc selectNodes \ [format {//div[@class="post box gainlayout " and .//a[@name="%s"]]} $rel]] # Get and print the author of the new message set A [[$div selectNodes {.//div[@class="author"]}] asText] puts [regsub -all {\s+} [string trim $A] { }] # Get and print the time stamp of the new message set D [[$div selectNodes {.//div[@class="date"]}] asText] puts [regsub -all {\s+} [string trim $D] { }] # Get and print the names of attachments, if any foreach F [$div selectNodes {.//div[@class="attachment"]}] { puts [regsub -all {\s+} [string trim [$F asText]] { }] } puts "" # Get and print the text of the message set T [[$div selectNodes {.//div[contains(@class,"text")]}] asText] puts "$T" puts "" # Print the full URL puts "$url#$rel" <>Web Scraping