Version 3 of ucnetgrab

Updated 2009-08-06 16:07:35 by rmax

Aug. 2009 by rmax

mikrocontroller.net is a popular German forum for microcontroller hobbyists working with controllers like AVR or PIC .

It allows users to subscribe to discussion threads and get a notification email when something new has been posted. Unfortunately these emails only contain a link to the new posting, but not the posted text.

This script can be used as a filter in a procmail rule to replace the notification body with the actual text of the new posting. It uses the Tcl core's http package to fetch the discussion page and the tdom package to parse the HTML.


 package require http
 package require tdom

 fconfigure stdout -encoding utf-8

 # pass on the mail header
 while {[gets stdin line] != 0} {
    puts $line
 }
 puts ""
 # read the body and grab the URL from it
 regexp {(https?://[^\#]*)\#([0-9]+)} [read stdin] -> url rel
 regsub {^https} $url {http} url

 # fetch the thread
 set token [http::geturl $url]
 set html [http::data $token]
 http::cleanup $token

 # parse the HTML
 set dom [dom parse -html $html doc]
 set div [format \
    {//div[@class="post box gainlayout " and .//a[@name="%s"]]} $rel]
 set p [[$dom documentElement] selectNodes $div]

 # get and print the author
 set A [[$p selectNodes {.//div[@class="author"]}] asText]
 puts [regsub -all {\s+} [string trim $A] { }]

 # get and print the date
 set D [[$p selectNodes {.//div[@class="date"]}] asText]
 puts [regsub -all {\s+} [string trim $D] { }]

 # get and print the names of attachments
 foreach F [$p selectNodes {.//div[@class="attachment"]}] {
    puts [regsub -all {\s+} [string trim [$F asText]] { }]
 }
 puts ""

 # get and print the text of the posting
 set T [[$p selectNodes {.//div[contains(@class,"text")]}] asText]
 puts "$T"
 puts ""
 puts "$url#$rel"