#!/bin/sh
# -*- tcl -*- \
exec tclsh $0 ${1+"$@"}
package require Tcl 8.3
if {[llength $argv] == 0} {
puts stderr "usage: wiki-reaper page ?page ...?"
exit 1
}
if {![catch { package require nstcl-html }] &&
![catch { package require nstcl-http }]} {
namespace import nstcl::*
} else {
package require http
proc ns_geturl {url} {
set conn [http::geturl $url]
set html [http::data $conn]
http::cleanup $conn
return $html
}
proc ns_striphtml {-tags_only html} {
regsub -all -- {<[^>]+>} $html "" html
return $html ;# corrected a typo here
}
proc ns_urlencode {string} {
set allowed_chars {[a-zA-Z0-9]}
set encoded_string ""
foreach char [split $string ""] {
if {[string match $allowed_chars $char]} {
append encoded_string $char
} else {
scan $char %c ascii
append encoded_string %[format %02x $ascii]
}
}
return $encoded_string
}
}
proc output {data} {
# we don't want to throw an error if stdout has been closed
catch { puts $data }
}
proc reap {page} {
package require htmlparse
set url http://wiki.tcl.tk/[ns_urlencode $page]
set now [clock format [clock seconds] -format "%e %b %Y, %H:%M" -gmt 1]
set html [ns_geturl $url]
# can't imagine why these characters would be in here, but just to be safe
set html [string map [list \x00 "" \x0d ""] $html]
set html [string map [list <pre> \x00 </pre> \x0d] $html]
if {![regexp -nocase {<title>([^<]*)</title>} $html => title]} {
set title "(no title!?)"
}
if {![regexp -nocase {<i>Updated on ([^G]+ GMT)} $html => updated]} {
set updated "???"
}
output "#####"
output "#"
output "# \"$title\""
output "#"
output "# Tcl code harvested on: $now GMT"
output "# Wiki page last updated: $updated"
output "#"
output "#####"
output \n
set html [ns_striphtml -tags_only $html]
foreach chunk [regexp -inline -all {\x00[^\x0d]+\x0d} $html] {
set chunk [string range $chunk 1 end-1]
set chunk [::htmlparse::mapEscapes $chunk]
foreach line [split $chunk \n] {
if {[string index $line 0] == " "} {
set line [string range $line 1 end]
}
output $line
}
}
output \n
output "# EOF"
output \n
}
foreach page $argv {
reap $page
}Sample usage:
- First you have to get the above code into a file somehow. You have to start somewhere ;-) . So somehow save this page into a file called "wiki-reaper", and edit the contents to remove comments, etc.
- Make certain that the file is going to be found when you attempt to run it. On Unix like systems, that involves putting the file into one of the directories in $PATH.
- wiki-reaper 4718 causes wiki-reaper to fetch itself... :-)
- if verbatim text (text in <pre>...</pre> form) starts off with a certain marker, it gets recognized as being a "snippet"
- snippets are stored in a separate read-only area, and remain forever accessible, even if the page changes subsequently
- the main trick is that snippets get stored on basis of their MD5 sum
- each snippet also includes: the wiki page#, the IP of the submitter, timestamp, and a tag
- the tag is extracted from the special marker that introduces a snippet, it's a "name" for the snippet, to help deal with multiple snippets on a page
- if you have an MD5, you can retrieve a snippet, without risk of it being tampered with, by an url, say http://mini.net/wikisnippet/<this-is-the-32-character-md5-in-hex>
- the IP stored with it is the IP of the person making the change, and creating the snippet in the first place, so it is a reliable indicator of the source of the snippet
- if you edit a page and don't touch snippet contents, nothing happens to them
- if you do alter one, it gets a new MD5 and other info, and gets stored as a new snippet
- if you delete one, it stops being on the page, but the old one is retrievable as before
SB 2002-11-23: If you for a minute forget about the validation of code integrity and think about the possibility to modify program code independent of location, then it sounds like a very good idea. An example is to show progress of coding. The start is a very simple example code, then the example is slightly modified to show how the program can be improved. With this scheme, every improvement of code can be backtracked to the very beginning, and, hence, work as a tutorial for new programmers. If we then think about trust again, there are too many options for code fraud that I do not know.
escargo 23 Nov 2002 - I have to point out that the IP address of the source is subject to a bunch of qualifications. Leaving out the possibility of the IP address being spoofed, I get different IP addresses because of the different locations I use to connect to the wiki; with subnet masking it's entirely possible that my IP addresses could look very different at different times even when I am connected from the same system.Aside from that issue, could such a scheme be made to work well with a version of the unknown proc and mounting the wiki, or part of the wiki, through VFS? This gets back to the TIP dealing with metadata for a repository.This in turn leads me to wonder, how much of a change would it be to add a page template capability to the wiki? In practice now, when we create a new page, it is always the same kind of page. What if there was a policy change that allowed for creating each new page selected from a specific set of types of pages. The new snippet page would be one of those types. Each new page would have metadata associated with it. Instead of editing pages always in a text box, maybe there would be a generated form. Is that possible? How hard would it be? This could lead from a pure wiki to a web-based application, but I don't know if that is a bad thing or not. Just a thought. (Tidied up 5 May 2003 by escargo.)
LV May 5, 2003 - with regards to the snippet ideas above, I wonder if, with the addition of CSS support here on the wiki, some sort of specialized marking would not only enable snipping code, but would also enable some sort of special display as well - perhaps color coding to distinguish proc names from variables from data from comments, etc.CJU March 7, 2004 - In order to do that, you would need to add quite a bit of extra markup to the HTML. I once saw somewhere that one of the unwritten "rules" of wikit development was that preformatted text should always be rendered untouched from the original wiki source (with the exception of links for URLs). I don't particularly agree with it, but as long as it's there, I'm not inclined to believe that the developer(s) are willing to change.Now, straying away from your comment for a bit, I would rather have each preformatted text block contain a link to the plaintext within that block. This reaping is an entertaining exercise, but it's really just a work-around for the fact that getting just the code out of an HTML page is inconvenient for some people. I came to this conclusion when I saw a person suggest that all reapable pages on the wiki should have hidden markup so that the reaper could recognize whether the page was reapable or not. To me, it's a big red flag when you're talking about manually editing hundreds or thousands of pages to get capability that should be more or less automatic.I'm looking at toying around with wikit in the near future, so I'll add this to my list of planned hacks.
LV 2007 Oct 08Well, I changed the mini.net reference to wiki.tcl.tk. But there is a bug that results in punctuation being encoded. I don't know why that wasn't a problem before. But I changed one string map into a call to ::htmlparse::mapEscapes to take care of of the problem.
tb 2009 Jun 16Hm... - I still get unmapped escape sequences, when reaping from this page, using kbskit-8.6. I don't get them, when reaping from a running wikit. Am I missing something?
See also:
- wiki-runner
- TWiG
- fetch <page>.txt to get the Wiki markup iso the html
LV - 2009-06-17 07:37:08Is anyone still using this program? Do any of the wiki's enhancements from the past year or two provide a way to make this type of program easier?
jdc - 2009-06-17 08:34:23Fetching <pagenumber>.txt will get you the Wiki markup. Best start from there when you want to parse the wiki pages yourself. Another option is to fetch <pagenumber>.code to only get the code blocks. Or use TWiG.
