Wikit DB Repair

CMcC 9May07 - this is the code I used to repair the wiki from history.

Can someone advise as to why it mucks up unicode in titles?

LV I posted pointers to this page on both the Metakit and the tclerswiki mailing lists. Hopefully someone will stop by with comments.


   package require Mk4tcl
   package require fileutil

   encoding system utf-8

   set dbf [lindex $argv 0]
   set histdir [lindex $argv 1]

   foreach f [glob -tails -directory $histdir *] {
    if {[string match .* $f]} {
        continue
    }
    lassign [split $f -] id date who
    if {![info exists diffs($id)]
        || $date > [lindex $diffs($id) 0]
    } {
        set diffs($id) [list $date $id $who $f]
    }
   }

   mk::file open db $dbf

   foreach id [lsort -integer [array names diffs]] {
    #lappend repairs [lindex $diffs($id) 1]
    lassign $diffs($id) date id1 who f
    set content [split [fileutil::cat -encoding utf-8 [file join $histdir $f]] \\n]
    set title [lindex $content 0]
    set content [join [lrange $content 4 end] \n]
    if {$id >= [mk::view size db.pages]} {
        set title [string trim [lindex [split $title :] 1]]
        puts "adding $id '$title'"
        mk::row append db.pages name $title page $content date $date who $who
    } else {
        puts "modding $id"
        mk::set db.pages!$id page $content date $date who $who
    }
   }

   mk::file commit db
   mk::file close db

wdb Just a try -- as far as I understand, meta stores ASCII only -- perhaps it makes sense, before write to db, the unicodes convert in Tcl conventions such as \u004f, and after read back, perform a subst -novariable -nocommand $title?

stevel Metakit stores and returns Tcl strings as is - i.e. UTF-8


NEM Regarding unicode breakage, I don't see anything wrong with the code here. What does the code that saves the history look like? Are you sure it is saving pages as UTF-8? If you get rid of the encoding system and -encoding options, does that help things? Also, does Metakit know anything about encodings or does it treat strings as just blobs of binary data?


EMJ Page contents have also been mucked up - see e.g. page 18008 which I had fixed not long ago - and it also seems to have changed its page number (was 18012). Also if you look at the references to page 17213 you will see many pages listed which do not actually contain such a reference - I edited a couple, which forced them of the list, but most of them do not contain the reference and are still there.


WJP "encoding system" tells Tcl what encoding to use when communicating with the system. With the rare exception of situations in which Tcl doesn't know what the system encoding is, issuing this command is almost always a mistake since it can only screw things up. It is not a substitute for "fconfigure ...-encoding..."


Category Wikit