[KBK] 2007-06-12 Owing to various problems that happened over the years, the Wiki is known to have a number of [Pages containing invalid UTF-8 sequences]. People who are interested in improving the Wiki are invited to attempt to repair the text of these pages. Note that all invalid UTF-8 sequences have been replaced with the character � (\ufffd) - searching for that character will locate the damage within the page. ''([http://wiki.tcl.tk/_search?S=%EF%BF%BD*&_charset_=UTF-8] is a link to pages with problems, and this page too!)'' ---- [KBK] 2007-06-29 The problem with the [remove diacritic] page is that the testing for "valid" UTF-8 is intentionally overzealous. When I reviewed the damaged pages, a great many of them contained the dreaded "double encoding" - ISO8859-1 expanded to UTF-8, with the result interpreted as ISO8859-1 and expanded to UTF-8 a second time. The result of this "double encoding" is that a character such as é (\u00e9) would be expanded into the two-byte UTF-8 sequence C3 89, and that sequence would be interpreted as the spurious combination \u00c3\u0089. The page in question was, as far as I can tell, the ''only'' case of either of the characters \u00c2 (upper-case Latin letter A with circumflex) and \u00c3 (upper-case Latin letter A with tilde) appearing on the Wiki ''other'' than as the result of this process; these two characters are extremely uncommon even in natural languages that use them. (French, for instance, often omits accents from capital letters other than É.) So it seemed wise to reject these two characters, rather than having, say, broken browsers silently convert ü to the presumptively valid pair of characters \u00c3\u00bc (upper-case Latin letter A with tilde followed by the vulgar fraction ¼). Given the large number of browsers out there that appear to get it wrong, I really don't know what else to do. I'm open to suggestions. [LV] Perhaps in the case where there is a possibility of a character being correct, the user should be prompted with an "are you certain" type prompt. [Lars H]: Try adding a hidden field (like the O field used for page versions to detect edit conflicts) to the edit page form, which contains some non-ASCII characters (e.g. those occurring in the page already). If the browser gets it wrong for the text to edit, there's a fair chance it gets all form field wrong in the same way. Since the server can know what went out in this extra field, it can verify that it gets the same thing back. Hmm... Looking at the code for this edit, there is a hidden item named _charset_ which doesn't appear to have any value: Is this an incomplete implementation of the idea I propose? [Lars H]: My edit #124 was bad -- attempting repair. Oddly, this browser (Safari) didn't have the encoding problem with the old Wiki. [Lars H]: Edit trying to diagnose encoding problem. Will surely disturb the contents further. ---- [jdc] 29-nov-2007 : I used the following script on the wiki database to detect invalid UTF-8 sequence: ====== lappend auto_path /home/decoster/tcl/Wub/Utilities package require Mk4tcl package require utf8 mk::file open db wikit.tkd mk::loop i db.pages { lassign [mk::get $i name page] name page set data [encoding convertto identity $page] set point [utf8::findbad $data] if { $point >= 0 && $point < [string length $page] - 1 } { puts "bad utf8: $i / $point" } } mk::file close db exit ====== This reported the following pages: ====== bad utf8: db.pages!2957 / 9075 bad utf8: db.pages!2987 / 2143 bad utf8: db.pages!4588 / 5130 bad utf8: db.pages!8410 / 292 bad utf8: db.pages!8442 / 5608 bad utf8: db.pages!8788 / 886 bad utf8: db.pages!9112 / 4925 bad utf8: db.pages!9281 / 554 bad utf8: db.pages!12169 / 4736 bad utf8: db.pages!14525 / 2935 bad utf8: db.pages!15412 / 4059 bad utf8: db.pages!15599 / 3036 bad utf8: db.pages!19658 / 310 bad utf8: db.pages!19693 / 9485 ====== [LV] Any way for the above code to display a bit of context - or is there some option in the various web browsers to display what character of the page is being displayed? It's just tough to figure out what needs to be fixed with the info here. And is that utf8 package available here on the wiki some place? [jdc] It's possible to print some text around the bad utf-8 characters, but I can't put the result here :-) Replace the puts with: ====== puts "bad utf8: $i / $point:\n[encoding convertfrom identity [string range $data [expr {$point-10}] [expr {$point+10}]]]" ====== ---- !!!!!! %|[Category Characters] | [Category Discussion] | [Category Wikit]|% !!!!!!