Encoding Translations and i18n

Now that we have a fairly comprehensive start and native translations among character sets in the core Tcl, it's time to fledge out this great beginning. Many translations are not yet available for a simple

 encoding convertfrom ?encoding_name? ?string?

Yet, the world-wide documentation of these encodings is expanding rapidly. Recently (Dec 1999), an ebcdic.enc[L1 ] was posted by Jan Nijtmans [L2 ] based on a web table [L3 ], but there are so many more [L4 ] still missing. Mark Leisher has a compendium at this homepage [L5 ], for example. Tcl can be a powerful tool for standardization and automatic compatibility [L6 ] leadership in data exchange. A mapping of Tcl endoding names to IANA's list is available at [L7 ].


UTF-8[L8 ] and other transformations [L9 ]

Ref: (Compuserve)[L10 ] (Wyoming)[L11 ] (Germany)[L12 ] (Wiki)[L13 ]

The UTF-8 encoding (e.g. Unicode-like encoding for the web - Netscape/IE 4+ support [L14 ]) is an alternative encoding to Unicode-16, encoded character-for-character, but with 'escape' values. You cannot mix Unicode-16 with UTF-8, but you can convert losslessly between them, so long as you're not off into the Unicode-32 encodings.

  \xFD\xBF\xBF\xBF\xBF\xBF translates to U+7FFFFFFF (Unicode-32)
  \xFB\xBF\xBF\xBF\xBF     translates to U+03FFFFFF (Unicode-32)
  \xF7\xBF\xBF\xBF         translates to U+001FFFFF (Unicode-32)
  \xEF\xBF\xBF             translates to U+0000FFFF or ''\uFFFF'' in Tcl
  \xDF\xBF                 translates to U+000007FF or ''\u07FF'' in Tcl
  \x7F                     is the highest single-byte code in UTF-8

Although there are Unicode escaped glyphs, similar to those so often used like   for the ISO8859-1 non-breaking space, you cannot count on browsers (yet) properly interpreting them within a page, and especially when the page itself has not been tagged as using charset=utf8.

BR - Do you have specific experience here? In theory the numeric character entities (what you call "escaped glyphs", like  ) do not depend on the charset that a file is tagged with. The charset tag is explicitly for the characters not represented by entities. I have experience with IE, Netscape 4, Mozilla and Opera. Among those I remember that I have had problems with hexadecimal character entities like   and with UTF-8 display support, especially outside of Latin-1 (as expected for non-Unicode apps). Other than that things have worked fine.


The following should be considered alpha quality. For Tcl8.x with built-in encodings, merely use utf-8 as the convertto/convertfrom:

 #
 # Converts a Unicode string into an array of 16-bit values, for which
 # the low 8 bits of each character should be emitted to give the true
 # UTF-8 value (e.g. [encoding encodingto iso8859-1] in most cases)
 # Equivalent: [encoding convertto utf-8 string]
 #
 proc {unicode_to_utf8} {string} {
   set rv {}
   foreach c [split $string {}] {
     scan $c %c i
     if {$i < 128} {
       append rv $c
     } elseif {$i < 2048} {
       append rv [format %c%c [expr (($i & 1984) >> 6) | 192] \
                              [expr ($i & 62) | 128]]
     } elseif {$i < 65536} {
       append rv [format %c%c%c [expr (($i & 61440) >> 12) | 224] \
                                [expr (($i & 1984) >> 6) | 192] \
                                [expr ($i & 62) | 128]]
     } elseif {$i < 2097152} {
 #       Can't happen in Tcl 8.3.x and below
     } elseif {$i < 4294967296} {
 #       Can't happen in Tcl 8.3.x and below
     }
   }
   return $rv; # to be interpreted as a byte array
 }


 #
 # Converts a "string" of 16-bit UTF-8 entities into true unicode-16 where
 # values of \u0000-\uFFFF are specified in the UTF-8.  Source data
 # likely was read in as [encoding encodingfrom iso8859-1].  When the
 # second parameter (uescape) is specified as a non-zero (TRUE) value,
 # any UTF-8 value above U+0000FFFF will be inserted as a pseudo \u
 # escaped ASCII-hex value.  When it is not specified, any values above
 # U+0000FFFF will be replaced with a \uFFFC (not a character) which is
 # officially called the "Object Replacement Character"
 # Equivalent: [encoding convertfrom utf-8 $hextetarray]
 #
 proc {utf8_to_unicode} {hextetarray {uescape 0}} {
   set rv {}
   set string [split $hextetarray {}]
   for {set x 0} {$x < [llength $hextetarray]} {incr x} {
     scan [lindex $hextetarray $x] %c i
     if {$i > 253} {
 #       Cannot be handled in 31 bits, let alone 16-bit Unicode-16
 #       Most likely an error - absorb ONE byte
       append rv ?
     } elseif {$i >= 252} {
 #       Cannot be handled in 16-bit Unicode, this is a 32-bit value
 #       in the range U+04000000..U+7FFFFFFF
       if {$uescape} {
         set iiiiii [expr ($i & 1) << 31]
         incr x
         scan [lindex $hextetarray $x] %c i
         set iiiii [expr ($i & 63) << 24]
         incr x
         scan [lindex $hextetarray $x] %c i
         set iiii [expr ($i & 63) << 18]
         incr x
         scan [lindex $hextetarray $x] %c i
         set iii [expr ($i & 63) << 12]
         incr x
         scan [lindex $hextetarray $x] %c i
         set ii  [expr ($i & 63) << 6]
         incr x 
         scan [lindex $hextetarray $x] %c i
         set i [expr $i & 63]
         append rv {\u}
         append rv [format %X [expr $i | $ii | $iii | $iiii | $iiiii | $iiiiii]]
       } else {
         append rv "\uFFFC"
         incr x 5
       }
     } elseif {$i >= 248} {
 #       Cannot be handled in 16-bit Unicode, this is a 32-bit value
 #       in the range U+00200000..U+03FFFFFF
       if {$uescape} {
         set iiiii [expr ($i & 3) << 24]
         incr x
         scan [lindex $hextetarray $x] %c i
         set iiii [expr ($i & 63) << 18]
         incr x
         scan [lindex $hextetarray $x] %c i
         set iii [expr ($i & 63) << 12]
         incr x
         scan [lindex $hextetarray $x] %c i
         set ii  [expr ($i & 63) << 6]
         incr x
         scan [lindex $hextetarray $x] %c i
         set i [expr $i & 127]
         append rv {\u}
         append rv [format %X [expr $i | $ii | $iii | $iiii | $iiiii]]
      } else {
        append rv "\uFFFC"
        incr x 4
      }
    } elseif {$i >= 240} {
 #       Cannot be handled in 16-bit Unicode, this is a 32-bit value
 #       in the range U+00010000..U+001FFFFF
       if {$uescape} {
         set iiii [expr ($i & 7) << 18]
         incr x
         scan [lindex $hextetarray $x] %c i
         set iii [expr ($i & 63) << 12]
         incr x
         scan [lindex $hextetarray $x] %c i
         set ii  [expr ($i & 63) << 6]
         incr x
         scan [lindex $hextetarray $x] %c i
         set i [expr $i & 63]
         append rv {\u}
         append rv [format %X [expr $i | $ii | $iii | $iiii]]
       } else {
         append rv "\uFFFC"
         incr x 3
       }
     } elseif {$i >= 224} {
       set iii [expr ($i & 15) << 12]
       incr x
       scan [lindex $hextetarray $x] %c i
       set ii  [expr ($i & 63) << 6]
       incr x
       scan [lindex $hextetarray $x] %c i
       set i [expr $i & 63]
       append rv [format %c [expr $i | $ii | $iii]]
     } elseif {$i >= 192} {
       set ii [expr ($i & 31) << 6]
       incr x
       scan [lindex $hextetarray $x] %c i
       set i [expr ($i & 63)]
       append rv [format %c [expr $i | $ii]]
     } elseif {$i < 128} {
       append rv [lindex $hextetarray $x]
     }
   }
   return $rv; # as a Unicode string
 }

Byte-Order Mark

Also, there has been no strong push for use of the Unicode "introducer" in the Tcl community (yet). It's wise to use \uFEFF at the beginning of any Unicode-16 encoded file. This gives insurance about byte order, because \uFFFE is guaranteed to never be a true Unicode[L15 ] character. In UTF-8, the BOM is \xEF\xBB\xBF

When writing e.g. files in UTF-8 encoding, no support for a BOM prefix is given (this isn't an easy matter - it would be necessary to extend the open command when a utf-8 file for write be opened). UTF-8 with BOM may have the advantage of automatic encoding recognition by file processors, e.g. Excel .csv File import. A way to manually add a BOM-prefix is as follows:

 set fid [open $filename w]
 fconfigure $fid -encoding binary
 puts -nonewline $fid \xef\xbb\xbf   ;# add bom header
 fconfigure $fid -encoding utf-8
 ....

Slash-U format

In Tcl (and/or Tcl/Tk) source files, we 'must' use slash-u format for unicode characters which are beyond the basic ASCII encoding, in order to preserve values across different system encodings. This proc is provided as an easy way to grab Unicode characters into strings which the interpreter will later encode into the desired values.

  proc {unicode_to_slashu} {string} {
    set rv {}
    foreach c [split $string {}] {
      scan $c %c c
      append rv {\u}
      append rv [format %.4X $c]
    }
    return $rv
  }

Note: Java calls this format "Unicode escapes", C and C++ talk about UCNs, Universal Character Names.


Unicodes to HTML format Here's a similar helper that converts all characters above 127 in a string to the entity decimal format in HTML (e.g. &#38;#22269; for \u56fd, i.e. 国):

 proc u2html {s} {
    set res ""
    foreach u [split $s ""] {
        scan $u %c t
        if {$t>127} {
            append res "&#$t;"
        } else {
            append res $u
        }
    }
    set res
 } ;# RS

See also the Drag and Drop page on the Wiki, and The Lish family - The i18n package for other ways to get Unicodes from 7-bit ASCII.


LV What does one do when there is a need for an encoding which doesn't seem to be present? Obviously, one possibility is that the encoding is there, but with a different name. For instance, one of my developers needs something called simplifed Chinese aka gbk. I don't see anything by that name in tcl/tk. So, is this something known by some other name, or is this a missing encoding? And if this is a missing encoding, then how does an encoding get created and included in a future release?

KBK GBK (国家标准扩展 "Guo2 jia1 Biao1 zhun3 Kuo4 zhan3") means "Extended National Standard." It's a proper superset of GB2312, and gb2312 (occasionally, gb2312-raw) will usually work. gb12345 is another possibility about what is meant by GBK. If you've heard a Windows user talk about GBK, they usually mean cp936. So those are all things to try; "GBK" is itself rather ill-defined.

If you *do* have an unsupported encoding, please submit a FR.

Lars H: Some googling turned up [L16 ], which appears to be a BSD manpage on gbk.