Transliteration

Richard Suchenwirth 2003-06-04 - Transliteration is a kind of string conversion, where one subsequence (one or more characters) is replaced by another subsequence in all occurences. This may be of use in cryptography, but most of all in linguistics, i18n, where it is often helpful to transliterate 7-bit ASCII from or to the standard orthography of a given language - see the Lish family.

Let's take Danish as example, where the three special letters Å, Æ, Ø are in use, but often not found on keyboards in other countries. We can define a mapping like this:

 AA <-> Å
 AE <-> Æ
 OE <-> Ø

For a long time, such mappings were implemented in Tcl with regsub -all, but it's also quite a while that we can use the more efficient string map command, where a simple but full implementation could be (using the \x shortcuts for characters on the U+00 page):

 proc danlish string {
    string map {AA \xC5 AE \xC6 OE \xD8} $string
 }

However, it is even more efficient to not write a proc, but to declare this special situation of a single command with constant argument(s) in front by "currying" it into an interp alias:

 interp alias {} danlish {} string map {AA \xC5 AE \xC6 OE \xD8}

This way, any call to danlish will be executed as the specified string map, provided it has exactly one argument. But as the interp alias syntax is not especially beautiful, we may abstract the above process into a reusable proc - read "x" as "trans-":

 proc xlit {name map} {
    interp alias {} $name {} string map $map
 }
 xlit danlish {
    AA \xC5 AE \xC6 OE \xD8 
    aa \xE5 ae \xE6 oe \xF8
 }

Now we have a pleasant way for declaring transliterations by just giving a name and the mapping list, and use it now to take care of lowercase too. But what if we have a genuine Danish string and want our ASCII back? We would call string map with the inverted map. The map can by introspection be retrieved from the alias, so we don't have to waste a global variable for it, and inversion seems to be best done pairwise like this:

 proc lswap list {
    #-- turn {a b c d} to {b a d c}
    set res {}
    foreach {a b} $list {lappend res $b $a}
    set res
 }
 proc from {name string} {
    #-- inverse of a given 'xlit' mapping
    set rmap [lindex [interp alias {} $name] end]
    string map [lswap $rmap] $string
 }
#--- testing:
 puts 1:[danlish AEroe]
 puts 2:[from danlish [danlish AEroe]]

which should result in first the Danish spelling of that Baltic island, and second our original input back.

The members of the Lish family sometimes need extra processing, like right-to-left conversion in Arblish and Heblish, or totally different treatment as in Hanglish, but part or all of e.g. Greeklish or Ruslish can be rewritten in terms of the simple but generic xlit code above. Well, the learning never stops...

Lars H: Right-to-left conversion in Arblish and Heblish? Is the *lish for right-to-left scripts in reverse logical order? The Unicode standard stresses the fact that the characters in Unicode strings should be in logical order (regardless of writing directions) quite heavily. Or are you reversing the strings because Tk widgets don't respect the writing directions?
RS: Exactly - a workaround for Bidi rendering. The Arblish and Heblish routines produce strings that look right, though I'm aware they are wrong in memory... :(