The Lish family

Richard Suchenwirth 2001-01-16 -- The Lish family is a set of transliterations all designed to convert strings in lowly 7-bit ASCII to appropriate Unicode strings (see Unicode and UTF-8) in some major non-Latin writing systems.

The name comes from the common suffix "lish" as in English, which is actually the neutral element of the family, faithfully returning its input ;-) Some rules of thumb:

  • One *lish character should unambiguously map to one target character, wherever applicable
  • One target letter should be represented by one *lish letter ([A-Za-z]), wherever applicable. Special characters and digits should be avoided for coding letters
  • Mappings should look intuitive and/or follow established practices
  • In languages that distinguish case, the corresponding substitutes for upper- and lowercase letters should also correspond casewise in lower ASCII (e.g. see Ruslish)

It all began with Greeklish, which is not my invention, but used by Greeks on the Internet for writing Greek without Greek fonts or character set support. I just extended the practice I found with the convention of marking accented vowels with a trailing apostrophe (so it's not a strict 1:1 transliteration anymore).

The *lish procedures can be called with any number of arguments, for convenience. So you can just type

   arblish dby w Abw Zby
   ruslish Moskva i Leningrad
   greeklish Ellhnikh' Dhmokrati'a

and watch the output on any Unicode-enabled device (e.g. all Tk widgets that accept text). BTW: printing Unicode text goes quite nicely on NT by displaying on a Tk text widget, copying and pasting into Notepad with a Unicode font set.

Depending on job requirements and interests, the family grew, and now contains (see also Languages supported by Lish)

  • Arblish -- see A simple Arabic renderer, r2l and context glyphs
  • Chinlish -- Pinyin words to Unicode, partial solution - add the words you want
  • Eurolish -- Danish, French, German, Icelandic, Italian, Spanish, Swedish
  • Greeklish -- the mother of all Lishes
  • Hanglish -- computing Hangul 2.0 Unicodes from Jamo equivalents
  • Heblish for Hebrew (r2l, context forms of letters explicitly indicated)
  • Japlish -- work in progress, here's a first shot
  • Ruslish -- for Russian
  • APLish -- for APL (not exactly a natural language)
  • Monglish - for Mongolian in Tcl strimjes, different because the vertical writing requires a bottom-up design for pixel fonts, and the output is bitmap images with Mongolian

For frequent use in multilingual contexts, one might introduce two-letter language code aliases: ar, gr, kr, iv, jp, ru. Source text with such embeddings just needs to be subst-ed and then makes nice Unicode. One minor flaw is left-justification even for Arabic and Hebrew, but in plain text you can't do much more than pad with spaces ;-(

With this set of transliterations, I've basically covered most of what the fabulous Bitstream Cyberbit (available from http://jefferson.village.virginia.edu/IBabble/download/cyberbit.html ) and other monster fonts have to offer. Any volunteers for "Thailish" ;-?

Future plans: The parts of the Lish family developed over years and were just recently put under their common wrapper. You can see evolution in the code. When I have some time, I'd like to unify concepts and interfaces more than before to turn the whole thing into the i18n package. Most parts of the Lish family are coming together under the roof of taiku, see taiku goes multilingual, or it's little PocketPC brother iKu.


See lish2html for a filter that substitutes embedded *lish calls from and to HTML files.