Mapping words to integers

Richard Suchenwirth 2004-02-15 - On Integer range generator, SS asked for a way to enumerate a range of words, like

 {a..z aa..az ba ..}

Such enumeration must of course be based on an alphabet - the example shows that it's supposed to be [a-z] in this case.

At first glance, the task appears similar to conversion from or to baseN, where N is the length of the alphabet. However, the zero element is somehow special: if "a" were 0, "aa" would be 00, which for an integer is the same. So I let "a" count as 1, up to "z" as 26 -- and "aa" is then (a)*26+(a), 1*26+1 = 27. Different from normal base conversion,

 az = (a)*26+(z), 1*26+26 = 52 (in place of 2*26+0)

, and so on. The code below takes care of this "zero-less" base conversion, by adding 1 in word2int, or decrementing by 1 in int2word. The empty string "" maps to resp. from 0, which makes sense for a "zero element" of a set of strings. Come to think, in "words" over an alphabet, all characters are equal, while in integers (at any base), leading zeroes are redundant. So zero is a special case, and the other characters can occur in a 1..N range, while with baseN conversion, the characters used are in the 0..(N-1) range. If zero is "" and part of the rendering alphabet, we would be thrown back before the time zero as visible digit was invented (in India) - because nobody could see it :)

One trivial practical use of this is the names of Excel spreadsheet columns, which, with [A-Z] as alphabet, just follow the same pattern - but their little world ends at column 256, or "IV"... My experiments show that with the [a-z] alphabet, strings of up to 13 letters can be correctly transformed to integers. Above that size, strange errors may happen:

 % int2word [word2int kaaaaaaaaaaaaa]
 coscxanchbfnjk

So best check the string length early in word2int. I expect the limit to be related to the length of the alphabet, but haven't researched this in detail yet.

Note the cute side-effect that the name of the alphabet-listing procedure a-z looks, when called (in brackets), exactly like the corresponding regular expression. And make sure if you change that, say to [a-z0-9], to provide a proc with that name... Even if it looks like a RE, it is just a function call.

 proc word2int {word {alphabet ""}} {
    if {$alphabet eq ""} {set alphabet [a-z]}
    set i [expr {wide(0)}]
    foreach c [split $word ""] {
        set i [expr {$i*[llength $alphabet]+[lsearch $alphabet $c]+1}]
    }
    if {$i<0} {error "word $word too long to fit in integer"}
    set i
 }
 proc int2word {int {alphabet ""}} {
    if {$alphabet eq ""} {set alphabet [a-z]}
    set word ""
    set la [llength $alphabet]
    while {$int > 0} {
        incr int -1
        set word  [lindex $alphabet [expr {$int % $la}]]$word
        set int   [expr {$int/$la}]
    }
    set word
 }
 proc a-z {} {list a b c d e f g h i j k l m n o p q r s t u v w x y z}
# Testing:
 proc must {cmd res} {
    if {[set r [eval $cmd]] ne $res} {error "$cmd -> $r, not $res"}
 }
 must {word2int a}  1
 must {word2int z}  26
 must {word2int aa} 27
 must {word2int az} 52
 must {word2int ba} 53
 must {int2word 1}  a
 must {int2word 26} z
 must {int2word 27} aa
 must {int2word 52} az
 must {int2word 53} ba
 must {int2word [word2int suchenwi]} suchenwi
#--------------------- Systematic testing in a loop
 for {set i 1} {$i<10000} {incr i} {
    if {[word2int [int2word $i]] != $i} {
        error "$i: [int2word $i] / [word2int [int2word $i]]"
    }
 }

This can also be used for a simple encryption, e.g. by multiplying resp. dividing the intermediate integers, or adding/subtracting a well-known integer:

 % int2word [expr 2*[word2int hello]]
 pjxyd
 % int2word [expr [word2int pjxyd]/2]
 hello
 % int2word [expr [word2int message] + [word2int secretkey]]
 sepwymlmd
 % int2word [expr [word2int sepwymlmd] - [word2int secretkey]]
 message

Note however that for addition, message and key should be of about the same length, othe====== rwise the prefix of the longer will show unhidden, like the "se" above in "sepwymlmd", or more evident:

 % int2word [expr [word2int longmessage] + [word2int key]]
 longmesslmd
 % int2word [expr [word2int word] + [word2int verylongkey]]
 veryloodzxc

Not different from when you add a long and a short integer...


See also Mapping words to floats


This code found a surprising use in Brute force meets Goedel, where int2word could produce RPN programs from their Goedel number