Anagrams

With a given word list (such as from the 12-dicts [L1 ] project) formatted as a Tcl procedure that returns a list [L2 ], it is interesting to speculate on the fastest procedure for obtaining anagrams of a given set of letters.

The following procedure is my effort. It has some refinements: (1) it takes an optional template, as for string match and (2) it allows for a wildcard character (such as the blank in Scabble or the joker in card games), written as "_".

Code:

 proc anagram {style letters {template *}} {
  set anagrams [list]
  set usingTemplate [string match "template" $style]
  foreach word [wordList] {
    if {$usingTemplate} {
      if {![string match $template $word]} {continue}
    }
    set discard $letters
    set flag true
    for {set i 0} {$i<[string length $word]} {incr i} {
      set match [string first [string index $word $i] $discard] 
      if {$match>-1} {
        set discard [string replace $discard $match $match]
      } else {
        #-- We don't hold the char; but we may have a blank
        set match [string first _ $discard]
        if {$match>-1} {
          #-- We do have a blank, so we use that instead
          set discard [string replace $discard $match $match]
        } else {
          set flag false
          break
        }
      }
    }
    if {$flag} {
      lappend anagrams $word
    }
  }
  return $anagrams
 }

 # Examples: 
 catch {console show}
 puts [anagram anyof RETAINS]
 puts [anagram template RETAINS *A]
 puts [anagram template TO_UE ?????]

Alastair Davies - 17 January 2006


HJG I get the error 'Invalid command name "wordList"'. Maybe a Tcl 8.5 feature ? - RS No - looks like a function that reads the word list file, and returns them as a list. Left as an exercise for the reader :^)

Alastair responds: HJG and others can download [L3 ] the source code that defines "wordList". The start and end of this look like...

 proc wordList {{lang en}} {
  switch $lang {
    en {
      return {
        AARDVARK
        AARDVARKS
        ABACI
        ABACK
        ABACUS

        ZYGOTE
        ZYGOTES
        ZYGOTIC
        ZYMURGY
      }
    }
  }
 }

...but the whole file is 1359 kb. English only at present, but does anyone have a similar French or German word list that could be added to the source?


KPV By far the fastest way to do this is to preprocess the word list and create what's known as an anagram dictionary[L4 ]. Once that's done, figuring out anagrams takes O(1) time.

Over lunchtime sandwiches, Alastair created an anagram dictionary:

 set anagramDict [dict create]
 foreach word [wordList] {
   dict lappend anagramDict [join [lsort [split $word ""]] ""] $word
 }

In anagramDict, the key "AEINRST", for example, lists the words "ANTSIER", "NASTIER", "RETAINS", "RETINAS" and "RETSINA", and it takes (approximately) no time at all to access them. Thanks, KPV! Incidentally, there are 69894 keys in the dictionary and 75273 words in the original list.

Thinking of anagrams in the looser sense (using any, rather than all, of the letters), I came up with this recursive procedure to reduce a word letter by letter to find all possible keys:

 proc reduce {word} {
  lappend reduction $word
  if {[string length $word]>2} {
    for {set i 0} {$i<[string length $word]} {incr i} {
      set reduction [concat $reduction [reduce [string replace $word $i $i]]]
    }
  }
  return [lsort -uniq $reduction]
 }

These keys can be used to search the anagramDict (as defined above):

 proc anagram {word} {
  global anagramDict
  set word [join [lsort [split $word ""]] ""]
  set result [list]
  foreach key [reduce $word] {
    if {[dict exists $anagramDict $key]} {
      set result [concat $result [dict get $anagramDict $key]]
    }
  }  
  return $result
 }

KPV Speaking of anagram dictionaries, there are a couple of interesting questions you can ask. See if you answer them before looking at the dictionary.

  • What's the first entry after 'a' and 'aa'?
  • What's the last entry?
  • What's the longest entry that is itself a word?

Alastair is amused: Ha! Funny. Makes me look at words in a new light, to be considering the letters in alphabetical order.

Unfortunately (for me), recursion (see proc reduce {word} above) turns out not to be a good solution for identifying all the possible groups of letters within a word. The time taken blows up for words of length somewhere around nine or ten letters. Has anyone got a better suggestion to share?

Lars H: I'm not surprised, considering that you generate every subword of length k from a word of length n as many as (n-k)! times. Changing reduce so that it branches on the two cases (i) subword contains first letter of $word and (ii) subword does not contain first letter of $word should improve this considerably; see power set of a list for some relevant procedures.

There is also a regexp-based alternative, which may be very fast:

 proc anagram2 {word} {
  global anagramDict
  array set anagramArray $anagramDict
  set RE ^[join [lsort [split $word ""]] ?]?\$
  set result [list]
  foreach key [array names anagramArray -regexp $RE] {
    lappend result {*}$anagramArray($key)
  }  
  return $result
 }; # Not tested, but should work.

It's necessary to put the dictionary in an array for this, because the -regexp option of array names was not remembered (or simply discarded as feature creep) when dict keys was designed.

Alastair found the pointer to enumerating the power set of a list very helpful, and snaffled the subsets procedure for the following (speedy) anagrammer:

 proc subsets {l} {
   set subsets [list [list]]
   foreach e $l {
     foreach subset $subsets {
       lappend subsets [lappend subset $e]
     }
   }
   return $subsets
 }

 proc anagram3 {word {template *}} {
   global anagramDict
   set letters [lsort [split $word ""]]
   set result [list]
   foreach subset [subsets $letters] {
     set key [join $subset ""]
     if {[dict exists $anagramDict $key]} {
       foreach word [dict get $anagramDict $key] {
         if {[string match $template $word]} {
           lappend result $word
         }
       }
     }
   }
   return $result
 }

This version doesn't handle wildcard characters - as far as I can see these would need to be inserted in each possible position within each of the returned subsets, to be matched using a glob pattern by dict keys. Incidentally, using regular expressions to match the keys (of an array) proved to be rather slow.