Huffman coding, part 2

Richard Suchenwirth 2002-10-31 - In Huffman coding, it was described how to en- and decode strings with bit sequences of varying length - the frequent ones short, the rarer long. In order to play with the binary command, here is a second take which starts from frequency-counting a particular text:

 proc ccount string {
    # character frequency count, e.g.
    # ccount Tennessee => {e 4} {n 2} {s 2} {T 1}
    foreach char [split $string ""] {
        if [catch {incr a($char)}] {set a($char) 1}
    }
    set res {}
    foreach char [array names a] {
        lappend res [list [list $char] $a($char)]
    }
    lsort -integer -decr -index 1 $res
 }

if 0 {Such a tally can well be used to construct a Huffman code tree, as I've learned from SICP, where the two most rare occurrences are merged into a node, until the tree is complete:}

 proc huffmantree string {
    set res [ccount $string]
    while {[llength $res]>1} {
        foreach {c1 n1} [lindex $res end] break
        foreach {c2 n2} [lindex $res end-1] break
        set res [lrange $res 0 end-2]
        lappend res [list [list $c1 $c2] [expr {$n1+$n2}]]
        set res [lsort -integer -decr -index 1 $res]
    }
    lindex [lindex $res 0] 0
 }

if 0 {For string map use, such a tree has to be turned into a paired list of character and bit sequence, by a recursive procedure that traverses the code tree and adds 0 or 1 until a terminal node is reached:}

 proc tree2map {tree {prefix ""}} {
    if {[llength $tree]==1} {
        list $tree $prefix
    } else {
        concat [tree2map [lindex $tree 0] ${prefix}0] \
             [tree2map [lindex $tree 1] ${prefix}1]
    }
 }

if 0 {The resulting list has to be joined once and is then ready for doing the encoding into a binary string. The following proc returns both the used map and the binary string:}

 proc encode string {
    set map [join [tree2map [huffmantree $string//]]]
    list $map [binary format b* [string map $map $string]]
 }

if 0 {Decoding is easy with the map (which has to be inverted, so that {a b c d} makes {b a d c}), and the binary string as input:}

 proc decode {mapstring} {
    set rmap {}
    foreach {char code} [lindex $mapstring 0] {
        lappend rmap $code $char
    }
    binary scan [lindex $mapstring 1] b* bits
    string map $rmap $bits
 }
 # Self-referential test data...
 set fp [open [info script]]
 set data [read $fp]
 close $fp

if 0 {You can now experiment with short strings:

 puts [decode [encode "Hello, world!"]]

or longer strings, namely this page:

 puts [decode [encode $data]]

and notice that the original string comes back well, except for maybe some stray characters added at the end. This is because binary format produces a sequence of bytes, so in cases where the bits string's length is not a multiple of eight, it is zero-padded in encode. A simple way to get around this is to append a certain sequence (e.g. //) to the original and let the receiver discard the rest after that sequence. (Of course it should not appear in the original text - so don't use this for C++ code sources..)

By comparing the string length of the original with the binary string, you get an idea how much compression is possible with this approach: even the custom Huffman map together with the binary string comes out shorter than the original, e.g.

 2885/3428 = 84.2%

You can use this for compressed encryption if both sender and receiver know the Huffman map, but then you have to take care that all characters in the input appear in the map - for this page (at some point of writing) the compression ratio (counting the binary string only) is

 1991/3428 = 58.1%

MS: Huffman coding single characters is not all too efficient.

As illustration: for a pure printable ascii text, just stripping the two leading bits (01) in every byte compresses to 75% of the original (special handling for SPACE: 00100000 is needed - for instance, code it as 111111 if DEL is guaranteed not to be in the source).

KPV: I think this criticism is misplaced, or rather that it applies not just to Huffnam coding but to any generalized compression scheme. If you know that the data you want to compress has some specialized characteristics then you can almost always can utilized that knowledge to develop a more optimized compression scheme.

MS did definitely *not* mean that comment as criticism! It was meant as a comment on the previous paragraph.

I find the Huffman implementation simply marvelous (Richard got us used to that), and Huffman coding per se is known to be very compact. The fact is that any compression scheme that looks at single characters (order 0) will not be very efficient for text sources.

SS 14Oct2004: I implemented Huffman code generation starting from the frequency table. I used an algorithm somewhat different because the tree generation is only implicit. The test-case is the one you can find in Introduction To Algorithms book. I think it's interesting to note how while Huffman code generation is generally thought to need a tree, with high level languages it may be much shorter in another way.

 proc HuffmanBuildCode freqtab {
     while {[llength $freqtab] > 1} {
         set freqtab [lsort -integer -index 0 $freqtab]
         set freqsum [expr {[lindex $freqtab 0 0]+[lindex $freqtab 1 0]}]
         set left [lindex $freqtab 0 1]
         set right [lindex $freqtab 1 1]
         set newnode {}
         foreach e $left {
             lappend newnode [list [lindex $e 0] 0[lindex $e 1]]
         }
         foreach e $right {
             lappend newnode [list [lindex $e 0] 1[lindex $e 1]]
         }
         set freqtab [lrange $freqtab 2 end]
         lappend freqtab [list $freqsum $newnode]
         puts $freqtab
     }
     return $freqtab
 }

 puts [HuffmanBuildCode {{5 f} {9 e} {12 c} {13 b} {16 d} {45 a}}]

The output is

 {100 {{a 0} {c 100} {b 101} {f 1100} {e 1101} {d 111}}}

Arts and crafts of Tcl-Tk programming - Category Compression - Category File