Octet-packed integers

Richard Suchenwirth 2002-10-11 - This page was originally called "adaptive integers", but as JCW writes below, the term was wrong - hence I renamed this page. - "Shorter than a short, or longer than a long long...", that could be a slogan for "octet-packed integers" as described in MetaKit File Format - a way of encoding integers as a varying-length sequence of octets (8-bit bytes) in a very compact way. As I read the description there, an octet-packed integer is a sequence of octets, each carrying 7 bits "payload" of the original number, MSB first (i.e. "big endian"). The last octet is marked with the high bit 7 set, the others have it as 0. This way, arbitrary long integers can be stored (e.g. a 70-bit number in ten octets), while on the other hand the frequent small integers below 128 just consume one byte.

This concept is both more economic (you don't have to use 4 bytes if most of the time you're just storing 0's or 1's) and more flexible than fixed-size integers (which have an upper bound, MAXINT) - good for compact storage, but requiring some conversion when interfacing to the "outside world". JCW has certainly implemented these conversions efficiently in C, but like so often, I felt tempted to do it in Tcl too. Such sequences of bytes will of course be strings (since 8.0 we may have zero bytes inside strings), but their printed representation may be pretty unreadable, so the first tool for experimenting is a little hex-dumper for strings:

 proc string2hex string {
    set res {}
    foreach char [split $string ""] {
        scan $char %c ichar
        lappend res [format %02X $ichar]
    }
    set res
 }
 proc int2opi int {
    if {$int < 0} {error "can't convert negative number $int"}
    set res [expr {$int? "": [format %c 0x80]}]
    while {$int} {
        set low [expr {$int % 128}]  ;# get rightmost 7 bits
        if {$res==""} {incr low 128} ;# maybe set "end" bit
        set res [format %c $low]$res ;# prepend data octet
        set int [expr {$int / 128}]  ;# and carry on...
    }
    set res
 }
 proc opi2int opi {
    set res 0
    foreach byte [split $opi ""] {
        scan $byte %c ichar      ;# get numeric byte value
        set res [expr $res * 128 + ($ichar & 0x7F)]
        if {$ichar & 0x80} break ;# detect end bit
    }
    set res
 }
# ------------------- testing...
 foreach case {
    0 1 10 100 1000 10000 100000 127 128 255 256 1234567 2147483647
 } {
    set opi [int2opi $case]
    puts "$case -> [string2hex $opi] -> [opi2int $opi]"
 }

The above code passes all tests, which is at least encouraging. Further thinking on octet-packed integers, the concept is similar to the UTF-8 representation (see Unicode and UTF-8), which also employs byte sequences of varying length, but more efficient and stronger: UTF-8 can hold non-negative integers (what character codes really are) of up to 48 bits (one "locomotive" 0xFF, indicating that 8 "trailers" follow, each carrying a "payload" of 6 bits), while octet-packed integers can hold numbers of arbitrarily many bits, at 7 bits per byte. A 49-bit octet-packed integer is 7 bytes long, while the maximum UTF-8 holds 48 bits in 9 bytes.

Another comparison is, very evidently for Tcl, with numbers in decimal strings, where 0..9 take one byte, 10..99 2 bytes, etc. In contrast, octet-packed integers pack 0..127 in one byte, 128..16383 in two bytes, and so on. An additional advantage is that lists of opis can be packed without delimiting characters, because the last byte of each number makes clear that it's over, thus saving the space required between elements in "decimal string" lists. This requires some more coding:

 proc ints2opis ints {
    set res ""
    foreach int $ints {append res [int2opi $int]}
    set res
 }

That was easy. In the other direction, we have the problem that opis can be separated only by checking each byte's high bit. (Inserting blanks wouldn't help, as they might as well be parts of an opi. So we best rewrite the single-item converter to produce a list when meeting multiple opis:

 proc opis2ints opis {
    set ress {}
    set res 0
    foreach byte [split $opis ""] {
        scan $byte %c ichar      ;# get numeric byte value
        set res [expr $res * 128 + ($ichar & 0x7F)]
        if {$ichar & 0x80} {
            lappend ress $res
            set res 0
        }
    }
    set ress
 } ;# but see negative-enabled version below

#----------------------- Testing again:
 foreach case {
    1 {1 10 100 1000} {0 100 200} 4711 {}
 } {
    set opis [ints2opis $case]
    puts [list $case -> [string2hex $opis] -> [opis2ints $opis]]
 }

Works again. An empty list produces an empty opi string, which on reconversion produces an empty list - I didn't expect that, but it well makes sense. Multi-element lists come out equal again, but as single-element lists work too, we can do away with the single-case opi2int and let the plural form take over its job. (The savings of the compact format are best on longer lists, of course). The same consideration lets us integrate int2opi into int2opis, so we end up in two procedures for lists, which also cover single elements:

 proc ints2opis ints {
    set ress {}
    foreach int $ints {
        if {$int < 0} {error "can't convert negative number $int"}
        set res [expr {$int? "": [format %c 0x80]}]
        while {$int} {
            set low [expr {$int % 128}]  ;# get rightmost 7 bits
            if {$res==""} {incr low 128} ;# maybe set "end" bit
            set res [format %c $low]$res ;# prepend data octet
            set int [expr {$int / 128}]  ;# and carry on...
        }
        append ress $res
    }
    set ress
 } ;# but see negative-enabled version below

Well. A nice little package of two procs to deal with opis. Of course, the more you have the more you want. Negative integers appeared to be a tricky problem: a fixed width-integer is just left-filled with 1s to the fixed size - but we don't have a fixed size here. JCW explained how to do it below - add a leading 00 "sign byte". The updated code is at end.

Other wishes are easier fulfilled, e.g. we want to know the length of an opi sequence, which by definition is the number of bytes with high bit set:

 proc opis'length opis {
    set res 0
    foreach byte [split $opis ""] {
        if {[scan $byte %c] & 0x80} {incr res}
    }
    set res
 }
 #... or an accessor to the i-th element of an opi sequence:
 proc opis'index {opis i} {
    set from 0
    set n 0
    set bytepos 0
    foreach byte [split $opis ""] {
        if {[scan $byte %c] & 0x80} {
            if {$n == $i} {
                return [string range $opis $from $bytepos]
            }
            incr n
            if {$n >= $i-1} {set from [expr {$bytepos+1}]}
        }
        incr bytepos
    }
 }
 #...or an accessor to a specified range of opi sequence elements:
 proc opis'range {opis from to} {
    set frompos 0
    set n 0
    set bytepos 0
    foreach byte [split $opis ""] {
        if {[scan $byte %c] & 0x80} {
            if {$n == $to} {
                return [string range $opis $frompos $bytepos]
            }
            incr n
            if {$n == $from} {set frompos [expr {$bytepos+1}]}
        }
        incr bytepos
    }
 }

14oct02 jcw - Delightful whizlet stuff, Richard! Let me elaborate about the background of these things, and their use in the MetaKit database core. The 1..N-byte encoding is indeed used in MK as a way to store several things though I now tend to call them "byte-packed ints". The term "adaptive integers" is actually used for the fact that MK stores integer data as columns of 0/1/2/4/8/16/32-bit values, tightly packed. I.e. a million rows of 0/1 values are stored in 125 Kb. The adaptiveness is explained by the fact that the size gets adjusted to accommodate larger ints as they get stored. Storing 12345 in my example would cause MK to expand the column to 2 Mb.

FYI, in the byte-packed format you described, negative ints can be stored by prefixing these bytes with a zero byte as 1-s complement flag - this is possible because a zero byte never normally appears as first byte. So "148" means "20", while "0,148" means ~20, i.e. -21. Again, the nice property here is that such an encoding works for arbitrarily-sized ints.

14oct02 escargo - If this format is designed to store nonnegative integers, how do you disinguish between a natural zero (which would be a zero byte) and a zero byte acting as a 1-s complement flag?

RS: In the description I implemented, the last byte always has bit 7 set, so 0 is 0x80. JCW's suggestion would make -1 come out as 0x00 0x80, where the 0 byte marks the negative sign. This however means that negative numbers always cost a full byte more...

escargo: Looking back at the description above I see where only bit 7 of the last byte is set to one. That clears up my misunderstanding. It is still strange to see 1's complement numbers again. That is kind of like 8-bit bytes; most modern machines are 2's complement.

AK: I have updated the description of the file format to correct usage of adaptive integer versus octet-packed integer. Richard, note the reference to readkit.tcl on the format page. This will contain code to extract octet-packed integers too I guess. - RS: Not when I looked... but this is a roll-your-own fun project anyway. For breakfast, I have upgraded the two procs so now they deal with negatives too:}

 proc opis2ints opis {
    set ress {}
    set res 0
    set negative 0
    foreach byte [split $opis ""] {
        scan $byte %c ichar      ;# get numeric byte value
        if {$res==0 && $ichar==0} {incr negative}
        set res [expr $res * 128 + ($ichar & 0x7F)]
        if {$ichar & 0x80} {
            if $negative {set res [expr {~$res}]}
            lappend ress $res
            set res 0
            set negative 0
        }
    }
    set ress
 }
 proc ints2opis ints {
    set ress {}
    foreach int $ints {
        set negative 0
        if {$int < 0} {
            incr negative
            set int [expr {~$int}]
        }
        set res [expr {$int? "": [format %c 0x80]}]
        while {$int} {
            set low [expr {$int % 128}]  ;# get rightmost 7 bits
            if {$res==""} {incr low 128} ;# maybe set "end" bit
            set res [format %c $low]$res ;# prepend data octet
            set int [expr {$int / 128}]  ;# and carry on...
        }
        if $negative {set res \x00$res}
        append ress $res
    }
    set ress
 }

-1 is an interesting corner case, as it converts to 00 80, while 0 is 80.

escargo: But we have one's complement, which means that both positive and negative zero are possible. So 0 is 80 and negative zero is 00 80. One should be 81 and negative one should be 00 81. Sanity check, please (10/15/2002)

RS: According to JCWs sketch above, which I take as spec, the sign byte causes the number to be treated as its bitwise complement: 0 is 80, -1 is (sign) ~-1 (which is 0), which is 00 80. And we rather end up with two's complement again...

escargo: Something is lacking in either the description or the specification. It can't be one's complement if there aren't both positive and negative zero. Consider that this format should allow taking the absolute value by omitting the sign byte of a negative number (if it's one's complement). Therefore, if 1 = 81 then -1 = 00 81. The absolute value of 00 81 = 81. If -1 is converting to 00 80 then this not a one's complement encoding. In point of fact, this should be considered a signed magnitude [L1 ] encoding.

20060308 dcd: Sanity check: +1 == 81 => -1 == 00 <high bit><one's complement of 1> == 00 FE no? Here's a version that implements this (with some probably paranoid ops thrown in:)

proc opis2ints opis {
    set ress {}
    set res 0
    set negative 0
    foreach byte [split $opis ""] {
        scan $byte %c ichar      ;# get numeric byte value
        if {$res==0 && $ichar==0} {incr negative; continue}
        if {$negative} {
            set res [expr {$res * 128 + (128 + ~($ichar & 0x7f))}]
        } else {
            set res [expr {$res * 128 + ($ichar & 0x7f)}]
        }
        if {$ichar & 0x80} {
            if $negative {set res [expr {-1 * $res}]}
            lappend ress $res
            set res 0
            set negative 0
        }
    }
    set ress
 }

 proc ints2opis ints {
    set ress {}
    foreach int $ints {
        set negative 0
        if {$int < 0} {
            incr negative
            set int [expr {-1 * $int}]
        }
        set res [expr {$int ? "" : [format %c 0x80]}]
        while {$int} {
            set low [expr {$int % 128}]  ;# get rightmost 7 bits
            if {$negative} {set low [expr {~ $low & 0x7f}]}
            if {$res==""} {set low [expr {$low | 0x80}]} ;# maybe set "end" bit
            set res [format %c $low]$res ;# prepend data octet
            set int [expr {$int / 128}]  ;# and carry on...
        }
        if $negative {set res \x00$res}
        append ress $res
    }
    set ress
 }

Note that the opis2ints code is complicated by sign extension during the complement op.