zlib stream

Documentation

zlib stream mode ?level?

This command, part of zlib, creates a streaming compression or decompression command, allowing greater control over the compression/decompression process. It returns the name of the stream instance command. The mode must be one of compress, decompress, deflate, inflate, gzip, gunzip. The optional level, which is only valid for compressing streams, gives the compression level from 0 (none) to 9 (max).

The returned streamInst command will support the following subcommands:

streamInst add ?option? data
    • Shortcut for a put followed by a get.
streamInst checksum
    • Returns the current checksum of the uncompressed data, calculated using the appropriate algorithm for the stream's mode.
streamInst close
    • Disposes of the streamInst command. Deleting with rename works the same.
streamInst eof
    • Returns whether the end of the input data has been reached.
streamInst finalize
    • Shortcut for “streamInst put -finalize {}”.
streamInst flush
    • Shortcut for “streamInst put -flush {}”.
streamInst fullflush
    • Shortcut for “streamInst put -fullflush {}”.
streamInst get ?count?
    • Return up to count bytes from the stream's internal buffers. If count is unspecified, return as much as is available (without flushing).
streamInst put ?option? data
    • Appends the bytes data to the stream, compressing or decompressing as necessary. The option controls the type of flush done: -flush means to ensure that all data appended to the stream has been processed and made ready for get at some compression performance penalty, -fullflush also makes sure that the compression engine can restart from the point after the flush (at more penalty), and -finalize states that no more data will be written to the stream, causing any trailing bytes required by the format to be written.
streamInst reset
    • Recreates the stream, ready to start afresh. Discards whatever is in the stream's buffers.

Example - streaming over sockets

For simple zlib streaming over sockets like in HTTP, zlib push is sufficient. This breaks down more interactive protocols, as it gives you no way to control when a block is flushed to the receiver. If you want to flush each line, for example, you will need something like the following.

This code simply forces a flush each time $zchan write is called. If that proves insufficient, simply remove the flush flag in method write and call the object's method flush directly.

This code was inspired by an experiment by karll

See zlib manual and http://www.bolet.org/~pornin/deflate-flush.html for more detail on Zlib's flushing modes.

# it appears that [$transchan flush] doesn't get called any time interesting.
# So each [$transchan write] needs to flush by itself.
#
# Flushing an already flushed stream is a harmless error {TCL ZLIB BUF}, so we catch it
#
oo::class create zchan {
    variable Stream
    variable Chan
    variable Mode
    constructor {mode} {
        set Stream [zlib stream $mode]
        #  oo::objdefine [self] forward stream $Stream
    }

    method initialize {chan mode} {
        set Chan $chan
        set Mode $mode
        if {$mode eq "write"} {
            return {initialize finalize write flush}
        } elseif {$mode eq "read"} {
            return {initialize finalize read drain}
        }
    }
    method finalize {chan} {
        my destroy
    }

    method write {chan data} {
        try {
            $Stream add -flush $data
            # equivalent to:
            #  $Stream put $data
            #  $Stream flush
            #  $Stream get
        } trap {TCL ZLIB BUF} {} {
            return ""
        }
    }
    method flush {chan} {
        try {
            $Stream add -flush {}
            # equivalent to:
            #  $Stream flush
            #  $Stream get
        } trap {TCL ZLIB BUF} {} {
            return ""
        }
    }

    method read {chan data} {
        $Stream add $data
    }
    method drain {chan} {
        $Stream add -finalize {}
    }
}

if 0 {
    lassign [chan pipe] r w
    chan configure $w -translation binary -buffering none
    chan configure $r -translation binary -blocking 0
    lassign {gzip gunzip} out in
    puts $w "Frumious bandersnatch!"
    puts "read: [gets $r]"
    chan push $w [zchan create gw $out]
    chan push $r [zchan create gr $in]
    puts $w "Vorpal snacks!"
    puts "read: [gets $r]"
    puts $w "And bric-a-brac!"
    puts "read: [gets $r]"
    chan pop $w
    chan pop $r
    puts $w "Galumphing back"
    puts "read: [gets $r]"
}

AMG: I'm trying to read data from disk, compress it, and store the compressed result into an SQLite database. For small files this is easy, but Tcl panics when files exceed two gigabytes in size. Tcl strings simply can't grow that large. Thus, I need to stream the data rather than buffer it all at once.

At first I thought the way to go was to use [zlib push deflate] on [db incrblob], then [chan copy] from disk to the incrblob channel, but I have to preallocate the blob. If I set the blob size to that of the disk file (plus 10% in case the file is too random), this would work, except I have to follow up by truncating the blob to the actual compressed size. How can I tell what that size is? [chan copy] returns the uncompressed size, which doesn't do me any good. [chan tell] doesn't work on an incrblob channel. [zlib push] adds some configuration options to the channel, but none of them tell me how many bytes have passed in or out of the stream.

If I could use [zlib push deflate] on the read channel, [chan copy] would return the compressed size, but I get the error "compression may only be applied to writable channels". I really don't know why this error exists, but it's definitely getting in my way.

Next up: [::tcl::transform::zlib] from Tcllib. However, I found that for small files it doesn't produce any output at all. When [finalize] gets called, it's too late to finalize the zlib stream and return the last of the compressed data, so my version does this in [drain] instead. There may be cases where [drain] is too early to finalize the zlib stream, but [chan copy] only does one [drain] at the very end. Code below.

# zlibCompressor --
# Input stream compression.
oo::class create zlibCompressor {
    variable stream
    method initialize {handle mode} {
        set stream [zlib stream deflate -level 9]
        return {initialize finalize drain read}
    }
    method finalize {handle} {
        $stream close
        my destroy
    }
    method drain {handle} {
        $stream finalize
        return [$stream get][$stream reset]
    }
    method read {handle data} {
        $stream add $data
    }
}
oo::objdefine zlibCompressor method push {chan} {
    chan push $chan [my new]
}

Alas, this still doesn't work. [finalize] can return a lot of data all at once, but [chan copy] throws away all but the first four kilobytes or so. If I attempt to manually drain the rest using [chan read], that incurs more [finalize]s, giving me an unlimited stream of bogus data.

The only thing I can really do is bypass [chan copy] altogether:

set inChan [open $input rb]
set outChan [$db incrblob $table $column $rowid]
set stream [zlib stream gzip -level 9]
set size 0
set end 0
while {!$end} {
    if {[set inData [chan read $inChan 4096]] ne {}} {
        $stream put $inData
    } else {
        $stream finalize
        set end 1
    }
    set outData [$stream get]
    chan puts -nonewline $outChan $outData
    incr size [string length $outData]
}
$stream close
chan close $inChan
chan close $outChan
chan puts $size

As I dug in deeper, I discovered SQLite has blob size limits too, much tighter than Tcl even, so I had to implement a chunking scheme dividing files across multiple table rows. The incrblob system became less and less a good fit, but [zlib stream] is proving to be indispensable for this task.


AMG: Even though the documentation says that [get] returns as much data as is available, in practice it seems to only return at most 65536 bytes at a time. If more data than that is immediately available, [get] has to be called repeatedly until it returns less than that amount (or empty string). I lost so much time trying to debug this in my program...