indexing a flat file, and a readBytes proc

<jkock, 2005-04-20, 2005-10-21>

The problem

Suppose you have a huge flat text file, but the content actually comes in blocks. For example this could be a berkeley mbox, where some "From " pattern delimits the individual mails in the file. Now you want to create an index to this file, telling where each block starts and how big it is, and then later be able to read in only a specified block. At first sight this is an easy problem: First you create the index and then you do seek-and-read.

However if the file is multi-byte encoded you run into some more serious problems due to the fact that seek measures in bytes whereas read measures in chars. (You may also run into trouble just in the trivial situation where the file has DOS line endings.)

For example, to create the index, just read in the entire file (using fconfigure appropriately) and find what you want with some regexp -indices. But these indices won't help you to seek, because seek needs a pure byte measurement... Bad luck. (Of course your difficulty in producing a byte measurement is this: you would have to read the entire text chunk up to each index to count how many bytes each char occupies. The expense of doing this is of course also the reason why seek cannot measure in chars...)

Workaround

The easiest solution to this problem is to not use a flat file in the first place. Use, for example, an sqlite database instead. SQLite is designed to handle exactly this sort of problem transparently. And you get the added benefits of atomic, consisent, isolated, and durable I/O and an advanced query language. But using SQLite to store your information would obviate the original purpose of this page, so we will say no more on that topic....

Towards a solution

Alright, then let us do everything in bytes. To set up the index in bytes you can perhaps get away with using TclX scanmatches, and get the byte offsets from $matchIndex(offset). Very well, now we've got the indices in bytes, and we can seek. But now we cannot easily read!

In order to read a specified number of bytes, my first solution was a trick with a translation pipe using the TclX command pipe: first read from the file in binary mode, then put it into the pipe in raw mode, and finally pick it out of the pipe with the appropriate encoding. E.g.

    package require Tclx
    set fd [open $bigFlatFile r]
    # We know this file is utf-8 encoded, but we want to read a 
    # certain number of bytes, not chars...
    fconfigure $fd -encoding binary
    pipe out in
    fconfigure $in -encoding binary -blocking 0 -buffering none
    fconfigure $out -encoding utf-8 -blocking 0 -buffering none
    seek $fd $offset
    puts $in [read $fd $numBytes]
    read -nonewline $out
    close $fd
    close $in 
    close $out

Unfortunately, on big chunks of text (>8192), there seems to be a bug in pipe that obstructs this solution... In fact: makes the tcl interpreter hang...

In any case, Lars H pointed out that this could be done in a much cleaner way using encoding. Here is the final solution (so far):

 # This proc is supposed to work just like [read $fileHandle $numChars],
 # except that the size of the chunk to read is specified in bytes, not in
 # chars.  This is useful in connection with [seek] and [tell] which always
 # measure in bytes.  The proc is supposed to respect the fileHandle's
 # configuration w.r.t. encoding, but it will not respect the configuration 
 # w.r.t. eol convention, I think.
 proc readBytes { fileHandle numBytes } {
     # Record the original configuration:
     set enc [fconfigure $fileHandle -encoding]
     # Special treatment of encoding "binary", since this encoding is not
     # accepted by [encoding convertfrom].  But this case is trivial:
     if { $enc eq "binary" } {
         return [read $fileHandle $numBytes]
     }
     # We are going to reconfigure the channel.  If anything goes wrong, at
     # least we should restore the original configuration, hence the catch:
     if { [catch {

         # Configure for binary read:
         fconfigure $fileHandle -encoding binary
         set binaryData [read $fileHandle $numBytes]
         set txt [encoding convertfrom $enc $binaryData]
         # And restore the original configuration:
         fconfigure $fileHandle -encoding $enc
                 
     } err] } {
         fconfigure $fileHandle -encoding $enc
         error $err
     } else {
         return $txt
     }
 }

Older remark: it would be really nice (and quite logical, in view of the functionality provided by seek and tell) if read could accept a -bytes flag. The only thing needed is a convention about how to handle the situation where the number of bytes does not constitute a complete char. One convention could be: finish the char in that case. Another convention: discard the non-complete char. Or finally, just leave the fractional char as binary debris --- it is up to the caller to make sure this does not happen, and in the examples like the above this comes about naturally.


See Also