Read-only memory-mapped files

The "mmap" command for memory mapping is now part of the mvector package in CritLib [L1 ]. [Should mvector and CritLib have their own pages?]


This demo illustrates an experimental "mmap" command and datatype for Tcl, which provides efficient access to files by mapping them into memory. Using data in such a way takes maximum advantage of virtual memory (VM).

The interface is almost hidden, it's very easy to forget it exists (it's also easy to cause "shimmering" and lose most of the speed advantage).

What you need is an open file in the form of a Tcl "channel" descriptor. Sockets cannot be used, the file descriptor must be mappable (which is whatever the OS considers acceptable).

After that, you only need to deal with two call variations to "mmap":

    set filesize [mmap $fd]

This returns the file size (and sets up the mmap as dual representation).

The second call is the one to fetch a particular byte range:

    set offset 123
    set length 17
    set data [mmap $fd $offset $length]

The result is a 17-byte string (byte array, to be precise), as found at offset 123 in the file.

Trivial, eh? You can throw away the "read" command now :)

Actually, things are slightly more complex. Accessing data in this way is extremely efficient, but setting up / tearing down such a mapping is not that cheap. That means that you need to be careful to rarely lose the dual-object representation. A simple way to do this, is to store the channel identifier in a variable, and never use that variable in any other way than as first arg to "mmap". And always call mmap with the same var.

Keep in mind that mmap is for binary data - it's 100% ignorant of Unicode. Even end-of-line translations and end-of-file markers are totally ignored.

The "mmap" command is particular in the sense that "mvec" is special-cased to work with it if present (see "mvec.README"). The combination mvec+mmap makes it possible to implement high-performance vector storage. This is the "vkit" project as described elsewhere and is work in progress.

When compiled as C extension, "mmap" is part of JOLT.

JOLT stands for JC's Own Little Toolbox - it's a context I use to try out new things, especially when mixing Tcl and C and exploring possibilities. This demo includes the C code for mmap (and a few more pieces), with a binary build (just Linux for now). The code will be moved into a context and namespace of its own when the dust has settled a bit more.

Below are some examples. One of the nice things about "mmap" is that it can be faked entirely in pure Tcl (by reading the entire file on first access). That is why the output below consists of two sections - one with C-level support, and one in pure Tcl.

Don't put too much emphasis on speed differences as shown below. The key performance boost comes from the combination with mvec, which can then use data on file as efficiently as in memory. No copying/buffering overhead.

Script:


  puts "version = [package require jolt]"

  if {[info commands nop] != ""} {
    puts "10 nops = [time {nop;nop;nop;nop;nop;nop;nop;nop;nop;nop} 10000]"
  }

  proc mmap_try {} {
    set fd [open [info script]]
    puts "mmap size = [mmap $fd]"
    puts "mmap data = [mmap $fd 261 8]"

    puts "mmap call = [time {mmap $fd} 10000]"

    set a abc
    puts "mmap convert = [time {mmap $fd; string length $a} 10000]"
    puts "mmap shimmer = [time {mmap $fd; string length $fd} 1000]"

    if {[info commands hexdump] != ""} {
      puts [hexdump [mmap $fd 128 256]]
    }
    close $fd
  }

Output (SuSE 7.1 Linux, PIII/650) - WITH MMAP IN C:


  version = 2001.11.06.194312
  10 nops = 7 microseconds per iteration

    MMAP
    ====

  mmap size = 14829
  mmap data = fallback
  mmap call = 2 microseconds per iteration
  mmap convert = 3 microseconds per iteration
  mmap shimmer = 21 microseconds per iteration
  00000000 6a6f6c74 3a206e6f 20636f6d 70696c65 *jolt: no compile*
  00000010 64206578 74656e73 696f6e22 207d0a20 *d extension" }. *
  00000020 20706163 6b616765 2070726f 76696465 * package provide*
  00000030 206a6f6c 7420302e 300a7d0a 0a696620 * jolt 0.0.}..if *
  00000040 7b5b696e 666f2063 6f6d6d61 6e647320 *{[info commands *
  00000050 6d6d6170 5d203d3d 2022227d 207b0a20 *mmap] == ""} {. *
  00000060 20696620 7b247463 6c5f696e 74657261 * if {$tcl_intera*
  00000070 63746976 657d207b 20707574 7320226d *ctive} { puts "m*
  00000080 6d61703a 2066616c 6c626163 6b20746f *map: fallback to*
  00000090 2054636c 22207d0a 0a202070 726f6320 * Tcl" }..  proc *
  000000a0 6d6d6170 207b6664 20617267 737d207b *mmap {fd args} {*
  000000b0 0a202020 20757076 61722023 30205f6d *.    upvar #0 _m*
  000000c0 6d61705f 64617461 28246664 29206461 *map_data($fd) da*
  000000d0 74610a20 20202023 20636163 68652061 *ta.    # cache a*
  000000e0 2066756c 6c20636f 7079206f 66207468 * full copy of th*
  000000f0 65206669 6c652074 6f207369 6d756c61 *e file to simula*

Output (SuSE 7.1 Linux, PIII/650) - PURE TCL RUN:


  version = 0.0
  10 nops = 7 microseconds per iteration

  MMAP
  ====

  mmap size = 14829
  mmap data = fallback
  mmap call = 21 microseconds per iteration
  mmap convert = 22 microseconds per iteration
  mmap shimmer = 22 microseconds per iteration
  00000000 6a6f6c74 3a206e6f 20636f6d 70696c65 *jolt: no compile*
  00000010 64206578 74656e73 696f6e22 207d0a20 *d extension" }. *
  00000020 20706163 6b616765 2070726f 76696465 * package provide*
  00000030 206a6f6c 7420302e 300a7d0a 0a696620 * jolt 0.0.}..if *
  00000040 7b5b696e 666f2063 6f6d6d61 6e647320 *{[info commands *
  00000050 6d6d6170 5d203d3d 2022227d 207b0a20 *mmap] == ""} {. *
  00000060 20696620 7b247463 6c5f696e 74657261 * if {$tcl_intera*
  00000070 63746976 657d207b 20707574 7320226d *ctive} { puts "m*
  00000080 6d61703a 2066616c 6c626163 6b20746f *map: fallback to*
  00000090 2054636c 22207d0a 0a202070 726f6320 * Tcl" }..  proc *
  000000a0 6d6d6170 207b6664 20617267 737d207b *mmap {fd args} {*
  000000b0 0a202020 20757076 61722023 30205f6d *.    upvar #0 _m*
  000000c0 6d61705f 64617461 28246664 29206461 *map_data($fd) da*
  000000d0 74610a20 20202023 20636163 68652061 *ta.    # cache a*
  000000e0 2066756c 6c20636f 7079206f 66207468 * full copy of th*
  000000f0 65206669 6c652074 6f207369 6d756c61 *e file to simula*

I wrote a shared-memory extension for Tcl, which is listed on my home page. I'm also interested in writing a front-end for libMM, by Ralf Engelschall. -davidw

See Also