'''Compact data storage''' presents techniques for conserving space in memory
and on disk.  


** See Also **

   [Memory costs with Tcl]:   Some hard results on how many bytes Tcl uses for various kinds of things.

   [Measuring memory usage]:   Some scripts that can be used for measuring how much memory is allocated by a sequence of commands.  For example the '''growth''' and '''Delta''' procedures used below can be found on that page.


** Avoid That String Representation **

Although [EIAS%|%every value is a string], Tcl employs some tricks under the
hood to only generate that string when necessary.  A `[list]`, for example, is
a string, but one can create a list with `[list]`, and then `[lappend]` items
to the list without ever triggering Tcl to generate the string for the list.
If the list is large, this can save a drastic amount of memory.

The thing to remember is that every value is a string, but that string is not
generated until it is actually used.  With a little care, for a data structure
like a list, it may ''never'' get generated.


** 100000 random digits **

[Lars H] 2002-11-11:

''100000 random digits'' is the name of a classical table of random numbers -- depending on one's point of view, it is either 
one of the most boring books ever published, or one of the least boring books ever published 
(since you never can tell what is going to happen next) -- but it also makes a nice example.
Suppose you want to make a list of 100000 (pseudo-)random digits. 
The obvious way of doing this is

======
set L [list]
for {set n 0} {$n < 100000} {incr n} {
    lappend L [expr {int(floor(rand()*10))}]
}
======

and the test

======
Delta [growth {
    set L [list]
    for {set n 0} {$n < 100000} {incr n} {
        lappend L [expr {int(floor(rand()*10))}]
    }
}]
======

tells you that this causes Tcl to allocate 3207168 (or thereabout) bytes. 
However, since equal values appear over and over again in this list, it is more efficient 
to make sure that they really are ''the same'' (as objects) values. One way of doing this is

======
set L [list]
set digits [list]
for {set n 0} {$n<10} {incr n} {lappend digits $n}
for {set n 0} {$n<100000} {incr n} {
    lappend L [lindex $digits [expr {int(floor(rand()*10))}]]
}
======

The test

======
Delta [growth {
    set L [list]
    set digits [list]
    for {set n 0} {$n < 10} {incr n} {lappend digits $n}
    for {set n 0} {$n < 100000} {incr n} {
        lappend L [lindex $digits [expr {int(floor(rand()*10))}]]
    }
}]
======

returns (for me) a mere 798720 bytes. The number of correct digits in these measurements is probably rather low, but the difference in magnitude is quite significant.


** Helpful extensions **

Extensions which can help with storing data compactly in memory include:

   [Metakit]:   

   [Speed Tables]:   Faster table access than generic [dbms%|%database management systems].  A really slick and competent extension.  If the task is processing large volumes of [Vector or matrix algebra in Tcl%|%tabular data] in near-real time, this is the extension.

   [NAP]:   

   [TArray]:   

   [narray]:   

   [Bita]:   

   [TNA]:   (how does this relate to "storing data compactly"?  The linked page is a documentation generator.)

   [ycl eav]:   Create ad-hoc hierarchies of records, each representing an object with arbitrary attributes.  Array-valued attributes make it convenient to iterate through tabular [Vector or matrix algebra in Tcl%|%data].  All data is stored in sqlite for compact storage.  Table data can be transformed in-place.


** Metakit **

[jcw] 2002-11-11: You could use [Metakit] - it's a database, but it also works with in memory data.  
Here's an example to show how it adaptively stores ints in 1..32 bits per value:

======
package require Mk4tcl

foreach x {2 4 10 100 200 30000 40000} {
    mk::file open db
    mk::view layout db.v i:I
  
    puts "100,000 ints in range \[0..$x) take:"
  
    for {set i 0} {$i < 100000} {incr i} {
        mk::row append db.v i [expr {int(rand()*$x)}]
    }
  
    set fd [open tempfile w]
    mk::file save db $fd
    close $fd
  
    puts "   [file size tempfile] bytes"
  
    file delete tempfile
    mk::file close db
}
======

The output is:

======none
100,000 ints in range [0..2) take:
  12543 bytes
100,000 ints in range [0..4) take:
  25045 bytes
100,000 ints in range [0..10) take:
  50045 bytes
100,000 ints in range [0..100) take:
  100045 bytes
100,000 ints in range [0..200) take:
  200045 bytes
100,000 ints in range [0..30000) take:
  200045 bytes
100,000 ints in range [0..40000) take:
  400045 bytes
======

FWIW, writing the data to file is done just to find out how much space it consumes.  
In-memory consumption will add a few Kb (plus a per-row overhead of 0.1% ...).

[Lars H]: In this case, it is an absolutely essential piece of information. 
Being compact on file is, if not easy, so at least nothing that is harder in Tcl 
than it is in other languages. Being compact in memory is what is important ''here.''
The need for a standard extension that provides near-optimal (1 bit of useful information costs 1+epsilon bits of memory) storage of information in memory is something that I have felt a couple of times during the last year. From the looks of the above figures, it looks like [Metakit] could fill this need. Excellent! (Not that it means we shouldn't care about being efficient when storing ordinary Tcl objects too, though.)

----

[GPS]: For [NexTk]'s text widget I found that using a list of lists was easiest to program for, but drastically increased memory requirements.  My initial prototype used something like: 

======
$path lines [list [list line1Width [list [list contextId "text with context"] [list anotherContext "more text"]]]].
======

Where a `contextId` stores the font and color for text, or an image. 

My eventual solution was to use `[binary]` to store a serialized representation of each line, which is then built up as the list of lists by iteration of the binary output whenever a range of text is required, such as for display.  So, now the improved representation is currently like:

 
======
$path lines [list [list line1Width line1Height $binData] [list line2Width line2Height $binData2]];  
======


The memory requirements are drastically reduced now, and it's usably fast now.  The line1Width is used for the final scrollbar calulation, when accounting for wrapped text.  It's imperfect, but writing a text widget is very hard.  For perfection I'd have to iterate the entire text everytime I update the yview/scrollbar, and take into account each and every context and where it wraps, whereas the solution I chose is an approximation that is generally accurate.

<<categories>> Data Structure | Internals | Performance