Version 6 of Some observations on behind-the-scenes actions of Tcl

Updated 2008-02-24 15:26:04 by lars_h

2008-02-24

1. I keep a large collection of text data as a list in memory (lappend x {text...} etc)

2. I want to search this data.

Here's what happens.

  set match [lsearch -regexp $x {needle}]

-> memory usage of the tcl process more than doubles (before: 80MB, after: 200MB)

  foreach k $x { if {[regexp -nocase {needle} $k]} {puts "match"} 

- >ditto

  foreach k $x { if {[regexp -nocase {needle} [list $k]]} {puts "match"} 

-> Heureka! total memory usage stays at 80MB.

I'm still not quite sure what's going on, it's about keeping lists 'pure' I guess. I'm now consulting these pages:

list shimmering pure list.

6am EDIT: I think I almost get it now. regexp treats $x as string and forces every element into a string representation AS WELL as a list representation. sigh. lsearch is forcing a string representation of all elements. seems unavoidable. I could also lappend items as strings not lists, but I'm saving my data as TCL source for various reasons, and of course { } is much cleaner in that case.

-hans

Lars H: Are you taking about a the object getting a string representation, or getting a String internal representation? (They're not the same.) For

  foreach k $x { if {[regexp -nocase {needle} [list $k]]} {puts "match"} 

to make a difference, it seems it'd have to be the latter (in order to get a string rep for [list $k], one first needs the string rep of $k, but since [list $k] is not the same Tcl_Obj as $k, an intrep imposed upon [list $k] by regexp will not be retained in the elements of $x); that String intreps use 2 bytes for every character is consistent with a jump from 80MB to 200MB, if you have about 60M ASCII characters without intrep in your big list.

I once had a similar difficulty, but the other way round, with a program that dumped a list of lists of integers to a text file. In order to generate a stringrep for a list of integers, Tcl first had to generate stringreps for all the integers, and that similarly doubled the memory usage during the final dump. In that case I solved it by feeding each integer through format before I did anything stringy to it, since that gave a Tcl_Obj that wasn't shared with the big list of lists of integers...


See also: