Complex data structures

Summary

Arjen Markus Quite often the question arises: how to model complex data structures? As Tcl allows various solutions, it seemed to me a good idea to start a new page on this subject.

Problems

escargo 2003-01-17: Just for the sake of illustration, could you give some examples of these complex data structures? Are we trying to talk about AVL trees, red-black trees, tries, DAGs, petri nets, bags, or what all else?

AM 2003-01-24: When I set up this page, I was thinking of C structs, Fortran arrays and other "passive" data structures. Note that the Tcllib contains a number of modules for managing data structures - matrix, tree, ...

awalczak: I deal with tcl code generated by idl2tcl. In my case IDL code is modeling struct with other structs and collections inside. I would be very thankfull for example how to build in TCL a big struct-like construct step-by-step (having smaller units defined earlier and combining them together before return form a function).

RBTree is a Tcl extension that implements Red-Black trees, having features of an ordered list and direct access as array. - RS found explanations here (link broken 2013-05-12), but from there I gather it's just well-balanced binary trees, and on insertion are (by possible rotation) brought back to well-balancedness.

DRH: Complex data structures can be modeled using an in-memory SQLite database. (Or use a standard disk-resident database if persistence is important.) Think of each table as a class and each row as a structure instance or object. With this in mind, it is trivial to build arrays, queues, stacks, or graphs of objects.

escargo 2005-09-23: Possible, yes, but not necessarily efficient. If what I want is a Tclish version of Judy arrays tries, or a PATRICIA tree , I lose all the specialized performance advantages if I have to build them out of the heavy plumbing of an in-memory database.

DRH: escargo seems to confuse abstract data structures (ex: arrays, bags, DAGs, queues, etc) with their implementations (red/black trees, tries, hash tables, linked lists, etc). The Zen of Tcl is that you don't worry about the details of implementation, you focus instead on the higher-level aspects of the problem. An in-memory SQL database provides O(logN) search and O(N) traversal of data using a variety of underlying algorithms. The specific algorithms used (B-tree, hashing, linked lists, etc) are irrelevant to the programmer using SQL. All she cares about is that the results come up quickly. And an SQL database provides the same asymptotic performance as the exotic data structures mentioned by escargo above. The exotic data structures may be faster by a constant factor in certain scenarios. But generally speaking, that constant is measured in percent. And the speedup only applies for narrowly defined problems. If performance is so critical that I am willing to write intricate, customized code to achieve a speedup of 20 or 30%, then I'm going to code the problem in C, not in TCL. For 95% of the things that I and most other people do, where ease and generality of implementation and simplicity of maintenance are the driving concerns, then a generalized abstraction such TCL's built-in arrays or an in-memory SQL database is a far better choice than trying to employ an exotic data storage and retrieval algorithm.

escargo: I can't program with abstract data structures; I need to use their implementations. (And I do understand the differences between policy and mechanism, too.) I never said that the data structures would be implemented in Tcl. I fully expect that the implementations would be in C (or C++) with a wrapper that allows Tcl to use them. An in-memory SQL database seems far more exotic to me than Tcl access to an existing C library for Judy arrays, etc. I certainly believe that the KISS principle applies. That constant k that differentiates performance between an ideal implementation of a complex data structure and one layered on an in-memory SQL database can be significant.

And there are times when I have a program that is in the 5% where such things matter. (Also, I'm aware that writing a simpler but naive implementation and profiling it to see where it's slow is better than picking a complex algorithm just to speed up part of a program guessing that you know ahead of time where the bottlenecks will be.)

L: I agree wholeheartedly with escargo. It is nice in many cases to ignore the "details of implementation and focus on the higher-level aspects of the problem". However, sometimes this is impossible. Example: I have a 15GB text file that I want to index by creating a trie of all the words in it to record at which offset in the file this word first occurs. Because I know that the same word will be repeated a million times

kanryu 2008-06-08: huddle is a complex container with abstract container algorithms (such as dict and list) which can contain multirank nodes.

Ideas for Implementing Complex Data Structures

Ordered roughly after how "exotic" they are, taking into account requirements, user base, and history:

  • Use arrays (associative arrays allow indices that are strings, rather than mere numbers or tuples of numbers)
  • Use dicts (quite similar functionally to arrays)
  • Use tagged lists (see explanation below)
  • Use namespaces
  • Use key–value lists (essentially the same as dicts, but one avoids using the dict command; mostly of interest for backwards compatibility)
  • Use struct::tree (in tcllib)
  • Use TclX keyed lists
  • Use one of the object-oriented extensions
  • Write a specialized C function to implement a new kind of Tcl object
  • Use SWIG to create glue between some commonly used C library and Tcl
  • Use Records - Larry Smith 6/8/2008
  • Use huddle (or tagged list, dict, or other containers)

object oriented extensions

When you have a bunch of related data, collect it in an object (e.g. separate fields are separate instance variables).

Larry Smith 2008-06-08: Lost and Most are very compact, likely solutions to this problem since their overhead for object-orientation is minimal.

  • Pro: Natural and intuitive to anybody who has done any OO.
  • Pro: The data structure can own resources, as opposed to only store values.
  • Pro (but also a con): the data structure can mutate.
  • Con: Not really a solution by itself (unless all one wants is a C-style struct), but rather a framework within which solutions can be more elegantly implemented.
  • Con: At the moment garbage collection can't be applied to these objects so you need to take care of destruction yourself.
  • Con: Objects are not first-class data items.
  • Con: Requires an external package or at least Tcl 8.6a2 (for TclOO).

TclX keyed lists

  • Pro: a keyed list can handle nested data structures. For example, the keyed list "{ID 106} {NAME {{FIRST Frank} {LAST Zappa}}}" represents the following hierarchical structure:
ID = 106
NAME:
   FIRST = Frank
   LAST = Zappa

[So, what commands create this keyed list?]

  • Pro: keyed lists are first-class data items. They can be passed in and out of procs like any other list or string.
  • Pro: it is possible to query the existence and value of a key in a single command, without the speed drawback of catch.
  • Con: the access syntax is more verbose than for arrays. [Seems similar to dict, however]
  • Con: access to nested elements requires multiple commands.
  • Con: requires an external package.

The thing I like best about keyedlist is how it uses the dot notation when defining the key. You can build nested structures with a single command. The previous example would be create like this.

keylset mykeylist ID 106 NAME.FIRST Frank NAME.LAST Zappa
set mykeylist
{ID 106} {NAME {{FIRST Frank} {LAST Zappa}}}

You retrieve the same way.
keylget mykeylist NAME.LAST
Zappa

I definitely prefer keyedlist over dict when dealing with nest structures.

arrays

Separate fields are separate entries in an array (a data structure which has been in the standard Tcl distribution for quite some time).

  • Pro: by one name, you gather all the pieces of information
set personal_data($name,birthday) "..."
set personal_data($name,telnum) "..."
...
  • Pro: each piece of data resides in a separate variable, so access is easy.
  • Pro: traces can be set on individual elements, which makes it possible to (e.g.) use array elements as Tk -textvariables.
  • Pro: implemented internally by hash tables, hence access is fast.
  • Con: arrays are not "first-class" data in Tcl, you will have to use upvar or array get and set commands to transfer them between procedures (or provide namespace qualification, but then the arrays cannot be local).
  • Con: essentially only one level of structuring. Otherwise you would have to use structured indices to store the values.

AIN (Array In Namespace) is a package for this kind of data structure, providing commands for e.g. copying an array and generating names for arrays.

dicts

dicts became available in Tcl 8.5.

  • Pro: in one value, you gather all the pieces of information
     set personal_data [dict create name [dict create FIRST "Frank" LAST "Zappa"] ID 106]
     dict set personal_data telnum $phone
     dict get personal_data name FIRST ; # => Frank
     ...
  • Pro: easy access to all the data pieces.
  • Pro: implemented internally by hash tables, hence access is fast.
  • Pro: dicts are "first-class" data items in Tcl, you can freely pass any dict value to any proc
  • Pro: support for multiple levels of structuring, by putting dicts in dicts.
  • Pro: the order of elements is (as of Tcl 8.5.0, although not in the alpha-releases) preserved; key are listed in the order they were added to the dictionary.
  • Con: Comparatively new (Tcl 8.5) addition to the language, but also available for 8.4 as a compiled extension [link, anyone?], and not too hard to emulate in pure Tcl either -- see forward-compatible dict
  • Con: need to use [dict] subcommands to manipulate contents, unlike with arrays where contents are manipulated using "ordinary" commands. Sometimes elements have to be unpacked, manipulated, then placed back into the dict. Example: there is no [dict lset].
  • Con: nested dicts aren't "natively" supported by all [dict] subcommands, so they sometimes have to be unpacked, manipulated, then placed back into the dict (similar to the previous "con"). Example: [dict incr] does not allow a key "path" whereas [dict get] does.
  • Minor con: although dictionaries implement the theoretical concept of a "map" (like lists do for tuples), they do not provide a canonical representation, since the order of keys depends on how the dictionary was constructed (two pure dictionaries with the same keys and mapping of keys to values may still have distinct string representations). An [lsort -index -stride] may however be used to overcome this.

Tagged lists

A tagged list is a list where the first element is a "type name" and subsequent entries contain data. Each field has a separate position in the list, but the meaning of a list element may depend on the "type name"; in C terms, the data type of a tagged list is a union of structs. [So, where does one find an implementation of this data structure? Well, data is code shows some things that can be done with it. If that is too hand-on, then you might prefer to use something like struct::tree instead.]

  • Pro: tagged lists {NAME ... {BIRTHDAY ...} {TELNUM ...} ...} are simple to use, they are first-class data (they are simply a form of lists).
  • Pro: tagged lists can have any level of substructure, and are especially suited for tree-like data structures.
  • Pro: the format lends itself to the data is code style of processing.
  • Con: replacing values deep inside such a structure can be a nuisance. (But this may force you to use constructor and accessor procs to do this - nice modular design).
  • Con: because of the need for linear or binary searching, tagged lists will be slower than arrays for large data structures.

The comparison on this page is in a sense unfair to tagged lists, since it focuses on table-style data structures. While tagged lists can be used to implement such things, their primary use-case is rather for tree-like data structures. For an application of this, see A little XML parser.

The nice thing about this wiki is that it is relatively simple to add more content. If the existing comparison seeems unfair, feel free to add examples that demonstrate the additional functionality. It appears, to this reader, that the page is not intent on being an advocate for one thing over another, but instead providing the means to describe a data structure's benefits and detriments, so that someone can more easily determine when the appropriate times are to make use of them.

namespaces

When you have a bunch of related data, collect it as variables in a particular namespace. Separate data sets can have separate namespaces.

  • Pro: each field is a separate variable; no special syntax required.
  • Pro: implemented internally with hash tables, for O(1) access.
  • Pro: the data structure can nested by nesting namespaces (children of a node resides in children of the namespace of that node).
  • Pro: traces can be set on individual elements, which makes it possible to (e.g.) use fields as Tk -textvariables.
  • Con: namespaces are not first-class data; they can only be passed by reference (name of namespace).
  • Con: if you need many namespaces it can be difficult to maintain the application.
  • Con: variables will exist independent of any procedure, so they have a non-local scope. Again, maintenance problems will arise.
  • Con: variables that are in some, but not the current, namespace can be awkward to access.

anonymous: What do you mean by namespaces here? I can think of two uses: One use is simply using namespaces as a way to group a bunch of variables, and the other would include procs, to create something that acts more like an object. Perhaps both approaches need entries?

key–value lists

A key–value list is a list where every even index item is a "key" (arbitrary identifier string) and the following (odd index) item is the value associated with that key.

  • Pro: a key–value list can handle nested data structures. For example, the keyed list "ID 106 NAME {FIRST Frank LAST Zappa}" represents the following hierarchical structure:
ID = 106
NAME:
   FIRST = Frank
   LAST = Zappa
  • Pro: key–value lists are first-class data items. They can be passed in and out of procs like any other list or string.
  • Pro: key–value lists do not require any Tcl extensions.
  • Pro: The [array get] and [array set] subcommands use key–value lists natively.
  • Pro: Forward compatible with dicts (key–value lists have the same string representation as the corresponding dict).
  • Pro: ordering of elements is preserved (although not if you get it from array get), which is useful for some applications.
  • Con: before Tcl 8.5, there were no commands for directly accessing data stored in key–value lists.

The most convenient way of accessing information in a key–value list is to unroll it into an array. (In practice, key–value lists spend a lot of time unrolled into arrays.) Typical code might look like

proc update_KVL {KVL some more args} {
    array set A $KVL
    # Do the update
    return [array get A]
}

It is often convenient to use a key–value list for storing "the settings". Consider

proc do_something {settings_KVL args} {
    # Store defaults into array A
    array set A $settings_KVL; # Override with specific settings
    # Use data in A when doing something
}

jcw 2003-05-12: FWIW, there's a little "ihash" package in critlib coded in C (and I have a minimal pure Tcl script for doing the same somewhere), which makes key-value lists (which I used to call interleaved lists) in the same performance order as arrays, because it maintains a hash table alongside. This is transparent, but you have to be careful to avoid shimmering or things will slow down as the hash table gets rebuilt way too often. Probably also some more info, search for ihash on the wiki.


Of course it is possible to combine these solutions - to get the best (or worst) out of them.

Discussion

Ken: Are there other ways to create C-like structure besides TclX keyed lists; for example like the "record" below indicated by Brett? Regarding the solution you've given, can you kindly explain how to get it working cause when i tried the code, it failed to recognize the package. And when i tried to use TclX keyed lists, tclsh8.4 fail to recognize the commands. How to get TclX working under Tcl? And is TclX part of Tcl? if yes, why is it separate? Sorry new to this language.

Lars H: Tclx is a package. You need to be certain to install Tclx as a package, then code a package require Tclx to load it. But dicts are probably a better choice than keyed lists nowadays.''

Ken: Another question: How do i create an array of structures? Is it there any way of doing it besides manually naming it directly, for example:

set ArrayStructure(0.value) 1
set ArrayStructure(1.value) 2

Lars H: Depends on how you implement your "structures". Maybe C struct is Tcl! will give you some ideas? But if you're new to Tcl then you're likely to be thinking in too C-ish terms about these matters. Perhaps a list of structs would suffice as good? Also take a look at the struct package.


RS 2002-07-05: In the Tcl chatroom we discussed the potential of nested lists for data storage, given the new "multi-dimensional" lindex and lset starting from 8.4. Steve Landers wished for structures so list elements could be addressed with meaningful strings rather than positional indices only. Here's a nice minimal solution that I came up with:

proc struct {name elements} {interp alias {} $name {} lsearch $elements}

#... used to define a "struct" (map from names to positions):
struct staff-> {firstname lastname birthdate phone}

#... used to abstract such details in coding accesses:
lset database $employeeID [staff-> phone] 123-4567890

Letting the struct name end in "->" is cute sugar. Note however that after it (and not before it) there must be whitespace.

jcw: Nother way out would be to use "staff:"

RS: Yep - see Implementing enumerated types for a two-way map that generates both "staff:" and "staff@" for the inverse direction.

Brett Schwarz: humm, I just found this page. I actually put a package in tcllib that does what RS is talking about (essentially). It is called 'record', and is located under the 'struct' module in tcllib. It is only in CVS now (as of Jan 2003), but should go into next tcllib release. Here is how you would use it:

package require struct

namespace import struct::*
 
# define records
record phones {
     home
     work
     cell
}

record addr {
    name
    street
    {country USA}
    {record phones phlist}
}

# instantiate
addr inst1
# assign members
inst1 configure -name Brett -street main
# get a values (returns list)
inst1 cget -name -street
# you also use the dot notation
set name [inst1.name]
inst1.street = Main
# the phlist is a nested record, and is accesses like this:
inst1 configure -phlist.home 425-555-1212
inst1.phlist.work = 206-555-1212
# following would return a list of the phones (home work cell)
set phones [inst1.phlist]

Note the 'country' member. I assigned it a default value of "USA". Anyways, you get the picture. See the docs in CVS for more info...


George Peter Staplin: I couldn't resist writing an example of doing that with my megapkg structure extension.

load ./structure.so

proc phone {} {
    set p [structure]
    # Initialize the phone record
    $p -home "" -work "" -cell ""
    return $p
}

proc addr {} {
    set addr [structure]
    $addr -name "" -street "" -country USA -phoneNumbers [phone]
    return $addr
}

set addr [addr]
# This prevents setting new keys by accident
lock.structure.creation $addr
$addr -name "John Doe" -street "Logan"

[$addr -phoneNumbers] -home 555-1234 -work 555-7812

puts "The person's name is [$addr -name].  He lives on [$addr -street].  His work number is [[$addr -phoneNumbers] -work]."

The - in a key is just some added sugar. Any data can be used for a key (even binary data).


escargo 2005-08-30: Are there any extensions or packages for implementing a trie in Tcl? (I didn't say tries because that's too easy to hit when searching.) A hashing Trie in Tcl


What other complex data structures are available? The tcllib distribution has graph, sets, matrix, pool, prioqueue, queue, record, skiplist, stack, and tree.