Fitting a new Rear-End to Arrays

This page written by DKF


What is good about Tcl's arrays? Their compact syntax for common operations.

What is not so good about Tcl's arrays? You have to use a hash-table for the mapping from keys to values.


LV For the more newbie reader, why is the hash table not so good?


Would that make things extra cool? If we could take arrays and put a new back-end on them based on lists, dictionaries, or even something more exotic. There have been attempts to do this in the past (BLT's vectors spring to mind here)

Why do this to arrays instead of putting magic conversion on values? We'd have a proper named location to store the metadata describing how the array is implemented.

What is going to be hard? Well, traces are definitely tricky, as is upvar and friends, as they both (currently) require that your mapping mechanism hands out references to a Var structure, and that's not something that you can really map nicely onto a Tcl_Obj as it is an updatable structure.


What Operations are Needed?

get
Retrieve a value from the mapping given a key. This operation should succeed if the key was one returned by the list keys operation below; behaviour for other keys is undefined in general.
set
Update or insert a value into the mapping given a key
unset
Update the mapping so as to remove an element
list keys
Get the list of keys that map to data values. Note that an array might be permitted to support other keys with "magic" names, but this operation should only list the keys that map in a straight-forward fashion.
serialize
Convert entire array to a string (preferably in the order listed by the list keys operation.)
deserialize
Convert string to array

[...]


RS Don't arrays (and dicts) represent a string -> value mapping, while vectors (and lists) do (0<=int<maxint) -> value?

DKF: So what if vectors/lists have a restricted language of keys? :^)

13may04 jcw - Cool! The serialize/deserialize operations are "slightly less primitive operations" IMO, in that they can be implemented with the first four. The get/set/unset/list combo is the core. They offer all sorts of interesting new options, similar to Perl's "tie". My first goal would probably be to map this to hashed Metakit views, i.e. persistent memory-mapped arrays. Gdbm is another obvious candidate. More advanced uses may need some more machinery, but I think most of that can be done in Tcl.

I'd like to describe an idea which unifies keyed access (arrays), indexed access (lists), scalars, and more - but let me just point to [L1 ] and [L2 ] for now.


RHS 08June2004 I've been thinking a lot about arrays and how they aren't "first class citizens" when it comes to being data objects. Other than the fact that it would require a lot of work to make the change, what are the reasons for not converting arrays to be TclObjects that could be shimmered to/from other types of TclObjects? By way of an example, I think it would be very handy to be able to do:

set bob {a 1 b 2}
puts "bob(a) = $bob(a) -> should be 1"
puts "bob = $bob -> should be {a 1 b 2}"
set bob(c) 3
puts "bob = $bob -> should be {a 1 b 2 c 3} or {b 2 a 1 c 3} or {c 3 b 2 a 1} or etc"

My thought is that, by converting tcl arrays to TclObjects and allowing them to shimmer to/from other types, that we would now be able to return arrays from procs, pass them in by value, etc.

Other than the obviously huge amount of work the conversion would require, what are the major problems with this approach?

Traces perhaps? RS: As soon as we have dicts, these can take over all the pure-value/first-class usages, and arrays will be used less, I expect.

NEM There is a major problem with this approach (overloading array syntax). Let's rearrange the items in your example:

set bob {1 a 2 b}
puts "bob(2) = $bob(2) -> should be... err 'b' if it's a dict/array, or 'a' if it's a list..."

As you can see, this introduces an inheritant ambiguity. The return value depends on the current underlying Tcl_ObjType of the Tcl_Obj (value) - thus exposing a previously hidden implementation detail, and radically changing Tcl's script-level semantics. I wrote some notes on a related subject at [L3 ] a while back. A possible way out is to introduce type-tagging at the script-level (so that the current type becomes a part of the string rep) where you want this sort of polymorphism. See TOOT for my experiments in that direction (no use of array syntax, but I think the idea could carry across, with some work).

DKF: My idea is that the array command will get a new subcommand which allows you to declare a new array with an alternative implementation inside it. This might work like this:

array declare bob -type list  ;# Support more options (e.g. database username/password)
array set bob {a b c d}       ;# Set up the array contents
puts "bob(2) = $bob(2)"       ;# which is the value "c" if it is a list, of course.

RHS I disagree with NEM's conclusion above. The notation bob(2) refers, as Tcl is defined now, to the element 2 of array bob. As such,

set bob {1 a 2 b}
puts "bob(2) = $bob(2) -> should be b"

Would be the only interpretation. Only if we decide to state that bob(2) can represent

  • element 2 of array bob
  • lindex $bob 2 (for a list)
  • some other reference into bob, for some other form of bob

Do we run into ambiguity. I propose that sticking with the current definition of bob(2) as an index into an array is just fine.

As an added note, if both the key and values of arrays can be TclObjects, then we can have arrays of arrays of arrays (much like in 7.6), which I'm a big fan of for representing tree structures and the like... And the shimmering to other forms should work fine:

set bob {a 1 b {A 8 B 7} c 3}
set joe $bob(b)
puts "bob(b)(A) = $joe(A) -> should be 8"
puts "bob(b)(A) = [set ${bob(b)}(A)] -> should be 8"
puts "bob(b)(A) = [set $bob(b)(A)] -> should be 8 too? Not sure on the parser here"

NEM Ah, ok - I misunderstood what you were proposing. So you are after accessing Dictionaries as Arrays (i.e. using array syntax to access dicts - coming in Tcl 8.5, see TIP 111 [L4 ])? I haven't quite made up my mind about DKF's proposal for an [array declare] command - static typing makes me nervous. Actually, it's a weird sort of typing that we have since array variables were introduced -- special variables which hold a value (an array) which is opaque and can never be replaced (the array variable cannot be assigned to) -- instead, the array is mutable (it contains scalar variables, which can be assigned to as usual). With Donal's proposal, AIUI, this changes, and we now enforce type-checking on variables. The array variables declared become normal variables, and can presumably be assigned to. However, they are now type-checked, or at least cause automatic shimmering of items assigned to them. Some examples, needing clarification:

array declare bob -type list
set a "This is \{not a list"
catch {llength $a} err; puts $err ;# --> unmatched open-brace in list
array set bob $a ;# 1
puts $a(2)  ;# 2

What happens now? An error? At 1 or 2? This is presumably an error, and is indeed an error now, as [array set] expects a valid list anyway. But what if we had other types?

array declare foo -type myspecialtype
array set foo [myothertype create]
puts $a(jimmy)

Now suppose that values created by [myothertype create] are valid lists, but not valid myspecialtypes... What happens? As a further example:

array declare foo -type dict
array declare bob -type list
array set bob {a 1 b} ;# 1
array set foo [array get bob] ;# 2

What happens here at 1 and 2? Is 1 valid - it is a valid list, but not a valid dict or array (odd number of elements)? If 1 is valid, what about 2? Note, I've not made up my mind on this completely yet, just trying to think about the implications.

Lars H: NEM's claim that an array can never be replaced needs to be refined. You cannot replace an array with an ordinary variable using set, but if you unset it then that variable name is ready to be reset as any kind of variable -- ordinary or array.

As I see it, one important benefit of fitting a new rear end to arrays is that one can store special sets of data more efficiently. As an extreme example one could take a long "vector of bits" array type, where the data is stored as Judy arrays or some such. For such an array, accessing any index which is not a (32-bit) integer would be an error, as would be trying to assign a non-boolean value to an array element. array get would still be possible, but run a significant risk for out-of-memory panics.

RHS 09June2004 Personally, I'm not a big fan of either dicts or typed arrays. I'm of the opinion that normal arrays, if they were TclObjects and could be treated as such (ie, how I described above), would be adequete for all the tasks I would want to use them for. Admittedly, it wouldn't address Lars' point of storing specialized data more effeciently, but it would be fine for most complex data types that I can think of (ie, trees, etc).

As far as NEM's comments about malformed lists and the like, the way I see it they would cause an error. My thought is that any form that array set accepts (ie, a even numbered element list) would be acceptable as a list form that could be shimmered to an array. If array set wouldn't accept it, neither would it be possible to shimmer it to an array from a list.

DKF: But arrays are collections of variables, not collections of values. It's a subtle but utterly fundamental difference.

RHS 11June2004 : True, but thats only because we think of and use them that way. There really is no inherent difference between an array and a dict, other than the way that can be used. They are both a collection of key->value pairs, where keys must be unique. It just so happens that arrays have all their functionality built around being modified in place. If we used list operations to work on lists in a similar way, they would be the same:

list::set bob {a b c d e}
list::setElement bob 1 B
puts $bob ;# a B c d e

Would there really be any difference between the following two sets of operations?

proc changeAValue {&name} {
    upvar ${&name} name
    list::set name {a b c}
}
proc dontChangeAValue {value} {
    list::set value {a b c}
}
set bob {A B C}
dontChangeAValue $bob
puts $bob ;# "A B C"
changeAValue bob
puts $bob ;# "a b c"

proc changeAValue {&name} {
    upvar ${&name} name
    array set name {a 1 b 2 c 3}  ;# ok, so array set doesn't set so much as "merge", but its the idea
}
proc dontChangeAValue {value} {
    # Assume we can pass in arrays by value, as is being discussed
    array set value {a 1 b 2 c 3}
}
unset bob
array set bob {a A b B c C}
dontChangeAValue $bob
puts $bob ;# "a A b B c C" (or in some other order)
changeAValue bob
puts $bob ;# "a 1 b 2 c 3" (or in some other order)

The whole concept of "list are immutable and arrays aren't, while lists are values and arrays are names" is overthinking the problem, in my opinion. We have list commands that modify lists in place (ie, we can treat lists as Values or Variables). Why not have the ability to modify arrays as Values?

NEM Just a quick correction: we have list commands that modify variables containing lists - they don't act on the lists themselves. That's why you do stuff like [lappend foo args] rather than [lappend $foo args]. [set] and foo(blah) syntax always operate on variables, however, so this isn't a major problem. A couple of things to think about:

set foo(1) "Hello!"   ;# is foo an array, list or dict?
trace add variable foo(1) read [list blah...] ;# only valid for arrays, as only arrays are containers of variables

The first item could be solved by picking a default. Anything else could be done through appropriate set commands:

array set foo { 1  "Hello!" }
lset foo 1 "Hello!"
dict set foo 1 "Hello!"

For backwards compatibility, the default should probably be an array. Trace can be solved by simply throwing an error if the variable isn't an array. Considering Donal's proposed [array declare] command:

array declare foo -type list
array set foo {1 a 2 b}
puts $foo(2) ;# Outputs "2"
puts [dict get $foo 2] ;# Outputs "b" - shimmer to dict
puts $foo(2) ;# Outputs "2" - shimmer back to list

So, the array declare just has to make sure that any accesses through the foo(bar) syntax cause a shimmer of the contained value to the declared type (if necessary). That could work. So basically, $foo(bar) syntax becomes an alias for [$type set foo bar], and [set foo(bar) val] becomes [$type set foo bar val]. This is quite an interesting idea - the variable is a typed container, but the value is still untyped.

Another idea: what if "namespace" was allowed as a type? Then you really could throw away arrays... [set foo(bar)] would be equivalent to [set foo::bar], and thus would allow traces etc.

RHS You are, in my opinion, overthinking the problem. Lets assume that tcl arrays are TclObjects, because thats where I'm trying to get to. Then we have:

set foo(1) "Hello!"  ;# This is a TclObject, currently in ArrayObject form
puts $foo            ;# {1 Hello!} Now the TclObject has a string rep
puts [lindex $foo 1] ;# Now its a TclObject in ListObject form
puts $foo            ;# {1 Hello!} Now it has a valid string rep again
puts $foo(1)         ;# Hello! Now its in ArrayObject form again

What would this mean would have to be implemented? The way I see it:

  • Arrays need to be stored in TclObjects
  • Lists need a way to keep track of which elements (0/1,2/3,4/5,etc) have traces on them

This also means that you couldn't provide a way to put traces on individual list elements. Effectively, you'd only be able to put them on "pairs" (ie, the key and value). However, I don't really consider that much of a concern. If you could only put traces on individual element pairs through arrays, then we'd never have to worry about that.

As a sidenote, I'm firmly of the opinion that dicts would not be needed if we could get arrays to be first class data in tcl. Once they are, they can do everything dicts can do. (No offense is intended to the dict creators, for my lofty goal may be to difficult to be reachable anytime in the forseeable future).

NEM I disagree. Arrays are basically dicts which contain variables. I don't think setting a trace on a value is a useful thing to do - so setting traces on elements of a list is wrong (unless it is a list of variable names of course, in which case [trace add variable [lindex $list $index] ...] would work exactly as it does now). However, what I propose can work, I think. [trace]/[set] etc work on variables - more specifically variable names. I'd propose two changes. Firstly, I'd place the typed variable functionality as an option to the [variable] command. Secondly, I'd change setting array variables to require an extra (explicit) dereference. Some examples of use:

variable -type list foo [list a b c d e]
puts $foo(2) ;# -> c
set foo(2) a  ;# creates a (scalar) variable named "foo(2)" (literally)
set $foo(2) a ;# creates a var named "c" with value "a"
variable -type namespace bar ::somens
puts $foo(bar) ;# prints "::somens::bar" (the literal string)
puts [set $foo(bar)] ;# prints the value which is in ::somens::bar
set foo(bar) "Hello, World!" ;# creates var named "foo(bar)"
set $foo(bar) "Hello, World!" ;# set ::somens::bar to "Hello, World!"

This is perhaps slightly radical. Of course, this isn't backwards compatible as it requires an extra dereference when setting/getting array variables. In fact, a traditional array is now just a dict which contains variable names.

What'd be cool would be a TOOT based solution, with no typed vars, but values which are (type,data) tuples, and some sugaring, such that:

set foo [Dict create a 1 b 2 c 3]
puts $foo(b) ;# -> puts [$foo set b]  -> puts 2
set foo [$foo set bar [Dict create jim "Hello, World!" a 2]]
puts $foo(bar)(jim) ;# -> puts [[$foo set bar] set jim] -> puts "Hello, World!"
f course, this falls down for such things as:
set foo(bar)(jim) "Arrgh!"
lthough that could be transformed (by the [[set]] command) into:
[$foo set bar] set jim "Arrgh!"

which would return a new value with appropriate thing set. I have a "reference" type for TOOT almost complete which allows modifying in-place. The advantages of this over the typed variables descibed above are:

  • The "type" is associated directly with the value, and is explicit in the string rep.
  • With the nested syntax, type is never ambiguous - with typed variables you can only unambiguously determine the type for the first level of nesting.

But perhaps I'm rambling a bit too much here...


KJN: A rear-end based on an ordered rather than non-ordered map could be used to implement (and unify) arrays, dicts, and lists. This is the approach used by PHP: see [L5 ] and the links in that page's "Introduction". What I really like about PHP arrays is that:

  • (a) they are first-class objects, and so can be an argument or the return value of a function.
  • (b) they are the only compound datatype in PHP: you don't need to worry about whether you need an array, dict, list, tuple or whatever. It is the functions that you choose to manipulate the array that determine whether it is "in spirit" an ordered list, or an associative array, or (rarely) a mixture.

Any implementation of a language feature is a set of compromises; "one compound datatype, implemented as an ordered map" is no exception, and requires that, if the user does not provide a key for an array value, a key must be autogenerated: in PHP this is a (previously unused) integer. Beginners might be confused between this autogenerated key, and the integer that represents an element's position in the (ordered) array. IMHO, this is less troublesome to explain to a newbie than how to choose between dict, list, and array, and how to use upvar instead of passing arrays between procs.

(DKF notes that it depends on whether the newbies have had datastructures-and-algorithms training; if they've done any CS or SE course, they'll definitely have done that.)

Of course, if only the rear-end is unified, then the distinct API and notation for arrays and lists will remain, and the keys that the rear-end autogenerates for list elements will be invisible. We can then argue at length over whether unification of the front-end is desirable and, if so, what the notation and API should be for the "unified" compound variables. One possibility is that "old-style" lists and arrays would coexist with the new "unified" variables for ever, for reasons of backward compatibility, but that "unified" compound variables would be preferred for new designs.