Updated 2014-06-08 03:47:52 by pooryorick

An abstract data type is a data type that instead of specifying implementation details such as memory storage requirements, memory layout, specifies only the behaviour of some conceptual structure.

Description  edit

An abstract data type defines a class of abstract objects which is completely characterized by the operations available on those objects. This means that an abstract data type can be defined by defining the characterizing operations for that type (Liskov 74). Object systems are founded in this concept.

The implementation of an abstract data type is not necessarily hidden. Rather, abstract data types, being abstract, are type definitions that simply have nothing to say about their implementation details. C data types fit the definition of abstract data types because they are defined by the operations C provides for them, but because they are provided by the language proper, they are considered primitives instead. In many languages there is very little that differentates primitives from abstract data types.

Languages such as Haskell that provide type checking for abstract data types do it by means of [algebra] of types.

In Tcl, every value is an instance of string, which could be considered a primitive, but because of the extreme flexibility of Tcl in allowing all commands to be redefined, could also be considered just another abstract data type. In any given command, a value may additionally be interpreted as having some other type, abstract or otherwise. Whether the value is the data, as with the operand of list or dict, or an identifier for the data, as with array or chan, the type is defined by the operations, which are exposed as commands.

In turn, every command is part of the definition of some abstract data type. In the case of commands like info or if, that data type is the Tcl interpreter. The unique thing about Tcl is that it always transcends whatever it wraps itself around. There is only one way for the implementation of an abstract data type to be surfaced at the script level: as a value and set of commands. All the object systems surface objects and classes as commands. Tcl itself is the ultimate abstract data type, subsuming all others under its stringy cloak. A script is in fact the data structure of an abstract data type, whose structure and operations are defined by the Dodekalogue, and Tcl is its implementation. One of the reasons that the Tcl community may have been so ambivalent in the first place about ordaining an object system is that Tcl itself is an object system.

A Discussion  edit

CMCc: I'm going with this definition: A data type together with all the operations on data objects of that type that make use of information about the representation of the data type

array is an implementation of an abstract data type. It supports various operations such as array get and array set. It's also an opaque handle because it isn't possible to tinker with the hash collision tables from the script level, but that characteristic isn't essential to being an abstract data type.

The abstract data type of an array is the only information needed to make use of the array. Implementation details are completely separate from the abstraction:

  • the internal representation of an array is completely hidden
  • the abstract data type for array presents a clearly defined interface, represented by the subcommands of array
  • A completely different implementation could be substituted in without affecting the client of the abstract data type in any way.

Each of the other builtin data structures, and many of those in tcllib, have these benefits too.

However, types like array and (say) ::struct::tree suffer from an inability to be formally composed together with other data types into more complex abstract data types. That is to say, they aren't part of an [type algebra].

For example, one can't derive a SortedArray, in which array get always returns its results in a predefined order. Native array doesn't do this because it's unnecessary for most uses and it would be expensive but there are cases where one might like this: for example if one was using an array to store a time sequence of events. The ability to derive SortedArray (e.g.) from builtin Array in a coherent and consistent manner, while preserving something of the encapsulation of Array, would be a benefit of OO.

How many ways are there to wrap Array in something which adds/changes its functionality? An infinite number. There should also be a single, efficient, simple, predictable, well known way. That, I think, is what the adherents of OOTcl are suggesting. I agree with them.

RS has never felt a need for "SortedArray" - it's easy enough to iterate like this:
foreach name [lsort [array names arr]] {...}

Types/subtypes/classes/derived classes are popular in other languages, but doing without them, I just feel freer in Tcl... :)

CMCc: I presume, RS, that you've never had more than a few instances of SortedArray and discovered that you had to change the sort order ... in OO one changes it once, in the solution proposed, and it is changed everywhere it appears.

As there are an infinite number of ways to wrap functionality and compose simpler objects programmatically, and the only metric of quality being admitted here is performance, it would seem very difficult for a maintainer to recognise it when it's being done.

AMG: The Sorted Lists page has some related information.

How Does One Implement New Data Types in Tcl?  edit

Lars H: As remarked in the lengthy discussion on the Things holding Tcl back page, it is possible to encode data structures on top of simpler data structures, and these on top of still simpler data structures, in rather many levels. Tcl has a marvelous way of encoding lists as strings, but stops there, as everything is a string. Most computational models (e.g. Turing machines) stop there as well, although some (e.g. recursively defined functions) continue by encoding strings as Gödel numbers. That something can be encoded using simpler structures is not of much interest here.

CMCc: Yes, if one replaces the question "what are important/desirable characteristics of composition in a language?" to the question "is it possible to compose in [some given] language?", it does lose value.

Lars H: Given a range of implementations of a data structure, the right question is probably rather what the computational complexities of standard operations are for these implementations. If one can find an example of a data structure where Tcl implementations have significantly worse complexity than other languages, then that is a sign that the language could do with improvements. Some historical examples:

Lists. In Tcl 7, there were no Tcl_Obj's, and thus in particular, lists had no "internal representation". Any operation on a list required that the string was reparsed, so that for example llength and lindex (on average) were linear time operations. In particular, the complexity would depend on how large (as strings) the elements of the list were, which was a significant drawback in comparison with languages that maintained structured representations of their lists, such as Lisp.

In Tcl 8, such structured representations could be stored in Tcl_Obj's and both llength and lindex became constant time operations on values for values that have a structured list representations. Now it is rather Lisp (using a linked list implementations) that is at a disadvantage, because computing the length of a linked list requires that the list is traversed. (OTOH, insertion is linear in list length in Tcl, but can be faster in Lisp.)

Dictionaries. In Tcl 8.4, getting and setting individual elements of a dictionary (stored as a key value key value ... list) is linear in the size of that dictionary, but in Tcl 8.5 the new dictionary values with a hash internal representation will be faster (anyone know how much faster? Logarithmic?).

In these two cases, there was a clear gain with getting these things into Tcl, but for many of the things proposed, there is no gain! Lists are all that one needs.

Fancier data structures aren't always better

In the reference page (or thereabout) I asked whether there was any case at all of a data structure that could be implemented more efficiently (better complexity) using references (e.g. Lisp cons cells or C pointers to structs) than in current Tcl. The simplest example anyone gave was that of a dual-sorted queue: If the queue is a list then you need to search through it to find the smallest element in at least one of the sorting orders, so dequeuing is a linear time operation. If the queue is instead implemented as a linked list, where there is one "next1" reference for sorting order 1 and one "next2" reference for sorting order 2, then it is possible to find and dequeue the first element according to either sorting order in constant time. You do need to make both linked list structures dual-linked, so there is a total of four pointers per queue element, but it is still constant time.

There is a "but" though: that is just the complexity of dequeuing. Any kind of system that makes use of such a queue would have to enqueue as many elements as it dequeues, and inserting a new element on the proper position into a linked list is still a linear time operation. So in the end there was no gain, and total complexity is the same regardless of whether one uses reference-based linked lists or just plain Tcl lists.

Then again, it is possible to speed up insertion in the linked list case by adding a self-balancing tree structure to the thing, in which the original linked list is just the "thread" (I think it is called); this gets insertion down to logarithmic time, at the price of increasing the total number of references by another factor two (possibly more). The result is a rather huge data structure, but it is faster than simple-minded searching through a list. On the other hand, it still isn't superior to Tcl, because nothing stops us from treating a Tcl list as one in the old days sometimes treated an Pascal array as a "heap" from which we can allocate and deallocate our own dynamic memory ("pointers" then being indices into this list). Anything that can be done with references (without automagic garbage collection) can be simulated in Tcl using lists, without loss of asymptotic performance.

CMCc: Who said the sole metric of "better" is "more efficient"?

Lars gives a few examples of completely encapsulated objects which benefited from a reimplementation without a change in interface. The ability to do that, to distinguish between implementation and interface is why something which permits encapsulation is "better" than something which does not.

If it were not for the fact that the internal representation (and implementation) was distinct from its interface, it may well have been practically impossible to produce the more efficient implementation without disrupting users of the construct.

For such a distinction to be significant, it has to be distinct and recognisable by most people. That's one of the things OO does - signals what's available for use and what's implementational, and subject to change.

[DKF: what is meant by "implementational"? Replace with real meaning and delete this comment.]

Lars H: Your claim that my examples concern "completely encapsulated objects" is incorrect. Lists were not completely encapsulated in Tcl 7 (they could at any time be introspected using e.g. [string index]) and they are not completely encapsulated in Tcl 8 either (you can still peek at characters using [string index]). It is inefficient to operate on them using non-list commands, but it is perfectly possible.

Come to think of it, this may be what object-oriented programmers find difficult to understand about everything is a string (or at least one thing that makes them confused): A good data structure need not necessarily hide the data behind some opaque API -- it is sufficient that the public representation (the string) of the data can be lazily generated. Thinking of data structures as objects instead promotes the view that as little as possible should be made public.

((interpolated rebuttal: CMCc - I take it amiss when I'm told what I don't understand. Recently, on the wiki, I started talking about things which aren't strings (procs, arrays, etc) and Miguel was kind enough to give me the full version: everything is representable by a string. Every value is serializable and constructable to a string.

PYK 2014-06-06: In Tcl, every value is a string. If the term value refers to something at some other level, all bets are off, and Tcl has nothing to say about it.

Additionally, list, the example given by Lars H, is implemented as an array of pointers in C, but could be implemented as a linked list (for example) and this would not change its interface. List implementation is completely hidden from the user, as are all of the other basic data types. If it were not so, it would be possible to write code which broke when the implementation changed from the current array-of-pointers to linked-list. It is not possible.))

NEM: I've been giving this some thought recently -- the seeming contrast between transparent values and opaque objects/values. Tcl's built in list implementation is based on a string representation, but if I was to write a new list implementation using one of the many OO packages, it would be an opaque handle. The reason for this opaque handle is that in order to support the OO [object method args] notation (and the polymorphic method dispatch this allows) the "object/value" must be a command name. It would thus be pretty inconvenient to change this command name every time the data changed -- not to mention the risk of name clashes. This has implications for how the value is treated compared to built-in lists/dicts etc. Values in Tcl are often said to be typed according to how command treat them (there is no list type, merely some commands which attempt to treat a subset of strings as lists). However, OO extensions reverse this by making the value itself a command name, hiding the actual data behind the scenes, and laying down which commands can be applied to the value. So, in that sense, a certain amount of typing has been introduced by these extensions. Is this a bad thing? OO advocates would argue that it is a very good thing -- hiding implementation details behind an interface make codes maintenance easier, and helps robust code reuse. I can certainly see the logic in this, and I've begun to wonder whether transparent data types are actually a good idea. We've all seen the problems that people get into with using string functions on lists, and it seems advice generally given these days on how to get good performance, and best use out of values in Tcl is to not mix commands which expect different "types" - always use list functions on lists, try to avoid shimmering, etc. So, on the one hand we have the argument that transparent values make Tcl so powerful, and you can do wonderful things with them, and on the other, we have the wisdom; Don't Do That.

Note, also, that there is a confusion between the terms "everything is a string" and "transparent data types". Opaque data-types (handles) are still represented as a string, but their internal data structure need not be (could be implemented in C) convertible to a string. So, at the Tcl level, every value you see is still a string, just that sometimes that isn't the whole story. This is not a new problem introduced by OO extensions - channels, images, namespaces, commands themselves and more are all opaque handles. Many provide introspection capabilities to allow serialisation, but some do not, and generally none do so in a consistent manner.

So, what are the advantages of transparent data types? Natural serialisation is one: just write the value to a file and read it in later and you can exactly recreate it - no explicit serialisation/deserialisation to perform. I don't see this as particularly convincing argument. You could just as easily make "everything an object", and just mandate that each object provides routines to convert to/from a string -- then the input/output commands simply call these methods as needed. What are the advantages of making the implementation details of a value (the string representation) a part of the interface (indeed, the only interface)? Is Tcl unique in doing this?

(Discussion moved to bottom of page - marked with ***)

NEM Ok, I'm going to retract that statement about serialisation - it's not quite as easy in the OO approach as I made out. The problem comes from things like:
puts "This is my object: $obj"

Here, you need some way of finding the reference to $obj and replacing it with [$obj tostring]` transparently. Not impossible: you could do the serialisation in the variable dereferencing code (rather than in a channel handler), but this breeds more problems:
set obj2 $obj

$obj is serialised and then deserialised, rather than the opaque handle being copied - so what is the "type" of the new object? We have returned to a transparent value here, basically. This is a much more interesting problem than I had thought - and has rekindled my appreciation of transparent values. I need to think about this some more...

NEM: I just had an idea too: has anyone ever tried to combine the two approaches? I mean, create an OO package where the string-rep was something like:
 [list dispatch-cmd data...]

(CMCc: check out namespace ensemble for something along those lines - it makes a namespace a command which can be used to dispatch other stuff)

dispatch-cmd could be a command for the "class" of the object (assuming class-based OO), and the data bit could be a transparent representation of the data contained in the object (say a dict). Would require auto-expansion of leading word to work... Just an idea.

NEM: Looking at this idea again, I see it has a problem of this prefix stuff. For instance, it still doesn't solve the problem of being able to use an OO toolkit to write a compatible replacement for Tcl's built in lists - e.g. something that could be used with foreach transparently.

NEM: *** Moved this to the bottom for people skimming for changes:

Lars H, later insertion: The serialisation/deserialisation issue may not sound like much in a theoretical discussion, but if you frequently communicate structured data then it is a fantastic feature! The practical advantage is HUGE.

NEM (later still): Indeed - the practical benefits of such simple serialisation are well known, and very real. However, I'm just trying to think about what the trade-offs are. It seems that to have this natural serialisation (and powerful introspection) we have discarded most of the benefits of encapsulation - we have laid bare the very implementation of the data structure. For example, suppose if at some point somebody (a madman clearly, but anyway) decided to change the string format of lists (let's say they decide items will be separated by commas instead of whitespace, for easy production of csv files). Now if you send a list from a current Tcl, and try to read it into this new whizz-bang modern Tcl, it will probably fail (or at least, it is unlikely to reproduce the same list). There is no versioning in the string representation, either (but, of course, there is versioning in the Tcl package which implements it). So, we enforce encapsulation through a lot of saying (and documenting) "Don't Do That". Is this a good situation? Are the trade-offs acceptable? Is there a better way? Is there a way to say which data-types would be better as transparent, and which would be better as (encapsulated) opaque handles? And is relying on natural built-in serialisation a good idea for communication - should you really be designing/using a special protocol for it?

NEM: See also transparent and handle pages which discuss these two representations. Basically, transparent values are first-class, while handles are more like pointers or references (the handle is first-class, but the thing it points to is not).

FB: I'm trying to address the transparent vs handle dichotomy in Cloverfield. Transparent values are IMHO one of the major strengths of Tcl. However there are cases where this falls short, because Tcl lists can easily represent trees but not graphs. IOW, graphs cannot be expressed as a single transparent value in Tcl without in-band signaling (which is considered A Bad Thing most of the time in CS, see [1] vs. [2], return codes vs. exceptions, zero or {} vs. null, etc.

Lars H: A problem with your argument is that those references are about control signalling, whereas your graph example below is data signalling. I seriously doubt you can find references for in-band data signalling being a bad thing, but feel free to delete this comment if you provide such above.

), for example by enforcing some naming convention. But in this case you add some computational complexity, or are forced to supplement the value with external data. Let's consider the following graph:
|\ | /
| \|/

A possible Tcl definition would be:
    A {B D E} 
    B {A C E} 
    C {B E} 
    D {A E} 
    E {A B C D F}
    F {E}

Here the graph is expressed as a dictionary whose keys are the node names and the values are lists of connected nodes. This case is simple because of the naming convention. Now let's imagine that we want to allow changing the name of a node. This implies changing the key as well as all instances of the name in the lists. To prevent that we have to choose another convention by giving each node a constant identifier:
    1 {A {2 4 5}} 
    2 {B {1 3 5}} 
    3 {C {2 5}} 
    4 {D {1 5}}
    5 {1 2 3 4 6}}
    6 {F {5}}

Here we uncoupled the name from the identifier, and stored the latter in the lists. That means that we are now free to change the names of the nodes, but we can no longer use this name as an index into this structure, we are forced to use arbitrary identifiers as dictionary keys. To keep this possibility, we need to maintain a supplementary structure, here a reverse map. Moreover the identifiers are totally meaningless with respect to the data held by the structure. Identifiers should be hidden data, but they must be made transparent and explicitly used. This is a kind of in-band signaling.

The other alternative is to use handles. Handles are the Tcl equivalent of pointers. Each node becomes a distinct data identified by its handle. Accessing node data needs support from specific procs, or must be properly encapsulated into an object that provide access methods. Node data can be easily changed without altering the overall structure. But what we gain on one side is lost on the other side, as the structure can no longer be expressed as a string. And we have shifted the paradigm from a declarative to an imperative style.

Addressing this kind of problem needs out-of-band signalling. We need a way to express references in transparent Tcl values so that data and relationships can be uncoupled. To do so Cloverfield introduces the concept of word modifier. Words (in the Tcl sense) can be tagged by special modifiers using a syntax similar to the argument expansion operator. In our case, one can associate a reference identifier to a given word, and later refer back to this reference in the same structure (see also Cloverfield, section References). This uses the prefix {ref id}, where id is an arbitrary string. Let's define our graph using Cloverfield syntax:
    {ref 1}A ({ref 2}B {ref 4}D {ref 5}E)
    {ref 2}B ({ref 1}A {ref 3}C {ref 5}E)
    {ref 3}C ({ref 2}B {ref 5}E)
    {ref 4}D ({ref 1}A {ref 5}E)
    {ref 5}E ({ref 1}A {ref 2}B {ref 3}C {ref 4}D {ref 6}F)
    {ref 6}F ({ref 5}E)

Here each occurrence of node names are prefixed by a reference identifier (for comparison purpose we chose to reuse the same identifier values as above). These tags tell the interpreter that the associated data can be referenced later in the structure. Value wise, the above list is strictly equal to the first Tcl version above: the structure is still a dictionary that can be indexed by node names, and the lists of connected nodes hold references to this names rather than value, however foreach will only "see" the names; changing a node name doesn't break this relationship, because there is no longer an explicit naming convention, and the purposed served by the identifiers in the second Tcl version is now expressed as metadata. This is out-of-band signalling, and it allows keeping all meaningful data in a single structure that can be expressed as a string. For this example I've simply taken the first Tcl version and prefixed each name with its matching ID. (Note: There are better ways to express graphs using this syntax but I chose this one for comparison purpose).

This mechanism is not really meant for direct use in scripts (although its usage is fairly easy), but rather to enable the serialization of circular referencing structures that can't be expressed as trees. For example an application could build a list with complex reference patterns, such as Scheme-style "infinite" lists, and not worry about its string representation, whereas such structures would typically be impossible to create from Tcl, and generate an infinite loop when building the string representation of a self-referencing list created from C (which nothing prevents in the current version). Tcl is meant to be a glue language, and to do so data exchange must be expressed as strings and formatted following specific conventions. However some structures simply cannot be expressed that way without a lot of hassle, and this proposal tries to address this problem.

I think the lack of expressive power in languages is one of the major reasons why OOP is so popular. In the case of XML, compare for example DOM-based access (imperative) versus XPath queries (declarative): large portions of the Document Object Model become almost totally useless because a single XPath query can replace a complex sequence of DOM statements. And SQL, while still free of any OO concept, remains the best way to access relational databases, because no sane person wants to express database relationships in an imperative language.

DKF: The traditional Tcl way of handling such complex structures is (regardless of the opacity of the value) to provide a command (or commands) that talks the users' language and performs the operations on whatever the structure actually is. I encourage you to stick to that paradigm (however it works in practice when mapped to what you're doing).

NEM: The problem with lists is that there are two levels of implementation - the implementation of the efficient internal representation behind the scenes (by the list Tcl_ObjType and associated gubbins--like this?), and the implementation of the conversion from the string representation to this internal representation. You are right, in that the internal internal representation is completely encapsulated, and could be changed to a linked-list without breaking anything. I was referring, however, to the implementation in terms of string representation, which certainly would break things if it changed. It's this representation which is more important at the script level, as it is the only representation which is has any level of guarantee of surviving shimmering etc. So the string rep is an encoded form of (part of) the interface which the underlying implementation has to honour. Does that make sense?

CMCc - Makes sense to me. It's a shame Tcl_Obj wasn't called something different to save confusion. Yes, struct Tcl_Obj defines objects at the C language level, but it's an impoverished/minimalist object interface which, by design, only concerns itself with construction and conversion to/from a string representation. Nothing wrong with that, but it hardly impacts on the visible tcl language at all (if at all) which is evidenced by the way we think about shimmering.

The semantics of conversion to and from string representation is the explicit concern of Tcl_Obj, it's the C interface defined by Tcl_Obj, and Tcl_Obj is (by design) silent with regard to other interface functionality and is in any case invisible to the Tcl language level. The only Tcl interfaces to things supported by Tcl_Obj are command wrappers. Several significant entities in Tcl aren't even represented by Tcl_Obj (e.g. open files) and several of the significant entities which are represented by Tcl_Obj aren't visible to the Tcl language anyway.

DKF: They should have been Tcl_Value IMO, but that datatype was taken (for user-defined functions in expr, since you ask).

FB: I'd suggest Tcl_Word. The Dodekalogue designates values as words.

DKF: That'd probably get confused with 'machine word'. Sometimes you're just stuck.

FB: I think I could live with that confusion ;-). Confusion with objects is more problematic IMHO.

NEM 2008-01-24: Tcl_Term?

PYK 2014-06-07: Tcl_Chimera

Reference  edit

Programming with Abstract Data Types, Barbara Liskov, Stephen Zilles, 1974
Google definition

Page Authors  edit

Lars H
others as identified above