TDL

TDL is a data format for representing an ordered tree.

Lars H 2010-02-03:

Some months ago, while working on OpenMath things, I grew tired of the XML syntax — it just felt cluttered and clumsy with all those </>="' characters. I wanted something more lightweight (for me as author), so where to turn, if not to Tcl? First for sketching things out, and later as a format that programs would actually process, I began to use something I dubbed TDL (for Tcl Data Language, or perhaps — in analogy with Tool Command Language and Tool Protocol Language — Tool Data Language).

First, an example to illustrate the idea. A small fragment of OpenMath is

<OMA>
    <OMS cd="symocat1" name="label"/>
    <OMS cd="Hopf-algebra" name="mult"/>
    <OMA>
        <OMS cd="list1" name="list"/>
        <OMV name="a"/>
    </OMA>
    <OMA>
        <OMS cd="list1" name="list"/>
        <OMV name="b"/>
        <OMV name="c"/>
    </OMA>
</OMA>

This is rather cluttered. The same thing as TDL could be

OMA {
    /OMS symocat1 label
    /OMS Hopf-algebra mult
    OMA {/OMS list1 list; /OMV a}
    OMA {/OMS list1 list; /OMV b; /OMV c}
}

or if not so specialised

OMA {
    OMS cd symocat1 name label
    OMS cd Hopf-algebra name mult
    OMA {OMS cd list1 name list; OMV name a}
    OMA {
        OMS cd list1 name list
        OMV name b
        OMV name c
    }
}

giorgio_v: It reminds me of xmlgen

Lars H: Yes, there are similarities (and not only in purpose). One difference is that TDL is supposed to be data (even if treating it as code will be a convenient way of operating on it), whereas xmlgen / htmlgen rather is a way of writing scripts that construct the wanted data structure. That has implications for whether one might also want to generate it, or just write it. In addition, xmlgen / htmlgen seems to require that all tags to be used are defined as commands, whereas TDL would mostly require that only for positional argument commands.

A more trivial difference is that xmlgen / htmlgen keeps the = between attribute name and value. I wanted to get rid of that too.

JFL 2010-03-13: I had a very similar idea a few years ago, and created a script for converting XML files to "SML" files, and back. SML stood for Simple Markup Language. The script is here . See that file header for a full description of the syntax. At first I thought of this SML format only as another view on XML data. Then eventually I used it as the native protocol for data transfers between programs. This proves to be very convenient because it's even easier to parse than XML, and it's MUCH easier to debug by looking at what's transmitted.

Another variation on the same theme: It's also possible to present an actual file system as a text tree like this, using the convention that directory names are always noted with a trailing /, and that the file and directory contents MUST be indented more than the name. File meta data, like date/time, can be encoded in XML/SML like attributes. Here's a script that dumps a directory tree this way:

http://jf.larvoire.free.fr/progs/show

I used that script a lot for browsing Linux /proc subdirectories. (By default it displays only the top 10 lines of each file like "head")

Both programs work on both Linux and Windows. Run them with option -? to get a command line help screen.
Enjoy!

Basic principles

1. A general XML element is transformed to a Tcl command with the element name as command name, and the contents as a final "body" argument. Other arguments must occur in pairs and encode the attributes in the usual key–value style. Hence

<element attr1="val1" attr2="val2"></element>

is equivalent to

element attr1 val1 attr2 val2 {}

and also to

element attr1 val1 attr2 val2

since parity allows recognising the case of a missing body argument.

AMG: Prefixing the attribute names with a minus sign might help readability.

element -attr1 val1 -attr2 val2 {}
element -attr1 val1 -attr2 val2

Lars H 2011-07-20: Since "-" is not allowed as an initial character of XML names, and attribute names are names, it would indeed be unambiguous to allow an optional hyphen there. However, it would make it more cumbersome to operate on TDL data: hyphens would often need to be removed or added, there would be an ambiguity as to whether names are given with or without hyphen, etc.

2. A sequence of XML elements is transformed to a sequence of commands, i.e., to a script. Hence

<element1/><element2 attr2="val2"/>

is equivalent to

element1; element2 attr2 val2

and also to

element1
element2 attr2 val2

3. Command names which do not fit the XML syntax for element names are used to express things other than elements and can have other syntaxes, e.g. arguments identified by position. In particular, the / command is used to encode character data in a sequence of elements. Thus

<OMI>3</OMI>

can be encoded as

OMI {/ 3}

/ can have any number of arguments, which are concatenated in the manner of append.

4. Elements can sometimes be given an alternative encoding, in the form of a positional command. These commands will usually have a / prepended to the element name. The syntax is usually that required attributes (if any) and character data contents (if any) are mandatory arguments, but no general rule defining such commands exist; each must be defined explicitly.

Example: The OM elements

<OMS cd="arith1" name="times"/>
<OMV name="x"/>
<OMI>5</OMI>

could be expressed as

/OMS arith1 times
/OMV x
/OMI 5

but this would be context-dependent.

Downside

Some probably think it should be a criminal offense to invent a new syntax for what is almost XML, and there are no doubt interoperability problems lurking. On the other hand, if it allows me to be more productive, then I mostly think it is a good thing.

Basic operations

Prettyprinting

For robust and informative handling of syntax errors, it would probably be necessary to use something like parsetcl to parse TDL, but if we're content with parsing valid TDL and throwing an error on the rest, then things are much easier. The trick is to parse TDL data by evaluating it in an empty slave interpreter.

namespace eval prettyTDL {
    interp create -safe [list [namespace current]::theinterp]
    theinterp hide namespace
    theinterp invokehidden namespace delete ::
}

The first operation on TDL code will be to prettyprint it. The central command point for prettyprinting is the prettyprint procedure, which has the call syntax

: prettyTDL::prettyprint script ?option value ...?

and returns the prettyprinting of the script. The supported options are:

-indent: Basic indent string for the code block. Defaults to the empty string.
-step: Indent step, as a string to append to the -indent. Defaults to three spaces.

proc prettyTDL::prettyprint {script args} {
    set res {} 
    array set O {-indent {} -step {   }}
    array set O $args
    theinterp eval $script
    return $res
}

The way the pretty-printing works is that each command appends the pretty-printed form of itself, preceded by the appropriate indentation and followed by a newline, to the local variable, $res, in this procedure. The local array O has two entries -indent and -step which contain the current values of these parameters.

In most cases, that appending is taken care of by the unknown command in the slave interpreter, which is an alias to the following procedure in the master interpreter.

prettyTDL::theinterp alias unknown [namespace current]::prettyTDL::unknown
proc prettyTDL::unknown {name args} {
    upvar 1 res res O O
    switch -regexp -- $name {
        {^[[:alpha:]_:][[:alnum:]_:.-]*$} {
            if {[llength $args] % 2} then {
                set body [lindex $args end]
                set args [lreplace $args end end]
            } else {
                set body {} 
            }
            append res $O(-indent) [linsert $args 0 $name]
            if {[regexp {\S} $body]} then {
                append res " \{\n" [
                    prettyprint $body {*}[array get O]\
                        -indent $O(-indent)$O(-step)
                ] $O(-indent) \}
            }
            append res \n
        }
        default {
            append res $O(-indent) [linsert $args 0 $name] \n
        }
    }
}

proc prettyTDL::slash {args} {
    upvar 1 res res O(-indent) indent
    set L [list /]
    for {set i 0} {$i < [llength $args]} {incr i} {
        set n [string first \n [lindex $args $i]]
        if {$n<0} then {
            lappend L [lindex $args $i]
            continue
        }
        if {$n > 0} then {
            lappend L [string range [lindex $args $i] 0 $n-1]
        }
        append res $indent $L { \n} \n
        set L [list /]
        lset args $i [string replace [lindex $args $i] 0 $n]
        incr i -1
    }
    if {[llength $L] > 1} then {
        append res $indent $L \n
    }
}
prettyTDL::theinterp alias / [namespace current]::prettyTDL::slash

For example:

% prettyTDL::prettyprint {OMA {/OMS arith1 plus; /OMV a; /OMV b}}
OMA {
    /OMS arith1 plus
    /OMV a
    /OMV b
}

One could also perform various types of normalisation at this step, for example change an OMV to an /OMV.

prettyTDL::theinterp alias OMV apply {args {
    upvar 1 res res O O
    if {[llength $args]==2 && [lindex $args 0] eq "name"} then {
        append res $O(-indent) [list /OMV [lindex $args 1]] \n
    } else {
        unknown OMV {*}$args
    }
}}

So that

% prettyTDL::prettyprint {OMA {/OMS arith1 plus; OMV name a; OMV name b}}
OMA {
    /OMS arith1 plus
    /OMV a
    /OMV b
}

In the case of OpenMath, it might be more useful to normalise from /OMV to OMV, but that is something that would rather be done at TDL-to-XML conversion.

Conversion to XML

Another useful operation is that of converting a TDL script to equivalent XML, which is what the following does — or rather, it converts a TDL script to a list of tdom data-trees. A data-tree is an encoding (using built-in Tcl data containers) of general XML elements; technically it is a list on one of the forms

: element-name attribute-dict children
: #text text
: #cdata text
: #comment text
: #pi target text

(possibly there could be more), but only the first two are used here. The children is again a list (possibly empty) of data-trees. The attribute-dict is a dictionary mapping attribute names to their values, neither of which have any quoting of special characters. Similarly, the text is character data between tags, without any XML-quoting.

For XML-commands and the / command, this conversion is obvious, but for other non-XML commands some kind of XML encoding will have to be invented. The basic method is to express those as TDL:cmd elements, which must conform to the DTD fragment

<!ELEMENT TDL:arg (#PCDATA)>
<!ELEMENT TDL:cmd (TDL:arg*)>
<!ATTLIST TDL:cmd name CDATA #REQUIRED>
<!ATTLIST TDL:arg xml:space (preserve) #FIXED 'preserve' >

In other words, the command name is made an attribute of the TDL:cmd element, while the arguments appear in sequence as the character data contents of TDL:arg elements. It may be observed that this makes the whitespace contents in TDL:arg elements highly significant.

This first block of code defines a command TDLtoXML::main which takes a TDL script as argument and returns the corresponding list of data-trees.

namespace eval TDLtoXML {
    interp create -safe [list [namespace current]::theinterp]
    theinterp hide namespace
    theinterp invokehidden namespace delete ::
}
proc TDLtoXML::main {script} {
    set res {}
    theinterp eval $script
    return $res
}
TDLtoXML::theinterp alias unknown [namespace current]::TDLtoXML::unknown

proc TDLtoXML::unknown {name args} {
    switch -regexp -- $name {
        {^[[:alpha:]_:][[:alnum:]_:.-]*$} {
            set tree [list $name]
            if {[llength $args] % 2} then {
                lappend tree [lreplace $args end end]\
                    [main [lindex $args end]]
            } else {
                lappend tree $args {}
            }
        }
        default {
            set L {}
            foreach arg $args {
                lappend L [list TDL:arg {} [list [list \#text $arg]]]
            }
            set tree [list TDL:cmd [list name $name] $L]
        }
    }
    uplevel 1 [list ::lappend res $tree]
}

proc TDLtoXML::slash {args} {
    upvar 1 res res
    foreach arg $args {
        lappend res [list \#text $arg]
    }
}
TDLtoXML::theinterp alias / [namespace current]::TDLtoXML::slash

This second block of code defines the TDLtoXML::xml_from_trees command, which goes the rest of the way: taking a list of data-trees as argument, and returning corresponding XML code. (One could alternatively use the appendFromList method of some tdom node object for this.) xml_from_trees also takes some options that can be used to influence indentation and the string map used for encoding characters as XML entities.

proc TDLtoXML::xml_from_trees {treeL args} {
    array set Opt {
        -nodesep    \n
        -indentstep ""
        -charmap    {}
    }
    array set Opt $args
    if {![dict size $Opt(-charmap)]} then {
        set Opt(-charmap) {< &lt; > &gt; & &amp; ' &apos; \" &quot;}
    } else {
        foreach char {< > & ' \"} entity {&lt; &gt; &amp; &apos; &quot;} {
            if {![dict exists $O(-charmap) $char]} then {
                dict set $O(-charmap) $char $entity
            }
        }
    }
    set res {}
    foreach tree $treeL {
        append res [
            xml_from_treenode $tree $Opt(-nodesep) $Opt(-indentstep)\
                $Opt(-charmap)
        ]
    }
    return $res
}

proc TDLtoXML::xml_from_treenode {tree sep indent map} {
    switch -- [lindex $tree 0] "#text" - "#cdata" {
        return [string map $map [lindex $tree 1]]
    } #comment {
        return <!--[lindex $tree 1]-->
    } #pi {
        return "<?[lindex $tree 1] [lindex $tree 2]?>"
    } TDL:arg {
        set sep {} 
        set indent {} 
    }
    set res <[lindex $tree 0]
    dict for {key val} [lindex $tree 1] {
        append res { } $key =\" [string map $map $val] \"
    }
    if {[llength [lindex $tree 2]]} then {
        set subsep $sep$indent
        append res >
        foreach child [lindex $tree 2] {
            append res $subsep\
                [xml_from_treenode $child $subsep $indent $map]
        }
        append res $sep </[lindex $tree 0]>
    } else {
        append res />
    }
    return $res
}

For example,

% TDLtoXML::main {OMI {/ 3}}
{OMI {} {{{#text} 3}}}
% TDLtoXML::xml_from_trees [TDLtoXML::main {OMI {/ 3}}]
<OMI>
3
</OMI>

To be continued…

AMG: Very exciting stuff! I've made formats like this before, but without acknowledging the parallel to XML. They have worked very well for me. I'm considering using TDL for an upcoming project, but there's a dilemma I'm trying to work out. One use case that's been suggested to me is exporting subsets of the data in CSV which can be edited and reimported. I can't figure out how to do this, or if it's even possible, but is there any way to preserve comments, layout, quote style, and all that other formatting when programmatically manipulating TDL? (Or XML, for that matter.)

Another thing I'm worrying about is generating detailed error diagnostics. My previous code was written in Python (not my choice), and to do it I basically had to implement a subset of the Dodekalogue in Python (no substitution, no expansion). On the plus side, I was able to fairly easily make each word an object tagged with source information (file name and line number), which I used later when printing warnings and errors. What would be a good way to do this in pure Tcl?

See Config file using slave interp for a few successful Tcl implementations I've done, lacking good error prints. Also see grok for what's basically a Tcl port of the Python code I had written. Or don't; it's quite ugly, and it resides in a previously uncharted circle of Quoting hell.

Okay, I've thought about it a little bit more. Another problem that sprang to mind is my interp-based approach's potential for "preprocessing" macros and how to preserve formatting and track line numbers. Oh well, every C programmer knows macros can make errors much more difficult to locate, so I think I'll just throw up my hands and accept this as a natural consequence of metaprogramming. I'm sure preserving formatting would be even more difficult in this context, so perhaps I'll refuse to try to programmatically update data produced by macros.

As for preserving formatting in the first place, perhaps intermix word tokens with formatting tokens denoting whitespace, comments, quote style, etc. This could get ugly, though. I wish for a script interface to Tcl's parser such that it could return this data directly; however, this wiki already has several Tcl implementations of Tcl or Tcl-like parsers. Another approach would be to only produce tokens for words, including start and end character indexes as well as line numbers and file names. The interstitial formatting would be implied by gaps between adjacent words. Hmm. Ideas?

Maybe it's not clear, so I'll say it again. The two reasons I think I want to track the origin of each word are: (1) so the user can be told where the problem is when there's an error, and (2) so the program knows what to edit or delete when it's asked to make a change. Job (1) can also be handled by making the parser do validation internally, so it immediately knows when there's an error and it can just look at its state variables when it's assembling the error message. It can also be handled by making the parser be SAX-style instead of DOM-style, so that it calls into user code for each token or line, thus interleaving parsing and validation/analysis/processing into a single pass, so that the origin state variables are available when an error is detected. Job (2) maybe can be done similarly, except that the user code that's called into is able to return opcodes to keep, delete, or update the token or line it was passed.

Everything's far simpler when the data is stored in a format not intended to be human-editable. If things get too complicated, I'll go to that. It can be human-readable pretty-printed text suitable for diff, but maybe I can't make any attempt to preserve formatting. Or even have it be an SQLite database (or whatever), with the pretty-printed diff'able text as an exportable/importable report format, same as the CSV exports/imports.

Lars H: For locating errors, the -errorline entry in the catch options dictionary may be of use, but I don't know how well that would interact with the use of unknown; you'd have to check and see. As long as evaluating a script in the parser interpreter doesn't have any irreversible side-effects, one can use the following approach for parsing in general:

Try to parse by evaluating in parser interpreter. If that works, then we're done.
Otherwise parse it using something that can report errors with position and then continue (e.g. parsetcl), even if it's comparatively slow. Generate a detailed error report (in the case of parsetcl applied to TDL, every Ne, Sa, Sc, and Sv node marks an error).

A nice feature of TDL is that the parser can know from start which commands have script arguments, and therefore continue to parse these. That's not always obvious in Tcl in general.

AMG: I've continued to ponder incorporating TDL into my project. I tried creating an index over the data, such that the index can be queried to discern the non-semantic (syntactic?) formatting in order to modify the data while preserving most of the formatting, including comments. In the index I designed, finding an index entry given its character position takes logarithmic time (binary search), then finding its parent entries takes linear time (rewind through previous index entries until parents are found). Finding a line number given a character position also takes linear time (count the newline tokens). The formatting can be found by looking at the spaces between lines and words. However, this index design is worthless for my purposes because it's totally invalidated whenever any edits are made. Someone might find it useful, so I post it here. This index covers the second TDL example on this page, the "not so specialized" one. Each line in the following is a list element, so the whole index is a 2D list. It's sorted by ascending subelement #0, which is the character index. Subelement #1 is the length in characters of the object being indexed. This forms a sort of skip list, except that a binary search is needed to find the next sibling element. "{...}" is an abbreviation, so that I don't have to embed newlines in my example. Oh, another thing: that column of italicized numbers down the right isn't actually part of the index; it just shows the order in which each line of the index gets generated. 0's are generated when the file is read, 1's are generated when the first OMA body is parsed, 2's are generated when the second OMA body is parsed, and 3's are generated when the third OMA body is parsed. Entries from each indexing stage get merged into the existing index, maintaining the sort order.

  0 202 file whatever.tdl                                   0
  0 202 line {OMA {...}}                                    0
  0   3 word OMA                                            0
  4 196 word {...}                                          0
  5   1 newline                                             0
 10  26 line {OMS cd symocat1 name label}                   1
 10   3 word OMS                                            1
 14   2 word cd                                             1
 17   8 word symocat1                                       1
 26   4 word name                                           1
 31   5 word label                                          1
 36   1 newline                                             0
 41  29 line {OMS cd Hopf-algebra name mult}                1
 41   3 word OMS                                            1
 45   2 word cd                                             1
 48  12 word Hopf-algebra                                   1
 61   4 word name                                           1
 66   4 word mult                                           1
 70   1 newline                                             0
 75  40 line {OMA {OMS cd list1 name list; OMV name a}}     1
 75   3 word OMA                                            1
 79  36 word {OMS cd list1 name list; OMV name a}           1
 80  22 line {OMS cd list1 name list}                       2
 80   3 word OMS                                            2
 84   2 word cd                                             2
 87   5 word list1                                          2
 93   4 word name                                           2
 98   4 word list                                           2
104  10 line {OMV name a}                                   2
104   3 word OMV                                            2
108   4 word name                                           2
113   1 word a                                              2
115   1 newline                                             0
120  78 line {OMA {...}}                                    1
120   3 word OMA                                            1
124  74 word {...}                                          1
125   1 newline                                             0
134  22 line {OMS cd list1 name list}                       3
134   3 word OMS                                            3
138   2 word cd                                             3
141   5 word list1                                          3
147   4 word name                                           3
152   4 word list                                           3
156   1 newline                                             0
165  10 line {OMV name b}                                   3
165   3 word OMV                                            3
169   4 word name                                           3
174   1 word b                                              3
175   1 newline                                             0
184  10 line {OMV name c}                                   3
184   3 word OMV                                            3
188   4 word name                                           3
193   1 word c                                              3
194   1 newline                                             0
200   1 newline                                             0

For now, the most workable idea I have is to use a SAX-type interface in multiple passes. The first pass would be to analyze the input to determine what edits to make. The second pass would be to actually make the edits, using state information embedded in the call stack to emit the updated data. This should be fine, since maximum performance is not required. Further experimentation may lead me to identify a better index structure that logs SAX events (that's pretty much what the above index does) yet is resilient in the face of most edits.

SEH: The Wize library's XTL is similar.

AMG: XTL looks similar to TDL, but I think it takes me farther from my goal. Its format is a little more cumbersome to write by hand. Compatibility with XML is not important to me, so I don't need to complicate things just for XML's sake. Thanks for bringing this to our attention, though; maybe it needs a page on this Wiki.

Lars H: Actually, XTL rather seems to be like the data-trees: a representation of XML(ish) data structures in terms of Tcl lists. TDL is the script-like representation. It would probably be a good idea to start a Wiki page XML-list to survey the various list representations of XML data that are in use, though.

Category Data Serialization Format

Category XML