A simple database

Summary

Richard Suchenwirth 2001-05-23: There are lots of complex databases around. Here I want to explore how a database can be implemented in the Tcl spirit of simplicity, and how far that approach takes us.

Description

Consider the following model:

A database is a set of records
A record is a nonempty set of fields with a unique ID
A field is a pair of tag and nonempty value, both being strings

Fields

Fields may well be implemented as array entries, so we could have an array per record, or better one array for the whole database, where the key is composed of ID and tag. Unique IDs can be had by just counting up (incrementing the highest ID so far). The process of creating a simple database consists only of setting an initial value for the ID:

set db(lastid) 0

Let's consider a library application for an example. Adding a book to the database can be simply done by

set id [incr db(lastid)]
set db($id,author) "Shakespeare, William"
set db($id,title) "The Tempest"
set db($id,printed) 1962
set db($id,label) S321-001

Note that, as we never specified what fields a record shall contain, we can add whatever we see fit. For easier handling, it's a good idea to classify records somehow (we'll want to store more than books), so we add

set db($id,isa) book

Retrieving a record is as easy as this (though the fields come in undefined order):

array get db $id,*

and deleting a record is only slightly more convolved:

foreach i [array names db $id,*] {unset db($i)}

or, even easier and faster from Tcl 8.3 on:

array unset db $id,*

Here's how to get a "column", all fields of a given tag:

array get db *,title

But real columns may have empty fields, which we don't want to store. Retrieving fields that may not physically exist needs a tolerant access function:

proc db'get {_db id field} {
   upvar $_db db
   if {[array names db $id,$field]=="$id,$field"} {
       return $db($id,$field)
   } else {return ""}
} ;# thanks to Michel Salvagniac for fixing this!

In a classical database we have to define tables: which fields of what type and of which width. Here we can do what we want, even retrieve which fields we have used so far (using a temporary array to keep track of field names):

proc db'fields {_db} {
    upvar $_db db
    foreach i [array names db *,*] {
       set tmp([lindex [split $i ,] 1]) ""
    }
    lsort [array names tmp]
}

Searching for records that meet a certain condition can be done sequentially. For instance, we want all books printed before 1980:

foreach i [array names db *,printed] {
    if {$db($i)<1980} {
        set id [lindex [split $i ,] 0]
        puts "[db'get db $id author]: [db'get db $id title] $db($i)"
    }
}

We might also store our patrons in the same database (here in a different style):

set i [incr $db(lastid)]
array set db [list $i,name "John F. Smith" $i,tel (123)456-7890 $i,isa  patron}

Without a concept of "tables", we can now introduce structures like in relational databases. Assume John Smith borrows "The Tempest". We have the patron's and book's ID in variables and do double bookkeeping:

lappend db($patron,borrowed) $book ;# might have borrowed other books
set db($book,borrower) $patron
set db($book,dueback) 2001-06-12

When he returns the book, the process is reversed:

set pos [lsearch $db($patron,borrowed) $book]
set db($patron,borrowed) [lreplace $db($patron,borrowed) $pos $pos]
unset db($book,borrower) ;# we're not interested in empty fields
unset db($book,dueback)

The dueback field (%Y-%M-%d format is good for sorting and comparing) is useful for checking whether books have not been returned in time:

set today [clock format [clock seconds] -format %Y-%M-%d]]
foreach i [array names db *,dueback] {
    if {$db($i)<$today} {
        set book [lindex [split $i ,] 0] ;# or: set book [idof $i] - see below
        set patron $db($book,borrower)
        #write a letter
        puts "Dear $db($patron,name), "
        puts "please return $db($book,title) which was due on\
        $db($book,dueback)"
    }
}

Likewise, parts of the accounting (e.g. orders to, and bills from, booksellers) can be added with little effort, and cross-related also to external files (just set the value to the filename).

Indexes

As shown, we can retrieve all data by sequential searching over array names. But if the database grows in size, it's a good idea to create indexes which cross-reference tags and values to IDs. For instance, here's how to make an authors' index in four lines:

foreach i [array names db *,author] {
    set book [lindex [split $i ,] 0]
    lappend db(author=[string toupper $db($i)]) $book
} ;# and then..
foreach i [lsort [array names db author=SHAK*]] {
    puts "[lindex [split $i =] 1]:" ;# could be wrapped as 'valueof'
    foreach id $db($i) {
        puts "[db'get db $id title] - [db'get db $id label]"
    }
}

gives us a books list of all authors matching the given glob pattern (we reuse Tcl's functionality, instead of reinventing it...). Indexes are useful for repeated information that is likely to be searched. Especially, indexing the isa field allows iterating over "tables" (which we still don't explicitly have! ;-):

regsub -all isa= [array names db isa=*] "" tables
foreach patron $db(isa=patron) {...}

And beyond industry-standard SQL, we can search multiple indices in one query:

array names db *=*MARK*

gives you all (case-independent) occurrences of MARK, be it in patron's names, book's authors or titles. As versatile as good old grep...

Persistence

Databases are supposed to exist between sessions, so here's how to save a database to a file:

set fp [open Library.db w]
puts $fp [list array set db [array get db]]
close $fp

and loading a database is even easier (on re-loading, better unset the array before):

source Library.db

If you use characters outside your system encoding (no problem to write Japanese book titles in Kanji), you'll have to fconfigure (e.g -encoding utf-8) on saving and loading, but that's just a few more LOC. Saving also goes a good way to what is ceremonially called "committing" (you'll need write-locking for multi-user systems), while loading (without saving before) might be called a "one-level rollback", where you want to discard your latest changes.

Notice that so far we have only defined one short proc, all other operations were done with built-in Tcl commands only. For clearer code, it is advisable to factor out frequent operations into procs, e.g.

proc idof {index} {lindex [split $index ,] 0}
proc db'add {_db data} {
    upvar $_db db
    set id [incr db(lastid)]
    foreach {tag value} $data {set db($id,$tag) $value}
    # might also update indexes here
}
proc db'tablerow {_db id tags} {
    upvar $_db db
    set res {}
    foreach tag $tags {lappend res [db'get db $id $tag]}
    set res
}

Of course, with growing databases we may reach memory limits: arrays need some extra storage for administration. On the other hand, the present approach is pretty economic, since it does not use field widths (all strings are "shrink-wrapped"), and omits empty fields, while at the same time allowing to add whatever fields you wish. A further optimization could be to tally value strings, and replace the frequent ones with "@$id", where db(@$id) holds the value once, and only db'get has to be adapted to redirect the query.

Also, memory limits on modern computers are somewhere up high... so only at some time in the future you might have (but maybe not want) to change to a complex database ;-)

On the limits

Tcl arrays may get quite large (one app was reported to store 800000 keys in Greek characters), and at some point enumerating all keys with array names db (which produces one long list) may exceed your available memory, causing the process to swap. In that situation, you can fall back to the (otherwise slower, and uglier) use of a dedicated iterator:

set search [array startsearch db]
while {[array anymore db $search]} {
    set key [array nextelement db $search]
    # now do something with db($key) - but see below!
}
array donesearch db $search

But neither can you filter the keys you will get with a glob pattern, nor may you add or delete array elements in the loop - the search will be immediately terminated.

Another approach for databases with a large number of relatively short records is to keep all fields, as pairlist, in one element:

set db($id) {fnm Mary lnm Smith tel (123)234-4567 isa patron}

For processing a record, you'd just use into a temporary array:

array set tmp $db($id)
# do modifications...
set db($id) [array get tmp]

Full indexing in that approach would go like this:

foreach {tag value} $db($id) {lappend db($tag=[string toupper $value]) $id}

Discussion

davidw: Very nice - I wonder if we could optionally add mutexes from tcl's threading model to make sure that only one thing at a time is writing - or do we even need that?

RS: With only one interpreter per thread, no two threads can access the same variable, I think.

davidw: Aren't some of these commands vulnerable to keys with commas in them?

RS: Sure. You can pick any delimiter, and require that it does not appear in keys. Just commas look so tidy... :)

ZB: You're right - but the comma sometimes can be misunderstood as suggestion, that there are 2 parameters (2-dim. array) in the parentheses - while in fact there's just one; one "element name". Perhaps underscore could be more "unambiguous"?

RS: Well, that's the point. Although arrays take one key which is a string, it is perfectly ok to have that string look like multi-dimensional, with commas as separators:

set arr($x,$y,$z) 1

escargo 2004-05-18: I find the third premise to be potentially false. That is, the value might be empty.

A database is a set of records
A record is a nonempty set of fields with a unique ID
A field is a pair of tag and nonempty value, both being strings

There is also an open question about uniformity:

Do all records in the database have the same set of fields?

I think when a database is being created that there might be a point in time where records do not yet have any fields defined. That violates premise 2.

jcw: This approach loads all data on startup - which may not satisfy everyone's definition of a "database". Though it's delightfully simple and a good match for Tcl.

RS: OK, in view of the database NULL debate, I'm willing to withdraw the nonempty words :) My idea was that a non-existing field would return {}, so fields with empty content could just be deleted, to reduce memory needs. And as the text further on shows, any records can have any content - record uniformity is left to the user, if he so wishes.

DKF: I'd note that a set of records with a common pattern of relations is usually considered to be a table; a database is a collection of tables. (Well, for a relational DB; non-relational DBs are… different.)

Arrays / Hash Maps

Category Database

Category Embedded Database