unification

Unification is used in Logic programming to find one or more instantiations of the free variables in a set of logical formulas that satisfies those formulas. See [L1 ] for a brief overview. Pattern matching (as often used in Functional Programming) is a special case of unification, where variables are only allowed in one of the values being unified; the pattern.

NEM The core unification algorithm is actually quite simple, and is a classic piece of code to know. Here I'll present a general unification algorithm that works over many different types of data structures.

In order to perform unification of two values, we need to know something about their structure. For instance, to unify two lists we would traverse each in order attempting to unify each corresponding element. However, to unify two dicts we would traverse each and compare corresponding key/value pairs. This information about the structure of values is type information. As Tcl has no built-in notion of types, we have to explicitly pass this information in. The core of our unification algorithm looks as follows:

proc unify {type a b {env ""}} {
    if {[var? $a]} {
        unify-var $type $a $b $env
    } elseif {[var? $b]} {
        unify-var $type $b $a $env
    } else {
        invoke $type $a $b $env
    }
}

This basic algorithm takes a type-function (we'll descibe in a moment), two values to be unified, and an optional initial environment dictionary, env. This dict will be used to hold possible variable bindings as we traverse the values. If the two values can be successfully unified then the environment dictionary is returned, containing the bindings of variables to values. Otherwise, an error is thrown. The basic unifier checks to see if either of the values is a variable, and if so attempts to unify that variable with the other value in the context of the environment. This is done by the unify-var procedure:

# Variables are single words that start with a question mark, e.g. ?name
proc var? item { regexp {^\?\S*$} $item }
proc unify-var {type var other env} {
    if {[dict exists $env $var]} {
        unify $type [dict get $env $var] $other $env
    } elseif {[dict exists $env $other]} {
        unify $type $var [dict get $env $other] $env
    } elseif {$var eq "?"} {
        return $env
    } else {
        occurs-check $var $other
        return [dict set env $var $other]
    }
}

Again, this code is fairly straight-forward. If the environment already contains a mapping for this variable, then we lookup the value and retry unification. If not, then there are a number of options. Firstly, we can check if the other value is also a variable and is already bound in the environment (second branch of the if). If so, then we again lookup the value and retry unification. If neither value is bound, then we create a new binding for the variable in the environment with the other value. We also allow a special wild-card variable "?", which matches everything without forming a binding. This is useful when we don't care about some part of a structure.

The only missing piece of this variable unification is the "occurs-check" call. When we create a new binding in the environment, we need to make sure that the variable we are binding is not itself present in the value we are binding it to. As a value cannot contain itself, this would form an inconsistent binding, and so unification should fail. For instance, trying to unify ?foo with {a b ?foo} should fail as it would otherwise create an impossible structure. This occurs-check can be expensive, so Prolog and other logic programming languages tend to leave it out. (Actually, some versions of Prolog would create a cyclic data-structure for the example unification above. This is illegal in first-order logic, however). This is what we will do here, too. If you were writing a theorem prover or other application where soundness was more important than efficiency, then you should put this occurs-check back in.

proc occurs-check {var other} {}
# A crude implementation:
# proc occurs-check {var other} { if {[string first $var $other] >= 0} { fail } }

Our basic framework is pretty much complete. We just need a couple of extra utilities before we move on to individual type-functions. "invoke" is a simple helper to invoke a type function. It takes care of expanding the function into individual words and invoking at the global scope. "fail" is a special exception type to indicate that unification has failed. We do not handle this exception in unification, but let it propagate to callers. A Prolog implementation could catch this error and trigger back-tracking (i.e., trying a different rule).

proc invoke {cmd args} { uplevel #0 $cmd $args }
proc fail {} { return -code error "unification failure" }

We could use a custom return code (e.g., 3245 which is "fail" on a telephone keypad) if we wished to avoid the overhead of an error (which builds a stack-trace as it propagates), but we'll keep things simple for now.

We now must deal with describing the structure of individual types. Each type is represented by a function that takes two values of that type and an environment, and performs unification over those values. We will name our type-functions with an initial capital letter to distinguish them from the commands that manipulate those types. Let's start with the simplest type, Tcl's strings:

proc String {a b env} {
    if {$a ne $b} { fail }
    return $env
}

This simply compares two strings for equality. If they are not equal then we fail, otherwise we return the environment as it is. We can now test our unification function on simple strings:

% unify String hello world
unification failure
% unify String hello hello
% unify String hello ?s
?s hello
% unify String ?s hello
?s hello

Great! Everything seems to be working. We can define a type for numbers (both integers and reals) just as easily:

proc Num {a b env} {
    if {$a != $b} { fail }
    return $env
}

But what about lists and dictionaries? These are slightly more complicated, as they can contain elements of different types -- they are polymorphic data structures. In order to fully describe the type of these structures, we can further parameterise the type-functions with the types of their elements. We assume, for simplicity, that lists are homogenous; all elements are of the same type:

proc List {elemtype as bs env} {
    if {[llength $as] != [llength $bs]} { fail }
    foreach a $as b $bs {
        set env [unify $elemtype $a $b $env]
    }
    return $env
}

Here we run down each element of the lists and call "unify" recursively to see if they match. We pass "unify" the expected element type, which is an argument to the List function. So where does this element type come from? We simply pass in a compound type description when we call unify:

% unify {List Num} {1 2 3 ?d} {1 ?b 3.0 4}
?b 2 ?d 4

The "invoke" procedure we defined earlier makes sure that when we call the type-function, it is expanded into:

 invoke {List Num} $a $b $env => List Num $a $b $env

And now we can start to see how powerful unification is. We can also define a type of Tuples which are heterogenous lists, but we need to specify the type of each element:

proc Tuple {types as bs env} {
    if {[llength $types] != [llength $as] ||
        [llength $as]    != [llength $bs]} { fail }
    foreach type $types a $as b $bs {
        set env [unify $type $a $b $env]
    }
    return $env
}

We can then define dictionaries in a similar manner:

proc Dict {types as bs env} {
    if {[dict size $types] != [dict size $as] ||
        [dict size $as]    != [dict size $bs]} { fail }
    dict for {key type} $types {
        if {![dict exists $as $key] || ![dict exists $bs $key]} { fail }
        set env [unify $type [dict get $as $key] \
                             [dict get $bs $key] $env]
    }
    return $env
}

Here we parameterise our Dict type with a dictionary describing the types of each key/value mapping. We then check that each value has precisely the same mappings and try to unify each individual. More flexible schemes are possible; for instance, we could allow each dictionary to contain extra keys which are not in our type mapping -- structural sub-typing, but I'll leave that as an exercise.

We can now perform unification over very complex structures:

# Simple aliases:
proc def {name = args} {
    interp alias {} $name {} {*}$args
}
# Create a type for people:
def Person = Dict {
    name       String
    age        Num
    birthday   {Tuple {Num String}}
}
set neil {
    age        25
    name       "Neil Madden"
    birthday   {27 October}
}
array set ans [unify Person $neil {
    name     ?name 
    birthday {? ?month} 
    age      ?
}]
puts "$ans(?name)'s birthday is in $ans(?month)"

As you can see, unification forms the basis of some powerful structural querying, which we can use in a little database with unification.

There is still a piece of the puzzle missing however. When we have performed unification, we end up with an environment dictionary that maps variable names to the values they unified with. However, variables may unify with other variables, and with structures that contain variables. It is not always possible to resolve these variable references during unification as some variables may be unbound. In order to fix this, we can use a lookup procedure which takes care to resolve any free variables from the environment:

proc resolve {env term} {
     while {[var? $term]} { set term [dict get $env $term] }
     if {[llength $term] > 1} {
         set ret [list]
         foreach item $term {
             lappend ret [resolve $env $item]
         }
         return $ret
     }
     return $term
}

This allows us to resolve some quite tricky cases in unification. For instance, consider the following:

def ComplexType = Tuple {String {List String} String}
set res [unify ComplexType {f {g ?a} ?a} {f ?b abc}]
dict for {var term} $res {
    puts "$var = [resolve $res $term]"
}

This prints

?a = abc
?b = g abc

Which is indeed the correct unification. Note that our resolve procedure makes some assumptions about the layout of the data it gets. It should really be parameterised over types like our unification algorithm, but it should work for all our types so far. Also note that this is where the lack of an occurs-check can cause trouble: the resolve procedure will go into an infinite loop if any variable is bound to a term which contains itself.

In the current code above, we assume that each element is known to be of the correct type before we attempt unification. We can remove this assumption by tagging data with an indication of the type and then checking that the types match at the start of "unify" (simply "fail" if they don't match). I'll leave this as another exercise for the curious reader, as well as coming up with a definition of when two types match (e.g., taking into account sub-typing etc).