Parser Combinators

Discussions recently on this wiki about parsing techniques [L1 ] reminded me that I've been meaning to have a play with defining a simple parser combinator library in Tcl, based on the techniques used in Haskell [L2 ]. After my previous experiments with Monadic TOOT it should be fairly straight-forward to translate the monadic parser combinator paper for use in Tcl. It also demonstrates a little more concretely why monads are a cool (if slightly bewildering at first) technique. Here, a monad is used to sequence parsers to build larger parsers. A parser in this case is a function from a string to one or more possible parses, represented as a list of parsed info and remaining unparsed string pairs. The result is an elegant way of building a parser using recursive descent. The code isn't going to win any prizes for efficiency, and is heavily recursive in places, so use with care.

NEM 7June2005: I've updated the code a bit to improve performance and give a little more natural sugar (see expression parser at end). The main performance improvement comes from replacing the TOOT-style lambda (elegant, and with automatic GC) with a uniquely-named-proc pseudo-lambda that has no GC, but benefits immensely from byte-compiling. I also adjusted some of the recursive algorithms to be more efficient and added some explicit {*}s in places to avoid the unknown-handler overhead. The final test now takes about 72ms per iteration versus the original 450ms (a 6x speedup!). Still not quite industrial-strength, but may be ok for some purposes. There is also a toy BibTeX parser given at the end.

At present this code is 8.5 only.

# Monadic parser combinator library. Written using [TOOT] techniques, and
# based on the paper "Monadic Parsing in Haskell" Graham Hutton and Erik
# Meijer, Journal of Functional Programming, 8(4):437--444, July 1998.
# See http://www.cs.nott.ac.uk/~gmh/bib.html#pearl
#
# Copyright (c) 2005 Neil Madden ([email protected])
package require Tcl 8.5

# A helper method. This is a lexically-scoped lambda construct. Variables
# to be captured from the lexically-enclosing scope can either be specified
# explicitly by using the form [lambda params statics body], or if the
# statics argument is omitted then all local vars from the current scope
# are captured (actually, snap-shotted) to become part of the (immutable)
# lexical closure of the lambda. If this is gibberish to you, don't panic!
# All it means is that code such as:
#  set a 12
#  set foo [lambda {} { puts "a = $a" }]; $foo
# will do the right thing (i.e. print "a = 12"), instead of complaining
# that a is not a variable.
# This version doesn't use TOOT, but instead is a leaky version that
# creates uniquely named procs, which are never garbage collected. Less
# neat, but improves the performance immensely.
set lambda_id 0
proc lambda {params args} {
    global lambda_id
    if {[llength $args] == 1} {
        set body [lindex $args 0]
        # Get 'em all!
        set statics [uplevel 1 info locals]
    } else {
        foreach {statics body} $args { break }
    }
    set scope {}
    foreach vname $statics {
        # Ignore if it will be shadowed by a param
        # Could use {$vname ni $params} here, but not sure how widespread it
        # is (fairly recent addition to 8.5).
        if {[lsearch -exact $params $vname] == -1} {
            upvar 1 $vname var
            dict set scope $vname $var
        }
    }
    set name lambda[incr lambda_id]
    proc $name $params "
        set __scope [list $scope]
        dict with __scope [list $body]
    "
    return $name
}
# TOOT's auto-expand magic:
if {[llength [info commands ::__toot_unknown]] == 0} {
    rename ::unknown ::__toot_unknown
    proc ::unknown {cmd args} {
        if {[llength $cmd] > 1} {
            #puts "TOOT invoke: $cmd"
            uplevel 1 $cmd $args
        } else {
            uplevel 1 [linsert $args 0 ::__toot_unknown $cmd]
        }
    }
}
# Another little helper -- creates a unified var/command thing.
proc def {name = parser} {
    upvar #0 $name var
    set var $parser
    # Avoid needing to auto-{*} by using a curried alias
    interp alias {} $name {} ::Parser: [lindex $parser 1]
}

# newtype Parser a = Parser (String -> [(a,String)])
# A Parser is a function from strings to a list of token,string pairs which
# represent a parse sequence. Each pair consists of a typed item which is
# the parsed representation, and the remaining unparsed string suffix.
namespace eval Parser {
    namespace export {[a-z]*}
    namespace ensemble create

    # Simple constructor
    proc create {args} {
        list Parser: [uplevel 1 [linsert $args 0 lambda]]
    }

    # Implement the monad interface, which allows us to sequence parsers
    # together allowing for backtracking (actually, due to eager evaluation,
    # all possible parses are explored, but for the given examples this
    # makes little difference).

    # ret        :: a -> Parser a
    # Injects a value into the Parser monad. Returns a parser which when
    # given a string, simply returns the given token and leaves the string
    # untouched. This is called simply "return" in Haskell, but that name
    # is already taken in Tcl, so we use "ret" instead.
    proc ret a {
        create cs { list $a $cs }
    }

    # >>=        :: Parser a -> (a -> Parser b) -> Parser b
    # Creates a parser which is a combination of two other parsers. The
    # resulting parser feeds the input string into the first parser and then
    # tries each possible parse by feeding the resulting suffix strings into
    # the second parser. This is the fundamental operation of monadic
    # programming (the bind/sequencing op).
    proc >>= {p f} {
        create cs {
            set ret [list]
            foreach {a ccs} [{*}$p parse $cs] {
                lappend ret [{*}[$f $a] parse $ccs]
                # Insert a "break" here to only explore first parse result
            }
            # Flatten the resulting list
            join $ret
        }
    }

    # MonadZero instance
    # No-op parser, which simply fails to parse anything.
    variable zero [create cs { list } ]
    proc zero {} { variable zero; return $zero }

    # MonadPlus instance. This is used to combine the results of two parsers
    # (effectively creating a choice between them). This is done by simply
    # concatenating the result lists of the parsers. For instance, if you
    # had a grammar with a production:
    #  Foo ::= Bar | Jim | Bob
    # Then you could code that up as:
    #  def Foo ::= [$Bar | $Jim | $Bob]
    # We use "|", but the paper uses ++
    proc | {args} {
        create cs {
            set ret [list]
            foreach p $args {
                if {$p eq "|" || $p eq "||"} { continue }
                lappend ret {*}[{*}$p parse $cs]
            }
            return $ret
        }
    }

    # Deterministic version of | -- returns only first result
    # Called +++ in the Haskell paper
    proc || {args} {
        create cs {
            foreach p $args {
                if {$p eq "||" || $p eq "|"} { continue }
                set ret [{*}$p parse $cs]
                if {[llength $ret]} {
                    return [lrange $ret 0 1]
                }
            }
            return [list]
        }
    }

    # Just unpack the parser function and apply it to the given input
    # string.
    proc parse {p cs} {
        [lindex $p 1] $cs
    }

    # Type dispatch function -- part of [TOOT]s magic.
    proc ::Parser: {p args} {
        if {[llength $args]} {
            set method [lindex $args 0]
            uplevel 1 [lreplace $args 0 0 ::Parser::$method [list Parser: $p]]
        } else {
            return [list Parser: $p]
        }
    }
}
# A little syntactic sugar. Does a simple version of Haskell's do
# notation. Converts a script separated by semi-colons into monadic
# sequenced form, e.g.:
#  do { a <- p1; p2; b <- p3; Parser ret [list $a $b] }
# becomes:
#  p1 >>= [lambda a { p2 >>= [lambda _ { p3 >>= [lambda b {
#                     Parser ret [list $a $b]
#  }]}]}]
# This version is a bit more robust than the version on [Monadic TOOT],
# but still cannot handle nested do-scripts. Also, the use of
# semi-colons as a separator char may be a bit subtle given that they
# are usually optional in Tcl.
proc do {script} {
    set eval ""
    set num 0
    foreach line [lrange [split $script \;] 0 end-1] {
        set line [string trim $line]
        if {[string length $line]} {
            if {[regexp {(.*)<-(.*)} $line -> var comp]} {
                append eval "\n \[$comp\] >>= \[lambda $var \{"
            } else {
                append eval "\n \[$line\] >>= \[lambda _ \{"
            }
            incr num
        }
    }
    append eval \n[lindex [split $script \;] end]
    append eval [string repeat "\n\}\]" $num]
    uplevel 1 $eval
}

Now we have some basic infrastructure in place, let's start writing some actual parsers.

# Simple parser -- consumes first character, if there is one, or fails
# otherwise.
# item :: Parser Char
# item = Parser (\cs -> case cs of
#                                ""        -> []
#                                (c:ccs) -> [(c,ccs)])
def item ::= [Parser create cs {
    if {[string length $cs]} {
        list [string index $cs 0] [string range $cs 1 end]
    } else {
        list
    }
}]

# p :: Parser (Char,Char)
# Takes the 1st and 3rd characters from a string
def p ::= [item >>= [lambda c {
    item >>= [lambda _ {
    item >>= [lambda d {
         Parser ret [list $c $d]
    }]}]}]]
# Same, but using do notation. We will use do notation pretty much
# exclusively from here on, for obvious reasons!
def p2 ::= [do {
    c <- item;
    item;
    d <- item;
    Parser ret [list $c $d]
}]

proc const {a} { lambda b { return $a } }

# sat         :: (Char -> Bool) -> Parser Char
# A combinator which takes a predicate and yields a parser that consumes
# characters only if they satisfy the predicate.
proc sat p {
    do {
        c <- item;
        if {[$p $c]} {
            Parser ret $c
        } else {
            Parser zero
        }
    }
}
# char         :: Char -> Parser Char
# Returns a parser which matches a single character
proc char c { sat [lambda x { string equal $x $c }] }

# String :: String -> Parser String
# Match a specified string - this is an optimised version compared to the
# char by char original version.
proc String s {
    set len [string length $s]
    Parser create cs {
        set r [string range $cs 0 [expr {$len-1}]]
        if {$s eq $r} {
            list $s [string range $cs $len end]
        } else {
            list
        }
    }
}
# Case-insensitive string match
proc StringNC s {
    set len [string length $s]
    Parser create cs {
        set r [string range $cs 0 [expr {$len-1}]]
        if {[string equal -nocase $s $r]} {
            list $s [string range $cs $len end]
        } else {
            list
        }
    }
}

# many        :: Parser a -> Parser [a]
# Kleene-star operator. Applies the given parser 0 or more times.
# Equivalent to * regexp modifier.
proc many p {
    [many1 $p] || [Parser ret [list]]
}
# 1 or more version of above (equivalent to + regexp modifier).
proc many1 p {
    do {
        a <- {*}$p;
        as <- many $p;
        Parser ret [linsert $as 0 $a]
    }
}
# Sugared versions
interp alias {} ::Parser::* {} ::many
interp alias {} ::Parser::+ {} ::many1
# Version which joins the results back into a string:
proc Parser::*s p {
    [[many $p] >>= [lambda xs { Parser ret [join $xs] }]]
}
proc Parser::+s p {
    [[many1 $p] >>= [lambda xs { Parser ret [join $xs] }]]
}
# Repeated applications of parser p, separated by applications of parser sep
# whose result values are thrown away. e.g. sepby [char a] [char ,] will
# match a sequence of 0 or more "a"s separated by commas, such as "a,a,a".
# sepby :: Parser a -> Parser b -> Parser [a]
proc sepby {p sep} {
    [sepby1 $p $sep] || [Parser zero]
}
proc sepby1 {p sep} {
    # Simple do notation doesn't handle nesting, so we resort to explicit
    # sequencing for the inner "many" loop in here:
    do {
        a <- {*}$p;
        as <- many [{*}$sep >>= [lambda _ { return $p }]];
        Parser ret [linsert $as 0 $a]
    }
}

# chainl :: Parser a -> Parser (a -> a -> a) -> a -> Parser a
# Parses a sequences of values separated by applications of an operator
# parser which yields an operation which is used to combine values being
# parsed. Like a cross between sepby and foldl.
proc chainl {p op a} {
    [chainl1 $p $op] || [Parser ret $a]
}

proc chainl1 {p op} {
    do {
        a <- {*}$p;
        rest $a $p $op
    }
}
# Helper for chainl1
proc rest {a p op} {
    [do {
         f <- {*}$op;
         b <- {*}$p;
         rest [{*}$f $a $b] $p $op
    }] || [Parser ret $a]
}

All seems to be working ok so far. We'll leave out the chainr/chainr1 parser combinators as done in the paper. Shouldn't be too difficult to work out. Now we move on to the section on Lexical combinators which shows how we can avoid the lexing/parsing distinction by defining combinators to do the lexing.

# Whitespace
proc isSpace {char} { regexp {\s} $char }
def space ::= [[sat isSpace] *]
# Parse a token and discard trailing space
proc token p {
    do {
        a <- {*}$p;
        space;
        Parser ret $a
    }
}
# Parse a symbolic (string) token:
proc symb cs { token [String $cs] }
# Apply a parser, p, discarding any leading space:
# apply :: Parser a -> String -> [(a,String)]
proc apply {p cs} {
    {*}[do { space; {*}$p }] parse $cs
}

The final example of the paper is to implement a simple expression evaluator, which uses the following grammar:

    expr        ::= expr addop term | term
    term        ::= term mulop factor | factor
    factor      ::= digit | ( Expr )
    digit       ::= 0 | 1 | ... | 9
    number      ::= number digit | digit
    addop       ::= + | -
    mulop       ::= * | /

We have to define in reverse order to in the paper, so that the correct definitions are set up in the correct order. We also extend the grammar and evaluator to handle multi-digit numbers.

# We can be a bit more concise than Haskell here, as we don't have to
# distinguish between "+" the string and "+" the operator, as Everything Is
# A String!
def addop         ::= [[symb +] | [symb -]]
def mulop         ::= [[symb *] | [symb /]]
def digit         ::= [do { x <- token [sat isDigit]; Parser ret $x }]
def number        ::= [do { ds <- [digit +]; Parser ret [join $ds ""] }]
def factor        ::= [number || [do { symb "("; n <- Expr; symb ")";
                                       Parser ret $n }]]
def term          ::= [chainl1 factor mulop]
def Expr          ::= [chainl1 term addop]

# Some helpers:
foreach op {+ - * /} { proc $op {a b} [format {expr {$a %s $b}} $op] }
proc isDigit d { string match {[0-9]} $d }

# And now a little test:
puts " 1 - 2 * 3 + 4 = [apply Expr { 1 - 2 * 3 + 4 }]"
puts "12 * 52 / 64 = [apply Expr {12 * 52 / 64 }]"
puts "time = [time { apply Expr {12 * 52 / 64 } } 20]"

To me, this is what parsing should be like: elegant and straight-forward (once the infrastructure is in place). It'll take quite a bit of work to get it up to "industrial-strength" (like the Parsec library [L3 ] for Haskell). For instance, it takes 450592 microseconds per iteration for that last test on my iBook 800MHz G4! (Update: new code cuts that down to about 70ms). Most of that is due to the overhead of TOOT which involves lots of extra function calls and unknown-command trickery. It'd be an interesting project to see how far this could be taken from fun demo to a useful level of efficiency.

As a further test of the power of these parsers, I thought I'd have a go at recreating a simple BibTeX parser using them. First, for convenience I'll define a parser which matches an arbitrary regular expression (which simplifies the scanning a bit). The actual parser is based on a simplified grammar for BibTeX and will fail on quite a lot of valid input. Still, it shows how simply a parser can be constructed using this technique.

# Parse an arbitrary regular expression
proc Regexp {pattern} {
    Parser create cs {
        if {[regexp "^($pattern)(.*)" $cs -> match rest]} {
            list $match $rest
        } else {
            list
        }
    }
}
# Rough grammar:
#        BibTex     ::= Record*
#        Record     ::= @ Type { Key , Fields }
#        Fields     ::= Field*,
#        Field      ::= Key = BibStr1
#        BibStr1    ::= Key | { BibStr+ }
#        BibStr     ::= [^{}] | { BibStr+ }
#        Key        ::= [^\s,=\{\}]+
#        Type       ::= [^\{]+
def Type           ::= [Regexp {[^\{]+}]
def Key            ::= [token [Regexp {[^\s,=\{\}]+}]]
def BibStr         ::= [[Regexp {[^\{\}]+}] | [do {
                        symb "\{"; s <- [BibStr +s]; symb "\}";
                        Parser ret $s }]]
def BibStr1        ::= [[Key] || [do { symb "\{"; s <- [BibStr +s]; symb "\}";
                                       Parser ret $s }]]
def Field          ::= [do { k <- Key; symb "="; s <- BibStr1;
                             Parser ret [list $k $s] }]
def Fields         ::= [sepby Field [symb ","]]
def Record         ::= [do { symb "@"; t <- Type; symb "\{"; k <- Key; symb ",";
                             f <- Fields; symb "\}"; Parser ret [list $t $k $f] }]
# Apply a parser and invoke callback at end
proc Callback {p c} {
    do {
        res <- $p;
        $c $res
        Parser ret $res
    }
}
# The whole thing
def BibTeX ::= [[Callback Record PrintRecord] *]

proc PrintRecord {record} {
    foreach {type key fields} $record { break }
    puts "${type}: $key"
    foreach field $fields {
        puts "  [join $field { = }]"
    }
}
# A couple of records picked at random from my BibTeX database:
set bibtex {
@Article{Dennett/Kinsbourne:95a,
    author =    {Daniel C. Dennett and Marcel Kinsbourne},
    title =     {Multiple Drafts: An eternal golden braid?},
    journal =   {Behavioral and Brain Sciences},
    volume =    18,
    number =    4,
    year =      1995,
    pages =     {810--811}
}
@Book{Mitchell:93a,
    author =    {Melanie Mitchell},
    title =     {Analogy-Making as Perception: a computer model},
    year =      {1993},
    publisher = {{MIT} Press},
    address =   {Cambridge, {MA}}
}
}
set t [time { apply BibTeX $bibtex }]
puts "Parsed in $t"

It takes just under a second to parse and display those two records on my laptop -- still some work to do...


jima 2010-12-16

Just wanted to put here a link to some recent developments in parsing world: [L4 ]. In this post a technique for using the concept of a derivative of a language is applied to the rules of a CFG or the Parser Combinators dealing with it. Some implementations (Scala, Racket,...) are also provided.