grammar::peg

grammar::peg can be used to create and manipulate parsing expression grammars. Superseded by pt::peg::container

Description

Parsing expression grammers are similar but not equivalent to Context-Free Grammars, which may be familiar from BNF syntax specifications. Surprisingly, PEGs are both faster when matching (linear time as opposed to cubic time) and more powerful (at least no language has been shown context-free but not parsing expression) than CFGs.

Documentation:

official reference for grammar::peg
the more venerable peg package
reference for grammar::peg::interp
a companion to grammar::peg

Reference

Wikipedia
an overview of parsing expression grammars.
Parsing Expression Grammars (Part 1 of N) , Tim Davis, 2014-07-25
includes examples of grammar::peg usage

See Also

PEG Example
dkf 2013-12-03 16:54:07: I've written a new example that appears to work, and shows the clear difference between parsing (which the parser tools handle), compilation, and evaluation.

Examples

2008-03-09: I hope the following may help others who are trying to use the "grammar::peg" package. These were tested under "etcl" (Evolane Tcl/Tk Engine Version 1.0-rc26).

1% # For now, just load the 'grammar::peg^ package
2% package require grammar::peg
0.1

3% #initialize a new (abeit empty) grammar.  The start symbol will be, by default, 'epsilon^.
4% grammar::peg myGrammar
::myGrammar

5% # add a nonterminal to it
6% ::myGrammar nonterminal add myDigit {/ {t 0} {t 1} {t 2}}

7% # show its serialization
8% ::myGrammar serialize
grammar::pegc {myDigit {/ {t 0} {t 1} {t 2}}} {myDigit value} epsilon

9% # change the start symbol to the nonterminal 'myDigit^
10% ::myGrammar start {/ {n myDigit}}

11% # show its serialization
12% ::myGrammar serialize
grammar::pegc {myDigit {/ {t 0} {t 1} {t 2}}} {myDigit value} {/ {n myDigit}}

13% # Now we show how you could, instead, have started with a list representing
14% # the grammar. Since "myGrammar" exists, we will make "myGrammarFromList".
15% package require struct::list
1.6.1

16% # (Incidentally, Andreas Kupries states: 'struct::list' is a package in Tcllib, like grammar::peg, and
17% #                  guesses that grammar::peg forgot to 'package require' it.)
18% set mySerialization [list grammar::pegc \
> {myDigit {/ {t 0} {t 1} {t 2}}} {myDigit value} {/ {n myDigit}} \
> ]
grammar::pegc {myDigit {/ {t 0} {t 1} {t 2}}} {myDigit value} {/ {n myDigit}}

19% grammar::peg myGrammarFromList deserialize $mySerialization
::myGrammarFromList

20% # Next we attempt to use (interpret) the grammar on an input string.
21% #Load all relevant packages (this may be overkill)
22% set pattern {grammar::me|grammar::peg}; foreach packageName [package names] {if [regexp $pattern $packageName] {puts "[catch {package require $packageName}]  $packageName"}}
0  grammar::peg::interp
0  grammar::peg
0  grammar::me::tcl
0  grammar::me::cpu
0  grammar::me::util
0  grammar::me::cpu::core
0  grammar::me::cpu::gasm

23% # Initialize the interpreter.
24% grammar::peg::interp::setup ::myGrammar
25% # As of 3-9-8, that's as far as I have gotten.  The documentation at http://tcllib.sourceforge.net/doc/peg_interp.html
26% # suggests using 
27% #                            ::grammar::peg::interp::parse nextcmd errorvar astvar 
28% #" which "interprets the loaded grammar and tries to match it against the stream of characters represented by the command prefix nextcmd".
29% # so I put the string "102" in "myFile.txt" and tried the following, but, as you can see, was unsuccessful.

30% set f [open c:/myFile.txt r+]
file37bf6c8
31% chan configure $f -encoding cp1252
32% set offset 0
0
33% ::grammar::peg::interp::parse $f myErrorVar myAstVar
invalid command name "file37bf6c8"
34% chan close $f

JBR: I've made my own attempt to create a grammar::peg example. Maybe someone who knows can make this example work. Thanks.

#!/usr/bin/env tclsh8.6

package require grammar::peg
package require grammar::peg::interp

proc parse-string string {
    coroutine next-char apply {string {
        yield
        set i 1
        foreach ch [split $string {}] {
            yield [list $ch 1 1 $i]
            incr i
        }
        while 1 {yield {}}
    }} $string
}

::grammar::peg parser

parser nonterminal add Digit {/ {t 1} {t 2} {t 3} {t 4} {t 5}}
parser nonterminal add Int   {+ {n Digit}}
parser start {n Int}

parse-string 54321

grammar::peg::interp::setup parser
grammar::peg::interp::parse next-char err ast

When run I get this:

wrong # args: should be "ict_match_token tok msg"
    while executing
"ict_match_token "Expected $ch""
    (procedure "MatchExpr" line 40)
    invoked from within
"MatchExpr $e"
    (procedure "MatchExpr" line 236)
    invoked from within
"MatchExpr $ru($nt)"
    (procedure "MatchExpr" line 75)
    invoked from within
"MatchExpr $sub"
    (procedure "MatchExpr" line 159)
    invoked from within
"MatchExpr $ru($nt)"
    (procedure "MatchExpr" line 75)
    invoked from within
"MatchExpr $se"
    (procedure "grammar::peg::interp::parse" line 9)
    invoked from within
"grammar::peg::interp::parse next-char err ast"
    (file "./ISBL" line 30)

This appears to be a bug in grammar::peg::interp I changed line 116 of peg_interp.tcl to:

ict_match_token $ch "Expected $ch"

tomas 2010-01-25: This bug is fixed from release 1.13 onwards (thanks, AK, you are my hero).

AK: The above sentence is a bit ambiguous. To clarify, the bug fix is not in Tcllib 1.13. It is in the CVS, just after the release, and will be found in the next release after 1.13. That being said, the 'Parser Tools' supercede grammar_peg and grammar_me.

Then changing my coroutine to return EOF multiple times after the input is exhausted the string parses returning:

ALL 0 4 {Int 0 4 {Digit 0 0 {{} 0 0}} {Digit 1 1 {{} 1 1}} {Digit 2 2 {{} 2 2}} {Digit 3 3 {{} 3 3}} {Digit 4 4 {{} 4 4}}}

Now to learn how to do something useful with this. The input tokens don't appear to be represented in the output AST?


AK 2011-01-21 14:40:21:

Regarding the representation of input tokens. No, they are not represented directly. The numbers in the AST tell you the character range covered by the symbol, as offsets from the beginning of the string. This allows you to extract the lexemes/token from the input string. See http://docs.activestate.com/activetcl/8.5/tcllib/pt/pt_parser_api.html for an example of the tree and its contents.


tomas 2011-01-23:

Here's a more complete example. It tries to implement the "classical" expression grammar, with (integer decimal) numbers, the four arithmetical operations.

To note:

  • PEGs look a lot like BNF. They are not the same (as noted in the comments). See this wikipedia page for a discussion. I still rode on this similarity to make clear in the comments what the PEG statements "mean".
  • I stumbled upon the bug mentioned above and tried to file in the SourceForge bug tracker
  • The program tries to visualize what's going on by printing the AST's structure with the (recursive) function print_nested.
  • Grokking whitespace is left as an exercise for the reader :-)

Comments very welcome!

#!/usr/bin/tclsh

# peg: playing around with peg
package require grammar::peg
package require grammar::me::tcl
package require grammar::peg::interp

grammar::peg expression

# PEGs ain't BNF, but the similarities are striking
# One important difference: the choices {/ ...} are _ordered_
# choices; no backtracking in * and +. See
#  <http://en.wikipedia.org/wiki/Parsing_expression_grammar>
# for te gory details

# Therefore the comments evoking BNF are just to tickle the
# reader's gentle associative memory (or was it the associative's
# reader gentle memory? Ah -- you know what I mean.

# Sign ::= '+' | '-'
expression nonterminal add Sign {/ {t +} {t -}}
# Digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
expression nonterminal add Digit {/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}}
# Addop ::= '+' | '-'
expression nonterminal add Addop {/ {t +} {t -}}
# Mulop ::= '*' | '/'
expression nonterminal add Mulop {/ {t *} {t /}}

# Number ::= Sign? Digit+
expression nonterminal add Number {x {? {n Sign}} {+ {n Digit}}}
# Factor ::= '(' Expression ')' | Number
expression nonterminal add Factor {/ {x {t (} {n Expression} {t )}} {n Number}}
# Term ::= Factor ( Mulop Factor )*
expression nonterminal add Term {x {n Factor} {* {x {n Mulop} {n Factor}}}}
# Expression ::= Term ( Addop Term )*
expression nonterminal add Expression {x {n Term} {* {x {n Addop} {n Term}}}}
# Start here:
expression start {n Expression}

puts [expression serialize]

grammar::peg::interp::setup expression

set i -1
# The expression to be parsed. No whitespaces!
set src {1024+256}

proc nexttok {} {
  global src
  global i
  return [list [string index $src [incr i]] {} $i 0]
}

set errs {}
set ast {}

grammar::peg::interp::parse nexttok errs ast

puts "errs = $errs"
puts "ast = $ast"

proc print_nested {ast {prefix {}}} {
  global src
  puts "$prefix [lindex $ast 0]: [
    string range $src [lindex $ast 1] [lindex $ast 2]]"
  foreach sub [lrange $ast 3 end] {
    print_nested $sub "$prefix  "
  }
}

print_nested $ast

YOSIFOV: My example with human-readable PEG definition and without import from file:

package require starkit
starkit::startup
package require pt::peg::from::peg
package require pt::peg::container
package require pt::peg::interp

set PEG {
PEG myapp (Exp)
    IfKw       <- 'i' 'f'                               ;
    ThenKw     <- 't' 'h' 'e' 'n'                       ;
    ElseKw     <- 'e' 'l' 's' 'e'                       ;
    OSp        <- ('\t' / '\n' / ' ')?                  ;
    Sp         <- ('\t' / '\n' / ' ')+                  ;
    Block      <- (Exp / 't') &';'                      ;
    Cond       <- 'c' / 'o'                             ;
    If         <- IfKw Sp Cond OSp ':' OSp Block        ;
    Exp        <- If                                    ;
END;
}

set CPEG [pt::peg::from::peg convert $PEG]
pt::peg::container cont
cont deserialize = $CPEG
set src "if c: if o: t;"
pt::peg::interp myint
myint use cont
puts [myint parset $src]

Has bugs, I suppose, but shows main trick :) Parses nested IF (see $src): CONDITION is 'c' or 'o', BLOCK is EXP or 't'. See more HERE