GSoC Idea: Parser For expr Command

Parser For expr Command


Areas	Language parsing, computer arithmetic
Good if student knows	variable, based on interests
Priority	Low
Difficulty	low to medium
Benefits to the student	introduction to grammar and parsing concepts
Benefits to Tcl	Leverage ease of use of existing tools
Mentor	Steve Huntley

Project Description

Perl enables arbitrary precision math by means of a pragma, which overloads the standard math operators and thus transparently accepts large integers as arguments for mathematical expressions. In this way Perl is able to do not only large integer math transparently as Tcl does, but also floating-point, rational number, vector, etc. calculations [L1 ] [L2 ].

This approach could not work with Tcl, since it does not have math operators per se, but instead has the command expr with its own syntax for math expressions. However, it should be straightforward to overload the expr command itself; with a command that is able to parse existing valid math expressions, and can be expanded to accept a wider range of operations.

The goal of this project would be to write a selection of parsers/lexers for mathematical expressions, and incorporate them into a replacement for the expr command, thus allowing for transparent use of a wider range of numerical types (similar to Perl's usage), and experiments on new ways to execute expressions that are currently valid, such as:

compiling to C statements and executing via critcl
compiling to bytecode using methods outlined by KBK [L3 ]
automatic interfacing to graphing functions

The exact mix of features would be dependent on the skills and interests of the student.

Existing pure-Tcl parsing tools such as Yeti or taccle should be suitable for the task. Evaluation of parser features and selection of tools would be part of the project.

Requirements:

flexible

References

vexpr - An expr for vectors
lexpr
Shunting yard algorithm
modmath [L4 ]
tclparser, part of the TclPro suite has a C-coded extension to interface with Tcl's native parser. The parse expr subcommand parses expressions.

See: parsing expressions, Creating your own expr command for code that seems largely to have accomplished this goal.

Nsync also contains an expr parser.

Discussion

SEH -- In previous years, Tcl mentors have received criticism from Google that idea pages were messy, discursive and confusing to prospective students. Thus I am trying to keep all comments below the collapsible discussion header.

JBR - I've had very good experiences with grammar::peg which is included in tcllib. Look for my examples posted on the wiki.

Larry Smith: incorporate the Tcl expr patch allowing us to eliminate the dereferencing operator and outer braces? Please? Finally? allowing [ expr x*x ] instead of [ expr {$x*$x} ]

AMG: There are a lot of syntax conflicts which require variable names to be preceded by $. As for removing the outer braces, you can do so currently, though it is unsafe and slow.

However, these two changes at the same time would avoid part of the safety problem, since [expr] (not the Tcl parser) would be performing the variable substitution. But the speed problem would remain. So long as there are any spaces in the expression, there's more than one argument to [expr], and it has to [concat] them, which makes it impossible to bytecode the expression.

: Larry Smith Why should this be the case? Granted the current implementation does have this limitation, but it seems to me a SMOP.

: AMG: Multi-argument [expr] has no persistent Tcl_Obj in which to store the compiled form of the expression. [concat]'s result is a temporary Tcl_Obj, so it can't hold it. True, the bytecode compilation engine may still be able to convert the expression to bytecode which lives inside the proc body's Tcl_Obj, but this particular case has not been implemented because multi-argument [expr] is discouraged (brace your expr-essions).

: DKF: Bareword support is something we might do in the future — the main ambiguity is between function calls and array dereferences — but it's been postponed because it makes the parser significantly more complex (in particular, we were thinking of doing it for assignment support). The performance problem comes when Tcl determines that the argument to expr is non-constant. Constant values (of which braced ones are just the simplest) are compiled efficiently as they can be fully parsed ahead of time, non-constants can't and aren't. However, x*x could be a constant just fine. :-)

Larry Smith It's not encouraged because it's not implemented and it's not implemented because it is not encouraged? The idea should be to make Tcl's expr syntax simpler, more flexible, easier to use and easier to read. Restricting it to one argument is counterproductive of these goals and silly if we just implement your suggested method.

AMG: It's discouraged for two reasons: performance and security. Improving the performance is counterproductive if it only serves to encourage insecure coding practices. Your proposal to make [expr] substitute variables lacking a leading $ helps to alleviate the security concern in most cases, but bracket substitution remains a problem.

% tcl::unsupported::disassemble lambda {{} {expr 2+2}}
ByteCode 0x0174E2A8, refCt 1, epoch 4, interp 0x016144F8 (epoch 4)
  Source "expr 2+2"
  Cmds 1, src 8, inst 3, litObjs 1, aux 0, stkDepth 1, code/src 0.00
  Proc 0x015D6168, refCt 1, args 0, compiled locals 0
  Commands 1:
      1: pc 0-1, src 0-7
  Command 1: "expr 2+2"
    (0) push1 0         # "4"
    (2) done 

% tcl::unsupported::disassemble lambda {{} {expr 2 + 2}}
ByteCode 0x0174E6A8, refCt 1, epoch 4, interp 0x016144F8 (epoch 4)
  Source "expr 2 + 2"
  Cmds 1, src 10, inst 14, litObjs 3, aux 0, stkDepth 5, code/src 0.00
  Proc 0x015D6568, refCt 1, args 0, compiled locals 0
  Commands 1:
      1: pc 0-12, src 0-9
  Command 1: "expr 2 + 2"
    (0) push1 0         # "2"
    (2) push1 1         # " "
    (4) push1 2         # "+"
    (6) push1 1         # " "
    (8) push1 0         # "2"
    (10) concat1 5 
    (12) exprStk 
    (13) done

AMG: The remaining safety issue is due to [square bracket script substitution], which would still be performed by the Tcl parser instead of internal to [expr].

Some syntax conflicts, ambiguities, and difficulties:

Arrays and functions. Is cos(-1) calling a function "cos" with argument "-1", or is it looking for an element "-1" in an array called "cos"?

: Larry Smith In the context of expr, it is calling the function "cos" with argument -1. This is an artifact of the way Tcl follows C usage even though its own syntax is very different. In order to support arrays, some new notation would be needed. Following C, this would be cos[-1]. That just exacerbates the problem.

: -OR-

: One merely notes that "cos" is the custodian of some data, the specific instance needed being accessed by the specifier "-1". In other words, the distinction between arrays and functions is an artificial one, and not essential to the language or to our coding patterns. If "cos" is a function, it's a function call, if it isn't and "cos" is an array, it's an array access. If "cos" is neither, it is an error (or some third case).

: AMG: Tcl already has a value/command dichotomy (see "Getting rid of the value/command dichotomy for Tcl 9" for discussion). You point out that math functions form a third class, which can share names with values and commands. (Though tcl::mathfunc provides some measure of unification.) To this point, there haven't been any collisions, because context has always made it clear which is being accessed. Blending them all together in [expr] is contrary to precedent. I'd say it's un-Tcl'ish, however you point out that [expr] is already un-Tcl'ish.

Larry Smith =) This dichotomy was something of an artifact of the first implementation of Tcl. Arguably, the whole thing is over-engineered in this respect. Rather than "proc foo { bar } {...$bar...}", Ousterhaut could have simply done "set foo {...$0...}" and executed it with an optional "$" prefix - this mechanism would have made the whole lambda thing a doddle. Granted, such a change goes far astray from the objective under discussion, but if the objective under discussion has such far-ranging implications, perhaps we should talk about it.

AMG: I definitely hear you, I would very much like to see a unification, and I have a proposal to that end, with any benefits. However, it breaks every existing Tcl script, so it's for a new language. I don't know how far we're willing to go with Tcl 9, so my language project may be a complete fork, in which case I don't think I can pursue it.

: AMG: Modifying [expr] to handle the ambiguity between function calls and array references presents a challenge to the bytecode compiler. Tcl's dynamic nature makes it possible to add and remove tcl::mathfunc commands while the program is executing, so there's no way to know at compile time what each instance will be at run time. Here's current behavior:

  Command 1: "expr {$asdf(188)}"
    (0) push1 0         # "188"
    (2) loadArray1 %v0         # var "asdf"
    (4) tryCvtToNumeric 
    (5) done

  Command 1: "expr {asdf(188)}"
    (0) push1 0         # "tcl::mathfunc::asdf"
    (2) push1 1         # "188"
    (4) invokeStk1 2 
    (6) tryCvtToNumeric 
    (7) done

: loadArray1 and invokeStk1 would have to be replaced with a new bytecode that calls the function if it exists, or else gets the array element if the function lookup fails.

Array elements. foo(bar): The array is named "foo", that's no problem. But is the element literally "bar", or is it the string stored in the variable "bar"?

: Larry Smith This is also an artifact, this time of nature of the specifier used to access the data and the fact that Tcl puts no real limits on it. I would suggest it would mean the string stored in "bar". If you want "bar" literally, you would need to quote it, foo("bar").

: AMG: In a vacuum that works, but it breaks every script that reads array values inside expressions. Remember that [for], [if], and [while] have expression arguments, not only [expr].

Functions with an extra space. It's legal to put space between the function name and the argument list, e.g. cos (-1). Is this calling "cos" or looking in the variable "cos"?

: Larry Smith We often run into various pathological cases. The variable {} is one, this is another. While they are permitted under current syntax they are not good programming practices (much too "tricky", and usually easily replaced with less opaque code). As such I feel no real problem exists in removing the feature. "cos (-1)" would be the value of the variable "cos" followed by a space and the result of applying the -1 specification to the object represented by the null string - the function "", or the array {}(). Yes it may break some code, but I don't think that's a bad thing, it points up a wonky piece of code that needs attention anyway.

: AMG: I'd prefer to not change the interpretation of cos (-1). (-1), all by itself, could either be the numerical value negative one or the "-1" element in the "" array. Clearly we don't want to break that, so instead don't allow implicit reference to the "" variable or array inside an [expr]. With that gone, there's no reason why cos (-1) has to change. The [expr] parser already knows it's a function call, so leave that alone.

Variable names with special characters. "a b" (included space) is a valid variable name, but a simple parser might see it as two separate variables separated by a space. That's why there's "${a b}", with braces.

: Larry Smith Again, the pathological case. "a b" would refer to the values of a and b concatenated with a space. If you want an embedded space, you would need to specify "a\ b".

: AMG: You're not advocating removing support for $ notation from [expr], so just continue to use ${a b}. There's also [set "a b"]; that should continue to work. But here you do suggest an alternative syntax: use backslash to quote the special characters. Allowing backslashes and dollar signs in [expr]'s argument, yet ceasing to recommend that [expr]'s argument be a single brace-quoted word, yields a major conflict with the design of the Tcl parser. [expr]'s not going to see those backslashes and dollar signs, unless they themselves are backslashed or otherwise quoted (e.g. with braces!). Do you advocate having the Tcl parser specially handle lines that begin with the word "expr"? That would make the word "expr" a new form of quoting, in essence. See also: expr shorthand for Tcl9.

Larry Smith No, I don't. I was simply pointing out the most correct solution but, yes, it does imply quoting hell in the current way of doing things.

AMG: I'm perfectly fine with leaving [expr] as-is, so long as we also provide a shorthand for it that is convenient, safe, and equivalent. Actually I'm willing to go a step further and say that it need only be equivalent to the default behavior of [expr], not that it always wind up calling whatever command is currently named [expr]. This is analogous to the situation with [set var] and $var. Single-argument [set] is there for cases where the shorthand is inadequate, e.g. the variable name is computed. The shorthand is preferred, at least as fast as [set], and presents no security issues. It also always behaves in the same way, even if [set] itself is redefined. Back to [expr], it would remain available for cases where the expression itself is computed or stored in a variable, but most of the time it's only the constituent values that live in variables, which are substituted by the math expression engine, not the Tcl interpreter. My favored expr shorthand is $(...), with the Tcl interpreter upgraded to recognize that everything between the $( and the matching ) is not to be interpreted but rather passed verbatim to the math expression engine. Your $-less variable proposal is something I hadn't previously considered, but now I find it to be quite interesting, especially in combination with my other language design ideas to unify variables, procs, commands, lambdas, etc. into a single namespace.

Variable names that look like numbers. "1" is a valid variable name, so are "0xa" and "5.2e-7". They're also valid numbers. How can the parser tell the difference? If it assumes the leading digit makes it a number, then there needs to be a way to force the variable interpretation, such as "${5.2e-7)". But that brings back dollar sign notation, which defeats the purpose of your proposal and raises the safety issue again.

: Larry Smith Yet another pathological case, so much so I see no real problem with requiring the $ prefix to force interpretation as a variable.

Variable names that look like operators. "+" is also a valid variable name!

: Larry Smith and you would refer to it as "$+".

: AMG: Actually that would have to be ${+}, but I understand and agree with what you're saying.

Empty string variable. Believe it or not, but variables can be named empty string. How would you take the value, other than "${}"? Well, there's "$::", or "::" as you would have it.

: Larry Smith Or just $"". As I see it, this proposal is an opportunity to make expr more flexible and concise by making the corner cases the syntactical exception that they really are, rather than allowing them to dictate the (verbose and painful) notation needed for the normal cases.

: AMG: $"" is new syntax, are you proposing that as well?

It isn't really new, it merely regularizes referencing the null variable.

AMG: Nevertheless, here's a simple example demonstrating an overloaded [expr] that behaves as you ask. Note that it still has the safety problem, since variable substitution is performed by Tcl before calling [expr].

rename expr _expr; proc expr {args} {
    uplevel 1 _expr [regsub -all -nocase {[a-z:][a-z0-9_:]*\M(?!\()} [concat $args] {$&}]
}

This code doesn't get all cases, it doesn't support arrays, and it screws up the ?: ternary operator. Since it doesn't support arrays, which are used by the [history] mechanism, it won't work with an interactive Tcl session. For interactive use, try this:

proc expr2 {args} {
    uplevel 1 expr [regsub -all -nocase {[a-z:][a-z0-9_:]*\M(?!\()} [concat $args] {$&}]
}

AMG: Larry, don't get me wrong, I think this is a great idea, I'm just trying to be realistic about it so that it can get implemented without breaking everything.

Larry Smith I understand that, and, fundamentally, I support it. But someday we are going to have to either break backward compatibility or move on to another programming language. These are the sort of issues that will drive that discussion, and I'd like to keep them in mind.

SEH Since we're considering parsing and grammars here, the grammar devised could be more C-like and include keywords and everything. Thus trying to use a variable named e.g. cos could throw an error. But I was more interested in adding functionality than providing syntactic sugar. Another possibility: a parser specialized for currency calculations, thus solving a real problem.

Larry Smith I would argue that this change does add functionality, since I find a more readable and typeable syntax a win-win situation. However, if you look into the possibility of stacking various expr parsers these changes add a lot of new functionality. One preprocessor might translate "$1.34" as "[dollars 1 34]", for example, before handing it off the more general case (likewise £¥₠ to pound sterling, yen, and Euro). Another might be able to recognize vectors or numbers in some format - something like 2 3 ρ ι6 from APL say - permitting array operations in the same concise manner. Each layer would further specialize the underlying mathematical engine of expr to handle new problem domains in a concise manner.

FM Eliminate the dereferencing operator seems to lead to so much conflict, that it will surely never be done. It's too ambitious. The dereferencing operator don't bother me : it exists everywhere in TCL, and I admit easily that TCL is not C.

A good change, but less ambitious, would be to make expr multiInstructions and able to set variables with an = operator

Example :

proc xgcd {a b} {
    lassign [list 1 0 0 1] r0 s0 r1 s1
    while {$b} {
        expr {
            q  = $a/$b;
            b  = $a -$q *(a=$b);
            r1 = $r0-$q *(r0=$r1);
            s1 = $s0-$q *(s0=$s1);
       }
    }
    list $a $r0 $s0 
}

Instead of :

proc xgcd {a b} {
    lassign [list 1 0 0 1] r0 s0 r1 s1
    while {$b} {
       set q [expr {$a / $b}]
       set b [expr {$a - $q*[set a $b]}]
       set r1 [expr {$r0 - $q*[set r0 $r1]}]
       set s1 [expr {$s0 - $q*[set s0 $s1]}]
    }
    list $a $r0 $s0
}

AMG: Minor point: you'll want to delimit the instructions with semicolons rather than newlines, since expr currently treats newlines the same as spaces. FM:Thank's for the point. Example is updated.

DKF: We've definitely thought about this in the past (especially dgp). The problem is that it introduces a new basic concept — lvalue — so the parser gets a lot more complex because of that. We still want to do it I think, but it requires some serious sit-down-and-work-it-out time.

AM: Several packages/extensions/... come to mind: Lars Hellströms infix package and the L language for instance.

FM: Multi-instruction ability suffices. Assignement can be done via a proc in tcl::mathfunc.

proc tcl::mathfunc::set args {
    uplevel [list set {*}$args]
}
proc xgcd {a b} {
    lassign [list 1 0 0 1] r0 s0 r1 s1
    while {$b} {
        expr {
            set("q", $a/$b);
            set("b", $a -$q * set("a",$b));
            set("r1", $r0-$q * set("r0",$r1));
            set("s1", $s0-$q * set("s0",$s1));
       }
    }
    list $a $r0 $s0 
}

AMG: I suggest replacing your [set] with this: interp alias "" ::tcl::mathfunc::set "" set. Also, it looks like you meant to type: [list set {*}$args] (FM: changed). This can also be written slightly more efficiently: [concat set $args].

Semicolons introduce the notion of a sequence point, and your example shows the reason why. Since calls into mathfunc may have side effects including variable creation and modification, substitutions following semicolon must not be performed until everything previous to the semicolon is all done. So, how is this really different than just calling [expr] multiple times? All the following snippets work in Tcl right now:

interp alias "" ::tcl::mathfunc::set "" ::set
proc xgcd {a b} {
    lassign {1 0 0 1} r0 s0 r1 s1
    while {$b} {
        expr {set("q" , $a /$b)}
        expr {set("b" , $a -$q * set("a" ,$b ))}
        expr {set("r1", $r0-$q * set("r0",$r1))}
        expr {set("s1", $s0-$q * set("s0",$s1))}
    }
    list $a $r0 $s0
}

This might as well be written this way:

interp alias "" ::tcl::mathfunc::set "" ::set
proc xgcd {a b} {
    lassign {1 0 0 1} r0 s0 r1 s1
    while {$b} {
        set q  [expr {$a /$b}]
        set b  [expr {$a -$q * set("a" ,$b )}]
        set r1 [expr {$r0-$q * set("r0",$r1)}]
        set s1 [expr {$s0-$q * set("s0",$s1)}]
    }
    list $a $r0 $s0
}

Or even:

proc xgcd {a b} {
    lassign {1 0 0 1} r0 s0 r1 s1
    while {$b} {
        set q  [expr {$a /$b}]
        set b  [expr {$a -$q * [set a  $b ]}]
        set r1 [expr {$r0-$q * [set r0 $r1]}]
        set s1 [expr {$s0-$q * [set s0 $s1]}]
    }
    list $a $r0 $s0
}

Question: Where is it legal to put a semicolon? If it's legal to sneak it inside of parentheses (like the comma operator in C), things could get interesting.

FM : it should be so

# Tcl                             expr
# command arg0 arg1 ...  <=>      command(arg0,args1,...)  
# {a b c d}              <=>     (a,b,c,d)                 List of datas
# [A; B; C]              <=>     (A(); B(); C())           List of commands (returns the last result).

interp alias "" ::tcl::mathfunc::while "" ::while
interp alias "" ::tcl::mathfunc::lassign "" ::lassign
interp alias "" ::tcl::mathfunc::proc "" ::proc
interp alias "" ::tcl::mathfunc::list "" ::list
expr {
   proc("xgcd",{a b},(
      lassign(list(1,0,0,1),"r0","s0","r1","s1");
      while($b,(
         set("q",$a/$b);
         set("b", $a -$q * set("a",$b));
         set("r1", $r0-$q * set("r0",$r1));
         set("s1", $s0-$q * set("s0",$s1))
      ));
      list($a,$r0,$s0)
   ))}

AMG: How would [proc] and [while] know that their last arguments are expressions rather than Tcl scripts?

FM : because of the construct

# (expression1 ; expression2; expression3; ...etc;) is a sequence of Tcl expressions
# {command1    ; command2   ; command3;    ...etc;} is a sequence of Tcl commands
# sequence of expressions
expr {while($b,(
   set("b", $a -$q * set("a",$b));
   set("r1", $r0-$q * set("r0",$r1));
   set("s1", $s0-$q * set("s0",$s1))))}
# sequence of commands
expr {while($b, {
    set q  [expr {$a /$b}]
    set b  [expr {$a -$q * [set a  $b ]}]
    set r1 [expr {$r0-$q * [set r0 $r1]}]
    set s1 [expr {$s0-$q * [set s0 $s1]}]
}))}

So that

expr {(
   command1(arg0,...); 
   command2(arg0,...);
   ...
)}

is analogous to do

eval {
   expr {command1(arg0,...)}
   expr {command2(arg0,..)}
   ...
}

at the toplevel.

The construct (e0,e1,e2) should be seen as a list

# example : Helix parametric equation
# Note : this currently returns `unexpected "," outside function argument list`
lassign [expr {($R*cos($t), $R*sin($t), $k*$t)}] x y z
# Then, a sum of lists :
expr {(1,1)+(1,1)}; # eq {2 2} eq [expr {(2,2)}] eq [list 2 2]
# Note also that the effect of {*}[list 1 2 3] in expr should be (1,2,3), rather than `1 2 3`.
# It actually returns `missing operator at _@_ in expression "{*}_@_[list 1 2 3]"`.

DKF: What I would like to see out of this is rebuilding the expression in terms of normal Tcl commands, so (for example) that {$a+$b*$c**2} would go to:

+ $a [* $b [** $c 2]]

That could then be evaluated in a context that maps those operator-commands appropriately. Tcl already provides tcl::mathop so it would be easy to test, but extending to (say) complex numbers or matrices would be pretty simple.

FM : In fact, this context is Algebra. Algebra gives us

what are the valids objects that can be handled.
what are the valids operations.
commutatives, associatives rules.
the definition of operations

For instance, a complex number and a 2D vector can be written as a 2 elements list. But the multiplication of complex numbers and 2D vectors are not defined the same way. Also, a boolean addition is not defined as a decimal addition. There is a plenty of algebras : complex, quaternion, vectors, tensors, matrix, boolean, decimal,... etc

We should then be able to defined Algebras, then indicate that expr should operate in one of this.

DKF: You're reading somewhat too much. I'm saying that the transformation should be to Tcl commands and that those Tcl commands — which would not be defined by the transformation — should be responsible for interpreting the values as they see fit. This is The Tcl Way. I'd leave commutativity and associativity exactly as they are; Tcl has defined evaluation order so things are much simpler than in C.

FM : I was argee with your proposal, but was going a step further. My question was : what would happen if you want, in a same script, operate on complex numbers and operate on matrix ? Would you redefine each times all the operators before every calculation ? In complex scripts, how could you know the actual definition of operators/functions ? With a multi-instruction expr, it would be easy to give expr the context of calculations :

expr {
   algebra("complex");
   $R1*exp(i*$theta1)*$R2*exp(i*$theta2)
}
expr {
   algebra("matrix-3x3-real");
   ((1,0,0),(0,1,0),(0,0,1))*((1,0,0),(0,1,0),(0,0,1))
} # eq {{1 0 0} {0 1 0} {0 0 1}}
expr {
   algebra("matrix-2x2-complex");
   ((0,-i),(i,0))*((0,-i),(i,0))
} # eq {{1 0} {0 1}}

Some namespaces, for instance "::tcl::mathfunc::complex::", "::tcl::mathop::complex::", "::tcl::mathfunc::matrix-3x3-real::", "::tcl::mathop::matrix-3x3-real::",... etc ("::tcl::mathfunc::algebraName::", "::tcl::mathop::algebraName::") should contain the specifics definitions of functions / operators for specifics algebras. Algebric definitions won't change. So, we just have to be able to define Algebras, make them once, then use expr in their context.

GSOC