A regexp twist

NJG December 30, 2004


NJG January 23, 2005. A speed tuned version can now be downloaded!


Time ago in the line of my work I had come across a somewhat obfuscated VHDL source for a Reed-Solomon (see [L1 ]) decoder which otherwise was perfectly usable, i.e. could have been used for synthesizing the circuit. Having thought that I had known all about Reed-Solomon decoding that was there to be known I started wondering what obfuscation was supposed to hide and decided to turn the source into a digestible form.

One technique of obfuscation had been the replacement of all non-keyword identifiers with a long string composed solely of the characters: I (capital i), O (capital o), 1 (one), 0 (zero) and l (lower case L).

Without thinking much about efficiency I scribbled the following script to replace these with the more comprehensible alternatives: xxi.

 set nr 0
 set start 0
 while {1} {
       set subs [regexp -inline -indices -start $start -- {[IO10l]{10,}} $txt]
       if {$subs eq ""} {break}
       set beg [lindex [lindex $subs 0] 0]
       set start [lindex [lindex $subs 0] 1]
       set var [string range $txt $beg $start]
       if {[catch {set vn($var)}]} {
             set vn($var) xx$nr
             incr nr
       }
       set txt [string replace $txt $beg $start $vn($var)]
       set start [expr $start - [string length $var] + [string length $vn($var)]]
 }

Even while writing I thought that it ought to have been possible to do it with a single sweep of regular expression processing. The potential command syntax also seemed obvious:

regexp -inline ?other options? pattern string script

where script would be executed each time a pattern match occured.

(SG I am probably missing the point somewhere, but isn't that exactly what

  foreach var [regexp -all -inline ?other options? pattern string] script

would do?)

In current Tcl it would be illegal to specify a 3rd and further arguments in combination with the -inline option, so this form might as well represent a compatible extension.

I also decided that whatever I did would take the form of a loadable extension. The result can be downloaded from here [L2 ].

In the extension loading phase the pointer to the original Tcl_RegexpObjCmd is hijacked and used to replace the code that is executed for the regexp command.

The source of the new code is a slightly modified version of the original one. When the above command variant is detected the fact is noted, the script string is saved and the original code is made to believe as if a regexp scan with global match variables mVar0 ... mVar9 have been requested. (However only mVar0 and as many further variables are set as there are bracketed subexpresions.) At the end of each matching cycle then Tcl_Eval evaluates the saved script.

Using Tcl_Eval is the least efficient way to execute the script as in the case of sequential matches each time the interpretation starts with the string form. If the -all option is specified it is a better strategy to compile the script and execute the resulting byte code instead. For a compiled extension it certainly can be done. As for the loaded extension, I seem to remember as if not all required functions were stub exported.... Or was I just lazy?!

With this extension in place the following code produces identical output to the one above (except that now txt1 holds the result).

 set st 0
 set nr 0
 proc sbst {} {
       uplevel #0 {
              set p_st [lindex $mVar0 0]
              append txt1 [string range $txt $st [expr $p_st - 1]]
              set st [lindex $mVar0 1]
              set var [string range $txt $p_st $st]
              incr st
              if {[catch {set vn($var)}]} {
                     set vn($var) xx$nr
                     incr nr
              }
              append txt1 $vn($var)
       }
 }
 regexp -inline -indices -all {[IO10l]{10,}} $txt sbst
 append txt1 [string range $txt $st end]

HAPPY AND PROSPEROUS 2005 TO ALL OF YOU