loop microbenchmark

Loop behaviour

Playing with the bytecode stepper in Visual TAL, I was a bit surprised to see the bytecode for foreach invoking llength at every iteration. This made me wonder about the performance of common looping constructs, and somewhat surprising results.

As with any microbenchmark, this is probably measuring the wrong thing.

Specification

The obvious question is, "how does foreach behave if you modify the list under it?". The answer is conveniently omitted from the foreach manual, but provided clearly enough by the dodekalogue: foreach has its list arguments substituted before it is called (in other words: passed by value) so it cannot be affected by mutations.

Setup

Making big lists is easy with split [string repeat ..., but I have this in my prelude to start with:

proc range {from to} {
    if {$to<$from} {
        foreach from to $to $from
    }
    set res {}
    while {$from<=$to} {lappend res $from; incr from}
    set res
}

And our test var. I initialise it this way because if pasting in the shell, you don't really want to see the value of x echoed back at you.

set x [range 0 1e6]; llength x

(side note: will such a list always have an odd number of elements? I don't know.)

Here are the procs we want to test:

proc f {x} {
    foreach i $x {
        incr j
    }
    set j
}
proc g {x} {
    for {set a 0} {$a<[llength $x]} {incr a} {
        incr j
    }
    set j
}
proc h {x} {
    set l [llength $x]
    for {set a 0} {$a<$l} {incr a} {
        incr j
    }
    set j
}

(micro)Benchmarking

It turns out that iterating over a million-and-one element list is a bit slow on my netbook, so initial timing results are of a small sample:

  tclsh-8.6.0 % time {f $x} 10
  805257.6 microseconds per iteration
  tclsh-8.6.0 % time {g $x} 10
  502229.6 microseconds per iteration
  tclsh-8.6.0 % time {h $x} 10
  285997.6 microseconds per iteration

Striking. Of course this benchmark isn't exactly fair, since foreach is accessing the contents of the list as well as iterating over it. Suitably handicapping g and h is easy:

rename g g_
proc g {x} {
    for {set a 0} {$a<[llength $x]} {incr a} {
        set i [lindex $x $a]
        incr j
    }
    set j
}
rename h h_
proc h {x} {
    set l [llength $x]
    for {set a 0} {$a<$l} {incr a} {
        set i [lindex $x $a]
        incr j
    }
    set j
}

This brings the results closer to foreach, but the same performance degradation is apparent:

  tclsh-8.6.0 % time {g $x} 10
  688677.9 microseconds per iteration
  tclsh-8.6.0 % time {h $x} 10
  460473.2 microseconds per iteration

Curiously, foreach is still the slowest version. I hope this is an artifact of the system I am testing on.

Disassembly

Using disassemble on g, h and their predecessors reveals nothing unexpected. Looking at f though:

  Command 1: "foreach i $x {incr j}"
    (0) loadScalar1 %v0         # var "x"
    (2) storeScalar1 %v1        # temp var 1
    (4) pop 
    (5) foreach_start4 0 
                [data=[%v1], loop=%v2
                 it%v1  [%v3]]
    (10) foreach_step4 0 
                [data=[%v1], loop=%v2
                 it%v1  [%v3]]
    (15) jumpFalse1 +17         # pc 32
  Command 2: "incr j"
    (17) startCommand +12 1     # next cmd at pc 29
    (26) incrScalar1Imm %v4 +1  # var "j"
    (29) pop 
    (30) jump1 -20      # pc 10
    (32) push1 0        # ""
    (34) pop 
  Command 3: "set j"
    (35) startCommand +11 1     # next cmd at pc 46
    (44) loadScalar1 %v4        # var "j"
    (46) done

we see bytecode entries for foreach_start4 and foreach_step4. It turns out my initial data was wrong: dis2asm unrolls these instructions into similar code that we see in g's loop. Indeed these are bytecodes: in tclExecute.c you will find about 100 lines of C at jump label INST_FOREACH_STEP4, confirming that it is taking the length of every list at every iteration.

Ruminations

I don't really know what to draw from this, but it raises some questions. Is this measurement meaningful? If so, why is foreach measurably slower than a loop which apparently does the same thing? (I am reminded of the 386's loop opcode, which was significantly slower than the unrolled combination dec cx; jnz leading to its disuse outside of 4k demos). Is this an indication that rather than more bytecode instructions, what is really needed to boost Tcl performance is commands to support script-level interaction with the compiler? ... I'm dreaming about lisp macros and stuff like Sugar).

RS 2013-12-04 Thanks for trying Visual TAL! As I wrote in dis2asm does macros, I could not find TAL instructions corresponding to foreach_start and foreach_step in tcl::unsupported::assemble, so the "macro expansion" I concocted was a workaround.

In Tcl list objects, the length is a data field (just as strings know their string length), so it does not cost much time to retrieve that. My first attempt had a separate variable for the length, but I was then sure it is not needed.

DKF: Adding the foreach* instructions is one of the TODOs, but they've got some weird side-conditions on them that would need modelling. (Also, TAL can't currently construct auxiliary data blobs, which the foreach instructions use.) The real complexity of the foreach stuff is there to handle the multi-variable multi-list cases.

RS True... dis2asm already fails horribly on the simple variation

 proc f x {set res {};foreach {i j} $x {if {$i>0} {lappend res $i}};set res}

DKF: Of course, this also means that it might be worthwhile for foreach to be optimised in the one-var-one-list case by not using the foreach* instructions. I shall need to investigate this further.

RS See at end of dis2asm does macros how I manually expanded the foreach macros for multi-variable and multi-list cases - in hindsight, the code to be generated is pretty regular.

DKF: I think I'd definitely focus on the one/one case; it's massively more common than the others, so it's worth putting special effort into.

RS: I agree that the one/one case is the most frequent; but the complexity of the code to generate grows almost linearly with the number of iterators:

foreach_start must expand into 3 instructions per iterator
foreach_step must expand into 9 instructions per iterator...
... and finally tack on one lor instruction for each iterator except the first

With a cleverer macro expander (which parses the extra argument lines directly before generating the code), this will be an easy job - I will do it later today :^D

MS After reading this I finally decide to revisit the foreach bytecodes - see http://core.tcl.tk/tcl/info/3782af5f5a . The proposed bytecodes for f now look like:

% tcl::unsupported::disassemble proc f
ByteCode 0x0x14416f0, refCt 1, epoch 15, interp 0x0x13f35d0 (epoch 15)
  Source "\n    foreach i $x {\n        incr j\n    }\n    set j\n"
  Cmds 3, src 51, inst 19, litObjs 1, aux 1, stkDepth 4, code/src 0.00
  Proc 0x0x143c7c0, refCt 1, args 1, compiled locals 3
      slot 0, scalar, arg, "x"
      slot 1, scalar, "i"
      slot 2, scalar, "j"
  Exception ranges 1, depth 1:
      0: level 0, loop, pc 7-9, continue 11, break 12
  Commands 3:
      1: pc 0-15, src 5-39        2: pc 7-9, src 28-33
      3: pc 16-17, src 45-49
  Command 1: "foreach i $x {\n        incr j\n    }"
    (0) loadScalar1 %v0         # var "x"
    (2) foreach_start 0 
                data=[%v0, loop=%v18446744073709551612
                 it%v0        [%v1]
  Command 2: "incr j"
    (7) incrScalar1Imm %v2 +1         # var "j"
    (10) pop 
    (11) foreach_step 
    (12) foreach_end 
    (13) nop 
    (14) nop 
    (15) nop 
  Command 3: "set j"
    (16) loadScalar1 %v2         # var "j"
    (18) done

Category Performance