Version 1 of Contest - finding repeating substrings

Updated 2004-12-23 09:30:49

AM (23 december 2004) Here is a little puzzle for the days around Christmas and the Gregorian New Year:

Given a piece of text (where all whitespace has been replaced by single spaces, to make it less ambiguous), such as the description of the Endekalogue, are there any substrings of given length (longer than, say, 10 characters) that appear at least twice?

Requirements:

  • It should be a pure Tcl script that finds the answer for an arbitrary piece of text
  • It should be "best" in some way

The best solutions are either:

  • the shortest (smallest size of the code for any artistic but consistent definition of size)
  • the fastest (as measured by the [time] command)
  • the least number of command executions (as measured by the [info cmdcount] command)
  • the most fantastic use of regular expressions
  • the most obfuscated
  • the most elegant variation on the question (such as presenting a histogram of repeated substrings during the computation ...)
  • the best in any other challenging category

Rewards for each category: honorary mention on the Wiki

Add your solutions below!

adavis Does this fit the bill...

 # Text to be checked for repeated substrings.
 set testtext "well then, has this got some repeated strings? this is a test string"

 # Length of required substrings minus one.
 set findlength 3

 for {set i 0} {$i <= [expr {[string length $testtext] - $findlength}]} {incr i} {
    set teststring [string range $testtext $i [expr {$i + $findlength}]]
    if {[catch {incr result($teststring)}]} {set result($teststring) 1}
 }

 parray result

...I've used short text and substring for clarity.

Disclaimer: I have no other intention with this contest than the contest itself - it just struck me that I would not know how to solve it using [regexp], but I considered the use of regexps for about 10 seconds ....