Finding the overlap in two strings

MS Here is a proc that compares two strings that contain whole lines - where a line is ended by a configurable single- or multi-char EOL symbol.

findOverlap finds if andhow many of the last lines in $str1 match the first lines in $str2.

This might be used (as in Tcl chatroom snaphost history (2)) to build up the history of files that where:

  1. the file is built line-by-line
  2. lines are not timestamped
  3. old lines are removed

The proc is useful whether new lines are added at the bottom or the top of the file.

The first version below is correct; the second is lighter/faster, but suffers the stutterbug: if the first line in $str2 appears more than once in $str1, it may under some circumstances report some old lines as new. However, it never loses data, i.e., it does never report a new line as old.

  #
  # This proc finds the overlap of two strings; it returns a list 
  # of two indices:
  #   . the end of the last line in $str1 not present in $str2
  #   . the start of the first line in $str2 not present in $str1
  #                 | 
  #  $str1:   -----------
  #  $str2:          ----------
  #                      |
  # The line separator EOL is "\n" per default, but it can be set 
  # to any single or multicharacter value.
  #
  # The proc assumes that both $str1 and $str2 consist of full lines,
  # including the ending EOL
  
  proc findOverlap {str1 str2 {EOL "\n"}} {
      # shorten $str1 to be no longer than $str2 - it is unneeded
      # and makes faster searches. A long $str2 does not matter ...
 
      set len1 [string length $str1]
      set len2 [string length $str2]
      set str1 [string range $str1 end-$len2 end] 
 
      set first2 [string range $str2 0 [string first $EOL $str2]]
      append first2 [string range $EOL 1 end]
      set firstLen [string length $first2]
  
      while {1} {
          # Find if/where $first2 is present in $str1
  
          set index [string first $first2 $str1]
          if {$index == -1} {
              set len 0
              break
          }
  
          # remove the start of $str1, check if the end
          # matches the start of $str2. This rechecks
         # the first line, which isn't really needed.
  
          set str1 [string range $str1 $index end]
          set len [string length $str1]
          if {[string equal -length $len $str1 $str2]} {
              break
          }
  
          # no match; remove the first line from $str1, restart
          set str1 [string range $str1 $firstLen end]
      }
  
      # $len is the length of the match
      set last1  [expr {$len1 - $len - 1}]
      list $last1 $len
  }        

Can't this be done simply with the regexp engine? I did not find the way, yet ...


This version is more lightweight, but suffers from the stutterbug: if the first line in $str2 appears more than once in $str1, it may under some circumstances report some old lines as new.

However, it never loses data, i.e., it does never report a new line as old.

    proc findOverlap {str1 str2 {EOL "\n"}} {
        set len1 [string length $str1]
        
        set first2 [string range $str2 0 [string first $EOL $str2]]
        append first2 [string range $EOL 1 end]
        set firstLen [string length $first2]
        
        set index [string last $first2 $str1]
        if {$index == -1} {
            set len 0
            set index [expr {$len1 - 1}]
        } else {
            set len [expr {$len1 - $index}]
            incr index -1
        }
        # $len is the length of the match
        list $index $len
    }