lrmdups

Command in the TclX package for removing duplicate elements from a list. The returned list will be sorted.

lrmdups list

RS: You don't even need a procedure for this, because that's exactly what lsort -unique does:

interp alias {} lrmdups {} lsort -unique

Testing this interp alias interactively with a tclsh:

% lrmdups {foo bar grill bar foo}
bar foo grill

SS: Well.. it depends. lsort -unique does it in O(NlogN), while the trivial algorithm to remove duplicated elements runs in O(N). So if your list is very very long lsort may not be an option. Also you may like to have a stable algorithm that will not alter the order of non-duplicated elements.

RS: OK, but that means the sortedness specification is removed... Here is the trivial algorithm:

proc lrmdups list {
    set res {}
    foreach element $list {
       if {[lsearch -exact $res $element]<0} {lappend res $element}
    }
    set res
}
% lrmdups {foo bar grill bar foo}
foo bar grill

SS: Oops, sorry Richard, I didn't noticed the The returned list will be sorted part. So what the author if this page wants is really lsort -unique. Still your algorithm is not O(N), but O(N^2). In order to have an O(N) algorithm you need to use a dictionary with O(1) SEARCH complexity in the average case to check if an element already exists or not, i.e. a Tcl array.

RS OK, can't resist a mini-challenge :)

proc lrmdups list {
    array set a {}
    foreach element $list {set a($element) {}}
    lsort [array names a]
}
% lrmdups {foo bar grill bar foo}
bar foo grill

SS That's nice, and if you remove the final lsort should have O(N) complexity indeed. Still it is not stable. The following is a stable version:

proc lrmdups list {
    array set x {}
    set res {}
    foreach e $list {
        if {[info exists x($e)]} continue
        lappend res $e
        set x($e) {}
    }
    return $res
}
% lrmdups {foo bar grill bar foo}
foo bar grill

DKF: I suspect that that's actually still super-linear, but the reasoning for working that out is non-trivial. It depends on the distribution pattern of hash-chain lengths and the fact that you're building the entire hash table from scratch. The performance should be really good though (and better with dict, which does less memory management nasties).

SS: Not considering lappend that's not O(1), and rehashing (so we have to assume we can resize the hash table at top of the procedure), I think that under perfect hashing to build the hash table is clearly O(N), lookups are O(1). As we are not in a perfect hashing situation, but the hash function and keys (assuming for example words) are not too strange keys to show too strong patterns, and assuming the hash function will have more or less the same behaviour as N grows, I think that chains of a given length are equally probably as N grows, so to build the hash table will always take O(N) operations. Lookups should be O(1) under the same assuption so I think this is an O(N) business. But my math is not good enough to test this. If somebody very skilled with this kind of problems (like KBK for example ;) may throw some light on this issue it will be cool.

SS: Just noted that the insertion can be a "pure" O(1) even without perfect hashing if one does not care about duplicated elements when building the hash table. This btw makes the chains longer for an element that appears multiple time in the list. The search of this element will still be ok because if in a given bucket the element is present N times, we can found it in the average in chainlen/N+1, but there may be elements hashing on the same bucket as havely duplicated elements and this makes the analysis a bit more complex...


actually, the functionality for this, according to the Chart of existing list functionality page, exists in a few extensions (the description above is from TclX), but not directly in the core, what i was looking for (opening this up before, asking, then removing my question, as i had actually resorted to an extension for only one feature :^( ) was the ability to remove duplicated entries without sorting (or shuffling) the list, as i'm working with playlists and i'd rather not have the same thing in there 3-4 times, from importing other lists into one :^) this works great ... JMM


RS 2005-06-01 - Conversely, as asked for in the Tcl chatroom, here's code to return only duplicated elements:

proc lnunique list {
    set res {}
    foreach e $list {
        if {[info exists a($e)] && ![in $res $e]} {lappend res $e}
        set a($e) ""
    }
    set res
}
proc in {list element} {expr {[lsearch -exact $list $element] >=0}}
-- Test:
% lnunique {a b b c c c d}
b c

glennj - I'd approach that problem like this (not benchmarked, so don't know if it's faster):

if {[info tclversion] < 8.5} {
    rename incr _tcl_incr
    proc incr {varName {value 1}} {
        upvar $varName var
        if {[info exists var]} {_tcl_incr var $value} else {set var $value}
    }
}
proc lnunique list {
    set res [list]
    foreach e $list {
        incr a($e)
        if {$a($e) == 2} {lappend res $e}
    }
    set res
}
lnunique {a b c d e d c d e g } ;# ==> d c e