Split On Whitespace

Created by CecilWesterhof.

Often I want to split a string on repeating white-space. The normal split function does not do what I want. For example:

split "   To   show    the   problem.   "

gives:

{} {} {} To {} {} show {} {} {} the {} {} problem. {} {} {}

What I want is:

To show the problem.

That is why I created the following proc:

# A split that works on repeating white-space
# With:
#     splitOnWhiteSpace "   To   show    the   problem.   "
# You get:
#     "To show the problem."
# instead of:
#     "{} {} {} To {} {} show {} {} {} the {} {} problem. {} {} {}"
# With min/max you can verify the number of elements
# I prefer the regexp version, but
# the other version could take about 55% of the time
# That is why you can use fast to go for the fast version
proc splitOnWhiteSpace {value {min -1} {max -1} {fast False}} {
    if {!([string is integer -strict $min] && [string is integer -strict $max])} {
        error "min and max should both be integers ($min, $max)"
    }
    if {($min < -1) || ($max < -1)} {
        error "min and max should both be >= -1 ($min, $max)"
    }
    if {($max != -1) && ($max < $min)} {
        error "min should be <= max ($min, $max)"
    }
    if {$fast} {
        set splitLst [list {*}[string map {
            \{ \\\{
            \" \\\"
            \\ \\\\
            } $value]]
    } else {
        set splitLst [regexp -all -inline {\S+} $value]
    }
    if {$min != -1} {
        if {$max == -1} {
            set max $min
        }
        set length [llength $splitLst]
        if {($length < $min) || ($length > $max)} {
            if {$min == $max} {
                set msgEnd "$min values"
            } else {
                set msgEnd "between $min and $max values"
            }
            error "'$value' contains $length instead of $msgEnd"
        }
    }
    return $splitLst
}

With this I get:

To show the problem.

Beside splitting on repeating white-space, it can also check the number of elements. For example:

splitOnWhiteSpace "Just a test." 4

gives:

'Just a test.' contains 3 instead of 4 values

and:

splitOnWhiteSpace "Just a test." 4 5

gives:

'Just a test.' contains 3 instead of between 4 and 5 values

As always: comments, tips and questions are appreciated.


StephanKuhagen:

About four times faster compared to the regexp-line:

list {*}[string map {\{ \\\{} $value]

The string map is needed to avoid unmatched open braces in lists. If you know, that there will never be an opening brace in your inputs, you can get it even faster.

CecilWesterhof

Thanks, I implemented it. For the curious, originally I used:

set splitLst [regexp -all -inline {\S+} $value]

PYK 2018-06-07: ycl string delimit is a more general routine for performing this type of task.


gerhardr - 2018-06-11 14:41:59

Just for completeness the Tcllib solution .

It's a way slower (ca. factor 3 in my tests) but a general solution as it can use regexps as split object. Maybe it's also a motivation to improve the Tcllib method.

 % package require textutil
 0.8
 % set str        "   To   show    the   problem.   "
    To   show    the   problem.   
 % textutil::splitx $str
 {} To show the problem. {}
 % textutil::splitx [string trim $str]
 To show the problem.