- split string ?splitChars?

See also edit
Examples edit
split "comp.unix.misc" .
comp unix misc
split "Hello world" {}H e l l o { } w o r l dQuestions edit
ulis: where in the doc are defined the standard white-space characters?DKF: I believe there's a standard (ANSI? POSIX?) somewhere. But the answer includes "space", "tab", and "newline".escargo - By "tab" do you mean both horizontal tab (ASCII 9) and vertical tab (ASCII 11)? (See http://www.asciitable.com/
) Arguments could be made for most of the ASCII characters under 33.Strick: Let's ask Tcl what it thinks are white:
Note that the argument named splitChars above is a series of 0 to n individual characters. However, if you want to split on a specific sequence of 2 or more characters together, or if you want to split on a regular expression, split will not work for you. See Tcllib's textutil::splitx for that functionality.SS 2004/01/31 - or you can use the following function:
2006-06-21 Sarnold Here is my version of wsplit:
So what should you use when you don't care how many spaces were between tokens, you just want the nonblank tokens in the list and none of the separators? -- escargoRS: Easy, just use a filter:
See Counting characters in a string where split was pretty good...
Kaitzschu mentions, on comp.lang.tcl, this piece of info which is supposedly somewhere in the man pages:"If splitChars is an empty string then each character of string becomes a separate element of the result list."Thus,
RS 2006-07-04: When you split on "" on a byte array, it may be surprising that the result may contain unscannable characters for \x00 bytes. I had to work around like this:
escargo 18 Jun 2007 - (We seem to have some edits getting lost; I'm putting this back in since it seems to have disappeared.) I was realizing that then I was writing a little language that there doesn't seem to be a Tcl command that returns a list made from breaking up a string as the command parser would. split doesn't do it. It made me think how subst has arguments that disable command substitution and variable substitution, while eval does not. There are commands like eval, info complete, string is list that appear to do the tokenization, but the tokenized results are not accessible to the script level. Is there some way (apparently not obvious to me) to get the list form out of the string?MG Tcl 8.5's {*} expansion is probably the easiest way. In 8.4 you'd probably need to use eval, which would mean a lot of work to escape special characters, make sure things that the parser would see as one word, such as
) Arguments could be made for most of the ASCII characters under 33.Strick: Let's ask Tcl what it thinks are white:$ env | grep en_
LANG=en_US.UTF-8
$ cat what-chars-does-split-think-are-white.tcl
for {set i 0} {$i<65536} {incr i} {
if {[llength [format "/%c/" $i]] > 1} { puts -nonewline "$i " }
}
$ tclsh what-chars-does-split-think-are-white.tcl
9 10 11 12 13 32 $escargo 4 Jan 2005 - 9 = ASCII TAB, 10 = ASCII LF (line feed), 11 = ASCII VT (vertical tab), 12 = ASCII FF (form feed), 13 = ASCII CR (carriage return), and of course 32 = ASCII Space.I would have thought that the separator characters would count as white space (28-31, FS, GS, RS, US), but I guess they are regarded as "nonprinting" characters.DKF: I actually mean "what does isspace() think is whitespace". :^)Strick: Oops, i forgot to actually use split in my script above. So now I test four different notions of white, and get three different answers. I understand why Tcl's builtin list-splitting rules must be fixed, regardless of locale. But it seems 'split' should use the list-splitting rule or the the 'string is space' rule, but it uses its own (pre-unicode?) rule:$ cat what-chars-does-split-think-are-white.tcl
puts "tcl=[info patch] LANG=$env(LANG)"
puts -nonewline "according to llength: "
for {set i 0} {$i<65536} {incr i} {
if {[llength [format "/%c/" $i]] > 1} { puts -nonewline "$i " }
}
puts ""
puts -nonewline "according to split: "
for {set i 0} {$i<65536} {incr i} {
if {[llength [split [format "/%c/" $i]]] > 1} { puts -nonewline "$i " }
}
puts ""
puts -nonewline "according to 'string is space': "
for {set i 0} {$i<65536} {incr i} {
if {[string is space [format "%c" $i]]} { puts -nonewline "$i " }
}
puts ""
puts -nonewline "according to regexp {\\s}: "
for {set i 0} {$i<65536} {incr i} {
if {[regexp {\s} [format "%c" $i]]} { puts -nonewline "$i " }
}
puts ""
$
$ tclsh what-chars-does-split-think-are-white.tcl
tcl=8.4.7 LANG=en_US.UTF-8
according to llength: 9 10 11 12 13 32
according to split: 9 10 13 32
according to 'string is space': 9 10 11 12 13 32 160 5760 8192 8193 8194 8195 8196 8197 8198 8199 8200 8201 8202 8203 8232 8233 8239 12288
according to regexp {\s}: 9 10 11 12 13 32 160 5760 8192 8193 8194 8195 8196 8197 8198 8199 8200 8201 8202 8203 8232 8233 8239 12288
$escargo 27 Jan 2006 - If split used chars 9 10 11 12 13 32 then there would be only two sets, with the smaller set as a proper subset of the larger set. The two characters that would have to be added are the vertical tab and form feed.Note that the argument named splitChars above is a series of 0 to n individual characters. However, if you want to split on a specific sequence of 2 or more characters together, or if you want to split on a regular expression, split will not work for you. See Tcllib's textutil::splitx for that functionality.SS 2004/01/31 - or you can use the following function:
proc wsplit {string sep} {
set first [string first $sep $string]
if {$first == -1} {
return [list $string]
} else {
set l [string length $sep]
set left [string range $string 0 [expr {$first-1}]]
set right [string range $string [expr {$first+$l}] end]
return [concat [list $left] [wsplit $right $sep]]
}
}This version is recursive, so it may be better to rewrite it if you plan to use the function against very long strings with many separators. The difference between wsplit and splitx is that splitx uses regexp, so it may create problems with unknown separators.IL 2005/01/03 - on the near anniversary of this proc, the iterative version, quick n dirty since I'm in a hurry to parse some html...proc wsplit { str sepStr } {
set strList [list]
set sepLength [string length $sepStr]
while { [set index [string first $sepStr $str]] != "-1" } {
set left [string range $str 0 [expr $index + $sepLength - 1]]
set str [string range $str [expr $index + $sepLength + 1] end]
lappend strList $left
}
return $strList
}hmm use this version instead, the string first doesn't catch strings sepstrs connected to the ones you wantproc wsplit { str sepStr } {
if { ![regexp $sepStr $str] } { return $str }
set strList [list]
set pattern "(.*?)$sepStr"
while { [regexp $pattern $str match left] } {
lappend strList $left
regsub $pattern $str "" str
}
lappend strList $str
return $strList
}RS writes recently:Note that the wsplit can be done simpler:- map the separating string to a single char that cannot appear in the string
- split on that single char
proc wsplit {str sep} {
split [string map [list $sep \0] $str] \0
}
% wsplit This<>is<>a<>test. <>
This is a test.2006-06-21 Sarnold Here is my version of wsplit:
proc wsplit {str sep} {
set out ""
set sepLen [string length $sep]
if {$sepLen <2} {
return [split $str $sep]
}
while {[set idx [string first $sep $str]] >= 0} {
if {$idx>=0} {
# the left part : the current element
lappend out [string range $str 0 [expr {$idx-1}]]
}
# get the right part and iterate with it
set str [string range $str [incr idx $sepLen] end]
}
# there is no separator anymore, but keep in mind the right part must be appended
lappend out $str
}So what should you use when you don't care how many spaces were between tokens, you just want the nonblank tokens in the list and none of the separators? -- escargoRS: Easy, just use a filter:
proc filter {cond list} {
set res {}
foreach element $list {if [$cond $element] {lappend res $element}}
set res
}
% filter llength [split "a list with many spaces"]
a list with many spaces... or use% split [regsub -all {[ \t\n]+} "a list with many spaces" { }] to eliminate the excess white space ...... or use% lreplace "a list with many spaces" 0 -1to force reinterpretation as a list ...
See Counting characters in a string where split was pretty good...
Kaitzschu mentions, on comp.lang.tcl, this piece of info which is supposedly somewhere in the man pages:"If splitChars is an empty string then each character of string becomes a separate element of the result list."Thus,
set l [split {abcdefghijklmnopqrstuvwxyz} {}]results in a list where each character is turned into a separate list entry.RS 2006-07-04: When you split on "" on a byte array, it may be surprising that the result may contain unscannable characters for \x00 bytes. I had to work around like this:
proc hexdump str {
set res {}
foreach c [split $str ""] {
set i [scan $c %c]
if {$i eq ""} {set i 0} ;#<--------------------- here
lappend res [format %02x $i]
}
set res
} escargo 18 Jun 2007 - (We seem to have some edits getting lost; I'm putting this back in since it seems to have disappeared.) I was realizing that then I was writing a little language that there doesn't seem to be a Tcl command that returns a list made from breaking up a string as the command parser would. split doesn't do it. It made me think how subst has arguments that disable command substitution and variable substitution, while eval does not. There are commands like eval, info complete, string is list that appear to do the tokenization, but the tokenized results are not accessible to the script level. Is there some way (apparently not obvious to me) to get the list form out of the string?MG Tcl 8.5's {*} expansion is probably the easiest way. In 8.4 you'd probably need to use eval, which would mean a lot of work to escape special characters, make sure things that the parser would see as one word, such as
[string is int $foo]become
"[string is int $foo]"so the eval doesn't split it into separate words, etc. It would be possible, I imagine, but a bit of a pain to do, compared to something like
proc str2list {args} {
return $args
}
set list [str2list {*}$string]in 8.5.escargo - Reading the original TIP for {expand}, it appears that it was intended to expand a list into separate items for processing. So, using {*} as you did is working around the fact that something is taking $string and tokenizing it, treating the result as a list, and passing it to {*} to expand. Again, it seems, the tokenizing is buried where it is not accessible to a script.It really seems like there ought to be a way. Right now split doesn't take any control arguments, so that [split -tokens $string] wouldn't break any code.