Version 21 of Splitting strings with embedded strings

Updated 2014-03-03 15:02:38 by PeterLewerin

Splitting strings with embedded strings

Richard Suchenwirth 2001-05-31: - Robin Lauren <[email protected]> wrote in comp.lang.tcl:

I want to split an argument which contains spaces within quotes into proper name=value pairs. But I can't :)

Consider this example:

set tag {body type="text/plain" title="This is my body"} 
set element [lindex $tag 0]
set attributes [lrange $tag 1 end] ;# *BZZT!* Wrong answer!

My attributes becomes the list {type="text/plain"} {title="This} {is} {my} {body"} (perhaps even with the optional backslash before the quotes), which isn't really what i had in mind.

Answer

If there are always exactly two attribute definitions following an element name, one simple solution is to scan the string, and then enclose the name/value pairs in sublists:

% set parts [scan $tag {%s %[^=]="%[^"]" %[^=]="%[^"]"}]
# -> body type text/plain title {This is my body}
% set result [list]
% foreach {name value} [lrange $parts 1 end] { lappend result [list $name $value] }
% set result
# -> {type text/plain} {title {This is my body}}

For a more general solution, where there can be less or more than two definitions, a regexp match might be useful:

% set matches [regexp -inline -all {(\S+?)="(.*?)"} $tag]
# -> type=\"text/plain\" type text/plain {title="This is my body"} title {This is my body}
% set result [list]
% foreach {- name value} $matches { lappend result [list $name $value] }
% set result
# -> {type text/plain} {title {This is my body}}

(Note that here, the foreach command extracts three values from the list during each iteration: the first value is just discarded.)

The problem can also be solved using list/string manipulation commands, but then we need to make sure that we see the data in the same way as Tcl does. To a human, $tag intuitively looks like a list of three items, but according to Tcl list syntax, it has 6 items, and the second item, for example, contains two literal quotes.

% llength $tag
# -> 6
% lmap item $tag { format "{%s}" $item }
# -> {{body}} {{type="text/plain"}} {{title="This}} {{is}} {{my}} {{body"}}

One simple solution is to rewrite the tag string into something that is convenient for list manipulation (careful with the quoting in the string map here!):

(Oops, the syntax highlighting in the wiki renderer was confused by my initial invocation (string map {=\" " \{" \" \}} $tag): the one below works better but obfuscates the code somewhat. \x22 is double quote, \x7b is left brace, \x7d is right brace. Both invocations work equally well in the Tcl interpreter.)

% set taglist [string map [list =\x22 " \x7b" \x22 \x7d] $tag]
# -> body type {text/plain} title {This is my body}
% set result [list]
% foreach {name value} [lrange $taglist 1 end] { lappend result [list $name $value] }
% set result
# -> {type text/plain} {title {This is my body}}

Another solution splits the tag string into a list, not by white space but by double quotes (again, \x22 is just a wiki-friendly way to insert a double quote character: a \" or {"} will work in the Tcl interpreter):

% set taglist2 [split $tag \x22]
# -> {body type=} text/plain { title=} {This is my body} {}

Obviously, the result needs a little more processing:

  1. the element name is joined up with the first attribute name,
  2. the equal sign stays attached to the attribute name,
  3. the second (and third, etc) attribute name is preceded by leftover whitespace, and
  4. there is an empty element which resulted from splitting at the last double quote before the end of the string.

The first three problems are easily dealt with (a string consisting of a space and some non-space characters can be split into a list with an empty first item and the nonspace substring as the second element):

% string trimright [lindex [split {body type=}] 1] =
# -> type
% string trimright [lindex [split { title=}] 1] =
# -> title

and the fourth problem can be solved by breaking out of the loop if any attribute name is the empty string:

foreach {name value} $taglist2 {
    if {$name eq {}} { break }
    lappend result [list [string trimright [lindex [split $name] 1] =] $value]
}
% set result
# -> {type text/plain} {title {This is my body}}

All of the above solutions also handle empty attribute value strings, but not attribute values that are not surrounded by double quotes.


AM Also see: Splitting a string on arbitrary substrings