Version 23 of Splitting strings with embedded strings

Updated 2014-03-03 15:47:59 by pooryorick

Splitting strings with embedded strings

Richard Suchenwirth 2001-05-31: - Robin Lauren <[email protected]> wrote in comp.lang.tcl:

I want to split an argument which contains spaces within quotes into proper name=value pairs. But I can't :)

Consider this example:

set tag {body type="text/plain" title="This is my body"} 
set element [lindex $tag 0]
set attributes [lrange $tag 1 end] ;# *BZZT!* Wrong answer!

My attributes becomes the list {type="text/plain"} {title="This} {is} {my} {body"} (perhaps even with the optional backslash before the quotes), which isn't really what i had in mind.

Answer

(Reworked and extended by PL 2014-03-03)

The proper solution is to use a full XML parser like tDOM, because, as the examples below illustrate, any other solution will have holes in its coverage.

If there are always exactly two attribute definitions following an element name, one simple solution is to scan the string, and then enclose the name/value pairs in sublists:

% set parts [scan $tag {%s %[^=]="%[^"]" %[^=]="%[^"]"}]
# -> body type text/plain title {This is my body}
% set result [list]
% foreach {name value} [lrange $parts 1 end] { lappend result [list $name $value] }
% set result
# -> {type text/plain} {title {This is my body}}

One problem with this solution is that HTML attribute values can contain the double-quote character.

For a more general solution, albeit with the same caveat of double-quotes in an attribute value, where there can be less or more than two definitions, a regexp match might be useful:

% set matches [regexp -inline -all {(\S+?)="(.*?)"} $tag]
# -> type=\"text/plain\" type text/plain {title="This is my body"} title {This is my body}
% set result [list]
% foreach {- name value} $matches { lappend result [list $name $value] }
% set result
# -> {type text/plain} {title {This is my body}}

(Note that here, the foreach command extracts three values from the list during each iteration: the first value (stored in the variable named -) is just discarded.)

The problem can also be solved using list/string manipulation commands, but then we need to make sure that we see the data in the same way as Tcl does. To a human, $tag intuitively looks like a list of three items, but according to Tcl list syntax, it has 6 items, and the second item, for example, contains two literal quotes.

% llength $tag
# -> 6
% lmap item $tag { format "{%s}" $item }
# -> {{body}} {{type="text/plain"}} {{title="This}} {{is}} {{my}} {{body"}}

One simple solution is to rewrite the tag string into something that is convenient for list manipulation (careful with the quoting in the string map here!):

(Oops, the syntax highlighting in the wiki renderer was confused by my initial invocation (string map {=\" " \{" \" \}} $tag): the one below works better but obfuscates the code somewhat. \x22 is double quote, \x7b is left brace, \x7d is right brace. Both invocations work equally well in the Tcl interpreter.)

% set taglist [string map [list =\x22 " \x7b" \x22 \x7d] $tag]
# -> body type {text/plain} title {This is my body}
% set result [list]
% foreach {name value} [lrange $taglist 1 end] { lappend result [list $name $value] }
% set result
# -> {type text/plain} {title {This is my body}}

Another solution splits the tag string into a list, not by white space but by double quotes (again, \x22 is just a wiki-friendly way to insert a double quote character: a \" or {"} will work in the Tcl interpreter):

% set taglist2 [split $tag \x22]
# -> {body type=} text/plain { title=} {This is my body} {}

Obviously, the result needs a little more processing:

  1. the element name is joined up with the first attribute name,
  2. the equal sign stays attached to the attribute name,
  3. the second (and third, etc) attribute name is preceded by leftover whitespace, and
  4. there is an empty element which resulted from splitting at the last double quote before the end of the string.

The first three problems are easily dealt with (a string consisting of a space and some non-space characters can be split into a list with an empty first item and the nonspace substring as the second element):

% string trimright [lindex [split {body type=}] 1] =
# -> type
% string trimright [lindex [split { title=}] 1] =
# -> title

and the fourth problem can be solved by breaking out of the loop if any attribute name is the empty string:

foreach {name value} $taglist2 {
    if {$name eq {}} { break }
    lappend result [list [string trimright [lindex [split $name] 1] =] $value]
}
% set result
# -> {type text/plain} {title {This is my body}}

All of the above solutions also handle empty attribute value strings, but not attribute values that are not surrounded by double quotes.

See Also

split