Unicode file reader

Unicode file reader auto-detects and reads files in 16-bit Unicode representation

See Also

Unicode and UTF-8

Description

Richard Suchenwirth 1999-07-23: In order to auto-detect and read files (e.g. Tcl sources with Hebrew literals) in 16-bit Unicode representation, I wrote the following:

aspect 2014-02-12: WARNING: The following code is (unusually for RS!) rather buggy. It's a good example for newcomers of some easy mistakes to make:

  • the initial gets has $fn in encoding system, which could be anything (and might not let BOMs through unscathed)
  • the main read expects a number of characters in its second argument, but file size counts bytes
  • comparing strings with == instead of eq
  • info tclversion is not documented as returning a number (see Donald Porter's comment below)

Doubtless it worked for RS, as these bugs are mostly harmless, but such errors add up to hard-to-trace bugs when you start distributing code ...

proc file:uread {fn} {
    set encoding ""
    set f [open $fn r]
    if {[info tclversion]>=8.1} {
        gets $f line
        if {[regexp \xFE\xFF $line]||[regexp \xFF\xFE $line]} {
            fconfigure $f -encoding unicode
            set encoding unicode
        }
        seek $f 0 start ;# rewind -- real reading is still to come
    }
    set text [read $f [file size $fn]]
    close $f
    if {$encoding=="unicode"} {
        regsub -all "\uFEFF|\uFFFE" $text "" text
    }         
    return $text
}

Works both on ASCII and Unicode files (not on swapped bytes tho... FFFE seems to be handled in code, but swapping is not yet ;-(. See also: Unicode and UTF-8


Frank Pilhofer contributed the following swapper that operates on a string data that might be a whole Unicode file, in comp.lang.tcl:

Fortunately, swapping is pretty easy in Tcl, at least in LOC:

private method wordswap {data} {
    binary scan $data s* elements
    return [binary format S* $elements]
}

jima: I think it is better to use:

binary scan $data c* elements

Can any expert try my(jima) point?

So I'm now using the following code for reading:

global tcl_platform
if {[binary scan $data S bom] == 1} {
    if {$bom == -257} {
        if {$tcl_platform(byteOrder) == "littleEndian"} {
            set data [wordswap [string range $data 2 end]]
        } else {
            set data [string range $data 2 end]
        }
    } elseif {$bom == -2} {
        if {$tcl_platform(byteOrder) == "littleEndian"} {
            set data [string range $data 2 end]
        } else {
            set data [wordswap [string range $data 2 end]]
        }
    } elseif {$tcl_platform(byteOrder) == "littleEndian"} {
        set data [wordswap $data]
    }
}

Donald Porter:

Slightly off-topic note: The code example above tests for the Tcl version with

if {[info tclversion] >= 8.1} ...

A better way of testing that is to use:

if {[package vcompare [package provide Tcl] 8.1] >= 0} ...

That will continue working if Tcl releases are ever labeled with version numbers more than two levels deep, or if/when a minor release > 9 is released.


RS:

Sure. I admit yours is The Right Way ;-) -- only it's about double as long as mine... Maybe I'm pampered, but I've grown to expect it could be done even simpler, so that frequent constructs are nicely wrapped:

proc version {"of" pkg op vers} {
    expr [package vcompare [package provide $pkg] $vers] $op 0
}

Then we can write this sugar: (cf Salt and Sugar)

if [version of Tcl >= 8.1] {...