Reading and parsing RFC 822 headers

Here's the package to read and parse RFC 822-like (or RFC 2822-like) headers, which are used in a variety of applications nowadays (e-mail, news and Debian Packages.gz files come to mind).

When developing the package, the ability to re-produce a header just read is important. E.g. the following code should produce a header file containing the part of the message file until (and including) the first empty line:

 ## get header from `message':
 array unset header
 set chan  [ open "message" r ]
 set order [ read-parse-header $chan header ]
 close $chan

 ## write header to `header':
 set chan  [ open "header" w ]
 write-header $chan $order header
 puts $chan ""
 close $chan

Please note, however, that in mailbox files the header is often prepended with From (note the absence of colon) line, which isn't part of the RFC 822 syntax, so the read-parse-header procedure will signal an error on seeing this sort of input.

One approach to this problem can be seen in Mbox to MH directory conversion tool.


Namespace
::rfc-822

The procs below belong to this namespace, exported and ready to use.

Function
header-fields-order arrayName

Returns the order of fields in header, which is currently stored in array element with key :order. Empty list is returned if no such element exists.

Procedure
header-field-finish arrayName

This procedure is called to signal that the header processing is done.

To allow header field continuation, an implementation of header-parse-line must preserve some information between the calls. The information is currently stored in array elements with keys beginning with a couple of colons. This procedure flushes such a temporary information (if any), enforcing the array to actually represent the header read.

This procedure is automagically called by header-parse-line when it sees an empty line, which is a common convention of terminating the header.

Function
header-parse-line arrayName line

Parses the line given, modifying the array arrayName accordingly. Given an empty line calls header-field-finish and returns 1, otherwise returns 0.

See the description of the read-parse-header for more information.

Function
read-parse-header ?channelId? arrayName

Reads and parses RFC 822 header fields from channelId (defaulting to stdin if omitted), filling the array specified by arrayName. Returns the order of fields read (as a list).

Header fields are keyed by header name and stored as lists, with each occurence of the header field resulting in exactly one element in the list. Array isn't cleared before processing, it's left to the caller. Leading whitespace is not removed. Header values continued over several lines result in newlines being embedded in the list elements.

Returned order of fields will contain several occurences of the same header name, in case there were multiple occurences of the header.

Procedure
write-header ?channelId? order arrayName

Writes header fields specified by order argument to the given channelId (defaulting to stdout if omitted), obtaining values from the array specified by arrayName.

NB: Header fields not included in order will not be output. This can be used to strip away unwanted headers.


 ### rfc822h.tcl --- RFC 822 (and alike) headers  -*- Tcl -*-
 ## $Id: 15254,v 1.5 2006-01-16 07:00:23 jcw Exp $

 ### Copyright (C) 2005, 2006 Ivan Shmakov

 ##  This library is free software; you can redistribute it and/or modify
 ##  it under the terms of the GNU Lesser General Public License as
 ##  published by the Free Software Foundation; either version 2.1 of the
 ##  License, or (at your option) any later version.

 ##  This library is distributed in the hope that it will be useful, but
 ##  WITHOUT ANY WARRANTY; without even the implied warranty of
 ##  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 ##  Lesser General Public License for more details.

 ##  You should have received a copy of the GNU General Public License
 ##  along with this program; if not, write to the Free Software
 ##  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301
 ##  USA

 ### Code:

 namespace eval ::rfc-822 {
     variable re-header-name "\[\\041-\\071\\073-\\177\]+"
     variable re-header-del  ":"
     namespace export \
         header-fields-order \
         header-field-finish \
         header-parse-line \
         read-parse-header \
         write-header
 }

 ##  In the code below, header is an array keyed by header field name,
 ##  and whose values are corresponding header field values.  Because RFC
 ##  822 header field name cannot contain colons, the convention is to
 ##  put additional information (such as the order of fields) into array
 ##  elements with keys beginning with one or more colons.  Eg. the order
 ##  of fields is stored in the element with key ":colon".  The keys of
 ##  the private fields of the package begin with two colons.

 ## : Get the value of a variable, or default if it's not existent
 proc ::rfc-822::safe-get { varn { default "" } } {
     upvar 1 $varn var
     ## .
     expr { [ info exists var ] ? [ set var ] : $default }
 }

 ## : Get the order of fields in the header
 proc ::rfc-822::header-fields-order { arrayName } {
     ## .
     safe-get header(:order) [ list ]
 }

 ## : Store an accumulated header line (if any) in the header
 proc ::rfc-822::header-field-finish { arrayName } {
     upvar 1 $arrayName header
     set varnf header(::last-field)
     set varnv header(::last-value)
     if { ! [ info exists $varnf ] } {
         ## no header field to finish
         ## .
         return
     }
     set field [ set $varnf ]
     lappend header($field) [ set $varnv ]
     lappend header(:order) $field
     unset $varnf $varnv
     ## .
     return
 }

 ## : Parse just one given header line, modifying the header
 proc ::rfc-822::header-parse-line { arrayName line } {
     ## => 0 | 1 (done)
     variable re-header-name
     variable re-header-del
     upvar 1 $arrayName header

     if { [ string length $line ] == 0 } {
         ## end of header, store accumulated field, if any
         header-field-finish header
         return 1
     } elseif { [ string match "\[ \t\]*" $line ] } {
         ## accumulate continuation of a header field
         if { [ info exists header(::last-field) ] == 0 } {
             error "no header to continue"
         }
         append header(::last-value) "\n" $line
     } elseif { [ regexp -- \
                      "^(${re-header-name})${re-header-del}(.*)\$" \
                      $line \
                      dummy n v ] } {
         ## new header field
         header-field-finish header
         set header(::last-field) $n
         set header(::last-value) $v
     } else {
         error "cannot parse header line"
     }

     ## .
     return 0
 }

 ## : Read lines from the channel, parsing them as the RFC 822 header
 proc ::rfc-822::read-parse-header { { channelId stdin } arrayName } {
     upvar 1 $arrayName header

     set done 0
     if { ! [ info exists header(:order) ] } {
         set header(:order) [ list ]
     }
     while { ! $done } {
         if { [ gets $channelId line ] < 0 } {
             error "eof or no data while reading header"
         }
         if { [ header-parse-line header $line ] } {
             set done 1
         }
     }

     ## .
     set header(:order)
 }

 ## : Write lines to the given channel, formatting them as the header
 proc ::rfc-822::write-header { { channelId stdout } order arrayName } {
     ## FIXME: should it write header fields not in ORDER?
     upvar 1 $arrayName header
     array unset is
     foreach hf $order {
         if { ! [ info exists is($hf) ] } {
             set is($hf) 0
         }
         set i $is($hf)
         set l $header($hf)
         if { $i >= [ llength $l ] } {
             ## silently ignore this case
             continue
         }
         puts $channelId \
             [ format "%s:%s" $hf [ lindex $l $i ] ]
         incr is($hf)
     }
     ## .
     return
 }

 package provide rfc822::headers 0.3

 ### Emacs stuff
 ## Local variables:
 ## fill-column: 72
 ## indent-tabs-mode: nil
 ## ispell-local-dictionary: "english"
 ## mode: outline-minor
 ## outline-regexp: "###\\|proc"
 ## End:
 ## LocalWords:  
 ### rfc822h.tcl ends here