Updated 2012-02-02 03:56:13 by RLE

Mini-Tutorial: Content type and file extension -- CMcC 20040929

It should be realised by users of tclhttpd (by any web author, really,) that the file extension (.htm, .html, .pdf, .exe, whatever) bears only a passing relationship to the type of file sent to a browser, and to how the browser will process it.

Conventionally, .html means the document represents a file with mime type text/html, but this is actually determined by the HTTP protocol header Content-Type whose value is expected to be a mime type.

It is quite possible, in other words, to have a file named fred.pdf which contained a HTML file, and deliver it to a browser as an HTML file, and have it processed as an HTML file, merely by sending the appropriate Content-Type. The converse is also true. In fact, MSIE had a security flaw in which a file named fred.html could be an executable, and would be executed (I think they checked permission to render on the .html file suffix, not the actual content type.)

Why does this matter?

The good news is that mostly it doesn't matter: tclhttpd takes care of the conventional mappings of .suffix to mime-type, so normally a .html file will be returned with the expected type. However, it does this by means of a file mime.types in the lib directory, and that mapping is necessarily limited to a small set of well-known (or asserted) associations. You may need to extend the table to support your favourite .suffix, or you can directly register a mapping via [Mtype_Add $suffix $type].

Why it matters is that the association is not set in stone, but rather established by a loose consensus. It's as well to be aware that your data is being typed, and that you can intervene in this process (particularly using the [Doc_$type] post-processing proc.

Why it also matters is that a file's <DOCTYPE> declaration might not be identical to its mime type as calculated by tclhttpd ... leading to perhaps unpredictable results. Some browsers go into a more or less strict rendering mode if the content type differs from the declared DOCTYPE. This might have an effect on the way your HTML or XHTML pages are rendered.

tclhttpd also provides several ways to dynamically determine the content type of a file or data it's about to return:

  • [Httpd_Return*] procs all take a type argument, which usually defaults to text/html
  • Templates can set the element data(contentType) to their preferred content-type.
  • [Doc_$type] commands call [Httpd_Return*] directly, so can specify their content-type.
  • Each Direct domain can set its content-type by storing it in a per-proc global variable, e.g. [set Device/a/b/c text/plain] for a direct domain Device and procedure /a/b/c.
  • Cgi domain probably also allows content-type setting, but someone who's familiar with it can fill this in.

It should also be noted that browsers (in theory) negotiate permissable content type with the server by specifying an ordered list of the mime types they are prepared to handle. It is quite permissable, in other words, to request files with no suffix, and calculate different representations and types depending on the browser's willingness to process different types (for example, one could (if one were strange) return PDF files for every page on a website.)

In practice, this is much less useful than it might be, because most browsers claim to be able to support anything, with a default acceptable mime type */*. You're not invalidating any standards by returning an application/postscript file in response to a request for index.html (for example) although I'm not necessarily recommending it :)

Believe it or not, there's actually some discussion (in Apache) of using multiple suffixes to represent language variants, so fred.en.html and fred.fr.html would represent the same page in English and French (respectively.)