Updated 2013-01-27 03:44:41 by RLE

A possible tclvfs - CMcC 20041029

Storage of metadata (data about data) is an important and undersupported facility of file systems.

Tcl supports the [file attributes] command to store some system-defined metadata, and tclvfs permits this to be redefined and extended.

What this allows is vfs which intercepts the [file attributes] command and loads/stores attribute (name,value) pairs in some parallel store. Such a vfs would be stackable over other filesystems, so as to provide generic metadata for any filesystem. This would allow applications to simply assume that they could store arbitrary metadata.

The question is what form of parallel store would be best?

  • parallel hierarchy - metadata is stored in a file hierarchy which parallels the structure of the original, so there's a mapping from ${path} -> ${path}.metadata which contains an array dump of name->value pairs. CONS: expensive/slow, PROS: persistence for free.
  • hidden directory - metadata stored in special files hidden in a per-directory .metadata directory. CONS: expensive/slow, invasive (less so than invasive-interpolation, below), permissions problems, PROS: faster than parallel-hierarchy, metadata is joined with with data (less so than invasive-interpolation)
  • persistent array - metadata is stored in an array $metadata($path) as an [array get] form per file, to be loaded/stored once, then accessed from memory. Could use tie for persistence. CONS: doesn't scale well, persistence needs work PROS: fast.
  • metakit - metadata stored in a metakit file, loaded/stored as required. CONS: not pure tcl, slower than persistent-array, PROS: scales, faster than parallel-FS.
  • invasive interpolation - metadata stored at the head of each file in (say) RFC822-style name:value lines. PROS: some data (e.g. emails) are already of this form, some data (e.g. caches) can be coerced to this form CONS: invasive, wrecks general files for general (non-tcl) use.
  • multifork files - files may have several "forks", one of which is the traditional file and another of which can be used to store metadata. Example: Mac resource forks. Possible implementation on top of a monofork file system: each "file with metadata" is really a directory, where each fork is a separate file, i.e., there could be one $file/data with the file as such and one $file/about with the metadata. PROS/CONS: Similar to "hidden directory".
  • (add more here)

SEH -- 4feb05 -- I've been thinking a lot lately about how useful a metadata vfs would be. I think it could lead to a whole new way of developing applications: instead of using relational databases for storage designed to provide a single stovepiped function, one could design and improve new applications on an ongoing basis that access the same filesystem space and each make use of its own preferred attributes. There's no reason why a single file couldn't serve as a weblog entry, a bug report, a todo list item, a mail message, a calendar event, a wiki page, a usenet post, an FAQ contribution, a code patch submission and a documentation paragraph; it would simply be a matter of which application you use to access the filesystem space.

I favor a RFC822-style header storage format. Steve Cassidy has made persuasive arguments for this option while discussing metadata storage for CANTCL. (Namely, it's well-defined, well-understood, and can be used by any other application that can access a file.) If you didn't want to mix your metadata with your file contents, you could optionally define a "-contents" attribute whose value was the location of the original file and store the attributes separately. That is to say, a vfs with arbitrary attributes would allow you some leeway in choosing your storage method, you wouldn't have to choose one of the above options irrevocably.

Such a vfs could do more than get/set name,value pairs. It could allow optional hooks to procedures associated with getting and setting attribute values. Imagine an attribute called "md5", or "sha1". Such an attribute could be calculated on the fly at the time it's requested, rather than calculated and stored once with the attendant risk of becoming inaccurate with the next file write.

Such a filesystem stacked on top of a metakit vfs could take advantage of the automatic file contents indexing metakit already does to do fast SQL-style queries based on attribute values, thus allowing the design of issue tracking, project management and ERP-type applications, with search and reporting functions. You might be able to do away with relational databases entirely for small to medium sized systems, and best of all you'd get your application integration for free, instead of it being an expensive and complicated extra step.

CMcC one lovely thing about metakit as a substrate for a vfs is that you can add arbitrary metadata to it quite easily (and dynamically) as new columns which can then be treated as [file attributes]

DKF: [Metadata] editing - a way for users to really cause havok!

Want to point out that 'smarter' filesystems have this support built into them already. E.g. the FreeBSD UFS filesystems support (almost) arbitrary key/value pair attributes for files.

slebetman: Longs for the resource forks of Classic MacOS where metadata can be anything from text strings to embedded images to executable code (I do find embedding executable code as metadata to be kind of hackish though).

SEH 20060824 --it looks like libsqlfs (a SQLite backed virtual filesystem) solves a lot of these issues. Now if only a Tcl binding and a Tclvfs version would appear...