A data file metaformat is a set of syntactic and lexical conventions that is either formally standardized or sufficiently well established by practice that there are standard service libraries to handle marshalling and unmarshalling it.
Unix has evolved or adopted metaformats suitable for a wide range of applications. It is good practice to use one of these (rather than an idiosyncratic custom format) wherever possible. The benefits begin with the amount of custom parsing and generation code that you may be able to avoid writing by using a service library. But the most important benefit is that developers and even many users will instantly recognize these formats and feel comfortable with them, which reduces the friction costs of learning new programs.
In the following discussion, when we refer to “traditional Unix tools” we are intending the combination of grep(1), sed(1), awk(1) and tr(1) for doing text searches and transformations. Perl and other scripting languages tend to have good native support for parsing the line-oriented formats that these tools encourage.
Our first case study in textual data metaformats was the /etc/passwd file. This format (one record per line, colon-separated fields) is very traditional under Unix and frequently used for tabular data. Other classic examples include the /etc/group file describing security groups and the /etc/inittab file used to control startup and shutdown of Unix service programs at different run levels of the operating system.
Data files in this style are expected to support inclusion of colons in the data fields via backslash escaping. More generally, code that reads them is expected to support record continuation by ignoring backlash-escaped newlines, and to allow embedding non-printable character data via C-style backslash escapes.
This format is most appropriate when the data is tabular, keyed by a name (in the first field), and records are predictably short (less than 80 characters long). It works well with traditional Unix tools.
Occasionally one sees field separators other than the colon, such as the pipe character | or even an ASCII NUL. Old-school Unix practice used to favor tabs, a preference reflected in the defaults for cut(1) and paste(1); but this has gradually changed as format designers became aware of the many small irriations that ensue from the fact that tabs and spaces are not visually distinguishable.
This format is to Unix what CSV (comma-separated value) format is under Microsoft Windows and elsewhere outside the Unix world. CSV (fields separated by commas, double quotes used to escape commas, no continuation lines) is rarely found under Unix.
The RFC-822 metaformat derives from the textual format of Internet electronic mail messages; RFC822 is the original Internet RFC describing this format (since superseded by RFC2822). The MIME (Multipurpose Internet Media Extension) provides a way to embed typed binary data within RFC822-format messages. (Web searches on either of these names will turn up the relevant standards.)
In this metaformat, record attributes are stored one per line, named by tokens resembling mail header-field names and terminated with a colon followed by whitespace. Field names do not contain whitespace; conventionally a dash is substituted instead. The attribute value is the entire remainder of the line, exclusive of training whitespace and newline. A physical line that begins with tab or whitespace is interpreted as a continuation of the current logical line.
A blank line may be interpreted either as a record terminator or as an indication that unstructured text follows.
Under Unix, this is the traditional and preferred textual metaformat for attributed messages or anything that can be closely analogized to electronic mail. Usenet news uses it; so do the HTTP 1.1 (and later) formats used by the World Wide Web. It is very convenient for editing by humans. Traditional Unix search tools are still good for attribute searches, through finding record boundaries will be a little more work than in a record-per-line format.
For examples of this format, look in your mailbox.
Fortune-cookie format is used by the fortune(1) program for its database of random quotes. It is appropriate for records that are just bags of unstructured text. It simply uses % followed by newline (or sometimes %% followed by newline) as a record separator. Example 5.3 is an example section from a file of email signature quotes:
Example 5.3. A fortune file example
"Among the many misdeeds of British rule in India, history will look
upon the Act depriving a whole nation of arms as the blackest."
-- Mohandas Gandhi, "An Autobiography", pg 446
%
The people of the various provinces are strictly forbidden to have in their
possession any swords, short swords, bows, spears, firearms, or other types
of arms. The possession of unnecessary implements makes difficult the
collection of taxes and dues and tends to foment uprisings.
-- Toyotomi Hideyoshi, dictator of Japan, August 1588
%
"One of the ordinary modes, by which tyrants accomplish their purposes
without resistance, is, by disarming the people, and making it an
offense to keep arms."
-- Constitutional scholar and Supreme Court Justice Joseph Story, 1840
It is good practice to accept whitespace after % when looking for record delimiters. This helps cope with human editing mistakes.
Fortune-cookie record separators combine well with the RFC-822 metaformat for records. If you need a textual format that will support multiple records with a variable repertoire of explicit fieldnames, one of the least surprising and human-friendliest ways to do it would look like Example 5.4.
Example 5.4. Three planets in an RFC822-like format
Planet: Mercury Orbital-Radius: 57,910,000 Diameter: 4,880 km Mass: 3.30e23 kg % Planet: Venus Orbital-Radius: 108,200,000 km Diameter: 12,103.6 km Mass: 4.869e24 kg % Planet: Earth Orbital-Radius: 149,600,000 Diameter: 12,756.3 km Mass: 5.972e24 kg Moons: Luna
Of course, the record delimiter could be a blank line, but a line consisting of "%\n" is more explicit and less likely to be introduced by accident during editing. In a format like this it is good practice to simply ignore blank lines.
XML is well-suited for complex data formats (the sort of things that the old-school Unix tradition would use an RFC-822-like stanza format for) though overkill for simpler ones. It is especially appropriate for formats that have a complex nested or recursive structure of the sort that the RFC-822 metaformat does not handle well. For a good introduction to the format, see XML In A Nutshell [Harold&Means].
XML has a very simple syntax resembling HTML's — angle-bracketed tags and ampersand-led literal sequences. It is about as simple as a plain-text markup can be and yet express recursively nested data structures. XML is just a low-level syntax; it requires a document type definition (such as XHTML) and associated application logic to give it semantics.
Example 5.5 is a simple example of an XML-based configuration file. It is part of the kdeprint tool shipped with the open-source KDE office suite hosted under Linux. It describes options for an an image-to-Postscript filtering operation, and how to map them into arguments for a filter command. For another instructive example, see the discussion of Glade in Chapter 9 (Generation)
Example 5.5. An XML example
<?xml version="1.0"?>
<kprintfilter name="imagetops">
<filtercommand data="imagetops %filterargs %filterinput %filteroutput" />
<filterargs>
<filterarg name="center"
description="Image centering"
format="-nocenter" type="bool" default="true">
<value name="true" description="Yes" />
<value name="false" description="No" />
</filterarg>
<filterarg name="turn"
description="Image rotation"
format="-%value" type="list" default="auto">
<value name="auto" description="Automatic" />
<value name="noturn" description="None" />
<value name="turn" description="90 deg" />
</filterarg>
<filterarg name="scale"
description="Image scale"
format="-scale %value"
type="float" min="0.0" max="1.0" default="1.000" />
<filterarg name="dpi"
description="Image resolution"
format="-dpi %value"
type="int" min="72" max="1200" default="300" />
</filterargs>
<filterinput>
<filterarg name="file" format="%in" />
<filterarg name="pipe" format="" />
</filterinput>
<filteroutput>
<filterarg name="file" format="> %out" />
<filterarg name="pipe" format="" />
</filteroutput>
</kprintfilter>
One advantage of XML is that it has it is often possible to setect ill-formed, corrupted, or incorrectly-generated data through a syntax check, without knowing the semantics of the data.
The most serious problem with XML is that it doesn't play well with traditional Unix tools. Software that wants to read an XML format needs an XML parser; this means bulky, complicated programs, and may even restrict your choice of language when you write programs that want to read or generate your format.
One application area where XML is clearly winning is in markup formats for document files (we'll have more to say about this in Chapter 16 (Documentation)). Tagging in such documents tends to be relatively sparse among large blocks of plain text; thus, traditional Unix tools still work fairly well for simple text searches and transformations.
One interesting bridge between these worlds is PYX format — a line-oriented translation of XML that can be hacked with traditional line-oriented Unix text tools and then losslessly translated back to XML. A web search for “Pyxie” will turn up resources. The xmltk toolkit takes the ooposite tack, providing stream-oriented tools analogous to grep(1) and sort(1) for filtering XML documents; web search for “xmltk”.
XML can be a simplifying choice or a complicating one. There is a lot of hype surrounding it, but don't be a fashion victim by either adopting or rejecting it uncritically. Choose carefully and bear the KISS principle in mind.
Many Microsoft Windows programs use a textual data format that looks like Example 5.6. This example associates optional resources named ‘account’, ‘directory’, ‘numeric_id’, and ‘developer’ with named projects ‘python’, ‘sng’, ‘fetchmail’, and ‘py-howto’. The DEFAULT entry supplies values that will be used when a named entry fails to supply them.
Example 5.6. A .INI file example
[DEFAULT] account = esr [python] directory = /home/esr/cvs/python/ developer = 1 [sng] directory = /home/esr/WWW/sng/ numeric_id = 1012 developer = 1 [fetchmail] numeric_id = 18364 [py-howto] account = eric directory = /home/esr/cvs/py-howto/ developer = 1
This style of data file format is not native to Unix, but some Linux programs support it under Windows's influence. This format is readable and not badly designed, but is not widely supported by Unix tools. Like XML it doesn't play well with grep(1) or conventional Unix scripting tools. If you are willing to accept these limitations, using an XML format would probably be a better idea.
There are longstanding Unix traditions about how textual data formats ought to look. Most of these derive from one or more of the standard metaformats we've just described. It is wise to follow these unless you have strong and specific reasons to do otherwise.
One record per newline-terminated line, if possible. This makes it easy to extract records with text-stream tools. For data interchange with other operating systems, it's wise to make your file-format parser indifferent to whether the line ending is LF or LF-CR. It's also conventional to ignore trailing whitespace in such formats; this protects against common editor bobbles.
Less than 80 chars per line, if possible. This makes the format browseable in an ordinary-sized terminal window. If many records must be longer than 80 characters, consider a stanza format (see below).
Support the backslash convention. The standard way to support embedding non-printable control characters is by parsing C-like backslash escapes — \n for a newline, \r for a carriage return, \t for a tab, \b for backspace, \f for formfeed, \onn or \0nn for the octal character with value nn, \xnn for the hex character with value nn, \\ for a literal backslash.
In one-record-per-line formats, use colon as a field separator. This convention seems to have originated with the Unix password file. If your fields must contain colons, use a backslash as the prefix to escape them.
Do not allow the distinction between tab and whitespace to be significant. This is a recipe for serious headaches when the tab settings on your users' editors are different; more generally, it's confusing to the eye. Using tab as a field separator is especially likely to cause problems.
Favor hex over octal. Hex-digit pairs and quads are easier to eyeball-map into bytes and words than octal digits of three bits each; also marginally more efficient. This rule needs emphasizing because some older Unix tools such as od(1) violate it; that's a legacy from the field sizes in PDP-11 machine language.
For complex records, use a ‘stanza’ format: multiple lines per record, with a record separator line of %%\n or %\n. The separators make useful visual boundaries for human beings eyeballing the file.
In stanza formats, either have one record field per line or use a record format resembling RFC822 electronic-mail headers, with colon-terminated field-name keywords leading fields. The second choice is appropriate when fields are often either absent or longer than 80 characters, or when records are sparse (often missing fields).
In stanza formats, support line continuation. When interpreting the file, either discard backslash followed by whitespace or fold newline followed by whitespace to a single space, so that a long logical line can be folded into short (easily editable!) physical lines. It's also conventional to ignore trailing whitespace in these formats; this protects against common editor bobbles.
Either include a version number or design the format as self-describing chunks independent of each other. If there is even the faintest possibility that the format will have to be changed or extended, include a version number so your code can conditionally do the right thing on all versions. Alternatively, design the format as self-describing chunks so that new chunk types can be added without instantly breaking old code.
Beware of floating-point roundoff problems. Conversion of floating-point numbers from binary to text format and back can lose precision, depending on the quality of the conversion library you are using. If the structure you are marshalling/unmarshalling contains floating point, you should test the conversion in both directions and, if it looks like conversion in either direction is subject to roundoff errors, be prepared to dump the floating-point field as raw binary instead, or a hex encoding thereof. The binary dump may even be portable if both machines implement the IEEE floating-point standard.
Don't both comprssing or binary-encoding just part of the file. An effective way to combine transparency with storage economy is to apply some standard compression technique to the entirety of a text data file. For example: many projects, such as OpenOffice.org and AbiWord, now use XML compressed with gzip(1). Experiments have shown that documents in a compressed XML file are usually significantly smaller than the Microsoft Word's native file format, a binary format that one might imagine would take less space. The reason relates to a fundamental to the Unix philosophy — do one thing well. Creating a single tool to do the compression job well is more effective than ad-hoc compression on parts of the file, because the tool can look across all the data and exploit all repetitive information. In contrast, a binary format generally must allocate space for information that may not be used in particular cases (e.g., for unusual options or large ranges).
In Chapter 10 (Configuration) we will discuss a different set of conventions used for program run-control files.