Monday, December 12, 2005

Design Rules for Textual Data Formats

Another set of rules from Eric Raymonds excellent "The Art of Unix Programming". Use textual data format instead of binary to store or transport your data:
  • Easy for human beings to read, write, and edit without specialized tools.
  • Easy to prepare test data and to debug.
  • Future-proof your system. One specific reason is that ranges on numeric fields aren't implied by the format itself.
  • Other tools and applications can easily use your data, stimulating reuse and innovation.
Unix Textual File Format Conventions
  • One record per newline-terminated line, if possible.
  • Less than 80 characters per line, if possible.
  • Use # as an introducer for comments.
  • Support the backslash convention.
  • In one-record-per-line formats, use colon or any run of whitespace as a field separator.
  • Do not allow the distinction between tab and whitespace to be significant.
  • For complex records, use a ‘stanza’ format: multiple lines per record, with a record separator line of %%\n or %\n.
  • In stanza formats, either have one record field per line or use a record format resembling RFC 822 electronic-mail headers, with colon-terminated field-name keywords leading fields.
  • In stanza formats, support line continuation.
  • Either include a version number or design the format as self-describing chunks independent of each other.
  • Beware of floating-point round-off problems.
  • Don't bother compressing or binary-encoding just part of the file.
Data File Metaformats
  • DSV Format: Delimiter-Separated Values. One record per line, colon separated fields. Most appopriate for tabular data keyed by a name in the first field
  • RFC822 Format: derives from the textual format of Internet electronic mail messages, record attributes are stored one per line, named by tokens resembling mail header-field names and terminated with a colon followed by whitespace. Field names do not contain whitespace; conventionally a dash is substituted instead. The attribute value is the entire remainder of the line, exclusive of trailing whitespace and newline. A physical line that begins with tab or whitespace is interpreted as a continuation of the current logical line. A blank line may be interpreted either as a record terminator or as an indication that unstructured text follows.
  • Cookie-Jar Format, appropriate for records that are just bags of unstructured text. It simply uses newline followed by %% as a record separator.
  • Record-Jar Format: Cookie-jar record separators combined with the RFC 822 metaformat for records, support multiple records with a variable repertoire of explicit fieldnames.
  • XML Format: well suited for complex data formats though overkill for simpler ones. It is especially appropriate for formats that have a complex nested or recursive structure
  • Windows INI Format: appropriate if your data naturally falls into its two-level organization of name-attribute pairs clustered under named records or sections.
Read more at "The Art of Unix Prgramming, chapter 5: The Importance of Being Textual", with examples.
See also "Basics of the Unix Philosophy" with general rules for good programming design.

Digg this story


Post a Comment

Read more about Software Quality at the <<Software Quality Weblog Home