Victorian Electronic Records Strategy - Forever Digital logo
 


Search
    

2.2 Structured textual encoding

Long-term preservation of information can be viewed as a transmission protocol. The sending computer is the system that creates and (perhaps) initially stores the information. The receiving computer is the future system - as yet unbuilt - that will display the record. A transmission protocol cannot work unless both the sending and receiving hosts precisely agree on the encoding of the information that passes between them. This can be difficult enough to achieve when the two systems can be tested against each other, but in the case of the archiving of electronic information the 'receiving' system has not yet been constructed when the record is 'transmitted' by the 'sending' system.

Thus a well-designed long-term record format has three highly desirable characteristics:

  • Simple encoding. The encoding of the record should be as simple and as easy to understand as possible.
  • Self-describing. The smallest units of information in the record should be clearly identifiable, and labelled to indicate their meaning. Consider examining a record and finding a stream of undifferentiated bits. It is impossible to determine where each atom of information starts and stops, the data type of the information, or the meaning of the information. The simplest self-describing encoding is to encode the information as text, with each unit of information delimited by special characters, tag it with a label, and structure the information to show relationships.
  • Self-documenting. Some units of information will require complex explanations to explain the meaning of the information. Consider a digital signature. It is easy enough to tag the data that forms the signature with the label 'Digital Signature', but to check the signature requires a lot more information. What algorithm was used to generate the digital signature (and what were the values of any parameters)? Exactly what data in the record is covered by the digital signature? Is the digital signature encoded (e.g. has the binary digital signature been turned into text)? An archived record must include sufficient documentation to allow a future user to understand what was done to the record. A reference to an external publication is sufficient documentation if the external publication will be available indefinitely.

The requirement for a simple, self-describing, and self-documenting encoding suggests a textual encoding. However, there are two problems with the pure textual encoding of a record.

The first problem is efficiency. For example, binary encoding of a 24 bit RGB image requires 3 octets for each pixel. A simple textual encoding would require a minimum of 6 octets (e.g. "0,0,0;") and a maximum of 12 octets (e.g. "255,255,255;"), or between 200% and 400% space overhead for the RGB data. In addition to the space overhead, both parsing and generating the textual encoding is normally more expensive than parsing and generating the equivalent binary encoding. It is often preferable to use binary encoding for simple efficiency.

The second problem is complexity. Many types of data are inherently complex. Describing a printed page, for example, requires describing the position of every character on the page together with the characteristics of the character such as weight, orientation, and skew. It would be possible to develop a textual encoding to describe a page, but this requires specialist knowledge to ensure that the textual encoding is suitable. It is far preferable to use existing standards for complex data, even if they use complex binary encoding.

It is possible to include binary encodings within an archived object. The key is to:

  • Choose a binary encoding that has been published and is therefore available to be referenced, or is sufficiently simple that it can be documented within the record.
  • Include documentation on what binary encoding has been chosen in the archived object.

In summary, a good design for a long-term electronic record format will be based on a simple textual encoding that 'marks up' the data to indicate its extent, syntactic meaning, semantic meaning, and relationship to other data in the record. The use of binary encodings for specific elements in the record is acceptable when this allows the use of specialist standards, provided the use of these standards is well documented within the record.

back to top

Victorian Government logo - Link to VicGov home Public Record Office Victoria logo - Link to PROV home