Victorian Electronic Records Strategy - Forever Digital logo
 


Search
    

2.0 VERS Long Term Format (VERS Encapsulated Objects)

2.1 eXtensible Markup Language (XML)
2.2 Standard Encodings
   2.2.1 Documents
   2.2.2 Database Tables
   2.2.3 ‘Onion’ Records
   2.2.4 Digital Signatures
   2.2.5 Other Standard Encodings
Endnotes


The VERS Long Term Format consists of an object (known as a VERS Encapsulated Object or VEO) represented in eXtensible Markup Language (XML). This object will contain contextual information about the record and may also contain document files, image files, sound files, movie files, etc. The VEO is also signed using digital signature technology to ensure authenticity. For information about the generic VERS Long Term Record Format see PROS 99/007 Specification 1.

While a compact data format like a binary data format is desirable to encode long term records, a binary format is dependent on the program that interprets the binary data to extract the content. To make the record self sufficient, a textual encoding is preferred. While less efficient, the contents of a record can be inspected using simple text editing software and consequently the record is not dependant on software or documentation. XML is the text based encoding chosen for the VERS Long Term Record Format.


2.1 eXtensible Markup Language (XML)

The recommended Long Term Record Format is expressed using XML (eXtensible Markup Language). XML is a text based markup language. XML specifications are easily extensible (unlike HTML) and are relatively simple. The XML standard is defined in Extensible Markup Language (XML) 1.0.1


2.2 Standard Encodings

The data that forms a document2 may be encoded in many ways when the document is included in the record. For example, a report could be encoded as a PDF file, or a Word file. A Document in the record can contain several representations of the same physical document. Each encoding represents one representation of the physical document.

It should be noted when storing documents that XML requires that data be stored as character data (i.e. binary data must be encoded). Contents of XML tags must be textual characters. All binary data must be encoded to textual information. The encoding used should be documented in the File Encoding element. The data cannot include the characters ‘<’, ‘>’, or ‘&’. Where these characters are found within the data, they must be replaced by the strings ‘&lt;’, ‘&gt;’ and ‘&amp;’ respectively. For VEOs this encoding should be done in Base64.3 Base64 is an Internet standard that is a fundamental component of email systems.

For the long term preservation of records, standard encodings for three types of documents have been defined: documents, database tables, and records. Agencies may choose to use other encodings for alternate representations of documents but the standard encodings detailed below must be used for permanent or long term temporary public records.

2.2.1 Documents

The documents which are currently best able to be conserved as long term electronic records are items which can be printed. Examples include Word documents, database reports, emails, spreadsheets, and drawings.

In this Standard, documents are represented using the Portable Document Format (PDF) Version 1.3 produced by Adobe Systems Incorporated.4 The primary selection criteria for a document format was confidence that, for the foreseeable future, it would be possible to write a viewer for the document from publicly available information. Microsoft Word file format, for example, would not be an appropriate format, as the description of this format has not been published. The PDF standard has been published and is freely available.

PDF is flexible and PDF can be generated from any application that can generate Postscript (the standard printing language); thus anything that can be printed can be represented in PDF. PDF can also be generated from scanned documents. Scanned documents can be converted to an electronic document which is very close (or identical) in appearance to the original paper document. The text of the scanned image can be accessed, altered or used after employing optical character recognition software.

PDF is reasonably efficient in terms of size. In the VERS prototype5, PDF generated from Word documents was typically 50 to 80 percent of the size of the original Word document. PDF is much more efficient than Postscript.

PDF is a binary format and hence must be encoded into text before inserting the data into the VEO. See above on the use of Base64 to encode binary data as text. Encoding in Base64 means that the file increases in size by 25%.

2.2.2 Database Tables

Database tables are collections of sets of data. A database row encodes a single data set, and database columns encode the data set categories. A database table may also have associated forms, queries and macros that encode how the database is commonly used.

An XML DTD has been defined to mark up the rows of a database table.

<!ELEMENT vers:DatabaseTable
		(vers:DatabaseTableRow)*>
<!ELEMENT vers:DatabaseTableRow 
		(vers:DatabaseTableElement)*> 
<!ELEMENT vers:DatabaseTableElement (#PCDATA)> 

A database table is represented as zero or more rows (database records). Each row comprises zero or more elements (columns or fields).

Developing ‘grammars’ sufficient to fully describe the functionality of a database, including its schema, query language and fixed queries, may be a very time consuming and arduous process. Agencies should fully consider the database functionality which is desirable for long term preservation (which may be functionality considerably less than that desired for a fully functioning database) and concentrate on preserving and describing only those aspects of the database which will be useful in the long term and which are required to be kept for legislative or historical reasons.

2.2.3 ‘Onion’ Records

The digital signatures that secure a VERS record also prevent any modification of the record. In some circumstances it might be necessary to modify the metadata associated with a record. For example, the need may arise to refile a record, to add additional descriptive information, or to make a new linkage to another record. To allow the metadata to be modified without disturbing the evidentiary integrity, the VERS XML DTD allows a record to be included as the content within a new record. This layering of record metadata is referred to as producing ‘Onion Records’. The content in the case of an Onion Record is a complete record (i.e. an XML VEO).

2.2.4 Digital Signatures

Digital Signature technology must be used to ‘sign’ the VEO in order to ensure record authenticity (see discussion in section 3.0). It is recommended that any published and freely available Digital Signature Standard (such as the National Institute of Standards and Technology’s Digital Signature Standard6) be used as long as it is fully documented in the Signature Format Description element of the VEO.

Both the Signature and Public Key are binary data and need to be encoded to textual information. For VEOs this encoding should be done in Base64.7 The encoding used should be documented in the Signature Format Description8 element of the VEO metadata.

2.2.5 Other Standard Encodings

It is expected that other standard encodings will be defined after further research. Agencies should apply to PROV for further information and assistance in choosing standard encodings for computer file formats which are not covered by the encodings given above.


Endnotes

1. Extensible Markup Language (XML) 1.0, W3C Recommendation REC-xml-19980210, http://www.w3.org/TR/REC-xml

2. In its widest sense a 'document' could be a sound file, an image, a digital video as well as the more traditional word processing document or email.

3. Multipurpose Internet Mail Extensions (MIME), Part One: Format of Internet Message Bodies, Section 6.8 Base64 Content-Transfer-Encoding, IETF RFC 2045, http://src.doc.ic.ac.uk/computing/internet/rfc/rfc2045.txt

4. Portable Document Format Reference Manual, Version 1.3, Adobe Systems Incorporated, March 11 1999 http://partners.adobe.com/asn/developer/acrosdk/docs/PDFRef.pdf

5. For further information about the VERS prototype see Victorian Electronic Records Strategy Final Report, Public Record Office Victoria, March 1999.

6. National Institute of Standards and Technology, Federal Information Processing Standards Publication, Digital Signature Standard, FIPS PUB 186-2, 27 January 1998, http://csrc.nist.gov/publications/fips/fips186-2/fips186-2.pdf

7. Multipurpose Internet Mail Extensions (MIME), Part One: Format of Internet Message Bodies, Section 6.8 Base64 Content-Transfer-Encoding, IETF RFC 2045, http://src.doc.ic.ac.uk/computing/internet/rfc/rfc2045.txt

8. See PROS 99/007 Specification 2 VERS Metadata.

back to top

Victorian Government logo - Link to VicGov home Public Record Office Victoria logo - Link to PROV home