Victorian Electronic Records Strategy - Forever Digital logo
 


Search
    

7.1 Text

Text files are those which contain only text; typically letters, numbers, punctuation, spaces, and tab characters. Text files were once very common, but few files on most office computer systems are likely to contain only plain text, as modern office applications support documents with sophisticated formatting and images. In the future, plain text files may become more common. HTML pages and XML documents, for example, are plain text files. Other examples of text files include software source files, and some emails. On Windows file systems many text files have the extension ‘.txt’.

Text files can be opened using the simplest text editors such as Notepad and vi, and more complex programs such as Word and Wordpad.

A text file basically consists of a sequence of characters which are represented in the text file as numbers. In order to display the contents of a text file it is necessary convert the bit stream in the text file into a sequence of numbers and then convert the numbers into the characters. Both conversions are defined by standards.

The current root of English language textual encodings is ASCII (American Standard Code for the Interchange of Information) which dates from 1968 and which subsequently became an international standard (ISO 646:1991) [ISO646]. ASCII (ISO 646) was subsequently extended in the 1980s to allow coverage of some non English languages. The new standard is known as ISO 8859 [ISO8859]. The important thing to note is that in all three standards (ASCII, ISO 646, and ISO 8859) the first 127 characters are identical, so English text is represented identically in all three.

Text files with non Latin characters should be represented as Unicode [Unicode]. Unicode itself is defined by the Unicode Consortium, but a functionally equivalent standard is produced by ISO as ISO/IEC 10646 [ISO10646]. Unicode defines virtually all known characters in most languages. The standard continually evolves, mainly by the addition of new characters for additional languages. For this reason, any object that conforms to the current version of the Unicode standard now, will conform to future versions of the standard.

Like all character standards, Unicode defines the character glyphs and assigns each of them a number. Unlike the other character standards, the character numbers can be physically encoded in a digital file in a variety of different ways. The common physical encoding mechanisms are UTF-8 and UTF-16. UTF-8 has the characteristic that the characters available in ASCII are encoded identically to ASCII. Thus Unicode text containing only the characters available in ASCII and encoded in UTF-8 is identical to the same text encoded as ASCII.

In summary, if the text is English it should be expressed in ASCII. The resulting characters will be identical irrespective of whether they are considered to be encoded in ISO 646, ISO 8859-1, and Unicode (ISO 10646) represented in UTF-8.

Languages other than English should be expressed in Unicode (ISO 10646) represented in UTF-8.

back to top

Department for Victorian Communities logo - Link to DVC home Public Record Office Victoria logo - Link to PROV home