Deciding how to archive
In deciding which formats are best to store your documents in for the long term it is wise to consider one that both preserves the significant properties of your document while also providing a format which can be readily accessed and, if needs be migrated, by various common applications.
Significant properties of documents
Generally, the significant properties of documents and texts that need to be preserved are:
- The words and order of the words in the document
- The hierarchical structure of the document (e.g. different levels of headings).
- Formatting within the document (e.g. bold or italicised text).
- The page numbering of a document. This is particularly important where a document is a published or unpublished report or thesis. If users wish to cite and reference the document then retaining the correct text on the correct page is important (particularly so if files go through subsequent migrations).
- Any other non-text content such as images and data tables. Ideally this content is best stored separately.
The properties that are generally not seen as significant are:
- The font type and font size (unless this significantly changes pagination or formatting).
- Track changes functionality.
Significant properties may however change depending on the exact nature of the document being preserved. All files should be assessed on a case-by-case basis in order to determine which of the above are relevant.
File formats
In terms of file formats for preservation and long term stable storage there is a distinct preference to store and preserve documents in one of the now popular standardised XML formats such as Microsoft’s OOXML (.docx) or OpenOffice ODF (.odt). A JISC Technology Watch Report, ‘XML-based Office Document Standards’ (Ditch 2007), examines and compares these specifications in some detail. The main benefits of both formats are that they are recognised international open standards and are text-based (as opposed to binary files) and thus human readable. Both are mutually accepted by each of their native applications together with a number of third party applications such as Google Docs. Both formats are also similar in that they utilise a zipped archive format to contain the separate components that make up each file.
ODF does, however, make better use of other existing open standards (SVG for example). The available specification for ODF is also far shorter (and potentially more complete) than OOXML which suggests that third party support for this format might become more readily available. Microsoft’s OOXML does, however, provide better support for migrations of earlier versions of MS Word (backwards compatibility was one of their major design goals) though format fidelity is not entirely accurate when converting from MS Word to ODF.
In addition to these XML-based formats, PDF/A should also be considered as a potential preservation format though primarily in cases where the original document only exists in a PDF format. Although a binary format, PDF/A is an open standard with a freely available reader and growing third-party support. As extracting or migrating content from PDF documents into other formats is problematic, PDF/A provides an effective means of accurately preserving existing PDF content in a recognised, albeit binary, open standard format.
Preservation format | Requirements |
---|---|
.docx | Suitable for deposit, dissemination and preservation though embedded content should be stored separately. The final file is essentially a zipped archive and may be best stored in an uncompressed format. |
.odt | Suitable for both deposit and preservation though, in the latter case, the files should be stored in their uncompressed form. Additionally, where the document contains images or other content, these should ideally be stored separately in a suitable preservation format. |
.pdf/a | Suitable for long-term preservation. It is recommended that where files are created from another format (e.g. .doc or .odt) that these are retained and deposited alongside the PDF/A file. |
.txt / plain text files | Suitable for ingest, preservation and dissemination but only for extremely simple files. |
.sgml | Suitable for preservation and dissemination though documents must be valid. |
.html / .xhtml | Suitable for preservation and dissemination though documents must adhere to (and specify) a valid DTD and character encoding. Where used, CSS styles should either be specified within the document or supplied as a separate file. Images and other media should be dealt with as individual objects as per other guides. |
.xml | Suitable for preservation and dissemination though documents must adhere to (and specify) a valid DTD/schema and character encoding. |