What are documents and texts?

Kieron Niven, Archaeology Data Service / Digital Antiquity, Guides to Good Practice

Simply put, the majority of text documents are digital analogues of traditional publications and can therefore range in size and complexity from fairly simple reports and short papers through to substantial documents such as theses or books. These files consist predominantly of structured text (sentences, paragraphs, pages, chapters) but often include other elements such as images, figures and tabular data.

Digital texts can be produced in a variety of ways though most are largely created from scratch in word processing packages such as Microsoft Word and OpenOffice. In terms of actual formats, files produced by word processing packages have in the past been predominantly stored in proprietary binary file formats although more recent packages such as Microsoft Word 2007 and OpenOffice have highlighted a distinct move towards human readable xml-based formats and standards such as .docx (part of the Office Open XML[1] format) and .odt (part of the OpenDocument [2] format). In addition to the formats in which documents are originally created, many text documents in their final versions may be stored and disseminated in a common interchange format, most notably Adobe’s Portable Document Format [3] (PDF), which allows the format and structure of a document to remain consistent across a variety of platforms while also removing much of the editing possibilities.

In addition to documents created within word processing software, a significant proportion of text documents can be created as the result of a digitisation process. Journal digitisation, usually for the preservation or dissemination of pre-digital collections, is often the largest source of digital texts created outside of a word processor. This process generally starts with a digitised image of the hard copy page which is then processed using optical character recognition (OCR) in order to transform the image into ‘real’ (i.e. editable, searchable, etc.) text. The final text, which may also include images and figures, is predominantly stored using the PDF file format though an xml-based format may also be used, especially where dynamic online dissemination is required.

Beyond common word processing formats and PDF files, texts may also exist in a range of plain text or marked up formats such SGML, HTML and XML. This range of formats is discussed in detail in the Oxford Text Archive Preservation Manuals and will be dealt with briefly alongside other formats below.

