Archiving Digital Documents

Archaeologists are creating large numbers of electronic texts in the course of their researches. These have historically been drafts leading to a final hardcopy but are increasingly becoming end products in themselves. There is therefore a need to archive these electronic texts. However, many of the tools currently available in word processors are designed specifically for printed documents and can lead to problems in digital archiving. This document aims to highlight some of these difficulties and suggest approaches which can be used to circumvent them.

Contents

  1. Introduction
  2. Formatting and Fonts
  3. Tables and Figures
  4. Tables of Contents and Tables of Figures
  5. Footnotes and Endnotes
  6. Cross referencing within documents
  7. Saving as HTML
  8. Recommendations
  9. Appendix

1. Introduction

Word processing has played a major part in the increasing popularity of personal computers in archaeology. Word processors gave the advantage of allowing substantial and repeated edits without the need to fully retype a document. This gave much greater control over the content of a document with much reduced effort. The final printed document, however, was still produced by what was essentially (or even actually) an automatic typewriter using fixed pitch fonts (each character is allotted the same amount of space) and, normally, a ragged right margin.

As technologies improved and costs came down, page (laser) printers replaced character printers and word processing programmes moved closer to desk-top publishing packages giving users greater control over the appearance of their printed documents. Authors may now choose from a variety of fonts in a variety of sizes and can choose how the text should be justified. Tables, graphs and images are easily included in documents and colour be used if desired.

The increase in the capabilities of word processors has inevitably led to an increase in the complexity in the computer files in which the documents are stored. The formats of these files are vendor specific and this has been a problem for users of different software packages. Most word processors now come with a range of filters which allow them to read and write files in a number of different formats. Documents may be freely exchanged between users regardless of their choice of word processor, that is unless they get a file from a more recent version of a programme than those recognised by their own word processor.

In the short term there are few problems with sharing files but this can not be guaranteed in the future. Indeed some files created by 'entry level' programmes (e.g. MS Works) or by obsolete programmes can prove to be inaccessible to users of higher specification programmes, and files created by the latest releases of word processing packages are invariably inaccessible to users of older software. For example, WordStar was a popular programme before the advent of graphical interfaces and many documents in this format are still in existence. WordPerfect can normally recognise WordStar files as a matter of course. MS Word does not install filters to read WordStar documents by default, these must be specifically requested during installation or be loaded at a later date. Even with these filters present both WordPerfect and Word may sometimes fail to recognise WordStar files and attempt to convert them inappropriately, presenting the user with a largely unintelligible document.

When archiving word processed documents it is, therefore, preferable to convert the files to a neutral format or for the formatting to be embedded in the document as easily recognisable tags and the text as plain text which can be extracted if necessary. The most basic neutral format is plain ASCII text. The most widely used tagged formats are probably Rich Text Format (RTF) and HTML, the format of World Wide Web documents.

These examples clearly show the advantages of ASCII text as an archive format - the 'intellectual content' of the material is obvious. The text is easily discernable in the simple (hand crafted) HTML but is more difficult to find in the automatically generated HTML, requiring software to render the text. Trying to extract the text from the RTF file is an almost futile task and thus the content can only be accessed using software that can read RTF format.

ASCII is a form of computer coding in which each character is coded as a number. Computer programmes use the number to find the corresponding shape (the character) in a look-up table. ASCII stands for American Standard Code for Information Interchange and was developed in the days of mechanical terminals such as Teletypes. Thus it only supports standard alphabetical characters, numbers and symbols.

To cope with 'primitive' graphics and to give some support for foreign languages ASCII was extended to give another 126 codes. Accented characters and a variety of symbols are available in this extended range. The contents of the extended ASCII range are not firmly defined and may differ on different systems, where one system provides ° (the degree symbol) another may give a block graphic character. None of the characters in the extended ASCII range is supported in plain text files.

2. Formatting and fonts

Heavily formatted documents pose great difficulties for archiving and disseminating electronic documents. Despite efforts by software writers to enable word processing programmes to read and to write files in the formats of other programmes they are often only partially successful at achieving this end. For example, multi-column documents are seldom laid out the same way by different word processors or even by different versions of a given word processor. Using object frames or graphics boxes to lay out or emphasise text can lead to severe problems if the document is viewed in a version or programme other than the one in which the matrial was created. In some circumstances it can cause the computer to crash. To avoid these problems it is best to compose the document with minimal formatting and to apply the desired formatting to a copy of the document created specifically for printing. The content is, after all, more important than the presentation and attempting to preserve the look of a document at the expense of its content is folly.

Particular problems occur when the meaning of the text is dependent upon its formatting. It is an easy matter to construct matrices using a word processor using combinations of centred text, spaces and tabs to lay out the diagram. Centred text is likely to retain its position on the page correctly in a word processed document but if the document is converted to plain ASCII text then all the centred text will 'fall' to the left hand margin and as a result the layout, and possibly the meaning, will be lost.

matrix as ASCII text
matrix

The matrix on the left, consisting of numbers, underscores and the vertical bar formatted with centring, tabs and spaces, collapses to the matrix on the right when saved as plain text.

The layout can be preserved by saving the file in Rich Text Format; saving as HTML may not be as successful:-
matrix as HTML

The use of non-standard symbols may also results in loss of information when a document is converted to plain text or HTML. The following small matrix diagram created in MS Word
another matrix

saved as plain text becomes
matrix as ASCII text

saved as HTML and viewed in a browser for which the fonts for the special characters are not available renders as
matrix in HTML without the special fonts

Special symbols are often automatically inserted by word processors as you type. Unless instructed otherwise, some programmes will replace standard quotes with "smart" quotes, fractions (as x/y) with fraction characters, three dots with a dieresis and turn ordinals into superscript. These may not be saved as expected if the document is converted to plain text or saved as HTML.

Formatting is also used to convey the structure of the document - headings may be in a larger or different font, emboldened, underlined, or positioned distinctively on the line by centring the text or starting it some distance into the left margin. The hierarchies of sections within a document can be clearly communicated to the reader by formatting each heading level differently. All such effects are again lost when a document is saved as ASCII text. Documents destined for electronic archiving as plain text files should have the sections headings clearly distinct from the text using differing amounts of vertical and horizontal white space and/or labelled using a hierarchical numbering system.

3. Tables and figures

It is convenient to insert tables and charts created in a spreadsheet directly as objects in a word processed document. However it is not just the current spreadsheet table that is imported but the whole workbook. If there are multiple tables on the worksheet, or there are multiple worksheets, they are all imported into the document. This excess baggage cannot be removed. A better (and quicker) solution is to copy the tables from the spreadsheet and paste them into the word processed document. They are then converted to normal tables in the document.

If a document containing embedded objects such as spreadsheet tables and charts is saved as HTML the embedded objects are converted to image files. This is perfectly acceptable for charts or other graphical material (although the quality may be seriously compromised - see appendix) but it makes the data in tables inaccessible. It is generally better to link graphics files to the document (see appendix for file formats) and to archive both text and graphics files separately. The relative locations of the documents need to be considered to ensure the link does not become broken.

Embedded objects are lost if the document is converted to plain text. Native tables can be extracted from word processed documents with ease if the table structure is not complicated and the formatting in the table is straightforward but complex tables may become scrambled.

4. Tables of contents and tables of figures

Tables of contents and tables of figures are generally created when the text is finalised and the pagination known. If documents require conversion to other formats for archiving the pagination is likely to be lost and these tables will only serve to indicate how the document is logically structured and fail to fulfil their role of helping the reader to find the individual sections.

Tables of contents and tables of figures can be generated automatically if inbuilt heading styles are used and captions inserted via the relevant menu item. The tables can be updated during the editing process, saving the effort of keeping track of sections and figures. The items in the table of contents also become short-cuts to the sections in the document and if the document is saved in HTML format they are turned into hyperlinks to the sections in the document (but see 7. Saving as HTML).

5. Footnotes and endnotes

The loss of pagination in archive formats makes footnotes undesirable. Saving a file with footnotes as HTML results in the footnotes being moved to the end of the file where they appear as endnotes. Using both footnotes and endnotes therefore results in two sets of endnotes which can lead to confusion. If notes must be used it is better to use endnotes which, in HTML files, can be hyper-linked to the referencing text.

6. Cross referencing within documents

Pagination can seldom be guaranteed between different word processors or between different versions of a given word processor. This is because

  • documents opened by word processors' are formatted for the computer's current default printer. Settings for different printers may differ sufficiently for the text to be formatted differently on the page and the default page size may be different from that envisaged by the author.
  • different word processors packages and versions handle embedded items differently. For example, a table prepared in MS Word 2000 tends to take up more space when the document is converted to earlier versions of MS Word.

It is best to avoid using your perception of the pagination when cross referencing within a document (including tables of contents, lists of figures and tables) because it may not be preserved when the document is viewed by others or by yourself if you change or upgrade your software or change your printer. The pagination can not be guaranteed if the document is saved as plain text and indeed is irrelevant for documents converted to HTML. The safest solution is to use numbered sections and even numbered paragraphs to guide users as close as possible to the referenced material.

7. Saving as HTML

Most current word processors allow you to save a document in HTML format. This appears very convenient but it has drawbacks. As with all machine translations the results are generally idiosyncratic. Conversion to HTML is approached by most word processors as a printing task and the documents are subject to physical mark-up in which the tags are explicit instructions on rendering the text - the font, its size and attributes, and the position of the text on the page. This tends to result in the insertion of an excess of formatting tags each with a full set of attributes which leads to a substantial inflation of the file.

Hard coded formatting in the text may exacerbate these problems. Each time emboldening or italicisation is set a formatting code is inserted in the text which will become an HTML tag. Within paragraphs this is inevitable but it is not necessary for headings. Headings can be formatted using the word processors style tool. This is semantic mark-up - the essence of HTML - and will be used in the conversion process to mark up the text as HTML heading levels. The default formats of the styles offered by the word processor may be changed to suit your own tastes and the redefined style will be incorporated in the HTML files style definitions. There are also styles defined for other structural elements particularly bulleted and numbered lists and indented (block) text. Using these styles in word processed documents will assist in producing 'clean' HTML.

Automatically generated HTML may be designed for a specific browser, e.g. Internet Explorer, which may not fully support some elements of HTML and may implement non-standard features. The result may look fine in one browser but be an utter mess in another. Non-standard elements and superfluous tagging can be removed by utilities which will can clean up the HTML in the converted document. Such utilities are available either as stand alone programmes (e.g. HTML Tidy available from the World Wide Web Consortium - see a review and tips on its use) or as tools incorporated into web authoring software, and may operate automatically on saving the document (e.g. Netscape Composer removes a lot of unnecessary tags inserted by MS Word by simply opening the HTML file and then saving it). For users of MS Office 2000 there is the Office 2000 HTML Filter 2.0 for removing MS Office specific tags although it still leaves a lot of superfluous tags in the HTML file.

8. Recommendations

When creating documents concentrate solely on the content only using formatting to indicate the document structure. Final formatting for printing should be applied to a copy of the original document.

There are advantages in using your word processors built in semantic mark-up styles. These can generally be found in a drop down selection list or under the Format menu. The defaults can be altered and added to suit your own personal tastes and changes are permanently stored in the document. You can also create your own templates to be applied to future documents. Styles are recognised and used if the document is saved as HTML giving a much cleaner and more generic result than when formatting is applied directly to elements within the document. Styles also make it easy to globally update formatting changes and, when heading styles are used, to generate tables of contents which can be automatically updated during editing.

Do not make meaning in a document dependant on formatting. The formatting may be changed or lost when the document is translated to another file format and the meaning may be lost with it.

The use of special characters should be avoided if possible. These may not translate to other formats or be accessible to users of other word processing packages, particularly if they are taken from non-standard fonts.

Numbering paragraphs or at least sections will ensure that cross references retain their meaning if the file document is translated to another file format.

Do not embed 'foreign' material in documents, be it tables or graphics, created by other software packages. If such material needs to be included then copying it from the source and pasting into the word processed document will give a result which is easier to archive and disseminate, this is particularly the case with tables from spreadsheets. An alternative is to link the 'foreign' file or material to the document. The text and tabular or graphic material can then converted to suitable archival formats separately.

9. Appendix

Graphic file formats

There are many graphic file formats in existence with differing capabilities. Some are better as archive formats then others. Only a few are understood by today's web browsers. The two principal formats that can be used with web documents are GIF and JPEG, with PNG currently gaining more support. These are both compressed formats which can substantially reduce the size of graphic files if used on the correct type of image.

  • GIF images may only contain 256 distinct colours or shades of grey. These 256 colour values are not fixed but are selected from a 65 million colour palette. The compression of GIF images involves no loss of information in the file. GIF files are best for images with blocks of single colours. They should not be used for storing photographs in which colours or grey values grade gradually across the image. Many of the grey or colour values in a photograph need to be modified to reduce the total colour count in the picture to 256 resulting in a substantial loss in image quality. GIF is a suitable archive format for images with few colours and large blocks of single colours.
  • JPEG images are full colour i.e. they use a 65 million colour palette. This is a better format than GIF for storing photographs as the gradual changes between colours are preserved. However the compression used in JPEG images does result in a loss of information (information beyond what the human eye can see is discarded) and so JPEG can not be considered a suitable format for archiving images.
  • TIFF is a very flexible format which can be extended to cover different requirements. TIFF files can be saves at a variety of colour depths from 2 colour (black and white) to over 16 million colours depending on the nature of the image. Saving at a lower colour depth results in a smaller file size. TIFF files can be saved in compressed format and it is a non-lossy compression - all information in the image is preserved if the chosen colour depth is sufficiently great (colours are not modified to reduce the total number of colours to the available colour palette). However compression is only effective for images with little fine colour graduation, a photograph saved as a compressed file will be larger than an uncompressed version. TIFF format is a good candidates for archiving images.
  • BMP and PCX files are in many respects similar to core (basic) TIFF files, they can hold a variety of colour depths and use a non-lossy compression technique.
  • TGA (Targa) has most of the qualities of TIFF images and can store 32 bit colour images (approaching 4,300 million colours). Targa images may be compressed using a non-lossy algorithm but support for Targa files is not universal reducing its value as an archive format.
  • PNG a compressed (non-lossy) format supporting 256 or 16 million colours. It was developed as a patent-free alternative to GIF and JPEG formats designed principally for use on the internet but during the six or so years since it was developed has seen little support.