6. Archives and Digital Datasets
This chapter explores current practices in long-term archiving of digital datasets. Swain's recent study of museum archives in England showed that archaeology archives were held by a variety of organisations - regional and national museums, Royal Commissions, and also some contracting field units (Swain 1998; see also 4.3). Although digital datasets currently represent a tiny proportion of the holdings of archiving facilities, they do present a challenge to the bodies that hold them. Strategies for digital data allow archaeologists in Britain and Ireland to explore in more detail the current state of digital data in archives. This chapter details findings from organisations that stated that they held digital archives in the short- and long-term. Responses from organisations only were felt to provide sufficient information on the current state of digital data archives.
Of the 166 organisations that held digital data (both short- and long-term), there was great variation in the size and complexity of their digital archives. Diverse ranges of strategies have been adopted to curate material that range from the inadequate to the exemplary. This is a reflection, in part, of the range of organisations that are archiving digital data, as this rôle is not confined only to museums/formal archaeology archives. The diversity of curating strategies is also the result of national and international guidelines and standards for digital datasets being relatively new, and not as well established as those for the conservation of paper, photographs, and records in other media.
Figure 6.1 shows that representation of archiving facilities from the various countries surveyed corresponds well with the overall returns from organisations (shown in Figure 5.2). Although the percentage of archiving facilities responding in England is higher than the overall percentage of organisations that replied (69%), archives in Scotland and Wales are slightly lower than expected (9% and 8% respectively in Figure 6.2). The survey returns from Northern Ireland and the Republic are the same percentage as in Figure 6.2.
Figure 6.1 The country where archives are held (based on survey returns only)
6.2 How much digital information do archives hold
Figure 6.2 shows the amount of digital data currently being held in digital archiving facilities, both by the size of the archive in megabytes (Mb), and by the number of projects archived. About one-third of these are fairly small - between 1 and 100Mb (equivalent to 70 floppy discs). A further 13% hold between 100 and 1,000Mb (equivalent to 695 floppy discs). Almost a third of organisations (49) are holding over 1Gb of information. 29% of the survey population provided no information regarding the size of their digital holdings.
Figure 6.2 also shows the response for the number of projects archived, where a substantial proportion of the information is in digital form. Where this is the case, there is an increased likelihood that the digital component contains unique information, for example, 3D models in CAD, a GIS, or enhanced context information on a computer database. It may not be possible to reproduce this information on paper - the functionality of the dataset may rely on its being available digitally. Half of those responding held relatively few extensively digital projects (between 1 and 20). Almost a fifth had greater numbers of projects (between 20 and 100), and a seventh stated they held over 100 extensively digitised projects. Fewer respondents were able to answer this question than state how large their archives were in Mb.
Figure 6.2 Approximation of the amount of digital data held by organisations
Size of archive in Mb
Number of Projects archived digitally
Figure 6.3 explores the relationship between the amount of digital data (in Mb) held by archiving bodies and the number of extensively digital projects held in those archives. This was to see whether archiving bodies holding a lot of digital data also hold a large number of extensively digitised project archives. Figure 6.3 holds a lot of information. Archiving bodies have been divided into groups according to the size of their digital holdings, and each bar on the graph gives the total number of archiving bodies holding a certain amount of digital data (in Mb). Divisions within each bar further categorise archiving bodies according to the number of individual, extensively digitised, projects held within their archives. The categories shown in the legend represent the number of extensively digital projects held in an archive. The main conclusion to draw from this graph is that there is no clear relationship between the size of archives (in Mb) and the number of extensively digital projects they contain. Large archives are more likely to contain many digitised projects, but this is a fairly safe assumption to make for any digital archive. Additionally, individual, extensively digitised projects can vary greatly in size - those that contain complex CAD files, GIS, or digitised images will be much larger than tabulated. datasets or databases. As such, the size of a digitised project (in Mb) is therefore no indication of the amount of information it contains.
Figure 6.3 Amount of digital data held by archiving bodies, compared with the number of extensively digital projects contained in these archives
In order to gain a better understanding of the sorts of organisations that keep digital datasets in the short- and long-term, tests were run to identify the size of organisations (in staff numbers) and their rôle in archaeology. Traditionally, archaeological archives have been large bodies such as national museums and the Royal Commissions. Swain's (1998) survey of museum archives in England (outlined in 4.3) showed that many large units also retain project archives. Figures 6.4, 6.5 and 6.6 also suggest that many small organisations are holding digital archives. In Figure 6.4 each column on the graph shows the total number of archiving bodies holding a certain amount of digital data (1-5Mb, 5-10Mb etc.). Archiving bodies are further categorised based on the number of employees, which are listed in the legend. Figure 6.4 shows that archiving bodies holding large amounts of digital data are not necessarily large organisations (based on the number of employees). Reading from left to right, and using the code for organisations with 1-5 employees, it is apparent that small organisations exhibit great variation in the size of digital datasets they hold. Although larger organisations (with over 20 employees) tend to be associated with large datasets (over 100Mb), they are not alone. Consequently any attempt to locate digital archives must be addressed to all organisations, not just the major or national bodies. (section 5.4 and Figure 5.3 provide more information on the range of organisations responding to the survey).
Figure 6.4 Relationship between the number of staff in an organisation/department and the size of digital archives
Figure 6.5 Relationship between the rôles organisations play in archaeology and the amount of digital data they hold
Figure 6.5 explores the range of organisations holding digital archives. It is organised along similar lines to Figure 6.4 - each column shows the total number of archiving bodies holding a certain amount of digital data (1-5Mb, 5-10Mb etc.). Archiving bodies are further categorised according to their rôle in archaeology. The legend shows the categories used for the various areas of archaeology. The figure illustrates that many archive holders are consultancies and that a minority of museums hold large digital datasets. Large digital datasets are held by organisations in almost all of the other sectors listed. This underlines the need to address all archaeologists regarding the creation and maintenance of digital archives. Figure 6.6 explores the number of extensively digitised projects held by different projects (e.g. none, 1-5 projects etc.). Archiving bodies are further subdivided according to the rôle they play in archaeology, shown in the legend. Figure 6.6 illustrates that museums do not hold many extensively digitised projects; instead these are held by consultancies, field units, local government archaeology departments, and national bodies. Each column represents the total number of archiving bodies holding a certain number of digital projects
Figure 6.6 Relationship between the rôles organisations play in archaeology and the number of extensively digitised projects they hold
Although the museum sector and traditional archaeological archives have the responsibility to curate information in the long term, Figures 6.5-6.6 imply that they are currently holding a small proportion of digital data potentially available. This reinforces the findings of Swain's survey into museum archives, where he concluded that some field units are holding onto project archive material, including finds (Swain 1998, paragraph 7). Although Swain suggested that it was mostly large units, particularly those in towns, which were holding onto project archives, the findings of Strategies for Digital Data suggest that digital datasets are being retained by a wide variety of organisations.
6.3 How secure are digital data currently held in archives?Strategies for Digital Data indicates that archaeologists are holding many projects with an important digital component. The amount of space needed to store the physical media on which this digital information is held is not great. As well as protecting the media on which information is stored, steps need to be taken to ensure it is still readable on computer. Additionally, support documentation is essential to enable re-use of datasets for those not involved in the original project. Many tools are available for storing information - floppy discs, tapes, CD-ROMs and so on. Some are more reliable than others. Protecting the media on which data are held is only one element of ensuring their preservation. Information can also become unreadable through hardware and software development. Changes in programs can mean that information created in an old version may not be readable in subsequent versions. Archiving digital data is therefore an active process (for more information, see AHDS 1998). As the questionnaire was designed to identify current preservation strategies it was not possible to differentiate between long-term strategies and ongoing security measures. Figure 6.7 shows the range of storage facilities used for digital data (this is a summary of the responses to question 6.5). It is clear that the great majority of archives are held on magnetic media (hard disc, floppy disc, or some kind of tape). Fewer organisations have access to networks (only those archaeologists working for large organisations such as local government or HE departments). However, these are used by some for holding digital datasets, as well as for everyday work. Even fewer have access to read/write CD-ROMs.
Figure 6.7 Use of various file storage options
Another indication of the complexity of archiving practices is the number of ways in which data are held. If several methods are employed, the chances of losing information are reduced. Figure 6.7 also shows the percentage of organisations that rely on 1, 2 and 3 or more back-up options. It is assumed that information held on hard disc is not a formal back-up and that other means are used to store information. The results are summarised in Table 6.1. Almost half of the respondents rely on a single back-up strategy. Table 6.1 shows that many data creators rely solely on floppy discs for back-ups. These are widely acknowledged to be the least stable means of storing data long-term. Another one-third use two options, with the majority using floppy discs and tapes. Only one-fifth use three or more back-up options.
Table 6.1. Summary of methods used to store digital information
Single medium (74 responses) 1 of 2 media (54 responses) 1 of 3+ media (32 responses) CD-ROM 2 6 14 Network 15 16 24 Tape 18 47 32 Floppy disc 39 41 29
Two further tests were undertaken to explore whether large digital archives employ a more complex back-up strategy compared with those holding small numbers of digitised projects. The results of these tests are shown in Figures 6.8-6.9 and summarised in Tables 6.2-6.3.
Figure 6.8 Relationship between the size of archives (Mb) and complexity of back-up strategies
Table 6.2 Summary of archive size and number of back-up methods used
1 option 2 options 3 or more 1-5Mb 8 2 0 5-10Mb 7 3 0 10-50Mb 6 2 2 50-100Mb 10 4 1 100-1,000Mb 3 10 6 >1Gb 15 19 15
Figure 6.9 Relationship between the size of archives (no. of projects) and back-up strategies
Table 6.3. Summary of archive size (no. of projects) and number of back-up methods used
No. of projects 1 option 2 options 3 or more 1-5 27 14 4 6-20 14 8 6 21-100 10 13 5 >100 3 9 11
In Figures 6.8 and 6.9, each column represents archives of a particular size. The codes for the number of back-up methods used are shown in the legend. The results are summarised in Tables 6.2 and 6.3. The figures show that generally the larger the archive (those over 100Mb in size, or holding more than 20 projects) the greater range of back-up methods used. However, there are some large archives that rely only on a single back-up strategy (generally tapes or networks). There are, however, two organisations holding over 100Mb of information and six organisations holding more than 20 extensively digitised projects that rely solely on floppy discs.
There is variation in the storage of digital information. Many organisations rely on a complex holding strategy for digital datasets, combining two or more backing-up methods. From these results, it is possible to make some comments on everyday security practices and long-term preservation strategies:
Many archives (39) rely only on copies held on floppy disc for short and long-term preservation. By itself, this is an inadequate strategy. Tapes are being used by most archives for storing digital data, presumably both for short- and long-term. Little use is being made of CD-ROMs. There may not be a clear difference between practices used for everyday security and long-term storage of digital data. Although organisations holding larger digital archives tend to use a wider range of methods; some of these datasets could still be lost through unreliable backing-up (i.e. using floppy discs).
An essential purpose of Strategies for Digital Data was to explore future plans for file storage. These are illustrated in Figure 6.10. Overall, magnetic media (hard discs, floppy discs and tapes) will remain popular with archives. The most significant difference between Figures 6.7 (current) and 6.10 (future) is the planned increase in the use of CD-ROMs. This implies that archaeologists see the CD-ROM as the way forward, alongside continued use of magnetic tape drives, although floppy discs continue to be a popular option. The increased use of networks seems to have a limited prospect. More information can be gained by looking at the combinations of options archaeologists would like to use. There is a desire to use a greater range of backing-up methods, with only one-fifth seeing their future strategy relying on a single option (down from 46% in Figure 6.7). Table 6.4 shows that, for those relying on single strategies, many intend to use floppy discs. The number of archives planning to use two methods is roughly the same as in Figure 6.7. Almost half of those responding aim to use a wide range of methods - in particular future strategies are seen to lie increasingly with CD-ROMs and tape drives. Additionally, a few organisations stated that they plan to use a remote store for their digital data.
Figure 6.10 Future plans for storing digital data - use of typical storage options
Table 6.4 Summary of future methods to be used to store digital information
Single strategy 1 of 2 strategies 1 of 3+ strategies CD-ROM 3 17 30 Network 4 18 27 Tape 6 32 43 Floppy disc 20 26 37
Figure 6.11 Protecting physical media (tapes etc.)
As mentioned in 6.3, an element of digital archiving is the preservation of the physical media on which data are held. Figure 6.11 illustrates details of the steps taken to protect the physical storage media (question 6.7).
The percentages of data creators using no form of protection are different on the two graphs because they were calculated differently. Respondents were given a list of options, from which they could select as many as were applicable to their preservation strategies in use. The pie chart on the left is based on a simple count of 'ticks' for each option listed; where an organisation selected more than one option, it is recorded more than once. The chart on the right summarises variation in the number of preservation strategies used. Each organisation is recorded only once in this graph. The chart on the right shows that three-fifths of the organisations replying do not protect the media on which information is stored; indeed 91 have no suitable storage facilities. 64 archiving bodies have some kind of secure store. Overall, a quarter (39) rely on a single means of protection. Bearing in mind the reliance of most organisations on discs and tapes for storage, this indicates the real possibility of the discipline losing much information.
Before evaluating what information may be at the greatest risk, we need to explore the long-term strategies employed to preserve information. The questionnaire provided a list of practices that respondents could tick off, indicating the range of preservation practices employed, and space was left for additional comments. Some of the options listed can preserve information, but not digital functionality. Although many organisations do not have secure facilities for storing tapes, discs etc., this may not be a major issue if data is regularly migrated and copied onto new media. The results of the analysis into long-term preservation are shown in Figure 6.12. The pie chart on the left summarises a simple count of all options ticked by respondents (several options could be selected by each respondent). The chart on the right summarises variation in the number of strategies used (responses from organisations were counted only once). From the methods listed in figure 6.12, the only one that can retain digital functionality is data migration.
Figure 6.12 Variation in the strategies being used to ensure long-term preservation of information
This involves the transferral of digital information as software is updated and checked for compatibility. Data migration and validation is practised in only 14% of cases. This process may be required several times while digital datasets remain in an archive. Refreshing digital datasets by copying them onto new media is a less reliable preservation method, as there is no guarantee that the information will remain readable as software and hardware are updated. A good number of organisations (53) ensure that information is secure by making printouts or microfiche copies, although not necessarily in digitised form. Where complex datasets such as relational databases, GIS, 3-D CAD drawings and even virtual reality models are concerned, a lot of information is not carried across in printed/microfiche form. The functionality of these datasets relies on their being available digitally. The second chart shows that many organisations (c. one-third) do not claim to do anything to the digital datasets in their care. In addition, almost half rely on a single strategy for preserving information.
So few of the techniques discussed above preserve information in an accessible and usable digital form. Providing printouts/fiche copies may preserve information, but these may not retain the complexity of a digital dataset. Current practice indicates that the majority of digital datasets are not being archived in a way that can ensure their functionality for future use.As a final review of the security of archived digital datasets, the following three graphs compare practices between different areas of archaeology. Care needs to be taken with the results as some groups are represented by a few organisations only (the lowest being HE institutions, with only nine responding, contrasting with 40 for Consultancy). Totals are shown on the bars of the charts. Figure 6.13 summarises the number of storage methods used within each area of archaeology (use of floppy discs, tapes etc., covered in question 6.5 on the questionnaire, and discussed in 6.3). Although national bodies emerge as relying on the greatest range of back-up methods, there is little difference amongst the other areas of archaeology (around 50% in each area relying on only one method).
Figure 6.14 summarises responses to question 6.7 in the questionnaire (covered in 6.3). The majority of archives based in HEIs, Consultancies and Field Archaeology Units are not held in a secure place (e.g. fireproof safe). In addition, many archives held by museums and local government departments are not held in a protected place. This is problematic when some archives do not have strategies in place to ensure that the digital format is preserved in a usable fashion (though the information may be).
Figure 6.13 Storing digital datasets
Figure 6.14 Preserving physical media (use of fireproof safes, protection from humidity etc.)
Figure 6.15 summarises responses to question 6.8 of the questionnaire (and in 6.3). Only 50% of the archives held by local government bodies are copied to new media or migrated. Even fewer organisations ensure that digital datasets retain their functionality. Archives held in museums are in many cases left unmodified. Overall, the majority of digital datasets are not preserved in a way that ensures information can be accessed digitally, even though information may be preserved by making printouts or copying to microfiche.
Figure 6.15 Long-term security of data (file migration, use of printouts etc.)
Clearly there is already an important digital element to archaeological archives, but it is not secure. Moreover, the range of future strategies these organisations are adopting for their digital datasets, not least those who do not as yet have such a strategy, is a clear indication of the need for agreed standards and guidelines within archaeology as a whole. The following comments from our questionnaire illustrate the confusion and limited guidance that surrounds the issue of preserving digital data.Everything is retained in-house in hard and electronic copies. However, there is no long-term strategy to safeguard data within our organisation - it falls to individuals to send data to international data banks.
In practice - we hold excavation archives but no excavators have yet handed over their digital data. The pace of change of our own excavation work has been so fast that we haven't stopped to think up formal policies about archiving digital data.
We, like I'm sure many museums, have not the expertise to use or understand this technology and I cynically wonder how much this expensive data will be used - how often are paper archives used once stored in a museum?
As a museum the prospect of looking after digital archives is quite frightening, as none of us really know how long these materials/info will survive.
6.4 Usability of digital datasets
The digital portion of a project is not necessarily usable in and by itself. In many instances it is necessary to consult several parts of a project archive - paper reports and syntheses, plans and photographs, and any digital component. It is also necessary to have detailed information about digital datasets, such as hardware and programs that should be used, details of any coding or terminology to enable a user not involved in the original project to make considered use of the information. These may not be held in the same place.
Figure 6.16 illustrates that large proportions (44%) of digital archives are stored separately from parts in other media. Only 18% are stored with the rest of the archive. That 38% of organisations follow both practices may be indicative of changing strategies. There are advantages and disadvantages to storing digital datasets with other parts of a project archive. Conditions suitable for paper and photographic preservation are not necessarily adequate for digital information (e.g. boxed up alongside printed reports). At the same time, a user will probably need at least part of the paper or other archive to make use of any digital element. Maintaining detailed records listing where each part of a project archive is held and kept with each part of that archive is one means of overcoming this problem.
Figure 6.16 The location of copies of digital records
Figure 6.17 Additional information held alongside digital archives
To ensure that information contained in datasets is usable, a range of support documentation is required. It enables digital datasets to be maintained for re-use; users are assisted in their attempts to read the file, to read and use its contents and to evaluate the information. Archive holders were asked a set of questions on this topic, and responses are shown in figure 6.17.
Only 50% of organisations can provide this information. Bearing in mind the common dislocation of digital datasets from their associated files in other media, there are many organisations (c. 35%) that do not maintain information on the whereabouts of a project's full dataset. Once digital datasets have been obtained, some IT knowledge will be required to make use of them. Although over 50% note the hardware and software used to create the dataset, slightly fewer list what is required to use them. Having opened the files successfully, there is still a strong chance that the information will not be usable due to limited documentation of terminology, coding, or data collection strategies. The reliability of the information contained in the digital dataset may not be assessed as so few organisations have information on the validation procedures used. Finally, an assessment of the archaeological quality or value of the data/project is again not often noted, although this requires familiarity with the project on the part of the archive/depositor. How else can the value of the information held in digital form be considered? There are some organisations that are doing exemplary work in this field, but there is also much undeveloped practice. To conclude:
- Support documentation to enable re-users to evaluate content may be available for only 50% of datasets.
6.5 An estimate of the amount of digital data held by archaeologistsThis section attempts to estimate how much digital data is currently held by archaeologists in Britain and Ireland. It uses the results in this chapter and builds on the estimate provided in section 5.7 of the previous chapter. The results are, of course, very tentative. Strategies for Digital Data received information on digital data held by 166 archiving bodies, in Britain and Ireland. This group alone stated that they held at least 4,600 extensively digitised projects, and c. 60,600Mb of digital data (based on responses to question 6.3 - see 6.2). The total for all archiving bodies in Britain and Ireland must be significantly greater than this. The mailing list used identified 552 archaeology organisations in Britain and Ireland (see 5.3). The survey returns identified 166/263 (63.1%) organisations as holding digital archives in the short- and long-term. Of these, only 23 (13.9%) were classified as museums. It is possible to extrapolate from the survey returns to gain an estimate for the amount of digital data currently held:
An estimate of number of organisations holding digital archives in Britain and Ireland = 63.1% of 552 = c. 350 A conservative estimate of 50% = c. 275 The average size of archives held by respondents to Strategies for Digital Data (based on Figure 6.2; using the breakdown of organisations holding archives of a certain size; sizes are converted from a range to a mid-point): = ((11*2.5)+(10*5)+(12*25)+(15*50)+(21*500)+(49*1000)) 118 (no. of archives providing details) = c. 500 Mb An upper estimate of size of digital archives in Britain and Ireland = 350 * 500 Mb = c. 175,000Mb (175 Gb) A lower estimate = 275 * 500 Mb = c. 140,000 Mb (140Gb) The average number of extensively digitised projects held in archives by respondents (based on responses for Figure 6.2; ranges used to categorise archives are converted to mid-points): = ((51*2.5)+(29*13.5)+(29*60)+(23*100))
133 (no. of archives providing details)
= c. 35 projects An upper estimate of number of extensively digitised projects archived in Britain and Ireland = 350 * 35 = c. 12,200 projects A lower estimate = 275 * 35 = c. 9,600 projects
There is a considerable amount of digital data being created and archived, though traditional archives do not hold a lot of this. The analysis yielded the following results:
- Large datasets are held by a variety of organisations. In many cases such organisations are the units/contractors who undertook the work.
- Large digital datasets (over 100Mb) are not held exclusively by large organisations - some very small groups, of less than 5 people, hold considerable digital archives.
- Many archives rely only on copies held on floppy disc for short- and long-term preservation. Tapes are being used by most archives for storing digital data, presumably both for short- and long-term. Little use is being made of CD-ROMs.
- There may not be a clear difference between practices used for everyday security and long-term storage of digital data.
- Although those respondents holding larger digital archives tend to use a greater number of methods, even some of these datasets could be lost through unreliable backing-up (i.e. using floppy discs).
- Physical media (discs, tapes) are protected in a minority of cases.
- Most of the techniques used by archives fail to preserve information in an accessible and usable digital form. Providing printouts/fiche copies may preserve information, but these may not retain the complexity of a digital dataset. Current practice indicates that the majority of digital datasets are not being archived in a way that can ensure their functionality for future use.
- Support documentation to enable re-users to evaluate content may be available for only 50% of datasets.
- Most datasets are accessible to others, even those held by field units and consultants.
- By following 'good practice' for the preservation of other media, curators may be disregarding the needs of digital information.
- Information on digital datasets was obtained from 166 archiving bodies based in Britain and Ireland. This group alone hold at least 4,600 extensively digitised projects, and c. 60,600Mb of digital data.
- An estimate of the total amount of digital data held by archiving bodies in Britain and Ireland is 140-175Gb.
- An estimate of the number of extensively digitised projects held by archiving bodies in Britain and Ireland is 9,600-12,200.
A general mood of concern came across in the comments from respondents. Although there were exemplars of good practice, many curators felt they did not have the skills or experience even to appreciate the instability of the digital datasets in their care:
I filled this in as requested, being the Senior Archaeologist (manager) of the Service. I find I do not know enough, even of my own organisation's procedures. High time I remedied this.