Migrating Data: The Council for British Archaeology Research Reports

At the Archaeology Data Service we know that in order to keep files safe and accessible long into the future, we need to migrate or refresh them. This will create newer versions of the files to replace the old files which would one day be unreadable by modern software. To this end, we are currently working on one of the very first large collections that was entrusted to us back in the early days of the ADS. This collection is an archive of Council for British Archaeology (CBA) Research Reports. This run of reports dating back to 1955 were no longer in print so were scanned and given to us in digital form (as tif images and pdf files) to archive and make more widely available on-line. The collection in our care consists of over 100 reports which cover many different topics and themes within British Archaeology. This has remained one of our most popular and frequently accessed resources since we began making it available on-line in the year 2000.

Twelve years is a very long time in the world of computing. The internet was very different in the year 2000 and the majority of our users relied on slow dial up speeds to access on-line resources. When we first archived and made the CBA Research Reports available, the decision was made that people would not have the capability to download the full reports in one go and would prefer to access them in small chunks of 3 or 4 pages per pdf file, with hyperlinks between each section. This was an effective strategy at the time but times have changed and our users now have access to faster broadband speeds and would much prefer to download the whole report as a single file.

Another key issue with these original CBA Research Reports is that the files originally made available on-line were quite an early version of the PDF standard (version 1.2) and though they are not yet obsolete, a number of them are starting to generate error messages when accessed. Many of the errors and warnings relate to fonts that are no longer in general use. The fact that fonts were not embedded in the original pdf files means that they rely on modern pdf readers to know about the fonts that they require. If the fonts do not exist on a user’s computer, this renders some pages of the pdf files unreadable. It was agreed that all of these pdf 1.2 files would benefit from being converted to pdf/a (the more archivally stable flavour of the pdf specification). In a pdf/a file, all necessary fonts are embedded meaning that the file should have a longevity far greater than the original files that were submitted.

The task that we have been working on over the last few months is to recombine the CBA Research Reports into single pdf files, remove the hyperlinks which originally allowed you to navigate between sections, and migrate them into pdf/a files. We have also taken the opportunity to refresh the interface through which our users access the reports on-line. We are hoping that by doing this work we are not only making the reports available in a more useful format for our users, but ensuring that the reports can continue to be accessed in this format for many years down the line.

For those wishing to digitise journal articles or grey literature reports we provide some useful pointers in our Advice for Depositors, whilst the Guides to Good Practice provide some help on suitable archival formats for documents and text files. A future blog will discuss the ADS’ experience of working with pdf files.