Category Archives: Digital Preservation

in the dark near the Tannhäuser Gate

Blade Runner 1982, by Bill Lile Image shared under a  CC BY-NC-ND 2.0 licence

As it’s World Digital Preservation Day I thought I’d finished the following blog about our work with managing the digital objects within our collection. Like most of my blogs (including the much awaited sequel to Space is the Place) these often languish for a while awaiting a final burst of input. To celebrate WDPD 2018, here we go….


I half-heartedly apologise for the self indulgent and title to this blog, which most readers will know is taken from Rutger Hauer’s speech in the film Bladerunner (apparently he improvised). Aside from being an unashamed lover of the original film, like Roy Batty in the famous rooftop finale I’ve recently been prone to reflection on the events I’ve witnessed [at the ADS] over the last few years. In all honesty these aren’t quite on a par with “Attack ships on fire off the shoulder of Orion”, but perhaps as impressive in their own little way.

This reflection isn’t prompted by any impending doom – that I’m aware of – but rather that the  some of my recent work has been looking at the work the ADS has done in the context of the last two decades, for example looking at the history of OASIS as we move further into the redevelopment project, and revisiting the quantities and types of data we store as we find we’re rapidly filling up file servers. Along with this is a sudden realisation that after so long here I have become part of the furniture (I’ll let the reader decide which part!). However as colleagues inevitably leave – and although we do take the utmost care to document decisions (meeting minutes, copies of old procedural documents etc) the institutional memory sometimes becomes somewhat blurred, even taking on mythical status: “We’ve always done it like that”, “That never worked because…”, “So-and-so was working on that when they left”, “A million pounds” and so on.

Saving uploading Julian’s consciousness to an AI, which even with our best efforts we’re still some way off perfecting, there’s a danger of much of this internal history becoming lost (like tears in the rain). Over the past few years I’ve quite enjoyed talking – mainly to peers within the wider Digital Preservation community – about issues and problems/successes at the ADS. And just recently I gave a talk at CAA (UK) in Edinburgh about the twenty year journey of the ADS, from one of the 5 AHDS centres to a self-sustaining accredited digital archive. The talk itself didn’t have a particularly large audience, perhaps a result of the previous nights party (the conference as a whole was welcoming and well-organised) or the glittering papers in the parallel session, plus on this occasion I think I struggled to get 20 years of history into exactly 20 minutes!

The main thing I really wanted to communicate to people was quite how far the ADS have come technically and conceptually, from our beginnings in 1996, where we are now, and more importantly where we want to be. As a previous blog has covered in massive detail (WITH GRAPHS!) our holdings have grown considerably over the years, with associated problems in finding room to store things. Another issue as we surge past 2.5 million files is increasing the capacity for our users (and us!) to find things. As I showed an enraptured audience at CAA we’ve come along way from  2006 (when I joined) when we were running 2 or 3 physical servers, to the present day where we have a dispersed system of nearly 40 virtual machines with a range of software, which in turn support a large array of tools, applications and services that underpin our website(s), and the flows of data we provide to third parties.

The legendary minerva 2. Bought in 2006 and with its mighty 8Gb of RAM was for many years the backbone of the ADS systems. Now retired from the University server racks, it now sits quietly in our office. Contrary to the sticker, this does not contain the consciousness of Michael Charno.
A simplified representation of our current systems: 38 virtual machines supporting a range of server software which support ADS Applications and Services,

I always think this is an unseen part – to many outsiders –  of what the ADS do, and along with the procedures we have for actually being an archive there’s a whole lot of work going on underneath what we make visible to our users. In the talk to CAA I used the common analogy of a swan, what you see is the website, what you don’t see are the feet paddling away underneath. This doesn’t detract from the website of course, a commitment to providing access to data has always been a fundamental part of what and who we are. It’s as frustrating to us as to a user when someone can’t find what they’re looking for, especially when they know it exists. Which is why it is interesting (and I really think it is) to look at how we manage our data, and to make the ‘ADS Swan’ as efficient as possible.

For example, back in the old days (2006) interfaces to data were effectively hard coded into web-pages using the ColdFusion platform (CFML) as an interface between the XHTML and underlying file server and database. This was ok in its way, although still required someone to either code in links to file, generate file listings in page (or via separate scripts or commands). A common source of many broken links of this era is simply human error in generating these lists and replicating them in the web-page.

The old way of doing things…

Of course, even at the time my colleagues were aware that this was not the most efficient way we could work, and even the functions of ColdFusion (and its successors OpenBlue Dragon and Lucee) that generated listings directly in the code were still reliant on someone actually setting which directory was needed and how to handle the results directly in the page. Not great for when we had to update things… There was also an issue of the information displayed in the page, effectively you came to an archive, scrolled through and were presented with descriptions that were often little more than the file-name. There was also the massive issue of a disconnect between the files and the interface, actual file-level metadata was only stored in the files (e.g. CSV) in the file store. Our Collections Management System (CMS) stored lots of information about the collection, and we know it had files in it, but not the details. Any fixing, updating, migrating, querying all had to be done by hand, which was fine when we only had a small number of collections but presented problems when scaling up. Effectively, we had to get our files (or objects/datastreams) into some sort of Digital Asset Management System. Cue project SWORD-ARM

This project is probable deserving of its own essay, suffice to say we investigated using Fedora (Commons, as it later became) as a DAM for storing all the lovely rich technical and thematic metadata we collect, and perhaps most importantly had already collected (we already had several hundred collections of nearly a million files at this point). In short, an implementation of Fedora to suit our needs was deemed too complicated, and with too high a level of subsequent software development and maintenance for us to sustain. At that point -and again to our understanding and needs  – if even deleting a record required issuing a ticket for our systems team (the magnificent Michael and prodigious Paul at that point), then we were onto a loser. For our needs, perhaps all we needed as a database and a programming language…

The heroes of this story were undoubtedly Paul Young, Jenny Mitcham, Jo Gilham and Ray Moore who between them created an extension to our existing CMS: the Object Management System (OMS). The OMS is really too big to explore in too much detail, but the design of it was based on three overarching principles:

  1. To manage our digital objects in-line with the PREMIS data model
  2. To store accurate and consistent technical file-metadata
  3. To store thematic metadata about the content (what does the file show/do?)

The ambition was, and still is, to have a situation where a user provides much of this information ready-formed courtesy of an application such as ADS-EASY or OASIS. But most importantly I believe (and for this blog not to derail into masses of detail) was the move towards an implementation of the semantic units as defined in PREMIS. To explain, consider the shapefiles below.

What’s an object?

In our traditional way of doing things we just had a bunch of files on a server. Here, we have the files in the database but also a way of classifying and grouping them to explain what they are. So for example, a Shapefile has commonly used the dBase IV format (.dbf) for storing attributes; we also get .dbf as stand-alone databases.  We need to know that this .dbf is part of a larger entity, and should only be “handled” as part of that entity. In this case a Shapefile is normalized to GML (3.2) for preservation, and zipped up for easy dissemination. All of these things are part of the same representation object, we need to keep them together however dispersed they are across servers, associate them with the correct metadata, and plan their future migration accordingly.

And of course this is where we can store all our lovely technical and thematic metadata. For example I know for any object:

  • When it was created
  • What software created it
  • Who created it
  • Who holds copyright
  • Geographic location
  • It’s subject (according to archaeological understanding)
  • The file type – according to international standards of classification
  • Its checksum
  • Its content type
  • If it’s part of a larger intellectual entity

And we’re close to also fully recording an objects life-cycle within our system

  • When it was accessioned
  • When it was normalized – and the details of this action
  • When it was migrated
  • If it was edited
  • etc etc

I’ve deliberately over-simplified a very complicated process there as I’m running out of words. But suffice to say that the hard work many people (including current colleagues Jenny and Kieron) have put in on developing this system is nearing a stage where the benefits of all this are tantalizing close.

Now, readers from a Digital Preservation background will understand how that’s essential for how we need to work. The lay reader may well be thinking of the benefit to them. Put simply, this offers the chance to explore our objects having and independence away from their parent collections. For example, when working on the British Institute in Eastern Africa Image Archive (https://doi.org/10.5284/1038987) Ray built this specialised interface for cross-searching all the images. In this case all the searching is done on the metadata for the object representation, so for example:

http://archaeologydataservice.ac.uk/archives/view/object.cfm?object_id=1187702

It’s not too much of a jump to see future versions of the ADS website look to incorporate cross-collection searching. Allowing people quick, intuitive access to the wealth of data we store and perhaps, a way to cite the object… Something to aim for in a sequel at least.

Anyway, as always if you’ve made it this far thanks for reading

Tim

ADS Goes Live on International Digital Preservation Day

On 30th November 2017 the first ever International Digital Preservation Day will draw together individuals and institutions from across the world to celebrate the collections preserved, the access maintained and the understanding fostered by preserving digital materials.

The aim of the day is to create greater awareness of digital preservation that will translate into a wider understanding which permeates all aspects of society – business, policy making and personal good practice.

To celebrate International Digital Preservation Day ADS staff members will be tweeting about what they are doing, as they do it, for one hour each before passing on to the next staff member. Each staff member will be focusing on a different aspect of our digital preservation work to give as wide an insight into our work as possible. So tune in live with the hashtags #ADSLive and #idpd17 on Twitter or follow our Facebook page for hourly updates. Here is a sneak preview of what to expect and when:

Continue reading ADS Goes Live on International Digital Preservation Day

The (DOC)X-files

The following blog is simply a musing on our historic approaches to archiving formatted text files, prompted by a user enquiry into “best formats” for preservation of their reports, and my role at the ADS as keeping abreast of said formats and our internal policies.

Many years ago, in a meeting of the curatorial and technical team (CATTS), conversation veered towards our procedures for handling text documents. That is files whose significant properties were formatted text/typeset reports, as opposed to plain text files (with ascii or UTF-8 encoding) often used for exporting or importing of data. One colleague, half in jest, commented that as the Archaeology Data Service our focus should be on the literal data as understood in computer science – the individual pieces of information being generated from various instruments or collected in databases. Reports it may be argued are the interpretation of that data, but often not the raw data itself.

Continue reading The (DOC)X-files

DAI IANUS visits the ADS!

ADS was pleased to recently be the host to three Data Curators from a project called IANUS  as part of the ARIADNE project.  ADS spent two weeks immersing Martina, Anne and Philip in the day-to-day duties of a fully established repository. Here is what they had to say about their visit.

DAI IANUS visits the ADS!
By Martina Trognitz, Anne Sieverling & Philipp Gerth

From the 23rd of November until 4th of December, York had three more German inhabitants: us (Anne, Martina and Philipp)! We came all the way from Berlin to learn from the ADS.

IANUS Logo

In Berlin we work at the German Archaeological Institute (DAI) in a project called IANUS. It is funded by the German Research Foundation (DFG) and a first three year phase is now being followed by a second, which just started in March 2015. The aim of the project is to build up a digital archive for archaeology and related sciences in Germany.
Continue reading DAI IANUS visits the ADS!

ADS a Recommended Repository for Nature Publishing Group

ADS are very pleased to announce that we are now an officially recommended repository for Nature Publishing Group’s open access data journal Scientific Data. ADS joins approximately 80 other data repositories, representing research data from across the entire scientific spectrum. ADS has been approved by Scientific Data as providing stable archiving and long-term preservation of archaeology data.

SciData_new_logo_22

Scientific Data offers a new article type, the ‘Data Descriptor’, which has been specifically designed to publish peer-reviewed research data in an accessible way, so as to facilitate its interpretation and reuse. Publishing Data Descriptors enables data produces and curators to gain appropriate credit for their work, whilst also promoting reproducible research.  The main goals of this journal are tightly aligned with that of ADS, focusing on making the data publicly accessible and encouraging re-use.

data descriptor

By becoming a recommended repository for Scientific Data, we are now not only a recommended repository for archaeological data accompanying articles published by the Nature Publishing group but researchers now have the opportunity to deposit archaeological data to ADS, whilst submitting an Data Descriptor to Scientific Data.

All depositors depositing with ADS and intending to publish in Scientific Data or another Nature Publishing Group journal must choose to disseminate the data they are depositing with us under a CC-BY liecence. For more information contact the ADS at help@archaeologydataservice.ac.uk

 

 

 

Archiving Ipswich

Re-posted from Day of Archaeology

Two years after posting about my work on the Silbury Hill digital archive, in ‘AN ADS DAY OF ARCHAEOLOGY’, and I’m still busy working as a Digital Archivist with the ADS!

For the past few months, I have been working on the Ipswich Backlog Excavation Archive, deposited by Suffolk County Council, which covers 34 sites, excavated between 1974 and 1990.

Ipswich2

To give a quick summary of the work so far, the data first needed to be accessioned into our systems which involved all of the usual checks for viruses, removing spaces from file names, sorting the data into 34 separate collections and sifting out duplicates etc.  The archive packages were then created which involved migrating the files to their preservation and dissemination formats and creating file-level metadata using DROID.  The different representations of the files were linked together using object ids in our database and all of the archiving processes were documented before the coverage and location metadata were added to the individual site collections.

Though time consuming, due to the quantity of data, this process was fairly simple as most of the file names were created consistently and contained the site code.  Those that didn’t have descriptive file names could be found in the site database and sorted according to the information there.

The next job was to create the interfaces; again, this was fairly simple for the individual sites as they were made using a template which retrieves the relevant information from our database allowing the pages to be consistent and easily updateable.

The Ipswich Backlog Excavation Archive called for a more innovative approach, however, in order to allow the users greater flexibility with regards to searching, so the depositors requested a map interface as well as a way to query information from their core database.  The map interface was the most complex part of the process and involved a steep learning curve for me as it involved applications, software and code that I had not previously used such as JavaScript, OpenLayers, GeoServer and QGIS.  The resulting map allows the user to view the features excavated on the 34 sites and retrieve information such as feature type and period as well as linking through to the project archive for that site.

OpenLayers map of Ipswich excavation sites.

So, as to what I’m up to today…

The next, and final step, is to create the page that queries the database.  For the past couple of weeks I have been sorting the data from the core database into a form that will fit into the ADS object tables, cleaning and consolidating period, monument and subject terms and, where possible, matching them to recognised thesauri such as the English Heritage Monument Type Thesaurus.

Today will be a continuation of that process and hopefully, by the end of the day, all of the information required by the query pages will be added to our database tables so that I can begin to build that part of the interface next week.  If all goes to plan, the user should be able to view specific files based on searches by period, monument/feature type, find type, context, site location etc. with more specialist information, such as pottery identification, being available directly from the core database tables which will be available for download in their entirety.  Fingers crossed that it does all go to plan!

So, that’s my Day of Archaeology 2015, keep a look out for ADS announcements regarding the release of the Ipswich Backlog Excavation Archive sometime over the next few weeks and check out the posts from my ADS colleagues Jo Gilham and Georgie Field!

UPDATE: Ipswich Excavation Archive has now been released! All sites can be explored here!

New Guidelines for ADS Depositors

data_repositoryThe ADS, supported by funding from the Archives and Records Association, has recently revamped our Guidelines for Depositors.

The revamp reviewed the current ADS guidelines on digital archive deposition and developed updated guidance policies for depositors in light of the recent revisions to the Guides to Good Practice and the development of ADS-easy.

The revision to the ADS Guidelines for Depositors has produced a new user friendly interface designed after detailed consultation with users on the most intuitive and instructive way to present the guidelines.

Continue reading New Guidelines for ADS Depositors

Keeping our Data Consistent

The consistency and integrity of data is essential for any digital archive. Therefore, for the past few months we have been running a series of programs to test the consistency of our file system and database and try to identify any other problems. This work started when we decided to develop a program to test all the checksums in our file system. The idea was to run the program every few months in order identify any checksums which had changed since the last run.

checksum report
Part of a checksum report.

In addition, the program would test the checksums in the file system against the checksums in the database so that we could be sure that they were synchronised. The program took a few weeks to develop and has now been run several times. Each run produces a report which shows any checksum changes in the file system and the database. Happily, there have only been a few checksums flagged up in the reports so far and usually there have been good reasons why they have been changed.

Continue reading Keeping our Data Consistent

Opening up the Grey Literature Library

The Grey Literature Library is one of the ADS’s most popular resources, and as shown by projects such as the Roman Rural Landscape, one that is of massive research value. The library is constantly growing, with most reports coming from the OASIS system. In 2013 alone, there were 3891 reports submitted. Feedback from all levels of the archaeological community makes it clear that the hosting of openly accessible digital grey literature is a boon. However, one of the questions we are most commonly asked is “why does it take so long for a report uploaded to OASIS to make its way into the library?”. This is perfectly understandable; people who have completed an OASIS record to share the results of their fieldwork want to make sure this effort is not in vain. Rest assured it isn’t, here’s a small insight into what’s going on underneath the workings of the library.
Continue reading Opening up the Grey Literature Library

The Internet Archaeology of the ADS

While rationalising old and orphaned files on the ADS servers, I stumbled upon an old index.html file for a previous version of the website.  Similar to discovering a long forgotten photograph in the attic, this led me down the meandering path of memory lane.  However unlike a photograph, reconstructing the look and feel of a web page requires some fiddling to correctly associate the style sheets and any server side includes.  After a few cut and paste commands replacing server side includes with actual HTML and a directory search for the missing stylesheet, the old homepage was back up again in all of its glory.

ADS homepage c. 2008.

Even though I spent my first four years at ADS using this homepage it looked totally foreign to me.  The structure was confused, the javascript unnecessary and the style was uninspired.  The page was functional, but left a lot to be desired compared to the structure and clarity of the present version. The backend framework and systems that make up the current website also make it manageable and easy to update, compared to the organic, disjointed structure of the previous website which led to headache inducing updates as seemingly insignificant modifications led to unanticipated bugs (or features depending on your preferred coping mechanism).
Continue reading The Internet Archaeology of the ADS