This case study describes the background and behind the scenes work that has gone into archiving the Day of Archaeology Project. The final digital archive for Day of Archaeology is now live on the ADS.
The Day of Archaeology (DoA) project aimed to provide a window into the daily lives of archaeologists. The project asked people working, studying or volunteering in the archaeological world to participate in a “Day of Archaeology” each year by recording whatever their actual activity was on a specific day,sharing it through text, images or video on the Day of Archaeology blog. By choosing a single day, readers could experience a real cross-section of archaeological work, whether exotic or mundane, that reflected the reality of the profession.
The project was conceived and developed by a group of archaeologists with expertise in digital methods, communication and analysis: Lorna-Jane Richardson, Matt Law, J. Andrew Dufton, Kate Ellenberger, Stuart Eve, Tom Goskar, Jessica Ogden, Daniel Pett and Andrew Reinhard (see Richardson et al 2018). The first ever Day of Archaeology was held in 2011 and saw steady growth, including participation from thousands of archaeologists; from those working in the field through to specialists working in laboratories and behind computers. The last Day of Archaeology in this form was on Friday 28th July 2017.
The project ‘website’ was originally a WordPress instance (a free and open-source content management system) paired with a MySQL database, and hosted by the Portable Antiquities Scheme (PAS). At its inception, Day of Archaeology was conceived as a community partnership, ideally with crowdsourced funds to ensure maintenance of the WordPress site and domain name fees (see Richardson et al 2018). However, in 2015 NEARCH an EU Culture-funded project, which aimed to study the different dimensions of public participation in archaeology, agreed to provide support for DoA, both through widening participation across Europe and supporting contributions in a wider range of languages, but also through hosting the site when it was in a period of transition. As ADS was a NEARCH partner the WordPress site and MySQL database was transferred to ADS. NEARCH provided the resource to keep the DoA website running for the remaining length of the project.
As NEARCH came to a close in 2017, discussions were undertaken between ADS and the DoA partners about the future of the project. The partners agreed that after seven successful years, the project had accomplished much of what it had set out to do, and should be drawn to a close. This opened the opportunity for NEARCH to fund the long-term preservation of the project, ensuring this important resource for understanding archaeological practice during a particular period of time would continue to be available for future researchers.
Keeping the WordPress instance running indefinitely was deemed an unsustainable option. Instead, we looked at the data itself (i.e. the text, images and videos uploaded by the original creators). The ADS Curatorial and Technical staff (CATs) considered a number of options, such as exporting directly from MySQL as XML or JSON. However in this workflow the “look and feel” of a blog post was lost as it was stripped down to its component parts. In our view, the “look and feel” is/was a significant property that needed to be preserved.
The next option was WARC, which ADS have been monitoring for sometime as a preservation solution for websites, including our very own Internet Archaeology. When assessing WARC the main obstacle we found was that creating individual WARC files from Day of Archaeology posts was time consuming, and also would not preserve embedded audio-visual content. In addition there was also a concern that end users would face an extra step in needing to use the WARC files within the externally-hosted Wayback Machine. As we know from our Helpdesk, many of our users like the quick and simple option for accessing data, so would WARC achieve this when all users want to do is look at (static) text and pictures? Perhaps a simple PDF would do?
This put us into something of a quandry! We’ve always been reticent to rely on PDF as a long-term preservation solution. Sometimes we’re forced to do this as it is all a creator can give us. As many know, attempts at any sort of normalisation or migration strategies for PDF are difficult at best, and fundamentally flawed at worst. However, the CATs came up with the idea of ensuring the significant properties of the blog posts were preserved in an AIP package containing text, images and audio-visual content in suitable individual long-term preservation formats. We also keep a complete copy of the original Day of Archaeology database in its raw form, should we ever want to return to a WARC option (and for the record, we are interested in WARC, we just need time to thoroughly investigate how to integrate into our workflows). However, what would be wrong in providing a combined PDF for end users? In our view, nothing at all. In addition, the CATs also thought that providing additional access to the images within each post would be beneficial, as useful digital resources in and of themselves.
Thus began a fairly mammoth undertaking led by Katie Green and Jenny O’Brien, assisted by Teagan Zoldoske, Alfie Talks and Hayden Strawbridge (as a University of York MSc workplace secondments), to convert the Day of archaeology posts into easily accessible, yet sustainable digital documents. The work involved creating a PDF of the original page but also ensuring the original data underneath was kept as a separate Object in our Object Management System (OMS), and then building the relationships between the post itself as an intellectual entity (something for PREMIS fans), and the AIP and SIP versions of the content.
Within the data being reassembled, the CATs also felt it was important to retain the original DoA tags (essentially as folksonomy) as metadata and directly associated with the preserved objects as a significant property; i.e. here’s what the original author thought their blog should be tagged as. Of course this also had the benefit of giving us a structure upon which to build a user interface, but more of that later.
Another task was assessing each post for text or images which run against our Sensitive Data Policy, particularly images of minors and personal data, as much of this content was created prior to the introduction of more stringent GDPR legislation. This ran to 1000s of files that needed checking. At this point we needed to stop, as other projects required attention and staffing capacity could not be spared for all this extra checking that was legally required before we could proceed. However, with the appointment of Teagan as a Trainee Digital Archivist we soon had the capacity to get things rolling again, only the task at hand was still significant. Things moved forward slowly where time and priorities allowed, and then in the Spring of 2020 the UK went into a national lockdown in the wake of COVID-19. Although we all remained busy, we thought that having a collaborative team exercise to finish the job together would be good for morale, and thus DoA suddenly roared forward. Nearly every member of staff, including our Director, Administrator and the Editor of Internet Archaeology lent a hand. We helped each other with case studies (“should I keep this?”), and Teagan and Jenny helped collate files and metadata into a coherent archive.
An interface framework created by Jenny, completely underpinned by use of the OMS and all that metadata was then used by Teagan to load in the final files. A simple query interface was also in place to allow a basic search functionality on year, author, and of course DoA tag. As a final embellishment, each post – except where the post could not be displayed for sensitivity reasons – was also minted a DataCite DOI. According to international definition blogs are grey literature so why shouldn’t we assign a persistent identifier to each just as we do with fieldwork reports? Looking at many of the blog posts they do genuinely represent a snapshot of working life, capture processes, thoughts and ideas. For example there are fieldwork snapshots, space archaeology, zooarchaeology and ruminations on the actual activities of an archaeologist. Perhaps not this one though*. They should be as findable and citable as the other forms of literature we hold.
After all that effort, we feel that work done on the DoA archive and interface has stayed true to the original project, and not only facilitated the preservation of these posts but also built something which showcases them. Of course there’s more we want to do: integrating the blog posts into the ADS Library, integrating the images into a single application which cross-searches all our images (some of the DoA images are stunning, and need to be shown off more). I hope with time we’ll be able to achieve this, and continue the DoA legacy.
Readers may already know but the project has been reimagined into the Day in Archaeology run by the Council for British Archaeology (CBA), so do please have a look at that fantastic resource if you’re interested.
* For the record England won by 239 runs, Moeen Ali took a hat-trick spread over two overs to finish off the tail. Could do with this now to counteract the ‘competitive’ pitches they’re struggling with on the current tour.