Category Archives: Open Access

Space is the Place (part I)

server-racks-clouds_blue_circuit” by Kin Lane. CC BY-SA 2.0

This is the first part of a  (much delayed) series of blogs investigating the storage requirements of the ADS. This began way back in late 2016/early 2017 as we began to think about refreshing our off-site storage, and I asked myself the  very simple question of “how much space do we need?”. As I write it’s evolving into a much wider study of historic trends in data deposition, and the effects of our current procedure + strategy on the size of our digital holdings. Aware that blogs are supposed to be accessible, I thought I’d break into smaller and more digestible chunks of commentary, and alot of time spent at Dusseldorf airport recently for ArchAIDE has meant I’ve been able to finish this piece.

——————-

Here at the ADS we take the long-term integrity and resilience of our data very seriously. Although what most people in archaeology know us for is the website and access to data, it’s the long term preservation of that data that underpins everything we do. The ADS endeavour to work within a framework conforming to the ISO (14721:2003) specification of a reference model for an Open Archival Information System (OAIS). As you can see in the much-used schematic reproduced below, under the terminologies and concepts used in the OAIS model ‘Archival Storage’ is right at the heart of the operation.

How we actually achieve this is actually a pretty complicated process, documented in our Preservation Policy; suffice to say it’s far more than simply copying files to a server! However, we shouldn’t discount storage space entirely. Even in the ‘Zettabyte Era’, where cloud-based storage is commonplace and people are used to streaming or downloading files that – 10 years ago – would have been viewed as prohibitive, we still need some sort of space on which to keep our archive.

At the moment we maintain multiple copies of data in order to facilitate disaster recovery – a common/necessary strategy for any organisation that wants to be seen as a Digital Archive rather than simply a place to keep files. Initially, all data is maintained on the main ADS production server maintained by the ITS at the University of York which is backed up via daily snapshot, with these snapshots stored for a month, and furthermore backed up onto tape for 3 months.

In addition to this, all our preservation data is synchronised once a week from the local copy in the University of York to a dedicated off site store, currently maintained in the machine room of the UK Data Archive at the University of Essex . This repository takes the form of a standalone server behind the University of Essex firewall. In the interests of security outside access to this server is via an encrypted SSH tunnel from nominated IP addresses. Data is further backed up to tape by the UKDA. Quite simply, if something disastrous happened here in York, our data would still be recoverable.

This system has served us well, however recently a very large archive (laser scanning) was deposited with us. Just in it’s original form it was just under a quarter of the size of all our other archives combined, and almost filled up the available server space at York and Essex. In the short term, getting access to more space is not a problem as we’re lucky to be working with very helpful colleagues within both organisations. Longer-term however I think it’s unrealistic to simply keep on asking for more space at ad-hoc intervals, and goes into a wider debate over the merits of cloud-based solutions (such as Amazon) versus procuring traditional physical storage space (i.e. servers) with a third party. However I’ll save that dilemma for another blog!

However, regardless of which strategy we use in the future, for business reasons (i.e any storage with a third party will cost money) it would be good to be able to begin to predict or understand:

  • how much data we may receive in the future;
  • how size varies according to the contents of the deposit ;
  • the impact of our collections policy (i.e. how we store the data);
  • the effect of our normalisation and migration strategy.

Thus was the genesis of this blog….

We haven’t always had the capacity to ask these questions. Traditionally we never held information about the files themselves in any kind of database, and any kind of overview was produced via home brew scripts or command-line tools.  In 2008 an abortive attempt to launch an “ADS Big Table” which held basic details on file type, location and size was scuppered by the difficulties in importing data by hand (my entry of “Comma Seperated Values” [sic] was a culprit). However we took a great leap forward with the 3rd iteration of our Collections Management System which incorporated a schema to record technical file-level for every file we hold, and an application to generate and import this information automatically. As an aside, reaching this point required a great deal of work (thanks Paul!).

As well as aiding management of files (e.g. “where are all our DXF files?”), this means we can run some pretty gnarly queries against the database. For starters, I wanted to see how many deposits of data (Accessions) we received every year, and how big these were:

Number of Accessions (right axis) and combined size in Gb (left axis), ADS 1998-2017

As the graph above shows, over the years we’ve seen an ever increasing number of Accessions, that is the single act of giving us a batch of files for archiving (note: many collections contain more than one accession). Despite a noticeable dip in 2016, the trend has clearly been for people to give us more stuff, and for the combined size of this to increase. A notable statistic is that we’ve accessioned over 15 Tb in the last 5 years. In total last year (2017), we received just over 3 Terrabytes of data, courtesy of over 1400 individual events; compared with 2007 (a year after I started work here) where we received c. 700Mb in 176 events. That’s an increase of 364% and 713% respectively over 10 years, and it’s interesting to note the disparity between those two values which I’ll talk about later. However at this point the clear message is that we’re working harder than ever in terms of throughput, both in number and size.

Is this to do with the type of Accessions we’re dealing with? Over the years our Collections Policy has changed to reflect a much wider appreciation of data, and community. A breakdown of the Accessions by broad type adds more detail to the picture:

Number of Accessions by broad type, ADS 1998-2017

Aside from showing an interesting (to me at least) historical change in what the ADS takes (the years 1998-2004 are really a few academic research archives and inventory loads for Archsearch), this data also shows how we’ve had to handle the explosion of ‘grey literature’ coming from the OASIS system, and a marked increase in the amount of Project Archives  since we started taking more development-led work around 2014. The number of Project Archives should however come with a caveat, as in recent years these have been inflated by a number of ‘backlog’ type projects that have included alot of individual accessions under one much larger project, for example:

This isn’t to entirely discount these, just that they could be viewed as exceptional to the main flow of archives coming in through research and development-led work. So without these, the number of archives looks like:

Accessions for Project Archives: all records and with backlog/ALSF/CTRL removed, ADS 1998-2017

So, we can see the ALSF was having an impact 2006-011, and that 2014-2016 Jenny’s work on Ipswich and Exeter, and Ray’s reorganisation of CTRL was inflating the figures somewhat. What is genuinely startling, is that in 2017 this ceases to be the case, we really are taking 400+ ‘live’ Accessions from Project Archives now. How are these getting sent to us? Time for another graph!

Number of Accessions for Project Archives, split by delivery method, ADS 1998-2017

The numbers clearly show that post-2014 we are seeing alot more smaller archives being delivered semi-automatically via ADS-easy (limit of 300 files) and OASIS images (currently limited to 150 raster images). When I originally ran this query back in early 2017 it looked like ‘Normal’ deposits (*not that there’s anything that we could really call normal, a study of that is yet more blogs and graphs!) were dropping off, but 2017 has blown this hypothesis out of the water. What’s behind this, undoubtedly the influence of Crossrail which has seen nearly 30 Accessions, but also HLCs, ACCORD, big research projects, and alot of development-led work sent on physical media or via FTP sites (so perhaps bigger or more complex than could be handled by ADS-easy). Put simply, we really are getting alot more stuff!

There is one final thing I want to ask myself before signing off; how is this increase in Accessions affecting size? We’ve seen that total size is increasing (3 Tb accessioned in 2017), but is this just a few big archives distorting the picture? Cue final graphs…

Size (Gb) of Accessions for Journals, Archives and Grey Literature, ADS 1998-2017
Size (Gb) of Accessions from Normal archives, ADS-easy and OASIS images, ADS 1998-2017

I’m surprised somewhat by the first graph, as I hadn’t expected  the OASIS Grey Literature to be so high (1.5 Tb), although anecdotes from Jenny and Leontien attest to size of files increasing as processing packages enable more content to be embedded (another blog to model this?). Aside from this, although the impact of large deposits of Journals scans (uncompressed tiff) can be seen in most years, particularly 2015, it does seem as though we’re averaging around 1.5 Tb per year for archives. Remember, this is just what we’re being given and before any normalisation for our AIP (what we rely on for migration) and DIP (what we disseminate on the website). And, interestingly enough, the large amount of work we are getting through ADS-easy and OASIS images isn’t having a massive size impact, just under 400Gb combined for the last 3 years of these figure.

—————-

Final thoughts. First off, I’m going to need another blog or two (and more time at airports!) to go deeper into these figures, as I do want to look at average sizes of files according to type, and the impact of our preservation strategy on the size of what we store. However, I’m happy at this stage to reach the following conclusions:

  • Over the last 5 years we’ve Accessioned 15 Tb of data.
  • Even discounting singluar backlog/rescue projects and big deposits journal scans, this does seem to represent a longer term trend in growth
  • OASIS reports account for a significant proportion of this amount: at over a Tb a year
  • ADS-easy and OASIS images are having a big impact on how many Accessions we’re getting, but not an equal impact on size.
  • After threatening to fall away, non-automated archives are back! And these account for at least 1.5Tb per year, even disregarding anomalies.

Right, I’ll finish there. If anyone has read this far, I’m amazed, but thanks!

Tim

ps. Still here? Want to see another graph? I’ve got lots…

Total files Accessioned per year, ADS 1998-2017

Meet the #OAFund winner!

To mark the 2017 Open Access week, we thought it would be a good time to introduce the winner of our first Open Access Archaeology fund award (see our original announcement here), decided on after much deliberation and consideration by the panel of 3 independent judges. So…

Meet Chris

Figure 1: Chris with his geophysics equipment. Image credit: C. Whittaker

Chris Whittaker carried out a survey at Breedon on the Hill, a multi-period hilltop site, as part of his undergraduate dissertation at Newcastle University, supervised by Dr Caron Newman. After graduating he worked outside archaeology in the technology sector. However conscious that his data was potentially at risk, he applied to the fund to help preserve the data and publish his findings. He has since started to study for a research master’s in settlement archaeology at Newcastle University.

The judges felt that Chris’ proposal – Breedon Hill, Leicestershire: an archaeological investigation at the multi-period hilltop site – was “an important site and methodically-collected dataset, which made good use of both Internet Archaeology and ADS, with the data having considerable potential for re-use to inform future fieldwork”.

About Breedon Hill
Breedon Hill, Leicestershire is a scheduled ancient monument. The hilltop was the site of a univallate hillfort present from the Early-Middle Iron Age. From the 7th century AD, a minster church was founded within the hillfort enclosure. Today, approximately two-thirds of the Iron Age rampart, and much of the hillfort interior, has been irretrievably lost due to quarrying (Figure 2). The investigation combined magnetometry and resistivity geophysical surveys, alongside digital terrain models (processed LIDAR data), to contribute to the understanding of the character and development of the hillfort interior and its immediate environment. Very little is known about the different phases of occupation at the hilltop, as previous excavations have primarily focussed on the ramparts, and so Chris’ investigation sought to address this issue.

Figure 2: Breedon Hill Quarry. Taken from http://www.geograph.org.uk/p/4597198 ©Anthony Parkes and licensed for reuse under creativecommons.org/licenses/by-sa/2.0

The results of Chris’ geophysical survey reveal several phases of roundhouses and post-hole built structures, as well as several potential associated enclosures, in the south-eastern part of the hillfort interior. These will be published as part of a future open access article in Internet Archaeology and will link to a related digital archive deposited with the Archaeology Data Service. We are looking forward to working with Chris in the coming months.

The church at Breedon in relation to what remains of the western rampart. Image credit: C. Whittaker

Chris said “The work was undertaken while I was an undergraduate student, firstly as part of an independent summer research programme (processing the LIDAR data), and secondly as part of an undergraduate dissertation (undertaking the geophysical survey). Publisher or institutional paywalls are often barriers for local researchers to study the world around them. And I know from personal experience that projects such as the digitisation of volumes of the Derbyshire Archaeological Journal, preserved with the ADS, are of great benefit to local and school-level research alike. From a research perspective [open access] offers many opportunities for colleagues from different backgrounds to build on and potentially refine the resources preserved.”

And now, we start all over again…
As you know, the Open Access Archaeology fund is made up of donations, set aside to support the digital archiving and publication costs of those researchers for whom funding is simply not available despite research quality and whose digital data is potentially at greater risk.

Thank you to everyone for your support for our #OAFund which is now being used to support the open access dissemination of Chris’ work. Of course, in making the first award, we now need to start all over again to raise sufficient funds for the next round to help more early career and independent researchers like him. So please consider donating today and help to reduce the barriers to open archaeological research and advance knowledge of our shared human past.

https://www.yorkspace.net/giving/donate/archaeology-fund

We want to send out lots more of our little USB trowels just like last year and we have an extra special gift for everyone who sets up a recurring monthly or annual gift!

The dark valley: notes from the ADS library

Tim Evans

Over a year and a half ago I wrote a short blog on the mechanics of the ADS grey literature library, going in to (what I considered) fascinating detail on the technical considerations of archiving the reports we host online. In the intervening period since that blog I’ve spent a large portion of my time working on the Roman Rural Settlement of Britain project, and an array of what we term special collections (for example  Stones of Greece, Origins of Nottingham and Parks and Gardens). Colleagues such as Jenny O’Brien and Georgie Field have primarily been responsible for  transferring reports into the library and as such, some distance has crept into the relationship between myself and the library. Like an old friend to whom one hasn’t spoken for sometime, one starts to wonder as to whether the links and shared experiences will persevere.

Continue reading The dark valley: notes from the ADS library

ADS a Recommended Repository for Nature Publishing Group

ADS are very pleased to announce that we are now an officially recommended repository for Nature Publishing Group’s open access data journal Scientific Data. ADS joins approximately 80 other data repositories, representing research data from across the entire scientific spectrum. ADS has been approved by Scientific Data as providing stable archiving and long-term preservation of archaeology data.

SciData_new_logo_22

Scientific Data offers a new article type, the ‘Data Descriptor’, which has been specifically designed to publish peer-reviewed research data in an accessible way, so as to facilitate its interpretation and reuse. Publishing Data Descriptors enables data produces and curators to gain appropriate credit for their work, whilst also promoting reproducible research.  The main goals of this journal are tightly aligned with that of ADS, focusing on making the data publicly accessible and encouraging re-use.

data descriptor

By becoming a recommended repository for Scientific Data, we are now not only a recommended repository for archaeological data accompanying articles published by the Nature Publishing group but researchers now have the opportunity to deposit archaeological data to ADS, whilst submitting an Data Descriptor to Scientific Data.

All depositors depositing with ADS and intending to publish in Scientific Data or another Nature Publishing Group journal must choose to disseminate the data they are depositing with us under a CC-BY liecence. For more information contact the ADS at help@archaeologydataservice.ac.uk

 

 

 

Internet Archaeology is awarded the Directory of Open Access Journals Seal

InternDOAJ Seal logoet Archaeology is delighted to announce that we have been awarded the Directory of Open Access Journals (DOAJ) Seal.

The DOAJ is an online directory that indexes and provides access to high quality, open access, peer-reviewed journals.

The  DOAJ Seal is awarded to a journal that fulfills a set of criteria related to accessibility, openness, discoverability, reuse and author rights. It acts as a signal to readers and authors that the journal has generous use and reuse terms, author rights and adheres to the highest level of ‘openness’

Internet Archaeology has been awarded the DOAJ Seal because it:

  • has an archival and preservation arrangement in place with the Archaeology Data Service
  • provides permanent DOI identifiers in the published content
  • provides article level metadata to DOAJ
  • embeds machine-readable CC licensing information in article level metadata
  • allows reuse and remixing of content in accordance with a CC BY license
  • has a deposit policy registered in SHERPA/RoMEO
  • allows authors to hold copyright without restriction.

Internet Archaeology is currently the only open access archaeology journal to be awarded the Seal, sitting alongside 88 other journals from right across the academic spectrum. It is wonderful to have been recognised for our work in this area by the DOAJ.

Internet Archaeology Displays PreColumbian Rock Art in New Light with Interactive Technology .

Polynomial Texture Mapping (PTM) is a fairly new technique employed by archaeologists and it has furthered research at a well-known Brazilian rock art site, Avencal 1, revealing details not previously detected. An article outlining the work has just been published in Internet Archaeology and it contains an interactive viewer which enables readers to explore the rock art panels for themselves, including altering lighting conditions.

The WebRTIViewer showing Panel 1a from Urubici embeded in the Internet Archaeology article. © P. Riris, R Corteletti, Internet Archaeology.
The WebRTIViewer showing Panel 1a from Urubici embedded in the Internet Archaeology article. © P. Riris, R Corteletti, Internet Archaeology.

The viewer was developed by colleagues at the Visual Computing Lab at Pisa who are also developing the 3DHOP application for use by the ADS. This is the first time the viewer has been used in a peer-reviewed journal, and demonstrates once again the capabilities of publishing in Internet Archaeology over many other journals.

Phil Riris (Southampton, UK) and Rafael Corteletti (Universidade Federal do Paraná, Brazil) applied the technique to a series of ‘blank’ panels and revealed undocumented geometric designs as well as being able to identify differences in how the engravings were produced as well as potential sequencing.

 Riris, P. and Corteletti, R. (2015). A New Record of Pre-Columbian Engravings in Urubici (SC), Brazil using Polynomial Texture Mapping, Internet Archaeology 38. 

Internet Archaeology Goes Fully Open Access

Internet Archaeology is pleased to announce  that it has become a fully open access journal.

ia-logo
From this month Internet Archaeology’s 130 institutional subscribers from the UK, USA, Australia and Europe will no longer have to pay the £160 a year subscription and the £7 charge for individual articles is also being scrapped, making Internet Archaeology one of the first journals to transition from a subscription model to full open access. Several things have spurred this decision.
Continue reading Internet Archaeology Goes Fully Open Access

CAA 2014 Paris

The department of archaeology on rue Michelet, Paris
The impressive exterior of the department

The annual CAA (Computer Applications and Quantitative Methods in Archaeology) conference took place in the impressive surroundings of the Sorbonne. The Archaeology Data Service and Internet Archaeology were very well represented throughout the 4 days of the conference.

 

Day One 22nd April 2014

Partners from the ARIADNE project came together in Paris in the ARIADNE Workshop on On-line Resources chaired by ADS’s Catherine Hardman. The workshop introduced archaeological researchers to a variety of on-line data resources, including those held by the three partners providing on-line access to their data as part of the EC Infrastructures funded Advanced Research Infrastructure for Archaeological Dataset Networking (ARIADNE) project.

The partners were the Archaeology Data Service (ADS), ARACHNE at the German Archaeological Institute (DAI), and Fasti Online at the International Association of Classical Archaeology (AIAC). In addition to the ARIADNE partners, the workshop featured a presentation on data and data integration in the Digital Archaeological Record (tDAR). tDAR is an international digital repository based in America for the digital records of archaeological investigations.

Continue reading CAA 2014 Paris

Persistence in Preservation and Publication of Data

To recognise the effort that authors make in order to deposit digital data and to get academic credit for that effort, Internet Archaeology (IA) and the ADS have established an open access data paper series. ‘Data papers’ maximise a dataset’s re-use potential and help to improve the preservation and the publication of data and are a valuable addition to the advancement of archaeological research. However IA and ADS have now taken the concept a little further.

In order to identify the content and provide a persistent link to its location on the Internet, each data paper in IA and the corresponding archive in ADS are assigned unique DOIs (Digital Object Identifiers, issued via CrossRef and DataCite). The introduction of these unique digital identifiers has been a major advancement for persistence in data preservation, publication and citation, but our approach has been to extend them to a more granular level. While an ADS dataset is assigned a ‘top level’ DOI, additional identifiers to specific sections of the data area have also been allocated. This enhances the archive not just by enabling direct access to a subset of data but also allows those sub-sections, often authored by specialist researchers, to be citable in their own right and gives recognition to the individuals who undertook the work e.g. see Richards & Roskams (2013) archive: where the Geophysical Survey, the Field-walking Survey and Animal Bone reports all have their own DOI. There is no limit to the granulation possible and we envisage usage right down to individual digital objects, such as a photograph or a GIS shapefile, when their importance to a hypothesis is apparent. Such use of DOIs will greatly benefit archaeological research, providing greater transparency in archaeological reporting and improving research efficiency.

Continue reading Persistence in Preservation and Publication of Data

Three New Yorkshire Archives

ADS is pleased to announce the release of three new digital archives exploring the history of settlement in Yorkshire, carried out under the auspices of the University of York.

The first, ‘Burdale: an Anglian settlement in the Yorkshire Wolds‘ by Julian Richards and Steve Roskams, comprises a broad range of primary and secondary data derived from fieldwork and post-excavation analysis of the site.The aims of the projects were to:

  • establish the depth, extent and survival of archaeological deposits
  • explore the nature of sedimentation
  • identify the extent of the 8th and 9th century activity
  • establish the relationship of the metalwork finds and the features
  • collect environmental & artefactual samples
  • determine the nature of activity on site
  • help protect the site from illegal metal-detecting

The release of this archive coincides with the release of a Data Paper on the data set in our sister service Internet Archaeology. This Data paper highlight the reuse potential of the archive.

The second and third new digital archives are related to fieldwork undertaken within the parish of Cottam. Continue reading Three New Yorkshire Archives