English Heritage

Preservation and Management Strategies for Exceptionally Large Data Formats: 'Big Data'

Introduction | Overview | Case Studies | Questionnaire | Formats | Workshop | Deliverables | Staff

Overview

Aims of the 'Big Data' Project:

Preliminary and Ongoing Investigations:
  • Define 'Big Data' and the technologies, delivery mechanisms, storage methods and activities in archaeology, research and cultural resource management currently generating such data.
  • Identify and review during the period of the project the circumstances under which we deploy the technologies that generate 'Big Data' formats.
  • Identify, describe and characterise the data formats used in 'Big Data' projects and their relationship to storage, delivery and reusability.
  • Identify a representative list of creators and users of 'Big Data'.
  • Identify existing good practice in related fields. As this is an exploratory project for archaeology, research and cultural resource management this will necessarily include a review of policies and services offered in other related scientific fields e.g. National eScience Centre, Nature and Environment Research Council (NERC), Council for the Central Laboratory of the Research Councils (CCLRC), British Oceanographic Data Centre (BODC), British Atmospheric Data Centre (BADC), US Geological Survey (USGS) or even the European Space Agency (ESA), National Aeronautics and Space Administration (NASA) National Geophysical Data Centre at Boulder, Colorado, USA or Conseil Européen pour la Recherche Nucléaire (CERN). The petroleum exploration industry also generates considerable holdings of 3D seismic data for the North Sea. Maritime Archaeology has close and important links with Defra that will also need to be consulted. In particular Defra's appointed Marine ALSF Science Co-ordinator Dr Richard Newell and the new Marine Data & Information Partnership.
  • Determine existing suitable repositories for 'Big Data' under a distributed archiving model. This would allow for the ADS to act as a pointer to archives held elsewhere. The AHRB project outlined below for example will be archived at the AHDS repository, while other project data may be archived at other suitable sites.
  • To identify best practice in terms of cost for the storage of 'Big Data' in its raw and unprocessed form. This will include best formats for storage (e.g. ASCII) and best policies regarding file processing prior to archiving (e.g. Compression).
To Carry out a User Survey:
  • Identify and make recommendations on existing standards amongst users in archaeology, research and cultural resource management and current good practice in the generation use, delivery, preservation and storage of 'Big Data'.
  • Identify and make recommendations on the potential uses and reuses of such 'Big Data'.
  • To identify the amount of likely re-use of the various elements of 'Big Data' sets and any appropriate time periods for which they remain valid.
  • Address issues regarding archiving discard policies and data selection policies.
  • Address issues raised by the use of proprietary formats.
  • To assess potential future developments in the costs of storage and the implications for future policy.
To Carry out Preservation and Data Access work on Suitable Pilot Studies:
  • Investigate, set out and make recommendations on the preservation and storage options available for 'Big Data'.
  • Identify, describe and characterise the required documentation for the preservation and reuse of 'Big Data' formats.
  • Identify, describe and characterise the interpretive processes applied to 'Big Data' and relate these to the documentation issues.
  • Identify preservation (storage), dissemination (delivery) and reusability options and costs. Including specifically to:
    • Identify issues related to on line dissemination and make recommendations for data dissemination and delivery of data.
    • Identify issues regarding proprietary formats and make recommendations regarding their effects on storage, delivery and reuse.
    • Investigate and refine an appraisal mechanism for assessing large-scale digital archives in the light of archiving discard policies.
    • To assess the potential for likely re-use of the different types of data as an indicator of archive suitability.
    • Explore cost models for setting threshold limits on archive sizes or perhaps how long they need to be kept if not accessed at all.

Project Dissemination Aims:

  • To encourage and review debate in the user community.
  • To disseminate the results of the project through conference papers or a conference session.
  • To produce a "Big Data" report.
  • The above report and discussion will address generic and strategic issues regarding the archiving and reuse of 'Big Data' with the future in mind.
  • The project will have its own website hosted by the ADS where interested parties can access information and findings related to 'Big Data'. This website will link to the Heritage3D site and will complement rather than duplicate the information available there.

Email the Big Data project