Selection and appraisal of data
This guide will help you develop a managed approach to appraising and selecting datasets for long term curation. It should interest archaeologists from across the sector responsible for managing data or who work in data-intensive fields, and those supporting them in institutional repositories, data centres or archives.
Why select and appraise?
Just as for physical archives, it is not possible, or desirable, for all digital data to be kept forever but outside the traditional archive and museum communities there is no widespread recognition of the need to select data for curation. Instead there is a view that “storage is cheap so why don’t we just decide to keep everything”. While that may in theory be technologically possible in practice there are four main objections to this view (1):
- Digital content expands. And “…if the growth of content (per byte or per object) keeps pace with the declining cost of storage, then the real cost of keeping everything may actually be the same as it is now, or higher”(2).
- Backup and mirroring increases costs. No digital preservation approach can survive without appropriate mirroring and backup systems. This instantly increases the storage cost by at least a factor of two, but more usually by three.
- Discovery gets harder. Keeping everything means that the noise to signal ratio of searches will be high, requiring additional individual effort to ascertain which data is the intended target of a search.
- Managing and preserving is expensive. We must consider the cost of creating and managing preservation metadata, and the cost of preservation actions on data that does need to be retained.
The decision to be selective may raise a difficult question. Does the cost of selection outweigh the combined cost of creating and managing metadata, and undertaking preservation? Given the ‘unrepeatable’ nature of much archaeological data, for the archaeological community the answer is probably no. The fact that long-term retention and curation of data requires a commitment to incur future costs; this necessarily imposes on any community a need for careful consideration of what should be retained.
“Appraisal is the noblest function, the central core of contemporary archival practice.” (3)
What archivists call ‘appraisal’ is often referred to outside the archival profession as ‘selection’ or ‘acquisition’, and is closely linked to a repository or institutional policy on collection development. Appraisal is the process whereby some records are selected for retention, others are deemed of insufficient value to justify permanent retention. Selection is not an ad hoc process; it must be guided by local and community policies and legal requirements. The process used to make data selection decisions must be transparent and accountable.
Appraisal is perhaps the most contentious and certainly one of the most difficult undertakings of the professional archivist or those involved in post-excavation archive creation. It determines what records will be preserved for posterity. In a very real way, making these decisions within the archaeological discipline, we may shape national narratives about the past.
Appraisal and Selection Policy
A policy needs to ensure consistent, transparent and accountable decision-making, so that commitments can be tracked and accounted for. The policy must fit legal requirements, e.g. relating to privacy and Intellectual Property Rights. It may also need to comply with relevant legislation for the jurisdiction, e.g. Public Records Acts, as well as national and local authority data policies and codes of conduct adopted by the host institution or funder, and any information governance policies relating to archaeology.
The policy will set out criteria for assessing a dataset or a resource’s value, and what should be done with it accordingly. Criteria will vary depending, for example, on whether the remit includes preservation. In any case the policy will give the basis for further assessment of the datasets. That will also be influenced by discipline-specific factors and based around general criteria such as the seven listed below, which are drawn from various sources (4,5,6).
- Relevance: The resource content fulfils the priorities stated in the funding or commissioning body’s current strategy, including any legal requirement to retain the data beyond its immediate use.
- Scientific or Historical Value: Is the data scientifically, socially, or culturally significant? Assessing this involves inferring anticipated future use.
- Uniqueness: The extent to which the resource is the only or most complete source of the information that can be derived from it, and whether it is at risk of loss if not accepted, or may be preserved elsewhere.
- Potential for Redistribution: The reliability, integrity, and usability of the data files may be determined; these are received in formats that meet designated technical criteria; and Intellectual Property (7) or human subjects issues are addressed.
- Non-Replicability: It would not be feasible to replicate the data/resource or doing so would not be financially viable.
- Economic Case: Costs may be estimated for managing and preserving the resource, and are justifiable when assessed against evidence of potential future benefits; funding has been secured where appropriate.
- Full Documentation: the information necessary to facilitate future discovery, access, and reuse is comprehensive and correct; including metadata on the resource’s provenance and the context of its creation and use.
Developing the Appraisal Process
The appraisal process will apply the criteria set out in the policy. This process must be transparent and accountable to justify selection decisions to current and future users, so it must follow clear, unambiguous, and objective criteria.
High data volumes rule out appraisal at the record level, and may even do so at the data set level. The appraisal/selection process should be undertaken at as high a level of data aggregation as will ensure justifiable outcomes and allow cost effective decision making. The challenge is to identify a set of high-level criteria that can be applied using evidence it is practical to obtain, yet is sensitive to possibly wide variations at the individual item level.
The post-excavation team/staff should preferably take selection decisions, as this is the appropriate level of responsibility. The views of discipline specialists will often be essential, especially the team who created and used the data. All decisions must be recorded, with justifications, so that future users can understand why particular data sets were kept or destroyed. Decision records are metadata, to be held in whatever archive/asset/ digital object management system is used to manage and control the data, and these metadata must be retained permanently.
Below we take the general criteria listed in the Policy section and consider some more detailed questions and evidence relevant to appraisal on those criteria.
Relevance to Mission
‘Does the dataset or resource fall within the repository’s scope?
- Refer to the remit or mandate set by the host institution or other funders, their broader data policies, and codes of conduct for research e.g. for retention periods.
- Consult the strategic priorities of the host institution or other funders.
Are there other relevant legal requirements or guidelines?
- These may for example include legislation relating to Public Records, Copyright and Patents, Data Protection, Health and Safety, and Equality (see the ADS guide on sensitive data).
- Also check any discipline-specific information governance guidelines and codes of practice, e.g. codes and guidelines on the excavation and handling of human remains.
Scientific or Historical Value
Does the dataset reflect the interests of contemporary society?
- Consider how the research questions relate to trends in research awards by national funding bodies.
Is the dataset the only source of its content and will it be preserved elsewhere?
- Check whether the dataset(s) duplicates existing work, is new or unique.
- Try to find out if other copies of the data exist and are accessible and useable. If other copies exist, where is the most comprehensive or up-to-date version?
- Are any other copies at risk of loss, or will they be preserved where they are?
Potential for Redistribution
Are Intellectual Property Rights (IPR) issues addressed?
- Check the repository/archive/museum’s policy on IPR and sharing, access to and re-use of data.
- Check whether the funder or project consortium have IPR policies affecting the work, and whether these have been adhered to.
- Identify any contractual or licence terms affecting the dataset, e.g. has the copyright owner given permission for archiving?
- If a Creative Commons or similar ‘copyleft’ licence has been used, with what conditions, or has a public domain waiver been given?
Are human subjects issues addressed?
- Was informed consent obtained from the research subjects for archiving and re-use of personal data, on what terms, and is it feasible for the archive to adhere to them? E.g. can the data be effectively anonymised?
- Was approval by an Ethics Committee required to collect the data and if so is there evidence of this?
- Are there any other restrictions on sharing, access and re-use if the research involved human subjects (e.g. records of human remains)?
What is the reliability and usability of the dataset?
- Is the dataset in a format that allows others to use it without costs or other restrictions?
- Is software available to access, view and query the data, and if so will any costs or terms apply to users, especially for its long term maintenance?
- Is there enough metadata and documentation for the dataset to be readily used and understood away from its original context of creation?
Has the data been stored in a way that ensures its integrity has not been compromised?
- Whoever has kept and stored the dataset needs to ensure that the data cannot be tampered with or inadvertently changed.
- Backups must have been kept safely to ensure corrupted data can be replaced.
Does the dataset meet technical criteria that allow its easy redistribution?
- Has the data been created, or kept, in an open, machine-independent or easily accessible format?
- Can the data be easily migrated to other formats that might be more accessible to external users?
Can the data be easily replicated, recreated or re-measured?
- Are the data records transient or one-off events that cannot be repeated, such as archaeological excavation?
- Is the event/project which caused the data to be created easily reproducible?
Is the cost of replicating or re-measuring the data financially viable?
- Would another body be prepared to fund the future reproduction of the data?
- Even if the data can be recreated or re-measured it may be so expensive to do so that it is preferable to retain the original.
Has the total cost of retaining the data been considered?
- Keeping data for long periods involves more than storage. Data must be kept accessible, backups kept, and sharing and access implemented. All this adds to the cost of keeping data. The total cost must be considered and estimated to check whether it is financially viable to keep the data.
- The JISC Keeping Research Data Safe (KRDS) Phase 2 Project produced a cost model for digital preservation (8) including ‘acquisition’ costs. The British Library’s LIFE (Life Cycle Information for E-Literature) projects (9) have also developed a lifecycle model and a predictive costing tool that can help to determine costs. See also the ADS charging policy
If the cost is acceptable who is going to pay for data retention?
- Even if the cost of keeping the data is acceptable, how data retention will be funded must be considered. Without this selection cannot be viable.
- Has funding been provided, promised or assured?
Is there documentation to support sharing, access and re-use of the data?
- Datasets need some way of understanding their structure and the meaning of field names, etc. so anyone not directly involved in creating the data will be able to re-use it. Are there data dictionaries explaining the layout and structure?
- Is there comprehensive information about the context of data creation: the nature of the project; the data collection methodology, post-collection manipulation?
- Have records been kept of any access, copyright/IPR, privacy or ethical restrictions on access and re-use?
Re-modelling data management workflows
The need to reduce data management costs is driving greater automation of workflows. This presents opportunities for automation across the boundary between the originating data creator and the data repository. More metadata will be generated automatically through tools embedded in everyday research activity (like OASIS). In the future research material itself may be marked-up using automated classification techniques, or may be ‘linked data’ that has been integrated from disparate web sources. These new sources and technologies provide opportunities for data centres to improve data discovery.
Engaging the community in appraisal
There is a role for the wider community in the appraisal process, driven by the need to assess the research value of ever-expanding volumes of data more cost-effectively. According to Fran Berman the “need for community appraisal will push academic disciplines beyond individual stewardship, where project leaders decide which data is valuable, which should be preserved, and how long it should be preserved” (10). For a repository or data centre deciding to acquire a database or its contents, dataset-level usage and citation metrics may become an indicator of their current usage, provided of course they are valid measures for the research community concerned. These metrics may also be relevant for re-appraising datasets the archive is already making available.
- See “The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth Through 2011”, IDC White Paper, March 2008, which reports that the amount of digital data created in 2007 exceeded (for the first time) the amount of storage available (new and existing). IDC estimate that by 2011 half the digital information created will not be able to be stored.
- Paradigm Project: Workbook on Digital Private Papers, Section 04: Appraising digital records: a worthwhile exercise? Retrieved Feb 17 2010 from: http://www.paradigm.ac.uk/workbook/appraisal/digitalappraisal.html (no longer available).
- C. Coulture, “Archival Appraisal: A status Report”, Archivaria 59, 2005, p.107
- NASA Socioeconomic Data and Applications Center (SEDAC) Long-Term Archive. (n.d.). Appraisal for Accession to the SEDAC LTA. Retrieved June 24, 2010, from http://sedac.ciesin.columbia.edu/lta/Appraisal.html
- Faundeen, J. (2010). Appraising U.S. Geological Survey Science Records. Archival Issues, 32(1), 7 -22.
- NARA – U.S. National Archives and Records Administration. (2007). Strategic Directions: Appraisal Policy. Retrieved June 24, 2010, from http://www.archives.gov/records-mgmt/initiatives/appraisal.html
- Copying data for preservation purposes without specific approval from the copyright owner may not be covered by copyright legislation; or the Intellectual property rights for some material may be restrictive to the extent that there is no real possibility of access to data being made available in the future. In this case, it is probably pointless to expend resources on its curation.
- KRDS Project factsheet. Retrieved Aug 29 2010, from: http://www.beagrie.com/KRDS_Factsheet_0910.pdf (no longer available).
- LIFE Projects. Retrieved Aug 29 2010, from: http://www.life.ac.uk/
- Berman, F. (2008). Got data?: a guide to data preservation in the information age. Communications of the ACM, 51(12), 50–56.
This guidance is adapted from How to Appraise & Select Research Data for Curation by Angus Whyte, Digital Curation Centre, and Andrew Wilson, Australian National Data Service (2010). Thanks to the DCC and ANDS for permission to re-use their guidance.
Further Information and Bibliography
Whyte, A. and Wilson, A. (2010) ‘How to Appraise & Select Research Data for Curation’ Digital Curation Centre. Retrieved 06/06/2011 from: http://www.dcc.ac.uk/resources/how-guides/appraise-select-research-data
Two other DCC guides by Ross Harvey cover this topic:
Awareness Level: Introduction to Curation: Appraisal and Selection (2008)
Expert Level: Curation Reference Manual: Appraisal and Selection chapter (2006)
- Dallas, C. (n.d.). An agency-oriented approach to digital curation theory and practice. In Proceedings: International Symposium on “Information and Communication Technologies in Cultural Heritage” (p. 49).
- Digital Preservation Coalition. (n.d.). Decision Tree for Selection of Digital Materials for Long-term Retention.Retrieved June 24, 2010, from http://www.dpconline.org/advice/decision-tree.html
- Downs, R. R., & Chen, R. S. (2009). Designing Submission Services for a Trustworthy Digital Repository of Interdisciplinary Scientific Data. In Earth and Space Science Informatics Workshop: Developing the Next Generation of Earth and Space Science Informatics: Technologies and the People That Will Implement Them. August (pp. 3–5).
- Esanu, J., Davidson, J., Ross, S., & Anderson, W. (2004). Selection, Appraisal, and Retention of Digital Scientific Data: Highlights of an ERPANET/CODATA Workshop. Data Science Journal, 3, 226.
- Faundeen, J. (2010). Appraising U.S. Geological Survey Science Records. Archival Issues, 32(1), 7 -22.
- Gray, J., Szalay, A., Thakar, A., Stoughton, C., vandenBerg, J. (2002). “Online Scientific Data Curation, Publication, and Archiving”, Microsoft Research Technical Report MSR-TR-2002-74.
- Gutmann, M., Schürer, K., Donakowski, D., & Beedham, H. (2004). The selection, appraisal, and retention of social science data. Data Science Journal, 3(0), 209–221.
- NARA – US National Archives and Records Administration. (n.d.). Strategic Directions: Appraisal Policy. Retrieved June 24, 2010, from http://www.archives.gov/records-mgmt/initiatives/appraisal.html
- NASA Socioeconomic Data and Applications Center (SEDAC) Long-Term Archive. (n.d.). Appraisal for Accesion to the SEDAC LTA. Retrieved June 24, 2010, from http://sedac.ciesin.columbia.edu/lta/Appraisal.html
- Norris, R., Andernach, H., Eichhorn, G., Genova, F., Griffin, E., Hanisch, R., Kembhavi, A., et al. (2006). Astronomical Data Management. Arxiv preprint astro-ph/0611012.
- Pearce-Moses, R. (n.d.). SAA: Glossary of Archival Terminology. Retrieved July 6, 2010, from http://www.archivists.org/glossary/
- Schade, D. (2009). Data Centre Operations in the Virtual Observatory Age. In Proceedings PV2009. Presented at the Ensuring Long-Term Preservation and Adding Value to Scientific and Technical Data, Madrid, Spain.
- Wallis, J. C., Borgman, C. L., Mayernik, M. S., & Pepe, A. (2008). Moving archival practices upstream: An exploration of the life cycle of ecological sensing data in collaborative field research. International Journal of Digital Curation, 3(1).
- Witt, M. (2008). Institutional Repositories and Research Data Curation in a Distributed Environment. Library Trends, 57(2), 191–201.
- Yakel, E. (2007). Digital curation. Perspectives, 23(4), 335–340.