Keeping our Data Consistent

The consistency and integrity of data is essential for any digital archive. Therefore, for the past few months we have been running a series of programs to test the consistency of our file system and database and try to identify any other problems. This work started when we decided to develop a program to test all the checksums in our file system. The idea was to run the program every few months in order identify any checksums which had changed since the last run.

Screenshot of a part of a checksum report. — Part of a checksum report.

In addition, the program would test the checksums in the file system against the checksums in the database so that we could be sure that they were synchronised. The program took a few weeks to develop and has now been run several times. Each run produces a report which shows any checksum changes in the file system and the database. Happily, there have only been a few checksums flagged up in the reports so far and usually there have been good reasons why they have been changed.

During the development of the checksum program it also became apparent that some files didn’t conform to our current guidelines for file naming and we found that this became a problem when we ran our file identification software Droid (from PRONOM). We also found that the file structure of some our archives was not as consistent as we would like. This was particularly evident in our early archives when the procedures were different. Therefore, during the development of the checksum program we also developed some tools to check the file system for any file name problems, for example accented and windows characters, and check that the file structure conformed to our current procedures. The result of this work was that we found a few thousand file names which were inconsistent and about twenty archives in which the file structure needed to be modified. This is a very small percentage of our digital data as we have over 1.5 million files and over 1300 archives. This work has also shown that the vast majority of the inconsistences are in our very early archives and it is good to know that our archivists stringently adhere to the current guidelines.

In conclusion, the development of these tools has enhanced our ability to monitor the integrity of the contents of the archive and, moving forwards, should contribute to more robust data management policies and practices. At the same time it should allow for more rigorous and proactive data auditing.

Recent blog articles See all

CATs Assemble: Insights from our 2025 Archiving Strategy Week

Unlocking Archaeological Data: Exploring New Ways to Access ADS Collections via APIs

Evaluating evaluation trenching: reengaging with a PhD project using the ADS and cutting-edge data science.