SPRUCE Hackathon – File Characterisation

The other week I had the opportunity to participate in the SPRUCE Hackathon hosted by Leeds University.  Hackathons are an opportunity for developers to get together and work on (or hack) common problems.  Typically hackathons in the USA are fuelled by Mountain Dew and pizza, but as this was a British hackathon it was mostly fuelled by tea and cakes (and mighty fine cakes thanks to Becky).  The hackathon was specifically focused on issues around file characterisation, which is precisely identifying and describing the technical characteristics of a file as well as its metadata.  This is an ongoing challenge for practitioners in the digital preservation realm since there are many file formats, many versions of those many file formats, and little consistency in the way these many file formats and their many versions internally identify themselves.  Digital archivists need to know more than just the file extension or format’s name, which Gary McGath sums up nicely in his recent Code4Lib article:

Just knowing the format’s generic name isn’t enough. If you have a “Microsoft Word” file, that doesn’t tell you whether it’s a version from the early eighties, a recent document in Microsoft’s proprietary format, or an Office Open XML document. The three have practically nothing in common but the name.

Thankfully there are a number of characterisation tools to help digital archivists with this, and of the attendees at the hackathon were some of the key developers behind the major tools such as JHOVE, JHOVE2, FITS, DROID and C3PO.  This provided an exciting opportunity to work alongside them on their tools and learn more about how the tools work.

Rather than everyone working on one big problem, we split up into smaller groups to tackle more manageable tasks.  I was in the FITS group, which included the main developer behind it, Spencer McEwen, and the main developer behind JHOVE, Gary McGath.  Our objective for the hackathon was to add Apache Tika to the list of tools which FITS supports, which in concept we did, but given the hackathon was only 2 days, further work is still necessary.  Spencer has released a new version of FITS with Apache Tika functionality included, although it is disabled by default until more work can be done to process the outputs.

The hackathon provided a really good opportunity to work closely with Spencer and get a better understanding of the underlying code in FITS.  This also got me thinking about how we do file characterisation in the ADS, and how FITS could be a better fit in some of our workflows.  To provide a bit of context, FITS is effectively a wrapper program which simplifies the execution and outputs of a number of file characterisation tools.  In addition to two original tools (FileInfo and XMLMetadata), it bundles the following tools:

That is a pretty comprehensive list of tools, and with the addition of Apache Tika, FITS now supports most of the major tools used in file characterisation.  But the power of FITS isn’t just in throwing a lot of characterising tools at files, it’s also in the normalised output that’s returned to the user.  FITS parses all of the individual tools (sometimes complex) outputs and then constructs a unified and simplified XML file for the digital archivist.  The original outputs can also be included in the FITS XML output so the archivist can still consult with the raw results to confirm the FITS output.

Currently the ADS uses DROID 6.1 in its Collection Management System (CMS) to characterise files as they are ingest into our archive.  The choice was made to use DROID because of its perceived simplicity, its active development, and its ability to be integrated into other systems via an API.  After talking to the main developer behind our CMS, Paul Young, it has become clear that FITS may be a better tool to use instead of just DROID.  The gaps in DROID’s format coverage could be complimented by other tools within FITS, giving us potentially more accuracy.  FITS tool invocation is also customisable by allowing fine grained control of which tools run on particular file extensions, which could speed up the overall execution time of FITS if implemented in the CMS.

The most useful aspect of FITS to the ADS though is the normalised XML output, which will make the digital archivists job much easier to accurately characterise files.  The FITS output identifies the tools which agreed on a format as well as highlighting the parts they did not agree on.  These outputs could easily be integrated into the CMS to provide a GUI front end to FITS, so the processing and alignment of the digital objects is done in one familiar interface.  This is further simplified by the fact that FITS is open source, so tweaking and updating for the ADS is easily doable.  The most immediately useful contribution to the FITS source code would be a fully implemented Apache Tika, which we will likely begin helping with in the near future.

Overall the hackathon was a really valuable experience and I really enjoyed meeting and talking with a lot of other developers in the digital preservation world.  I will be looking forward to integrating FITS into the CMS as well as contributing more code to the FITS project.  I was also really impressed with C3PO and will hopefully be deploying it as well for visualising our collection of digital objects.  The fact that C3PO also works with FITS outputs is especially convenient, so employing FITS within our CMS has an added benefit.