The other week I had the opportunity to participate in the SPRUCE Hackathon hosted by Leeds University. Hackathons are an opportunity for developers to get together and work on (or hack) common problems. Typically hackathons in the USA are fuelled by Mountain Dew and pizza, but as this was a British hackathon it was mostly fuelled by tea and cakes (and mighty fine cakes thanks to Becky). The hackathon was specifically focused on issues around file characterisation, which is precisely identifying and describing the technical characteristics of a file as well as its metadata. This is an ongoing challenge for practitioners in the digital preservation realm since there are many file formats, many versions of those many file formats, and little consistency in the way these many file formats and their many versions internally identify themselves. Digital archivists need to know more than just the file extension or format’s name, which Gary McGath sums up nicely in his recent Code4Lib article:
Just knowing the format’s generic name isn’t enough. If you have a “Microsoft Word” file, that doesn’t tell you whether it’s a version from the early eighties, a recent document in Microsoft’s proprietary format, or an Office Open XML document. The three have practically nothing in common but the name.
Thankfully there are a number of characterisation tools to help digital archivists with this, and of the attendees at the hackathon were some of the key developers behind the major tools such as JHOVE, JHOVE2, FITS, DROID and C3PO. This provided an exciting opportunity to work alongside them on their tools and learn more about how the tools work.
Continue reading SPRUCE Hackathon – File Characterisation