One Pitfall of a Linked Open Data World

At the 7th World Archaeological Congress in Jordan, Martin Doerr raised a concern about the Linked Open Data world that was being advocated in our session.  In particular he mentioned worry over the assumption that all of this Linked Open Data was going to be persistently and indefinitely accessible, and he suggested that people keep RDF or other serializations of the Linked Open Data they were using, particularly vocabularies or thesauri.  This seemed like a good idea to us given the fragility of the web, and we have been informally promoting this idea at conferences and workshops.

Reason to heed Martin’s advice/concern has just presented itself, in the form of the recent US Government shutdown. This subsequently has brought down the Library of Congress website, including the id.loc.gov domain, which hosts their linked data records.

A screenshot from the Library of Congress website

We have aligned a lot of our archive metadata with the Library of Congress Subject Headings, which means all of our archive Linked Data includes links to the id.loc.gov domain. With the Library of Congress servers being shut off those links currently hang. This doesn’t “break” our Linked Open Data, and our archive metadata is still coherent and sensible.  However users of our Linked Open Data will find link rot which is never good, particularly within a technology that is predicated on persistent links.  Its telling when even a seemingly rock-solid organisation such as the Library of Congress can have their website taken down from poor “management”.

A collection with lots of broken references to Library of Congress Subject Heading concepts

We maintain our own copy of the Library of Congress Subject Heading Linked Data, which we downloaded 6 months ago and keep in a triple store.  Not only does that give us security if anything gets turned off again, but we are also able to interrogate the dataset with ease from within our own network, regardless of political wranglings 3000 miles away.  Again, even without resolvable URLs for those Library of Congress Subject Heading concepts, our Linked Open Data is still valid and makes sense, but this event certainly adds support to Martin Doerr’s recommendations in Jordan.

7 thoughts on “One Pitfall of a Linked Open Data World

  1. Hi Michael,
    Nice post following on from what we saw on Tuesday.
    Any chance ADS also has a ‘backup’ for the US Government too?
    Might come in handy once in a while 🙂
    Would be Interesting to learn the criterion by which some of the US Gov ‘legislative’ websites were kept running but obviously the Library of Congress was not considered a priority.
    Best Wishes
    K

    1. I should have given you credit in helping me discover this!

      We do have a sort of “backup” of the data in the form of the 4 million triples associated with the LoC Subject Headings, as do many others i would imagine. The LoC did not have a queryable SPARQL endpoint for that data, so downloading the raw RDF was the only way to interrogate it. This may have been by design, which makes sense in the light of this experience.

  2. Michael, this is a very interesting commentary, but being optimist I would say that the situation you describe is not a pitfall but a strong advantage over the existing or previous alternatives.

    URIs may not resolve to URLs but they are still valid Uniform Resource Identifiers, as you acknowledge. LOC data is not only linked but open, so you can keep a copy of an external resource. If needed, one could setup an id.loc.gov mirror (and possibly even adjust your DNS configuration). This is unprecedented and, I think, much better than the kind of “Semantic Web” that was commonly presented a few years ago, based essentially on a closed world assumption. The real challenge here is to enable easy and quick creation of such mirror networks, trying to avoid centralised systems (e.g. purl.org ‒ in a different area). In the age of cloud computing this shouldn’t be too difficult, at least in theory. What would happen if tomorrow the ADS is shutting down?

    1. I like your optimism Stefano! You are very right though, the model does not break when a LOD endpoint goes offline, and a LOCKSS approach potentially builds in that redundancy. There’s a fair concern about the potential versioning chaos that creates, but that’s a better problem to have than data extinction.

      The decentralised approach makes much more sense, and Ceri Binding, Holly Wright and me talked a lot about this for the SENESCHAL/Heritage Data LOD. Thankfully we were able to get a purl.org domain for the official Heritage Data concepts.

      As for an ADS shutdown, we have a “preservation” plan and fund for the ADS to ensure the safe migration of our data (this is part of any archive accreditation), which would include our LOD. Dereferencing would remain a potential issue since our domain may not be transferable, but as long as our underlying RDF is widely distributed our data should remain useful.

  3. I wholly agree. To me, joining the British Museum as Development Manager, one of my key points about Linked Open Data is that, while offering a public SPARQL endpoint is a laudable public good, offering the data dumps behind it, with the full set of ontologies on which reasoning relies, is a must. You’ll see this in our up-coming announcement of our revised collections data.

Comments are closed.