PDF, or PDF/A: that is the question

The Portable Document Format (PDF) remains the most popular and de facto format for the sharing of printable documents across the web. As such the PDF has become deeply embedded within personal, institutional and governmental workflows since its inception in 1993; indeed its pervasiveness is highlighted by the 100,000 or so PDFs within the ADS’ collections, making it by far our most common file type. As a result we thought it might be useful to provide some insight into the PDF, and its archival equivalent PDF/A, so that you can benefit from our (very!) long discussions and sleepless nights.

So what is PDF/A?

Essentially it is a constrained form of PDF version 1.4 that makes it more suitable for archiving and long-term preservation (the A meaning PDF/Archive). As an ISO standard (ISO 19005-1:2005) the file is a discrete and does not require external programs or information in order for it to be displayed. As a result certain content is prohibited (e.g. audio, video, Java, or other executable files, compression) and any encryption is forbidden. Similarly, unlike a regular PDF file which may substitute fonts that are unavailable, a PDF/A will store all fonts within the file structure. While certain metadata is also mandated. Within the PDF/A standard there are two levels of compliance:

  • PDF/A 1a – meets all the requirements of the standard.
  • PDF/A 1b – meets a much lower level of compliance, which allows for the retention of the visual appearance of file, but does not secure the structural or semantic properties of the file.

PDF/A: Why bother?

Unfortunately while PDF is an open standard it is still essentially a proprietary format and is generally regarded as unsuitable for preservation (see Guides to Good Practice for a fuller discussion), consequently PDF/A has become widely accepted as a viable alternative for preserving PDF content (e.g. Library of Congress). However, PDF/A remains in essence a proprietary format and is therefore far from ideal for preservation. As a result at the ADS we suggest that a better alternative for the long term preservation of files is the retention of those ‘original’ files that were used to create the PDF (Word, ODT, etc). Unfortunately this is not always possible consequently PDF/A is a next best alternative.

Creating a PDF/A?

The creation of PDF/A compliant files is becoming increasingly easy, with a wide range of commercial and freeware products available that can convert existing PDF files; whilst Microsoft Office 2007 and Open Office can handle direct conversions to PDF/A from their formats. A special note should be made that the PDF/A’s produced in this manner are generally only PDF/A 1b compliant.  The high volume of PDFs within our workflow means that the only practical option for large PDF collections, such as the Grey Literature Library, is a batch process. Initial experiments with Adobe Acrobat proved unsatisfactory, but a higher degree of success was reached with PDFTron’s PDF/A Manager which not only allowed batch conversions but included a validation tool.

When is a PDF/A not a PDF/A?

Just too muddy the waters a little further, and sound a note of caution, when creating PDF/A files. You may have noticed the helpful blue banner (beneath the menu bar) that appears in Adobe Reader or Acrobat to notify you that the PDF you are reading, or have just created, is PDF/A file; unfortunately this message only notifies that the file claims conformance to the PDF/A standard, it does not mean that the file actually adheres to the standard. As a result we always find it worth checking that the outcomes of conversions are really PDF/A using a validation programme (here at the ADS we use Adobe Acrobat, but there are others available).

Complicating matters and future directions

In the last twelve months a second ISO standard (ISO 19005-2:2011), based on PDF version 1.7, called PDF/A 2. Unlike PDF/A 1

it allows JPEG2000 compression, supports transparency effects and layers, embedding of OpenType fonts, and digital signatures. It also allows archiving of sets of documents as individual documents in one file (PDF/A Competence Center, 2011).

Like PDF/A 1 there are levels of compliance, and those files with are PDF/A 1 compliant will meet the PDF/A 2 standard. A full appraisal of these developments is pending, but we will keep you up to date with any developments.

Further information on the use of PDF/A as a preservation format can be found in the ADS’ Guides to Good Practice.

3 thoughts on “PDF, or PDF/A: that is the question

  1. Thanks for the information on what is a pertinent topic concerning digital outputs of archaeological work and research. I can’t help thinking that too much of archaeological output is locked into PDF format, and that if steps are not taken (as it seems the ADS are taking with their work with PDF/A) we could be facing another microfiche scenario, or perhaps even worse. To quote the film ‘Terminator’, “a storm is coming”.

  2. Hi Frank – thanks for your comments. Although not ostensibly the most interesting of topics PDF vs PDF/A is a key issue for those interested in the preservation (and thus future dissemination) of archaeological information. I don’t think it’s controversial to say that PDF is now the default format for reports (see Ray’s stats), and the use of PDF for scans or copies of plans, sections and elevations is becoming increasingly common. In much the same way microfiche was championed as a publication solution in the late seventies (see Mytum in Antiquity 52), I’ve been getting the feeling that PDF is being viewed in the same way today. It’s definitely a subject for us to stay on top of!

  3. Thanks for your comments, you have both made valid points about the use of PDF. Frank, as you correctly observe a lot of archaeological material is now “locked into PDF format” and whilst PDF remains an excellent format for exchanging material, once material is transformed into that format it is essentially impossible to get it back out. This isn’t so bad for text documents, but for drawings, plans or images it almost catastrophic; meaning the reuse potential is heavily reduced. At the same time, and something not many people automatically realise, the conversion process to standard PDF invariably means that content is down sampled resulting in a loss of detail and data at the outset. Tim’s observation is astute; ease of creation and small file size mean that the PDF is often considered a solution to archiving large datasets, but the significant advantage of fiche is the much higher resolution with which it preserves data, higher than the standard PDF. Again loss of resolution means a loss of information. Creating PDF/A files at the outset negates some of the loss of resolution, but it doesn’t counteract the ‘locked in syndrome’.