Digitisation

This guide explains how to digitise journal articles or grey literature reports with a view to their long term sustainability. The guidance is based on typical requirements for the deposit of data with the ADS, although the principles laid out here represent good practice.

Why are you digitising your journal articles or grey literature report?

There are number of reasons why you may wish to undertake a digitisation programme, but there are two relatively common ones:

We want to free up shelf space and destroy our hard copy: If this is the case then you really should think carefully about the management overhead involved in looking after the digital version of the reports/articles as they will be the only copy of the information you have. Not only can digital versions take up a relatively large amount of server space that some public bodies may struggle to release, but also digital archives take just as much looking after as hard copy. In addition, producing an effective and useful digital resource is not as simple as undertaking a quick scanning programme; the real costs fall on the work required in indexing. Even with specialist natural language processing software, creating a useful index to the scans can be excessively time consuming.
We want to put all our information online: Greater access to journal articles or grey literature reports is something that most people in the profession seem to support, however close thought needs to be given to the issue of copyright. You don’t necessarily have to hold the copyright yourself but you may require permission from the copyright holder to disseminate their work online.

What are your Digitisation Options?

You may want to outsource your digitisation work and there are specialists in this such. But you may also find some competitive rates at your local copy shop on the high street, some of whom offer this sort of service. There are usually two ways in which a commercial contractor would undertake the digitisation: destructive and non-destructive. The ‘destructive’ method involves cutting off binding and feeding papers though a scanner automatically; the non-destructive method involves manual turning of pages and is, obviously, going to be more expensive.

What you will need to do it yourself?

To digitise a copy of a journal or a grey literature report you will need access to the following hardware and software

A scanner capable of scanning the full size pages and generating black and white, greyscale, and colour images.
Suitable graphics software (Adobe Photoshop, Paint Shop Pro, Photopaint, Openoffice etc.) to edit and save the scanned images.
A computer with a CD/DVD writer or other output device (to store final copies of the files).
Spreadsheet or Database software (e.g. MS Excel, Access, etc) to create an index of scanned files (see below).
Adobe Acrobat PDF maker – or similar software – to convert scanned TIFF files to PDF files, preferably to PDF/A.

What you will produce?

Scanning a journal or grey literature backlog can take a long time and is typically spread over a number of sessions so it is important to have a clear understanding of what you are going to create. When you have finished, the paper copies will be represented by a series of digital files. These files will be:

The Original/Archival files i.e. a large number of TIFF image files, one for each page of the journal including the covers and indexes. Publications A5 size and smaller can be scanned two pages at a time though it is still recommended that pages are split during editing and saved as separate files. The TIFF files are best stored together in a folder/directory per journal volume, or for grey literature a folder/directory organised by date or HER number. The pages with text and line drawings should be saved as black and white images, pages with photographs as grey scale images and pages with colour as colour images. This can be done either at the scanning stage (i.e. scanned directly in the desired format) or converted post-scan via the graphics software i.e. all pages scanned in full colour and then downgraded. Note: the selective use of black and white, greyscale and coloured images is important as it helps minimise the size of the resulting TIFF files.
The Dissemination files i.e. a number of PDF files each, depending on size, representing either an article or grey literature report. The original structure of the journal or set of reports will also influence the structure of the PDF files e.g. plates or figures may be saved as individual files, as may plans and site images. It is recommended that individual PDF files do not exceed 50Mb each. OCR can be performed on PDF using Document to OCR text recognition.
An Index file listing all the articles, figures and covers in relation to both their archival TIFFs and dissemination PDF files (”see below”).

File Naming

All files produced, as well as their folders/directories, should be named in a logical and consistent fashion. Filenames should use only alpha-numeric characters (a-z, 0-9), the hyphen (-) and the underscore (_). No other punctuation or special characters should be included within the filename. Both upper and lower case characters can be used in a filename but try and keep filenames within your project consistent and ensure the Index file accurately reflects the case of your filenames. We recommend using the underscore character to imply a space within your filename.

A full stop (.) should only be used as a separator between the filename and the file extension and should not be used elsewhere. Files must have a file extension to help future users of the resource determine the file type. File extensions are normally 3 characters long and should be lower case.

It is best to start all filenames with a shortened version of the journal title and volume/issue number or perhaps use the county, district, parish or unit name e.g. Medieval Archaeology Vol.1 = ‘medarch001’

Note: For all filenames use leading zeros for numbering e.g. ‘001’ rather than ‘1’ as this allows files to be easily ordered.

TIFF File Names

TIFF files should be named with the journal name, volume number and page number e.g. medarch001_001.tif

Where pages are unnumbered, such as covers etc, the following naming convention should be used:
medarch001_000_1-cover.tif or medarch001_000_2-contents.tif

This scheme, which can easily be modified for grey literature reports, will ensure that all filenames have a unique name while also allowing them to be placed in order and easily referenced to the paper version.

PDF File Names

PDF files should be named in a similar fashion to the tiffs but, as they contain multiple pages, the name should include the page range. The author’s surname, a shortened version of the title or the plate/figure number can also be included e.g. medarch001_001-012_surname.pdf or medarch001_015-015_plate1.pdf

In cases where an article starts/ends on the same page as another article these pages should be duplicated in both PDF files so that each file contains a complete article. As with the TIF files, PDF files containing sections or pages that are unnumbered should use a ‘zero page’ number with an associated order number and description e.g. medarch001_000_1-2-cover_contents.pdf

Scanning

Decide prior to scanning which orientation – i.e. the positioning of the journal on the scanner – the journal will be scanned at. Later adjustments to the digital image are time consuming and may result in loss of quality.

Decide on the bit depth/colour depth and resolution that you are going to scan at:

The bit depth (the number of distinct tonal variations that can be captured for each colour channel) that you choose should reflect the type of document that is being scanned. Scanning a black and white text document in 24-bit colour (i.e. 8 bits per RGB channel) will produce an image that contains a large amount of information much of which may not be relevant to the original or the intended use of the digital copy. Alternatively, full colour scans of old black and white photos may more accurately capture other tones present in the image and create a more faithful reproduction. As a basic guideline, it is suggested that images are scanned at the highest available bit depth in either greyscale or colour modes. For greater accuracy it is also suggested that users calibrate their scanners and monitors to a colour chart to check that they are correctly set-up. Colour, contrast, brightness etc adjustment should also be carried out if a scanned image is visibly different to the original document.

In terms of resolution – i.e. the level of detail at which a document is captured by the scanner – journals should be scanned at a consistent resolution throughout, though variation can be made when dealing with images.

As a general guideline the following minimum resolutions should be followed:

Text at 300dpi using either 2bit (black and white) or 8bit (greyscale), 300 dpi is the minimum requirement recommended if OCR is to be used at a later date.
Line drawings at least 300dpi using either 2bit (black and white) or 8bit (greyscale).
Photos/images at 600dpi using 8bit greyscale or 24bit colour.

Note: These resolutions are recommended for archive files, they may produce large files that could be degraded for files being created for dissemination purposes only.

It is recommended that a trial run using pages containing different content and at different resolutions and bit depths is carried out to aid in selecting the final scanner settings. Scanned images should be checked against the original document to verify that the chosen resolution is capturing the document in a suitable amount of detail. Finally, prior to scanning, all images and the scanner should be checked to see that they are clean and dust free.

Saving the Images

Once the page has been scanned changes to the bit depth – e.g. conversion to greyscale or black and white – should be made. Each scanned page should then be saved as an uncompressed (i.e. without LZW compression) TIFF file. Please do not create multipage TIFF files as they can not be archived in this format. It is advisable to save each set of TIFF files in a directory representing each journal volume or batch of grey literature.

Conversion to PDF

Once a volume has been scanned it should be converted to PDF format. This can either be done on an article or entire volume basis. Plates/figures may need to be saved individually. In most PDF applications the conversion to PDF is a straightforward task and simply requires the relevant TIFF files to be selected and placed in the desired order (this is another task which is made much easier when using a sensible file naming convention). It is recommended that individual PDF files do not exceed 20Mb each. There are a few guidelines that should be followed to make these files as sustainable as possible:

PDF files should be created / saved as PDF/A (see below).
The following settings should be used:
- Embed and subset all fonts
- Embed all colour information
- Embed all images
- Do not link to/reference external files
- Use standards-based metadata where appropriate
- Add PDF tags to the document to provide structure
- The ‘make searchable’ option will perform OCR on the scan
Do not include:
- Audio and video content (or other multimedia)
- PDF transparency
- Encryption or security measures e.g. passwords or printing/opening/editing restrictions
- LZW compression
- Javascript
- Executable file launches
- Any security features

PDF/A

Standard PDF files, while ideal as a means of disseminating formatted text, are not easily converted to a stable archival format. However, an archival format of the PDF standard is now available. This is called ‘PDF/Archive’ or PDF/A. PDF/A files are more suitable for long-term preservation than normal PDF files so if you do not wish to keep archival versions of your scans as TIFF image files (which can generate very large file sizes) you could create PDF/A directly from scanning.

The PDF/A format is supported by many different software packages, making it a format that is relatively easy to create and work with. You may already have software that will allow you to create PDF/A files, but if you do not, there is software freely available to download that should help (for example at freepdfcreator.org). For more help with PDF/A please contact the ADS helpdesk.

Creating the Index File

The index file should provide a tabulated ‘table of contents’ for the journal that you are scanning and can be created as either a spreadsheet or a database. The index file also provides a means of associating each PDF file that is created with the original journal article or report. As a basic guide the index file should include the following fields (as applicable):

Author – article/report author
Title – the article/report title
Year – year of publication e.g. 2008
Pages – article’s page range e.g. 20-25
Volume – journal volume e.g. 3
Site reference or HER number
Location information (District/Parish)
Filename – filename of the relevant PDF and TIFF file

For Further Help and Information

http://www.dcc.ac.uk/

http://www.dpconline.org/

Help & guidance Data management