Attribute data and databases

Mark Gillings and Alicia Wise, with contributions by Mark Gillings, Peter Halls, Gary Lock, Paul Miller, Greg Phillips, Nick Ryan, David Wheatley, and Alicia Wise. Revised by Tim Evans, Peter Halls and Kieron Niven (2011), Archaeology Data Service / Digital Antiquity, Guides to Good Practice

Information commonly stored, or manipulated, using a GIS tends to have two main components – the spatial and the descriptive attributes. For many users, and with many software products, these two data types may appear to be a seamless unit. There are, however, some data management issues which are peculiar to whether you are working with spatial or attribute data, and certain general issues which are common to either form of information.

Attributes are data that describe the properties of a point, line, or polygon record in a Geographic Information System. For example, imagine a GIS coverage in which points represent sites on a landscape. The attribute data that accompanied this coverage would record more detailed information about each site. Attribute information might include an indication of the time period in which the site was occupied (e.g. Neolithic, Iron Age, Medieval), full descriptions of the archaeological deposits excavated from each site, and an indication of the class of artefacts found on the surface at each site.

Archaeological attribute data already exists in myriad forms. These can range from simple card indexes – for example the results of a graveyard survey undertaken using the Council for British Archaeology guidelines – through to complex digital databases recording a wealth of detailed information. Such databases sometime include descriptions about all the archaeological sites in a country or county and sometimes contain very detailed site-specific information such as stratigraphic records. This diversity is on the increase as the use of computers grows in archaeology.

In archaeological GIS you will often be linking and combining attribute information collected by others, and turning this information to new purposes.

Common sources of attribute data

Below are some likely sources of attribute data which you may come across, and wish to re-use:

paper based card indexes
archaeological site and survey archives (including paper based records, finds databases)
qualitative report texts and articles published in journals (paper based or on the Internet)
microfiche archives
geophysical interpretation data derived from interpreted geophysics plots
aerial photograph interpretations which may include morphological analysis, attribute data and photo source information
typological databases or artefact type series
data generated at a regional level for integrated large scale historic landscape studies
local level archaeological databases (e.g. Sites and Monuments Records or Urban Archaeological Databases where they are held separately)
local museum site and finds databases
local Record Offices
national archaeological databases (such as the various National Monument Records or English Heritage’s database of Scheduled Ancient Monuments)
Gardens Trust surveys
historic buildings surveys and databases maintained by local authorities
metadata relating to data sets

Designing a new attribute database

Whether you are using pre-existing attribute data or actually collecting new information yourself, you will need to think carefully about the design of your new attribute database. A great deal of literature exists on this complicated topic – it is quite literally a topic which has launched a thousand PhD research projects! Some good sources of basic information can be found in Batini et al. (1992), Date (1995), Ryan and Smith (1995), and Whittington (1988).

Archaeologists should also be aware of MIDAS Heritage, the Monument Inventory Data Standard[1]. This standard is designed for those establishing a new attribute database in which to manage archaeological information or for those who have been working with archaeological attribute databases for a long time.

The principal types of database structure

Database systems should be efficient tools for the storage, analysis and reporting of your data. As a result, the choice of database package and data structure used in a given project should be dictated by the requirements of each organisation or project. It is not within the scope of this guide to enter into a discussion of the merits and failings of software packages. Instead a short overview of the types of database data models is presented.

Data structures currently fall into four major types: flat file, hierarchical, relational, and object oriented. More detailed discussion of these can be found in Fundamentals of Spatial Information Systems (Laurini and Thompson 1996), especially pages 620-38 on object oriented databases.

Flat file data structures

In this simplest form of data structure, data are arranged in concurrent horizontal rows, with attributes stored in vertical columns. One row stores all attributes for a single entry (object) on the database. If many of the objects on the database have the same attributes they must be entered many times, leading to data redundancy and often to empty fields. A common example is the card index.

Hierarchical data structures

Hierarchical data structures have useful applications within archaeology as they arrange the objects in a database in a related tree of linked parent and child records. This can be used to model the breakdown of the historic environment into ‘monuments within monuments’ and allows flexible searching across the hierarchy. The most common applications for these database structures are in cultural resource management environments such as HERs and national databases which contain large amounts of data and need efficient, speedy, searches.

Relational data structures

Currently the most common type of database used is based on the relational data model. If you imagine a series of tables similar to small flat file databases, with links or relationships between specific unique fields that allow complex queries of different data sets, you have the essence of the relational structure. One table could be of pottery types, another of contexts, a third of scientific dates and with easily structured queries it will be possible to construct chronologies based on pottery typology or scientific dating.

Object oriented data structures

The newest form of data structure, there are currently a limited number of GIS packages using the object oriented approach (e.g. Smallworld). While the relational data structure deals with an object’s description by tearing it apart into single rows, and holding those rows in many discrete but linked tables of similarly grouped attributes, the object oriented approach to data structure allows the descriptive attributes of an object (e.g. a monument) to be encapsulated digitally in one place, allowing a more realistic model of the ‘real world’ to be assembled. The geographic location of the object is then just another characteristic of the object, just as function, date and period of existence are.

Issues to consider when structuring and organising a flexible attribute database

When attempting to structure and organise a flexible attribute database the following factors are of critical importance. In the following section each of these issues will be looked at in turn.

Naming conventions
Key fields
Character field definitions
Grid references
Validation
Numeric data
Data entry control
Confidence values
Consistency
Documentation
Dates

Naming conventions

Try to keep field names descriptive rather than cryptic. The crib sheet for decoding cryptic names may easily get lost, and your fields are likely to be too numerous for you to remember their contents easily. Also be aware that some GISs (particularly ArcMap) have limits on the number of characters in field names, so truncation may occur.

Key fields

Key fields are the most important fields in your attribute database and are the fields that will be used for primary searching of the database and/or for linking tables within your database. It is essential that the same data definitions are used for all instances of the key field in your database and that the same codes are used in each.

Character field definitions

Take care with character field definitions. Most databases require character data to be stored in a fixed length form and so, inevitably, this means that every record must contain enough space for the largest expected, even where this is not required for the vast majority of records. As an example, there is no point in defining a location name field large enough to store the longest name in Monmouthshire, Llanvihangel-Ystern-Llewern, if the name Monmouth happens to be the longest in the data set!

Grid references

Store grid references in an appropriate notation for easy transition to a GIS or conversion to an appropriate map projection (e.g. British National Grid references are commonly held as alphanumeric attributes in a single column which require some processing before points can be mapped on a GIS, a more appropriate form of notation would be in two numerical columns e.g. 456344 / 267833 for SP 5634467833).

Validation

Get in the habit of ensuring that the data entered into any field in your attribute database makes sense. For example, check that you haven’t typed the letter ‘O’ instead of ‘0’ (zero). Another tip is to check that numeric values are within range – for example that a slip of the old typing fingers hasn’t moved your Norman site from 1066 to 2066. It’s often helpful to have someone else validate data that you have entered as typos are more easily detected by a fresh pair of eyes. If your data input tools allow you to define validation checks, use them, but remember that – like spelling checkers – they cannot catch all possible input errors.

Numeric data

It is best to use numeric field types rather than text fields if you have numeric data. This can have three benefits. First, confusing characters – such as that familiar O (letter) instead of 0 (zero) problem – cannot be stored in the wrong field type. Second, in many computer-based databases numeric information is stored more efficiently than text and occupies less space. This means that your GIS data set will be leaner and meaner. Third, when data is held in numeric form the data can more readily be manipulated with the arithmetic operators.

If you are using numeric data, also ensure that you use the most appropriate numeric type – integer or floating point. Integer types are used for storing whole numbers and floating point numbers are used for storing numbers which have, or may have, a fractional part.

Data entry control

Where possible the fields should be set up to use dictionaries or thesauri to ensure that typing errors are kept to a minimum and restricted to free text fields, and that terms used to describe real world objects are used accurately and consistently. Adhere to established appropriate project data standards (e.g. the RCHME/English Heritage Urban Archaeological Database Data Standards). If no project standards exist, adhere to the data standards of the digital archive for your data. Remember that your data will need a home if it is to remain a useful and accessible resource in the future, and it is your responsibility to ensure its compatibility with other data sets of a similar spatial or temporal resolution.

Confidence values

These indicate the level of certainty that is associated with an entry in the attribute database. For example, your certainty that the location, identification, dating, etc. of the object is accurate. It is very good practice to maintain this information at all times.

Consistency

Try to ensure that the codes used to record your attribute data are consistent. Ensuring consistency is especially difficult when data entry is performed by more than one person, or if data entry is carried out incrementally over time. The use of thesauri and documentation standards can be helpful in ensuring consistency within your database and between your database and others.

Dates

Calendar dates should be recorded in a date field-type rather than character field-type to avoid the loss of crucial data when transferring into different software packages. Be aware some software will not prompt you if you are about to lose data due to incompatible field types.

Documentation

The most important thing of all is to document the way you have organised your database and entered information into it! It is essential that source-specific information is recorded as and when data is generated, as this task becomes increasingly difficult retrospectively. Where did the source data originate from, what was the scale at which it was prepared, if based on others’ work where can this be found, and what are the copyright restrictions involved in its use by a third party? What levels of accuracy were accepted and what errors were recorded during digitization etc? What data standards were adhered to (dated if possible, as revisions will occur) and what naming conventions have been adopted.

Combining and integrating attribute databases

Data standards

Successful database integration relies on the implementation of data standards. These aim to facilitate the production of a common frame of reference for archaeologists, endorsed by the profession as a whole and implemented in a widely compatible national network of databases and digital archives.

Many core data standards have been defined for many fields of archaeology, from portable items such as MDA Archaeological Object Thesaurus (MDA 1998) and the International Guidelines for Museum Object Information, produced by the International Committee for Documentation (CIDOC), to the draft data standards for SMRs and revisions of the RCHME Thesauri of Architectural Types, Monument Types, and Building Materials. Another useful resource is MIDAS Heritage. Outside of the profession, essential standards have been set for such data sets as British postal addresses (BS7666), and international naming conventions for countries (ISO3166).

The basic process involved in the integration of data from external databases relies on compatible field structure. This means that complementary fields in both the source and target databases must be of a compatible type (Integer, Floating Point, Date, a Character field of an appropriate length etc.) to avoid the loss of data during the integration process.

Some features of certain databases (e.g. DBASE memo fields) are difficult to export to other systems and may require specialist advice to avoid their loss. The new data should be date stamped digitally by the computer operator and a record kept of its source and ownership.

Integrating paper records

Data can be extracted from documents and typed manually into an existing database, or whole reports can be captured speedily using a commercially available optical character scanning suite. These convert scanned text into digital characters which can be saved into a variety of word processor formats. The character interpretation is never 100% effective and will require spell-checking and proof-reading before it is used, but this method can save a great deal of time, especially when capturing printed table data. Most often, the integration of paper-records will involve some form of manual input, often involving a number of separate individuals over a considerable period of time. Here the importance of adherence to existing standards and guidelines cannot be over-stressed. Such a process often involves a great number of decisions that directly affect the quality of the source data sets, as often very descriptive information is broken down into the discrete thematic field structure of the database. To ensure that the resultant database is usable it is important to record such decisions and ensure that a degree of consistency is adopted throughout the process.

[1] http://www.heritage-standards.org.uk/midas-heritage/

Help & guidance Guides to Good Practice