Preparing to archive: files and formats
As GIS data often incorporates data from a variety of sources the formats that are safest for digital preservation vary with the type of information contained within a file. In this section, recommendations are given for formatting of GIS files, databases, images, documentation, and metadata.
Any archiving of GIS files should aim to preserve the following properties:
- Coordinate reference system information
- Geometry (e.g. point, polygon, line)
- Attribute fields
- For rasters – source elevation model, bit-type, colourmap, pixel type
Strictly speaking, colour is not seen as a significant property of GIS data. This tailoring of data is stored in the project file (see below) and not in the digital object itself. If data creators require that colour/styling of original data should be recorded then this should be supplied as documentation in the form of a document or image. This documentation can then be stored with the data.
As highlighted in a 2009 DPC Technology Watch Report, “Attempts at defining a universal data model for geospatial data have been made (for example the Spatial Data Transfer Standard (SDTS)…but have not achieved widespread adoption. As a consequence, it is not possible to speak of – geospatial data – as a single type of information that can be handled by multiple, functionally equivalent applications and formats.” (McGarva et al 2009, 5). As with other data types discussed in these Guides, where the original source data (e.g. raw survey data) cannot be archived outside of the GIS environment, the most suitable files to use for archiving GIS data fall into the categories of open formats (e.g. GML and KML) and widely used standards (e.g. ESRI Shapefiles). This approach is further supported by the wide range of import and export functions often supported by the majority of GIS applications as well as third-party libraries such as the open source GDAL library (raster geospatial data formats) and the related OGR library (vector data).
General considerations, as outlined in the Guide-wide section on Planning for the Creation of Digital Data include ensuring that data, where possible, is not encoded or compressed.
In many GIS applications, project files – such as .apr or .mxd. – can be created to hold data in a tailored manner that involves classification, symbolization, and annotation based upon the data content. These data views typically appear as maps, charts, or tables, or some combination thereof. In order for an end user to render this content it is necessary not only to have the project file, but also the software that supports it, the related components (possibly including software add-ons or extensions), as well as the actual data. The required use of specific software, the complexity of the project file formats, and the tenuous links to the actual data, which is often simply pointed to, put these project files at high risk for failure over time. It is therefor recommended that project files are not archived or at least are not used to hold key information relating to the associated datasets.
Generally speaking, GIS data falls into two main categories, geo-referenced vector and gei-referenced raster data formats. Unlike other simpler data types, GIS files may consist of more than one physical file/object. This is well illustrated by the case of ESRI Shapefiles where a single ‘file’ may be made up of a collect of up to eight separate files. When archiving GIS data it is essential that all relevant files are stored. The tables below outline a number of common GIS formats for both raster and vector data however, in summary, it is recommended that where possible vector data is archived in the GML format and raster data as GeoTIFF files.
|ArcInfo Interchange (.e00)||An ESRI format developed to move coverages, INFO data files, text files such as ARC Macro Language (AML) files, and other ArcInfo files between machines not connected by a file sharing network. Interchange files contain all coverage information and appropriate INFO data file information in a fixed-length ASCII format. The ESRI E00 interchange data format combines spatial and descriptive information for vectors and rasters in a single ASCII file. It is mainly used to exchange files between different versions of ArcInfo, but can also be read by many other GIS programs. This can be used as a preservation format.|
|ESRI Shapefile (.shp, .shx, .dbf, .sbn and .sbx, .fbn and .fbx, .ain and .aih, .prj and .xml)||Shapefile is an openly published format and is actually a collection of files the number and combination of which depends upon the type of data stored in the file. Shapefiles store nontopological geometry and must be accompanied by in index file (.shx) and a dBASE file that holds the attributes of the shapes in the shp file. Shapefiles contain the following files:
– SHP – the file that stores the feature geometry. Required.
– SHX – the file that stores the index of the feature geometry. Required.
– DBF – the dBASE file that stores the attribute information of features. Required.
– SBN,SBX – the files that store the spatial index of the features. Optional.
– FBN, FBX – the files that store the spatial index of the features for shapefiles that are read-only. Optional.
– AIN, AIH – the files that store the attribute index of the active fields in a table or a theme’s attribute table. Optional.
– PRJ – the file that stores the coordinate system information. Optional.
– XML – metadata. Optional.
|Geographic Markup Language (.gml)||GML utilises XML to express geographical features. It can serve as a modelling language for geographic systems as well as an open interchange format for geographic data. It is an ISO standard (ISO 19136) and is built on a number of other ISO standards collectively known as the 19100 family. GML is defined by the Open Geospatial Consortium. In being an XML based schema and an ISO standard. GML is very suitable as a preservation format and recommended for GIS data.|
|Keyhole Markup Language (KML)||An XML based format initially developed for use with the Google Earth application but now an international standard of the Open Geospatial Consortium.|
|MOSS export||An export format from the MOSS GIS software, it can be problematic to import into other applications and is not a preferred format.|
|MapInfo Interchange Format (.mif & .mid)||MapInfo is a commonly used GIS software package. Where the .mif file contains the grahics, the .mid component contains any attribute data as delimited text and is optional. The format is a standard format and most other GIS programs can also read it. This format is ASCII based and open and thus a possible preservation format although MapInfo products provide support for GML which is even more suited to preservation.|
|National Transfer Format (NTF)||The NTF format has been mostly used in the past by the UK Ordnance Survey. The format is widely supported for conversion into other formats and is supported (read access) by the OGR Library.|
|Spatial Data transfer standard (.ddf)||The Spatial Data Transfer Standard (SDTS) is a data exchange format for transfering different databases between disimilar computing systems, preserving meaning and minimizing the amount of external information needed to describe the data. It can only be used for certain types of feature point, arc and grid data. One coverage would produce many files all with extension .ddf.|
|Vector product Format (.vpf)||Vector Product Format (VPF) is a U.S. Department of Defense Standard. The National Imagery and Mapping Agency (NIMA) is using VPF for digital vector products developed at a variety of scales. VPF has also been adopted into an international spatial standard as the Digital Geographic Information Exchange Standard (DIGEST). Vector Product Format (VPF) coverages and tables can be translated into ARC/INFO coverages and INFO tables.|
|Geo-referenced TIF Image/GeoTIFF .tif (.rrd,.aux .xml)||GeoTIFF is a metadata format, which provides geographic information associated with the image data. The TIFF file structure allows both the metadata and the image data to be encoded into the same file. GeoTIFF files embed information about the projection within tags in the file and will import automatically with correct georegistration. Whereas images saved using GeoTIFF require only one file with a .tiff or .tif file extension it is important not to confuse GeoTIFF with a different format using .tif files called the “TFW” format. This format uses two files, a .tif file and a .tfw “world” file to provide georeferencing information. TFW is not the same as GeoTIFF. Adding to the confusion is that some packages will create both a GeoTIFF file as well as a .tfw “world” file. The .tfw file provided in such cases is not part of the GeoTIFF standard. GeoTIFF is a preferred format to Tif World files.|
|ESRI GRID (.adf, .asc, .grd)||An ESRI GRID is a raster GIS file format developed by Esri, which has two formats. The first is a proprietary binary format with the extension .adf and is also known as an ARC/INFO GRID, ARC GRID and many other variations. See ESRI documentation on the binary version. The second is a non-proprietary ASCII format, also known as an ARC/INFO ASCII GRID with the file extension .asc, but recent versions of ESRI software also recognize the extension .grd. Both types are well documented online .|
|JPG World .jpg & jgw (.rrd,.aux,.xml)||As with TIFF world files (described above) these files consist of a standard JPG file accompanied by a world file containing the georeferencing information.|
If you have external databases connected to your GIS system, for example a database containing your attribute data, then you may want to archive these as well. Details on how best to archive database data is covered in the Databases and Spreadsheets guide.
It is NOT necessary to archive images of every single coverage in your GIS, nor is it necessary to archive images showing all of the ways you used the GIS to play with that data. Occasionally an image may have proven useful to you in a research project and, in order to document the research that you did, archiving that image might be worth more than 1,000 words of documentation. One example is an image showing lithic flakes scattered across a house floor in a pattern that you argued demonstrates lithic production was taking place on site — that single image might be well worth including.
Further information on archiving raster images can be found in the Raster Images guide.
Documentation and metadata to accompany your GIS, database, or image files
Your data set — the GIS files, database files, and image files — will need to be accompanied by detailed documentation as described in Sections 3.2 and 3.3. These are general guidelines and certain archives may have specific requirements for the format and content of the metadata that accompanies GIS and spatial datasets e.g. some archive may request that it is supplied as documents along with the data whereas others, e.g. tDAR, utilise interactive web forms to help users create metadata for resources they deposit.