Skip to content

Help & guidance Guides to Good Practice

File formats for archiving datasets

Angie Payne, Kieron Niven (Ed.), Archaeology Data Service / Digital Antiquity, Guides to Good Practice

The decision to archive a scan project should ideally occur in the planning phase of a project. Planning to archive can help project organization by establishing a framework for processing documentation and for data structuring. By establishing a proper framework at the beginning of a project, the archival process becomes more streamlined and integrated into a project rather than a secondary product that is generated at the end.

As stated in Project planning and requirements, the minimum standard deliverables for archiving a terrestrial scan dataset in archaeology are the individual raw scans, the final registered point cloud and all associated metadata elements including project metadata, scan metadata and registration metadata.

For additional products generated beyond the registered point cloud, it is also strongly advised to archive the final product, the interim dataset used to create the product, all transformation matrices and metadata elements associated with all datasets. The interim dataset is considered to be very important because it provides the connection between the original registered point cloud and the derived product. For example, a mesh is rarely generated from the original registered point cloud. The point cloud is typically cleaned to remove erroneous data points, overlapping data is typically minimized or deleted and additional smoothing or subsampling may be performed. Therefore it is valuable to have the interim edited point cloud to understand how the mesh relates back to the original dataset. The interim dataset is important for assessing a product’s validity and for understanding the processes used to create the final product and how these may/may not have deviated from the original dataset.

The recommended datasets and associated metadata for archiving are listed in the table below and discussed in the subsequent sections:

Dataset Description Preservation Format Metadata Required?
Original raw scan files ASCII TXT and Native Project Metadata, Scan Metadata Yes
Registered point cloud
Transformation Matrices for each scan
Control Point File if georeferenced
ASCII TXT Registration Metadata Yes
Pre-mesh point cloud ASCII TXT Pre-Meshing Metadata Strongly Advised if archiving a derived Polygonal mesh
Polygonal mesh OBJ + MTL (ASCII) Polygonal Mesh Metadata Required if final dataset and strongly advised if used to create a derived 2D/3D CAD Dataset
Decimated Mesh OBJ + MTL (ASCII) Decimated Mesh Metadata Required if final dataset
2D CAD Dataset DXF/DWG 2D CAD Model Metadata Required if final dataset
3D CAD Dataset DXF/DWG 3D CAD Model Metadata Required if final dataset
DEM SDTS DEM Creation Metadata Required if final dataset
Video Suitable MPG format Video Creation Metadata No

File types for archiving point clouds

There is currently no general purpose, open format for storing point cloud datasets generated by all types of laser scanners (The ASTM E56 File Format for 3D Imaging Data Exchange). The ASTM E57 format is promising particularly given the large number of vendors that are supporting it however the format is still in beta testing and is years away from complete implementation. The LAS format currently used by the airborne LIDAR community is also a possible alternative however the LAS standard is limited in its application to terrestrial and object-level datasets. Read more on both of these standards in the text below.

Until there is a single open format to support all types of scan data, it is suggested to archive point cloud data sets in both an ASCII format and also the native proprietary format where applicable. While the ASCII data format does not retain data organization and tends to generate large file sizes, this ensures easy human interpretation and is readable by virtually all point cloud software. All point cloud datasets including the original scan files and the registered point cloud should be archived in the ASCII format using the TXT file extension. For spherical datasets and/or datasets where the ASCII file is an “edited and processed version” of the original, it is also suggested to archive scans in their original native format. See section 3.1.1.a below to read more on when to archive scans as ASCII or both native and ASCII.

The ASTM E57 format is currently being developed to address the critical need to have an open standard, vendor neutral data format for the exchange and storage of 3D point clouds. The format can store data produced by laser scanners as well as other systems including flash LIDAR systems, structured light scanners, stereo vision systems, and others (The ASTM E56 File Format for 3D Imaging Data Exchange, 2). E57 also provides support for storing 2D images associated with a scan and all metadata associated with the 2D images and 3D data. In addition, the format supports organized “gridded” datasets which can be very valuable for long range spherical scans and can also support data in multiple coordinate systems. The format is currently supported by a long list of vendors including Leica, Optech, Riegl, Trimble, Z+F and more. To read more about the future development and implementation of the E57 format, please refer to the ASTM website and http://www.libe57.org/.

The LAS format was developed by ASPRS to store aerial LIDAR data and has recently gained more support in the terrestrial scanning community. While the format is not specific to terrestrial datasets, it does support point cloud organization and normal information which are important for storing original raw scan files. The format however does not support registration information or other ancillary data such as 2D images. Section 7.4.1 in the Metric Survey Specifications for Culture Heritage provides additional discussion of the LAS format. For more information, please refer to the ASPRS website.

When to use ASCII, native/proprietary, or both?

All scans are required to be archived at a minimum in the ASCII txt format to ensure file accessibility and readability. In addition, for datasets where it is beneficial to retain point cloud organization for example, mid/long range spherical datasets, it is advised to archive original raw scans in their native format. For spherical scans, preserving normal information prior to scan registration is important particularly for potential meshing operations. Also spherical data sets are often gridded when imported into processing software which involves re-interpolating the data. Therefore an ASCII version of a scan that is exported from the software can look considerably different than the original raw scan in its native format.

For any dataset that has also been considerably edited or cleaned prior to importation/registration, it is advised to archive the scan in its original native format. For example, phase scanners can produce significant amounts of noise that is typically filtered and removed in the processing software. This filtering process can also potentially remove useful data. Therefore, in this example, it would be advised to archive the original unedited dataset in the scan’s native format and to also archive the cleaned version of the scan used in registration in ASCII format. For object level datasets that have been scanned directly into a processing software, typical scan editing will include removing non-object data or ancillary information from a scan. In instances when scan data are transferred directly to the processing software, it can be difficult to extract the original scan in its native format. In this case where scan editing has removed only non-object specific information and where the original dataset is difficult to extract, it would be deemed acceptable to archive scans in the ASCII format only.

While it may not always be clear as to whether or not to archive a scan in both ASCII and native formats hopefully this discussion has provided some insight into possible scenarios. The best rule of thumb to remember is if the ASCII version of a scan is considerably different than the original then its best to also archive the original native format. When archiving both ASCII and native scans, the metadata should be completed for the native scan files.

File types for archiving polygonal meshes

There are numerous 3D interchange formats used today. The OBJ format is a geometry definition file format was first developed by Wavefront Technologies[2]. OBJ is a simple data format that represents 3D geometry, namely the position of each vertex, the UV position of each texture coordinate vertex, normals, and the faces that make each polygon defined as a list of vertices, and texture vertices. Additional files that can accompany an OBJ file include the material file (.MTL) and an image texture file (.JPG). OBJ files are the recommended data format for archiving polygonal mesh files. Alternative formats for archival submission include the X3D or DAE formats.

File types for archiving related products

As stated above, there are a host of different products that can be derived from point cloud data sets. Those discussed in this document include polygonal meshes (discussed above), 2D CAD models including plans, sections, and elevations, 3D CAD models, digital elevation models (DEMs) and movie files. For a discussion of archival CAD formats, please see the CAD data format section in the CAD Guide. Here it is recommended to archive both 2D and 3D CAD models using either the DXF or DWG formats developed by AutoDesk. In the GIS guide, it is recommended to archive DEMS in the USGS SDTS format or other available ArcView formats. Movie files or digital videos are recommended to be archived using the MPG format. For an additional discussion on digital video formats, please refer to the Digital Video guide.

[2] http://en.wikipedia.org/wiki/Wavefront_Technologies