GIS file format
A GIS file format or geospatial file format is a standard for encoding geographical information into a computer file. It is a specialized type of file format for use in geographic information systems, remote sensing image processing tools, and other geospatial applications. Since the 1970s, dozens of formats have been created based on various data models for various purposes. They have been created by government mapping agencies, GIS software vendors, standards bodies such as the Open Geospatial Consortium, informal user communities, and even individual developers.
History
The first GIS installations of the 1960s, such as the Canada Geographic Information System were based on bespoke software and stored data in bespoke file structures designed for the needs of the particular project. As more of these appeared, they could be compared to find best practices and common structures. When general-purpose GIS software was developed in the 1970s and early 1980s, including programs from academic labs such as the Harvard Laboratory for Computer Graphics and Spatial Analysis, government agencies, and new GIS software companies such as Esri and Intergraph, each program was built around its own proprietary file format. Since each GIS installation was effectively isolated from all others, interchange between them was not a major consideration.By the early 1990s, the proliferation of GIS worldwide and an increasing need for sharing data, soon accelerated by the emergence of the World Wide Web and spatial data infrastructures, led to the need for interoperable data and standard formats. An early attempt at standardization was the U.S. Spatial Data Transfer Standard, released in 1994 and designed to encode the wide variety of federal government data. Although this particular format failed to garner widespread support, it led to other standardization efforts, especially the Open Geospatial Consortium, which has developed or adopted several vendor-neutral standards, some of which have been adopted by the International Standards Organization.
Another development in the 1990s was the public release of proprietary file formats by GIS software vendors, enabling them to be used by other software. The most notable example of this was the publication of the Esri Shapefile format, which by the late 1990s had become the most popular de facto standard for data sharing by the entire geospatial industry. When proprietary formats were not shared, software developers frequently reverse-engineered them to enable import and export in other software, further facilitating data exchange. One result of this was the emergence of free and open-source software libraries, such as the Geospatial Data Abstraction Library, which have greatly facilitated the integration of spatial data in any format into a variety of software.
During the 2000s, the need for specialized spatial files was reduced somewhat by the emergence of spatial databases, which incorporated spatial data into general-purpose relational databases. However, new file formats have continued to appear, especially with the proliferation of web mapping; formats such as the Keyhole Markup Language and GeoJSON can be more easily integrated into web development languages than traditional GIS files.
Format characteristics
Over a hundred distinct formats have been created for the storage of spatial data, of which 20-30 are currently in common usage for different purposes. These can be distinguished in a number of ways:- Open formats are developed collectively by a community and are available for anyone to implement and contribute improvements, while Proprietary formats have been developed by a software company for use only in their own software and are generally maintained as a trade secret. A third category between these would include formats that are owned exclusively by one company or organization, but are published and available for implementation by anyone, such as the Esri Shapefile.
- Some file formats are text files that can be read by humans, especially those intended for data exchange, while others are binary files, most commonly those designed for native use in GIS software.
- Inherently spatial formats were designed specifically for storing geographic data, while others are spatial extensions to formats designed for a more general use.
- Many data formats incorporate some form of data compression, especially raster files. Generally, lossless compression methods are preferable over lossy methods, because the original data values need to be retrieved.
Raster formats
Because a grid is a sample of a continuous space, raster data is most commonly used to represent geographic fields, in which a property varies continuously or discretely over space. Common examples include remote sensing imagery, terrain/elevation, population density, weather and climate, soil properties, and many others. Raster data can be images with each pixel containing a color value. The value recorded for each cell may be of any level of measurement, including a discrete qualitative value, such as land use type, or a continuous quantitative value, such as temperature, or a null value if no data is available. While a raster cell stores a single value, it can be extended by using raster bands to represent RGB colors, colormaps, or an extended attribute table with one row for each unique cell value. It can also be used to represent discrete Geographic features, but usually only in exigent circumstances.
Raster data is stored in various formats; from a standard file-based structure of TIFF, JPEG, etc. to binary large object data stored directly in a relational database management system similar to other vector-based feature classes. Database storage, when properly indexed, typically allows for quicker retrieval of the raster data but can require storage of millions of significantly sized records.
Raster format examples
- ADRG – National Geospatial-Intelligence Agency 's ARC Digitized Raster Graphics
- Binary file – An unformatted file consisting of raster data written in one of several data types, where multiple band are stored in BSQ, BIP or BIL. Georeferencing and other metadata are stored one or more sidecar files.
- Digital raster graphic – digital scan of a paper USGS topographic map
- ECRG – National Geospatial-Intelligence Agency 's Enhanced Compressed ARC Raster Graphics
- ECW – Enhanced Compressed Wavelet. A compressed wavelet format, often lossy.
- Esri grid – proprietary binary raster format used by Esri since the mid-1980s
- GeoTIFF – TIFF variant enriched with GIS relevant metadata, especially georeferencing. An open format that has become one of the most common formats for data sharing.
- IMG – ERDAS IMAGINE image file format
- JPEG2000 – Open-source raster format. A compressed format, allows both lossy and lossless compression.
- MrSID – Multi-Resolution Seamless Image Database. A compressed wavelet format, allows both lossy and lossless compression.
- netCDF-CF – netCDF file format with CF medata conventions for earth science data. Binary storage in open format with optional compression. Allows for direct web-access of subsets/aggregations of maps through OPeNDAP protocol.
- RPF – Raster Product Format, military file format specified in MIL-STD-2411
- *CADRG – Compressed ADRG, developed by NGA, nominal compression of 55:1 over ADRG
- *CIB – Controlled Image Base, developed by NGA
- USGS DEM – The USGS' Digital Elevation Model
- *GTOPO30 – Large complete Earth elevation model at 30 arc seconds, delivered in the USGS DEM format
- DTED – National Geospatial-Intelligence Agency 's Digital Terrain Elevation Data, the military standard for elevation data
- World file – Georeferencing a raster image file
Vector formats
The Vector data model uses coordinate geometry to represent each shape as one of several geometric primitives, most commonly points, lines, and polygons. Many data structures have been developed to encode these primitives as digital data, but most modern vector file formats are based on the Open Geospatial Consortium Simple Features specification, often directly incorporating its Well-known text or Well-known binary encodings.
In addition to the geometry of each object, a vector dataset must also be able to store its attributes. For example, a database that describes lakes may contain each lake's depth, water quality, and pollution level. Since the 1970s, almost all vector file formats have adopted the relational database model, either in principle or directly incorporating RDBMS software. Thus, the entire dataset is stored in a table, with each row representing a single object that contains columns for each attribute.
Two strategies have been used to integrate the geometry and attributes into a single vector file format structure:
- A georelational format stores them as two separate files, with the geometry and attributes of each object being linked by file ordering or a primary key. This was most common from the 1970s through the early 1990s, because GIS software developers had to invent their own geometry data structures, but incorporated existing relational database file formats for the attributes. For example, the Esri Shapefile format includes the.dbf file from the DOS dBase software.
- The Object-based model stores them in a single structure, loosely or directly based on the objects in object-oriented programming languages. This is the basis of most modern file formats, including spatial databases that include a geometry column along with the other attributes in a single relational table. Other formats, such as GeoJSON, use different structures for geometry and attributes, but combine them for each object in the same file.
Vector datasets usually represent discrete geographical features, such as buildings, trees, and counties. However, they may also be used to represent geographical fields by storing locations where the spatially continuous field has been sampled. Sample points, Contour lines and triangulated irregular networks are used to represent elevation or other values that change continuously over space. TINs record values at point locations, which are connected by lines to form an irregular mesh of triangles. The face of the triangles represent the terrain surface.