Chapter 4 Data Structures There are a number of different ways to organize the data inside any information system. The choice of a particular spatial data structure is one of the important early decisions in designing a geographic information system. While very few of us will ever design a GIS from start to finish, a knowledge of data structures is valuable from several points of view other than system design. Fundamentally, users must be aware of the characteristics of several different structures, since several different standard forms are commonly used, and the choice of a data structure can affect both data storage volume and processing efficiency. From another point of view, when we are collecting our own data, we must make a choice of data structure for storage. Also, in an operational or research environment, it is often necessary to convert datasets between several different data structures, either to work with several kinds of data at the same time, or to import an unusual dataset into an existing system. It is very important to be able to understand how these conversions affect the underlying information itself. As we stated in the introduction, each different type kind of spatial data or theme in a GIS is referred to as a data layer or data plane. In each of these data layers there are three primitive types of geometrical entities to encode (after Peucker and Chrisman, 1975): points, lines, and polygons or planes. Some authors make a distinction between the representation of a truly three-dimensional surface, such as elevation datasets, and a representation of space in two dimensions, such as legal boundaries of land ownership on a flat map. In this chapter, we will focus on the latter. The essential function of the spatial data we store and manipulate is to subdivide the Earth's surface into meaningful entities or objects that can be characterized. In this way, the contents of a spatial database is a model of the Earth. Points, such as the locations of oil and water wells, and lines, such as the centerlines of roadways or streams, are key elements of this breakdown into component parts. When we consider bounded regions, such as the borders of a subdivision or the edges of a reservoir, we often focus on the boundary lines, and call the enclosed regions polygons. These polygonal regions are not necessarily defined in the precise terms of geometry, where a polygon is ordinarily a planar figure bounded by a series of straight line segments. In spatial data processing, common usage relaxes the requirement that the bounding line segments be straight; we use the term polygon even when the boundaries are curved. We note, however, that not all GISs can work directly with curves as such, but more often permit a single line to have interior digitized points in addition to the end points. Many sophisticated applications have been developed around networks of lines, such as the network developed by the arteries of a transportation or communications system, or a variety of piping systems such as a sanitary sewer or a pressurized gas delivery system. The above discussion concentrates on the geometry of the data. Equally important is the non-spatial or attribute data, which in some systems requires a greater amount of data storage. For a simple spatial object like a water well, the essential spatial information is the geodetic location of the well. The attribute data can include wide range of ancillary information about that well, including its depth, date of drilling, production volume, ownership, and so forth. Many geographic information systems have specialized capabilities for storing and manipulating the attribute data in addition to the spatial information. 4.1 Raster Data Structures One of the simplest data structures is a raster or cellular organization of spatial data. In a raster structure, a value for the parameter of interest -- elevation in meters above datum, land use class from a specified list, plant biomass in grams per square meter, and so forth -- is developed for every cell in a (frequently regular) array over space. For example, in Figure 4.1, elevation in meters above mean sea level has been recorded at locations on a regular grid. The original data is from a topographic map, from which we have extracted the contour lines. The raster array of elevations is derived from these contour lines using procedures discussed in section 6.7. This kind of data structure is intuitive; we might imagine a survey team determining elevations at regular distances along lines of constant latitude. 4.1.1 Simple Raster Arrays The horizontal dimension of the simplest raster, along the rows of the array, is often oriented parallel to the east-west direction for convenience. Following the conventional practice in image processing, raster elements in this direction along the rows of the array are sometimes called samples, and numbered from the left (or west) margin. Positions in the vertical direction aligning with the columns of the array, are often numbered starting from the top (or northern) boundary. This numbering scheme comes from the computer graphics field, in which displays are often painted on the computer screen or printer from the top down. Thus, the origin of the raster is frequently the upper left corner. This location is considered position (1,1) in some systems of notation, and position (0,0) in others - please be aware of the difference ! Note that this referencing system for cells in a raster is different from more traditional georeferencing systems such as latitude-longitude in which one specific point on the Earth's surface (such as the point where the prime meridian crosses the equator) is the origin. It is also different from the Universal Transverse Mercator system, where (in the northern hemisphere) the origin of the coordinate system is in the lower left corner, which is similar to a conventional cartesian system. Often, the distances between cells in the raster are constant in both the row and column directions; in other words, the cells in the raster are square. In this case, it is natural to store the data on a computer in a two-dimensional array. While a simple rectangular raster structure is a very popular approach, there are at least two limitations. First, for a raster structure at a particular scale, there is a finite limit to our ability to specify location. We are in either one cell or another - there is nothing in between. This is true because the line separating adjacent cells is considered to be infinitely narrow. This is the case for any raster data structure, including the non-rectangular forms we'll examine in a moment. Second, adjoining cells may not be evenly spaced, depending on how we define the word adjoining. Consider the center cell in Figure 4.2a. The cells above and below are 1 unit of distance from the center cell, while the cells on the diagonal are approximately 1.41 (the square root of 2) units of distance from the center. For searches through the data, if we include only cells above and below a cell of interest, we are working in a 4-connected neighborhood (Figure 4.2a), and all cells are equidistant from their neighbors. Thus, neighboring cells share an edge. If we include elements on the diagonal, we are working in an 8-connected neighborhood (Figure 4.2b), and now cells are not evenly spaced. In this latter case, some cells in the neighborhood share an edge, while others share only a vertex. Since all the cells in these two examples have neighbors of the same size and shape, we say that we have spatial neighborhood similarity (Tobler, 1979). An alternative is that each cell in the neighborhood has a different size and organization. While this is certainly reasonable for many kinds of geographic phenomena (urban neighborhoods are of different sizes and shapes, for example), it is not a common data model in a raster geographic information system. Whether it is more appropriate to base an analysis on 4-connected or 8-conaected space must be determined by the characteristics of the data and the objectives of the exercise. In many metropolitan areas of recent design, the streets are laid out in a rectangular network. If a raster data layer is used to represent this transportation network, a 4-conaected analysis of travel distances is appropriate, since diagonal motion is not ordinarily possible. In contrast, an 8-connected analysis of travel might be appropriate for a raster database of the weight of cotton harvested per hectare, where the objects traveling through space are airborne insect pests that do not respect the orientation of the local roads. It isn't necessary that the size of a raster be the same in the row and column directions. But for reasons of simplicity and symmetry, applications based on the use of rectangular raster cells are rare. It is also important to recognize that there are two different theoretical interpretations of the value stored in a cell of our sample raster of elevations. A cell's value might represent an elevation measured at the center of the cell, or the value might represent the average elevation of the entire cell. It is likely that neither interpretation is truly correct, although we typically operate as if the latter is our underlying model of truth. For practical purposes, the distinction rarely matters if we've chosen an appropriately small cell size. The size of the raster cell in a dataset is sometimes confused with the minimum mapping unit, i.e., the smallest element we can uniquely represent in our data. However, raster cell size and minimum mapping unit are not quite the same. Choosing an appropriate minimum mapping unit (or resel, an abbreviation for resolution element, as discussed in Chapter 1) for a study is a very important decision in the design phase of a project. Consider a dataset in which we wish to record vegetation types. In Figure 4.3a is a sketch map that might have come from a field mapping exercise. In the study area, a patch of evergreen trees is surrounded by grassland. One way to manually convert this data to a raster form is to overlay a regular grid on the sketch map (Figure 4.3b) and to assign a vegetation class to each cell, based on which class covers the majority of the cell. Consider the problem shown in Figure 4.3c. A different grid with larger cells has replaced the first grid. Four raster cells overlay the small area where we have grassland surrounding the stand of evergreen trees. Based on the majority rule we chose above, the evergreen stand will not appear in the database. Each of four cells is covered by a small piece of the evergreen stand, but in each of these cells, the evergreen stand is only a minority component of the total cell area. If the size of the raster cells had been different, the stand's existence would have been captured, as it was in Figure 4.3b. We might have even captured this stand if the cells in the raster had been oriented differently. This illustrates the need to understand the relationship between the minimum mapping unit and the raster cell size. A convenient rule of thumb, based on statistical sampling theory, is to use a raster cell half the length (or one-fourth the area) of the smallest feature you wish to record. A more conservative suggestion would be to use a raster cell one-third or one-fourth the length of the smallest desired feature. To reinforce this important issue, the size of the resels in a dataset must be significantly larger than the size of the raster cells, or we run the risk of losing important spatial objects when we build the database. Geometrical figures that completely cover a flat surface are called tessellations. The square raster structure we've just discussed is one such tessellation. Triangles and hexagons are two other tessellations of the plane (Figure 4.4). There has been a continuing interest in regular hexagonal cells as the basis of spatial data structures, in part because in a hexagonal tessellation of the plane, all neighboring cells are equidistant, unlike the situation in a raster of square cells (Burt, 1980). However, the use of a hexagonal or even a triangular data structure creates two problems that may be significant in some circumstances. First, the cells cannot be recursively subdivided into smaller cells of the same shape as the original cells, as is the case in a square system. Conversely, a hexagon made up of smaller hexagons will not be the same shape as those smaller hexagons. A third and less important point is that a numbering system for a hexagonal system is more complex than that of a square system, imposing at least a small additional overhead in system operations. Let's return briefly to the golf course example from the first chapter and create a raster database for this site. The simplest of the three datasets we considered was the map from the local planning agency, which describes the legal property boundaries, streets, and restrictions on construction and development due to easement for public utilities. In Figures 4.5a and 4.5b, respectively, we show a version of the original map, and a raster-converted representation of the map. The numbers in each cell indicate the permitted land use for each cell, using a majority rule as discussed above. By adding up the number of cells in each category, we can determine the fractional area coverage of each land use in the map: Land-Use Total Percent Category Class Cells of Total --------------------- ------- ------ --------- Roads 1 213 39.0% Easement Restrictions 2 129 23.6% Unrestricted Development 3 204 37.4% Raster datasets in practical use can be very large. As an example, satellite remote sensing data is frequently used to distinguish categories of land cover over large areas. The standard view of the earth from the U.S.'s Landsat series of satellites covers approximately 30,000 square kilometers, which at a nominal pixel size of 30 meters, corresponds to approximately 35 million raster cells or pixels (where pixel is a contraction of "picture element"). When dealing with such large datasets, there are several algorithms used to compress the data. Some of these algorithms are completely reversible; that is, we may recover exactly the original datasets. Others minimize the volume of the stored data by losing a (preferably) small and controlled amount of the original information. We will briefly mention two perfect-recovery compression mechanisms. The first, called run-length encoding, tries to exploit the fact that many datasets have large homogeneous regions. Consider the raster data in Figure 4.5b. The data attribute values at the beginning of the sixth row from the top are: 1 1 1 3 3 3 2 3 3 3 3 3 3 3 3 3 In a run-length encoded version, the original data is replaced by data pairs or tuples. The first number in the pair is a counter, indicating how many repetitions of the second number, the data value, occur starting at that point in the row. Thus, three cells in a row with data value 1 are compressed from three elements (1 1 1 ) to two (3 1). The data from the beginning of row 6 in our example would then become: (3 1) (3 3) (1 2) (9 3) Thus, we have three 1s, followed by three 3s, followed by a single 2, and then nine 3s. In this case, the original data occupied 16 elements and the compressed data 8 elements, for a compression factor of 50%. Note that we have assumed that the data elements in the run-length encoded file-both attribute values and repeat counts - occupy the same amount of space. The effectiveness of this compression mechanism varies with the dataset. In the worst case, where there are no repeating sequences at all along the rows of the array, the algorithm will make the dataset twice as large. For binary data (in other words, data with only two possible classes), Burrough (1986) shows another kind of run-length coding, where in a single row, the position of a cell where a run begins is stored. A second technique for compressing raster datasets uses what are called chain codes. In some instances we can consider a map as a set of spatially referenced objects placed on top of a background. The use of chain codes takes this point of view. The coordinates of a starting point on the border of an object (for example, a reservoir) are recorded, and then we store the sequence of cardinal directions of the cells that make up the boundary (Figure 4.3c). This may be an efficient means to store areas, particularly since each spatial object is kept as a separate entity in the database. However, some kinds of processing will require that the entire raster array be reconstituted which may be an unacceptable cost. 4.1.2 Hierarchical Raster Structures There has been a great deal of interest recently in a family of enhancements to the standard square-celled raster structure. We will introduce these modifications through an example. Consider a set of digital elevation data values, where the fundamental data are stored on a 50-meter square grid (that is, each cell represents a square that is 50 meters on a side). Rather than storing this information as a single layer in our GIS, we shall store it in several interrelated layers. One layer corresponds to the original 50-meter interval raster data. A second layer consists of data resampled to a 100-meter interval. Each cell in the 100-meter layer is the algebraic average of four cells in the 50-meter layer (Figure 4.6a). A third layer is created by averaging four 100-meter cells to create 200-meter cells. And we could continue this spatial averaging process, decreasing the spatial resolution at each "higher" layer, until at the highest layer we might have a single pixel, whose elevation value is the numerical average of all the data in the original 50-meter layer. As an aside, this only works perfectly - that is, yields a single pixel in the highest layer - if the original 50-meter data was a square array of 2 pixels on a side. In general, this is called a pyramidal data structure, since we can imagine each of the derived layers stacked on top of previous layers, in the shape of a pyramid. If each higher level layer has pixels that are exactly twice as wide as the previous (and thus, four times the area), as in our example, this is called a quadtree data structure (Samet, 1984). The name quadtree comes from the four-fold reduction in number of pixels in each layer, and the fact that that structure is easily pictured (as well as represented in the digital database) as a tree (Figure 4.6b). Leaves at the bottom of the tree, in this case, represent the elevation values in the original 50-meter data layer, and branches in the tree represent the averaging process as connections between pixels or cells in the different layers. Tobler and Chen (1986) discuss a modified quadtree system that may be useful for coding data for the entire Earth's surface. The single node at the top level of the tree represents the entire planet. At the 15th level, the resolution of the cells is comparable to that of meteorological satellites. At the 26th level, the spatial resolution is comparable to most aerial photography, and at level 30 there is centimeter-scale resolution over the globe, which Tobler and Chen indicate is adequate for geodetic control points in many applications. Data stored in a simple quadtree will occupy somewhat more storage space than the original raster. This is true because the original raster is a component of the pyramid, and forms the lowest layer in the quadtree. Consider a raster with 32 pixels on a side. This raster requires (32 times 32) or 1024 cells be stored. If we include the higher level layers, we require: Layer Width in Cells Total Cells 1 32 1024 2 16 256 3 8 64 4 4 16 5 2 4 6 1 1 which totals 1365 cells. For this small example, we increase the storage overhead 33%, since there are 33% more cells in the complete tree than in the original raster. However, this storage cost can be offset by some distinct advantages. There may be some processing steps that do not require the full resolution of the original raster dataset. If the data have been stored in a quadtree, a given step in the analysis may be based on whatever layer in the quadtree holds information of the appropriate resolution or scale, potentially saving both computer system input/output and processing time. A situation in which this advantage may be significant is where different datasets in the geographic information system have different mean resolutions. This is one of many examples we will discuss of the classic database management trade-off of storage costs versus processing costs, which we will see again in our discussions. Additionally, the hierarchical nature of the quadtree may permit us to minimize some kinds of search through the database (Smith et al., 1987). For example, if we are looking for a high elevation in the database (but we don't necessarily need the highest elevation), we could instruct our system to work down the quadtree: first find the quadrant in the next-to-uppermost layer with the highest average elevation, and ignore the rest of the data. In Figure 4.7, the lower-right quadrant of the two-by-two layer of the quadtree is the highest, with an elevation of 121 meters. We therefore consider only this fraction of the data at the next level down, removing 75% of the data from further consideration. Continuing down towards the highest-precision data layers in Figure 4.7, we find the highest cell in the lower-right corner of the four-by-four layer at an elevation of 127 meters. Repeating the procedure in the next level of the quadtree, we locate the cell marked as 134 meters. In total, we have examined 12 cells, out of total database of 64 elemental cells at the bottom of the tree, or 84 cells in the complete tree. In this way, we can search only a small fraction of the entire database, and still find an acceptably high place for many purposes. However, with this search strategy there is no guarantee that we will find the highest location in the database. If we stored more than just a single data value at each node in the quadtree, we can even satisfy a specific search for the highest location in the database. Consider a version in which we store three values at each quadtree node: the mean elevation of the appropriate area, as well as the minimum and maximum known elevation values in this area. In this case, we can use a search strategy parallel to the one discussed above to be able to find the highest (or lowest) cell in the entire region in a very small amount of time. This is another example of the compromise between storage and processing. Here we have additional storage costs (since we store three values in each cell) in order to decrease the time required to search the database. Another way in which a quadtree data structure can be used to minimize search effort involves categorical data. Imagine a quadtree for storing a number of land-use classes. At the base of the tree, each cell takes the attribute for the dominant class in the cell. However, each composite cell at the each higher level of the tree does not have to store just the dominant class for the cell or just the fact that the cell is not homogeneous; the higher-level cells are not limited to storing a single value. Instead, each higher-level cell could store a list of all the land-use classes stored in the cells beneath it. In this way, when we search through the quadtree looking for areas with certain kinds of land use, going from less geographic detail to more, we can discard limbs of the tree with no error. A modification of the quadtree can be used to minimize system storage under some circumstances. In a maximum block representation, we systematically eliminate all the redundant information in the tree. Consider the land-use example in Figure 4.8, where each cell in the original raster (Figure 4.8a) is coded in a binary fashion for land use. In this example, 0 represents undeveloped land surrounding a city, and 1 represents urban land use. In the associated quadtree, open circles represent the undeveloped land, and solid boxes represent the urban land use. For each lower-resolution level, we average the pixel values in the higher-resolution layer, to code for the fractional coverage of urban area in a cell. The numbering scheme to specify cells in a quadtree proceeds from the upper left in the data array; cell numbers are recorded in the upper right corner of the individual cells in the accompanying figures. Starting with the 2-by-2 area in the upper left, cells 1 through 4 are the upper left, upper right, lower left, and lower right, in sequence. The 2-by-2 area in the upper right is then numbered in the same pattern, then the area in the lower left, and finally we number the cells in the lower right. In a larger data array, we would then number the cells in the upper right corner 4-by-4 array, and so on. However, notice that there is redundancy in the companion quadtree to this raster array (Figure 4.8b). Cells 5, 6, 7, and 8 are all class 0, thus their average is zero; therefore specifying the exact attribute value of zero in the higher level layer tells us everything about the coordinating pixels in the lower-level layer. In this case, we could leave out the nodes in the tree at the lower (or increased resolution) level, thus decreasing the storage costs without losing any information. This is the maximum block representation (Figures 4.8c and 4.8d) of the quadtree. In this case, the maximum block quadtree requires eight fewer nodes to describe the spatial arrangement of land use. This revised tree requires a bit more processing to construct and use than a simple quadtree, since one must test to determine whether a cell at a particular resolution exists as a unique entity, or is represented as a part of a composite larger-area node in a different layer. One problem with the quadtree data structure is that it is not invariant under translation, rotation, or scaling. This is the same problem that is found with any arbitrary subdivision of datasets. Arbitrary boundaries are placed on the underlying continuous data to develop the quadtree, and the locations of the boundaries can strongly affect the resulting derived data. Consider a simple example of translation along one of the major axes of the raster array. Figure 4.9a shows a simple object composed of six shaded raster cells, along with the maximum block quadtree representation. The shaded cells in the raster array are indicated in the tree by filled squares; the white background is indicated in the tree by open circles. What happens when we translate the object to the east by one cell width? This could represent either another object of the same kind in a different location, or another quadtree with the same cell size but a different starting location. The object in Figure 4.9b looks the same to the human interpreter, but evolves into a quadtree with a very different shape. One solution to this problem is offered by Scott and Iyengar (1986), who discuss a related data structure that is translation invariant – that is, if the entire figure is moved, the characteristics of the data structure will not change. Their system is based upon (1) recognizing homogeneous square regions in a raster, and (2) recording both the size of the block and the coordinates of the block's upper left-hand corner. 4.2 Vector Data Structures A mathematician might define a vector as a quantity with a starting coordinate, and an associated displacement and direction (or bearing). In a description of spatial data based on vectors, we make the assumption that an element may be located at any location, without the positional constraints of a raster array. To briefly review the previous section, raster data structures are based on (usually regular) decompositions (i.e., "breakdowns") of the plane. In raster-based systems for representing spatial data, our ability to specify a location in space is limited by the size of the raster elements, since we are unable to know anything about different locations within a raster cell. In other words, there is a limit to geographic specificity. Vector data structures are based on elemental points whose locations are known to arbitrary precision, in contrast to the raster or cellular data structures we have described. As a simple example, to store a circle in one of the raster data structures above, we might find and encode all the raster cells (of a pre-defined shape and size) whose locations correspond to the boundary of the circle. This might be called a low-level description of the circle. A high-level description, on the other hand, might efficiently store the circle by recording a point location for the center of the circle, and specifying the radius. In this example, note that the high level description, based on a vector representation is more efficient in terms of the amount of data required, as well as more precise, if we have a means to indicate the geometric object circle. Many computer graphics and computer-aided-design (CAD) systems use vector-like models as their internal data organization, using primitives such as points, lines, and circles. These elements may also be found in the computer graphics language standards, such as GKS (Enderle et al., 1984), as well as many personal computer languages such as BASIC. These advantages may disappear, however, if we have to store the circle as a connected sequence of straight-line segments. For spatial data in most geographic information systems, the coordinate data is encoded, and after input processing, is stored as some combination of points, lines, and areas or polygons (Males, 1977; Peucker and Chrisman, 1975; and Peuquet, 1977). Several forms of vector data structures are in common use, both as representative database types within geographic information systems, as well as standards for data transfer between systems: whole polygon structure Dual Independent Map Encoding (DIME) file structure arc-node structure relational structure digital line graphs We will discuss each of these in turn. 4.2. 1 Whole Polygon Structure In a whole polygon structure, each layer in the database is divided into a set of polygons (Figure 4.10). Each polygon is encoded in the database as a sequence of locations that define the boundaries of each closed area in a specified coordinate system (sometimes called a boundary loop). Each polygon is then stored as an independent feature. There is no explicit means in this system of referencing areas that are adjacent. This is, to some extent, comparable to the chain-coded raster discussed in the previous section: in both whole polygon structure and a chain coded raster, the emphasis is on the individual polygonal areas, where each discrete area is stored separately. Attributes of the polygons, such as land cover or ownership, may be stored with the coordinate list. By maintaining each polygon as a separate entity in this way, the topological organization of the polygons is not maintained. By topology, we refer to the relationships between different spatial objects: which polygons share a boundary, which points fall along the edge of a particular polygon, and so forth. In a whole polygon structure, line segments that define the edges of polygons are recorded twice - once for the polygon on each side of a line. Similarly, points that are shared by several polygons, such as location (4,2) in the example, will also be represented several times in the database. With this organization, editing and updating the database without corrupting the data structure can be difficult. 4.2.2 DIME Structure The Dual Independent Map Encoding or DIME file structure was developed for use by the U.S. Bureau of the Census (Cooke and Maxfield, 1967). It was designed to incorporate topological information about urban areas for use in demographic analyses (Cooke, 1987). While DIME files themselves do not generally correspond to the internal database organization of a GIS, they are in common use as an archive data format as well as a defined format for data exchange between different systems. The basic element of the DIME file structure is a line segment defined by two end points or nodes. Figure 4.1l illustrates the structure of the DIME file, slightly simplified. The line segments and nodes are shared by adjacent polygonal units. In this structure the line segments are assumed to be straight. When curved lines re needed, they are represented as sequences of straight line segments. Each line segment is stored with three essential components: a segment name (such as the name of the street) that identifies the segment, node identifiers for the "from" and "to" endpoints of the segment, and identifiers for the polygons on left and right sides of the segment. A number of additional attributes may be coded in the DIME file structure, in order to store additional information about the various spatial objects. When the segment is a part of street, the address ranges for both sides of the street may be stored. A field is available for segments that are not streets, to indicate such features as a proposed street or the shoreline of a lake. Additional attribute fields, labeled by header numbers, are available for groups of segments, such as a telephone exchange, mailing address code (such as ZIP codes in the United States or similar mailing codes in the United Kingdom and Canada), or voting districts. There is a separate set of fields kept for logically grouping and labeling segments. In the example in the figure, the correspondence between headers and the mailing and telephone exchange codes are represented. A major disadvantage of DIME structures lies in the difficulty of manipulating complex lines, as in functions that require search along streets. Since streets are broken into discrete street segments by the cross streets, it is significant computational effort to follow the segments in sequence when required. An advantage of the system for some applications is its ability to catch addresses of spatial objects in multiple files, since the addresses are explicitly stored in the DIME file. 2.3 Arc-Node Structure In an arc-node data structure, objects in the database are structured anarchically. In this system, points are the elemental basic components (figure 4.12). Arcs are the individual line segments that are defined by a series of x-y coordinate pairs. Nodes are at the ends of arcs and form the points of intersection between arcs. There may be a distinction made between nodes at the ends of lines, and points that are not associated with lines (as we had in our discussions earlier). Polygons are areas that are completely banded by a set of arcs. Thus, nodes are shared by both arcs and contiguous polygons (Peucker and Chrisman, 1975). Several commercial geographic information systems use forms of this arc-node data structure. Arc-node structures permit us to encode the geometry of the data with no redundancy. Unlike the whole polygon structure, points are stored only once. They are reused as often as necessary, as in Figure 4.12 where point 1 is a component of polygon A as well as lines I and II. In an arc-node database, we can easily include attributes of various kinds. In the street and traffic control example in Figure 4.12, attributes are explicitly linked to geometry. For example, traffic control device descriptions are stored with the relevant nodes, and roadway length and pavement condition are stored with the appropriate arcs. Notice that in this arc-node transportation example we have stored the lengths of the arcs. This was a conscious decision that adds some redundancy to our database, since we could in theory calculate these lengths from the coordinates of the nodes at the arc ends. If our application makes intensive use of arc lengths, it may be more efficient to determine the lengths once when the database is constructed, rather than recalculating this value many times as the data is used. This is another example of the classic problem in database management systems of achieving a balance between storage and retrieval costs in comparison to processing costs. In a modern GIS, the user should be able to affect this balance, based on specific applications requirements. 4.2.4 Relational Structure Another form of arc-node vector data organization is sometimes called relational data structure. In the last arc-node example, data attribute values were stored together with topological information. In a relational data structure, attribute information is kept separately. This has become a popular design strategy for several commercial geographic information systems. We'll recast the same data from in the arc-node example of Figure 4.12 into this alternative organization. The topological data in a relational structure is organized in the same fashion as in an arc-node organization. The principal difference is in the attribute values. These are stored in relational tables (as in a relational database management system, which is discussed in more detail in section 7.3.1) as in Figure 4.13. These tables are straightforward: a row in a table represents a single data record, and the columns represent the different fields or attributes. One of the relational tables keeps track of the attributes of points or nodes in the spatial database. In our example, this table records pointers to the nodes, the kind of traffic control that exists at this node, and whether there are marked crosswalks at the street intersection. Similar tables for arcs might hold information about road lengths, pavement type, and the dates the street had been repaved. The tables for polygons might record area, zoning, assessed value, and ownership. Notice that this manner of storing data is directly comparable to that of the arc-node format in the previous section. The main difference is that, in the relational structure, the attribute data are maintained separately from the topological information. Thus, there are more separate files and pointers to maintain. More detail on the relational database model is found in section 7.3.l. General-purpose database management software is available for the relational model. Geographic information systems that take advantage of this software may cost less to develop and may provide added flexibility, by relying on a commercial database management system. 4.2.5 Digital Line Graph Structure The U.S. Geological Survey has made tremendous investments in digital cartographic data, and in the process, has developed a number of standards for such data. One of the best known is a standard vector file format known as a digital line graph (DLG). The agency is producing these vector files based on the source materials used to compile the USGS 7.5-Minute and 15-Minute Topographic Maps Series. A separate set of DLG data files is based on l:2,000,000 scale map products. The data contents of the DLG files are subdivided into different thematic layers (Allder and Elasal, 1984). One layer consists of boundary information, including both political and administrative boundaries in the region. A second layer is for hydrographic features. A third layer is for the transportation network in the area. Finally, the fourth layer is based on the Public Land Survey System, which has as its focus a survey system administered by the U.S. Bureau of Land Management. The essential data elements of the DLG level 3 structure are similar to the other vector data structures discussed in sections 4.2.2, 4.2.3, and 4.2.4. Nodes represent either end points of lines or line intersections, while additional points are used where required to indicate significant features along lines. Lines have starting and ending nodes, and as such, permit us to specify a direction along the line as well as the areas on the left and right sides of the line. A special degenerate line is defined as a line of zero length, and is used to define features that are indicated on the map as a point. Degenerate lines are recognizable because they have the same starting and ending node. Areas in the DLG format are completely bounded by line segments. Each area may have an associated point that represents the characteristics of the area; the point location is arbitrary, and may not even be within the area. The point, line, and area elements provide information about topology and location. In addition, there is an extensive system for coding attribute information for the elements. The attribute codes are based on those features represented on USGS topographic maps. Attribute codes are structured in a specified way, with both major and minor code components. Allder et. al. (1984) provide the details of the attribute coding. The major code is three digits long. The first two digits denote the general category of the element, while the third digit provides additional detail. A few of the major code categories are: Major Code Category 020 Hypsography 050 Hydrography 070 Surface Cover Minor codes are four digits long. The first digit is generally zero. The remaining three digits provide further detail: Minor Code Description 001 - 099 nodes 100 - 199 areas 200 – 299 lines 300 - 399 degenerate line 400 - 499 general-purpose codes 600 - 699 descriptive codes The general-purpose codes are used for those features that may be digitized as a node, an area, or a line, depending on the feature's size and position. The descriptive codes are used to provide additional information when required. Several attribute codes have special meaning. Code 000 0000 is reserved for the area outside a given map sheet; remember that the DLG data series are based on topographic maps. Other codes are reserved for such information as features that have been revised photographically, and those features that cannot be identified from the available source materials. To illustrate some of the detail that may be stored in the DLG format, we present a few of the codes from the hydrography DLG data layer. Note that the 050 major code specifies hydrography. Nodes Areas 050 0001 Upper end of stream 050 0101 Reservoir 050 0004 Stream entering water body 050 0103 Glacier 050 0005 Stream leaving water body 050 0106 Fish hatchery Lines Degenerate Lines 050 0200 Shoreline 050 0300 Spring 050 0201 Man-made shoreline 050 0302 Flowing well General Purpose Attributes General Descriptive Attributes 050 0400 Rapids 050 0601 Underground 050 0401 Falls 050 0603 Elevated 050 0406 Dam or weir 050 0604 Tunnel A DLG level 3 data file contains a number of header records, followed by the data records. The header records provide information about the date of creation of the file, map projection and coordinate system, and the number of points, lines, and areas stored in the file. Data records for nodes include a description (including the node location), the major and minor attribute codes, and a text string. Data records for areas include a description (including the coordinate location of the representative point), the attribute codes, and an associated text string. Data records for lines include a description (including identification of the starting and end nodes and the areas on the left and right), an ordered sequence of x-y coordinates along the line, the attribute codes, and a text string. There is an optional final record with information about estimated accuracy of the data. Allder et al. (1984) present greater details of the DLG format, as well as example data files. We point out that there is a new development that is expected to supplant the DLG and DIME file structures for use in the 1990 U.S. Census, called TIGER: the Topologically Integrated Geographic Encoding and Referencing system. A detailed discussion of TIGER is outside the scope of this section. 4.3 Comparisons between Data Structures A data storage structure may or may not incorporate topological information, describing not only an object's position, but also its spatial relationships with respect to neighboring objects. Topological information is important for many kinds of analyses, including automated error detection, windowing for analysis and graphic presentation, network applications, determining whether a point falls in a specific polygon, proximity operations, polygon overlay, and other intersection procedures (Peucker and Chrisman, 1975). However, if your application does not require this kind of detailed information about the relationships between spatial objects, the additional overhead of explicitly treating topology may significantly complicate the tasks of database creation and update. For example, unstructured vector lists may be perfectly adequate for some kinds of routine data display. Some kinds of topological information are implicit in spatial data. In a simple raster-structured data file, for example, there is a specified spatial organization for the data, with no gaps in the fabric of the database. The regularity in the array provides an implicit addressing system. This permits rapid random access to specified locations in the database. Thus, we know immediately those cells that are adjacent to any target location, and we can easily find and examine those regions that bound a specified group of cells. Topological information in vector structures is often coded explicitly in the database. Line segments within DIME files, for example, have identifiers and codes for the polygon on either side. When needed topological relationships are not explicitly coded in vector data structures, it can be relatively expensive and time-consuming to get the system to come up with them. For example, in some operational systems, data entry from maps starts with a digitizing process that does not require that the operator exp1icitly relate various line segments and polygons that use these segments as parts of their boundaries. Instead, after the operator creates what is sometimes called cartographic spaghetti, a batch process link points, lines, and polygons together into a topologically structured database. This is another illustration of the trade-off between the efforts expended when developing a database and additional storage costs versus costs and speed during analysis and retrieval. The traditional advantages and disadvantages of raster versus vector spatial data structures have been documented by Kennedy and Meyers (1977) and Durfee (1974), among many others. Basic issues include data volume (or storage efficiency), retrieval efficiency, robustness to perturbation, data manipulation (or processing) efficiency, data accuracy, and data display. Some of these differences, however, are less important in modern GIS implementations. Comparisons of data volume between raster and vector systems is entirely dependent on the database contents, as well as considerations of accuracy and precision. Some may even argue that it is inappropriate to compare the two in terms of data volume, since the characteristics of the two are so different. For example, an elevation dataset is generally stored as a complete cellular array in a raster, and as line segment storing the locations of lines of constant elevation (i.e., the contour lines) in a vector model. These are fundamentally different views of the underlying spatial information: the raster is quasi-continuous, the vector is clearly discrete; the raster representation may be considered more dense than the vector because more unique values are stored. Comparisons of processing efficiency are difficult in modern systems as well. Traditionally, overlay operations are thought to be more efficient in a raster system (see section 8.1.1). In current systems, there may be an efficient means to determine the approximate locations of polygons by maintaining a separate index database. Using such an index to structure a search through the spatial data, a comparison of raster and vector data structures based on processing speed may be more sensitive to the spatial data itself than to the choice between the two data structures. Durfee (1974) and others have suggested that modern geographic information systems should be able to accept, store, manage, retrieve, and display both raster and vector data types, as well as convert from one data structure to the other (Kennedy and Meyers, 1977 and Cicone, 1977). Unfortunately some existing GIS implementations do not have this capability. If forms of both raster and vector structures are found in a GIS, as well as structure conversion routines and appropriate analysis tools for each data type, then the data could be stored in their "natural" form to both optimize geographic specificity and minimize conversion costs and attendant bias. This also permits analytic procedures to operate on a data structure where efficiency or accuracy are highest. While this strategy is more complex than one in which all data are stored and manipulated in a single data structure, efficient software and hardware for vector/raster conversions reduce the size of the problem. The quadtree, as a more sophisticated means of working with raster data, may provide significant performance improvements over simple methods of raster storage. In exchange for increased complexity of the software to store and retrieve the data elements, as well as increased costs at the start to create the tree structure, there can be tremendous improvements in search speed, as we have discussed. A number of research groups, as well as commercial firms, are developing geographic information systems based on the quadtree data model. In particular, merging hierarchical raster data structures with vector data structures may take advantage of the benefits of each. We briefly examine one of these research systems in Chapter 12. The performance of a pure arc-node database varies with the application. When you have retrieved a geometric entity (point, line, or polygon), you have also retrieved all the relevant attributes. If the attributes of each geometric entity are required, no additional database operations are required. On the other hand, there is no way to work with the geometrical entities without incurring the overhead of moving through the attribute data as well, since they are in a common database. This may exact significant performance penalties during the retrieval process. The relational data structure has the potential for efficient search, at the expense of data file management complexity. As we have seen, such a system design permits search through either the geometrical entities or the attribute data, without the other getting in the way, since these two kinds of information are stored separately. Thus, one expects better data retrieval performance for simple kinds of search, which should result in more efficient operations. In any event, such a system can minimize the computer's input/output operations, at the expense of more complex file management, which may be particularly important when working on multi-user systems.