Chapter 8 Manipulation and Analysis We will discuss the analysis of data from data planes in a GIS without regard to the underlying data structure, (e.g., whether the data is stored in a raster or vector format). With algorithms available for both raster and vector data structures for virtually any analytic technique, as well as conversions from one structure to the other, which we have already discussed (Cicone, 1977; Peuquet, 1977; and Marble and Peuquet, 1977), a user of a geographic information system should be more concerned with requirements of the analytic task at hand, and less so with the internals of the database. Capabilities should also exist for the direct manipulation and analysis of the non-spatial attribute data files -- these are briefly discussed in the data management chapter. The distinction between manipulation and analysis is rather arbitrary. The Oxford English Dictionary defines manipulation as “the handling of objects for a particular purpose.” In our case, the objects we are handling are the spatial datasets in the information system. The OED definitions of analysis, on the other hand, revolve around “the discovery of general principles underlying . . . phenomena.” Some of these procedures will modify the form or structure of a data layer in specific ways; some authors consider these manipulations. Other more general-purpose procedures can create data that may be non-spatial (such as the mean elevation in a region of interest or the number of linear miles of roadway in a city) or can create new information in a form similar to the original data (for example, creating a data layer of elevation in meters referenced to a local benchmark, which is derived from elevation in feet referenced to a regional benchmark). The development of new derived data layers, which may form the input to further analysis, is an important function of any GIS. Figure 8.1 outlines the procedures we will examine in the rest of this chapter. Reclassification and Aggregation Geometric Operations Rotation, Translation, and Scaling Rectification and Registration Centroid Determination Data Structure Conversion Spatial Operations Connectivity and Neighborhood operations Measurement Distance and Direction Statistical Analysis Descriptive Statistics Regression, Correlation, and Cross-tabulation Modeling Figure 8.1 Data manipulation and analysis operations. Many of these functions (Figure 8.1) are needed in the preprocessing phase discussed in Chapter 6, and we will try to be consistent about providing cross-references to the appropriate sections. Goodchild (1987) organizes spatial analysis functions along different lines. He discusses a set of basic classes of analytic functions, based on the inputs and outputs of the analysis process. For example, one of his classes is the set of functions that require only attribute information from a class of spatial objects. An example of such an operation would be to count the number of water wells in a database. Another of his classes involves functions that examine the characteristics of object pairs. All example of this class is a function that calculates distances between pairs of objects. This overall approach, developing a taxonomy of analysis functions, is reminiscent of the toolbox viewpoint discussed at the end of Chapter 3. 8.1 Reclassification and Aggregation Frequently, the original data that is available to an analyst is not quite right for the task at hand. Two of the common problems are: The categories of information in the datasets need to be modified in some way to make them appropriate for the intended use. This arises frequently with both nominal and ordinal variables. Thus, the attributes themselves require modification. The data are not at the right resolution -- the spatial database is effectively at the wrong scale. This latter situation may be a problem for both raster and vector data structures, where the original specifications for data collection and database development did not capture all the spatial features required in later processing steps. We will examine these two problems separately. 8.1.1 Attribute Operations The original coding of attribute data may not be appropriate to later analysis tasks. For example, the categories of surface rock types in a geologic map may be too detailed for a particular purpose. It is easy to imagine an engineer needing to know whether a site is suitable for the construction of a specific facility. Rather than providing the list of the original geologic classes and information about the regions they cover, it would be more efficient to present the engineer with recoded data, in which the classes are described simply in terms of their suitability for the construction project: suitable, or unsuitable. Following this operation, the boundaries between the new suitability classes can be readjusted. This is often necessary after a recoding operation, since (in this case) a single suitability class may come from more than one original geologic class. The recoded data may have redundant boundaries that must be removed, so that a map that directly addresses the user’s needs can be drawn. Another example of attribute recoding comes from some of our recent work in forests. In Figure 8.2a, we present a map of tree species groupings, which may be based on both visiting the sites and aircraft observations. One of research goals in this kind of study is to determine whether satellite and aircraft multispectral scanners are able to distinguish between coniferous and deciduous trees, based on their multispectral signatures. In Figure 8.2b, we have recoded the polygons that were labeled by tree species into one of the two classes - deciduous or coniferous. Finally, in Figure 8.2c we derive the final map, by removing the redundant internal boundaries. This is a more complex operation in a vector-based data structure than in a raster-based one, since polygons are usually coded explicitly in the former and implicitly in the latter. In a vector data structure, eliminating a redundant line may require removing not only the line itself but also points which are no longer necessary, and then merging the appropriate polygons. There is a much wider range of possibilities when working with more than one data layer. The overlay procedure is one of the simplest operations to understand; we’ve already seen an example in the golf course planning exercise in Chapter 1. When overlaying two data layers, a matrix can be a useful tool to describe the operation. As an example, we might consider what is called a trafficability problem: can a particular kind of vehicle travel across the terrain in an area, based on the slope and the type of soil? An example of this kind of analysis is found in Figure 8.3. The slope and soil data layers in this example are nominal variables, and form the inputs to the analysis. In Figure 8.3b, we show the matrix of possibilities that is the basis of the traversal data layer. The columns in the matrix correspond to different soil types -- gravel, sand, and clay in this example. The rows correspond to different classes of slope -- level, moderate, and steep. The former is a nominal variable, and the latter ordinal. Each entry in the interior of the matrix tells us how easy it is for the vehicle in question to navigate a particular combination of slope and surface-soil type. For example, when the slope is moderate, the vehicle cannot travel over sandy soil, but can travel on a gravel surface. Figure 8.3c shows the resulting derived suitability map for traversal. In a raster-based system, each cell in the input data provides a soil-slope data tuple as input to the analysis matrix, which determines the class of trafficability in the output data. For example, in a band-sequential raster structure, this will be a simple sequential algorithm. We read the value of the first element in the soil array and the first element in the slope array, send these values to a routine that derives the resulting trafficability class, and send this derived value to the first element in the output array. This process continues through all the elements in the raster arrays. In a vector-based system, the analysis is based on a polygon intersection algorithm, in which new polygons are created as needed, and redundant boundaries (as illustrated in Figure 8.2) are eliminated. Once the attribute lists of the new polygons are created, we pass through the attribute lists (the polygon file in a simple arc-node structure, or the polygon attribute file in a relational structure). For each polygon, the soil and slope attributes are sent to a routine as before, to derive a new trafficability value for each polygon that may be stored in a new column in the attribute list. Note that the vector case requires a relatively complex geometrical operation to derive the intersected polygons, and with it the creation of appropriate new nodes and arcs and their combined attribute values. This geometrical process is not necessary in the raster case. On the other hand, we generally have few polygons whose attributes are used to derive the desired answer for each polygon, in comparison to the relatively large number of cells to be processed in a typical raster database. It is easy to see that this process can be generalized to larger numbers of data layers -- where in addition to slope and soils, a decision to drive a vehicle through an area may also depend on the vegetation cover (dense tree stands are hard to traverse, while open grassland is easy) and soil moisture (wet sand is hard, dry sand is easy). Furthermore, there are strategies that include the contributions of continuous variables (both ratio and interval), such as cost and time. Often, the manipulation of multiple data layers in this fashion is done in a stepwise manner: two input data layers combine to form an intermediate layer, and this intermediate layer is combined with the third source data layer to form another intermediate layer, and so forth. Another form of the overlay process is found when the values of the original data layers combine mathematically to produce the output decision data layer. This is common when the data are ratio variables. Consider two data layers that may have come from an analysis of a natural forest: layer 1 contains polygons coded for the number of plant species found in this polygon in the understory, and layer 2 contains polygons coded for the number of species that make up the canopy. These two data layers may have come from separate field studies. To arrive at the total number of plant species per spatial unit, we in some sense “add” the two data layers. With raster datasets, one can easily imagine adding the values in each of the data layers, cell by cell. The software is more complex when we are dealing with a vector data structure, but the principle is exactly the same. Interior boundaries are created in the derived data layer wherever necessary, as described in the last example, and the attributes from the two input layers are added. Logical (or Boolean) operations have many uses as well. Consider an example in which there are restorations on construction and new development, based on specific characteristics at the site. New construction might not be permitted in any plot in a proposed subdivision where either (1) archaeological artifacts have been found, or (2) nesting sites for rare or endangered bird species have been discovered. When dealing with this kind of analysis, one data layer is normally coded for the presence of artifacts, and another layer coded for the presence of target endangered species, The decision for construction opportunities is based on the Boolean operator “OR”: if a lot is known to have artifacts, OR to have the designated target species, construction is forbidden. The OR operator creates the desired output data layer from the two input data layers. As before, with raster data structures, the operation proceeds on a cell-by-cell basis; with vector data structures, the geometry of the output data layer must first be derived before the attributes may be combined. In this case, we have combined the exclusions together logically. If there are additional factors that could deny construction, we can sequentially take them into account by operating on the latest intermediate or derived layer, together with the next excluding layer. Similarly, we can include other characteristics that are required of a site with additional Boolean operators. To finish this example, we would overlay the subdivision boundaries on the derived exclusion data layer, and eliminate the excluded lots from consideration. The automated mapping-facilities management example in Chapter 12 shows precisely these kinds of requirements for both inclusion and exclusion. There are several other Boolean operators. The “AND” operator is used to merge characteristics that are required at the same time. For example, new construction at a location is permitted when the local environmental study is completed AND plans for meeting any specific zoning restrictions have been developed. An exclusive-OR operator, abbreviated “XOR”, is used to determine when one condition or the other is met, but not both. There is another set of techniques for developing thematic descriptions from multiple attributes. Termed classification, these techniques start with several data layers and, based on a family of statistical procedures, derive a single multi-valued layer. Classification is designed to locate and describe relatively homogeneous regions, and is frequently used with remotely sensed data. Two different approaches are in common use: supervised and unsupervised. In a supervised classification, the analyst develops multivariate descriptions of the different classes of interest. In a land-use application, the analyst could statistically describe the characteristics of each class. A certain type of recreational area might be described in terms of: public land ownership presence of on-site parking presence of rest rooms with wheelchair ramps area of the site is greater than 10 hectares the site contains a water body larger than 1 hectare This description is sometimes called a training class for this land-use type, and we might have one or more specific examples of this type in the database. Other classes are described in a similar fashion. The GIS is then used to assign all the regions in the dataset to one of the described land use categories. There are a number of different decision rules that may be used to make assignments between the different classes, when there is no exact match between the available categories and the characteristics of a parcel or other delimited area (in a vector database) or pixel (in a raster database). In addition, various measures are available to examine the multivariate distance between the specified training classes and the parcel in question, and then assign the parcel to the nearest or most likely class (for example, see Moik, 1980). The procedure is called a supervised classification, since the analyst supervises the choice of the classes. A user-friendly way to develop the list of classes and their characteristics is commonly found in systems that use multispectral image data. In these systems, the area on the ground which the analyst chooses as training sites for categories of interest are indicated with a light pen or joystick. Next, the system determines the characteristics of these selected sites from the stored database, and then develops statistical descriptions of the class. These derived statistical descriptions then form the basis for the subsequent statistical classification. Unfortunately, this is not as straightforward a procedure as it sounds. The problems are based on the problem of variance capture. When there are variations in the characteristics of the members of a class, it may be difficult to accurately and unambiguously describe the class as a whole. In the land-use class example above, the area constraints may not be exact; there may be recreation areas that are very much like those described, but slightly too small. In a multispectral sensor dataset, there may be water bodies with different levels of phytoplankton biomass, thus presenting the sensor with different amounts of green reflectance. A common practice in trying to minimize the problem is to develop descriptions of several training sites within a class, so that the variability in the class is explicitly considered. Many of the implementations of supervised classification permit the analyst to calculate a measure of confidence in the labeling of each polygon. In a two class example, if the probabilities of assignment of a polygon into one of the two classes are 92% and 8%, respectively, one has confidence in the result. If the probabilities of assignment are 51% and 49% for another polygonal area, however, the region may be placed into the same class, but the analyst is less comfortable with the reliability of the result. In contrast, unsupervised classification is much like the statistical clustering procedures used in ecological studies (Noyt-Meir and Whittaker, 1977). In unsupervised algorithms, the computer attempts to find different areas with similar attribute relationships. These areas are then labeled as belonging to the same class. It is then up to the analyst to determine a meaningful label for the classes, usually by working with site-specific information (see, for example, Jensen et al., 1987). Again considering the land use example, the software could scan through the data for commonly occurring patterns of land ownership, parking facilities, presence of water bodies and their sizes, and total area of the parcel. When statistically reliable patterns are discerned in this multivariate data, the system can group together those areas with similar characteristics. From a mathematical point of view, the system is attempting to locate areas in the multivariate data space where observations cluster. Such routines can be a good tool for summarizing complex multi-layered datasets, as well as a good exploration tool, to help build hypotheses about spatial pattern and structure. Some background in multivariate statistics can be a great help in the interpretation, however; these statistical procedures can often permit na?ve users to fool themselves. 8.1.2 Spatial Aggregation Spatial aggregation involves increasing the size of the elemental unit in the database. Since there is rarely an explicit elemental spatial unit in vector datasets (except for considerations of accuracy and precision, and the concept of generalization discussed earlier), this function commonly applies to raster datasets where there is a predetermined cell size. However, the concept is useful even in vector datasets. There may be instances in which regions of less than a specified area must be ignored for a particular application. In addition, a common operation with vector datasets is to merge adjoining polygons based on their attributes. These processes of changing the mean resolution of the data change the effective size of the minimum mapping unit. 1 1 1 1 1 1 1 1 1 1  1 1 1 1 1 1 1 2 2 2  1 1 1 1 1 1 2 2 2 2  1 1 1 1 1 2 2 2 2 2  1 1 1 2 2 2 2 2 2 2  1 1 1 2 2 2 2 2 2 2  2 2 2 2 2 2 2 2 2 2  1 1 2 2 2 2 2 2 2 2   1 1 1 1 2  1 1 1 2 2  1 2 2 2 2  2 2 2 2 2   ( b ) 1 1 1 3 3  1 1 3 2 2  1 3 2 2 2  3 2 2 2 2   ( c ) Figure 8.4 Spatial aggregation of categorical data. (a) original data layer, (b) aggregation based on a majority rule, (c) aggregation with the creation of a new mixed class. The processes of spatial aggregation require the user to develop a decision rule for merging the attributes of the data. When the data describes discrete categories of information, such as land-cover classes, majority or plurality rules are normally used to derive the new category for the aggregated data from the old (Figure 8.4b). The example in the figure shows an urban area bordered by a suburban area. In this case, we are increasing the linear dimensions of the raster cell by a factor of exactly two, so that four original cells cover the same area as a single derived cell. During the process, the category that covers the largest fraction of the resulting aggregated area determines the attribute label for this new area; when there is a tie, we have specified that urban takes priority. All alternative, which keeps more of the information about pattern in the data, is to instruct the system to create a new “mixed” class for the aggregated data when there is no majority (Figure 8.4e). For continuous data, such as elevation or average rainfall, an averaging process can be used to derive the aggregated data from the original values in the appropriate pixels (Figure 8.5a). This spatial averaging is a kind of filtering process, in which only general trends (or longer wavelengths, from a spatial analysis perspective) remain in the data. As in any other signal filtering process, we could have specified a different averaging function than the straightforward arithmetic average we used in this example. When the aggregation process for raster data does not involve combining an integer number of cells to form a single larger cell, the processes of spatial interpolation discussed in section 6.6.4 are used. These tools use different models of spatial change to convert between different spatial scales. This aggregation process is used frequently. There are often times when different data sets, each with its own spatial resolution, are all aggregated to a common scale before analysis. A simple example might be the integration of digital elevation data, with a raster cell size of 25i meter, with land-use data that has been processed to a 100-meter cell size. From a data management viewpoint, we must be careful to keep the original disaggregated data, even if it is not immediately useful, since it cannot be reconstructed from the aggregated data. The aggregation process is not reversible; that is, we cannot generally find a numerical operation that can identically recreate the original data from the aggregated data layer. The spatial resampling examples we have considered in this section have only dealt with cases in which the desired spatial resolution is an integer multiple of the resolution in the original data. In Chapter 6, we discussed the more general case of resampling, to which this restriction does not apply. We must point out one important caution. When working with multiple datasets of different underlying spatial resolution (as we defined the term in Chapter 1), be extremely careful about the final results. From a simplistic but practical point of view, the end results can have no more spatial precision than that of the input data layer with the largest mean resolution element. 8.2 Geometric Operations on Spatial Data This family of operations is important in both the preprocessing and analysis phases. In Chapter 6 we provide a detailed discussion of some of the uses of these functions in the preprocessing phase. In the following sections, we focus on using these tools during data analysis. Rectification is the process of converting the spatial locations of objects in a dataset to those of a specified map projection. Rectification has been discussed in Chapter 6. Generally, it requires an explicit understanding of the geometrical characteristics of the source data, as well as those of the desired map projection. Recent work with Landsat Thematic Mapper digital imagery (which has a pixel size of 30 meters) presents a useful case study of ground control point rectification. Welch et al. (1985) discuss the geometrical characteristics of TM imagery. Their analysis considered the numerical procedures necessary to register Landsat TM to U.S. Geologic Survey 1:24000 topographic maps. In their investigations, root mean square errors of less than 1 pixel could be obtained from as few as four ground control points for an area as large as 30,000 square kilometers. As we discussed in section 6.6, there will often be cases in which the procedures in the previous section can’t be used, because one or more of the spatial data sets does not represent a known map projection. There are many ways this problem arises. There will be instances in which a map-like product does not conform to a known map projection. Perhaps we have simply lost the details of the projection through the passage of time, or the cartographic process was not properly controlled, as is often the case in a field sketch map. A map may be based on measurements that have not been corrected for the effects of changes in elevation. Raster imaging sensors frequently present similar kinds of problems. These may be the result of a non-vertical imaging axis, distortion of many kinds in the optics and mechanical systems, curvature of the Earth’s surface, and so on. The tools for registering data layers under these circumstances have been discussed in section 6.6. In Figure 8.6, we illustrate a common example of this process, where a field sketch map, which does not correspond to any particular map projection, is registered to a map. This brief discussion of coordinate systems has focused on point locations, and thus, primarily on vector datasets. When working with raster data organizations, there is an additional complication, since the area covered by a cell in the registered data space will usually not coincide precisely with a cell in the original data space. This requires an interpolating function, as discussed in section 6.6. The numerical tools of rectification and registration can be used in an inverse fashion, to intentionally distort a geographic dataset to make it communicate more effectively. For example, a planimetric dataset of landcover could be “distorted”, to represent the perspective projection seen by an observer in a low-flying aircraft, or a motorist on a road viewing the watershed. Such a representation could be constructed using the tools of registration, and could effectively simulate a perspective view to a potential user. More sophisticated effects can be developed by integrating elevation data, and simulating view more accurately -- this is discussed in section 9.1.3. This specific example, simulating perspective views of a three-dimensional field, is an important component in the computer-aided design field. These kinds of operations, applying a specific geometric operation to analytic results, can also be used to mitigate problems in an output device, such as a projection video display with significant geometric distortions. 8.3 Centroid Determination A centroid is a common way to describe the "average" location of a line or polygon. It can be defined physically as the center of mass of a two- or three-dimensional object. For a two-dimensional polygon as defined in a vector data structure, we could average the location of all the infinitesimal area elements within the polygon and could thus determine the coordinate location of the area’s centroid. For an object defined within a raster, we could average the center coordinates of all the raster elements that make up an implicitly defined polygon, and could thus arrive at the object’s centroid. There is an easy way to think about a centroid. Consider taking a polygonal area on a rigid map, cutting it out along its boundaries, and then locating the point at which this two-dimensional object will balance on a single point. This point is the centroid of the area. Note, however, that there are cases in which the centroid position is not inside the object. For example, consider a U-shaped region; its centroid is between the uprights in the U, and thus is not in the area itself. Imagine a three-dimensional doughnut or torus: its centroid is a location inside the hole, not within the bounds of the solid object itself. However, most people would agree that even for these relatively difficult cases, the centroid location is a reasonable representative location for the objects. In some systems, after the centroid for a spatial object is calculated, the location is stored in the database as a searchable key. With the centroid as a location key, we can rapidly find objects that are relatively near a specified location. Thus, we can use the set of centroid locations as a spatial index for objects in the database. 8.4 Data Structure Conversion Data structure conversion was examined in detail in section 6.1.1. It involves procedures for moving information in existing datasets from one specific data structure to another. It is frequently required in the preprocessing phase, to permit various datasets to be stored in a database that has a limited number of internal data structures. At the same time, there may be instances where, in order to run a particular analytic function, the data must be in a particular structure. This happens frequently when using external numerical models, as discussed in section 8.8. Thus, we must have access to the conversion functions even during analysis. In addition, we will have a need for these kinds of tools when generating output products as well, as discussed in Chapter 9. 8.5 Spatial Operations Spatial analysis involves operations that consider the spatial arrangement of information in one or more data layers. They may be broken down into two families: operations concerned with connections between locations, and operations concerned with local or neighborhood characteristics. 8.5. 1 Connectivity Operations One of the classic functions of a geographic information system is to be able identify items in proximity to each other. This function has a wide variety of applications, including identifying impact and easement corridors, and in problems of site selection. In a site selection context, it may be important to locate near certain kinds of activities, and to avoid locating near others. If we can specify the appropriate distance criteria, the GIS can help narrow down the search for acceptable sites by taking a data layer that stores the locations of key activities, and deriving a new data layer that describes the impact zone around these activities. A hospital should be relatively near to major transportation routes to make it easy for people to get to the hospital, for example, but far from sources of noise and air pollution. For a more complex example, consider the problem of residential development near an airport. Locations under the principal fight paths and those within a specified distance of the airport boundary may be subject to restrictions on the height of buildings and other facilities. If we have geographic data layers in which the flight corridor boundaries as well as the airport boundaries are stored, we should be able to calculate the locations of those areas that are subject to the height restrictions. Developing the boundaries of the region within the specified distance from the airport is called a proximity or buffer operation, and is a common tool in spatial analysis. Proximity operations on vector datasets may require relatively complex geometrical operations. Locations within a specified distance of a point will form a circle; when the indicated points are closer than twice the specified distance, the circles will overlap (Figure 8.7a). Proximity boundaries on individual straight lines are relatively easy to develop and are more complex for either connected line segments or curves. Similar operations on raster data are easier to imagine, since we are effectively counting pixels away from the specified cells. There are a variety of kinds of spatial analyses that are based on networks. Applications in this general area include optimum corridor or travel selection, and hydrology and flow functions. A complex but useful function found in some systems is to be able to identify the separate watersheds in an area, through run-off direction calculations that are based on a terrain description. Another interesting example of a network analysis problem is to evaluate alternative routes for emergency vehicles, based on a combination of total length of the route, plus anticipated congestion on surface streets, which is dependent on the time of day (Figure 8.7b). This is a complex problem in systems analysis, and is not usually the province of general-purpose geographic information systems. In some commercial systems, these kinds of applications separate software packages that interface to the GIS, for both access to the database as well as display functions. 8.5.2 Neighborhood Operations There are a variety of different operations that are based on local or neighborhood characteristics of the data. In vector datasets, we often store information about nearby objects, such as polygons which are adjacent to each other. Thus, if we need to know which polygons are adjacent to a specified target polygon, line, or point, we can search through the data structure files in a straightforward fashion. For example, this kind of operation permits us to examine the various kinds of land-use surrounding a specific parcel which a client might wish to purchase. In a raster dataset used to store many different kinds of variables, we can take the same approach. We can examine the attributes of the pixels surrounding a set of one or more target pixels, and look for such things as variability in certain attributes. On the other hand, when the raster contains values of an ordinal, interval, or ratio variable such as per capita income or rainfall, one of the simplest local operations to imagine, and an extremely common function, is computation of the local mean value in a single data layer. If we consider that every pixel represents the center point of a 3-by-3-pixel neighborhood, we can directly calculate the mean of this local area. The new set of mean values forms the derived data layer. This operation is a low-pass filter, in the language of signal processing. It will identity the general trends in the data, and reduce or smooth out the effects of single cells that vary from the local trends. This can be a very useful function when examining “noisy” data. Without considering the details, we note that a similar operation is possible in vector datasets. However, in this latter case we are examining the attributes of objects (for example, polygons) within a specified distance of a target object. The mean filter we have just described is a special case of what is called spatial convolution. We have applied a set of weights to the raster cells in a local area, to derive a value representative of that area. In the case of the 3-by-3 mean filter above, we have summed the 9 values in the local neighborhood, and then divided by nine to preserve the local mean. Alternatively, we can weight each value by the constant 1/9, and then sum the weighted values. It is common practice to represent this operation by writing out the array of weights: [ 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 ] This array is sometimes called the kernel of the convolution operation. If we were interested in the mean of a larger area, we could specify a larger spatial window -- perhaps a 5-by-5 cell region. By increasing the size of the window for this simple mean filter, we smooth the behavior of the data layer more dramatically. In other words, we attenuate the high-frequency information in the data more strongly. Figure 8.8a shows a raster dataset of infrared reflectance, derived from the Landsat Thematic Mapper satellite sensor (which is discussed in more detail in Chapter 10). Figure 8.8b shows the same data after operating the mean filter. By removing a portion of the “high-frequency energy” in the data, or in other words, focusing on the general (or “low-frequency”) trends in this information, we get a very different picture of the spatial pattern in land cover and land use. More detail on various convolution operations is presented in Moik (1980); we will only discuss some of the most common techniques. In direct contrast to the low-pass mean filter we have described, there are a family of edge-detecting or high-pass filters in common use. One frequent edge-detecting filter has the following kernel: [ -1 -1 -1 -1 9 -1 -1 -1 -1 ] As in the 3-by-3 mean filter above, the sum of the weights is 1, which suggests that the filter will approximately preserve the mean of the dataset. We point out that these operations are not well-defined for categorical data, and may produce results which are difficult to interpret. If these kinds of filters are applied to data that is not continuous and naturally ordered, such as classes of land use or lists of species, these operations may not make sense. Figure 8.8c shows the operation of this high-pass filter on the remote sensing data; note how it complements the low-pass filter shown in Figure 8.8b. This view of the original data emphasizes areas where there is change in adjacent land-cover types. Another common spatial manipulation, which is directly related to the convolutions discussed above, is a median filter. With this filter, rather than calculating the mean value of a neighborhood, we calculate the median value. The median is the central value in a dataset, in that half of the data values fall above the median, and half fall below. This kind of calculation may be of particular value when the data has outliers or is otherwise not statistically well-behaved. This operation is another kind of low-pass filter, as it can remove the effect of outliers and can focus our attention on the general patterns in the data. This filter is popular with analysts who prefer non-parametric statistics (see Winkler and Hays, 1975), which are algorithms for analyzing data without regard to the underlying frequency distributions. Texture transformations are another set of tools that are used to identify the spatial pattern in data. Texture filters are designed to enhance the heterogeneity of an area. A common algorithm for a texture filter is to calculate the standard deviation of the values in a 3-by-3 neighborhood. If the attribute values in this neighborhood are all similar, the standard deviation is small, and we say that this neighborhood has low texture or low variability. Where there are many different attributes in a neighborhood, we have high texture. These transformations are also used to find the boundaries between delimited areas, since texture within a homogeneous area (in the way we have defined texture) must be zero. When the neighborhood for the calculation spans the border between relatively homogeneous areas, the texture value is high. There are a number of spatial transformations that are particularly useful when working with elevation data. A slope transformation (Figure 8.9b) turns a data layer of elevation into one of slope, by calculating the local first derivative. As we have seen in the use of spatial filters, many raster-base systems use a local 3-by-3-cell region to calculate the slope at the center of the region. This operation is the same as calculating a local derivative: imagine using the 3-by-3 local set of elevations to determine a best-fitting plane, and then calculate the slope of that plane. A companion to slope when working with elevation data is aspect. An aspect calculation is used to determine the direction that a slope faces (Figure 8.9c). In mountainous terrain, with a ridge oriented east-west, the slopes on the north of the ridge face to the north, and those south of the ridge face to the south. A mathematical way of explaining aspect is to calculate the horizontal component of the vector perpendicular to the surface. Aspect is usually classified into bins of fixed size, so that the resulting data layer is not continuous, but ordinal. A frequent choice is to classify slope into eight categories, each representing an eighth-of-a-circle (or 45-degree) range in aspect. Another neighborhood operation can be used to determine the characteristics of a dataset along a specified line. Such a process is commonly graphed as a profile or cross-section of a dataset. For example, in the terrain data shown in Figure 8.9, we might need information about the slope of a specified path through the terrain, which corresponds to a ski slope or roadway. There are a number of ways to develop this information, generally based on extracting the elevation values in the neighborhood of the line (Figure 8.9d), interpolating values along the specified path, and then calculating the local slope (Figure 8.9e). Such a calculation may be simpler with raster datasets than with vector forms, at least conceptually. In a raster of elevations, we can use the pixel values near the desired line to estimate elevations along the line, and then calculate the desired set of slopes. In a vector dataset, elevation values are often stored as contour lines. In this second instance, one must effectively fit a smooth surface to the known elevations along the contour lines, and then estimate values along the desired path from the surface. There are many uses for determination of visibility from a point. Impact assessment procedures often consider the visual effects of a new construction project. Tactical planning of military exercises frequently considers sites in terms of their visibility. Calculation of a viewshed -- the terrain that is visible from a specified point -- is a complex calculation, and is similar to one frequently found in computer-aided design systems. 8.6 Measurement A wide variety of measurement tools are needed in a geographic information system. With raster-based data, the precision of a measurement is of course limited to the raster cell size; vector systems are ultimately limited by the precision of the stored node locations. Distance measurements are of value in many circumstances. Software components must be available to determine line and arc lengths and point-to-point distance calculations, as well as the perimeter of a specified area. With a vector data structure, the specified region is often a polygon (or based upon the union and intersection of a set of polygons), and the area can be determined from the coordinates of the appropriate nodes. With a raster data structure, polygonal areas are implicit in the data, as collections of locally connected pixels that share a specified set of attribute values. Thus, an operation such as finding the perimeter of an area may require (1) determining the boundaries of the region of interest by searching through the data for changes in attribute values, and (2) totaling the length of the discovered boundary. When calculating the distance between areas, whether the data is based on raster or vector data models, there are at least two different interpretations of distance. The required distance may be at the closest approach between the two polygons, or between a representative location for each polygon (such as the centroid, as in section 8.3). The former may be appropriate for planning an irrigation channel between a pond and a field, while the latter may be better for estimating average travel time between cities. In Figure 8.10, we see another hole in the golf course we discussed in Chapter 1. A simple distance calculation for this application would be to find the locations for the markers on the sides of the fairway which indicate a distance of 100 yards from the hole. The marker locations could be determined in any of several ways. One sequence of steps might be: 1. identifying the specific hole in the database, 2. use a proximity operator to determine the circle of diameter 100 yards centered at the hole, and 3. locate the markers where the sides of the fairway intersect the circle. In a raster database, distances between pixels are ordinarily calculated between the center points of the pixels. In a vector database, there are several options. While the distance between two points is unambiguous, the distance between two lines or two polygons, as examples, are not. The distance between two polygons could be calculated as the shortest distance between the two, which might be appropriate when considering the most cost-effective location for a canal (along the line of shortest distance) between two bodies of water. One alternative is to calculate the distance between the polygons as the distance between their centroids, which might be a reasonable model for the average distance between two parcels. Area calculations can work in a similar fashion. For example, determining the area of a catchment basin, or the total area of specified land-use zoning in an area, may be required for an application. Thus, returning to Figure 8.10 and our planned golf course, we will need to estimate the area of the green to know how much grass seed and fertilizer to order. In three dimensions, volume calculations may be required. Such applications would include an earth-moving application in civil engineering, or the calculation of the volume of a subsurface water body in hydrology. In our golf course, we could estimate the volume of soil to remove to create the sand trap, or the volume to move to create a level green. Determination of direction has a wide range of uses. The regions downwind of a smokestack must be identified in an air pollution study. Areas down slope of a water body may be at risk of a catastrophic mud slide after an earthquake. Direction calculations also have applications in route planning, in which a determination of the distance and direction between start and end points provides the necessary navigation instructions. Another form of measurement involves counting specified objects in a region. If the objects are specified explicitly in the database, the problem is reduced to one of sorting. The system is essentially programmed to sort through the database to find a set of keywords and record the number of times they are found. Some applications may require only a determination that a specified object is found in an area; one example of the object is sufficient. Other applications require an exhaustive search through the database to determine all the occurrences of the desired class of objects. In a raster dataset, the attribute values of the cells in the region of interest are successively compared to the desired values. When a match is found a counter is incremented. The procedure is similar for a vector dataset, where the attribute tables are searched for desired sets of values and relationships. When the desired objects are not explicitly stored in the database, the problem becomes much more complex. Consider the problem of locating a site appropriate for a fire tower in a forest. The problem is mostly that of finding a location with an appropriate viewshed. The data for this problem includes elevation, vegetation cover, and land ownership. There may be algorithmic solutions to such a problem which require a tremendous amount of computation, but there may also be efficient heuristics or “rules of thumb,” which permit us to eliminate large areas of the database from consideration. Such strategies enter the realm of artificial intelligence (Smith et al., 1987b). 8.7 Statistical Analysis There are a variety of statistical techniques that are common in modern geographic information systems. As with many of the other tools we have discussed, these have value in many places in the overall information flow in a GIS. Whether for quality assurance during preprocessing, or for summarizing a dataset as a data management report, or for deriving new data during analysis, several statistical procedures are commonly required. These include: Descriptive statistics: The mean, median, and variance of the attribute values in a data layer (or delimited area within a layer) are often needed, for continuous variables. For example, it may be necessary to know the mean elevation of a specified area, or the variance in vegetation density in a field. Higher-order statistical moments, such as the coefficients of skewness and kurtosis which may be used to compare a dataset against a Gaussian distribution are relatively rare in current systems. Histograms or frequency counts: The histogram of a dataset provides us with the distribution of attribute values in a region (section 9.1.3). Note that this calculation is straightforward in a raster, since each cell stores attribute information for a predetermined area; thus, we calculate an area-weighted estimate. On the other hand, in a vector database we have the option of using the area of each polygon to appropriately weight the attribute, or alternately, base the histogram on a per polygon analysis. Histograms and frequency counts can be extremely valuable as data screening tools and can help us to formulate hypotheses during analysis. Histograms are useful for examining all types of variables. Extreme values: Locating the maximum and minimum attribute values in a specified area is often useful. As an example, in a dataset of bathymetry we may need to find the shallowest and deepest values of a water body. This information is vital during preprocessing, to ensure that the values being entered into a database are reasonable; they are a rapid but not exhaustive way to detect some kinds of coding and transcription errors. Correlations and cross-tabulation: In a correlation analysis, we wish to compare the spatial distribution of attributes in two (or more) data layers, usually by calculating a correlation coefficient or linear regression equation when working with interval or ratio variables. For many kinds of data, it is desirable to be able to specify a non-parametric regression; this is rare in modern GISs. For nominal or ordinal variables, a cross-tabulation compares the attributes in two data layers by determining the joint distribution of attributes. When working with both categorical and continuous variables at the same time, the appropriate statistical model is generally an analysis of variance (or analysis of covariance). We will briefly examine a cross-tabulation, since it is such a powerful and common tool in a GIS. The results of a cross-tabulation are usually displayed in a two-dimensional table: Household Average per capita income < $5,000 < $12,500 <$22,500 >$22,500  Owner 154 354 673 982  Renter 269 627 513 451   The table shown above describes two data layers of demographic information, one regarding per capita income for each household, and the other indicating home ownership. Note that the table is based on categorical data: household ownership is nominal, and per capita income, which is ultimately a ratio variable, has been broken into categories and thus appears in this analysis as an ordinal variable. The data, traditional for census data, may have come from the attribute tables in a relational database model, or from separate income and ownership layers in a raster database model. This specific analysis can be used to examine whether there is a relationship between the level of income and the probability of home ownership. For an analysis of this kind to be complete, there are standard statistical tests that may be applied to determine whether the arrangement of data in the cells of the table might have arisen by chance. Having such statistical procedures within the GIS is very useful. In this cross-tabulation, notice that we have turned one continuous ratio variable plus a nominal variable into an integer-valued ratio variable: this kind of variable type conversion happens frequently. There will be times when the statistical capability of a geographic information system is simply inadequate for an analysis problem. When this occurs, it will be important to be able to create an intermediate output file, so that data may be transferred to an appropriate system -- perhaps one of the well-known statistical analysis packages (e.g., SPSS, BIOMED, MINITAB). The ability to extract data from the GIS in a standard (or at least, predefined) format is an important consideration for several reasons. In this case, we focus on the need to gain access to statistical software that is external to the system. In section 8.8, we discuss this problem in more detail. Manipulation and analysis techniques such as these operate on data retrieved from the spatial database, and include both spatial and non-spatial information. New data files containing the “value added” or derived data/information may be created for inclusion within the database. These data can then be analyzed at a later date. 8.8 Modeling One cannot reasonably expect the developer of a geographic information system to anticipate everything. First, as a user develops experience with a system, the user’s interest in complex analytic functions tends to grow. Second, there will be analytic models that are unique to a discipline, or unique to an organization. Perhaps the details of an analytic procedure for an application are proprietary to the company. In these and other instances, we must have means of getting data out of the GIS (as we’ve mentioned in the discussion of external statistics packages), and into other numerical models for analysis. Furthermore, we should have a means of getting the results of these external systems back into the GIS, for display as well as for storage and further manipulation. A few common modeling functions should be found in modern systems. The Boolean and arithmetic functions discussed in section 8.1.1 can be used to develop a wide range of applications models. These are of such broad use and applicability that one should expect them in all systems. When it is possible for a user to specify a numerical function by effectively providing a subroutine in an appropriate high-level computer language, the system gains tremendous flexibility and potential for expansion. A number of considerations bear on the problem of extracting data for use in other software systems. A key component of the problem is the ability to extract a specified subset of the database. Tools should be available to delimit the data of interest both spatially (perhaps by defining a spatial window of interest) and by the desired attributes. Another key component is the ability to determine the specific organization and format for the exported data. Raster data formats, as we have discussed, cover a relatively small range of possibilities, and as such, most geographic information systems possess functions for exporting a univariate raster. Providing functions to specify the raster cell size, and the value coding (for example, floating point format or word size for unsigned integers) are important keys to flexibility. Since vector forms vary more widely, it is more difficult to develop means to export vector files in multiple formats that are widely useful. When such functions are available, it is of course extremely important to have accurate documentation about the detailed formats of the exported data, as well as any available options. There will always be users who are forced to develop their own software for specialized kinds of analysis. When the internal data formats of a GIS are clearly specified, and when they are appropriate to the analysis problem, it may be possible to write code that operates directly on the GIS database. While this may seem particularly efficient, it opens a path for a number of serious data management problems, including database corruption, lack of appropriate backup protection, and loss of currency. When direct access to the GIS database is desirable, however, we suggest that a reasonable compromise is to be able to directly read the GIS internal database, but not write to it. In this way, the derived data layers can be examined in a separate procedure, and only after appropriate tests have been passed, is the output incorporated into the general database. This is less of a concern if the analytic software is able to “see” only a user’s local copy of the database, rather than the overall multi-user database. There are a number of popular spatial models that are found on some systems. These include determination of visibility or viewshed, network analysis models such as for water flow through piping systems or for calculating optimal paths through transportation networks, and classification and corrections for atmospheric transmission for remote sensing data. In the rest of this section, we briefly describe two applications of modeling in a geographic information system; more are found in Chapter 12. Spanner et al. (1982) discuss an interesting use of a geographic information system, in which they examine potential soil loss in an area in southern California. According to research they cite, the predicted loss of soil at a particular place, in tons per acre per year, may be estimated as a multiplicative function of rainfall, soil erodability, slope gradient and length, and crop practices. Soil erodability as a function of location was estimated from data collected by the U.S. Department of Agriculture and presented in the form of a map, which was digitized. Length of slope and slope gradient were calculated from a digital elevation model. Predicted soil loss in tons/acre/year is a function of: Rainfall Soil Erodability Slope and Slope Length Crop Management Conservation Practice Inputs: NOAA Precipitation-Frequency Atlas USDA SOIL Conservation Service Maps USGS Digital Elevation Datasets Landsat Remotely Sensed Data  Intermediate Layers: Interpolated Rainfall Derived Soil Erodability Coefficients Slope and Slope length Crop management Factor Conservation Factor   Outputs: Predicted Soil Loss Soil Loss Tolerance   Figure 8.11 Components of the universal soil-loss equation. The effect of crop management was determined in two phases. First, a map of crop types was developed from a combination of satellite multispectral data, photography, elevation, and site observations. The crop management factor was then determined from guidelines developed by the U.S. Soil Conservation Service. The Soil Conservation Service has also developed estimates of soil-loss tolerance for the soils in the study area; these are estimates that reflect the amount of permissible erosion. The separate data layers were developed and rectified to a common 60-meter raster grid system. Estimates of soil loss and soil-loss tolerance were then calculated on a cell-by-cell basis. By then subtracting the soil-loss values from the soil-loss tolerance, a final thematic layer is developed that shows those locations where estimated erosion may be a serious obstacle to farming (Figure 8.11). An analysis of the accuracy of the individual data layers, as well as a brief sensitivity analysis of the components of the model, concludes their discussion. Most geographic information systems have the components needed to estimate a simple cost-of-construction model as a function of location. Consider an organization that needs a site for a new warehouse. The costs of developing the site include building the access road, bringing in utilities (such as power, water, sanitary sewer, and communications), acquiring the land, and building the warehouse itself. Where in the area is the least-cost location for this facility? The key to such a model is to be able to mathematically describe the component costs as a simple function of location. Consider one component of the problem: building the access road. In the general area of the proposed warehouse, we can identify the principal streets. For any possible location in the general area, we can determine the distance to the nearest street. We can then multiply this distance by the specific cost of constructing the access road (as cost per unit of distance), and thus have an estimate of cost of building the access road to any specified location. In a geographic information system, we generally have proximity operators (section 8.5) that can determine distances from specified classes of features. The input data for this operation is the relevant street network. In a raster-based system, we can develop a data layer in which each pixel is labeled by the distance to the road (Figure 8.12a). In a vector-based system, we might calculate isolines of distance from the outer margins of the streets, which convey basically the same information. In either case, the distance measures can then be weighted by the cost per unit distance, thus developing one of the component cost layers in this analysis. In exactly similar ways, we can develop data layers that contain the costs of running power and communications lines, water, and so forth. The costs of acquiring the site for our example warehouse may depend on the site’s location. There is an airport in the target area, which affects the cost of land. Sites within 1 kilometer of the airport property boundary are less expensive than those farther away, since the airport is a source of noise and air pollution (Figure 8.12b). This is another application of a proximity operator. However, in this instance we are developing a boundary at a specified distance (which one might consider a binary data layer: inside versus outside the boundary), rather than a continuous surface of distances. If costs are based on a more complicated function, perhaps reflecting aircraft landing and takeoff patterns, these can be added to the model as well. Finally we can add the cost of constructing the building itself, which might depend on geotechnical considerations, such as surface soil characteristics and depth to bedrock. A final data layer calculated as the sum of the component cost layers can then be developed, which will indicate the least expensive site for constructing the new facility.