Chapter 3 The Essential Elements of a Geographic Information System: An Overview As we said in the first chapter, an information system is fundamentally an end-to-end system, which deals with the flow of data and information from its primary sources to the derived information and its ultimate uses. Geographic information systems are designed to handle information regarding spatial locations. In this chapter, we will introduce the essential functional components of a GIS, and will discuss some key concepts in geography and geographic data processing. 3.1 GIS Functional Elements There are five essential elements that a GIS must contain (Figure 3.l; based on the discussion in Knapp, 1978): data acquisition, preprocessing, data management, manipulation and analysis, and product generation. For any given application of a geographic information system, it is important to view these elements as a continuing process. We will introduce each of the elements in this chapter, and will examine each in greater detail later in this text. As a guiding principle, the analyst should develop an end-to-end model of the task at hand. Even when the precise details of the steps to be taken may depend on the results of intermediate calculations and analyses, an explicit outline of the process, like a working hypothesis in a scientific experiment, can be very valuable. Data acquisition is the process of identifying and gathering the data required for your application. This typically involves a number of procedures. One procedure might 'be to gather new data by preparing large-scale maps of natural vegetation from field observations, or by contracting for aerial photography. Other kinds of surveys may be required to determine, for example, consumer satisfaction and preferences in different parts of a city to help locate new business offices. Other procedures for data acquisition may include locating and acquiring existing data, such as maps, aerial and ground photography, surveys of many kinds, and documents, from archives and repositories. One must never underestimate the costs (in time as well as money) of the data-acquisition phase. A GIS is of no use to anyone until the relevant data have been identified and located. Furthermore, the accuracy (of the decisions reached through spatial analysis is limited by the accuracy and precision of the underlying datasets. We often know too little about the underlying quality of many kinds of spatial data. At times, however, we may be forced to use maps and other datasets whose underlying quality is unknown. And without spending some effort ensuring that various datasets are not only relevant but also reliable, we run the risk of fooling ourselves. Preprocessing involves manipulating the data in several ways so that it may be entered it into the GIS. Two of the principal tasks of preprocessing include data format conversion and identifying the locations of objects in the original data in a systematic way. Converting the format of the original data often involves extracting information from maps, photographs, and printed records (such as demographic reports) and then recording this information in a computer database. This process is a time-consuming and costly efforts for many organizations. This is particularly (and sometimes painfully) true when one calculates the costs of converting large volumes of data based on paper maps and transparent overlays, to an automated GIS based on computerized datasets. We will discuss aspects of the this process in section 6.l. A second key task of the preprocessing phase is to establish a consistent system for recording and specifying the locations of objects in the datasets. When this task is completed, it is possible to determine the characteristics of any specified location in terms of the contents of any data layer in the system. During these processes, it is very important to establish specific quality control criteria for monitoring the operations during the preprocessing phase so that the databases can be of maximum value to the user. Data-management functions govern the creation of, and accession, the database itself. These functions provide consistent methods for data entry, update, deletion, and retrieval. Modern database management systems isolate the users from the details of data storage, such as the particular data organization on a mass storage medium. When the operations of data management are executed well, the users usually do not notice. When they are done poorly, everyone notices: the system is slow, cumbersome tease, and easy to disrupt. Under these latter circumstances, the smallest human and machine errors create large problems for both the users and the system operators. Data-management concerns include issues of security. Procedures must be in place to provide different users with different kinds of access to the system and its database. For example, database update may be permitted only after a control authority has verified that the change is both appropriate and correct. Manipulation and analysis are often the focus of attention for user of the system. Many users believe, incorrectly, that this module is all this constitutes a geographic information system. In this portion of the system are the analytic operators that work with the database contents to derive new information. For example, we might specify a region of interest and request that the average slope of the area be calculated, based on the contours of elevation that have already been stored in the GIS database. Since no single system can encompass the complete range of analytic capabilities a user can imagine, we must have specific facilities to be able to move data and information between systems. For example, we may need to move data from our GIS to an external system where a particular numerical model is available, and then transport the derived results back into the spatial database inside the GIS. This kind of modularity, where other data processing and analysis systems can be linked to a GIS, is very valuable in many circumstances, and permits the system to be easily extended over time by pairing it with other analytic tools. When one speak of geoprocessing, one is often focused on the manipulation and analysis components of a GIS. Product generation is the phase where final outputs from the GIS are created. These output products might include statistical reports (such as a table listing the average population densities for each county in California, or a report indicating landowners who are delinquent in their property taxes), maps (for example, a presentation of the property boundaries of plots within a township that are owned by public agencies, or a map of a subdivision indicating where construction workers must be careful when digging due to the presence of underground pipes and cables), and graphics of various kinds (such as a set of bar charts that compare the acreage of different crop types in an area). Some of these products are soft copy images: these are transient images on television-like computer displays. Others, which are durable since they are printed on paper and film, are called hard copy. Increasingly, output products include computer-compatible materials: tapes and disks in standard formats for storage in an archive or for transmission to another system. The capability of taking the output of an analytic process, and placing it back into the geographic database for future analysis, is extremely important. These essential components of a geographic information system are the same as those of any other information system. Let us compare this sequence of functional elements to a more conventional information system problem. Consider the steps that are taken in an automated system to manage employee records for a business. Information about the individuals must be gathered together, perhaps via a questionnaire and interview when the individual is hired. This is clearly the data acquisition phase. Then, because some of the information is inevitably expressed by different people in different ways (for example, some people will list their education as "through grade 12", while others will say "through high school"), the data must be put into a consistent vocabulary and format. Only after this preprocessing phase can the data be entered into the computer in a consistent form. Validation of the data entered into the system is a fundamental part of the preprocessing phase, to insure the accuracy of the resulting database. Once the data have been converted into a consistent form and put in the computer database, we have accomplished a large fraction of the end-to-end task and often expended a large fraction of the end-to-end costs. Data management functions permit as to update the information when necessary (for example, when an employee completes an advanced degree), and to retrieve only the relevant information when required (as in a summary report of salaries for a particular division of the company). Various kinds of analytical operations can be run--perhaps using employee addresses to find out which employees live close to one another in an effort to encourage car pooling. Finally, we need to be able to develop statistical reports, graphics of many kinds, and other output products, such as documentation for management reviews of salary levels. These steps exactly parallel the five GIS components we will discuss in detail. 3.2 Data in a GIS It is important to understand the different kinds of variables that can be stored in any information system. Nominal variables are those which are described by name, with no specific order. Categories of land use (such as parks, wilderness areas, residential districts, and central business districts) and trees (such as Eucalyptus calophylla, Pinus coulteri, and Quercus agrifolia) are different kinds of nominal variables. These are common in many kinds of thematic maps. Ordinal variables are lists of discrete classes, but with an inherent order. Classes of streams (first order, second order, and so forth; referring to the number of tributaries which contribute to the stream) or levels of education (primary, secondary, college, post-graduate) are ordinal variables since the discrete classes have a natural sequence. Interval variables have a natural sequence, but in addition, the distances between the values have meaning. Temperature measured in degrees Celsius is an interval variable, since the distance betwteen 10C and 20C is the same as the distance between 20C and 30C. Finally, ratio variables have the same characteristic as interval variables, but in addition, they have a natural zero or starting point. Since degrees Celsius is a measurement with an arbitrary zero point, the freezing point of pure water, it fails the latter test. Degrees Kelvin, since it is based on an absolute standard, is ratio variable. Per capita income, the fraction of the weight of a soil sample that passes through a specified sieve, and rainfall per month are common ratio variables. In addition to these 4 kinds of data, there are two different classes of data found in most geographic information systems. Consider a simple object in space: a water well. From the point of view of a GIS, the primitive but essential piece of information to record about this water well is its location on the Earth -- a data value pair such as longitude and latitude, thus storing the simplest kind of spatial data. However, there may be a wide range of additional information which is required for many applications. This might include the depth of the well, the volume of water produced over a given period of time, dates of pump tests, and temporal sequences of measurements of dissolved and particulate matter in the water from the well. This second set of non-spatial or attribute data, which is logically connected to the spatial data, must not be forgotten. In many geographic information systems, there are tools to both store and manipulate the non-spatial data along with the spatial data. In some applications, as we will see, the volume of non-spatial data may actually be larger than the volume of the spatial data, and the logical connections between the spatial and non-spatial information may be very important. A recent issue of The American Cartographer (January, 1988), the journal of the American Congress on Surveying and Mapping, proposes a standard for digital cartographic data. This standard is based on entities in the real world, and a mechanism to represent these entities in terms of objects in a database. Within this proposal is a set of definitions of spatial objects, which we now paraphrase to explain more of the vocabulary of geographic information systems. This brief discussion also expands on the comments in Chapter l about different kinds of spatial objects. One may divide the different kinds of spatial objects into three classes, based on spatial dimensions of the objects. A 0-dimensional object is a point that specifies a geometric location. From a mathematician's perspective, a point is a primitive location with no areal extent. Points are used in a number of ways in both computer graphic and digital cartographic data, as well as in a geographic information system. They are commonly used to indicate features themselves, such as the exact center of the water well mentioned above, the end of a street, or the corner of a lot in a subdivision. Points are also used as a reserved position for a label (such as a place name) or a symbol (such as an airport or benchmark) on a map, or to carry information for the surrounding region (such as who owns the region, or the color to be used when the region is displayed). Points are also used to define more complex spatial objects, such as lines and areas. The simplest 1-dimensional object is a straight line between two points. More complex forms of lines include connected sets of straight lines (determined by the sequence of points at which the path changes direction), curves which are based on mathematical functions, and lines whose direction is specified. Particular sets of mathematical functions are used to define curves in some disciplines, as in the functional definition of the curve of a street used by a civil engineer. One advantage of a directed line segment is that we have a way to distinguish which end is the beginning of the line, and which end is the end. This may be particularly valuable in circumstances as diverse as the analysis of flow in pipes (perhaps indicating source and destination for flow in a potable water supply system) or models of population flow between countries. When the line segments carry information about direction, we are also able to distinguish the regions on the left and right sides of the line. As we shall see later, this can be very useful in a number of applications. Finally, 2-dimensional objects are areas, which also come in many forms. In a particular application, we may refer to a bounded area, or focus on just the boundary, or just the region within the boundary. The description of the area itself is normally based on the geometry of the bounding line segments. The area may be either homogeneous or divided internally, as discussed in Chapter l. A distinction is often made between sets of two-dimensional bounded regions, and true three-dimensional surfaces. In some applications, an analysis based on a two-dimensional planimetric representation of the Earth may be completely sufficient. We focus on these kinds of applications in this introductory text. The details of the connections between spatial objects, such as the information about which areas bound a line segment, is called topology. One of the distinguishing features of some geographic information system databases is that they have explicit mechanisms to store topology, as we shall see in Chapter 4. Cowen (1987) discusses a geographic information system from several different points of view. The database approach stresses the ability of the underlying data structures to contain complex geographical data. The descriptions of spatial objects in the previous several paragraphs take this view. In Chapter 4 we examine a number of common alternatives to storing spatial data. The process-oriented approach focuses on the sequence of system elements used by an analyst when running an application -- the five components we discussed at the beginning of this chapter follow this view. Chapters 5 through 9 in this text represent such an approach. An application- oriented approach defines a GIS based on the kinds of information manipulated by the system and the utility of the derived information produced by the system. Chapter 12 presents a number of uses of these spatial data processing systems, and clearly emphasizes this view. A natural resources inventory system is an easily understood example of this approach. Finally, a toolbox approach emphasizes the software components and algorithms that should be contained in a GIS. We develop a number of details from this point of view of a GIS in Chapters 6 and 8. Each of these different points of view of a geographic information system is useful; we recommend that the reader consider the differences between them during the following discussions.