Chapter 5 Data Acquisition The point is often made that the value of a geographic information system is due in large part to the quality of the data contained within the system. In this chapter we address the problem of acquiring the wide variety of essential datasets. 5.1 Introduction The first steps in developing the database for a geographic information system are to acquire the data and to place them into the system. GISs must be able to accept a wide range of kinds and formats of data for input. There may be times when a given user may generate all their own datasets; this is, however, relatively rare. Even so, getting data is one of the greatest operational problems and costs in the field. Kennedy and Guinn (1975) described the importance of the data in automated spatial information systems in this way: While models which use the data are important to support the decision-making activities of those who use the system, a large portion of the investment will be in obtaining converting and storing new data. Data to be input to a GIS are typically acquired in a diverse variety of forms. Some data come in graphic and tabular forms. These would include maps and photography, records from site visits by specialists from many fields, related non-spatial information from both printed and digital files (including descriptive information about the spatial data, such as date of compilation, and observational criteria). Other data come in digital form. These would include digital spatial data such as computer records of demographic or land ownership data, magnetic tapes containing information about topography and remotely sensed imagery. Often these data will require manual or automated preprocessing or to data encoding. For example, tabular records may need to be entered into the computer system. Aerial photography might require photointerpret to extract the important spatial objects and their relative locations, a digitizing process to convert the data to digital form, and numerical rectification algorithms to convert the locations of significant features to a standard georeferencing system. Computer programmers may need to help move digital datasets from one computer to another. Airborne scanner data might require thematic classification and rectification before the data are suitable for entry into a GIS. We discuss many of these processes in Chapter 6. An important and sometimes overlooked issue during data acquisition is to remain information about the accuracy, precision, currency, and spatial characteristics (such as the georeferencing system and scale) of the data themselves (Kamedy and Meyers, 1977; and Salmen et al., 1977). In some cases, there may be mapping standards of various kinds, which will vary by country, organization and agency. We can think of a number of broad classes of data to guide our discussions of data acquisition. Many interesting spatial datasets are effectively sets of points. These include the locations of water, gas, and wells, a representative location for groups of buildings, and the addresse members of a target demographic group. Other datasets may be considered forms of networks. The complete set of roads in an area, ranging from unpaved fire trails to multi-lane superhighways, is one kind of transport network. Other networks include waterways, potable-water delivery and waste-water collection systems, railroad systems, and gas, electric, and communications systems at local and regional levels. Other kinds of spatial data might be described as continuous field, where we can theoretically calculate or measure a value at any location. Examples of this class of data are descriptions of elevation, plant biomass in population density (and many kinds of demographic variables). Finally an important class of spatial data involves dividing a portion of the Earth's surface into relatively homogeneous discrete regions. A political map in perhaps the most common example of this type, since it subdivides a portion of the Earth into countries, states, provinces, counties, and so forth. A simple way to subdivide the Earth would be to develop classes of land cover or land use, indicating the boundaries of the different classes and the characteristic within the boundaries. Each of these different kinds of spatial data may be stored and presented to users in various ways. Traditional cartographic products provide an easy-to-understand means of storing and communicating these various kinds of spatial data. In many cases, photography may be used, perhaps in combination with cartographic overlays, to portray phenomena and relationships on the Earth in an efficient manner. Text and tables are also commonly used to archive information about spatial objects. And as we discussed in Chapter 4, there are a variety of ways to use digital computer technology to store spatial data. When thinking about spatial datasets, whether examining an existing dataset or setting specifications for developing future collections, there are a number of important non-spatial data elements to consider. The date of the data collection process is of course a natural item to record. It is one of the many important ancillary elements that provide us with an indication of the value of a dataset. There will be times when information from a specific point in time is needed. There will be other times when we need the most recent information available, as well as an understanding of how the landscape changes over time, to evaluate whether the data will be sufficiently precise or accurate for a specific application. Another consideration is the observation criteria and source. A vegetation map that is based on field visits of an expert botanist is of more value than a hypothetical map of presumed species distributions based on knowledge of latitude and weather. Thus, users must evaluate the characteristics of the best available, in order to determine their suitability. Other important elements to consider include the positional accuracy of the dataset its logical consistency, and completeness. Scale and resolution are two separate issues in spatial data. Consider two aerial photographs as a simple illustration of this fact. Both photographs were taken with the same camera, from the same altitude and location, over the same terrain. Assuming identical processing of the photographs, they will have the same scale. However, if one film emulsion has a finer grain and better contrast, it may have a better ability to resolve smaller details in the scene, and thus, better resolution. To put it in the terms we used in Chapter 4, we may say that the two images have essentially different minimum mapping units. The geometrical properties of a set of spatial data are of course important. We must be able to determine the particular means used to specify location in the data. This usually implies that we must be able to identify the coordinate system and projection used in the data (as we discuss in section 6.6.l), or be able to devise a means to modify the spatial arrangement in the original data so that it corresponds to a desired arrangement (see section 6.6.4). There are a number of other important attributes of a dataset, including information about accuracy and precision, as well as the density of observations used to develop the entire dataset. Regarding the latter, we may develop a large dataset based on a small number of measurements, and then use numerical models (as in section 6.7) to infer values at many other locations. This is commonly the case for elevation datasets, as well as detailed demographic information. From a very simple point of view, we can distinguish two different families of datasets, and thus have an idea of the kinds of effort each requires before it is ready for use in a geographic information system. Existing datasets are those that are already compiled and available in some form. We of course do not minimize the fact that they may require a great deal of effort to make them appropriate for a particular use. The many steps required to prepare an existing dataset for use are discussed in Chapter 6. In contrast, there are many circumstances where we must develop or generate the dataset ourselves. In this second case, while we may have complete control over the data gathering process, we generally have much more work to do. 5.2 Existing Datasets A lesson that many institutions learn through experience is that it is usually worthwhile to spend some time looking for existing data that can serve the stated needs, before plunging ahead and perhaps developing sources de novo. There is a great deal of spatial information available in the public domain for some parts of the world, if you know where to begin to look. We caution that different nations treat spatial data differently. In the United States, much spatial data collected by government agencies, including maps, photographs, and many kinds of digital data, are considered in the public domain. Thus, there are effectively no restrictions on access to this data. In contrast, in many other parts of the world, spatial data are considered the proprietary resource of the agency that collected the data, or may be controlled for economic or security reasons. In these cases, access to certain data may be strictly controlled. We will discuss some well-known sources of spatial data, both to help document these agencies and their data products, and to document the trend towards greater availability of digital cartographic data in the western countries. The most common form of spatial data is a map. Section 2.1 briefly introduces cartographic products. Maps of various kinds are in common use for many kinds of spatial analysis. National agencies in many developed countries have systematic collections of map products at various scales, and programs for distributing and maintaining these resources. When appropriate maps are available, a digitizing process (described in Chapter 6) permits us to extract the information from the flat map and place it into a digital computer. Table 5.1 Examples of Data in Digital Form Available from the United States Government. DATA TYPE DATA SOURCE Topography Digital Elevation Model USGS/NMD Digital Terrain Data DMA Land Use and Land Cover USGS/NMD Ownership and Political Boundaries USGS/NMD Transportation USGS/NMO, DOE Hydrography USGS/NMD Socioeconomic and Demographic Data USCB Census Tract Boundaries Demographic Data Socioeconomic Data Soils USDA/SCS Wetlands USFWS Remotely Sensed Data NASA, NOAA Abbreviations used in the table: DMA Defense Mapping Agency DOE Department of Energy NASA National Aeronautics and Space Administration NOAA National Oceanic and Atmospheric Administration USCB U.S. Census Bureau USDA/SCS U.S. Department of Agriculture Soil Conservation Service USFWS U.S. Fish and Wildlife Service USGS/NMD U.S. Geological Survey National Mapping Division However, many people are surprised to find that coverage of the world, in terms of scale, currency, and the themes of interest, is quite uneven. When spatial data may be found in a digital form, there may be significant cost savings since the digitizing process is not required. In the United States, a number of kinds of thematic spatial data in digital form are being produced on a routine basis. While more and more states, regional authorities, counties, and cities are creating such datasets, the Federal Government is the best-known source for many GIS users. Table 5.l presents a number of examples of digital datasets, produced on a routine basis, that are available (or are being made available) from the U.S. Government. Agencies and commercial firms involved in remote sensing of the Earth have large data holdings, and may also be able to provide new data acquisitions. At the present time, two well-established commercial firms, Eosat and SPOT Image Corp., provide access to the Landsat and SPOT series of remote sensing systems. These are discussed in Chapter 10. Historical datasets are available, as well as orders for acquisitions in the future, on a fee-for-services basis. National agencies, such as the National Oceanic and Atmospheric Administration in the U.S., can provide spatial data from the operational weather satellite programs. As we discussed in Chapter 2, there has been an explosion in the development and applications of geographic information systems since the 1960s. This explosion has been touched off by achievements in two areas: the incredible advances in computer science and technology, and the increasing availability of spatially referenced data in digital form. As we shall see in later chapters, data in digital form are generally easier to place into a modern geographic information system. Data in analog form, such as photographs and printed tables, require a painstaking and sometimes expensive conversion to digital form. The majority of map makers think of aerial photos as source materials rather than data. They point out that, in showing objects and phenomena, maps (usually) have uniform symbols, and that maps treat both distance and directions in carefully controlled ways. Aerial photographs, as well as other forms of remotely sensed data, likewise show objects and phenomena, but they lack both the interpretation and geometric control. However, advances in sensor technology, as well as in processing and analysis techniques, are slowly changing the views of the cartographic community. We reemphasize that all operational mapping programs in the U.S. Federal Government is based on remotely sensed data of various kinds and photogrammetric techniques. For conventional aerial photography the standard format is a black-and-white or color photographic print on paper approximately 9 by 9 inches. For vertical aerial photography acquired from a mapping camera for a private individual, commercial firm, or government agency, the tilt should not exceed 3 degrees from vertical. Tilt is defined as the angle between the optical axis and a straight line through the center of the camera's lens. The scale of this photography is dependent upon both the focal length of the lens of the camera and the height from which the photo is acquired (see Chapter 10). Most scales used by agencies of U.S. federal government, for example, range from 1:20000 to 1:40000. While the most commonly used scales of conventional photography range from 1:4800 to 1:24000, a wider spread of scales, from perhaps 1:500 to 1:90000, are in common use. Scales in the range 1:500 to 1:2000 maybe employed for detailed urban planning and recreation-area management applications, while smaller scales will most often be used for more general resource analysis application (such as large-area forest stand maps or regional land-cover mapping and planning applications). A more detailed discussion of aerial photography is beyond the scope of this text. Another product that deserves attention here are orthophoto maps. Orthophotos are produced by an instrument called an orthophoto scope. The orthophotoscope removes the relief displacement found in any photograph and creates an essentially planimetric product (see section 6.8). Orthophoto technology can be employed to produce a variety of different photographic products. Photogrammetrists are capable of producing orthophotos as photographic prints or half-tones. These can then be lithographed along with overprints of map grids, contour lines, place names, and other cartographic symbols. The U.S. Geological Survey currently produces two types pf photoimage products: the orthophotoquad and orthophoto map. The orthophotoquad is essentially a photographic base-representation of the standard USGS 7.5-Minute series topographic map. The orthophoto maps include contours and colors representing water bodies, wetlands, forests, and so on. These maps tend to be highly selective in terms of the features that are labeled and in coverage. The advantage of these products for an analyst is that they have properties of both maps and photographs. They may be used like a map, in that there is careful control of geometry and strict control of the use of symbols and graphics. At the same time, objects or phenomena may be interpreted from these orthophotomaps as in any other aerial photo. This combination makes these products ideal for many GIS applications. In the United States, data from the Department of Commerce, National Aeronautics and Space Administration (NASA), National Oceanic and Atmospheric Administration (NOAA), and the Department of Interior through the United States Geological Survey (USGS) have been particularly important. Specific examples of data here include detailed census datasets at a variety of levels of spatial aggregation. remotely sensed data (see Chapter 10), as well as land-use, land-cover, and digital elevation data from the USGS files. For sites in the United States, the U.S. Geological Survey, via the National Cartographic Information Centers and Public Information Offices, provides a wealth of information in digital form (we note that these will soon be known as Earth Science Information Offices). The U.S. GeoData tapes are one good example. Digital line graph data tapes are sold by 7.5- and 15-minute block, which correspond to USGS 7.5- and 15-minute topographic quadrangle maps - the standard map base for the United States. The U.S. GeoData digital tapes contain information in four thematic layers, which can be obtained separately: boundaries, transportation, hydrography, and the U.S. Public Land Survey System. All of these layers are coded in a vector format, based on a latitude-longitude coordinate system (as discussed briefly in Chapter 4). Another series of U.S. GeoData tapes is for land use and land cover. The land-use and land-cover tapes are available in either vector or raster cell format, based on a Universal Transverse Mercator coordinate system. Another kind of information available from the USGS is the Digital Elevation Model. These are regular raster datasets of elevation values. There is a long-range effort in the U.S. government to create a National Digital Cartographic Database (Eric Anderson, pers. comm.). This is based on the work of an interagency coordinating committee, to set standards for the format and content of digital spatial data throughout the federal government. The layers to be included in this database include hypsography, hydrography, land surface cover, surface features excluding vegetation, boundaries, positional control, transportation, other man-made structures, and the Public Land Survey System. Not all governments choose to make these kinds of data part of the public domain, as is the case in the United States. In Great Britain, for example, spatial datasets like those we have discussed are generally viewed as private holdings of the appropriate national agencies. As such, users may need to negotiate with the agency for access to these data. The United Nations Environment Programme is a spatial data user as well as a producer. Through the newly established Global Resources Information Database, with existing centers at the UNEP offices in Geneva, Switzerland, and Nairobi, Kenya, efforts are under way to collect and disseminate important spatial datasets for the globe, as well as provide certain kinds of assistance in spatial data collection and processing to less-developed countries. Sample datasets in the archives now include range and endangered species distributions for parts of the world, as well as small scale global datasets of soils and vegetation. We believe this an important initiative, and a mechanism for the more effective manage men of important large-area datasets. In many cases, existing datasets may not be quite suitable for the application. For example, in looking for data for a particular project, one may find a certain readily available map that has an appropriate scale and an appropriate set of thematic categories. The map may, however, be old enough to raise a significant concern about its value for the intended project. In cases such as this there are a variety of techniques one can use to update the old map -- thus avoiding the expenditure of time, effort, and money that is required to compile a new map from scratch. Photogrammetric tools such as transfer scopes and projectors permit us to merger the information of older maps with later aerial photographs (see section 6.8). Similar results may be obtained in some cases by using image processing technology (as outlined in section 10.3). There are other times when an existing dataset doesn't quite suit the needs of an application. In some cases, the desired information may be obtained from these datasets through inference. For example, to help decide on a good location for a shop that sells and repairs bicycles, one may wish to obtain some suitable demographic data. But available demographic data may not provide directly the information desired about the bicycle-riding population in an area. If, on the other hand, available datasets (such as those which are created by the U.S. Census Bureau) do give information about the age distribution of the population as a function of location, plus information about the location of certain educational institutions, the needed information may be inferred (with the help of whatever reasoning, observation, and other information). Thus, in southern California, where college-aged people living within four miles of a college or university are likely to own bicycles, potential locations for a bicycle shop -- near student housing or between student housing and the educational institutions -- could be identified. Overall, it is not always possible to find information about the quality of existing datasets. The expensive but conservative option is to attempt to verify the accuracy and precision, either quantitatively or qualitatively. Unfortunately, we often avoid these issues, and simply use what we believe to be the best available datasets for the task at hand. This latter choice runs a serious risk of letting us fool ourselves. 5.3 Developing your own Data There will be times and circumstances in which it is necessary to develop your own datasets. Existing information resources may not be relevant to the problem, or perhaps they are not sufficiently current. There may also be datasets whose validity is unknown, thus forcing us to collect our own data to either test the existing datasets or to replace them. These occasions have a particular advantage, in that they give us the opportunity to design the data acquisition program to meet our needs exactly. The usual disadvantage of such a program is, unfortunately, the expense of designing, implementing, and managing the data-gathering task. Constructing new datasets involves field work of many kinds. Maps of terrain or of the location of certain cultural features may need to be created. Details of plant and animal populations, such as noting the occurrences of different species or determining the age distributions within a population, may be required for an environmental report. Knowledge of the general trends in groundwater elevation in an area, as well as changes through the annual cycle of discharge and recharge, may be needed to site a water well or a waste disposal facility. Knowledge of the presence or absence of archaeological remains at predetermined locations may be needed before permits may be granted for construction projects. As we discussed in section 5.2, we must always look for both direct observations of the target and inferences that can be made from other datasets. Sampling design is one of the important elements of any data gathering plan, where decisions are made about how to gather the data of interest. We will discuss some of the key concepts of sampling design in the next few pages. We must first take a moment and distinguish accuracy from precision. When making observations and measurements of any kind, an understanding of these two different concepts is very important. By accuracy, we mean freedom from error or bias, and thus, closeness to the "true" value. Precision, on the other hand, refers to our ability to distinguish small differences. Determining the distance between two trees with a device that measures to the nearest centimeter provides a higher level of precision than using a device that measures to the nearest meter. On the other hand, if the centimeter-scale device is calibrated incorrectly, so that it is consistently underestimating distances, the meter-scale device may have higher accuracy. Before looking into some of the details of sampling in space, we will examine some concepts of spatial pattern. In Figure 5.1, we show three different spatial patterns. These patterns might have come form an exercise in mapping the locations of certain types of plants. In Figure 5.1a, the pattern could be described as clumped, since the mapped objects are concentrated in certain areas. Another way of describing this distribution is to say we have positive autocorrelation: the objects are typically found close to others of the same kind. In the case of plants this might be due to the environment in these locations being particularly hospitable to the type of plant. In Figure 5.1b, on the other hand, the objects in the spatial pattern are spread out or dispersed; we could also say that we observe negative autocorrelation, meaning that the objects are typically found well away from others of the same kind. This type of pattern is sometimes seen in desert plants that compete for water, or in established forests where, in the competition for sunlight and root space, the older and more successful trees have overshadowed, crowded out, and killed off the smaller and less vigorous trees of like requirements. One simple way to characterize these patterns is to focus on the distance from any given plant to the nearest neighboring plant of the same kind. In the clumped pattern, the average distance to the nearest neighbor is very small, and there will be relatively little variation in this distance if the clumps are uniformly dense. In the dispersed pattern, the average distance to the nearest neighbor is quite large. A random distribution of these plants would imply that at any location, there is an equally likely change of finding a plant, regardless of other plants in the vicinity. Thus, in a random pattern (Figure 5.1c), some plants would be close to their neighbors, and others would be far away. The clumped and dispersed spatial patterns in Figure 5.1 represent significant departures from random. When designing a sampling program, one must make a number of decisions. One of the essential choices involves the samples themselves. Point sampling involves determining the desired information at a single point. For example, once a point is chosen, we could determine depth to groundwater by drilling a well at that location. Another option is line sampling, where two points are chosen and we census the desired information along that line. This is common practice in vegetation ecology, where all the plant species are identified along chosen lines or transects. The third common option is to locate appropriately sized areas or quadrats in the region of interest, and determine the needed information in each area. Quadrats are typically square or round. An advantage of quadrat and transect sampling techniques are that they provide information about spatial distributions both within the samples (the quadrats themselves) and between the samples. The endpoints in a continuum of sampling strategies are when we take either a single measurement to infer regional characteristics, or when we have exhaustive sampling, and is the case where we are able to sample the entire region. In essence, the latter is the extreme case of quadrat sampling, in which the area of the quadrat is chosen to be identical to the area of interest. Clearly, this will provide us with the best information, since we will examine the geographic region of interest completely. However, we often have limited resources with which to conduct our data gathering, and thus have to design a plan to maximize the value of the limited data we can obtain. Figure 5.2 shows two different ways to apportion sampling effort. In the upper left of Figure 5.2, we show an example of random sampling. We have first divided the area into quadrats of equal size, and then, using a computer-generated list of random numbers, have randomly selected 25% of the quadrats for examination. Note that this is but one way to select the transects. Alternatively, we could have chosen the quadrat center points randomly, rather than choosing from a regular array of possible quadrats. In this example, our 25% sample has discovered 1 object of interest, and we thus extrapolate to decide that 4 objects probably exist in the region. Notice that the true number of objects is 8; we have underestimated the total. Another way to apportion our sampling effort is shown in the upper right of Figure 5.2. In this second case we have used a systematic sample. In this case, we use the same number of sample quadrats, but they are arrayed in a regular, predictable pattern in space. We find two objects in this second sample, and thus extrapolate to a total population of 8. The example above is not meant to suggest that systematic sampling is always better than random sampling. Systematic sampling is often easier to design and implement than random sampling, since it is a relatively simple task to systematically place the sampling elements, whether they are points, lines, or quadrats. Random sampling, on the other hand, requires relatively more effort to choose the locations, navigate in space, and locate the samples. On the other hand, systematic sampling can cause all kinds of artifacts when the sample spacing roughly coincides with a multiple of the spacing of the objects of interest. Consider, for example, trees in an orchard that are routinely planted in a square array, with 4 meters between rows and columns. If we apply quadrat samples systematically on a square grid with an 8-meter spacing, precisely twice the spacing of the trees, we could completely miss all the trees (Figure 5.2c)! To deal with this problem, a method called systematic unaligned sampling is used, in which the sampling interval and orientation is specifically adjusted to avoid alignment with the phenomena of interest (Congalton, 1988). Another sampling strategy is used when we have some information about the study site. This third option known as stratified sampling, involves dividing the study region into relatively homogeneous units, and sampling each unit separately. For example, imagine a program for monitoring a group of insect pests that damage farm crops. The target area may have several different crops under cultivation. In a stratified sampling program, we can explicitly take the spatial distributions of the crops into account. We might sample more intensively in fields growing crops that the target insects are known to inflict great economic damage upon, to be able to take measures as early as possible in the growing season. In contrast, we might work less hard in crops on which the insects are known to cause little damage. An important use of stratified sampling is seen when the geographic region has important but rare polygonal areas. Consider a program to determine the distribution of land cover as a function of cover type. In a random sample, the number of observations of a particular type of land cover will be proportional to the area of the cover. Thus, common land-cover types will be sampled frequently, and rare land-cover types may never be observed. To compensate, we can use a stratified sampling plan to make sure that we visit examples of every known land-cover type, and thus improve our knowledge of the rare types. It is common practice in many fields to use a pilot study -- a rapid or preliminary look at the population of interest -- before any major sampling efforts. The pilot study is usually designed for two purposes. First, it permits us to gather some information in the field, possibly from the target area. This small amount of information can be used for adjusting quadrat size, selecting the total number of samples for the principal sampling effort, testing the observational methods, and (where possible) developing some general characteristics of the study population. Such a pilot study could permit us to choose an unaligned sampling strategy in the case illustrated in Figure 5.2c, and thus could prevent us from being misled. In the case of stratified samplings the pilot study could provide the necessary information to develop or test the initial stratification. The second purpose of the pilot study is that it permits us to check our original hypotheses about the costs and time that would be required for gathering the data. Thus, the pilot study permits us to avoid costly mistakes and poor data quality. The tools of remote sensing, discussed in some detail in section 6.8 and Chapter 10, can be of tremendous value in designing field studies. Remote sensing, defined here as a suite of techniques for making observations at a distance, can often provide cost-effective information about properties of the Earth's surface over large areas. Frequently, an aerial photograph or processed multispectral scanner image can provide the basis for a field sampling campaign. In our own work, these data sources have been used to minimize the effort required to collect more traditional spatial datasets, such as social and environmental surveys.