Chapter 7 Data Management As we discussed in Chapter 3, the data management element of a geographic information system is central to the overall system. The data acquisition and preprocessing phases we have already examined involve preparing the data for storage and use. The data management functions make the information available to the users. There are a number of important components in any information system. The database itself is a structured collection of information. The tools of database management provide safe and efficient access to the database. The overall goal of data management is to provide users with such access without having to learn the details of the database itself. In effect, the database management system hides many of the details, and thus provides a higher-level set of tools for users. An important distinction is made in the data management field between logical data and physical data. The way in which data appear to a user is called a logical view of the data. Through the database management software, the physical data themselves including the details of data organization as it actually appears in memory or on a storage medium, can be kept hidden from the user. As a brief example, consider a system that records publications of the faculty in a university. The faculty members may all be using the same computer system for their publications index, but each may keep track of her or his own publications in slightly different ways. One may use a fixed-length record organization, where information about each reference falls into pre-defined locations in sequence in a data record. The first 32 alphanumeric characters in the data record might be reserved for the first author’s name, for example. Thus, the position of a piece of information has meaning. Another may use variable-length fields, with tags to indicate record contents; the records in this case might have elements such as "YEAR = 1988" and "KEY = HYDROLOGY." The elements "YEAR" and "KEY" are called tags, since they mark the fields so that we (and the applications software) know what the field contains. In some cases, there may not be rigid specifications for the lengths of these fields. Because of the presence of the data tags, the sequence of the data fields does not matter in this second case. As long as all the required information for this application is available -- title of the paper, the list of authors, the name of the journal, etc. -- this can be managed. Thus, several different data files of bibliographic references, one for each faculty member, stored in possibly different physical organizations on different fixed disk, may be in simultaneous use. However, a modern database management system should be able to present a user with a single logical view of the dataset in a formatted presentation. The database management system, or DBMS, hides the physical details of storage and retrieval from the users. Furthermore, different users may have different logical views of the stored data. In other words, the data does not have to appear the same way to all the users. Thus, logical views are dynamic in nature: they may change, depending on the user and on the application program. Such a system may also provide a mechanism for access control: anyone in the department may examine the contents of the bibliographic database we've described, but only the authors (or their designated representatives) may enter new references or modify old ones (for example, to correct errors, or to change an entry to indicate a paper formerly “in press” has finally appeared). In these ways, data management software allows a user to access data efficiently without being concerned with its actual physical storage implementation, and allows degrees of protection in terms of what a user may see, and what a user is permitted to do. 7.1 Basic Principles of Data Management A database may be defined as data and information stored “more-or-less permanently” (Ullman, 1982), or as structured collection of information on a defined subject (Martin, 1986). A database management system is the software that permits one or more users to work efficiently with the data. The essential components of the system must provide the means to define the contents of a database, insert new data, delete old data, ask about the database contents, and modify the contents of the database. A DBMS is like a high-level computer language, in that primitive functions, such as the details of the operating system calls for opening and closing files, are hidden from the user. Obviously, defining the contents of the database must be the first step in data management. The kinds of information required include: Data format definition: Is a specified data value to be stored as an integer -- and if so, how many digits are required for storage or presentation? Is the value a floating point number -- and if so, shall we use exponential notation? Is the data value a string of characters -- if so, is there a pre-defined format, such as those which are used for a date or telephone number? Data contents definition: The fields in the database are named. Useful names (such as CREATION.DATE or LATITUDE) are more helpful than arbitrary names (such as VARIABLE001 and X3). Value restrictions: Many systems permit the user to enter constraints on data values, which are then used to validate new entries. Variables such as date and time of course have constrained values: accidentally entering “month 13” or“63 minutes past noon” should be easy to prevent. Other values may have applications-dependent restrictions (such as the latitude and longitude values for a specified country), as well as default values. These kinds of information about the data in a DBMS is normally stored in a data dictionary. The data dictionary is itself a database, and since it describes the contents of another database, we say that it contains metadata: data about data. There are a number of important functions a database management system must provide: Security: As in the department-wide bibliographic database example above, all users should not have all modes of access to a database. Those without proper knowledge, or proper authority, should not have the ability to modify the contents of the database. Integrity: A DBMS checks elements as they are entered, to enforce the necessary structural constraints of the internal data. Data fields are checked for permissible values as described above; users (and applications programs) are forced to enter those data fields that are required; and so on. Furthermore, when the system has the responsibility of creating internal elements (such as pointers that explicitly provide links between data elements, or values that are derived from the entered data), they are created and checked for consistency. Checks for logical consistency are also required for the user's entries as well. Remembering the arc-node vector database structures discussed in Chapter 4, arcs are defined based on their end nodes. The database management system must prevent an arc from being described unless the related nodes exist. Further, editing functions must not be permitted to disrupt internal dependencies. Again considering an arc-node database, we must be prevented from deleting a node if there are still arcs that are defined based on that node. Synchronization: This refers to forms of protection against inconsistencies that can result from multiple simultaneous users. For a simple example, consider a database system for managing a collection of maps in a library. At the same moment, two users examine the database to determine whether a particular map is present in the collection. Both users are able to determine that the map exists, and they are each led to believe that it is currently in the library and available to be checked out. They each go to the library to check out the map, and one of the two users must be disappointed. A mechanism is required so that when one user is about to remove something from the collection, the other user is either warned, or prevented from accessing the information until the first has committed to the transaction. While most GISs rarely have more than one user accessing a portion of the database at a time, the problem is quite real in conventional database systems. Imagine one operator updating a land-use data layer, while another is trying to analyze the same layer: the analytic results will change through time, much to the chagrin of the second operator. Physical data independence: The underlying data storage and manipulation hardware should not matter to the user. In fact, under perfect circumstances, the hardware could be changed without users having any awareness of the change. This is a part of the distinction we made between logical and physical views of the database. It also permits system developers to write software without having to keep track of all of the details of physical data storage. Ultimately, this independence permits us to change hardware as needs and technologies change, without rewriting the associated data manipulation software. Minimization of redundancy: A desirable goal for a database is to avoid redundancy. When material is stored redundantly in a database, updating can be complex, since we must modify the same information stored in several places. If a data element is not changed identically in each of its locations, the database is corrupted. As we mentioned in Chapter 4, however, there are times when we optimize overall system performance by not following this tenet. 7.2 Efficiency Efficient data storage, retrieval, deletion, and update are dependent on many parameters. Optimizing a database management system is a complex subject, discussed at length in many texts (see, for example, Chapter 6 in Martin, 1986). There are specialized systems in the commercial marketplace that are designed to handle certain kinds of database transaction problems, particularly in financial applications and travel reservation systems, with extremely high performance. We touch only on a few highlights of the subject. Two key elements to consider are the physical storage medium and the organization of the data (Calkins and Tomlinson, 1977). The most common storage media in most computer systems are magnetic disks and tapes. As we write this manuscript, we are also beginning to see optical materials in commercial computer systems. Data on magnetic disk are typically organized as concentric circles of data bits, called tracks, with tracks divided into sectors. When more than one disk is held on a single drive spindle, all the tracks vertically above each other are termed a cylinder. The magnetic heads that read and write the data on the disks are stepped across the disk surface to address the appropriate track, and then the system must pause while the relevant sector rotates underneath the disk head. Thus, the raw parameters of disk performance are governed by the speed at which the head assembly can move to a new track, and the speed at which the disk rotates. The speed of disk rotation, plus information about the linear density of data along the tracks, directly provides information about the ultimate data transfer speed to or from the disk. Since we can direct the head assembly to any track from any other, we call this a random access storage system. If many users are simultaneously working with a single disk drive, they may cause a tremendous amount of head movement, as the system works to satisfy all the user’s different requests for data. Two techniques are commonly used to improve disk system performance. A cache is a bank of high-speed memory used to store data that the system expects to need. For many systems, there may be little difference between reading a single sector of data compared to reading several sectors from a disk. Once the disk head assembly has moved to the correct track, reading sectors in addition to the one chosen may often take very little additional time. If the system can place one or more additional sectors from the disk in the cache, a subsequent request for data might be satisfied by the cache, without incurring the costs of head movement and disk rotation. This may be the case when sequentially reading from or writing to a file. If the data are organized on the disk such that caching is effective, system performance can be significantly improved, since data stored in the cache can be retrieved much faster than that stored on the disk. A second way to improve system performance, particularly on multi-user computers, is to build intelligence into the disk controller to optimize a sequence of data requests. If there is a processor on the disk controller that examines all the pending data requests, it may be possible to schedule the requests in a sequence other than the order in which the request were made, to improve overall system performance. For example, the disk controller may be able to make use of knowledge of the distribution of data on the disk surface itself, in conjunction with knowledge of the current position of read/write head: to satisfy a sequence of data requests, read the data nearest the current head position before moving the head a great distance. This method may reduce the average time it takes to retrieve data over a large number of requests. Such a strategy must not be taken to an extreme; it must be ensured that all users have equitable access to the resources. Magnetic tape storage is called sequential access, since the storage system must pass through the length of the tape to locate data elements, unlike the random access capabilities of magnetic disk systems. The industry standard tape records a linear density of 1600 or 6250 bits per inch, with nine parallel data tracks on half-inch-wide tape. Eight of these tracks are used for data storage, and the last provides some measure of error protection. Read and write speeds of 25 to 125 inches per second are common. Cache memory buffers are common in these systems, for a similar reason as in disk systems. Since the magnetic tape cannot accelerate and decelerate instantaneously, the cache can receive or provide data in very-high-speed bursts to improve system performance. Since these devices are inherently sequential, magnetic-tape drives are inappropriate when random data access is required. They are most commonly used for database backup, long-term storage, and data transfer between systems. We mention a caveat to long-term tape storage in Chapter 12. According to Calkins and Tomlinson (1977), The file structure should be determined by the complexity of the data structures, the data manipulations that must be performed, and the type of computer-aided techniques to be used for example, regular computer job processing (termed batch) or interactive manipulation and analysis. A difficulty that arises here is based on the uses and users of the system. If a system is designed to support a single kind of processing problem, it is possible to define an optimum file structure, in terms of overall system performance. Design of such a structure would take into account the physical characteristics of the hardware, the design and operation of the software, and the data structure and its implementation. In contrast, when a wide variety of analytic operations are needed, supporting different kinds of users with different requirements and datasets, across several disciplines, such an optimization is much more difficult. In the creation of a spatial database, it is necessary to provide modes of access for retrieval of both spatial and non-spatial (or attribute) information. Searches are conducted to locate features as well as sets of features. A GIS may be required to locate any of the spatial objects we mentioned in Chapter 1, or any of the components of the spatial database we discussed in Chapter 4. Some of the queries of the spatial database involve classes of features such as (from Calkins and Tomlinson, 1977; and Salmen, et al., 1977): A single feature: Find a second-level stream. A set of defined features: Find all second- and third-level streams. An incompletely-defined feature or set of features (sometimes termed browse): Find all features of (type = hydrography) in this region. Features based on defined relationships within the data set: Find all second-level streams above 3000 meters elevation. A set of features in which the criteria are within another dataset. For example, we may have an external database management system with both water-quality measurements as well as water-quality standards, in addition to a geographic information system with information about the stream network: Find second-level streams with PCB pollutant levels above state water- quality guidelines. Efficient data-retrieval operations (which of course are required during the processing phases in addition to an explicit query of the database) are largely dependent upon four items. The volume of data stored certainly affects data-retrieval speed, particularly when exhaustive search of the database is necessary. The method of data encoding can be very important to performance. This includes decisions about the types of variables to be stored as well as the way the values are stored. The design of the database structure can take advantage of knowledge of the types of problems to be solved, as we mentioned in Chapter 4. Finally, the complexity of the query directly affects the necessary calculations as well as the types and amounts of requests to be made of the database management system. These four concerns are all important to any evaluation of system performance. In general, there are well-developed sets of procedures for reasonably efficient retrieval of non-spatial data. However, searches for spatial features or sets of features are considerably more complex and optimizing systems’ search performance under these kinds of conditions is an active field of GIS research. 7.3 Conventional Database Management Technology At the present time, there are two principal approaches or models in database management systems. The following briefly discusses each of these in turn, to help users of geographic information systems to understand better a system's internal structure and capabilities. Modern textbook in database management (such as Martin, 1986) treat this topic in much greater depth. 7.3. 1 Relational DBMS Relational database management systems have become extremely popular in recent years. The underlying model on which they are based is very easy to understand, and they are particularly well-suited for ad hoc user queries. As an example, consider the following data about the hypothetical golf course we’ve mentioned in previous chapters: Nominal Area of Total Date Hole Length Par Green Area Repaired 1 410 4 210 17690 10/87 2 365 4 160 20350 2/87 3 390 4 150 16980 2/87 4 150 3 75 4210 2/87 5 490 5 185 21760 2/87 6 340 4 95 11610 4/87 7 165 3 85 4500 10/87 8 475 5 120 18200 4/87 9 420 4 110 14540 10/87 In this example, the entire table is called a relation. Each row in the relation, which corresponds to the complete set of data about a single hole, is called either a record or a tuple. Each column within a record, which corresponds to a different kind of information about a given hole, is called a field or an attribute. For example, the record which describes the first hole stores five different facts about the first hole. These facts are the fields within the records, and the separate fields tell us that the first hole is 410 yards long, par for the hole is 4 strokes, the green is 210 square yards, the total area of the hole is 17690 square yards, and the hole was last refurbished in October of 1987. For those readers who are not golfers, the par for a hole is the estimated number of strokes required to play a specified hole, if one is skilled and makes no errors. Asking questions of such a database in most cases requires that either a database programmer construct a custom software module for the application, or that the user learn a query language. The query language is the means of controlling the database management system, much like an operating system has its own language for controlling the computer. A sample query about the contents of this database might look like: OPEN TABLE “FRONT-NINE” PRINT (HOLE, LENGTH) WHERE (PAR > 4) The first command instructs the system to look in the appropriate part of the database (in this example, we are retrieving a set of records about a particular group of holes). The second command poses a specific question to the database manager. The system then responds with a table, listing the requested information (holes and their length) based on the specified constraint -- that the par for that hole is greater than 4: QUERY #l: TABLE = “FRONT-NINE” PRINT (HOLE, LENGTH) WHERE (PAR > 4) HOLE 5 8  LENGTH 490 475   One of the key characteristics of a relational database management system is that a single query can find more than one set of tuples, as we have seen in this example. More sophisticated queries might involve more than one attribute at a time: PRINT (HOLE, TOTAL.AREA) WHERE (DATE = 2/87) AND (TOTAL.ABEA > 10000) This query might be used to begin planning the purchase of materials needed to restore some of the golf course, based on the elapsed time since the last refurbishment and the area which may need restoration. The database management system would respond to this query with: QUERY #2: TABLE = “FRONT-NINE” PRINT (HOLE, TOTAL.AREA) WHERE (DATE = 2/87) AND (TOTAL.ABEA > 10000) HOLE 2 3 5  TOTAL.AREA 20350 16980 21760   In later queries, we could ask the system to compute values derived from several queries, to generate such information as the total area of the greens (perhaps to estimate the amount of seed and fertilizer required in an annual maintenance budget). Note that these tables of data correspond closely to the organization of spatial and non-spatial data in several of the vector data structures described in Chapter 4, in particular, the arc-node (section 4.2.3) and relational data structures (section 4.2.4). Chang (1981) notes that there are a wide range of functional query languages. At one extreme they are termed procedural, since the user must specify not only what is wanted from the database, but also the details of how to get it. Since the user must work relatively hard in this case, by specifying the procedures necessary to answer the question, the system processing may be relatively simple. At the other extreme are non-procedural query languages, which specify only the user’s objectives. The system then has more work to do, since the user has supplied none of the implementation details. A popular model for a non-procedural interface to a database is termed query-by-example (or QBE), and is found in several commercial products. In a QBE system, queries are described by using the database interface to (in effect) describe the appearance of the output report. The system then develops a method of producing such a report, which essentially corresponds to the problem of writing a program in the database query language. Systems based on the QBE model can be extremely easy to learn. As we discussed in Chapter 4, this relational database model is used in a number of operational geographic information systems. Lorie and Meier (1984) discuss some of the important issues in the use of such systems. In contrast to some operational GISs, relational database management systems often have well-developed capabilities for security, networked multi-user access, and concurrency control. 7.3.2 Navigational DBMS In contrast to the relational database model described above, which is based on rectangular arrays of data, there are navigational database models which may be thought of as directed graphs (Martin, 1986). Usually broken down into network and hierarchical versions, navigational databases are also based on records, as discussed in the relational model in section 7.3.1. We note here that the term navigational, as used in this section, is not used in the usual sense of spatial analysis, where the goal may be route evaluation through a transportation network, or an evaluation of fluid flow through a plumbing system. Instead, we use the term as Martin does, to describe the following structure for an automated database management system. In a navigational database, the relationships between different tuples are often displayed as links in a diagram. As an example, consider one organization for a collection of maps. We have separate kinds of records for storing information about the individual maps, about countries, about map series, and about the way maps are stored: MAPS: Map Name, Area of Coverage MAPSERIES: Series Name, Scale, Thematic Coverage COUNTRY: Country Name, Area of Coverage STORAGE: Map Name, Cabinet, Drawer As is clear in Figure 7.1, there are a number of links between these different databases. When we retrieve information from this database, we traverse the links between the different record types; this is why we use the word navigational for these databases models. To retrieve a map of a particular country, we first examine the COUNTRY database to determine the geographic area covered by the country. We can also examine the MAPSERIES database to find a map series with the appropriate themes (for example, we may be looking specifically for geologic maps). Based on the MAPSERIES database, we can follow a link to a specific MAPS database, where information about the maps for the particular series are stored. We then use the information from the COUNTRY database to find a specific map in the selected series. Finally, based on the unique map name, we follow a link to the STORAGE database to find exactly where the desired map is stored. Unlike the relational data model, in navigational systems we only retrieve a single tuple from a given query. In order to continue traversing links to find additional examples, we must specifically request that the system get the next relevant tuple. COUNTRY LATITUDE 19N - 72N… CANADA LATITUDE 42N - … MAPSERIES USGS 1 DEGREE MAPS 11952’30’’ W 3430’N 7.5-MINUTE 1:24000 CALFAX QUADRANGLE 12007’30’’ W 3622’ 30’’N 7.5-MINUTE 1:24000 STORAGE ROW 6 CABINET 9 DRAWER 1 ROW 6 CABINET 9 DRAWER 2 ROW 6 CABINET 9 DRAWER 3 Figure 7.1 Navigation database model. Notice that there is a hierarchy in a part of the system we have described. The MAPSERIES database points us to one of a number of MAPS databases; thus, we might consider the MAPSERIES database as a parent of the MAPS databases, which would be called the children. Note also that there are many-to-many relationships in this example. One such instance is the relationship between MAPS, MAPSERIES, and STORAGE based on the map name. The relational database model has become more popular than the navigational model in recent years. While the navigational model may provide a faster response time for predefined queries, the relational model is extremely easy to explain to a user, and is well-suited to ad hoc queries. Further, the relational query languages may often be easier to learn than those for navigational database systems. 7.4 Spatial Database Management The functions of data management permit the efficient use of a database and are the entry points to hardware and software facilities. In a very real sense, there are two fundamental questions to ask of spatial data, no matter whether the data consists of a paper map stored in a cabinet or a digital file system within a geographic information system running on a multi-million dollar computer: What is found at a given location? Examples of this question might be: What is the elevation of a specific geographic location? At a specified coordinate, what is the soil type? Are there examples of specified objects within a specified area? Examples of this question might be: Are there any fire towers in a specified region? Where are any sources of potable water in a specified region? These two fundamental questions are ultimately at the core of any geographic analysis. In addition to answering these kinds of questions about spatial data, a modern GIS should possess a number of qualities that are common to all database management systems. These include: Efficiency: The storage, retrieval, deletion, and update of large datasets is an expensive process overall. These are the essential management functions for any database, and must be carried out efficiently regardless of physical storage device (fixed or removable disks, magnetic tape, or the new optical media) or database location (whether stored on a local computer, or accessible across a network or multiple-computer cluster). Capability of handling multiple users and databases: This means we must be able to support simultaneous access to the database by multiple users when required, as well as logically view the database as arbitrary subsets of the entire physical database. Lack of redundancy of data: Redundancy in a database is generally not desirable. In section 4.2.1, we discussed whole polygon vector structure, and noted that one of the problems with this data structure is its redundancy. In a database, storing values that are dependent on other stored values without explicitly keeping track of the dependencies can lead to disruption of the database. At the same time, storing and manipulating the dependencies, in addition to the data itself, increases the difficulties of working with the data. Data independence, security, and integrity: These three areas are, again, not unique in any way to geographic databases, but are guiding principles for any database management system. Data independence implies that data and the application programs that operate on them are independent, so that either may be changed without affecting the other. This is a fundamental consideration, and reflects a common database management point of view. Whenever possible, we must design systems in which components may be changed in a modular fashion, with minimum disruption of other components. Security refers to the protection of the data against accidental or intentional disclosure to unauthorized persons and protection against unauthorized access, modification, or destruction of the database. Integrity in this context is the ability to protect data from systems problems through a variety of assurance measures (e.g., range checking, backup, and recovery; Martin, 1975). In conventional data processing terms, such support functions are referred to as database management. These principles are not firm rules that must be obeyed. Rather, they are useful guidance. In automated data processing systems, there are always tradeoffs to consider, balancing theoretical preferences against practical system operations. As we saw in an example in the data structures chapter, if we have a geometrical description of a bounded region stored in the database, we can always calculate its area or perimeter whenever required. However, if these values are needed repeatedly, it may be more appropriate to store the values permanently. If we do so, we are creating redundancy in the database, which will make data update more difficult. This will also increase the size of the database, since we are storing additional attributes. These costs must be balanced against the reduced processing needed during system operation. Many retrieval strategies have been devised to facilitate manipulation and retrieval of large spatial databases. One of the most costly operations is searching the database. Some of the means to minimize the costs of search include: Paging: In paging (which is sometimes called tiling), the spatial database is subdivided into (usually rectangular) regions, which may not have to be the same size. In this way, operations on the database do not need to work with the entire database at once, but with a smaller set of pages (or tiles). This is reminiscent of dividing a study area among a set of map sheets. This internal stratification can dramatically improve a system’s speed when responding to queries, but it must not burden the users. As the database grows in size (for example, when new subdivisions are built in a city), functions should be available to subdivide the existing pages into smaller ones, to keep the data volume in each page small, which facilitates rapid access. Complications will arise in this kind of system at the margins between the pages, where features may overlap the boundaries. Retrieval within a specified region of interest: In this case, the user specifies a spatial window of interest for the current set of queries. If portions of the database can be eliminated from consideration, the system can be much faster (and thus, ultimately, work with more data). To avoid problems of redundancy, this should be possible without copying the data in the region of interest to a new file. Centroid as a search key: If the centroid of an object’s location is stored in the database, the speed of operations such as search for the conjunction between objects can be improved. Clearly, the centroid, as a single parameter description of an object, cannot completely characterize the object for all applications. However, indexing classes of objects by the location of their centroids can provide a dramatic help when searching through large spatial databases. Similarly, if the applications require them, other high-order descriptions such as perimeter, size, and indices of shape can also be stored to improve system response. Hierarchical storage: As discussed in section 4.1, hierarchical decompositions of space can offer significant performance improvements in some applications. By effectively having descriptions of space at different mean resolutions, it is sometimes possible to prune away regions very rapidly and to focus quickly on promising areas. At the same time, for some kinds of data, the hierarchical storage structures may provide mechanisms for space-efficient storage for some kinds of geographic data. Topological encoding: As discussed in Chapter 4, many modern spatial databases explicitly encode information about the spatial relationships between features. The DIME file structure, for example, has specific components to indicate which blocks are bordered by street segments. Since these relationships are explicitly stored, certain kinds of operations may be executed rapidly, in comparison to a data structure where resource-intensive search is required to uncover implicit relationships like adjacency and containment. Such techniques are often developed and implemented within a specific operational environment (i.e., batch versus interactive, sequential versus direct access, mainframe versus minicomputer). Furthermore, particular encoding and retrieval techniques that take advantage of the data management and retrieval characteristics of a specific operating system may have been developed and thus, are dependent on a particular machine architecture. System design and retrieval techniques must be carefully examined whenever or wherever the transportability of a GIS to a different operating environment or operating system is a consideration.