Download CHAPTER 6: DATA MANAGEMENT

CHAPTER 6: DATA MANAGEMENT (GIS: A Management Perspective – Stan Aronoff) Pages 151 - 187 For an organization to function effectively, it requires accurate and timely information. Based on this need, it is easy to see why the business community first adopted computer-based data storage and retrieval technology. In the 1960s large engineering projects like the space program required an enormous amount of inventory and a database system was used to manage this information. Another application of the 1960s was the Sabre airline reservation system developed by IBM and American Airlines. Since these early beginnings, the business community has invested heavily in data base technology to gather and maintain information. As the information systems field developed during the 60s and 70s, the concepts of database and data base management systems were developed and refined. Today, database management systems handle enormous databases such as the national census. A database is the information to be stored whereas the database management system is the system used to manage the database. Specifically, a database is a collection of information about things and their relationships to each other. For example, a database of names, addresses and relationships (client, relative, friend). The objective in collecting and maintaining information in a database is to relate facts and situations that were previously separate. There are historically two approaches to database management. The first is the file processing approach and the second is the more recent database management system approach. The file processing approach (Figure on page 152) required that the data be stores as one or more computer files that were accessed by the special purpose database software in whatever manner the designer believed to be most efficient. File processing is the most common approach to using a database. Drawbacks of the file processing approach: Since each application program must directly access each data file that it uses, the program must know how the data in each file are stored. The can create redundancy because the instructions to access a data file must be present in each application program. Another problem exists when data are shared by different application programs and by different users. If data files are accessed and modified by several programs and users, then there must be some overall control over which users are given access to the database and what modifications they are permitted to make. A lack of central control can seriously degrade the integrity of the data. A database management system (DBMS) is comprised of a set of programs that manipulate and maintain the data in a database. This is the second approach: The DBMS Approach. The DBMS were developed to manage the sharing of data in an orderly manner and to ensure that the integrity of the database is maintained. 1 A DBMS acts as a central control over all interactions between the database and the application programs, which in turn interacts with the user. When the programs, such as order entry services or geographic analysis functions, required access to the database the DBMS acts as the intermediary and supervisor. One of the major benefits of a DBMS is that is provides data independence. The application program does not need to know how the data is physically stores because all access to the database is via the DBMS. The application program issues a command to the DBMS that retrieves and "re-packages" the data into the format needed by the application. This greatly reduces the effort needed to maintain the application programs and the database. Many DBMS incorporate a direct user interface. A DBMS is also used to tailor the style of information presented to the different users. In Figure 6.3 page 153 the same dataset is presented in two different ways depending on the needs of the users. Account executive view versus the inventory management view This ability to present the data in different ways is a very valuable function and does not store multiple copies of the same database. ADVANTAGES OF THE DATABASE APPROACH (over the file processing approach) 1. 2. 3. 4. 5. 6. 7. Centralize Control: A single DBMS under the control of one person can ensure that data quality standards and the integrity of the data are maintained. Data can be Shared Efficiently: Using a DBMS, the information in a database can be shared in a flexible yet controlled manner. Also facilitates the development of new applications of the existing data. Data Independence: Application programs are independent of the physical form in which the data are stored. Easier Implementation of New Database Applications: New application programs and unique database searches can be more easily implemented using the services provided by a DBMS. Direct User Access: Database systems now commonly provide a user interface so that non-programmers can perform sophisticated analyses. Redundancy Can Be Controlled: In a file-processing environment, separate data files are used for each application and data is stored more than once. Excessive data redundancy is expensive. In addition, an effective strategy must be provided to update the multiple copies of the data. A DBMS can be used to monitor and reduce the level of redundancy, as well as manage the updating procedures. User Views: A DBMS can provide a convenient user interface to create and maintain multiple 'views' of the data. DISADVANTAGES OF THE DATABASE APPROACH 1. 2. 3. Cost: The database system software and any associated hardware can be expensive. At a minimum, they represent an additional acquisition and maintenance cost. Added Complexity: A database system is more complex than a file processing system. In theory, the more complex the system, the more susceptible it is to failure and the more difficult the recovery. In practice, fullfeatured DBMS are provided with effective backup and recovery systems. Centralized Risk: In centralizing the location of the data and reducing data redundancy, there is a greater theoretical risk of loss or corruption of data while running an application program. However, the backup and recovery procedures normally provided in a DBMS minimize the risks. The first GIS used a file-processing database and many still do. However, the trend is increasingly towards the use of a DBMS, if not to manage all the data in a GIS at least to manage the non-spatial attribute components. Virtually all commercial GIS now incorporate some form of DBMS. DBMS TERMINOLOGY Record: a small group of related data items stored together. One row in the table. 2 Field: a record is divided into fields. A field defines the attributes in the record. Key: a label comprised of one or more fields and used as a search foundation. Query: a search through the database. THREE CLASSIC DATA MODELS The conceptual organization of a database is termed the data model. It can be thought of as the style of describing and manipulating the data in a database. There are three classic data models that are used to organize electronic data bases: The Hierarchical, the Network, and the Relational Models. THE HIERARCHICAL DATA MODEL In a hierarchical data model, the data are organized in a tree structure. See Figure 6.5 on page 156. The organization is encoded _n the data records for each entity. There is one field that is designated as the key field and is used to organize the hierarchy. The top of the hierarchy is termed the root and is comprised of one entity. Except for the root, every element has one higher level element related to it, called the parent, and one or more subordinate elements, termed children. In the hierarchical data model, every relation is a many-to-one relation or a one-to-one relation. The many Departments belong to one University; there are many students in each department. Retrieval of all the students or all the professors in a specific department is a very efficient search because there is a direct link between student and department entities and between professor and department entities. However, to find all the courses offered by a specific department requires a two stage search. First, the records for all the professors teaching in that department would be retrieved and then the courses that each of those professors taught would be retrieved. This is a less efficient type of retrieval because an intermediate entity, the professors must be retrieved. This type of retrieval can still be efficient if it does not involve too many intermediate levels. In the hierarchical model an entity can have only one parent, so the Course entity is not permitted to have both the Department and Professor entities as parents. Another limitation of this model is that searches cannot be cone on the attribute fields. In this example, you cannot retrieve all second year students because the Year field is not a key. Hierarchical systems are easy to understand and easy to update. They also provide high speed access to large data sets. This is a good system for bibliographic databases and airline reservation systems when the types of searches are very predictable and can be tightly specified. The major disadvantages of the hierarchical model are that the data relationships are difficult to modify and queries are restricted to traversing the hierarchy. Geographic information analysis searches are often exploratory and cannot be predicted in advance. Another disadvantage is that multiple parents are not allowed. THE NETWORK DATA MODEL In the network data model, an entity can have multiple parents as well as multiple children and no root is required. The data records can be directly searched without traversing the entire hierarchy above that record. Figure 6.6 page 158. The Course entity can have two parents in the Department and Professor entities. A search of all courses in a specified department can now be done more directly than in the hierarchical example. The Student-Course relation is a many-to-many relation. Each student can be enrolled in many courses and each course can have many students. However, this model does not allow many-to-many relations, this relation is handled indirectly by using an intermediate relation or intermediate record. For example, the intersection records represent the registration of students in courses or the Student-Course combinations. Each Student-Course combination is unique. One Course entity can have many Registration entities and one Student entity can have many Registration entities. Network data models tend to have less redundant data storage than the corresponding hierarchical model. However, more extensive linkage information must be stored, adding to the size and complexity of the data files. When the data structure to be represented is in fact a simple hierarchy, there is no real difference in the expressive power of these two models. However, where a more complex real-world data structure must be represented, the network model can accommodate the added complexity. As with the hierarchical model, the relations among data elements are encoded in the database. This provides high speed retrieval, but the data relationships are difficult to modify. The principle disadvantages of the network model are that it is more complex than the hierarchical model and not as flexible as the relational model. THE RELATIONAL DATA MODEL Figure 6.7 In the relational data model there is no hierarchy of data fields within a record and every data field can be used as a key. The data are stored as a collection of values in the form of simple records or tuples (rows). The tuples are grouped together in two-dimensional tables with each table usually stored as a separate file. The table as a whole represents the relationships among all the attributes it contains and is called a relation. Using the relational model, a search can be made of any single table using any of the attribute fields, singly, or together. For example, . . . . . 3 Searches of related attributes that are stored in different tables can be done by linking two or more tables using any attribute they share in common. This is a join operation. See figure 6.8 page 159. By including only the data fields required, redundant data storage is reduced. In fact, table 6 does not have to be stored at all; it can be created as a virtual table. As can be seen in table 6, there is a certain amount of redundancy in a relational table. The Course-ID, Course Department, and Course Name information is repeated. However, each row (tuple) is unique. There should never be two identical rows because there is no need to store the same fact twice. Advantages of the relational model over the hierarchical and network models. 1. 2. 3. 4. The relational model is more flexible than other models. The way the data values exist in the relational tables does not in any way restrict the kinds of processing that can be done. In the hierarchical and network models, manipulation of the data is restricted by the structure built into the data model. The relational model has a sound theoretical base in mathematical theory. You can use the mathematics of relations as the basis for data processing procedures. The organization of the relational model is simple to understand and, therefore, a good vehicle to communicate database ideas. The same database can generally be represented with less redundancy using the relational model than the other two models. Disadvantages of the relational model. 1. 2. 3. It is more difficult to implement. It tends to have slower performance. The absence of pointers (a code that indicates a location in a file, such as the location in a file where the attributes of a geographic feature are stored) requires that manipulation of the data be based on matching values in the relational tables. This is a much more time consuming operation and, as a result, a relational data base system tends to be significantly slower than the corresponding hierarchical or network data bas system. THE NATURE OF GEOGRAPHIC DATA The map is the most familiar form for representing geographical data. A map consists of a group of points, lines, and areas that are positioned with reference to a common coordinate system. The map legend links the non-spatial attributes, such as place names, symbols, and colors to the spatial data (the locations of the elements). The map itself serves to both store the data and to present the data to the user. In a computer-based GIS, the storage and presentation of geographic data are separate. And the same data my be viewed as many different types of maps. In addition to maps, the data may be presented in the form of tables or text descriptions. In a computer-based GIS, geographic data are represented as points, lines, and areas as with maps. However, for efficient computer implementation, these elements are organized somewhat differently than the organization of a paper map. The information for a geographic feature has four major components: Its geographic position, its attributes, its spatial relationships, and time. Geographic Position (where is it?) Each feature has a location that must be specified in a unique way. For geographic data, locations are recorded in terms of a coordinate system like Lat./Longs, UTM, or SPC. A GIS requires that a common coordinate system be used for all the datasets that will be used together Attributes (What is it?) The second characteristic of geographic data are their attributes, non-spatial attributes. There is a level of inaccuracy inherent in non-spatial attribute data as there is for spatial data. A commercial district may not be 100% commercial and a pine stand may not be 100% pine. Often this type of inaccuracy is not addressed by GIS users, but for many types of analyses it is most important to recognize and take into account this imprecision. Spatial Relationships (What are its relationships?) The spatial relationships among geographic data are very numerous and often complex. For example, it is not only important to know the location of the fire and the fire hydrants, but also how close those fire hydrants are to the fire. This relationship is intuitive to the person reading the map but must be expressed in a computer-compatible manner. Because it is not possible to store information about all possible spatial relationships, only some of them are stored and others are either calculated as needed or not available. 4 Time (When did this exist?) Geographic information is referenced to a point in time or a period in time. Knowing when the data was collected can be important. The representation of time in a GIS is an added level of complexity that is difficult to handle. Taken together, these four attributes make geographic data uniquely different from other types of data. As with other database systems, a data model is used to represent the information considered to be most relevant to the application at hand. If the model is appropriately designed, the GIS will mimic the behavior of the real world accurately enough to provide useful information. SPATIAL DATA MODELS There are two fundamental approaches to the representation of the spatial component of geographic information: The Vector Model and the Raster Model. In the vector model, objects or conditions in the real world are presented by the points and lines that define their boundaries, much like they are when they are drawn on a map. Every position in the map space has a unique coordinate value. Points, line, and polygons are used to represent irregularly distributed geographic objects or conditions in the real world. Examples of point, line, and polygon data. The spatial entities in the vector model correspond more or less to the spatial entities that they represent in the real world. In the raster model, the space is regularly subdivided into cells. The location of geographic objects or conditions is defined by the row and column position of the cells they occupy. The area that each cell represents defines the spatial resolution available. Because positions are defined by the cell row and cell column numbers, the position of geographic features is only recorded to the nearest cell. The value stored for each cell indicates the type of object or condition that is found at that location. The cell values report a condition at a location and that condition pertains to the entire cell. The units of the raster model do not correspond to the spatial entities they represent in the real world, unlike the vector model. In both models, the spatial information is represented using homogeneous units. In the raster approach, the homogeneous units are cells. In the vector approach, the homogeneous units are the points, lines, and polygons. Each approach tends to work best in situations where the spatial information is to be treated in a manner that closely matches the data model. Where the geographic information of interest is the spatial variability of a phenomena, the raster approach is general better suited. The subtle color variations from point to point in a digital image are well represented by very large numbers of cells each assigned a set of values to represent the red, green, and blue intensities at each cell position. Similarly, the shape of a surface, its topography, is well represented by a set of evenly spaced elevation measurements. Where the information of interest is the distribution of objects in space or the conditions that apply to an area feature (soil or forest stands on a thematic map), then the vector approach tends to be better suited. THE RASTER DATA MODEL In its simplest form, the raster data model consists of a regular grid of square or rectangular cells. The location of each cell is defined by its row and column numbers. The value assigned to the cell indicates the value of the attribute it represents. A point is represented by a single cell (See figure 6-10), a line by several cells with the same value forming a linear grouping, and an area by a clump of cells all having the same value. The raster data model is also easily interfaced to the hardware devices commonly used for the input and output of spatial data. For this reason, the first GISs were written in FORTRAN and were raster based. Each cell in a raster file is assigned only one value, so different attributes are stored in separate files. Operations on multiple raster files involve the retrieval and processing of the data from corresponding cell positions in the different data files. Conceptually, the process is like stacking the files (See figure 6-11) and using the vertical stack of cell values to analyze each cell location. For example, in order to find all the cells with a Pine forest cover and a Sandy soil type, each cell in the soil file and each corresponding cell in the forest file would be retrieved and evaluated. All those cells that were coded as Pine forest and also as Sandy soil would be identified and could be output to a new data file. This is overlay analysis. The total number of values stored in a raster file equals the number of rows x the number of columns. The smaller the area of land that each cell represents, the higher the resolution of the data and the larger the file size. 1 km = 4 cells at 250 x 250m pixels 1 km x 1 km = 16 cells 1 km - 10 cells at 100 x 100m pixels 1 km x 1 km = 100 cells Raster files are commonly very large. Where there is considerable redundancy of this type, significant reduction in the size of the raster file can be achieved by using various methods of data compression, such as run length encoding and quadtree. 5 RUN-LENGTH ENCODING Data compression means that data is represented in a more compact form. If data in a file are very different from cell to cell (i. e., digital terrain data or satellite imagery), then the large number of cells serve to capture the high spatial variability. If the number of values were reduced, some of the spatial information would be lost. However, in many cases, the spatial variability is not high and the information can be represented with less redundancy and without the loss of detail. This is useful in thematic data because cells representing areas of the same class have the same value, the pattern of values tends to be spatially clumped. The quantity of data needed to capture a clumped pattern of spatial variability can be considerably reduced by suing data structures that code these repeated values more compactly than the simple raster data structure. In run-length encoding, adjacent cells along a row that have the same value are treated as a group termed a run. Instead of repeatedly storing the same value for each cell, the value is stored once, together with information about the size and location of the run. In standard run-length encoding, the value of the attribute, the number of cells in the run, and the row number are recorded. See Figure 6-12 B. Another type of run-length encoding data compression is the value point encoding. Here cells are assigned to position number starting in the upper left corner, proceeding from left to right and from top to bottom. The position number for the end of each run is stored in the point column. The value for each cell in the run is in the value column. Figure 6-12 shows examples of the full raster coding, run-length encoding, and value point encoding and the sizes which are 100 values, 54 values, and 32 values respectively. These forms of data compression become less efficient as the number of edges or transitions increase. The greatest degree of compression is achieved when there are only a few classes and they occur in large clumps. QUADTREES The quadtree data model provides a more compact raster representation by using a variable-sized grid cell. Instead of dividing an area into cells of one size, finer subdivisions are used in those areas with finer detail. In this way, a higher level of resolution is provided only where it is needed. Using the quadtree structure, a coarse resolution (large cells) is used to encode large homogeneous areas while a finer resolution (small cells) is used for areas of high spatial variability. Conceptually, the construction of a quadtree can be thought of as a process of regularly subdividing a map. If there is more than one class present, then the map is subdivided into four equal-sized quadrants. Then each quadrant is tested to determine if more than one class is present. Every quadrant that contains more than one class is again subdivided into four equal-sized quadrants, whereas homogeneous quadrants are not subdivided. In Figure 6-13, notice that more cells and smaller cells are created at feature boundaries. The dividing process is limited to a chosen maximum number of iterations and this, in effect, established the minimum cell size that can be represented. Figure 6-14 provides a more detailed look at the quadtree structure. The physical structure of the computer file is also organized according to the numbering scheme. As a result, cells that are close together on the map are close together in the file. For operations that use data for a neighborhood, this storage organization provides efficient data retrieval. Quadtrees are particularly efficient for identifying the nearest neighbor of a selected point and for identifying the area (polygon) in which a point is located (point-in-polygon search). The major disadvantage of quadtrees is the time it takes to create and modify them. Quadtrees can provide more efficient storage of data but only if the data are fairly homogenous. The fewer the classes and the larger the clumps, the greater the degree of compression and the more efficient the quadtree structure. THE VECTOR DATA MODEL The vector data model provides for precise positioning in space. The approach used in the vector model is to precisely specify the position of the points, lines and polygons used to represent features of interest. The map area is assumed to be a continuous coordinate space where a position can be defined as precisely as desired. The location of features on the earth's surface are referenced to map positions using an X, Y coordinate system (Cartesian Coordinate system). Geographic features are commonly recorded on 2-D maps as points, lines and areas. The vector model uses a similar approach. A point feature is recorded as a single XY coordinate pair, a lines as a series of XY coordinate pairs, and an area as a closed loop of XY coordinate pairs. See figure 6-15. The early systems were designed to meet the needs of automated mapping where the principle objective was to store the positions of the points, lines and polygons, as well as the drawing instructions to plot them (color, pattern, etc.). These systems were later developed to provide for storage of geographic attributes and recognition of the graphic elements that represented a particular geographic feature. However, the data were stored as a more or less unorganized collection of elements. This is called the spaghetti model. THE SPAGHETTI MODEL See Figure 6-16. In this model, the paper map is translated line-for-line into a list of XY coordinates. A point is encoded as a single XY, a line as a string of XYs, and a polygon as a closed loop of XY coordinates. The common boundary between adjacent polygons must be recorded twice, once for each polygon. There is no inherent structure only as a collection of coordinate strings. 6 The spatial relationships between these features are not encoded either. For example, information about the features adjacent to each polygon. This information would have to be generated by searching all the features in the data file and calculating whether or not they were adjacent. The spaghetti model is very inefficient for most types of spatial analysis since any spatial relationships must be derived by computation. However, it is an efficient model for digitally reproducing maps because information extraneous to the plotting process, such as spatial relationships, are not stored. THE TOPOLOGICAL MODEL The topological model is the most widely used method of encoding spatial relationships in a GIS. Topology is the mathematical method used to define spatial relationships. See Figure 6-17. This particular form of topological model is called the Arc-Node data model. The basic logical entity is the arc, as series of points that start and end at a node. A node is an intersection point where two or more arcs meet. A polygon is comprised of a closed chain of arcs that represents the boundaries of an area. Table 6-17. Note the tables: Polygon Topology, Node Topology, Arc Topology, and Arc Coordinate Data. In a GIS, polygons and points are often stored in one type of data layer and lines are stored in a separate data layer. The table does not exemplify this. The Polygon Topology Table shows the arcs the make up the boundaries of each polygon. Polygons can have islands within them, Polygon C is an island in Polygon B. This is indicated in the arc list for Polygon B by a zero preceding the list of arcs that make up the island. The point in Polygon B is also treated as a polygon, Polygon D, which is comprised of a single arc a6. A point can be considered a polygon with no area. In order to complete the spatial definitions, there must be a way to refer to the area that is outside the map boundary. This outside area is designated as polygon E, for which the arcs are not explicitly defined. In the Node Topology Table, each node is defined by the arcs to which it belongs. N1 is an endpoint for arcs a1, a3, and a4. The Arc Topology Table defines the relationship of the nodes and polygons to the areas. The end points are distinguished by designating one node as the start or from node and one as the end or to node. The left and right polygons are also designated. From the topology alone, the topology tables, analyses of relative position of the map elements can be done. For example, all polygons adjacent to polygon B can be found by searching the Arc Topology Table. Every polygon paired with B in this table is adjacent to it. The topology tables can be used to find all features contained within a polygon by searching the polygon topology table for arc lists that contain a zero. Polygon B is seen to have two contained features, one defined by arc a6 and the other by arc a7. Spatial queries of this type can be processed much more quickly using the topology tables than they can be done by calculation from the coordinate data as required in the spaghetti model. To relate map features to the real world positions, the XY coordinates are needed. These are stored in the Arc Coordinate Data Table. Each arc is represented by one or more straight-line segments defined by a series of coordinates. Attribute data are commonly stored in the form of relational tables in which one data field contains an identification code for the spatial entity. A topologically structured data model is well-suited to such spatial operations as contiguity and connectivity analyses. Contiguity is the spatial relationship of adjacency. That is, elements that touch each other are adjacent. A biologist might be interested in the habitats that occur next to each other, whereas a city planner might be interested in zoning conflicts, such as industrial zones that border recreational areas. Connectivity refers to interconnected pathways or networks that transport something. The streets of a city, the cables of a telephone system, and the streams and rivers in a landscape are examples of transportation networks. Connectivity functions are used to find optimum routings through a network. One of the advantages of the topological structure over the spaghetti model is that spatial analyses can be done without using the coordinate data. This avoids the time-consuming calculations needed to derive spatial relationships from the geographic coordinates. When spatial data are stored using a non-topological model, extensive calculations are needed to derive the topological information. Creating the topological structure does impose a cost, however, When a new map is entered or an existing map is changed, the topology must be updated. Systems that do not have a topological structure can use a simpler internal data but require more complex algorithms to analyze spatial relationships. Virtually all full-featured, vector-based GISs now use a topological data model. Table 6-1 (p. 166) COMPARISON OF RASTER AND VECTOR DATA MODELS Advantages of the Raster Data Model 1. It is a simple data structure. 2. Overlay operations are easily and efficiently implemented. 3. High spatial variability is efficiently represented in a raster format. 7 4. The raster format is more or less required for efficient manipulation and enhancement of digital images. Disadvantages of the Raster Data Model    The raster data structure is less compact. Data compression techniques can often overcome this problem. Topological relationships are more difficult to represent. The output of graphics is les aesthetically pleasing because boundaries tend to have a blocky appearance rather than the smooth line of hand drawn maps. This can be overcome by using a very large number of cells, but may result in unacceptably large files. Advantages of the Vector Model    It provides a more compact data structure than the raster model. It provides efficient encoding of topology, and, as a result, more efficient implementation of operations that require topological information, such as network analysis. The vector model is better suited to supporting graphics that closely approximate hand-drawn maps. Disadvantages of the Vector Model     It is a more complex data structure than a simple raster format. Overlay operations are more difficult to implement. The representation of high spatial variability is inefficient. Manipulation and enhancement of digital images cannot be effectively done in the vector domain. DATABASES FOR GIS: MANAGING SPATIAL AND ATTRIBUTE DATA TOGETHER The development of database systems to handle spatial information and the associated attribute data has happened within the last few years. Today, most spatial data are still being stored in the form of paper maps, imagery, tables or text descriptions and are analyzed using manual techniques. However, there is a rapid increase in the amount of geographic information that is now being collected and stored in digital form suitable for computer-based retrieval and analysis. Organizing Geographic Information Within a DBMS Digital mapping was historically viewed as a set of tools for automating the drafting function. The digital mapping function was not viewed as an integral part of the data management system of the organization. Early GIS and automated cartography systems used data files directly without using a database management system. This file processing approach to data management is still used in many GISs today. Since automated cartography systems and GISs perform similar manipulations, it is often expected that they should provide comparable levels of performance. However, the additional complexities of handling the data components other than the geometric ones and providing a multi-user environment, makes this expectation difficult to satisfy. In a GIS, geographic information is not approached as a drafting task but as a database application. As noted earlier, the advantages of organizing data using a DBMS are: minimize redundancy of data storage; provide central control of data access, manipulation, integrity and security of the database; and make application programs independent of the form in which the data are stored. There are at least four ways that DBMS concepts are applied in the context of a GIS: 1. The uses of the data are independent of the way the data are stored. For example, instead of storing maps, the data that describe the geographic elements (attributes) are stored with minimum redundancy, and then maps or other types of output are generated as needed in the form best suited for a specific analysis. 2. Automated updating of interrelated data files is provided. The transaction that registers a change (the sale of land) can be used to update all of the data files that are affected. All users immediately have access to updated data. 8 3. The relationships among all the spatial and attribute information are explicitly defined. Topology is used to relate all the spatial elements to each other. 4. The central control of the DBMS provides better control of the integrity of the database. Limitations of General Purpose DBMSs for GIS Applications The data model most widely accepted for handling non-spatial attribute data in GIS applications is the relational model. The organization of attribute data into a series of tables that can be used individually or together is simple to understand and provides efficient data storage. Relational databases also provide query languages that are easily learned by GIS users and provide very flexible analysis capabilities. Because of these things, the relational model is well-suited to handling the storage of the non-spatial attribute data because it is well adapted to the unpredictable nature of geographic data. However, using a relational DBMS to store the spatial data and adapting the DBMS query language to provide spatial analysis functions has been a much more difficult problem to handle. Using a relational database to manage tabular data and make changes involves functions such as adding, deleting, or changing a record. For example, property ownership changes. Spatial information is more complex and the transactions performed on them are more complicated. Modifying spatial information usually involves the simultaneous updating of large numbers of record in multiple files. See Figure 6-21 page 182 (Dividing a parcel of land) The functions needed to handle geographic data are not done well in the tabular database environment of a standard DBMS. There are many reasons why; here are five. 1. The spatial data records used in a GIS are variable length records which are needed to store variable numbers of coordinate points, whereas general purpose database systems are designed to handle fixed length records. 2. Manipulation of geographic data involves spatial concepts, such as proximity, connectedness, containment, and overlay. These functions are not easily accommodated by general purpose database query languages. 3. A GIS requires sophisticated graphics that are not normally provided for a DBMS. 4. Geographic information is complex. The representation of a single geographic feature requires multiple records in possibly multiple files. For example, the feature may involve feature coordinates, topology, measurements (area, etc.), links to the non-spatial data, and the attributes themselves. 5. The highly interrelated nature of GIS data records require a more sophisticated security system than the record locking approach taken by most general purpose DBMS. To ensure the integrity of the geographic database, the security system must protect the integrity of the multiple files in which the spatial data are stored. A change in one record can create multiple errors in multiple files. PRACTICAL APPROACHES USED TO IMPLEMENT A GIS Various practical approaches have been taken to provide data management services for a GIS. They are grouped into four (somewhat overlapping) strategies. See Figure 6-22. 1. Develop a propriety system providing the individual data management services required by the different application modules. This is the file processing approach. The Map Analysis Package and ERDAS are examples of a GIS that uses this approach. Each data set is stored as a separate file. The user can invoke separate analysis functions to manipulate one or more of the data files. Results are produced in the form of new data files that can be output using various routines. 2. Develop a hybrid system using a commercially available DBMS (usually a relational one) for storage of the non-spatial attributes. Develop separate software to manage the storage and analysis of the spatial data, using the services of the relational database to access the attribute data. ESRI's ARC/INFO is an example of the hybrid approach. The non-spatial attribute data are stored in the INFO (Oracle, dBase, or INGRES) database management system which provides both storage and manipulation functions for the nonspatial attribute data. The ARC system provides for the storage and manipulation of the spatial data. Spatial analysis 9 functions are provided using the toolbox approach. In ARC, individual modules are provided for such functions as data entry, editing, network analysis, etc. 3. Use an existing DBMS, usually a relational one, as the core of the GIS and then develop extensions to the system where needed. Although the spatial and attribute data may be managed by the DBMS, a significant amount of software is generally added to the DBMS to provide the spatial functions and graphics display used in geographic analysis. System 9, developed by Wild Heerbrugg and Prime Computers, is an example of a GIS built around an existing DBMS. Both the spatial coordinate data and the topological data, as well as non-spatial attribute data are stored using the relational database in a UNIX environment. The relational DBMS supports variable length fields and they have added software to handle the spatial references (such as points and lines that comprise a geographic feature). They have also added software that allows spatial queries such as overlay, connectivity, and neighborhood operations. 4. Start from scratch and develop a spatial database capable of handling the spatial and non-spatial data in an integrated fashion. The MAPS systems developed at Carnegie-Mellon University has taken a different approach than most GISs. Instead of trying to develop one or perhaps two representations of the spatial data that are general enough for all types of spatial data, the MAPS system support several different representations of data that are all transparent to the user. Data in various raster and vector forms can be used and queries are optimized to use the representation best-suited to execute a particular function. Five databases are maintained within the MAPS systems: 1. A digital terrain database; 2. A map database; 3. A landmark database; 4. An image database; and 5. The CONCEPTMAP database. Together they provide for the storage and integrated analysis of such diverse datasets as digitized aerial photography, digital satellite imagery, digital terrain data, topologically encoded map data, symbols, and non-spatial attributes. CONCEPTMAP contains the information about spatial entities and manipulation procedures that are used to interact directly with a user or with the applications programs. In this way, MAPS provides a single uniform access to all spatial data, independent of their internal structure. CONCLUSION A DBMS provides a number of functions critical to the effective operation of a GIS. However, the difficulties of adapting existing DBMSs to handle spatial data have required that hybrid or modified DBMSs be developed. There is a strong demand for systems that can integrate diverse raster and vector format data sets without requiring the operator to first convert the data to a common format. 10

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CHAPTER 6: DATA MANAGEMENT