Download CHAPTER 6: DATA MANAGEMENT

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
CHAPTER 6: DATA MANAGEMENT
(GIS: A Management Perspective – Stan Aronoff)
Pages 151 - 187
For an organization to function effectively, it requires accurate and timely information.
Based on this need, it is easy to see why the business community first adopted computer-based data storage and
retrieval technology. In the 1960s large engineering projects like the space program required an enormous amount of
inventory and a database system was used to manage this information.
Another application of the 1960s was the Sabre airline reservation system developed by IBM and American Airlines.
Since these early beginnings, the business community has invested heavily in data base technology to gather and
maintain information.
As the information systems field developed during the 60s and 70s, the concepts of database and data base
management systems were developed and refined. Today, database management systems handle enormous databases
such as the national census.
A database is the information to be stored whereas the database management system is the system used to manage the
database.
Specifically, a database is a collection of information about things and their relationships to each other. For example, a
database of names, addresses and relationships (client, relative, friend).
The objective in collecting and maintaining information in a database is to relate facts and situations that were previously
separate.
There are historically two approaches to database management. The first is the file processing approach and the second
is the more recent database management system approach.
The file processing approach (Figure on page 152) required that the data be stores as one or more computer files that
were accessed by the special purpose database software in whatever manner the designer believed to be most efficient.
File processing is the most common approach to using a database.
Drawbacks of the file processing approach:
Since each application program must directly access each data file that it uses, the program must know how the data in
each file are stored. The can create redundancy because the instructions to access a data file must be present in each
application program.
Another problem exists when data are shared by different application programs and by different users. If data files are
accessed and modified by several programs and users, then there must be some overall control over which users are
given access to the database and what modifications they are permitted to make. A lack of central control can seriously
degrade the integrity of the data.
A database management system (DBMS) is comprised of a set of programs that manipulate and maintain the data in a
database. This is the second approach: The DBMS Approach.
The DBMS were developed to manage the sharing of data in an orderly manner and to ensure that the integrity of the
database is maintained.
1
A DBMS acts as a central control over all interactions between the database and the application programs, which in turn
interacts with the user.
When the programs, such as order entry services or geographic analysis functions, required access to the database the
DBMS acts as the intermediary and supervisor.
One of the major benefits of a DBMS is that is provides data independence. The application program does not need to
know how the data is physically stores because all access to the database is via the DBMS.
The application program issues a command to the DBMS that retrieves and "re-packages" the data into the format
needed by the application. This greatly reduces the effort needed to maintain the application programs and the database.
Many DBMS incorporate a direct user interface.
A DBMS is also used to tailor the style of information presented to the different users.
In Figure 6.3 page 153 the same dataset is presented in two different ways depending on the needs of the users.
Account executive view versus the inventory management view
This ability to present the data in different ways is a very valuable function and does not store multiple copies of the same
database.
ADVANTAGES OF THE DATABASE APPROACH (over the file processing approach)
1.
2.
3.
4.
5.
6.
7.
Centralize Control: A single DBMS under the control of one person can ensure that data quality standards and
the integrity of the data are maintained.
Data can be Shared Efficiently: Using a DBMS, the information in a database can be shared in a flexible yet
controlled manner. Also facilitates the development of new applications of the existing data.
Data Independence: Application programs are independent of the physical form in which the data are stored.
Easier Implementation of New Database Applications: New application programs and unique database
searches can be more easily implemented using the services provided by a DBMS.
Direct User Access: Database systems now commonly provide a user interface so that non-programmers can
perform sophisticated analyses.
Redundancy Can Be Controlled: In a file-processing environment, separate data files are used for each
application and data is stored more than once. Excessive data redundancy is expensive. In addition, an
effective strategy must be provided to update the multiple copies of the data. A DBMS can be used to monitor
and reduce the level of redundancy, as well as manage the updating procedures.
User Views: A DBMS can provide a convenient user interface to create and maintain multiple 'views' of the data.
DISADVANTAGES OF THE DATABASE APPROACH
1.
2.
3.
Cost: The database system software and any associated hardware can be expensive. At a minimum, they
represent an additional acquisition and maintenance cost.
Added Complexity: A database system is more complex than a file processing system. In theory, the more
complex the system, the more susceptible it is to failure and the more difficult the recovery. In practice, fullfeatured DBMS are provided with effective backup and recovery systems.
Centralized Risk: In centralizing the location of the data and reducing data redundancy, there is a greater
theoretical risk of loss or corruption of data while running an application program. However, the backup and
recovery procedures normally provided in a DBMS minimize the risks.
The first GIS used a file-processing database and many still do. However, the trend is increasingly towards the use of a
DBMS, if not to manage all the data in a GIS at least to manage the non-spatial attribute components.
Virtually all commercial GIS now incorporate some form of DBMS.
DBMS TERMINOLOGY
Record: a small group of related data items stored together. One row in the table.
2
Field: a record is divided into fields. A field defines the attributes in the record.
Key: a label comprised of one or more fields and used as a search foundation.
Query: a search through the database.
THREE CLASSIC DATA MODELS
The conceptual organization of a database is termed the data model. It can be thought of as the style of describing and
manipulating the data in a database.
There are three classic data models that are used to organize electronic data bases: The Hierarchical, the Network, and
the Relational Models.
THE HIERARCHICAL DATA MODEL
In a hierarchical data model, the data are organized in a tree structure. See Figure 6.5 on page 156. The organization is
encoded _n the data records for each entity.
There is one field that is designated as the key field and is used to organize the hierarchy.
The top of the hierarchy is termed the root and is comprised of one entity. Except for the root, every element has one
higher level element related to it, called the parent, and one or more subordinate elements, termed children.
In the hierarchical data model, every relation is a many-to-one relation or a one-to-one relation.
The many Departments belong to one University; there are many students in each department.
Retrieval of all the students or all the professors in a specific department is a very efficient search because there is a
direct link between student and department entities and between professor and department entities.
However, to find all the courses offered by a specific department requires a two stage search. First, the records for all the
professors teaching in that department would be retrieved and then the courses that each of those professors taught
would be retrieved.
This is a less efficient type of retrieval because an intermediate entity, the professors must be retrieved. This type of
retrieval can still be efficient if it does not involve too many intermediate levels.
In the hierarchical model an entity can have only one parent, so the Course entity is not permitted to have both the
Department and Professor entities as parents.
Another limitation of this model is that searches cannot be cone on the attribute fields. In this example, you cannot
retrieve all second year students because the Year field is not a key.
Hierarchical systems are easy to understand and easy to update. They also provide high speed access to large data
sets.
This is a good system for bibliographic databases and airline reservation systems when the types of searches are very
predictable and can be tightly specified.
The major disadvantages of the hierarchical model are that the data relationships are difficult to modify and queries are
restricted to traversing the hierarchy. Geographic information analysis searches are often exploratory and cannot be
predicted in advance. Another disadvantage is that multiple parents are not allowed.
THE NETWORK DATA MODEL
In the network data model, an entity can have multiple parents as well as multiple children and no root is required. The
data records can be directly searched without traversing the entire hierarchy above that record. Figure 6.6 page 158.
The Course entity can have two parents in the Department and Professor entities.
A search of all courses in a specified department can now be done more directly than in the hierarchical example.
The Student-Course relation is a many-to-many relation. Each student can be enrolled in many courses and each course
can have many students.
However, this model does not allow many-to-many relations, this relation is handled indirectly by using an intermediate
relation or intermediate record.
For example, the intersection records represent the registration of students in courses or the Student-Course
combinations. Each Student-Course combination is unique. One Course entity can have many Registration entities and
one Student entity can have many Registration entities.
Network data models tend to have less redundant data storage than the corresponding hierarchical model. However,
more extensive linkage information must be stored, adding to the size and complexity of the data files.
When the data structure to be represented is in fact a simple hierarchy, there is no real difference in the expressive
power of these two models. However, where a more complex real-world data structure must be represented, the network
model can accommodate the added complexity.
As with the hierarchical model, the relations among data elements are encoded in the database. This provides high
speed retrieval, but the data relationships are difficult to modify. The principle disadvantages of the network model are
that it is more complex than the hierarchical model and not as flexible as the relational model.
THE RELATIONAL DATA MODEL
Figure 6.7 In the relational data model there is no hierarchy of data fields within a record and every data field can be
used as a key.
The data are stored as a collection of values in the form of simple records or tuples (rows). The tuples are grouped
together in two-dimensional tables with each table usually stored as a separate file.
The table as a whole represents the relationships among all the attributes it contains and is called a relation.
Using the relational model, a search can be made of any single table using any of the attribute fields, singly, or together.
For example, . . . . .
3
Searches of related attributes that are stored in different tables can be done by linking two or more tables using any
attribute they share in common. This is a join operation. See figure 6.8 page 159.
By including only the data fields required, redundant data storage is reduced. In fact, table 6 does not have to be stored
at all; it can be created as a virtual table.
As can be seen in table 6, there is a certain amount of redundancy in a relational table. The Course-ID, Course
Department, and Course Name information is repeated. However, each row (tuple) is unique. There should never be two
identical rows because there is no need to store the same fact twice.
Advantages of the relational model over the hierarchical and network models.
1.
2.
3.
4.
The relational model is more flexible than other models. The way the data values exist in the relational tables
does not in any way restrict the kinds of processing that can be done. In the hierarchical and network models,
manipulation of the data is restricted by the structure built into the data model.
The relational model has a sound theoretical base in mathematical theory. You can use the mathematics of
relations as the basis for data processing procedures.
The organization of the relational model is simple to understand and, therefore, a good vehicle to communicate
database ideas.
The same database can generally be represented with less redundancy using the relational model than the
other two models.
Disadvantages of the relational model.
1.
2.
3.
It is more difficult to implement.
It tends to have slower performance. The absence of pointers (a code that indicates a location in a file, such as
the location in a file where the attributes of a geographic feature are stored) requires that manipulation of the
data be based on matching values in the relational tables.
This is a much more time consuming operation and, as a result, a relational data base system tends to be
significantly slower than the corresponding hierarchical or network data bas system.
THE NATURE OF GEOGRAPHIC DATA
The map is the most familiar form for representing geographical data. A map consists of a group of points, lines, and
areas that are positioned with reference to a common coordinate system.
The map legend links the non-spatial attributes, such as place names, symbols, and colors to the spatial data (the
locations of the elements).
The map itself serves to both store the data and to present the data to the user. In a computer-based GIS, the storage
and presentation of geographic data are separate. And the same data my be viewed as many different types of maps.
In addition to maps, the data may be presented in the form of tables or text descriptions.
In a computer-based GIS, geographic data are represented as points, lines, and areas as with maps. However, for
efficient computer implementation, these elements are organized somewhat differently than the organization of a paper
map.
The information for a geographic feature has four major components: Its geographic position, its attributes, its spatial
relationships, and time.
Geographic Position (where is it?)
Each feature has a location that must be specified in a unique way. For geographic data, locations are recorded in terms
of a coordinate system like Lat./Longs, UTM, or SPC.
A GIS requires that a common coordinate system be used for all the datasets that will be used together
Attributes (What is it?)
The second characteristic of geographic data are their attributes, non-spatial attributes.
There is a level of inaccuracy inherent in non-spatial attribute data as there is for spatial data.
A commercial district may not be 100% commercial and a pine stand may not be 100% pine.
Often this type of inaccuracy is not addressed by GIS users, but for many types of analyses it is most important to
recognize and take into account this imprecision.
Spatial Relationships (What are its relationships?)
The spatial relationships among geographic data are very numerous and often complex.
For example, it is not only important to know the location of the fire and the fire hydrants, but also how close those fire
hydrants are to the fire.
This relationship is intuitive to the person reading the map but must be expressed in a computer-compatible manner.
Because it is not possible to store information about all possible spatial relationships, only some of them are stored and
others are either calculated as needed or not available.
4
Time (When did this exist?)
Geographic information is referenced to a point in time or a period in time. Knowing when the data was collected can be
important.
The representation of time in a GIS is an added level of complexity that is difficult to handle.
Taken together, these four attributes make geographic data uniquely different from other types of data.
As with other database systems, a data model is used to represent the information considered to be most relevant to the
application at hand. If the model is appropriately designed, the GIS will mimic the behavior of the real world accurately
enough to provide useful information.
SPATIAL DATA MODELS
There are two fundamental approaches to the representation of the spatial component of geographic information: The
Vector Model and the Raster Model.
In the vector model, objects or conditions in the real world are presented by the points and lines that define their
boundaries, much like they are when they are drawn on a map.
Every position in the map space has a unique coordinate value. Points, line, and polygons are used to represent
irregularly distributed geographic objects or conditions in the real world.
Examples of point, line, and polygon data.
The spatial entities in the vector model correspond more or less to the spatial entities that they represent in the real
world.
In the raster model, the space is regularly subdivided into cells. The location of geographic objects or conditions is
defined by the row and column position of the cells they occupy.
The area that each cell represents defines the spatial resolution available.
Because positions are defined by the cell row and cell column numbers, the position of geographic features is only
recorded to the nearest cell.
The value stored for each cell indicates the type of object or condition that is found at that location. The cell values report
a condition at a location and that condition pertains to the entire cell.
The units of the raster model do not correspond to the spatial entities they represent in the real world, unlike the vector
model.
In both models, the spatial information is represented using homogeneous units. In the raster approach, the
homogeneous units are cells. In the vector approach, the homogeneous units are the points, lines, and polygons.
Each approach tends to work best in situations where the spatial information is to be treated in a manner that closely
matches the data model.
Where the geographic information of interest is the spatial variability of a phenomena, the raster approach is general
better suited. The subtle color variations from point to point in a digital image are well represented by very large numbers
of cells each assigned a set of values to represent the red, green, and blue intensities at each cell position.
Similarly, the shape of a surface, its topography, is well represented by a set of evenly spaced elevation measurements.
Where the information of interest is the distribution of objects in space or the conditions that apply to an area feature (soil
or forest stands on a thematic map), then the vector approach tends to be better suited.
THE RASTER DATA MODEL
In its simplest form, the raster data model consists of a regular grid of square or rectangular cells. The location of each
cell is defined by its row and column numbers. The value assigned to the cell indicates the value of the attribute it
represents.
A point is represented by a single cell (See figure 6-10), a line by several cells with the same value forming a linear
grouping, and an area by a clump of cells all having the same value.
The raster data model is also easily interfaced to the hardware devices commonly used for the input and output of spatial
data. For this reason, the first GISs were written in FORTRAN and were raster based.
Each cell in a raster file is assigned only one value, so different attributes are stored in separate files. Operations on
multiple raster files involve the retrieval and processing of the data from corresponding cell positions in the different data
files.
Conceptually, the process is like stacking the files (See figure 6-11) and using the vertical stack of cell values to analyze
each cell location.
For example, in order to find all the cells with a Pine forest cover and a Sandy soil type, each cell in the soil file and each
corresponding cell in the forest file would be retrieved and evaluated. All those cells that were coded as Pine forest and
also as Sandy soil would be identified and could be output to a new data file. This is overlay analysis.
The total number of values stored in a raster file equals the number of rows x the number of columns. The smaller the
area of land that each cell represents, the higher the resolution of the data and the larger the file size.
1 km = 4 cells at 250 x 250m pixels 1 km x 1 km = 16 cells
1 km - 10 cells at 100 x 100m pixels 1 km x 1 km = 100 cells
Raster files are commonly very large. Where there is considerable redundancy of this type, significant reduction in the
size of the raster file can be achieved by using various methods of data compression, such as run length encoding and
quadtree.
5
RUN-LENGTH ENCODING
Data compression means that data is represented in a more compact form.
If data in a file are very different from cell to cell (i. e., digital terrain data or satellite imagery), then the large number of
cells serve to capture the high spatial variability. If the number of values were reduced, some of the spatial information
would be lost.
However, in many cases, the spatial variability is not high and the information can be represented with less redundancy
and without the loss of detail. This is useful in thematic data because cells representing areas of the same class have the
same value, the pattern of values tends to be spatially clumped.
The quantity of data needed to capture a clumped pattern of spatial variability can be considerably reduced by suing data
structures that code these repeated values more compactly than the simple raster data structure.
In run-length encoding, adjacent cells along a row that have the same value are treated as a group termed a run. Instead
of repeatedly storing the same value for each cell, the value is stored once, together with information about the size and
location of the run.
In standard run-length encoding, the value of the attribute, the number of cells in the run, and the row number are
recorded. See Figure 6-12 B.
Another type of run-length encoding data compression is the value point encoding. Here cells are assigned to position
number starting in the upper left corner, proceeding from left to right and from top to bottom.
The position number for the end of each run is stored in the point column. The value for each cell in the run is in the
value column.
Figure 6-12 shows examples of the full raster coding, run-length encoding, and value point encoding and the sizes which
are 100 values, 54 values, and 32 values respectively.
These forms of data compression become less efficient as the number of edges or transitions increase. The greatest
degree of compression is achieved when there are only a few classes and they occur in large clumps.
QUADTREES
The quadtree data model provides a more compact raster representation by using a variable-sized grid cell. Instead of
dividing an area into cells of one size, finer subdivisions are used in those areas with finer detail. In this way, a higher
level of resolution is provided only where it is needed.
Using the quadtree structure, a coarse resolution (large cells) is used to encode large homogeneous areas while a finer
resolution (small cells) is used for areas of high spatial variability.
Conceptually, the construction of a quadtree can be thought of as a process of regularly subdividing a map. If there is
more than one class present, then the map is subdivided into four equal-sized quadrants. Then each quadrant is tested
to determine if more than one class is present. Every quadrant that contains more than one class is again subdivided into
four equal-sized quadrants, whereas homogeneous quadrants are not subdivided.
In Figure 6-13, notice that more cells and smaller cells are created at feature boundaries. The dividing process is limited
to a chosen maximum number of iterations and this, in effect, established the minimum cell size that can be represented.
Figure 6-14 provides a more detailed look at the quadtree structure.
The physical structure of the computer file is also organized according to the numbering scheme. As a result, cells that
are close together on the map are close together in the file. For operations that use data for a neighborhood, this storage
organization provides efficient data retrieval. Quadtrees are particularly efficient for identifying the nearest neighbor of a
selected point and for identifying the area (polygon) in which a point is located (point-in-polygon search).
The major disadvantage of quadtrees is the time it takes to create and modify them. Quadtrees can provide more
efficient storage of data but only if the data are fairly homogenous. The fewer the classes and the larger the clumps, the
greater the degree of compression and the more efficient the quadtree structure.
THE VECTOR DATA MODEL
The vector data model provides for precise positioning in space. The approach used in the vector model is to precisely
specify the position of the points, lines and polygons used to represent features of interest. The map area is assumed to
be a continuous coordinate space where a position can be defined as precisely as desired.
The location of features on the earth's surface are referenced to map positions using an X, Y coordinate system
(Cartesian Coordinate system).
Geographic features are commonly recorded on 2-D maps as points, lines and areas. The vector model uses a similar
approach. A point feature is recorded as a single XY coordinate pair, a lines as a series of XY coordinate pairs, and an
area as a closed loop of XY coordinate pairs. See figure 6-15.
The early systems were designed to meet the needs of automated mapping where the principle objective was to store
the positions of the points, lines and polygons, as well as the drawing instructions to plot them (color, pattern, etc.).
These systems were later developed to provide for storage of geographic attributes and recognition of the graphic
elements that represented a particular geographic feature. However, the data were stored as a more or less unorganized
collection of elements. This is called the spaghetti model.
THE SPAGHETTI MODEL
See Figure 6-16. In this model, the paper map is translated line-for-line into a list of XY coordinates. A point is encoded
as a single XY, a line as a string of XYs, and a polygon as a closed loop of XY coordinates.
The common boundary between adjacent polygons must be recorded twice, once for each polygon. There is no inherent
structure only as a collection of coordinate strings.
6
The spatial relationships between these features are not encoded either. For example, information about the features
adjacent to each polygon. This information would have to be generated by searching all the features in the data file and
calculating whether or not they were adjacent.
The spaghetti model is very inefficient for most types of spatial analysis since any spatial relationships must be derived
by computation. However, it is an efficient model for digitally reproducing maps because information extraneous to the
plotting process, such as spatial relationships, are not stored.
THE TOPOLOGICAL MODEL
The topological model is the most widely used method of encoding spatial relationships in a GIS.
Topology is the mathematical method used to define spatial relationships. See Figure 6-17. This particular form of
topological model is called the Arc-Node data model.
The basic logical entity is the arc, as series of points that start and end at a node. A node is an intersection point where
two or more arcs meet. A polygon is comprised of a closed chain of arcs that represents the boundaries of an area.
Table 6-17. Note the tables: Polygon Topology, Node Topology, Arc Topology, and Arc Coordinate Data. In a GIS,
polygons and points are often stored in one type of data layer and lines are stored in a separate data layer. The table
does not exemplify this.
The Polygon Topology Table shows the arcs the make up the boundaries of each polygon. Polygons can have islands
within them, Polygon C is an island in Polygon B. This is indicated in the arc list for Polygon B by a zero preceding the list
of arcs that make up the island.
The point in Polygon B is also treated as a polygon, Polygon D, which is comprised of a single arc a6. A point can be
considered a polygon with no area.
In order to complete the spatial definitions, there must be a way to refer to the area that is outside the map boundary.
This outside area is designated as polygon E, for which the arcs are not explicitly defined.
In the Node Topology Table, each node is defined by the arcs to which it belongs. N1 is an endpoint for arcs a1, a3, and
a4.
The Arc Topology Table defines the relationship of the nodes and polygons to the areas. The end points are
distinguished by designating one node as the start or from node and one as the end or to node. The left and right
polygons are also designated.
From the topology alone, the topology tables, analyses of relative position of the map elements can be done. For
example, all polygons adjacent to polygon B can be found by searching the Arc Topology Table. Every polygon paired
with B in this table is adjacent to it.
The topology tables can be used to find all features contained within a polygon by searching the polygon topology table
for arc lists that contain a zero. Polygon B is seen to have two contained features, one defined by arc a6 and the other by
arc a7.
Spatial queries of this type can be processed much more quickly using the topology tables than they can be done by
calculation from the coordinate data as required in the spaghetti model.
To relate map features to the real world positions, the XY coordinates are needed. These are stored in the Arc
Coordinate Data Table. Each arc is represented by one or more straight-line segments defined by a series of
coordinates.
Attribute data are commonly stored in the form of relational tables in which one data field contains an identification code
for the spatial entity.
A topologically structured data model is well-suited to such spatial operations as contiguity and connectivity analyses.
Contiguity is the spatial relationship of adjacency. That is, elements that touch each other are adjacent. A biologist might
be interested in the habitats that occur next to each other, whereas a city planner might be interested in zoning conflicts,
such as industrial zones that border recreational areas.
Connectivity refers to interconnected pathways or networks that transport something. The streets of a city, the cables of
a telephone system, and the streams and rivers in a landscape are examples of transportation networks. Connectivity
functions are used to find optimum routings through a network.
One of the advantages of the topological structure over the spaghetti model is that spatial analyses can be done without
using the coordinate data. This avoids the time-consuming calculations needed to derive spatial relationships from the
geographic coordinates.
When spatial data are stored using a non-topological model, extensive calculations are needed to derive the topological
information.
Creating the topological structure does impose a cost, however, When a new map is entered or an existing map is
changed, the topology must be updated.
Systems that do not have a topological structure can use a simpler internal data but require more complex algorithms to
analyze spatial relationships.
Virtually all full-featured, vector-based GISs now use a topological data model.
Table 6-1 (p. 166) COMPARISON OF RASTER AND VECTOR DATA MODELS
Advantages of the Raster Data Model
1. It is a simple data structure.
2. Overlay operations are easily and efficiently implemented.
3. High spatial variability is efficiently represented in a raster format.
7
4. The raster format is more or less required for efficient manipulation and enhancement of digital images.
Disadvantages of the Raster Data Model



The raster data structure is less compact. Data compression techniques can often overcome this problem.
Topological relationships are more difficult to represent.
The output of graphics is les aesthetically pleasing because boundaries tend to have a blocky appearance
rather than the smooth line of hand drawn maps. This can be overcome by using a very large number of cells,
but may result in unacceptably large files.
Advantages of the Vector Model



It provides a more compact data structure than the raster model.
It provides efficient encoding of topology, and, as a result, more efficient implementation of operations that
require topological information, such as network analysis.
The vector model is better suited to supporting graphics that closely approximate hand-drawn maps.
Disadvantages of the Vector Model




It is a more complex data structure than a simple raster format.
Overlay operations are more difficult to implement.
The representation of high spatial variability is inefficient.
Manipulation and enhancement of digital images cannot be effectively done in the vector domain.
DATABASES FOR GIS: MANAGING SPATIAL AND ATTRIBUTE DATA TOGETHER
The development of database systems to handle spatial information and the associated attribute data has happened
within the last few years.
Today, most spatial data are still being stored in the form of paper maps, imagery, tables or text descriptions and are
analyzed using manual techniques. However, there is a rapid increase in the amount of geographic information that is
now being collected and stored in digital form suitable for computer-based retrieval and analysis.
Organizing Geographic Information Within a DBMS
Digital mapping was historically viewed as a set of tools for automating the drafting function. The digital mapping function
was not viewed as an integral part of the data management system of the organization.
Early GIS and automated cartography systems used data files directly without using a database management system.
This file processing approach to data management is still used in many GISs today.
Since automated cartography systems and GISs perform similar manipulations, it is often expected that they should
provide comparable levels of performance. However, the additional complexities of handling the data components other
than the geometric ones and providing a multi-user environment, makes this expectation difficult to satisfy.
In a GIS, geographic information is not approached as a drafting task but as a database application.
As noted earlier, the advantages of organizing data using a DBMS are: minimize redundancy of data storage; provide
central control of data access, manipulation, integrity and security of the database; and make application programs
independent of the form in which the data are stored.
There are at least four ways that DBMS concepts are applied in the context of a GIS:
1. The uses of the data are independent of the way the data are stored. For example, instead of storing
maps, the data that describe the geographic elements (attributes) are stored with minimum
redundancy, and then maps or other types of output are generated as needed in the form best suited
for a specific analysis.
2. Automated updating of interrelated data files is provided. The transaction that registers a change
(the sale of land) can be used to update all of the data files that are affected. All users immediately
have access to updated data.
8
3. The relationships among all the spatial and attribute information are explicitly defined. Topology is
used to relate all the spatial elements to each other.
4. The central control of the DBMS provides better control of the integrity of the database.
Limitations of General Purpose DBMSs for GIS Applications
The data model most widely accepted for handling non-spatial attribute data in GIS applications is the relational model.
The organization of attribute data into a series of tables that can be used individually or together is simple to understand
and provides efficient data storage.
Relational databases also provide query languages that are easily learned by GIS users and provide very flexible
analysis capabilities.
Because of these things, the relational model is well-suited to handling the storage of the non-spatial attribute data
because it is well adapted to the unpredictable nature of geographic data.
However, using a relational DBMS to store the spatial data and adapting the DBMS query language to provide spatial
analysis functions has been a much more difficult problem to handle.
Using a relational database to manage tabular data and make changes involves functions such as adding, deleting, or
changing a record. For example, property ownership changes.
Spatial information is more complex and the transactions performed on them are more complicated. Modifying spatial
information usually involves the simultaneous updating of large numbers of record in multiple files. See Figure 6-21 page
182 (Dividing a parcel of land)
The functions needed to handle geographic data are not done well in the tabular database environment of a
standard DBMS. There are many reasons why; here are five.
1. The spatial data records used in a GIS are variable length records which are needed to store
variable numbers of coordinate points, whereas general purpose database systems are designed to
handle fixed length records.
2. Manipulation of geographic data involves spatial concepts, such as proximity, connectedness,
containment, and overlay. These functions are not easily accommodated by general purpose database
query languages.
3. A GIS requires sophisticated graphics that are not normally provided for a DBMS.
4. Geographic information is complex. The representation of a single geographic feature requires
multiple records in possibly multiple files. For example, the feature may involve feature coordinates,
topology, measurements (area, etc.), links to the non-spatial data, and the attributes themselves.
5. The highly interrelated nature of GIS data records require a more sophisticated security system than
the record locking approach taken by most general purpose DBMS. To ensure the integrity of the
geographic database, the security system must protect the integrity of the multiple files in which the
spatial data are stored. A change in one record can create multiple errors in multiple files.
PRACTICAL APPROACHES USED TO IMPLEMENT A GIS
Various practical approaches have been taken to provide data management services for a GIS. They are grouped into
four (somewhat overlapping) strategies. See Figure 6-22.
1. Develop a propriety system providing the individual data management services required by the different
application modules. This is the file processing approach.
The Map Analysis Package and ERDAS are examples of a GIS that uses this approach. Each data set is stored as a
separate file. The user can invoke separate analysis functions to manipulate one or more of the data files. Results are
produced in the form of new data files that can be output using various routines.
2. Develop a hybrid system using a commercially available DBMS (usually a relational one) for storage of the
non-spatial attributes. Develop separate software to manage the storage and analysis of the spatial data, using
the services of the relational database to access the attribute data.
ESRI's ARC/INFO is an example of the hybrid approach. The non-spatial attribute data are stored in the INFO (Oracle,
dBase, or INGRES) database management system which provides both storage and manipulation functions for the nonspatial attribute data. The ARC system provides for the storage and manipulation of the spatial data. Spatial analysis
9
functions are provided using the toolbox approach. In ARC, individual modules are provided for such functions as data
entry, editing, network analysis, etc.
3. Use an existing DBMS, usually a relational one, as the core of the GIS and then develop extensions to the
system where needed. Although the spatial and attribute data may be managed by the DBMS, a significant
amount of software is generally added to the DBMS to provide the spatial functions and graphics display used in
geographic analysis.
System 9, developed by Wild Heerbrugg and Prime Computers, is an example of a GIS built around an existing DBMS.
Both the spatial coordinate data and the topological data, as well as non-spatial attribute data are stored using the
relational database in a UNIX environment. The relational DBMS supports variable length fields and they have added
software to handle the spatial references (such as points and lines that comprise a geographic feature). They have also
added software that allows spatial queries such as overlay, connectivity, and neighborhood operations.
4. Start from scratch and develop a spatial database capable of handling the spatial and non-spatial data in an
integrated fashion.
The MAPS systems developed at Carnegie-Mellon University has taken a different approach than most GISs. Instead of
trying to develop one or perhaps two representations of the spatial data that are general enough for all types of spatial
data, the MAPS system support several different representations of data that are all transparent to the user.
Data in various raster and vector forms can be used and queries are optimized to use the representation best-suited to
execute a particular function. Five databases are maintained within the MAPS systems:
1. A digital terrain database;
2. A map database;
3. A landmark database;
4. An image database; and
5. The CONCEPTMAP database.
Together they provide for the storage and integrated analysis of such diverse datasets as digitized aerial photography,
digital satellite imagery, digital terrain data, topologically encoded map data, symbols, and non-spatial attributes.
CONCEPTMAP contains the information about spatial entities and manipulation procedures that are used to interact
directly with a user or with the applications programs. In this way, MAPS provides a single uniform access to all spatial
data, independent of their internal structure.
CONCLUSION
A DBMS provides a number of functions critical to the effective operation of a GIS. However, the difficulties of adapting
existing DBMSs to handle spatial data have required that hybrid or modified DBMSs be developed.
There is a strong demand for systems that can integrate diverse raster and vector format data sets without requiring the
operator to first convert the data to a common format.
10