Download Data Modelling and Database Requirements for Geographical Data

Data Modelling and Database Requirements for Geographical Data Håvard Tveite January, 1997 Abstract An overview of the fields of data modelling, database systems and geographical information systems is presented as a background. Requirements to a data model and a data modelling methodology for geographical data are investigated and presented. Another contribution is an extension of the traditional ER-diagrams to better communicate the semantics of geographical data. The approach is based on earlier work on Sub-Model Substitution, and adds new symbology that is relevant for geographical data. Database system requirements for geographical data servers are investigated and presented, together with new ideas on distribution of geographical data for parallel processing. Table of Contents Chapter 1 Chapter 2 Chapter 3 Introduction Motivation . . . . . . . . . . . Contributions . . . . . . . . . . Related work . . . . . . . . . . How this document is organised . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Database Systems and Data Models Data modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modelling concepts 5 Infological data models and the infological and datalogical realm Metadata versus “ordinary” data 8 Semantic data models . . . . . . . . . . . . . . . . . . . . . . . . . . ER models and diagrams 9 EER models and diagrams 11 Object-oriented data models 13 Database systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brief history 15 Definitions 15 The three-schema architecture 16 Features/services of database systems 17 Distributed database systems 18 Database machines 19 Status of database systems 19 Database models . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hierarchical DBMSs 20 Network DBMSs 21 Relational DBMSs 23 Object-oriented DBMSs 26 Deductive DBMSs 28 Geographical Information Systems History . . . . . . . . . . . . . . . . . . . . . . . . . . . Definitions of GIS . . . . . . . . . . . . . . . . . . . . . The utility of geographical information systems . . . . . . Local administration GIS, an example application area Geographical data . . . . . . . . . . . . . . . . . . . . . Geographical maps 33 Spatial geographical data 34 Non-spatial or “catalogue type” GIS data 36 Historical data 36 Data quality 37 Data distribution and sharing 37 Models for geographical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 . . . . . . . . . . . . 1 1 2 2 3 5 . . 5 7 . . 9 . 14 . 20 29 . . . 29 . . . 30 . . . 31 . . . 33 . . . . . . . 38 ii The raster paradigm 38 The vector paradigm 40 Representation of the interior of spatial objects Queries and operations . . . . . . . . . . . . . . . GIS queries 42 Use of the different GIS query types 44 Current GIS technology . . . . . . . . . . . . . . ARC/INFO 45 System 9 48 TIGRIS 50 Smallworld GIS 50 GRASS 51 Summary 52 Trends . . . . . . . . . . . . . . . . . . . . . . . Hardware trends 53 Technology trends 54 GIS trends 55 The GIS of the future . . . . . . . . . . . . . . . . Servers of geographical information 56 Research and research issues . . . . . . . . . . . . Chapter 4 41 . . . . . . . . . . . . 42 . . . . . . . . . . . . 45 . . . . . . . . . . . . 53 . . . . . . . . . . . . 55 . . . . . . . . . . . . 57 Data model requirements Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . Geographical data revisited . . . . . . . . . . . . . . . . . . . Borders of geographical phenomena 60 Features of geographical data 60 Requirements to high level geographical data models . . . . . . Traditional ER model abstractions 66 Geometrical object types 67 Spatial relationships 68 Implicit geographical relationships 69 Topology 69 Aggregation 73 Generalisation 74 Categories 76 History and time 76 Quality/ accuracy 77 Derived objects 79 Sharing of geometrical objects among geographical objects Roles and scale 80 Spatial constraints 81 Groups of related objects (themes) 81 Distributed ownership 82 Behaviour 82 Modelling implications . . . . . . . . . . . . . . . . . . . . . . Proposed data models and exchange standards for GIS data . . ATKIS 84 SDTS 86 59 . . . . . 59 . . . . . 59 . . . . . 65 79 . . . . . 83 . . . . . 84 iii NGIS and FGIS MetaMap 95 Chapter 5 Chapter 6 90 Sub-Structure Abstraction in Geographical Data Modelling Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geographical data modelling using structure abstractions . . . . Extending ER-diagrams with sub model substitution 102 A forestry research example 109 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future work 111 99 . . . . 99 . . . . 101 . . . . 110 . . . . 111 Database management system issues for geographical data Basic requirements . . . . . . . . . . . . . . . . . . . . . . . . . Data volumes and data types . . . . . . . . . . . . . . . . . . . . Samples 115 Raster data 116 Vector data 118 Time 118 Generalisation levels 118 Summary 119 Multimedia (integrated) database systems . . . . . . . . . . . . . Hypertext 120 Spatio-temporal databases . . . . . . . . . . . . . . . . . . . . . Concepts of time in databases 121 Representing time in databases 122 TQuel 122 Time in geographical databases 122 Metadata and data dictionaries . . . . . . . . . . . . . . . . . . . Quality in geographical databases 125 Data dictionary issues for geographical data 126 Geographical Query Languages . . . . . . . . . . . . . . . . . . Different ways of organising geographical information 130 Spatial query language proposals 131 Query optimisation 134 Spatial data types 134 Spatial constraints 137 Operations 137 Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transactions on temporal geographical data 143 Transaction management 143 Concurrency Control 144 Distribution issues . . . . . . . . . . . . . . . . . . . . . . . . . Parallel processing 150 Distribution of spatial data 151 Replication 155 Heterogeneous database system integration 156 Fast geometrical processing 157 113 . . . 113 . . . 115 . . . 119 . . . 121 . . . 124 . . . 129 . . . 142 . . . 149 iv Data exchange formats 157 Some limitations of currently used database models . . . . . . . . . . 158 Network database models 159 The relational database model 160 Object-oriented database models 164 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Appendix A Data structures for spatial databases Basic data structures . . . . . . . . . . . . Digital computer storage media 167 Sequences (lists/arrays) 168 Randomised sequences 169 Hierarchical structures . . . . . . . . . . . Multi-dimensional trees . . . . . . . . . . Points 172 Lines 172 Regions in 2D 173 Grid partitioning and spatial hashing . . . Multi resolution image trees (pyramids) Region quad trees 175 Linearisation 175 EXCELL 176 Grid file 176 Appendix B Representation of 3D structures 3D objects . . . . . . . . . . . . Storage organisation . . . . . . . Point sampling . . . . . . . . . . Wire frame . . . . . . . . . . . . Triangulated Irregular Network . Parametric representations . . . . Constructive Solid Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 . . . . . . . . . . . . . . . 167 . . . . . . . . . . . . . . . 170 . . . . . . . . . . . . . . . 171 . . . . . . . . . . . . . . . 175 175 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix C The NHS Electronic Navigational Chart Database Introduction . . . . . . . . . . . . . . . . . . . . . . . . Background . . . . . . . . . . . . . . . . . . . . . . . . . Navigational Charts . . . . . . . . . . . . . . . . . . . . ENC and ECDIS 186 The ENCDB 186 Data management 187 Relating the traditional chart data to other data 188 Structures for the ECDIS database . . . . . . . . . . . . . Data modelling for ECDIS . . . . . . . . . . . . . . . . . DBMS-aspects of an ENC-server . . . . . . . . . . . . . The amount of data 193 The data 193 Response time 194 Concurrency and recovery 194 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 179 180 180 181 182 182 184 185 . . . . . . . 185 . . . . . . . 185 . . . . . . . 185 . . . . . . . 188 . . . . . . . 191 . . . . . . . 193 v Security 194 Reliability 195 Billing 195 The choice of a database system for the ECDIS server 195 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Bibliography 197 Index 217 vi vii Acknowledgements I wish to thank my supervisor, professor Kjell Bratbergsengen, for encouragement and support through the 8 years that have passed since I started these studies. Without his continuous commitment and goodwill, I would have given up a long time ago. Thanks also to professor Ingolf Hådem, who agreed to be my advisor on photogrammetry. The contact with Hådem has been sporadic after the focus of my work was directed to data modelling and database management. The first part of this study took place when I was employed as a research assistant at the Department of Computer Systems and Telematics, NTH (now a part of NTNU) for two and a half years from 1988 to 1990. Then, I was supported by a research grant from the Norwegian Research Council for one and a half year from 1990 to 1991. The rest of the work has been done now and then while being employed at the Department of Surveying at the Agricultural University of Norway (NLH). I would like to thank everyone at the Department of Computer Systems and Telematics in Trondheim for a friendly atmosphere. In particular, I would like to mention the members of the database group. They have always been very helpful. My employer during the last years of this work, the Department of Surveying at NLH, also deserves some thanks for encouraging me to finish this work. I have some "friends" who have been annoying me by asking about the status of my thesis work on all occasions during the last 4 years. I am not quite sure if I should thank them or not! Last, but not least, the friendly atmosphere of "Munkholmens Lægeforening" has been an important inspiration. Without such a stimulating environment, it would have been difficult to get the necessary inspiration for finishing this work. viii Chapter 1 Introduction Digital geographical data are indispensable for monitoring and managing the environment and for managing and planning geographically based human activities such as land use, utility networks, long-distance transportation and mining in efficient ways. Sharing of digital geographical data both between and within organisations is of utmost importance for the efficient use of geographical information systems (GISs). One reason for this is the large efforts and costs that are involved in collecting and maintaining high quality geographical data sets: The more users that can share the data, the easier it is to cover these expenses. Another important reason for sharing is that the availability of high quality data sets has the potential of making environmental (and land use -) planning and management better and more cost-efficient. To be able to share digital geographical data, standards are necessary. Standard data models for spatial data, standard encoding formats for the exchange of spatial data and standard communication protocols for distributing spatial data are all necessary parts of a foundation for efficient geographical data sharing, with the spatial data model as the basic component. A number of national standards for the digital encoding of topographic and thematic maps have emerged in the last decade. The problem with todays standards is that they cover only a limited part of the semantics necessary for general purpose exchange of geographical data. The lack of an agreed on data model that covers the essential aspects of spatial data has been impeding the development of powerful exchange standards. This thesis looks into the problems of geographical data models and geographical data modelling, and outlines some possible solutions. To be able to share geographical data between organisations and within organisations, it is necessary to have a system for managing the data. This thesis presents a set of requirements to database management systems that are to act as servers of geographical data/information. 1.1 Motivation Research on geographical information systems has suffered under the lack of a solid foundation. Many GIS concepts need clarification, spatial data modelling methodologies should be developed, spatial database systems and data structures need elaboration [Günther90], digital cartography and GIS user interfaces need sophistication and finally, there is 2 Chapter 1: Introduction an urgent need for standards. The use of GISs is particularly impeded by the lack of standards and the resulting limited availability of high quality data sets. Investments in GISs are risky in such a situation. It is difficult for users to find a suitable system for covering their needs for spatial data support when there is no consensus on what kind of functionality and which kinds of interfaces such a system should provide. Efforts to develop and market geographical data sets is difficult when there are no generally accepted standards for their structuring, storage and exchange. When such standards are in place, there will be a market for geographical data servers and services. Such servers should be connected to an international public computer network, giving "everyone" access to useful spatial data. An international system of spatial data servers will have to be supported by mechanisms for finding the right data, and sophisticated spatial database systems are required to manage the geographical data on these servers. Data modelling techniques supporting spatial data become increasingly important as the use of GISs becomes more and more widespread. There is a need for simple concepts and intuitive models in the communication process between the computer scientists and GIS experts on the one side and the spatial science experts on the other. A standardised high level approach to geographical data modelling would be a very useful tool. Such a platform for integrated use of all kinds of geographical information would be a good starting point for GIS application and database development. If a more solid foundation for GISs can be achieved, the activity in the field must be expected to increase significantly. The serious use of GISs could blossom, and GIS related research and the use of GIS as a tool in other kinds of spatial research would accelerate. 1.2 Contributions The main contributions of this work are in two areas. The first area is geographical data modelling, and the second area is database support for geographical data, with special emphasis on the distribution/partitioning of geographical information. • Modelling concepts specific to geographical and spatial information are identified. • Spatial sub-structure abstractions in ER-like diagrams [Tveite92] are proposed. • Database issues for geographical data are outlined and investigated. • Distribution issues for geographical data [Tveite93] are identified and a distribution strategy for geographical data is outlined. 1.3 Related work Research on databases for spatial data is one of the branches of database system research that has been receiving increasing attention during the last decade ([Günther90]). There are now well attended special purpose conferences for advances in spatial databases (SSD 89 [Buchmann90], SSD91 [Günther91], SSD93 [Abel93]), SSD95 [Egenhofer95]. Data models for spatial and geographical databases and geographical information systems have received some attention in the 1980ies and early 1990ies. As in the database commu- How this document is organised 3 nity, object-oriented methods have been particularly popular recently. Among the early publications on these topics are Egenhofer [Egenhofer87] [Egenhofer89a] [Egenhofer89b], Feuchtwanger [Feuchtwanger89] [Feuchtwanger93], Frank [Frank88] [Egenhofer87] [Egenhofer89a] [Egenhofer89b], Worboys [Worboys90a] [Worboys90b], Hearnshaw [Worboys90a] [Worboys90b], Maguire [Worboys90a] [Worboys90b], Morehouse [Morehouse90], Orenstein [Orenstein86] [Orenstein88] [Orenstein90b], Peucker [Peucker75] [Peucker78], Scholl [Scholl90] and Voisard [Scholl90]. Within the area of distribution and parallelisation, there has been work on the use of parallel technology for geographical information analysis at the University of Edinburgh. In Edinburgh, they have been looking into parallelisation of GIS algorithms. Some other efforts on algorithms have also been made, for instance by Mower [Mower92], but the use of parallel technology for organising general purpose spatial databases has not been given much attention. The main part of this thesis has been written while working with the database technology group at IDT, NTH. Distributed database technology (both hardware and software) has been the focus of the group, and several prototype parallel database machines have been developed for research purposes. The research performed by this group has provided valuable input to the “distribution-part” of this thesis. 1.4 How this document is organised The thesis can roughly be divided into two parts. The first part includes the chapters 2 to 6, and contains the central aspects of the thesis, namely data modelling and database system topics for GIS. The second part comprises the appendices A to C. Appendix A is a very short overview of spatial data structures. Appendix B is a short presentation of representation techniques for three-dimensional (3D) structures. Appendix C is a report submitted to the Norwegian Hydrographic Service discussing database issues for an electronic navigational chart database that was under construction some years ago. The server was to provide authorised chart information to ships. 4 Chapter 1: Introduction Chapter 2 Database Systems and Data Models This chapter is an introduction to the fields of data modelling, database systems and database models, a necessary background for the rest of the thesis. The review will be limited to a short summary of the most common data modelling approaches, an overview of the features expected from a database system and some short notes on the most popular database models. 2.1 Data modelling An information system that is to support an activity should cover all aspects of the real world pertinent to that activity. To be able to develop such an information system, a good model of this so-called mini-world must be developed. Such a high-level data model should abstract and structure descriptions of the phenomena in the mini-world in such a way that the information becomes manageable and understandable for humans. It is important for a useful data model to [Tsichritzis82]: “… capture the appropriate amount of meaning as related to the desired use of the data”. Much research has been devoted to the development of powerful modelling formalisms, emphasising the communication (presentation and visualisation) of mini-worlds between humans and the translation of the models into formats suitable for computer handling. 2.1.1 Modelling concepts To be able to talk about the world and our representation of the world in a model, a certain vocabulary must be defined. The following is a blend of terminology taken from different sources ([Tsichritzis82], [Chen76], [Ng81], [Elmasri89], [Rumbaugh91], [Sindre90]). Abstraction is used to hide detail, so that one can concentrate on overall structure. Recognised data abstraction mechanisms: Classification is the formation of an object type from a group of similar tokens (the reverse process is called instanciation). Generalisation is the abstraction of similar object types into a higher level object type (the reverse process is called specialisation). Aggregation is the abstraction by which an object is constructed from its constituent objects [Tsichritzis82]. Aggregation and generalisation hierarchies are orthogonal, 6 Chapter 2: Database Systems and Data Models and can therefore be specified separately [Tsichritzis82]. The term Association [Elmasri89] is also used for type level aggregations. Association [Sindre90] is related to aggregation, but is a weaker relationship between independent objects (not really structural). Grouping [Hull87] covers the same abstraction as association. Category [Elmasri89] is also similar to association. One use of association is grouping of different classes that play the same role in a relationship to some other class (the owner of a property can be either an organisation or a person). Associations can often be represented using generalisations. Identification ensures that all abstract concepts and concrete objects can be made uniquely identifiable. This can be accomplished by unique names or by other means [Elmasri89]. Attribute: A named domain that represents a semantically meaningful object … [Tsichritzis82] (for example the name of a person, the geometry of an area feature, the speed limit of a road, …). Class: The group of all objects obeying the class’ membership condition/predicate. The set of all objects of a certain object type. Category [Tsichritzis82] is a similar concept to class. Data in the same category are supposed to have similarities [Tsichritzis82]. Constraint: In a data modelling context, inherent constraints are limitations imposed by the structures of the data model. Explicit constraints enable the modeller to include more semantics in the model than the structures of the data model itself conveys. Datum (plural: Data): an existing description of some phenomenon or phenomena (measurement recordings, images, information catalogues, …). Domain: In data modelling, homogeneous sets are called domains (examples of some traditional domains in data modelling: integers with values between 0 and 80, real numbers, strings of characters of maximum length 15, date, …). Extensional property: token-/object-level property Intentional property: (object) type-level property Object: The human interpretation of a phenomenon in a modelling context (in some modelling formalisms this is represented as an aggregation of attributes). (Object) Type: The common characteristics of a set of similar objects can be covered by a type (Abstraction is used to define a type from a class of similar tokens [Tsichritzis82]). Strictly typed data models are data models where each datum must belong to some category; Loosely typed data models do not make any assumptions of categories [Tsichritzis82]. (Object) Token: An instance of an object type (A token is an actual value or a particular instance of an object [Tsichritzis82]. Phenomenon: Some interesting “thing” (event, object, …) in the real world (for example a flow of water, an organism, a building, a car accident, …). The phenomenon concept covers the Entity concept (entity: “… something with objective reality which exists or can be thought of”, as suggested by Hall in 1976 [Tsichritzis82]). Phenomenon Data modelling 7 will be used for references to the real world in this thesis. Entity will be reserved for use in the context of the Entity-Relationship (ER) modelling formalism. Relation: A mathematical relation is a set that expresses a correspondence between (or aggregation of -) two or more sets [Tsichritzis82]. In the relational model, both the entities and the relationships from the ER-model are formalised using relations. N-ary relations can be visualised as tables where n-tuples constitute the rows. Relationship: An observed or intended connection between phenomena that is interesting for the modelling of a mini-world. An n-ary relationship connects n phenomena. The most common relationship type is the binary relation, connecting two phenomena. Rumbaugh et al. call the relationship concept an association [Rumbaugh91]. Set: In data modelling, a set is any collection of objects that is properly identified and is characterised by a membership condition [Tsichritzis82]. A classical mathematical set is not ordered, and duplicates are not allowed. An extended mathematical set allows ordering. Groupings [Vossen91] and sets are similar concepts. Tuple: The row of a relational table or a list of values. In the relational model, each value comes from a pre-defined domain. n-tuple: a set of n values from a set of n domains. 2.1.2 Infological data models and the infological and datalogical realm The concepts of infological and datalogical data models were introduced by Langefors in a series of publications starting in 1963 [Tsichritzis82]. Infological data models represent information in ways that are supposed to be similar to how people perceive the information (infological realm), without considering their final computer-related representations (datalogical realm). The ideal situation for an information system designer is to have a powerful infological data model that can be easily communicated between humans, and that there is a way to perform a non-loss translation of this infological data model into the datalogical realm. Infological data models In the early theoretical work on infological data models, the concepts of object, property, relationship and time were identified as basic. An elementary fact is in this framework represented as a triple (a collection of objects + a property or relationship + time), called an elementary constellation. Structured textual descriptions (natural language), formal logic (specification in for instance the logic-based programming language Prolog [Clocksin84]) and other structural techniques (with visualisation through diagrams) have been proposed as infological data models. Structured textual descriptions can express things in a human readable format, but have severe limitations when it comes to data structuring and formalisation for translation into the datalogical realm. 8 Chapter 2: Database Systems and Data Models Logic has the advantage of being a formal description, having its roots in mathematics. It is therefore more easily translated into the datalogical realm. A problem is that logic lacks mechanisms for efficient communication of structure. Diagrams have the advantage that they can show structure (relationships) in a human readable way (usually as two-dimensional maps), and diagrams have therefore become very popular for “semantic” data modelling. A problem with diagrams is that they can be difficult to translate into the datalogical realm, and there is a limit to the amount of information that can be put into a diagram without making it difficult to comprehend. Semantic data models [Hull87] [Peckham88] introduce many useful methods for data structuring and abstraction, and constitute the most interesting branch of infological data models for database modelling. In this chapter, the ER model and an EER model are described to give a background in high level data modelling. The entity-relationship (ER) approach (or ER diagrams, initially proposed by Chen [Chen76]) has been the most popular diagrammatic representation for data modelling in the last decade. The expressiveness of the original ER model has been extended in many directions to capture more real-world semantics in the diagrams. The latest direction in real world modelling for computer representation is the object-oriented approach. Object-oriented methods add encapsulation and behaviour to the traditional structuring mechanisms of semantic data models. The datalogical realm Many different lower level data models (more closely tied to the datalogical realm) have been used through the years. They are by definition computer oriented, but the evolutionary trend of these data models is that they are approaching infological data models in expressiveness. The first low level data models from the 1950s and 1960s were based on simple file and record structures. Beginning in the late 1960s, there has been an “evolution” of the datalogical models, starting with the hierarchical data models and continuing with network data models and relational data models. In the last decade the object-oriented data models have been proposed. Reaching object-oriented data models, the distinction between the datalogical and the infological realm is getting fuzzy. Object-oriented models are claimed to cover both the infological and the datalogical realm, being directly implementable through object-oriented database systems. As datalogical data models are approaching infological data models in expressive power and sophistication, their implementation is becoming more and more complicated. 2.1.3 Metadata versus “ordinary” data The semantics of data in a database can be described using metadata. In a relational database system, some metadata are available through the data description in the system catalogues, where all the relations (tables) are described (with relation names, field names, field types and keys). In the context of geographical information standardisation work within CEN*, the term metadata is defined as [CEN95b]: * CEN - Comité Européen de Normalisation (European Committee for Standardisation Semantic data models 9 Data that describes the content, representation, extent (both geographic and temporal), spatial reference, quality and administration of a geographic data set. The inclusion of more semantics through more elaborate (and higher levels of) data description is often desirable. As much as possible of the information from the semantic data model underlying the database should be available within the final database. Data quality, the time of validity/acquisition of the data, the constraints that pertain to the data, and the data set location and ownership in a distributed database environment are all examples of useful metadata. Metadata could be provided at a separate level, or they could be integrated with the basic data using attributes or relationships to metadata. In general, it will be difficult to draw a sharp line between what constitutes the metadata and what constitutes the “ordinary” data. The method of metadata representation (integrated or separated) will often be a matter of preferences, but could also be dictated by the application type. For example: should the spatial extent/position of a geographical object be considered a metadata attribute or a basic attribute of the object. It is important to arrive at standards for the representation of metadata. If such standards are available, databases can be more self-contained (representing more of the real world semantics), and easier to utilise and validate by a larger class of users. 2.2 Semantic data models Semantic data models [Hull87] has been a popular investigation topic since the late 1970s. One of the early data models in this category was the ER (Entity Relationship) model proposed by Chen [Chen76]. The SDM [Hammer78] is an example of a semantically richer data model, using terminology such as class, entity, object, aggregate, abstraction, event, name, attribute, subclass, restriction and subset. Semantic data models have a strong advantage over the traditional “database models” for real-world modelling since they incorporate a wider range of data abstraction mechanisms. Developers and database designers working with complex data (CAD, CASE, GIS) are facing problems when they try to model their applications and data sets within the limits of the network or relational data model. The semantic data models are useful for infological data modelling, but the translation of complex semantic data models into, for instance, the relational model can be non-trivial. A common “solution” to this problem for many application areas has been to avoid traditional databases, developing custom data structures instead. 2.2.1 ER models and diagrams The basic Entity Relationship (ER) model proposed by Chen [Chen76] and later elaborated on by Ng [Ng81] and others offer the following primitives for modelling: • Regular and weak entities. Weak entities are entities that cannot exist in isolation, and depend on other entity types for full identification. In the diagrams, a regular entity is represented by a labelled rectangle, and a weak entity by a double-sided labelled rectangle. 10 Chapter 2: Database Systems and Data Models • Named relationships, involving two or more entities. In the diagrams, an n-ary relation is usually represented by a labelled diamond with one line to each of the n participating entities. • Constraints, such as existence dependencies (arrows instead of plain lines in the diagram) and relationship cardinalities (numbers put with the relationship lines in the diagram). A structure example showing the symbology of ER diagrams, as proposed by Chen (except there are no labels on the entities and relationships in the figure), is presented in Figure 2-1. Figure 2-1 Original ER diagram symbology. The expressiveness of the original ER diagrams has been extended (trivially) with: • Attributes, with names and value sets / domains (value sets are represented as labelled circles) that can be attached (with a line) to both entities and relationships in the diagrams. The attribute name is placed along the line that attaches the value set circle to the entity rectangle • Constraints on attributes, such as keys (illustrated by underlining the attribute name). The resulting EAR model is described by an ISO document (ISO/TC97/SC5-N695). EAR (entity-attribute-relationship) diagrams, have been extensively used in modelling, especially for relational database design. Whether to include attributes or not in the diagrams is a matter of preferences. The problem with including attributes is that the diagrams tend to become cluttered and hence more difficult to communicate. Complex objects (aggregations) can be modelled using the ER model by introducing consists-of/part-of (component-of) relationships between the complex entities and their member entities. Generalisation and specialisation is often modelled in the ER model by defining is-a relationships between the specialised object types and the more general object types (the Semantic data models 11 vehicle object type is connected via is-a relationships from the more specialised object types: car, bus, bicycle, lorry, tractor, tram, …). Associations can be modelled using is-member-of relationships. Temporal relations(ips can be modelled by using precedes relationships, but history data or versioned objects do not have a particular modelling primitive (time is not included in the ER model). Time can be supported using attributes (time of creation, time of destruction). The ER modelling formalism was intended as a data modelling tool. The behavioural part of modelling is not addressed. The big advantage of the basic ER model is that there are methods for translation of all its concepts into many popular database models (hierarchical, network and relational) [Ng81]. It is therefore fairly straightforward to implement as a database schema something that has been specified using the original ER model. Another advantage of the ER model is its limited amount of modelling primitives, which makes the model easy to learn. The limited number of modelling primitives is also a problem with the ER model. The pure ER approach can result in diagrams that are difficult to comprehend/communicate because of the necessary overloading of the very limited number of primitives. An abstraction mechanism that would allow the recognition of overall structure by grouping and hiding independent sub-models is lacking in the ER model. Omission of attributes is the only information hiding mechanism available, so it is not possible to perform multi-level modelling. As the number of entities and relationships in ER models increases, the diagrams tend to become visually unmanageable. In psychology it has been found that humans only can process 5 to 9 information items at a time (George Millers paper in Psychological Review, march 1956, pp. 81-97 [Coad90]). According to this, diagrams with 10 to 100 information items will be very hard to digest when there is no apparent way of grouping them into more manageable pieces. In practical ER modelling of large structures it is already normal to split the diagrams in one way or another. The ER model does, however, not offer any abstraction mechanisms to support such a partitioning of the model into sub-models. The choice of representation for a phenomenon will in many cases be a matter of preferences. There are no basic rules for when to apply entities and when to apply relationships. All relationships can, in theory, also be represented as entities. This can be confusing to the users of the data modelling formalism. 2.2.2 EER models and diagrams Extended Entity Relationship (EER) models and diagrams have been proposed to overcome some of the deficiencies of the first generation of ER models ([Teorey86], [Batini86], [Elmasri89]). These models provide new abstraction mechanisms in addition to those provided by the original ER model. The EER approach also introduces new symbology for some of the most common abstractions to produce more easily comprehensible diagrams. Elmasri and Navathe’s proposal for an EER-model [Elmasri89] includes the notion of a class (that encompasses entity types), subclasses, superclasses (the set of members of a subclass is always a subset of the set of members of the superclass) and categories 12 Chapter 2: Database Systems and Data Models (associations). All classes can participate in relationships. The following symbology is added to the ER diagrams (see Figure 2-2 for an illustration): Figure 2-2 EER symbology as used by Elmasri and Navathe [Elmasri89]. • superclass - subclass: The superclass’ and the subclass’ rectangles are connected with a line containing the subset symbol (⊂). The open end of the subset symbol points towards the superclass. A subclass can be defined by a predicate on the superclass’ group of attributes. In this case, the predicate is attached as a label to the subclass - superclass line. • generalisation/specialisation: This is represented as a circle with a “d” (disjoint specialisation) or an “o” (overlapping specialisation) in it, connected with one line (or a double line, if the specialisation is total) to the superclass, and subset-lines (with the subset symbol (⊂) on) to all the subclasses. A specialisation can be based on the value of a single attribute, in which case it is called attribute defined. The name of this attribute is used to tag the specialisation at the superclass end of the symbol. • categories: This is represented in diagrams as a circle with a “∪” (union) in it, having one subset line (double lined, in case the categorisation is total) to the category class (the open end of the subset symbol pointing towards the circle), and lines to all the defining classes. Predicates can be attached as labels to these lines to specify which of the members of each defining class that should be members of the category. The concept of categories makes it possible to group very different classes that play the same role in a relationship. A labelled rectangle is introduced for each category. This notion of category is similar to association. Semantic data models 13 • constraints: superclass - subclass: A predicate to determine which characteristics a member of the superclass should have to be a member of the subclass. specialisations: A double line from the superclass to the circle to indicate that all the members of the superclass must be members of some subclass. A “d” or an “o” in the circle to indicate whether the specialisation is disjoint (no superclass member can be member of more than one subclass) or overlapping. categorisation: A double line from the category class to the circle to indicate that all the members of the defining classes must be members of the category class; Predicates to determine which members of the defining classes should be members of the category class. This EER model does not include aggregation as a special concept, and that must be considered a weakness in the context of complex object modelling. By using aggregations it would be easier to hide detailed information and emphasise overall structure by using a levelled or black-box based method. The structure of the EER model makes it possible to do some sort of multi-level modelling, but it is meant to be a single-level approach, hence it inherits the one-level weakness from the ER model. The EER model performs reasonably well for semantic modelling when compared to other popular modelling formalisms. In an empirical study, comparing data modelling formalisms [Kim95c], the EER model [Teorey86] was compared to the NIAM [Nijssen77] model, one of the most popular object-relationship (a sort of binary model [Tsichritzis82]) model [Biller77]. The findings of this empirical study can be summarised as follows (six hypothesis were tested). (1) There was no significant difference between the NIAM user group and the EER user group in their model comprehension performance, (2) the NIAM users group did not perform significantly better than the EER user group in the discrepancy-checking task, (3a) there was no significant difference between the NIAM user group and the EER user group in their perceived difficulty of formalism, but (3b) the EER users valued their modelling formalism significantly more than the NIAM users, (4) EER analysts produced a data model of significantly higher semantic quality than NIAM analysts, (5) EER analysts did not produce a data model of significantly higher syntactic quality than NIAM analysts, (6) the EER users perceived their modelling formalism to be significantly more useful than the NIAM users. 2.2.3 Object-oriented data models Object-oriented modelling research, starting in the 1980s, had its roots in semantic data models and object-oriented programming languages (such as SIMULA [Birtwistle73] and Smalltalk [Goldberg83]). Object-oriented data models incorporate such things as encapsulation and behaviour in addition to the structuring mechanisms of semantic data models [Rumbaugh91] [Coad90]. Direct realisations of object-oriented data models into object-oriented database systems has received a lot of attention, gaining momentum in the mid 1980s [Abiteboul90]. Ideas of richer database models than the relational model was, however, starting to emerge already in the late 1970ies (e.g. the SIMULA based ASTRA with the ASTRAL (extended Pascal) language [Bratbergsengen83] and PASCAL/R [Schmidt83b]). Modelling approaches that incorporate mechanisms from semantic data models are called structurally object-oriented, while those using mechanisms from object-oriented program- 14 Chapter 2: Database Systems and Data Models ming languages are termed behaviourally object-oriented. An object-oriented modelling approach should incorporate both the behavioural and the structural aspects. Object-oriented programming languages The behavioural aspect of object-oriented data models has evolved from the field of object-oriented programming languages, having their roots in SIMULA [Birtwistle73] in the late 1960s, and continuing with Smalltalk [Goldberg83] and C++ [Stroustrup91] in the 1970s and 1980s. The key features of object-oriented programming languages are: • Abstract data types, including methods for presenting and manipulating the state of the objects. • Communication by message passing. To inquire an object about some property, a message is sent to the object. The messages constitute the interface to the object. • Encapsulation/information hiding. Access to the internals of the objects is restricted, so information on an object is generally only available through its public interface (methods). • Generalisation/specialisation hierarchies. A car, a bus and a lorry all have some common properties that can be captured by the more general class of vehicles. Cars, busses and lorries are specialised subsets of vehicles. • Inheritance: properties and methods are inherited from the root of a generalisation tree and out to the leaves. Object-Oriented modelling and analysis Object-oriented data models combine abstractions from semantic data modelling and object-oriented programming languages. This makes them useful for many classes of real world modelling. Their advantage is in areas where behaviour is important. Simulation is such an application domain, often used in decision support systems. Object-oriented approaches provide an integrated framework for modelling both applications and the data the applications will be working on [Coad90]: … it combines the data and process model into one complete model Object-oriented methodology has a great potential for GIS modelling, but for the geographical data modelling undertaken in this thesis, structural methods are considered sufficient, as explained further in chapter 5. 2.3 Database systems Database systems facilitate data sharing and easy access to data. This is made possible by providing standardised interfaces to the data in the database and applying mechanisms that ensure consistent access to the data for concurrent users. In addition to this, the database systems ensure database consistency after system failure. Database systems 15 2.3.1 Brief history The history of electronic data management started with the “process-oriented” period (1960-1970). In this period, before database systems were introduced, applications and their data were tied intimately together. Files could be shared between applications, but the structuring of the data was embedded within the applications. This meant that in order to apply a small modification to the data structure in a file it was necessary to change all the applications that were using it. By far the easiest approach for such systems was therefore to let the data structures remain static. Consequently, new and more efficient data structuring methods were difficult to take advantage of. In this period, work on data management system started, and early commercial systems emerged (e.g. IDS in 1962, and IMS-2/VS in 1968 [Wiederhold81]) with standardised access methods. The “data-oriented” period (1970-) followed this first period. The necessity of controlled sharing of data was recognised, particularly for business data within large organisations. The introduction of the database system approach, as we know it today, occurred early in this period. Standard database models with standard interfaces to the data were developed (network, hierarchical and relational), hiding the internal structure of the database (access structures and internal data formats) from the applications. The security and integrity of data in multi-user centralised - and distributed - database systems has continuously been enhanced through advances in transaction management research (concurrency control mechanisms, recovery protocols and commit protocols). Reaching the beginning of the 1990ies, the database needs of most business type applications have been satisfied by current commercially available database system technology. Engineering applications and other applications based on complex data do, however, seem to have demands on databases that go beyond the capabilities of current database technology [Carey90] [Maier89] [Frank84] [Egenhofer87] [Frank88]. These applications have, for efficiency and modelling reasons, until now not been utilising database systems for the management of their data. Some database systems have been constructed to meet the special needs of technical application, such as the extended relational system TECHRA [TECHRA93]. During the last decade, the need for database system support has become apparent also for applications that work on complex data. To try to meet these needs, extensions to the now maturing relational database management systems have been proposed (in competition with object-oriented databases). These new database systems should provide a more flexible and efficient environment for integrating applications and data. 2.3.2 Definitions There have been many attempts on defining a good and consistent terminology for the research field of database systems. The descriptions provided below, taken from Elmasri and Navathe’s book on database systems [Elmasri89], apply for this thesis and reflect the most common terminology in the database literature. Database “A database is a logically coherent collection of data with some inherent meaning. A random assortment of data cannot be referred to as a database.” “A database is designed, built, and populated with data for a specific purpose. It has an intended group of users and some preconceived applications in which these 16 Chapter 2: Database Systems and Data Models users are interested.” “A database represents some aspect of the real world, sometimes called the mini-world. Changes to the mini-world are reflected in the database.” Database management system (DBMS) “A database management system (DBMS) is a collection of programmes that enables users to create and maintain a database. The DBMS is hence a general-purpose software system that facilitates the processes of defining, constructing, and manipulating databases for various applications.” Database system ( = database + DBMS) … “ - we usually have a considerable amount of software to manipulate the database in addition to the database itself. The database and software are together called a database system.” Self-contained nature of a database system “A fundamental characteristic of the database approach is that the database system contains not only the database itself but also a complete definition or description of the database. This definition is stored in the system catalogue, …” Distributed DBMS (DDBMS) “A distributed DBMS (DDBMS) can have the actual database and DBMS software distributed over many sites connected by a computer network. Homogeneous DDBMSs use the same software at multiple sites. A recent trend is to develop software to access several autonomous pre-existing databases stored under heterogeneous DDBMSs. This leads to a federated DBMS (or multidatabase system), where the participating DBMSs are loosely coupled and have a degree of local autonomy.” 2.3.3 The three-schema architecture The three-schema architecture (or the ANSI/X3/SPARC DBMS Framework [Yormark77][Tsichritzis78]) is a recognised three level model for database system architecture (Figure 2-3). The internal schema/level is the direct interface to the data structures used to implement the database. Low level features, such as pointers, hash tables and other data structures are available at this level. All the mechanisms provided by the conceptual schema must be translated into the operations and data structures of the internal schema. The internal schema is only used by system programmers to implement data formats and operations at the conceptual level of the database system. The conceptual schema is described as follows [Elmasri89]: The conceptual schema is a global description of the database that hides the details of physical storage structures and concentrates on describing entities, data types, relationships, and constraints. A high-level data model or an implementation data model can be used at this level. The external schema/level provides specialised views of the database. Each external view is tailored to a user or a group of users, so that only the data and operations that are of interest Database systems 17 Figure 2-3 The ANSI/X3/SPARC three-schema architecture for database systems. to these users are accessible through the view. The external level can be used both to hide data from unauthorised usage and to customise interfaces to the database. 2.3.4 Features/services of database systems A set of requirements expected to be met by database systems has evolved, and some of the most central features are listed below. • A database system must be able to store large amounts of data • A database system should conceptually organise the data according to an accepted (“standard”) data model, and should allow access to the data through a well defined (“standard”) interface (at the least a data manipulation language (DML) for interfacing to general purpose programming languages), hiding details of the internal data structures from the user. Both interactive interfaces, integrated application development environments and embeddings in the most popular general purpose programming languages are expected. Content-based (associative) retrieval should be provided through set oriented operations and it should be possible to find related objects by navigating through the structures of the conceptual schema. The data model of the conceptual schema therefore has to be able to represent complex data structures and relationships. • Metadata, or descriptions of the information present in the database should be available in the database, both to the DBMS itself and to users through a query interface. The system catalogue (of relational systems) or a data dictionary (an extended system catalogue) have traditionally been used for these purposes. • It shall be possible to specify constraints on the data, such as domains of attributes, cardinality of relationships, optional or mandatory features, … These constraints should after specification be automatically enforced in the database system. 18 Chapter 2: Database Systems and Data Models • A database system should provide multiple users with concurrent and controlled access to the data through transaction management [Bernstein87]. Transaction management should provide atomic transactions through the recovery system and serialisability or other correctness criteria through concurrency control. An atomic transaction should have the ACID transaction properties. ACID stands for: Atomic, Consistency preserving, Isolated and Durable transactions [Elmasri94]. The notion of atomic transactions imply that either the whole transaction (all of the operations) is done or nothing is done. No partial execution of transactions are allowed. A recovery system shall monitor transactions and log all changes made to the database on secure/permanent storage. If the system crashes for some reason, the recovery system will go through this log and bring the database back to a consistent state. This can be done by making sure that all changes made by committed transactions (transactions that have finished as the crash occurred) are reflected in the database (REDO-ing changes made by these transactions that are not reflected in the database), while non of the changes made by transactions that were aborted by the system crash are left in the database (UNDO-ing these changes). Serialisability is currently the most recognised correctness criterion for concurrency control mechanisms in database systems. A sequence of database operations belonging to different concurrent transactions is serialisable if the resulting state of the database could have been obtained by performing some serial execution of the involved transactions. Serialisability does not seem to be a good criteria for co-operative work, such as in design and planning. New kinds of concurrency control mechanisms are needed to control the complex interactions between co-operating concurrent processes. • Multiple views on the data should be supported to provide customised interfaces to the data and to enforce access restrictions, avoiding unauthorised usage of the data. • Fault tolerance is a desirable feature of database systems containing vital information that has to be kept on-line at all hours. Fault tolerance means that the database system should be able to continue to operate normally (having the complete database available) also in the case of failures. Failures could be a disk-crash, memory errors, loss of power, communication failure, program error, etc. Fault tolerance can be obtained through controlled redundancy. Mirroring of disks can be used to take care of disk crashes. RAID* technology provide the same functionality [Chen94] [Ganger94] [Patterson88]. Duplication can be used for most hardware elements in a database system to provide fault tolerance (processors, communication channels, tape drives, disk drives and controllers). In addition to these basic features, monitoring of the database (usage statistics) is provided by most commercial database management systems. 2.3.5 Distributed database systems Distributed database systems is an active area of research [Özsu91] [Garcia-Molina95]. By storing logically connected data at different sites or computers, many interesting issues arise. Distributed transaction management (atomicity, serialisability, concurrency control, commit protocols), distributed query optimisation, reliability of distributed databases and the * Redundant Array of Inexpensive/Independent Disks [Chen94] Database systems 19 use of redundancy are all good examples of the complex problems that are receiving attention in this field [Bernstein87] [Breitbart92] [Ceri88]. Multidatabases or federated database systems are loosely connected database systems where the individual databases could be organised according to different database models, and each database system has a high degree of local autonomy [Hsiao92]. Methods for achieving (transparent) data sharing in this kind of environment are emerging, but still constitute a topic for research [Breitbart92] [Kim95d]. Object-oriented approaches to distributed data management have been proposed, using object-oriented abstractions to specify high level interfaces to the databases through for instance a distributed conceptual schema [Papazoglou90]. 2.3.6 Database machines The management of large databases has become a problem in many application areas. This has encouraged research in reliable, high capacity database systems. Special purpose database machines (or database computers) [Su88] have come out of this research. One of the most promising approaches are the parallel database machines, where multiple processors are co-operating in storing and retrieving data from a shared database (generally distributed over a number of disks). Such architectures are used both to achieve better performance and to improve availability [Kim84]. This research has lead to commercial products, among which the Tandem (NonStop System) was of the first (the NonStop fault-tolerant architecture came in 1976 [Katzman78], the (distributed) transaction manager ENCOMPASS came a little later [Kim84]). Parallel database machines can provide improved performance [DeWitt85] through distribution and parallel processing and reliability through duplication of hardware and data. The relational database model has proved itself as a good model for parallelisation, and most current parallel database machines are based upon the relational paradigm [Omiecinski95]. In Norway there have been experiments on parallel relational database machines, and several generations of experimental parallel database machine have been built at NTH in Trondheim [Bratbergsengen89]. 2.3.7 Status of database systems Vossen gives a short and nice overview of the status of database systems entering the 1990ies [Vossen91]. The following is partly based on his observations. The database systems of the 1980ies are good at handling: • Simply structured data objects (record oriented) • Simple data types (number, character string, …) • Short transactions • High transaction rates • Frequent in-place updates 20 Chapter 2: Database Systems and Data Models New areas of database applications differ significantly from the traditional database application areas, and need support for: • Complex (evolving) data models • New data types, for instance spatial data types, such as images and topological structures, with associated data structures and operators • Integration of very different data types • Relaxed consistency constraints • Long transactions with few serious access conflicts (which must lead to re-evaluation of concurrency control and recovery mechanisms) • Fault tolerance and 100% availability • High data rates with guaranteed service, as required by for instance video servers • Extremely low response times, as demanded by real time applications (“real time DB”) These features are not well supported by the database systems of the 1980ies, and must be given more emphasis in the years to come. Geographical information systems is one example of these “new” application areas. 2.4 Database models The three-schema architecture’s conceptual schema can presently be specified using three or four major approaches. The different approaches to conceptual schema definition will here be termed database models. The most popular models, up to 1990, have been the two set models (the hierarchical and the network model), the relational model, and recently also object-oriented models. 2.4.1 Hierarchical DBMSs In the middle of the 1960s the first commercial hierarchical database management systems was on the market, one of them being IMS of IBM* (1968). GIS** of IBM was a hierarchical query and update system that was out even earlier (1966). There is no formal theory on hierarchical database models, but some common characteristics of the family can be identified [Tsichritzis82] [Elmasri89]. The abstractions used in hierarchical models are records (entities) and parent-child relationships. A parent-child relationship type has one owner record type (parent) and one member record type (child). A record type can act as the owner of many different parentchild relationship types, but can only act as a member of one parent-child relationship type, thereby forming a strict hierarchy. An instance of the parent-child relationship type has a unique owner record (from the owner record type) and zero or many member records (from the member record type). * International Business Machines ** General Information System Database models 21 Figure 2-4 Spatial topology as modelled using a hierarchical diagram. Hierarchical models support one-to-many (1:N) hierarchical relationships in a natural way, but many-to-many (M:N) relationships and non-hierarchical structures are impossible to handle without introducing some kind of data duplication. N-ary relationships are even more problematic. Virtual records have been introduced to allow other relationship types than one-to-many. Hierarchical data models can be displayed in a hierarchical definition tree [Tsichritzis82], as illustrated by the spatial topology example in Figure 2-4 (spatial topology is described in chapter 4). Virtual record types are shown with a thicker outline, and their real record types are indicated by thin arrowed lines in the figure. Many of the early database systems were hierarchical (Mass Gen Hospitals MUMPS from 1966, Informatics’ MARKIV from 1967, IBMs IMS-2/VS from 1968, Control Datas MARS from 1969, MRI’s System 2000/S2K from 1970 [Wiederhold81]), and many installations of these systems are still in use. The hierarchical data model’s limited expressiveness makes it inferior to the CODASYL DBTG network model for most non-hierarchical applications. The hierarchical model is optimised for hierarchical structures, and performs well in such settings. 2.4.2 Network DBMSs The first network database management system that appeared was Honeywells IDS in 1962. This was also the first commercial database management system to appear [Elmasri89]. The first standardisation effort in the field of data base systems was done by the CODASYL* Data Base Task Group (DBTG). The results of this work were a series of proposals for a standardised interface to database systems (1969, 1971, -73 and -78) [Tsichritzis82]. These proposals have been collectively referred to as the CODASYL network data model. Many database systems that follow this standard have been implemented, and a large number of * Conference on Data System Languages 22 Chapter 2: Database Systems and Data Models Figure 2-5 Spatial topology as modelled using a data structure diagram (DBTG network model). databases are organised and managed by CODASYL systems. The CODASYL network data model is more conveniently called the network model. The abstractions used in the network model are about the same as the abstractions used in hierarchical models. The DBTG network data model’s set type corresponds to the parentchild relationship type of the hierarchical data model (but should not be confused with a mathematical set). Each set type consists of an owner record type and a member record type. In the network model, a record type can be a member of more than one set type, but a member record can have at most one owner record for each set type. This means that a member record only can take part in one set occurrence for each set type it participates in. Many-to-many (M:N) relationships can therefore only be supported “non-redundantly” by introducing a “dummy” record type between the two participating record types (Island, L-border and R-border in Figure 2-5 are examples of such “dummy” record types). In addition, the network model supports the relationships available in the hierarchical model. Network models can be represented graphically using data structure diagrams, also called Bachman diagrams [Elmasri89]. Spatial topology as modelled in a data structure diagram is shown in Figure 2-5. The CODASYL proposals include a DDL (data definition language) to describe the database structure textually and a navigational DML (data manipulation language) to query and modify the database. The notions of user work area (UWA) and currency indicators are introduced to facilitate programming language interfaces and database navigation. The NDL (Network Definition Language) standard for network languages was proposed by ANSI in 1985 [Elmasri89]. After the CODASYL DBTG report in 1971 [CODASYL71], several commercial products were developed (Honeywell’s IDS II, Burrough’s DMS II, Univac’s DMS1100, DEC’s DBMS10 and 11, HP’s IMAGE, Cullinet’s IDMS [Wiederhold81]). The network data model is very good at navigation, that is - one item at a time retrieval. It was not made for set-based retrieval, and is not very good at this. The fixed structure dictated by the model makes it painful to change the schema. This means that the model is too rigid for applications in dynamic environments. Distribution and parallelisation has not been considered useful or feasible using the network and hierarchical data models. For the many organisations that have requirements that suits this technology, the robust network database systems are still of the most powerful. Early in the 1990s, a large part of production database systems are network systems, but their share of the database market seems to be decreasing. Database models 23 2.4.3 Relational DBMSs The relational data model, introduced by Codd [Codd70], is a database model that builds on the mathematical concepts of sets and relations. Functional dependencies and keys are two other concepts that are important in the modelling and design of relational databases. • The properties of sets important to the relational model are the following: Duplicates are not allowed in a set, and a set imposes no ordering on its members. • A relation establishes a connection between an arbitrary number of domains (n-ary relations are relations which include n domains). Relations are represented as tuples. A tuple is a collection that contains one instance of each of the domains participating in the relation. The tuples of a relation are organised as unordered rows in a two-dimensional table. • Functional dependencies. If, in a relation R, a set of attributes, B, is functionally dependent on a set of attributes, A, this means that if two tuples of R have the same value for A, they must also have the same value for B. • Keys. A key of a relation is a minimal set of attributes that functionally determines all the attributes of the tuple (since duplicates are not allowed, no tuples in a relation can have the same key). A relation can have many keys (e.g. the set of all attributes of a relation makes up a key), in which case one of them is chosen as the primary key. Relations are created to describe relevant features of the phenomena being modelled. These features include relationships between phenomena in addition to the individual phenomena with their characteristics/attributes. A person could, in the relational model, be described by attributes such as name, date of birth and colour of the eyes, and by its relationships to other phenomena such as father, mother, employer and place of living. All these properties can be grouped together into an (unnormalised) person relation in the relational model. Relations are used to store most of the system information in a relational database system. A table is established for each relation in the (normalised) data model. Operations in the relational model are defined in the relational algebra or the relational calculus. The relational algebra consists of the relational operators selection (σ), that is a set-operation that retrieves tuples based on values of the attributes of a relation, projection (π), that picks out certain domains/attributes/columns from a relation, and join ( ) that is a sophistication of the cartesian product, where two relations are combined into a new relation on the basis of the values of some common domain(s) of the relations, the new relation will consist of all the domains of the original relations. In the new relation, a row from the first relation is combined with all the rows of the second relation that satisfies the condition on the join attributes. Natural join (*) is an equi-join (the condition on the join attributes is equality), where the join-domains are not duplicated. In addition, the general set operations union (∪), intersection (∩) and difference (-) are available in the relational model. 24 Chapter 2: Database Systems and Data Models The relational calculus is related to first-order predicate calculus, using the logical symbols ∧, ∨, ¬, ∀, ∃ (and, or, not, for all, exists). In tuple relational calculus the variables have tuples as their range, while in domain relational calculus the variables have attribute value domains as their range. pAB(σC=1(R)) in the relational algebra is equivalent to {t.A, t.B | (R(t) ∧ (t.C=1)} in the tuple relational calculus and {A, B | (∃C) (R(ABC) ∧ C=1) } in the domain relational calculus. Both SQL (Structural Query Language) and QUEL (the query language of the INGRES database management system) are related to the tuple relational calculus. QUEL is much closer related to the relational calculus than is SQL [Elmasri89]. QBE (Query By Example) is related to the domain relational calculus. Normalisation To avoid the problems that duplication of information can introduce, normalisation is performed on relational data models before realisation in a database system [Date86] [Elmasri89]. A measure of a relational design is provided by the normal form metric, describing the properties of a relational design. Normal forms were introduced by Codd in 1971-1972 [Tsichritzis82]. In this first effort, a series of three normal forms was defined. The notion of functional dependency as introduced in Codd’s original paper [Codd70] is very important for specifying these original normal forms. • The first normal form (1NF) requires that all attributes in a relational scheme are atomic (no group of values are allowed for a single attribute). • The second normal form (2NF) requires that the relation is on first normal from, and that all attributes that are not part of the primary key shall be functionally dependent on the primary key of the relation, but not functionally dependent on a subset of the primary key. • The third normal form (3NF) requires that the relation is on second normal from, and that no transitive functional dependencies exist in the relation. Further normal forms have been specified since then, among them Boyce-Codd normal form (BCNF, which is stronger than 3NF), 4NF (introducing multi-valued dependencies) and 5NF (introducing join dependencies). The more normalised a relational schema is, the more well-behaved it will be in the case of queries and updates. A relational schema can be normalised by splitting the relations that violate the conditions of normalisation. For some kinds of normalisation, it is not possible to split relations without loosing functional dependencies or introducing replication. There is also a penalty on splitting relations, because of all the joins that must be performed to reconstruct the universal relation (a relation consisting of all the attributes of the relational schema). The choice of how far to normalise will depend on the application. SQL SQL (Structured Query Language) is the standard interface to relational databases. The SQL “standard” has been enhanced in a stepwise fashion to meet new user requirements [Melton90]. Database models 25 The traditional SQL (SQL-86 and SQL-89) data types are INTEGER, SMALLINT, CHARACTER, DECIMAL, NUMERIC, REAL, FLOAT and DOUBLE PRECISION. The SQL2 standard includes commonly available extensions such as CHARACTER VARYING, DATE, TIME, BIT, TIMESTAMP, INTERVAL and BIT VARYING [Melton90]. Traditional SQL uses the following operators: SELECT, INSERT, UPDATE, DELETE, join (a join between the tables X and Y on the column COL is specified by the condition X.COL=Y.COL), project (the column in the projection are specified in the SELECT part of the query), UNION, comparison (=, ~=, >, >=, ~>, <, <=, ~<, [NOT] LIKE, IS [NOT] NULL, IN (set membership), [NOT] EXISTS), AND, OR, NOT, aggregations (COUNT, SUM, AVG, MAX, MIN), GROUPing, aliasing and ORDERing. SQL2 adds INTERSECT, EXCEPT (difference), OUTER JOIN, CROSS JOIN, NATURAL JOIN. A typical SQL query returns a new relation, and is structured as follows: SELECT a set of columns (original columns + aggregations) FROM a set of tables WHERE conditions combined with AND and OR HAVING aggregation condition GROUP BY result columns ORDER BY result columns RM/T As a response to the work on semantic data models, Codd [Codd79] wrote a paper where he proposed extensions to the relational model to support a higher level of data semantics. The proposal includes modelling concepts, rules for insertion, update and deletion and algebraic operators. The ideas were first presented at a conference in Tasmania, and the model was called the RM/T (Relational Model / Tasmania). The RM/T supports “objects” by introducing system controlled surrogate keys (E-attributes) for identification in addition to the user-defined keys of the traditional relational model. Using this scheme, it is possible for an object to change type dynamically. The RM/T supports generalisation with attribute inheritance and aggregations as first class “objects” (entities). It also supports temporal ordering of events. Summary Relational database management systems have evolved to a high degree of sophistication, providing atomic transactions and serialisability for concurrent users on distributed databases through logging, concurrency control and transaction management. Progress has also been made on parallel relational database machines [Omiecinski95]. The research on relational databases has taken advantage of the simple mathematical model that the relational model is built upon. Since its introduction in 1970 the relational model has developed into a de facto standard for database systems with its standardised SQL interface. Most of the database systems that have been developed in the last decades are based on the relational paradigm. While being well suited for administrative applications, it does not seem powerful enough for more complex design applications [Frank84, Frank88, Kemper87] in the present technological settings. Current relational database systems do not seem 26 Chapter 2: Database Systems and Data Models to be able to manage large amounts of complex data and long lasting transactions that operate on such data. Ingres, Oracle, Sybase, Informix, Tandem non-stop (parallel database machine) and DBaseIV (personal computers) are some of the most popular of the currently available relational database systems. 2.4.4 Object-oriented DBMSs The problems of mapping complex data models to database models have lead to a great interest in realisations/implementations of high-level data models. One solution has been to build the database system around abstract data types and object-oriented programming languages, thereby supporting all the object-oriented programming paradigms [Atkinson87]. These kinds of database systems have been termed object-oriented database management systems (OODBMS). OODBMSs were introduced into the field of database systems in the middle of the 1980ies. Many of the first systems have been based on C++ (often termed persistent C++). Research on OODBMSs has grown tremendously, and is currently one of the most pursued in the field of database systems [Banchilon90]. If an object-oriented DBMS could be built, it would save all the efforts currently being spent on translating semantic data models into computer representable mechanisms and structures. With OODBMSs, once the system is modelled, the database schema is also completely specified! Not surprisingly, it has been problematic to implement fully object-oriented DBMSs, and the complexity of such systems seems to demand further research before OODBMS technology will be really “competitive”. To be able to function as a proper database system, an OODBMS must support most of the before mentioned DBMS features, in addition to the features of object-oriented programming languages and semantic data models. Among other things, OODBMSs bring navigation mechanisms as previously used in the network model back to database systems to obtain better performance for widely used non-set-based retrievals. There has been controversy about what OODBMSs are and should be. In order to try to make the foundations more firm, some of the most active researchers in the field came up with a list of features OODBMSs should include, the so called “object-oriented database system manifesto” [Atkinson89]. This effort was intended to provide the basis for more fruitful discussions. The key features of OODBMSs, according to this list, are: • Complex objects • Unique object identity (preferably with a version mechanism) • Encapsulation, information hiding • Types and classes (preferably with type checking) • Class or type hierarchies with inheritance (preferably multiple inheritance) • Overriding, overloading, late binding • Computational completeness Database models 27 • Extensibility (schema evolution) • Persistence • Secondary storage management (very large data sets) • Concurrency control (preferably long transactions) • Recovery • Ad hoc query facilities Object-oriented database issues divided the database research community into two camps, one in favour of building object-oriented database systems from scratch, the other in favour of extending the existing (relational) database systems to capture more real world semantics (called extended relational database models), as proposed in the “third generation database system manifesto” [Stonebraker90]. Advantages of object-oriented databases are: • Trivial compilation of semantic high-level data models into the database conceptual schema (they are the same). • The database is fully integrated within a programming language, and therefore computationally complete. The so-called impedance mismatch between programming languages and traditional database systems is avoided. • Possibilities for inclusion of behaviour in the database. • Uniform interface to all objects through methods specified in the data model. This makes them a strong integration tool. It is possible to specify standard interfaces to heterogeneous devices, applications, databases and systems, as long as they offer the necessary/same functionality. • Abstraction and encapsulation (information hiding), keeping internal structures hidden from the applications. • Reuse of code through inheritance. • Schema evolution is claimed to be easier in OODBMSs than in present DBMSs. Disadvantages of object-oriented databases are: • OODBMSs are complex, and therefore difficult to implement. • There is not yet a production quality OODBMS (?). • The present lack of a simple, formal object-oriented data model [Beeri90] makes specification complicated, and leads to a lack of standards for distributed computing and sharing of data. • The lack of an SQL-like associative set-based query language. Todays OODBMS are advanced network database systems (based on navigation). 28 Chapter 2: Database Systems and Data Models • Global optimisation is difficult, because the optimiser should not be allowed to access the internal data structures of the objects (encapsulation). E.g. indexing based on attribute values is complicated. • Encapsulation means security, but also overhead. The lack of standards in many fields of information management makes the object-oriented approach attractive for integration [Kim95d]. If good standards for information exchange evolve, the need for the integration mechanisms of object-oriented technology could be expected to decrease. Object-oriented databases take advantage of clustering and buffering in main memory to achieve high performance. Since many of the OODBMSs are aiming at the construction and design market (CAD), check-in and check-out of complete “drawings” has normally been the only level of concurrency control in OODBMSs. By keeping the complete workspace (all interesting data) in main memory, efficient interactive systems are possible. This kind of clustering and buffering could complicate concurrency control for more general purpose databases. Object Store, O2 [Deux90], Ontos, Gemstone and Postgres [Stonebraker91] are some examples of early OODBMS products. 2.4.5 Deductive DBMSs Deductive databases build on deduction in logic, as found in the programming language Prolog [Clocksin84]. Deductive database systems store rules and facts, and are able to answer queries to the database by combining these rules and facts. Expert systems and knowledge-based systems are application areas that need the support of a deductive database system. A central constraint in deductive database systems and logic is the closed world assumption (CWA). The CWA is a prerequisite for making deductions on the basis of a set of rules and facts. It means that one assumes that facts that are not present in the database do not exist, and the same thing applies to rules. Research in deductive databases has been concentrated on developing fast methods to combine rules and facts, in order to perform deductions as rapidly as possible. The research has encompassed both special purpose hardware, and the development of new algorithms for efficient combination of corresponding rules and facts. Active database systems are database systems that have production rules for integrity constraint enforcement, derived data maintenance, triggers, alerters, protection, version control, and others [Dayal95]. The rule processing capabilities of active database systems also make them suitable for handling deductive databases. Chapter 3 Geographical Information Systems The paper map has for some thousand years been the main carrier of geographically related information. Increasing demands on availability of information and efficiency of information processing has lead to a need for better ways of storing and distributing such information. GISs (Geographical Information Systems) are expected to provide the answer to these needs. This chapter is a short presentation of geographical information system topics and their current status. The last three sections explores future trends, parallelisation issues and geographical information servers. 3.1 History People have always had a need for geographically related information. In the earliest times, when man’s principal occupation was hunting and collecting food, geographical information (for navigating between the “home”, good places for collecting different herbs, fruits and vegetables, good areas for hunting, …) could normally only be “stored” in the human brains. Orientation was probably centred around significant landmarks and landscape features such as rivers, lakes, mountain ranges and vegetation boundaries. The amount of man-made features and infrastructure started to increase from nearly zero when man settled down and started with agriculture. From then on, there has been a steady increase in both infrastructure and the number of man-made features. In the last couple of centuries, the growth has exploded. Orientation and the organisation of the human society have become more and more complex as the amount of man-made features has increased. The traditional (paper) map, being an abstraction of the landscape as observed by humans, has been a very important tool for both orientation and land-use planning and - administration. It has always been the main carrier of geographically related information (stone carvings and paper maps). Today we have an enormous amount of maps of varying scale for different purposes stored in voluminous archives all over the world. 30 Chapter 3: Geographical Information Systems In parallel with the proliferation of maps, the amount of geographically relatable data has steadily increased. Information about land-owners, land use, vegetation, mineral reserves, wild-life, infrastructure and all kinds of services is growing day by day. This growth in the amount of infrastructure, utilities and available information has led to a need for more powerful management tools. The present situation requires highly skilled personnel with many years of experience to handle and utilise the available information in an efficient way. This is partly due to the lack of tools for efficient integration of the many information sources, which therefore have to be integrated “by hand” (or head), and partly due to the complexity of the tasks. There is hope that GISs will provide a means to make geographical information more accessible to a broader class of users. GISs originated in Canada in the early 1960ies [Tomlinson89] [Nagy79]. 3.2 Definitions of GIS In one of the first wide-spread textbooks on GIS, Burrough uses the following definition of GIS [Burrough89]: .. tools for collecting, storing, retrieving at will, transforming, and displaying spatial data from the real world for a particular set of purposes. This set of tools constitutes a ’Geographical Information System’ (sometimes a Geographic Information System - sic) Geographical information systems should, by word analysis, be information systems that have something to do with geography. To provide a definition of GIS, one could therefore start with the definition of an information system. A much cited definition of information systems is given by Langefors [Langefors73]: An information system is a system that collects, stores, processes, and distributes information sets A definition concentrating on geographical data and implicitly referencing Langefors’ definition would be: Geographical information systems are information systems where some of the information sets acted upon are geographically related (to points, lines, areas, surfaces or volumes in space) According to the above definition, GISs constitute a very broad class of information systems. A GIS can be anything from a small dedicated information system with a limited domain and small amounts of data, to an enormous general purpose information system encompassing every possible piece of geographical information distributed over many databases throughout the world. The utility of geographical information systems 31 3.3 The utility of geographical information systems GISs have many potential uses and users. There are, however, some requirements that have to be satisfied in order to make them generally useful. Burrough [Burrough86] points out that GISs will have to provide more advanced analysis capabilities than boolean logic, map overlay and conventional thematic mapping techniques. He also stresses that statistics should be employed to a larger extent when collecting and handling sampling data. Commercial GISs has not yet reached an acceptable level of sophistication when it comes to the incorporation and use of statistics and advanced spatial analysis. In addition to Burrough’s points, it is a basic requirement that the application of computer based GIS methods should give results that are of better - or at least the same quality as traditional methods, while being more cost effective and easier to use on an overall basis. Some additional requirements will also have to be fulfilled before GISs can be considered a mature information system branch. • An internationally standardised data model for the intrinsic properties of geographical data. • Standardised modelling tools for both spatial data and applications. • An internationally standardised thesaurus of geographical terms for use in (global) data dictionaries. • Integration and analysis of different kinds of spatial data sets (raster, vector, sampled surfaces). • Support for advanced (spatial) statistical methods. Some GIS provide interfaces to statistical packages, but the development of spatial statistical methods has not come very far. • The handling of data quality in the context of data sets, processing algorithms and presentations. This is currently a hot research topic. • Co-operation in data capture and sharing of geographical data. • On-line availability (over some public network) of geographical data that users could be interested in (on a standard digital format). • Good support for multiple concurrent users (co-operating and/or independent). This issue is currently receiving an increasing amount of attention. • Standardised user interfaces to simplify the training of GIS personnel. During the 1990ies, most GISs have been moving into windows-based user interfaces, and this is an improvement. • System performance adequate for interactive usage. The most pressing problem for the proliferation of GISs is the lack of computerised geographical data sets. The pioneer GIS users must normally digitise most of the necessary data themselves, and this makes the introduction of a GIS very expensive. As soon as the national mapping authorities and the other vendors of geographical data can provide 32 Chapter 3: Geographical Information Systems complete* geographical data coverage of adequate quality/precision on digital format, this situation will change, and new applications with new users will be feasible. This must be expected to take some time, but will probably be accelerated by the growing interest in GIS that already has lead to the production of a lot of digital geographical data sets. The Norwegian Mapping Authority expects to be able to offer all their products on a standardised digital format before the turn of the century. When the GIS field has matured, there are many possible areas of utilisation for GISs. A very promising application area for GIS is public (and private) services and planning, where a very good cost-benefit ratio has been forecast (up to 1:4 [Bernhardsen86]). Hypertext-based browsing systems for “tourist” information have a great potential (e.g. on the WWW** or similar systems, using for instance VR*** techniques for visualisation and interaction). GISs have a potential for transport monitoring and routing systems (GIS will probably play an important role in logistics in the future). Navigation applications certainly constitute an interesting application area, both land - (cars), sea - (ships) and to some extent also air (planes) navigation can profit on the use of GIS technology together with a positioning system (e.g. GPS****). ECDIS (electronic chart display and information system) [Grant90] is a promising application in the area of maritime navigation and information. And many car manufacturers are now supplying their cars with road navigation systems (as these systems mature and get widespread, powerful database servers will have to be established to provide millions of cars with real-time information on roads, construction sites and accident points). Archaeology is another good example of an application area for GIS Natural resource exploration such as mining would benefit from the use of a GIS for organisation and analysis of geological probes and other relevant data. The use of GIS technology in marketing seems to have great potential. Finally, GIS is the perfect tool for monitoring the environment, and very useful for natural resource management and land use planning. 3.3.1 Local administration GIS, an example application area A GIS that administrates all the information (geographical and non-geographical) pertinent to an administrative unit (e.g. a county or municipality) is a very good tool for improving and rationalising management and decision making processes. This is possible by providing shared and controlled access to all the relevant community information to the different branches at all levels of the local administration. * For most countries, the number of map sheets that are available digitally still comprise only a fraction of the number of map sheets the national and regional mapping authorities are responsible for. ** World Wide Web: A publicly available world-wide hypertext structure (treated later in the thesis) on the internet. References include the internet protocol (IP) address of the computer where the document resides plus the path to the document on that computer. *** Virtual Reality. ****Global Positioning System, an operative special purpose US military satellite-based positioning system. Geographical data 33 A nordic working group has forecast that investments in such a GIS by local authorities will give a (purely economical) cost-benefit ratio of 1:4 [Bernhardsen86]. The real benefits for society should be even greater, considering the potential for improved decisions and the new possibilities for better communication, both with the public and within the public services themselves. A local administration GIS relies on many different data sets collected from a variety of sources. The GIS will have to be responsible for the integration of these data sets, and for making them available to the planners in the different sectors for common and simultaneous usage. Users should, in general, be allowed to read all data, but only update the data for which they are responsible. Non-local data sets will, in the normal case, only be used as background, read-only information. 3.4 Geographical data There are a multitude of data types and formats that are interesting for incorporation into a GIS. The GIS information base includes all data collected and measured for the purpose of describing phenomena connected to a location on the earth. Geographical data are characterised by being related to a position/point, a line or an area on the surface of the earth, or a geographically relevant surface or a volume. These data constitute one of the most important subclasses of spatial data or spatially referenced data. Spatial data have traditionally been stored and presented as “paper” maps. For geographical information systems it is very important that the traditional map information (objects with a spatial attribute) is augmented with “non-spatial” information (such as census data, cadastres*, pay-roll information, “product” information, …) to better facilitate analysis. 3.4.1 Geographical maps A geographical map is a simplification/abstraction of reality for the purpose of illustrating spatial relationships in the real world. Geographical maps have for a long time played the key role in storing and presenting information with a dimensionality of at least 2. Topographical maps constitute the most general type of geographical maps. Such a map is a simplified representations of an area as it appears to a human observer. It provides a known scale and a standard frame of positional reference. It will contain an elevation model, characteristic terrain-features (vegetation and hydrography) and visible human infrastructure. Topographical maps are used for orientation (way-finding and positioning) and as a frame of reference for the display of various kinds of thematic information. Thematic maps show only one or a few themes of an area. Examples of such maps are: geological maps, maps of some kind of utility (telephone network, sewage, fresh-water, electricity, …), vegetation maps and economical maps (borders for cadastres). Thematic maps are tailored for specific purposes and applications. * Cadastre: Register of land properties with relevant information 34 Chapter 3: Geographical Information Systems Sophisticated methods and theories for good map design have been developed through the years (cartography and cartographic communication) [Keates82]. Cartographic techniques are now being developed further to take advantage of the technical possibilities in the “GIS age”, with powerful computers and sophisticated display devices (communication channels). 3.4.2 Spatial geographical data Spatial features Laurini and Thompson [Laurini92] give an introductory, generalised overview of spatial (geographical) features. 1) Phenomena that vary in character from place to place. 2) Natural features with unclear boundaries or no boundaries at all. 3) Person-made phenomena with clear limits. 4) Phenomena located in space, either geographic (earth) or arbitrary. 5) Entities that are related or unrelated to each other by location Geographical data are computer-friendly descriptions of geographical features. Geographical data There are basically two types of geographical data. The first type consists of what one could classify as geographical objects, while the other encompasses geographical samples. The two ways of representing geography have different characteristics. Geographical objects refer to specific features or objects in space (and time), and can form large structures such as complex objects, line networks and area manifolds. Geographical samples are normally tied to geographical(-temporal) points or small point-like areas, and convey information about a set of characteristics at the sampling spots (soil type, vegetation type, temperature, elevation, humidity, rainfall …) [Neugebauer90]. Geographical object structures Geographical objects can be further grouped into a limited set of structure classes. • Isolated geographical objects An isolated geographical object is an object that is not composed of other objects, and does not take part in a network or manifold structure. Trees, beacons, (street lights), poles, traffic signs, and in some contexts houses, oil wells, glaciers, … are all spatial objects that can be treated in “isolation”. • Complex geographical objects Aggregated geographical objects include many man-made features. Some examples: Political units form a hierarchy, with countries composed of counties that are again composed of municipalities. Economical units have the same structure, with properties that are composed of lots. Buildings are composed of various components, for instance rooms, walls, roofs, floors and a variety of utility networks. Towns consist of streets, buildings, … Aggregated geographical objects could also be composed by taking groups of spatial objects that fulfil certain criteria (properties polluted by oil-spills). Geographical data 35 • Networks A network is a connection of geographical items, most often linear items, but regional items can also take part (complex networks). Transport systems, such as roads, waterways, utilities (pipelines, cables) and railways are typical examples of networks. Surface hydrology is an example of a natural network (most often a hierarchy). • Manifolds* A manifold is a complete partitioning of the region of interest. The countries and oceans of the earth make up a 2D manifold, while geology classification makes up a 3D manifold. Soil classifications, economical and political units and vegetation classifications all make up 2D manifolds. These classes of geographical objects put different demands on the storage structure and the modelling concepts, and will be referred to in later sections. Geographical samples Sampling is a method that can be used to collect information about continuously varying phenomena or fields** (for instance natural resources or climate). Sampling theory and statistical methods based on sampling provide the scientific basis for geographical sampling [Blais86]. The utility of samples for classification will depend on the value of the auto-correlation function (analogous to the the inverse rate of change) of the sampled geographical phenomenon as compared to the sampling frequency (density of the samples). The Nyquist frequency corresponds to half the sampling frequency, and provides an upper limit for the frequency information that can be recovered from samples. If the auto-correlation is high, the number of samples can be low to obtain a certain level of accuracy. If an estimate of the auto-correlation function can be provided, it is possible to determine the expected accuracy of interpolation and classification into manifolds. A subjective estimate could be given by a human observer (for topography the classes could include plain, hilly and mountainous for trend surfaces, combined with for instance smooth, ragged and broken for local variations). Sampling can be done in a regular grid, in which case the resulting data sets are closely related to rasters or images. Satellite images can be classified as regular samples. Example applications for geographical sampling: • Elevation/topography (field over a 2D region) • Vegetation classification and statistics (field over a 2D region) • Soil classification and statistics (field over a 2D or 3D region) • Snow cover representation (field over a 2D region) • Geological probes/samples (field over a 3D region, the probes are fields over a 1D region) • Climatic measurements/samples (temperature, rainfall, humidity/aridity, wind, rain chemistry, fog, cloud cover, cloud height, …) (fields over 2D or 3D regions or 3D surfaces) • Water quality, river currents, … (field over 3D regions or 3D surfaces) * As defined in the SDTS [USGS90] ** A 2-dimensional field is a variable that has a defined value at every point in the plane 36 Chapter 3: Geographical Information Systems Geographical samples can be used for direct analysis, or for performing thematic classifications and making thematic maps. A thematic classification is produced by interpolating and extrapolating the sample data into a classification manifold covering the region of interest. This classification manifold can then be presented as a thematic map. Spatial sampling and temporal sampling will have to be combined when monitoring natural phenomena. 3.4.3 Non-spatial or “catalogue type” GIS data The term catalogue type data is introduced to give a name to traditional data, as opposed to spatial data in a GIS. All kinds of information traditionally handled by computer database management systems fit into the catalogue type. An example of catalogue type data is the supplier-parts databases in [Date86]: Part (part number, part name, colour, weight, city of storage) Supplier (supplier number, supplier name, status, city) SuppParts (supplier number, part number, quantity) Catalogue type data can easily be organised into tables of numerical and textual fields, where each row describes an object, and each column contains one characteristic or attribute (so it fits perfectly into the relational model). These kinds of data are easy to store and manipulate in todays information systems, and fit very well into the relational database model. Databases for record keeping in business and administration is a typical example of what is here termed catalogue type information. Most catalogue type information is spatially relatable (in this example through the city name) and, as such, potentially interesting for spatial analysis using a GIS. In the supplierpart database, a typical example of this is the optimal transportation routing problem for the various parts. 3.4.4 Historical data Historical data and time-series data can often be of great value in the GIS context. This means that geographical samples that are to be used for time-series analysis should have a 4-dimensional reference, including both geographical and temporal positions, while manmade geographical objects such as buildings, roads, canals and land properties that have a more discrete life-cycle could have their history of changes or time of validity attached. Snapshot databases are not adequate for a general purpose GIS. Historical information could be of interest to both researchers, record keepers and planners, both for monitoring and for forecasting. The consequence of this is that all historical versions of geographical objects should be kept in the geographical database, and samples should be stored both with their date and position of sampling to allow time series analysis and forecasting by analysts in the future. It is important that both the geographical and temporal dimensions (sometimes termed 4D) are considered when a sampling strategy is determined. Geographical data 37 3.4.5 Data quality Results from a sophisticated GIS analysis will be unreliable in the presence of errors and inaccuracies in the input data (garbage in - garbage out). In his chapter on data quality in [Burrough86], Borrough discuss “the main sources of error and variation in data that can contribute to unreliable results being produced by a geographical information system”. It would be very useful to be able to incorporate into geographical information systems methods to automatically determine the fidelity and accuracy of GIS results from the quality of input data and the characteristics of the applications. To achieve this, it is necessary to represent errors and accuracy for the input data, to track the errors and inaccuracies as they propagate through the various analysis steps of a GIS application, and finally to communicate (by for instance using advanced visualisation techniques) the uncertainties embedded in the final results to the user [NCGIA91]. GIS applications might themselves be inaccurate, and this should be taken into account in the error/accuracy analysis process. A US initiative on standards for spatial accuracy [Chrisman84] [USGS90] has come up with a taxonomy for data quality. According to this work, data quality can be divided into [USGS90]: lineage (data sources and transformations performed) positional accuracy attribute accuracy logical consistency (fidelity of relationships, valid values, topology and geometry) completeness Temporal accuracy has been proposed as an additional accuracy parameter by the CEN TC 287 WG2 PT05 [CEN95]. 3.4.6 Data distribution and sharing Geographical data sets are often useful for many purposes and users and at the same time expensive to collect. Sharing is therefore very desirable for these kinds of data. Collection of geographical data are performed by governmental agencies, local authorities, private companies and individuals for private usage or for the benefit of the public. It will normally be in the responsible collectors interest to keep the data set up to date by continuously taking the necessary new measurements and samples. The responsible collector is the legal owner of the data, and must be given credit for other organisations’ use of the data [Carter92]. The owner of the data will therefore want to keep control of the database, preferably by storing it locally and allowing other users access to the database only on a commercial basis. This ownership structure of geographical information puts certain constraints on the freedom of choice when organising and designing database systems for GIS. 38 Chapter 3: Geographical Information Systems 3.5 Models for geographical data Two paradigms for geographical data have been used in GISs through the years. One is image- or grid-based and is termed the tessellation or raster model, while the other is geometry- or object-based and is termed the vector model [Peuquet84]. Some GISs utilise only one of these paradigms, while others provides for both with “translation” procedures for “conversion” between the two models when that is necessary. 3.5.1 The raster paradigm The basic unit of representation in the raster approach to geographical information management is a rectangular* region in 2D (or 3D**) geographical space. The size of the basic units will depend on the requirements of the applications that are to make use of the information (10 km x 10 km cells can be more useful than 10 m x 10 m cells for certain applications). These equal sized units, called raster elements or pixels***/voxels****, are arranged as the elements of a matrix, hence imposing a tessellation of the geographical region of interest into a regular grid. The raster approach hence uses a fixed resolution (discrete) representation of 2D (or 3D) space (R2/R3). Together with the raster one has to store the geographical location of the raster and the cell size of the raster, in order to make the mapping from raster element to geographical location possible. Rasters can not represent lines in a straightforward fashion, neither can it represent points. In order to be able to represent lines and points in the raster model, one has to treat them as regions. The raster (or regular tessellation) approach to GIS data storage and representation is, as mentioned earlier, sample- or image-based. It can be termed place-oriented, as opposed to object-oriented. The traditional raster model can be termed a “2.5D” model. The reason for this is that it is able to represent how a phenomenon varies over a 2D region (2D fields). A phenomenon of interest to the user, e.g. soil type, rain-fall or elevation, is measured for each grid cell, and the resulting matrix of values constitutes a layer in the raster model. At least one layer is introduced for each phenomenon that is of interest to the user, and the resulting multi-layer structure constitutes the basis for operations and analysis in the raster model. A raster data model of an area will consequently consist of many raster layers, each covering a feature of interest to the users. To be able to do efficient combinatorial analysis on the raster layers of the area of interest, it is important that the tessellation (cell-size and cell-boundaries) of all the raster layers are compatible (cell borders coincide). If the cell boundaries of one raster layer are different from those of the other raster “layers” (or the cell sizes are different), resampling will have to be performed to make the raster layers compatible. Resampling takes time. * Other cell shapes, such as triangles and hexagons can also be used, but this is not common ** The regular tessellation based approach to geographical information representation does not have to be limited to 2D. 3D representations can also be useful, for instance for atmospheric modelling, modelling of oceans, lakes and rivers and modelling of geology. *** Picture element: used for images and 2D rasters. ****Volume element: used for 3D rasters. Models for geographical data 39 For each cell in a raster a value can be stored. This value is used to describe the phenomenon that is being represented by this raster layer. Some systems will provide 8 bits (28 = 256 possible values) for each grid cell, some may provide less, but others again may provide much more (e.g. 32 bits). The more bits that are used, the more information can be put into a grid cell, and the more data will have to be managed by the system. The raster paradigm is illustrated by Figure 3-1, showing one theme (layer) of a 2D raster model, for instance soil type. In this example, 3 bits are available for representing the soil information at each cell of the grid. Figure 3-1 Representing geographic space using the raster paradigm (regular grid sampling) Some pros of the raster paradigm are: • It uses a simple data model. • It is suitable for continuously varying phenomena (elevation, rainfall, soil, vegetation, ...). • It allows easy and efficient overlay operations by per cell computations (avoiding geometrical calculations). • It makes fast retrieval of the thematic characteristics of a place possible (there is an easy mapping from geographical position to the position of the relevant element in the raster). • It is possible to make an easy transition from map based routines to computer based routines through scanning (producing 2D rasters) of traditional maps. • The raster data structure is compliant with important source data (satellite/aerial imagery, scanned maps). Some of the problems with the present use of the raster paradigm are: • Non-adaptive/fixed resolution within a raster layer. To be able to represent localities of high spatial variation, the cell size will have to be made very small. This means that the raster will be extremely large, and huge amounts of data must be stored. Compression will help out a bit on this problem. 40 Chapter 3: Geographical Information Systems • Non-intuitive representation of spatial object geometry (lines, points, homogeneous 2D and 3D regions). This results in poor storage efficiency for such structures, even if compression techniques are used. It also makes network analysis impractical. • Difficulties with representing explicit relationships between spatial objects (topology) and relationships between spatial object geometry and the non-spatial attribute data of the objects. • geometrical transformations on raster data is complicated, and always introduce errors through the necessary resampling because regions of fixed shape and size are the basic elements. These problems put restrictions on the kinds of analysis that can be done in a raster based environment. The raster paradigm does not seem to be suitable as the only way of data representation in a general purpose GIS. The raster data model tends to be requested and favoured by researchers interested in environmental analysis applications [Maguire91a]. 3.5.2 The vector paradigm The vector approach to representing geographical information is a phenomenon-based (or object-oriented) way of representing spatial reality. Each geographical phenomenon has to be described using a combination of structured geometrical objects (points, lines, areas, surfaces and volumes). A vector-represented geographical object can take part in complex geometrical structures, such as networks and 2D/3D manifolds. The vector paradigm provides a continuous representation of object boundaries in space (limited only by the numerical precision of the computer representation). Topology The topological data model [Peucker75] is a very important element in the vector based approach. Topology structures can organise the geometry of spatial phenomena into large structures by linking geometrical objects through their borders. An edge is linked through its end-points, an area is linked through its bounding lines and a volume is linked through its bounding surfaces. Using topology information, it is possible to find the neighbours of a geometrical object in a network or manifold by looking up the objects that share a border with it in the topological structure. A more detailed description of the topological data model is provided in chapter 4. The topological (manifold) structure of the vector model is illustrated in Figure 3-2. Models for geographical data 41 Figure 3-2 Representing geographic space using the vector paradigm (points, edges and regions) The strength of the vector paradigm is that it is very expressive when it comes to representing geographical objects. Network and manifold analysis is directly supported by the vector data model (overlay operations do, however, require massive geometrical computations in vector based systems as compared to overlays on raster data). 3.5.3 Representation of the interior of spatial objects The vector model, as manifested by todays vector-based GISs, is not well suited for the representation and analysis of continuously varying phenomena (fields). Current vectorbased systems represent geometrical objects by their boundaries/borders, and are therefore not able to support properties that varies over the interior of an object in an integrated way (for more on point-set topology, interior (X°), boundary (∂X) and co-dimension in a GIS context, see [Pullar88], [Egenhofer90b], [Egenhofer91a] or [Papadias94]). • In 1D space, it should be possible to represent phenomena that varies along an interval (the border of an interval is its end-points, and the interior of an interval is the interval, excluding the end-points). • In 2D space, it should be possible to represent phenomena that varies over the interior of a 2D region (the border of 2D regions are lines in 2D, while the interior is the 2D region, excluding its border lines). In a typical vector GIS you can attach information to the 2D region as a whole, and to its border lines. The representation of the interior of 2D regions is not sufficiently integrated into the vector models that are applied in current GISs. The interior can be represented using triangulated irregular networks (TINs* [Peucker78]) that are available in some systems, but a TIN is only one interpolation method, and there is normally a very limited set of analysis tools available for TINs. Lines in 2D: In a typical vector GIS, it is only possible to attach information to the line as a whole and to its end-points. The ARC/INFO GIS use what they call “dynamic segmentation” to allow a better representation of the interior of lines in 2D. • In 3D space, it should be possible to represent phenomena that varies over a 3D region (the border of 3D regions are surfaces in 3D, while the interior is the 3D region, excluding its bounding surfaces). Most current vector GISs do not support 3D * Triangulated Irregular Network 42 Chapter 3: Geographical Information Systems functionality at all, let alone the interior of 3D objects. Surfaces in 3D and lines in 3D space: For both lines and surfaces in 3D space, it can be useful to represent variation over the interior. In general, it is safe to state that most of the vector models used in todays vector GISs only provide border representations. 3.6 Queries and operations A geographical information system must be able to handle many sorts of queries. Queries for catalogue type information that is prevailing in todays database applications will always be important in an information system, but in GISs spatial queries will also play a central role. A GIS interface should provide mechanisms for both textual and spatial/graphical interaction (“image” based queries). To be able to support all potential queries to a GIS database, a GIS will have to implement a multitude of operations. The most basic operations in a GIS are the data integration operations that prepare the different geographical data sets of a study area for analysis (e.g. transformations between different coordinate systems / reference systems and data model/format translations). Update, analysis and presentation are also basic tasks that all require an extensive set of operations. Berry, taking the raster approach, suggests the following basic set of operations in computer assisted map analysis: reclassification, overlay, distance measurements, connectivity measurements and characterisation of neighbourhoods [Berry87]. As a background to the design of data structures and database interfaces for GISs, it is useful to identify common query types, and to provide an indication of the relative importance (frequency of use) of the different query types. If query information is available, the task of tailoring database systems for fast execution of the most common geographical queries will be easier (while not forgetting the less common queries), resulting in more efficient GISs. It will be difficult to get authoritative statistics on these matters, but assessments can be made based on the nature of the routines performed by various (potential) GIS user groups. 3.6.1 GIS queries Classification of queries into high level query types will depend very much upon the point of view of the classifier. For the purpose of this thesis, the data modelling - and database point of view is taken, so the query types identified below are based on a classification of the data types useful in a GIS, rather than thematic or other kinds of classifications. A further treatment of GIS queries in the context of a database management system is given in chapter 6. General catalogue queries This category includes set oriented queries (both range queries and exact match queries) and direct queries on identified phenomena. The first example is a set based range query, the second is a set based exact match query and the rest are direct object queries. • Find all cities with a population greater than 4 000 000. Queries and operations 43 • Find the address of persons with last name “Olsen” and first name “Vegard”. • Find the names of the constituent parts of the engine with the identifier “XYZ1”. • Find the names of the minerals contained in the “Iddefjord granite” together with their weight percentage. Spatial queries Spatial queries are queries that use position and and spatial relationships (such as distance) as a basis for retrieving information about spatial phenomena. These queries can be divided into spatial computations, set oriented (range or exact match) - and instance oriented queries. Topological queries for network and manifold analysis make up a separate type of spatial queries. The first example given below is a spatial computation query, the second is a set query constrained by a spatial object (Norway). The third is a set query constrained by a point and a distance operator, the fourth is an instance query, more specifically a nearest neighbour query (point-point or region-region, depending on the representation) and the fifth is a topological query. • Find the area of the district “Hordaland” in “Norway”. • Find all cities in “Norway” with a population exceeding 40 000 people. • How many grocery stores are there within a radius of 10 km from the city hall in “Trondheim”. • Find the bank closest to the company X’s office in “Oslo” • Find the properties that share property border(s) with properties containing hospitals in “Norway”. 3D model queries Queries that require manipulation and computations on 3D models form a sub-class of spatial queries, and can be projection queries (often for display purposes) or “surface-constrained” computational queries. The first example given below is a projection query, while the second, third and fourth are 3D computational queries. • Show a 3D model of the construction area, “Building-land”, as seen from 1 km to the south-west, at a viewing angle of 15 degrees, with default shadows and hidden-surface removal. • What is the volume of the known sand-reserves in “Vestfold”. • Contour the region with north-east corner (lat1,long1) and south-west corner (lat2,long2) at the scale 1:1000, using a contour interval of 1 meter. • Perform a simulation of water flow in the snow melting period for the “Gaula” river system, using climate and precipitation data from 1986-1987. 44 Chapter 3: Geographical Information Systems Image queries and integration queries Image queries and the integration of images with vector based objects through transformations and image processing is used for visualisations and intelligent/constrained image processing. Image queries are display-oriented or image processing-oriented. The first and the fourth example given below are display oriented, while the second and third are of the image processing type. The third and the fourth integrate images with vector-based object information. The fourth also includes a 3D model query. • Show a picture of the house of “Per Monsen”. • Extract line-features from the infra-red band of the satellite image “pa11tdu.img” (using some default filter). • Indicate the position of houses on image “abc.def” (“abc.def” includes all orientation parameters). • Insert the new ski-jump site ski-jump.new into the image lillehammer.124 with image orientation: point of view (p.x,p.y, p.z); direction of view (vert.deg, hor.deg. tilt.deg); focal length (foc). Some of these high-level query types will be discussed further in chapter 6. 3.6.2 Use of the different GIS query types It is difficult to determine with some degree of certainty the relative importance of the different query types in GISs. There will be many uses of GISs, and each usage will have its own pattern of query types. The list below is therefore a very high-level assessment of the value of the different query types identified in the previous section. General catalogue type queries Since this is the type of queries that are available in todays database systems, we must expect them to remain important also for GIS applications. Many spatially related queries can also be formulated using the general mechanisms of catalogue queries. Spatial (2D) queries Spatial queries are essential for spatial analysis, and spatial analysis is one of the key features of GISs. Network and manifold analysis will continue to be important in planning, and spatial statistics will be important for environmental research and monitoring. 3D model queries 3D model queries will be more common in future GISs for visualisations (VR implementations with motion through the 3D model), slope analysis, geological analysis and semiautomatic computation of 3D models from multiple images using (3D constrained) digital photogrammetry techniques. These kinds of queries are presently not very common because they require too much computation for interactive usage in todays technological setting. It should also be noted that many GIS applications will not be interested in three dimensions, a standard projection of the earth surface will generally suffice. Current GIS technology 45 Image queries These kinds of queries will probably also grow in importance in the future. Simple image presentation queries must be expected to become quite common, for instance in hypertext type GIS applications. The more advanced integration of images and vector objects will probably be limited to advanced visualisations for planners and data-acquisition from image-based sensing equipment (3D model refinements and other types of semi-automatic and automatic data set maintenance), at least in the nearest future. A general purpose GIS must support all these very different kinds of queries, and since most of the GIS applications are interactive, it is difficult to give some query types priority or advantages over others. 3.7 Current GIS technology This section gives an indication of where the GIS technology stands today. Where are the bottlenecks as to performance, what is the functionality offered. The presentation will emphasise data management in the systems as much as possible. 3.7.1 ARC/INFO* ARC/INFO was introduced in 1982. It was the first database-oriented vector GIS developed, and is presently the most used “complete” general-purpose GIS world-wide. ESRI (Environmental Systems Research Institute, Inc., USA) started the development of the system in 1980, after several years of experience with developing GIS software for in-house use (the first commercially available GIS software from ESRI was PIOS, Polygon Information Overlay System). Much of the initial work performed by ESRI was of a hands-on, environmental consulting nature: regional planning, forest inventory, coastal zone analysis, wildlife mapping and environmental assessment [ESRI95a]. Since its introduction, ARC/INFO has been continuously enhanced and developed. The few things that have not changed since the introduction is that it is vector-based and it stores geographical information in two separate but logically connected parts. ARC/INFO provides a toolbox for analysis and presentation of geographical information. The basic interface is command line based, with separate commands for all the tools. The set of available tools is growing steadily, and is already quite extensive. An example of a tool is a network analysis module. A macro language (AML) is available for tailoring of the user interface through specification and programming of menus and applications. A programmers library with all the routines used in ARC/INFO is also available for “advanced customers”. ARC/INFO is currently the market leader on the GIS arena both for PCs (PC ARC/INFO) and workstations. The contents of this chapter has been based upon information from the literature [Morehouse85], [Morehouse89], [Peuquet90b], brochures/newsletters from ESRI and personal electronic communication (email) with ESRI staff [ESRI95a]. * ARC/INFO is a registered trademark of Environmental Systems Research Institute Inc. (ESRI), Redlands, USA 46 Chapter 3: Geographical Information Systems Data model The data model, as described in [Morehouse85], is a hybrid data model, where the geometrical/geographical/topological/structural data are stored in the ARC part of the system, whereas the other (non-geometrical) part of the data (the thematic data) are stored in the INFO* part. • The ARC part is built upon the topological data model [Peucker75], and is optimised for efficient access to the geometry (spatial searching, topological navigation). • The INFO part contains the thematic part of the GIS data set. These data are stored using a tabular / relational data model, as found in relational databases. ARC/INFO provides INFO for storage of the thematic part of the data sets, but interfaces to general purpose commercial RDBMSs (relational database management systems) are also provided for this purpose. This kind of data model is called a geo-relational model. The association of ARC data with INFO data is accomplished in the RDBI (relational database interface) using keys for indexing the cartographic and attribute information. The connection is bi-directional to make the following modes of operation possible: • use the ARC part to select geometrical objects and then look up information about these objects in the INFO part. • use the INFO-part to select interesting objects, based on non-geometrical attributes, and then fetch them from the ARC part for display or further analysis. An ARC/INFO data base is divided into coverages, layers and tiles to allow the management of large data sets. In multi-user environments, concurrency control is performed through a map library/librarian, to which all request for coverages has to be made. The mechanism used is a primitive check-in - check-out mechanism, whose prime purpose is to overcome the most basic multi-user problems. More fine-grained concurrency control has been considered impractical due to long, interactive transactions. To facilitate data sharing in multi-user environments, a new data server product has been developed (ArcStorm). ESRI has also, in co-operating with Oracle Corp., developed a separate product called the spatial database engine (SDE), on top of the Oracle RDBMS [ESRI95b]. The SDE application interface provides integrated storage of all geographical data in a RDBMS. It also provides more fine-grained concurrency control. The hybrid geographical data model of ARC/INFO has its advantages in that both thematic searching and spatial searching can be performed in the most efficient way. The problem with the approach is that integrated concurrency control, consistency checking and recovery is complicated by such a dichotomy. Database management systems Many of the most popular commercial DBMSs providing a relational (SQL) interface can host ARC/INFO’s INFO part. One can therefore profit on the state of the art of relational databases, and on any future enhancements to this technology. Transaction management, * INFO is a relational DBMS developed by Henco Corp., used in ARC/INFO under licence from Henco. Current GIS technology 47 concurrency control, recovery and monitoring are supported by RDBMSs and is therefore readily available for the INFO part of the database. The ARC part, however, has very limited capabilities in these respects, being primarily a geometrically optimised data structure. Some examples of the functionality provided by ARC/INFO To indicate what the state of the art is in GIS, some of the operations provided by ARC/INFO are listed. • Geographical data set (called coverage) overlay routines. Coverage overlay can be performed for polygons on polygons (with elimination of spurious polygons and reclassification through polygon joins), lines on polygons and points on polygons • Buffer zoning (for instance around a road or a building) • Manual and automatic data editing of both geometry and thematic information (the ARCEDIT function family) • Support for the most common map projections and transformations • Datum adjustments • Interfaces/data transfer to and from other systems (more than 20 conversion interfaces: Scitex, IGES, TIGER/Line, MOSS GIS export files, AutoCad DXF, MIADS, Gerber, SOSI, …) • Output for a variety of devices • Network analysis with route planning and optimal resource allocation (for instance students to schools) • Analysis and presentation of digital elevation models (DEMs) with thematic information imposed is supported through triangulated irregular networks (TINs). There are also functions for contouring on DEMs • Support for the use of external statistical software packages with ARC/INFO • Limited integration of raster and vector data (in a separate GRID module) • Display and query routines for raster and vector data (the ARCPLOT family of functions) • Multi-user control for ARC (locking of map sections) The comprehensive set of functions available indicates that GISs have grown into useful tools for the administration of geographical data. Most of the other available GISs provide similar functionality. Environment ARC/INFO has been ported to many different operating systems, among them DOS, VMS, Windows NT and many flavours of UNIX. The command-line user interface is cumbersome with a lot of commands to remember and is most useful for expert users. Workstation usage has become easier after the X Window System* protocol was supported. For less advanced * X Window System is a trademark of The Massachusetts Institute of Technology 48 Chapter 3: Geographical Information Systems users, ESRI has developed ArcView, a menu-based user friendly system for browsing through and presenting geographical data that are organised as ARC/INFO coverages. ArcView is available for the MS-Windows family and X-Windows. Performance Performance is a problem for all GISs of today. This is due to the potentially huge amount of data necessary to perform an analysis. Response times for queries on limited data sets are acceptable, but as the amount of data increases, long computation and search times must be expected. There is also limits on the amount of data that can be processed in one run. Specially for PC ARC/INFO these limitations are noticeable. To be able to perform analysis and presentations interactively on even limited data sets, a powerful hardware platform must be used. 3.7.2 System 9* System 9 was introduced by Wild in 1985-1986. A discussion of System 9 is included here because all data management (both geometrical data and thematic data) is taken care of by a relational database systems. This presentation is based on literature from the late 1980ies, so it might not be representative for the current state of the system. The database management approach of Wild/Prime System 9 is slightly different from the dichotomy used by ARC/INFO [Morehouse85, Lauzon85, Charlwood87]. System 9 stores all its data (both geometrical and property data) using the relational database model. The layered architecture of System 9 is shown in Figure 3-3. System 9 has built an object-oriented extension shell around the relational database system, and this shell provides a geo-object interface. Variable length fields (varchar, blob/bulk) of the relational database system has been utilised to store text, lists (for instance of coordinates of a line segment) and images. A blob field has no internal structure from the DBMS point of view, so the data types embedded in blob/bulk fields are taken care of by the object-oriented shell, so that structure is seen at the external level. * System 9 is a trademark of Computervision GIS, Inc., a subsidiary of Computervision, a subsidiary of Prime Computer Inc. Current GIS technology 49 Applications queries Application interface read, write, update routines, cache object functions database functions Database internal level Object cache, data dictionary functions, variable lenght list function generic read, write, update routines Kernel Relational Database system Figure 3-3 The system-9 architecture (based on [Charlwood87]) System 9 identifies the primitive spatial types: node, line and surface. These primitives are used to build more complex types as necessary. The data are stored non-redundantly, so that a line-primitive that is shared between a border and a road is stored once and referenced from both the road-object and the border-object. This is advantageous both from a consistency point of view and a data management point of view [McLaren86]. A minimum enclosing rectangle is stored with every spatial object to facilitate efficient spatial operations. The query language is an extended SQL. SQL is enhanced to handle references between spatial entities, to handle queries to the blob/bulk fields and to handle spatial relationships (overlap, connectivity, containment). System 9 uses the relational database system for report writing, transaction logging, security, recovery and rollback. Concurrency control is also provided by relational database systems, but is not fully utilised in System 9. Using the object-oriented extension shell for the geometry, and hiding much of the structure in unintelligent bulk fields complicates advanced concurrency control and recovery. A System 9 geo-database is split into self-contained databases called “projects”. These are again split into working subsets called “partitions”. A particular data item can be modified through only one pre-defined partition, but is available for read-only access by other partitions. Long updating transactions can be supported by checking in and checking out partitions from the projects. System 9 uses caching in the object shell extensively. This should not pose too many problems for coarse grained concurrency control mechanism (locking of database partitions), but limits the possibilities for applying more fine-grained locking strategies (e.g. object-level locking) in that it puts greater demands on the cache-manager to secure consistent multi-user operation. 50 Chapter 3: Geographical Information Systems The System 9 approach seems to be sound and efficient. The development of GIS databases is provided in a smoothly integrated fashion. Tools (called table generator) are provided for the creation of database tables and interface routines from a description of the database in a binary data model. 3.7.3 TIGRIS Intergraph corporation has developed TIGRIS (Topologically Integrated Geographic and Resource Information System) [Herring87], [Herring89], [Herring90]. The reason for presenting it here is that is was the first commercial GIS utilising object-oriented methodology and OODBMS. TIGRIS has full support for topology, and uses object-oriented methods for integrated representation and storage of thematic and geometrical data. The geographical object system of TIGRIS is layered. At the bottom there is the topological level (node, edge or face). The next level collects topological objects into feature components (the simplest physically homogeneous features represented in the data, subclassed as point, line or area). Subsequent levels collect feature components and other features into more complex and abstract entities [Herring87]. The information base is organised into objects, and each object can be investigated in isolation to determine its attributes, relationships and behaviour. TIGRIS provides both a procedural and a declarative (SQL-like) query interface to the data sets [Herring88], [Herring89]. TIGRIS is developed for Intergraph workstations running the UNIX operating system. A separate file-server or database machine is recommended for environments that need to store large amounts of data. Spatial analysis and queries are supported extensively. Herring describes a totally customisable environment for querying the geographical database [Herring88]. Multiqueries (a sequence of connected queries) and user defined SQL-like macros are provided and it is possible to link in optimised, pre-compiled procedures to perform parts of the queries or time critical operators. TIGRIS has been using its own OODBMS, and Intergraph has been awaiting further progress on object-oriented database systems. The lack of suitable OODBMSs has probably caused Intergraph to keep a low profile in their marketing of TIGRIS. 3.7.4 Smallworld GIS Smallworld GIS is a relative newcomer on the GIS market. It is completely based on object-oriented technology, both for programming the system, and for data management [Newell92]. It provides object-level concurrency control, and a limited version control capability. In contrast to TIGRIS, it has been made very visible through marketing. Smallworld GIS is fully customisable, and a customisation is normally done for each customer. Magik is the object oriented programming environment that is used. It’s interactive environment is inspired by Smalltalk, while it’s procedural syntax is Algol-like. An extensive library of standard object classes is available. The database schema can be developed incrementally by adding new classes. Magic is used for both system develop- Current GIS technology 51 ment, application development and customisation. The developers claim that the Magic environment provides faster development, reduced efforts in programming and maintenance, and an easy transfer to new (hardware) platforms [Chance90a], [Chance90b]. The Smallworld system provides integrated version management (management and merging a hierarchy of alternative versions). It uses an optimistic approach, which means that no actions are taken before an attempt is made to insert a new version into the database. Semantic locking is also possible in a version managed database (limiting database access for a user geographically or thematically), and concurrency control can be based on version management [Easterfield90], [Newell91b]. Figure 3-4 The Smallworld open software architecture ([Chance90a]) A virtual database functions as the application database interface, and provides a seamless interface to all databases (local and external), as shown in Figure 3-4. Versioning is handled at the virtual database level. The fundamental persistent storage of Smallworld GIS is tabular, but it is made to look like an object data structure through encapsulation [Newell92]. Clustering has been used to speed up storage access. The clustering is based on a spatial key (linearisation), generated when the object was first created (the clustering key does not change even if the geometry changes) [Newell91a]. 3.7.5 GRASS The Geographical Resource Analysis Support System (GRASS) is a public domain rasterbased GIS. Version 4.0 was completed in 1991, and in 1996 the current version was 4.1 [GRASS93]. GRASS has been ported to many UNIX platforms, and is a collection of utilities covering most aspects of GIS. The module library has been developed by interested 52 Chapter 3: Geographical Information Systems users and researchers all over the world, but the initial effort on programming and system design was made by the US Army Construction Engineering Research Laboratory (USACERL) [GRASS95]. The source code for all the software components in the system is available to all interested users. This means that further development of modules and the total environment can be performed by everyone, leading to an increasingly more powerful system. CERL has been co-ordinating the contributions into new releases of GRASS. The GRASS concept is a toolbox approach, and most of the modules operate independently. Data are organised in a UNIX file hierarchy and is placed at a user-defined location in the UNIX file system. GRASS uses its own internal formats for vector, raster and site data, but provides conversion between its internal formats and many other spatial data formats. GRASS is a raster system, and provides a comprehensive set of tools for raster operations and analysis. Some example applications of GRASS for environmental modelling and visualisation are presented on the CERL WWW server [CERL95] The user interface of the GRASS toolbox is by default a command-line based conversation. For some of the GRASS tools, more sophisticated interfaces have been built. An X-Windows interface has been used for display purposes (an integration of a subset of the display routines), and an integrated X-Windows based environment is under development, and is being shipped with the GRASS releases [Gardels88]. 3.7.6 Summary There are now many powerful geographical information systems available, both commercially and as public domain software. Many of the systems are still in their infancy, and have problems with reliability and robustness. The available GISs are continuously improving both in computational power and expressiveness. The analytical capabilities of the systems are improving, and the systems take advantage of increasingly more powerful hardware platforms to obtain better performance for all kinds of operations. User interfaces are slowly following up the trends in the rest of the computer science field with multiple windows and nice graphics. GIS specific features such as spatial interaction techniques (query “languages”), spatial analysis, visualisation of spatial features and spatial data quality are still challenging areas for research, and need further attention. The data management part of GIS is the one causing most trouble for both the users and the vendors. Most GIS data sets are complex and huge, and perpetually growing. Traditional database management systems have been abandoned for managing geometrical data by most GIS vendors. Instead they have settled for some custom data structure to store the spatial data. Attribute/thematic data are not so problematic as the geometry data, and suits traditional relational DBMSs well. DBMS interfaces are therefore provided by many vendors for the storage of the non-spatial data. This approach to data management has it problems, particularly for transaction management and multi-user support. There is limited multi-user concurrency control available in todays GISs, apart from the check-in, check-out mechanism in some systems. This means that cooperative work using GISs is inhibited. The extent to which cooperative work is useful in GIS should determine Trends 53 the efforts put into this problem area. Transaction management research in related areas, such as Computer-Assisted Software Engineering (CASE) and Computer Assisted Design (CAD) could be useful also within the GIS context. 3.8 Trends This section tries to outline the trends that are believed to be most influential for GIS technology in the next twenty years. The predictions for hardware development are not very controversial, involving only enhancements to current technology. When it comes to technology trends the predictions are a little more speculative, but not very controversial. The conclusions reached are in line with previous forecasts [Dangermond86]. 3.8.1 Hardware trends • Less expensive and more powerful processors. The price of micro processors seems to fall without limits. The ultimate limit being the price of the piece of “metal” and the energy required to make the wafer and microprocessor logic. This limit could be in the order of 10 NOK (1 US$), which is pretty inexpensive. The power of microprocessors will probably finally be limited by the speed of light and the cooling requirements of the processors. The limit has not yet been reached. • Less expensive and higher capacity transistor-based memory. By 1994 200,000 NOK (30000 US$) per Gigabyte RAM (Random Access Memory). By 1994 16 MB RAM chips were available off the shelf. 64 MB RAM chips had already been tested. The limit does not seem to have been reached. In parallel with this development, ever faster RAM is emerging. • Less expensive and higher capacity secondary storage devices (optical and magnetic disks). By 1991 30000 NOK (5000 US$) per Gigabyte of magnetic disk memory, by 1995 less than 2000 NOK per Gigabyte, still going down. Disk capacities of several Gigabytes are now common (in 1994, 18 Gigabytes disks were available), and the technology is still advancing. In 1994, magnetic tapes were able to store about 20 Gigabytes of data per tape cartridge, and the capacity is still increasing. Hierarchical Storage Management (HSM) systems were in 1994 able to store many Petabytes (1015 bytes) of data in one single system. HSM’s are normally based on extensible robot-operated multi-tape/optical disk archives at the bottom of the hierarchy, and magnetic disks at the top, all integrated into a single cabinet with an advanced interface (currently SCSI). • Proliferation of high capacity (speed and volume) local and long haul networks. Optical fibres are being introduced everywhere both in local area networks and for long distances. Hopefully, the high end ISDN* services will become available for computer networks soon. ATM**-based networks also promise new capacity and speed improvements (ATM bandwidths are currently: 155 Mbit/s, 622Mb/s and 1.2 Gb/s). * Integrated Services Digital Network 54 Chapter 3: Geographical Information Systems The following observation can also be made: The gap between processor speed and I/O speed is widening at a fast pace. The increase in CPU-speed has been about 70%/year, while the increase in I/O speed (magnetic disk based) has been at +10-20%/year. The I/O subsystem is therefore becoming more and more of a bottleneck in computers, and optimisations in this area will be increasingly important (some progress has already been made using RAID technology). 3.8.2 Technology trends • Proliferation of inexpensive and powerful parallel computers. This seems to be a little further away, but transputer-based environments (using for instance the OCCAM language) have been available for a while, and inexpensive microprocessors will accelerate this trend. • A shift to parallel processing for computationally intensive applications, and introduction of new tools for programming parallel computers. Parallel computers have developed from toys to commercial products. Meiko, Parsytec, Siemens, Intel (Paragon), MasPar, Parsys, SGI, Pyramid, Convex, IBM (SP2), Cray (T3D), nCUBE, Hitatchi and NEC are current actors on the hardware side of this arena. This indicates that the technology has matured, and is taken seriously by the major hardware vendors. It also shows that research in the area of parallel processing is given priority. • Use of parallel computers for data management. The Tandem Non-Stop-SQL server was the first product to exploit parallelism for data management. Teradata has been another actor on the scene with several products (these products were overtaken by AT&T), their first product was DBC1012. Parallelism is used for improving security (by duplication of hardware, software and data) and for providing better performance. The principal use so far has been in “high volume - short transactions” environments. • RAID technology will probably be applied more extensively in the future in order to improve the performance and reliability/fault tolerance of disk-based I/O subsystems. • Advances in computer storage methods. If the advances in computer storage technology continue in the future, we must expect new types of “permanent” storage devices that are orders of magnitude faster than the present mechanical disk technology. In the future, “permanent” computer storage must be expected to reach the same speed of access and the same compactness as todays volatile memory. We have already seen solid state disks (transistor-based) of several hundred megabytes capacity. A further development here will mean that database technology will be able to move into a new performance dimension. • Network access to distributed services. A future scenario is that computing will become distributed, with application servers and database servers providing all kinds of applications and data (probably for some fee) through standardised interfaces. The user chooses which services to use based on his/her requirements (e.g. text processing, spatial analysis) and the services that are available. There will probably be a + Asyncronous Transfer Mode The GIS of the future 55 reduced need for local software and data. Such an environment requires standardisation of services interfaces and a service broker. The Object Management Group* (OMG) is specifying CORBA** [Soley95] to provide such an environment, while Microsoft uses OLE for the same purposes. Schek has done some investigations in a GIS context [Schek93], and the OGC*** is working on OGIS**** to specify interfaces for geographical information and analysis services (some information has been available on the internet [OGIS95]). ISO is also working on standards in this area (ISO TC211). 3.8.3 GIS trends • Ever increasing amounts of digitally available geographical information. • More advanced users and usage of geographical data. 3D visualisations, complex analysis including large amounts of data, … All these trends will be welcomed by both GIS users and vendors, and should provide a good basis for the development of the GISs of the future. The systems will hopefully continue to improve to become capable of more and more demanding tasks. 3.9 The GIS of the future To be able to cope with the challenges of the future, geographical information systems must take advantage of the available technology to develop and mature into generally useful systems. A list of requirements for the next generation of general purpose GISs follows. Data model • full topological capabilities (networks and manifolds) • full 3D model • full integration of the raster model and the vector model (support for fields) • temporal data • quality measures incorporated into data, procedures and presentations • advanced support for data sharing Database • non-stop operation * The OMG comprised over 300 companies early in 1995. It promotes the object-oriented approach and develops standards for open distributed processing based on object-oriented methodology [Soley95] ** The Common Object Request Broker Architecture *** Open GIS Consortium, Inc. Consists of GIS vendors, computer vendors and federal agencies. They are supported by some university research activity. OGC plans to produce a set of proposed standards by late 1996. These proposals will be submitted to ANSI and ISO. ****Open Geodata Interoperability Specification (trademark) 56 Chapter 3: Geographical Information Systems • utilisation of large amounts of main memory for the buffering of complete data sets • integration of imagery, geometry and thematic information • possibilities for transparent distribution and parallelised database operations • standard application interfaces (data dictionary, query facilities) • full support for concurrent and cooperative usage • integration of heterogeneous databases GIS processing capabilities • full support for all aspects of spatial analysis, including network and manifold analysis • advanced image processing, including (semi-) automatic digital photogrammetry • full 3D processing for analysis and visualisation (including VR applications) • support for distributed, parallel spatial query processing Environment and user interfaces • advanced visualisation techniques (e.g. 3D views and animations) • standardised interfaces for user interaction (e.g. multiple “windows”) • standardised interfaces for data exchange from external data servers This long list of requirements is not easy to fulfil. A lot of research is needed, specifically on database and data modelling issues. This research will have to take into account the requirements of interactive GIS processing applications, and will have to find ways of applying distribution and parallel methods in geographical data processing. 3.9.1 Servers of geographical information A geographical information server is an agent that provides geographical data to local and/or remote geographical information systems. Such servers are expected to play an important role in GIS environments due to both performance considerations [Dowers90] [Healey91] and the increasing necessity of data sharing. Geographical data come from a variety of sources (national mapping agencies, utility management companies, census bureaux, satellite programmes). The owners of the data generally want to control the availability of the data, they want credit for data usage, and they want to be responsible for keeping the data continuously up to date. All these observations support the concept of a non-centralised approach to GIS data management. Each data supplier will have to maintain its own local database containing all the information that should be available to the potential customers. The database should be attached to a high speed world wide computer-computer communication network, and should support a standardised geographical query interface. The database content will have to be described (using metadata) in such a way that potential Research and research issues 57 users can assess the suitability of the data before acquisition. It will probably not be necessary to support updates through such a geographical data interface, since all modification are expected to be done locally. A functional analysis of this kind of server will have to be based on both our current technological and social context and on a scenario for the future. The analysis has to provide a set of functions that a geographical information system has to provide, describe the complexity of these functions, and examine the possibility of dividing the workload between the GISs and the geographical data servers. The functional analysis has to give an indication of the processing power and data storage capacity needed at the different levels of the system. Transaction processing A geographical data server will not have to meet the same requirements as a typical transaction system (server for bank-accounts). The transaction rates for geographical data will generally be lower, but the complexity of the individual transactions will be higher. The fact that most external transactions will be read-only, and that the database will be a historical database simplifies concurrency control and transaction management. These topics are discussed further in chapter 6. 3.10 Research and research issues The interest for research on GIS has been steadily increasing the last 20 years or so. The main research conferences to date have been Auto-Carto (from the 70ies), the Symposiums on Spatial Data Handling (SDH, from 1984) and The Symposiums on Large Spatial Databases (SSD, from 1989). A research journal was established in 1987, the International Journal of Geographical Information Systems (changed its name to the International Journal of Geographical Information Science in 1997). NCGIA The NCGIA* was established by the US National Science Foundation (NSF) in 1988, and is run as a co-operation project between three US universities with special interests in GIS. The co-operating universities are: State University of New York, Buffalo; University of California, Santa Barbara, University of Maine. The NCGIA has initiated and performed a lot of research in the field of GIS, and has been the single most important contributor to GIS research in the world since its establishment. The NCGIA bases its research on initiatives, and below are the first 16 of them listed: Initiative 1: Accuracy of Spatial Databases Initiative 2: Languages of Spatial Relations Initiative 3: Multiple Representations Initiative 4: Use and Value of Geographic Information Initiative 5: Very Large Spatial Databases (VLSDB) Initiative 6: Spatial Decision Support Systems Initiative 7: Visualisation of the Quality of Spatial Information Initiative 8: Formalising Cartographic Knowledge Initiative 9: Institutions Sharing Geographic Information Initiative 10: Spatio-Temporal Reasoning in GIS * National Center for Geographic Information and Analysis 58 Chapter 3: Geographical Information Systems Initiative 11: Space-Time Modelling in GIS Initiative 12: GIS and Remote Sensing Initiative 13: User Interfaces for Geographic Information Systems Initiative 14: GIS and Spatial Analysis Initiative 15: Multiple Roles for GIS in US Global Change Research Initiative 16: Law, Public Policy and Spatial Databases The initiatives of the NCGIA are supposed to cover most of the current research issues in GIS. The results of the work with the initiatives are published in technical reports that are generally available. GISDATA In Europe, the European Science Foundation (ESF) established a scientific programme called GISDATA in 1993, supposed to run until the end of 1996. GISDATA is meant to play a similar role as NCGIA in Europe, and initially the following research areas were proposed [GISDATA93]: • Geographical Databases • Geographical Data Integration • Social and Environmental Applications Later, new research areas has been given focus [GISDATA95]. For 1995, the research areas were • Data Quality • RS and Urban Change • Spatial Models & GIS. For 1996 the research areas will be • Spatial and Temporal Change in GIS • Geographical Information the European Dimension • GIS & Emergency Management. The GISDATA programme had, as of November 1995 resulted in 6 books and some other publications [GISDATA95]. Chapter 4 Data model requirements 4.1 Introduction Geographical Information Systems should be able to handle complex, real-world information. Geographical data can have many uses, and different kinds of applications will apply the data in different contexts. To make sharing of geographical information possible, standardised (core) data models for spatial data should be developed and agreed upon. These models must includes powerful mechanisms for representing geographical phenomena with their abstractions, constraints, attributes and relationships. Such a standardised core for geographical data models would provide a sound basis for the development of special purpose data models. The aim of this chapter is to identify requirements to such a core data model. Several national initiatives have been taken to specify data models and exchange formats for geographical information in order to facilitate easy exchange and sharing of GIS data. Examples are FGIS/SOSI of Norway, ATKIS of Germany and the SDTS of the USA. All of these seem to have their limits and weaknesses, and there is not yet consensus on what kind of a model that should constitute the basis for an internationally acceptable standard. The lack of a good common (useful for both humans and computers) data model for geographical information has long impeded GIS technology and use [Peuquet84]. There has been much research on data models for geographical data, and some progress has been made. Throughout this chapter references to this research will be provided. In this chapter, a list of properties of geographical data relevant to modelling is put together, and requirements to geographical data modelling tools are identified. The last part of the chapter reviews some national efforts on standardisation for geographical data modelling and exchange. 4.2 Geographical data revisited Data modelling and GIS have been presented in chapter 2 and 3 respectively. To provide a basis for the discussions on data models for geographical data, some distinctive properties of such data relevant to modelling are identified and outlined. The issues presented here will be elaborated on further in later sections. 60 Chapter 4: Data model requirements 4.2.1 Borders of geographical phenomena As mentioned in chapter 3, spatial measurements can be divided into measurements on geographical object structures and measurements on continuously varying geographical phenomena. It is important for the data modeller to be aware of the difference between these two paradigms. • It is not meaningful to provide exact borders for soil types, geological features, vegetation types and lakes in nature. Most natural borders are fuzzy and their location depend on the time of measurement, human interpretation and classification [Burrough86]. What should be provided for “deep” spatial analysis are the (time series of) spatial samples underlying the classifications. Examples of such samples are drilling probes, soil profiles, elevation points, water surface levels, rainfall, wind and temperature measurements. From these samples, classifications can be performed to produce for instance soil maps, geological maps, rainfall maps and elevation maps with accuracy- and confidence-measures attached. The resulting classification manifold (with its accuracy measures) could then again be used as input data in other application environments. • It is meaningful to store political boundaries, economical boundaries and land-use boundaries as exact geometrical lines in topological manifold structures, and to store roads, tubes, railways and cables as edges in topological network structures. This dichotomy must be reflected in the data model. A general purpose geographical database should always provide the original measured data (describing the earth) as the basis for analysis, classification, planning and visualisations. 4.2.2 Features of geographical data • Spatial/geometrical objects The inclusion of positional information for the storage and manipulation of earthbased spatial objects is the single most distinctive feature of geographical data. Geographical phenomena are spatial phenomena pertaining to the earth, most of which are constrained by the earth’s surface. One can define a very limited number of basic (generic) spatial object types as a basis for developing data models for such phenomena: - Points in 2D and 3D space (graph vertices [Wilson85]), e.g. a trigonometric point - Lines in 2D and 3D space (graph edges [Wilson85]), e.g. a cable - Fields over 1D (line) features (“1.5D”, can represent functions of position along a line: f(k), where 0 ≤ k ≤ length(line) or f(p(u)), where p(u), 0 ≤ u ≤ 1, is a parametric representation of the position of a line), e.g. elevation along a road - Regions in 2D space (a face of a plane graph [Wilson85], “2D object homeomorphic to a disc” [Egenhofer92]), e.g. a property lot - Fields over 2D features and 1D features in 2D (can represent functions of position in 2D: f(x,y) | (x,y) ∈ feature), e.g. rainfall - Volumes in 3D space, e.g. a cloud - Surfaces in 3D space (e.g. a general parametric surface patch, such as p(u,w) = x(u,w),y(u,w),z(u,w), 0≤u≤1, 0≤w≤1 ) - Fields over 1D, 2D or 3D features in 3D space (can represent functions of position Geographical data revisited 61 in 3D (f(x,y,z) | (x,y,z) ∈ feature), e.g. grain size distribution over a sand/gravel reserve For these spatial object types the temporal dimension is also of interest. There are many options for representing these basic objects geometrically, both with respect to data structures / coding, and with respect to the selection of an adequate reference system (e.g. datum and projection). • Samples As a rule, natural phenomena have fuzzy borders and variation over their interior. The characteristics of most natural phenomena at a point in space is, however, often highly correlated with the point’s nearest surroundings. A very common way to get assessments of natural phenomena and resources is therefore to take samples or probes at selected locations within the region of interest, and then use statistical methods (interpolation) to get measures for other locations. Samples can be based on points, lines, regions or volumes (in the three last cases, one will have to have a kind of sampling or aggregation within the sample). Images Regular samplings, such as digital satellite images and digital aerial photographs, in addition to maps and sketches make up a significant part of most GIS data sets, both as input (satellite images, aerial photographs) and output (maps, sketches). • The earth’s surface The earth’s surface is the platform for a very large group of geographical phenomena, such as vegetation, rainfall, administrative and land use units, roads and rivers. By having a good basic 3D (or 2 1/2D) model of the elevation of the earth surface, such phenomena can be represented using only 2D/planar coordinates ((easting,northing) - or (latitude,longitude) pairs) in a suitable map projection, knowing that elevation information can be found using the 3D surface model. • Spatial relationships A variety of spatial relationships are used in everyday life, such as above, in front of, behind, at, inside and between. Many of them are fuzzy and inexact, while some are well defined. An example of the latter is topology, describing geometrical properties that stay invariant under translation, rotation and scaling. Examples of topological relationships are inside, outside, overlapping, on and bordering. Topology of spatial objects Topology [Peucker75] is an explicit representation of the spatial relationships derivable from connectedness and neighbourhood through borders. - Knot-points (vertices) are introduced where lines meet - A line (edge) has two end-points (always knot points (vertices), but not necessarily distinct) - A region (face of a plane graph) has bounding edges - A volume has bounding surfaces Topological analysis does not require knowledge of the underlying (e.g. Euclidean) geometry of the objects. 62 Chapter 4: Data model requirements • Complex objects / aggregation relationships An aggregation is an abstraction in which a relationship between objects is regarded as a higher level object [Smith77]. - a telephone network is an assembly of cables, coupling boxes, switching boxes and phones - a property is an aggregation of parcels - a country is an aggregation of districts - a building can be an assembly of elements (walls, roof, cables, tubes) - a water system is an aggregation of rivers, lakes and streams It should be possible to identify the constituents of a compound object, and it should also be possible to identify all the complex objects that an object takes part in. • Generalisation / specialisation relationships Generalisation (in the classification sense) is an abstraction in which a set of similar objects is regarded as a generic object [Smith77]. Instanciation is the inverse of this kind of generalisation. A group of similar/related GIS object types can be generalised (in the modelling sense) into a generic object type that covers the common properties of the group. Specialisation is the inverse of generalisation. One starts out with a generic object type and arrives at more specialised object types. Some examples of generic geographical object types (more specialised object types in parenthesis): - forest compartment (birch parcel, spruce parcel, felled (no forest) parcel, …); - political area (nation, county, municipality, …); - building (factory, office building, block of flats, villa, …); - image (satellite image, scanned photograph, perspective drawing, map, …); Generalisation/specialisation is a useful tool for finding an appropriate level of detail when modelling geographical reality. • Category A category is a grouping of entity types that play the same role in some relationship. A category is similar to a generalisation, but while a set of attributes is the common denominator for the entity types in a generalisation, a set of relationships is the common denominator for the entity types in a category. An example of a situation where the category abstraction is useful is for expressing the property-owner relationship, where the owner-side of the relationship can be either a person or a company. The category could also be used to model the land-cover manifold. • Other relationships for spatial data Relationships other than topological/spatial relationships, aggregations and generalisations exist also for geographical data. The coupling of non-spatial data with spatial objects is particularly important in many GIS applications. - ownership of land parcels - coupling of census data to political units - legal decisions pertaining to a property - coupling of climate measurements to the positions of the observation sites Geographical data revisited 63 • Temporal behaviour and versions Very few geographical objects are static through history. Some objects exist only for a limited period of time, and many evolve as time passes (seasonal change, harvesting, wearing, …). This temporal behaviour is interesting for many GIS applications (analysis, monitoring, statistics). Temporal objects have much in common with versioned objects, but are easier to handle due to their constrained semantics (a single-threaded case of versioned objects). General versioned objects can be of interest in the context of GIS for planning purposes. • Accuracy / quality Spatial data (e.g. points, lines, regions, surfaces, volumes, fields) contained in a GIS are based on measurements or assessments of real world phenomena, and as such should have an accuracy of measure or confidence of assessment/classification attached. Such “attributes” provide a way to estimate the quality of the results of GIS analysis through, for instance, sensitivity analysis [Lodwick90]. The accuracy of derived data has to be calculated from the accuracy-properties of the source data and the computation algorithms. • Geometry sharing Sharing of a geometrical object amongst many geographical objects is possible. - a road can be used as a lot-border, a field border and a transportation network component - a border can be defined to follow the centre line of a river/stream - many different cables can be put in the same ditch. In order to avoid redundant storage in such cases, geometry sharing/referencing should be supported. • Scale / Roles Spatial data have many uses, and spatial objects will exhibit different characteristics when they are viewed at different levels (scale dependent properties) or are appearing in different contexts (role dependent properties). Some objects or object characteristics become insignificant as the scale gets smaller. - a house could be interesting at large scales, but at smaller scales it should be ignored or included as a part of a settlement The same applies to most geographical features. Either they become obsolete as the scale gets smaller, or they will need to be combined with other objects into larger structures or represented in new and generalised ways. Geographical features tend to play different roles in different contexts. Roads, rivers and houses can serve as examples: - a road can be seen as part of a transportation network in the context of routing and transport analysis (emphasising speed limits, surface type and length), while the road managers might see it as a piece of construction (a volume object including many layers of material, bridges, tunnels, etc.) - rivers can be analysed both as transportation networks and as water resources, emphasising very different characteristics of the river phenomenon. - houses can play the role of homes to people in some contexts, while only their physical characteristics may be of interest in other contexts. 64 Chapter 4: Data model requirements • Constraints Constraints are rules and characteristics of the real world that the data model and database system must capture, conform to and enforce. Constraints on spatial relationships, topology and geometry are crucial for GISs, and require special attention. In addition the more traditional constraints (cardinality of relationships, the domains of attributes, mandatory attributes) must be taken care of. A potentially interesting class of constraints for geographical databases is quality/accuracy based constraints. • Derived objects Geographical objects may be derived from other objects, images or measurements (sampling, surveying). Derived objects should have references to the objects from which they have been derived (as is proposed in the lineage portion of the data quality part of the SDTS* proposal [USGS90], described later in the chapter). References from measurements to derived objects (cross referencing) can also be useful. • Analysis An important application area for GIS data is in analysis, statistics and supervision / monitoring. This kind of usage implies (read-only) bulk data access and scrutinization of every possible kind of relationship in the data. To facilitate analysis, the data model should provide possibilities for exploiting unanticipated relationships and unorthodox ways of accessing the data. • Themes Geographical objects can often be organised into themes (vegetation, road-network, land property, building, water-course, geology, transportation, topography, …). An application will normally be interested in only a limited number of these themes. Geographical data models that can organise data into themes will be convenient for many users and applications. • Distributed ownership A GIS often includes information from many different data sources. These sources could be owned by independent organisations and distributed over a large physical area, needing long-haul data networks for access. Ownership issues, data integration and data communication must be addressed in the development of an integrated platform for geographical data. Distribution issues could therefore be interesting also in a data modelling context. • Behaviour Geographical objects’ behaviour will often be associated with presentation of the object, but “active” GIS objects could need to include other kinds of behaviour as well. This is particularly true for simulation environments. Examples include climate models, nutrition/growth simulations and environmental stress tolerance simulations. GIS interfaces depend on good visualisation of geographical objects. Such visualisations again depend on scale and context. Visualisation methods could be integrated into geographical and geometrical objects. To be able to handle these issues on the modelling level, an object-oriented (both structural and behavioural) approach to GIS modelling should be considered. * Spatial Data Transfer Standard, US Geological Survey Requirements to high level geographical data models 65 4.3 Requirements to high level geographical data models A high level data model is a tool for modelling/describing interesting aspects of real world phenomena as accurately, convincingly and completely as possible. For this purpose, adequate structuring and abstraction mechanisms should be provided. For geographical information in particular we want to be able to model geographical phenomena in their right context. A high level data model should be easily comprehensible for all specialists involved, and should provide a structuring of the information that is useful for translation into database oriented data models (as described in chapter 2). Through the years, many different approaches have been taken to high level data modelling. In the early 1980ies, entity-relationship (ER) data models, binary data models, semantic network data models and infological data models were considered to be a representative selection of the methods used [Tsichritzis82]. Semantic extensions to the entity-relationship approach have become increasingly popular, and in the late 1980s, these types of models have been classified as structurally object-oriented models [Rumbaugh91]. The ER data model and some of its extensions will be used as a basis in this thesis. The GIS information base consists of spatial data objects (with their geometrical and non-geometrical properties) and traditional (spatially relatable) “catalogue” type of data (for instance census data and administrative data). A useful high-level data model for the GIS information base will have to integrate spatial and non-spatial data in a common modelling framework to facilitate full data integration in GIS databases. Geographical data model requirements Data models for geographical data should cover most of the following topics (in accordance with the preceding section): • Basic primitives such as entities, relationships and attributes in the traditional ER model. • Two-dimensional and three-dimensional geographical structures with (shared) geometrical data types (points, lines, regions, surfaces, volumes, fields, samples, rasters). The time/temporal dimension should be supported for all geographical structures. • Two-dimensional and three-dimensional spatial relationships and constraints, including topological relationships (network and manifold structures). • Aggregation hierarchies, both at the attribute level and the object level. • Generalisation/specialisation hierarchies/networks (in the case of multiple inheritance). • Relationships specified on “unions” of different object types (EER categories). • Historical information. • Quality (including accuracy) and scale measures. 66 Chapter 4: Data model requirements • Cross referencing of derived objects (data aggregation, computed data) and source data. • Support for different scales and roles when modelling a single phenomenon. • Grouping of related object types into themes. • Data integration mechanisms. and probably also: • Behaviour, presentation mechanisms (projections, rules for combining different themes). These topics will be elaborated further and discussed in a data modelling context in the following subsections. Basic component of spatial knowledge has been presented by Golledge [Golledge92]. The requirements presented here should take care of most of those basic components, and in addition, some new requirements are presented. 4.3.1 Traditional ER model abstractions Entities Entities (or object types) are useful as a basic abstraction also for geographical data modelling. Some examples of entities are house, forest parcel, road segment and tree. Figure 4-1 Cardinality constrained relationships Constrained object-object relationships Generic object type to object type relationships are indispensable in any model of reality. Ordinary relationships (e.g. relationships as used in the ER model) should in a geographical context for instance be used to connect owners to properties, crops to fields, minerals to geological formations, buses to routes / roads and so on. Requirements to high level geographical data models 67 For modelling purposes it is also useful to be able to put constraints on relationships, such as those used in ER models (1:1, 1:N, N:M, compulsory/possible relationships). See Figure 4-1 for an illustration of constrained relationships (1:N, N:M) in the ER model. Attributes Properties of entities and relationships can be represented as attributes. An attribute could be the date of birth of a person, the colour of a flower, the surface material of a road or the amount of algae of a certain type at a sampling spot in a lake or river. Attributes and relationships are our key tools for describing the distinctive features of an entity/object type. 4.3.2 Geometrical object types All necessary geometrical/spatial object types should be supported in their most general representation. Object borders/boundaries (∂) and interiors (°) [Egenhofer90b] (as discussed in chapter 3) should be supported, as should continuous variation throughout the interior of objects (fields), as discussed earlier in this chapter. Point. A point needs to be related to some reference system (datum and projection), and its position within this reference system must be given. Either some kind of “global” reference system (where a point is given by for instance latitude, longitude, elevation) or a planar/projected reference system (where a point can is given relative to a certain origo: north,east,elevation) should be applied. If no elevation value is specified for a point, it should be assumed to lie on the earth’s surface. Line. A geometrical line can theoretically be represented as an infinite sequence of points connecting two end-points. A representation will have to be chosen that gives the desired accuracy and that is reasonably compact when it comes to computer storage. A sequence of points selected according to for instance the Douglas-Peucker algorithm [Douglas73] or a parametric representation (such as spline curves, B-spline curves or Bezier curves [Mortenson85]) could be used for line representation. A way of representing variation of a property along a line would also be useful for many applications, as mentioned in the preceding summary. A line where the representation does not include elevation information should be assumed to lie on the earth’s surface. Region. A region can, in a geographical context, be restricted to two dimensions, and can be defined as all the points that are inside the region’s (closed) boundary lines (a face in a plane graph [Wilson85]). A region can therefore be described by it bounding lines and an indication of where the interior of the region is (the vector approach), or by a finite set of regularly sampled points representing the interior of the region (the raster approach). The raster approach to representing a region can capture continuous change in a property throughout the region, making it an efficient “2.5D” (or field) representation. Representation of homogeneous regions using raster technology is consequently a waste of the capabilities of the raster approach, while the current vector approach to regions only can handle homogeneous regions. Surface. A geographical surface can be described as a function in three dimensional space. It can either be represented by a set of points laying on the surface with some neighbourhood information (topology) attached (to determine how the surface is to be constructed from the points (e.g. a TIN structure)). Or it can be represented by 68 Chapter 4: Data model requirements functions (e.g. parametric functions such as B-splines or Bezier). An important sub-class of surfaces are those that can be represented as functions of two dimensions: z=f(north, east). Such functions can be used to represent “continuously” varying geographical phenomena (e.g. elevation, rainfall, surface temperature and soil depth). Such a representation is an implementation of a field over a 2D region. Neugebauer [Neugebauer90] suggests that continuous surfaces should be available directly at the database level, hiding the (sample) data sets that will have to be the basis for the database system’s or GIS’s interpolations. Such a feature would be very useful for many disciplines. However, the underlying samples should also be available for users who want to provide their own interpolation methods. Peucker and Chrisman were of the first to emphasise the importance of fields over 2D regions (they called them three-dimensional surfaces) in geographical information systems [Peucker75]. Volume. A geographical volume is the three-dimensional space bounded by a closed set of surfaces (the volume boundary). The representation of volumes can be oriented towards the surfaces of the volume or the interior of the volume. The vector approach can represent the surfaces of a volume quite easily using for instance parametric functions or TINs, and it is also possible to represent the interior of a volume using the same techniques. The raster approach has to represent the interior of the volume, and can do this in a straightforward way. The raster approach to representing volumes makes it possible to represent continuous changes for a property throughout the volume, making the raster approach to volume modelling a “3.5D” representation (field over a 3D region). Spatial set. For the important class of spatial samples, a spatial set concept is valuable. Spatial sets could be sets of points, lines, regions or volumes. A spatial set is different from other sets, because all the elements of the set have a position in two- or three-dimensional space. The members of a spatial set thus have an inherent structure, being spatially related to each other. Field. Variation throughout the interior of a geographical object should be supported in all relevant dimensions. 4.3.3 Spatial relationships Spatial relationships are relationships that are a result of the location of objects in space. The importance of spatial relationships in geographical data models and geographical information systems has been emphasised by many researchers [Mark89, Peuquet86]. Spatial relationships can be divided into groups on the basis of their characteristics. Three such groups could be: • (Geo)Metric relationships, such as distance and direction • Topological relationships. These can be derived from geometry (e.g. neighbour and border) • Fuzzy spatial relationships. These can be difficult to define accurately and are often context sensitive. Examples are above and in front of. Requirements to high level geographical data models 69 Between is an example of a relationship that can have both fuzzy and topological properties. Peuquet claims that all spatial relationships can be built from three primitives (distance, direction and boolean set operators) [Peuquet86]. Spatial relationships can in general be derived from the geographical location of the involved objects. Metric relationships are embedded in the geometry, and can be calculated in a straightforward way. Topological relationships can be found by investigating geometry. An example is neighbouring objects (that is - objects with common borders). To find all topological relationships, one will have to search the geometrical structures to find all relationships that stay invariant under spatial transformations such as rotation, scaling and shifting. Fuzzy spatial relationships can to a certain extent be deduced through knowledgebased / rule-based systems. Mark and Frank [Mark90] have shown interest for a linguistic approach in their research on spatial relationships. They state that: The slow progress in GIS development appears at least partially to be due to the lack of formal understanding of spatial concepts as they apply to geographic space. Mark and Frank also cite an article of Boyle from 1983, where status is more generally described [Mark90]: The (present) lack of a coherent theory of spatial relations hinders the use of automated geographic information systems at nearly every point. Since the theory of spatial relationships is immature at the moment, it is not possible to specify a complete set of requirements to data models with respect to spatial relationships. However, it is probably not necessary to represent all spatial relationships explicitly. Most of them can be derived from geometry anyway. 4.3.4 Implicit geographical relationships All geographical objects are spatially relatable to each other through their common geographical reference system. Relationships such as distance and overlap can be derived from the geographically referenced geometry of the objects. When an application is interested in examining different geographical data sets in combination, there is no need to have explicit relationships to take care of this. The common geographical reference system ensures that such analysis always is possible. Some modellers will only be interested in the combination of a limited number of geographical data sets. When this is the case, it should be possible to specify in advance which data sets that are frequently examined together. This could make optimisation possible (for vector-structured data, the topology of the combined data sets could be stored to avoid demanding computations each time the combined data sets are to be examined). 4.3.5 Topology Some classes of GIS objects form large geometrical structures. Examples of such are • Planar graphs [Wilson85], also called 2D manifolds* (such as political, economical or land use partitioning of an area) 70 Chapter 4: Data model requirements • Networks (such as roads and railways, sewage and freshwater tubes, electricity and telephone cables). Classification manifolds are important examples of derived data sets. The borders on such maps are interpolated on the basis of a systematic sampling of the area of interest (e.g.. for soil, climate, geology or vegetation mapping). For 2D manifolds and networks, connection and neighbourhood information is very useful for the purpose of analysis. This information can be made explicit by adding a topological structure for such geographical features. The topology [Peucker75] of spatial geographical data objects is a non-(geo)metrical model of the spatial connections between the volumes, surfaces, regions, lines and points that constitute geographical objects in space. One of the first standards for topology-structured geographical information was the US Bureau of the Census’ TIGER file structure [Boudriault87][Broome90]. Figure 4-2 Topology for geometrical objects in 3-dimensional space In topological modelling, the borders of the objects are the only features of interest. An edge is topologically defined by its two end-vertices, a region is topologically defined by its border edges and a volume is topologically defined by its bounding surfaces. If the topological model is complete it is possible to use the topology to find neighbouring objects using this border information. Figure 4-2 shows an ER-illustration of topology for objects in three dimensional space, while Figure 4-3 shows the 2D equivalent. Spatial topology is an active field of research, and particularly the representation of topology in 3 (and 4) dimensional space has been investigated. Point set topological relationships has been studied in 2 and 3 dimensions [Pullar88][Egenhofer91b] [Jen94] [Molenaar94], and cell simplexes, cell complexes and manifolds have been considered for modelling and representing topology for 3D space [Frank86] [Pigot92a]. Topology in a space-time context has also been investigated [Egenhofer92] [Pigot92b]. ** As defined in the SDTS [USGS90] Requirements to high level geographical data models 71 Figure 4-3 Topology for geometrical objects in 2-dimensional space To be able to model spatial connections between and within objects, topological relationships are indispensable. Using a layered approach to spatial object modelling, the geometrical level would constitute the lowest level in the hierarchy, with its representations of points, lines and surfaces. The spatial objects would be at the highest level, while the topological level would be an intermediate level in the hierarchy, present only for geographical objects that participate in topological structures. The topology layer will show how the basic geometrical objects are interconnected to form spatial object structures. It will show how borderlines can be assembled to describe different economical and administrative units, how a hydrological system is an assembly of lakes, rivers, streams and ponds, how different road-segments should be interconnected to form the road transportation network, and so on. The topology level should be logically self-contained, and together with the geometry it will provide the backbone of the geographical data model. By storing topology on a thematic basis, separated from, but connected to the geometry, sharing of the geometrical descriptions (for instance between a river and a lot border) can be accomplished. The geometry of a river could then be used both in a property manifold and a river network. Inter thematic topology Topological relationships can be incorporated into the data model using two different approaches. • An integration approach, covering all the themes in the GIS database, and thereby causing a very detailed topological model. • A separation approach, keeping the topology within a theme separated from the topology of the other themes. This can be illustrated by a road-network theme and a water-system theme. The separation approach will result in topological points/nodes (grey circles) at road-crossings and streammeets (Figure 4-4). The integration approach will introduce additional topological points 72 Chapter 4: Data model requirements where roads cross rivers and streams (Figure 4-5). ATKIS [ATKIS89] has chosen the integration approach. Using this concept, a lot of new, artificial object partitionings must be introduced to represent inter-thematic topological relationships. Figure 4-4 Separation topological approach The separation approach seems to be the simplest and most general, and gives a higher degree of data independence. The separation approach will demand more processing for overlay analysis, whereas the integration approach will give a higher data overhead (an excessive amount of topological points), and a lot of work is needed to keep the topological structures updated. The overhead will mean that single data set analysis will be slowed down in the integration approach (particularly for updates). The integration approach seems to be a convenient solution from the data processing point of view, particularly when it comes to overlay analysis. One problem with this approach is that a complete restructuring of all the stored spatial objects will be required each time a new data set is introduced! Also, the storage of inter-thematic topology integrated into the various themes will inevitably make the single data set topology much more complicated. Because of the big efforts involved in importing (and exporting) data sets in the integrated approach, it could inhibit the exchange of data by making it very expensive to utilise external data sets. Integrated topology should be provided in the high-level data model, since it is useful for many applications. This does not mean that the integration model should have to be reflected in the low level topological model within each theme. Topology could either be stored as metadata, or it could be derived when needed (inter-thematic topological points can always be derived from the spatial data of the interesting themes). If the inter-theme topology is stored as metadata, the same heavy computations are needed to update the topology as new data sets are included. A (physical) solution could be a two-levelled topological model. At the bottom level, the topological relationships are kept within the different themes. At the top level one could maintain a dynamic data structure storing some or all of the inter-layer topological relationships according to their usefulness and rate of usage. For node-edge topology, the top level model must introduce references between the crossing topological edges to faithfully represent an inter-layer knot point and facilitate cross referencing. Requirements to high level geographical data models 73 Figure 4-5 Integration topological approach 4.3.6 Aggregation Smith and Smith describe aggregation as [Smith77]: “… an abstraction in which a relationship between objects is regarded as a higher level object. In making such an abstraction, many details of the relationship may be ignored. For example, a certain relationship between a person, a hotel, and a date can be abstracted as the object ”reservation”. It is possible to think about a “reservation” without bringing to mind all details of the underlying relationship for example, the number of the room reserved, the name of the reserving agent, or the length of the stay”. Aggregation, or the construction of an object from its constituent objects, can be illustrated in different ways. In [Tsichritzis82], an aggregation on the type level is shown as in Figure 4-6. Figure 4-6 Aggregation Substituting the tokens “Per Jensen” for name, “Ulvefaret 3” for address and ”76” for age, one ends up with a token level object aggregation. 74 Chapter 4: Data model requirements Aggregations are useful for building up spatial object that consist of many different objects as parts. Water systems (lakes, streams, rivers), waterways (canals, rivers, lakes), sewage systems (bowls, tubes) and buildings (doors, windows, rooms, stairs, …) are examples of such. These are examples of (object) type level aggregations. In GISs aggregations could be used both for geometrical aggregations (land parcels are aggregated into properties, counties are aggregated into states, construction material could be aggregated into roads) and general attribute aggregations (the properties of geometry, #inhabitants, area and government could be aggregated into a high level country object). The EER-model does not support entity-level aggregation with a special diagrammatic representation [Elmasri89]. To model this kind of aggregation in the EER-model one will have to resort to using “is-a-part-of” or “is-a-component-of” relationships. 4.3.7 Generalisation Smith and Smith use generalisation in the following sense [Smith77]: “A generalisation is an abstraction which enables a class of individual objects to be thought of generically as a single object.” The generalisation abstraction is used whenever a potential application could benefit from treating a group of similar objects or object types uniformly. Generalisation can be justified as long as the group of objects / object types have one or (preferably) more properties (attributes) in common. The generalised object / object type will consist of all the properties (attributes) that are common to the lower level objects / object types and that could be of interest to some application. Generalisation does not introduce new objects, so a certain lake (for instance Femunden) has the same identity and hence is the same lake object, even if it is accessed as a “region” object. The difference is that when it is accessed as a region, only the properties that have been defined relevant to a region will be available. Generalisation can be performed on the object-level (often referred to as classification), in which case the generalisation is used to form of a basic class / entity-set from real world objects (first level abstraction). This could be termed an phenomenon-to-class generalisation. The classification of individual cats into the class cat is an example of this kind of generalisation (see Figure 4-7) Figure 4-7 Phenomenon to class generalisation (classification) Requirements to high level geographical data models 75 The other kind of generalisation is the class-to class generalisation (at the type-level). The generalisation of political units, lakes, and islands into generic regions is an example of class-to-class generalisation (see Figure 4-8 for an illustration of this generalisation hierarchy). Such a generalisation could be justified if we wanted to access the area of every kind of region in a uniform manner (to be able to compare the area of a lake with the area of a political region). Another class-class generalisation is the abstraction of all classes of public roads into a generic road class. This class could be used for network analysis (shortest path routing, …) for the road transportation sector. Figure 4-8 Class-class generalisation, as used in [Elmasri89] Generalisation can be performed on all kinds of entities / classes. Multi-level generalisation will result in a directed acyclic graph (DAG) of classes, and an object that is a member of a class in such a generalisation DAG will at the same time be a member of all its ancestor classes in the DAG. Generalisation is useful for information hiding in the data model. During data modelling, it is a goal to find the right level of generalisation. If the topological/geometrical properties of all area-objects in a region is of interest, operations on a generic “region” class will be most useful, whereas if the crops that are grown in the agricultural areas are of interest, the lower level “field” class will also be needed. Inheritance In generalisation hierarchies, many properties and relationships will be the same down the hierarchy. The region generalisation hierarchy could be used as an example. The top level of the hierarchy consists of the generic region, with its topological/geometrical properties, such as borderlines, circumference, area of extent and relationships to neighbouring regions. These properties will be useful in all the objects down the generalisation hierarchy, and should therefore be inherited all the way down to the bottom of the tree. By using inheritance we can ensure that all objects in a generalisation hierarchy can be treated uniformly when considered at the highest level of generalisation. Without inheritance, there is a risk that common properties/attributes can get different names for each object. Using inheritance, we know that it will be possible to use the same algorithms for querying and manipulating for instance the geometry of all kinds of regions (forest stands, airport runways, lakes, parking lots, etc.). 76 Chapter 4: Data model requirements If an object can be present in many generalisation hierarchies, it should inherit properties from all these hierarchies. This concept is termed multiple inheritance, and introduces complications through the possibility of name-conflicts between the hierarchies. Multiple inheritance is necessary to represent phenomena that play more than one role in the actual modelling context. 4.3.8 Categories If an application is interested in treating a union of different classes as a whole because they play the same role in a relationship to other classes, a new modelling concept has to be introduced. This concept has been denoted category by Elmasri and Navathe [Elmasri89]. Figure 4-9 shows an example from the book illustrating the abstract entity of an “owner” (of for instance a land-parcel), which could be either a “company”-entity or a “person”-entity. Figure 4-9 Category [Elmasri89] Categories could also be modelled using the concept of generalisation, but quite often the only generic feature of the categorised entities will be the relationships through the category. In these cases, the use of generalisation would be misleading and confusing for the reader/user of the model (and even the modeller). An example where categories can be useful is in specifying the topology of a road network for timber transportation. A node (category) in this network could either be a road crossing, a dead end, a factory or a piling site (the links will always be road segments). 4.3.9 History and time Historical data are now acknowledged as being of interest to many GIS applications [Langran88] [Vrana89]. Such data could be used for modelling and monitoring both environmental changes and changes in infrastructure. The historical dimension is potentially interesting for most kinds of geographical phenomena. Time should therefore be included as a basic element in geographical data models. By including time in the data model, it will be much easier to handle the temporal dimension uniformly in query languages and in data transfers. Requirements to high level geographical data models 77 Trend analysis and time series analysis are new possibilities that arise when time is included as a basic property of geographical objects. Historical snapshots and history animations will also be trivial to obtain from geographical data when time is fully integrated into the data model. Nature Environmental monitoring is based on measurements over a certain area using various sampling techniques (air photos, satellite images, climate monitoring stations and other point sampling techniques). Environmental data should be handled statistically since they represent samples of continuously varying natural phenomena. The measurements can be done in time series at certain points or be taken systematically over an area at well defined points in time. Infrastructure Infrastructure is composed of well defined geometrical structures, and changes in such structures can be considered discrete compared to the slow changes that occur in nature. Man made constructions (buildings, bridges, roads) are made over short periods in time, and so are modifications to such constructions. Infrastructure components can be torn down in a day or just left for a gradual decay into ruins. History could be difficult to handle for infrastructure because of the object-oriented nature of man-made features. The handling of object identity will be one of the problems encountered when trying to represent changes to geographical objects. Should an object change identity when it has been changed? How significant must the change be before we have a new object? Environmental data are sample based, so they are not in the same way object-oriented (although it is possible to derive objects from the data using classification techniques). Versions Versions can be useful for representing alternatives in the planning of infrastructure and land use. For a planning department it will be convenient to have all plans stored in a geographical database, with the different alternatives that have been considered. This will particularly be the case for plans under current consideration, but also historical plans can be of interest in the future. If such a storage shall be possible, a more general purpose versioning mechanism must be available. It must be possible to store an existing road together with a number of alternative placings of the road. 4.3.10 Quality/ accuracy Geographical data represent phenomena in nature. The level of accuracy that can be achieved with such representations varies, and it is important that this is reflected in the data themselves [Goodchild91]. To be able to perform sensitivity analysis [Lodwick90] (through for instance error propagation and simulation) and in other ways provide measures of the level of confidence in results from GIS analysis and classification, quality aspects of the data - such as accuracy of the underlying measurements and completeness of the data sets - must be available for calculation and propagation through the analysis process. 78 Chapter 4: Data model requirements Quality information for traditional (paper) maps The map has always carried with it an implicit measure of positional accuracy through its scale (scale is not available for digital geographical information, so there is a need for alternative and more direct representations of spatial accuracy). The lineage of the map has normally been described somewhere on the map (producer, method of production, time of data collection). Other quality measures for paper maps have been manifested in mapping rules and requirements, but normally not described on the map itself. Geographical data quality Accuracy and other quality measures for the geographical data in a GIS database is essential information for assessing the usefulness of the results of an integration of different geographical data sets, and for reliable and trustworthy GIS analysis and presentations (visualisation of the quality of spatial data processing results will have to be included in future GISs [Clapham91][NCGIA91]). Work in this area was initiated in the early 1980s (e.g.. [Chrisman84], [Chrisman86], [Beck86] and [Openshaw89]), and has received increasing attention in the 1990ies. The US SDTS (spatial data transfer standard) requires quality measures to be supplied with all kinds of geographical data when they are transferred from one system to another [USGS90]. In the SDTS (see page 86), the following five quality measures are identified: 1) Lineage: The origin and history of the data, including methods of measurements/derivation, transformations, control information used and dates of validity/collection. 2) Positional accuracy. 3) Attribute accuracy. Both numerical and classification accuracy is covered. 4) Logical consistency. Describes the fidelity of the relationships encoded in the data structure. 5) Completeness. Describes the ratio of the objects represented to the abstract universe of all such objects (exhaustiveness). According to Firns and Benwell, two main types of accuracy can be identified for GIS data [Firns91]. Spatial accuracy involves the accuracy of absolute and relative positioning while descriptional accuracy is the accuracy of the representation of the state of objects in terms of non-spatial attribute values and relationships. Spatial accuracy is a geometrical property, and should be tied to the shareable geometry of the spatial objects. Descriptional accuracy could be provided by giving non-spatial attributes and relationships relevant accuracy measures. This might be a useful way of attacking geographical data quality, but a further refinement of the concepts will have to be introduced. Using this taxonomy, topological relationships would have to be treated as special kinds of non-spatial relationships, and other quality issues, such as the completeness of a spatial data set, is not covered at all (e.g.. the completeness of a road coverage). Quality information is sometimes associated with a data set, and sometimes with a group of objects, and sometimes with a single object or an attribute of an object. A method for modelling quality should take this into account, and provide flexibility in representation. Conceptually, as much quality information as possible should be available on the attribute/object level. Where only data set measures are available, a method for inheriting quality measures from the data set level to the level of the individual objects should be made available. Requirements to high level geographical data models 79 4.3.11 Derived objects Some of the objects stored in geographical databases are extracted or calculated from other data sets. Such derived objects should have a reference to the source data from which they have been derived, and to the methodology that was used to obtain the new object from the original data (lineage [USGS90]). Many representations of natural phenomena fall into this category. Rivers, vegetation boundaries, digital elevation models and geological structures will generally be based on some kind of measurements. These measurements could be aerial photos, satellite images or field surveys. In the data model derived data should be identified and related to their source data. 4.3.12 Sharing of geometrical objects among geographical objects When a data model for geographical data is to be developed, one should have in mind sharing of geometry between different geographical objects. As discussed in the section on topology, this suggests that the geometry should be isolated from the geographical objects and the topology. A three level model for the representation of spatial properties could be used, with geometry at the lowest level, topology at the intermediate level and the geographical objects at the highest level (as in ATKIS [ATKIS89]). The model should be flexible, possibly allowing for more efficient representations (by-passing the topological level) for data where sharing is not possible or the demands on consistency are lower. Figure 4-10 shows an example of such a layered data organisation. Figure 4-10 Geographical data layers In some cases it will be natural to refer to the geographical object itself, rather than the topology/geometry of the object when sharing is of interest. A road as a boundary to a compartment or field, and a river as a boundary to a property are examples where this could be of interest. The legal definitions of the borders will decide in cases where law is applicable (if a border is defined to follow the centre line of a river, the river object should be referenced, if the border is defined to follow the centre line of the river at a particular point in time, the geometry of the river at that point in time should be referenced), and users’ wishes and 80 Chapter 4: Data model requirements convenience of representation will decide elsewhere. A problem with referring to an object instead of its topology/geometry is that some extra computations will have to be done at run-time in order to find the exact geometry of the object. 4.3.13 Roles and scale The representation of geographical phenomena depends on both the scale of interest and the role the phenomenon plays in our model. Both the graphic presentation and storage representation of geographical objects depend on scale and role. The complex challenges of computer-assisted cartographic generalisation [Brassel88] [Muller91] is a part of the scale problem, while role-dependent representations constitute a new problem domain. Roles Geographical phenomena can play various roles, determined by the context in which they appear. People of different background/profession will often have different points of view when it comes to what aspects of a certain phenomenon that are considered interesting for representation in a data model (an ecologist and a ship owner will generally be interested in different characteristics of a river or a lake). This role problem introduces a new complexity dimension in the modelling and representation of geographical data. Roles should therefore be covered in a general purpose data modelling technique (and, if possible, also by geographical data transfer standards). Role aspects of representation results from the different uses a geographical phenomenon might have, and the many roles it might play as a part of nature and to humans. Continuing the example on rivers and lakes, it is evident that water systems play many roles. They are habitats of many different species of fish, algae and other types of animals and plants. They can be used as fresh-water supply to people, cooling water to power-plants, recipients of many kinds of waste (industrial and natural), transportation media for boats and timber logs, and sources for hydro-electric power. The possible list of roles is long. Scale The “scale” at which an application works will often determine what aspects of a phenomenon that are considered interesting (e.g.. aggregated information could be most useful when considering large regions, while more detailed information will be considered most useful when working on small areas). Relationships could also be different at different generalisation levels (this is true for topology). Scale dependent representation can be illustrated by some examples: • A building can be generalised from an area (volume) object to a point object at certain scales, and in certain contexts. • A building can also be combined (or aggregated) with other buildings that are “close” to form a region object (settlement, town) at certain scales, and for certain applications. Requirements to high level geographical data models 81 • A dirt road can be generalised from an area object to a line object for smaller scales, and can perhaps be excluded at the smallest scales. A way of handling the scale - and role problem is to include a set of representation-indications for all objects in the data model in order to be able to show what roles are interesting in the applications that are to operate within a certain modelling context. It could also be useful to include a specification of the scales and contexts for which an existing representation will be suitable. A generalisation strategy for moving between different geometrical structures could be included in the representation whenever techniques for this become available. Automatic (cartographic) generalisation is a large research area [Brassel88] [Muller91] [Bjørke90]. 4.3.14 Spatial constraints There are many possible types of spatial constraints. Some could be between data sets, while others could be internal to a data set. Examples of inter-data set constraints could be: a forest parcel should not overlap with a water surface; a building should be contained in a property (depending on the existing rules for the relationship between buildings and properties); a forest compartment with trees should not overlap a field; where a river crosses a road or railway, there should be a bridge or a tunnel. Data set internal constraints could be of the kind: buildings cannot overlap (either for 2D or for 3D); roads cannot cross if there is no crossroads (road network node) unless there is a bridge or tunnel; properties can not overlap (in 2D); a light point must be connected to the electricity network; the intersection of a tube for an electricity cable with all other 3D infrastructure elements must be empty. Topological constraints: an edge must have two end-points (not necessarily distinct), an edge in a manifold structure must limit two and only two distinct planar surfaces (faces). All constraints must be specified in the data model, so that rules can be specified and enforced in the database system. 4.3.15 Groups of related objects (themes) The pattern of access on geographical data sets is seldom completely random. Each user tends to concentrate on certain themes and/or certain geographical regions. By arranging GIS-data according to themes or groups, both the use and the management of the data can be made more efficient. This observation could be utilised in a geographical database management system, and to be able to achieve the potential benefits, the data modeller will have to identify groups of data objects and groups of object types that are accessed coherently. 82 Chapter 4: Data model requirements Thematic grouping Some examples of useful thematic groupings: Political/economical boundaries could constitute a theme (World, Continent, Country, District, Municipality, Property, Lot). The water system could make up a theme (lake, river, canal, stream), the road-network another (roads, crossroads, parking lots, squares), the topographic surface of the earth yet another (DEM, spot heights, faults, drainage systems), and so on. Spatial grouping Geographical regions are often useful for limiting data sets. If an application is a municipality application, one would presume that only the data that lies within the border of the municipality are of interest. This could be used when accessing remote databases. Applications that work in the context of water systems (water pollution, hydro-power, fresh-water) could use drainage basins for the same purpose. In the modelling context, it would consequently be useful to be able to specify a region for constraining large data sets that are of interest to a set of users/applications. Another aspect of grouping is the natural grouping that occurs because of the distributed nature of geographical databases (ownership). This grouping could also be of interest to application developers and geographical data brokers. It would therefore be useful if a geographical data model could represent distributed ownership of data in addition to thematic groupings of spatial object types and structures. 4.3.16 Distributed ownership Since the geographical data that is of interest to an application might be distributed over a large number of geographical data servers due to ownership issues, it could be interesting for the data modeller to be able to specify whether a data set or a set of objects are managed locally or at an external site. In this context it would be interesting to be able to specify the characteristics of the retrieval process for the external data. Alternatives could be on-line database access, off-line access (the data may have to be ordered, introducing a certain delay) or access to a local copy (that might not be up to date). Pricing information, restrictions on usage and expected network delays for on-line access are also among the things that could be specified. 4.3.17 Behaviour The fully object-oriented approach to data modelling makes it possible to associate behaviour to objects. Behaviour could be a procedure to present the object to the user graphically, or it could be other kinds of analysis or retrievals pertinent to the object. Behaviour could be useful for simulation environments, as mentioned earlier in this chapter, but the behavioural aspects of geographical data in todays applications seem to be limited to presentation, cartographic generalisation and some geometrical and statistical computations. Behaviour is useful for the modelling of geographical information systems, but for most kinds of geographical data it has a more limited utility. Modelling implications 83 4.4 Modelling implications The requirements put on the modelling environment by geographical data and applications, as outlined in the previous section, are extensive. A model that shall accommodate general purpose GIS database development must therefore be very expressive. Non of the data modelling methodologies mentioned in chapter 2 cover all the aspects treated here. The entities and relationships provided in the basic ER modelling scheme constitute a too limited set of structuring mechanisms for the complexity of GIS data. Even the EER model and other sophisticated semantic data models come into trouble when facing the needs of geographical data modelling. Semantic or (structurally) object-oriented data models (e.g.. ER and EER) are hopefully general enough to provide the basis for a modelling tool for geographical data. The EER model provides some of the abstraction mechanisms that we need for describing geographical data models. Necessary modelling extensions to the EER approach will have to be investigated in order to arrive at a consistent modelling framework that is able to capture and structure the semantics of geographical data, as described in the list presented earlier in this chapter. The following concepts can be considered as adequately covered by the EER approach and similar semantic approaches: • Entities/object types • Constrained object-object relationships • Attributes • Sharing • Aggregation • Categories • Generalisation The following concepts are not considered, or not adequately/fully covered: • Geometrical primitives • Spatial relationships, including topology (for geometrical objects, these relationships deserve special attention in the modelling formalism) • Scale and roles • History/time • Accuracy/quality • Derived objects • Groups of “related” objects (a new structuring/abstraction method is needed) • Behaviour 84 Chapter 4: Data model requirements One of the most important things to work on is a more powerful method for managing large diagrams with a huge number of entities and relationships. The overall structure must be communicated, avoiding unmanageably complex diagrams. A solution based on some sort of “black box” principle for hiding self-contained parts at higher levels would be attractive. The key to the problem is the isolation of smaller parts from the total modelling problem. This is not a trivial task. Spatial objects and their structures often give rise to very complex modelling diagrams when modelled using standard data modelling technology. This can partly be overcome by defining some kernel spatial structures that can be abstracted to symbols in the diagrams. Bédard introduces symbol-based entities to represent spatial structures [Bédard89]. The approach is called the sub model substitution (SMS) technique. This approach will be elaborated on in chapter 5, where elements of a data modelling framework for geographical information will be outlined. 4.5 Proposed data models and exchange standards for GIS data The need for effective transfer of geographical data has resulted in many national projects to develop powerful and flexible models that can support the exchange of geographical and geographically related data. Central in these efforts is the specification of a geographical data model. In the following sections some of the better known efforts are presented, namely the German ATKIS and the US SDTS. Norwegian work in this area is also given treatment. 4.5.1 ATKIS The “Amtliches Topographisch-Kartographisches Informationsystem” (ATKIS) is the responsibility of the Federal German Republic State Survey Working Committee (AdV*) [ATKIS89]. It covers the whole cartographic process, emphasising both collection, storage, presentation and automation. The storage and exchange of digital spatial information is provided at two levels. The DLM (Digitales Landschaftsmodell) covers the semantics of the real-world data (topographic objects), while the DKM (Digitales Kartographisches Modell) is a digital representation of the paper map, with cartographic symbology (but no semantics). A linkage of the symbol objects in the DKM to the spatial objects in the DLM is provided in the ATKIS-SK (Signaturenkatalog) to facilitate cross-referencing. The DLM structures the landscape into objects and object hierarchies. The different object types are described in a catalogue of object types (ATKIS-OK). The OK (Objektartenkatalog) is structured by object themes, object groups and object types. Objects consist of geometrical boundaries, attributes and relationships to other objects. Both the DLM and the DKM are for the time being scale-based (due to the ease of implementation?). Separate DLMs are specified for the scales 1:25000, 1:200000 and 1:1000000 (DLM25, DLM200 and the DLM1000 respectively). * Arbeitsgemeinschaft der Vermessungsverwaltungen der Länder der Bundesrepublik Deutschland Proposed data models and exchange standards for GIS data 85 The object types are chosen in accordance with the regulations for the German Topographic State Survey, but the set of object types is supposed to be open and extendible and will therefore be able to include new object types and information. The DLM consists of seven object themes: • Control points • Settlement • Transportation • Vegetation • Water(ways) • Areas • Relief The first six themes are considered two-dimensional and are collectively called the digital situation model (DSM), while the relief theme comprises the digital terrain model (DGM). Figure 4-11 The three level information model of ATKIS [ATKIS89] The DLM is a three-level way of attacking the problem of real-world modelling (see 4.5.1). The modelling is done on an object-part basis, and the highest level is the semantic level, covering most attributes of the data objects. The medium level is the topological level, and the lowest level is the geometrical level. The primitive object types supported are point, line, area and raster objects. Complex objects can be built from these primitives (for instance a waterway consists of lakes (area), rivers (line/area) and canals (line/area)). 86 Chapter 4: Data model requirements Every object has a so called global identity. Object parts has to be topologically atomic, meaning that attributes and relationships cannot change within an object part. Non-redundant storage is provided by allowing several object parts to share a vector element. ATKIS does not provide mechanisms for the storage of historical data, but the time of latest change is maintained for each object. The Swedish STANLI project (a national project for the standardisation of landscape information) has reviewed the ATKIS DLM through real world case studies [STANLI91]. Their comments on ATKIS can be summarised as follows: + The topological and geometrical organisation of ATKIS is good. - ATKIS has no support for non-geometrical objects (and relationships to such data), historical data and quality data. - ATKIS does not support complexity levels. Such levels should be provided, so that simple data (pure geometry) could be organised without the complexity introduced by more sophisticated data (containing object identities and relationships). Commentary ATKIS is a good attempt at providing a standard data model for geographical information. It may be incomplete, but what it covers, it covers reasonably well. The scale based approach is questionable, and ATKIS needs to include time, quality and 3D objects to become useful for most needs. ATKIS seems to be a good starting point for further efforts on standardisation for geographical data modelling. 4.5.2 SDTS The Spatial Data Transfer Standard (SDTS) is a standard for spatial data modelling and transfer for the USA [USGS90]. The US Geological Survey (USGS) has co-ordinated the work with the standard, which has been a long term effort [Moellering86]. The work was initiated in 1980 and a final proposal of the SDTS as a FIPS (Federal Information Processing Standard) was completed in July 1991. The SDTS was approved as a FIPS (FIPS 173) in 1992. The SDTS specifies modules for describing data organisation, data formats and data quality. Data modelling SDTS uses the following modelling concepts from the object-oriented programming systems literature [USGS90]. • Phenomenon. A fact, occurrence or circumstance (SDTS transfers information about phenomena that are defined in space and time and are described by using a fixed location— spatial phenomena). Proposed data models and exchange standards for GIS data 87 • Classification. Assignment of similar phenomena to a common class. An individual phenomenon is an instance of its class. • Generalisation. A process in which classes are assigned to other (higher level) classes. The general class contains all the instances of the constituent classes. • Aggregation. The operation of constructing more complex phenomena out of component phenomena. • Association. The assignment of phenomena to sets, using criteria different from those used for classification. The classes of (spatial) phenomena that are of interest for data modelling are called entity types, and the individual phenomena are called entity instances. An entity’s digital representation consists of one or more spatial objects. A spatial object that represents all of a single entity instance is an entity object. Entity objects have locational attributes (spatial address), non-locational attributes and relationships (e.g.. topology). A feature is defined as the combination of a phenomenon and its representation. An attribute is a characteristic of a class. The combination of values of the key attributes forms a unique identifier for each entity instance (as in the relational database model). A relationship is a special case of an association, and can exist between entity types. SDTS represents entity instances as static, without a temporal dimension. All temporal characteristics are expected to be treated as ordinary attributes, and have not been standardised. Some temporal aspects are taken care of by the lineage part of the quality transport module, which incorporates information on data collection and later modifications. Spatial data types In the SDTS, both geometry and topology (or a combination) are defined as valid representations for objects with spatial attributes. Geometry The following geometrical objects are included in the SDTS. • Point: A zero-dimensional object that specifies geometrical location. • Line segment: A directed line between two points. • String: An ordered sequence of points, representing a line. • Arc: An ordered sequence of points that forms a curve that is defined by a mathematical function. • G-Ring: An ordered sequence of of strings and/or arcs. • G-Polygon: An area bounded by an outer G-ring and zero or more inner G-rings, none of which are collinear or intersecting. • Pixel: A two-dimensional picture element that is the smallest non-dividable element of an image. • Grid cell: A two-dimensional object that represents an element of a grid. 88 Chapter 4: Data model requirements Topology The following topological objects are included in the SDTS. • Node: A zero-dimensional object, which may bound one or more links or chains. • Link: A connection between two nodes. A link cannot intersect other links, and may be directed by ordering the two nodes. • Chain: A directed non branching sequence of non intersecting line segments and/or arcs connecting two nodes (not necessarily distinct). Can specify start and end node (network chain), left and right polygons (area chain) or both (complete chain). • GT-Ring: A ring created from complete and/or area chains. • GT-Polygon: An atomic two-dimensional component of one and only one two-dimensional manifold, bounded by GT-Rings. The universe polygon is the part of the universe that is outside the area covered by other GT-polygons. A void polygon is an area that is bounded by other GT-polygons, but has the same characteristics as the universe polygon. Aggregate spatial objects The following aggregate spatial objects are included in the SDTS. • Planar graph: A graph that can be drawn on a planar surface without introducing intersections of the links and chains. • Network: A graph without two-dimensional objects (containing only points and lines). • Two-dimensional manifold. A planar graph and its associated two-dimensional objects. Each chain bounds two, and only two, not necessarily distinct, GT-polygons. The GT-polygons are mutually exclusive and completely exhaust the surface. • Image: A two-dimensional array of regularly spaced elements constituting a picture. • Grid: A regular (the repeating pattern is a square, an equilateral triangle or an equiangular and equilateral hexagon) or nearly regular (the repeating pattern consists of rectangles, parallelograms or non-equilateral triangles) tessellation of a surface. • Layer: An integrated, spatially distributed set of data representing entity instances within one theme, or having one common attribute value in an association of spatial objects. • Raster: A number of overlapping layers for the same grid or image. Data quality The data quality part of SDTS identifies the following aspects of (spatial) data quality. Each of these aspects is covered by a mandatory transfer module, and the tests that have been performed to establish the quality measures shall be described in the modules. Proposed data models and exchange standards for GIS data 89 • Lineage: The source and history of the data, including methods of measurements/derivation, transformations, control information used and dates of validity/collection. • Positional accuracy. The degree of compliance to the spatial address standard used, including references to the test used to establish the accuracy. • Attribute accuracy. Both numerical and classification accuracy is covered. The tests used to establish the accuracy shall be described. • Logical consistency. Describes “the fidelity of the relationships encoded in the data structure of the digital spatial data”. Tests used to establish the consistency measure shall be described. Graphic data consistency, topological data consistency and general data structure consistency. • Completeness. Describes “the relationship between the objects represented and the abstract universe of all such objects” (exhaustiveness). Geometrical thresholds used (minimum area, shortest lines, …). The data quality modules only provide textual descriptions for all the quality measures. Numerical fields are not present in any of the quality modules. The transfer format A data transfer using the SDTS consists of a number of files, each containing a transfer module. Cross referencing between the modules is facilitated by using unique module names and by numbering the records within each module. ISO* 8211 is used for encoding the modules, and the coding scheme is described in the SDTS documentation. A comprehensive thesaurus contained in the SDTS documentation provides a standard nomenclature of spatial features for use in the SDTS. For an SDTS data transfer to be valid it will have to include the following modules, each contained in a separate file. Identification module. Internal spatial reference module. External spatial reference module. Catalogue/Directory module. Catalogue/Cross-Reference module. Catalogue/Spatial-Domain module. Spatial-Domain module. All quality modules (lineage, positional accuracy, attribute accuracy, logical consistency and completeness). In addition to these mandatory modules, a number of data modules and modules with auxiliary information will be present in a typical transfer. Non-spatial attributes are organised according to the relational paradigm and the SQL standard, using foreign keys to facilitate joins. * International Standards Organisation 90 Chapter 4: Data model requirements Commentary The SDTS was, when it was completed, the most comprehensive effort at specifying a well-founded standard for the exchange of geographical information. But, unfortunately, it does not provide a standard geographical data model. The introductory parts of the document provides concise definitions of geographical data and data modelling concepts. This part is perhaps the most valuable of the whole document. The SDTS has been specified in such a way that data encoded in all kinds of spatial data models can be transported. It does not put any requirements on the structure of the data. It supports topologically complete data as well as spaghetti data. This is one of SDTS’s strengths, but perhaps also its greatest weakness. It has become too comprehensive and complicated. There are a lot of different possibilities for even the simplest spatial objects. This means that the job of making an SDTS translation module for import of all kinds of data will be very difficult and time consuming. If the SDTS had aimed at specifying a new “standard” data model, limiting the number of alternative terms, it would have been much more useful as a standardisation vehicle. The strong part of the SDTS is that it covers geometry, topology and data quality in a very general fashion, and the same applies to attribute data and cross referencing. The value of the very detailed and voluminous thesaurus is more questionable. A less detailed framework for naming and classifying GIS objects (for instance into hierarchies and themes), or a standard for geographical data dictionaries would probably have been more valuable. All in all, the SDTS is a valuable step in the direction of an international standard for the exchange of geographical information, but it is not enough. 4.5.3 NGIS and FGIS The Norwegian Mapping Authority (Statens Kartverk) is also involved in standardisation in the field of geographical information. The NGIS and FGIS projects were the main efforts in Norway in this area around 1990. NGIS NGIS* is a multi-million NOK project run by the Norwegian Mapping Authority. It was initiated in 1989, and the aim was to specify a national server for geographical data. The centre (NGIS) is meant to provide the basic geographical information contained in the national map series and registers on a standardised digital format to the community in the late 1990ies as an on-line database service. The development of NGIS was initiated to meet the Norwegian society’s future needs for (digital) geographical information. NGIS shall offer logically integrated 3-dimensional geographical data, and not only spaghetti digital representations of the traditional paper maps. NGIS was planned to be main-frame based and centralised. It is planned to have large capacities for data storage, and other vendors of geographical information are to be offered to use NGIS as a host for distribution of their data. * Nasjonalt Geografisk InformasjonsSenter (in Norwegian) = National geographical information centre. Proposed data models and exchange standards for GIS data 91 NGIS will encompass information from all the services of the Norwegian Mapping Authority, including: • Topographic maps in the national map series 1:50000, 1:250000 and other smaller scale maps • (Sea-) Navigational charts • Economical maps of scale 1:5000-1:20000 (with detailed infrastructure and borders for vegetation, property and administration) • Registers for addresses, properties and buildings (GAB) • The road network • The national mesh of triangulation fixed points The information contained in NGIS will consist of both alphanumerical data, vector/geometrical data and image/raster data. The amount of data contained in the NGIS database is forecast to grow to 100-200 Gigabytes during the first 5 years of operation. The data will be available to the community through public and private telephone and data networks, and it is expected that ISDN with its broad-band services will provide sufficient capacity for the potentially large transfer volumes. Data selection from the NGIS database can be based on geographical criteria (some specified region), thematic criteria (hydrography, topography, land-use, ...), scale and others. Easy access to the data is one of the main concerns of the NGIS work, and a user-friendly and intuitive interface is considered very important. Window based client tools for NGIS access will be developed to accomplish this. In order to make NGIS as open and available as possible, a relational database with a standard SQL-interface is preferred for data management. The data will be delivered on the Norwegian Mapping Authority’s in-house transfer format (SOSI*) [SOSI90]. NGIS seems to be a useful and necessary project for the society, but it can be argued that in this case one probably would be better off using a distributed approach to geographical data management than the proposed centralist approach (as discussed in chapter 3), because the responsibility for updating the data will be geographically distributed. There is, however, a need for a central metadata management site, provining an interface for searching for data sets. FGIS FGIS** was a standardisation project with participants from industry and the Norwegian Mapping Authority initiated as a part of the NGIS project to specify the data model, the data structures and the interfaces of geographical data servers. It was carried out under the supervision of the Norwegian Mapping Authority. FGIS was initiated in 1989 and finished in 1990 with specifications covering an exchange format, a data model and a geographical information system kernel application interface [FGIS90]. The goals of the FGIS project were to [FGIS90]: * Samordnet Opplegg for Stedfestet Informasjon (in Norwegian) = Co-ordinated arrangement for geographical information ** Felles-GIS (in Norwegian) = shared GIS 92 Chapter 4: Data model requirements • Contribute to simple and consistent exchange of data between geographical applications and systems • Contribute in the area of user interfaces in such a way that FGIS-based geographical applications can have a uniform appearance and follow industrial standards • Support the administrative, structural and security aspects of FGIS-based geographical applications • Make sure that users of FGIS-based geographical applications will be able to utilise new technology as soon as possible The FGIS results were to be used for the NGIS database and its interfaces. To secure the portability of geographical applications, the efforts have been based on a platform of international standards or de facto standards (the ones explicitly mentioned are: the ANSI SPARC three-schema architecture for database design, the SQL database query language, the Unix operating system (POSIX and X/OPEN), the X-windows user interface environment, the C programming language, the EDIFACT data exchange standard, the TCP/IP communication protocol, the GKS and PHIGS computer graphics standards). The geographical data transfer format was specified by developing the Norwegian Mapping Authority’s SOSI geographical data coding standard further. Figure 4-12 The FGIS system components [FGIS90] The FGIS architecture The FGIS project has proposed an architecture for open GIS systems. The architecture is high-level and intuitive and is shown in Figure 4-12. The core of the system is the FGIS kernel. The other parts of the system are tied together through the kernel. The EDI Management provides the interface to the rest of the world of GIS systems through EDIFACT and SOSI, while the Database Management System and the Data Dictionary provides an interface to the GIS data and the data model (metadata). The concurrency control provided in the FGIS database is a check-in check-out mechanism. The applications communicate with the kernel through the FGIS API (Application Program Interface). The applications shall provide a standardised window-based interface (preferably X-windows) to the users. Proposed data models and exchange standards for GIS data 93 Data modelling A geographical meta model is specified that is an extension of the ER model. This model provides the framework for the specification of FGIS-compliant (object-oriented) geographical data models. A sketch of the FGIS geographical meta model is shown in Figure 4-13. Figure 4-13 The FGIS geographical meta model (based on a figure in [FGIS90] • Object is used instead of entity (more familiar concept) • An object will have to belong to a certain geometrical class (neutral object, point object, line object, surface object or volume object) Point objects: Poles, measured points, border markers and buoys can all be represented as point objects Line objects: Linear features such as telephone lines, communication lines, centre lines of roads constitute the trasé objects Region objects: Properties, forest parcels and parking lots are examples of region objects Volume objects. Geological structures, houses and water reservoirs can be seen as volume objects Neutral objects. Non-spatial objects such as persons or companies fall into this group • Geometrical constraints (topological and spatial) are introduced • Global and local relationships (complex objects can be constructed using local relationships, where identification does not have to be globally unique) 94 Chapter 4: Data model requirements • The entities and relationships of the path concept are quite vaguely described, and is supposed to capture “equality-of-path” problems (no matter what path one takes in the model, using different sequences of relationships between two object-types, one should always end up with the same connected objects) Other (non-geographical) mechanisms available for modelling: • All objects can have attributes and relationships to all types of objects (including sub typing) • A variety of constraints are available, among them cardinality of relationships, domains and keys/identifiers The implementation of the geometrical structure of a spatial object is hidden, and separated from the other properties of the object. Many geographical objects can therefore refer to, and thereby share, the same hidden geometrical structures. This means that the border between two properties can be the same geometrical object as the centre line of a road (if the road is moved, the border moves with it). FGIS suggests the use of only two primitive geometrical objects, namely the point and the line. A line represents a connection between two points. The other spatial objects will have to be built from these primitives. The support for a terrain elevation model is only described at a very high level [FGIS90]: It should be possible to calculate the elevation of any chosen point in the terrain from the elevation data in the database. Interfaces SQL is used for internal database access, but a special purpose object-oriented geographical database interface called FAPI (FGIS Application Interface) has been specified to hide the SQL interface. The data exchange between different systems over public and private networks has been investigated, and EDIFACT combined with SOSI is suggested as important interface components. Remote data must be accessed through the FGIS Kernel and the EDI Management component. An EDIMS* will govern the interchange of data with external databases according to the X.200** standard. Commentary The “open” approach of FGIS is very useful, allowing integration with other information systems by emphasising the use of international standards in all possible areas. FGIS is very ambitious and comprehensive in its coverage. The geographical meta model is well formulated (except, perhaps, for the path concepts), but the geometry data model could have been more sophisticated. * Electronic Data Interchange Management System ** Reference Model of Open Systems Interconnection for CCITT Applications Proposed data models and exchange standards for GIS data 95 A question is whether two geometrical primitives (point and line) are sufficient for a general purpose geographical data model. The lack of an explicit surface representation mechanism makes 3D support limited. For 3D modelling, e.g.. for geology, a general purpose surface representation would be useful. The only 3D phenomenon mentioned in the FGIS proposal is terrain elevation, so it seems likely that general 3D objects have not been considered. The two most important aspects of geographical data that are not covered explicitly by FGIS are time and quality. • Time is never mentioned, but should be an intrinsic part of a general purpose geographical meta model • The spatial accuracy part of quality could be included as a geometrical constraint in the FGIS geographical meta model. Completeness and logical consistency are, however, in many instances data set oriented. To be able to accommodate these concepts, neutral objects must be introduced by the modeller to represent the data set aggregations, and quality measures could then be attached as attributes to these neutral aggregation objects. The problem with this approach is that everything must be modelled by the users. All quality aspects should be intrinsic to the geographical meta model Other deficiencies of less importance: • The meta model does not provide mechanisms for handling samples and interpolation (apart from the vaguely described terrain elevation support). Fields are thus not supported • Complex objects / aggregations are not supported directly • Roles/scale (different representations) is not addressed • Derived objects are not mentioned • The grouping of spatial objects into themes is not supported Even though the FGIS geographical meta model does not cover everything of importance to geographical data modelling, it did provide a good step in the right direction. The FGIS results and ideas are being considered, together with several other European efforts, in the CEN* attempt at specifying a standard exchange format and data model for the European GIS market (CEN TC 287), as well as in similar ISO efforts (ISO/TC211). 4.5.4 MetaMap MetaMap is a multi resolution model proposed by the research establishment SINTEF SI in Oslo, Norway for representing and handling spatio-temporal information [Misund93]. The proposal description is not very detailed, so a lot of questions remain unanswered. * CEN are the initials of the association of European national standardisation organisation 96 Chapter 4: Data model requirements Delta-representation MetaMap represents temporal multi resolution geometrical objects as a basic object plus a number of geometrical and temporal deltas/differences. The base geometrical object (M0) in MetaMap is the crudest possible representation of the oldest known instance of the object, and the scale oriented geometrical deltas (Dn) add detail to the geometry for representation at larger scales (increased resolution), while the temporal oriented deltas handle temporal changes to the geometry. An example of the derivation of a particular object variant, Mi: Mi = M0 + D1 + D2 + ... + Di This concept is claimed to give a compact representation of spatio-temporal objects. Commentary: The combination of the multi resolution and temporal deltas has not been described in [Misund93], so it is difficult to evaluate the benefits of the approach when it comes to compactness. Object-oriented features All geographical phenomena are represented in MetaMap as objects in accordance with the COM* of the OMG** [Soley95]. All geographical objects should have a thematic description and a geometrical description. This is the proposed type definition for a geographical object/entity that is a specialisation of Object (from COM): type Entity ≤ Object { nameE : e : Entity) → (a : Attribute); geoE : e : Entity) → (g : Geometry); infoE : e : Entity) → (i : Information) } The geometrical and thematic descriptions are linked using an object identity-based thematic-geometrical relationship (in addition to the implicit aggregation relationship for the geographical object). This relationship is to cover both pure geometry and topology. It must also ensure that thematic versions are related to the geometrical versions in a correct and consistent way. MetaMap shall also be able to accommodate multimedia object types. Geographical object organisation MetaMap organises geography as a hierarchy of geographical objects that are representable in 3 dimensions. The mother of all geographical objects is the surface of the earth. An object does not have to be explicitly linked to its mother object, the geographical coordinates of the object can serve as an implicit reference. Commentary: A hierarchy of geographical objects makes local geographical reference systems possible, and can therefore provide a more convenient and compact representation of local objects. The notion of hierarchies is probably more relevant to object structures (CAD type data) than to the structures one may or may not find for natural phenomena. * Core Object Model ** Object Management Group: A group of representatives from the computer hardware and software industry that develop standards for object-oriented system development Proposed data models and exchange standards for GIS data 97 Primitive geometrical objects MetaMap recognises 3 categories of objects based on their geometrical properties: • curve/line (implicit/explicit 3D and simple/complex structure) • surfaces (implicit/explicit 3D and simple/complex structure) • solids Points are not mentioned/recognised as an object category. Implicit 3D representation means that the object inherits the 3D properties (elevation) from its ancestors, and therefore can be represented using only 2D coordinates. Simple structure means that there is only one (atomic) element, while complex structure for instance could indicate a network of geometrical objects. Future MetaMap constitutes the core geographical model for a strategic technology development program on geographical information technology funded by the Norwegian Research Council. The program aims at developing advanced applications for analysis and presentation of geographical data. Contributions and problems A possible contribution of MetaMap is the integrated representation of temporal and multi resolution data in the geometry. It should also be possible to integrate quality into the model, both for geometry and thematics, and this would make it even more innovative. The details on the implementation will show if this is a fruitful approach or not. A discussion regarding the support for, and possible consequences of, changes in dimensionality as a result of change in resolution (e.g.. from 3D (volume) to 2D (region) and 1D (point) for a house) is presently lacking. The object-oriented hierarchical approach to geographical data modelling is of more questionable value, but could for instance be useful for the inheritance of elevation information and for certain classes of man-made features. The integrated organisation of geometry/topology and thematics for geographical objects is very vaguely described, but the principle of sharing geometry between objects is a good one. The handling of topology is not described in sufficient detail for further treatment. The exclusion of geographical point objects is unfortunate. 98 Chapter 4: Data model requirements Chapter 5 Sub-Structure Abstraction in Geographical Data Modelling The purpose of a high level data model is to model a certain selection of real world phenomena as accurately, convincingly and completely as possible using various structuring and abstraction mechanisms. For geographical data in particular, we want to be able to model geographical phenomena in a context that suits our purposes. In this chapter, a structuring method for ER diagrams that takes care of some of the peculiarities of geographical data is proposed. Emphasis is put on providing intuitive and expressive diagrams. The approach uses sub-structure abstractions, and builds on the work on sub model substitution (SMS) proposed by Bédard [Bédard89]. The method can be used to emphasise overall structure in large models with a huge number of entities and relationships, avoiding too complex diagrams by using multiple levels of abstraction in the specifications. Geographical data models, as specified in these diagrams, must be translatable into database conceptual schemas. How to automate the translation step has not been investigated in this thesis. 5.1 Context The following presentation concentrates on the structural aspects of geographical data modelling. The resulting methodology is not meant to be a general GIS modelling method. General modelling methods should also incorporate behavioural aspects. There have been several reasons for not including behaviour in this thesis. First of all, data sharing between various applications is one of the most important goals in current GIS research. Such a co-operation on data usage relies upon a common structural data model for efficient data management and application development. And, as mentioned in chapter 4, behaviour is probably not the most central aspect of human interpretation of most geographical phenomena. Finally, the behavioural aspects of object-oriented modelling are far less understood than the structural part [Beeri90], and its inclusion should therefore await further progress in OO research. Consequently, inclusion of behavioural aspects in geographical data modelling seems to be too ambitious considering the current state of the art in modelling and GIS research, so structural modelling will be the centre of focus. There should, however, 100 Chapter 5: Sub-Structure Abstraction in Geographical Data Modelling be is a place for behavioural modelling for geographical data, for instance for representing the seasonal variations of spatial phenomena. Geographical data modelling methods should incorporate the same set of basic tools and paradigms that are used in other branches of information system data modelling. The particular nature of geographical data will have to be reflected by geographical augmentations to the traditional data modelling methodologies. As discussed in chapter 4, spatial, temporal and quality aspects are particularly important for geographical data. These concepts must be covered in such a way that the augmented model keeps its good characteristics with respect to structuring, expressive power, expressive economy and (visual) clarity/perceptibility [Sindre90]. This can partly be achieved by incorporating quality measures as an implicit part of the structure of the data model, avoiding the extra symbolism that explicit representation would require. This is an approach akin to the one taken for temporal data models, where all objects get temporal properties implicitly. The result of an integration of quality measures into the data model could be termed a quality data model. Modelling of geographical information can in many cases be relatively easy. Many application domains see a limited number of interesting phenomena that can be easily structured. A challenge in geographical data modelling is to provide models that allow sharing of digital representations of “semantically rich” geographical phenomena. In order to meet this challenge, it will be important to determine a handy set of spatial data types, good generalisation hierarchies, and object groupings. So that the models can be useful both for data sharing and for communicating ideas in the system- and database design process. The need for data sharing, through for instance public databases, encourages the development of standard geographical data models that are relevant for a large spectrum of geographical data users. As discussed in chapter 4, geographical phenomena often play many different roles. Geographical data can therefore be used in a variety of contexts by the data users. This makes development of a general purpose data model for a geographical phenomenon difficult, if not impossible. If it should be possible to develop a general purpose data model for a phenomenon, it will very often be a need for multiple views on this complex model, each view tailored to the needs of a particular type of user group (role view). As an example, consider roads. • For a transport company or an ambulance, a road is a potential part of a route, which in turn is a connection between two places. The interesting properties of the road are those determining the suitability for transport (length, cover, roughness, speed limits and surroundings such as settlements). • For the construction company, the road consists of many types of foundation and material in several layers, varying along the road. • For a telecommunications - or some other cable company, roads could act as barriers for cable ditches. • For a farmer, a road could act as part of the border of a field. A forest stand is another example. Geographical data modelling using structure abstractions 101 • For a harvester, it is a stand of timber at a certain stage and with a certain economical value, and a certain future potential for timber production. • For a zoologist it is a habitat for various kinds of animals. • For a geographer and a meteorologist it is a climate factor, buffering water and stabilising temperature. • For a botanist it could be a collection of individual plants. • For a landscape architect, it is an important visual element in the landscape. • ... and so on. Considering the possibly diverse roles of geographical objects, a general purpose geographical data model will either have to incorporate many subtle representations and relationships, making it very complex, or it will have to give priority to certain properties that will be sufficient for most needs, facilitating the more subtle needs in less obvious ways. A goal for this part of the thesis is to specify some low-level building blocks, on which geographical data models can be based, and a framework for building structures out of these basic elements in a consistent way. The modelling approach should reflect the semantical richness of geographical data by allowing individual views, and not constraining the modelling unnecessarily by building a fixed and rigid type/class hierarchy. Such a hierarchy should be specified by the modeller, not the modelling methodology. It should also be up to the application modeller to decide which views of the data that are of interest within a given setting (houses can for instance be viewed as point objects, region objects or volume objects). Another goal of this work is to allow for abstractions and information hiding in the model diagrams, by introducing high level structuring mechanisms. Finally, the model, as specified in the diagrams, should be translatable into database models. 5.2 Geographical data modelling using structure abstractions The modelling framework chosen is an extended entity relationship approach (see chapter 2). Quality measures are expected to be an integrated part of the structure of the data model, in the same way as time is integrated in temporal data models. The result of the integration of quality measures into the data model could be called a quality data model. When quality is integrated at the model level, standardisation of the query language interface with respect to quality will also be feasible. How to integrate quality aspects into the data model is be a subject for further research. The presented method provides more powerful abstractions and thereby information hiding in the model diagrams by introducing symbology in the data model diagrams (the visualisation mechanisms of the data model). 102 Chapter 5: Sub-Structure Abstraction in Geographical Data Modelling 5.2.1 Extending ER-diagrams with sub model substitution Traditional ER diagrams are, as discussed in chapter 4, not well suited for GIS data modelling without modifications. Extensions will therefore be necessary. EER diagrams and other semantical approaches are examples of earlier and more general work on extending the ER model and its diagrams. The following extensions build upon earlier efforts on EER models [Elmasri89], augmenting their most useful abstractions with more specific geographical data modelling constructs. Sub-Model Substitution New symbols are incorporated into the ER-type diagrams in order to make them better suited for the communication of geographical data models. The approach is a further refinement of the Sub-Model Substitution (SMS) approach of Bédard [Bédard89]. The ER tradition of using rectangles for object types is continued, but icons are added to the rectangles to visualise important properties of the object type/entity. The number of different icons has to be kept as small as possible to allow easy perception of the diagrams. Many icons can be placed in a single rectangle (entity) in order to indicate the object type’s characteristics more completely. Combinations of different icons for the same object type is akin to multiple inheritance. The use of icons can therefore also be seen as a new way of visualising certain specialisation relationships. Icons will, hopefully, make the diagram structure less complicated by hiding standard components and relationships, and symbolising them in an intuitive fashion. The perceptibility of the diagrams is expected to increase considerably. The first icons presented here cover spatial aspects (geometry, topology, etc.). After these spatial oriented icons, a set of more general purpose icons and mechanisms are presented, covering time and traditional abstractions. 3D The 3D icon indicates that the object type should be represented in 3 dimensional space, and hence be available for 3D analysis. This icon will be placed in the rectangles of all modelled object types that have interesting 3D properties and that behave as more than just planar objects on the earth’s surface. A proposal for a 3D icon is shown in Figure 5-1 (a shaded “3D” symbol). Figure 5-1 Representation of an entity/object type with 3-dimensional properties. Geometrical constraints: Every object that has a 3D icon attached must be related to a 3 dimensional space (elevation in the case of terrain surfaces). For geographical objects this will mean a 3D geographical position/extent indicator. Geometry Spatial objects need special attention. Geometry is a basic property of spatial objects. Geographical data modelling using structure abstractions 103 Geometrical objects are geographical objects with geometrical properties. They constitute the basis of spatial information systems. Useful geometrical object types are points, lines, regions in 2D, surfaces in 3 dimensions and volumes/regions in 3D. The first two of these object types can have both 2D and 3D properties. This can be shown by using the 3D icon in addition to the geometry icon whenever the third dimension is of interest. A volume is inherently 3-dimensional. A region (2D and 3D) can be uniquely defined by its bounding lines/surfaces. To represent variation over the interior of geometrical objects, a field is needed. In some contexts it can be interesting to show that a geometrical object is geographically referenced. This could for instance be accomplished by including a globus icon in the entity box. Point An example of an icon to represent a geometrical point is shown in Figure 5-2. A rectangle containing a point icon represents a phenomenon that is interesting as a zero-dimensional geometrical object. Figure 5-2 Representation of a point entity/object type using a point icon. Constraints: A point should have a reference to a geographical reference system (datum and projection) and at least two dimensional coordinates. A point should contain a single reference to this reference system. Line A geometrical line can be represented in the diagrams by using a line icon as illustrated in Figure 5-3. A geometrical line can be thought of as a one-dimensional geometrical object. Figure 5-3 Representation of a line entity/object type using a line icon. Constraints: A line should have two well defined end-points. A line should have an internal representation conveying the shape of the line. Region Figure 5-4 is a proposal for a 2D region icon. The region is an example of a geometrical object that is inherently two-dimensional. 104 Chapter 5: Sub-Structure Abstraction in Geographical Data Modelling Figure 5-4 Representation of a region entity/object type using the 2D region icon. Constraints: A region must have a defined interior and a closed boundary of lines. Surface A surface is similar to a region, being bounded by lines. All points on a surface have a location in three dimensional space. A surface can be arbitrarily folded. A surface could be represented using the region icon in combination with the 3D icon (Figure 5-5). Figure 5-5 Representation of a surface entity/object type using the 3D - and region icon. Constraints: A surface must have a closed boundary consisting of (3D) lines. 3D coordinates must be available at all positions on the surface. It must be possible to determine the characteristics of the neighbourhood of a surface point (curvature, etc.). Volume A volume (3D region) is a three dimensional object bounded by surfaces. It could have its own icon, and a suggestion is shown in Figure 5-6. Figure 5-6 Representation of a volume entity/object type using the volume icon. Constraints: A volume must have an interior and a closed boundary consisting of surfaces. Varying phenomena Phenomena that varies over the interior of a spatial object (fields) are important, particularly for environmental modelling (elevation, rainfall, temperature, soil and geology), but also for infrastructure (road information that varies along the road, e.g. speed limit and elevation). Many natural phenomena varies in a continuous fashion. When these changes over Geographical data modelling using structure abstractions 105 the interior of an object are to be represented in the database (for example by sampling), a field icon could be included, as shown for a 2D region object with a non-homogeneous interior in Figure 5-7 (this entity could for instance represent elevation). To symbolise change/variation, the sinus curve has been used in this icon. Figure 5-7 Representation of a varying entity/object type using the field icon. Constraints: The field icon must be used in combination with a geometrical icon (line, region, surface or volume). Other geometrical objects of interest Initially, two additional useful geometrical object representations have been identified, the raster and the sample-set. Both of these can be combined with the 3D icon. Raster The raster icon can be used whenever matrixes of values (measurements) appear, and a proposal is shown in Figure 5-8. It can be used in combination with the 3D icon. This will cover remotely sensed imagery, scanned photographs and other regular (region or volume) samples. A raster is a representation of a field over a 2D/3D region. Figure 5-8 Representation of a raster entity/object type using the raster icon. Sample set The sample-set icon shown in Figure 5-9 could be used for sets of samples taken irregularly in 2Dor 3D geographical space. This icon can be useful for representing point probes for the purpose of classification, monitoring and taxation of natural resources (continuously varying environmental phenomena). Figure 5-9 Representation of a sample-set entity/object type, using the sample-set icon. Topology Spatial objects that take part in structures have topology as an important extra property (chapter 4). All geographical objects that take part in some kind of topology structure must 106 Chapter 5: Sub-Structure Abstraction in Geographical Data Modelling also have a geometry, and are therefore just a “refined” type of geometrical objects. A topology object can be represented in the diagrams by using both a topology icon and a geometry icon. Topology objects are spatial objects where topology properties are of interest. They are very important in spatial analysis, so both networks (Figure 5-10) and manifolds (Figure 5-11) are honoured with their own icons (the network icon and the manifold icon). A 3D-manifold can be represented by composing the manifold icon and the 3D icon. Figure 5-10 Representation of a network entity/object type, using the network icon. Figure 5-11 Representation of a manifold entity/object type, using the manifold icon. Time It is essential to incorporate time into geographical data models for the purposes of monitoring and time series analysis. History objects (geographical objects where history is of interest) are represented using the time icon (a small analog clock), as shown in Figure 5-12. Figure 5-12 Representation of a time entity/object type, using the time icon. Examples of icon usage In a road network, a road object type could have the line icon and the network icon attached, while if we want to model the crossroads, they could have the point icon and the network icon attached. A house could be represented as a point (symbol), a region (2D map), or a volume. Depending on the application, one or more of these different representations can be interesting, and the representations that are of interest are included by using the corresponding icons in the object’s rectangle. Geographical data modelling using structure abstractions 107 Traditional abstraction In order to provide a basis for hierarchical data modelling, mechanisms must be provided that hide detail at the highest levels, and allow for more detail at the lower levels. These can generally be termed abstraction mechanisms. The three most useful abstractions for data modelling are aggregation, generalisation/specialisation and association [Sindre90]. These abstractions could have icons for sub model substitution. At lower levels of the modelling hierarchy the abstractions should be represented as complete structures (as in the EER example in Figure 2-2). Aggregation is used to compose complex object types from their constituting object types. This approach is a kind of brick house approach, where the constituting object types are glued together to form the new high-level complex object. A natural way to depict this in a diagram would be to show this gluing. An attempt at this brick house approach is shown in Figure 5-13. The parts of an aggregation are generally candidates for participation in other structures (relationships, other aggregations, specialisations/generalisations or associations). This makes it difficult to use the “brick house” notation on a detailed level of modelling. Such an aggregation abstraction could be used as an icon at high levels (including only the label of the aggregate), hiding the sub-components of the structure. At lower levels, the individual object types of the aggregation will have to be represented explicitly. Figure 5-13 An aggregation icon, the parts are not interesting as isolated phenomena. Generalisation can be used when we want to treat objects with common characteristics as a group. Similar object types are gathered under the umbrella of the high-level object type resulting from the generalisation abstraction. It would be desirable to illustrate that the similar object types have a common interface. A proposal for such a diagrammatic representation is shown in Figure 5-14. All the object types participating in the generalisation keep their integrity. The individual object types are still able to take part in separate relationships with other object types, but using this representation, it is difficult for an object type to take part in multiple generalisation hierarchies. The generalisation representation shown in the figure should therefore only be used as a symbol at high levels in the modelling hierarchy (incorporating only the label of the high-level object type). If there is no multiple inheritance, this representation could also be used at lower levels of the modelling hierarchy, and then as a complex of all the participating object types (labelling all the object types in the generalisation, giving them some degree of integrity). Generalisation can be performed on both generalisations and on atomic object types. To allow multiple inheritance and complex objects, a relationship based approach will have to be taken (the traditional EER model approach as shown in Figure 2-2). 108 Chapter 5: Sub-Structure Abstraction in Geographical Data Modelling Figure 5-14 A generalisation/specialisation icon. Association is used to group object types that have something in common. They could for instance play the same role in relationships to other object types. This is a much looser coupling than aggregation and generalisation. A sub model substitution for associations could be useful for high level data modelling, and could be shown in diagrams using a dotted outline around the associated object types. A proposal for an association abstraction is shown in Figure 5-15. LABEL Figure 5-15 An association icon with a dotted outline. A general sub-structure abstraction could be used to represent very high level features, such as themes (hydrography, vegetation, geology, …) in a geographical data model. This would constitute a sort of general black box that could be used for all abstractions not covered by other primitives. Rumbaugh et. al. [Rumbaugh91] suggest the term module for this kind of abstraction. A modelling primitive that indicates the fuzzy nature of this component by using a cloud-like representation is shown in Figure 5-16. Figure 5-16 A general sub-structure abstraction (module) icon, for instance for a theme. Data set location The need for visualising the location of data sets could also be met using icons. Objects belonging to non-local databases could be marked with a distribution icon (e.g. a set of database symbols and a telecommunications symbol). Figure 5-17 shows an example of such a symbology. Geographical data modelling using structure abstractions 109 Figure 5-17 Representation of remotely stored data sets, using a distribution icon. Sequences A sequence abstraction is useful in many contexts, and can be introduced in the diagrammatic notation by labelling the lines of sequence relationships with a sequence of numbers (1,2,3,4,5,…). Examples of the use of sequences are the ordering of neighbouring nodes in a manifold structure, the road pieces making up a route, points defining a line and the sides of a polygon. Data quality Quality is not naturally covered by icons in the diagrams. In the proposed framework, these aspects should be included as basic properties of all geographical objects, following a standardised approach. In the data model, it should be possible to include quality measures for groups of geographical objects, individual objects and attributes of geographical objects. Work on spatial data quality has been done within the SDTS project (five classes of spatial data quality are identified, see page 86), and work is going on in ISO and CEN. The spatial quality measures required for a particular object should be determined from the spatial object type, as shown by the icons in the data model. 5.2.2 A forestry research example As an example of the use of the proposed modelling framework, a model of a forestry research environment is shown in Figure 5-18. The data model shows five external (remote) databases. A remote climate database that contains climate information collected at geographically distributed weather observation points (3D). A remote property database containing historical information on property regions/polygons in a manifold structure. A remote soil database that contains a soil classification manifold. A remote topographic database containing elevation information as point samples. A remote vegetation database containing historical vegetation information in a classification manifold. The model also shows the structure of the research data. A field experiment (with a number of attributes), consists of a number of experimental plots. An experimental plot has a number of measured properties and treatment data. A tree is stored with measurement data, and a tree is always related to the the plot in which it is located. In this example, the external databases are accessed on the basis of the (implicit) common geographical reference framework. No other relationships exist to these data sets. 110 Chapter 5: Sub-Structure Abstraction in Geographical Data Modelling Figure 5-18 An example for a forestry research data model (without the attributes). 5.3 Translation Sub structure abstractions pose no particular problems when it comes to translating the high level data model into database models. The sub structure abstractions only act as a structuring mechanism for the high level data model. A high level data model that includes sub structure abstractions can always be translated into an equivalent high level data model without sub structure abstractions. Sub structure abstractions simplifies the translation of the resulting high level data model because of its use of standard components. The translations of these standard components can be optimised. Conclusion 111 5.4 Conclusion The complexity of geographical data models require better structuring and abstraction tools, in particular, a way of hiding detail on commonly used structures would be useful. A framework for including structure abstractions in the ER family of data models has been presented. The work builds on sub models substitution, as proposed by Bédard [Bédard89]. A set of useful abstractions for geographical data modelling has been proposed. Abstraction icons have been introduced for geometrical objects (point, line, polygon, 3D and varying) and spatial structures (network, manifold, raster and samples). Non-spatial abstraction icons have also been suggested (generalised object, aggregated object, association object, time and remotely stored data) to make it easier to communicate over-all structure in ER-type diagrams. The use of symbols has long traditions in cartography, and since most potential participants in a geographical database design process will have experience with map use, the suggested representation should be particularly suitable for geographical data modelling. The utility of the method in practical modelling has yet to be proven. 5.4.1 Future work The sub model substitution framework must be tried out on real world geographical data modelling problems. An evaluation of the model should be performed, and improvements suggested. Some questions will need to be answered: • Is sub model substitution a useful abstraction mechanism for geographical data modelling? • What comprises the ideal set of sub models? • How should the sub models be represented graphically? An ultimate goal is an intuitive modelling environment that is able to generate (distributed) database schemas from diagrams and auxiliary information. 112 Chapter 5: Sub-Structure Abstraction in Geographical Data Modelling Chapter 6 Database management system issues for geographical data Huge amounts of structured and unstructured spatial and spatially related data must be organised and made available to a large set of local and remote users in a GIS environment. The GIS community consists of many potential demanding users of database management systems (DBMS). Potential, because DBMSs have not yet been utilised and developed to its full potential by the GIS community. In this chapter some of the demands geographical data put on database systems are pointed out, and database system research and technology is reviewed with a geographical bias. Distributed database systems are discussed, as well as time, data dictionary issues, query languages and transaction processing with concurrency control. The chapter is rounded off with an assessment of the suitability of different database models for geographical data management, and a short conclusion. Some research issues in spatial databases have been proposed by Günther and Buchmann [Günther90]. A review of spatial database system research has been given by Güting [Güting94]. A similar overview of spatial database implementations and query languages for spatial data has been provided by Samet and Aref [Samet95] 6.1 Basic requirements The basic requirements that users of geographical data put on a DBMS are similar to those of other database system users [Frank84][Frank88][Feuchtwanger89][Frank91]. Some of the requirements mentioned below have a special meaning for GIS, and will be further elaborated on in later sections. • Many geographical DBMSs must support interactive operation, that is, less than 2 seconds response-time for most data retrievals. • Storage efficiency is important for large data sets. For geographical databases, the amount of data to be handled could be extreme (Gigabytes, Tigabytes, …). HSM* methods could be required for some data intensive applications (see page 53). • A geographical database management system should support and be able to handle structured and complex data. That is, vector and raster spatial data types, object hierarchies, data with quality measures and temporal (versioned) data. A BLOB** * Hierarchical Storage Management 114 Chapter 6: Database management system issues for geographical data data type could also be useful. Consistency enforcement for all data types should be performed by the geographical DBMS. • Operations should be supported by an integrated standard query language, so that query optimisation is possible. Geographical data will be subject to both spatial/topological queries, image queries, 3D queries and general attribute queries. • Transaction support. In the case of GISs, long duration transactions will tend to dominate, and there will in general not be much updating (the majority of transactions will be read-only). Concurrency control will be necessary in order to ensure the consistency of non-static data sets while providing 24 hour availability. The amount of updating to geographical data sets are often limited, so some “relaxed” concurrency control method would probably suffice. Recovery of the database to a consistent state should be possible after system failures, requiring a transaction log to be kept. • Transaction capacity. High-volume transaction systems such as banking applications require that the DBMS can handle hundreds or thousands of transactions per second (TPS) with very short delays. In general transactions on geographical databases will not be that frequent, but some public geographical databases could get very frequent accesses (e.g. property databases or traffic information systems databases in large cities and other databases that will be used for real-time navigation), demanding capacity for handling many transactions per second (TPS). • Data sharing and the resulting data transfers to and from remote databases requires the involved databases to be accessible on the network through a standardised database interface. • Advanced data dictionary support should be provided (preferably as an integrated part of the query language) in order to make more of the semantics of the data available and to allow heterogeneous database integration (explicitly or transparently). The data dictionary should support geographically based meta-queries and thematically based meta-queries. • Accounting and billing on the basis of data access and processing time must be possible in a data sharing environment. In the case of integrated data sets, it is important that the accounting also can be done at the object and attribute levels, on the basis of ownership of the source data. • Integrity should be enforced by defining the valid states of the geographical database and the (state dependent) valid operations on the geographical database. • Security against unauthorised access to the data will be necessary, at least for parts of some databases [Jajodia90] [Lunt90]. • Mechanisms for adding, deleting and changing data types and constraints (schema evolution) is generally nice to have. This is also true for geographical databases since ** Binary Large Object Data volumes and data types 115 the totality of interesting roles/uses of the modelled geographical phenomena can be difficult to predict. 6.2 Data volumes and data types Our means for collecting geographical data have been steadily improving, and this trend is continuing, leading to larger and larger amounts of existing and incoming data. GIS data have previously been discussed in chapter 3 and 4. This section adds some examples showing the order of magnitude of present and future GIS data set volumes. The storage requirements of geographical databases will vary from application area to application area. Global applications will generally require more data than local, and applications that utilise images or other kinds of (high-volume) automatically sampled data (such as satellite imagery, seismic data and drilling logs) will require more storage space than the typical vector-based applications. 6.2.1 Samples Continuously varying phenomena are conveniently represented by performing regular (such as raster data) or irregular sampling on them (as discussed in chapter 3 and 4). Sampling is often applied within such areas as vegetation monitoring and mapping, soil and geology mapping and water and air quality monitoring. Samples are also referred to as measurement data elsewhere in the literature [Neugebauer90]. The reconstruction of continuously varying phenomena from samples requires computations. Applications working on samples are therefore normally not I/O bound. As long as samples are taken one by one with human assistance, the number of samples will stay reasonably manageable. On the other hand, when sampling can be performed automatically and “continuously”, the number of samples will increase significantly. The resulting volumes of data depend on the sampling frequency, the area sampled and, if temporal monitoring is involved, the length of the sampling period. Automatic sampling is, or could be, applied in areas such as: climate measurements (wind, temperature, humidity, precipitation, wave-height, wave-length, solar radiation, …), traffic monitoring (radars, video), land and seabed topographic mapping, seismic exploration with multi-sensor measurements and continuous logging during drilling processes. All these areas are expected to be given higher priority in the future, leading to huge amounts of high resolution (spatial and temporal) data. The volume of data that is required to represent a continuously varying spatial phenomenon using point samples should be limited only by the variability of the phenomenon. The sampling frequency must be twice as high as the frequency of variation of the sampled phenomenon (the Nyquist frequency). Terrain models A special, and important, kind of sampled structure is the terrain surface model (often called digital terrain model, DTM). Such a model should provide approximate elevation values at all places within the modelled area. Terrain surface models are very often based on point samples of the elevation (traditionally they have been represented with iso-elevation lines 116 Chapter 6: Database management system issues for geographical data or contours on maps). Applications that utilise the 3D properties of terrain models are generally extremely computational intensive, and are very seldom I/O bound. Terrain models are presently not extreme in data volume, because a fully automatic way of obtaining elevation points on the surface of the earth is just now getting available. When robust methods for automatic digital photogrammetry become widely available, the situation will be different, and very detailed (high resolution) and voluminous terrain models will result. 3D models 3D models are useful in meteorology, oceanography and geology. To be able to represent detailed 3D information, huge amounts of data must be stored. Some 3D phenomena are also dynamic of nature (the weather, the ocean), and therefore require frequent temporal sampling. 3D data are normally acquired using equipment such as seismics, echo sounders, lasers and radars. Storage structures for 3D phenomena are reviewed in appendix B. 6.2.2 Raster data A very important subclass of samples are samples taken in a regular grid/raster. Raster images require large amounts storage space. • For current medium-cost raster display technology, about 1000 rows x 1000 columns x 1 byte = 1 MegaByte of data is needed to store an image that covers the computer screen. Both the number of bytes per pixel (e.g. 3 bytes / 24 bits) and the number of rows and columns are increasing. • A scene from the Landsat* MSS** contains 3240 x 2340 x 4 (bands) ≈ 30 million pixels (6 bits/pixel: 23 Mbyte/image) [Lillesand87]. This means less than 50 Landsat MSS scenes per Gigabyte. The Landsat TM*** produces scenes of the same size, but with 6 bands (+ the lower resolution thermal band), and with a radiometric resolution of 8 bits/pixel. The MSS generates 15 Mbits/second, while the TM generates 85 Mbits/second (about 1⁄2 Gigabyte/minute!). The two sensors of the earth observing satellites in the Spot**** programme delivers 6000 8bit pixels per row each in panchromatic mode, and the satellite can transmit data at a rate of 25 Mbps [Lillesand87]. A square shaped scene/image will take up about 36 megabyte of storage space. The storage requirements for satellite imagery are enormous compared to other kinds of data. • Instruments used for scanning “analogue” images are providing higher and higher geometrical and radiometric resolution. Presently, the geometrical resolution used is * Landsat is a US series of moving earth observing satellites carrying a set of sensors, providing world-covering multi-band images, currently with a maximum resolution of 30m x 30m [Lillesand87]. ** Multi Spectral Scanner. *** Thematic Mapper. ****Spot is a French series of moving earth observing satellites, providing images where each pixel represents a patch of 10m x 10m (panchromatic) or 20m x 20m (3 different bands) of the earth surface Data volumes and data types 117 of the same order of magnitude as the resolution of photographic film (about 10 micrometer). A scanned 25 cm by 25 cm air photograph will consequently result in 2  25 . 10−2 m  8 about   = 6.25 . 10 pixels. If each pixel is represented using 1 byte, . 10−6 m  10  this will result in over 1 Gigabyte of uncompressed data for a stereo pair of air photographs (uncompressed). • Image archives. Due to the large storage space requirements of individual images, image archives that store a large number of images will require enormous amounts of storage space. The future will probably bring more satellites with more advanced sensors, leading to a higher rate of data production than today. Airborne digital sensors, proving higher resolution data, will probably also become more popular. This would lead to even tougher requirements on local databases and data handling equipment. Compression The data volumes of raster data can be reduced through image compression [Gonzalez87]. Non-loss compression methods, such as run-length coding, ensure that the image can be completely restored, while lossy compression methods, such as fractal [Barnsley88] encodings, make full restoration of the original image impossible. Lossy techniques are generally not eligible for compression of images that are to be used as the basis for further analysis. Compression techniques can reduce the storage requirements of most pictures, but if an image has been compressed, it will generally have to be restored before usage. The computational load of this restoration depends on the compression algorithm used. Special purpose hardware is developed for fast compression and restoration of images. The Joint Photographic Experts Group (JPEG) has developed a standard for still image compression. This standard is also aimed at images for presentations, and therefore allows lossy compression [Kim91]. The compression ratio can be adjusted according to the applications demands for quality and compactness. Video If raster images/pictures are to be put in series to produce film sequences, a rate of 25-75 pictures per second is needed to produce acceptable to very good quality presentations. For normal screens this means more than 25 Megabytes of data per second of uncompressed motion video. Special image compression techniques for highly correlated sequential images can, however, reduce this amount significantly. The Moving Pictures Experts Group (MPEG) are working on standards for this type of image compression. The techniques suggested are lossy, and an improvement of the compression ratio by a factor of 3 to 10 is expected by exploiting the correlation of succeeding images in such sequences [Kim91]. The combination of single image compression and difference compression in MPEG-1 are consequently able to reduce the data stream to between 1 and 2 Megabits per second for normal video [Furht95]. 118 Chapter 6: Database management system issues for geographical data 6.2.3 Vector data In many cases, the vector model provides a more compact method for storing thematic geographical information than does the raster model. The “exact” borders of all Norwegian properties stored on a compact vector format will probably not take up significantly more storage space than a single uncompressed high-resolution stereo pair of aerial photographs (some Gigabytes for the property database, and about a Gigabyte for the stereo pair)! This does not mean that vector data sets are small. The order of magnitude of many Norwegian national vector data sets will be Gigabytes. The vector format is compact for representing (boundary) lines, but its structure is much more complex than the raster format (e.g. topology and line representation). This, combined with the significant volume of many vector data sets, is likely to cause performance problems for most current database systems. 6.2.4 Time When the temporal dimension is included in a geographical database, the data volume will accumulate as new data are included (e.g. 1⁄2 Gigabyte/min. of Landsat data). All historical information is potentially interesting, so no data should be thrown away, leading to a perpetual accumulation of data. One consequence for the database system is that most data will be read-only. New data sets will generally not replace or correct older data sets, but will be stored together with them. The older data sets are kept as historical information. In a practical implementation, the storage of the new and the old data sets could be co-ordinated to provide a more compact representation (change-oriented storage). 6.2.5 Generalisation levels Geographical data can be useful at many levels of generalisation, from the most detailed representation of a garden to an overview representation of a continent, or even the complete earth. Geographical data generalisation can consists of simplification of object representations, aggregation of objects, and removal of insignificant objects (as the application scale gets smaller). An ideal geographical data server should be able to provide geographical data from a single data set at many generalisation levels (as requested by the individual queries). The indication of generalisation level could be (map)scale of application (e.g. 1:10000 for municipal land use planning and 1:1000000 for global environmental research). Generalisation is a very complicated process, requiring the application of a large set of rules, and for good results, it will probably also require generalisation information to be attached to the individual objects of the database. Reduction of dimensionality is a more easily implementable kind of generalisation. A user could request a volume, region or point representation of a house. The server should then have operations to derive lower dimensional representations on the basis of its stored representation (volume to region could be done by projecting the volume onto a 2D reference system, region to point could be done by returning the centre of mass of the region). Multimedia (integrated) database systems 119 6.2.6 Summary The challenge of GIS data management is twofold. First, organising and storing the large volumes of spatial data, and second, finding methods for filtering out “interesting” information from the global geographical information base. The answer to the first challenge is powerful geographical data models, sophisticated multimedia database management systems and a suitable data-distribution and -integration method. To answer the second challenge, one will have to add efficient data structures (supporting generalisation) and access methods on powerful hardware. 6.3 Multimedia (integrated) database systems The variety of data types within a GIS suggests that a database management system will have to be a specialised integrated database system (or multimedia database system). In the preceding sections it has been shown that the volume of geographical data available to geographical information systems already is enormous, and that the main contribution presently comes from remotely sensed imagery. Structured vector data will generally be manageable for local applications, but for wider area analysis, the amount of vector data to consider could be overwhelming even with state of the art technology. Efficient methods for searching for interesting data and for filtering away the rest will therefore be very important for the performance of GISs. Multimedia information systems shall be able to manage and integrate a variety of data types, including for instance textual information, numerical information, all kinds of graphic information (e.g. drawings and images), sound and video sequences [Christodoulakis95] [Yager91]. Multimedia database systems should be designed to provide much of the same functionality as traditional database systems for all these data types. There are presently no such systems available. The GIS branch of multimedia database systems will have to be based on a global frame of reference, within which the different representations of geographical objects and events can be localised [Rhind92]. Several global reference systems exist, the latitude-longitude system, the UTM system and WGS84* are some of the traditional ones. Other approaches have also been taken lately, for instance regular hierarchical triangulation of the earth (the quaternary triangular mesh, QTM [Dutton89]). If many different reference systems are to be used, transformations between these reference systems will have to be performed on the fly by a multimedia system. Images (rasters), 2D geometry (vectors) and alphanumerical tabular information constitute the traditional data types useful in a GIS. An important feature of many future GIS database systems will be a 3D model of the surface of the earth and possibly also geological features. This 3D model must be truly integrated with the global reference system. In addition, it must be possible to integrate 3D object geometry (e.g. houses, bridges and trees) with the rest of the database in a straightforward manner. The integration of these different data types is important in order to allow seamless spatial analysis and presentation. Seamless access to all available information sources gives new * World Geodetic System 1984 120 Chapter 6: Database management system issues for geographical data opportunities for analysis and presentation, for instance by allowing vegetation and soil maps to be combined with satellite imagery and the 3D terrain model. Sound has been given some emphasis in multimedia systems, but will probably only be of limited use for GISs. It could be useful in tourist information systems, for instance the sounds from nature (a waterfall, the cracking of an iceberg, birds) or sounds of human activity (traffic, talk, music). While multimedia research has lead to some results in multimedia user interface design, it has not yet come up with any standards for multimedia database systems. Image databases have attracted most attention, and some work has been done on the integration of images with relational databases [Roussopoulos88][Joseph88]. Within ISO, there is work in progress on multimedia extensions to SQL (SQL3/MM [ISO/IEC94a]). The potential of object-oriented databases for multimedia applications has also been investigated [Woelk86, Woelk87]. On the GIS arena, no really elegant methods for integrating vector and raster data have been found, and in addition to this, the geometry is usually stored separated from the thematic information in GIS databases (the geo-relational approach). The status of GIS with respect to multimedia integration is consequently not too encouraging. 6.3.1 Hypertext Hypertext (or hypermedia) is an approach to multimedia systems that puts emphasis on information structuring and user interfaces [Conklin87] [Goyal89]. Hypertext techniques should also be interesting for geographical information system design. Incorporation of these kinds of structuring methods is likely to make GISs more user friendly and flexible. Hypertext is first and foremost a browsing tool where information is structured as a web of nodes and associative links. Each node contains some information on a certain subject and from zero to many links to other nodes. From a node you can move to other nodes that are associated with the current node in some way. A link to another node can be indicated explicitly as highlighted text, icons and parts of images, and, in addition, some dictionaryor rule-based approach could be used. The linking mechanism is intrinsically very flexible, and the structure of the resulting web of links and nodes depends very much on the creator of the web. A big problem in hypertext systems is how to keep track of where you are in the web. Network browsers that can handle millions of nodes and links in a user-friendly way are very difficult to implement. Hypertext has also been investigated in the context of GIS, and it has been suggested to use it as a cartographic product [Lindholm90] [Laurini90]. There is still no agreed upon theory for the implementation of hypertext-structured database management system. The utility of hypertext techniques is therefore presently limited to user interfaces in browsing applications (e.g. the Internet WWW* and electronic atlases and public database interfaces). * World Wide Web Spatio-temporal databases 121 6.4 Spatio-temporal databases Time in databases has been investigated extensively during the last 20 years [McKenzie86] [Snodgrass90]. By including time in the database it is possible to represent the history of database objects. Concurrency control and recovery problems can also be solved in an elegant way in temporal databases [Agrawal89, Bernstein87]. Temporal/history data are a kind of versioned data. When handling temporal data there does not have to be support for the parallel versions that must be supported for general versioned data (e.g. for road and land use planning or computer aided software engineering). Temporal versions are linearly ordered along the temporal dimension, and therefore comprise a subset of versioned data. Many different ways of including time in databases have been proposed, and most of the work has been based on the relational model. 6.4.1 Concepts of time in databases During the years, different concepts of time in databases have been identified, the most important being transaction time, valid time and user-defined time [Snodgrass85]. The first two are maintained by and known to the DBMS, while the last one is a user-defined attribute that contains some information about time. Transaction time and valid time will be supported by mechanisms in the query language, while the user will have to take care of processing and administration of the user-defined time attributes without any time-specific help from the DBMS. According to their use of time, databases can be classified into different categories [Snodgrass85] [Snodgrass86] [Snodgrass92]: • Static databases have no notion of time, they only contain a snapshot of the reality they are supposed to represent. Past states are discarded and forgotten. • Static rollback databases support the notion of transaction time. This means that it is possible to find out what the state of the database was at a certain point in the past by checking the transaction time tags on the different data items (rolling back the database to some state in the past). Only the time of data entry and deletion is recorded, so the states of a static rollback database does not necessarily reflect the states of the modelled reality. • Historical databases support the notion of valid time. In a historical database it is possible to query about the state of the reality model at a certain point in time. It is also possible to correct errors from the past and insert new facts about the past. Databases may also include user-defined time. User-defined time will have to be included as normal attributes in the schema and maintained and manipulated by the user, just like any other attribute. • Temporal databases incorporate both transaction time and valid time. It is possible to query about the state of the world at a certain point in time, as recorded by the database at a later point in time. 122 Chapter 6: Database management system issues for geographical data 6.4.2 Representing time in databases The inclusion of time in relational databases has been extensively investigated, and a number of suggestions have come up. Most of them use the basic temporal elements time interval for database items that have a (limited) lifetime and a single time value for events. A time interval can be specified using the start point and the endpoint of the interval. Temporal attributes are not available for direct manipulation in most of the suggestions for temporal extensions to the relational model. Such attributes must normally be accessed using special purpose operations and operators. Tansel and Clifford include time at the attribute level, as an integral part of the attributes [Clifford85][Tansel86]. In their model, an attribute is either an atom (value), a set of atoms, a triplet (from time, until time, value) or a set of triplets. Gadia treats relational tuples as atomic with respect to temporal issues [Gadia88]. That is, the same from time and until time are valid for all attributes in a tuple. This approach has been termed a homogeneous model. Snodgrass also attaches time at the tuple level [Snodgrass87]. 6.4.3 TQuel Snodgrass has developed an extension to Quel (a query language used in Ingres®, based on the relational calculus) incorporating both transaction time and valid time, called TQuel [Snodgrass87]. The new mechanisms/clauses in TQuel are shown below as an example of temporal extensions to query languages. as of <time>, for transaction time querying (query on a snapshot of the database) as of <time> through <time>, for transaction time querying (for examining a sequence of transactions) valid at <time>, for valid time (history) querying (query about the state of the reality model at a point in history) valid from <time> to <time>, for valid time (history) intervals (query about intervals in history) when <point(s) in time> | <time interval(s)>, for valid time (history) querying (query using a combination of historical events and intervals) In TQuel, temporal expressions are used to define the points in time and time intervals to be used in the temporal clauses: begin of x, end of x, x precede y, x overlap y, a extend b, a equal b. These expressions can be combined using the traditional logical operators (and, or, not) and used in when expressions. 6.4.4 Time in geographical databases For geographical databases a minimum database requirement should be that historical data are kept for the future. The more integrated temporal information that is available, the better. Fully temporal database support can be useful, providing better means for system maintenance and database analysis (e.g. analysis on the evolution of the reality model and the Spatio-temporal databases 123 database). The most flexible solution would therefore be to utilise a database system that supports time at the level of temporal databases, also called spatio-temporal databases [Al-Taha94]. The integration of the spatial dimension with the the temporal dimension in a data model is not trivial. Not only can there be changes to the geometry of the objects, spatial relationships (such as topology) will also be modified when elements of a topological structure are created, deleted and modified. Sometimes there can even occur complete restructurings of a topological structure (e.g. cadastre restructuring). A question in this respect is how to handle object identities in the case of such changes. In addition to man-controlled changes, there are the gradual changes in nature that are normally best taken care of by sampling methods. The representation of gradual changes of topology has been investigated by Egenhofer and Al-Taha [Egenhofer92]. Object structures such as cadastres, buildings, land-use, roads and other infrastructure often change in an event-driven way (“abrupt” changes). Natural phenomena tend to vary in a more continuous (e.g. surface changes due to erosion, vegetation changes resulting from plant growth / natural succession and climate driven water level changes), and often periodic fashion (e.g. seasonal changes to vegetation, hydrography and climate). There are exceptions to the “rule of continuity” in nature (e.g. changes resulting from fires, volcano eruptions, earthquakes, tornadoes, floods, landslide and human interventions). This distinction suggests the use of event-based structures for temporal monitoring of object structures, and a sampling approach for natural phenomena, with the addition of events for representing important abrupt changes in nature (natural “disasters”). Gradual temporal changes to geometry are interesting mostly for geometry derived from sampling of natural phenomena, and therefore irrelevant from the database point of view when only the sampled data are stored. Interpolation in temporal databases is easy for event-driven changes (no computations, just retrieval of valid objects), but for sample-based approaches special purpose spatio-temporal interpolation algorithms are required (these algorithms should also provide quality measures of the interpolation results). An early effort on discussing temporal issues for geographical information was made by Langran and Chrisman [Langran88] and Price [Price89]. They proposed that the temporal dimension should be organised topologically, just as the spatial dimension, using events as nodes and states as links in the topological structure. They rejected the snapshot way of representing time, and instead suggested that only events (that lead to changes) should be reflected in the database. This leads to a compact (non-redundant) and expressive model of temporal change. The work focuses on objects, and there is no mentioning of samples of continuous phenomena and derived structures (manifolds). Queries on the proposed model are discussed in a later paper by Langran [Langran89]. A problem with history data is what to do with the ever accumulating amounts of data. Ahn and Snodgrass have tried to solve the problem of efficient maintenance of history data by letting the most recent data be more available than older data (a partitioned storage) [Ahn88]. By partitioning the database, the older data can be kept on a more space efficient format, or on a slower storage medium than the current data. 124 Chapter 6: Database management system issues for geographical data There has been some interest in spatio-temporal modelling for geographical data in recent years. Guptill discusses the use of extensible relational DBMSs for representing spatial and temporal aspects of geographical data [Guptill90]. Lin discusses spatio-temporal intersection, using time as a separate dimension [Lin91]. Worboys proposes an event-driven spatial-temporal model based on simplexes within an object-oriented framework [Worboys92]. Pigot and Hazelton propose a model for a “4D” GIS based on gradual topological change of manifolds [Pigot92b]. 6.5 Metadata and data dictionaries A general definition of metadata is that it is information/data about data. It can, however, be difficult to decide what should constitute the basic data and what should constitute the metadata. In this section, the metadata that are used to describe, catalogue and index data from different sources are discussed. Such metadata should make the user able to determine whether a data set is of interest or not for his/her purposes without having to inspect the real data. An important high-level component of these metadata will be a spatial and thematic catalogue for the initial navigation among the available spatial data sets of the world. The metadata should describe the individual data sets, for instance with respect to data quality. At its most detailed level, metadata should provide descriptions of individual objects and attributes. Standardisation work in the field of metadata for geographical information has been started in the USA [FGDC94], Europe [CEN95b] and now also internationally in the ISO/TC211. Advanced and (globally) standardised data dictionaries are essential for sharing of geographical data and co-operation in a world of heterogeneous systems. The way in which GIS data normally are organised (as discussed in chapter 3), suggests a data dictionary that can support search on a spatial (geographical) basis, a thematic basis and a temporal basis. Spatial search should be possible using both explicit (the geometrical representation of the region of interest) and implicit references (e.g. geographical names). Such a global data dictionary should (preferably) be accessible to users everywhere, and its contents available through for instance standard geographical query language extensions. In geographical information systems, the global distributed heterogeneous geographical information base should be described in such a way that local geographical database systems can navigate through all available geographical databases to answer queries such as: Find all privately owned properties larger than 10 km2 in Europe, valid at the 1. of January 1960. Select the areas that are at an altitude of at least 2000 meters in Kenya and Tanzania. Calculate the total area of pine forest at the altitudes 500-800 meters in Scandinavia. As it will be difficult to force different countries and institutions to conform to such a data dictionary standard, it is important that the benefits of conforming to the standard are noticeably higher than the costs. In the United States, an initial proposal for a metadata standard for geographical data has been suggested [FGDC94]. The standard proposes the following kinds of metadata for geographical data: Metadata and data dictionaries 125 • Data set identification (identity, themes covered, representation model, format, size, description, intended use, data set extent, intended scale, resolution of the data) • Data quality information • Spatial data organization information • Spatial reference information (coordinate system, etc.) • Entity and attribute information (entity types, attributes, domains) • Distribution information (data distributor and ways of distributions) • Metadata reference information (currentness and contacts for the metadata) • Citation information (how to reference this data set) • Time period information • Contact information (how to communicate with people associated with the data set) The first (identification) and second (quality) requirements seem to be the most challenging research issues at the moment. The organisation of geographical themes in a thesaurus (preferably hierarchically) and methods for including data set extents in a geographical data dictionary system are very important areas of standardisation, as is the complex area of data quality. Most of the other issues are more easily solvable, or there already exist solutions. 6.5.1 Quality in geographical databases An important issue in geographical database research is how to represent the quality of the data in the database. Quality measures are needed to make quality assessments and error propagation possible throughout GIS analysis. Some of the earliest efforts within this research area were made in the standardisation work that resulted in the SDTS [USGS90] (see page 86). To allow sensitivity analysis and error propagation modelling, there is a need for adequate representations of data quality in the database [Hootsmans92]. In addition to the quality descriptions of the data, it is important to have good computational models for the applications that work on the data. Quality should be available at all levels in a data set (attribute, object/tuple, object class/table, theme, data set), and different measures of quality will be interesting at the different levels. Currency, lineage, numerical - and classification precision is most interesting at the attribute level. Consistency, currency, lineage, positional - and geometrical accuracy will be useful at the object level. Completeness and consistency will be the most important quality measures at the object class -, theme - and database level. The quality measures to apply will depend on the type of geographical information that is considered. The quality information should, if possible, be split into “independent” parts. An example: For line and surface geometry one could split spatial accuracy into locational accuracy and shape fidelity [Tveite95]. 126 Chapter 6: Database management system issues for geographical data The SDTS divides spatial data quality into the five groups (see the section on SDTS in chapter 4) [USGS90]. These are discussed below. Lineage information must be available for all object types in the database. Some aspects of lineage are meaningful for the complete data set, while other can be attached to individual objects. It is important that derived data sets refer to the original data set for further lineage information. According to the nature of the phenomena represented, lineage information can be provided as attributes with the individual objects or attributes with aggregations of objects (e.g. data sets). Some of the lineage information can be represented as relationships to other objects, for instance to control points or instrument specifications. Positional accuracy can be given for aggregates of objects (e.g. complete data sets) as a statistical measure, or it can be given for individual geometrical objects if accuracy information is available at the object level. Investigations has to be done on how to represent positional accuracy for geometrical objects other than points (e.g. shape fidelity [Tveite95]). Attribute accuracy can be represented by a statistical measure for complete data sets, or for individual objects. For classifications, the probability of misclassification can be provided as a misclassification matrix, while for measurements instrument and/or method accuracy can be used. Logical consistency is a quality measure that can be used on complex objects to indicate the correctness of relationships within the object. Networks and manifolds are examples where the consistency of the topological relationships have to be quantified as a logical consistency measure. Completeness is meaningful only at the data set level, and could be represented as an attribute showing the percentage of the objects that are represented in the database. Quality and accuracy in spatial data have received increasing attention in the last couple of years. Chrisman has discussed errors in categorical maps [Chrisman89], Goodchild has looked into modelling of error for remotely sensed data input to GIS [Goodchild89] Openshaw applies simulation for handling errors in spatial databases [Openshaw89]. Data quality has also been the theme of meetings (accuracy problems in spatial data, NCGIA, 1988 [Goodchild89]), conferences (e.g. “Symposium on Spatial Database Accuracy” [Hunter91] and “International Symposium on the Spatial Accuracy of Natural Resource Data Bases” [Congalton94]), and a workshop topic (for instance given by Goodchild at the symposium on spatial data handling, 1992). The NCGIA in the US has put up an initiative on visualisation of spatial data quality[NCGIA91]. 6.5.2 Data dictionary issues for geographical data Geographical data lend themselves to distribution in a natural way, as discussed in chapter 3. Such a distributed, heterogeneous data environment will need a well organised data dictionary to facilitate distributed data retrieval and management. Data dictionary issues should, if possible, also be included in the data model. If a data set is known to be distributed over a large number of sites, this could then be indicated by the modeller. By including this in the model, the users and programmers will have the liberty to take distribution into consideration when formulating queries and deciding on processing Metadata and data dictionaries 127 strategies. A distribution icon (a group of databases, or a globe) attached to the diagrammatic representation of the object type could be useful for this (more about icons in chapter 5). It is normally one of the goals of a database systems to make data distribution invisible to users (distribution transparency using a distributed conceptual schema, DCS). This can be desirable for single company/owner databases and distributed databases containing noncommercial (free) data. Geographical data are generally different (see chapter 3). Geographical data are provided on a commercial basis by a large number of institutions and companies, all having different pricing policies and varying thematic interests in the modelled phenomena. This setting is different from most of the settings that are normally discussed in the distributed database management literature. The users of geographical data should be interested in having access to a distribution data dictionary at the conceptual data model level. This is desirable because of the need to determine the availability of the data sets and their cost (both in terms of payment/royalties and acquisition time). At the conceptual level, the data model should provide a standard, abstracted interface to the (distributed) data dictionary of all relevant geographical databases. This standard interface should be available through the query language, providing an integrated framework for distributed (geographical) data retrieval. That is, the data dictionary will be an important part of the geographical database system. The data dictionary part of the database should, in addition to quality measures, include other metadata elements, such as the ownership, location, availability and cost as standard “attributes” of all data sets. The setting for a distributed geographical data dictionary will probably be an international network of data servers that are able to communicate and exchange data on a commercial basis. Standardisation efforts within the OSI* framework of ISO** covers communication protocols between such servers. Geographical data can be thought of as hierarchically organised according to location and theme. Leaf nodes represent individual, generally autonomous, data sets. The data sets are placed at different levels and branches, according to their thematic contents and their geographical extent/location. Such an indexing/tree structure should, conceptually, be a composition of two orthogonal hierarchies, namely a location hierarchy and a theme hierarchy. In addition to these two hierarchies, it should be possible to specify the time at which the data set should be valid (points and intervals in time must be supported). At each node in the “location” tree there should be a full thematic hierarchy available, and from each node in the “theme” tree, it should be possible to search through a “location” tree for the current theme. Hence, a single data set could be found from many different places in the structure, and it should be possible to reach it using different paths. Location In the location hierarchy, one could choose the top level to be a geographical level dividing the world into continents and oceans. The next levels could then reflect political boundaries, giving a division on the basis of political units such as nations, states, districts, counties and municipalities. In this hierarchy, international data sets should be attached to the highest * Open Systems Interconnection ** International Standardisation Organisation 128 Chapter 6: Database management system issues for geographical data levels, national data sets to the national level, and so on. Such a geographical/political hierarchy provides one natural framework for guiding queries to geographical data sets. Another approach could be to base the hierarchy on watersheds. It should be possible to provide many location hierarchies in the dictionary. The basis for such a solution is further discussed in connection with distribution of geographical data sets later in this chapter. An alternative way of organising the spatial part of the dictionary is a global recursive triangular tiling system, QTM [Dutton89][Goodchild90a]. Within this framework, an easy way of specifying the approximate location and extent of a data set would be to provide the address of the smallest triangle enclosing the data set and the largest triangle enclosed by the data set. The address of a spatial phenomenon in this tiling system will be a very crude indication of its spatial extent, so a more precise description of the geometry should also be available at the dictionary level. The approach also inherits the classical problem of indexing based on hierarchical tiling, namely the objects that lie on tile boundaries at a high level in the hierarchy. A solution to this problem could be to represent objects by a point (carefully chosen, so that it does not to lie on any tile boundaries) and an indication of the extent of the object. The QTM could then provide every geographical object with a (linearised) global address. According to the FGDC proposal, data set location should be specified either as a minimum bounding “rectangle” or as a bounding polygon, in both cases using latitude and longitude coordinates [FGDC94]. Such a location indication will be very useful, if it is made available to the data dictionary system. Geographical names Geographical names are an essential tool for humans when they wish to specify a location. Geographical names should therefore be available for expressing location in query language for geographical databases. For navigation in the data dictionary location hierarchy, one should be able to use the geographical names of the appropriate regions. Since there are many alternative ways of writing geographical names (Norway, Norwegen, Norge, Norvege, Noreg, ...), a name resolution mechanism will have to be employed. When specifying location while querying geographical databases it should also be possible to use other geographical names than those in the dictionary hierarchy (as an alternative to specifying geometrical locations explicitly through coordinates). A comprehensive index of names and accompanying places/locations (points, regions or fuzzy regions) must be made for this purpose. The development of such an index would require a lot of work, and particularly the specification of fuzzy regions (such as Hardangervidda, Oslomarka, Oslofjorden and Central Europe) will be demanding, since a lot of interviews will have to be made for each region to get sufficient statistical material. More advanced natural language interfaces could be the next step for geographical query languages, allowing locational prepositions (e.g. between Hardanger and Bergen) in addition to the geographical names. Theme The equally important theme hierarchy should also be available in the data dictionary for navigation. In order to specify a hierarchy that allows search for geographical data on a thematic basis, an internationally agreed upon taxonomy / thesaurus for GIS information Geographical Query Languages 129 must be developed. Such a thesaurus will have to be very comprehensive, and will have to be built in a systematic fashion. An important first step is to specify a skeleton classification hierarchy, upon which the thesaurus can be based. The first layer of this hierarchy could for instance be made up of the following classes: infrastructure, topography/geomorphology, hydrography, oceanography, meteorology, vegetation, soils and geology. Because of the expected difficulty of obtaining interdisciplinary consensus on what comprises an ideal theme hierarchy, the goal of the work should be to find a suitable lattice structure. As in the location hierarchy, databases could be attached at all levels of this lattice, and a single database must be referenced from all the nodes of the tree that describe themes covered by that particular database. Efforts in this area have thus far only been made at national levels, for instance within the SDTS project in the USA (part 2 and 3 of [USGS90]). At the European scene (CEN, TC 287), work is in progress to establish international (European) standards for the modelling and exchange of geographical/spatial data. This standard will hopefully provide a better basis for a global geographical data dictionary. An important practical problem in the design of distributed database systems is to decide whether the global data dictionary should be centralised or distributed, replicated or not replicated. This is a very complicated issue, and the pros and cons of these alternative approaches are discussed in the distributed database systems literature [Ceri88]. What is important is that the data dictionary should be available to all potential users of geographical data. 6.6 Geographical Query Languages A geographical database management system should at least provide an application interface (for programming language bindings / application development). If the database system is to provide a direct interactive query service to end-users, an additional, higher level graphic/textual interface to the database will be necessary. With such an interface, the end-users can interact directly with the maps/images using a pointing device to specify locations (e.g. points, areas or volumes) of interest. This last topic, while interesting and important, is user-interface oriented, and will therefore not be elaborated on any further here. It is, however, important that the application interface to the database system provides all the mechanisms that are needed to support these higher level (graphic) interfaces. Traditional query languages either provide a set-based retrieval mechanisms (operates on groups of objects, as for instance QUEL and SQL) or a navigation (one object at a time) mechanism (as provided by hierarchical and network DMLs). In a database query language that is to be used for retrieval from geographical databases, both of these mechanisms are useful. The navigation mechanisms are necessary for topological structures, particularly for network analysis, but also for manifold analysis and information browsing. The set-based mechanisms are useful when performing statistical analysis on thematic data for research and planning. A query language standard suitable for geographical data and applications should support multimedia data types. A standardisation of the database interface is especially important 130 Chapter 6: Database management system issues for geographical data for shared distributed databases, because it will allow straightforward access to all external data sets stored in databases conforming to the standards. Image operations and 3D analysis are important application areas that are not adequately covered by traditional query mechanisms, but should be given attention when developing query languages for GIS. Mechanisms for accessing to non-local data must be expected to become a requirement for future general purpose GISs. The query language must be able to help the users in the search for interesting data sets (all over the world), by offering queries to data dictionaries for meta-information on the data sets available in external databases (e.g. data describing the data model used, the spatial and thematic contents of the database and the quality of the data sets contained in the database). This dictionary capability should be integrated with the traditional DML mechanisms (navigation and set-oriented retrieval). 6.6.1 Different ways of organising geographical information Depending on the application domain, the spatial component of geographical data can be looked upon in different ways, for instance: • As geometrical objects, where location, distances and directions, normally in Euclidean space or on the geoid* are important. • As topological structures, where the geometrical properties that stay invariant under simple transformations (translation, rotation, scaling), such as neighbourhood information and information about the borders of objects, are of interest. • As continuous phenomena, where sampling and interpolation methods can be used to represent the phenomena (e.g. rasters and irregular samples). Each of these attack-points to geographical data must be supported by a GIS, and a general purpose query language should include mechanisms for integrating, querying and manipulating all these representations of geographical information. The geometrical object part of a query language for geographical databases will probably be the most difficult to specify. The reason for this is that many human questions on space and geometry tend to be quite fuzzy, incorporating expression such as “large”, “high”, “long”, “in front of”, “between”, “north of” and so on. It will therefore be difficult to find a good set of spatial/geometrical operations. A good starting point could be to define a set of basic operations, where distance operators would play an important role. An example is an operator that, given a geometrical object and a distance, returns a new geometrical object covering the part of space that is within this distance from the original geometrical object (a buffer operation). Topology, as discussed in chapter 4, is a rather formalised part of geographical data models. Some of the topological operations required are border/endpoint, interior, co-border/bounding, intersect and containment. Rasters (regular samples) and “randomly” sampled data require general purpose and special purpose operations on “images” and point-sets, and must be treated in a different way than * The “sea-level” surface of equal gravity Geographical Query Languages 131 topology and vector-geometrical object data. For binary images, some useful query mechanisms are the operators and, or, not and xor for overlay operations. In addition to these come image processing operators for making FFTs*, convolutions, and other kinds of filterings and maskings [Gonzalez87]. For scattered, irregular samples, it would be nice to be able to query for values at all locations in the “space” of interest (normally made up by the geographical and temporal dimensions). To achieve this, interpolations on the sample data must be performed on the fly by the system [Neugebauer90]. 6.6.2 Spatial query language proposals There have been many suggestions for spatial/geographical query languages and data types, and some overviews of research on spatial query languages have been published recently [Güting94] [Samet95]. Most of these query language suggestions have started out with SQL or QUEL and then introduced extensions to support spatial and complex object queries. Many of the efforts in this area have been limited to image database systems, but there are also some more general approaches. • Abel and Smith [Abel86] extends SQL in their COREGIS database with spatial data types (point, line, simple polygon and composite) and spatial operators. “Inclusion/exclusion” has been used to represent complex spatial objects. A raw (BLOB) field is used to store the sequence of points in a line to avoid using one tuple for each point in a line. They do not have a full integration of the spatial operators with SQL. Intersects (a boolean binary spatial operator) is one of the spatial operators [Abel86]. • GEOVIEW [Waugh86] is also a relational approach, and uses one or more “LONG” (BLOB) fields to accommodate new data types (point, line, arc, node, polygon, text, quadtree block, collections, grid, TIN node, TIN patch). Only two tables are used to store the spatial data of a “coverage”. One table contains the geometry (the ENTITY table), while the other contains the attributes (the ATTRIBUTES table). An ID field that identifies the entity is used to link the two tables. One more table is used to store metadata, one tuple per “coverage” (the DIRECTORY table). Since the geometry is stored in “LONG” (BLOB) fields, it can not be interpreted by the traditional query language. The topology is available in traditional relational tables (ATTRIBUTES), so no special operators are not needed to retrieve topology information. A directory containing metadata was included as a database table, providing metadata integrated with the rest of the data set. GEOVIEW can store geographical data, but seem to have limited number of spatial operators and functions. A spatial window search is supported [Waugh86]. • System9 (see chapter 3), as described by Charlwood, Moon and Tulip [Charlwood87], also applies bulk fields to represent spatial data types (node, line and surface are the basic types). SQL is extended by adding grammar and vocabulary to handle referencing between spatial entities, to handle queries based on the values in bulk fields, and to handle spatial relationships such as overlap, connectivity, and containment [Charlwood87]. • Aref and Samet also took the SQL approach for their SAND spatial database architecture, and introduced POINT, LINE_SEGMENT, POLYGON and REGION * Fast Fourier Transforms 132 Chapter 6: Database management system issues for geographical data as basic spatial abstract data types, upon which spatial selects and - joins could operate (spatial operators are only described at a high level of abstraction) [Aref91]. • XSQL/2 [Lorie91], is meant to be fully compatible with SQL, a property that Lorie sees as the only option considering the currently very strong position of relational database systems. Lorie does not discuss spatial extensions in particular, he only outlines some object-oriented extensions to SQL (identity, complex objects, ADTs and methods). • TIGRIS uses object-oriented spatial extension to an SQL-dialect together with multiqueries and macros [Herring88]. This SQL-dialect with its extensions work on an object-oriented database. Some example spatial operators: adjacent, contains, contains_point, enclosed_by, intersect, near and self_intersect (boolean operators); boundary, construct_from, containment_set, difference, intersection, self_intersection, merge, split and union (derivation operators that produce a new spatial object); area, approach_point, centroid, distance, length, perimeter, project_point, range, representative_point and set_distance (functional operators that return a single “value”). A multiquery groups several queries together in a sequence, so that the results from a query can be used by the next. The result of the multiquery is the result from the last query in the multiquery. • Query language extensions for the support of image operations have also been examined (e.g. PSQL [Roussopoulos88] and PICQUERY [Joseph88]. Some object-oriented approaches to spatial query languages follow. • Worboys, Hearnshaw and Maguire use the relational “Domain Retrieval Calculus” (referencing Lacroix and Pirotte, 1977) as a basis for a query language over an object-oriented data model [Worboys90b]. The basic spatial objects in their model are: point, node, line segment, chain, string, ring and polygon. A list of spatial operators is not provided. An example of an object retrieval calculus query for retrieving the closest hospital to “my_house” [Worboys90b]: {h: hospital | ∀m: hospital, distance(m, my_house) ≥ distance(h, my_house)} • Scholl and Voisard specify query language mechanisms for thematic maps, restricting their focus to regions and operations on regions [Scholl90]. The approach taken is called a complex object approach, supplementing the relational algebra with a boolean algebra over the two-dimensional space ℜ2 together with geometrical operations and set operations. Application dependent operations (in this case, operations on geographical data) are expressed through a general “apply” construct that applies a user defined function to each member of a set. The approach builds on work on a complex object algebra by Abiteboul (technical report 846, INRIA, France, 1988). Lu proposes the use of deductive database techniques to support geographical queries [Lu90]. Also QBE (query by example) has been adapted to geographical databases, for instance in GEOBASE [Barrera81]. Other discussions on spatial query language mechanisms include the linearisation approach in PROBE [Orenstein86, Orenstein88, Orenstein90a]. Geographical Query Languages 133 SQL3 There is currently work in progress to specify SQL3. A part of this work deals with multimedia (SQL/MM) [ISO/IEC94a], including spatial extensions [ISO/IEC94b, ISO/IEC96]. The SQL3 is leading SQL in an object-oriented direction, and supports abstract data types (ADTs). A variety of spatial domains and data types are proposed (in the 1994 version [ISO/IEC94b] lots of temporal domains and data types were included, while in the 1996 version [ISO/IEC96] time is supported through multiple inheritance). Coordinate data types are specified at the lowest level (ST_Coordinate and its 2D and 3D specialisations: ST_Coord2D and ST_Coord3D), and a generic spatial object ADT is placed at the top of the type hierarchy (ST_SpatialObject). In-between and around, there are a number of other ADTs: geometrical ADTs and metadata ADTs (the 1994 version included quality ADTs). An ST_SpatialObject consists of a set of ST_GeometricObject which is a generalisation of ST_Point, ST_Line, ST_Area, and ST_GeometricAggregate. This means that ST_SpatialObject functions can operate on all these ADTs. The topology approach taken is that of point set topology (and the 9 intersection model) [Egenhofer91a]. Boundary and interior of a spatial object can be found using the spatial operators ST_Boundary and ST_Interior, and intersection is supported by the boolean binary function ST_Set_Intersects. These operators ensure that a complete set of topological relationships can be derived. Two boolean binary functions on ST_SpatialObject are specified: ST_Set_Contains and ST_Set_Equals. Operators such as ST_Set_Intersection, ST_Set_Union, ST_Set_Difference, are provided for establishing new ST_SpatialObjects. A number of other operators on ST_SpatialObject have also been proposed. Volumes and surfaces are not supported yet, and rasters do not seem to be integrated with the vector data types. Summary For the sake of standardisation, extensions to the basic query language (at the moment SQL) should be as limited as possible. One must therefore search for a small set of spatial operators and functions that is complete in the sense that all possible spatial requests can be served by using only the operators and functions in the set. The current SQL3/MM approach seems sound in this respect. When considering performance, the picture is not as clear when it comes to what constitutes an optimal set of spatial operators. There will be a trade-off between a small set of simple highly optimised operations that a query optimiser can combine in efficient ways, and a larger set of more powerful optimised operations that normally have a complex behaviour and can be more difficult for a query optimiser to integrate with other operations (a classical RISC*/CISC** dilemma). CISC systems give superior performance for the operations they have been optimised for, while RISC systems can be more efficient for operations that are not directly supported by the CISC operations (RISCs are generally more flexible). Basic geographical operations are discussed later in this section. * Reduced Instruction Set Computer ** Complex Instruction Set Computer 134 Chapter 6: Database management system issues for geographical data 6.6.3 Query optimisation An advantage of standardised query languages is that optimisation of the query execution plan is possible by changing the order of the different operations in such a way that the total amount of computation is minimised. A basic technique of relational query optimisation is to apply selections (restrictions) on the involved tables before joins are performed, in order to minimise the amount of data that must be considered in the more computationally demanding join operations. In order to be able to perform advanced query optimisation, one will have to have knowledge of the cost of all the query language operations. In addition, it is very helpful to have a good knowledge of the data sets in the database (statistics). For a relational system, such knowledge could include the volume of data contained in each table, and the distribution of values for each attribute. To allow efficient query optimisation for spatial database transactions, the spatial data types and operators should be first class citizens of the query language. This could be achieved through general ADT mechanisms [Aref91] [Haas91], but could probably be handled more efficiently by specifying a standard set of spatial data types, constructors and operations. The use of ADTs for spatial data requires that there is a built in mechanism for explaining the characteristics of the operations and data types to the query optimiser. This makes it a very complex task to include new ADTs. It is also probable that ADT optimisation techniques will be inferior to an integrated approach that can build upon a predefined set of spatial data types and spatial operations. If a standardised set of spatial ADTs are specified (as it might be in the spatial part of the SQL3 standard [ISO/IEC94a]), optimisation will be possible for database management systems that want to provide good performance on geographical data sets. Samet has included a section on spatial query optimisation in a recent review article [Samet95]. 6.6.4 Spatial data types The domains introduced by geographical databases can be called spatial domains, and the most central are listed below (see also chapter 3). Associated with all basic positional references, there must be a description of the geographical reference system used. First the most central geographical/spatial domains: • Points in space (0D objects) • Lines / vectors in space (1D objects) • Regions (2D objects in 2D space) • Surfaces in space (2D objects in 3D space) • Volumes in space (3D objects in 3D space) • Fields - continuous variation of some value over the interior of a geometric object (a line, a region (in 2D or 3D), a surface in 3D) Geographical Query Languages 135 On top of these basic domains, the most important structural domains should be defined: • Networks of lines in 2D or 3D space (could also include TINs) • Manifolds (for regions and volumes) To support integration with rasters or images one can include some additional spatial domains (similar to/supporting the field): • Pixel (the atomic element of a raster), usually representing a rectangle or square • Voxels (volume raster elements) • Raster/matrix (n by m (by o) grid of pixels(/voxels)) All these domains should be supported by the basic spatial data types of a geographical query language. GEOBASE, an example of an early effort in this area, uses 0-, 1- and 2-dimensional geographical objects (points, lines and polygons, termed images by the authors) as basic geographical data types [Barrera81]. An extended relational approach If we choose to go for an implementation within the relational framework, we need some spatial data types in addition to the SQL data types mentioned in chapter 2, to make the users able to formulate efficient queries on spatial databases. As mentioned earlier, the set of new data types should be as small as possible, in order to limit the complexity of the resulting query language syntax. Geometry can be covered in many ways, and the minimum requirement is a spatial reference type. The most natural spatial reference type is the point data type, consisting of a group of two or three real numbers describing a position on the earth (latitude, longitude (and elevation)) in some reference system. There might be a need for two 0-dimensional object types: one for 2D references (e.g. position) and another one for 3D (e.g. point). The position will be a projection of the point onto the plane or earth surface. An easy way out would be to go for only the position type, and include elevation as an ordinary attribute. Such an approach would limit the possibilities for developing standardised 3D operations. The other geometrical constructs, such as line, region, surface, volume and field could be represented using constructors on the point or position type. A general purpose constructor that is useful for building more complex geometrical and topological items is the sequence constructor (as used in TECHRA, a relational database system developed in Norway for scientific and technical applications [TECHRA93]). The use of ADTs for the data types (for instance line), where the representation of the data type (line) is hidden, but where the ADT interface could support all imaginable query and update operations, would be very convenient. In this way it would truly function as a basic data type. For all the higher-level geometrical types (line, region, surface and volume), it would be useful to allow both attribute variation over the interior of the objects (fields) and homogeneous interiors (as is normal in current systems) at the conceptual level. The important class of continuously varying natural phenomena, that normally could be represented using samples in geographical databases, does deserve a special purpose data 136 Chapter 6: Database management system issues for geographical data type. This is the field. The field is currently most used for a 2.5D surfaces, but could theoretically also be used for representing continuous change along a line in some space, over a 3D surface or over the interior of a 3D volume. Without such a data type, the users will have to retrieve the underlying samples from the database system and perform the interpolations themselves. Using the ADT approach for fields, interpolations will be performed by the database management system. It is therefore very important that the interpolation results are augmented by an accuracy measure, from which the user can decide whether the sampling frequency is high enough to provide a basis for meaningful interpolation over the region of interest. Topology and geometry operations could need some special purpose data types in order to make geometrical and topological constraints and query language operations a part of the database system. In case this is what is wanted, the database system could profit on knowing about some of the following data types: node, edge, face, network, TIN and manifold. Regular grids or rasters is a very common way of representing and storing environmental data and measurements, and should be supported by a geographically oriented database management system for full integration with the other geographical data types. This could be done using a matrix type, or a more clumsy sequence of sequence type (one more nesting for 3D, such as seismic data, atmospheric and marine data). The matrix type could be regarded as a specific implementation of (1.5D,) 2.5D or 3.5D fields. Data dictionary data types There is a need for dictionary data types that can facilitate queries to the data dictionary (and directory) system, providing both geographical and thematic searching. Data types for thematic dictionary search and for geographical dictionary search are therefore necessary. • The thematic data type should be built according to internationally accepted standard thesauruses of geographical data. • The geographical data type should be able to index different kinds of geographical unit hierarchies (e.g. political/administrative or watershed). Such data types should make hierarchy queries possible. The domain of the data types could for instance be a structured string, supporting wild card character searching. By combining the domains it should be possible to specify many administrative and thematic restrictions within a single query. A geographical hierarchy could for instance be based on the political/economical units: Continent - Country - (Landscape -) State / District - County - Municipality - Township Property- Lot A natural supplement to this method is region/polygon geometry search. A region expressed using a hierarchical expression of geographical names could then be translated into a closed polygon or a volume. Such a geometry could then be used in the next stages of processing of the data dictionary query. An advantage of using polygons is that language expressions vary from culture to culture, while geometrical descriptions are internationally standardised, and translators are easier to specify. Geographical Query Languages 137 The data dictionary types should also be used in interactive graphical systems, to provide the user with a friendly browsing environment. 6.6.5 Spatial constraints In addition to traditional database constraints, such as the nature and cardinality of relationships, temporal constraints and identification, spatial constraints are necessary for complex spatial objects, topology and other spatial relationships. Some examples of spatial constraints are given below. • Networks require that all links have two end-nodes (not necessarily distinct), and that all nodes are attached to at least one edge. In manifolds, all border segments (edges or surfaces) should border two and only two regions or volumes (this requires a universal polygon/volume). • Complex spatial objects, such as edges, regions, surfaces, volumes and rasters will have to conform to structural constraints. An edge should be a one-dimensional path connecting two points. A 2D region should be a 2-dimensional phenomenon bounded by edges. A raster should be a matrix of objects of a certain domain, representing a regular tessellation (having the same dimensions as the matrix) of a geographical region (normally 2D or 3D). • For spatial database integration, a very important aspect is the identification of spatial objects contained in multiple databases. The geographical database management system should be able to tell that a river or a road in one database is the same as the river or the road in another database. The key to this problem could be a combination of a common spatial reference framework and adequate metadata (with quality/accuracy measures) in the database. • Usage and quality constraints (scale, accuracy, context-dependence) form an equally important class of constraints for geographical data. Different aspects of scale will be central for these constraints. Spatial constraints will be important in the design of the spatial operations of geographical query languages. It should also be possible to specify spatial constraints that apply to a certain kind of geographical phenomenon/data (e.g. a river cannot flow uphill). 6.6.6 Operations The traditional set-based operations of relational database management systems are: selection, union, intersection, division, difference, negation, aggregation and join (see chapter 2). GISs introduce new domains, outlined in the previous sections, and therefore also need some spatial variants of the traditional operations. As mentioned earlier in this section: For the sake of standardisation and optimisation, it is important to find a smallest set of spatial operators and functions that is complete in the sense that all possible spatial requests can be served by using combinations of these operators and functions. To find a set of basic spatial operations will have to be the first and most important step in developing a full set of spatial operations. Following the ideas of the previous sections, a 138 Chapter 6: Database management system issues for geographical data starting point could be to divide the operations into geometrical operations (including geographical data integration), vector-topological (often navigational) operations and “raster operations”. Basic spatial operations have also been discussed in the literature (e.g. [Egenhofer87], [Egenhofer90b]). Geometrical operations on objects in spatial databases Geometrical operations are operations that operate on geometrical elements (points, lines, polygons, surfaces and volumes) and normally use Euclidean geometry to obtain the results. The first category of geometrical operations return a scalar value, and could therefore be called geometrical calculations or scalar operations (giving scalar results). Classes of geometrical calculations are: • Distance queries: compute and return the (for instance Euclidean) distance between geometrical objects (3D or 2D distance). This operation could in addition return a direction vector. line-queries: compute the length of a line (or a perimeter). • Extent queries, eg: Length-queries: compute the length of a line. Area-queries: compute the area of a polygon, or a surface. Volume-queries: compute the volume of 3D-objects. • Field queries: compute properties, e.g. slope_of_field(field, x, y). elevation_of_field(field, x, y) or value(field, point) that returns the value for the field at that point. mean(field, region) that returns the mean value of the field over the specified region. mean(field, line) for lines. These operations should also return a measure for the reliability of the value since it is derived using the field representation/interpolation method. The second category encompasses operations that return spatial objects. The operations that combine two different data sets can be termed integration operations. A spatial join [Güting94] [Samet95] is an operation that integrates two spatial data sets by merging their geometries together to form a new (integrated) spatial data set, and perhaps do operations based on the values of the attributes of the data sets. In traditional geographical information systems, a spatial join is often called an overlay. The union and intersection operations listed below are two kinds of spatial joins. • boundary (object): an operation that returns the boundary of an object (topological [Egenhofer90b]). • interior (object): a unary operation that returns the interior of an object (topological [Egenhofer90b]). • exterior (object): a unary operation that returns the exterior of an object (topological [Egenhofer90b]) complement (object, [universe]): a unary operator that returns the part of the data set / space that is “outside” the object. The result will depend on the context (e.g. 1D, 2D or 3D) and the universe of discourse. Geographical Query Languages 139 • union (polygon-polygon, line-line, network-network, manifold-manifold): a binary operation that returns the union of two (sets of) geometrical objects. • intersection (object1,object2): a binary operation, returning the intersection of the two (sets of) geometrical objects. The intersection of two sets of areas can be a set of areas, lines and points (intersection is a valid operation for line, area, surface and volume objects). • projection (object, projection description): an operation that returns an object of lower dimensionality. Should support dimensionality generalisation as discussed in an earlier section. • 3D queries: operations on volumes and surfaces/fields that return points, lines, areas, fields or volumes, such as drainage_basin(surface, point) and visible_areas(surface, point) (that both return areas). • clip (object1, object2): Return the parts of object 1 that are within object two. When applied to a field, the dimensionality of the field(object1) will be reduced to the dimensionality of object2. • generalise (object, scale): an operation that generalises the object(structure) to the indicated scale. • buffer (object, (Euclidean) distance): an operation that returns a new object (region or volume, depending on the context/universe) that covers the original geometrical object and all the space that is within distance from the object. The buffer function could for instance be used in conjunction with an “inside” query to perform neighbourhood queries (find all houses within 200 meters from the E6 in “Nordland fylke”). • neighbour queries: an operation that finds the n nearest neighbours (of some specified type) of a geographical object. The third category of queries operate on geometrical objects and return a truth value. These queries can in most cases be formulated using the previously mentioned query types [Egenhofer90b]. • equal (x,y) can be determined by (intersection(complement(x),y) = ∅) and (intersection(complement(y),x) = ∅). • contains (x,y) can be determined by (intersection(complement(x),y) = ∅). (point on line, point in polygon, point on surface, point in volume, line in line, line in polygon, line on surface, line in volume, polygon in polygon, polygon in volume, surface on surface, surface in volume, volume in volume). • overlaps (x,y) can be determined by (intersection(x,y) ≠ ∅). • touch (x,y) can be determined by ((intersection(boundary(x),boundary(y)) ≠ ∅) and (intersection(interior(x), interior(y)) = ∅)). • location queries (spatial relationships between two objects): n,s,e,w,ne,nw,se,sw [Roussopoulos88]. 140 Chapter 6: Database management system issues for geographical data Topological spatial operations Topology are geometrical relationships between objects, and operations on topology will consequently be utilising these relationships (navigation). Navigation is performed by starting out with some object, and then following the relationships of the data model to other objects in the database. In the GIS context navigation is useful for topological relationships in networks and manifolds, in addition to other objectobject relationships. Topological operations for geometrical data has been investigated in the literature (e.g. [Pullar88], [Egenhofer90a]). • Border (n-complex): returns the (n-1)-complexes that make up the borders of the n-complex. E.g. in Figure 3-2, border(RX) = {L1,L2,L8,L9} and border(L4) = {P4,P5}. • Coborder (n-complex): returns the (n+1)-complexes that have this n-complex as a part of their border. E.g. in Figure 3-2, coborder(L8) = {RX,RY} and coborder(P3) = {L2,L3,L8}. • Neighbour (n-complex): returns the the set of n-complexes that are neighbours of this n-complex. E.g. in Figure 3-2, neighbour(X) = {Y} and neighbour(L8) = {L2,L3,L6,L9}. • Transitive closure: (recursive) operations working on relationships. An illustrating example is to find all ancestors or descendants of a person using the parent-child relationship recursively. Topological operations related to transitive closure work on complete networks of topological relationships, such as shortest path, travelling salesman, reachability and minimum spanning tree. Raster operations The raster specific operations performed within GISs will mostly be operations for image processing [Gonzalez87] and pattern recognition [Tou75] [Gonzalez78][Jain87] [Thomason87]. Since rasters can be used to represent fields, these operations are related to field operations. For a database management system, the most interesting queries are subimage queries and content-oriented (pattern recognition) queries. Advanced analysis on rasters/fields, such as modelling of the spread of fires in forests will normally be handled by applications, and does therefore not have to be capabilities of the database system. A few examples of operations on images that could be supported by a database management system: • Subimage(image_id, x1, y1, x2, y2). • Channel(image, ch): returns channel ch of the image. • Filter(image, a_filter): performs a convolution on the image using a_filter as a filter. • Histogram(image). Produce the image histogram. • Find_feature(image, feature, similarity-requirement): this would be a very advanced pattern recognition operation that would be very useful for content-based image retrieval and feature extraction. Geographical Query Languages 141 • Ortho_photo(image, DEM, control_point_pairs): returns a geometrically rectified image (using photogrammetrical methods), useful for integration with other spatially referenced data. This is an example of an operation that is needed to transform images to a format that make them suitable for integration with other kinds of geographical information, for instance within a GIS. • Other special purpose functions, such as fourier-transforms and stretching (e.g. linear stretching and histogram equalisation) Image operations have been discussed more thoroughly by Berry [Berry87]. Overlay The following paragraphs discuss overlay, an integration operation for spatial data sets that cover the same geographical area. It is perhaps the most central operation of current GISs. A polygon overlay is the action of integrating two sets of polygons (each set will be seen as a polygon network (manifold)). The name overlay comes from the traditional methods, where maps of the same scale, covering the same area, were placed on top of each other to be analysed in combination. The result of the overlay will be a new set of polygons (in a polygon network (manifold) structure), where the borderlines are made up by all the borderlines from the two original polygon sets. The properties of the new polygons will be a combination of the properties of the polygons of the original coverages (see for instance [Burrough89]). This kind of overlay can also be used to combine point and line data with polygon data. For instance to determine which administrative unit(s) a road or a house belongs to, or which drainage basins a waste disposal site will affect. In geographical information systems, overlay is one of the most useful operations. It is used to combine different types of geographical information for a region, for instance for suitability analysis and hazard or impact analysis. In traditional geographical analysis the overlay operations have been performed by putting different thematic map-layers on top of each other for visual inspection, or for printing the result to produce a new map. The overlay process normally includes some pre-processing and post-processing. The pre-processing is performed to produce polygon networks from the original data. The post-processing is usually some kind of computation of the properties of the new polygon network from the properties of the original ones, and possibly removal of edges between polygons of equal type. After the overlay operation, the new polygon set can be used as input to new overlay operations. An example of an overlay is the combination of a cadastral database* with a vegetation database to find oak forests on government properties. Another example is the suitability analysis presented by Burrough [Burrough89], where a soil map is combined with a drainage map to find promising areas (for instance for agriculture). A more elaborate example could be to find areas suitable for cottages. Requirements could be that they should be within 1 km from the sea or a lake, on a south to south-west pointing slope that is not in the shadows, not wasting fertile land, within 100 meter from a public road, with plenty of ground water supplies, etc. * A cadastre is an official register containing information on all real estate in an administration unit. 142 Chapter 6: Database management system issues for geographical data Overlay is useful for both raster and vector data, separately and in combination. Vector polygon overlay is computationally demanding. The first step involves finding all new line crossings and the second step is to build the new polygon topology. The resulting polygons inherit all the properties of the participating data sets. Raster overlay is straightforward when the input rasters have the same cell boundaries. The resulting raster can be obtained cell by cell from the originals using the relevant operations (for instance addition, subtraction, multiplication and division) on the cell values. For boolean rasters, the boolean operators AND, OR, XOR and NOT can be used to determine the resulting raster. In a general purpose GIS, it should also be possible to perform raster-vector overlay (support for continuous variation/fields in the vector model). In a database query language, overlay will be performed in some form or another in all operations that involve a spatial join. 6.7 Transactions Most current GIS implementations offer only limited database management support. Database management systems are to some extent used for storing non-geometrical attribute data, but very few systems store all their data in an integrated DBMS environment. Status quo stems from the performance problems GIS vendors are facing for their geometrical operations (selection, overlay, network analysis, and so on). The vendors have been forced to optimise the spatial data structures, by-passing database management system support. This data organisation makes advanced transaction management virtually impossible. To provide the most primitive support for transaction management, some systems offer check-in check-out capabilities, but in general, the first generation GISs do not support multi-user environments in an acceptable way. In the future, a significant part of the growing community of GIS users will operate in multi-user environments, where controlled sharing of geographical data sets will be essential for the utility of the systems. Hence, transaction handling and concurrency control will have to be given a higher priority in the next generation of GISs. To be able to perform transaction processing tasks using todays standards and methods, the geometrical part of the data sets will have to be integrated with the rest of the data sets in a closer way than what has been provided by most of the first generation GISs. The geo-relational approach, as utilised by for instance ARC/INFO, is a too weak integration mechanism for present concurrency control and transaction processing methodologies. The problem of the geo-relational approach is not primarily that the geometrical part is separated from the non-spatial part of the data sets, but that the data organisation of the geometrical part has not been designed with concurrency control in mind (a non-database management system approach). Ideally - thematic, geometrical and topological data should all be organised according to database management system principles. Such a solution should be possible for all systems by modifying the internal representation of the geometrical part of the data sets. The reluctance of GIS designers to take this last step into sophisticated database management is at least partly due to performance concerns. Current database transaction processing and concurrency control mechanisms are mostly developed for the relational database model [Bernstein87]. The mathematical foundation of the relational model has made it an attractive model for research in this area. Some of Transactions 143 the techniques developed for the relational model are also being investigated and extended to fit into object-oriented database technology (e.g. [Herlihy90]). 6.7.1 Transactions on temporal geographical data Most geographical data sets represent phenomena that evolve over time, and historical information on these changes should be kept for time series analysis. This means that when storing changes to phenomena, one must also be sure to keep the past states accessible to the database users. The opportunities for simplification that temporal/versioned data handling give should be exploited when choosing transaction processing mechanisms and concurrency control methods (concurrency control in multi-version databases are discussed by for instance Agrawal and Sengupta [Agrawal89]). What distinguishes spatial data sets from other data sets is that they occur in a spatial context (all geographical data have a position in 2- or 3-dimensional geographical space). This underlying structure is the single most distinguishing part of geographical data semantics, and attempts should be made to utilise these characteristics of spatial data in transaction management, and particularly by developing new concurrency control methods. 6.7.2 Transaction management Traditional transaction management is based on the notion of atomic transactions and the ACID transaction properties. If a transaction has completed without errors or conflicts of any kind, it is allowed to commit, if not, it will have to be aborted. If a transaction comes to a commit, all the changes it has made to the database are made permanent. If a transaction must be aborted for some reason before completion, all the changes that this transaction has made to the database must be undone in such a way that no traces of the transaction is left in the database (if some other transaction has read data written by an aborted transaction, it must also be aborted). This is called to rollback or undo the transaction. To allow rollback and recovery from system failures (due to disk crashes, electricity drop outs, …), a transaction log is kept that records all the operations that have been done on the database, which transaction that has done these operations, and interesting transaction events (particularly commits). In addition, checkpoints should be established at specified intervals (suspend all transactions, and force all updates made by committed transaction to be written to permanent storage (normally the disks)). To recover from system failure, one starts out with the latest checkpoint, and uses the log to fix the database. Changes made by transactions that had come to a commit when the crash appeared must be reflected in the database. If the commit happened after the last checkpoint, operations that appeared after the checkpoint must be redone. The changes made by transactions that started before the checkpoint and were aborted after the checkpoint or did not manage to complete before the crash occurred will have to be undone. A transaction manager uses the services of the underlying concurrency control system to ensure that transactions are executed and ended in a correct way. The transaction capacity required by a geographical database will depend on the popularity of the database, and its availability. For servers of official geographical data, such as national 144 Chapter 6: Database management system issues for geographical data map series that are often used as background maps for presentations and analysis, several simultaneous transactions must be expected to occur throughout the day. Exact numbers are difficult to predict, as they depend on the future number of GIS users and the way in which data distribution and management is organised. The most primitive transaction management method is check-in check-out, avoiding lower level concurrency control (as used in ARC/INFO LIBRARIAN [Aronson89]). This has been much used in design applications in the past, but does not allow the necessary concurrency for sharing of data for cooperative work and on-line access to external data sets. For more advanced transaction management, involving distributed databases, the 2PC (2 phase commit) commit protocol is the most widely used [Bernstein87]. This protocol provides consistency of a distributed database system by ensuring that a transaction is either committed at all sites or aborted at all sites in the network. To limit application blocking, 2PC has a built-in flexibility to handle site failures and network failures/partitions correctly. 6.7.3 Concurrency Control There are two popular groups of methods that are used for concurrency control in database systems. One is based on locking and the other on time-stamp ordering. A third group, that often utilises techniques from the first two groups, are the optimistic methods. Traditional concurrency control mechanisms are based on the serialisability criterion (page 18) [Bernstein87]. The locking technique tags all items that are accessed by some transaction with a read tag or a write tag (other tag types are possible, e.g. read-write and increment). All the tags of a particular transaction are removed before or when the transaction terminates. Access to a tagged item by another transaction is allowed or disallowed on the basis of the existing tag type of the item and the operation requested (e.g. an existing read tag normally blocks all new write operations on an item until the read tag is removed). The most popular locking technique is 2PL (Two Phase Locking), and it ensures serialisability by forbidding transactions to acquire new locks when they have released one or more locks. There are a variety of different 2PL locking techniques, some liberal and some conservative [Bernstein87]. Liberal techniques allow much concurrency at the risk of having to rollback transactions that conflict (deadlock situations). Conservative techniques avoid rollbacks by not allowing risky concurrent access. Conservative Strict 2PL is an ultra-conservative technique that requires all locks to be acquired before any operations can be performed (conservative 2PL), and that does not release any locks before the transaction has committed or aborted (strict 2PL). Timestamp ordering is a different kind of approach. Each transaction is assigned a timestamp. When a data item is accessed, the transactions timestamp is assigned to the data item together with the transaction-id and an indication of the kind of operation (read/write). When a transaction tries to access a data item, the time-stamp of the new transaction is compared with the timestamps that are attached to the data item. If the access will lead to a conflict with some other transaction, it is not allowed (e.g. if an older transaction tries to read an item that has been written by a younger transaction). As long as there will be no conflicts, the operation is done, and the new timestamp is added to the list of timestamps of the data item. There are a lot of variations of timestamp ordering. Transactions 145 Optimistic techniques allows a maximum amount of concurrency. All accesses are allowed, but when a transaction has performed all its operations, it must be check whether or not conflicts have occurred. If there have been conflicts, the transaction is rolled back, if not, it is committed. Comparisons of the level of concurrency allowed by the different methods illustrate the strengths and weaknesses of the three approaches [Franaszek85]. Geographical databases Geographical databases have the following characteristics relevant to concurrency control: • Transactions will often be spatially localised, by requesting data from a certain region. • Data may be distributed between many spatial data servers • Virtually no updates to non-local data (one will generally not be allowed to update the data that belong to another organisation) and even very few local updates. • Many of the geographical data sets (themes) will be used as background (read-only) data (information/maps), with relaxed requirements with regard to the correctness of the data (need not be completely up to date, and could be coarse). • For updates, long transactions tend to dominate. • Historical data. Most data should not be changed or deleted. New data are added to the data set, with the time of validity or acquisition attached. A possible modification to historical data is to update an attribute that says that the data are not longer current (for instance when a house is torn down). Temporal data lead to very few locking conflicts. • Hot spots* are generally not a problem. Different kinds of metadata might be hot spot candidates. The characteristics of geographical databases have much in common with other advanced database applications (such as CAD/CAM and software development environments). A good review of concurrency control in advanced database applications can be found in [Barghouti91]. In the following sections, some issues of transaction processing for geographical databases are discussed. Ways of allowing more concurrency for long transactions Traditional serialisable schedulers are very restrictive on concurrency and their limitations are particularly evident for long transactions [Barghouti91]. The theoretical (and unreachable) limit of transaction concurrency is the class of correct schedules. To approach this limit, our knowledge of the data and transaction semantics has to be utilised (Farrag and Özsu discuss the utilisation of transaction semantics [Farrag89]). This means that for geographical databases, the temporal nature of data together with their spatial properties must be investigated in order to specify liberal and correct concurrency control methods. Other correctness criteria than strict serialisability have been proposed for long nested transactions on versioned data. Korth and Spiegel suggest that database constraints are used * A hot spot is a data item in the database that is very often accessed (e.g. accumulation data) 146 Chapter 6: Database management system issues for geographical data to partition transactions into independent parts [Korth88]. They also incorporate versioned data into their method, and exploit the fact that sub-transactions often can be arranged into a partial order instead of a serial order, allowing a higher degree of concurrency. They still require serialisability among the top-level transactions. CAD transactions have been suggested as a potential application area for the method. Concurrency control mechanisms for GIS databases should take advantage of results obtained on more liberal concurrency control methods, and extend those methods in accordance with the particular semantics of geographical data. Spatial locking methods that are able to perform locking on a spatial region, perferrably on a per theme basis, should be investigated. Temporal data If all data stored in the database are marked with their date of validity, and many generations of data are present (historical data), all read-transactions can be served immediately without having to lock the data items, provided they do not require up to the minute data. To ensure consistency of the data sets, such read-transactions should be tagged with a query timestamp that is older than all active write-transactions. The data that were valid at the defined query-time will be returned to the transaction. A problem with this approach is the difference between transaction time and valid time. Transaction time poses no special problems to the concurrency control mechanisms, and gives opportunities for time-stamp based concurrency control [Bernstein87]. The solution sketched in the previous paragraph will work if transaction time is used. Valid time can not be used as a basis for concurrency control. A data set will often not be inserted into the database at the time of collection, and this means that we cannot know in advance if interesting “older” data will be inserted during the transaction. We can get around this by returning the time-stamp of the query together with the results to indicate that the results were consistent with the state of the database (not the world) at a certain point in time (as-of). Read-only transactions Read-only transactions will be dominating for most geographical database servers. It is therefore important that read-only transactions are not unnecessarily delayed by the concurrency control process. A strategy for accomplishing this could be to employ optimistic methods when these kinds of transactions are dominant. Through the concurrency control process, the transaction manager should log information on possible violations of the correctness criterion (e.g. serialisability) that occur throughout the life of a transaction. After the transaction has done all its operations (reads), the transaction manager could determine what to do (commit or abort), depending on the preferences of the transaction. In the case of possible conflicts, the transaction should always be warned that an inconsistent data set could have been returned. If one can determine whether a transaction will be read-only or not before the transaction starts, read-only transactions could receive special treatment based on their nature. A way to accomplish this is to require transactions to state beforehand whether they will (possibly) perform writes. In many real-world geographical databases, only a limited set of users are allowed to make changes. If a transaction comes from a user that is not allowed to perform updates, it will have to be a read-only transaction, and all writes will be refused. Transactions 147 Read-only transactions may also want to be know for certain that they get a consistent view of the database. A transaction should be able to specify what kind of service it wants. Transactions that want a consistent view of the database may, of course, be delayed and restarted as necessary, resolving whatever conflicts that may occur. Whether guaranteed consistency should be the default or not is a matter of taste. In a 1983 paper, Garcia-Molina [Barghouti91] introduces the notion of sensitive transactions in order to be able to treat browsing transactions and normal (sensitive) transactions differently. Introduction of a lock type for browsing transactions (that do not require a perfectly consistent view of the database) has been suggested [Kemper94]. Replication As for all other databases, replication can be used to speed up access to non-local geographical data and to make the global system of geographical data more resilient to site and network failures. This is achieved by storing copies of the master database at different locations in the network, so that the copies can take some of the load during normal operation, and take over in the case of network failures (see page 155). Replication gives rise to interesting challenges for concurrency control and transaction management. Different approaches can be taken. By using a master copy, all write-locks will have to be obtained on a single site, while reads can be performed locally. Another approach is to demand that write-locks are acquired at a majority of the sites that contain the data before an update is allowed. This approach avoids errors when a network partition occurs, but imposes more overhead at query time. For geographical databases, no special issues arise with regard to replication due to the spatial nature of the data. Replication can be used to increase reliability, as for all other databases, and the dominance of read-only users makes concurrency control a fairly uncomplicated task. A spatial locking / spatial concurrency control method An interesting concurrency control strategy for spatial databases in general, and GIS-databases in particular is a locking scheme based on a combination of location and theme. A query to a spatial database is either an ordinary attribute-qualified query, such as: select person.name from person, building where building.area > 5000 sqm and building.owner = person.pid; Or the query is spatially qualified: select landuse.class, sum(landuse.area) from landuse, districts where landuse.region is_inside districts.region and districts.name = “Akershus” group by landuse.class; A GIS user that is doing updates to the database will normally be working on a small region at a time. To ascertain that undesirable interference is avoided, it should be possible to lock some or all objects of a certain theme (or with certain properties) in a specified region. 148 Chapter 6: Database management system issues for geographical data GIS applications, with their long transactions and region-based queries pose opportunities for efficient concurrency control techniques. And the dominance of temporal data in GIS applications does not reduce these opportunities. If a locking scheme (e.g. 2PL*) is employed as a concurrency control mechanism, both spatial and thematic locking should be made available. This means that a spatial object should be locked both when parts of its geometry are accessed and when some of its thematic attributes are accessed by a transaction. A locking structure that can support this could be organised as follows: Each lock on a spatial object should consist of sufficient metadata to determine its spatial extent (volume, polygon, line or point) and its thematic “class”. The geometries of all the locked objects could then be maintained in a spatial data structure that would allow efficient intersection operations (e.g. an R-tree [Guttman84]). The locking structure should reflect the data structures used for storing the data. 2- or 3-dimensional hierarchical structures are obvious candidates. Each node in the tree should have locking sub-structures for all the object-types (themes) in the database. The locking process should traverse the tree from top to bottom, just as for ordinary tree locking [Bernstein87]. Before a spatial object is accessed, the geometry of the object must be combined with the locking structure using a geometrical overlay. There will be three possible outcomes of this operation: no intersections, complete containment (either way) and overlap. If the object does not intersect with any objects in the locking structure, the operation can be allowed, and the object included in the locking structure. If the object is completely contained by objects in the locking structure, the thematic structure is checked. If there is not a thematic clash, the operation can be allowed. If there is a clash, one has to determine whether the operation should be allowed or not on the basis of the type of operation in question. If an overlap case occurs and there is a clash in thematic content and type of operation, the operation can not be allowed, and the “conflict-region” could be returned to the transaction. The transaction will then have possibilities for adjusting the region and try again. A modification to this strategy could be to use the thematic content of the object to be accessed to select a smaller conflict-group of geometrical objects from the locking structure before doing the overlay. This could save time, depending on the implementation of the geometrical overlay process. A problem with such object-level approaches as the one outlined above is that a geographical transaction often involves a large number of spatial objects. To reduce the locking overhead, such transactions should be able to do the locking on a higher level of granularity (multi-granularity locking [Bernstein87] [Kemper94] or multi-level concurrency control [Weikum86] [Badrinath90]). A transaction could state that it is interested in reading a certain region, and a certain group of themes within that region, and updating another region (normally a subset of the first region), and a group of themes within that region. The more coarse grained one gets in this strategy, the more one approaches the check-in, check-out concurrency control strategy. The method, as outlined, would therefore be very flexible. Parallel locking and searching Multiple processor database machines are common today, and for these kinds of architectures it is possible to give the locking manager a dedicated set of processors. This could be a useful approach for spatial concurrency control. A way of doing it could be as follows: * 2 Phase Locking Distribution issues 149 1. The database-machine/transaction manager gives the locking manager the spatial regions accessed by the transactions, together with the object-types/themes that are involved. 2. The locking manager searches for conflicts, and reports them back to the databasemachine. 3. The database machine/transaction manager determines what to do with the transactions. To optimise for speed, all spatial objects of the database could be placed in the same data structure (overlay), and for all points, lines, regions and volumes there should be a list of all conflicting/overlapping objects. 6.8 Distribution issues As discussed in chapter 3, geographical data have a potentially large and geographically widespread customer base. At the same time, most GIS applications will request local data most frequently, and updates will generally only be allowed for local data sets. Such a setting is likely to make a distributed approach to global geographical database management more attractive than isolated huge database servers. Advantages The following advantages of distribution can be identified for the geographical data scenario: • Limitations on the size of local databases. Because each site will only use local storage for locally owned data and perhaps some copies of often used external data sets, the amount of data at each site is more likely to be manageable. • Reliability. Compared to the centralist approach, the database is less vulnerable to site failures. If one site goes down, only the local data of that site will become unavailable, affecting only a limited number of users. Through replication the availability of the data can get even better. • Locality. The geographical and thematic distribution of geographical data will generally reflect the data usage patterns. Most user queries will reference local data only, saving communication bandwidth. Autonomy. The distribution of geographical data will be determined by ownership. The owners of the data can administrate their data locally, controlling updates and the availability of the data. The local systems will to a large extent be able to function even if one or several remote geographical databases become unavailable for some reason. • Capacity. The databases in a distributed setting will generally be manageable in size, and in a distributed environment the “global” database can be expanded by adding more local databases. Performance bottlenecks are also less likely to appear in a distributed system than in centralised systems, since the transaction load will be distributed among many database systems. 150 Chapter 6: Database management system issues for geographical data • Controlled replication. By putting copies of the original data sets at the sites where they are most frequently requested, the data can be made directly available to the customers at their local site, saving communication band width. This is most advantageous for static data sets. For dynamic data sets, the update problem for replicated data sets will be significant. Disadvantages • Data administration. Someone (the system data dictionary) must keep track of where the different data sets are located (both the originals and the copies), and the collection of usage statistics for billing purposes will probably be more complicated than in a centralist approach. • Network delays. Communication among geographically distributed database servers over long-haul networks are presently slow, so the retrieval of distributed data sets can take some time (both for localisation and retrieval). The evolving networks of higher capacity networks (e.g. fibre-optic cables) will help to reduce these problems. • User inconvenience. By distributing the data, a “user” (or perhaps the transaction manager) will sometimes have to access a number of databases to answer a single query. • Communication overhead. A certain amount of communication is needed for transaction management and data administration in a distributed system. Managing huge local databases Even if geographical data are distributed according to ownership, some local GIS databases will become huge by todays standards (for instance national mapping agencies and sites containing remotely sensed and other automatically collected data). The size of these databases could make them candidates for distribution. Such a distribution, through the partitioning of a huge local database into many more manageable databases, is another aspect of distribution. This aspect can be covered by parallel database systems or parallel database machines. 6.8.1 Parallel processing Parallel processing techniques use the divide and conquer strategy to speed up computations. First the problem is divided into sub-tasks that can be performed independently and in parallel. Then the sub-tasks are distributed among the available processors as evenly as possible. Each processor carries out its part of the job (possibly communicating with a limited number of other processors), and finally the partial results are combined to obtain the complete result. There are different ways of obtaining speed-up through parallelism [Quinn87]. The pipeline principle lets a data stream go sequentially through an array of processors, each of which performs its separate task (similar to a factory assembly line). The results from the first processor is passed on to the next, and the final results appear at the end of the processor pipeline. On a stream of data, the speed-up factor will grow, bounded by the number of processors (N) in the array, as the amount of data increases. This is due to the fact that all the N processors are working simultaneously on different stages of the Distribution issues 151 computation. The first processor is working on data item Xi in the sequence, the second processor is working on data item Xi-1, while the last processor in the sequence is working on data item Xi-N+1. The pipeline is not considered a truly parallel architecture. In SIMD (Single Instruction stream - Multiple Data stream) processor arrays, data are distributed among the processors, which all execute the same programme. On MIMD (Multiple Instruction stream - Multiple Data stream) multi-processor systems each processor can be programmed individually. The processors in a parallel computer can, according to their primary usage and communication patterns, be connected using all-to-all connections, mesh connections, pyramid connections, hypercubic connections or some other kinds of special purpose connections [Quinn87]. The problem with parallel processing is that one must find ways of partitioning the problem into an arbitrary number of autonomous sub-tasks. For data processing applications, a very important part of the problem is the partitioning/splitting of the data set. Geographical Information Systems Parallel processing has not yet been utilised in commercial GISs. There has, however, been some research activity, for instance in Edinburgh [Dowers90] [Hopkins92]. To be able to handle the data processing needs of the interactive GISs of the future, very powerful technology will have to be employed. There is a need for fast spatial and “traditional” retrieval of large amounts of geographical information, and there is a need for fast geometrical processing of the spatial data. It will probably be very difficult to meet the future requirements of general purpose interactive GISs without utilising parallel processing technology. 6.8.2 Distribution of spatial data As pointed out in the beginning of this chapter, the amount of globally available digital geographical information (probably Petabytes of data) is growing enormously, and will continue to do so also in the foreseeable future. The information that will be of interest to a single application is, however, normally restricted to some small, manageable subset of the global geographical database. Mechanisms must be devised to facilitate fast and easy retrieval of the interesting data sets from the complete GIS information base. One aspect of this is adaptation to scale or generalisation. This topic will not be considered in detail here. An aspect, that will be elaborated further, is the organisation of the data for fast retrieval of subsets of the global geographical database. By dividing and distributing the global geographical database in such a way that the local databases become as autonomous as possible, the performance goals could be fulfilled for the majority of geographical data users. As discussed in chapter 3, GIS data are spread between many different owners and vendors. The most realistic approach to database management for GIS should therefore be the distributed database approach. The various producers and owners of GIS data are generally autonomous, and must be expected to choose different hardware and operating system platforms and different DBMS solutions. The resulting collection of (semi-)autonomous databases is termed a heterogeneous distributed database system [Ceri88], a distributed 152 Chapter 6: Database management system issues for geographical data multidatabase system [Özsu91] [Kim95d], or a federated DBMS [Elmasri89] (as mentioned in chapter 2). Partitioning of the data set can be done at the inter-database level and the intra-database level. • At the inter-database level one can “distribute” the data between different organisations, parts of organisations and databases to reduce the overall amount of remote data that will have to be considered during local searching. This is termed a filtering method here. • At the intra-database level, the data can be spread out among available processors at a site. This will be termed a parallelisation method here. If integration of distributed geographical data sets and communication between geographical data servers shall be possible, a set of standardised interfaces must be supported by all the participating systems. Such standards will have to be developed both at the modelling level (conceptual schema) and at the data-transfer level. Data modelling has been discussed in chapter 4, and data transfer is covered by the seven OSI* layers [Tanenbaum81] and spatial data transfer standards (that are yet to be internationally approved), such as the SDTS [USGS90]. Both the OGC (OGIS) and the ISO (ISO/TC211) are presently working on standards for allowing distributed processing of geographical information. Filtering Two filtering methods are very useful for GIS data, one of them due to the spatial properties of the data, and the other due to the many classes of users and owners of GIS data. • Spatial filtering restricts the search to a geographical region • Thematic filtering restricts the search to a certain theme or class of themes Both methods can be realised by physically splitting the global data base (according to spatial and thematic criteria). Such a splitting is very often feasible because it suits the structure of ownership of geographical data (the number of owners of spatial data sets tend to be of the same order of magnitude as the number of spatial data sets, see for instance, page 37). Spatial filtering Many GIS applications will operate in a fixed geographical context that in itself provides good filtering. Such contexts should be reflected in the distribution of the data to provide spatial filtering directly in the organisation of the data. Spatial distribution of geographical data will sometimes even be compulsory, since geographical information often is considered strategic by countries, states and companies (e.g. many nations want to control the use of their national geographical data sets). Such a political motivation will have to result in per nation geographical databases, and hence restricts the freedom of choice for the first levels of filtering for “the global GIS” database. It is possible that such protectionism will occur also at lower levels, thereby demanding further political/economical partitioning. * Open Systems Interconnection, the ISO model for communication between computer systems. Distribution issues 153 Figure 6-1 Spatial filtering of the global geographical data base For economical and political applications, a natural way of splitting the database would be hierarchically, according to administrative and economical units. At the highest level, the global geographical data base would be split into national data bases, these could then again be split into district data bases, these again into municipal data bases, and at the lowest level private property or lot data bases, the upper parts of this hierarchy is illustrated in Figure 6-1. Sites at each level of this filtering hierarchy should provide access to their own local data sets in addition to a global data dictionary or some other mechanism for directing data requests to other sites/agents. Such a partitioning of the data set seems to reflect the natural ownership of both economical and political data. 154 Chapter 6: Database management system issues for geographical data For environmental data, a non-political partitioning scheme could be very useful. Traditional cartographic tiling into map sheets will probably be useful also in the future, but for global tiling of large spatial databases new methods should be considered [Goodchild90a]. The Quaternary Triangular Mesh [Dutton89] is presented as a promising approach. It is based on hierarchically dividing the globe into triangular tiles, starting with an octahedron with one vertex at each pole, and four vertices on the equator (longitude 0, 180, 90 West and 90 East). Each of these triangles is recursively subdivided into four new triangles by connecting the midpoints of its sides. Tiling based on natural borders is another possibility for environmental data (e.g. watersheds). The consequences of spatial filtering can be illustrated by an example: A local authority GIS will be utilising information that is related to geographical locations within the administrative unit. Some of the data will have to be extracted from external databases on demand (databases of a specific thematic content, databases on a higher level -, or databases on a lower level of the filtering hierarchy), but most of the data will come from its own data sets. Requests on the private information base can be served by accessing the limited local database with acceptable response times. Requests that go beyond the private data will have to be serviced by retrieving data from remote databases, and will introduce delays from the network and the remote databases before the data becomes available. Non-local data will generally be used as read-only information, so if the performance penalties are high for external access, copies of frequently accessed external data sets could be stored locally and updated as needed. By storing external information locally, local access could be slowed down if it leads to an increased amount of data to consider in a search, while access to the “alien” data sets will be sped up. Thematic filtering Geographical data can be divided into many classes or themes. Pipelines, railways, telephone networks, vegetation and land use are some examples of semi-autonomous spatial data sets. Some themes are of general utility. Topographic maps are for instance often used as a read-only frame of reference for other kinds of thematic information, and can therefore be considered as the backbone of the geographical information base. Most thematic data sets will be of little interest to others than the owners of the data or a very limited user group. Assuming this, distributing geographical data for storage on a thematic basis is advantageous both because of its filtering effect and because it is practical with respect to ownership issues. Thematic filtering makes many special purpose GISs feasible. A telephone company does not have to consider detailed road networks or the vegetation and soil cover for an area to accomplish most of its tasks, and can therefore limit its local database to the information on the telephone network and the necessary background (topographic) information. With this kind of filtering, the amount of information in the database is dramatically reduced, and it becomes more likely that the data can be accessed in a time-efficient way. Lower level filters and parallelisation Distribution through the thematic and spatial filters discussed in the previous sections partly come “for free” as a result of the ownership structure of geographical data. The resulting Distribution issues 155 local databases can still be very large (e.g. the database of the national mapping authority), demanding distribution between disks (or nodes in a parallel DBMS) using further levels of filtering to partition the data into manageable units. There are at least two ways of obtaining further filtering. One approach is to continue on the earlier threads, and organise the database so that “direct” access to coherent groups of data is possible through the storage structure. This can be done by organising the data using modified versions of existing associative storage structures (e.g. B-trees, ISAM or k-d trees for thematic filtering and k-d trees, quad-trees, R-trees or grid-files for spatial filtering). Appendix A contains an overview of data structures for spatial data. Each disk (or node in a parallel DBMS) could for instance store a sub-tree of the structure. Another approach is to use other distribution techniques within a parallel DBMS. The objects of the database can be distributed between the available nodes in the parallel database machine according to the (hashed) value of some property/attribute. The hashing attribute could be the identity attribute or a spatial or thematic classification attribute, depending on the application. Hashing is extensively used in parallel relational database machines (relational algebra operations are discussed in [Bratbergsengen84] [Bratbergsengen90]), and could probably also be adopted in other environments. Hashing can be used in relational databases for all relational algebra operations. In join-operations, this is done by distributing both of the operand tables among the processors on the basis of the join attribute (hash join). The join operation can then be performed autonomously by each of the active processors. The advantage of this “randomisation” approach is that it is easily scalable to increasing numbers of processors, while it ensures a decent load balancing, and does not require too much synchronisation. A problem with hashing approaches is that the resulting distribution could become very uneven if the hashing method is not able to handle the distribution of the hashing attribute(s). 6.8.3 Replication Replication in a distributed database environment can take two different forms. Controlled replication aims at improving the system performance and the availability of data transparent to the users (maintaining the image of a single database system). The other kind of replication could be called autonomous replication, where a local site keeps a snapshot of a remote database for convenience. Controlled replication is an important distributed database systems topic [Ceri88] [Bernstein87] [Bernstein93], but will not be elaborated on any further here. As discussed in chapter 3, the data used in GISs will often come from many different sources, and the proprietors of data sets normally want to stay in control of the distribution of the data. So, if outsiders need the information they must buy it from the owners, possibly under certain license conditions. The use of remote data gives rise to an important question: Shall the acquired data be stored on a permanent basis in the local database system or not? If the data are stored, they can be used to answer future queries where the demands for up-to-date data are relaxed (the data normally goes gradually out-of-date), saving communication bandwidth and the query overhead involved with remote retrieval. A new question 156 Chapter 6: Database management system issues for geographical data that will have to be answered then is when, and how often, should an updated version of the data be retrieved from the source database? The updating of the local copies could be solved either by requesting the remote database to send updates when changes have become “considerable”, or by retrieving the complete data set in a systematic fashion (for instance every hour, day, week, month or on demand). This decision will have to depend on to what extent the local applications need up-to-date data and the expected rate of change of the data. Using this kind of replication, all queries can be answered by accessing the local database only. The problem with such an approach is that it is difficult to decide in advance what data will be interesting to the users, and even if one does find out what will be useful, the database could grow so large that it exceeds the capacity of the local database system. The bandwidth between the local site and the providers of the data will also tend to be wasted on a lot of data that nobody ever will be interested in. The least complicated solution is, of course, not to use replication at all. This requires all the databases to be instantaneously accessible at all times. In such an environment data are fetched from their sources when they are asked for (at query time). For such a “no-replication” approach to be efficient in a heterogeneous distributed database system environment, at least the data model and the query language will have to be standardised in some way. The currently dominating query language (“standard” SQL), does not currently have expressive power to handle complex data types, such as geometrical data. A standard query language with powerful spatial capabilities would be very useful, because it could use its knowledge of the data to limit the amount of data transferred in each query. 6.8.4 Heterogeneous database system integration The goal of heterogeneous database system research is to integrate different database management systems in such a way that data from one database system can be accessed also from other, completely different, database systems [Hsiao92]. To be able to achieve such an integration, standards are needed at many levels. The six lowest layers in the seven level OSI system of ISO provide standards for signal processing and various higher level network protocols, while the seventh layer (the application layer) is meant to cover all application specific standards, in this case database specific standards. Standards for integrating database systems will have to cover dictionaries for distributed data, data model abstractions with query languages and access protocols to remote data, and preferably also provide protocols for distributed transaction processing. For GIS the most interesting aspects of heterogeneous database system integration are data dictionary issues and data model abstractions (with access protocols and query languages). GISs depend upon a model for geometry representation that captures all the aspects of spatial data that of interest in a GIS context. Search in remote databases would be significantly more efficient if mechanisms for “joins” were available across database platforms (to allow semi-joins [Ceri88]). This would for instance allow retrieval of objects in a remote database whose geometry overlaps the geometry of a given set of local objects. Examples: Distribution issues 157 • Retrieve all spruce forest stands that lie on government property in Nordland (send over the geometry of Nordland, do a spatial join with the geometry of the property data set, select the governmental properties, send the resulting geometry (of all governmental properties in Nordland); do a spatial join with the geometry of spruce stands in the remote vegetation database; send the resulting spruce stands back for further processing). • Find all residential houses with children below the age of 15 that lie more than 4000 meters from a school in Ringsaker municipality. Adapting standard database terminology, these examples could be classified as spatial semi-joins [Ceri88]. Thematic restrictions as used in traditional relational semi-joins would also be interesting for geographical data (e.g. owner = government in the first example). Data dictionary issues and query languages for spatial databases have been treated earlier in this chapter. 6.8.5 Fast geometrical processing Parallel algorithms for geometrical processing of the data selected from the database will probably be the next step to take in enhancing the processing power of GISs. Not much research has been done in this area, but it should be possible to draw upon the general results from research in parallel processing [Quinn87]. Parallel geographical data processing should take advantage of both the spatial and the thematic properties of geographical data. The workload should be distributed among the available processors in such a way that the amount of work is about the same for all processors and at the same time the inter-processor communication is minimised. A potentially large percentage of the time spent for parallel processing is the initial data distribution step and the final data collection step. In the distribution step, data are sent to the participating processors, while in the collection step the results are sent to their final destination (processor). One or both of these steps can be avoided if the data are present locally at the beginning of the processing and/or can be left locally after the last processing step has been done. Tailoring of the parallel algorithms to fit the data distribution methodology will therefore be advantageous. Parallel applications will often consist of a number of processing steps. It is important to arrange the sequence of steps in such a way that communication is minimised. At the Edinburgh Parallel Computing Centre and Edinburgh University there has been research activity on the use of parallel technology for GIS [Healey89] [Dowers90]. An algorithm for polygon overlay has been developed for a parallel computer [Hopkins92] [Waugh92]. At the Hypercube Laboratory at NTH in Trondheim, some initial testing has been done on the parallelisation of spatial database operations. Parallelisation of a spatial join operation was investigated in a student project [Hagaseth90]. 158 Chapter 6: Database management system issues for geographical data 6.8.6 Data exchange formats Without standard geographical data formats, the use of external geographical databases becomes far too complicated to be practical. Standardised formats for geographical data will have to be agreed upon some time in the not too distant future to facilitate easy exchange of data between the different suppliers and the customers. While de facto standards for digital encoding of sound (compact disks) exist and there are many formats available for image exchange, much work will still have to be done before standards for geographical data representation and geographical data exchange will be internationally agreed upon and available. A wish-list for GIS-users could consist of the following: • A standard for the exchange of object geometry (including fields and 3D models of surfaces and volumes) • A standard for the specification of spatial/topological relationships and constraints • A standard for representing all relevant data quality aspects and temporal aspects of geographical data • A standard way of exchanging object hierarchies and complex objects • A standard way of exchanging complete geographical objects (all properties including geometry - in the same framework) • A nomenclature/thesaurus for the wide variety of geographical objects and themes • A scheme for inter-database object identification (should preferably be solved by using the spatial reference of the object in some way) As discussed in chapter 4, work has been done to provide standards for the exchange of geographical data in many parts of the world (e.g. in Germany [ATKIS89], Norway [FGIS90] and the USA [USGS90]), but an international standard has yet to emerge. European countries have started work on a new standard for the European arena. This effort was initiated by CERCO (European Committee of National Mapping Directors) and organised through CEN (TC 287), the European sub-organisation of the ISO. This work was expected to be finished in 1993-1994. Similar work was started by ISO (ISO/TC211) in 1995. The question is if the GIS technology is mature enough to make the specification of an acceptable standard possible at this point in time. The slow progress of CEN/TC287 suggests that this might not have been the case. The GIS standardisation work will probably be iterative, and the ambitions of standardisation projects should preferably reflect this assumption. 6.9 Some limitations of currently used database models A database model (chapter 2) that is to be used for general purpose geographical databases must be able to accommodate geographical data according to the requirements outlined earlier. Some limitations of currently used database models 159 In this section, network databases, relational databases and object-oriented databases are briefly discussed as hosts for geographical databases. Hierarchical models will be left out because they are similar to network database models, and provide only a subset of the mechanisms offered by these. The problems (and opportunities) of the database models with respect to geographical data handling are attempted highlighted. 6.9.1 Network database models In network database models, relationships are explicitly represented, and this makes fast navigation and transitive closure operations feasible. A set-type in network database models represents a one-to-many relationship. Network database models are therefore optimised for storing one-to-many relationships (strictly hierarchical structures, such as the political unit hierarchy with nations consisting counties, and counties consisting of municipalities). For many-to-many relationships, one must introduce extra record types as links, and the support for such relationships is therefore not as efficient and a bit more awkward. A very useful feature of the network database model is the sequence. Sequences are supported as ordered sets, and could for instance be used for representing geometrical lines (as point sequences) in an efficient way. Network database management systems are normally optimised for a particular set of applications, and a standard interface for ad hoc querying is not available (the standard query language has to be embedded in some host language). Geometry and topology Network database systems can accommodate topological structures, using explicit links for the topological relationships. Some examples of the representation of geometrical and topological records and sets: Record types: point(x,y{,z}, point-property*) - does also cover nodes lineseg(lineseg-property*) edge(edge-property*) - this is an explicit topology relationship region(x,y,region-property*) - x,y gives an interior point for the region Extra record types used for the topological many-to-many relationships and recursive relationships: l-p a N:M geometrical relationship connecting lines with their constituting points (it is N:M because the end-points could also be used by multiple lines! Could perhaps also use separate sets for nodes and interior points) e-n a 2:N topology relationship connecting edges with their end-points (nodes/vertices). This record type is redundant, since the end-point also can be found via the line and l-p record types r-e a 2:N topology relationship connecting regions with their bounding edges r-h a recursive 1:N topology relationship connecting regions with their holes/islands/embedded regions Set types: line-points (OWNER: line, MEMBER: l-p) - order is significant point-lines (OWNER: point, MEMBER: l-p) 160 Chapter 6: Database management system issues for geographical data arc-lines(OWNER: arc, MEMBER: line) - order is useful arc-nodes(OWNER: arc, MEMBER: a-n) - order could be useful node-arcs(OWNER: point, MEMBER: a-n) - order could be useful region-arcs(OWNER: region, MEMBER: r-a) - order is useful arc-regions(OWNER: arc, MEMBER: r-a) - order could be useful encloses(OWNER: region, MEMBER: r-r) - 1:N enclosed-by(OWNER: region, MEMBER: r-r) - N:1 A Bachman diagram of this structure is shown in Figure 6-2. Figure 6-2 Bachman diagram for geometry/topology To be able to share geometry between different topological structures/themes, one will have to introduce topological set types (arc-lines, arc-nodes, arc-regions, regions-arc, encloses, enclosed-by) for each theme. Images Images are not supported directly as a data type in the network database model, but could be represented either as binary large objects (BLOBs), or as ordinary sets of sets of pixels or sets of sets of subimages (which in turn could be represented as BLOBs or sets of sets of pixels): Image-rows(OWNER: image, ordered MEMBER: row) row-pixel(OWNER: row, ordered MEMBER: pixel) or row-subimage(OWNER: row, ordered MEMBER: subimage) Such a representation with a matrix of pixels or subimages will make navigation much slower down the columns than along the rows. This can be fixed by introducing a set class for column-pixel (and also one for image-columns). The network database model is a bit out of fashion, and has not been a research issue for many years. The navigational features of network database models can also be found in object-oriented database models. Some limitations of currently used database models 161 6.9.2 The relational database model The relational database model is well equipped to handle catalogue type information. What is interesting in the GIS context is the way in which geographical objects can be organised, in particular the geometry and topology. The relational model has been criticised for spreading the attributes associated with a feature between many different relations. This makes operations on single objects complicated because several relational joins are required for reconstructing complete features/objects [Kemper87] [Keating87]. Another consequence of this is that efficient clustering of information on secondary storage is complicated (Newell et. al. suggests a clustering method, using a spatial “key”, to make relational databases more efficient for geographical data [Newell91a]). Hierarchical data dictionaries based on spatial location and thematic content (as mentioned earlier in this chapter) could also provide some help for clustering data for relational databases. Healey states that there is presently no alternative to the relational model for GIS databases, but that this might change with the development of object-oriented models [Healey91b]. The theory of the relational database model has been developed to a very high level of sophistication when it comes to query optimization, transaction processing, concurrency control and recovery. One of the reasons for this is that the relational database model has been very popular in the database research community since the 1970ies. Various hashing and indexing schemes have been developed to make set-based retrievals and single tuple retrievals efficient in relational database systems. Geometry and topology For an ordinary (non-extended) relational database, the following geometrical relations are necessary (provided that the geometry of lines and surfaces can be specified through point samples). For points and lines, there could be 2- and 3-dimensional variants (for many points and lines the third dimension is not relevant, and for such features it should be possible to skip the third dimension). Point(Point-Id, X-Coord, Y-Coord[, Z-Coord]) Line(Line-Id, Seq#, X-Coord, Y-Coord[, Z-Coord]) or Line(Line-Id, Seq#, Point-Id) The Seq# field should (in addition to the Line_Id field) be generated and maintained by the system in such a way that there will never be holes in a point sequence representing a line (Seq# 1,2,3,4,5,6 is OK, while Seq# 1,2,5,8,9 should not be allowed). The user will therefore not have write-access to this field, and the complete line will have to be locked when updating a part of a line (updates to existing lines will be very infrequent in a historical geographical database system). Such an arrangement will make the handling of incomplete lines (that for instance could be a result of a spatial partitioning of the data set) less complicated. The dark side is that more updating is needed when changing the geometry of lines (very seldom occurring, as mentioned earlier). Topology It is straightforward to specify topology within the relational paradigm by introducing relations that make use of the basic geometrical relationships as presented above. 162 Chapter 6: Database management system issues for geographical data Some interesting topological relations are those used in network and manifold structures. Conceptually, these structures should be placed in a layer between the geometry objects and the geographical objects. This can be achieved by not storing geometry in the topology layer, using only references to the underlying geometrical objects. To be able to use a common border for two different theme topologies, one will have to have a complete set of topological relations for each topologically structured theme in the database. Network structures will have to use a node relation, an edge relation and a node-edge relation. The node relation should include a node identifier (primary key) and a point identifier (foreign key). The edge relation should include the edge-id and a set of references to line segments constituting the edge. The node-edge relation should include a reference to the edge and references to one or two end-nodes, and possibly also a direction attribute. 2D manifolds will have to provide references between neighbouring polygons through the edges. This can be accomplished by including references to the bounding edges with all manifold regions. The manifold regions could in addition have a reference to an interior point at the geometrical level. 3D-manifolds could be represented similarly, using surfaces as borders. It could be nice to be able to distinguish between geometrical surface patches and topological surfaces. Surface could be used for the topological surface, while patch could be used for the geometrical surface patches. A surface is the combination of geometrical surface patches making up the common border of two topological volumes. An interior point attribute could be included also in the volume relation. An example of “adequately” normalised topological relations follows: Node(Node-Id, Point-Id) — node-point is a 1:1 relationship within a theme. An alternative representations could be Node(Node_Id, Theme, Point_Id) for integrated topology. Edge(Edge-Id, Seq#, Line-Id) — edge-line is a 1:M relationship, seq.no. makes it a 1:1 relationship (a short alternative: Edge(Edge-id, Line-id)). For integrated topology: Edge(Edge-Id, Seq#, Theme, Line-Id) Surface(Surf-Id, Patch-id) — surface-surface_patch is a 1:M relationship within a theme VolumeBoundary(Vol-Id, Surf-Id) — volume-surface is a 2:N relationship within a theme SurfaceBoundary(Surf-Id, Edge-Id) — surface-edge is an N:M relationship within a theme RegionBoundary(Reg-Id, Edge-Id) — region-edge is a 2:N relationship within a theme. For integrated topology: RegionBoundary(Reg-Id, Edge-Id, Theme) EdgeBoundary(Edge-Id, Node-Id) alternatively EdgeBoundary(Edge-Id, StartNode-Id, EndNode-Id) — edge-node is an N:2 relationship. For integrated topology: EdgeBoundary(Edge-Id, Node-Id, Theme) A problem with the relational model with respect to topology is that it does not support transitive closure operations. Topological search in the relational model can be performed in a step by step manner by performing joins on the topological relations. In order to provide fast search through the topology of spatial data, the number and size of attributes in each topological relation should be kept as small as possible (to minimise the data volumes involved in the join-operations). An alternative could be to add special purpose transitive closure operations to the relational operators*. * RECURSIVE UNION is proposed in SQL3 to handle hierarchical relationships. Some limitations of currently used database models 163 Rasters/Images Rasters are usually represented using the bulk or blob data type normally present in relational database systems. This data type treats the raster as a large uninterpreted block of storage, and allows no individual treatment of the pixels or subregions of the raster. A relational representation of a raster should support queries on individual pixels and sub-rasters. If rasters are to be represented in the standard relational model, a somehow wasteful representation would result, as indicated below: Raster(Raster-Id, row#, col#, pixel_value) ImageInfo(Raster-Id, #-of-rows, #-of-cols, image_name, …) The support for image-based operations within a relational database framework has been discussed, for instance in “Database system for PSQL” [Roussopoulos88]. Terrain surface model representation The role of the DBMS in DTM representation is primarily to store the sampled points in an efficient fashion, so that they are available for spatial searching. In addition, it is also possible to store the topology of the DTM explicitly, for instance as a TIN model. In the relational database model, the TIN model could be stored simply as a TIN-relation. TIN(Point-id, Point-id) or the more lengthy: TINPatch(Patch-Id, Point-id, Point-id, Pointid) A built-in mechanism for spatial searching in the relational database query language would be useful for efficient storage and retrieval of terrain surface data. Quality Completeness information and consistency information could be included in the system tables of relational database systems. Object-/tuple-level quality information could be represented as a separate quality relation for each relation that should have quality information attached, using the primary key of the ordinary relation as a foreign key in the quality relation. It is not straightforward to include attribute level quality measures in the relational model. Concurrency control The concurrency control and transaction management mechanisms of current relational database management systems are not flexible enough to handle many long transactions in an efficient way. There has been some research in this area, but the results have not yet propagated to the industry [Barghouti91]. New trends in relational database technology Relational databases have in the last couple of years been extended in many directions. The introduction of abstract data types into the relational database model will make more efficient handling of complex objects possible, and will bridge a part of the gap in expressiveness between the relational database model and object-oriented database models. Relational systems supporting ADTs have been termed extendible relational database management systems (see also chapter 2). Oosterom has investigated this approach in a geographical context using Postgres [Stonebraker91] [Oosterom91] [Vijlbrief92]. 164 Chapter 6: Database management system issues for geographical data There are many alternative ways of representing the spatial constructs of a geographical data model in the extended relational model. A standardised way of handling these constructs is essential for straightforward access to external data sets and to allow distributed processing of geographical data using the relational database model. Some standard spatial data types will have to be included to provide a basis for efficient representations of spatial data. A spatial location/point data type is the most important one, but complex geographical objects should also be considered in order to make standardised operations/operators for these objects possible. The 2- and 3-dimensional geometrical objects: volume, surface, region, line, point and field, must all be supported, preferably as built in data types. The important complex structures: networks, manifolds and rasters could perhaps also be supported. Standardisation work is in process on extendible relational databases, and in particular on SQL, to determine standard abstract data types for multimedia extensions [ISO/IEC94a]. Included in this work is also a part on spatial extensions relevant for GIS: The spatial part (part 3 [ISO/IEC94b]) of the ISO/IEC* standardisation work with SQL/MM (based on the not yet completed ISO standard proposal SQL3). 6.9.3 Object-oriented database models Current database technology relies heavily on the relational data model. The relational model is excellent for storing “business” types of data, but has weak support for abstract data types and the mechanisms of object-oriented modelling, such as inheritance, encapsulation and behaviour. Different approaches to object management has been suggested. One approach is to extend the relational database model [Stonebraker90], and another is to start afresh with object-oriented database systems [Atkinson89]. Object-oriented database models are based on, and hence support, the advanced modelling features described in chapter 4 and 5. The realisation of the high level data model in an OODBMS is therefore trivial. The support for concepts such as generalisation/inheritance and aggregation directly in the database model makes it very flexible and powerful [Mohan88]. To arrive at a standardised object-oriented implementation of geographical databases, the interfaces of a set of intrinsic geographical classes, for instance based on the icons used in chapter 5, must be specified. When these basic classes are available, an object type in a data model diagram can be constructed using inheritance from the intrinsic classes according to the icons that are attached to the object type in the data model. The task of specifying flexible interfaces for intrinsic geographical classes will, however, be complicated. This is partly due to the wide variety of interpretations that are possible for many geographical phenomena. Many of the structures of geographical data models fit well into the “object”-framework [Egenhofer87]. This is particularly true for human infrastructure (properties, roads, water, electricity, buildings, railways, etc.). On the other hand, many geographical phenomena are difficult to classify into object structures because they are continuous of nature, with unclear boundaries and many possible interpretations [Goodchild90b] [Aangeenbrug91]. * ISO/IEC JTC1/SC21 Information Retrieval, Transfer and Management for OSI, WG3 Database Conclusions 165 Object-oriented data models use explicit connections between data objects (object handles). For a GIS, the use of such direct linkages will not be too numerous, due to the limited amount of direct connections a GIS object will have to other GIS objects (most spatial relationships are implicit in the spatial location of the objects). The most important linkages in geographical data are the topological linkages. These linkages should be reflected in the database by object handles and clustering of connected components to increase the speed of topological operations (searching, transitive closure). The availability of semantic knowledge, inherited from the high-level data model, should enable object-oriented database systems to make data management more efficient, for instance through intelligent buffering and clustering. A problem with object-oriented databases is the lack of standards. The products that have emerged so far have been largely experimental and many of the earlier systems include only enough mechanisms to make C++ persistent. Concurrency control, query languages and view mechanisms in todays OODBMSs have not been developed to the same level of sophistication as in the RDBMSs [Kotz-Dittrich95]. Since object-oriented databases are to correspond directly to object-oriented high level data models, data types and operators will not be elaborated on. 6.10 Conclusions The field of database management systems is in a state of rapid change. Until the early 1990ies, the typical database system customer was primarily interested in business type data, that is - numbers, character strings and dates. These kinds of data fits the traditional relational model perfectly. Safe storage of multimedia data is becoming more and more critical for business in many organisations. This has resulted in a growing demand among DBMS customers for a richer selection of data types. Spatial data types, supporting geographical data, have been recognised as a very important multimedia component, as can be observed in the work on the SQL3/MM standard. The lack of support for spatial data types in relational DBMSs has always been a great frustration for GIS vendors and users. This lack of support has lead to the widespread use of the “geo-relational” approach (only non-spatial attributes (not geometry) is stored in a RDBMS, e.g. Intergraph’s MGE and ESRI’s ArcInfo) and also to the use of BLOBs for storing geometry in the relational model (System9 and ESRI’s SDE - Spatial Database Engine). Such solutions impose problems, particularly for transaction management/concurrency control. Object-oriented DBMSs have always been considered well suited for applications with a demand for a richer and more flexible set of data types than relational DBMSs provide, but the business community has shown a reluctance to take OODBMSs into use. This is probably partly due to their legacy systems (hierachical, network or relational DBMSs containing business data, and with many mission critical applications that work in this environment), and partly due to the immaturity of OODBMS products. To keep their customers, the relational DBMS suppliers have had to react. Most RDBMS vendors are therefore working on multimedia support and some have already prototypes on the market 166 Chapter 6: Database management system issues for geographical data (e.g. Oracle’s multi-dimensional and Informix’s Illustra data blades). Consequently, there has been a significant support in the RDBMS industry for the SQL3 work. Geographical data have many important characteristics that will have to be taken into account when designing a supporting DBMS. First of all, a core set of spatial data types will have to be fully supported. Support is needed for 0D (point), 1D (line), 2D (region) and 3D (volume) geometrical objects in 2D and 3D space with a representation of their interior that support (continuous) variation over the interior of the objects. In addition to the spatial data types, there is a need for (spatio-)temporal support, long spatial transactions and advanced metadata support. The spatial data types will have to be supported as basic data types in order to be able to perform efficient queries. Many GIS environments also have large storage space requirements that will have to be addressed, for instance by implementing integrated tertiary storage (HSM). More research is needed on spatial data types, spatial operators, spatial join, spatial query optimisation, spatially aware concurrency control (spatial locking) and transaction management, spatial constraints and spatial data distribution in a networked environment. The characteristics of spatial data and their particular requirements pose some new problems, but also provide some new opportunities to database researchers, as discussed throughout this chapter. The database research community has started to take interest in the management of geographical data, and contributions on geographical/spatial data modelling and management is often requested for database research conferences. Appendix A Data structures for spatial databases This appendix provides a short overview of data structures suitable for the special needs of spatial information, including geographical data. The first two parts is an introduction to basic (non-spatial) data structures. In the rest of the chapter some different approaches to the storage of spatial or multi-dimensional data are presented. Data sets can vary in structure, content and other characteristics. A very important characteristic of a data set, in the context of data structures, is whether it is dynamic or static. Static data structures are much easier to build than dynamic data structures, and particularly for hierarchical methods and hashing methods, static data set are much easier to handle. The focus here will be on data structures for dynamic data sets. A.1 Basic data structures As an introduction, the main categories of data structures for storing simple and more complex data items are reviewed. Most of these structures have been developed for one-dimensional data sets, but they provide the basic techniques for developing new data structures. In cases where time-complexity of search is discussed, n stands for the number of items stored in the structure. A.1.1 Digital computer storage media A data structure has to be mapped to a storage medium. An overview of the most common storage medias for digital computers will therefore be presented to show the context in which data structures operates. The one-dimensional address space has been the paradigm underlying digital computers ever since the von Neuman machine. Our storage media follows this paradigm, and provide access to the data accordingly. The media are listed with the slowest at the top (first), and the fastest at the bottom. • Magnetic (and optical) tape is a one-dimensional storage medium (a tape consists of many tracks, so it is not purely one-dimensional). In order to access a random data item the tape has to be read through from the start until the data item is found. 168 Appendix A: Data structures for spatial databases • Optical disks have about the same characteristics as magnetical disks, but they also provide continuous one-dimensional reading by organising the data in a spiral (just like a music record). • Magnetic disks have a two- (or three-) dimensional address space on the physical disk. Along one dimension you go through the sectors of the tracks, and along the other dimension you go through the tracks/cylinders. A third dimension can be introduced going through the different heads on a disk with multiple surfaces. Reading and writing of data on a magnetic disk can, however, only be done sequentially along one dimension (along the sectors of a track). So what you get is an ordered group of one-dimensional sequences to which you have random access, and not a real two-dimensional storage medium. Most disk interfaces (for instance the SCSI interface) provide a one-dimensional address space of sectors for random access to the user. • Transistor based memory is accessed on a per word basis through a linear one-dimensional address space. Since transistor memory is accessed randomly, the memory can be arranged according to any dimensionality without affecting the efficiency of access. A memory space of 2N addresses can for instance be turned into a three dimensional memory pool: 2Ax2Bx2C (A+B+C=N), using A bits for the first dimension addressing, B bits for the second dimension addressing and C bits for the third dimension addressing. A.1.2 Sequences (lists/arrays) There are a variety of ways in which to organise data on a sequential medium to facilitate fast associative access to data. The most basic ones are presented here. Unsorted sequences Unsorted sequences are very easy to maintain (because you do not need to do any maintenance). Insertions can be done by using a free position in the sequence (such a free position would be the result of deletion), or by adding the new item to the end of the sequence. To search an unsorted sequence for an item, an average of n/2 items must be investigated to find the item of interest. To find a group of items based on a search criterion, all the items in the file will have to be inspected. Sorted sequences Sorted sequences are data that are organised in sequence according to some criterion. The problem with sorted sequences on linear computer storage show up when a new item is to be inserted into (or deleted from) the sequence. Since there is no room for the new item, some of the existing items will have to be moved to give room for the new item (half of the existing items on the average). Sorted sequences is well suited for static data. We have two important types of sorting criteria: attribute value and frequency of usage. Basic data structures 169 • Attribute sorted sequences are sorted according to the value of an attribute or some value computed from the attributes of the data item. If the storage device provides direct access to individual data items, an item can be found after having searched through an average of logn items by using the binary search method (successively halving the search space). An attribute-sorted sequence can be augmented by a permanent index in one or many levels (some sort of tree), also resulting in a search-complexity of logn. The advantage of this approach over the binary search method is that the index-structure can be clustered (saving disk accesses), and in addition the highest levels of the structure can be kept in primary memory, saving many secondary memory accesses (magnetic and optical disks are many orders of magnitude slower than transistor based memory). • Frequency sorted sequences are sorted by the frequency of access. Frequency sorted sequences will show very good average performance for single data item selection in cases where the probability of access is very varying for the data items in the structure. The average number of items to search before finding the desired item will be: SUMfor all items (P(item)*itemseq.no). A.1.3 Randomised sequences For randomised sequences, one defines an address space into which a hash function shall map the data items. This address space is traditionally static, but dynamic address spaces have also been suggested (e.g. expandable open hashing [Knott71], dynamic hashing [Larson78], extendible hashing [Fagin79] and linear hashing [Litwin80]). Each address maps into a block of storage (called a bucket) capable of storing a certain number of data items. An item is placed in the storage area according to the value obtained by performing the hash function on a predefined attribute or set of attributes of the data item (often a key). When a bucket at a certain address is filled up with data items, and a new data item maps to the same location, bucket overflow occurs. The new data item can not be stored in the correct bucket, and will have to be stored elsewhere. There are many solution to the overflow problem, but no matter what approach is chosen, overflow results in at least one extra block access for storage and retrieval of overflow items. When designing a randomised data structure, an important aspect is therefore the choice of storage utilisation (bucket size / address space), or how much “extra” space one should allocate to avoid excessive overflow. Research has provided a variety of hash functions and overflow strategies. The choice of hash function will have to be made on the basis of the characteristics of the data set. It is important that the hash function spread the data as evenly as possible over the address space, so different hash functions should, if possible, be analysed and tested on the real data before deciding which one to use, and with what kind of parameters. Randomised sequences are particularly good at providing very fast access to single data items based on the hash attributes. One or, in the case of overflow, a few accesses is all that is needed to localise and retrieve an item if its hash attributes are known. 170 Appendix A: Data structures for spatial databases Randomised sequences give very poor performance for interval search based on attribute values. This concludes the most basic data structures, and the focus will now be put on more “advanced” storage organisation methods. They all map down to the one-dimensional address space provided by todays computers. Many of the structures partition memory into blocks which can be addressed individually. These blocks are often, for efficiency reasons, chosen to be compatible with or have the same size as the file system block size (typically 1024, 4096, 8192, … bytes). A.2 Hierarchical structures The basis of hierarchical structures is a recursive splitting of problems into smaller sub-problems. Such a decomposition can be visualised as a tree, and the data structures that build on this paradigm are called tree structures or hierarchical structures. If a splitting into two parts is performed at each level, a binary tree results (Figure A-1). Splitting into more than two at each level is also possible (higher branching factors), and reduces the height of the tree. There are two alternative ways of organising the data items within a tree structure. The first method is to store the data items embedded in the structure (data in each node) and at all levels of the tree. The second method is to store data items, or references to the data items, only at the lowest level of the tree (the so called leaf nodes), using the intermediate nodes only for indexing. The second method gives a clear separation of the data from the indexing method, and means that a tree structure can be built on top of another data structure or a flat sequential file, and is therefore often preferred in database management systems. A node in a tree structure contains pointers to its children nodes and a description of the data that are contained in these lower level nodes. Following this scheme, the user can find a data item stored in the tree by navigating from the root of the tree through the internal nodes to the data item, using the description in the internal nodes to find the way. Algorithms for tree operations are intrinsically recursive. A balanced tree structure guarantees that the number of levels in the tree is logxn, where x is the branching factor of the tree. More or less balanced trees are desirable for short average access times. Therefore, techniques for balancing tree structures and keeping dynamic tree structures balanced after deletion and insertion has received much attention. Figure A-1 The binary tree (log2n levels) Multi-dimensional trees 171 Many tree structures have been proposed for storage organisation. One of the earliest and most general is the B-tree family for one-dimensional data [Bayer72, Aho83]. Other popular hierarchical structures are: ISAM (for one-dimensional data), that uses a high, hardware dependent branching factor; the tries (for one-dimensional data), that use a high, data dependent branching factor; the versatile quad-trees, used for storing points, lines, regions and volumes (oct-trees) [Finkel74, Samet84, Samet89]; R-trees for storing lines and regions [Guttman84, Sellis87]; k-d trees [Bentley75]; and many more [Samet89]. Tree-structures are used to provide efficient direct access to data items. They provide time complexity for searching of O(logn), where n is the number of data items stored in the structure. This is a dramatic improvement from the O(n) performance of sequential access methods. Fast sequential access to the data in a tree is also possible if the tree has been built over a sorted sequential file. Sorted sequential files are problematic for dynamic data sets, and will give high time penalties for insertions. Trees introduce data overhead by adding a secondary structure to the data. When inserting or deleting data from the data set organised by the tree, the tree will also have to be updated, introducing overhead for processing these operations. A.3 Multi-dimensional trees Trees can be generalised to organise multi-dimensional data. The quad-tree ([Finkel74], [Samet84], [Samet89]) and the k-d tree [Bentley75] were of the first attempts at adapting tree-structures to multi-dimensional data. Both structures were developed to address the problem of data retrieval based on composite keys in an integrated way, as opposed to the method of secondary indexes (inverted files). Figure A-2 A quad-tree partitioning of 2D space 172 Appendix A: Data structures for spatial databases A.3.1 Points Points in space are 0-dimensional objects with an address composed of all the dimensions of the space within which they are contained. The quad-tree [Finkel74] is a multi-way tree, each node containing two children for each dimension. An example of a traditional (2D) point quad-tree is shown in Figure A-2. A quad tree structure for 3D space is called an oct-tree. The k-d tree [Bentley75] is a binary tree that is able to store truly multi-dimensional data. The level of the tree determines the split dimension. At the first (highest) level the data are split along the first dimension, at the second level the data are split along the second dimension. This continues in a round-robin fashion. For a k-dimensional tree, the dimension to use for splitting can be determined as: Split dimension = (level-1) mod k + 1 An example of a k-d tree is shown in Figure A-3 (using the same points as in the quad-tree example). Balancing multi-dimensional branching trees and keeping them balanced for dynamic data sets is much more complicated than the balancing of one-dimensional trees. Since the k-d tree uses binary branching, it is one of the easiest to turn into a balanced structure, since one can build on the methods used for traditional one-dimensional trees. The k-d-B-tree [Robinson81] and the hB-tree [Lomet90] combine the k-d tree with B-tree properties to provide a balanced tree structure suitable for dynamic data sets. A.3.2 Lines Lines are 1-dimensional objects. They have two end-points, and between these end-points, the line can can have a complex shape in space. It is the shape of the lines that make them special when compared to points. Lines can be stored as point sequences, where the individual points can be stored using for instance a point quad-tree or a k-d tree. This is not a very efficient solution, so special purpose data structures for lines have been developed. Figure A-3 A kd-tree partitioning of 2D space Multi-dimensional trees 173 The strip tree [Ballard81] was one of the first data structures suggested for representing lines. It is a binary tree structure, and is a kind of multi-resolution structure where the line is represented by rectangles/strips (directed straight lines with an indication of the width to the left and the width to the right of the straight line represented as six-tuples: xb, yb, xe, ye, wr, wl) at each node of the structure. The strip tree expects lines to be represented as a sequence of points. The original procedure for constructing a strip tree from a line consisting of n points with distinct end-points (mechanisms for handling closed curves are also suggested) for a resolution w* ≥ 0 is as follows [Ballard81]: Find the smallest rectangle with a side parallel to the line L through x0 and xn which just covers all the points. This rectangle is the strip of the root node of the strip tree. Now pick a point xk which touches one of the two sides of the rectangle that are parallel to L. Repeat the process for each of the two sublists [x0, ... , xk] and [xk, ... , xn]. This results in two subtrees that are sons of the root node. The process terminates when all strips have width w <= w* w* is a user-definable parameter to select the accuracy (maximum deviation from the original line) of the resulting line representation (by choosing w=0, the n-1 original line segments will be at the leaf nodes of the tree. An example of a strip tree representation of a line is shown in Figure A-4. The first level strip is shown in gray, the two second level strips are shown with outlines only. The original line is shown with a thicker line. There are also other proposals for structures for indexing line data. Oosterom adapts the Binary Space Partition (BSP) tree to represent line segments [Oosterom89]. Samet discuss methods for representing lines using the quad tree [Samet89]. A.3.3 Regions in 2D Regions in 2D are 2-dimensional objects. They can be represented by their bounding lines, but such a representation makes it difficult to work on the region as a hole (the interior of the region). A basis for indexing regions is often their bounding box. This is good representation economy and convenient for partitioning and search. There are also structures that do not rely on the bounding box. Figure A-4 A strip tree representation of a line. 1. level gray, 2. level outlines only 174 Appendix A: Data structures for spatial databases The R-tree [Guttman84] family is the representation that has become the most popular [Roussopoulos85]. While k-d trees and quad-trees are particularly well suited for point storage, the R-tree is made for storage of regions, and is based on their bounding boxes. R-trees are an extension of B-trees to multi-dimensional regions, and is therefore able to cope with dynamic data sets. At each internal node in an R-tree, a list of references to child nodes is stored (there can be between m and M children, as for B-trees). With each child reference, the minimum bounding box of the objects of the child is stored. An example 2D R-tree partitioning is shown in Figure A-5. The R*-tree [Beckmann90] is an attempt at optimisation of the area, margin and overlap for R-trees. A performance evaluation of the R*-tree has been done by Mackert and Lohman [Mackert86] The original R-tree applies overlapping branches. The R+-tree [Sellis87] is an extension of k-d-B-trees to cover non-zero area objects. Non-overlapping rectangle division gives more efficient search [Faloutsos87]. But the fractioning of the objects that follows increases the number of references (duplication each time an object is split) in the data structure and introduces a kind of redundancy in the structure. This disadvantages can outweigh the advantages for many data sets (it depends much on the structure of the data). The more spread out the objects of a data set are, the more efficient is the R+-tree, compared with the traditional R-tree. A comparison of the performance of the R+-tree and the R-tree have been done by Greene [Greene89]. Different ways of using region quad-trees (a raster-like structure) for indexing bounding rectangles are discussed by Samet [Samet89]. Figure A-5 An R-tree Grid partitioning and spatial hashing 175 The bounding boxes of spatial objects can be transformed into points in higher dimensional space. These representation points can then be indexed using data structures for multidimensional points [Six88] [Faloutsos89] [Pagel93]. Günther proposes the application of half-spaces for storing and indexing multidimensional region objects [Günther87] [Günther89]. A.4 Grid partitioning and spatial hashing Spatial data structures that work on a regular partitioning of the area of interest are popular. The grid file [Nievergelt84], the region quad tree [Samet89] and linearisation methods [Orenstein84] [Jagadish90] are of the most popular. Hashing methods have been applied also in the spatial data context. An early proposal was EXCELL [Tamminen82]. A.4.1 Multi resolution image trees (pyramids) Pyramid structures show the full resolution image at the bottom level of the structure, while the intermediate and topmost levels are made from the lower levels by making each pixel in the higher level image be computed from the pixel values of a matrix of pixels in the lower level image. See Figure A-6. Pyramids are useful for image browsing, since it allows incremental resolution improvements. A.4.2 Region quad trees A region quad tree [Samet84] [Samet89] divide the region of interest into homogeneous regions of varying sizes by using a tree structure with branching factor 4. At the top level of the tree, the region is divided into 4 equal sized rectangular or square regions, and this splitting scheme is applied recursively until the leaf regions are homogeneous or a maximum depth has been reached. Region quad trees can also be used for binary image compression. A.4.3 Linearisation Multi-dimensional raster structures can be linearised by establishing a unique method of counting the pixels of the raster (finding a space filling curve). There are many ways of counting multidimensional structures, the most normal being methods that count linearly through the dimensions (e.g. first counting the elements of the first row, then the elements Figure A-6 A multi-resolution representation of an image (pyramid) 176 Appendix A: Data structures for spatial databases of the second row, and so on). The problem with this simple approach is that it does not preserve much of the spatial “structure” (neighbouring pixels might be very far apart in the resulting sequence, and spatial relationships are difficult to establish from the sequence). This problem has been attacked by many researchers, and various methods have been proposed. Morton order [Morton66], Hilbert’s space filling curves [Jagadish90] and Z-ordering [Orenstein84] all use bit interleaving of the involved dimensions to preserve as much spatial clustering as possible. The linearisation of a 2D raster according to the Z-ordering method is shown in Figure A-7. Linearisation of quad tree structures is also possible [Samet89]. A.4.4 EXCELL The extendible cell (EXCELL) method comprises a structure of variably sized data bucket rectangles on the top of a regular grid, indexed by hash functions [Tamminen82]. To be able to accommodate dynamic data sets, it applies extendible hashing [Fagin79] as its indexing function. The grid (hash) function is in principle just a bit interleaving of the binary representation of the x and y value. This means that the basic structure is a regular 2D grid. Splitting is done in a round robin fashion, first in the x dimension, then in the y dimension, then in the x dimension, and so on. When a bucket that corresponds to the size of a grid cell has to be split, the whole structure is expanded by splitting the grid in the next dimension. Many cells in the grid can refer to the same data bucket, using the extendible hashing method to map grid cells to data buckets. A.4.5 Grid file The grid file [Nievergelt84] partitions the area of interest into a grid (not necessarily regular). Each dimension is partitioned into a number of intervals (the intervals do not have to be of the same size). A directory is maintained for each of the dimensions. Dynamic data sets are supported, so during operation, the partitioning of the dimensions can be changed Figure A-7 The Z-ordering [Orenstein84] linearisation path Grid partitioning and spatial hashing 177 by adding a new interval (expansion) or merging two intervals (contraction). These splittings and mergings can only be applied for one dimension at a time. Merging of intervals is not expected to be required very often. Each grid cell is a grid bucket in the grid file system. The central structure in the grid file is the grid directory. It is responsible for mapping grid cells to the real data buckets. The assignment of grid cells to data buckets is governed by a rule that states that a data bucket can only correspond to convex (box shaped) grid regions. The grid directory consists of the k-dimensional grid array of bucket pointers, and one 1-D array for each of the k dimensions (linear scales to take care of the partitioning information). To manage the data bucket to grid region mapping system, a twin tree can be maintained. Each time a bucket is split, the twin tree is updated, with two children (twins, normally called buddies in 1D) under the bucket’s node. Merging will have to proceed from the leaves and up in the twin tree. Different splitting and merging policies are possible. 178 Appendix A: Data structures for spatial databases Appendix B Representation of 3D structures This appendix is a very short review of traditional methods for organising 3-dimensional (3D) structures [Mortenson85] [Encarnação83], such as volumes, surfaces, lines and points using computer storage devices. It is included as a background for the discussion on spatial data models and data structures. This section contains an overview of 3D objects and 3D modelling, followed by a presentation of some different ways of representing 3D structures. B.1 3D objects Four different object types are possible in 3D space. These objects are the point, the line, the surface and the volume. The storage characteristics of these object types are the following: • Points, or 0-dimensional spatial objects, are trivial to represent using 3D coordinates (e.g. Euclidean (x,y,z) or polar coordinates). Compression techniques can be used to code sets of points in a more efficient way (differential representations). In addition to the geometry, an indication of the accuracy of measure is necessary to represent a measured 3D point. • Lines, or 1-dimensional spatial objects, introduce more complexity. Theoretically a line is made up of an indefinite sequence of points. The representation in computer storage will have to be a simplification of this indefinite sequence. Two different methods have been used for this simplification. The first is the approximation of the whole line by a parametric function. The second is sampling from the indefinite point sequence, and then representation of the line segments between these sample points by using some kind of function (for instance straight lines or higher order splines). Both regular (fixed intervals) and adaptive/optimal (avoiding sampling of for instance straight lines) sampling is possible. The methods will be constrained by user-defined limits on the maximum deviation allowed from the original line. Generally, the more accurate the representation is required to be, the more storage is needed for the representation (higher sampling frequency or more complex functions). The accuracy of the resulting line is determined by the accuracy of the constituent points, the sampling frequency and the fidelity of the interpolation method. 180 Appendix B: Representation of 3D structures • Surfaces can be represented in much the same way as lines. Sampling and approximations by functions are the two main methods for representation. Sampling can be done in a regular pattern (grid) or adaptive (considering the variability or auto-correlation of the surface). Functional approximations can be done globally (on the object as a whole) or on subregions or patches (regular or adaptive). The accuracy of a surface is determined, as for lines, by the accuracy of the sampling points, the sampling frequency and the fidelity of the interpolation method. It is important for all representations to maintain the topology of the sampled points, particularly for complex surfaces. Fractal geometry has also been investigated in the context of terrain surface modelling [Xia91]. • Volumes, or 3-dimensional spatial objects, can be completely represented by their bounding surfaces and holes, plus an indication of what constitutes the inside of the volume. An alternative way of representing volumes is to use simple volumes as basic elements, and transform and combine these to make up the complete volume object. This kind of representation is generally more useful for CAD type objects than geographical objects, because of the much more regular nature of CAD objects. The accuracy of a volume representation is completely determined by the accuracy of its bounding surfaces or its basic elements and transformations. B.2 Storage organisation Efficient analysis and presentation of 3D structures is difficult, and the computer storage representation determines to what extent such operations are feasible at all in an interactive environment. There is a choice between two paradigms for organising 3D structures using main memory and background storage. The homogeneous solution uses the same representation on the background storage device (disk) as in main memory. This solution allows paging and therefore does not constrain the size of the data structure to the limits of main memory. The split solution uses two different storage structures for main memory and for secondary storage. Conversion is then necessary when moving data between secondary storage and main memory, and this complicates paging. Such a solution will introduce the problems of limited main memory in addition to more complicated updating of the structure. The advantage of this solution is that it is possible to use a more general and flexible structure on secondary storage, allowing easier integration with other applications. The choice of paradigm will depend on the application context. B.3 Point sampling Using points to represent 3D structures is the sampling approach. To represent a line or a surface in space, a set of sample points that lie on the structure will have to be selected. Using these topologically structured sample points, a model of the complete original structure can be constructed using interpolation techniques. Wire frame 181 There are many kinds of interpolation techniques. The simplest is the linear approach, using straight line segments for the interpolation of curves, and flat triangles for interpolating on surfaces. If we have no knowledge of the auto-correlation of a structure, the linear approach could be as good as any other interpolation method. If we know that the auto-correlation is large compared to the sampling frequency, more sophisticated methods for interpolation should be investigated. B-splines, Bezier curves and other kinds of parametric functions will then be candidates. Krieging with trend surface and probability estimates is a particularly good candidate when the auto-correlation is fairly well known, and accuracy estimates are required. Using 3D sampling, the points in the sample set will all lie on the measured structure. This means that an interpolation method that includes the sample points on the surface of the modelled structure is preferable (exact interpolation at the sampling points). Most geographical sampling is performed at the surface of the earth. In this case, there is nearly always a functional mapping from a position (lat,long) to a sample point, and neighbouring samples can be found on the basis of their latitude and longitude. For general 3D surfaces this is not so. Even for simple convex volumes, there will be two sample points for each position in all 2D projection of the volume (except for points on the border of the projection). There will be one point for each side of the volume (e.g. top and bottom). For general 3D structures, the samples must therefore be structured in such a way that the topology is maintained. Point sampling is the underlying approach for many of the other representation methods. B.4 Wire frame The wire frame model is an old method for presenting volumes and rod structures in CAD applications. The method uses “wires” at all edges and lines in a construction (for instance rods). This results in the familiar skeleton appearance of the wire-frame model. A wire frame model can also be regarded as a kind of unshaded perspective drawing. Wire frames have been used for visualisation of terrain surfaces, both on computer screens and as drawings. The method is as follows. First, a regular grid sampling of the elevation is done over the terrain area, giving a grid of elevation values. Secondly, a point of view and a viewing direction is determined for the perspective drawing (the point is often chosen above and some distance away from the area of interest). Finally, the wire frame model is made using this perspective, by drawing wires in the x and y direction through the grid elevation points, (linearly) interpolating the elevation between the grid points. The result is a web or a mesh covering the landscape, giving a 3-D appearance. The wire-frame model is similar to a point sample model with linear interpolation between the samples. It is a nice presentation model, but is lacking expressiveness as a storage model for curved and complex surfaces. 182 Appendix B: Representation of 3D structures B.5 Triangulated Irregular Network The Triangular Irregular Network (TINs) is a method for geographical surface representation and modelling based on irregular point samples [Peucker78]. The surface points are stored together with a triangulation. This method is much used in geographical information systems, for instance in ARC/INFO. The basic data structure in a TIN model is the node with an attached list of ordered neighbour nodes. The neighbour nodes can be ordered by starting north of the node and proceeding clockwise. The world outside the modelled area is represented by a dummy node. It is possible to extend the basic model in various ways, for instance by introducing explicit references to the triangles (convenient for attaching attributes to the triangles). Another often used extension is the representation of surface specific points (peaks and pits) and lines (ridges and channels). The advantage of the TIN is that neighbour information is stored explicitly and compactly in the data structure, resulting in efficient methods for local search. The planar Voronoi diagram and the dual of the TIN, the Delaunay triangulation are well known methods for establishing TINs from irregularly spaced points [Aurenhammer91]. B.6 Parametric representations Parametric representations have been widely used for geometrical modelling of curved lines and surfaces [Mortenson85]. These methods are point-based. A set of points (topologically ordered) comprise the backbone of such structures, and a set of parametric functions describe the lines/surfaces using the points. For representing a line, a sequence of knot points must be found, while for a surface, a grid of points has to be determined. There are many different kinds of parametric representations, some are used to describe complete curves and surfaces in a single function (global methods), while other divide the structures into pieces and determine these separately (piecewise or local methods). Global methods use high order polynomials in the defining functions, and therefore give high order continuity, but are complicated to handle (modifications have global implications) and compute. Local methods use only lower order polygons resulting in lower order continuity, but the individual pieces of the structure are easy to compute and handle (modifications have only local implications). Many kinds of parametric function have been proposed and used for reducing the number of points necessary to represent a geometrical structure at a certain level of accuracy. For representing a curved line, one has to determine (sample) points (or knot points) for the line, and for each line-segment or point, a set of parameters that faithfully describe the line-segment must be provided for the function. Spline curves A spline curve or a minimum energy curve / elastic curve is a curve that passes through or interpolates all its control points. A sequence of PC (parametric cubic) curves can be used as an exact geometrical model of a traditional spline. The derivation of the set of PC curves requires that a set of simultaneous equations be solved (using local coordinate systems for Parametric representations 183 simplicity) to make the curves fit together at the points, and then these curves will have to be transformed into the global coordinate system. The resulting set of PC curves can the be reparameterised so that the parameter runs from u=0 to u=1 over the length of the line. Bezier curves and surfaces The Bezier curve is an approximation method for obtaining a curve from a given set of points. A Bezier curve is a polynomial representation, where the degree of the polynomial will be the same as the number of given points plus 1. Bezier curves can be joined together at the end points. Bezier curves do not interpolate their defining points exactly. The method is used to limit the degree of the polynomials and to obtain a higher degree of local control than global methods offer. Bezier curves are represented with the following formulae, including all the points/vertices pi of the characteristic line or polygon n p(u) = ∑ pi Bi, n(u) u ∈ [0.1], i=0 where the blending function is defined as Bi,n(u) = C(n,i)ui(1−u)n−i using the binomial coefficient: n! i!(n−i)! C(n,i) = Bezier curves start on p0 and ends on pn. The rest of the points in the characteristic line/polygon are only governing the shape of the curve, and the Bezier curve does not have to pass through any of them. The nth derivative at the start point is given by the n+1 first points. Hence the tangent (first derivative) is given by p0 and p1, the curvature (second derivative) is given by p0, p1 and p2, and so on. The continuity can therefore be controlled at the joints between Bezier curves. Bezier surfaces is a generalisation of Bezier curves to two dimensions. B-spline curves and surfaces B-spline curves provide a higher degree of local control than Bezier curves, and the degree of continuity throughout the curve is specified by a separate parameter (k). n p(u) = ∑ pi Ni, k(u) u ∈ [0.1], i=0 where the blending function is defined as 1 Ni,1(u) =  0 and ifti ≤ u < ti+1 otherwise 184 Ni,k(u) = Appendix B: Representation of 3D structures (ti+k − u)Ni+1,k−1(u) (u − ti)Ni,k−1(u) + ti+k−1 − ti ti+k − ti+1 k controls the degree (k-1) of the resulting polynomial in u. ti relate the parametric variable u to the control points pi. ti = 0 ti = i − k + 1 ti = n − k + 2 if i < k if k ≤ i < n if i > n with 0 ≤ i ≤ n+k and 0≤u≤n−k+2 For k=1, the B-spline function gives only the set of control points, not a curve. For k=2, it gives a set of straight line segments connecting the control points as its resulting curve (two control points influence each curve segment). For k=3, it gives a sequence of polynomials in u, having continuous first derivative at the connections (three control points influence each curve segment). The resulting curve generally does not pass through the control points. B-splines have many possible applications, and has been suggested for compression of 3D models, for instance seabed terrain models [Dæhlen90]. B.7 Constructive Solid Geometry Constructive Solid Geometry (CSG) is a volume representation first used in CAD/CAM [Encarnação83] [Mortenson85]. In CSG, a volume is “constructed” by combining a basic set of building blocks (spheres, boxes, cylinders, rotational surfaces) using combinatorial operators (union, intersection, minus, …) in space (see Figure B-1 for an example). CSG is well suited for many mechanical parts that are constructed by man and thereafter machined with for instance numerically controlled (NC) tools. For the class of objects that fits into this framework, a very compact representation can be obtained (an object-type reference with scaling parameters and possibly some other parameters is enough to describe an individual part, while orientation parameters and location coordinates are needed for its integration with other parts). Figure B-1 Constructive Solid Geometry (CSG) primitive elements For 3D structures in geographical nature, the utility of CSG has not been proven. The problem with nature is that the structures are highly irregular. As a complement to surface modelling methods for the representation of man made features (e.g. buildings) within GISs, CSG might have potential. It is difficult to say whether this integration will to be feasible or not. Appendix C The NHS Electronic Navigational Chart Database A couple of years ago, the Norwegian Hydrographic Service (NHS) initiated the specification of an Electronic Navigational Chart Database (ENCDB), that is - a database that will act as a server for electronic navigational charts all over the world. I investigated the database issues for this kind of server as a case study. C.1 Introduction Electronic Navigational Charts (ENCs) are supposed to become an integrated part of bridge information systems for seagoing vessels. The integration of these charts with the global positioning system (GPS), active sensors (for instance radar) and other information sources (various databases) will provide better means for safe sea-navigation (through for instance collision avoidance). The advent of ENCs will also, hopefully, provide better and more flexible user interfaces to the information that today is carried by paper charts and other paper-based information sources (e.g. list of lights). Ways of storing and distributing ENCs and their updates will become important issues in such a setting. After presenting a little background material on ENCs, I will concentrate on the database and data modelling aspects of a server-database that is to deliver updated ENC information to Electronic Chart Display and Information Systems (ECDIS) on board the ships. C.2 Background The use of electronic navigational charts, and the distribution of chart updates via the INMARSAT C system was first tested in practice in the North Sea Project [NORTH SEA89]. The results were encouraging, and accelerated the work on ECDISs. The International Hydrographic Organisation (IHO) has set down committees on the standardisation of formats for the exchange of ENCs and their updates. The NHS has been participating in this work, and was hoping to be the host of a model ENCDB. The involved people at the NHS have provided me with useful information on these subjects. C.3 Navigational Charts A navigational chart is a legal document. To simplify a bit, a ship must, as a rule, carry updated navigational charts for its insurances to be valid. The oceans of the world are covered by overlapping sheets of navigational charts of varying scales. It has been estimated 186 Appendix C: The NHS Electronic Navigational Chart Database that a ship sailing globally will have to carry about 2000 paper charts. The national hydrographic offices have the responsibility for keeping these charts updated, and charts can be purchased from the responsible hydrographic offices. The charts have a date of validity, and to enable the seagoing vessels to keep their charts updated at all times, the hydrographic offices publish periodic updates to their navigational charts (countries with a large amount of charts publish updates with only a few days interval, while countries with a smaller amount of charts or resources publish their updates once a month). It is the responsibility of the crew to manually update the charts accordingly. This updating is very time-consuming, and hence expensive for a vessel that utilises many charts. C.3.1 ENC and ECDIS The IHO has proposed that “ECDIS should be the equivalent to the paper chart” (point 1.3 in [IHOSP5288]). All the information from the paper chart should be available, and the same legal restrictions apply. A cell-structure is suggested [NORTH SEA89] where the coarsest cells cover an area of 8°x8° (A-cells, for free ocean navigation, scales < 1:250000), and the finest cells cover 15’x15’ (I-cells, for harbour navigation, scale ranges 1:12500 1:40000). In between there are the 4°x4° B-cells, the 1°x1° C-cells, the 30’x30’ D-cells. A further refinement (four sub-cells) of the D- and I-cells into EFGH-cells and JKLM-cells should be provided for areas with high data density. Cells are identified by the “scale”-letter, and two numbers. First a 3 digit number: the number of 15’ increments from the south pole northward, then a 4 digit number: the number of 15’ increments eastward from Greenwich. A cell identification will look like this: I5750016 (I-cell 53°45’N, 4°00’E) C.3.2 The ENCDB The IHO working group on ECDIS has proposed guidelines for the logical structure of the ENCDB. This database is “The master data base for production and maintenance of the ENC, compiled from the national ENCD” [Grant90]. The main purpose of the database is to make ECDIS data (ENCs) available to the customers. Data from the national hydrographic offices are to be translated and incorporated into the ENCDB by the ENCDBs host. It has been suggested ([IHOSP5288], point 7.1) that the information in an ENCDB should be divided into an approved part (resembling the old paper chart), a modifications part (resembling the 14 day update publications to the paper chart) and an administrative part. The administrative part will consist of useful information not normally found on traditional paper charts. The administrative part of the database could expand into various new areas, and should be logically connected to the rest of the data as an ordinary information database. Updates to the on-ship databases could be - and is planned to be - broadcast by world covering satellite systems, as tested in the North Sea Project. The INMARSAT system has so far been tested, and is considered useful for the purpose. In the nearest future, the data will be distributed on diskettes. The reason for this is that equipment for - and usage of satellite systems is too expensive at the moment. No matter whether or not broadcasting is chosen as a distribution strategy for updates in the long term, the format of the exchange will have to be internationally (IHO) agreed upon. For satellite broadcasting it will be important that the format is compact [Sandvik90]. Navigational Charts 187 The problems of securing successful delivery of the updates to all recipients (thousands of ships) by broadcasting in a noisy environment is in itself a topic. The broadcasting strategy is by far the easiest from the point of view of an ENC server (if messages get through to all subscribers). More secure, non-broadcasting strategies would put high demands on the transaction capacity of the ENCDB. The host of an ENCDB must provide data conforming to the IHO exchange standards for ECDIS. The IHO CEDD (committee on the exchange of digital data) is specifying such a format, known as DX90. The status of nautical maps as legal documents implies that the security and integrity of the data must be given high priorities when storing and distributing them in digital form. The quality control through the production and before the chart can be integrated into the database will also have to be very strict. C.3.3 Data management The NHS wants the ECDIS-server to utilise a database management system (DBMS) for the ENCDB. The advantages of using a DBMS for the ENCDB are the traditional, that is a standard query language interface, integrity constraints, data dictionary, concurrency control, recovery and various kinds of DBMS utilities. It could be argued that a DBMS is over-kill for this (basically) file-server application, and will only impose unnecessary overhead and slow the system down. The chart data are very modular (cells) and structured (approved data and modification data) and each cell could constitute a single file conforming to the CEDD exchange standards of the IHO (presently DX90). A file system would therefore cover the needs of a simple chart-server system (a system for distributing the electronic chart as a pure substitute for the paper chart), but for a complete ECDIS server other issues may arise. • The ENCs are supposed to be integrated with other kinds of information (the so-called administrative part) in ECDIS. If this shall be possible, one needs to have a way of integrating the chart data with the rest of the information base. The contents of the administrative part of ECDIS is expected to evolve over time, and a DBMS provides mechanisms for the integration of new data types into the system with only limited or no effects on existing applications. In addition, the inclusion of the administrative part of ECDIS will lead to increased updating activity on the ENCDB. Hence, more sophisticated transaction management is required, as provided by a DBMS. • For an interactive system, it is important to be able to obtain more information on a feature (e.g. lights, beacons or places) on the screen by pointing to it (for instance touching the screen). In the file system approach, examining all object in the complete ENC, with updates and all is one possible strategy, but this can be time-consuming for large data sets. Another alternative is to use advanced data structures to organise the data in order to limit the search space, and hence bring response times down. Many kinds of search structures are supported (in some way) by a DBMS. • The chart data to be stored in the ENCDB consist of geometric features such as points, lines and regions. It is important for the applications that these geometric features are stored in a consistent manner, ensuring that the topology [Peucker75] of the data is 188 Appendix C: The NHS Electronic Navigational Chart Database explicitly or implicitly present. This requires the use of topological constraints on the data, and a DBMS could provide such mechanisms, whereas a pure file approach could not (the applications would have to take care of everything). • Provision of new (on-line) services to customers and communication/integration with other information systems and databases will be simplified with the standard interface a DBMS provides. • A DBMS solution can provide sheetlessness, while the one-chart-per-file approach will make linkage of map sheets and smooth transitions over sheet boundaries non-trivial. The tiling of geographical data into map sheets has been discussed in [Chrisman90], and one of his conclusions is that (p.161): “.. tiles are more likely to survive in single-purpose, centralist circumstances. Multipurpose use will create pressures for sheetlessness”. In our context single-purpose could correspond to the use of an ENC in isolation, and multipurpose could correspond to the complete, integrated ECDIS. With respect to response times for the presentation of a complete ENCs, the DBMS will generally be slower than a pure file-system approach, but it is realistic to assume that affordable computer-technology will be able to master these types of database retrievals without unreasonable delays (“instant” retrieval) in some years time (that is, when ECDIS becomes operational). For zooming (that is viewing only a part of a cell) a DBMS approach would be able to exploit the structuring of the data, and should give better performance. A DBMS-approach will probably be the best solution in the long run. A DBMS based system should be able to provide a seamless integration of all the ECDIS data. For such a seamless database, ownership issues together with extraction and distribution procedures will have to be considered carefully. The ENC (CEDD) exchange format will have to support different levels of user sophistication (some may want all information available, while others may only want what is required by a particular application), and at the same time provide hooks for ECDIS-relations (that is some kind of global identification of the individual chart objects). C.3.4 Relating the traditional chart data to other data The organisation of the relations between the map elements and the “administrative” part of the database constitute the main challenge of the ECDIS. It must be possible both to start with the ENC and find additional information pertaining to a specified object (area, line, point), and vice versa. C.4 Structures for the ECDIS database In this discussion of data structures I take for granted the partitioning of ECDIS data into approved, modification and administration data. An ECDIS is comprised of different components with different legal restrictions. The backbone part of the ECDIS will be the information compulsory for all seagoing vessels (over some minimum size). In addition to this there are possibilities for enhancing the information system on-board for safer and more efficient navigation. As the amount of data grows, the demands on the data structures will grow accordingly. Structures for the ECDIS database 189 A data structure for the ENC-database must be efficient enough to support rapid display of any ENC on the computer screen on the bridge (a response time of maximum 2 seconds could be considered acceptable). It should be reasonable to require the response time to be at most 2 seconds for the display a screen-full of a “representative” chart. A complete ENC-cell can consist of up to 30 megabytes of data (reported by the NHS for some cells at the Norwegian coast including height contours), and with a maximum rate of 2 MBytes per second for a standard SCSI disk-interface the mere retrieval of a file of this size would take in the order of 10 seconds for a single PC with a single SCSI-disk. With the proposed SCSI-2 standard interface [ANSI86], a rate of 8 MBytes per second will be the maximum achievable, and the whole operation should take 2.5 seconds. At present, however, computer displays are not of the same size as a paper chart sheet (about 1m x 1m), so only a portion of the chart can be displayed to the right scale. This will reduce the amount of data to be displayed by at least a factor of 4, and the response times could be acceptable even for the most crowded cells. The “normal” cells of 2 MByte of data should not introduce an I/O bottleneck for todays powerful PCs. 15 to 30 seconds has been reported for the retrieval and presentation of 2 MByte cells on a standard Intel-386-based PC. Zooming in and out on the data will result in improved response times for zooming in (without changing cell class), while zooming out (without changing to another class of cells) will introduce large amounts of data from the neighbouring cells for display. Without (cartographic) generalisation, zooming out will lead to unmanageable amounts of data, both for retrieval and display. Because of the large amounts of data (especially line information) in the ENCs, compression has been considered necessary. A special study [Dæhlen90] of the use of splines in line compression with experimentation on data from the Seatrans Project has resulted in a compression rate of about 20 (that is a 95% reduction compared to the original line data) with no visual effects on the lines of the original (not zoomed) maps. The use of splines results in savings in data storage but a little more time spent on processing during data retrieval (if you do not have special graphics hardware for splines). Most of the lines in an ENC are approximations (depth contours, height contours, land-water boundaries) derived from various kinds of measurements. This means that the small losses in accuracy that the spline technique imposes will have very little influence on the utility of the data. Other kinds of data may be more vulnerable to spline compression, for instance man made features such as docks. If one shall be able to take advantage of the situation when only a portion of a map is displayed, the database has to be structured accordingly. If the database is not structured to take advantage of the spatial position of the data (for instance by “dividing” the database into spatial regions according to the chart cells), the retrieval of data for a single chart cell would require a search through the complete database to extract the relevant objects. This is truly unacceptable. The file organisation method should take advantage of the structure of the electronic navigational chart database. Since the data are organisationally divided into cells, a grid-file [Nievergelt84] or Excell [Tamminen82] type of data organisation (regular partitioning of space) on the top-level should be efficient for chart retrieval. Other kinds of spatial structures [Samet89], such as quad-trees [Samet84], R-trees [Guttman84] and R*-trees [Sellis87] should also be considered, especially for lower levels of the data structure. A top-level quad-tree/R-tree could also be used, and it could then for instance be arranged in such a way that the single chart-cells would show up as sub-trees. 190 Appendix C: The NHS Electronic Navigational Chart Database The multi-scale aspect of the ENCs can pose problems for the storage of the data. A single harbour will be covered by charts of many different scales (1:10000, 1:20000, ... 1:500000). When generating a map of 8° x 8°, one would have to filter out many of the features pertinent to the large-scale harbour map. This could be done by marking all the information in the database in accordance with the range of scales it is to be used for. For instance, if a certain buoy should be included in charts with a scale larger than 1:50000, it will have this “scale-property” as an attribute. In the DX90 format, every object has the MAXSCALE and MINSCALE attributes which determine the range of scales for which the object is valid. Some objects (e.g. an island) will have interest over the whole range of scales. The problem for these objects is to reduce the amount of detail as the scales get smaller. Efficient line-generalisation methods, both on the data structure side and the presentation side will have to be developed (multi-resolution structures). The multi-scale aspect also applies for the symbols used to represent a certain feature. An object could be represented quite differently in a small-scale ocean-navigation map compared to a large-scale harbour map. Should symbols for presentation of the object for the different chart-scales be stored in the database, or should this be up to the presentation part of the ECDIS? Another approach to the multi-scale problem is to partition the database according to scale, and hence duplicate some information. This would give rise to update problems, some of which might be remedied by using triggered updates. The modification part of the database has to be managed with great care in the ENCDB. One solution is to store the updates separately in a temporal sequence of modifications. Each modification will apply to a single cell, and the storage structure for the modification part should be similar to or integrated with the storage structure of the approved part. By integrating the modification data into the approved part, and marking every object in the ENCDB with its date of creation (and destruction) as in a temporal database, the modification problem could be alleviated. An update-request would then be answered by using a simple time-based filter. Each update request must be accompanied by a “date of latest update” for each of the requested chart cells. The server then extracts the modification data by using the time-filter, and sends the customer all the updates that have come since the specified date. The administrative part of the database will probably be the most difficult part to handle. This is partly because there are no restrictions on the organisation of the data. In addition, there is no limit as to the amount of data that can be stored. For the time being, a list of lights is the only thing known to become included in this part. In the future, however, images of distinctive features in an area of demanding navigation, information on harbours, political issues, maps of land features and other interesting information could become a part of this add-on information. The administrative part of ECDIS could also be provided by other sources than the hydrographic offices. It is therefore very important to have a clean interface (data dictionary) to this part of the database. If the ENCDB is to provide all the administrative data, and if an on-line transaction system (communications for instance via satellites) is provided, a powerful database transaction system will be needed to handle the traffic. Updates to the database could only be provided by authorised users, e.g. the national hydrographic offices, national surveillance institutions and other authorised information providers. Data modelling for ECDIS 191 Figure C-1 An ER-model of some of the information contained in a navigational chart C.5 Data modelling for ECDIS (The data contained in ECDIS have already been mentioned. There is the ENC data, that consists of geographical/spatial information in the following forms: 3D points (depths/soundings), surface points (buoys, markers, lights), islands / shorelines (polygons/lines), dryfall (tidal variations), depth-contours, light sectors, fairways, …., and then there is the administrative data. To be able to accommodate the data to the structures of a database management system, the ENC information must be structured into a data model. Important elements of a data model for ECDIS are position, scale and time. Topology [Peucker75] will also be useful for some of the ENC data. The object approach to modelling should be suitable for ECDIS. The most common data model for modelling “reality” is the Entity-Relationship (ER) model [Chen76]. The ER approach to spatial map data modelling has been tried out for instance in [Calkins87]. An ER-model of some aspects of the nautical chart information could be as shown in Figure C-1 (I am, however not an expert on the information needs of sea-navigation, so the model should just be taken as an example). 192 Appendix C: The NHS Electronic Navigational Chart Database Figure C-2 An icon based ER-model of some of the information in a navigational chart Inheritance is supported through a generalisation hierarchy (a cable is a line, a buoy is a point, as settlement is a point, a dryfall is an area, an island is an area). Chart generalisation could profit from scale-dependent relationships (e.g. a settlement could inherit from the area entity at large scales and from the point entity at smaller scales). The selection of entities represented are quite arbitrary, and are only meant to give a general idea of the complexity of the problem. Attributes, such as max.-, min.scale and from-, until-time are not included in the model. The model, as shown in Figure C-1, contains some spatial references that are not really necessary (the island on which a settlement is situated can be derived from the geographical position of the settlement and the land/sea manifold). Figure C-2 shows the same information as it could be represented using an icon-based ER approach [Tveite92]. DBMS-aspects of an ENC-server 193 C.6 DBMS-aspects of an ENC-server The discussion in the following sections is at an overview level, and is meant only to highlight the problems and possibilities the management of ECDIS data pose to DBMSs. C.6.1 The amount of data The data volumes for storing a total coverage of the oceans and waterways of the earth are huge. An average of 1-2 Megabyte of data for each chart cell is forecast by the IHO. On some parts of the Norwegian coast data volumes in the order of 30 Megabytes has been reported for the 1:50000 equivalent cell size with height (land) contours included. In accordance with these figures, one must believe that a minimum navigational chart database will have to store many Gigabytes of data, and that the data volume of an extended database (including “administrative” data) should be forecast to grow high up in the Terabytes. Such an extended database could include new kinds of information pertaining to the objects in the ENC, pictures of harbour approaches, pictures of other significant features, full coverage of land features and detailed 3D-models of the seabed and land-surface. As for the regular updates to the ENCs, the data volume have been forecast to be about 135 KBytes pr. week ([Sandvik90], p65) for an international server covering the whole world. This number does only apply to the paper-chart part of the ENC. Updates to extended services would give numbers of a higher order of magnitude. C.6.2 The data The chart information in the ENCDB is very important (many lives depend on it), so the data should be handled with great care to avoid errors. A chart error leading to a grounding could leave the ENCDB host legally responsible. The “administrative” part of the information base (list of lights, harbour information, national laws, etc.) does not have the same legal restrictions as the “paper chart” part. It can therefore be handled in a more flexible way. The ECDIS data will consist of a mixture of different types and formats. • The geometric part of the database is the most problematic from a database point of view, and will contains all the geographic properties of the chart data. • The tabular part of the database will consist of non-spatial information on the objects in the ENCDB. • The pictorial part of the database are the images and pictures that could be of use in an ECDIS. This is information found in the administrative part of ECDIS, and could be pictures of harbours, lights and other distinct features of interest to the navigator. All of these data types with their associated operations will have to be supported in the database management system for ECDIS. 194 Appendix C: The NHS Electronic Navigational Chart Database Most, if not all, of the data in ECDIS should be temporal, that is they should have a time of validity. This will also have to be supported by the DBMS. The accuracy of the data in ECDIS will vary. To enable predictions on the accuracy of the results of the various operations on - and applications of the ECDIS data, accuracy measures should be included in the database. Indications of the adequate scales of usage will have to be attached to all displayable objects. C.6.3 Response time A navigational chart should be displayed immediately at the request of the operator. Efficient retrieval and integration of the traditional chart data (coast-lines, dryfalls, soundings, islands, ..) with other data (the so-called administrative part of the ENCD) is needed to give the navigation systems information in real-time. The traditional, general purpose database management systems have not got the power to do this for fast-moving vessels at present. In the near future these systems must be expected to provide higher performance. Today, only a database system based on parallel processing and/or storage or a tailor-made system will be needed to give the efficiency required. C.6.4 Concurrency and recovery The ENC-database on-board ships does not need any concurrency control or recovery [Bernstein87] if the system is to be used as a pure information source. Hence, the management of data on-board will be greatly simplified. The ENCDB will, however, need some kind of concurrency control to ensure that the data that is sent out is consistent (no partial updates have occurred during the transmission of a chart). For the ENCDB, a coarse-granularity, spatial locking system will do (for instance locking on the cell level, or locking the complete database). If the ENCDB shall allow general transactions, a full concurrency control scheme is required. In choosing the granularity and type of concurrency control, one must take into account the relative large amount of long (read-)transactions that result from the extraction of charts and updates for transmission. A spatially based concurrency control mechanism should be preferred. Recovery systems for the ENCDB will have to be state of the art, since the demands on the reliability and correctness of the system are so high. C.6.5 Security The status of the navigational chart as a legal document puts very high demands on the security and integrity of the data and on monitoring the data communication. The data in the approved and modification parts should be securely protected. One has to be able to determine if the data comes from the authorised ENCDB server, or not. 100 percent reliable communication is necessary to ensure correct delivery. Cryptographic storing methods and physical security arrangements in addition to ordinary operating system file-protection mechanisms would be appropriate to ensure adequate security for the data in the system. DBMS-aspects of an ENC-server 195 C.6.6 Reliability The ENCDB should be a non-stop system. It should always be available and resilient to the most common failures and accidents, such as disk-crashes, power failure, corruption of internal memory and failure of cables and components. One will have to assume that the demands on the mean time between failure (MTBF) should be in the order of years. To achieve this, parallel/replicated storage and duplicated processors must be used. This kind of database technology has been around for some years. C.6.7 Billing ECDIS data will be provided from different sources, and these sources will want credit (often as cash) for the use of their data. To be able to do efficient billing of the data, suppliers of the data must be recorded with their percentage of ownership. Billing at cell level seems the most natural way of organising this. A log of all the retrievals in the database has to be kept to perform the billing. A suitable pricing policy must be determined. C.6.8 The choice of a database system for the ECDIS server Todays alternatives for DBMSs can be outlined as follows. The choices of database system can be divided into a hardware - and a model choice. The hardware choice is of the type of processor or architecture to be utilised, while the model choice is between the different data models proposed for database systems. The choices can be simplified to the following: • Single-processor versus (centralised) multi-processor • Relational databases [Codd70] versus Object-oriented databases [Atkinson89] (network databases and hierarchical databases seem to be a little out of fashion) The choice of a single-processor versus a multi-processor database machine will depend on the expected transaction rates and storage requirements. For efficiency reasons a database system utilising parallel technology could be preferable for demanding systems. For security and reliability reasons duplication is advantageous, and multi-processor environments are well suited for this. Relational database management systems (RDBMSs) have the advantage of being the most modern of the dominant technologies of today. Modelling for RDBMSs is a well understood task, and the models can be modified as new objects, relationships or requirements show up. The interfaces to RDBs are also well defined and partly standardised (SQL). Data stored in relations are trivial to update by insertions, deletions and changes. Object-oriented database management systems (OODBMS) are quite fresh as commercial products, but have proven very fast in comparison with RDBMSs for selected tasks [SI91]. Exchange of data with other systems is not trivial because of the lack of standards. We are still waiting for mature OODBMS technology. CAD, Software Engineering and GIS are some of the fields suffering under the limits of todays relational DBMSs and where the contribution of OODBMSs are forecast to be significant. The advantages of OODBMSs 196 Appendix C: The NHS Electronic Navigational Chart Database are: intuitive and expressive modelling ((multiple) inheritance), information hiding (access to data only through methods), object identity, complex objects, a complete programming environment, and direct implementation of the model in the working database. Suggested features of OODBMSs are described in [Atkinson89]. C.7 Conclusions The building of an electronic navigational chart server seems to be feasible, and a stripped version will not require advanced database technology. A general-purpose DBMS is, however, necessary for a full ECDIS. A DBMS approach will give an open-ended system, available to the customers in a more direct manner. When the exchange standards and the data structures of the ENC are standardised and available from the IHO, the provision of data on the specified format from an ENCDB server will be straight-forward, provided the necessary data are available. For the ENCDB, the main effort has to be put into securing safe operation of the system and efficient data structures. An interesting topic for further research is the information system part of the ECDIS. This could evolve to a full-grown multimedia database system. The research area of multimedia database systems is still in its infancy [IEEECOMPUTER89], so it would be wise to wait for some of the dust to settle (standardisation) in this area before taking the step into a general-purpose multi-media information system. Bibliography [Aangeenbrug91] "A Critique of GIS" R.T Aangeenbrug In [Maguire91], pp. 101-107 [Abel86] "A Relational GIS Database Accomodating Independent Partionings of the Region" David J. Abel, John L. Smith Second Symposium on Spatial Data Handling, Seattle, 1986, pp. 213-225 [Abel93] "Advances in Spatial Databases" David Abel, Beng Chin Ooi (Eds.) Proceedings, Third International Symposium, SSD’93, Singapore, june 1993, Lecture Notes in Computer Science 692, Springer Verlag, 1993, 431p. [Abiteboul90] "New Hope on Data Models and Types: Report of an NSF-INRIA workshop" Serge Abiteboul, Peter Buneman, Claude Delobel, Richard Hull, Paris Kanellakis, Victor Vianu SIGMOD RECORD, Vol. 19, No. 4, Dec. 1990, pp. 41-48 [Agrawal89] "Modular Synchronization in Multiversion Databases: Version Control and Concurrency Control" Divyakant Agrawal, Soumitra Sengupta ACM, Proc SIGMOD 1989, pp. 408-417 [Ahn88] "Partitioned Storage for Temporal Databases" Ilsoo Ahn, Richard Snodgrass Information Systems, Vol. 13, No. 4, 1988, pp. 369-391 [Aho83] "Data Structures and Algorithms" Alfred V. Aho, John E. Hopcroft, Jeffrey D. Ullman Addison Wesley, 1983 (first edition 1982) [Al-Taha94] "Bibliography on Spatiotemporal Databases" Khaled K. Al-Taha, Richard T. Snodgrass, Michael D. Soo International Journal of Geographical Information Systems, Vol. 8, No.1, 1994, pp. 95-103 [Aref91] "Extending a DBMS with Spatial Operations" Walid G. Aref, Hanan Samet In [Günther91], pp. 299-318 [Aronson89] "The Geographic Database - Logically Continuous and Physically Discrete" Peter Aronson Proceedings, Auto-Carto 9, Baltimore, Maryland, 1989, pp. 452-461 [Atkinson87] "Types and Persistence in Database Programming Languages" Malcolm P. Atkinson, O. Peter Buneman ACM Computing Surveys, Vol. 19, No. 2, June 1987, pp. 105-190 [Atkinson89] "The Object-Oriented Database System Manifesto" Malcolm Atkinson, François Bancilhon, David DeWitt, Klaus Dittrich, David Maier, Stanley Zdonik Proceedings of the 1st Intl. Conf. on Deductive and Object-Oriented Databases (DOOD’89), Kyoto, Japan, Dec. 1989, pp. 40-57 [ATKIS89] "Amtliches Topographisch-Kartographisches Informationssystem ATKIS, Teil A Konzeption und Inhalt des Informationssystems ATKIS" AdV-Arbeitsgruppe ATKIS Arbeitsgemeinschaft der Vermessungsverwaltungen der Länder der Bundesrepublik Deutschland (AdV), Stand 10.1989 (in German), 31p. 198 Bibliography [Aurenhammer91] "Voronoi Diagrams - A Survey of a Fundamental Geometric Data Structure" Franz Aurenhammer ACM Computing Surveys, Vol. 23, No. 3, September 1991, pp. 345-405 [Badrinath90] "Performance Evaluation of Semantics-based Multilevel Concurrency Control Protocols" B.R. Badrinath, Krithi Ramamritham ACM, SIGMOD record, vol.19, No. 2, 1990 (Proc. SIGMOD’90), pp. 163-172 [Ballard81] "Strip Trees: A Hierarchical Representation for Curves" Dana H. Ballard Communications of the ACM, Vol. 24, No. 5, May 1981, pp. 310-321 [Bancilhon90] "Object-Oriented Database Systems: In Transit" François Bancilhon, Won Kim SIGMOD RECORD, Vol. 19, No. 4, Dec. 1990, pp. 49-53 [Barghouti91] "Concurrency Control in Advanced Database Applications" Naser S. Barghouti, Gail E. Kaiser ACM Computing Surveys, Vol. 23, No. 3, Sept. 1991, pp. 269-317 [Barnsley88] "Fractals Everywhere" Michael Fielding Barnsley Academic Press, 1988, 394p [Barrera81] "Schema Definition and Query Language for a Geographical Database System" R. Barrera, A. Buchmann IEEE Computer Architecture for Pattern Analysis and Image Database Management, Nov 1981, pp 250-256 [Batini86] "A Comparative Analysis of Methodologies for Database Schema Integration" C. Batini, M. Lenzerini, S.B. Navathe ACM Computing Surveys, Vol. 18, No. 4, December 1986, pp. 323-364 [Bayer72] "Organization and Maintenance of Large Ordered Indexes" R. Bayer, E. McCreight Acta Informatica, Vol. 1, No. 3, pp. 173-189 [Beck86] "Quality Control and Standards for a National Digital Cartographic Data Base" Francis J. Beck, Randle W. Olsen Proceedings, Auto Carto London, 1986, vol. 1, pp. 372-380 [Beckmann90] "An Efficient and Robust Access Method for Points and Rectangles" Nobert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger ACM, SIGMOD record, vol.19, No. 2, 1990 (Proc. SIGMOD’90), pp. 322-331 [Bédard89] "Extending Entity/Relationship Formalism for Spatial Information Systems" Yvan Bédard, François Paquette Proceedings, Auto Carto 9, Baltimore, Maryland, 1989, pp. 818-827 [Beeri90] "Formal Models for Object Oriented Databases" Catriel Beeri In: Deductive and Object-Oriented Databases (DOOD89), Editors: Kim, Nicolas, Nishio. Elsevier 1990, pp. 405-430 [Bentley75] "Multidimensional Binary Search Trees Used for Associative Searching" Jon Louis Bentley Communications of the ACM, Vol. 18, No. 9, 1975, pp. 509-517 [Bernhardsen86] "Community Benefit of Digital Spatial Information" T. Bernhardsen, S. Tveitdal Proceedings, Auto Carto London, 1986, vol. 2, pp. 1-3 [Bernstein93] "Concurrency in Programming and Database Systems" Arthur J. Bernstein, Philip M. Lewis Jones and Bartlett Publishers, 1993, 548p. [Bernstein87] "Concurrency Control and Recovery in Database Systems" Philip A. Bernstein, V. Hadzilacos, Nathan Goodman Addison Wesley, 1987 Bibliography 199 [Berry87] "Fundamental Operations in Computer-assisted Map Analysis" Joseph K. Berry International Journal of Geographical Information Systems, Vol. 1, No. 2, 1987, pp. 119-136 [Biller77] "Concepts for the Conceptual Schema" N. Biller, E. Neuhold In "Architecture and Models in Data Base Management Systems", G. Nijssen, Ed. North-Holland, Amsterdam, 1977, pp. 1-30 [Birtwistle73] "SIMULA Begin" Graham M. Birtwistle, Ole-Johan Dahl, Bjørn Myrhaug, Kristen Nygaard Studentlitteratur, Lund, Sweden, 1973 [Bjørke90] "Cartographic Zoom" Jan Terje Bjørke, Rune Aasgaard Proceedings of the 4th International Symposium on Spatial Data Handling, 1990, Zürich, Switzerland, vol.1, pp. 345-353 [Blais86] "Optimal Interval Sampling in Theory and Practice" J.A.R. Blais, M.A. Chapman, W.K. Lam Second Symposium on Spatial Data Handling, Seattle, 1986, pp. 185-192 [Boudriault87] "Topology in the TIGER file" Gerard Boudriault Auto Carto 8, Baltimore, Maryland, 1987, pp. 258-263 [Brassel88] "A review and conceptual framework of automated map generalization" Kurt E. Brassel, Robert Weibel International Journal of Geographical Information Systems, Vol. 2, No. 3, 1988, pp. 229-244 [Bratbergsengen83] "Feature Analysis of ASTRAL" Kjell Bratbergsengen, Tor Stålhane In [Schmidt83a], pp. 50-75 [Bratbergsengen84] "Hashing Methods and Relational Algebra Operations" Kjell Bratbergsengen Proc. of the 10th Conference on Very Large Data Bases, Singapore Aug. 1984 [Bratbergsengen89] "The Development of the CROSS8 and HC16-186 Parallel (Database) Computers" Kjell Bratbergsengen, Torgrim Gjelsvik The Sixth International Workshop on Database Machines, France, June19-23 1989 [Bratbergsengen90] "Relational Algebra Operations" Kjell Bratbergsengen PRISMA Workshop, Parallel Database Systems, September 24-26 1990, Nordwijk, The Netherlands, 1990, 20p [Breitbart92] "Overview of Multidatabase Transaction Management" Y. Breitbart, H. Garcia-Molina, A. Silberschatz The VLDB Journal, Vol. 1, No.2, 1992, pp. 181-239 [Broome90] "The TIGER Data Base Structure" Frederick R. Broome, David B. Meixler Cartography and Geographic Information Systems, Vol. 17, No. 1, 1990, pp. 39-47 [Buchmann90] "Design and Implementation of Large Spatial Databases (first symposium SSD ’89, Santa Barbara, California, July 17/18, 1989)" A. Buchmann, O. Günther, T.R. Smith, Y.-F. Wang (Eds.) Lecture Notes in Computer Science 409, Springer Verlag, 1990 [Burrough86] "Five Reasons why Geographical Information Systems are not being Used Efficiently for Land Resources Assessment" P.A. Burrough Proceedings, Auto Carto London, 1986, vol. 2, pp. 139-148 [Burrough89] "Principles of Geographical Information Systems for Land Resources Assessment" P.A. Burrough Clarendon Press, Oxford, 1989 (first edition 1986) 200 Bibliography [Calkins87] "The Transition To Automated Production Cartography: Design Of The Master Cartographic Database" Hugh W. Calkins, Duane F. Marble The American Cartographer, Vol. 14, No.2, 1987, pp. 105-119 [Carey90] "Extensible Database Management Systems" Michael Carey, Laura Haas SIGMOD RECORD, Vol. 19, No. 4, Dec. 1990, pp. 54-60 [Carter92] "Perspectives on Sharing Data in Geographic Information Systems" James R. Carter Photogrammetric Engineering and Remote Sensing, Vol.58, No. 11, Nov. 1992, pp. 1557-1560 [CEN95] "Geographic Information - Data Description - Quality" CEN/TC287 - Geographic Information, WG2, PT05 CEN/TC287, document N369, 1995 [CEN95b] "Geographic Information - Data Description - Metadata" CEN/TC287 - Geographic Information, WG2, PT01 CEN/TC287, document N370, 1995 [Ceri88] "Distributed Databases Principles & Systems" Stefano Ceri, Giuseppe Pelagatti McGraw-Hill, third printing 1988 (first edition 1985) [CERL95] "Environmental Modeling and Visualization With GRASS GIS" CERL, Bill Brown Internet URL: http://softail.cecer.army.mil/grass/viz/VIZ.html, 1995 [Chance90a] "An Object -Oriented GIS - Issues and Solutions" Arthur Chance, Richard Newell, David G. Theriault Conference Proceedings of EGIS, Amsterdam, April 1990 [Chance90b] "An Overview of Smallworld Magic" Arthur Chance, Richard Newell, David G. Theriault Smallworld Technical Paper no. 9, 1990 [Charlwood87] "Developing a DBMS for Geographic Information: A Review" Gerald Charlwood, George Moon, John Tulip Auto Carto 8, Baltimore, Maryland, 1987, 14 p. [Chen76] "The Entity-Relationship Model - Toward a Unified View of Data" Peter Pin-Shan Chen ACM Transaction on Database Systems, Vol. 1, No. 1, March 1976, pp. 9-36 [Chen94] "RAID: High-Performance, Reliable Secondary Storage" Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, David A. Patterson ACM Computing Surveys, vol. 26, no. 2, June 1994, pp. 145-185 [Chrisman84] "The Role of Quality Information in the Long-Term Functioning of a Geographic Information System" Nicholas R. Chrisman Cartographica, vol. 21, No. 2/3, 1984, pp. 79-87 [Chrisman86] "Obtaining Information on Quality of Digital Data" Nicholas R. Chrisman Proceedings, Auto Carto London, 1986, vol. 1, pp. 350-358 [Chrisman89] "Modelling Error in Overlaid Categorical Maps" Nicholas R. Chrisman In [Goodchild89], pp. 21-34 [Chrisman90] "Deficiencies of sheets and tiles: building sheetless databases" Nicholas R. Chrisman International Journal of Geographical Information Systems, 1990, vol. 4, no 2, pp 157-167 [Christoduolakis95] "Multimedia Information Systems: Issues and Approaches" Stavros Christoduolakis, Leonidas Koveos In [Kim95a], pp. 318-337 Bibliography 201 [Clapham91] "The Development of an Initial Framework for the Visualisation of Spatial Data Quality" Sarah B. Clapham, Kate Beard Technical Papers, 1991 ACSM-ASPRS Annual Convention, Vol. 2, Cartography and GIS/LIS, Baltimore, 1991, pp. 73-82 [Clifford85] "On an Algebra for Historical Relational Databases: Two Views" James Clifford, Abdullah Uz Tansel ACM, SIGMOD record, vol.14, No. 4, 1985 (Proc. SIGMOD’85), pp. 247-265 [Clocksin84] "Programming in Prolog" W.F. Clocksin, C.S. Mellish Springer-Verlag, 1984 [Coad90] "Object-Oriented Analysis" Peter Coad, Edward Yourdon Prentice-Hall, 1990 [CODASYL71] "CODASYL Data Base Task Group. April 1971 Report" Data Base Task Group ACM, 1971 [Codd70] "A Relational Model for Large Shared Data Banks" E.F. Codd Communications of the ACM, Vol. 13, No. 6, June 1970, pp. 377-387 [Codd79] "Extending the Relational Model to Capture More Meaning" E.F. Codd ACM Transactions on Database Systems, Vol. 4, No. 4, Dec. 1979, pp. 397-434 [Congalton94] "International Symposium on the Spatial Accuracy of Natural Resource Data Bases" Russel G. Congalton, Ed. American Society for Photogrammetry and Remote Sensing, 1994, 271p. [Conklin87] "Hypertext: An Introduction and Survey" Jeff Conklin IEEE, Computer, vol. 20, no. 9, September 1987, pp. 17-41 [Dangermond86] "GIS Trends and Experiences" Jack Dangermond Second Symposium on Spatial Data Handling, Seattle, 1986, pp. 1-4 [Date86] "An Introduction to Database Systems, Volume I" C.J. Date Addison Wesley, fourth edition 1986 [Dayal95] "Active Database Systems" Umeshwar Dayal, Eric Hanson, Jennifer Widom In [Kim95a], pp. 434-456 [Deux90] "The Story of O2" O. Deux, et al. IEEE Transactions on Knowledge and Data Engineering, Vol. 2, No.1, pp. 91-108 [DeWitt85] "Multiprocessor Hash-Based Join Algorithms" David J. DeWitt, Robert Gerber Proceedings of VLDB’95, Stockholm, 1985, pp. 151-164 [Douglas73] "Algorithms for the reduction of the number of points required to represent a digitized line or its caricature" David H. Douglas, Thomas K. Peucker Canadian Cartographer, Vol.10, No.4, 1973, pp. 110-122 [Dowers90] "Analysis of GIS Performance on Parallel Architectures and Workstation-Server Systems" S. Dowers, B.M. Gittings, T.M. Sloan, T. Waugh Proceedings, GIS/LIS ’90, 7-10. Nov. 1990, Anaheim, CA, vol. 2, pp. 555-561 [Dutton89] "Planetary Modelling via Hierarchical Tessellation" Geoffrey Dutton Proceedings, Auto-Carto 9, Baltimore, Maryland, 1989, pp. 462-471 202 Bibliography [Dæhlen90] "Compression of Hydrographic Data" Morten Dæhlen, Geir Westgaard Senter for Industrial Research, report no. 900612-1, august 1990, 23p. [Easterfield90] "Version Management in GIS - Applications and Techniques" Mark E. Easterfield, Richard G. Newell, David G. Theriault EGIS ’90, EGIS Foundation, Netherlands, 1990, pp. 288-297 [Egenhofer87] "Object-Oriented Databases: Database Requirements for GIS" Max J. Egenhofer, Andrew U. Frank International Geographic Information Systems Symposium: The Research Agenda, Crystal City, VA, november 1987, pp. II:189-211 [Egenhofer89a] "Object-Oriented Modeling in GIS: Inheritance and Propagation" Max J. Egenhofer, Andrew U. Frank Proceedings, Auto-Carto 9, Baltimore, Maryland, 1989, pp. 588-598 [Egenhofer89b] "Object-Oriented Software Engineering Considerations for Future GIS" Max J. Egenhofer, Andrew U. Frank Proceedings, IGIS’89, Baltimore, Maryland, 1989, pp. 55-72 [Egenhofer90a] "A Topological Data Model for Spatial Databases" M.J. Egenhofer, A.U. Frank, J.P. Jackson in [Buchman90], pp. 271-286 [Egenhofer90b] "A Mathematical Framework for the Definition of Topological Relationships" Max Egenhofer, John R. Herring Proceedings of the 4th International Symposium on Spatial Data Handling, 1990, Zürich, Switzerland, pp. 803-813 [Egenhofer91a] "Point-set topological spatial relations" Max J. Egenhofer, Robert D. Franzosa International Journal of Geographical Information Systems, Vol. 5, No. 2, 1991, pp. 161-174 [Egenhofer91b] "Reasoning about Binary Topological Relations" Max J. Egenhofer In [Günther91], pp. 143-160 [Egenhofer92] "Reasoning about Gradual Changes of Topological Relationships" Max J. Egenhofer, Khaled K. Al-Taha In [Frank92], pp. 196-219 [Egenhofer95] "Advances in Spatial Databases, 4th International Symposium, SSD’95, Portland, Maine, USA, August 6-9, 1995, proceedings" Max J. Egenhofer, John R. Herring (Eds.) Lecture Notes in Computer Science, Vol. 951, Springer, 1995. [Elmasri89] "Fundamentals of Database Systems" Ramez Elmasri, Shamkant B. Navathe The Benjamin/Cummings Publishing Company, Inc., California, 1989. [Elmasri94] "Fundamentals of Database Systems, second edition" Ramez Elmasri, Shamkant B. Navathe The Benjamin/Cummings Publishing Company, Inc., California, 1989. [Encarnação83] "Computer Aided Design - Fundamentals and System Architectures" José Encarnação, Ernst G. Schlectendahl Springer-Verlag, 1983. [ESRI95a] ESRI (Peter Moran, product marketing) Personal email communication, 1995. [ESRI95b] ""Spatial Database Engine Released ESRI (contact Carl Sylvester) ARC News, Vol.17, No.4, 1995, pp. 1-2 [Fagin79] "Extendible Hashing - A Fast Access Method for Dynamic Files" Donald Fagin, Jurg Nievergelt, Nicholas Pippenger, H. Raymond Strong ACM Transactions on Database Systems, Vol.4, No.3, September 1979, pp 315-344 Bibliography 203 [Faloutsos87] "Analysis of Object Oriented Spatial Access Methods" Christos Faloutsos, Timos Sellis, Nick Roussopoulos ACM, SIGMOD record, vol.16, No. 3, 1987 (Proc. SIGMOD’87), pp. 426-439 [Faloutsos89] "Tri-Cell - A Data Structure for Spatial Data" Christos Faloutsos, Winston Rego Information Systems, Vol. 14, No. 2, 1989, pp. 131-139 [Farrag89] "Using Semantic Knowledge of Transactions to Increase Concurrency" Abdel Aziz Farrag, M. Tamer Özsu ACM Transactions on Database Systems, Vol. 14, No. 4, Dec. 1989, pp. 503-525 [Feuchtwanger89] "Geographic Logical Database Model Requirements" Martin Feuchtwanger Proceedings, Auto Carto 9, Baltimore, Maryland, 1989, pp. 599-609 [Feuchtwanger93] "Towards a Geographic Semantic Database Model" Martin Feuchtwanger Thesis, Doctor of Philosophy, Geography, Simon Fraser University, July 1993, 186p. [FGDC94] "Content Standards for Digital Geospatial Metadata" Federal Geographic Data Committee Department of the Interior, US Geological Survey, Federal Geographic Data Committee (FGDC), June 8, 1994 [FGIS90] "FGIS Konseptbeskrivelse, Versjon 2.0" Statens Kartverk Statens Kartverk, Hønefoss, 27/7-1990 [Finkel74] "Quad Trees: A Data Structure for Retrieval on Composite Keys" R.A. Finkel, K.L. Bentley Acta Informatica, Vol. 4, No. 1, pp. 1-9 [Firns91] "ER on the Side of Spatial Accuracy" Peter G. Firns, George L. Benwell Proceedings, Symposium on Spatial Database Accuracy, June 19-20, 1991, Melbourne, Australia, pp. 192-202 [Franaszek85] "Limitations of Concurrency in Transaction Processing" Peter Franaszek, John T. Robinson ACM Transactions on Database Systems, Vol. 10, No. 1, March 1985, pp. 1-28 [Frank84] "Requirements for Database Systems Suitable to Manage Large Spatial Databases" Andrew U. Frank International Symposium on Spatial Data Handling, Zurich, Switzerland, August 1984, pp. 38-60 [Frank86] "Cell Graphs: A Provable Correct Method for the Storage of Geometry" Andrew U. Frank, Werner Kuhn Proceedings, Second Symposium on Spatial Data Handling, Seattle, 1986, pp. 411-436 [Frank88] "Requirements for a Database Management System for a GIS" Andrew U. Frank Photogrammetric Engineering and Remote Sensing, Vol.54, No. 11, Nov. 1988, pp. 1557-1564 [Frank91] "Properties of Geographic Data: Requirements for Spatial Access Methods" Andrew Frank In [Günther91], pp. 225-234 [Frank92] "Theories and Methods of Spatio-Temporal Reasoning in Geographical Space" Andrew U. Frank, Irene Campari, Ubaldo Formentini (Eds.) Proceedings, International Conference GIS - From Space to Territory: Theories and Methods of Spatio-Temporal Reasoning, Pisa, Italy, September 1992, Lecture Notes in Computer Science 639, Springer Verlag, 1992, 431p. [Furht95] "Design Issues for Interactive Television Systems" Borko Furht, Deven Kalra, Frederick L. Kitson, Arturo A. Rodriguez, William E. Wall IEEE Computer, Vol.28, No.5, May 1995, pp. 25-39 204 Bibliography [Gadia88] "A Homogeneous Relational Model and Query Languages for Temporal Databases" Shashi K. Gadia ACM Transactions on Database Systems, Vol. 13, No. 4, Dec. 1988, pp. 418-448 [Ganger94] "Disk Arrays: High-Performance, High-Reliability Storage Subsystems" Gregory R. Ganger, Bruce L. Worthington, Robert Y. Hou, Yale N. Patt IEEE Computer, Vol. 27, No. 3, March 1994, pp. 30-36 [Garcia-Molina95] "Distributed Databases" Hector Garcia-Molina, Mei Hsu In [Kim95a], pp. 477-493 [Gardels88] "GRASS in the X-Windows Environment: Distributing GIS Data and Technology" Kenneth Gardels GIS/LIS’88 Proceedings, ACSM, ASP/RS, AAG, URISA, San Antonio, TX, Nov. 1988, pp. 751[GISDATA93] GISDATA Newsletter No. 1, ESF, March 1993 [GISDATA95] GISDATA Newsletter No. 6, ESF, November 1995 [Goldberg83] "Smalltalk-80: The Language and its Implemtation" Adele Goldberg, David Robson Addison-Wesley, Reading, MA, 1983 [Golledge92] "Do People Understand Spatial Concepts: The Case of First-Order Primitives" Reginald G. Golledge In [Frank92], 1992, pp. 1-21 [Gonzalez78] "Syntactic Pattern Recognition: An Introduction" Rafael C. Gonzales, Michael G. Thomason Addison-Wesley, 1978, 283p. [Gonzalez87] "Digital Image Processing" Rafael C. Gonzales, Paul Wintz Addison Wesley, 1987, 503p [Goodchild89] "The Accuracy of Spatial Databases" Michael Goodchild, Sucharita Gopal, eds. Taylor and Francis, London, 1989 [Goodchild90a] "Tiling of Large Geographical Databases" Michael F. Goodchild In [Buchmann90], pp. 137-146 [Goodchild90b] "Keynote address: Spatial Information Science" Michael F. Goodchild Proceedings, 4th International Symposium on Spatial Data Handling, 1990, Zürich, Vol. 1, pp. 3-12 [Goodchild91] "Keynote address: Symposium on Spatial Database Accuracy" Michael F. Goodchild Proceedings, Symposium on Spatial Database Accuracy, June 19-20, 1991, Melbourne, Australia, pp. 1-16 [Goyal89] "Intelligent Information Systems: The Concept of an Intelligent Document" Pankaj Goyal Information Systems, Vol. 14, No. 4, 1989, pp. 351-358 [Grant90] "The Management and Dissemination of Electronic Navigational Chart Data in the 1990s" Stephen Grant, Michael Casey, Timothy Evangelatos, Horst Hecht International Hydrographic Review, Monaco, LXVII(2), July 1990, pp. 17-30 [GRASS93] "Grass 4.1 Reference Manual" GRASS Project US Army Corps of Engineers, Construction Engineering Research Laboratories, Champaign, Illinois, 1993 Bibliography 205 [GRASS95] "GEOGRAPHIC RESOURCES ANALYSIS SUPPORT SYSTEM (GRASS)" GRASS (William D. Goran) Internet URL: http://deathstar.rutgers.edu/grass/what.html [Greene89] "An Implementation and Performance Analysis of Spatial Data Access Methods" Diane Greene Proc. IEEE, 5th International Conference on Data Engineering, Los Angeles, Calif., 1989, pp. 606-615 [Guptill90] "Multiple Representations of Geographic Entities through Space and Time" Stephen C. Guptill Proceedings of the 4th International Symposium on Spatial Data Handling, 1990, Zürich, Switzerland, pp. 859-868 [Guttman84] "R-Trees: A Dynamic Index Structure for Spatial Searching" Antonin Guttman ACM, Proc. SIGMOD’84, Boston, MA, June 18-21, 1984, pp. 47-57 [Günther87] "A Dual Space Representation for Geometric Data" Oliver Günther, Eugene Wang Proc. of the 13th VLDB Conference, Brighton, 1987, pp. 501-506 [Günther89] "The Design of the Cell Tree: An Object-Oriented Index Structure for Geometric Databases" Oliver Günther Proc. IEEE, 5th International Conference on Data Engineering, Los Angeles, Calif., 1989, pp. 598-605 [Günther90] "Research Issues in Spatial Databases" O. Günther, A. Buchmann SIGMOD RECORD, Vol. 19, No. 4, Dec 1990, pp. 61-68 [Günther91] "Advances in Spatial Databases" O. Günther, H.-J. Schek Proceedings, 2nd Symposium, SDD’91, Zurich, Switzerland, August 28-30, 1991, Springer Verlag 1991, 471p [Güting94] "An Introduction to Spatial Database Systems" R.H. Güting The VLDB Journal, Vol. 3, No. 4,1994, pp. 357-399 [Hagaseth90] "Multimedia Databasesystemer for Geografiske Informasjonssystemer" Marianne Hagaseth Unpublished student report, IDT, NTH, 4/5-1990 (in Norwegian) [Haas91] "Exploiting Extensible DBMS in Integrated Geographic Information Systems" Laura M. Haas, William F. Cody In [Günther91], pp. 423-450 [Hammer78] "The Semantic Data Model: A Modelling Mechanism for Data Base Applications" Michael Hammer, Dennis McLeod Proceedings of the ACM SIGMOD Conference, Austin, 1978, pp. 26-36 [Healey89] "Transputer Based Parallel Processing for GIS Analysis: Problems and Potentialities" R.G. Healey, G.B. Desa Auto Carto 9, Baltimore, Maryland, April 1989, pp. 90-99 [Healey91a] "Determination of Computing Resource Requirements for GIS Processing in a Workstation-Server Environment" R.G. Healey, S. Dowers, B.M. Gittings, T.M. Sloan, T.C. Waugh Proceedings EGIS 1991, pp. 422-426 [Healey91b] "Database Management Systems" R.G. Healey Chapter 18, in [Maguire91], pp. 251-267 [Herlihy90] "Apologizing Versus Asking Permission: Optimistic Concurrency Control for Abstract Data Types" Maurice Herlihy ACM Transactions on Database Systems, Vol. 15, No. 1, March 1990, pp. 96-124 206 Bibliography [Herring87] "TIGRIS: Topologically Integrated Geographic Information System" John R. Herring Auto Carto 8, Baltimore, Maryland, march 1987, pp. 282-291 [Herring88] "Extensions to the SQL Query Language to Support Spatial Analysis in a Topological Data Base" John R. Herring, Robert C. Larsen, Jagadisan Shivakumar GIS/LIS’88 Proceedings, ACSM, ASP/RS, AAG, URISA, San Antonio, TX, Nov. 1988, pp. 741-750 [Herring89] "A Fully Integrated Geographic Information System" John R. Herring Auto Carto 9, Baltimore, Maryland, April 1989, pp. 828-837 [Herring90] "The Definition and Development of a Topological Spatial Data System" John R. Herring Photogrammetry and Land Information Systems, Editor: Otto Kölbl, Lausanne, Switzerland 1990, pp. 57-70 [Hootsmans92] "Knowledge-Supported Generation of Meta-Information on Handling Crisp and Fuzzy Datasets" Rob M. Hootsmans, Wouter M. de Jong, Frans J.M. van der Wel Proceedings, 5th International Symposium on Spatial Data Handling, Charleston, SC, August 3-7, 1992, pp. 470-479 [Hopkins92] "Algorithm Scalability for Line Intersection Detection in Parallel Polygon Overlay" Sara Hopkins, Richard G. Healey, Thomas Waugh Proceedings of the 5th International Symposium on Spatial Data Handling, 1992, Charleston, SC, USA, vol.1, pp. 210-218 [Hsiao92] "Tutorial on Federated Databases and Systems (Part I)" D. Hsiao The VLDB Journal, Vol. 1, No.1, 1992, pp. 127-179 [Hull87] "Semantic Database Modeling: Survey, Applications, and Research Issues" Richard Hull, Roger King ACM Computing Surveys, vol.19, no.3, Sept. 1987, pp. 201-260 [Hunter91] "Proceedings, Symposium on Spatial Database Accuracy" Gary J.Hunter, editor Dept. of Surveying and Land Information, Univ. of Melbourne, 1991, 260p. [IEEECOMPUTER89] "IEEE Computer, special issue on image database management" IEEE Computer, December 1989, p 7-71 [IHOSP5288] "Draft Specifications for ECDIS" IHO special publication 52, 3. draft, October 1988 [ISO/IEC94a] "SQL Multimedia and Application Packages (SQL/MM) Project Plan" ISO/IEC JTC1/WG3, N1677, SQL/MM SOU-002, March 1994 [ISO/IEC94b] "SQL Multimedia and Application Packages (SQL/MM). Part 3: Spatial" ISO/IEC JTC1/WG3, N1677, SQL/MM SOU-005, March 1994 [ISO/IEC96] "SQL Multimedia and Application Packages - Part 3: Spatial" ISO/IEC JTC1/SC21, N10441, ISO/IEC CD 13249-3:199x (E), November 1996 [Jagadish90] "Linear Clustering of Objects with Multiple Attributes" H.V. Jagadish ACM, SIGMOD record, vol.19, No. 2, 1990 (Proc. SIGMOD’90), pp. 332-342 [Jain87] "Advances in Statistical Pattern Recognition" Anil K. Jain NATO ASI Series, Vol. F30, Pattern Recognition Theory and Applications, Edited by P.A. Devijver and J. Kittler, Springer-Verlag, 1987, pp. 1-19 [Jajodia90] "Database Security: Current Status and Key Issues" Sushil Jajodia, Ravi Sandhu SIGMOD RECORD, Vol. 19, No. 4, Dec. 1990, pp. 123-126 Bibliography 207 [Jardine77] "The ANSI/SPARC DBMS Model: Proceedings of the Second SHARE Working Conference on Data Base Management Systems, Montreal, Canada, April 26-30, 1976" D.A. Jardine (editor) North Holland 1977 [Jen94] "A Model for Handling Topological Relationships in a 2D Environment" Tao-Yuan Jen, Patrice Boursier In [Waugh94], pp. 73-88 [Joseph88] "PICQUERY: A High Level Query Language for Pictorial Database Management" Thomas Joseph, Alfonso F. Cardenas IEEE Transactions on Software Engineering, Vol. 14, No. 5, May 1988, pp. 630-638 [Katzman78] "A Fault-tolerant Computing System" James A. Katzman Proc. of the 11th Hawaii International Conference on System Sciences, volume 3, 1978, pp. 85-102 [Keates82] "Understanding Maps" J.S. Keates Longman, London and New York, 1982 [Keating87] "An Integrated Topological Database Design for Geographic Information Systems" Terrence Keating, William Phillips, Kevin Ingram Photogrammetric Engineering and Remote Sensing, Vol. 53, No. 10, Oct. 1987, pp. 1399-1402 [Kemper87] "An Analysis of Geometric Modeling in Database Systems" Alfons Kemper, Mechtild Wallrath ACM Computing Surveys, Vol. 19, No. 1, March 1987, pp. 47-91 [Kemper94] "Object-Oriented Database Management, Applications in Engineering and Computer Science" Alfons Kemper, Guido Moerkotte Prentice-Hall, 1994, 680p. [Kim84] "Highly Available Systems for Database Applications" Won Kim ACM Computing Surveys, Vol. 16, No. 1, March 1984, pp. 71-98 [Kim89] "Object-Oriented Concepts, Databases, and Applications" Won Kim, Fredrick H. Lochovsky, editors ACM Press, 1989 [Kim95a] "Modern Database Systems: The Object Model, Interoperability and Beyond" Won Kim (editor) ACM Press, Addison Wesley, 1995, 703p. [Kim95d] "Introduction to Part 2: Technology for Interoperating Legacy Databases" Won Kim In [Kim95a], pp. 515-520 [Kim91] "Chips Deliver Multimedia" Yongmin Kim Byte, December 1991, pp. 163-173 [Kim95c] "Comparing Data Modelling Formalisms" Young-Gul Kim, Salvatore T. March Communications of the ACM, Vol. 38, No. 6, 1995, pp. 103-115 [Knott71] "Expandable Open Adress Hash Table Storage and Retrieval" Gary D. Knott Proc. ACM SIGFIDET workshop on Data Description, Access and Control, 1971, pp 187-206 [Korth88] "Formal Model of Correctness Without Serializability" Henry F. Korth, Gregory D. Speegle ACM, SIGMOD record, vol.17, No. 3, 1988 (Proc. SIGMOD’88), pp. 379-386 208 Bibliography [Kotz-Dittrich95] "Where Object-Oriented DBMSs Should Do Better: A Critique Based on Early Experiences" Angelika Kotz-Dittrich, Klaus R. Dittrich In [Kim95a], pp. 238-254 [Langefors73] "Theoretical analysis of information systems" Börje Langefors Philadelpia, Auerbach, 4th ed., 1973, 489 p. [Langran88] "A Framework for Temporal Geographic Information" Gail Langran, Nicholas R. Chrisman Cartographica, Vol. 25, No. 3, 1988, pp. 1-14 [Langran89] "Accessing Spatiotemporal Data in a Temporal GIS" Gail Langran Proceedings, Auto-Carto 9, Baltimore, Maryland, 1989, pp. 191-198 [Larson78] "Dynamic Hashing" Per-Åke Larson BIT, No.18, 1978, pp 184-201 [Laurini90] "Principles of Geomatic Hypermaps" Robert Laurini, Françoise Milleret-Raffort Proceedings of the 4th International Symposium on Spatial Data Handling, 1990, Zürich, Switzerland, vol.2, pp. 642-651 [Laurini92] "Fundamentals of Spatial Information Systems" Robert Laurini, Derek Thompson Academic Press, 1992 [Lauzon85] "Database Support for Geographic Information Systems: The Wild System 9 Approach" J.P. Lauzon, R. McLaren, C. Harwood Proceedings of ACSM-ASPRS Fall Meeting, Indianapolis, Sept. 8-13, 1985, pp. 583-594 [Lillesand87] "Remote Sensing and Image Interpretation" Thomas M. Lillesand, Ralph W. Kiefer John Wiley & Sons, second edition 1987 [Lin91] "A Rationale for Spatiotemporal Intersection" Hui Lin, Hugh W. Calkins Technical Papers, 1991 ACSM-ASPRS Annual Convention, Vol. 2, Cartography and GIS-LIS, Baltimore, 1991, pp. 204-213 [Lindholm90] "Hypermedia as a Cartographic Product - Use and Production" Mikko Lindholm, Tapani Sarjakoski Course Material, Scandinavian Summer Course in Cartography, August 19-31, 1990, Gol, Norway [Litwin80] "Linear Hashing: A New Tool for File and Table Adressing" Witold Litwin Proceedings of the Sixth International Conference on Very Large Data Bases, Montreal, October 1980, pp. 212-223 [Lomet90] "The hB-Tree: A Multiattribute Indexing Method with Good Guaranteed Performance" David B. Lomet, Betty Salzberg ACM Transactions on Database Systems, Vol. 15, No. 4, Dec. 1990, pp. 625-658 [Lorie91] "The Use of a Complex Object Language in Geographic Data Management" Raymond A. Lorie In [Günter91], pp. 319-337 [Lu90] "Decomposition of Spatial Database Queries by Deduction and Compilation" Wei Lu, Jiawei Han Proceedings of the 4th International Symposium on Spatial Data Handling, 1990, Zürich, Switzerland, vol.2, pp. 579-588 [Lunt90] "Database Security" Teresa F. Lunt, Eduardo B. Fernandez SIGMOD RECORD, Vol. 19, No. 4, Dec. 1990, pp. 90-97 Bibliography 209 [Mackert86] "R* Optimizer Validation and Performance Evaluation for Local Queries" Lothar F. Mackert, Guy M. Lohman ACM, SIGMOD record, vol.15, No. 2, 1986 (Proc. SIGMOD’86), pp. 84-95 [Maguire91a] "Integrated GIS: The Importance of Raster" David J. Maguire, Barry Kimber, Julian Chick Technical Papers, 1991 ACSM-ASPRS Annual Convention, Volume 4, GIS, Baltimore, ACSMASPRS 1991, pp. 107-116 [Maguire91b] "Geographical Information Systems" David J. Maguire, Michael F. Goodchild, David W. Rhind Longman 1991, 2 volumes [Maier89] "Making Database Systems Fast Enough for CAD Applications" David Maier In [Kim89], pp. 573-582 [Mark89] "Concepts of Space and Spatial Language" David M. Mark, Andrew U. Frank Proceedings, Auto Carto 9, Baltimore, 1989, pp. 538-556 [Mark90] "Experiential and Formal Models of Geographic Space" David M. Mark, Andrew U. Frank Santa Barbara, California: National Center for Geographic Information and Analysis, Report 90-10, part 1, 24p. [McKenzie86] "Bibliography: Temporal Databases" Edwin McKenzie SIGMOD Record, Vol. 15, No. 4, Dec. 1986, pp. 40-52 [McLaren86] "The Next Generation of Manual Data Capture and Editing Techniques: The Wild System 9 Approach" Robin A McLaren, Walter Brunner Proceedings 1986 ACSM-ASPRS Annual Convention, Vol. 4, pp. 50-59 [Melton90] "SQL2 The SEQUEL An Emerging Standard" Jim Melton Database Programming and Design, Nov. 1990, pp 24-32 [Misund93] "Multimodels and Metamap - Towards an Augmented Map Concept" Gunnar Misund Thesis, Cand Scient, University of Oslo, Nov. 1993, 170p. [Moellering86] "Developing Digital Cartographic Data Standards for the United States" Harold Moellering Proceedings, Auto-Carto London, 1986, vol. 1, pp. 312-322 [Mohan88] "An Object-Oriented Knowledge Representation for Spatial Information" L. Mohan, L. Kashyap IEEE Transactions on Software Engineering, Vol. 14, No. 5, May 1988, pp. 675-681 [Molenaar94] "Modelling Topologic Relationships in Vector Maps" M. Molenaar, O. Kufoniyi, T. Bouloucos In [Waugh94], pp. 112-126 [Morehouse85] "ARC/INFO: A Geo-Relational Model for Spatial Information" Scott Morehouse Proceedings, Auto-Carto 7, Washington DC, ACSM, 1985, pp. 388-397 [Morehouse89] "The Architecture of ARC/INFO" Scott Morehouse Proceedings, Auto-Carto 9, Baltimore, Maryland, 1989, pp. 266-277 [Morehouse90] "The Role of Semantics in Geographic Data Modelling" Scott Morehouse Proceedings of the 4th International Symposium on Spatial Data Handling, 1990, Zürich, Switzerland, pp. 689-698 210 Bibliography [Mortenson85] "Geometric Modeling" Michael E. Mortenson John Wiley & Sons, Inc, 1985 [Morton66] "A Computer Oriented Geodetic Data Base and a new Technique in File Sequencing" G.M. Morton Internal document, IBM Canada Ltd., 1966 [Mower92] "Building a GIS for Parallel Computing Environments" James E. Mower Proceedings of the 5th International Symposium on Spatial Data Handling, 1992, Charleston, SC, USA, vol.1, pp. 219-229 [Muller91] "Generalisation of Spatial Databases" Jean-Claude Muller In [Maguire91b], pp. 457-475 [Muller92] "Parallel Distributed Processing: An Application to Geographic Feature Selection" Jean-Claude Muller Proceedings of the 5th International Symposium on Spatial Data Handling, 1992, Charleston, SC, USA, vol.1, pp. 230-240 [Nagy79] "Geographic Data Processing" George Nagy, Sharad Wagle ACM Computing Surveys, Vol. 11, No. 2, 1979, pp. 139-181 [NCGIA91] "Scientific Report for the Specialist Meeting 8-10 June 1991" NCGIA, Initiative 7: Visualization of Spatial Data Quality, technical Paper 91-26, October 1991 [Neugebauer90] "Extending a Database to Support the Handling of Environmental Measurement Data" Leonore Neugebauer In [Buchmann90], pp. 147-165 [Newell91a] "Integration of Spatial Objects in a GIS" Richard G. Newell, Mark Easterfield, David G. Theriault Proceedings, Auto-Carto 10, Baltimore, 1991, pp. 408-415 [Newell91b] "The Management of Multiple Users of Large Seamless Databases" Richard G. Newell, David G. Theriault, Mark Easterfield, Colin Dean Smallworld technical papers 14, 1991 [Newell92] "Practical Experiences of Using Object-Orientation to Implement a GIS" Richard G. Newell, Mark Easterfield, David G. Theriault Proceedings of GIS/LIS 1992 [Ng81] "Further Analysis of the Entity-Relationship Approach to Database Design" Peter A. Ng IEEE Transactions on Software Engineering, Vol. 7, No.1, 1981, pp. 85-99 [Nievergelt84] "The Grid File: An Adaptable, Symmetric Multikey File Structure" J. Nievergelt, H. Hinterberger, K.C. Sevcik ACM Transactions on Database Systems, Vol. 9, No. 1, March 1984, pp. 38-71 [Nijssen77] "Current Issues in Conceptual Schema Concepts" G.M. Nijssen In "Architecture and Models in Data Base Management Systems", G. Nijssen, Ed. North-Holland, Amsetdam, 1977 [NORTH SEA89] "The North Sea Project A test project for electronic navigational charts Experiences and Conclusions" The Norwegian Hydrographic Service, Stavanger, March 28th, 1989 [OGIS95] "Open GIS Consortium" [email protected] Internet URL: http://www.ogis.org/ogis.html [Omiecinski95] "Parallel Relational Database Systems" Edward Omiecinski In [Kim95a], pp. 494-512 Bibliography 211 [Oosterom89] "A Reactive Data Structure for Geographical Information Systems" Peter van Oosterom Proceedings, Auto Carto 9, Baltimore, Maryland, 1989, pp. 665-674 [Oosterom91] "Building a GIS on top of the open DBMS "Postgres"" Peter van Oosterom, Tom Vijlbrief EGIS ’91, Brussels, Belgium, April 2-5, 1991, pp. 775-787 [Openshaw89] "Learning to Live with Errors in Spatial Databases" Stan Openshaw In [Goodchild89], pp. 263-276 [Orenstein84] "A Class of Data Structures for Associative Searching" Jack A. Orenstein Proceedings 3rd ACM SIGACT-SIGMOD Symposium on Principles of Database Systems, 1984, pp. 181-190 [Orenstein86] "Spatial Query Processing in an Object-Oriented Database System" Jack A. Orenstein ACM, SIGMOD record, vol.15, No. 2, 1986 (Proc. SIGMOD’86), pp. 326-336 [Orenstein88] "PROBE Spatial Data Modeling and Query Processing in an Image Database Application" Jack A. Orenstein, Frank A. Manola IEEE Transactions on Software Engineering, Vol. 14, No. 5, May 1988, pp. 611-629 [Orenstein90a] "A Comparison of Spatial Query Processing Techniques for Native and Parameter Spaces" Jack Orenstein ACM, SIGMOD record, vol.19, No. 2, 1990 (Proc. SIGMOD’90), pp. 343-352 [Orenstein90c] "An Object-Oriented Approach to Spatial Data Processing" Jack A. Orenstein Proceedings of the 4th International Symposium on Spatial Data Handling, 1990, Zürich, Switzerland, pp. 669-678 [Özsu91] "Distributed Database Systems: Where Are We Now?" M. Tamer Özsu, Patric Valduriez IEEE Computer, Vol. 24, No. 8, August 1991, pp. 68-78 [Pagel93] "The Transformation Technique for Spatial Objects Revisited" Bernd-Uwe Pagel, Hans-Werner Six, Henrich Toben In [Abel93], pp. 73-88 [Papadias94] "Qualitative Representation of Spatial Knowledge in Two-Dimensional Space" Dimitris Papadias, Timos Sellis The VLDB Journal, Vol. 3, No. 4, 1994, pp 479-516 [Papazoglou90] "An Object-Oriented Approach to Distributed Data Management" M.P. Papazoglou, L. Marinos IBM Journal on Systems Software, 11, 1990, pp 95-109 [Patterson88] "A Case for Redundant Arrays of Inexpensive Disks" D. Patterson, G. Gibson, R. Katz Proceedings of the SIGMOD Conference, New York, 1988, pp. 109-116 [Peckham88] "Semantic Data Models" J. Peckham, F. Myrianski ACM Computing Surveys, Vol.20, No.3, Sept. 1988, pp. 153-189 [Peucker75] "Cartographic Data Structures" Thomas K. Peucker, Nicholas Chrisman The American Cartographer, Vol. 2, No. 1, 1975, pp. 55-69 [Peucker78] "The Triangulated Irregular Network" Thomas K. Peucker, Robert J. Fowler, James J. Little, David M. Mark Proceedings, Digital Terrain Models (DTM) Symposium, ASP-ACSM, St. Louis, 1978, pp. 516-540 212 Bibliography [Peuquet84] "A Conceptual Framework and Comparison of Spatial Data Models" Donna J. Peuquet Cartographica, vol. 21, No. 4, 1984, pp. 66-113 [Peuquet86] "The Use of Spatial Relationships to Aid Spatial Database Retrieval" Donna J. Peuquet Proceedings, Second Symposium on Spatial Data Handling, Seattle, 1986, pp. 459-471 [Peuquet90a] "Introductory readings in Geographic Information Systems" Donna J. Peuquet, Duane F. Marble, eds. Taylor & Francis, 1990 [Peuquet90b] "ARC/INFO: an example of a contemporary geographic information system" Donna J. Peuquet, Duane F. Marble In [Peuquet90a], pp. 90-99 [Pigot92a] "A Topological Model for a 3D Spatial Information System" Simon Pigot Proceedings, 5th International Symposium on Spatial Data Handling, Charleston, SC, August 3-7, 1992, pp. 344-360 [Pigot92b] "The Fundamentals of a Topological Model for a Four-Dimensional GIS" Simon Pigot, Bill Hazelton Proceedings, 5th International Symposium on Spatial Data Handling, Charleston, SC, August 3-7, 1992, pp. 580-591 [Price89] "Modelling the Temporal Element in Land Information Systems" S. Price International Journal of Geographical Information Systems, Vol. 3, No. 3, 1989, pp. 233-244 [Pullar88] "Toward Formal Definitions of Topological Relations Among Spatial Objects" David V. Pullar, Max J. Egenhofer Proceedings, Third Symposium on Spatial Data Handling, Sydney, Australia, 1988, pp. 225-241 [Quinn87] "Designing Efficient Algorithms for Parallel Computers" Michael J. Quinn McGraw-Hill, 1987, 288p [Rhind92] "The Information Infrastructure of GIS"" David Rhind Proceedings, 5th International Symposium on Spatial Data Handling, Charleston, SC, August 3-7, 1992, pp. 1-19 [Robinson81] "The K-D-B-Tree: A Search Structure for Large Multidimensional Dynamic Indexes" John T. Robinson Proceedings ACM SIGMOD 1981, pp. 10-19 [Roussopoulos85] "Direct Spatial Search on Pictorial Databases Using Packed R-trees" Nick Roussopoulos, Daniel Leifker ACM, SIGMOD record, vol.14, No. 4, 1985 (Proc. SIGMOD’85), pp. 17-31 [Roussopoulos88] "An Efficient Pictorial Database System for PSQL" Nick Roussopoulos, Christos Faloutsos, Timos Sellis IEEE Transaction on Software Engineering, Vol. 14, No. 5, May 1988, pp. 639-650 [Rumbaugh91] "Object-Oriented Modeling and Design" James Rumbaugh, Michael Blaha, William Premerlani, Frederick Eddy, William Lorensen Prentice Hall, 1991 [Samet84] "The Quadtree and Related Hierarchical Data Structures" Hanan Samet ACM Computing Surveys, Vol. 16, No. 2, June 1984, pp. 187-260 [Samet89] "The Design and Analysis of Spatial Data Structures" Hanan Samet Addison Wesley, 1989 [Samet95] "Spatial Data Models and Query Processing" Hanan Samet, Walid G. Aref In [Kim95a], pp. 338-360 Bibliography 213 [Sandvik90] "Updating the Electronic Chart -The Seatrans Project" Robert Sandvik International Hydrographic Review, Monaco, LXVII(2), July 1990, p 59-67 [Schek93] "From Extensible Databases to Interoperability between Multiple Databases and GIS Applications" Hans-J. Schek, Andreas Wolf In [Abel93], pp. 207-238 [Schmidt83a] "Relational Database Systems, Analysis and Comparison" Joachim W. Schmidt, Michael L. Brodie (Eds.) Springer Verlag, 1983, 618p [Schmidt83b] "Feature Analysis of the PASCAL/R Relational System" J.W. Schmidt, M. Mall, W.H. Dotzek In [Schmidt83a], pp. 332-377 [Scholl90] "Thematic Map Modeling" Michael Scholl, Agnès Voisard In [Buchmann90], pp. 167-190 [Sellis87] "The R+-tree: A Dynamic Index for Multi-dimensional Objects" Timos Sellis, Nick Roussopoulos, Christos Faloutsos Proceedings of the 13th VLDB Conference, Brighton, 1987, pp. 507-518 [SI90] Reported results from engineering benchmarking of some OODBMSs and RDBMSs at SI, Norway. SI, Norway, late autumn 1990 [Sindre90] "HICONS: A General Diagrammatic Framework for Hierarchical Modelling" Guttorm Sindre Thesis, University of Trondheim, NTH, 1990:31 [Six88] "Spatial Searching in Geometric Databases" Hans-Werner Six, Peter Widmayer Proceedings, IEEE, 4th International Conference on Data Engineering, Los Angeles, Calif., 1988, pp. 496-503 [Smith77] "Database Abstractions: Aggregation and Generalization" John Miles Smith, Diane C.P. Smith ACM Transactions on Database Systems, Vol. 2, No. 2, June 1977, pp. 105-133 [Snodgrass85] "Taxonomy of Time in Databases" Richard Snodgrass, Ilsoo Ahn ACM, SIGMOD record, vol.14, No. 4, 1985 (Proc. SIGMOD’85), pp. 236-246 [Snodgrass86] "Temporal Databases" Richard Snodgrass, Ilsoo Ahn IEEE Computer, Vol. 19, No. 9, Sept. 1986, pp. 35-42 [Snodgrass87] "The Tempora Query Language TQuel" Richard Snodgrass ACM Transactions on Database Systems, Vol. 12, No. 2, June 1987, pp. 247-298 [Snodgrass90] "Temporal Databases Status and Research Directions" Richard Snodgrass SIGMOD RECORD, Vol. 19, No. 4, Dec 1990, pp. 83-89 [Snodgrass92] "Temporal Databases" Richard T. Snodgrass In [Frank92], pp. 22-64 [Soley95] "The OMG Object Model" Richard Mark Soley, William Kent In [Kim95a], pp. 18-41 [SOSI90] "SOSI, Spesifikasjoner, Brukerveiledning, versjon 1.4" Statens Kartverk Statens Kartverk, Hønefoss, mars 1990 214 Bibliography [STANLI91] "ATKIS-test - test av datamodellen i ATKIS som underlag för val av format för överföring av geografiska data" STANLI, SIS-STG STANLI Rapport nr 1:1991, TK80 Landskapsinformation, 1991, 65p (in Swedish) [Stonebraker90] "Third-Generation Database System Manifesto" Michael Stonebraker, Lawrence A. Rowe, Bruce Lindsay, James Gray, Michael Carey, Michael Brodie, Philip Bernstein, David Beech (The Committee for Advanced DBMS Function) SIGMOD RECORD, Vol. 19, No. 3, Sept 1990, pp. 31-44 [Stonerbraker91] "The POSTGRES Next Generation Database Management System" M. Stonebraker, G. Kemnitz Communications of the ACM, Vol.34, No.10, Oct. 1991, pp. 78-92 [Stroustrup91] "The C++ Programming Language, second edition" Bjarne Stroustrup Addison-Wesley, 1991 [Su86] "Modeling Integrated Manufacturing Data with SAM*" Stanley Y.W. Su Computer, Vol. 19, No. 1, Jan. 1986, pp. 34-49 [Su88] "Database Computers: Principles, Architectures, and Techniques" Stanley Y.W. Su McGraw-Hill, 1988 [Tamminen82] "The EXCELL Method for Efficient Geometric Access to Data" Markku Tamminen, Reijo Sulonen ACM IEEE 19th Design Automation Conference, Las Vegas, 1982, pp. 345-351 [Tanenbaum81] "Computer Newworks" Andrew S. Tanenbaum Prentice/Hall, 1981 [Tansel86] "Adding Time Dimension to Relational Model and Extending Relational Algebra" Abdullah Uz Tansel Information Systems, Vol. 11, No. 4, 1986, pp. 343-355 [TECHRA93] "Techra SQL Reference Manual" KVATRO A/S, T012B, 1993 [Teorey86] "A Logical Design Methodology for Relational Databases Using the Extended Entity-Relationship Model" Toby J. Teorey, Dongqing Yang, James P. Fry ACM Computing Surveys, Vol. 18, No. 2, June 1986, pp. 197-222 [Thomason87] "Structural Method in Pattern Analysis" Michael G. Thomason NATO ASI Series, Vol. F30, Pattern Recognition Theory and Applications, Edited by P.A. Devijver and J. Kittler, Springer-Verlag, 1987, pp. 307-321 [Tomlinson89] "Canadian GIS Experience" Roger F. Tomlinson CISM Journal ACSGC, Vol. 43, No.3, Autumn 1989, pp. 227-232 [Tou74] "Pattern Recognition Principles" Julius T. Tou, Rafael C. Gonzalez Addison-Wesley, 1974, 377 p. [Tsichritzis78] "The ANSI/X3/SPARC DBMS Framework: Report of the Study Group on Data Base Management Systems" Dionysios C. Tsichritzis, A. Klug, eds. Information Systems 3, 1978, pp. 173-191 [Tsichritzis82] "Data Models" Dionysios C. Tsichritzis, Frederick H. Lochovsky Prentice Hall, Inc., 1982 Bibliography 215 [Tveite92] "Sub-Structure Abstractions in Geographical Data Modelling" Håvard Tveite Proc., Neste Generasjons GIS, Trondheim 14-15 des. 1992, pp. 17-35 [Tveite93] "Methods for Partitioning Large Geographical Databases" Håvard Tveite Proc., Neste Generasjons GIS, NLH, s, 16-17 Dec. 1992, pp. 193-208 [Tveite95] "Accuracy Assessments of Geographical Line Data Sets, the Case of the Digital Chart of the World" Håvard Tveite, Sindre Langaas Proc., ScanGIS’95, the 5th Scandinavian Research Conference on Geographical Information Systems, Trondheim, Norway, 12-14 June, 1995, pp. 145-154 [USGS90] "Spatial Data Transfer Standard, version 12/90" USGS US Department of the interior, US Geological Survey, National Mapping Division, 1990, 202 p. [Vijlbrief92] "The GEO++ System: An Extensible GIS"" Tom Vijlbrief, Peter van Oosterom Proceedings, 5th International Symposium on Spatial Data Handling, Charleston, SC, August 3-7, 1992, pp. 40-50 [Vossen91] "Data Models, Database Languages and Database Management Systems" Gottfried Vossen Addison-Wesley, 1991 [Vrana89] "Historical Data as an Explicit Component of Land Information Systems" Nick Vrana International Journal of Geographical Information Systems, Vol.3, No.1, 1989, pp. 33-49 [Waugh86] "The GEOVIEW design: a relational database approach to geographical data handling" T.C. Waugh, R.G. Healey Second Symposium on Spatial Data Handling, Seattle, 1986, pp. 193-212 [Waugh92] "An Algorithm for Polygon Overlay Using Cooperative Parallel Processing" T.C. Waugh, S. Hopkins International Journal of Geographical Information Systems, Vol.6, No.6, pp. 457-467 [Waugh94] "Advances in GIS Research" Thomas C. Waugh, Richard G. Healey, Eds. Proceedings of the Sixth International Symposium on Spatial Data Handling, Taylor & Francis 1994, 2 vols. [Weikum86] "A Theoretical Foundation of Multilevel Concurrency Control" G. Weikum Proceedings of the Fifth ACM Symposium on Principles of Database Systems, March 1986, pp. 31-42 [Wiederhold81] "Database Design" Gio Wiederhold McGraw-Hill, International Student Edition, 1981, 658p. [Wilson85] "Introduction to Graph Theory, third edition" Robin J. Wilson Longman, 1985, 166p. [Woelk86] "An Object-Oriented Approach to Multimedia Databases" Darrell Woelk, Won Kim, Willis Luther ACM, SIGMOD record, vol.15, No. 2, 1986 (Proc. SIGMOD’86), pp. 311-325 [Woelk87] "Multimedia Information Management in an Object-Oriented Database System" Darrell Woelk, Won Kim Proc. of the 13th VLDB Conference, Brighton, 1987, pp. 319-329 [Worboys90a] "Object-Oriented Data Modeling for Spatial Databases" Michael F. Worboys, Hilary M. Hearnshaw, David J. Maguire International Journal of GIS, vol. 4, No. 4, pp. 369-383 216 Bibliography [Worboys90b] "Object-Oriented Data and Query Modelling for Geographical Information Systems" M.F. Worboys, H.M. Hearnshaw, D.J. Maguire Proceedings of the 4th International Symposium on Spatial Data Handling, 1990, Zürich, Switzerland, vol.2, pp. 679-688 [Worboys92] "A Model for Spatio-Temporal Information" M.F. Worboys Proceedings, 5th International Symposium on Spatial Data Handling, Charleston, SC, August 3-7, 1992, pp. 602-611 [Xia91] "The Uses and Limitations of Fractal Geometry in Terrain Modeling" Zong-Guo Xia, Keith C. Clarke Technical Papers, 1991 ACSM-ASPRS Annual Convention, Vol. 2, Cartography and GIS/LIS, Baltimore, 1991, pp. 336-352 [Yager91] "Information’s Human Dimension" Tom Yager BIT, Dec. 1991, pp. 153-160 [Yormark77] "The ANSI/X3/SPARC/SG DBMS Architecture" B. Yormark In [Jardine77] Index A abort 143 aborted transaction 18 abstract data type 14 abstraction 5 accounting 114 accuracy 37, 63, 77, 194 descriptional 78 spatial 78 ACID transaction 18, 143 active database system 28 ADT 133 AdV 84 aggregation 10, 62, 73, 87, 107 air photograph 117 analysis 64 ANSI/X3/SPARC DBMS framework See: three-schema architecture arc 87 ARC/INFO 45 archaeology 32 ArcStorm 46 ArcView 48 as of clause (TQuel) 122 as-of 146 association 6-7, 11, 87, 108 associative retrieval 17 associative storage structures 155 ATKIS 84 ATKIS-OK 84 ATKIS-SK 84 ATM 53 atomic transaction 18, 143 attribute 6, 10, 67 attribute-defined specialisation 12 auto-correlation 35, 180 automatic sampling 115 automatical sampling 115 B B-spline curve 67 B-splines 181 B-tree 155 Bachman diagrams 22 background data 145 behaviour 64, 82 behaviourally object-oriented 14 Bezier curve 67, 181 Bezier curves 183 Bezier surfaces 183 billing 195 binary large object 160 blob 48, 113, 160 blocking 144 border 140 boundary 41 branching factor 170 buffer operation 130 C C 92 C++ 14 CAD 9, 146, 184 cadastral database 141 cadastre 33, 141 capacity 149 cardinality of relationships 94 cartographical communication 34 cartography 34 CASE 9 catalogue 16 catalogue type information 36 category 6, 11-12, 62, 76 CEDD 187 cell 70 CEN 8, 129, 158 CERCO 158 chain 88 check-in check-out 144 checkpoint 143 Chen 9 CISC 133 cite autonomy 149 class 6, 11 classification 87 co-dimension 41 coborder 140 CODASYL DBTG network data model 21 commit 143 committed transactions 18 communication overhead 150 complex 70 complex data 113 complex geographical object 34 complex object 34, 62 complex objects 63, 79 computer screen 116 conceptual schema 16 concurrency control 18, 114, 142, 145, 194 conservative 2PL 144 consist-of relationship 10 consistency 114 constraint 6, 9-10, 13, 17, 64, 94 constraints 137 constructive solid geometry 218 See: CSG cooperative work 18 CORBA 55 correct schedule 145 correctness criterion 18 coverage 47 CSG 184 currency indicator 22 CWA 28 D DAG See: directed acyclic graph data 6 data administration 150 data dictionary 17, 114 data models 5, 7 data quality 9, 37 data replication 149 data set location 9 data structure diagram 22 data volumes 149 database 15 monitoring 18 database computer 19 database machine 19 database management system 16, 113 database model 15, 20 database models 20-21, 23, 25, 27 database schema 16 database system 14-17, 19 datalogical data models 7-8 datalogical models 8 datum 6, 61 DBMS See: database management system DCS 127 DDBMS See: distributed DBMS deadlock 144 deductive database management systems 28 delta representation 96 DEM 47 derived object 64, 79 descriptional accuracy 78 DGM 85 diagrams 8 digital elevation model 47 digital photogrammetry 44 directed acyclic graph 75 directory 136 distance 138 distributed conceptual schema 19 distributed data 145 distributed database systems 18 distributed DBMS 16 distributed DBMSs 149 distribution icon 108 distribution transparency 127 DKM 84 index DLM 84 DLM1000 84 DLM200 84 DLM25 84 DML 17 domain 6, 94 domain relational calculus 24 Douglas-Peucker algorithm 67 DSM 85 duplicates 23 DX90 187 dynamic segmentation 41 E EAR diagrams 10 EAR model 10 ECDIS 32, 185-186 edge 60-61 EDIFACT 92, 94 Edinburgh University 157 EER diagram 11, 102 EER model 11, 102 electronic navigational chart 185 elementary fact 7 ENC See: electronic navigational chart encapsulation 14 ENCDB 185-186 entity 6, 9, 66 entity instance 87 entity object 87 entity type 87 entity-relationship 8 environmental analysis 40 ER diagram 9, 102 ER model 9, 102 event 122 Excell 189 expert system 28 explicit constraint 6 extendable relational DBMS 163 extended entity-relationship model See: EER model extensional property 6 external schema 16 F FAPI 94 fault tolerance 18 feature 87 federated database systems 19 federated DBMS 16, 151 FFT 131 FGIS 91 fibre-optic cable 150 field 35, 41, 60, 67-68, 104, 134, 136 filtering 152 first normal form 24 fractal encoding 117 index fractal geometry 180 frank84 113 functional dependency 23 G G-Polygon 87 G-Ring 87 general sub-structure abstraction 108 generalisation 10, 12, 14, 62, 74, 87, 107, 118 geo-relational 120, 142 geo-relational model 46 geographical data 33 geographical information system 1, 9, 29-30 data properties 59 geographical map 33 geographical names 124, 128 geographical samples 34-35 geoid 130 geometrical calculations 138 geometry 60, 87, 102 German Topographic State Survey 85 GIS See: geographical information system GIS application areas 32 GIS queries 42 GKS 92 GPS 32, 185 GRASS 51 grid 38, 88 grid cell 87 grid-file 155, 189 grouping 6 GT-Polygon 88 GT-Ring 88 H handle 164 hardware trends 53 hash function 169 hash join 155 hB-tree 172 heterogeneous DBMS 16 heterogeneous distributed database system 151 hierarchical database management system 20 hierarchical storage management 53, 113, 165 historical data 36, 76, 145 historical databases 121 history 106, 118 homogeneous DDBMS 16 hot spot 145 HSM See: hierarchical storage management hypermedia 120 hypertext 32, 45, 120 I I/O 115 IBM 20 icon 102 identification 6 identifier 94 219 IHO See: international hydrographic organisation image 35, 88 image archive 117 image compression 117 image processing 140 impedance mismatch 27 implicit geographical relationship 69 IMS 20 INFO 46 infological data models 7 information hiding 14 information system 30 Ingres 122 inherent constraint 6 inheritance 14, 75 INMARSAT 185 integrated database system 119 integrity 114 intentional property 6 interior 41, 67 internal schema 16 international hydrographic organisation 185 Internet 120 is-a relationship 10 is-member-of relationship 11 ISAM 155 ISDN 53, 91 ISO 127, 158 ISO 8211 89 isolated geographical object 34 J join 23 join dependency 24 JPEG 117 K k-d tree 155, 172 k-d-b tree 172 key 23, 94 knowledge-based system 28 Krieging 181 L Landsat 116 Langefors 7 latitude-longitude 119 layer 38, 88 line 60, 67, 134, 163 3D 179 line icon 103 line object 93 line segment 87 line-generalisation 190 link 88, 120 locality 149 location hierarchy 127 locking-based concurrency control 144 log 18, 143 logic 8 220 long transaction 145, 194 long-haul network 150 loosely typed 6 lossy compression 117 M magnetic disk 53 manifold 34-35, 69, 88, 106, 135 manifold icon 106 mathematical set 7 matrix 136 measurement data 115 message passing 14 metadata 8, 17, 72, 92, 124 MetaMap 95 method 14 MIMD 151 mini-world 5 misclassification matrix 126 modelling concepts 59 module 108 monitoring 18 MPEG 117 MTBF 195 multi resolution object 96 multidatabase system 16, 151 multidatabases 19 multigranularity locking 148 multilevel concurrency control 148 multimedia database system 119 multiple inheritance 76, 102 multiquery 50 multivalued dependency 24 N n-ary relation 23 n-complex 140 n-tuple 7 naming 6 natural disaster 123 natural language interface 128 navigation 32, 129 navigation DML 22 navigational chart 185 NC 184 NCGIA 57, 126 neighbour 140 nested transaction 145 network 34-35, 70, 88, 135 network browser 120 network database management system 21 network database model 159 network delays 150 network model 22 neutral object 93 NGIS 90 NHS See: Norwegian Hydrographic Service node 88, 120 non-loss compression 117 normal form 24 index normalisation 24 North Sea Project 185 Norwegian Hydrographic Service 185 Norwegian Mapping Authority 32, 91 Norwegian Mapping Autority 90 numerically controlled See: NC Nyquist frequency 35, 115 O object 6-7 object token 6 object type 6, 66 object-oriented 14, 38 object-oriented database 195 object-oriented database model 160, 164 object-oriented modelling 8, 99 OGC 55 OGIS 55 OMG 55 OODBMS 26 optimistic concurrency control 145 OSI 127, 152 overlay 138, 141 ownership 64 P parallel database machine 150 parallel database machines 19 parallel database system 150 parallelisation 152 parametric function 181-183 parent-child relationship 20 part-of relationship 10 patch 162 pattern recognition 140 persistent C++ 26 phenomenon 6, 86 PHIGS 92 pipeline 150 pixel 87, 135 place-oriented 38 planar graph 88 plane graph 60 point 60, 67, 87, 134, 163 3D 179 point icon 103 point object 93 point sampling 180 point-set topology 41 polygon overlay 141 positioning 33 Postgres 163 preposition 128 primary key 23 projection 23, 61 Prolog 7 property 7 index Q QBE 24, 132 QTM 119 quad-tree 155, 172, 189 quality 63, 77, 86, 101, 113 quality data model 100-101 Quaternary Triangular Mesh 153 QUEL 24, 129 Query By Example See: QBE query language 114 query optimisation 114, 134 R R*-tree 174, 189 R+-tree 174 R-tree 155, 174, 189 radar 185 RAID 18, 54 RAM 53 raster 35, 88, 130, 135-136, 140 raster data 116 raster icon 105 raster layer 38 raster model 38 raster paradigm 38 read-only transaction 146 record 20 recovery 18, 114, 143, 194 redo 143 region 60-61, 67, 134, 163 region icon 103 region object 93 relation 7, 23, 66 relational algebra 23 relational calculus 23-24 relational database 195 relational database management system 23 relational database model 160 relational model 142 relationship 7, 10 relationships 62 reliability 149, 195 replication 155-156 resampling 38 response time 188, 194 response-time 113 RISC 133 role 63, 80 roles 100 rollback 143 S sample-set icon 105 samples 130 sampling 115, 123, 179-180 satellite images 116 scalar operation 138 scale 63, 80, 137, 190, 194 scanning 116 schema evolution 114 SDE 46, 165 SDM 9 SDTS 125 second normal form 24 security 114, 194 selection 23 semantic data model 9 semantic data models 8 semi-join 156-157 sensitive transaction 147 sequence 109, 135, 159 serialisability 18, 144-145 set 7, 23 set type 22 set-based retrieval 129 SIMD 151 simplex 70 SIMULA 14 SINTEF 95 Smalltalk 14 SMS 99 snapshot database 36 SOSI 92, 94 sound 120 space filling curve 175 spatial accuracy 78 spatial computations 43 spatial constraints 93, 137 spatial data 33 spatial data types 134 spatial filtering 152 spatial join 138, 156-157, 166 spatial locking 146-147, 166 spatial object 87 spatial phenomenon 86 spatial set 68 spatio-temporal 95 spatio-temporal databases 123 spatio-temporal modelling 124 specialisation 10, 12, 14, 62 spline 182 spline curve 67 Spot 116 SQL 24-25, 91-92, 129 SQL3 133 SQL2 25 STANLI 86 Statens Kartverk 90 static databases 121 static rollback databases 121 STDS 86 storage efficiency 113 strict 2PL 144 strictly typed 6 string 87 strip tree 173 structurally object-oriented 13, 65 Structured Query Language See: SQL 221 222 structured textual descriptions 7 sub model substitution 84 sub-structure abstraction 99 subclass 11-12 suitability analysis 141 superclass 11-12 surfacce 3D 180 surface 60, 67, 134, 162-163 System 9 48, 165 system catalogue 16-17 T Tandem 19 TC 287 95, 129, 158 TCP/IP 92 technology trends 54 temporal 36, 63 temporal data 113 temporal databases 121 temporal relation 11 terrain model 115 tessellation models 38 thematic filtering 152, 154 thematical map 33 theme 64 theme hierarchy 128 thesaurus 89, 128 third normal form 24 3D model 61 3D icon 102 3D modelling 180 3D objects 179 3D structures 179 three-schema architecture 16, 92 3D model 116 TIGRIS 50, 132 time 7, 9, 76, 106, 118, 121 time icon 106 time interval 122 time value 122 time-series 36 timestamp-based concurrency control 144 TIN 47, 182 token 6 topographical map 33 topological constraints 93 topological data model 40 topological queries 43 topology 61, 69-70, 88, 105, 130, 187 TPS 114 TQuel 122 transaction 114 atomicity 18 transaction handling 142 transaction log 114, 143 transaction management 18, 143 transaction processing 142 transaction time 121, 146 transitive closure 140, 162, 164 index trasé 93 tree locking 148 trends 53 triangular irregular network See: TIN triangulated irregular network 47 tuple 7, 23 tuple relational calculus 24 2.5D model 38 2PC 144 two-dimensional manifold 88 2PL 144 type 6 U undo 143 universal relation 24 universe polygon 88 unix 92 user work area 22 user-defined time 121 USGS 86 USGS90 79 UTM 119 V valid clause (TQuel) 122 valid time 121, 146 vector data 118 vector GIS 40 vector model 38 versioned data 113, 121, 145 versioned objects 63 vertex 60-61 video 117 Virtual Reality See: VR virtual record 21 visualisation 64 void polygon 88 volume 60-61, 68, 134, 163, 180 volume object 93 voxel 38, 135 VR 32, 44, 56 W way-finding 33 weak entity 9 web 120 WGS84 119 when clause (TQuel) 122 Wild System 9 See: System 9 wire frame 181 WWW 32, 120 X X-Windows 92 X.200 94

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Modelling and Database Requirements for Geographical Data