* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CHAPTER 2
Survey
Document related concepts
Oracle Database wikipedia , lookup
Global serializability wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Functional Database Model wikipedia , lookup
Concurrency control wikipedia , lookup
Relational model wikipedia , lookup
ContactPoint wikipedia , lookup
Clusterpoint wikipedia , lookup
Transcript
Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 CHAPTER 2 THE INTEGRATION of DATABASE SYSTEMS 2.1 INTRODUCTION The problem of retrieving and integrating information from distributed heterogeneous databases has been investigated by a number of researchers. Numerous papers have been published describing a variety of approaches which have been developed to handle this problem, although aspects of the problem remain. This chapter discusses some of the approaches which were reported in the literature to tackle the problems of heterogeneity among databases. A survey of, and comparison between, systems developed to handle this problem can be found in [1,2]. This chapter will focus on three commonly used and well-supported approaches: the physical approach, the global approach, and the multidatabase approach. Section 2.2 gives the background information on distributed database systems, and section 2.3 presents the autonomy of the database. Section 2.4 presents the database heterogeneity. Section 2.5 discusses the physical approach, and section 2.6 discusses the global approach. The multidatabase approach is presented in section 2.7, and section 2.8 presents other approaches. The approach adopted in this project is discussed in section 2.9. The chapter concludes with a summary in section 2.10. Hazem Turki El Khatib 9 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata 2.2 Chapter 2 THE MOVE TO DISTRIBUTED DATABASES SYSTEM The computer system provides four types of services [3]: 1. The data storage services, which provide users with efficient storage media. 2. The data access services, which provide functions for retrieving data from the storage media. 3. The application services, which provide users with capabilities to execute specific tasks. 4. The presentation services, which provide user interfaces to end-users. A database system (DBS) is a data-storage and access system, which is composed of two elements: a set of data, called a database, and a software program, called a database management system (DBMS) [4]. The main aim of such systems is to store and process information. Database management systems include functions for protecting data integrity, supporting easy maintenance, assuring data security, providing languages for the storage, retrieval and update of the data, and providing facilities for data backup and recovery, etc. Before the development of the DBMS, the data were stored in separate data files. These files could be created and maintained by different applications and were likely to contain duplicate and redundant data. However, there were significant problems if two applications wanted Hazem Turki El Khatib 10 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 to share access to the same file, including writing to it. A DBMS provides centralised control of the data and solves the problems of shared access to data files [5,6]. The data in a database are organised according to a particular data model, such as the relational model, hierarchical model, network model, etc. A schema describes the actual structures and organisation within the system. Initially databases systems were developed as stand-alone systems operating independently. During the 1970s, centralised databases were in common use. A centralised DBS consists of a single centralised DBMS managing a single database on the same computer system [7]. However, recent “innovations in communications and database technologies have engendered a revolution in data processing” [4], giving rise to a new generation of decentralised database systems. A distributed (decentralised) database system is “made up of a single logical database that is physically distributed across a computer network, together with a distributed database management system (DDBMS) that answers correct queries and services update requests” [4]. Bell and Grimson [8] present the taxonomy of distributed data systems and classify them into two types: Homogeneous distributed database management systems (HmDDBMSs) Heterogeneous distributed database management systems (HgDDBMSs) Hazem Turki El Khatib 11 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 A distributed database implies homogeneity, in that all its physical components run the same distributed database management system, which supports a single data model and query language. In contrast, a distributed database implies heterogeneity, in that it includes heterogeneous components at the database level, where, for example, the local nodes have different types of hardware, operating system [1], data models, query languages and schemas. 2.3 AUTONOMY The concept of database autonomy refers to the ability of each local database system to have control over its own data and to perform various operations on its own data. Bukhres et al. [4] define database autonomy as “the ability of each local database system to control access to its data by other database systems, as well as the ability to access and manipulate its own data independently of other systems”. Sheth and Larson [7] describe four types of autonomy: 1. Design Autonomy: the ability of a database system to choose its own design with respect to data model, query language, semantic interpretation of the data, the operations supported by the system, and the implementation of the system (e.g., concurrency control algorithms). 2. Communication Autonomy: the ability of a database system to decide whether and when to communicate with other database systems. Hazem Turki El Khatib 12 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 3. Execution Autonomy: the ability of a database system to decide how to execute local operations without interference from external operations, and the order in which to execute external operations. 4. Association Autonomy: the ability of a database system to determine the extent to which it will share its functionality and resources with others. 2.4 HETEROGENEITY With the current explosion of data, the need for retrieving and integrating information from a collection of independently designed autonomous databases is a complex and critical problem [9,10,11,12]. There are several different forms of heterogeneity, discussed in detail in chapter 3, which have exercised the research community for nearly 20 years [5]. Many types of heterogeneity are the result of technological differences, for example, differences in hardware, system software (such as operating systems), and communication systems. The following sections will present the approaches that have been discussed in different systems to resolve the heterogeneity problem. 2.5 THE PHYSICAL APPROACH The physical approach [4,13] integrates all data needed by an application into one database. This is referred to as the data warehouse, which is built up from different databases and contains large amounts of data (billions of records) [14]. It was Hazem Turki El Khatib 13 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 “designed especially for decision support queries, therefore only data that is needed for decision support is extracted from the operational data and stored in the warehouse” [14]. Two approaches, the “top down” and the “bottom up” approach, are used to build a data warehouse. In the first approach, the data warehouse first builds for the complete organisation and from this selects the information needed for different end-users. In the “bottom up” approach, the smaller local data warehouse, known as a datamart, is used by end-users at a local level and the data warehouse is generated from these. Figure 2.1 shows the relationship between databases, the data warehouse, and datamarts. Replication techniques are used to load the information from different databases to the data warehouse. Top down Replication Bottom up Databases Data warehouse Datamarts Figure 2.1 ~ The relationship between databases, data warehouse, and datamarts Hazem Turki El Khatib 14 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 2.5.1 The Physical Approach Architecture The physical approach architecture might be referred to as the monolithic architecture. The physical approach integrates all data needed by an application into one database, and, as illustrated in Figure 2.2, while the monolithic architecture views the system as a standalone application, which does not correlate with any system. It is one large application that contains data storage, a business logic code and a presentation code. This architecture was popular in the days of the large mainframes on which they ran, because such mainframes could manage such applications. User Interface Logic Data Figure 2.2 ~ Monolithic Architecture Hazem Turki El Khatib 15 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 2.5.2 The Physical Approach Disadvantages The physical approach is incomplete because it does not allow data to be maintained independently; it requires an expensive application conversion, and the database administrator has to understand the structure of the database and keep track of any changes. Further to that, it would result in redundant data. Although the monolithic architecture is simple to understand, it is difficult to develop and maintain, as well as being difficult to test. 2.6 THE GLOBAL APPROACH An alternative solution is the global approach. Other terms used for global approach are logical approach [4,7,9,13] composite approach [15], schema integration [16], and view integration [9,17,18]. Schema integration and view integration have been used to refer closely to each other. However, Spaccapietra et al., [17] argue that there are some differences between the two approaches. View integration is “a process in classical database design deriving an integrated schema from a set of user views”. In view integration views are usually based on the same data model (homogeneous), and have no “associated recorded extension”. On the other hand, in database integration, the local schemas may be based on different data models and have an associated extension (i.e., they describe data which are actually stored in a database). This approach has been widely reported in the literature [19,20,21] as the process of providing a global schema for all databases and transactions Hazem Turki El Khatib 16 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 needed to be mapped to this schema [1]. It provides users with uniform access to data contained in various databases, without migrating the data to a new database, and without requiring the users to know either the location or the characteristics of different databases and their corresponding DBMSs. Database designers attempt to reconcile the conflicts among all component databases by designing a global schema, which allows users to send queries based on the global schema. The definition of a global schema incorporates functions to resolve discrepancies and inconsistencies among the underlying databases. The global schema approach was first described by Dayal and Hwang [9], and it arises in two different contexts [22]: Global schema design Logical database design In global schema design, several databases already exist and are in use. The objective is to represent the contents of these databases by designing a single global schema. User queries can then be specified against this global schema, and the requests are mapped to the relevant databases [22,23]. In logical database design [23], each class of users designs a view of that part of the proposed database they need to access. The objective is to design a conceptual schema that represents the contents of all of these views. User queries and Hazem Turki El Khatib 17 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 transactions specified against each view are mapped to the logical integrated schema. Database design has been described as an art rather than a science and it depends upon the experience of the designer. There have been two schools of thought in [22] and in [24]. Schema integration is a three phase process: first, the investigation phase where commonalities and discrepancies among input schemas have to be determined [16] or detected [25]. The aim is to identify semantically related objects by a comparison process based on matching of names, structures and constraints [18,26,27,28] in which their similarities and dissimilarities are discovered. Second, semantic conflicts between objects of component databases are resolved [25], and finally integration is performed [17]. Interaction with the database administrator is required to solve conflicts among input schemas in case the integrator does not have the knowledge of how to do it. Many approaches and techniques for schema integration have been reported in the literature. A detailed survey by Batini et al. [16] discussed and compared twelve methodologies for the problem of schema integration. They argue that the two problems to be dealt with during integration come from structural and semantic diversities of schemas to be merged, which are caused because different names are attached to the same concept in the two views. The concept can be represented either as a relation in one database or as an attribute in another. It divides schema integration activities into five steps: preintegration, comparison of the schemas, conformation of the schemas, merging and restructuring. Hazem Turki El Khatib 18 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 Preintegration: this involves analysis of schemas to decide the choice of schemas to be integrated, the order of integration, and the number of schemas to be integrated at one time. Comparison: this involves comparing the objects of the schema to be integrated to determine the correspondence among objects and detect possible conflicts, including identification of naming conflicts, domain conflicts, structural differences and missing data. Conformation: this step enables more effective comparison by ensuring that related information is represented in a similar form in different schemas. Close interaction with designers and users is needed before compromises can be achieved. Merging & Restructuring: these two steps are concerned with specifying the interrelationships between the schema objects in the different schemas, resulting in a global conceptual schema. This global schema has to contain correctly all concepts present in any component schema. However, the concept must be represented only once in the integrated schema if the same concept is represented in more than one component schema. The integrated schema should be easy for both the designer and the end user to understand. Several other techniques have been identified to facilitate schema integration and are summarised in [5,29]: Catalogue technique, Hyperstrucured technique, MetaHazem Turki El Khatib 19 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 Translation, Object Equivalencing, Mediation, Intelligent Co-operating systems, and KBS assist. Two methodologies have been proposed for schema integration according to [17]. They are manual methodology and semi-automatic methodology. Manual methodology, which was first developed by Motro and Buneman [30], aims at providing a tool which allows the DBA to build the integrated schema from local schemas. Semi-automatic methodology uses a semi-automatic reasoning technique to discover similarity assertions between objects by evaluating some degree of similarity, e.g. names, and structures [5,6]. Interaction with the DBA is invoked to accept or refuse some of the correspondence assertions and to solve some conflicts that the system is unable to resolve. This is why it is semi-automatic. Some semiautomatic tools developed to perform schema integration were reported in [17,18,31]. Nevertheless, Sheth and Larson [7] have argued that a completely automatic schema integration process is impractical because it would require that all of the semantics of the schema be completely specified. 2.6.1 The Global Approach Architecture The disadvantages of the monolithic architecture have given way for another architecture to appear. The monolithic architecture has been replaced by the client/server architecture. With the availability of smaller and cheaper PCs, and the off-the-shelf database management system, the client/server architecture has become widespread [32]. As shown in Figure 2.3, the client/server architecture Hazem Turki El Khatib 20 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 breaks the system functionality between the client and the server. The server acts as a producer, and the client as a consumer of service. The main concepts of the architecture are: Server: performs services in response to requests sent by clients. Client: sends request to servers and receives the results of the service returned from the server. Service: could be data, functions, etc. User Interface Logic Data Figure 2.3 ~ Client/Server Architecture When a client becomes a server in the client/server architecture, the relationship is referred to as peer-to-peer connection. Hazem Turki El Khatib 21 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 Two examples of protocol modules used in the client-server architecture are [33]: the three-message protocol in which the client makes a request, the server responds and then the client acknowledges receipt of the response. the single-shot protocol in which only repeatable requests are used. 2.6.2 Schema Integration processing strategies Batini et al. [16] have proposed different strategies for the sequencing and grouping for integration. As shown in Figure 2.4, each strategy is shown in the form of a tree. The leaf nodes of the tree correspond to the component schemas, and the nonleaf nodes correspond to intermediate results of integration. The root node is the final result. The primary classification of strategies is binary versus n-ary. Binary strategies allow the integration of two schemas at a time. They are called “ladder” strategies when a new component schema is integrated with an existing intermediate result at each step. A binary strategy is “balanced” when the schemas are divided into pairs at the start. N-ary strategies allow integration of n schemas at a time (n>2). An n-ary strategy is “one shot” when the n schemas are integrated in a single step; it is “iterative” otherwise. Hazem Turki El Khatib 22 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Ladder Balanced One-shot Chapter 2 Iterative Figure 2.4 ~ Types of integration-processing strategies [16] 2.6.3 CARNOT System “There has been some work on the problem of accessing information distributed over multiple sources both in the AI-oriented database community and in the more traditional database community” [10]. A survey and comparison of these can be found in [1,2]. The use of a knowledge base to integrate a variety of information sources has been investigated by the AI-oriented database community. Breitbart [13] believe that for a system involving hundreds of local databases, expert systems and knowledge base technologies are required to help users in their access to the data [13]. In this section the Carnot system [15] will be presented as an example based on global approach. The Carnot system has been developed at the Microelectronics and Computer Technology Corporation, Austin, Texas. The goal of this project is to integrate heterogeneous databases using a set of articulation axioms that describe Hazem Turki El Khatib 23 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 how to map between SQL queries and domain concepts. Carnot uses the Cyc knowledge base as a global schema to build the articulation axioms (statements of equivalence between components of two theories). The schemas of individual resources are compared and merged with this knowledge base, although not with each other, making a global schema much easier to construct and maintain. The Carnot system uses the following knowledge in resolving semantic differences: a structural description of the local schema. schema knowledge, which is the structure of the data, integrity constraints, and allowed operations. resource knowledge, which is a description of supported services, such as the data model and languages, lexical definitions of object names, the data itself, comments from the resource designer, and guidance from the integrator. organisation knowledge, which is the corporate rules governing use of the resource. As shown in Figure 2.5 resource integration is achieved by separate mappings between each information resource and the global schema. Each mapping consists of a syntax translation and a semantics translation. The syntax translation provides a bidirectional translation between a local data manipulation language and the global Hazem Turki El Khatib 24 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 context language. The semantics translation is a mapping between two expressions that have equivalent meanings in the global context language. After integration, one can use the information that becomes available through a global view or a local view. However, Carnot has two problems: a problem of integrating the results returned from multiresource queries, and the large size of the global schema. Application Application Application Local view 1 (DML1) Local view n (DMLn) DML1 GCL DMLn GCL Local-to-global semantic translation by articulation axioms Global view Application Global-to-local semantic translation by articulation axioms GCL DMLn GCL DML1 Local schema 1 Database 1 Local schema n Database n Figure 2.5 ~ Global and local views in semantic transaction processing [15] Hazem Turki El Khatib 25 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 2.6.4 Schema Integration Disadvantages Schema integration has the disadvantages that it is difficult and complex to create a single global schema for a large number of data sources because of the large number of comparisons required, and it is difficult to maintain every time a local schema changes [2,12,13,34]. Litwin and Abdellatif [11] argue that “a single schema for the thousands of databases is a dream”. The global designers have to understand the underlying assumptions, semantics of each existing system, and the heterogeneous local database structure [4]. Distributing the process between client and server allows applications to scale and develop easily. However, this architecture has some of the problems of the monolithic case. One such problem occurs because the client and the server depend on each other. It is difficult to use the presentation code in another system with a different database, and vice versa [32]. This happened because in the client/server architecture there is no independent place to store the implementation of the business processes. Often much of the processing goes on in the client or in the server, which means that either the client or the server is a large program. This drawback of the client/server architecture has been solved in the development of the n-tier architecture. Hazem Turki El Khatib 26 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata 2.7 Chapter 2 THE MULTIDATABASE APPROACH The third approach is the multidatabase approach or federated approach [4]. The terms “multidatabase system” and “federated system” are often used interchangeably [6]. The term “multidatabase system” itself is used by different authors to mean a number of different things. In practice, the terms “federated database system, FDBS”, “multidatabase system”, “interoperable database system”, and “heterogeneous DDBMS” are used synonymously [2]. They refer to systems that support access to multiple autonomous databases without a global schema. Sheth and Larson [7] used this term to describe heterogeneous distributed database systems. However, Litwin et al. [2] use the term to mean a loosely coupled FDBS (see section 2.7.3). They proposed a reference architecture for the multidatabase systems approach. The levels identified in the architecture are “Internal Level”, “Conceptual Level”, and “External Level”. Bukhres et al. [4] considered a multidatabase system as a collection of loosely coupled element databases, with no global schema applied for their integration - “a multidatabase system is a system that supports the execution of operations on multiple component database systems”. Dayal [9] uses the term to describe a tightly coupled FDBS. Hurson and Bright [34] consider a multidatabase system to be a federated database, which is a distributed system that provides a global interface to heterogeneous pre-existing local DBMS. Users can access multiple remote databases with a single query. Nodine and Zdonik [35] define a multidatabase as a “system that integrates multiple autonomous heterogeneous database systems to allow them to be accessed using a uniform Hazem Turki El Khatib 27 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 interface and to provide basic database support for consistency and persistence across the set of databases” A key characteristic of the multidatabase is that the local DBMS remains autonomous [2] and the essential component of such a system is that the language used to manage the databases should be designed to be able to cope with local autonomy and data redundancy [5], and it needs features which are not part of the database language, such as using logical database names in queries to qualify data elements in different databases, and to allow for autonomy while supporting cooperation between database administrators [6]. 2.7.1 The Multidatabase System Taxonomy The basic multidatabase system taxonomy is shown in Figure 2.6. The taxonomy involves the following classification [4,7]: Nonfederated Database System. A nonfederated multidatabase system integrates component systems that are not autonomous but still heterogeneous. It has only one level of management, and all operations are performed uniformly. A nonfederated database system does not distinguish local and nonlocal users. Federated Database System. A federated database system (FDBS) is a collection of co-operating database systems that are autonomous [7]. One of the significant aspects of an FDBS is that a component DBS can continue its local Hazem Turki El Khatib 28 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 operations and at the same time participate in a federation to allow data sharing. The term federated database system was introduced by Heimbigner and McLeod [36] to refer to a collection of databases in which the sharing is made more explicit by allowing export schemas, which define the shareable part of each local database [after 16]. Since its introduction, the term has been used for several different but related DBS architectures, until Sheth and Larson [7] proposed a reference architecture based on all the preceding research [5]. Multidatabase Architecture Nonfederated Database System Federated Database System Tightly Coupled Loosely Coupled Single Federation Multiple Federation Figure 2.6 ~ Multidatabase system taxonomy taxonomy 2.7.2 The Federated Database Systems Architecture The n-tier architecture can be viewed as a three–tiered architecture, a five-tiered architecture, or more. The n-tier architecture breaks the application into three basic layers, a presentation layer (External schema), a business logic layer (Conceptual schema), and a database layer (Internal schema) [32,37], as shown in Figure 2.7. Hazem Turki El Khatib 29 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 The presentation layer (service consumer) provides a user interface and acts as a client of the business logic servers. The business logic layer (service provider) acts as a server with which the client code interacts. It contains all the complicated application logic and it does not take into account how data is stored. The database layer (data provider) has no idea what operations will be performed on it. Applications exist as co-operating components, which allows different clients to share the same business logic. This creates reusable software, which is easy to deploy and maintain and offers much greater flexibility and scalability. User Interface Logic Data Figure 2.7 ~ N-Tier Architecture The ANSI/X3/SPARC Study Group on Database Systems introduced the standard ANSI/SPARC three-level schema architecture for centralised DBMSs [7] as shown in Figure 2.8. Hazem Turki El Khatib 30 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata External Schema 1 External Schema 2 Chapter 2 External Schema n Conceptual Schema Internal Schema Figure 2.8 ~ ANSI-SPARC three-level architecture Sheth and Larson [7] introduced software system components that link the three levels together. They introduced some component definitions and then extended the diagram shown in Figure 2.8 to that shown below in Figure 2.9. The basic types of system components used include: 1. schemas, which are descriptions of data managed by a DBMS. 2. mappings, which are functions that correlate the schema objects in one schema to the schema objects in another schema. 3. commands, which are requests for specific actions that are either entered by a user or generated by a processor. Hazem Turki El Khatib 31 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata External Schema 1 External Schema 2 Filtering Processor 1 Filtering Processor 2 Chapter 2 External Schema n Filtering Processor n Conceptual Schema Transforming Processor Internal Schema Accessing Processor Database Figure 2.9 ~ Extended ANSI-SPARC three-level architecture [7] 4. Processors are software modules that manipulate commands and data. There are usually four types of processors, each performing different functions on data. The transforming processor translates commands/data from one language/format to another language/format. This provides data model transparency, which will hide differences in query language and data formats. Hazem Turki El Khatib 32 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 Filtering processors constrain the commands and associated data that can be passed to another processor. Examples of filtering processors include a syntactic constraint checker, semantic integrity constraint checker, and access controller. A constructing processor is used particularly in the heterogeneous database system. It partitions and/or replicates an operation submitted by a single processor into operations that are accepted by two or more processors. Constructing processors also merge data produced by several processors into a single data set for consumption by another single processor. This supports location, distribution and replication transparency, undertaking tasks such as schema integration (integrating multiple schemas into a single schema), negotiation between schema owners (to determine what protocol should be used among them), query decomposition and optimisation, and global transaction management (performing concurrency control). The accessing processor executes commands to retrieve data from the database. The three-level schema architecture is incomplete for describing the architecture of an FDBS. The three-level schema must be extended to support the three dimensions of a federated database system - distribution, heterogeneity, and Hazem Turki El Khatib 33 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 autonomy [38]. Sheth and Larson [7] proposed an architecture composed of five levels of schema as shown in Figure 2.10. These schemas are: Local Schema: Each local DBMS has a local schema that defines all the local data. A local schema is expressed in the native data model of the component DBMS, and therefore different local schemas may be expressed in different data models. External Schema External Schema Federated Schema Export Schema Federated Schema Export Schema External Schema Export Schema Component Schema Component Schema Local Schema Local Schema Component DBS Component DBS Figure 2.10 ~ Five-level schema architecture of an FDBS [7] Hazem Turki El Khatib 34 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 Component Schema: A component schema is the local schema represented in the global data model called the common data model (CDM) of the FDBS. Sheth and Larson [7] gave two reasons for defining component schemas in a CDM: 1. they describe the divergent local schemas using a single representation. 2. semantics that are missing in a local schema can be added to its component schema. The process of schema translation from a local schema to a component schema generates the mappings between the component schema objects and the local schema objects. Transforming processors use these mappings to transform commands on a component schema into commands on the corresponding local schema. Such transforming processors and the component schemas support the heterogeneity feature of an FDBS. Export Schema: not all data of a component DBS may be available to the federation and its users. An export schema represents the portion of the component schema that the component database chooses to make available to the FDBS. The purpose of defining export schemas is to facilitate control and management of association autonomy. The export schemas support the autonomy feature of an FDBS. Hazem Turki El Khatib 35 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 Federated Schema: a federated schema is an integration of several export schemas. A federated schema also includes the information on data distribution that is generated when integrating export schemas. The federated schemas support the distribution feature of an FDBS. There may be multiple federated schemas in an FDBS, one for each class of federation users. External Schema: an external schema defines a schema for a particular user or application. Reddy et al. [39] proposed a methodology that transfers existing local databases to meet diverse application needs at the global level. It uses a four-layered schema architecture: local schema, local object schema, global schema, and global view schema. Architectures with Additional Basic Components There are several types of architectures with additional components that are extensions or variations of the basic components of the reference architecture. Such components improve the capabilities of an FDBS. Examples of such components include an auxiliary schema that stores the following types of information: data needed by federation users but not available in any of the pre-existing component DBSs. Hazem Turki El Khatib 36 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 information needed to resolve incompatibilities (e.g., unit translation tables, format conversion information). statistical information helpful in performing query processing and optimisation. Extended federated architectures To allow a federation user to access data from systems other than the component DBSs, the five-level schema architecture can be extended in additional ways, for example, by replacing a component database by a collection of application programs. It is conceptually possible to replace some database tables by application programs. For example [7], a table containing pairs of equivalent Fahrenheit and Celsius values can be replaced by a procedure whereby calculated values on one scale give values on the other. 2.7.3 Coupling in Federated Database Systems As shown in Figure 2.6, Federated Database Systems can be classified as tightly coupled or loosely coupled systems. The major difference between these two classifications lies in who manages the federation and how the components are integrated [7]. Tightly Coupled. An FDBS is tightly coupled if the federation and its administrator(s) have the responsibility for creating and maintaining the Hazem Turki El Khatib 37 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 federation and actively control the access to component DBSs. Tightly coupled federations take the form of schema integration [4] and may have one or more federated schemas. A tightly coupled FDBS is said to have single federation if it allows the creation and management of only one federated schema; it is said to have multiple federations if it allows the creation and management of multiple federated schemas. It provides location, replication, and distribution transparency. The process for the administration of a tightly coupled FDBS is as follows. First, export schemas are created by negotiation between a component DBA and the federation DBA; the component DBA has control over what is included in the export schemas. Then the federation DBA creates and controls the federated schema as s/he is usually allowed to read the component schemas to help determine what data are available and where they are located and then negotiate for their access. Finally, external schemas are created by negotiation between a federation user and the federation DBA who has the authority to decide what is included in each external schema. DDTS [40] can be categorised as tightly coupled FDBSs with single federation. Mermaid [41] and Multibase [9] are examples of tightly coupled FDBSs with multiple federations. Loosely Coupled. In loosely coupled federated database systems, users are largely responsible for the administration of the federated system. There is no central authority that controls the creation of or access to data. Each component system is responsible for constructing a global schema view and for processing queries that access remote component systems. Loosely coupled federations take the form of schema importation [16], interoperable database systems, or Hazem Turki El Khatib 38 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 multidatabase systems. Loosely coupled systems do not maintain hard links into component databases. Only those databases that require integration are attached as needed to fulfil the transaction request. A loosely coupled FDBS always supports multiple federated schemas and provides an interface to deal directly with multiple component DBMSs. A typical way to formulate queries is to use a multidatabase language. A typical process of developing federated schemas in a loosely coupled FDBS is as follows. Each federation user is the administrator of his or her own federated schema. First, a federation user looks at the available set of export schemas to determine which of these describe data s/he would like to access. Next, the federation user defines a federated schema by importing the export schema objects by using a user interface or an application program, or by defining a multidatabase language query that references export schema objects. The user is responsible for understanding the semantics of the objects in the export schemas and resolving the DBMS and semantic heterogeneity. Finally, the federated schema is named and stored under the account of the federation user who is its owner. The loosely coupled approach may be inappropriate for more traditional business or corporate databases, where system control is desirable, where the users are naïve and would find it difficult to perform negotiation and integration themselves, or where location, distribution, and replication transparencies are desirable. The loosely coupled approach provides a greater degree of autonomy for the component systems, because no central authority is imposed. In addition, loose coupling Hazem Turki El Khatib 39 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 generally tends to scale better to very large systems. On the other hand, maintaining consistency is more difficult in a system that lacks a central authority [4]. Breitbart [13] believes that the Federated approach is more advantageous than the global schema approach; it is popular, and many currently developed systems use this approach for resolving heterogeneity problems between databases. This avoids constructing a global schema and is easier to maintain. To achieve this, a system must maintain knowledge about the contents of each database to know what to include in a query and to help in resolving conflicts between sources, since there is no global schema to provide information about the semantics of the databases. 2.7.4 Multimedia Information Presentation System (MIPS) In this section the MIPS system [42,43] will be presented as an example based on a loosely coupled FDBS approach. The MIPS system was developed in a European funded project to link together distributed heterogeneous databases. It supports a single user query that can retrieve information from a collection of databases and provide a single integrated answer to the user. The MIPS system analyses the query, identifies the databases required to answer it, fetches the information, assembles the results, and presents them to the user. Ideally, all this would be done transparently. The architecture of the MIPS system is illustrated in Figure 2.11. The system consists of a set of tools which can be grouped into three layers: Presentation Layer, Dialogue Management Layer and Data Layer. The Presentation Layer accepts a user request and displays the results. The Dialogue Management Hazem Turki El Khatib 40 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 Layer consists of two major components, Knowledge Based System (KBS) and Selection and Retrieval Tool (SRT), together with a minor one, the External Data Access (EDA). The Data Layer represents the data sources that can be accessed. The SRT consists of two sub-modules, Breakdown and Clarification (B&C) and Assembly of the Consolidated Answer (ACA). The B&C sub-module accepts the user query coming from the Presentation Layer, and breaks down the query into a set of sub-queries targeted at the databases with help from the KBS. The ACA submodule assembles the result and sends it to the Presentation Layer. The communication amongst the original MIPS system modules is managed through an Internal Representation Language, the IRL, which exists in a number of dialects between different pairs of modules. However, this system is limited for two reasons. First, the number of databases that can be handled in MIPS is restricted because of the adoption of a centralised KBS module. This contains the knowledge about the available databases (schemas, locations, protocols, query languages, etc.) as well as the knowledge necessary for resolving database heterogeneity. Second, the mechanism used for accessing databases is restricted since it is based on a standard commercial communication protocol to link to the databases. This requires much effort to add a new database to the KBS. Hazem Turki El Khatib 41 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 Figure 2.11 ~ The MIPS Architecture 2.8 OTHER APPROACHES Ceri and Widom [44] argue that when a multidatabase environment includes facilities at each site for production rules and persistent queues, these facilities can be used to maintain consistency across semantically heterogeneous databases. Production rules in database systems allow specification of database operations that are executed automatically whenever certain events occur or conditions are met. Persistent queues in multidatabase (or client-server) environments provide a mechanism for reliable execution of asynchronous transactions on remote data. In a multidatabase environment with production rules and persistent queues, consistency Hazem Turki El Khatib 42 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 across semantically heterogeneous databases can be maintained automatically as follows: rules are triggered by any changes to a database that may create inconsistencies. Chatterjee and Segev [31] present a probabilistic technique for resolving data heterogeneity problems. They discussed the Entity Join, which can be used to join records across different databases. A probabilistic model was presented for estimating the accuracy of the join in a heterogeneous environment. 2.9 THE APPROACH ADOPTED IN THIS RESEARCH From the above discussion, the approach developed in this work adopts the features of the federated approach raised in the MIPS system to resolve the heterogeneity problems. However, this architecture distributes the knowledge over the network instead of it being stored in a centralised knowledge base. It hides the location of the databases and structure from the users without the need for creating a global schema and it raises the number of databases attached to the system. It also distributes the task of resolving heterogeneity between autonomous and cooperating agents. The structure of the system is presented in chapter 5. More detailed functional descriptions of the individual agents follow in Chapters 6, 7 and 8. In this approach users can send queries using concepts mediated between them and the databases which are in turn transformed into the shared context and then transferred to the appropriate databases. The returned data is processed to resolve heterogeneities and assembled into a single answer. Hazem Turki El Khatib 43 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata 2.10 Chapter 2 SUMMARY With the large number of databases accessible by the users, the need for retrieving and integrating information from distributed, autonomous, heterogeneous databases has been investigated by a number of researchers. This chapter has set out some of the research issues that have been addressed in relation to the problem of providing access to such databases. Various papers describing a variety of different approaches to handle this problem have been published, although aspects of the problem remain. Three commonly used and well-supported approaches are: the physical approach, the global approach, and the multidatabase approach. The physical approach requires the database administrator to understand the structure of the databases and keep track of changes to the underlying databases, and it is likely to result in redundant data. The global approach has the advantage that is possesses location, replication, and distributed transparency, which is achieved mainly by schema integration. However, it is complex to create a single global schema for a large number of data sources because of the large number of comparisons required, and the global designers have to understand the underlying assumptions, and semantics of each existing system, as well as the heterogeneous local database structure. The multidatabase approach is more advantageous than the global schema approach; it is popular; many currently developed systems use this approach for resolving heterogeneity between databases, and it avoids constructing a global schema. Since there is no global schema to provide information about the semantics of the Hazem Turki El Khatib 44 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 databases, the system needs to maintain knowledge about the contents of each database to know what to include in a query and to help in resolving conflicts between sources. The approach developed in this work adopts the features of the federated approach in order to resolve the heterogeneity problems. This architecture distributes the knowledge over the network instead of it being stored in a centralised knowledge base. Hazem Turki El Khatib 45 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 REFERENCE [1] G. Thomas, G.R. Thompson, C. Chung, E. Barkmeyer, F. Carter, M. Templeton, S. Fox, B. Hartman, Heterogeneous distributed database systems for production use, ACM Computing Surveys 22 (3) (1990) 237266 [2] Litwin, L. Mark, and N. Roussopoulos, Interoperability of Multiple Autonomous Databases, ACM Computing Surveys, Vol. 22, No. 3, September 1990, pp. 267 – 293 [3] Y. Bishr, Semantic Aspect of Interoperable GIS, PhD thesis, International Institute for Aerospace Survey and Earth Sciences (ITC), Netherlands, 1997. [4] O.A. Bukhres, A.K. Elmagarmid, F.F. Gherfal, X. Liu, K. Barker, T. Schaller, The integration of database systems, in: O.A. Bukhres, A.K Elmagarmid, (Ed.), Object-Oriented multidatabase systems: A solution for advanced applications, Prentice-Hall, Inc. Englewood Cliffs, New Jersey, (1996) 37-56. [5] L. M. MacKinnon, Intelligent Query Manipulation for Heterogeneous Databases, PhD thesis, Department of Computing & Electrical Engineering, Heriot-Watt University, Edinburgh, October, 1998. [6] J. Hu, Interoperability of Heterogeneous Medical Databases, PhD thesis, Department of Computing & Electrical Engineering, Heriot-Watt University, Edinburgh, May, 1994. [7] A.P. Sheth, J.A. Larson, Federated database systems for managing distributed, heterogeneous, and autonomous databases, ACM Computing Surveys 22 (3) (1990) 183-232 [8] 1992. Bell, D. and Grimson, J., Distributed Database Systems, Addison-Wesley: Wokingham, England, [9] U. Dayal, H. Hwang, View definition and generalization for database integration in a multidatabase system, IEEE Transaction on Software Engineering, SE-10 (6), 1984, pp. 628-645 [10] Y. Arens, C. Chee, C. Hsu, and C. Knoblock, Retrieving and Integrating Data from Multiple Information Sources, International Journal of Intelligent and Cooperative Information Systems, Vol. 2, No. 2, 1993, pp. 127-158 [11] W. Litwin, A. Abdellatif, Multidatabase interoperability, Computer 19 (12) (1986) 10-18 [12] S. Navathe, A. Savasere, A schema integration facility using object-oriented data model, in: O.A.Bukhres, A.K. Elmagarmid, (Ed.), Object-Oriented multidatabase systems: A solution for advanced applications, Prentice-Hall, Inc. Englewood Cliffs, New Jersey, (1996) 105-127 [13] 53-60 Y. Breitbart, Multidatabase Interoperability, SIGMOD RECORD, Vol. 19, No. 3, September 1990, pp. [14] P. Adriaans and D. Zantinge, Data Mining, Addison-Wesley Longman, England, 1998. [15] Christine Collet, Michael N. Huhns, and Wei-Min Shen, Resource Integration Using a Large Knowledge Base in Carnot, COMPUTER, Vol 24, No. 12, December 1991, pp. 55-62 [16] C. Batini, M. Lenzerini, S.B. Navathe, A comparative analysis of methodologies for database schema integration, ACM Computing Surveys 18 (4) (1986) 323-364 [17] S. Spaccapietra, C. Parent, Y. Dupont, Model independent assertions for integration of heterogeneous schemas, VLDB Journal 1 (1) (1992) 81-126 Hazem Turki El Khatib 46 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 [18] S. Navathe, S. Gadgil, A methodology for view integration in logical database design, Eighth International Conference on Very Larger Data Bases, Mexico City, VLDB Endowment, Saratoga, Calif., 1982, pp. 142-164 [19] Landers, T., and Rosenberg, R., 1982. “An Overview of Multibase”. Proceedings of the 2nd International Symposium for Distributed Databases (1992), pp. 153-183. [20] Arens, Y., and Knoblock, C.A., 1992. “Planning and Reformulating Queries for Semantically Modeled Multidatabase Systems”. Proceedings of the 1st International Conference on Information and Knowledge Management, 1992, pp. 92-101. [21] Ahmed, R., Smedt, P.D., Du, W., Kent, W., Ketabchi, M.A., Litwin, W.A., Raffi, A., and Shan, M.C., “The Pegasus Heterogeneous Multidatabase System”. IEEE Computer 24, 12 (Dec. 1991), pp. 19-27. [22] S.B. Navathe, T. Sashidhar, R. Elmasri, Relationship merging in schema integration, 10 th International Conference on Very Large Data Bases, Singapore, 1984, pp. 78-90 [23] J.A. Larson, S.B. Navathe, R. Elmasri, A theory of attribute equivalence in database with application to schema integration, IEEE Trans. Softw. Eng. 15 (4), 1989, pp. 449-463 [24] R. Elmasri, S. Navathe, Object integration in logical database design, IEEE COMPDEC Conf., (1984) 426-433 [25] M. Solaco, F. Saltor, M. Castellanos, Semantic heterogeneity in multidatabase systems, in: O.A. Bukhres, A.K. Elmagarmid, (Ed.), Object-Oriented Multidatabase Systems: A Solution For Advanced Applications, Prentice-Hall, Inc., 1996, pp. 129-202. [26] C. Batini, M. Lenzerini, A methodology for data schema integration in the entity-relationship model, IEEE Trans. Softw. Eng. SE-10 (6), 1984, pp. 650-664 [27] Siegel, M. and Madnick, S. A metadata approach to resolving semantic conflicts, Seventeenth International Conference on Very Large Data Bases, Barcelona, September, 1991 [28] C. Yu, B. Jia, W. Sun, S. Dao, Determining relationships among name in heterogeneous databases, Sigmod Record 20 (4), 1991, pp. 79-80 [29] K.G. Jeffery, L. Hutchinson, J. Kalmus, M. Wilson, W. Behrendt, C. MacNee, A model for heterogeneous distributed database systems, in: D.S. Bowers (Ed.), Directions in Databases; Proceedings BNCOD 12, Guildford, U.K., Lecture Notes in Computer Science 826, Springer-Verlag, July 6-8, 1994, pp. 221 – 234 [30] A. Motro, P. Buneman, Constructing superviews, International Conference on Management of Data, Ann Arbor, Mich., ACM, New York, Apr. 29- May 1, 1981, pp. 56-64 [31] A. Chatterjee, A. Segev, Data manipulation in heterogeneous databases, Sigmod Record 20 (4), 1991, pp. 64-68 [32] D.J. Berg and J.S. Fritzinger, Advanced techniques for Java developers, John Wiley & Sons, Inc.,1999. [33] J.M. Crichlow , The essence of distributed systems. Pearson Education Limited: Essex, England. 1999. [34] A.R. Hurson, M.W. Bright, Object-Oriented multidatabase systems, in: O.A. Bukhres, A.K. Elmagarmid (Ed.), Object-Oriented multidatabase systems: A solution for advanced applications, Prentice-Hall, Inc. Englewood Cliffs, New Jersey, 1996, pp. 1-33 [35] M. H. Nodine and S. B. Zdonik, The Impact of Transaction Management on Object-Oriented Multidatabase Views, in: O.A. Bukhres, A.K. Elmagarmid (Ed.), Object-Oriented multidatabase systems: A solution for advanced applications, Prentice-Hall, Inc. Englewood Cliffs, New Jersey, (1996) 57-104 Hazem Turki El Khatib 47 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 2 [36] Hammer, M., and McLeod, D. On database management system architecture. MIT Lab. For Comp. Sc. MIT/LCS/TM-141. (Oct.) 1979, 35 [37] B. Elbert and B. Martyna, Client/server computing: architecture, applications, and distributed systems management.” ARTECH HOUSE, Inc. 1994. [38] R. M. Colomb & M. E. Orlowska, Interoperability in information systems, Information Systems Journal, May 1994, pp. 37-50 [39] M.P. Reddy, B.E. Prasad, P.G. Reddy, A. Gupta, A methodology for integration of heterogeneous databases, IEEE Transactions on Know. and Data Eng. 6 (6), 1994, pp. 920-933 [40] Dwyer, P. & Larson, J., Some experiences with a distributed database testbed system, Proceedings of IEEE, Vol. 75, No. 5, pp. 633-647, 1987. [41] Mermaid- A Front-End to Distributed Heterogeneous Database (Invited Paper), M. Templeton. D. Brill, S. K. Dao, E. Lund, P. Ward, A. L. P. Chen, and R. MacGregor, Proceedings of the IEEE, Vol. 75, No. 5, May 1987, pp. 695- 708 [42] L.M. Mackinnon, D.H. Marwick, M.H. Williams, A Model for Query Decomposition and Answer Construction in Heterogeneous Distributed Database Systems, J. Intelligent Info. Sys, 11, 1998, pp. 69-87. [43] W.J. Austin, E.K. Hutchinson, J.R. Kalmus, L.M. Mackinnon, K.G. Jeffery, D.H. Marwick, M.H. Williams, M.D. Wilson, Processing travel queries in a multimedia information system, Proc. Info. & Comms Technologies in Tourism, Springer Verlag, 1994, pp. 64-71. [44] S. Ceri and J. Widom, Managing Semantic Heterogeneity with Production Rules and Persistent Queues, Proceedings of the 19th Very Large DataBases Conference, Dublin, Ireland, 1993, pp. 108-119 Hazem Turki El Khatib 48 PhD Thesis ~ 2000