Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Entity–attribute–value model wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Concurrency control wikipedia , lookup
Functional Database Model wikipedia , lookup
Relational model wikipedia , lookup
Clusterpoint wikipedia , lookup
Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 1 CHAPTER 1 INTRODUCTION 1.1 BACKGROUND From the early days of civilisation, humans have invented methods for locating or collecting resources and distributing them to communities that need them. Thousands of years ago, Romans built stone networks that soared above the underlying buildings to bring water directly from its source to their cities. Today, from aquaducts to oil pipelines to postal services, civilisations depend on network systems that gather, filter, and then distribute goods and services. The computer network is the most recent example of such a network. The development of computer networks during the late 1980s and 1990s provided users with the possibility of linking distributed computers. Moreover, the development of the World Wide Web (WWW) provided users with the possibility of accessing different data sources through the Internet. However, when the WWW was developed, it offered access to either semi-structured data (e.g. an HTML document) or unstructured data (e.g. a text file). More recently, access to structured data (e.g. a database) through Open Database Connectivity (ODBC), Java Database Connectivity (JDBC), or some other technology, has become possible. Hazem Turki El Khatib 1 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 1 The introduction of WWW and ODBC/JDBC has changed user data processing capabilities from an ability to access a single database on a local host to access to a number of different databases located over the network. The present data processing situation is characterised by a growing number of applications that require access to data from a set of heterogeneous distributed databases. This opens up the problem of integrating and accessing heterogeneous distributed databases. 1.2 PROBLEMS OF INTEGRATING INFORMATION With the current explosion of information accessible through the Internet, the retrieval and integration of information from heterogeneous data sources is a challenging problem. Much work has been done in this area, although aspects of the problem remain. To understand better the nature of heterogeneous distributed database systems, the following example of a medical heterogeneous database may be considered. In this example, four databases are shown in Figures 1-4 in Appendix 1. The question to be considered is: find the weights of all male patients weighed within the last year. To answer the question, the user would have to access more than one database. Since each database uses a different format for representing data, and different meanings of the concepts, answering this question is not straightforward. In another example, consider two kinds of doctors in different hospitals located in different areas. One is a physician, and the second is a surgeon. The physician needs and stores information about patients, information which is also needed by Hazem Turki El Khatib 2 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 1 the surgeon. What prevents the physician and the surgeon from sharing their information instead of storing the same data again? There are several reasons: 1. Difference in hardware; each doctor may have different kinds of hardware (machine server, network). 2. Difference in operating system. 3. Difference in the way data is captured and stored. These include: - Naming heterogeneity, when the same values are stored in different databases but the names given to the attributes are different in different systems. - Relational structure heterogeneity, when the composition of elementary attributes into composite structures varies but once again values stored are identical. - Value heterogeneity, when the way in which values are represented is different in different databases. - Semantic heterogeneity, when the data stored in different databases embodies different assumptions. - Data model heterogeneity, when the data model itself is the issue. Hazem Turki El Khatib 3 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata - Chapter 1 Timing heterogeneity, when the structure of a database, the representation of attributes and the values themselves change over time. In subsequent chapters the differences in (3) above will be studied in greater detail. 1.3 RESEARCH OBJECTIVES AND CONTRIBUTIONS The overall objective of this research was to develop and implement a system to integrate information from heterogeneous distributed databases with the following properties: 1. It should provide users with transparent access to data sources. Transparency means to hide from the user the heterogeneity between databases, where data is physically stored, which databases are being accessed, the structure and size of the data, query language, etc., in order to retrieve data from them. 2. Functions to resolve the heterogeneity must be automatically performed by the system and be transparent to the user. 3. The system architecture must be extensible, flexible, and adaptable to increasing system size. In doing so, the system has to distribute over the network the knowledge about the databases the system connects to and the knowledge about how to resolve the heterogeneity between them. This information is stored in metadata at the database level. Hazem Turki El Khatib 4 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 1 4. The system should also maintain the autonomy of the underlying databases. The retrieval system uses agents to enable data retrieval and answer construction from autonomous, distributed, heterogeneous data sources, taking account of the syntactic and semantic differences between data sources. Each database has its own metadata description created by the database administrator (DBA) based on the system ontology, and Web technologies are used to interface with the underlying databases. The benefits of such systems are better user/customer service (the user/customer does not have to login to different databases and retrieve the required information in many operations at different stages) and as a result faster time to market as organisations can respond more quickly to their demands. Contributions made by this research to meet the objectives include: The research presents a novel agent-based architecture, which distributes the knowledge over the network instead of storing it in a centralised knowledge base. This architecture analyses the user query, identifies the databases required to answer it, fetches the information, assembles the results, and presents them to the user. A framework is presented for classifying different aspects of heterogeneity in the data set. An approach has been developed to distribute the task of resolving heterogeneity between autonomous and co-operating agents transparently to the user. Hazem Turki El Khatib 5 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 1 A system has been produced that is extensible and adaptable to increasing system size, and for databases to be added or removed with little effort. This has been achieved through the distribution of the knowledge over the network at the database level by the creation of metadata mapping database attributes onto the domain ontology. This metadata also provides support for the resolution of heterogeneity between databases within the system. 1.4 THESIS STRUCTURE The research has been conducted in four phases. The first phase sets the scope of the study and classifies the problems of integrating information from heterogeneous distributed databases. The second phase handles the system architecture perspective, addresses the technologies that will be used in the system, and builds the External Data Access Agent. The third phase builds the set of agents responsible for resolving heterogeneity between data retrieved from heterogeneous distributed databases. The fourth phase is concerned with building the set of agents responsible for locating suitable databases to answer the user query. This thesis is laid out in ten chapters. In chapter 2, the architectures for distributed systems are studied. A major challenge in developing a system that provides access to a collection of databases is to resolve the heterogeneity that may exist between different databases. To assist in handling this problem, chapter 3 proposes a framework for classifying different aspects of heterogeneity in data sets, and relates to this framework the various aspects of heterogeneity discussed by different Hazem Turki El Khatib 6 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata Chapter 1 researchers. From this, a test suite has been developed to evaluate systems that provide access to heterogeneous databases. The objective of chapter 4 is to study various models that have been proposed to resolve heterogeneity among heterogeneous distributed databases, and to show how this work benefits from them and creates an improved approach. Chapter 5 presents the system requirements and the architecture needed to satisfy these requirements. The system architecture consists of six levels of functionality within a five-layer model. In this chapter the concepts of software agent, ontology, and metadata are discussed. Chapter 6 describes the roles of agents in the Query Layer that are responsible for the query break-down processes. The roles of agents in the Information Finder Layer are presented in Chapter 7. These agents are responsible for locating a suitable data source to answer the query and to help resolve conflicts between data sources by providing information about databases. Chapter 8 describes the roles of agents in the Answer Layer, which are responsible for resolving conflicts that may occur in the result. Chapter 9 is a description of the system implementation, and shows how this system benefits from technology such as CORBA, Java, and XML. Conclusions and issues for future research are outlined in chapter 10. Hazem Turki El Khatib 7 PhD Thesis ~ 2000 Integrating Information from Heterogeneous Databases Using Agents and Metadata 1.5 Chapter 1 PUBLICATIONS One paper has already been published on material presented in this thesis and two others have been submitted for publication. Chapter 3 substantially reproduces the paper: A framework and test-suite for assessing approaches to resolving heterogeneity in distributed databases. Hazem T. El-Khatib, M. Howard Williams, Lachlan M. MacKinnon, David H. Marwick. Information and Software Technology, Volume 42, Issue 7, (1 May 2000) pp 505-515. Applying web technology to linking to heterogeneous data sources. David H. Marwick, M. Howard Williams, Lachlan M. MacKinnon, Hazem T. El-Khatib. Submitted for publication. Using Agents to Retrieve and Integrate Information from Heterogeneous Distributed Databases. Hazem T. El-Khatib, M. Howard Williams, David H. Marwick, Lachlan M. MacKinnon. Submitted for publication. Hazem Turki El Khatib 8 PhD Thesis ~ 2000