Download Chapter 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Functional Database Model wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Healthcare Cost and Utilization Project wikipedia , lookup

Database model wikipedia , lookup

Transcript
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 1
CHAPTER 1
INTRODUCTION
1.1
BACKGROUND
From the early days of civilisation, humans have invented methods for locating or
collecting resources and distributing them to communities that need them.
Thousands of years ago, Romans built stone networks that soared above the
underlying buildings to bring water directly from its source to their cities. Today, from
aquaducts to oil pipelines to postal services, civilisations depend on network
systems that gather, filter, and then distribute goods and services.
The computer network is the most recent example of such a network. The
development of computer networks during the late 1980s and 1990s provided users
with the possibility of linking distributed computers. Moreover, the development of
the World Wide Web (WWW) provided users with the possibility of accessing
different data sources through the Internet. However, when the WWW was
developed, it offered access to either semi-structured data (e.g. an HTML document)
or unstructured data (e.g. a text file). More recently, access to structured data (e.g. a
database)
through
Open
Database
Connectivity
(ODBC),
Java
Database
Connectivity (JDBC), or some other technology, has become possible.
Hazem Turki El Khatib
1
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 1
The introduction of WWW and ODBC/JDBC has changed user data processing
capabilities from an ability to access a single database on a local host to access to a
number of different databases located over the network. The present data
processing situation is characterised by a growing number of applications that
require access to data from a set of heterogeneous distributed databases. This
opens up the problem of integrating and accessing heterogeneous distributed
databases.
1.2
PROBLEMS OF INTEGRATING INFORMATION
With the current explosion of information accessible through the Internet, the
retrieval and integration of information from heterogeneous data sources is a
challenging problem. Much work has been done in this area, although aspects of the
problem remain. To understand better the nature of heterogeneous distributed
database systems, the following example of a medical heterogeneous database may
be considered. In this example, four databases are shown in Figures 1-4 in
Appendix 1. The question to be considered is: find the weights of all male patients
weighed within the last year. To answer the question, the user would have to access
more than one database. Since each database uses a different format for
representing data, and different meanings of the concepts, answering this question
is not straightforward.
In another example, consider two kinds of doctors in different hospitals located in
different areas. One is a physician, and the second is a surgeon. The physician
needs and stores information about patients, information which is also needed by
Hazem Turki El Khatib
2
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 1
the surgeon. What prevents the physician and the surgeon from sharing their
information instead of storing the same data again? There are several reasons:
1. Difference in hardware; each doctor may have different kinds of hardware
(machine server, network).
2. Difference in operating system.
3. Difference in the way data is captured and stored. These include:
-
Naming heterogeneity, when the same values are stored in different databases
but the names given to the attributes are different in different systems.
-
Relational structure heterogeneity, when the composition of elementary attributes
into composite structures varies but once again values stored are identical.
-
Value heterogeneity, when the way in which values are represented is different in
different databases.
-
Semantic heterogeneity, when the data stored in different databases embodies
different assumptions.
-
Data model heterogeneity, when the data model itself is the issue.
Hazem Turki El Khatib
3
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
-
Chapter 1
Timing heterogeneity, when the structure of a database, the representation of
attributes and the values themselves change over time.
In subsequent chapters the differences in (3) above will be studied in greater detail.
1.3
RESEARCH OBJECTIVES AND CONTRIBUTIONS
The overall objective of this research was to develop and implement a system to
integrate information from heterogeneous distributed databases with the following
properties:
1. It should provide users with transparent access to data sources. Transparency
means to hide from the user the heterogeneity between databases, where data is
physically stored, which databases are being accessed, the structure and size of
the data, query language, etc., in order to retrieve data from them.
2. Functions to resolve the heterogeneity must be automatically performed by the
system and be transparent to the user.
3. The system architecture must be extensible, flexible, and adaptable to increasing
system size. In doing so, the system has to distribute over the network the
knowledge about the databases the system connects to and the knowledge
about how to resolve the heterogeneity between them. This information is stored
in metadata at the database level.
Hazem Turki El Khatib
4
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 1
4. The system should also maintain the autonomy of the underlying databases.
The retrieval system uses agents to enable data retrieval and answer construction
from autonomous, distributed, heterogeneous data sources, taking account of the
syntactic and semantic differences between data sources. Each database has its
own metadata description created by the database administrator (DBA) based on
the system ontology, and Web technologies are used to interface with the underlying
databases. The benefits of such systems are better user/customer service (the
user/customer does not have to login to different databases and retrieve the required
information in many operations at different stages) and as a result faster time to
market as organisations can respond more quickly to their demands.
Contributions made by this research to meet the objectives include:

The research presents a novel agent-based architecture, which distributes the
knowledge over the network instead of storing it in a centralised knowledge base.
This architecture analyses the user query, identifies the databases required to
answer it, fetches the information, assembles the results, and presents them to
the user.

A framework is presented for classifying different aspects of heterogeneity in the
data set.

An approach has been developed to distribute the task of resolving heterogeneity
between autonomous and co-operating agents transparently to the user.
Hazem Turki El Khatib
5
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata

Chapter 1
A system has been produced that is extensible and adaptable to increasing
system size, and for databases to be added or removed with little effort. This has
been achieved through the distribution of the knowledge over the network at the
database level by the creation of metadata mapping database attributes onto the
domain ontology. This metadata also provides support for the resolution of
heterogeneity between databases within the system.
1.4
THESIS STRUCTURE
The research has been conducted in four phases. The first phase sets the scope of
the study and classifies the problems of integrating information from heterogeneous
distributed databases. The second phase handles the system architecture
perspective, addresses the technologies that will be used in the system, and builds
the External Data Access Agent. The third phase builds the set of agents
responsible for resolving heterogeneity between data retrieved from heterogeneous
distributed databases. The fourth phase is concerned with building the set of agents
responsible for locating suitable databases to answer the user query.
This thesis is laid out in ten chapters. In chapter 2, the architectures for distributed
systems are studied. A major challenge in developing a system that provides access
to a collection of databases is to resolve the heterogeneity that may exist between
different databases. To assist in handling this problem, chapter 3 proposes a
framework for classifying different aspects of heterogeneity in data sets, and relates
to this framework the various aspects of heterogeneity discussed by different
Hazem Turki El Khatib
6
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 1
researchers. From this, a test suite has been developed to evaluate systems that
provide access to heterogeneous databases.
The objective of chapter 4 is to study various models that have been proposed to
resolve heterogeneity among heterogeneous distributed databases, and to show
how this work benefits from them and creates an improved approach. Chapter 5
presents the system requirements and the architecture needed to satisfy these
requirements. The system architecture consists of six levels of functionality within a
five-layer model. In this chapter the concepts of software agent, ontology, and
metadata are discussed. Chapter 6 describes the roles of agents in the Query Layer
that are responsible for the query break-down processes. The roles of agents in the
Information Finder Layer are presented in Chapter 7. These agents are responsible
for locating a suitable data source to answer the query and to help resolve conflicts
between data sources by providing information about databases. Chapter 8
describes the roles of agents in the Answer Layer, which are responsible for
resolving conflicts that may occur in the result. Chapter 9 is a description of the
system implementation, and shows how this system benefits from technology such
as CORBA, Java, and XML. Conclusions and issues for future research are outlined
in chapter 10.
Hazem Turki El Khatib
7
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
1.5
Chapter 1
PUBLICATIONS
One paper has already been published on material presented in this thesis and two
others have been submitted for publication.

Chapter 3 substantially reproduces the paper: A framework and test-suite for
assessing approaches to resolving heterogeneity in distributed databases.
Hazem T. El-Khatib, M. Howard Williams, Lachlan M. MacKinnon, David H.
Marwick. Information and Software Technology, Volume 42, Issue 7, (1 May
2000) pp 505-515.

Applying web technology to linking to heterogeneous data sources. David H.
Marwick, M. Howard Williams, Lachlan M. MacKinnon, Hazem T. El-Khatib.
Submitted for publication.

Using Agents to Retrieve and Integrate Information from Heterogeneous
Distributed Databases. Hazem T. El-Khatib, M. Howard Williams, David H.
Marwick, Lachlan M. MacKinnon. Submitted for publication.
Hazem Turki El Khatib
8
PhD Thesis ~ 2000