Download CHAPTER 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

IMDb wikipedia , lookup

Oracle Database wikipedia , lookup

Global serializability wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Functional Database Model wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

ContactPoint wikipedia , lookup

Clusterpoint wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Transcript
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
CHAPTER 2
THE INTEGRATION of DATABASE SYSTEMS
2.1
INTRODUCTION
The problem of retrieving and integrating information from distributed heterogeneous
databases has been investigated by a number of researchers. Numerous papers
have been published describing a variety of approaches which have been developed
to handle this problem, although aspects of the problem remain.
This chapter discusses some of the approaches which were reported in the literature
to tackle the problems of heterogeneity among databases. A survey of, and
comparison between, systems developed to handle this problem can be found in
[1,2]. This chapter will focus on three commonly used and well-supported
approaches: the physical approach, the global approach, and the multidatabase
approach. Section 2.2 gives the background information on distributed database
systems, and section 2.3 presents the autonomy of the database. Section 2.4
presents the database heterogeneity. Section 2.5 discusses the physical approach,
and section 2.6 discusses the global approach. The multidatabase approach is
presented in section 2.7, and section 2.8 presents other approaches. The approach
adopted in this project is discussed in section 2.9. The chapter concludes with a
summary in section 2.10.
Hazem Turki El Khatib
9
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
2.2
Chapter 2
THE MOVE TO DISTRIBUTED DATABASES SYSTEM
The computer system provides four types of services [3]:
1. The data storage services, which provide users with efficient storage media.
2. The data access services, which provide functions for retrieving data from the
storage media.
3. The application services, which provide users with capabilities to execute specific
tasks.
4. The presentation services, which provide user interfaces to end-users.
A database system (DBS) is a data-storage and access system, which is composed
of two elements: a set of data, called a database, and a software program, called a
database management system (DBMS) [4]. The main aim of such systems is to
store and process information. Database management systems include functions for
protecting data integrity, supporting easy maintenance, assuring data security,
providing languages for the storage, retrieval and update of the data, and providing
facilities for data backup and recovery, etc. Before the development of the DBMS,
the data were stored in separate data files. These files could be created and
maintained by different applications and were likely to contain duplicate and
redundant data. However, there were significant problems if two applications wanted
Hazem Turki El Khatib
10
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
to share access to the same file, including writing to it. A DBMS provides centralised
control of the data and solves the problems of shared access to data files [5,6]. The
data in a database are organised according to a particular data model, such as the
relational model, hierarchical model, network model, etc. A schema describes the
actual structures and organisation within the system.
Initially databases systems were developed as stand-alone systems operating
independently. During the 1970s, centralised databases were in common use. A
centralised DBS consists of a single centralised DBMS managing a single database
on the same computer system [7]. However, recent “innovations in communications
and database technologies have engendered a revolution in data processing” [4],
giving rise to a new generation of decentralised database systems.
A distributed (decentralised) database system is “made up of a single logical
database that is physically distributed across a computer network, together with a
distributed database management system (DDBMS) that answers correct queries
and services update requests” [4]. Bell and Grimson [8] present the taxonomy of
distributed data systems and classify them into two types:

Homogeneous distributed database management systems (HmDDBMSs)

Heterogeneous distributed database management systems (HgDDBMSs)
Hazem Turki El Khatib
11
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
A distributed database implies homogeneity, in that all its physical components run
the same distributed database management system, which supports a single data
model and query language. In contrast, a distributed database implies heterogeneity,
in that it includes heterogeneous components at the database level, where, for
example, the local nodes have different types of hardware, operating system [1],
data models, query languages and schemas.
2.3
AUTONOMY
The concept of database autonomy refers to the ability of each local database
system to have control over its own data and to perform various operations on its
own data. Bukhres et al. [4] define database autonomy as “the ability of each local
database system to control access to its data by other database systems, as well as
the ability to access and manipulate its own data independently of other systems”.
Sheth and Larson [7] describe four types of autonomy:
1. Design Autonomy: the ability of a database system to choose its own design
with respect to data model, query language, semantic interpretation of the data,
the operations supported by the system, and the implementation of the system
(e.g., concurrency control algorithms).
2. Communication Autonomy: the ability of a database system to decide whether
and when to communicate with other database systems.
Hazem Turki El Khatib
12
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
3. Execution Autonomy: the ability of a database system to decide how to execute
local operations without interference from external operations, and the order in
which to execute external operations.
4. Association Autonomy: the ability of a database system to determine the
extent to which it will share its functionality and resources with others.
2.4
HETEROGENEITY
With the current explosion of data, the need for retrieving and integrating information
from a collection of independently designed autonomous databases is a complex
and critical problem [9,10,11,12]. There are several different forms of heterogeneity,
discussed in detail in chapter 3, which have exercised the research community for
nearly 20 years [5]. Many types of heterogeneity are the result of technological
differences, for example, differences in hardware, system software (such as
operating systems), and communication systems. The following sections will present
the approaches that have been discussed in different systems to resolve the
heterogeneity problem.
2.5
THE PHYSICAL APPROACH
The physical approach [4,13] integrates all data needed by an application into one
database. This is referred to as the data warehouse, which is built up from different
databases and contains large amounts of data (billions of records) [14]. It was
Hazem Turki El Khatib
13
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
“designed especially for decision support queries, therefore only data that is needed
for decision support is extracted from the operational data and stored in the
warehouse” [14]. Two approaches, the “top down” and the “bottom up” approach,
are used to build a data warehouse. In the first approach, the data warehouse first
builds for the complete organisation and from this selects the information needed for
different end-users. In the “bottom up” approach, the smaller local data warehouse,
known as a datamart, is used by end-users at a local level and the data warehouse
is generated from these. Figure 2.1 shows the relationship between databases, the
data warehouse, and datamarts. Replication techniques are used to load the
information from different databases to the data warehouse.
Top down
Replication
Bottom up
Databases
Data warehouse
Datamarts
Figure 2.1 ~ The relationship between databases, data warehouse, and datamarts
Hazem Turki El Khatib
14
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
2.5.1 The Physical Approach Architecture
The physical approach architecture might be referred to as the monolithic
architecture. The physical approach integrates all data needed by an application into
one database, and, as illustrated in Figure 2.2, while the monolithic architecture
views the system as a standalone application, which does not correlate with any
system. It is one large application that contains data storage, a business logic code
and a presentation code. This architecture was popular in the days of the large
mainframes on which they ran, because such mainframes could manage such
applications.
User Interface
Logic
Data
Figure 2.2 ~ Monolithic Architecture
Hazem Turki El Khatib
15
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
2.5.2 The Physical Approach Disadvantages
The physical approach is incomplete because it does not allow data to be
maintained independently; it requires an expensive application conversion, and the
database administrator has to understand the structure of the database and keep
track of any changes. Further to that, it would result in redundant data.
Although the monolithic architecture is simple to understand, it is difficult to develop
and maintain, as well as being difficult to test.
2.6
THE GLOBAL APPROACH
An alternative solution is the global approach. Other terms used for global approach
are logical approach [4,7,9,13] composite approach [15], schema integration [16],
and view integration [9,17,18]. Schema integration and view integration have been
used to refer closely to each other. However, Spaccapietra et al., [17] argue that
there are some differences between the two approaches. View integration is “a
process in classical database design deriving an integrated schema from a set of
user views”. In view integration views are usually based on the same data model
(homogeneous), and have no “associated recorded extension”. On the other hand,
in database integration, the local schemas may be based on different data models
and have an associated extension (i.e., they describe data which are actually stored
in a database). This approach has been widely reported in the literature [19,20,21]
as the process of providing a global schema for all databases and transactions
Hazem Turki El Khatib
16
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
needed to be mapped to this schema [1]. It provides users with uniform access to
data contained in various databases, without migrating the data to a new database,
and without requiring the users to know either the location or the characteristics of
different databases and their corresponding DBMSs. Database designers attempt to
reconcile the conflicts among all component databases by designing a global
schema, which allows users to send queries based on the global schema. The
definition of a global schema incorporates functions to resolve discrepancies and
inconsistencies among the underlying databases.
The global schema approach was first described by Dayal and Hwang [9], and it
arises in two different contexts [22]:

Global schema design

Logical database design
In global schema design, several databases already exist and are in use. The
objective is to represent the contents of these databases by designing a single
global schema. User queries can then be specified against this global schema, and
the requests are mapped to the relevant databases [22,23].
In logical database design [23], each class of users designs a view of that part of the
proposed database they need to access. The objective is to design a conceptual
schema that represents the contents of all of these views. User queries and
Hazem Turki El Khatib
17
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
transactions specified against each view are mapped to the logical integrated
schema. Database design has been described as an art rather than a science and it
depends upon the experience of the designer. There have been two schools of
thought in [22] and in [24].
Schema integration is a three phase process: first, the investigation phase where
commonalities and discrepancies among input schemas have to be determined [16]
or detected [25]. The aim is to identify semantically related objects by a comparison
process based on matching of names, structures and constraints [18,26,27,28] in
which their similarities and dissimilarities are discovered. Second, semantic conflicts
between objects of component databases are resolved [25], and finally integration is
performed [17]. Interaction with the database administrator is required to solve
conflicts among input schemas in case the integrator does not have the knowledge
of how to do it.
Many approaches and techniques for schema integration have been reported in the
literature. A detailed survey by Batini et al. [16] discussed and compared twelve
methodologies for the problem of schema integration. They argue that the two
problems to be dealt with during integration come from structural and semantic
diversities of schemas to be merged, which are caused because different names are
attached to the same concept in the two views. The concept can be represented
either as a relation in one database or as an attribute in another. It divides schema
integration activities into five steps: preintegration, comparison of the schemas,
conformation of the schemas, merging and restructuring.
Hazem Turki El Khatib
18
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata

Chapter 2
Preintegration: this involves analysis of schemas to decide the choice of
schemas to be integrated, the order of integration, and the number of schemas to
be integrated at one time.

Comparison: this involves comparing the objects of the schema to be integrated
to determine the correspondence among objects and detect possible conflicts,
including identification of naming conflicts, domain conflicts, structural differences
and missing data.

Conformation: this step enables more effective comparison by ensuring that
related information is represented in a similar form in different schemas. Close
interaction with designers and users is needed before compromises can be
achieved.

Merging & Restructuring: these two steps are concerned with specifying the
interrelationships between the schema objects in the different schemas, resulting
in a global conceptual schema. This global schema has to contain correctly all
concepts present in any component schema. However, the concept must be
represented only once in the integrated schema if the same concept is
represented in more than one component schema. The integrated schema
should be easy for both the designer and the end user to understand.
Several other techniques have been identified to facilitate schema integration and
are summarised in [5,29]: Catalogue technique, Hyperstrucured technique, MetaHazem Turki El Khatib
19
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
Translation, Object Equivalencing, Mediation, Intelligent Co-operating systems, and
KBS assist.
Two methodologies have been proposed for schema integration according to [17].
They are
manual
methodology and
semi-automatic
methodology.
Manual
methodology, which was first developed by Motro and Buneman [30], aims at
providing a tool which allows the DBA to build the integrated schema from local
schemas. Semi-automatic methodology uses a semi-automatic reasoning technique
to discover similarity assertions between objects by evaluating some degree of
similarity, e.g. names, and structures [5,6]. Interaction with the DBA is invoked to
accept or refuse some of the correspondence assertions and to solve some conflicts
that the system is unable to resolve. This is why it is semi-automatic. Some semiautomatic tools developed to perform schema integration were reported in
[17,18,31]. Nevertheless, Sheth and Larson [7] have argued that a completely
automatic schema integration process is impractical because it would require that all
of the semantics of the schema be completely specified.
2.6.1 The Global Approach Architecture
The disadvantages of the monolithic architecture have given way for another
architecture to appear. The monolithic architecture has been replaced by the
client/server architecture. With the availability of smaller and cheaper PCs, and the
off-the-shelf database management system, the client/server architecture has
become widespread [32]. As shown in Figure 2.3, the client/server architecture
Hazem Turki El Khatib
20
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
breaks the system functionality between the client and the server. The server acts as
a producer, and the client as a consumer of service. The main concepts of the
architecture are:

Server: performs services in response to requests sent by clients.

Client: sends request to servers and receives the results of the service returned
from the server.

Service: could be data, functions, etc.
User Interface
Logic
Data
Figure 2.3 ~ Client/Server Architecture
When a client becomes a server in the client/server architecture, the relationship is
referred to as peer-to-peer connection.
Hazem Turki El Khatib
21
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
Two examples of protocol modules used in the client-server architecture are [33]:

the three-message protocol in which the client makes a request, the server
responds and then the client acknowledges receipt of the response.

the single-shot protocol in which only repeatable requests are used.
2.6.2 Schema Integration processing strategies
Batini et al. [16] have proposed different strategies for the sequencing and grouping
for integration. As shown in Figure 2.4, each strategy is shown in the form of a tree.
The leaf nodes of the tree correspond to the component schemas, and the nonleaf
nodes correspond to intermediate results of integration. The root node is the final
result. The primary classification of strategies is binary versus n-ary.

Binary strategies allow the integration of two schemas at a time. They are called
“ladder” strategies when a new component schema is integrated with an existing
intermediate result at each step. A binary strategy is “balanced” when the
schemas are divided into pairs at the start.

N-ary strategies allow integration of n schemas at a time (n>2). An n-ary strategy
is “one shot” when the n schemas are integrated in a single step; it is “iterative”
otherwise.
Hazem Turki El Khatib
22
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Ladder
Balanced
One-shot
Chapter 2
Iterative
Figure 2.4 ~ Types of integration-processing strategies [16]
2.6.3 CARNOT System
“There has been some work on the problem of accessing information distributed
over multiple sources both in the AI-oriented database community and in the more
traditional database community” [10]. A survey and comparison of these can be
found in [1,2]. The use of a knowledge base to integrate a variety of information
sources has been investigated by the AI-oriented database community. Breitbart [13]
believe that for a system involving hundreds of local databases, expert systems and
knowledge base technologies are required to help users in their access to the data
[13].
In this section the Carnot system [15] will be presented as an example based on
global approach. The Carnot system has been developed at the Microelectronics
and Computer Technology Corporation, Austin, Texas. The goal of this project is to
integrate heterogeneous databases using a set of articulation axioms that describe
Hazem Turki El Khatib
23
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
how to map between SQL queries and domain concepts. Carnot uses the Cyc
knowledge base as a global schema to build the articulation axioms (statements of
equivalence between components of two theories). The schemas of individual
resources are compared and merged with this knowledge base, although not with
each other, making a global schema much easier to construct and maintain. The
Carnot system uses the following knowledge in resolving semantic differences:

a structural description of the local schema.

schema knowledge, which is the structure of the data, integrity constraints, and
allowed operations.

resource knowledge, which is a description of supported services, such as the
data model and languages, lexical definitions of object names, the data itself,
comments from the resource designer, and guidance from the integrator.

organisation knowledge, which is the corporate rules governing use of the
resource.
As shown in Figure 2.5 resource integration is achieved by separate mappings
between each information resource and the global schema. Each mapping consists
of a syntax translation and a semantics translation. The syntax translation provides a
bidirectional translation between a local data manipulation language and the global
Hazem Turki El Khatib
24
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
context language. The semantics translation is a mapping between two expressions
that have equivalent meanings in the global context language. After integration, one
can use the information that becomes available through a global view or a local
view. However, Carnot has two problems: a problem of integrating the results
returned from multiresource queries, and the large size of the global schema.
Application

Application


Application

Local view 1
(DML1)



Local view n
(DMLn)

DML1  GCL
DMLn  GCL
Local-to-global semantic translation by
articulation axioms
Global view
Application
Global-to-local semantic translation by
articulation axioms
GCL  DMLn
GCL DML1
Local schema 1



Database 1
Local schema n
Database n
Figure 2.5 ~ Global and local views in semantic transaction processing [15]
Hazem Turki El Khatib
25
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
2.6.4 Schema Integration Disadvantages
Schema integration has the disadvantages that it is difficult and complex to create a
single global schema for a large number of data sources because of the large
number of comparisons required, and it is difficult to maintain every time a local
schema changes [2,12,13,34]. Litwin and Abdellatif [11] argue that “a single schema
for the thousands of databases is a dream”. The global designers have to
understand the underlying assumptions, semantics of each existing system, and the
heterogeneous local database structure [4].
Distributing the process between client and server allows applications to scale and
develop easily.
However, this architecture has some of the problems of the
monolithic case. One such problem occurs because the client and the server depend
on each other. It is difficult to use the presentation code in another system with a
different database, and vice versa [32]. This happened because in the client/server
architecture there is no independent place to store the implementation of the
business processes. Often much of the processing goes on in the client or in the
server, which means that either the client or the server is a large program. This
drawback of the client/server architecture has been solved in the development of the
n-tier architecture.
Hazem Turki El Khatib
26
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
2.7
Chapter 2
THE MULTIDATABASE APPROACH
The third approach is the multidatabase approach or federated approach [4]. The
terms
“multidatabase
system”
and
“federated
system”
are
often
used
interchangeably [6]. The term “multidatabase system” itself is used by different
authors to mean a number of different things. In practice, the terms “federated
database system, FDBS”, “multidatabase system”, “interoperable database system”,
and “heterogeneous DDBMS” are used synonymously [2]. They refer to systems that
support access to multiple autonomous databases without a global schema. Sheth
and Larson [7] used this term to describe heterogeneous distributed database
systems. However, Litwin et al. [2] use the term to mean a loosely coupled FDBS
(see section 2.7.3). They proposed a reference architecture for the multidatabase
systems approach. The levels identified in the architecture are “Internal Level”,
“Conceptual Level”, and “External Level”. Bukhres et al. [4] considered a
multidatabase system as a collection of loosely coupled element databases, with no
global schema applied for their integration - “a multidatabase system is a system
that supports the execution of operations on multiple component database systems”.
Dayal [9] uses the term to describe a tightly coupled FDBS. Hurson and Bright [34]
consider a multidatabase system to be a federated database, which is a distributed
system that provides a global interface to heterogeneous pre-existing local DBMS.
Users can access multiple remote databases with a single query. Nodine and Zdonik
[35] define a multidatabase as a “system that integrates multiple autonomous
heterogeneous database systems to allow them to be accessed using a uniform
Hazem Turki El Khatib
27
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
interface and to provide basic database support for consistency and persistence
across the set of databases”
A key characteristic of the multidatabase is that the local DBMS remains
autonomous [2] and the essential component of such a system is that the language
used to manage the databases should be designed to be able to cope with local
autonomy and data redundancy [5], and it needs features which are not part of the
database language, such as using logical database names in queries to qualify data
elements in different databases, and to allow for autonomy while supporting cooperation between database administrators [6].
2.7.1 The Multidatabase System Taxonomy
The basic multidatabase system taxonomy is shown in Figure 2.6. The taxonomy
involves the following classification [4,7]:

Nonfederated
Database
System.
A
nonfederated
multidatabase
system
integrates component systems that are not autonomous but still heterogeneous.
It has only one level of management, and all operations are performed uniformly.
A nonfederated database system does not distinguish local and nonlocal users.

Federated Database System. A federated database system (FDBS) is a
collection of co-operating database systems that are autonomous [7]. One of the
significant aspects of an FDBS is that a component DBS can continue its local
Hazem Turki El Khatib
28
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
operations and at the same time participate in a federation to allow data sharing.
The term federated database system was introduced by Heimbigner and McLeod
[36] to refer to a collection of databases in which the sharing is made more
explicit by allowing export schemas, which define the shareable part of each local
database [after 16]. Since its introduction, the term has been used for several
different but related DBS architectures, until Sheth and Larson [7] proposed a
reference architecture based on all the preceding research [5].
Multidatabase Architecture
Nonfederated Database System
Federated Database System
Tightly Coupled
Loosely Coupled
Single Federation
Multiple Federation
Figure 2.6 ~ Multidatabase system taxonomy
taxonomy
2.7.2 The Federated Database Systems Architecture
The n-tier architecture can be viewed as a three–tiered architecture, a five-tiered
architecture, or more. The n-tier architecture breaks the application into three basic
layers, a presentation layer (External schema), a business logic layer (Conceptual
schema), and a database layer (Internal schema) [32,37], as shown in Figure 2.7.
Hazem Turki El Khatib
29
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
The presentation layer (service consumer) provides a user interface and acts as a
client of the business logic servers. The business logic layer (service provider) acts
as a server with which the client code interacts. It contains all the complicated
application logic and it does not take into account how data is stored. The database
layer (data provider) has no idea what operations will be performed on it.
Applications exist as co-operating components, which allows different clients to
share the same business logic. This creates reusable software, which is easy to
deploy and maintain and offers much greater flexibility and scalability.
User Interface
Logic
Data
Figure 2.7 ~ N-Tier Architecture
The ANSI/X3/SPARC Study Group on Database Systems introduced the standard
ANSI/SPARC three-level schema architecture for centralised DBMSs [7] as shown in
Figure 2.8.
Hazem Turki El Khatib
30
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
External Schema 1
External Schema 2
  
Chapter 2
External Schema n
Conceptual Schema
Internal Schema
Figure 2.8 ~ ANSI-SPARC three-level architecture
Sheth and Larson [7] introduced software system components that link the three
levels together. They introduced some component definitions and then extended the
diagram shown in Figure 2.8 to that shown below in Figure 2.9. The basic types of
system components used include:
1.
schemas, which are descriptions of data managed by a DBMS.
2.
mappings, which are functions that correlate the schema objects in one
schema to the schema objects in another schema.
3.
commands, which are requests for specific actions that are either entered by
a user or generated by a processor.
Hazem Turki El Khatib
31
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
External Schema 1
External Schema 2
Filtering Processor 1
  
Filtering Processor 2
Chapter 2
External Schema n
Filtering Processor n
Conceptual Schema
Transforming Processor
Internal Schema
Accessing Processor
Database
Figure 2.9 ~ Extended ANSI-SPARC three-level architecture [7]
4.
Processors are software modules that manipulate commands and data. There
are usually four types of processors, each performing different functions on
data.

The
transforming
processor
translates
commands/data
from
one
language/format to another language/format. This provides data model
transparency, which will hide differences in query language and data
formats.
Hazem Turki El Khatib
32
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata

Chapter 2
Filtering processors constrain the commands and associated data that can
be passed to another processor. Examples of filtering processors include
a syntactic constraint checker, semantic integrity constraint checker, and
access controller.

A constructing processor is used particularly in the heterogeneous
database system. It partitions and/or replicates an operation submitted by
a single processor into operations that are accepted by two or more
processors. Constructing processors also merge data produced by several
processors into a single data set for consumption by another single
processor.
This
supports
location,
distribution
and
replication
transparency, undertaking tasks such as schema integration (integrating
multiple schemas into a single schema), negotiation between schema
owners (to determine what protocol should be used among them), query
decomposition and optimisation, and global transaction management
(performing concurrency control).

The accessing processor executes commands to retrieve data from the
database.
The three-level schema architecture is incomplete for describing the architecture of
an FDBS. The three-level schema must be extended to support the three
dimensions of a federated database system - distribution, heterogeneity, and
Hazem Turki El Khatib
33
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
autonomy [38]. Sheth and Larson [7] proposed an architecture composed of five
levels of schema as shown in Figure 2.10. These schemas are:

Local Schema: Each local DBMS has a local schema that defines all the local
data. A local schema is expressed in the native data model of the component
DBMS, and therefore different local schemas may be expressed in different data
models.
External Schema
  
External Schema
Federated Schema
Export Schema
  
Federated Schema
  
Export Schema
External Schema
Export Schema
Component Schema
  
Component Schema
Local Schema
  
Local Schema
Component DBS
Component DBS
Figure 2.10 ~ Five-level schema architecture of an FDBS [7]
Hazem Turki El Khatib
34
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata

Chapter 2
Component Schema: A component schema is the local schema represented in
the global data model called the common data model (CDM) of the FDBS. Sheth
and Larson [7] gave two reasons for defining component schemas in a CDM:
1. they describe the divergent local schemas using a single representation.
2. semantics that are missing in a local schema can be added to its component
schema.
The process of schema translation from a local schema to a component schema
generates the mappings between the component schema objects and the local
schema objects. Transforming processors use these mappings to transform
commands on a component schema into commands on the corresponding local
schema. Such transforming processors and the component schemas support the
heterogeneity feature of an FDBS.

Export Schema: not all data of a component DBS may be available to the
federation and its users. An export schema represents the portion of the
component schema that the component database chooses to make available to
the FDBS. The purpose of defining export schemas is to facilitate control and
management of association autonomy. The export schemas support the
autonomy feature of an FDBS.
Hazem Turki El Khatib
35
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata

Chapter 2
Federated Schema: a federated schema is an integration of several export
schemas. A federated schema also includes the information on data distribution
that is generated when integrating export schemas. The federated schemas
support the distribution feature of an FDBS. There may be multiple federated
schemas in an FDBS, one for each class of federation users.

External Schema: an external schema defines a schema for a particular user or
application.
Reddy et al. [39] proposed a methodology that transfers existing local databases to
meet diverse application needs at the global level. It uses a four-layered schema
architecture: local schema, local object schema, global schema, and global view
schema.
Architectures with Additional Basic Components
There are several types of architectures with additional components that are
extensions or variations of the basic components of the reference architecture. Such
components improve the capabilities of an FDBS. Examples of such components
include an auxiliary schema that stores the following types of information:

data needed by federation users but not available in any of the pre-existing
component DBSs.
Hazem Turki El Khatib
36
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata

Chapter 2
information needed to resolve incompatibilities (e.g., unit translation tables,
format conversion information).

statistical information helpful in performing query processing and optimisation.
Extended federated architectures
To allow a federation user to access data from systems other than the component
DBSs, the five-level schema architecture can be extended in additional ways, for
example, by replacing a component database by a collection of application
programs. It is conceptually possible to replace some database tables by application
programs. For example [7], a table containing pairs of equivalent Fahrenheit and
Celsius values can be replaced by a procedure whereby calculated values on one
scale give values on the other.
2.7.3 Coupling in Federated Database Systems
As shown in Figure 2.6, Federated Database Systems can be classified as tightly
coupled or loosely coupled systems. The major difference between these two
classifications lies in who manages the federation and how the components are
integrated [7].

Tightly Coupled. An FDBS is tightly coupled if the federation and its
administrator(s) have the responsibility for creating and maintaining the
Hazem Turki El Khatib
37
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
federation and actively control the access to component DBSs. Tightly coupled
federations take the form of schema integration [4] and may have one or more
federated schemas. A tightly coupled FDBS is said to have single federation if it
allows the creation and management of only one federated schema; it is said to
have multiple federations if it allows the creation and management of multiple
federated
schemas.
It
provides
location,
replication,
and
distribution
transparency. The process for the administration of a tightly coupled FDBS is as
follows. First, export schemas are created by negotiation between a component
DBA and the federation DBA; the component DBA has control over what is
included in the export schemas. Then the federation DBA creates and controls
the federated schema as s/he is usually allowed to read the component schemas
to help determine what data are available and where they are located and then
negotiate for their access. Finally, external schemas are created by negotiation
between a federation user and the federation DBA who has the authority to
decide what is included in each external schema. DDTS [40] can be categorised
as tightly coupled FDBSs with single federation. Mermaid [41] and Multibase [9]
are examples of tightly coupled FDBSs with multiple federations.

Loosely Coupled. In loosely coupled federated database systems, users are
largely responsible for the administration of the federated system. There is no
central authority that controls the creation of or access to data. Each component
system is responsible for constructing a global schema view and for processing
queries that access remote component systems. Loosely coupled federations
take the form of schema importation [16], interoperable database systems, or
Hazem Turki El Khatib
38
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
multidatabase systems. Loosely coupled systems do not maintain hard links into
component databases. Only those databases that require integration are
attached as needed to fulfil the transaction request. A loosely coupled FDBS
always supports multiple federated schemas and provides an interface to deal
directly with multiple component DBMSs. A typical way to formulate queries is to
use a multidatabase language. A typical process of developing federated
schemas in a loosely coupled FDBS is as follows. Each federation user is the
administrator of his or her own federated schema. First, a federation user looks
at the available set of export schemas to determine which of these describe data
s/he would like to access. Next, the federation user defines a federated schema
by importing the export schema objects by using a user interface or an
application program, or by defining a multidatabase language query that
references export schema objects. The user is responsible for understanding the
semantics of the objects in the export schemas and resolving the DBMS and
semantic heterogeneity. Finally, the federated schema is named and stored
under the account of the federation user who is its owner.
The loosely coupled approach may be inappropriate for more traditional business or
corporate databases, where system control is desirable, where the users are naïve
and would find it difficult to perform negotiation and integration themselves, or where
location, distribution, and replication transparencies are desirable. The loosely
coupled approach provides a greater degree of autonomy for the component
systems, because no central authority is imposed. In addition, loose coupling
Hazem Turki El Khatib
39
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
generally tends to scale better to very large systems. On the other hand, maintaining
consistency is more difficult in a system that lacks a central authority [4].
Breitbart [13] believes that the Federated approach is more advantageous than the
global schema approach; it is popular, and many currently developed systems use
this approach for resolving heterogeneity problems between databases. This avoids
constructing a global schema and is easier to maintain. To achieve this, a system
must maintain knowledge about the contents of each database to know what to
include in a query and to help in resolving conflicts between sources, since there is
no global schema to provide information about the semantics of the databases.
2.7.4 Multimedia Information Presentation System (MIPS)
In this section the MIPS system [42,43] will be presented as an example based on a
loosely coupled FDBS approach. The MIPS system was developed in a European
funded project to link together distributed heterogeneous databases. It supports a
single user query that can retrieve information from a collection of databases and
provide a single integrated answer to the user. The MIPS system analyses the
query, identifies the databases required to answer it, fetches the information,
assembles the results, and presents them to the user. Ideally, all this would be done
transparently. The architecture of the MIPS system is illustrated in Figure 2.11. The
system consists of a set of tools which can be grouped into three layers:
Presentation Layer, Dialogue Management Layer and Data Layer. The Presentation
Layer accepts a user request and displays the results. The Dialogue Management
Hazem Turki El Khatib
40
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
Layer consists of two major components, Knowledge Based System (KBS) and
Selection and Retrieval Tool (SRT), together with a minor one, the External Data
Access (EDA). The Data Layer represents the data sources that can be accessed.
The SRT consists of two sub-modules, Breakdown and Clarification (B&C) and
Assembly of the Consolidated Answer (ACA). The B&C sub-module accepts the
user query coming from the Presentation Layer, and breaks down the query into a
set of sub-queries targeted at the databases with help from the KBS. The ACA submodule assembles the result and sends it to the Presentation Layer.
The communication amongst the original MIPS system modules is managed through
an Internal Representation Language, the IRL, which exists in a number of dialects
between different pairs of modules.
However, this system is limited for two reasons. First, the number of databases that
can be handled in MIPS is restricted because of the adoption of a centralised KBS
module. This contains the knowledge about the available databases (schemas,
locations, protocols, query languages, etc.) as well as the knowledge necessary for
resolving database heterogeneity. Second, the mechanism used for accessing
databases is restricted since it is based on a standard commercial communication
protocol to link to the databases. This requires much effort to add a new database to
the KBS.
Hazem Turki El Khatib
41
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
Figure 2.11 ~ The MIPS Architecture
2.8
OTHER APPROACHES
Ceri and Widom [44] argue that when a multidatabase environment includes facilities
at each site for production rules and persistent queues, these facilities can be used
to maintain consistency across semantically heterogeneous databases. Production
rules in database systems allow specification of database operations that are
executed automatically whenever certain events occur or conditions are met.
Persistent queues in multidatabase (or client-server) environments provide a
mechanism for reliable execution of asynchronous transactions on remote data. In a
multidatabase environment with production rules and persistent queues, consistency
Hazem Turki El Khatib
42
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
across semantically heterogeneous databases can be maintained automatically as
follows: rules are triggered by any changes to a database that may create
inconsistencies.
Chatterjee and Segev [31] present a probabilistic technique for resolving data
heterogeneity problems. They discussed the Entity Join, which can be used to join
records across different databases. A probabilistic model was presented for
estimating the accuracy of the join in a heterogeneous environment.
2.9
THE APPROACH ADOPTED IN THIS RESEARCH
From the above discussion, the approach developed in this work adopts the features
of the federated approach raised in the MIPS system to resolve the heterogeneity
problems. However, this architecture distributes the knowledge over the network
instead of it being stored in a centralised knowledge base. It hides the location of the
databases and structure from the users without the need for creating a global
schema and it raises the number of databases attached to the system. It also
distributes the task of resolving heterogeneity between autonomous and cooperating agents. The structure of the system is presented in chapter 5. More
detailed functional descriptions of the individual agents follow in Chapters 6, 7 and 8.
In this approach users can send queries using concepts mediated between them
and the databases which are in turn transformed into the shared context and then
transferred to the appropriate databases. The returned data is processed to resolve
heterogeneities and assembled into a single answer.
Hazem Turki El Khatib
43
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
2.10
Chapter 2
SUMMARY
With the large number of databases accessible by the users, the need for retrieving
and integrating information from distributed, autonomous, heterogeneous databases
has been investigated by a number of researchers. This chapter has set out some of
the research issues that have been addressed in relation to the problem of providing
access to such databases. Various papers describing a variety of different
approaches to handle this problem have been published, although aspects of the
problem remain.
Three commonly used and well-supported approaches are: the physical approach,
the global approach, and the multidatabase approach. The physical approach
requires the database administrator to understand the structure of the databases
and keep track of changes to the underlying databases, and it is likely to result in
redundant data. The global approach has the advantage that is possesses location,
replication, and distributed transparency, which is achieved mainly by schema
integration. However, it is complex to create a single global schema for a large
number of data sources because of the large number of comparisons required, and
the global designers have to understand the underlying assumptions, and semantics
of each existing system, as well as the heterogeneous local database structure. The
multidatabase approach is more advantageous than the global schema approach; it
is popular; many currently developed systems use this approach for resolving
heterogeneity between databases, and it avoids constructing a global schema. Since
there is no global schema to provide information about the semantics of the
Hazem Turki El Khatib
44
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
databases, the system needs to maintain knowledge about the contents of each
database to know what to include in a query and to help in resolving conflicts
between sources.
The approach developed in this work adopts the features of the federated approach
in order to resolve the heterogeneity problems. This architecture distributes the
knowledge over the network instead of it being stored in a centralised knowledge
base.
Hazem Turki El Khatib
45
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
REFERENCE
[1]
G. Thomas, G.R. Thompson, C. Chung, E. Barkmeyer, F. Carter, M. Templeton, S. Fox, B. Hartman,
Heterogeneous distributed database systems for production use, ACM Computing Surveys 22 (3) (1990) 237266
[2]
Litwin, L. Mark, and N. Roussopoulos, Interoperability of Multiple Autonomous Databases, ACM
Computing Surveys, Vol. 22, No. 3, September 1990, pp. 267 – 293
[3]
Y. Bishr, Semantic Aspect of Interoperable GIS, PhD thesis, International Institute for Aerospace
Survey and Earth Sciences (ITC), Netherlands, 1997.
[4]
O.A. Bukhres, A.K. Elmagarmid, F.F. Gherfal, X. Liu, K. Barker, T. Schaller, The integration of
database systems, in: O.A. Bukhres, A.K Elmagarmid, (Ed.), Object-Oriented multidatabase systems: A
solution for advanced applications, Prentice-Hall, Inc. Englewood Cliffs, New Jersey, (1996) 37-56.
[5]
L. M. MacKinnon, Intelligent Query Manipulation for Heterogeneous Databases, PhD thesis,
Department of Computing & Electrical Engineering, Heriot-Watt University, Edinburgh, October, 1998.
[6]
J. Hu, Interoperability of Heterogeneous Medical Databases, PhD thesis, Department of Computing &
Electrical Engineering, Heriot-Watt University, Edinburgh, May, 1994.
[7]
A.P. Sheth, J.A. Larson, Federated database systems for managing distributed, heterogeneous, and
autonomous databases, ACM Computing Surveys 22 (3) (1990) 183-232
[8]
1992.
Bell, D. and Grimson, J., Distributed Database Systems, Addison-Wesley: Wokingham, England,
[9]
U. Dayal, H. Hwang, View definition and generalization for database integration in a multidatabase
system, IEEE Transaction on Software Engineering, SE-10 (6), 1984, pp. 628-645
[10]
Y. Arens, C. Chee, C. Hsu, and C. Knoblock, Retrieving and Integrating Data from Multiple
Information Sources, International Journal of Intelligent and Cooperative Information Systems, Vol. 2, No. 2,
1993, pp. 127-158
[11]
W. Litwin, A. Abdellatif, Multidatabase interoperability, Computer 19 (12) (1986) 10-18
[12]
S. Navathe, A. Savasere, A schema integration facility using object-oriented data model, in:
O.A.Bukhres, A.K. Elmagarmid, (Ed.), Object-Oriented multidatabase systems: A solution for advanced
applications, Prentice-Hall, Inc. Englewood Cliffs, New Jersey, (1996) 105-127
[13]
53-60
Y. Breitbart, Multidatabase Interoperability, SIGMOD RECORD, Vol. 19, No. 3, September 1990, pp.
[14]
P. Adriaans and D. Zantinge, Data Mining, Addison-Wesley Longman, England, 1998.
[15]
Christine Collet, Michael N. Huhns, and Wei-Min Shen, Resource Integration Using a Large
Knowledge Base in Carnot, COMPUTER, Vol 24, No. 12, December 1991, pp. 55-62
[16]
C. Batini, M. Lenzerini, S.B. Navathe, A comparative analysis of methodologies for database schema
integration, ACM Computing Surveys 18 (4) (1986) 323-364
[17]
S. Spaccapietra, C. Parent, Y. Dupont, Model independent assertions for integration of heterogeneous
schemas, VLDB Journal 1 (1) (1992) 81-126
Hazem Turki El Khatib
46
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
[18]
S. Navathe, S. Gadgil, A methodology for view integration in logical database design, Eighth
International Conference on Very Larger Data Bases, Mexico City, VLDB Endowment, Saratoga, Calif., 1982,
pp. 142-164
[19]
Landers, T., and Rosenberg, R., 1982. “An Overview of Multibase”. Proceedings of the 2nd
International Symposium for Distributed Databases (1992), pp. 153-183.
[20]
Arens, Y., and Knoblock, C.A., 1992. “Planning and Reformulating Queries for Semantically Modeled
Multidatabase Systems”. Proceedings of the 1st International Conference on Information and Knowledge
Management, 1992, pp. 92-101.
[21]
Ahmed, R., Smedt, P.D., Du, W., Kent, W., Ketabchi, M.A., Litwin, W.A., Raffi, A., and Shan, M.C.,
“The Pegasus Heterogeneous Multidatabase System”. IEEE Computer 24, 12 (Dec. 1991), pp. 19-27.
[22]
S.B. Navathe, T. Sashidhar, R. Elmasri, Relationship merging in schema integration, 10 th International
Conference on Very Large Data Bases, Singapore, 1984, pp. 78-90
[23]
J.A. Larson, S.B. Navathe, R. Elmasri, A theory of attribute equivalence in database with application
to schema integration, IEEE Trans. Softw. Eng. 15 (4), 1989, pp. 449-463
[24]
R. Elmasri, S. Navathe, Object integration in logical database design, IEEE COMPDEC Conf., (1984)
426-433
[25]
M. Solaco, F. Saltor, M. Castellanos, Semantic heterogeneity in multidatabase systems, in: O.A.
Bukhres, A.K. Elmagarmid, (Ed.), Object-Oriented Multidatabase Systems: A Solution For Advanced
Applications, Prentice-Hall, Inc., 1996, pp. 129-202.
[26]
C. Batini, M. Lenzerini, A methodology for data schema integration in the entity-relationship model,
IEEE Trans. Softw. Eng. SE-10 (6), 1984, pp. 650-664
[27]
Siegel, M. and Madnick, S. A metadata approach to resolving semantic conflicts, Seventeenth
International Conference on Very Large Data Bases, Barcelona, September, 1991
[28]
C. Yu, B. Jia, W. Sun, S. Dao, Determining relationships among name in heterogeneous databases,
Sigmod Record 20 (4), 1991, pp. 79-80
[29]
K.G. Jeffery, L. Hutchinson, J. Kalmus, M. Wilson, W. Behrendt, C. MacNee, A model for
heterogeneous distributed database systems, in: D.S. Bowers (Ed.), Directions in Databases; Proceedings
BNCOD 12, Guildford, U.K., Lecture Notes in Computer Science 826, Springer-Verlag, July 6-8, 1994, pp. 221
– 234
[30]
A. Motro, P. Buneman, Constructing superviews, International Conference on Management of Data,
Ann Arbor, Mich., ACM, New York, Apr. 29- May 1, 1981, pp. 56-64
[31]
A. Chatterjee, A. Segev, Data manipulation in heterogeneous databases, Sigmod Record 20 (4), 1991,
pp. 64-68
[32]
D.J. Berg and J.S. Fritzinger, Advanced techniques for Java developers, John Wiley & Sons, Inc.,1999.
[33]
J.M. Crichlow , The essence of distributed systems. Pearson Education Limited: Essex, England. 1999.
[34]
A.R. Hurson, M.W. Bright, Object-Oriented multidatabase systems, in: O.A. Bukhres, A.K.
Elmagarmid (Ed.), Object-Oriented multidatabase systems: A solution for advanced applications, Prentice-Hall,
Inc. Englewood Cliffs, New Jersey, 1996, pp. 1-33
[35]
M. H. Nodine and S. B. Zdonik, The Impact of Transaction Management on Object-Oriented
Multidatabase Views, in: O.A. Bukhres, A.K. Elmagarmid (Ed.), Object-Oriented multidatabase systems: A
solution for advanced applications, Prentice-Hall, Inc. Englewood Cliffs, New Jersey, (1996) 57-104
Hazem Turki El Khatib
47
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 2
[36]
Hammer, M., and McLeod, D. On database management system architecture. MIT Lab. For Comp. Sc.
MIT/LCS/TM-141. (Oct.) 1979, 35
[37]
B. Elbert and B. Martyna, Client/server computing: architecture, applications, and distributed systems
management.” ARTECH HOUSE, Inc. 1994.
[38]
R. M. Colomb & M. E. Orlowska, Interoperability in information systems, Information Systems
Journal, May 1994, pp. 37-50
[39]
M.P. Reddy, B.E. Prasad, P.G. Reddy, A. Gupta, A methodology for integration of heterogeneous
databases, IEEE Transactions on Know. and Data Eng. 6 (6), 1994, pp. 920-933
[40]
Dwyer, P. & Larson, J., Some experiences with a distributed database testbed system, Proceedings of
IEEE, Vol. 75, No. 5, pp. 633-647, 1987.
[41]
Mermaid- A Front-End to Distributed Heterogeneous Database (Invited Paper), M. Templeton. D.
Brill, S. K. Dao, E. Lund, P. Ward, A. L. P. Chen, and R. MacGregor, Proceedings of the IEEE, Vol. 75, No. 5,
May 1987, pp. 695- 708
[42]
L.M. Mackinnon, D.H. Marwick, M.H. Williams, A Model for Query Decomposition and Answer
Construction in Heterogeneous Distributed Database Systems, J. Intelligent Info. Sys, 11, 1998, pp. 69-87.
[43]
W.J. Austin, E.K. Hutchinson, J.R. Kalmus, L.M. Mackinnon, K.G. Jeffery, D.H. Marwick, M.H.
Williams, M.D. Wilson, Processing travel queries in a multimedia information system, Proc. Info. & Comms
Technologies in Tourism, Springer Verlag, 1994, pp. 64-71.
[44]
S. Ceri and J. Widom, Managing Semantic Heterogeneity with Production Rules and Persistent
Queues, Proceedings of the 19th Very Large DataBases Conference, Dublin, Ireland, 1993, pp. 108-119
Hazem Turki El Khatib
48
PhD Thesis ~ 2000