Download CHAPTER 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Relational algebra wikipedia , lookup

Healthcare Cost and Utilization Project wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
CHAPTER 3
A FRAMEWORK FOR DISTRIBUTED HETEROGENEOUS
DATABASES
3.1
INTRODUCTION
During the past three decades there has been a rapid growth in the number of
databases. This has led to the storage of related data in different formats across
multiple databases. For example, in areas such as healthcare, the information on
a single patient may be scattered over a number of different medical databases
with no simple way of obtaining a complete record of the patient.
In this chapter a framework for classifying different aspects of heterogeneity in
data sets is proposed, relating the various aspects of heterogeneity discussed by
different researchers to this framework. The idea behind such a framework is to
identify a comprehensive range of different types of heterogeneity that can arise
either alone or in combinations. A simple test-suite using this framework has been
devised which can be used to test and compare different approaches to the
interoperability of databases. The suite comprises a small number of data sets
and queries which exercise almost all aspects of the framework.
Hazem Turki El Khatib
46
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
The focus of this framework is on the relational database model since the vast
majority of databases currently in use are relational. Most of the heterogeneities
identified are common across all database models, including Object Oriented, but
this framework has not been extended to consider these other models at this time.
Section 3.2 provides an overview of work done by a number of different
researchers on the problem of heterogeneous distributed database systems.
Section 3.3 describes the framework with examples drawn from a small set of
example databases. Section 3.4 describes the test-suite derived from this and
section 3.5 provides a summary of the chapter.
3.2
OVERVIEW OF PREVIOUS WORK
This section provides a brief overview of the aspects covered in a number of
different papers on this subject. The next section presents our proposed
framework. A summary of how these fit into the proposed framework is given in
Table 3.1. An index for the table abbreviation is given in Table 3.2. It should be
noted that the terminology used differs amongst authors and their coverage of a
concept varies; this is indicated by an ‘F’ (if the concept is fully covered) or a ‘P’ (if
it is partially covered) in the table.
The technological differences between computer systems give rise to heterogeneity
conflicts. These include differences in hardware, system software (such as operating
systems), communication protocols, and so on [7]. At the database system level, the
Hazem Turki El Khatib
47
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
heterogeneity can be divided into those resulting from the differences in DBMSs and
those resulting from the differences in the semantics of data as shown in Figure 3.1.
Solaco et al., [25] classified the heterogeneity as systems heterogeneities and
semantic heterogeneities. Systems heterogeneities include differences in hardware,
operating systems, database management systems, transaction management,
communication protocols, and so on. Semantic heterogeneities include differences
in database models, particularly in the schemas of the databases.
Database Systems
 Differences in DBMS
 Data models (structures, constraints, query
languages)
 System level support (concurrency control,
commit, recovery)
 Semantic Heterogeneity
Operating System
 File systems
 Naming, file types, operations
 Transaction support
 Interprocess communication
Hardware/System
 Instruction set
 Data formats & representation
 Configuration
Communication
Figure 3.1 ~ Type of heterogeneities
This thesis is not concerned about system heterogeneities that may or may not
exist. It may be that during the design process all databases in the system chose
the same hardware, operating system, DBMS, and so on. However, “semantic
heterogeneities will nearly always exist because the designers of the respective
Hazem Turki El Khatib
48
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
component databases will have conceived the real world in differing ways and will
have designed different schemas” [25].
The major problem is that of finding how a data item in one set can be mapped to
an appropriate form to make it accessible in another – in other words, finding
attributes equivalence. The simplest form of heterogeneity in this regard is that of
naming conflicts and naming heterogeneity. In general, the categories of
structural and naming heterogeneities are recognised by most authors, e.g. [45] [4]. [9,16,18,26,31,44,46,47] defined naming conflicts as homonyms and
synonyms. Elmasri et al. [48] used the same two categories but widened naming
conflicts to include attribute equivalence and entity class equivalence. Structural
conflicts may be viewed as differences in abstraction level [48,49,50] as well as
differences in roles, degree and cardinality constraints [48,22].
Hazem Turki El Khatib
49
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Authors/Heterogeneity
N.HG
N
N
S
H
R.S.H
R
Z
V.H
N S
N S
N
s
I
I
S
Cardenas, A.F. [58]
Sheth and Larson [7]
Thomas, Thompson et al [1]
Ferrier and Stangret [59]
Litwin and Abdellatif [11]
Urban and Wu [52]
Hurson and Bright [34]
Larson,Navathe et al. [23]
Chatterjee and Segev [31]
Spaccap., Parent et al. [17]
Navathe and Gadgil [18]
Batini and Lenzerini [26]
Bukhres, Elmag. et al [4]
Navathe and Savasere [12]
Casanova and Vidal [45]
Motro and Buneman [30]
Al-fedaghi and Scheu. [65]
Yao, Waddle, Housel [66]
Yu, Jia, Sun, and Dao [28]
Teorey and Fry [67]
Kahn [46]
Elmasri and Navathe [24]
Navathe, Sashid. et al. [22]
Mannino and Effelsbe. [49]
Kual, Drosten, Neuhold[50]
Elmasri,Larson, Nav. [48]
Dayal and Hwang [9]
Batini, Lenzerini et al. [16]
Spaccapietra, Parent [60]
Solaco, Saltor et al. [25]
Ventrone and Heiler [55]
Kim and Seo [47]
Reddy, Prasad, Gupta [39]
Breitbart, Olson et al. [61]
Fankhauser, Neuhold [56]
Sheth and Kashyap [57]
Jeffery,Hutchinson et al [29]
Deen, Amin. Taylor [51]
P
F
F
F
P
F
F
F
P
P
P
P
P
P
P
F
F
F
P
F
F
F
F
P
F
F
F
P
F
F
F
P
P
P
P
P
P
P
F
F
F
P
F
F
F
F
P
P
F
F
P
F
F
F
F
F
F
F
F
S.H
WD
R
C
C
P
P
P
P
F
P
F
DAL
D.M
PH BD
DC
F
F
F
F
F
P
F
F
P
DIC
Chapter 3
DV
RK
T.H
ID
AU
D.E
F
F
F
F
F
F
F
P
P
F
P
F
P
F
P
F
F
P
P
P
P
P
P
P
F
P
F
F
F
F
F
F
P
F
F
F
F
P
F
P
F
F
F
F
F
P
F
P
P
F
P
P
F
F
F
F
F
F
F
F
F
F
F
F
P
F
F
F
F
F
F
F
F
F
F
F
F
F
P
F
F
P
F
F
P
F
P
F
F
F
F
F
Table 3.1 ~ Relationship between concepts used by other researchers and our classification
Hazem Turki El Khatib
50
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
ABBREVIATION
N.HG
Chapter 3
TERM
Naming Heterogeneity
N.S
Naming Synonyms
A.S
Attribute Synonyms
R.S
Relation (Table) Synonyms
N.H
Naming Homonyms
A.H
Attribute Homonyms
R.H
Relation (Table) Homonyms
A-R H
Attribute_Relation Homonyms (Entity-Class Homonyms)
R.S.H
Relational Structure Heterogeneity
R.Z
Relation Size
V.H
Value Heterogeneity
N.N
Numeric-Numeric
D.U-F.C
D.U-T.V.C
U-O.C
G
Different Units- Fixed Conversion
Different Units- Time Varying Conversion
Units- Other Conversion
Granularity
C.V.S-V A
Composition of Values in a Single-Valued Attribute
S.S
String-String
V.S
Value Synonyms
V.h
Value Homonyms
D.S.F
Different String Formats
N.s
Numeric-String
S.C
Simple Conversion
S
Structures
I.I
Incomplete Information
S.H
Semantic Heterogeneity
W.D.R
What the data represents
C.C
Context in which data is captured
D.A.L
Different in Abstraction Level
D.M
Data Model
P.H
Paradigm Heterogeneity
B.D
Behavioural Differences
D.C
Dependency Conflicts
D.I.C
Differences in Constraints
D.V
Default Value
R.K
Relation Keys
T.H
Timing Heterogeneity
D.E
Domain Evolution
I.D.A.U
Inconsistencies due to asynchronous updated
Table 3.2 ~ Abbreviations of the terms used in Table 3.1
Hazem Turki El Khatib
51
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
The second major form of heterogeneity is concerned with differences in the
representations of values. Again several authors recognise differences in units
and granularity as well as differences in data types and structure. These include
Deen, Amin, and Taylor [51], Bukhres et al. [4], Jeffery et al. [29], and Larson et
al. [23], who also cover differences in level of abstraction and in object identifier,
Chatterjee and Segev [31], who include codes, incomplete information and
recording errors, and Navathe and Savasere [12], who include data type and
scale.
Another way of viewing this is by distinguishing between schema level conflicts
and data level inconsistencies [9]. This notion is elaborated by Kim and Seo [47],
who distinguish between data that has been incorrectly entered, obsolete data
and different representations for the same data. Reddy, Prasad and Gupta [39]
refer to quantitative data incompatibilities which they attribute to different levels of
accuracy, asynchronous updates and lack of security.
The most complex heterogeneity is semantic heterogeneity, which is addressed
by Urban and Wu [52], Colomb and Orlowska [38], Spaccapietra et al. [17], Sheth
and Larson [7], and Hurson and Bright [34]. Solaco, Saltor and Castellanos [25]
also base the classification of semantic heterogeneities on an object-oriented
data model. In [53] the following definition of semantic heterogeneity is given:
“variations among component database systems in the structure, organization,
and conceptual description of information facts (units), units of behaviour
(procedures), and semantic integrity constraints”.
Hazem Turki El Khatib
52
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
Two forms of semantic heterogeneity in the context of geographic databases have
been identified in [54]. They are generic semantic heterogeneity and contextual
semantic heterogeneity. Generic semantic heterogeneity arises when nodes use
different generic conceptual models of the spatial information. Contextual
semantic heterogeneity is caused by the local environmental conditions at nodes.
In addition, Ventrone and Heiler [55] describe problems of semantic heterogeneity
resulting from domain evolution.
Fankhauser and Neuhold [56] refer to the
problem of ambiguity and distinguish model ambiguity (arising from primitives
such as is-a, instance-of, part-of) and semantic ambiguity. Sheth and Kashyap
[57] include conflicts such as default value conflicts, attribute integrity constraint
conflicts and union compatibility conflicts.
The question of data model heterogeneity is addressed by [58] – [61], and
[1,11,12,34], while Bukhres et al. [4] break the heterogeneity dimension into three
different possible dimensions: model, access, and processing.
Apart from the heterogeneities covered in this chapter, authors have also covered
differences in the database management systems [7], in data models [62], in
query languages and differences at the system level (e.g. concurrency control,
commit protocols and recovery). Ferrier and Stangret [59] include the network and
the operating system and Litwin and Abdellatif [11] physical aspects such as login
procedures, concurrency control, etc.
Hazem Turki El Khatib
53
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
3.3
Chapter 3
FRAMEWORK FOR CLASSIFYING HETEROGENEITIES
This section presents a framework for classifying the different types of heterogeneity
which arise and need to be catered for. In so doing, the classifications are described
in terms of the relational model as mentioned in the previous section.
From the discussion in the previous section, different instances of heterogeneity
can be classified into one or a combination of the following:
1. Naming heterogeneity. This occurs when the same values are stored in
different databases but the names given to the attributes are different in
different systems. These can be handled by a simple (syntactic) attribute
transformation of the query.
2. Relational structure heterogeneity. Here the composition of elementary
attributes into composite structures varies but once again values stored
are identical. This can be handled by a (syntactic) relational transformation
of the query.
3. Value heterogeneity. In this case the way in which values are represented
is different in different databases.
This may involve type and value
transformations.
4. Semantic heterogeneity. This is the most difficult form to deal with as in
this case the data stored in different databases embody different
Hazem Turki El Khatib
54
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
assumptions, e.g. in what they represent or in how they have been
captured. To quote [63]: “As we move away from systems issues to
semantic issues, we move from well-defined computational paradigms for
symbol manipulation to the issues of meaning and use of data as used by
different applications and by different human data administrators and end
users. We need to deal with multiple (possibly changing) interpretations of
data by different user in different context, data inconsistencies, and
incomplete information.”
5. Data model heterogeneity. Here the data model itself is the issue and
transformations between data models and differences between them are
relevant.
6. Timing heterogeneity. This concerns the changes over time in the
structure of a database, the representation of attributes and the values
themselves. Basically, almost any difference from each of the preceding
categories, which can occur between databases, may also arise within a
single database if it changes with time.
One area not covered in this categorisation is that of recording errors in the data.
Although this is a factor that does create problems, the issues of noisy data are
generally highly dependent on the application and therefore impossible to cover in
any generalised way.
Hazem Turki El Khatib
55
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
In order to illustrate the different aspects of heterogeneity which follow, four
simple data sets are given in Figures 3.2, 3.3, 3.4 and 3.5. Each contains a
collection of patient record data for patients attending different clinics in different
institutions.
The aim is to link these together to provide a single integrated
information system.
The first data set, Database1, comprises three relations: PAT-REC which stores
basic patient data, VISIT which records details of individual visits to the clinic and
LAB-TEST which stores information on laboratory tests conducted. The second
data set, Database2, is a minor variation on Database1 with essentially the same
three relations. Database3, on the other hand, consists of four relations: PATIENT
which stores patient data, PATIENT-NAME which stores patient names, VISIT
and TEST. Database4 represents data from a paediatric clinic. It consists of three
relations: PATIENT which stores patient data, VISIT which stores information
about home visits, C-VISIT which records details of individual visits to the clinic.
Hazem Turki El Khatib
56
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
PAT-REC
NAME
Mark Richard
Karen Taylor
Susan Marshal
ID
529
319
129
BIRTHDATE
05-01-73
13-04-74
05-03-76
ID
319
129
129
529
WEIGHT
75
69
70
NULL
SEX
Male
Female
Female
PHONE
031-6220723
0131-3122468
NULL
VISIT
VISIT-DATE
120894
240292
030392
050492
MEDICATION-PRICE
£5.5
£6.0
£3.0
£8.5
TEST-ID
3100
1200
2400
4000
LAB-TEST
TEST-ID
3100
1200
2400
4000
TEST-CODE
PNE
BLD
LP8
BLOOD
RESULT
1.3
4.0
2.2
7.5
RELATION
NAME
ATTRIBUTE
NAME
SEMANTIC
ATTRIBUT
E TYPE
NULL
VALUE
TYPE
PAT-REC
PHONE
UNKNOWN
NATIONALPHONE
VISIT-DATE
VISIT
WEIGHT
VISIT
MEDICATIONPRICE
Home telephone
number
The date when patient
visits the clinic
The person’s weight
when s/he enters the
clinic
The medication price
excluding TAX
String
VISIT
Int
Int
KILOGRAMS
String
UK-POUND
Note:Patient 129 does not have a home telephone number. Mark Richard's telephone number is stored with the old area code,
while Karen Taylor's telephone number is stored with the new area code. The code ‘PNE’ represents ‘Pneumonia’. The
tax rate before 1991 was 15% and after this date it became 17.5%.
Figure 3.2 ~ Structure of Database 1
The different types of heterogeneity as shown in Figure 3.6 are given below.
3.3.1 Naming heterogeneity
The simplest form of heterogeneity is associated with concept naming.
This
arises when the same concept is described by two or more names in different
databases (synonyms), or when the same name is used for different concepts
(homonyms). This form of heterogeneity is not concerned with the value which is
stored but merely with the name by which it is accessed.
Hazem Turki El Khatib
57
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
PATIENT
ID
4480
1280
6512
5555
PATIENT-NAME
Peter Brown
Janet Smith
Mark Richard
Karen Taylor
PHONE
3542311
6248526
5112168
NULL
SEX
M
F
M
F
DATE
23 FEB 80
12 MAY 81
05 JAN 73
13 APR 74
VISIT
ID
4480
6512
1280
5555
WEIGHT
180
150
120
160
DATE
AUG 12 90
APRIL 24 92
MAY 14 88
JAN 15 93
TEST-ID
4000
3010
3020
3030
PRICE
$10.30
$12.0
$5.0
NULL
TEST
TEST-ID
3010
4000
3020
3030
CODE
MKT
PNE
BLD
PNE
RESULT
Normal
Above Normal
High
Normal
ATTRIBUTE
NAME
PHONE
SEMANTIC
PATIENT
VISIT
DATE
VISIT
WEIGHT
VISIT
PRICE
The date when
patient visits the
clinic
The person’s
weight when
s/he enters the
clinic
The medication
price excluding
TAX
RELATION NAME
ATTRIBUTE
TYPE
Int
Work telephone
number
NULL
VALUE TYPE
UNKNO
WN
LOCALPHONE
String
POUNDS
Int
String
Not
Applic
able
US-DOLLAR
Note:Mark Richard's telephone number was updated on APRIL 24 1992. The code ‘PNE’ represents
‘Pneumoconiosis’.
Figure 3.3 ~ Structure of Database 2

Naming synonyms
These include the following:
Attribute synonyms
The same attribute may be given different names in different databases. For
example, the attribute NAME in relation PAT-REC in Database1 corresponds to
the attribute PATIENT-NAME in relation PATIENT in Database2.
Hazem Turki El Khatib
58
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
PATIENT
ID
529
319
420
BIRTHDATE
05 JAN 73
13 APRIL 74
12 JUN 80
SEX
Male
Female
Female
FIRST-NAME
Karen
Mark
Diana
SURNAME
Taylor
Richard
Steven
PHONE
4495111
NULL
NULL
PATIENT-NAME
ID
319
529
420
MAIDEN-NAME
Thomas
NULL
Adam
VISIT
ID
319
529
529
420
DATE
24 JUN 87
03 APRIL 89
15 MAY 89
13 APRIL 90
WEIGHT
74.6
68.2
67.4
70.6
PRICE
$12
$9
$5
$8
T-ID
12b
2FC
13f
7N
TEST
T-ID
12b
2FC
13f
7N
RELATION NAME
CODE
EF6
LP8
PNE
EF7
RESULT
A
C
B
C
ATTRIBUTE
NAME
PHONE
SEMANTIC
VISIT
MAIDENNAME
DATE
VISIT
WEIGHT
VISIT
PRICE
The surname before
marriage
The date when patient visits
the clinic
The person’s weight when
s/he enters the clinic
The medication price
including TAX
PATIENT
PATIENT-NAME
ATTRIBUTE
TYPE
Int
Work telephone number
String
NULL
VALUE TYPE
UNKNOWN
Not
applicable
LOCAL-PHONE
String
Float
String
KILOGRAMS (to
nearest tenth)
US-DOLLAR
Note:-
Mark Richard's telephone number was updated on MAY 15 1989. Patient phone number was not
compulsorily captured until 01/01/1990. So, NULL prior to this date represents 'unknown' and after this date
represents 'no telephone number'. The code ‘PNE’ represents ‘Pneumonia’
Figure 3.4 ~ Structure of Database 3
Relation (Table) synonyms
The same relation may be represented by different names in different databases.
For example, the relation LAB-TEST in Database1 corresponds to the relation
TEST in Database2.
Hazem Turki El Khatib
59
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
PATIENT
P-NAME
Alex Brown
Sue Peter
Mark Smith
ID
9211
9345
9289
BIRTHDATE
120393
231195
050496
SEX
M
F
M
PHONE
3542311
6248526
5112168
VISIT
ID
9211
9345
9289
9345
V-NAME
Morton Pattie
Fernie Garcia
Dines Douglas
Ross Scott
ID
9345
9211
9289
DATE
06 JAN 96
03 MAR 96
15 SEP 97
DATE
OCT 03 97
APRIL 12 96
JAN 24 96
APRIL 23 96
BLOOD PRESSURE
70:140
80:140
80:160
90:140
COMMENT
NO CHANGE
IMPROVING
IMPROVING
NO CHANGE
C-VISIT
RELATION NAME
PATIENT
PATIENT
ATTRIBUTE
NAME
P-NAME
PHONE
VISIT
ID
VISIT
V-NAME
VISIT
DATE
VISIT
C-VISIT
BLOOD
PRESSURE
ID
C-VISIT
DATE
C-VISIT
BLOOD
PRESSURE
WEIGHT
C-VISIT
BLOOD PRESSURE
90:140
80:140
90:160
WEIGHT
30.5
35.5
48.5
SEMANTIC
ATTRIBUTE
TYPE
String
Int
The patient’s name
The person’s contact phone
number
The home visit’s
identification
The nurse’s name who
visits patient at home
The date when the nurse
visits the patient at home
The blood pressure
measured at home
The clinic visit’s
identification
The date when patient visits
the clinic
The blood pressure
measured at the clinic
The person’s weight when
s/he enters the clinic
NULL
VALUE TYPE
UNKNOWN
LOCALPHONE
Int
String
String
Int
String
String
Float
KILOGRAMS
Figure 3.5 ~ Structure of Database 4

Naming homonyms
These include the following:
Attribute homonyms
Two attributes with the same name occurring in different databases represent
different things. For example, the attribute DATE occurring in relation VISIT of
Database2 is different from DATE in relation VISIT of Database4. Although they
have the same name, they represent different concepts.
Hazem Turki El Khatib
60
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
Relation (Table) homonyms
Relations with the same name occurring in different databases contain different
things. For example, the relation VISIT in Database1 records details of individual
visits to the clinic, while the relation VISIT in Database4 stores information about
visits to the patient’s home.
Attribute-Relation homonyms (or entity-class homonyms)
An attribute in one database has the same name as a relation in another
database. For example, PATIENT-NAME is an attribute of relation PATIENT in
Database2, but it is a relation in Database3.
3.3.2 Relational structure heterogeneity
This form of heterogeneity arises when the way in which attributes are composed
into relations in one database is different from that of another. Once again this
form of heterogeneity is not concerned with the values of attributes, but merely
how they are assembled into relations.

Relation size
In this case relations with the same name have different numbers of attributes in
different databases, and thus are not union-compatible. For example, relation
PATIENT in Database2 has five attributes whereas relation PATIENT in
Database3 has only four.
Hazem Turki El Khatib
61
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
3.3.3 Value heterogeneity
This form of heterogeneity is concerned with the way in which the values of a
concept are represented.
It is possible that different instances of the same
concept occurring in different databases may be represented in different ways.

Numeric – numeric
Different units - fixed conversion
This arises when different databases use different units for the same data
element. For example, an attribute WEIGHT in the VISIT relation of Database1 is
expressed in kilograms whereas in Database2 it is expressed in pounds. This
represents a straightforward conversion from one set of units to another.
Different units- time varying conversion
As an example, consider the MEDICATION-PRICE/PRICE attributes in the VISIT
relations which in Database1 contains values expressed in pounds sterling and in
Database2 contains values expressed in US dollars. This is also a conversion but
the conversion factor varies with time and a conversion factor must be chosen for
an appropriate instant of time.
Units- other conversions
Apart from the standard conversions of the previous two subsections, several
irregular conversions arise.
For example, the telephone number value in the
PHONE attribute in relation PAT-REC of Database1 is represented with area
Hazem Turki El Khatib
62
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
codes whereas in the PHONE attribute in relation PATIENT of Database2 it is
represented without area codes.
 Naming Heterogeneity (N.HG)
Naming Synonyms (N.S)
Attribute Synonyms
(A.S)
Relation (Table) Synonyms (R.S)
Naming Homonyms (N.H)
Attribute Homonyms (A.H)
Relation (Table) Homonyms (R.H)
Attribute_Relation Homonyms (Entity-Class Homonyms)(A-R H)
 Relational Structure Heterogeneity
(R.S.H)
Relation Size (R.Z)
 Value Heterogeneity (V.H)
Numeric-Numeric (N.N)
Different Units- Fixed Conversion (D.U-F.C)
Different Units- Time Varying Conversion (D.U-T.V.C)
Units- Other Conversion (U-O.C)
Granularity (G)
Composition of Values in a Single-Valued Attribute (C.V.S-V A)
String-String (S.S)
Value Synonyms (V.S)
Value Homonyms (V.h)
Different String Formats (D.S.F)
Numeric-String (N.s)
Simple Conversion (S.C)
Structures (S)
Incomplete Information
(I.I)
 Semantic Heterogeneity (S.H)
What the data represents (W.D.R)
Context in which data is captured (C.C)
Different in Abstraction Level (D.A.L)
 Data Model (D.M)
Paradigm Heterogeneity (P.H)
Behavioural Differences (B.D)
Dependency Conflicts (D.C)
Differences in Constraints (D.I.C)
Default Value (D.V)
Relation Keys (R.K)
 Timing Heterogeneity (T.H)
Domain Evolution (D.E)
Inconsistencies due to asynchronous updated (I.D.A.U)
Figure 3.6 ~ The Classification Diagram
Hazem Turki El Khatib
63
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
Granularity
This form of heterogeneity arises when data elements representing a particular
measurement differ in their level of granularity. For example, the WEIGHT
attribute value in the VISIT relation of Database1 is stored to the nearest kilogram
while in Database3 it is stored to the nearest tenth of a kilogram.
Composition of values in a single-valued attribute
Sometimes a value consists of two or more components which are directly
related. A classic example is that of the price of an object or service, which may
be given inclusive or exclusive of tax. Similarly, prices in a restaurant may be
inclusive or exclusive of service charge. As an example of this form of
heterogeneity consider the attribute MEDICATION-PRICE of relation VISIT in
Database3 which describes the price of the medicine including tax, whereas the
attribute PRICE in relation VISIT in Database2 describes the medication price
without tax.

String – string
Value synonyms
This occurs when the values of an attribute are represented as strings but a
slightly different set of values is used in different databases. As an example, the
value of the SEX attribute in the PAT-REC relation of Database1 is stored as
Male or Female, while in attribute SEX in relation PATIENT of Database2 it is
stored as M or F.
Hazem Turki El Khatib
64
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
Value homonyms
The value ‘PNE’ occurring in attribute TEST-CODE in relation LAB-TEST of
Database1 represents ‘Pneumonia’ but ‘PNE’ represents ‘Pneumoconiosis’ in
attribute CODE in relation TEST of Database2.
Different string formats
These arise when different databases use different string formats for the same
element. The most common occurrence of this is in date representation; for
example the attribute DATE in relation VISIT of Database3 is represented as Day
Month Year, whereas the attribute DATE in relation VISIT of Database2 is
represented as Month Day Year.
Other forms might include “MM-DD-YY”,
“DD/MM/YY”, “DDMMYY”, “MMDDYY”, “YYYYMMDD” and so on.

Numeric – string
Simple conversion
This arises when the same attribute is defined in terms of different data types in
different databases. For example, the PHONE attribute in relation PAT-REC of
Database1 is of type ‘string’, whereas in relation PATIENT of Database2 it is of
type integer.
The date problem described in the previous section also arises here; for example
the VISIT-DATE attribute value in relation VISIT of Database1 is stored as a
numeric value while the DATE attribute value in relation VISIT of Database2 is
stored as string.
Hazem Turki El Khatib
65
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata

Chapter 3
Structures
These arise when different databases use different formats for the same element.
For example, in Database2 the name of the patient is represented as a single
attribute in relation PATIENT whereas in Database3, it is represented as a pair of
separate attributes in the relation PATIENT-NAME.

Incomplete information
The meaning of null varies amongst databases (unknown, not applicable,
unavailable). For example, when the value of an attribute MAIDEN-NAME is
NULL, this is interpreted as not applicable if attribute SEX is ‘MALE’. However, if
SEX is ‘FEMALE’ it would be either NO MAIDEN-NAME or MAIDEN-NAME is
unknown. If the AGE attribute value is equal to NULL this is taken as unknown
value. On the other hand, if the PHONE attribute is NULL as in Database1 and
Database3, this may mean either not applicable or unknown.
3.3.4 Semantic Heterogeneity
Ter Bekke [64] defines semantics as the discipline which deals with relationships
between words and the things to which these words refer. In database modelling,
semantics is concerned with the study of the meaning and relationship between real
world features and database objects [3]. This form of heterogeneity occurs when
there are differences in what the data actually represents or the context in which the
data has been captured in different databases. The semantic heterogeneity can be
classed as:
Hazem Turki El Khatib
66
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata

Chapter 3
What the data represents
As an example, the PHONE attribute in relation PAT-REC of Database1 is a
home phone number; the PHONE attribute in relation PATIENT of Database4 is a
contact phone number, which may be the home phone number but may not.
They are concepts which are related but not necessarily identical.

Context in which data is captured
As an example, consider blood pressure. If blood pressure is measured at home
by a nurse the measurement may be significantly lower than that obtained in the
clinic by a doctor (so-called ‘white coat’ syndrome). In the case of Database4 the
blood pressure in relation VISIT is measured at home by a nurse, whereas in
relation C-VISIT it is measured in the clinic.
Equally, one would like to know whether a measurement may be affected by other
conditions (e.g. if a patient being examined for condition X is also suffering from
condition Y at the same time).

Difference in abstraction level
The requirements of different local DBMSs may cause objects to be modelled at
different levels of abstraction. For example, the attribute RESULT in relation LABTEST of Database1 describes the result of a test on the scale 0 to 10 whereas
attribute RESULT in relation TEST of Database2 describes the result in terms of
values {Low, Normal, Above Normal, High}.
Hazem Turki El Khatib
67
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
Another example, attribute TELEPHONE-NUMBER, may contain home telephone
number [23].
3.3.5 Data model heterogeneity

Paradigm heterogeneity
Local database systems may employ different paradigms, such as relational,
hierarchical, object-oriented, or entity-relationship. The focus of this framework is
on the relational database model and has not been extended to consider these
other models at this time.

Behavioural differences
These arise when different insertion/deletion policies are associated with the
same class of objects in distinct schemas. A record type may have constraints on
the total number of occurrences, or on the insertions and deletions of records.
For example, the details of a patient’s visit to hospital must be kept for a minimum
of 10 years before they can be deleted, but in another database details may be
kept for only 5 years before they can be deleted.

Dependency conflicts
These arise when a group of concepts is related among themselves with different
dependencies in different schemas. For example, it is possible for a relationship
between two concepts in one database to be 1:1, whereas in another it could be
1:n.
Hazem Turki El Khatib
68
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata

Chapter 3
Differences in constraints
The data model may support different constraints. For example, in Database4 the
patients are all children and hence the attribute BIRTHDATE in relation PATIENT
is constrained to dates consistent with this (e.g. less than 10 years of age). On the
other hand, the corresponding attribute BIRTHDATE in relation PAT-REC in
Database1 has no such constraint.

Default value
This form of heterogeneity occurs when there are different definitions of the
attribute domain. Two attributes might have different default values in different
databases. For example, when inserting a new VISIT record the default value for
VISIT-DATE in Database1 may be the current date whereas the default value for
DATE in Database2 may be NULL.

Relation keys
In this case, equivalent relations in different databases may have different
attributes as keys which can affect updates to these relations.
3.3.6 Timing heterogeneity

Domain evolution
This problem occurs when the semantics of values of a domain change over time.
This includes many of the different kinds of heterogeneity already described. For
example, the form used to represent a value may change over time. An example
Hazem Turki El Khatib
69
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
of this is the change in telephone code occurring in Database1 where the area
code changed from ‘031’ to ‘0131’ at a particular point in time. Other forms of
domain evolution include changes in composition of values (e.g. when the tax
rules changed), changes in granularity, changes in string representations resulting
from changes in coding systems, changes in cardinality, etc.

Inconsistencies due to asynchronous updates
These happen when data items are replicated in different databases, get updated
at different points in time and become inconsistent. For example, the PHONE
attribute in relation PATIENT of Database2 for Mark Richard has been updated
without a corresponding update to attribute PHONE in relation PATIENT of
Database3, and so the two attribute values become inconsistent.
3.4
THE TEST SUITE
For the test suite, the following five simple queries have been selected.
Q1:
Find the test code and result for Karen Taylor
This query tests for attribute synonyms (e.g. NAME/PATIENT-NAME), attributerelation homonyms (e.g. PATIENT-NAME), relation synonyms (e.g. LABTEST/TEST), value homonyms (e.g. meaning of PNE), structures (e.g. PATIENTNAME), relation size (PATIENT has 4 or 5 attributes), and difference in
abstraction level (RESULT).
Hazem Turki El Khatib
70
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Q2:
Chapter 3
Find the telephone numbers for patients born before 01/01/1981
This query involves semantic heterogeneity – what data represents (home vs.
work number), different units – other conversion (with or without area code),
incomplete information (meaning of NULL), different string formats (for
BIRTHDATE), relation synonyms (PAT-REC/PATIENT), relation size (PATIENT),
domain evolution (area code and changed meaning of NULL), inconsistencies
due to asynchronous updates (the attributes PHONE in relation PATIENT of
Database2 is updated [Mark Richard’s phone number] without updating to the
attribute PHONE in relation PATIENT of Database3).
Q3:
Find weights of all male patients weighed within the last year
The query covers fixed conversion between different units (WEIGHT – pounds vs.
kilograms), different granularity (Kgs vs. Tenths of Kg), value synonyms
(Male/Female vs. M/F), relation size, relation homonyms (VISIT), relation
synonyms (PATIENT/PAT_REC) and numeric-string conversion (VISIT-DATE).
Q4:
Find the price of medication for patient 529
This query involves time-varying conversion between different units (Pounds vs.
Dollars), composition of values in a single-valued attribute (PRICE with or without
TAX), domain evolution (TAX rate).
Hazem Turki El Khatib
71
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Q5:
Chapter 3
Find the blood-pressure for Alex Brown
This query covers semantic heterogeneity – context in which data is captured
(BLOOD PRESSURE), and attribute homonyms (DATE).
A single query which covers most of the range of heterogeneities in the test suite
is as follows:
Q6:
Find the name, date of birth and telephone number of every male
patient who has had a high result (> 4.0) for test PNE and whose weight
exceeds 180 pounds.
Hazem Turki El Khatib
72
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
3.5
Chapter 3
SUMMARY
Much research has been carried out on the problem of accessing heterogeneous
distributed database systems and a range of different aspects of heterogeneity
has been identified by different authors. This chapter presents a framework for
classifying the different types of heterogeneity, which brings together the different
aspects of heterogeneity addressed by these authors. A summary of this
classification is given in Figure 3.6.
An overview of some of the work done by different researchers on the problem of
heterogeneity is given in section 3.2. A summary of how the different concepts
covered by different authors fits into the proposed framework is given in table 3.1.
From this framework a test suite has been developed which can be used to
evaluate and compare the extent to which different approaches handle different
aspects of this heterogeneity. A major advantage of this test suite is that it
consists of four small databases and a small set of queries, all of which are easy
to implement. Using it, all aspects of heterogeneity identified in the framework are
covered, with the exception of data model heterogeneity.
This classification is based on a relational model, although it could easily be
adapted to other paradigms. Such a framework can provide an aid for database
designers and for integrating heterogeneous database research.
Hazem Turki El Khatib
73
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
Chapter 3
REFERENCES
[45]
M.A. Casanova, V.M.P. Vidal, Towards a sound view integration methodology, Second ACM
SIGACT/SIGMOD conference on principles of database systems, Atlanta, Ga., ACM, New York, Mar.
21-23, (1983) 36-47.
[46]
B. Kahn, A structured logical database design methodology, Ph.D. Dissertation, Department
of Computer Science, University of Michigan, Ann Arbor, Mich, (1979).
[47]
W. Kim, J. Seo, Classifying schematic and data heterogeneity in multidatabase systems,
Computer 24 (12) (1991) 12-18.
[48]
R. Elmasri, J. Larson, S.B. Navathe, Integration algorithms for federated databases and
logical database design, Tech. Rep. Honeywell Corporate Research Center, (1987).
[49]
M.V. Mannino, W. Effelsberg, A methodology for global schema design, Tech. Rep. No. TR-
84-1, Department of Computer and Information Science, University of Florida, (1984).
[50]
M. Kaul, K. Drosten, E.J. Neuhold, View System: integration heterogeneous information
bases by object-oriented views, IEEE 6th International Conference on Data Eng., Los Angeles, (1990)
2-10.
[51]
S.M. Deen, R.R. Amin, M.C. Taylor, Data integration in distributed databases, IEEE Trans.
Softw. Eng. SE-13 (7) (1987) 860-864.
[52]
S.D. Urban, J. Wu, Resolving semantic heterogeneity through the explicit representation of
data model semantics, Sigmod Record 20 (4) (1991) 55-58.
[53]
P. Drew, R. King, D. McLeod, M. Rusinkiewicz, and A. Silberschatz. Report of the Workshop
on Semantic Heterogeneity and Interoperation in Multidatabase Systems.
SIGMOD RECORD,
22:3 (September 1993), pp. 47:56.
[54]
M. F. Worboys and S. M. Deen, Semantic Heterogeneity in Distributed Geographic
Databases, SIGMOD RECORD, Vol.20, No.4, December 1991, pp. 30-34
[55]
V. Ventrone, S. Heiler, Semantic heterogeneity as a result of domain evolution, Sigmod
Record 20 (4) (1991) 16-20.
Hazem Turki El Khatib
74
PhD Thesis ~ 2000
Integrating Information from Heterogeneous Databases Using Agents and Metadata
[56]
Chapter 3
P. Fankhauser, E.J. Neuhold, Knowledge bases integration of heterogeneous databases,
Interoperable Database Systems (DS-S) (A-25), D.K. Hsiao, E.J. Neuhold and R. Sacks-Davis
(Editor), Elsevier Science Publishers B. V. North-Holland, (1993) 155-175.
[57]
A. Sheth, V. Kashyap, So Far (Schematically) yet, So Near (Semantically), Interoperable
Database Systems (DS-S) (A-25). D.K. Hsiao, E.J. Neuhold and R. Sacks-Davis (Editor), Elsevier
Science Publishers B. V. North-Holland, (1993) 283-311.
[58]
A.F. Cardenas, Heterogeneous distributed database management: The HD-DBMS,
Proceedings of the IEEE, 75 (5), (1987) 588-600.
[59]
A. Ferrier, C. Stangret, Heterogeneity in the distributed database management system Sirius-
Delta, Eighth Int. Conf. on Very Large Data Bases, Mexico City, (1982) 45-53.
[60]
S. Spaccapietra, C. Parent, Conflicts and correspondence assertions in interoperable
databases, Sigmod Record 20 (4) (1991) 49-54.
[61]
Y. Breitbart, P.L. Olson, G.R. Thompson, Database integration in a distributed heterogeneous
database system, Second IEEE Data Eng. Int. Conf., CS Press, Los Alamitos, Calif., Order
No. 655, (1986) 301-310.
[62]
D.K. Hsiao, M.N. Kamel, Heterogeneous databases: proliferations, issues, and solutions
(Invited Paper), IEEE Trans. on Know. and Data Eng. 1 (1) (1989) 45-62.
[63]
A. Sheth. Special Issue on Semantic Heterogeneity. ACM SIGMOD Record 20 (4)
December, 1991.
[64]
Ter Bekke, J. H., 1991. “Semantic Data Modeling in Relational Environments”. Ph.D. Thesis,
University of Delft.
[65]
S. Al-Fedaghi, P. Scheuermann, Mapping considerations in the design of schemas for the
relational model, IEEE Trans. Softw Eng. SE-7 (1) (1981) 99-111.
[66]
S.B. Yao, V.E. Waddle, B.C. Housel, View modelling and integration using the functional data
model, IEEE Trans. Softw. Eng. SE-8 (6) (1982) 544-553.
[67]
T. Teorey, J. Fry, Design of database structures, Prentice-Hall, Englewood Cliffs, N.J., 1982.
Hazem Turki El Khatib
75
PhD Thesis ~ 2000