Download Concept Hierarchies for Database Integration in a Multidatabase

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Functional Database Model wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Clusterpoint wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Relational algebra wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Concept Hierarchies for Database Integration in a
Multidatabase System
Pauray S.M. Tsai and Arbee L.P. Chen
Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan 300, R.O.C.
Abstract
Since relational DBMSs are wide-spread, providing a relational global view
for users who are only familiar with the relational model to access data in a
multidatabase system is signicant. In this paper, we propose an approach
to integrate database schemas into a relational global view which consists of
concept hierarchies. A concept hierarchy is composed of relation schemes with
the same concept, to which specialization and generalization can be specied to
enrich the semantics of the relational view. Based on the concept hierarchies, a
procedure is designed to decompose a global query to local subqueries and a set
of transformation rules are developed to transform a query into an equivalent
one for query optimization.
1 Introduction
Because of the increasing need for data sharing among multiple databases, the development
of multidatabase systems ASD91, SBD81] has been considered as an important research
issue Br90]. There are two approaches to manipulate data in a multidatabase system.
One is to provide users with a logically integrated global view. It presents a high level
of transparency and a uniform interface for the user to retrieve data in the multidatabase
system. A variety of researches had focused on data/schema integration BOT86, DH84,
KDN92, Mo87, SPD92]. Dayal and Hwang DH84] and Motro Mo87] adopted this approach
based on functional model, while Breitbart et al. BOT86] and Deen et al. DAT87] were
based on relational model, and Koh and Chen KC93] on object model. Batini et al. surveyed
twelve methodologies for database or view integration in BLN86]. In SLC88], an interactive
interface was developed to get the information required for the integration and to integrate
schemas according to the provided semantics.
The other approach is to provide users with a multidatabase query language LAZ89]
namely, users can pose their queries against the local schemas by using a multidatabase
manipulation language. In this approach, it does not require explicitly creating a global
view, but the user needs to have sucient information of the local schemas to specify a
query. Czejdo et al. CRE87] used a relational language to perform schema integration in
the process of query formulation, in which conicts in component schemas were resolved by
specially dened operators and domain incompatibilities resolved by extended abstract data
type. In REC89], a graphical multidatabase query language was developed to manipulate
data across databases and a knowledge base was used to resolve schema incompatibilities
among dierent databases.
Appeared in Proc. SIXTH International Conference on Management of Data (1994).
1
Relational database systems have been widely used, and object-oriented database systems are also getting popular in recent years. To facilitate dierent users and applications
to access data, a multidatabase system can provide two views: one based on the relational
model and the other based on the object-oriented model. DATAPLEX Ch90] used the relational model as a common data model for accessing heterogeneous distributed databases,
while the Amoco Distributed Database System (ADDS) BOT86] integrated relational, network, and hierarchical databases by an extended relational data model. Chen et al. CKK94]
considered various schema and data conicts and developed a methodology to provide an
object view over multiple object databases. Pegasus ASD91] heterogeneous multidatabase
system used both type and function abstractions to integrate dierent databases by an
object-oriented data model. In the prior work using the relational model for schema integration, it is dicult for users to understand the relationships among relations in dierent
databases, especially when the number of component databases is large.
In this paper, we propose an approach to integrate database schemas into a relational
global view which consists of concept hierarchies. A concept hierarchy is composed of relation
schemes with the same concept. For example, relations relating to students are organized
into the STUDENT concept hierarchy. Similar to the concept hierarchy GN87] which
provides information for inductive learning, the concept hierarchy that we propose provides
users with valuable information to capture the relationships among dierent relations for
specifying queries. Concept hierarchies are created by the multidatabase administrator who
collects the necessary information for integration from local database administrators and
integrates relation schemes into concept hierarchies by a schema integration language. Note
that our approach is dierent from the work of Dayal and Hwang DH84]. In DH84], the
functional data model with generalization is used to integrate database schemas and queries
are specied by a query language like DAPLEX Sh81], while we use the relational data
model with concept hierarchies to create a global view and the SQL language as a query
language. By the approach of concept hierarchies, the user who references the relational
view can easily understand the relationships among relations in dierent databases and the
burden of specifying a query is relieved.
Schema integration by our approach consists of three phases:
Phase 1: map the non-relational schema into relational one.
Phase 2: collect the necessary information for integration from local database administrators. The technique of assertion specication SLC88] can be used to represent the integration semantics.
Phase 3: create concept hierarchies by a schema integration language. Schema conicts can
be resolved in the process of integration.
Each corresponding relation in a concept hierarchy can be a virtual relation or a real relation. A real relation corresponds to a relation dened in Phase 1 , while a virtual relation
is generated from real or virtual relations by generalization or specialization operation. The
problems of Phase 1 and Phase 2 have been well studied. In this paper, we devote our attention to Phase 3 and the query processing based on the realtional global view with concept
hierarchies.
This paper is organized as follows. In Section 2, we describe the creation of the concept
hierarchies. A schema integration language is developed to perform the integration. In
Section 3, query processing based on the concept hierarchies is discussed. Transformation
rules are developed to transform a query into an equivalent one for query optimization.
Finally, we conclude with the future work in Section 4.
2
DB2
DB1
TEACHER
name
address
phone
salary
STUDENT
id
name
class
DB3
TEACHER
TEACHER
name
arrdess
phone
birthday
birthday
id
name
name zip
city
street
dept
birthday
STUDENT
STUDENT
address
salary
class
birthday
id
name
dept
WORKER_STUDENT
id
name
class
phone
job
Creation of Concept Hierarchies by the Multidatabase Administrator
Data Dictionary
Global
View
Figure 1: An example multidatabase system.
2 Concept Hierarchies
In this section, we consider the creation of a relational global view which is composed of
concept hierarchies.
2.1 A Schema Integration Language
Consider an example of the multidatabase system shown in Figure 1. The multidatabase
system consists of three individual databases. The relations TEACHER and STUDENT
in database DB1 record the information about teachers and students in the department
of Computer Science at National Tsing Hua University (NTHU), respectively. In database
DB2 , relations TEACHER and STUDENT record the information about teachers and
students, respectively, in the department of Electrical Engineering at NTHU. In database
DB3 , relations TEACHER and STUDENT record the information about teachers and
students, respectively, at National Chiao Tung University (NCTU). Besides, the relation
WORKER STUDENT in DB3 records the information about worker-students at NCTU.
The multidatabase administrator is provided with a schema integration language for the
creation of concept hierarchies from the local schemas. The information about the mapping
between the local schemas and concept hierarchies is stored in the data dictionary. Users
specify queries against the concept hierarchies, which will be translated into equivalent
queries against the local schemas by referring to the data dictionary. In the following, we
describe constructs for dening the concept hierarchies.
1. Dening multidatabase relations
CREATE <multidatabase relation name>
FROM <database identier>.< database relation identier>
WHERE ( RN <attribute identier of the database relation> AS <attribute name>
, RN <attribute identier of the database relation> AS < attribute name>] ... )]
3
This statement denes a relation of a component database to be a multidatabase relation. In syntactic denitions, the square brackets indicate that the material enclosed
is optional. An ellipsis ":::" indicates that the immediately preceding syntactic unit
may optionally be repeated one or more times. Material in <> must be replaced by
a specic value given by the users and material in capitals must be written exactly
as shown. The names of a database, a database relation (or a multidatabase relation
which has been dened) and an attribute of a (multi)database relation are called a
database identier, a database relation identier (or a multidatabase relation identier) and an attribute identier, respectively. The dened multidatabase relation
implicitly has all the attributes of the corresponding database relation, however, the
RN clause permits renaming a database attribute name in the denition of a multidatabase relation. The multidatabase relation dened by this statement is a real
relation. The instances in the multidatabase relation can be obtained by the following
procedure:
(a) Fetch the database relation specied in the FROM clause.
(b) Do the actions in the WHERE clause.
2. Dening generalized virtual relations
CREATE <multidatabase relation name> AS
GEN OF <multidatabase relation identier>, <multidatabase relation identier>
, <multidatabase relation identier>] ...
CLASSIFYING ATTRIBUTE ( <classifying attribute name> DOMAIN AS
(<classifying attribute value for the corresponding multidatabase relation>,
<classifying attribute value for the corresponding multidatabase relation>
, <classifying attribute value for the corresponding multidatabase relation>] ... ))
WHERE (f RN <multidatabase relation identier>.<attribute identier>
AS <attribute name> j
CONVERT
<multidatabase relation identier>.(<attribute identier>
,<attribute identier>] ...)
TO <attribute name> BY <function name> g
, f RN <multidatabase relation identier>.<attribute identier>
AS <attribute name> j
CONVERT
<multidatabase relation identier>.(<attribute identier>
,<attribute identier>] ...)
TO <attribute name> BY <function name> g] ... )]
The statement denes a multidatabase relation to be the generalization of some existent multidatabase relations. The multidatabase relation dened by this statement is
a generalized virtual relation because it is not a real relation. The generalized virtual
relation is called the superconcept relation of the multidatabase relations from which it
is derived, and the multidatabase relations generalizing the superconcept relation are
called the subconcept relations of the superconcept relation. The CLASSIFYING
ATTRIBUTE clause denes a classifying attribute which is used to identify the corresponding subconcept relation for each instance. Attributes of the generalized virtual
relation are composed of the common attributes of its subconcept relations and the
classifying attribute. The DOMAIN AS clause species the value of the classifying
attribute for each corresponding subconcept relation in the GEN OF clause. We
can resolve the naming conict of the subconcept relations by the RN clause. Moreover, the CONVERT clause can be used to resolve data representation conicts and
4
data scaling conicts, which converts an attribute (or a set of attributes) of a multidatabase relation to another one by the conversion function. For example, the clause
"CONVERT R1:(a1 a2 ::: an) TO attr BY transform" indicates that the value of
attribute attr is evaluated by the function transform with R1:a1 R1:a2 ::: R1:an as
its parameters. Functions needed for schema integration are implemented by the
multidatabase administrator and stored in the data dictionary. The instances in the
generalized virtual relation can be obtained by the following procedure:
(a) Materialize the subconcept relations specied in the GEN OF clause.
(b) Do the actions in the WHERE clause.
(c) Project the common attributes of the subconcept relations.
(d) For each resultant subconcept relation from (c), add the classifying attribute and
its corresponding value specied in the DOMAIN AS clause for each tuple.
(e) Union the subconcept relations.
3. Dening the relationship between a real multidatabase relation and other multidatabase
relations by generalization
CREATE <real multidatabase relation identier> AS
GEN OF <multidatabase relation identier> , <multidatabase relation identier>] ...
The statement builds the relationship between a real multidatabase relation and other
multidatabase relations which can be real or virtual by generalization. Dierent
from the generalized virtual relation, the real multidatabase relation dened in the
statement is a multidatabase relation which has been dened and corresponds to a
relation in a component database. This real multidatabase relation is also called
the superconcept relation of the multidatabase relations specied in the GEN OF
clause, and the multidatabase relations specied in the GEN OF clause are called
the subconcept relations of the real multidatabase relation.
4. Dening specialized virtual relations
CREATE <multidatabase relation name> AS
SPE OF <multidatabase relation identier>, <multidatabase relation identier>
, <multidatabase relation identier>] ...
WHERE (f RN <multidatabase relation identier>.<attribute identier>
AS <attribute name> j
CONVERT
<multidatabase relation identier>.(<attribute identier>
,<attribute identier>] ...)
TO <attribute name> BY <function name> g
, f RN <multidatabase relation identier>.<attribute identier>
AS <attribute name> j
CONVERT
<multidatabase relation identier>.(<attribute identier>
,<attribute identier>] ...)
TO <attribute name> BY <function name> g] ... )]
The statement denes a multidatabase relation to be the specialization of some existing multidatabase relations. The multidatabase relation dened by this statement
is a specialized virtual relation because it does not correspond to any local relation.
The specialized virtual relation is called the subconcept relation of the multidatabase
relations from which it is derived, and the multidatabase relations specializing the
5
subconcept relation are called the superconcept relations of the specialized virtual relation. Attributes of the specialized virtual relation are the union of attributes in
the superconcept relations. In a multidatabase system, the same real-world entity
can be represented as instances in dierent databases. The identication of the same
real-world entities from dierent databases is studied in TC93]. For simplicity, we
assume that instances represent the same real-world entity if and only if they have
the same values for their common attributes. The instances in the specialized virtual
relation can be obtained by the following procedure:
(a) Materialize the superconcept relations specied in the SPE OF clause.
(b) Do the actions in the WHERE clause.
(c) Natural join the superconcept relations.
5. Dening the relationship between a real multidatabase relation and other multidatabase
relations by specialization
CREATE <real multidatabase relation identier> AS
SPE OF <multidatabase relation identier> , <multidatabase relation identier>] ...
The statement builds the relationship between a real multidatabase relation and other
multidatabase relations by specialization. The real multidatabase relation is also
called the subconcept relation of the multidatabase relations specied in the SPE
OF clause, and the multidatabase relations specied in the SPE OF clause are
called the superconcept relations of the real multidatabase relation.
2.2 A Schema Integration Example
Consider the multidatabase system in Figure 1. Firstly, we dene multidatabase relations
as follows.
CREATE CS teacher
FROM DB1 .TEACHER
WHERE (RN phone AS home phone, RN salary AS yearly salary)
CREATE EE teacher
FROM DB2 .TEACHER
WHERE (RN phone AS oce phone, RN salary AS monthly salary)
CREATE NCTU teacher
FROM DB3 .TEACHER
CREATE CS student
FROM DB1 .STUDENT
WHERE (RN id AS NTHU id)
CREATE EE student
FROM DB2 .STUDENT
WHERE (RN id AS NTHU id)
CREATE NCTU student
FROM DB3 .STUDENT
WHERE (RN id AS NCTU id)
CREATE WORKER student
FROM DB3 .WORKER STUDENT
WHERE (RN id AS NCTU id)
Then, we dene generalized virtual relations as follows.
6
CREATE NTHU teacher AS
GEN OF CS teacher, EE teacher
CLASSIFYING ATTRIBUTE (dept DOMAIN AS (CS, EE))
WHERE (CONVERT CS teacher.yearly salary TO monthly salary BY SAL)
The attribute salary is recorded by yearly salary in the relation CS teacher and by monthly salary in the relation EE teacher. Assume we adopt the scale of monthly salary in the
virtual relation NTHU teacher. The SAL function is dened by
SAL(e) = e=12 e 2 domain(CS teacher:yearly salary):
CREATE TEACHER AS
GEN OF NTHU teacher, NCTU teacher
CLASSIFYING ATTRIBUTE (school DOMAIN AS (NTHU, NCTU))
WHERE (CONVERT NCTU teacher.(zip,city,street) TO address BY ADDR)
The attribute address in NTHU teacher is recorded by the aggregation of zip, city, and
street, such as "300 Hsinchu KuangFu." Assume we adopt the representation of attribute
address in the virtual relation TEACHER. The ADDR function is dened by
ADDR(e1 e2 e3) = e1 + " " + e2 + " " + e3
where operator '+' represents concatenation.
CREATE NTHU student AS
GEN OF CS student, EE student
CLASSIFYING ATTRIBUTE (dept DOMAIN AS (CS, EE))
CREATE STUDENT AS
GEN OF NTHU student, NCTU student
CLASSIFYING ATTRIBUTE (school DOMAIN AS (NTHU, NCTU))
Next, we dene the relationship between NCTU student and WORKER student by
generalization.
CREATE NCTU student AS
GEN OF WORKER student
The relationship can also be dened by specialization as follows.
CREATE WORKER student AS
SPE OF NCTU student
Finally, we dene specialized virtual relations.
CREATE CS&EE teacher AS
SPE OF CS teacher, EE teacher
WHERE (RN CS teacher.yearly salary AS CS yearly salary,
RN EE teacher.monthly salary AS EE monthly salary)
CREATE NCTU&NTHU teacher AS
SPE OF NCTU teacher, NTHU teacher
WHERE (RN NCTU teacher.dept AS NCTU dept,
RN NTHU teacher.dept AS NTHU dept,
RN NTHU teacher.monthly salary AS NTHU monthly salary,
CONVERT NCTU teacher.(zip,city,street) TO address BY ADDR)
7
TEACHER
name
address
NCTU_teacher
name
zip
dept
school
NTHU_teacher
city
street
dept
birthday
name
NCTU&NTHU_teacher
name address
birthday
address
monthly_salary
dept
CS_teacher
NCTU_dept NTHU_dept
NTHU_monthly_salary
EE_teacher
name address home_phone yearly_salary
name address office_phone birthday monthly_salary
CS&EE_teacher
name address
home_phone office_phone birthday CS_yearly_salary
EE_monthly_salary
STUDENT
name
dept
NCTU_student
NCTU_id
NTHU_student
name
dept
NTHU_id
WORKER_student
NCTU_id
school
name
class
birthday
CS_student
name
class
phone
job
NTHU_id
dept
EE_student
name
class
address
birthday
NTHU_id name
class
birthday
Figure 2: Concept hierarchies.
The resultant concept hierarchies are shown in Figure 2. The mapping information for
the schema integration is recorded by three tables: the multidatabase relation table, the
generalized virtual relation table, and the specialized virtual relation table. The multidatabase relation table is used to record the information for multidatabase relations in the
system. The information for generalized virtual relations and specialized virtual relations
are recorded in the generalized virtual relation table and the specialized virtual relation table, respectively. Figure 3 shows the mapping tables for the TEACHER concept hierarchy
in Figure 2. The symbols Vgen and Vspe in the column from of the multidatabase relation table denote that the corresponding multidatabase relations are a generalized virtual relation
and a specialized virtual relation, respectively.
3 Query Processing
In this section, we consider query processing based on the concept hierarchies.
3.1 Query Decomposition
The query processor deals with a query by the following procedure.
Procedure P: Decompose a multidatabase query to local subqueries.
For each relation speci
ed in the multidatabase query
8
relation name
CS teacher
EE teacher
NCTU teacher
NTHU teacher
TEACHER
CS&EE teacher
NCTU&NTHU teacher
multidatabase relation table
from
renamed attribute
(phone,home phone) (salary,yearly salary)
(phone,oce phone) (salary,monthly salary)
DB1 .TEACHER
DB2 .TEACHER
DB3 .TEACHER
Vgen
Vgen
Vspe
Vspe
generalized virtual relation table
relation name subconcept relation
NTHU teacher CS teacher
EE teacher
TEACHER
NCTU teacher
NTHU teacher
classifying attribute classifying value
dept
CS
EE
school
NCTU
NTHU
conversion function
renamed attribute
SAL(CS teacher.yearly salary)%monthly salary
ADDR(NCTU teacher.(zip,city,street))%address
relation name
CS&EE teacher
specialized virtual relation table
superconcept relation
CS teacher
EE teacher
NCTU&NTHU teacher NCTU teacher
NTHU teacher
conversion function
ADDR(NCTU teacher.(zip,city,street))%address
renamed attribute
(CS teacher.yearly salary,CS yearly salary)
(EE teacher.monthly salary,EE monthly salary)
(NCTU teacher.dept,NCTU dept)
(NTHU teacher.dept,NTHU dept)
(NTHU teacher.monthly salary,NTHU monthly salary)
Figure 3: The mapping information for the TEACHER concept hierarchy.
Step 1: Look up the multidatabase relation table.
1. If the relation is a real relation which has not been materialized, materialize this relation by information recorded in the columns from and renamed
attribute.
2. If the value of the column from for the relation is Vgen , then go to Step 2.
3. If the value of the column from for the relation is Vspe , then go to Step 3.
Step 2: Look up the generalized virtual relation table. The relation is materialized
by doing the following actions.
1. For each subconcept relation of the generalized virtual relation, if it has not
been materialized, then go to Step 1.
2. Rename attributes and perform conversion functions on subconcept relations
according to the information recorded in the columns renamed attribute and
conversion function, respectively.
9
3. Project the common attributes of the subconcept relations.
4. For each resultant subconcept relation from 3., add the classifying attribute
and its corresponding value for each tuple.
5. Union the subconcept relations.
Step 3: Look up the specialized virtual relation table. The relation is materialized
by doing the following actions.
1. For each superconcept relation of the specialized virtual relation, if it has not
been materialized, then go to Step 1.
2. Rename attributes and perform conversion functions on the superconcept relations according to the information recorded in the columns renamed attribute
and conversion function, respectively.
3. Natural join the superconcept relations.
3.2 Query Transformation
In this subsection, we develop transformation rules for query optimization. Some
notations are described as follows.
a1 !b1 ::: an!bn R denotes that attributes a1 ::: an in the relation R are renamed to b1 ::: bn, respectively.
R attr fvg denotes that the attribute attr is added to relation R and each
tuple in R is lled with the value v for the attribute attr.
FUNi(Bi)%ai represents a function named FUNi with Bi denoting the list of
parameters for FUNi . The corresponding attribute for the returned value of
the function is named ai . FUN1 (B1 )%a1 ::: FUNk(Bk )%ak R denotes that
functions FUN1 (B1 )%a1 :::
FUNk (Bk )%ak are performed on the relation R. For function FUNj (Bj )%aj
with j Bj j= 1, if its inverse function exists, then the inverse function will be
de
ned. The de
nition of the inverse function is of bene
t to query optimization. We assume that caching is used to store the returned values of function
calls. Thus, if the number of distinct values of Bi is n, then FUNi (Bi )%ai
needs to be computed only n times. The technique of function caching has
been proved useful for query optimization HS93].
For simplicity, the selection predicate considered in a query is of the form attr op C ,
where attr represents a relation attribute, op denotes an operator such as ">," "<"
or "=," and C represents a constant. The associated attribute for a predicate is the
attr component. An associated attribute ai is called a private attribute of relation
Rk , where k = 1 2 if ai appears in Rk and is not the attribute of the other relation.
The transformation rules are described as follows.
10
Rule 1
Consider the query Q = A (a1 !b1 ::: an!bn R), where A represents a set of
attributes. For each i, 1 i n, if bi 62 A, then
Q A (a1!b1 ::: ai;1!bi;1 ai+1!bi+1 ::: an!bn R):
By this rule, we can eliminate some unnecessary renaming operations.
Rule 2
Consider the query Q = P (R1 a fv1g R2 a fv2 g), where P represents a
selection predicate.
8>
if v1 6= v2 and P is "a = v1 "
< R1 a fv1g
Q > R2 a fv2g
if v1 6= v2 and P is "a = v2 "
: (P R1) a fv1g (P R2) a fv2g otherwise
By this rule, the size of relations to be unioned can be reduced by performing
selections on the local relations. The union operation may be discarded in
some special cases.
Rule 3
Consider the query Q = A ((A1 R1) a fv1 g (A2 R2 ) a fv2g).
(
if a 62 A
Q (AR1 R1A)R2a fv1g (
R
)
f
v
g
A;fag
A;fag 2 a 2 otherwise
By this rule, we can reduce the size of relations to be unioned by eliminating
some unnecessary projection attributes. The operation a may be discarded
in some special case.
Rule 4
Consider the query
Q = P (FUN1(B1)%a1 ::: FUNi(Bi)%ai ::: FNUk(Bk )%ak R):
1. If the associated attribute of P 62 fa1 ::: akg, then
Q FUN1(B1)%a1 ::: FUNk(Bk )%ak (P R):
Because the associated attribute of P is not dependent on any function,
we can perform the selection on R before these functions are computed.
Thus, the size of R can be reduced so can the cost of computing these
functions.
2. If the associated attribute of P is ai and the inverse function of FUNi (Bi )
can be obtained, then we can derive the corresponding predicate P 0 of P
based on the inverse function, where the associated attribute of P 0 is Bi .
Therefore
Q FUN1(B1)%a1 ::: FUNk(Bk )%ak (P 0 R):
By this rule, we can reduce the cost of computing these functions. The
reason is similar to that of case 1.
11
3. If the associated attribute of P is ai and the inverse function of FUNi (Bi )
can not be obtained, then
Q FUN1(B1)%a1 ::: FUNi;1(Bi;1 )%ai;1 FUNi+1(Bi+1)%ai+1 :::
FUNk (Bk )%ak (P (FUNi(Bi)%ai R)):
Because the associated attribute of P is dependent on the function FUNi (Bi )
and the inverse function of FUNi (Bi ) can not be obtained, we can only
perform the selection after the function FUNi (Bi ) is computed. It is
clear that the cost of computing the other functions will be reduced after
P (FUNi(Bi )%ai R) is performed.
Rule 5
Consider the query Q = (FUNr1 (Br1 )%ar1 ::: FUNri (Bri )%ari :::
FUNrk (Brk )%ark R1) ./ (FUNs1 (Bs1 )%as1 ::: FUNsj (Bsj )%asj R2), where
the join is the natural join. If ari is not the common attribute of
(FUNr1 (Br1 )%ar1 ::: FUNri (Bri )%ari ::: FUNrk (Brk )%ark R1) and
(FUNs1 (Bs1 )%as1 ::: FUNsj (Bsj )%asj R2 ), then
Q FUNri (Bri )%ari ((FUNr1 (Br1 )%ar1 ::: FUNri;1(Bri;1 )%ari;1
FUNri+1 (Bri+1 )%ari+1 ::: FUNrk(Brk )%ark R1) ./ (FUNs1 (Bs1 )%as1 :::
FUNsj (Bsj )%asj R2))
Since ari is not the common attribute, we can delay the execution of
FUNri (Bri )%ari until the natural join is performed. After the natural join, the
number of distinct values of Bri may be reduced. Therefore, we can decrease
the cost of computing FUNri (Bri )%ari .
Rule 6
Consider the query P (FUN1(B1 )%a1 ::: FUNi(Bi )%ai ::: FUNk(Bk )%ak
(R1 ./ R2)).
1. If the associated attribute ap of P 62 fa1 ::: akg, then
8
>> FUN1(B1)%a1 ::: FUNk (Bk )%ak (P R1 ./ R2)
>> if ap is the private attribute of R1
>>
<
1 (B1 )%a1 ::: FUNk (Bk )%ak (R1 ./ P R2)
Q > FUN
if
ap
is
the private attribute of R2
>>
>>
>: FUN1(B1)%a1 ::: FUNk (Bk )%ak (P R1 ./ P R2)
otherwise
Because the associated attribute of P is not dependent on any function,
we can reduce the size of relations to be joined by performing selections
on the local relations. Besides, the number of distinct values for each attribute can be reduced, which decreases the cost of computing functions.
12
2. If the associated attribute of P is ai and the inverse function of FUNi (Bi )
can be obtained, then we can derive the corresponding predicate P 0 of P
based on the inverse function. Therefore
8> FUN (B )%a ::: FUN (B )%a ( 0 R ./ R )
1
k k
k P 1
2
>< if Bi 1 the1 attribute
set of R1
Q>
>> FUN1(B1)%a1 ::: FUNk (Bk )%ak (R1 ./ P 0 R2)
: if Bi the attribute set of R2
By this rule, we can reduce the cost of performing the join and functions.
The reason is similar to that of case 1.
3. If the associated attribute of P is ai and the inverse function of FUNi (Bi )
can not be obtained, then
8>
1 (B1 )%a1 ::: FUNi;1 (Bi;1 )%ai;1 FUNi+1 (Bi+1 )%ai+1
>> FUN
:::
FUNk (Bk )%ak (P (FUNi(Bi)%ai R1) ./ R2)
>>
if
B
< i the attribute set of R1
Q>
>> FUN1(B1)%a1 ::: FUNi;1(Bi;1 )%ai;1 FUNi+1(Bi+1)%ai+1
>> ::: FUNk (Bk )%ak (R1 ./ P (FUNi(Bi)%ai R2))
: if Bi the attribute set of R2
Because the associated attribute of P is dependent on the function FUNi (Bi )
and the inverse function of FUNi (Bi ) can not be obtained, we can only
perform the selection after the function FUNi (Bi ) is computed. It is
clear that the size of relations to be joined is reduced and the cost of
computing the other functions can be decreased after the join.
4 Conclusions and Future Work
Since relational DBMSs are wide-spread, providing a relational global view for relational applications and users who are only familiar with the relational model to
access data in a multidatabase system is signi
cant. In this paper, we propose a
methodology for integrating schemas into a relational global view. The schemas to
be integrated are organized into concept hierarchies which capture the relationships
among relations in dierent databases. Dierent from prior work using the relational model for schema integration, the ideas of specialization and generalization
are applied to enrich the relational view and facilitate the user to issue queries by a
standard relational query language such as SQL. We have studied query processing
based on concept hierarchies. A procedure is designed to decompose a global query
to local subqueries and a set of transformation rules are developed to transform a
query into an equivalent one for query optimization.
We have implemented a multidatabase prototype using the concept hierarchy
approach at National Tsing Hua University. The update problem in the concept
13
hierarchy is under investigation. We also study the query optimization issue of conversion functions in a multidatabase system. The conversion functions de
ned in the
concept hierarchies may be time-consuming. Therefore, the execution order of joins
and selections involving expensive conversion functions HS93] in this environment
needs to be further considered.
References
ASD91] R. Ahmed, P.D. Smedt, W. Du, W. Kent, M.A. Ketabchi, W.A. Litwin, A.
Ra
i, and M.C. Shan, The Pegasus Heterogeneous Multidatabase System,
IEEE COMPUTER, December (1991) pp. 19-27.
BLN86] C. Batini, M. Lenzerini, and S.B. Navathe, A Comparative Analysis of
Methodologies for Database Schema Integration, ACM Computing Surveys,
18 (4) (1986) pp. 323-364.
BOT86] Y. Breitbart, P.L. Olson, and G.R. Thompson, Database Integration in
a Distributed Heterogeneous Database System, IEEE Second International
Conference on Data Engineering, (1986) pp. 301-310.
Br90] Y. Breitbart, Multidatabase Interoperability, SIGMOD RECORD, 19 (3)
(1990) pp. 53-60.
CKK94] A.L.P. Chen, J.L. Koh, T.C.T. Kuo, C.C. Liu, Schema Integration and
Query Processing for Multiple Object Databases, Journal of Integrated
Computer-Aided Engineering: Special Issue on Multidatabase and Interoperable Systems, Wiley Interscience (1994) (to appear).
Ch90] C. Chung, DATAPLEX: An Access to Heterogeneous Distributed Databases,
Communications of the ACM, 33 (1) (1990) pp. 70-80 (with corrigendum in
Comm. ACM 33 (4) p.459).
CRE87] B. Czejdo, M. Rusinkiewicz, and D.W. Embley, An Approach to Schema
Integration and Query Formulation in Federated Database Systems, IEEE
Third International Conference on Data Engineering, (1987) pp. 477-484.
DH84] U. Dayal and H.Y. Hwang, View De
nition and Generalization for Database
Integration in a Multidatabase System, IEEE Transactions on Software Engineering, 10 (6) (1984) pp. 628-644.
DAT87] S.M. Deen, R.R. Amin, and M.C. Taylor, Data Integration in Distributed
Databases, IEEE Transactions on Software Engineering, 13 (7) (1987) pp.
860-864.
GN87] M. Genesereth and N. Nilsson, Logical Foundations of Articial Intelligence,
San Francisco, CA: Morgan Kaufmann, (1987).
HS93] J.M. Hellerstein and M. Stonebraker, Predicate Migration: Optimizing
Queries with Expensive Predicates, Proceedings of ACM SIGMOD, (1993)
pp. 267-276.
KDN92] M. Kaul, K. Drosten, and E.J. Neuhold, View System: Integrating Heterogeneous Information Bases by Object-Oriented Views, IEEE Sixth International Conference on Data Engineering, (1992) pp. 2-10.
14
KC93] J.L. Koh and A.L.P. Chen, Integration of Heterogeneous Object Schemas,
Proceedings of the 12th International Conference on Entity-Relationship Approach, (1993) pp. 289-300.
LAZ89] W. Litwin, A. Abdellatif, A. Zeroual, and B. Nicolas, MSQL: A Multidatabase Language, Information Science, (1989) pp. 59-101.
Mo87] A. Motro, Superviews: Virtual Integration of Multiple Databases, IEEE
Transactions on Software Engineering, 13 (7) (1987) pp. 785-798.
REC89] M. Rusinkiewicz, R. Elmasri, B. Czejdo, D. Georakopoulous, G. Karabatis,
A. Jamoussi, L. Loa, and Y. Li, OMNIBASE: Design and Implementation
of a Multidatabase System, Proceedings of the 1st Annual Symposium in
Parallel and Distributed Processing, (1989) pp. 162-169.
SLC88] A. Sheth, J. Larson, A. Cornelio, and S. Navathe, A Tool for Integrating
Conceptual Schemas and User Views, IEEE Fourth International Conference
on Data Engineering, (1988) pp.176-183.
Sh81] D.W. Shipman, The Functional Data Model and the Data Language
DAPLEX, ACM Trans. Database Syst., 6 (1) (1981) pp. 140-173.
SBD81] J.M. Smith, P.A. Bernstein, U. Dayal, N. Goodman, T. Landers, K.W.T.
Lin, and E. Wong, Multibase { Integrating Heterogeneous Distributed
Database Systems, Proceedings of AFIPS NCC, (1981) pp. 487-499.
SPD92] S. Spaccapietra, C. Parent and Y. Dupont, Model Independent Assertions
for Integration of Heterogeneous Schemas, VLDB Journal, (1992) pp. 81126.
TC93] P.S.M. Tsai and A.L.P. Chen, Querying Uncertain Data in Heterogeneous
Databases, Proceedings of IEEE Third International Workshop on Research
Issues on Data Engiennring: Interoperability in Multidatabase Systems
(1993) pp. 161-168.
15