Download A query based approach for integrating heterogeneous data sources

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
A query based approach for integrating
heterogeneous data sources
Ruxandra Domenig
Klaus R. Dittrich
Department of Information Technology
University of Zurich
Switzerland
Department of Information Technology
University of Zurich
Switzerland
[email protected]
[email protected]
by translating various query languages and models of the
underlying data sources into a global query language and
data model. Regarding the process of integration, there are
three dierent approaches that can be taken:
ABSTRACT
Data available on-line toda y is spread over heterogeneous
data sources like traditional databases or sources of various forms containing unstructured and semistructured data.
Obviously , the \tec hnical" vailabilit
a
y alone is not at all sufcien t for making meaningful use of the existing information,
and thus the problem of eectively and eÆciently accessing
and querying heterogeneous data is an important research issue. One main approach is to integrate the data sources and
oer users an a priori dened global schema. Alternatively,
there are approaches which implement tools for giving users
the possibility to dene the global schema themselves. We
propose a new approach where heterogeneous sources can be
queried through a unied interface and underlying sources
are in tegrated b y means of a query language only
. In this
paper, we presen t extensions to OQL which allo w to query
structurally heterogeneous data, i.e. structured, semistructured and unstructured data alik e, and to integrate data
on-the-y. We also deal with issues of the extensive meta
information about sources that is needed for integration.
1.
The most common approach is to strictly separate in-
tegration from the querying process. First, a global
schema is manually dened, by following the usual
database design steps. Second, mappings betw een the
global and local schemas1 are dened. Only after completition of this in tegration step, the system may be
used for querying. The disadvan tage of this approach
is that the sc hema is rather rigid and cannot be easily
adjusted when, for example, new sources are connected
to the system.
A more exible approach is to oer tools whic hal-
INTRODUCTION
Progress in both, persisten t storage of all kinds of data
and computer net w orktechnology, is the main reason for
the explosive gro wth of data that is available on-line. In order to make meaningful use of it, new and innovative ways
for searching m ustbe dev eloped. One specic case is to
correlate and retrieve information - in a uniform and integrated way - scattered in a netw ork of structur ally heterogene ous data sources, i.e. sources which con tain structured,
semistructured and/or unstructured data. Generally speaking, in tegrated searc h refers to users posing queries against
a set of data sources, using a global query language. Queries
are in ternally split, sent to the respective data sources, and
later the results are combined. Query processing comprises
in this case t w o main steps:the homogenization of data and
its combination, i.e. integration. Homogenization is realized
Permission to make digital or hard copies of part or all of this work or
Permission
to make use
digital
or hard without
copies of
or part of
workare
for
personal
or classroom
is granted
feeallprovided
thatthis
copies
classroomfor
useprofit
is granted
without fee
providedand
thatthat
copies
are
notpersonal
made orordistributed
or commercial
advantage
copies
notthis
made
or distributed
profit or
advantage
that copies
bear
notice
and the fullforcitation
on commercial
the first page.
To copyand
otherwise,
to
bear this to
notice
full citation
the first page.
To copy
otherwise,
republish,
postand
onthe
servers,
or toonredistribute
to lists,
requires
priorto
republish,
to post and/or
on servers
or to redistribute to lists, requires prior specific
specific
permission
a fee.
permission and/or a fee.
CIKM
2000,
VA USA
CIKM
2000McLean,
Washington
© Copyright
ACM 20002000
1-58113-320-0/00/11
. . .$5.00..$5.00
ACM 0-89791-88-6/97/05
low to dene the global schema at run-time. While
the in tegration process is still separated from querying, it can be repeated, i.e. a new global schema can
be dened more easily when users' needs change. In
this w ay, new sources ma
y be easier connected. Before
querying, users (or their administrators) may themselves dene the global schema as w ell as mappings
betw een global and local schemas using for example a
specication language. Ho w ever, this ma
y also be regarded as a drawback, since specifying a lot of technical information (i.e. the global schema and mappings)
is not always a simple task and requires a lot of expertise. Note that compared with the previous approach,
the specication of the global schema is restricted by
the a vailable tools and thus not all desired mappings
may be denable. In addition, fewer optimizations can
be applied in this case.
The most exible approach is to let users do the in-
tegration while formulating queries, without having to
dene a schema beforehand. The system just oers
a homogenized view over the underlying data sources,
but the integration is done while querying. New sources
can be easily connected to the system, since their data
only have to be homogenized. When querying, users
1
Though there is no \schema" for unstructured data sources,
w e use this notion for such sources as well and regard the
functionality exported by an unstructured source as a function (e.g. a stored procedure in a DBMS) which has as input
parameters the query (e.g. keyw ords) and as output the result set.
453
A similar approach is used in the DIOM project [13] which
allows users to dene a personal view on the data using a
meta model, and to express queries based on it. DIOM uses
an extension of the ODMG data model [6] for representing
data on the mediation level, and an extension of OQL as
query language. As a result, more complex queries can be
formulated and more semantics of the underlying sources
can be exploited because the ODMG model is more complex
than OEM.
DISCO [16] assumes an integrated global view at the mediation layer and proposes solutions for special problems related to query processing in this context: unavailable data
sources, query rewriting, and optimization. It also oers
some solutions for the extensibility problem by dening data
which have to be exported by wrappers.
Further projects related to our work include Garlic [14],
InfoSleuth [5] and Information Manifold (IM) [12]. These
projects provide solutions to subsets of problems which we
also face. However, to the best of our knowledge, there
is no other project which aims at the integration of data
through the query language, at the time when users query
the system.
decide in which sources to search and how to integrate data. In this case, extensive information about
the sources, i.e. metadata, must be available. If users
have little or no knowledge about the structure and the
characteristics of the underlying data sources, some
\approximate" or \fuzzy" search is still possible. Alternatively, if users know the underlying data sources
and their query capabilities, they can express highly
targeted queries. Though many cases (like transactions in nancial applications) require exact queries,
there are also numerous examples where a \fuzzy" way
to query and support to connect and disconnect data
sources easily are very benecial.
As will be discussed later, research so far has focused on
the rst and second approach. The main motivation for the
third approach is twofold. First, larger numbers of diverse
data sources can be more easily made available for searching
because no (or only few) preparatory steps are needed. Second, since users have more and more diverse objectives, they
need a exible way to search. This includes \fuzzy" ways to
query, aiming at getting any information, even if it is just
loosely related to the information eectively needed. On the
other hand, precise queries also have to be supported in order to get exact information. It turns out that the absence of
an integrated schema facilitates the \fuzzy" way of querying, provided that the query language allows for querying
heterogeneous data in an integrated way.
Our system SINGAPORE (SINGle Access POint for heterogeneous data REpositories) implements this approach.
Our main goal in this paper is to present SINGAPORE's
query language, together with the appropriate infrastructure environment. Specically, we extend OQL with various features for querying structured, semistructured and
unstructured data in a exible and integrated way.
The paper is structured as follows. We start with a running example in Section 2. In Section 3 we present the user's
view of the data, and in Section 4 we present the SINGAPORE query language. We then sketch the architecture of
our system in Section 5. Section 6 deals with the meta information needed to formulate queries and to implement the
query language. In Section 7 we describe some implementation aspects of the metadata repository. Section 8 concludes
the paper.
2. A RUNNING EXAMPLE
In this section, we compare the three above-mentioned
approaches for integration by using a uniform example. To
this end, we rst consider some data sources (see Figure 1)
and second, a query against them. Structured data sources
create table Person( |
create table Institute(
ID:
NUMBER,
|
Name:
CHAR(50)
Name:
CHAR(50), |
Depart:
CHAR(50)
Surname: CHAR(50), |
Summary:
CHAR(10000)
Photo:
BLOB,
|
);
Inst:
CHAR(50) |
);
|
-----------------------------------------------------------create function idf_retrieve(CHAR(50))
RETURNS list(CHAR(10000))
EXTERNAL NAME '/home/locations/funcloc/dist'
LANGUAGE C
PARAMETER STYLE DB2SQL
NO SQL
DETERMINISTIC
NO EXTERNAL FUNCTION
NO FENCED
class Paper (
Title:
string,
Abstract: string,
Auth:
set(Author)
);
Related work.
Several approaches for integrated access to heterogeneous
sources have been proposed. Most of them (including ours)
are based on a three-tier architecture as presented in [17],
consisting of a data source layer, an integration layer and
a user or application layer. The lowest layer contains the
sources and so-called wrappers which export their functionality and data. The middle layer comprises mediators which
were introduced in [17] to \simplify, abstract, reduce, merge
and explain data". The upper layer represents the interface
to users and applications.
The TSIMMIS project, which aims at assisting users in
implementing mediators for integrating information [9], uses
a query language based on a light-weight model for semistructured data, OEM. Data of the underlying data sources are
represented in OEM, so in fact TSIMMIS oers a semistructured view of heterogeneous data. The advantage of this system is that it supports users well in specifying integration
components, which are then generated automatically.
|
|
|
|
|
class Author (
Id:
integer,
AuthorName: string,
Publications: set(Paper)
);
Literature(Text: string)
Metacrawler(Where: string, Keyword: string)
Figure 1: Data sources example
in our example include tables Person, Institute and a
stored procedure idf retrieve dened in a DB2 database,
and classes Paper and Author dened in an object-oriented
O2 database. The attribute Summary of table Institute
contains a textual description. idf retrieve is a stored
procedure which takes as argument keywords and retrieves
Summary attributes of tuples in Institute which contain the
input keywords. We also consider a search engine, Literature,
which allows to search for various information about pub-
454
lications in computer science, and the metasearch engine
Metacrawler for searching the WWW and newsgroups. We
represent the functionality exported by a search engine as a
stored procedure which takes as arguments the query (e.g. for
Literature the parameter Text and for Metacrawler the
parameter Keyword) and other information needed for answering the query (e.g. for Metacrawler the attribute Where
which species where to search, i.e. on the Web or in Newsgroups).
Let us now assume that users are interested in Which
class Person (
| class Institute (
ID:
NUMBER,
|
Name:
CHAR(50),
Name:
CHAR(50),
|
Depart:
CHAR(50),
Surname: CHAR(50),
|
Summary:
CHAR(10000)
Photo:
BLOB,
| );
Inst: CHAR(50)
|
);
|
----------------------------------------------------------class Paper (
| class Author (
Title:
string,
|
Id:
integer,
Abstract: string,
|
AuthorName: string,
Auth:
set(Author) |
Publications: set(Paper)
);
| );
----------------------------------------------------------class Literature (
| class Metacrawler (
Text: CHAR(MAX_STRING) |
Keyword: CHAR(MAX_STRING),
);
|
Where: CHAR(100)
| );
----------------------------------------------------------class Retrieval (
idf_retrieve(CHAR(50)): list(CHAR(MAX_STRING))
);
publications have been published by employees of
the ``Database Technology'' institute?
class Person (
| class Paper (
ID:
NUMBER,
|
Title:
CHAR(50),
Name:
CHAR(50), |
Abstract: CHAR(50),
Surname:
CHAR(50), |
Auth:
set(Person)
Photo:
BLOB,
| );
Inst_Name:
CHAR(50), |
publications: set(Paper) |
);
|
--------------------------------------------------------class Literature (
Text: CHAR(5000);
);
Figure 3: The global (unintegrated) schema.
Note that class Retrieval contains
just a method, idf retrieval.
stricted to certain sources. Users have to know that information about \Database Technology" may be available in structured sources and in Literature.
Figure 2: An example of an integrated schema
As mentioned before, the rst two integration approaches
oer a global integrated schema which users can query. One
example of a global schema for the given data sources is
expressed in an object-oriented model in Figure 2. In this
case, the global schema merges tables Person, Institute
and class Author into one class Person. We assume that
information delivered by Metacrawler and the functionality
oered by idf retrieve are not of importance for global
applications. Note that the generation of the global schema
is a complex process which has to be done manually (at least
in part), by a system administrator. Our example query
might then look as follows, in an OQL-like syntax [6].
SELECT
FROM
WHERE
AND
3. Select titles and author names from all sources available which contain information about publications in
the eld of \Database Technology". The search is in
this case even more precise: just titles and author
names are searched, not all information as in case
1. Obviously, users need to know that attributes like
Title and AuthName are available in some of the underlying data sources.
4. Select titles of publications from Paper and names of
employees from Person which work at the Institute
called \Database Technology". This query can only
be formulated if users have extensive knowledge about
the underlying data sources and know that it makes
sense to combine information from Person, Institute,
Paper and Author.
The challenge thus is to design a system which accepts all
such kinds of queries. First, the query language should contain various constructs to support a \fuzzy" kind of querying
(like in cases 1, 2 and 3). Second, the integration of (heterogeneous) data has to be accomplished in the last approach
at the time the system is queried. For the example in case
4, data from Paper, Person, Institute and Author must
be combined and users have to know that it makes sense to
combine it, and how to achieve the integration.
Though the \fuzzy" way of querying could also be realized
in the rst two approaches by oering a suitable query language, this has not yet been approached so far. In addition,
the two challenges complement each other, since integration
\on the y" can be more easily realized by using \fuzzy"
elements of the query language, as will be seen later.
Title
Paper, Person Pers
Paper.Auth.Name = Pers.Name
Pers.Inst_Name = ``Database Technology''.
Let us now consider the third approach. In this case,
no integration step is required before the system is queried.
Only a global view is provided (see Figure 3) which is just the
union of the homogenized schemas of the underlying sources
(and thus contains all schema elements of Figure 1).
Using this global view, we would like that users can formulate queries, ranging from \fuzzy" to precise ones, including
for example:
1. What information is available about \Database Technology"? In this case no information about the sources
is needed. Obviously, this query returns all kinds of information, from all sources and not only about publications about \Database Technology" (assuming that
publications of employees of \Database Technology"
institute are about this topic area).
3. THE DATA SPACE
2. What information in the structured sources and in the
\Literature" source is available about \Database Technology"? In this case, the information need is expressed in a more specic way, since the search is re-
In this section we present the conceptual view of data we
use in SINGAPORE which in particular denes (sets of)
data queries can be formulated against.
455
Data space
Data2
Data1
structured
DB_OO
DB_OR
DB2
Level 1
(Classification Roots)
unstructured
semistructured
HTML
Text
O2
Level 2
(Data or Content Type)
Metacrawler
D1
Institute
idf_retrieve
Paper
Literature
Addr
O2Lit
D2
Author
Level 3
(Source Location)
Address
Text
Person
street
Name Depart. Summary
ID Name Surname Photo Inst
zip_code
Keyword
Where
number
location house_name
Level 4
Title Abstract Auth. ID AuthName Publications
zip_code name
(Source Schemas)
Figure 4: The data space
We regard data stored in a connected source as class extensions (in the spirit of object-orientation) and consider
these as sets which can be queried. Furthermore, we also
want to oer more general queriable data sets, which would
typically embody more than just one data source. For this
reason, we consider unions of extensions according to classications based on various criteria like the kind of data,
the particular contents of sources, etc. The resulting structure is called the data space and is more or less a tree for
which most nodes contain the union (of extensions) of data
items of their subordinate nodes. This does not always apply for nodes which represent the contents of data sources.
In this case, a node may be the composition of its subordinate nodes. For example, a relational table is composed out
of attributes, but it is not the union of the attribute values.
This conceptual view of data allows that any node except
the leaf nodes in the data space can be used as a set which
can be queried.
The explanation above suggests that the data space comprises four levels. The lowest level contains schema denitions of data items stored in the connected sources. Each
data source is identied by a name, specied on level three.
Level two contains classications of data, and on the rst
level, entry points to classications are specied. Apart from
the structure of data which may be automatically gained
from sources, all other information of the data space is dened by the system administrator. We discuss this issue in
Section 6.
Figure 4 represents an example of a data space which contains on the rst level two entry points of classication hierarchies, Data1 and Data2. On level two, data is organized
according to its \structuredness", i.e. structured, semistructured and unstructured data. We further classify structured data sources into object-relational databases (DB OR)
and object-oriented databases (DB OO), and we assume that
structured data are stored in database management systems
DB2 and O2. Unstructured data are organized according to
the kind of data which can be queried, normal text (Text)
or HTML les (html). Level three contains the names of
sources connected to the system, and on level four we represent e.g. the schema of Figure 3 that has been dened for
these sources. Note that we also extended levels three and
four with a semistructured data source storing information
about addresses. Here, an Address is usually composed of
a street, a number, a zip code, and a location. However, an
address may also comprise a name of a house (house name),
a zip code, and a location. As a consequence, an address is
only semistructured (see below).
4. THE SINGAPORE QUERY LANGUAGE:
SOQL
Assuming the above view on the data, we can now dene
the query language of SINGAPORE. We dene it starting
from a language for structured data, in our case OQL, and
then extend it with features for unstructured and semistructured data, and for integrating data.
Querying unstructured data.
Basically, querying unstructured data means to query data
using keywords. SOQL provides an operator contains which
retrieves data from unstructured data sources. For our example, the query
SELECT * from Metacrawler, Literature WHERE
contains(``Database Technology'')
will trigger a search for the keywords \Database Technology" in the specied search engines (cf. Figure 4). A variation of contains is contains related. It takes the same
arguments, but also information pertaining to keywords similar to the given ones is retrieved. This requires further
transformations to be done on the input keywords, for ex-
456
'Meier', the exact name of the attribute is not known.
ample stemming, word decomposition or the application of
ontological knowledge2 .
A search for keywords \author, name"{ using algorithms for word decomposition, stemming and once
again applying ontological knowledge { will result in
two matches, D1.Person.Name = 'Meier' and
O2Lit.Author.AuthorName = 'Meier'.
Querying semistructured data.
As dened in [1], [7], semistructured data have structure,
but the structure is not rigid. In addition, the structure
denition or parts of it are not necessarily separated from
the data, i.e. they may be implicit. Query languages for
semistructured data (LOREL [3], StruQL [8]) deal with these
aspects. We thus extend OQL with features of those languages:
4.1 Supporting integrated access to heterogeneous data sources
Regular path expressions are used to specify the struc-
ture of data even if it is not rigid. For a path specication, we introduce two new syntactic elements: \*",
if none, one or more nodes are not specied, and \?', if
exactly one node is not specied. For example, assume
that source Addr accepts queries which specify regular paths expressions. In SOQL, a query like SELECT
Address.*.zip code FROM Addr can be formulated.
Type coercion refers to implicit type cast. For example,
the expression Address.zip-code = '8001' is allowed
in SOQL, where '8001' is a string and Address.zip-code
A last query language issue is the combination of heterogeneous data in one query, which may have various meanings
for structured and unstructured data. First, consider the
query SELECT : : : FROM D1.Person, D2.Institute, which
is supposed to combine data from D1.Person and
D2.Institute (i.e. structured data in a conventional relational database). The result of this query is a projection
(depending on the attributes specied in the SELECT part) on
the cartesian product of tables D1.Person and D2.Institute.
Second, consider metasearch engines3 which also allow the
\combination" of information from dierent search engines.
An example of a query which retrieves information from
Literature and Metacrawler (i.e. from unstructured data
sources) is SELECT : : : FROM Literature, Metacrawler.
has been specied as an integer.
User Interface
Using unstructured and semistructured query characteristics for searching in the data space.
We can now take advantage of the above-mentioned characteristics of query languages for unstructured and semistructured data and apply them to the entire data space:
user layer
Query Preprocessor
Query Processor
Query Postprocessor
Metadata
Repository
Source Registration
contains and contains related may be used for query-
ing structured or semistructured data as if it were unstructured. For example, searching for a keyword in a
database table triggers a search (exact for contains,
approximate for contains related) in each textual attribute of the table and returns the tuples which contain this string.
Query Transformer
Wrapper Manager
mediation layer
source layer
Regular path expressions can be used for specifying
paths in the entire data space. For our example in
Figure 4, structured.*.Name is matched by
structured.DB OR.DB2.D1.Person.Name and
structured.DB OR.DB2.D2.Institute.Name.
Wrapper
Wrapper
Wrapper
Search
Engine
OODBMS
RDBMS
Figure 5: The coarse architecture of SINGAPORE
As in any metasearch engine, the result of this query
would be the union of the set of documents returned by
Literature and by Metacrawler.
These examples show that we need to dierentiate between the two cases. we introduce a new syntactical feature
in SOQL: fg, used to specify the union between the result
sets of two or more data sources, and [ ], used to specify
the cartesian product between the result sets of two or more
data sources (which is also the default if no such brackets
are used).
Type coercion is also applied to structured data sources
in the data space. If for example a condition Person.ID
= '101' is specied, where '101' is a string and
Person.ID is declared as an integer in D1, a type cast
is realized and the expression Person.ID = 101 is sent
to D1, where 101 is an integer.
In analogy to [1] which proposes that data and meta-
data can be used together in queries, we implement
a similar feature in our system, but make it applicable to the whole data space. If an attribute (or a
path in the data space) has to be specied but its exact name is not known, the operator LIKE applied to
a related name triggers a keyword search in the entire data space. For example, in LIKE(author.name)=
2
Ontological knowledge is a set of concepts in a particular information domain and the relationships between them [10].
For example, \Computer Science" and \Database Technology" would be related concepts in our application domain.
5. THE SINGAPORE ARCHITECTURE
We present in this section the coarse architecture of SINGAPORE which follows the usual three-tier architecture
mentioned in Section 1 (see Figure 5).
3
A metasearch engine is a system which provides a single
access point to various search engines. It sends a query
to the underlying search engines, combines the results and
returns a single, unied hit list to the user.
457
and operations (Operation). Operations have a Signature
and a Result. The union of various properties and operations forms an Entity (for example a table in a relational
database). A data source and an entity itself may contain
other entities. Each element has a name (SName, EName,
PName, OName). The Path variable species for a source its
classication path on level 1 in the data space.
Problems related to schema integration have already been
investigated in the area of federated databases, and we use
these results in our case to determine the integration information which has to be stored in the metadata repository. As described in e.g. [4], the integration process includes
mainly two steps:
Preintegration. The database administrator determines
similarities and conicts between schemas and denes
rules for interschema correspondences. Obviously, this
has to be done manually.
Integration. The integrated schema is (automatically)
built according to the rules dened in the previous
step.
As a consequence, two issues are of importance at the moment the system is queried: Which sources contain similar
information? and Which are the conicts between sources
containing similar information?
These two problems will be discussed in the next subsections, by extending the specications shown before.
Users interact with a presentation layer and use it to submit their queries. The mediation layer has two important
tasks: to process queries and to register new sources into
the system. When a query is entered, it is passed to the
query transformer component which contains mainly two
components: the query preprocessor parses the query and
handles its \fuzzy" part, and the query processor does
further processing, including the generation of subqueries.
These are sent through the wrapper manager to the respective sources. Wrappers translate subqueries into the query
language (or the procedural query interface) of each source.
Sources return results in their particular data model and
wrappers translate them back into the SINGAPORE data
model. Wrappers may also improve the query capabilities of
sources, for example by extracting or ltering data returned
by the source. The wrapper manager sends partial results to
the query postprocessor, which combines the returned results. The source registration component is responsible
for connecting and disconnecting sources.
One of the main components is the metadata repository
which stores information about the underlying data sources.
This information is used by system components (query transformer, query postprocessor and source registration) or by
users when formulating queries. The metadata repository
will be presented in more detail in Section 6.
The mediation layer models queries and data in a common data model which is used as a logically homogenized
view of the data of the underlying sources, but also as a
means to support global query processing. We choose an
already existing one, the ODMG data model, and extend it
to meet our requirements. First, we add a composite type
ss-struct similar to the struct constructor, to represent
semistructured data. In contrast to struct, ss-struct does
not apply any type checking. Note that this type is also implemented in other systems, like [11]. Second, we implement
heterogeneous collections. This is again necessary for representing semistructured data, but also for combining data
from various sources.
6.
nonterminalPredeccessors
*
1 MetadataRepository
1..*
TreeNode
1..*
String: SName
1..* String: Description
String: Related_to
nonterminalSuccessors
ClusterTreeNode
*
1..*
* terminalSuccessors
contains 1..*
*
contains_oper.
contains_prop.
THE METADATA REPOSITORY
1..*
1..*
Property
Operation
Bool: Mandatory
String: Relation
The metadata repository contains the following information: i) structure, query capabilities and result types of data
sources, ii) descriptive information about sources, which we
call \soft" metadata, and iii) ontological knowledge.
This information is specied either manually or entered
automatically. In the former case, the system administrator
has to specify it while in the latter case, the source registration component extracts it from sources (for example, a
DBMS stores data about its schema in its data dictionary
and also allows to query it).
We present in this section examples of metadata needed
to formulate queries and to integrate the data sources at the
time the system is queried, using an XML-like notation [2].
We start from the data contained in the example data space
and extend it later on:
1..*
DataSource
*
1..*
contains_oper.
*
has_entity
1..*
*
Entity
*
*
1..*
1..*
has_underentity
contains_prop.
Figure 6: The main classes of the metadata repository
6.1 Specifying information about similarity
In the traditional case of schema integration, the system
administrator determines which sources contain similar information and whether it makes sense to integrate them.
There are dierent levels of similarities: between the topics covered in dierent sources, between entities, between
properties and between operations. In order to detect such
similarities, users need descriptive information about the
data, represented by \soft" metadata. For this reason, we
rst extend each element in a data source with a new tag
Description which contains a textual description of the content of this element. Additionally, we allow the specication
of descriptive information about which sources, properties,
operations or entities are similar, by introducing the tag
Related to. For example, the \soft" metadata of sources
D1 and O2Lit are as follows:
<! Element DataSource(SName, Entity*,
Property*, Operation*, Path*)>
<! Element Entity(EName, Type, Entity*, Property*, Operation*)>
<! Element Property(PName, Type)>
<! Element Operation(OName, Signature)>
<! Element Signature(ArgumentList, Result+)>
<! Element ArgumentList (Type*)>
<! Element Result (Type)>
Data sources contain various structural units (Property)
458
6.3 Source query capabilities
<DataSource>
<SName>D1</SName>
<Description>Database about employees of the
Computer Science Department since 1990.
</Description>
<Related_to>O2Lit</Related_to>
<Entity><SName>Person</SName>
<Related_to>O2Lit.Author</Related_to>
</Entity>
<Property><SName>Name</SName>
<Description> The name of the employee</Description>
<Related_to> O2Lit.Author.AuthName</Related_to>
</Property>
.....
</DataSource>
<DataSource>
<SName>O2Lit</SName>
<Description>Database about publications of the
Computer Science Department since 1980.
</Description>
<Related_to>D1</Related_to>
<Entity><SName>Author</SName>
<Related_to>D1.Person</Related_to></Entity>
<Property><SName>AuthName</SName>
<Description> The name of the author</Description>
<Related_to> D1.Person.Name and
D1.Person.Surname.</Related_to>
</Property>
.....
</DataSource>
Since we have made the assumption that sources are heterogeneous, they will most likely also have dierent query
capabilities which users have to know when querying.
First, it may be the case that the specication of some
properties is mandatory for some sources. This is indicated
in a tag Mandatory for a Property which can take the values
yes or (default) no. One example is Metacrawler, where the
property Keyword is mandatory, but Where is optional, since
it has \Web" as a default value.
Second, properties may have various dependencies among
them in any specic source. There are cases when a group
of properties must be specied together. For this reason,
we introduce two tags: AND op, used for properties which
have to be specied together, and XOR op, used for properties
which have to be specied alternatively. One example is the
source Address which requires that either house name, or
street and number are specied:
<DataSource>
<SName>Address</SName>
<XOR_op>house_name,<AND_op>street, number</AND_op></XOR_op>
</DataSource>
Finally, users need to know the semantics of operations
dened in data sources. The Signature tag dened before
represents just the arguments and result types. We introduce a tag Argument which contains the list of properties
of the source which must be specied as arguments for this
operation. In the example of Figure 1, idf retrieve is implemented in source D2 and takes as an argument just the
attribute Summary of table Institute.
Summarizing, the extensions for Property and Operation
are:
There are mainly two ways to determine similar information in the underlying sources. First, by using the information Description or Related to presented before. Second, one can specify similarities by using ontological knowledge. In Section 2, we have assumed that information about
publications in the eld of \Database Technology" is desired. However, the Description tag of sources D1 and
O2Lit shows that these sources contain information about
\Computer Science". The fact that \Database Technology"
is a subordinate concept of \Computer Science" is stored in
the ontology database of the SINGAPORE metadata repository.
<! Element Property(PName, Type, Description*,
Related_to*, Mandatory*, Relation*)>
<! Element Relation (XOR_cond*, AND_cond*)>
<! Element XOR_op (PName, AND_cond*)>
<! Element AND_op (PName, XOR_cond*)>
<! Element Operation(OName, Signature, Description*, Related_to*)>
<! Element Signature(ArgumentList, Result+)>
<! Element ArgumentList (Argument*)>
<! Element Argument (Property*)>
6.2 Specifying information about conflicts
As described in [15], the following kinds of conicts may
arise during the integration process:
Structural Conicts. A concept of the real world may
6.4 Using metadata for query processing
be represented dierently in two sources. A simple
example for our data space is the name of a person,
which in represented in table Person as a Name and
Surname and in class Author as one string, AuthName.
In SINGAPORE, users can search in the \soft" meta
information and learn about this structural conict.
The \soft" metadata presented in this section support
users to learn about underlying data sources in order to formulate queries which integrate data. It is obvious that this
metadata may easily become voluminous, and consequently,
suitable graphical tools need to be provided to present it to
users. However, even if such tools are available, the process
of formulating queries may get quite complex. Therefore, in
the long run, we aim at using metadata also during query
processing as far as possible, e.g. by suggesting to users
meaningful queries which might match the input query they
have formulated using fuzzy query language features. The
next section sketches some aspects of the implementation of
the metadata repository for processing these features.
Descriptive Conicts.
There are mainly two types
of descriptive conicts, name conicts (synonyms or
homonyms for attributes) and scaling conicts (e.g. different units of measurement for the same real world object are used). In SINGAPORE, users can learn about
descriptive conicts through the information stored in
the Related to tag. For scaling conicts, the necessary transformation should be available. An example
is the transformation of km into m, specied in the following way:
7. IMPLEMENTATION
We implement the metadata repository as a collection
of trees (the main classes and their relationships are represented in Figure 6). Each element in the metadata repository is a TreeNode. Class ClusterTreeNode represents the
classication nodes on levels 1 and 2. On level 3 there are
<Scale>
<From> km </From> <To> m </To>
<Through> 1 km = 1000 m </Through>
</Scale>
459
Keyword: Name
Data
Space
Index
9. ACKNOWLEDGMENTS
List of paths
D1.Person.Name
D1.Person.Surname
D2.Institute.Name
O2Lit.Author.AuthName
Addr.Address.location.name
Addr.Address.house_name
We thank Anca Vaduva for her help during the preparation of this paper.
10. REFERENCES
[1] S. Abiteboul. Querying semi-structured data. In
Proceedings of the International Conference on
Database Theory (ICDT97), 1997.
[2] S. Abiteboul, P. Buneman, and D. Suciu. Data on the
Web: From Relations to Semistructured Data and
XML. Moran Kaufmann Publishers, 1999.
[3] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and
J. Wiener. The Lorel Query Language for
Semistructured Data. International Journal on Digital
Libraries, 1(1), Apr. 1997.
[4] C. Batini, M. Lenzerini, and S. Navathe. A
comparative analysis of methodologies of database
schema integration. ACM Comp. Surveys 18(4), 1986.
[5] R. J. Bayardo and W. B. et al. InfoSleuth:
Agent-based semantic integration of information in
open environments. SIGMOD Record, 1997.
[6] R. Cattell, editor. The Object Database Standard:
ODMG 3.0. Morgan Kaufmann, 2000.
[7] R. Domenig and K. R. Dittrich. An overview and
classication of mediated query systems. SIGMOD
Record, September 1999.
[8] M. Fernandez, D. Florescu, J. Kang, A. Levy, and
D. Suciu. STRUDEL: A Web site management
system. SIGMOD Record, 26(2), 1997.
[9] H. Garcia-Molina, Y. Papakonstantinou, D. Quass,
A. Rajaraman, Y. Sagiv, J. Ullman, V. Vassalos, and
J. Widom. The TSIMMIS approach to mediation:
Data models and languages. Journal of Intelligent
Information Systems, 1997.
[10] T. R. Gruber. Toward principles for the design of
ontologies used for knowledge sharing. International
Workshop on Formal Ontology, March 1993.
[11] A. Heuer and D. Priebe. Irql { yet another language
for querying semi-structured data? Preprint,
Universitat Rostock, Fachbereich Informatik, 2000.
[12] A. Levy, A. Rajaraman, and J. Ordille. Querying
heterogeneous information sources using source
descriptions. In Proceedings of the international
Conference on VLDB, India, 1996.
[13] L. Liu, C. Pu, and Y. Lee. Adaptive query mediation
across heterogeneous information sources. In
Proceedings of the International Conference on
Cooperative Information Systems (CoopIS), 1996.
[14] M. Roth, F. Ozcan, and L. Haas. Cost models do
matter: Providing cost information for diverse data
sources in a federated system. Proceedings of the
international Conference on VLDB, 1999.
[15] S. Spaccapietra, C. Parent, and Y. Dupont. Model
independent assertions for integration of
heterogeneous schemas. VLDB Journal, July 1992.
[16] A. Tomasic, L. Raschid, and P. Valduriez. Scaling
access to heterogeneous data sources with Disco. In
IEEE Transactions on Knowledge and Data
Engineering. IEEE, September/October 1998.
[17] G. Wiederhold. Mediators in the architecture of future
information systems. IEEE Computer, 25(3), 1992.
Figure 7: The data space index
DataSource nodes and on level 4 schemas of the sources
(composed of Entity, Property and Operation).
The metadata repository stores an index called Data Space
Index which supports an information retrieval kind of search
on paths in the data space4 . Paths of the data space are
transformed (indexed) by applying stemming and word decomposition algorithms. The index has as a key a keyword
and contains lists of paths in the data space which match the
keyword. For example, when processing the query SELECT
LIKE(Name) FROM : : : , the Data Space Index is checked,
which in our example contains the information shown in
Figure 7. Processing of LIKE would transform the given
query into six queries, having one of the paths specied in
Figure 7 in their SELECT part. Note that this index can also
be used by users when searching for keywords in the data
space. Additionally, the metadata repository contains an index called Soft Inform Index which allows users to search
in an information retrieval way for descriptions stored in attributes Description and Related to, for each element in
the metadata repository. These indexes are built at system
start up and updated when new sources are registered.
8.
DISCUSSION AND CONCLUSIONS
We have presented a new approach to query data from
a network of heterogeneous data sources. Our goal was to
integrate data sources when querying and to oer the possibility to formulate both, exact and fuzzy queries. For this
reason we extended OQL with special features and we have
shown that the realization of this approach is based on extensive meta information.
We believe that a system to integrate heterogeneous data
through the query language is a necessity today, because it
allows to connect new sources in an easy way, without requiring a global integrated schema. Furthermore, users can
query the system, even if they do not know all query capabilities of the sources. They can formulate queries similar
to keyword search in information retrieval systems, but in
addition, they can also formulate more targeted queries if
they use the information stored in the metadata repository.
Thus, in our approach we combine the way to search in information retrieval systems with the way to search in DBMS,
and therefore the implementation combines techniques from
both areas.
Consequently, we oer a solution for the problems of integrated search in multiple heterogeneous data sources and
of supporting various advanced search requirements. The
next step in our project is to detail the query processing
components.
4
This kind of index is called inverted le in information retrieval systems.
460