Download A query based approach for integrating heterogeneous data sources

A query based approach for integrating heterogeneous data sources Ruxandra Domenig Klaus R. Dittrich Department of Information Technology University of Zurich Switzerland Department of Information Technology University of Zurich Switzerland [email protected] [email protected] by translating various query languages and models of the underlying data sources into a global query language and data model. Regarding the process of integration, there are three dierent approaches that can be taken: ABSTRACT Data available on-line toda y is spread over heterogeneous data sources like traditional databases or sources of various forms containing unstructured and semistructured data. Obviously , the \tec hnical" vailabilit a y alone is not at all sufcien t for making meaningful use of the existing information, and thus the problem of eectively and eÆciently accessing and querying heterogeneous data is an important research issue. One main approach is to integrate the data sources and oer users an a priori dened global schema. Alternatively, there are approaches which implement tools for giving users the possibility to dene the global schema themselves. We propose a new approach where heterogeneous sources can be queried through a unied interface and underlying sources are in tegrated b y means of a query language only . In this paper, we presen t extensions to OQL which allo w to query structurally heterogeneous data, i.e. structured, semistructured and unstructured data alik e, and to integrate data on-the-y. We also deal with issues of the extensive meta information about sources that is needed for integration. 1. The most common approach is to strictly separate integration from the querying process. First, a global schema is manually dened, by following the usual database design steps. Second, mappings betw een the global and local schemas1 are dened. Only after completition of this in tegration step, the system may be used for querying. The disadvan tage of this approach is that the sc hema is rather rigid and cannot be easily adjusted when, for example, new sources are connected to the system. A more exible approach is to oer tools whic hal- INTRODUCTION Progress in both, persisten t storage of all kinds of data and computer net w orktechnology, is the main reason for the explosive gro wth of data that is available on-line. In order to make meaningful use of it, new and innovative ways for searching m ustbe dev eloped. One specic case is to correlate and retrieve information - in a uniform and integrated way - scattered in a netw ork of structur ally heterogene ous data sources, i.e. sources which con tain structured, semistructured and/or unstructured data. Generally speaking, in tegrated searc h refers to users posing queries against a set of data sources, using a global query language. Queries are in ternally split, sent to the respective data sources, and later the results are combined. Query processing comprises in this case t w o main steps:the homogenization of data and its combination, i.e. integration. Homogenization is realized Permission to make digital or hard copies of part or all of this work or Permission to make use digital or hard without copies of or part of workare for personal or classroom is granted feeallprovided thatthis copies classroomfor useprofit is granted without fee providedand thatthat copies are notpersonal made orordistributed or commercial advantage copies notthis made or distributed profit or advantage that copies bear notice and the fullforcitation on commercial the first page. To copyand otherwise, to bear this to notice full citation the first page. To copy otherwise, republish, postand onthe servers, or toonredistribute to lists, requires priorto republish, to post and/or on servers or to redistribute to lists, requires prior specific specific permission a fee. permission and/or a fee. CIKM 2000, VA USA CIKM 2000McLean, Washington © Copyright ACM 20002000 1-58113-320-0/00/11 . . .$5.00..$5.00 ACM 0-89791-88-6/97/05 low to dene the global schema at run-time. While the in tegration process is still separated from querying, it can be repeated, i.e. a new global schema can be dened more easily when users' needs change. In this w ay, new sources ma y be easier connected. Before querying, users (or their administrators) may themselves dene the global schema as w ell as mappings betw een global and local schemas using for example a specication language. Ho w ever, this ma y also be regarded as a drawback, since specifying a lot of technical information (i.e. the global schema and mappings) is not always a simple task and requires a lot of expertise. Note that compared with the previous approach, the specication of the global schema is restricted by the a vailable tools and thus not all desired mappings may be denable. In addition, fewer optimizations can be applied in this case. The most exible approach is to let users do the integration while formulating queries, without having to dene a schema beforehand. The system just oers a homogenized view over the underlying data sources, but the integration is done while querying. New sources can be easily connected to the system, since their data only have to be homogenized. When querying, users 1 Though there is no \schema" for unstructured data sources, w e use this notion for such sources as well and regard the functionality exported by an unstructured source as a function (e.g. a stored procedure in a DBMS) which has as input parameters the query (e.g. keyw ords) and as output the result set. 453 A similar approach is used in the DIOM project [13] which allows users to dene a personal view on the data using a meta model, and to express queries based on it. DIOM uses an extension of the ODMG data model [6] for representing data on the mediation level, and an extension of OQL as query language. As a result, more complex queries can be formulated and more semantics of the underlying sources can be exploited because the ODMG model is more complex than OEM. DISCO [16] assumes an integrated global view at the mediation layer and proposes solutions for special problems related to query processing in this context: unavailable data sources, query rewriting, and optimization. It also oers some solutions for the extensibility problem by dening data which have to be exported by wrappers. Further projects related to our work include Garlic [14], InfoSleuth [5] and Information Manifold (IM) [12]. These projects provide solutions to subsets of problems which we also face. However, to the best of our knowledge, there is no other project which aims at the integration of data through the query language, at the time when users query the system. decide in which sources to search and how to integrate data. In this case, extensive information about the sources, i.e. metadata, must be available. If users have little or no knowledge about the structure and the characteristics of the underlying data sources, some \approximate" or \fuzzy" search is still possible. Alternatively, if users know the underlying data sources and their query capabilities, they can express highly targeted queries. Though many cases (like transactions in nancial applications) require exact queries, there are also numerous examples where a \fuzzy" way to query and support to connect and disconnect data sources easily are very benecial. As will be discussed later, research so far has focused on the rst and second approach. The main motivation for the third approach is twofold. First, larger numbers of diverse data sources can be more easily made available for searching because no (or only few) preparatory steps are needed. Second, since users have more and more diverse objectives, they need a exible way to search. This includes \fuzzy" ways to query, aiming at getting any information, even if it is just loosely related to the information eectively needed. On the other hand, precise queries also have to be supported in order to get exact information. It turns out that the absence of an integrated schema facilitates the \fuzzy" way of querying, provided that the query language allows for querying heterogeneous data in an integrated way. Our system SINGAPORE (SINGle Access POint for heterogeneous data REpositories) implements this approach. Our main goal in this paper is to present SINGAPORE's query language, together with the appropriate infrastructure environment. Specically, we extend OQL with various features for querying structured, semistructured and unstructured data in a exible and integrated way. The paper is structured as follows. We start with a running example in Section 2. In Section 3 we present the user's view of the data, and in Section 4 we present the SINGAPORE query language. We then sketch the architecture of our system in Section 5. Section 6 deals with the meta information needed to formulate queries and to implement the query language. In Section 7 we describe some implementation aspects of the metadata repository. Section 8 concludes the paper. 2. A RUNNING EXAMPLE In this section, we compare the three above-mentioned approaches for integration by using a uniform example. To this end, we rst consider some data sources (see Figure 1) and second, a query against them. Structured data sources create table Person( | create table Institute( ID: NUMBER, | Name: CHAR(50) Name: CHAR(50), | Depart: CHAR(50) Surname: CHAR(50), | Summary: CHAR(10000) Photo: BLOB, | ); Inst: CHAR(50) | ); | -----------------------------------------------------------create function idf_retrieve(CHAR(50)) RETURNS list(CHAR(10000)) EXTERNAL NAME '/home/locations/funcloc/dist' LANGUAGE C PARAMETER STYLE DB2SQL NO SQL DETERMINISTIC NO EXTERNAL FUNCTION NO FENCED class Paper ( Title: string, Abstract: string, Auth: set(Author) ); Related work. Several approaches for integrated access to heterogeneous sources have been proposed. Most of them (including ours) are based on a three-tier architecture as presented in [17], consisting of a data source layer, an integration layer and a user or application layer. The lowest layer contains the sources and so-called wrappers which export their functionality and data. The middle layer comprises mediators which were introduced in [17] to \simplify, abstract, reduce, merge and explain data". The upper layer represents the interface to users and applications. The TSIMMIS project, which aims at assisting users in implementing mediators for integrating information [9], uses a query language based on a light-weight model for semistructured data, OEM. Data of the underlying data sources are represented in OEM, so in fact TSIMMIS oers a semistructured view of heterogeneous data. The advantage of this system is that it supports users well in specifying integration components, which are then generated automatically. | | | | | class Author ( Id: integer, AuthorName: string, Publications: set(Paper) ); Literature(Text: string) Metacrawler(Where: string, Keyword: string) Figure 1: Data sources example in our example include tables Person, Institute and a stored procedure idf retrieve dened in a DB2 database, and classes Paper and Author dened in an object-oriented O2 database. The attribute Summary of table Institute contains a textual description. idf retrieve is a stored procedure which takes as argument keywords and retrieves Summary attributes of tuples in Institute which contain the input keywords. We also consider a search engine, Literature, which allows to search for various information about pub- 454 lications in computer science, and the metasearch engine Metacrawler for searching the WWW and newsgroups. We represent the functionality exported by a search engine as a stored procedure which takes as arguments the query (e.g. for Literature the parameter Text and for Metacrawler the parameter Keyword) and other information needed for answering the query (e.g. for Metacrawler the attribute Where which species where to search, i.e. on the Web or in Newsgroups). Let us now assume that users are interested in Which class Person ( | class Institute ( ID: NUMBER, | Name: CHAR(50), Name: CHAR(50), | Depart: CHAR(50), Surname: CHAR(50), | Summary: CHAR(10000) Photo: BLOB, | ); Inst: CHAR(50) | ); | ----------------------------------------------------------class Paper ( | class Author ( Title: string, | Id: integer, Abstract: string, | AuthorName: string, Auth: set(Author) | Publications: set(Paper) ); | ); ----------------------------------------------------------class Literature ( | class Metacrawler ( Text: CHAR(MAX_STRING) | Keyword: CHAR(MAX_STRING), ); | Where: CHAR(100) | ); ----------------------------------------------------------class Retrieval ( idf_retrieve(CHAR(50)): list(CHAR(MAX_STRING)) ); publications have been published by employees of the ``Database Technology'' institute? class Person ( | class Paper ( ID: NUMBER, | Title: CHAR(50), Name: CHAR(50), | Abstract: CHAR(50), Surname: CHAR(50), | Auth: set(Person) Photo: BLOB, | ); Inst_Name: CHAR(50), | publications: set(Paper) | ); | --------------------------------------------------------class Literature ( Text: CHAR(5000); ); Figure 3: The global (unintegrated) schema. Note that class Retrieval contains just a method, idf retrieval. stricted to certain sources. Users have to know that information about \Database Technology" may be available in structured sources and in Literature. Figure 2: An example of an integrated schema As mentioned before, the rst two integration approaches oer a global integrated schema which users can query. One example of a global schema for the given data sources is expressed in an object-oriented model in Figure 2. In this case, the global schema merges tables Person, Institute and class Author into one class Person. We assume that information delivered by Metacrawler and the functionality oered by idf retrieve are not of importance for global applications. Note that the generation of the global schema is a complex process which has to be done manually (at least in part), by a system administrator. Our example query might then look as follows, in an OQL-like syntax [6]. SELECT FROM WHERE AND 3. Select titles and author names from all sources available which contain information about publications in the eld of \Database Technology". The search is in this case even more precise: just titles and author names are searched, not all information as in case 1. Obviously, users need to know that attributes like Title and AuthName are available in some of the underlying data sources. 4. Select titles of publications from Paper and names of employees from Person which work at the Institute called \Database Technology". This query can only be formulated if users have extensive knowledge about the underlying data sources and know that it makes sense to combine information from Person, Institute, Paper and Author. The challenge thus is to design a system which accepts all such kinds of queries. First, the query language should contain various constructs to support a \fuzzy" kind of querying (like in cases 1, 2 and 3). Second, the integration of (heterogeneous) data has to be accomplished in the last approach at the time the system is queried. For the example in case 4, data from Paper, Person, Institute and Author must be combined and users have to know that it makes sense to combine it, and how to achieve the integration. Though the \fuzzy" way of querying could also be realized in the rst two approaches by oering a suitable query language, this has not yet been approached so far. In addition, the two challenges complement each other, since integration \on the y" can be more easily realized by using \fuzzy" elements of the query language, as will be seen later. Title Paper, Person Pers Paper.Auth.Name = Pers.Name Pers.Inst_Name = ``Database Technology''. Let us now consider the third approach. In this case, no integration step is required before the system is queried. Only a global view is provided (see Figure 3) which is just the union of the homogenized schemas of the underlying sources (and thus contains all schema elements of Figure 1). Using this global view, we would like that users can formulate queries, ranging from \fuzzy" to precise ones, including for example: 1. What information is available about \Database Technology"? In this case no information about the sources is needed. Obviously, this query returns all kinds of information, from all sources and not only about publications about \Database Technology" (assuming that publications of employees of \Database Technology" institute are about this topic area). 3. THE DATA SPACE 2. What information in the structured sources and in the \Literature" source is available about \Database Technology"? In this case, the information need is expressed in a more specic way, since the search is re- In this section we present the conceptual view of data we use in SINGAPORE which in particular denes (sets of) data queries can be formulated against. 455 Data space Data2 Data1 structured DB_OO DB_OR DB2 Level 1 (Classification Roots) unstructured semistructured HTML Text O2 Level 2 (Data or Content Type) Metacrawler D1 Institute idf_retrieve Paper Literature Addr O2Lit D2 Author Level 3 (Source Location) Address Text Person street Name Depart. Summary ID Name Surname Photo Inst zip_code Keyword Where number location house_name Level 4 Title Abstract Auth. ID AuthName Publications zip_code name (Source Schemas) Figure 4: The data space We regard data stored in a connected source as class extensions (in the spirit of object-orientation) and consider these as sets which can be queried. Furthermore, we also want to oer more general queriable data sets, which would typically embody more than just one data source. For this reason, we consider unions of extensions according to classications based on various criteria like the kind of data, the particular contents of sources, etc. The resulting structure is called the data space and is more or less a tree for which most nodes contain the union (of extensions) of data items of their subordinate nodes. This does not always apply for nodes which represent the contents of data sources. In this case, a node may be the composition of its subordinate nodes. For example, a relational table is composed out of attributes, but it is not the union of the attribute values. This conceptual view of data allows that any node except the leaf nodes in the data space can be used as a set which can be queried. The explanation above suggests that the data space comprises four levels. The lowest level contains schema denitions of data items stored in the connected sources. Each data source is identied by a name, specied on level three. Level two contains classications of data, and on the rst level, entry points to classications are specied. Apart from the structure of data which may be automatically gained from sources, all other information of the data space is dened by the system administrator. We discuss this issue in Section 6. Figure 4 represents an example of a data space which contains on the rst level two entry points of classication hierarchies, Data1 and Data2. On level two, data is organized according to its \structuredness", i.e. structured, semistructured and unstructured data. We further classify structured data sources into object-relational databases (DB OR) and object-oriented databases (DB OO), and we assume that structured data are stored in database management systems DB2 and O2. Unstructured data are organized according to the kind of data which can be queried, normal text (Text) or HTML les (html). Level three contains the names of sources connected to the system, and on level four we represent e.g. the schema of Figure 3 that has been dened for these sources. Note that we also extended levels three and four with a semistructured data source storing information about addresses. Here, an Address is usually composed of a street, a number, a zip code, and a location. However, an address may also comprise a name of a house (house name), a zip code, and a location. As a consequence, an address is only semistructured (see below). 4. THE SINGAPORE QUERY LANGUAGE: SOQL Assuming the above view on the data, we can now dene the query language of SINGAPORE. We dene it starting from a language for structured data, in our case OQL, and then extend it with features for unstructured and semistructured data, and for integrating data. Querying unstructured data. Basically, querying unstructured data means to query data using keywords. SOQL provides an operator contains which retrieves data from unstructured data sources. For our example, the query SELECT * from Metacrawler, Literature WHERE contains(``Database Technology'') will trigger a search for the keywords \Database Technology" in the specied search engines (cf. Figure 4). A variation of contains is contains related. It takes the same arguments, but also information pertaining to keywords similar to the given ones is retrieved. This requires further transformations to be done on the input keywords, for ex- 456 'Meier', the exact name of the attribute is not known. ample stemming, word decomposition or the application of ontological knowledge2 . A search for keywords \author, name"{ using algorithms for word decomposition, stemming and once again applying ontological knowledge { will result in two matches, D1.Person.Name = 'Meier' and O2Lit.Author.AuthorName = 'Meier'. Querying semistructured data. As dened in [1], [7], semistructured data have structure, but the structure is not rigid. In addition, the structure denition or parts of it are not necessarily separated from the data, i.e. they may be implicit. Query languages for semistructured data (LOREL [3], StruQL [8]) deal with these aspects. We thus extend OQL with features of those languages: 4.1 Supporting integrated access to heterogeneous data sources Regular path expressions are used to specify the structure of data even if it is not rigid. For a path specication, we introduce two new syntactic elements: \*", if none, one or more nodes are not specied, and \?', if exactly one node is not specied. For example, assume that source Addr accepts queries which specify regular paths expressions. In SOQL, a query like SELECT Address.*.zip code FROM Addr can be formulated. Type coercion refers to implicit type cast. For example, the expression Address.zip-code = '8001' is allowed in SOQL, where '8001' is a string and Address.zip-code A last query language issue is the combination of heterogeneous data in one query, which may have various meanings for structured and unstructured data. First, consider the query SELECT : : : FROM D1.Person, D2.Institute, which is supposed to combine data from D1.Person and D2.Institute (i.e. structured data in a conventional relational database). The result of this query is a projection (depending on the attributes specied in the SELECT part) on the cartesian product of tables D1.Person and D2.Institute. Second, consider metasearch engines3 which also allow the \combination" of information from dierent search engines. An example of a query which retrieves information from Literature and Metacrawler (i.e. from unstructured data sources) is SELECT : : : FROM Literature, Metacrawler. has been specied as an integer. User Interface Using unstructured and semistructured query characteristics for searching in the data space. We can now take advantage of the above-mentioned characteristics of query languages for unstructured and semistructured data and apply them to the entire data space: user layer Query Preprocessor Query Processor Query Postprocessor Metadata Repository Source Registration contains and contains related may be used for querying structured or semistructured data as if it were unstructured. For example, searching for a keyword in a database table triggers a search (exact for contains, approximate for contains related) in each textual attribute of the table and returns the tuples which contain this string. Query Transformer Wrapper Manager mediation layer source layer Regular path expressions can be used for specifying paths in the entire data space. For our example in Figure 4, structured.*.Name is matched by structured.DB OR.DB2.D1.Person.Name and structured.DB OR.DB2.D2.Institute.Name. Wrapper Wrapper Wrapper Search Engine OODBMS RDBMS Figure 5: The coarse architecture of SINGAPORE As in any metasearch engine, the result of this query would be the union of the set of documents returned by Literature and by Metacrawler. These examples show that we need to dierentiate between the two cases. we introduce a new syntactical feature in SOQL: fg, used to specify the union between the result sets of two or more data sources, and [ ], used to specify the cartesian product between the result sets of two or more data sources (which is also the default if no such brackets are used). Type coercion is also applied to structured data sources in the data space. If for example a condition Person.ID = '101' is specied, where '101' is a string and Person.ID is declared as an integer in D1, a type cast is realized and the expression Person.ID = 101 is sent to D1, where 101 is an integer. In analogy to [1] which proposes that data and metadata can be used together in queries, we implement a similar feature in our system, but make it applicable to the whole data space. If an attribute (or a path in the data space) has to be specied but its exact name is not known, the operator LIKE applied to a related name triggers a keyword search in the entire data space. For example, in LIKE(author.name)= 2 Ontological knowledge is a set of concepts in a particular information domain and the relationships between them [10]. For example, \Computer Science" and \Database Technology" would be related concepts in our application domain. 5. THE SINGAPORE ARCHITECTURE We present in this section the coarse architecture of SINGAPORE which follows the usual three-tier architecture mentioned in Section 1 (see Figure 5). 3 A metasearch engine is a system which provides a single access point to various search engines. It sends a query to the underlying search engines, combines the results and returns a single, unied hit list to the user. 457 and operations (Operation). Operations have a Signature and a Result. The union of various properties and operations forms an Entity (for example a table in a relational database). A data source and an entity itself may contain other entities. Each element has a name (SName, EName, PName, OName). The Path variable species for a source its classication path on level 1 in the data space. Problems related to schema integration have already been investigated in the area of federated databases, and we use these results in our case to determine the integration information which has to be stored in the metadata repository. As described in e.g. [4], the integration process includes mainly two steps: Preintegration. The database administrator determines similarities and conicts between schemas and denes rules for interschema correspondences. Obviously, this has to be done manually. Integration. The integrated schema is (automatically) built according to the rules dened in the previous step. As a consequence, two issues are of importance at the moment the system is queried: Which sources contain similar information? and Which are the conicts between sources containing similar information? These two problems will be discussed in the next subsections, by extending the specications shown before. Users interact with a presentation layer and use it to submit their queries. The mediation layer has two important tasks: to process queries and to register new sources into the system. When a query is entered, it is passed to the query transformer component which contains mainly two components: the query preprocessor parses the query and handles its \fuzzy" part, and the query processor does further processing, including the generation of subqueries. These are sent through the wrapper manager to the respective sources. Wrappers translate subqueries into the query language (or the procedural query interface) of each source. Sources return results in their particular data model and wrappers translate them back into the SINGAPORE data model. Wrappers may also improve the query capabilities of sources, for example by extracting or ltering data returned by the source. The wrapper manager sends partial results to the query postprocessor, which combines the returned results. The source registration component is responsible for connecting and disconnecting sources. One of the main components is the metadata repository which stores information about the underlying data sources. This information is used by system components (query transformer, query postprocessor and source registration) or by users when formulating queries. The metadata repository will be presented in more detail in Section 6. The mediation layer models queries and data in a common data model which is used as a logically homogenized view of the data of the underlying sources, but also as a means to support global query processing. We choose an already existing one, the ODMG data model, and extend it to meet our requirements. First, we add a composite type ss-struct similar to the struct constructor, to represent semistructured data. In contrast to struct, ss-struct does not apply any type checking. Note that this type is also implemented in other systems, like [11]. Second, we implement heterogeneous collections. This is again necessary for representing semistructured data, but also for combining data from various sources. 6. nonterminalPredeccessors * 1 MetadataRepository 1..* TreeNode 1..* String: SName 1..* String: Description String: Related_to nonterminalSuccessors ClusterTreeNode * 1..* * terminalSuccessors contains 1..* * contains_oper. contains_prop. THE METADATA REPOSITORY 1..* 1..* Property Operation Bool: Mandatory String: Relation The metadata repository contains the following information: i) structure, query capabilities and result types of data sources, ii) descriptive information about sources, which we call \soft" metadata, and iii) ontological knowledge. This information is specied either manually or entered automatically. In the former case, the system administrator has to specify it while in the latter case, the source registration component extracts it from sources (for example, a DBMS stores data about its schema in its data dictionary and also allows to query it). We present in this section examples of metadata needed to formulate queries and to integrate the data sources at the time the system is queried, using an XML-like notation [2]. We start from the data contained in the example data space and extend it later on: 1..* DataSource * 1..* contains_oper. * has_entity 1..* * Entity * * 1..* 1..* has_underentity contains_prop. Figure 6: The main classes of the metadata repository 6.1 Specifying information about similarity In the traditional case of schema integration, the system administrator determines which sources contain similar information and whether it makes sense to integrate them. There are dierent levels of similarities: between the topics covered in dierent sources, between entities, between properties and between operations. In order to detect such similarities, users need descriptive information about the data, represented by \soft" metadata. For this reason, we rst extend each element in a data source with a new tag Description which contains a textual description of the content of this element. Additionally, we allow the specication of descriptive information about which sources, properties, operations or entities are similar, by introducing the tag Related to. For example, the \soft" metadata of sources D1 and O2Lit are as follows: <! Element DataSource(SName, Entity*, Property*, Operation*, Path*)> <! Element Entity(EName, Type, Entity*, Property*, Operation*)> <! Element Property(PName, Type)> <! Element Operation(OName, Signature)> <! Element Signature(ArgumentList, Result+)> <! Element ArgumentList (Type*)> <! Element Result (Type)> Data sources contain various structural units (Property) 458 6.3 Source query capabilities <DataSource> <SName>D1</SName> <Description>Database about employees of the Computer Science Department since 1990. </Description> <Related_to>O2Lit</Related_to> <Entity><SName>Person</SName> <Related_to>O2Lit.Author</Related_to> </Entity> <Property><SName>Name</SName> <Description> The name of the employee</Description> <Related_to> O2Lit.Author.AuthName</Related_to> </Property> ..... </DataSource> <DataSource> <SName>O2Lit</SName> <Description>Database about publications of the Computer Science Department since 1980. </Description> <Related_to>D1</Related_to> <Entity><SName>Author</SName> <Related_to>D1.Person</Related_to></Entity> <Property><SName>AuthName</SName> <Description> The name of the author</Description> <Related_to> D1.Person.Name and D1.Person.Surname.</Related_to> </Property> ..... </DataSource> Since we have made the assumption that sources are heterogeneous, they will most likely also have dierent query capabilities which users have to know when querying. First, it may be the case that the specication of some properties is mandatory for some sources. This is indicated in a tag Mandatory for a Property which can take the values yes or (default) no. One example is Metacrawler, where the property Keyword is mandatory, but Where is optional, since it has \Web" as a default value. Second, properties may have various dependencies among them in any specic source. There are cases when a group of properties must be specied together. For this reason, we introduce two tags: AND op, used for properties which have to be specied together, and XOR op, used for properties which have to be specied alternatively. One example is the source Address which requires that either house name, or street and number are specied: <DataSource> <SName>Address</SName> <XOR_op>house_name,<AND_op>street, number</AND_op></XOR_op> </DataSource> Finally, users need to know the semantics of operations dened in data sources. The Signature tag dened before represents just the arguments and result types. We introduce a tag Argument which contains the list of properties of the source which must be specied as arguments for this operation. In the example of Figure 1, idf retrieve is implemented in source D2 and takes as an argument just the attribute Summary of table Institute. Summarizing, the extensions for Property and Operation are: There are mainly two ways to determine similar information in the underlying sources. First, by using the information Description or Related to presented before. Second, one can specify similarities by using ontological knowledge. In Section 2, we have assumed that information about publications in the eld of \Database Technology" is desired. However, the Description tag of sources D1 and O2Lit shows that these sources contain information about \Computer Science". The fact that \Database Technology" is a subordinate concept of \Computer Science" is stored in the ontology database of the SINGAPORE metadata repository. <! Element Property(PName, Type, Description*, Related_to*, Mandatory*, Relation*)> <! Element Relation (XOR_cond*, AND_cond*)> <! Element XOR_op (PName, AND_cond*)> <! Element AND_op (PName, XOR_cond*)> <! Element Operation(OName, Signature, Description*, Related_to*)> <! Element Signature(ArgumentList, Result+)> <! Element ArgumentList (Argument*)> <! Element Argument (Property*)> 6.2 Specifying information about conflicts As described in [15], the following kinds of conicts may arise during the integration process: Structural Conicts. A concept of the real world may 6.4 Using metadata for query processing be represented dierently in two sources. A simple example for our data space is the name of a person, which in represented in table Person as a Name and Surname and in class Author as one string, AuthName. In SINGAPORE, users can search in the \soft" meta information and learn about this structural conict. The \soft" metadata presented in this section support users to learn about underlying data sources in order to formulate queries which integrate data. It is obvious that this metadata may easily become voluminous, and consequently, suitable graphical tools need to be provided to present it to users. However, even if such tools are available, the process of formulating queries may get quite complex. Therefore, in the long run, we aim at using metadata also during query processing as far as possible, e.g. by suggesting to users meaningful queries which might match the input query they have formulated using fuzzy query language features. The next section sketches some aspects of the implementation of the metadata repository for processing these features. Descriptive Conicts. There are mainly two types of descriptive conicts, name conicts (synonyms or homonyms for attributes) and scaling conicts (e.g. different units of measurement for the same real world object are used). In SINGAPORE, users can learn about descriptive conicts through the information stored in the Related to tag. For scaling conicts, the necessary transformation should be available. An example is the transformation of km into m, specied in the following way: 7. IMPLEMENTATION We implement the metadata repository as a collection of trees (the main classes and their relationships are represented in Figure 6). Each element in the metadata repository is a TreeNode. Class ClusterTreeNode represents the classication nodes on levels 1 and 2. On level 3 there are <Scale> <From> km </From> <To> m </To> <Through> 1 km = 1000 m </Through> </Scale> 459 Keyword: Name Data Space Index 9. ACKNOWLEDGMENTS List of paths D1.Person.Name D1.Person.Surname D2.Institute.Name O2Lit.Author.AuthName Addr.Address.location.name Addr.Address.house_name We thank Anca Vaduva for her help during the preparation of this paper. 10. REFERENCES [1] S. Abiteboul. Querying semi-structured data. In Proceedings of the International Conference on Database Theory (ICDT97), 1997. [2] S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Moran Kaufmann Publishers, 1999. [3] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The Lorel Query Language for Semistructured Data. International Journal on Digital Libraries, 1(1), Apr. 1997. [4] C. Batini, M. Lenzerini, and S. Navathe. A comparative analysis of methodologies of database schema integration. ACM Comp. Surveys 18(4), 1986. [5] R. J. Bayardo and W. B. et al. InfoSleuth: Agent-based semantic integration of information in open environments. SIGMOD Record, 1997. [6] R. Cattell, editor. The Object Database Standard: ODMG 3.0. Morgan Kaufmann, 2000. [7] R. Domenig and K. R. Dittrich. An overview and classication of mediated query systems. SIGMOD Record, September 1999. [8] M. Fernandez, D. Florescu, J. Kang, A. Levy, and D. Suciu. STRUDEL: A Web site management system. SIGMOD Record, 26(2), 1997. [9] H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, V. Vassalos, and J. Widom. The TSIMMIS approach to mediation: Data models and languages. Journal of Intelligent Information Systems, 1997. [10] T. R. Gruber. Toward principles for the design of ontologies used for knowledge sharing. International Workshop on Formal Ontology, March 1993. [11] A. Heuer and D. Priebe. Irql { yet another language for querying semi-structured data? Preprint, Universitat Rostock, Fachbereich Informatik, 2000. [12] A. Levy, A. Rajaraman, and J. Ordille. Querying heterogeneous information sources using source descriptions. In Proceedings of the international Conference on VLDB, India, 1996. [13] L. Liu, C. Pu, and Y. Lee. Adaptive query mediation across heterogeneous information sources. In Proceedings of the International Conference on Cooperative Information Systems (CoopIS), 1996. [14] M. Roth, F. Ozcan, and L. Haas. Cost models do matter: Providing cost information for diverse data sources in a federated system. Proceedings of the international Conference on VLDB, 1999. [15] S. Spaccapietra, C. Parent, and Y. Dupont. Model independent assertions for integration of heterogeneous schemas. VLDB Journal, July 1992. [16] A. Tomasic, L. Raschid, and P. Valduriez. Scaling access to heterogeneous data sources with Disco. In IEEE Transactions on Knowledge and Data Engineering. IEEE, September/October 1998. [17] G. Wiederhold. Mediators in the architecture of future information systems. IEEE Computer, 25(3), 1992. Figure 7: The data space index DataSource nodes and on level 4 schemas of the sources (composed of Entity, Property and Operation). The metadata repository stores an index called Data Space Index which supports an information retrieval kind of search on paths in the data space4 . Paths of the data space are transformed (indexed) by applying stemming and word decomposition algorithms. The index has as a key a keyword and contains lists of paths in the data space which match the keyword. For example, when processing the query SELECT LIKE(Name) FROM : : : , the Data Space Index is checked, which in our example contains the information shown in Figure 7. Processing of LIKE would transform the given query into six queries, having one of the paths specied in Figure 7 in their SELECT part. Note that this index can also be used by users when searching for keywords in the data space. Additionally, the metadata repository contains an index called Soft Inform Index which allows users to search in an information retrieval way for descriptions stored in attributes Description and Related to, for each element in the metadata repository. These indexes are built at system start up and updated when new sources are registered. 8. DISCUSSION AND CONCLUSIONS We have presented a new approach to query data from a network of heterogeneous data sources. Our goal was to integrate data sources when querying and to oer the possibility to formulate both, exact and fuzzy queries. For this reason we extended OQL with special features and we have shown that the realization of this approach is based on extensive meta information. We believe that a system to integrate heterogeneous data through the query language is a necessity today, because it allows to connect new sources in an easy way, without requiring a global integrated schema. Furthermore, users can query the system, even if they do not know all query capabilities of the sources. They can formulate queries similar to keyword search in information retrieval systems, but in addition, they can also formulate more targeted queries if they use the information stored in the metadata repository. Thus, in our approach we combine the way to search in information retrieval systems with the way to search in DBMS, and therefore the implementation combines techniques from both areas. Consequently, we oer a solution for the problems of integrated search in multiple heterogeneous data sources and of supporting various advanced search requirements. The next step in our project is to detail the query processing components. 4 This kind of index is called inverted le in information retrieval systems. 460

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A query based approach for integrating heterogeneous data sources