Download Concept Hierarchies for Database Integration in a Multidatabase

Concept Hierarchies for Database Integration in a Multidatabase System Pauray S.M. Tsai and Arbee L.P. Chen Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan 300, R.O.C. Abstract Since relational DBMSs are wide-spread, providing a relational global view for users who are only familiar with the relational model to access data in a multidatabase system is signicant. In this paper, we propose an approach to integrate database schemas into a relational global view which consists of concept hierarchies. A concept hierarchy is composed of relation schemes with the same concept, to which specialization and generalization can be specied to enrich the semantics of the relational view. Based on the concept hierarchies, a procedure is designed to decompose a global query to local subqueries and a set of transformation rules are developed to transform a query into an equivalent one for query optimization. 1 Introduction Because of the increasing need for data sharing among multiple databases, the development of multidatabase systems ASD91, SBD81] has been considered as an important research issue Br90]. There are two approaches to manipulate data in a multidatabase system. One is to provide users with a logically integrated global view. It presents a high level of transparency and a uniform interface for the user to retrieve data in the multidatabase system. A variety of researches had focused on data/schema integration BOT86, DH84, KDN92, Mo87, SPD92]. Dayal and Hwang DH84] and Motro Mo87] adopted this approach based on functional model, while Breitbart et al. BOT86] and Deen et al. DAT87] were based on relational model, and Koh and Chen KC93] on object model. Batini et al. surveyed twelve methodologies for database or view integration in BLN86]. In SLC88], an interactive interface was developed to get the information required for the integration and to integrate schemas according to the provided semantics. The other approach is to provide users with a multidatabase query language LAZ89] namely, users can pose their queries against the local schemas by using a multidatabase manipulation language. In this approach, it does not require explicitly creating a global view, but the user needs to have sucient information of the local schemas to specify a query. Czejdo et al. CRE87] used a relational language to perform schema integration in the process of query formulation, in which conicts in component schemas were resolved by specially dened operators and domain incompatibilities resolved by extended abstract data type. In REC89], a graphical multidatabase query language was developed to manipulate data across databases and a knowledge base was used to resolve schema incompatibilities among dierent databases. Appeared in Proc. SIXTH International Conference on Management of Data (1994). 1 Relational database systems have been widely used, and object-oriented database systems are also getting popular in recent years. To facilitate dierent users and applications to access data, a multidatabase system can provide two views: one based on the relational model and the other based on the object-oriented model. DATAPLEX Ch90] used the relational model as a common data model for accessing heterogeneous distributed databases, while the Amoco Distributed Database System (ADDS) BOT86] integrated relational, network, and hierarchical databases by an extended relational data model. Chen et al. CKK94] considered various schema and data conicts and developed a methodology to provide an object view over multiple object databases. Pegasus ASD91] heterogeneous multidatabase system used both type and function abstractions to integrate dierent databases by an object-oriented data model. In the prior work using the relational model for schema integration, it is dicult for users to understand the relationships among relations in dierent databases, especially when the number of component databases is large. In this paper, we propose an approach to integrate database schemas into a relational global view which consists of concept hierarchies. A concept hierarchy is composed of relation schemes with the same concept. For example, relations relating to students are organized into the STUDENT concept hierarchy. Similar to the concept hierarchy GN87] which provides information for inductive learning, the concept hierarchy that we propose provides users with valuable information to capture the relationships among dierent relations for specifying queries. Concept hierarchies are created by the multidatabase administrator who collects the necessary information for integration from local database administrators and integrates relation schemes into concept hierarchies by a schema integration language. Note that our approach is dierent from the work of Dayal and Hwang DH84]. In DH84], the functional data model with generalization is used to integrate database schemas and queries are specied by a query language like DAPLEX Sh81], while we use the relational data model with concept hierarchies to create a global view and the SQL language as a query language. By the approach of concept hierarchies, the user who references the relational view can easily understand the relationships among relations in dierent databases and the burden of specifying a query is relieved. Schema integration by our approach consists of three phases: Phase 1: map the non-relational schema into relational one. Phase 2: collect the necessary information for integration from local database administrators. The technique of assertion specication SLC88] can be used to represent the integration semantics. Phase 3: create concept hierarchies by a schema integration language. Schema conicts can be resolved in the process of integration. Each corresponding relation in a concept hierarchy can be a virtual relation or a real relation. A real relation corresponds to a relation dened in Phase 1 , while a virtual relation is generated from real or virtual relations by generalization or specialization operation. The problems of Phase 1 and Phase 2 have been well studied. In this paper, we devote our attention to Phase 3 and the query processing based on the realtional global view with concept hierarchies. This paper is organized as follows. In Section 2, we describe the creation of the concept hierarchies. A schema integration language is developed to perform the integration. In Section 3, query processing based on the concept hierarchies is discussed. Transformation rules are developed to transform a query into an equivalent one for query optimization. Finally, we conclude with the future work in Section 4. 2 DB2 DB1 TEACHER name address phone salary STUDENT id name class DB3 TEACHER TEACHER name arrdess phone birthday birthday id name name zip city street dept birthday STUDENT STUDENT address salary class birthday id name dept WORKER_STUDENT id name class phone job Creation of Concept Hierarchies by the Multidatabase Administrator Data Dictionary Global View Figure 1: An example multidatabase system. 2 Concept Hierarchies In this section, we consider the creation of a relational global view which is composed of concept hierarchies. 2.1 A Schema Integration Language Consider an example of the multidatabase system shown in Figure 1. The multidatabase system consists of three individual databases. The relations TEACHER and STUDENT in database DB1 record the information about teachers and students in the department of Computer Science at National Tsing Hua University (NTHU), respectively. In database DB2 , relations TEACHER and STUDENT record the information about teachers and students, respectively, in the department of Electrical Engineering at NTHU. In database DB3 , relations TEACHER and STUDENT record the information about teachers and students, respectively, at National Chiao Tung University (NCTU). Besides, the relation WORKER STUDENT in DB3 records the information about worker-students at NCTU. The multidatabase administrator is provided with a schema integration language for the creation of concept hierarchies from the local schemas. The information about the mapping between the local schemas and concept hierarchies is stored in the data dictionary. Users specify queries against the concept hierarchies, which will be translated into equivalent queries against the local schemas by referring to the data dictionary. In the following, we describe constructs for dening the concept hierarchies. 1. Dening multidatabase relations CREATE <multidatabase relation name> FROM <database identier>.< database relation identier> WHERE ( RN <attribute identier of the database relation> AS <attribute name> , RN <attribute identier of the database relation> AS < attribute name>] ... )] 3 This statement denes a relation of a component database to be a multidatabase relation. In syntactic denitions, the square brackets indicate that the material enclosed is optional. An ellipsis ":::" indicates that the immediately preceding syntactic unit may optionally be repeated one or more times. Material in <> must be replaced by a specic value given by the users and material in capitals must be written exactly as shown. The names of a database, a database relation (or a multidatabase relation which has been dened) and an attribute of a (multi)database relation are called a database identier, a database relation identier (or a multidatabase relation identier) and an attribute identier, respectively. The dened multidatabase relation implicitly has all the attributes of the corresponding database relation, however, the RN clause permits renaming a database attribute name in the denition of a multidatabase relation. The multidatabase relation dened by this statement is a real relation. The instances in the multidatabase relation can be obtained by the following procedure: (a) Fetch the database relation specied in the FROM clause. (b) Do the actions in the WHERE clause. 2. Dening generalized virtual relations CREATE <multidatabase relation name> AS GEN OF <multidatabase relation identier>, <multidatabase relation identier> , <multidatabase relation identier>] ... CLASSIFYING ATTRIBUTE ( <classifying attribute name> DOMAIN AS (<classifying attribute value for the corresponding multidatabase relation>, <classifying attribute value for the corresponding multidatabase relation> , <classifying attribute value for the corresponding multidatabase relation>] ... )) WHERE (f RN <multidatabase relation identier>.<attribute identier> AS <attribute name> j CONVERT <multidatabase relation identier>.(<attribute identier> ,<attribute identier>] ...) TO <attribute name> BY <function name> g , f RN <multidatabase relation identier>.<attribute identier> AS <attribute name> j CONVERT <multidatabase relation identier>.(<attribute identier> ,<attribute identier>] ...) TO <attribute name> BY <function name> g] ... )] The statement denes a multidatabase relation to be the generalization of some existent multidatabase relations. The multidatabase relation dened by this statement is a generalized virtual relation because it is not a real relation. The generalized virtual relation is called the superconcept relation of the multidatabase relations from which it is derived, and the multidatabase relations generalizing the superconcept relation are called the subconcept relations of the superconcept relation. The CLASSIFYING ATTRIBUTE clause denes a classifying attribute which is used to identify the corresponding subconcept relation for each instance. Attributes of the generalized virtual relation are composed of the common attributes of its subconcept relations and the classifying attribute. The DOMAIN AS clause species the value of the classifying attribute for each corresponding subconcept relation in the GEN OF clause. We can resolve the naming conict of the subconcept relations by the RN clause. Moreover, the CONVERT clause can be used to resolve data representation conicts and 4 data scaling conicts, which converts an attribute (or a set of attributes) of a multidatabase relation to another one by the conversion function. For example, the clause "CONVERT R1:(a1 a2 ::: an) TO attr BY transform" indicates that the value of attribute attr is evaluated by the function transform with R1:a1 R1:a2 ::: R1:an as its parameters. Functions needed for schema integration are implemented by the multidatabase administrator and stored in the data dictionary. The instances in the generalized virtual relation can be obtained by the following procedure: (a) Materialize the subconcept relations specied in the GEN OF clause. (b) Do the actions in the WHERE clause. (c) Project the common attributes of the subconcept relations. (d) For each resultant subconcept relation from (c), add the classifying attribute and its corresponding value specied in the DOMAIN AS clause for each tuple. (e) Union the subconcept relations. 3. Dening the relationship between a real multidatabase relation and other multidatabase relations by generalization CREATE <real multidatabase relation identier> AS GEN OF <multidatabase relation identier> , <multidatabase relation identier>] ... The statement builds the relationship between a real multidatabase relation and other multidatabase relations which can be real or virtual by generalization. Dierent from the generalized virtual relation, the real multidatabase relation dened in the statement is a multidatabase relation which has been dened and corresponds to a relation in a component database. This real multidatabase relation is also called the superconcept relation of the multidatabase relations specied in the GEN OF clause, and the multidatabase relations specied in the GEN OF clause are called the subconcept relations of the real multidatabase relation. 4. Dening specialized virtual relations CREATE <multidatabase relation name> AS SPE OF <multidatabase relation identier>, <multidatabase relation identier> , <multidatabase relation identier>] ... WHERE (f RN <multidatabase relation identier>.<attribute identier> AS <attribute name> j CONVERT <multidatabase relation identier>.(<attribute identier> ,<attribute identier>] ...) TO <attribute name> BY <function name> g , f RN <multidatabase relation identier>.<attribute identier> AS <attribute name> j CONVERT <multidatabase relation identier>.(<attribute identier> ,<attribute identier>] ...) TO <attribute name> BY <function name> g] ... )] The statement denes a multidatabase relation to be the specialization of some existing multidatabase relations. The multidatabase relation dened by this statement is a specialized virtual relation because it does not correspond to any local relation. The specialized virtual relation is called the subconcept relation of the multidatabase relations from which it is derived, and the multidatabase relations specializing the 5 subconcept relation are called the superconcept relations of the specialized virtual relation. Attributes of the specialized virtual relation are the union of attributes in the superconcept relations. In a multidatabase system, the same real-world entity can be represented as instances in dierent databases. The identication of the same real-world entities from dierent databases is studied in TC93]. For simplicity, we assume that instances represent the same real-world entity if and only if they have the same values for their common attributes. The instances in the specialized virtual relation can be obtained by the following procedure: (a) Materialize the superconcept relations specied in the SPE OF clause. (b) Do the actions in the WHERE clause. (c) Natural join the superconcept relations. 5. Dening the relationship between a real multidatabase relation and other multidatabase relations by specialization CREATE <real multidatabase relation identier> AS SPE OF <multidatabase relation identier> , <multidatabase relation identier>] ... The statement builds the relationship between a real multidatabase relation and other multidatabase relations by specialization. The real multidatabase relation is also called the subconcept relation of the multidatabase relations specied in the SPE OF clause, and the multidatabase relations specied in the SPE OF clause are called the superconcept relations of the real multidatabase relation. 2.2 A Schema Integration Example Consider the multidatabase system in Figure 1. Firstly, we dene multidatabase relations as follows. CREATE CS teacher FROM DB1 .TEACHER WHERE (RN phone AS home phone, RN salary AS yearly salary) CREATE EE teacher FROM DB2 .TEACHER WHERE (RN phone AS oce phone, RN salary AS monthly salary) CREATE NCTU teacher FROM DB3 .TEACHER CREATE CS student FROM DB1 .STUDENT WHERE (RN id AS NTHU id) CREATE EE student FROM DB2 .STUDENT WHERE (RN id AS NTHU id) CREATE NCTU student FROM DB3 .STUDENT WHERE (RN id AS NCTU id) CREATE WORKER student FROM DB3 .WORKER STUDENT WHERE (RN id AS NCTU id) Then, we dene generalized virtual relations as follows. 6 CREATE NTHU teacher AS GEN OF CS teacher, EE teacher CLASSIFYING ATTRIBUTE (dept DOMAIN AS (CS, EE)) WHERE (CONVERT CS teacher.yearly salary TO monthly salary BY SAL) The attribute salary is recorded by yearly salary in the relation CS teacher and by monthly salary in the relation EE teacher. Assume we adopt the scale of monthly salary in the virtual relation NTHU teacher. The SAL function is dened by SAL(e) = e=12 e 2 domain(CS teacher:yearly salary): CREATE TEACHER AS GEN OF NTHU teacher, NCTU teacher CLASSIFYING ATTRIBUTE (school DOMAIN AS (NTHU, NCTU)) WHERE (CONVERT NCTU teacher.(zip,city,street) TO address BY ADDR) The attribute address in NTHU teacher is recorded by the aggregation of zip, city, and street, such as "300 Hsinchu KuangFu." Assume we adopt the representation of attribute address in the virtual relation TEACHER. The ADDR function is dened by ADDR(e1 e2 e3) = e1 + " " + e2 + " " + e3 where operator '+' represents concatenation. CREATE NTHU student AS GEN OF CS student, EE student CLASSIFYING ATTRIBUTE (dept DOMAIN AS (CS, EE)) CREATE STUDENT AS GEN OF NTHU student, NCTU student CLASSIFYING ATTRIBUTE (school DOMAIN AS (NTHU, NCTU)) Next, we dene the relationship between NCTU student and WORKER student by generalization. CREATE NCTU student AS GEN OF WORKER student The relationship can also be dened by specialization as follows. CREATE WORKER student AS SPE OF NCTU student Finally, we dene specialized virtual relations. CREATE CS&EE teacher AS SPE OF CS teacher, EE teacher WHERE (RN CS teacher.yearly salary AS CS yearly salary, RN EE teacher.monthly salary AS EE monthly salary) CREATE NCTU&NTHU teacher AS SPE OF NCTU teacher, NTHU teacher WHERE (RN NCTU teacher.dept AS NCTU dept, RN NTHU teacher.dept AS NTHU dept, RN NTHU teacher.monthly salary AS NTHU monthly salary, CONVERT NCTU teacher.(zip,city,street) TO address BY ADDR) 7 TEACHER name address NCTU_teacher name zip dept school NTHU_teacher city street dept birthday name NCTU&NTHU_teacher name address birthday address monthly_salary dept CS_teacher NCTU_dept NTHU_dept NTHU_monthly_salary EE_teacher name address home_phone yearly_salary name address office_phone birthday monthly_salary CS&EE_teacher name address home_phone office_phone birthday CS_yearly_salary EE_monthly_salary STUDENT name dept NCTU_student NCTU_id NTHU_student name dept NTHU_id WORKER_student NCTU_id school name class birthday CS_student name class phone job NTHU_id dept EE_student name class address birthday NTHU_id name class birthday Figure 2: Concept hierarchies. The resultant concept hierarchies are shown in Figure 2. The mapping information for the schema integration is recorded by three tables: the multidatabase relation table, the generalized virtual relation table, and the specialized virtual relation table. The multidatabase relation table is used to record the information for multidatabase relations in the system. The information for generalized virtual relations and specialized virtual relations are recorded in the generalized virtual relation table and the specialized virtual relation table, respectively. Figure 3 shows the mapping tables for the TEACHER concept hierarchy in Figure 2. The symbols Vgen and Vspe in the column from of the multidatabase relation table denote that the corresponding multidatabase relations are a generalized virtual relation and a specialized virtual relation, respectively. 3 Query Processing In this section, we consider query processing based on the concept hierarchies. 3.1 Query Decomposition The query processor deals with a query by the following procedure. Procedure P: Decompose a multidatabase query to local subqueries. For each relation speci ed in the multidatabase query 8 relation name CS teacher EE teacher NCTU teacher NTHU teacher TEACHER CS&EE teacher NCTU&NTHU teacher multidatabase relation table from renamed attribute (phone,home phone) (salary,yearly salary) (phone,oce phone) (salary,monthly salary) DB1 .TEACHER DB2 .TEACHER DB3 .TEACHER Vgen Vgen Vspe Vspe generalized virtual relation table relation name subconcept relation NTHU teacher CS teacher EE teacher TEACHER NCTU teacher NTHU teacher classifying attribute classifying value dept CS EE school NCTU NTHU conversion function renamed attribute SAL(CS teacher.yearly salary)%monthly salary ADDR(NCTU teacher.(zip,city,street))%address relation name CS&EE teacher specialized virtual relation table superconcept relation CS teacher EE teacher NCTU&NTHU teacher NCTU teacher NTHU teacher conversion function ADDR(NCTU teacher.(zip,city,street))%address renamed attribute (CS teacher.yearly salary,CS yearly salary) (EE teacher.monthly salary,EE monthly salary) (NCTU teacher.dept,NCTU dept) (NTHU teacher.dept,NTHU dept) (NTHU teacher.monthly salary,NTHU monthly salary) Figure 3: The mapping information for the TEACHER concept hierarchy. Step 1: Look up the multidatabase relation table. 1. If the relation is a real relation which has not been materialized, materialize this relation by information recorded in the columns from and renamed attribute. 2. If the value of the column from for the relation is Vgen , then go to Step 2. 3. If the value of the column from for the relation is Vspe , then go to Step 3. Step 2: Look up the generalized virtual relation table. The relation is materialized by doing the following actions. 1. For each subconcept relation of the generalized virtual relation, if it has not been materialized, then go to Step 1. 2. Rename attributes and perform conversion functions on subconcept relations according to the information recorded in the columns renamed attribute and conversion function, respectively. 9 3. Project the common attributes of the subconcept relations. 4. For each resultant subconcept relation from 3., add the classifying attribute and its corresponding value for each tuple. 5. Union the subconcept relations. Step 3: Look up the specialized virtual relation table. The relation is materialized by doing the following actions. 1. For each superconcept relation of the specialized virtual relation, if it has not been materialized, then go to Step 1. 2. Rename attributes and perform conversion functions on the superconcept relations according to the information recorded in the columns renamed attribute and conversion function, respectively. 3. Natural join the superconcept relations. 3.2 Query Transformation In this subsection, we develop transformation rules for query optimization. Some notations are described as follows. a1 !b1 ::: an!bn R denotes that attributes a1 ::: an in the relation R are renamed to b1 ::: bn, respectively. R attr fvg denotes that the attribute attr is added to relation R and each tuple in R is lled with the value v for the attribute attr. FUNi(Bi)%ai represents a function named FUNi with Bi denoting the list of parameters for FUNi . The corresponding attribute for the returned value of the function is named ai . FUN1 (B1 )%a1 ::: FUNk(Bk )%ak R denotes that functions FUN1 (B1 )%a1 ::: FUNk (Bk )%ak are performed on the relation R. For function FUNj (Bj )%aj with j Bj j= 1, if its inverse function exists, then the inverse function will be de ned. The de nition of the inverse function is of bene t to query optimization. We assume that caching is used to store the returned values of function calls. Thus, if the number of distinct values of Bi is n, then FUNi (Bi )%ai needs to be computed only n times. The technique of function caching has been proved useful for query optimization HS93]. For simplicity, the selection predicate considered in a query is of the form attr op C , where attr represents a relation attribute, op denotes an operator such as ">," "<" or "=," and C represents a constant. The associated attribute for a predicate is the attr component. An associated attribute ai is called a private attribute of relation Rk , where k = 1 2 if ai appears in Rk and is not the attribute of the other relation. The transformation rules are described as follows. 10 Rule 1 Consider the query Q = A (a1 !b1 ::: an!bn R), where A represents a set of attributes. For each i, 1 i n, if bi 62 A, then Q A (a1!b1 ::: ai;1!bi;1 ai+1!bi+1 ::: an!bn R): By this rule, we can eliminate some unnecessary renaming operations. Rule 2 Consider the query Q = P (R1 a fv1g R2 a fv2 g), where P represents a selection predicate. 8> if v1 6= v2 and P is "a = v1 " < R1 a fv1g Q > R2 a fv2g if v1 6= v2 and P is "a = v2 " : (P R1) a fv1g (P R2) a fv2g otherwise By this rule, the size of relations to be unioned can be reduced by performing selections on the local relations. The union operation may be discarded in some special cases. Rule 3 Consider the query Q = A ((A1 R1) a fv1 g (A2 R2 ) a fv2g). ( if a 62 A Q (AR1 R1A)R2a fv1g ( R ) f v g A;fag A;fag 2 a 2 otherwise By this rule, we can reduce the size of relations to be unioned by eliminating some unnecessary projection attributes. The operation a may be discarded in some special case. Rule 4 Consider the query Q = P (FUN1(B1)%a1 ::: FUNi(Bi)%ai ::: FNUk(Bk )%ak R): 1. If the associated attribute of P 62 fa1 ::: akg, then Q FUN1(B1)%a1 ::: FUNk(Bk )%ak (P R): Because the associated attribute of P is not dependent on any function, we can perform the selection on R before these functions are computed. Thus, the size of R can be reduced so can the cost of computing these functions. 2. If the associated attribute of P is ai and the inverse function of FUNi (Bi ) can be obtained, then we can derive the corresponding predicate P 0 of P based on the inverse function, where the associated attribute of P 0 is Bi . Therefore Q FUN1(B1)%a1 ::: FUNk(Bk )%ak (P 0 R): By this rule, we can reduce the cost of computing these functions. The reason is similar to that of case 1. 11 3. If the associated attribute of P is ai and the inverse function of FUNi (Bi ) can not be obtained, then Q FUN1(B1)%a1 ::: FUNi;1(Bi;1 )%ai;1 FUNi+1(Bi+1)%ai+1 ::: FUNk (Bk )%ak (P (FUNi(Bi)%ai R)): Because the associated attribute of P is dependent on the function FUNi (Bi ) and the inverse function of FUNi (Bi ) can not be obtained, we can only perform the selection after the function FUNi (Bi ) is computed. It is clear that the cost of computing the other functions will be reduced after P (FUNi(Bi )%ai R) is performed. Rule 5 Consider the query Q = (FUNr1 (Br1 )%ar1 ::: FUNri (Bri )%ari ::: FUNrk (Brk )%ark R1) ./ (FUNs1 (Bs1 )%as1 ::: FUNsj (Bsj )%asj R2), where the join is the natural join. If ari is not the common attribute of (FUNr1 (Br1 )%ar1 ::: FUNri (Bri )%ari ::: FUNrk (Brk )%ark R1) and (FUNs1 (Bs1 )%as1 ::: FUNsj (Bsj )%asj R2 ), then Q FUNri (Bri )%ari ((FUNr1 (Br1 )%ar1 ::: FUNri;1(Bri;1 )%ari;1 FUNri+1 (Bri+1 )%ari+1 ::: FUNrk(Brk )%ark R1) ./ (FUNs1 (Bs1 )%as1 ::: FUNsj (Bsj )%asj R2)) Since ari is not the common attribute, we can delay the execution of FUNri (Bri )%ari until the natural join is performed. After the natural join, the number of distinct values of Bri may be reduced. Therefore, we can decrease the cost of computing FUNri (Bri )%ari . Rule 6 Consider the query P (FUN1(B1 )%a1 ::: FUNi(Bi )%ai ::: FUNk(Bk )%ak (R1 ./ R2)). 1. If the associated attribute ap of P 62 fa1 ::: akg, then 8 >> FUN1(B1)%a1 ::: FUNk (Bk )%ak (P R1 ./ R2) >> if ap is the private attribute of R1 >> < 1 (B1 )%a1 ::: FUNk (Bk )%ak (R1 ./ P R2) Q > FUN if ap is the private attribute of R2 >> >> >: FUN1(B1)%a1 ::: FUNk (Bk )%ak (P R1 ./ P R2) otherwise Because the associated attribute of P is not dependent on any function, we can reduce the size of relations to be joined by performing selections on the local relations. Besides, the number of distinct values for each attribute can be reduced, which decreases the cost of computing functions. 12 2. If the associated attribute of P is ai and the inverse function of FUNi (Bi ) can be obtained, then we can derive the corresponding predicate P 0 of P based on the inverse function. Therefore 8> FUN (B )%a ::: FUN (B )%a ( 0 R ./ R ) 1 k k k P 1 2 >< if Bi 1 the1 attribute set of R1 Q> >> FUN1(B1)%a1 ::: FUNk (Bk )%ak (R1 ./ P 0 R2) : if Bi the attribute set of R2 By this rule, we can reduce the cost of performing the join and functions. The reason is similar to that of case 1. 3. If the associated attribute of P is ai and the inverse function of FUNi (Bi ) can not be obtained, then 8> 1 (B1 )%a1 ::: FUNi;1 (Bi;1 )%ai;1 FUNi+1 (Bi+1 )%ai+1 >> FUN ::: FUNk (Bk )%ak (P (FUNi(Bi)%ai R1) ./ R2) >> if B < i the attribute set of R1 Q> >> FUN1(B1)%a1 ::: FUNi;1(Bi;1 )%ai;1 FUNi+1(Bi+1)%ai+1 >> ::: FUNk (Bk )%ak (R1 ./ P (FUNi(Bi)%ai R2)) : if Bi the attribute set of R2 Because the associated attribute of P is dependent on the function FUNi (Bi ) and the inverse function of FUNi (Bi ) can not be obtained, we can only perform the selection after the function FUNi (Bi ) is computed. It is clear that the size of relations to be joined is reduced and the cost of computing the other functions can be decreased after the join. 4 Conclusions and Future Work Since relational DBMSs are wide-spread, providing a relational global view for relational applications and users who are only familiar with the relational model to access data in a multidatabase system is signi cant. In this paper, we propose a methodology for integrating schemas into a relational global view. The schemas to be integrated are organized into concept hierarchies which capture the relationships among relations in dierent databases. Dierent from prior work using the relational model for schema integration, the ideas of specialization and generalization are applied to enrich the relational view and facilitate the user to issue queries by a standard relational query language such as SQL. We have studied query processing based on concept hierarchies. A procedure is designed to decompose a global query to local subqueries and a set of transformation rules are developed to transform a query into an equivalent one for query optimization. We have implemented a multidatabase prototype using the concept hierarchy approach at National Tsing Hua University. The update problem in the concept 13 hierarchy is under investigation. We also study the query optimization issue of conversion functions in a multidatabase system. The conversion functions de ned in the concept hierarchies may be time-consuming. Therefore, the execution order of joins and selections involving expensive conversion functions HS93] in this environment needs to be further considered. References ASD91] R. Ahmed, P.D. Smedt, W. Du, W. Kent, M.A. Ketabchi, W.A. Litwin, A. Ra i, and M.C. Shan, The Pegasus Heterogeneous Multidatabase System, IEEE COMPUTER, December (1991) pp. 19-27. BLN86] C. Batini, M. Lenzerini, and S.B. Navathe, A Comparative Analysis of Methodologies for Database Schema Integration, ACM Computing Surveys, 18 (4) (1986) pp. 323-364. BOT86] Y. Breitbart, P.L. Olson, and G.R. Thompson, Database Integration in a Distributed Heterogeneous Database System, IEEE Second International Conference on Data Engineering, (1986) pp. 301-310. Br90] Y. Breitbart, Multidatabase Interoperability, SIGMOD RECORD, 19 (3) (1990) pp. 53-60. CKK94] A.L.P. Chen, J.L. Koh, T.C.T. Kuo, C.C. Liu, Schema Integration and Query Processing for Multiple Object Databases, Journal of Integrated Computer-Aided Engineering: Special Issue on Multidatabase and Interoperable Systems, Wiley Interscience (1994) (to appear). Ch90] C. Chung, DATAPLEX: An Access to Heterogeneous Distributed Databases, Communications of the ACM, 33 (1) (1990) pp. 70-80 (with corrigendum in Comm. ACM 33 (4) p.459). CRE87] B. Czejdo, M. Rusinkiewicz, and D.W. Embley, An Approach to Schema Integration and Query Formulation in Federated Database Systems, IEEE Third International Conference on Data Engineering, (1987) pp. 477-484. DH84] U. Dayal and H.Y. Hwang, View De nition and Generalization for Database Integration in a Multidatabase System, IEEE Transactions on Software Engineering, 10 (6) (1984) pp. 628-644. DAT87] S.M. Deen, R.R. Amin, and M.C. Taylor, Data Integration in Distributed Databases, IEEE Transactions on Software Engineering, 13 (7) (1987) pp. 860-864. GN87] M. Genesereth and N. Nilsson, Logical Foundations of Articial Intelligence, San Francisco, CA: Morgan Kaufmann, (1987). HS93] J.M. Hellerstein and M. Stonebraker, Predicate Migration: Optimizing Queries with Expensive Predicates, Proceedings of ACM SIGMOD, (1993) pp. 267-276. KDN92] M. Kaul, K. Drosten, and E.J. Neuhold, View System: Integrating Heterogeneous Information Bases by Object-Oriented Views, IEEE Sixth International Conference on Data Engineering, (1992) pp. 2-10. 14 KC93] J.L. Koh and A.L.P. Chen, Integration of Heterogeneous Object Schemas, Proceedings of the 12th International Conference on Entity-Relationship Approach, (1993) pp. 289-300. LAZ89] W. Litwin, A. Abdellatif, A. Zeroual, and B. Nicolas, MSQL: A Multidatabase Language, Information Science, (1989) pp. 59-101. Mo87] A. Motro, Superviews: Virtual Integration of Multiple Databases, IEEE Transactions on Software Engineering, 13 (7) (1987) pp. 785-798. REC89] M. Rusinkiewicz, R. Elmasri, B. Czejdo, D. Georakopoulous, G. Karabatis, A. Jamoussi, L. Loa, and Y. Li, OMNIBASE: Design and Implementation of a Multidatabase System, Proceedings of the 1st Annual Symposium in Parallel and Distributed Processing, (1989) pp. 162-169. SLC88] A. Sheth, J. Larson, A. Cornelio, and S. Navathe, A Tool for Integrating Conceptual Schemas and User Views, IEEE Fourth International Conference on Data Engineering, (1988) pp.176-183. Sh81] D.W. Shipman, The Functional Data Model and the Data Language DAPLEX, ACM Trans. Database Syst., 6 (1) (1981) pp. 140-173. SBD81] J.M. Smith, P.A. Bernstein, U. Dayal, N. Goodman, T. Landers, K.W.T. Lin, and E. Wong, Multibase { Integrating Heterogeneous Distributed Database Systems, Proceedings of AFIPS NCC, (1981) pp. 487-499. SPD92] S. Spaccapietra, C. Parent and Y. Dupont, Model Independent Assertions for Integration of Heterogeneous Schemas, VLDB Journal, (1992) pp. 81126. TC93] P.S.M. Tsai and A.L.P. Chen, Querying Uncertain Data in Heterogeneous Databases, Proceedings of IEEE Third International Workshop on Research Issues on Data Engiennring: Interoperability in Multidatabase Systems (1993) pp. 161-168. 15

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Concept Hierarchies for Database Integration in a Multidatabase