* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Integrating Data Using Ontology and SSM Fragments
Microsoft SQL Server wikipedia , lookup
Concurrency control wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Relational algebra wikipedia , lookup
Functional Database Model wikipedia , lookup
Versant Object Database wikipedia , lookup
Clusterpoint wikipedia , lookup
Searching Integrated Relational and Record-based Legacy Data Using Ontologies L.L. Miller and Hsine-Jen Tsai Department of Computer Science Iowa State University Ames, Iowa 50011 Sree Nilakanta College of Business Iowa State University Ames, Iowa 50011 Mehdi Owrang American University Washington, D.C. Abstract Integration of data continues to be a problem. The number of databases available to a corporation continues to grow. Simply keeping track of the number and diversity of the attributes (fields) can be a difficult problem in large organizations. In this paper we define an ontology model for the domain that uses ontologies for object (entity) search over a set of integrated relational databases and record-based legacy systems. The integration process is based on a hypergraph model that makes use of the theory of universal relations. The design of the complete system model is given and a prototype of the model is briefly discussed. 1. Introduction Managing the vast amounts of information in large computer networks presents a number of difficulties to system users and designers. New applications and databases are created on a regular basis to solve local problems as they arise. For large organizations, it means the number of databases can be staggering. We can not expect users to know terms for identifying specific information from multiple data sources. Most databases are created and maintained by local groups and/or organizations, that use software that optimizes local transactions. Even if we assume that all databases use a standard hardware/software platform, language and protocol, there still is the issue of conceptual heterogeneity. To assist users in obtaining an integrated view of information from heterogeneous distributed data sources continues to be an active research area. Among the research groups working on this problem, the use of an ontology [6,8,10,13,18,20,21] seems very appealing. Since the beginning of the nineties, ontologies have become a popular research topic investigated by several artificial intelligence 1 research communities, including knowledge engineering, natural-language processing and knowledge representation. More recently, the notion of ontologies has become widespread in fields such as intelligent information integration, information retrieval on the Internet, and knowledge management. The reason for ontologies being so popular is in large part due to what they promise: a shared and common understanding of some domain that can be communicated across people and computers. General ontologies have not been effective. Therefore, the best one expects from an ontology is for it to be domain specific. However, for imprecise queries, the first problem is to take query terms and map them to database terms. Therefore, minimally we must modify the ontology to make it database specific. The Summary Schemas Model (SSM) [2,3,4] provides a way to link database terms to the ontology. In spite of the large amount of research on database integration of heterogeneous data sources that has been done, the problem continues to create difficulties for most organizations. In the present work we look at a subproblem of the general integration problem, that is the case where the data sources are controlled by one organization and the data sources consist of relational databases and record-based legacy systems. While this is a small part of the general problem, it covers a large number of applications that typical organizations are concerned with integrating. Our contribution in this paper is the development of an ontology-based model that provides access to a distributed set of relational databases and record-based legacy systems through imprecise queries. A database specific ontology is integrated with a set of semantically disjoint universal relations over the set of data sources to provide access. The use of universal relations which simplifies the connection between the ontology and the set of distributed data sources. For any request that requests semantically related data, there is a single universal relation that is capable of responding to the request. Specifically, we develop the notion of database specific weighted ontologies as a means of determining the required universal relation. The use of universal relations in this context is made possible due to our data integration scheme. The integration scheme is based on the use of hypergraphs and the theory of relational databases. Such an approach provides the additional capability of testing the correctness of any query generated. 2 A brief overview of ontologies, Summary Schema Model (SSM), and integration issues are presented in Section 2. The overall model is overviewed in Section 3. In Section 4 we present our approach to ontologies and look at the issue of generating SSM tree fragments and database specific ontologies. Section 5 looks at the issues that makeup our data integration scheme. Section 6 overviews our current version of the feasibility prototype. Finally, we conclude by summarizing our results. 2. Background 2.1 Ontologies The word “ontology” is borrowed from philosophy, in which it refers to the “subject of existence” [8]. It is the science of “what is”. It discusses the structures of entities, the properties of entities and the relations between entities. In a word, it seeks to find an appropriate classification of entities. In the context of artificial intelligence, an ontology is a model of some portion of the world and is described by defining a set of representational terms [6]. A formal definition is “a formal, explicit specification of a shared conceptualization” [8]. “Conceptualization” refers to an abstract model of some phenomena in the world by having identified the relevant concepts of those phenomena. So, an ontology is a description of concepts and relationships between them. The main motivation of an ontology is knowledge sharing and reuse [9,25]. In the field of information system, different groups gather data using their own terminologies. When all those data are integrated, a major problem that needs to be handled is the terminological and conceptual incompatibility. It could be done in a case-by-case basis. But a solution based on a “consistent and unambiguous description of concepts and their potential relation” [19] will be much better than a case-by-case one. In the Knowledge Sharing Effort (KSE) project [18], ontologies are put forward as means to share knowledge bases between various knowledge-based systems. A major challenge in using ontologies lies in how to build them, or what should they look like? Several groups have given solutions. They describe how ontologies should be constructed so that they contain the richest information in the least space and can be efficiently retrieved for use. A solution based on the definition of a “core library” has been proposed in [25]. More often, an ontology is considered as a taxonomic hierarchy of 3 words with the “is-a” relation between them [9]. Some technology has also been proposed to modify a poorly designed ontology into a better one [11]. In dealing with multi-database systems, ontologies can be used effectively to organize keywords as well as database concepts by capturing the semantic relationships among keywords or among tables and fields in a relational database. By using these relationships, a network of concepts can be created to provide users with an abstract view of an information space for their domain of interest. Ontologies are well suited for knowledge sharing in a distributed environment where, if necessary, various ontologies can be integrated to form a global ontology. Database owners find ontologies useful because they form a basis for integrating separate databases through identification of logical connections or constraints between the information pieces. Ontologies can provide a simple conversational interface to existing databases and support extraction of information from them. Because of the distinctions made within an ontological structure, they have been used to support database cleaning, semantic database integration, consistency-checking, and data mining [20]. An example of using ontologies in databases is Ontolingua [9]. Ontolingua is being built with the purpose of enabling databases (and the people and systems that interface with them) to share its ontology that is specific to the computer science and mathematics domains with the intention of enabling data sharing and reuse. Another example of database application is the Cyc ontology that had a knowledge base built on a core of approximately 400,000 hand-entered assertions (or rules) designed to capture a large portion of what we normally consider consensus knowledge about the world [14]. Partitioned into an Upper Cyc Ontology and the full Cyc Knowledge Base, there are 3,000 terms of the most general concepts of human consensus reality in the Upper Cyc ontology with literally millions of logical axioms of more specific concepts descending below, populating the Cyc Knowledge Base. Cyc foundation enables it to address effectively a broad range of otherwise intractable software. The global ontologies objects, attributes, transitions, and relationships are accepted as forming the domain’s universe. 2.2 Summary Schema Model (SSM) 4 The SSM was first proposed by M. Bright et al. [2,3,4]. The SSM was designed to address the following issues [2,3,4]: 1. In multi-database system, users cannot be expected to remember voluminous specific access terms, so the global database should provide system aids for matching user requests to system data access. 2. Because of different local requirements, independent database designers are unlikely to use consistent terms in structuring data. The system must take responsibility for matching user requests to precise system access terms. The SSM provides the following capabilities: it allows imprecise queries and automatically maps imprecise data references to the semantically closest system access terms. Note that the SSM deals with imprecision in database access terms rather than data values within the database. The SSM uses a taxonomy of the English language that maintains synonym and hypernym/hyponym links between terms. Roget’s original thesaurus provided just such a taxonomy and is the current basis for the SSM. Identifying semantic similarity is the first step in mapping local to global data representation. The SSM creates an abstract view of the data available in local databases by forming a hierarchy of summary schemas. A database schema is a group of access terms that describe the structure and content of the data available in a database. A summary schema is a concise, although more abstract, description of the data available in a group of lower level schemas. In SSM, schemas are summarized by mapping each access term to its hypernym. Hypernyms are semantically close to their hyponyms, so summary schemas retain most of the semantic content of the input schemas. The SSM trees structure the nodes of a multi-database into a logical hierarchy. Each leaf node contributes a database schema, and each access term in a leaf schema is associated with an entry-level term in the system taxonomy. Once these terms have been linked to the taxonomy hierarchy, creating the summary schemas at the internal nodes is automatic. Each internal node maintains a summary schema representing the schemas of its children. Conceptually, only leaf nodes have participating DBMSs, while internal 5 nodes are responsible for the summary schemas structure and most of the SSM processing. 2.3. Integration Bright et al. [1] define a multidatabase system as a system layer that allows global access to multiple, autonomous, heterogeneous, and preexisting local databases. This global layer provides full database functionality and interacts with the local DBMSs at their external user interface. Both the hardware and software intricacies of the different local systems are transparent to the user, and access to different local systems appears to the user as a single, uniform system. The term multidatabase includes federated databases, global schema multidatabases, multidatabase language systems, and homogeneous multidatabase language systems. Multidatabases inherit many of the problems associated with distributed databases, but also must content with the autonomy and heterogeneity of the databases that they are trying to integrate. As the number of local systems and the degree of heterogeneity among these systems rises, the cost of integration increases. There has been considerable research on multidatabase systems. A great deal of the work has been examined in [12]. This approach has focused on the problem from the point of view of applying traditional database techniques to bridge the mismatch between the underlying data sources. Several researchers have explored the use of intelligent agents called mediators [26,27] as a means of bridging the mismatch between the heterogeneous data sources. At present, there is no implemented system that offers the full range of functionality envisioned by Wiederhold in his paper [26]. Examples of projects that have been developed include HERMES being developed at the University of Maryland [23], CoBase at UCLA [5] and NCL at the University of Florida [22], and MIX at SDSC[15]. The advantage of such mediator-based systems is that to add a new data source it is only necessary to find the set of rules that define the new data source. More recently, a number of researchers have started to look at XML-based data integration techniques as a way to attack the general data integration problem. The use of XML in the general data integration problem is especially interesting as the unstructured format that 6 XML supports allows one to manipulate a variety of data types. Beyond simply storing the data in XML format, data integration requires mechanisms to do the integration. Zamboulis makes use of Graph Restructuring to accomplish the integration [30]. A number of groups have looked at XQuery as the basis of their approach to XML-based data integration[6,7,8,12]. The Tukwila Data Integration System provides a complete solution that involves not only integration, but activities like optimizing network performance as well [23]. In the next section we overview the complete model before examining the two principle components of our model in more detail. 3. Model Overview The proposed model makes use of a database specific ontology and an integration scheme based on universal relations to support imprecise queries over a distributed set of relational databases and record-based legacy systems. Figure 3.1 illustrates the relationship between the objects used to construct the physical state of our model. The universal relations are used to provide a simple query interface to the set of distributed relational databases and record-based legacy systems. The Summary Schema Model (SSM) tree fragments are used to convert a domain entity ontology into a database specific ontology. The result is that the model is capable of supporting imprecise requests. Once the terms used in the user’s request are related to the appropriate database terms (i.e., attribute names), the model automatically generates a result relation and returns it to the user. Entity Ontology SSM tree fragments Universal Relations 7 Relational Databases and Legacy Systems Figure 3.2 looks at the model from the perspective of the processes that are required to enable the model. The components inside the dotted rectangle provide an illustration of the relationship between the components. The interactions between the components of the model is best illustrated by looking at the way that data flows within the model. The front end system passes the model a set of terms and conditions as a request (query). The controller passes the terms, including any terms in the conditions, to the Ontology Mediation Manager. The terms are used to search the ontology to find the universal relation(s) that are needed to generate the universal relation query to respond to the request. Terms that cann’t be located in the database specific ontology are typically mediated with the user. There are multiple ways that this mediation could be implemented depending on the nature of the front end. In our discussion (and prototype) we have assumed the use of a GUI to conduct this mediation as a visual process, but this would not be required. Locating the terms in the ontology would identify one or more universal relations that can be used to answer the request. In general only one universal relation would be identified due to the universal relations being semantically disjoint. More details on this issue are discussed in Section 5. As a result, in the remainder of the paper we will assume that only one universal relation is required to produce a result for a given request. Based on the results of the ontology search, a universal relation query is generated. The universal relation query is passed to the Query Engine along with a request id. There it is converted into an integration query that makes use of the relations and legacy system records that define the universal relation’s data space. The integration query is partitioned by the Data/Query Manager and the resulting subqueries are sent to the 8 appropriate data sources. The relations that are generated by the subqueries are returned to the Data/Query Manager where they are merged and the final result relation is sent back to the front end system. Front End System User Model Controller Ontology Mediation Manager Ontology Search Manager Database Specific Ontology Query Engine MetaData MetaData Data/Query Manager Data Sources Figure 3.2. Block diagram of the proposed model. 9 In the next two sections, we take a more detailed look at the components of the model. Our approach to creating and searching database specific ontologies is examined in Section 4. An overview of our integration scheme is given in Section 5. 4. Ontology Design Ontologies are in general domain specific. In an environment where one is trying to integrate a set of heterogeneous, distributed data sources, this means that it is necessary to make the ontology used to search the data sources database specific. For an ontology, this means that the attribute names used in the universal relations must be incorporated into the ontology. 4.1 Ontology Design The focus in this section of the paper is moving from domain specific ontologies to database specific ontologies. We see ontologies as representing the entities (objects) in the domain that the user of the integrated databases is working. The domain is represented by terms that define the problem area. Note that the user’s problem and the available databases must come from the same domain in order for a solution to exist. An ontology can be defined as a graph = (,), where is the set of terms used to represent the domain and is the set of edges connecting the nodes representing the terms. Each term node can have properties assigned to it. In our ontology model there are four types of edges in , namely, the is-a, is-part-of, synonym, and antonym edges. Isa and is-part-of edges are directed, while synonym and antonym edges have no direction. Let I() be the set of is-a edges in the ontology . Then (,I()) represents a directed acyclic graph (dag) with the more general terms higher in the dag and more specific terms lower in the dag. As expected, synonym and antonym edges are used to connect terms with the same and opposite meaning, respectively. To enhance the search operation, we add the notion of edge weights to create a weighted ontology. Let be the set of weights such that i is the weight for Ei . We use the weights to prune the search of the ontology. For E I(), the weights are used to estimate the relative closeness of the is-a relationship. A similar argument can be made for is-part-of edges. Going through a term like Physical Object would not be useful. To block the search, the weight assigned to the edges connected to such a term are set to a 10 large values. In our current ontology design weights for is-a and is-part-of edges are integers. Note that the use of weights is to reduce the number of questions that a user must be asked during the search. In meaningful queries there are likely to be several query terms. This combined with the expected bushiness of the ontologies give rise to the possibility of an overwhelming number of questions that the user could be asked if the user had to resolve all of the choices. The weights on the synonym and antonym edges range from zero to one, where one indicates an exact match for a synonym and an exact opposite for an antonym. Using weights on these edges allows us to show the degree of the match. A small example of a weighted ontology is shown in Figure 4.1. Entity 20 Is-a edge 15 20 Synonym edge Physical Object 50 Living Being 100 Social Entity 10 15 15 25 Green Apple Animal 30 10 Green Apple 15 Human Being Country 0.9 Person Figure 4.1 A weighted ontology. The method of generation of the weights depends on the builder of the ontology. The weights can be assigned by hand or can be generated automatically. We have generated the weights by hand in our current test sets, but we have designed an algorithm for generating the weights from metadata and domain documents. 3.2 Creating a database specific ontology To move from a domain specific ontology to a database specific ontology, we make use of Summary Schema Model (SSM) tree fragments. The process of creating a database specific ontology requires us to create SSM tree fragments that are relatively specific. The SSM tree fragments are constructed starting with the attribute names used in 11 the schema of the universal relations that are defined by the data source data. To successfully search a database specific ontology, it is critical that the SSM tree fragments do not generalize. If the root term of an SSM tree fragment is too general, the database terms will not be found by searches starting at meaningful domain terms. To start the process of making an ontology database specific, we check the attribute names in the universal relation defined by the data sources to determine if they already exist as terms in the ontology. If the term exists, a pointer is added to the ontology term property set to point to the universal relation that the attribute is located in. For the remaining universal relation attributes, the metadata of the databases is used to unify the attribute names into one or more SSM tree fragments. In particular the definitions of the database fields named by the attribute names given in the meta data are used to determine related (i.e., unifyable) terms. The term that is used to unify a subset of the remaining universal relation attributes is then matched against the ontology terms. If it is found, the SSM tree fragment is attached to the ontology term. Weights are assigned by the individual expanding the ontology. If the root term of the new fragment is not in the ontology, the unification process asks the user for related terms and again checks the ontology. If no match exists, our algorithm looks to incorporate more universal relation attributes into the SSM tree fragment(i.e., grow the SSM tree fragment). Our early attempts to completely automate the process have not been very promising, so we are currently using a human aided approach. The metadata definitions and related documents are used to determine likely unification terms. This gives the human guiding the process the opportunity to choose a unifying term from an existing list. Entity Ontology SSM tree fragments Universal Relations 12 At each step, the root term of the SSM tree fragment is checked to see if it exists in the ontology. When all of the attribute names have been incorporated into the ontology in this manner, we say that the ontology is database specific. Figure 4.2 shows a block diagram of the database specific ontology. 3.3 Search The basic premise of our ontology search is to allow the user to give a set of search terms and proceed from the search terms to “near by” database terms. Weights combined with user interaction are used to define what is meant by “near by”. To look at the search, we provide a set of basic rules used in the search. Ontology Search Rules for is-a, synonym, and antonym edges: 1. A user creates a request by supplying a set of search terms. A search algorithm searches the database specific ontology to locate the search terms. If some of the search terms are not found in the ontology, the user is asked to refine the query terms. 2. Weights are used to block paths that are unlikely to provide useful results. As an example, an is-a edge from a very general term to a specific term (e.g., Apple in Figure 4.1) is unlikely to yield a useful “near by” term. Weights are used in combination with user interaction to provide an effective search without overwhelming the user. 3. In a typical successful search, when no link to a universal relation is found at an original term node, the algorithm starts from the node by looking for synonym edges. If one is found the weight is tested against the synonym threshold. If the weight is larger than the threshold, the search moves to the next node and continues. Since more than one synonym edge may be followed, the weights on synonym edges are multiplied and the product is tested against the threshold. Whether more edges are followed from the individual nodes depends on whether we are looking for all “near by” database terms or one. If no synonym edge exist, then the is-a edges are used as indicated in rule 2. 4. For a NOT search, the algorithm starts from the query term in the ontology and looks for an antonym edge leaving the term node. If one exists, its weight is tested against the antonym threshold. If an appropriate 13 antonym edge is found, the search moves to the new term node and a positive search (rule 3) is initiated from that point. 5. In all cases if no “near by” database term is found for a query term, the user is notified and asked to refine the query term. 6. When all query terms have been processed, the search algorithm returns a set of universal relations and attribute names that can be used to generate the required universal relation query. 4. Integration Scheme While there has been a great deal of activity on integrating heterogeneous databases, important questions remain. To bridge this gap, we use an integration model designed to operate on a subset of the general integration problem where the data sources are limited to relational databases and record-based legacy systems. Our approach takes advantage of the work on universal relation interfaces (URIs) [7,17]. The idea behind an URI is to provide a single relation view of a set of relations found in the same database. The set of relations should have sufficient semantic overlap so that the single universal relation view was able to provide a semantically correct “view” of the data. In addition an URI has to support development of a correct query. The task of applying the earlier work on URIs to the integration of relational databases and record-based legacy systems has three basic steps: 1. Give the record-based legacy systems a relational structure, which we call a pseudo relation. 2. Group attributes so that only semantically equivalent attributes have the same name in the integrated environment. 3. Model each set of connected relations (defined in Section 4.3) as a universal relation. The result of applying the three steps is a set of universal relations that are visible to any software interacting with the integration model. The number of universal relations will depend on the degree of overlap between relations and pseudo relations. The next three subsections look at the three steps in more detail. 4.1 Defining Pseudo Relations Our approach is to have the local data administrator of each record-based legacy system define the set of export “relation view(s)” (records) that he/she is willing to export 14 into the integrated environment. This set can change over time. The local data administrator defines these “relation views” as a set of requests to the legacy system at the programmatic level (batch mode). Each “relation view” places a pseudo relation in the integrated environment. A pseudo relation is a set of tuples with each column named by a unique attribute name. A wrapper for the legacy system is then created that resides on the same platform as the legacy system. The wrapper is a static agent that interfaces with the integration model by exporting the required “relation view” as a set of tuples (i.e., a pseudo relation). To generate the pseudo relation the view manager executes the appropriate request to the Wrapper Request Request for For Data Relation View Manager records Record Based Legacy System Set of Pseudo records Relation Figure 4.3. Relationship between wrapper and legacy system. legacy system through the “relation views” defined by the local administrator. Figure 4.3 illustrates the relationship. Each application of retrieving data through a wrapper results in placing a pseudo relation in the integrated environment. Selection of rows in the resulting table can easily be implemented as part of the view manager. 4.2. Attribute Names In any set of database relations and legacy systems there is likely to be problems with attribute names. In particular one expects some instances of semantically equivalent 15 attributes with different names and some cases of attributes with the same names, but different meanings. We use the typical solution to this problem, i.e., have the designer of the integrated system evaluate the existing name set by reviewing the metadata defined over the data sources. He/she can then rename attributes within the integrated system to remove the problem. For relational databases, this can accomplished by using views. The use of views can also be used by the local database administrator as a means of controlling what data is exported into the integrated environment. Since the local data administrator of a legacy system is already defining a “relation view” in the integrated environment for each export schema, any required name changes can be handled at that level. The result is that we can look at the integrated environment as defining a set of attributes, such that, if two attributes have the same semantics, they have the same name. Also, if two attributes have the same name, they have the same semantics. Another advantage of renaming the attributes in the proposed environment is that attribute names can be chosen to provide more semantic meaning. This results in easier SSM tree fragment construction. 4.3 Universal Relations A universal relation u(U) is seen as a virtual relation u over a scheme U. We use U and attr(U) interchangeability to mean the attributes in the scheme U. The universal relation u can be defined over a set of relations {r1 (R1), r2 (R2), … , rn (Rn)} where u= r1 r2 … rn and attr(U) = attr(R1) attr(R2) … attr(Rn). The universal relations used in our integration model are restricted to being connected and maximal. A universal relation over a set of relations R = {R1, R2, …, Rn} is connected as long as it is not possible to partition the set of relations into two sets, say O1 and O2, such that O1 and O2 are subsets of R and O1 ∩ O2 ≠ . A universal relation u(U) is considered to be maximal if attr(U) is the maximum set of attributes and u is connected. 16 In the remainder of this presentation, we use the phrase universal relation to mean a maximal and connected universal relation. In the next subsection, we look at the basic aspects of our integration model. 4.4. Data Integration The Ontology Mediation Manager (Figure 3.2) sees the data through the integration scheme as a set of disjoint universal relations. As such, it simply generates a universal relation SQL query of the form Select attribute list From universal relation Where condition. The Ontology Mediation Manager tags the universal relation query with the request id from the front end system and supplemented by the controller to identify the front end and the user making the request. The task of the integration system is to 1. Convert the universal relation query into a query over the relations that support the universal relation. 2. Ensure the correctness of the query. 3. Partition the query with respect to the data sources. 4. Query the individual data sources, combine the results into a final relation, and return it to the user. The integration system is made up of two primary components, namely, a Query Engine and a Data/Query Manager (Figure 3.2). The Query Engine makes use of a hypergraph model of the set of relations that support the universal relation used in the universal relation query to generate the query and test its correctness. The Data/Query Manager receives the universal relation query from the Query Engine, partitions it with respect to the location of the data, sends the resulting sub-queries to the appropriate data sources, and combines the results of the subqueries if there is more than one sub-query. In the next two sections we look briefly at the underlying concepts of the Query Engine and Data/Query Manager, respectively. 5. Query Generation and Correctness Overview Hypergraphs play a critical role in our approach to integration. A hypergraph is a couple H = (N,E), where N is the set of vertices and E is the set of hyperedges, which are nonempty subsets of N. There is a natural correspondence between database schemes and 17 hypergraphs. Consider the set of relation schemes R = {R1, R2, …, Rn}. We can define the set of attributes of R as being attr(R) = ni=1Ri. The hypergraph HR = (attr(R),R) can be seen to be a hypergraph representation of the set of relations. Typically, the hypergraph has been used to represent the scheme of a single database, but there is no reason that we can not use the more general interpretation of having it represent the scheme of the relations and pseudo relations that define the data in the integrated environment. Let L = {L1, L2, …, Lm} be the set of pseudo relations that are defined for the record-based legacy systems as described in Section 4.1. Let R = {R1, R2, …, Rn} be the set of relation schemes associated with the relational databases that exist within the integrated environment. If RENAME() is the process described in Section 4.2, then S = RENAME(L) RENAME(R) can be perceived as the relation set for the integrated a) hypergraph ABC C A E CDE AEF B F BF b) complete intersection graph ABC {A,B} 18Aset = {CDE} BF {A,B,F} environment. We can then look at HI = (attr(S),S) as a hypergraph representation of the integrated environment. The hypergraph HI defines a set of one or more connected subhypergraphs. The precise number of connected subhypergraphs is dependent on the connectivity of the relations and pseudo relations in the integrated environment. Each connected subhypergraph, say Hu = (attr(U),U) where U is a subset of S and attr(U)∩attr(S-U) =Ø, provides the basis of one universal relation. Looking at the elements of S = {S1, S2, …, Sm+n}, where Si = RENAME(Li) mi1and Sj+m = RENAME(Rj) nj1, we assume that the Sk m+nk1 define meaningful groupings of attributes within the integrated environment. Using the results of [7], we then have the join dependency [S] defined over the integrated environment. The importance of this is that we can apply the strategy used in our earlier work on universal relations [16,17] to check the correctness of any queries generated in the integrated environment. To translate a universal relation query to an integration query, we must translate the request to the target data space (the hypergraph representing the collection of connected operational databases). Finally, the target query hypergraph needs to be mapped to an SQL query. To create the mapping, we convert the underlying hypergraph into a set of Adjusted Breadth First Search (ABFS) trees [17]. An ABFS tree is created by applying a variation of the breadth first search to the complete intersection graph (CIG) defined by the underlying hypergraph model. An ABFS tree is created for each node (relation) in the CIG that contains attributes required in the SQL query. Each path from the root to a leaf of the ABFS tree defines a set of relations that can be joined. From this set of paths, we choose a subset that covers the attributes required in the query. The 19 ABFS tree that requires joining the fewest relations is chosen to create the relation list in the new SQL query. Figure 5.1 illustrates a simple example of this process. The complete details of mapping the request to an SQL query are given in [17, Appendix A]. To ensure that the correctness of the integration query, we need to have the join sequence define a lossless join. Using the result from [7], the join dependency [U] is defined over the relations and pseudo relations that make up the universal relation used in the universal relation query that is being translated. The importance of this is that FDHinge of Hu defines a set of edges whose corresponding relations have a lossless join [16]. The test for correctness starts by testing if the edges that correspond to the join sequence define an FD-Hinge in Hu. Failing that, the set of edges are expanded to form an FD-Hinge. 6. Data/Query Manager Overview The first task of the Data/Query Manager is that the integration query generated by the Query Engine must be partitioned into subqueries with respect to the location of the relations/pseudo relations involved in the query. Once the integration query has been partitioned, the resulting subqueries are sent to the appropriate data sources. Example 1 provides a simple example of the partition process. Example 1: Example of query partition using SQL syntax. Data layout: Site 1 tables: R1(A,B,C), R2(C,D,E) Site 2 tables: R3(E,F,G) Universal Relation Query: Select G,B Where F=10 Integration Query: Select G,B From R1,R2,R3 Where R1.C=R2.C and R2.E=R3.E and F=10 Partition results: Query for Site 1 (Q1): Select B, E From R1, R2 20 Where R1.C=R2.C Query for Site 2 (Q2): Select E, G From R3 Where F=10 Request Framework Query: Select G, B From Q1, Q2 Where Q1.E = Q2.E The Data/Query Manager retains the Request Framework Query so that it can combine the results when two or more subqueries are needed. Assuming Id1 is the request identifier for the universal relation query, Site1 is the site location, and Q1 & Q2 are the subquery identifiers for the two subqueries in Example 1, then Example 2 illustrates the strings used by the Data/Query Manager to represent the subqueries and the Request Framework Query. Example 2: The query string for the result given in Example 1: SubQuery Queue: “Select B, E From R1, R2 Where R1.C = R2.C”:<Id1,Q1,Site1> “Select E, G From R3 Where F =10”:<Id1,Q2,Site 2> Request Framework Query Queue: ”Select G, B From Q1, Q2 Where Q1.E = Q2.E”:<Id1> The results of the subqueries are placed in a temporary database at the site of the Data/Query Manager. When results from all of the subqueries have returned and are stored in the local database, the Request Framework Query is used to combine the intermediate results before returning the final result relation to the front end system. 7. Prototype A prototype was implemented to test the feasibility of our approach. The prototype was implemented in JAVA, developed on the Red Hat Linux platform, and tested in the Windows platform. Figure 7.1 illustrates a block diagram of the prototype. It is made up of four primary components: the User Interface, the Ontology Search 21 Manager, the Query Engine, and the Data/Query Manager. The functionality for the Ontology Mediation Manager has been incorporated into the User Interface in the current version of the prototype. The User Interface allows a user to enter a set of domain search terms and a condition. The beginning screen with an example in progress is shown in Figure 7.2. Data/Query Manager User Interface Query Engine Ontology Search Manager Figure 7.1 Block diagram of the prototype. When the user is satisfied with what has been entered, he/she clicks on the Start Request button. The Ontology Search System performs the search described in Section 3.3. The Ontology Aided Search Environment Enter Domain Search Terms: name, department, sales Enter Condition: location = 'US' 22 Start Request Help ontology is searched for the domain terms provided by the user. If all of the domain search terms are found in the ontology, the database information found though the SSM fragments is returned to the user interface module. The user is notified of a successful ontology search with the screen shown in Figure 7.3. The user has the option to see the SQL query that has been constructed, see the results of the query on screen, or restart the query process. Note that, the motivation for the prototype has been to test our underlying systems and not to develop a full featured user interface. Ontology Aided Search Environment Ontology Search Successful Domain Search Terms used in search: name, department, sales Current Condition: location = 'US' Buttons Show Current Options View Query Results on Screen Restart Request Figure 7.3 Successful ontology search screen. 23 Help The discussion above assumes that the domain terms that the user entered were in the ontology used by the system. When the ontology search doesn’t find all of the domain search terms, the system creates a screen showing the terms that can not be found. Two conditions exist, namely, the ontology search found a term that appears to be close or no term(s) can be found. In the first case the system returns the fragment of the ontology that it thinks may be relevant. The user can choose one of the terms shown in the ontology fragment or enter another term. Figure 7.4 shows an example of the case where a fragment of the ontology is presented to the user. The screen illustrates how the system prototype engages the user to help out the ontology search. The example shows three is-a relationships with "country" being the likely choice for the user. The "geographic feature" node represents an is-closely-related relationship. Note that neither the type of the arc nor the weights are shown at this point. We are hoping to get the user's interpretation without biasing the user's choice. User Help is help is required to complete the search! Domain Search Term in question: location? Help Enter best choice or enter a new search term: Terms assumed to be related: geographic features geographic location country state city Figure 7.4. Screen showing user/ontology interaction. 24 In the case that no terms are considered close to the domain search term, the user is asked to enter a new domain search term with the same meaning. When the ontology search (with the user's assistance) resolves the search terms to database terms, the information is passed to the Query Generation System, where an SQL query is generated. The Query Generation System tests for the correctness of the generated query [17]. If the Query Generation System was called through the View Query button, the SQL query is shown. Again since we are in the test mode, we have chosen to show the full SQL query with the tables or pseudo tables from the distributed data sources as though they are in the same database. In a commercial package, more options of how to show the query would have to be considered. When the user clicks on the Results on Screen button, the query information is passed to the Data Integration System. There the query is partitioned into queries for the individual relational databases or legacy systems and the sub queries are sent to the individual data source sites. The results of the individual queries are sent back to the Data Integration system and the query results for the user’s query are prepared using the original query. The sub-queries are used in the prototype to define and spawn a set of mobile agents. The agents are sent to the sites that contain the relevant data. Each agent carries one of the SQL queries. The data returned by the agents is combined to produce the required result. The result of the request is then returned to the user and displayed on the screen. The choice of mobile agents is not critical to the model, but rather represents a method for quickly generating the necessary infrastructure. Client-Server models using SOAP, CORBA or Java JDBC connections could also be used. We have used all four types of connections in related projects. 8. Conclusions A model for using domain specific ontologies, converting them to database specific ontologies for aiding in the interpretation of a user's query has been given. The 25 model allows users to define both domain specific search terms and domain specific functions to operate on the results of the query. The model was built on an integrated database/legacy system environment. Our data integration scheme provides a universal relation view of the distributed data sources. A prototype to test the feasibility of the ontology and data integration model has been designed and implemented. The prototype takes the user input and generates SQL queries for the relational databases/legacy systems over which the ontology search operates. 9. References 1. Bright, M.W., A.R. Hurson and S.H. Pakzad. A taxonomy and current issues in multidatabase systems. IEEE Computer, Vol. 25, No. 3, pages 50-60. 2. Bright, M.W and A. Hurson, “Summary Schemas in multidatabase systems”, Computer Engineering Technical Report at PennState, 1990. 3. Bright, M.W., A. Hurson, S. Pakzad, and H. Sarma, “The Summary Schemas Model – An approach for handling Multidatabases: Concept and Performance Analysis”, Multidatabase System: An Advanced Solution for Global Information Sharing, pp.199, 1994. 4. Bright, M.W. and A. Hurson, “Automated Resolution of Semantic Heterogeneity in Multidatabases”, ACM Transactions on Database Systems, pp. 213, 19(2), 1994. 5. Chu, W.W., H. Yang, K. Chiang, M. Minock, G. Chow \& C. Larson. CoBase:A Scalable and Extensible Cooperative Information System, Journal of Intelliegent Information Systems, Vol. 6, No. 2/3, 1996, pp. 223. 6. Corazzon, Raul. ed. “Descriptive and Formal Ontology”, http://www.formalontology.it. 7. Fagin, R., A.O. Mendelzon, and J.D. Ullman. 1982. A simplified universal relation assumption and its properties. ACM Transactions on Database Systems. Vol. 7. Pages 343-360. [6] Gardarin, Georges, Antoine Mensch, Anthony Tomasic: An Introduction to the e-XML Data Integration Suite. EDBT 2002: 297-306. [7] Gardarin, Georges, Fei Sha, Tuyet-Tram Dang-Ngoc: XML-based Components for Federating Multiple Heterogeneous Data Sources. ER 1999: 506-519. [8] Gardarin, Georges, Antoine Mensch, Tuyet-Tram Dang-Ngoc, L. Smit: Integrating Heterogeneous Data Sources with XML and XQuery. DEXA Workshops 2002: 839-846. 8. Gruber, T. "A translation approach to portable ontologies," Knowledge Acquisition, pp. 199-220, 5(2), 1993. 9. Gruber, T. “Toward Principles for the Design of Ontologies Used for Knowledge Sharing”, ed. N. Guarino. International Workshop on Formal Ontology, Padova, Italy 1993. 10. Guarino N., “Formal Ontology, Conceptual Analysis and Knowledge Representation”. International Journal of Human and Computer Studies, special issue on The Role of Formal Ontology in the Information Technology edited by N. Guarino and R. Poli, vol 43 no. 5/6, 1995. 26 11. Guarino, N. and C. Welty, "Ontological Analysis of Taxonomic Relationships", In, A. Laender, V. Storey, eds, Proceedings of ER-2000: The 19th International Conference on Conceptual Modeling, October, 2000. 12. Hurson, A., M. Bright, S. Pakzad (ed.): Multidatabase systems - an advanced solution for global information sharing. IEEE Computer Soc. Press 1994. 13. Peter D. Karp, Vinay K. Chaudhri and Jerome Thomere “XOL: An XML-Based Ontology Exchange Language”, http://www.oasis-open.org/cover/xol-03.html. [12] Lehti, Patrick, Peter Fankhauser: XML Data Integration with OWL: Experiences and Challenges. SAINT 2004: 160-170. 14. Lenat, D. B. “Welcome to the Upper Cyc Ontology”, http://www.cyc.com/overview.html, 1996. 15. Ludäscher, B., Y. Papakonstantinou, P. Velikhov. A Framework for NavigationDriven Lazy Mediators. ACM Workshop on the Web and Databases. 1999. 16. Miller, L. L. 1992. Generating hinges from arbitrary subhypergraphs. Information Processing Letters. Vol. 41. No. 6. Pages 307-312. 17. Miller, L.L. and M.M. Owrang. 1997. A dynamic approach for finding the join sequence in a universal relation interface. Journal of Integrated Computer-Aided Engineering. No. 4. Pages 310-318. 18. Neches, R., Fikes, R.E., Finin, T., Gruber, T.R., Senator, T., Swartout, W.R., "Enabling technology for knowledge sharing", AI Magazine, pp.36-56, 12(3), 1991. 19. Schulze-Kremer, S. "Adding Semantics to Genome Databases: Towards an Ontology for Molecular Biology", Proceedings of The Fifth Intemational Conference on Intelligent Systems for Molecular Biology, T. Gaasterland, Et al,(eds.), Halkidiki, Greece, June 1997. 20. N. J. Slattery, “A Study of Ontology and Its Uses in Information Technology Systems”, http://www.mitre.org/support/papers/swee/papers/slattery/. 21. Sowa, John. Knowledge Representation: Logical, Philosophical, and Computational Foundations. Brooks/Cole. Pacific Grove, CA. 2000. 22. Su, S.Y.W., H.L. Yu, J.A. Arroyo-Figueroa, Z. Yang and S. Lee. NCL:A Common Language for Achieving Rule-Based Interoperability Among Heterogeneous Systems, Journal of Intelliegent Information Systems, Vol. 6, No. 2/3, 1996, pp. 171-198. 23. Subrahmanian,V.S, Sibel Adali, Anne Brink, Ross Emery, J.ames J. Lu, Adil Rajput, Timothy J. Rogers, Robert Ross, Charles Ward. HERMES: Heterogeneous Reasoning and Mediator System, http://www.cs.umd.edu/ projects/hermes/publications/abstracts/hermes.html. 24. Swartout, W.R., P. Patil, K. Knight, and T. Russ, “Toward Distributed Use of LargeScale Ontologies” In Proceedings of the 10th Knowledge Acquisition for KnowledgeBased Systems Workshop. Banff, Canada. 1996. [23] Tukwila Data Integration System. University of Washington. http://data.cs.washington.edu/integration/tukwila. Accessed 10/5/2004. 25. van Heijst, G., A. Schreiber, B. Wielinga. "Using explicit ontologies in KBS development", International Journal of Human-Computer Studies, pp. 183-292, Vol. 46, No. 2/3, Feb, 1997. 26. Wiederhold, G. Mediators in the Architecture of Future Information Systems, IEEE Computer, Vol. 25, No. 3, 1992, pp. 38-49. 27. Wiederhold, G. and M. Genesereth. The Conceptual Basis for Mediation Services, IEEE Expert, Vol.12 No.5, 1997, pp 38-47. 27 [30] Zamboulis, L., XML Data Integration By Graph Restructuring, Proc. BNCOD21, Edinburgh, July 2004. Springer-Verlag, LNCS 3112, pp 57-71. Appendix A. Query Generation To create a query, we must translate the request to the target data space (the hypergraph representing the collection of connected operational databases). Finally, the target query hypergraph is mapped to an SQL query. To look at this process in more detail, we consider the basic data structures and algorithms. We start by looking at the notion of a complete intersection graph. Complete Intersection Graph (CIG) Let H = (U, R) be a hypergraph where U = {A1 , A2 , ... , An} is a set of attributes and R = {R1 , R2 , ..., Rp} is a set of relation schemes over U. The complete intersection graph (CIG) [17] is an undirected graph (R, E) where E = { (Ri , Rj) : Ri Rj , Ri R, Rj R, i j }. Note that the edge (Ri , Rj) between vertices (or nodes) Ri and Rj exists if and only if Ri and Rj have at least one attribute in common. The edge (Ri, Rj) will be labeled with Rij where Rij = Ri Rj. An example of a hypergraph and its complete intersection graph is shown in Figure A.1. Adjusted Breadth First Search (ABFS) The adjusted breadth first search (ABFS) [17] is a variation of the breadth first search (BFS) to determine the join sequence for a target hypergraph. ABFS supplements 28 BFS by including a path label for each node and an adjustment set in the search tree so that the search is more efficient. The resulting search tree is called an ABFS tree [17]. ABC C A E CDE AEF B F BF a) hypergraph Figure A.1 b) complete intersection graph A hypergraph & its complete intersection graph (CIG). The node from which the search is started is called the root of the ABFS tree. A sample ABFS tree is shown in Figure A.2. The path label [17] for an ABFS tree node is the union of all query attributes on this ABFS tree node and its ancestors on the search path. So the path label of an ABFS tree node should be a superset of its parent’s path label. In the process of creating an ABFS tree, the path labels will be used to prune or delay the expansion of subsets where the unused nodes that are adjacent to the current endpoint of the search path do not contribute any new query attributes to the path label. Any nodes falling into this class will be stored in the adjustment set [17] (denoted by ASet) with a pointer to the position where they could be added to the ABFS tree during the further search or expansion. The relevant CIG can be applied to determine which nodes are adjacent to the current endpoint of the search path. The expansion of the ABFS tree will continue until the union of the path labels of all the leaves in the current ABFS tree contains all the query attributes. If the ABFS tree can not be expanded and the union of the path labels of all the leaves in the current ABFS tree does not contain all the query attributes, then a node can be taken from the adjustment set and the process can be restarted from the position pointed by this node. Note that this 29 process of creating an ABFS tree should terminate successfully in finite steps since all the query attributes are in the hypergraph and can be reached eventually. In addition, using the above approach, many different ABFS trees with the same root may be generated. This is because the order of search is not unique. Also, there are more than one way (such as FIFO, LIFO, or randomly) to select nodes from the adjustment set. Join Sequence Finding an optimal join sequence for the selected query attributes (including the ABC {A,B} Aset = {CDE} BF {A,B,F} Figure A.2. Adjusted BFS tree using the CIG of Figure 4.1 with root ABC and the query attributes ABF. attributes appeared in the query condition) is a crucial part in the modeler design and implementation. Once the ABFS tree with a given root is created, we can determine the join sequence defined by this tree. The approach is to select a set of paths connected to the root such that the union of the path labels contains all of the desired query attributes. We use the following procedure to select the appropriate paths [17]: <Step 1.0> Set W := the set of query attributes. Go to <Step 1.1>. <Step 1.1> Mark every leaf and its ancestors if its path label has a query attribute that appears only once in the path labels of all leaves in the ABFS tree. Remove the query attributes included in the path labels of the marked nodes from W. If W is empty, stop; otherwise, go to <Step 1.2>. <Step 1.2> If there is a contributing query attribute in more than one path labels of the unmarked leaves with the same parent, then mark one (and only one) of those leaves and its ancestors. Remove the query attributes in the path labels of the marked nodes from W. If W is empty, stop; otherwise, go to <Step 1.3>. (By a contributing query attribute we mean a query attribute that occurs in the path label of a leaf but does not occur in its parent’s path label.) <Step 1.3> If there is a leaf which contains a remaining query attribute in W with the lowest frequency, then mark this leaf and its ancestors. In case 30 of tie, choose the leaf with the shortest path and mark the nodes on this path. It is worth noting that the approach described in the previous subsection does not guarantee to create the optimal ABFS tree with a given root since the order of search in that approach is not necessarily optimal. The socalled optimal ABFS tree with a given root is actually the one with the minimum weight over all possible ABFS trees for this root. By weight of an ABFS tree we mean the length of the join sequence defined by the ABFS tree. The creation of a non-optimal ABFS tree with some given root does not cause serious problems. On one hand, our goal is to generate an optimal join sequence which is the one with the minimum weight over the optimal ABFS trees for all roots. The probability of creating non-optimal ABFS trees for all roots is very low. On the other hand, one can remove a redundant join from the resulting join sequence at a later stage. Another point worth noting is that we do not have to generate ABFS trees for all roots. We need only to generate ABFS trees for the so-called legal roots. A root is called illegal if it does not contain any query attribute or its set of query attributes is properly contained in the set of query attributes of an adjacent node. The algorithm to find a join sequence for a given hypergraph and a set of query attributes is summarized as follows: <Step 2.0> Create the CIG for the hypergraph. Find the set LR of all legal roots in the CIG. Set minweight:= the number of nodes in CIG. Go to <Step 2.1>. <Step 2.1> If LR is empty, then stop. Otherwise, choose a root r LR , set LR := LR - {r}, and go to <Step 2.2>. <Step 2.2> Create an ABFS tree with root r. Find the weight and the corresponding join sequence for this tree. If the weight is smaller than minweight, then save this join sequence as the current best one, and replace minweight with the weight. Go to <Step 2.1>. 31 Appendix B. Query Correctness Our correctness process has been built on the issue of testing for fd-hinges [16]. To come to an understanding of what this includes, it is necessary to briefly look at the underlying concepts of hinges and fd-hinges. A hypergraph H is reduced if no hyperedge of H is properly contained in another hyperedge of H. H is connected if every pair of its hyperedges is connected by some path of hyperedges. If H is a reduced connected hypergraph with the vertex set N and edge set E, then E’ is a complete subset of E if and only if E’ E and for each Ei in E if Ei attr(E’), then Ei belongs to E’. E’ is said to be a trivial subset of E if |E’| <= 1 or E = E’. Let E’ be a complete subset of E and E1, E2 E – E’. Figure B.1. Hinge example with E1 as the separating edge and {E1,E2,E3,E4} as a Hinge. Then we say E1 and E2 are connected with respect to E’ if and only if they have common vertices not belonging to E’. Let E’ be a nontrivial complete subset of E and 1, 2, …, p be connected components of E-E’ with respect to E’. Then E’ has the bridge-property if and only if for every i = 1, 2, …, p there exists Ei E’ such that (attr(E’)NiEi, where Ni = attr(i), Ei is called a separating edge of E’ corresponding to i. A nontrivial complete subset E’ of E with the bridge property is call a hinge of H. An example of a hinge is shown in Figure B.1. Note that {E2,E3,E4} is not a hinge. Let F be a set of functional dependencies (fds). Let TE’ be the tableau defined over the attributes in E for the schemes represented by the edges in E’ E, the chaseF(TE’) is the result of using the fds in F to chase the tableau TE’. Now let E* be the set defined by chaseF(E’) such that E* = {Si|if wi(A) is a distinguished variable and wi 32 chaseF(TE’), then ASi}. In other words each element in E* corresponds to a row in the tableau chaseF(TE’) and consists of the attributes that have distinguished values in the row. Note that by the definition of the chase algorithm each element of E* is a super set of the corresponding element in E’ that was used to initially define the row in the tableau. Construct the hypergraph HE*,F = (attr(E),(E-E’)E*). Then E’ is an F-fd-hinge of a hypergraph H when E* is a hinge of HE*,F. In [16] we showed that an F-fd-hinge was equivalent to an embedded join dependency. In other words any time that a set of edges defines an F-fd-hinge, the set of relation schemes that correspond to the edges define a lossless join. As a result, our test for query correctness comes down to testing the set to determine if they define an F-fd-hinge. 33