Download Integrating Data Using Ontology and SSM Fragments

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

SQL wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Concurrency control wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Relational algebra wikipedia , lookup

Functional Database Model wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Searching Integrated Relational and Record-based Legacy Data Using Ontologies
L.L. Miller and Hsine-Jen Tsai
Department of Computer Science
Iowa State University
Ames, Iowa 50011
Sree Nilakanta
College of Business
Iowa State University
Ames, Iowa 50011
Mehdi Owrang
American University
Washington, D.C.
Abstract
Integration of data continues to be a problem. The number of databases available to a corporation continues
to grow. Simply keeping track of the number and diversity of the attributes (fields) can be a difficult
problem in large organizations. In this paper we define an ontology model for the domain that uses
ontologies for object (entity) search over a set of integrated relational databases and record-based legacy
systems. The integration process is based on a hypergraph model that makes use of the theory of universal
relations. The design of the complete system model is given and a prototype of the model is briefly
discussed.
1. Introduction
Managing the vast amounts of information in large computer networks presents a
number of difficulties to system users and designers. New applications and databases are
created on a regular basis to solve local problems as they arise. For large organizations, it
means the number of databases can be staggering. We can not expect users to know terms
for identifying specific information from multiple data sources. Most databases are
created and maintained by local groups and/or organizations, that use software that
optimizes local transactions. Even if we assume that all databases use a standard
hardware/software platform, language and protocol, there still is the issue of conceptual
heterogeneity. To assist users in obtaining an integrated view of information from
heterogeneous distributed data sources continues to be an active research area.
Among the research groups working on this problem, the use of an ontology
[6,8,10,13,18,20,21] seems very appealing. Since the beginning of the nineties, ontologies
have become a popular research topic investigated by several artificial intelligence
1
research communities, including knowledge engineering, natural-language processing and
knowledge representation. More recently, the notion of ontologies has become
widespread in fields such as intelligent information integration, information retrieval on
the Internet, and knowledge management. The reason for ontologies being so popular is
in large part due to what they promise: a shared and common understanding of some
domain that can be communicated across people and computers. General ontologies have
not been effective. Therefore, the best one expects from an ontology is for it to be
domain specific. However, for imprecise queries, the first problem is to take query terms
and map them to database terms. Therefore, minimally we must modify the ontology to
make it database specific. The Summary Schemas Model (SSM) [2,3,4] provides a way to
link database terms to the ontology.
In spite of the large amount of research on database integration of heterogeneous
data sources that has been done, the problem continues to create difficulties for most
organizations. In the present work we look at a subproblem of the general integration
problem, that is the case where the data sources are controlled by one organization and
the data sources consist of relational databases and record-based legacy systems. While
this is a small part of the general problem, it covers a large number of applications that
typical organizations are concerned with integrating.
Our contribution in this paper is the development of an ontology-based model that
provides access to a distributed set of relational databases and record-based legacy
systems through imprecise queries. A database specific ontology is integrated with a set
of semantically disjoint universal relations over the set of data sources to provide access.
The use of universal relations which simplifies the connection between the ontology and
the set of distributed data sources. For any request that requests semantically related data,
there is a single universal relation that is capable of responding to the request.
Specifically, we develop the notion of database specific weighted ontologies as a means
of determining the required universal relation. The use of universal relations in this
context is made possible due to our data integration scheme. The integration scheme is
based on the use of hypergraphs and the theory of relational databases. Such an approach
provides the additional capability of testing the correctness of any query generated.
2
A brief overview of ontologies, Summary Schema Model (SSM), and integration
issues are presented in Section 2. The overall model is overviewed in Section 3. In
Section 4 we present our approach to ontologies and look at the issue of generating SSM
tree fragments and database specific ontologies.
Section 5 looks at the issues that
makeup our data integration scheme. Section 6 overviews our current version of the
feasibility prototype. Finally, we conclude by summarizing our results.
2. Background
2.1 Ontologies
The word “ontology” is borrowed from philosophy, in which it refers to the
“subject of existence” [8]. It is the science of “what is”.
It discusses the structures of
entities, the properties of entities and the relations between entities. In a word, it seeks to
find an appropriate classification of entities. In the context of artificial intelligence, an
ontology is a model of some portion of the world and is described by defining a set of
representational terms [6]. A formal definition is “a formal, explicit specification of a
shared conceptualization” [8]. “Conceptualization” refers to an abstract model of some
phenomena in the world by having identified the relevant concepts of those phenomena.
So, an ontology is a description of concepts and relationships between them.
The main motivation of an ontology is knowledge sharing and reuse [9,25]. In the
field of information system, different groups gather data using their own terminologies.
When all those data are integrated, a major problem that needs to be handled is the
terminological and conceptual incompatibility. It could be done in a case-by-case basis.
But a solution based on a “consistent and unambiguous description of concepts and their
potential relation” [19] will be much better than a case-by-case one. In the Knowledge
Sharing Effort (KSE) project [18], ontologies are put forward as means to share
knowledge bases between various knowledge-based systems.
A major challenge in using ontologies lies in how to build them, or what should
they look like? Several groups have given solutions. They describe how ontologies should
be constructed so that they contain the richest information in the least space and can be
efficiently retrieved for use. A solution based on the definition of a “core library” has
been proposed in [25]. More often, an ontology is considered as a taxonomic hierarchy of
3
words with the “is-a” relation between them [9]. Some technology has also been proposed
to modify a poorly designed ontology into a better one [11].
In dealing with multi-database systems, ontologies can be used effectively to
organize keywords as well as database concepts by capturing the semantic relationships
among keywords or among tables and fields in a relational database. By using these
relationships, a network of concepts can be created to provide users with an abstract view
of an information space for their domain of interest. Ontologies are well suited for
knowledge sharing in a distributed environment where, if necessary, various ontologies
can be integrated to form a global ontology.
Database owners find ontologies useful because they form a basis for integrating
separate databases through identification of logical connections or constraints between
the information pieces. Ontologies can provide a simple conversational interface to
existing databases and support extraction of information from them. Because of the
distinctions made within an ontological structure, they have been used to support database
cleaning, semantic database integration, consistency-checking, and data mining [20].
An example of using ontologies in databases is Ontolingua [9]. Ontolingua is
being built with the purpose of enabling databases (and the people and systems that
interface with them) to share its ontology that is specific to the computer science and
mathematics domains with the intention of enabling data sharing and reuse. Another
example of database application is the Cyc ontology that had a knowledge base built on a
core of approximately 400,000 hand-entered assertions (or rules) designed to capture a
large portion of what we normally consider consensus knowledge about the world [14].
Partitioned into an Upper Cyc Ontology and the full Cyc Knowledge Base, there are
3,000 terms of the most general concepts of human consensus reality in the Upper Cyc
ontology with literally millions of logical axioms of more specific concepts descending
below, populating the Cyc Knowledge Base. Cyc foundation enables it to address
effectively a broad range of otherwise intractable software. The global ontologies objects,
attributes, transitions, and relationships are accepted as forming the domain’s universe.
2.2 Summary Schema Model (SSM)
4
The SSM was first proposed by M. Bright et al. [2,3,4]. The SSM was designed to
address the following issues [2,3,4]:
1. In multi-database system, users cannot be expected to remember
voluminous specific access terms, so the global database should provide
system aids for matching user requests to system data access.
2. Because of different local requirements, independent database designers are
unlikely to use consistent terms in structuring data. The system must take
responsibility for matching user requests to precise system access terms.
The SSM provides the following capabilities: it allows imprecise queries and
automatically maps imprecise data references to the semantically closest system access
terms. Note that the SSM deals with imprecision in database access terms rather than data
values within the database.
The SSM uses a taxonomy of the English language that maintains synonym and
hypernym/hyponym links between terms. Roget’s original thesaurus provided just such a
taxonomy and is the current basis for the SSM. Identifying semantic similarity is the first
step in mapping local to global data representation.
The SSM creates an abstract view of the data available in local databases by
forming a hierarchy of summary schemas. A database schema is a group of access terms
that describe the structure and content of the data available in a database. A summary
schema is a concise, although more abstract, description of the data available in a group
of lower level schemas. In SSM, schemas are summarized by mapping each access term
to its hypernym. Hypernyms are semantically close to their hyponyms, so summary
schemas retain most of the semantic content of the input schemas.
The SSM trees structure the nodes of a multi-database into a logical hierarchy.
Each leaf node contributes a database schema, and each access term in a leaf schema is
associated with an entry-level term in the system taxonomy. Once these terms have been
linked to the taxonomy hierarchy, creating the summary schemas at the internal nodes is
automatic. Each internal node maintains a summary schema representing the schemas of
its children. Conceptually, only leaf nodes have participating DBMSs, while internal
5
nodes are responsible for the summary schemas structure and most of the SSM
processing.
2.3. Integration
Bright et al. [1] define a multidatabase system as a system layer that allows global
access to multiple, autonomous, heterogeneous, and preexisting local databases. This
global layer provides full database functionality and interacts with the local DBMSs at
their external user interface. Both the hardware and software intricacies of the different
local systems are transparent to the user, and access to different local systems appears to
the user as a single, uniform system. The term multidatabase includes federated
databases, global schema multidatabases, multidatabase language systems, and
homogeneous multidatabase language systems.
Multidatabases inherit many of the problems associated with distributed
databases, but also must content with the autonomy and heterogeneity of the databases
that they are trying to integrate. As the number of local systems and the degree of
heterogeneity among these systems rises, the cost of integration increases.
There has been considerable research on multidatabase systems. A great deal of
the work has been examined in [12]. This approach has focused on the problem from the
point of view of applying traditional database techniques to bridge the mismatch between
the underlying data sources.
Several researchers have explored the use of intelligent agents called mediators
[26,27] as a means of bridging the mismatch between the heterogeneous data sources. At
present, there is no implemented system that offers the full range of functionality
envisioned by Wiederhold in his paper [26]. Examples of projects that have been
developed include HERMES being developed at the University of Maryland [23], CoBase
at UCLA [5] and NCL at the University of Florida [22], and MIX at SDSC[15]. The
advantage of such mediator-based systems is that to add a new data source it is only
necessary to find the set of rules that define the new data source.
More recently, a number of researchers have started to look at XML-based data
integration techniques as a way to attack the general data integration problem. The use of XML
in the general data integration problem is especially interesting as the unstructured format that
6
XML supports allows one to manipulate a variety of data types. Beyond simply storing the data
in XML format, data integration requires mechanisms to do the integration. Zamboulis makes
use of Graph Restructuring to accomplish the integration [30]. A number of groups have looked
at XQuery as the basis of their approach to XML-based data integration[6,7,8,12]. The Tukwila
Data Integration System provides a complete solution that involves not only integration, but
activities like optimizing network performance as well [23].
In the next section we overview the complete model before examining the two
principle components of our model in more detail.
3. Model Overview
The proposed model makes use of a database specific ontology and an integration
scheme based on universal relations to support imprecise queries over a distributed set of
relational databases and record-based legacy systems.
Figure 3.1 illustrates the
relationship between the objects used to construct the physical state of our model. The
universal relations are used to provide a simple query interface to the set of distributed
relational databases and record-based legacy systems. The Summary Schema Model
(SSM) tree fragments are used to convert a domain entity ontology into a database
specific ontology.
The result is that the model is capable of supporting imprecise
requests. Once the terms used in the user’s request are related to the appropriate database
terms (i.e., attribute names), the model automatically generates a result relation and
returns it to the user.
Entity Ontology
SSM tree fragments
Universal Relations
7
Relational Databases and
Legacy Systems
Figure 3.2 looks at the model from the perspective of the processes that are
required to enable the model. The components inside the dotted rectangle provide an
illustration of the relationship between the components.
The interactions between the components of the model is best illustrated by
looking at the way that data flows within the model. The front end system passes the
model a set of terms and conditions as a request (query). The controller passes the terms,
including any terms in the conditions, to the Ontology Mediation Manager. The terms are
used to search the ontology to find the universal relation(s) that are needed to generate the
universal relation query to respond to the request. Terms that cann’t be located in the
database specific ontology are typically mediated with the user. There are multiple ways
that this mediation could be implemented depending on the nature of the front end. In
our discussion (and prototype) we have assumed the use of a GUI to conduct this
mediation as a visual process, but this would not be required.
Locating the terms in the ontology would identify one or more universal relations
that can be used to answer the request. In general only one universal relation would be
identified due to the universal relations being semantically disjoint. More details on this
issue are discussed in Section 5. As a result, in the remainder of the paper we will
assume that only one universal relation is required to produce a result for a given request.
Based on the results of the ontology search, a universal relation query is generated.
The universal relation query is passed to the Query Engine along with a request id.
There it is converted into an integration query that makes use of the relations and legacy
system records that define the universal relation’s data space. The integration query is
partitioned by the Data/Query Manager and the resulting subqueries are sent to the
8
appropriate data sources. The relations that are generated by the subqueries are returned
to the Data/Query Manager where they are merged and the final result relation is sent
back to the front end system.
Front End System
User
Model
Controller
Ontology
Mediation
Manager
Ontology
Search
Manager
Database
Specific
Ontology
Query Engine
MetaData
MetaData
Data/Query Manager
Data Sources
Figure 3.2. Block diagram of the proposed model.
9
In the next two sections, we take a more detailed look at the components of the
model. Our approach to creating and searching database specific ontologies is examined
in Section 4. An overview of our integration scheme is given in Section 5.
4. Ontology Design
Ontologies are in general domain specific. In an environment where one is trying
to integrate a set of heterogeneous, distributed data sources, this means that it is necessary
to make the ontology used to search the data sources database specific. For an ontology,
this means that the attribute names used in the universal relations must be incorporated
into the ontology.
4.1 Ontology Design
The focus in this section of the paper is moving from domain specific ontologies
to database specific ontologies. We see ontologies as representing the entities (objects) in
the domain that the user of the integrated databases is working.
The domain is
represented by terms that define the problem area. Note that the user’s problem and the
available databases must come from the same domain in order for a solution to exist.
An ontology can be defined as a graph  = (,), where  is the set of terms used
to represent the domain and  is the set of edges connecting the nodes representing the
terms. Each term node can have properties assigned to it. In our ontology model there
are four types of edges in , namely, the is-a, is-part-of, synonym, and antonym edges. Isa and is-part-of edges are directed, while synonym and antonym edges have no direction.
Let I() be the set of is-a edges in the ontology . Then (,I()) represents a directed
acyclic graph (dag) with the more general terms higher in the dag and more specific terms
lower in the dag. As expected, synonym and antonym edges are used to connect terms
with the same and opposite meaning, respectively.
To enhance the search operation, we add the notion of edge weights to create a
weighted ontology. Let  be the set of weights such that i   is the weight for Ei  .
We use the weights to prune the search of the ontology. For E  I(), the weights are used
to estimate the relative closeness of the is-a relationship. A similar argument can be made
for is-part-of edges. Going through a term like Physical Object would not be useful. To
block the search, the weight assigned to the edges connected to such a term are set to a
10
large values. In our current ontology design weights for is-a and is-part-of edges are
integers. Note that the use of weights is to reduce the number of questions that a user
must be asked during the search. In meaningful queries there are likely to be several
query terms. This combined with the expected bushiness of the ontologies give rise to the
possibility of an overwhelming number of questions that the user could be asked if the
user had to resolve all of the choices.
The weights on the synonym and antonym edges range from zero to one, where
one indicates an exact match for a synonym and an exact opposite for an antonym. Using
weights on these edges allows us to show the degree of the match. A small example of a
weighted ontology is shown in Figure 4.1.
Entity
20
Is-a edge
15
20
Synonym edge
Physical
Object
50
Living
Being
100
Social Entity
10
15
15
25
Green
Apple
Animal
30
10
Green
Apple
15
Human
Being
Country
0.9
Person
Figure 4.1 A weighted ontology.
The method of generation of the weights depends on the builder of the ontology.
The weights can be assigned by hand or can be generated automatically. We have
generated the weights by hand in our current test sets, but we have designed an algorithm
for generating the weights from metadata and domain documents.
3.2 Creating a database specific ontology
To move from a domain specific ontology to a database specific ontology, we
make use of Summary Schema Model (SSM) tree fragments. The process of creating a
database specific ontology requires us to create SSM tree fragments that are relatively
specific. The SSM tree fragments are constructed starting with the attribute names used in
11
the schema of the universal relations that are defined by the data source data. To
successfully search a database specific ontology, it is critical that the SSM tree fragments
do not generalize. If the root term of an SSM tree fragment is too general, the database
terms will not be found by searches starting at meaningful domain terms.
To start the process of making an ontology database specific, we check the
attribute names in the universal relation defined by the data sources to determine if they
already exist as terms in the ontology. If the term exists, a pointer is added to the
ontology term property set to point to the universal relation that the attribute is located in.
For the remaining universal relation attributes, the metadata of the databases is used to
unify the attribute names into one or more SSM tree fragments.
In particular the
definitions of the database fields named by the attribute names given in the meta data are
used to determine related (i.e., unifyable) terms. The term that is used to unify a subset of
the remaining universal relation attributes is then matched against the ontology terms. If
it is found, the SSM tree fragment is attached to the ontology term.
Weights are
assigned by the individual expanding the ontology. If the root term of the new fragment
is not in the ontology, the unification process asks the user for related terms and again
checks the ontology.
If no match exists, our algorithm looks to incorporate more
universal relation attributes into the SSM tree fragment(i.e., grow the SSM tree
fragment). Our early attempts to completely automate the process have not been very
promising, so we are currently using a human aided approach.
The metadata definitions
and related documents are used to determine likely unification terms. This gives the
human guiding the process the opportunity to choose a unifying term from an existing
list.
Entity Ontology
SSM tree fragments
Universal Relations
12
At each step, the root term of the SSM tree fragment is checked to see if it exists in the
ontology. When all of the attribute names have been incorporated into the ontology in
this manner, we say that the ontology is database specific.
Figure 4.2 shows a block
diagram of the database specific ontology.
3.3 Search
The basic premise of our ontology search is to allow the user to give a set of
search terms and proceed from the search terms to “near by” database terms. Weights
combined with user interaction are used to define what is meant by “near by”. To look at
the search, we provide a set of basic rules used in the search.
Ontology Search Rules for is-a, synonym, and antonym edges:
1. A user creates a request by supplying a set of search terms. A search
algorithm searches the database specific ontology to locate the search
terms. If some of the search terms are not found in the ontology, the user
is asked to refine the query terms.
2. Weights are used to block paths that are unlikely to provide useful
results. As an example, an is-a edge from a very general term to a specific
term (e.g., Apple in Figure 4.1) is unlikely to yield a useful “near by” term.
Weights are used in combination with user interaction to provide an
effective search without overwhelming the user.
3. In a typical successful search, when no link to a universal relation is
found at an original term node, the algorithm starts from the node by
looking for synonym edges. If one is found the weight is tested against the
synonym threshold. If the weight is larger than the threshold, the search
moves to the next node and continues. Since more than one synonym edge
may be followed, the weights on synonym edges are multiplied and the
product is tested against the threshold. Whether more edges are followed
from the individual nodes depends on whether we are looking for all “near
by” database terms or one. If no synonym edge exist, then the is-a edges
are used as indicated in rule 2.
4. For a NOT search, the algorithm starts from the query term in the
ontology and looks for an antonym edge leaving the term node. If one
exists, its weight is tested against the antonym threshold. If an appropriate
13
antonym edge is found, the search moves to the new term node and a
positive search (rule 3) is initiated from that point.
5. In all cases if no “near by” database term is found for a query term, the
user is notified and asked to refine the query term.
6. When all query terms have been processed, the search algorithm returns
a set of universal relations and attribute names that can be used to generate
the required universal relation query.
4. Integration Scheme
While there has been a great deal of activity on integrating heterogeneous
databases, important questions remain. To bridge this gap, we use an integration model
designed to operate on a subset of the general integration problem where the data sources
are limited to relational databases and record-based legacy systems. Our approach takes
advantage of the work on universal relation interfaces (URIs) [7,17]. The idea behind an
URI is to provide a single relation view of a set of relations found in the same database.
The set of relations should have sufficient semantic overlap so that the single universal
relation view was able to provide a semantically correct “view” of the data. In addition
an URI has to support development of a correct query.
The task of applying the earlier work on URIs to the integration of relational
databases and record-based legacy systems has three basic steps:
1. Give the record-based legacy systems a relational structure, which
we call a pseudo relation.
2. Group attributes so that only semantically equivalent attributes
have the same name in the integrated environment.
3. Model each set of connected relations (defined in Section 4.3) as
a universal relation.
The result of applying the three steps is a set of universal relations that are visible
to any software interacting with the integration model. The number of universal relations
will depend on the degree of overlap between relations and pseudo relations. The next
three subsections look at the three steps in more detail.
4.1 Defining Pseudo Relations
Our approach is to have the local data administrator of each record-based legacy
system define the set of export “relation view(s)” (records) that he/she is willing to export
14
into the integrated environment.
This set can change over time.
The local data
administrator defines these “relation views” as a set of requests to the legacy system at the
programmatic level (batch mode). Each “relation view” places a pseudo relation in the
integrated environment. A pseudo relation is a set of tuples with each column named by
a unique attribute name.
A wrapper for the legacy system is then created that resides on the same platform
as the legacy system. The wrapper is a static agent that interfaces with the integration
model by exporting the required “relation view” as a set of tuples (i.e., a pseudo relation).
To generate the pseudo relation the view manager executes the appropriate request to the
Wrapper
Request
Request for
For Data
Relation
View
Manager
records
Record
Based
Legacy
System
Set of
Pseudo
records
Relation
Figure 4.3. Relationship between wrapper and legacy system.
legacy system through the “relation views” defined by the local administrator. Figure 4.3
illustrates the relationship.
Each application of retrieving data through a wrapper results in placing a pseudo
relation in the integrated environment. Selection of rows in the resulting table can easily
be implemented as part of the view manager.
4.2. Attribute Names
In any set of database relations and legacy systems there is likely to be problems
with attribute names. In particular one expects some instances of semantically equivalent
15
attributes with different names and some cases of attributes with the same names, but
different meanings.
We use the typical solution to this problem, i.e., have the designer of the
integrated system evaluate the existing name set by reviewing the metadata defined over
the data sources. He/she can then rename attributes within the integrated system to
remove the problem. For relational databases, this can accomplished by using views.
The use of views can also be used by the local database administrator as a means of
controlling what data is exported into the integrated environment. Since the local data
administrator of a legacy system is already defining a “relation view” in the integrated
environment for each export schema, any required name changes can be handled at that
level.
The result is that we can look at the integrated environment as defining a set of
attributes, such that, if two attributes have the same semantics, they have the same name.
Also, if two attributes have the same name, they have the same semantics.
Another advantage of renaming the attributes in the proposed environment is that
attribute names can be chosen to provide more semantic meaning. This results in easier
SSM tree fragment construction.
4.3 Universal Relations
A universal relation u(U) is seen as a virtual relation u over a scheme U. We use
U and attr(U) interchangeability to mean the attributes in the scheme U. The universal
relation u can be defined over a set of relations {r1 (R1), r2 (R2), … , rn (Rn)} where u=
r1 r2
…
rn and attr(U) = attr(R1)  attr(R2)  …  attr(Rn).
The universal relations used in our integration model are restricted to being
connected and maximal. A universal relation over a set of relations R = {R1, R2, …, Rn}
is connected as long as it is not possible to partition the set of relations into two sets, say
O1 and O2, such that O1 and O2 are subsets of R and O1 ∩ O2 ≠ . A universal relation
u(U) is considered to be maximal if attr(U) is the maximum set of attributes and u is
connected.
16
In the remainder of this presentation, we use the phrase universal relation to mean
a maximal and connected universal relation. In the next subsection, we look at the basic
aspects of our integration model.
4.4. Data Integration
The Ontology Mediation Manager (Figure 3.2) sees the data through the
integration scheme as a set of disjoint universal relations. As such, it simply generates a
universal relation SQL query of the form Select attribute list From universal relation
Where condition. The Ontology Mediation Manager tags the universal relation query
with the request id from the front end system and supplemented by the controller to
identify the front end and the user making the request. The task of the integration system
is to
1. Convert the universal relation query into a query over the relations that support
the universal relation.
2. Ensure the correctness of the query.
3. Partition the query with respect to the data sources.
4. Query the individual data sources, combine the results into a final relation, and
return it to the user.
The integration system is made up of two primary components, namely, a Query
Engine and a Data/Query Manager (Figure 3.2). The Query Engine makes use of a
hypergraph model of the set of relations that support the universal relation used in the
universal relation query to generate the query and test its correctness. The Data/Query
Manager receives the universal relation query from the Query Engine, partitions it with
respect to the location of the data, sends the resulting sub-queries to the appropriate data
sources, and combines the results of the subqueries if there is more than one sub-query.
In the next two sections we look briefly at the underlying concepts of the Query
Engine and Data/Query Manager, respectively.
5. Query Generation and Correctness Overview
Hypergraphs play a critical role in our approach to integration. A hypergraph is a
couple H = (N,E), where N is the set of vertices and E is the set of hyperedges, which are
nonempty subsets of N. There is a natural correspondence between database schemes and
17
hypergraphs. Consider the set of relation schemes R = {R1, R2, …, Rn}. We can define
the set of attributes of R as being attr(R) = ni=1Ri. The hypergraph HR = (attr(R),R) can
be seen to be a hypergraph representation of the set of relations.
Typically, the
hypergraph has been used to represent the scheme of a single database, but there is no
reason that we can not use the more general interpretation of having it represent the
scheme of the relations and pseudo relations that define the data in the integrated
environment.
Let L = {L1, L2, …, Lm} be the set of pseudo relations that are defined for the
record-based legacy systems as described in Section 4.1. Let R = {R1, R2, …, Rn} be the
set of relation schemes associated with the relational databases that exist within the
integrated environment. If RENAME() is the process described in Section 4.2, then S =
RENAME(L)  RENAME(R) can be perceived as the relation set for the integrated
a) hypergraph
ABC
C
A
E
CDE
AEF
B
F
BF
b) complete intersection graph
ABC {A,B}
18Aset = {CDE}
BF {A,B,F}
environment. We can then look at HI = (attr(S),S) as a hypergraph representation of the
integrated environment. The hypergraph HI defines a set of one or more connected
subhypergraphs. The precise number of connected subhypergraphs is dependent on the
connectivity of the relations and pseudo relations in the integrated environment. Each
connected subhypergraph, say Hu = (attr(U),U) where U is a subset of S and
attr(U)∩attr(S-U) =Ø, provides the basis of one universal relation.
Looking at the elements of S = {S1, S2, …, Sm+n}, where Si = RENAME(Li)
mi1and Sj+m = RENAME(Rj) nj1, we assume that the Sk m+nk1 define
meaningful groupings of attributes within the integrated environment. Using the results
of [7], we then have the join dependency
[S] defined over the integrated environment.
The importance of this is that we can apply the strategy used in our earlier work on
universal relations [16,17] to check the correctness of any queries generated in the
integrated environment.
To translate a universal relation query to an integration query, we must translate
the request to the target data space (the hypergraph representing the collection of
connected operational databases).
Finally, the target query hypergraph needs to be
mapped to an SQL query. To create the mapping, we convert the underlying hypergraph
into a set of Adjusted Breadth First Search (ABFS) trees [17]. An ABFS tree is created
by applying a variation of the breadth first search to the complete intersection graph
(CIG) defined by the underlying hypergraph model. An ABFS tree is created for each
node (relation) in the CIG that contains attributes required in the SQL query. Each path
from the root to a leaf of the ABFS tree defines a set of relations that can be joined. From
this set of paths, we choose a subset that covers the attributes required in the query. The
19
ABFS tree that requires joining the fewest relations is chosen to create the relation list in
the new SQL query. Figure 5.1 illustrates a simple example of this process. The
complete details of mapping the request to an SQL query are given in [17, Appendix A].
To ensure that the correctness of the integration query, we need to have the join
sequence define a lossless join. Using the result from [7], the join dependency
[U] is
defined over the relations and pseudo relations that make up the universal relation used in
the universal relation query that is being translated. The importance of this is that FDHinge of Hu defines a set of edges whose corresponding relations have a lossless join
[16]. The test for correctness starts by testing if the edges that correspond to the join
sequence define an FD-Hinge in Hu. Failing that, the set of edges are expanded to form
an FD-Hinge.
6. Data/Query Manager Overview
The first task of the Data/Query Manager is that the integration query generated by
the Query Engine must be partitioned into subqueries with respect to the location of the
relations/pseudo relations involved in the query. Once the integration query has been
partitioned, the resulting subqueries are sent to the appropriate data sources. Example 1
provides a simple example of the partition process.
Example 1: Example of query partition using SQL syntax.
Data layout:
Site 1 tables: R1(A,B,C), R2(C,D,E)
Site 2 tables: R3(E,F,G)
Universal Relation Query:
Select G,B
Where F=10
Integration Query:
Select G,B
From R1,R2,R3
Where R1.C=R2.C and R2.E=R3.E and F=10
Partition results:
Query for Site 1 (Q1):
Select B, E
From R1, R2
20
Where R1.C=R2.C
Query for Site 2 (Q2):
Select E, G
From R3
Where F=10
Request Framework Query:
Select G, B
From Q1, Q2
Where Q1.E = Q2.E

The Data/Query Manager retains the Request Framework Query so that it can
combine the results when two or more subqueries are needed. Assuming Id1 is the
request identifier for the universal relation query, Site1 is the site location, and Q1 & Q2
are the subquery identifiers for the two subqueries in Example 1, then Example 2
illustrates the strings used by the Data/Query Manager to represent the subqueries and the
Request Framework Query.
Example 2: The query string for the result given in Example 1:
SubQuery Queue:
“Select B, E From R1, R2 Where R1.C = R2.C”:<Id1,Q1,Site1>
“Select E, G From R3 Where F =10”:<Id1,Q2,Site 2>
Request Framework Query Queue:
”Select G, B From Q1, Q2 Where Q1.E = Q2.E”:<Id1>

The results of the subqueries are placed in a temporary database at the site of the
Data/Query Manager. When results from all of the subqueries have returned and are
stored in the local database, the Request Framework Query is used to combine the
intermediate results before returning the final result relation to the front end system.
7. Prototype
A prototype was implemented to test the feasibility of our approach.
The
prototype was implemented in JAVA, developed on the Red Hat Linux platform, and
tested in the Windows platform. Figure 7.1 illustrates a block diagram of the prototype.
It is made up of four primary components: the User Interface, the Ontology Search
21
Manager, the Query Engine, and the Data/Query Manager. The functionality for the
Ontology Mediation Manager has been incorporated into the User Interface in the current
version of the prototype. The User Interface allows a user to enter a set of domain search
terms and a condition. The beginning screen with an example in progress is shown in
Figure 7.2.
Data/Query Manager
User Interface
Query Engine
Ontology Search Manager
Figure 7.1 Block diagram of the prototype.
When the user is satisfied with what has been entered, he/she clicks on the Start Request
button. The Ontology Search System performs the search described in Section 3.3. The
Ontology Aided Search Environment
Enter Domain Search Terms:
name, department, sales
Enter Condition:
location = 'US'
22
Start Request
Help
ontology is searched for the domain terms provided by the user. If all of the domain
search terms are found in the ontology, the database information found though the SSM
fragments is returned to the user interface module. The user is notified of a successful
ontology search with the screen shown in Figure 7.3. The user has the option to see the
SQL query that has been constructed, see the results of the query on screen, or restart the
query process. Note that, the motivation for the prototype has been to test our underlying
systems and not to develop a full featured user interface.
Ontology Aided Search Environment
Ontology Search Successful
Domain Search Terms used in search:
name, department, sales
Current Condition:
location = 'US'
Buttons Show Current Options
View Query
Results on Screen
Restart Request
Figure 7.3 Successful ontology search screen.
23
Help
The discussion above assumes that the domain terms that the user entered were in
the ontology used by the system. When the ontology search doesn’t find all of the
domain search terms, the system creates a screen showing the terms that can not be found.
Two conditions exist, namely, the ontology search found a term that appears to be close
or no term(s) can be found. In the first case the system returns the fragment of the
ontology that it thinks may be relevant. The user can choose one of the terms shown in
the ontology fragment or enter another term. Figure 7.4 shows an example of the case
where a fragment of the ontology is presented to the user. The screen illustrates how the
system prototype engages the user to help out the ontology search. The example shows
three is-a relationships with "country" being the likely choice for the user.
The
"geographic feature" node represents an is-closely-related relationship. Note that neither
the type of the arc nor the weights are shown at this point. We are hoping to get the user's
interpretation without biasing the user's choice.
User Help is help is required to complete the search!
Domain Search Term in question:
location?
Help
Enter best choice or enter a new search term:
Terms assumed to be related:
geographic features
geographic location
country
state
city
Figure 7.4. Screen showing user/ontology
interaction.
24
In the case that no terms are considered close to the domain search term, the user
is asked to enter a new domain search term with the same meaning.
When the ontology search (with the user's assistance) resolves the search terms to
database terms, the information is passed to the Query Generation System, where an SQL
query is generated.
The Query Generation System tests for the correctness of the
generated query [17]. If the Query Generation System was called through the View Query
button, the SQL query is shown. Again since we are in the test mode, we have chosen to
show the full SQL query with the tables or pseudo tables from the distributed data sources
as though they are in the same database. In a commercial package, more options of how
to show the query would have to be considered.
When the user clicks on the Results on Screen button, the query information is
passed to the Data Integration System. There the query is partitioned into queries for the
individual relational databases or legacy systems and the sub queries are sent to the
individual data source sites. The results of the individual queries are sent back to the
Data Integration system and the query results for the user’s query are prepared using the
original query. The sub-queries are used in the prototype to define and spawn a set of
mobile agents. The agents are sent to the sites that contain the relevant data. Each agent
carries one of the SQL queries. The data returned by the agents is combined to produce
the required result. The result of the request is then returned to the user and displayed on
the screen.
The choice of mobile agents is not critical to the model, but rather represents a
method for quickly generating the necessary infrastructure. Client-Server models using
SOAP, CORBA or Java JDBC connections could also be used. We have used all four
types of connections in related projects.
8. Conclusions
A model for using domain specific ontologies, converting them to database
specific ontologies for aiding in the interpretation of a user's query has been given. The
25
model allows users to define both domain specific search terms and domain specific
functions to operate on the results of the query. The model was built on an integrated
database/legacy system environment. Our data integration scheme provides a universal
relation view of the distributed data sources.
A prototype to test the feasibility of the
ontology and data integration model has been designed and implemented. The prototype
takes the user input and generates SQL queries for the relational databases/legacy systems
over which the ontology search operates.
9. References
1. Bright, M.W., A.R. Hurson and S.H. Pakzad. A taxonomy and current issues in
multidatabase systems. IEEE Computer, Vol. 25, No. 3, pages 50-60.
2. Bright, M.W and A. Hurson, “Summary Schemas in multidatabase systems”, Computer
Engineering Technical Report at PennState, 1990.
3. Bright, M.W., A. Hurson, S. Pakzad, and H. Sarma, “The Summary Schemas Model –
An approach for handling Multidatabases: Concept and Performance Analysis”,
Multidatabase System: An Advanced Solution for Global Information Sharing, pp.199,
1994.
4. Bright, M.W. and A. Hurson, “Automated Resolution of Semantic Heterogeneity in
Multidatabases”, ACM Transactions on Database Systems, pp. 213, 19(2), 1994.
5. Chu, W.W., H. Yang, K. Chiang, M. Minock, G. Chow \& C. Larson. CoBase:A
Scalable and Extensible Cooperative Information System, Journal of Intelliegent
Information Systems, Vol. 6, No. 2/3, 1996, pp. 223.
6. Corazzon, Raul. ed. “Descriptive and Formal Ontology”, http://www.formalontology.it.
7. Fagin, R., A.O. Mendelzon, and J.D. Ullman. 1982. A simplified universal relation
assumption and its properties. ACM Transactions on Database Systems. Vol. 7. Pages
343-360.
[6] Gardarin, Georges, Antoine Mensch, Anthony Tomasic: An Introduction to the e-XML Data
Integration Suite. EDBT 2002: 297-306.
[7] Gardarin, Georges, Fei Sha, Tuyet-Tram Dang-Ngoc: XML-based Components for
Federating Multiple Heterogeneous Data Sources. ER 1999: 506-519.
[8] Gardarin, Georges, Antoine Mensch, Tuyet-Tram Dang-Ngoc, L. Smit: Integrating
Heterogeneous Data Sources with XML and XQuery. DEXA Workshops 2002: 839-846.
8. Gruber, T. "A translation approach to portable ontologies," Knowledge Acquisition, pp.
199-220, 5(2), 1993.
9. Gruber, T. “Toward Principles for the Design of Ontologies Used for Knowledge
Sharing”, ed. N. Guarino. International Workshop on Formal Ontology, Padova, Italy
1993.
10. Guarino N., “Formal Ontology, Conceptual Analysis and Knowledge
Representation”. International Journal of Human and Computer Studies, special issue on
The Role of Formal Ontology in the Information Technology edited by N. Guarino and R.
Poli, vol 43 no. 5/6, 1995.
26
11. Guarino, N. and C. Welty, "Ontological Analysis of Taxonomic Relationships", In, A.
Laender, V. Storey, eds, Proceedings of ER-2000: The 19th International Conference on
Conceptual Modeling, October, 2000.
12. Hurson, A., M. Bright, S. Pakzad (ed.): Multidatabase systems - an advanced solution
for global information sharing. IEEE Computer Soc. Press 1994.
13. Peter D. Karp, Vinay K. Chaudhri and Jerome Thomere “XOL: An XML-Based
Ontology Exchange Language”, http://www.oasis-open.org/cover/xol-03.html.
[12] Lehti, Patrick, Peter Fankhauser: XML Data Integration with OWL: Experiences and
Challenges. SAINT 2004: 160-170.
14. Lenat, D. B. “Welcome to the Upper Cyc Ontology”, http://www.cyc.com/overview.html,
1996.
15. Ludäscher, B., Y. Papakonstantinou, P. Velikhov. A Framework for NavigationDriven Lazy Mediators. ACM Workshop on the Web and Databases. 1999.
16. Miller, L. L. 1992. Generating hinges from arbitrary subhypergraphs. Information
Processing Letters. Vol. 41. No. 6. Pages 307-312.
17. Miller, L.L. and M.M. Owrang. 1997. A dynamic approach for finding the join
sequence in a universal relation interface. Journal of Integrated Computer-Aided
Engineering. No. 4. Pages 310-318.
18. Neches, R., Fikes, R.E., Finin, T., Gruber, T.R., Senator, T., Swartout, W.R.,
"Enabling technology for knowledge sharing", AI Magazine, pp.36-56, 12(3), 1991.
19. Schulze-Kremer, S. "Adding Semantics to Genome Databases: Towards an Ontology
for Molecular Biology", Proceedings of The Fifth Intemational Conference on Intelligent
Systems for Molecular Biology, T. Gaasterland, Et al,(eds.), Halkidiki, Greece, June 1997.
20. N. J. Slattery, “A Study of Ontology and Its Uses in Information Technology Systems”,
http://www.mitre.org/support/papers/swee/papers/slattery/.
21. Sowa, John. Knowledge Representation: Logical, Philosophical, and Computational
Foundations. Brooks/Cole. Pacific Grove, CA. 2000.
22. Su, S.Y.W., H.L. Yu, J.A. Arroyo-Figueroa, Z. Yang and S. Lee. NCL:A Common
Language for Achieving Rule-Based Interoperability Among Heterogeneous Systems,
Journal of Intelliegent Information Systems, Vol. 6, No. 2/3, 1996, pp. 171-198.
23. Subrahmanian,V.S, Sibel Adali, Anne Brink, Ross Emery, J.ames J. Lu, Adil Rajput,
Timothy J. Rogers, Robert Ross, Charles Ward. HERMES: Heterogeneous Reasoning
and Mediator System, http://www.cs.umd.edu/
projects/hermes/publications/abstracts/hermes.html.
24. Swartout, W.R., P. Patil, K. Knight, and T. Russ, “Toward Distributed Use of LargeScale Ontologies” In Proceedings of the 10th Knowledge Acquisition for KnowledgeBased Systems Workshop. Banff, Canada. 1996.
[23] Tukwila Data Integration System. University of Washington.
http://data.cs.washington.edu/integration/tukwila. Accessed 10/5/2004.
25. van Heijst, G., A. Schreiber, B. Wielinga. "Using explicit ontologies in KBS
development", International Journal of Human-Computer Studies, pp. 183-292, Vol. 46,
No. 2/3, Feb, 1997.
26. Wiederhold, G. Mediators in the Architecture of Future Information Systems, IEEE
Computer, Vol. 25, No. 3, 1992, pp. 38-49.
27. Wiederhold, G. and M. Genesereth. The Conceptual Basis for Mediation Services,
IEEE Expert, Vol.12 No.5, 1997, pp 38-47.
27
[30] Zamboulis, L., XML Data Integration By Graph Restructuring, Proc. BNCOD21,
Edinburgh, July 2004. Springer-Verlag, LNCS 3112, pp 57-71.
Appendix A. Query Generation
To create a query, we must translate the request to the target data space (the
hypergraph representing the collection of connected operational databases). Finally, the
target query hypergraph is mapped to an SQL query. To look at this process in more
detail, we consider the basic data structures and algorithms. We start by looking at the
notion of a complete intersection graph.
Complete Intersection Graph (CIG)
Let H = (U, R) be a hypergraph where U = {A1 , A2 , ... , An} is a set of
attributes and R = {R1 , R2 , ..., Rp} is a set of relation schemes over U. The complete
intersection graph (CIG) [17] is an undirected graph (R, E) where E = { (Ri , Rj) : Ri 
Rj   , Ri  R, Rj  R, i  j }. Note that the edge (Ri , Rj) between vertices (or nodes)
Ri and Rj exists if and only if Ri and Rj have at least one attribute in common. The edge
(Ri, Rj) will be labeled with Rij where Rij = Ri  Rj. An example of a hypergraph and its
complete intersection graph is shown in Figure A.1.
Adjusted Breadth First Search (ABFS)
The adjusted breadth first search (ABFS) [17] is a variation of the breadth first
search (BFS) to determine the join sequence for a target hypergraph. ABFS supplements
28
BFS by including a path label for each node and an adjustment set in the search tree so
that the search is more efficient. The resulting search tree is called an ABFS tree [17].
ABC
C
A
E
CDE
AEF
B
F
BF
a) hypergraph
Figure A.1
b) complete intersection graph
A hypergraph & its complete intersection graph (CIG).
The node from which the search is started is called the root of the ABFS tree. A sample
ABFS tree is shown in Figure A.2.
The path label [17] for an ABFS tree node is the union of all query attributes on this
ABFS tree node and its ancestors on the search path. So the path label of an ABFS tree
node should be a superset of its parent’s path label. In the process of creating an ABFS
tree, the path labels will be used to prune or delay the expansion of subsets where the
unused nodes that are adjacent to the current endpoint of the search path do not contribute
any new query attributes to the path label. Any nodes falling into this class will be stored
in the adjustment set [17] (denoted by ASet) with a pointer to the position where they
could be added to the ABFS tree during the further search or expansion. The relevant CIG
can be applied to determine which nodes are adjacent to the current endpoint of the search
path.
The expansion of the ABFS tree will continue until the union of the path labels of
all the leaves in the current ABFS tree contains all the query attributes. If the ABFS tree
can not be expanded and the union of the path labels of all the leaves in the current ABFS
tree does not contain all the query attributes, then a node can be taken from the adjustment
set and the process can be restarted from the position pointed by this node. Note that this
29
process of creating an ABFS tree should terminate successfully in finite steps since all the
query attributes are in the hypergraph and can be reached eventually.
In addition, using the above approach, many different ABFS trees with the same
root may be generated. This is because the order of search is not unique. Also, there are
more than one way (such as FIFO, LIFO, or randomly) to select nodes from the
adjustment set.
Join Sequence
Finding an optimal join sequence for the selected query attributes (including the
ABC {A,B}
Aset = {CDE}
BF {A,B,F}
Figure A.2. Adjusted BFS tree using the
CIG of Figure 4.1 with root ABC and the
query attributes ABF.
attributes appeared in the query condition) is a crucial part in the modeler design
and implementation. Once the ABFS tree with a given root is created, we can
determine the join sequence defined by this tree. The approach is to select a set of
paths connected to the root such that the union of the path labels contains all of
the desired query attributes. We use the following procedure to select the
appropriate paths [17]:
<Step 1.0> Set W := the set of query attributes. Go to <Step 1.1>.
<Step 1.1> Mark every leaf and its ancestors if its path label has a query
attribute that appears only once in the path labels of all leaves in the ABFS
tree. Remove the query attributes included in the path labels of the
marked nodes from W. If W is empty, stop; otherwise, go to <Step 1.2>.
<Step 1.2> If there is a contributing query attribute in more than one path
labels of the unmarked leaves with the same parent, then mark one (and
only one) of those leaves and its ancestors. Remove the query attributes in
the path labels of the marked nodes from W. If W is empty, stop;
otherwise, go to <Step 1.3>. (By a contributing query attribute we mean a
query attribute that occurs in the path label of a leaf but does not occur in
its parent’s path label.)
<Step 1.3> If there is a leaf which contains a remaining query attribute in
W with the lowest frequency, then mark this leaf and its ancestors. In case
30
of tie, choose the leaf with the shortest path and mark the nodes on this
path.
It is worth noting that the approach described in the previous subsection
does not guarantee to create the optimal ABFS tree with a given root since
the order of search in that approach is not necessarily optimal. The socalled optimal ABFS tree with a given root is actually the one with the
minimum weight over all possible ABFS trees for this root. By weight of
an ABFS tree we mean the length of the join sequence defined by the
ABFS tree.
The creation of a non-optimal ABFS tree with some given root does not
cause serious problems. On one hand, our goal is to generate an optimal
join sequence which is the one with the minimum weight over the optimal
ABFS trees for all roots. The probability of creating non-optimal ABFS
trees for all roots is very low. On the other hand, one can remove a
redundant join from the resulting join sequence at a later stage.
Another point worth noting is that we do not have to generate ABFS trees
for all roots.
We need only to generate ABFS trees for the so-called legal roots. A root
is called illegal if it does not contain any query attribute or its set of
query attributes is properly contained in the set of query attributes of an
adjacent node.
The algorithm to find a join sequence for a given hypergraph and a set of
query attributes is summarized as follows:
<Step 2.0> Create the CIG for the hypergraph. Find the set LR of all legal
roots in the CIG. Set minweight:= the number of nodes in CIG. Go to
<Step 2.1>.
<Step 2.1> If LR is empty, then stop. Otherwise, choose a root r  LR , set
LR := LR - {r}, and go to <Step 2.2>.
<Step 2.2> Create an ABFS tree with root r. Find the weight and the
corresponding join sequence for this tree. If the weight is smaller than
minweight, then save this join sequence as the current best one, and
replace minweight with the weight. Go to <Step 2.1>.
31
Appendix B. Query Correctness
Our correctness process has been built on the issue of testing for fd-hinges [16].
To come to an understanding of what this includes, it is necessary to briefly look at the
underlying concepts of hinges and fd-hinges.
A hypergraph H is reduced if no hyperedge of H is properly contained in another
hyperedge of H. H is connected if every pair of its hyperedges is connected by some path
of hyperedges. If H is a reduced connected hypergraph with the vertex set N and edge set
E, then E’ is a complete subset of E if and only if E’ E and for each Ei in E if Ei 
attr(E’), then Ei belongs to E’. E’ is said to be a trivial subset of E if |E’| <= 1 or E = E’.
Let E’ be a complete subset of E and E1, E2  E – E’.
Figure B.1. Hinge example with E1 as the separating
edge and {E1,E2,E3,E4} as a Hinge.
Then we say E1 and E2 are connected with respect to E’ if and only if they have
common vertices not belonging to E’.
Let E’ be a nontrivial complete subset of E and 1, 2, …, p be connected
components of E-E’ with respect to E’. Then E’ has the bridge-property if and only if for
every i = 1, 2, …, p there exists Ei  E’ such that (attr(E’)NiEi, where Ni = attr(i), Ei
is called a separating edge of E’ corresponding to i. A nontrivial complete subset E’ of
E with the bridge property is call a hinge of H. An example of a hinge is shown in Figure
B.1. Note that {E2,E3,E4} is not a hinge.
Let F be a set of functional dependencies (fds). Let TE’ be the tableau defined
over the attributes in E for the schemes represented by the edges in E’  E, the
chaseF(TE’) is the result of using the fds in F to chase the tableau TE’. Now let E* be the
set defined by chaseF(E’) such that E* = {Si|if wi(A) is a distinguished variable and wi 
32
chaseF(TE’), then ASi}. In other words each element in E* corresponds to a row in the
tableau chaseF(TE’) and consists of the attributes that have distinguished values in the
row. Note that by the definition of the chase algorithm each element of E* is a super set
of the corresponding element in E’ that was used to initially define the row in the tableau.
Construct the hypergraph HE*,F = (attr(E),(E-E’)E*). Then E’ is an F-fd-hinge of a
hypergraph H when E* is a hinge of HE*,F.
In [16] we showed that an F-fd-hinge was equivalent to an embedded join dependency.
In other words any time that a set of edges defines an F-fd-hinge, the set of relation schemes that
correspond to the edges define a lossless join. As a result, our test for query correctness comes
down to testing the set to determine if they define an F-fd-hinge.
33