Download A Deductive Database Solution to Intelligent Information Retrieval

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Microsoft Access wikipedia , lookup

Oracle Database wikipedia , lookup

SQL wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

IMDb wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Ingres (database) wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Concurrency control wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Functional Database Model wikipedia , lookup

Database wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

ContactPoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Intelligent
KayLiang
A Deductive
Information
Database
Retrieval
Solution
to
from Legacy Databases
Ong, Natraj Ami, Christine Tomlinson, Unnikrishnan
and Darrell
Microelectronics
and Computer Technology Corporation
(MCC),
West Balcones Center Drive, Austin, Texas, USA,
{kayliang, arni, tomlic, unniks, woelk}@mcc.com
Woelk
Abstract
LDL++ is being tested as a tool for Data Cleaning and
databasPurification [5, 93 on large telecommunication
In this paper, we will report on the experience of building
es in collaboration with Bell Communication and Pacific
a successful industrial application using the LDL++ deducBell. In addition, LDL++ also serves as a Knowledge
tive database as part of the technology transfer process to
Discovery/Mining
tool for integrating inductive Machine
our sponsor company, Eastman Chemical Company. We will
Learning
algorithm
with deductive querying capabilities
describe the nature of the problems faced by Eastman Chem[14].
More
detailed
results
of these efforts will be reported
ical Company and show how the LDL++ deductive database
technology helps to build an Intelligent Information Retrieval in the future once they have matured.
System (IIRS) to solve their problems. We will also elaborate
In this paper, we will share our experience in building an
separately on the specific benefits contributed by the IIRS, D- industrial application using deductive database technole
eductive Database technology and the LDL++ system. Last- gy. The application is termed an Intelligent Information
ly, we will share some of the invaluable lessons learned from Retrieval System (IIRS) b ecause it is a system for rethe experience.
trieving
legacy information
driven by a domain-specific
knowledge base. The users of the system are novices, the
system provides a very simple and user-friendly graphical form-based user interface for specifying a request. As
The field of deductive database has received much at- there is a significant
knowledge
gap between the users
tention as a form of new technology for integrating
and the underlying
data format, the system has to be
knowledge-based systems with relational database tech- intelligent enough to utilize the knowledge base to pr*
nology. Much of the research has focused on theories cess the query. The application is built as part of the
and techniques for efficient computation.
The fruits of technology transfer process to Eastman Chemical Comsuch efforts have brought about the release of several pany, a sponsor of the LDL/LDL++
research at MCC.
prototype systems such as Glue/Nail [6], LDL/LDL++
The IIRS is used in their research division for analysis
[2, 3, 13, 11, Coral [7], Aditi [8]. The LDL system was re- of chemical compounds. As we will describe later, the
leased for use and experimentation by MCC shareholders whole effort is not a simple one way process such that
and university researchers in 1989. This was followed by there is only technology transfer to the users. Instead, it
the release of the LDL++
system, a second generation is a two way transfer process where much useful feedback
system with an industrial-strength
robust implementawas received from the users that prompted
the revision
tion and with many new and improved features based on and improvement of the technology itself.
feedback from users of the LDL system.
Despite the availability of the deductive database technology, little has been focused on the application aspect 2 The Problem
of the technology to real-world problems faced in the industry. There were research efforts to explore data dredg- &&man Chemical Company has been producing chemical products for the past 30 years. Before chemical proding applications [12] such as [4, 10, 151 that attempted to find patterns in atmospheric and bio-genome da- ucts are manufactured, properties of these chemical comta. However, there has been very little evidence of the pounds are examined and tested in the laboratory. Once
the tests are performed, the information is recorded in
value of deductive databases in the commercial/business
a database. Thus, after 30 years, much valuable inforarena. In the case of LDL/LDL++,
several commercial
mation
has been accumulated. The latest version of the
applications are being explored right now. In particular,
database has been migrated to a relational database system.
Proceedings
of the Fourth International
Conference
on
Database Systems for Advanced Applications
(DASFAA'SS)
A chemical product is usually made up of a combination
Ed. Tok Yang Ling and Yoshifuminasunaga
of compounds, which is also referred to as the composition
Singapore,
April
10-13, 1995
@ Uorld Scientific
Publishing
Co. Pte Ltd
of the chemical product. Each compound has its own spe-
1
Introduction
172
cial encoding and different chemical products are created
by mixing different compounds,
with different quantity
of different units of measurement
plus other information.
Compounds are also divided into different categories and
there are domain-specific
constraints
depending on the
categories of the compounds used to create the chemical
product.
The actual information
about how each chemical product is composed is proprietary
and will not be
discussed in the paper.
Each chemical product is tested against a suite of different types of laboratory
tests and each type of laboratory test determines
the property of the chemical product. Based on these properties,
one can determine how
the chemical products can be used. The information
5
tored in the database includes the compounds, composition and the various properties of the chemical products.
The number of compounds, chemical products and properties is relatively
large. This information
provides significant competitive
advantage to the company in terms
of cost and response to new products in the market place.
In particular,
the information
is being used repeatedly as
follows:
l
l
l
determine if a chemical product of the same or similar composition has been manufactured or investigated before.
(2) Given a set of properties, determine if a chemical
product of the same or similar properties has been manufactured or investigated before.
(1) Given
a composition,
(3) Given a chemical product, determine what tests have
been performed before and also determine if a particular
test has been performed before.
(1) is necessary to evaluate a competitive
product manufactured by a rival company, First, the composition of the
chemical product is determined.
Then, the information
from the database can help to determine if the same or
similar chemical product has been ma.nufactured or investigated before. If an identical or similar product is found,
(3) can be used to retrieve prior tests on these products.
The process to test a chemical product is normally very
expensive and (3) will help to avoid unnecessary tests.
Furthermore,
if sufficient information
about the chemical
product can be accumulated,
the company can promptly
begin manufacturing
the product. Thus, this will give the
company the benefits of lower costs and faster response
time.
(2,) is used when a customer requests the manufacturing
of a chemical product based on some requirements
on its
properties.
Again, from these constraints
on the properties, the database could potentially
have information
about chemical products that have the same or similar
property
profile as the requested product.
Even if only similar products are found, the information
is useful
to help chemists narrow down the choice of compositions. The composition
can then be modified to fit into the
requirements
of the product requested.
Unfortunately,
the
retrieval
of information
for
these
queries is too complex
to be carried out directly
by
the chemists themselves.
Historically,
a chemist will
have to consult the in-house database experts, who have
the knowledge
of both the database configuration
and
domain-specific
knowledge about chemicals, in order to
get the necessary information.
When the database experts finally get the results, it will be sent back to the
chemists. If the results are not what the chemists want, another request will be necessary until the chemists
get what is needed. This poses problems in the following
ways:
Long Turn-Around Time It could take several weeks for
the chemists to receive the results. The results returned
by the queries may not be specific enough and thus, too
much results could be generated in hundreds of pages of
output that are useless to the chemists. Furthermore,
subsequent queries are normally necessary to probe further into the result. As a result, the turn-around time
is extremely long, tedious and frustrating for both the
chemists and the database experts.
Knowledge and Representation
Gap There is a significant knowledge and representation
gap between the
users/chemists and the information
in the underlying
database. Information
about the chemicals and laboratory tests are represented in a canonical form very different from the terms understood by the chemists. As
a result, a chemist normally submits a request based on
a set of conversation codes (sometimes informally on a
piece of scratch paper) which the database experts can
understand. The conversation codes are essentially a Iist
of mappings between the encodings understood by the
chemists and the actual canonical representation in the
actual databases and they are used for conversation between the users and the experts.
Following
that, the
database expert creates a SQL query that converts the
terms specified
by the chemists into the canonical
rep
Furthermore,
it is also the responsibility
resentations.
of the database
experts
to include
additional
domainspecific constraints
in order to retrieve
the information
correctly.
The long turn-around
time frustrates
the chemists.
Often, the time required to retrieve the necessary information is so long that it does not justify their time waiting
for it. Thus, they end up re-performing
the laboratory tests and may incur unnecessary cost. Furthermore,
requests for new products from customers must be answered in a specific time-frame
and if the time for information retrieval takes too long, it is simply not acceptable
and requests may not be answered on time. The current
setup for information
retrieval also relies heavily on the
experience and knowledge of the database experts. Eastman Chemical Company will face tremendous
problems
when these database experts retire. Their roles are not
easily replaceable because’each of them has accumulated
a wealth of domain-specific
knowledge about the conversation codes and the chemical domain. They also know
how these chemicals are represented in the database sys-
173
. _.
cK++-based Appliuths
II.’
CK+* Appliation
Pmgnmming fnterhcc
Figure 1: LDL++ Open Architecture
terns as well as how to pose queries to retrieve information.
These problems have prompted Eastman Chemical Company to look into commercial products as well as MCC technologies for a solution to these problems. There
was no commercial product that has both the capability to represent and perform inference on domain-specific
knowledge and to query against legacy databases. The
LDL++ Deductive Database technology was investigated as a possible tool for building a system to facilitate
their information retrieval needs.
3
The
LDL++
System
The LDL++
system is a deductive database system
based on the integration
of logic programming system with relational database technology. It provides a
logic-based language that is suitable for both database
More details
queries and knowledge representation.
on the LDL++ system and language can be found in
11, 3, 13, 181. In this section, we will briefly describe some
of the salient aspects of the LDL++ system as relevant to the building of the intelligent information retrieval
system.
The LDL-++ query language is based on Horn clause logic
and an LDL++ program is essentially a set of declarative
rules. For example, the following rules
ancestor(X,Y)
ancestor(X,Y)
+
+
parent(X,Y).
ancestor(X,Z),parent(Z,Y).
specify that a new relation ancestor/:! can be defined
based on the relation parent/2.
X, Y and Z are variables and ancestor and parent are predicate symbols. By
declarativeness, we mean that the ordering between the
rules is unimportant
and will not affect the results returned. Deduction of all values of ancestor/2 is achieved
through an iterative bottom-up execution model.
The LDL++ language supports a rich set of complex data
types such as complex objects, lists and sets in addition
to other basic types such as integer, real and string. Examples of these complex types are reclang(e(l,2J, [I,,!?”
and l,Z respectively. Thus, the LDL++ language is ideal for representing the domain knowledge in IIFS Furthermore, rule-based inference capability allows generation of complex domain-specific constraints at run-time.
The language also supports the meta-query facility which
plays a very significant role in the IIRS. Based on runtime data, this facility first allows construction of rules,
followed by the compilation of query form before invoking the query. This empowers the IIRS t#ohandle queries
that are not originally specified in the rule base.
The open architecture of the LDL++ system, shown in
Figure 1, meets many of the demands of the IIRS. It
procedural languages in two
is ” open” to the C/C++
ways: It provides an Application Programming Interface
(API) that allows applications to drive the system and,
an External Function Interface (EFI) that allows C/C++
routines to be imported into the inference engine. It is
also “open” to external databases such as Sybase, Oracle,
Rdb, Ingres, and DB2 ‘, through its External Database
Interface (EDI). Both tables in the external databases
and C/C++ interface routines are modeled as predicates
through the EDI and EFI respectively. As a result, these
external resources are transparent to the inference engine and the IIRS can plug into different database and
procedural routines without having to make any changes
to the overall implementation.
This empowers IIRS to
have transparent access to data from different sources and
the front-end portion of the application does not have to
change for different data sources. The ED1 and EFI are
also convenient for gathering data from multiple, heterogeneous databases or files.
The IIRS
4
Implementation
We begin by first examining the configuration of the IIRS system. This will give an overview of software components, how they fit together and their implementation
platform.
The configuration is show in Figure 2. As
shown, there is a three-tier design to the configuration.
They are:
s The Client
own client
l
l
Process Each individual
process and can reside
user will have hi/her
on any platform.
The Server Process The server process
ity to process and dispatch
concurrent
provides
queries.
the abil-
The Data Repository The data could reside at any other
place transparent
to the client
process.
‘Sybase is a trademark of Sybase Inc., Oracle and Rdb are the
trademark of Oracle Inc. Ingres are the trademark
of Computer
Associates Inc. and DB2 is a trademark of IBM Inc.
174
The purpose of this three-tier design is to ensure data
independence. The underlying mechanisms in the client
process do not need to know where the data comes from.
In the long run, it will allow for transparent migration of
the data from the old repository to new ones.
The client process is a single process that includes
a graphical form-based user interface (GUI) and the
LDL++ engine. Additional C++ routines are imported
into the inference to support customized predicates. The
LDL++ backend communicates with the server through
a SQL Access Group Call Level Interface (SAGCLI). The
server is based on the Extensible Services Switch (ESS)
technology developed at MCC [16]. It serves as the multiplexer to receive queries from multiple client processes
and dispatches them to the appropriate data repositories.
In the current setup, there are two relational databases and both reside on DEC-supported RDB relational
database on VAX mini-computers. These repositories are
likely to be migrated to a new relational database on a
new platform in the long run.
The chemists, who are novice users, manipulate graphical objects such as buttons, menus, forms, etc. provided
by the graphical form-based user interface. It is implemented using Motif graphical user interface toolkit. The
underlying implementation
and configuration are completely hidden from the users. They do not know where
the data comes from, how the queries are composed and
how results are assembled before returning to them. This
is a critical design decision that is planted consciously
because the thought of having to use logic-based system
or language will impede their desire to use the system.
Queries are automatically formulated as they choose the
options in the menus and buttons. More specific information is entered by filling slots in electronic forms.
When the IIRS client process is brought up, LDL++
schema, rules, facts (facts that represents some domainspecific knowledge) are automatically
loaded into the
client process. Query forms are then compiled once and
ready for querying. The C++ routines are loaded into
the process on a per-demand basis if used by the queries
selected by the users. These routines are loaded once and
subsequent queries do not require re-loading of the loaded
routines.
When a user has completed formulating his/her query,
an LDL++ query form is instantiated with the appropriate bindings or values. The knowledge base, encoded
as LDL++ rules and facts, will perform inference and
generate a meta-query expression represented as ground
data. This meta-query expression is then fed into the
LDL++ meta-query facility that generates new rules and
new query forms. These newly created query forms are
subsequently compiled at run-time.
The compiler will
perform the necessary optimization that will compress
and collapse as many rules and literals and generate one
or more SQL expression as compactly as possible. These
175
Figure 2: IIRS Configuration
SQL expressions are then dispatched to the server process
through the SQLCLI interface. Results returned from
server process is then propagated as tuples back to the
GUI. Some results are filtered to provide more mnemonic and meaningful presentation to the users. The server
process accepts the SQL expression and dispatches it to
the appropriate database repository. This server process
can accept queries from various client processes concurrently. Thus, multiple LDL++ client processes can be
executing at the same time, each serving a different user.
5
Illustration
of the Application
Even though the IJRS application is made up of different
components, the core of it lies in the rule-based component driven by the deductive database engine. In this
section, we will show and discuss a reduced simplified
subset of the actual application and illustrate using one
of the many query forms.
All the tables, regardless of the database in which they
reside, are viewed transparently as LDL++ predicates.
The database schema is declared as follows 2:
ess :: propertyl(SampleId
: integer,
PropValue
: string,PropValuel2
: string)
ess :: property2(SampleId : integer,
PropValue2l:float,PropValue22:string,
PropValue
: float)...
There are about 100 properties, each represented as a
separate table. Each record in a table represents one test
on a particular property. Each property may have a different number of attributes. SampleId, PropValueii,
etc
2As the information
in the application is highly proprietary, ta
ble names,attributes namesand values are modified sufficiently in
order not to disclosetoo much information. However, the made-up
description shall be adequate to illustrate the points.
electronic form entries into the list structure.
Secondly, rules must be written to verify that the input compound codes, i.e. ‘Al,, 'B2,, .. . . etc are indeed valid.
They are checked against a knowledge base of conversation codes represented as facts. In addition, they are
transformed into whatever the canonical forms represented in the database. Thirdly, each of the compound codes
is analyzed to determine to which category of chemicaless::composition(SampleId:integer,
s they belong. Thus, several string processing routines
CompoundCode:string,
specialized for this application are written in C++ and
ComponndQuantity:float)
imported into the rule base.
Thus, a composition sample with SampleId of value 1000 Once the various categories of chemicals are identified,
could have the following entries in the composition/3 ta- the rule base validates the domain specific constraints
ble:
against the input constraints by the rule base. More
composition(1000,'X.,1',20.0).
importantly,
additional domain specific constraints are
composition(1000,'X..2',15.0).
generated and appended to the original constraints becomposition(lOOO,'X..3',65.0).
fore the query is dispatched. For instance, consider the
composition(1000,'X..4',100.0).
domain specific constraint that all compounds of a given
composition(1000,'X..5',40.0).
chemical category must be summed to 100%. If 'B2' and
composition(lOOO,'X..6',35.0).
‘~3’ are the only compounds of the same chemical cat'X..l',
'X..Z',
. . . are canonical encoding of the comegory, then we know that the query generated based on
pound. If a test on property1 has been performed, there
the constraints will never produce any result. Thus, the
will be an entry in propertyl/3
as follows:
query is not evaluated any further and the query returns
no solution. On the other hand, if 'Al' and 'D4' are
property1(1000,15.22,'XYZ').
the only compounds of the same chemical category, then
Various query forms are pm-compiled and available for an additional constraint is generated that specifies that
the GUI to access. Here, we will illustrate using one query the values selected for compound 'Al' and 'D4' must
form. This objective of this query form is to find if there
be summed to 100%. This will become clear once the
is a sample, in this case finding the sample id will suffice,
generated
rule is shown.
that satisfies a set of composition constraints. The query
Once
the
constraints are filtered, transformed, verified
form is denoted in LDL++ as:
and enhanced, they are processed through a meta-query
export findsampleidfrom-composition(
facility that takes these constraints as a form of data and
$CompConstraints,SampleId).
generates new rules that implements these constraints.
Assume that ‘Ai’ and ‘D4’ are the same chemical catIn LDL++,
arguments prefixed with a ‘3’ in the query egory while ’ B2 ’ , ‘C3 ’ and ‘ES ’ are the same chemical
form provides hints to the compiler that the argument category. Assume that each compound code is transwill be bound with some values at query time. Hence, formed by adding ‘..’ to the original encoding, then, conthe query form indicates that $CompConstraints will be ceptually, the following LDL++ rule is generated at run
supplied with a value while the result will be bound to time:
SampleId after the query is evaluated.
metapred(SampleId)+
There are various important considerations. First of alcomposition(SampleId,'A..l',Vl),
l, the number of composition constraints are unknown
composition(SampleId,'B..2',
V2),
when the rule base is loaded and query forms are comcomposition(SampleId,'C..3', V3),
piled. The chemists can input any number of constraints
composit ion(SampleId,'D..4', V4),
when filling in the form. Thus, the query form must be
composition(SampleId,'E..5',VS),
ready to handle variable number of constraints. This is
Vl <= 75.0,Vl >= 4S.O,V2 <= 84.0,
represented as a list of functors shown as follows:
v2 >= so.o,v3
<= lO.O,V3 >= 5.0,
are attribute name declarations while integer, float are
column type declarations. The attribute SampleId is an
index on the sample on which tests have been performed.
The ess: : prefix denotes the database server where the
table is residing. Each test sample is also a composition
which is made of a set of compounds. There is a primary
table that stores the composition of the test sample. The
schema for this table is:
<= 1oo.o,v4 >= io.o,v5 <= 100.0,
>= 10.0,VlfV4=
100,
V2+V3+V5=100.
v4
v5
[ componnd('Al',range(75.0,45.0),
componnd('B2',range(84.0,50.0),
compound('C3',range(l0.0,5.0),
compound('D4',range(100.0,10.0),
compound('E4',range(100.0,10.0),...]
The LDL++ compiler will then rewrite this new rule and
through the SQL compression and collapsation algorithEach functor represents a compound with a compound m, transform it into a compact SQL statement and discode and a range of values. In this way, by having a patch it to the backend database server. Many of the
list structure, an arbitrary number of constraints can be system features of the meta-query facility, the SQL comspecified. The GUI is responsible for transforming the pression and collapsation algorithm, external procedural
176
interface, external database interface, etc. cannot be covered within the scope of this paper but will be covered in
future publications
[l]. In addition,
this illustration
is also by no means comprehensive.
Many details about how
the conversion codes are represented
and implemented,
how the constraints
are processed, validated,
enhanced
and eventually
transformed
into a form suitable for metaquery facility are too tedious to be discussed here.
6
Evaluation
Benefits
The specific
IIRS are:
l
of the
benefits
about
by the installation
of
Direct Access by Novice Users One of the major differences that the IIRS has introduced
is to provide the users
with the ability
to access the information
directly.
This
gives them a sense of control
as well as flexibility.
This
includes
the flexibility
to access information
at a time
convenient
to them as well as the flexibility
to experiment with different
queries in the way they want to.
Prior to the IIRS, the information
gathering
process was
time-consuming
and users has to go through
an expert.
the knowledge
information.
gap
Capturing Domain-Specific
Knowledge The building
of
the IIRS required acquisition
of domain knowledge.
One
of the problems
that prompted
the building
of IIRS
was that some of the in-house
experts
who have been
handling
the queries for the chemists
will be retiring.
Thus, the IIRS, to a certain extent, replaces these experts. More significantly,
information
retrieval
databases
by the chemists does not depend
ability of these experts any more.
l
Increase
Use of Legacy
a significant
Information
The GUI
step to encourage better
from the
on the availrepresents
use of the lega-
cy information.
Before IIRS, each request for a query
relied on the availability
of the experts.
Such bottlenecks discourage
request of queries from the users. As
mentioned
l
before, re-performing
Znformation
of the IIRS
laboratory
experiments
rather than searching
for previous
experimental
s can be expensive.
More importantly,
chemists
resultcannot
177
unnecessary
more accessible
co&.
Filtering and AugmentationOne
of the roles
is to provide filtering
of both input and out-
put information.
Input specifications
(also referred to as
code) entered and understood
to the canonical
representation
ta repositories.
IIRS
brought
makes the legacy information
and helps to prevents
of the IIRS
In short, the IIRS actually
eliminates
between the users and the underlying
l
certainly
conversation
are mapped
In this section, we will discuss how the IIRS has contributed to solving the problem faced by Eastman Chemical Company and the impact and differences that have
been made before and after its implementation.
We will
focus on the benefits realized by the IIRS users. In addition, we will discuss the role deductive database technology plays in realizing this solution.
In particular,
we
would
like to answer the question of why the deductive
database approach is essential and why other approaches
are less suitable.
Lastly, we will highlight
some of the
specific contributions
of the LDL++
system and we will
examine some of the useful features in the LDL++
system that lead to a superior implementation
of IIRS.
6.1
take advantage
of the valuable
information
and knowledge about manufacturing
a chemical product
that may
be available
in the legacy data.
The easy-t&user
GUI
Furthermore,
by the users
in the da-
values returned
from the
database are filtered into a more presentable
format and
if necessary,
augmented
with more mnemonic
information. For example, the canonical
representation
of a compound could be replaced or augmented
with the actual
chemical name understood
by the users.
In short, IIRS represents a technology
leap for Eastman
But more imChemical
Company
as an organization.
portantly,
it improves the environment
for conducting
a
business with a higher productivity
and a lower cost.
6.2
Benefits
of the Deductive
Database
Technol-
ogy
Why is deductive database technology better able to solve
the problems faced by Eastman Chemical Company than
other technologies such as the relational database or logic
programming
technology
? The IIRS could certainly
be
implemented
using C or C++ with embedded SQL on
top of a relational database. Or it can be implemented
by
extending logic programming
system such as Prolog with
the interface to databases.
However, the time, cost and
efforts would be tremendously
larger. This is due to the
following benefits made available by deductive databases:
l
Knowledge-Driven
Database QueryingThe
IIRS requires
the ability
to close the knowledge
gap between the users and the canonical
representation
in the data reposiThus, there must be a representation language
tories.
for capturing
the domain specific knowledge.
The logicbased representation
of deductive
database language fits
this requirement
very well. Secondly,
inference
on this
knowledge
is required
to perform
information
filtering
and augmenting,
analysis of input specifications
and generation of domain-specific
constraints.
Furthermore,
the
application
also does some form of constraint
checking
by pre-evaluating
and pre-validating
the input queries.
Queries that are deemed to fail will be intercepted,
cancelled immediately
and not dispatched
to the server at
all. Once the input specifications
are processed by the
knowledge
base, the same specifications
is also expressible as a query to the database.
Thus, deductive
database
satisfies
these multiple
needs of the IIRS very well in
supporting
a database
querying
environment
driven by
a knowledge
base. Such capabilities
to represent
knowledge structures
and perform
inferences
and constraint
checking are not inherent
in procedural
languages
such
as C/C++
which they are never designed for. Thus, rebuilding
such capabilities
from scratch using C/C++
is
time-consuming
and it makes no sense spending
all the
string manipulation
and certain customized lower level
processing that can only be done efficiently in a procedural language such as C/C++,
not in declarative rulebased language like LDL++.
In addition, the IIRS has
to be able to interface with external legacy databases.
The API of the LDL++
system offers the interface for
the C-based implementation
of the GIJI. In fact, through
the API, the GUI serves as the master application that
drives the LDL++
engine. The C+-+ routines are imported into the system through the I3FI and thus, new
customized predicates based on these routines can be defined. Lastly, external legacy databases can be accessed
through the EDI. Without this ability to interface with
various different types of components, building of the IIRS would have been very difficult, may be impossible.
resources developing yet another knowledge-based system. Furthermore, there is also the tedious job of dealing with the impedance mismatch between the C/C++based manipulation language and the SQL querying language. Many details have to be implemented to handle the generation of the query and the conversation of
data returned from the database server. On the other
hand, logic programming systems can handle this knowledge representation
and inference requirement equally
well but they lack the ability to query against databases
and generate optimized SQL expressions from the rules.
Deductive databases can perform the two tasks equally
well in a seamless fashion because they have the knowledge representation
and inference capability as well as
the built-in facilities to perform database querying.
s Easy Maintainability
and Extensibility
As the IIRS is
slowly emerging as a tool used by the chemists on a
daily basis, the requirements are evolving as the users
are exposed to its various functionalities.
Thus, continuous modifications are necessary and new functionalities
are constantly being requested. The IIRS has proven to
be very flexible with respect to upgrades. The experience has shown that the time taken to build the rulebased portion of the application is short with a highlevel declarative language. However, frequent. iterations
of revisions are necessary. Often, incomplete or incorrect knowledge is captured in the knowledge base due
to misunderstanding
in the knowledge acquisition process. Furthermore, since the rules determine the format
of the results returned, users often request changing of
the format for better presentation. Fortunately, as most
of the applications
are coded in high-level declarative
rules, they can be easily revised, maintained and extended. Such tasks would require significantly more effort if
the implementation
were done in C/C++.
6.3
Benefits
of the LDL++
cility
were developed
l
Access to Heterogeneous Legacy Databases
One of the constraints when building the IIRS is that
it requires no reconstruction of the legacy databases, at
least in the first few years of operation.
As shown in
the configuration,
the client process makes no assumption about the locale of the data repositories. Thus, other applications that are using the data repositories are
not affected by the installation of the IIRS. Furthermore,
when the data repositories are eventually migrated to a
different locale on a different vendor database on a different platform, the implementation
of the facilities on
the client process, i.e. the GUI, the rules and facts, does
not need to change at all. Another ability brought about
by the IIRS is the ability to integrate information from
different legacy databases in a transparent manner. The
users are not aware of whether the query sent involves
many databases and which part of the results come from
which database.
l
Technology
Many of the features in the LDL++
system have been
found to be critical in the implementation of the IIRS.
This has confirmed the benefits of many of the design
decisions when the LDL++ system was being developed.
In particular, the open architecture of LDL++ makes it
possible to integrate the various implementation components, each of which has its own strengths and merits.
The second important design decision was the transparent access to multiple heterogeneous legacy databases.
More importantly,
the experience in building the IIRS
provided feedback that resulted in improvements
to many
aspects of the system. In particular, the meta-query faof these LDL++
detailed below:
. Transparent
due to the IIRS
effort.
The roles
features in the IIRS are discussed in
Open Architecture As shown in the Figure 1, the open architecture of the LDL++ system offers various channels
to access external resources in additional to the schema,
rules and facts. The IIRS demands many capabilities.
First, it has a GUI, which has to be implemented using the C-based Motif toolkits. Secondly, it also requires
7
Meta-level Facility The development of the meta-query
facility originated from the need to generate SQL constraints based on input data and execute them at runtime. It was later generalized to generate rules and query
forms and compile them at run-time.
The meta-query
facility is unique in the sense that it offers flexibility
at
New rules and query forms are created and
run-time.
deleted dynamically and the choice of which predicate to
query can be delayed and can be decided based on the
input data values, rather than fixed to the predicate of
the query form. Due to the nature of the IIRS where
queries are driven by the knowledge base and the input
data, the meta-query facility is absolutely essential for
the successful implementation.
Conclusions
In developing the IIRS, we have learned some very critical
lessons. The first is that further development of industrial
applications are essential in order for deductive database
to mature into the commercial arena. The experience
gained from these industrial applications is invaluable as
a form of feedback that is not available from academic
research and applications.
More practical features can
be discovered and designed to meet real needs from the
eventual users. Secondly, a deductive database must be
178
designed with an open architecture in mind. It is recognized that a declarative rule-based language is more
expressive for specifying queries more complex than SQL
queries. However, to develop a complete solution for the
users, the deductive database technology itself does not
meet all the requirements such as procedural processing,
GUI etc. Thus, by having an open architecture, it serves
the glue with the freedom to tap into other resources.
Thirdly, access to legacy databases is important. Existing
systems should be allowed to take advantage of deductive
database capabilities quickly and directly without having
to migrate all the data into deductive database format.
The migration process is cumbersome and takes time and
it presents the risk of data corruption. An extensible approach is more suitable by gradually enhancing existing
legacy databases with new functionalities in an incremental manner using deductive database techniques. Lastly, we would like to remark on the issue of performance
in the case of IIRS. In almost all cases, the time taken by the inference engine is insignificant relative to the
time taken to process the SQL queries dispatched to the
server. Thus, query processing at the data repositories
and communication cost of data transfer represent the
largest bottIenecks. Thus, it is critical that optimization
in the deductive database compilers and engines should
pay more attention to such factors. As a result, to minimize communication cost, the LDL++ compiler attempts
to push as many joins and selections as possible to the
server by generating the SQL statement as compactly as
possible.
In view of the successful completion of the first usable
IIRS system, further applications are being explored. In
particular, the LDL++
deductive database technology
could be used to integrate heterogeneous databases from
different departments or divisions into one uniform view
and present to the users for various decision making.
Some databases are mainframe-based such as DB2 and
IMS while others could be relationa databases on Unix
platforms. The LDL++ system can tap into whatever
information resources accessible by the ESS component of the IIRS. Furthermore, the LDL++
system could
also be used for migration of legacy databases into newer databases on newer platforms such that old or erroneous data are filtered or corrected using the rule base.
There
is also an on-going
pilot
effort
to perform
knowl-
edge mining on the same databases based on an approach
that combines inductive learning with deductive database
technology. The purpose is to attempt to discover new
knowledge hidden in the huge data pool and such discoveries may return substantial future business values in the
long run.
In short, the potential of deductive database technology
is tremendous and the IIRS is one step closer to demonstrate
that
the technology
is indeed
useful in real-world
applications.
179
References
PI
A Second GenerArni, Ong, Tsur, and Zaniolo, LDL++:
ation Deductive Database System, Working Paper, 1994.
PI
Chimenti, D. et al., The LDL System Prototype, IEEE
Journal on Data and Knowledge Engineering, vol. 2, no.
1, pp. 76-90, March 1990.
[31 Naqvi and Tsur. 1989. A Logical Language for Data and
Knowledge Bases, W. H. Freeman Company.
[41 Muntz, R.R., E.C. Shek and C. Zaniolo, Using LDL++
for Spatic+temporaI
Reasoning in Atmospheric Science,
Vancouver, Canada, 1993.
[51 Ong, KayLiang , Sheth, Amit and Wood, Christopher,
LDL++ and Q-Data:: a Pmctical Deductive Database in
Action, Working Paper (1994).
PI
Phipps, G., M.A., Derr and K. A. Ross, Glue-Nail:
a
Deductive Database System, Proc. 1991 ACM-SIGMOD
Conference on Management of Data, pp. 308-317 (1991).
R., Srivastava, D. and Sudarshan, S.,
171 Ramakrishan,
CORAL: A Deductive Database Programming Language,
Proc. VLDB Int. Conf, pp. 238-250, 1992.
181Ramamohanarao,
K. An Implementation
Overview of
Aditi Deductive Database Systems,
Procs. Third Int.
Conference on Deductive and Object-Oriented
Databases, Dec. 6-8, 1993, Scottsdale, Arizona.
PI
TSOU, E., et al., Improving Data Quality Via LDL++,
ILP’93 Workshop on Programming with Logic Databases
Vancouver, Canada, 1993.
WI
Tsur, S., F. Olken and D.Naor, Deductive Databases for
Genomic Mapping, Proc. NACLP Workshop on Deductive Databases, Ed. J. Chomicki, Nov. 1990.
[a
Proc. 10th,
Tsur S., Deductive Databases in Action,
ACM SIGACT-SIGMOD-SIGART
Symp. on Principles
of Database Systems, pp. 205-218, 1990.
WI
Tsur S., Data Dredging, Data Engineering,
4, IEEE Computer Society, Dec. 90.
P31 Tsur, Arni, and Ong. 1993. The LDL++
MCC Technical Report, Carnot-012-93(P).
Vol. 13, No.
User Guide,
P41 Shen, W. Mitbander, B., Ong, K. and Zaniolo, C., Using
Metaqueries to Integrate Inductive Learning and Deductive Database Technology, AAAI
Workshop on Knowledge Discovery from Databases,, 1994.
WI
Wing-Kwong Wang, Logic Programming and Deductive
A Comparison
Databases for Genomic Computations:
between Prolog and LDL, Proceedings HICSS, 1993.
[I’4
Woelk, D., Huhns, M., Jacob, N., Ksiezyk, T., Ong, K.,
Shen, W., Singh, M., and Tomlinson, C., Carnot Prototype, to appear Object Oriented Multidatabase Systems,
Ed. Omran Bukhres and Ahmed Elmagarmid, 1994.
1171Zaniolo,
C., Design and Implementation
of a Logic Based
Language for Data Intensive Applications,
Proc. of the
5th Int. Conf. and Symp. on Logic Programming,,
pp.
1666-1687, MIT Press, 1988.
W31Zaniolo,
C. 1992. Intelligent Databases: Old Challenges
and New Opportunities.
Journal of Intelligent fnformation Systems, 1. 271-292. Kluwer Academic.