Download Databases and Natural Language Interfaces

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Relational algebra wikipedia , lookup

SQL wikipedia , lookup

PL/SQL wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Databases and Natural Language Interfaces
Porfírio P. Filipe 1
Nuno J. Mamede 2
1Inst. Sup. de Eng. de Lisboa, R. Conselheiro Emídio Navarro, 1949-014 Lisboa, Portugal
2CSTC/Instituto Superior Técnico, Av. Rovisco Pais, 1049-001Lisboa, Portugal
[email protected]
[email protected]
Abstract. A Natural Language Interface for Databases allows users of
multimedia kiosks to formulate natural language questions. User questions are
first translated into a logic language and subsequently into Structured Query
Language (SQL), which is processed by a database management system to return
the answer. This paper focuses on the translation stage. Special attention is
devoted to the conceptual model, a relational database that organizes all the
data supporting the translation process. The translation algorithm is presented
and commented examples are used to better understand its functioning.
Keywords. language engineering, natural language interface for databases,
conceptual model, relational database, type hierarchy, translation
1 Motivation
One of the main characteristics of multimedia kiosks is their familiar visual appearance, reducing the
complexity of communication between man and machine to a minimum. The anthroponomical
synchronisation of Image, Video, Audio and Text is one of the crucial factors to “seduce” the user
into wanting to experiment the system. Another fundamental characteristic is its usefulness. The user
must feel, when using the system for the first time, that it can be useful. This is only possible with a
well designed interface through which information can be easily accessed without needing to learn
another vast and complex communication language (the one used by the system).
In spite of the large variety of existing systems, a standard for these interfaces does not yet exist. As
a consequence, the user can fully understand the system only after a certain amount of time. Another
criticism one could have about this interaction method has to do with the fact that in a traditional
system, with navigation through several successive windows, one can not always get the needed
information. This occurs either because it does not exist (and the system is unable to inform the user
about it) or because the user does not know the system’s language well enough to extract the desired
information (it may take too many steps to get there).
What could be the best browsing alternative that passes beyond the aforementioned limitation? The
answer could be a Natural Language Interface for Databases (NLIDB).
The evolution of technology has caused a continuos development of NLIDBs, especially in the area
of natural language processing, exploring architectures that transform the NLIDBs into relational
agents, and integrating languages and graphics that explore the advantages of both modalities [1][5].
Many times the common citizen needs to access information kept in databases. Almost all relational
database management systems use SQL’s SELECT instruction as the standard interface to perform
interrogations. This language is a cumbersome language for “normal” users.
Databases and Natural Language Interfaces 2
A NLIDB may translate the questions from natural language into SELECT instructions. The
questions formulated by the user contain two types of information: (i) the information to be found,
i.e., what the user expects in the reply; and (ii) the conditions the reply must satisfy. The translation
process handles each type of information differently.
Replies supplied to the user may have presentation requirements (text format, use of graphs, or
videos, …) to clarify the answer, to make it more pleasant, or to complete it. If, for example, the
reply includes an address, a presentation similar to the one used in postcards will certainly be
appreciated.
2 System Architecture
The system’s architecture (see Figure 1) is based on an Intermediate Representation Language
(LIL), where the natural language question is transformed into an intermediate logical query before
the final translation into an SQL query. This language expresses the meaning of the sentence in terms
of high-level concepts, independent from database structure [2][ 3].
The system’s architecture can be seen as consisting of two large modules. The first module controls
natural language processing (linguistic component), where a question is submitted and successively
transformed (morphological, syntactic, and semantic analysis). One or more LIL expressions are
obtained at the end of this process. These expressions correspond to the possible interpretations of
the initial question. Given the domain’s dimension and the natural language’s flexibility, there will
usually exist several interpretations for the same question.
The second component is in charge of the connection with the database, translating the LIL
expressions into SQL expressions and sending them to the database management system to produce
the answers.
The main advantage of this architecture is the complete separation between the linguistic component
and the database knowledge. The portability of the system to other relational database is guaranteed
by the conceptual model’s configuration.
The translation is made by stages: natural language into a syntactic tree, then to a logical formula, and
finally into a SELECT instruction. Communication with the database management system, which
generates the replies, is carried out through an ODBC driver.
Natural Language Question
P o r t u g u e s e , F r e n c h , E n g l i s h ,...
Semantic
Analysis
Morphological
Analysis
Syntactic
Syntactic
Analysis
W
Woorrddss
Information
LIL/SQL
LIL/SQL
Translation
Translation
LLooggiiccaall Q
Quueerryy
(LIL language)
SSyynnttaaccttiicc SSttrruuccttuurree
((ssyynnttaaccttiicc ttrreeee))
SQL Query
(SELECT)
Domain
D
Daattaabbaassee
Conceptual Model
DBMS
Type
Hierarchy
Hierarchy
Question
Discovery
Context
Consult
A
Annssw
weerr
Consult
General
Representation
TTrraannssllaattiioonn
Data
Answer
Answer
Interpretation
Fig. 1. Architecture of a Natural Language Interface for Databases.
The conceptual model (CM) is the component of an NLIDB that symbolically represents the
constraints associated with the application’s domain [6]. The conceptual model has two components:
Databases and Natural Language Interfaces 3
(i) a type hierarchy — contains information on inheritance between classes of entities, and (ii) a
general representation — an explicit representation of the domain’s conceptual constraints.
During the semantic analysis it is necessary to verify if every question respects the conceptual
constraints of the domain database, which may help choosing an adequate interpretation for the
original question. For example, the question: “which hotels have swimming pools with salt-water”
has two possible interpretations: the “salt-water” can be related to the “hotel” or to the
“swimming-pool”. A request to the conceptual model asking if the “hotel” entity has the “saltwater” property will fail if the fact “hotel has salt-water” is irrelevant and, consequently, not
represented in the conceptual model. When a question does not respect these domain constraints,
we say it is semantically incorrect.
3 The Source Language
The Logic Interface Language (LIL) is inspired in the formalism presented in the MASQUE/SQL
project [3] and was first described in [9][11]. The LIL’s syntax is similar to that of first-order logic
and has the expressive power of a predicate logic that allows representing real world concepts:
(a) A term may be a constant or a variable or a function symbol applied to a tuple of terms.
Constants and variables represent world objects, including abstract objects, such as events
and situations. Examples of constants are Ritz, sauna, -15, and 1997; examples of
variables are _3 and X.
(b) A primitive formula contains a predicate symbol (written as a term) and one or two
arguments (terms). A primitive formula is written as pred_symbol(term1[,term2]).
Examples of primitive formulas are hotel(X) and have(X,sauna).
(c) A LIL expression contains a set of LIL formulas separated by commas (denoting
conjunction), and has the following syntax: formula1,…,formula2. An example of a LIL
expression is author(X), write(X,book).
(d) Logical connectives, conjunction (&) and disjunction (V), glue LIL formulas. The valid syntax
is: V(pred1(term1,…),…, predn(termn1, …)), and &(pred1(term1, …), …,
predn(termn1,
…)). An example of a formula with logical connectives:
&(have(X,restaurant),have(X,sauna)).
(e) Predefined formulas have the following syntax: EXACT(term1,termcte)(equal),
SUP(term1,termcte)(greater), and INF(term1,termcte)(lesser).
4 The Object Language
The questions placed to a NLIDB system may contain three kinds of information that are used to
identify: (a) the properties that are relevant to the answer; (b) the selection condition; and (c) how the
reply is to be sorted. All these components are optional, i.e., may be unspecified.
For example, in the question “which hotels have sauna?”, the relevant property is the “hotel
name” and the selection condition is “have sauna”. Note that the original question does not identify
any relevant property, being necessary to deduce it during the translation process.
The translation from LIL into SQL involves the identification of the above components and
transforming them into the syntax of the SELECT instruction. Figure 2 describes the syntax of the
SELECT instruction, relating it with the contents of the natural language question: (a) the names (of
columns) or expressions (that involve columns) that follow the SELECT keyword specify the
properties that are relevant to the reply; (b) the tables that follow the FROM keyword specify the
entities referred to in the question; (c) the logical expressions that follow the WHERE and HAVING
Databases and Natural Language Interfaces 4
keywords handle the conditions the reply must satisfy; (d) following the ORDER BY and GROUP
BY keywords is the definition of the output sorting; (e) The clause FOR UPDATE OF is
A
A nn ss w
w ee rr
Properties
Question
Entities
Answer
C
C oo nn dd ii tt ii oo nn
A
A nn ss w
w ee rr
Organization
Answer
Condition
Answer
O
O rr gg aa nn ii zz aa tt ii oo nn
Don't
II nn tt ee rr ee ss tt
Fig. 2. Syntax of the SELECT instruction (SQL language).
meaningless to the translation process.
The WHERE component is the most important considering our objectives. It may be necessary to
use the HAVING component in alternative to the WHERE component when the logical expression
calls SQL functions.
5 Translation
The LIL-SQL translator [10] is based on several mapping tables, which are highly dependent on
database organization. This process is very efficient and, more importantly, it can be used with any
relational database, being only necessary to update the contents of the conceptual model.
The question “which hotels do have sauna?” is translated to the following LIL expression:
hotel(X), have(X,sauna), which is then transformed into SQL, for example: SELECT
HOTEL.NAME FROM HOTEL WHERE HOTEL.QT_SAUNA>0. The answer may be produced using
nominal lists of tourism resources, texts, and graphics.
5.1 Auxiliary Data Structure
The data structure that sustains the LIL/SQL translation contains information that depends on the
implementation of the domain database.
Fig. 3. Relational data model of the expanded conceptual model.
Databases and Natural Language Interfaces 5
Instead of creating a new data structure, we decided to extend the relational database that stored the
conceptual model. The entity relationship diagram of that structure is presented in Figure 3. The
tables and columns used to support translation (shaded), depend on how each concept (already
belonging to the conceptual model) is represented in the domain database.
Representing Symbols
The representation of each symbol includes a field to keep its type. In the simplest case, the type of
each symbol corresponds to an SQL type such as INTEGER, DATE, or CHAR. We use the _TYPE_
column of the SYMBOL table to represent the type of each symbol. The meaning of the column
SYMBOL._TYPE_ is the following:
(a) if the symbol is a property, then it denotes the SQL data type;
(b) if the symbol is an entity, then it denotes the symbol represented by a view or a table;
(c) if the symbol is an equivalent symbol, then it denotes a format function.
The table SYMBOL also has the column _CONDITION_ to represent a condition with the following
meaning:
(a) if SYMBOL._TYPE_ is VIEW, it denotes a view and the _CONDITION_ value contains the
SELECT instruction used to generate the view;
(b) in the remaining cases, the _CONDITION_ value is a restriction (SQL logical condition).
Representing Associations
The requirement that each association’s type be represented is satisfied through the inclusion of the
ASSOCIATION._TYPE_ column, enabling the expression of its direction (direct or inverse) and the
side of N-ary associations having cardinality N: (a) character ‘D’, in the ASSOCIATION._TYPE_
column, denotes a direct association; (b) character ‘I’ denotes an inverse association. These
associations are represented in the conceptual model only to extend the natural language vocabulary
available to formulate questions.
In the presence of several associations with the same arguments, only the association defined first is
represented in the ASSOCIATION._ASSOC_ column. These associations, called base associations,
have a type ‘B’ (column ASSOCIATION._TYPE_) instead of ‘D’.
All associations with the same arguments, direct or inverse, are considered equivalent, being
convertible into the corresponding base association (the arguments of the associations in the inverse
direction are changed). When there are multiple associations with distinct meanings one of the entities
has to be defined as equivalent in the EQ_SYMBOL table [7].
Representing Default Attributes and Default Entities
Questions that have a set of entities as reply are common. For example, “which hotels are located
in Lisbon?”. Note that this question cannot be directly translated into a SELECT instruction because
the hotel property that should be part of the reply has to be determined. This corresponds to
determining the relevant column in the HOTEL table.
The DEFAULT table (a detail of the SYMBOL table) helps the translator to determine the properties
implicit in a question. This table allows memorizing which property (SYMBOL2 column) represents an
entity by default (SYMBOL1 column). Default entities and default properties are represented in the
same way: column DEFAULT.SYMBOL2 contains the default entity associated with the property
represented in column DEFAULT.SYMBOL1.
Databases and Natural Language Interfaces 6
Representing Values
We represent in the conceptual model values that belong to the domain database. This is done to
minimize the mismatch between the domain database and the conceptual model. Ideally, when the
definition of the domain database is generated from the conceptual model, the data necessary to
support the translation contains only meta-information about the domain database (mainly definitions
of database keys). Whenever it is necessary to adapt the NLIDB to an existing domain, we also
have to represent some data belonging to the domain database, such as “codes” and
“abbreviations”. The relational conceptual model was enlarged with the OCCURRENCE table, a detail
of the SYMBOL table.
The OCCURRENCE.OC_ENTITY column refers an entity, the OCCURRENCE.OC_ATTRIBUTE column
refers a relevant attribute, the OCCURRENCE.CODE column contains one code or one abbreviation
valid for that attribute, and the OCCURRENCE.VALUE column contains the represented data.
For example, we may represent the fact “red vehicle” in the database as ‘R’ if the
OCCURRENCE.OC_ENTITY column refers the ‘vehicle’ entity, the OCCURRENCE.OC_ATTRIBUTE
column the ‘color’ property, the OCCURRENCE.CODE column holds the ‘R’ code and the
OCCURRENCE.VALUE column contains the ‘red’ string.
Representing proper names
The representation of proper names is similar to the representation of values. An entity’s proper
name is stored in the OCCURRENCE table. To guide the search at execution time, at least one proper
name for each entity must be included in this table. The representation of proper names to support
translation is only possible in domains with few proper names. Otherwise, it is necessary to search, at
execution time, the domain database to verify which are the table and the column to which the proper
name corresponds.
To translate the question “which are the books by José Saramago?” one has to determine that
‘José Saramago’ is the proper name of a writer. To find which table and column contain that
proper name, it is necessary to search the OCCURRENCE.VALUE column. The result of this search the
pair formed by the OC_ATTRIBUTE and OC_ENTITY columns of the OCCURRENCE table.
Representing Keys
A key is made of a set of columns that univocally identify a line of a table of the relational model.
When the key belongs to the table we say that it is the primary key of that table. When a table
contains a primary key of another table we say it is a foreign key. A foreign key is a form of relating
tables. The representation of primary and foreign keys is essential to determine the translation
between associations. The primary keys are represented in the KEY_PK table, and the foreign keys in
the KEY_FK table.
5.2 Auxiliary Functions
The LIL/SQL translation needs to know how each concept is represented in the domain database.
To help achieving that goal we defined five functions that always return a character string.
Function TRANSLATION(X,Y,Z)
All the arguments are symbols. The first argument refers an association, the second an entity, and the
third argument either an entity, a property, a property value, or a proper name. Alternatively, the ‘_’
character may be used, anywhere, to express the concept “any symbol”. This function returns the
translation of the fact (part of the question) specified in the arguments.
Databases and Natural Language Interfaces 7
Function CLASS(X)
Using the column CLASSIFICATION.DESG_CLASS, this function informs if the argument, a symbol,
is represented as a table or as a column in the domain database.1
Function DEFAULT(X)
This function is used to get the default properties or default entities associated with the argument, an
entity or a property: if the argument is a property it returns the default entity, but if the argument is an
entity it returns the default property. This function returns the symbol represented in column
DEFAULT.SYMBOL2 that is associated with column DEFAULT.SYMBOL1.
Function EQUIVALENT(X)
The argument can be either a property or an entity, and the return value is the expression that was
used to define the argument as an equivalent symbol of another symbol2. If there is an equivalent
symbol, it returns the value (the name of the equivalent function) stored in the
CLASSIFICATION.DESC_CLASS column and the arguments stored in the EQ_SYMBOL table.
Otherwise it returns the argument itself.
Function FORMATTER(X)
This function is used to provide a function to correctly display the reply: the argument is a property,
and the return value is the function name stored in the SYMBOL._TYPE_ column.
5.3 The Translation Algorithm
The translation of LIL into SQL is supported in the assumption that the final SQL expression can be
obtained by appending the partial translations of each formula belonging to a LIL expression. LIL
formulas with a variable argument identify properties, columns, or tables containing values the reply
must exhibit. The translation of these formulas defines the SELECT or FROM clauses, while the
remaining formulas define the FROM or WHERE clauses, i.e., the conditions the reply must satisfy.
After translating all the formulas belonging to a LIL expression, it is necessary to verify whether the
FROM clause contains all the columns referenced in the SELECT and in the WHERE clauses.
When an absence is discovered, a comma and the missing column are appended to the FROM
clause. A similar operation must also be performed to identify columns that are not referenced in the
SELECT clause.
The translation process uses eight rules: one to substitute variables by its type; two to translate
primitive formulas; one to translate formulas that have logical connectives; one to handle predefined
formulas; two to guarantee that all the referenced tables and columns belong to the FROM and
SELECT clauses; and one to handle formatting.
Rule 1 — Variable Substitution
During the translation process it is assumed that variables are replaced by their types, i.e., the names
of the classes they belong to. The LIL expression: pred1(X1), pred2(X1,term2), after applying
the substitution X1/pred1 (assuming pred1 is the type of the variable X1), is transformed into
pred2(pred1,term2). When the substitution takes place we will refer to the type of X1 as X1cte.
1
This description is not complete, but a full description can be found in [6].
2
The symbols’ classification is stored in the column CLASSIFICATION.DESG_CLASS. If it is a function, the
table EQ_SYMBOL contains a line referring each argument.
Databases and Natural Language Interfaces 8
Rule 2 — Unary Primitive Formulas
The translation of a primitive formula with one argument (prd(trm)) identifies a table or an
expression that involves columns. There are two possibilities, depending on the value returned by the
evaluation of CLASS(prd):
(a) if the returned value is ‘Table’, then the value returned after evaluating EQUIVALENT(prd) is
concatenated, with a comma, to the FROM clause.
If trm
is a constant (a proper name) then the value returned after evaluating
is concatenated, with the ‘AND’ operator, to the WHERE clause;
TRANSLATION(_,pred,trm)
(b) if the returned value is ‘Column’, then
If the LIL expression contains a formula of the type pred2(trm2,prd)
If trm is a constant (value of a property) then the value returned after evaluating
TRANSLATION(pred2,trm2cte,prd(trm)) is concatenated, with the ‘AND’ operator, to
the WHERE clause;
If trm is a variable then if the value returned by EQUIVALENT(prd) contains a function
symbol then the value returned after evaluating EQUIVALENT(prd) is concatenated, with
a comma, to the SELECT clause.
otherwise EQUIVALENT(trm2cte).EQUIVALENT(prd) is evaluated and its value is
concatenated, with a comma, to the SELECT clause.
If the LIL expression does not contain a formula of type pred2(trm2,prd)
If the value returned by EQUIVALENT(prd) contains a function symbol then the value
returned after evaluating EQUIVALENT(prd) is concatenated, with a comma, to the
SELECT clause.
otherwise the value returned after evaluating DEFAULT(prd).EQUIVALENT(prd) is
concatenated, with a comma, to the SELECT clause.
Rule 3 — Binary Primitive Formulas
The translation of a primitive formula with two arguments (prd(trm1,term2)) returns an expression,
the evaluation of TRANSLATION(prd,trm1cte, trm2cte), that must be concatenated, using the ‘AND’
operator, to the WHERE clause.
Rule 4 — Logical Connectives
The translation of a formula using the logical connective and (&(prd1(trm2,trm3),…, prda(trmb,
trmc))) returns an expression, the evaluation of TRANSLATION(prd1cte, trm2cte, trm3cte) AND
… AND TRANSLATION(prdacte, trmbcte, trmcct), that must be concatenated, using the ‘AND’
operator, to the WHERE clause. The translation of a formula using the logical connective or
(V(prd1(trm2,trm3),…,prda(trmb,trmc))) returns an expression, the evaluation of
(TRANSLATION(prd1cte,trm2cte,trm3cte) OR … OR TRANSLATION(prdacte, trmbcte,trmccte)),
that must be concatenated, using the ‘AND’ operator, to the WHERE clause.
Rule 5 — Predefined Formulas
The translation of a predefined formula with the EXACT predicate (EXACT(trm,cte)) concatenates,
with the ‘AND’ operator, to the WHERE clause the value returned after evaluating an expression. The
latter depends on the LIL expression: (a) if it contains a formula of the type pred2(trm2,trm) the
evaluated expression is EQUIVALENT(trm2cte).EQUIVALENT(trmcte) = cte otherwise, (b)
expression DEFAULT(trmcte).EQUIVALENT(trmcte) = cte is evaluated.
Databases and Natural Language Interfaces 9
The translation of a predefined formula with the SUP predicate (SUP(trm,cte)) concatenates, with
the ‘AND’ operator, to the WHERE clause the value returned after evaluating an expression. The
latter depends on the LIL expression: (a) if it contains a formula of the type pred2(trm2,trm) the
evaluated expression is EQUIVALENT(trm2cte).EQUIVALENT(trmcte) > cte otherwise, (b)
expression DEFAULT(trmcte).EQUIVALENT(trmcte) > cte is evaluated.
The translation of a predefined formula with the INF predicate (INF(trm,cte)) concatenates, with
the ‘AND’ operator, to the WHERE clause the value returned after evaluating an expression. The
latter depends on the LIL expression: (a) if it contains a formula of the type pred2(trm2,trm) the
evaluated expression is EQUIVALENT(trm2cte).EQUIVALENT(trmcte) < cte otherwise, (b)
expression DEFAULT(trmcte).EQUIVALENT(trmcte) < cte is evaluated.
Rule 6 — Missing tables
If the FROM clause does not contain all the tables referenced in the SELECT and WHERE clauses,
then all the missing tables are appended, using commas, to the FROM clause.
Rule 7 — Missing columns
Searches the tables that are referenced in the LIL expression with a variable argument and
simultaneously do not have any of its columns included in the SELECT clause. All the found tables
(named T) with missing tables are submitted to the following processing: the value returned after
evaluating EQUIVALENT(T).defualt(T) is concatenated, with a comma, to the SELECT clause.
Rule 8 — Optional formatting
Applies the FORMATTER function to each column of the SELECT clause, or to the values returned
by the equivalence functions, obtaining the final format for the properties.
Algorithm
Start with an empty SELECT SQL query
With each LIL formula do
begin
Try to apply rule 1
Select the appropriate rule from the set {2,3,4,5}
Apply the selected rule
end
Try to apply rule 6 and rule 7
Apply rule 8
6 Examples
To demonstrate the application of the LIL/SQL translation algorithm we will present a couple of
commented examples. A question mark will be used to represent an empty SQL clause. So, the
initial SELECT query is: SELECT ? FROM ?.
“What are the names of the hotels having five stars?”
LIL formula: name(X), hotel(Y), have(Y,name), have(Y,star), EXACT(star,5)
Initial SQL query:
SELECT ? FROM ?
Translation of: name(X), using Rule 2:
since the call to function CLASS(name) returns ‘Column’, and the LIL formula also contains
have(Y,name)the substitution rule produces have(hotel,name)
the expression
Databases and Natural Language Interfaces 10
is evaluated and its value (HOTEL.NAME) is
appended to the SELECT clause. The new SQL query is: SELECT HOTEL.NAME FROM ?
EQUIVALENT(hotel).EQUIVALENT(name)
Translation of: hotel(Y), using Rule 2:
since the call to function CLASS(hotel) returns
‘Table’, the expression
EQUIVALENT(hotel) is evaluated and its value (hotel) is appended to the FROM clause, then
the new SQL query is: SELECT HOTEL.NAME FROM HOTEL
Translation of: have(Y,name), using Rule 3:
since the LIL formula also contains hotel(Y) the substitution rule produces
have(hotel,name). The expression TRANSLATION(have,hotel,name) is evaluated and its
value (an empty string, assuming that table HOTEL has the NAME column) is appended to the
WHERE clause then the SQL query remains unchanged.
Translation of: have(Y,star), using Rule 3:
since the LIL formula also contains hotel(Y), the substitution rule produces
have(hotel,star). The expression TRANSLATION(have,hotel,star) is evaluated and its
value (an empty string, assuming that table HOTEL has the column STAR) is appended to the
WHERE clause then the SQL query remains unchanged.
Translation of: EXACT(star,5), using Rule 5:
since the LIL formula also contains have(Y,star), the substitution rule produces
have(hotel,star). The expression EQUIVALENT(hotel).EQUIVALENT(star)=5 is
evaluated and its value (HOTEL.STAR=5) is appended to the WHERE clause then the final SQL
query is: SELECT HOTEL.NAME FROM HOTEL WHERE HOTEL.STAR=5
The result of evaluating FORMATTER(NAME), in this case CHAR(60), is the information to format
the answer.
“Which hotels have swimming-pool or sauna?”
We assume that the “hotel” entity is represented by the HOTEL table, which has the QT_SAUNA
column to store the number of saunas. The fact “hotel has swimming-pool” is represented by
keyword ‘S’ in the POOL column.
LIL formula:
hotel(X), V(have(X,pool), have(X,sauna))
Initial SQL query:
SELECT ? FROM ?
Translation of: hotel(X), using Rule 2:
since the call to the function CLASS(hotel) returns ‘Table’, the expression
EQUIVALENT(hotel) is evaluated and its value (hotel) is appended to the FROM clause.
Then the new SQL query is: SELECT ? FROM HOTEL
Translation of: V(have(X,pool),have(X,sauna)), using Rule 4:
since the LIL formula also contains hotel(X) the substitution rule produces
have(hotel,pool)
→ TRANSLATION(HAVE, HOTEL, POOL)
→ HOTEL.POOL=‘S’
have(hotel,sauna) → TRANSLATION(HAVE, HOTEL, SAUNA)
→ HOTEL.QT_SAUNA>0
the obtained values are appended to the WHERE clause using the OR keyword as separator. Then
the new SQL query is: SELECT ? FROM HOTEL WHERE HOTEL.POOL=‘S’ OR HOTEL.QT_SAUNA
> 0
Databases and Natural Language Interfaces 11
Determination of the omitted column, using Rule 6:
the expression: EQUIVALENT(HOTEL).DEFAULT(HOTEL) is evaluated and its value
(HOTEL.NAME) is appended to the SELECT clause. Then final SQL query is: SELECT
HOTEL.NAME FROM HOTEL WHERE HOTEL.POOL=‘S’ OR HOTEL.QT_SAUNA > 0
The result of evaluating FORMATTER(NAME), in this case CHAR(60), is the information to format
the answer.
7 Conclusion
The algorithm presented in this paper was implemented and is a module of Edite, a multi-lingual
(Portuguese, French, English, and Spanish) natural language front-end for relational databases. Edite
answers written questions about tourism resources by transforming them into SQL queries. The
answers depend on the type of question. They can be nominal lists of resources, text, images, or
graphics. Currently, the database contains 53000 tourism resources, organized as 253 distinct types,
corresponding to 209 database tables.
The main goal of Edite, a NLIDB, is to provide users with the capability of obtaining information
stored in a database [4]. The user is not required to learn an artificial communication language, being
possible to formulate questions in the user’s own native language. Our solution has the advantage of
being database independent [8].
References
1.
Allen, J. 1995. “Natural Language Understanding”. The Benjamin/Cummings Publishing Company, Inc.
2.
Androutsopoulos I., Ritchie G., Thanisch, P. 1993. “An Efficient and Portable Natural Language Query Interface for
Relational Databases”. Proceedings of the 6th International Conference on Industrial & Engineering
Applications of Artificial Intelligence and Expert Systems, Edinburgh, U.K., pages 327-330. Gordon and
Breach Publishers Inc., Langhorne, PA, U.S.A.
3.
Androutsopoulos, I. 1993. “Interfacing a Natural Language Front-End to a Relational Database (MSc thesis)”.
Technical paper 11, Dept. of AI, Univ. of Edinburgh.
4.
Androutsopoulos, I. 1994. “Natural Language Interfaces - An Introduction”. Journal of Natural Language
Engineering, Cambridge University Press.
5.
Cohen, P.R. 1991. “The Role of Natural Language in a Multimodal Interface”. Technical Note 514, Computer
Dialogue Laboratory, SRI International, 1991.
6.
Filipe, P. 1999, “Sistema de Interrogações em Língua Natural para Bases de Dados: Modelo Conceptual, Aquisição de
Vocabulário e Tradução”, M.Sc. Dissertation. Instituto Superior Técnico, Universidade Técnica de Lisboa.
7.
Filipe, P., Mamede, N, 1999. “Aquisição de Vocabulário num Sistema de Interrogações em Língua Natural para Bases
de Dados”, Actas do IV Encontro para o Processamento Computacional da Língua Portuguesa Escrita e
Falada, Évora.
8.
Grosz, B. J., Appelt, D. E., Martin, P. A., Pereira, C. N. 1987. “TEAM: An Experiment in the Design of Transportable
Natural-Language Interfaces”. Artificial Intelligence 32, pages 173-243. Elsevier Science Publishers B.V. (NorthHolland).
9.
Marques, L. 1996. “Edite - Um Sistema de Acesso a Base de Dados em Língua natural Análise Morfológica, Sintáctica e
Semântica”, .M.Sc. Dissertation. Instituto Superior Técnico, Universidade Técnica de Lisboa.
10. Reis, P., Mamede, N. 1996. “LIL-SQL. Processamento de Interrogações LIL por Tradução para SQL”. Technical
Report. Grupo de Sistemas e Serviços Telemáticos, INESC.
11. Reis, P., Mamede, N., Matias, J. 1997. Edite – A Natural Language Interface to Databases: a New Dimension
for an Old Approach in “Proceeding of the Fourth International Conference on Information and
Communication Technology in Tourism”, ENTER’ 97, Edinburgh, Scottland.