Download Keyword-based Search in a Relational Database

Document related concepts

Open Database Connectivity wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

SQL wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Keyword-based Search in a Relational Database
Daniël Suelmann
Advisor: dr. George M. Welling
Bachelor‟s Thesis
Department of Information Science
Faculty of Arts
University of Groningen
August 2009
Abstract
A relational database is often operated by means of a structured query language (SQL). When composing SQLqueries one must have an understanding of the SQL syntax to be able to produce a query the database can
execute. Additionally, one must be familiar with the attributes and relations in the database to be able to retrieve
the data desired. When one wants to search the available data stored in the relational database, these
requirements can be discouraging. In this case keyword-based search functionality could improve the
accessibility of the data.
In this thesis I investigate the subject of keyword-based search in structured data. My objective in this thesis
is to show that keyword-based search in the available data can yield meaningful results.
I present a search application I developed that enables keyword-based search in the Lastgeld data. This data is
about skippers and their cargo entering the port of Amsterdam in the period from 1744 to 1748.
Answers to multiple keywords in queries are retrieved based on the following assumption: if keywords in a
keyword query are related in the relational database that contains the Lastgeld data, then retrieving this relating
data yields results likely to be meaningful given the keyword query. I describe how the presented search
application achieves just that.
Additionally given the experimental results I conclude that keyword query execution time increases if the
intersection work that needs to be performed by the presented search application increases. This work is
dependent on the number of keywords in the query and the amount of id-numbers associated with these
keywords in the indices. I also conclude that the more SQL queries the search application proposes to be
executed by the relational database, the more the overall query execution time will increase.
Contents
1. INTRODUCTION .......................................................................................................................................... 1
1.1 OUTLINE ........................................................................................................................................................ 1
2. DATA REPRESENTATIONS ........................................................................................................................... 2
2.1 GRAPH REPRESENTATIONS ............................................................................................................................. 2
2.1.1 Definition ............................................................................................................................................... 3
2.2 THE LASTGELD DATABASE ............................................................................................................................. 4
3. SYSTEM DESIGN .......................................................................................................................................... 5
3.1 DEFINITION .................................................................................................................................................... 5
3.2 INDEXING ....................................................................................................................................................... 6
3.4 CREATING A GRAPH DATA STRUCTURE........................................................................................................... 8
3.5 CREATING EDGES ........................................................................................................................................... 9
3.6 FINDING AN ANSWER TO A KEYWORD QUERY ............................................................................................... 10
3.7 CREATING SQL QUERIES .............................................................................................................................. 10
3.8 RESULTS ...................................................................................................................................................... 11
4. EXPERIMENT.............................................................................................................................................. 12
4.1 EXPERIMENTAL DESIGN ............................................................................................................................... 13
4.2 EXPERIMENTAL RESULTS ............................................................................................................................. 14
4.3 ANALYSIS .................................................................................................................................................... 16
5. RELATED WORK ........................................................................................................................................ 17
5.1 GRAPH-BASED SYSTEMS............................................................................................................................... 17
5.1.1 Data representation ............................................................................................................................. 17
5.1.2 Top-k ranking ....................................................................................................................................... 18
5.2 XML-BASED SYSTEMS ................................................................................................................................. 20
5.2.2 Semantic relatedness ............................................................................................................................ 21
5.2.3 Top-k ranking ....................................................................................................................................... 22
6. IMPLEMENTATION ................................................................................................................................... 23
6.1 CREATING INDICES ....................................................................................................................................... 23
6.2 OBJECT ORIENTED DESIGN ........................................................................................................................... 24
6.3 FLOW OF CONTROL....................................................................................................................................... 25
6.4 WEIGHTED GRAPH DATA STRUCTURE ........................................................................................................... 26
5.5 ALGORITHMS ............................................................................................................................................... 27
6.5.1 Searching indices ................................................................................................................................. 27
6.5.2 Intersecting id-numbers ....................................................................................................................... 27
6.5.3 Finding combinations........................................................................................................................... 28
6.5.4 Retrieving the vertex with the most edges ............................................................................................ 28
6.6 JAVA LIBRARY CLASSES ............................................................................................................................... 29
6.7 WEB DEPLOYMENT....................................................................................................................................... 30
6.7.1 Specification ......................................................................................................................................... 30
6.7.2 User interface ....................................................................................................................................... 30
6.7.3 Web address ......................................................................................................................................... 30
6.7.4 The search application doesn’t work ................................................................................................... 30
7. EPILOGUE ................................................................................................................................................... 31
7.1 CONCLUSIONS .............................................................................................................................................. 31
7.2 DISCUSSION ................................................................................................................................................. 31
7.3 PROPOSALS FOR FUTURE WORK.................................................................................................................... 32
REFERENCES ................................................................................................................................................... 32
APPENDIX A: CREATING INDICES IN PERL ....................................................................................................... 34
APPENDIX B: SEARCH APPLICATION SOURCE CODE IN JAVA .......................................................................... 36
FIGURES
2.1 Königsberg anno 1736………………………………………………………………………………….……………………………………………..…………………. 2
2.2 Graph representation of the Königsberg problem……………………………………………..……………..………………..…………………………. 2
2.3 Graph representations……………………………………………………………………………………………………………………………………………….…... 3
2.4 A directed graph……………………………………………………………………………………………………………………………………………….………..….. 3
2.5 A weighted graph…………………………………………………………………………………………………………………………………….…………………….. 3
3.1 System design………………………………………………………………………………………………………………………………………………….…………….. 5
3.2 A fragment of an inverted index………………………………………………………………………………………………………………………….…………. 6
3.3 A directed weighted graph representation of the keyword query IJsbrand Hanning Riga 1744…..………………………..…..…… 8
3.4 Graph II IJsbrand Hanning Riga 1744......………….……………………………………………………………………………………………….…………. 10
4.1 Experimental results keyword query A……………………………………………………………………………………………………………….……..... 14
4.2 Experimental results keyword query B…………………………………………………………………………………………………………….…………… 14
4.3 Experimental results keyword query C………………………………………………………………………………………………………………….……… 15
4.4 Experimental results keyword query D………………………………………………………………………………………………………………….……… 15
5.1 A fragment of a data graph model…………………….…………………………………………………………………………………………………………. 17
5.2 Radius Steiner trees …….…………………………………..……………………………………………………………………………………………….…………. 17
5.3 Accented Steiner nodes……………………………………………………………………………………………………………..………………….…...……….. 19
5.4 A Steiner graph result………………………….……………………………………………………………………………………………………….……………… 19
5.5 An XML data representation………………..……………………………………………………………………………..…………………….………………… 21
6.1 Object oriented design………………………………………………………………………………………………………………………..….…………….…….. 24
6.2 Weighted graph data structure…………………………………………………………………………….…………………………………………...….…….. 26
6.3 User interface search application…………………………………………………………………………………………………………………………………. 30
TABLES
2.1 Graph endpoints……………………………………………………………………………………………………………………………………………………….……. 3
2.2 A fragment of the Skippers table……………………………………………………………………………………………………………………………………. 4
2.3 A fragment of the Locations table……………………………………………………………………………………………………………………………….….. 4
3.1 Fields eligible for keyword-based search………………………………………………………………………………………………….……………….……. 4
3.2 Number of intersections…………………………………………………………………………………………………………………….………………..………….9
3.3 Example queries…………………………………………………………………………………………………………………………………….…………………….. 11
4.1 Experimental queries………………………………………………………………………………………………………………………………………………….….13
5.1 Table data representation………………………………………………………………………………………………………………………………………….… 17
6.1 Implemented Java Library classes…………………………………………………………………………………………………………………………………..29
6.2 Web deployment specification……………………………………………………………………………………………………………………………………….30
LISTINGS
3.1 An index of first names…………………………………………………………..………………………………………………………………………………………. 6
3.2 An index of dates……………………………………………………………………………………………………………………………………………………………. 6
3.3 An index of harbors…………………………………………………………………………………………………………………………………………….………….. 7
3.4 An index of countries………………………………………………………………………………………………………………………………….……………………7
3.5 Retrieving occurrences of keywords from the indices given the query Jan Cornelis Nantes 1744……………….…………….........8
3.6 Intersecting vertices for the query IJsbrand Hanning Riga 1744……………………..………………………………….…………………….….. 10
5.7 System-generated SQL queries and their answers………………………………………………………………………………………………..………. 10
5.1 An XML fragment……………………………………………………………..……………………………………….…….……………………………….………….. 20
6.1 Creating indices…………………………………………………………………………………………………..…………………………………….…………………..23
1. Introduction
Today‟s most widely used search engines enable users to express a search query by means of one or more
keywords. This query can express a descriptive phrase or isn‟t more than a single term, coherent with a specific
information need. The user can query the available data without having to know any query language or having to
know how the data is stored in its internal data repository.
In this thesis I investigate the subject of applying a keyword-based search approach to data stored in a
relational database. More specific I focus on a database containing historical data. The database consists of two
tables containing structured data. The main table contains about 13.000 records of data about skippers and their
cargo entering the port of Amsterdam in the period from 1744 to 1748.
My objective in this thesis is to show that keyword-based search in a relational database can yield results likely
to be meaningful given the available Lastgeld data.
I investigate this subject by presenting a search application I developed that enables users to perform
keyword-based search in the available Lastgeld data. In general I focus on answering the following questions:
1.
2.
What methods are applied in querying a relational database given on one or more keywords?
How can these methods yield meaningful results in a keyword-based search application, given the
available Lastgeld data and the relational database containing this data?
1.1 Outline
In the second chapter I describe some of the theory associated with graph data structures, since I use a graph data
structure to be able to query the data in the database. In this chapter I also describe the Lastgeld data and the way
it is stored in a relational database.
In chapter three, I present the application I developed and I describe the heuristics applied to achieve results. I
end this chapter by showing several example queries and their resulting answers returned by the presented
application.
In the fourth chapter I describe the experiment I performed regarding the efficiency of the presented search
application.
Chapter five is about some of the work performed in the research area of keyword-based search in structured
data. In this chapter I describe several systems that enable keyword-based search over structured data using
distinct approaches.
In chapter six I return to the subject of the presented search application and I describe which choices I made
to implement it.
Finally, in chapter seven I draw conclusions, discuss achievements and propose future work.
The appendices contain the documented source code of the presented search application.
1
2. Data representations
When a keyword-based search system receives a query, it needs to determine what this query means in terms of
the data repository it is designed to search. If the query consists of a single keyword, then that single keyword
can occur in many locations in the database. If there are more keywords, it also needs to determine the relation
between the keywords in the underlying database. If the data in the underlying database is connected by many
relations that give meaning to the data, then the search system has to be familiar with these relations to be able to
meaningfully determine if the keywords entered match any of these relations. To be able to determine such
relations, or to retrieve any data, the data must be transferred from a storage device to main memory.
Consequently the relations existing in the database on the storage device have to be replicated in main memory
also.
One could say that a model of the data and the relations meaningfully connecting entities of data must be
available in main memory to be useful to the search system. A way to create such a model is by means of a graph
data structure. In a graph abstract data type, entities of data can be connected to one another in an unrestricted
manner. Whereas tree data structures provide a useful way of representing relationships in which a hierarchy
exists, a graph data structure becomes useful if relationships between data entities appear more freely. Dale et
al.[3]
2.1 Graph representations
Graph theory is rooted in mathematics. In 1736
graph theory was born when the King of Prussia
confronted mathematician Leonard Euler with the
following problem:
The town of Königsberg (now Kaliningrad in
Russia), is built at the point where two branches of
the Pregel river come together. The river divides the
town into an island and some land around the river
banks. The island and the various pieces of main
land are connected by seven bridges. Is it possible
for a person to take a walk around town, starting and
ending at the same location, and crossing each of the
seven bridges exactly once?
Euler‟s conclusion was that it is impossible to travel
the bridges in the city on Königsberg once and only
once. Euler claimed that if there are more than two
landmasses with an odd number of bridges, then no
such journey is possible. Second, if the number of
bridges is odd for exactly two landmasses, then the
journey is possible if it starts in one of the two odd
numbered landmasses. Finally, Euler claims that if
there are no regions with an odd number of
landmasses then the journey can be accomplished
starting in any region. Paoletti [12]
Figure 2.1: Königsberg anno 1736
Figure 2.2: graph representation of the Königsberg problem
2
2.1.1 Definition
In our time, a graph consists of three entities; a set of vertices V(G);
a set of edges E(G) and an edge-endpoint function; g: E(G) → V(G)
that connects each edge with a pair of vertices.
Vertices can represent whatever is subject of attention; people,
brain cells, cities, courses, or entities of data present in a relational
database.
If the vertices represent cities, then the edges might represent the
roads between the cities. Because the road between Groningen and
Amsterdam also runs between Amsterdam and Groningen, the
edges in this representation have no direction. This is called an
undirected graph. If an edge represents for instance a pipe that
transports a fossil fuel like natural gas, then the gas is most
probably transported in only one direction. A graph with edges
directed from one vertex to another is called a directed graph.
Lanzani [8]
Directed graphs are often represented with arrows like for
instance figure 2.4. A more formal definition of the directed graph
in figure 2.4 is:
Figure 2.3: graph representaions
Lanzani [8]
V(G) = {1, 3, 5, 7, 9, 11}
E(G) = {(1,3),(3,1),(5,7)(5,9),(9,9)(9, 11)(11,1)}
In an undirected graph the arrows are simply omitted since the
order of the vertices in each edge is unimportant. A more formal
definition of the first undirected graph in figure 2.3 is:
V(G) = {1, 2, 3, 4, 5}
E(G) = {(1,2),(2,3),(3,4)(4,5)(5,1)
Edge
e1
e2
e3
e4
e5
Endpoints
{1, 2}
{2, 3}
{3, 4}
{4, 5}
{5, 1}
Table 2.1: graph endpoints
If two vertices in a graph are connected by an edge, then they are
said to be adjacent. In figure 2.4, vertex 5 is said to be adjacent to
vertices 7 and 9, while vertex 1 is said to be adjacent from vertices
3 and 11.
A tree is a special case of a directed graph in which each vertex
may only be adjacent from one parent vertex, except the root vertex
is not adjacent from any other vertex.
A path from one vertex to another consists of a sequence of
vertices such that each vertex in that sequence is connected to
another vertex by an edge. This sequence of vertices and edges
make it possible to traverse the graph in a certain manner.
A weighted graph is a graph in which edges are
associated with values. Weighted graphs can be
used to represent certain applications in which
edges are more than a connection. For instance,
Figure 2.5 depicts a graph in which the vertices are
cities and the edges the roads between the cities.
Additionally each edge contains a value that
represents the distance in kilometers between the
cities. Dale et al.[3]
Figure 2.4: a directed graph
Dale et al. [3]
Figure 2.5: a weighted graph
3
2.2 The Lastgeld database
I explained in the previous section, that a graph model of the data and the relations meaningfully connecting
entities of data must be available in main memory to be useful to the search system. Now that I explained some
important notions of graph theory, I continue to describe the database I used to build a graph data structure.
The database of subject is called the Lastgeld database. The Lastgeld database contains data about skippers
and their cargo entering the port of Amsterdam from 1744 to 1748. These skippers had to pay a toll named
Lastgeld. Welling [15]
Table 2.2: a fragment of the Skippers table
I only use two tables. The main table (table 2.2) contains about 13.000
tuples1. Additionally I use a second table (table 2.3) that contains about
1500 tuples. This table relates to the main table by the name of the “hid”
attribute.
I chose this data representation because it shows two important
notions within the domain of any relational database. First, keywords in
Table 2.3: A fragment of the Locations
the same tuple are related. For instance, a keyword query Cornelis Vos
table
should match the data in the first tuple. Secondly, keywords possibly
span multiple relations. For instance, a keyword query Hendrik Clasen
France should match data in the third tuple and should additionally match data in the Locations table, i.e. the
search system should return all the tuples that contain data about a skipper named Hendrik Clasen who visited
any harbor in France.
The Lastgeld database is governed on a storage device by a Relational Database Management System
(RDBMS). The RDBMS provides an interface to the data it retains by means of a query language. In the case of
the Lastgeld database the query language is SQL (Structured Query Language). For instance, the following
queries are valid when one wants to retrieve data from the Lastgeld database:
SELECT date, firstname, lastname FROM skippers WHERE achternaam = 'clasen' (1)
SELECT
date,
firstname,
lastname
FROM
skippers
INNER
JOIN
locations
ON
(skippers.hid = locations.hid) WHERE skippers.lastname = 'vos' AND locations.harbor
= 'nantes' (2)
The first query is directed at a single table, while the second query joins the two tables in order to get matches
from both tables.
If one reflects on the abstraction layer that allows us to operate the database, then one could say that the notions
of relations, tuples, attributes, keys, foreign keys, SQL, etc. are all abstractions of more complicated constructs
beneath the surface of the relational database. These abstractions allow us to deal with the database with relative
ease. For instance, programmers need the control of a SQL query language to perform a wide variety of actions
on the database. However, the abstraction layer suitable for programmers isn‟t very suitable for the people
whose primary interest is the data that is stored in the database.
1
For clarity; the terms relation, attribute and tuple are more commonly referred to as table, column, and row,
respectively.
4
3. System design
In chapter 2 I describe some of the properties of the Lastgeld database. I also explain that the data and relations
residing in the database are eventually stored on a storage device. To be able to perform any kind of search
operation, it is necessary to retrieve all the data from the database and have it available in main memory.
This can be done in several ways, for instance:
1.
2.
3.
Replicate all the data available in the database as a graph data structure that is loaded into main
memory, and traverse this data graph to obtain answers to keyword queries; Aditya et al.[1]
Replicate only the data scheme of the database as a graph data structure that is loaded into main
memory and traverse this scheme graph to obtain answers to keyword queries; Hristidis et al.[6]
Export all data as XML and use an extended XML query language to obtain answers to keyword
queries. Cohen et al.[2]
In chapter Five I more elaborately describe some of the methods applied in the research area of keyword-based
search in structured data. Although most probably inspired by what I describe in chapter five, I have not
attempted to rebuild these methods. They appear to be highly dependent on the context in which it is meant to be
implemented. Considering this observation I have solely focused on the context of the Lastgeld data and its
container, a relational database.
3.1 Definition
I have developed a heuristic that is best described in terms of the following steps:
1. Index all table fields eligible for keyword search;
2. Search all indices based on the keywords in a keyword query;
3. For every keyword match found in an index, create a vertex in a graph data structure;
4. For every possible combination of two distinct vertices, create edges between two vertices if the idnumbers associated each vertex has at least one resemblance;
5. Find the vertex that has the most edges, since this is the vertex that relates the most keywords present
in the graph, and intersect all id-numbers associated with the edges adjacent from this vertex;
6. Use the id-numbers that result from the intersection of edges to create SQL queries.
keyword query
search
indices
answer
create
return
query
graph
relational database
Figure 3.1 System design
Figure 3.1 visualizes these steps in a more general fashion. In the following 6 sections I elaborate on each of
these steps in more detail.
5
3.2 Indexing
A major concept in information retrieval is indexing. Indexing is applied to gain speed in the process of
retrieval. An established indexing technique is to create what is called an inverted index. The basic idea of an
inverted index is depicted in figure 3.2, where the numbers denote the document id-numbers in which the terms
occur. The terms occurring in the document corpus are united in what is called a dictionary. Each term in this
dictionary points to what is called a postings list consisting of separate postings.
For instance, given a keyword query Brutus
and Ceasar. The postings lists of the terms
Brutus and Ceasar are retrieved and intersected
to obtain the document id-numbers that occur
in both postings lists. Manning et al. [11] The
term and can be handled in different ways;
treat it as a phrasal element; designate it to be a
stop word, hence ignoring it; or perhaps
interpret it as a boolean operator.2
Figure 3.2: a fragment of an inverted index
I explain all of this because I used the
Manning et al. [11]
concept of an inverted list within the context of
the Lastgeld database, although in a very
different manner. The construct is based on the
following reasoning: if a term like for instance
“Brutus” belongs to a document with id 1, then
a term like for instance “Cornelis” belongs to a
tuple with id 1.
I have materialized this reasoning by
creating indices of the form as shown in listing
Table 3.1: fields eligible for keyword-based search
3.1 and 3.2. An advantage of the Lastgeld data
is the fact that the data doesn‟t change over
time. I simply pre-generated the indices and
AART 1198
stored them on disk. I created these indices for ABE 4765 12726 4790 1567 1246
2
the following attributes :
ABE BR 6900 12471 11380 857
Skippers - date, firstname, lastname and
Locations - harbor, modcountry
The dates in the second column I have reduced
to only the year notation as shown in listing
3.2. It is difficult to see for instance “1744-0401” as a keyword while just “1744” is easier to
grasp. However, I emphasize that this is a
design choice made to reduce complexity. I
will not go into which date format is the best to
index.
ABE BROER 5986
ABE BROERSZ 9008
ABEL 6545
ABEL BAS 3553
ABLBERT 7581
ABLBERT J 10065
ABRAHAM 3410 4434 6303 5806 9355 9064 6687 8444 9828…
Listing 3.1: an index of first names
1744
1745
1746
1747
1748
1 2 3 4 5 6 7 8 9 10 11 12 13
2332 2333 2334 2335 2336 2337
5027 5028 5029 5030 5031 5032
7619 7620 7621 7622 7623 7624
10423 10424 10425 10426 10427
16 17 18 19 20...
2338 2339 2340...
5033 5034 5035…
7625 7626 7627…
10428 10429 10430…
Listing 3.2: an index of dates
2
I will not go into this particular subject since the relevance of the term “and” is minimal in the Lastgeld data
set. However, supporting Boolean operators in a keyword based search system can be useful, although I consider
this subject to be beyond the scope of my thesis project.
3
I have decided not to use the numerical data as shown in the Skippers table (table 2.2), since numerical data
will not fit the classification of a keyword.
6
I have shown the way I created indices for some of
the data in the Skippers table. To be able to associate
the data in the Locations table with the data in the
Skippers table, I chose to ignore the relational keyforeign key concept all together. In essence this
abstract concept serves the purpose of attaching
relating entities of data over multiple tables within the
domain of a relational database, not in the data
structure I use to connect data.
Listing 3.3 and 3.4 show how I span two tables by
assigning the id-numbers in the Skippers table to their
relating data in the locations table. This way I can deal
with the data per tuple, including the data in a relating
table, by a single id. I will elaborate on this benefit in
section 3.7.
AABO 10051 3670 12238 6360 9627 3812 6464
AAHUS 1465 4526 6088 5876 9036 1804 7579…
AALBORG 4345 6911 6027 6438 8408 9311 …
ALAMEIDA 10351 12755 12514 7843 12778 …
ALEXANDRIEN 5544
ALICANTE 2213 506 2587 8481 12925 4544…
Listing 3.3: an index of harbors
BELGIUM
7659 2379 7491 7596 12956 9064 9828…
DENMARK
11644 11762 1519 7549 3602 9108
10175…
EGYPT
5544
ENGLAND
1000 159 6016 2225 2312 12211 7832…
ESTONIA 3921 7800 1118 10851 884 11398 11622…
FINLAND
10051 3670 12238 6360 9627 3812 6464…
Listing 3.4: an index of countries
3.3 Searching indices
Now that I explained how I created indices, I proceed explaining how I search these indices. Given a keyword
query Q that consists of keywords K1,…,Kn , I search every index for every K in Q. For instance given a keyword
query: Cornelis Vos France 1744, I perform the following index searches:
Cornelis - Skippers.date, Skippers.firstname, Skippers.lastname, Locations.harbor, Locations.modcountry
Vos - Skippers.date, Skippers.firstname, Skippers.lastname, Locations.harbor, Locations.modcountry
France - Skippers.date, Skippers.firstname, Skippers.lastname, Locations.harbor, Locations.modcountry
1744 - Skippers.date, Skippers.firstname, Skippers.lastname, Locations.harbor, Locations.modcountry
This seems a bit redundant at first, but I do not really know based on some keywords entered where they might
occur in the indices or what their inter-keyword relationship is.
Furthermore I must emphasize that I do not perform any pre-analyzing of keywords based on their
appearance. For instance if a keyword entered is numerical like 1744, I still search the indices
Skippers.firstname, Skippers.lastname, Locations.harbor, Locations.modcountry even though I know that no
numerical values exist in these indices. I do this to keep me from extra work, at the cost of efficiency.
Also I haven‟t divided the indices in partitions, for instance the index of Skippers.firstname is a continuous
list from A to Z. I could have divided the indices in sub-indices that contain one letter of the alphabet, this way I
could select the appropriate sub-index based on the first letter or digit of a keyword. I haven‟t done this to keep
me from dealing with many separate indices, again at the cost of efficiency.
For instance given a keyword query Jan Cornelis Nantes 1744, the process of searching the indices for
keywords yields lists with the following contents:
A)Jan - firstname[2, 19, 54, 73, 76, 88, 95, 97, 112, ..., 12999]
B)Cornelis - firstname[1, 56, 130, 137, 143, 144, 176,..., 13007]
C)Cornelis - lastname[8, 53, 130, 160, 211, 220, 235, 358,..., 12956]
D)Nantes - harbor[1, 190, 234, 335, 379, 477, 478, 554,..., 12865]
E)1744 - date[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ..., 2331]
Listing 3.5: retrieving occurrences of keywords from the indices given the query Jan Cornelis Nantes 1744
7
3.4 Creating a graph data structure
Looking at listing 3.5, B, D and E seem to be related in the first tuple, however the first name A and C do not
belong to the first tuple, so this isn‟t a very good match. Ideally one would want to find the tuples(s) in which all
keywords are present, or tuples in which at least four of the five keywords are present. However, I realize if one
wants to retrieve relating data of this kind in any way, one has to find a way to deal with the complexity of interkeyword relationships first.
I decided to use a directed weighted graph data structure. This data structure enables me to handle the
complexity of the data in a better way. In section 3.6 I explain why I implemented the graph as directed graph
instead of an undirected graph. Recall that I described the essence of a weighted graph in section 2.1.1. Figure
2.5 depicts a weighed graph representation of the roads between cities and the distances associated with the
edges between the roads. Based on this concept I came up with the following reasoning:
If I declare every keyword in a keyword query, that occurs at least with one id in an index, to be a vertex, then I
can connect a pair of vertices by creating edges between them, if they have at least one id in common. In doing
so I can assign the id-numbers two vertices have in common to the edge between these two vertices. This way I
create a weighted graph in which all keywords of the keyword query found in the indices are vertices in the
graph and the id-numbers associated with a keyword in the index are associated with the vertex of that keyword.
There exists an edge between any possible combination of two vertices if two vertices have at least one idnumber in common. If this is the case then an edge between that pair of vertices is created and the weight of the
edge is associated with all the id-numbers these two vertices have in common. For example the query IJsbrand
Hanning Riga 1744 is inserted in the directed weighted graph data structure as follows:
Figure 3.3: a directed weighted graph of the keyword query IJsbrand Hanning Riga 1744
8
3.5 Creating edges
In order to create edges between two vertices the id-numbers part of each of these two vertices, have to be
intersected. The result of the intersection is assigned to the edge between these two vertices. I found this to be a
good solution since the edge represents the relation between two data entities based on the id-numbers they have
in common. If an edge is established between, for instance, firstname.ijsbrand and lastname.hanning, I chose to
not also create an edge between lastname.hanning and firstname.ijsbrand. This would make the graph more
complicated, while having the edges in just one direction worked out to be sufficient. I elaborate on this in the
next section.
First, I describe the need to determine all the pairs of vertex combinations possible to be able to intersect the
id-numbers associated with vertices. Given a set of 5 vertices as depicted in figure 3.3, the following distinct
combinations are possible: {0,1}{1,2}{2,3}{3,4}{0,2}{1,3}{2,4}{0,3}{1,4}{0,4}. These combinations of
vertices must all be intersected because they are potentially related. For instance given the keyword query;
IJsbrand Hanning Riga 1744, the following intersection sequence must be completed:
***vertex combinations***
ijsbrand firstname[64, 901, 5360,
ijsbrand lastname[5039]
--ijsbrand lastname[5039]
hanning lastname[64]
--hanning lastname[64]
riga harbor[64, 65, 85, 103, 195,
--riga harbor[64, 65, 85, 103, 195,
1744 date[1, 2, 3, 4, 5, 6, 7, 8,
--ijsbrand firstname[64, 901, 5360,
hanning lastname[64]
--ijsbrand lastname[5039]
riga harbor[64, 65, 85, 103, 195,
--hanning lastname[64]
1744 date[1, 2, 3, 4, 5, 6, 7, 8,
--ijsbrand firstname[64, 901, 5360,
riga harbor[64, 65, 85, 103, 195,
--ijsbrand lastname[5039]
1744 date[1, 2, 3, 4, 5, 6, 7, 8,
--ijsbrand firstname[64, 901, 5360,
1744 date[1, 2, 3, 4, 5, 6, 7, 8,
7750, 7757, 11404]
198, 346, 373,...,13006]
198, 346, 373,...,13006]
9, 10, 11, 12,...,2331]
7750, 7757, 11404]
198, 346, 373,...,13006]
9, 10, 11, 12,...,2331]
7750, 7757, 11404]
198, 346, 373,...,13006]
9, 10, 11, 12,...,2331]
7750, 7757, 11404]
9, 10, 11, 12,...,2331]
Listing 3.6: intersecting vertices for the query “IJsbrand Hanning Riga 1744”
A consequence of intersecting all combinations of vertices this way is that the
intersection work that needs to be done to create edges increases fast in
relation to the number of keywords in the query. Table 3.2 shows the increase
of the number of intersections as result of the increase of the number of
keywords. Since some of the id-lists as shown in listing 3.6 are quite large, the
retrieval time should increase in relation to the size of the id-lists together with
the number of keywords in a keyword query.
In chapter four I perform an experiment to further analyze these
dependencies in relation to overall retrieval time.
Number of
keywords
4
6
8
10
12
20
Number of
intersections
6
15
28
45
66
190
Table 3.2: number of intersections
9
3.6 Finding an answer to a keyword query
I assume that the vertex with the most outgoing edges
is the “binding” vertex. Vertex A in figure 3.4
connects to BCD, I simply intersect the numbers
associated with these three edges and I obtain the final
result with id 64. Because I decided to implement the
graph as a directed graph this is possible. If I
implemented the graph in figure 3.4 as an undirected
graph each vertex would have 3 incoming and 3
outgoing edges and hence this approach wouldn‟t
work
I must admit that this is likely not the best solution.
For instance suppose there is another vertex in the
game by the name E, and A also matches this vertex,
but all the other vertices A is related to do not match
this vertex then the final result is an empty list. In
other words the edges of the vertex with the most
Figure 3.4: graph representaion of the keyword query
edges must all be related.
IJsbrand Hanning Riga 1744
A better solution would be to traverse the graph.
There are several ways to traverse the graph. While traversing the graph the main condition that must be fulfilled
is that every new edge reached must have at least one id in common with latest visited edge to be able to
continue the traversal on a new edge.
To find the best fit answer given the keywords of a keyword query it is evident that as many vertices possible
should be visited. In the case of the graph in figure 3.4, the path ABDC would yield the id-numbers spanning the
most keywords. Finding a good graph traversal algorithm is one of the subjects I suggest in chapter 8 as a
possible follow-up. For now I will use the rather „naïve‟ approach of finding the vertex with the most edges.
3.7 Creating SQL queries
At this stage an answer to a keyword query has or has not been found, depending on the presence of the query
keywords in the indices of the relational database. If there is a result list of id-numbers then the search
application possesses all it needs to efficiently retrieve all relating data from the relational database. For instance
given the keyword query Cornelis Vos France, the search application creates the following SQL queries:
SELECT
SELECT
SELECT
SELECT
*
*
*
*
FROM
FROM
FROM
FROM
Skippers
Skippers
Skippers
Skippers
WHERE
WHERE
WHERE
WHERE
idno
idno
idno
idno
=
=
=
=
1
2574
5288
8102
results:
id = 1 first name = CORNELIS last name = VOS harbor = NANTES
toll-decimal = 2.40 weight = 128.47 guldens = 2 stuivers = 8
cargo units = 65
id = 2574 first name = CORNELIS last name = VOS harbor = LE CROISIC
= 29 toll-decimal = 2.40 weight = 57.32 guldens = 2 stuivers = 8
cargo units
id = 5288
units = 65
first name = CORNELIS C.
last name = VOS
harbor = BORDEAUX
toll-decimal = 2.40 weight = 128.47 guldens = 2 stuivers = 8
cargo
id = 8102
units = 38
first name = CORNELIS W.
last name = VOS
harbor = LIBOURNE
toll-decimal = 2.40 weight = 75.10 guldens = 2 stuivers = 8
cargo
Listing 3.7: system-generated SQL queries and their answers
10
3.8 Results
In Information Retrieval search system effectiveness is assessed relative to an information need, not to a query.
A document is classified as relevant if it is coherent with a certain information need, not because it contains all
the keywords in a certain query. Manning et al.[11]
However, a tuple in a relational database is not a document containing potentially many terms. When I query
a document with 500 terms with just two keywords and these two keywords appear to be in that document, the
relation of 2 to 500 is weak. In contrast when I query just tuples in a relational database consisting of five fields
like in the Lastgeld database and two of these five fields are present in a keyword query as well in a specific
tuple, it is obviously a much stronger relation. Still, the relevance of retrieved results can‟t be judged without a
defined information need.
An inherent characteristic of keyword queries is that they are inexact. One cannot claim that a keyword query
like Jan Jansen Kleine Oost Germany 1745 is similar to for instance:
SELECT * FROM skippers INNER JOIN locations ON (skippers.hid = locations.hid)WHERE
skippers.date = '1745'
AND skippers.firstname = 'jan' AND skippers.lastname =
'jansen' AND harbor = 'kleine oost' AND locations.harbor = 'germany'
By trading a query syntax like SQL for just some keywords, control over retrieving exact answers is lost. The
spaces between keywords do not per se imply a relation between keywords, but they do not imply there isn‟t any
relation either. This lack of exactness of keyword queries has forced me to make an assumption of what is
important when dealing with multiple keywords in a query. I assume that if the keywords in a keyword query are
related in the relational database that contains the Lastgeld data, then retrieving this relating data yields results
likely to be meaningful given the keyword query. Consider the following queries and their effect:
Query
Cornelis
Cornelis Vos 1744
France
Cornelis Vos
Groningen
Cornelis Vos Jansen
Cornelis France
Description
If there is just one keyword present in the query, the search application retrieves all
tuples in which the keyword Cornelis occurs. Note, Cornelis appears in the indices as
a first name and a last name, both tuples in which Cornelis appears as a last name
and a first name are retrieved.
All keywords are present in the indices, and all keywords are related by several idnumbers. Based on the assumption described in the previous paragraph, the search
application only retrieves the tuples in which all the keywords occur, all tuples in
which they occur separately or not retrieved. Keyword order does not influence the
generation of results, e.g. Vos France 1744 Cornelis yields the exact same results.
The keyword Groningen doesn’t appear in the indices and thus no vertex is created
for this keyword. However, Cornelis and Vos do occur in the indices and thus
vertices are created for these keywords. As a result this query yields the same
results as Cornelis Vos alone.
This keyword query is a special case; Cornelis and Vos are related, but Cornelis and
Jansen are also related. The keyword Cornelis has the most edges; it points to Vos
and to Jansen. As a consequence of working with vertex that has the most edges,
the id-numbers associated with the edge between Cornelis and Vos are intersected
with the id-numbers associated with the edge between Cornelis and Jansen. This
leads to an empty result list. I have not attempted to simulate “OR” semantics,
therefore I have not merged the results of Cornelis Vos and Cornelis Jansen into one
result list.
In terms of the relational database Cornelis appears in the Skippers table and France
appears in the Locations table, since these relations do not exist in the indices, the
search application generates SQL-queries for a single table only. This query retrieves
all the occurrences of Cornelis in the Skippers table in relation to France in the
Locations table.
Table 3.3: Example queries
11
4. Experiment
In this chapter I present the results of the experiments regarding the efficiency of the presented search
application. I have designed an experiment in which I measure:
1.
The retrieval time of multiple keyword-based search queries posed to the search application. More
specific I focus on:
A)
The number of keywords in the query. I state in section 3.5 that the retrieval time will increase in
relation to the number of keywords in a keyword query. In table 3.2 I show that the number of
intersections needed increases rapidly as the number of keywords in the query increases.
B) The size of the id-list associated with a keyword. In section 3.5 I also stated that retrieval time will
increase if the size of the id lists associated with the keywords increases.
Given these parameters I can assess their processing weight in terms of the relative increase of retrieval
time.
2.
The retrieval time of the SQL-queries the search application proposes.
For instance given a keyword query “Cornelis Vos France”, the search application deals with data in
different tables by merging the indices of two tables within the relational database, this way the search
application restricts it‟s querying to just one table like so:
SELECT
SELECT
SELECT
SELECT
*
*
*
*
FROM
FROM
FROM
FROM
Skippers
Skippers
Skippers
Skippers
WHERE
WHERE
WHERE
WHERE
idno
idno
idno
idno
=
=
=
=
1
2574
5288
8102
However, in this approach multiple queries are posed in sequence, e.g. if there are 100 result tuples,
then 100 queries are formulated and posed to the relational database. Intuitively this appears to be an
expensive solution in terms of processing time. On the other hand the queries are rather straightforward
to process, since the queries do not span multiple tables and the „where‟ clause contains just one
argument at a time.
In the following sections I describe how the factors mentioned in both 1A/B and 2 influence the execution time
of a search request.
12
4.1 Experimental design
I formulated four search queries to measure the influence of three factors on the query execution time. The
search queries are presented in table 4.1 and the three factors are:
1.
2.
3.
The index size(s);
The number of keywords in the query;
The number of queries the search application proposes.
A
keyword
query
description
number of
occurrences
B
C
D
Jan Jansen Kleine
Oost Germany
1745
involves relatively
big lists of idnumbers
Germany
Nicolaas Bark
Hendriks
Bordeaux
involves the
biggest list of
id-numbers
involves very
small lists of idnumbers
Jan: 1319
Jansen: 149
Kleine Oost: 5618
Germany: 5683
1745: 2696
Germany: 5683
Nicolaas: 9
Bark: 4
involves mediumsized(relative)
lists of idnumbers
Hendriks: 333
Bordeaux: 264
Table 4.1: experimental queries
Query A consists of 5 keywords. Each keyword occurs in only one index. As a consequence the search
application needs to intersect 10 combinations of reasonable sized lists of id-numbers. The answer to the query
consists of one tuple, therefore the number of queries the search application proposes to the relational database is
one.
Query B consists of just one keyword that occurs in only one index. As a consequence the search application
doesn‟t intersect at all. The answer to the query consists of 5683 tuples, therefore the number of queries the
search application proposes to the relational database is 5683.
Query C consists of two keywords with small lists of id-numbers. As a consequence the search application
intersects only once. The answer to the query consists of one tuple.
Query D consists of two keywords that have medium-sized lists of id-numbers in the indices (relative to the
available data). As a consequence the search application intersects only once. The answer to the query consists of
one tuple. The answer to the query consists of 6 tuples.
For each query I measured the execution time 50 times: for the speed of the execution within the search
application and for the speed of the execution of the queries proposed by the search application. In between
measurements I halted for 10 seconds.
To add an extra element of comparison I also measured the speed of the SQL queries semantically related to
A,B,C & D. For instance keyword query A could be interpreted as:
SELECT * FROM skippers INNER JOIN locations ON (skippers.hid = locations.hid)WHERE
skippers.date = '1745'
AND skippers.firstname = 'jan' AND skippers.lastname =
'jansen' AND harbor = 'kleine oost' AND locations.harbor = 'germany'
This serves as an indicator of the performance loss the presented search application causes under the assumption
that keyword query expresses the same search intention as the mentioned SQL query, given the semantical
relatedness of the Lastgeld data.
13
4.2 Experimental results
I have performed the experiments on a Pentium 2.2 GHz processor with 4GB of RAM running on Windows
Vista. The Lastgeld database as described in section 2.2, I run on the same machine. The RDBMS I use is
provided free by MySQL[17]. In the next chapter I describe the implementation details of the presented search
application in more detail.
Figure 4.1: experimental results keyword query A
Figure 4.2: experimental results keyword query B
14
Figure 4.3: experimental results keyword query C
Figure 4.4: experimental results keyword query A
15
4.3 Analysis
Keyword query A consists of a relative high number of keywords. The 15.500 id-numbers that are associated
with the keywords in the indices cause the search application to execute many intersections and therefore the
execution time of the search application increases with hundreds of milliseconds. However the SQL query the
search application poses is executed in almost instantaneous. (Recall that keyword query A only yields one
query). The semantically related autonomous query is somewhat slower but executes in around 10 milliseconds.
Keyword query B consists of only one keyword. The search application finishes the request with a median of
44 milliseconds over 50 measurements. However the search application proposes 5683 SQL queries since the
keyword Germany occurs with 5683 id-numbers. This slows down the eventual result generation considerably.
The median of the execution of these SQL queries is 895 milliseconds over 50 measurements. Also the
semantically related query doesn‟t appear to execute flawlessly.
Keyword query C still generates about 60 ms of execution time for the keyword search application, while the
number of intersections is reduced to one and the id occurrences in the indices are minimal.
The search application executes keyword query D slightly slower than query C, but the overall execution
times are similar to keyword query C.
16
5. Related work
There has been extensive Research on the topic of keyword search over structured data. In section 5.1 I describe
three keyword search systems which perform operations on data stored in graph data structures. These graph data
structures serve as models of the underlying relational database. I chose to describe three systems that apply
different techniques to achieve about the same goals. In section 5.2 I describe work on the subject of keyword
search over XML data. Like data stored in a relational database, XML data is structured as well, although in a
different sense. Techniques applied in this area are different but eventually their purpose is identical to the
purpose of the techniques applied when dealing with relational databases.
5.1 Graph-based systems
Discover[6], BANKS[1] and EASE[9] are systems developed in this area. In these systems a database is
represented by a graph data structure where tuples are the nodes in the data structure.
5.1.1 Data representation
Discover, BANKS and EASE generate query
answers called „tuple trees‟ based on the keywords
received as input. A tuple tree is a joining tree of
tuples, i.e. a data structure that contains one or more
tuples in a predefined manner. In a tuple tree each
node is connected via foreign-key relationships. In
BANKS and Ease a specific kind of tuple tree, called
a Steiner tree, is applied to hold the answers to
keyword queries
A difference between these systems can be found
in the way the database of subject is modeled to
generate the tuple trees. The key algorithms of
Discover work on a graph data structure that is
modeled based on the properties of the database
schema. The database schema graph is used to
produce a number of SQL-queries needed to answer
the keyword queries presented to the system. Unlike
Discover, more advanced BANKS and EASE, model
the entire database as a directed graph data structure.
Figure 5.1 visualizes a fragment of the graph data
model Ease constructs based on the publication
database presented by tables 5.1.
From here the system generates tuple trees to
create answers to the keyword-queries presented to
the system. In case of EASE, as mentioned before,
system tuple trees are built as a Steiner graphs. For
clarity a tree data structure is a kind of graph data
structure. Dale et al.[3] Figure 5.2 visualizes these
Steiner trees as answers to keyword queries. The
circles around nodes represent a method described in
EASE to reduce processing, since it can be very
costly to generate Steiner trees over a large data
graph. EASE propose to define a radius based on
certain properties, containing the nodes necessary to
produce an adequate answer to the keyword query
being processed.
Tables 5.1: table data representation
Figure 5.1: a fragment of a data graph model
Figure 5.2: radius Steiner trees
17
5.1.2 Top-k ranking
An important notion in all these methods is efficiency. The described techniques possibly span very large
databases. Because schema or data graphs are kept in main memory and trees are generated on the fly when the
system is queried, the processing must finish almost instantaneous.
According to Dalvi et al. [4] the graph data structures used in keyword search engines can potentially span
very large data sets, because structured data is fairly compact compared to textual data; graphs with millions of
nodes, related to hundreds of megabytes of data, can be stored in tens of megabytes of main memory. If a search
system runs on dedicated servers even larger graphs such as the English Wikipedia, which contained over 1.4
million nodes and 34 million links (edges) as of October 2006, can be handled.
Although efficiency becomes a very important topic when dealing with large data graphs, a certain degree of
effectiveness is in most cases just as, or even more important. Discovery, BANKS and EASE all incorporate a
top-K ranking mechanism to achieve a degree of effectiveness as well.
Discover ranks results by the number of joins involved. The idea behind this strategy is that joins involving
many tables are more difficult to grasp. This ranking strategy has a certain parallel with ranking methods used in
document retrieval; documents in which keywords occur close to one another are ranked higher than documents
in which keywords are far apart. However, in a follow up to Discover, Hristidis et al. [7] propose a ranking
method known in the field of Information Retrieval (IR) as relevance ranking. The general idea behind relevance
ranking is that according to some definition of relevance, only the few most relevant matches are generally of
interest. Consequently instead of computing all matches for a keyword query, only the top-k matches are
computed. This, in turn, yields a more efficient solution.
The BANKS system incorporates a technique that assigns weights to tuples and assigns weights to edges
between tuples. A combination of tuple weights and edge weights in a tuple tree is calculated to rank matches.
Liu et al.[10] argue that, despite the methods applied to gain effectiveness, the focus of Discover and
BANKS is still primarily set to obtaining efficiency by avoiding the creation of unnecessary tuple trees and by
deploying algorithms to improve the time and space complexities. They state that effectiveness should be equally
important. In turn they incorporate a full-fledged IR-solution on top of a system conceptually comparable to
Discover, BANKS and Ease.
Liu et al. define tuple trees to be super-documents and all text column values to be documents. Let T be a
tuple tree and {D1, D2, …, Dm} be all text column values in T. Then to rank tuple trees they compute a
similarity value between the query Q and the super-document T as shown in Equation 1. The similarity is the dot
product of the query vector and the super-document vector. In contrast to the systems described earlier Liu et al.
apply IR-evaluation techniques to assess the results achieved.
𝑆𝑖𝑚 ( 𝑄 , 𝑇 ) =
𝑘∈𝑄,𝑇 𝑤𝑒𝑖𝑔𝑕𝑡
𝑘, 𝑄 ∗ 𝑤𝑒𝑖𝑔𝑕𝑡 (𝑘, 𝑇) (1)
In EASE, a ranking mechanism is proposed that incorporates three notions; a TF-IDF based ranking function
that considers textual properties of a Steiner graph; the compactness of a Steiner graph and the keyword order in
the query.
The TF-IDF based ranking function assigns a weight to the Steiner graph, that is, the keywords present in the
Steiner graph. Recall that Steiner graphs are an extraction of the keyword presence in the underlying data graph
that is the model of the data present in a relational database. The TF-IDF based ranking function takes into
account the term frequency (TF), the inverse document frequency (IDF) and the normalized document rank
(NDL). TF and IDF are used to rank. In IR-literature NDL used to normalize document length, since a longer
document tends to repeat the same terms, while this doesn‟t per se mean that the document should be ranked
higher. Manning et al.[11]
18
Then finally these three parameters are computed as follows:
(2) ntf ( k I ,G ) = 1 + ln (1 + ln (1 + tf (k i , G ))
(3) idf ki = ln (N + 1) / (Nki + 1 )
(4) ndl = ( 1 – s ) + s * tlG / avgtl
Where tf (ki , G ) in Equation 2 denotes the term frequency of keyword k i in the data graph G ; In Equation 3 N
and Nki denote the number of Steiner graphs and the number of those Steiner graphs containing keyword k i .
In Equation 4, tlG denotes the total number of terms in G and avgtl is the average number of terms among all
Steiner graphs. These parameters are consequently used to compute a ranking weight between a keyword and a
Steiner graph SG.
In EASE however Li et al. comment that Information Retrieval ranking methods based on TF-IDF can be
efficient for textual documents, but are not very efficient for semi-structured and structured data.
Li et al. state that a consequence of modeling data as a graph is that the ranking of structural properties of the
data graph, becomes just as, or even more important. According to Li et al., rich structural relationships should
be at least as important as discovering more keywords in the data graph. To this end EASE [9] takes the
structural compactness of the data into account to create an additional weight on top of the TF-IDF weight
described earlier.
Given a keyword query K { k1, k2 ,..., k m }, then in figure 5.3 the thickedged circles containing p5, p7 and a4 are content nodes that contain at least
one keyword. (Recall that a node is represents the data of a tuple in the
relational database). Node s in figure 5.5 is called a Steiner node if there exist
two content nodes, u and v, and s is on the path u ↔ v (s may be u or v), where
u ↔v denotes a path between u and v. Since such a path exists between p5, a4
and p7 the radius property described earlier yields a Steiner radius graph,
(Figure 5.4), which serves as an input to a ranking function. The ranking is
Figure 5.3: accented Steiner
accordingly based on the compactness of the Steiner graph. The underlying
nodes
idea is that a more compact Steiner graph is more likely to be meaningful.
Although the structural compactness of nodes can be important measure
when generating a useful result set, it cannot be of service to evaluate interkeyword semantics. The order of the keywords can hold meaning if the query is
an expression of a phrase. In EASE, the weighting function applied also takes
the keyword order into account. This is done by assigning more weight to
keywords that have a smaller inter-keyword distance.
Figure 5.4: a Steiner graph
result
19
5.2 XML-based systems
A different but intrinsic related research topic is keyword search in XML-databases. This topic is related because
XML is, like data in relational databases, structured by nature as well. XML queries possibly return entire XML
documents, or may as well return deeply nested XML elements. Because of the immanent nested structure of
XML the notion of ranking is no longer at the granularity of a document but at the granularity of an XML
element. Manning et al.[11]
As described in the previous section about graph-based systems, efficiency and effectiveness are very
important factors in the process of developing a search system. In section 5.2.1 I describe an indexing method as
employed by Florescu et al [5] . After that, I shift from the indexing technique applied by Florescu et al. to the
search system XSearch [2]. First I describe the semantics derived from an XML data representation in order to
meaningfully answer a keyword query. In conclusion of this chapter, I describe the ranking mechanism of the
XSearch system.
Florescu et al. [5] extend an XML query language for the purpose of keyword search. In their proposal the XML
data is replicated in a relational database. I will not go into this particular architecture. I‟m interested in
describing the index system that is employed to retrieve query answers from XML data.
A common indexing approach used in traditional IR-systems is by means of an inverted file as described in
section 3.2. A simple setup for an inverted file takes the following form:
<word, document>
This means that word can be found in document. However, when dealing with XML, retrieval is no longer at the
granularity of a document but at the granularity of an XML element, as noted earlier. Consider, listing 5.1. The
word „Analysis‟ appears in a title element nested in an article element. To be able to utilize the nested structure
for retrieval, Florescu et al. make a distinction between
keywords occurring as tags, e.g. article; as names of an <document>
attribute, e.g. id or as data content of elements. <article id="1"
Additionally the depth at which a keyword occurs is
<author><name>
taken into account. As a result an inverted file of the
Adam Dingle</name></author>
following form is proposed:
<author><name>
Peter Sturmh</name></author>
<"article", elID1, 0, tag>
<author><name>
<"id", elID1, 1, attr>
Li Zhang</name></author>
...
<title>Analysis and Characterization
<"name", elID1, 2, tag>
of Large-Scale Web Server Access
<"Adam", elID1, 2, value>
Patterns and Performance</title>
<year>1999</year>
elIDn is associated with all elements, such that each
<booktitle>World Wide Web
interior node Ne is labeled with a distinct element ID,
Journal</booktitle>
elID. Additionally, each elID
</article>
</document>
Listing 5.1: an XML fragment
To utilize the inverted list, all data is modeled in records containing the URL of the belonging XML
document, the starting and ending positions of the elements within this document, and the type of the element.
As a result the following relational schema is obtained:
elements(elID, docid, start_pos, end_pos, type,id_val)
documents(docid, URL,...)
In summary, by scanning the inverted index, which is actually a representation of the data corpus, the search
system can find the desired data more efficiently. If the system must search the entire database sequentially the
processing cost would be high. Florescu et al. have created an inverted list that allows search on the level of tags,
attributes and data values. This way a specific fragment of an xml document can be found fast.
20
5.2.2 Semantic relatedness
In XSearch Cohen et al. present a free form query language over XML documents. XML documents are modeled
as trees that consist of interior nodes and leaf nodes. Each interior node is associated with a label and each leaf
node is associated with one or more keywords. Figure 5.5 represents such a tree as a model of a fragment of the
SIGMOD (Special Interest Group on Management of Data) publication database.
Figure 5.5: an XML data representation
In a sense a node in the tree can be viewed as a human being in our world; different people may have identical
names. As such, two different nodes with the same label are different entities of the same type. To extend this
analogy, one can say humans are related if they share the same ancestor(s). Now suppose that nodes n and n'
have different ancestors, let‟s say na and n'a and these ancestors share the same label, then it is said that n and n'
are not meaningfully related. This holds as long n and n' share the same relationship tree. Let T be a tree and let
n1 and n2 be nodes in T, then the shortest undirected path between n1 and n2 consists of the path via the lowest
common ancestor of both n1 and n2. Recall as described in chapter 2, that a path in a graph is undirected such
that G = {(n1, n2), (n2, n1)}, i.e. it‟s possible to get to n2 if n1 is the present location, and vice versa. Then the sub
tree consisting of the two paths is denoted as T| n1, n2 and is called a relationship tree. The overall notion of
relating nodes is formalized by means of two conditional rules:
1. T| n,n' does not contain two distinct nodes with the same label.
or
2. The only two distinct nodes in T| n,n' with the same label are n and n'.
21
These semantics are additionally extended with traditional information retrieval techniques to rank query
answers.
5.2.3 Top-k ranking
In XSearch, as described in the previous section, sub trees are generated as possible answers to a keyword query.
The weights for ranking are calculated at the level of the leaf nodes of a document.
Let k be a keyword and nl be a leaf node, and let occ(k, nl) denote the frequency of the occurrence k in nl..
The term frequency of k in nl is defined as:
𝑡𝑓 𝑘, 𝑛𝑙 : =
max
𝑜𝑐𝑐 𝑘 ,𝑛 𝑙
𝑘 ′ ∈ 𝑤𝑜𝑟𝑑𝑠 𝑛 𝑙
(5)
𝑜𝑐𝑐 𝑘 ′ ,𝑛 𝑙
This is a variation of an IR-based approach that assigns more weight to frequent words in sparse nodes. Let N be
de set of all leaf nodes in the corpus, then the inverse leaf frequency is defined as:
𝑖𝑙𝑓(𝑘): = log 1 +
|𝑁|
| 𝑛 ′ ∈ 𝑁 𝑘 ∈ 𝑤𝑜𝑟𝑑𝑠 𝑛 ′
|
(6)
Then, 𝑡𝑓𝑖𝑙𝑓 = 𝑡𝑓 𝑘, 𝑛𝑙 × 𝑖𝑙𝑓(𝑘). Note, by taking a logarithm in ilf, the importance of the tf factor is increased.
The actual weight stored is a normalized version of tfilf denoted as w(k,nl), such that w is 0 if k doesn‟t appear in
nl .
Furthermore the labels are taken into account as a weighting factor. Recall that each interior node is
associated with a label. Each label l is associated with a weight w(l) that determines its importance. These
weights can either be user-defined or system-generated. The key notion using label weights is that the interior
nodes, that determine the structure of the XML-data, can also be taken into account. For instance, higher weights
can be assigned to less common labels.
To incorporate both tfilf weights and label weights, the vector space model is utilized to determine how well
an answer satisfies a query. Let L be the set of all labels and let K be the set of all keywords. Each interior node
n in the data is associated with a vector Vn of size |L x K|. The vector has an entry for each pair (l,k) ∈ L x K.
Then, Vn [l,k] is used to denote the entry of Vn corresponding to the pair (l,k). Let Nleaf be the set of leaf
descendents of n. The values of Vn are defined as follows:
𝑉𝑛 𝑙, 𝑘 =
𝑛′ ∈𝑁𝑙𝑒𝑎𝑓
𝑤 𝑘, 𝑛′ 𝑖𝑓 𝑙𝑎𝑏𝑒𝑙 𝑛 = 𝑙
0 𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒
(7)
Note that w(k,nl) is 0 if k doesn‟t appear in nl . To be able to calculate similarities between answers and queries,
there have to be vectors representing the terms in the query as well. Term t is associated with a vector of the size
|L x K|, denoted by Vt , then the similarity between a Query Q and an answer N, denoted as sim(Q,N), is the sum
of the cosine distances between the vectors associated with the nodes in N and the vectors associated with the
matching terms in Q.
As a next step the semantics described in the previous section 3.2.1 are extended by two weight factors. Let
tsize(N) denote the number of nodes in the relationship tree of N. If this value is small, then the nodes are closer
together, and therefore more likely to be meaningfully related. Additionally it is said that the nodes n and n'
participate in an ancestor-descendent relationship if n is de ancestor of n' and vice versa. This indicates a strong
relationship between n and n' . This notion is denoted by anc-des(N) , where N denotes the number of unordered
pairs that participate in an ancestor-descendent relationship.
Finally, given a query Q and a answer N, the factors sim(Q,N), tsize(N) and anc-des(N) are combined to
determine the ranking of a query answer:
𝑠𝑖𝑚 (𝑄,𝑁)𝛼
𝑡𝑠𝑖𝑧𝑒 (𝑁)𝛽
× 1 + 𝛾 × 𝑎𝑛𝑐 − 𝑑𝑒𝑠(𝑁) (8)
Experimentations were conducted with varying 𝛼, 𝛽 and 𝛾 values to gain more control over the resulting weight.
22
6. Implementation
In this chapter I describe in general how I implemented the search application presented in chapter four. The first
section I describe how I created the indices using the programming language Perl. After that I focus on the
search application. I used the Java programming language for this task. First I show the object oriented design of
the search application, followed by a description of the flow of control between functional entities. Then I
describe the weighted graph data structure I implemented, followed by the most prominent algorithms.
I used Object-oriented constructs to design 12 separate classes, including inner classes. I also used four
classes I did not author. These are WeightedGraph, LinkedQueue, QueueInterface and ConnectionPool. The first
three are published in [3] and the ConnectionPool is available in [16]. The Perl source code is documented in
appendix A and the Java source code is documented in appendix B.
Since I use Java I have an elaborate library of functionality at my disposal. I list the Java library classes I used
and the function they perform in the search application.
Finally, I describe how I implemented the application in a web-based environment.
6.1 Creating indices
For every field in the Lastgeld database I wanted to index, I executed the steps as shown in listing 6.1. The Perl
source code I wrote can be found in appendix A.
1.
Make an alphabetically ordered SQL-dump of a field into a text file. The dump has the following form:
(1198, 'AART'),
(4765, 'ABE'),
(12726, 'ABE'),
(4790, 'ABE'),>>
2.
Delete all punctuation (Perl);
1198 AART
4765 ABE
12726 ABE
4790 ABE >>
3.
Place the id after the index term (Perl);
AART 1198
ABE 4765
ABE 12726
ABE 4790 >>
4.
Place every similar index term on the same line (Perl)
AART 1198 <new line>
ABE 4765 ABE 12726 ABE 4790 ABE 1567 ABE 1246 ABE BR 6900 ABE BR
11380 ABE BR 857 ABE BROER 5986 ABE BROERSZ 9008 <new line>
ABEL 6545 ABEL BAS 3553 <new line>
ABLBERT 7581 ABLBERT J 10065 <new line>
5.
Delete all the terms except the term that starts the line (Perl.
AART
1198
ABE
4765 12726 4790 1567 1246 6900 12471 11380 857 5986 9008
ABEL
6545 3553
ABLBERT
7581 10065 >>
Listing 6.1: creating indices
23
12471 ABE BR
6.2 Object oriented design
The UML-diagram depicted in figure 6.1 shows an overview of the classes I created to implement the search
application.
Figure 6.1: object oriented design
24
I declared the Firstname, Lastname, Data, Harbor and Country to be inner classes of Lastgeld because they are
entities that belong to a single semantical unit, which can be thought of as a table. Grouping them together
within the Lastgeld class merges them in the Lastgeld class, while still having control of instantiating them
separately.
I implemented the Graphable interface to make sure that every object being added to the graph has the same
properties. The inner classes implement this interface and thus they must implement every method I declared in
the interface. The interface construct helps to guarantee that all objects share the same properties. This can
prevent inconsistencies and possible errors.
All instance variables are declared private or protected. This construct guarantees that all interaction from
outside the class with the variables is reserved to public methods.
6.3 Flow of control
In this section I explain in general in what order control is passed between the classes depicted in figure 6.1
given a keyword query and a formulated answer.
The WebFrontend class is instantiated by the Tomcat web server when it gets an URL-request from the web
browser. The web browser displays an input field for a user to enter a keyword query. The web browser sends a
http-request to the service method of the WebFrontend class. Here the keyword query is extracted and a database
object is created. The keywords are handed over as arguments to the constructor of the database object to be
created. The created Database object instantiates –for every keyword in the query- a new Lastgeld object. Every
Lastgeld object instantiates five Connector objects, one for every index stored on disk, e.g. if a keyword query of
three keywords is entered, three Lastgeld objects are created and thus in total 15 Connector objects are created,
each Connector object is associated with one keyword and sequentially searches one index. If there is a match,
then the id-numbers associated with that term in the index are stored in a list and returned to the Lastgeld object
that created that Connector object. Given a match in a particular index and the id-numbers associated with that
match, the Lastgeld object creates one of its inner classes; Firstname; Lastname; Country; Date or Harbor,
depending on the index in which the keyword match is found. The id-numbers found in the indices are associated
with these inner classes. All of the inner class objects are stored in a list and given back to the Database object.
Then, the database object instantiates a Grapher object, to store these objects as vertices in a WeightedGraph
object, which is instantiated by the Grapher object. Then the Grapher object retrieves all vertices present in the
WeightedGraph object and intersects all the id-numbers associated with all possible unique combinations of
vertices, then edges are created and id-numbers two vertices have in common are associated with the edge
between them. The vertex with the most edges is retrieved and the id-numbers associated with every edge of the
vertex with the most edges are intersected again. This yields a list of results. This list of results is passed back to
the Database object. The database object in turn passes it back to the WebFrontend. The WebFrontend
instantiates a ConnectionPool object which grants access to the MySQL database, and the WebFrontend
formulates one or more SQL-queries based on the id-numbers returned by the Database object. The answers to
these queries are passed back via WebFrontend‟s http-response parameter and the results are presented to the
user.
25
6.4 Weighted graph data structure
Recall that I explained in section 3.4 that all keywords of the keyword query found in the indices are vertices in
the graph and the id-numbers associated with a keyword in the index are associated with the vertex of that
keyword. There exist edges between any possible combination of two vertices if two vertices have at least one
id-number in common. If this is the case then an edge between that pair of vertices is created and the weight of
the edge is associated with all the id-numbers these two vertices have in common.
To be able to use the weighted graph data structure as decribed in Dale et al.[3], I needed to modify the data
structure in its core. The edges in the presented weighted graph data structure of Dale et al. consist of single
integer values. However, I needed the weights to be lists of integer values. As a result I modified the data
structure as depicted in figure 6.2.
Figure 6.2: weighted graph data structure
Note that the vertices are Graphable objects, they implement the Graphable interface I designed. Figure 6.1
shows that the inner classes of Lastgeld: Firstname, Lastname, Harbor, Country and Date implement this
interface.
26
5.5 Algorithms
The core functionality of the presented application can be summarized as; searching indices; intersecting indices;
finding combinations of intersections and finding the vertex with the most edges. I describe these subjects in
more detail in the next sub-sections.
6.5.1 Searching indices
I decided to store the indices as text files, this appeared to be the most straightforward implementation. As a
result the indices are searched sequentially. Recall that any line in an index starts with a term followed by a
sequence of id-numbers. The StringTokenizer object takes two parameters the line currently read and a delimiter,
which is a space in the indices I created.
while (inLine != null){
tokenizer = new StringTokenizer(inLine, " ");
term = tokenizer.nextToken();
if (term.equals(keyword)){
while(tokenizer.hasMoreTokens()){
String token = tokenizer.nextToken();
Integer id = Integer.valueOf(token);
list.add(id);
}
}
inLine = inFile.readLine();
}
6.5.2 Intersecting id-numbers
The id-numbers associated with each graphable object are stored in an ArrayList. To be able to intersect two
ArrayLists containing id-numbers I use ArrayList‟s retainall method. An ArrayList object inherits this method
from the AbstractCollection class. The source code is straightforward, but like many Java classes, this method
makes use of other classes to get the job done. Since ArrayList is a Collection, the “contains” method is available
for the actual intersection. To be able to loop through the items in the ArrayList an iterator object is used.
public boolean retainAll(Collection c) {
boolean modified = false;
Iterator<E> e = iterator();
while (e.hasNext()) {
if (!c.contains(e.next())) {
e.remove();
modified = true;
}
}
return modified;
}
public boolean contains(Object o) {
Iterator<E> e = iterator();
if (o==null) {
while (e.hasNext())
if (e.next()==null)
return true;
} else {
while (e.hasNext())
if (o.equals(e.next()))
return true;
}
return false;
}
Comment: the contains method is shown in
the right side of this box.
JDK source 5.0[14]
27
6.5.3 Finding combinations
All vertices are stored in an ArrayList. Each vertex is accessible via an index of the ArrayList. To be able to
intersect all pairs of vertices to create edges all possible combinations of ArrayList indices must be obtained.
These combinations can be obtained by the size of the ArrayList that contains the vertices. This size is the
parameter of the following method I created:
private int[][] calculateCombinations(int c){
int[][] combinations = new int[c*c/2][2];
int store = 0;
int number = c;
for (int b = 1; b < number; b++ ){
int y = b - 1;
int i = 1;
while (i != number-y){
combinations[store][0] = i-1;
i = i + b;
combinations[store][1] = i-1;
int x = b - 1;
i = i - x ;
store++;
}
}
return combinations;
}
For instance if the int 3 is passed, the method returns the combinations 12 23 and 13.
6.5.4 Retrieving the vertex with the most edges
LinkedQueue edges;
Graphable item;
Graphable biggest = null;
int biggestSize = 0;
for(int i = 0; i < vertices.size();i++){
int size = 0;
item = (Graphable)
vertices.get(i);
edges = graph.getToVertices(item);
size = edges.size();
if(size > biggestSize){
biggestSize = size;
biggest = item;
}
}
return biggest;
QueueInterface getToVertices(Graphable
vertex){
QueueInterface adjVertices = new
LinkedQueue();
int fromIndex;
int toIndex;
fromIndex = indexIs(vertex);
for (toIndex = 0; toIndex <
numVertices; toIndex++)
if (edges[fromIndex][toIndex]
!= NULL_EDGE)
adjVertices.enqueue(vertices[to
Index]);
return adjVertices;
}
Comment: Part of WeightedGraph.java
Dale et al.[3]
Comment: The vertices are retrieved from
the graph using the getToVertices method
belonging to the graph object. This
method is described in the right side of
this box.
28
6.6 Java library classes
I used several classes available in the Java library to get several jobs done. In this section I list these classes and I
explain in what way I used them.
java.io.BufferedReader
java.io.InputStreamReader
java.io.IOException
java.io.PrintWriter
java.util.ArrayList
java.util.Collections
java.util.StringTokenizer
java.net.URL
java.net.URLConnection
java.sql.Connection
java.sql.Statement
java.sql.ResultSet
java.sql.SQLException
javax.servlet.http.HttpServlet
javax.servlet.http.HttpServletR
equest
javax.servlet.http.HttpServletR
esponse
I use the BufferedReader class to retrieve data from the text files that
store the indices. A BufferedReader object has a method named
readLine to read from a text file line by line.
A BufferedReader object cannot read text. It needs an
InputStreamReader object to bridge from a byte streams to character
stream.
An IOException object is created if anything goes wrong in the process of
reading text files.
This class prints formatted representations of objects to a text-output
stream. When a servlet (Java web class) is called in a web browser a
request and response object is send to that servlet. The response object
has a method named getWriter which delivers a PrintWriter object to
the servlet. In turn this object is necessary to print output to the web
browser.
I have extensively used ArrayLists when passing data from object to
object. The advantage of the use of an ArrayList is that it can store any
kind of object. Another advantage is that an ArrayList dynamically
allocates space, i.e. it increases in size if it becomes full and decreases in
size if objects are deleted from it.
I use the static sort method from the Collections class to sort the idnumbers associated with a Graphable object. The sorting algorithm is a
modified version of mergesort.
A StringTokenizer cuts the lines read by the BufferedReader in to single
tokens.
I use the URL class to define where the search application can find the
indices.
An URLConnection object is returned when an URL object calls its
method openConnection.
To access the MySQL database there needs to be a connection to it. A
Connection object establishes such a connection.
An SQL statement is stored in a Statement object.
SQL results are stored in a ResultSet object.
An SQLException object is created if anything goes wrong in the process
of executing a SQL statement.
The WebFrontend class I wrote extends the HttpServlet class. By
extending this class the WebFrontend class inherits all the methods from
HttpServlet. This way I profit from the methods already written in
HttpServlet.
An HttpServletRequest object is created when an URL is directed at the
WebFrontend class. I use this object to get the information send in the
header of the WebFrontend class. For instance. I use the header like so:
“web/Lastgeld?search_text=cornelis+vos” (part of the url of a search
request).
An httpServletResponse object is necessary to retrieve a PrintWriter
object. Recall that a PrintWriter object is necessary to print output to
the web browser.
Table 6.1: Used Java Library classes - Java 1.4.2 API specification [13]
29
6.7 Web deployment
6.7.1 Specification
Host
Relational database
Webserver
Java Database Connectivity (JDBC) driver
Connection pool
Java Runtime environment
Specification
A slice of an 2.4 GHz processor and 512MB of
RAM running on Linux 2.6.26 xen VPS
MySQL 5.1 Community Server [18]
Apache Tomcat 5.5 [16]
Connector/J 5.1 [19]
A Java class for pre-allocating, recycling, and
managing JDBC connections [17]
JRE 6 [20]
Table 6.2: web deployment specification
6.7.2 User interface
The user interface I created looks like this:
Figure 6.3: user interface search application
As you can see there is nothing difficult to this interface. The only function that needs a short introduction is the
“verbose” option. The verbose option executes the search request while in the mean time printing the results of
some important intermediate steps that lead to a final query answer. (Warning this option will flood your screen
in some cases).
6.7.3 Web address
I host the search application at the following web address:
http://daniel.adixhosting.nl/web/lastgeld
6.7.4 The search application doesn‟t work
A lot can go wrong when one runs a not-so-thoroughly-tested database-driven Java web application on an
external computer with very modest resources. Therefore I post my e-mail address:
[email protected]
I will gladly try to solve the problem at hand. Having expressed this, I end the chapter.
30
7. Epilogue
7.1 Conclusions
I presented a search application that enables keyword-based search in the available Lastgeld database. I
employed several techniques to be able to retrieve meaningful answers to queries consisting of multiple
keywords.
These answers are based on the following assumption: if the keywords in a keyword query are related in the
relational database that contains the Lastgeld data, then retrieving this relating data yields results likely to be
meaningful given the keyword query. I state that my objective in this thesis is to show that keyword-based
search in a relational database can yield meaningful results given the available Lastgeld data.
I have presented several keyword queries and the results retrieved by the presented search application. These
results show that although the control and exactness of SQL queries is lost, the results are likely to be
meaningful given the available Lastgeld data.
In the experiment I observed that the presented search application can be demanding in some cases. Based on
the experiments I conclude that keyword query execution time increases if the intersection work that needs to be
performed by the presented search application increases. This work is dependent on the number of keywords in
the query and the amount of id-numbers associated with these keywords in the indices. I also conclude that the
more SQL queries the search application proposes to be executed by the relational database, the more the overall
query execution time will increase.
7.2 Discussion
The presented search application demonstrates one of many possible solutions for obtaining keyword-based
search in a relational database. However, the presented solution is mainly designed for the data available. I used
only two tables to show how relations can be combined in a graph data structure. In the chapter five I describe
related work it is clear that the graph data structure is employed in very different ways. It appears to be more
common to designate an entire tuple to be a vertex, instead of designating a field in the tuple to be a vertex the
way I did. This way edges between vertices represent relations between tables. Finding relating data is on the
scale of a graph model of relating tables not on the scale of a graph model of relating data within tuples the way I
have used a graph data structure. My approach seems far more difficult to scale incorporating many tables, while
solutions based on an entire tuple being a vertex connecting to other vertices that are tuples in other tables,
appear to be a more generalizable approach in modeling an amount of data that is stored in an undefined number
of tables. I designed the presented search application to work with the available data. This data doesn‟t have a
rich relational structure by means of many relating tables.
The experiment indicates that the presented search application can be increasingly demanding in terms of query
processing. The way id-lists are intersected to establish edges between vertices in the graph data structure can be
an execution time consuming construct in some cases. Also proposing multiple queries to the relational database
after the search application finished processing can be an execution time consuming construct. Although
efficiency can be improved, it is unclear of these inherent demanding constructs are acceptable in practice.
By trading query syntax like SQL for just some keywords, control over retrieving exact answers is lost. The
spaces between keywords do not per se imply a relation between keywords, but they do not imply there isn‟t any
relation either. I chose find answers to multiple keywords in a query by incorporating all the keywords as a
relating phrase based on the available data. This yields meaningful results, however this remains relative to the
interpretation of the meaning of spaces between keywords.
31
7.3 Proposals for future work
Efficiency can be improved by revising the index structure. Currently, all indices are an alphabetical list per
database field. The index first name alphabetically lists all first names from A to Z. For every letter in the
alphabet, this index can be partitioned in sub-indices. Then the appropriate indices can be retrieved given the
first letters of the keywords. This way the sequential search process is less demanding in terms of processing.
To avoid the execution of 5600 SQL queries a keyword query “Germany” initiates, it is possible to generate
just one query containing multiple AND clauses, like for instance: SELECT * FROM Skippers WHERE id=‟1‟
AND id=‟2‟ AND id=‟3‟. Still, in case of the keyword query “Germany” this approach will yield a very long
query.
Another solution to this problem is to take alternative steps for keyword queries that contain only one
keyword, because the search application will likely propose many SQL-queries in case of one keyword. In this
case a possible scenario is to avoid the search application entirely, and search the indices separately for this
keyword. If it is known that the keyword occurs in index A, then a more appropriate SQL-query can be
formulated given this knowledge.
The retrieval of meaningful results may be improved by traversing the employed weighted graph data structure.
Currently the vertex with the most edges is retrieved. This approach retrieves the tuples of data in which the
keywords must occur in the same tuple. If the graph were to be traversed starting at different vertices, then
different kinds of relating data can be retrieved. This way and, or and not semantics can be introduced into the
application. Also there could be made an order of importance between vertices that have just one edge, vertices
that have two edges, etc.
However, to be able to evaluate the benefit of any kind of functional variation within the presented
application, it is necessary to assess the effectiveness of the application based on actual information needs as
well.
References
[1] B. Aditya, G. Bhalotia, S. Chakrabarti, A. Hulgeri, C. Nakhe, P. Parag, S.Sudarshan. BANKS: browsing
and keyword searching in relational databases. In Proceedings of the 28 th international conference on
Very Large Databases, 2002. http://delivery.acm.org.proxy-ub.rug.nl/10.1145/1290000/1287473/p1083aditya.pdf?key1=1287473&key2=0346186421&coll=ACM&dl=ACM&CFID=42949402&CFTOKEN=9
7092769, visited on may 14th, 2009.
[2] S.Cohen, J. Mamou. Y. Kanza, Y. Sagiv. XSearch: a semantic search engine for XML. In Proceedings of
the 29th international conference on Very large databases, vol. 29, 2003. http://delivery.acm.org.proxyub.rug.nl/10.1145/1320000/1315457/p45cohen.pdf?key1=1315457&key2=8557427421&coll=ACM&dl=ACM&CFID=43812128&CFTOKEN=5
1392105, visited on June 28th, 2009.
[3] N.Dale, D.T Joyce, C.Weems. Object-oriented data structures using Java. ISBN 0-7637-1079-2. Jones
and Bartlett Publishers International, London, 2002.
[4] B.B. Dalvi, M. Kshirsagar, S. Sudarshan. Keyword search on external memory graphs. In Proceedings of
the VLDB Endowment, Vol 1, Issue 1, 2008. http://delivery.acm.org.proxyub.rug.nl/10.1145/1460000/1453982/p1189dalvi.pdf?key1=1453982&key2=2831696421&coll=ACM&dl=ACM&CFID=42949402&CFTOKEN=97
092769, visited on July 3th, 2009.
[5] D. Floresqu, D. Kossmann, I. Manolescu. Integrating keyword search into XML query processing. In
Computer Networks, Vol. 33, 2000. http://www.sciencedirect.com.proxyub.rug.nl/science?_ob=MImg&_imagekey=B6VRG-40B2JGR-C11&_cdi=6234&_user=4385132&_orig=search&_coverDate=06%2F30%2F2000&_sk=999669998&vie
w=c&wchp=dGLbVlW-zSkWA&md5=c122eafe4481c853b1287351178b8472&ie=/sdarticle.pdf, visited
on May 17th, 2009.
32
[6] V. Hristidis, Y Papakonstantinou. Discover: keyword search in relational databases. In Proceedings of the
28th international conference on Very Large Databases, 2002. http://delivery.acm.org.proxyub.rug.nl/10.1145/1290000/1287427/p670hristidis.pdf?key1=1287427&key2=0686186421&coll=ACM&dl=ACM&CFID=42949402&CFTOKEN
=97092769, visited on May 15th, 2009.
[7] V. Hristidis, L. Gravano, Y, Papakonstantinou. Efficient IR-Style Keyword search over Relational
databases. In Proceedings of the 29th international conference on Very large databases, Vol. 29, 2003.
http://portal.acm.org.proxyub.rug.nl/citation.cfm?id=1453856.1453887&coll=ACM&dl=ACM&CFID=43719809&CFTOKEN=297
07051, visited on May 15th, 2009.
[8] L.Lanzani. Discrete mathematics, chapter 11, Graph theory, 2008, http://comp.uark.edu/~lanzani/2103NOTES/11.1-11.2.pdf, visited on July 6th, 2008.
[9] G. Li, B.C. Ooi, J. Feng, J. Wang, L, Zhou. EASE: an effective 3-in-1 keyword search method for
unstructured, semi-structured and structured data. In Proceedings of the 2008 ACM SIGMOD
international conference on Management of data. http://delivery.acm.org.proxyub.rug.nl/10.1145/1380000/1376706/p903li.pdf?key1=1376706&key2=0314196421&coll=ACM&dl=ACM&CFID=42949402&CFTOKEN=97092
769, visited on May 21th, 2009.
[10] F.Liu, C. Yu, W. Meng, A Chowdhury. Effective Keyword Search in Relational Databases. In
Proceedings of the 2007 ACM SIGMOD international conference on Management of data.
http://delivery.acm.org.proxy-ub.rug.nl/10.1145/1150000/1142536/p563liu.pdf?key1=1142536&key2=1057366421&coll=ACM&dl=ACM&CFID=43719809&CFTOKEN=2970
7051, visited on May 21th, 2009.
[11] C.D Manning, P.Raghavan, H.Schütze. Introduction to information retrieval. ISBN 978-0-521-86571-5.
Cambridge university press, Newyork, 2008.
[12] L.Paoletti. Leonard Euler‟s solution to the Königsberg bridge problem,
http://mathdl.maa.org/mathDL/46/?pa=content&sa=viewDocument&nodeId=1310&bodyId=1452, visited
on July 6th, 2009.
[13] Sun.com. Java 2 Platform, Standard Edition, v 1.4.2 API specification.
htp://java.sun.com/j2se/1.4.2/docs/api/overview-summary.html, visited on July 18th
[14] Sun.com. JDK Source 5.0. https://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_DeveloperSite/en_US/-/USD/ViewProductDetail-Start?ProductRef=J2SE-1.5.0-OTH-G-F@CDSCDS_Developer&ProductUUID=.59IBe.oWd4AAAEZZvkZK4O9&ProductID=.59IBe.oWd4AAAEZZv
kZK4O9&Origin=ViewProductDetail-Start , (SDN registration required)
[15] G.M. Welling. The Prize of Neutrality. Trade relations between Amsterdam and North America 17711817, 1998. http://dissertations.ub.rug.nl/FILES/faculties/arts/1998/g.m.welling/thesis.pdf, visited on
Augustus 11th, 2009.
Software used in the implementation as described in section 5.2 and 5.7.1.
[16] Apache Tomcat 6.0. http://tomcat.apache.org/. (open source)
[17] ConnectionPool.java http://archive.coreservlets.com/coreservlets/ConnectionPool.java (freely available)
[18] MySQl Community server 5.0. http://dev.mysql.com/downloads/. (open source)
[19] MySQL Connector/J 5.1. http://dev.mysql.com/downloads/connector/j/5.1.html (open source)
[20] Sun. Java Runtime Environment 6. http://java.sun.com/javase/downloads/index.jsp (freely available)
33
Appendix A: creating indices in Perl
#
#
#
#
Step 1 Creating indices
Name: step1.pl
Author: Daniel Suelmann
Effect: Deletes punctuation of a SQL-dump.
use strict;
my $readfile = shift(@ARGV);
chomp($readfile);
open(FILE, $readfile) or die "Cannot open $readfile: $!";
while (<FILE>) {
my $line = $_;
chomp($line);
$line =~ s/[[:punct:]]//g;
open(RESULT, ">>inputtostep2.txt") or die "Cannot open $!";
print RESULT "$line \n";
}
close RESULT;
---Result:
1198 AART
4765 ABE
12726 ABE
4790 ABE
1567 ABE >>
#
#
#
#
Step 2 Creating indices
Name: step2.pl
Author Daniel Suelmann
Effect: Puts the id-numbers after the index term.
use strict;
my $numbers;
my $readfile = shift(@ARGV);
chomp($readfile);
open(FILE, $readfile) or die "Cannot open $!";
while (<FILE>) {
my $line = $_;
chomp($line);
$line =~ s/([0-9]+)//g;
$numbers = $1;
open(RESULT, ">>inputtostep3.txt") or die "Cannot open: $!";
print RESULT "$line $numbers\n";
}
close RESULT;
---Result:
AART 1198
ABE 4765
ABE 12726
ABE 4790
ABE 1567 >>
# Step 3 Creating indices
34
# Name: step3.pl
# Author Daniel Suelmann
# Effect: Puts every similar index term on the same line.
use strict;
my $prevline = "";
my $string = "";
my $numbers;
my $readfile = shift(@ARGV);
chomp($readfile);
open(FILE, $readfile) or die "Cannot open $readfile: $!";
while (<FILE>) {
my $line = $_;
chomp($line);
$line =~ /([A-Z]+)/;
if ($prevline eq $1){
open(RESULT, ">>inputfinalstep.txt") or die "Cannot open $!";
print RESULT "$line";
} else {
open(RESULT, ">>inputfinalstep.txt") or die "Cannot open $!";
print RESULT "\n$line";
}
$line =~ /([A-Z]+)/;
$prevline = $1;
}
close RESULT;
---Result:
AART 1198 <new line>
ABE 4765 ABE 12726 ABE 4790 ABE 1567 ABE 1246 ABE BR 6900 ABE BR 12471 ABE
BR 11380 ABE BR 857 ABE BROER 5986 ABE BROERSZ 9008 <new line>
ABEL 6545 ABEL BAS 3553 <new line>
ABLBERT 7581 ABLBERT J 10065 <new line>
ABRAHAM 3410 ABRAHAM 4434 ABRAHAM 6303 ABRAHAM 5806 ABRAHAM 9355 ABRAHAM 9064
ABRAHAM 6687 ABRAHAM 8444 ABRAHAM 9828 <new line>
#
#
#
#
Step 4 Creating indices.
Name: step4.pl
Author Daniel Suelmann
Effect: Deletes all the terms except the term that starts the line.
use strict;
my $string;
my $readfile = shift(@ARGV);
chomp($readfile);
open(FILE, $readfile) or die "Cannot open $readfile: $!";
while (<FILE>) {
my $line = $_;
chomp($line);
$line =~ /([A-Z ]+)/;
$string = $1;
$line =~ s/([A-Z ]+)/ /g;
open(RESULT, ">>resultsfinal.txt") or die "Cannot open $!";
print RESULT $string;
print RESULT $line . "\n";
}
close RESULT;
---Result:
AART
1198
35
ABE
4765 12726 4790 1567 1246 6900 12471 11380 857 5986 9008
ABEL
6545 3553
ABLBERT
7581 10065
ABRAHAM
3410 4434 6303 5806 9355 9064 6687 8444 9828
Appendix B: search application source code in Java
Note that I do not try to catch exceptions. Each search request restarts the application. If an exception occurs,
the exception occurs for one search request only. The application throws the exception out to the web server,
which in turn prints the exception in readable and understandable format in the web browser of the user.
/*
*
/*
*
*
Name: Database.java
Author: Daniël Suelmann
Effect:
* main scenario:
* 1. Put the keywords in a list;
* 2. See whether it matches the table fields;
* 3. Receive the matches;
*
3.1 if the matches are form a single
*
keyword, the job is done.
*
3.2 if not, continue.
* 4. Put the matches in a graph.
* 5. Add edges -between vertices- to the
*
graph-based on id resemblance.
* 6. Find out which vertex has the most edges:
*
this is probably the most important one.
* 7. Intersect the id-numbers associated with
*
each edge, the id-numbers are the result
*
id-numbers, these id-numbers are given
*
back to the servlet that called this method
*
from within the web environment.
* */
package standalone;
import java.io.IOException;
import java.util.ArrayList;
public class Database {
private Grapher grapher = new Grapher();
public static ArrayList<ArrayList<Graphable>> dispatch(String[] array) throws
IOException{
/* effect: creates an instance of Lastgeld
* for each keyword in the arguments array.
* effect: dispatches the keywords in requests
* to the text files.
* incoming collaboration: receives an array
* of keywords from Database's main.
* outgoing collaboration: sends a single
* keyword to Lastgeld's request.
* incoming collaboration: receives a list
* with object references from Lasgeld's request.
* outgoing collaboration: sends a list of lists
* with object references to Database's main.
*/
ArrayList<ArrayList<Graphable>> list = new
ArrayList<ArrayList<Graphable>>();
for(int i = 0; i < array.length; i++){
Lastgeld lg = new Lastgeld();
36
list.add(lg.request(array[i]));
}
return list;
}
public ArrayList<Integer> getResult(String[] str) throws IOException {
ArrayList <ArrayList<Graphable>> keywords;
ArrayList<Integer> result = new ArrayList<Integer>();
String[] input = str;
keywords = dispatch(input);
/*If there's only one keyword entered there's no
*need to set up a graph.
*The triple loop gets lists in the list (loop 1),
*these lists contain objects (loop 2), these
*objects have (first name, last name, harbor,
*country) and the last loop gets id-numbers from
*within the objects.
* */
if (input.length == 1){
if(keywords != null){
for (int i = 0; i < keywords.size(); i++){
ArrayList<Graphable> list = (ArrayList<Graphable>)
keywords.get(i);
if(list != null){
for(int j = 0; j < list.size(); j++){
Graphable item = (Graphable)
list.get(j);
if(item != null){
ArrayList<Integer> ids =
(ArrayList<Integer>)item.getIDs();
if (ids !=null){
for(int k = 0; k <
ids.size(); k++){
result.add(ids.get(k));
}
}
else{
System.out.println("No,
id-numbers in the
list");
}
}
}
}
else{
System.out.println("emptylist");
}
}
}
else {
System.out.println("There are no objects to start with");
}
}
37
/* If there are more than one keyword
* entered; a graph is created;
* The graph is populated by inserting
* the keyword objects(vertices) into
* the graph. The edges in between two
* vertices are created; The edges are
* printed (for my administration
* output is returned to the Tomcat server
* console); And finally, the result set
* is determined.
*/
else {
if (!keywords.isEmpty()){
grapher.populate(keywords);
grapher.addEdges();
grapher.printEdgesToConsole();
result = (ArrayList<Integer>) grapher.findResult();
}
}
return result;
}
public
ArrayList <Graphable> getVertices(){
ArrayList <Graphable> list = grapher.getVertices();
return list;
}
public
ArrayList <Graphable> getIntersections(){
ArrayList <Graphable> list = grapher.getIntersections();
return list;
}
public WeightedGraph getGraph(){
WeightedGraph graph = grapher.getGraph();
return graph;
}
public ArrayList<Object> getCombinedIntersections(){
ArrayList<Object> list = grapher.getCombinedIntersections();
return list;
}
/* Comment on the reset methods:
* In between search requests all objects must
* be deleted otherwise data structures are
* flooded with mixed result data.
*/
public void resetCombinedIntersections(){
grapher.resetCombinedIntersections();
}
public void resetIntersections(){
grapher.resetIntersections();
}
public void resetGraph(){
grapher.resetGraph();
}
public void resetVertices(){
grapher.resetVertices();
38
}
public void resetListOfLists(){
grapher.resetListOfLists();
}
}
/* Name: Lastgeld.java
* Author: Daniël Suelmann
* Effect:
* Every Lastgeld object instantiates five
* Connector objects, one for every index
* stored on disk. The Connector objects return
* lists with matching id-numbers. Per match
* an inner class object is instantiated
* depending on which index matched the
* keyword. All object references are gathered
* and send back to Database.
*/
package standalone;
import java.io.IOException;
import java.util.ArrayList;
public class Lastgeld {
//inner class
protected class Firstname implements Graphable{
protected String content;
protected final String type = "lastgeldfirstname";
protected ArrayList<Integer> ids;
public String toString(){
return content;
}
public ArrayList<Integer> getIDs(){
return ids;
}
public String getType(){
return this.type;
}
public boolean isEqual(Graphable object){
if (object.getType().equals(this.type) &&
object.toString().equals(this.content)){
return true;
}
else
return false;
}
}
//inner class
protected class Lastname implements Graphable{
protected String content;
protected final String type = "lastgeldlastname";
protected ArrayList<Integer> ids;
public String toString(){
return content;
}
public ArrayList<Integer> getIDs(){
return ids;
}
39
public String getType() {
return this.type;
}
public boolean isEqual(Graphable object){
if (object.getType().equals(this.type) &&
object.toString().equals(this.content))
return true;
else
return false;
}
}
//inner class
protected class Harbor implements Graphable{
protected String content;
protected final String type = "lastgeldharbor";
protected ArrayList<Integer> ids;
public String toString(){
return content;
}
public ArrayList<Integer> getIDs(){
return ids;
}
public String getType() {
return this.type;
}
public boolean isEqual(Graphable object){
if (object.getType().equals(this.type) &&
object.toString().equals(this.content))
return true;
else
return false;
}
}
//inner class
protected class Date implements Graphable{
protected String content;
protected final String type = "lastgelddate";
protected ArrayList<Integer> ids;
public String toString(){
return content;
}
public ArrayList<Integer> getIDs(){
return ids;
}
public String getType() {
return this.type;
}
public boolean isEqual(Graphable object){
if (object.getType().equals(this.type) &&
object.toString().equals(this.content))
return true;
else
return false;
}
}
//inner class
40
protected class Country implements Graphable{
protected String content;
protected final String type = "lastgeldcountry";
protected ArrayList<Integer> ids;
public String toString(){
return content;
}
public ArrayList<Integer> getIDs(){
return ids;
}
public String getType() {
return this.type;
}
public boolean isEqual(Graphable object){
if (object.getType().equals(this.type) &&
object.toString().equals(this.content))
return true;
else
return false;
}
}
protected
protected
protected
protected
protected
static
static
static
static
static
Firstname firstname;
Lastname lastname;
Harbor harbor;
Date date;
Country country;
// constructor of the table
public Lastgeld(){
/* effect: Instantiates Lastgeld's inner
* objects Firstname, Lastname, Date,
* Harbor, Country and points references
* to them.
*/
firstname = new Firstname();
lastname = new Lastname();
harbor = new Harbor();
date = new Date();
country = new Country();
}
private static ArrayList<Graphable> populate(ArrayList<ArrayList<Integer>>
lists, String keyword){
/* effect: attaches id-numbers to the instance
* variables of Lastgeld's inner classes.
* effect: attaches the keyword to the instance
* variables of Lastgeld's inner classes.
* incoming collaboration: receives a keyword
* from Lastgeld's request.
* incoming collaboration: receives a list of id
* lists from Lastgeld's request.
* outgoing collaboration: sends a list with
* object references to the fields to Lastgeld's
* request.
*/
ArrayList<Graphable> objectrefs = new ArrayList<Graphable>();
ArrayList<Integer> list0 = (ArrayList<Integer>) lists.get(0);
ArrayList<Integer> list1 = (ArrayList<Integer>) lists.get(1);
ArrayList<Integer> list2 = (ArrayList<Integer>) lists.get(2);
ArrayList<Integer> list3 = (ArrayList<Integer>) lists.get(3);
41
ArrayList<Integer> list4 = (ArrayList<Integer>) lists.get(4);
if(!list0.isEmpty()){
firstname.content = keyword;
firstname.ids = (ArrayList<Integer>) lists.get(0);
objectrefs.add(firstname);
}
if(!list1.isEmpty()){
lastname.content = keyword;
lastname.ids = (ArrayList<Integer>) lists.get(1);
objectrefs.add(lastname);
}
if(!list2.isEmpty()){
harbor.content = keyword;
harbor.ids = (ArrayList<Integer>) lists.get(2);
objectrefs.add(harbor);
}
if(!list3.isEmpty()){
date.content = keyword;
date.ids = (ArrayList<Integer>) lists.get(3);
objectrefs.add(date);
}
if(!list4.isEmpty()){
country.content = keyword;
country.ids = (ArrayList<Integer>) lists.get(4);
objectrefs.add(country);
}
return objectrefs;
}
public static ArrayList<Graphable> request(String s) throws IOException{
/* effect dispatches calls to the connector to
* access the text files.
* incoming collaboration: receives a keyword
* form Database's dispatch.
* incoming collaboration: receives lists of
* id-numbers (associated with a single keyword)
* from Connectors read method.
* outgoing collaboration: sends a list of id-lists
* to Lastgeld's populate.
* outgoing collaboration: sends the list with
* object references to Database's dispatch
*/
ArrayList<ArrayList<Integer>> lists = new
ArrayList<ArrayList<Integer>>();
ArrayList<Integer>
ArrayList<Integer>
ArrayList<Integer>
ArrayList<Integer>
ArrayList<Integer>
list0
list1
list2
list3
list4
=
=
=
=
=
null;
null;
null;
null;
null;
String keyword = s;
Connector con0 = new Connector("vnfreq.txt");
list0 = (ArrayList<Integer>) con0.read(s);
Connector con1 = new Connector("anfreq.txt");
list1 = (ArrayList<Integer>) con1.read(s);
Connector con2 = new Connector("havenfreq.txt");
list2 = (ArrayList<Integer>) con2.read(s);
Connector con3 = new Connector("datefreq.txt");
42
list3 = (ArrayList<Integer>) con3.read(s);
Connector con4 = new Connector("countryfreq.txt");
list4 = (ArrayList<Integer>) con4.read(s);
lists.add(list0);
lists.add(list1);
lists.add(list2);
lists.add(list3);
lists.add(list4);
return populate(lists, keyword);
}
}
/* Name: Connector.java
* Author: Daniël Suelmann
* Effect:
* Every Lastgeld object instantiates five Connector objects,
* one for every index stored on disk, e.g. if a keyword
* query of three keywords is entered, three Lastgeld objects
* are created and thus in total 15 Connector objects are created,
* each Connector object is associated with one keyword and
* sequentially searches one index.
*/
package standalone;
import
import
import
import
import
import
import
import
java.net.URL;
java.net.URLConnection;
java.io.BufferedReader;
java.io.IOException;
java.io.InputStreamReader;
java.util.StringTokenizer;
java.util.ArrayList;
java.util.Collections;
public class Connector {
String filename;
String keyword;
public Connector(String fn){
this.filename = fn;
}
public ArrayList<Integer> read(String keyword) throws IOException{
/* effect: Opens up a file a checks line by line
* if there's a match with the keyword. If there
* is a match the id-numbers associated with that
* match are stored in a list. This results in
* a list of id-numbers of matches, accumulated
* over all the lines in the files.
* incoming collaboration: receives a file name
* from Connector's constructor.
* incoming collaboration: receives a single keyword
* from Lastgeld's request.
* outgoing collaboration: sends a list of id-numbers
* to Lastgeld's request.
* */
URL url = new URL("http://localhost:8080/web/" + filename);
URLConnection urlConnection = url.openConnection();
urlConnection.connect();
43
String inLine = null;
String word;
StringTokenizer tokenizer;
ArrayList<Integer>list = new ArrayList<Integer>();;
BufferedReader inFile;
inFile = new BufferedReader(new InputStreamReader(url.openStream()));
inLine = inFile.readLine();
while (inLine != null){
tokenizer = new StringTokenizer(inLine, " ");
word = tokenizer.nextToken();
word = word.toLowerCase();
if (word.equals(keyword)){
while(tokenizer.hasMoreTokens()){
String token = tokenizer.nextToken();
token = token.replaceAll("\\D", "");
if (!token.equals("")){
Integer id = Integer.valueOf(token);
list.add(id);
}
}
}
inLine = inFile.readLine();
}
Collections.sort(list);
return list;
}
}
/* Name: Grapher.java
* Author: Daniël Suelmann
/* Effect:
* The effect of this class is explained
* in detail at the level of
* each method.
*/
package standalone;
import java.util.ArrayList;
public class Grapher {
private
private
private
private
private
static
static
static
static
static
ArrayList<ArrayList<Graphable>> list;
WeightedGraph graph = new WeightedGraph();
ArrayList <Graphable> vertices;
ArrayList <Graphable> intersections = null;
ArrayList<Object> combinedintersections = null;
public void populate(ArrayList<ArrayList<Graphable>> keywords){
/* effect: extracts the references in each list and adds
* them to the graph.
* incoming collaboration: receives a list of lists from
* Database's main.
* outgoing collaboration: sends object references to
* WeightedGraph's addVertex.
*/
list = keywords;
for (int i = 0; i < list.size(); i++){
ArrayList<Graphable> l =
(ArrayList<Graphable>)list.get(i);
for(int j = 0; j < l.size(); j++){
graph.addVertex((Graphable) l.get(j));
}
}
}
44
public void addEdges(){
/* effect: Based on the vertices currently present,
* it initiates the process of finding and adding
* the edges to the graph.
* incoming collaboration: is called by Database's
* getResults.
* incoming collaboration: receives a list with match
* descriptives
*/
findEdges();
}
public static void findEdges(){
/* effect: compare the vertices based on all possible
* combinations.
* explanation: if there are N vertices compare the
* id-numbers of the vertices in all possible
* combinations. For instance: if there are three
* vertices made based on keyword hits, then combinations
* 1 2, 2 3 and 1 3 are possibilities. These combinations
* are used as indices for the vertex list, 1 2 turns
* into 0 1 etc. Then the ArrayLists of id-numbers
* associated with an vertex object -based on the
* combinations- is intersected to sort out all vertices
* with identical id-numbers. (The intersection is done by
* the ArrayList method retainAll. If identical id-numbers
* are found, the ArrayList which contains matches is added
* an ArrayList, so this would be a list of possible one or
* more lists. It returns this list to Grapher's addEdges.
* incoming collaboration: works with an ArrayList of vertices
* (Grapher instance variable) from WeightedGraph's retrieves.
* outgoing collaboration: passes the number of vertices to
* Grapher's combinationCalculator.
* incoming collaboration: receives an Two-dimensional array
* with all possible combinations given the vertices as ints.
* outgoing collaboration: sends a list -possible empty or
* may contain one or more lists- to Grapher's addEdges
*/
int[][]combinations;
vertices = graph.retrieve();
combinations = calculateCombinations(vertices.size());
ArrayList<Integer> idsobject0;
ArrayList<Integer> idsobject1;
if (intersections == null){
intersections = new ArrayList();
}
if (combinedintersections == null){
combinedintersections = new ArrayList();
}
for (int i = 0; i < combinations.length; i++){
if (!(combinations[i][0] == 0 && combinations[i][1] ==
0)){
Graphable object0 = (Graphable)
vertices.get(combinations[i][0]);
Graphable object1 = (Graphable)
vertices.get(combinations[i][1]);
intersections.add(object0);
intersections.add(object1);
idsobject0 = object0.getIDs();
idsobject1 = object1.getIDs();
ArrayList<Integer> clone;
clone = (ArrayList<Integer>)idsobject0.clone();
45
clone.retainAll(idsobject1);
if(!clone.isEmpty()){
if(!object0.toString().equals(object1.toString())){
graph.addEdge(object0, object1, clone);
}
combinedintersections.add(object0.getType());
combinedintersections.add(object0.toString());
combinedintersections.add(object1.getType());
combinedintersections.add(object1.toString());
combinedintersections.add(clone);
}
}
}
}
private static int[][] calculateCombinations(int c){
/*effect finds combinations based on a fixed
*value (parameter c); for instance if the
*int 3 is passed it finds out that 3 consists
*of combinations 12 23 and 13. To serve Grapher's
*findEdges it changes these combinations in array
*index format, 12 turn into 01, 13 turns into 02 etc.
*outgoing collaboration: sends a two-dimensional
*array with all possible combinations given the
*vertices as ints to findEdges.
*/
int[][] combinations = new int[c*c/2][2];
int store = 0;
int number = c;
for (int b = 1; b < number; b++ ){
int y = b - 1;
int i = 1;
while (i != number-y){
combinations[store][0] = i-1;
i = i + b;
combinations[store][1] = i-1;
int x = b - 1;
i = i - x ;
store++;
}
}
return combinations;
}
private static void printCombinationsToConsole(ArrayList<Object> l){
/* effect: prints some additional information about
* the combinations checked and their matches found.
* This will not be visible to any user, printed to
* the webserver's console.
* incoming collaboration: receives an ArrayList with
* data from grapher's addEdges.
* outgoing collaboration: prints all findings to the
* server console.
*/
int count = 0;
int total = 1;
for (int i = 0; i < l.size(); i++){
if (count == 5){
System.out.print("\n " + total);
count = 0;
total++;
}
System.out.print(" " + l.get(i));
count++;
46
}
System.out.print("\n");
}
public void printEdgesToConsole(){
/* effect: prints some additional information about the
* edges present in the graph. This will not be visible
* to any user, printed to the webserver's console.
* incoming collaboration: uses Grapher's ArrayList
* instance variable vertices to get to the vertices
* present in the graph.
* incoming collaboration: receives -for each vertex in
* the graph- a LinkedQueue of vertices that are adjacent.
* outgoing collaboration: prints all findings to the
* server console.
*/
for (int i = 0; i < vertices.size(); i++){
Graphable object;
object = (Graphable) vertices.get(i);
System.out.print(object.getType()+ ", "); //type
System.out.print(object.toString() + ", "); //content
//System.out.println(object.getIDs()+ "\n"); //ID's
QueueInterface queue;
queue = graph.getToVertices(object);
if (!queue.isEmpty()){
System.out.print("has edges: ");
int count = 1;
while(!queue.isEmpty()){
Graphable item = (Graphable) queue.dequeue();
System.out.print(" " + count++ + ": " +
item.getType());
System.out.print(", " + item);
}
System.out.print("\n");
}
else {
System.out.println("no edges.");
}
}
}
private Graphable findStart(){
/* effect: finds the vertex with the most edges.
* explanation: the vertex with the most edges is
* the vertex with the strongest relation to the other
* keywords, i.e. this is the vertex that's 'likely' relevant.
* incoming collaboration: receives a queue with edges from
* WeightedGraphs getToVertices.
* outgoing collaboration: send the vertex with the most
* edges/vertices to Grapher's findResult
*/
QueueInterface edges;
Graphable item;
Graphable biggest = null;
int biggestSize = 0;
for(int i = 0; i < vertices.size(); i++){
int size = 0;
item = (Graphable) vertices.get(i);
edges = graph.getToVertices(item);
size = edges.size();
if(size > biggestSize){
biggestSize = size;
biggest = item;
}
}
return biggest;
47
}
public ArrayList<Integer> findResult(){
/* effect: dispatches two functions: findstart
* returns the vertex with the most vertices.
* connectEdges perform the final intersection of
* the id-numbers associated with the edges.
* the result is a list of final result id-numbers.
* incoming collaboration: is called by Database's
* getResults.
* incoming collaboration: receives an ArrayList with
* result id-numbers from Grapher's connect edges.
* outgoing collaboration: sends back the ArrayList with
* result id-numbers to Database's getResult.
*/
ArrayList <Integer> result;
Graphable startPoint = findStart();
result = connectEdges(startPoint);
return result;
}
private static ArrayList<Integer> connectEdges(Graphable
startVertex){
/* effect:
* intersects all edges of the given argument
* startVertex, which is the vertex with the
* most edges to other vertices.
* explanation: up until now there is a graph
* that holds x vertices that are connected in
* some way. What is done here, is intersecting
* all the id-numbers that are associated with
* all the edges of a particular vertex.
* What makes a relation between certain vertices
* is the fact that an edge between vertex a & b
* holds the same or a subset of the id-numbers
* between vertices a & c or a & d.
* incoming collaboration: receives a Graphable
* object that becomes the start of the traversal.
* incoming collaboration: receives a LinkedQueue
* with all the edges of startVertex (argument)
* by WeightedGraph's getToVertices.
* incoming collaboration: receives an ArrayList
* with the id-numbers associated with an edge
* between the startVertex and a vertex connected
* to the startVertex.
* outgoing collaboration: sends an ArrayList of
* ints to Grapher's findResult.
*/
QueueInterface edges = graph.getToVertices(startVertex);
ArrayList<ArrayList<Integer>> lists = new
ArrayList<ArrayList<Integer>>();
ArrayList<Integer>sim;
Graphable item;
while(!edges.isEmpty()){
item = (Graphable) edges.dequeue();
sim = graph.holdSimilarities(startVertex, item);
if (!sim.isEmpty()){
lists.add(sim);
}
}
ArrayList<Integer> result = null;
if (!lists.isEmpty()){
result = (ArrayList<Integer>) lists.get(0);
for(int i = 1; i < lists.size();i++){
48
result.retainAll((ArrayList<Integer>)
lists.get(i));
}
return result;
}
else {
ArrayList <Integer> empty =
return empty;
new ArrayList<Integer>();
}
}
public
ArrayList <Graphable> getVertices(){
return vertices;
}
public WeightedGraph getGraph(){
return graph;
}
public ArrayList<Graphable> getIntersections(){
return intersections;
}
public ArrayList<Object> getCombinedIntersections(){
return combinedintersections;
}
/*
*
*
*
The following resetters are necessary due
to the fact that objects persist in between
search requests of the Webfrontend class.
*/
public void resetIntersections(){
intersections = null;
}
public void resetCombinedIntersections(){
combinedintersections = null;
}
public void resetListOfLists(){
list = null;
}
public void resetVertices(){
vertices = null;
}
public void resetGraph(){
graph.reset();
}
}
49
/* Name: Graphable.java
* Author: Daniël Suelmann
/* Effect: This is an interface class.
* It this describes abstract methods
* that must be implemented by all classes
* implementing this interface.
* This objects are instances of Lastgeld's
* inner classes; Firstname,Lastname, Harbor,
* Country and Date. *
*/
package standalone;
import java.util.ArrayList;
public interface Graphable {
public abstract String toString();
//effect: implementation prints the content variable of a graphable
object.
public abstract ArrayList<Integer> getIDs();
//effect: implementation gets the id-numbers associated with a
graphable object.
public abstract String getType();
//effect: implementation gets the type of a graphable object.
public abstract boolean isEqual(Graphable object);
//effect: implementation compares to objects of the graphable type.
}
/* Name: WeightedGraph.java
* This class is a slightly modified version of the
* WeightedGraph data structure presented in the book
* Object-oriented data structures using Java by
* Dale et al.[3]
* The modifications:
* The values associated with the edges were initially
* ints. For the application I need edges that store
* multiple ints representing the id-numbers that are
* shared by two vertices. Since I use ArrayLists
* throughout the application, I also changed the edges
* instance variable to be two-dimensional arrays of
* the type ArrayList. These modifications can be found
* among the instance variables and in the methods
* holdSimilarties and addEdge. Also I added two methods
* to retrieve information from the graph, these are
* Retrieve and printNames.
* */
package standalone;
import java.util.ArrayList;
public class WeightedGraph implements WeightedGraphInterface
{
public static ArrayList <Integer> NULL_EDGE = null;
private int numVertices;
private int maxVertices;
private Graphable[] vertices;
private ArrayList<Integer>[][] edges;
private boolean[] marks; // marks[i] is mark for vertices[i]
50
public WeightedGraph()
// Post: Arrays of size 50 are dynamically allocated for
//
marks and vertices, and of size 50 X 50 for edges
//
numVertices is set to 0; maxVertices is set to 50
{
numVertices = 0;
maxVertices = 50;
vertices = new Graphable[50];
marks = new boolean[50];
edges = new ArrayList[50][50];
}
/* Comment on the edges
* Modification of the original WeightesGraph data structure:
* the weights have become ArrayLists of id-numbers
* Each edge contains the id-similarities between two vertices
* If verticeX = {1,2,3,4,5} and verticeY = {3,4,5,6} the edge
* represents the intersection result vertice{X,Y} = {3,4,5},
* these two values would be stores in a ArrayList
* this ArrayList is necessary to be able to, in turn, intersect
* this intersection with another intersection,
* for instance vertice{P,Q) = {3,5,7,9} which would result in
* vertice(P,Q,X,Y){3,5}, etc.
* */
public void reset(){
vertices = new Graphable[50];
edges = new ArrayList[50][50];
numVertices = 0;
}
public ArrayList<Graphable> retrieve(){
ArrayList<Graphable> list = new ArrayList<Graphable>();
for (int i = 0; i < vertices.length; i++){
Graphable object = (Graphable) vertices[i];
if (object != null){
list.add(object);
}
}
return list;
}
public void printNames(){
for (int i = 0; i < vertices.length; i++){
Graphable object = (Graphable) vertices[i];
if (object != null){
System.out.println(object.toString());
}
}
}
public WeightedGraph(int maxV)
// Post: Arrays of size maxV are dynamically allocated for
//
marks and vertices, and of size maxV X maxV for edges
//
numVertices is set to 0; maxVertices is set to maxV
{
numVertices = 0;
maxVertices = maxV;
vertices = new Graphable[maxV];
marks = new boolean[maxV];
edges = new ArrayList[maxV][maxV];
}
public void addVertex(Graphable vertex)
// Post: vertex has been stored in vertices.
//
Corresponding row and column of edges has been set to NULL_EDGE.
51
//
numVertices has been incremented
{
vertices[numVertices] = vertex;
for (int index = 0; index < numVertices; index++)
{
edges[numVertices][index] = NULL_EDGE;
edges[index][numVertices] = NULL_EDGE;
}
numVertices++;
}
private int indexIs(Graphable vertex)
// Post: Returns the index of vertex in vertices
{
int index = 0;
while (vertex != vertices[index])
index++;
return index;
}
public void addEdge(Graphable fromVertex, Graphable toVertex, ArrayList<Integer>
IDs)
// Post: Edge (fromVertex, toVertex) is stored in edges
{
int row;
int column;
row = indexIs(fromVertex);
column = indexIs(toVertex);
edges[row][column] = IDs;
}
public ArrayList<Integer> holdSimilarities(Graphable fromVertex, Graphable
toVertex)
// Post: Returns the weight associated with the edge
//
(fromVertex, toVertex)
{
int row;
int column;
row = indexIs(fromVertex);
column = indexIs(toVertex);
return edges[row][column];
}
public QueueInterface getToVertices(Graphable vertex)
// Returns a queue of the vertices that are adjacent from vertex.
{
QueueInterface adjVertices = new LinkedQueue();
int fromIndex;
int toIndex;
fromIndex = indexIs(vertex);
for (toIndex = 0; toIndex < numVertices; toIndex++)
if (edges[fromIndex][toIndex] != NULL_EDGE)
adjVertices.enqueue(vertices[toIndex]);
return adjVertices;
}
52
/*
*
/*
*
*
*
*
*
*
*
*
*
Name: WebFrontend.java
Author: Daniël Suelmann
Effect:
1. Displays a search interface;
2. Receives keyword queries;
3. Instantiates a Database object, that initializes
other classes to retrieve an answer to the query.
4. When the answer is retrieved, proposed SQL
queries are executed.
5. The answer is returned to the web browser.
6. -Optional- Displays intermediate intersection results
*/
package scriptie;
import
import
import
import
import
import
import
import
import
import
import
import
java.io.IOException;
java.io.PrintWriter;
java.sql.Connection;
java.sql.ResultSet;
java.sql.SQLException;
java.sql.Statement;
java.util.ArrayList;
java.util.StringTokenizer;
javax.servlet.http.HttpServlet;
javax.servlet.http.HttpServletRequest;
javax.servlet.http.HttpServletResponse;
standalone.*;
public class WebFrontend extends HttpServlet {
public void service(HttpServletRequest request, HttpServletResponse response)
throws IOException {
Database db = new Database();
int countresults = 0;
ArrayList resultstorage = new ArrayList();
String enterkeywords = null;
String noresults = null;
response.setContentType("text/html");
PrintWriter out = response.getWriter();
StringTokenizer tokenizer;
ArrayList <Integer> result = null;
long elapsedTimeMillis = 0;
String input = request.getParameter("search_text");
if(input != null && input !=""){
input = input.toLowerCase();
tokenizer = new StringTokenizer(input, " ");
int size = tokenizer.countTokens();
String[] feed = new String[size];
int count = 0;
while(tokenizer.hasMoreTokens()){
feed[count] = tokenizer.nextToken();
count++;
}
long start = System.currentTimeMillis(); //measuring speed
result =
db.getResult(feed);
if (result.isEmpty()){
noresults = "No results.";
}
53
else {
try {
ConnectionPool pool = new
ConnectionPool("com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/scriptie",
"root", "XXXXX", 10, 20, true);
Connection conn = pool.getConnection();
Statement stmt;
ResultSet rs;
stmt = conn.createStatement();
for (int i = 0; i < result.size(); i++){
int id = (int) result.get(i);
rs = stmt.executeQuery("SELECT * FROM
lastgeld WHERE Idno = '" + id + "'");
while(rs.next()){
int theInt= rs.getInt("Idno");
String vn = rs.getString("voornaam");
String an =
rs.getString("achternaam");
String hh =
rs.getString("haven_herk");
String al =
rs.getString("Aantal_lasten");
String hd =
rs.getString("Heffing_decimaal");
String tn = rs.getString("tonnage");
String gl = rs.getString("guldens");
String st = rs.getString("stuivers");
String sc = rs.getString("scanfile");
countresults++;
//int printcount = countresults + 1;
String store = null;
store = (" " + countresults + ". id
= " + theInt + " first name = " + vn + " last name = " + an + " harbor = " + hh
+ " cargo units = " +
al + " toll-decimal = " + hd + " weight = " + tn + " guldens = " + gl + "
stuivers = "
+ st + " <a href=\"" +
sc + "\" target=\"_blank\">source</a><br>");
resultstorage.add(store);
}
}
stmt.close();
pool.free(conn);
conn.close();
pool.closeAllConnections();
}
catch (SQLException e){
e.printStackTrace();
}
result = null;
feed = null;
}
elapsedTimeMillis = System.currentTimeMillis()-start; // Time
elapsed
} //if input != null && input != ""
else {
enterkeywords = ("Enter one or more keywords.");
} //else
out.println("<html>");
out.println("<body>");
out.println("<pre>");
//Search area
out.println("<center>");
out.println("<br><br>");
out.println("<form action='lastgeld' method='get'>");
54
out.print("<input type=text name=search_text>");
out.print("<input type=submit value=Search>");
out.print("<input type='checkbox' name='checkbox' value='verbose'
/>");
out.print("verbose");
out.println("</form>");
out.println("<a href = '/web/experiment'>experiment</a><br>");
if (noresults != null)
out.println(noresults);
if (enterkeywords != null)
out.println(enterkeywords);
out.println("</center>");
// Creating the verbose
String checkbox = request.getParameter("checkbox");
if(checkbox != null){
if (checkbox.equals("verbose")){
out.println("<table border='1' align='center'>");
out.println("<tr>");
out.println("<td align='right'>");
out.println("Available indices: <a href =
'/web/datefreq.txt' target = '_blank'>date</a>, <a href = '/web/vnfreq.txt' target
= '_blank'>first name</a>, <a href = '/web/anfreq.txt' target = '_blank'>last
name</a>, <a href = '/web/havenfreq.txt' target = '_blank'>harbor</a>, <a href =
'/web/countryfreq.txt' target = '_blank'>country</a> ");
out.println("</td>");
out.println("</tr>");
out.println("<tr>");
out.println("<td>");
//intersection combinations
ArrayList <Graphable> intersections =
db.getIntersections();
int pair = 1;
int hr = 1;
int count = 0;
if (intersections != null){
if(!intersections.isEmpty()){
out.println("<h2>Step 1: find intersection
combinations based on the keywords entered and the available indices.</h2><br>");
for (int i = 0; i < intersections.size();
i++){
if (count == 2){
out.print("<hr size='3' color=
'gray'>");
count = 0;
}
Graphable object;
object = intersections.get(i);
if (count == 0){
out.println("<b>pair " + pair +
":</b><br>");
pair++;;
}
out.println("<b>"+ object.toString() +
" " + object.getType()+ "</b>" + object.getIDs() + "<br>");
count++;
if(hr == 1){
out.println("<hr>");
hr--;
}
else
55
hr++;
}
}
}
else {
out.print("Verbose returns output at a minimum of
two keywords.");
}
intersections = null;
db.resetIntersections();
out.println("</td>");
out.println("</tr>");
//combined intersections
ArrayList<Object> combinedintersections =
db.getCombinedIntersections();
int count1 = 0;
int total = 1;
if (combinedintersections != null){
out.println("<tr>");
out.println("<td>");
out.println("<h2>Step 2: intersect the
combinations.</h2><br>");
if(!combinedintersections.isEmpty()){
for (int i = 0; i <
combinedintersections.size(); i++){
if (count1 == 2){
out.print(" <b><--></b> ");
}
if (count1 == 5){
out.print("<br>");
count1 = 0;
total++;
}
if (count1 < 4){
out.print(" <b>" +
combinedintersections.get(i) + "</b>");
count1++;
}
else{
out.print(" " +
combinedintersections.get(i));
count1++;
}
}
}
out.println("</td>");
out.println("</tr>");
}
combinedintersections = null;
db.resetCombinedIntersections();
ArrayList <Graphable> vertices = db.getVertices() ;
WeightedGraph graph = db.getGraph();
if (vertices != null && graph != null){
out.println("<tr>");
out.println("<td>");
out.println("<h2>Step 3: find the vertex with the
most edges.</h2><br>");
if(!vertices.isEmpty()){
for (int i = 0; i < vertices.size(); i++){
56
Graphable object;
object = (Graphable) vertices.get(i);
out.print(" <b>" + object.getType()+
"</b> "); //type
out.print("<b>" + object.toString() +
"</b> "); //content
//System.out.println(object.getIDs()+
"\n"); //ID's
QueueInterface queue;
queue = graph.getToVertices(object);
if (!queue.isEmpty()){
out.print("has edges: ");
int count2 = 1;
while(!queue.isEmpty()){
Graphable item =
(Graphable) queue.dequeue();
out.print(" " + count2++
+ ": " + item.getType());
out.print(", " + item);
}
out.print("<br>");
}
else {
out.println("no edges.<br>");
}
}
}
out.println("</td>");
out.println("</tr>");
}
out.println("</table>");
}
}
if (!resultstorage.isEmpty()){
out.print(" <b>Found "+ countresults + " records in ");
if (elapsedTimeMillis != 0)
out.println(elapsedTimeMillis + "
milliseconds:</b><br>");
for(int i = 0; i < resultstorage.size(); i++){
out.print(resultstorage.get(i));
}
}
out.println("</pre>");
out.println("</body>");
out.println("</html>");
input = null;
db.resetIntersections();
db.resetCombinedIntersections();
db.resetGraph();
db.resetVertices();
db.resetListOfLists();
db = null;
}
}
57