Download Keyword-based Search in a Relational Database

Keyword-based Search in a Relational Database Daniël Suelmann Advisor: dr. George M. Welling Bachelor‟s Thesis Department of Information Science Faculty of Arts University of Groningen August 2009 Abstract A relational database is often operated by means of a structured query language (SQL). When composing SQLqueries one must have an understanding of the SQL syntax to be able to produce a query the database can execute. Additionally, one must be familiar with the attributes and relations in the database to be able to retrieve the data desired. When one wants to search the available data stored in the relational database, these requirements can be discouraging. In this case keyword-based search functionality could improve the accessibility of the data. In this thesis I investigate the subject of keyword-based search in structured data. My objective in this thesis is to show that keyword-based search in the available data can yield meaningful results. I present a search application I developed that enables keyword-based search in the Lastgeld data. This data is about skippers and their cargo entering the port of Amsterdam in the period from 1744 to 1748. Answers to multiple keywords in queries are retrieved based on the following assumption: if keywords in a keyword query are related in the relational database that contains the Lastgeld data, then retrieving this relating data yields results likely to be meaningful given the keyword query. I describe how the presented search application achieves just that. Additionally given the experimental results I conclude that keyword query execution time increases if the intersection work that needs to be performed by the presented search application increases. This work is dependent on the number of keywords in the query and the amount of id-numbers associated with these keywords in the indices. I also conclude that the more SQL queries the search application proposes to be executed by the relational database, the more the overall query execution time will increase. Contents 1. INTRODUCTION .......................................................................................................................................... 1 1.1 OUTLINE ........................................................................................................................................................ 1 2. DATA REPRESENTATIONS ........................................................................................................................... 2 2.1 GRAPH REPRESENTATIONS ............................................................................................................................. 2 2.1.1 Definition ............................................................................................................................................... 3 2.2 THE LASTGELD DATABASE ............................................................................................................................. 4 3. SYSTEM DESIGN .......................................................................................................................................... 5 3.1 DEFINITION .................................................................................................................................................... 5 3.2 INDEXING ....................................................................................................................................................... 6 3.4 CREATING A GRAPH DATA STRUCTURE........................................................................................................... 8 3.5 CREATING EDGES ........................................................................................................................................... 9 3.6 FINDING AN ANSWER TO A KEYWORD QUERY ............................................................................................... 10 3.7 CREATING SQL QUERIES .............................................................................................................................. 10 3.8 RESULTS ...................................................................................................................................................... 11 4. EXPERIMENT.............................................................................................................................................. 12 4.1 EXPERIMENTAL DESIGN ............................................................................................................................... 13 4.2 EXPERIMENTAL RESULTS ............................................................................................................................. 14 4.3 ANALYSIS .................................................................................................................................................... 16 5. RELATED WORK ........................................................................................................................................ 17 5.1 GRAPH-BASED SYSTEMS............................................................................................................................... 17 5.1.1 Data representation ............................................................................................................................. 17 5.1.2 Top-k ranking ....................................................................................................................................... 18 5.2 XML-BASED SYSTEMS ................................................................................................................................. 20 5.2.2 Semantic relatedness ............................................................................................................................ 21 5.2.3 Top-k ranking ....................................................................................................................................... 22 6. IMPLEMENTATION ................................................................................................................................... 23 6.1 CREATING INDICES ....................................................................................................................................... 23 6.2 OBJECT ORIENTED DESIGN ........................................................................................................................... 24 6.3 FLOW OF CONTROL....................................................................................................................................... 25 6.4 WEIGHTED GRAPH DATA STRUCTURE ........................................................................................................... 26 5.5 ALGORITHMS ............................................................................................................................................... 27 6.5.1 Searching indices ................................................................................................................................. 27 6.5.2 Intersecting id-numbers ....................................................................................................................... 27 6.5.3 Finding combinations........................................................................................................................... 28 6.5.4 Retrieving the vertex with the most edges ............................................................................................ 28 6.6 JAVA LIBRARY CLASSES ............................................................................................................................... 29 6.7 WEB DEPLOYMENT....................................................................................................................................... 30 6.7.1 Specification ......................................................................................................................................... 30 6.7.2 User interface ....................................................................................................................................... 30 6.7.3 Web address ......................................................................................................................................... 30 6.7.4 The search application doesn’t work ................................................................................................... 30 7. EPILOGUE ................................................................................................................................................... 31 7.1 CONCLUSIONS .............................................................................................................................................. 31 7.2 DISCUSSION ................................................................................................................................................. 31 7.3 PROPOSALS FOR FUTURE WORK.................................................................................................................... 32 REFERENCES ................................................................................................................................................... 32 APPENDIX A: CREATING INDICES IN PERL ....................................................................................................... 34 APPENDIX B: SEARCH APPLICATION SOURCE CODE IN JAVA .......................................................................... 36 FIGURES 2.1 Königsberg anno 1736………………………………………………………………………………….……………………………………………..…………………. 2 2.2 Graph representation of the Königsberg problem……………………………………………..……………..………………..…………………………. 2 2.3 Graph representations……………………………………………………………………………………………………………………………………………….…... 3 2.4 A directed graph……………………………………………………………………………………………………………………………………………….………..….. 3 2.5 A weighted graph…………………………………………………………………………………………………………………………………….…………………….. 3 3.1 System design………………………………………………………………………………………………………………………………………………….…………….. 5 3.2 A fragment of an inverted index………………………………………………………………………………………………………………………….…………. 6 3.3 A directed weighted graph representation of the keyword query IJsbrand Hanning Riga 1744…..………………………..…..…… 8 3.4 Graph II IJsbrand Hanning Riga 1744......………….……………………………………………………………………………………………….…………. 10 4.1 Experimental results keyword query A……………………………………………………………………………………………………………….……..... 14 4.2 Experimental results keyword query B…………………………………………………………………………………………………………….…………… 14 4.3 Experimental results keyword query C………………………………………………………………………………………………………………….……… 15 4.4 Experimental results keyword query D………………………………………………………………………………………………………………….……… 15 5.1 A fragment of a data graph model…………………….…………………………………………………………………………………………………………. 17 5.2 Radius Steiner trees …….…………………………………..……………………………………………………………………………………………….…………. 17 5.3 Accented Steiner nodes……………………………………………………………………………………………………………..………………….…...……….. 19 5.4 A Steiner graph result………………………….……………………………………………………………………………………………………….……………… 19 5.5 An XML data representation………………..……………………………………………………………………………..…………………….………………… 21 6.1 Object oriented design………………………………………………………………………………………………………………………..….…………….…….. 24 6.2 Weighted graph data structure…………………………………………………………………………….…………………………………………...….…….. 26 6.3 User interface search application…………………………………………………………………………………………………………………………………. 30 TABLES 2.1 Graph endpoints……………………………………………………………………………………………………………………………………………………….……. 3 2.2 A fragment of the Skippers table……………………………………………………………………………………………………………………………………. 4 2.3 A fragment of the Locations table……………………………………………………………………………………………………………………………….….. 4 3.1 Fields eligible for keyword-based search………………………………………………………………………………………………….……………….……. 4 3.2 Number of intersections…………………………………………………………………………………………………………………….………………..………….9 3.3 Example queries…………………………………………………………………………………………………………………………………….…………………….. 11 4.1 Experimental queries………………………………………………………………………………………………………………………………………………….….13 5.1 Table data representation………………………………………………………………………………………………………………………………………….… 17 6.1 Implemented Java Library classes…………………………………………………………………………………………………………………………………..29 6.2 Web deployment specification……………………………………………………………………………………………………………………………………….30 LISTINGS 3.1 An index of first names…………………………………………………………..………………………………………………………………………………………. 6 3.2 An index of dates……………………………………………………………………………………………………………………………………………………………. 6 3.3 An index of harbors…………………………………………………………………………………………………………………………………………….………….. 7 3.4 An index of countries………………………………………………………………………………………………………………………………….……………………7 3.5 Retrieving occurrences of keywords from the indices given the query Jan Cornelis Nantes 1744……………….…………….........8 3.6 Intersecting vertices for the query IJsbrand Hanning Riga 1744……………………..………………………………….…………………….….. 10 5.7 System-generated SQL queries and their answers………………………………………………………………………………………………..………. 10 5.1 An XML fragment……………………………………………………………..……………………………………….…….……………………………….………….. 20 6.1 Creating indices…………………………………………………………………………………………………..…………………………………….…………………..23 1. Introduction Today‟s most widely used search engines enable users to express a search query by means of one or more keywords. This query can express a descriptive phrase or isn‟t more than a single term, coherent with a specific information need. The user can query the available data without having to know any query language or having to know how the data is stored in its internal data repository. In this thesis I investigate the subject of applying a keyword-based search approach to data stored in a relational database. More specific I focus on a database containing historical data. The database consists of two tables containing structured data. The main table contains about 13.000 records of data about skippers and their cargo entering the port of Amsterdam in the period from 1744 to 1748. My objective in this thesis is to show that keyword-based search in a relational database can yield results likely to be meaningful given the available Lastgeld data. I investigate this subject by presenting a search application I developed that enables users to perform keyword-based search in the available Lastgeld data. In general I focus on answering the following questions: 1. 2. What methods are applied in querying a relational database given on one or more keywords? How can these methods yield meaningful results in a keyword-based search application, given the available Lastgeld data and the relational database containing this data? 1.1 Outline In the second chapter I describe some of the theory associated with graph data structures, since I use a graph data structure to be able to query the data in the database. In this chapter I also describe the Lastgeld data and the way it is stored in a relational database. In chapter three, I present the application I developed and I describe the heuristics applied to achieve results. I end this chapter by showing several example queries and their resulting answers returned by the presented application. In the fourth chapter I describe the experiment I performed regarding the efficiency of the presented search application. Chapter five is about some of the work performed in the research area of keyword-based search in structured data. In this chapter I describe several systems that enable keyword-based search over structured data using distinct approaches. In chapter six I return to the subject of the presented search application and I describe which choices I made to implement it. Finally, in chapter seven I draw conclusions, discuss achievements and propose future work. The appendices contain the documented source code of the presented search application. 1 2. Data representations When a keyword-based search system receives a query, it needs to determine what this query means in terms of the data repository it is designed to search. If the query consists of a single keyword, then that single keyword can occur in many locations in the database. If there are more keywords, it also needs to determine the relation between the keywords in the underlying database. If the data in the underlying database is connected by many relations that give meaning to the data, then the search system has to be familiar with these relations to be able to meaningfully determine if the keywords entered match any of these relations. To be able to determine such relations, or to retrieve any data, the data must be transferred from a storage device to main memory. Consequently the relations existing in the database on the storage device have to be replicated in main memory also. One could say that a model of the data and the relations meaningfully connecting entities of data must be available in main memory to be useful to the search system. A way to create such a model is by means of a graph data structure. In a graph abstract data type, entities of data can be connected to one another in an unrestricted manner. Whereas tree data structures provide a useful way of representing relationships in which a hierarchy exists, a graph data structure becomes useful if relationships between data entities appear more freely. Dale et al.[3] 2.1 Graph representations Graph theory is rooted in mathematics. In 1736 graph theory was born when the King of Prussia confronted mathematician Leonard Euler with the following problem: The town of Königsberg (now Kaliningrad in Russia), is built at the point where two branches of the Pregel river come together. The river divides the town into an island and some land around the river banks. The island and the various pieces of main land are connected by seven bridges. Is it possible for a person to take a walk around town, starting and ending at the same location, and crossing each of the seven bridges exactly once? Euler‟s conclusion was that it is impossible to travel the bridges in the city on Königsberg once and only once. Euler claimed that if there are more than two landmasses with an odd number of bridges, then no such journey is possible. Second, if the number of bridges is odd for exactly two landmasses, then the journey is possible if it starts in one of the two odd numbered landmasses. Finally, Euler claims that if there are no regions with an odd number of landmasses then the journey can be accomplished starting in any region. Paoletti [12] Figure 2.1: Königsberg anno 1736 Figure 2.2: graph representation of the Königsberg problem 2 2.1.1 Definition In our time, a graph consists of three entities; a set of vertices V(G); a set of edges E(G) and an edge-endpoint function; g: E(G) → V(G) that connects each edge with a pair of vertices. Vertices can represent whatever is subject of attention; people, brain cells, cities, courses, or entities of data present in a relational database. If the vertices represent cities, then the edges might represent the roads between the cities. Because the road between Groningen and Amsterdam also runs between Amsterdam and Groningen, the edges in this representation have no direction. This is called an undirected graph. If an edge represents for instance a pipe that transports a fossil fuel like natural gas, then the gas is most probably transported in only one direction. A graph with edges directed from one vertex to another is called a directed graph. Lanzani [8] Directed graphs are often represented with arrows like for instance figure 2.4. A more formal definition of the directed graph in figure 2.4 is: Figure 2.3: graph representaions Lanzani [8] V(G) = {1, 3, 5, 7, 9, 11} E(G) = {(1,3),(3,1),(5,7)(5,9),(9,9)(9, 11)(11,1)} In an undirected graph the arrows are simply omitted since the order of the vertices in each edge is unimportant. A more formal definition of the first undirected graph in figure 2.3 is: V(G) = {1, 2, 3, 4, 5} E(G) = {(1,2),(2,3),(3,4)(4,5)(5,1) Edge e1 e2 e3 e4 e5 Endpoints {1, 2} {2, 3} {3, 4} {4, 5} {5, 1} Table 2.1: graph endpoints If two vertices in a graph are connected by an edge, then they are said to be adjacent. In figure 2.4, vertex 5 is said to be adjacent to vertices 7 and 9, while vertex 1 is said to be adjacent from vertices 3 and 11. A tree is a special case of a directed graph in which each vertex may only be adjacent from one parent vertex, except the root vertex is not adjacent from any other vertex. A path from one vertex to another consists of a sequence of vertices such that each vertex in that sequence is connected to another vertex by an edge. This sequence of vertices and edges make it possible to traverse the graph in a certain manner. A weighted graph is a graph in which edges are associated with values. Weighted graphs can be used to represent certain applications in which edges are more than a connection. For instance, Figure 2.5 depicts a graph in which the vertices are cities and the edges the roads between the cities. Additionally each edge contains a value that represents the distance in kilometers between the cities. Dale et al.[3] Figure 2.4: a directed graph Dale et al. [3] Figure 2.5: a weighted graph 3 2.2 The Lastgeld database I explained in the previous section, that a graph model of the data and the relations meaningfully connecting entities of data must be available in main memory to be useful to the search system. Now that I explained some important notions of graph theory, I continue to describe the database I used to build a graph data structure. The database of subject is called the Lastgeld database. The Lastgeld database contains data about skippers and their cargo entering the port of Amsterdam from 1744 to 1748. These skippers had to pay a toll named Lastgeld. Welling [15] Table 2.2: a fragment of the Skippers table I only use two tables. The main table (table 2.2) contains about 13.000 tuples1. Additionally I use a second table (table 2.3) that contains about 1500 tuples. This table relates to the main table by the name of the “hid” attribute. I chose this data representation because it shows two important notions within the domain of any relational database. First, keywords in Table 2.3: A fragment of the Locations the same tuple are related. For instance, a keyword query Cornelis Vos table should match the data in the first tuple. Secondly, keywords possibly span multiple relations. For instance, a keyword query Hendrik Clasen France should match data in the third tuple and should additionally match data in the Locations table, i.e. the search system should return all the tuples that contain data about a skipper named Hendrik Clasen who visited any harbor in France. The Lastgeld database is governed on a storage device by a Relational Database Management System (RDBMS). The RDBMS provides an interface to the data it retains by means of a query language. In the case of the Lastgeld database the query language is SQL (Structured Query Language). For instance, the following queries are valid when one wants to retrieve data from the Lastgeld database: SELECT date, firstname, lastname FROM skippers WHERE achternaam = 'clasen' (1) SELECT date, firstname, lastname FROM skippers INNER JOIN locations ON (skippers.hid = locations.hid) WHERE skippers.lastname = 'vos' AND locations.harbor = 'nantes' (2) The first query is directed at a single table, while the second query joins the two tables in order to get matches from both tables. If one reflects on the abstraction layer that allows us to operate the database, then one could say that the notions of relations, tuples, attributes, keys, foreign keys, SQL, etc. are all abstractions of more complicated constructs beneath the surface of the relational database. These abstractions allow us to deal with the database with relative ease. For instance, programmers need the control of a SQL query language to perform a wide variety of actions on the database. However, the abstraction layer suitable for programmers isn‟t very suitable for the people whose primary interest is the data that is stored in the database. 1 For clarity; the terms relation, attribute and tuple are more commonly referred to as table, column, and row, respectively. 4 3. System design In chapter 2 I describe some of the properties of the Lastgeld database. I also explain that the data and relations residing in the database are eventually stored on a storage device. To be able to perform any kind of search operation, it is necessary to retrieve all the data from the database and have it available in main memory. This can be done in several ways, for instance: 1. 2. 3. Replicate all the data available in the database as a graph data structure that is loaded into main memory, and traverse this data graph to obtain answers to keyword queries; Aditya et al.[1] Replicate only the data scheme of the database as a graph data structure that is loaded into main memory and traverse this scheme graph to obtain answers to keyword queries; Hristidis et al.[6] Export all data as XML and use an extended XML query language to obtain answers to keyword queries. Cohen et al.[2] In chapter Five I more elaborately describe some of the methods applied in the research area of keyword-based search in structured data. Although most probably inspired by what I describe in chapter five, I have not attempted to rebuild these methods. They appear to be highly dependent on the context in which it is meant to be implemented. Considering this observation I have solely focused on the context of the Lastgeld data and its container, a relational database. 3.1 Definition I have developed a heuristic that is best described in terms of the following steps: 1. Index all table fields eligible for keyword search; 2. Search all indices based on the keywords in a keyword query; 3. For every keyword match found in an index, create a vertex in a graph data structure; 4. For every possible combination of two distinct vertices, create edges between two vertices if the idnumbers associated each vertex has at least one resemblance; 5. Find the vertex that has the most edges, since this is the vertex that relates the most keywords present in the graph, and intersect all id-numbers associated with the edges adjacent from this vertex; 6. Use the id-numbers that result from the intersection of edges to create SQL queries. keyword query search indices answer create return query graph relational database Figure 3.1 System design Figure 3.1 visualizes these steps in a more general fashion. In the following 6 sections I elaborate on each of these steps in more detail. 5 3.2 Indexing A major concept in information retrieval is indexing. Indexing is applied to gain speed in the process of retrieval. An established indexing technique is to create what is called an inverted index. The basic idea of an inverted index is depicted in figure 3.2, where the numbers denote the document id-numbers in which the terms occur. The terms occurring in the document corpus are united in what is called a dictionary. Each term in this dictionary points to what is called a postings list consisting of separate postings. For instance, given a keyword query Brutus and Ceasar. The postings lists of the terms Brutus and Ceasar are retrieved and intersected to obtain the document id-numbers that occur in both postings lists. Manning et al. [11] The term and can be handled in different ways; treat it as a phrasal element; designate it to be a stop word, hence ignoring it; or perhaps interpret it as a boolean operator.2 Figure 3.2: a fragment of an inverted index I explain all of this because I used the Manning et al. [11] concept of an inverted list within the context of the Lastgeld database, although in a very different manner. The construct is based on the following reasoning: if a term like for instance “Brutus” belongs to a document with id 1, then a term like for instance “Cornelis” belongs to a tuple with id 1. I have materialized this reasoning by creating indices of the form as shown in listing Table 3.1: fields eligible for keyword-based search 3.1 and 3.2. An advantage of the Lastgeld data is the fact that the data doesn‟t change over time. I simply pre-generated the indices and AART 1198 stored them on disk. I created these indices for ABE 4765 12726 4790 1567 1246 2 the following attributes : ABE BR 6900 12471 11380 857 Skippers - date, firstname, lastname and Locations - harbor, modcountry The dates in the second column I have reduced to only the year notation as shown in listing 3.2. It is difficult to see for instance “1744-0401” as a keyword while just “1744” is easier to grasp. However, I emphasize that this is a design choice made to reduce complexity. I will not go into which date format is the best to index. ABE BROER 5986 ABE BROERSZ 9008 ABEL 6545 ABEL BAS 3553 ABLBERT 7581 ABLBERT J 10065 ABRAHAM 3410 4434 6303 5806 9355 9064 6687 8444 9828… Listing 3.1: an index of first names 1744 1745 1746 1747 1748 1 2 3 4 5 6 7 8 9 10 11 12 13 2332 2333 2334 2335 2336 2337 5027 5028 5029 5030 5031 5032 7619 7620 7621 7622 7623 7624 10423 10424 10425 10426 10427 16 17 18 19 20... 2338 2339 2340... 5033 5034 5035… 7625 7626 7627… 10428 10429 10430… Listing 3.2: an index of dates 2 I will not go into this particular subject since the relevance of the term “and” is minimal in the Lastgeld data set. However, supporting Boolean operators in a keyword based search system can be useful, although I consider this subject to be beyond the scope of my thesis project. 3 I have decided not to use the numerical data as shown in the Skippers table (table 2.2), since numerical data will not fit the classification of a keyword. 6 I have shown the way I created indices for some of the data in the Skippers table. To be able to associate the data in the Locations table with the data in the Skippers table, I chose to ignore the relational keyforeign key concept all together. In essence this abstract concept serves the purpose of attaching relating entities of data over multiple tables within the domain of a relational database, not in the data structure I use to connect data. Listing 3.3 and 3.4 show how I span two tables by assigning the id-numbers in the Skippers table to their relating data in the locations table. This way I can deal with the data per tuple, including the data in a relating table, by a single id. I will elaborate on this benefit in section 3.7. AABO 10051 3670 12238 6360 9627 3812 6464 AAHUS 1465 4526 6088 5876 9036 1804 7579… AALBORG 4345 6911 6027 6438 8408 9311 … ALAMEIDA 10351 12755 12514 7843 12778 … ALEXANDRIEN 5544 ALICANTE 2213 506 2587 8481 12925 4544… Listing 3.3: an index of harbors BELGIUM 7659 2379 7491 7596 12956 9064 9828… DENMARK 11644 11762 1519 7549 3602 9108 10175… EGYPT 5544 ENGLAND 1000 159 6016 2225 2312 12211 7832… ESTONIA 3921 7800 1118 10851 884 11398 11622… FINLAND 10051 3670 12238 6360 9627 3812 6464… Listing 3.4: an index of countries 3.3 Searching indices Now that I explained how I created indices, I proceed explaining how I search these indices. Given a keyword query Q that consists of keywords K1,…,Kn , I search every index for every K in Q. For instance given a keyword query: Cornelis Vos France 1744, I perform the following index searches: Cornelis - Skippers.date, Skippers.firstname, Skippers.lastname, Locations.harbor, Locations.modcountry Vos - Skippers.date, Skippers.firstname, Skippers.lastname, Locations.harbor, Locations.modcountry France - Skippers.date, Skippers.firstname, Skippers.lastname, Locations.harbor, Locations.modcountry 1744 - Skippers.date, Skippers.firstname, Skippers.lastname, Locations.harbor, Locations.modcountry This seems a bit redundant at first, but I do not really know based on some keywords entered where they might occur in the indices or what their inter-keyword relationship is. Furthermore I must emphasize that I do not perform any pre-analyzing of keywords based on their appearance. For instance if a keyword entered is numerical like 1744, I still search the indices Skippers.firstname, Skippers.lastname, Locations.harbor, Locations.modcountry even though I know that no numerical values exist in these indices. I do this to keep me from extra work, at the cost of efficiency. Also I haven‟t divided the indices in partitions, for instance the index of Skippers.firstname is a continuous list from A to Z. I could have divided the indices in sub-indices that contain one letter of the alphabet, this way I could select the appropriate sub-index based on the first letter or digit of a keyword. I haven‟t done this to keep me from dealing with many separate indices, again at the cost of efficiency. For instance given a keyword query Jan Cornelis Nantes 1744, the process of searching the indices for keywords yields lists with the following contents: A)Jan - firstname[2, 19, 54, 73, 76, 88, 95, 97, 112, ..., 12999] B)Cornelis - firstname[1, 56, 130, 137, 143, 144, 176,..., 13007] C)Cornelis - lastname[8, 53, 130, 160, 211, 220, 235, 358,..., 12956] D)Nantes - harbor[1, 190, 234, 335, 379, 477, 478, 554,..., 12865] E)1744 - date[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ..., 2331] Listing 3.5: retrieving occurrences of keywords from the indices given the query Jan Cornelis Nantes 1744 7 3.4 Creating a graph data structure Looking at listing 3.5, B, D and E seem to be related in the first tuple, however the first name A and C do not belong to the first tuple, so this isn‟t a very good match. Ideally one would want to find the tuples(s) in which all keywords are present, or tuples in which at least four of the five keywords are present. However, I realize if one wants to retrieve relating data of this kind in any way, one has to find a way to deal with the complexity of interkeyword relationships first. I decided to use a directed weighted graph data structure. This data structure enables me to handle the complexity of the data in a better way. In section 3.6 I explain why I implemented the graph as directed graph instead of an undirected graph. Recall that I described the essence of a weighted graph in section 2.1.1. Figure 2.5 depicts a weighed graph representation of the roads between cities and the distances associated with the edges between the roads. Based on this concept I came up with the following reasoning: If I declare every keyword in a keyword query, that occurs at least with one id in an index, to be a vertex, then I can connect a pair of vertices by creating edges between them, if they have at least one id in common. In doing so I can assign the id-numbers two vertices have in common to the edge between these two vertices. This way I create a weighted graph in which all keywords of the keyword query found in the indices are vertices in the graph and the id-numbers associated with a keyword in the index are associated with the vertex of that keyword. There exists an edge between any possible combination of two vertices if two vertices have at least one idnumber in common. If this is the case then an edge between that pair of vertices is created and the weight of the edge is associated with all the id-numbers these two vertices have in common. For example the query IJsbrand Hanning Riga 1744 is inserted in the directed weighted graph data structure as follows: Figure 3.3: a directed weighted graph of the keyword query IJsbrand Hanning Riga 1744 8 3.5 Creating edges In order to create edges between two vertices the id-numbers part of each of these two vertices, have to be intersected. The result of the intersection is assigned to the edge between these two vertices. I found this to be a good solution since the edge represents the relation between two data entities based on the id-numbers they have in common. If an edge is established between, for instance, firstname.ijsbrand and lastname.hanning, I chose to not also create an edge between lastname.hanning and firstname.ijsbrand. This would make the graph more complicated, while having the edges in just one direction worked out to be sufficient. I elaborate on this in the next section. First, I describe the need to determine all the pairs of vertex combinations possible to be able to intersect the id-numbers associated with vertices. Given a set of 5 vertices as depicted in figure 3.3, the following distinct combinations are possible: {0,1}{1,2}{2,3}{3,4}{0,2}{1,3}{2,4}{0,3}{1,4}{0,4}. These combinations of vertices must all be intersected because they are potentially related. For instance given the keyword query; IJsbrand Hanning Riga 1744, the following intersection sequence must be completed: ***vertex combinations*** ijsbrand firstname[64, 901, 5360, ijsbrand lastname[5039] --ijsbrand lastname[5039] hanning lastname[64] --hanning lastname[64] riga harbor[64, 65, 85, 103, 195, --riga harbor[64, 65, 85, 103, 195, 1744 date[1, 2, 3, 4, 5, 6, 7, 8, --ijsbrand firstname[64, 901, 5360, hanning lastname[64] --ijsbrand lastname[5039] riga harbor[64, 65, 85, 103, 195, --hanning lastname[64] 1744 date[1, 2, 3, 4, 5, 6, 7, 8, --ijsbrand firstname[64, 901, 5360, riga harbor[64, 65, 85, 103, 195, --ijsbrand lastname[5039] 1744 date[1, 2, 3, 4, 5, 6, 7, 8, --ijsbrand firstname[64, 901, 5360, 1744 date[1, 2, 3, 4, 5, 6, 7, 8, 7750, 7757, 11404] 198, 346, 373,...,13006] 198, 346, 373,...,13006] 9, 10, 11, 12,...,2331] 7750, 7757, 11404] 198, 346, 373,...,13006] 9, 10, 11, 12,...,2331] 7750, 7757, 11404] 198, 346, 373,...,13006] 9, 10, 11, 12,...,2331] 7750, 7757, 11404] 9, 10, 11, 12,...,2331] Listing 3.6: intersecting vertices for the query “IJsbrand Hanning Riga 1744” A consequence of intersecting all combinations of vertices this way is that the intersection work that needs to be done to create edges increases fast in relation to the number of keywords in the query. Table 3.2 shows the increase of the number of intersections as result of the increase of the number of keywords. Since some of the id-lists as shown in listing 3.6 are quite large, the retrieval time should increase in relation to the size of the id-lists together with the number of keywords in a keyword query. In chapter four I perform an experiment to further analyze these dependencies in relation to overall retrieval time. Number of keywords 4 6 8 10 12 20 Number of intersections 6 15 28 45 66 190 Table 3.2: number of intersections 9 3.6 Finding an answer to a keyword query I assume that the vertex with the most outgoing edges is the “binding” vertex. Vertex A in figure 3.4 connects to BCD, I simply intersect the numbers associated with these three edges and I obtain the final result with id 64. Because I decided to implement the graph as a directed graph this is possible. If I implemented the graph in figure 3.4 as an undirected graph each vertex would have 3 incoming and 3 outgoing edges and hence this approach wouldn‟t work I must admit that this is likely not the best solution. For instance suppose there is another vertex in the game by the name E, and A also matches this vertex, but all the other vertices A is related to do not match this vertex then the final result is an empty list. In other words the edges of the vertex with the most Figure 3.4: graph representaion of the keyword query edges must all be related. IJsbrand Hanning Riga 1744 A better solution would be to traverse the graph. There are several ways to traverse the graph. While traversing the graph the main condition that must be fulfilled is that every new edge reached must have at least one id in common with latest visited edge to be able to continue the traversal on a new edge. To find the best fit answer given the keywords of a keyword query it is evident that as many vertices possible should be visited. In the case of the graph in figure 3.4, the path ABDC would yield the id-numbers spanning the most keywords. Finding a good graph traversal algorithm is one of the subjects I suggest in chapter 8 as a possible follow-up. For now I will use the rather „naïve‟ approach of finding the vertex with the most edges. 3.7 Creating SQL queries At this stage an answer to a keyword query has or has not been found, depending on the presence of the query keywords in the indices of the relational database. If there is a result list of id-numbers then the search application possesses all it needs to efficiently retrieve all relating data from the relational database. For instance given the keyword query Cornelis Vos France, the search application creates the following SQL queries: SELECT SELECT SELECT SELECT * * * * FROM FROM FROM FROM Skippers Skippers Skippers Skippers WHERE WHERE WHERE WHERE idno idno idno idno = = = = 1 2574 5288 8102 results: id = 1 first name = CORNELIS last name = VOS harbor = NANTES toll-decimal = 2.40 weight = 128.47 guldens = 2 stuivers = 8 cargo units = 65 id = 2574 first name = CORNELIS last name = VOS harbor = LE CROISIC = 29 toll-decimal = 2.40 weight = 57.32 guldens = 2 stuivers = 8 cargo units id = 5288 units = 65 first name = CORNELIS C. last name = VOS harbor = BORDEAUX toll-decimal = 2.40 weight = 128.47 guldens = 2 stuivers = 8 cargo id = 8102 units = 38 first name = CORNELIS W. last name = VOS harbor = LIBOURNE toll-decimal = 2.40 weight = 75.10 guldens = 2 stuivers = 8 cargo Listing 3.7: system-generated SQL queries and their answers 10 3.8 Results In Information Retrieval search system effectiveness is assessed relative to an information need, not to a query. A document is classified as relevant if it is coherent with a certain information need, not because it contains all the keywords in a certain query. Manning et al.[11] However, a tuple in a relational database is not a document containing potentially many terms. When I query a document with 500 terms with just two keywords and these two keywords appear to be in that document, the relation of 2 to 500 is weak. In contrast when I query just tuples in a relational database consisting of five fields like in the Lastgeld database and two of these five fields are present in a keyword query as well in a specific tuple, it is obviously a much stronger relation. Still, the relevance of retrieved results can‟t be judged without a defined information need. An inherent characteristic of keyword queries is that they are inexact. One cannot claim that a keyword query like Jan Jansen Kleine Oost Germany 1745 is similar to for instance: SELECT * FROM skippers INNER JOIN locations ON (skippers.hid = locations.hid)WHERE skippers.date = '1745' AND skippers.firstname = 'jan' AND skippers.lastname = 'jansen' AND harbor = 'kleine oost' AND locations.harbor = 'germany' By trading a query syntax like SQL for just some keywords, control over retrieving exact answers is lost. The spaces between keywords do not per se imply a relation between keywords, but they do not imply there isn‟t any relation either. This lack of exactness of keyword queries has forced me to make an assumption of what is important when dealing with multiple keywords in a query. I assume that if the keywords in a keyword query are related in the relational database that contains the Lastgeld data, then retrieving this relating data yields results likely to be meaningful given the keyword query. Consider the following queries and their effect: Query Cornelis Cornelis Vos 1744 France Cornelis Vos Groningen Cornelis Vos Jansen Cornelis France Description If there is just one keyword present in the query, the search application retrieves all tuples in which the keyword Cornelis occurs. Note, Cornelis appears in the indices as a first name and a last name, both tuples in which Cornelis appears as a last name and a first name are retrieved. All keywords are present in the indices, and all keywords are related by several idnumbers. Based on the assumption described in the previous paragraph, the search application only retrieves the tuples in which all the keywords occur, all tuples in which they occur separately or not retrieved. Keyword order does not influence the generation of results, e.g. Vos France 1744 Cornelis yields the exact same results. The keyword Groningen doesn’t appear in the indices and thus no vertex is created for this keyword. However, Cornelis and Vos do occur in the indices and thus vertices are created for these keywords. As a result this query yields the same results as Cornelis Vos alone. This keyword query is a special case; Cornelis and Vos are related, but Cornelis and Jansen are also related. The keyword Cornelis has the most edges; it points to Vos and to Jansen. As a consequence of working with vertex that has the most edges, the id-numbers associated with the edge between Cornelis and Vos are intersected with the id-numbers associated with the edge between Cornelis and Jansen. This leads to an empty result list. I have not attempted to simulate “OR” semantics, therefore I have not merged the results of Cornelis Vos and Cornelis Jansen into one result list. In terms of the relational database Cornelis appears in the Skippers table and France appears in the Locations table, since these relations do not exist in the indices, the search application generates SQL-queries for a single table only. This query retrieves all the occurrences of Cornelis in the Skippers table in relation to France in the Locations table. Table 3.3: Example queries 11 4. Experiment In this chapter I present the results of the experiments regarding the efficiency of the presented search application. I have designed an experiment in which I measure: 1. The retrieval time of multiple keyword-based search queries posed to the search application. More specific I focus on: A) The number of keywords in the query. I state in section 3.5 that the retrieval time will increase in relation to the number of keywords in a keyword query. In table 3.2 I show that the number of intersections needed increases rapidly as the number of keywords in the query increases. B) The size of the id-list associated with a keyword. In section 3.5 I also stated that retrieval time will increase if the size of the id lists associated with the keywords increases. Given these parameters I can assess their processing weight in terms of the relative increase of retrieval time. 2. The retrieval time of the SQL-queries the search application proposes. For instance given a keyword query “Cornelis Vos France”, the search application deals with data in different tables by merging the indices of two tables within the relational database, this way the search application restricts it‟s querying to just one table like so: SELECT SELECT SELECT SELECT * * * * FROM FROM FROM FROM Skippers Skippers Skippers Skippers WHERE WHERE WHERE WHERE idno idno idno idno = = = = 1 2574 5288 8102 However, in this approach multiple queries are posed in sequence, e.g. if there are 100 result tuples, then 100 queries are formulated and posed to the relational database. Intuitively this appears to be an expensive solution in terms of processing time. On the other hand the queries are rather straightforward to process, since the queries do not span multiple tables and the „where‟ clause contains just one argument at a time. In the following sections I describe how the factors mentioned in both 1A/B and 2 influence the execution time of a search request. 12 4.1 Experimental design I formulated four search queries to measure the influence of three factors on the query execution time. The search queries are presented in table 4.1 and the three factors are: 1. 2. 3. The index size(s); The number of keywords in the query; The number of queries the search application proposes. A keyword query description number of occurrences B C D Jan Jansen Kleine Oost Germany 1745 involves relatively big lists of idnumbers Germany Nicolaas Bark Hendriks Bordeaux involves the biggest list of id-numbers involves very small lists of idnumbers Jan: 1319 Jansen: 149 Kleine Oost: 5618 Germany: 5683 1745: 2696 Germany: 5683 Nicolaas: 9 Bark: 4 involves mediumsized(relative) lists of idnumbers Hendriks: 333 Bordeaux: 264 Table 4.1: experimental queries Query A consists of 5 keywords. Each keyword occurs in only one index. As a consequence the search application needs to intersect 10 combinations of reasonable sized lists of id-numbers. The answer to the query consists of one tuple, therefore the number of queries the search application proposes to the relational database is one. Query B consists of just one keyword that occurs in only one index. As a consequence the search application doesn‟t intersect at all. The answer to the query consists of 5683 tuples, therefore the number of queries the search application proposes to the relational database is 5683. Query C consists of two keywords with small lists of id-numbers. As a consequence the search application intersects only once. The answer to the query consists of one tuple. Query D consists of two keywords that have medium-sized lists of id-numbers in the indices (relative to the available data). As a consequence the search application intersects only once. The answer to the query consists of one tuple. The answer to the query consists of 6 tuples. For each query I measured the execution time 50 times: for the speed of the execution within the search application and for the speed of the execution of the queries proposed by the search application. In between measurements I halted for 10 seconds. To add an extra element of comparison I also measured the speed of the SQL queries semantically related to A,B,C & D. For instance keyword query A could be interpreted as: SELECT * FROM skippers INNER JOIN locations ON (skippers.hid = locations.hid)WHERE skippers.date = '1745' AND skippers.firstname = 'jan' AND skippers.lastname = 'jansen' AND harbor = 'kleine oost' AND locations.harbor = 'germany' This serves as an indicator of the performance loss the presented search application causes under the assumption that keyword query expresses the same search intention as the mentioned SQL query, given the semantical relatedness of the Lastgeld data. 13 4.2 Experimental results I have performed the experiments on a Pentium 2.2 GHz processor with 4GB of RAM running on Windows Vista. The Lastgeld database as described in section 2.2, I run on the same machine. The RDBMS I use is provided free by MySQL[17]. In the next chapter I describe the implementation details of the presented search application in more detail. Figure 4.1: experimental results keyword query A Figure 4.2: experimental results keyword query B 14 Figure 4.3: experimental results keyword query C Figure 4.4: experimental results keyword query A 15 4.3 Analysis Keyword query A consists of a relative high number of keywords. The 15.500 id-numbers that are associated with the keywords in the indices cause the search application to execute many intersections and therefore the execution time of the search application increases with hundreds of milliseconds. However the SQL query the search application poses is executed in almost instantaneous. (Recall that keyword query A only yields one query). The semantically related autonomous query is somewhat slower but executes in around 10 milliseconds. Keyword query B consists of only one keyword. The search application finishes the request with a median of 44 milliseconds over 50 measurements. However the search application proposes 5683 SQL queries since the keyword Germany occurs with 5683 id-numbers. This slows down the eventual result generation considerably. The median of the execution of these SQL queries is 895 milliseconds over 50 measurements. Also the semantically related query doesn‟t appear to execute flawlessly. Keyword query C still generates about 60 ms of execution time for the keyword search application, while the number of intersections is reduced to one and the id occurrences in the indices are minimal. The search application executes keyword query D slightly slower than query C, but the overall execution times are similar to keyword query C. 16 5. Related work There has been extensive Research on the topic of keyword search over structured data. In section 5.1 I describe three keyword search systems which perform operations on data stored in graph data structures. These graph data structures serve as models of the underlying relational database. I chose to describe three systems that apply different techniques to achieve about the same goals. In section 5.2 I describe work on the subject of keyword search over XML data. Like data stored in a relational database, XML data is structured as well, although in a different sense. Techniques applied in this area are different but eventually their purpose is identical to the purpose of the techniques applied when dealing with relational databases. 5.1 Graph-based systems Discover[6], BANKS[1] and EASE[9] are systems developed in this area. In these systems a database is represented by a graph data structure where tuples are the nodes in the data structure. 5.1.1 Data representation Discover, BANKS and EASE generate query answers called „tuple trees‟ based on the keywords received as input. A tuple tree is a joining tree of tuples, i.e. a data structure that contains one or more tuples in a predefined manner. In a tuple tree each node is connected via foreign-key relationships. In BANKS and Ease a specific kind of tuple tree, called a Steiner tree, is applied to hold the answers to keyword queries A difference between these systems can be found in the way the database of subject is modeled to generate the tuple trees. The key algorithms of Discover work on a graph data structure that is modeled based on the properties of the database schema. The database schema graph is used to produce a number of SQL-queries needed to answer the keyword queries presented to the system. Unlike Discover, more advanced BANKS and EASE, model the entire database as a directed graph data structure. Figure 5.1 visualizes a fragment of the graph data model Ease constructs based on the publication database presented by tables 5.1. From here the system generates tuple trees to create answers to the keyword-queries presented to the system. In case of EASE, as mentioned before, system tuple trees are built as a Steiner graphs. For clarity a tree data structure is a kind of graph data structure. Dale et al.[3] Figure 5.2 visualizes these Steiner trees as answers to keyword queries. The circles around nodes represent a method described in EASE to reduce processing, since it can be very costly to generate Steiner trees over a large data graph. EASE propose to define a radius based on certain properties, containing the nodes necessary to produce an adequate answer to the keyword query being processed. Tables 5.1: table data representation Figure 5.1: a fragment of a data graph model Figure 5.2: radius Steiner trees 17 5.1.2 Top-k ranking An important notion in all these methods is efficiency. The described techniques possibly span very large databases. Because schema or data graphs are kept in main memory and trees are generated on the fly when the system is queried, the processing must finish almost instantaneous. According to Dalvi et al. [4] the graph data structures used in keyword search engines can potentially span very large data sets, because structured data is fairly compact compared to textual data; graphs with millions of nodes, related to hundreds of megabytes of data, can be stored in tens of megabytes of main memory. If a search system runs on dedicated servers even larger graphs such as the English Wikipedia, which contained over 1.4 million nodes and 34 million links (edges) as of October 2006, can be handled. Although efficiency becomes a very important topic when dealing with large data graphs, a certain degree of effectiveness is in most cases just as, or even more important. Discovery, BANKS and EASE all incorporate a top-K ranking mechanism to achieve a degree of effectiveness as well. Discover ranks results by the number of joins involved. The idea behind this strategy is that joins involving many tables are more difficult to grasp. This ranking strategy has a certain parallel with ranking methods used in document retrieval; documents in which keywords occur close to one another are ranked higher than documents in which keywords are far apart. However, in a follow up to Discover, Hristidis et al. [7] propose a ranking method known in the field of Information Retrieval (IR) as relevance ranking. The general idea behind relevance ranking is that according to some definition of relevance, only the few most relevant matches are generally of interest. Consequently instead of computing all matches for a keyword query, only the top-k matches are computed. This, in turn, yields a more efficient solution. The BANKS system incorporates a technique that assigns weights to tuples and assigns weights to edges between tuples. A combination of tuple weights and edge weights in a tuple tree is calculated to rank matches. Liu et al.[10] argue that, despite the methods applied to gain effectiveness, the focus of Discover and BANKS is still primarily set to obtaining efficiency by avoiding the creation of unnecessary tuple trees and by deploying algorithms to improve the time and space complexities. They state that effectiveness should be equally important. In turn they incorporate a full-fledged IR-solution on top of a system conceptually comparable to Discover, BANKS and Ease. Liu et al. define tuple trees to be super-documents and all text column values to be documents. Let T be a tuple tree and {D1, D2, …, Dm} be all text column values in T. Then to rank tuple trees they compute a similarity value between the query Q and the super-document T as shown in Equation 1. The similarity is the dot product of the query vector and the super-document vector. In contrast to the systems described earlier Liu et al. apply IR-evaluation techniques to assess the results achieved. 𝑆𝑖𝑚 ( 𝑄 , 𝑇 ) = 𝑘∈𝑄,𝑇 𝑤𝑒𝑖𝑔𝑕𝑡 𝑘, 𝑄 ∗ 𝑤𝑒𝑖𝑔𝑕𝑡 (𝑘, 𝑇) (1) In EASE, a ranking mechanism is proposed that incorporates three notions; a TF-IDF based ranking function that considers textual properties of a Steiner graph; the compactness of a Steiner graph and the keyword order in the query. The TF-IDF based ranking function assigns a weight to the Steiner graph, that is, the keywords present in the Steiner graph. Recall that Steiner graphs are an extraction of the keyword presence in the underlying data graph that is the model of the data present in a relational database. The TF-IDF based ranking function takes into account the term frequency (TF), the inverse document frequency (IDF) and the normalized document rank (NDL). TF and IDF are used to rank. In IR-literature NDL used to normalize document length, since a longer document tends to repeat the same terms, while this doesn‟t per se mean that the document should be ranked higher. Manning et al.[11] 18 Then finally these three parameters are computed as follows: (2) ntf ( k I ,G ) = 1 + ln (1 + ln (1 + tf (k i , G )) (3) idf ki = ln (N + 1) / (Nki + 1 ) (4) ndl = ( 1 – s ) + s * tlG / avgtl Where tf (ki , G ) in Equation 2 denotes the term frequency of keyword k i in the data graph G ; In Equation 3 N and Nki denote the number of Steiner graphs and the number of those Steiner graphs containing keyword k i . In Equation 4, tlG denotes the total number of terms in G and avgtl is the average number of terms among all Steiner graphs. These parameters are consequently used to compute a ranking weight between a keyword and a Steiner graph SG. In EASE however Li et al. comment that Information Retrieval ranking methods based on TF-IDF can be efficient for textual documents, but are not very efficient for semi-structured and structured data. Li et al. state that a consequence of modeling data as a graph is that the ranking of structural properties of the data graph, becomes just as, or even more important. According to Li et al., rich structural relationships should be at least as important as discovering more keywords in the data graph. To this end EASE [9] takes the structural compactness of the data into account to create an additional weight on top of the TF-IDF weight described earlier. Given a keyword query K { k1, k2 ,..., k m }, then in figure 5.3 the thickedged circles containing p5, p7 and a4 are content nodes that contain at least one keyword. (Recall that a node is represents the data of a tuple in the relational database). Node s in figure 5.5 is called a Steiner node if there exist two content nodes, u and v, and s is on the path u ↔ v (s may be u or v), where u ↔v denotes a path between u and v. Since such a path exists between p5, a4 and p7 the radius property described earlier yields a Steiner radius graph, (Figure 5.4), which serves as an input to a ranking function. The ranking is Figure 5.3: accented Steiner accordingly based on the compactness of the Steiner graph. The underlying nodes idea is that a more compact Steiner graph is more likely to be meaningful. Although the structural compactness of nodes can be important measure when generating a useful result set, it cannot be of service to evaluate interkeyword semantics. The order of the keywords can hold meaning if the query is an expression of a phrase. In EASE, the weighting function applied also takes the keyword order into account. This is done by assigning more weight to keywords that have a smaller inter-keyword distance. Figure 5.4: a Steiner graph result 19 5.2 XML-based systems A different but intrinsic related research topic is keyword search in XML-databases. This topic is related because XML is, like data in relational databases, structured by nature as well. XML queries possibly return entire XML documents, or may as well return deeply nested XML elements. Because of the immanent nested structure of XML the notion of ranking is no longer at the granularity of a document but at the granularity of an XML element. Manning et al.[11] As described in the previous section about graph-based systems, efficiency and effectiveness are very important factors in the process of developing a search system. In section 5.2.1 I describe an indexing method as employed by Florescu et al [5] . After that, I shift from the indexing technique applied by Florescu et al. to the search system XSearch [2]. First I describe the semantics derived from an XML data representation in order to meaningfully answer a keyword query. In conclusion of this chapter, I describe the ranking mechanism of the XSearch system. Florescu et al. [5] extend an XML query language for the purpose of keyword search. In their proposal the XML data is replicated in a relational database. I will not go into this particular architecture. I‟m interested in describing the index system that is employed to retrieve query answers from XML data. A common indexing approach used in traditional IR-systems is by means of an inverted file as described in section 3.2. A simple setup for an inverted file takes the following form: <word, document> This means that word can be found in document. However, when dealing with XML, retrieval is no longer at the granularity of a document but at the granularity of an XML element, as noted earlier. Consider, listing 5.1. The word „Analysis‟ appears in a title element nested in an article element. To be able to utilize the nested structure for retrieval, Florescu et al. make a distinction between keywords occurring as tags, e.g. article; as names of an <document> attribute, e.g. id or as data content of elements. <article id="1" Additionally the depth at which a keyword occurs is <author><name> taken into account. As a result an inverted file of the Adam Dingle</name></author> following form is proposed: <author><name> Peter Sturmh</name></author> <"article", elID1, 0, tag> <author><name> <"id", elID1, 1, attr> Li Zhang</name></author> ... <title>Analysis and Characterization <"name", elID1, 2, tag> of Large-Scale Web Server Access <"Adam", elID1, 2, value> Patterns and Performance</title> <year>1999</year> elIDn is associated with all elements, such that each <booktitle>World Wide Web interior node Ne is labeled with a distinct element ID, Journal</booktitle> elID. Additionally, each elID </article> </document> Listing 5.1: an XML fragment To utilize the inverted list, all data is modeled in records containing the URL of the belonging XML document, the starting and ending positions of the elements within this document, and the type of the element. As a result the following relational schema is obtained: elements(elID, docid, start_pos, end_pos, type,id_val) documents(docid, URL,...) In summary, by scanning the inverted index, which is actually a representation of the data corpus, the search system can find the desired data more efficiently. If the system must search the entire database sequentially the processing cost would be high. Florescu et al. have created an inverted list that allows search on the level of tags, attributes and data values. This way a specific fragment of an xml document can be found fast. 20 5.2.2 Semantic relatedness In XSearch Cohen et al. present a free form query language over XML documents. XML documents are modeled as trees that consist of interior nodes and leaf nodes. Each interior node is associated with a label and each leaf node is associated with one or more keywords. Figure 5.5 represents such a tree as a model of a fragment of the SIGMOD (Special Interest Group on Management of Data) publication database. Figure 5.5: an XML data representation In a sense a node in the tree can be viewed as a human being in our world; different people may have identical names. As such, two different nodes with the same label are different entities of the same type. To extend this analogy, one can say humans are related if they share the same ancestor(s). Now suppose that nodes n and n' have different ancestors, let‟s say na and n'a and these ancestors share the same label, then it is said that n and n' are not meaningfully related. This holds as long n and n' share the same relationship tree. Let T be a tree and let n1 and n2 be nodes in T, then the shortest undirected path between n1 and n2 consists of the path via the lowest common ancestor of both n1 and n2. Recall as described in chapter 2, that a path in a graph is undirected such that G = {(n1, n2), (n2, n1)}, i.e. it‟s possible to get to n2 if n1 is the present location, and vice versa. Then the sub tree consisting of the two paths is denoted as T| n1, n2 and is called a relationship tree. The overall notion of relating nodes is formalized by means of two conditional rules: 1. T| n,n' does not contain two distinct nodes with the same label. or 2. The only two distinct nodes in T| n,n' with the same label are n and n'. 21 These semantics are additionally extended with traditional information retrieval techniques to rank query answers. 5.2.3 Top-k ranking In XSearch, as described in the previous section, sub trees are generated as possible answers to a keyword query. The weights for ranking are calculated at the level of the leaf nodes of a document. Let k be a keyword and nl be a leaf node, and let occ(k, nl) denote the frequency of the occurrence k in nl.. The term frequency of k in nl is defined as: 𝑡𝑓 𝑘, 𝑛𝑙 : = max 𝑜𝑐𝑐 𝑘 ,𝑛 𝑙 𝑘 ′ ∈ 𝑤𝑜𝑟𝑑𝑠 𝑛 𝑙 (5) 𝑜𝑐𝑐 𝑘 ′ ,𝑛 𝑙 This is a variation of an IR-based approach that assigns more weight to frequent words in sparse nodes. Let N be de set of all leaf nodes in the corpus, then the inverse leaf frequency is defined as: 𝑖𝑙𝑓(𝑘): = log 1 + |𝑁| | 𝑛 ′ ∈ 𝑁 𝑘 ∈ 𝑤𝑜𝑟𝑑𝑠 𝑛 ′ | (6) Then, 𝑡𝑓𝑖𝑙𝑓 = 𝑡𝑓 𝑘, 𝑛𝑙 × 𝑖𝑙𝑓(𝑘). Note, by taking a logarithm in ilf, the importance of the tf factor is increased. The actual weight stored is a normalized version of tfilf denoted as w(k,nl), such that w is 0 if k doesn‟t appear in nl . Furthermore the labels are taken into account as a weighting factor. Recall that each interior node is associated with a label. Each label l is associated with a weight w(l) that determines its importance. These weights can either be user-defined or system-generated. The key notion using label weights is that the interior nodes, that determine the structure of the XML-data, can also be taken into account. For instance, higher weights can be assigned to less common labels. To incorporate both tfilf weights and label weights, the vector space model is utilized to determine how well an answer satisfies a query. Let L be the set of all labels and let K be the set of all keywords. Each interior node n in the data is associated with a vector Vn of size |L x K|. The vector has an entry for each pair (l,k) ∈ L x K. Then, Vn [l,k] is used to denote the entry of Vn corresponding to the pair (l,k). Let Nleaf be the set of leaf descendents of n. The values of Vn are defined as follows: 𝑉𝑛 𝑙, 𝑘 = 𝑛′ ∈𝑁𝑙𝑒𝑎𝑓 𝑤 𝑘, 𝑛′ 𝑖𝑓 𝑙𝑎𝑏𝑒𝑙 𝑛 = 𝑙 0 𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒 (7) Note that w(k,nl) is 0 if k doesn‟t appear in nl . To be able to calculate similarities between answers and queries, there have to be vectors representing the terms in the query as well. Term t is associated with a vector of the size |L x K|, denoted by Vt , then the similarity between a Query Q and an answer N, denoted as sim(Q,N), is the sum of the cosine distances between the vectors associated with the nodes in N and the vectors associated with the matching terms in Q. As a next step the semantics described in the previous section 3.2.1 are extended by two weight factors. Let tsize(N) denote the number of nodes in the relationship tree of N. If this value is small, then the nodes are closer together, and therefore more likely to be meaningfully related. Additionally it is said that the nodes n and n' participate in an ancestor-descendent relationship if n is de ancestor of n' and vice versa. This indicates a strong relationship between n and n' . This notion is denoted by anc-des(N) , where N denotes the number of unordered pairs that participate in an ancestor-descendent relationship. Finally, given a query Q and a answer N, the factors sim(Q,N), tsize(N) and anc-des(N) are combined to determine the ranking of a query answer: 𝑠𝑖𝑚 (𝑄,𝑁)𝛼 𝑡𝑠𝑖𝑧𝑒 (𝑁)𝛽 × 1 + 𝛾 × 𝑎𝑛𝑐 − 𝑑𝑒𝑠(𝑁) (8) Experimentations were conducted with varying 𝛼, 𝛽 and 𝛾 values to gain more control over the resulting weight. 22 6. Implementation In this chapter I describe in general how I implemented the search application presented in chapter four. The first section I describe how I created the indices using the programming language Perl. After that I focus on the search application. I used the Java programming language for this task. First I show the object oriented design of the search application, followed by a description of the flow of control between functional entities. Then I describe the weighted graph data structure I implemented, followed by the most prominent algorithms. I used Object-oriented constructs to design 12 separate classes, including inner classes. I also used four classes I did not author. These are WeightedGraph, LinkedQueue, QueueInterface and ConnectionPool. The first three are published in [3] and the ConnectionPool is available in [16]. The Perl source code is documented in appendix A and the Java source code is documented in appendix B. Since I use Java I have an elaborate library of functionality at my disposal. I list the Java library classes I used and the function they perform in the search application. Finally, I describe how I implemented the application in a web-based environment. 6.1 Creating indices For every field in the Lastgeld database I wanted to index, I executed the steps as shown in listing 6.1. The Perl source code I wrote can be found in appendix A. 1. Make an alphabetically ordered SQL-dump of a field into a text file. The dump has the following form: (1198, 'AART'), (4765, 'ABE'), (12726, 'ABE'), (4790, 'ABE'),>> 2. Delete all punctuation (Perl); 1198 AART 4765 ABE 12726 ABE 4790 ABE >> 3. Place the id after the index term (Perl); AART 1198 ABE 4765 ABE 12726 ABE 4790 >> 4. Place every similar index term on the same line (Perl) AART 1198 <new line> ABE 4765 ABE 12726 ABE 4790 ABE 1567 ABE 1246 ABE BR 6900 ABE BR 11380 ABE BR 857 ABE BROER 5986 ABE BROERSZ 9008 <new line> ABEL 6545 ABEL BAS 3553 <new line> ABLBERT 7581 ABLBERT J 10065 <new line> 5. Delete all the terms except the term that starts the line (Perl. AART 1198 ABE 4765 12726 4790 1567 1246 6900 12471 11380 857 5986 9008 ABEL 6545 3553 ABLBERT 7581 10065 >> Listing 6.1: creating indices 23 12471 ABE BR 6.2 Object oriented design The UML-diagram depicted in figure 6.1 shows an overview of the classes I created to implement the search application. Figure 6.1: object oriented design 24 I declared the Firstname, Lastname, Data, Harbor and Country to be inner classes of Lastgeld because they are entities that belong to a single semantical unit, which can be thought of as a table. Grouping them together within the Lastgeld class merges them in the Lastgeld class, while still having control of instantiating them separately. I implemented the Graphable interface to make sure that every object being added to the graph has the same properties. The inner classes implement this interface and thus they must implement every method I declared in the interface. The interface construct helps to guarantee that all objects share the same properties. This can prevent inconsistencies and possible errors. All instance variables are declared private or protected. This construct guarantees that all interaction from outside the class with the variables is reserved to public methods. 6.3 Flow of control In this section I explain in general in what order control is passed between the classes depicted in figure 6.1 given a keyword query and a formulated answer. The WebFrontend class is instantiated by the Tomcat web server when it gets an URL-request from the web browser. The web browser displays an input field for a user to enter a keyword query. The web browser sends a http-request to the service method of the WebFrontend class. Here the keyword query is extracted and a database object is created. The keywords are handed over as arguments to the constructor of the database object to be created. The created Database object instantiates –for every keyword in the query- a new Lastgeld object. Every Lastgeld object instantiates five Connector objects, one for every index stored on disk, e.g. if a keyword query of three keywords is entered, three Lastgeld objects are created and thus in total 15 Connector objects are created, each Connector object is associated with one keyword and sequentially searches one index. If there is a match, then the id-numbers associated with that term in the index are stored in a list and returned to the Lastgeld object that created that Connector object. Given a match in a particular index and the id-numbers associated with that match, the Lastgeld object creates one of its inner classes; Firstname; Lastname; Country; Date or Harbor, depending on the index in which the keyword match is found. The id-numbers found in the indices are associated with these inner classes. All of the inner class objects are stored in a list and given back to the Database object. Then, the database object instantiates a Grapher object, to store these objects as vertices in a WeightedGraph object, which is instantiated by the Grapher object. Then the Grapher object retrieves all vertices present in the WeightedGraph object and intersects all the id-numbers associated with all possible unique combinations of vertices, then edges are created and id-numbers two vertices have in common are associated with the edge between them. The vertex with the most edges is retrieved and the id-numbers associated with every edge of the vertex with the most edges are intersected again. This yields a list of results. This list of results is passed back to the Database object. The database object in turn passes it back to the WebFrontend. The WebFrontend instantiates a ConnectionPool object which grants access to the MySQL database, and the WebFrontend formulates one or more SQL-queries based on the id-numbers returned by the Database object. The answers to these queries are passed back via WebFrontend‟s http-response parameter and the results are presented to the user. 25 6.4 Weighted graph data structure Recall that I explained in section 3.4 that all keywords of the keyword query found in the indices are vertices in the graph and the id-numbers associated with a keyword in the index are associated with the vertex of that keyword. There exist edges between any possible combination of two vertices if two vertices have at least one id-number in common. If this is the case then an edge between that pair of vertices is created and the weight of the edge is associated with all the id-numbers these two vertices have in common. To be able to use the weighted graph data structure as decribed in Dale et al.[3], I needed to modify the data structure in its core. The edges in the presented weighted graph data structure of Dale et al. consist of single integer values. However, I needed the weights to be lists of integer values. As a result I modified the data structure as depicted in figure 6.2. Figure 6.2: weighted graph data structure Note that the vertices are Graphable objects, they implement the Graphable interface I designed. Figure 6.1 shows that the inner classes of Lastgeld: Firstname, Lastname, Harbor, Country and Date implement this interface. 26 5.5 Algorithms The core functionality of the presented application can be summarized as; searching indices; intersecting indices; finding combinations of intersections and finding the vertex with the most edges. I describe these subjects in more detail in the next sub-sections. 6.5.1 Searching indices I decided to store the indices as text files, this appeared to be the most straightforward implementation. As a result the indices are searched sequentially. Recall that any line in an index starts with a term followed by a sequence of id-numbers. The StringTokenizer object takes two parameters the line currently read and a delimiter, which is a space in the indices I created. while (inLine != null){ tokenizer = new StringTokenizer(inLine, " "); term = tokenizer.nextToken(); if (term.equals(keyword)){ while(tokenizer.hasMoreTokens()){ String token = tokenizer.nextToken(); Integer id = Integer.valueOf(token); list.add(id); } } inLine = inFile.readLine(); } 6.5.2 Intersecting id-numbers The id-numbers associated with each graphable object are stored in an ArrayList. To be able to intersect two ArrayLists containing id-numbers I use ArrayList‟s retainall method. An ArrayList object inherits this method from the AbstractCollection class. The source code is straightforward, but like many Java classes, this method makes use of other classes to get the job done. Since ArrayList is a Collection, the “contains” method is available for the actual intersection. To be able to loop through the items in the ArrayList an iterator object is used. public boolean retainAll(Collection c) { boolean modified = false; Iterator<E> e = iterator(); while (e.hasNext()) { if (!c.contains(e.next())) { e.remove(); modified = true; } } return modified; } public boolean contains(Object o) { Iterator<E> e = iterator(); if (o==null) { while (e.hasNext()) if (e.next()==null) return true; } else { while (e.hasNext()) if (o.equals(e.next())) return true; } return false; } Comment: the contains method is shown in the right side of this box. JDK source 5.0[14] 27 6.5.3 Finding combinations All vertices are stored in an ArrayList. Each vertex is accessible via an index of the ArrayList. To be able to intersect all pairs of vertices to create edges all possible combinations of ArrayList indices must be obtained. These combinations can be obtained by the size of the ArrayList that contains the vertices. This size is the parameter of the following method I created: private int[][] calculateCombinations(int c){ int[][] combinations = new int[c*c/2][2]; int store = 0; int number = c; for (int b = 1; b < number; b++ ){ int y = b - 1; int i = 1; while (i != number-y){ combinations[store][0] = i-1; i = i + b; combinations[store][1] = i-1; int x = b - 1; i = i - x ; store++; } } return combinations; } For instance if the int 3 is passed, the method returns the combinations 12 23 and 13. 6.5.4 Retrieving the vertex with the most edges LinkedQueue edges; Graphable item; Graphable biggest = null; int biggestSize = 0; for(int i = 0; i < vertices.size();i++){ int size = 0; item = (Graphable) vertices.get(i); edges = graph.getToVertices(item); size = edges.size(); if(size > biggestSize){ biggestSize = size; biggest = item; } } return biggest; QueueInterface getToVertices(Graphable vertex){ QueueInterface adjVertices = new LinkedQueue(); int fromIndex; int toIndex; fromIndex = indexIs(vertex); for (toIndex = 0; toIndex < numVertices; toIndex++) if (edges[fromIndex][toIndex] != NULL_EDGE) adjVertices.enqueue(vertices[to Index]); return adjVertices; } Comment: Part of WeightedGraph.java Dale et al.[3] Comment: The vertices are retrieved from the graph using the getToVertices method belonging to the graph object. This method is described in the right side of this box. 28 6.6 Java library classes I used several classes available in the Java library to get several jobs done. In this section I list these classes and I explain in what way I used them. java.io.BufferedReader java.io.InputStreamReader java.io.IOException java.io.PrintWriter java.util.ArrayList java.util.Collections java.util.StringTokenizer java.net.URL java.net.URLConnection java.sql.Connection java.sql.Statement java.sql.ResultSet java.sql.SQLException javax.servlet.http.HttpServlet javax.servlet.http.HttpServletR equest javax.servlet.http.HttpServletR esponse I use the BufferedReader class to retrieve data from the text files that store the indices. A BufferedReader object has a method named readLine to read from a text file line by line. A BufferedReader object cannot read text. It needs an InputStreamReader object to bridge from a byte streams to character stream. An IOException object is created if anything goes wrong in the process of reading text files. This class prints formatted representations of objects to a text-output stream. When a servlet (Java web class) is called in a web browser a request and response object is send to that servlet. The response object has a method named getWriter which delivers a PrintWriter object to the servlet. In turn this object is necessary to print output to the web browser. I have extensively used ArrayLists when passing data from object to object. The advantage of the use of an ArrayList is that it can store any kind of object. Another advantage is that an ArrayList dynamically allocates space, i.e. it increases in size if it becomes full and decreases in size if objects are deleted from it. I use the static sort method from the Collections class to sort the idnumbers associated with a Graphable object. The sorting algorithm is a modified version of mergesort. A StringTokenizer cuts the lines read by the BufferedReader in to single tokens. I use the URL class to define where the search application can find the indices. An URLConnection object is returned when an URL object calls its method openConnection. To access the MySQL database there needs to be a connection to it. A Connection object establishes such a connection. An SQL statement is stored in a Statement object. SQL results are stored in a ResultSet object. An SQLException object is created if anything goes wrong in the process of executing a SQL statement. The WebFrontend class I wrote extends the HttpServlet class. By extending this class the WebFrontend class inherits all the methods from HttpServlet. This way I profit from the methods already written in HttpServlet. An HttpServletRequest object is created when an URL is directed at the WebFrontend class. I use this object to get the information send in the header of the WebFrontend class. For instance. I use the header like so: “web/Lastgeld?search_text=cornelis+vos” (part of the url of a search request). An httpServletResponse object is necessary to retrieve a PrintWriter object. Recall that a PrintWriter object is necessary to print output to the web browser. Table 6.1: Used Java Library classes - Java 1.4.2 API specification [13] 29 6.7 Web deployment 6.7.1 Specification Host Relational database Webserver Java Database Connectivity (JDBC) driver Connection pool Java Runtime environment Specification A slice of an 2.4 GHz processor and 512MB of RAM running on Linux 2.6.26 xen VPS MySQL 5.1 Community Server [18] Apache Tomcat 5.5 [16] Connector/J 5.1 [19] A Java class for pre-allocating, recycling, and managing JDBC connections [17] JRE 6 [20] Table 6.2: web deployment specification 6.7.2 User interface The user interface I created looks like this: Figure 6.3: user interface search application As you can see there is nothing difficult to this interface. The only function that needs a short introduction is the “verbose” option. The verbose option executes the search request while in the mean time printing the results of some important intermediate steps that lead to a final query answer. (Warning this option will flood your screen in some cases). 6.7.3 Web address I host the search application at the following web address: http://daniel.adixhosting.nl/web/lastgeld 6.7.4 The search application doesn‟t work A lot can go wrong when one runs a not-so-thoroughly-tested database-driven Java web application on an external computer with very modest resources. Therefore I post my e-mail address: [email protected] I will gladly try to solve the problem at hand. Having expressed this, I end the chapter. 30 7. Epilogue 7.1 Conclusions I presented a search application that enables keyword-based search in the available Lastgeld database. I employed several techniques to be able to retrieve meaningful answers to queries consisting of multiple keywords. These answers are based on the following assumption: if the keywords in a keyword query are related in the relational database that contains the Lastgeld data, then retrieving this relating data yields results likely to be meaningful given the keyword query. I state that my objective in this thesis is to show that keyword-based search in a relational database can yield meaningful results given the available Lastgeld data. I have presented several keyword queries and the results retrieved by the presented search application. These results show that although the control and exactness of SQL queries is lost, the results are likely to be meaningful given the available Lastgeld data. In the experiment I observed that the presented search application can be demanding in some cases. Based on the experiments I conclude that keyword query execution time increases if the intersection work that needs to be performed by the presented search application increases. This work is dependent on the number of keywords in the query and the amount of id-numbers associated with these keywords in the indices. I also conclude that the more SQL queries the search application proposes to be executed by the relational database, the more the overall query execution time will increase. 7.2 Discussion The presented search application demonstrates one of many possible solutions for obtaining keyword-based search in a relational database. However, the presented solution is mainly designed for the data available. I used only two tables to show how relations can be combined in a graph data structure. In the chapter five I describe related work it is clear that the graph data structure is employed in very different ways. It appears to be more common to designate an entire tuple to be a vertex, instead of designating a field in the tuple to be a vertex the way I did. This way edges between vertices represent relations between tables. Finding relating data is on the scale of a graph model of relating tables not on the scale of a graph model of relating data within tuples the way I have used a graph data structure. My approach seems far more difficult to scale incorporating many tables, while solutions based on an entire tuple being a vertex connecting to other vertices that are tuples in other tables, appear to be a more generalizable approach in modeling an amount of data that is stored in an undefined number of tables. I designed the presented search application to work with the available data. This data doesn‟t have a rich relational structure by means of many relating tables. The experiment indicates that the presented search application can be increasingly demanding in terms of query processing. The way id-lists are intersected to establish edges between vertices in the graph data structure can be an execution time consuming construct in some cases. Also proposing multiple queries to the relational database after the search application finished processing can be an execution time consuming construct. Although efficiency can be improved, it is unclear of these inherent demanding constructs are acceptable in practice. By trading query syntax like SQL for just some keywords, control over retrieving exact answers is lost. The spaces between keywords do not per se imply a relation between keywords, but they do not imply there isn‟t any relation either. I chose find answers to multiple keywords in a query by incorporating all the keywords as a relating phrase based on the available data. This yields meaningful results, however this remains relative to the interpretation of the meaning of spaces between keywords. 31 7.3 Proposals for future work Efficiency can be improved by revising the index structure. Currently, all indices are an alphabetical list per database field. The index first name alphabetically lists all first names from A to Z. For every letter in the alphabet, this index can be partitioned in sub-indices. Then the appropriate indices can be retrieved given the first letters of the keywords. This way the sequential search process is less demanding in terms of processing. To avoid the execution of 5600 SQL queries a keyword query “Germany” initiates, it is possible to generate just one query containing multiple AND clauses, like for instance: SELECT * FROM Skippers WHERE id=‟1‟ AND id=‟2‟ AND id=‟3‟. Still, in case of the keyword query “Germany” this approach will yield a very long query. Another solution to this problem is to take alternative steps for keyword queries that contain only one keyword, because the search application will likely propose many SQL-queries in case of one keyword. In this case a possible scenario is to avoid the search application entirely, and search the indices separately for this keyword. If it is known that the keyword occurs in index A, then a more appropriate SQL-query can be formulated given this knowledge. The retrieval of meaningful results may be improved by traversing the employed weighted graph data structure. Currently the vertex with the most edges is retrieved. This approach retrieves the tuples of data in which the keywords must occur in the same tuple. If the graph were to be traversed starting at different vertices, then different kinds of relating data can be retrieved. This way and, or and not semantics can be introduced into the application. Also there could be made an order of importance between vertices that have just one edge, vertices that have two edges, etc. However, to be able to evaluate the benefit of any kind of functional variation within the presented application, it is necessary to assess the effectiveness of the application based on actual information needs as well. References [1] B. Aditya, G. Bhalotia, S. Chakrabarti, A. Hulgeri, C. Nakhe, P. Parag, S.Sudarshan. BANKS: browsing and keyword searching in relational databases. In Proceedings of the 28 th international conference on Very Large Databases, 2002. http://delivery.acm.org.proxy-ub.rug.nl/10.1145/1290000/1287473/p1083aditya.pdf?key1=1287473&key2=0346186421&coll=ACM&dl=ACM&CFID=42949402&CFTOKEN=9 7092769, visited on may 14th, 2009. [2] S.Cohen, J. Mamou. Y. Kanza, Y. Sagiv. XSearch: a semantic search engine for XML. In Proceedings of the 29th international conference on Very large databases, vol. 29, 2003. http://delivery.acm.org.proxyub.rug.nl/10.1145/1320000/1315457/p45cohen.pdf?key1=1315457&key2=8557427421&coll=ACM&dl=ACM&CFID=43812128&CFTOKEN=5 1392105, visited on June 28th, 2009. [3] N.Dale, D.T Joyce, C.Weems. Object-oriented data structures using Java. ISBN 0-7637-1079-2. Jones and Bartlett Publishers International, London, 2002. [4] B.B. Dalvi, M. Kshirsagar, S. Sudarshan. Keyword search on external memory graphs. In Proceedings of the VLDB Endowment, Vol 1, Issue 1, 2008. http://delivery.acm.org.proxyub.rug.nl/10.1145/1460000/1453982/p1189dalvi.pdf?key1=1453982&key2=2831696421&coll=ACM&dl=ACM&CFID=42949402&CFTOKEN=97 092769, visited on July 3th, 2009. [5] D. Floresqu, D. Kossmann, I. Manolescu. Integrating keyword search into XML query processing. In Computer Networks, Vol. 33, 2000. http://www.sciencedirect.com.proxyub.rug.nl/science?_ob=MImg&_imagekey=B6VRG-40B2JGR-C11&_cdi=6234&_user=4385132&_orig=search&_coverDate=06%2F30%2F2000&_sk=999669998&vie w=c&wchp=dGLbVlW-zSkWA&md5=c122eafe4481c853b1287351178b8472&ie=/sdarticle.pdf, visited on May 17th, 2009. 32 [6] V. Hristidis, Y Papakonstantinou. Discover: keyword search in relational databases. In Proceedings of the 28th international conference on Very Large Databases, 2002. http://delivery.acm.org.proxyub.rug.nl/10.1145/1290000/1287427/p670hristidis.pdf?key1=1287427&key2=0686186421&coll=ACM&dl=ACM&CFID=42949402&CFTOKEN =97092769, visited on May 15th, 2009. [7] V. Hristidis, L. Gravano, Y, Papakonstantinou. Efficient IR-Style Keyword search over Relational databases. In Proceedings of the 29th international conference on Very large databases, Vol. 29, 2003. http://portal.acm.org.proxyub.rug.nl/citation.cfm?id=1453856.1453887&coll=ACM&dl=ACM&CFID=43719809&CFTOKEN=297 07051, visited on May 15th, 2009. [8] L.Lanzani. Discrete mathematics, chapter 11, Graph theory, 2008, http://comp.uark.edu/~lanzani/2103NOTES/11.1-11.2.pdf, visited on July 6th, 2008. [9] G. Li, B.C. Ooi, J. Feng, J. Wang, L, Zhou. EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. http://delivery.acm.org.proxyub.rug.nl/10.1145/1380000/1376706/p903li.pdf?key1=1376706&key2=0314196421&coll=ACM&dl=ACM&CFID=42949402&CFTOKEN=97092 769, visited on May 21th, 2009. [10] F.Liu, C. Yu, W. Meng, A Chowdhury. Effective Keyword Search in Relational Databases. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data. http://delivery.acm.org.proxy-ub.rug.nl/10.1145/1150000/1142536/p563liu.pdf?key1=1142536&key2=1057366421&coll=ACM&dl=ACM&CFID=43719809&CFTOKEN=2970 7051, visited on May 21th, 2009. [11] C.D Manning, P.Raghavan, H.Schütze. Introduction to information retrieval. ISBN 978-0-521-86571-5. Cambridge university press, Newyork, 2008. [12] L.Paoletti. Leonard Euler‟s solution to the Königsberg bridge problem, http://mathdl.maa.org/mathDL/46/?pa=content&sa=viewDocument&nodeId=1310&bodyId=1452, visited on July 6th, 2009. [13] Sun.com. Java 2 Platform, Standard Edition, v 1.4.2 API specification. htp://java.sun.com/j2se/1.4.2/docs/api/overview-summary.html, visited on July 18th [14] Sun.com. JDK Source 5.0. https://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_DeveloperSite/en_US/-/USD/ViewProductDetail-Start?ProductRef=J2SE-1.5.0-OTH-G-F@CDSCDS_Developer&ProductUUID=.59IBe.oWd4AAAEZZvkZK4O9&ProductID=.59IBe.oWd4AAAEZZv kZK4O9&Origin=ViewProductDetail-Start , (SDN registration required) [15] G.M. Welling. The Prize of Neutrality. Trade relations between Amsterdam and North America 17711817, 1998. http://dissertations.ub.rug.nl/FILES/faculties/arts/1998/g.m.welling/thesis.pdf, visited on Augustus 11th, 2009. Software used in the implementation as described in section 5.2 and 5.7.1. [16] Apache Tomcat 6.0. http://tomcat.apache.org/. (open source) [17] ConnectionPool.java http://archive.coreservlets.com/coreservlets/ConnectionPool.java (freely available) [18] MySQl Community server 5.0. http://dev.mysql.com/downloads/. (open source) [19] MySQL Connector/J 5.1. http://dev.mysql.com/downloads/connector/j/5.1.html (open source) [20] Sun. Java Runtime Environment 6. http://java.sun.com/javase/downloads/index.jsp (freely available) 33 Appendix A: creating indices in Perl # # # # Step 1 Creating indices Name: step1.pl Author: Daniel Suelmann Effect: Deletes punctuation of a SQL-dump. use strict; my $readfile = shift(@ARGV); chomp($readfile); open(FILE, $readfile) or die "Cannot open $readfile: $!"; while (<FILE>) { my $line = $_; chomp($line); $line =~ s/[[:punct:]]//g; open(RESULT, ">>inputtostep2.txt") or die "Cannot open $!"; print RESULT "$line \n"; } close RESULT; ---Result: 1198 AART 4765 ABE 12726 ABE 4790 ABE 1567 ABE >> # # # # Step 2 Creating indices Name: step2.pl Author Daniel Suelmann Effect: Puts the id-numbers after the index term. use strict; my $numbers; my $readfile = shift(@ARGV); chomp($readfile); open(FILE, $readfile) or die "Cannot open $!"; while (<FILE>) { my $line = $_; chomp($line); $line =~ s/([0-9]+)//g; $numbers = $1; open(RESULT, ">>inputtostep3.txt") or die "Cannot open: $!"; print RESULT "$line $numbers\n"; } close RESULT; ---Result: AART 1198 ABE 4765 ABE 12726 ABE 4790 ABE 1567 >> # Step 3 Creating indices 34 # Name: step3.pl # Author Daniel Suelmann # Effect: Puts every similar index term on the same line. use strict; my $prevline = ""; my $string = ""; my $numbers; my $readfile = shift(@ARGV); chomp($readfile); open(FILE, $readfile) or die "Cannot open $readfile: $!"; while (<FILE>) { my $line = $_; chomp($line); $line =~ /([A-Z]+)/; if ($prevline eq $1){ open(RESULT, ">>inputfinalstep.txt") or die "Cannot open $!"; print RESULT "$line"; } else { open(RESULT, ">>inputfinalstep.txt") or die "Cannot open $!"; print RESULT "\n$line"; } $line =~ /([A-Z]+)/; $prevline = $1; } close RESULT; ---Result: AART 1198 <new line> ABE 4765 ABE 12726 ABE 4790 ABE 1567 ABE 1246 ABE BR 6900 ABE BR 12471 ABE BR 11380 ABE BR 857 ABE BROER 5986 ABE BROERSZ 9008 <new line> ABEL 6545 ABEL BAS 3553 <new line> ABLBERT 7581 ABLBERT J 10065 <new line> ABRAHAM 3410 ABRAHAM 4434 ABRAHAM 6303 ABRAHAM 5806 ABRAHAM 9355 ABRAHAM 9064 ABRAHAM 6687 ABRAHAM 8444 ABRAHAM 9828 <new line> # # # # Step 4 Creating indices. Name: step4.pl Author Daniel Suelmann Effect: Deletes all the terms except the term that starts the line. use strict; my $string; my $readfile = shift(@ARGV); chomp($readfile); open(FILE, $readfile) or die "Cannot open $readfile: $!"; while (<FILE>) { my $line = $_; chomp($line); $line =~ /([A-Z ]+)/; $string = $1; $line =~ s/([A-Z ]+)/ /g; open(RESULT, ">>resultsfinal.txt") or die "Cannot open $!"; print RESULT $string; print RESULT $line . "\n"; } close RESULT; ---Result: AART 1198 35 ABE 4765 12726 4790 1567 1246 6900 12471 11380 857 5986 9008 ABEL 6545 3553 ABLBERT 7581 10065 ABRAHAM 3410 4434 6303 5806 9355 9064 6687 8444 9828 Appendix B: search application source code in Java Note that I do not try to catch exceptions. Each search request restarts the application. If an exception occurs, the exception occurs for one search request only. The application throws the exception out to the web server, which in turn prints the exception in readable and understandable format in the web browser of the user. /* * /* * * Name: Database.java Author: Daniël Suelmann Effect: * main scenario: * 1. Put the keywords in a list; * 2. See whether it matches the table fields; * 3. Receive the matches; * 3.1 if the matches are form a single * keyword, the job is done. * 3.2 if not, continue. * 4. Put the matches in a graph. * 5. Add edges -between vertices- to the * graph-based on id resemblance. * 6. Find out which vertex has the most edges: * this is probably the most important one. * 7. Intersect the id-numbers associated with * each edge, the id-numbers are the result * id-numbers, these id-numbers are given * back to the servlet that called this method * from within the web environment. * */ package standalone; import java.io.IOException; import java.util.ArrayList; public class Database { private Grapher grapher = new Grapher(); public static ArrayList<ArrayList<Graphable>> dispatch(String[] array) throws IOException{ /* effect: creates an instance of Lastgeld * for each keyword in the arguments array. * effect: dispatches the keywords in requests * to the text files. * incoming collaboration: receives an array * of keywords from Database's main. * outgoing collaboration: sends a single * keyword to Lastgeld's request. * incoming collaboration: receives a list * with object references from Lasgeld's request. * outgoing collaboration: sends a list of lists * with object references to Database's main. */ ArrayList<ArrayList<Graphable>> list = new ArrayList<ArrayList<Graphable>>(); for(int i = 0; i < array.length; i++){ Lastgeld lg = new Lastgeld(); 36 list.add(lg.request(array[i])); } return list; } public ArrayList<Integer> getResult(String[] str) throws IOException { ArrayList <ArrayList<Graphable>> keywords; ArrayList<Integer> result = new ArrayList<Integer>(); String[] input = str; keywords = dispatch(input); /*If there's only one keyword entered there's no *need to set up a graph. *The triple loop gets lists in the list (loop 1), *these lists contain objects (loop 2), these *objects have (first name, last name, harbor, *country) and the last loop gets id-numbers from *within the objects. * */ if (input.length == 1){ if(keywords != null){ for (int i = 0; i < keywords.size(); i++){ ArrayList<Graphable> list = (ArrayList<Graphable>) keywords.get(i); if(list != null){ for(int j = 0; j < list.size(); j++){ Graphable item = (Graphable) list.get(j); if(item != null){ ArrayList<Integer> ids = (ArrayList<Integer>)item.getIDs(); if (ids !=null){ for(int k = 0; k < ids.size(); k++){ result.add(ids.get(k)); } } else{ System.out.println("No, id-numbers in the list"); } } } } else{ System.out.println("emptylist"); } } } else { System.out.println("There are no objects to start with"); } } 37 /* If there are more than one keyword * entered; a graph is created; * The graph is populated by inserting * the keyword objects(vertices) into * the graph. The edges in between two * vertices are created; The edges are * printed (for my administration * output is returned to the Tomcat server * console); And finally, the result set * is determined. */ else { if (!keywords.isEmpty()){ grapher.populate(keywords); grapher.addEdges(); grapher.printEdgesToConsole(); result = (ArrayList<Integer>) grapher.findResult(); } } return result; } public ArrayList <Graphable> getVertices(){ ArrayList <Graphable> list = grapher.getVertices(); return list; } public ArrayList <Graphable> getIntersections(){ ArrayList <Graphable> list = grapher.getIntersections(); return list; } public WeightedGraph getGraph(){ WeightedGraph graph = grapher.getGraph(); return graph; } public ArrayList<Object> getCombinedIntersections(){ ArrayList<Object> list = grapher.getCombinedIntersections(); return list; } /* Comment on the reset methods: * In between search requests all objects must * be deleted otherwise data structures are * flooded with mixed result data. */ public void resetCombinedIntersections(){ grapher.resetCombinedIntersections(); } public void resetIntersections(){ grapher.resetIntersections(); } public void resetGraph(){ grapher.resetGraph(); } public void resetVertices(){ grapher.resetVertices(); 38 } public void resetListOfLists(){ grapher.resetListOfLists(); } } /* Name: Lastgeld.java * Author: Daniël Suelmann * Effect: * Every Lastgeld object instantiates five * Connector objects, one for every index * stored on disk. The Connector objects return * lists with matching id-numbers. Per match * an inner class object is instantiated * depending on which index matched the * keyword. All object references are gathered * and send back to Database. */ package standalone; import java.io.IOException; import java.util.ArrayList; public class Lastgeld { //inner class protected class Firstname implements Graphable{ protected String content; protected final String type = "lastgeldfirstname"; protected ArrayList<Integer> ids; public String toString(){ return content; } public ArrayList<Integer> getIDs(){ return ids; } public String getType(){ return this.type; } public boolean isEqual(Graphable object){ if (object.getType().equals(this.type) && object.toString().equals(this.content)){ return true; } else return false; } } //inner class protected class Lastname implements Graphable{ protected String content; protected final String type = "lastgeldlastname"; protected ArrayList<Integer> ids; public String toString(){ return content; } public ArrayList<Integer> getIDs(){ return ids; } 39 public String getType() { return this.type; } public boolean isEqual(Graphable object){ if (object.getType().equals(this.type) && object.toString().equals(this.content)) return true; else return false; } } //inner class protected class Harbor implements Graphable{ protected String content; protected final String type = "lastgeldharbor"; protected ArrayList<Integer> ids; public String toString(){ return content; } public ArrayList<Integer> getIDs(){ return ids; } public String getType() { return this.type; } public boolean isEqual(Graphable object){ if (object.getType().equals(this.type) && object.toString().equals(this.content)) return true; else return false; } } //inner class protected class Date implements Graphable{ protected String content; protected final String type = "lastgelddate"; protected ArrayList<Integer> ids; public String toString(){ return content; } public ArrayList<Integer> getIDs(){ return ids; } public String getType() { return this.type; } public boolean isEqual(Graphable object){ if (object.getType().equals(this.type) && object.toString().equals(this.content)) return true; else return false; } } //inner class 40 protected class Country implements Graphable{ protected String content; protected final String type = "lastgeldcountry"; protected ArrayList<Integer> ids; public String toString(){ return content; } public ArrayList<Integer> getIDs(){ return ids; } public String getType() { return this.type; } public boolean isEqual(Graphable object){ if (object.getType().equals(this.type) && object.toString().equals(this.content)) return true; else return false; } } protected protected protected protected protected static static static static static Firstname firstname; Lastname lastname; Harbor harbor; Date date; Country country; // constructor of the table public Lastgeld(){ /* effect: Instantiates Lastgeld's inner * objects Firstname, Lastname, Date, * Harbor, Country and points references * to them. */ firstname = new Firstname(); lastname = new Lastname(); harbor = new Harbor(); date = new Date(); country = new Country(); } private static ArrayList<Graphable> populate(ArrayList<ArrayList<Integer>> lists, String keyword){ /* effect: attaches id-numbers to the instance * variables of Lastgeld's inner classes. * effect: attaches the keyword to the instance * variables of Lastgeld's inner classes. * incoming collaboration: receives a keyword * from Lastgeld's request. * incoming collaboration: receives a list of id * lists from Lastgeld's request. * outgoing collaboration: sends a list with * object references to the fields to Lastgeld's * request. */ ArrayList<Graphable> objectrefs = new ArrayList<Graphable>(); ArrayList<Integer> list0 = (ArrayList<Integer>) lists.get(0); ArrayList<Integer> list1 = (ArrayList<Integer>) lists.get(1); ArrayList<Integer> list2 = (ArrayList<Integer>) lists.get(2); ArrayList<Integer> list3 = (ArrayList<Integer>) lists.get(3); 41 ArrayList<Integer> list4 = (ArrayList<Integer>) lists.get(4); if(!list0.isEmpty()){ firstname.content = keyword; firstname.ids = (ArrayList<Integer>) lists.get(0); objectrefs.add(firstname); } if(!list1.isEmpty()){ lastname.content = keyword; lastname.ids = (ArrayList<Integer>) lists.get(1); objectrefs.add(lastname); } if(!list2.isEmpty()){ harbor.content = keyword; harbor.ids = (ArrayList<Integer>) lists.get(2); objectrefs.add(harbor); } if(!list3.isEmpty()){ date.content = keyword; date.ids = (ArrayList<Integer>) lists.get(3); objectrefs.add(date); } if(!list4.isEmpty()){ country.content = keyword; country.ids = (ArrayList<Integer>) lists.get(4); objectrefs.add(country); } return objectrefs; } public static ArrayList<Graphable> request(String s) throws IOException{ /* effect dispatches calls to the connector to * access the text files. * incoming collaboration: receives a keyword * form Database's dispatch. * incoming collaboration: receives lists of * id-numbers (associated with a single keyword) * from Connectors read method. * outgoing collaboration: sends a list of id-lists * to Lastgeld's populate. * outgoing collaboration: sends the list with * object references to Database's dispatch */ ArrayList<ArrayList<Integer>> lists = new ArrayList<ArrayList<Integer>>(); ArrayList<Integer> ArrayList<Integer> ArrayList<Integer> ArrayList<Integer> ArrayList<Integer> list0 list1 list2 list3 list4 = = = = = null; null; null; null; null; String keyword = s; Connector con0 = new Connector("vnfreq.txt"); list0 = (ArrayList<Integer>) con0.read(s); Connector con1 = new Connector("anfreq.txt"); list1 = (ArrayList<Integer>) con1.read(s); Connector con2 = new Connector("havenfreq.txt"); list2 = (ArrayList<Integer>) con2.read(s); Connector con3 = new Connector("datefreq.txt"); 42 list3 = (ArrayList<Integer>) con3.read(s); Connector con4 = new Connector("countryfreq.txt"); list4 = (ArrayList<Integer>) con4.read(s); lists.add(list0); lists.add(list1); lists.add(list2); lists.add(list3); lists.add(list4); return populate(lists, keyword); } } /* Name: Connector.java * Author: Daniël Suelmann * Effect: * Every Lastgeld object instantiates five Connector objects, * one for every index stored on disk, e.g. if a keyword * query of three keywords is entered, three Lastgeld objects * are created and thus in total 15 Connector objects are created, * each Connector object is associated with one keyword and * sequentially searches one index. */ package standalone; import import import import import import import import java.net.URL; java.net.URLConnection; java.io.BufferedReader; java.io.IOException; java.io.InputStreamReader; java.util.StringTokenizer; java.util.ArrayList; java.util.Collections; public class Connector { String filename; String keyword; public Connector(String fn){ this.filename = fn; } public ArrayList<Integer> read(String keyword) throws IOException{ /* effect: Opens up a file a checks line by line * if there's a match with the keyword. If there * is a match the id-numbers associated with that * match are stored in a list. This results in * a list of id-numbers of matches, accumulated * over all the lines in the files. * incoming collaboration: receives a file name * from Connector's constructor. * incoming collaboration: receives a single keyword * from Lastgeld's request. * outgoing collaboration: sends a list of id-numbers * to Lastgeld's request. * */ URL url = new URL("http://localhost:8080/web/" + filename); URLConnection urlConnection = url.openConnection(); urlConnection.connect(); 43 String inLine = null; String word; StringTokenizer tokenizer; ArrayList<Integer>list = new ArrayList<Integer>();; BufferedReader inFile; inFile = new BufferedReader(new InputStreamReader(url.openStream())); inLine = inFile.readLine(); while (inLine != null){ tokenizer = new StringTokenizer(inLine, " "); word = tokenizer.nextToken(); word = word.toLowerCase(); if (word.equals(keyword)){ while(tokenizer.hasMoreTokens()){ String token = tokenizer.nextToken(); token = token.replaceAll("\\D", ""); if (!token.equals("")){ Integer id = Integer.valueOf(token); list.add(id); } } } inLine = inFile.readLine(); } Collections.sort(list); return list; } } /* Name: Grapher.java * Author: Daniël Suelmann /* Effect: * The effect of this class is explained * in detail at the level of * each method. */ package standalone; import java.util.ArrayList; public class Grapher { private private private private private static static static static static ArrayList<ArrayList<Graphable>> list; WeightedGraph graph = new WeightedGraph(); ArrayList <Graphable> vertices; ArrayList <Graphable> intersections = null; ArrayList<Object> combinedintersections = null; public void populate(ArrayList<ArrayList<Graphable>> keywords){ /* effect: extracts the references in each list and adds * them to the graph. * incoming collaboration: receives a list of lists from * Database's main. * outgoing collaboration: sends object references to * WeightedGraph's addVertex. */ list = keywords; for (int i = 0; i < list.size(); i++){ ArrayList<Graphable> l = (ArrayList<Graphable>)list.get(i); for(int j = 0; j < l.size(); j++){ graph.addVertex((Graphable) l.get(j)); } } } 44 public void addEdges(){ /* effect: Based on the vertices currently present, * it initiates the process of finding and adding * the edges to the graph. * incoming collaboration: is called by Database's * getResults. * incoming collaboration: receives a list with match * descriptives */ findEdges(); } public static void findEdges(){ /* effect: compare the vertices based on all possible * combinations. * explanation: if there are N vertices compare the * id-numbers of the vertices in all possible * combinations. For instance: if there are three * vertices made based on keyword hits, then combinations * 1 2, 2 3 and 1 3 are possibilities. These combinations * are used as indices for the vertex list, 1 2 turns * into 0 1 etc. Then the ArrayLists of id-numbers * associated with an vertex object -based on the * combinations- is intersected to sort out all vertices * with identical id-numbers. (The intersection is done by * the ArrayList method retainAll. If identical id-numbers * are found, the ArrayList which contains matches is added * an ArrayList, so this would be a list of possible one or * more lists. It returns this list to Grapher's addEdges. * incoming collaboration: works with an ArrayList of vertices * (Grapher instance variable) from WeightedGraph's retrieves. * outgoing collaboration: passes the number of vertices to * Grapher's combinationCalculator. * incoming collaboration: receives an Two-dimensional array * with all possible combinations given the vertices as ints. * outgoing collaboration: sends a list -possible empty or * may contain one or more lists- to Grapher's addEdges */ int[][]combinations; vertices = graph.retrieve(); combinations = calculateCombinations(vertices.size()); ArrayList<Integer> idsobject0; ArrayList<Integer> idsobject1; if (intersections == null){ intersections = new ArrayList(); } if (combinedintersections == null){ combinedintersections = new ArrayList(); } for (int i = 0; i < combinations.length; i++){ if (!(combinations[i][0] == 0 && combinations[i][1] == 0)){ Graphable object0 = (Graphable) vertices.get(combinations[i][0]); Graphable object1 = (Graphable) vertices.get(combinations[i][1]); intersections.add(object0); intersections.add(object1); idsobject0 = object0.getIDs(); idsobject1 = object1.getIDs(); ArrayList<Integer> clone; clone = (ArrayList<Integer>)idsobject0.clone(); 45 clone.retainAll(idsobject1); if(!clone.isEmpty()){ if(!object0.toString().equals(object1.toString())){ graph.addEdge(object0, object1, clone); } combinedintersections.add(object0.getType()); combinedintersections.add(object0.toString()); combinedintersections.add(object1.getType()); combinedintersections.add(object1.toString()); combinedintersections.add(clone); } } } } private static int[][] calculateCombinations(int c){ /*effect finds combinations based on a fixed *value (parameter c); for instance if the *int 3 is passed it finds out that 3 consists *of combinations 12 23 and 13. To serve Grapher's *findEdges it changes these combinations in array *index format, 12 turn into 01, 13 turns into 02 etc. *outgoing collaboration: sends a two-dimensional *array with all possible combinations given the *vertices as ints to findEdges. */ int[][] combinations = new int[c*c/2][2]; int store = 0; int number = c; for (int b = 1; b < number; b++ ){ int y = b - 1; int i = 1; while (i != number-y){ combinations[store][0] = i-1; i = i + b; combinations[store][1] = i-1; int x = b - 1; i = i - x ; store++; } } return combinations; } private static void printCombinationsToConsole(ArrayList<Object> l){ /* effect: prints some additional information about * the combinations checked and their matches found. * This will not be visible to any user, printed to * the webserver's console. * incoming collaboration: receives an ArrayList with * data from grapher's addEdges. * outgoing collaboration: prints all findings to the * server console. */ int count = 0; int total = 1; for (int i = 0; i < l.size(); i++){ if (count == 5){ System.out.print("\n " + total); count = 0; total++; } System.out.print(" " + l.get(i)); count++; 46 } System.out.print("\n"); } public void printEdgesToConsole(){ /* effect: prints some additional information about the * edges present in the graph. This will not be visible * to any user, printed to the webserver's console. * incoming collaboration: uses Grapher's ArrayList * instance variable vertices to get to the vertices * present in the graph. * incoming collaboration: receives -for each vertex in * the graph- a LinkedQueue of vertices that are adjacent. * outgoing collaboration: prints all findings to the * server console. */ for (int i = 0; i < vertices.size(); i++){ Graphable object; object = (Graphable) vertices.get(i); System.out.print(object.getType()+ ", "); //type System.out.print(object.toString() + ", "); //content //System.out.println(object.getIDs()+ "\n"); //ID's QueueInterface queue; queue = graph.getToVertices(object); if (!queue.isEmpty()){ System.out.print("has edges: "); int count = 1; while(!queue.isEmpty()){ Graphable item = (Graphable) queue.dequeue(); System.out.print(" " + count++ + ": " + item.getType()); System.out.print(", " + item); } System.out.print("\n"); } else { System.out.println("no edges."); } } } private Graphable findStart(){ /* effect: finds the vertex with the most edges. * explanation: the vertex with the most edges is * the vertex with the strongest relation to the other * keywords, i.e. this is the vertex that's 'likely' relevant. * incoming collaboration: receives a queue with edges from * WeightedGraphs getToVertices. * outgoing collaboration: send the vertex with the most * edges/vertices to Grapher's findResult */ QueueInterface edges; Graphable item; Graphable biggest = null; int biggestSize = 0; for(int i = 0; i < vertices.size(); i++){ int size = 0; item = (Graphable) vertices.get(i); edges = graph.getToVertices(item); size = edges.size(); if(size > biggestSize){ biggestSize = size; biggest = item; } } return biggest; 47 } public ArrayList<Integer> findResult(){ /* effect: dispatches two functions: findstart * returns the vertex with the most vertices. * connectEdges perform the final intersection of * the id-numbers associated with the edges. * the result is a list of final result id-numbers. * incoming collaboration: is called by Database's * getResults. * incoming collaboration: receives an ArrayList with * result id-numbers from Grapher's connect edges. * outgoing collaboration: sends back the ArrayList with * result id-numbers to Database's getResult. */ ArrayList <Integer> result; Graphable startPoint = findStart(); result = connectEdges(startPoint); return result; } private static ArrayList<Integer> connectEdges(Graphable startVertex){ /* effect: * intersects all edges of the given argument * startVertex, which is the vertex with the * most edges to other vertices. * explanation: up until now there is a graph * that holds x vertices that are connected in * some way. What is done here, is intersecting * all the id-numbers that are associated with * all the edges of a particular vertex. * What makes a relation between certain vertices * is the fact that an edge between vertex a & b * holds the same or a subset of the id-numbers * between vertices a & c or a & d. * incoming collaboration: receives a Graphable * object that becomes the start of the traversal. * incoming collaboration: receives a LinkedQueue * with all the edges of startVertex (argument) * by WeightedGraph's getToVertices. * incoming collaboration: receives an ArrayList * with the id-numbers associated with an edge * between the startVertex and a vertex connected * to the startVertex. * outgoing collaboration: sends an ArrayList of * ints to Grapher's findResult. */ QueueInterface edges = graph.getToVertices(startVertex); ArrayList<ArrayList<Integer>> lists = new ArrayList<ArrayList<Integer>>(); ArrayList<Integer>sim; Graphable item; while(!edges.isEmpty()){ item = (Graphable) edges.dequeue(); sim = graph.holdSimilarities(startVertex, item); if (!sim.isEmpty()){ lists.add(sim); } } ArrayList<Integer> result = null; if (!lists.isEmpty()){ result = (ArrayList<Integer>) lists.get(0); for(int i = 1; i < lists.size();i++){ 48 result.retainAll((ArrayList<Integer>) lists.get(i)); } return result; } else { ArrayList <Integer> empty = return empty; new ArrayList<Integer>(); } } public ArrayList <Graphable> getVertices(){ return vertices; } public WeightedGraph getGraph(){ return graph; } public ArrayList<Graphable> getIntersections(){ return intersections; } public ArrayList<Object> getCombinedIntersections(){ return combinedintersections; } /* * * * The following resetters are necessary due to the fact that objects persist in between search requests of the Webfrontend class. */ public void resetIntersections(){ intersections = null; } public void resetCombinedIntersections(){ combinedintersections = null; } public void resetListOfLists(){ list = null; } public void resetVertices(){ vertices = null; } public void resetGraph(){ graph.reset(); } } 49 /* Name: Graphable.java * Author: Daniël Suelmann /* Effect: This is an interface class. * It this describes abstract methods * that must be implemented by all classes * implementing this interface. * This objects are instances of Lastgeld's * inner classes; Firstname,Lastname, Harbor, * Country and Date. * */ package standalone; import java.util.ArrayList; public interface Graphable { public abstract String toString(); //effect: implementation prints the content variable of a graphable object. public abstract ArrayList<Integer> getIDs(); //effect: implementation gets the id-numbers associated with a graphable object. public abstract String getType(); //effect: implementation gets the type of a graphable object. public abstract boolean isEqual(Graphable object); //effect: implementation compares to objects of the graphable type. } /* Name: WeightedGraph.java * This class is a slightly modified version of the * WeightedGraph data structure presented in the book * Object-oriented data structures using Java by * Dale et al.[3] * The modifications: * The values associated with the edges were initially * ints. For the application I need edges that store * multiple ints representing the id-numbers that are * shared by two vertices. Since I use ArrayLists * throughout the application, I also changed the edges * instance variable to be two-dimensional arrays of * the type ArrayList. These modifications can be found * among the instance variables and in the methods * holdSimilarties and addEdge. Also I added two methods * to retrieve information from the graph, these are * Retrieve and printNames. * */ package standalone; import java.util.ArrayList; public class WeightedGraph implements WeightedGraphInterface { public static ArrayList <Integer> NULL_EDGE = null; private int numVertices; private int maxVertices; private Graphable[] vertices; private ArrayList<Integer>[][] edges; private boolean[] marks; // marks[i] is mark for vertices[i] 50 public WeightedGraph() // Post: Arrays of size 50 are dynamically allocated for // marks and vertices, and of size 50 X 50 for edges // numVertices is set to 0; maxVertices is set to 50 { numVertices = 0; maxVertices = 50; vertices = new Graphable[50]; marks = new boolean[50]; edges = new ArrayList[50][50]; } /* Comment on the edges * Modification of the original WeightesGraph data structure: * the weights have become ArrayLists of id-numbers * Each edge contains the id-similarities between two vertices * If verticeX = {1,2,3,4,5} and verticeY = {3,4,5,6} the edge * represents the intersection result vertice{X,Y} = {3,4,5}, * these two values would be stores in a ArrayList * this ArrayList is necessary to be able to, in turn, intersect * this intersection with another intersection, * for instance vertice{P,Q) = {3,5,7,9} which would result in * vertice(P,Q,X,Y){3,5}, etc. * */ public void reset(){ vertices = new Graphable[50]; edges = new ArrayList[50][50]; numVertices = 0; } public ArrayList<Graphable> retrieve(){ ArrayList<Graphable> list = new ArrayList<Graphable>(); for (int i = 0; i < vertices.length; i++){ Graphable object = (Graphable) vertices[i]; if (object != null){ list.add(object); } } return list; } public void printNames(){ for (int i = 0; i < vertices.length; i++){ Graphable object = (Graphable) vertices[i]; if (object != null){ System.out.println(object.toString()); } } } public WeightedGraph(int maxV) // Post: Arrays of size maxV are dynamically allocated for // marks and vertices, and of size maxV X maxV for edges // numVertices is set to 0; maxVertices is set to maxV { numVertices = 0; maxVertices = maxV; vertices = new Graphable[maxV]; marks = new boolean[maxV]; edges = new ArrayList[maxV][maxV]; } public void addVertex(Graphable vertex) // Post: vertex has been stored in vertices. // Corresponding row and column of edges has been set to NULL_EDGE. 51 // numVertices has been incremented { vertices[numVertices] = vertex; for (int index = 0; index < numVertices; index++) { edges[numVertices][index] = NULL_EDGE; edges[index][numVertices] = NULL_EDGE; } numVertices++; } private int indexIs(Graphable vertex) // Post: Returns the index of vertex in vertices { int index = 0; while (vertex != vertices[index]) index++; return index; } public void addEdge(Graphable fromVertex, Graphable toVertex, ArrayList<Integer> IDs) // Post: Edge (fromVertex, toVertex) is stored in edges { int row; int column; row = indexIs(fromVertex); column = indexIs(toVertex); edges[row][column] = IDs; } public ArrayList<Integer> holdSimilarities(Graphable fromVertex, Graphable toVertex) // Post: Returns the weight associated with the edge // (fromVertex, toVertex) { int row; int column; row = indexIs(fromVertex); column = indexIs(toVertex); return edges[row][column]; } public QueueInterface getToVertices(Graphable vertex) // Returns a queue of the vertices that are adjacent from vertex. { QueueInterface adjVertices = new LinkedQueue(); int fromIndex; int toIndex; fromIndex = indexIs(vertex); for (toIndex = 0; toIndex < numVertices; toIndex++) if (edges[fromIndex][toIndex] != NULL_EDGE) adjVertices.enqueue(vertices[toIndex]); return adjVertices; } 52 /* * /* * * * * * * * * * Name: WebFrontend.java Author: Daniël Suelmann Effect: 1. Displays a search interface; 2. Receives keyword queries; 3. Instantiates a Database object, that initializes other classes to retrieve an answer to the query. 4. When the answer is retrieved, proposed SQL queries are executed. 5. The answer is returned to the web browser. 6. -Optional- Displays intermediate intersection results */ package scriptie; import import import import import import import import import import import import java.io.IOException; java.io.PrintWriter; java.sql.Connection; java.sql.ResultSet; java.sql.SQLException; java.sql.Statement; java.util.ArrayList; java.util.StringTokenizer; javax.servlet.http.HttpServlet; javax.servlet.http.HttpServletRequest; javax.servlet.http.HttpServletResponse; standalone.*; public class WebFrontend extends HttpServlet { public void service(HttpServletRequest request, HttpServletResponse response) throws IOException { Database db = new Database(); int countresults = 0; ArrayList resultstorage = new ArrayList(); String enterkeywords = null; String noresults = null; response.setContentType("text/html"); PrintWriter out = response.getWriter(); StringTokenizer tokenizer; ArrayList <Integer> result = null; long elapsedTimeMillis = 0; String input = request.getParameter("search_text"); if(input != null && input !=""){ input = input.toLowerCase(); tokenizer = new StringTokenizer(input, " "); int size = tokenizer.countTokens(); String[] feed = new String[size]; int count = 0; while(tokenizer.hasMoreTokens()){ feed[count] = tokenizer.nextToken(); count++; } long start = System.currentTimeMillis(); //measuring speed result = db.getResult(feed); if (result.isEmpty()){ noresults = "No results."; } 53 else { try { ConnectionPool pool = new ConnectionPool("com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/scriptie", "root", "XXXXX", 10, 20, true); Connection conn = pool.getConnection(); Statement stmt; ResultSet rs; stmt = conn.createStatement(); for (int i = 0; i < result.size(); i++){ int id = (int) result.get(i); rs = stmt.executeQuery("SELECT * FROM lastgeld WHERE Idno = '" + id + "'"); while(rs.next()){ int theInt= rs.getInt("Idno"); String vn = rs.getString("voornaam"); String an = rs.getString("achternaam"); String hh = rs.getString("haven_herk"); String al = rs.getString("Aantal_lasten"); String hd = rs.getString("Heffing_decimaal"); String tn = rs.getString("tonnage"); String gl = rs.getString("guldens"); String st = rs.getString("stuivers"); String sc = rs.getString("scanfile"); countresults++; //int printcount = countresults + 1; String store = null; store = (" " + countresults + ". id = " + theInt + " first name = " + vn + " last name = " + an + " harbor = " + hh + " cargo units = " + al + " toll-decimal = " + hd + " weight = " + tn + " guldens = " + gl + " stuivers = " + st + " <a href=\"" + sc + "\" target=\"_blank\">source</a><br>"); resultstorage.add(store); } } stmt.close(); pool.free(conn); conn.close(); pool.closeAllConnections(); } catch (SQLException e){ e.printStackTrace(); } result = null; feed = null; } elapsedTimeMillis = System.currentTimeMillis()-start; // Time elapsed } //if input != null && input != "" else { enterkeywords = ("Enter one or more keywords."); } //else out.println("<html>"); out.println("<body>"); out.println("<pre>"); //Search area out.println("<center>"); out.println("<br><br>"); out.println("<form action='lastgeld' method='get'>"); 54 out.print("<input type=text name=search_text>"); out.print("<input type=submit value=Search>"); out.print("<input type='checkbox' name='checkbox' value='verbose' />"); out.print("verbose"); out.println("</form>"); out.println("<a href = '/web/experiment'>experiment</a><br>"); if (noresults != null) out.println(noresults); if (enterkeywords != null) out.println(enterkeywords); out.println("</center>"); // Creating the verbose String checkbox = request.getParameter("checkbox"); if(checkbox != null){ if (checkbox.equals("verbose")){ out.println("<table border='1' align='center'>"); out.println("<tr>"); out.println("<td align='right'>"); out.println("Available indices: <a href = '/web/datefreq.txt' target = '_blank'>date</a>, <a href = '/web/vnfreq.txt' target = '_blank'>first name</a>, <a href = '/web/anfreq.txt' target = '_blank'>last name</a>, <a href = '/web/havenfreq.txt' target = '_blank'>harbor</a>, <a href = '/web/countryfreq.txt' target = '_blank'>country</a> "); out.println("</td>"); out.println("</tr>"); out.println("<tr>"); out.println("<td>"); //intersection combinations ArrayList <Graphable> intersections = db.getIntersections(); int pair = 1; int hr = 1; int count = 0; if (intersections != null){ if(!intersections.isEmpty()){ out.println("<h2>Step 1: find intersection combinations based on the keywords entered and the available indices.</h2><br>"); for (int i = 0; i < intersections.size(); i++){ if (count == 2){ out.print("<hr size='3' color= 'gray'>"); count = 0; } Graphable object; object = intersections.get(i); if (count == 0){ out.println("<b>pair " + pair + ":</b><br>"); pair++;; } out.println("<b>"+ object.toString() + " " + object.getType()+ "</b>" + object.getIDs() + "<br>"); count++; if(hr == 1){ out.println("<hr>"); hr--; } else 55 hr++; } } } else { out.print("Verbose returns output at a minimum of two keywords."); } intersections = null; db.resetIntersections(); out.println("</td>"); out.println("</tr>"); //combined intersections ArrayList<Object> combinedintersections = db.getCombinedIntersections(); int count1 = 0; int total = 1; if (combinedintersections != null){ out.println("<tr>"); out.println("<td>"); out.println("<h2>Step 2: intersect the combinations.</h2><br>"); if(!combinedintersections.isEmpty()){ for (int i = 0; i < combinedintersections.size(); i++){ if (count1 == 2){ out.print(" <b><--></b> "); } if (count1 == 5){ out.print("<br>"); count1 = 0; total++; } if (count1 < 4){ out.print(" <b>" + combinedintersections.get(i) + "</b>"); count1++; } else{ out.print(" " + combinedintersections.get(i)); count1++; } } } out.println("</td>"); out.println("</tr>"); } combinedintersections = null; db.resetCombinedIntersections(); ArrayList <Graphable> vertices = db.getVertices() ; WeightedGraph graph = db.getGraph(); if (vertices != null && graph != null){ out.println("<tr>"); out.println("<td>"); out.println("<h2>Step 3: find the vertex with the most edges.</h2><br>"); if(!vertices.isEmpty()){ for (int i = 0; i < vertices.size(); i++){ 56 Graphable object; object = (Graphable) vertices.get(i); out.print(" <b>" + object.getType()+ "</b> "); //type out.print("<b>" + object.toString() + "</b> "); //content //System.out.println(object.getIDs()+ "\n"); //ID's QueueInterface queue; queue = graph.getToVertices(object); if (!queue.isEmpty()){ out.print("has edges: "); int count2 = 1; while(!queue.isEmpty()){ Graphable item = (Graphable) queue.dequeue(); out.print(" " + count2++ + ": " + item.getType()); out.print(", " + item); } out.print("<br>"); } else { out.println("no edges.<br>"); } } } out.println("</td>"); out.println("</tr>"); } out.println("</table>"); } } if (!resultstorage.isEmpty()){ out.print(" <b>Found "+ countresults + " records in "); if (elapsedTimeMillis != 0) out.println(elapsedTimeMillis + " milliseconds:</b><br>"); for(int i = 0; i < resultstorage.size(); i++){ out.print(resultstorage.get(i)); } } out.println("</pre>"); out.println("</body>"); out.println("</html>"); input = null; db.resetIntersections(); db.resetCombinedIntersections(); db.resetGraph(); db.resetVertices(); db.resetListOfLists(); db = null; } } 57

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Keyword-based Search in a Relational Database