* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download DBXplorer : A System For Keyword-Based Search Over Relational
Survey
Document related concepts
Oracle Database wikipedia , lookup
Microsoft Access wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Concurrency control wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Relational algebra wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Ingres (database) wikipedia , lookup
Clusterpoint wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Transcript
Sanjay Agrawal Surajit Chaudhuri Gautham Das Microsoft Research Microsoft Research Microsoft Research Presented by Anusha Mannem Introduction Overview Of DBXplorer Publish & Design of Symbol Table Finding Matches For Keyword Search Support for Generalized Matches(Token Matching) Experimental Results Conclusions Internet search engines have popularized keyword based search. Alternative to Keyword-Based search is Structured Search. Ex: Searching for books in Bookseller's database Books Travel Lonely planet Asia (or) Books Travel Rough Guides Europe Other similar example is http://m.www.yahoo.com/ Example for Keyword-Based Search is searching for “Jim Gary” on Microsoft intranet to obtain matched rows, i.e., rows in DB where ‘Jim Gary’ occur. DBXplorer, an efficient and scalable keyword search utility for relational databases. The goal of DBXplorer is to enable keyword searches on multiple databases without requiring the users to know the schema of respective databases. How it works? When a DBXplorer is given a set of query keywords, it returns all the rows (either from single table, or by joining tables connected by foreign-key joins) such that each row contains all the keywords. Enabling this keyword search requires 2 steps (a)Publish Step (b)Search Step Publish : A databases is enabled for keyword search through the following steps. Step 1 : A database is identified, along with the set of tables or columns within the database to be published. Step 2 : Auxiliary tables are created for supporting keyword searches. The most important structure is Symbol Table S that is used at search time to efficiently determine the locations of query keywords in the database Search: Given a query consisting of a set of keywords, it is answered as follows. Step 1: Looks up symbol table to identify the tables, and columns/rows of the database which contain the query keywords. Step 2: Identify and enumerate all possible joins (subsets of table if joined) whose rows might contain required keywords Step 3: Generate SQL statements for each join (gives rows which contain all keywords), and the final rows are ranked and presented to the user. Publish and Search components are packaged as two separate COM objects. Publish Component provides interface to (a) Select a database (b) Select tables/columns within the database to publish ,and (c) modify/remove/maintain the publication For given set of keywords, the search component provides interfaces to (a) Retrieve matching databases from set of published databases. (b) Selectively identify tables, columns/rows that need to be searched within each database identified in step (a) Exact match problem is considered i.e., each keyword in the query must match the value of an attribute in a row of a table. Symbol Table S, which stores the information of the query keywords at different levels (row/column ) of granularity. The two granularity levels are : Column level granularity (Pub-Col) : for each keyword , we only store the list of columns where they occur. Row Level granularity (Pub-Cell) : for each keyword we keep track of all the rows that contains the keyword. Space and Time Requirements Pub-Col is much faster and smaller when compared to Pub-Cell. Keyword Search Performance Pub-Col is effective if there are an indexes on the published columns Ease of Maintenance Pub-Col is easier to maintain as it requires an update only if the insertions cause new values in column data. Hence, Pub-Col symbol table is always better than Pub-Cell table, unless certain columns do not have indexes. Symbol table is stored as a relation with two attributes : Keyword and ColId. Alternative way is to keep hashed values of keywords in the symbol table . FK-Comp: If the set of values in column c1 is a subset of the values in another column c2 due to key-foreign key relationship, we retain only a single hash table entry for keywords common to both as (keyval,c1), since the foreign key constraint can be used to infer presence of keyval in c2. CP-Comp : V2 (a)Partition H into a minimum number c1 V3 of bipartite cliques (a bipartite clique is any subgraph of H with a maximal number of edges). v4 x c2 Uncompressed hash table (b) Compress each clique. (map the symbol table S to a bipartite graph H having two node sets, HashVal & ColId, where every row (v, c) in table S corresponds to an edge in H.) ColumnsMap table Compressed hash table Step 1 : Symbol table is searched(using generated SQL) to identify the database tables, columns/cells that contain at least one query keyword. Step 2 : Enumerating Join trees Step 3 : Identifying matching rows Identify and enumerate all potential subsets of tables in the database that, if joined, might contain rows having all keywords. The resulting relation will contain all potential rows having all keywords specified in the query. If we view the schema graph G as an undirected graph, this step enumerates join trees, i.e., sub-trees of G such that: (a) the leaves belong to MatchedTables and (b) together, the leaves contain all keywords of the query keywords Join Trees The input to this final search step is enumerated join trees. Each join tree is then mapped to a single SQL statement that joins the tables as specified in the tree, and select those rows that contain all keywords. The retrieved rows are ranked before being output. Our approach is to rank the rows by number of joins involved E.g., documents in which keywords occur close to one another are ranked higher than documents in which keywords are far apart. Token matches - the keyword in the query matches only a token or substring of an attribute value (e.g., retrieve rows of address by specifying only a street name). Pub-Prefix method B+ tree indexes can be used to retrieve rows whose cell matches a given prefix string This clause is of the form WHERE T.C LIKE ‘P%K%’ During publishing of a database, for every keyword K, the entry (hash(K), T.C, P) is kept in the symbol table if there exists a string in column T.C which ▪ contains a token K, and ▪ has prefix P Let the hash values of the searchable tokens i.e., ‘string’, ‘ball’ and ‘round’ be 1, 2 and 3 respectively RowId C 1 this is a string 2 this string 3 this is a ball 4 no string 5 any ball is round Pub-Prefix table Database table T Consider searching keyword “string” Pub-Prefix table returns prefixes “th” and “no” and subsequent SQL will contain (T.C LIKE ‘th%string%’) OR (T.C LIKE ‘no%string%’) Setup: The Experiments are on a 450 MHz 256 MB Intel P-III machine. Four databases were used for evaluation: TPC-H data of sizes from 100 to 500 MB USR is Microsoft employee address DB of 130 MB with 19 tables ML is a 375 MB mailing list DB with 38 tables KB is a 365 MB DB with 84 tables containing information on articles and help manuals on various shipped products In particular it has been showed that : Pub-Col table is compact when compared to Pub-Cell Pub-Col scales linearly with data size, and is independent of data distribution. Pub-Prefix is compact compared to Pub-Cell and has a significantly better performance when full text indexes are not present Publish & Search performance of Pub-Col & Pub-Cell are compared: Fig: Matching Databases Fig: Join Trees Fig: Matching Rows DBXplorer has been implemented using a commercial relational database and web server and allows users to interact via a browser at front-end. Although the authors discussed only the case where there is a single database, this techniques can be extended to keyword search over multiple databases. DBXplorer support exact matches and generalized matches up to some extent. As mentioned earlier, the Pub-Col alternative is the best when columns have indexes on them. If a full-text index is available, use Pub-Col with the full-text index. Instead, if only a traditional index is available and the column width is small, use Pub-Prefix, otherwise use Pub-Cell.