Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DBxplorer: A System for Keyword-Based Search over Relational Databases Paper By: Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Presented By Bhushan Chaudhari University of Texas at Arlington 1 Contents Typical Key-word Search Overview of DBxplorer Processing Component for producing Symbol Table Storing Symbol table in databases Search Component Compaction Algorithms Finding Matches using graphs Generalized Matches Experiments and Statistics 2 Key-word Search General Scenario Finding specified keywords in same or different tables or completely different schemas Physical schema needs to be explored Existing indexes need to be leveraged by using proper data structures Data structure => Structure to be followed by symbol tables 3 Levels of Granularity Column level granularity Row level granularity 2 4 2 TOYOTA COROLLA 2 TOYOTA CAMRY 2 HONDA ACCORD 2 HONDA CRV 4 Overview of DBxplorer Supports Conjunctive Queries Implementation using MSSQL Server2000 and IIS Web Server ODBC Interface for database Connection Uses the functionality of relational engines very well 5 Overview of DBxplorer (Contd ..) Publish Determine the tables to be published all_tab, all_tab_user Table relations in form of graphs Columns to be published all_tab_columns, user_tab_columns “select table_name from user_tab_columns” “where column_name = ‘desired_name’ Building Symbol Tables 6 Overview of DBxplorer (Contd ..) Search Symbol table is looked to identify the tables, columns and rows containing keywords Join trees (set of tables which are related) Query is constructed for each join tree and tuples containing all keywords are found 7 Design Alternatives for Symbol Table Location Granularity Column level and cell level “Why there is no row level granularity? “ Hard to implement SQL queries work w.r.t. columns 8 Factors affected due to granularity Space and time requirements Searching time Time required to build Using in-built operators like ‘distinct’ accumulating all values inside column becomes easy Most of the typical database systems use Hash Value indexes which are good for equality searches 9 Factors affected due to granularity (Contd..) Search Performance Typically depends upon presence of indices Ease of maintenance One time creation Insert Update Delete Pub-Col Y/N Y/N N Pub-Cell Y Y Y “ Which type of symbol table should we use?” 10 Storing Symbol Tables Pub-Col Representation Key-word -> Column Id Hash Value -> Column Id Types of Compression Algorithms Used Foreign Key Compression (FK-Comp) General Compression Technique (CP-Comp) 11 Building Compression table 12 Compression Algorithm 13 Storing Symbol Tables (Contd..) Pub-Cell Representation Hash Value -> Cell ID Hash Value -> Cell ID List “Retrieval of all locations for a key-word is achieved by looking up a single row from pub-cell symbol table” No Compression Pre-computation is complex Inverted lists can be implemented using this table 14 Finding Matches for Keyword Search • Each join tree is mapped to a SQL Query and selects those rows that contain all keywords. • Ranking is based upon no. of joins (Quite similar to ranking upon proximity of words in documents) 15 Search Algorithm 16 Supporting Generalized Matches Where T.C like ‘%STRING%’ Traditional databases use B+ Trees indices Pub-Prefix Representation Microsoft SQL Server (Most Recent version) Where CONTAINS(C,’String’); Enables token searches having form WHERE T.C LIKE ‘P%K%’ Symbol table entry (Hash(k), T.C, P) Efficiency depends on length of prefix Length of prefix also affects symbol table size and build time. Stemming 17 Experiments – Symbol Table Granularity Symbol Table Size 18 Experiments – Symbol Table Granularity Publishing Time 19 Experiments – Symbol Table Granularity Search Performance 20 Experiments – Scalability of Pub-Col Data Size and Distribution 21 Experiments – Scalability of Pub-Col Number of Keywords in Search 22 Experiments – Scalability of Pub-Col Effectiveness of Compression Techniques 23 Experiments – Scalability of Pub-Col Effectiveness of Pub-Prefix Method 24 References DBxplorer: A System for Keyword Based Search over Relational Databases ICDE 2002 By: Sanjay Agrawal, Surajit Chaudhuri, Gautam Das DBXplorer: Enabling Keyword Search over Relational Databases. (Demo), SIGMOD Conference 2002: 627 By: Sanjay Agrawal, Surajit Chaudhuri, Gautam Das 25