Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University 2009. 02. 12. Outline Introduction Bibliography Fundamental Characteristics Research Dimensions Summary Future Direction Copyright 2009 by CEBT 2 Introduction Relational databases A repository for a significant amount of data (e.g. enterprise data) – RDBMS Precise Structured Query Language (SQL) – Precise and complete – Difficult for casual users Easy way of querying structured data (Web) documents – Collection of unstructured (natural language) documents available online – Search engine Querying managing an abstract view of underlying data Querying unstructured data The most popular application for information discovery Keyword search – Simple and user-friendly – Approximating the precise results Structured – Data Querying structured data Unstructured Easy In statistical and semantic ways Deep Web Information over the Web comes out of relational databases Copyright 2009 by CEBT 3 Introduction Enabling casual users to query relational databases with keywords “casual users” – Without any knowledge about the schema information – Without any knowledge of the query language (SQL) Search system should have the knowledge in behalf of users Relational Databases keywords SQL Results Challenges Inherent discrepancy of data between IR and DB – Information often splits across the tables (or tuples) in relational databases Ex) A single retrieval unit of information Copyright 2009 by CEBT 4 Bibliography Proximity DataSpot [Palmon et al., VLDB, 1998] DTL's DataSpot - Database Exploration Using Plain Language [Palmon et al., SIGMOD, 1998] DTL's DataSpot- database exploration as easy as browsing the Web DBXplorer [Goldman et al., VLDB, 1998] Proximity Search in Databases [Agrawal et al., 2002, ICDE] DBXplorer: a system for keyword-based search over relational databases BANKS [Hulgeri et al., 2001, DEBU] Keeyword Search in Databases [Hulgeri et al., 2002, ICDE] Keyword Searching and Browsing in Databases using BANKS [Kacholia et al., 2005, VLDB] Bidirectional Expansion For Keyword Search DISCOVER [Hristidis et al., 2002, VLDB] DISCOVER: Keyword search in relational databases [Hristidis et al., 2003, VLDB] Efficient IR-Style Keyword Search over Relational Databases. [Liu et al., 2006 SIGMOD] Effective Keyword Search in Relational Databases ObjectRank [Balmin and Hristidis et al., 2004, VLDB] ObjectRank: Authority-Based Keyword Search in Databases [Balmin and Hristidis et al., 2008, TODS] Authority-based search on databases Copyright 2009 by CEBT 5 Proximity Proximity Measure of how related objects are Object related by a distance function – Shortest path computation document K-neighborhood distance look-up table relational database ……… ……… ……… …… Copyright 2009 by CEBT 6 DataSpot Hyperbase SQL query Modeling data graph Sub-hyperbase as an answer Relational Databases Best-first searching keywords convert query Hyperbase Customers Customer ID … 123456 … Record Record Orders … Customer ID … 123456 Field Field Name Thesaurus Stem “client” Stem “customer” String Stem Text “Customer” Copyright 2009 by CEBT Field Value Key 123456 Text “ID” 7 DBXplorer Symbol table index for schema entities Locating objects efficiently – Granularity – Compaction keywords query term. location … … … … … … Relational Databases Schema graph Join tree enumeration – Joining several tables on the fly Copyright 2009 by CEBT 8 BANKS Directed (data) graph Backward edge Graph traversing algorithm – NP-hard problem – Heuristics Backward Expanding search Bi-directional expanding search Rich interface Copyright 2009 by CEBT 9 DISCOVER High level representation of the architecture for keyword search in relational databases Top-k join query processing Pipeline algorithm – Threshold [Fagin et al. 2001] IR-style ranking function TF-IDF based tuple ranking Copyright 2009 by CEBT 10 ObjectRank Authority Measure of how important objects are – Authority flow graph Modified Pagerank algorithm – (Global) ObjectRank algorithm – Inverse ObjectRank algorithm Copyright 2009 by CEBT 11 Fundamental Characteristics Identifying schema elements To avoid linearly scanning all the tables Indexing structure – Keyword query processing – Making the best of the lack of syntax in query keywords Formalizing internal queries – Ranking Processing Indexing Model k1 e.g. SQL k2 RDBMS k3 k4 RDB Modeling answers Logical unit of retrieval is not a document – Search system Processing queries Inverted index e.g. Directed Acyclic Graph (DAG) Ranking answers Assign a single score, which can reflect the semantics of underlying schema, for each answer Order the returned answers Copyright 2009 by CEBT 12 Research Dimensions Model Data Representation Query Representation Efficient Processing Processing Top-k query processing Indexing Indexing structure Ranking Ranking Presentation Copyright 2009 by CEBT 13 Data representation (1/4) Graph model PaperID Data graph Paper J.H.Park08 AuthorID Writes JHPark AuthorID Author Schema graph JHPark PaperName Web Content Summarization Using … PaperID J.H.Park08 SGLee S.G.Lee08 AuthorName Jaehui Park SGLee Sang-goo Lee Cites Paper Writes Author Citing Cited … PaperID PaperName … AuthorID PaperID … AuthorID AuthorName … Copyright 2009 by CEBT 14 Data representation (2/4) Data graph Search time reducing Finding an optimal answer – Heuristics Size problem – RDB NP-hard : Steiner tree problem traverse Efficient graph traversing – keywords Too huge to fit into main memory Maintenance problem – Not appropriate for update-intensive databases Copyright 2009 by CEBT 15 Data representation (3/4) Schema graph Smaller Size – traverse Query RDB Scales well for huge database Utilize underlying RDBMS facilities – keywords e.g. Database indexes on columns Exploiting the schema of the underlying database – Generating optimal internal queries : SQL – Evaluation for Queries Query keywords : Jaehui Relational Database -------------------------------------------------Candidate join queries: Tmp1 : select * from Paper, Writes where Paper.PaperName = ‘Relational Database’ AND … Tmp2 : select * from Tmp1, Author where … Author.AuthorName = ‘Jaehui’ AND … Copyright 2009 by CEBT 16 Data representation (4/4) Graph model A logical unit of information – Subgraph K2 K1 T1 A set of multiple nodes joined together may include some tuples that does not contain any query keywords T3 T2 K3 T6 T5 K3 T4 K2 K1 T1 T3 T2 K3 T4 Weighting scheme – K1 Edges Distance (or Proximity) – K2 T6 T5 T2 Join operations K2 Nodes T1 T3 K3 K1 Importance (or Authority) Copyright 2009 by CEBT K3 T6 T3 T1 17 Ranking Relevance Answer size – Minimal subgraph including all the query keywords – Distance as the semantics closeness between objects Writes Tree Traverse algorithm … Query Evaluation … 0.8 … The distance between an entity and its attributes The distance between tuples in the same table The distance between tuples related through primary and foreign key 0.2 Jane Tom … Standard IR weighting method TF-IDF Text databases (e.g. user complaints, product descriptions, book reviews, etc.) Cites Importance Paper Term frequency – 0.4 Authority – Authority transfer graph – Citing Cited … Paper 0.7 0 PaperID PaperName … Writes 0.2 AuthorID PaperID … Author 0.2 AuthorID AuthorName … Nodes with incoming link with high authority are assumed to have higher importance Specificity problem Specific results should be ranked higher than general one e.g., InverseObjectRank algorithm Copyright 2009 by CEBT 18 Efficient processing Indexing structure Reducing scan time – Granularity levels of schema elements Reducing computation time – Precomputation Column level vs. Record (or Cell) level edge weights, node weights, relevance scores, etc. Query execution technique Top-k query processing – Avoiding creating all query results Decide which candidate answers will produce top-k results e.g. Sparse algorithm Pipeline algorithm Copyright 2009 by CEBT ROWID a1 a2 a3 … Score 76 60 15 … ROWID b1 b2 b3 … Score 90 50 12 … Query representation Logical operators conjunction, disjunction Type and condition Type – Find type, Near type Conditional keywords – e.g. Year > 300 Copyright 2009 by CEBT 20 Presentation Visualizing search result e.g. Tree view – structural level vs. tuple level Limiting maximum size of an answer Limiting maximum number of answer … Copyright 2009 by CEBT 21 Summary Comparison in a common framework Data model Ranking Efficiency Query representation Presentation Proximity Data-graph Distance K-neighborhood distance look-up Type, Conjunction - DataSpot Data-graph Number of edges - Conjunction Table DBXplorer Schema-graph Number of joins Symbol table Conjunction Enumerated rows BANKS Data-graph (directed) Edge weight, Node weight Disk resident index on keyword Conjunction Dynamic Joined Tree DISCOVER Schema-graph Number of joins Master Index Conjunction, Disjunction - ObjectRank Schema-graph, Data-graph Authority Master Index Conjunction, Disjunction - Copyright 2009 by CEBT 22 Future Directions Probabilistic model Naïve approaches – Rank measures on the answer size Cannot directly estimate the (probability of) relevance between the query and the retrieved tuples Heuristic performs well Probabilistic model – e.g. Bayesian belief network Term-based approach to approximate optimal answer Modification for dealing with relational database Efficient query processing Dependencies between schema elements Top-k query processing have shown a great impact on performance – Ranking function involves aggregation or grouping operator – Symbol table design Conclusion Various approaches are described with our understanding We envision the above research directions to be important to pursue. Copyright 2009 by CEBT 23 Thank you Copyright 2009 by CEBT 24