Download Keyword Search in Relational Databases {Sub

Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University 2009. 02. 12. Outline  Introduction  Bibliography  Fundamental Characteristics  Research Dimensions  Summary  Future Direction Copyright  2009 by CEBT 2 Introduction   Relational databases  A repository for a significant amount of data (e.g. enterprise data) – RDBMS Precise Structured Query Language (SQL) – Precise and complete – Difficult for casual users Easy way of querying structured data (Web) documents – Collection of unstructured (natural language) documents available online – Search engine   Querying managing an abstract view of underlying data Querying unstructured data  The most popular application for information discovery Keyword search – Simple and user-friendly – Approximating the precise results   Structured –   Data Querying structured data Unstructured Easy In statistical and semantic ways Deep Web  Information over the Web comes out of relational databases Copyright  2009 by CEBT 3 Introduction  Enabling casual users to query relational databases with keywords  “casual users” – Without any knowledge about the schema information – Without any knowledge of the query language (SQL)  Search system should have the knowledge in behalf of users Relational Databases keywords SQL Results  Challenges  Inherent discrepancy of data between IR and DB – Information often splits across the tables (or tuples) in relational databases  Ex) A single retrieval unit of information Copyright  2009 by CEBT 4 Bibliography  Proximity    DataSpot  [Palmon et al., VLDB, 1998] DTL's DataSpot - Database Exploration Using Plain Language  [Palmon et al., SIGMOD, 1998] DTL's DataSpot- database exploration as easy as browsing the Web DBXplorer     [Goldman et al., VLDB, 1998] Proximity Search in Databases [Agrawal et al., 2002, ICDE] DBXplorer: a system for keyword-based search over relational databases BANKS  [Hulgeri et al., 2001, DEBU] Keeyword Search in Databases  [Hulgeri et al., 2002, ICDE] Keyword Searching and Browsing in Databases using BANKS  [Kacholia et al., 2005, VLDB] Bidirectional Expansion For Keyword Search DISCOVER  [Hristidis et al., 2002, VLDB] DISCOVER: Keyword search in relational databases  [Hristidis et al., 2003, VLDB] Efficient IR-Style Keyword Search over Relational Databases.  [Liu et al., 2006 SIGMOD] Effective Keyword Search in Relational Databases ObjectRank  [Balmin and Hristidis et al., 2004, VLDB] ObjectRank: Authority-Based Keyword Search in Databases  [Balmin and Hristidis et al., 2008, TODS] Authority-based search on databases Copyright  2009 by CEBT 5 Proximity  Proximity  Measure of how related objects are  Object related by a distance function – Shortest path computation  document K-neighborhood distance look-up table relational database ……… ……… ……… …… Copyright  2009 by CEBT 6 DataSpot  Hyperbase SQL query  Modeling data graph  Sub-hyperbase as an answer Relational Databases  Best-first searching keywords convert query Hyperbase Customers Customer ID … 123456 … Record Record Orders … Customer ID … 123456 Field Field Name Thesaurus Stem “client” Stem “customer” String Stem Text “Customer” Copyright  2009 by CEBT Field Value Key 123456 Text “ID” 7 DBXplorer  Symbol table index for schema entities  Locating objects efficiently – Granularity – Compaction keywords query term. location … … … … … … Relational Databases  Schema graph  Join tree enumeration – Joining several tables on the fly Copyright  2009 by CEBT 8 BANKS  Directed (data) graph  Backward edge  Graph traversing algorithm – NP-hard problem – Heuristics  Backward Expanding search  Bi-directional expanding search  Rich interface Copyright  2009 by CEBT 9 DISCOVER  High level representation of the architecture for keyword search in relational databases  Top-k join query processing  Pipeline algorithm –  Threshold [Fagin et al. 2001] IR-style ranking function  TF-IDF based tuple ranking Copyright  2009 by CEBT 10 ObjectRank  Authority  Measure of how important objects are –  Authority flow graph Modified Pagerank algorithm – (Global) ObjectRank algorithm – Inverse ObjectRank algorithm Copyright  2009 by CEBT 11 Fundamental Characteristics  Identifying schema elements  To avoid linearly scanning all the tables  Indexing structure –  Keyword query processing –  Making the best of the lack of syntax in query keywords Formalizing internal queries – Ranking Processing Indexing Model k1 e.g. SQL k2 RDBMS k3 k4 RDB Modeling answers  Logical unit of retrieval is not a document –  Search system Processing queries   Inverted index e.g. Directed Acyclic Graph (DAG) Ranking answers  Assign a single score, which can reflect the semantics of underlying schema, for each answer  Order the returned answers Copyright  2009 by CEBT 12 Research Dimensions  Model  Data Representation  Query Representation  Efficient Processing  Processing  Top-k query processing  Indexing  Indexing structure  Ranking  Ranking  Presentation Copyright  2009 by CEBT 13 Data representation (1/4)  Graph model  PaperID Data graph Paper J.H.Park08 AuthorID Writes JHPark AuthorID Author  Schema graph JHPark PaperName Web Content Summarization Using … PaperID J.H.Park08 SGLee S.G.Lee08 AuthorName Jaehui Park SGLee Sang-goo Lee Cites Paper Writes Author Citing Cited … PaperID PaperName … AuthorID PaperID … AuthorID AuthorName … Copyright  2009 by CEBT 14 Data representation (2/4)  Data graph  Search time reducing Finding an optimal answer – Heuristics Size problem –  RDB NP-hard : Steiner tree problem   traverse Efficient graph traversing –  keywords Too huge to fit into main memory Maintenance problem – Not appropriate for update-intensive databases Copyright  2009 by CEBT 15 Data representation (3/4)  Schema graph  Smaller Size –  traverse Query RDB Scales well for huge database Utilize underlying RDBMS facilities –  keywords e.g. Database indexes on columns Exploiting the schema of the underlying database – Generating optimal internal queries : SQL – Evaluation for Queries Query keywords : Jaehui Relational Database -------------------------------------------------Candidate join queries: Tmp1 : select * from Paper, Writes where Paper.PaperName = ‘Relational Database’ AND … Tmp2 : select * from Tmp1, Author where … Author.AuthorName = ‘Jaehui’ AND … Copyright  2009 by CEBT 16 Data representation (4/4)  Graph model  A logical unit of information – Subgraph K2 K1 T1  A set of multiple nodes joined together  may include some tuples that does not contain any query keywords T3 T2 K3 T6 T5 K3 T4 K2 K1 T1 T3 T2 K3 T4  Weighting scheme – K1 Edges  Distance (or Proximity)  – K2 T6 T5 T2 Join operations K2 Nodes  T1 T3 K3 K1 Importance (or Authority) Copyright  2009 by CEBT K3 T6 T3 T1 17 Ranking  Relevance   Answer size – Minimal subgraph including all the query keywords – Distance as the semantics closeness between objects Writes Tree Traverse algorithm … Query Evaluation … 0.8 …  The distance between an entity and its attributes  The distance between tuples in the same table  The distance between tuples related through primary and foreign key 0.2 Jane Tom … Standard IR weighting method  TF-IDF  Text databases (e.g. user complaints, product descriptions, book reviews, etc.) Cites Importance  Paper Term frequency –  0.4 Authority – Authority transfer graph  – Citing Cited … Paper 0.7 0 PaperID PaperName … Writes 0.2 AuthorID PaperID … Author 0.2 AuthorID AuthorName … Nodes with incoming link with high authority are assumed to have higher importance Specificity problem  Specific results should be ranked higher than general one  e.g., InverseObjectRank algorithm Copyright  2009 by CEBT 18 Efficient processing  Indexing structure  Reducing scan time – Granularity levels of schema elements   Reducing computation time – Precomputation   Column level vs. Record (or Cell) level edge weights, node weights, relevance scores, etc. Query execution technique  Top-k query processing – Avoiding creating all query results  Decide which candidate answers will produce top-k results  e.g. Sparse algorithm Pipeline algorithm Copyright  2009 by CEBT ROWID a1 a2 a3 … Score 76 60 15 … ROWID b1 b2 b3 … Score 90 50 12 … Query representation  Logical operators  conjunction, disjunction  Type and condition  Type –  Find type, Near type Conditional keywords – e.g. Year > 300 Copyright  2009 by CEBT 20 Presentation  Visualizing search result  e.g. Tree view – structural level vs. tuple level  Limiting maximum size of an answer  Limiting maximum number of answer  … Copyright  2009 by CEBT 21 Summary  Comparison in a common framework Data model Ranking Efficiency Query representation Presentation Proximity Data-graph Distance K-neighborhood distance look-up Type, Conjunction - DataSpot Data-graph Number of edges - Conjunction Table DBXplorer Schema-graph Number of joins Symbol table Conjunction Enumerated rows BANKS Data-graph (directed) Edge weight, Node weight Disk resident index on keyword Conjunction Dynamic Joined Tree DISCOVER Schema-graph Number of joins Master Index Conjunction, Disjunction - ObjectRank Schema-graph, Data-graph Authority Master Index Conjunction, Disjunction - Copyright  2009 by CEBT 22 Future Directions  Probabilistic model  Naïve approaches –  Rank measures on the answer size  Cannot directly estimate the (probability of) relevance between the query and the retrieved tuples  Heuristic performs well Probabilistic model – e.g. Bayesian belief network  Term-based approach to approximate optimal answer  Modification for dealing with relational database   Efficient query processing   Dependencies between schema elements Top-k query processing have shown a great impact on performance – Ranking function involves aggregation or grouping operator – Symbol table design Conclusion  Various approaches are described with our understanding  We envision the above research directions to be important to pursue. Copyright  2009 by CEBT 23 Thank you Copyright  2009 by CEBT 24

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Keyword Search in Relational Databases {Sub