Download IR-Style VLDB03

Efficient IR-Style Keyword Search over Relational Databases • Vagelis Hristidis University of California, San Diego • Luis Gravano Columbia University • Yannis Papakonstantinou University of California, San Diego Motivation • Keyword search is the dominant information discovery method in documents • Increasing amount of data is stored in databases • Plain text coexists with structured data Motivation • Up until recently, information discovery in databases required: – Knowledge of schema – Knowledge of a query language (e.g., SQL) – Knowledge of the role of the keywords • Goal: Enable IR-style keyword search over DBMSs without the above requirements IR-Style Search over DBMSs • IR keyword search well developed for document search • Modern DBMSs offer IR-style keyword search over individual text attributes • What is equivalent to document in databases? Example – Complaints Database Schema Products prodId manufacturer model Complaints prodId custId date comments Customers custId name occupation Example - Complaints Database Data Complaints tupleId prodId custId date c1 p121 c3232 6-30-2002 “disk crashed after just one week of moderate use on an IBM Netvista X41” c2 p131 c3131 7-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk” c3 p131 c3143 8-3-2002 “IBM Netvista unstable with Maxtor HD” Customers comments Products tupleId custId name occupation tupleId prodId manufacturer model u1 c3232 “John Smith” “Software Engineer” p1 p121 “Maxtor” “D540X” u2 c3131 “Jack Lucas” “Architect” p2 p131 “IBM” “Netvista” u3 c3143 “John Mayer” “Student” p3 p141 “Tripplite” “Smart 700VA” Example – Keyword Query [Maxtor Netvista] Complaints tupleId prodId custId date c1 p121 c3232 6-30-2002 “disk crashed after just one week of moderate use on an IBM Netvista X41” c2 p131 c3131 7-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk” c3 p131 c3143 8-3-2002 “IBM Netvista unstable with Maxtor HD” Customers comments Products tupleId custId name occupation tupleId prodId manufacturer model u1 c3232 “John Smith” “Software Engineer” p1 p121 “Maxtor” “D540X” u2 c3131 “Jack Lucas” “Architect” p2 p131 “IBM” “Netvista” u3 c3143 “John Mayer” “Student” p3 p141 “Tripplite” “Smart 700VA” Keyword Query Semantics (definition of “document” in databases) Keywords are: • in same tuple • in same relation • in tuples connected through primary-foreign key relationships Score of result: • distance of keywords within a tuple • distance between keywords in terms of primaryforeign key connections • IR-style score of result tree Example – Keyword Query [Maxtor Netvista] Complaints tupleId prodId custId date comments c1 p121 c3232 6-30-2002 “disk crashed after just one week of moderate use on an IBM Netvista X41” c2 p131 c3131 7-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk” c3 p131 c3143 8-3-2002 Customers “IBM Netvista unstable with Maxtor HD” Products tupleId custId name occupation tupleId prodId manufacturer model u1 c3232 “John Smith” “Software Engineer” p1 p121 “Maxtor” “D540X” u2 c3131 “Jack Lucas” “Architect” p2 p131 “IBM” “Netvista” u3 c3143 “John Mayer” “Student” p3 p141 “Tripplite” “Smart 700VA” Results: (1) c3, (2) p2 c3, (3) p1 c1 Result of Keyword Query Result is tree T of tuples where: • each edge corresponds to a primaryforeign key relationship • no tuple of T is redundant (minimality) • - “AND” query semantics: Every query keyword appears in T - “OR” query semantics: Some query keywords might be missing from T Score of Result T • Combining function Score combines scores of attribute values of T • One reasonable choice: Score=aTScore(a)/size(T) • Attribute value scores Score(a) calculated using the DBMS's IR “datablades” Shortcomings of Prior Work • Simplistic ranking methods (e.g., based only on size of connecting tree), ignoring well-studied IR ranking strategies • No straightforward extension to improve efficiency by returning just top-k results • Not good in handling free-text attributes [DBXplorer,DISCOVER] Example – Keyword Query [Maxtor Netvista] Complaints tupleId prodId custId date comments score c1 p121 c3232 6-30-2002 “disk crashed after just one week of moderate use on an IBM Netvista X41” 1/3 c2 p131 c3131 7-3-2002 Score(c3) = 4/3 “lower-end IBM Netvista caught fire, starting apparently with disk” 1/3 “IBM Netvista unstable with Maxtor HD” 4/3 Score(p1 c1) = (1+1/3)/2 = 4/6 c3 p131 c3143 8-3-2002 Customers Products tupleId custId name occupation tupleId prodId manufacturer model score u1 c3232 “John Smith” “Software Engineer” p1 p121 “Maxtor” “D540X” 1 u2 c3131 “Jack “Architect” p2 p131 “IBM” “Netvista” 1 u3 c3143 “John Mayer” “Student” p3 p141 “Tripplite” “Smart 700VA” 0 Score(p2 c3) = (1+4/3)/2 = 7/6 Lucas” Results: (1) c3, (2) p2 c3, (3) p1 c1 Architecture User [Maxtor Netvista] Keywords IR Engine IR Index ComplaintsQ = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] ProductsQ = [(p1,manufacturer,1), (p2,model,1)] Tuple Sets Candidate Network Generator Database Candidate Schema Networks Top-k Joining Trees of Tuples Execution Engine Parameterized, Prepared SQL Queries c3 p2  c3 p1  c2 Database ComplaintsQ ProductsQ ComplaintsQ  ProductsQ ComplaintsQ  Customer{} ComplaintsQ ComplaintsQ  Product{}  ComplaintsQ ... SELECT * FROM ComplaintsQ c, ProductsQ p WHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?; ... Architecture User [Maxtor Netvista] Keywords IR Engine IR Index ComplaintsQ = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] ProductsQ = [(p1,manufacturer,1), (p2,model,1)] Tuple Sets Candidate Network Generator Database Candidate Schema Networks Top-k Joining Trees of Tuples Execution Engine Parameterized, Prepared SQL Queries c3 p2  c3 p1  c2 Database ComplaintsQ ProductsQ ComplaintsQ  ProductsQ ComplaintsQ  Customer{} ComplaintsQ ComplaintsQ  Product{}  ComplaintsQ ... SELECT * FROM ComplaintsQ c, ProductsQ p WHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?; ... Candidate Network Generator • Find all trees of tuple sets (free or non-free) that may produce a result, based on DISCOVER's CN generator [VLDB 2002] • Use single non-free tuple set for each relation – allows “OR” semantics – fewer CNs are generated – extra filtering step required for “AND” semantics Candidate Network Generator Example For query [Maxtor Netvista], CNs: • ComplaintsQ • ProductsQ • ComplaintsQ  ProductsQ • ComplaintsQ  Customer{} ComplaintsQ • ComplaintsQ  Product{}  ComplaintsQ Non-CNs: • ComplaintsQ  Customer{} Complaints{} • Product Q  Complaints{}  ProductQ Architecture User [Maxtor Netvista] Keywords IR Engine IR Index ComplaintsQ = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] ProductsQ = [(p1,manufacturer,1), (p2,model,1)] Tuple Sets Candidate Network Generator Database Candidate Schema Networks Top-k Joining Trees of Tuples Execution Engine Parameterized, Prepared SQL Queries c3 p2  c3 p1  c2 Database ComplaintsQ ProductsQ ComplaintsQ  ProductsQ ComplaintsQ  Customer{} ComplaintsQ ComplaintsQ  Product{}  ComplaintsQ ... SELECT * FROM ComplaintsQ c, ProductsQ p WHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?; ... Execution Algorithms • Users usually want top-k results. • Hence, submitting to DBMS a SQL query for each CN (Naïve algorithm) is inefficient. • When queries produce at most very few results, Naïve algorithm is efficient, since it fully exploits DBMS. • Monotonic combining functions: if results T, T' have same schema and for every attribute Score(ai)≤Score(a'i) then Score(T)≤Score(T') Sparse Algorithm: Example Execution CN results score MFS ProductsQ p1 9 9 ComplaintsQ c2 7 7 ComplaintsQ  ProductsQ c1  p1 ComplaintsQ tupleId score c2 7 c1 5 c3 1 (9+5)/2=7 (9+7)/2 = 8 ProductsQ tupleId score p1 9 p2 6 p3 1 •Best when query produces at most a few results Single Pipelined Algorithm: Example Execution CN: ComplaintsQ  ProductsQ ComplaintsQ tupleId score c2 7 c1 5 c3 1 ProductsQ tupleId score p1 9 p2 6 p3 1 Results queue Get next MPFS = Max[(1+9)/2, Max[(5+9)/2, (7+6)/2]=7 (7+6)/2]=6.5 (7+1)/2]=5 tuple from Output: p1→c1 p2→c2 most promising non-free tuple set result p1→c1 p2→c2 score 7 6.5 Global Pipelined Algorithm : Example Execution Queue of CN processes ordered by ascending MPFS C3 MPFS3 = 3.5 Complaints tupleId score c2 7 c1 5 c3 1 C4 C5 C1 C3 C2 Products tupleId score p1 9 p2 6 p3 1 Results queue global MPFS=max(MPFSi) over all CNs Ci result p1→c1 p2→c2 score Processing Unit 7 6.5 Output • Best when query produces many results. Hybrid Algorithm • Estimate number of results. – For “OR”-semantics, use DBMS estimator – For “AND”-semantics, probabilistically adjust DBMS estimator. • If at most a few query results expected, then use Sparse Algorithm. • If many query results expected, then use Global Pipelined Algorithm. Related Work • DBXplorer [ICDE 2002], DISCOVER [VLDB 2002] – – – – Similar three-step architecture Score = 1/size(T) Only AND semantics No straightforward extension for efficient top-k execution • BANKS [ICDE 2002], Goldman et al. [VLDB 1998] – Database viewed as graph – No use of schema • Florescu et al. [WWW 2000], XQuery Full-Text • Ilyas et al. [VLDB 2003], J* algorithm [VLDB 2001] – Top-k algorithms for join queries Experiments – DBLP Dataset C(cid,name) C: Conference Y: Year Y(yid,year,cid) P(pid,title,yid) PP(pid1,pid2) P: Paper A: Author A(aid,name) PA(pid,aid) DBLP contains few citation edges. Synthetic citation edges were added such that average # citations is 20. Final dataset is 56MB. Experiments run over state-of-the-art commercial RDBMS. OR Semantics: Effect of Maximum Allowed CN Size 1000000 100000 msec 10000 1000 100 10 2 3 4 5 6 7 max CN size Naive Sparse SA SASymmetric GA GASymmetric Hybrid Average execution time of 100 2-keyword top-10 queries OR Semantics: Effect of Number of Objects Requested k 1000000 100000 msec 10000 1000 100 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 k Naive Sparse SA SASymmetric GA GASymmetric Hybrid Average execution time of 100 2-keyword queries with maximum candidate-network size of 6 OR Semantics: Effect of Number of Query Keywords 100000 msec 10000 1000 100 2 3 Naive Sparse #keywords GA 4 GASymmetric 5 Hybrid Average execution time of 100 top-10 queries with maximum candidate-network size of 6 Conclusions • Extend IR-style ranking to databases. • Exploit text-search capabilities of modern DBMSs, to generate results of higher quality. • Support both “AND” and “OR” semantics. • Achieve substantial speedup over prior work via pipelined top-k query processing algorithms. Questions? Compare algorithms wrt Result size 100000 10000 10000 1000 1000 msec msec 100000 100 100 10 10 1 0 100 500 1000 total # results 2000 6000 1 0 5 20 100 total # results GA Sparse GA Sparse OR-semantics AND-semantics Max CN size = 6, top-10, 2 keywords, OR-semantics 200 700 Ranking Functions • Proposed algorithms support tuple monotone combining functions • That is, if results T, T' have same schema and for every attribute Score(ai)≤Score(a'i) then Score(T)≤Score(T')

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download IR-Style VLDB03