Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Efficient Keyword Search across Heterogeneous Relational Databases Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University Key Message of Paper Precise data integration is expensive But we can do IR-style data integration very cheaply, with no manual cost! – just apply automatic schema/data matching – then do keyword search across the databases – no need to verify anything manually Already very useful Build upon keyword search over a single database ... 2 Keyword Search over a Single Relational Database A growing field, numerous current works – – – – Many related works over XML / other types of data – – – – DBXplorer [ICDE02], BANKS [ICDE02] DISCOVER [VLDB02] Efficient IR-style keyword search in databases [VLDB03], VLDB-05, SIGMOD-06, etc. XKeyword [ICDE03], XRank [Sigmod03] TeXQuery [WWW04] ObjectRank [Sigmod06] TopX [VLDB05], etc. More are coming at SIGMOD-07 ... 3 A Typical Scenario Customers tid custid name Complaints contact addr tid id emp-name comments t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work t2 c533 IBM David Long … u2 c124 John Deferred work to t3 c333 MSR David Ross … John Smith Foreign-Key Join Q = [Michael Smith Cisco] Ranked list of answers Repair didn’t work score=.8 Deferred work to John Smith score=.7 t1 c124 Cisco Michael Jones … u1 c124 Michael Smith t1 c124 Cisco Michael Jones … u2 c124 John 4 Our Proposal: Keyword Search across Multiple Databases Employees Complaints comments tid empid u1 c124 Michael Smith Repair didn’t work v1 e23 Mike D. Smith u2 c124 John Deferred work to v2 e14 John Brown John Smith v3 e37 Jack Lucas tid id emp-name name Groups Customers tid custid name contact addr tid eid reports-to t1 c124 Cisco Michael Jones … x1 e23 e37 t2 c533 IBM David Long … x2 e14 e37 t3 c333 MSR Joan Brown … Query: [Cisco Jack Lucas] t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work v1 e23 Mike D. Smith x1 e23 e37 across databases v3 e37 Jack Lucas IR-style data integration 5 A Naive Solution 1. Manually identify FK joins across DBs 2. Manually identify matching data instances across DBs 3. Now treat the combination of DBs as a single DB apply current keyword search techniques Just like in traditional data integration, this is too much manual work 6 Kite Solution Automatically find FK joins / matching data instances across databases no manual work is required from user Employees Complaints comments tid empid u1 c124 Michael Smith Repair didn’t work v1 e23 Mike D. Smith u2 c124 John Deferred work to v2 e14 John Brown John Smith v3 e37 Jack Lucas tid id emp-name name Groups Customers tid custid name contact addr tid eid reports-to t1 c124 Cisco Michael Jones … x1 e23 e37 t2 c533 IBM David Long … x2 e14 e37 t3 c333 MSR Joan Brown … 7 Complaints Automatically Find FK Joins across Databases Employees comments tid empid u1 c124 Michael Smith Repair didn’t work v1 e23 Mike D. Smith u2 c124 John Deferred work to v2 e14 John Brown John Smith v3 e37 Jack Lucas tid id emp-name name Current solutions analyze data values (e.g., Bellman) Limited accuracy – e.g., “waterfront” with values yes/no “electricity” with values yes/no Our solution: data analysis + schema matching – improve accuracy drastically (by as much as 50% F-1) Automatic join/data matching can be wrong incorporate confidence scores into answer scores8 Incorporate Confidence Scores into Answer Scores Recall: answer example in single-DB settings t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work score=.8 Recall: answer example in multiple-DB settings score 0.7 for data matching t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work v1 e23 Mike D. Smith score 0.9 for FK join score (A, Q) = x1 e23 e37 v3 e37 Jack Lucas α.score_kw (A, Q) + β.score_join (A, Q) + γ.score_data (A, Q) size (A) 9 Summary of Trade-Offs SQL queries Precise data integration – the holy grail IR-style data integration, naive way – manually identify FK joins, matching data – still too expensive IR-style data integration, using Kite – automatic FK join finding / data matching – cheap – only approximates the “ideal” ranked list found by naive 10 Kite Architecture Q = [ Smith Cisco ] Index Builder IR index1 … IR indexn Foreign key joins Condensed CN Generator – Partial Refinement rules Top-k Searcher D1 Schema Matcher … Dn Offline preprocessing – Deep Data instance matcher Foreign-Key Join Finder Data-based Join Finder – Full Distributed SQL queries D1 … Dn Online querying 11 Online Querying Database 1 Relation 1 Relation 2 Database 2 Relation 1 Relation 2 What current solutions do: 1. Create answer templates 2. Materialize answer templates to obtain answers 12 Create Answer Templates Service-DB Find tuples that contain query keywords – – Use DB’s IR index example: Complaints Customers u1 v1 u2 v2 v3 Q = [Smith Cisco] Tuple sets: Service-DB: ComplaintsQ={u1, u2} CustomersQ={v1} HR-DB: EmployeesQ={t1} GroupsQ={} Create tuple-set graph HR-DB Groups Employees x1 t1 x2 t2 t3 Schema graph: Customers J1 J4 Complaints J2 Emps Groups J3 Tuple set graph: Customers{} J1 J4 Complaints{} Emps{} J1 J4 J3 J1 J4 J3 CustomersQ J1 ComplaintsQ J4 EmpsQ J2 Groups{} J2 13 Create Answer Templates (cont.) Search tuple-set graph to generate answer templates – also called Candidate Networks (CNs) Each answer template = one way to join tuples to form an answer sample CNs sample tuple set graph J1 Customers{} CN1: CustomersQ J4 Complaints{} Emps{} J1 J4 J3 J2 J1 J4 J3 Groups{} J2 CustomersQ J1 ComplaintsQ J4 EmpsQ J1 CN2: CustomersQ Complaints{Q} J2 J2 J4 CN3: EmpsQ Groups{} Emps{} Complaints{Q} J2 J3 J4 CN4: EmpsQ Groups{} Emps{} Complaints{Q} 14 Materialize Answer Templates to Generate Answers By generating and executing a SQL query CN: CustomersQ ComplaintsQ SQL: SELECT * FROM Customers C, Complaints P J1 (CustomersQ = {v1} , ComplaintsQ = {u1, u2}) WHERE C.cust-id = P.id AND (C.tuple-id = v1) AND (P.tuple-id = u1 OR tuple-id = u2) Naive solution – materialize all answer templates, score, rank, then return answers Current solutions – find only top-k answers – materialize only certain answer templates – make decisions using refinement rules + statistics 15 Challenges for Kite Setting More databases way too many answer templates to generate – can take hours on just 3-4 databases Materializing an answer template takes way too long – requires SQL query execution across multiple databases – invoking each database incurs large overhead Difficult to obtain reliable statistics across databases See paper for our solutions 16 Empirical Evaluation Domains Domain Avg # approximate FK joins tuples Avg # Avg # tables Avg # tuples per table # DBs attributes per per DB per table schema total across DBs per pair Total size DBLP 2 3 3 11 6 11 500K 400M Inventory 8 5.8 5.4 890 804 33.6 2K 50M Sample Inventory Schema AUTHOR ARTIST BOOK CD WH2BOOK WH2CD WAREHOUSE Inventory 1 The DBLP Schema AR (aid, biblo) CITE (id1, id2) PU (aid, uid) AR (id, title) AU (id, name) CNF (id, name) DBLP 1 DBLP 2 17 Runtime Performance (1) runtime vs. maximum CCN size 180 time (sec) DBLP 120 60 0 1 2 3 4 5 6 7 8 9 Inventory 120 max CCN size 60 0 1 2 3 4 5 6 7 2-keyword queries, k=10, 5 databases 2-keyword queries, k=10, 2 databases runtime vs. # of databases max CCN size Hybrid algorithm adapted to run over multiple databases 45 Inventory time (sec) time (sec) 180 30 Kite without adaptive rule selection and without rule Deep 15 Kite without condensed CNs Kite without rule Deep 0 1 2 3 4 5 6 7 8 # of DBs maximum CCN size = 4, 2-keyword queries, k=10 Full-fledged Kite algorithm 18 Runtime Performance (2) runtime vs. # of keywords in the query 40 DBLP 15 time (sec) time (sec) 20 10 5 Inventory 30 20 10 0 |q| 1 2 3 4 |q| 0 5 1 max CCN=6, k=10, 2 databases 2 3 4 5 max CCN=4, k=10, 5 databases runtime vs. # of answers requested 45 time (sec) time (sec) 45 30 15 k 0 1 4 7 10 13 16 19 22 25 27 30 2-keyword queries, max CCN=4, |q|=2, 5 databases Inventory 30 15 k 0 1 4 7 10 13 16 19 22 25 27 30 2-keyword queries, max CCN=4, 5 databases 19 Query Result Quality Pr@k Pr@k 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 k 0 1 5 10 15 OR-semantic queries 20 k 0 1 5 10 15 20 AND-semantic queries Pr@k = the fraction of answers that appear in the “ideal” list 20 Summary Kite executes IR-style data integration – performs some automatic preprocessing – then immediately allows keyword querying Relatively painless – no manual work! – no need to create global schema, to understand SQL Can be very useful in many settings: e.g., on-the-fly, best-effort, for non-technical people – enterprises, on the Web, need only a few answers – emergency (e.g., hospital + police), need answers quickly 21 Future Directions Incorporate user feedback interactive IR-style data integration More efficient query processing – large # of databases, network latency Extends to other types of data – XML, ontologies, extracted data, Web data IR-style data integration is feasible and useful extends current works on keyword search over DB raises many opportunities for future work 22 BACKUP 23 Other Experiments Join Discovery Accuracy 1 accuracy (F1) 0.8 Schema matching helps improve join discovery algorithm drastically Kite also improves singledatabase keyword search algorithm mHybrid 0.6 0.4 0.2 0 Inventory 1 Inventory 2 Inventory 3 Inventory 4 Inventory 5 Join Discovery Join Discovery + Schema Matching Kite over single database time (sec) 6 4 2 0 1 2 3 4 5 6 7 8 max CCN size 24