Download ICDE07 Talk - Pages - University of Wisconsin–Madison

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Database wikipedia , lookup

Relational algebra wikipedia , lookup

Clusterpoint wikipedia , lookup

Healthcare Cost and Utilization Project wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Efficient Keyword Search across
Heterogeneous Relational Databases
Mayssam Sayyadian, AnHai Doan
University of Wisconsin - Madison
Hieu LeKhac
University of Illinois - Urbana
Luis Gravano
Columbia University
Key Message of Paper

Precise data integration is expensive
 But we can do IR-style data integration
very cheaply, with no manual cost!
– just apply automatic schema/data matching
– then do keyword search across the databases
– no need to verify anything manually

Already very useful
Build upon keyword search over a single database ...
2
Keyword Search over
a Single Relational Database

A growing field, numerous current works
–
–
–
–

Many related works over XML / other types of data
–
–
–
–

DBXplorer [ICDE02], BANKS [ICDE02]
DISCOVER [VLDB02]
Efficient IR-style keyword search in databases [VLDB03],
VLDB-05, SIGMOD-06, etc.
XKeyword [ICDE03], XRank [Sigmod03]
TeXQuery [WWW04]
ObjectRank [Sigmod06]
TopX [VLDB05], etc.
More are coming at SIGMOD-07 ...
3
A Typical Scenario
Customers
tid custid name
Complaints
contact
addr
tid id
emp-name
comments
t1 c124
Cisco
Michael Jones
…
u1 c124 Michael Smith
Repair didn’t work
t2
c533
IBM
David Long
…
u2 c124 John
Deferred work to
t3
c333
MSR
David Ross
…
John Smith
Foreign-Key Join
Q = [Michael Smith Cisco]
Ranked list of answers
Repair didn’t work
score=.8
Deferred work to John Smith
score=.7
t1 c124 Cisco
Michael Jones …
u1 c124 Michael Smith
t1 c124 Cisco
Michael Jones …
u2 c124 John
4
Our Proposal:
Keyword Search across Multiple Databases
Employees
Complaints
comments
tid
empid
u1 c124 Michael Smith
Repair didn’t work
v1
e23
Mike D. Smith
u2 c124 John
Deferred work to
v2
e14
John Brown
John Smith
v3
e37
Jack Lucas
tid id
emp-name
name
Groups
Customers
tid custid name
contact
addr
tid
eid
reports-to
t1
c124
Cisco
Michael Jones
…
x1
e23
e37
t2
c533
IBM
David Long
…
x2
e14
e37
t3
c333
MSR
Joan Brown
…
Query: [Cisco Jack Lucas]
t1 c124 Cisco Michael Jones …
u1 c124 Michael Smith Repair didn’t work
v1 e23 Mike D. Smith
x1 e23 e37
across databases
v3 e37 Jack Lucas
 IR-style data integration
5
A Naive Solution
1. Manually identify FK joins across DBs
2. Manually identify matching data instances across DBs
3. Now treat the combination of DBs as a single DB
 apply current keyword search techniques
Just like in traditional data integration,
this is too much manual work
6
Kite Solution

Automatically find FK joins / matching data instances
across databases
 no manual work is required from user
Employees
Complaints
comments
tid
empid
u1 c124 Michael Smith
Repair didn’t work
v1
e23
Mike D. Smith
u2 c124 John
Deferred work to
v2
e14
John Brown
John Smith
v3
e37
Jack Lucas
tid id
emp-name
name
Groups
Customers
tid custid name
contact
addr
tid
eid
reports-to
t1
c124
Cisco
Michael Jones
…
x1
e23
e37
t2
c533
IBM
David Long
…
x2
e14
e37
t3
c333
MSR
Joan Brown
…
7
Complaints
Automatically Find FK Joins
across Databases
Employees
comments
tid
empid
u1 c124 Michael Smith
Repair didn’t work
v1
e23
Mike D. Smith
u2 c124 John
Deferred work to
v2
e14
John Brown
John Smith
v3
e37
Jack Lucas
tid id
emp-name
name

Current solutions analyze data values (e.g., Bellman)
 Limited accuracy
– e.g., “waterfront” with values yes/no
“electricity” with values yes/no

Our solution: data analysis + schema matching
– improve accuracy drastically (by as much as 50% F-1)
Automatic join/data matching can be wrong
 incorporate confidence scores into answer scores8
Incorporate Confidence Scores
into Answer Scores

Recall: answer example in single-DB settings
t1 c124 Cisco

Michael Jones …
u1 c124 Michael Smith
Repair didn’t work
score=.8
Recall: answer example in multiple-DB settings
score 0.7 for data matching
t1 c124 Cisco Michael Jones …
u1 c124 Michael Smith Repair didn’t work
v1 e23 Mike D. Smith
score 0.9 for FK join
score (A, Q) =
x1 e23 e37
v3 e37 Jack Lucas
α.score_kw (A, Q) + β.score_join (A, Q) + γ.score_data (A, Q)
size (A)
9
Summary of Trade-Offs

SQL queries
Precise data integration
– the holy grail

IR-style data integration, naive way
– manually identify FK joins, matching data
– still too expensive

IR-style data integration, using Kite
– automatic FK join finding / data matching
– cheap
– only approximates the “ideal” ranked list found by naive
10
Kite Architecture
Q = [ Smith Cisco ]
Index Builder
IR index1
…
IR indexn
Foreign key joins
Condensed
CN Generator
– Partial
Refinement
rules
Top-k
Searcher
D1
Schema
Matcher
…
Dn
Offline preprocessing
– Deep
Data instance
matcher
Foreign-Key Join Finder
Data-based
Join Finder
– Full
Distributed SQL queries
D1
…
Dn
Online querying
11
Online Querying
Database 1
Relation 1
Relation 2
Database 2
Relation 1
Relation 2
What current solutions do:
1. Create answer templates
2. Materialize answer templates to obtain answers
12
Create Answer Templates
Service-DB
Find tuples that contain query keywords
–
–
Use DB’s IR index
example:
Complaints
Customers
u1
v1
u2
v2
v3
Q = [Smith Cisco]
Tuple sets: Service-DB: ComplaintsQ={u1, u2} CustomersQ={v1}
HR-DB:
EmployeesQ={t1}
GroupsQ={}
Create tuple-set graph
HR-DB
Groups
Employees
x1
t1
x2
t2
t3
Schema graph:
Customers
J1
J4
Complaints
J2
Emps
Groups
J3
Tuple set graph:
Customers{}
J1
J4
Complaints{}
Emps{}
J1
J4
J3
J1
J4
J3
CustomersQ
J1
ComplaintsQ
J4
EmpsQ
J2
Groups{}
J2
13
Create Answer Templates (cont.)

Search tuple-set graph to generate answer templates
– also called Candidate Networks (CNs)

Each answer template =
one way to join tuples to form an answer
sample CNs
sample tuple set graph
J1
Customers{}
CN1: CustomersQ
J4
Complaints{}
Emps{}
J1
J4
J3
J2
J1
J4
J3
Groups{}
J2
CustomersQ
J1
ComplaintsQ
J4
EmpsQ
J1
CN2: CustomersQ  Complaints{Q}
J2
J2
J4
CN3: EmpsQ  Groups{}  Emps{}  Complaints{Q}
J2
J3
J4
CN4: EmpsQ  Groups{}  Emps{}  Complaints{Q}
14
Materialize Answer Templates
to Generate Answers

By generating and executing a SQL query
CN:
CustomersQ  ComplaintsQ
SQL:
SELECT * FROM Customers C, Complaints P
J1
(CustomersQ
= {v1} , ComplaintsQ = {u1, u2})
WHERE C.cust-id = P.id AND
(C.tuple-id = v1) AND
(P.tuple-id = u1 OR tuple-id = u2)

Naive solution
– materialize all answer templates, score, rank, then return answers

Current solutions
– find only top-k answers
– materialize only certain answer templates
– make decisions using refinement rules + statistics
15
Challenges for Kite Setting

More databases
 way too many answer templates to generate
– can take hours on just 3-4 databases

Materializing an answer template takes way too long
– requires SQL query execution across multiple databases
– invoking each database incurs large overhead

Difficult to obtain reliable statistics across databases

See paper for our solutions
16
Empirical Evaluation
Domains
Domain
Avg # approximate FK joins tuples
Avg #
Avg # tables
Avg # tuples
per table
# DBs
attributes per
per DB
per table
schema
total across DBs
per pair
Total
size
DBLP
2
3
3
11
6
11
500K
400M
Inventory
8
5.8
5.4
890
804
33.6
2K
50M
Sample Inventory Schema
AUTHOR
ARTIST
BOOK
CD
WH2BOOK
WH2CD
WAREHOUSE
Inventory 1
The DBLP Schema
AR (aid, biblo)
CITE (id1, id2)
PU (aid, uid)
AR (id, title)
AU (id, name)
CNF (id, name)
DBLP 1
DBLP 2
17
Runtime Performance (1)
runtime vs. maximum CCN size
180
time (sec)
DBLP
120
60
0
1
2
3
4
5
6
7
8
9
Inventory
120
max
CCN
size
60
0
1
2
3
4
5
6
7
2-keyword queries, k=10, 5 databases
2-keyword queries, k=10, 2 databases
runtime vs. # of databases
max
CCN
size
Hybrid algorithm adapted to run
over multiple databases
45
Inventory
time (sec)
time (sec)
180
30
Kite without adaptive rule
selection and without rule Deep
15
Kite without condensed CNs
Kite without rule Deep
0
1
2
3
4
5
6
7
8 # of DBs
maximum CCN size = 4, 2-keyword queries, k=10
Full-fledged Kite algorithm
18
Runtime Performance (2)
runtime vs. # of keywords in the query
40
DBLP
15
time (sec)
time (sec)
20
10
5
Inventory
30
20
10
0
|q|
1
2
3
4
|q|
0
5
1
max CCN=6, k=10, 2 databases
2
3
4
5
max CCN=4, k=10, 5 databases
runtime vs. # of answers requested
45
time (sec)
time (sec)
45
30
15
k
0
1
4
7
10
13
16
19
22
25
27
30
2-keyword queries, max CCN=4, |q|=2, 5 databases
Inventory
30
15
k
0
1
4
7
10
13
16
19
22
25
27
30
2-keyword queries, max CCN=4, 5 databases
19
Query Result Quality
Pr@k
Pr@k
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
k
0
1
5
10
15
OR-semantic queries

20
k
0
1
5
10
15
20
AND-semantic queries
Pr@k = the fraction of answers that appear in the “ideal” list
20
Summary

Kite executes IR-style data integration
– performs some automatic preprocessing
– then immediately allows keyword querying

Relatively painless
– no manual work!
– no need to create global schema, to understand SQL

Can be very useful in many settings:
e.g., on-the-fly, best-effort, for non-technical people
– enterprises, on the Web, need only a few answers
– emergency (e.g., hospital + police), need answers quickly
21
Future Directions

Incorporate user feedback
 interactive IR-style data integration

More efficient query processing
– large # of databases, network latency

Extends to other types of data
– XML, ontologies, extracted data, Web data
IR-style data integration
is feasible and useful
extends current works on keyword search over DB
raises many opportunities for future work
22
BACKUP
23
Other Experiments
Join Discovery Accuracy
1
accuracy (F1)
0.8

Schema matching helps
improve join discovery
algorithm drastically

Kite also improves singledatabase keyword search
algorithm mHybrid
0.6
0.4
0.2
0
Inventory 1 Inventory 2 Inventory 3 Inventory 4 Inventory 5
Join Discovery
Join Discovery + Schema Matching
Kite over single database
time (sec)
6
4
2
0
1
2
3
4
5
6
7
8
max CCN size
24