Download IR-Style VLDB03

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Efficient IR-Style Keyword Search
over Relational Databases
• Vagelis Hristidis
University of California, San Diego
• Luis Gravano
Columbia University
• Yannis Papakonstantinou
University of California, San Diego
Motivation
• Keyword search is the dominant
information discovery method in
documents
• Increasing amount of data is stored in
databases
• Plain text coexists with structured data
Motivation
• Up until recently, information discovery in
databases required:
– Knowledge of schema
– Knowledge of a query language (e.g., SQL)
– Knowledge of the role of the keywords
• Goal: Enable IR-style keyword search over DBMSs
without the above requirements
IR-Style Search over DBMSs
• IR keyword search well developed for
document search
• Modern DBMSs offer IR-style keyword
search over individual text attributes
• What is equivalent to document in
databases?
Example – Complaints Database
Schema
Products
prodId
manufacturer
model
Complaints
prodId
custId
date
comments
Customers
custId
name
occupation
Example - Complaints Database
Data
Complaints
tupleId
prodId
custId
date
c1
p121
c3232
6-30-2002
“disk crashed after just one week of moderate
use on an IBM Netvista X41”
c2
p131
c3131
7-3-2002
“lower-end IBM Netvista caught fire, starting
apparently with disk”
c3
p131
c3143
8-3-2002
“IBM Netvista unstable with Maxtor HD”
Customers
comments
Products
tupleId
custId
name
occupation
tupleId
prodId
manufacturer
model
u1
c3232
“John
Smith”
“Software
Engineer”
p1
p121
“Maxtor”
“D540X”
u2
c3131
“Jack
Lucas”
“Architect”
p2
p131
“IBM”
“Netvista”
u3
c3143
“John
Mayer”
“Student”
p3
p141
“Tripplite”
“Smart
700VA”
Example – Keyword Query
[Maxtor Netvista]
Complaints
tupleId
prodId
custId
date
c1
p121
c3232
6-30-2002
“disk crashed after just one week of moderate
use on an IBM Netvista X41”
c2
p131
c3131
7-3-2002
“lower-end IBM Netvista caught fire, starting
apparently with disk”
c3
p131
c3143
8-3-2002
“IBM Netvista unstable with Maxtor HD”
Customers
comments
Products
tupleId
custId
name
occupation
tupleId
prodId
manufacturer
model
u1
c3232
“John
Smith”
“Software
Engineer”
p1
p121
“Maxtor”
“D540X”
u2
c3131
“Jack
Lucas”
“Architect”
p2
p131
“IBM”
“Netvista”
u3
c3143
“John
Mayer”
“Student”
p3
p141
“Tripplite”
“Smart
700VA”
Keyword Query Semantics
(definition of “document” in databases)
Keywords are:
• in same tuple
• in same relation
• in tuples connected through primary-foreign key
relationships
Score of result:
• distance of keywords within a tuple
• distance between keywords in terms of primaryforeign key connections
• IR-style score of result tree
Example – Keyword Query
[Maxtor Netvista]
Complaints
tupleId
prodId
custId
date
comments
c1
p121
c3232
6-30-2002
“disk crashed after just one week of moderate
use on an IBM Netvista X41”
c2
p131
c3131
7-3-2002
“lower-end IBM Netvista caught fire, starting
apparently with disk”
c3
p131
c3143
8-3-2002
Customers
“IBM Netvista unstable with Maxtor HD”
Products
tupleId
custId
name
occupation
tupleId
prodId
manufacturer
model
u1
c3232
“John
Smith”
“Software
Engineer”
p1
p121
“Maxtor”
“D540X”
u2
c3131
“Jack
Lucas”
“Architect”
p2
p131
“IBM”
“Netvista”
u3
c3143
“John
Mayer”
“Student”
p3
p141
“Tripplite”
“Smart
700VA”
Results: (1) c3, (2) p2 c3, (3) p1 c1
Result of Keyword Query
Result is tree T of tuples where:
• each edge corresponds to a primaryforeign key relationship
• no tuple of T is redundant (minimality)
• - “AND” query semantics: Every query
keyword appears in T
- “OR” query semantics: Some query
keywords might be missing from T
Score of Result T
• Combining function Score combines
scores of attribute values of T
• One reasonable choice:
Score=aTScore(a)/size(T)
• Attribute value scores Score(a)
calculated using the DBMS's IR
“datablades”
Shortcomings of Prior Work
• Simplistic ranking methods (e.g., based
only on size of connecting tree), ignoring
well-studied IR ranking strategies
• No straightforward extension to improve
efficiency by returning just top-k results
• Not good in handling free-text attributes
[DBXplorer,DISCOVER]
Example – Keyword Query
[Maxtor Netvista]
Complaints
tupleId
prodId
custId
date
comments
score
c1
p121
c3232
6-30-2002
“disk crashed after just one week of moderate
use on an IBM Netvista X41”
1/3
c2
p131
c3131
7-3-2002
Score(c3)
= 4/3
“lower-end
IBM Netvista
caught fire, starting
apparently with disk”
1/3
“IBM Netvista unstable with Maxtor HD”
4/3
Score(p1 c1) = (1+1/3)/2 = 4/6
c3
p131
c3143
8-3-2002
Customers
Products
tupleId
custId
name
occupation
tupleId
prodId
manufacturer
model
score
u1
c3232
“John
Smith”
“Software
Engineer”
p1
p121
“Maxtor”
“D540X”
1
u2
c3131
“Jack
“Architect”
p2
p131
“IBM”
“Netvista”
1
u3
c3143
“John
Mayer”
“Student”
p3
p141
“Tripplite”
“Smart
700VA”
0
Score(p2 c3) = (1+4/3)/2
= 7/6
Lucas”
Results: (1) c3, (2) p2 c3, (3) p1 c1
Architecture
User
[Maxtor
Netvista]
Keywords
IR Engine
IR Index
ComplaintsQ =
[(c3,comments,1.33),
(c1,comments,0.33),
(c2,comments,0.33)]
ProductsQ =
[(p1,manufacturer,1),
(p2,model,1)]
Tuple Sets
Candidate Network
Generator
Database
Candidate
Schema
Networks
Top-k Joining Trees
of Tuples
Execution Engine
Parameterized,
Prepared
SQL Queries
c3
p2  c3
p1  c2
Database
ComplaintsQ
ProductsQ
ComplaintsQ  ProductsQ
ComplaintsQ  Customer{} ComplaintsQ
ComplaintsQ  Product{}  ComplaintsQ
...
SELECT * FROM ComplaintsQ c, ProductsQ p
WHERE c.prodId = p.prodId AND c.prodId=? AND
c.custId = ?;
...
Architecture
User
[Maxtor
Netvista]
Keywords
IR Engine
IR Index
ComplaintsQ =
[(c3,comments,1.33),
(c1,comments,0.33),
(c2,comments,0.33)]
ProductsQ =
[(p1,manufacturer,1),
(p2,model,1)]
Tuple Sets
Candidate Network
Generator
Database
Candidate
Schema
Networks
Top-k Joining Trees
of Tuples
Execution Engine
Parameterized,
Prepared
SQL Queries
c3
p2  c3
p1  c2
Database
ComplaintsQ
ProductsQ
ComplaintsQ  ProductsQ
ComplaintsQ  Customer{} ComplaintsQ
ComplaintsQ  Product{}  ComplaintsQ
...
SELECT * FROM ComplaintsQ c, ProductsQ p
WHERE c.prodId = p.prodId AND c.prodId=? AND
c.custId = ?;
...
Candidate Network Generator
• Find all trees of tuple sets (free or non-free)
that may produce a result, based on
DISCOVER's CN generator [VLDB 2002]
• Use single non-free tuple set for each relation
– allows “OR” semantics
– fewer CNs are generated
– extra filtering step required for “AND”
semantics
Candidate Network Generator
Example
For query [Maxtor Netvista], CNs:
• ComplaintsQ
• ProductsQ
• ComplaintsQ  ProductsQ
• ComplaintsQ  Customer{} ComplaintsQ
• ComplaintsQ  Product{}  ComplaintsQ
Non-CNs:
• ComplaintsQ  Customer{} Complaints{}
• Product Q  Complaints{}  ProductQ
Architecture
User
[Maxtor
Netvista]
Keywords
IR Engine
IR Index
ComplaintsQ =
[(c3,comments,1.33),
(c1,comments,0.33),
(c2,comments,0.33)]
ProductsQ =
[(p1,manufacturer,1),
(p2,model,1)]
Tuple Sets
Candidate Network
Generator
Database
Candidate
Schema
Networks
Top-k Joining Trees
of Tuples
Execution Engine
Parameterized,
Prepared
SQL Queries
c3
p2  c3
p1  c2
Database
ComplaintsQ
ProductsQ
ComplaintsQ  ProductsQ
ComplaintsQ  Customer{} ComplaintsQ
ComplaintsQ  Product{}  ComplaintsQ
...
SELECT * FROM ComplaintsQ c, ProductsQ p
WHERE c.prodId = p.prodId AND c.prodId=? AND
c.custId = ?;
...
Execution Algorithms
• Users usually want top-k results.
• Hence, submitting to DBMS a SQL query for
each CN (Naïve algorithm) is inefficient.
• When queries produce at most very few
results, Naïve algorithm is efficient, since it
fully exploits DBMS.
• Monotonic combining functions: if results T, T'
have same schema and for every attribute
Score(ai)≤Score(a'i) then Score(T)≤Score(T')
Sparse Algorithm: Example
Execution
CN
results
score
MFS
ProductsQ
p1
9
9
ComplaintsQ
c2
7
7
ComplaintsQ  ProductsQ
c1  p1
ComplaintsQ
tupleId score
c2
7
c1
5
c3
1
(9+5)/2=7
(9+7)/2 = 8
ProductsQ
tupleId score
p1
9
p2
6
p3
1
•Best when query produces at most a few results
Single Pipelined Algorithm:
Example Execution
CN: ComplaintsQ  ProductsQ
ComplaintsQ
tupleId score
c2
7
c1
5
c3
1
ProductsQ
tupleId score
p1
9
p2
6
p3
1
Results queue
Get next
MPFS = Max[(1+9)/2,
Max[(5+9)/2,
(7+6)/2]=7
(7+6)/2]=6.5
(7+1)/2]=5
tuple from
Output: p1→c1
p2→c2
most
promising
non-free
tuple set
result
p1→c1
p2→c2
score
7
6.5
Global Pipelined Algorithm :
Example Execution
Queue of CN processes
ordered by ascending
MPFS
C3
MPFS3 = 3.5
Complaints
tupleId score
c2
7
c1
5
c3
1
C4
C5
C1
C3
C2
Products
tupleId score
p1
9
p2
6
p3
1
Results queue
global MPFS=max(MPFSi)
over all CNs Ci
result
p1→c1
p2→c2
score
Processing Unit
7
6.5
Output
• Best when query produces many results.
Hybrid Algorithm
• Estimate number of results.
– For “OR”-semantics, use DBMS estimator
– For “AND”-semantics, probabilistically
adjust DBMS estimator.
• If at most a few query results expected,
then use Sparse Algorithm.
• If many query results expected, then
use Global Pipelined Algorithm.
Related Work
• DBXplorer [ICDE 2002], DISCOVER [VLDB 2002]
–
–
–
–
Similar three-step architecture
Score = 1/size(T)
Only AND semantics
No straightforward extension for efficient top-k execution
• BANKS [ICDE 2002], Goldman et al. [VLDB 1998]
– Database viewed as graph
– No use of schema
• Florescu et al. [WWW 2000], XQuery Full-Text
• Ilyas et al. [VLDB 2003], J* algorithm [VLDB 2001]
– Top-k algorithms for join queries
Experiments – DBLP Dataset
C(cid,name)
C: Conference
Y: Year
Y(yid,year,cid)
P(pid,title,yid)
PP(pid1,pid2)
P: Paper
A: Author
A(aid,name)
PA(pid,aid)
DBLP contains few citation edges.
Synthetic citation edges were added such that average #
citations is 20.
Final dataset is 56MB.
Experiments run over state-of-the-art commercial RDBMS.
OR Semantics: Effect of Maximum
Allowed CN Size
1000000
100000
msec
10000
1000
100
10
2
3
4
5
6
7
max CN size
Naive
Sparse
SA
SASymmetric
GA
GASymmetric
Hybrid
Average execution time of 100 2-keyword top-10 queries
OR Semantics: Effect of Number of
Objects Requested k
1000000
100000
msec
10000
1000
100
10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
k
Naive
Sparse
SA
SASymmetric
GA
GASymmetric
Hybrid
Average execution time of 100 2-keyword queries with
maximum candidate-network size of 6
OR Semantics: Effect of Number of
Query Keywords
100000
msec
10000
1000
100
2
3
Naive
Sparse
#keywords
GA
4
GASymmetric
5
Hybrid
Average execution time of 100 top-10 queries with
maximum candidate-network size of 6
Conclusions
• Extend IR-style ranking to databases.
• Exploit text-search capabilities of modern
DBMSs, to generate results of higher quality.
• Support both “AND” and “OR” semantics.
• Achieve substantial speedup over prior work
via pipelined top-k query processing
algorithms.
Questions?
Compare algorithms wrt Result
size
100000
10000
10000
1000
1000
msec
msec
100000
100
100
10
10
1
0
100
500
1000
total # results
2000
6000
1
0
5
20
100
total # results
GA
Sparse
GA
Sparse
OR-semantics
AND-semantics
Max CN size = 6, top-10, 2 keywords, OR-semantics
200
700
Ranking Functions
• Proposed algorithms support tuple
monotone combining functions
• That is, if results T, T' have same
schema and for every attribute
Score(ai)≤Score(a'i) then
Score(T)≤Score(T')
Related documents