Download Machine and Statistical Learning for Database Querying

Document related concepts

Pattern recognition wikipedia , lookup

Data analysis wikipedia , lookup

Neuroinformatics wikipedia , lookup

Corecursion wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Machine and Statistical Learning
for Database Querying
Chao Wang
Data Mining Research Lab
Dept. of Computer Science & Engineering
The Ohio State University
Advisor: Prof. Srinivasan Parthasarathy
Supported by: NSF Career Award IIS-0347662
Copyright 2006, Data Mining Research Lab
Outline
• Introduction
– Selectivity estimation
– Probabilistic graphical model
• Querying transaction database
• Probabilistic model-based itemset
summarization
• Querying XML database
• Conclusion
Copyright 2006, Data Mining Research Lab
Introduction
Copyright 2006, Data Mining Research Lab
Introduction
• Database querying
• Selectivity estimation
– Estimation of a query result size in database
systems
– Usage: for query optimizer to choose an
efficient execution plan
• Rely on probabilistic graphical models
Copyright 2006, Data Mining Research Lab
Probabilistic Graphical Models
• Marriage of graph theory and probability theory
• Special cases of the basic algorithms discovered in many (dis)guises:
–
–
–
–
–
Statistical physics
Hidden Markov models
Genetics
Statistics
…
• Numerous applications
–
–
–
–
–
–
Bioinformatics
Speech
Vision,
Robotics,
Optimization
…
Copyright 2006, Data Mining Research Lab
Directed Graphical Models
(Bayesian Network)
x2
x4
x1
x6
x3
x5
p(x1,x2,x3,x4,x5,x6) = p(x1)p(x2|x1)
p(x3|x1)p(x4|x2)p(x5|x3)p(x6|x2,x5)
Copyright 2006, Data Mining Research Lab
Undirected Graphical Models
(Markov Random Field (MRF))
x2
x4
x1
x6
x3
x5
p(x1,x2,x3,x4,x5,x6) = (1/Z)Φ(x1,x2)
Φ(x1,x3)Φ(x2,x4)Φ(x3,x5)Φ(x2,x5,x6)
Copyright 2006, Data Mining Research Lab
Inference – Computing Conditional
Probabilities
• Conditioning
x2
x4
x1
x6
x3
x5
• Marginalization:
• Conditional probabilities
Copyright 2006, Data Mining Research Lab
Querying Transaction
Database
Copyright 2006, Data Mining Research Lab
Transaction Database
• Consist of records of interactions among
entities
• Two examples:
– Market-basket data
Each basket is a transaction consisting of
items
– Co-authorship data
Each paper is a transaction consisting of
“author” items
Copyright 2006, Data Mining Research Lab
Querying Transaction Database
• Rely on frequent itemsets to learn graphical
models
• Rely on the model to solve the selectivity
estimation problem
– Given a conjunctive query Q, estimate the size
of the answer set, i.e., how many transactions
satisfy Q
Copyright 2006, Data Mining Research Lab
Frequent Itemset Mining
• Market-Basket Analysis
1
A
B
C
D
0
1
1
0
Copyright 2006, Data Mining Research Lab
Frequent Itemset Mining
• Support(I): number of transactions “containing I”
1
1
1
1
1
1
1
1
1
1
1
Copyright 2006, Data Mining Research Lab
Frequent Itemset Mining
Problem
• Given D, minsup
Find all itemsets I with support(I) ≥ minsup
Copyright 2006, Data Mining Research Lab
Using Frequent Itemsets to Learn
an MRF
• A k-itemset can be viewed as a constraint on the
underlying distribution generating the data
• Given a set of itemsets, we compute a
distribution satisfying them and having a
Maximum Entropy (ME)
• This maximum entropy distribution is equivalent
to an MRF
Copyright 2006, Data Mining Research Lab
An ME Distribution Example
Frequent
Itemsets
X1
• The maximum entropy distribution
has the following product form:
X2
X3
X4
X5
X1 X2
X1 X3
X2 X3
X3 X4
X4 X5
Where I(.) is an indication function for the
corresponding itemset constraint and the constants u0,
u1, …, u11 are estimated from the data.
X1 X2 X3
Copyright 2006, Data Mining Research Lab
An MRF Example
X1
C1
X2
X3
C2
X4
C3
X5
Copyright 2006, Data Mining Research Lab
Iterative Scaling Algorithm
• Time complexity
Runs for k iterations, m itemset constraints
and t is the average inference time
 O(k * M * t)
Efficient inference is crucial !
Copyright 2006, Data Mining Research Lab
Junction Tree Algorithm
• Exact inference algorithm
• Time complexity is exponential in the
treewidth (tw) of the model
– Treewidth = (maximum clique size in the
graph formed by triangulating the model – 1)
• Real world models, tw is often well above
20, thus intractable
Copyright 2006, Data Mining Research Lab
Approximate Inference Algorithm
• Gibbs sampling
– Simulating samples from posterior distributions
– Sum over samples to evaluate marginal probabilities
• Mean field algorithm
– Convert the inference problem to an optimization problem, and
solve the relaxed optimization problem
• Loopy belief propagation
– Apply Pearl’s belief propagation directly to loopy graphs
– Works quite well in practice
Will the iterative scaling algorithm still
converge (when subjected to
approximate inference algorithms) ?
Copyright 2006, Data Mining Research Lab
Graph Partitioning-Based
Approximate MRF Learning
Lemma:
For all disjoint vertex subsets a, b and c in an MRF,
whenever b and c are separated by a in the graph, then the
variables associated with b, c are independent given the
variables associated with a alone.
Copyright 2006, Data Mining Research Lab
Graph Partitioning-Based
Approximate MRF Learning
• Cluster variables based on graph
partitioning
• Interaction importance and treewidth
based variable-cluster augmentation
• Learn an exact local MRF on a variablecluster and combine all local models to
derive an approximate global MRF
Copyright 2006, Data Mining Research Lab
Clustering Variables
• k-MinCut
– Partition the graph into k equal parts
– Minimize the number of edges of E whose
incident vertices belong to different partitions
– Weighted graphs: Minimize the sum of
weights of all edges across different partitions
Copyright 2006, Data Mining Research Lab
Accumulative Edge Weighting
Scheme
• Edge weight should reflect the correlation strength
3+2=
Itemsets
Support
X1 X2
3
X1 X3
4
X2 X3
2
X3 X4
2
X4 X5
6
X1 X2 X3
2
Copyright 2006, Data Mining Research Lab
Clustering Variables
• The k-MinCut partitioning scheme yields
disjoint partitions. However, there exist
edges across different partitions. In other
words, different partitions are correlated to
each other. So how do we account for the
correlations across different partitions?
Copyright 2006, Data Mining Research Lab
Interaction Importance and Treewidth
Based Variable-Cluster Augmentation
• Augmenting variable-cluster
– Add back most significant incident edges to a
variable-cluster
• Optimization
– Take into consideration model complexity
• Keep track of treewidth of the augmented variable-clusters
• 1-hop neighboring nodes first, then 2-hop nodes, …, and so
on
Copyright 2006, Data Mining Research Lab
Treewidth Based Augmentation
Variable-cluster
1-hop neighboring
nodes
2-hop neighboring
nodes
……
Copyright 2006, Data Mining Research Lab
Interaction Importance and
Treewidth Based Variable-Cluster
Augmentation
Copyright 2006, Data Mining Research Lab
Approximate Global MRFs
• For each augmented variable-cluster,
collect related itemsets and learn an exact
local MRF
• All local MRFs together offer an
approximate global MRF
Copyright 2006, Data Mining Research Lab
Learning Algorithm
Copyright 2006, Data Mining Research Lab
A Greedy Inference Algorithm
• Given the global model consisting of a set of
local MRFs, how do we make inference?
– Case 1: all query variables are covered by a single
MRF, evaluate the marginal probability directly
– Case 2: use a greedy decomposition scheme to
compute
• First, pick a local model that has the largest intersection with
the current query (i.e., cover most variables)
• Then pick the next local model covering most uncovered
query variables, and so on
• Overlapped decomposition
Copyright 2006, Data Mining Research Lab
A Greedy Inference Algorithm
X1X2X3X6X7
X3X4X6X8
X5X9X10
M1
M2
M3
Qx = X1 X2 X3 X4 X5
P( X 1, X 2, X 3) P( X 3, X 4) P( X 5)
P(Qx ) 
P( X 3)
Copyright 2006, Data Mining Research Lab
Discussions
• The greedy inference scheme is a heuristic
• Global model is not globally consistent;
However, we expect that the global model is
nearly consistent ( Heckerman et al. 2000)
• A generalized belief propagation style approach
is currently under investigation to force the local
consistency across the local models, thereby
offering a globally consistent model
Copyright 2006, Data Mining Research Lab
Experimental Results
• C++ implementation. The Junction tree
algorithm is implemented based on Intel’s
Open-Source Probabilistic Networks
library (C++)
• Use Apriori algorithm to collect frequent
itemsets
• Use Metis for graph partitioning
Copyright 2006, Data Mining Research Lab
Experimental Setup
• Datasets
– Microsoft Anonymous Web, |D|=32711, |I|=294
– BMS-Webview1, |D|=59602, |I|=497
• Query workloads
– Conjunctive queries,
e.g., X1 & ¬X2 & X4
• Performance metrics
– Time: online estimating time and offline learning time
– Error: average absolute relative error
• Varying
– k, the no. of clusters
– g, the no. of vertices used during the augmentation
– tw, the treewidth threshold when using treewidth based augmentation
optimization
Copyright 2006, Data Mining Research Lab
Results on the Web Data
• Support threshold = 20, results in 9901
frequent itemsets
• Treewidth = 28 according to Maximum
Cardinality Search (MCS)-ordering
heuristic
Copyright 2006, Data Mining Research Lab
Varying k (g = 5):
Online Time
Estimation accuracy
Online time
Copyright 2006, Data Mining Research Lab
Offline Time
Varying g (k = 20):
Online Time
Estimation Accuracy
Online time
Copyright 2006, Data Mining Research Lab
Offline Time
Varying tw (k = 25):
Online Time
Estimation Accuracy
Copyright 2006, Data Mining Research Lab
Offline Time
Using Non-Redundant Itemsets
• There exist redundancies in a collection of
frequent itemsets
• Select non-redundant patterns to learn
probabilistic models
• Closely related to pattern summarization
Copyright 2006, Data Mining Research Lab
Probabilistic Model-Based
Itemset Summarization
Copyright 2006, Data Mining Research Lab
Non-Derivable Itemsets
• Based on redundancies
– How do supports relate?
• What information about unknown supports
can we derive from known supports?
– Concise representation: only store nonredundant information
Copyright 2006, Data Mining Research Lab
The Inclusion-Exclusion Principle
Copyright 2006, Data Mining Research Lab
Deduction Rules via InclusionExclusion
• Let A, B, C, … be items
• Let A’ correspond to the set
{ transactions t | t contains A }
• (AB)’ = (A)’ ∩ (B)’
• Then supp(AB) = | (AB)’|
Copyright 2006, Data Mining Research Lab
Deduction Rules via InclusionExclusion
• Inclusion-exclusion principle:
|A’ U B’ U C’| = |A’| + |B’| + |C’|
- |(AB)’| - |(AC)’| - |(BC)’|
+ |(ABC)’|
Thus, since |A’ U B’ U C’| ≤ n,
Supp(ABC) ≤ s(AB) + s(AC) + s(BC)
- s(A) - s(B) - s(C) + n
Copyright 2006, Data Mining Research Lab
Complete Set for Supp(ABC)
0
sABC ≥ 0
1
sABC ≤ sAB
sABC ≤ sAC
sABC ≤ sBC
2
sABC ≥ sAB + sAC - sA
sABC ≥ sAB + sBC – sB
sABC ≥ sAC + sBC – sC
sABC ≤ sAB + sAC + sBC - sA - sB - sC + n
3
Copyright 2006, Data Mining Research Lab
Derivable Itemsets
Given: Supp(I) for all I  J
 Lower bound on Supp(J) = L
Upper bound on Supp(J) = U
• Without counting: Supp(J)  [L, U]
• J is a derivable itemset (DI) iff L = U
We know Supp(J) exactly without counting!
Copyright 2006, Data Mining Research Lab
Derivable Itemsets
• J is a derivable itemset:
– No need to count Supp(J)
– No need to store Supp(J)
• We can use the deduction rules
– Concise representation:
C = { (J, Supp(J) ) | J not derivable from Supp(I),
IJ}
Copyright 2006, Data Mining Research Lab
Probabilistic Model Based Itemset
Summarization
• We can learn the MRF from non-derivable
itemsets alone
Lemma: Given a transaction dataset D, the
MRF M constructed from all of its σ-frequent
itemsets is equivalent to M’, the MRF
constructed from only its σ-frequent nonderivable itemsets
• Can we do better?
– Further compress the patterns
Copyright 2006, Data Mining Research Lab
Probabilistic Model Based Itemset
Summarization
• Use smaller itemsets to learn an MRF
• Use this model to infer the supports of
larger itemsets
• Use those itemsets whose occurrence can
not be explained (by some error threshold)
by the model to augment the model
Copyright 2006, Data Mining Research Lab
Itemset Summarization
Algorithm
Copyright 2006, Data Mining Research Lab
Generalized Non-Derivable
Itemsets
• All the itemsets in the final summary are
non-derivable
• Relax the requirement for an itemset to be
derivable
Copyright 2006, Data Mining Research Lab
Experimental Results
• Experimental Setup
– Datasets:
– Performance metrics:
• Summarization accuracy (restoration error)
• Summary size
• Summarizing time
Copyright 2006, Data Mining Research Lab
Results on the Chess Dataset
minSup = 2000
Summary size
 166581 frequent itemsets
1276 non-derivable
Summarizing time
Estimation accuracy
Copyright 2006, Data Mining Research Lab
Results on the Chess Dataset
Skewed itemset distribution when varying error threshold
Copyright 2006, Data Mining Research Lab
Results on the Mushroom Dataset
minSup = 2031 (25%)
Summary size
 5545 frequent itemsets
534 non-derivable
Summarizing time
Estimation accuracy
Copyright 2006, Data Mining Research Lab
Results on the Mushroom Dataset
Skewed itemset distribution when varying error threshold
Copyright 2006, Data Mining Research Lab
Result Summary and
Discussions
• There do exist redundancies in a collection of itemsets, and
the probabilistic model based summarization scheme can
effectively eliminate such redundancies
– When datasets are dense and largely satisfy conditional
independence assumption, our summarization approach is
extremely efficient
– When datasets become sparse and do not satisfy the conditional
independence assumption, the summarization task becomes more
difficult (need more time and space)
• Itemsets-based MRF learning and MRF-based itemset
summarization are two interactive procedures
Copyright 2006, Data Mining Research Lab
Query XML Database –
Exploiting Independence Structure
from Complex Structural Patterns
Copyright 2006, Data Mining Research Lab
Querying XML Database
• XML is becoming the standard for data
exchange
• We need to query the structure and text data
of XML documents
• XML twig query:
– an important query mechanism
– a structural query with small branches
• Optimizing these queries requires estimating
the selectivity of the twig queries
Copyright 2006, Data Mining Research Lab
Querying XML Database
• An XML document example:
DBLP.xml
(Digital Bibliography & Library Project)
Copyright 2006, Data Mining Research Lab
Querying XML Database
• A twig example:
FOR all books IN document(“DBLP.xml")
WHERE publisher = "Morgan Kaufmann"
RETURN title
b
b: book
p: publisher
p
t
t : title
Copyright 2006, Data Mining Research Lab
Querying XML Database
b
selectivity = 2
p
Copyright 2006, Data Mining Research Lab
b: book
t
p: publisher
t : title
Problem Statement
• The goal is to accurately estimate the
selectivity of twig queries with limited
memory
– Need a structure to store relevant statistics of
the data
– Then estimate selectivity from these statistics
Copyright 2006, Data Mining Research Lab
Our Approach (TreeLattice)
• Key idea: store the occurrence statistics of
small twigs in the summary
– The summary is a lattice consisting of small
trees, thus called TreeLattice
• Then based on these statistics to estimate
the selectivity of the large twigs
Copyright 2006, Data Mining Research Lab
Challenges
• How to estimate the selectivity for a given twig
given the selectivity information of its sub-twigs?
• How to decompose a large twig into smaller twigs?
• What statistics to store in the lattice summary?
Copyright 2006, Data Mining Research Lab
Estimation Procedure
Lemma: If these two tree
augmentations are conditionally
independent (conditioned on T),
then we have:
T
e1
e2
x
y
T1
Augmenting T with
e1 to get T1
: selectivity
T2
Augmenting T with
e2 to get T2
Copyright 2006, Data Mining Research Lab
Decomposition Strategies
• How to decompose a large twig into
smaller sub-twigs?
– Recursive decomposition with or without
voting
– Fixed-sized decomposition
– Hybrid decomposition
Copyright 2006, Data Mining Research Lab
Recursive Decomposition
a
b
d
e
c
a
b
d
e
a
b
d
f
g
f
g
f
g
c
a
b
d
e
f
a
b
d
a
b
d
f
g
Recursively applying the
estimation formula.
f
g
c
a
b
d
f
 It’s possible there exist multiple feasible decompositions. Rely on
voting to obtain the best estimate as we can
• Much more accurate than without voting
• Estimating process slows down
Copyright 2006, Data Mining Research Lab
Fixed-sized Decomposition
a
b
c
a
b
c
d
d
f
e
g
b
+
c
d
e
Very fast, but can not be
applied directly
b
b
+
d
f
+
d
e
Copyright 2006, Data Mining Research Lab
f
g
Hybrid Decomposition
a
b
c
fixed-sized
decomposition
a
b
c
d
a
a
b
b
d
b
a
d
b
c
f
e
g
b
c
d
+
e
……
c
b
+
d
a
b
b
d
f
+
d
f
g
e
recursive
decomposition with
voting
b
Copyright 2006, Data Mining Research Lab
Summary Statistics
• What to store in lattice summary?
– Store important statistics
– Store non-redundant information
– How to achieve this?
• Store non-derivable patterns only!
Copyright 2006, Data Mining Research Lab
Summary Statistics
• A twig pattern is δ- derivable if and only if its true
selectivity is within an error tolerance of δ to its
expected selectivity according to TreeLattice.
– 0-derivable (δ=0) patterns are those patterns whose
selectivity can be estimated exactly.
• Pruning 0-derivable patterns
– No loss of accuracy
Copyright 2006, Data Mining Research Lab
Summary Statistics
• Level-wise lattice summary construction
– Add all twigs of size 1&2 to the summary (base)
– Then add larger non-derivable frequent twigs
into the summary, until the memory budget is
depleted
Copyright 2006, Data Mining Research Lab
Experimental Methodology
• Datasets: NASA, PSD, IMDB and XMark
• Workloads: 1000 frequent twig queries of
size between 4 and 9.
• Error metric: Mean absolute relative error
1
| estim( q)  count( q) |
 
| W | q W
count( q)

Copyright 2006, Data Mining Research Lab
Accuracy of Estimators
Avg. Rel Error(%)
Recursive Decomp+Voting
Fast Decomp
Recursive Decomp
TreeSketches
20
18
16
14
12
10
8
6
4
2
0
4
NASA
5
6
7
Query Size
8
9
• Recursive decomposition with voting yields best
estimates
• The quality of estimation degrades as the twig size
increases due to error propagation
Copyright 2006, Data Mining Research Lab
Varying Summary Size
TreeLattice
TreeSketches
Avg. Rel Error(%)
14%
12%
10%
8%
6%
4%
2%
0%
10k
NASA
20k
30k
40k
50k
Summary Size
• The larger the summary, the better the estimations
• TreeLattice makes more efficient use of the memory
budget
Copyright 2006, Data Mining Research Lab
Estimation Time
Response Time(ms)
Recursive Decomp+Voting
Fast Decomp
Recursive Decomp
TreeSketches
70
60
50
40
30
20
10
0
4
NASA
5
6
7
Query Size
8
9
• TreeLattice is very fast when processing relative small twigs
• Recursive decomposition with voting slows down a lot as the
twig size increases.
• Overall, fast decomposition is best.
Copyright 2006, Data Mining Research Lab
δ-derivable Pruning
• The proportion of 0-derivable patterns is very
high on NASA, PSD and XMark
– Tree growing conditional independence
assumption holds well
– TreeLattice works very well
• Assumption does not hold that well on IMDB.
How to improve the estimations on IMDB?
Copyright 2006, Data Mining Research Lab
δ-derivable Pruning
TreeSketches
IMDB
• Larger δ is good for large twigs, at the cost of
sacrificing estimation accuracy for small twigs.
Copyright 2006, Data Mining Research Lab
Discussions
• TreeLattice is effective in estimating the
selectivity of XML twig queries
– Compares favorably with the state-of-the-art
approach
– The lattice summary construction is fast
– The online estimation is fast
Copyright 2006, Data Mining Research Lab
Conclusion
Copyright 2006, Data Mining Research Lab
Conclusion
• Conditional independence structure is common in
the real world
• Graphical models are effective to capture such
structures and solve the selectivity estimation
problem for database querying
• Model structured data (sequence/tree/graph) using
probabilistic models
• Model streaming/incremental data
Copyright 2006, Data Mining Research Lab