Download View slides - ECML PKDD 2008

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Bellwether Analysis
Hierarchies in Data Mining
Raghu Ramakrishnan
[email protected]
Chief Scientist for Audience and Cloud Computing
Yahoo!
About this Talk
• Common theme—multidimensional view of
data:
– Reveals patterns that emerge at coarser
granularity
• Widely recognized, e.g., generalized association rules
– Helps handle imprecision
• Analyzing imprecise and aggregated data
– Helps handle data sparsity
• Even with massive datasets, sparsity is a challenge!
– Defines candidate space of subsets for
exploratory mining
• Forecasting query results over “future data”
• Using predictive models as summaries
• Potentially, space of “mining experiments”?
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
2
Background:
The Multidimensional Data Model
Cube Space
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
4
Star Schema
PRODUCT
pid
pname
Category
Model
SERVICE
pid
timeid
locid
repair
“FACT” TABLE
DIMENSION TABLES
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
TIME
timeid
date
week
year
LOCATION
locid
country
region
state
R. Ramakrishnan
5
Dimension Hierarchies
• For each dimension, the set of values can be
organized in a hierarchy:
PRODUCT
TIME
LOCATION
year
automobile
category
model
Hierarchies in Data Mining
quarter
week
month
date
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
country
region
state
R. Ramakrishnan
6
Multidimensional Data Model
• One fact table D=(X,M)
– X=X1, X2, ... Dimension attributes
– M=M1, M2,… Measure attributes
• Domain hierarchy for each dimension attribute:
– Collection of domains Hier(Xi)= (Di(1),..., Di(k))
– The extended domain: EXi = 1≤k≤t DXi(k)
• Value mapping function: γD1D2(x)
– e.g., γmonthyear(12/2005) = 2005
– Form the value hierarchy graph
– Stored as dimension table attribute (e.g., week for a time
value) or conversion functions (e.g., month, quarter)
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
7
Multidimensional Data
2
1
Region
State
ALL
Truck
Sedan
TX
NY
MA
Civic
CA
West
ALL
LOCATION
East
3
ALL
Automobile
Hierarchies in Data Mining
Camry
F150
p3
p1
Sierra
ALL
3
Category
2
Model
1
DIMENSION
ATTRIBUTES
p4
p2
FactID
Auto
Loc
Repair
p1
F150
NY
100
p2
Sierra
NY
500
p3
F150
MA
100
p4
Sierra
MA
200
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
8
Cube Space
• Cube space: C = EX1EX2…EXd
• Region: Hyper rectangle in cube space
– c = (v1,v2,…,vd) , vi  EXi
– E.g., c1= (NY, Camry); c2 = (West, Sedan)
• Region granularity:
– gran(c) = (d1, d2, ..., dd), di = Domain(c.vi)
– E.g., gran(c1) = (State, Model); gran(c2) = (State, Category)
• Region coverage:
– coverage(c) = all facts in c
• Region set: All regions with same granularity
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
9
OLAP Over Imprecise Data
with Doug Burdick, Prasad Deshpande, T.S. Jayram, and
Shiv Vaithyanathan
In VLDB 05, 06 joint work with IBM Almaden
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
10
Imprecise Data
2
1
Region
State
ALL
Truck
Sedan
TX
NY
MA
Civic
CA
West
ALL
LOCATION
East
3
ALL
Automobile
Hierarchies in Data Mining
Camry
F150
p3
Sierra
3
Category
2
Model
1
p4
p5
p1
ALL
p2
FactID
Auto
Loc
Repair
p1
F150
NY
100
p2
Sierra
NY
500
p3
F150
MA
100
p4
Sierra
MA
200
p5
Truck
MA
100
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
11
Querying Imprecise Facts
Auto = F150
Loc = MA
SUM(Repair) = ???
How do we treat p5?
Truck
F150
Sierra
NY
East
MA
p5
p4
p3
p1
Hierarchies in Data Mining
FactID
Auto
Loc
Repair
p1
F150
NY
100
p2
Sierra
NY
500
p3
F150
MA
100
p4
Sierra
MA
200
p5
Truck
MA
100
p2
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
12
Allocation (1)
Truck
NY
East
MA
F150
Sierra
p5
p3
p4
p1
Hierarchies in Data Mining
p2
FactID
Auto
Loc
Repair
p1
F150
NY
100
p2
Sierra
NY
500
p3
F150
MA
100
p4
Sierra
MA
200
p5
Truck
MA
100
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
13
Allocation (2)
(Huh? Why 0.5 / 0.5?
- Hold on to that thought)
Truck
NY
East
MA
F150
p5
p3
Sierra
p5
p4
p1
Hierarchies in Data Mining
p2
ID
FactID
Auto
Loc
Repair
Weight
1
p1
F150
NY
100
1.0
2
p2
Sierra
NY
500
1.0
3
p3
F150
MA
100
1.0
4
p4
Sierra
MA
200
1.0
5
p5
F150
MA
100
0.5
6
p5
Sierra
MA
100
0.5
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
14
Allocation (3)
Auto = F150
Loc = MA
SUM(Repair) = 150
Truck
NY
East
MA
F150
p5
p3
Sierra
p5
p4
p1
Hierarchies in Data Mining
p2
Query the Extended Data Model!
ID
FactID
Auto
Loc
Repair
Weight
1
p1
F150
NY
100
1.0
2
p2
Sierra
NY
500
1.0
3
p3
F150
MA
100
1.0
4
p4
Sierra
MA
200
1.0
5
p5
F150
MA
100
0.5
6
p5
Sierra
MA
100
0.5
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
15
Allocation Policies
• Procedure for assigning allocation weights is
referred to as an allocation policy
– Each allocation policy uses different information to
assign allocation weight
• Key contributions:
– Appropriate characterization of the large space of
allocation policies (VLDB 05)
– Designing efficient algorithms for allocation policies
that take into account the correlations in the data
(VLDB 06)
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
16
Motivating Example
Query: COUNT
Truck
F150
p4
We propose desiderata
that enable
p5
appropriate definition of query
semantics for imprecise data
NY
East
MA
p3
Sierra
Hierarchies in Data Mining
p1
p2
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
17
Desideratum I: Consistency
Truck
F150
Sierra
p4
p5
NY
East
MA
p3
• Consistency
specifies the
relationship between
answers to related
queries on a fixed
data set
Hierarchies in Data Mining
p1
p2
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
18
Desideratum II: Faithfulness
p3
MA
p5
F150
p4
p1
p2
Sierra
p5
p3
Data Set 3
F150
p4
p1
p2
Sierra
p5
MA
Sierra
NY
NY
MA
F150
Data Set 2
p4
p3
NY
Data Set 1
p1
p2
• Faithfulness specifies the relationship between answers
to a fixed query on related data sets
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
19
F150
Sierra
Imprecise facts
lead to many
possible worlds
[Kripke63, …]
p1
F150
w1
w2
p2
MA
F150
p5
p3
p1
p3
Sierra
p5 p4
w4
NY
p4
Hierarchies in Data Mining
MA
Sierra
NY
NY
p5
p2
w3
Sierra
p4
p2
F150
MA
p3
p1
NY
MA
F150
p4
p3
NY
MA
p5
p1
p2
Sierra
p5 p4
p3
p1
p2
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
20
Query Semantics
• Given all possible worlds together with their
probabilities, queries are easily answered using
expected values
– But number of possible worlds is exponential!
• Allocation gives facts weighted assignments to
possible completions, leading to an extended
version of the data
– Size increase is linear in number of (completions of)
imprecise facts
– Queries operate over this extended version
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
21
Bellwether Analysis
Dealing with Data Sparsity
Deepak Agarwal, Andrei Broder, Deepayan
Chakrabarti, Dejan Diklic, Vanja Josifovski, Mayssam
Sayyadian
Estimating Rates of Rare Events at Multiple
Resolutions, KDD 2007
Motivating Application
Content Match Problem
pages
ads
• Problem:
– Which ads are good on what pages
– Pages: no control; Ads: can control
• First simplification:
– (Page, Ad) completely characterized by a
set of high-dimensional features
• Naïve Approach:
– Experiment with all possible pairs several
times and estimate CTR.
• Of course, this doesn’t work
• Most (ad, page) pairs have very few
impressions, if any,
• and even fewer clicks
Severe data sparsity
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
31
Estimation in the “Tail”
• Use an existing, well-understood hierarchy
– Categorize ads and webpages to leaves of the
hierarchy
– CTR estimates of siblings are correlated
The hierarchy allows us to aggregate data
• Coarser resolutions
– provide reliable estimates for rare events
– which then influences estimation at finer resolutions
Similar “coarsening”, different motivation:
Mining Generalized Association Rules
Ramakrishnan Srikant, Rakesh Agrawal , VLDB 1995
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
32
Sampling of Webpages
• Naïve strategy: sample at random from the set
of URLs
Sampling errors in impression volume AND click
volume
• Instead, we propose:
– Crawling all URLs with at least one click, and
– a sample of the remaining URLs
Variability is only in impression volume
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
34
Imputation of Impression Volume
Z(0)
• Region node
= (page node, ad node)
• Build a Region Hierarchy
Z(i)
A cross-product of the page
hierarchy and the ad
hierarchy
Leaf
Region
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
35
Exploiting Taxonomy Structure
• Consider the bottom two levels
of the taxonomy
• Each cell corresponds to a
(page, ad)-class pair
 Key point : Children under a parent node are alike
and expected to have similar CTRs
(i.e., form a cohesive block)
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
36
Imputation of Impression Volume
For any level Z(i)
Ad classes
Page classes
Clicked
pool
Sampled
Excess impressions
Non-clicked
(to be imputed)
pool
sums to
∑nij + K.∑mij
[row constraint]
sums to #impressions
on ads of this ad class
[column constraint]
Hierarchies in Data Mining
#impressions =
nij + mij + xij
sums to
Total impressions
(known)
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
37
Imputation of Impression Volume
sums to
[block constraint]
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
38
Imputing xij
Iterative Proportional Fitting
[Darroch+/1972]
Initialize xij = nij + mij
(i)
Z
Z(i+1)
block
Top-down:
• Scale all xij in every block in
Z(i+1) to sum to its parent in Z(i)
• Scale all xij in Z(i+1) to sum to
the row totals
• Scale all xij in Z(i+1) to sum to
the column totals
Repeat for every level Z(i)
Bottom-up: Similar
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
39
Imputation: Summary
• Given
– nij (impressions in clicked pool)
– mij (impressions in sampled non-clicked pool)
– # impressions on ads of each ad class in the ad
hierarchy
• We get
– Estimated impression volume
Ñij = nij + mij + xij
in each region ij of every level Z(.)
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
40
Bellwether Analysis
Dealing with Data Sparsity
Deepak Agarwal, Pradheep Elango, Nitin Motgi,
Seung-Taek Park, Raghu Ramakrishnan, Scott Roy,
Joe Zachariah
Real-time Content Optimization through Active User
Feedback, NIPS 2008
Yahoo! Home Page Featured Box
• It is the top-center
part of the Y!
Front Page
• It has four tabs:
Featured,
Entertainment,
Sports, and Video
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
42
Novel Aspects
• Classical: Arms assumed fixed over time
– We gain and lose arms over time
• Some theoretical work by Whittle in 80’s; operations research
• Classical: Serving rule updated after each pull
– We compute optimal design in batch mode
• Classical: Generally. CTR assumed stationary
– We have highly dynamic, non-stationary CTRs
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
43
Bellwether Analysis:
Global Aggregates from Local Regions
with Beechung Chen, Jude Shavlik, and Pradeep Tamma
In VLDB 06
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
44
Motivating Example
• A company wants to predict the first year worldwide profit
of a new item (e.g., a new movie)
– By looking at features and profits of previous (similar) movies, we
predict expected total profit (1-year US sales) for new movie
• Wait a year and write a query! If you can’t wait, stay awake …
– The most predictive “features” may be based on sales data
gathered by releasing the new movie in many “regions” (different
locations over different time periods).
• Example “region-based” features: 1st week sales in Peoria, week-toweek sales growth in Wisconsin, etc.
• Gathering this data has a cost (e.g., marketing expenses, waiting
time)
• Problem statement: Find the most predictive region
features that can be obtained within a given “cost budget”
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
45
Key Ideas
• Large datasets are rarely labeled with the targets that we
wish to learn to predict
– But for the tasks we address, we can readily use OLAP
queries to generate features (e.g., 1st week sales in
Peoria) and even targets (e.g., profit) for mining
• We use data-mining models as building blocks in
the mining process, rather than thinking of them
as the end result
– The central problem is to find data subsets
(“bellwether regions”) that lead to predictive features
which can be gathered at low cost for a new case
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
46
Motivating Example
• A company wants to predict the first year’s
worldwide profit for a new item, by using its
historical database
• Database Schema:
Profit Table
Time
Location
CustID
ItemID
Profit
Ad Table
Item Table
ItemID
Category
R&D Expense
Time
Location
ItemID
AdExpense
AdSize
• The combination of the underlined attributes forms a key
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
47
A Straightforward Approach
• Build a regression model to predict item profit
By joining and aggregating tables
in the historical database
we can create a training set:
Item-table features
ItemID Category R&D Expense
Profit Table
Time
Location
CustID
ItemID
Profit
Ad Table
Item Table
ItemID
Category
R&D Expense
Time
Location
ItemID
AdExpense
AdSize
Target
Profit
1
Laptop
500K
12,000K
2
Desktop
100K
8,000K
…
…
…
…
An Example regression model:
Profit = 0 + 1 Laptop + 2 Desktop +
3 RdExpense
• There is much room for accuracy improvement!
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
48
Using Regional Features
• Example region: [1st week, HK]
• Regional features:
– Regional Profit: The 1st week profit in HK
– Regional Ad Expense: The 1st week ad expense in HK
• A possibly more accurate model:
Profit[1yr, All] = 0 + 1 Laptop + 2 Desktop + 3 RdExpense +
4 Profit[1wk, HK] + 5 AdExpense[1wk, HK]
• Problem: Which region should we use?
– The smallest region that improves the accuracy the most
– We give each candidate region a cost
– The most “cost-effective” region is the bellwether region
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
49
Basic Bellwether Problem
Features i,r(DB)
1
2
3
4
KR
WI
… 52
ItemID Category … Profit[1-2,USA] …
…
…
i
Desktop
…
…
…
…
…
45K
…
…
…
ItemID Total Profit
…
…
i
2,000K
…
…
Aggregate over data records Total Profit
in region r = [1-2, USA] in [1-52, All]
…
USA
5
Target i(DB)
r
WY
...
…
For each region r, build a predictive model hr(x); and then
choose bellwether region:
• Coverage(r) fraction of all items in region  minimum
coverage support
• Cost(r, DB) cost threshold
• Error(hr) is minimized
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
52
Experiment on a Mail Order Dataset
Error-vs-Budget Plot
Bel Err
Avg Err
Smp Err
30000
25000
RMSE
20000
15000
10000
5000
[1-8 month, MD]
• Bel Err: The error of the
bellwether region found using a
given budget
• Avg Err: The average error of all
the cube regions with costs
under a given budget
• Smp Err: The error of a set of
randomly sampled (non-cube)
regions with costs under a given
budget
(RMSE: Root Mean Square Error)
0
5
25
Hierarchies in Data Mining
45
65
Budget
85
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
53
Experiment on a Mail Order Dataset
Uniqueness Plot
• Y-axis: Fraction of regions
that are as good as the
bellwether region
Fraction of indistinguisables
0.9
0.8
0.7
– The fraction of regions that
satisfy the constraints and
have errors within the 99%
confidence interval of the
error of the bellwether region
0.6
0.5
0.4
0.3
0.2
0.1
[1-8 month, MD]
0
5
25
Hierarchies in Data Mining
45
65
Budget
85
• We have 99% confidence that
that [1-8 month, MD] is a quite
unusual bellwether region
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
54
Basic Bellwether Computation
1
• OLAP-style bellwether analysis
– Candidate regions: Regions in a data cube
– Queries: OLAP-style aggregate queries
• E.g., Sum(Profit) over a region
2
3
4
5
…
5
2
KR
…
USA
WI
WY
...
…
• Efficient computation:
– Use iceberg cube techniques to prune infeasible
regions (Beyer-Ramakrishnan, ICDE 99; Han-PeiDong-Wang SIGMOD 01)
• Infeasible regions: Regions with cost > B or coverage < C
– Share computation by generating the features and
target values for all the feasible regions all together
• Exploit distributive and algebraic aggregate functions
• Simultaneously generating all the features and target values
reduces DB scans and repeated aggregate computation
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
55
Subset-Based Bellwether Prediction
• Motivation: Different subsets of items may have
different bellwether regions
– E.g., The bellwether region for laptops may be
different from the bellwether region for clothes
• Two approaches:
Bellwether Cube
Bellwether Tree
R&D Expenses
No
Category
[1-2, WI]
[1-1, NY]
Laptop
Desktop
[1-3, MD]
Hierarchies in Data Mining
Yes
Category
R&D Expense  50K
Software
Hardware
…
Low
Medium
High
OS
[1-3,CA]
[1-1,NY]
[1-2,CA]
…
...
…
…
Laptop
[1-4,MD]
[1-1, NY]
[1-3,WI]
…
…
…
…
…
…
…
…
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
57
Characteristics of Bellwether Trees & Cubes
Dataset generation:
• Use random tree to generate
different bellwether regions
for different subset of items
Parameters:
• Noise
• Concept complexity: # of tree nodes
Result:
• Bellwether trees & cubes have
better accuracy than basic
bellwether search
• Increase noise  increase error
• Increase complexity  increase
error
2
3
2.5
1.5
cube
1
tree
0.5
15 nodes
0.5
1
Noise
Hierarchies in Data Mining
RMSE
RMSE
2
0
0.05
1.5
basic
basic
1
cube
0.5
tree
Noise level: 0.5
0
2
3
7
15
31
Number of nodes
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
63
R. Ramakrishnan
69
Efficiency Comparison
3000
naive cube
Naïve computation
methods
2500
naive tree
Sec
2000
1500
RF tree
1000
single-scan
cube
500
0
100
Our computation
techniques
optimized
cube
150
200
250
Thousands of examples
Hierarchies in Data Mining
300
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
70
Scalability
7000
single-scan
cube
1000
Sec
800
optimized
cube
600
6000
5000
4000
Sec
1200
400
2000
200
1000
0
0
2.5
5
7.5
Millions of examples
Hierarchies in Data Mining
10
RF tree
3000
2.5
5
7.5
Millions of examples
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
10
R. Ramakrishnan
71
Exploratory Mining:
Prediction Cubes
with Beechung Chen, Lei Chen, and Yi Lin
In VLDB 05
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
72
The Idea
• Build OLAP data cubes in which cell values represent
decision/prediction behavior
– In effect, build a tree for each cell/region in the cube—
observe that this is not the same as a collection of trees
used in an ensemble method!
– The idea is simple, but it leads to promising data mining
tools
– Ultimate objective: Exploratory analysis of the entire space
of “data mining choices”
• Choice of algorithms, data conditioning parameters …
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
73
Example (1/7): Regular OLAP
Z: Dimensions Y: Measure
Goal: Look for patterns of unusually
high numbers of applications:
Location
Time
# of App.
…
AL, USA
…
…
Dec, 04
…
...
2
…
WY, USA
Dec, 04
3
Location
All
Country
Time
All
All
Japan
State
Hierarchies in Data Mining
USA
AL
Norway
WY
Year
Month
All
85
86
Jan., 86
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
04
Dec., 86
R. Ramakrishnan
74
Example (2/7): Regular OLAP
Goal: Look for patterns of unusually
high numbers of applications:
Coarser
regions
CA
USA
…
04
03
…
100
90
…
80
90
…
…
…
…
Roll up
2004
2003
Jan … Dec Jan … Dec
CA
USA
…
…
…
30
20
50
25
30
…
…
70
2
8
10
…
…
…
…
…
…
…
…
…
…
Drill
down
Cell value: Number of loan applications
Hierarchies in Data Mining
Z: Dimensions Y: Measure
Location
Time
# of App.
…
AL, USA
…
WY, USA
…
Dec, 04
…
Dec, 04
...
2
…
3
CA
USA
…
AB
…
YT
AL
…
WY
…
Jan
2004
…
Dec
…
…
20
15
15
…
5
2
20
…
5
3
15
…
55
…
…
…
5
…
…
10
…
…
…
…
…
…
…
Finer regions
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
75
Example (3/7): Decision Analysis
Goal: Analyze a bank’s loan decision process
w.r.t. two dimensions: Location and Time
Fact table D
Z: Dimensions X: Predictors Y: Class
Location
Time
AL, USA
Dec, 04
White
…
…
WY, USA
Dec, 04
…
Approval
M
…
Yes
…
…
…
…
Black
F
…
No
Race Sex
Cube subset
Model h(X, Z(D))
E.g., decision tree
Location
All
Country
Time
All
All
Japan
State
Hierarchies in Data Mining
USA
AL
Norway
WY
Year
Month
All
85
86
Jan., 86
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
04
Dec., 86
R. Ramakrishnan
76
Example (3/7): Decision Analysis
•
Are there branches (and time windows) where
approvals were closely tied to sensitive attributes
(e.g., race)?
–
•
Suppose you partitioned the training data by location and
time, chose the partition for a given branch and time window,
and built a classifier. You could then ask, “Are the
predictions of this classifier closely correlated with race?”
Are there branches and times with decision making
reminiscent of 1950s Alabama?
–
Requires comparison of classifiers trained using different
subsets of data.
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
77
Example (4/7): Prediction Cubes
2004
2003
…
Jan
…
Dec
Jan
…
Dec
…
CA
0.4
0.8
0.9
0.6
0.8
…
…
USA
0.2
0.3
0.5
…
…
…
…
…
…
…
…
…
…
…
1. Build a model using data
from USA in Dec., 1985
2. Evaluate that model
Data [USA, Dec 04](D)
Location
Time
Race
Sex
…
Approval
AL ,USA
Dec, 04
White
M
…
Y
…
…
…
…
…
…
WY, USA
Dec, 04
Black
F
…
N
Measure in a cell:
• Accuracy of the model
• Predictiveness of Race
measured based on that
model
• Similarity between that
model and a given model
Model h(X, [USA, Dec 04](D))
E.g., decision tree
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
78
Example (5/7): Model-Similarity
Given:
- Data table D
- Target model h0(X)
- Test set D w/o labels
2004
2003
Data table D
…
Dec Jan
…
Dec
…
CA
0.4
0.2
0.3
0.5
…
…
USA
0.2
0.3
0.9
…
…
…
…
…
…
…
…
…
…
…
Time
Race
Sex
…
Approval
AL, USA
Dec, 04
White
M
…
Yes
…
…
…
…
…
…
WY, USA
Dec, 04
Black
F
…
No
…
Jan
0.6
Location
Level: [Country, Month]
Build a model
Similarity
The loan decision process in USA during Dec 04 h0(X)
was similar to a discriminatory decision model
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
Race
Sex
…
White
F
Yes
…
Yes
…
…
…
…
…
Black
M
No
…
Yes
Test set D
R. Ramakrishnan
79
Example (6/7): Predictiveness
Given:
- Data table D
- Attributes V
- Test set D w/o labels
CA
USA
…
Data table D
2004
2003
Jan … Dec Jan … Dec
…
…
0.4
0.2
0.3
0.2
0.3
0.9
…
…
…
0.6
…
0.5
…
…
…
…
…
…
…
…
Level: [Country, Month]
Location
Time
Race
Sex
…
Approval
AL, USA
…
Dec, 04
…
White
…
M
…
…
…
Yes
…
WY, USA
Dec, 04
Black
F
…
No
Yes
No
.
.
Yes
h(X)
Yes
No
.
.
No
Build models
h(XV)
Predictiveness of V
Race was an important predictor of loan
approval decision in USA during Dec 04
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
Race
Sex
…
White
…
F
…
…
…
Black
M
…
Test set D
R. Ramakrishnan
80
Example (7/7): Prediction Cube
2004
2003
Roll up
…
04
03
…
Jan
…
Dec
Jan
…
Dec
…
CA
0.3
0.2
…
CA
0.4
0.1
0.3
0.6
0.8
…
…
USA
0.2
0.3
…
USA
0.7
0.4
0.3
0.3
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Cell value: Predictiveness of Race
CA
Drill down
USA
…
Hierarchies in Data Mining
2004
2003
…
Jan
…
Dec
Jan
…
Dec
…
AB
0.4
0.2
0.1
0.1
0.2
…
…
…
0.1
0.1
0.3
0.3
…
…
…
YT
0.3
0.2
0.1
0.2
…
…
…
AL
0.2
0.1
0.2
…
…
…
…
…
0.3
0.1
0.1
…
…
…
WY
0.9
0.7
0.8
…
…
…
…
…
…
…
…
…
…
…
…
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
81
Efficient Computation
• Reduce prediction cube computation to data
cube computation
– Represent a data-mining model as a distributive or
algebraic (bottom-up computable) aggregate
function, so that data-cube techniques can be
directly applied
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
82
Bottom-Up Data Cube
Computation
1985
1986
1987
1988
47
107
76
67
1985
1986
1987
1988
Norway
10
30
20
24
Norway
84
…
23
45
14
32
…
114
USA
14
32
42
11
USA
99
All
All
All
297
All
Cell Values: Numbers of loan applications
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
83
Functions on Sets
• Bottom-up computable functions: Functions that can be
computed using only summary information
• Distributive function: (X) = F({(X1), …, (Xn)})
– X = X1  …  Xn and Xi  Xj =
– E.g., Count(X) = Sum({Count(X1), …, Count(Xn)})
• Algebraic function: (X) = F({G(X1), …, G(Xn)})
– G(Xi) returns a length-fixed vector of values
– E.g., Avg(X) = F({G(X1), …, G(Xn)})
• G(Xi) = [Sum(Xi), Count(Xi)]
• F({[s1, c1], …, [sn, cn]}) = Sum({si}) / Sum({ci})
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
84
Scoring Function
• Represent a model as a function of sets
• Conceptually, a machine-learning model h(X; Z(D)) is
a scoring function Score(y, x; Z(D)) that gives each
class y a score on test example x
– h(x; Z(D)) = argmax y Score(y, x; Z(D))
– Score(y, x; Z(D))  p(y | x, Z(D))
– Z(D): The set of training examples (a cube subset of D)
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
85
Machine-Learning Models
• Naïve Bayes:
– Scoring function: algebraic
• Kernel-density-based classifier:
– Scoring function: distributive
• Decision tree, random forest:
– Neither distributive, nor algebraic
• PBE: Probability-based ensemble (new)
– To make any machine-learning model distributive
– Approximation
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
87
Probability-Based Ensemble
Decision tree on [WA, 85]
PBE version of decision
tree on [WA, 85]
1985
Jan
…
1985
Dec
Jan
…
WA
…
…
…
Dec
…
WA
…
…
Decision trees built on
the lowest-level cells
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
88
Efficiency Comparison
2500
RFex
Execution Time (sec)
KDCex
2000
NBex
1500
Using exhaustive
method
J48ex
NB
1000
KDC
500
0
40K
RFPBE
J48PBE
80K
120K
160K
Using bottom-up
score computation
200K
# of Records
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
94
Bellwether Analysis
Conclusions
Related Work: Building Models on
OLAP Results
• Multi-dimensional regression [Chen, VLDB 02]
– Goal: Detect changes of trends
– Build linear regression models for cube cells
• Step-by-step regression in stream cubes [Liu, PAKDD 03]
• Loglinear-based quasi cubes [Barbara, J. IIS 01]
– Use loglinear model to approximately compress dense regions of
a data cube
• NetCube [Margaritis, VLDB 01]
– Build Bayes Net on the entire dataset of approximate answer
count queries
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
96
Related Work (Contd.)
• Cubegrades [Imielinski, J. DMKD 02]
– Extend cubes with ideas from association rules
– How does the measure change when we rollup or drill down?
• Constrained gradients [Dong, VLDB 01]
– Find pairs of similar cell characteristics associated with big
changes in measure
• User-cognizant multidimensional analysis [Sarawagi,
VLDBJ 01]
– Help users find the most informative unvisited regions in a data
cube using max entropy principle
• Multi-Structural DBs [Fagin et al., PODS 05, VLDB 05]
• Experiment Databases: Towards an Improved
Experimental Methodology in Machine Learning
[Blockeel & Vanschoren, PKDD 2007]
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
97
Take-Home Messages
• Promising exploratory data analysis paradigm:
– Can use models to identify interesting subsets
– Concentrate only on subsets in cube space
• Those are meaningful subsets, tractable
– Precompute results and provide the users with an interactive
tool
• A simple way to plug “something” into cube-style
analysis:
– Try to describe/approximate “something” by a distributive or
algebraic function
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
98
Conclusion
• Hierarchies are widely used, and a promising
tool to help us deal with
–
–
–
–
Data sparsity
Data imprecision and uncertainty
Exploratory analysis
“Experiment” planning and management
• Area is as yet under-appreciated
– Lots of work on taxonomies and how to use them,
but there are many novel ways of using them that
have not received enough attention
Hierarchies in Data Mining
Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma
R. Ramakrishnan
100