Download Document

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Theory, Practice & Methodology
of Relational Database
Design and Programming
Copyright © Ellis Cohen 2002-2006
Introduction to
Data Mining
These slides are licensed under a Creative Commons
Attribution-NonCommercial-ShareAlike 2.5 License.
For more information on how you may use them,
please see http://www.openlineconsult.com/db
1
Topics
Types of Data Mining
Data Mining Activities
Estimation:
Classification & Approximation
Decision Trees
Clustering
Targeted Clustering &
Association Rules
Market Basket Analysis
Scoring Market Basket Analysis
© Ellis Cohen, 2003-2006
2
Types of
Data Mining
© Ellis Cohen, 2003-2006
3
Goals of Data Mining
Find patterns and relationships
among objects represented by
data points in a dataset
Novel / Significant
People who buy notebooks buy pencils vs.
Men who buy diapers at night also buy beer
Understandable / Useful
Tradeoff between accuracy and simplicity
Causality vs Chance
All data has random variations, which can show
up as spurious patterns and relationships
© Ellis Cohen, 2003-2006
4
Types of Data Mining Covered
Estimation
(Classification & Approximation)
Classify/categorize objects, or
approximate some value associated
with objects based on their features
Clustering
Find groups of objects, [some of] whose
features are all similar to one another
Market Basket Analysis
Find collections of items which frequently
occur together (in the same basket) &
formulate the cause
© Ellis Cohen, 2003-2006
5
Objects & Features
custid
3043
3174
…
age
23
44
…
The entire collection
of data is called a
dataset
income
5000
6190
…
…
Object:
Customer
(Represented
by a
Data Point)
Features
(also called
Variables)
© Ellis Cohen, 2003-2006
6
Classification
Determine target category of an
object, based on its features
Predictor Variables: Age, MonthlyIncome
Target Variable: CarType
Category values: x - Luxury Cars o - Midrange Cars # - Cheap Cars
Objects: Customers
Note
difficult
separation
Monthly
Income
x
Region
x
x
x
x
x
x
#
#ox x
xx
x
xx x x o x
o
x
xxx x
ox
#
o
o
x ox
o
x
o # o
x
#
o
o
#
o x o o# o
#
o
o
oo x oxo o o x o o o # o
x o o o
#
o
# #
# # o o o o
#
# #o # #
o #
#o # #
x
#
#
#
#
#
#
#
#
Draw regions in
which one category
predominates
Age
o
o
Region
#
Region
When a new customer comes along, you can
categorize what kind of car they're likely to buy
based upon which region they fall into
© Ellis Cohen, 2003-2006
7
Approximation
Approximate some (continuous) target
value of an object based on its features
Predictor Variables: Age, MonthlyIncome
Target Variable: CarCost -- amt spent by customer on a car
Objects: customers
Find a function
f(income,age) that
gives a good estimate
of the actual value
Monthly
Income
o
o oo
o
oo
oo
o
o
o
o
o
o
oo o o o o o
oo o
o o
oo o o
o o o o o
o
oooo
o
o
o
o
o oo o o o o
o
o
oo oooo o o o o o o o o o o o o o
o
o o
o o o
o
o
o
o
o
o
o
o
o
o ooo o o o
o
o
o
o
Age
o
o
o
o
When a new customer comes along, you can
apply f to their income & age to estimate how
much they're likely to spend on a car
© Ellis Cohen, 2003-2006
8
Applications of Estimation
Sales
Estimate how much a customer is likely to spend
Determine amount of effort/money to put into
capturing a potential customer
Credit
Decide whether to approve a credit application
Determine someone's credit limit
Evaluate whether a credit card is stolen
Medicine
Determine whether a patient has a disease based
on symptoms and test results
Assess which protocol should be used to treat an
illness
© Ellis Cohen, 2003-2006
9
Clustering
Find groups of objects, [some of] whose
features are all similar to one another
Objects:
Corvette
Buyers
x
x
x
x
x
x
x x
x
xx
x
x
xxx
xx xx x
x x
x
x
Monthly
Income
FEATURES
x
Identify Cluster
e.g. by
center & radius
Age
© Ellis Cohen, 2003-2006
10
Applications of Clustering
Marketing
Target advertising for each cluster;
use media whose demographics match cluster
Astronomy
Find galactic clusters;
explore large scale structure of the universe
Epidemiology
Find things in common among people with the
same illness (esp location)
Government Policy & Planning
Identify regions with similar features (e.g.
economic, industrial, geographic) for policies
and planning (e.g. land use, economic
assistance, office locations, bus routes)
© Ellis Cohen, 2003-2006
11
Market Basket Analysis
Organize dataset into baskets.
Find groups of items which frequently
occur together in baskets
11-Feb-99
11-Feb-99
11-Feb-99
…
11-Feb-99
11-Feb-99
…
13-Feb-99
13-Feb-99
Rules capture
causality
Joe
Joe
Joe
Diapers
Formula
Beer
Simha Pretzels
Simha Beer
Sasha Diapers
Sasha Beer
Basket: Daily
Shopping by
a Customer
Diapers and beer
occur together
frequently
Item: Product purchased
NO! People who buy beer are not more likely
Beer  Diapers? to buy diapers
YES! People who buy diapers are more likely
Diapers  Beer? to buy beer (esp men at night)
© Ellis Cohen, 2003-2006
12
Applications of
Market Basket Analysis
Marketing
Baskets: Daily Shopping
Items: Products Purchased
Controlling Customer Traversal in Stores
Coupons
Recommendations (e.g. Amazon)
Semantic Nets
Baskets: Documents
Items: Words/Phrases
Use for Semantic Search
Plagiarism Detection
Baskets: Sentences
Items: Documents
© Ellis Cohen, 2003-2006
13
Data Mining Approaches
Deterministic
Heuristic
K-Means
Clustering
Agglomerative
CLIQUE
Regression
Neural Nets
Estimation Bayesian Networks Genetic Algorithms
K-Nearest Neighbor
Decision Trees
Mkt Basket Apriori
Analysis
Produces "best"
possible model or
prediction
Produces reasonably
good model or
prediction
© Ellis Cohen, 2003-2006
14
Data Mining
Activities
© Ellis Cohen, 2003-2006
15
Data Mining Activities Diagram
Forensics
Analyzer
Anomalies
Model
Discovery
Detector/
Predictor
Modelling
Detection
Prediction
Sample Data
Live Data
© Ellis Cohen, 2003-2006
Predictions
16
Data Mining Activities
Discovery/Modeling
Using an existing sample dataset to develop a model
which characterizes or describes the data
Forensics
Finding anomalous data points in an existing sample
dataset – those that do not match the discovered
model, and determining the cause (which may
involve another round of Discovery/Modeling)
Prediction
Using the discovered model to predict an unknown
feature value of a new live data point
Detection
Detect new live data points which are anomalous –
those that do not match the discovered model, and
determining the cause (more Discovery/Modeling)
© Ellis Cohen, 2003-2006
17
Applications of
Modeling & Prediction
Clustering
Model to decide on a targeted advertising
program
Predict whether a web user is in a target group
for a product, and if so, show them the ad
prepared for that group.
Market Basket Analysis
Use the model to decide on store layout, sales &
promotions
Use Predictions to delivery Personalized Coupons
at checkout
Classification/Approximation
Model to decide marketing program for lessfocused mass media advertising
Predict how an individual potential customer will
behave, and personalize sales approach to that
one customer
© Ellis Cohen, 2003-2006
18
Data Mining Planning
Data, Mining Type & Activity Selection
What data do you want to mine and how do you
want to mine it?
Data Cleaning & Transformation
Does the data need to be prepared so the mining
will work correctly?
Evaluation
How will you be able to tell whether the results
are good?
Visualization & Scoring
Mining Type & Activity Details
What approach will you use to implement the
mining type & activity, and with what control
parameters?
© Ellis Cohen, 2003-2006
19
Integrated DW/DM Process
Data
Sources
Data
Warehouse
Data Mining
Store
More
ETL
ETL
OLAP &
Visualization
Data
Mining
• Episodic
• Strategic
• Continuous
© Ellis Cohen, 2003-2006
20
Estimation:
Classification &
Approximation
© Ellis Cohen, 2003-2006
21
Classification
Determine target category of an
object, based on its features
Predictor Variables: Age, MonthlyIncome
Target Variable: CarType
Category values: x - Luxury Cars o - Midrange Cars # - Cheap Cars
Objects: Customers
x
x
x
x
x
x
#
#ox x
xx
x
xx x x o x
o
x
xxx x
ox
#
o
o
x ox
o
x
o # o
x
#
o
o
#
o x o o# o
#
o
o
oo x oxo o o x o o o # o
x o o o
#
o
# #
# # o o o o
#
# #o # #
o #
#o # #
x
#
#
#
#
#
#
#
#
Draw regions in
which one category
predominates
Note
difficult
separation
Monthly
Income
Age
o
x
Region
o
Region
#
Region
When a new customer comes along, you can
categorize what kind of car they're likely to buy
based upon which region they fall into
© Ellis Cohen, 2003-2006
22
Approximation
Approximate some (continuous) target
value of an object based on its features
Predictor Variables: Age, MonthlyIncome
Target Variable: CarCost -- amt spent by customer on a car
Objects: customers
Find a function
f(income,age) that
gives a good estimate
of the actual value
Monthly
Income
o
o oo
o
oo
oo
o
o
o
o
o
o
oo o o o o o
oo o
o o
oo o o
o o o o o
o
oooo
o
o
o
o
o oo o o o o
o
o
oo oooo o o o o o o o o o o o o o
o
o o
o o o
o
o
o
o
o
o
o
o
o
o ooo o o o
o
o
o
o
Age
o
o
o
o
When a new customer comes along, you can
apply f to their income & age to estimate how
much they're likely to spend on a car
© Ellis Cohen, 2003-2006
23
Estimation Activities
Forensics
Analyzer
Anomalies
Model
Discovery
Modeling
Sample Data
Live Data
Detector/
Predictor
Detection
Prediction
Predictions
Modeling
Come up with a way of estimating target value of data items (only
known for sampled data, not live data) based on other features
Forensics
Understand why some data items have values which significantly
differ from the estimated value
Predication
Estimate the (unknown) target value of live data items based on
the known features
Detection
When the live data's unknown target value becomes known, find
items whose target value doesn't match its estimated value
© Ellis Cohen, 2003-2006
24
Model Characteristics
Forensics
Analyzer
Anomalies
Model
Discovery
Modelling
Sample Data
Live Data
Detector/
Predictor
Detection
Prediction
Predictions
Characteristics of Models:
• Transparent (Descriptive): Understandable
• Opaque (Predictive): Not particularly
understandable; its sole purpose is to drive a
predictor/detector/analyzer
• Raw Data: The retained sample data is provided
directly to the predictor/detector/analyzer
© Ellis Cohen, 2003-2006
25
Training & Testing
Sample Set
Training Set
Testing Set
Testing Set
Testing Set
Testing Set
Testing Set
Use the training set to
build the model
Use testing sets to tweak
and validate the model
© Ellis Cohen, 2003-2006
26
Estimation Approaches
Models
Raw
Data
Transparent
Opaque
Classification
Approximation
K Nearest Neighbor
Bayesian
Network
Linear
Regression
Decision
Trees
Regression
Trees
Feed-Forward
Neural Networks
Genetic Algorithms
© Ellis Cohen, 2003-2006
27
Decision
Trees
© Ellis Cohen, 2003-2006
28
Classification
Determine target category of an
object, based on its features
Predictor Variables: Age, MonthlyIncome
Target Variable: CarType
Category values: x - Luxury Cars o - Midrange Cars # - Cheap Cars
Objects: Customers
x
x
x
x
x
x
#
#ox x
xx
x
xx x x o x
o
x
xxx x
ox
#
o
o
x ox
o
x
o # o
x
#
o
o
#
o x o o# o
#
o
o
oo x oxo o o x o o o # o
x o o o
#
o
# #
# # o o o o
#
# #o # #
o #
#o # #
x
#
#
#
#
#
#
#
#
Draw regions in
which one category
predominates
Note
difficult
separation
Monthly
Income
Age
o
x
Region
o
Region
#
Region
When a new customer comes along, you can
categorize what kind of car they're likely to buy
based upon which region they fall into
© Ellis Cohen, 2003-2006
29
Motivating Decision Trees
1
x
x
x
x
xxx xxx
x
x
#
x x x x x oxxxx x oooo
xxx xo o o
x
o
o
o oo o o x
xx
x
x
x
o o oo# o o
o
x
o
#
o
oo
# o
o # o##
oooo
o ooo o x o o oo o # #
### # # # # oo o o # # ### #
# #
#o ### # o # o
#
#
#
#
o# # # #
# ##
x
x
x
x
xxx xxx
x
x
#
x x x x x oxxxx x oooo
xxx xo o o
x
o
o
o oo o o x
xx
x
x
x
o o oo# o o
o
x
o
#
o
oo
# o
o # o##
oooo
o ooo o x o o oo o # #
### # # # # oo o o # # ### #
# #
#o ### # o # o
#
#
#
#
o# # # #
# ##
o
x
x
x
x
xxx
x
x
x x
x
x x x x x #oxxxxxx oooo
xxx xo o o
x
o
o
o
x xxxxxo oo o x o o o o o o
xoo ooo # # o o o o #o##o##
oo
#
oo o ooo o x o oo o
# # ## # oo o o # # #### #
#### #
#
o
# # o # ## # ## o o o # # ## # ##
© Ellis Cohen, 2003-2006
o
x
x
x
xxx
x
x
x x
x
x x x x x #oxxxxxx oooo
xxx xo o o
x
o
o
o
o# o o
o oo o x
xx
x
o
x
x
o
o
o
xoo ooo # # o o
o
oo o ooo o x o oo o o # ## o##
### # # # # oo o o # # ### #
#
# ## o
# # o # ## # ## o o # # ## # ##
o
2
30
x
o
o
#
#
o
Decision Tree Construction
x
x
x
x
xxx
x
x
x x
x
x x x x x #oxxxxxx oooo
xxx xo o o
x
o
o
o
x xxxxxo oo o x o o o o o o
xoo ooo # # o o o o #o##o##
oo
#
oo o ooo o x o oo o
# # ## # oo o o # # #### #
income #### #
#
o
# # o # ## # ## o o o # # ## # ##
age
x
o
o
#
#
o
income < 3000
#
income < 6000
age < 35
#
#
o
age < 45
o
#
x
age < 52
o
…
o
…
© Ellis Cohen, 2003-2006
age < 54
x
o
…
31
Decision Tree Algorithm
Given a set of variables to
consider, in each iteration the
algorithm
– decides which variable to split
– where (at what value) to split it
in order to find the best possible
separation of tuples
© Ellis Cohen, 2003-2006
32
Decision Tree
Construction Issues
Splitting
Which variable to split & where
Gini, Twoing, Entropy, Chi-Square, Gain, …
Binary or n-ary splits,
esp for categorical variables
Use multiple approaches and pick best
Biased Splitting
Non-symmetric misclassification costs
safety misclassification (e.g. O rings)
Category Importance
fraud detection
Penalties on Variables
acquisition cost, questionable values
Case weights
different distributions of sample & live data
© Ellis Cohen, 2003-2006
33
Linear Combination Splitting
x
x
x
x
o # x xx x xxxx
x x
x x x x x x x xxxx oooo
x
x xx x x
o oo o o
o
o
x x x xo o
o o
o
o
o
x
o
o
o
o
#
o ooo # o o x # # ###
o
#
oo o ooo oo # # # #
##
# # ### ##
#
#
#
#
# #
#
#
# ####
o # ## o# ## # # # # ## # ##
o
income < 40*age + 400
x o x xxxxx xx
x
x
x
x x
x x x x x x x# xxxx oooo
x
o
x xx x x
o
o
o
o
o
x x x xo o o
o o
oo
o
x
o
o
o
o
#
o
#
o oo
o
o x# # # ## ###
o
oo o o
o
o
o #o # # # # # # # ##
#
# # ## #o
#
# # ## # #
#
# # ##o # #
#
## # # #
# ##
o
© Ellis Cohen, 2003-2006
34
Overfitting
income < 3000
x
x
x
xxx
x
x
x x
x
x x x x x #oxxxxxx oooo
xxx xo o o
x
o
o
o
o# o o
o oo o x
xx
x
o
x
x
o
o
o
xoo ooo # # o o
o
oo o ooo o x o oo o o # ## o##
# # ## # oo o o # # #### #
income ### #
#
o
# # o # ## # ## o o # # ## # ##
age
…
income < 6000
o
Prevent overfitting by
• Stopping
Need criteria
Danger of stopping
too early
• Pruning
Build full tree
Cut back when testing
…
o
age < 52
…
o
income < 4500
#
o
age < 59
#
#
age < 61
o
© Ellis Cohen, 2003-2006
#
35
Classification Rule Extraction
x
x
x
x
xxx
x
x
x x
x
x x x x x #oxxxxxx oooo
xxx xo o o
x
o
o
o
x xxxxxo oo o x o o o o o o
xoo ooo # # o o o o #o##o##
oo
#
oo o ooo o x o oo o
# # ## # oo o o # # #### #
income #### #
#
o
# # o # ## # ## o o # # ## # ##
age
x
o
o
#
#
o
Extract a rule for each region
(25 ≤ age ≤ 35) Λ (income < 3000) 
CarType = 'midrange' (i.e. 'o')
Support:
7/94  7.4%
Confidence:
6/7  86%
Support( A ) = # of objects satisfying A / # of total objects:
Does the region determined by A have enough pts to matter?
Confidence( A  B ) = # satisfying A & B / # satisfying A:
How confident are we that a point in the region determined
by A also satisfies B?
© Ellis Cohen, 2003-2006
36
Classification Ruleset
(income < 3000) & (age < 35)  '#'
(income < 3000) & (35 <= age < 45)  'o'
(income < 3000) & (45 <= age)  '#'
(3000 <= income < 4500) & (age < 20)  'o'
…
((income < 3000) & (35 <= age < 45) |
(3000 <= income < 4500) & (age < 20)) |
…  'o'
(income < 3000) &
((age < 35) | (45 <= age)) …  '#'
© Ellis Cohen, 2003-2006
37
Using Decision Trees
Forensics
Analyzer
Anomalies
Model
Discovery
Modeling
Sample Data
Live Data
Detector/
Predictor
Detection
Prediction
Predictions
The Decision Tree
(or the extracted Classification Rules)
are a Transparent (understandable) Model.
How is the Model used for
Forensics, Prediction & Detection?
© Ellis Cohen, 2003-2006
38
Clustering
© Ellis Cohen, 2003-2006
39
Clustering
Find groups of objects, [some of] whose
features are all similar to one another
Objects:
Corvette
Buyers
x
x
x
x
x
x
x x
x
xx
x
x
xxx
xx xx x
x x
x
x
Monthly
Income
FEATURES
x
Identify Cluster
e.g. by
center & radius
Age
© Ellis Cohen, 2003-2006
40
Clustering Activities
Forensics
Analyzer
Anomalies
Model
Discovery
Modeling
Sample Data
Live Data
Detector/
Predictor
Detection
Prediction
Predictions
Modeling
Description of each cluster
• Cluster boundaries
• For compact clusters: subspaces + centroid + radius
Forensics
Identify and explain outliers (points not in a cluster)
Detection/Predication
Does live data cluster in the same way?
© Ellis Cohen, 2003-2006
41
Modeling Clusters
Bounds-Based
Cluster 1:
[age: 21  2, monthlyIncome: 5000  1000]
Cluster 2:
[age: 54  3, monthlyIncome: 6400  1200]
Centroid/Radius-Based
Cluster 1:
centroid: [age: 21, monthlyIncome: 5000],
radius: .12
Cluster 2:
centroid: [age: 54, monthlyIncome: 6400],
radius: .15
Centroid/Distance-Based approach implies that
• clusters are circular (too strong)
• we need a uniform distance metric (needed anyway)
© Ellis Cohen, 2003-2006
42
Distance Metric
Clustering requires a distance metric
Given 2 data points, pt1, pt2
Compute distance d( pt1, pt2 )
Distance in a single dimension
Easy for quantitative variables (v2-v1)
Harder for categorical variables
Hardest for structured variables
(e.g. similarity metrics for text, images)
Distance over multiple dimensions
More than just Pythagoras …
© Ellis Cohen, 2003-2006
43
Categorical Variable Distance
Ordinal Variables [ordered]
v2 - v1 doesn't work
Use lookup table or function f(v1,v2)
Nominal Variables [unordered]
– Non-hierarchical [e.g. gender]
d(v1,v2) = 0, if v1=v2
1, otherwise
– Hierarchical
Use distance based upon hierarchy
d(p1,p2) [p1 and p2 are prodid] = for example
0, if p1 = p2, else
.4, if Prodtyp(p1) = Prodtyp(p2), else
.7, if Subcat(p1) = Subcat(p2), else
.9, if Category(p1) = Category(p2)
1, otherwise
© Ellis Cohen, 2003-2006
44
Multidimensional Distance
x = (x1,x2,…,xn)
y = (y1,y2,…,xn)
Euclidean Distance
d(x,y) = sqrt(  (xi - yi)2 )
What if dimensions aren't commensurate?
Scale all dimensions
– Use weights based upon importance, or
– so values between 0 and 1, or
– d(x,y) = sqrt(  ((xi - yi)/i ) 2 )
i
is the standard deviation for the ith
dimension
© Ellis Cohen, 2003-2006
45
Types of Clustering
Partition-Driven
(primarily for O.R.)
Partition data points
Score: based on compactness
Either
• Every pt is in a cluster
• Minimize # of pt which are not
Density-Driven
(primarily for Data Mining)
Discovering dense collections of data
points
Find all clusters which have minimum
size & density
No requirement to include outliers
© Ellis Cohen, 2003-2006
46
Targeted Clustering &
Association Rules
© Ellis Cohen, 2003-2006
47
Exploratory vs Targeted Clustering
Exploratory Clustering
Find clusters involving an arbitrary
set of variables
Targeted Clustering
Find clusters among a set of
variables which include the target
variable (possibly restricted to a
particular value)
© Ellis Cohen, 2003-2006
48
Single Value Targeted Clustering
Suppose
• Our sample dataset consists of car
buyers
• we want to find clusters of car
buyers who bought luxury cars

1. Restrict the sample dataset to just
those tuples where CarType =
"luxury"
2. Use clustering among this
restricted dataset
© Ellis Cohen, 2003-2006
49
Multiple Value Targeted Clustering
Suppose
• Our sample dataset consists of car
buyers
• we want to find clusters of car
buyers who bought the various
categories of cars

Use Single Value Targeted Clustering
for CarType = "luxury", then again
for CarType = "midrange", then
again for CarType = "cheap"
© Ellis Cohen, 2003-2006
50
Clustering vs Classification
Can't we use Multiple Value Targeted
Clustering to do Classification?
Find the clusters where
CarType = "luxury",
CarType = "midrange",
and CarType = "cheap"
Use the clusters as the model and to
predict the value of live data.
WILL THIS WORK OR NOT?
© Ellis Cohen, 2003-2006
51
Clustering Does NOT Classify
Clusters do not cover the space
Clusters only identify dense regions of objects.
The bulk of the space that a decision tree would
assign to cheap car buyers probably does NOT
hold dense clusters of them, so would not be
included in the clustered model
Clusters for different target values may
overlap
x
o cluster
o
x
x cluster
o o x xx x
o o o x o xo x
o o o x x x xx
o oo xo o
x
x
o
oo o x o x
o x o xx x o
o
© Ellis Cohen, 2003-2006
52
Association Rules
The region corresponding to a cluster
may include other data points whose
target values differ from the targeted
cluster value
o cluster
o ox
o oo x
o o ox
o oo xo
Monthly oo oo x
ox o
Income
Age
Cluster:
[age: 26  3,
monthlyIncome: 3000  1000]
Pct of data pts within the cluster region with target value
'midrange' (i.e. with symbol 'o') is the same as
the CONFIDENCE of the association rule
(23 ≤ age ≤ 29) Λ (2000 ≤ monthlyIncome ≤ 4000)
 carType = "midrange'
© Ellis Cohen, 2003-2006
53
Cluster Confidence
Depending upon the data mining
application, confidence may or
may not matter.
Problem: Come up with
applications where the
confidence of clusters
– must be > 80%
– must be > 30%
– doesn't matter
© Ellis Cohen, 2003-2006
54
Requiring Confident Clusters
Suppose we are interested in
regions where enough data
points are clustered together
(i.e. with good support), but
where a minimum confidence
w.r.t. a targeted value is
required.
Is there any alternative to
simply discarding clustered
regions with low confidence?
© Ellis Cohen, 2003-2006
55
Split Clusters
o ox
o oo x
o o ox
o oo xo
oo oo x
ox o
Use decision-tree style splitting of
the data points within the cluster
to best separate the 'o' valued
points from the non-'o' valued
points
If the 'o' region has too few data
points (too little support), tough
luck
If it has adequate support &
confidence, done!
If it has adequate support, but has
inadequate confidence, split it
again …
(If the non-'o' regions are large enough,
they can also potentially be split to
find 'o' subregions)
© Ellis Cohen, 2003-2006
56
Quantitative Target Variables
If a target variable is
quantitative, especially if it is
continuous, how can targeted
clustering be done?
For example, how do you find
clusters of car buyers who
spent approximately the same
amount on a car?
© Ellis Cohen, 2003-2006
57
Quantitative Targeted Clustering
How do you find clusters of car
buyers who spent approximately
the same amount on a car?
Do standard clustering,
just require that the variables
used always INCLUDE the
target variable!
© Ellis Cohen, 2003-2006
58
Market Basket
Analysis
© Ellis Cohen, 2003-2006
59
Market Basket Analysis
Organize dataset into baskets.
Find groups of items which frequently
occur together in baskets
11-Feb-99
11-Feb-99
11-Feb-99
…
11-Feb-99
11-Feb-99
…
13-Feb-99
13-Feb-99
Rules capture
causality
Joe
Joe
Joe
Diapers
Formula
Beer
Simha Pretzels
Simha Beer
Sasha Diapers
Sasha Beer
Basket: Daily
Shopping by
a Customer
Diapers and beer
occur together
frequently
Item: Product purchased
NO! People who buy beer are not more likely
Beer  Diapers? to buy diapers
YES! People who buy diapers are more likely
Diapers  Beer? to buy beer (esp men at night)
© Ellis Cohen, 2003-2006
60
Market Basket Activities
Forensics
Analyzer
Anomalies
Model
Discovery
Modeling
Sample Data
Live Data
Detector/
Predictor
Detection
Prediction
Predictions
Modeling
Identify good sets of rules (w statistics)
Forensics
Understand why other groups of items are NOT related
Detection/Prediction
Does live data follow same rules with same statistics
© Ellis Cohen, 2003-2006
61
Baskets
In order to use market basket
analysis, you must first divide the
dataset into baskets.
Baskets are specified as a group of
variables (possibly derived). The
actual baskets are obtained by
grouping the dataset by these
variables (e.g. date/customer).
The first step of market basket
analysis is deciding which variables
define the baskets.
© Ellis Cohen, 2003-2006
62
Items
Market Basket analysis looks for
groups of items which frequently
appear together in a basket.
An item is determined by a variable
(or set of variables). Each different
value for that variable (or
variables) determines a different
item (e.g. productPurchased).
The second step of market basket
analysis is determining which
variable(s) are used to identify the
items
© Ellis Cohen, 2003-2006
63
Market Basket Discovery
1) Find frequent itemsets
2 items that appear together frequently
are interesting
3 items that appear together frequently
are really interesting
{ charcoal, chicken, bbq sauce }
4 or more items, really really interesting
2) Find rules that characterize
causality
Diapers  Beer, but not Beer  Diapers
Think in terms of which do you give a
coupon for.
© Ellis Cohen, 2003-2006
64
Apriori Algorithm
Find all itemsets which have at least n items
Use Apriori or Monotonicity principle:
If a set of items S is frequent (i.e. appears in at least
n baskets), then every subset of S is frequent.
Call L1, the items which appear in at least n baskets
Consider all combos of 2 items {A,B}, both from L1
Call L2, those which appear in at least n baskets
Consider all combos of 3 items {A,B,C}, where
{A,B} and {A,C}, and {B,C} are in L2
Call L3, those which appear in at least n baskets
Consider all combos of 4 items {A,B,C,D}, where
{A,B,C} and {A,B,D}, {A,C,D} and {B,C,D} are in L3
Call L4, those which appear in at least n baskets
The frequent itemsets are L1 + L2 + …
© Ellis Cohen, 2003-2006
65
DB Implementation of Apriori
One scan through DB to get frequent items
CREATE HotItems AS
SELECT item FROM purchases
GROUP BY item HAVING count(*) >= n
Another scan through DB looking for pair
itemsets (repeat for size n itemsets)
WITH HotPurchases AS
(SELECT * FROM
Purchase NATURAL JOIN HotItems)
SELECT P1.item, P2.item
FROM HotPurchases P1, HotPurchases P2
WHERE P1.basket = P2.basket
AND P1.item < P2.item
GROUP BY P1.item, P2.item
HAVING count(*) >= n
© Ellis Cohen, 2003-2006
66
Apriori Scalability
Obtaining frequent pair itemsets:
If HotItems can be kept in memory allowing
rapid lookup (sorted list or hashtable)
FP (frequent pair) itemsets can be obtained in
one linear pass through the DB.
Obtaining frequent size n itemsets:
1.
2.
Use a separate linear scan through the DB up
to n. Slow.
On second scan, don't just count pairs;
instead build a memory-based FP-Tree.
Can be used to find all frequent itemsets of
any size.
But, we often only care about frequent pair
itemsets.
© Ellis Cohen, 2003-2006
67
Clustering & Market Basket Analysis
Market Basket Analysis is a form of clustering
• Turn each basket into a single LARGE data item.
– Each LARGE data item has a separate boolean
variable for each possible item that can be in
a basket. For example
– Beer, diapers, etc. are separate variables
– A LARGE data item's beer value is TRUE if the
basket it came from had a beer
• In the original dataset, we look for
k-element itemsets which appear in p or more
baskets (using apriori)
This is equivalent to using the LARGE item
dataset and
– using the subspace clustering algorithm to
look for k-dimensional cells with p or more
points
© Ellis Cohen, 2003-2006
68
Scoring
Market Basket
Analysis
© Ellis Cohen, 2003-2006
69
Support
Cereal
Beer
Support( s ) =
# of baskets containing S /
# of total baskets
40
1000
2000
4000
Support { Beer } = 1000/4000 = 25%
Support { Cereal } = 2000/4000 = 50%
Support { Beer, Cereal } = 40/4000 = 1%
Support: How significant is this itemset
In a supermarket, anything over .1% might be
significant
Given the # of total baskets, the minimum
interesting support determines n for the Apriori
algorithm
© Ellis Cohen, 2003-2006
70
Confidence
Cereal
Confidence( A  B ) =
Support( A & B ) /
Support( A )
Beer
40
1000
2000
4000
Confidence( A  B ) =
# of baskets containing A & B /
# of baskets containing A
Confidence( Beer  Cereal ) = 40/1000 = 4%
Confidence( Cereal  Beer ) = 40/2000 = 2%
Confidence( A  B ): If a basket has A,
how likely is it that the basket also will have B
(i.e. how confident are we that A predicts B)
If this is low (say, less than 30%), it is not very
interesting, since the two items don't correlate
© Ellis Cohen, 2003-2006
71
High Support & Confidence
Milk
Beer
400
1000
2000
4000
Support { Beer } = 1000/4000 = 25%
Support { Milk } = 2000/4000 = 50%
Support { Beer, Milk } = 400/4000 = 10% WOW!
Confidence( Milk  Beer ) = 400/2000 = 20%
Confidence( Beer  Milk ) = 400/1000 = 40%
High Confidence, so potentially interesting
BUT 40% < 50% the pct who buy milk anyway
So giving milk coupons to beer buyers is probably
not the most useful thing to do
© Ellis Cohen, 2003-2006
72
Lift
Milk
Lift( A  B ) =
Confidence( A  B ) /
Support( B )
= Lift( B  A )
Beer
400
1000
2000
4000
Support { Beer } = 1000/4000 = 25%
Support { Milk } = 2000/4000 = 50%
Support { Beer, Milk } = 400/4000 = 10% WOW!
Confidence( Milk  Beer ) = 400/2000 = 20%
Confidence( Beer  Milk ) = 400/1000 = 40% OK!
Lift( A  B ): How much does A help B
Lift( Beer  Milk ) = 40% / 50% = .8
If lift < 1, then it doesn't help at all!
© Ellis Cohen, 2003-2006
73
Good Lift
Diapers
Lift( A  B ) =
Confidence( A  B ) /
Support( B )
= Lift( B  A )
Beer
80
1000
200
4000
Support { Beer } = 1000/4000 = 25%
Support { Diapers } = 200/4000 = 5%
Support { Beer, Diapers } = 80/4000 = 2% OK!
Confidence( Beer  Diapers ) = 80/1000 = 8%
Confidence( Diapers  Beer ) = 80/200 = 40% OK!
Lift( Diapers  Beer ) = 40% / 25% = 1.6
Note: Lift can be useful in clustering situations as well
© Ellis Cohen, 2003-2006
74
Support, Confidence & Lift
AB
Support( A & B )
How important is the rule: What percent of baskets
have both A & B?
Confidence( A  B )
How likely is it that baskets which contains A also
contains B. In general, should be at least 35%.
Lift( A  B )
If we know that a basket contains A, how much surer
are we that the basket contains B than if we didn't
know what else what in the basket. Must be > 1;
probably should be at least 1.3.
© Ellis Cohen, 2003-2006
75
Hierarchical Categories
Do apriori with values at each category level
Whole Wheat Bread  Skim Milk,
but not
Bread  Milk
or vice versa!
For scalability,
can initially only include higher level categories,
then split itemsets with high levels of support
© Ellis Cohen, 2003-2006
76
Rules for Larger Itemsets
For { A, B, C, D }
Consider
A,
A,
A,
B,
B, C  D
B, D  C
C, D  B
C, D  A
Diapers, ChildrensTylenol  Beer
may have less support than
Diapers  Beer
but may well have higher confidence
and higher lift
© Ellis Cohen, 2003-2006
77
Incorporating Other Variables
Diapers,
gender:male,
time: [8pm:1am]
 Beer
will also have less support than
Diapers  Beer
But will almost certainly have
higher confidence & lift
Remember, this is just
subspace clustering with more variables
© Ellis Cohen, 2003-2006
78