Download Data Clustering for Forecasting - MIT Center for Digital Business

Document related concepts

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Human genetic clustering wikipedia , lookup

Transcript
Data Clustering for Forecasting
James B. Orlin
MIT Sloan School and OR Center
Mahesh Kumar
MIT OR Center
Nitin Patel
Visiting Professor
Jonathan Woo
ProfitLogic Inc.
1
Overview of Talk
• Overview of Clustering
• Error-based clustering
• Use of clustering in forecasting
• But first, a few words from Scott Adams
2
3
4
5
What is clustering
Clustering is the process of partitioning a set of data or
objects into clusters with the following properties:
• Homogeneity within clusters: data that belong to the
same cluster should be as similar as possible
• Heterogeneity between clusters: data that belong to
different clusters should be as different as possible.
6
Overview of this talk
• Provide a somewhat personal view of the
significance of clustering in life, and why it has
not met its promise
• Provide our technique for how to incorporate
uncertainty about data into clustering, so as to
reduce uncertainty in forecasting.
7
Iris Data (Fisher, 1936)
can2
can1
Species
Setosa
Versicolor
Virginica
8
Cluster the iris data
• This is a 2-dimensional projection of 4-dimensional
data. (sepal length and width, petal length and width)
• It is not clear if there are 2, 3 or 4 clusters
• There are 3 clusters
• Clusters are usually chosen to minimize some metric
(e.g., sum of squared distances from center of the
cluster)
9
Iris Data
can2
can1
Species
Setosa
Versicolor
Virginica
10
Iris Data, using ellipses
can2
can1
Species
Setosa
Versicolor
Virginica
11
Why is clustering important: a
personal perspective
• Two very natural aspects of intelligence:
– grouping (clustering) and categorizing
– It’s an organizing principle of our minds and of our life
• Just a few examples
– We cluster life into “work life” and “family life”
– We cluster our life by our “roles” father, mother, sister,
brother, teacher, manager, researcher, analyst, etc
– We cluster our work life into various ways, perhaps organized
by projects, or who we report to, or by who reports to us, etc.
– We even cluster what talks we attend, perhaps organized by
quality, or what we learned, or where it was.
12
More on Clustering in Life
• More clustering Examples:
– Go shopping: products are clustered in the store
(useful for locating things)
– As a professor: I need to cluster students into letter
grades: “what really is the difference between a B+
and an A- ? (useful in evaluations)
– When we figure out what to do, we often prioritize by
clustering things (important vs. non-important)
– We cluster people into multiple dimensions based on
appearance, intelligence, character, religion, sexual
orientation, place of origin, etc
• Conclusion: Humans are clustering and
categorizing by nature. It is part of our nature.
It is part of our intelligence
13
Fields that have used clustering
•
•
•
•
•
Marketing (market segmentation, catalogues)
Chemistry (the periodic table is a great example)
Finance (making sense of stock transactions)
Medicine (clustering patients)
Data mining (what can we do with transactional data,
such as click stream data?)
• Bioinformatics (how can we make sense of proteins?)
• Data compression and aggregation (can we cluster
massive data sets into smaller data sets for subsequent
analysis?
• plus much more
14
Has clustering been successful
in data mining?
• Initial hope: clustering would find many interesting
patterns and surprising relationships
– arguably not met, at least not nearly enough
– perhaps it requires too much intelligence
– perhaps we can do better in the future
• Nevertheless: clustering has been successful in use
computers for things that humans are quite bad at
– dealing with massive amounts of data
– effectively using knowledge of “uncertainty”
15
An issue in clustering:
the effect of scale
• Background: an initial motivation for our
work in clustering (as sponsored by the ebusiness Center) is to eliminate the effect of
scale in clustering
16
A Chart of 6 Points
Clustering 6 points
6
5
4
3
2
1
0
1.5
2
2.5
3
3.5
4
4.5
17
Two Clusters of the 6 Points
Clustering 6 points
6
5
4
3
2
1
0
1.5
2
2.5
3
3.5
4
4.5
18
We added two points and adjusted
the scale
Clustering 8 points
5.5
5
4.5
4
3.5
0
10
20
30
40
50
60
19
3 clusters of the 8 points
Clustering 8 points
5.5
5
4.5
4
3.5
0
10
20
30
40
50
The 6 points on the left are clustered differently
60
20
Scale Invariance
• A clustering approach is called “scale
invariant” if it develops the same solution,
independent of the scales used
• The approach developed next is scale invariant
21
Using clustering to reduce uncertainty.
Try to find the average of the 3 populations
can2
can1
Species
Setosa
Versicolor
Virginica
22
Using uncertainty to improve clustering:
an example with 4 points in 1 dimension
The four points were obtained as sample means for four
samples, two from one distribution, and two from
another.
Objective: cluster into two groups of two each so as to
maximize the probability that each cluster represents
two samples from the same distribution.
0.48
0.5
0.52
0.54
0.56
0.58
0.6
23
Standard Approach
Consider the four data points, and cluster based on
these values.
Resulting cluster
0.48
0.5
0.52
0.54
0.56
0.58
0.6
24
Incorporating Uncertainty
• a common assumption in statistics
– data comes from “populations” or distributions
– from data, we can estimate the mean of the population and the
standard deviation of the original
• Usual approach to clustering
– keep track of the estimated mean
– ignore the standard deviation (estimate of the error)
• Our approach: use both the estimated mean and the
estimate of the error.
0.48
0.5
0.52
0.54
0.56
0.58
0.6
25
The two samples on the left were samples with
10,000 points each. The samples on the right
were two samples with 100 points each.
0.48
0.5
0.52
0.54
0.56
0.58
0.6
The radius corresponds to standard deviation.
Smaller circles ! larger data sets ! more certainty.
26
probability = 4/19
probability = 8/19
probability = 7/19
0.48
0.5
0.52
0.54
0.56
0.58
0.6
27
10,000 points
with mean .501
100 points with
mean .562
0.48
0.5
0.52
0.54
0.56
0.58
0.6
10,100 points with
mean .501
True mean: .5
10,000 points
with mean .536
100 points with
mean .592
0.48
0.5
0.52
0.54
0.56
0.58
0.6
10,100 points with
mean .537
28
True mean: .53
More on using uncertainty
• We will use clustering to reduce uncertainty
• We will use our knowledge of the uncertainty to
improve the clustering
• In the previous example, the correct cluster was
probability = 8/19
• We had generated 20 sets of four points at random.
The data was from the second set of four points.
29
Error based clustering
1. Start with n points in k-dimensional space
–
–
next example has 15 points, 2 dimensions
Each point has an estimated mean as well as a
standard deviation of the estimate
2. Determine the likelihood for each pair of
points coming from the same distribution
3. Merge the two points with the greatest
likelihood
4. Return to Step 2.
30
Using Maximum Likelihood
• Maximum Likelihood Method
– Suppose we have G clusters, C1, C2, …, CG. Out of
exponentially many clusterings possible, which
clustering is most likely w.r.t. to the observed data.
G
Objective: max
xi
∑ (∑ σ
k =1 i∈Ck
2
i
) (∑
t
i∈Ck
1
σi
2
) (∑
−1
i∈Ck
xi
σi
2
)
Computationally difficult!
31
Heuristic solution based on maximum
likelihood
• Greedy heuristic
– Start with n single point clusters
– Combine pair of clusters that lead to maximum increase in the
objective value (based on maximum likelihood)
– Stop when we have G clusters.
Similar to hierarchical Clustering
32
Error-based clustering
• At each step combine pair of clusters Ci, Cj with
2
2 −1
t
(
x
−
x
)
(
σ
+
σ
smallest i j
i
j ) ( xi − x j )
– xi , xi : maximum likelihood of means of clusters
– "i , "j : standard errors in x’s.
• We define the distance between two clusters as
( xi − x j ) (σ i + σ j ) ( xi − x j )
t
2
2 −1
Computationally much easier!!
33
Error-based Clustering Algorithm
• distance(Ci, Cj) =
( xi − x j )t (σ i + σ j ) −1 ( xi − x j )
2
2
• Start with n singleton clusters
• At each step combine pair of clusters Ci, Cj with
smallest distance.
• Stop when we have desired number of clusters
It is a generalization of Ward’s method.
34
The mean is the dot. The error is given by the ellipse.
A small ellipse
means that the
data is quite
accurate.
35
Determine the two
elements most
likely to come
from the same
distribution.
Merge them into a
single element.
36
Merge them into a
single element.
Determine the two
elements most
likely to come
from the same
distribution.
37
Continue this
process, reducing
the number of
clusters one at a
time.
38
39
40
41
42
43
44
45
46
47
48
49
Here we went all the way to a
single cluster.
We could stop with 2 or 3 or more
clusters. We can also evaluate
different numbers of clusters at
the end.
50
Rest of the Lecture
• The use of clustering in forecasting developed while
Mahesh Kumar worked at ProfitLogic.
• Joint work: Mahesh Kumar, Nitin Patel, Jonathon
Woo.
51
Motivation
• Accurate sales forecasting is very important in retail
industry in order to make good decisions.
Shipping
Manufacturer
Pricing
Allocation
Wholesaler
Retailer
Customer
Kumar et al. used clustering to help in accurate sales
forecasting.
52
Forecasting Problem
• Goal: Forecast Sales
• Parameters that affect sales
–
–
–
–
–
–
Price
When a product is introduced
Promotions
Inventory
Base demand as a function of time of the year.
Random effects.
53
Seasonality Definition
• Seasonality is the hypothesized underlying base
demand of a group of similar merchandize as a
function of time of the year.
• It is a vector of size 52, describing variations over the
year.
• It is independent of external factors like changes in
price, promotions, inventory, etc. and is modeled as a
multiplicative factor.
• e.g., two portable CD players have essentially the same
seasonality, but they may differ in price, promotions,
54
inventory, etc.
Seasonality Examples (made up data)
weekly sales for summer shoes
weekly sales for winter boots
55
Objective: determine seasonality
of products
• Difficulty: observations of a product’s seasonality is
complicated by so other factors
– when the product is introduced
– sales and promotions
– inventory
• Solution methods
– preprocess data to compensate for sales and promotions and
inventory effects
– average over lots of very similar products to eliminate some of
the uncertainty
– Further clustering of products can eliminate more uncertainty
56
Retail Merchandize Hierarchy
J-Mart
Shoes
Chain
Department
Men’s
summer
Shoes
Class
Item
Debok
walkers
Sales data available for items
57
Modeling Seasonality
Seasi = {( xi1 , σ ), ( xi 2 , σ i 2 ),..., ( xi 52 , σ i 52 )} = ( xi , σ i )
2
i1
2
2
2
• Seasonality is modeled as a vector with 52
components
• Assumptions:
– We assume errors are Guassian
– We treat the estimate of the σ’s as if they are the correct values
58
Illustration on simulated data
• Kumar et al generated data with 3 different
seasonalities.
• They then combined similar products and
produced estimates of seasonalities.
• Clustering produced much better final
estimates.
59
Simulation Study
• 3 different seasonalities were used to generate sales
data for 300 items.
• All 300 items divided into 12 classes.
• 12 estimates of seasonality coefficients along with
associated errors.
• Used clustering into three clusters to forecast correct
seasonalities.
60
Seasonalities
61
Initial seasonality estimates
62
Clustering
• Cluster classes with similar seasonality to reduce
errors.
– Example: Men’s winter shoes, men’s winter coats.
• Standard Clustering methods do not incorporate
information contained in the errors.
– Hierarchical clustering
– K-means clustering
– Ward’s method
63
Further Clustering
• They used K-means, hierarchical, and Ward’s
technique
• They also used error based clustering
64
Kmeans, hierarchical (avg), Ward’s Result
65
Error-based Clustering Result
66
Real Data Study
• Data from retail industry.
• 6 department: books, sporting goods, greeting cards,
videos, etc.
• 45 classes.
• Sales forecast
– Without clustering
– Standard clustering
– Error-based clustering
67
Forecast Result (An example)
Sales
No Clustering
Standard Clustering
Error-based Clustering
Weeks
68
Result Statistics
• Average Forecast Error
ForecastSale − ActualSale
∑
=
∑ ActualSale
69
Summary and Conclusion
• A new clustering method that incorporates
information contained in errors
• It has strong theoretical justification under
appropriate assumptions
• Computationally easy
• Works well in practice
70
‘
Summary and Conclusion
• Major point: if one is using clustering to reduce
uncertainty, then it makes sense to use error-based
clustering.
• Scale invariance.
• Error-based clustering has strong theoretical
justification and works well in practice.
• The concept of using errors can be applied to many
other applications where one has reasonable estimate of
errors.
71