Download Supervised vs Unsupervised

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Matrix calculus wikipedia , lookup

Matrix multiplication wikipedia , lookup

Principal component analysis wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Transcript
Supervised vs Unsupervised
• Supervised learning
– the “teacher” Y (the response variable)
– The goal is to predict Y from X, which is equivalent to estimate
f (x) = E[Y | X = x],
under square error loss for quantitative response or 0/1 loss for
binary response.
• Unsupervised learning: understand the data generating process for X.
1
• P (X, Y ) = P (Y |X) × P (X)
– P (Y |X): focus of supervised learning.
– P (X): focus of unsupervised learning.
• Goal is clear in supervised learning, and measure of performance is
well defined, but that’s not the case for unsupervised learning.
2
Clustering
• Gaol: group or segment objects into subsets or clusters, such that
those within each cluster are more similar to one another than objects
assigned to different clusters.
• A key issue: choice of similarity/dissimilarity measure between two
objects.
3
• Partitioning methods: K-means and K-medoids
• Hierarchical methods: agglomerative, divisive or hybrid.
• Spectral clustering
• Model based clustering, such as mixture models,
f (x) =
K
X
k=1
4
πk p(x | θk ).
Inputs for Clustering
• Xn×p : data matrix, where rows stand for objects/samples and
columns stand for variables/features.
• Dn×n : dissimilarity matrix where d(i, j) = d(j, i) measures the
difference between the objects i and j.
Some clustering methods process data matrix along with their own
similarity measures, while some operate on dissimilarity matrix. But we
can always transform the input from one form to the other:
• X =⇒ D: easy
• D =⇒ X: e.g., multi-dimensional scaling (MDS)
5
Multi-dimensional Scaling (MDS)
Given a dissimilarity matrix Dn×n , find values z1 , . . . , zn ∈ Rk to minimize
the following stress function
S(z1 , . . . , zn ) =
X
2
dij − kzi − zj k .
i6=j
That is, MDS finds a k-dimensional representation of the data that
preserves the pairwise distances as well as possible.
6
K-means Clustering
• Input: Xn×p (data matrix) and K (number of clusters)
• K-means clustering algorithm
0. Start with some initial guess for the K cluster centers (centroid).
1. For each data point, find the closest cluster center (partition step).
2. Replace each centroid by average of data points in its partition
(update centroids).
3. Repeat 1 + 2 until convergence.
7
• Goal: Partition data into K groups so that the dissimilarities between
points in the same cluster are smaller than those in different clusters.
• Dissimilarity measure is the squareda Euclidean distance.
• Optimize the following objective function (within cluster sum of
squares)
Ω(C1K , mK
1 ) =
K X
X
k=1 i∈Ck
a
What about un-squared Euclidean distance?
8
kxi − mk k2 .
K-means converges to a local minimum of Ω by iterating the following two
steps
Step 1 (Update Centers): Given the partition C1 , . . . , CK ,
min
mK
1
K X
X
kxi − mk k2 .
k=1 i∈Ck
Step 2 (Update Partition): Given the cluster centers m1 , . . . , mK ,
min
C1K
K X
X
k=1 i∈Ck
9
kxi − mk k2 .
Some Practical issues
• Try many random starting centroids and choose the solution with the
smallest within-cluster SS.
• How to choose K?
• Should we normalize each dimension?
• If we use other dissimilarity measures, such as
– Different measures for numerical, ordinal and categorical variables,
– can we still use K-means algorithm?
10
Dissimilarity Measures
d(xi , xj ) =
p
X
wl (i, j)d(xil , xjl ).
l=1
• Numerical variables:
d(xil , xjl ) = g(|xil − xjl |),
where g is a monotone function.
• Interval variables: replace xil by
xil
,
range of the lth variable
and then treat it numerical.
11
(1)
• Ordinal variables (e.g., rank data, say 1 : M ): replace xil by
xil −1/2
,
M
and then treat it as an interval-valued variable.
• Categorical variables:
d(xil , xjl ) = 1(xij = xjl ).
• wl (i, j) = 0 if xil or xjl is missing.
• wl (i, j) = 0 if xil = xjl = 0 when lth variable is asymmetric binary.
Other choices of dissimilarity measures, which can not be expressed as (1):
check ?dist and ?daisy in R.
12
K-medoids Clustering
• Input: Dn×n (pair-wise dissimilarity matrix) and K (number of
clusters)
• Cluster centers (medoids) are restricted to be data points.
mk = arg min
xi :C(i)=k
X
d(xi , xj ).
j:C(j)=k
• Iterate the following two steps
– Partition C(i) = arg min1≤k≤K d(xi , mk )
– Update Medoids based on (2).
13
(2)
Partition Around Medoids (PAM)
Recall that the K-medoids algorithm iteratively solves
min
C,m1:K
K
X
X
d(xi , mk ).
k=1 i:C(i)=k
PAM (Kaufman and Rousseeew, 1990) minimizes the above objective
function by iteratively swapping xi ↔ xj that decreases the objective
function the most, where
xi ∈ {m1 , . . . , mk },
xj 6= {m1 , . . . , mk },
until convergence.
14
How Many Clusters?
• Difficult, because it is NOT a well-defined problem.
• Can we cast it as a model selection problem and use “fitting
error” plus “penalty” (increasing with K) to select K?
– Possible for the mixture model approach where error is −2 log Lik.
– For ordinary clustering algorithms, what’s the error?
– Can we cast it as a classification problem with the label being the
cluster membership? A tricky issue: the meaning of the so-called
“cluster 1” changes when data change.
– A better choice is the Association matrix A: Aij = 1 if i and j are
in the same cluster, and 0, otherwise.
15
A couple heuristic approaches
• Gap statistics
• Silhouettes statistics
• Prediction strength
16
Gap Statistics
• Many measures of goodness for clusterings are based on the tightness
of clusters.
SS(K) =
K X
X
kxi − mk k2 .
k=1 i∈Ck
• Gap statistic (Tibshirani et al 2001)
h
i
G(K) = E0 log SS(K) − log SS(K)
B
X
1
log SSb (K) − log SS(K)
≈
B b=1
17
• One-standard-error type rule:
K ∗ = arg min{K : G(K) ≥ G(K + 1) − sK+1 }
K
p
where sK = sd0 (log SS(K)) 1 + 1/B.
18
Silhouettes statistics
• The Silhouette statistic (Rousseeuw, 1987) of the ith obs measures
how well it fits in its own cluster versus how well it fits in its next
closest cluster.
• Adapted to K-means, define
a(i) = kxi − mk k2 ,
b(i) = kxi − ml k2 ,
where i ∈ Ck and Cl is the next-closest cluster to xi . Then its
silhouette is
b(i) − a(i)
.
s(i) =
max{a(i), b(i)}
19
• In general,
– a(i) = average dissimilarity between i and all other points in the
same cluster;
– d(i, C) = average dissimilarty between i and points in C;
– b(i) = minC d(i, C) and i is not in C.
• 1 ≥ s(i) ≥ −1; usually s(i) > 0, but can be slightly below 0.
• s(i) ≈ 1: object i is well classified;
• s(i) ≈ 0: object i lies intermediate between two clusters;
• s(i) ≈ −1: object i is badly classified.
• The larger the (average) silhouette, the better the cluster.
20
• Silhouette Coefficient: Average silhouette widths.
A rule of thumb from Anja et al. “Clustering in an Object-Priented
Environment”:
– SC = 0.71–1.00: A strong structure has been found.
– SC = 0.51 – 0.70: A reasonable structure has been found.
– SC = 0.26 – 0.50: The structure is weak and could be artificial,
try additional methods.
– SC < 0.26: No substantial structure has been found.
21
Prediction Strength
• Divide the data into training and test sets.
• Cluster test set into K clusters.
• Cluster training set into K clusters.
• Predict pairwise co-memberships in the test set, using the clustering
rule obtained from training set.
• Prediction strength (Tibshirani et al 2005) is the proportion of
pairwise memberships that are correctly predicted.
• For each k, repeat the CV procedure above and calculate the average
prediction strength.
22
• Suppose we have training and test sets, Xtr and Xte , and a clustering
operator C(X, k).
• If X has n obs, let D[C(...), X] be a n × n (membership) matrix with
Dij = 1 if obs i and j are in the same cluster.
• Let A1 , . . . , Ak denote the indices of the test obs in the k test clusters.
• Let n1 , . . . , nk denote the number of observations in these clusters.
• The prediction strength of C(·, k) is defined to be
1
min
j∈{1,...,k}
X
nj (nj − 1) i,i0 ∈A
j
I D(C[Xtr , k], Xte )ii0
23
=1
Hierarchical Clustsering Methods
• Input:
– Pairwise dissimilarity (or distance) of observations
– Rule to calculate dissimilarty between (disjoint) groups of
observations
• Output: a hierarchical clustering result
– n single-point clusters at the lowest level;
– one cluster at the highest level;
– clusters at one level are created by splitting/merging clusters at
the next higher/lower level.
24
• Bottom-up clustering starts with all observations seperate, and
successively joins together the cloest groups of observations until all
observations are in a single group.
• Dissimilarty (or distance) between groups of obs
– Single-linkage (nearest-neighbor)
– Complete-linkage (furthest-neighbor)
– Group average
25
• Top-down clustering starts with all observations together, and
successively divide into two groups until all observations seperate.
• Hybrid (Chipman and Tibshirani, 2006).
26