Download COMP1942

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
COMP1942
Clustering 1
(Introduction and kmean)
Prepared by Raymond Wong
Some parts of this notes are borrowed from LW Chan’s notes
Screenshots of XLMiner Captured by Kai Ho CHAN
Presented by Raymond Wong
raywong@cse
COMP1942
1
Clustering
Raymond
Louis
Wyman
…
Cluster 2
(e.g. High Score in History
and Low Score in Computer)
Computer
100
History
40
90
20
45
95
…
History
…
Cluster 1
(e.g. High Score in Computer
and Low Score in History)
Computer
Problem: to find all clusters
COMP1942
2
Why Clustering?

Clustering for Utility


Summarization
Compression
COMP1942
3
Why Clustering?

Clustering for Understanding

Applications

Biology


Psychology and Medicine


Group different types of traffic patterns
Software

COMP1942
Group different customers for marketing
Network


Group medicine
Business


Group different species
Group different programs for data analysis
4
Clustering Methods

K-means Clustering





Original k-means Clustering
Sequential K-means Clustering
Forgetful Sequential K-means Clustering
How to use the data mining tool
Hierarchical Clustering Methods



Agglomerative methods
Divisive methods – polythetic approach and
monothetic approach
How to use the data mining tool
COMP1942
5
K-mean Clustering



Suppose that we have n example feature
vectors x1, x2, …, xn, and we know that they
fall into k compact clusters, k < n
Let mi be the mean of the vectors in cluster i.
we can say that x is in cluster i if distance
from x to mi is the minimum of all the k
distances.
COMP1942
6
Procedure for finding k-means


Make initial guesses for the means m1,
m2, .., mk
Until there is no change in any mean



Assign each data point to the cluster
whose mean is the nearest
Calculate the mean of each cluster
For i from 1 to k

COMP1942
Replace mi with the mean of all examples for
cluster i
7
Procedure for finding k-means
10
10
9
9
8
8
7
7
6
6
5
5
10
9
8
7
6
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
k=2
Arbitrarily choose k
means
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
10
9
9
8
8
7
7
6
6
5
5
3
2
1
0
0
1
2
3
4
5
6
7
8
7
8
9
10
reassign
4
COMP1942
Update
the
cluster
means
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
8
Initialization of k-means


The way to initialize the means was not
specified. One popular way to start is to
randomly choose k of the examples
The results produced depend on the initial
values for the means, and it frequently
happens that suboptimal partitions are found.
The standard solution is to try a number of
different starting points
COMP1942
9
Disadvantages of k-means

Disadvantages


In a “bad” initial guess, there are no points
assigned to the cluster with the initial
mean mi.
The value of k is not user-friendly. This is
because we do not know the number of
clusters before we want to find clusters.
COMP1942
10
Clustering Methods

K-means Clustering





Original k-means Clustering
Sequential K-means Clustering
Forgetful Sequential K-means Clustering
How to use the data mining tool
Hierarchical Clustering Methods



Agglomerative methods
Divisive methods – polythetic approach and
monothetic approach
How to use the data mining tool
COMP1942
11
Sequential k-Means Clustering



Another way to modify the k-means
procedure is to update the means one
example at a time, rather than all at once.
This is particularly attractive when we acquire
the examples over a period of time, and we
want to start clustering before we have seen
all of the examples
Here is a modification of the k-means
procedure that operates sequentially
COMP1942
12
Sequential k-Means Clustering



Make initial guesses for the means m1,
m2, …, mk
Set the counts n1, n2, .., nk to zero
Until interrupted


Acquire the next example, x
If mi is closest to x


COMP1942
Increment ni
Replace mi by mi + (1/ni) . (x – mi)
13
Clustering Methods

K-means Clustering





Original k-means Clustering
Sequential K-means Clustering
Forgetful Sequential K-means Clustering
How to use the data mining tool
Hierarchical Clustering Methods



Agglomerative methods
Divisive methods – polythetic approach and
monothetic approach
How to use the data mining tool
COMP1942
14
Forgetful Sequential k-means



This also suggests another alternative in which we
replace the counts by constants. In particular,
suppose that a is a constant between 0 and 1, and
consider the following variation:
Make initial guesses for the means m1, m2, …, mk
Until interrupted


Acquire the next example x
If mi is closest to x, replace mi by mi+a(x-mi)
COMP1942
15
Forgetful Sequential k-means



The result is called the “forgetful” sequential kmeans procedure.
It is not hard to show that mi is a weighted
average of the examples that were closest to mi,
where the weight decreases exponentially with
the “age” to the example.
That is, if m0 is the initial value of the mean
vector and if xj is the j-th example out of n
examples that were used to form mi, then it is not
hard to show that
mn = (1-a)n m0 +a  k=1n (1-a)n-kxk
COMP1942
16
Forgetful Sequential k-means


Thus, the initial value m0 is eventually
forgotten, and recent examples receive
more weight than ancient examples.
This variation of k-means is particularly
simple to implement, and it is attractive
when the nature of the problem
changes over time and the cluster
centres “drift”.
COMP1942
17
Clustering Methods

K-means Clustering





Original k-means Clustering
Sequential K-means Clustering
Forgetful Sequential K-means Clustering
How to use the data mining tool
Hierarchical Clustering Methods



Agglomerative methods
Divisive methods – polythetic approach and
monothetic approach
How to use the data mining tool
COMP1942
18
How to use the data mining
tool


We can use XLMiner for performing kmean clustering
Open “cluster.xls”
COMP1942
19
COMP1942
20
COMP1942
21
COMP1942
22
COMP1942
23
Data source
Workbook
Worksheet
Data range
COMP1942
24
COMP1942
25
# Rows :
10
# Cols:
3
Variables
First row contains headers
Variables In Input Data
Name
Computer
History
COMP1942
26
COMP1942
27
COMP1942
28
COMP1942
29
Normalize input data
Parameters
# Clusters:
2
COMP1942
10
# Iterations:
30
Options
5
Set Seed
Fixed Start
Random Starts
COMP1942
No. of Starts:
12345
31
Show data summary
Show distances from each cluster center
COMP1942
32
COMP1942
33
COMP1942
34
COMP1942
35
COMP1942
36
COMP1942
37
COMP1942
38
COMP1942
39
COMP1942
40
COMP1942
41
COMP1942
42
COMP1942
43
COMP1942
44
COMP1942
45
COMP1942
46
COMP1942
47
COMP1942
48
COMP1942
49
COMP1942
50
COMP1942
51
COMP1942
52
COMP1942
53
COMP1942
54