Download Clustering Algorithms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Clustering Algorithms
Sunida Ratanothayanon
What is Clustering?
Clustering


Clustering is a classification pattern that
divide data into groups in meaningful
and useful way
Unsupervised classification pattern
Clustering


Clustering is a classification pattern that
divide data into groups in meaningful
and useful way
Unsupervised classification pattern
Outline

K-Means Algorithm

Hierarchical Clustering Algorithm
K-Means Algorithm




A partial clustering algorithm
k clusters (# of k is specified by a user)
Each cluster has a cluster center called
centroid.
The algorithm will literately group data
into k clusters based on a distance
function.
K-Means Algorithm


The centroid can be obtained from the
mean of all data points in the cluster.
Stop when there is no change of center.
A numerical example
K-Means example
We have five data points with 2
attributes
Want to group data into 2
clusters (k=2)

Data Point
x1
x2
1
22
21
2
19
20
3
18
22
4
1
3
5
4
2
K-Means example
We can plot a graph from five
data points as following.

Plot of 5 data points over X1 and X2
25
X2
20
15
Cluster C2
10
Cluster C1
5
0
0
5
10
15
X1
20
25
K-Means example
(1

st
iteration)
Step1 : Choosing center and defining k
C1=(18,22), C2= (4,2)

Step2 : Computing cluster centers
We already define c1 and c2
Data
Point
x1
x2
1
22
21
2
19
20
3
18
22
4
1
3
5
4
2
Step3 : Finding square of Euclidian distance of each data point from the
center and assigning each data points to a cluster

d


2
n
 xi  yi
i1
K-Means example
(1 st iteration)

Step3 (cont):
Data
Point
x1
x2
1
22
21
2
19
20
3
18
22
4
1
3
5
4
2
Distance table for all data points
Data
Point
C1
C2
(18,22)
(4,2)
(22,21)
4.13
26.9
(19,20)
2.23
23.43
(18,22)
0
24.41
(1,3)
25.49
3.1
(4,2)
24.41
0
Then, we assign each data point to the cluster by comparing its distance to
the center. The data point will be assigned to its closest cluster.

K-Means example
Data
Point
C1
C2
(18,22)
(4,2)
(2
(22,21)
4.13
26.9
(19,20)
2.23
23.43
(18,22)
0
24.41
(1,3)
25.49
3.1
(4,2)
24.41
0
nd
iteration)
Step2 : Computing cluster centers
We will compute new cluster centers

Member of cluster1 are (22,21), (19,20) and (18,22). We will find
average of these data points
 22 19  18  59
 21   20   22   63
       
59 / 3 19.7 
 63 / 3   21 

 


C1 is [19.7, 21]
Member of cluster2 are (1,3) and (4,2).
1  4 5
3   2  5
     
5 / 2   2.5
5 / 2    2.5

  
C2 is [2.5, 2.5]
K-Means example
(2

nd
iteration)
Data
Point
C1
C2
(18,22)
(4,2)
(22,21)
4.13
26.9
(19,20)
2.23
23.43
(18,22)
0
24.41
(1,3)
25.49
3.1
(4,2)
24.41
0
Step3 :Finding square of Euclidian distance of each data point from the
center and assigning each data points to a cluster



Distance table for all data points with new centers
Data
Point
C1’
C2’
(19.7,21)
(2.5,2.5)
(22,21)
2.3
26.88
(19,20)
1.22
24.05
(18,22)
1.97
24.91
(1,3)
25.96
1.58
(4,2)
24.65
1.58
Assign each data point to the cluster by comparing its distance to the center. The
data point will be assigned to its closest cluster.
Repeat step2 and 3 for the next iteration because centers still have a change
K-Means example
(3

rd
iteration)
Step2 : Computing cluster centers
Data
Point
C1’
C2’
(19.7,21)
(2.5,2.5)
(22,21)
2.3
26.88
(19,20)
1.22
24.05
(18,22)
1.97
24.91
(1,3)
25.96
1.58
(4,2)
24.65
1.58
We will compute new cluster centers

Member of cluster1 are (22,21), (19,20) and (18,22). We will find
average of these data points
 22 19  18  59
 21   20   22   63
       
59 / 3 19.7 
 63 / 3   21 

 


C1 is [19.7, 21]
Member of cluster2 are (1,3) and (4,2).
1  4 5
3   2  5
     
5 / 2   2.5
5 / 2    2.5

  
C2 is [2.5, 2.5]
K-Means example
(3

rd
iteration)
Data
Point
C1’
C2’
(19.7,21)
(2.5,2.5)
(22,21)
2.3
26.88
(19,20)
1.22
24.05
(18,22)
1.97
24.91
(1,3)
25.96
1.58
(4,2)
24.65
1.58
Step3 :Finding square of Euclidian distance of each data point from the
center and assigning each data points to a cluster



Distance table for all data points with new centers
Data
Point
C1’’
C2’’
(19.7,21)
(2.5,2.5)
(22,21)
2.3
26.88
(19,20)
1.22
24.05
(18,22)
1.97
24.91
(1,3)
25.96
1.58
(4,2)
24.65
1.58
Assign each data point to the cluster by comparing its distance to the center. The
data point will be assigned to its closest cluster.
Stop the algorithm because centers remain the same.
Hierarchical Clustering
Algorithm
C
E
A
B
D



Produce a nest sequence of cluster like
a tree.
Allow to have subclusters.
Individual data point at the bottom of
the tree are called “Singleton clusters”.
Hierarchical Clustering
Algorithm

Agglomerative method


A tree will be build up from the bottom
level and will be merged the nearest pair of
clusters at each level to go one level up
Continue until all the data points are
merged into a single cluster.
A numerical example
Hierarchical Clustering
example
We have five data points with 3
attributes

Data Point
x1
x2
x3
A
9
3
7
B
10
2
9
C
1
9
4
D
6
5
5
E
1
10
3
Hierarchical Clustering example
(1
st
iteration)
Data
Point
Step1 : Calculating Euclidian distance
between two vector points

Then we obtain distance table as following
Data Point
A
B
C
D
E
(9, 3, 7)
(10, 2, 9)
(1, 9, 4)
(6, 5, 5)
(1, 10, 3)
A ( 9, 3, 7)
0
2.45
10.44
4.12
11.36
B (10, 2, 9)
-
0
12.45
6.4
13.45
C (1, 9, 4)
-
-
0
6.48
1.41
D (6, 5, 5)
-
-
-
0
7.35
E (1, 10, 3)
-
-
-
-
0
x1
x2
x3
A
9
3
7
B
10
2
9
C
1
9
4
D
6
5
5
E
1
10
3
Hierarchical Clustering example
(1
st
iteration)
Data Point
Step2 : Forming a tree




Consider the most similar pair of data
points from the previous distance table
A
B
C
D
E
(9, 3, 7)
(10, 2, 9)
(1, 9, 4)
(6, 5, 5)
(1, 10, 3)
A ( 9, 3, 7)
0
2.45
10.44
4.12
11.36
B (10, 2, 9)
-
0
12.45
6.4
13.45
C (1, 9, 4)
-
-
0
6.48
1.41
D (6, 5, 5)
-
-
-
0
7.35
E (1, 10, 3)
-
-
-
-
0
C and E are the most similar
We will obtain the first cluster as following
C
E

Repeat step1 and 2 until all data points are merged into a single cluster.
Hierarchical Clustering example
(2
nd
iteration)
Data Point
Step1 : Calculating Euclidian
distance between two vector
points

A
B
C
D
E
(9, 3, 7)
(10, 2, 9)
(1, 9, 4)
(6, 5, 5)
(1, 10, 3)
A ( 9, 3, 7)
0
2.45
10.44
4.12
11.36
B (10, 2, 9)
-
0
12.45
6.4
13.45
C (1, 9, 4)
-
-
0
6.48
1.41
D (6, 5, 5)
-
-
-
0
7.35
E (1, 10, 3)
-
-
-
-
0
We will redraw the distance table including the merge of two entities, C&E.
C
E
Data Point
A
B
D
C&E
(9, 3, 7)
(10, 2, 9)
(6, 5, 5)
A ( 9, 3, 7)
0
2.45
4.12
10.9
B (10, 2, 9)
-
0
6.4
12.95
D (6, 5, 5)
-
-
0
6.90
C&E (1, 9.5, 3.5)
-
-
-
0

A distance for C&E to A
can be obtained from
d

(C ,E ), A
 avg (d
,d
)
C, A E, A
We can use a previous
table to get the distance
from C to A and E to A.
avg (10.44, 11.36) = 10.9
Hierarchical Clustering example
(2
nd
iteration)
Data Point
A
B
D
(9, 3, 7)
(10, 2, 9)
(6, 5, 5)
A ( 9, 3, 7)
0
2.45
4.12
10.9
B (10, 2, 9)
-
0
6.4
12.95
D (6, 5, 5)
-
-
0
6.90
C&E (1, 9.5, 3.5)
-
-
-
0
Step2 : Forming a tree




Consider the most similar pair of data
points from the previous distance table
A and B are the most similar
We will obtain the second cluster as
C
following
E
A
B

Repeat step1 and 2 until all data points are merged into a single cluster.
C&E
Hierarchical Clustering example
(3
rd
iteration)
Data Point
Step1 : Calculating Euclidian
distance between two vector
points

A
B
D
C&E
(9, 3, 7)
(10, 2, 9)
(6, 5, 5)
A ( 9, 3, 7)
0
2.45
4.12
10.9
B (10, 2, 9)
-
0
6.4
12.95
D (6, 5, 5)
-
-
0
6.90
C&E (1, 9.5, 3.5)
-
-
-
0
We will redraw the distance table including the merge entities, C&E and A&B.
 From previous table, we can obtain
following distances for the new distance
Data Point
A&B
D
C&E
table
(6, 5, 5)
A&B
0
5.26
11.93
D (6, 5, 5)
-
0
6.9
C&E
-
-
0
d( A,B ),D  avg (d( A,D ) , d( B,D ) )  avg (4.12,6.40)  5.26
d ( C , E ), D  6.90
d(C ,E ),( A,B)  avg (d(C ,E ) , d( A,B) )  avg (d(C ,E ), A , d(C ,E ),B )  avg (10.9,12.95)  11.93
Hierarchical Clustering example
(3
rd
iteration)
Data Point



D
Consider the most similar pair of data
points from the previous distance table
A&B
0
5.26
11.93
D (6, 5, 5)
-
0
6.9
C&E
-
-
0
A&B and D are the most similar
We will obtain the new cluster as following
C
E
A
B
D

C&E
(6, 5, 5)
Step2 : Forming a tree

A&B
Repeat step1 and 2 until all data points are merged into a single cluster.
Hierarchical Clustering example
(4
th
iteration)
Data Point
A&B
D
C&E
(6, 5, 5)
Step1 : Calculating Euclidian
distance between two vector
points

A&B
0
5.26
11.93
D (6, 5, 5)
-
0
6.9
C&E
-
-
0
We will redraw the distance table including the merge entities, C&E and
A&B&D.
 From previous table, we can obtain a
distance from cluster A&B&D to C&E as
Data Point
A&B&D
C&E
following
A&B&D
0
9.4
C&E
-
0
d( A, B, D ),(C , E )  avg (d( A, B ),(C , E ) , d D,(C , E ) )  avg (11.93,6.9)  9.4
Hierarchical Clustering example
(4
th
iteration)
Step2 : Forming a tree




Consider the most similar pair of data
points from the previous distance table
A&B&D
C&E
A&B&D
0
9.4
C&E
-
0
We can form a final tree because no more recalculation has to be made
We can merge all data points into a single cluster A&B&D&C&E.
C
E
A
B
D

Data Point
Stop the algorithm.
Conclusion

Two major clustering algorithms.

K-Means algorithm



An algorithm which literately groups data into k
clusters based on a distance function.
# of k is specified by a user.
Hierarchical Clustering algorithm


It is a nest sequence of cluster like a tree.
A tree will be build up from the bottom level
and continue until all the data points are
merged into a single cluster.
References
[1] Hastie, T., Tibeshirani, R., & Friedman J. Data Mining, Inference,
Prediction. Unsupervised Learning. pp.453-480
[2] JAIN, A. K., MURTY, M. N., & FLYNN, P. J. (1999). Data Clustering:
A Review. ACM Computing Surveys, 31(3), 264-330.
[3] Liu, B. (2006). Web Data Mining. Unsupervised Learning. Springer.
pp.117-150.
[4] Ning, T. P., STEINBACH, M., & KUMAR, V. Introduction to Data
Mining. Cluster Analysis: Basic Concepts and Algorithms. pp.487553.
Thank you
Related documents