Download 4 - CAU AI Lab

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
KDD: Part II Clustering
Dae-Won Kim
School of Computer Science & Engineering
Chung-Ang University
What is Data Mining?
Too much data and not enough information — this is a problem
facing many businesses and industries.
A solution lies here, with data mining. Most businesses have an
enormous amount of data, with a great deal of information
hiding within it, but "hiding" is usually exactly what it is doing:
So much data exists that it overwhelms traditional methods of
data analysis.
Data mining provides a way to get at the information buried in
the data. Data mining finds hidden patterns in large, complex
collections of data, patterns that elude traditional statistical
approaches to analysis.
- Oracle Data Mining Solution
Issues
Classification
vs. Clustering
vs. Rule Mining
What is Classification?
Tell me what is the name of this fish?
What is Classification?
Construct a classifier for making a decision
Fish: xT = (x1, x2) = (Lightness, Width)
Classification
Definition
“The act of taking in raw data and making an action based on the category of the
pattern.”
“ Build a machine that can recognize or predict patterns:
Character, Speech, Face, Cancer, Protein, and DNA sequence etc.”
An example: Predict the patients by cancer subtype or treatment
Leukemia:
(Golub et al., Science, 1999)
- 2 class problem
- 38 patients samples
- 34 normal samples
- 6,817 genes
What is Clustering?
Cluster analysis discovers (b) from (a)
What is Clustering?
Clustering vs. Classification
 Clustering (cluster analysis)
• Unsupervised pattern classification
• No training patterns, no prior knowledge
• Discovers homogeneous groups in data based on proximity
 Classification (discriminant analysis)
• Supervised pattern classification
• Labeled training patterns, the groups are known a priori
• Constructs rules for classifying new data into the known groups
Interchangeable Terms
 Cluster Analysis is the preferred generic term
• Clustering in Computer Science
• Numerical taxonomy in Biology
• Q analysis in psychology
• Segmentation in Market researchers
 Cluster is the preferred generic term
• group or class are also often used
 Proximity is the preferred generic term
• (dis)similarity or distance are also often used
Image Segmentation
Gene Function Prediction
More Examples
Intrusion Detection Systems
Atopic Dermatitis
Notation
Given data set : ‘n x d’ pattern (or data) matrix
Air pollution in US cities
City
‘n’ data patterns (objects,
observations, vectors)
SO2
TEMP
WIND
DAYS
Phoenix
10
70.3
6.0
36
Miami
10
75.5
8.8
128
Seattle
29
51.1
9.4
164
Detroit
36
49.9
8.4
113
Time_1
Time_2
Time_d
Gene_1
2.1
3.6
-2.6
Sample_1
Gene_2
3.5
7.1
-2.1
Sample_2
Gene_n
-1.2
8.9
6.5
Sample_n
Cond_1
Cond_2
Cond_d
Gene_1
EEG_1
Gene_2
EEG_2
Gene_n
EEG_n
‘d’ variables (features,
attributes, dimensions, fields)
Time_1
Time_2
Time_d
Time_1
Time_2
Time_d
 x11
x
X   21
 

 xn1
x12  x1d 
   


 

  xnd 
Q: cluster data into three groups
Observation
A = (1, 1)
C = (3, 2)
E = (11, 2)
G = (11, 6)
B = (3, 1)
D = (9, 2)
F = (9, 6)
Basic Algorithms for Cluster Analysis
Hierarchical Clustering Algorithm
Dae-Won Kim
School of Computer Science and Engineering
CAU
Idea
A
B
Hierarchical Algorithm
1. Start with each point as its own cluster
2. At each iteration, merge two clusters with the smallest distance
Feature 2
Feature 3
Feature 1
Hierarchical Algorithm
Eventually all points will be linked into a single cluster
Feature 2
Feature 3
Feature 1
Hierarchical Algorithm
The sequence of mergers is represented in a hierarchical tree
f
a
b
c
d
e
f
g
g
a
b
c
e
d
Hierarchical Algorithm
Agglomerative vs. Divisive
0
1
2
3
4
agglomerative
a
a,b
b
a,b,c,d,e
c
c,d,e
d
d,e
e
4
3
2
1
0
divisive
Example
Observation from Students’ heights and weights
Data 1 = (180, 70), Data 2 = (180, 71), Data 3 = (180, 73), …
Example
Initial Distance Matrix
1
2
3
4
5
1
0
2
1
0
3
3
2
0
4
6
7
6
0
5
5
6
5
3
0
6
7
10
9
4
5
23
6
0
Example: Single-Link Method
Step1 : (1,2) Group identified
1
2
3
4
5
1
0
2
1
0
3
3
2
0
4
6
7
6
0
5
5
6
5
3
0
6
7
10
9
4
5
6
4,5
1,2,3
0
4,5
5
0
6
7
4
1,2
3
4
5
1,2
0
3
2
0
4
6
6
0
5
5
5
3
0
6
7
9
4
5
6
0
0
Step 4 : (4,5,6) identified
1,2,3
Step 2 : (1,2,3) identified
Step 3 : (4,5) identified
1,2,3
6
0
4
5
1,2,3
0
4
6
0
5
5
3
0
6
7
4
5
6
0
Example
Step 5 : finalized
1,2,3 4,5,6
1,2,3
0
4,5,6
5
0
25
Variant: distance between two groups
# based on the distance between two closest elements
Feature 2
Feature 2
4
2
2
3
1
1
Feature 1
Element-wise distance
6
Feature 1
Group-wise distance
Variant: Average-Link Method
# based on the average of all pairs of distances
Feature 2
Feature 2
4
2
1
Feature 1
Element-wise distance
2
3
1
6
Feature 1
Group-wise distance
Quiz: Find two groups using
1. Hierarchical algorithm using single-link
2. Hierarchical algorithm using average-link
3
8
1
4
9
5
4
3
2
11
Example
Student 1 = (180, 70)
Typical
Student 2 = (176, 70)
1
…
3
8
1
4
9
5
4
Initial distance matrix
3
2
11
1
2
3
1
0
2
4
0
3
8
3
0
4
5
9
11
4
2,3
4
1
0
2,3
4
0
4
5
9
0
1
2,3
4
Variant
0
1
0
2,3
6
0
4
5
10
0
Basic Algorithms for Cluster Analysis
K-Means Clustering Method
Dae-Won Kim
School of Computer Science and Engineering
CAU
Review: Hierarchical
1. Hierarchical algorithm
“Yields a dendrogram representing the nested grouping and similarity levels
at which groupings change.”
2. Three popular schemes
“Two clusters are merged to a larger cluster based on minimum distance criteria.”
1) single-link: minimum distance of all pairs of patterns from two clusters
2) complete-link: maximum distance
Cluster B
3) average-link: average distance
Cluster A
Review: Hieararchical
3. Single-link vs. complete-link
• Single-link is more versatile than complete-link
• Complete-link does not suffer from a “chaining effect”
single-link
complete-link
Review: Hieararchical
4. Pros and cons
• More versatile than partitional algorithms
• Dendrogram provides a visual inspection
• Computationally prohibitive: O(n2)
• Can not repair the faults from previous steps
• May produce large chunks of clusters
• Most widely used due to ease of use and visualization
• Average-link algorithm showed the superior performance
• In some reports, the performance was close to random
Review: Hieararchical
5. Graph-theoretic clustering is a hierarchical clustering approach
- Single-link clusters are subgraphs of the minimum spanning tree
- Complete-link clusters are subgraphs of the maximal spanning tree
Using the minimal spanning tree to form clusters
Review: Hierarchical
Tip. To speed up in implementation, please use the dissimilarity matrix and
indexing structure.
Dissimilarity matrix
1
1,2
3
4
5
1,2
0
3
2
0
4
6
6
0
5
5
5
3
0
6
7
9
4
5
6
0
1,2,3
4
5
1,2,3
0
4
6
0
5
5
3
0
6
7
4
5
2
3
5
1
0
2
1
0
3
3
2
0
4
6
7
6
0
5
5
6
5
3
0
6
7
10
9
4
5
6
0
4
1,2,3
6
0
4,5
1,2,3
0
4,5
5
0
6
7
4
1,2,3
6
0
1,2,3
0
4,5,6
5
4,5,6
0
K-Means Clustering Algorithm
• Fast clustering
• Cluster centers representing each cluster
• Each cluster center is obtained by the average of its members
• Clustering is calculated using the distance to its cluster center
K-Means Clustering Algorithm
1. Objective function of K-means algorithm
“Yields a single partition of data at each iteration using the distance between patterns and
centroids, leading to intracluster compactness and intercluster separation.”
k
n


min  J ( X , k )   d ( x j , vi )
i 1 j 1


2. Incremental Greedy Procedure
1) Select an k-initial centroids
2) Assign each pattern to its closest cluster centroid
3) Update cluster centroids
4) Repeat these steps until no improvement in J(X,k)
Procedure
Step 1. Guess the ‘K’ centers by random
Step 2. Classify data into the K groups (centers)
Step 3. Update the centers
Step 4. if no change in centers, then stop; otherwise go to Step 2
center2
center1
x
x
Key Steps
 Classify data: assign each datum to the closest group
 Update centers: compute the average of data in each group
4
1
c2
c1
x
2
5
3
7
New c1 = (1+2+3+7) / 4
x
6
Example:
Cluster the data into K=2 groups
1. Initialize :
G1-Center= A = (1, 1), G2-center = B = (2, 1)
2. Classify Data:
Update Centers:
Check Stop:
G1={A, C}, G2={B, D, E, F, G, H}
G1-Center =(1.0, 1.5), G2-Center =(3.6, 3.5)
Change(Yes) -> Continue
y
5
4
3. Classify Data:
Update Centers:
Check Stop:
G1={A, B, C, D}, G2={E, F, G, H}
G1-Center =(1.5, 1.5), G2-Center =(4.5, 4.5)
Change(Yes) -> Continue
4. Classify Data:
Update Centers:
Check Stop:
G1={A, B, C, D}, G2={E, F, G, H}
G1-Center =(1.5, 1.5), G2-Center =(4.5, 4.5)
Change(No) -> Stop
G
H
E
F
3
2
1
C
D
A
B
1
2
x
3
4
5
Example:
1. Initialize:
Cluster the data into K=2 groups
G1-Center=(0, 0), G2-Center=(1, 0)
x2
2. Classify Data:
Update Centers:
Check Stop:
G1={x1, x3}, G2={x2, x4, …, x20}
G1-Center =(0.0, 0.5), G2-Center =(5.67, 5.33)
Change(Yes) -> Continue
x19 x20
9
x16 x17 x18
8
x12 x15 x13 x14
7
x9 x10 x11
6
5
3. Classify Data:
Update Centers:
Check Stop:
4. Classify Data:
Update Centers:
Check Stop:
G1={x1, .., x8}, G2={x9, …, x20}
G1-Center =(1.25, 1.13), G2-Center =(7.67, 7.33)
Change(Yes) -> Continue
G1={x1, .., x8}, G2={x9, …, x20}
G1-Center =(1.25, 1.13), G2-Center =(7.67, 7.33)
Change(Yes) -> Stop
4
3
x6 x7 x8
2
x3 x4 x5
1
x1 x2
0
1
x1
2
3
4
5
6
7
8
9
Issues in K-means
Pros and cons
• The simplest and most commonly used algorithm
• Computationally efficient: O(nks)
• Tends working well with isolated and compact clusters
• Induce fixed shapes of clusters depending on distance measure
• Sensitive to the initial selection of centroids
1) ‘ellipsoidal’ clustering results
come from the selection of {A,B,C}
2) ‘rectangular’ clustering results
come from the selection of {A,D,F}
3) Density-based initial selection is
popular: mountain clustering
Two different clustering results
(group 7 data into 3 clusters)
Issues in K-means
Tip. Why does k-means-type algorithm remain so popular?
1) The cluster solution is affected by the choice of internal parameters
: K-means algorithm is easy to use in all applications due to its small number of parameters
2) Mathematically sound
: Convergence of K-means-type algorithms in a finite number of iterations is proved
: Local optimality of the partial optimal solution is proved
Tip. In implementation, a ‘dead’ cluster arises due to random initialization.
Thus, please always check the number data
belonging to each cluster in each iteration
Dead cluster
Other Issues in K-means
How to guess the desirable initial
centers in K-means algorithm?
Other Issues in K-means
How to cluster the uncertain data
which have vague boundaries?
Other Issues in K-means
How to know the optimal number
of clusters in K-means algorithm?
Other Issues in K-means
K-means type algorithms detect
only sphere-shaped clusters?
Other Issues in K-means
How to cluster the symbolic
categorical data?