Download Data Mining: cluster analysis (3)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
DATA MINING: CLUSTER
ANALYSIS (3)
Instructor: Dr. Chun Yu
School of Statistics
Jiangxi University of Finance and Economics
Fall 2015
Clustering Algorithms
• K-means and its variants
• Hierarchical clustering
• Density-based clustering
DBSCAN
•
DBSCAN is a density-based
algorithm.
•
Center-based density
•
Density = number of points within a specified radius
(Eps)
DBSCAN
•
A point is a core point if it has more than a specified
number of points (MinPts) within Eps
•
These are points that are at the interior of a cluster
•
A border point has fewer than MinPts within Eps, but
is in the neighborhood of a core point
•
A noise point is any point that is not a core point or a
border point.
DBSCAN: Core, Border, and Noise Points
DBSCAN Algorithm
• Any two core points that are close enough (within a
distance Eps of one another) are put in the same cluster
• Any border point that is close enough to a core point is
put in the same cluster as the core point
DBSCAN: Determining EPS and MinPts
•
•
•
Idea is that for points in a cluster, their kth nearest
neighbors are at roughly the same distance (called kdist)
Noise points have the kth nearest neighbor at farther
distance
So, plot sorted distance of every point to its kth nearest
neighbor
DBSCAN: Core, Border and Noise Points
Original Points
Point types: core,
border and noise
Eps = 10, MinPts = 4
When DBSCAN Works Well
Original Points
Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well
•Varying densities
• High-dimensional data
Cluster Validity
• For supervised classification we have a variety of
measures to evaluate how good our model is
• Accuracy, precision, etc.
• For cluster analysis, the analogous question is how to
evaluate the “goodness” of the resulting clusters?
• Why do we want to evaluate them?
• To avoid finding patterns in noise
• To compare clustering algorithms
• To compare two sets of clusters
• To compare two clusters
Clusters found in Random Data
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
DBSCAN
0.6
0.6
y
y
Random
Points
0
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
x
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
y
y
Kmeans
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.2
0.4
0.6
x
0.8
1
0.6
0.8
1
x
0
Complete
Link
0
0.2
0.4
0.6
x
0.8
1
Different Aspects of Cluster Validation
1.
2.
3.
Determining the clustering tendency of a set of data, i.e.,
distinguishing whether non-random structure actually exists in the
data.
Determining the ‘correct’ number of clusters.
Evaluating how well the results of a cluster analysis fit the data
without reference to external information.
- Use only the data
4.
5.
Comparing the results of a cluster analysis to externally known
results, e.g., to externally given class labels.
Comparing the results of two different sets of cluster analyses to
determine which is better.
Measures of Cluster Validity
• External Index: Used to measure the extent to which
cluster labels match externally supplied class labels.
• Entropy
• Internal Index: Used to measure the goodness of a
clustering structure without respect to external
information.
• Sum of Squared Error (SSE)
• Relative Index: Used to compare two different
clusterings or clusters.
• Often an external or internal index is used for this function, e.g.,
SSE or entropy
Internal Measures: Cohesion and Separation
• Cluster Cohesion: Measures how closely related are
objects in a cluster
• Cluster Separation: Measure how distinct or well-
separated a cluster is from other clusters
Internal Measures: Cohesion and Separation
•
A proximity graph based approach can also be used for
cohesion and separation.
•
Cluster cohesion is the sum of the weight of all links within a cluster.
• Cluster separation is the sum of the weights between nodes in the cluster
and nodes outside the cluster.
cohesion
separation
Measuring Cluster Validity Via Correlation
•
Two matrices
•
•
•
•
Proximity Matrix
“Incidence” Matrix
•
One row and one column for each data point
•
An entry is 1 if the associated pair of points belong to the same
cluster
•
An entry is 0 if the associated pair of points belongs to different
clusters
Compute the correlation between the two matrices
(actual and ideal similarity matrices)
High correlation indicates that points that belong to the
same cluster are close to each other.
Measuring Cluster Validity Via Correlation
• Correlation of incidence and proximity matrices for the K-
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
y
y
means clusterings of the following two data sets.
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.2
0.4
0.6
x
Corr = -0.9235
0.8
1
0
0
0.2
0.4
0.6
x
Corr = -0.5810
0.8
1
Using Similarity Matrix for Cluster Validation
• Order the similarity matrix with respect to cluster
labels and inspect visually.
1
1
0.9
0.8
0.7
Points
y
0.6
0.5
0.4
0.3
0.2
0.1
0
10
0.9
20
0.8
30
0.7
40
0.6
50
0.5
60
0.4
70
0.3
80
0.2
90
0.1
100
0
0.2
0.4
0.6
x
0.8
1
20
40
60
Points
80
0
100 Similarity
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp
1
10
0.9
0.9
20
0.8
0.8
30
0.7
0.7
40
0.6
0.6
50
0.5
0.5
60
0.4
0.4
70
0.3
0.3
80
0.2
0.2
90
0.1
0.1
100
20
40
60
80
0
100 Similarity
Points
y
Points
1
0
0
0.2
0.4
0.6
x
DBSCAN
0.8
1
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp
1
10
0.9
0.9
20
0.8
0.8
30
0.7
0.7
40
0.6
0.6
50
0.5
0.5
60
0.4
0.4
70
0.3
0.3
80
0.2
0.2
90
0.1
0.1
100
20
40
60
80
0
100 Similarity
y
Points
1
0
0
0.2
0.4
0.6
x
Points
K-means
0.8
1
Using Similarity Matrix for Cluster Validation
• Clusters in random data are not so crisp
1
10
0.9
0.9
20
0.8
0.8
30
0.7
0.7
40
0.6
0.6
50
0.5
0.5
60
0.4
0.4
70
0.3
0.3
80
0.2
0.2
90
0.1
0.1
100
20
40
60
80
0
100 Similarity
y
Points
1
0
0
Points
0.2
0.4
0.6
x
Complete Link
0.8
1
Using Similarity Matrix for Cluster Validation
1
0.9
500
1
2
0.8
6
0.7
1000
3
0.6
4
1500
0.5
0.4
2000
0.3
5
0.2
2500
0.1
7
3000
DBSCAN
500
1000
1500
2000
2500
3000
0
Internal Measures: SSE
•
Internal Index: Used to measure the goodness of a clustering
structure without respect to external information
• SSE
• SSE is good for comparing two clusterings or two clusters
(average SSE).
• Can also be used to estimate the number of clusters
10
9
6
8
4
7
6
SSE
2
0
5
4
-2
3
2
-4
1
-6
0
5
10
15
2
5
10
15
K
20
25
30
Internal Measures: SSE
• SSE curve for a more complicated data set
1
2
6
3
4
5
7
SSE of clusters found using K-means
Framework for Cluster Validity
•
Need a framework to interpret any measure.
•
•
For example, if our measure of evaluation has the value, 10, is that
good, fair, or poor?
Statistics provide a framework for cluster validity
•
•
•
The more “atypical” a clustering result is, the more likely it represents
valid structure in the data
Can compare the values of an index that result from random data or
clusterings to those of a clustering result.
There is the question of whether the difference between two index
values is significant
Statistical Framework for SSE
• Example
• Compare SSE of 0.005 against three clusters in random data
• Histogram shows SSE of three clusters in 500 sets of random data
points of size 100 distributed over the range 0.2 – 0.8 for x and y
values
1
50
0.9
45
0.8
40
0.7
35
30
Count
y
0.6
0.5
0.4
20
0.3
15
0.2
10
0.1
0
25
5
0
0.2
0.4
0.6
x
0.8
1
0
0.016 0.018
0.02
0.022
0.024
0.026
SSE
0.028
0.03
0.032
0.034
Thank you!