Download Data Mining Summer school

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Data Mining
Summer school
CLUSTER ANALYSIS
Asso. Prof. NGUYEN Tri Thanh
University of Engineering and Technology
August 11, 2016
1
Outline
1.  Introduction
2.  Clustering applications
3.  A Categorization of Major Clustering Methods
4.  Data representation
5.  Flat clustering
6.  Hierarchical clustering
7.  Cluster evaluation
8.  Summary
9.  Discussion
August 11, 2016
Data Mining: Concepts and Techniques
2
Introduction
n 
Cluster analysis (clustering)
n 
n 
Given a set of objects, group similar data according to
the characteristics into clusters
Cluster: A collection of data objects
n 
similar (or related) to one another within the same group
n 
dissimilar (or unrelated) to the objects in other groups
n 
How to identify the similarity?
n 
How to identify the number of clusters?
August 11, 2016
Data Mining: Concepts and Techniques
3
Introduction
n 
Clustering is embedded in human naturally
n 
Group animals, plants
n 
Group students
n 
Group customers
n 
Facebook groups
n 
Interest groups
n 
…
August 11, 2016
Data Mining: Concepts and Techniques
4
Clustering vs Classification
n 
n 
Clustering
n 
No predefined classes (unknown number of clusters)
n 
Unlabeled data objects
n 
Unsupervised learning
Classification
n 
Predefined classes
n 
Labeled data objects
n 
Predict/identify the class of an unlabeled object
n 
Supervised learning
August 11, 2016
Data Mining: Concepts and Techniques
5
Clustering applications
n 
Image Processing and Pattern Recognition
n 
Spatial Data Analysis
n 
n 
Create thematic maps in GIS by clustering feature
spaces
Detect spatial clusters or for other spatial mining tasks
n 
Economic Science (especially market research)
n 
WWW
n 
n 
Document classification
Cluster Weblog data to discover groups of similar
access patterns
August 11, 2016
Data Mining: Concepts and Techniques
6
Clustering Applications
n 
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
n 
Land use: Identification of areas of similar land use in an earth
observation database
n 
Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
n 
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
n 
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
August 11, 2016
Data Mining: Concepts and Techniques
7
Categorization of Clustering Approaches
n 
n 
n 
n 
Partitioning approach:
n  Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
n  Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
n  Create a hierarchical decomposition of the set of data (or objects)
using some criterion
n  Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
Density-based approach:
n  Based on connectivity and density functions
n  Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
n  based on a multiple-level granularity structure
n  Typical methods: STING, WaveCluster, CLIQUE
August 11, 2016
Data Mining: Concepts and Techniques
8
Major Clustering Approaches (II)
n 
n 
n 
n 
Model-based:
n  A model is hypothesized for each of the clusters and tries to find the
best fit of that model to each other
n  Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
n  Based on the analysis of frequent patterns
n  Typical methods: p-Cluster
User-guided or constraint-based:
n  Clustering by considering user-specified or application-specific
constraints
n  Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
n  Objects are often linked together in various ways
n  Massive links can be used to cluster objects: SimRank, LinkClus
August 11, 2016
Data Mining: Concepts and Techniques
9
Flat clustering
August 11, 2016
Data Mining: Concepts and Techniques
10
Data object representation
n 
A set/vector of features/attributes
n 
n 
Person: name, age, sex, job, …
Text: set of distinct words
dis(d1 , d 2 ) =
n 
n
2
(
d
−
d
)
∑i=1 1i 2i
Similarity
n 
n 
The distance: the smaller the more similar
The similarity: the bigger the more similar
d1 • d 2
sim(d1 , d 2 ) =
=
d1 d 2
August 11, 2016
∑
n
i =1
n
2
1i
∑i =1 d
d1i d 2i
n
2
d
∑i=1 2i
11
K-means Clustering
n 
n 
Number of clusters (k) is known in advance
Clusters are represented by the centroid of the documents that belong to
that cluster
1
c=
n 
S
∑d
d∈S
Cluster membership is determined by the most similar cluster centroid
1. Select k documents from S to be used as cluster centroids. This is usually done at random.
2. Assign documents to clusters according to their simility to the cluster centroids,
i.e. for each document find the most similar centroid and assign that document to the
corresponding cluster.
3. For each cluster recompute the cluster centroid using the newly computed
cluster members.
4. Go to Step 2 until the process converges, i.e. the same documents are assigned to
each cluster in two consecutive iterations or the cluster centroids remain the same.
August 11, 2016
12
The K-Means Clustering Method
n 
Example
Assign
each
objects
to most
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
9
9
8
8
7
7
6
6
5
5
4
3
2
1
0
0
1
2
3
4
5
6
Arbitrarily choose K
object as initial
cluster center
7
8
9
10
10
similar
center
K=2
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
10
9
9
8
8
7
7
6
6
5
5
4
2
1
0
0
1
2
3
4
5
6
7
8
7
8
9
10
reassign
3
August 11, 2016
10
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
13
K-means Clustering Discussion
n 
n 
n 
n 
In step 2 documents are moved between clusters in order
to maximize the intra-cluster similarity
The clustering maximizes the criterion function (a measure
for evaluating clustering quality)
In distance-based k-means clustering the criterion function
is the sum of squared errors (based on Euclidean distance
and means)
For k-means clustering of documents a function based on
k
centroids and similarity is used
J =∑
∑ sim(c , d
i =1 d j ∈Di
August 11, 2016
14
i
j
)
K-means Clustering Discussion (cont’d)
n 
n 
n 
n 
Clustering that maximizes this function is called minimum
variance clustering
K-means algorithm produces minimum variance clustering,
but does not guarantee that it always finds the global
maximum of the criterion function
After each iteration the value of J increases, but it may
converge to a local maximum
The result greatly depends on the initial choice of cluster
centroids
August 11, 2016
15
Sample data
August 11, 2016
Data Mining: Concepts and Techniques
16
K-means Clustering Example (result)
Clustering of CCSU Departments data with 6 TFIDF attributes (k = 2)
August 11, 2016
17
Hierarchical clustering
August 11, 2016
Data Mining: Concepts and Techniques
18
Hierarchical Partitioning
n 
n 
n 
Produces a nested sequence of partitions of the set of data points
Can be displayed as a tree (called dendrogram) with a single cluster
including all points at the root and singleton clusters (individual points)
at the leaves
Example of hierarchical partitioning of set of numbers {1, 2, 4, 5, 8, 10}
The similarity measure
used in this example is
computed as (10-d)/10
where d is the distance
between data points or
cluster centers
August 11, 2016
19
Approaches to Hierarchical Partitioning
n 
Agglomerative
n 
n 
Starts with the data points and at each step merges the
two closest (most similar) points (or clusters at later
steps) until a single cluster remains
Divisible
n 
n 
Starts with the original set of points and at each step
splits a cluster until only individual points remain
To implement this approach we need to decide which
cluster to split and how to perform the split
August 11, 2016
20
Approaches to Hierarchical Partitioning (cont’d)
n 
The agglomerative approach is more popular as it
needs only the definition of a distance or similarity
function on clusters/points
August 11, 2016
21
Approaches to Hierarchical Partitioning (cont’d)
n 
For data points in the Euclidean space the Euclidean
distance is the best choice
n 
For documents represented as TF-IDF vectors the
preferred measure is the cosine similarity defined as
follows
d1 • d 2
sim(d1 , d 2 ) =
=
d1 d 2
August 11, 2016
∑
n
i =1
n
2
1i
∑i =1 d
22
d1i d 2i
n
2
d
∑i=1 2i
Agglomerative Hierarchical Clustering
n 
There are several versions of this approach depending
on how similarity on clusters sim(S1,S2) is defined
(S1,S2 are sets of documents)
n 
Similarity between cluster centroids, i.e. sim(S1,S2)=sim(c1,c2),
where the centroid c of cluster S is
1
c=
S
n 
∑d
d∈S
Maximum similarity between documents from each cluster
(nearest neighbor clustering)
sim( S1 , S 2 ) = max sim(d1 , d 2 )
d1∈S1 , d 2 ∈S 2
August 11, 2016
23
Agglomerative Hierarchical Clustering (cont’d)
n 
Minimum similarity between documents from each cluster
(farthest neighbor clustering)
sim( S1 , S 2 ) = min sim(d1 , d 2 )
d1∈S1 , d 2 ∈S 2
n 
Average similarity between documents from each cluster
1
sim( S1 , S 2 ) =
S1 S 2
August 11, 2016
∑ sim(d , d )
1
d1∈S1 , d 2 ∈S 2
24
2
Agglomerative Clustering Algorithm
n 
S is the initial set of documents and G is the clustering tree
n 
k and q are control parameters that stop merging
n 
n 
when a desired number of clusters (k) is reached
or when the similarity between the clusters to be merged
becomes less than a specified threshold (q)
August 11, 2016
25
Agglomerative Clustering Algorithm (cont’d)
1. G ← {{d } | d ∈ S } (initialize G with singleton clusters, each one containing a
document from S )
2. If G ≤ k then exit (stop if the desired number of clusters is reached)
3. Find Si , S j ∈ G, such that (i, j ) = arg max ( i , j ) sim( Si , S j ) (find the two
closest clusters)
4. If sim(Si ,S j ) < q then exit (stop if the similarity of the closest clusters is
less than q)
5. Remove Si and S j from G
6. G = G ∪ {Si , S j } (merge Si and S j , and add the new cluster to the hierarchy)
7. Go to 2
For n documents both time and space complexity of the
algorithm are O (n2)
August 11, 2016
26
Agglomerative Clustering Example 1
August 11, 2016
27
Agglomerative Clustering Example 1
August 11, 2016
28
DIANA clustering algorithm
0. G ← {d1 , d 2 ,..., d n } (initialize G with singleton cluster of all
documents); g ← G
1. Find a document d that is the most distinct from the others
S ← {d }
2.1. For all d i ∉ S , calculate li = avg ( ∑ | d i − d |) − avg ( ∑ | d i − d |)
d i ∉S
d i ∉S
2.2. Find d h having max lh , if lh > 0, S ← {d } ∪ {d h }
3.Repeat until there is no lh > 0. Update G
m
4. Select the cluster g having the biggest diameter
m
∑∑ (d
i
− d j )2
i =1 j =1
m(m-1 )
, goto 1
5. Stop when all clusters have a single document
August 11, 2016
Data Mining: Concepts and Techniques
29
Clustering evaluation
August 11, 2016
Data Mining: Concepts and Techniques
30
Web Content Mining
n 
Evaluating Clustering
n 
Similarity-Based Criterion Functions
n 
Probabilistic Criterion Functions
n 
MDL-Based Model and Feature Evaluation
n 
Classes to Clusters Evaluation
n 
Precision, Recall and F-measure
n 
Entropy
August 11, 2016
31
Similarity-Based Criterion Functions (distance)
n 
Basic idea: the cluster center ci (centroid or mean in case
of numeric data) best represents cluster Di if it minimizes
the sum of the lengths of the “error” vectors x -ci for all x
∈ Di
k
1
ci =
Di
2
J e = ∑ ∑ x − ci
i =1 x∈Di
n 
∑x
x∈Di
Alternative formulation based on pairwise distance
between cluster members
1 k 1
Je = ∑
2 i =1 Di
August 11, 2016
2
∑x
x j , xl ∈Di
32
j
− xl
Similarity-Based Criterion Functions (cosine similarity)
n 
n 
n 
For document clustering the (centroid) cosine similarity is used
k
ci • d j
1
ci =
dj
J s = ∑ ∑ sim(ci , d j ) sim(ci , d j ) =
∑
Di d j ∈Di
ci d j
i =1 d j ∈Di
Equivalent form based on pairwise similarity
1 k 1
JS = ∑
sim(d j , d k )
∑
2 i =1 Di x j , xk ∈Di
Another formulation based on intracluster similarity (used
to controls merging of clusters in hierarchical
Avegrage pair-wise similarity
agglomerative clustering)
1 k 1
1 k
JS = ∑
sim(d j , d l ) = ∑ Di sim( Di )
∑
2 i =1 Di x j , xl ∈Di
2 i =1
August 11, 2016
33
Classes to Clusters Evaluation
n 
n 
n 
n 
n 
Assume that the classification of the documents in a
sample is known, i.e. each document has a class label
Cluster the sample without using the class labels
Assign to each cluster the class label of the majority of
documents in it
Compute the error as the proportion of documents with
different class and cluster label
Or compute the accuracy as the proportion of documents
with the same class and cluster label
August 11, 2016
34
Classes to Clusters Evaluation (Example)
August 11, 2016
35
Confusion matrix (contingency table)
TP (True Positive), FN (False Negative), FP (False Positive), TN (True Negative)
TP + TN
Accuracy =
TP + FP + TN + FN
FP + FN
Error =
TP + FP + TN + FN
August 11, 2016
36
Precision and Recall
TP
Precision =
TP + FP
TP
Recall =
TP + FN
August 11, 2016
37
F-Measure
Generalized confusion matrix
form classes and k clusters
Combining precision and recall
P(i, j ) =
nij
m
∑i =1 nij
R(i, j ) =
nij
∑
2 P(i, j ) R(i, j )
F (i, j ) =
P(i, j ) + R(i, j )
Evaluating the whole clustering
ni
F = ∑i =1
n
m
max F (i, j )
j =1,...,k
k
ni = ∑ j =1 nij
m
k
n = ∑i =1 ∑ j =1 nij
August 11, 2016
38
Total number
of documents
k
n
j =1 ij
F-Measure (Example)
August 11, 2016
39
Entropy
n 
n 
Consider the class label as a random event and evaluate
its probability distribution in each cluster
The probability of class i in cluster j is estimated by the
proportion of occurrences of class label i in cluster j
pij =
nij
m
∑n
ij
i =1
n 
The entropy is as a measure of “impurity” and accounts for
the average information in an arbitrary message about the
class label
m
H j = −∑ pij log pij
i =1
August 11, 2016
40
Entropy (cont’d)
n 
To evaluate the whole clustering we sum up the entropies
of individual clusters weighted with the proportion of
documents in each
k
H =∑
i =1
August 11, 2016
nj
n
Hj
41
Entropy (Examples)
n 
n 
n 
A “pure” cluster where all documents have a single class
label has entropy of 0
The highest entropy is achieved when all class labels have
the same probability
For example, for a two class problem the 50-50 situation
has the highest entropy of (-0.5 log 0.5- 0.5 log 0.5)=1
August 11, 2016
42
Entropy (Examples) (cont’d)
n 
Compare the entropies of the previously discussed
clusterings for attributes offers and students
H (offers ) =
11 ⎛ 8
8 3
3⎞ 9 ⎛ 3
3 6
6⎞
⎜ − log − log ⎟ + ⎜ − log − log ⎟ = 0.878176
20 ⎝ 11
11 11
11 ⎠ 20 ⎝ 9
9 9
9⎠
H ( students ) =
August 11, 2016
15 ⎛ 10
10 5
5⎞ 5 ⎛ 1
1 4
4⎞
−
log
−
log
+
−
log
−
log
⎜
⎟
⎜
⎟ = 0.869204
20 ⎝ 15
15 15
15 ⎠ 20 ⎝ 5
5 5
5⎠
43
Summary
n 
n 
n 
n 
n 
Cluster analysis groups objects based on their similarity
and has wide applications
Measure of similarity can be computed for various types
of data
Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
There are still lots of research issues on cluster analysis
August 11, 2016
Data Mining: Concepts and Techniques
44
Reference
n 
n 
n 
J. Han and M. Kamber, Data Mining-Concepts
and Techniques, Morgan Kaufmann, 2006
Z. Markov and D. T. Larose, Data mining the web,
uncovering patterns in Web content, structure and
usage, John Wiley & Sons, 2007
Nguyễn Hà Nam, Nguyễn Trí Thành, Hà Quang
Thụy, Giáo trình khai phá dữ liệu, NXB ĐHQGHN,
2014
August 11, 2016
Data Mining: Concepts and Techniques
45
Discussion
August 11, 2016
Data Mining: Concepts and Techniques
46