Download Projected clustering

Document related concepts

Quantitative comparative linguistics wikipedia , lookup

Transcript
Mining Projected Clusters in High
Dimensional Spaces
Mohamed Bouguessa
1
Data Mining
Data mining is the process of extracting potentially
useful information from a dataset.
Clustering is one of the central data mining problems
3
Clustering?
Clustering techniques aims to partition a given set of
data or objects into groups such that elements drawn
from the same group are highly similar, while those
assigned to different groups are dissimilar.
4
Similarity measure
Clustering algorithm employ a distance metric in order to
partition the database
Euclidean distance
5
High dimensional data
!   Problem with high dimensional data
Similarity function require similar objects to have close
values in all dimensions.
The concept of similarity between objects in the fulldimensional space is often invalid.
6
Clustering in high dimensional space
!   Illustration
Person
Age
Virus Level
Blood Type
Disease A
1
35
0.95
AB
Uninfected
2
64
0.9
AB
Uninfected
3
27
1.0
AB
Uninfected
4
18
9.8
O
Infected
5
42
8.6
AB
Infected
6
53
11.3
B
Infected
7
37
0.75
O
Recovered
8
28
0.8
A
Recovered
9
65
0.89
B
Recovered
7
Clustering in high dimensional space
•  Problem with high dimensional data
Ø  Presence of irrelevant dimension (attributes/features)
Ø  Clusters may exist in different subspaces
(not in the full-dimensional space)
8
Clustering in high dimensional space
!   Illustration
!   Data set with 4 clusters.
Ø  cluster 1 et cluster 2 exist in dimension a & b.
Ø  cluster 3 et cluster 4 exist in dimension b & c.
9
Clustering in high dimensional space
Sample data plotted in one dimension
Sample data plotted in two dimension
10
Projected clusters
10
Projected clustering
!   Definition
A projected cluster is a subset S of data points, together
with a subspace of dimension, D, such that the points
in S are closely clustered in D
11
Projected clustering
!   Example
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
x1
xa
xb
xc
xN
Cluster1
Cluster2
Cluster3
Cluster4
Ø  the third cluster: (S3, D3) = ({xb, …, xc},{A2, A4,A10}).
12
Previous approaches
!   Major limitations of previous approaches
(each approach has one or more of the followings)
!   Produce clusters all of the same dimensionality.
!   Unable to determine the dimensionality of each cluster
automatically.
!   Unable to identify clusters with low dimensionality (clusters
with low percentage of relevant dimensions, e.g. only 5% of
input dimensions (features) are relevant).
19
Our approach: PCKA
!   PCKA: Projected Clustering based on the K-means Algorithm
DB: data set of d-dimensional points
A={A1, A2, …, Ad} : set of attributes
X={x1, x2, …, xn} : set of N data points, where xi=(xi1, …, xij, …, xid).
xij : 1-d point
nc: given number of clusters
Cs: a projected cluster, containing Ns data points defined in a dsdimensional formed by the set As of its relevant dimensions. The
remaining set A-As represent the irrelevant dimensions of Cs.
20
PCKA
In order to discover clusters in different subspaces, PCKA
proceeds in three phases:
1.  Attribute relevance analysis.
2.  Outlier handling.
3.  Discovery of projected clusters.
21
Phase 1: Attribute relevance analysis
!   Identify relevant dimension which exhibit some cluster structure
!   By cluster structure we mean a region that has a high density of
points that its surrounding regions
Detect dense regions and their location in each dimension
Dense
regions
22
Attribute relevance analysis
!   In order to detect densely populated regions in each attribute we
compute a sparseness degree yij for each 1-d point xij by measuring
the variance of its k nearest (1-d point) neighbors
Intuitively, a large value of yij means that xij belongs to a sparse
region, while a small one indicates that xij belongs to a dense
region.
23
Attributes Relevance Analysis
Dense
regions
In order to identify dense regions in each dimensions we are
interested in all sets of xij having small sparseness degree.
Model the sparseness degree in each dimension as a
mixture distribution.
24
Attributes Relevance Analysis
!   PDF estimation
!   Identify the statistical properties of the sparseness degree
x1
A1
A2
A3
A4
A5
A6
A7
A8
A9
xa
xb
xc
xN
Cluster1
Cluster2
Cluster3
Cluster4
Histograms of the sparseness degree of: (a) A1, (b) A2 , (c) A3 and (d)A4
25
A10
Attributes Relevance Analysis
!   PDF estimation
The histograms suggest the existence of components with
different shape and/or heavy tail
Which inspire us to use the gamma mixture model
Formerly, we expect that the sparseness degree follows a
mixture density of the form
26
Attributes Relevance Analysis
!   How to estimate the parameters of the gamma components Gl?
Maximum likelihood technique
!   How to estimate the number of components?
Bayesian Information Criteria (BIC)
27
Attributes Relevance Analysis
!   PDF estimation
28
Attributes Relevance Analysis
!   PDF estimation
x1
A1
A2
A3
A4
A5
A6
A7
A8
A9
xa
xb
xc
xN
Cluster1
Cluster2
Cluster3
Cluster4
29
A10
Attributes Relevance Analysis
!   Dense region detection
Observation
The locations of components that represent dense regions
are close to zero in comparison to those that represent
sparse regions.
E x a m i n e t h e l o c a t i o n o f e a ch
components
Find a typical value that best describe the sparseness
degree that belong to each component.
30
Attributes Relevance Analysis
!   Dense region detection
Let LOC={loc1, …, locj, …, locm_total}
Let locj denotes the location of the component j
m_total: is the total number of all components
In order to determine
the location of each
component we calculate
its median
31
Attributes Relevance Analysis
!   Dense region detection
!   Large value of locj means that the component j correspond
to sparse region.
!   Small value of locj means that the component j correspond
to dense region.
Our objective
!   Divide LOC into two groups E and F, where E is the group of
the highest values of locj and F is the group of low values of j
MDL principle
MDL: Minimum Description Length
32
Attributes Relevance Analysis
!   Dense region detection
33
Attributes Relevance Analysis
!   Illustration
The set E
0.5
0.45
0.4
locj values
0.35
0.3
0.25
0.2
0.15
The set F
0.1
0.05
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Rank number
34
Attributes Relevance Analysis
!   Dense region detection
We obtain a binary matrix Z(N*d)
x1
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
xa
xb
xc
xN
Cluster1
Cluster2
Cluster3
Cluster4
35
Attributes Relevance Analysis
!   Dense region detection
36
Phase 2 : Handling outlier
!   Outliers can be defined as set of data points that are
considerably dissimilar, exceptional, or inconsistent with respect
to the remaining data.
!   Our outlier handling mechanism makes an efficient use of
the properties of the binary matrix Z.
!   In our case, outliers do not belong to any of the identified
dense regions in the matrix Z, most of them are located in
sparse regions.
37
Handling outlier
!   We use the binary similarity coefficients (Jaccard coefficient)
in order to measure the similarity between binary data points zi
in the matrix Z.
38
Handling outlier
Let JC(zi,zj) is the Jaccard coefficient between two binary points.
A pair of points is considered similar if the estimated Jaccard
coefficient between them exceeds a certain threshold ε.
Our outlier handling mechanism is based on the following definition
!   The above definition exploit the fact that points which belong to
dense regions (clusters) have in general a large number of similar
points in contrast to outliers
39
Handling outlier
40
Phase 3: Discovery of projected clusters
!   The main goal of phase 3 is to identify clusters and their
relevant dimensions
!   The clustering process is based on the K-means algorithm
!   K-means partitions a data into a number of clusters, each of
which is represented by a center.
•  Pick nc points as cluster centers.
•  Alternate:
• Assign data instance to closest mean
• Assign each mean to the average of its assigned points
•  Stop when no points assignments change.
41
K-means principle
42
Discovery of projected clusters
Problem: each cluster cluster has its own relevant dimension
Consequence
The use of classical distance function to compute the similarity
between two data points is not an effective approach with high
dimensional data.
Why?
because each dimension is equally weighted when computing the
distance between two points.
43
Discovery of projected clusters
Proposed solution
To address this problem, we associate the binary weights tij in the
matrix T (extracted from the matrix Z) to the Euclidian distance.
This makes the distance measure more effective because the
computation of distance is restricted to subsets (i.e. projections)
where the object values are dense.
Formally
44
Discovery of projected clusters
!   How to identify relevant dimension for each clusters?
!   The sum of the binary weights of the data points belonging to
the same cluster over each dimension give us a meaningful
measure of the relevance of each dimension to the clusters.
∑
∑
45
Discovery of projected clusters
!   How to identify relevant dimension for each clusters?
!   We propose a relevance index Wsj for each dimension in cluster Cs
!   The value of the index is always between 0 and 1. The index
gives a large value (close to 1) when the dimension is relevant to
the cluster. On the other hand, an irrelevant dimension receives a
very small index value (close to 0).
46
Discovery of projected clusters
!   How to identify relevant dimension for each clusters?
δ is a user-defined parameter that control the degree of
relevancy of the dimension Aj to the cluster Cs
47
Discovery projected clusters
48
Empirical evaluation
!   Accuracy: the aim is to test whether our algorithm, in
comparison with other existing approaches, is able to correctly
identify projected clusters in complex situations.
!   Efficiency: the aim is to determine how the running time
scales with 1) the size and 2) the dimensionality of the dataset.
!   We evaluate the performance of PCKA
synthetic datasets.
on number of
49
Empirical evaluation
!   Performance measure:
We use the Clustering Error (CE) for projected clustering
Such a metric performs comparisons in a more objective way
since it takes into account the data point group and the
associated subspace simultaneously.
The value of CE is always between 0 and 1. The more
similar the original partition and the generated partition by
clustering algorithm the smaller the CE value.
50
Experiments on synthetic data sets
!   Robustness to the average cluster dimensionality
We generated sixteen different datasets with:
Number of data points N = 3000
Number of dimensions d = 100
Number of clusters nc = 5
The average cluster dimensionality varies from 2% to
70% of d.
No outliers were added to generated datasets.
51
Experiments on synthetic data sets
!   Robustness to the average cluster dimensionality
PCKA
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
2%
4%
6%
SSPC(BEST)
8%
10%
12%
HARP
14%
16%
PROCLUS(BEST)
18%
20%
25%
FASTDOC(BEST)
30%
40%
50%
60%
70%
CE distance between the output of the five algorithms and the true clustering
52
Experiments on synthetic data sets
!   Outlier immunity
We generated three groups of data sets.
In each group there are five datasets with:
N = 1000, d = 100, nc = 3.
In each datasets in the group, the percentage of outlier
varied from 0% to 20% of N.
The average cluster dimensionality of the datasets in
the first group is fixed to 2% of d, for the second and
the third group the average cluster dimensionality is
fixed to 15% of d and 30% of d respectively.
53
Experiments on synthetic data sets
!   Outlier immunity
1
PCKA
SSPC(BEST)
HARP
PROCLUS(BEST)
FASTDOC(BEST)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0%
5%
10%
15%
20%
CE distance; datasets with average cluster dimensionality = 2% of d
54
Experiments on synthetic data sets
!   Outlier immunity
1
PCKA
SSPC(BEST)
HARP
PROCLUS(BEST)
FASTDOC(BEST)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0%
5%
10%
15%
20%
CE distance; datasets with average cluster dimensionality = 15% of d
55
Experiments on synthetic data sets
!   Outlier immunity
1
PCKA
SSPC(BEST)
HARP
PROCLUS(BEST)
FASTDOC(BEST)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0%
5%
10%
15%
20%
CE distance; datasets with average cluster dimensionality = 30% of d
56
Experiments on synthetic data sets
!   Scalability with data set size
PCKA
PCKA
10000
Runing time (second)
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
1000
5000
10000
50000
100000
Dataset size (log scale)
Scalability of PCKA w.r.t. the data set size
57
Experiments on synthetic data sets
!   Scalability with dimensionality of data
PCKA
PCKA
4000
Runing time (second)
3500
3000
2500
2000
1500
1000
500
0
100
250
500
750
1000
Dataset dimensionality
Scalability of PCKA w.r.t. the data dimensionality
58
Experiments on real data
!  
Wisconsin Diagnostic Breast Cancer Data (WDBC): The set
contains 569 samples, each with 30 features. The samples are
grouped into two clusters: 357 samples for benign and 212 for
malignant.
!   Saccharomyces Cerevisiae Gene Expression Data (SCGE): This data
contains expression level of 205 genes under 80 experiments. The
data set is presented as a matrix. Each row corresponds to a gene
and the each column represents experiments. The genes are
grouped into four clusters.
59
Experiments on real data
!   Multiple Features Data (MF): The set consists of features of
handwritten numerals ("0"-"9") extracted from a collection of
Dutch utility maps. 200 patterns per cluster (for a total of
2,000 patterns) have been digitized in binary images. For our
experiments we have used five features sets (files):
1.  mfeat-fou: 76 Fourier coefficients of the character shapes;
2.  mfeat-fac: 216 profile correlations;
3.  mfeat-kar: 64 Karhunen-Loève coefficients;
4.  mfeat-zer: 47 Zernike moments;
5.  mfeat-mor: 6 morphological features.
In summary, we have a data set with 2000 patterns, 409 features,
and 10 clusters.
60
Experiments on real data
We use the class label as ground truth and we measured the
accuracy of clustering by matching the points in input and
output clusters.
Accuracy of clustering
61