Download TuftsSVC - Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Principal component analysis wikipedia , lookup

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Gaussian Kernel Width Exploration and Cone
Cluster Labeling For Support Vector Clustering
Nov. 28, 2007
Sei-Hyung Lee
Karen Daniels
Department of Computer Science
University of Massachusetts Lowell
1
Outline
•
•
•
•
•
•
•
Clustering Overview
SVC Background and Related Work
Selection of Gaussian Kernel Widths
Cone Cluster Labeling
Comparisons
Contributions
Future Work
2
Clustering Overview
• Clustering
– discovering natural groups in data
• Clustering problems arise in
– bioinformatics
• patterns of gene expression
– data mining/compression
– pattern recognition/classification
3
Definition of Clustering
• Definition by Everitt(1974)
– “A cluster is a set of entities which are alike, and
entities from different clusters are not alike.”
If we assume that the objects to be clustered are
represented as points in the measurement space,
then
– “Clusters may be described as connected regions
of a multi-dimensional space containing a relatively
high density of points, separated from other such
regions by a region containing a relatively low
density of points.”
4
5
6
7
8
9
Sample Clustering Taxonomy
(Zaiane1999)
Partitioning
Hierarchical
Density-based
Grid-based
Model-based
fixed number of clusters k
Statistical
(COBWEB)
Neural
Network
(SOM)
Hybrids are also possible.
http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/ (Chapter 8)
10
Strengths and Weaknesses
Typical
Strength
Partitioning
• relatively efficient O(ikn)
Hierarchical
• does not require choice of k
Density-based
Grid-based
Model-based
• discover arbitrary shape
• fast processing time
• exploit underlying data
distribution
Weakness
• split large clusters & merge
small clusters
• find spherical-shape
• sensitive to outliers (k-means)
• choice of k
• sensitive to initial selection
• never be undone
• requires termination condition
• does not scale well
• sensitive to parameters
• sensitive to parameters
• can’t find arbitrary shape
• assumption is not always true
• expensive to update
• difficult for large data sets
• slow
11
Comparison of Clustering Techniques
Handle
Order
High
Noise Dependency Dimension
Scalability
Arbitrary
Shape
k-means
YES
NO
NO
NO
YES
O(ikN)
k-medoids
YES
NO
Outlier
NO
YES
O(ikN)
CLARANS
YES
NO
Outlier
NO
NO
O(N2)
BIRCH
YES
NO
?
NO
NO
O(N)
CURE
YES
YES
YES
NO
NO
O(N2logN)
SVC
?
YES
YES
NO
YES
O((N-Nbsv)Nsv)
Density-based
DBSCAN
YES
YES
YES
NO
NO
O(NlogN)
Grid-based
STING
YES
NO
?
NO
NO
O(N)
Model-based
COBWEB
NO
?
?
YES
NO
?
Partitional
Hierarchical
Time
Complexity
k = number of clusters, i = number of iterations, N = number of data points, Nsv = number of support vectors,
12
Nbsv = number of bounded support vectors. SVC time is for single combination of parameters.
Jain et al. Taxonomy (1999)
Cross-cutting Issues
Agglomerative vs. Divisive
Monothetic vs. Polythetic
(sequential feature consideration)
Hard vs. Fuzzy
Deterministic vs. Stochastic
Incremental vs. Non-incremental
Distance between 2
clusters = minimum of
distances between all
inter- cluster pairs.
Distance between 2
clusters = maximum of
distances between all
inter- cluster pairs.
13
More Recent Clustering Surveys
• Clustering Large Datasets (Mercer 2003)
– Hybrid Methods: e.g. Distribution-Based Clustering Algorithm for
Clustering Large Spatial Datasets (Xu et al. 1998)
• Hybrid: model-based, density-based, grid-based
• Doctoral Thesis (Lee 2005)
– Boundary-Detecting Methods:
• AUTOCLUST (Estivill-Castro et al. 2000)
– Voronoi modeling and Delaunay triangulation
• Random Walks (Harel et al. 2001)
– Delaunay triangulation modeling and k-nearest-neighbors
– Random walk in weighted graph
• Support Vector Clustering (Ben-Hur et al. 2001)
– One-class Support Vector Machine + cluster labeling
14
Overview of SVM
• Map non-linearly separable data
into a feature space where they are linearly separable
• Class of hyperplanes : f ( x)    x  b  0
where, ω is normal vector of a hyper-plane
b is the offset from the origin
: non-linear mapping
15
Overview of SVC
•
•
•
•
•
•
Support Vector Clustering (SVC)
Clustering algorithm using (one-class) SVM
Able to handle arbitrary shaped clusters
Able to handle outliers
Able to handle high dimensions, but…
Need input parameters
– For kernel function that defines inner product in
feature space
1
• e.g. Gaussian kernel width q in K ( x, y ) 
– Soft margin C to control outliers
e
q x y
2
16
SVC Main Idea
unit
ball
SV
Gaussian Kernel
BSV
R
a
x
1
K ( x, y ) 
e
q x y 2
x
Φ(x)
“Attract” hyper-plane onto
data points instead of “repel.”
Data space contours
are not explicitly
available.
R : Radius of the minimal hyper-sphere
a : center of the sphere
R(x) : distance between F(x) and a
BSV : data x outside of sphere, R(x) > R
Num(BSV) is controlled by C
SV : data x on the surface of sphere, R(x)=R
Num(SV) is controlled by q
Others : data x inside of sphere, R(x) < R
17
Find Minimal Hyper-sphere (with BSVs)
R2
minimize
L
 2 R  2 R   j  2 R(1    j )  0
R
  j  1 3
subject to ||  ( x j )  a ||2  R 2   j and  j  0
Lagrangian :
max min L  R   ( R   j  ||  ( x j )  a || )  j    j  j  C   j
2
 ,
R,a ,
2
2
j
j
j
where  j  0 and  j  0 are Lagrange multiplier s,
Maximize to obtain  ' s subject to : 02   j  C.
1
( R   j  ||  ( x j )  a ||2 ) j
KKT conditions :  j  j  0j
j  C   j
0
Wolfe dual form of L  R 2   ( R 2   j  ||  ( x j )  a ||2 )  j    j  j  C   j
j
j
j
j
j
 R 2  R 2    j (C   j )    ( x j ) 2  j  2a 2  a 2    j  j  C   j
j
j
   ( x j )  j  a  by 4
2
j
2
j
  ( x )     
j
2
j
j
j
i
i
j
  ( xi )   ( x j )  
j
using K ( x j , x j )   ( xi )   ( x j ) ,  K ( x j , x j )  j    i  j K ( xi , x j )
j
i, j
j
by 3
2
5
(Only points on boundary contribute.)
Use  j to classify data point :
j
 R 2  R 2   j    j  j    ( x j ) 2  j 2a   ( x j ) j  a 2   j    j  j  C   j
j
4
L
   j   j   C  0
 j
j
j
j
j
j
 a   ( x j ) j
j
C is a constant and C   j is a penalty te rm for BSVs.
j
L
  2 ( x j )  j  2a   j  0
a
j
j
4
SV :  j  0, 0   j  C by
2
 on surface of sphere
BSV : 2   j  0,  j  0
 outside sphere.
  j  C by
1
 j  0
5
 j  0,  j  0  inside sphere.
18
Relationship Between Minimal Hyper-sphere
and Cluster Contours
R : Radius of the minimal hyper-sphere
a : center of the sphere
R(x) : distance between φ(x) and a
R 2 ( x)  distance between  (x) and a
contours  points on the surface of the minimal sphere
 x | R( x)  R
 ||  ( x) - a ||2
  ( x) 2 - 2(a) ( x)  a 2
 K ( x, x) - 2(  j ( x j )) ( x)  (  j ( x j )) 2
j
j
 K ( x, x) - 2  j K ( x j ,x)    i  j K ( xi , x j )
j
Challenge: Contour boundaries
are not explicitly available.
i, j
Number of clusters increases with increasing q.
19
SVC High-Level Pseudo-Code
SVC (X)
q  initial value;
C  initial C ( =1)
loop
K  computeKernel(X,q);
β  solveLagrangian(K,C);
cluster labeling(X,β );
if clustering result is satisfactory, exit
choose new q and/or C;
end loop
20
Previous Work on SVC
• Tax and Duin (1999): Novelty detection using (one-class)
SVM.
• SVC suggested by A. Ben-Hur, V.Vapnik, et al. (2001)
– Complete Graph
– Support Vector Graph
• J. Yang, et al. (2002): Proximity Graph
• J. Park, et al. (2004): Spectral Graph Partitioning
• J. Lee, et al. (2005): Gradient Descent
• W. Puma-Villanueva et al. (2005) Ensembles
• S. Lee and K. Daniels (2004, 2005, 2006, 2007): Kernel
width exploration and fast cluster labeling
21
Previous Work on Cluster Labeling
Complete Graph (CG)
Support Vector Graph (SVG)
Proximity Graph (PG)
all (xi,xj) in X
all (xi,xj),
where xi or xj is a SV
all (xi,xj), where xi and xj
are linked in a PG
22
Gradient Descent (GD)
support vectors
Non-SV data points
stable equilibrium points
23
Traditional Sample Points Technique
• CG, SVG, PG, and GD use this technique.
③
xj
y
xi
②
①
xi
xj
xj
①
②
③
disconnected
disconnected
connected
24
Problems of Sample Points Technique
xi
xi
xj
xj
False Negative
False Positive
sample points
25
CG Result (C=1)
26
Problems of SVC
• Difficult to find appropriate q and C
– no guidance for choosing q and C
– too much trial and error
• Slow cluster labeling
– O(N2Nsvm) time for CG method, where m is the number of sample
points on the line segment connecting any pair of data points
d 
– general size of Delaunay triangulation in d dimensions =  ( N  2  )
• Bad performance in high-dimensions
– as the number of principal components is increased, there is a
performance degradation
27
Our q Exploration
• Lemmas
– If q=0, then R2=0
– If q=∞, then βi=1/N
for all i∈{1,…, N}
– If q =∞,then R2=1-1/N
– R2=1 iff q =∞, and N =∞
– If N is finite,
then R2≤1-1/N <1
• Theorem
– Under certain circumstances,
R2 is a monotonically
nondecreasing function of q
– Secant-like algorithm
28
q-list Length Analysis
• Estimation of q-list length
2
2
≈ lg(max{ xi  x j })  lg(min{ xi  x j })
• depends only on
– spatial characteristics of the data set and
– not on the dimensionality of the data set or
the number of data
• 89% accuracy w.r.t. the actual q-list length for all
datasets considered
29
Our Recent q Exploration Work
• Curve typically has one critical radius of
curvature at q*.
• Approximate q* to yield q̂ *(without cluster labeling).
• Use q̂ * as starting q value in sequence.
30
q Exploration Results
• 2D: On average actual number is
– 32% of estimate
– 22% of secant length
• Higher dimensions: On average actual number is
– 112% of estimate.
– 82% of secant length
Dim.
9
25
3
4
200
31
2D q Exploration Results
32
Higher Dimensional q Exploration
Results
33
Cone Cluster Labeling (CCL)
• Motivation: Avoid line segment sampling
• Approach:
– Leverage geometry of feature space.
– For Gaussian kernel
• Images of all data points are on surface of unit ball in feature
space.
• Hyper-sphere in data space corresponds to cone in feature
space with apex at origin.
Gaussian Kernel
1
K ( x, y ) 
Sample 2D Data Space
e
q x y 2
unit ball
Low-Dimensional View of HighDimensional Feature Space
34
Cone Cluster Labeling
P : intersecti on between th e surface of the unit ball and
the minimal hyperspher e in feature space
Support Vector Cone :  vi
Covering P : ((vi V  vi )  P)  P , where V is set of support ve ctors
vj
vi
v
i
θ
θ
v
θ
j
θ
35
Cone Cluster Labeling
• Cone base angles are all
= .
• Cones have a’ in
common.
• Pythagorean Theorem
holds in feature space.
• To derive data space
hyper-sphere radius, use
cos( )  F(vi )  a'
F(vi )  a'  a
a  1 R2
 cos( )  1  R 2
36
Cone Cluster Labeling
P' : mapping of P into the data space
( vi  P) correspond s to a support ve ctor hyperspher e S vi centered at vi with radius Z :
Z -
ln(cos(  ))
ln( 1  R 2 )
 
q
q
(vi V S vi ) approximat ely covers P'
Z
q=0.003
q=0.137
P’
37
Cone Cluster Labeling
Cone Cluster Labeling ( X , Q, V )
for each q  Q
compute Z for q
AdjacencyM atrix  ConstructC onnectivit y(V , Z )
Labels  FindConnCo mponents(A djacencyMa trix)
for each x  X , where x  V
idx  find the nearest SV to x
Labels( x)  Labels( xidx )
end for
print Labels
end for
38
2D CCL Results (C=1)
39
Sample Higher Dimensional CCL
Results in “Heat Map” Form
d
N
N = 12
d=9
3 clusters
N = 30
d = 25
5 clusters
N = 205
d = 200
5 clusters
40
Comparison – cluster labeling algorithms
CG
SVG
PG
GD
CCL
Construct
Adjacency
Matrix
O(N2Nsvm)
O(NN2svm)
O(N(logN+ Nsvm))
O(m(N2i+ NsvN2sep))
O(N2sv)
Find
Connected
Components
O(N2)
O(NNsv)
O(N2)
O(N2sep)
O(N2sv)
Non-SV
Labeling
N/A
O((N-Nsv)Nsv)
O((N-Nsv)Nsv)
O(N-Nsep)
O((N-Nsv)Nsv)
TOTAL
O(N2Nsvm)
O(NN2svm)
O(N2 + NNsvm)
O(m(N2i+ NsvN2sep))
O(NNsv)
m: the number of sample points
i: the number of iterations for convergence
Time is for a single (q,C) combination.
41
Comparisons – 2D
Construct Adjacency Matrix
Find Connected Components
Non-SV Labeling
Total Time for Cluster Labeling
42
Comparisons – HD
Construct Adjacency Matrix
Find Connected
Components
Non-SV Labeling
43
Contributions
• Automatically generate Gaussian kernel width
values
– include appropriate width values for our test data sets
– obtain some reasonable cluster results from the q-list
• Faster cluster labeling method
– faster than any other SVC cluster labeling algorithms
– good clustering quality
44
Future Work
“The presence or absence of robust, efficient parallel
clustering techniques will determine the success or
failure of cluster analysis in large-scale data mining
applications in the future.” - Jain et al. 1999
Make SVC
scalable!
45
End
46