Download Clustering Cluster Analysis 群聚分析

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
The K-Means Clustering Method
• Example
10
10
9
9
8
8
7
7
6
6
10
9
8
7
6
5
5
5
4
4
3
Clustering
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
reassign
reassign
10
10
9
9
8
8
7
7
6
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Update
the
cluster
means
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
2
Ming-Yen Lin/FCU
Cluster Analysis 群聚分析
Clustering的一般應用
• Pattern Recognition 圖樣識別
• Spatial Data Analysis 空間資料分析
• Cluster 群聚: 一群 data objects
– 在同一群內相當相似
– 在不同群內非常不相似
– create thematic maps in GIS by clustering feature spaces
– detect spatial clusters and explain them in spatial data
mining
• Cluster analysis
– 把資料依相似性分群
• Image Processing 影像處理
• Economic Science (especially market research)
• WWW
• Clustering 是 unsupervised classification: 無預先設好
的類別標籤
• Typical applications
– Document classification
– Cluster Weblog data to discover groups of similar access
patterns
– 作為了解資料分佈的工具(stand-alone tool)
– 作為 其他方法的 preprocessing step
3
Ming-Yen Lin/FCU
4
Ming-Yen Lin/FCU
何謂 Good Clustering?
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs
• Land use: Identification of areas of similar land use in
an earth observation database
• Insurance: Identifying groups of motor insurance
policy holders with a high average claim cost
• City-planning: Identifying groups of houses according
to their house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
• 好的分群方法產生高品質的 clusters
– high intra-class similarity (cluster 內:高相似)
– low inter-class similarity (cluster 間:低相似)
• 結果的品質決定於 clustering 方法的
– similarity measure
– implementation
• clustering 方法的品質也可以用「找出(部分或全部)
隱藏的 pattern 能力」來度量
5
Ming-Yen Lin/FCU
6
Ming-Yen Lin/FCU
資料結構
Clustering 需求
• Data matrix
• 擴充性 (scalability )
• 處理各種型態的屬性 (types of attributes)
– (two modes)
• 找出任意形狀的cluster
• 決定輸入參數時需盡量減少所需的 domain knowledge
n obj. * p var.
• 處理noise 及outlier的能力
• 對輸入資料的順序要 insensitive
• Dissimilarity matrix
• high dimensionality
• 可以整合 user-specified constraints
– (one mode)
• Interpretability
• usability
n*n
7
Ming-Yen Lin/FCU
⎡ x 11
⎢
⎢ ...
⎢x
⎢ i1
⎢ ...
⎢x
⎣⎢ n1
...
x 1f
...
...
...
...
x if
...
...
...
...
...
x nf
...
...
⎡ 0
⎢ d(2,1)
⎢
⎢ d(3,1 )
⎢
⎢ :
⎢⎣ d ( n ,1)
0
d ( 3,2 )
:
d ( n ,2 )
0
:
...
x 1p ⎤
⎥
... ⎥
x ip ⎥
⎥
... ⎥
x np ⎥⎥
⎦
⎤
⎥
⎥
⎥
⎥
⎥
... 0 ⎥⎦
8
Ming-Yen Lin/FCU
Clustering Quality 的度量
Clustering Analysis 的資料型態
• Dissimilarity/Similarity metric: 以 distance function表
示,d(i, j)
• Distance functions 的定義依照變數型態而不同
• Interval-scaled variables
– weight, height, latitude, … (roughly linear)
• Binary variables
– interval-scaled, boolean, categorical, ordinal and ratio
variables
– symmetric: gender
• 依照各個應用與資料的意義訂定變數的weights
• 有時很難定義 “similar enough” or “good enough”
– asymmetric: fever (Y/N), test (P/N)
– 答案很主觀
• Nominal, ordinal, and ratio variables
– map_color, weather; ordering; AeBt
• Variables of mixed types
9
Ming-Yen Lin/FCU
Interval-valued variables
Similarity and Dissimilarity Between Objects
• Distances: 度量兩data objects的 similarity 或
dissimilarity
• Standardize data 先標準化 (xif 變成zif)
– 算 mean absolute deviation
where
10
Ming-Yen Lin/FCU
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
m f = 1n (x1 f + x2 f + ... + xnf )
mf: 平均
– properties
• d(i,j) ≥ 0
.
• d(i,i) = 0
– 算 standardized measurement (z-score)
x −m
zif = if s f
f
• d(i,j) = d(j,i)
• d(i,j) ≤ d(i,k) + d(k,j)
• 用 mean absolute deviation 比用 standard deviation 更 robust
• Manhattan distance d(i, j) =| xi1 −x j1 | +| xi2 −x j2 | +...+| xip −x jp |
– standard deviation 對差值平方
• Euclidean distance
11
Ming-Yen Lin/FCU
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1
j1
i2
j2
ip
jp
Ming-Yen Lin/FCU
12
Similarity and Dissimilarity (Cont.)
Binary Variables
• Minkowski distance:
• A contingency table for binary data 各種可能
d (i, j) = q (| x − x | + | x − x | +...+ | x − x | )
i1
j1
i2
j2
ip
jp
1
i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) 是兩個 pdimensional 的 data objects, q 是正整數
1
a
b
0
c
d
sum a + c b + d
q
q
q
Object j
Object i
sum
0
a +b
c+d
p
• Simple matching coefficient (invariant, if the binary variable is
b+c
symmetric):
d (i, j ) =
a+b+c+d
• Jaccard coefficient (noninvariant if the binary variable is
• Manhattan distance: q = 1
• Euclidean distance: q = 2
d (i, j ) =
asymmetric):
b+c
a+b+c
13
Ming-Yen Lin/FCU
14
Ming-Yen Lin/FCU
Dissimilarity between Binary Variables
Nominal Variables
• Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
• binary variable 的 generalization : 超過兩種狀態,
如 red, yellow, blue, green
• Method 1: Simple matching
Test-4
N
N
N
– m: # of matches, p: total # of variables
– gender 是 symmetric attribute
– 其他是 asymmetric binary attribute
– 讓 Y 跟 P 為 1, N 為 0
0 + 1
= 0 . 33
2 + 0 + 1
1 + 1
d ( jack , jim ) =
= 0 . 67
1 + 1 + 1
1 + 2
d ( jim , mary ) =
= 0 . 75
1 + 1 + 2
d ( jack , mary
Ming-Yen Lin/FCU
d ( i , j ) = p −p m
• Method 2: use a large number of binary variables
) =
– creating a new binary variable for each of the M nominal
states
• Show an example
15
16
Ming-Yen Lin/FCU
Ordinal Variables
Ratio-Scaled Variables
• Can be discrete or continuous
• Ratio-scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
• Order is important, e.g., rank
• Can be treated like interval-scaled
– 將 xif 用他的 rank替代, rif ∈ {1, …, Mf}
• Methods:
– 將各個 ordinal variable 對應到 [0, 1]
– treat them like interval-scaled variables—not a good choice!
(why?—the scale can be distorted)
• 取代 i-th object 的 f-th 變數
z
if
=
r if − 1
M f − 1
– apply logarithmic transformation, (log-log maybe)
yif = log(xif)
– 用 interval-scaled variables的方法計算dissimilarity
• Example
– treat them as continuous ordinal data
• as ordinal data, treat their rank as interval-scaled
17
Ming-Yen Lin/FCU
Major Clustering Approaches
Variables of Mixed Types
• Partitioning algorithms
• 資料庫可能同時包含這六種variables
– Construct various partitions
– symmetric binary, asymmetric binary, nominal, ordinal,
interval and ratio
• 可以用 weighted formula 來結合
d (i, j ) =
Σ
δ
Σ
– evaluate them by some criterion
δij(f) = 0:
(1) 缺 xif 或 xjf
(2) xif=xjf =0, f asymmetric
其他: δij(f) = 1
p
( f )
( f )
f = 1
ij
ij
p
( f )
f = 1
ij
d
δ
– 當 f 是 binary or nominal:
dij(f) = 0 if xif = xjf , otherwise dij(f) = 1
– 當 f 是interval-based: dij(f) = |xif-xjf|/max(xf)-min(xf)
– 當 f 是ordinal or ratio-scaled
• 算 ranks rif
• 把 zif 當 interval-scaled z if = r − 1
M −1
f
• Hierarchy algorithms
– Create a hierarchical decomposition of the set of data (or objects) using
some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the clusters
and the idea is to find the best fit of that model to each other
if
Ming-Yen Lin/FCU
18
Ming-Yen Lin/FCU
19
20
Ming-Yen Lin/FCU
The K-Means Clustering Method
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database
D of n objects into a set of k clusters
• Given k, the k-means algorithm is implemented in
four steps:
• Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
– Partition objects into k nonempty subsets
– Heuristic methods: k-means and k-medoids algorithms
– Compute seed points as the centroids of the clusters of
the current partition (the centroid is the center, i.e., mean
point, of the cluster)
– k-means (MacQueen’67): Each cluster is represented by the
center of the cluster
– Assign each object to the cluster with the nearest seed
point
– k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
– Go back to Step 2, stop when no more new assignment
– Global optimal: exhaustively enumerate all partitions
21
Ming-Yen Lin/FCU
22
Ming-Yen Lin/FCU
Comments on the K-Means Method
The K-Means Clustering Method
• Strength: Relatively efficient: O(tkn), where n is # objects, k is
# clusters, and t is # iterations. Normally, k, t << n.
• Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
10
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
9
8
7
• Comment: Often terminates at a local optimum.
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Update
the
cluster
means
• Weakness
3
2
1
0
1
2
3
4
5
6
10
9
9
8
8
7
7
6
6
5
5
4
3
2
1
0
1
2
3
4
5
6
7
8
7
8
9
10
reassign
reassign
10
0
– Applicable only when mean is defined, then what about categorical
data?
0
9
10
Update
the
cluster
means
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
23
Ming-Yen Lin/FCU
24
Ming-Yen Lin/FCU
K-Means 的問題
Hierarchical Clustering
• Use distance matrix as clustering criteria. This
method does not require the number of clusters k as an
input, but needs a termination condition
• The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially distort the
distribution of the data.
Step 0
• K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the
a
most centrally located object in a cluster.
b
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
Step 2 Step 3 Step 4
abcde
cde
d
de
e
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
25
PAM (Partitioning Around Medoids, 1987)
Ming-Yen Lin/FCU
Step 4
agglomerative
(AGNES)
ab
c
1
0
Step 1
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
26
Ming-Yen Lin/FCU
Dendrogram Shows Hierarchical Clustering
AGNES (Agglomerative Nesting)
dendrogram:將 data objects decompose 為數層 nested
partitioning (tree of clusters)
• Implemented in Splus (e.g)
• Use Single-Linkage method and the dissimilarity matrix
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
4 clusters
Single-linkage: cloest pair
Complete-linkage: distant
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
1
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
27
Ming-Yen Lin/FCU
28
Ming-Yen Lin/FCU
Density-Based Clustering Methods
DIANA (Divisive Analysis)
• Clustering based on density (local cluster criterion), such as
density-connected points
• Major features:
• Implemented in statistical analysis packages, e.g.,
Splus
–
–
–
–
• Inverse order of AGNES
• Eventually each node forms a cluster on its own
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
• Several interesting studies:
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
1
2
3
4
5
6
7
8
9
10
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98)
5
5
4
–
–
–
–
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
29
Ming-Yen Lin/FCU
30
Ming-Yen Lin/FCU
Model-Based Clustering Methods
Model-Based Clustering Methods
• Attempt to optimize the fit between the data and some
mathematical model
• Statistical approach
– Conceptual clustering
– COBWEB (Fisher’87)
• AI approach
– a “prototype” for each cluster (called exemplar)
– put new obj. to the most similar exemplar
– Neural network approach
• Self-Organization feature Map (SOM)
– several units competing for the current object
31
Ming-Yen Lin/FCU
32
Ming-Yen Lin/FCU
What Is Outlier Discovery?
Self-organizing feature maps (SOMs)
• 何謂 outliers?
• Clustering is also performed by having several units
competing for the current object
• The unit whose weight vector is closest to the current
object wins
• The winner and its neighbors learn by having their
weights adjusted
• Useful for visualizing high-dimensional data in 2- or
3-D space
– Michael Jordon、CEO薪水、age = 999
– 那些跟其他資料相當不相似的資料 (considerably
dissimilar!)
• Problem:find top k outliers among n objects
• Applications:
– Credit card/ Telecom fraud detection
– Customer segmentation
– Medical analysis
• Approaches
Example Tool
33
Ming-Yen Lin/FCU
Outlier Discovery:
Statistical
Approaches
M Assume a model underlying distribution that generates data
set (e.g. normal distribution)
• Use discordancy tests depending on
– data distribution
– distribution parameter (e.g., mean, variance)
– number of expected outliers
• Drawbacks
– Statistical-based
– Distance-based
– Deviation-based
Ming-Yen Lin/FCU
Distance-Based Approach
• 參數: p (為一分數), D
• Distance-based outlier DB(p, D)-outlier: dataset S中的
object O, S中至少 p 的object 跟 O 的距離大於 D
– 沒有夠多的鄰居
• distance-based outlier mining algorithms
– Index-based algorithm
– Nested-loop algorithm
– Cell-based algorithm
• ...
– most tests are for single attribute
– In many cases, data distribution may not be known
Ming-Yen Lin/FCU
34
Summary
Constraint-Based Clustering
• Cluster analysis groups objects based on their similarity
• ATM allocation problem
– cluster analysis has wide applications
• Measure of similarity can be computed for various types of data
• Clustering algorithms can be categorized into
–
–
–
–
–
partitioning methods
hierarchical methods
density-based methods
grid-based methods
model-based methods
• Outlier detection and analysis
– useful for fraud detection, etc.
– performed by statistical, distance-based or deviation-based approaches
• research issues: constraint-based clustering
37
Ming-Yen Lin/FCU
38
Ming-Yen Lin/FCU