Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The K-Means Clustering Method • Example 10 10 9 9 8 8 7 7 6 6 10 9 8 7 6 5 5 5 4 4 3 Clustering 2 1 0 0 1 2 3 4 5 6 7 8 K=2 Arbitrarily choose K object as initial cluster center 9 10 Assign each objects to most similar center 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Update the cluster means 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 reassign reassign 10 10 9 9 8 8 7 7 6 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Update the cluster means 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 2 Ming-Yen Lin/FCU Cluster Analysis 群聚分析 Clustering的一般應用 • Pattern Recognition 圖樣識別 • Spatial Data Analysis 空間資料分析 • Cluster 群聚: 一群 data objects – 在同一群內相當相似 – 在不同群內非常不相似 – create thematic maps in GIS by clustering feature spaces – detect spatial clusters and explain them in spatial data mining • Cluster analysis – 把資料依相似性分群 • Image Processing 影像處理 • Economic Science (especially market research) • WWW • Clustering 是 unsupervised classification: 無預先設好 的類別標籤 • Typical applications – Document classification – Cluster Weblog data to discover groups of similar access patterns – 作為了解資料分佈的工具(stand-alone tool) – 作為 其他方法的 preprocessing step 3 Ming-Yen Lin/FCU 4 Ming-Yen Lin/FCU 何謂 Good Clustering? Examples of Clustering Applications • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • Land use: Identification of areas of similar land use in an earth observation database • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults • 好的分群方法產生高品質的 clusters – high intra-class similarity (cluster 內:高相似) – low inter-class similarity (cluster 間:低相似) • 結果的品質決定於 clustering 方法的 – similarity measure – implementation • clustering 方法的品質也可以用「找出(部分或全部) 隱藏的 pattern 能力」來度量 5 Ming-Yen Lin/FCU 6 Ming-Yen Lin/FCU 資料結構 Clustering 需求 • Data matrix • 擴充性 (scalability ) • 處理各種型態的屬性 (types of attributes) – (two modes) • 找出任意形狀的cluster • 決定輸入參數時需盡量減少所需的 domain knowledge n obj. * p var. • 處理noise 及outlier的能力 • 對輸入資料的順序要 insensitive • Dissimilarity matrix • high dimensionality • 可以整合 user-specified constraints – (one mode) • Interpretability • usability n*n 7 Ming-Yen Lin/FCU ⎡ x 11 ⎢ ⎢ ... ⎢x ⎢ i1 ⎢ ... ⎢x ⎣⎢ n1 ... x 1f ... ... ... ... x if ... ... ... ... ... x nf ... ... ⎡ 0 ⎢ d(2,1) ⎢ ⎢ d(3,1 ) ⎢ ⎢ : ⎢⎣ d ( n ,1) 0 d ( 3,2 ) : d ( n ,2 ) 0 : ... x 1p ⎤ ⎥ ... ⎥ x ip ⎥ ⎥ ... ⎥ x np ⎥⎥ ⎦ ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ... 0 ⎥⎦ 8 Ming-Yen Lin/FCU Clustering Quality 的度量 Clustering Analysis 的資料型態 • Dissimilarity/Similarity metric: 以 distance function表 示,d(i, j) • Distance functions 的定義依照變數型態而不同 • Interval-scaled variables – weight, height, latitude, … (roughly linear) • Binary variables – interval-scaled, boolean, categorical, ordinal and ratio variables – symmetric: gender • 依照各個應用與資料的意義訂定變數的weights • 有時很難定義 “similar enough” or “good enough” – asymmetric: fever (Y/N), test (P/N) – 答案很主觀 • Nominal, ordinal, and ratio variables – map_color, weather; ordering; AeBt • Variables of mixed types 9 Ming-Yen Lin/FCU Interval-valued variables Similarity and Dissimilarity Between Objects • Distances: 度量兩data objects的 similarity 或 dissimilarity • Standardize data 先標準化 (xif 變成zif) – 算 mean absolute deviation where 10 Ming-Yen Lin/FCU s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |) m f = 1n (x1 f + x2 f + ... + xnf ) mf: 平均 – properties • d(i,j) ≥ 0 . • d(i,i) = 0 – 算 standardized measurement (z-score) x −m zif = if s f f • d(i,j) = d(j,i) • d(i,j) ≤ d(i,k) + d(k,j) • 用 mean absolute deviation 比用 standard deviation 更 robust • Manhattan distance d(i, j) =| xi1 −x j1 | +| xi2 −x j2 | +...+| xip −x jp | – standard deviation 對差值平方 • Euclidean distance 11 Ming-Yen Lin/FCU d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 ) i1 j1 i2 j2 ip jp Ming-Yen Lin/FCU 12 Similarity and Dissimilarity (Cont.) Binary Variables • Minkowski distance: • A contingency table for binary data 各種可能 d (i, j) = q (| x − x | + | x − x | +...+ | x − x | ) i1 j1 i2 j2 ip jp 1 i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) 是兩個 pdimensional 的 data objects, q 是正整數 1 a b 0 c d sum a + c b + d q q q Object j Object i sum 0 a +b c+d p • Simple matching coefficient (invariant, if the binary variable is b+c symmetric): d (i, j ) = a+b+c+d • Jaccard coefficient (noninvariant if the binary variable is • Manhattan distance: q = 1 • Euclidean distance: q = 2 d (i, j ) = asymmetric): b+c a+b+c 13 Ming-Yen Lin/FCU 14 Ming-Yen Lin/FCU Dissimilarity between Binary Variables Nominal Variables • Example Name Jack Mary Jim Gender M F M Fever Y Y Y Cough N N P Test-1 P P N Test-2 N N N Test-3 N P N • binary variable 的 generalization : 超過兩種狀態, 如 red, yellow, blue, green • Method 1: Simple matching Test-4 N N N – m: # of matches, p: total # of variables – gender 是 symmetric attribute – 其他是 asymmetric binary attribute – 讓 Y 跟 P 為 1, N 為 0 0 + 1 = 0 . 33 2 + 0 + 1 1 + 1 d ( jack , jim ) = = 0 . 67 1 + 1 + 1 1 + 2 d ( jim , mary ) = = 0 . 75 1 + 1 + 2 d ( jack , mary Ming-Yen Lin/FCU d ( i , j ) = p −p m • Method 2: use a large number of binary variables ) = – creating a new binary variable for each of the M nominal states • Show an example 15 16 Ming-Yen Lin/FCU Ordinal Variables Ratio-Scaled Variables • Can be discrete or continuous • Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt • Order is important, e.g., rank • Can be treated like interval-scaled – 將 xif 用他的 rank替代, rif ∈ {1, …, Mf} • Methods: – 將各個 ordinal variable 對應到 [0, 1] – treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted) • 取代 i-th object 的 f-th 變數 z if = r if − 1 M f − 1 – apply logarithmic transformation, (log-log maybe) yif = log(xif) – 用 interval-scaled variables的方法計算dissimilarity • Example – treat them as continuous ordinal data • as ordinal data, treat their rank as interval-scaled 17 Ming-Yen Lin/FCU Major Clustering Approaches Variables of Mixed Types • Partitioning algorithms • 資料庫可能同時包含這六種variables – Construct various partitions – symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio • 可以用 weighted formula 來結合 d (i, j ) = Σ δ Σ – evaluate them by some criterion δij(f) = 0: (1) 缺 xif 或 xjf (2) xif=xjf =0, f asymmetric 其他: δij(f) = 1 p ( f ) ( f ) f = 1 ij ij p ( f ) f = 1 ij d δ – 當 f 是 binary or nominal: dij(f) = 0 if xif = xjf , otherwise dij(f) = 1 – 當 f 是interval-based: dij(f) = |xif-xjf|/max(xf)-min(xf) – 當 f 是ordinal or ratio-scaled • 算 ranks rif • 把 zif 當 interval-scaled z if = r − 1 M −1 f • Hierarchy algorithms – Create a hierarchical decomposition of the set of data (or objects) using some criterion • Density-based: based on connectivity and density functions • Grid-based: based on a multiple-level granularity structure • Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other if Ming-Yen Lin/FCU 18 Ming-Yen Lin/FCU 19 20 Ming-Yen Lin/FCU The K-Means Clustering Method Partitioning Algorithms: Basic Concept • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters • Given k, the k-means algorithm is implemented in four steps: • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion – Partition objects into k nonempty subsets – Heuristic methods: k-means and k-medoids algorithms – Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) – k-means (MacQueen’67): Each cluster is represented by the center of the cluster – Assign each object to the cluster with the nearest seed point – k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster – Go back to Step 2, stop when no more new assignment – Global optimal: exhaustively enumerate all partitions 21 Ming-Yen Lin/FCU 22 Ming-Yen Lin/FCU Comments on the K-Means Method The K-Means Clustering Method • Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • Example 10 10 9 9 8 8 7 7 6 6 5 5 4 4 10 • Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) 9 8 7 • Comment: Often terminates at a local optimum. 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 K=2 Arbitrarily choose K object as initial cluster center 9 10 Assign each objects to most similar center 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Update the cluster means • Weakness 3 2 1 0 1 2 3 4 5 6 10 9 9 8 8 7 7 6 6 5 5 4 3 2 1 0 1 2 3 4 5 6 7 8 7 8 9 10 reassign reassign 10 0 – Applicable only when mean is defined, then what about categorical data? 0 9 10 Update the cluster means – Need to specify k, the number of clusters, in advance – Unable to handle noisy data and outliers – Not suitable to discover clusters with non-convex shapes 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 23 Ming-Yen Lin/FCU 24 Ming-Yen Lin/FCU K-Means 的問題 Hierarchical Clustering • Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition • The k-means algorithm is sensitive to outliers ! – Since an object with an extremely large value may substantially distort the distribution of the data. Step 0 • K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the a most centrally located object in a cluster. b 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 Step 2 Step 3 Step 4 abcde cde d de e 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 25 PAM (Partitioning Around Medoids, 1987) Ming-Yen Lin/FCU Step 4 agglomerative (AGNES) ab c 1 0 Step 1 Step 3 Step 2 Step 1 Step 0 divisive (DIANA) 26 Ming-Yen Lin/FCU Dendrogram Shows Hierarchical Clustering AGNES (Agglomerative Nesting) dendrogram:將 data objects decompose 為數層 nested partitioning (tree of clusters) • Implemented in Splus (e.g) • Use Single-Linkage method and the dissimilarity matrix • Merge nodes that have the least dissimilarity • Go on in a non-descending fashion • Eventually all nodes belong to the same cluster 4 clusters Single-linkage: cloest pair Complete-linkage: distant 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 1 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 27 Ming-Yen Lin/FCU 28 Ming-Yen Lin/FCU Density-Based Clustering Methods DIANA (Divisive Analysis) • Clustering based on density (local cluster criterion), such as density-connected points • Major features: • Implemented in statistical analysis packages, e.g., Splus – – – – • Inverse order of AGNES • Eventually each node forms a cluster on its own Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition • Several interesting studies: 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98) CLIQUE: Agrawal, et al. (SIGMOD’98) 5 5 4 – – – – 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 29 Ming-Yen Lin/FCU 30 Ming-Yen Lin/FCU Model-Based Clustering Methods Model-Based Clustering Methods • Attempt to optimize the fit between the data and some mathematical model • Statistical approach – Conceptual clustering – COBWEB (Fisher’87) • AI approach – a “prototype” for each cluster (called exemplar) – put new obj. to the most similar exemplar – Neural network approach • Self-Organization feature Map (SOM) – several units competing for the current object 31 Ming-Yen Lin/FCU 32 Ming-Yen Lin/FCU What Is Outlier Discovery? Self-organizing feature maps (SOMs) • 何謂 outliers? • Clustering is also performed by having several units competing for the current object • The unit whose weight vector is closest to the current object wins • The winner and its neighbors learn by having their weights adjusted • Useful for visualizing high-dimensional data in 2- or 3-D space – Michael Jordon、CEO薪水、age = 999 – 那些跟其他資料相當不相似的資料 (considerably dissimilar!) • Problem:find top k outliers among n objects • Applications: – Credit card/ Telecom fraud detection – Customer segmentation – Medical analysis • Approaches Example Tool 33 Ming-Yen Lin/FCU Outlier Discovery: Statistical Approaches M Assume a model underlying distribution that generates data set (e.g. normal distribution) • Use discordancy tests depending on – data distribution – distribution parameter (e.g., mean, variance) – number of expected outliers • Drawbacks – Statistical-based – Distance-based – Deviation-based Ming-Yen Lin/FCU Distance-Based Approach • 參數: p (為一分數), D • Distance-based outlier DB(p, D)-outlier: dataset S中的 object O, S中至少 p 的object 跟 O 的距離大於 D – 沒有夠多的鄰居 • distance-based outlier mining algorithms – Index-based algorithm – Nested-loop algorithm – Cell-based algorithm • ... – most tests are for single attribute – In many cases, data distribution may not be known Ming-Yen Lin/FCU 34 Summary Constraint-Based Clustering • Cluster analysis groups objects based on their similarity • ATM allocation problem – cluster analysis has wide applications • Measure of similarity can be computed for various types of data • Clustering algorithms can be categorized into – – – – – partitioning methods hierarchical methods density-based methods grid-based methods model-based methods • Outlier detection and analysis – useful for fraud detection, etc. – performed by statistical, distance-based or deviation-based approaches • research issues: constraint-based clustering 37 Ming-Yen Lin/FCU 38 Ming-Yen Lin/FCU