Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 8 — ©Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University, Canada http://www.cs.sfu.ca May 22, 2017 Data Mining: Concepts and Techniques 1 Chapter 8. Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary May 22, 2017 Data Mining: Concepts and Techniques 2 Clustering Problem Formally Given a database D={t1,t2,…,tn} of tuples and an integer value k, the Clustering Problem is to define a mapping f:D {1,..,k} where each ti is assigned to one cluster Kj, 1<=j<=k. A cluster, Kj, contains precisely those tuples mapped to it. Unlike classification problem, clusters are not known a priori. May 22, 2017 Data Mining: Concepts and Techniques 4 General Applications of Clustering Pattern Recognition Spatial Data Analysis create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining Image Processing Economic Science (especially market research) WWW Document classification Cluster Weblog data to discover groups of similar access patterns May 22, 2017 Data Mining: Concepts and Techniques 5 Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults May 22, 2017 Data Mining: Concepts and Techniques 6 Clustering Issues – The appropriate number of clusters for each data set. – How to define similarity or the criterion used to group data together. – Outlier handling is difficult. Should they be a part of an existing cluster, or another cluster? – Dynamic database, how to update the clusters when there are changes in data. – The semantic meaning of each cluster. (Contrast with classes in classification process, each has a definitive meaning.) – Type of attributes that the clustering algorithm can handle. – Scalability to large datasets. May 22, 2017 Data Mining: Concepts and Techniques 7 Notion of a cluster is ambigious May 22, 2017 Data Mining: Concepts and Techniques 8 Different types of clusters Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cluster 3 Cluster 4 May 22, 2017 Data Mining: Concepts and Techniques 9 What Is Good Clustering? A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. May 22, 2017 Data Mining: Concepts and Techniques 10 Requirements of Clustering in Data Mining Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability May 22, 2017 Data Mining: Concepts and Techniques 11 Chapter 8. Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary May 22, 2017 Data Mining: Concepts and Techniques 12 Similarity and Dissimilarity Metric •Similarity - Numerical measure of how alike two data objects are. - Is higher when objects are more alike. - Often falls in the range [0,1] • Dissimilarity - Numerical measure of how different two data objects are. - Is lower when objects are more alike. - Minimum dissimilarity is often 0. - Upper limit varies • Proximity refers to a similarity or dissimilarity May 22, 2017 Data Mining: Concepts and Techniques 13 Data Structures Data matrix This represents n objects, such as persons, with p variables (also called measurements or attributes), such as age, height, gender, race, and so on. Called “two modes” : since rows and columns represent different entities Dissimilarity matrix Stores a collection of proximities that are available for all pairs of n objects. (n by n matrix) Called “one mode” : since it reprsents the same entity d(i,j) is the measured difference or dissimilarity between objects i and j. May 22, 2017 x11 ... x i1 ... x n1 ... x1f ... ... ... ... xif ... ... ... ... ... xnf ... ... x1p ... xip ... xnp 0 d(2,1) 0 d(3,1) d ( 3,2) 0 : : : d ( n,1) d ( n,2) ... ... 0 Data Mining: Concepts and Techniques 14 Measure the Quality of Clustering Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j) There is a separate “quality” function that measures the “goodness” of a cluster. The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. Weights should be associated with different variables based on applications and data semantics. It is hard to define “similar enough” or “good enough” the answer is typically highly subjective. May 22, 2017 Data Mining: Concepts and Techniques 15 Type of data in clustering analysis Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: May 22, 2017 Data Mining: Concepts and Techniques 16 Interval-valued variables Interval-scaled (based) variables are continuous measurements of a roughly linear scale (such as weight, height, weather). The measurement unit used can affect the clustering analysis. Using inches or meters for a measurement may lead to a very different clustering structure. To avoid dependence on on the choice of measurement units, the data should be standardized. How to Standardize data Calculate the mean absolute deviation: where sf 1 n (| x1 f m f | | x2 f m f | ... | xnf m f |) m f 1n (x1 f x2 f ... xnf ) . Calculate the standardized measurement (z-score) xif m f zif sf Using mean absolute deviation is more robust than using standard deviation May 22, 2017 Data Mining: Concepts and Techniques 17 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance: d (i, j) (| x x | | x x | ... | x x | ) i1 j1 i2 j 2 ip jp q q q q where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer If q = 1, d is Manhattan distance d (i, j) | x x | | x x | ... | x x | i1 j1 i2 j 2 i p jp May 22, 2017 Data Mining: Concepts and Techniques 18 Similarity and Dissimilarity Between Objects (Cont.) If q = 2, d is Euclidean distance: d (i, j) (| x x |2 | x x |2 ... | x x |2 ) i1 j1 i2 j2 ip jp Properties d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j) Also one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures. May 22, 2017 Data Mining: Concepts and Techniques 19 Euclidean Distance Source : S. Ranka May 22, 2017 Data Mining: Concepts and Techniques 20 Similarity and Dissimilarity Between Objects (Cont.) Determine similarity between two objects. Definition: The similarity between two tuples ti and t j , sim(ti ,t j ), in a database D is a mapping from D D to the range [0,1]. Thus, sim(ti ,t j ) [0,1]. Similarity characteristics: ti D, sim(ti ,ti ) 1 ti ,t j D, sim(ti ,t j ) 0 if ti and t j are not alike at all ti ,t j ,tk D, sim(ti ,t j ) sim(ti ,tk ) if ti is more like tk than it is like t j Source : Dunham May 22, 2017 Data Mining: Concepts and Techniques 21 Similarity and Dissimilarity Between Objects (Cont.) 2h 1tiht jh k Dice : sim(ti ,t j ) k t h 1t 2jh k 2 h 1 ih k Jaccard : sim(ti ,t j ) t t h 1 ih jh k 2 h 1 jh t t t t Cosine : sim(t ,t ) t t t t Overlap : sim(t ,t ) min t , k 2 h 1 ih h 1tiht jh k k i h 1 ih jh j k k 2 h 1 ih 2 h 1 jh k i j h 1 ih jh k k 2 2 ih h 1 h 1 jh t ti ti1,,tik ,t j t j1,,t jk May 22, 2017 Data Mining: Concepts and Techniques 22 Similarity and Dissimilarity Between Objects (Cont.) Measure dissimilarity between objects Euclidean (L2 ) : dis(ti ,t j ) 2 ( t t ) h1 ih jh k Manhattan (L1) : dis(ti ,t j ) h1|(tih t jh ) | k May 22, 2017 Data Mining: Concepts and Techniques 23 Binary Variables How can we compute the dissimilaty between objects descired by by either symmetic or asymmetic binary variables. A binary variable has only two states 0 and 1. Symetric : both states are equally valuable and carry the same weight. Example: gender having states male and female Asymmetric : the outcome states are not equally important, such as the positive and negative outcomes of a disease test. Example : HIV positive - represented by 1 (rarest) HIV negative – represented by 0 May 22, 2017 Data Mining: Concepts and Techniques 24 Binary Variables A contingency table for binary data Object i Object j 1 0 1 a b 0 c d sum a c b d sum a b cd p • Simple matching coefficient (invariant, if the binary variable is symmetric): d (i, j) bc a bc d • Jaccard coefficient (noninvariant if the binary variable is asymmetric): d (i, j) May 22, 2017 bc a bc Data Mining: Concepts and Techniques 25 Dissimilarity between Binary Variables Example Name Jack Mary Jim Gender M F M Fever Y Y Y Cough N N P Test-1 P P N Test-2 N N N Test-3 N P N Test-4 N N N gender is a symmetric attribute the remaining attributes are asymmetric binary let the values Y and P be set to 1, and the value N be set to 0 01 0.33 2 01 11 d ( jack , jim ) 0.67 111 1 2 d ( jim , mary ) 0.75 11 2 d ( jack , mary ) May 22, 2017 Data Mining: Concepts and Techniques 26 Nominal Variables A nominal variable is a generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching m: # of matches, p: total # of variables m d (i, j) p p Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states May 22, 2017 Data Mining: Concepts and Techniques 27 Ordinal Variables order is important, e.g., rank An ordinal variable can be discrete or continuous A discrete ordinal variable resebles a nominal variable, except that M states of the the ordinal value are ordered in a meaningful sequence (e.g. Projesional ranks : Assistant, Associate, Full professor) A continuous ordinal variable looks like a set of continous data of of an unkwon scale; that is, the realtive ordering of values is essential but their actual size is not. (e.g. The relative ranking in a particular sport: gold, silver, and bronze) May 22, 2017 Data Mining: Concepts and Techniques 28 Ordinal Variables They can be treated like interval-scaled Suppose f is a variable from a set of ordinal variables descibing n objects The value of f for the ith object is xif f has Mf ordered states 1,....,Mf Replace each xif by its rank corresponding rank rif {1,...,M f } map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by zif rif 1 M f 1 compute the dissimilarity using methods for interval-scaled variables May 22, 2017 Data Mining: Concepts and Techniques 29 Ratio-Scaled Variables Ratio-scaled variable: A ratio scale variable makes a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt Methods: treat them like interval-scaled variables — not a good choice! (why?) it is likely that the scale may be distorted. apply logarithmic transformation yif = log(xif) treat them as continuous ordinal data treat their rank as intervalscaled. May 22, 2017 Data Mining: Concepts and Techniques 30 Variables of Mixed Types A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio. One may use a weighted formula to combine their effects. pf 1 ij( f ) d ij( f ) d (i, j) pf 1 ij( f ) f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise f is interval-based: use the normalized distance f is ordinal or ratio-scaled r 1 z compute ranks rif and if M 1 and treat zif as interval-scaled if f May 22, 2017 Data Mining: Concepts and Techniques 31 Chapter 8. Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary May 22, 2017 Data Mining: Concepts and Techniques 32 Major Clustering Approaches (Han) Partitioning algorithms: Construct various partitions and then evaluate them by some criterion Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion Density-based: based on connectivity and density functions Grid-based: based on a multiple-level granularity structure Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other May 22, 2017 Data Mining: Concepts and Techniques 33 Major Clustering Approaches (Dunham) Clustering Hierarchical Agglomerative May 22, 2017 Partitional Divisive Categorical Sampling Data Mining: Concepts and Techniques Large DB Compression 34