Download No Slide Title

Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 8 — ©Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University, Canada http://www.cs.sfu.ca May 22, 2017 Data Mining: Concepts and Techniques 1 Chapter 8. Cluster Analysis  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Grid-Based Methods  Model-Based Clustering Methods  Outlier Analysis  Summary May 22, 2017 Data Mining: Concepts and Techniques 2 Clustering Problem Formally    Given a database D={t1,t2,…,tn} of tuples and an integer value k, the Clustering Problem is to define a mapping f:D  {1,..,k} where each ti is assigned to one cluster Kj, 1<=j<=k. A cluster, Kj, contains precisely those tuples mapped to it. Unlike classification problem, clusters are not known a priori. May 22, 2017 Data Mining: Concepts and Techniques 4 General Applications of Clustering      Pattern Recognition Spatial Data Analysis  create thematic maps in GIS by clustering feature spaces  detect spatial clusters and explain them in spatial data mining Image Processing Economic Science (especially market research) WWW  Document classification  Cluster Weblog data to discover groups of similar access patterns May 22, 2017 Data Mining: Concepts and Techniques 5 Examples of Clustering Applications      Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults May 22, 2017 Data Mining: Concepts and Techniques 6 Clustering Issues – The appropriate number of clusters for each data set. – How to define similarity or the criterion used to group data together. – Outlier handling is difficult. Should they be a part of an existing cluster, or another cluster? – Dynamic database, how to update the clusters when there are changes in data. – The semantic meaning of each cluster. (Contrast with classes in classification process, each has a definitive meaning.) – Type of attributes that the clustering algorithm can handle. – Scalability to large datasets. May 22, 2017 Data Mining: Concepts and Techniques 7 Notion of a cluster is ambigious May 22, 2017 Data Mining: Concepts and Techniques 8 Different types of clusters Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cluster 1 Cluster 2 Cluster 3 Cluster 4 May 22, 2017 Data Mining: Concepts and Techniques 9 What Is Good Clustering?    A good clustering method will produce high quality clusters with  high intra-class similarity  low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. May 22, 2017 Data Mining: Concepts and Techniques 10 Requirements of Clustering in Data Mining  Scalability  Ability to deal with different types of attributes  Discovery of clusters with arbitrary shape  Minimal requirements for domain knowledge to determine input parameters  Able to deal with noise and outliers  Insensitive to order of input records  High dimensionality  Incorporation of user-specified constraints  Interpretability and usability May 22, 2017 Data Mining: Concepts and Techniques 11 Chapter 8. Cluster Analysis  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Grid-Based Methods  Model-Based Clustering Methods  Outlier Analysis  Summary May 22, 2017 Data Mining: Concepts and Techniques 12 Similarity and Dissimilarity Metric •Similarity - Numerical measure of how alike two data objects are. - Is higher when objects are more alike. - Often falls in the range [0,1] • Dissimilarity - Numerical measure of how different two data objects are. - Is lower when objects are more alike. - Minimum dissimilarity is often 0. - Upper limit varies • Proximity refers to a similarity or dissimilarity May 22, 2017 Data Mining: Concepts and Techniques 13 Data Structures  Data matrix    This represents n objects, such as persons, with p variables (also called measurements or attributes), such as age, height, gender, race, and so on. Called “two modes” : since rows and columns represent different entities Dissimilarity matrix    Stores a collection of proximities that are available for all pairs of n objects. (n by n matrix) Called “one mode” : since it reprsents the same entity d(i,j) is the measured difference or dissimilarity between objects i and j. May 22, 2017  x11   ... x  i1  ... x  n1 ... x1f ... ... ... ... xif ... ... ... ... ... xnf ... ... x1p   ...  xip   ...  xnp    0   d(2,1)  0    d(3,1) d ( 3,2) 0    : : :   d ( n,1) d ( n,2) ... ... 0 Data Mining: Concepts and Techniques 14 Measure the Quality of Clustering      Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j) There is a separate “quality” function that measures the “goodness” of a cluster. The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. Weights should be associated with different variables based on applications and data semantics. It is hard to define “similar enough” or “good enough”  the answer is typically highly subjective. May 22, 2017 Data Mining: Concepts and Techniques 15 Type of data in clustering analysis  Interval-scaled variables:  Binary variables:  Nominal, ordinal, and ratio variables:  Variables of mixed types: May 22, 2017 Data Mining: Concepts and Techniques 16 Interval-valued variables  Interval-scaled (based) variables are continuous measurements of a roughly linear scale (such as weight, height, weather).  The measurement unit used can affect the clustering analysis. Using inches or meters for a measurement may lead to a very different clustering structure. To avoid dependence on on the choice of measurement units, the data should be standardized.  How to Standardize data  Calculate the mean absolute deviation: where sf  1 n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |) m f  1n (x1 f  x2 f  ...  xnf ) .   Calculate the standardized measurement (z-score) xif  m f zif  sf Using mean absolute deviation is more robust than using standard deviation May 22, 2017 Data Mining: Concepts and Techniques 17 Similarity and Dissimilarity Between Objects   Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance: d (i, j)  (| x  x |  | x  x | ... | x  x | ) i1 j1 i2 j 2 ip jp q q q q where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer  If q = 1, d is Manhattan distance d (i, j) | x  x |  | x  x | ... | x  x | i1 j1 i2 j 2 i p jp May 22, 2017 Data Mining: Concepts and Techniques 18 Similarity and Dissimilarity Between Objects (Cont.)  If q = 2, d is Euclidean distance: d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 ) i1 j1 i2 j2 ip jp  Properties      d(i,j)  0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j) Also one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures. May 22, 2017 Data Mining: Concepts and Techniques 19 Euclidean Distance Source : S. Ranka May 22, 2017 Data Mining: Concepts and Techniques 20 Similarity and Dissimilarity Between Objects (Cont.)  Determine similarity between two objects. Definition: The similarity between two tuples ti and t j , sim(ti ,t j ), in a database D is a mapping from D  D to the range [0,1]. Thus, sim(ti ,t j )  [0,1]. Similarity characteristics: ti  D, sim(ti ,ti )  1 ti ,t j  D, sim(ti ,t j )  0 if ti and t j are not alike at all ti ,t j ,tk  D, sim(ti ,t j )  sim(ti ,tk ) if ti is more like tk than it is like t j Source : Dunham May 22, 2017 Data Mining: Concepts and Techniques 21 Similarity and Dissimilarity Between Objects (Cont.) 2h 1tiht jh k Dice : sim(ti ,t j )   k t  h 1t 2jh k 2 h 1 ih   k Jaccard : sim(ti ,t j )  t t h 1 ih jh k 2 h 1 jh t  t t t  Cosine : sim(t ,t )   t  t t t  Overlap : sim(t ,t )  min t ,  k 2 h 1 ih h 1tiht jh k k i h 1 ih jh j k k 2 h 1 ih 2 h 1 jh k i j h 1 ih jh k k 2 2 ih h 1 h 1 jh t  ti  ti1,,tik ,t j  t j1,,t jk May 22, 2017 Data Mining: Concepts and Techniques 22 Similarity and Dissimilarity Between Objects (Cont.)  Measure dissimilarity between objects Euclidean (L2 ) : dis(ti ,t j )  2 ( t  t ) h1 ih jh k Manhattan (L1) : dis(ti ,t j )  h1|(tih  t jh ) | k May 22, 2017 Data Mining: Concepts and Techniques 23 Binary Variables  How can we compute the dissimilaty between objects descired by by either symmetic or asymmetic binary variables. A binary variable has only two states 0 and 1.  Symetric : both states are equally valuable and carry the same weight.   Example: gender having states male and female Asymmetric : the outcome states are not equally important, such as the positive and negative outcomes of a disease test.  Example :  HIV positive - represented by 1 (rarest)  HIV negative – represented by 0 May 22, 2017 Data Mining: Concepts and Techniques 24 Binary Variables  A contingency table for binary data Object i Object j 1 0 1 a b 0 c d sum a  c b  d sum a b cd p • Simple matching coefficient (invariant, if the binary variable is symmetric): d (i, j)  bc a bc  d • Jaccard coefficient (noninvariant if the binary variable is asymmetric): d (i, j)  May 22, 2017 bc a bc Data Mining: Concepts and Techniques 25 Dissimilarity between Binary Variables  Example Name Jack Mary Jim    Gender M F M Fever Y Y Y Cough N N P Test-1 P P N Test-2 N N N Test-3 N P N Test-4 N N N gender is a symmetric attribute the remaining attributes are asymmetric binary let the values Y and P be set to 1, and the value N be set to 0 01  0.33 2 01 11 d ( jack , jim )   0.67 111 1 2 d ( jim , mary )   0.75 11 2 d ( jack , mary )  May 22, 2017 Data Mining: Concepts and Techniques 26 Nominal Variables   A nominal variable is a generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching  m: # of matches, p: total # of variables m d (i, j)  p  p  Method 2: use a large number of binary variables  creating a new binary variable for each of the M nominal states May 22, 2017 Data Mining: Concepts and Techniques 27 Ordinal Variables  order is important, e.g., rank  An ordinal variable can be discrete or continuous   A discrete ordinal variable resebles a nominal variable, except that M states of the the ordinal value are ordered in a meaningful sequence (e.g. Projesional ranks : Assistant, Associate, Full professor) A continuous ordinal variable looks like a set of continous data of of an unkwon scale; that is, the realtive ordering of values is essential but their actual size is not. (e.g. The relative ranking in a particular sport: gold, silver, and bronze) May 22, 2017 Data Mining: Concepts and Techniques 28 Ordinal Variables  They can be treated like interval-scaled Suppose f is a variable from a set of ordinal variables descibing n objects   The value of f for the ith object is xif f has Mf ordered states 1,....,Mf Replace each xif by its rank corresponding rank  rif {1,...,M f }  map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by  zif  rif 1  M f 1 compute the dissimilarity using methods for interval-scaled variables May 22, 2017 Data Mining: Concepts and Techniques 29 Ratio-Scaled Variables   Ratio-scaled variable: A ratio scale variable makes a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt Methods:  treat them like interval-scaled variables — not a good choice! (why?) it is likely that the scale may be distorted.  apply logarithmic transformation yif = log(xif)  treat them as continuous ordinal data treat their rank as intervalscaled. May 22, 2017 Data Mining: Concepts and Techniques 30 Variables of Mixed Types   A database may contain all the six types of variables  symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio. One may use a weighted formula to combine their effects.  pf  1 ij( f ) d ij( f ) d (i, j)   pf  1 ij( f )  f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise  f is interval-based: use the normalized distance  f is ordinal or ratio-scaled r 1 z   compute ranks rif and if M 1  and treat zif as interval-scaled if f May 22, 2017 Data Mining: Concepts and Techniques 31 Chapter 8. Cluster Analysis  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Grid-Based Methods  Model-Based Clustering Methods  Outlier Analysis  Summary May 22, 2017 Data Mining: Concepts and Techniques 32 Major Clustering Approaches (Han)  Partitioning algorithms: Construct various partitions and then evaluate them by some criterion  Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion  Density-based: based on connectivity and density functions  Grid-based: based on a multiple-level granularity structure  Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other May 22, 2017 Data Mining: Concepts and Techniques 33 Major Clustering Approaches (Dunham) Clustering Hierarchical Agglomerative May 22, 2017 Partitional Divisive Categorical Sampling Data Mining: Concepts and Techniques Large DB Compression 34

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download No Slide Title