Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

Clustering: partitioning data into similar groupings. A cluster is grouping of 'similar' items! Clustering is a process that partitions a set of objects into equivalence classes. In clustering, data is partitioned into classes. Class members of each class are in some way more "similar" to each other than with those pertaining to different classes. So a. intra-class variance is low b. inter-class variance is high Application of clustering: 1. Pattern recognition 2. Data mining 3. Image processing 4. Market research & econometric assessment 5. In WWW: document classification and search on a similar ontology space 6. Land use registry 7. Insurance 8. Urban and regional planning Clustering = Unsupervised learning with no notion of pre-defined classes (what are they, how many of them – no a priori knowledge) Data preparation before data mining: ► Normal data to be mined is noisy with many unwanted attributes, etc. ►Discretization of continuous data ►Data normalization [ -1 .. + 1] or [0 .. 1] range ►Data smoothing to reduce noise, removal of outliers, etc. ►Relevance analysis: feature selection to ensure relevant set of wanted features only Clustering is an unsupervised partition of a given data into equivalent classes. Ultimately, clustering is equivalent to classification. A good clustering would produce a partition with low within-group variance and high inter-group variance. The idea is centroid of a cluster (the geographical center of a cluster in a spatial data) and its variance about the centroid may be sufficient in most cases to depict the data. many data few data. Cluster point or centroid = exempler Variance = mushiness of the concept of the centroid within-group = intra-cluster between-groups = inter cluster The issue here is "similarity". How do we measure similarity? This is not easy to answer. Secondly, if there are "hidden"patterns, does the clustering scheme discover them? Requirements of good clustering: 1. Insensitivity to order of input data 2. Capable of cluster identification on a single pass over data 3. Works even in presence of noise and outliers 4. Scalability 5. low dependence on domain knowledge 6. Ability to deal with different types of input: numerical, ordinal, categorical, etc. 7. Discovery of clusters with arbitrary shape. e.g. Two points or patterns are similar if the distance between them is lower than some threshold. d( A,B ) cluster. A and B are in the same A metric space is a set S xi where a generalized distance function of some sort could be defined. In this, d( xi ,x j ) 0 if xi x j d( xi ,x j ) d( x j ,xi ) 0 d( xi ,x j ) d( xi ,xk ) d( xk ,x j ) triangular inequality How do we measure the distance? It depends! Euclidian distance: d 2 ( xi ,x j ) ( xik x jk )2 as 2-norm k Manhattan distance: d( xi ,x j ) | xik x jk | k Bounded distance: On any given metric space, a measure that never exceeds a threshold. e.g. d( xi ,x j ) D( xi ,x j ) where D( xi ,x j ) is a 1 D( xi ,x j ) measure of distance between two points obeying usual distance properties. Maximum distance: d( xi ,x j ) max | xik x jk | -norm k Minimum distance: d( xi ,x j ) min | xik x jk | k Mahalnabis distance: d p ( xi ,x j ) |xik x jk | p k String distance: 1 p as p-norm a. Hamming: Distance between two strings is the total number of positions they differ. b. Levenshtein distance: Given a source string s and a target string t , the minimum number of insertions, deletions and substitutions require to transform s into t . Useful in Spell checking Speech recognition DNA analysis Plagiarism detection Larger the distance between two data items less similar they are. Also, we need a way to measure the distance between two clusters. A number of possibilities exist. Single linkage distance (nearest neighbor): D( C1 ,C2 ) minx1C1 ,x2C2 d( x1 ,x2 ) minimum distance between points in them. Complete Linkage distance (farthest neighbors) D( C1 ,C2 ) maxx1C1 ,x2C2 d( x1 ,x2 ) Average linkage distance (a compromise): Average distance among all pairs of points d( C1 ,C2 ) 1 | C1 || C2 | d( x ,x x1C1 x2 C2 1 2 ) Centroid Linkage (easiest) d( C1 ,C2 ) d( x1 ,x2 ) ■ Single-linkage produces long and stringy clusters ■ Complete linkage produces compact and small clusters What should be the criteria for good clustering? Again depends. Classical criterion-set: 1. Number of distinct clusters should be as low as possible and yet accommodate a sense of classification. 2. Distance between two distinct clusters should be larger than some threshold. How big, how small? 3. A decent clustering scheme will yield a minimum total variance of the clusters over all possible schemes. Not all these are independent. Ideal clustering algorithm is an NP-complete bin-packing problem. One needs a workable heuristic to approach the problem. Classification = Clustering = Learning ■ A set S is partitioned into a number of equivalent classes. Supervised learning classification (normally) Unsupervised Learning clustering Both clustering and classification induce partition on the given data set . The clustering schemes: a. Partitioning algorithms: Construct various partitions and then evaluate them by various algorithms. If clusters are too close, perhaps they should be coalesced. If they are too voluminous (high variance) , perhaps they should be partitioned further. b. Hierarchical algorithm: agglomerative ad divisive approach c. Density-based algorithms: based on connectivity and density functions. d. Grid-based algorithms: Clustering Scheme Bottom -Up Agglomerative Top-Down Hierarchical In general, A cluster-scheme is agglomerative if new clusters are formed one at a time from the existing clusterset. Let P n = { v1 ,v2 ,...vn } be an n-cluster partition. We obtain P k from P k 1 as follows: a. Chose C h , Cl P k 1 b. Pk 1 Pk as Ch and Cl are erased from P k 1 and the new cluster C Ch Cl is inserted into the configuration. The selection of C h and Cl is predicated by some objective function. Some choices: a. var( Ch Cl ) var( Ch ) var( Cl ) b. minimize F(Ch ,Cl ) diameter( Ch diameter( C ) max x ,yC min P(x,y) Cl ) where length (P(x,y)) Clustering or classification can be classified as ► supervised learning (e.g. as in neural nets) ► unsupervised learning (e.g. as in Isodata, ..) Example Isodata. Basic algorithm. 1. Given n patterns (points, objects, signatures ..) xk X . 2. Choose randomly any two points xk and xl such that the distance d( xk ,xl ) , the minimum inter-cluster distance. 3. Take one of the remaining point xm X . If d( xm ,C ) d( xm ,C ) assign xm to C if d( xm ,C ) , the maximum cluster diameter 4. If d( xm ,C ) d( xm ,C ) and d( xm ,C ) ,assign xm to is own cluster Cm 5. For the remaining points go to step 3 until no point is remaining. Pros and cons of Isodata Pros. 1. Clustering is not geographically biased to any particular region of data distribution. 2. A very efficient way of finding inherent clusters in a set. Cons. 1. Clustering is based on a number of iterations required. 2. One doesn’t know a priori number of distinct clusters. 3. Insensitive to variance/covariance.