Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Clustering II Finite Mixtures • Model data using a mixture of distributions • • • • – Each distribution represents one cluster – Each distribution gives probabilities of attribute values in that cluster Finite mixtures: finite number of clusters Individual distributions are usually normal Combine distributions using cluster weights Each normal distribution can be described in terms of μ (mean) and σ (standard deviation) • For a single attribute with two clusters – μA, σA for cluster A and μB, σB for cluster B – The attribute values are obtained by combining values from cluster A with a probability of PA and from cluster B with a probability PB – Five parameters μA, σA, μB, σB and PA (because PA+PB=1) describe the attribute value distribution 2 EM Algorithm • EM = Expectation – Maximization – Generalize k-means to probabilistic setting • Input: Collection of instances and number of clusters, k • Output: probabilities with which each instance belongs to each of the k clusters • Method: – Start by guessing values for all the parameters of the k clusters (similar to guessing centroids in k-means) – Repeat • E ‘Expectation’ Step: Calculate cluster probability for each instance • M ‘Maximization’ Step: Estimate distribution parameters from cluster probabilities • Store cluster probabilities as instance weights • Estimate parameters from weighted instances – Until we obtain distribution parameters that predict the input data 3 Incremental Clustering (Cobweb/Classit) • Input: Collection of instances • Output: A hierarchy of clusters • Method: – Start with an empty root node of the tree – Add instances one by one – if any of the existing leaves is a good ‘host’ for the incoming instance then form a cluster with it • Good host has high category utility (next slide) – If required restructure the tree • Cobweb - for nominal attributes • Classit – for numerical attributes 4 Category Utility • Category Utility, CU(C1,C2,…,Ck) = {∑lP[Cl] ∑i ∑j(P[ai=vij|Cl]2-P[ai=vij]2)}/k • Computes the advantage in predicting the values of attributes of instances in a cluster – If knowing the cluster information of an instance does not help in predicting the values of its attributes, then the cluster isn’t worth forming • The inner term of difference of squares of probabilities, (P[ai=vij|Cl]2-P[ai=vij]2) is computing this information • The denominator, k is computing this information per cluster 5 Weather Data with ID ID Outlook Temperature Humidity Windy Play a sunny hot high false no b sunny hot high true no c overcast hot high false yes d rainy mild high false yes e rainy cool normal false yes f rainy cool normal true no g overcast cool normal true yes h sunny mild high false no i sunny cool normal false yes j rainy mild normal false yes k sunny mild normal true yes l overcast mild high true yes m overcast hot normal false yes n rainy mild high true no Artificial data, therefore not possible to find natural clusters (two clusters of yeses and nos not possible) 6 Trace of Cobweb 2 1 a:no a:no b:no c:yes d:yes e:yes No good host for the first five instances 3 a:no b:no c:yes e is the best host CU of e&f as cluster high e&f are similar d:yes e:yes f:no 7 Trace of Cobweb (Contd) 4 a:no b:no c:yes d:yes e:yes f:no At root: e&f cluster best host At e&f: no host, so no new cluster, g added to e&f cluster f&g are similar 5 b:no a:no d:yes h:no g:yes c:yes e:yes f:no g:yes At root: a is the best host and d is the runner-up Before h is inserted runner-up, d is evaluated CU of a&d is high, so d merged into a to form a new cluster At a&d: no host, so no new cluster, h added to a&d cluster 8 Trace of Cobweb (Contd) g:yes a:no d:yes c:yes h=no b:no k:yes l:yes e:yes f:no j:yes m:yes n:no i:yes For large data sets, growing the tree to individual instances might lead to overfitting. A similarity threshold called cutoff used to suppress growth 9 Hierarchical Agglomerative Clustering • Input: Collection of instances • Output: A hierarchy of clusters • Method: – Start with individual instances as clusters – Repeat • Merge the ‘closest’ two clusters – Until only one cluster remains • Ward’s method: Closeness or proximity between two clusters is defined as the increase in squared error that results when two clusters are merged • Squared error measure used for only the local decision of merging clusters – No global optimization 10 HCE • A visual knowledge discovery tool for analysing and understanding multi-dimensional (> 3D) data • Offers multiple views of – input data and clustered input data – where views are coordinated • Many other similar tools do a patch work of statistics and graphics • HCE follows two fundamental statistical principles of exploratory data analysis – To examine each dimension first and then find relationships among dimensions – To try graphical displays first and then find numerical summaries 11 GRID Principles • GRID – graphics, ranking and interaction for discovery • Two principles – Study 1D, study 2D and find features – Ranking guides insight, statistics confirm • These principles help users organize their knowledge discovery process • Because of GRID, HCE is more than R + Visualization • GRID can be used to derive some scripts to organize exploratory data analysis using R (or some such statistics package) 12 Rank-by-Feature Framework • A user interface framework based on the GRID Principles • The framework – Uses interactive information visualization techniques combined with – statistical methods and data mining algorithms – Enables users to orderly examine input data • HCE implements rank-by-feature framework – This means • HCE uses existing statistical and data mining methods to analyse input data and • Communicate those results using interactive information visualization techniques 13 Multiple Views in HCE • • • • • Dendrogram Colour Mosaic 1 D histograms 2D scatterplots And more 14 Dendrogram Display • Results of HAC are shown visually using a dendrogram • A dendrogram is a tree – with data items at the terminal (leaf) nodes – Distance from the root node represents similarity among leaf nodes • Two visual controls A B C D – minimum similarity bar allows users to adjust the number of clusters – Detail cut-off bar allows users to reduce clutter 15 Colour Mosaic • • • • • • Input data is shown using this view Is a colour coded visual display of tabular data Each cell in the table is painted in a colour that reflects the cell’s value Two variations 1 3 2 4 Table Original layout – The layout of the mosaic is similar to the original table – A transpose of the original layout HCE uses the transposed layout because data sets usually have more rows than columns A colour mapping control Transposed Layout 16 1D Histogram Ordering • This data view is part of the rank-by-feature framework • Data belonging to one column (variable) is displayed as a histogram + box plot – Histogram shows the scale and skewness – Box plot shows the data distribution, center and spread • For the entire data set many such views are possible • By studying individual variables in detail users can select the variables for other visualizations 17 2D Scatter Plot Ordering • This data view is again part of the rank-by-feature framework • Three categories of 2D presentations are possible – Axes of the plot obtained from Principal Component Analysis • Linear or non-linear combinations of original variables – Axes of the plot obtained directly from the original variables – Parallel coordinates • HCE uses the second option of plotting pairs of variables from the original variables • Both 1D and 2D plots can be sorted according to some user selected criteria such as number of outliers 18