* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download analyse input data
Survey
Document related concepts
Transcript
Clustering II Finite Mixtures • Model data using a mixture of distributions • • • • – Each distribution represents one cluster – Each distribution gives probabilities of attribute values in that cluster Finite mixtures: finite number of clusters Individual distributions are usually normal Combine distributions using cluster weights Each normal distribution can be described in terms of μ (mean) and σ (standard deviation) • For a single attribute with two clusters – μA, σA for cluster A and μB, σB for cluster B – The attribute values are obtained by combining values from cluster A with a probability of PA and from cluster B with a probability PB – Five parameters μA, σA, μB, σB and PA (because PA+PB=1) describe the attribute value distribution 2 EM Algorithm • EM = Expectation – Maximization – Generalize k-means to probabilistic setting • Input: Collection of instances and number of clusters, k • Output: probabilities with which each instance belongs to each of the k clusters • Method: – Start by guessing values for all the parameters of the k clusters (similar to guessing centroids in k-means) – Repeat • E ‘Expectation’ Step: Calculate cluster probability for each instance • M ‘Maximization’ Step: Estimate distribution parameters from cluster probabilities • Store cluster probabilities as instance weights • Estimate parameters from weighted instances – Until we obtain distribution parameters that predict the input data 3 Incremental Clustering (Cobweb/Classit) • Input: Collection of instances • Output: A hierarchy of clusters • Method: – Start with an empty root node of the tree – Add instances one by one – if any of the existing leaves is a good ‘host’ for the incoming instance then form a cluster with it • Good host has high category utility (next slide) – If required restructure the tree • Cobweb - for nominal attributes • Classit – for numerical attributes 4 Category Utility • Category Utility, CU(C1,C2,…,Ck) = {∑lP[Cl] ∑i ∑j(P[ai=vij|Cl]2-P[ai=vij]2)}/k • Computes the advantage in predicting the values of attributes of instances in a cluster – If knowing the cluster information of an instance does not help in predicting the values of its attributes, then the cluster isn’t worth forming • The inner term of difference of squares of probabilities, (P[ai=vij|Cl]2-P[ai=vij]2) is computing this information • The denominator, k is computing this information per cluster 5 Weather Data with ID ID Outlook Temperature Humidity Windy Play a sunny hot high false no b sunny hot high true no c overcast hot high false yes d rainy mild high false yes e rainy cool normal false yes f rainy cool normal true no g overcast cool normal true yes h sunny mild high false no i sunny cool normal false yes j rainy mild normal false yes k sunny mild normal true yes l overcast mild high true yes m overcast hot normal false yes n rainy mild high true no Artificial data, therefore not possible to find natural clusters (two clusters of yeses and nos not possible) 6 Trace of Cobweb 2 1 a:no a:no b:no c:yes d:yes e:yes No good host for the first five instances 3 a:no b:no c:yes e is the best host CU of e&f as cluster high e&f are similar d:yes e:yes f:no 7 Trace of Cobweb (Contd) 4 a:no b:no c:yes d:yes e:yes f:no At root: e&f cluster best host At e&f: no host, so no new cluster, g added to e&f cluster f&g are similar 5 b:no a:no d:yes h:no g:yes c:yes e:yes f:no g:yes At root: a is the best host and d is the runner-up Before h is inserted runner-up, d is evaluated CU of a&d is high, so d merged into a to form a new cluster At a&d: no host, so no new cluster, h added to a&d cluster 8 Trace of Cobweb (Contd) g:yes a:no d:yes c:yes h=no b:no k:yes l:yes e:yes f:no j:yes m:yes n:no i:yes For large data sets, growing the tree to individual instances might lead to overfitting. A similarity threshold called cutoff used to suppress growth 9 Hierarchical Agglomerative Clustering • Input: Collection of instances • Output: A hierarchy of clusters • Method: – Start with individual instances as clusters – Repeat • Merge the ‘closest’ two clusters – Until only one cluster remains • Ward’s method: Closeness or proximity between two clusters is defined as the increase in squared error that results when two clusters are merged • Squared error measure used for only the local decision of merging clusters – No global optimization 10 HCE • A visual knowledge discovery tool for analysing and understanding multi-dimensional (> 3D) data • Offers multiple views of – input data and clustered input data – where views are coordinated • Many other similar tools do a patch work of statistics and graphics • HCE follows two fundamental statistical principles of exploratory data analysis – To examine each dimension first and then find relationships among dimensions – To try graphical displays first and then find numerical summaries 11 GRID Principles • GRID – graphics, ranking and interaction for discovery • Two principles – Study 1D, study 2D and find features – Ranking guides insight, statistics confirm • These principles help users organize their knowledge discovery process • Because of GRID, HCE is more than R + Visualization • GRID can be used to derive some scripts to organize exploratory data analysis using R (or some such statistics package) 12 Rank-by-Feature Framework • A user interface framework based on the GRID Principles • The framework – Uses interactive information visualization techniques combined with – statistical methods and data mining algorithms – Enables users to orderly examine input data • HCE implements rank-by-feature framework – This means • HCE uses existing statistical and data mining methods to analyse input data and • Communicate those results using interactive information visualization techniques 13 Multiple Views in HCE • • • • • Dendrogram Colour Mosaic 1 D histograms 2D scatterplots And more 14 Dendrogram Display • Results of HAC are shown visually using a dendrogram • A dendrogram is a tree – with data items at the terminal (leaf) nodes – Distance from the root node represents similarity among leaf nodes • Two visual controls A B C D – minimum similarity bar allows users to adjust the number of clusters – Detail cut-off bar allows users to reduce clutter 15 Colour Mosaic • • • • • • Input data is shown using this view Is a colour coded visual display of tabular data Each cell in the table is painted in a colour that reflects the cell’s value Two variations 1 3 2 4 Table Original layout – The layout of the mosaic is similar to the original table – A transpose of the original layout HCE uses the transposed layout because data sets usually have more rows than columns A colour mapping control Transposed Layout 16 1D Histogram Ordering • This data view is part of the rank-by-feature framework • Data belonging to one column (variable) is displayed as a histogram + box plot – Histogram shows the scale and skewness – Box plot shows the data distribution, center and spread • For the entire data set many such views are possible • By studying individual variables in detail users can select the variables for other visualizations 17 2D Scatter Plot Ordering • This data view is again part of the rank-by-feature framework • Three categories of 2D presentations are possible – Axes of the plot obtained from Principal Component Analysis • Linear or non-linear combinations of original variables – Axes of the plot obtained directly from the original variables – Parallel coordinates • HCE uses the second option of plotting pairs of variables from the original variables • Both 1D and 2D plots can be sorted according to some user selected criteria such as number of outliers 18