Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Course Summer Mining Data Summer Course: Data Mining Feature Selection, Dimensionality Reduction, and Clustering Presenter: Georgi Nalbantov August 2009 Course 2/54 Summer Mining Data Structure Feature Selection Filtering approach Wrapper approach Embedded methods Dimensionality Reduction Principal Components Analysis (PCA) Nonlinear PCA (Kernel PCA, CatPCA) Multi-Dimensional Scaling (MDS) Homogeneity Analysis Clustering Density estimation and clustering K-means clustering Hierarchical clustering Clustering with Support Vector Machines (SVMs) Course 3/54 Summer Mining Data Feature Selection, Dimensionality Reduction, and Clustering in the KDD Process U.M.Fayyad, G.PatetskyShapiro and P.Smyth (1995) Course 4/54 Summer Mining Data Feature Selection In the presence of millions of features/attributes/inputs/variables, select the most relevant ones. Advantages: build better, faster, and easier to understand learning machines. X n m features m’ Course 5/54 Summer Mining Data Feature Selection Goal: select the two best features individually Any reasonable objective J will rank the features J(x1) > J(x2) = J(x3) > J(x4) Thus, features chosen [x1,x2] or [x1,x3]. However, x4 is the only feature that provides complementary information to x1 Course 6/54 Summer Mining Data Feature Selection Filtering approach: ranks features or feature subsets independently of the predictor (classifier). …using univariate methods: consider one variable at a time …using multivariate methods: consider more than one variables at a time Wrapper approach: uses a classifier to assess (many) features or feature subsets. Embedding approach: uses a classifier to build a (single) model with a subset of features that are internally selected. Course 7/54 Summer Mining Data Feature Selection: univariate filtering approach Issue: determine the relevance of a given single feature. m density, P(Xi| Y) density, P(Xi| Y) m s- xi -1 s- xi Course 8/54 Summer Mining Data Feature Selection: univariate filtering approach Issue: determine the relevance of a given single feature. Under independence: P(X, Y) = P(X) P(Y) Measure of dependence (Mutual Information): MI(X, Y) = P(X,Y) log dX dY = KL( P(X,Y) || P(X)P(Y) ) Course 9/54 Summer Mining Data Feature Selection: univariate filtering approach Correlation and MI Note: Correlation is a measure of linear dependence Course 10/54 Summer Mining Data Feature Selection: univariate filtering approach Correlation and MI under the Gaussian distribution Course Summer Mining Data Feature Selection: univariate filtering approach. Criteria for measuring dependence. 11/59 Course 12/59 Summer Mining Data Feature Selection: univariate filtering approach m- m-, m+ m+ Density P(Xi| Y=-1) P(Xi| Y=1) Legend: Y=1 Y=-1 -1 ss+ xi s- s+ xi Course 13/59 Summer Mining Data Feature Selection: univariate filtering approach P(Xi| Y=1) = P(Xi| Y=-1) P(Xi| Y=1) /= P(Xi| Y=-1) density Legend: Y=1 Y=-1 -1 ss+ xi s- s+ xi Course 14/59 Summer Mining Data Feature Selection: univariate filtering approach mT-test m+ Is this distance significant? • Normally distributed classes, equal variance s2 unknown; estimated from data as s2within. • Null hypothesis H0: m+ = m• T statistic: If H0 is true, then -1 t= (m+ - m-)/(swithin1/m++1/m-) ~ Student(m++m--2 d.f.) s- s+ xi Course 15/59 Summer Mining Data Feature Selection: multivariate filtering approach Guyon-Elisseeff, JMLR 2004; Springer 2006 Course 16/59 Summer Mining Data Feature Selection: search strategies Kohavi-John, 1997 N features, 2N possible feature subsets! Course 17/59 Summer Mining Data Feature Selection: search strategies Forward selection or backward elimination. Beam search: keep k best path at each step. GSFS: generalized sequential forward selection – when (n-k) features are left try all subsets of g features. More trainings at each step, but fewer steps. PTA(l,r): plus l , take away r – at each step, run SFS l times then SBS r times. Floating search: One step of SFS (resp. SBS), then SBS (resp. SFS) as long as we find better subsets than those of the same size obtained so far. Course 18/59 Summer Mining Data Feature Selection: filters vs. wrappers vs. embedding Main goal: rank subsets of useful features Course 19/59 Summer Mining Data Feature Selection: feature subset assessment (wrapper) N variables/features Split data into 3 sets: training, validation, and test set. m2 m3 M samples m1 1) For each feature subset, train predictor on training data. 2) Select the feature subset, which performs best on validation data. Repeat and average if you want to reduce variance (cross-validation). 3) Test on test data. Danger of over-fitting with intensive search! Course 20/59 Summer Mining Data Feature Selection via Embedded Methods: L1-regularization sum(|beta|) sum(|beta|) Course 21/59 Summer Mining Data Feature Selection: summary Univariate Multivariate Linear T-test, AUC, feature ranking RFE with linear SVM or LDA Non-linear Mutual information feature ranking Nearest Neighbors Neural Nets Trees, SVM Course 22/59 Summer Mining Data Dimensionality Reduction In the presence of may of features, select the most relevant subset of (weighted) combinations of features. Feature Selection: Dimensionality Reduction: X 1 ,, X m X k1 ,, X kp X 1 ,, X m f1 ( X 1 ,, X m ), , f p ( X 1 , , X m ) Course 23/59 Summer Mining Data Dimensionality Reduction: (Linear) Principal Components Analysis PCA finds a linear mapping of dataset X to a dataset X’ of lower dimensionality. The variance of X that is remained in X’ is maximal. Dataset X is mapped to dataset X’, here of the same dimensionality. The first dimension in X’ (= the first principal component) is the direction of maximal variance. The second principal component is orthogonal to the first. Course 24/59 Summer Mining Data Dimensionality Reduction: Nonlinear (Kernel) Principal Components Analysis Original dataset X Map X to a HIGHERdimensional space, and carry out LINEAR PCA in that space (If necessary,) map the resulting principal components back to the origianl space Course 25/59 Summer Mining Data Dimensionality Reduction: Multi-Dimensional Scaling MDS is a mathematical dimension reduction technique that maps the distances between observations from the original (high) dimensional space into a lower (for example, two) dimensional space. MDS attempts to retain pairwise Euclidean distances in the lowdimensional space . Error on the fit is measured using a so-called “stress” function Different choices for a stress function are possible Course 26/59 Summer Mining Data Dimensionality Reduction: Multi-Dimensional Scaling Raw stress function (identical to PCA): Sammon cost function: Course 27/59 Summer Mining Data Dimensionality Reduction: Multi-Dimensional Scaling (Example) Input: Output: Course 28/59 Summer Mining Data Dimensionality Reduction: Homogeneity analysis Homals finds a lower-dimensional representation of categorical data matrix X. It may be considered as a type of nonlinear extension of PCA. Course 29/59 Summer Mining Data Clustering: Similarity measures for hierarchical clustering Clustering X2 + + Classification X2 + + ++ + + Regression + + + + + + + + + + ++ + + + + + + - - - + - + + - X1 + + X1 X1 k-th Nearest Neighbour Linear Discriminant Analysis, QDA Classical Linear Regression Parzen Window Logistic Regression (Logit) Ridge Regression Unfolding, Conjoint Analysis, Cat-PCA Decision Trees, LSSVM, NN, VS NN, CART Course 30/59 Summer Mining Data Clustering Clustering in an unsupervised learning technique. Task: organize objects into groups whose members are similar in some way Clustering finds structures in a collection of unlabeled data A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters Course 31/59 Summer Mining Data Density estimation and clustering Bayesian separation curve (optimal) Course 32/59 Summer Mining Data Clustering: K-means clustering Minimizes the sum of the squared distances to the cluster centers (reconstruction error) Iterative process: Estimate current assignments (construct Voronoi partition) Given the new cluster assignments, set cluster center to centerof-mass Course 33/59 Summer Mining Data Clustering: K-means clustering Step 1 Step 2 Step 3 Step 4 Course 34/59 Summer Mining Data Clustering: Hierarchical clustering Dendrogram Clustering based on (dis)similarities. Multilevel clustering: level 1 has n clusters, level n has one cluster Agglomerative HC: starts with N clusters and combines clusters iteratively Divisive HC: starts with one cluster and divides iteratively Disadvantage: wrong division cannot be undone Course 35/59 Summer Mining Data Clustering: Nearest Neighbor algorithm for hierarchical clustering 1. Nearest Neighbor, Level 2, k = 7 clusters. 2. Nearest Neighbor, Level 3, k = 6 clusters. 3. Nearest Neighbor, Level 4, k = 5 clusters. Course 36/59 Summer Mining Data Clustering: Nearest Neighbor algorithm for hierarchical clustering 4. Nearest Neighbor, Level 5, k = 4 clusters. 5. Nearest Neighbor, Level 6, k = 3 clusters. 6. Nearest Neighbor, Level 7, k = 2 clusters. Course 37/59 Summer Mining Data Clustering: Nearest Neighbor algorithm for hierarchical clustering 7. Nearest Neighbor, Level 8, k = 1 cluster. Course 38/59 Summer Mining Data Clustering: Similarity measures for hierarchical clustering Course 39/59 Summer Mining Data Clustering: Similarity measures for hierarchical clustering Pearson Correlation: Trend Similarity b 0.5a c a 0.2 C pearson(a, b ) 1 C pearson(a , c ) 1 C pearson(b , c ) 1 Course 40/59 Summer Mining Data Clustering: Similarity measures for hierarchical clustering Euclidean Distance d ( x, y) N n 1 ( xn yn ) 2 x1 y1 x y xN y N Course 41/59 Summer Mining Data Clustering: Similarity measures for hierarchical clustering Cosine Correlation y1 x1 x y xN y N 1 N x y i i i 1 Ccosine( x , y ) N x y xy +1 Cosine Correlation – 1 x y Course 42/59 Summer Mining Data Clustering: Similarity measures for hierarchical clustering Cosine Correlation: Trend + Mean Distance b 0.5a c a 0.2 Ccos ine (a, b ) 1 Ccos ine (a, c ) 0.9622 Ccos ine (b , c ) 0.9622 Course 43/54 Summer Mining Data Clustering: Similarity measures for hierarchical clustering b 0.5a c a 0.2 C pearson(a, b ) 1 C pearson(a , c ) 1 C pearson(b , c ) 1 d (a , b ) 2.8025 d (a , c ) 1.5875 d (b , c ) 3.2211 Ccos ine (a, b ) 1 Ccos ine (a, c ) 0.9622 Ccos ine (b , c ) 0.9622 Course 44/54 Summer Mining Data Clustering: Similarity measures for hierarchical clustering Similar? C pearson(a, b ) 0.1175 C pearson(a , c ) 0.1244 C pearson(b , c ) 0.1779 d (a , b ) 0.0279 d (a , c ) 0.0255 d (b , c ) 0.0236 Ccos ine (a, b ) 0.7544 Ccos ine (a, c ) 0.8092 Ccos ine (b , c ) 0.844 Course 45/54 Summer Mining Data Clustering: Grouping strategies for hierarchical clustering C1 C2 C3 Merge which pair of clusters? Course 46/54 Summer Mining Data Clustering: Grouping strategies for hierarchical clustering Single Linkage + Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters + C2 C1 Tend to generate “long chains” Course 47/54 Summer Mining Data Clustering: Grouping strategies for hierarchical clustering Complete Linkage + Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters + C2 C1 Tend to generate “clumps” Course 48/54 Summer Mining Data Clustering: Grouping strategies for hierarchical clustering Average Linkage + + C2 C1 Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster). Course 49/54 Summer Mining Data Clustering: Grouping strategies for hierarchical clustering Average Group Linkage + + C2 C1 Dissimilarity between two clusters = Distance between two cluster means. Course 50/54 Summer Mining Data Clustering: Support Vector Machines for clustering The not-noisy case Objective function: Ben-Hur, Horn, Siegelmann and Vapnik, 2001 Course 51/54 Summer Mining Data Clustering: Support Vector Machines for clustering The noisy case Objective function: Ben-Hur, Horn, Siegelmann and Vapnik, 2001 Course 52/54 Summer Mining Data Clustering: Support Vector Machines for clustering The noisy case (II) Objective function: Ben-Hur, Horn, Siegelmann and Vapnik, 2001 Course 53/54 Summer Mining Data Clustering: Support Vector Machines for clustering The noisy case (III) Objective function: Ben-Hur, Horn, Siegelmann and Vapnik, 2001 Course 54/54 Summer Mining Data Conclusion / Summary / References Feature Selection Dimensionality Reduction Filtering approach Wrapper approach Embedded methods Principal Components Analysis (PCA) Nonlinear PCA (Kernel PCA, CatPCA) Multi-Dimensional Scaling (MDS) Homogeneity Analysis Kohavi and John, 1996 Kohavi and John, 1996 I. Guyon et. al., 2006 http://www.cs.otago.ac.nz/cosc453/student_tutorials/... principal_components.pdf Schoelkopf et. al., 2001; .;Gifi, 1990 Born and Groenen, 2005 Gifi, 1990 Clustering Density estimation and clustering K-means clustering Hierarchical clustering Clustering with Support Vector Machines (SVMs) Hastie et. el., 2001 MacQueen, 1967 http://www.autonlab.org/tutorials/kmeans11.pdf Ben-Hur, Horn, Siegelmann and Vapnik, 2001