Download Feature selection, Dimensionality Reduction and Clustering

Course Summer Mining Data Summer Course: Data Mining Feature Selection, Dimensionality Reduction, and Clustering Presenter: Georgi Nalbantov August 2009 Course 2/54 Summer Mining Data Structure  Feature Selection    Filtering approach Wrapper approach Embedded methods  Dimensionality Reduction      Principal Components Analysis (PCA) Nonlinear PCA (Kernel PCA, CatPCA) Multi-Dimensional Scaling (MDS) Homogeneity Analysis Clustering     Density estimation and clustering K-means clustering Hierarchical clustering Clustering with Support Vector Machines (SVMs) Course 3/54 Summer Mining Data Feature Selection, Dimensionality Reduction, and Clustering in the KDD Process U.M.Fayyad, G.PatetskyShapiro and P.Smyth (1995) Course 4/54 Summer Mining Data Feature Selection In the presence of millions of features/attributes/inputs/variables, select the most relevant ones. Advantages: build better, faster, and easier to understand learning machines. X n m features m’ Course 5/54 Summer Mining Data Feature Selection  Goal: select the two best features individually     Any reasonable objective J will rank the features J(x1) > J(x2) = J(x3) > J(x4) Thus, features chosen [x1,x2] or [x1,x3]. However, x4 is the only feature that provides complementary information to x1 Course 6/54 Summer Mining Data Feature Selection  Filtering approach: ranks features or feature subsets independently of the predictor (classifier).   …using univariate methods: consider one variable at a time …using multivariate methods: consider more than one variables at a time  Wrapper approach: uses a classifier to assess (many) features or feature subsets.  Embedding approach: uses a classifier to build a (single) model with a subset of features that are internally selected. Course 7/54 Summer Mining Data Feature Selection: univariate filtering approach  Issue: determine the relevance of a given single feature. m density, P(Xi| Y) density, P(Xi| Y) m s- xi -1 s- xi Course 8/54 Summer Mining Data Feature Selection: univariate filtering approach  Issue: determine the relevance of a given single feature.  Under independence: P(X, Y) = P(X) P(Y)  Measure of dependence (Mutual Information): MI(X, Y) =  P(X,Y) log dX dY = KL( P(X,Y) || P(X)P(Y) ) Course 9/54 Summer Mining Data Feature Selection: univariate filtering approach  Correlation and MI Note: Correlation is a measure of linear dependence Course 10/54 Summer Mining Data Feature Selection: univariate filtering approach  Correlation and MI under the Gaussian distribution Course Summer Mining Data Feature Selection: univariate filtering approach. Criteria for measuring dependence. 11/59 Course 12/59 Summer Mining Data Feature Selection: univariate filtering approach m- m-, m+ m+ Density P(Xi| Y=-1) P(Xi| Y=1) Legend: Y=1 Y=-1 -1 ss+ xi s- s+ xi Course 13/59 Summer Mining Data Feature Selection: univariate filtering approach P(Xi| Y=1) = P(Xi| Y=-1) P(Xi| Y=1) /= P(Xi| Y=-1) density Legend: Y=1 Y=-1 -1 ss+ xi s- s+ xi Course 14/59 Summer Mining Data Feature Selection: univariate filtering approach mT-test m+ Is this distance significant? • Normally distributed classes, equal variance s2 unknown; estimated from data as s2within. • Null hypothesis H0: m+ = m• T statistic: If H0 is true, then -1 t= (m+ - m-)/(swithin1/m++1/m-) ~ Student(m++m--2 d.f.) s- s+ xi Course 15/59 Summer Mining Data Feature Selection: multivariate filtering approach Guyon-Elisseeff, JMLR 2004; Springer 2006 Course 16/59 Summer Mining Data Feature Selection: search strategies Kohavi-John, 1997 N features, 2N possible feature subsets! Course 17/59 Summer Mining Data Feature Selection: search strategies  Forward selection or backward elimination.  Beam search: keep k best path at each step.  GSFS: generalized sequential forward selection – when (n-k) features are left try all subsets of g features. More trainings at each step, but fewer steps.  PTA(l,r): plus l , take away r – at each step, run SFS l times then SBS r times.  Floating search: One step of SFS (resp. SBS), then SBS (resp. SFS) as long as we find better subsets than those of the same size obtained so far. Course 18/59 Summer Mining Data Feature Selection: filters vs. wrappers vs. embedding  Main goal: rank subsets of useful features Course 19/59 Summer Mining Data Feature Selection: feature subset assessment (wrapper) N variables/features Split data into 3 sets: training, validation, and test set. m2 m3 M samples m1 1) For each feature subset, train predictor on training data. 2) Select the feature subset, which performs best on validation data.  Repeat and average if you want to reduce variance (cross-validation). 3) Test on test data. Danger of over-fitting with intensive search! Course 20/59 Summer Mining Data Feature Selection via Embedded Methods: L1-regularization sum(|beta|) sum(|beta|) Course 21/59 Summer Mining Data Feature Selection: summary Univariate Multivariate Linear T-test, AUC, feature ranking RFE with linear SVM or LDA Non-linear Mutual information feature ranking Nearest Neighbors Neural Nets Trees, SVM Course 22/59 Summer Mining Data Dimensionality Reduction In the presence of may of features, select the most relevant subset of (weighted) combinations of features. Feature Selection: Dimensionality Reduction: X 1 ,, X m  X k1 ,, X kp X 1 ,, X m  f1 ( X 1 ,, X m ), , f p ( X 1 , , X m ) Course 23/59 Summer Mining Data Dimensionality Reduction: (Linear) Principal Components Analysis  PCA finds a linear mapping of dataset X to a dataset X’ of lower dimensionality. The variance of X that is remained in X’ is maximal. Dataset X is mapped to dataset X’, here of the same dimensionality. The first dimension in X’ (= the first principal component) is the direction of maximal variance. The second principal component is orthogonal to the first. Course 24/59 Summer Mining Data Dimensionality Reduction: Nonlinear (Kernel) Principal Components Analysis Original dataset X Map X to a HIGHERdimensional space, and carry out LINEAR PCA in that space (If necessary,) map the resulting principal components back to the origianl space Course 25/59 Summer Mining Data Dimensionality Reduction: Multi-Dimensional Scaling MDS is a mathematical dimension reduction technique that maps the distances between observations from the original (high) dimensional space into a lower (for example, two) dimensional space. MDS attempts to retain pairwise Euclidean distances in the lowdimensional space .  Error on the fit is measured using a so-called “stress” function  Different choices for a stress function are possible Course 26/59 Summer Mining Data Dimensionality Reduction: Multi-Dimensional Scaling  Raw stress function (identical to PCA):  Sammon cost function: Course 27/59 Summer Mining Data Dimensionality Reduction: Multi-Dimensional Scaling (Example) Input: Output: Course 28/59 Summer Mining Data Dimensionality Reduction: Homogeneity analysis  Homals finds a lower-dimensional representation of categorical data matrix X. It may be considered as a type of nonlinear extension of PCA. Course 29/59 Summer Mining Data Clustering: Similarity measures for hierarchical clustering Clustering X2 + + Classification X2 + + ++ + + Regression + + + + + + + + + + ++ + + + + + + - - - + - + + - X1 + + X1 X1  k-th Nearest Neighbour  Linear Discriminant Analysis, QDA  Classical Linear Regression  Parzen Window  Logistic Regression (Logit)  Ridge Regression  Unfolding, Conjoint Analysis, Cat-PCA  Decision Trees, LSSVM, NN, VS  NN, CART Course 30/59 Summer Mining Data Clustering  Clustering in an unsupervised learning technique.  Task: organize objects into groups whose members are similar in some way  Clustering finds structures in a collection of unlabeled data  A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters Course 31/59 Summer Mining Data Density estimation and clustering Bayesian separation curve (optimal) Course 32/59 Summer Mining Data Clustering: K-means clustering  Minimizes the sum of the squared distances to the cluster centers (reconstruction error)  Iterative process:  Estimate current assignments (construct Voronoi partition)  Given the new cluster assignments, set cluster center to centerof-mass Course 33/59 Summer Mining Data Clustering: K-means clustering Step 1 Step 2 Step 3 Step 4 Course 34/59 Summer Mining Data Clustering: Hierarchical clustering Dendrogram  Clustering based on (dis)similarities. Multilevel clustering: level 1 has n clusters, level n has one cluster  Agglomerative HC: starts with N clusters and combines clusters iteratively  Divisive HC: starts with one cluster and divides iteratively  Disadvantage: wrong division cannot be undone Course 35/59 Summer Mining Data Clustering: Nearest Neighbor algorithm for hierarchical clustering 1. Nearest Neighbor, Level 2, k = 7 clusters. 2. Nearest Neighbor, Level 3, k = 6 clusters. 3. Nearest Neighbor, Level 4, k = 5 clusters. Course 36/59 Summer Mining Data Clustering: Nearest Neighbor algorithm for hierarchical clustering 4. Nearest Neighbor, Level 5, k = 4 clusters. 5. Nearest Neighbor, Level 6, k = 3 clusters. 6. Nearest Neighbor, Level 7, k = 2 clusters. Course 37/59 Summer Mining Data Clustering: Nearest Neighbor algorithm for hierarchical clustering 7. Nearest Neighbor, Level 8, k = 1 cluster. Course 38/59 Summer Mining Data Clustering: Similarity measures for hierarchical clustering Course 39/59 Summer Mining Data Clustering: Similarity measures for hierarchical clustering  Pearson Correlation: Trend Similarity   b  0.5a   c  a  0.2   C pearson(a, b )  1   C pearson(a , c )  1   C pearson(b , c )  1 Course 40/59 Summer Mining Data Clustering: Similarity measures for hierarchical clustering  Euclidean Distance   d ( x, y)   N n 1 ( xn  yn ) 2  x1   y1        x   y     xN   y N  Course 41/59 Summer Mining Data Clustering: Similarity measures for hierarchical clustering  Cosine Correlation  y1   x1        x   y     xN   y N  1 N x  y  i i i 1   Ccosine( x , y )  N   x  y   xy   +1  Cosine Correlation  – 1 x   y Course 42/59 Summer Mining Data Clustering: Similarity measures for hierarchical clustering  Cosine Correlation: Trend + Mean Distance   b  0.5a   c  a  0.2   Ccos ine (a, b )  1   Ccos ine (a, c )  0.9622   Ccos ine (b , c )  0.9622 Course 43/54 Summer Mining Data Clustering: Similarity measures for hierarchical clustering   b  0.5a   c  a  0.2   C pearson(a, b )  1   C pearson(a , c )  1   C pearson(b , c )  1   d (a , b )  2.8025   d (a , c )  1.5875   d (b , c )  3.2211   Ccos ine (a, b )  1   Ccos ine (a, c )  0.9622   Ccos ine (b , c )  0.9622 Course 44/54 Summer Mining Data Clustering: Similarity measures for hierarchical clustering Similar?   C pearson(a, b )  0.1175   C pearson(a , c )  0.1244   C pearson(b , c )  0.1779   d (a , b )  0.0279   d (a , c )  0.0255   d (b , c )  0.0236   Ccos ine (a, b )  0.7544   Ccos ine (a, c )  0.8092   Ccos ine (b , c )  0.844 Course 45/54 Summer Mining Data Clustering: Grouping strategies for hierarchical clustering C1 C2 C3 Merge which pair of clusters? Course 46/54 Summer Mining Data Clustering: Grouping strategies for hierarchical clustering Single Linkage + Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters + C2 C1 Tend to generate “long chains” Course 47/54 Summer Mining Data Clustering: Grouping strategies for hierarchical clustering Complete Linkage + Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters + C2 C1 Tend to generate “clumps” Course 48/54 Summer Mining Data Clustering: Grouping strategies for hierarchical clustering Average Linkage + + C2 C1 Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster). Course 49/54 Summer Mining Data Clustering: Grouping strategies for hierarchical clustering Average Group Linkage + + C2 C1 Dissimilarity between two clusters = Distance between two cluster means. Course 50/54 Summer Mining Data Clustering: Support Vector Machines for clustering The not-noisy case Objective function: Ben-Hur, Horn, Siegelmann and Vapnik, 2001 Course 51/54 Summer Mining Data Clustering: Support Vector Machines for clustering The noisy case Objective function: Ben-Hur, Horn, Siegelmann and Vapnik, 2001 Course 52/54 Summer Mining Data Clustering: Support Vector Machines for clustering The noisy case (II) Objective function: Ben-Hur, Horn, Siegelmann and Vapnik, 2001 Course 53/54 Summer Mining Data Clustering: Support Vector Machines for clustering The noisy case (III) Objective function: Ben-Hur, Horn, Siegelmann and Vapnik, 2001 Course 54/54 Summer Mining Data Conclusion / Summary / References  Feature Selection     Dimensionality Reduction      Filtering approach Wrapper approach Embedded methods Principal Components Analysis (PCA) Nonlinear PCA (Kernel PCA, CatPCA) Multi-Dimensional Scaling (MDS) Homogeneity Analysis Kohavi and John, 1996 Kohavi and John, 1996 I. Guyon et. al., 2006 http://www.cs.otago.ac.nz/cosc453/student_tutorials/... principal_components.pdf Schoelkopf et. al., 2001; .;Gifi, 1990 Born and Groenen, 2005 Gifi, 1990 Clustering     Density estimation and clustering K-means clustering Hierarchical clustering Clustering with Support Vector Machines (SVMs) Hastie et. el., 2001 MacQueen, 1967 http://www.autonlab.org/tutorials/kmeans11.pdf Ben-Hur, Horn, Siegelmann and Vapnik, 2001

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Feature selection, Dimensionality Reduction and Clustering