Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Distance Metric Learning in Data Mining Fei Wang 1 What is a Distance Metric? From wikipedia 2 Distance Metric Taxonomy Linear vs. Nonlinear Distance Characteristics Global vs. Local Learning Metric Unsupervised Algorithms Distance Metric Supervised Semisupervised Discrete Numeric Fixed Metric Continuous Euclidean Categorical 3 Categorization of Distance Metrics: Linear vs. Non-linear Linear Distance – First perform a linear mapping to project the data into some space, and then evaluate the pairwise data distance as their Euclidean distance in the projected space – Generalized Mahalanobis distance • • If M is symmetric, and positive definite, D is a distance metric; • If M is symmetric, and positive semi-definite, D is a pseudo-metric. • • Nonlinear Distance – First perform a nonlinear mapping to project the data into some space, and then evaluate the pairwise data distance as their Euclidean distance in the projected space 4 Categorization of Distance Metrics: Global vs. Local Global Methods – The distance metric satisfies some global properties of the data set • PCA: find a direction on which the variation of the whole data set is the largest • LDA: find a direction where the data from the same classes are clustered while from different classes are separated • … Local Methods – The distance metric satisfies some local properties of the data set • LLE: find the low dimensional embeddings such that the local neighborhood structure is preserved • LSML: find the projection such that the local neighbors of different classes are separated • …. 5 Related Areas Metric learning Dimensionality reduction Kernel learning 6 Related area 1: Dimensionality reduction Dimensionality Reduction is the process of reducing the number of features through Feature Selection and Feature Transformation. Feature Selection Find a subset of the original features Correlation, mRMR, information gain… Feature Transformation Transforms the original features in a lower dimension space PCA, LDA, LLE, Laplacian Embedding… Each dimensionality reduction technique can map a distance metric, where we first perform dimensionality reduction, then evaluate the Euclidean distance in the embedded space 7 Related area 2: Kernel learning What is Kernel? Suppose we have a data set X with n data points, then the kernel matrix K defined on X is an nxn symmetric positive semi-definite matrix, such that its (i,j)-th element represents the similarity between the i-th and j-th data points Kernel Leaning vs. Distance Metric Learning Kernel learning constructs a new kernel from the data, i.e., an inner-product function in some feature space, while distance metric learning constructs a distance function from the data Kernel Leaning vs. Kernel Trick • Kernel learning infers the n by n kernel matrix from the data, • Kernel trick is to apply the predetermined kernel to map the data from the original space to a feature space, such that the nonlinear problem in the original space can become a linear problem in the feature space B. Scholkopf, A. Smola. Learning with Kernels - Support Vector Machines, Regularization, Optimization, and Beyond. 2001. 8 Different Types of Metric Learning Nonlinear Linear Kernelization LE LLE ISOMAP SNE PCA UMMP LPP Kernelization LDA MMDA Xing RCA ITML NCA ANMM LMNN Unsupervised Supervised Kernelization SSDR CMM LRML Semi-Supervised •Red means local methods •Blue means global methods 9 Metric Learning Meta-algorithm Embed the data in some space – Usually formulate as an optimization problem • Define objective function – Separation based – Geometry based – Information theoretic based • Solve optimization problem – Eigenvalue decomposition – Semi-definite programming – Gradient descent – Bregman projection Euclidean distance on the embedded space 10 Unsupervised metric learning Learning pairwise distance metric purely based on the data, i.e., without any supervision Type nonlinear linear Kernelization Principal Component Analysis Laplacian Eigenmap Locally Linear Embedding Isometric Feature Mapping Stochastic Neighborhood Embedding Kernelization Locality Preserving Projections Objective Global Local 11 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II - Methods Unsupervised Metric Learning Supervised Metric Learning Semi-supervised Metric Learning Challenges and Opportunities for Metric Learning 12 Principal Component Analysis (PCA) Find successive projection directions which maximize the data variance 2nd PC 1st PC Geometry based objective Eigenvalue decomposition Linear mapping Global method The pairwise Euclidean distances will not change after PCA 13 Kernel Mapping Map the data into some high-dimensional feature space, such that the nonlinear problem in the original space becomes the linear problem in the high-dimensional feature space. We do not need to know the explicit form of the mapping, but we need to define a kernel function. Figure from http://omega.albany.edu:8008/machine-learning-dir/notes-dir/ker1/ker1.pdf 14 Kernel Principal Component Analysis (KPCA) Perform PCA in the feature space Original data With the representer theorem Embedded data after KPCA Geometry based objective Eigenvalue decomposition Nonlinear mapping Global method Figure from http://en.wikipedia.org/wiki/Kernel_principal_component_analysis 15 Locality Preserving Projections (LPP) Find linear projections that can preserve the localities of the data set PCA Data Degree Matrix X-axis is the major directio Laplacian Matrix LPP Data Geometry based objective Generalized eigenvalue decomposition Linear mapping Local method X. He and P. Niyogi. Locality Preserving Projections. NIPS 2003. Y-axis is the major projection direction 16 Laplacian Embedding (LE) LE: Find embeddings that can preserve the localities of the data set The embedded Xi The relationship between Xi and Xj Geometry based objective Generalized eigenvalue decomposition Nonlinear mapping Local method M. Belkin, P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, June 17 2003; 15 (6):1373-1396 Locally Linear Embedding (LLE) Obtain data relationships Obtain data embeddings Geometry based objective Generalized eigenvalue decomposition Nonlinear mapping Local method Sam Roweis & Lawrence Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, v.290 no.5500 , Dec.22, 2000. pp.2323—2326. 18 Isometric Feature Mapping (ISOMAP) 1. 2. 3. Construct the neighborhood graph Compute the shortest path length (geodesic distance) between pairwise data points Recover the low-dimensional embeddings of the data by MultiDimensional Scaling (MDS) with preserving those geodesic distances Geometry based objective Eigenvalue decomposition Nonlinear mapping Local method J. B. Tenenbaum, V. de Silva and J. C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 2000. 19 Stochastic Neighbor Embedding (SNE) The probability that i picks j as its neighbor Neighborhood distribution in the embedded space Neighborhood distribution preservation Information theoretic objective Gradient descent Nonlinear mapping Local method Hinton, G. E. and Roweis, S. Stochastic Neighbor Embedding. NIPS 2003. 20 Summary: Unsupervised Distance Metric Learning PCA Local Global √ Linear √ LPP LLE Isomap SNE √ √ √ √ √ √ √ √ √ √ Nonlinear Separation Geometry √ √ √ Information theoretic Extensibility √ √ 21 Outline Unsupervised Metric Learning Supervised Metric Learning Semi-supervised Metric Learning Challenges and Opportunities for Metric Learning 22 Supervised Metric Learning Learning pairwise distance metric with data and their supervision, i.e., data labels and pairwise constraints (must-links and cannot-links) Type nonlinear linear Kernelization Linear Discriminant Analysis Eric Xing’s method Information Theoretic Metric Learning Global Kernelization Locally Supervised Metric Learning Objective Local 23 Linear Discriminant Analysis (LDA) Suppose the data are from C different classes Separation based objective Eigenvalue decomposition Linear mapping Global method Figure from http://cmp.felk.cvut.cz/cmp/software/stprtool/exa mples/ldapca_example1.gif Keinosuke Fukunaga. Introduction to statistical pattern recognition. Academic Press Professional, Inc. 1990 . Y. Jia, F. Nie, C. Zhang. Trace Ratio Problem Revisited. IEEE Trans. Neural Networks, 20:4, 2009. 24 Distance Metric Learning with Side Information (Eric Xing’s method) Must-link constraint set: each element is a pair of data that come from the same class or cluster Cannot-link constraint set: each element is a pair of data that come from different classes or clusters Separation based objective Semi-definite programming No direct mapping Global method E.P. Xing, A.Y. Ng, M.I. Jordan and S. Russell, Distance Metric Learning, with application to Clustering with side-information. NIPS 16. 2002. 25 Information Theoretic Metric Learning (ITML) Suppose we have an initial Mahalanobis distance parameterized by , a set must-link constraints and a set of cannot-link constraints. ITML solves the following optimization problem Information theoretic objective Bregman projection No direct mapping Global method of Scalable and efficient Jason Davis, Brian Kulis, Prateek Jain, Suvrit Sra, & Inderjit Dhillon. Information-Theoretic Metric Learning. ICML 2007 Best paper award 26 Summary: Supervised Metric Learning LDA Label Xing ITML √ √ √ Pairwise constraints LSML √ √ √ Local Global √ Linear √ √ √ √ Nonlinear Separation √ √ √ √ Geometry √ Information theoretic Extensibility √ √ 27 Outline Unsupervised Metric Learning Supervised Metric Learning Semi-supervised Metric Learning Challenges and Opportunities for Metric Learning 28 Semi-Supervised Metric Learning Learning pairwise distance metric using data with and without supervision Type nonlinear Kernelization Semi-Supervised Dimensionality Reduction linear Constraint Margin Maximization Laplacian Regularized Metric Learning Objective Global Local 29 Constraint Margin Maximization (CMM) Scatterness Compactness Projected Constraint Margin Maximum Variance No label information Geometry & Separation based objective Eigenvalue decomposition Linear mapping Global method F. Wang, S. Chen, T. Li, C. Zhang. Semi-Supervised Metric Learning by Maximizing Constraint Margin. CIKM08 30 Laplacian Regularized Metric Learning (LRML) Smoothness term Compactness term Scatterness term Geometry & Separation based objective Semi-definite programming No mapping Local & Global method S. Hoi, W. Liu, S. Chang. Semi-Supervised Distance Metric Learning for Collaborative Image Retrieval. CVPR08 31 Semi-Supervised Dimensionality Reduction (SSDR) Most of the nonlinear dimensionality reduction algorithms can finally be formalized as solving the following optimization problem Low-dimensional embeddings Algorithm specific symmetric matrix Separation based objective Eigenvalue decomposition No mapping Local method X. Yang , H. Fu , H. Zha , J. Barlow. Semi-supervised nonlinear dimensionality reduction. ICML 2006. 32 Summary: Semi-Supervised Distance Metric Learning CMM LRML Label √ √ Pairwise constraints √ √ √ Embedded coordinates √ Local Global √ Linear √ √ √ √ Nonlinear Separation SSDM √ √ √ Locality-Preservation √ Information theoretic Extensibility √ 33 Outline Unsupervised Metric Learning Supervised Metric Learning Semi-supervised Metric Learning Challenges and Opportunities for Metric Learning 34 Scalability – Online (Sequential) Learning (Shalev-Shwartz,Singer and Ng, ICML2004) (Davis et al. ICML2007) – Distributed Learning Efficiency – Labeling efficiency (Yang, Jin and Sukthankar, UAI2007) – Updating efficiency (Wang, Sun, Hu and Ebadollahi, SDM2011) Heterogeneity – Data heterogeneity (Wang, Sun and Ebadollahi, SDM2011) – Task heterogeneity (Zhang and Yeung, KDD2011) Evaluation 35 Shai Shalev-Shwartz, Yoram Singer and Andrew Y. Ng. Online and Batch Learning of Pseudo-Metrics. ICML 2004 J. Davis, B. Kulis, P. Jain, S. Sra, & I. Dhillon. Information-Theoretic Metric Learning. ICML 2007 Liu Yang, Rong Jin and Rahul Sukthankar. Bayesian Active Distance Metric Learning. UAI 2007. Fei Wang, Jimeng Sun, Shahram Ebadollahi. Integrating Distance Metrics Learned from Multiple Experts and its Application in InterPatient Similarity Assessment. SDM2011. F. Wang, J. Sun, J. Hu, S. Ebadollahi. iMet: Interactive Metric Learning in Healthcare Applications. SDM 2011. Yu Zhang and Dit-Yan Yeung. Transfer Metric Learning by Learning Task Relationships. In: Proceedings of the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 1199-1208. Washington, DC, USA, 2010. 36 37