Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Classification problems with heterogeneous information sources N.U.S. - January 13, 2006 Gert Lanckriet ([email protected]) U.C. San Diego Motivation • Statistical machine learning – Blends statistics, computer science, signal processing, optimization – Involves solving large-scale data analysis problems • autonomously • in tandem with a human • Challenges: – Massive scale of data sets – On-line issues – Diversity of information sources describing data Example: web-related applications • Data point = web page • Sources of information about the webpage: – Content: • • • • Text Images Structure Sounds – Relation to other webpages: links network – Users (log data): • click behavior • origin Example: web-related applications • Data point = web page • Sources of information about the webpage: – Content: • • • • Text Images Structure Sounds – Relation to other webpages: links network – Users (log data): • click behavior • origin Information in diverse (heterogeneous) formats Example: bioinformatics mRNA expression data hydrophobicity data protein-protein interaction data sequence data (gene, protein) upstream region data (TF binding sites) Overview • Kernel methods • Classification problems • Kernel methods with heterogeneous information • Classification with heterogeneous information (SDP) • Applications in computational biology Overview • Kernel methods • Classification problems • Kernel methods with heterogeneous information • Classification with heterogeneous information (SDP) • Applications in computational biology Kernel-based learning Data Embed data Linear algorithm x1 xn SVM, MPM, PCA, CCA, FDA… • if data described by numerical vectors: embedding ~ (non-linear) transformation non-linear versions of linear algorithms Kernel-based learning Data Embed data Linear algorithm x1 xn SVM, MPM, PCA, CCA, FDA… • embedding can be defined for non-vector data Kernel-based learning Embed data IMPLICITLY: Inner product measures similarity j i K Property: Any symmetric positive definite matrix specifies a kernel matrix & every kernel matrix is symmetric positive definite Kernel-based learning Data x1 xn Embed data Kernel-based learning Data Embed data Linear algorithm x1 K xn SVM, MPM, PCA, CCA, FDA… Kernel design Kernel algorithm Kernel methods • Unifying learning framework – connections to statistics, convex optimization, functional analysis – different data analysis problems can be formulated within this framework • • • • Classification Clustering Regression Dimensionality reduction • Many successful applications Kernel methods • Unifying learning framework – connections to statistics, convex optimization, functional analysis – different data analysis problems can be formulated within this framework • Many successful applications – – – – – hand-writing recognition text classification analysis of micro-array data face detection time series prediction Binary classification y1= -1 • Training data: {(xi,yi)}i=1...n – xi : description ith object – yi 2 {-1,+1} : label • • • • • HEART URINE DNA BLOOD SCAN y2= +1 • • • • • HEART URINE DNA BLOOD SCAN x1 • Problem: design a classification rule such that, given a new x, it predicts y with minimal probability of error x2 Binary classification • Find hyperplane that separates the two classes • • • • • x1 Classification Rule: HEART URINE DNA BLOOD SCAN • • • • • HEART URINE DNA BLOOD SCAN x2 Maximal margin classification • Maximize margin: – Position hyperplane between two classes – Such that 2-norm distance to closest point from each class is maximized Maximal margin classification • If not linearly separable: – – Allow some errors Try to maximize margin for data points with no error Maximal margin classification: training algorithm max margin min error error slack correctly classified Maximal margin classification • Training: convex optimization problem (QP) • Dual problem: Maximal margin classification • Training: convex optimization problem (QP) • Dual problem: • Optimality condition: Maximal margin classification • Training: • Classification rule: classify new data point x: Maximal margin classification • Training: • Classification rule: classify new data point x: Kernel-based classification Data Embed data x1 Linear classification algorithm xn K Kernel design Support vector machine (SVM) Kernel algorithm Overview • Kernel methods • Classification problems • Kernel methods with heterogeneous information • Classification with heterogeneous information (SDP) • Applications in computational biology Kernel methods with heterogeneous info • Data points: proteins • Information sources: j i K Kernel methods with heterogeneous info • Data points: proteins • Information sources: K Kernel methods with heterogeneous data • Proposed approach – First focus on every single source j of information individually – Extract relevant information from source j into Kj – Design algorithm to learn the optimal K, by “mixing” any number of kernel matrices Kj, for a given learning problem Kernel methods with heterogeneous data 1 2 K Kernel methods with heterogeneous data • Proposed approach 1 – First focus on every single source k of information individually – Extract relevant information from source j into Kj Focus on kernel design for specific types of information 2 – Design algorithm that learns the optimal K, by “mixing” any number of kernel matrices Kj, for a given learning problem Homogeneous, standardized input Flexibility Can ignore information irrelevant for learning task Kernel design: classical vector data • Data matrix: – each row corresponds to a gene (data point) – each column corresponds to an experiment (mRNA expression level) • Each gene: described by vector of numbers Kernel design: classical vector data • Inner product : • Normalized inner product : Similar Dissimilar Kernel design: classical vector data • A more advanced similarity measurement for vector data: Gaussian kernel • Corresponds to highly non-linear embedding Kernel design: classical vector data Kernel design: strings • Data points: proteins • Described by variable-length, discrete strings (amino acid sequences) protein 1 >ICYA_MANSE GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY DGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVN LVPWVLATDYKNYAINYMENSHPDKKAHSIHAWILSKSKVLEGNTKEVVD NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH protein 2 >LACB_BOVIN MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI DALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALE KFDKALKALPMHIRLSFNPTQLEEQCHI • Kernel design: derive valid similarity measure, based on non-vector information Kernel design: strings >ICYA_MANSE GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY DGKKALVLDTDVSNGVKEYMENSLEIAPDAKYTKQGKYVMTFKFGQRVVN LVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVD NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH >LACB_BOVIN MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI DALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVKYVNTFKEALE KFDKALKALPMHIRLSFNPTQLEEQCHI more similar String kernels >ICYA_JAKSE GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY DGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVN LVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVD NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH >LACB_BOVIN MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI DALNENKVLVLDTDYKKYLDYCMENSAEPEQSLACQCLVRTPEVDDEALE KFDKALKALPMHIRLSFNPTQLEEQCHI less similar Kernel design: graph • Data points: vertices • Information: connectivity described by graph • Diffusion kernel: establishes similarities between vertices of a graph, based on the connectivity information – based upon a random walk – efficiently accounts for all paths connecting two vertices, weighted by path lengths Kernel methods with heterogeneous data 1 2 ? K Learning the kernel matrix ? Any symmetric positive definite matrix specifies a kernel matrix Positive semidefinite matrices form a convex cone Learn K from the convex cone of positive-semidefinite matrices… K ? Define cost function to assess the quality of a kernel matrix Restrict to convex cost functions … according to a convex quality measure Learning the kernel matrix ? Learn K from the convex cone of positive-semidefinite matrices… K ? … according to a convex quality measure Semidefinite Programming (SDP): deals with optimizing convex cost functions over the convex cone of positive semidefinite matrices (or a convex subset of it) Classification with multiple kernels ? K ? … according to a convex Learn K from the convex cone of positive-semidefinite matrices (or a quality measure convex subset) … Integrate constructed kernels Large margin classifier (SVM) Classification with multiple kernels ? K ? … according to a convex Learn K from the convex cone of positive-semidefinite matrices (or a quality measure convex subset) … Integrate constructed kernels learn a linear combination Large margin classifier (SVM) Classification with multiple kernels ? K ? … according to a convex Learn K from the convex cone of positive-semidefinite matrices (or a quality measure convex subset) … Integrate constructed kernels Large margin classifier (SVM) learn a linear combination maximize the margin Classification with multiple kernels • SVM, one kernel, dual formulation • SVM, multiple kernels, dual formulation Convex (pointwise max of set of convex functions) Semidefinite programming problem Classification with multiple kernels • SVM, one kernel, dual formulation • SVM, multiple kernels, dual formulation Need to reformulate this in standard SDP format Classification with multiple kernels Integrate constructed kernels Large margin classifier (SVM) learn a linear mix maximize the margin SDP (standard form) Classification with multiple kernels Integrate constructed kernels Large margin classifier (SVM) learn a linear mix maximize the margin Theoretical performance guarantees Applications in computational biology • Yeast membrane protein prediction • Yeast protein function prediction Yeast Membrane Protein Prediction • Membrane proteins: – anchor in various cellular membranes – serve important communicative functions across the membrane – important drug targets • About 30% of the proteins are membrane proteins Yeast Membrane Protein Prediction • Protein sequences: SW scores • Protein sequences: BLAST scores • E-values of Pfam domains • Protein-protein interactions Diffusion • mRNA expression profiles Gaussian • Hydropathy profile Yeast Membrane Protein Prediction • Protein sequences: SW scores • Protein sequences: BLAST scores • E-values of Pfam domains • Protein-protein interactions • mRNA expression profiles • Hydropathy profile K Yeast Membrane Protein Prediction Yeast Protein Function Prediction • Five different types of data: – – – – – Pfam domains genetic interactions (CYGD) physical interactions (CYGD) protein-protein interaction (TAP) mRNA expression profiles • Compare our approach to approach using Markov Random Fields (Deng et al.) – using the five types of data – also reporting improved accuracy compared to using any single data type Yeast Protein Function Prediction MRF SDP/SVM (binary) SDP/SVM (enriched) Conclusion • Computational and statistical framework to integrate data from heterogeneous information sources – – – – flexible and unified approach within kernel methodology specifically: classification problems resulting formulation: semidefinite programming • Applications show classification performance can be enhanced by integrating diverse genome-wide information sources