Download Introduction

2010.01.22 Do Kyoon Kim Introduction • The new focus on genomics has highlighted a particular challenge – To integrate the different views of the genome that are provided by various types of experiment data • Different data sources are likely to contain different and thus partly independent information about the task at hand • However, genomic data come in a wide variety of data formats – – – – Expression data as vectors Protein sequence Gene sequence Protein-protein interaction as graphs Introduction • This paper describes a computational framework for integrating heterogeneous genome-wide measurement • Each dataset is represented via a kernel function – Which defines generalized similarity relationships between pairs of entities, such as genes or proteins • Kernel matrices derived from different types of data can be combined in a straightforward fashion – Such combination in a way that minimizes a statistical loss function – Using a convex optimization method known as semidefinite programming (SDP) • Apply them to the recognition of two important groups of proteins in yeast – Ribosomal proteins – Membrane proteins Kernel methods ( kij = kji & c’kc≥0 for any c ) Input feature space Kernel feature space Data • Experiment with seven kernel matrices derived from three different types of data – Four from the primary protein sequence – Two from protein-protein interaction data – One from mRNA expression data Making kernel matrices • Smith-Waterman, BLAST, and Pfam HMM kernels – A homolog of a membrane protein is likely to be located in the membrane, and similarly for the ribosome – Standard homology detection methods – Using empirical kernel map (Tsuda. 1999) • Fast Fourier Transform (FFT) kernel – Specific to the membrane protein recognition – Hydropathy profile (from the Kyte-Doolittle index): a vector containing the hyrophobicities of the amino acids along the protein – Low-pass filter: – Frequency contents: – FFT kernel using Gaussian kernel function Making kernel matrices (cont’d) • Protein interaction: linear and diffusion kernels – Using a database of known interactions (von Mering et al., 2002) – Linear kernel function for interaction matrix – Diffusion kernel: Graphs -> kernel matrix • Based on a random walk on the graph • Nodes that are connected by shorter paths or by many paths are considered more similar • Gene expression – 441 distinct DNA microarray experiments was downloaded from the Stanford Microarray Database – Expression kernel using Radial basis kernel function (with ) Kernel methods for data fusion • SVM forms a linear discriminant boundary in the feature space • Weight vector can be expressed as • The support values are solution of the following dual quadratic program Kernel methods for data fusion (cont’d) • • can subsequently be classified by computing the linear function If is positive, belonging to class +1; otherwise, -1 Experiment Design • Use as a gold standard the annotations provided by the MIPS Comprehensive Yeast Genome Database (CYGD) – 138 ribosome proteins – 497 membrane proteins • Randomly splitting data (without stratifying) into a training and test set in a ratio of 80/20 – Repeated 30 times • Report ROC score and TP1FP Results • Combining datasets yields better classification performance • SDP approach performs no better than the naïve approach (Ribosomal proteins) – Provides an additional explanatory result (weights) – Robustness in the presence of noise Results • Ribosomal protein classification • Membrane protein classification Discussion • Kernel-based statistical learning methods have a number of general virtues as tools for biological data analysis – Handle with variety of data types, such as vectorial data, strings, trees, and graphs – Incorporation of more specific biological knowledge – Require only that data be reduced to a kernel matrix; creates opportunities for standardization – Allows the development of general tools for combining multiple data types • Envision the development of general libraries of kernel matrices for biological data – Summarize the statistically-relevant features of primary data – Encapsulate biological knowledge – Serve as inputs to a wide variety of subsequent data analysis

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introduction