Download Introduction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cyclol wikipedia , lookup

Protein purification wikipedia , lookup

List of types of proteins wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Proteomics wikipedia , lookup

Western blot wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Transcript
2010.01.22
Do Kyoon Kim
Introduction
• The new focus on genomics has highlighted a particular challenge
– To integrate the different views of the genome that are provided by various types of
experiment data
• Different data sources are likely to contain different and thus partly
independent information about the task at hand
• However, genomic data come in a wide variety of data formats
–
–
–
–
Expression data as vectors
Protein sequence
Gene sequence
Protein-protein interaction as graphs
Introduction
• This paper describes a computational framework for integrating
heterogeneous genome-wide measurement
• Each dataset is represented via a kernel function
– Which defines generalized similarity relationships between pairs of entities, such
as genes or proteins
• Kernel matrices derived from different types of data can be combined
in a straightforward fashion
– Such combination in a way that minimizes a statistical loss function
– Using a convex optimization method known as semidefinite programming (SDP)
• Apply them to the recognition of two important groups of proteins in
yeast
– Ribosomal proteins
– Membrane proteins
Kernel methods
( kij = kji & c’kc≥0 for any c )
Input feature space
Kernel feature space
Data
• Experiment with seven kernel matrices derived from three different
types of data
– Four from the primary protein sequence
– Two from protein-protein interaction data
– One from mRNA expression data
Making kernel matrices
• Smith-Waterman, BLAST, and Pfam HMM kernels
– A homolog of a membrane protein is likely to be located in the membrane, and
similarly for the ribosome
– Standard homology detection methods
– Using empirical kernel map (Tsuda. 1999)
• Fast Fourier Transform (FFT) kernel
– Specific to the membrane protein recognition
– Hydropathy profile (from the Kyte-Doolittle index): a vector containing the
hyrophobicities of the amino acids along the protein
– Low-pass filter:
– Frequency contents:
– FFT kernel using Gaussian kernel function
Making kernel matrices (cont’d)
• Protein interaction: linear and diffusion kernels
– Using a database of known interactions (von Mering et al., 2002)
– Linear kernel function for interaction matrix
– Diffusion kernel: Graphs -> kernel matrix
• Based on a random walk on the graph
• Nodes that are connected by shorter paths or by many paths are considered more similar
• Gene expression
– 441 distinct DNA microarray experiments was downloaded from the Stanford
Microarray Database
– Expression kernel using Radial basis kernel function (with
)
Kernel methods for data fusion
•
SVM forms a linear discriminant boundary in the feature space
•
Weight vector can be expressed as
•
The support values are
solution of the following dual quadratic program
Kernel methods for data fusion (cont’d)
•
•
can subsequently be classified by computing the linear function
If
is positive, belonging to class +1; otherwise, -1
Experiment Design
• Use as a gold standard the annotations provided by the MIPS
Comprehensive Yeast Genome Database (CYGD)
– 138 ribosome proteins
– 497 membrane proteins
• Randomly splitting data (without stratifying) into a training and test set
in a ratio of 80/20
– Repeated 30 times
• Report ROC score and TP1FP
Results
• Combining datasets yields better classification performance
• SDP approach performs no better than the naïve approach
(Ribosomal proteins)
– Provides an additional explanatory result (weights)
– Robustness in the presence of noise
Results
•
Ribosomal protein classification
•
Membrane protein classification
Discussion
•
Kernel-based statistical learning methods have a number of general virtues
as tools for biological data analysis
– Handle with variety of data types, such as vectorial data, strings, trees, and graphs
– Incorporation of more specific biological knowledge
– Require only that data be reduced to a kernel matrix; creates opportunities for
standardization
– Allows the development of general tools for combining multiple data types
•
Envision the development of general libraries of kernel matrices for biological
data
– Summarize the statistically-relevant features of primary data
– Encapsulate biological knowledge
– Serve as inputs to a wide variety of subsequent data analysis