Download Protein Family Classification using Sparse Markov Transducers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Proteomics wikipedia , lookup

Protein wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Protein domain wikipedia , lookup

Protein folding wikipedia , lookup

Western blot wikipedia , lookup

Protein design wikipedia , lookup

Structural alignment wikipedia , lookup

Rosetta@home wikipedia , lookup

Cyclol wikipedia , lookup

Protein purification wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein structure prediction wikipedia , lookup

Homology modeling wikipedia , lookup

Transcript
Protein Family Classification
using Sparse Markov Transducers
Proceedings of Eighth International Conference on
Intelligent Systems for Molecular Biology
(ISMB2000), pp. 134-145
E. Eskin, W.N. Grundy, and Y. Singer
Cho, Dong-Yeon
Abstract

Classifying proteins into families using sparse
Markov transducers (SMTs)
 Estimation of a probability distribution conditioned on
an input sequence
 Similar
to probability suffix trees
 Allowing for wild-cards
 Two models
 Efficient data structures
Introduction

Protein Classification
 Pairwise similarity
 Creating profiles for protein families
 Consensus patterns using motifs
 HMM-based approaches
 Probability suffix trees (PSTs)
 A PST
is a model that predicts the next symbol in a sequence
based on the previous symbols.
 This approach is based on the presence of common short
sequences (motifs) through the protein family.
 One drawback of PSTs is that they rely on exact matches to the
conditional sequences (e.g., 3-hydroxyacyl-CoA dehydrogenase).
VAVIGSGT
VGVLGLGT
V*V*G*GT – wild cards

Sparse Markov Transducers (SMTs)
 A generalization of PSTs
 It
can condition the probability model over a sequence that
contains wild-cards.
 In a transducer, the input symbol alphabet and output symbol
alphabet can be different.
 Two methods
 Single
amino acid
 Protein family
 Efficient data structure

Experiments
 Pfam database of protein family
Sparse Markov Transducers

A Markov Transducer of Order L
 Conditional probability distribution
P(Yt | X t X t 1 X t 2 X t 3 ... X t ( L 1) )
 Xk
are random variables over an input alphabet
 Yk is a random variable over an output alphabet

Sparse Markov Transducer
 Conditional probability distribution
P(Yt |  n X t  n X t ... n X t )
1
1
 :
 ti
k
2
2
k
wild card
 t  ( j 1 n j )  (i  1)
i
 Two approaches for SMT-based protein classification
 A prediction
model for each family: single amino acid
 A single model for the entire database: protein family

Sparse Markov Trees
 Representationally equivalent to SMTs
 The
topology of a tree encodes the positions of the wild-cards
in the conditioning sequence of the probability distribution.
u 2   1 A 2C u5   1C 3C
* C *** C
* A * *C
ACAAAC
AACCC
CCADC C
BAACC
CCADC CCA

Training a Prediction Tree
 A set of training examples
 The
input symbols are used to identify which leaf node is
associated with that training example.
 The output symbol is then used to update the count of the
appropriate predictor.
 The predictor kept counts of each output symbol seen
by that predictor.
 We
smooth each count by adding a constant value to the count
of each output symbol. Cf) Dirichlet distribution
u1
DACDADDDCAA, C
CAAAACAD, D
AACCAAA, ?
C0.5, D0.5

Mixture of Sparse Prediction Trees
 We do not know which tree topology can best estimate the
distribution.
 A mixture
technique employs a weight sum of trees as a predictor.
P t (Y | X t ) 
t
t
w
P
(
Y
|
X
)
T T T

t
w
T
T
 Updating
the weight of each tree for each input string in the data
set based on how well the tree preformed on predicting the output
wTt 1  wTt PT ( yt | x t )
t 1
T
w
w
1
T
t
i
P
(
y
|
x
T i )
i 1
 The
tree.
prior probability of a tree is defined by the topology of the

Implementation of SMTs
 Two important parameters
 MAX_DEPTH:
the
maximum depth of the tree
 MAX_PHI: the maximum
number of wild-cards at
every node
Ten tress in the mixture if
MAX_DEPTH=2 and
MAX_PHI = 1
 Template tree
 We
only store these nodes which are reached during training.
AA, AC and CD
Efficient Data Structures

Performance of the SMT typically improves with
higher MAX_PHI and MAX_DEPTH.
 The memory usage become bottleneck because it restricts
these parameters to values that will allow the tree to fit in
memory.

Lazy Evaluation
 We store the tails of the training sequence and recompute
the part of the tree on demand when necessary.
 EXPAND_SEQUENCE_COUNT =
4
ACDACAC(D)
ACDACAC(A), DACADAC(C), DACAAAC(D), ACACDAC(A), ADCADAC(D)
Methodology

Data
 Two versions of the Pfam database
 Version
1.0: for comparing results to previous one
 Version 5.2: the latest version
 175 protein families
 A total
of 15610 single domain protein sequences containing a
total 3560959 residues
 Training and test data with a ratio of 4:1 for each family

transmembrane receptor: 530 protein sequence (424 + 106)
 The 424 sequences of the training set give 108858
subsequences that are used to train the model.

Building SMT Prediction Models
 A prediction model for each protein family
 A sliding window of size 11
 Prediction
of the middle symbol a6 using neighboring symbols
 The input symbols are a5a7a4a8a3a9a2a10a1a11.
 MAX_DEPTH = 7 and MAX_PHI = 1

Classification of a Sequence using a SMT
Prediction Model
 Computation of the likelihood for an unknown sequence
 A sequence
is classified into a family by computing the
likelihood of the fit for each of the 175 models.

Building the SMT Classifier Model
 Estimation of the probability over protein families given a
sequence of amino acids
 Input
sequence: an amino acid sequence from a protein family
 Output symbol: the protein family name
 A sliding window of 10 amino acids: a1,…,a10
 MAX_DEPTH=5 and MAX_PHI=1

Classification of a Sequence using an SMT Classifier
 Each position of the sequence gives us a probability over
the 175 families measuring how likely the substring
originated from each family.

Results
 Time-Space-Performance tradeoffs
 Results of Protein Classification using SMTs
 The
SMT models outperform the PST models.
 SMT Classifier > SMT Prediction > PST Prediction
Discussion

Sparse Markov Transducers (SMTs)
 We have presented two methods for protein
classification using sparse Markov transducers (SMTs).

Future Work
 Incorporating biological information into the model
such as Dirichlet mixture priors
 Combining a generative and discriminative model
 Using both positive and negative examples in training