Download PowerPoint-Präsentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Localization prediction of
transmembrane proteins
Stefan Maetschke, Mikael Bodén and Marcus Gallagher
The University of Queensland
Protein classes
Protein
Membrane
Soluble
Integral
Peripheral
Anchored
Transmembrane
-barrel
-helical
Multi-spanning
Maetschke et al, The University of Queensland
Single-spanning
2
Transmembrane protein types
Type-I
signal
peptide
Type-II
C
N
Type-IV
(multi-spanning)
Type-III
N
C
C
N
Cytosol (inside)
Maetschke et al, The University of Queensland
3
Eukaryotic cell
Peroxisome
Nucleus
Mitochondrion
RNA
Ribosome
Endoplasmic Reticulum
ERGIC
Golgi Complex
Lysosome
Endosome
Maetschke et al, The University of Queensland
4
Secretory and endocytic pathway
Maetschke et al, The University of Queensland
5
Problem and hypothesis
• Sorting signals for transmembrane proteins serve multiple
purposes (targeting, retention, retrieval, avoidance) and
are largely unknown (the problem is challenging/multifaceted)
• Current localization prediction of eukaryotic
transmembrane proteins is poor (models based on soluble
proteins are ill-suited) (previous work is
inadequate/incomplete)
• Localization prediction for transmembrane proteins is
virtually unexplored (paucity/variance of data) (it is an open
problem)
• Explicit modelling of protein topology should enhance
localization prediction accuracy
(parameter tuning receives explicit guidance to biologically
sensible solutions) (the way to do it!)
Maetschke et al, The University of Queensland
6
Hidden Markov model

State sequence:

Inital state probabilities:
s1 s1 s1 s2 s2 s2 s2 s2 s2 s3
    i    q1  Si 

b1
A  aij  P(qt  S j | qt 1  Si )
A

Observation probabilities:
B   bi (k )   P(ot  Vk | qt  Si )
a33
a23
S2
1
2
A
R
V
1
2
V
20
Maetschke et al, The University of Queensland
S3
b3
b2
...
R

a12
S1
State transition probabilities:
  
a22
a11
A
R
1
2
...
Observation sequence:
...

V
20
20
7
2-order Hidden Markov model

Observation sequence:

State sequence:

Inital state probabilities:
s1 s1 s1 s2 s2 s2 s2 s2 s2 s3
    i    q1  Si 
S2
a33
a23
S3
b3
State transition probabilities:
b1
A  aij  P(qt  S j | qt 1  Si )
AA
1
AA
1
AA
1
AR
2
AR
2
AR
2
Observation probabilities:
AN
3
AN
3
AN
3
B   bi (k )   P(ot  Vk | qt  Si )
AD
4
AD
4
AD
4

VV
400
Maetschke et al, The University of Queensland
VV
...
  
b2
...

a12
S1
...

a22
a11
400
VV
400
8
3-order Hidden Markov model
Observation sequence:

State sequence:

Inital state probabilities:
s1 s1 s1 s2 s2 s2 s2 s2 s2 s3
    i    q1  Si 

a12
S1
State transition probabilities:
  
a22
a11

A  aij  P(qt  S j | qt 1  Si )
b1
AAA
1
2
Observation probabilities:
AAN
B   bi (k )   P(ot  Vk | qt  Si )
AAD
a23
3
4
AAC
5
AAQ
1
AAR
VVV
8000
Maetschke et al, The University of Queensland
AAA
1
AAR
2
AAN
2
AAN
3
AAD
3
AAD
4
AAC
4
AAC
5
AAQ
5
AAQ
6
...
...
6
AAA
S3
b3
b2
AAR

S2
a33
6
...

VVV
VVV
8000
8000
9
Signal peptide
N-terminal
region
hydrophobic core
Maetschke et al, The University of Queensland
cleavage
region
mature
protein
10
Transmembrane domain
icap
TMD
ocap
Maetschke et al, The University of Queensland
11
Protein topology model
SP
N-term
outside ocap
TMD
Maetschke et al, The University of Queensland
icap
inside C-term
12
Localization model (5 x topology models)
Peroxisome
Nucleus
Mitochondrion
ERGIC
Endoplasmic Reticulum
Lysosome
Golgi Complex
Endosome
Maetschke et al, The University of Queensland
13
LOCATE dataset
Subset LOCATE database
873
Plasma Membrane
261
Endoplasmic Reticulum
141
Golgi Complex
45
Lysosome
31
Endosome
FANTOM3, Mouse proteome
 Filter for transmembrane proteins
 No multi-targeted proteins
 Redundancy reduced (<25%)
 TMDs and SPs are labeled (predicted)
 High quality localization annotation

1351
Maetschke et al, The University of Queensland
14
Prediction performance
Prediction Performance (MCC)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
LOCATE dataset
 Mean correlation coefficient
 10 fold, 10 times
 Five locations (ER, PM, GO, EN, LY)
 SVM: linear kernel
 1-, 2- and 3-order HMMs

SVM-1
SVM-2
HMM-1
HMM-2
Confusion Matrix HMM-2
HMM-3
=> Di-peptide composition superior to
single amino acid composition
=> Topological model superior to
non-topological model
Maetschke et al, The University of Queensland
15
Predictor comparison
Prediction accuracy in %
75
80
Test set
(20 PM, 20 ER, 20 Golgi)
 HMM: only three classes but
test set  train set
 Other predictors: more classes but
test set  train set

70
60
48
50
40
33
30
20
18
→ difficult to compare!
10
0
CELLO
WolfPSort
PAnalyst
CELLO 2.5:
WolfPSort:
ProteomeAnalyst 2.5:
HMM-2:
HMM-2
http://cello.life.nctu.edu.tw/
http://wolfpsort.seq.cbrc.jp/
http://www.cs.ualberta.ca/~bioinfo/PA/Sub/
http://pprowler.itee.uq.edu.au/TMPHMMLoc
Maetschke et al, The University of Queensland
16
Conclusion
• Novel predictor for subcellular localization of transmembrane
proteins along the secretory pathway:
http://pprowler.itee.uq.edu.au/TMPHMMLoc
• Protein model has less states than topology predictors (TMHMM,
HMMTOP, etc) but is of second order
• Localization model is trained and tested using LOCATE, a recent,
high-quality localization dataset
• Overall better performance than current localization predictors
(transmembrane proteins, eukaryotic, secretory pathway)
– Di-peptide composition superior to single amino acid composition
– "Topological" model superior to "non-topological" baseline model
Maetschke et al, The University of Queensland
17
Related documents