Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Localization prediction of transmembrane proteins Stefan Maetschke, Mikael Bodén and Marcus Gallagher The University of Queensland Protein classes Protein Membrane Soluble Integral Peripheral Anchored Transmembrane -barrel -helical Multi-spanning Maetschke et al, The University of Queensland Single-spanning 2 Transmembrane protein types Type-I signal peptide Type-II C N Type-IV (multi-spanning) Type-III N C C N Cytosol (inside) Maetschke et al, The University of Queensland 3 Eukaryotic cell Peroxisome Nucleus Mitochondrion RNA Ribosome Endoplasmic Reticulum ERGIC Golgi Complex Lysosome Endosome Maetschke et al, The University of Queensland 4 Secretory and endocytic pathway Maetschke et al, The University of Queensland 5 Problem and hypothesis • Sorting signals for transmembrane proteins serve multiple purposes (targeting, retention, retrieval, avoidance) and are largely unknown (the problem is challenging/multifaceted) • Current localization prediction of eukaryotic transmembrane proteins is poor (models based on soluble proteins are ill-suited) (previous work is inadequate/incomplete) • Localization prediction for transmembrane proteins is virtually unexplored (paucity/variance of data) (it is an open problem) • Explicit modelling of protein topology should enhance localization prediction accuracy (parameter tuning receives explicit guidance to biologically sensible solutions) (the way to do it!) Maetschke et al, The University of Queensland 6 Hidden Markov model State sequence: Inital state probabilities: s1 s1 s1 s2 s2 s2 s2 s2 s2 s3 i q1 Si b1 A aij P(qt S j | qt 1 Si ) A Observation probabilities: B bi (k ) P(ot Vk | qt Si ) a33 a23 S2 1 2 A R V 1 2 V 20 Maetschke et al, The University of Queensland S3 b3 b2 ... R a12 S1 State transition probabilities: a22 a11 A R 1 2 ... Observation sequence: ... V 20 20 7 2-order Hidden Markov model Observation sequence: State sequence: Inital state probabilities: s1 s1 s1 s2 s2 s2 s2 s2 s2 s3 i q1 Si S2 a33 a23 S3 b3 State transition probabilities: b1 A aij P(qt S j | qt 1 Si ) AA 1 AA 1 AA 1 AR 2 AR 2 AR 2 Observation probabilities: AN 3 AN 3 AN 3 B bi (k ) P(ot Vk | qt Si ) AD 4 AD 4 AD 4 VV 400 Maetschke et al, The University of Queensland VV ... b2 ... a12 S1 ... a22 a11 400 VV 400 8 3-order Hidden Markov model Observation sequence: State sequence: Inital state probabilities: s1 s1 s1 s2 s2 s2 s2 s2 s2 s3 i q1 Si a12 S1 State transition probabilities: a22 a11 A aij P(qt S j | qt 1 Si ) b1 AAA 1 2 Observation probabilities: AAN B bi (k ) P(ot Vk | qt Si ) AAD a23 3 4 AAC 5 AAQ 1 AAR VVV 8000 Maetschke et al, The University of Queensland AAA 1 AAR 2 AAN 2 AAN 3 AAD 3 AAD 4 AAC 4 AAC 5 AAQ 5 AAQ 6 ... ... 6 AAA S3 b3 b2 AAR S2 a33 6 ... VVV VVV 8000 8000 9 Signal peptide N-terminal region hydrophobic core Maetschke et al, The University of Queensland cleavage region mature protein 10 Transmembrane domain icap TMD ocap Maetschke et al, The University of Queensland 11 Protein topology model SP N-term outside ocap TMD Maetschke et al, The University of Queensland icap inside C-term 12 Localization model (5 x topology models) Peroxisome Nucleus Mitochondrion ERGIC Endoplasmic Reticulum Lysosome Golgi Complex Endosome Maetschke et al, The University of Queensland 13 LOCATE dataset Subset LOCATE database 873 Plasma Membrane 261 Endoplasmic Reticulum 141 Golgi Complex 45 Lysosome 31 Endosome FANTOM3, Mouse proteome Filter for transmembrane proteins No multi-targeted proteins Redundancy reduced (<25%) TMDs and SPs are labeled (predicted) High quality localization annotation 1351 Maetschke et al, The University of Queensland 14 Prediction performance Prediction Performance (MCC) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 LOCATE dataset Mean correlation coefficient 10 fold, 10 times Five locations (ER, PM, GO, EN, LY) SVM: linear kernel 1-, 2- and 3-order HMMs SVM-1 SVM-2 HMM-1 HMM-2 Confusion Matrix HMM-2 HMM-3 => Di-peptide composition superior to single amino acid composition => Topological model superior to non-topological model Maetschke et al, The University of Queensland 15 Predictor comparison Prediction accuracy in % 75 80 Test set (20 PM, 20 ER, 20 Golgi) HMM: only three classes but test set train set Other predictors: more classes but test set train set 70 60 48 50 40 33 30 20 18 → difficult to compare! 10 0 CELLO WolfPSort PAnalyst CELLO 2.5: WolfPSort: ProteomeAnalyst 2.5: HMM-2: HMM-2 http://cello.life.nctu.edu.tw/ http://wolfpsort.seq.cbrc.jp/ http://www.cs.ualberta.ca/~bioinfo/PA/Sub/ http://pprowler.itee.uq.edu.au/TMPHMMLoc Maetschke et al, The University of Queensland 16 Conclusion • Novel predictor for subcellular localization of transmembrane proteins along the secretory pathway: http://pprowler.itee.uq.edu.au/TMPHMMLoc • Protein model has less states than topology predictors (TMHMM, HMMTOP, etc) but is of second order • Localization model is trained and tested using LOCATE, a recent, high-quality localization dataset • Overall better performance than current localization predictors (transmembrane proteins, eukaryotic, secretory pathway) – Di-peptide composition superior to single amino acid composition – "Topological" model superior to "non-topological" baseline model Maetschke et al, The University of Queensland 17