Download bi6103-20feb04 - NUS School of Computing

Document related concepts

Circular dichroism wikipedia , lookup

Protein folding wikipedia , lookup

Proteomics wikipedia , lookup

Protein domain wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

List of types of proteins wikipedia , lookup

Protein purification wikipedia , lookup

Cyclol wikipedia , lookup

Protein wikipedia , lookup

Trimeric autotransporter adhesin wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Alpha helix wikipedia , lookup

Western blot wikipedia , lookup

Protein structure prediction wikipedia , lookup

Transcript
Recognition of Protein Features
Limsoon Wong
Institute for Infocomm Research
BI6103 guest lecture on ?? March 2004
Copyright 2003 limsoon wong
Lecture Plan
• Membrane proteins
• Subcellular localization
Copyright 2003 limsoon wong
Recognition of
Transmembrane Helices
Copyright 2003 limsoon wong
Eukaryotic Cells
• Eukaryotic cells have membrane-bound
compartments with specialized functions
Copyright 2003 limsoon wong
Lipids & Membrane
• Membrane is a double layer of lipids and associated proteins
which define subcellular compartments or enclose the cell
• Lipids consist of a “polar head group” and long-chain fatty acids
• This dual nature promotes formation of lipid bilayers
• “Hydrophobic tails” are shielded from aqueous environment
• Water-soluble (i.e., charged or polar) molecules cant pass through
this impermeable barrier
• Permeability across the bilayer is regulated by membrane proteins
that span the bilayer and function like channels or pores
Copyright 2003 limsoon wong
Membrane Proteins
• Two types of
membrane proteins:
Integral vs peripheral
• Two types of integral
membrane proteins:
all- vs -barrel
all-
-barrel
Copyright 2003 limsoon wong
Topography & Topology
• topography: predict
location of
transmembrane
segment
• topology: predict
location of N- and Ctermini wrt lipid
bilayer
• We focus on topography
prediction for all-
Lipid molecules
membrane proteins
Copyright 2003 limsoon wong
Datasets
• Jayasinghe et al. Protein Sci, 10:455-458, 2001
– 59 high resolution membrane proteins
– www.biocomp.unibo.it/gigi/ENSEMBLE
• Moller et al. Bioinformatics, 16:1159--1160, 2000
– 151 low resolution membrane proteins
• Jones et al., Biochem., 33(10):3038--3049, 1994
– 38 multi-spanning and 45 single-spanning membrane proteins
– topologies experimentally determined
• Sonnhammer et al., ISMB, 6:175-182, 1998
– 108 multi-spanning and 52 single-spanning membrane proteins
– most of experimentally determined topologies, but less reliably
determined than Jones et al.
Copyright 2003 limsoon wong
Monne et al., JMB, 288:141--145, 1999:
Turn Propensity Scale for TM Helices
ER
• E. coli Lep protein contains
two TM domains (H1, H2)
and C-terminal doman P2
• Translocation of P2 to
lumenal side is easy to test
by glycoslation
• Replace H2 by 40 residue
poly-L segment
LIK4L21XL7VL10Q3P
• The poly-L segment can
form either one long TM or
2 closely-spaced TM
helices, depending on what
is substituted for X
Copyright 2003 limsoon wong
Monne et al., JMB, 288:141--145, 1999:
Turn Propensity Scale for TM Helices
glycoslated
non-glycoslated
• Using the poly-L segment,
measure “turn” propensity of
the 20 amino acids by
substituting them for the X in
the poly-L segment
• Hydrophobic residues (I, V, L, F,
C, M, A) do not induce turn
• Charged and polar residues
(except S & T) induce turn
• Exercise:
– What are the charged/polar
residues?
– What could be reason of S & T
not inducing turn?
Copyright 2003 limsoon wong
Monne et al., JMB, 288:141--145, 1999:
Turn Propensity Scale for TM Helices
• In all- membrane
proteins,
– hydrophobic residues
prefer membrane env and
have low turn propensity
– charged & polar residues
induce turn formation to
avoid membrane interior
 prediction of TM helix
 distinction of 1 long TM
helix vs 2 closely spaced
TM helices
Monne et al., JMB, 288:141--145, 1999
Copyright 2003 limsoon wong
Wiess et al, ISMB, 1:420--421, 1993
Hydrophobicity Approach
• Inside of cellular
membrane is
hydrophobic
• Segment of protein that
spans membrane is
expected to contain
many hydrophobic amino
acids
 Locate segments that
have high average
“hydrophobicity” score
Monne et al., JMB, 288:141--145, 1999
Copyright 2003 limsoon wong
Wiess et al, ISMB, 1:420--421, 1993
Hydrophobicity Approach
• Caveats:
– may be unable to
distinguish hydrophobic
core of nonmembrane
proteins vs.
transmembrane regions
– what are the right
thresholds?
•
•
•
•
find a segment of 10 to 70aa with hp > 0.71
expand to longer segment with hp > 0.35
mark this segment as TM
repeat above starting from position after previous segment
Adjustable
thresholds
Copyright 2003 limsoon wong
An Example: Bacteriorhodopsin
1 gigtllmlig tfyfiargwg vtdkkareyy aitilvpgia saaylsmffg iglttvevag
61 maepleiyya ryadwlfttp lllldlalla nadrttigtl igvdalmivt gligalshtp
121 larytwwlfs tiaflfvlyy lltvlrsaaa elsedvqttf ntltalvavl wtaypilwii
181 gtegagvvgl gvetlafmvl dvta
7 transmembrane helices
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=461610&dopt=GenPept&term=bacteriorhodopsin&qty=1
Copyright 2003 limsoon wong
An Example: Bacteriorhodopsin
• After applying
hydrophobicity
scale...
1 gigtllmlig tfyfiargwg vtdkkareyy aitilvpgia saaylsmffg iglttvevag
61 maepleiyya ryadwlfttp lllldlalla nadrttigtl igvdalmivt gligalshtp
121 larytwwlfs tiaflfvlyy lltvlrsaaa elsedvqttf ntltalvavl wtaypilwii
181 gtegagvvgl gvetlafmvl dvta
Copyright 2003 limsoon wong
An Example: Bacteriorhodopsin
• Compute
hydrophobicity
score, hp > 7
1 gigtllmlig tfyfiargwg vtdkkareyy aitilvpgia saaylsmffg iglttvevag
61 maepleiyya ryadwlfttp lllldlalla nadrttigtl igvdalmivt gligalshtp
121 larytwwlfs tiaflfvlyy lltvlrsaaa elsedvqttf ntltalvavl wtaypilwii
181 gtegagvvgl gvetlafmvl dvta
TM identified: 6/7, TM FP: 0
TM residue identified: 62/117, TM residue FP: 4
Copyright 2003 limsoon wong
An Example: Bacteriorhodopsin
• Expand segment,
maintain hp > 5,
avoid low
hydrophobicity
1 gigtllmlig tfyfiargwg vtdkkareyy aitilvpgia saaylsmffg iglttvevag
61 maepleiyya ryadwlfttp lllldlalla nadrttigtl igvdalmivt gligalshtp
121 larytwwlfs tiaflfvlyy lltvlrsaaa elsedvqttf ntltalvavl wtaypilwii
181 gtegagvvgl gvetlafmvl dvta
TM identified: 6/7, TM FP: 0
TM residue identified: 100/117, TM residue FP:15
Copyright 2003 limsoon wong
Sonnhammer et al., ISMB, 6:175-182, 1998:
TMHMM, A HMM Approach
• There are 3 main locations of a residue:
– TM helix core (viz., in hydrophobic tail of membrane
– TM helix cap (viz., in head of membrane)
• cytoplasmic vs
• non-cytoplasmic side of the helix core
cyto
– loops
• cytoplasimc vs
• non-cytoplasmic (short) vs
• non-cytoplasmic (long)
non-cyto
 So needs HMM with 7 states
• Exercise: What is the 7th state for?
Copyright 2003 limsoon wong
Sonnhammer et al., ISMB, 6:175-182, 1998:
TMHMM, Architecture
cyto
non-cyto
Each state has an associated probability
distribution over the 20 amino acids
characterizing the variability of amino
acids in the region it models
Copyright 2003 limsoon wong
Sonnhammer et al., ISMB, 6:175-182, 1998:
TMHMM, Architecture
• The first 3 and last 2 core states have to be traversed.
But all other core states can be bypassed.
• This models core regions of 5--25 residues
Copyright 2003 limsoon wong
Sonnhammer et al., ISMB, 6:175-182, 1998:
TMHMM, Architecture
To model neutral amino
acid distribution
To model bias in amino
acid usage near cap
• The states of globular, loop, & cap regions.
• The caps are 5 residues each. Since core is 5--25
residues, this allows for helices 15--35 residues long
Copyright 2003 limsoon wong
Sonnhammer et al., ISMB, 6:175-182, 1998:
TMHMM, Training the HMM
• Stage 1: Baum-Welch is used for maximum likelihood estimation from
“diluted” labeled training data. As precise end of TM is only
approximately known, we “dilute” by unlabeling 3 residues on each
side of a helix boundary to accommodate this
• Stage 2: Baum-Welch is used for maximum likelihood estimation from
“relabeled” training data. The original training data are diluted as by
unlabeling 5 residues on each side of a helix boundary. Model from
Stage 1 is used to produce “relabeled training data” by relabeling this
part under constraints of remaining labels
• Stage 3: Model from Stage 2 is further tuned by a method for
“discriminative” training, to maximize probability of correct prediction
(Krogh, ISMB, 5:179--186, 1997)
Copyright 2003 limsoon wong
Krogh, ISMB, 5:179--186, 1997:
Discriminative HMM Training
Copyright 2003 limsoon wong
Sonnhammer et al., ISMB, 6:175-182, 1998:
TMHMM, Example
Non-cytoplasmic
TM segment
Cytoplasmic
Datasets
• Jones et al., Biochem., 33(10):3038--3049, 1994
• Sonnhammer et al., ISMB, 6:175-182, 1998
Copyright 2003 limsoon wong
Sonnhammer et al., ISMB, 6:175-182, 1998:
TMHMM, Accuracy (10-CV)
All TM segments
& their orientation
correctly predicted
All TM segments
correctly predicted,
ignoring orientation
precision
Copyright 2003 limsoon wong
Martelli et al. Bioinformatics, 19:i205--i211, 2003
ENSEMBLE
NN
HMM1
HMM2
ENSEMBLE
Copyright 2003 limsoon wong
ENSEMBLE:
The Neural Network Part
1
h1
17 * 20
input units
h2
HMM
Input
layer
17*2
inputs
LOOP
15 hidden
units
17
h5
Feed-forward
back-propagation
neural network
• The NN part is a cascade shown above, a la Rost
et al., Protein Science, 1995
Copyright 2003 limsoon wong
ENSEMBLE:
The HMM1 Part
• HMM1 models the hydrophobic nature of most
TM helices, a la Krogh et al. JMB 2001 & Sonnhammer et al.,
ISMB 1998
Copyright 2003 limsoon wong
ENSEMBLE:
The HMM2 Part
• HMM2 models TM helices that are mix of
hydrophobic and hydrophilic residues, ala Martelli
et al., Bioinformatics 2002.
Copyright 2003 limsoon wong
ENSEMBLE:
Predicting if a residue is in TM
NN
helix
•
•
•
•
HMM1
HMM2
ENSEMBLE
loop (inner I, outer O)
NN(p,i) = NN(H,p,i)  NN(L,p,i)
HMM1(p,i) = AP1(H,p,i)  AP1(I,p,i)  AP1(O,p,i)
HMM2(p,i) = AP2(H,p,i)  AP2(I,p,i)  AP2(O,p,i)
E(p,i) = (NN(p,i) + HMM1(p,i) + HMM2(p,i)) / 3
position
E(p,i) > 0 means residue i
of protein p is in TM helix
Copyright 2003 limsoon wong
Ensemble: Topography Prediction
Fariselli et al., Bioinformatics, 2003
TM helix found by
MaxSubSeq but
would be missed
w/o it
NN
HMM1
ENSEMBLE
HMM2
MaxSubSeq
This path is
taken means
positions
m to j form
a helix
Copyright 2003 limsoon wong
Ensemble:
Topography Prediction Results
90%
85%
80%
75%
70%
65%
60%
Jayasinghe
(CV)
Moller
NN
HMM1
HMM2
ENSEMBLE
TMHMM2.0
MEMSAT
PHD
HMMTOP
A prediction is considered correct if
(a) the number of TM segments is correct and
(b) the overlap between a predicted and a real TM segment > 8aa
Copyright 2003 limsoon wong
Topology Prediction: Postive-Inside
Gavel et al., FEBS, 282:41--46, 1991
Rule
• Positivelycharged
residues (Lys
and Arg) are
enriched
more than 2
fold in
stromal vs
luminal loops
Copyright 2003 limsoon wong
Topology Prediction:
Ensemble
“positive-inside” rule
Copyright 2003 limsoon wong
Ensemble:
Topology Prediction Results
80%
75%
70%
65%
60%
55%
50%
45%
40%
ENSEMBLE
(rule 4)
TMHMM2.0
MEMSAT
PHD
HMMTOP
Jayasinghe
(CV)
Moller
ENSEMBLE
(rule 1)
Copyright 2003 limsoon wong
Short Break
Copyright 2003 limsoon wong
Subcellular Localization
Copyright 2003 limsoon wong
Compartments and Sorting
• Eukaryotic cells requires
proteins be targeted to their
subcellular destinations
• Protein sorting is
determined by specific
amino acid sequences,
or “signals”, within the
protein
• Secretory pathway
targets proteins to
plasma membrane,
some membranebound organelles such
as lysosomes, or to
export proteins from
the cell
Copyright 2003 limsoon wong
Secretory Pathway
• The secretory pathway consists
of the endoplasmic reticulum
(ER), Golgi apparatus and
transport vesicles
• The transport vesicles carry
proteins from one compartment
to the other
• Exocytosis is mediated by fusion
of secretory vesicles with the
plasma membrane.
• Endocytosis is the opposite of
exocytosis and involves the
uptake of extracellular material
by pinching off vesicles from the
plasma membrane
• The contents of the endocytic
vesicles are delivered to the
lysosomes by membrane fusion
• Lysosomes contain hydrolytic
enzymes that breakdown
macromolecules into the smaller
subunits which can be utilized by
the cell for its own biosynthesis
Copyright 2003 limsoon wong
Datasets
• Reinhartdt & Hubbard, NAR, 26:2230--2236, 1998
– 2427 eukaryotic proteins for 4 locations (cytoplasmic, extracellular, nuclear,&
mitochondrial)
– 997 prokaryotic proteins for 3 locations (cytoplasmic, extracellular, &
periplasmic)
• Park & Kanehisa, Bioinformatics, 19:1656--1663, 2003
– 7589 eukaryotic proteins from 709 organisms for 12 locations
(chloroplast, cytoplasmic, cytoskeleton, ER, extracellular, golgi, lysosomal,
mitochondrial, nuclear, peroxisomal, plasma membrane, vacuolar)
• Chou & Cai, JBC., 277:45765--45769, 2002
– 2191 proteins for 12 locations
• Emanuelsson et al., JMB, 300:1005--1016, 2000
• Gardy et al., NAR, 31:3613--3617, 2003
Copyright 2003 limsoon wong
Common Eukaryotic Protein
Sorting Signals
For a comprehensive list of cellular localization sites, see
http://mendel.imp.univie.ac.at/CELL_LOC/index.html
Copyright 2003 limsoon wong
~25aa
Schematic
View of
Sorting
Signals
cleavage site
Copyright 2003 limsoon wong
SP
signal peptide
Sequence Logos of
SP, mTP, & cTP
mTP
mitochondrial
transfer peptide
cTP
chloroplast
transit peptide
Copyright 2003 limsoon wong
Neural Network Approach: TargetP
Emanuelsson et al., JMB, 300:1005--1016, 2000
• cTP, mTP, SP
– 4 hidden units
– feedforward NNs
– input windows:
• 55aa (cTP), 35aa
(mTP), 27aa (SP)
• sparsely encoded
• Integrating Network
– 0 hidden unit
– feedforward NN
– input is taken from the
outputs of cTP, mTP, SP
networks over 100aa at
N-terminal
cTP: chloroplast transit peptide,
mTP: mitochondria transfer peptide, SP: signal peptide
Copyright 2003 limsoon wong
TargetP:
Performance
Dataset: Emanuelsson et al., JMB, 2000
Copyright 2003 limsoon wong
Expert System Approach: PSORT
Horton & Nakai, ISMB, 1997
A simplified
version of the
decision tree that
PSORT uses to
check and reason
over various
sorting signals
Copyright 2003 limsoon wong
A Refinement: PSORT-B
Gardy et al., NAR, 31:3613--3617, 2003
• Sites considered
–
–
–
–
–
Localization sites
cytoplasm
or “unknown”
inner membrane
periplasm
Bayesian
outer membrane
Network
extracellular space
SCLMotifs
BLAST
HMMTOP
Outer
Signal
Membrane SubLocC
Peptides
Protein
Copyright 2003 limsoon wong
PSORT-B:
SCL-BLAST
• Homology to a protein of
known localization is good
indicator of a protein’s
actual localization site
 BLAST target protein
against a database of
proteins whose localization
sites are known
 Return localization sites of
hits at E-value of 10e-10
over 80% of length
Copyright 2003 limsoon wong
PSORT-B:
Motifs
• Some motifs in PROSITE
may be able to identify
subcellular localization with
100% precision
 Scan target protein against
a database of such motifs
(28 such 100%-precision
motifs are known)
 Return localization sites
corresponding to the motif
hits
Copyright 2003 limsoon wong
PSORT-B:
HMMTOP
• -helical transmembrane
region is reliable indicator
of localization to inner
membrane
 Scan target protein for
transmembrane  helices
using HMMTOP
 Return localization site as
“inner membrane” if >2 
helices found
Copyright 2003 limsoon wong
PSORT-B:
Outer Membrane Proteins
• Outer-membrane proteins
have characteristics barrel structure
 Identify freq seq occurring
only in -barrel proteins
(279 such freq seq known)
 Scan target protein for
these freq seq
 Return localization site as
“outer membrane” if >2
such freq seq found
Copyright 2003 limsoon wong
PSORT-B:
SubLocC
• Overall amino acid
composition is useful for
recognizing cytoplasmic
proteins
 Trained SVM on overall
amino acid composition to
predict cytoplasmic vs noncytoplasmic, as in SubLoc
 Analyze target protein’s
amino acid composition
using this SVM
Copyright 2003 limsoon wong
PSORT-B:
Signal Peptides
• Presence of signal peptide at Nterminal means protein not
cytoplasmic
 Train HMM and SVM to recognize
signal peptides and their cleavage
sites
 If high-confidence cleavage site
found by HMM in first 70aa of target
protein, then “non-cytoplasmic”
 If low-confidence cleavage site found,
pass candidate signal peptide to SVM
to confirm
 If confirmed, then “non-cytoplasmic”
 Otherwise, “unknown”
Copyright 2003 limsoon wong
PSORT-B:
Bayesian Network
• Bayesian Network integrates
results from the 6 modules
• Produces a score for each of
the 5 possible localization
sites
• If a site scores >7.5, then
predicts as a localization site
of the target protein
• If no site scores >7.5, then
makes no prediction
Copyright 2003 limsoon wong
PSORT-B:
Performance of Individual Modules
Dataset: Gardy et al., NAR, 2003
Copyright 2003 limsoon wong
PSORT-B:
Performance wrt Localization Sites
PSORT-B is a considerable improvement over original PSORT
Dataset: Gardy et al., NAR, 2003
Copyright 2003 limsoon wong
PSORT vs PSORT-B:
Some Remarks
• PSORT considers various signal/features
in a top-down way driven by its reasoning
tree
• PSORT-B generates all signal/features in
a bottom-up way, then integrate them for
decision making using Bayesian Network
• Machine learning “beats” human expert?
Probably the number of features/rules
needed is too much/complicated
Copyright 2003 limsoon wong
Amino acid
composition
of proteins
residing in
different
sites are
different
Copyright 2003 limsoon wong
Amino Acid Composition Differences
• each cellular location
• If the above is true,
has own characteristic
the amino acid
physio-chemical
composition
environment
differences wrt
cellular location sites
• proteins in each
should be more
location have adapted
pronounced on
thru evolution to that
protein surfaces than
environment
protein interior
• thus reflected in the
• Exercise: Why?
protein structure and
amino acid composition
Copyright 2003 limsoon wong
Adaptation of Protein Surfaces
Andrade et al., JMB, 1998
• To test the theory
of adaptation of
protein surfaces
to subcellular
localization, we
do a plot of 3
types of
composition
vectors along
their first two
principal
components
Proportion of
jth amino acid
type in ith protein
Copyright 2003 limsoon wong
Adaptation of Protein Surfaces
Andrade et al., JMB, 1998
Total amino acid
composition vector
Surface amino acid
composition vector
• Clearly total & surface
composition vectors show
better separation than interior
composition vectors
Interior amino acid
composition vector
Copyright 2003 limsoon wong
Amino Acid Composition
• This means can use amino acid
composition vectors, especially those
from protein surfaces, to predict
subcellular localization!
• Let’s see how this turn out….
Copyright 2003 limsoon wong
Neural Networks: NNPSL
Reinhardt & Hubbard, NAR, 26:2230--2236, 1998
Input1
fraction of
each amino
acid in
the input
protein
cytoplasmic
extracellular
mitochodrial
nuclear
Input20
Copyright 2003 limsoon wong
NNPSL:
Performance
• Outputs NNPSL
have values 0 to 1.
The difference ()
between the
highest and the
next highest nodes
can be used as a
reliability index
0 <  < 0.2
0.2 <  < 0.4
0.4 <  < 0.6
0.6 <  < 0.8
0.8 <  < 1
Dataset:
Reinhardt & Hubbard,
NAR, 1998
Copyright 2003 limsoon wong
Performance
Emanuelsson, BIB, 3:361--376, 2002
(940 proteins)
(2738 proteins)
Dataset: Emanuelsson et al., JMB, 2000
Copyright 2003 limsoon wong
Markov Chain
Yuan, FEBS Letters, 451:23--26, 1999
Why?
Copyright 2003 limsoon wong
Markov Chain:
Performance
(Eukaryotic)
NNPSL
4th Order Markov
Dataset:
Reinhardt & Hubbard,
NAR, 1998
Copyright 2003 limsoon wong
Support Vector Machines: SubLoc
Hua & Sun, Bioinformatics, 17:721--728, 2001
SVM
nuclear
vs rest
20-dimensional
vector giving amino
acid composition
of the input protein
SVM
mitochondrial
vs rest
SVM
extracellular
vs rest
SVM
cytoplasmic
vs rest
ArgmaxX X-vs-rest
The SVMs use
• polynomial kernel with d = 9 (prokaryotic),
K(Xi,Xj) = (Xi ·Xj + 1)d
• RBF kernel with =16 (eukaryotic),
K(Xi, Xj) = exp(-  |Xi - Xj|2
Copyright 2003 limsoon wong
SubLoc:
Performance
NNPSL
SubLoc
(Eukaryotic)
Dataset: Reinhardt & Hubbard, NAR, 1998
Copyright 2003 limsoon wong
SubLoc: Robustness
of
Amino Acid Composition Approach
• Amazingly, accuracy of SubLoc is virtually unaffected
when the first 10, 20, 30, & 40 amino acids in a protein
are deleted
• Amino acid composition is a robust indicator of
subcellular localization, and is insensitive to errors in
N-terminal sequences
Copyright 2003 limsoon wong
Amino Acid Composition:
Taking it Further
• How about pairs of consecutive amino
acids? (a.k.a 2-grams) How about 3grams, …, k-grams?
• How about pseudo amino acid
composition?
• How about presence of entire functional
domains? (I.e. think of the presence/absence of a
functional domain as a summary of amino acid
sequence info...)
Copyright 2003 limsoon wong
Functional Domain Composition
Chou & Cai, JBC, 277:45765--45769, 2002
Training seqs of
various localization
sites
Train SVM
using these vectors
xi = 1 means ith
domain is present
BLAST against
db of known
functional domains
(SBASE-A)
+
amino
acid
composition
Copyright 2003 limsoon wong
Functional Domain Composition:
Performance
Dataset: Reinhardt & Hubbard, NAR, 1998
• Not so good
• Why?
Number of known domains in SBASE-A too small
 Need to handle situation where a protein has no
hit in known domains
Copyright 2003 limsoon wong
Functional Domain Composition
Cai & Chou, BBRC, 305:407--411, 2003
If a protein got a hit in Interpro,
use NN-5875D; else use NN-40D
Training seqs of
various localization
sites
BLAST against
db of known
functional domains
(Interpro)
NN-5875D:
NN-40D:
Train k-NN (k=1)
using these vectors
Train k-NN (k=1)
using these vectors
or, if no
hit found
Amino
acid
composition
Pseudo amino
acid composition
Copyright 2003 limsoon wong
Functional Domain Composition:
Performance
Dataset: Reinhardt & Hubbard, NAR, 1998
Copyright 2003 limsoon wong
Notes
Copyright 2003 limsoon wong
References (Transmembrane)
• Wiess et al. “Transmembrane segment prediction from
protein sequence data”, ISMB, 420--421, 1993
• Gavel et al. “The positive-inside rule applies to thylakoid
membrane proteins”, FEBS 282:41--46, 1991
• Monne et al. “A turn propensity scale for transmembrane
helices”, JMB, 288:141--145, 1999
• Sonnhammer et al. “A hidden Markov model for predicting
transmembrane helices in protein sequences”, ISMB,
6:175--182, 1998
• Martelli et al. “An ENSEMBLE machine learning approach
for the prediction of all-alpha membrane proteins”,
Bioinformatics, 19(suppl):i205--i211, 2003
Copyright 2003 limsoon wong
References (Transmembrane)
• Von Heijne. “Membrane protein structure prediction”, JMB,
225: 487--494, 1992
• Jacoboni et al. “Prediction of the transmembrane regions of
beta-barrel membrane proteins with a neural networkbased predictor”, Protein Sci., 10:779--787, 2001
• Martelli et al. “a sequence-profile-based HMM for predicting
and discriminating beta barrel membrane proteins”,
Bioinformatics, 18:S46--S53, 2002
• Moller et al. “Evaluation of methods for the prediction of
membrane spanning regions”, Bioinformatics, 17:646--653,
2001
• Fariselli et al. “MaxSubSeq: an algorithm for segmentlength optimization. The case study of the transmembrane
spanning segments”, Bioinformatics, 19:500--505, 2003
Copyright 2003 limsoon wong
References (Transmembrane)
• Rost et al. “Transmembrane helices predicted at 95%
accuracy”, Protein Sci., 4:521--533, 1995
• Krogh et al. “Predicting transmembrane protein topology
with a hidden Markov model: Application to complete
genomes”, JMB, 305:567--580, 2001
• Andersson et al. “Different positively charged amino acids
have similar effectson the topology of a polytopic
transmembrane protein in E. coli”, JBC, 267:1491--1495,
1992
Copyright 2003 limsoon wong
References (Subcellular Localization)
• Horton & Nakai, “Better prediction of protein cellular
localization sites with the k-nearest neighbours
classifier”, ISMB, 5:147--152, 1997
• Gardy et al., “PSORT-B: Improving protein subcellular
localization for Gram-negative bacteria”, NAR,
31:3613--3617, 2003
• Emanuelsson, “Predicting protein subcellular
localization from amino acid sequence information”,
BIB, 3:361--376, 2002
• Andrade et al., “Adaptation of protein surfaces to
subcellular location”, JMB, 276:517--525, 1998
• Yuan, “Prediction of protein subcellular locations using
Markov chain models”, FEBS Letters, 451:23--26, 1999
Copyright 2003 limsoon wong
References (Subcellular Localization)
• Emanuelsson et al., “ChloroP, a neural network-based
method for predicting chloroplast transit peptides and
their cleavage sites”, Protein Sci., 8:978--984, 1999
• Emanuelsson et al., "Predicting subcellular localization
of proteins based on their N-terminal amino acid
sequence", JMB, 300:1005-1016, 2000
• Hua & Sun, “Support vector machine approach for
protein subcellular localization prediction”,
Bioinformatics, 17:721--728, 2001
• Reinhardt & Hubbard, “Using neural networks for
prediction of the subcellular location of proteins”, NAR,
26:2230--2236, 1998
Copyright 2003 limsoon wong
References (Subcellular Localization)
• Cai & Chou, “Nearest neighbour algorithm for
predicting protein subcellular location by combining
functional domain composition and pseudo-amino acid
composition”, BBRC, 305:407--411, 2003
• Chou & Cai, “Using functional domain composition and
support vector machines for prediction of protein
subcellular location”, JBC, 277:45765--45769, 2002
• Park & Kanehisa, “Prediction of protein subcellular
locations by support vector machines using
compositions of amino acids and amino acid pairs”,
Bioinformatics, 19:1656--1663, 2003
Copyright 2003 limsoon wong