Download TARGET: a new method for predicting protein subcellular

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Metabolism wikipedia , lookup

Gene expression wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Expression vector wikipedia , lookup

Point mutation wikipedia , lookup

Signal transduction wikipedia , lookup

Genetic code wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Metalloprotein wikipedia , lookup

Structural alignment wikipedia , lookup

Acetylation wikipedia , lookup

SR protein wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Biochemistry wikipedia , lookup

Magnesium transporter wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Protein purification wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Interactome wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein wikipedia , lookup

Western blot wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
BIOINFORMATICS
ORIGINAL PAPER
Vol. 21 no. 21 2005, pages 3963–3969
doi:10.1093/bioinformatics/bti650
Sequence analysis
TARGET: a new method for predicting protein subcellular
localization in eukaryotes
Chittibabu Guda1,2, and Shankar Subramaniam3,4,5
1
Gen NY sis Center for Excellence in Cancer Genomics, 2Department of Epidemiology and Biostatistics,
University at Albany, State University of New York, One Discovery Drive, Rensselaer, NY 12144-3456, USA and
3
San Diego Supercomputer Center, 4Department of Bioengineering and 5Department of Chemistry and
Biochemistry, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
Received on May 3, 2005; revised on and accepted on August 26, 2005
Advance Access publication September 6, 2005
ABSTRACT
Motivation: There is a scarcity of efficient computational methods for
predicting protein subcellular localization in eukaryotes. Currently available methods are inadequate for genome-scale predictions with several
limitations. Here, we present a new prediction method, pTARGET that
can predict proteins targeted to nine different subcellular locations in the
eukaryotic animal species.
Results: The nine subcellular locations predicted by pTARGET include
cytoplasm, endoplasmic reticulum, extracellular/secretory, golgi, lysosomes, mitochondria, nucleus, plasma membrane and peroxisomes.
Predictions are based on the location-specific protein functional
domains and the amino acid compositional differences across different
subcellular locations. Overall, this method can predict 68–87% of the
true positives at accuracy rates of 96–99%. Comparison of the prediction performance against PSORT showed that pTARGET prediction
rates are higher by 11–60% in 6 of the 8 locations tested. Besides,
the pTARGET method is robust enough for genome-scale prediction
of protein subcellular localizations since, it does not rely on the presence
of signal or target peptides.
Availability: A public web server based on the pTARGET method is
accessible at the URL http://bioinformatics.albany.edu/~ptarget.
Datasets used for developing pTARGET can be downloaded from
this web server. Source code will be available on request from the
corresponding author.
Contact: [email protected]
Supplementary data: Accessible as online-only from the publisher.
INTRODUCTION
Protein subcellular localization, consequent to protein sorting or
protein trafficking, is a key functional characteristic of proteins.
The eukaryotic cell is a highly ordered structure where nucleusencoded proteins are synthesized in the cytoplasm and all
non-cytosolic proteins are transported to their destined subcellular
locations. Subcellular localization of proteins in the intended compartments is vital for the structural and functional integrity of the
cell. Therefore, comprehensive knowledge on the subcellular
localization of proteins is essential for understanding their roles
and interacting partners in cellular metabolism. Exhaustive experimental studies have been carried out in yeast to elicit the subcellular
To whom correspondence should be addressed.
localization of the entire proteome (Kumar et al., 2002; Huh et al.,
2003); however, such diligent feats are not practicable in all species.
Therefore, experimental annotation of protein subcellular localization is not able to keep up with the large number of sequences that
continue to emerge from the genome sequencing projects. To bridge
this gap, there is a need to develop faster, accurate and genomescale computational methods for predicting subcellular localization
of proteins.
Several computational methods have been developed over the
past decade for predicting subcellular localization of eukaryotic
proteins. These methods are broadly classified into four groups.
(1) Methods based on the sorting signals rely on the presence of
protein targeting or signal peptides that are recognized by locationspecific transport machinery to enable their entry (Nielsen et al.,
1997; Nakai and Horton, 1999; Emanuelsson et al., 2000). Among
these, PSORT is a popular method (Nakai and Horton, 1999) that
could predict proteins targeted to 12 different subcellular locations.
Nevertheless, these methods can predict only those proteins with
known sorting signals. (2) Methods based on the differences in the
amino acid composition or amino acid properties of proteins from
different subcellular locations. These methods use hydrophobicity
index of amino acids (Feng and Zhang, 2001), amino acid composition (Cedano et al., 1997; Reinhardt and Hubbard, 1998; Feng,
2000; Hua and Sun, 2001; Cui et al., 2004), etc.; however, the
overall prediction accuracy of these methods is rather low. (3)
Methods based on lexical analysis of keywords (LOCkey) from
the functional annotation of proteins (Nair and Rost, 2002). The
reliability of this method depends on the consistency and the accuracy of keyword assignments given to the proteins. (4) The fourth
group of prediction methods uses phylogenetic profiles (Marcotte
et al., 2000), domain projection (Mott et al., 2002) or a combination
of evolutionary and structural information (Nair and Rost, 2003).
But, these methods are useful for predicting only a limited number
of locations.
Recently, we published a new prediction method, MITOPRED
based on functional domain occurrence patterns and amino acid
compositional differences between sequences belonging to different
subcellular locations (Guda et al., 2004). However, this method can
predict only those proteins targeted to mitochondria. Here we present another method, referred to henceforth as pTARGET that can
predict proteins targeted to nine different subcellular locations in
eukaryotic species.
The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
3963
C.Guda and S.Subramaniam
Fungi and Metazoans). This prediction algorithm calculates two
distinct scores, i.e. a score based on the presence or absence of
location-specific Pfam domains in a given location (Pfam score)
and a score based on the relative amino acid weights calculated from
AAC (AAC score). The sum of these two scores is used in the final
prediction.
NUC
EXC
MIT
CYT
PLA
Score based on Pfam domain occurrence patterns
END
GOL
LYS
POX
0
100
200
300
400
Number of unique Pfam-A domains
Fig. 1. Number of location-specific Pfam-A domains in different subcellular
locations.
METHODS
Data collection and filtering. We used protein sequences from the SWISSPROT database release 45.0 (http://www.ebi.ac.uk/swissprot), for training
and testing of pTARGET. To obtain high-quality datasets, we filtered
the data as follows. (1) Included sequences only from the animal species
(includes Fungi and Metazoa) that have annotation for ‘subcellular localization’. (2) Removed sequences with ambiguous and uncertain annotations
such as ‘by similarity’, ‘potential’, ‘probable’, ‘possible’, etc. (3) Removed
sequences known to exist in more than one subcellular location such as
those that shuttle between cytoplasm and nucleus, etc. In each location,
we clustered sequences at 95% identity using the cd-hit program (Li
et al., 2001) to remove highly homologous sequences. (4) Finally, we selected only those subcellular locations with at least 100 annotated sequences.
These locations include (the number of sequences is shown in parentheses),
CYT-cytoplasm (2062), END-endoplasmic reticulum (693), EXCextracellular/secretory (5688), GOL-golgi complex (221), LYS-lysosomes
(174), MIT-mitochondria (1698), NUC-nucleus (3446), PLA-plasma membrane (4162) and POX-peroxisomes (173).
Calculation of amino acid composition. For proteins from each location,
we calculated the average relative amino acid compositions (AACs) separately for the N-terminal 25 residues (NTAAC) and for the rest of the
sequence (CTAAC), as described in Guda et al. (2004).
Determination of location-specific Pfam domains. Pfam database (database of protein families, version 16.0) has a collection of 7677 unique protein
functional domains built based on Hidden Markov Models (HMMs) (http://
pfam.wustl.edu; Bateman et al., 2004). We searched all protein sequences in
each location against the Pfam-A database at gathering thresholds using a
faster ‘hmmpfam’ program (Chukkapalli et al., 2004) modified from the
HMMER software (Eddy, 1998). By comparing the occurrence patterns of
Pfam domains across nine subcellular locations, we determined the locationspecific Pfam domains (Fig. 1).
Comparison of pTARGET with PSORT. We downloaded and locally
installed the PSORT stand-alone program from the URL http://psort.nibb.
ac.jp. The datasets used for training and testing of PSORT are identical to
those used for pTARGET.
ALGORITHM
Recently, we published MITOPRED, a variant of this algorithm for
predicting only mitochondrial proteins (Guda et al., 2004), whereas
the current algorithm implements an improved scoring system that
predicts up to nine subcellular locations in animal species (includes
3964
Each location has a set of location-specific Pfam domains that are
not known to exist in other locations. A query sequence is searched
against the Pfam-A database and if any Pfam domains are found, a
Pfam score is calculated for each location based on the matching
location-specific domains. Pfam score is an arbitrary value (we
chose ‘+50’ for rewards and ‘50’ for penalties) assigned to locations based on the presence or absence of location-specific domains.
For example, protein sequence ‘ABF1_HUMAN’ contains the
‘Homeobox’ domain that is nucleus-specific. If the query sequence
contains the Homeobox domain, the Pfam score for nuclear locations is ‘+50’ and it is ‘50’ each, for the rest of the locations. If the
query sequence contains ‘shared’ domain(s), only those locations in
which the domain is shared will get a Pfam score of ‘0’, while the
other locations will get ‘50’ since it is a non-specific domain for
them. Finally, if the query sequence does not have any known PfamA domain, the Pfam score is ‘0’ for all locations, in which case
prediction is based on the amino acid composition scores alone.
Score based on the amino acid composition
pTARGET program considers 9 subcellular locations and for each
location, there are two distinct regions i.e. NT and CT (N- and Cterminal regions), making it 18 effective locations with distinct
amino acid compositions (Table 1). We compared the AACs
from each location against those of similar regions in the other
locations, in all pairwise combinations. For each pairwise comparison, we calculated residue-specific weights using equation (1) and
identified the residues whose compositions differ by at least 20%
(Table 2).
W ABi ¼ f½ f Ai f Bi =minð f Ai ‚ f Bi Þ 10g i ¼ 1‚ 2‚ 3‚ . . . ‚ 20‚ ð1Þ
where, WABi is the weight for amino acid i at location A in comparison with that at location B, fAi and fBi are relative frequencies
of residue i at location A and B, respectively. The AAC of a location
is represented in a 20-element vector. The total number of pairwise
vector comparisons in all combinations equals to 2 ((n (n 1))/2)
where, n is the number of locations (n ¼ 9) with two distinct regions
i.e. NT and CT in each location.
AAC scores have been calculated separately for each of the nine
locations where the location with highest score wins the prediction.
For each current location, there are 16 ‘other’ locations including 8
NT and 8 CT locations, and the AAC score for the current location is
the sum of 16 arbitrary scores (either zero or 10), one from each
pairwise comparison against ‘other’ locations. In each pairwise
comparison, two raw scores are calculated, one for the current
location (Equation 2) and the second for the ‘other’ location
(Equation 3). Every time the raw score of the current location is
higher that of the ‘other’ location, an arbitrary score of 10 is added
to the AAC score of the current location; if not, ‘zero’ is added and
vice versa. For example, the AAC score for a cytoplasmic location is
calculated by comparing the scoring residues in the query AAC
against matching residue averages of cytoplasmic AACs or the
Prediction of protein subcellular localization
Table 1. Location-specific relative amino acid composition for the N-terminal and C-terminal sequences
Location
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
CYT_NT
END_NT
EXC_NT
GOL_NT
LYS_NT
MIT_NT
NUC_NT
PLA_NT
POX_NT
CYT_CT
END_CT
EXC_CT
GOL_CT
LYS_CT
MIT_CT
NUC_CT
PLA_CT
POX_CT
8.39
9.06
10.01
8.04
12.13
11.17
8.15
8.16
10.26
7.08
6.20
6.29
6.40
6.50
7.43
7.21
6.90
7.30
1.56
2.54
4.57
1.97
3.23
1.77
1.45
2.06
1.13
1.88
1.44
5.08
1.59
2.27
1.27
1.74
2.86
1.50
4.90
2.56
2.55
2.81
1.96
1.90
5.11
3.30
5.46
5.67
5.30
5.17
5.22
5.21
4.66
5.04
3.79
5.03
6.98
3.93
2.99
3.84
2.18
2.24
6.80
4.74
4.88
7.34
6.45
6.19
7.07
5.36
6.31
7.00
4.60
6.07
3.59
4.84
4.24
5.21
2.70
4.04
3.02
4.56
2.79
3.91
5.62
3.50
4.60
4.70
4.11
3.19
5.36
4.44
7.44
6.39
6.25
5.75
7.61
6.57
6.83
7.42
6.15
7.02
6.23
7.25
5.82
7.71
6.82
6.33
6.13
6.93
1.98
1.31
1.63
1.46
1.50
2.09
2.07
1.70
2.04
2.40
2.54
2.37
2.59
2.75
2.37
2.63
2.22
2.49
4.80
4.54
4.25
4.51
2.45
4.10
3.04
4.11
3.95
5.37
5.59
4.17
5.07
4.60
5.83
4.17
6.47
5.74
6.45
3.39
3.92
5.03
2.30
4.43
6.09
2.93
5.46
7.28
6.20
6.62
5.80
4.88
6.94
7.16
4.27
6.26
8.49
18.90
17.18
14.77
20.94
12.03
7.14
14.14
9.74
9.04
10.61
7.82
10.17
9.34
9.97
8.35
11.12
9.93
4.56
4.90
4.58
5.64
4.47
5.27
5.34
5.38
4.25
2.09
2.47
1.73
2.27
2.08
2.62
2.03
2.50
2.01
3.76
1.82
2.40
2.75
1.52
2.62
3.95
3.75
3.70
4.01
3.92
4.90
4.32
4.71
4.27
4.34
3.90
4.36
5.37
4.47
4.85
5.09
7.24
5.18
6.91
6.20
6.40
5.17
5.53
5.59
4.96
5.75
5.10
6.36
5.07
4.87
4.36
2.54
3.28
3.04
2.62
3.53
4.37
3.36
4.36
4.27
3.77
4.35
4.68
4.11
3.82
4.93
3.41
4.18
5.00
3.81
3.49
6.13
5.44
9.13
6.68
4.32
6.07
4.78
4.98
5.69
5.39
4.25
4.94
6.56
4.61
5.06
7.84
8.47
7.68
8.45
7.54
9.24
9.98
8.44
8.39
6.60
6.43
7.31
7.46
7.20
6.46
8.94
7.97
6.67
4.81
4.46
4.96
4.93
4.08
5.50
5.06
5.82
5.02
5.38
5.26
5.59
5.34
5.38
5.65
5.25
6.08
5.51
6.60
7.75
7.37
6.74
6.56
5.94
4.85
5.98
7.04
6.76
6.72
5.61
6.33
6.72
6.44
5.15
7.47
6.90
0.94
2.21
1.58
1.64
2.34
1.47
0.64
1.86
0.66
1.10
1.40
1.35
1.57
2.21
1.51
0.86
1.66
1.38
2.17
2.11
2.22
2.19
1.19
1.79
2.54
1.77
2.23
2.86
3.34
3.45
3.36
4.30
3.49
2.75
3.60
3.36
Gray–NT sequences; white–CT sequences.
Table 2. N-terminal and C-terminal scoring residues differing by at least 20% in their AAC from all-against-all comparison of subcellular locations
CYT
CYT
END
EXC
GOL
LYS
MIT
NUC
PLA
POX
C, D, E, F,
H, K, L, N,
P, Q, R, W
C, D, E, H,
K, L, N, Q,
R, W
C, D, E, F,
G, H, K, L,
M, N, Q, R,
W
C, K, L, N,
R, W
A, C, D, E,
F, H, I, K,
L, N, P, Q,
W, Y
A, C, D, E,
F, I, K, P,
R, Y
A, D, E, K,
L, N, Q, R,
W, Y
I, P, R, S,
V, W
C, D, E, F,
K, L, Q, T,
W, Y
A, C, E, F,
G, I, R, W
C, D, E, F,
H, K, L, N,
P, Q, R, W
A, C, D, E,
F, G, I, K,
L, N, P, Q,
R, T, W, Y
A, C, D, E,
F, G, I, K,
L, M, N, P,
T, W, Y
C, D, E, F,
H, I, K, L,
N, P, Q, R,
V, W, Y
A, C, D, E,
F, H, I, K,
L, N, P, Q,
R, S, V, W
C, D, E, F,
H, I, K, L,
N, P, Q, V,
W
A, C, D, E,
H, I, K, L,
N, Q, R, S,
T, V, W, Y
A, C, D, E,
F, I, K, L,
N, P, Q, R,
V, W, Y
C, D, E, H,
L, N, P, Q,
T, V
A, C, E, F,
K, M, R
A, C, D, E,
H, K, L, N,
Q, R, T, V,
W
C, D, E, H,
L, R, S, V,
Y
A, C, D, E,
K, L, N, P,
R, V, Y
C, D, E, F,
H, K, L, N,
P, Q, R, W
E, G, K, N,
P, R, Y
A, C, D, E,
F, H, L, M,
N, P, Q, W
A, C, D, E,
F, I, K, L,
M, N, Q, R,
T, W, Y
A, D, E, H,
K, N, R, W
C, D, E, G,
H, I, K, L,
N, Q, T, W,
Y
C, D, E, F,
K, L, M, N,
P, Q, R, W,
Y
A, C, E, I,
L, M, V
END
C, F, W
C, E, H, N,
Q, W
EXC
C, I, M, N,
V, W, Y
C, F, I, L,
M, N
GOL
G, K, W
F, Q
C, F, G, I,
L, M
LYS
C, E, F,
K, W, Y
C, E, G, I,
K, N, W, Y
C, F, K, M,
R, W, Y
C, E, G, R,
W, Y
MIT
C, D, M, W,
Y
F
C, I, L, M
C, Q
C, I, K, M,
W, Y
NUC
F, I, P, R,
S, V, W
C, F, I, L,
M, Q, R, S,
V, W, Y
C, S, W, Y
F, I, K, L,
P, R, V, W,
Y
C, E, F, G,
K, R, S, V,
W, Y
C, F, I, M,
P, Q, R, S,
V, W, Y
PLA
C, D, E, F,
I, K, L, Q,
S, W, Y
C, D, E, K, S
C, D, E, I,
K, Q
C, D, G, H,
I, M, N, Q,
W
C, D, E, F,
K, S
POX
C, E, W
F, M
C, D, E, F,
I, K, L, M,
N, Q, R, V,
W
C, F, I, L, V
——
C, I, K, W,
Y
M
A, D, E, F,
H, L, R, Y
C, F, H, I,
K, L, N, P,
Q, R, S, T,
W, Y
C, D, E, F,
H, I, K, L,
Q, R, V, W,
Y
C, D, E, F,
I, K, L, M,
P, Q, R, V,
W, Y
F, I, P, R,
S, V, W, Y
A, C, D, F,
G, H, K, L,
M, Q, R, W,
Y
C, D, E, F,
K, M, Q, W
The upper diagonal shows differences in the N-terminal region and the lower diagonal shows differences in the C-terminal region.
3965
C.Guda and S.Subramaniam
‘other’ 16 non-cytoplasmic AACs. In other words, this translates
into a higher score for cytoplasmic locations and a lower score for
the non-cytoplasmic locations, if the AAC of query sequence is
closer to that of the cytoplasmic averages and vice versa. Note
that for each comparison, the scoring residues differ depending
on the ‘other’ location being compared, since we use only those
residue weights that differ by at least 20% in any given vector pair
comparison (Table 2). While calculating the cytoplasmic score,
residues from the first row of Table 2 are used for N-terminal
AAC comparisons; while, residues from the first column of
Table 2 are used for C-terminal AAC comparisons. Cytoplasmic
(Cs) and ‘other’ location (Os) scores have been calculated using
Equations (2) and (3), respectively.
(
X
Qi Oi ‚ if W COi þ2
ðd i jW COi jÞ where, d i ¼
Cs ¼
Oi Qi ‚ if W COi 2
8
i:jW COi j2
ð2Þ
Os ¼
X
(
ðd i jW COi jÞ where, di ¼
8i:jW COi j2
Ci Qi ‚ if W COi þ 2
Qi Ci ‚ if W COi 2
ð3Þ
where WCOi is the weight for residue i when the AACs from a
cytoplasmic location and location O are compared, Qi, Ci and Oi
are relative frequencies of residue i in the query sequence, cytoplasmic location and location O, respectively. The final AAC score
for the cytoplasmic location (SC) is the sum of arbitrary scores
determined using Equation (4).
(
R
X
a‚ if Cs > Os
So ¼
‚
ð4Þ
SC ¼
0‚
if Cs Os
o¼0
where R is the number of non-cytoplasmic locations (total 16), So is
the score for ‘other’ location O and a is an arbitrary value of 10. If
the query sequence is cytoplasmic, Cs is expected to be higher than
Os at all locations, i.e. the total cytoplasmic score equals to R times a
(maximum 160). For example, ADO_HUMAN protein is a cytoplasmic enzyme that functions as aldehyde oxidase and this protein
gets the maximum score of 160 in the current scoring scheme.
Likewise, the final AAC score for each location is calculated and
adjusted to a maximum score of 50 in order to equalize it with the
Pfam score.
Using Pfam and AAC scores in the prediction
The sum of Pfam and AAC scores is used in the prediction; however, their relative contribution in the final prediction vary depending on the presence, absence, shared or unknown nature of the Pfam
domains in the query sequence. In a nutshell, (1) when a query
sequence contains at least one location-specific domain, the Pfam
score itself is enough to make a prediction; (2) when a query
sequence has domain(s) shared across multiple locations, the combined score is necessary for prediction and (3) when a query
sequence has no known domain(s), the prediction is entirely based
on the AAC score. A detailed explanation of this process with actual
scores and examples is provided in Supplementary Table 3.
Algorithm testing
We used various measures of quality including specificity, sensitivity and Mathew’s correlation coefficients (MCC) for testing the
3966
algorithm, as described in Guda et al. 2004. To characterize the
prediction performance for individual locations, we used the ROC
(Receiver Operating Characteristic) plots (Swets, 1988).
IMPLEMENTATION
Analysis of Pfam domain occurrence patterns
Eukaryotic cells are organized into a complex network of membranes and compartments where metabolic pathways are distributed
across different subcellular locations. Since, enzymes or proteins
involved in these pathways contain one or more functional domains
(Pfam domains), by keeping track of the functional domains specific
to a location, it is possible to predict the location of a protein that
contains such domains. We analyzed about 23000 protein sequences
from the SWISSPROT database containing subcellular location
information (from empirical studies) and determined unique
Pfam domains specific to each of the nine locations (Fig. 1). A
query sequence is searched against the Pfam database to find if
any Pfam domains are present in that sequence. The Pfam score
is calculated for each location depending on the presence or absence
of matching location-specific Pfam domains in the query sequence.
For multidomain proteins, the total Pfam score is the sum of all
domain scores; however, the presence of one location-specific Pfam
domain is enough to assign a query protein to that location.
The Pfam-A database release 16.0 contains about 7677 functional
domains (HMM models), yet we used only 2146 unique domains in
this program because only the eukaryotic and non-plant sequences
were used in the dataset. The limitation of predicting solely based on
Pfam score is that for any given genome, 30–40% of the proteins
do not have reliable Pfam-A annotations at gathering thresholds,
and some functional domains are shared across multiple subcellular
locations. To predict such proteins, the current method uses AAC
differences across different subcellular locations in the scoring
system.
Analysis of AAC differences across different
subcellular locations
It has been known that protein sorting usually relies on the presence
of N-terminal targeting sequences that are recognized by locationspecific translocation machinery (Rusch and Kendall, 1995). To
take full advantage of such targeting signals, we analyzed the
AAC of N-terminal 25 residues (NT) separately from the rest of
the C-terminal (CT) sequence (Table 1). We determined the AAC
differences across different locations in all pairwise combinations
(36 pairs) for 9 subcellular locations and chose only those residues
showing at least 20% difference, as the scoring residues (Table 2,
also Fig. 1 in the Supplementary data). Inclusion of residues with
fewer than 20% differences in the scoring system lowered the prediction performance of this method (data not shown).
Analysis of AAC from different subcellular locations revealed
remarkable differences in the NT region compared to the CT region
(Table 2) because the targeting signals are mostly found in the
N-terminal region except for the endoplasmic reticulum and peroxisomal proteins where KDEL/HDEL and SKL signals, respectively, are found at the C-terminus (Stornaiuolo et al., 2003;
Subramani et al., 2000). For the mitochondria or other organelles
involved in the secretory pathway (Endoplasmic reticulum !
Golgi ! Lysosomes ! Extracellular), N-terminal target peptides
are identified based on the cleavage sites (Emanuelsson et al.,
Prediction of protein subcellular localization
Percentage of True Positives
100
Table 3. Measuring the performance of pTARGET based on several measures of quality
80
60
CYT
END
40
EXC
20
GOL
LYS
0
0.00
END
1.00
EXC
GOL
2.00
3.00
Percentage of False Positives
LYS
MIT
NUC
4.00
5.00
MIT
PLA
POX
CYT
NUC
PLA
Fig. 2. Comparison of the prediction performance of different subcellular
locations using ROC plots. Data points used in the ROC plots correspond to
full range of discrete score thresholds i.e. >50, 50, 46, 43, 40, 37, 31, 25, <25.
2000); however, such sites are neither universal to all locations nor
to all proteins targeted to a particular location. Because of these
differences and ambiguities in the protein targeting mechanisms, we
used a scoring system that is independent of the targeting signals. In
this approach, the real differences are deduced by comparing the
AAC of each location against that of all other locations in a pairwise
fashion. This method not only reveals the differences in the target
peptide residues but also the latent differences in the internal regions
of the proteins that are otherwise difficult to conceive. For example,
it has been known that in the N-terminal mitochondrial target peptides, Arg (R), Ala (A) and Ser (S) are over-represented while
negatively charged residues such as Asp (D) and Glu (E) are
under-represented (Emanuelsson et al., 2000). Our analysis of
the NTAAC from mitochondrial proteins revealed a lot more
than just these differences such as Tyr (Y) is under-represented
(by at least 20%) compared with most other locations, and Leu
(L) is under-represented against END, EXC, GOL, LYS but
over-represented (by at least 20%) against CYT, NUC and POX
locations, etc. (Table 1). For each location, we deduced such latent
and significant differences in the AAC of the NT- and CT-regions
for all-against-all locations (Table 2).
Evaluation of the prediction performance
For each location we used two test sets; the first one is all known
positives and the second set is all known negatives for that location.
We evaluated pTARGET’s performance in predicting nine subcellular locations based on specificity and sensitivity, MCC values
(Table 3) and ROC plots (Fig. 2). We also determined the rates
of false positives (FPs) and false negatives (FNs) using proteins
from all-against-all locations (Table 1 in the Supplementary data).
pTARGET can make predictions at different score thresholds resulting in different values for the evaluation parameters stated above.
Score threshold of 50 is a cutoff where predictions could be either
from the Pfam score alone or from the AAC score alone, while 1 is
the lowest possible score.
POX
TP
FN
TN
FP
SN
SP
MCC
720
1551
332
548
3512
4230
88
150
91
145
982
1451
2109
2862
2030
3474
66
138
1320
489
355
139
1722
1004
131
69
81
27
681
212
1329
576
2126
682
107
35
15642
15042
17052
16751
12517
12339
17558
17308
17568
17350
16089
15523
14269
14065
13615
13255
17601
17390
100
700
43
344
31
209
5
255
42
260
30
596
75
279
11
371
8
219
0.35
0.76
0.48
0.80
0.67
0.81
0.40
0.68
0.53
0.84
0.59
0.87
0.61
0.83
0.49
0.84
0.38
0.80
0.99
0.96
1.00
0.98
1.00
0.98
1.00
0.99
1.00
0.99
1.00
0.96
0.99
0.98
1.00
0.97
1.00
0.99
0.53
0.69
0.64
0.69
0.76
0.83
0.61
0.50
0.60
0.54
0.74
0.76
0.73
0.84
0.65
0.83
0.58
0.55
TP-true positives, FN-false negatives, TN-true negatives, FP-false positives, SPSpecificity, SN-Sensitivity, MCC-Mathew’s correlation coefficient
Values in the upper (bold) and lower (italic) rows are predictions at score thresholds of 50
and 1, respectively. SN, SP and MCC values are adjusted to the second decimal point.
Specificity and sensitivity test
Specificity and sensitivity are two competing but non-exclusive
measures of quality useful for testing the performance of classification methods. An ideal classification method should have both values close to 1. As shown in Table 3, the maximum sensitivity of
pTARGET ranges from 0.68 (GOL) to 0.87 (MIT) at the lowest
score threshold of 1, while for all but the GOL location sensitivity
rates peaked above 0.75. At the other end, specificity rates are
almost perfect (1) for all locations at the highest score threshold
of 50, while at the highest sensitivity level (score threshold of 1) the
specificity rates are still above 0.96. In other words, the worst case
false positive rate expected for any location would not be >4%.
Figure 2 shows the relationship between specificity and sensitivity
using ROC plots. For all but CYT locations, the ROC curves climb
rapidly towards the upper left hand corner of the graph which is a
good characteristic of ROC plots. This shows that the pTARGET
program has high sensitivity as well as high specificity. The overall
prediction performance of pTARGET is the lowest for cytoplasmic proteins. This is probably because CYT is the default location
for protein synthesis as well as the hub of cellular core metabolism
and, therefore, it is likely to have the most number of ‘shared’
functional domains thus negatively affecting the prediction
performance.
Matthew’s correlation coefficient test
MCC provides a single measure of evaluating specificity and sensitivity together, where it equals one for perfect predictions and zero
for random assignments (Matthew, 1975). At the highest specificity
level (score 50), MCC values for different locations range from
0.53 to 0.76, while at the highest sensitivity level (Score 1) the
range is between 0.50 to 0.84 (Table 3).
3967
C.Guda and S.Subramaniam
Percentage of True Posivites
100
90
80
70
60
50
40
30
20
10
0
CYT
END
EXC
GOL
pTARGET
MIT
NUC
PLA
POX
PSORT
Fig. 3. Comparison of the prediction performance of pTARGET and PSORT.
Our results suggest that the prediction performance of pTARGET
is consistent and better than that of PSORT, for most of the locations
tested. Unlike PSORT, the current method is sufficiently robust for
genome-scale prediction of proteins in eukaryotic animal species
and does not require species-specific training datasets. Previously,
we used MITOPRED for genome-scale prediction of mitochondrial
proteins in six eukaryotic proteomes (Guda et al., 2004). One of the
limitations of pTARGET is its inability to accurately predict proteins localized in multiple locations such as those shuttling between
cytoplasm and nucleus. Based on the number of ‘shared’ domains in
our study (500, data not shown), we estimate that in eukaryotic
proteomes, at least 20% of the proteins are localized to multiple
locations. In the future, we will focus on developing sophisticated
scoring methods to accurately predict proteins targeted to multiple
locations.
Comparison of pTARGET with PSORT
PSORT has been chosen for comparison since it is the only other
computational method that predicts as many subcellular locations as
pTARGET does and is available as a stand alone version. We used
identical datasets for testing both methods and removed the LYS
location from the comparison because PSORT predicts this location
as part of the vesicular secretory pathway. Since the scoring systems
used in these two methods are not comparable, we used the highest
sensitivity thresholds for prediction that corresponds to a specificity
higher than 0.95, in both cases.
As shown in Figure 3, pTARGET prediction rates are higher than
those of PSORT for all but EXC and PLA locations. The improvement in the prediction rates of pTARGET vary with each location
i.e. CYT (24%), END (37%), GOL (60%), MIT (40%), NUC (11%)
and POX (42%), while PSORT prediction rates are higher in EXC
(11%) and PLA (7%) locations. PSORT employs a suite of regular
expressions for predicting signal peptides and cleavage sites and,
therefore, it is able to predict extracellular proteins more efficiently,
where such signals are well characterized. It is a known fact that
signal peptides control the entry of almost all proteins to the secretory pathway, both in eukaryotes and prokaryotes (Gierasch, 1989;
von Heijne, 1990; Rapoport 1992). For other locations, such knowledge on protein targeting is either ambiguous or not fully available.
Improved prediction rates observed for END (37%), GOL (60%)
and POX (42%) locations are especially significant in the pTARGET method because most computational methods fail to predict
these locations due to lack of sufficient training data.
DISCUSSION
We removed the plant sequences from our datasets because several
metabolic pathways and organelles in plants are not the same as in
animals, leading to differences in the distribution of protein functional domains in these two systems. Even though the AAC differences in the CT regions are not as pronounced as those of NT
regions (Table 2), inclusion of CT differences in the scoring system
has considerably lessened the number of false positives (data not
shown). This is because the NTAAC averages are based on only 25
residues and hence the scoring system could easily pick up unintended sequences with similar NT composition. Since, the pTARGET method is primarily based on the location-specific protein
functional domains (Pfam-A domains), its performance could be
significantly improved as more functional domains are identified in
future versions of the Pfam database.
3968
ACKNOWLEDGEMENTS
Authors are thankful to Dr. Giridhar Chukkapalli at the San Diego
Supercomputer Center for assistance in running genome-scale HMM
jobs. This project has been supported by the start-up funds to CG
from the State University of New York at Albany and the University
of California Life Sciences Informatics (LSI) Program/Mitokor
grant (L99-10077) to SS.
Conflicts of Interest: none declared.
REFERENCES
Bateman,A. et al. (2004) The Pfam protein families database. Nucleic Acids Res., 32,
D138–D141.
Cedano,J. et al. (1997) Relation between amino acid composition and cellular location
of proteins. J. Mol. Biol., 17, 594–600.
Chukkapalli,G. et al. (2004) SledgeHMMER: a web server for batch searching of Pfam
database. Nucl. Acids Res., 32, W542–W544.
Cui,Q. et al. (2004) Esub8: a novel tool to predict protein subcellular localizations in
eukaryotic organisms. Bioinformatics, 5, 66–72.
Eddy,S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763.
Emanuelsson,O. et al. (2000) Predicting subcellular localization of proteins based on
their N-terminal amino acid sequence. J. Mol. Biol., 300, 1005–1016.
Feng,Z.P. (2000) Prediction of the subcellular location of prokaryotic proteins
based on a new representation of the amino acid composition. Biopolymers, 58,
491–499.
Feng,Z.P. and Zhang,C.T. (2001) Prediction of the subcellular location of prokaryotic
proteins based on the hydrophobic index of the amino acids. Int. J. Biol. Macromol., 14, 255–261.
Gierasch,L.M. (1989) Signal sequences. Biochemistry, 28, 923–930.
Guda,C. et al. (2004) MITOPRED: a genome-scale method for prediction of nucleusencoded mitochondrial proteins. Bioinformatics, 20, 1785–1794.
Hua,S. and Sun,Z. (2001) Support vector machine approach for protein subcellular
localization prediction. Bioinformatics, 17, 721–728.
Huh,W.-K. et al. (2003) Global analysis of protein localization in budding yeast.
Nature, 425, 686–691.
Kumar,A. et al. (2002) Subcellular localization of the yeast proteome. Genes Dev., 16,
707–719.
Li,W. et al. (2001) Clustering of highly homologous sequences to reduce the size of
large protein databases. Bioinformatics, 17, 282–283.
Marcotte,E.M. et al. (2000) Localizing proteins in the cell from their phylogenetic
profiles. Proc. Natl Acad. Sci. USA, 97, 12115–12120.
Matthews,B.W. (1975) Comparison of the predicted and observed secondary structure
of T4 phage lysozyme. Biochim. Biophys. Acta, 405, 442–451.
Mott,R. et al. (2002) Predicting protein cellular location using a domain projection
method. Genome Res., 12, 1168–1174.
Nair,R. and Rost,B. (2002) Inferring sub-cellular localization through automated
lexical analysis. Bioinformatics, 18, S78–S86.
Prediction of protein subcellular localization
Nair,R. and Rost,B. (2003) Better prediction of sub-cellular localization by combining
evolutionary and structural information. Proteins, 53, 917–930.
Nakai,K. and Horton,P. (1999) PSORT: a program for detecting the sorting signals of
proteins and predicting their subcellular localization. Trends Biochem. Sci., 24,
34–36.
Nielsen,H. et al. (1997) Identification of prokaryotic and eukaryotic signal peptides and
prediction of their cleavage sites. Prot. Engg., 10, 1–6.
Rapoport,T.A. (1992) Transport of proteins across the endoplasmic reticulum membrane. Science, 258, 931–936.
Reinhardt,A. and Hubbard,T. (1998) Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res., 26, 2230–2236.
Rusch,S.L. and Kendall,D.A. (1995) Protein transport via amino-terminal
targeting sequences: common themes in diverse systems. Mol. Membr. Biol.,
12, 295–307.
Stornaiuolo,M. et al. (2003) KDEL and KKXX retrieval signals appended to the same
reporter protein determine different trafficking between endoplasmic reticulum,
intermediate compartment, and golgi complex. Mol. Biol. Cell, 14, 889–902.
Subramani,S. et al. (2000) Import of peroxisomal matrix and membrane proteins. Annu.
Rev. Biochem., 69, 399–418.
Swets,J.A. (1988) Measuring the accuracy of diagnostic system. Science, 240,
1285–1293.
von Heijne,G. (1990) The signal peptide. J. Membr. Biol., 115, 195–201.
3969