* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download TARGET: a new method for predicting protein subcellular
Gene expression wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Expression vector wikipedia , lookup
Point mutation wikipedia , lookup
Signal transduction wikipedia , lookup
Genetic code wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Metalloprotein wikipedia , lookup
Structural alignment wikipedia , lookup
Acetylation wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
Biochemistry wikipedia , lookup
Magnesium transporter wikipedia , lookup
Protein purification wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Interactome wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Western blot wikipedia , lookup
Protein–protein interaction wikipedia , lookup
BIOINFORMATICS ORIGINAL PAPER Vol. 21 no. 21 2005, pages 3963–3969 doi:10.1093/bioinformatics/bti650 Sequence analysis TARGET: a new method for predicting protein subcellular localization in eukaryotes Chittibabu Guda1,2, and Shankar Subramaniam3,4,5 1 Gen NY sis Center for Excellence in Cancer Genomics, 2Department of Epidemiology and Biostatistics, University at Albany, State University of New York, One Discovery Drive, Rensselaer, NY 12144-3456, USA and 3 San Diego Supercomputer Center, 4Department of Bioengineering and 5Department of Chemistry and Biochemistry, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA Received on May 3, 2005; revised on and accepted on August 26, 2005 Advance Access publication September 6, 2005 ABSTRACT Motivation: There is a scarcity of efficient computational methods for predicting protein subcellular localization in eukaryotes. Currently available methods are inadequate for genome-scale predictions with several limitations. Here, we present a new prediction method, pTARGET that can predict proteins targeted to nine different subcellular locations in the eukaryotic animal species. Results: The nine subcellular locations predicted by pTARGET include cytoplasm, endoplasmic reticulum, extracellular/secretory, golgi, lysosomes, mitochondria, nucleus, plasma membrane and peroxisomes. Predictions are based on the location-specific protein functional domains and the amino acid compositional differences across different subcellular locations. Overall, this method can predict 68–87% of the true positives at accuracy rates of 96–99%. Comparison of the prediction performance against PSORT showed that pTARGET prediction rates are higher by 11–60% in 6 of the 8 locations tested. Besides, the pTARGET method is robust enough for genome-scale prediction of protein subcellular localizations since, it does not rely on the presence of signal or target peptides. Availability: A public web server based on the pTARGET method is accessible at the URL http://bioinformatics.albany.edu/~ptarget. Datasets used for developing pTARGET can be downloaded from this web server. Source code will be available on request from the corresponding author. Contact: [email protected] Supplementary data: Accessible as online-only from the publisher. INTRODUCTION Protein subcellular localization, consequent to protein sorting or protein trafficking, is a key functional characteristic of proteins. The eukaryotic cell is a highly ordered structure where nucleusencoded proteins are synthesized in the cytoplasm and all non-cytosolic proteins are transported to their destined subcellular locations. Subcellular localization of proteins in the intended compartments is vital for the structural and functional integrity of the cell. Therefore, comprehensive knowledge on the subcellular localization of proteins is essential for understanding their roles and interacting partners in cellular metabolism. Exhaustive experimental studies have been carried out in yeast to elicit the subcellular To whom correspondence should be addressed. localization of the entire proteome (Kumar et al., 2002; Huh et al., 2003); however, such diligent feats are not practicable in all species. Therefore, experimental annotation of protein subcellular localization is not able to keep up with the large number of sequences that continue to emerge from the genome sequencing projects. To bridge this gap, there is a need to develop faster, accurate and genomescale computational methods for predicting subcellular localization of proteins. Several computational methods have been developed over the past decade for predicting subcellular localization of eukaryotic proteins. These methods are broadly classified into four groups. (1) Methods based on the sorting signals rely on the presence of protein targeting or signal peptides that are recognized by locationspecific transport machinery to enable their entry (Nielsen et al., 1997; Nakai and Horton, 1999; Emanuelsson et al., 2000). Among these, PSORT is a popular method (Nakai and Horton, 1999) that could predict proteins targeted to 12 different subcellular locations. Nevertheless, these methods can predict only those proteins with known sorting signals. (2) Methods based on the differences in the amino acid composition or amino acid properties of proteins from different subcellular locations. These methods use hydrophobicity index of amino acids (Feng and Zhang, 2001), amino acid composition (Cedano et al., 1997; Reinhardt and Hubbard, 1998; Feng, 2000; Hua and Sun, 2001; Cui et al., 2004), etc.; however, the overall prediction accuracy of these methods is rather low. (3) Methods based on lexical analysis of keywords (LOCkey) from the functional annotation of proteins (Nair and Rost, 2002). The reliability of this method depends on the consistency and the accuracy of keyword assignments given to the proteins. (4) The fourth group of prediction methods uses phylogenetic profiles (Marcotte et al., 2000), domain projection (Mott et al., 2002) or a combination of evolutionary and structural information (Nair and Rost, 2003). But, these methods are useful for predicting only a limited number of locations. Recently, we published a new prediction method, MITOPRED based on functional domain occurrence patterns and amino acid compositional differences between sequences belonging to different subcellular locations (Guda et al., 2004). However, this method can predict only those proteins targeted to mitochondria. Here we present another method, referred to henceforth as pTARGET that can predict proteins targeted to nine different subcellular locations in eukaryotic species. The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 3963 C.Guda and S.Subramaniam Fungi and Metazoans). This prediction algorithm calculates two distinct scores, i.e. a score based on the presence or absence of location-specific Pfam domains in a given location (Pfam score) and a score based on the relative amino acid weights calculated from AAC (AAC score). The sum of these two scores is used in the final prediction. NUC EXC MIT CYT PLA Score based on Pfam domain occurrence patterns END GOL LYS POX 0 100 200 300 400 Number of unique Pfam-A domains Fig. 1. Number of location-specific Pfam-A domains in different subcellular locations. METHODS Data collection and filtering. We used protein sequences from the SWISSPROT database release 45.0 (http://www.ebi.ac.uk/swissprot), for training and testing of pTARGET. To obtain high-quality datasets, we filtered the data as follows. (1) Included sequences only from the animal species (includes Fungi and Metazoa) that have annotation for ‘subcellular localization’. (2) Removed sequences with ambiguous and uncertain annotations such as ‘by similarity’, ‘potential’, ‘probable’, ‘possible’, etc. (3) Removed sequences known to exist in more than one subcellular location such as those that shuttle between cytoplasm and nucleus, etc. In each location, we clustered sequences at 95% identity using the cd-hit program (Li et al., 2001) to remove highly homologous sequences. (4) Finally, we selected only those subcellular locations with at least 100 annotated sequences. These locations include (the number of sequences is shown in parentheses), CYT-cytoplasm (2062), END-endoplasmic reticulum (693), EXCextracellular/secretory (5688), GOL-golgi complex (221), LYS-lysosomes (174), MIT-mitochondria (1698), NUC-nucleus (3446), PLA-plasma membrane (4162) and POX-peroxisomes (173). Calculation of amino acid composition. For proteins from each location, we calculated the average relative amino acid compositions (AACs) separately for the N-terminal 25 residues (NTAAC) and for the rest of the sequence (CTAAC), as described in Guda et al. (2004). Determination of location-specific Pfam domains. Pfam database (database of protein families, version 16.0) has a collection of 7677 unique protein functional domains built based on Hidden Markov Models (HMMs) (http:// pfam.wustl.edu; Bateman et al., 2004). We searched all protein sequences in each location against the Pfam-A database at gathering thresholds using a faster ‘hmmpfam’ program (Chukkapalli et al., 2004) modified from the HMMER software (Eddy, 1998). By comparing the occurrence patterns of Pfam domains across nine subcellular locations, we determined the locationspecific Pfam domains (Fig. 1). Comparison of pTARGET with PSORT. We downloaded and locally installed the PSORT stand-alone program from the URL http://psort.nibb. ac.jp. The datasets used for training and testing of PSORT are identical to those used for pTARGET. ALGORITHM Recently, we published MITOPRED, a variant of this algorithm for predicting only mitochondrial proteins (Guda et al., 2004), whereas the current algorithm implements an improved scoring system that predicts up to nine subcellular locations in animal species (includes 3964 Each location has a set of location-specific Pfam domains that are not known to exist in other locations. A query sequence is searched against the Pfam-A database and if any Pfam domains are found, a Pfam score is calculated for each location based on the matching location-specific domains. Pfam score is an arbitrary value (we chose ‘+50’ for rewards and ‘50’ for penalties) assigned to locations based on the presence or absence of location-specific domains. For example, protein sequence ‘ABF1_HUMAN’ contains the ‘Homeobox’ domain that is nucleus-specific. If the query sequence contains the Homeobox domain, the Pfam score for nuclear locations is ‘+50’ and it is ‘50’ each, for the rest of the locations. If the query sequence contains ‘shared’ domain(s), only those locations in which the domain is shared will get a Pfam score of ‘0’, while the other locations will get ‘50’ since it is a non-specific domain for them. Finally, if the query sequence does not have any known PfamA domain, the Pfam score is ‘0’ for all locations, in which case prediction is based on the amino acid composition scores alone. Score based on the amino acid composition pTARGET program considers 9 subcellular locations and for each location, there are two distinct regions i.e. NT and CT (N- and Cterminal regions), making it 18 effective locations with distinct amino acid compositions (Table 1). We compared the AACs from each location against those of similar regions in the other locations, in all pairwise combinations. For each pairwise comparison, we calculated residue-specific weights using equation (1) and identified the residues whose compositions differ by at least 20% (Table 2). W ABi ¼ f½ f Ai f Bi =minð f Ai ‚ f Bi Þ 10g i ¼ 1‚ 2‚ 3‚ . . . ‚ 20‚ ð1Þ where, WABi is the weight for amino acid i at location A in comparison with that at location B, fAi and fBi are relative frequencies of residue i at location A and B, respectively. The AAC of a location is represented in a 20-element vector. The total number of pairwise vector comparisons in all combinations equals to 2 ((n (n 1))/2) where, n is the number of locations (n ¼ 9) with two distinct regions i.e. NT and CT in each location. AAC scores have been calculated separately for each of the nine locations where the location with highest score wins the prediction. For each current location, there are 16 ‘other’ locations including 8 NT and 8 CT locations, and the AAC score for the current location is the sum of 16 arbitrary scores (either zero or 10), one from each pairwise comparison against ‘other’ locations. In each pairwise comparison, two raw scores are calculated, one for the current location (Equation 2) and the second for the ‘other’ location (Equation 3). Every time the raw score of the current location is higher that of the ‘other’ location, an arbitrary score of 10 is added to the AAC score of the current location; if not, ‘zero’ is added and vice versa. For example, the AAC score for a cytoplasmic location is calculated by comparing the scoring residues in the query AAC against matching residue averages of cytoplasmic AACs or the Prediction of protein subcellular localization Table 1. Location-specific relative amino acid composition for the N-terminal and C-terminal sequences Location A C D E F G H I K L M N P Q R S T V W Y CYT_NT END_NT EXC_NT GOL_NT LYS_NT MIT_NT NUC_NT PLA_NT POX_NT CYT_CT END_CT EXC_CT GOL_CT LYS_CT MIT_CT NUC_CT PLA_CT POX_CT 8.39 9.06 10.01 8.04 12.13 11.17 8.15 8.16 10.26 7.08 6.20 6.29 6.40 6.50 7.43 7.21 6.90 7.30 1.56 2.54 4.57 1.97 3.23 1.77 1.45 2.06 1.13 1.88 1.44 5.08 1.59 2.27 1.27 1.74 2.86 1.50 4.90 2.56 2.55 2.81 1.96 1.90 5.11 3.30 5.46 5.67 5.30 5.17 5.22 5.21 4.66 5.04 3.79 5.03 6.98 3.93 2.99 3.84 2.18 2.24 6.80 4.74 4.88 7.34 6.45 6.19 7.07 5.36 6.31 7.00 4.60 6.07 3.59 4.84 4.24 5.21 2.70 4.04 3.02 4.56 2.79 3.91 5.62 3.50 4.60 4.70 4.11 3.19 5.36 4.44 7.44 6.39 6.25 5.75 7.61 6.57 6.83 7.42 6.15 7.02 6.23 7.25 5.82 7.71 6.82 6.33 6.13 6.93 1.98 1.31 1.63 1.46 1.50 2.09 2.07 1.70 2.04 2.40 2.54 2.37 2.59 2.75 2.37 2.63 2.22 2.49 4.80 4.54 4.25 4.51 2.45 4.10 3.04 4.11 3.95 5.37 5.59 4.17 5.07 4.60 5.83 4.17 6.47 5.74 6.45 3.39 3.92 5.03 2.30 4.43 6.09 2.93 5.46 7.28 6.20 6.62 5.80 4.88 6.94 7.16 4.27 6.26 8.49 18.90 17.18 14.77 20.94 12.03 7.14 14.14 9.74 9.04 10.61 7.82 10.17 9.34 9.97 8.35 11.12 9.93 4.56 4.90 4.58 5.64 4.47 5.27 5.34 5.38 4.25 2.09 2.47 1.73 2.27 2.08 2.62 2.03 2.50 2.01 3.76 1.82 2.40 2.75 1.52 2.62 3.95 3.75 3.70 4.01 3.92 4.90 4.32 4.71 4.27 4.34 3.90 4.36 5.37 4.47 4.85 5.09 7.24 5.18 6.91 6.20 6.40 5.17 5.53 5.59 4.96 5.75 5.10 6.36 5.07 4.87 4.36 2.54 3.28 3.04 2.62 3.53 4.37 3.36 4.36 4.27 3.77 4.35 4.68 4.11 3.82 4.93 3.41 4.18 5.00 3.81 3.49 6.13 5.44 9.13 6.68 4.32 6.07 4.78 4.98 5.69 5.39 4.25 4.94 6.56 4.61 5.06 7.84 8.47 7.68 8.45 7.54 9.24 9.98 8.44 8.39 6.60 6.43 7.31 7.46 7.20 6.46 8.94 7.97 6.67 4.81 4.46 4.96 4.93 4.08 5.50 5.06 5.82 5.02 5.38 5.26 5.59 5.34 5.38 5.65 5.25 6.08 5.51 6.60 7.75 7.37 6.74 6.56 5.94 4.85 5.98 7.04 6.76 6.72 5.61 6.33 6.72 6.44 5.15 7.47 6.90 0.94 2.21 1.58 1.64 2.34 1.47 0.64 1.86 0.66 1.10 1.40 1.35 1.57 2.21 1.51 0.86 1.66 1.38 2.17 2.11 2.22 2.19 1.19 1.79 2.54 1.77 2.23 2.86 3.34 3.45 3.36 4.30 3.49 2.75 3.60 3.36 Gray–NT sequences; white–CT sequences. Table 2. N-terminal and C-terminal scoring residues differing by at least 20% in their AAC from all-against-all comparison of subcellular locations CYT CYT END EXC GOL LYS MIT NUC PLA POX C, D, E, F, H, K, L, N, P, Q, R, W C, D, E, H, K, L, N, Q, R, W C, D, E, F, G, H, K, L, M, N, Q, R, W C, K, L, N, R, W A, C, D, E, F, H, I, K, L, N, P, Q, W, Y A, C, D, E, F, I, K, P, R, Y A, D, E, K, L, N, Q, R, W, Y I, P, R, S, V, W C, D, E, F, K, L, Q, T, W, Y A, C, E, F, G, I, R, W C, D, E, F, H, K, L, N, P, Q, R, W A, C, D, E, F, G, I, K, L, N, P, Q, R, T, W, Y A, C, D, E, F, G, I, K, L, M, N, P, T, W, Y C, D, E, F, H, I, K, L, N, P, Q, R, V, W, Y A, C, D, E, F, H, I, K, L, N, P, Q, R, S, V, W C, D, E, F, H, I, K, L, N, P, Q, V, W A, C, D, E, H, I, K, L, N, Q, R, S, T, V, W, Y A, C, D, E, F, I, K, L, N, P, Q, R, V, W, Y C, D, E, H, L, N, P, Q, T, V A, C, E, F, K, M, R A, C, D, E, H, K, L, N, Q, R, T, V, W C, D, E, H, L, R, S, V, Y A, C, D, E, K, L, N, P, R, V, Y C, D, E, F, H, K, L, N, P, Q, R, W E, G, K, N, P, R, Y A, C, D, E, F, H, L, M, N, P, Q, W A, C, D, E, F, I, K, L, M, N, Q, R, T, W, Y A, D, E, H, K, N, R, W C, D, E, G, H, I, K, L, N, Q, T, W, Y C, D, E, F, K, L, M, N, P, Q, R, W, Y A, C, E, I, L, M, V END C, F, W C, E, H, N, Q, W EXC C, I, M, N, V, W, Y C, F, I, L, M, N GOL G, K, W F, Q C, F, G, I, L, M LYS C, E, F, K, W, Y C, E, G, I, K, N, W, Y C, F, K, M, R, W, Y C, E, G, R, W, Y MIT C, D, M, W, Y F C, I, L, M C, Q C, I, K, M, W, Y NUC F, I, P, R, S, V, W C, F, I, L, M, Q, R, S, V, W, Y C, S, W, Y F, I, K, L, P, R, V, W, Y C, E, F, G, K, R, S, V, W, Y C, F, I, M, P, Q, R, S, V, W, Y PLA C, D, E, F, I, K, L, Q, S, W, Y C, D, E, K, S C, D, E, I, K, Q C, D, G, H, I, M, N, Q, W C, D, E, F, K, S POX C, E, W F, M C, D, E, F, I, K, L, M, N, Q, R, V, W C, F, I, L, V —— C, I, K, W, Y M A, D, E, F, H, L, R, Y C, F, H, I, K, L, N, P, Q, R, S, T, W, Y C, D, E, F, H, I, K, L, Q, R, V, W, Y C, D, E, F, I, K, L, M, P, Q, R, V, W, Y F, I, P, R, S, V, W, Y A, C, D, F, G, H, K, L, M, Q, R, W, Y C, D, E, F, K, M, Q, W The upper diagonal shows differences in the N-terminal region and the lower diagonal shows differences in the C-terminal region. 3965 C.Guda and S.Subramaniam ‘other’ 16 non-cytoplasmic AACs. In other words, this translates into a higher score for cytoplasmic locations and a lower score for the non-cytoplasmic locations, if the AAC of query sequence is closer to that of the cytoplasmic averages and vice versa. Note that for each comparison, the scoring residues differ depending on the ‘other’ location being compared, since we use only those residue weights that differ by at least 20% in any given vector pair comparison (Table 2). While calculating the cytoplasmic score, residues from the first row of Table 2 are used for N-terminal AAC comparisons; while, residues from the first column of Table 2 are used for C-terminal AAC comparisons. Cytoplasmic (Cs) and ‘other’ location (Os) scores have been calculated using Equations (2) and (3), respectively. ( X Qi Oi ‚ if W COi þ2 ðd i jW COi jÞ where, d i ¼ Cs ¼ Oi Qi ‚ if W COi 2 8 i:jW COi j2 ð2Þ Os ¼ X ( ðd i jW COi jÞ where, di ¼ 8i:jW COi j2 Ci Qi ‚ if W COi þ 2 Qi Ci ‚ if W COi 2 ð3Þ where WCOi is the weight for residue i when the AACs from a cytoplasmic location and location O are compared, Qi, Ci and Oi are relative frequencies of residue i in the query sequence, cytoplasmic location and location O, respectively. The final AAC score for the cytoplasmic location (SC) is the sum of arbitrary scores determined using Equation (4). ( R X a‚ if Cs > Os So ¼ ‚ ð4Þ SC ¼ 0‚ if Cs Os o¼0 where R is the number of non-cytoplasmic locations (total 16), So is the score for ‘other’ location O and a is an arbitrary value of 10. If the query sequence is cytoplasmic, Cs is expected to be higher than Os at all locations, i.e. the total cytoplasmic score equals to R times a (maximum 160). For example, ADO_HUMAN protein is a cytoplasmic enzyme that functions as aldehyde oxidase and this protein gets the maximum score of 160 in the current scoring scheme. Likewise, the final AAC score for each location is calculated and adjusted to a maximum score of 50 in order to equalize it with the Pfam score. Using Pfam and AAC scores in the prediction The sum of Pfam and AAC scores is used in the prediction; however, their relative contribution in the final prediction vary depending on the presence, absence, shared or unknown nature of the Pfam domains in the query sequence. In a nutshell, (1) when a query sequence contains at least one location-specific domain, the Pfam score itself is enough to make a prediction; (2) when a query sequence has domain(s) shared across multiple locations, the combined score is necessary for prediction and (3) when a query sequence has no known domain(s), the prediction is entirely based on the AAC score. A detailed explanation of this process with actual scores and examples is provided in Supplementary Table 3. Algorithm testing We used various measures of quality including specificity, sensitivity and Mathew’s correlation coefficients (MCC) for testing the 3966 algorithm, as described in Guda et al. 2004. To characterize the prediction performance for individual locations, we used the ROC (Receiver Operating Characteristic) plots (Swets, 1988). IMPLEMENTATION Analysis of Pfam domain occurrence patterns Eukaryotic cells are organized into a complex network of membranes and compartments where metabolic pathways are distributed across different subcellular locations. Since, enzymes or proteins involved in these pathways contain one or more functional domains (Pfam domains), by keeping track of the functional domains specific to a location, it is possible to predict the location of a protein that contains such domains. We analyzed about 23000 protein sequences from the SWISSPROT database containing subcellular location information (from empirical studies) and determined unique Pfam domains specific to each of the nine locations (Fig. 1). A query sequence is searched against the Pfam database to find if any Pfam domains are present in that sequence. The Pfam score is calculated for each location depending on the presence or absence of matching location-specific Pfam domains in the query sequence. For multidomain proteins, the total Pfam score is the sum of all domain scores; however, the presence of one location-specific Pfam domain is enough to assign a query protein to that location. The Pfam-A database release 16.0 contains about 7677 functional domains (HMM models), yet we used only 2146 unique domains in this program because only the eukaryotic and non-plant sequences were used in the dataset. The limitation of predicting solely based on Pfam score is that for any given genome, 30–40% of the proteins do not have reliable Pfam-A annotations at gathering thresholds, and some functional domains are shared across multiple subcellular locations. To predict such proteins, the current method uses AAC differences across different subcellular locations in the scoring system. Analysis of AAC differences across different subcellular locations It has been known that protein sorting usually relies on the presence of N-terminal targeting sequences that are recognized by locationspecific translocation machinery (Rusch and Kendall, 1995). To take full advantage of such targeting signals, we analyzed the AAC of N-terminal 25 residues (NT) separately from the rest of the C-terminal (CT) sequence (Table 1). We determined the AAC differences across different locations in all pairwise combinations (36 pairs) for 9 subcellular locations and chose only those residues showing at least 20% difference, as the scoring residues (Table 2, also Fig. 1 in the Supplementary data). Inclusion of residues with fewer than 20% differences in the scoring system lowered the prediction performance of this method (data not shown). Analysis of AAC from different subcellular locations revealed remarkable differences in the NT region compared to the CT region (Table 2) because the targeting signals are mostly found in the N-terminal region except for the endoplasmic reticulum and peroxisomal proteins where KDEL/HDEL and SKL signals, respectively, are found at the C-terminus (Stornaiuolo et al., 2003; Subramani et al., 2000). For the mitochondria or other organelles involved in the secretory pathway (Endoplasmic reticulum ! Golgi ! Lysosomes ! Extracellular), N-terminal target peptides are identified based on the cleavage sites (Emanuelsson et al., Prediction of protein subcellular localization Percentage of True Positives 100 Table 3. Measuring the performance of pTARGET based on several measures of quality 80 60 CYT END 40 EXC 20 GOL LYS 0 0.00 END 1.00 EXC GOL 2.00 3.00 Percentage of False Positives LYS MIT NUC 4.00 5.00 MIT PLA POX CYT NUC PLA Fig. 2. Comparison of the prediction performance of different subcellular locations using ROC plots. Data points used in the ROC plots correspond to full range of discrete score thresholds i.e. >50, 50, 46, 43, 40, 37, 31, 25, <25. 2000); however, such sites are neither universal to all locations nor to all proteins targeted to a particular location. Because of these differences and ambiguities in the protein targeting mechanisms, we used a scoring system that is independent of the targeting signals. In this approach, the real differences are deduced by comparing the AAC of each location against that of all other locations in a pairwise fashion. This method not only reveals the differences in the target peptide residues but also the latent differences in the internal regions of the proteins that are otherwise difficult to conceive. For example, it has been known that in the N-terminal mitochondrial target peptides, Arg (R), Ala (A) and Ser (S) are over-represented while negatively charged residues such as Asp (D) and Glu (E) are under-represented (Emanuelsson et al., 2000). Our analysis of the NTAAC from mitochondrial proteins revealed a lot more than just these differences such as Tyr (Y) is under-represented (by at least 20%) compared with most other locations, and Leu (L) is under-represented against END, EXC, GOL, LYS but over-represented (by at least 20%) against CYT, NUC and POX locations, etc. (Table 1). For each location, we deduced such latent and significant differences in the AAC of the NT- and CT-regions for all-against-all locations (Table 2). Evaluation of the prediction performance For each location we used two test sets; the first one is all known positives and the second set is all known negatives for that location. We evaluated pTARGET’s performance in predicting nine subcellular locations based on specificity and sensitivity, MCC values (Table 3) and ROC plots (Fig. 2). We also determined the rates of false positives (FPs) and false negatives (FNs) using proteins from all-against-all locations (Table 1 in the Supplementary data). pTARGET can make predictions at different score thresholds resulting in different values for the evaluation parameters stated above. Score threshold of 50 is a cutoff where predictions could be either from the Pfam score alone or from the AAC score alone, while 1 is the lowest possible score. POX TP FN TN FP SN SP MCC 720 1551 332 548 3512 4230 88 150 91 145 982 1451 2109 2862 2030 3474 66 138 1320 489 355 139 1722 1004 131 69 81 27 681 212 1329 576 2126 682 107 35 15642 15042 17052 16751 12517 12339 17558 17308 17568 17350 16089 15523 14269 14065 13615 13255 17601 17390 100 700 43 344 31 209 5 255 42 260 30 596 75 279 11 371 8 219 0.35 0.76 0.48 0.80 0.67 0.81 0.40 0.68 0.53 0.84 0.59 0.87 0.61 0.83 0.49 0.84 0.38 0.80 0.99 0.96 1.00 0.98 1.00 0.98 1.00 0.99 1.00 0.99 1.00 0.96 0.99 0.98 1.00 0.97 1.00 0.99 0.53 0.69 0.64 0.69 0.76 0.83 0.61 0.50 0.60 0.54 0.74 0.76 0.73 0.84 0.65 0.83 0.58 0.55 TP-true positives, FN-false negatives, TN-true negatives, FP-false positives, SPSpecificity, SN-Sensitivity, MCC-Mathew’s correlation coefficient Values in the upper (bold) and lower (italic) rows are predictions at score thresholds of 50 and 1, respectively. SN, SP and MCC values are adjusted to the second decimal point. Specificity and sensitivity test Specificity and sensitivity are two competing but non-exclusive measures of quality useful for testing the performance of classification methods. An ideal classification method should have both values close to 1. As shown in Table 3, the maximum sensitivity of pTARGET ranges from 0.68 (GOL) to 0.87 (MIT) at the lowest score threshold of 1, while for all but the GOL location sensitivity rates peaked above 0.75. At the other end, specificity rates are almost perfect (1) for all locations at the highest score threshold of 50, while at the highest sensitivity level (score threshold of 1) the specificity rates are still above 0.96. In other words, the worst case false positive rate expected for any location would not be >4%. Figure 2 shows the relationship between specificity and sensitivity using ROC plots. For all but CYT locations, the ROC curves climb rapidly towards the upper left hand corner of the graph which is a good characteristic of ROC plots. This shows that the pTARGET program has high sensitivity as well as high specificity. The overall prediction performance of pTARGET is the lowest for cytoplasmic proteins. This is probably because CYT is the default location for protein synthesis as well as the hub of cellular core metabolism and, therefore, it is likely to have the most number of ‘shared’ functional domains thus negatively affecting the prediction performance. Matthew’s correlation coefficient test MCC provides a single measure of evaluating specificity and sensitivity together, where it equals one for perfect predictions and zero for random assignments (Matthew, 1975). At the highest specificity level (score 50), MCC values for different locations range from 0.53 to 0.76, while at the highest sensitivity level (Score 1) the range is between 0.50 to 0.84 (Table 3). 3967 C.Guda and S.Subramaniam Percentage of True Posivites 100 90 80 70 60 50 40 30 20 10 0 CYT END EXC GOL pTARGET MIT NUC PLA POX PSORT Fig. 3. Comparison of the prediction performance of pTARGET and PSORT. Our results suggest that the prediction performance of pTARGET is consistent and better than that of PSORT, for most of the locations tested. Unlike PSORT, the current method is sufficiently robust for genome-scale prediction of proteins in eukaryotic animal species and does not require species-specific training datasets. Previously, we used MITOPRED for genome-scale prediction of mitochondrial proteins in six eukaryotic proteomes (Guda et al., 2004). One of the limitations of pTARGET is its inability to accurately predict proteins localized in multiple locations such as those shuttling between cytoplasm and nucleus. Based on the number of ‘shared’ domains in our study (500, data not shown), we estimate that in eukaryotic proteomes, at least 20% of the proteins are localized to multiple locations. In the future, we will focus on developing sophisticated scoring methods to accurately predict proteins targeted to multiple locations. Comparison of pTARGET with PSORT PSORT has been chosen for comparison since it is the only other computational method that predicts as many subcellular locations as pTARGET does and is available as a stand alone version. We used identical datasets for testing both methods and removed the LYS location from the comparison because PSORT predicts this location as part of the vesicular secretory pathway. Since the scoring systems used in these two methods are not comparable, we used the highest sensitivity thresholds for prediction that corresponds to a specificity higher than 0.95, in both cases. As shown in Figure 3, pTARGET prediction rates are higher than those of PSORT for all but EXC and PLA locations. The improvement in the prediction rates of pTARGET vary with each location i.e. CYT (24%), END (37%), GOL (60%), MIT (40%), NUC (11%) and POX (42%), while PSORT prediction rates are higher in EXC (11%) and PLA (7%) locations. PSORT employs a suite of regular expressions for predicting signal peptides and cleavage sites and, therefore, it is able to predict extracellular proteins more efficiently, where such signals are well characterized. It is a known fact that signal peptides control the entry of almost all proteins to the secretory pathway, both in eukaryotes and prokaryotes (Gierasch, 1989; von Heijne, 1990; Rapoport 1992). For other locations, such knowledge on protein targeting is either ambiguous or not fully available. Improved prediction rates observed for END (37%), GOL (60%) and POX (42%) locations are especially significant in the pTARGET method because most computational methods fail to predict these locations due to lack of sufficient training data. DISCUSSION We removed the plant sequences from our datasets because several metabolic pathways and organelles in plants are not the same as in animals, leading to differences in the distribution of protein functional domains in these two systems. Even though the AAC differences in the CT regions are not as pronounced as those of NT regions (Table 2), inclusion of CT differences in the scoring system has considerably lessened the number of false positives (data not shown). This is because the NTAAC averages are based on only 25 residues and hence the scoring system could easily pick up unintended sequences with similar NT composition. Since, the pTARGET method is primarily based on the location-specific protein functional domains (Pfam-A domains), its performance could be significantly improved as more functional domains are identified in future versions of the Pfam database. 3968 ACKNOWLEDGEMENTS Authors are thankful to Dr. Giridhar Chukkapalli at the San Diego Supercomputer Center for assistance in running genome-scale HMM jobs. This project has been supported by the start-up funds to CG from the State University of New York at Albany and the University of California Life Sciences Informatics (LSI) Program/Mitokor grant (L99-10077) to SS. Conflicts of Interest: none declared. REFERENCES Bateman,A. et al. (2004) The Pfam protein families database. Nucleic Acids Res., 32, D138–D141. Cedano,J. et al. (1997) Relation between amino acid composition and cellular location of proteins. J. Mol. Biol., 17, 594–600. Chukkapalli,G. et al. (2004) SledgeHMMER: a web server for batch searching of Pfam database. Nucl. Acids Res., 32, W542–W544. Cui,Q. et al. (2004) Esub8: a novel tool to predict protein subcellular localizations in eukaryotic organisms. Bioinformatics, 5, 66–72. Eddy,S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763. Emanuelsson,O. et al. (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol., 300, 1005–1016. Feng,Z.P. (2000) Prediction of the subcellular location of prokaryotic proteins based on a new representation of the amino acid composition. Biopolymers, 58, 491–499. Feng,Z.P. and Zhang,C.T. (2001) Prediction of the subcellular location of prokaryotic proteins based on the hydrophobic index of the amino acids. Int. J. Biol. Macromol., 14, 255–261. Gierasch,L.M. (1989) Signal sequences. Biochemistry, 28, 923–930. Guda,C. et al. (2004) MITOPRED: a genome-scale method for prediction of nucleusencoded mitochondrial proteins. Bioinformatics, 20, 1785–1794. Hua,S. and Sun,Z. (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 17, 721–728. Huh,W.-K. et al. (2003) Global analysis of protein localization in budding yeast. Nature, 425, 686–691. Kumar,A. et al. (2002) Subcellular localization of the yeast proteome. Genes Dev., 16, 707–719. Li,W. et al. (2001) Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17, 282–283. Marcotte,E.M. et al. (2000) Localizing proteins in the cell from their phylogenetic profiles. Proc. Natl Acad. Sci. USA, 97, 12115–12120. Matthews,B.W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta, 405, 442–451. Mott,R. et al. (2002) Predicting protein cellular location using a domain projection method. Genome Res., 12, 1168–1174. Nair,R. and Rost,B. (2002) Inferring sub-cellular localization through automated lexical analysis. Bioinformatics, 18, S78–S86. Prediction of protein subcellular localization Nair,R. and Rost,B. (2003) Better prediction of sub-cellular localization by combining evolutionary and structural information. Proteins, 53, 917–930. Nakai,K. and Horton,P. (1999) PSORT: a program for detecting the sorting signals of proteins and predicting their subcellular localization. Trends Biochem. Sci., 24, 34–36. Nielsen,H. et al. (1997) Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Prot. Engg., 10, 1–6. Rapoport,T.A. (1992) Transport of proteins across the endoplasmic reticulum membrane. Science, 258, 931–936. Reinhardt,A. and Hubbard,T. (1998) Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res., 26, 2230–2236. Rusch,S.L. and Kendall,D.A. (1995) Protein transport via amino-terminal targeting sequences: common themes in diverse systems. Mol. Membr. Biol., 12, 295–307. Stornaiuolo,M. et al. (2003) KDEL and KKXX retrieval signals appended to the same reporter protein determine different trafficking between endoplasmic reticulum, intermediate compartment, and golgi complex. Mol. Biol. Cell, 14, 889–902. Subramani,S. et al. (2000) Import of peroxisomal matrix and membrane proteins. Annu. Rev. Biochem., 69, 399–418. Swets,J.A. (1988) Measuring the accuracy of diagnostic system. Science, 240, 1285–1293. von Heijne,G. (1990) The signal peptide. J. Membr. Biol., 115, 195–201. 3969