Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
R3P-Loc Server Guide Supplementary Materials for R3P-Loc Web-server Shibiao Wan and Man-Wai Mak email: [email protected], [email protected] June 2014 Back to R3P-Loc Server Contents 1 Introduction to R3P-Loc Server 2 2 Step-by-step Protocol Guide 4 2.1 Inputing Protein Accession Numbers via Copy-and-Paste . . . . . . . . . . 5 2.2 Inputing Protein Sequences via Copy-and-Paste . . . . . . . . . . . . . . . 6 2.3 File-Upload Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Emailing Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Statistical Methods 11 4 Dataset Construction 15 1 R3P-Loc Server Guide 1 Introduction to R3P-Loc Server R3P-Loc is a subcellular-localization predictor which can deal with datasets with both single-label and multi-label proteins. The R3P-Loc server can predict two different species (eukaryote and plant) and two different input types (amino acid sequences in FASTA format and protein accession numbers1 in UniProtKB [1] format). R3P-Loc stands for using Ridge Regression and Random Projection for predicting subcellular localization of both single-label and multi-label proteins, meaning that this predictor applies random projection to reduce the feature dimensions of an ensemble ridge regression classifier. The R3P-Loc predictor can deal with both single-location proteins and multi-location proteins. Similar to many other GO-based predictors [2, 3, 4, 5], R3PLoc uses gene ontology as the feature information. The specific algorithms can be found in the paper. For eukaryote proteins, R3P-Loc is designed to predict 22 subcellular locations of multi-label eukaryotic proteins. The 22 subcellular locations include: (1) acrosome; (2) cell membrane; (3) cell-wall; (4) centrosome; (5) chloroplast; (6) cyanelle; (7) cytoplasm; (8) cytoskeleton; (9) endoplasmic reticulum; (10) endosome; (11) extracellular; (12) golgi apparatus; (13) hydrogenosome; (14) lysosome; (15) melanosome; (16) microsome; (17) mitochondrion; (18) nucleus; (19) peroxisome; (20) spindle pole body; (21) synapse; and (22) vacuole. The predictor is not designed for predicting the subcellular localization of non-eukaryotic proteins when selecting predicting the eukaryotic proteins. Therefore, the prediction results of non-eukaryotic proteins in this case are arbitrary and meaningless. 1 http://www.uniprot.org/manual/accession numbers 2 R3P-Loc Server Guide Figure 1: Interface of the R3P-Loc web-server. For plant proteins, R3P-Loc is designed to predict 12 subcellular locations of multilabel plant proteins. The 12 subcellular locations include: (1) cell membrane; (2) cell wall; (3) chloroplast; (4) cytoplasm; (5) endoplasmic reticulum; (6) extracellular; (7) golgi apparatus; (8) mitochondrion; (9) nucleus; (10) peroxisome; (11) plastid; and (12) vacuole. Note (11) plastid here includes those plastid groups except for (3) chloroplast. The predictor is not designed for predicting the subcellular localization of non-plant proteins. Therefore, the prediction results of non-plant proteins are arbitrary and meaningless. 3 R3P-Loc Server Guide Input format and type selection Figure 2: Different formats and types of input. 2 Step-by-step Protocol Guide Fig. 1 shows the interface of the R3P-Loc web-server. As can be seen, there are two steps to use R3P-Loc: 1. select the species type and input type. Fig. 2 shows the four combinations of species types and input types: eukaryote protein amino acid sequences in FASTA format, eukaryote protein UNIPROTKB accession numbers, plant protein amino acid sequences in FASTA format and plant protein UNIPROTKB accession numbers. 4 R3P-Loc Server Guide 2. Input the query proteins in the form of either FASTA sequences or accession numbers. There are also two ways to input the proteins: copy-and-paste the protein information into the textbox or upload a file containing the proteins. Users may optionally provide an email address if they upload a file containing many FASTA sequences or accession numbers. Prediction results will be emailed to the users. For users’ convenience, several examples of eukaryote sequences, eukaryote accession numbers, plant sequences and plant accession numbers are provided in the R3P-Loc webserver. Also, a help page is provided in the web-server to introduce the concepts of FASTA format and UniProtKB accession number format. Besides, the two benchmark datasets are downloadable from the hyperlinks in the web-server. Some simple yet informative instructions, which include significance of subcellular localization prediction, specific information about R3P-Loc and some notes, are also provided thereafter. For readers’ ease of using the R3P-Loc web-server, different combinations of species types, input types and ways to input proteins are specifically presented in the following subsections. 2.1 Inputing Protein Accession Numbers via Copy-and-Paste Fig. 3 shows an example of using accession numbers (AC) as input. Note that R3P-Loc can deal with one or more accession numbers for each submission.2 Details of UniProtKB ACs can be found on the ‘help’ page. After prediction, a prediction page similar to Fig. 4 will be shown, in which the input statistics, prediction results and a link of a 2 Note that the server can allow users to input maximum 100 accession numbers for each submission. 5 R3P-Loc Server Guide Select eukaryotic accession numbers Input accession numbers Press this button to predict Figure 3: An example of using accession numbers as input. downloadable file containing the prediction results are listed. Fig. 5 specifies the details of the downloadable prediction-result file. 2.2 Inputing Protein Sequences via Copy-and-Paste Fig. 6 shows an example of using protein amino acid sequences as input. Note that R3PLoc can deal with one or more protein sequences (maximum 10)3 for each submission. Details of FASTA format can be found in the ‘help’ page. After prediction, a prediction page similar to Fig. 7 will be shown, where the input statistics, prediction results and a 3 Note that the updated server can allow users to input maximum 50 sequences for each submission. 6 R3P-Loc Server Guide Figure 4: Prediction results page for using accession numbers as input. link to a downloadable text file containing the prediction results are listed. Fig. 8 specifies the details of the prediction-result file. Within the prediction results, besides the final subcellular locations, the BLAST E-value is also shown for each query protein sequence. 2.3 File-Upload Function R3P-Loc allows users to upload a text file containing a list of accession numbers or sequences in FASTA format. Fig. 9 shows an example of uploading a file with a list accession numbers without providing an email address. In this case, R3P-Loc will present the prediction results in HTML format, as shown in Fig. 10. Also, a text file can also be 7 R3P-Loc Server Guide Figure 5: An example of the prediction-result file. downloaded from the result page. Fig. 11 shows an example of the downloadable file. 2.4 Emailing Function For ease of sending results and further processing the prediction results, an emailing function is added to R3P-Loc. By providing their email address as shown in Fig. 12, users will receive the prediction results through emails. After prediction, an email with contents similar to that of Fig. 13 will be sent to the designated email address. The email will be entitled with “Results for your subloc prediction task from REP-Loc Server” sent by the official email of R3P-Loc server, namely [email protected]. The contents will be read as: “Dear users, Thank you for using our R3P-Loc web-server to predict protein subcellular 8 R3P-Loc Server Guide Select eukaryotic protein sequences Input protein sequences Press this button to predict Figure 6: An example of using protein amino acid sequences as input. localization. Attached please find the prediction results of your submissions. You can find more information from our server website. Thank you again for your support. Best wishes, R3P-Loc Server” The prediction results are saved as an attachment within the email. 9 R3P-Loc Server Guide Figure 7: Prediction results page for using accession numbers as input. Figure 8: Details of the downloadable prediction-results file. 10 R3P-Loc Server Guide Select plant accession numbers Input file (with a list of accession numbers) Press this button to predict Figure 9: An example of using a file with a list accession numbers as input without providing emails. 3 Statistical Methods In statistical prediction, there are three methods that are often used for testing the generalization capabilities of predictors: independent tests, subsampling tests (or K-fold crossvalidation) and jackknife tests (or leave-one-out cross validation, short for LOOCV). In independent tests, the training set and the testing set were fixed, thus enabling us to obtain a fixed accuracy for the predictors. However, the selection of independent dataset often bears some sort of arbitrariness [6], which inevitably leads to non-bias-free 11 R3P-Loc Server Guide Figure 10: Prediction results page for using a file input. accuracy for the predictors. In subsampling tests, here we use five-fold cross validation as an example. The whole dataset was randomly divided into 5 disjoint parts with equal size [7]. The last part may have 1-4 more examples than the former 4 parts in order for each example to be evaluated on the model. Then one part of the dataset was used as the test set and the remained parts are jointly used as the training set. This procedure is repeated five times, and each time a different part was chosen as the test set. The number of the selections in dividing the benchmark dataset is obviously an astronomical figure even for a small-size 12 R3P-Loc Server Guide Input format and type selection Figure 11: Details of the downloadable prediction-results file. dataset. This means that different selections lead to different results even for the same benchmark dataset, thus still being liable to statistical arbitrariness. Subsampling tests with a smaller K work definitely faster than that with a larger K. Thus, subsampling tests are faster than LOOCV, which can be regarded as N -fold cross-validation, where N is the number of samples in the dataset, and N > K. At the same time, it is also statistically acceptable and usually regarded as less biased than the independent tests. In LOOCV, every protein in the benchmark dataset will be singled out one-by-one and is tested by the classifier trained by the remaining proteins. In this case, the arbitrariness 13 R3P-Loc Server Guide Select plant protein sequences Input file (with a list of protein sequences) Input email to receive and save results Press this button to predict Figure 12: An example using a file with a list of protein sequences as input and providing emails. can be avoided because LOOCV will yield a unique outcome for the predictors. Therefore, LOOCV is considered to be the most rigorous and bias-free method [8]. Hence, LOOCV was used to examine the performance of R3P-Loc against other state-of-the-art predictors. 14 R3P-Loc Server Guide Figure 13: An example of the email containing the prediction results. 4 Dataset Construction R3P-Loc uses two benchmark datasets [2, 9] to evaluate its performance. Both of them were constructed by using the same standard procedures with the same Swiss-Prot versions and date of construction (i.e., Swiss-Prot 55.3 on 29-Apr-2008 for the benchmark plant dataset). The differences are the species (i.e., eukaryote or plant). Here, we take the plant dataset as an example to illustrate the details of the procedures, which are specified as follows: 1. Go to the UniProt/SwissProt official webpage (http://www.uniprot.org/); 2. Go to the ‘Search’ section and select ‘Protein Knowledgebase (UniProtKB)’ (default) in the ‘Search in’ option; 3. In the ‘Query’ option, select or type ‘reviewered: yes’; 4. Select ‘AND’ in the ‘Advanced Search’ option, and then select ‘Taxonomy [OC]’ 15 R3P-Loc Server Guide Table 1: Breakdown of the multi-label eukaryotic protein dataset. The sequence identity is cut off at 25%. The superscripts e stand for the eukaryotic dataset. Label Subcellular Location 1 Acrosome 2 Cell membrane 3 Cell wall 4 Centrosome 5 Chloroplast 6 Cyanelle 7 Cytoplasm 8 Cytoskeleton 9 ER 10 Endosome 11 Extracellular 12 Golgi apparatus 13 Hydrogenosome 14 Lysosome 15 Melanosome 16 Microsome 17 Mitochondrion 18 Nucleus 19 Peroxisome 20 SPI 21 Synapse 22 Vacuole e Total number of locative proteins (Nloc ) e Total number of actual proteins (Nact ) No. of Locative Proteins 14 697 49 96 385 79 2186 139 457 41 1048 254 10 57 47 13 610 2320 110 68 47 170 8897 7766 and type in ‘Viridiplantae’; 5. Select ‘AND’ in the ‘Advanced Search’ option, and then select ‘Fragment: no’; 6. Select ‘AND’ in the ‘Advanced Search’ option, and then select ‘Sequence length’ and type in ‘50 - ’ (no less than 50); 16 R3P-Loc Server Guide Table 2: Breakdown of the multi-label plant protein dataset. The sequence identity is cut off at 25%. The superscripts p stand for the plant dataset. Label Subcellular Location 1 Cell membrane 2 Cell wall 3 Chloroplast 4 Cytoplasm 5 Endoplasmic reticulum 6 Extracellular 7 Golgi apparatus 8 Mitochondrion 9 Nucleus 10 Peroxisome 11 Plastid 12 Vacuole Total number of locative proteins (N p ) loc p ) Total number of actual proteins (Nact No. of Locative Proteins 56 32 286 182 42 22 21 150 152 21 39 52 1055 978 7. Select ‘AND’ in the ‘Advanced Search’ option, and then select ‘Date entry integrated’ and type in ‘ -20080429’; 8. Select ‘AND’ in the ‘Advanced Search’ option, and then select “Subcellular location: XXX Confidence: Experimental”; (XXX means the specific subcellular locations. Here it includes 12 different locations: cell membrane; cell wall; chloroplast; endoplasmic reticulum; extracellular; golgi apparatus; mitochondrion; nucleus; peroxisome; plastid; vacuole.) 9. Further exclude those proteins which are not experimentally annotated (This is to recheck the proteins to guarantee they are all experimentally annotated). 17 R3P-Loc Server Guide After selecting the proteins, Blastclust4 was applied to reduce the redundancy in the dataset so that none of the sequence pairs has sequence identity higher than 25%. The details of the breakdown of the two benchmark datasets are listed in Table 1 and Table 2. Both datasets can be accessible from the page of Datasets of R3P-Loc web-server. R3P-Loc server is available at http://bioinfo.eie.polyu.edu.hk/R3PLocServer/. References [1] R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. J. Martin, D. A. Ntale, C. O’Donovan, N. Redaschi, and L. S. Yeh, “UniProt: the Universal Protein knowledgebase,” Nucleic Acids Res, vol. 32, pp. D115–D119, 2004. [2] K. C. Chou, Z. C. Wu, and X. Xiao, “iLoc-Euk: A multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins,” PLoS ONE, vol. 6, no. 3, pp. e18258, 2011. [3] S. Wan, M. W. Mak, and S. Y. Kung, “GOASVM: A subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou’s pseudoamino acid composition,” Journal of Theoretical Biology, vol. 323, pp. 40–48, 2013. [4] S. Wan, M. W. Mak, and S. Y. Kung, “mGOASVM: Multi-label protein subcellular localization based on gene ontology and support vector machines,” BMC Bioinformatics, vol. 13, pp. 290, 2012. 4 http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html 18 R3P-Loc Server Guide [5] S. Wan, M. W. Mak, and S. Y. Kung, “HybridGO-Loc: Mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins,” PLoS ONE, vol. 9, no. 3, pp. e89545, 2014. [6] K. C. Chou and C. T. Zhang, “Review: Prediction of protein structural classes,” Critical Reviews in Biochemistry and Molecular Biology, vol. 30, no. 4, pp. 275–349, 1995. [7] S. Y. Mei, W. Fei, and S. G. Zhou, “Gene ontology based transfer learning for protein subcellular localization,” BMC Bioinformatics, vol. 12, pp. 44, 2011. [8] T. Hastie, R. Tibshirani, and J. Friedman, The element of statistical learning, Springer-Verlag, 2001. [9] Z. C. Wu, X. Xiao, and K. C. Chou, “iLoc-Plant: A multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites,” Molecular BioSystems, vol. 7, pp. 3287–3297, 2011. 19