Download Supplementary Materials for R3P-Loc Web-server

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Elementary mathematics wikipedia , lookup

Transcript
R3P-Loc Server Guide
Supplementary Materials for R3P-Loc Web-server
Shibiao Wan and Man-Wai Mak
email: [email protected], [email protected]
June 2014
Back to R3P-Loc Server
Contents
1 Introduction to R3P-Loc Server
2
2 Step-by-step Protocol Guide
4
2.1
Inputing Protein Accession Numbers via Copy-and-Paste . . . . . . . . . .
5
2.2
Inputing Protein Sequences via Copy-and-Paste . . . . . . . . . . . . . . .
6
2.3
File-Upload Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.4
Emailing Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
3 Statistical Methods
11
4 Dataset Construction
15
1
R3P-Loc Server Guide
1
Introduction to R3P-Loc Server
R3P-Loc is a subcellular-localization predictor which can deal with datasets with both
single-label and multi-label proteins. The R3P-Loc server can predict two different species
(eukaryote and plant) and two different input types (amino acid sequences in FASTA
format and protein accession numbers1 in UniProtKB [1] format).
R3P-Loc stands for using Ridge Regression and Random Projection for predicting
subcellular localization of both single-label and multi-label proteins, meaning that this
predictor applies random projection to reduce the feature dimensions of an ensemble ridge
regression classifier. The R3P-Loc predictor can deal with both single-location proteins
and multi-location proteins. Similar to many other GO-based predictors [2, 3, 4, 5], R3PLoc uses gene ontology as the feature information. The specific algorithms can be found
in the paper.
For eukaryote proteins, R3P-Loc is designed to predict 22 subcellular locations of
multi-label eukaryotic proteins. The 22 subcellular locations include: (1) acrosome; (2)
cell membrane; (3) cell-wall; (4) centrosome; (5) chloroplast; (6) cyanelle; (7) cytoplasm;
(8) cytoskeleton; (9) endoplasmic reticulum; (10) endosome; (11) extracellular; (12) golgi
apparatus; (13) hydrogenosome; (14) lysosome; (15) melanosome; (16) microsome; (17)
mitochondrion; (18) nucleus; (19) peroxisome; (20) spindle pole body; (21) synapse; and
(22) vacuole. The predictor is not designed for predicting the subcellular localization of
non-eukaryotic proteins when selecting predicting the eukaryotic proteins. Therefore, the
prediction results of non-eukaryotic proteins in this case are arbitrary and meaningless.
1
http://www.uniprot.org/manual/accession numbers
2
R3P-Loc Server Guide
Figure 1: Interface of the R3P-Loc web-server.
For plant proteins, R3P-Loc is designed to predict 12 subcellular locations of multilabel plant proteins. The 12 subcellular locations include: (1) cell membrane; (2) cell
wall; (3) chloroplast; (4) cytoplasm; (5) endoplasmic reticulum; (6) extracellular; (7) golgi
apparatus; (8) mitochondrion; (9) nucleus; (10) peroxisome; (11) plastid; and (12) vacuole.
Note (11) plastid here includes those plastid groups except for (3) chloroplast. The
predictor is not designed for predicting the subcellular localization of non-plant proteins.
Therefore, the prediction results of non-plant proteins are arbitrary and meaningless.
3
R3P-Loc Server Guide
Input format and type selection
Figure 2: Different formats and types of input.
2
Step-by-step Protocol Guide
Fig. 1 shows the interface of the R3P-Loc web-server. As can be seen, there are two steps
to use R3P-Loc:
1. select the species type and input type. Fig. 2 shows the four combinations of
species types and input types: eukaryote protein amino acid sequences in FASTA
format, eukaryote protein UNIPROTKB accession numbers, plant protein amino
acid sequences in FASTA format and plant protein UNIPROTKB accession numbers.
4
R3P-Loc Server Guide
2. Input the query proteins in the form of either FASTA sequences or accession numbers. There are also two ways to input the proteins: copy-and-paste
the protein information into the textbox or upload a file containing the proteins.
Users may optionally provide an email address if they upload a file containing many
FASTA sequences or accession numbers. Prediction results will be emailed to the
users.
For users’ convenience, several examples of eukaryote sequences, eukaryote accession
numbers, plant sequences and plant accession numbers are provided in the R3P-Loc webserver. Also, a help page is provided in the web-server to introduce the concepts of FASTA
format and UniProtKB accession number format. Besides, the two benchmark datasets
are downloadable from the hyperlinks in the web-server. Some simple yet informative
instructions, which include significance of subcellular localization prediction, specific information about R3P-Loc and some notes, are also provided thereafter.
For readers’ ease of using the R3P-Loc web-server, different combinations of species
types, input types and ways to input proteins are specifically presented in the following
subsections.
2.1
Inputing Protein Accession Numbers via Copy-and-Paste
Fig. 3 shows an example of using accession numbers (AC) as input. Note that R3P-Loc
can deal with one or more accession numbers for each submission.2 Details of UniProtKB
ACs can be found on the ‘help’ page. After prediction, a prediction page similar to
Fig. 4 will be shown, in which the input statistics, prediction results and a link of a
2
Note that the server can allow users to input maximum 100 accession numbers for each submission.
5
R3P-Loc Server Guide
Select eukaryotic accession numbers
Input accession numbers
Press this button to predict
Figure 3: An example of using accession numbers as input.
downloadable file containing the prediction results are listed. Fig. 5 specifies the details
of the downloadable prediction-result file.
2.2
Inputing Protein Sequences via Copy-and-Paste
Fig. 6 shows an example of using protein amino acid sequences as input. Note that R3PLoc can deal with one or more protein sequences (maximum 10)3 for each submission.
Details of FASTA format can be found in the ‘help’ page. After prediction, a prediction
page similar to Fig. 7 will be shown, where the input statistics, prediction results and a
3
Note that the updated server can allow users to input maximum 50 sequences for each submission.
6
R3P-Loc Server Guide
Figure 4: Prediction results page for using accession numbers as input.
link to a downloadable text file containing the prediction results are listed. Fig. 8 specifies
the details of the prediction-result file. Within the prediction results, besides the final
subcellular locations, the BLAST E-value is also shown for each query protein sequence.
2.3
File-Upload Function
R3P-Loc allows users to upload a text file containing a list of accession numbers or
sequences in FASTA format. Fig. 9 shows an example of uploading a file with a list
accession numbers without providing an email address. In this case, R3P-Loc will present
the prediction results in HTML format, as shown in Fig. 10. Also, a text file can also be
7
R3P-Loc Server Guide
Figure 5: An example of the prediction-result file.
downloaded from the result page. Fig. 11 shows an example of the downloadable file.
2.4
Emailing Function
For ease of sending results and further processing the prediction results, an emailing
function is added to R3P-Loc. By providing their email address as shown in Fig. 12,
users will receive the prediction results through emails. After prediction, an email with
contents similar to that of Fig. 13 will be sent to the designated email address. The email
will be entitled with “Results for your subloc prediction task from REP-Loc Server” sent
by the official email of R3P-Loc server, namely [email protected]. The contents
will be read as:
“Dear users,
Thank you for using our R3P-Loc web-server to predict protein subcellular
8
R3P-Loc Server Guide
Select eukaryotic protein sequences
Input protein sequences
Press this button to predict
Figure 6: An example of using protein amino acid sequences as input.
localization. Attached please find the prediction results of your submissions.
You can find more information from our server website. Thank you again for
your support.
Best wishes,
R3P-Loc Server”
The prediction results are saved as an attachment within the email.
9
R3P-Loc Server Guide
Figure 7: Prediction results page for using accession numbers as input.
Figure 8: Details of the downloadable prediction-results file.
10
R3P-Loc Server Guide
Select plant accession numbers
Input file (with a list of accession numbers)
Press this button to predict
Figure 9: An example of using a file with a list accession numbers as input
without providing emails.
3
Statistical Methods
In statistical prediction, there are three methods that are often used for testing the generalization capabilities of predictors: independent tests, subsampling tests (or K-fold crossvalidation) and jackknife tests (or leave-one-out cross validation, short for LOOCV).
In independent tests, the training set and the testing set were fixed, thus enabling
us to obtain a fixed accuracy for the predictors. However, the selection of independent
dataset often bears some sort of arbitrariness [6], which inevitably leads to non-bias-free
11
R3P-Loc Server Guide
Figure 10: Prediction results page for using a file input.
accuracy for the predictors.
In subsampling tests, here we use five-fold cross validation as an example. The whole
dataset was randomly divided into 5 disjoint parts with equal size [7]. The last part
may have 1-4 more examples than the former 4 parts in order for each example to be
evaluated on the model. Then one part of the dataset was used as the test set and the
remained parts are jointly used as the training set. This procedure is repeated five times,
and each time a different part was chosen as the test set. The number of the selections in
dividing the benchmark dataset is obviously an astronomical figure even for a small-size
12
R3P-Loc Server Guide
Input format and type selection
Figure 11: Details of the downloadable prediction-results file.
dataset. This means that different selections lead to different results even for the same
benchmark dataset, thus still being liable to statistical arbitrariness. Subsampling tests
with a smaller K work definitely faster than that with a larger K. Thus, subsampling
tests are faster than LOOCV, which can be regarded as N -fold cross-validation, where
N is the number of samples in the dataset, and N > K. At the same time, it is also
statistically acceptable and usually regarded as less biased than the independent tests.
In LOOCV, every protein in the benchmark dataset will be singled out one-by-one and
is tested by the classifier trained by the remaining proteins. In this case, the arbitrariness
13
R3P-Loc Server Guide
Select plant protein sequences
Input file (with a list of protein sequences)
Input email to receive and save results
Press this button to predict
Figure 12: An example using a file with a list of protein sequences as input
and providing emails.
can be avoided because LOOCV will yield a unique outcome for the predictors. Therefore,
LOOCV is considered to be the most rigorous and bias-free method [8]. Hence, LOOCV
was used to examine the performance of R3P-Loc against other state-of-the-art predictors.
14
R3P-Loc Server Guide
Figure 13: An example of the email containing the prediction results.
4
Dataset Construction
R3P-Loc uses two benchmark datasets [2, 9] to evaluate its performance. Both of them
were constructed by using the same standard procedures with the same Swiss-Prot versions
and date of construction (i.e., Swiss-Prot 55.3 on 29-Apr-2008 for the benchmark plant
dataset). The differences are the species (i.e., eukaryote or plant). Here, we take the plant
dataset as an example to illustrate the details of the procedures, which are specified as
follows:
1. Go to the UniProt/SwissProt official webpage (http://www.uniprot.org/);
2. Go to the ‘Search’ section and select ‘Protein Knowledgebase (UniProtKB)’ (default) in the ‘Search in’ option;
3. In the ‘Query’ option, select or type ‘reviewered: yes’;
4. Select ‘AND’ in the ‘Advanced Search’ option, and then select ‘Taxonomy [OC]’
15
R3P-Loc Server Guide
Table 1: Breakdown of the multi-label eukaryotic protein dataset. The sequence identity
is cut off at 25%. The superscripts e stand for the eukaryotic dataset.
Label Subcellular Location
1
Acrosome
2
Cell membrane
3
Cell wall
4
Centrosome
5
Chloroplast
6
Cyanelle
7
Cytoplasm
8
Cytoskeleton
9
ER
10
Endosome
11
Extracellular
12
Golgi apparatus
13
Hydrogenosome
14
Lysosome
15
Melanosome
16
Microsome
17
Mitochondrion
18
Nucleus
19
Peroxisome
20
SPI
21
Synapse
22
Vacuole
e
Total number of locative proteins (Nloc
)
e
Total number of actual proteins (Nact )
No. of Locative Proteins
14
697
49
96
385
79
2186
139
457
41
1048
254
10
57
47
13
610
2320
110
68
47
170
8897
7766
and type in ‘Viridiplantae’;
5. Select ‘AND’ in the ‘Advanced Search’ option, and then select ‘Fragment: no’;
6. Select ‘AND’ in the ‘Advanced Search’ option, and then select ‘Sequence length’
and type in ‘50 - ’ (no less than 50);
16
R3P-Loc Server Guide
Table 2: Breakdown of the multi-label plant protein dataset. The sequence identity is
cut off at 25%. The superscripts p stand for the plant dataset.
Label Subcellular Location
1
Cell membrane
2
Cell wall
3
Chloroplast
4
Cytoplasm
5
Endoplasmic reticulum
6
Extracellular
7
Golgi apparatus
8
Mitochondrion
9
Nucleus
10
Peroxisome
11
Plastid
12
Vacuole
Total number of locative proteins (N p )
loc
p
)
Total number of actual proteins (Nact
No. of Locative Proteins
56
32
286
182
42
22
21
150
152
21
39
52
1055
978
7. Select ‘AND’ in the ‘Advanced Search’ option, and then select ‘Date entry integrated’ and type in ‘ -20080429’;
8. Select ‘AND’ in the ‘Advanced Search’ option, and then select “Subcellular location:
XXX Confidence: Experimental”; (XXX means the specific subcellular locations.
Here it includes 12 different locations: cell membrane; cell wall; chloroplast; endoplasmic reticulum; extracellular; golgi apparatus; mitochondrion; nucleus; peroxisome; plastid; vacuole.)
9. Further exclude those proteins which are not experimentally annotated (This is to
recheck the proteins to guarantee they are all experimentally annotated).
17
R3P-Loc Server Guide
After selecting the proteins, Blastclust4 was applied to reduce the redundancy in the
dataset so that none of the sequence pairs has sequence identity higher than 25%.
The details of the breakdown of the two benchmark datasets are listed in Table 1 and
Table 2. Both datasets can be accessible from the page of Datasets of R3P-Loc web-server.
R3P-Loc server is available at http://bioinfo.eie.polyu.edu.hk/R3PLocServer/.
References
[1] R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro,
E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. J. Martin, D. A. Ntale,
C. O’Donovan, N. Redaschi, and L. S. Yeh, “UniProt: the Universal Protein knowledgebase,” Nucleic Acids Res, vol. 32, pp. D115–D119, 2004.
[2] K. C. Chou, Z. C. Wu, and X. Xiao, “iLoc-Euk: A multi-label classifier for predicting
the subcellular localization of singleplex and multiplex eukaryotic proteins,” PLoS
ONE, vol. 6, no. 3, pp. e18258, 2011.
[3] S. Wan, M. W. Mak, and S. Y. Kung, “GOASVM: A subcellular location predictor by
incorporating term-frequency gene ontology into the general form of Chou’s pseudoamino acid composition,” Journal of Theoretical Biology, vol. 323, pp. 40–48, 2013.
[4] S. Wan, M. W. Mak, and S. Y. Kung, “mGOASVM: Multi-label protein subcellular
localization based on gene ontology and support vector machines,” BMC Bioinformatics, vol. 13, pp. 290, 2012.
4
http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html
18
R3P-Loc Server Guide
[5] S. Wan, M. W. Mak, and S. Y. Kung, “HybridGO-Loc: Mining hybrid features on
gene ontology for predicting subcellular localization of multi-location proteins,” PLoS
ONE, vol. 9, no. 3, pp. e89545, 2014.
[6] K. C. Chou and C. T. Zhang, “Review: Prediction of protein structural classes,”
Critical Reviews in Biochemistry and Molecular Biology, vol. 30, no. 4, pp. 275–349,
1995.
[7] S. Y. Mei, W. Fei, and S. G. Zhou, “Gene ontology based transfer learning for protein
subcellular localization,” BMC Bioinformatics, vol. 12, pp. 44, 2011.
[8] T. Hastie, R. Tibshirani, and J. Friedman,
The element of statistical learning,
Springer-Verlag, 2001.
[9] Z. C. Wu, X. Xiao, and K. C. Chou, “iLoc-Plant: A multi-label classifier for predicting
the subcellular localization of plant proteins with both single and multiple sites,”
Molecular BioSystems, vol. 7, pp. 3287–3297, 2011.
19