Download alternatively-spliced protein sequences derived

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

United Kingdom National DNA Database wikipedia , lookup

Human genome wikipedia , lookup

Primary transcript wikipedia , lookup

Genetic code wikipedia , lookup

NEDD9 wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Protein moonlighting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Alternative splicing wikipedia , lookup

Metagenomics wikipedia , lookup

Genomics wikipedia , lookup

Sequence alignment wikipedia , lookup

Transcript
BIOINFORMATICS APPLICATIONS NOTE
Vol. 16 no. 11 2000
Pages 1048–1049
VARSPLIC: alternatively-spliced protein
sequences derived from SWISS-PROT and
TrEMBL
Paul Kersey ∗, Henning Hermjakob and Rolf Apweiler
EMBL Outstation Hinxton, The European Bioinformatics Institute (EBI), Wellcome
Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
Received on April 5, 2000; revised on July 11, 2000; accepted on August 1, 2000
Abstract
Summary: The program varsplic.pl uses information
present in the SWISS-PROT and TrEMBL databases to
create new records for alternatively spliced isoforms.
These new records can be used in similarity searches.
Availability: The program is available at ftp:// ftp.ebi.
ac.uk/ pub/ software/ swissprot/ , together with regularly
updated output files.
Contact: [email protected]
Introduction
Many proteins exist in more than one isoform, one
cause of which is differential splicing: up to 30% of
human genes are believed to exist in alternatively spliced
isoforms. Isoforms may differ quite considerably from
one another, with potentially less than 50% sequence
similarity. In the SWISS-PROT and TrEMBL databases
(Bairoch and Apweiler, 2000), one sequence (usually that
of the longest isoform) is displayed for each protein:
information about splicing variants is contained in the
feature table, and identified by use of the feature key
‘VARSPLIC’. VARSPLIC features contain the following
information:
(i) the numerical position of the first and last amino
acid of a region of protein sequence present in the
parental isoform that is absent from, or added to, in
one or more alternative isoforms;
(ii) the sequence of this region in the parental isoform,
and the alternative/additional sequence that is
present in the alternative isoform(s); or
(iii) if the alternative isoform(s) contain a simple deletion of sequence, the comment ‘MISSING’ is applied instead;
(iv) a list of the isoforms in which this variation is found.
Reflecting the interest in alternative splicing, an ‘Alternative Splicing Database’ has been set up by Gelfand and
∗ To whom correspondence should be addressed.
1048
colleagues (Gelfand et al., 1999; Dralyuk et al., 2000).
This database displays a subset of SWISS-PROT records,
selected according to the presence of certain words that
(usually) indicate the existence of alternative splicing
variants in SWISS-PROT. The sequence of alternative
isoforms is not calculated. Additionally, the authors have
used a clustering algorithm to uncover instances where
two separate SWISS-PROT records describe isoforms
of the same protein. However, suggested clusters can
only be verified using information already present in
SWISS-PROT.
The results of similarity searches performed against a
protein database (using algorithms such as FASTA and
BLAST) will clearly be affected by the choice of the
isoform whose sequence is displayed for each protein in
the database. Improved results might be obtained if such
comparisons were run against a database containing all
known isoforms, i.e. including more than one sequence
per protein where that protein is known to exist in different
isoforms. One alternative might be to search a redundant
database, in which separate sequences submitted for one
protein have not been merged. However, if such a database
is searched, it will not be immediately apparent whether
the sequences with highest similarity to the query are
derived from the same protein or from different proteins;
nor whether alternative sequences from the same protein
represent genuine splice variants or errors (caused by
frameshifts, etc.). Curation of entries in SWISS-PROT
and TrEMBL adds extra information to the raw sequence,
clearly describing alternatively spliced protein sequences
encoded by the same gene. To exploit this information, we
have written the program varsplic.pl.
The program
Varsplic.pl is written in PERL and requires the Swissknife
package (Hermjakob et al., 1999). The program will parse
a database in SWISS-PROT format, and generate new
records in FASTA format, one for each isoform that is
described but not explicitly displayed in an extant record.
c Oxford University Press 2000
VARSPLIC expansion of SWISS-PROT entries
Table 1. A statistical breakdown of the results obtained from running the
program varsplic.pl on the SWISS-PROT and TrEMBL databases (SWISSPROT release 39 and updates until 11-Jul-2000; TrEMBL release 14 minus
data integrated into SWISS-PROT as of 11-Jul-2000)
Number of new
variants /record
0
1
2
3
4
5–13
SWISS-PROT (87 252 records)
Number of
records
% of
records
85 969
914
208
91
23
47
98.53
1.05
0.24
0.10
0.03
0.05
TrEMBL
(296 907 records)
Number of
% of
records
records
296 632
204
32
17
6
6
99.91
0.07
0.01
0.01
0.00
0.00
Number of
new
variants
% of
new
variants
Analysis of new variants
% non-identity
between parent
and variant
sequence
0–5
5–10
10–15
15–20
20–25
25–30
30–35
35–40
40–45
45–50
50–100
Number of
new
variants
625
414
235
166
125
98
73
49
34
28
148
% of
new
variants
31.31
20.74
11.77
8.32
6.26
4.91
3.66
2.45
1.70
1.40
7.41
133
60
57
30
24
24
19
11
15
8
56
31.00
13.99
13.29
6.99
5.59
5.59
4.43
2.56
3.50
1.86
13.05
The user may specify to generate new records for a new
isoform only if its sequence dissimilarity from its ‘parent’
sequence exceeds a stated threshold.
Alternatively spliced isoforms in SWISS-PROT and
TrEMBL
The results obtained from running the program on
SWISS-PROT and TrEMBL (SWISS-PROT release 39
and updates until 11-Jul-2000; TrEMBL release 14 minus
data integrated into SWISS-PROT as of 11-Jul-2000) are
shown in Table 1. 1996 new records from SWISS-PROT,
and 429 new records from TrEMBL, were generated.
A significant number of variants show considerable
sequence dissimilarity from the parent record. For all
new variants of sequences in SWISS-PROT, the mean
percentage sequence difference from the parent sequence
is 16.20%, and median percentage sequence difference
7.58%; the mean length difference between a variant and
its parent sequence is 106 residues, though the spread of
length differences is broad, with a standard deviation of
225 residues. Over one third of all additional sequences
(928 sequences) are variants of human proteins; other
species with many annotated variants include Mus musculus (496 variants), Rattus norvegicus (346 variants),
Drosophila melanogaster (223 variants) and Gallus gallus
(79 variants). These sequences might easily be missed by
similarity searches performed against the parent database.
Information on splicing variants is not explicitly provided by genome sequencing projects (in contrast to raw
protein sequence data). Despite this, the number of known
splicing variants annotated in the public databases is increasing : between November 1999 and July 2000, the
number of such variants annotated in SWISS-PROT and
TrEMBL rose by 16.8%. Use of the program varsplic.pl
allows for the inclusion of these sequences in biological
analyses, while ensuring that clear information about the
relationship of sequences to each other and access to unified annotation remain available.
Future developments
The program will be re-run, and the output made available,
on each new release of the SPTR (SWISS-PROT +
TrEMBL + TrEMBL new) non-redundant database at
ftp://ftp.ebi.ac.uk/pub/databases/sp tr nrdb/ and ftp://ftp.
expasy.ch/databases/sp tr nrdb/.
References
Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein
sequence database and its supplement TrEMBL. Nucleic Acids
Res., 28, 45–48.
Dralyuk,I., Brudno,M., Gelfand,M.S., Zorn,M. and Dubchak,I.
(2000) Nucleic Acids Res., 28, 296–297.
Gelfand,M.S., Dubchak,I., Dralyuk,I. and Zorn,M. (1999) Nucleic
Acids Res., 27, 301.
Hermjakob,H., Fleischmann,W. and Apweiler,R. (1999)
Swissknife—‘lazy parsing’ of SWISS-PROT entries. Bioinformatics, 15, 771–772.
1049