Download alternatively-spliced protein sequences derived

BIOINFORMATICS APPLICATIONS NOTE Vol. 16 no. 11 2000 Pages 1048–1049 VARSPLIC: alternatively-spliced protein sequences derived from SWISS-PROT and TrEMBL Paul Kersey ∗, Henning Hermjakob and Rolf Apweiler EMBL Outstation Hinxton, The European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK Received on April 5, 2000; revised on July 11, 2000; accepted on August 1, 2000 Abstract Summary: The program varsplic.pl uses information present in the SWISS-PROT and TrEMBL databases to create new records for alternatively spliced isoforms. These new records can be used in similarity searches. Availability: The program is available at ftp:// ftp.ebi. ac.uk/ pub/ software/ swissprot/ , together with regularly updated output files. Contact: [email protected] Introduction Many proteins exist in more than one isoform, one cause of which is differential splicing: up to 30% of human genes are believed to exist in alternatively spliced isoforms. Isoforms may differ quite considerably from one another, with potentially less than 50% sequence similarity. In the SWISS-PROT and TrEMBL databases (Bairoch and Apweiler, 2000), one sequence (usually that of the longest isoform) is displayed for each protein: information about splicing variants is contained in the feature table, and identified by use of the feature key ‘VARSPLIC’. VARSPLIC features contain the following information: (i) the numerical position of the first and last amino acid of a region of protein sequence present in the parental isoform that is absent from, or added to, in one or more alternative isoforms; (ii) the sequence of this region in the parental isoform, and the alternative/additional sequence that is present in the alternative isoform(s); or (iii) if the alternative isoform(s) contain a simple deletion of sequence, the comment ‘MISSING’ is applied instead; (iv) a list of the isoforms in which this variation is found. Reflecting the interest in alternative splicing, an ‘Alternative Splicing Database’ has been set up by Gelfand and ∗ To whom correspondence should be addressed. 1048 colleagues (Gelfand et al., 1999; Dralyuk et al., 2000). This database displays a subset of SWISS-PROT records, selected according to the presence of certain words that (usually) indicate the existence of alternative splicing variants in SWISS-PROT. The sequence of alternative isoforms is not calculated. Additionally, the authors have used a clustering algorithm to uncover instances where two separate SWISS-PROT records describe isoforms of the same protein. However, suggested clusters can only be verified using information already present in SWISS-PROT. The results of similarity searches performed against a protein database (using algorithms such as FASTA and BLAST) will clearly be affected by the choice of the isoform whose sequence is displayed for each protein in the database. Improved results might be obtained if such comparisons were run against a database containing all known isoforms, i.e. including more than one sequence per protein where that protein is known to exist in different isoforms. One alternative might be to search a redundant database, in which separate sequences submitted for one protein have not been merged. However, if such a database is searched, it will not be immediately apparent whether the sequences with highest similarity to the query are derived from the same protein or from different proteins; nor whether alternative sequences from the same protein represent genuine splice variants or errors (caused by frameshifts, etc.). Curation of entries in SWISS-PROT and TrEMBL adds extra information to the raw sequence, clearly describing alternatively spliced protein sequences encoded by the same gene. To exploit this information, we have written the program varsplic.pl. The program Varsplic.pl is written in PERL and requires the Swissknife package (Hermjakob et al., 1999). The program will parse a database in SWISS-PROT format, and generate new records in FASTA format, one for each isoform that is described but not explicitly displayed in an extant record. c Oxford University Press 2000 VARSPLIC expansion of SWISS-PROT entries Table 1. A statistical breakdown of the results obtained from running the program varsplic.pl on the SWISS-PROT and TrEMBL databases (SWISSPROT release 39 and updates until 11-Jul-2000; TrEMBL release 14 minus data integrated into SWISS-PROT as of 11-Jul-2000) Number of new variants /record 0 1 2 3 4 5–13 SWISS-PROT (87 252 records) Number of records % of records 85 969 914 208 91 23 47 98.53 1.05 0.24 0.10 0.03 0.05 TrEMBL (296 907 records) Number of % of records records 296 632 204 32 17 6 6 99.91 0.07 0.01 0.01 0.00 0.00 Number of new variants % of new variants Analysis of new variants % non-identity between parent and variant sequence 0–5 5–10 10–15 15–20 20–25 25–30 30–35 35–40 40–45 45–50 50–100 Number of new variants 625 414 235 166 125 98 73 49 34 28 148 % of new variants 31.31 20.74 11.77 8.32 6.26 4.91 3.66 2.45 1.70 1.40 7.41 133 60 57 30 24 24 19 11 15 8 56 31.00 13.99 13.29 6.99 5.59 5.59 4.43 2.56 3.50 1.86 13.05 The user may specify to generate new records for a new isoform only if its sequence dissimilarity from its ‘parent’ sequence exceeds a stated threshold. Alternatively spliced isoforms in SWISS-PROT and TrEMBL The results obtained from running the program on SWISS-PROT and TrEMBL (SWISS-PROT release 39 and updates until 11-Jul-2000; TrEMBL release 14 minus data integrated into SWISS-PROT as of 11-Jul-2000) are shown in Table 1. 1996 new records from SWISS-PROT, and 429 new records from TrEMBL, were generated. A significant number of variants show considerable sequence dissimilarity from the parent record. For all new variants of sequences in SWISS-PROT, the mean percentage sequence difference from the parent sequence is 16.20%, and median percentage sequence difference 7.58%; the mean length difference between a variant and its parent sequence is 106 residues, though the spread of length differences is broad, with a standard deviation of 225 residues. Over one third of all additional sequences (928 sequences) are variants of human proteins; other species with many annotated variants include Mus musculus (496 variants), Rattus norvegicus (346 variants), Drosophila melanogaster (223 variants) and Gallus gallus (79 variants). These sequences might easily be missed by similarity searches performed against the parent database. Information on splicing variants is not explicitly provided by genome sequencing projects (in contrast to raw protein sequence data). Despite this, the number of known splicing variants annotated in the public databases is increasing : between November 1999 and July 2000, the number of such variants annotated in SWISS-PROT and TrEMBL rose by 16.8%. Use of the program varsplic.pl allows for the inclusion of these sequences in biological analyses, while ensuring that clear information about the relationship of sequences to each other and access to unified annotation remain available. Future developments The program will be re-run, and the output made available, on each new release of the SPTR (SWISS-PROT + TrEMBL + TrEMBL new) non-redundant database at ftp://ftp.ebi.ac.uk/pub/databases/sp tr nrdb/ and ftp://ftp. expasy.ch/databases/sp tr nrdb/. References Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL. Nucleic Acids Res., 28, 45–48. Dralyuk,I., Brudno,M., Gelfand,M.S., Zorn,M. and Dubchak,I. (2000) Nucleic Acids Res., 28, 296–297. Gelfand,M.S., Dubchak,I., Dralyuk,I. and Zorn,M. (1999) Nucleic Acids Res., 27, 301. Hermjakob,H., Fleischmann,W. and Apweiler,R. (1999) Swissknife—‘lazy parsing’ of SWISS-PROT entries. Bioinformatics, 15, 771–772. 1049

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download alternatively-spliced protein sequences derived