Download Overview of splicing relevant databases - Stamm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nucleic acid analogue wikipedia , lookup

Gene desert wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene wikipedia , lookup

Microevolution wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Transposable element wikipedia , lookup

Genome (book) wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Minimal genome wikipedia , lookup

Gene expression profiling wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Public health genomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Designer baby wikipedia , lookup

ENCODE wikipedia , lookup

Epitranscriptome wikipedia , lookup

Genomic library wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Metagenomics wikipedia , lookup

Human genome wikipedia , lookup

Pathogenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Human Genome Project wikipedia , lookup

Helitron (biology) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genomics wikipedia , lookup

Genome editing wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome evolution wikipedia , lookup

Primary transcript wikipedia , lookup

Alternative splicing wikipedia , lookup

Transcript
QuickTime™ and a
decompressor
are needed to see this picture.
see changes in the question answered section
Comments by Stefan
Glossary
Refereal to other chapters
List of vendors
Reagents added to the database
look at the book chapters at the eurasnet site:
http://www.eurasnet.info/noe/book
stefanstamm galadriel
Title: Overview of splicing relevant databases
Pierre de la Grange
GenoSplice technology, Centre Hayem, Hôpital Saint-Louis, Paris, France
*Address correspondence to: Pierre de la Grange, GenoSplice technology, Centre Hayem, Hôpital Saint-Louis, 1
avenue Claude Vellefaux, 75010 Paris, France; tel: +33 (0) 157 276 839; fax: +33 (0) 157 276 831; E-mail:
[email protected]
1.
Abstract
Alternative Splicing is the main mechanism allowing to increase the transcriptome diversity by generating
multiple RNA isoforms from a single gene. This mechanism concerns Alternative splicing affects more than 90%
of human genes and is altered in many diseases. In addition to the alternative splicing, other mechanisms allow
to increase the transcriptome diversity: for example, at least 81% of genes are subject to alternative
transcription initiation and 60% to undergo alternative polyadenylation. Around 10% of human genes may
produce more than 10 different transcripts (i.e., with a different exon content). The large number and wide
biological impact of alternative transcripts has created a high demand for tools enabling the identification,
classification, functional annotation and expression profiling of alternative transcripts To meet this demand,
several alternative splicing databases have been developed based on large-scale mappings not sure about this s
or assemblies of transcribed sequences.
2.
Theoretical background
2.1. Alternative splicing databases: interest
Alternative splicing concerns more than 90% of human genes [1] and is altered in many diseases [2] see chapter
10, 11 baralle. In order to study gene expression regulation, including splicing regulation, researchers need
tools and information to help them guide and interpret their experiments. Alternative splicing databases can fill
several of these needs by gathering and organizing genomic and transcriptomic data as well as providing tools
allowing to predict many features in term of regulation (e.g., tissue expression).
2.2. Alternative splicing databases: common strategy
Most alternative splicing databases are based on the same strategy: the exon content of transcripts is retrieved
by aligning sequence of these transcripts together or against the corresponding genomic sequence. Transcript
sequences are downloaded from publicly available databanks: EMBL, GenBank and DDBJ [3,4,5]. Among these
sequences, “full-length” complementary DNAs (flcDNA) allow to define the whole exon content of the different
gene products. At the beginning of the 90’s, there were the first massive generations of Expressed Sequence
Tags (EST). ESTs are unique read from clone extremity from normal or pathological tissue collections and
provide the major information source for computational detection of alternative splicing patterns. Other kinds
of sequences are obtained by large-scale approaches: Sequence Tagged Sites (STS), Genome Survey Sequences
(GSS) and High Throughput Genomic Sequences (HTGS). Transcript-to-genomic alignments are performed using
dedicated bioinformatics tools whose sensitivity, specificity and speed varied. The most used tools are BLAT,
SIM4, GMAP, SPA and POA [6,7,8,9,10]. Can you define these abbreviations and put the URLs in the table?
Alternative events are defined by comparing the exon content of transcripts from the same gene. Integration of
these data in a user-friendly web interface is a crucial point in order to facilitate access of information for the
user. Many other information sources are often integrated (e.g., protein information from SwissProt).
2.3. Description of Alternative splicing databases
More than 30 alternative splicing databases were developed during these last years. However, each of these
databases has its specificities and there is not a “perfect database”: two or three should be used. Table 1
presents a selection of 14 databases with their brief description, advantages, and reference.
2.4. The UCSC genome browser
In addition to specialist databases on alternative splicing, other bioinformatic tools are very useful to study the
gene expression regulation at the exon level. One of the most famous and useful tool is the UCSC Genome
Browser [25]. Can you add the site to a table? This site contains the reference sequence and working draft
assemblies for a large collection of genomes. The Genome Browser provides dozens of aligned annotation
tracks that have been computed at UCSC or have been provided by outside collaborators. In addition to these
standard tracks, it is also possible for users to upload their own annotation data for temporary display in the
browser (see “Protocol” section).
3.
Protocol
Since each database provides many different options, it is not possible here to describe how to use each of
these databases. For most of them, a detailed documentation is available on their website or within the
corresponding publication (see table 1). An example of the utilization of FAST DB from GenoSplice technology
will be provided in the next section. The following part of this section describes how to create a custom track
with the UCSC Genome Browser (explanations taken from the UCSC website).
Genome Browser annotation tracks are based on files in line-oriented format. Each line in the file defines a
display characteristic for the track or defines a data item within the track. Annotation files contain three types
of lines: browser lines, track lines, and data lines. To construct an annotation file and display it in the Genome
Browser, follow these steps:
Step 1: Format the data set
Formulate your data set as a tab-separated file using one of the formats supported by the Genome Browser:
GFF, BEDGRAPH, GTF, PSL, BED, bigBed, WIG, bigWig, MAF and microarray (see the UCSC website for more
details about these formats). (address?)
Step 2: Define the Genome Browser display characteristics
Add one or more optional browser lines to the beginning of your formatted data file to configure the overall
display
of
the
Genome
Browser
when
it
initially
shows
your
annotation
data
(genome.ucsc.edu/goldenPath/help/customTrack.html#lines). Browser lines allow you to configure such things
as the genome position that the Genome Browser will initially open to, the width of the display, and the
configuration of the other annotation tracks that are shown (or hidden) in the initial display.
Step 3: Define the annotation track display characteristics
Following the browser lines and preceding the formatted data, add a track line
(genome.ucsc.edu/goldenPath/help/customTrack.html#TRACK) to define the display attributes for your
annotation data set. Track lines enable you to define annotation track characteristics such as the name,
description, colors, initial display mode, use score, etc.
Can you show an example of such an annotation file?
4.
Example of an experiment
Figures 1 to 4 show several screenshots from FAST DB [16,26] regarding the PDLIM5 human gene. In addition to
the presented options, FAST DB provides many other options such as tissue-specificity analysis using EST
expression data, prediction of microRNA binding sites, prediction of NMD regulation, prediction of transcription
and splicing factor binding sites, etc. FAST DB also provides direct links to many other databases (SwissProt,
PubMed, Entrez Gene, OMIM, EnsEMBL, UCSC, other alternative splicing databases, etc). Even before the FAST
DB update providing several additional options and tools, a publication from Lerivray et al., awarded it as the
most useful and user-friendly alternative splicing database [27].
5.
Troubleshooting
It is known that alternative splicing regulation depends on development stages, tissues type and various stimuli
[28,29]. However, very few cell types, development stages or stimuli have been studied in a genome-wide
manner. Considering this point, the main limit of approaches used by the alternative splicing databases is that
the number of transcript sequences in publicly available databanks is surely underestimated compared to those
existing in vivo.
Moreover, even if all possible transcript sequences are not available in databanks, their number is growing
every day and update of specialist databases such as alternative splicing databases is a crucial aspect. However,
due to technical and time limitations, many databases are not regularly updated.
The number of available information in alternative splicing databases depends on their update but also on the
selection of raw data to define gene exon structures and alternative events. For example, EST data contain a
wide variety of experimental artefacts artifacts that can lead to incorrect prediction of alternative splicing. To
reduce the number of such artefacts artifacts, some databases have set up several filters. Stringency of these
filters lead allow to obtain less but highly confident data, on the opposite side, less (or no) filters lead allow to
obtain much more data but with many artefacts artifacts. For this reason, it is advised to use two or three
different databases to have an overview of all information available for a same gene in term of splicing events.
Finally, one other crucial aspect, which is not limited to alternative splicing databases, concerns the
standardization of data. For example, the same gene can have a different number of exons depending on the
database (e.g., exon #4 of one gene in a given database corresponds to exon #6 of the same gene in another
database). It became a real problem when researchers need to compare/share their results, in particular when
publishing their data. As done by HUGO for the gene names and symbols [30], efforts should be made to
standardize information regarding exon/intron structure and alternative events of known genes. Currently, the
best option to avoid this problem is to refer to the actual exon sequence.
Figure legends
Table 1: Description of relevant alternative splicing databases
Figure 1: Options from FAST DB. Example with the human PDLIM5 gene
A. Main page of FAST DB for the PDLIM5 gene. The Exon/intron gene structure is displayed with known
alternative events in red. In particular, exons 10 to 12 are known to be multiple-cassette exons and
exons 9 and 18 are two alternative terminal exons for this gene.
B. Tissue-specificity of the multiple-cassette exons 10 to 12 using EST data. These exons seems to be
specifically included in muscle and heart tissues (blue bars) and skipped in the other tissues (red bars).
C. The in silico PCR option of FAST DB allows to facilitate the primer design for RT-PCR validations and to
predict the expected product sizes and sequences.
D. FAST DB allows to predict the functional consequences of alternative events by providing protein
domains prediction through direct links to specialist databases (e.g., SMART). In this example the short
form ending in exon 9 is predicted to be translated in protein encoded one PDZ domain. The long form
(ending in exon 18) is predicted to be translated in protein encoded one PDZ domain and three LIM
domains.
References
[1] Wang E.T., Sandberg R., Luo S., Khrebtukova I., Zhang L., Mayr C., Kingsmore S.F., Schroth G.P., Burge C.B. (2008). Alternative isoform
regulation in human tissue transcriptomes. Nature. 456(7221):470-6
[2] Venables J.P. (2004). Aberrant and alternative splicing in cancer. Cancer Res. 64(21):7647-54
[3] Benson D.A., Karsch-Mizrachi I., Lipman D.J., Ostell J., Sayers E.W. (2009). GenBank. Nucleic Acids Res. 37(Database issue):D26-31
[4] Sugawara H., Ogasawara O., Okubo K., Gojobori T., Tateno Y. (2008). DDBJ with new system and face. Nucleic Acids Res. 36(Database
issue):D22-4
[5] Kulikova T., Akhtar R., Aldebert P., Althorpe N., Andersson M., Baldwin A., Bates K., Bhattacharyya S., Bower L., Browne P., et al.
(2007).EMBL Nucleotide Sequence Database in 2006. Nucleic Acids Res. 35(Database issue):D16-20
[6] Grasso C., Lee C. (2004). Combining partial order alignment and progressive multiple sequence alignment increases alignment speed
and scalability to very large alignment problems. Bioinformatics. 20(10):1546-56
[7] Florea L., Hartzell G., Zhang Z., Rubin G.M., Miller W. (1998). A computer program for aligning a cDNA sequence with a genomic DNA
sequence. Genome Res. 8(9):967-74
[8] Wu T.D., Watanabe C.K. (2005). GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics.
21(9):1859-75
[9] Van Nimwegen E., Paul N., Sheridan R., Zavolan M. (2006). SPA: a probabilistic algorithm for spliced alignment. PLoS Genet. 2(4):e24.
[10] Kent W.J. (2002). BLAT--the BLAST-like alignment tool. Genome Res. 12(4):656-64
[11] Kim N, Alekseyenko AV, Roy M, Lee C. (2007) The ASAP II database: analysis and comparative genomics of alternative splicing in 15
animal species. Nucleic Acids Res. 35(Database issue):D93-8
[12] Koscielny G, Le Texier V, Gopalakrishnan C, Kumanduri V, Riethoven JJ, Nardone F, Stanley E, Fallsehr C, Hofmann O, Kull M., et al.
(2009). ASTD: The Alternative Splicing and Transcript Diversity database. Genomics. 93(3):213-20
[13] Dralyuk I, Brudno M, Gelfand MS, Zorn M, Dubchak I. (2000). ASDB: database of alternatively spliced genes. Nucleic Acids Res.
28(1):296-7
[14] Nagasaki, H., Arita, M., Nishizawa, T., Suwa, M., and Gotoh, O. (2006) Automated classification of alternative splicing and
transcriptional initiation and construction of visual database of classified patterns. Bioinformatics 22, 1211-6
[15] Kim, P., Kim, N., Lee, Y., Kim, B., Shin, Y., and Lee, S. (2005) ECgene: genome annotation for alternative splicing. Nucleic Acids Res 33,
D75-9
[16] de la Grange P., Dutertre M., Correa M., Auboeuf D. (2007). A new advance in alternative splicing databases: from catalogue to
detailed analysis of regulation of expression and function of human alternative splicing variants. BMC Bioinformatics. 8:180.
[17] Takeda J., Suzuki Y., Nakao M., Kuroda T., Sugano S., Gojobori T., Imanishi T. (2007). H-DBAS: alternative splicing database of
completely sequenced and manually annotated full-length cDNAs based on H-Invitational. Nucleic Acids Res. 35(Database issue):D104-9
[18] Holste D., Huo G., Tung V., Burge C.B. (2006). HOLLYWOOD: a comparative relational database of alternative splicing. Nucleic Acids
Res. 34(Database issue):D56-62
[19] Zheng C.L., Kwon Y.S., Li H.R., Zhang K., Coutinho-Mansfield G., Yang C., Nair T.M., Gribskov M., Fu X.D. (2005). MAASE: an alternative
splicing database designed for supporting splicing microarray applications. RNA. 11(12):1767-76
[20] Huang Y.H., Chen Y.T., Lai J.J., Yang S.T., Yang U.C. (2002). PALS db: Putative Alternative Splicing database Nucleic Acids Res. 30(1):18690
[21] Huang H.D., Horng J.T., Lee C.C., Liu B.J. (2003). ProSplicer: a database of putative alternative splicing information derived from
protein, mRNA and expressed sequence tag sequence data. Genome Biol. 4(4):R29
[22] Huang H.D., Horng J.T., Lin F.M., Chang Y.C., Huang C.C. (2005). SpliceInfo: an information repository for mRNA alternative splicing in
human genome. Nucleic Acids Res. 33(Database issue):D80-5
[23] Krause A., Haas S.A., Coward E., Vingron M. (2002). SYSTERS, GeneNest, SpliceNest: exploring sequence space from genome to protein.
Nucleic Acids Res. 30(1):299-300
[24] Hiller M., Nikolajewa S., Huse K., Szafranski K., Rosenstiel P., Schuster S., Backofen R., Platzer M. (2007). TassDB: a database of
alternative tandem splice sites. Nucleic Acids Res. 35(Database issue):D188-92
[25] Kuhn R.M., Karolchik D., Zweig A.S., Wang T., Smith K.E., Rosenbloom K.R., Rhead B., Raney B.J., Pohl A., Pheasant M. et al. (2009). The
UCSC Genome Browser Database: update 2009. Nucleic Acids Res. 37(Database issue):D755-61
[26] de la Grange P., Dutertre M., Martin N., Auboeuf D. (2005). FAST DB: a website resource for the study of the expression regulation of
human gene products. Nucleic Acids Res. 33(13):4276-84
[27] Lerivray H., Méreau A., Osborne H.B. (2006). Our favourite alternative splice site. Biol Cell. 98(5):317-21
[28] Johnson J.M., Castle J., Garrett-Engele P., Kan Z., Loerch P.M., Armour C.D., Santos R., Schadt E.E., Stoughton R., Shoemaker D.D.
(2003). Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science. 302(5653):2141-4.
[29] Pan Q., Shai O., Misquitta C., Zhang W., Saltzman A.L., Mohammad N., Babak T., Siu H., Hughes T.R., Morris Q.D., et al. (2004).
Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. Mol Cell. 16(6):929-41
[30] Eyre T.A., Ducluzeau F., Sneddon T.P., Povey S., Bruford E.A., Lush M.J. (2006). The HUGO Gene Nomenclature Database, 2006
updates. Nucleic Acids Res. 34(Database issue):D319-21
Abbreviations
EMBL
DDBJ
flcDNA
EST
STS
European Molecular Biology Laboratory
DNA Data Bank of Japan
full-length complementary DNA
Expressed Sequence Tag
Sequence Tagged Site
GSS
Genome Survey Sequence
HTGS High Throughput Genomic Sequence
ASAP Alternative Splicing Annotation Project
ASTD Alternative Splicing and Transcript Diversity Database
ASDB Alternative Splicing DataBase
ASTRA Alternative Splicing and Transcription Archives
FAST DB Friendly Alternative Splicing and Transcripts DataBase
H-DBAS Human-transcriptome Database for Alternative Splicing
MAASE Manually Annotated Alternatively Spliced Events Database
PALSdb Putative Alternative Splicing DataBase
UCSC University of California, Santa Cruz
Any acknowledgments?
“full-length”
complementary DNAs
(flcDNA)
A cDNA is full length until someone
finds a longer one
Any better definition?
De la grange
Expressed Sequence
Tags (EST).
unique read from clone extremity from normal or
pathological tissue collections and provide the
major information source for computational
detection of alternative splicing patterns
De la grange
Sequence Tagged Sites
(STS),
Genome Survey
Sequences (GSS)
Genome Survey
Sequences (GSS)
High Throughput
Genomic Sequences
(HTGS).
De la grange
De la grange
De la grange
De la grange