* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Copy-number variation wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Non-coding DNA wikipedia , lookup
Gene desert wikipedia , lookup
Gene nomenclature wikipedia , lookup
Point mutation wikipedia , lookup
Public health genomics wikipedia , lookup
Epitranscriptome wikipedia , lookup
Genome evolution wikipedia , lookup
Genomic imprinting wikipedia , lookup
Human genome wikipedia , lookup
Microevolution wikipedia , lookup
Gene expression programming wikipedia , lookup
Designer baby wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Metagenomics wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genomic library wikipedia , lookup
Genome editing wikipedia , lookup
Sequence alignment wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Pathogenomics wikipedia , lookup
Helitron (biology) wikipedia , lookup
A Method to Detect Gene Structure and Alternative Splice Sites by Agreeing ESTs to a Genomic Sequence Paola Bonizzoni Graziano Pesole* Raffaella Rizzi DISCo, University of Milan-Bicocca, Italy *Department of Physiology and Biochemistry, University of Milan, Italy Supported by FIRB Bioinformatics: Genomics and Proteomics 15-20 september WABI03 1 Outline Gene structure and alternative splicing (AS) Problem definition and algorithm ASPic program Experimental results and discussion 15-20 september WABI03 2 Mechanism of Splicing DNA 5’ 3’ 3’ 5’ TRANSCRIPTION pre-mRNA 5’ exon 1 exon 2 exon 3 3’ SPLICING by spliceosome mRNA EST 15-20 september WABI03 exon 1 exon 1 exon 2 exon 2 exon 3 exon 3 splicing product Expressed Sequence Tag (cDNA) 3 Modes of Alternative Splicing Genomic sequence 1 Introns 2 3 2 3 Exons 1 Third Second splicing splicing mode mode First splicing mode 21 32 15-20 september WABI03 3 4 Modes of Alternative Splicing 1 22b 3 Competing 5’–3’ Exclusive exons: 1 2b 3 15-20 september WABI03 5 Why AS is important? AS occurs in 59% of human genes (Graveley, 2001) AS expands protein diversity (generates from a single gene multiple transcripts) AS is tissue-specific (Graveley, 2001) AS is related to human diseases 15-20 september WABI03 6 Motivations Regulation of AS is still an open problem NEED tools to predict alternative splicing forms analyze such a mechanism by a representation of splicing forms 15-20 september WABI03 7 What is available? Fast programs to produce a single EST alignment to a genomic sequence: Spidey (Wheelan et al., 2001) Squall (Ogasawara & Morishita, 2002) But to predict the exon-intron gene structure is a complicate goal because of sequencing errors in EST make difficult to locate splice sites by alignment duplications, repeated sequences may produce more than one possible EST alignment 15-20 september WABI03 8 Open Problems Formal definition of AS prediction problem … Combined analysis of ESTs alignments related to the same gene by agreeing ESTs to a common exon-intron gene structure Optimization criteria 15-20 september WABI03 9 Formal Definitions Def 1 Genomic sequence, G = I1 f1 I2 f2 I3 f3 … In fn In+1, where Ii (i=1, 2, …, n+1) are introns and fi (i=1, 2, …, n) are exons Def 2 Exon factorization of G, GE = f1 f2 f3 … fn Def 3 EST factorization of an EST S compatible with GE is S=s1 s2 … sk s.t. there exists 1 i1 < i2 < … < ik n: st =k-1 suff (fit) or st = pref (fit) edit fit)t=2, error st = (s fitt,for 3, …,for k-1t=2, 3, …, splice edit(s suff(fof errorskand error s1 is a1,suffix fi1and is aedit(s prefixk,ofpref(f fik ik))variant i1)) 15-20 september WABI03 10 The Problem Input - A genomic sequence G - A set of EST sequences S = {S1, S2, …, Sn} Output An exon factorization GE of G (GE = f1, f2, …, fn) and a set of ESTs factorizations compatible with GE Objective: minimize n 15-20 september WABI03 11 Example Genomic sequence G A2 A1A2 B D1 C1 D1D 2 C1C2 EST set S = {S1, S2, S3} S1 A2 A1A2 S2 S3 D1 A2 C1 B D1D2 15-20 september WABI03 7 exons 4 D1 C1C2 12 Results MEFC is MAX-SNP-hard (linear reduction from NODE-COVER) heuristic algorithm: Iterate process to factorize each EST backtracking to recompute previous EST factors if not compatible to GE 15-20 september WABI03 13 The algorithm Iterative jth step: partial EST factorization of Si (compute factor sij) si-1 1 Si-1 si1 Si G si-1 j-1 e1 si-1 j si j-1 e2 si-1 n sij em em After placing all the factors sij for the set S, if (Compatible(e m, exon_list)) then place the factors; addexternal em to exon_list; 15-20 september WABI03 otherwise try to place sij elsewhere; If not possible then backtrack; 14 The algorithm (more details) Compute factor sij G ag Si si1 sij c2 gt exon si j-1 c1 c2 c3 si j c4 si jy c5 The Then Find Sij can algorithm theberightmost algorithm entire canonical divided searches factor into searches ag gt pattern anijperfect s components can a such perfect on be the match placed that cleft of (k=1,2,…,n) the onc1Gedit of on cG2 distance on G kmatch between At leastsij one y and of these the genomic components substring for k from from ag 1 to to (n-1) gt is Suppose that c21 has a noperfect perfectmatch matchon onGG bounded is error-free and can be placed on G 15-20 september WABI03 15 ASPic (Alternative Splicing PredICtion) Input - A minimum length of an exon - A maximum number of exons in the exon factorization of the genomic sequence - An error percentage - A genomic sequence - An ESTs set (or cluster) Output - A text file for all ESTs alignments - An HTML file for the exon factorization of the genomic sequence 15-20 september WABI03 16 ASPic data validation Validation Database: ASAP (Lee et al., 2003) ASPic INPUT: Genomic sequences from ASAP database EST clusters of human chromosome 1 from UniGene database 15-20 september WABI03 17 Experimental Results Genomic ASAP Novel Introns introns introns detected sequence shift detected (official by ASPic detected ASAPgene by ASPic name) 15-20 september WABI03 18 Execution times PENTIUM IV, 1600 MHZ, 256 MB, running Linux 15-20 september WABI03 19 An example of data (gene HNRPR) Positions are from 0 for ASPic and from 1 for ASAP 15-20 september WABI03 ASPic finds a novel intron from 2144 to 5333 confirmed by 18 EST sequences 20 An example of data (gene HNRPR, intron 2144-5333) EST ID Genomic Left EST and exons right exonsends of the two exons 15-20 september WABI03 21 WEB site 15-20 september WABI03 22 WEB site 15-20 september WABI03 23 WEB site 15-20 september WABI03 24 Responsabili di progetto: Prof. Paola Bonizzoni Prof. Graziano Pesole Responsabile disegno software: Raffaella Rizzi Sito WEB: Rappresentazione grafica: Gabriele Ravanelli Francesco Perego Anna Redondi Francesca Rossin Gianluca Dellavedova Analisi dati: Altri contributi: 15-20 september WABI03 25 GRAZIE! 15-20 september WABI03 26