Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley The Gene Finding Problem Exon 1 Intron 1 Exon 2 Exon 3 Intron 2 Intron 3 Exon 4 5’ Promoter TATA DNA 3’ Splice site GGTGAG Translation Initiation ATG Splice site CAG Pyrimidine tract Branchpoint CTGAC polyA signal Stop codon TAG/TGA/TAA Approaches to Gene Recognition • Naïve (mid80s - mid90s) ORFfinder, BLAST.. • Statistical de novo Genie (96), Genscan (97), FGENESH.. • Systems Ensembl.. “Ask not what mathematics can do for biology, ask what biology can do for mathematics” - Stanislaw Ulam Difficulty of naïve approaches n = number of acceptor splice sites m = number of donor splice sites Number of gene structures = Fn+m+1 (Fibonacci #) 1,1,2,3,5,8,13,21,34… statistical gene finding TAATATG TCC AC GG G TATTGA G CA TTG TAC AC GG G G TATTGA G C ATG TAA TGAA TAAT ATG TCC AC GG G TATTGAG CATTG TAC AC GG G G TATTGAG C ATG TAA TGAA Using GHMMs for ab-initio gene finding In practice, have observed sequence TAATATG TCC AC GG G TATTGA G CA TTG TAC AC GG G G TATTGA G C ATG TAA TGAA Predict genes by estimating hidden state sequence TAAT ATG TCC AC GG G TATTGAG CATTG TAC AC GG G G TATTGAG C ATG TAA TGAA Usual solution: single most likely sequence of hidden states (Viterbi). Results • High sensitivity / low specificity • Exon / Intron length distributions • Identification of GC isochore - gene richness dep. • Splice site models Comparative Gene Finding http://www-gsd.lbl.gov/vista/ Comparison of 1196 orthologous mRNAs (Makalowski et al., 1996) • Sequence identity: – – – – exons: 84.6% protein: 85.4% 5’ UTRs: 67% 3’ UTRs: 69% • 27 proteins were 100% identical. Comparison of 117 complete genes Batzoglou/Pachter et al. 1999 • 95% of genes equal number of coding exons Exceptions: Spermidine Synthase Lymphotoxin Beta • 73% of coding exons have equal length • 95% of coding exons have length equal mod 3 • Intron conservation 35% • Intron length ratio longer/shorter: 1.5 SLAM- alignment & gene finding • Input: – Pair of syntenic sequences (FASTA). • Output: – CDS and CNS predictions in both sequences. – Protein predictions. – Protein and CNS alignment. http://bio.math.berkeley.edu/slam/ SLAM components • Splice site detector – VLMM • Intron and intergenic regions – 2nd order Markov chain – independent geometric lengths • Coding sequence – PHMM on protein level – generalized length distribution • Conserved non-coding sequence – PHMM on DNA level Input: Output: What have we learned from comparative gene finding? • conservation is a stronger splice site indicator than consensus • intron lengths have diverged • gene structure conservation is more powerful than sequence conservation for prediction • consensus for GC splice sites SLAM whole genome run • • • • • Align the genomes Construct a synteny map Chop up into SLAMable pieces Run SLAM Collate results Alignment project: http://zilla.lbl.gov/ Linux cluster with 15 1.2GHz PC, 750Mb of RAM Three days to align the entire mouse genome against the human genome Finding regulatory regions Godzilla -Experimentally defined enhancer (beta- enolase) Gene name Enolase http://lemur.lbl.gov/vistatrack/ Experimental gene verification with RT-PCR predicted intron primer Intron > 1000bp Aligning human/mouse Exons > 60bp SLAM CNS data Single exon data Acknowledgments Marina Alexandersson – Gothenburg, Sweden (SLAM) Nick Bray – LBNL/UCB math (Avid alignment program) Simon Cawley - Affymetrix (SLAM) Olivier Couronne – LBNL (Godzilla) Colin Dewey - Berkerley (SLAM) Alex Poliakov - LBNL (Godzilla, VISTA) Chuck Sugnet - UCSC (SLAM) Inna Dubchak - LBNL Eddy Rubin - LBNL