Download DNA

Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley The Gene Finding Problem Exon 1 Intron 1 Exon 2 Exon 3 Intron 2 Intron 3 Exon 4 5’ Promoter TATA DNA 3’ Splice site GGTGAG Translation Initiation ATG Splice site CAG Pyrimidine tract Branchpoint CTGAC polyA signal Stop codon TAG/TGA/TAA Approaches to Gene Recognition • Naïve (mid80s - mid90s) ORFfinder, BLAST.. • Statistical de novo Genie (96), Genscan (97), FGENESH.. • Systems Ensembl.. “Ask not what mathematics can do for biology, ask what biology can do for mathematics” - Stanislaw Ulam Difficulty of naïve approaches n = number of acceptor splice sites m = number of donor splice sites Number of gene structures = Fn+m+1 (Fibonacci #) 1,1,2,3,5,8,13,21,34… statistical gene finding TAATATG TCC AC GG G TATTGA G CA TTG TAC AC GG G G TATTGA G C ATG TAA TGAA TAAT ATG TCC AC GG G TATTGAG CATTG TAC AC GG G G TATTGAG C ATG TAA TGAA Using GHMMs for ab-initio gene finding In practice, have observed sequence TAATATG TCC AC GG G TATTGA G CA TTG TAC AC GG G G TATTGA G C ATG TAA TGAA Predict genes by estimating hidden state sequence TAAT ATG TCC AC GG G TATTGAG CATTG TAC AC GG G G TATTGAG C ATG TAA TGAA Usual solution: single most likely sequence of hidden states (Viterbi). Results • High sensitivity / low specificity • Exon / Intron length distributions • Identification of GC isochore - gene richness dep. • Splice site models Comparative Gene Finding http://www-gsd.lbl.gov/vista/ Comparison of 1196 orthologous mRNAs (Makalowski et al., 1996) • Sequence identity: – – – – exons: 84.6% protein: 85.4% 5’ UTRs: 67% 3’ UTRs: 69% • 27 proteins were 100% identical. Comparison of 117 complete genes Batzoglou/Pachter et al. 1999 • 95% of genes equal number of coding exons Exceptions: Spermidine Synthase Lymphotoxin Beta • 73% of coding exons have equal length • 95% of coding exons have length equal mod 3 • Intron conservation 35% • Intron length ratio longer/shorter: 1.5 SLAM- alignment & gene finding • Input: – Pair of syntenic sequences (FASTA). • Output: – CDS and CNS predictions in both sequences. – Protein predictions. – Protein and CNS alignment. http://bio.math.berkeley.edu/slam/ SLAM components • Splice site detector – VLMM • Intron and intergenic regions – 2nd order Markov chain – independent geometric lengths • Coding sequence – PHMM on protein level – generalized length distribution • Conserved non-coding sequence – PHMM on DNA level Input: Output: What have we learned from comparative gene finding? • conservation is a stronger splice site indicator than consensus • intron lengths have diverged • gene structure conservation is more powerful than sequence conservation for prediction • consensus for GC splice sites SLAM whole genome run • • • • • Align the genomes Construct a synteny map Chop up into SLAMable pieces Run SLAM Collate results Alignment project: http://zilla.lbl.gov/ Linux cluster with 15 1.2GHz PC, 750Mb of RAM Three days to align the entire mouse genome against the human genome Finding regulatory regions Godzilla -Experimentally defined enhancer (beta- enolase) Gene name Enolase http://lemur.lbl.gov/vistatrack/ Experimental gene verification with RT-PCR predicted intron primer Intron > 1000bp Aligning human/mouse Exons > 60bp SLAM CNS data Single exon data Acknowledgments Marina Alexandersson – Gothenburg, Sweden (SLAM) Nick Bray – LBNL/UCB math (Avid alignment program) Simon Cawley - Affymetrix (SLAM) Olivier Couronne – LBNL (Godzilla) Colin Dewey - Berkerley (SLAM) Alex Poliakov - LBNL (Godzilla, VISTA) Chuck Sugnet - UCSC (SLAM) Inna Dubchak - LBNL Eddy Rubin - LBNL

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download DNA