Download DNA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Finding genes in human
using the mouse
Finding genes in mouse
using the human
Lior Pachter
Department of Mathematics
U.C. Berkeley
The Gene Finding Problem
Exon 1
Intron 1
Exon 2
Exon 3
Intron 2
Intron 3
Exon 4
5’
Promoter
TATA
DNA
3’
Splice site
GGTGAG
Translation
Initiation
ATG
Splice site
CAG
Pyrimidine
tract
Branchpoint
CTGAC
polyA signal
Stop codon
TAG/TGA/TAA
Approaches to Gene
Recognition
• Naïve (mid80s - mid90s)
ORFfinder, BLAST..
• Statistical de novo
Genie (96), Genscan (97),
FGENESH..
• Systems
Ensembl..
“Ask not what mathematics can do for biology,
ask what biology can do for mathematics” - Stanislaw Ulam
Difficulty of naïve approaches
n = number of acceptor splice sites
m = number of donor splice sites
Number of gene structures = Fn+m+1 (Fibonacci #)
1,1,2,3,5,8,13,21,34…
statistical gene finding
TAATATG TCC AC GG G TATTGA G CA TTG TAC AC GG G G TATTGA G C ATG TAA TGAA
TAAT ATG TCC AC GG G TATTGAG CATTG TAC AC GG G G TATTGAG C ATG TAA TGAA
Using GHMMs for ab-initio gene
finding
In practice, have observed sequence
TAATATG TCC AC GG G TATTGA G CA TTG TAC AC GG G G TATTGA G C ATG TAA TGAA
Predict genes by estimating hidden state sequence
TAAT ATG TCC AC GG G TATTGAG CATTG TAC AC GG G G TATTGAG C ATG TAA TGAA
Usual solution: single most likely sequence
of hidden states (Viterbi).
Results
• High sensitivity / low specificity
• Exon / Intron length distributions
• Identification of GC isochore - gene richness dep.
• Splice site models
Comparative Gene Finding
http://www-gsd.lbl.gov/vista/
Comparison of 1196 orthologous mRNAs
(Makalowski et al., 1996)
• Sequence identity:
–
–
–
–
exons: 84.6%
protein: 85.4%
5’ UTRs: 67%
3’ UTRs: 69%
• 27 proteins were 100% identical.
Comparison of 117 complete genes
Batzoglou/Pachter et al. 1999
• 95% of genes equal number of coding exons
Exceptions:
Spermidine Synthase
Lymphotoxin Beta
• 73% of coding exons have equal length
• 95% of coding exons have length equal mod 3
• Intron conservation 35%
• Intron length ratio longer/shorter: 1.5
SLAM- alignment & gene
finding
• Input:
– Pair of syntenic sequences (FASTA).
• Output:
– CDS and CNS predictions in both
sequences.
– Protein predictions.
– Protein and CNS alignment.
http://bio.math.berkeley.edu/slam/
SLAM components
• Splice site detector
– VLMM
• Intron and intergenic regions
– 2nd order Markov chain
– independent geometric lengths
• Coding sequence
– PHMM on protein level
– generalized length distribution
• Conserved non-coding sequence
– PHMM on DNA level
Input:
Output:
What have we learned from
comparative gene finding?
• conservation is a stronger splice site indicator than
consensus
• intron lengths have diverged
• gene structure conservation is more powerful than
sequence conservation for prediction
• consensus for GC splice sites
SLAM whole genome run
•
•
•
•
•
Align the genomes
Construct a synteny map
Chop up into SLAMable pieces
Run SLAM
Collate results
Alignment project:
http://zilla.lbl.gov/
Linux cluster with
15 1.2GHz PC,
750Mb of RAM
Three days to align
the entire mouse
genome against the
human genome
Finding regulatory regions
Godzilla
-Experimentally defined enhancer (beta- enolase)
Gene name
Enolase
http://lemur.lbl.gov/vistatrack/
Experimental gene
verification with RT-PCR
predicted intron
primer
Intron > 1000bp
Aligning human/mouse
Exons > 60bp
SLAM CNS data
Single exon data
Acknowledgments
Marina Alexandersson – Gothenburg, Sweden (SLAM)
Nick Bray – LBNL/UCB math (Avid alignment program)
Simon Cawley - Affymetrix (SLAM)
Olivier Couronne – LBNL (Godzilla)
Colin Dewey - Berkerley (SLAM)
Alex Poliakov - LBNL (Godzilla, VISTA)
Chuck Sugnet - UCSC (SLAM)
Inna Dubchak - LBNL
Eddy Rubin - LBNL
Related documents