Download Gene Finding

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics in learning and memory wikipedia , lookup

Genomic library wikipedia , lookup

RNA interference wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Oncogenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Epitranscriptome wikipedia , lookup

Copy-number variation wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Public health genomics wikipedia , lookup

Genetic code wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Metagenomics wikipedia , lookup

Genetic engineering wikipedia , lookup

Non-coding RNA wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene therapy wikipedia , lookup

Transposable element wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Pathogenomics wikipedia , lookup

NEDD9 wikipedia , lookup

Gene nomenclature wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene expression programming wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Point mutation wikipedia , lookup

Minimal genome wikipedia , lookup

Non-coding DNA wikipedia , lookup

Human genome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene desert wikipedia , lookup

Primary transcript wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome editing wikipedia , lookup

Gene expression profiling wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Microevolution wikipedia , lookup

Gene wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome evolution wikipedia , lookup

Designer baby wikipedia , lookup

Helitron (biology) wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Gene Finding
•Based partly on slides by Hedi Hegyi, Irene Liu
1
“The Central Dogma”
Transcription
Translation
RNA
Protein
2
Gene Finding in Prokaryotes
3
Reminder: The Genetic Code
1 start, 3 stop Codons
4
Finding Genes in Prokaryotes
Gene structure
High gene density
– ~85% coding in E.coli
=> is every ORF a gene?
5
Finding ORFs
Many more ORFs than genes
– In E.Coli one finds 6500 ORFs while there are 4290 genes.
In random DNA, one stop codon every 64/3=21
codons on average.
Average protein is ~300 codons long.
=> search long ORFs.
Problems:
– Short genes
– Overlapping long ORFs on opposite strands
6
Codon Frequencies
Coding DNA is not random:
– In random DNA, expect Leu : Ala : Trp ratio of
6:4:1
– In real proteins, 6.9 : 6.5 : 1
Different frequencies for different species.
7
Using Codon Frequencies/Usage
Assume each codon is independent.
For codon abc calculate frequency
f(abc) in coding region.
Given coding sequence a1b1c1,…,
an+1bn+1cn+1
Calculate p1  f a b c  f a b c  ...  f a b c
1 1 1
2 2 2
n n n
p2  f b1c1a2  f b2c2a3  ...  f bncnan 1
p3  f c1a2b2  f c2a3b3  ...  f cnan 1bn 1
The probability that the ith reading
pi
frame is the coding region: Pi 
p1  p2  p3
8
C+G Content
C+G content (“isochore”) has strong effect
on gene density, gene length etc.
– < 43% C+G : 62% of genome, 34% of genes
– >57% C+G : 3-5% of genome, 28% of genes
Gene density in C+G rich regions is 5 times
higher than moderate C+G regions and 10 times
higher than rich A+T regions
– Amount of intronic DNA is 3 times higher for A+T rich
regions. (Both intron length and number).
– Etc…
9
RNA Transcription
Not all ORFs are expressed.
Transcription depends on regulatory regions.
Common regulatory region – the promoter
RNA polymerase binds tightly to a specific
DNA sequence in the promoter called the
binding site.
10
Prokaryotic Promoter
One type of RNA polymerase.
11
Positional Weight Matrix
For TATA box:
12
Gene Finding in Eukaryotes
13
Coding density
14
Eukaryote gene structure
•
•
•
•
Gene length: 30kb, coding region: 1-2kb
Binding site: ~6bp; ~30bp upstream of TSS
Average of 6 exons, 150bp long
Huge variance: - dystrophin: 2.4Mb long
– Blood coagulation factor: 26 exons, 69bp to
3106bp; intron 22 contains another unrelated gene
15
Splicing
Splicing: the removal of the introns.
Performed by complexes called spliceosomes,
containing both proteins and snRNA.
The snRNA recognizes the splice sites through
RNA-RNA base-pairing
Recognition must be precise: a 1nt error can
shift the reading frame making nonsense of its
message.
Many genes have alternative splicing which
changes the protein created.
16
Splice Sites
17
Gene prediction programs
Scan the sequence in all 6 reading frames:
1. Start and stop codons
2. Long ORF
3. Codon usage
4. GC content
5. Gene features: promotor, terminator,
poly A sites, exons and introns, …
Frame +1
Frame +2
Frame +3
18
Gene prediction programs
Genscan:
Predict location and gene features.
Can handle few genes in one sequence
http://genes.mit.edu/GENSCAN.htm•
l
19
Gene prediction programs
Results:
20
Genscan
Burge and Karlin, Stanford, 1997
Before The Human Genome Project
– No alignments available
– Estimated human genes count was 100,000
First program to do well on realistic
sequences
– Long, multiple genes in both orientations
Pretty good sensitivity but poor specificity
– 70% Sn, 40% Sp
GenScan Output
22
An end to ab initio prediction
ab initio gene prediction is inaccurate.
High false positive rates for most predictors.
Rarely used as a final product
• Human annotation runs multiple algorithms and
scores exon predicted by multiple predictors.
• Used as a starting point for refinement/verification
23
Comparative Genomics
Use homologue sequences:
1. Annotated genes.
2. mRNA sequences.
3. Proteins sequences
4. ESTs
24
ESTs
EST – Expressed Sequence Tags. Short
sequences which are obtained from
cDNA (mRNA).
25
Transcript-based prediction
Align transcript data to genomic sequence
using a pair-wise sequence comparison.
Gene
Model:
EST
cDNA
26
Comparative Gene Predictions
Exons are more conserved than introns
•From a study of 1196 genes:
•exons: 84.6%
•protein: 85.4%
•introns: 35%
•5’ UTRs: 67%
•3’ UTRs: 69%
Gene Prediction using expressed
sequences
Improvement over previously existing
methods, in particular when predicting
CDS:
– there exists an increasing richer
representation of the transcripts content of the
human genome in public databases
Improvements in the ability to call the
coding bases, but in particular
– in connecting exons into transcripts
– in predicting alternative splice forms
Dual genome predictors
Use statistical inference and other
‘informant’ genomes: SLAM, DoubleScan,
Twinscan, SGP-2, GenomeScan etc..
Exon prediction such as ExoFish.
Twinscan
Korf, Flicek, Duan, Brent, Washington University
in St. Louis, 2001
Similar to GENSCAN, except it uses another
informant sequence as comparison.
– For Human, the informant is normally mouse
Slightly more sensitive than GENSCAN, much
more specific
– Exon sensitivity/specificity about 75%
Nature, 2005•
Nature methods, 2005•
Other predictions in UCSC 2009
N-SCAN – extension of TwinScan
– Allows for more species (currently uses
mouse)
– A richer model of sequence evolution
N-SCAN PASA-EST
– Combines conservation with evidence based
on ESTs
32
Other predictions in UCSC 2009
CONSTRAST (Stanford, 2007)
– 11 informants (rhesus, mouse, cow,...,chicken)
– Machine learning two-phase approach – first predict
exons and then combine them
33
Start and stop codon
classifier accuracy
increases as
informants are added.
Splice site detection
accuracy increases as
informants are added.
34
Dark Matter in the Genome
Tiling arrays for human
chromosomes 20 and
22:
47% of positive probes
outside exons
What could they be?
• Novel protein-coding
genes
• Novel non-coding genes
• Antisense transcription
• 22% in introns
• Alternative isoforms
• 25% in intergenic
regions
• Biological ‘artifacts’
• False positives
Johnson et al. 2005 TRENDS in Genetics 21:93-102•
Predicting non-coding RNA?
The clues we used so far are useless!
Not clear which properties can be exploited
Sequence features such as promoters are too
weak
Histone modifications + conservation the key
36
37