Download G

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Exome sequencing wikipedia , lookup

Non-coding RNA wikipedia , lookup

RNA polymerase II holoenzyme wikipedia , lookup

Eukaryotic transcription wikipedia , lookup

Gene desert wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Non-coding DNA wikipedia , lookup

Gene expression wikipedia , lookup

Molecular evolution wikipedia , lookup

Gene regulatory network wikipedia , lookup

Community fingerprinting wikipedia , lookup

Gene expression profiling wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Gene wikipedia , lookup

Genome evolution wikipedia , lookup

Transcriptional regulation wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Transcript
What is a Gene?
CS426: Gene Prediction
Niranjan Nagarajan
Motivation to find them!
• Definition: An inheritable trait associated
with a region of DNA that codes for a
protein or specifies an RNA molecule
which in turn has an influence on some
characteristic phenotype of the organism.
Transcription and Translation
• Need to know all the letters in order to understand
the language of the cell.
• Experiments are expensive and time consuming.
• Computational techniques promise to leverage
existing knowledge about genes to predict new
genes at a fraction of the cost and time.
• The sequencing of numerous genomes (100
microbial and several eukaryotic) provides us the
raw information to tackle this task.
1
Eukaryotic Gene Structure
5’ - Promoter
Exon1
UTR
Intron1
Exon2
Terminator – 3’
splice
UTR
splice
transcription
Poly A
translation
protein
Homology
• Protein databases:
Gene finding approaches
• Find genes by matching to known genes or
proteins.
• Use information about transcribed elements in
cDNA and EST databases.
• Model the various start and stop signals
recognized by the transcription machinery to
detect the signals in the genome.
• Study the difference in composition of intergenic
regions and genes to train systems to distinguish
them.
Homology continued …
Transcript databases: Wider coverage and gives hints
about alternative splicing. However sometimes gives only
partial information and is error prone and noisy.
– Good alignment using Smith-Waterman or BLAST can
detect putative exons.
– About 50% of the genes can be detected this way.
– Problems with partial alignment and UTRs.
• Compare genomes:
– Assumption that coding regions are more conserved
than non-coding regions.
– Sometimes conservation may not cover the entire exon
or extend over to introns as well.
2
Intrinsic Approaches
• In prokaryotic sequences it is enough to search for
ORFs i.e. dna fragments between a start and a stop
codon.
• HMMs are used to exploit differences in GC
content, hexamer frequency and base occurrence
periodicity.
• Three-periodic Markov models of order five work
well for exons.
• Separate markov models for introns, UTRs and
intergenic regions helps to boost performance.
Signal Sensors continued …
• Usually done by aligning
“known” fragments and
modeling them with
ƒ Positional Weight
Matrices
ƒ Hidden Markov Models
Signal Sensors
• To detect promoter regions, splice sites,
translation starts and stops.
• Can be used where homology based
approaches fail.
• Usually the signals are weak and can only
help in refining predictions from other
methods.
Combining Them
• Homology based methods define boundaries
poorly. Combining them with signal sensors
improves predictions.
• Spliced alignment: local alignment of putative
exons with gap opening and closing determined by
the presence of splice signals.
• Intrinsic information about the putative exons as
well as global information about their location can
also be integrated into this setup.
• A HMM framework can also be used to combine
various sources of information.
3
Pitfalls
Future directions
• Long genes increase the complexity of the
search process.
• Long introns weaken the assumptions made
by alignment algorithms.
• Short exons are hard to detect.
• Presence of overlapping genes, noncanonical splice sites and alternative start
sites.
• Expert systems that combine predictions
from various methods: biologically there is
no single model for a gene.
• Signal sensors that can handle noncanonical cases.
• Improved identification of promoter
sequences.
• Identification of functional RNA genes.
Credits
• Current methods of gene prediction, their
strengths and weaknesses, Catherine Mathe
et al., Nucleic Acids Research, 2002.
• DNA Sequence Analysis (presentation by
Amir Mitchell.)
4