Download Assembling the Sequence of the Genome

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pharmacogenomics wikipedia , lookup

Copy-number variation wikipedia , lookup

Human genetic variation wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Point mutation wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Non-coding DNA wikipedia , lookup

Gene desert wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Genetic engineering wikipedia , lookup

Oncogenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Transposable element wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Essential gene wikipedia , lookup

Metagenomics wikipedia , lookup

Gene expression programming wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genomic library wikipedia , lookup

Public health genomics wikipedia , lookup

Human genome wikipedia , lookup

Genomic imprinting wikipedia , lookup

History of genetic engineering wikipedia , lookup

Human Genome Project wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Helitron (biology) wikipedia , lookup

Ridge (biology) wikipedia , lookup

Microevolution wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

RNA-Seq wikipedia , lookup

Pathogenomics wikipedia , lookup

Designer baby wikipedia , lookup

Genomics wikipedia , lookup

Genome (book) wikipedia , lookup

Gene wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome editing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Minimal genome wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Bioinformatics 2008
Ab Initio Gene Calling: How Can a Gene be Real if I Haven’t Seen It Before?
Readings: pp. 197-200 in Introduction to Bioinformatics Algorithms, glimmerI.pdf
In the not-so-old days, biologists obtained sequences of DNA fragments only after they
had narrowed down that a particular fragment carried a particular gene. Now, entire genomes
are being sequenced at incredible speeds. Yet, obtaining a complete genome sequence is not
the end of the road. Understanding what the sequence is telling us is not a trivial matter, even
for a seemingly easy question like where are the genes in the sequence. There are two basic
ways to find genes in a sequence. One way is using BLAST to look for similarity in the unknown
sequence to known genes from other organisms. We now know the basics of how the BLAST
algorithm works and how we can use it most effectively. The other way to find genes, called ab
initio, makes no assumptions about what genes should be in the genome or what they look like.
The simplest ab initio method is to look for open reading frames (ORFs). We have already met
a web-accessible program that does this task - ORFFinder, available at the NCBI website.
ORFFinder
http://www.ncbi.nlm.nih.gov/gorf/gorf.html
Looking for ORFs is straightforward, but use the following questions to make sure you feel
comfortable with the approach and its limitations.
SQ1. What is an ORF?
SQ2. What defines the boundaries of an ORF?
SQ3. How often would you expect to find a start codon? A stop codon? What are the
assumptions behind your estimates?
SQ4. Does looking for ORFs overestimate or underestimate the number of real genes?
The next level of ab initio analysis includes additional information available about the genome
itself. Several of the most popular programs are listed below. In the simplest terms, these
programs ask “what do known genes from the organism of interest have in common?” (training
problem) and then “do any substrings within the genome sequence share these common
features?” (unknown evaluation problem). The known genes may be a few genes identified the
old-fashion way years ago or a set of the longest ORFs found in the genome sequence that
seem high unlikely to have arisen by chance. Either group or a combination of the two acts as a
training data set for the program to determine the key characteristics of genes in the organism
of interest. The program can then analyze a new sequence from that organism and compute
the probability that a given ORF is a real gene.
Glimmer & GlimmerM
http://www.tigr.org/softlab/glimmer/glimmer.html
GeneMark & GeneMark.hmm
http://opal.biology.gatech.edu/GeneMark/
GeneScan (for eukaryotes)
http://genes.mit.edu/GENSCAN.html
other eukaryotic-specific programs http://igs-server.cnrs-mrs.fr/igs/banbury/programs.html
What characteristics can genes, each of which codes for a different protein, share just because
they come from the same genome? Let’s look at one program, Glimmer developed at TIGR, to
get some ideas (Salzberg et al., Nucleic Acids Research 26:544-548).
Glimmer Training Problem Phase:
(1) Input a data set (previously identified genes or the longest ORFs found in the
genome).
(2) Using the data set, calculate for each nucleotide (A, C, G, T) the frequency at which
each nucleotide precedes it. For example, given a C what fraction of XC within
known genes are AC? CC? GC? TC?
SQ5. Given that the organism has a given base composition, could not one predict the
frequencies of AC, CC, GC, and TC? What assumption would we have to make?
(3) Do the same calculations for each nucleotide, taking into account the two preceding
nucleotides.
SQ6. Think about the genetic code. How does it relate to step (3)?
(4) Do the same calculations for each nucleotide, taking into account the three
preceding nucleotides.
(5) Do the same calculations for each nucleotide, taking into account the four preceding
nucleotides.
(6) Do the same calculations for each nucleotide, taking into account the five preceding
nucleotides.
(7) Do the same calculations for each nucleotide, taking into account the six preceding
nucleotides.
(8) Do the same calculations for each nucleotide, taking into account the seven
preceding nucleotides.
(9) Do the same calculations for each nucleotide, taking into account the eight
preceding nucleotides.
(10) Develop a scoring function that will assign a probability score for an unknown
sequence, looking at 8 bases at a time and taking steps 1 base long, based on
the data obtained in steps 2-9 above. However, only those frequencies based on
a set minimum number of occurrences in the training set are used.
Glimmer Unknown Evaluation Problem Phase:
(1) Input an unknown string (could be a part or the entire genome sequence of the
same organism used for the training set).
(2) For a potential ORF, calculate the score for that substring in all six reading frames.
If the reading frame corresponding to the ORF itself has the highest score and
the score is greater than a set minimum cut-off, then accept the ORF as a
putative gene.
SQ7. Remind yourself, why are there 6 reading frames?
(3) If two ORFs in different reading frames both are accepted, but they overlap by more
than a set cut-off then compare the score for each of the two frames only in the
region of overlap and keep only the higher-scoring gene.
Glimmer is a little more complicated than this, but hopefully you get the picture. We can talk
more about it in class.