Download Assembling the Sequence of the Genome

Bioinformatics 2008 Ab Initio Gene Calling: How Can a Gene be Real if I Haven’t Seen It Before? Readings: pp. 197-200 in Introduction to Bioinformatics Algorithms, glimmerI.pdf In the not-so-old days, biologists obtained sequences of DNA fragments only after they had narrowed down that a particular fragment carried a particular gene. Now, entire genomes are being sequenced at incredible speeds. Yet, obtaining a complete genome sequence is not the end of the road. Understanding what the sequence is telling us is not a trivial matter, even for a seemingly easy question like where are the genes in the sequence. There are two basic ways to find genes in a sequence. One way is using BLAST to look for similarity in the unknown sequence to known genes from other organisms. We now know the basics of how the BLAST algorithm works and how we can use it most effectively. The other way to find genes, called ab initio, makes no assumptions about what genes should be in the genome or what they look like. The simplest ab initio method is to look for open reading frames (ORFs). We have already met a web-accessible program that does this task - ORFFinder, available at the NCBI website. ORFFinder http://www.ncbi.nlm.nih.gov/gorf/gorf.html Looking for ORFs is straightforward, but use the following questions to make sure you feel comfortable with the approach and its limitations. SQ1. What is an ORF? SQ2. What defines the boundaries of an ORF? SQ3. How often would you expect to find a start codon? A stop codon? What are the assumptions behind your estimates? SQ4. Does looking for ORFs overestimate or underestimate the number of real genes? The next level of ab initio analysis includes additional information available about the genome itself. Several of the most popular programs are listed below. In the simplest terms, these programs ask “what do known genes from the organism of interest have in common?” (training problem) and then “do any substrings within the genome sequence share these common features?” (unknown evaluation problem). The known genes may be a few genes identified the old-fashion way years ago or a set of the longest ORFs found in the genome sequence that seem high unlikely to have arisen by chance. Either group or a combination of the two acts as a training data set for the program to determine the key characteristics of genes in the organism of interest. The program can then analyze a new sequence from that organism and compute the probability that a given ORF is a real gene. Glimmer & GlimmerM http://www.tigr.org/softlab/glimmer/glimmer.html GeneMark & GeneMark.hmm http://opal.biology.gatech.edu/GeneMark/ GeneScan (for eukaryotes) http://genes.mit.edu/GENSCAN.html other eukaryotic-specific programs http://igs-server.cnrs-mrs.fr/igs/banbury/programs.html What characteristics can genes, each of which codes for a different protein, share just because they come from the same genome? Let’s look at one program, Glimmer developed at TIGR, to get some ideas (Salzberg et al., Nucleic Acids Research 26:544-548). Glimmer Training Problem Phase: (1) Input a data set (previously identified genes or the longest ORFs found in the genome). (2) Using the data set, calculate for each nucleotide (A, C, G, T) the frequency at which each nucleotide precedes it. For example, given a C what fraction of XC within known genes are AC? CC? GC? TC? SQ5. Given that the organism has a given base composition, could not one predict the frequencies of AC, CC, GC, and TC? What assumption would we have to make? (3) Do the same calculations for each nucleotide, taking into account the two preceding nucleotides. SQ6. Think about the genetic code. How does it relate to step (3)? (4) Do the same calculations for each nucleotide, taking into account the three preceding nucleotides. (5) Do the same calculations for each nucleotide, taking into account the four preceding nucleotides. (6) Do the same calculations for each nucleotide, taking into account the five preceding nucleotides. (7) Do the same calculations for each nucleotide, taking into account the six preceding nucleotides. (8) Do the same calculations for each nucleotide, taking into account the seven preceding nucleotides. (9) Do the same calculations for each nucleotide, taking into account the eight preceding nucleotides. (10) Develop a scoring function that will assign a probability score for an unknown sequence, looking at 8 bases at a time and taking steps 1 base long, based on the data obtained in steps 2-9 above. However, only those frequencies based on a set minimum number of occurrences in the training set are used. Glimmer Unknown Evaluation Problem Phase: (1) Input an unknown string (could be a part or the entire genome sequence of the same organism used for the training set). (2) For a potential ORF, calculate the score for that substring in all six reading frames. If the reading frame corresponding to the ORF itself has the highest score and the score is greater than a set minimum cut-off, then accept the ORF as a putative gene. SQ7. Remind yourself, why are there 6 reading frames? (3) If two ORFs in different reading frames both are accepted, but they overlap by more than a set cut-off then compare the score for each of the two frames only in the region of overlap and keep only the higher-scoring gene. Glimmer is a little more complicated than this, but hopefully you get the picture. We can talk more about it in class.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Assembling the Sequence of the Genome