Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Cracking the Second Genetic Code: Sequence Patterns in Noncoding DNA Jeff Elhai Center for the Study of Biological Complexity Forty years ago, a series of experiments culminated in the elucidation of the genetic code, a means of translating the information within the DNA of genes into the information within protein. We now know that only 3% of human DNA is within genes. How do we understand the remaining 97%? The situation is somewhat more comfortable with bacterial genomes, where 70 to 80% of the DNA encodes protein, but the puzzle remains as to the function of the noncoding DNA. All possible tetrameric, pentameric, and hexameric sequences were counted within the genome of Nostoc PCC 7120 and compared with statistical expectation (Fig. 1). Some were profoundly underrepresented, and most of these turned out to be palindromes, double-stranded DNA sequences that are the same read left-to-right on one strand as rightto-left on the other (e.g. GGATCC CCTAGG ). A large fraction of highly underrepresented hexamers are recognition sites for restriction enzymes encoded by Nostoc PCC 7120 or other Nostoc-like strains. Since restriction enzymes are used by bacteria as a defense mechanism against foreign DNA, this result suggests that these cyanobacteria have freely exchanged DNA in recent evolutionary time and calls into question the traditional definition of species as it applies to cyanobacteria. Frequency of occurrence A time-honored method of breaking codes is to catalog those symbols that appear much less frequently than expected and also those that appear much more frequently than expected. My colleagues and I have pursued this strategy, focusing on the genomes of cyanobacteria of the genus Nostoc, which comprised amongst the largest bacterial genomes known. 10 8 6 Underrepresented Overrepresented Palindromic hexemers Nonpalindromic hexemers 4 2 350 300 250 200 150 100 50 0 0 0 0.25 0.5 0.75 1 1.25 1.5 1.75 Bias (Rho) Figure 1. Distribution of hexameric DNA sequences in the genome of the Nostoc PCC 7120. Nonpalindromic (red, dashed) and palindromic (green solid) sequences were evaluated as to their frequency relative to the frequencies of all included sequences, measured by the statistic rho. Thick lines represent sequences that are recognized by restriction enzymes found in some strain of Nostoc-like cyanobacteria. Sequences that appear more frequently than statistically expected have also pointed to significant biological functions. The genome of Nostoc punctiforme was found to possess more than a thousand instances of tandem heptameric repeats and hundreds of dispersed repeats, typically 24-bp in length. These 24-bp repeats were usually flanked by tandem heptameric repeats. We have found many instances where the 24-bp repeats have evidently served as sites for intragenomic recombination (Fig. 2). Such rampant recombination would be expected to randomize gene order within the genome, and we have indeed observed this, by comparing gene order between the genomes of different Nostoc’s. Repeated sequences are therefore likely to play an important role in the evolution of the genome. heptamer repeats |← 24-bp repeat →| heptamer repeats A ...TTAATCTAAAATCCAAAATGGTTCGACTGAGCGAAGCCGAAGTCCAAAATCCAAAATTAGGGGA... B ...ATGCTGTGCGATGTCTACGATGGTCACTGAGCGCAGCCGTACCACTTCGTGGAAGCAAGCTACG... C ...TCAATCTAAAATCCAAAATTGTTCGACTGAGCGTAGCCGTACCCCTCCGGGGAAGCAAGCTACG... Figure 2. Recombination between dispersed instances of a 24-bp repeat in the genome of Nostoc punctiforme. Three highly separated regions of the genome are shown. Yellow and gray highlight a sequence found in the genome approximately 200 times. Magenta and purple highlight a 7-bp tandem repeat (yAAAATy) that frequently flanks the 24-bp repeat. Dark cyano highlights a nonrepetitive sequence sometimes found flanking the 24-bp repeat. Recombination between sequence A and B can explain the structure of sequence C.