* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A Purine-Pyrimidine Classification Scheme of the Genetic Code
Survey
Document related concepts
Transcript
A Purine-Pyrimidine Classification Scheme of the Genetic Code Although containing the same information, our new classification scheme of the genetic code is simpler than the common representation as a threedimensional matrix: it contains just 32 instead of 64 fields. Moreover, it shows known patterns in the code more clearly than the common scheme. Above all, with the help of our new scheme we could identify new patterns never seen before. This gives rise to some speculations about the origin and early evolution of the genetic code. We hypothesize that coding started in a binary doublet manner and developed via a quaternary doublet code to our contemporary quaternary triplet code. Most interestingly, it may be possible to discover traces of the old binary coding in present-day genomes. Thomas Wilhelm, Svetlana Nikolajewa The genetic code specifies how the information contained in the nucleic acids DNA and RNA is translated into the correct sequence of amino acids building the highly specific proteins. Up to the three termination codons UGA, UA(G/A) (standard code), each nucleotide triplet stands for exactly one amino acid, the methionin codon AUG is also the start codon. The genetic code is comma-free and non-overlapping. It is usually represented as a three-dimensional matrix in which the four rows stand for the first base and the four columns for the second base. To show the third dimension (the third base) in the plane figure, each of the 16 boxes is again divided into four fields, giving together 64 entries (Fig.1). For a long time one assumed that the genetic code is universal for all life forms on earth. Today there are at least 16 slightly deviating different codes known (www.ncbi.nlm.nih.gov/Taxonomy/Utils/w printgc.cgi). However, it is generally believed that all these deviations are later descendants of the earlier standard code. Not surprisingly, non-standard codes are only found in small genomes, nearly all of them in mitochondria known to have by far the smallest genomes. Since the early days of the discovery of the genetic code non-random patterns have been searched in the code for providing information about its origin and early evolution. In 1965 Nirenberg finished his famous project of deciphering the code. At that time most scientists believed that the code is the result of pure chance and hence does not need any further evolutionary explanation. Crick [1] formulated the corresponding “frozen accident” hypothesis which was widely accepted for many years. However, today it is assumed that at least some hints of possible evolutionary scenarios can be found in our contemporary code. The top-down approach, which we are following here, analyzes patterns in the code and tries to infer appropriate chemical and selective forces. The bottom-up approach, on the other hand, is rooted in biochemistry and aims at constructing plausible scenarios for the origin of coding. It has been appreciated for a long time that the genetic code assigns similar amino acids to similar codons. Two different rationales have been presented: first, mutation and translation error minimization [2], and second, similar amino acids tend to directly interact with similar RNA sequences [3]. It was stated that “the canonical code is at or very close to a global optimum for error minimization” [4]. It has also been proposed that instead of the actual codons, some of their derivatives, such as the anticodons or codon-anticodon duplexes were the original amino acid binding motifs. It is also possible that the original amino acid recognition took place at the tRNA acceptor stem. Szathmary [5] proposed that amino acid-RNA allocation took place even BIOforum Europe 06/2004, pp 46–49, GIT VERLAG GmbH & Co. KG, Darmstadt, www.gitverlag.com/go/bioint Fig. 1: The common representation of the standard genetic code (mRNA triplets in the mRNA reading direction (5'_3')). Shaded regions show codon families. before the appearance of tRNA. He also gave a possible evolutionary scenario for the development of an anticodon hairpin to a longer structure with an operational code at the acceptor stem. However, the first codon position seems to be correlated with amino acid biosynthetic pathways and to their evolution as evaluated by synthetic “primordial soup” experiments. The second position is correlated with the hydropathic properties of the amino acids, and the degeneracy of the third position could be related to the molecular weight or size of the amino acids [6]. Lagerkvist [7] observed that codon families (the amino acid of a codon family is uniquely determined by the first two nucleotides of a codon) have a much higher probability to appear in the left part of the common illustration scheme (cf. Fig. 1). He also found that “strong” codons (the first two nucleotides in the codon are G and/or C) always represent codon families, while “weak” codons (A and/or U as the first two nucleotides) never do so. “Mixed” codons in the right part of the scheme never represent codon families, whereas mixed codons in the left part always stand for a codon family. The New Classification Scheme of the Genetic Code Most amino acid properties show no clear pattern in the common scheme of the genetic code. Recently we proposed a new classification scheme [8 and www.imb-jena.de/~sweta/genetic_code], based on a binary purine(1)-pyrimidine(0) coding (Fig. 2). It shows known regularities more clearly than the common scheme and it even highlights some new patterns. There are three possible variants of a binary coding scheme for the genetic code: One could group the bases (i) according to base-pairs (A,U = 1, G,C = 0), (ii) according to keto- and aminobases (G,U = 1, A,C = 0), and (iii) according to purines and pyrimidines (A,G = 1, C,U = 0). In such a simplified code eight different binary triplets exist: 000, 001, ..., 111. Each of these binary triplets represents eight different codons, e.g. in our coding scheme 000 stands for CCC, CCU, ..., UUU. The purine-pyrimidine coding is superior to the other two variants, because it is the only one that allows the genetic code to be represented using just four columns (Fig. 2). The reason for this vast simplification in our scheme is that for the third position in each triplet it only matters if it is a purine or a pyrimidine. Given the primary purine-pyrimidine coding, we have again two different possibilities to sort the first two bases per row: one can use either of the remaining two binary codings, according to basepairs or according to keto- and aminobases as a sort criterion inside the rows. We have chosen the base-pairs for sorting inside rows, because only this reveals the following regularities of the genetic code: (i) All codon families group together, i.e. they are not scattered allover the table. (ii) More importantly, the codon strength classification directly corresponds to the columns in our scheme (cf. Fig. 2). Thus, in the first column the first two bases complementary pair with 6 hydrogen bonds, in the sec- Fig. 2: The purine(1)-pyrimidine(0) classification scheme of the standard genetic code. The third base is given in parenthesis. If there are differences between the standard code and any other code, the number of deviations from the standard code is indicated. For instance, in the UG(G/A) field, 0/9 indicates that UGG encodes for Trp in all codes, but UGA is not the termination codon in 9 of the 16 non-standard codes. In some bacteria the 21st amino acid, selenocysteine, can also be encoded by UGA. Shaded regions show codon families. The point in the center indicates the perfect point symmetry corresponding to Halitsky’s family – nonfamily symmetry operation [9]. The thick horizontal line marks the symmetry axis for codon-anticodon symmetry. ond and third column with 5, and in the fourth column with just 4 hydrogen bonds. For all these reasons our classification scheme of the genetic code is superior to all similiar ones. Our new scheme shows some fascinating regularities. We can, for instance, better understand the number of different tRNAs in some organisms. In the simplest case one should expect one tRNA per coding field in our scheme. Exactly this happens in the case of vertebrate mitochondria. It is known that animal mitochondria contain exactly 22 different tRNAs. In vertebrate mitochondria UA1 and AG1 are stop codons. Thus there are exactly 22 fields for amino acids left: the 8 codon families plus 14 remaining fields. Interestingly, the 22 tRNAs in animal mitochondria correspond 1:1 to these 22 fields. The amino acids of the nine "strong groups" (mutually evolutionary conserved, based on the alignment score matrix PAM250, cf. http://bioinfolab.unl.edu/emlab/documents/clustalx_doc/clustalw.txt) very closely group together in our scheme, more closely than in the standard scheme. That means neighboring amino acids in our schme have a higher probability to be aligned to each other in genome comparisons than neighboring amino acids in the standard scheme. Our new scheme also led us to detect hitherto unknown regularities of amino acid properties in the genetic code. Jungck [10] collected 15 different measures of amino acid properties. For all of these we arranged a table with 8 rows and 4 columns corresponding to our scheme. Amazingly, the column sums of nearly all measures are perfectly correlated to the corresponding codon-anticodon binding strength. For instance, the first column harbours more polar amino acids, the last column less polar ones and the mixed codon fields are in between. Similarly, the bulkiness and the specific volume increases continuously from the first to the last column. Evolution of the Genetic Code The observed regularities inspire to some speculations about the early evolution of the genetic code. Thus the strong correlation between amino acid properties and codon strength implies that the first two positions together (and not the second position alone as speculated by others) must have been important for the amino acid – codon assignment in the early evolution of the code. It therefore also could be that just the first two nucleotides of a codon (or anticodon) show specific binding affinity to the corre- sponding amino acid (maybe important in the process of the code formation). Nowadays one assumes that "the code probably underwent a process of expansion from relatively few amino acids to the modern complement of 20" [11]. Can we find some hints in our scheme indicating coding of less than 20 amino acids in ancient times? Indeed, there is a high redundancy for each second row. This gives rise to the speculation that in the early days of code evolution just the first two bases of the triplet were coding. The reading frame, however, arguably always comprised three letters. In any way, a quaternary doublet can encode at most 16 amino acids, or 15 plus one termination codon (some bacteria exist that do not possess any stop codon). In this context it is interesting to note that Asn, Gln, Met, Trp, and Tyr seem to be newer amino acids. Since the discovery of the genetic code it is speculated that the first genetic material contained only a single base-pairing unit [1]. Recently, for the first time a ribozyme was found composed of only one purine and one pyrimidine [12]. Assuming a binary doublet code, it is tempting to speculate which four amino acids, one per two consecutive rows, were the first encoded ones. In the first two rows Ser seems to be the oldest amino acid, and in the third and fourth row Ala. The 01-rows obviously contain no really old amino acid while the 11rows contain more than one: Gly, Asp, Glu. However, Gly is biochemically built from Ser, so Ser can be assumed as prior. It could be that in the beginning of nucleic acid – amino acid assignment Asp and Glu competed for the 11-doublet. Of course, code transfer from one amino acid to another might also have occurred. Conclusions We have found a concise scheme for the genetic code that is superior to similar schemes for different reasons. It shows nice patterns and symmetries and even so far unknown regularities in the code. We are now studying the fascinating question whether we still can find traces of doublet coding or even binary coding in contemporary genomes. References are available from the authors. Dr. Thomas Wilhelm Svetlana Nikolajewa Theoretical Systems Biology Institute of Molecular Biotechnology Beutenbergstr. 11 07745 Jena, Germany [email protected] [email protected]