* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Determination of nucleotide sequences in DNA
Gene expression wikipedia , lookup
DNA repair protein XRCC4 wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Genetic engineering wikipedia , lookup
DNA profiling wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
DNA sequencing wikipedia , lookup
Agarose gel electrophoresis wikipedia , lookup
Restriction enzyme wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Silencer (genetics) wikipedia , lookup
SNP genotyping wikipedia , lookup
Real-time polymerase chain reaction wikipedia , lookup
Transformation (genetics) wikipedia , lookup
Point mutation wikipedia , lookup
Genetic code wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Genomic library wikipedia , lookup
Molecular cloning wikipedia , lookup
Biosynthesis wikipedia , lookup
DNA supercoil wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Non-coding DNA wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Community fingerprinting wikipedia , lookup
Bioscience Reports 1, 3-18 (1981) Printed in Great Britain 3 D e t e r m i n a t i o n of n u c l e o t i d e s e q u e n c e s in DNA Nobel lecture, 8 December 1980 Frederick SANGER Medical Research Council Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, U.K. In spite of the important role played by DNA sequences in living m a t t e r , it is onty relatively recently that general methods for their determination have been developed. This is m a i n l y b e c a u s e of t h e v e r y l a r g e s i z e of DNA molecules, the smallest being those of the simple bacteriophages such as dpX174 (which contains about 5000 base pairs). It was t h e r e f o r e d i f f i c u l t to d e v e l o p methods with such complicated systems. There are however some relatively small RNA molecules - notably the transfer RNAs of about 75 nucleotides - and these were used for the early studies on nucleic acid sequences (1). F o l l o w i n g my work on amino acid s e q u e n c e s in proteins (2) I t u r n e d my a t t e n t i o n t o RNA and) w i t h G. G. B r o w n l e e a n d B. G. Barrell, developed a relatively rapid small=scale method for the fractionation of 32 P-labelled oligonucleotides (3). This b e c a m e t h e basis for m o s t s u b s e q u e n t studies of RNA sequences. The general approach used in these studies, and in those on proteins, depended on the principle of partial degradation. The large molecules were broken down, usually by suitable enzymes, to give s m a l l e r p r o d u c t s which were then separated from each other and their sequences determined. When sufficient results had been obtained they were f i t t e d together by a process of deduction to give the complete sequence. This approach was necessarily rather slow and t e d i o u s , o f t e n i n v o l v i n g s u c c e s s i v e digestions and fractionations, and it was not easy to apply it to the larger DNA molecules. When we first studied DNA, some significant s e q u e n c e s of about 50 nucleotides in length were obtained with this method (%5), but it seemed t h a t to be a b l e to s e q u e n c e g e n e t i c material a new approach was desirable and we turned our attention to the use of copying procedures. Copying Procedures In t h e R N A f i e l d t h e s e p r o c e d u r e s had been p i o n e e r e d by C. Weissmann and his colleagues (6) in t h e i r s t u d i e s on t h e RNA sequence of the bacteriophage Q6. Q6 contains a replicase that will synthesize a c o m p l e m e n t a r y copy of the s i n g l e - s t r a n d e d RNA c h a i n , s t a r t i n g f r o m its 3' end. These workers devised elegant procedures involving pulse-labelling with radioactively labelled n u c l e o t i d e s , f r o m which sequences could be deduced. For DNA s e q u e n c e s we have used the enzyme DNA p o l y m e r a s e , which copies single-stranded DNA as shown in Fig. 1. The e n z y m e requires a primer, which is a single-stranded oligonucleotide having a sequence that is c o m p l e m e n t a r y to, and t h e r e f o r e able to h y b r i d i z e Abbreviations: The abbreviations C, A, and G are used to describe both the ribonucleotides and the deoxyribonucleotides, according to context. 9 Nobel Foundation 1981 # F. . . . . . . . P P 5' P "P ~ 3' P P P + . . . . 'F P-P-P P / . P-P ~ + SANGER etc p-p-P.~ etc P-P-p f / Fig. I. Specificity requirements for DNA polymerase. w i t h , a r e g i o n on t h e D N A b e i n g sequenced ( t h e t e m p l a t e ) . Mononucleotide residues are added sequentially to the 3' end of the primer from the corresponding deoxynucleoside triphosphates, making a complementary copy of the template DNA. By using t r i p h o s p h a t e s containing 32p in the cz position, the newly synthesized DNA can be labelled. In the early experiments synthetic oligonudeotides were used as primers, but after the discovery of restriction enzymes i t was more convenient to use fragments resulting from their action as they were much more easily obtained. The copying procedure was used i n i t i a l l y to prepare a short specific region of labelled DNA w h i c h could then be s u b j e c t e d to p a r t i a l digestion procedures. One of the difficulties of sequencing DNA was to find specific methods for breaking it down into small f r a g m e n t s . No s u i t a b l e enzymes were known t h a t would r e c o g n i z e only one nucleotide. However, Berg, Fancher, and Chamber]in (7) had shown e a r l i e r t h a t under c e r t a i n conditions it was possible to incorporate ribonucleotides, in place of the normal deoxyribonucleotJdes, into DNA chains w i t h DNA p o l y m e r a s e . Thus, for instance, if copying were carried out using ribo CTP and the o t h e r t h r e e d e o x y n u c l e o s i d e triphosphates, a chain could be built up in which the C residues were in the ribo form. Bonds involving ribonucleotides could be broken by alkali under conditions where those involving the deoxynucleotides were not, so that a specific s p l i t t i n g at C residues could be o b t a i n e d . Using this method we were able to extend our sequencing studies to some extent (g). However, extensive fractionations and analyses were still required. DNA SEQUENCING The ' P l u s and M i n u s ' 5 Method In the course of these e x p e r i m e n t s we needed to prepare DNA copies of high specific radioactivity, and in order to do this the highly labelled substrates had to be present in low concentrations. Thus if ~ [ 3 2 p ] - d A T P was used for l a b e l l i n g , its c o n c e n t r a t i o n was much lower than that of the other three triphosphates~ and frequently when we analysed the newly synthesized DNA chains we found t h a t t h e y terminated at a position immediately before that at which an A should have been incorporated. C o n s e q u e n t l y a m i x t u r e of products was produced all having the same 5' end (the 5' end of the primer) and terminating at the 3' end at the position of the A residues. If these products could be fractionated on a system that separated only on the basis oi the chain l e n g t h , the p a t t e r n of t h e i r d i s t r i b u t i o n on fractionation would be proportional to the distribution of the A ' s along the DNA chain. And this, together with the distribution of the other t h r e e m o n o n u c l e o t i d e s , is the i n f o r m a t i o n r e q u i r e d for sequence determination. Initial experiments carried out w i t h 3. E. Donelson suggested t h a t this approach could be the basis for a more rapid method~ and i t was found that good fractionations according to size could be obtained by ionophoresis on acrylamide gels. The method described above met with only limited success~ but we were able to develop two modified techniques that depended on the same general principle and these provided a much more rapid and simpler method of DNA sequence determination than anything we had used before (9). This, w h i c h is known as the ' p l u s and m i n u s ' techniqu% was used to determine the almost complete sequence of the DNA of bacteriophage 0X174, which contains 5386 nucleotides (10). The 'Dideoxy' Method More r e c e n t l y we have developed another, similar method which uses specific chain-terminating analogues of the normal deoxynucleoside triphosphates (If). This method is both quicker and more accurate than the plus and minus technique. It was used to c o m p l e t e the sequence of 0 X I 7 4 (12) and to determine the sequence of a related bacteriophage, G~ (13)9 and has now been applied to m a m m a l i a n mitochondrial DNA. The analogues most widely used are the dideoxynucleoside triphosphates (Fig. 2). They are the same as the norrnal d e o x y n u c l e o s i d e triphosphates b u t l a c k t h e 3' h y d r o x y l group. They can be incorporated into a growing DNA chain by DNA polymerase but act as terminators because, once they are incorporated~ the chain contains no 3' hydroxyl group and so no other nucleotide can be added. The principle of the method is summarised in Fig. 3. Primer and template are denatured to separate the two strands of the p r i m e r , which is usually a restriction enzyme fragment, and then annealed to form the primer-template complex. The mixture is then divided into f o u r samples. One (the T sample) is incubated with DNA polymerase in the presence of a mixture of ddTTP (dideoxythymidine triphosphate) and a low c o n c e n t r a t i o n of TTP, t o g e t h e r w i t h the o t h e r three deoxynucleoside triphosphates (one of which is labelled with 32p) at 6 F. ~OCHz NR O ~OCHz CH3 "N OH +p ppOCH2~ ] NR ~u~ O HN ,,,~ e%p--OQH~o polymerase OH SANGER i/C. 3 I OH TTP "OCH2 0 HN"~ NR ~OCH20 NR ./CH3 O , OJ~- N~ / OH+pppOC~,,~ e02p -- O C I ~ ddTTP Fig. 2. Diagram showing chain termination with dideoxythymidine triphosphate (ddTTP). The top line shows the DNA polymerase-catalysed reaction of the normal deoxynucleoside triphosphate (TTP) with the 3' terminal nucleotide of the primer; the bottom line9 the corresponding reaction with ddTTP. 3' 5' TAGCAACT Primer ! j ATP GTP CTP TTP+ddTTP - - ATCGTddT --ATCGddT DNAJpolymerase + I 1 ATP ATP ATP+ddAT~ GTP G T P + d d G T P GTP C T P + d d C T P CTP CTP TTP TTP TTP - - ATddC X -- ATOGTTddG -- ATCGTTddA --ATCddG --ddA T CG A Acrylamide gel electrophoresis Fig. 3. P r i n c i p l e of the chain-terminating method. DNA SEQUENCING 7 normal concentration. As the DNA chains are built up on the 3' end oi the primer the position of the T ' s will be filled, in most cases by the normal substrate T and extended further, but occasionally by ddT and t e r m i n a t e d . Thus at t h e end of i n c u b a t i o n t h e r e r e m a i n s a mixture of chains terminating with T at their 3' end but aJI having the same 5' end (the 5' end of the primer). Similar incubations are c a r r i e d out in the p r e s e n c e of e a c h of the o t h e r t h r e e dideoxy derivatives, giving mixtures terminating at the positions of C, A, and G respectively, and the four mixtures are f r a c t i o n a t e d in parallel by electrophoresis on acrylamide gel under denaturing c o n d i t i o n s . This system separates the chains according to size, the small ones moving quickly and the large ones slowly. As all the chains in the T mixture end at T, the relative positions of the T ' s in the chain will define the relative sizes of the chains, and t h e r e f o r e their relative positions on the gel after fractionation. The actual sequence can then simply be read off from an autoradiograph of the gel (Fig. #). The method is c o m p a r a t i v e l y rapid and accurate, and sequences of up to about 300 nucleotides from the 3' end of the primer can usually be determined. C o n s i d e r a b l y l o n g e r s e q u e n c e s have been r e a d o f f , but these are usually less reliable. One p r o b l e m with t h e method is that it requires single-stranded DNA as t e m p l a t e . This is t h e n a t u r a l f o r m of t h e DNA in t h e bacteriophages ~X174 and G4, but most DNA is double-stranded, and it is frequently d i f f i c u l t to s e p a r a t e t h e two s t r a n d s . One way of overcoming t h i s was d e v i s e d by A. 3. H. S m i t h ( 1 4 ) . If t h e d o u b l e - s t r a n d e d l i n e a r DNA is t r e a t e d with e x o n u c l e a s e III (a double-strand-specific 3' exonuclease), each chain is degraded from its 3' end, as shown in Fig. 5, giving rise to a structure that is largely single-stranded and can be used as t e m p l a t e for DNA polymerase with suitable small primers. This method is particularly s u i t a b l e for use with fragments c l o n e d in p l a s m i d v e c t o r s and has been used extensively in the work on human mitochondrial DNA. Cloning in Single-stranded Bacteriophage Another method of preparing suitable template DNA that is being more widely used is to clone fragments in a single-stranded bacteriophage vector (15-17). This approach is summarised in Fig. 6. Various v e c t o r s have been d e s c r i b e d . We have used a d e r i v a t i v e of bacteriophage M I 3 developed by Gronenborn and Messing (16) which c o n t a i n s an insert of the B-galactosidase gene with an EcoRl restriction enzyme site in it. The presence of ~3-galactosidase in a plaque can be readily detected by using a suitable colour-forming substrate (X-gal). The presence of an insert in the EcoRI site destroys the B-galactosidase gene, giving rise to a colourless plaque. Besides being a simple and g e n e r a l method of p r e p a r i n g single-stranded DNA this approach has other advantages. One is that it is possible to use the same primer on all clones. Heidecker e t a l . (18) prepared a 96-nucleotide-long restriction fragment derived from a position in the Ml3 vector flanking the EcoRI site (see Fig. 6). T h i s can be used to prime into, and thus determine, a sequence of about 200 nucleotides in the inserted D N A . Smaller synthetic primers have now been prepared (19,20) which allow longer sequences g F. SANGER Fig. 4. A u t o r a d i o g r a p h of a DNA sequencing gel. The origin is at the top, and migration of the DNA chains 9 a c c o r d i n g to s i z e ~ is downwards. The gel on the left has been run for 2.5 h and that on the right for 5 h with the same polymerization mixtures. DNA SEQUENCING 9 1 3: 5' 3' I3" Exo ~I 5" 5' 3" Fig. 5. Degradation exonuclease III. of double-stranded DNA with to be determined. The approach that we have used is to prepare clones at random from restriction enzyme digests and determine the sequence with the flanking primer. Computer programs (21) are then used to store, overlap, and arrange the data. Another important advantage of the cloning technique is that it is a very e f f i c i e n t and rapid method of fractionating fragments of DNA. In all s e q u e n c i n g t e c h n i q u e s , b o t h for p r o t e i n s and nucleic acids, fractionation has been an i m p o r t a n t s t e p , a n d m a j o r p r o g r e s s has u s u a l l y been d e p e n d e n t on t h e d e v e l o p m e n t of new f r a c t i o n a t i o n methods. With the new rapid methods for DNA sequencing, f r a c t i o n a t i o n is still i m p o r t a n t , and as t h e sequencing procedure itself is becoming more rapid more of the work has involved fractionation of t h e r e s t r i c t i o n e n z y m e f r a g m e n t s by electrophoresis on acrylamide. This becomes i n c r e a s i n g l y d i f f i c u l t as l a r g e r DNA m o l e c u l e s a r e studied and may involve several successive fractionations before pure Rest,iction enzyme site lacx~/~ galactosidase gene (Blue plaques) I Cut with restriction enzyme. Insert DNA fragment(~). ( <.=e ,...,e, , Fig. 6. Diagram illustrating the cloning of DNA fragments in the single-stranded bacteriophage vector Ml3mp2 (16) and sequencing the insert with a flanking primer. ~0 F. SANGER fragments are obtained. In the new method these fractionations are replaced by a cloning procedure. The mixture is spread on an agar p l a t e and grown. Each clone r e p r e s e n t s the progeny of a single molecule and is t h e r e f o r e pure, i r r e s p e c t i v e of how c o m p l e x the original mixture was. It is p a r t i c u l a r l y s u i t a b l e for studying l a r g e DNAs. In f a c t t h e r e is no t h e o r e t i c a l limit to the size of DNA t h a t could be s e q u e n c e d by this m e t h o d . We have applied the m e t h o d to f r a g m e n t s f r o m m i t o c h o n d r i a l DNA (22,23) and to b a c t e r i o p h a g e X D N A . Initially, new data can be accumulated v e r y quickly (under ideal conditions at a b o u t 5 0 0 - 1 0 0 0 nucleotides a day). H o w e v e r , a t l a t e r s t a g e s m u c h of t h e d a t a p r o d u c e d will b e in r e g i o n s t h a t h a v e a l r e a d y been s e q u e n c e d , and p r o g r e s s then a p p e a r s to be much slower. N e v e r t h e l e s s we find t h a t m o s t new clones studied give s o m e useful d a t a , e i t h e r for c o r r e c t i n g or for c o n f i r m i n g old s e q u e n c e s . Thus in the work with b a c t e r i o p h a g e X DNA we h a v e a b o u t 90~ of the m o l e c u l e identified in s e q u e n c e s and m o s t of the new clones we study c o n t r i b u t e s o m e new i n f o r m a t i o n . In m o s t s t u d i e s on D N A one is c o n c e r n e d with i d e n t i f y i n g the reading f r a m e s for p r o t e i n genes, and to do this the s e q u e n c e m u s t be c o r r e c t . Errors can readily o c c u r in such e x t e n s i v e s e q u e n c e s and c o n f i r m a t i o n is always n e c e s s a r y . We usually consider it n e c e s s a r y to d e t e r m i n e t h e s e q u e n c e of e a c h region on both s t r a n d s of the DNA. Although in t h e o r y it would be possible to c o m p l e t e a s e q u e n c e d e t e r m i n a t i o n solely by the random a p p r o a c h , it is p r o b a b l y b e t t e r to use a m o r e s p e c i f i c m e t h o d to d e t e r m i n e the final r e m a i n i n g n u c l e o tides in a sequencing study. Various m e t h o d s a r e possible (22,24), but all a r e slow c o m p a r e d with the random cloning a p p r o a c h . Bacteriophage ~X174 DNA The first D N A t o be c o m p l e t e l y sequenced by the copying procedures was from bacteriophage e)XI7# (10,12) - a single-stranded circular DNA, 5386 nucleotides long, which codes for ten genes. The most u n e x p e c t e d f i n d i n g f r o m this w o r k was t h e p r e s e n c e o f 'overlapping' genes. Previous genetic studies had suggested that genes were arranged in a linear order along the D N A chains, each gene being encoded by a unique region of the DNA. The sequencing studies indicated however that there were regions of the qbX DNA that were coding for two genes. This is made possible by the nature of the genetic code. Since a sequence of three nucleotides (a codon) codes for one amino acid, each region of DNA can theoretically code for three different amino acid sequences, depending on where t r a n s l a t i o n starts. This is illustrated in Fig. 7. The reading frame or phase in which t r a n s l a t i o n takes place is defined by the p o s i t i o n of t h e initiating ATG codon, following which nucleotides are simply read off three at a time. In dpX there is an initiating ATG within the gene coding for the D protein, b u t in a d i f f e r e n t reading f r a m e . Consequently this initiates an entirely d i f f e r e n t sequence of amino acids, which is that of the E protein. Fig. 8 shows the genetic map of (~X. The E gene is completely contained within the D gene and the B gene within the A. Further studies (25) on the related bacteriophage, G4, revealed the presence of a previously unidentified gene, which was called K. This DNA SEQUENCING Gly tl His GGC CAC Ala . GCA Thr G . GCC .ACG. Pro GG. Ala CCA. Arg CGC Cys . TGC His CAT Arg .AGG Ala . GCA Met .ATG. Arg . CGG Gly . GGC . GG Gin Ala CAG. GCG. G Fig. 7. D i a g r a m illustrating how one DNA sequence can c o d e for t h r e e d i f f e r e n t amino acid sequences. The dots indicate the positions of t r i p l e t c o d o n s c o d i n g for the amino acids. o v e r l a p s both the A and C genes, and there is a sequence of four nucleotides that codes for part of all three p r o t e i n s , A, C, and K, using al! of its three possible reading frames, It is uncertain whether overlapping genes are a general phenomenon or whether they are confined to viruses, whose survival may depend on their rate of replication and therefore on the size of the DNA: with overlapping genes more genetic information can be c o n c e n t r a t e d in a given-sized DNA. Further details o5 the sequence of bacteriophage ~XI7# DNA have been published elsewhere (10,12). m RNA DNA ~m RNA KB A Fig. 8. Gene map of ~X174 DNA. F. 12 Mammalian Mitochondrial SANGER DNA M i t o c h o n d r i a contain a small d o u b l e - s t r a n d e d DNA ( m t D N A ) which codes for two ribosomal RNAs (rRNAs)~ 22-23 transfer RNAs (tRNAs)~ and a b o u t 10-13 p r o t e i n s which a p p e a r to be c o m p o n e n t s of the inner m i t o c h o n d r J a l m e m b r a n e a n d a r e s o m e w h a t h y d r o p h o b i c . O t h e r p r o t e i n s of the m i t o c h o n d r i a a r e encoded by t h e nucleus of the cell and s p e c i f i c a l l y t r a n s p o r t e d i n t o t h e m i t o c h o n d r i a . Using the a b o v e m e t h o d s we h a v e d e t e r m i n e d the n u c l e o t i d e s e q u e n c e of h u m a n m t D N A (23) and a l m o s t the c o m p l e t e s e q u e n c e oi b o v i n e m t D N A . The sequence r e v e a l e d a n u m b e r of u n e x p e c t e d features which indicated that the transcription and translation machinery of mitochondria is rather diiferent irom that of other biological systems. The genetic code in mitochondria Hitherto it has been believed that the genetic code was universal. No diflerences were found in the E s c h e r i c h i a c o l i , yeast, or mammalian systems that had been studied. Our initial sequence studies were on h u m a n m t D N A . N o a m i n o acid sequences of the proteins that were encoded by h u m a n m t D N A were known. However, Stellans and Buse ( 2 6 ) h a d determined the sequence of subunit II of cytochrome o x i d a s e ( C O II) f r o m bovine m i t o c h o n d r i a , and Barrell~ Bankier, and Drouin (27) found t h a t a region of the h u m a n m t D N A t h a t t h e y w e r e studying had a s e q u e n c e t h a t would code for a p r o t e i n homologous to this a m i n o acid s e q u e n c e - i n d i c a t i n g t h a t it m o s t p r o b a b l y w a s t h e gene for the human CO II. Surprisingly the DNA s e q u e n c e c o n t a i n e d T G A t r i p l e t s in t h e r e a d i n g frame of the homologous protein. A c c o r d i n g to the normal g e n e t i c cod% TGA is a t e r m i n a t i o n codon and if it o c c u r s in the reading f r a m e of a p r o t e i n t h e p o l y p e p t i d e c h a i n is t e r m i n a t e d at t h a t position. It was noted t h a t in t h e p o s i t i o n s w h e r e T G A o c c u r r e d in t h e h u m a n m t D N A s e q u e n c % t r y p t o p h a n was found in the bovine p r o t e i n s e q u e n c e . The only possible conclusion s e e m e d to be t h a t in m i t o c h o n d r i a TGA was not a termination codon but was coding for tryptophan. It was s i m i l a r l y concluded t h a t ATA~ which n o r m a l l y codes for isoleucin% was coding for m e t h i o n i n e . As t h e s e studies w e r e based on a c o m p a r i s o n of a h u m a n D N A w i t h a b o v i n e p r o t e i n 9 t h e p o s s i b i l i t y t h a t t h e differences w e r e due to s o m e s p e c i e s variation~ a l t h o u g h unlikely, could not be c o m p l e t e l y excluded. For a conclusive d e t e r m i n a t i o n of the m i t o c h o n d r i a l code it was n e c e s s a r y to c o m p a r e the DNA s e q u e n c e of a gene with the a m i n o acid s e q u e n c e of the p r o t e i n it was coding for. This was done by Young and Anderson (28)~ who i s o l a t e d bovine mtDNA~ d e t e r m i n e d the s e q u e n c e of its CO II gen% and c o n f i r m e d the above differences. Fig. 9 shows the human and bovine m i t o c h o n d r i a l g e n e t i c code and the f r e q u e n c y of use of the d i f i e r e n t codons in h u m a n m i t o c h o n d r i a . All codons a r e used with the e x c e p t i o n of UUA and UAG, which a r e t e r m i n a t o r s ~ and AGA and AGG, which n o r m a l l y c o d e f o r a r g i n i n e . This suggests t h a t AGA and AGG a r e probably also t e r m i n a t i o n codons in m i t o c h o n d r i a , F u r t h e r support for this is t h a t no t R N A recognising the codons h a s b e e n f o u n d ( s e e b e l o w ) a n d t h a t s o m e of t h e u n i d e n t i f i e d reading f r a m e s found in the DNA s e q u e n c e a p p e a r to end with t h e s e codons. DNA S E Q U E N C I N G 13 SECOND LETTER A ut~ Phe UUC 77 140 52 UAU 46 I s~ UUA Leu UUG 73 17 UCA 99 83 UAC UAA IICG 7 UAG qU~ 65 CCU 41 119 ICAU CAC CUC Set 167 CCC CUA 276 CCA CUG 45 COG 125 ACU 196 ACC 166 ACA 40 ACG 30 I 49 71 GCU 124 GCC Ala GCA 80 Leu ALnJ AUC lle AUA AUG Met IGI/U GUC GUA Val GUG 8 GCG Pro Thr G l Ter 7? CGC S! CGA 51 155 IAAU AAC 133 10 I Lys AAG 85 I 43 I GAU 5515] 8 Asn 13o 331 Asp 10 n~ Glu 64241 I0 7 Arg CGG 25 29 2 Set AGC 59 AGA AGG Ter Gly C,AA GAG 17 95 CGU 9 5 UGA 18 ICAA CAG Cyll ~Gc UGG 52 7 GAC UGU 88 34 Fig. 9. The h u m a n and bovine mitochondrial genetic code, showing the coding p r o p e r t i e s of the tRNAs (boxed codons), the frequency of use of the different codons in human mitochondria, and the total number of codons used in the whole genome shown in Fig. 10. (One methionine tRNA has been detected, but as there is some u n c e r t a i n t y about the n u m b e r present and their coding properties, these codons are not boxed.) In parallel his c o l l e a g u e s c h a n g e s in the as t h o s e found s u m m a r i s e d in with our studies on m a m m a l i a n m t D N A , T z a g o l o f f and (29,30) w e r e studying y e a s t m t D N A . They also found g e n e t i c code, but surprisingly t h e y a r e not all the s a m e in m a m m a l i a n m i t o c h o n d r i a . These differences are T a b l e 1. Transfer RNAs T r a n s f e r P.NAs have a c h a r a c t e r i s t i c b a s e - p a i r i n g s t r u c t u r e which can be drawn in the f o r m of t h e ' c l o v e r l e a f ' model. By e x a m i n i n g t h e D N A s e q u e n c e ~or c l o v e r l e a f s t r u c t u r e s and using a c o m p u t e r p r o g r a m ( 3 1 ) , it w a s p o s s i b l e to i d e n t i f y g e n e s c o d i n g f o r t h e mt-tRNAs. Besides the c l o v e r l e a f s t r u c t u r e , n o r m a l c y t o p l a s m i c t R N A s h a v e a n u m b e r of s o - c a l l e d ' i n v a r i a b l e ' f e a t u r e s w h i c h a r e b e l i e v e d to b e important to t h e i r b i o l o g i c a l f u n c t i o n . M o s t of t h e m a m m a l i a n m t - t R N A s a r e a n o m a l o u s in t h a t s o m e of t h e s e i n v a r i a b l e f e a t u r e s a r e 14 F. Table i. SANGER Coding changes in mitochondria Amino acid coded Codon Normal Mammalian mitochondria Yeast mitochondria TGA Term Trp Trp AUA lie Met lie CTN Leu Leu Thr AGA, AGG Arg Term ? Arg CGN Arg Arg Term ? missing. T h e m o s t b i z a r r e is one of the serine t R N A s in which a c o m p l e t e l o o p of t h e c l o v e r l e a f s t r u c t u r e is m i s s i n g (32,33). N e v e r t h e l e s s it f u n c t i o n s as a t R N A . In n o r m a l c y t o p l a s m i c s y s t e m s a t l e a s t 32 t R N A s a r e required to c o d e for all the a m i n o acids. This is r e l a t e d to the ' w o b b l e ' e f f e c t . C o d o n - a n t i c o d o n relationships in the f i r s t and second positions of the codons a r e defined by the n o r m a l b a s e - p a i r i n g rules, but in t h e third position G can pair with U. The result of this is t h a t one t R N A can recognise two codons. T h e r e a r e m a n y c a s e s in the g e n e t i c code w h e r e all four codons s t a r t i n g with the s a m e two n u c l e o t i d e s code for the same amino acid. T h e s e a r e k n o w n as ' f a m i l y b o x e s ' . The s i t u a t i o n for the alanine f a m i l y box is shown in T a b l e 2, i n d i c a t i n g t h a t with the normal wobble s y s t e m two t R N A s a r e r e q u i r e d to code Ior the four alanine codons. Table 2. Normal coding properties of alanine tRNAs Codon Anticodon (wobble) Anticodon (mitochondria) GCU GGC GCC UGC GCA UGC GCG Note that the first position of the codon pairs with the third position of the anticodon and vice versa~ e.g. 5' GCU 3' (codon) 3' CGG 5' (anticodon) DNA SEQUENCING 15 Only 22 t R N A g e n e s could be found in m a m m a l i a n m t D N A , and for all t h e f a m i l y boxes t h e r e was only one, which had a T in the position c o r r e s p o n d i n g to the third position of the codon ( 3 4 ) . It s e e m s v e r y unlikely t h a t none of the o t h e r p r e d i c t e d t R N A s w o u l d h a v e b e e n d e t e c t e d , and the m o s t f e a s i b l e e x p l a n a t i o n is t h a t in m i t o c h o n d r i a one t R N A can r e c o g n i s e all four codons in a f a m i l y box and t h a t a U in t h e first position of the a n t i c o d o n can pair with U, C, A, or G in the third position of the cod0n. C l e a r l y , in boxes in w h i c h t w o of t h e c o d o n s c o d e f o r one amino acid and two for a d i f f e r e n t one, t h e r e m u s t be two d i f f e r e n t t R N A s a n d t h e w o b b l e e f f e c t s t i l l a p p l i e s . Such t R N A s a r e found, as e x p e c t e d , in the m i t o c h o n d r i a l genes. The coding p r o p e r t i e s of t h e m t - t R N A s a r e s h o w n in F i g . 9. Similar c o n c l u s i o n s h a v e b e e n r e a c h e d by H e c k m a n e t ai. (35) and by Bonitz e t a l . ( 3 6 ) , working r e s p e c t i v e l y on n e u r o s p o r a a n d y e a s t mitochondria. Distribution of protein genes Mitochondrial DNA was known to code for t h r e e of the subunits of c y t o c h r o m e oxidase, p r o b a b l y t h r e e subunits of t h e A T P a s e c o m p l e x , c y t o c h r o m e b, and a n u m b e r of o t h e r u n i d e n t i f i e d proteins. In order to i d e n t i f y t h e p r o t e i n - c o d i n g genes, the DNA was searched for r e a d i n g f r a m e s , i.e. long s t r e t c h e s of DNA c o n t a i n i n g no t e r m i n a t i o n codons in one of the phases and thus being c a p a b l e of coding for long p o l y p e p t i d e chains. Such reading f r a m e s should s t a r t with an initiation codon, which in n o r m a l s y s t e m s is nearly always A T G , and end with a t e r m i n a t i o n codon. Fig. 10 s u m m a r i s e s the distribution of t h e r e a d i n g f r a m e s on the DNA, and t h e s e are b e l i e v e d to be t h e genes coding for the proteins. The gene for CO II was i d e n t i f i e d f r o m the a m i n o acid s e q u e n c e as d e s c r i b e d a b o v e , for subunit I of c y t o c h r o m e oxidase f r o m a m i n o a c i d s e q u e n c e s t u d i e s on the bovine p r o t e i n by 3. E. Walker ( p e r s o n a l c o m m u n i c a t i o n ) , and CO III, c y t o c h r o m e b, a n d , p r o b a b l y , A T P a s e 6 w e r e identified by c o m p a r i s o n with the DNA s e q u e n c e s of the c o r r e s p o n d i n g genes in y e a s t m i t o c h o n d r i a . T z a g o l o f f a n d his c o l l e a g u e s w e r e a b l e to i d e n t i f y t h e s e g e n e s in y e a s t by g e n e t i c methods. It has not y e t been possible to assign p r o t e i n s to the o t h e r reading f r a m e s . One unusual f e a t u r e of the m t D N A is t h a t it has a very c o m p a c t structure. The reading f r a m e s coding for the p r o t e i n s and the rRNA g e n e s a p p e a r to be flanked by the t R N A g e n e s with no, or v e r y f e w , intervening nucleotides. This suggests a r e l a t i v e l y s i m p l e m o d e l f o r t r a n s c r i p t i o n , in which a large RNA is copied f r o m the DNA and the t R N A s a r e cut out by a processing e n z y m e , and this s a m e p r o c e s s i n g l e a d s to t h e p r o d u c t i o n of t h e r R N A s and t h e m e s s e n g e r R N A s ( m R N A s ) , m o s t of which will be m o n o c i s t r o n i c . Strong s u p p o r t f o r this model c o m e s f r o m the work of A t t a r d i (37,38), who has i d e n t i f i e d t h e RNA s e q u e n c e at t h e 5 ' and 3' ends of the m R N A s , thus l o c a t i n g t h e m on the DNA s e q u e n c e . One c o n s e q u e n c e of this a r r a n g e m e n t is t h a t the i n i t i a t i o n c o d o n is a t , or v e r y n e a r , t h e 5 ' e n d of t h e mRNAs. This suggests t h a t t h e r e must be a d i f f e r e n t m e c h a n i s m of initiation f r o m t h a t found in o t h e r s y s t e m s . In b a c t e r i a t h e r e is u s u a l l y a r i b o s o m a l b i n d i n g s i t e b e f o r e t h e i n i t i a t i n g ATG codon, w h e r e a s in e u c a r y o t i c s y s t e m s t h e ' c a p ' s t r u c t u r e on t h e 5' end of 2 0 0 Cytochrome 9 0 H S L 0 L b URF1 COT O I CO T0- I~ F ,_, 2 T P 16000 D loop URF 5 16569 -7? i 0 ~ 11000 COIT 8000 I URF2 i 5O0O 14000 9 10 S D 0 R URF4 L -1 1 0 I i URF 3 ~ -3 I QFM 0 I- V j_, ,~ 0 13000 G 10000 ~ 7000 I I 4000 12SrRNA I j 0 1 a 3 ~ . _ ~ . 12 7 -46? y_ "~-'URF6 ~ 0 E 4 I-~ATPase 6 ? L strand origin 1 O Y WAN URF A6L URF 4 25 K 16SrRNA i 15000 ' 12000 9000 I i 60O0 ! 3000 Fig. i0. Gene map of human mtDNA deduced from the DNA sequence. Boxed regions are the predicted reading frames for the proteins. URF = unidentified reading frame, tRNA genes are denoted by the one-letter amino acid code and are either L-strand coded ( m D-) or H-strand coded ( ~ ) . Numbers above the genes show the scale in nucleotides and those below~ the predicted number between genes. *Indicates that termination codons are created by polyadenylation of the mRNA. 9 , H strand origin D_ loop 2000 1000 rn > Z DNA S E Q U E N C I N G /7 t h e m R N A a p p e a r s to h a v e a s i m i l a r f u n c t i o n and the f i r s t ATG following the c a p a c t s as i n i t i a t o r . It s e e m s t h a t m i t o c h o n d r i a m a y h a v e a m o r e s i m p l e , a n d p e r h a p s m o r e p r i m i t i v e , s y s t e m with t h e t r a n s l a t i o n m a c h i n e r y recognising s i m p l y t h e 5 ' e n d o f t h e m R N A . A n o t h e r unique f e a t u r e of m i t o c h o n d r i a is t h a t A T A and possibly ATT can a c t as i n i t i a t o r codons as well as ATG. On t h e b a s i s of t h e a b o v e model, s o m e of the m R N A s will not c o n t a i n t e r m i n a t i o n codons at t h e i r 3' ends a f t e r the t R N A s a r e c u t out. H o w e v e r t h e y h a v e T or TA a t the 3 ' end. The m R N A s a r e g e n e r a l l y found p o l y a d e n y l a t e d at t h e i r 3' ends and this p r o c e s s will n e c e s s a r i l y g i v e r i s e to t h e codon TAA to t e r m i n a t e those p r o t e i n reading f r a m e s . The c o m p a c t s t r u c t u r e of the m a m m a J i a n m i t o c h o n d r i a l g e n o m e is in m a r k e d c o n t r a s t to t h a t of y e a s t , w h i c h is a b o u t f i v e t i m e s as l a r g e and y e t codes for only a b o u t the s a m e n u m b e r of p r o t e i n s and RNAs. The genes a r e s e p a r a t e d by long A T - r i c h s t r e t c h e s of D N A with no obvious biological function. T h e r e a r e also insertion s e q u e n c e s within s o m e of the genes, w h e r e a s t h e s e a p p e a r to b e a b s e n t f r o m m a m m a l i a n m~DNA. References I. Holley, R.W., L e s P r i x N o b e l , p. 183 (1968). 2. Sanger, F., L e s P r i x N o b e l , p. 134 (1958). 3. Sanger, F., Brownlee, G.G.~ & Barrell, B.G.~ J. Mol. B i o l . 13, 373 (1965). 4. Robertson, H.D., Barrell~ B.G.~ Weith, H.L., & Donelson, J.E., Nature New Biol. 241, 38 (1973). 5. Ziff, E.B., Sedat, J.W., & Galibert, F., N a t u r e N e w B i o l . 241, 34 (1973). 6. Billeter, M.A., Dahlberg, J.E., Goodman, H.M.~ Hindley, J., & Weissmann~ C.~ N a t u r e , 224, 1083 (1969). 7. Berg, P., Fancher, H., & Chamberlin, M., Symposium on 'Informational Macromolecules', pp. 467-483, Academic Press, New York & London (1963). 8. Sanger~ F., Donelson~ J.E., Coulson, A.R. 9 K6ssel, H., & Fischer, D., P r o c . N a t l . A c a d . Sci. U . S . A . 70, 1209 (1973). 9. Sanger~ F., & Coulson, A.R., J. Mol. B i o l . 94~ 441 (1975). 10. Sanger, F., Air~ G.M., Barrell, B.G., Brown, N.L., Coulson, A.R., Fiddes, J.C., Hutchison~ C.A., Slocombe, P.M., & Smith, M., N a t u r e , 265~ 687 (1977). 11. Sanger, F., Nicklen, S., & Coulson, A.R.~ P r o c . N a t l . A c a d . Sci. U.S.A. 74, 5463 (1977). 12. Sanger, Fo, Coulson~ A.R.~ Friedmann, T.~ Air~ G.M.~ Barrell, B.G., Brown, N.L., Fiddes, J.C.~ Hutchison, C.A.~ Slocombe, P.M., & Smith, M., J. Mol. B i o l . 125, 225 (1978). 13. Godson, G.N.~ Barrell, B.G., Staden, R., & Fiddes, J.C., Nature, 276, 236 (1978). 14. Smith, A.J.H., N u c l . A c i d s Res. 6, 831 (1979). 15. Barnes, W.M., Gene, 5~ 127 (1979). 18 F. SANGER 16. Gronenborn, B., & Messing, J. N a t u r e , 272, 375 (1978). 17. Herrmann, R., Neugebauer, K., Schaller, H., & Zentgraf, H., in The Single-stranded D N A P h a g e s (Denhardt, D.T., Dressler, D., & Ray, D.S., eds.), pp. 473-476, Cold Spring Harbor Laboratory, New York (1978). 18. Heidecker, G., Messing, J., & Gronenborn, B., G e n e , I09 69 (1980). 19. Anderson, S., Gait, M.J., Mayol, L., & Young, I.G., N u c l . Acids Res. 8, 1731 (1980). 20. Gait, M.J., Unpublished results (1980). 21. Staden, R., N u c l . A c i d s Res. 8, 3673 (1980). 22. Sanger, F.~ Coulson, A.R., Barrell, B.G., Smith, A.J.H., & Roe, B.A., J. Mol. B i o l . 143, 161 (1980). 23. Barrell, B.G., Anderson, S., Bankier, A.T., de Bruijn, M.H.L., Chen, E., Coulson, A.R., Drouin, J., Eperon, I.C., Nierlich, D.P., Roe, B.A., Sanger, F., Schreier, P.H., Smith, A.J.H., Staden, R., & Young, I.G., 31st Mosbach Colloq., B i o l o g i c a l C h e m i s t r y and O r g a n e l l e F o r m a t i o n , in press. 24. Winter, G., & Fields, S., N u c l . A c i d s Res. 8, 1965 (1980). 25. Shaw, D.C., Walker, J.E., Northrop, F.D., Barrell, B.G., Godson, G.N., & Fiddes, J.C., N a t u r e , 272, 510 (1978). 26. Steffans, G.J., & Buse, G., H o p p e - S e y l e r ' s Z. P h y s i o l . Chem. 360, 613 (1979). 27. Barrell, B.G., Bankier, A.T., & Drouin, J., N a t u r e , 282, 189 (1979). 28. Young, I.G., & Anderson, S., G e n e , in press. 29. Macino, G., Coruzzi, G., Nobrega, F.G., Li, M., & Tzago!off , A., Proc. N a t l . A c a d . sci. U . S . A . 76, 3784 (1979). 30. Macino, G., & Tzagoloff, A., Proc. N a t l . A c a d . Sci. U.S.A. 76, 131 (1979). 31. Staden, R., Nucl. A c i d s Res. 8, 817 (1980). 32. de Bruijn, M.H.L., Schreier, P.H., Eperon, I.C. Barrell, B.G., Chen, E.Y., Armstrong, P.W.~ Wong, J.F.H., & Roe~ B.A., N u c l . A c i d s Res. 8, 5213 (1980). 33. Areari, P., & Brownlee, G.G., N u c l . A c i d s Res., in press. 34. Barrell, B.G., Anderson~ S., Bankier, A.T., de Bruijn, M.H.L., Chen, E., Coulson~ A.R.~ Drouin, J., Eperon, I.C.~ Nierlich, D.P., Roe, B.A.~ Sanger, F., Schreier, P.H., Smith, A.J.H.~ Staden, R., & Young, I.G., Proc. Natl. A c a d . Sci. U . S . A . 77~ 3164 (1980). 35. Heckman, J.E., Sarnoff, J.~ Alzner-De Weerd~ B.~ Yin~ S., & RajBhandary, U.L.~ Proc. N a t l . A c a d . Sci. U . S . A . 77, 3159 (1980). 36. Bonitz, S.G., Berlani, R., Coruzzi, G., Li, M., Macino, G., Nobrega, F.G., Nobrega, M.P., Thalenfeld, B.E., & Tzagoloff, A., Proc. N a t l . A c a d . sci. U . S . A . 77, 3167 (1980). 37. Montoya, J., Ojala, D., & Attardi, G., N a t u r e , in press. 38. Ojala, D., Montoya, J., & Attardi, G., N a t u r e , in press.