* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Evidence for Variable Selective Pressures at a
Genome evolution wikipedia , lookup
Epitranscriptome wikipedia , lookup
History of genetic engineering wikipedia , lookup
DNA supercoil wikipedia , lookup
Frameshift mutation wikipedia , lookup
Holliday junction wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Microevolution wikipedia , lookup
DNA barcoding wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Metagenomics wikipedia , lookup
Nucleic acid double helix wikipedia , lookup
Primary transcript wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Genome editing wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Nucleic acid tertiary structure wikipedia , lookup
DNA nanotechnology wikipedia , lookup
Oncogenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Human genome wikipedia , lookup
Helitron (biology) wikipedia , lookup
Microsatellite wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Point mutation wikipedia , lookup
Evidence for Variable Selective Pressures at a Large Secondary Structure of the Human Mitochondrial DNA Control Region Filipe Pereira,* Pedro Soares,à João Carneiro,* Luı́sa Pereira,*§ Martin B. Richards,à David C. Samuels,k and António Amorim* *Instituto de Patologia e Imunologia Molecular da Universidade do Porto, Porto, Portugal; Faculdade de Ciências da Universidade do Porto, Porto, Portugal; àInstitute of Integrative and Comparative Biology, Faculty of Biological Sciences, University of Leeds, Leeds, United Kingdom; §Medical Faculty, University of Porto, Porto, Portugal; and kVirginia Bioinformatics Institute, Virginia Polytechnic Institute and State University A combined effect of functional constraints and random mutational events is responsible for the sequence evolution of the human mitochondrial DNA (mtDNA) control region. Most studies targeting this noncoding segment usually focus on its primary sequence information disregarding other informative levels such as secondary or tertiary DNA conformations. In this work, we combined the most recent developments in DNA folding calculations with a phylogenetic comparative approach in order to investigate the formation of intrastrand secondary structures in the human mtDNA control region. Our most striking results are those regarding a new cloverleaf-like secondary structure predicted for a 93-bp stretch of the control region 5#-peripheral domain. Randomized sequences indicated that this structure has a more negative folding energy than the average of random sequences with the same nucleotide composition. In addition, a sliding window scan across the complete mitochondrial genome revealed that it stands out as having one of the highest folding potential. Moreover, we detected several lines of evidence of both negative and positive selection on this structure with high levels of conservation at the structure-relevant stem regions and the occurrence of compensatory base changes in the primate lineage. In the light of previous data, we discuss the possible involvement of this structure in mtDNA replication and/or transcription. We conclude that maintenance of this structure is responsible for the observed heterogeneity in the rate of substitution among sites in part of the human hypervariable region I and that it is a hot spot for the 3# end of human mtDNA deletions. Introduction The mammalian mitochondrial genome consists of a closed circular double-stranded molecule devoted to the coding of key subunits of the oxidative phosphorylation system. The designation of individual strands of the mitochondrial DNA (mtDNA) molecule as heavy (H) strand and light (L) strand reflects their different buoyant densities in a cesium chloride gradient due to a strand bias in base composition (the H strand is guanine rich, whereas the L strand is guanine poor) (Shadel and Clayton 1997; Taanman 1999; Spelbrink 2003). In terms of coding efficiency, the human mitochondrial genome displays an exceptional organizational economy. Coding sequences are usually found contiguous to each other and some protein-coding genes even overlap, as observed for the human ATPase 6 and ATPase 8 genes. In addition, some termination codons are not completely encoded by the mtDNA but are instead posttranscriptionally generated by polyadenylation of the corresponding mRNAs (Shadel and Clayton 1997; Taanman 1999; Spelbrink 2003). In marked contrast with the evolutionary pressure to reduce the mitochondrial content is the persistence of a noncoding segment responsible for regulation of the mtDNA replication and transcription, known as control region. The denomination of displacement loop (D loop) is frequently used in the literature as synonymous with control region. However, it refers to the three-stranded DNA structure formed in the control region due to the premature arrest of H-strand synthesis near the control region 5# end (fig. 1). The functions of the D loop and its relation with the mtDNA Key words: mtDNA control region, secondary structures, mutational heterogeneity, mtDNA deletions. E-mail: [email protected]. Mol. Biol. Evol. 25(12):2759–2770. 2008 doi:10.1093/molbev/msn225 Advance Access publication October 9, 2008 Ó The Author 2008. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] replication are still unknown. In this manuscript, control region and D loop will not be used as synonyms. The increasing amount of available sequences revealed a similar control region structural organization across mammals, with size variation ranging between 880 and 1400 bp (Sbisa et al. 1997). One important issue relating to these observations is thus to identify which structural and/or functional domains are under selective pressures strong enough to maintain a large noncoding segment in a genetic system evolving in general terms toward a decrease in genome size. The presence of binding sites for nuclear-encoded factors, known to regulate mtDNA maintenance and expression, provides a partial explanation for this question. Among these are the H-strand origin of replication sites (OH), transcription initiation sites and promoter regions (Montoya et al. 1982; Chang and Clayton 1985; Hixson and Clayton 1985) and conserved elements with possible regulatory functions, namely the termination-associated sequences (Doda et al. 1981) and conserved sequence blocks (Walberg and Clayton 1981; fig. 1). However, a clear explanation is still missing for a large number of control region stretches for which no regulatory element has been identified so far. Also still unexplained is the observed heterogeneity in the rate of substitution among some control region sites, well documented in hypervariable regions I and II (HVR-I and HVR-II) of the human mtDNA (Excoffier and Yang 1999; Meyer et al. 1999). Substantial evidence has begun to emerge that molecular processes such as transcription, replication, and recombination are partially regulated by the formation of DNA secondary structures (e.g., Forsdyke 1995; Seffens and Digby 1999; Katz and Burge 2003; Cheung 2004). For instance, hairpin or cruciform structures have been identified as recognition sites for the binding of several transcription factors. The formation of local intrastrand secondary structures during transcription can also modulate site-specific mutation rates (Wright 2000; Wright et al. 2003; Hoede 2760 Pereira et al. FIG. 1.—Schematic representation of the human mtDNA control region. The locations of the 13 predicted larger secondary structures (A–M) are indicated by dotted lines. Orientation and H-strand replication initiation sites (OH) are indicated according to the strand-asynchronous replication mechanism. Localization of the displacement loop (D loop), phenylalanine transfer RNA (tRNAPhe), proline transfer RNA (tRNAPro), light-strand promoter (LSP), heavy-strand promoter (HSP), transcription initiation sites (L, H1, and H2), conserved sequence blocks (CSBs), and terminationassociated sequences (TAS) are also indicated. Numbers according to the human mtDNA reference sequence (Anderson et al. 1981). et al. 2006). In this context, a DNA secondary structure refers to the molecular-folded structure formed when a singlestranded DNA molecule folds back upon itself. Typically, unpaired and mispaired bases in a secondary structure form single-stranded loops that are more vulnerable to mutations than paired bases in double-stranded stem regions (e.g., Wright 2000; Hoede et al. 2006). It has previously been shown that some control region segments have the ability to form DNA secondary structures in humans and other vertebrate species, namely rat, mouse, cow, and Xenopus (Brown et al. 1986). Two methods are commonly employed in the prediction of nucleic acid secondary structures: phylogenetic approaches using information from homologous and alignable sequences from different organisms and free energy minimization methods based on the calculation of the overall free energy difference in the folding of a nucleic acid molecule (Zuker 2000). AlthoughusedforyearswithsuccessforRNAstudies,efficient algorithms for DNA folding prediction have only recently become available and incorporated into computer software packages (SantaLucia 1998; Markham and Zuker 2005). The recent developments in DNA folding prediction, as well as the increasing number of available control region sequences, prompted us to reevaluate the mtDNA control region potential for intrastrand secondary structure formation. We identified 13 potential structures (fig. 1, A–M), and several lines of evidence support the in vivo formation of one of them (A) located at the control region 5# domain. We have therefore analyzed the involvement of this structure in replication and/or translational regulatory mechanisms, mutation rate heterogeneity, and mtDNA deletions. Material and Methods Phylogenies and Complete Sequence Data Set A human mtDNA phylogenetic tree was reconstructed using the reduced median algorithm (Bandelt et al. 1995) implemented in the ‘‘Network 4.2.0.1’’ software (http:// www.fluxus-engineering.com). Based on the relative mutation rate of the mutations and diversity indices of the clades, we selected a single tree from the Network output. The tree combines a total of 2,196 complete mtDNA sequences from macrohaplogroups M, N, and L (African) retrieved from published and unpublished studies (supplementary table S1, Supplementary Material online). A sample of 427 mtDNA control region sequences representing the main human lineages were assembled into a single database (supplementary table S2, Supplementary Material online). Similar databases were also constructed using all available and complete control region sequences from Pan troglodytes, Gorilla gorilla, and Pongo pygmaeus (supplementary table S2, Supplementary Material online). The consensus sequence for each species was obtained from the sequences alignment using the ClustalW software implemented by the BioEdit program (http:// www.mbio.ncsu.edu/BioEdit/bioedit.html). Mitochondrial tRNA gene sequences used in this study were derived from the human mtDNA reference sequence (Anderson et al. 1981; Andrews et al. 1999). Sequence numbering is in accordance with the L-strand human mtDNA reference sequence (Anderson et al. 1981). Nucleic Acid Secondary Structures Prediction Sequences were submitted to the DNA ‘‘mfold’’ web server (version 3.1; http://frontend.bioinfo.rpi.edu/ applications/mfold/cgi-bin/dna-form1.cgi) for DNA secondary structure prediction by free energy minimization using nearest neighbor thermodynamic rules (SantaLucia 1998; Markham and Zuker 2005). Computations were performed using ‘‘mfold’’ default parameters for folding temperature (37 °C), ionic conditions ([Naþ] 5 1.0 M; [Mgþþ] 5 0 M) and ‘‘window size parameter’’ of 15 for the complete control region sequence and 2 for tRNAs and for the largest control region secondary structure. DNA folding was limited to structures in which the maximum distance between the two bases initiating and terminating the primary stem is 100 nucleotides. In two control region segments, the maximum distance was reduced to 50 nt positions in order to obtain a better resolution preventing the formation of structures with larger loops. RNA folding free energies were determined using the RNA mfold software (Markham and Zuker 2005). All folding free energies are expressed in kilocalorie per mole (kcal/mol). Consensus Folding of Aligned Sequences A different approach to predict the existence of conserved secondary structure in a set of aligned RNA or DNA mtDNA Secondary Structures 2761 sequences was developed by Washielt and Hofacker (2004) and uses modified dynamic programming algorithms that add a covariance term to standard free energy calculations. All 16 different haplotypes identified for structure A were aligned and submitted to the Alifold interface (http://rna. tbi.univie.ac.at/cgi-bin/alifold.cgi) using default parameters for DNA secondary structures calculations. Simulation of DNA Sequence Evolution along the Primate Phylogenetic Tree We simulated the evolution of nucleotide sequences along the primate phylogeny in the Seq-Gen program (Rambaut and Grassly 1997). The input tree was generated in the DNA maximum likelihood program implemented by the PHYLIP package (http://evolution.genetics.washington. edu/phylip.html) using aligned sequences for the mtDNA structure A region from Homo sapiens, P. troglodytes, G. gorilla, and P. pygmaeus. Twenty data sets of four sequences 93 nt long were generated in the Seq-Gen program using default parameters and the HKY85 model of nucleotide substitution (Hasegawa et al. 1985). Phylogenetic Analysis of the Human Mitochondrial Tree and Relative Mutation Rates We used our reconstructed worldwide human mtDNA tree to estimate the number and age of mutations in structure A and an adjacent region. Although it is not possible to date mutations directly by the phylogenetic approach, it is possible to date clades within the tree, allowing us to assign mutations to a particular window of time. In our approach, if a mutation occurred on a tip branch, it was dated to an age of between zero years and the age of the closest ancestral node in the tree. Alternatively, if a mutation occurred on a branch leading to a particular subclade, it was dated to between the age of that subclade and the age of the ancestral node (a detailed description of the methods used can be found in the Supplementary Material online). To illustrate the procedure, we present a small portion of human mtDNA phylogenetic tree in supplementary figure S1 (Supplementary Material online; the Pacific haplogroup M29/Q). It is well known that considering the age of an individual mutation provides only weak evidence for either neutrality or deleteriousness because slightly deleterious mutations can persist for a long time in a population and even become fixed at the species level (Kimura 1983). A young mutation per se does not present any evidence on its viability for persistence. However, vital information can be obtained from the overall distribution of these mutations on the tree. If a mutation is present at high frequency in young branches but is absent in older branches or in a significantly lower proportion than in those young branches, we can assume that this mutation is under selective constraints. We compared mutations in different groups: within and outside structure A and at paired and unpaired sites within structure A. If we exclude purifying selection and assume similar mutation rates in two adjacent regions of the same size, the probability of detecting a mutation in any of these regions should be similar. Moreover, if muta- tions are randomly distributed throughout the tree, no differences between mutations in these adjacent regions should be detected in different clades and at different time depths of the tree. Therefore, major differences in variation could be the result of different evolutionary constraints. We used a mutation rate of 1 mutational event in every 5,138 years in the entire coding region to date clades within the tree (Mishmar et al. 2003). We estimated the ages of control region mutations directly from the age of the clade in which they were observed as previously described. Assessment of Selective Effects To test for selection, we also calculated Tajima’s (1989) D values for different regions of the human mtDNA on a data set of 474 complete mtDNA sequences using the DnaSP 4.0 software (Rozas et al. 2003). It was not possible to include the complete data set of 2,196 sequences in order to calculate significant values due to input limitations of the DnaSP software. Therefore, the data set included all African L (non-M or N) sequences (n 5 274) used in the phylogenetic analysis (because these present a greater time depth, allowing for selection effects to impact more strongly) and a random sample of non-African haplogroup M (n 5 100) and N (n 5 100) sequences. Complete mtDNA Sliding Window Scan for Secondary Structures’ Potential A PYTHON algorithm was written to generate a set of sequence fragments with the same sequence length as the largest control region secondary structure (93 bp), overlapped by 1 bp, spanning the complete human mtDNA reference sequence. All nucleotide windows were submitted to the DNA ‘‘mfold’’ web server for secondary structure prediction. Random Sequences Two types of sequence randomization procedures were used to detect possible statistical biases in the prediction of secondary structures. The first method used was based on a simple random nucleotide permutation, keeping the sequence nucleotide composition constant (mononucleotide shuffling) (Stothard 2000). Additionally, random sequences with the same dinucleotide frequency (dinucleotide shuffling) were generated using the Altschul–Erickson shuffle algorithm, implemented by the ‘‘Dishuffle’’ interface (http:// clavius.bc.edu/%7Eclotelab/RNAdinucleotideShuffle/),which derives a first-order Markov model from the conditional probabilities found in the target sequence (Altschul and Erickson 1985). Estimation of Statistical Significance The consensus nucleotide composition of the control region segment in which structure A was predicted to occur (table 1) was used to generate 1,000 shuffled sequences according to both randomization procedures. Minimum free energies were predicted for these sequences as described before. 2762 Pereira et al. Table 1 Characteristics of All Predicted Secondary Structures with More than 10 Paired Bases (A–M) in the Human mtDNA Control Region DG (kcal/mol)a Predicted Secondary Structures Nucleotide Positions (A) 16028–16120 (B) 16124–16160 (C) 16310–16332 (D) 16360–16373 (E) 16377–16391 (F) 16481–16513 (G) 16566–015 (H) 025–055 (I) 061–097 (J) 116–149 (K) 181–226 (L) 249–279 (M) 376–414 Control Region Domains and Regulatory Elements Length Paired Unpaired L Strand H Strand L-Strand mRNA Number of Haplotypesb HVR-I, D-loop 3# end HVR-I, TAS (partial) HVR-I HVR-I HVR-I, central domain Central domain Central domain Central domain HVR-II HVR-II HVR-II, OH, CSB1 (partial) HVR-II LSP (partial), L 93 37 23 14 15 33 19 31 37 34 46 31 39 60 16 16 10 10 22 14 14 22 20 32 14 24 33 21 7 4 5 11 5 17 15 14 14 17 15 11.17 0.47 1.24 0.74 4.00 1.30 0.80 1.32 5.69 2.53 5.27 3.85 2.38 10.40 0.99 0.96 1.47 4.12 1.12 0.31 1.33 3.13 3.45 2.82 4.32 2.30 26.23 6.50 2.30 2.70 7.80 1.70 1.20 9.80 5.40 5.70 4.60 13.60 7.30 16 22 20 8 4 5 1 2 18 9 53 9 4 NOTE.—TAS, termination-associated sequences; CSB, conserved sequence blocks; LSP, light-strand promoter. a Free energy difference associated with the secondary structure folding at 37 °C. b Number of haplotypes as defined within the region of each structure. To determine if the folding energy predicted for structure A was significantly different from that of randomized sequences, the average and standard deviations (SDs) of folding energies were used to calculate the Z score (or segment score) (Seffens and Digby 1999; Workman and Krogh 1999; Katz and Burge 2003). The Z score reveals how many units of the SD the folding energy for the native sequence is above or below the average of the randomized sequences. A negative Z score indicates that the native structure has a greater secondary structure potential (more negative folding energy) than the average of the random sequences. The significance level of calculated Z scores was estimated according to the procedure developed by Workman and Krogh (1999) to account for possible deviations from a normal distribution of Z scores with mean 0 and SD 1. A PYTHON script using the random module was written to generate 5,000 sets of 101 shuffled sequences, randomly selected from each group of 1,000 mono- and dinucleotide shuffled sequences. For each set of 101 sequences, a Z score was calculated for one randomly selected sequence (test sequence) using the average and SD of the remaining 100. From the 5,000 bootstrap procedure, the fractions of random sequences with Z score lower than that of the native structure gives the P value for the sequence (Workman and Krogh 1999). Z scores and sequence P values were similarly obtained for all mitochondrial tRNA genes derived from the human mtDNA reference sequence. All PYTHON scripts developed for this study are part of the DNAux software platform available at http://www.portugene.com/ software.html. (supplementary fig. S2, Supplementary Material online). Structures were named from A to M according to their location in the control region (from 5# to 3#), and their characteristics are summarized in table 1. The free energy variation on the formation of each structure was found to correlate with the structure’s length, due to the expected increase in folding energies with the larger number of bonding interactions (r2 5 0.64; supplementary fig. S3, Supplementary Material online). Approximately equivalent folding energies were obtained for the H-strand sequence with the exception of structure G with a positive value (DG 5 0.31) (table 1). Because mRNA transcripts from both mtDNA strands include sections of the control region complementary sequence, a possible formation of secondary structures at the mRNA level is not completely excluded. Therefore, secondary structures and folding energies were predicted for the L-strand mRNA transcript (table 1 and supplementary fig. S4, Supplementary Material online). All secondary structures predicted for the mtDNA L strand were also observed in the mRNA transcript with minor structural differences. The most significant difference was observed for structure K, with the predicted mRNA structure showing less 16 paired bases than the homologous mtDNA structure. A similar structural conformation was found for structure A in mRNA and mtDNA strands with four major stem regions and a total of 60 paired bases (supplementary fig. S4, Supplementary Material online). In general, greater negative folding energies were observed for the structures in the mRNA transcript than in the mtDNA strands, with structure A presenting the most stable conformation (DG 5 26.23). Results Identification of 13 mtDNA Control Region Secondary Structures High Levels of Conservation at Some Control Region Secondary Structures Thirteen secondary structures with more than 10 paired bases, varying in length from 14 to 93 bp, were identified in the human mtDNA L-strand consensus sequence If the formation of these structures were associated with any regulatory or structural process, a selective pressure to maintain their conformation should be expected at mtDNA Secondary Structures 2763 FIG. 2.—Graphic representing the positions in the human mitochondrial genome (16023–16250) in relation to estimates of the lowest possible age of the mutations at that position detected in the mitochondrial tree. The area is proportional to the number of occurrences of mutations at those time depths. Different colors represent different classes of sites. the sequence level. The degree of conservation of each structure was indirectly assessed by searching a database comprising 427 human sequences. The search was successively conducted for each structure previously identified (A–M). Low intraspecific variability values were observed for most structures, varying from one haplotype in structure G to 53 in structure K (table 1). If we consider that a nucleotide substitution in a secondary structure stem region will lead, in most cases, to a disruption of a pairing and consequently to a less stable structural conformation, we should expect a low frequency of polymorphic positions at functionally important stem regions when compared with those at unpaired nucleotide regions. Indeed, our results show that structures A, B, D, H, I, J, K, and M do have a higher proportion of polymorphic positions in the unpaired nucleotides than in the stem regions (supplementary fig. S5, Supplementary Material online). This difference attains statistical significance (P value 5 0.0104) for the mtDNA L-strand structure A but not for the mRNA homologous structure (P value 5 0.210) because two unpaired polymorphic positions in the mtDNA structure are found paired in the larger stem region of the mRNA structure (the UG pair near the interior loop) (supplementary fig. S4, Supplementary Material online). It is possible that some minor secondary structures do not attain statistically significant results due to their shorter length. Therefore, we performed a pool analysis combining all structures with the exception of structure A. This analysis did not yield a significant result (P value 5 0.804). The problem of combining different structures (as if they were a larger one) is that it may subsume any possible distinctive significant result that an individual structure might have (even if it were not detected due to its shorter length). We also combined the sequence of different structures to calculate its pooled folding potential. The result proved to be highly artificial and does not contribute to understand- ing the secondary structure potential of each individual sequence. In order to verify if the observed distinct haplotypes, defined within the region of each structure, are responsible for different secondary conformations, we calculated the number of paired bases observed in each one. Our results showed that a similar structural conservation is achieved for most structures even when considering different haplotypic sequences. The consensus folding of aligned sequences also demonstrated that the conformation predicted for structure A is maintained in all different haplotypes (supplementary fig. S6, Supplementary Material online). Lower values were found for structures B, C, and K. When there was a change in the number of paired bases, the most common was the loss of one paired base, while the gain of one paired base was less frequent (supplementary fig. S7, Supplementary Material online). Structure A Is under Strong Selective Pressure We compared the conservation status of structure A with the remainder of HVR-I in terms of relative mutation rates and in terms of the age of the mutations themselves, using the worldwide human mtDNA tree reconstructed from 2,196 sequences. We plotted each of the mutations (defined by its position in the sequence) against the age estimate of the clade each defines (or zero, if the mutation occurred at a tip of the tree) for structure A and its flanking region up to position 16250 (fig. 2). The extent of overlapping data points (mutations at the same position and of the same age) was indicated by the area of the circles. The density of mutations in the tree for structure A (225 mutations over 93 bp) was much lower (2.69 times lower) than for its flanking region (828 mutations in 127 bp), a difference that is statistically significant (P value , 0.0001). Additionally, we compared the proportion of mutations that showed 2764 Pereira et al. evidence of being older than 10,000 years within (12% in 225) and outside (24% in 828) structure A, also yielding a statistically significant value (P value 5 0.0010). The relative difference is even higher when considering mutations that are more than 20,000 years old (4% vs. 14%, P value 5 0.0003). This indicates that the region comprising structure A is under the effect of a stronger selective pressure than its flanking region. When considering mutations within structure A, the number of observed mutations in paired sites was proportionally lower (5.24 times) than at unpaired sites (P value , 0.0001), showing again evidence of different evolutionary constraints in paired and unpaired sites. However, the proportion of mutations with evidence of being older than 10,000 years is similar for unpaired and paired sites (12%). The same was observed for ages higher than 20,000 years (4%). This can easily be explained by assuming the existence of strong selective constraints at certain unpaired sites of structure A (for instance, they can interact with other molecules). Thus, the number of older mutations at unpaired sites is relatively lower than the number of recent mutations (and therefore, more similar to that found in stem regions) because there was enough time for the elimination of older ‘‘weakly deleterious’’ mutations by purifying selection. The same effect is probably not so evident in stem regions because paired sites are under a stronger selective constraint to maintain the pair that rapidly removes mutated bases. Testing for Selection—Tajima’s D Statistic We calculated the Tajima’s D statistic for 93-bp intervals overlapping at 83 bp for an overall region of ;1,000 bp, between positions 15508 and 16510 using the L* (i.e., without M and N) sequences (199) and a random selection of 200 M and N sequences from the worldwide human mtDNA database. The more negative the value of the statistic, the higher the probability that the stretch is under purifying selection. The values are negative and significant for the majority of the coding region, as expected for a functionally constrained region (cytochrome b, the threonine tRNA and the proline tRNA) (supplementary fig. S8, Supplementary Material online). Strikingly, the control region interval presenting the most negative Tajima’s D value is the one where structure A is predicted to occur (nucleotide positions 16028–16120), with a value of 2.182 (P value , 0.01). This result suggests that this region most likely have undergone negative selection, although other stretches of the control region may also have done so. To further test the relevance of the formation of the secondary structure, we compared stretches with paired and unpaired sites in structure A. Although both stretches presented significantly negative values (1.980 and 1.812, P value , 0.05), regions with paired sites presents a more negative value. Secondary Structure Conservation in the Primate Lineage An insight into the degree of conservation of these structures could be obtained by considering their interspe- cies structural variability. Control region consensus sequences from P. troglodytes, G. gorilla, and P. pygmaeus were compared with the human control region secondary structures by registering the number of nucleotide sequence alterations. A proportion lower than 20% was found for structures A, F, G, H, I, and J (supplementary fig. S9 [Supplementary Material online] upper part of the chart). As expected, a clear correlation between the number of variable positions and the degree of phylogenetic divergence was observed. The effect of these positional variants in relation to structures defined in humans was evaluated in terms of the loss of base pair stacks because structural stability is highly dependent on the number of complementary hydrogen–bonded nucleotides between paired strands. Structures A, F, G, H, and J were the ones maintaining a lower proportion of base pair stack losses in all species (,20%) (supplementary fig. S9 [Supplementary Material online] lower part of the chart). Strong supporting evidence for the formation of a secondary structure in a particular region is the phenomenon of compensatory base changes (CBCs), in which a mutation in one strand of a stem is compensated by a mutation in the complementary strand in order to preserve its overall structure. CBCs were detected in structure A (fig. 3), between P. troglodytes and human (16047/16057, exchanging AT for a GC pair) and P. pygmaeus and human (16063/ 16078, exchanging CG for a TA pair), as well as for structure L, between P. troglodytes and human (253/275, exchanging TA for a CG pair). To test the possibility of a random accumulation of CBCs in structure A, we simulated the evolution of 93 bp sequences along the primate phylogeny. Two CBCs were detected between H. sapiens and P. troglodytes in a total of 20 simulated trees. Therefore, the occurrence of one CBC in the P. troglodytes lineage is not significantly different from what would be expected by chance (P 5 0.1), suggesting that accumulation of CBCs simply by random events should not be completely ruled out. In any case, these two CBCs were observed in trees with a substantially higher sequence divergence between human and chimp (14 and 15 polymorphisms) than the real one observed between consensus sequences (nine substitution polymorphisms). This fact could explain the occurrence of CBCs in simulated trees because a high number of polymorphic positions substantially increases the probability of having compensatory substitutions. It is also important to notice that the lower mutability of stem regions in structure A makes the occurrence of CBCs a rare event—a fact that is not considered in simulated nucleotide sequences. The influence of CBCs on the folding potential of structure A was investigated by calculating the folding energy of the consensus human structure with the P. troglodytes and P. pygmaeus CBCs. The folding energy associated with the formation of these structures was 9.74 and 11.75, respectively. As expected, no structural alteration was observed in either sequence (data not shown). Therefore, CBSs only account for a small difference in the observed folding potential (1.43 and 0.58, respectively) meaning that structure A is highly stable with either a GC or AT pair in the stem regions. mtDNA Secondary Structures 2765 FIG. 3.—Graphical representation of the human mtDNA control region secondary structure A (at positions 16028–16120). The gray scale represents the number of mutations at each position observed in a human mtDNA phylogenetic tree reconstructed from 2,196 complete mtDNA genome sequences. Arrows indicate base substitutions obtained from a database of 427 human mtDNA sequences; dashed circles indicate Pan troglodytes and Pongo pygmaeus CBCs; dashed bent arrows delimit a region with a high incidence of 3# ends for mtDNA deletions (at positions 16067–16078). The trinucleotide stop point of D-loop strand synthesis is also shown. Among different human lineages, none of these CBCs were detected possibly due to the low coalescence time of modern human mtDNA. Structure A Has a Greater Folding Potential than Random Sequences The folding potential of any nucleic acid sequence depends on three fundamental characteristics: length, base composition, and base order. By fixing length and base composition, it is possible to evaluate the importance of base order for the formation of a particular secondary structure. In addition to a conventional mononucleotide shuffling, we also generated 1,000 random sequences with the same dinucleotide frequency as structure A. The dinucleotide frequency could be important for the accuracy of secondary structure prediction because stacking energies of a single base pair also depend on neighboring nucleotides (Workman and Krogh 1999; Katz and Burge 2003; Clote et al. 2005). Both mono- and dinucleotide-shuffled sequences of structure A presented an average folding energy (6.85 and 7.88, respectively) higher than the one obtained for the native sequence (11.17). Only 40 of 1,000 mononucleotide shuffled sequences and 64 of 1,000 dinucleotide random sequences had lower folding energies that structure A (table 2). Z scores obtained with both randomization procedures (1.93 and 1.54 for mono- and dinucleotide shuffling, respectively) indicated that structure A has a more negative folding energy than the average of the random sequences. The bootstrap procedure used to assess the Z score statistical significance shows that the difference was significant for the set of mononucleotide-shuffled sequences (P value 5 0.041) and not significant for the dinucleotide shuffling procedure (P value 5 0.068), to a significance level of 0.05. P values calculated with this bootstrap procedure were similar to the ones obtained assuming a standard normal distribution, suggesting that Z score values can be approximated well by this type of probability distribution (table 2). We also tested the influence of dinucleotide composition on a larger mtDNA control region fragment, including structure A and its 93-bp flanking segment. This analysis was restricted to the noncoding control region to avoid possible biases toward potential secondary structures in coding regions (the 5#-flanking region of structure A comprises coding domains—the proline tRNA and cytochrome b genes). The folding energy predicted for the 186-bp control region fragment was DG 5 14.55 kcal/mol. A set of 1,000 random sequences with the same dinucleotide frequency of the 186-bp fragment were generated using the Dishuffle interface, and folding energies were calculated in the mfold web server (mean DG of 12.91; SD of 2.79). We estimated the significance level of calculated Z scores (P value 5 0.255) assuming a normal distribution of Z scores with mean 0 and SD 1. 2766 Pereira et al. Table 2 Folding Energies, Z Scores and P Values for 1,000 Random Sequences with the Same Mono- and Dinucleotide Composition of Structure A Randomization Model Mononucleotide shuffling Dinucleotide shuffling DG Mean (kcal/mol) Standard Deviation Number of Sequences DG , 11.17 Z Score P Value (assuming a N ; (0,1)) P Value (bootstrap procedure) 6.85 7.88 2.24 2.13 40 (0.040%) 64 (0.064%) 1.93 1.54 0.027 0.062 0.041 0.068 NOTE.—Structure A: DG 5 11.17 kcal/mol. Structure A Is Not Associated with a High G þ C Content It has been suggested that a high G þ C content could favor the formation of more stable secondary structures (e.g., Galtier and Lobry 1997). There are at least two possible ways to consider the relationship between G þ C content and secondary structures: 1) a high G þ C content could be selected for in order to maintain a strong folding potential in a particular sequence or 2) a particular sequence may have a strong folding potential because it has a high G þ C content that increases the number of paired bases and the structure stability. The supplementary figure S10 (Supplementary Material online) shows a sliding window analysis of the compositional bias along the human mtDNA control region, obtained from the 427-sequence database, calculated in the Mesquite software (http://mesquiteproject. org). It is clear that there is no significant G þ C bias in the region where structure A is formed—it presents values around the mean for the entire control region (the green line). Moreover, structure A has a balanced nucleotide distribution (23 Ts; 27 As; 26 Cs, and 17 Gs). Therefore, we conclude that there is no evidence for the existence of selective pressure to increase the number of GC pairs in structure A. It seems that the folding energy associated with the formation of structure A does not requires the presence of more C:G pairs to reach its stability. Conversely, there is no evidence supporting the idea that structure A is formed simply due to a high G þ C content in the control region. Z scores and P values were calculated for 1,000 random sequences with the same mono- and dinucleotide composition of tRNA genes. Values obtained for structure A were compared with the ones calculated for mitochondrial tRNAs genes (supplementary fig. S11, Supplementary Material online). Structure A Is One of the Mitochondrial Regions with Higher Folding Potential A complete scan of the mitochondrial genome was performed to determine how commonly 93-bp segments (i.e., of the length of structure A) have a higher folding potential. Among nearly 16,500 windows analyzed for minimum folding energies, only 22 regions were identified with a folding potential higher than that observed for structure A (supplementary table S4 and fig. S12, Supplementary Material online). The frequency of 93-bp windows according to their folding energy is represented in figure 4. The window representing structure A is at one extreme tail of the distribution (P value 5 0.029). As expected, most of the 22 regions detected belong to rRNA and tRNA domains. In agreement with the previously calculated free energy per base for tRNA genes (supplementary table S4, Supplementary Material online), the region including the cysteine tRNA gene presented a lower folding energy than structure A. Other mtDNA segments with tRNA genes were also detected as having lower folding energy than structure A (e.g., tRNA–Ser/tRNA–Asp) Comparison between Structure A and Mitochondrial tRNA Genes A good way to evaluate the robustness of the secondary structure A prediction is to compare its folding potential with those obtained for mtDNA fragments known to form functional secondary structures. Previously, Clote et al. (2005) using a database with 530 tRNAs showed that these structural sequences display low Z scores (mean Z score of 1.59) compared with random sequences of the same dinucleotide frequency, in accordance with their known folding potential. Therefore, folding energies were computed for all mitochondrial tRNA genes. Normalized free energies (free energy per base) for each tRNA gene and for structure A were calculated to avoid biases due to the correlation between free energy variation and structure’s length (supplementary fig. S3, Supplementary Material online). The results showed that only the gene for the cysteine tRNA had a lower free energy per base than structure A, 0.135 and 0.120, respectively (supplementary table S3, Supplementary Material online). FIG. 4.—Distribution of folding energy values calculated for all 93-bp windows spanning the complete human mitochondrial genome. Folding energy for structure A is indicated by an arrow. mtDNA Secondary Structures 2767 but resulted from the overlap of tRNA genes in the same 93-bp scanning window. The structure A folding energy is even lower than that predicted for most rRNA regions (supplementary fig. S12, Supplementary Material online). Two control region 93-bp windows presented folding energies as low as structure A, but each one includes two structures (I, J and C, D) that independently have a lower folding potential. Additionally, 10 regions with high folding potential were detected inside protein-coding genes. Nevertheless, as can be observed in supplementary figure S13 (Supplementary Material online), the folding energy associated with these regions is the result of different secondary structures (most of them shorter than 20 bp) that together contribute to the overall folding potential of the segment. In any case, three larger structures of more than 60 bp were observed in the ND1, ND4L, and ND5 genes. It cannot be excluded that some human mitochondrial–encoded mRNA might present a substantial folding potential, as previously noticed for yeast (Saccharomyces cerevisiae) and fungal (Podospora anserina) mitochondrial mRNAs (Katz and Burge 2003). Discussion The analysis of mtDNA has become established as a powerful tool in population genetics, forensics, systematics, and evolutionary studies. Most investigations are based only on the primary sequence with little attention paid to other possible informative levels such as secondary or tertiary conformations. A region particularly predisposed to the formation of such structures is the mtDNA control region, due to the occurrence of extensive single-stranded DNA stretches during mtDNA replication or in the three-stranded D-loop structure (fig. 1). In this work, we have combined recent developments in DNA folding prediction with phylogenetic information from a human mtDNA tree reconstructed from 2,196 sequences in order to identify and characterize all human control region secondary structures. When considering complementary DNA strands, cruciforms or dual hairpin– folded structures are also possible. Indeed, this might be the case for the mtDNA control region where highly stable structures were found in complementary regions of both L and H strands (table 1). In any case, it is important to mention that the stability of these structures can always be increased through the interaction of specific proteins. The clearest evidence supporting the in vivo formation of a secondary structure was obtained for structure A, predicted for the control region 5# domain, 93 nucleotides long and with 60 paired bases (fig. 3). The evidence comprises: a higher proportion of polymorphic positions observed in unpaired nucleotides when compared with stem regions, the low density of mutations in the human mitochondrial tree, significantly negative Tajima’s D values, the existence of CBCs among primates, and a lower proportion of base pair stack losses in primate species. Negative Z scores were obtained from 1,000 random sequences with the same mono- and dinucleotide composition of structure A, indicating that this structure has a more negative folding energy than the average of the random se- quences (table 2). Computed P values of Z scores showed a significant difference between structure A folding energy and that of monoshuffled random sequences (P value 5 0.041). However, a higher P value was calculated from the set of random sequence with the same dinucleotide composition (P value 5 0.068). With respect to this, Workman and Krogh (1999) showed that the predicted free energy of 46 mRNAs was not significantly different from random sequences with the same dinucleotide distribution. This hypothesis was supported by P values substantially higher than the one calculated for structure A—only 6 of 46 mRNAs analyzed in that study presented lower P values than structure A (Workman and Krogh 1999). Higher P values were also observed for 18 of 22 mitochondrial tRNA genes analyzed in this study (supplementary table S3 and fig. S11, Supplementary Material online). Additionally, the P value of the 186-bp fragment, including structure A and a flanking region (P value 5 0.255), was higher than the one calculated for structure A alone (P value 5 0.062) demonstrating that the folding potential of the 186-bp control region fragment depends almost wholly on the formation of structure A and that the dinucleotide composition per se is insufficient to explain the formation of such a large structure. By using a larger fragment, we can observe that the region where structure A is formed has a folding potential that does not depend on the background control region dinucleotide composition. Furthermore, structure A presented a lower folding energy than most mtDNA tRNA genes and rRNA sequence domains, standing out as one of the regions with the most negative folding energy observed in a sliding window scan across the complete mitochondrial genome (supplementary table S4 and fig. S12, Supplementary Material online). The putative formation of a shorter hairpin structure with 83 nt and 34 paired bases, partially overlapping structures A and B at the control region 5# end, has been previously reported (Brown et al. 1986). However, a considerably lower negative folding energy associated with the formation of that structure (DG 5 4.4) was observed when compared with structure A (DG 5 11.17). A possible explanation for this discrepancy could be the use in this work of the most recent developments in programming algorithms for DNA folding calculations that have considerably improved the quality of folding predictions (SantaLucia 1998; Markham and Zuker 2005). The various criteria used to detect the putative in vivo formation of control region secondary structures revealed that, with the exception of structure A, there was insufficient evidence to consider the remaining structures as biologically significant (supplementary table S5, Supplementary Material online). Nevertheless, the possibility that some minor secondary structures do actually exist in vivo should not be completely ruled out. It is probable that statistically significant results were not reached for some structures simply due to their shorter length. It is interesting to note that five of six H-strand synthesis initiation sites (OH) are found on J, K, and two minor secondary structures (Chang and Clayton 1985) and that structure M overlaps part of the light-strand promoter which includes the starting site of L-strand transcription (supplementary fig. S2, Supplementary Material online). 2768 Pereira et al. Possible Roles of Structure A in mtDNA Replication and Transcription Two conflicting models have been put forward to explain the mtDNA replication mechanism. According to the classic strand-asynchronous mechanism, replication of the H strand initiates unidirectionally at different nucleotide positions (OH) mapped across the control region 3# domain (Shadel and Clayton 1997). It was demonstrated that synthesis of most of these newly replicative chains is prematurely arrested at a single trinucleotide stop point, near the control region 5# end (Doda et al. 1981), forming a threestranded DNA structure known as D loop or displacement loop (fig. 1). The presence of structure A near the control region 5# end, enclosing the D-loop termination stop point, and structure A’s predicted 2-dimensional conformation, suggest that this structure may act as a structural barrier for the replicative enzymatic assembly or as a recognition site to molecules involved in the premature arrest of Hstrand elongation and D-loop formation (fig. 3). However, because recent experiments have led to the suggestion of an alternative bidirectional strand-coupled model postulating that replication initiates bidirectionally in a sequence cluster located at the control region 5# domain (Yasukawa et al. 2005), a possible role for structure A in replication initiation events should not be ruled out. An additional functional role that could be attributed to structure A regards the processing of polycistronic RNA precursors to produce mature transcripts (fig. 1). It has been proposed that tRNA secondary structures provide the punctuation marks for the correct mRNA processing (Ojala et al. 1980). The observation that structure A can be folded into a stable tRNA-like cloverleaf conformation on the nascent mRNA derived from the L strand (DG 5 26.23) suggests that it may act as a punctuation mark for the processing of the flanking proline tRNA gene (supplementary fig. S4, Supplementary Material online). Although the precise location and mechanism for H- and L-strand transcription termination is unknown, a relatively high number of RNAs possessing stable poly(A) tails were found to end at several control region sites (Slomovic et al. 2005). Strikingly, structure A is located in a hot spot of polyadenylated sites corresponding to the 3# end of H-strand transcripts (Slomovic et al. 2005). This observation suggests that the formation of such a large secondary structure in the DNA template during transcription may induce the transcription termination of several full-length H-strand mRNAs. In any case, the nonsignificant difference between the proportion of polymorphic positions in unpaired and paired nucleotides observed for the mRNA structure A reinforces the idea of a stronger selective pressure at the DNA level (a significant difference was found for the mtDNA structure A). MtDNA Control Region Mutational Heterogeneity: the Effect of Selection on a DNA Secondary Structure The observed heterogeneity in the rate of substitution among control region sites could be readily explained by the existence of conserved regulatory elements and protein-binding sites. Nevertheless, in a large number of different control region stretches no regulatory element capable of explaining the observed mutation variability has so far been identified. Whereas in protein-coding genes, the observed heterogeneity could be explained by selective constraints acting on translated products, in noncoding segments, it could only be explained by structural or other functional roles for the DNA molecule itself. The formation of thermodynamically stable secondary structures is one important factor in this perspective. Indeed, because mutational events are more likely to occur during replication in single-stranded DNA due to, for instance, damage from oxygen radicals, one should expect that paired bases in a secondary structure (e.g., in stem regions) are better protected from alteration than those in unpaired regions (e.g., in loops) (Wright 2000; Wright et al. 2002; Hoede et al. 2006). To investigate if the observed mutational heterogeneity documented for the human HVR-I is in any way related with structure A, we have analyzed the values of observed mutations in a phylogenetic tree containing 2,196 human mtDNA sequences with more than 10,000 independent mutations (figs. 2 and 3). This method provides a well-defined tree structure by using complete sequences and overcomes major drawbacks of previous methods for the estimation of mutation rates which lacked a phylogenetic framework (Meyer et al. 1999; Pesole and Saccone 2001; for a review of pitfalls, see Bandelt et al. 2006). It is clear from the analysis of figure 2 that the region where structure A was predicted to form has a much lower density of mutations than for its flanking region. Additionally, the relative difference in these densities increases at greater time depths, strongly suggesting a differential pressure of purifying selection on the structure and its 3#-flanking region. Moreover, the number of bases with zero mutation occurrences in the tree is much higher in stem regions (45%) than in the unpaired stretches (20%). The mutational conservation is more pronounced in the two peripheral stem domains, whereas positions with a higher mutation rate are predominantly found in loop regions (fig. 3). These results indicate that, at least for some control region stretches, the combination of selective pressures and the protective effect of double-stranded DNA in secondary conformations can significantly contribute to explaining the observed mutation rate heterogeneity. Secondary Structure A Encloses the Hot Spot for the 3# End of mtDNA Deletions Deletions in the human mtDNA are responsible for a number of genetic diseases ranging from mild myopathies to severe multisystem disorders (Brockington et al. 1993; Chinnery et al. 2000). It is commonly accepted that accumulation of mtDNA deletions is usually found associated with the presence of short direct repeats in the flanking area of the deletion break point (Mita et al. 1990; Samuels et al. 2004; Yui and Matsuura 2006). Most human mtDNA deletions are believed to occur during mtDNA replication as a result of DNA replication slippage events with slipped strand mispairing involving two 13-bp direct repeats (Samuels et al. 2004). As a result, the distribution of the 5# and 3# ends of most mtDNA deletions match the location of those two 13-bp repeats. However, for most deletions with no flanking repeats associated, and even some with mtDNA Secondary Structures 2769 flanking repetitive regions, the 3#-end distribution does not correspond with any of those 13-bp repeats. Interestingly, we identified an almost perfect match between the peak in the distribution for the 3# ends (at positions 16067–16078) and the control region stretch where structure A was predicted to be formed (at positions 16028–16120; fig. 3). The data set of Samuels et al. (2004) included 111 reported unique deletions with a 3# end in the control region. Of these 111 deletions, 108 had their 3# end located in structure A (two-tailed P value , 0.0001). Furthermore, these deletion end points were very highly localized to a specific region within structure A, the loop from 16067–16074 (fig. 3), with 76 of the deletion ends occurring within this loop (two-tailed P value , 0.0001). In view of the fact that this correspondence is highly significant for deletions without involvement of a direct repeat mechanism, it is clearly possible that other factors, such as the formation of secondary structures, may trigger the formation of some mtDNA deletions. Presumably, the formation of a large secondary structure at the leading strand of the replication fork may cause the transient dissociation of the replicative enzymatic assembly, therefore increasing the likelihood of the slipped strand mispairing events that are a well-known cause of deletions. Concluding Remarks Genetic evidence for both negative (less variation in structure-relevant stems) and positive (compensatory changes in stem regions) selection on the human mtDNA control region emphasize the importance that stable secondary structures may play in mitochondrial genome evolution. The distribution of such structures along both control region peripheral domains may well be the explanation for the persistence of such a large noncoding segment in a genome governed by a strong evolutionary pressure for removing dispensable sequences. A correct prediction of DNA secondary structures is of crucial importance for the development of more efficient and directed experimental research aimed at a better understanding of the complex protein–nucleic acid interactions involved in mtDNA replication and transcription, regulated by nuclear-encoded factors. More detailed investigation should also provide valuable insights toward the understanding of the putative role of secondary structures in the generation of mtDNA deletions. A better knowledge of the generation of diversity and its heterogeneity across control region sequences is also of great importance for molecular evolutionary and phylogenetic research. Supplementary Material Supplementary tables S1–S5 and figures S1–S13 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/). Acknowledgments This work was partially supported by a research grant to F.P. (SFRH/BD/19585/2004) from Fundacxão para a Ciência e a Tecnologia. P.S. was supported by a Marie Curie Early Stage Training Grant. Instituto de Patologia e Imunologia Molecular da Universidade do Porto is partially supported by ‘‘Programa Operacional Ciência e Inovacxão 2010’’ (POCI 2010), VI Programa Quadro (2002–2006). Literature Cited Altschul SF, Erickson BW. 1985. Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol Biol Evol. 2:526–538. Anderson S, Bankier AT, Barrell BG, et al. (14 co-authors). 1981. Sequence and organization of the human mitochondrial genome. Nature. 290:457–465. Andrews RM, Kubacka I, Chinnery PF, Lightowlers R, Turnbull D, Howell N. 1999. Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat Genet. 23:147. Bandelt HJ, Forster P, Sykes BC, Richards MB. 1995. Mitochondrial portraits of human populations using median networks. Genetics. 141:743–753. Bandelt HJ, Kong QP, Richards M, Macaulay V. 2006. Estimation of mutation rates and coalescence times: some caveats. In: Bandelt HJ, Macaulay V, Richards M, editors. Human mitochondrial DNA and the evolution of Homo sapiens, Series: Nucleic Acids and Molecular Biology vol. 18. Berlin: Springer-Verlag. p. 47–92. Brockington M, Sweeney MG, Hammans SR, MorganHughes JA, Harding AE. 1993. A tandem duplication in the D-loop of human mitochondrial DNA is associated with deletions in mitochondrial myopathies. Nat Genet. 4:67–71. Brown GG, Gadaleta G, Pepe G, Saccone C, Sbisa E. 1986. Structural conservation and variation in the D-loop-containing region of vertebrate mitochondrial DNA. J Mol Biol. 192:503–511. Chang DD, Clayton DA. 1985. Priming of human mitochondrial DNA replication occurs at the light-strand promoter. Proc Natl Acad Sci USA. 82:351–355. Cheung AK. 2004. Palindrome regeneration by template strandswitching mechanism at the origin of DNA replication of porcine circovirus via the rolling-circle melting-pot replication model. J Virol. 78:9016–9029. Chinnery PF, Johnson MA, Wardell TM, Singh-Kler R, Hayes C, Brown DT, Taylor RW, Bindoff LA, Turnbull DM. 2000. The epidemiology of pathogenic mitochondrial DNA mutations. Ann Neurol. 48:188–193. Clote P, Ferre F, Kranakis E, Krizanc D. 2005. Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA. 11:578–591. Doda JN, Wright CT, Clayton DA. 1981. Elongation of displacement-loop strands in human and mouse mitochondrial DNA is arrested near specific template sequences. Proc Natl Acad Sci USA. 78:6116–6120. Excoffier L, Yang Z. 1999. Substitution rate variation among sites in mitochondrial hypervariable region I of humans and chimpanzees. Mol Biol Evol. 16:1357–1368. Forsdyke DR. 1995. A stem-loop ‘‘kissing’’ model for the initiation of recombination and the origin of introns. Mol Biol Evol. 12:949–958. Galtier N, Lobry JR. 1997. Relationships between genomic GþC content, RNA secondary structures, and optimal growth temperature in prokaryotes. J Mol Evol. 44:632–636. Hasegawa M, Kishino H, Yano T. 1985. Dating of the humanape split by a molecular clock of mitochondrial DNA. J Mol Evol. 22:160–174. 2770 Pereira et al. Hixson JE, Clayton DA. 1985. Initiation of transcription from each of the two human mitochondrial promoters requires unique nucleotides at the transcriptional start sites. Proc Natl Acad Sci USA. 82:2660–2664. Hoede C, Denamur E, Tenaillon O. 2006. Selection acts on DNA secondary structures to decrease transcriptional mutagenesis. PLoS Genet. 2:e176. Katz L, Burge CB. 2003. Widespread selection for local RNA secondary structure in coding regions of bacterial genes. Genome Res. 13:2042–2051. Kimura M. 1983. The neutral theory of molecular evolution. Cambridge: Cambridge University Press. Markham NR, Zuker M. 2005. DINAMelt web server for nucleic acid melting prediction. Nucleic Acids Res. 33:W577–W581. Meyer S, Weiss G, von Haeseler A. 1999. Pattern of nucleotide substitution and rate heterogeneity in the hypervariable regions I and II of human mtDNA. Genetics. 152:1103–1110. Mishmar D, Ruiz-Pesini E, Golik P, et al. (13 co-authors). 2003. Natural selection shaped regional mtDNA variation in humans. Proc Natl Acad Sci USA. 100:171–176. Mita S, Rizzuto R, Moraes CT, Shanske S, Arnaudo E, Fabrizi GM, Koga Y, DiMauro S, Schon EA. 1990. Recombination via flanking direct repeats is a major cause of large-scale deletions of human mitochondrial DNA. Nucleic Acids Res. 18:561–567. Montoya J, Christianson T, Levens D, Rabinowitz M, Attardi G. 1982. Identification of initiation sites for heavy-strand and light-strand transcription in human mitochondrial DNA. Proc Natl Acad Sci USA. 79:7195–7199. Ojala D, Merkel C, Gelfand R, Attardi G. 1980. The tRNA genes punctuate the reading of genetic information in human mitochondrial DNA. Cell. 22:393–403. Pesole G, Saccone C. 2001. A novel method for estimating substitution rate variation among sites in a large dataset of homologous DNA sequences. Genetics. 157:859–865. Rambaut A, Grassly NC. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. 13:235–238. Rozas J, Sanchez-DelBarrio JC, Messeguer X, Rozas R. 2003. DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics. 19:2496–2497. Samuels DC, Schon EA, Chinnery PF. 2004. Two direct repeats cause most human mtDNA deletions. Trends Genet. 20:393–398. SantaLucia J Jr. 1998. A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc Natl Acad Sci USA. 95:1460–1465. Sbisa E, Tanzariello F, Reyes A, Pesole G, Saccone C. 1997. Mammalian mitochondrial D-loop region structural analysis: identification of new conserved sequences and their functional and evolutionary implications. Gene. 205:125–140. Seffens W, Digby D. 1999. mRNAs have greater negative folding free energies than shuffled or codon choice randomized sequences. Nucleic Acids Res. 27:1578–1584. Shadel GS, Clayton DA. 1997. Mitochondrial DNA maintenance in vertebrates. Annu Rev Biochem. 66:409–435. Slomovic S, Laufer D, Geiger D, Schuster G. 2005. Polyadenylation and degradation of human mitochondrial RNA: the prokaryotic past leaves its mark. Mol Cell Biol. 225: 6427–6435. Spelbrink JN. 2003. Replication, repair, and recombination of mitochondrial DNA. In: Holt I, editor. Genetics of mitochondrial diseases. Oxford: Oxford Medical Publications. p. 3–25. Stothard P. 2000. The sequence manipulation suite: JavaScript programs for analyzing and formatting protein and DNA sequences. Biotechniques. 28:1102–1104. Taanman JW. 1999. The mitochondrial genome: structure, transcription, translation and replication. Biochim Biophys Acta. 1410:103–123. Tajima F. 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 123: 585–595. Walberg MW, Clayton DA. 1981. Sequence and properties of the human KB cell and mouse L cell D-loop regions of mitochondrial DNA. Nucleic Acids Res. 9:5411–5421. Washietl S, Hofacker IL. 2004. Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics. J Mol Biol. 342:19–30. Workman C, Krogh A. 1999. No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution. Nucleic Acids Res. 27: 4816–4822. Wright BE. 2000. A biochemical mechanism for nonrandom mutations and evolution. J Bacteriol. 182:2993–3001. Wright BE, Reimers JM, Schmidt KH, Reschke DK. 2002. Hypermutable bases in the p53 cancer gene are at vulnerable positions in DNA secondary structures. Cancer Res. 62:5641–5644. Wright BE, Reschke DK, Schmidt KH, Reimers JM, Knight W. 2003. Predicting mutation frequencies in stem-loop structures of derepressed genes: implications for evolution. Mol Microbiol. 48:429–441. Yasukawa T, Yang MY, Jacobs HT, Holt IJ. 2005. A bidirectional origin of replication maps to the major noncoding region of human mitochondrial DNA. Mol Cell. 18:651–662. Yui R, Matsuura ET. 2006. Detection of deletions flanked by short direct repeats in mitochondrial DNA of aging Drosophila. Mutat Res. 594:155–161. Zuker M. 2000. Calculating nucleic acid secondary structure. Curr Opin Struct Biol. 10:303–310. Connie Mulligan, Associate Editor Accepted October 2, 2008