Download Evidence for Variable Selective Pressures at a

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mutagen wikipedia , lookup

Nucleosome wikipedia , lookup

Genome evolution wikipedia , lookup

Epitranscriptome wikipedia , lookup

History of genetic engineering wikipedia , lookup

DNA supercoil wikipedia , lookup

Gene wikipedia , lookup

Frameshift mutation wikipedia , lookup

Holliday junction wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Microevolution wikipedia , lookup

DNA barcoding wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Metagenomics wikipedia , lookup

Replisome wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

Mutation wikipedia , lookup

Primary transcript wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Genome editing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Nucleic acid tertiary structure wikipedia , lookup

DNA nanotechnology wikipedia , lookup

Oncogenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Human genome wikipedia , lookup

Genomics wikipedia , lookup

Helitron (biology) wikipedia , lookup

Microsatellite wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Point mutation wikipedia , lookup

NUMT wikipedia , lookup

Human mitochondrial genetics wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Transcript
Evidence for Variable Selective Pressures at a Large Secondary Structure
of the Human Mitochondrial DNA Control Region
Filipe Pereira,* Pedro Soares,à João Carneiro,* Luı́sa Pereira,*§ Martin B. Richards,à
David C. Samuels,k and António Amorim* *Instituto de Patologia e Imunologia Molecular da Universidade do Porto, Porto, Portugal; Faculdade de Ciências da Universidade
do Porto, Porto, Portugal; àInstitute of Integrative and Comparative Biology, Faculty of Biological Sciences, University of Leeds,
Leeds, United Kingdom; §Medical Faculty, University of Porto, Porto, Portugal; and kVirginia Bioinformatics Institute, Virginia
Polytechnic Institute and State University
A combined effect of functional constraints and random mutational events is responsible for the sequence evolution of the
human mitochondrial DNA (mtDNA) control region. Most studies targeting this noncoding segment usually focus on its
primary sequence information disregarding other informative levels such as secondary or tertiary DNA conformations. In
this work, we combined the most recent developments in DNA folding calculations with a phylogenetic comparative
approach in order to investigate the formation of intrastrand secondary structures in the human mtDNA control region. Our
most striking results are those regarding a new cloverleaf-like secondary structure predicted for a 93-bp stretch of the control
region 5#-peripheral domain. Randomized sequences indicated that this structure has a more negative folding energy than
the average of random sequences with the same nucleotide composition. In addition, a sliding window scan across the
complete mitochondrial genome revealed that it stands out as having one of the highest folding potential. Moreover, we
detected several lines of evidence of both negative and positive selection on this structure with high levels of conservation at
the structure-relevant stem regions and the occurrence of compensatory base changes in the primate lineage. In the light of
previous data, we discuss the possible involvement of this structure in mtDNA replication and/or transcription. We conclude
that maintenance of this structure is responsible for the observed heterogeneity in the rate of substitution among sites in part
of the human hypervariable region I and that it is a hot spot for the 3# end of human mtDNA deletions.
Introduction
The mammalian mitochondrial genome consists of
a closed circular double-stranded molecule devoted to
the coding of key subunits of the oxidative phosphorylation
system. The designation of individual strands of the mitochondrial DNA (mtDNA) molecule as heavy (H) strand and
light (L) strand reflects their different buoyant densities in
a cesium chloride gradient due to a strand bias in base composition (the H strand is guanine rich, whereas the L strand
is guanine poor) (Shadel and Clayton 1997; Taanman 1999;
Spelbrink 2003).
In terms of coding efficiency, the human mitochondrial genome displays an exceptional organizational economy. Coding sequences are usually found contiguous to
each other and some protein-coding genes even overlap,
as observed for the human ATPase 6 and ATPase 8 genes.
In addition, some termination codons are not completely
encoded by the mtDNA but are instead posttranscriptionally generated by polyadenylation of the corresponding
mRNAs (Shadel and Clayton 1997; Taanman 1999;
Spelbrink 2003). In marked contrast with the evolutionary
pressure to reduce the mitochondrial content is the persistence of a noncoding segment responsible for regulation of
the mtDNA replication and transcription, known as control
region. The denomination of displacement loop (D loop) is
frequently used in the literature as synonymous with control
region. However, it refers to the three-stranded DNA structure formed in the control region due to the premature arrest
of H-strand synthesis near the control region 5# end (fig. 1).
The functions of the D loop and its relation with the mtDNA
Key words: mtDNA control region, secondary structures, mutational
heterogeneity, mtDNA deletions.
E-mail: [email protected].
Mol. Biol. Evol. 25(12):2759–2770. 2008
doi:10.1093/molbev/msn225
Advance Access publication October 9, 2008
Ó The Author 2008. Published by Oxford University Press on behalf of
the Society for Molecular Biology and Evolution. All rights reserved.
For permissions, please e-mail: [email protected]
replication are still unknown. In this manuscript, control region and D loop will not be used as synonyms.
The increasing amount of available sequences revealed a similar control region structural organization
across mammals, with size variation ranging between
880 and 1400 bp (Sbisa et al. 1997). One important issue
relating to these observations is thus to identify which structural and/or functional domains are under selective pressures strong enough to maintain a large noncoding
segment in a genetic system evolving in general terms toward a decrease in genome size. The presence of binding
sites for nuclear-encoded factors, known to regulate
mtDNA maintenance and expression, provides a partial explanation for this question. Among these are the H-strand
origin of replication sites (OH), transcription initiation sites
and promoter regions (Montoya et al. 1982; Chang and
Clayton 1985; Hixson and Clayton 1985) and conserved
elements with possible regulatory functions, namely the
termination-associated sequences (Doda et al. 1981) and conserved sequence blocks (Walberg and Clayton 1981; fig. 1).
However, a clear explanation is still missing for a large number of control region stretches for which no regulatory element has been identified so far. Also still unexplained is
the observed heterogeneity in the rate of substitution among
some control region sites, well documented in hypervariable
regions I and II (HVR-I and HVR-II) of the human mtDNA
(Excoffier and Yang 1999; Meyer et al. 1999).
Substantial evidence has begun to emerge that molecular processes such as transcription, replication, and recombination are partially regulated by the formation of DNA
secondary structures (e.g., Forsdyke 1995; Seffens and
Digby 1999; Katz and Burge 2003; Cheung 2004). For instance, hairpin or cruciform structures have been identified
as recognition sites for the binding of several transcription
factors. The formation of local intrastrand secondary structures during transcription can also modulate site-specific
mutation rates (Wright 2000; Wright et al. 2003; Hoede
2760 Pereira et al.
FIG. 1.—Schematic representation of the human mtDNA control region. The locations of the 13 predicted larger secondary structures (A–M) are
indicated by dotted lines. Orientation and H-strand replication initiation sites (OH) are indicated according to the strand-asynchronous replication
mechanism. Localization of the displacement loop (D loop), phenylalanine transfer RNA (tRNAPhe), proline transfer RNA (tRNAPro), light-strand
promoter (LSP), heavy-strand promoter (HSP), transcription initiation sites (L, H1, and H2), conserved sequence blocks (CSBs), and terminationassociated sequences (TAS) are also indicated. Numbers according to the human mtDNA reference sequence (Anderson et al. 1981).
et al. 2006). In this context, a DNA secondary structure refers to the molecular-folded structure formed when a singlestranded DNA molecule folds back upon itself. Typically,
unpaired and mispaired bases in a secondary structure form
single-stranded loops that are more vulnerable to mutations
than paired bases in double-stranded stem regions (e.g.,
Wright 2000; Hoede et al. 2006).
It has previously been shown that some control region
segments have the ability to form DNA secondary structures
in humans and other vertebrate species, namely rat, mouse,
cow, and Xenopus (Brown et al. 1986). Two methods are
commonly employed in the prediction of nucleic acid secondary structures: phylogenetic approaches using information from homologous and alignable sequences from
different organisms and free energy minimization methods
based on the calculation of the overall free energy difference
in the folding of a nucleic acid molecule (Zuker 2000).
AlthoughusedforyearswithsuccessforRNAstudies,efficient
algorithms for DNA folding prediction have only recently
become available and incorporated into computer software
packages (SantaLucia 1998; Markham and Zuker 2005).
The recent developments in DNA folding prediction,
as well as the increasing number of available control region
sequences, prompted us to reevaluate the mtDNA control
region potential for intrastrand secondary structure formation. We identified 13 potential structures (fig. 1, A–M), and
several lines of evidence support the in vivo formation of
one of them (A) located at the control region 5# domain. We
have therefore analyzed the involvement of this structure in
replication and/or translational regulatory mechanisms, mutation rate heterogeneity, and mtDNA deletions.
Material and Methods
Phylogenies and Complete Sequence Data Set
A human mtDNA phylogenetic tree was reconstructed
using the reduced median algorithm (Bandelt et al. 1995)
implemented in the ‘‘Network 4.2.0.1’’ software (http://
www.fluxus-engineering.com). Based on the relative mutation rate of the mutations and diversity indices of the clades,
we selected a single tree from the Network output. The tree
combines a total of 2,196 complete mtDNA sequences from
macrohaplogroups M, N, and L (African) retrieved from
published and unpublished studies (supplementary table
S1, Supplementary Material online).
A sample of 427 mtDNA control region sequences
representing the main human lineages were assembled into
a single database (supplementary table S2, Supplementary
Material online). Similar databases were also constructed
using all available and complete control region sequences
from Pan troglodytes, Gorilla gorilla, and Pongo
pygmaeus (supplementary table S2, Supplementary Material online). The consensus sequence for each species was
obtained from the sequences alignment using the ClustalW
software implemented by the BioEdit program (http://
www.mbio.ncsu.edu/BioEdit/bioedit.html). Mitochondrial
tRNA gene sequences used in this study were derived from
the human mtDNA reference sequence (Anderson et al.
1981; Andrews et al. 1999). Sequence numbering is in accordance with the L-strand human mtDNA reference
sequence (Anderson et al. 1981).
Nucleic Acid Secondary Structures Prediction
Sequences were submitted to the DNA ‘‘mfold’’ web
server (version 3.1; http://frontend.bioinfo.rpi.edu/
applications/mfold/cgi-bin/dna-form1.cgi) for DNA secondary structure prediction by free energy minimization using nearest neighbor thermodynamic rules (SantaLucia
1998; Markham and Zuker 2005). Computations were performed using ‘‘mfold’’ default parameters for folding temperature (37 °C), ionic conditions ([Naþ] 5 1.0 M; [Mgþþ] 5
0 M) and ‘‘window size parameter’’ of 15 for the complete
control region sequence and 2 for tRNAs and for the
largest control region secondary structure. DNA folding
was limited to structures in which the maximum distance
between the two bases initiating and terminating the primary stem is 100 nucleotides. In two control region segments, the maximum distance was reduced to 50 nt
positions in order to obtain a better resolution preventing
the formation of structures with larger loops. RNA folding
free energies were determined using the RNA mfold software (Markham and Zuker 2005). All folding free energies are expressed in kilocalorie per mole (kcal/mol).
Consensus Folding of Aligned Sequences
A different approach to predict the existence of conserved secondary structure in a set of aligned RNA or DNA
mtDNA Secondary Structures 2761
sequences was developed by Washielt and Hofacker (2004)
and uses modified dynamic programming algorithms that
add a covariance term to standard free energy calculations.
All 16 different haplotypes identified for structure A were
aligned and submitted to the Alifold interface (http://rna.
tbi.univie.ac.at/cgi-bin/alifold.cgi) using default parameters
for DNA secondary structures calculations.
Simulation of DNA Sequence Evolution along the
Primate Phylogenetic Tree
We simulated the evolution of nucleotide sequences
along the primate phylogeny in the Seq-Gen program
(Rambaut and Grassly 1997). The input tree was generated
in the DNA maximum likelihood program implemented by
the PHYLIP package (http://evolution.genetics.washington.
edu/phylip.html) using aligned sequences for the mtDNA
structure A region from Homo sapiens, P. troglodytes,
G. gorilla, and P. pygmaeus. Twenty data sets of four sequences 93 nt long were generated in the Seq-Gen program
using default parameters and the HKY85 model of nucleotide substitution (Hasegawa et al. 1985).
Phylogenetic Analysis of the Human Mitochondrial Tree
and Relative Mutation Rates
We used our reconstructed worldwide human mtDNA
tree to estimate the number and age of mutations in structure
A and an adjacent region. Although it is not possible to date
mutations directly by the phylogenetic approach, it is possible to date clades within the tree, allowing us to assign
mutations to a particular window of time. In our approach,
if a mutation occurred on a tip branch, it was dated to an age
of between zero years and the age of the closest ancestral
node in the tree. Alternatively, if a mutation occurred on
a branch leading to a particular subclade, it was dated to
between the age of that subclade and the age of the ancestral
node (a detailed description of the methods used can be
found in the Supplementary Material online). To illustrate
the procedure, we present a small portion of human mtDNA
phylogenetic tree in supplementary figure S1 (Supplementary Material online; the Pacific haplogroup M29/Q).
It is well known that considering the age of an individual mutation provides only weak evidence for either
neutrality or deleteriousness because slightly deleterious
mutations can persist for a long time in a population and
even become fixed at the species level (Kimura 1983).
A young mutation per se does not present any evidence
on its viability for persistence. However, vital information
can be obtained from the overall distribution of these mutations on the tree. If a mutation is present at high frequency
in young branches but is absent in older branches or in a significantly lower proportion than in those young branches,
we can assume that this mutation is under selective constraints.
We compared mutations in different groups: within
and outside structure A and at paired and unpaired sites
within structure A. If we exclude purifying selection and
assume similar mutation rates in two adjacent regions of
the same size, the probability of detecting a mutation in
any of these regions should be similar. Moreover, if muta-
tions are randomly distributed throughout the tree, no differences between mutations in these adjacent regions
should be detected in different clades and at different time
depths of the tree. Therefore, major differences in variation
could be the result of different evolutionary constraints.
We used a mutation rate of 1 mutational event in every
5,138 years in the entire coding region to date clades within
the tree (Mishmar et al. 2003). We estimated the ages of
control region mutations directly from the age of the clade
in which they were observed as previously described.
Assessment of Selective Effects
To test for selection, we also calculated Tajima’s
(1989) D values for different regions of the human mtDNA
on a data set of 474 complete mtDNA sequences using the
DnaSP 4.0 software (Rozas et al. 2003). It was not possible
to include the complete data set of 2,196 sequences in order
to calculate significant values due to input limitations of
the DnaSP software. Therefore, the data set included all
African L (non-M or N) sequences (n 5 274) used in
the phylogenetic analysis (because these present a greater
time depth, allowing for selection effects to impact more
strongly) and a random sample of non-African haplogroup
M (n 5 100) and N (n 5 100) sequences.
Complete mtDNA Sliding Window Scan for Secondary
Structures’ Potential
A PYTHON algorithm was written to generate a set of
sequence fragments with the same sequence length as the
largest control region secondary structure (93 bp), overlapped by 1 bp, spanning the complete human mtDNA
reference sequence. All nucleotide windows were submitted to the DNA ‘‘mfold’’ web server for secondary structure
prediction.
Random Sequences
Two types of sequence randomization procedures were
used to detect possible statistical biases in the prediction of
secondary structures. The first method used was based on
a simple random nucleotide permutation, keeping the sequence nucleotide composition constant (mononucleotide
shuffling) (Stothard 2000). Additionally, random sequences
with the same dinucleotide frequency (dinucleotide shuffling) were generated using the Altschul–Erickson shuffle algorithm, implemented by the ‘‘Dishuffle’’ interface (http://
clavius.bc.edu/%7Eclotelab/RNAdinucleotideShuffle/),which
derives a first-order Markov model from the conditional probabilities found in the target sequence (Altschul and Erickson
1985).
Estimation of Statistical Significance
The consensus nucleotide composition of the control
region segment in which structure A was predicted to occur
(table 1) was used to generate 1,000 shuffled sequences according to both randomization procedures. Minimum free
energies were predicted for these sequences as described
before.
2762 Pereira et al.
Table 1
Characteristics of All Predicted Secondary Structures with More than 10 Paired Bases (A–M) in the Human mtDNA Control
Region
DG (kcal/mol)a
Predicted Secondary Structures
Nucleotide Positions
(A) 16028–16120
(B) 16124–16160
(C) 16310–16332
(D) 16360–16373
(E) 16377–16391
(F) 16481–16513
(G) 16566–015
(H) 025–055
(I) 061–097
(J) 116–149
(K) 181–226
(L) 249–279
(M) 376–414
Control Region Domains
and Regulatory Elements
Length
Paired
Unpaired
L Strand
H Strand
L-Strand
mRNA
Number of
Haplotypesb
HVR-I, D-loop 3# end
HVR-I, TAS (partial)
HVR-I
HVR-I
HVR-I, central domain
Central domain
Central domain
Central domain
HVR-II
HVR-II
HVR-II, OH, CSB1 (partial)
HVR-II
LSP (partial), L
93
37
23
14
15
33
19
31
37
34
46
31
39
60
16
16
10
10
22
14
14
22
20
32
14
24
33
21
7
4
5
11
5
17
15
14
14
17
15
11.17
0.47
1.24
0.74
4.00
1.30
0.80
1.32
5.69
2.53
5.27
3.85
2.38
10.40
0.99
0.96
1.47
4.12
1.12
0.31
1.33
3.13
3.45
2.82
4.32
2.30
26.23
6.50
2.30
2.70
7.80
1.70
1.20
9.80
5.40
5.70
4.60
13.60
7.30
16
22
20
8
4
5
1
2
18
9
53
9
4
NOTE.—TAS, termination-associated sequences; CSB, conserved sequence blocks; LSP, light-strand promoter.
a
Free energy difference associated with the secondary structure folding at 37 °C.
b
Number of haplotypes as defined within the region of each structure.
To determine if the folding energy predicted for structure A was significantly different from that of randomized
sequences, the average and standard deviations (SDs) of
folding energies were used to calculate the Z score (or segment score) (Seffens and Digby 1999; Workman and Krogh
1999; Katz and Burge 2003). The Z score reveals how many
units of the SD the folding energy for the native sequence is
above or below the average of the randomized sequences.
A negative Z score indicates that the native structure has
a greater secondary structure potential (more negative folding energy) than the average of the random sequences.
The significance level of calculated Z scores was estimated according to the procedure developed by Workman
and Krogh (1999) to account for possible deviations from
a normal distribution of Z scores with mean 0 and SD 1.
A PYTHON script using the random module was written
to generate 5,000 sets of 101 shuffled sequences, randomly
selected from each group of 1,000 mono- and dinucleotide
shuffled sequences. For each set of 101 sequences, a Z score
was calculated for one randomly selected sequence (test sequence) using the average and SD of the remaining 100.
From the 5,000 bootstrap procedure, the fractions of random sequences with Z score lower than that of the native
structure gives the P value for the sequence (Workman and
Krogh 1999). Z scores and sequence P values were similarly obtained for all mitochondrial tRNA genes derived
from the human mtDNA reference sequence. All PYTHON
scripts developed for this study are part of the DNAux software platform available at http://www.portugene.com/
software.html.
(supplementary fig. S2, Supplementary Material online).
Structures were named from A to M according to their location in the control region (from 5# to 3#), and their characteristics are summarized in table 1. The free energy
variation on the formation of each structure was found to
correlate with the structure’s length, due to the expected increase in folding energies with the larger number of bonding interactions (r2 5 0.64; supplementary fig. S3,
Supplementary Material online). Approximately equivalent
folding energies were obtained for the H-strand sequence
with the exception of structure G with a positive value
(DG 5 0.31) (table 1).
Because mRNA transcripts from both mtDNA strands
include sections of the control region complementary sequence, a possible formation of secondary structures at
the mRNA level is not completely excluded. Therefore, secondary structures and folding energies were predicted for
the L-strand mRNA transcript (table 1 and supplementary
fig. S4, Supplementary Material online). All secondary
structures predicted for the mtDNA L strand were also observed in the mRNA transcript with minor structural differences. The most significant difference was observed for
structure K, with the predicted mRNA structure showing
less 16 paired bases than the homologous mtDNA structure.
A similar structural conformation was found for structure A
in mRNA and mtDNA strands with four major stem regions
and a total of 60 paired bases (supplementary fig. S4, Supplementary Material online). In general, greater negative
folding energies were observed for the structures in the
mRNA transcript than in the mtDNA strands, with structure A
presenting the most stable conformation (DG 5 26.23).
Results
Identification of 13 mtDNA Control Region Secondary
Structures
High Levels of Conservation at Some Control Region
Secondary Structures
Thirteen secondary structures with more than 10
paired bases, varying in length from 14 to 93 bp, were identified in the human mtDNA L-strand consensus sequence
If the formation of these structures were associated
with any regulatory or structural process, a selective pressure to maintain their conformation should be expected at
mtDNA Secondary Structures 2763
FIG. 2.—Graphic representing the positions in the human mitochondrial genome (16023–16250) in relation to estimates of the lowest possible age
of the mutations at that position detected in the mitochondrial tree. The area is proportional to the number of occurrences of mutations at those time
depths. Different colors represent different classes of sites.
the sequence level. The degree of conservation of each
structure was indirectly assessed by searching a database
comprising 427 human sequences. The search was successively conducted for each structure previously identified
(A–M). Low intraspecific variability values were observed
for most structures, varying from one haplotype in structure
G to 53 in structure K (table 1).
If we consider that a nucleotide substitution in a secondary structure stem region will lead, in most cases, to
a disruption of a pairing and consequently to a less stable
structural conformation, we should expect a low frequency
of polymorphic positions at functionally important stem regions when compared with those at unpaired nucleotide regions. Indeed, our results show that structures A, B, D, H, I,
J, K, and M do have a higher proportion of polymorphic
positions in the unpaired nucleotides than in the stem regions
(supplementary fig. S5, Supplementary Material online).
This difference attains statistical significance (P value 5
0.0104) for the mtDNA L-strand structure A but not for
the mRNA homologous structure (P value 5 0.210) because
two unpaired polymorphic positions in the mtDNA structure
are found paired in the larger stem region of the mRNA structure (the UG pair near the interior loop) (supplementary
fig. S4, Supplementary Material online).
It is possible that some minor secondary structures do
not attain statistically significant results due to their shorter
length. Therefore, we performed a pool analysis combining
all structures with the exception of structure A. This analysis did not yield a significant result (P value 5 0.804). The
problem of combining different structures (as if they were
a larger one) is that it may subsume any possible distinctive
significant result that an individual structure might have
(even if it were not detected due to its shorter length).
We also combined the sequence of different structures to
calculate its pooled folding potential. The result proved
to be highly artificial and does not contribute to understand-
ing the secondary structure potential of each individual sequence.
In order to verify if the observed distinct haplotypes,
defined within the region of each structure, are responsible
for different secondary conformations, we calculated the
number of paired bases observed in each one. Our results
showed that a similar structural conservation is achieved for
most structures even when considering different haplotypic
sequences. The consensus folding of aligned sequences also
demonstrated that the conformation predicted for structure
A is maintained in all different haplotypes (supplementary
fig. S6, Supplementary Material online). Lower values were
found for structures B, C, and K. When there was a change
in the number of paired bases, the most common was the
loss of one paired base, while the gain of one paired base
was less frequent (supplementary fig. S7, Supplementary
Material online).
Structure A Is under Strong Selective Pressure
We compared the conservation status of structure A
with the remainder of HVR-I in terms of relative mutation
rates and in terms of the age of the mutations themselves,
using the worldwide human mtDNA tree reconstructed
from 2,196 sequences. We plotted each of the mutations
(defined by its position in the sequence) against the age estimate of the clade each defines (or zero, if the mutation
occurred at a tip of the tree) for structure A and its flanking
region up to position 16250 (fig. 2). The extent of overlapping data points (mutations at the same position and of the
same age) was indicated by the area of the circles. The density of mutations in the tree for structure A (225 mutations
over 93 bp) was much lower (2.69 times lower) than for its
flanking region (828 mutations in 127 bp), a difference that
is statistically significant (P value , 0.0001). Additionally,
we compared the proportion of mutations that showed
2764 Pereira et al.
evidence of being older than 10,000 years within (12% in
225) and outside (24% in 828) structure A, also yielding
a statistically significant value (P value 5 0.0010). The relative difference is even higher when considering mutations
that are more than 20,000 years old (4% vs. 14%, P value 5
0.0003). This indicates that the region comprising structure A
is under the effect of a stronger selective pressure than its
flanking region.
When considering mutations within structure A, the
number of observed mutations in paired sites was proportionally lower (5.24 times) than at unpaired sites (P value ,
0.0001), showing again evidence of different evolutionary
constraints in paired and unpaired sites. However, the proportion of mutations with evidence of being older than
10,000 years is similar for unpaired and paired sites
(12%). The same was observed for ages higher than
20,000 years (4%). This can easily be explained by assuming the existence of strong selective constraints at certain
unpaired sites of structure A (for instance, they can interact
with other molecules). Thus, the number of older mutations
at unpaired sites is relatively lower than the number of recent mutations (and therefore, more similar to that found in
stem regions) because there was enough time for the elimination of older ‘‘weakly deleterious’’ mutations by purifying selection. The same effect is probably not so evident in
stem regions because paired sites are under a stronger selective constraint to maintain the pair that rapidly removes
mutated bases.
Testing for Selection—Tajima’s D Statistic
We calculated the Tajima’s D statistic for 93-bp intervals overlapping at 83 bp for an overall region of ;1,000
bp, between positions 15508 and 16510 using the L* (i.e.,
without M and N) sequences (199) and a random selection
of 200 M and N sequences from the worldwide human
mtDNA database. The more negative the value of the
statistic, the higher the probability that the stretch is under
purifying selection. The values are negative and significant
for the majority of the coding region, as expected for a functionally constrained region (cytochrome b, the threonine
tRNA and the proline tRNA) (supplementary fig. S8,
Supplementary Material online). Strikingly, the control
region interval presenting the most negative Tajima’s D
value is the one where structure A is predicted to occur (nucleotide positions 16028–16120), with a value of 2.182
(P value , 0.01). This result suggests that this region most
likely have undergone negative selection, although other
stretches of the control region may also have done so.
To further test the relevance of the formation of the secondary structure, we compared stretches with paired and unpaired sites in structure A. Although both stretches
presented significantly negative values (1.980 and
1.812, P value , 0.05), regions with paired sites presents
a more negative value.
Secondary Structure Conservation in the Primate Lineage
An insight into the degree of conservation of these
structures could be obtained by considering their interspe-
cies structural variability. Control region consensus sequences from P. troglodytes, G. gorilla, and P. pygmaeus were
compared with the human control region secondary structures by registering the number of nucleotide sequence alterations. A proportion lower than 20% was found for
structures A, F, G, H, I, and J (supplementary fig. S9 [Supplementary Material online] upper part of the chart). As expected, a clear correlation between the number of variable
positions and the degree of phylogenetic divergence was
observed.
The effect of these positional variants in relation to
structures defined in humans was evaluated in terms of
the loss of base pair stacks because structural stability is
highly dependent on the number of complementary
hydrogen–bonded nucleotides between paired strands.
Structures A, F, G, H, and J were the ones maintaining
a lower proportion of base pair stack losses in all species
(,20%) (supplementary fig. S9 [Supplementary Material
online] lower part of the chart).
Strong supporting evidence for the formation of a secondary structure in a particular region is the phenomenon of
compensatory base changes (CBCs), in which a mutation in
one strand of a stem is compensated by a mutation in the
complementary strand in order to preserve its overall structure. CBCs were detected in structure A (fig. 3), between
P. troglodytes and human (16047/16057, exchanging AT
for a GC pair) and P. pygmaeus and human (16063/
16078, exchanging CG for a TA pair), as well as for structure L, between P. troglodytes and human (253/275, exchanging TA for a CG pair).
To test the possibility of a random accumulation of
CBCs in structure A, we simulated the evolution of 93 bp
sequences along the primate phylogeny. Two CBCs were
detected between H. sapiens and P. troglodytes in a total
of 20 simulated trees. Therefore, the occurrence of one
CBC in the P. troglodytes lineage is not significantly different from what would be expected by chance (P 5 0.1),
suggesting that accumulation of CBCs simply by random
events should not be completely ruled out. In any case,
these two CBCs were observed in trees with a substantially
higher sequence divergence between human and chimp
(14 and 15 polymorphisms) than the real one observed
between consensus sequences (nine substitution polymorphisms). This fact could explain the occurrence of CBCs
in simulated trees because a high number of polymorphic
positions substantially increases the probability of having
compensatory substitutions. It is also important to notice
that the lower mutability of stem regions in structure A
makes the occurrence of CBCs a rare event—a fact that
is not considered in simulated nucleotide sequences.
The influence of CBCs on the folding potential of
structure A was investigated by calculating the folding
energy of the consensus human structure with the P.
troglodytes and P. pygmaeus CBCs. The folding energy associated with the formation of these structures was 9.74
and 11.75, respectively. As expected, no structural alteration was observed in either sequence (data not shown).
Therefore, CBSs only account for a small difference in
the observed folding potential (1.43 and 0.58, respectively) meaning that structure A is highly stable with either
a GC or AT pair in the stem regions.
mtDNA Secondary Structures 2765
FIG. 3.—Graphical representation of the human mtDNA control region secondary structure A (at positions 16028–16120). The gray scale
represents the number of mutations at each position observed in a human mtDNA phylogenetic tree reconstructed from 2,196 complete mtDNA genome
sequences. Arrows indicate base substitutions obtained from a database of 427 human mtDNA sequences; dashed circles indicate Pan troglodytes and
Pongo pygmaeus CBCs; dashed bent arrows delimit a region with a high incidence of 3# ends for mtDNA deletions (at positions 16067–16078).
The trinucleotide stop point of D-loop strand synthesis is also shown.
Among different human lineages, none of these CBCs
were detected possibly due to the low coalescence time of
modern human mtDNA.
Structure A Has a Greater Folding Potential than
Random Sequences
The folding potential of any nucleic acid sequence depends on three fundamental characteristics: length, base
composition, and base order. By fixing length and base composition, it is possible to evaluate the importance of base
order for the formation of a particular secondary structure.
In addition to a conventional mononucleotide shuffling,
we also generated 1,000 random sequences with the same dinucleotide frequency as structure A. The dinucleotide frequency could be important for the accuracy of secondary
structure prediction because stacking energies of a single base
pair also depend on neighboring nucleotides (Workman and
Krogh 1999; Katz and Burge 2003; Clote et al. 2005). Both
mono- and dinucleotide-shuffled sequences of structure A
presented an average folding energy (6.85 and 7.88, respectively) higher than the one obtained for the native sequence (11.17). Only 40 of 1,000 mononucleotide
shuffled sequences and 64 of 1,000 dinucleotide random sequences had lower folding energies that structure A (table 2).
Z scores obtained with both randomization procedures
(1.93 and 1.54 for mono- and dinucleotide shuffling,
respectively) indicated that structure A has a more negative
folding energy than the average of the random sequences.
The bootstrap procedure used to assess the Z score statistical
significance shows that the difference was significant for the
set of mononucleotide-shuffled sequences (P value 5 0.041)
and not significant for the dinucleotide shuffling procedure
(P value 5 0.068), to a significance level of 0.05. P values
calculated with this bootstrap procedure were similar to the
ones obtained assuming a standard normal distribution, suggesting that Z score values can be approximated well by this
type of probability distribution (table 2).
We also tested the influence of dinucleotide composition on a larger mtDNA control region fragment, including
structure A and its 93-bp flanking segment. This analysis
was restricted to the noncoding control region to avoid possible biases toward potential secondary structures in coding
regions (the 5#-flanking region of structure A comprises
coding domains—the proline tRNA and cytochrome
b genes). The folding energy predicted for the 186-bp control region fragment was DG 5 14.55 kcal/mol. A set of
1,000 random sequences with the same dinucleotide frequency of the 186-bp fragment were generated using the
Dishuffle interface, and folding energies were calculated
in the mfold web server (mean DG of 12.91; SD of
2.79). We estimated the significance level of calculated
Z scores (P value 5 0.255) assuming a normal distribution
of Z scores with mean 0 and SD 1.
2766 Pereira et al.
Table 2
Folding Energies, Z Scores and P Values for 1,000 Random Sequences with the Same Mono- and Dinucleotide Composition of
Structure A
Randomization Model
Mononucleotide shuffling
Dinucleotide shuffling
DG Mean
(kcal/mol)
Standard
Deviation
Number of Sequences
DG , 11.17
Z Score
P Value (assuming
a N ; (0,1))
P Value (bootstrap
procedure)
6.85
7.88
2.24
2.13
40 (0.040%)
64 (0.064%)
1.93
1.54
0.027
0.062
0.041
0.068
NOTE.—Structure A: DG 5 11.17 kcal/mol.
Structure A Is Not Associated with a High G þ C
Content
It has been suggested that a high G þ C content could
favor the formation of more stable secondary structures
(e.g., Galtier and Lobry 1997). There are at least two possible ways to consider the relationship between G þ C content and secondary structures: 1) a high G þ C content
could be selected for in order to maintain a strong folding
potential in a particular sequence or 2) a particular sequence
may have a strong folding potential because it has a high
G þ C content that increases the number of paired bases and
the structure stability. The supplementary figure S10 (Supplementary Material online) shows a sliding window analysis of the compositional bias along the human mtDNA
control region, obtained from the 427-sequence database,
calculated in the Mesquite software (http://mesquiteproject.
org). It is clear that there is no significant G þ C bias in the
region where structure A is formed—it presents values
around the mean for the entire control region (the green
line). Moreover, structure A has a balanced nucleotide distribution (23 Ts; 27 As; 26 Cs, and 17 Gs). Therefore, we
conclude that there is no evidence for the existence of selective pressure to increase the number of GC pairs in structure A. It seems that the folding energy associated with the
formation of structure A does not requires the presence of
more C:G pairs to reach its stability. Conversely, there is no
evidence supporting the idea that structure A is formed simply due to a high G þ C content in the control region.
Z scores and P values were calculated for 1,000 random sequences with the same mono- and dinucleotide composition of tRNA genes. Values obtained for structure A
were compared with the ones calculated for mitochondrial
tRNAs genes (supplementary fig. S11, Supplementary
Material online).
Structure A Is One of the Mitochondrial Regions with
Higher Folding Potential
A complete scan of the mitochondrial genome was
performed to determine how commonly 93-bp segments
(i.e., of the length of structure A) have a higher folding potential. Among nearly 16,500 windows analyzed for minimum folding energies, only 22 regions were identified with
a folding potential higher than that observed for structure A
(supplementary table S4 and fig. S12, Supplementary Material online). The frequency of 93-bp windows according
to their folding energy is represented in figure 4. The window representing structure A is at one extreme tail of the
distribution (P value 5 0.029).
As expected, most of the 22 regions detected belong to
rRNA and tRNA domains. In agreement with the previously calculated free energy per base for tRNA genes (supplementary table S4, Supplementary Material online), the
region including the cysteine tRNA gene presented a lower
folding energy than structure A. Other mtDNA segments
with tRNA genes were also detected as having lower folding energy than structure A (e.g., tRNA–Ser/tRNA–Asp)
Comparison between Structure A and Mitochondrial
tRNA Genes
A good way to evaluate the robustness of the secondary structure A prediction is to compare its folding potential
with those obtained for mtDNA fragments known to form
functional secondary structures. Previously, Clote et al.
(2005) using a database with 530 tRNAs showed that these
structural sequences display low Z scores (mean Z score of
1.59) compared with random sequences of the same dinucleotide frequency, in accordance with their known folding potential. Therefore, folding energies were computed
for all mitochondrial tRNA genes. Normalized free energies
(free energy per base) for each tRNA gene and for structure
A were calculated to avoid biases due to the correlation between free energy variation and structure’s length (supplementary fig. S3, Supplementary Material online). The
results showed that only the gene for the cysteine tRNA
had a lower free energy per base than structure A,
0.135 and 0.120, respectively (supplementary table
S3, Supplementary Material online).
FIG. 4.—Distribution of folding energy values calculated for all 93-bp
windows spanning the complete human mitochondrial genome. Folding
energy for structure A is indicated by an arrow.
mtDNA Secondary Structures 2767
but resulted from the overlap of tRNA genes in the same
93-bp scanning window.
The structure A folding energy is even lower than that
predicted for most rRNA regions (supplementary fig. S12,
Supplementary Material online). Two control region 93-bp
windows presented folding energies as low as structure A,
but each one includes two structures (I, J and C, D) that
independently have a lower folding potential. Additionally,
10 regions with high folding potential were detected inside
protein-coding genes. Nevertheless, as can be observed in
supplementary figure S13 (Supplementary Material online),
the folding energy associated with these regions is the result
of different secondary structures (most of them shorter than
20 bp) that together contribute to the overall folding potential of the segment. In any case, three larger structures of
more than 60 bp were observed in the ND1, ND4L, and
ND5 genes. It cannot be excluded that some human
mitochondrial–encoded mRNA might present a substantial folding potential, as previously noticed for yeast
(Saccharomyces cerevisiae) and fungal (Podospora anserina) mitochondrial mRNAs (Katz and Burge 2003).
Discussion
The analysis of mtDNA has become established as
a powerful tool in population genetics, forensics, systematics, and evolutionary studies. Most investigations are based
only on the primary sequence with little attention paid to
other possible informative levels such as secondary or tertiary conformations. A region particularly predisposed to
the formation of such structures is the mtDNA control region, due to the occurrence of extensive single-stranded
DNA stretches during mtDNA replication or in the
three-stranded D-loop structure (fig. 1).
In this work, we have combined recent developments
in DNA folding prediction with phylogenetic information
from a human mtDNA tree reconstructed from 2,196 sequences in order to identify and characterize all human
control region secondary structures. When considering
complementary DNA strands, cruciforms or dual hairpin–
folded structures are also possible. Indeed, this might be the
case for the mtDNA control region where highly stable
structures were found in complementary regions of both L
and H strands (table 1). In any case, it is important to mention
that the stability of these structures can always be increased
through the interaction of specific proteins.
The clearest evidence supporting the in vivo formation
of a secondary structure was obtained for structure A, predicted for the control region 5# domain, 93 nucleotides long
and with 60 paired bases (fig. 3). The evidence comprises:
a higher proportion of polymorphic positions observed in
unpaired nucleotides when compared with stem regions,
the low density of mutations in the human mitochondrial
tree, significantly negative Tajima’s D values, the existence
of CBCs among primates, and a lower proportion of base
pair stack losses in primate species.
Negative Z scores were obtained from 1,000 random
sequences with the same mono- and dinucleotide composition of structure A, indicating that this structure has a more
negative folding energy than the average of the random se-
quences (table 2). Computed P values of Z scores showed
a significant difference between structure A folding energy
and that of monoshuffled random sequences (P value 5
0.041). However, a higher P value was calculated from
the set of random sequence with the same dinucleotide
composition (P value 5 0.068). With respect to this,
Workman and Krogh (1999) showed that the predicted free
energy of 46 mRNAs was not significantly different from
random sequences with the same dinucleotide distribution.
This hypothesis was supported by P values substantially
higher than the one calculated for structure A—only 6 of
46 mRNAs analyzed in that study presented lower P values
than structure A (Workman and Krogh 1999). Higher P values were also observed for 18 of 22 mitochondrial tRNA
genes analyzed in this study (supplementary table S3
and fig. S11, Supplementary Material online). Additionally,
the P value of the 186-bp fragment, including structure A
and a flanking region (P value 5 0.255), was higher than
the one calculated for structure A alone (P value 5 0.062)
demonstrating that the folding potential of the 186-bp control region fragment depends almost wholly on the formation of structure A and that the dinucleotide composition
per se is insufficient to explain the formation of such a large
structure. By using a larger fragment, we can observe that
the region where structure A is formed has a folding potential that does not depend on the background control region
dinucleotide composition.
Furthermore, structure A presented a lower folding energy than most mtDNA tRNA genes and rRNA sequence
domains, standing out as one of the regions with the most
negative folding energy observed in a sliding window scan
across the complete mitochondrial genome (supplementary
table S4 and fig. S12, Supplementary Material online).
The putative formation of a shorter hairpin structure
with 83 nt and 34 paired bases, partially overlapping structures A and B at the control region 5# end, has been previously reported (Brown et al. 1986). However, a
considerably lower negative folding energy associated with
the formation of that structure (DG 5 4.4) was observed
when compared with structure A (DG 5 11.17). A possible explanation for this discrepancy could be the use
in this work of the most recent developments in programming algorithms for DNA folding calculations that have
considerably improved the quality of folding predictions
(SantaLucia 1998; Markham and Zuker 2005).
The various criteria used to detect the putative in vivo
formation of control region secondary structures revealed
that, with the exception of structure A, there was insufficient evidence to consider the remaining structures as
biologically significant (supplementary table S5, Supplementary Material online). Nevertheless, the possibility that
some minor secondary structures do actually exist in vivo
should not be completely ruled out. It is probable that statistically significant results were not reached for some structures simply due to their shorter length. It is interesting to
note that five of six H-strand synthesis initiation sites (OH)
are found on J, K, and two minor secondary structures
(Chang and Clayton 1985) and that structure M overlaps
part of the light-strand promoter which includes the starting
site of L-strand transcription (supplementary fig. S2, Supplementary Material online).
2768 Pereira et al.
Possible Roles of Structure A in mtDNA Replication and
Transcription
Two conflicting models have been put forward to explain the mtDNA replication mechanism. According to the
classic strand-asynchronous mechanism, replication of the
H strand initiates unidirectionally at different nucleotide positions (OH) mapped across the control region 3# domain
(Shadel and Clayton 1997). It was demonstrated that synthesis of most of these newly replicative chains is prematurely arrested at a single trinucleotide stop point, near the
control region 5# end (Doda et al. 1981), forming a threestranded DNA structure known as D loop or displacement
loop (fig. 1). The presence of structure A near the control
region 5# end, enclosing the D-loop termination stop point,
and structure A’s predicted 2-dimensional conformation,
suggest that this structure may act as a structural barrier
for the replicative enzymatic assembly or as a recognition
site to molecules involved in the premature arrest of Hstrand elongation and D-loop formation (fig. 3). However,
because recent experiments have led to the suggestion of an
alternative bidirectional strand-coupled model postulating
that replication initiates bidirectionally in a sequence cluster
located at the control region 5# domain (Yasukawa et al.
2005), a possible role for structure A in replication initiation
events should not be ruled out.
An additional functional role that could be attributed
to structure A regards the processing of polycistronic RNA
precursors to produce mature transcripts (fig. 1). It has been
proposed that tRNA secondary structures provide the punctuation marks for the correct mRNA processing (Ojala et al.
1980). The observation that structure A can be folded into
a stable tRNA-like cloverleaf conformation on the nascent
mRNA derived from the L strand (DG 5 26.23) suggests
that it may act as a punctuation mark for the processing of
the flanking proline tRNA gene (supplementary fig. S4,
Supplementary Material online).
Although the precise location and mechanism for H- and
L-strand transcription termination is unknown, a relatively
high number of RNAs possessing stable poly(A) tails were
found to end at several control region sites (Slomovic et al.
2005). Strikingly, structure A is located in a hot spot of polyadenylated sites corresponding to the 3# end of H-strand transcripts (Slomovic et al. 2005). This observation suggests that
the formation of such a large secondary structure in the DNA
template during transcription may induce the transcription
termination of several full-length H-strand mRNAs.
In any case, the nonsignificant difference between the
proportion of polymorphic positions in unpaired and paired
nucleotides observed for the mRNA structure A reinforces the
idea of a stronger selective pressure at the DNA level (a significant difference was found for the mtDNA structure A).
MtDNA Control Region Mutational Heterogeneity: the
Effect of Selection on a DNA Secondary Structure
The observed heterogeneity in the rate of substitution
among control region sites could be readily explained by
the existence of conserved regulatory elements and
protein-binding sites. Nevertheless, in a large number of
different control region stretches no regulatory element
capable of explaining the observed mutation variability
has so far been identified. Whereas in protein-coding genes,
the observed heterogeneity could be explained by selective
constraints acting on translated products, in noncoding segments, it could only be explained by structural or other
functional roles for the DNA molecule itself. The formation
of thermodynamically stable secondary structures is one
important factor in this perspective. Indeed, because mutational events are more likely to occur during replication in
single-stranded DNA due to, for instance, damage from oxygen radicals, one should expect that paired bases in a secondary structure (e.g., in stem regions) are better protected
from alteration than those in unpaired regions (e.g., in
loops) (Wright 2000; Wright et al. 2002; Hoede et al. 2006).
To investigate if the observed mutational heterogeneity documented for the human HVR-I is in any way related
with structure A, we have analyzed the values of observed
mutations in a phylogenetic tree containing 2,196 human
mtDNA sequences with more than 10,000 independent mutations (figs. 2 and 3). This method provides a well-defined
tree structure by using complete sequences and overcomes
major drawbacks of previous methods for the estimation of
mutation rates which lacked a phylogenetic framework
(Meyer et al. 1999; Pesole and Saccone 2001; for a review
of pitfalls, see Bandelt et al. 2006).
It is clear from the analysis of figure 2 that the region
where structure A was predicted to form has a much lower
density of mutations than for its flanking region. Additionally,
the relative difference in these densities increases at greater
time depths, strongly suggesting a differential pressure of purifying selection on the structure and its 3#-flanking region.
Moreover, the number of bases with zero mutation occurrences in the tree is much higher in stem regions (45%)
than in the unpaired stretches (20%). The mutational conservation is more pronounced in the two peripheral stem
domains, whereas positions with a higher mutation rate
are predominantly found in loop regions (fig. 3). These results indicate that, at least for some control region stretches,
the combination of selective pressures and the protective
effect of double-stranded DNA in secondary conformations
can significantly contribute to explaining the observed mutation rate heterogeneity.
Secondary Structure A Encloses the Hot Spot for the 3#
End of mtDNA Deletions
Deletions in the human mtDNA are responsible for
a number of genetic diseases ranging from mild myopathies
to severe multisystem disorders (Brockington et al. 1993;
Chinnery et al. 2000). It is commonly accepted that accumulation of mtDNA deletions is usually found associated
with the presence of short direct repeats in the flanking area
of the deletion break point (Mita et al. 1990; Samuels et al.
2004; Yui and Matsuura 2006). Most human mtDNA deletions are believed to occur during mtDNA replication as
a result of DNA replication slippage events with slipped
strand mispairing involving two 13-bp direct repeats
(Samuels et al. 2004). As a result, the distribution of the
5# and 3# ends of most mtDNA deletions match the location
of those two 13-bp repeats. However, for most deletions
with no flanking repeats associated, and even some with
mtDNA Secondary Structures 2769
flanking repetitive regions, the 3#-end distribution does not
correspond with any of those 13-bp repeats. Interestingly,
we identified an almost perfect match between the peak in
the distribution for the 3# ends (at positions 16067–16078)
and the control region stretch where structure A was predicted to be formed (at positions 16028–16120; fig. 3).
The data set of Samuels et al. (2004) included 111 reported
unique deletions with a 3# end in the control region. Of
these 111 deletions, 108 had their 3# end located in structure
A (two-tailed P value , 0.0001). Furthermore, these deletion end points were very highly localized to a specific region within structure A, the loop from 16067–16074 (fig.
3), with 76 of the deletion ends occurring within this loop
(two-tailed P value , 0.0001).
In view of the fact that this correspondence is highly
significant for deletions without involvement of a direct repeat mechanism, it is clearly possible that other factors,
such as the formation of secondary structures, may trigger
the formation of some mtDNA deletions. Presumably, the
formation of a large secondary structure at the leading
strand of the replication fork may cause the transient dissociation of the replicative enzymatic assembly, therefore
increasing the likelihood of the slipped strand mispairing
events that are a well-known cause of deletions.
Concluding Remarks
Genetic evidence for both negative (less variation in
structure-relevant stems) and positive (compensatory
changes in stem regions) selection on the human mtDNA
control region emphasize the importance that stable secondary structures may play in mitochondrial genome evolution.
The distribution of such structures along both control region peripheral domains may well be the explanation for
the persistence of such a large noncoding segment in a genome governed by a strong evolutionary pressure for removing dispensable sequences.
A correct prediction of DNA secondary structures is of
crucial importance for the development of more efficient
and directed experimental research aimed at a better understanding of the complex protein–nucleic acid interactions
involved in mtDNA replication and transcription, regulated
by nuclear-encoded factors. More detailed investigation
should also provide valuable insights toward the understanding of the putative role of secondary structures in
the generation of mtDNA deletions. A better knowledge
of the generation of diversity and its heterogeneity across
control region sequences is also of great importance for molecular evolutionary and phylogenetic research.
Supplementary Material
Supplementary tables S1–S5 and figures S1–S13 are
available at Molecular Biology and Evolution online
(http://www.mbe.oxfordjournals.org/).
Acknowledgments
This work was partially supported by a research grant
to F.P. (SFRH/BD/19585/2004) from Fundacxão para
a Ciência e a Tecnologia. P.S. was supported by a Marie
Curie Early Stage Training Grant. Instituto de Patologia e
Imunologia Molecular da Universidade do Porto is partially supported by ‘‘Programa Operacional Ciência e
Inovacxão 2010’’ (POCI 2010), VI Programa Quadro
(2002–2006).
Literature Cited
Altschul SF, Erickson BW. 1985. Significance of nucleotide
sequence alignments: a method for random sequence
permutation that preserves dinucleotide and codon usage.
Mol Biol Evol. 2:526–538.
Anderson S, Bankier AT, Barrell BG, et al. (14 co-authors).
1981. Sequence and organization of the human mitochondrial
genome. Nature. 290:457–465.
Andrews RM, Kubacka I, Chinnery PF, Lightowlers R,
Turnbull D, Howell N. 1999. Reanalysis and revision of the
Cambridge reference sequence for human mitochondrial
DNA. Nat Genet. 23:147.
Bandelt HJ, Forster P, Sykes BC, Richards MB. 1995.
Mitochondrial portraits of human populations using median
networks. Genetics. 141:743–753.
Bandelt HJ, Kong QP, Richards M, Macaulay V. 2006.
Estimation of mutation rates and coalescence times: some
caveats. In: Bandelt HJ, Macaulay V, Richards M, editors.
Human mitochondrial DNA and the evolution of Homo
sapiens, Series: Nucleic Acids and Molecular Biology vol. 18.
Berlin: Springer-Verlag. p. 47–92.
Brockington M, Sweeney MG, Hammans SR, MorganHughes JA, Harding AE. 1993. A tandem duplication in the
D-loop of human mitochondrial DNA is associated with
deletions in mitochondrial myopathies. Nat Genet. 4:67–71.
Brown GG, Gadaleta G, Pepe G, Saccone C, Sbisa E. 1986.
Structural conservation and variation in the D-loop-containing
region of vertebrate mitochondrial DNA. J Mol Biol.
192:503–511.
Chang DD, Clayton DA. 1985. Priming of human mitochondrial
DNA replication occurs at the light-strand promoter. Proc Natl
Acad Sci USA. 82:351–355.
Cheung AK. 2004. Palindrome regeneration by template strandswitching mechanism at the origin of DNA replication of
porcine circovirus via the rolling-circle melting-pot replication model. J Virol. 78:9016–9029.
Chinnery PF, Johnson MA, Wardell TM, Singh-Kler R, Hayes C,
Brown DT, Taylor RW, Bindoff LA, Turnbull DM. 2000. The
epidemiology of pathogenic mitochondrial DNA mutations.
Ann Neurol. 48:188–193.
Clote P, Ferre F, Kranakis E, Krizanc D. 2005. Structural RNA
has lower folding energy than random RNA of the same
dinucleotide frequency. RNA. 11:578–591.
Doda JN, Wright CT, Clayton DA. 1981. Elongation of
displacement-loop strands in human and mouse mitochondrial
DNA is arrested near specific template sequences. Proc Natl
Acad Sci USA. 78:6116–6120.
Excoffier L, Yang Z. 1999. Substitution rate variation among
sites in mitochondrial hypervariable region I of humans and
chimpanzees. Mol Biol Evol. 16:1357–1368.
Forsdyke DR. 1995. A stem-loop ‘‘kissing’’ model for the
initiation of recombination and the origin of introns. Mol Biol
Evol. 12:949–958.
Galtier N, Lobry JR. 1997. Relationships between genomic GþC
content, RNA secondary structures, and optimal growth
temperature in prokaryotes. J Mol Evol. 44:632–636.
Hasegawa M, Kishino H, Yano T. 1985. Dating of the humanape split by a molecular clock of mitochondrial DNA. J Mol
Evol. 22:160–174.
2770 Pereira et al.
Hixson JE, Clayton DA. 1985. Initiation of transcription from
each of the two human mitochondrial promoters requires
unique nucleotides at the transcriptional start sites. Proc Natl
Acad Sci USA. 82:2660–2664.
Hoede C, Denamur E, Tenaillon O. 2006. Selection acts on DNA
secondary structures to decrease transcriptional mutagenesis.
PLoS Genet. 2:e176.
Katz L, Burge CB. 2003. Widespread selection for local RNA
secondary structure in coding regions of bacterial genes.
Genome Res. 13:2042–2051.
Kimura M. 1983. The neutral theory of molecular evolution.
Cambridge: Cambridge University Press.
Markham NR, Zuker M. 2005. DINAMelt web server for nucleic
acid melting prediction. Nucleic Acids Res. 33:W577–W581.
Meyer S, Weiss G, von Haeseler A. 1999. Pattern of nucleotide
substitution and rate heterogeneity in the hypervariable
regions I and II of human mtDNA. Genetics. 152:1103–1110.
Mishmar D, Ruiz-Pesini E, Golik P, et al. (13 co-authors). 2003.
Natural selection shaped regional mtDNA variation in
humans. Proc Natl Acad Sci USA. 100:171–176.
Mita S, Rizzuto R, Moraes CT, Shanske S, Arnaudo E,
Fabrizi GM, Koga Y, DiMauro S, Schon EA. 1990.
Recombination via flanking direct repeats is a major cause
of large-scale deletions of human mitochondrial DNA.
Nucleic Acids Res. 18:561–567.
Montoya J, Christianson T, Levens D, Rabinowitz M, Attardi G.
1982. Identification of initiation sites for heavy-strand and
light-strand transcription in human mitochondrial DNA. Proc
Natl Acad Sci USA. 79:7195–7199.
Ojala D, Merkel C, Gelfand R, Attardi G. 1980. The tRNA genes
punctuate the reading of genetic information in human
mitochondrial DNA. Cell. 22:393–403.
Pesole G, Saccone C. 2001. A novel method for estimating
substitution rate variation among sites in a large dataset of
homologous DNA sequences. Genetics. 157:859–865.
Rambaut A, Grassly NC. 1997. Seq-Gen: an application for the
Monte Carlo simulation of DNA sequence evolution along
phylogenetic trees. Comput Appl Biosci. 13:235–238.
Rozas J, Sanchez-DelBarrio JC, Messeguer X, Rozas R. 2003.
DnaSP, DNA polymorphism analyses by the coalescent and
other methods. Bioinformatics. 19:2496–2497.
Samuels DC, Schon EA, Chinnery PF. 2004. Two direct repeats
cause most human mtDNA deletions. Trends Genet. 20:393–398.
SantaLucia J Jr. 1998. A unified view of polymer, dumbbell, and
oligonucleotide DNA nearest-neighbor thermodynamics. Proc
Natl Acad Sci USA. 95:1460–1465.
Sbisa E, Tanzariello F, Reyes A, Pesole G, Saccone C. 1997.
Mammalian mitochondrial D-loop region structural analysis:
identification of new conserved sequences and their functional
and evolutionary implications. Gene. 205:125–140.
Seffens W, Digby D. 1999. mRNAs have greater negative
folding free energies than shuffled or codon choice randomized sequences. Nucleic Acids Res. 27:1578–1584.
Shadel GS, Clayton DA. 1997. Mitochondrial DNA maintenance
in vertebrates. Annu Rev Biochem. 66:409–435.
Slomovic S, Laufer D, Geiger D, Schuster G. 2005. Polyadenylation and degradation of human mitochondrial RNA:
the prokaryotic past leaves its mark. Mol Cell Biol. 225:
6427–6435.
Spelbrink JN. 2003. Replication, repair, and recombination
of mitochondrial DNA. In: Holt I, editor. Genetics of
mitochondrial diseases. Oxford: Oxford Medical Publications.
p. 3–25.
Stothard P. 2000. The sequence manipulation suite: JavaScript
programs for analyzing and formatting protein and DNA
sequences. Biotechniques. 28:1102–1104.
Taanman JW. 1999. The mitochondrial genome: structure,
transcription, translation and replication. Biochim Biophys
Acta. 1410:103–123.
Tajima F. 1989. Statistical method for testing the neutral
mutation hypothesis by DNA polymorphism. Genetics. 123:
585–595.
Walberg MW, Clayton DA. 1981. Sequence and properties of the
human KB cell and mouse L cell D-loop regions of
mitochondrial DNA. Nucleic Acids Res. 9:5411–5421.
Washietl S, Hofacker IL. 2004. Consensus folding of aligned
sequences as a new measure for the detection of functional
RNAs by comparative genomics. J Mol Biol. 342:19–30.
Workman C, Krogh A. 1999. No evidence that mRNAs have
lower folding free energies than random sequences with the
same dinucleotide distribution. Nucleic Acids Res. 27:
4816–4822.
Wright BE. 2000. A biochemical mechanism for nonrandom
mutations and evolution. J Bacteriol. 182:2993–3001.
Wright BE, Reimers JM, Schmidt KH, Reschke DK. 2002.
Hypermutable bases in the p53 cancer gene are at vulnerable
positions in DNA secondary structures. Cancer Res.
62:5641–5644.
Wright BE, Reschke DK, Schmidt KH, Reimers JM, Knight W.
2003. Predicting mutation frequencies in stem-loop structures
of derepressed genes: implications for evolution. Mol Microbiol. 48:429–441.
Yasukawa T, Yang MY, Jacobs HT, Holt IJ. 2005.
A bidirectional origin of replication maps to the major
noncoding region of human mitochondrial DNA. Mol Cell.
18:651–662.
Yui R, Matsuura ET. 2006. Detection of deletions flanked by
short direct repeats in mitochondrial DNA of aging Drosophila. Mutat Res. 594:155–161.
Zuker M. 2000. Calculating nucleic acid secondary structure.
Curr Opin Struct Biol. 10:303–310.
Connie Mulligan, Associate Editor
Accepted October 2, 2008