* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Phylogenetic Network and Physicochemical Properties of
Transfer RNA wikipedia , lookup
Genome evolution wikipedia , lookup
Metagenomics wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Non-coding DNA wikipedia , lookup
Koinophilia wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Human genome wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Helitron (biology) wikipedia , lookup
Microevolution wikipedia , lookup
Microsatellite wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Oncogenomics wikipedia , lookup
Frameshift mutation wikipedia , lookup
Expanded genetic code wikipedia , lookup
Phylogenetic Network and Physicochemical Properties of Nonsynonymous Mutations in the Protein-Coding Genes of Human Mitochondrial DNA Jukka S. Moilanen and Kari Majamaa Biocenter and Department of Neurology, University of Oulu, Oulu, Finland Theories on molecular evolution predict that phylogenetically recent nonsynonymous mutations should contain more non-neutral amino acid replacements than ancient mutations. We analyzed 840 complete coding-region human mitochondrial DNA (mtDNA) sequences for nonsynonymous mutations and evaluated the mutations in terms of the physicochemical properties of the amino acids involved. We identified 465 distinct missense and 6 nonsense mutations. 48% of the amino acid replacements changed polarity, 26% size, 8% charge, 32% aliphaticity, 13% aromaticity, and 44% hydropathy. The reduced-median networks of the amino acid changes revealed relatively few differences between the major continent-specific haplogroups, but a high variation and highly starlike phylogenies within the haplogroups. Some 56% of the mutations were private, and 25% were homoplasic. Nonconservative changes were more common than expected among the private mutations but less common among the homoplasic mutations. The asymptotic maximum of the number of nonsynonymous mutations in European mtDNA was estimated to be 1,081. The results suggested that amino acid replacements in the periphery of phylogenetic networks are more deleterious than those in the central parts, indicating that purifying selection prevents the fixation of some alleles. Introduction The human mitochondrial genome (mtDNA) has genes coding for 2 rRNAs, 22 tRNAs, and 13 subunits of the respiratory chain complexes (MTND1, MTND2, MTND3, MTND4, MTND4L, MTND5, MTND6, MTATP6, MTATP8, MTCO1, MTCO2, MTCO3, and MTCYB). The protein-coding genes occupy 68% of the genome, and therefore a random nucleotide substitution has a high probability of being nonsynonymous and of leading to amino acid replacement. The neutral (Kimura 1968) and the nearly neutral (Ohta 1992) theories of molecular evolution predict that a certain proportion of nonsynonymous mutations will be neutral in effect, whereas the rest will be more or less deleterious. Several studies have demonstrated an excess of nonsynonymous mutations within species as compared with variation between species (Nachman et al. 1996; Rand and Kann 1996; Hasegawa, Cao, and Yang 1998; Nachman 1998; Fry 1999), and this finding has been interpreted as suggesting selection against mildly deleterious mutations, which prevents their fixation. Furthermore, direct measurements of the intergenerational substitution rate in human mtDNA have yielded rates higher than the estimates derived from phylogenetic analyses, suggesting that a significant fraction of mutations is removed by selection (Parsons et al. 1997). The effects of nonsynonymous mutations depend both on the position of the amino acid replacement in the protein sequence and on the physicochemical properties of the amino acids involved. The genetic code appears to have evolved toward minimizing changes in physicochemical properties, which also affect the rate of nonsynonymous substitutions (Xia and Li 1998), suggesting that amino acid replacements resulting in a dissimilar amino acid are generally more deleterious than replacements resulting in an amino acid with similar properties. If the hypothesis of selection against mildly deleterious mutations is correct, phylogenetically recent mutations should contain more deleterious mutations and more dissimilar amino acid replacements than the older ones. On the one hand, there are many examples of pathogenic single-nucleotide mutations in mtDNA. In addition, there is evidence that certain combinations of otherwise harmless polymorphisms in mitochondrial lineages may be associated with susceptibility to complex diseases (Wallace, Brown, and Lott 1999; Chinnery et al. 2000; Ruiz-Pesini et al. 2000), or with successful aging (De Benedictis et al. 1999). Their effect is most likely due to changes in the amino acid sequences of the protein-coding genes. On the other hand, several studies have failed to make the distinction between a pathogenic mutation and a haplotype-associated neutral polymorphism (Herrnstadt et al. 2002a). For these reasons, knowledge of the nature and phylogenetic relationships of amino acid haplotypes in the human mitochondrial genome is also important in clinical practice. Although the number of complete mtDNA sequences available has grown exponentially (Finnilä et al. 2000; Ingman et al. 2000; Elson et al. 2001; Finnilä, Lehtonen, and Majamaa 2001; Maca-Meyer et al. 2001; Herrnstadt et al. 2002a), marking the start of mitochondrial population genomics (Hedges 2000), the functional consequences of the numerous variations in these sequences have not yet received much attention. We report here on the characterization of the nonsynonymous mutations in 840 complete human mitochondrial coding region sequences in terms of their physicochemical properties, and on the construction of a phylogenetic network for the amino acid sequences of all 13 protein-coding genes. Furthermore, the physicochemical properties of the amino acid replacements were compared according to their positions in the network to assess the hypothesis of selection against mildly deleterious replacements. Key words: human mitochondrial DNA, molecular evolution, population genetics, amino acid substitution, phylogenetics, neutral theory. E-mail: [email protected]. Materials and Methods Alignment of mtDNA Sequences Mol. Biol. Evol. 20(8):1195–1210. 2003 DOI: 10.1093/molbev/msg121 2003 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038 Human mtDNA sequences were obtained from public sources (table 1). Historical or current reference sequences 1195 1196 Moilanen and Majamaa Table 1 Available Complete Mitochondrial DNA Coding Region Sequences Ida Identifierb Refseq F1-F192 G1-G33 G34 G35 G36 G37-G89 G90-G92 G93 G94 G95 M1-M560 Population mitomapRCRS GenBank AF382013.1-AF381981.1 GenBank NC_001807.4 GenBank J01415.1 GenBank AB055387.1 GenBank AF347015.1-AF346963.1 GenBank E27671.1-E27669.1 GenBank D38112.1 GenBank X93334.1 GenBank V00662.1 MtDNA1-mtDNA560 Finnish Diverse African Japanese Diverse Japanese African Swedish Diverse from UK, USA Notes Reference sequence in this studyc Population (Finnilä, Lehtonen and Majamaa 2001) Population (Maca-Meyer et al. 2001) Latest GenBank reference sequenced Historical reference sequence Cardiomyopathy patient (Shin et al. 2000) Population (Ingman et al. 2000) Mitochondrial diabetese Population (Horai et al. 1995) Population (Arnason, Xu and Gullberg 1996) Historical reference sequence (Anderson et al. 1981) Population and patientsf a Sequence identifiers used in this study. Sequence identifiers in public files. The MITOMAP reference sequence, a modified version of the 2001 Revised CRS (Andrews et al. 1999; available at http://www.mitomap.org). d Identical to G37, has 41 differences relative to mitomapRCRS. e Sequences with 3243A . G, 3423T . G and 3426A . G but otherwise similar to G95. f Includes patients with type 2 diabetes and neurodegenerative disorders (Herrnstadt et al. 2002a; available at http://www.mitokor.com/science/560mtdnas.php). Transitions at nucleotide position 5262 were added to two MitoKor sequences according to the published erratum (Herrnstadt et al. 2002b). M104 is from the CCL2 HeLa cell culture (Herrnstadt et al. 2002c). b c (G34, G35, G95) were excluded from the analyses, and sequences G90–G92 were excluded as they only demonstrate variation at positions 3243, 3423, 3426, and 11447 and are otherwise identical to G95, including its errors. The CCL2 HeLa sequence (M104) was excluded because of an unusually high rate of divergence (Herrnstadt et al. 2002c). The remaining 840 complete coding region sequences were compared with the MITOMAP reference sequence using the diffseq utility of the EMBOSS software package (Rice, Longden, and Bleasby 2000). The sequences were aligned with the reference sequence, and the nucleotide data of all sequences with position information were stored in a relational SQL database. Comparison of the stored sequences with the original ones by means of the diffseq utility did not reveal any errors. The nucleotide sequences of the protein-coding genes were extracted according to the MITOMAP mtDNA function locations, and noncoding sections between the genes were ignored. The SQL query language and the programming language Perl were used for sequence alignment and subsequence extraction. Identification of Nonsynonymous Mutations Amino acid translations of protein-coding genes were obtained by the methods provided by the Bio::PrimarySeqI interface of Bioperl (available at http://www.bioperl.org), and nonsynonymous changes were subsequently identified. The neighboring changes of each mtDNA variant were examined in order to identify multiple nucleotide substitutions within a single codon. Observed nonsense mutations were verified manually from the original DNA sequences. Comparison of the amino acid translations of 325 mtDNA mutations with those found in MITOMAP led to the identification of six discrepancies, whereupon manual examination of the sequences indicated that all were errors in MITOMAP. Construction of Reduced-Median Networks Six reduced-median networks (Bandelt et al. 1995) were constructed from all the nonsynonymous mtDNA mutations in the 840 sequences to infer the protein-level phylogeny in African, Asian, and European haplogroup clusters. All the coding region variation (Finnilä, Lehtonen, and Majamaa 2001; Herrnstadt et al. 2002a) was used to assign each sequence to one of the six networks. Sequence assignment was verified by comparing the identified mutations against those displayed in the published networks. Because the actual content of the Finnish and the corrected MitoKor sequences was used, the comparison led to the identification of an unpublished error in the haplogroup H skeleton network (Herrnstadt et al. 2002a), in which two sequences had been marked with the wrong identifiers (45 and 530), which belonged to two haplogroup J sequences. Furthermore, we found that the transition at nucleotide position 14097 in two sequences (F162 and F163) was incorrectly shown as 14096 (Finnilä, Lehtonen, and Majamaa 2001). Ten GenBank sequences could not be unambiguously assigned to any of the major African, Asian, or European haplogroups and were included in the Asian haplogroup cluster network, since they were of Asian or Pacific origin. The sequences were converted to a binary data matrix by considering transitions and transversions as distinct entities (Bandelt et al. 1995). Reduced median networks were constructed from the binary data using Network 2.1 (available at http:// fluxus-engineering.com). All binary characters were weighted equally, including transitions and transversions, and the default reduction threshold r ¼ 2 (Bandelt, Macaulay, and Richards 2000) was used in the analysis. Characterization of Amino Acid Replacements The amino acids involved in the nonsynonymous mutations were characterized in terms of six Network of Nonsynonymous Mutations in mtDNA 1197 physicochemical properties relevant to protein evolution (Xia and Li 1998), namely polarity, size, isoelectric point, aliphatic and aromatic nature, and hydropathy. We defined amino acids with polarity 8.6 (Grantham 1974) as polar, amino acids with a side chain molecular volume 61Å3 (Grantham 1974) as small, amino acids with an isoelectric point 7.59 as positively charged, amino acids with an isoelectric point 3.22 as negatively charged (AlffSteinberger 1969), amino acids with aliphatic side chains (I, L, and V) as aliphatic, amino acids with aromatic rings (F, H, W, and Y) as aromatic, and amino acids with a negative Kyte-Doolittle hydropathy index (Kyte and Doolittle 1982) as hydrophilic, and those with a positive index as hydrophobic. Amino acid replacements were then assigned to categories according to changes in these physicochemical properties. Furthermore, each replacement was defined as conservative or nonconservative according to the BLOSUM62 matrix used for sequence comparisons (Henikoff and Henikoff 1992), nonconservative replacements having a negative value in the matrix (Cargill et al. 1999). The distribution of mutations within genes was assessed by identifying hydrophobic and hydrophilic regions of genes. These regions were defined by comparing the average hydropathy of each 19-amino acid segment to the mean of all segments for the respective gene. The average Kyte-Doolittle hydropathy index of 19 neighboring amino acids was calculated for each amino acid position according to the MITOMAP reference sequence and by reference to the pepinfo utility of the EMBOSS package. We used this segment size because it has been shown to be a good value for identifying transmembrane regions (Kyte and Doolittle 1982). Contingency Table Analysis The nonsynonymous mutations were counted as differences relative to the reference sequence and without correcting for multiple hits; that is, each mutation was counted once regardless of the number of its occurrences in the networks. This approach results in an underestimate of the true number of mutations that have occurred during human mtDNA evolution, but despite this disadvantage, the method was used here to avoid the confounding effect of the expected high degree of homoplasy. Private mutations occupying the peripheral tips of phylogeny were inferred from alleles that were present in only one sequence, whereas homoplasic mutations were inferred from the presence of a mutation in .1 lineages in the networks. Since each mutation was counted only once, it was possible to classify each amino acid replacement at a given sequence position unambiguously as private/ nonprivate and homoplasic/nonhomoplasic. Alternatively, it could have been possible to infer each occurrence of a homoplasic mutation from the phylogeny and to count the occurrences separately, but with this approach the frequencies of mutation categories among homoplasic mutations would have been inflated by the subset of mutations that were highly homoplasic, and would also have depended on the method and parameters used in the phylogenetic reconstruction. The frequencies of the mutation categories among private amino acid replacements, homoplasic replacements, and replacements in hydrophobic regions were compared with those among the remaining ones using the Fisher’s exact test as implemented in R 1.4.1 (Ihaka and Gentleman 1996; available at http://cran.r-project.org/), which computes the exact value of P and the conditional maximum likelihood estimate of the odds ratio. This test was used because small cell frequencies were expected, and the two-tailed test was used because no particular direction of differences was assumed a priori. Sample estimates of the odds ratio were similar to the reported conditional maximum likelihood estimates as differences were observed only in second to fourth decimal positions. Inflated type I error rate due to multiple comparisons was assessed by obtaining the adjusted significance level (ac) from 1 (1ac)n ¼ 0.05, where n is the number of comparisons and 0.05 is the significance level corresponding to 95% confidence limit. Rate of Detection of New Mutations in European Sequences An estimate for the cumulative rate of discovery of new nonsynonymous mutations in the 647 European sequences was derived by taking 500 random permutations and examining the sequences contained in each consecutively, calculating for each sequence the cumulative sum of mutations that had not occurred in the previous sequences. The sequences were sampled without replacement. The arithmetic mean of the cumulative sums of the 500 permutations was plotted, and statistical models having an asymptotic maximum were fitted to this mean curve by the nonlinear least squares method to provide an estimate of the total number of nonsynonymous mutations in European mtDNA and to predict the number of sequences required for identifying most of the mutations. Results Mutations in the Protein-Coding Genes of mtDNA A total of 988 synonymous, 465 nonsynonymous missense, and 6 nonsense mutations were identified in the protein-coding genes of 840 complete coding region mtDNA sequences, when the mutations were counted as differences relative to the reference sequence. One-third (32%) of all mutations were nonsynonymous (table 2). MTATP6 and MTATP8 had the highest proportion of nonsynonymous mutations from all mutations (52.5% and 51.3%, respectively), whereas MTND4L, MTCO2, and MTND3 had the lowest (19.2%, 22.2%, and 22.2%, respectively). Several sequences were detected in which two mutations co-occurred in one codon including 4769A . G and 4767A . G in sequences M175, M222, M385, and M409; 8574C . T and 8572G . A in M533; 8703C . T and 8701A . G in G42; 10400C . T and 10398A . G in 53 sequences; and 14767T . C and 14766C . T in M455. All these pairs consisted of a nonsynonymous and a synonymous mutation. Furthermore, each amino acid replacement resulted from a specific nucleotide substitution, as we found no instances where two different 1198 Moilanen and Majamaa Table 2 Synonymous and Nonsynonymous Mutations in the 840 mtDNA Sequences, by Genes Gene MTND1 MTND2 MTCO1 MTCO2 MTATP8 MTATP6 MTCO3 MTND3 MTND4L MTND4 MTND5 MTND6 MTCYB Total Lengtha Synonymousb Nonsynonymousc 956 1,042 1,542 684 207 681 784 346 297 1,378 1,812 525 1,141 78 91 122 63 19 57 67 35 21 124 158 54 105 38 36 40 18 20 63 32 10 5 38 75 29 70 116 127 162 81 39 120 99 45 26 162 233 83 175 11,341 988 471 1,459 Total NOTE.—MTND, NADH dehydrogenase; MTCO, cytochrome c oxidase; MTATP, ATP synthase; MTCYB, cytochrome b. a Gene length in nucleotides. b Number of synonymous mutations. c Number of nonsynonymous mutations. Mutations were counted as differences relative to the reference sequence and without correcting for multiple hits. Sums over all genes do not equal the totals due to overlapping regions between MTATP6 and MTATP8, MTATP6 and MTCO3, and MTND4 and MTND4L. nonsynonymous mutations had caused an identical amino acid change. The most common amino acid replacement was an A-T change in either direction, followed in decreasing frequency by I-V, I-T, and F-L (fig. 1). Nonsense Mutations Two mutations in the initiator codon of the MTND1 gene were identified. 3308T . C was present in 10 sequences (G20, G66, M158, M165, M192, M215, M293, M379, M386, and M514). This mutation has been identified in the chimpanzee (Arnason, Xu, and Gullberg 1996), but in humans it was originally reported in a patient FIG. 1.—Matrix of amino acid replacements for the 840 mtDNA sequences. The area of each circle is proportional to the frequency of distinct replacements between the respective amino acids. For reference, the number of replacements between T and A is 98, and that between K and N is 1. FIG. 2.—Collapsed network of the continent-specific major haplogroup clusters. The central haplotype from each cluster is shown. Each dashed rectangle indicates the figure containing the expanded network for the respective cluster. The mutations are shown as amino acid changes relative to the MITOMAP reference sequence (refseq). Outgroup, sequence G37. þ, a homoplasic mutation. with bilateral striatal necrosis and MELAS (Campos et al. 1997). However, experimental data have suggested that 3308T . C does not affect the synthesis of the MTND1 polypeptide and that any methionine codon close to the 59 end of a mitochondrial mRNA may serve as a translation initiator (Fernandez-Moreno et al. 2000). Our phylogenetic analysis indicated that the mutation represents a polymorphism in the African haplogroup L1b, as suggested previously (Rocha et al. 1999), but it was also present in another branch of the African network, indicating that it has arisen more than once. 3308T . A, resulting in a codon for lysine, was found in the sequence M170, which also harbored 3312dupC, a single-nucleotide duplication in the second codon. The sequence of the first three codons was therefore AAACCCCATG instead of ATACCCATG. A third initiator codon mutation was observed in MTND5, where 12338T . C (M339) led to a codon for threonine. Methionine occupies position 3 in MTND1 and MTND5 and probably serves as a translation initiator in the presence of [3308T . A; 3312dupC] and 12338T . C. Network of Nonsynonymous Mutations in mtDNA 1199 FIG. 3.—Reduced median network of nonsynonymous mutations in Asian haplogroup clusters. The mutations are shown as amino acid changes relative to the MITOMAP reference sequence (refseq). Outgroup, sequence G37. Squares, links to the networks of other haplogroup clusters. @, a back mutation; þ, a homoplasic mutation; CM, cardiomyopathy. The weights of all the characters in the analysis were equal. Some branch lengths have been distorted to increase legibility. Sequence identifiers are shown inside the nodes. F, Finnish sequences; M, MitoKor sequences. The origin of each GenBank sequence, denoted with the letter G, is given next to the sequence identifier. PNG, Papua-New Guinea. Eight sequences (F45, F46, F47, F48, F49, G46, M6, and M426) harbored 7444G . A in the stop codon of MTCO1, leading to the translation of KQK, which has been suggested to increase the penetrance of primary mutations in Leber’s hereditary optic neuropathy (Brown et al. 1995). A single-nucleotide deletion 6577delG in the middle of MTCO1 in G36 led to G225E and caused a premature termination of translation with an open reading 1200 Moilanen and Majamaa FIG. 4.—Reduced median network of nonsynonymous mutations in the African haplogroup cluster. See the legend of figure 3 for explanation of symbols. frame for 28 amino acids (EETPFYTNTYSDFSVTLKFMFLSYQASE). The sequence G36 also harbored 12192G . A, which has been reported to be associated with cardiomyopathy (Shin et al. 2000; MIM 590040), although this variant is a polymorphism in the Finnish population (Finnilä, Lehtonen, and Majamaa 2001). Assuming that the frameshift mutation 6577delG (G225fsX28) is not an error in the published sequence, the mutation might provide an alternative explanation for cardiomyopathy in G36. Reduced-Median Networks of Nonsynonymous Mutations Reduced-median networks of Asian and African haplogroups and the European haplogroup clusters IWX, Network of Nonsynonymous Mutations in mtDNA 1201 FIG. 5.—Reduced median network of nonsynonymous mutations in the European haplogroup cluster IWX. See the legend of figure 3 for explanation of symbols. KU, JT, and HV were constructed using information on all nonsynonymous mutations in the 840 sequences and by placing the African sequence G37 as an outgroup. The African, Asian, and European major haplogroups were found to be closely related in their amino acid sequences (fig. 2). The center of the Asian network consisted of a reticulation formed by MTATP6:A20T, MTCYB:S172N, and MTATP6:T59A (fig. 3). The central node of haplogroup L (fig. 4) and the common root of haplogroups D, E, and M were found to belong to this reticulation and had an identical amino acid sequence. Only two amino acid changes separated haplogroups C and Z from L. The central nodes of the European haplogroup clusters IWX (fig. 5) and KU (fig. 6) and the Asian haplogroup B1 had an identical amino acid sequence, which was separated from haplogroup L by MTATP6:T59A and MTND3: T114A. Additional amino acid replacements separated the other European haplogroup clusters JT (fig. 7) and HV (fig. 8) and the Asian haplogroups A and B2 from this node. The major haplogroups in all the ethnic groups were clearly discernible. Amino acid sequences formed highly starlike phylogenies with major center nodes in all the haplogroup clusters. Thirteen of the 20 amino acid replacements that distinguished the major haplogroups were homoplasic (fig. 2) and 18 of 20 were conservative. MTND4:P140S, and MTCYB:T7I were homoplasic and nonconservative. Characterization of Amino Acid Replacements Half of the amino acid replacements (48%) involved a change in polarity, and hydropathy was changed in 44% of the replacements. Only four replacements (MTND1: L289Q, MTATP6:V21E, MTND5:Q546L, and MTND5: L555Q) involved changes between the seven most hydrophilic amino acids and the seven most hydrophobic ones, defined as a change of at least –1.7 to þ1.7 or vice versa on the Kyte-Doolittle scale. Changes in polarity and 1202 Moilanen and Majamaa FIG. 6.—Reduced median network of nonsynonymous mutations in the European haplogroup cluster KU. See the legend of figure 3 for explanation of symbols. hydropathy were followed in decreasing frequency by changes in aliphaticity (32%), size (26%), aromaticity (13%), and charge (8.3%). Of the amino acid replacements, 133 (28%) were nonconservative according to the BLOSUM62 matrix (table 3). The distribution of amino acid replacements among the 13 protein-coding genes suggested that the mutations were not distributed randomly across or between genes (fig. 9). The mutations were quite evenly distributed in MTATP6, MTATP8, MTCO3, MTND3, MTND4L, and Network of Nonsynonymous Mutations in mtDNA 1203 FIG. 7.—Reduced median network of nonsynonymous mutations in the European haplogroup cluster JT. See the legend of figure 3 for explanation of symbols. MTND6, whereas each of the remaining genes had at least one region which appeared relatively conserved as compared to the other regions of the gene. Apparent mutational hotspots, or nonconstrained regions, were identified in both hydrophobic and hydrophilic regions. An excess of amino acid replacements in MTND6 (22/29, 76%) were private (P ¼ 0.03, Fisher’s exact test), but no comparable deviations from the expected proportion of 56% were identified among the other genes. Contingency Table Analysis Of the replacements, 261 (56%) were private, whereas 207 replacements (44%) were present in more 1204 Moilanen and Majamaa FIG. 8.—Reduced median network of nonsynonymous mutations in the European haplogroup cluster HV. See the legend of figure 3 for explanation of symbols. Inset, additional nodes with private amino acid changes and connecting only to the center of the network (‘‘HV’’). than one sequence. Nonconservative changes were more common among the private replacements than among the nonprivate ones (P ¼ 0.005, Fisher’s exact test). Changes in size, charge, aliphaticity, and aromaticity were also more common among the private replacements than among the nonprivate ones, but these differences were not significant (table 4). Of the 468 amino acid replacements, 116 (25%) were homoplasic, indicating that they had arisen multiple times during human evolution. Nonconservative changes were Network of Nonsynonymous Mutations in mtDNA 1205 Table 3 Properties of the 468 Amino Acid Replacements Detected in the 840 mtDNA Sequences Na Direction of changeb Polarity 224 (.48) Size 123 (.26) Hydropathy 207 (.44) Polar fi nonpolar Nonpolar fi polar Small fi large Large fi small Hydrophobic fi hydrophilic Hydrophilic fi hydrophobic Neutral fi positive Neutral fi negative Positive fi neutral Positive fi negative Negative fi neutral Negative fi positive Aliphatic fi nonaliphatic Nonaliphatic fi aliphatic Aromatic fi nonaromatic Nonaromatic fi aromatic Category of Change Charge 39 (.08) Aliphaticity 151 (.32) Aromaticity 59 (.13) Nonconservativec Privated Homoplasice Hydrophobic locationf Hydrophilic locationg 133 261 116 239 192 Na 104 120 53 70 112 95 10 13 8 0 6 2 82 69 36 23 (.22) (.26) (.11) (.15) (.24) (.20) (.02) (.03) (.02) (0) (.01) (.004) (.18) (.15) (.08) (.05) (.28) (.56) (.25) (.51) (.41) a Number of mutations in the category. Proportion from the total number of mutations is shown in parentheses. Direction is shown relative to the reference sequence. Mutation with a negative value in the BLOSUM62 matrix. d Mutation observed in only one sequence. e Mutation observed in 2 lineages. f Average hydropathy index of 19 neighboring amino acids is higher than the mean for the respective gene. g Average hydropathy index is lower than the mean of the respective gene. b c less common than expected among the homoplasic replacements (P ¼ 0.002). A change from an aliphatic to a nonaliphatic amino acid or vice versa occurred in 25 homoplasic replacements (22%) and in 126 (36%) of the non-homoplasic ones (P ¼ 0.004), while an aromatic amino acid was replaced by a nonaromatic one or vice versa in 8 homoplasic replacements (7%) and in 51 (14%) of the non-homoplasic ones (P ¼ 0.04). Replacements between small and large amino acids were also less common in the homoplasic group (P ¼ 0.04). The other types of changes did not differ in frequency between the homoplasic and non-homoplasic replacements (table 4). The mean hydropathy indices were 1.006 for MTATP6, –0.401 for MTATP8, 0.725 for MTCO1, 0.432 for MTCO2, 0.411 for MTCO3, 0.673 for MTCYB, 0.662 for MTND1, 0.596 for MTND2, 1.075 for MTND3, 0.705 for MTND4, 1.376 for MTND4L, 0.563 for MTND5, and 1.036 for MTND6. The average hydropathy calculated for 19 neighboring amino acids was not defined for 37 amino acid replacements that were near either end of the subunit. 239 (55%) of the remaining 431 replacements were among the 1,843 positions located in regions that were more hydrophobic than the mean, whereas 192 (45%) were among the 1,712 positions located in the hydrophilic regions. The amino acid replacements in hydrophobic regions altered the amino acid charge less often than those in hydrophilic regions and were more often conservative, whereas replacements between aliphatic and nonaliphatic amino acids were more frequent among those in hydrophobic regions than among those in hydrophilic regions (table 4). Amino acid content between the hydrophobic and hydrophilic regions differed, because 103/381 (27%) of the charged amino acids (D, E, H, K, R) and 697/1,065 (65%) of the aliphatic amino acids (I, L, V) in the reference sequence were found to be located in hydrophobic regions of genes. Rate of Detection of New Mutations in 647 European Sequences Because private replacements were common among the 840 sequences, we set out to estimate the total number of nonsynonymous mutations that may be present in the population. The rate of detection of new mutations was calculated from 500 permutations of the 647 European sequences harboring 301 distinct nonsynonymous mutations. The Weibull growth curve provided the best fit with the mean of the cumulative sums (fig. 10). The asymptotic maximum of the number of nonsynonymous mutations in European mtDNA was estimated to be 1,081 (standard error 7.3). The 301 mutations detected in 647 European sequences therefore encompass approximately 28% of all nonsynonymous mutations that may be present in European populations. Assuming that mutation identification continues to follow the estimated model, 12,200 sequences will be required to identify 90% of the 1,081 mutations and 18,100 sequences to identify 95%. Similar predictions for non-European sequences were not feasible because of the small number of Asian and African sequences known. Discussion We found 1,459 distinct mutations in the proteincoding genes of 840 complete human mtDNA coding 1206 Moilanen and Majamaa FIG. 9.—Distribution of amino acid replacements and hydropathic regions in the 13 mtDNA-encoded proteins. The x-axis shows the amino acid position, and the y-axis shows a common scale for hydropathy and amino acid dissimilarity. Curve, the average Kyte-Doolittle hydropathy index for 19 neighboring amino acids; positive values indicate hydrophobic regions. 3, private replacement; þ, homoplasic replacement; 8, other replacement. Negative values for amino acid replacements indicate nonconservative changes and positive values indicate conservative changes according to the BLOSUM62 matrix. Histogram, the number of distinct amino acid changes within a window of 50 amino acid positions plotted at the median position of the window. One unit on the y-axis scale corresponds to 10 amino acid changes. region sequences, when the mutations were counted as differences relative to the reference sequence. One-third of the mutations were nonsynonymous. The frequency of changes in the physicochemical properties of the respective amino acids was high, suggesting that such changes are quite common in human mtDNA and that evaluation of the pathogenicity of an amino acid replacement should not rely solely on these structural considerations. The differences between the frequencies of the particular types of changes are inherent consequences of differences in the frequencies of individual amino acid replacements (fig. 1), which in turn depend on several factors, including sequence composition (Naylor, Collins, and Brown 1995), variable substitution rates and selective constraints among sites and substitutions (Xia 1998; Tourasse and Li 2000; McClellan and McCracken 2001), and the tendency of the genetic code to prefer substitutions between similar amino acids over dissimilar ones (Haig and Hurst 1991). The mitochondrial genome differs from nuclear genes in several properties, including amino acid composition (Naylor, Collins, and Brown 1995) and genetic code (Barrell, Bankier, and Drouin 1979; Knight, Landweber, and Yarus 2001). The proportion of nonconservative amino acid replacements out of all replace- Network of Nonsynonymous Mutations in mtDNA 1207 Table 4 Comparisons of Categories of the 468 Amino Acid Replacements Category of Changea Polarity Size Hydropathy Charge Aliphaticity Aromaticity Nonconservative Hydrophobic location Private N ¼ 261 b 122 74 110 25 93 37 88 141 61 22 57 11 25 8 20 51 117 64 110 8 94 29 56 102 49 97 14 58 22 45 98 (.49) (.24) (.47) (.07) (.28) (.11) (.22) (.47) ORc 95% CId P Valuee 0.90 1.28 0.83 1.46 1.42 1.39 1.83 1.24 0.62–1.32 0.82–1.99 0.56–1.21 0.71–3.13 0.94–2.16 0.77–2.56 1.18–2.85 0.83–1.86 0.64 0.29 0.35 0.31 0.09 0.27 0.005* 0.28 1.29 0.58 1.30 1.21 0.49 0.44 0.44 0.73 0.83–2.00 0.33–1.00 0.83–2.03 0.53–2.62 0.29–0.82 0.17–0.97 0.25–0.76 0.46–1.17 0.28 0.04* 0.24 0.57 0.004* 0.04* 0.002** 0.17 1.09 1.04 1.10 0.22 2.05 0.84 0.61 0.73–1.62 0.66–1.64 0.73–1.64 0.08–0.52 1.32–3.21 0.46–1.55 0.39–0.96 0.70 0.91 0.70 0.0001** 0.0009** 0.57 0.02* Nonhomoplasic N ¼ 352 (.53) (.19) (.49) (.09) (.22) (.07) (.17) (.44) Hydrophobic Location N ¼ 239 Polarity Size Hydropathy Charge Aliphaticity Aromaticity Nonconservative b (.47) (.28) (.42) (.10) (.36) (.14) (.34) (.54) Homoplasic N ¼ 116 Polarity Size Hydropathy Charge Aliphaticity Aromaticity Nonconservative Hydrophobic location Nonprivate N ¼ 207 163 101 150 28 126 51 113 188 (.46) (.29) (.43) (.08) (.36) (.14) (.32) (.53) Hydrophilic Location N ¼ 192 (.49) (.27) (.46) (.03) (.39) (.12) (.23) 90 50 84 26 46 27 64 (.47) (.26) (.44) (.14) (.24) (.14) (.33) a See the footnote to table 3 for explanation of categories. Number of amino acid replacements of the respective type. Proportions are shown in parentheses. c Odds ratio. d 95% confidence interval for odds ratio. e Probability of the null hypothesis that OR is 1 (Fisher’s exact test). * P , 0.05. ** P , 0.00223, which corresponds to the 95% significance level adjusted for multiple comparisons. b ments (28.4%) was nevertheless not appreciably different from that in 106 nuclear genes (Cargill et al. 1999), where 36% were nonconservative (odds ratio 1.4, 95% confidence interval 0.95–2.03, P ¼ 0.07; Fisher’s exact test). The reduced-median networks of the nonsynonymous mutations provided a comprehensive description of the intraspecies protein-level phylogeny in humans. The phylogenetic signal of synonymous mutations was lost, because only the nonsynonymous mutations were considered, but the various haplogroups were still discernible. Disregarding synonymous mutations may even improve the accuracy of a phylogenetic network (Naylor and Brown 1997). Many branches in the full networks (Finnilä, Lehtonen, and Majamaa 2001; Herrnstadt et al. 2002a) contain at least one nonsynonymous mutation, and the branches were also shown clearly in the present networks. Exceptions to this pattern included the root of haplogroups H and V, which was a single node, because all the nucleotide differences between these haplogroups were synonymous. Furthermore, the central nodes of several major haplogroups (U2 and B1; L and the root of D, E, and M) had identical amino acid sequences. The major haplogroups were found to be closely related in their amino acid sequences, with relatively few replacements separating their center nodes, but the variation within haplogroups was high, resulting in starlike phylogenies. More than half of the observed amino acid changes were present in only one sequence, giving rise to rare amino acid haplotypes. This finding is analogous to earlier observations of an excess of nonsynonymous mutations within species, as compared with variation between species (Nachman et al. 1996; Rand and Kann 1996; Hasegawa, Cao, and Yang 1998; Nachman 1998; Fry 1999). This is usually assumed to result from purifying selection against slightly deleterious alleles, which prevents their fixation. Such mildly deleterious mutations should reside in the periphery of phylogenetic networks. This hypothesis was supported by the present comparison of private replacements and nonprivate ones, which revealed that nonconservative changes are more frequent among the private replacements. The frequency of homoplasic mutations in human mtDNA has been found to be high (Finnilä, Lehtonen, and Majamaa 2001; Herrnstadt et al. 2002a). We found here that homoplasy among nonsynonymous mutations is also common, as one-fourth of all amino acid replacements were homoplasic. Interestingly, the homoplasic replacements included fewer nonconservative replacements and replacements involving small, aliphatic, and aromatic amino acids. This observation suggests that physicochem- 1208 Moilanen and Majamaa FIG. 10.—Identification of new nonsynonymous mutations in 647 European sequences. 500 permutations of the sequence order (Index) and the cumulative sum of mutations not observed in previous sequences in each permutation were obtained. Solid curve, the mean of the 500 cumulative sum curves. The largest and lowest value of nonsynonymous mutations at the corresponding index observed in any permutation are shown above and below the mean curve. Dashed curve, the Weibull growth curve y ¼ a – b exp[–exp(d) xe] fitted to the mean curve by the nonlinear least squares method and using all 647 data points. The fitted curve with parameters a ¼ 1080.84 (SE 7.3), b ¼ 1080.31 (SE 7.3), d ¼ –5.435 (SE 0.0033), and e ¼ 0.6664 (SE 0.00075) is superimposed almost perfectly on the mean curve (residual sum of squares ¼ 16.86). a indicates the asymptotic maximum of the Weibull growth curve. ical properties determine, at least in part, whether amino acid replacements are removed by selection or whether they persist long enough to be observed in separate lineages in the phylogeny—that is, whether they become homoplasic. Homoplasic replacements are therefore not confined exclusively to nonconstrained amino acid positions. Most ancient amino acid replacements distinguishing the major haplogroups were observed in other parts of the phylogeny as well, and all but two were conservative, which is consistent with their neutrality. Our findings support the assumption that amino acid replacements resulting in dissimilar amino acid properties are generally more deleterious than replacements resulting in similar properties. However, the effects of nonsynonymous mutations depend also on the position of the amino acid replacement in the protein sequence. Nonsynonymous mutations were found to occupy both hydrophobic and hydrophilic regions of genes, when the regions were defined according to the average hydropathy for the respective gene. Mutations in hydrophobic regions involved less changes in charge and more changes in aliphaticity than expected and were less often nonconservative than mutations in hydrophilic regions; but such differences are confounded by the differences in the amino acid composition of the respective regions. Even if it is accepted that the hydrophobic regions may be generally more conserved than hydrophilic regions (Naylor, Collins, and Brown 1995), the distribution of amino acid changes among genes (fig. 9) suggested that not all hydrophobic regions are alike. Several amino acid replacements were identified in the fifth, eleventh, and twelfth hydrophobic domains of MTCO1, for example, but none were identified in the seventh or eighth. Although it may eventually be possible to determine the degree and nature of the constraints on each region, and perhaps even on each position in mtDNA, the distribution of nonsynonymous mutations along the genes is still relatively sparse, suggesting that even larger numbers of sequences and polymorphisms will be required for detailed identification and characterization of functionally constrained and nonconstrained regions in human mtDNA. The cumulative rate of detection of new nonsynonymous mutations in European sequences was found to follow the Weibull growth curve model, the estimated parameters suggesting that 193 the current number of mtDNA sequences will be required to identify 90% of the nonsynonymous mutations that may be present in European populations. In conclusion, the results of this descriptive analysis of 471 nonsynonymous mutations showed that nonconservative changes were more common among private replacements and nonhomoplasic replacements than among nonprivate and homoplasic ones, and that a similar trend was evident in certain physicochemical characteristics of replacements, suggesting a role for selection against these in the evolution of the protein-coding genes of mtDNA. Selection presumably varies between genes, functional domains, and sites, however, and even more sequences will be required for reliable mapping of constrained and nonconstrained regions. Assessment of the pathogenicity of an amino acid change should not rely on single structural considerations, because changes in physicochemical properties such as hydropathy, size, charge, and polarity are common in the mtDNA-encoded proteins in human. The entire mtDNA genome should be screened to exclude other mutations when a particular variant is suspected of being pathogenic, and a population-genetic approach should be adopted to recognize neutral variants that are present in populations. The reduced-median networks and the tabulation of physicochemical properties of amino acid changes presented here should therefore also have practical applications. Supplementary Material The complete table of nonsynonymous mutations in the 840 sequences, their amino acid translations, and their physicochemical properties is provided as online Supplementary Material. Links to updated versions of the table may appear at http://cc.oulu.fi/;jukkamoi/mtres/. Network of Nonsynonymous Mutations in mtDNA 1209 Acknowledgments This work was supported by grants from the Sigrid Juselius Foundation, the Maud Kuistila Memorial Foundation, and the Research Council for Health, Academy of Finland. Literature Cited Alff-Steinberger, C. 1969. The genetic code and error transmission. Proc. Natl. Acad. Sci. USA 64:584–591. Anderson, S., A. T. Bankier, B. G. Barrell et al. (14 co-authors). 1981. Sequence and organization of the human mitochondrial genome. Nature 290:457–465. Andrews, R. M., I. Kubacka, P. F. Chinnery, R. N. Lightowlers, D. M. Turnbull, and N. Howell. 1999. Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat. Genet. 23:147. Arnason, U., X. Xu, and A. Gullberg. 1996. Comparison between the complete mitochondrial DNA sequences of Homo and the common chimpanzee based on nonchimeric sequences. J. Mol. Evol. 42:145–152. Bandelt, H. J., P. Forster, B. C. Sykes, and M. B. Richards. 1995. Mitochondrial portraits of human populations using median networks. Genetics 141:743–753. Bandelt, H. J., V. Macaulay, and M. Richards. 2000. Median networks: speedy construction and greedy reduction, one simulation, and two case studies from human mtDNA. Mol. Phylogenet. Evol. 16:8–28. Barrell, B. G., A. T. Bankier, and J. Drouin. 1979. A different genetic code in human mitochondria. Nature 282:189–194. Brown, M. D., A. Torroni, C. L. Reckord, and D. C. Wallace. 1995. Phylogenetic analysis of Leber’s hereditary optic neuropathy mitochondrial DNA’s indicates multiple independent occurrences of the common mutations. Hum. Mutat. 6: 311–325. Campos, Y., M. A. Martin, J. C. Rubio, M. C. Gutierrez del Olmo, A. Cabello, and J. Arenas. 1997. Bilateral striatal necrosis and MELAS associated with a new T3308C mutation in the mitochondrial ND1 gene. Biochem. Biophys. Res. Commun. 238:323–325. Cargill, M., D. Altshuler, J. Ireland et al. (17 co-authors). 1999. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22:231–238. Chinnery, P. F., G. A. Taylor, N. Howell, R. M. Andrews, C. M. Morris, R. W. Taylor, I. G. McKeith, R. H. Perry, J. A. Edwardson, and D. M. Turnbull. 2000. Mitochondrial DNA haplogroups and susceptibility to AD and dementia with Lewy bodies. Neurology 55:302–304. De Benedictis, G., G. Rose, G. Carrieri et al. (13 co-authors). 1999. Mitochondrial DNA inherited variants are associated with successful aging and longevity in humans. FASEB J. 13:1532–1536. Elson, J. L., R. M. Andrews, P. F. Chinnery, R. N. Lightowlers, D. M. Turnbull, and N. Howell. 2001. Analysis of European mtDNAs for recombination. Am. J. Hum. Genet. 68: 145–153. Fernandez-Moreno, M. A., B. Bornstein, Y. Campos, J. Arenas, and R. Garesse. 2000. The pathogenic role of point mutations affecting the translational initiation codon of mitochondrial genes. Mol. Genet. Metab. 70:238–240. Finnilä, S., I. E. Hassinen, L. Ala-Kokko, and K. Majamaa. 2000. Phylogenetic network of the mtDNA haplogroup U in Northern Finland based on sequence analysis of the complete coding region by conformation-sensitive gel electrophoresis. Am. J. Hum. Genet. 66:1017–1026. Finnilä, S., M. S. Lehtonen, and K. Majamaa. 2001. Phylogenetic network for European mtDNA. Am. J. Hum. Genet. 68:1475– 1484. Fry, A. J. 1999. Mildly deleterious mutations in avian mitochondrial DNA: evidence from neutrality tests. Evolution 53: 1617–1620. Grantham, R. 1974. Amino acid difference formula to help explain protein evolution. Science 185:862–864. Haig, D., and L. D. Hurst. 1991. A quantitative measure of error minimization in the genetic code. J. Mol. Evol. 33:412– 417. Hasegawa, M., Y. Cao, and Z. Yang. 1998. Preponderance of slightly deleterious polymorphisms in mitochondrial DNA: nonsynonymous/synonymous rate ratio is much higher within species than between species. Mol. Biol. Evol. 15:1499–1505. Hedges, S. B. 2000. A start for population genomics. Nature 408:652–653. Henikoff, S., and J. G. Henikoff. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89:10915–10919. Herrnstadt, C., J. L. Elson, E. Fahy et al. (11 co-authors). 2002a. Reduced-median-network analysis of complete mitochondrial DNA coding-region sequences for the major African, Asian, and European haplogroups. Am. J. Hum. Genet. 70:1152– 1171. ———. 2002b. Reduced-median-network analysis of complete mitochondrial DNA coding-region sequences for the major African, Asian, and European haplogroups [erratum]. Am. J. Hum. Genet. 71:448–449. Herrnstadt, C., G. Preston, R. Andrews, P. Chinnery, R. N. Lightowlers, D. M. Turnbull, I. Kubacka, and N. Howell. 2002c. A high frequency of mtDNA polymorphisms in HeLa cell sublines. Mutat. Res. 501:19–28. Horai, S., K. Hayasaka, R. Kondo, K. Tsugane, and N. Takahata. 1995. Recent African origin of modern humans revealed by complete sequences of hominoid mitochondrial DNAs. Proc. Natl. Acad. Sci. USA 92:532–536. Ihaka, R., and R. Gentleman. 1996. R: a language for data analysis and graphics. J. Comp. Graph. Stat. 5:299–314. Ingman, M., H. Kaessmann, S. Pääbo, and U. Gyllensten. 2000. Mitochondrial genome variation and the origin of modern humans. Nature 408:708–713. Kimura, M. 1968. Evolutionary rate at the molecular level. Nature 217:624–626. Knight, R. D., L. F. Landweber, and M. Yarus. 2001. How mitochondria redefine the code. J. Mol. Evol. 53:299–313. Kyte, J., and R. F. Doolittle. 1982. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157:105–132. Maca-Meyer, N., A. M. Gonzáles, J. M. Larruga, C. Flores, and V. M. Cabrera. 2001. Major genomic mitochondrial lineages delineate early human expansions. BMC Genetics 2:13. McClellan, D. A., and K. G. McCracken. 2001. Estimating the influence of selection on the variable amino acid sites of the cytochrome b protein functional domains. Mol. Biol. Evol. 18:917–925. Nachman, M. W. 1998. Deleterious mutations in animal mitochondrial DNA. Genetica 102–103:61–69. Nachman, M. W., W. M. Brown, M. Stoneking, and C. F. Aquadro. 1996. Nonneutral mitochondrial DNA variation in humans and chimpanzees. Genetics 142:953–963. Naylor, G. J., and W. M. Brown. 1997. Structural biology and phylogenetic estimation. Nature 388:527–528. Naylor, G. J., T. M. Collins, and W. M. Brown. 1995. Hydrophobicity and phylogeny. Nature 373:565–566. Ohta, T. 1992. The nearly neutral theory of molecular evolution. Annu. Rev. Ecol. Syst. 23:263–286. 1210 Moilanen and Majamaa Parsons, T. J., D. S. Muniec, K. Sullivan et al. (11 co-authors). 1997. A high observed substitution rate in the human mitochondrial DNA control region. Nat. Genet. 15: 363–368. Rand, D. M., and L. M. Kann. 1996. Excess amino acid polymorphism in mitochondrial DNA: contrasts among genes from Drosophila, mice, and humans. Mol. Biol. Evol. 13:735– 748. Rice, P., I. Longden, and A. Bleasby. 2000. EMBOSS: the European molecular biology open software suite. Trends. Genet. 16:276–277. Rocha, H., C. Flores, Y. Campos, J. Arenas, L. Vilarinho, F. M. Santorelli, and A. Torroni. 1999. About the ‘‘pathological’’ role of the mtDNA T3308C mutation. . . Am. J. Hum. Genet. 65:1457–1459. Ruiz-Pesini, E., A. C. Lapena, C. Diez-Sanchez et al. (11 coauthors). 2000. Human mtDNA haplogroups associated with high or reduced spermatozoa motility. Am. J. Hum. Genet. 67:682–696. Shin, W. S., M. Tanaka, J. Suzuki, C. Hemmi, and T. Toyo-oka. 2000. A novel homoplasmic mutation in mtDNA with a single evolutionary origin as a risk factor for cardiomyopathy. Am. J. Hum. Genet. 67:1617–1620. Tourasse, N. J., and W. H. Li. 2000. Selective constraints, amino acid composition, and the rate of protein evolution. Mol. Biol. Evol. 17:656–664. Wallace, D. C., M. D. Brown, and M. T. Lott. 1999. Mitochondrial DNA variation in human evolution and disease. Gene 238:211–230. Xia, X. 1998. The rate heterogeneity of nonsynonymous substitutions in mammalian mitochondrial genes. Mol. Biol. Evol. 15:336–344. Xia, X., and W. H. Li. 1998. What amino acid properties affect protein evolution? J. Mol. Evol. 47:557–564. Wolfgang Stephan, Associate Editor Accepted March 10, 2003