* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Human mutations in glucose 6-phosphate dehydrogenase reflect
Epigenetics of neurodegenerative diseases wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genome (book) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Expanded genetic code wikipedia , lookup
Designer baby wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Genome evolution wikipedia , lookup
Metagenomics wikipedia , lookup
Human genome wikipedia , lookup
Computational phylogenetics wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Sequence alignment wikipedia , lookup
Koinophilia wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Helitron (biology) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome editing wikipedia , lookup
Oncogenomics wikipedia , lookup
Genetic code wikipedia , lookup
Microevolution wikipedia , lookup
Human mutations in glucose 6-phosphate dehydrogenase reflect evolutionary history ROSARIO NOTARO,* ADEYINKA AFOLAYAN,*,† AND LUCIO LUZZATTO*,1 *Department of Human Genetics, Memorial Sloan-Kettering Cancer Center, New York, NY 10021, USA; and †Department of Biochemistry, Obafemi Awolowo University, Ile-Ife, Nigeria ABSTRACT Glucose 6-phosphate dehydrogenase (G6PD) is a cytosolic enzyme encoded by a housekeeping X-linked gene whose main function is to produce NADPH, a key electron donor in the defense against oxidizing agents and in reductive biosynthetic reactions. Inherited G6PD deficiency is associated with either episodic hemolytic anemia (triggered by fava beans or other agents) or life-long hemolytic anemia. We show here that an evolutionary analysis is a key to understanding the biology of a housekeeping gene. From the alignment of the amino acid (aa) sequence of 52 glucose 6-phosphate dehydrogenase (G6PD) species from 42 different organisms, we found a striking correlation between the aa replacements that cause G6PD deficiency in humans and the sequence conservation of G6PD: two-thirds of such replacements are in highly and moderately conserved (50 –99%) aa; relatively few are in fully conserved aa (where they might be lethal) or in poorly conserved aa, where presumably they simply would not cause G6PD deficiency. This is consistent with the notion that all human mutants have residual enzyme activity and that null mutations are lethal at some stage of development. Comparing the distribution of mutations in a human housekeeping gene with evolutionary conservation is a useful tool for pinpointing amino acid residues important for the stability or the function of the corresponding protein. In view of the current explosive increase in full genome sequencing projects, this tool will become rapidly available for numerous other genes.— Notaro, N., Afolayan, A., Luzzatto, L. Human mutations in glucose 6-phosphate dehydrogenase reflect evolutionary history. FASEB J. 14, 485– 494 (2000) Key Words: G6PD deficiency 䡠 G6PD variants 䡠 housekeeping genes 䡠 human mutants 䡠 evolution Glucose 6-phosphate dehydrogenase (G6PD) is a cytosolic enzyme whose main function is to produce NADPH, a key electron donor in the defense against oxidizing agents and in reductive biosynthetic reactions (1). In humans, genetically determined G6PD deficiency is associated with either episodic hemolytic anemia, triggered by fava beans 0892-6638/00/0014-0485/$02.25 © FASEB or other agents (referred to hereinafter as mild phenotype), or life-long hemolytic anemia (referred to hereafter as severe phenotype) (2). Among 122 variants characterized at the molecular level, those with severe phenotype are all rare, but those with mild phenotype have become common in many populations as a result of malaria selection (1). G6PD was first identified in the early 1930s in yeast (3) and human red blood cells (4), and since then in all organisms that have been tested for this enzyme activity as well as in all tissues and cell types from higher animals and plants (5); hence, it can be regarded as a prototype example of a housekeeping enzyme. Structural and functional features of the G6PD promoter conform to those of a typical housekeeping gene (6 – 8). The G6PD gene from Homo sapiens was the first to be cloned (9). Since then, 52 complete G6PD coding sequences have become available; of these, 43 were sought out deliberately and 9 have emerged in the last 2 years from the complete sequencing of microorganisms. From these sequence data we have derived a general idea of the evolution of G6PD over a Myr time scale (macroevolution). We used this evolutionary information to understand the distribution of human mutations that affect enzyme activity (microevolution). MATERIALS AND METHODS G6PD sequences For this study we have selected only complete G6PD sequences for which there was an available open reading frame from a start to a stop codon. The following 49 G6PD complete coding sequences have been obtained from GENBANK by keyword search or by BLAST search using human G6PD sequence as probe (in parentheses, GENBANK sequence ID): Actinobacillus actinomycetemcomitans (1651208); Anabaena sp. (988293); Aquifex aeolicus (2983137); Arabidopsis thaliana 1, chloroplast (3021305); Arabidopsis thaliana 2 (1174336); Aspergillus niger (870831); Bacillus subtilis (1303961); Borrelia 1 Correspondence: Department of Human Genetics, Memorial Sloan-Kettering Cancer Center, 1275 York Avenue, New York, NY 10021, USA. E-mail: [email protected] 485 burgdorferi (2688568); Caenorhabditis elegans (1321752); Ceratitis capitata (460877); Chlamydia trachomatis (1791245); Cricetulus griseus (2828743); Drosophila melanogaster (157470); Emericella nidulans (642160); Erwinia chrysanthemi (397854); Escherichia coli (1788158); Fugu rubripes (2734869); Haemophilus influenzae (1573543); Helicobacter pylori (2314250); Homo sapiens (31543); Kluyveromyces lactis (5539); Leuconostoc mesenteroides (149631); Macropus robustus (560549); Medicago sativa (603219); Mus musculus G6PD1 (51114) and G6PD2 (1806126); Mycobacterium tuberculosis zwf (2117215) and zwf2 (2131049); Nicotiana tabacum cytosolic TCG6 (3021508), cytosolic TCG9 (3021510), chloroplast TPG16 (AJ001771), chloroplast TPG18 (3021532) and chloroplast (1480344) G6PD; Nostoc sp. (558505); Petroselinum crispum cytosolic cG6PDH1 (2352921), cytosolic cG6PDH2 (2352923) and chloroplast pG6PDH (2352919); Pichia jadinii (348417); Plasmodium falciparum (438212); Pseudomonas aeruginosa (2935661); Rattus norvegicus (56196); Saccharomyces cerevisiae (171545); Solanum tuberosum cytosolic (471345) and chloroplast (1197385) G6PD; Spinacia oleracea, chloroplast (2276344); Synechococcus sp. (988286); Synechocystis sp. (1652530); Treponema pallidum (3322776); Zymomonas mobilis (155591). Two additional complete sequences (Deinococcus radiodurans and Streptococcus pnaeumoniae) were retrieved by BLAST search from the archive of The Institute of Genomic Research. The complete sequence from Rhodobacter capsulatus was retrieved from the archive of University of Chicago. The full genomes of Archaeoglobus fulgidus, Methanococcus jannaschii, Methanobacterium thermoautotrophicum, Mycoplasma genitalium, Mycoplasma pneumoniae, and Rickettsia prowazekii are listed in GENBANK and were searched by using BLAST; the full-sequence Pyrococcus horikoshii was searched by using BLAST on the server of National Institute of Technology and Evaluation, Tokyo, Japan. We have not included the sequence of a rabbit microsomal G6PD (10); it is likely that this sequence belongs to the non-glucose-6-phosphate-specific hexose-6-phosphate dehydrogenase previously characterized from rat liver microsomes (11). Alignment and sequence analysis The sequences were aligned using the program PIMA 1.4 (12) (Human Genome Center, Baylor College of Medicine, Houston, Tex.: http://dot.imgen.bcm.tmc.edu:9331/multialign.html). The alignment was manually refined. The percentages of identity and similarity were calculated from the alignment using the program GeneDoc (13). For aa similarity, we adopted the criteria of the Dayhoff’s PAM 250 matrix (14): 1) DEQHN, 2) FY, 3) KR, 4) SAT, and 5) LIVM. For physicochemical grouping, we adopted GeneDoc criteria in which aa are classified in twelve groups based on three subsequent hierarchical levels: the first level is based on size, the second is based on electrical charge for polar aa, the third is based on aromaticity for nonpolar aa (15, 16). Phylogenetic tree The evolutionary tree of G6PD was derived from the alignment of the 52 sequences. Distances between sequences were calculated from the alignment by using PROTDIST (according to Dayhoff’s PAM 250 matrix; ref 14); the tree was then generated by using KITSCH and drawn by using TREEVIEW (17). The programs PROTDIST and KITSCH are part of the PHYLIP 3.57 package (18). Amino acid solvent accessibility The 3-dimensional structure of human G6PD was previously modeled on the crystal structure of L. mesenteroides G6PD 486 Vol. 14 March 2000 (19). Specifically, residues 27–512 of human G6PD are aligned to residues 1– 486 of L. mesenteroides. The coordinates of this model have been used to calculate the solvent accessibility (SA) of individual aa residues in the monomeric structure of human G6PD using Swisse-PdbViewer (20) program (http://www.expasy.ch/spdbv/mainpage.htm). Amino acids with SA ⬍ 10% are regarded as buried and amino acids with SA ⱖ 10% are regarded as exposed (21). Human mutations One hundred twenty-two mutations or combination of mutations of human G6PD have recently been tabulated (22): 114 variants have missense mutations; of these, 106 have a single mutation, 7 have two mutations, and 1 has three mutations. Six variants have in-frame deletions (four single aa deletions, one 2 aa, and one 8 aa deletion). One variant has a nonsense mutation and one has a splicing site mutation. We have made use of only 118 variants from this database (Fig. 4) because no clinical data are available on the other four. Of these 118 variants, 3 do not affect enzyme activity; all others entail enzyme deficiency: of these, 57 are associated with chronic nonspherocytic hemolytic anemia (WHO class I, severe phenotype) and 58 are associated with the risk of acute hemolytic anemia (WHO classes II and III, mild phenotype). In studying the relationship between human mutations and evolutionary conservation, we have confined the analysis to only the 99 single missense mutations and the 4 single amino acid deletions. RESULTS G6PD in living organisms We have retrieved from available databases a total of 52 sequences clearly homologous to human G6PD, which we have used as reference. Of these, 50 were annotated as G6PD and 2 were not. When all of these sequences are aligned by using conventional algorithms (Fig. 1), we find that the mammalian sequences have ⬃94% identity (by comparison, homology among mammalian hemoglobins is of the order of 70%); when these are compared with those of lower vertebrates, invertebrates, and microorganisms, it is not surprising that the degree of identity decreases but remains higher than 20% (Table 1). The percentage of similarity among all of the 52 sequences ranges between 43 to 98%. A long stretch of 141 AA (163–303 in human) is aligned without a single gap in all sequences (except H. pylori and A. aeolicus). Some G6PD species in plants have aminoterminal extensions related to their localization in chloroplasts; G6PD from Plasmodium falciparum has an intriguing amino-terminal extension (23) that may have a different enzymatic function (24), but in the G6PD-like portion the homology is of the same order of magnitude as that found in other organisms. In our retrieval of G6PD sequences, it was of special interest that no coding sequence recognizable as G6PD could be found in seven microorgan- The FASEB Journal NOTARO ET AL. Figure 1. Macroevolution of the G6PD molecule. This diagram reflects a synoptic analysis of 52 different G6PD sequences. Human G6PD (italics) is used as the reference sequence and the even exons are underlined and numbered: the coding sequence starts in exon 2. In the lettering of human G6PD, aa solvent accessibility (SA) is reported with the following color code: blue, buried aa (SA ⬍10%); red, exposed aa (SA⫽10%). Immediately above the human sequence we show the similarity consensus for the 52 sequences analyzed (similarity), and above that the identity consensus sequence (identity). At each position, the most frequent aa is shown, with a coding for its frequency (white uppercase in blue field: 100%; black uppercase in green field: 76 –99%; white lowercase in red field: 51–75%); hyphens are entered in the two consensus lines for aa with ⬍50% conservation. For aa similarity we have adopted the criteria of the Dayhoff PAM 250 matrix (14): DEQHN, FY, KR, SAT, and LIVM. Conserved blocks described in the text are boxed and identified with roman numerals. Above the identity consensus sequence, runs of underlined a and b indicate ␣-helices and -strands respectively, as found in the Leuconostoc mesenteroides molecule (sec. structure) (19). The structure of the first 24 aa and the last 3 aa (lowercase) is not known for L. mesenteroids. isms whose genome has been fully sequenced. Four of these (Archaeoglobus fulgidus, Methanococcus jannaschii, Methanobacterium thermoautotrophicum, Pyrococcus horikoshii) are Archea, which grow in an oxygenMACRO- AND MICROEVOLUTION OF G6PD poor habitat. The other three are Eubacteria with a small and defective genome, two of which (Mycoplasma genitalium and M. pneumoniae) are cell surface parasites, whereas one (Rickettsia prowazekii) is an 487 TABLE 1. Identity and similarity among G6PD sequencesa HUMAN MOUSE FUGRU DROME CAEEL SOLTU YEAST SYNP7 ECOLI LEUME AQUAE HUMAN 515 98 89 76 76 70 64 54 53 49 45 MOUSE 93 515 89 77 76 71 64 54 53 50 45 FUGRU 76 75 514 75 74 69 63 52 51 51 42 DROME 59 59 59 523 73 66 64 53 50 48 44 CAEEL 56 55 55 54 522 68 64 56 52 50 45 SOLTU 49 49 48 44 44 511 64 55 53 50 47 YEAST 41 42 43 42 43 46 505 56 55 54 46 SYNP7 32 32 30 29 32 32 31 524 58 54 44 ECOLI 31 31 30 28 31 32 30 38 491 59 47 LEUME 28 27 28 27 28 27 30 30 31 486 43 AQUAE 23 23 22 20 22 23 23 25 24 24 431 I D E N T I T Y SIMILARITY a Percentage of identity and similarity were determined on the alignment of the 52 G6PD sequences (8) with the program GeneDoc (25) according to the Dayhoff matrix PAM 250 (26). Identity is shown in the upper right portion of the matrix; similarity is shown in the lower left portion of the matrix. In the highlighted diagonal the number of amino acids for each sequence is shown. Abbreviations as follows: HUMAN: H. sapiens; MOUSE: M. musculus; FUGRU: F. rubripes; DROME: D. melanogaster; CAEEL: C. elegans; SOLTU: S. tuberosum (cytosol); YEAST: S. cerevisiae; SYNP7: Synechoccus sp.; ECOLI: E. coli; LEUME: L. mesenteroides; AQUAE: A. aeolicus. obligate intracellular parasite that might be able to capitalize on the G6PD activity of the host cell. The homology between G6PD sequences from Eubacteria and Eukaryota clearly supports a common evolutionary origin of G6PD throughout living organisms. Unlike tissue-specific genes such as globins, a housekeeping gene such as G6PD allows us to outline evolutionary relationships for most of the living organism, the only exception so far being the Archea. A dendrogram based on economy principles is consonant with taxonomy (Fig. 2), and some specific points are noteworthy. 1) Metazoa, fungi, and plants all separate early on from bacteria. 2) Vertebrates separate cleanly from invertebrates. 3) The chloroplast G6PD sequences of the plants separate early from the cytosolic G6PD species of the plants and from G6PD of fungi and metazoa. 4) Rather than being nearer metazoa, the G6PD sequences of fungi seem to fall between chloroplast G6PD and cytosolic G6PD. Sequence conservation and 3-dimensional structure In all of the 52 known G6PD species, there are 25 identical aa and 56 similar aa. The total number of aa is 515 in human and between 425 and 604 in most species. If we apply broader criteria of similarity, whereby aa are classified into 12 groups sharing some physicochemical properties, there are in fact 143 conserved aa. The overall percent homology among G6PD sequences is of course only a rough measure of evolutionary conservation, because homology is not uniform throughout the sequence. The degree of homology tends to decrease toward both the NH2 terminus and the carboxyl terminus, perhaps because these regions are allowed greater flexibility without prejudice to the overall conformation of the molecule. Similarity con488 Vol. 14 March 2000 servation is shaped in blocks; we have identified 12 by visual inspection (Fig. 1). Although we cannot yet fully explain the significance of each conserved block, we can offer a rationale in most cases. Block I is the NADP binding region (residues 38 – 47 of human G6PD); it has a characteristic dinucleotide pocket (GXXGXX), flanked on either side by additional conserved hydrophobic amino acids (e.g., IIM 35–37, P 50 and P 62, L 55, and L 61) and a KKK motif. Block IV comprises the catalytic site, which surrounds a completely conserved lysine; in human G6PD this is K 205, previously shown to be essential for enzyme activity though not for substrate binding (25). To understand the significance of other regions of high conservation, we must consider the conformation of the protein. The crystal structure has been fully solved for L. mesenteroides G6PD (26) and human G6PD (27). The latter agrees well with a previously published model (19); therefore, it is reasonable to presume that the 3-dimensional structure is essentially conserved. With reference to the human G6PD model (19), we find that the conserved blocks III and V are facing the active center (see above). Block X contributes much of the subunit interface. Blocks II, VII, VIII, IX, and XI are part of the hydrophobic core. The carboxyl-terminal ends of block IV and block XI also contribute to the subunit interface. Blocks VI and XII include amphipatic ␣ helices. To test the notion that surface accessibility is significantly related to evolutionary conservation, we consider four groups of aa based on similarity conservation (fully conserved, 100%; highly conserved, 75–99%; moderately conserved, 50 –74%; poorly conserved, ⬍50%). We find that buried aa residues are over-represented and that exposed aa residues are under-represented among those with a similarity higher than 75% (Fig. 3) (2⫽50; P⬍0.001). The FASEB Journal NOTARO ET AL. served. In addition, the mutations in human G6PD are spread throughout the sequence. Only one discrete cluster emerges in the aa range 380 – 410 that corresponds to the subunit interface in the enzymatically active G6PD dimer (19), To assess how human mutations associated with G6PD deficiency relate in general to the evolutionary history of the G6PD sequence, we refer again to the four similarity conservation groups defined above and see that a distinct pattern emerges (Table 2; Fig. 5). Fully conserved amino acids and poorly conserved amino acids are under-represented in G6PDdeficient mutants. By contrast, highly and moderately conserved amino acids are over-represented in G6PD-deficient mutants. This skewed distribution of mutations among the different amino acid conservation group is statistically significant (2⫽9.36; P⬍0.03). In 84% (83/99) of cases, regardless of the degree of evolutionary conservation, aa replacements in human mutants do not respect similarity; in 68% (67/99) of cases, the aa replaced in the mutants are not found in any nonhuman G6PD—if so, they probably would not cause G6PD deficiency. In support of this notion, we noticed that in 9 of the 16 cases (56%) where a human mutation does respect similarity, the mutated residue is normal in another species; but of the 83 cases where the human mutation does not respect similarity, there are only 23 (28%) where the mutated residue is normal in another species. Figure 2. Evolutionary tree of G6PD derived from the alignment of the 52 sequences (see Fig. 1 and Materials and Methods). Distances between sequences were calculated from the alignment by using PROTDIST according to Dayhoff’s PAM 250 matrix (14); the tree was then generated by using KITSCH and drawn by using TREEVIEW (17). The programs PROTDIST and KITSCH are part of the PHYLIP 3.57 package (18). Cyt, cytosol; Chl: chloroplast. #The author who reported the M. sativa G6PD sequence (603219) did not state whether the sequence is encoding for the G6PD present in the cytosol or in the chloroplast. From the alignment and from this phylogenetic tree, we infer that the sequence almost certainly encodes for a cytosolic G6PD. $The author who reported A. thaliana 2 G6PD sequence (1174336) did not state whether the sequence is encoding for the G6PD present in the cytosol or in the chloroplast. From the alignment and from this phylogenetic tree, we infer that the sequence almost certainly encodes for a chloroplast G6PD. Mutations in human G6PD We next analyzed the known mutations of human G6PD (microevolution) in relation to the macroevolutionary conservation of the protein sequence (Fig. 4). The majority of mutations causing G6PD deficiency are missense, and obvious ‘null’ mutations (early nonsense mutations, mutations destroying the reading frame, mutations in the substrate binding or coenzyme binding regions) have never been obMACRO- AND MICROEVOLUTION OF G6PD DISCUSSION The G6PD gene is unique among housekeeping genes because of the existence of a large number of human mutants that have been discovered, of which many are polymorphic rather than sporadic. Analyzing this system therefore affords an opportunity to relate the naturally occurring genetic variation in the human species (microevolution) to sequence information available for G6PD from more than 50 other organisms (macroevolution). G6PD is not ubiquitous Knowledge of the complete sequence of the genome of several microorganisms has been one of the most significant advances in genomics research in the past 5 years. A remarkable implication is that for the first time it is possible to ask directly not only what genes can be found in an organism, but also what genes are lacking. To our surprise, we found no recognizable G6PD sequence in seven microorganisms whose genome had been fully sequenced. Indeed, three of the microorganisms that lack G6PD are Eubacteria, 489 Figure 3. Conservation and 3-dimensional structure. Each bar represents the distribution of G6PD amino acids among the various conservation categories (see text) for exposed aa, total aa, and buried aa, respectively. with small and ‘defective’ genomes, and parasitic habit (Mycoplasma genitalium and M. pneumoniae, Rickettsia prowazekii). We surmise that all three might be able to capitalize on the G6PD activity of the host cell. The other four microorganisms that lack G6PD (Archaeoglobus fulgidus, Methanococcus jannaschii, Meth- Figure 4. Microevolution of the G6PD molecule. This diagram reflects a synoptic analysis of 52 different G6PD sequences and of 118 human mutations. Human G6PD is used as the reference sequence and the exons are in black boxes; the coding sequence starts in exon 2. In the lettering of human G6PD, aa similarity (cfr. Fig. 1) is reported with the following color code: White letters in blue field, 100%; black letters in green field, 76 –99%; white letters in red field, 51–75%; black lowercase for aa with ⬍50% conservation. Below the human sequence, aa replacements in human mutants (mutations) are shown in uppercase in a black field if they have a severe clinical phenotype, in boldface underlined lowercase if they have a mild clinical phenotype, in italics lowercase if they do not have any known clinical phenotype, and in lowercase if they have been observed only in association with other mutations; ⌬ indicates a deleted residue; Z indicates a nonsense mutation; 兰 indicates a mutation that interferes with splicing. The human mutations are from ref 22 490 Vol. 14 March 2000 The FASEB Journal NOTARO ET AL. TABLE 2. Human G6PD deficiency mutations relate to evolutionary conservationa Amino acid residue with increasing similarity conservation, n (%) 1) G6PD aa sequence 2) Mutations, mildc 3) Mutations, severeb 4) Mutations, totalh ⬍50% 50–75% 76–99% 100% Total 151 (29.3) 10e (18.6) 9d (18.4) 19 (18.4) 128f (24.9) 18 (33.3) 13 (26.5) 31g (30.1) 180f (34.9) 22 (40.7) 24e (49) 46g (44.7) 56 (10.9) 4 (7.4) 3 (6.1) 7 (6.8) 515 54 49 103 a Of 115 known human mutations that affect G6PD activity (see Fig. 1), we have included in this analysis only 103: the 99 missense mutations b that are not found in combination with other mutations and the 4 single amino acid deletions. Associated with chronic nonspherocytic c d hemolytic anemia (WHO class I). Associated with the risk of acute hemolytic anemia (WHO class II and III). Of these nine mutations, two are in frame deletions and seven are in the subunit interface; it appears that elsewhere a simple replacement of a nonconserved aa is most e f unlikely to cause a severe phenotype. Of these mutations, one is an in-frame deletion. The aa in these two cells add up to 308, or g h 59.8% of the entire sequence. The mutations in these two cells add up to 77, or 74.8% of the total number of mutations analyzed. The frequencies of mutations observed in the four conservation classes (line 4) when compared with those expected based on the respective aa 2 frequencies (line 1) yielded a value of 9.36 (P ⬍ 0.03). anobacterium thermoautotrophicum, Pyrococcus horikoshii) are Archea, which grow in an oxygen-poor habitat; we surmise that they do not need G6PD because they do not need to defend themselves against oxidative stress. These evolutionary findings, while showing that life without G6PD is possible, provide independent confirmation for the notion derived from targeting the G6PD gene in mouse embryonic stem cells—namely, that in the organisms that do have G6PD, its only indispensable function consists not in pentose synthesis, but rather in supplying reductive potential in the form of NADPH (28). NADPH in turn serves as a defense against oxidative stress as well as for the synthesis of nitric oxide (29, 30). Phylogenesis The phylogenetic tree we obtained for G6PD clearly supports a common evolutionary origin throughout living organisms and is consonant with taxonomy (Fig. 2). Our findings on the evolution of the G6PD gene are reminiscent of those that have been reported with other housekeeping genes, most of which are ancient in evolutionary history (31): namely, the degree of homology bears a broadly inverse correlation to evolutionary distances, and certain critical regions are nearly completely conserved. P. falciparum seems to be at an evolutionary dead-end, as is often the case with parasitic protozoa. The absence of G6PD in all the Archea that have been fully sequenced is intriguing, but not unique. Eight hundred sixty-four clusters of orthologous groups (COG) have been defined on the basis of consistent patterns of sequence similarities when comparing protein sequences encoded in eight complete genomes from six major phylogenetic lineages (32). Each COG consists of individual proteins or groups of paralogs from at least three lineages. Indeed, 26% of COG show the same phylogenetic pattern of G6PD (presence in Bacteria and Eukarya, absence in Archaea) (33). This phylogenetic pattern suggests that G6PD gene has been lost in the common ancestor of Archaea. Alternatively, the common ancestor of Archea and Eukaria did not have G6PD, and Eukaria later acquired G6PD gene by horizontal transfer from Bacteria. Figure 5. Conservation and G6PD variants. The upper bar represents the distribution of G6PD mutants among the various conservation categories (see text). For comparison, the lower bar represents the distribution of all amino acids in human G6PD among the various conservation categories. MACRO- AND MICROEVOLUTION OF G6PD 491 Evolutionary conservation and 3-dimensional structure In general, one would expect in an enzyme that is a globular protein that the active center and the hydrophobic core would show a high degree of conservation (34), and we have validated this principle in the case of G6PD: 64.2% of buried (but only 35.8% of exposed) amino acid residues are highly or fully conserved (Fig. 1, Fig. 3; P⬍0.001). In addition, we note that in a homodimer the subunit interface is structurally special because each amino acid within that region is represented twice, in symmetrical positions, on the contact surface. As a result, any amino acid replacement within this region will produce two nearby changes in the structure of the protein. In fact, we have found that the subunit interface probably accounts for the majority of the conserved blocks in the G6PD sequence on a longrange evolutionary scale, suggesting that constraints against sequence change are greater than average in that region. Genetic variation in human G6PD Comparing the distribution of mutations in a human gene with its evolutionary history is a powerful tool for pinpointing the function of domains and even of individual aa within a protein. To do this, it is useful to consider the range of potential phenotypic (clinical) consequences of different types of mutations in a housekeeping gene such as G6PD. 1) Null mutations have never been observed. We have obtained mice heterozygous for G6PD deficiency from G6PD null embryonic stem cells (28), but no hemizygous G6PD-deficient mice; we have recently found that the condition is lethal at an early stage in embryonic development (35). 2) Mutations that compromise drastically substrate binding, coenzyme binding, or the catalytic mechanism would be expected to affect G6PD function in all cells to approximately the same extent and they ought to cluster in the respective regions of the sequence; no such clusters are seen (Fig. 4). We presume that mutations with such drastic effects may frequently be lethal. 3) Mutations that cause instability of the G6PD protein molecule would be expected to produce marked deficiency of G6PD in red cells (much more than in other cells), because these cells have a long life span after they have lost capacity for protein synthesis (36). The majority of known human G6PD-deficient variants fall into this category. 4) The only cluster of mutations emerges between aa 380 and 410, and corresponds to the subunit interface in the enzymatically active G6PD dimer (19). Moreover, nearly all of the mutations in this cluster cause a severe phenotype (class I); indeed, 34% of all class I mutations fall 492 Vol. 14 March 2000 within this 6% of the entire sequence. This highlights the critical role of precisely fitting noncovalent interactions for the formation and stability of the dimeric molecule. The distribution of mutations in relationship to the evolutionary history of the G6PD sequence shows a definite and peculiar pattern: more than two-thirds of the aa replacements that cause G6PD deficiency in humans are in highly and moderately conserved aa, whereas relatively few are in fully or poorly conserved aa (Fig. 5, P⬍0.03). Fully conserved amino acids are under-represented in G6PD-deficient mutants, presumably because of a higher probability that their replacement may be lethal. Poorly conserved amino acids are also under-represented, presumably because in many cases their replacement may not cause G6PD deficiency and therefore such mutations may go undetected. In fact, of the 19 mutants in this group, 7 are in the subunit interface region (see above) and 3 are in-frame deletions rather than missense mutations; of the remaining, none has a severe phenotype, confirming that replacements of these nonconserved amino acids are generally well tolerated. By contrast, highly and moderately conserved amino acids are over-represented in G6PDdeficient mutants; indeed, two-thirds of these mutants affect such amino acids residues, as though the consequence of their replacement is often serious enough to cause instability but not sufficiently serious to be lethal. This finding is a novel modification of the notion that mutations associated with genetic defects tend to affect the amino acids residues that are most conserved. Human polymorphic mutations vs. human sporadic mutations Extensive databases on human disease genes in several cases comprise 100 or more mutations (37). However, G6PD is the only case of a housekeeping gene in which many (about one-half; see Table 2) of the known mutations are both potentially pathogenic and polymorphic2 as a result of malaria selection (38, 39). Indeed, each of the mutant alleles concerned constitutes an example of balanced polymorphism, whereby the deleterious phenotypic consequences of the mutation are balanced by the resistance that it confers against Plasmodium falciparum. By contrast, we can presume that each of the sporadic G6PD mutants has remained sporadic because its phenotypic consequences are too deleterious to be balanced by the resistance it might confer 2 The other pertinent case is of course that of hemoglobin, produced by two highly tissue specific (nonhousekeeping) genes. Polymorphic mutants of both the ␣ and the  globin genes are known, although they are not as numerous as the G6PD mutants. The FASEB Journal NOTARO ET AL. against Plasmodium falciparum. If we analyze separately the severe sporadic mutations and the mild polymorphic mutations against the backdrop of macroevolution, we find that the former are slightly more frequent in the 76 –99% conservation bracket and the latter are more frequent in the 50 –75% conservation bracket (see Table 2). However, with the number of mutants known thus far, the difference is not yet significant. 5. 6. 7. 8. CONCLUSION 9. The evolutionary conservation of a housekeeping gene such as G6PD is greater than that of tissuespecific genes, presumably because the latter may require more specific adaptation to the physiology of individual organisms. Total lack of G6PD is almost certainly lethal in mammals, but partial G6PD deficiency confers biological advantage with respect to malaria. Most of the aa replacements causing G6PD deficiency take place in positions that are conserved, but less than fully conserved. Thus, the range of permissible G6PD mutations in the microevolution of the human species bears a clear imprint of the macroevolutionary history of this gene. With increasing numbers of full genome sequences becoming available, the approach we have used for the analysis of G6PD mutations is likely to be generally applicable to genes (particularly housekeeping genes) underlying other human diseases. 10. 11. 12. 13. 14. 15. 16. 17. We thank Dr. Margaret Adams for providing the coordinates of the human G6PD model, Dr. Mike Rosen for valuable help with the analysis of the 3-dimensional structure, Drs. Nicola Pavletich and Dr. Martin Wiedmann for critical reading of the manuscript, and Dr. Howard T. Thaler for helpful suggestions. We also thank all colleagues of the G6PD groups in Ibadan (Nigeria), Napoli (Italy), and London (UK). This work was supported by MSKCC and NIH (PO1 HL 59312). R.N. is on leave of absence from the Hematology Division of Federico II University, Naples, Italy. Note added in proof: In a publication that appeared after this paper was submitted, Cheng et al. [J. Biomed. Sci. (1999) 6, 106 –114] have reported an independent analysis of human mutations in relation to amino acid conservation among 23 G6PD sequences from different organisms; some of their conclusions are in good agreement with ours. 18. 19. 20. 21. 22. 23. REFERENCES 24. 1. Luzzatto, L., and Mehta, A. (1995) Glucose 6-phosphate dehydrogenase deficiency In The Metabolic and Molecular Basis of Inherited Disease (Scriver, C. R., Beaudet, A. L., Sly, W. S., and Valle, D., eds) pp. 3367–3398, McGraw-Hill, New York 2. Beutler, E. (1991) Glucose 6-phosphate dehydrogenase deficiency. N. Engl. J. Med. 324, 169 –174 3. Warburg, O., and Christian, W. (1932) Über ein neues Oxydationsferment und sein Absorptionsspektrum. Biochem. Z. 254, 438 – 458 4. Warburg, O., and Christian, W. (1931) Über Aktivierung der robinsonschen Hexosemono-phosphorsäure in roten Blutzellen MACRO- AND MICROEVOLUTION OF G6PD 25. 26. and die Gewinnung aktivierender Fermentlösung. Biochem. Z. 242, 206 –227 Luzzatto, L., and Battistuzzi, G. (1985) Glucose-6-phosphate dehydrogenase. Adv. Hum. Genet. 14, 217–329 Martini, G., Toniolo, D., Vulliamy, T., Luzzatto, L., Dono, R., Viglietto, G., Paonessa, G., D’Urso, M., and Persico, M. G. (1986) Structural analysis of the X-linked gene encoding human glucose 6-phosphate dehydrogenase. EMBO J. 5, 1849 –1855 Philippe, M., Larondelle, Y., Lemaigre, F., Mariame, B., Delhez, H., Mason, P., Luzzatto, L., and Rousseau, G. G. (1994) Promoter function of the human glucose-6-phosphate dehydrogenase gene depends on two GC boxes that are cell specifically controlled. Eur. J. Biochem. 226, 377–384 Ursini, M. V., Scalera, L., and Martini, G. (1990) High levels of transcription driven by a 400 bp segment of the human G6PD promoter. Biochem. Biophys. Res. Commun. 170, 1203–1209 Persico, M. G., Viglietto, G., Martini, G., Toniolo, D., Paonessa, G., Moscatelli, C., Dono, R., Vulliamy, T., Luzzatto, L., and D’Urso, M. (1986) Isolation of human glucose-6-phosphate dehydrogenase (G6PD) cDNA clones: primary structure of the protein and unusual 5⬘ non-coding region. Nucleic Acids Res. 14, 2511–2522; 7822 Ozols, J. (1993) Isolation and the complete amino acid sequence of lumenal endoplasmic reticulum glucose-6-phosphate dehydrogenase. Proc. Natl. Acad. Sci. USA 90, 5302–5306 Hino, Y., and Minakami, S. (1982) Hexose-6-phosphate dehydrogenase of rat liver microsomes. Isolation by affinity chromatography and properties. J. Biol. Chem. 257, 2563–2568 Smith, R. F., and Smith, T. F. (1990) Automatic generation of primary sequence patterns from sets of related protein sequences. Proc. Natl. Acad. Sci. USA 87, 118 –122 Nicholas, K. B., and Nicholas, B. J. (1997) GeneDoc. Pittsburgh Supercomputer Center Dayhoff, M. O. (1979) Atlas of Protein Sequences and Structure, Vol. 5, Suppl. 3, National Biomedical Research Foundation, Washington, D.C. Nicholas, K. B., and Nicholas, H. B. J. (1997) Analysis and visualization of genetic variant. http://wwwcriscom/⬃Ketchup/genedocshtml Taylor, W. R. (1986) The classification of amino acid conservation. J. Theor. Biol. 119, 205–218 Page, R. D. (1996) TreeView: an application to display phylogenetic trees on personal computers. Comput. Appl. Biosci. 12, 357–358 Felsenstein, J. (1989) PHYLIP—phylogeny inference package (version 3.2). Cladistic 5, 164 –166 Naylor, C. E., Rowland, P., Basak, A. K., Gover, S., Mason, P. J., Bautista, J. M., Vulliamy, T. J., Luzzatto, L., and Adams, M. J. (1996) Glucose 6-phosphate dehydrogenase mutations causing enzyme deficiency in a model of the tertiary structure of the human enzyme. Blood 87, 2974 –2982 Guex, N., and Peitsch, M. C. (1997) SWISS-MODEL and the Swiss-dbViewer: an environment for comparative protein modeling. Electrophoresis 18, 2714 –2723 Goldman, N., Thorne, J. L., and Jones, D. T. (1998) Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics 149, 445– 458 Vulliamy, T., Luzzatto, L., Hirono, A., and Beutler, E. (1997) Hematologically important mutations: glucose-6-phosphate dehydrogenase. Blood Cells Mol. Dis. 23, 302–313 O’Brien, E., Kurdi-Haidar, B., Wanachiwanawin, W., Carvajal, J. L., Vulliamy, T. J., Cappadoro, M., Mason, P. J., and Luzzatto, L. (1994) Cloning of the glucose 6-phosphate dehydrogenase gene from Plasmodium falciparum. Mol. Biochem. Parasitol. 64, 313–326 Scopes, D. A., Bautista, J. M., Vulliamy, T. J., and Mason, P. J. (1997) Plasmodium falciparum glucose-6-phosphate dehydrogenase (G6PD)-the N-terminal portion is homologous to a predicted protein encoded near to G6PD in Haemophilus influenzae. Mol. Microbiol. 23, 847– 848 Bautista, J. M., Mason, P. J., and Luzzatto, L. (1995) Human glucose-6-phosphate dehydrogenase. Lysine 205 is dispensable for substrate binding but essential for catalysis. FEBS Lett. 366, 61– 64 Rowland, P., Basak, A. K., Gover, S., Levy, H. R., and Adams, M. J. (1994) The 3-dimensional structure of glucose 6-phos- 493 27. 28. 29. 30. 31. 32. 33. 494 phate dehydrogenase from Leuconostoc mesenteroides refined at 2.0 A resolution. Structure 2, 1073–1087 Au, S. W., Naylor, C. E., Gover, S., Vandeputte-Rutten, L., Scopes, D. A., Mason, P. J., Luzzatto, L., Lam, V. M., and Adams, M. J. (1999) Solution of the structure of tetrameric human glucose 6-phosphate dehydrogenase by molecular replacement. Acta Crystallogr. D. Biol. Crystallogr. 55, 826 – 834 Pandolfi, P. P., Sonati, F., Rivi, R., Mason, P., Grosveld, F., and Luzzatto, L. (1995) Targeted disruption of the housekeeping gene encoding glucose 6-phosphate dehydrogenase (G6PD): G6PD is dispensable for pentose synthesis but essential for defense against oxidative stress. EMBO J. 14, 5209 –5215 Tsai, K. J., Hung, I. J., Chow, C. K., Stern, A., Chao, S. S., and Chiu, D. T. (1998) Impaired production of nitric oxide, superoxide, and hydrogen peroxide in glucose 6-phosphate-dehydrogenase-deficient granulocytes. FEBS Lett. 436, 411– 414 Nathan, C. (1992) Nitric oxide as a secretory product of mammalian cells. FASEB J. 6, 3051–3064 Straus, D., and Gilbert, W. (1985) Genetic engineering in the Precambrian: structure of the chicken triosephosphate isomerase gene. Mol. Cell. Biol. 5, 3497–3506 Tatusov, R. L., Koonin, E. V., and Lipman, D. J. (1997) A genomic perspective on protein families. Science 278, 631– 637 Tatusov, R. L., Koonin, E. V. K., and Lipman, D. J. (1998) A natural system of gene families from complete genomes. http:// www.ncbi.nlm.nih.gov/COG Vol. 14 March 2000 34. 35. 36. 37. 38. 39. Chothia, C., and Lesk, A. M. (1986) The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823– 826 Longo, L., Rosti, V., Gaboli, M., Tabarini, D., Pandolfi, P. P., and Luzzatto, L. (1997) Mice heterozygous for complete glucose 6-phosphate dehydrogenase (G6PD) deficiency are viable and show somatic cell selection. Blood 90 (Suppl. 1), 408a Mason, P. J. (1996) New insights into G6PD deficiency. Br. J. Haematol. 94, 585–591 Krawczak, M., and Cooper, D. N. (1997) The human gene mutation database. Trends Genet. 13, 121–122 Luzzatto, L. (1979) Genetics of red cells and susceptibility to malaria. Blood 54, 961–976 Ruwende, C., Khoo, S. C., Snow, R. W., Yates, S. N., Kwiatkowski, D., Gupta, S., Warn, P., Allsopp, C. E., Gilbert, S. C., Peschu, N., et al. (1995) Natural selection of hemi- and heterozygotes for G6PD deficiency in Africa by resistance to severe malaria. Nature (London) 376, 246 –249 The FASEB Journal Received for publication February 28, 1999. Revised for publication September 16, 1999. NOTARO ET AL.