* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Natural Selection on the gag, pal, and eltv Genes of Human
Viral phylodynamics wikipedia , lookup
DNA vaccination wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Human genome wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Gene expression profiling wikipedia , lookup
Genetic code wikipedia , lookup
History of genetic engineering wikipedia , lookup
Designer baby wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Helitron (biology) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Group selection wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Natural Selection on the gag, pal, and eltv Genes of Human Immunodeficiency Virus 1 (HIV-l) Stacy A. Seibert, Carina Y. Howell, Marianne K. Hughes, and Austin L. Hughes Department of Biology and Institute of Molecular Evolutionary Genetics, Pennsylvania State University Natural selection on polymorphic protein-coding loci of human immunodeficiency virus-l (HIV-l), the more geographically widespread of the two viruses causing human acquired immune deficiency syndrome (AIDS), was studied by estimating the rates of nucleotide substitution per site in comparisons among alleles classified in families of related alleles on the basis of a phylogenetic analysis. In the case of gag, pol, and gp41, the rate of synonymous substitution generally exceeded that of nonsynonymous substitution, indicating that these genes are subject to purifying selection. However, in the case of several of the variable (V) regions of the gp120 gene, especially V2 and V3, comparisons within and between families often showed a significantly higher rate of nonsynonymous than of synonymous nucleotide substitution. This pattern of nucleotide substitution indicates that positive Darwinian selection has acted to diversify these regions at the amino acid level. The V regions have been identified as probable epitopes for antibody recognition; therefore, avoidance of such recognition seems likely to be the basis for positive selection on these regions. By contrast, regions of HIV- 1 proteins identified as epitopes for T cell recognition show no evidence of positive selection and are often highly conserved at the amino acid level. These results suggest that selection favoring avoidance of T cell recognition has not been a major factor in the history of HIV-l and thus that avoidance of T cell recognition is not likely to be a major factor in the pathogenesis of AIDS. Introduction Human immunodeficiency virus 1 (HIV-l), the more geographically widespread of the two viruses causing human acquired immunodeficiency syndrome (AIDS), is known to exhibit high levels of genetic polymorphism even within a single patient (Hahn et al. 1986; Fisher et al. 1988). Since viruses with RNA genomes are known to have high mutation rates (Holland et al. 1982), one possible explanation for HIV-l polymorphism is that it is selectively neutral and that the high level of polymorphism is a consequence of the high mutation rate. An alternative hypothesis is that at least some HIV1 polymorphisms are selectively maintained and that this natural selection arises as a result of host immune defenses against the virus (Simmonds et al. 1990; Phillips et al. 199 1; Holmes et al. 1992). So far it has been difficult to decide between these two hypotheses. A powerful method of discriminating between positive Darwinian selection and neutral polymorphism is to compare the rates of synonymous and nonsynonyKey words: env gene, escape mutants, gag gene, HIV- 1, pol gene, positive selection. Address for correspondence and reprints: Austin L. Hughes, Department of Biology, Mueller Laboratory, Pennsylvania State University, University Park, Pennsylvania 16802; E-mail: austina hugaus.bio.psu.edu. Mol. Bid. Evol. 12(5):803-813. 1995. 0 1995 by The University of Chicago. All rights reserved. 0737-4038/95/1205-0009$02.00 mous nucleotide substitutions per site (Hughes and Nei 1988, 1989). In the case of positive selection favoring diversity at the amino acid level, the rate of nonsynonymous substitution is found to exceed that of synonymous substitution. Under purifying selection, as occurs in the case of most protein-coding genes, the synonymous rate is higher (Kimura 1977). Simmonds et al. (1990) analyzed sequences of a highly variable region (V3) of the surface protein gp120 from HIV- 1 and reported that the ratio of synonymous substitutions per site to nonsynonymous substitutions per site is 0.67, which is consistent with the hypothesis that this polymorphism is maintained by positive selection. However, these authors did not report the results of any statistical tests of the difference between rates of synonymous and nonsynonymous substitution; therefore, they did not rule out the possibility that the observed bias toward nonsynonymous substitution in this region was due to chance. By contrast, an analysis of a cytotoxic T cell (CTL) epitope in the gag protein of the simian immunodeficiency virus SIVmac failed to show statistically significant evidence of positive Darwinian selection, although considerable polymorphism was observed in this region and the occurrence of mutations eliminating CTL recognition was documented (Chen et al. 1992). To study the question of positive selection on HIV1 genes in more detail, we estimated rates of synonymous --_ 804 Seibert et al. and nonsynonymous substitution per site in the gag, pal, and env genes of HIV- 1 from published sequences. Because the host’s immune system has been hypothesized to be a source of selection favoring diversity of HIV proteins, we analyzed separately regions reported to be involved in immune recognition. Even when an enhanced rate of nonsynonymous substitution is observed in comparisons of closely related sequences, the same pattern may not be observed when more distantly related sequences are compared (Hughes and Nei 1988, 1989; Tanaka and Nei 1989). Presumably this occurs because selectively favored nonsynonymous substitutions become saturated over time, allowing the rate of synonymous substitution to overtake the rate of nonsynonymous substitution (Hughes and Nei 1988). Therefore, we expected that evidence of positive selection on HIV- 1 genes would be most apparent in comparisons of closely related genes. In order to identify such families of genes, we conducted phylogenetic analyses of HIV- 1 gag, pal, and env genes. The vertebrate immune system includes two different systems for molecular recognition that operate in quite different ways. T cell receptors (present on both CTL and helper T cells) recognize short peptides derived from intracellularly processed foreign proteins that are bound and presented on the cell surface by class I (in the case of CTL) and class II (in the case of helper T cells) major histocompatibility complex (MHC) molecules. Immunoglobulins, by contrast, recognize extracellular foreign antigens in their native state. By analyzing putative T cell and immunoglobulin epitopes separately, we obtained information regarding the relative importance of selection by these two components of immune recognition on HIV proteins. Methods Sequences Analyzed The genomes of HIV- 1 and HIV-2 and related lentiviruses contain three major genes, gag, pal, and env; proteins encoded by these genes make up the bulk of the infective virion (Arnold and Arnold 199 1). In each case, the initial product of translation is a polyprotein that is then broken down into separate proteins. The gag gene encodes the virion structural proteins p 17 (matrix), p24 (capsid), and p7/p9 (nucleocapsid) (figs. 1 and U). The pol gene encodes protease, the component proteins of reverse transcriptase (~66, p5 1, and RNAse H), and integrase (figs. 1 and 2B). The env gene encodes the envelope glycoproteins gp 120 and gp4 1 (figs. 1 and 2C). Glycoprotein gp120 recognizes the CD4 receptor and thus enables entry into CD4+ T cells, and gp120 is also the primary target of anti-HIV antibodies (Arnold and Arnold 199 1). Comparison of gpl20 sequences has revealed five hypervariable regions (Vl-V5; fig. 2C), and prediction ClPl20 FIG. I.-HIV virion. Abbreviations: IN, integrase; PR, protease; RNA, genomic RNA; RT, reverse transcriptase. Redrawn with changes from Arnold and Arnold 199 1. of antigens likely to be recognized by host immunoglobulins suggested that these antigens would mainly be found in the hypervariable regions (Modrow et al. 1987). Simmonds et al. (1990) computed rates of synonymous and nonsynonymous nucleotide substitution per site in the V3 and flanking region and reported that the nonsynonymous rate was higher. However, they did not report the results of any statistical test of the difference in rates, and because they studied PCR-amplified sequences in this region, they could not compare the rates of nucleotide substitution in V3 with those in other regions of gp120. Regions of the HIV proteins that are bound by host MHC molecules and presented to T cells, or T cell epitopes (TCE), have been identified experimentally by the method of cellular immunology or by elution and direct sequencing of MHC-bound peptides. The former method generally does not determine the exact peptide bound by the MHC molecule but rather a broader region that presumably contains one or more such peptides. To test the hypothesis that natural selection acting on HIV proteins favors evasion of T cell recognition (Phillips et al. 1991), we compared rates of nucleotide substitution in the remainder of the genes with those in TCE identified (1) by cellular methods (Schrier et al. 1988; Clerici et al. 1989; Wahren et al. 1989; Krowka et al. 1990; De Groot et al. 199 1; Johnson et al. 199 1; Phillips et al. 199 1; Chen et al. 1992) or (2) by elution of peptides (Huet et al. 1990; Takashi et al. 199 1; Tso- (A) PI7 MGARASVLSGGELDRWEKIRLRPGGKKKYKLKHIVWASRELERFAVNPGLLETSEGCRQILGQLQPSLQTGS P24 EELRSLYNTVATLYCVHQRIEIKDTKEALDKIEEEQNKS~~~DTGHSSOVSONY~PIVONI~O~ HOAISPRTLNAWVKVVEEKAFSPEVIPMFSALSEGATPOD ++++++++++++ HP~GPIAPGQ~PRGSDIAGTTSTLQEQIG~TNNPPIPVGEIY~WIILGLNKI~YSPTSILDIRQ GPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANGGPGHK P~/PG ARVLIAEAMSQVTNSATIIMMQRGNFRNQRKIVKCFNCGKEGHIARNCRAPRKKGCWKCGKEGHOMKDCTER QANFLGKIWPSYKGRPGNFLQSRPEPTAPPFLQSRPEPTAPPEESFRSG~TTTPSOKQEPID~LYPLTSL RSLFGNDPSSQ (B) protease PQITLWQRPLVTIKIGGQLKEALLDTG~DT~EE~LPGR~P~IGGIGGFIK~QYDQIPIEICGH~I reverse transcriptase GT~VGPTPVNIIGRNLLTQIGCTLNF(PISPIETVP~KPG~GPK~QWPLTEEKI~~ICTEMEKE GKISKIGPENPYNTPVFAIKSTKWRKLVDFRELNKRTDVGDAYF SVPLDKDFRKYTAFTIPSTNNETPGIRYQYNVLPQGWKGSDLY VGSDLEIGQHRTKIEELRQHLLRWGLTTPDKKHQKEPPFL +++++++++ VGKLNWASQIYAGIKVKQLCKLLRGTKALTEWPLTEEAELELAENREILKEPVHGMYDPSKDLIAELQKQ GQGQWTYQIYQEPFKNLKTGKYAKMRGTHTNDVKQLTEAVID YWQATWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGPLTDTTNQK TELQAIYL~QDSGLE~IVTDSQYALGIIQAQPDKSES integrase DKLVSAGIRKVLIFLDGIDKAAQEEHEKYHSNWRAMA SDFNLPPWAKEIVASCDKCQLKGEAMHGQVDCSPG IWQLDCTHLEGKVILVAVHVASGYIEAEVIPAETGQETAYFLL~AGRWP~TIHTDNGGNFISTT~CW WAGIKQEFGIPYNPQSQGWES~NEL~IIGQV~QAEHLKTAVQ~VFIHNFK~GGIGGYSAGERIIDI IATDIOTKELOKQITKIQNFRVYYRDSRDSRDPLWKGPA~L~GEGAWIQDNSDIKWPR~II~YGKQM AGDDCVASIRQDED (Cl gpI20 TEKLWVTVYYGVPVWKEATTTLFCASDAKAYDTEVHNVWA VI QMHEDIISLWDQSLKPCVKLTPLCVSLKCTDLGNATNTNSSNTNSSSGE~KGEI~CSFNISTSIRGKV v2 QKEYAFFYKLDIIPIDNDTTSYTLTSCNTSVITQACPKVSFEPIPIHYCAPAGFAILKCNNKTFNGTGPCTN v3 VSTVQCTHGIRPWSTQLLLNGSL~EEWIRSANFTDNAKTIIVQLNOS~INCTRPNNNT~SIRIQRGP ++++++++ GRAFVTIGKIGNMRQAHCNIS~~ATLKQIASKLREOFGNNKTIIFKQSSGGDPEIVTHSFNCGGEFFYC v4 NSTQLFNSTWFNSTWSTEGSNNTEGSDTITLPCRIKOFINMWQEVG~YAPPISGQIRCSSNITGLLLT~ V5 QP4I GGNNNNGSEIFRPGGGDMRDNWRSELYKYKWKIEPLGVAPT~WQ~~IAVGIG~FLGFLG~GS ++++++++++ TMGARSMTLTVQARQLLSGI~~NNLLRAIEAQQHLLQLT~GIKOLQ~ILA~RYLKD~LLGIWGCSG KLICTTAVPWNASWSNKSLEQIWNMTWMEWDREINNYTSLIHSLIEESQNQQEKNEQELLELLELDKWASL~ FNITNWLWYIKIFIMIVGGLVGLRIVFAVLSIVNRVRQGYRDR +++++++++++ SIRLVNGSLALIWDDLRSLCLFSYHRLRDLLLLIVTRI~LLGRRG~~KY~LLQYWSQEL~SAVSLLN ATAIAVAEGTDRVIEWOGACRAIRHIPRRIRQGLERILL FIG. 2.-Amino acid sequence of component proteins encoded by (A) gag, (B) pal, and (C) env from the LA1 isolate of HIV-l (WainHobson et al. 1985). Vertical lines indicate boundaries between proteins. T cell epitopes identified by cellular methods are underlined; those identified by elution of peptides are marked with +. Variable regions V l-V5 of env gp120 are overlined. Sources for protein boundaries are as follows: (A) ~17, ~24, and p7/p9 (McCutchan et al. 1992). (B) Protease (Loeb et al. 1989), reverse transcriptase (Jupp et al. 199 I), integrase (Vink and Plasterk 1993). (C) gp120 and gp41 (Cheng-Mayer et al. 1990). 806 Seibert et al. 0 I .OS I .lO 1 yL11768 (HIV-1 ) 0 I , P .025 1 .05 I P r LK0201 3 (HIV-1 ) 1 l Zl 1530 (HIV-1 ) 2 ** -Ml 1 7451 (HIV-1 ) -M27323 ** 3 - (HIV-1 ) M22639 (HIV-l) K03454 (HIV-1 ) LL11762 -L11766 (HIV-l) Wf ’ (HIV-1 ) -L11782 (HIV-1 ) -L11792 (HIV-1 ) -L11778 (HIV-1 ) ,-L11785 L20571 (HIV-l) 4 (HIV-1 ) L11765 (HIV-1 ) M66437 (SIVagm) I I (HIV-1 ) 1 LK03454 l*r 1 uo4005 (SIVagm) Lo7625 (HIV_2) 1 5 L11797 (HIV-1 ) -L11752 clr* (HIV-1 ) -LO3696 ** - (HIV-1 ) -L11780 - M30895 (HIV-2) M30502 (HIV-2) c (HIV-1 ) JO4542 (HIV-2) L11753 (HIV-1 ) yL11781 -Ml (HIV-1 ) 3136 (HIV-1 ) --M21137 ~A04321 ,-K02012 -K02013 6 (HIV-1 ) -Ml c* (HIV-1 ) ** (HIV-1 ) 6403 (SIV) cM76764 . (SIV) -DO1065 (SIV) M33262 (SIV,,,) (HIV-1 ) t YOO295 (SIV) FIG. 4.-Phylogenetic tree 01 pal sequences based on proportion amino acid difference (p). Sequence identification, abbreviations, and tests of internal branch length are as in fig. 3. mides et al. 199 1; Dai et al. 1992; Johnson et al. 1992, 1993) (fig. 2). DNA sequences for gag, env, and pol genes of HIV1, HIV-2, and SIV were collected from the GenBank database. The numbers of sequences used were as follows: 45 gag sequences, 46 gpl20 sequences, 35 gp4 1 sequences, and 33 pol sequences. The sequences are identified by accession number in the phylogenetic trees (figs. 3-6). X6801 9 (FIV) FIG. 3.-Phylogenetic tree of gag sequences based on proportion amino acid difference (p). Sequences are identified by their Genbank accession number. Abbreviations are as follows: agm, African green monkey; mat, macaque; stm, stump-tailed macaque; FIV, feline immunodeficiency virus. Tests of the hypothesis that the length of an internal branch is equal to zero: 'P< 0.05; "P < 0.01; ***P -c0.001. Statistical Methods Sequences were aligned at the amino acid level by the CLUSTAL V program (Higgins et al. 1992) using default settings, and in some cases were corrected bv eve to imnrove the alignment. When a given set of Natural Selection on HIV-l 0 .04 .02 807 1 P A B I - uM38430 M21138 c** l 1 LO8657 C M60472 I sequences were compared, any codon at which the alignment postulated a gap in any sequence was excluded from each set of pairwise comparisons, so that a comparable data set was used in each comparison. Phylogenetic trees were constructed by the neighborjoining method (Saitou and Nei 1987) on the basis of the proportion of amino acid difference; amino acid sequences were used because synonymous nucleotide sites were saturated in many comparisons. The statistical significance of internal branches in phylogenetic trees was tested by Rzhetsky and Nei’s (1992) method. The phylogenetic analyses were used to identify families of closely related sequences, most of which cor- 0 I .03 1 P .06 I 1 K02013 A04321 M37575 ** A LM33943 -M38430 -M21138 -Ml 2507 M63929 *** B M68894 l* -+ L -Ml L14574 1 3137 -Ll4574 TM79352 l M68893 * -M38428 M79353 ** -L14575 -M60472 LO8656 * LO8655 c! LO8657 -M31451 rM37576 r LO7082 M37575 M37491 Id37574 -LO7421 L14572 -L14576 K03458 K03454 -JO3653 K03454 LO7082 FIG. 5.-Phylogenetic tree ofgpl20 sequences based on proportion amino acid difference (p). All sequences are HIV-l; tests of internal branch length are as in fig. 3. FIG. 6.-Phylogenetic tree of gp41 sequences based on proportion amino acid difference (p). All sequences are HIV- 1; tests of internal branch length are as in fig. 3. 808 Seibert et al. Table 1 Numbers of Synonymous (&) and Nonsynonymous (&) Nucleotide Substitutions per 100 Sites (*SE) in Comparisons of HIV-l gag Alleles REMAINDER TCE ds ds Family Family Family Family Family Family Family Overall 1 , 2 3 , 4 ., 5 .. 6 . . 7 mean 11.2 8.8 3.7 55.7 21.9 10.9 13.2 15.5 + 2.1 +- 2.1 f 1.1 + 6.5 f 3.4 k 1.4 + 2.6 k 4.8 2.9 3.0 1.9 5.8 5.7 2.9 5.1 3.5 + 0.5*** +_0.6** + 0.4 f 0.9*** + 0.9*** + 0.4*** f 0.8** + 1.2* 15.2 13.8 4.0 36.0 21.6 12.4 17.9 18.4 6 f 2.4 re_2.5 2 1.2 + 4.4 + 3.3 -t 1.4 1 3.0 + 2.2 3.2 6.0 2.1 7.4 9.6 37 6:l 5.3 f 0.5*** +_ 0.9** f 0.4 + 0.9*** + l.l*** f 0 4*** + 0:9*** + 0.6*** NOTE.-Families are as in fig. 3. Standard errors of mean ds and dN are computed by Nei and Jin’s (1989) method. Tests of the hypothesis that ds = dN: *P -c 0.05; **P < 0.01; ***P c 0.001. responded to clusters supported by statistically significant internal branches in the trees. Within these families, we estimated the number of synonymous nucleotide substitutions per site (&) and the number of nonsynonymous nucleotide substitutions per site (dN) using Nei and Gojobori’s (1986) method. Standard errors of ds and dN for sets of pairwise comparisons were estimated by Nei and Jin’s (1989) method. To test for natural selection acting on gag- and &encoded proteins, ds and dN were estimated separately for TCE and for the remainder of the gene (fig. 2A-2B). In the case of env, TCE mentioned in the literature include some that overlap the V3, V4, and V5 regions of gpl20 (fig. 2C). We estimated ds and dN for V 1-V5 of gp 120; for TCE in gp 120 other than those overlapping one of the V regions; for the remainder of gpl20; for TCE in gp4 1 (Schrier et al. 1988; Table 2 Numbers of Synonymous (&) and Nonsynonymous of pof Polyprotein Gene Alleles Wahren et al. 1989); and for the remainder (fig. 2C). Results The phylogenetic tree of gag sequences (fig. 3) is rooted by use of a sequence from feline immunodeficiency virus (FIV) as an outgroup. As with previous phylogenetic analyses (Yokoyama 199 1), HIV- 1 and HIV-2 clustered separately, with HIV-2 closer to genes from simian viruses. The trees for pol (fig. 4), gp120 (fig. 5), and gp41 (fig. 6) were rooted by placing the root in the longest internal branch because in these cases FIV sequences were too distant for very reliable alignment. In the case of pol, HIV-2 sequences clustered more closely with simian sequences (fig. 4). In the case ofgag, pal, and gp120, families of closely related sequences were identified for analysis of rates of nucleotide substitution. Each such family constituted a cluster of closely related (&) Nucleotide Substitutions FAMILY 1 (HIV-l) Protease ... .. Reverse transcriptase: TCE . ... .... Remainder .. ... . Integrase: TCE .. ... ... Remainder ..... . NOTE.-Families of gp41 per 100 Sites (*SE) in Comparisons FAMILY 2 (HIV-2) ds & 13.4 f 3.3 2.1 + 0.6*** 28.2 + 17.1 + 5.9 22.2 + 1.3 11.2 + 3.1 19.7 + 2.1 FAMILY 3 (SIV) & ds 4.6 4.3 + 0.9*** 33.1 + 6.5 2.9 k 0.8*** 3.1 + 1.1* 2.6 + 0.2*** 72.1 _+ 23.0 46.3 -e 2.7 3.2 + 1.5** 4.1 + 0.4*** 3.3 IfI 3.4 6.6 f 0.9 0.0 + 0.0 0.7 + 0.2*** 2.5 + 0.7** 1.6 + 0.3*** 18.3 k 30.8 + 4.3 + 1.3** 3.6 + 0.5*** 6.6 + 3.0 4.9 + 1.2 0.9 k 0.5 0.8 + 0.3*** & 5.1 3.3 d,v are as in fig. 4. Standard errors of mean ds and dN are computed by Nei and Jin’s (1989) method. Tests of the hypothesis that ds = dN: *P < 0.05; **p < 0.01; ***p < 0.001. Natural Selection on HIV-l o! -I b-- 09 1 w 9 ” r-4 0 m d +I +I +I t-l +I +I +I 99990909c‘! 2; ocuoo~o- fl +I +I tl tl fl +I qd-qcq~cq” 9 9 T hid ? 4-I +1 fl fl +I 7‘ 9 9 -*mo wooo;odtiucj y’? 10-- ?‘ +I +I tl 109 ? wmt-d---_d ccl -cl+I -0 o-oo;;oc; 9 +I +I cow 66 c?‘?c?o\9c?w r-~m~mod woom~-- *m d d c? 9 +I +I ?‘ -- Cl “! +I “” m r-j w. * hiWOd +I +I +I +I 09 ? c? WJ + rnb 66 fl +I am 6 ,’ c-4 d vi 9 9 -r4ow-06 m “I G 09 - +I +I +I tl +I +I +I m. q 9 cc m -. w *4_Ot-‘\d--_; P4 I--- &_; +I +I WO _;t+ 809 sequences, and in most cases the cluster was supported by a statistically significant internal branch (figs. 3-5). The same families that were identified for gp120 (designated A-E in fig. 5) were seen on the gp41 tree (fig. 6), except for family E, for which gp41 sequences were not available. Numbers of nucleotide substitutions per synonymous site and per nonsynonymous site were estimated within seven families of gag sequences (table 1). Both in regions identified as TCE and in the remainder of the gene, ds exceeded dN;and this difference was statistically significant for all but one family (table 1). This pattern is indicative of purifying selection. In the case of pol, rates of nucleotide substitution were estimated within three families, one from HIV-l, one from HIV-2, and one from SIV (table 2). In each case, ds was significantly greater than dN both in TCE and the remainder of the protease, reverse transcriptase, and integrase genes, and the difference is statistically significant for all comparisons but two (table 2). The pattern of nucleotide substitution seen in the case of the env gene differed from that of either pal or gag. In families B, C, and E, dNwas significantly greater than ds in the V2 region; and in families B and C, dN was significantly greater than ds in the V3 region (table 3). Family C was remarkable in that it showed dN significantly greater than ds in the V4 region (table 3). When overall means were computed for the five gpl20 families, mean dN was significantly greater than mean ds in the case of V2 and V3; in the case of V 1, V4, and V5 and the TCE, ds and dN did not differ significantly. In the remainder of gpl20 ds was significantly greater than dN (table 3). By contrast, gp41 showed ds significantly greater than dN in both TCE and the remainder of the gene in overall means for families A-D (table 3). Table 4 shows comparisons of V 1-V5, the remainder of gp120, and gp41 among the four families (A-D) for which gp120 and gp41 sequences were available. In three of six pairwise comparisons, dN was significantly greater than ds in V2. Likewise, in two comparisons, dN was significantly greater than ds in VI. However, no such pattern was seen in the case of V3-V5. In the remainder of gp120 and in gp41, ds was greater than dN in all comparisons, and the difference was significant in two cases for gp120 and in five cases for gp41 (table 4). When TCE were compared between families, ds was always greater than d,, although this difference was generally not significant (data not shown); therefore, data for the TCE were not included in table 4. Recent analysis of self peptides bound by human class I MHC molecules (i.e., peptides derived from self proteins and presented by class I MHC molecules on the surface of noninfected cells) showed that these peptides tend to be derived from proteins that have been 8IO Seibert et al. Table 4 Number of Synonymous (&) and Nonsynonymous Families of HIV env Genes (&) Nucleotide Substitutions FAMILYAVLFAMILY B ds gp120: VI ...... v2 ...... v3 ...... v4 ...... v5 ...... Remainder gp41 ...... NOTE.-Standard . . 0.0 1.3 19.2 7.8 17.2 5.8 4.8 + ++ f + -t 1 & 0.0 1.1 11.4 8.1 13.9 1.4 1.0 34.6 11.7 15.3 12.2 47.5 3.9 2.7 + + + + + -+ + 15.2* 3.7** 5.0 5.6 15.8 0.6 0.4 18.0 1.0 18.9 35.0 21.2 11.9 6.9 errors of mean ds and dN are computed by Nei and Jin’s per 100 Sites (&SE) in Comparisons between FAMILY A vs.FAMILYC FAMILY A vs. FAMILY D ds & rk 13.4 I!I 1.0 + 11.3 k 27.1 + 22.3 f 2.0 + 1.4 d,v 27.4 21.6 15.7 56.9 5.9 9.6 3.1 I!I 6.6 + 5.5*** AI 5.3 -t 19.0 f 3.3 +- 0.9 + 0.4* (1989) method. Tests of the hypothesis that 15.8 10.8 12.1 51.8 37.9 9.2 9.8 f 9.9 + 7.1 f 6.5 f 31.3 f 27.7 f 2.0 + 1.7 & 29.4 13.3 27.9 29.9 32.0 4.2 2.6 f 5.9 f 4.0 + 7.5 + 14.6 + 10.1 -t 0.7* + 0.4*** ds = dN:*P < 0.05; **P < 0.0I;***P < 0.001. highly conserved evolutionarily and from the most conserved regions of these proteins (Hughes and Hughes 1995). Furthermore, these peptides tend to be derived from relatively hydrophobic portions of proteins that are hydrophilic overall (Hughes and Hughes 1995). To test whether HIV-derived peptides bound by host class I MHC molecules have similar characteristics, we compared these peptides with their source proteins (table 5). So that we would have a measure of evolutionary conservation that was comparable among different genes, we made comparisons between sequences derived from two complete HIV genomes (LA1 and ELI) (table 5). Since LA1 (accession number IS0201 3) and ELI (K03454) are not closely related (figs. 3-6), this comparison provides an indication of the degree of conservation over a long period of time relative to the deepest branch point of available HIV- 1 sequences. The results showed that the peptides were generally hydrophobic, as measured by the percentage of highly hydrophobic residues, and in most cases more hydrophobic than the remainder of the source protein (table 5). In three of the five peptides examined, there were no nonsynonymous differences between LA1 and ELI in the gene region encoding the peptide but a substantial number of nonsynonymous differences outside the peptide (table 5). In the other two cases, dN in the peptide and the remainder of the gene did not differ significantly (table 5). In these cases, the amino acid differences that were observed in the peptides between LA1 and ELI were generally conservative ones that did not affect the overall hydrophobicity of the peptide to a great extent (table 5). Therefore, these data suggest that, far from being subject to positive selection favoring diversity, peptides from HIV-l proteins that are bound by the host class I MHC are generally derived from relatively conserved regions, and some are derived from highly conserved protein regions. Discussion Our analyses indicate that gp41, gag, and pol are subject to purifying selection overall. Furthermore, TCE derived from these proteins not only are not subject to positive selection favoring diversity at the amino acid level but actually tend to be derived from portions of the source proteins that are subject to purifying selection and thus are relatively conserved evolutionarily. In the case of gpl20, by contrast, positive Darwinian selection has acted to favor diversity at the amino acid level in certain regions. However, this positive selection has not acted in the same way on all families of gp120 genes. Positive selection was most frequently found on the V2 region. There was evidence of positive selection on this region in the B, C, and E families and in three of six comparisons between families (tables 3 and 4). In addition, there was evidence of positive selection on the V3 region in two families and on the V4 region in one family (table 3). Therefore, it appears that positive selection on gp120 has an episodic or opportunistic character. In other words, members of a given HIV-l lineage, sharing a common ancestry and perhaps certain characteristics of their environment, can be temporarily subject to a type of selection that does not necessarily occur in the case of other such lineages. Perhaps differences in clinical stage and/or host genotypes may contribute to differences in the selective environment. It seems likely that the source of positive selection on gp120 is the vertebrate immune system and that diversity at the amino acid level is favored because it reduces or evades immune recognition by the host. The V regions in which positive selection was found are putative immunoglobulin epitopes, and the V3 region has also been implicated as a TCE. Therefore, it seems possible that such selection can be caused by both the T Natural FAMILY B vs. FAMILY C div ds 17.1 0.0 11.8 21.9 4.2 12.3 9.1 -t * f -+ k zk * FAMILY B vs. FAMILY D 17.1 0.0 8.4 18.6 10.6 2.1 1.4*** 27.5 15.5 6.1 63.4 40.0 9.9 3.8 & + 11.0 f 4.4*** + 2.3 +- 20.1 f 18.7 + 0.9 + 0.5 0.0 -t 0.0 4.7 -t 4.8 11.4-t 7.4 18.0 -t 20.9 24.1 + 19.0 8.6 -t 1.9 9.6 k 1.7 cell and the immunoglobulin components of the host immune system. However, there are several lines of evidence suggesting that selection to avoid immunoglobulin recognition has been the predominant mode of positive selection on gp120. First, positive selection was never observed on any TCE except V3, which is also a putative immunoglobulin epitope. Second, in most families, gag and pol TCE showed significant evidence of purifying selection (tables l-2); significant evidence of purifying selection was also found in the case of TCE in one gp41 family (table 3). Furthermore, three of the five known class I MHC-bound peptides from HIV-l have been highly conserved over evolutionary time, including one each from pol, gp41, and gpl20 (table 5). Phillips and McMichael(l993) state that “the appearance of sequence variation which alters the ability ds -+ 8.8 -t 4.0 + 5.3 -t 22.0 k 21.4 -t 0.6* k 0.5*** 8.0 7.0 8.0 10.5 34.1 13.5 8.3 & f 7.5 + 5.0 + 4.8 + 15.3 f 24.6 + 2.3 + 1.4 Substitutions PEPTIDE’ 31.6 10.2 22.1 39.0 19.2 10.0 3.3 f 8.0* f 3.0 f 6.1 f 16.6 + 11.2 f 0.9 + 0.5*** per 100 Sites (&) % HFO PROTEIN 8 11 of cytotoxic T cells to recognize antigens of the virus is good evidence that this form of immunity exerts a selective force.” However, the mere occurrence of such mutations is not in itself evidence that they have been selectively favored. A significantly higher rate of nonsynonymous than of synonymous nucleotide substitution provides much more convincing evidence of positive selection. Alternatively, the hypothesis that an escape mutation has been positively selected can be supported by a study of the viral population within a given host that shows fixation of such a mutation, as recently shown for a hepatitis C virus mutation by Weiner et al. (1995). Analysis of known class I MHC-bound peptides suggests that these are derived from often conserved, relatively hydrophobic regions (table 5). In this respect, these pathogen-derived peptides resemble self protein- Table 5 Percent of Highly Hydrophobic Residues (9%HFO) and Nonsynonymous in HIV-l Peptides Eluted from MHC Class I Molecules GENE on HIV- 1 FAMILY C vs. FAMILY D & 15.2 12.6 18.0 54.8 56.3 3.9 3.5 Selection Peptide dN” Remainder Peptide Remainder gag ~24 134-142 LA1 ELI KRWIILGLNKIV _--__V_-____ 58.3 30.6 3.6 + 3.6 em gpl20 346-353 LA1 ELI FNEGGEFF ________ 50.0 33.1 0.0 4 o.o*** 17.7 + 1.4 73-82 LA1 ELI ERYLKDQQLL __________ 40.0 40.1 0.0 + o.o*** 7.9 + 1.1 fzP4 1 259-269 LA1 ELI RLRDLLLIVTR _____J___AV_ 54.5 40.1 RT 310-318 LA1 ELI ILKEPVHGV _________ 44.4 34.1 gP4 PO1 1 18.3 + 9.5 0.0 + o.o*** 7.2 f 0.7 7.9 It_ 1.1 2.7 + 0.5 a Numbering of residues and % HFO are based on the LA1 sequences (GenBank accession number KO2013). Differences between LA1 and ELI (accession number K03454) are shown. Highly hydrophobic residues are C, F, I, L, M, V, W, Y. b dN is computed between LA1 and ELI sequences. Tests of the hypothesis that dNin the peptide equals that in the remainder of the protein: ***P< 0.00 1. 8 12 Seibert et al. identified in mice and humans: correlation with a cytotoxic T cell epitope. J. Infect. Dis. 164: 1058- 1065. FISHER,A. G., B. ENSOLI,D. LOONEY,A. ROSE, R. C. GALLO, M. S. SAAG, G. M. SHAW, B. H. HAHN, and F. WONGSTAAL. 1988. Biologically diverse molecular variants within a single HIV-l isolate. Nature 334:440-444. HAHN, B., G. M. SHAW, M. E. TAYLOR, R. R. REDFIELD, P. D. MARKHAM, S. Z. SALAHUDDIN,F. WONG-STAAL, R. C. GALLO,E. S. PARKS,and W. D. PARKS. 1986. Genetic variation in HTLV-III/LAV over time in patients with AIDS or at risk for AIDS. Science 232:1548-1553. HIGGINS, D. G., A. J. BLEASBY,and R. FUCHS. 1992. Clustal V: improved software for multiple sequence alignment. Comput. Appl. Biosci. 8: 189- 19 1. HOLLAND, J., K. SPINDLER,F. HORODYSKI,E. GRABAU, S. NICHOL, and S. VAN DE POL. 1982. Rapid evolution of RNA genomes. Science 215:1577-1585. HOLMES,E. C., L. Q. ZHANG, P. SIMMONDS,C. A. LUDLAM, and A. J. LEIGH BROWN. 1992. Convergent and divergent sequence evolution in the surface envelope glycoprotein of human immunodeficiency virus type 1 within a single infected patient. Proc. Natl. Acad. Sci. USA 89:4835-4839. HUET, S., D. F. NIXON, J. B. ROTHBARD, A. TOWNSEND, S. A. ELLIS, and A. J. MCMICHAEL. 1990. Structural hoAcknowledgments mologies between two HLA B27-restricted peptides suggest residues important for interaction with HLA B27. Int. ImThis research was supported by grants RO lmunol. 2:3 11-3 16. GM34940 and K04-GM006 14 from the National InstiHUGHES,A. L., and M. K. HUGHES. 1995. Self peptides bound tutes of Health to A.L.H. by HLA class I molecules are derived from highly conserved regions of a set of evolutionarily conserved proteins. ImLITERATURE CITED munogenetics 41:257-262. HUGHES, A. L., and M. NEI. 1988. Pattern of nucleotide subARNOLD, E., and G. F. ARNOLD. 199 1. Human immunodestitution at major histocompatibility complex class I loci ficiency virus structure: implications for antiviral design. reveals overdominant selection. Nature 335: 167- 170. Adv. Viral Res. 39: l-87. . 1989. Nucleotide substitution at major histo-comCHEN, Z. W., L. SHEN, M. D. MILLER, S. H. GHIM, A. L. ~ patibility complex class II loci: evidence for overdominant HUGHES,and N. L. LETVIN. 1992. Cytotoxic T lymphocytes selection. Proc. Natl. Acad. Sci. USA 86:958-962. do not appear to select for mutations in an immunodomJOHNSON, R. D., A. TROCHA, T. M. BUCHANAN,and B. D. inant epitope of simian immunodeficiency virus gag. J. ImWALKER. 1992. Identification of overlapping Hla class Imunol. 149:4060-4066. restricted cytotoxic T cell epitopes in a conserved region of CHENG-MAYER,C., M. QUIROGA, J. W. TUNG, D. DINA, and the human immunodeficiency virus type 1 envelope glyJ. LEVY. 1990. Viral determinants of human immunodecoprotein: definition of minimum epitopes and analysis of ficiency virus type 1 T-cell or macrophage tropism, cytothe effects of sequence variation. J. Exp. Med. 175:961pathogenicity, and CD4 antigen modulation. J. Virol. 64: 971. 4390-4398. 1993. Recognition of a highly conserved region of CLERICI, M., N. I. STOCKS, R. A. ZAJAC, R. N. BOSWELL, -. human immunodeficiency virus type 1 gp 120 by an HLAD. C. BERSTEIN,D. L. MANN, G. M. SHEARER,and J. A. Cw4-restricted cytotoxic T-lymphocyte clone. J. Virol. 67: BERZOFSKY.1989. Interleukin-2 production used to detect 438-445. antigen peptide recognition by T-helper lymphocytes from JOHNSON, R. D., A. TROCHA, L. YANG, G. P. MAZZARA, asymptomatic HIV-positive individuals. Nature 339:383D. L. PANICALI,T. M. BUCHANAN,and B. D. WALKER. 385. 199 1. HIV- 1 gag-specific cytotoxic T lymphocytes recognize DAI, L. C., K. WEST, R. LITTUA, K. TAKASHI,and F. A. ENNIS. multiple highly conserved epitopes: fine specificity of the 1992. Mutation of human immunodeficiency virus type 1 gag-specific response defined by using unstimulated peat amino acid 585 on gp4 1 results in loss of killing by CD8+ ripheral blood mononuclear cells and cloned effector cells. A24-restricted cytotoxic T lymphocytes. J. Virol. 66:3 15 lJ. Immunol. 147:1512-1521. 3154. DE GROOT, A. S., M. CLERICI,A. HOSMALIN,S. H. HUGHES, JUPP, R. A., L. H. PHYLIP, J. S. MILLS, S. F. J. LE GRICE, and D. BARND, C. W. HENDRIX, R. HOUGHTON, G. M. J. KAY. 199 1. Mutating P2 and P1 residues at cleavage SHEARER,and J. A. BERZOFSKY.1991. Human immunojunctions in the HIV-l pol poly-protein. Effects on hydrodeficiency virus reverse transcriptase T helper epitopes lysis by HIV- 1 proteinase. FEBS Lett. 283: 180- 184. derived peptides (Hughes and Hughes 1995). Hughes and Hughes (1995) suggested that any mechanism that leads to binding of relatively hydrophobic peptides derived from overall relatively hydrophilic proteins will tend to select evolutionarily conserved peptides, because such hydrophobic regions are often functionally important. And any mechanism that leads to binding of predominantly conserved peptides will be to the host’s advantage in that it will minimize the likelihood of escape mutants’ occurrence in parasite populations (Hughes and Hughes 1995). HIV, however, is able to evade the immune system of its host despite the fact that relatively conserved peptides are presented to CTL by MHC molecules. This suggests that evasion of CTL cannot be the only mechanism involved in HIV’s successful evasion of immune recognition. Indeed, the evidence of strong purifying selection in the case of many TCEs is not consistent with the hypothesis that positive selection favoring avoidance of CTL recognition is a major factor in the pathogenesis of AIDS.