Download during the Somatic Hypermutation Process Trends in Antibody

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene therapy wikipedia , lookup

Gene desert wikipedia , lookup

Pathogenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Expanded genetic code wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Gene expression programming wikipedia , lookup

Non-coding DNA wikipedia , lookup

History of genetic engineering wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Population genetics wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Human genome wikipedia , lookup

Genomics wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Genome (book) wikipedia , lookup

Metagenomics wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Genome editing wikipedia , lookup

Oncogenomics wikipedia , lookup

Microsatellite wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Epistasis wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genetic code wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Microevolution wikipedia , lookup

Frameshift mutation wikipedia , lookup

Mutation wikipedia , lookup

Designer baby wikipedia , lookup

Point mutation wikipedia , lookup

Transcript
Trends in Antibody Sequence Changes
during the Somatic Hypermutation Process
This information is current as
of June 15, 2017.
Subscription
Permissions
Email Alerts
J Immunol 2006; 177:333-340; ;
doi: 10.4049/jimmunol.177.1.333
http://www.jimmunol.org/content/177/1/333
This article cites 32 articles, 12 of which you can access for free at:
http://www.jimmunol.org/content/177/1/333.full#ref-list-1
Information about subscribing to The Journal of Immunology is online at:
http://jimmunol.org/subscription
Submit copyright permission requests at:
http://www.aai.org/About/Publications/JI/copyright.html
Receive free email-alerts when new articles cite this article. Sign up at:
http://jimmunol.org/alerts
The Journal of Immunology is published twice each month by
The American Association of Immunologists, Inc.,
1451 Rockville Pike, Suite 650, Rockville, MD 20852
Copyright © 2006 by The American Association of
Immunologists All rights reserved.
Print ISSN: 0022-1767 Online ISSN: 1550-6606.
Downloaded from http://www.jimmunol.org/ by guest on June 15, 2017
References
Louis A. Clark, Skanth Ganesan, Sarah Papp and Herman W.
T. van Vlijmen
The Journal of Immunology
Trends in Antibody Sequence Changes during the Somatic
Hypermutation Process
Louis A. Clark,1 Skanth Ganesan, Sarah Papp, and Herman W. T. van Vlijmen1
T
he versatility of the mammalian adaptive immune system
rests on its ability to generate high-affinity Abs to foreign
Ags. The set of naive B cells expresses a repertoire of
surface-bound Ab precursors called B cell receptors or BCRs, a
subset of which can bind most new Ags with low affinity. To make
this possible, the V(D)J recombination process has evolved to mix
V, D, and J genes during cell development to provide a large
sequence diversity in the germline L and H chains. Considering
only the number of human V, D, and J genes and neglecting a gain
in diversity from the joining mechanisms, one gets an estimate of
320 L chain variable domain and 11,000 H chain variants. The
assumption that any H chain can pair with any L chain yields 3.5 ⫻
106 theoretical specificities (1). Additional adaptability comes
from the somatic hypermutation process, where the recombined
Ab genes undergo error-prone replication during an in vivo selection process. Those B cells that produce Abs with higher affinities
for their Ags selectively proliferate, leading to a further enormous
increase in diversity and a fine tuning of the affinity.
Much of somatic hypermutation research to date has focused on
defining the molecular mechanism. This aspect has been reviewed
recently (2–5) and, although much remains to be learned, certain
sequence-related aspects are clear. The enzyme AID (activationinduced cytidine deaminase) deamidates cytidine residues, converting them to uracil and initiating the somatic hypermutation
process (6). Further processing of the local DNA may involve
excision of the uracil base and repair by certain error-prone polymerases (see reviews in Refs. 3–5). The overall process favors
Biogen Idec Inc., Cambridge, MA 02142
Received for publication November 4, 2005. Accepted for publication April 19, 2006.
The costs of publication of this article were defrayed in part by the payment of page
charges. This article must therefore be hereby marked advertisement in accordance
with 18 U.S.C. Section 1734 solely to indicate this fact.
1
Address correspondence and reprint requests to Dr. Louis A. Clark and Dr. Herman
W. T. van Vlijmen, Biogen Idec Inc., 14 Cambridge Center, Cambridge, MA 02142.
E-mail addresses: [email protected] and [email protected]
Copyright © 2006 by The American Association of Immunologists, Inc.
single-base transitions over transversions at an ⬃3:1 ratio (7). Insertions and deletions also occur but are considerably less common
(8, 9). Certain four-base DNA sequence motifs, called hotspots,
are correlated with the mutation locations. The two most commonly cited four-base motifs are RGYW (10) and its inverse repeat, WRCY (11), where R denotes a purine base, Y a pyrimidine
base, and W an A or a T base.
It is not always clear how the mutations selected during the
affinity maturation process contribute to improving binding affinity
or other properties such as selectivity or stability. Only a few studies exist that look at the effect of single somatic hypermutations
(12). One expects the residues in contact with the Ag to be the
most critical, but often large affinity improvement come from nonsurface mutations. Daugherty et al. (13) used phage display on
Escherichia coli to improve the affinity of a single-chain Fv molecule to cardiac glycoside digoxigenin and found that all of the
affinity enhancements occurred at non-Ag-contacting residues.
Similarly, Zahnd et al. (14) used a ribosome display to improve
single-chain Fv binding to a peptide Ag and found that the most
effective mutation was not at the interface with the Ag. From our
own work redesigning Ab-Ag interfaces (34), we find that even
Ag-contacting mutation effects are not always rationalizable or
predictable from three-dimensional structure-based energetic calculations. Thus, there is great potential to learn from the natural
affinity maturation process.
The work presented here seeks to better define the results of the
somatic hypermutation process and to investigate sequence-related
aspects of protein-protein recognition. Given a mature sequence, a
probable germline sequence resulting from the recombination of
species-specific V, D, and J genes (15, 16) can be derived using
sequence matching and consideration of the known mechanistic
rules of V(D)J recombination (see Materials and Methods). Once
the germline chain sequence is known, simple comparison with the
aligned mature sequence yields the position and type of mutation
that occurred during the affinity maturation process.
0022-1767/06/$02.00
Downloaded from http://www.jimmunol.org/ by guest on June 15, 2017
Probable germline gene sequences from thousands of aligned mature Ab sequences are inferred using simple computational
matching to known V(D)J genes. Comparison of the germline to mature sequences in a structural region-dependent fashion allows
insights into the methods that nature uses to mature Abs during the somatic hypermutation process. Four factors determine the
residue type mutation patterns: biases in the germline, accessibility from single base permutations, location of mutation hotspots,
and functional pressures during selection. Germline repertoires at positions that commonly contact the Ag are biased with
tyrosine, serine, and tryptophan. These residue types have a high tendency to be present in mutation hotspot motifs, and their
abundance is decreased during maturation by a net conversion to other types. The heavy use of tyrosines on mature Ab interfaces
is thus a reflection of the germline composition rather than being due to selection during maturation. Potentially stabilizing
changes such as increased proline usage and a small number of double cysteine mutations capable of forming disulfide bonds are
ascribed to somatic hypermutation. Histidine is the only residue type for which usage increases in each of the interface, core, and
surface regions. The net overall effect is a conversion from residue types that could provide nonspecific initial binding into a
diversity of types that improve affinity and stability. Average mutation probabilities are ⬃4% for core residues, ⬃5% for surface
residues, and ⬃12% for residues in common Ag-contacting positions, excepting the those coded by the D gene. The Journal of
Immunology, 2006, 177: 333–340.
334
Tomlinson et al. (17) have previously used a similar type of
analysis to analyze the diversity of amino acids at specific positions in the germline and mature Ab sequences. They found that
the frequency of somatic hypermutation and the diversity of the
germline sequences is highest in the CDRs. Rather than focus on
the mutation frequencies, we examine the type of mutation and its
functional implications deduced from the location in the structure.
The results indicate that residue type changes during the somatic
hypermutation process are significant and have underlying functional rationales.
Materials and Methods
Germline V(D)J fitting procedure
the best gene are always assumed to be identical with those in the mature
sequence, and no mutations are recorded.
Algorithm testing was performed in cases where the V(D)J fitting procedure was performed manually. Additionally, D gene fitting was tested on
cases where the DNA sequence was known. In the small number of cases
examined, translating from the protein sequence to the most probable DNA
sequence was sufficient to distinguish between available D gene sequences.
Typically, one or two related D genes had match scores clearly separated
from all other lower-scoring D gene matches
In some cases the algorithm clearly returns nonsensical matches, as
indicated by long stretches of poor base matches. The failures can be traced
back to humanized Abs, to Abs from species outside the human and mouse
libraries used, or to insertions/deletions. Germline sequence determinations
requiring ⬎20-amino acid mutations relative to the mature sequence are
discarded on the assumption that the problem is inappropriate to the sequence or that the algorithm has failed. Sequences with more than five
consecutive mutated residues are discarded for similar reasons. Some sequences have so little information in the D and J gene regions that fitting
is impractical. For this reason, solutions having mutations at ⬎40% of the
D and J gene region positions are also discarded.
Calculation of mutation probabilities
A simple estimate for the probability of making a transition from residue
type i to residue type j (Pij) with a single base change can be calculated
using the codon definitions and their known usage frequencies. A given
number of codons, M (N) code for each residue type i (j). For a given i to
j transition, the probability is the sum of individual possible codon m to n
transitions (Equation 1).
冘冘冉 冊
M
Pij ⫽
N
m⫽1 n⫽1
1
f ⌬
12 im mn
(1)
The factor (1/12) is the probability of picking one of the three bases in the
codon (1/3) times the probability of picking the correct change (1/4). The
frequency ( fim) is the fraction of residue type i that is coded using codon
m in human genes. If codons m and n differ by only a single base, then such
a transition could occur, and the function ⌬mn is unity. This simple transition probability (Pij) can be renormalized using the residue type usage in
the germline genes to form the background of Fig. 1A.
An estimate for the probability of random mutations leading to a cysteine pair capable of forming a new disulfide bond (Pds) can be calculated
from the probability of making a single cysteine mutation (Pcys ⫽ 1/19),
the probability of a residue pair being close enough to form a disulfide
(Ppair) and the average number of mutation pairs per chain (Npairs) (Equation 2).
Pds ⫽ NpairsPcysPcysPpair
(2)
An average of eight mutations per chain yields 28 unique pairs (Npairs).
Approximately 4% (Ppair ⯝ 0.04) of residue pairs have C␤-C␤ separation
distances of ⬍6Å. This leads to an estimate of 68 disulfide-producing
events (Pds ⯝ 0.003) in the ⬃22,320 sequences examined.
Regional assignments from sequence position
All Ab sequences were first aligned to the set (AAAAA, Aho’s Amazing
Atlas of Antibody Anatomy) provided by Honegger et al. (19). We use this
alignment and numbering system because it leaves gaps for almost all loop
lengths, gives a structural correspondence between light and heavy positions, and provides numerical values for all positions. The Kabat system
uses alphanumeric numbering for some loop positions and can be cumbersome in some cases.
Position definitions were derived from Protein Data Bank structures of
Abs in contact with peptide or protein Ags. Ab-Ag interface positions are
occupied by residues that face the Ag and are within 12 Å of the nearest Ag
atom for at least one Protein Data Bank structure. Because the positions
were assigned based on a small set of structures, residues occupying Ab-Ag
positions do not always contact the Ag for every sequence and may be
solvent exposed. Residue positions on the VL-VH interface were defined
similarly using an 8-Å cutoff but were required to meet this requirement in
at least half of the structures. Core positions are those that point inward to
the core of the variable domain and typically expose ⬍10 Å2. The surface
positions are taken to be the remainder of the occupied positions. Some
positions may have more than one assignment, particularly those that may
both contact the Ag and be present at the VL-VH interface.
Downloaded from http://www.jimmunol.org/ by guest on June 15, 2017
The V(D)J germline fitting algorithm is intended to provide reasonable
solutions for the germline Ab sequence based on the mature protein sequence for the large majority of cases. It is recognized that no such fitting
procedure can be completely free of ambiguities. In some cases, the length
of the D region or the lack of residues in many publicly available sequences
gives insufficient information for a certain fit. The flexible joining mechanisms of the natural process also limit the reliability of the method. For
example, the natural N nucleotide addition process inserts random bases
into the sequence, making the concept of a precursor gene at those positions inapplicable. In all situations, the default algorithm behavior is to
assume that there was no mutation and that the germline sequence at ambiguous positions is identical with the mature sequence.
The gene fitting algorithms for the H and L chains share some basic
features. In each case, the mature protein sequence is converted to a most
probable DNA sequence using mouse or human codon probability tables.
Comparisons with the germline V(D)J gene sequences must be done with
the DNA sequence, because the germline genes may be read in multiple
reading frames (only D and J genes) depending on their position in the final
sequence and because some sequences, especially D genes, are too short to
reliably match using the reduced information in the protein sequences.
Readers interested in the most exhaustive germline fitting procedure for the
D gene region should consult the work of Monod et al. (18). Once the most
probable mature DNA sequence is determined, V, J, and, if appropriate, D
genes from the known library of genes are sequentially fit to the sequence.
V(D)J genes are taken from the ImMunoGeneTics (IMGT) database (15,
16). The fitting is done exhaustively for individual gene possibilities by
sliding the library gene along the mature DNA sequence and recording the
number of matching bases at each offset. Although simple and easy to code,
this approach is more reliable than a BLAST search because of the short
sequences involved. It is affordable because of the small number of possible germline V(D)J genes; analysis of a L/H chain pair takes a couple
minutes. Fitting for the whole data set can be done in 1–2 days on a
computer cluster and could be done considerably faster if the method and
code were to be optimized. It should be noted that a useful, fast, and
accessible algorithm for blasting against the V genes (IgBLAST) is available on the National Center for Biotechnology Information website
(www.ncbi.nlm.nih.gov/igblast).
The L and H chain matching is handled using a similar exhaustive procedure. In each case the V genes are matched first near the N terminus of
the sequence, and the best match is retained. The unmatched portion of the
sequence is then extracted, and the best J gene match is found using a
similar procedure. For the L chains, this process is repeated for each possible V gene to account for the possibility that an overly long V gene was
chosen because it simply adds extra base matches near the C-terminal end
rather than improving the final fit. There are cases where a V gene with a
lower match score leads to a better overall fit because it allows room for a
better-fitting J gene. The H chain is handled similarly; for each possible V
gene match the entire set of possible J genes is placed at the end of the
sequence. For each V-J combination, the remaining unmatched sequence is
fit to the D gene library. Most but not all H chains have a D gene segment
inserted between the V gene and J gene portions.
Variability in the natural joining mechanisms is accounted for by allowing two amino acids worth of flexibility during fitting. For example,
after the V gene segment is fit, the unmatched portion to which the J gene
will be fit is extended two residues (six bases) of mature DNA sequence
into the V gene region. This is done to allow for the possibility that part of
the germline V gene was deleted during splicing. For a similar reason the
unfit stretch of DNA in the D gene region is extended six bases into the
bordering V and J gene regions. Even with this flexibility, many D genes
are often excluded because the gap between the V and J genes is too small.
Conversely, if there is a gap between segments, then bases unmatched by
TRENDS IN SOMATIC HYPERMUTATIONS
The Journal of Immunology
335
Results
A large number of mature Ab chain sequences are available from
public databases (20 –22), including Ig sequences from IgBLAST
(subset of the “nr” database with similarity of at least 50% over
one-third of germline V gene length). After removal of duplicates
and sequences not labeled as mouse or human, 23,116 heavy and
11,095 ␭ or ␬ variable domain sequences were fit to known germline V, D, and J genes using the methodology described in the
Materials and Methods section. Seventy-two percent of the light
and 62% of the heavy sequences yielded plausible (see Materials
and Methods section for definition) germline solutions and contributed to the results presented in this section. Approximately half
of the sequences that contribute to the final dataset are human. All
results are for the combined mouse and human datasets unless
noted. Due to the higher uncertainty of fitting the short D gene
segments to the mature sequence, all mutation information from
the D gene region is omitted unless specifically noted.
A comparison of the residue mutation type frequency with that
observed in regular evolutionary relationships shows some basic
similarity. Fig. 1A shows the prevalence of residue type change
mutations in the full dataset (circles) using a color scale. For comparison, the expected mutation frequency has been calculated assuming a single DNA base change per codon and human codon
usage. These expected frequencies are visible as the background
color in each box. There is a clear qualitative relationship between
expected and observed frequencies. An additional comparison with
expected mutation frequencies comes from sequence alignment
scoring matrices. Mutation frequencies can be derived from a multiple sequence alignment by observing residue type variability at a
given position. If this information is compiled in a residue typespecific fashion and averaged over all positions, then a global view
of how easily a given residue type can substitute for another can be
compiled. The BLOSUM62 data provides such mutation frequencies derived from protein sequences with ⬎62% sequence homology (23).
There is a similarly loose relationship between the observed
frequency and the mutation frequency expected from the
BLOSUM62 data as shown in Fig. 1B. From these tests we conclude that the residue type changes during somatic hypermutation
are similar overall to those seen during evolution but may show
deviation in the finer details.
In this overall dataset, there is some conservation of hydrophobicity (clustering around the diagonal in Fig. 1A), a relatively small
number of mutations involving proline, cysteine, and tryptophan
amino acids, and a number of scattered higher-probability mutation types. Comparison of observed Ab (Fig. 1A, overlaid circles)
with the expected (Fig. 1A, background squares) mutations indicates that at least some of the higher-probability mutation types,
such as T3 S and A3 V, have similar frequencies. Others, such as
F7L and Y3 G transitions, are less frequently and more frequently seen, respectively, in Abs. Further trends will become
more pronounced in region-specific subsets of the data.
There is often a clear imbalance in the frequency of mutations
from a given residue type to another (X3 Y) compared with the
reverse (Y3 X). This imbalance is apparent in the lack of symmetry in the matrix and is presented in Fig. 2, which shows the
same data processed to show deviations from the average of X3 Y
and Y3 X entries. Part of the imbalance can be accounted for by
differences in codon usage. For example, K3 R mutations are 3.3
times more common than R3 K mutations, and this difference
could be explained by the 6:2 ratio of R to K codons and slightly
FIGURE 1. Residue type change mutations for the full dataset. A, Number of germline (left column) to mature (top row) mutations for the full
dataset, including human and mouse L and H chains. Example: the red
circle in the center can be interpreted to mean that the S3 T
(germline3mature) type mutation occurs with a high frequency during the
somatic hypermutation process. Amino acids are ordered by hydrophobicity according to the octanol scale. The background color in each box represents the expected pair mutation frequency assuming human codon usage, one base change per codon, and germline residue type frequencies (see
Materials and Methods). B, Comparison between observed residue type
change mutation frequency and that expected from the data used to derive
the BLOSUM62 scoring matrix (23).
lower relative usage of arginine in the germline. Other imbalances,
such as A3 V, have no codon number or germline usage bias. For
comparison, the BLOSUM matrices and other residue substitution
scoring matrices are symmetric by construction.
Fig. 3 shows the probability of finding a mutation at a given
position in the variable domain of the L chain or the H chain. Each
residue position has been assigned to an Ab structural region using
the sequence alignment and crystal structures (see Materials and
Methods). There is a clear relationship between the position assignment and the mutation probability. Ab-Ag interface and solvent-exposed surface positions are more likely to be targets of
productive (nonsilent) mutations. The distributions agree qualitatively with those from Tomlinson et al. (17), but it is now possible
to see that the previously unanalyzed H3 loop has more mutations
than the other CDR loops even when mutations in the D gene
region are omitted. In agreement with these results, we also see a
sizable mutation probability (⬎10%) for the first three positions on
the N terminus of the L chain. Despite being outside of the traditional CDR loops, these residues are often in contact with the Ag
Downloaded from http://www.jimmunol.org/ by guest on June 15, 2017
Full dataset: mutation type and location
336
TRENDS IN SOMATIC HYPERMUTATIONS
(results not shown). Average mutation frequencies broken down
by region are 4, 7, 12, and 5% for the core, VL-VH interface,
Ab-Ag interface, and surface, respectively.
The germline sequences were analyzed for four-base RGYW
and WRCY (see Ref. 3 for definitions) intrinsic mutation hotspots.
There is a proximity relationship between hotspot and mutation
position. Fig. 4 shows the distance expressed in the number of
bases between the mutated codon center and the nearest base of a
hotspot. For example, a distance of two bases means that the codon
precedes the four-base hotspot and that they are adjacent with no
bases separating them. At a distance of ⫺1, the codon is after the
hotspot with its first base forming the last base of the hotspot. The
mean distances to the RGWY and WRCY hotspots are 12 and 10
bases, respectively, and mutations can be seen to be much more
likely near hotspots. There is a clear periodicity of 3 in Fig. 4. This
means that the first three bases (for example, RGY) of the hotspots
tend to be in-frame. Specific analysis shows that 62% of RGYW
hotspots and 70% of WRCY hotspots are in-frame.
Despite the tendency for mutations to be near hotspots within
specific Ab sequences, there is no apparent relationship between
the overall frequency of mutation at a given position and the probability of a hotspot occurring at that position. The CDRs do not
have a higher density of intrinsic mutation hotspots (data not
shown). It is not rare for the probability of finding a hotspot to
approach unity, meaning that some residue positions are nearly
always part of hotspots.
FIGURE 3. Probability of finding a mutation at a given position in the
variable domain of the L (top panel) and H (bottom panel) chains. Residues
are numbered using the Honegger numbering scheme (19). The CDR loops
are framed in red lines, and position definitions are given as colored bars
below the histogram. The gray bars include the less reliable D gene region
predictions for the H chain.
Fig. 5 gives the distribution of number of mutations within each
region. The residue positions were distributed into the Ab-Ag interface, VL-VH interface, core, and surface regions using analysis
of Protein Data Bank Ab-Ag complex structure as described in the
Materials and Methods section. It can be seen that the mutations
Region-specific mutation type and location
The region of the variable domain where a mutation occurs
strongly determines its effect, if any, on the function of the Ab. To
a first approximation, the residues on the Ab-Ag interface determine specificity and affinity, whereas those in the core partially
determine stability. Mutations at the interface between the two
variable domains will also contribute to stability (24) and binding
affinity (25) and may contribute to the relative orientation of the
domains. Surface mutations outside the interfaces may affect solvation properties and aggregation propensity but are expected to
have little effect as long as polar surface area is maintained.
FIGURE 4. Distance from the mutation to nearest mutation hotspot.
The distance is calculated as the difference between the base number at the
center of the mutated codon and the closest base in the nearest hotspot of
a given type. A periodicity of 3 is clearly seen, indicating that the beginning of the hotspots tend to be in-frame. The zero-distance peak implies
that the codon center is inside the hotspot. It has been normalized to account for the four degenerate possibilities.
Downloaded from http://www.jimmunol.org/ by guest on June 15, 2017
FIGURE 2. Deviations in amino acid somatic hypermutation symmetries from germline (left column) to mature (top row) mutations for the full
dataset. Entry values (Mij) are calculated from those in Fig. 1 (Aij) using
Mij ⫽ (Aij ⫺ Aji)/(0.5Aij ⫹ 0.5Aji). Original matrix entries were zeroed
(filtered out) if Aij or Aji was ⬍5% of the maximum entry. Amino acids are
ordered by hydrophobicity according to the octanol scale.
The Journal of Immunology
337
FIGURE 5. Number of mutations per variable domain broken into contributions from each region. All D gene mutations are omitted from the
statistics.
FIGURE 6. Region-specific residue compositions of mature sequences
with amino acids ranked according to the octanol hydrophobicity scale.
Data from an analysis by Lo Conte et al. (27) of amino acid contributions
to Ab-Ag interface area are displayed as lines.
FIGURE 7. Amino acid composition changes during the mouse somatic
hypermutation process expressed as percentage changes from the germline.
For reference, the expected direction (⫹ or ⫺) of net change from the
simple model (see Materials and Methods) is shown above each bar. Error
bars are at one SD. Bars for rare residue types (⬍50 examples in applicable
structural region of germline) are omitted.
ing net changes occur at the Ab-Ag interface. Tyrosine residues
undergo the greatest change in absolute numbers, partially because
they are so prevalent in the germline structure. They are reduced
by 13% (10% in mouse and 18% in human) from their germline
frequency. The loss is distributed throughout the variable domain
but is more pronounced in the CDRs and, in particular, at Agcontacting positions of the H3 loop. Tyrosine is over-represented
in the germline JH genes for both human and mouse. This has been
noted by Zemlin et al. (Fig. 3 of Ref. 27). One of the six human JH
genes (JH6) and its alleles can code for strings of 4 – 6 tyrosines.
The JH6 gene is used in 22% of the germline sequence solutions
found in our analysis. The JH4 gene can code for two tyrosines
separated by two residues and is used in 37% of human H chain
germline gene fitting solutions. This overall tyrosine change is
large enough to skew the average side chain size change at the H
Downloaded from http://www.jimmunol.org/ by guest on June 15, 2017
on the Ab-Ag interface span the broadest range, with a mean of 3.1
mutations per chain. Surface and VL-VH mutations are less frequent at 2.3 and 1.5 mutations per chain on average. Core mutations are least frequent at 1.3 mutations per chain. The distributions for the L and H chains are qualitatively similar (not shown).
For the H chain, region-specific averages are ⬃30% higher than
those in the L chain. On average, L and H chains have 6.8 and 8.6
mutations per chain, respectively.
As a reference point for describing composition changes, Fig. 6
shows the region-specific amino acid compositions in the mature
sequence. The germline compositions are qualitatively similar. As
should be expected, polar residues are favored on the surface and
hydrophobic residues are favored in the core regions. One of the
largest contributions to Ab-Ag interface composition comes from
tyrosine at 15%. Glycine and serine are also strong contributors at
11 and 18%, respectively. As a comparison for only the interface
with the Ag, the data of Lo Conte et al. (26) from an analysis of
Ab-Ag crystal structures are also shown. Their numbers are the
percentages of the interface area contributed by each amino acid
type and, thus, are lower in value for smaller residues such as
glycine and serine.
The net creation and removal of amino acid types from the
germline to the mature sequence varies considerably by residue
type. Figs. 7 and 8 give the percentage increase of each amino acid
type in the overall, surface, core, and Ab-Ag interface regions for
mouse and human sequences, respectively. Some of the most strik-
338
chain to a net loss of 3.4 heavy atoms over the whole Ab-Ag
interface. For comparison, there is a net gain of 1.1 heavy atoms in
the corresponding region of the L chain. Glutamine, serine, and
tryptophan residues are also depleted on the interface.
Several residue types are used more frequently in the mature Ab
on the interface with the Ag. Phenylalanine is the most enhanced
residue at the interface, increasing by ⬃80% (31% in mouse and
104% in human) from the germline composition. Proline usage
also increases during maturation by 42% (32% in mouse and 50%
in human), as does histidine (36% in mouse and 82% in human).
The residues showing these large increases in mature Ab use are
almost absent in the germline but occur significantly in mature
sequences (see Fig. 6).
Proline usage changes in L and H chains are seen mainly in turn
regions and are qualitatively very similar between chains. Residue
positions 15, 48, and 98 (Kabat light, 14, 40, and 80; Kabat heavy,
14, 41, and 84) are in the ␤-hairpin turns nearest the constant
domain and show pronounced mutation activity. The turn in the L3
and H3 loops tends to gain prolines at position 135 (Kabat light,
95e; Kabat heavy, 100h). Similarly, prolines tend to appear at position 52 (Kabat light, 44; Kabat heavy, 45), which is a kinked
position in the strand that forms part of the VL-VH interface.
Other interesting composition trends occur during maturation.
The most striking is an increase in the relative number of cysteine
residues on the surface. The low abundance of cysteines on the
surface means that 181 net mutation events in the human dataset
translate to a 178% increase (see Fig. 8B). One expects that additional cysteines could lead to creation of nonnative disulfide bonds
and inhibit expression. Presumably, the new cysteines observed in
this work do not have a destabilizing effect large enough to be
selected against. In a small number (⬍50) of cases, two cysteines
are apparently created, and in 12 cases they have sequence positions that could lead to the formation of an additional stabilizing
disulfide bond. A probability estimate of this event occurring randomly indicates that six times more potential disulfide bonds
should have been observed (see Materials and Methods). The discrepancy could be partially attributed to the observed order of
magnitude lower probability of mutating to a cysteine than the
random mutation probability used in the calculation. Histidine residues are gained consistently in each of regions for both species
except on the surface of the mouse Abs. There is an ⬃80% increase in histidine usage on both the surface and Ab-Ag of human
Abs during maturation. For the mouse, the increase on the Ab-Ag
interface is less pronounced at 36%. This increase in the number of
histidines is the largest of the core residue position changes (Figs.
7C and 8C), where comparatively little composition change
is seen.
It is reasonable to ask whether the residue types in which the
hotspots occur (RGYW or WRCY) bias the residue composition
changes. According to Fig. 4, the highest probability location for
a mutation is inside a hotspot. Consequently, one might expect a
tendency for residue types that are consistently mutated to be more
frequently present in hotspots. Fig. 9 shows the fraction of each
germline residue type in each of the RGYW or WRCY motifs.
More than 50% of the tyrosine residues in the germline are located
in WRCY motifs. Generally, the residue types that are lost on the
Ab-Ag interface (glutamine, serine, tyrosine, and tryptophan) all
have some of the highest propensities to be coded in hotspots.
FIGURE 9. Tendencies to find specific residue types in the mutation
hotspot motifs. The bar heights are the fraction of a specific germline
residue type that is coded within the four-base (two incomplete residues)
motif.
Downloaded from http://www.jimmunol.org/ by guest on June 15, 2017
FIGURE 8. Amino acid composition changes during the human somatic
hypermutation process expressed as percentage changes from the germline.
For reference, the expected direction (⫹ or ⫺) of net change from the
simple model (see Materials and Methods) is shown above each bar. Error
bars are at one SD.
TRENDS IN SOMATIC HYPERMUTATIONS
The Journal of Immunology
339
FIGURE 10. Intrinsic mutation probabilities for
germline residue types on the Ab-Ag interface. Numbers
given in the scale bar are the probabilities to mutate to
the mature (column index) residue, given that the germline (row index) residue is a specific type. The tyrosine
stripe is most pronounced in the H chain mutation map.
There are higher than background tendencies for mutations to take tyrosine (Y) in the germline to aspartate
(D), glycine (G), asparagine (N), serine (S), and phenylalanine (F). The background is identical with that
used in Fig. 1, but is not normalized by germline residue
type usage.
Discussion
With the results presented here, it is possible to speculate on the
underlying reasons for the mutation propensities and what they tell
us about Ab maturation. Why is tyrosine, serine, and tryptophan
usage on the Ab-Ag interface decreased during the maturation process? Similarly, why is there a substantial relative increase in the
usage of histidine, proline, and phenylalanine?
One of the most striking results shown is the overall tendency
for germline tyrosine, serine, and tryptophan residues to mutate to
other residue types. Germline genes are biased to place these versatile residue types on the Ab-Ag interface. We speculate that the
tyrosines are there to promote low-affinity binding of the naive
germline Abs to new Ags. Tyrosine can provide hydrogen bonding
and substantial hydrophobic interactions. Tryptophan is the most
hydrophobic of residues and arguably the best for nonspecific interactions. Serine can hydrogen bond while minimizing potential
steric conflicts. Both tyrosine and serine have a polar character and
are likely to help maintain solution stability. In support of these
ideas, Fellouse et al. (28, 29) have used phage display techniques
to show that tyrosine and a combination of small flexible residues
are sufficient to recognize a number of Ags, including human and
mouse vascular endothelial growth factor molecules.
Not only does nature bias the interface with promiscuous binding residue types, but it provides a maturation strategy capable of
making efficient and productive changes to them. Tyrosine, serine,
and glycine residues are well represented in germline mutation
hotspots (Fig. 9), increasing their tendency to mutate. The presence
of serines in hotspots has been previously noted and suggested to
be a strategy for improving the targeting of CDRs (30). If one
examines into which residue types the tyrosines mutate, one sees
that they most often become the small residues glycine and serine.
Aspartate and asparagine, two short-chain charged and uncharged
residues, are also favored, perhaps to provide ion-pair and H-bond
interactions. Phenylalanine may be favored because of its similarity to tyrosine and greater hydrophobicity. Tryptophan often becomes arginine, glycine, serine, and leucine. It may be converted
to eliminate steric conflicts and minimize nonspecific interactions
that could lead to aggregation.
The low usage of histidine and methionine at the Ab-Ag interface (Fig. 6) is puzzling. Next to proline and cysteine, which often
play specific structural and functional roles, histidine and methionine are least used on the interface. The work of Lo Conte et al.
(26) shows that these residues are used less on Ab-Ag interfaces
than on other protein-protein interfaces. One possible explanation
for the low use of histidine and methionine is the hypothesis that
they have only recently been introduced into the amino acid repertoire on an evolutionary time scale (31). These amino acids may
be underused relative to their potential. If under-utilization in the
germline can explain their low usage, then one would expect their
usage to increase in the mature Abs. Figs. 7 and 8 shows that this
Downloaded from http://www.jimmunol.org/ by guest on June 15, 2017
The most important region for Ab affinity maturation is the interface with the Ag. As seen in Figs. 7A and 8A, trends in the
residue usage are apparent, but no information is offered about the
specific mutations that occur from given germline residue types.
Fig. 10 shows the intrinsic mutation probabilities for residues on
the interface between the Ab and the Ag. On the Ab-Ag interface,
some pronounced residue type changes are E7D and K3 R,
which are charge conserving. Others, such as N7D and Q3 E,
conserve size but change charge or change predominantly the hydrophobicity, such as S7T and F3 L.
A germline tyrosine residue at a potentially Ag-contacting position has a 13% chance to mutate to another residue type (see
Figs. 7A and 8A for species-specific numbers). This tendency is
also apparent in Fig. 10, which shows that tyrosines in the germline tend to have a substantial intrinsic tendency to mutate. In the
unnormalized Ab-Ag interface data, the horizontal tyrosine stripe
is the most pronounced feature. Conversions from tyrosine and
serine to other residue types are the dominant mutation types occurring on the Ab-Ag interface. When germline tyrosine is mutated, it has a tendency to convert to aspartate, histidine, glycine,
asparagine, serine, and phenylalanine. Of these residue types, each
can be reached with a single DNA base change, except glycine.
When a germline serine is mutated, larger residues such as asparagine and threonine, both still capable of hydrogen bonding, are
common replacements. Tryptophans are often converted to
leucines and tyrosines, but also may become small residues such as
serine and glycine.
340
Acknowledgments
The advice and supporting environment provided to L.A.C. as postdoctoral
fellow at Biogen Idec are greatly appreciated. Helpful comments from
Juswinder Singh and Alexey Lugovskoy are appreciated.
Disclosures
L. A. Clark is a current employee and S. Ganesan, S. Papp, and H. W. T.
van Vlijmen are past employees of Biogen Idec Inc. L. A. Clark and
H. W. T. van Vlijmen are stockholders in the company.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
References
1. Janeway, C. A., P. Travers, M. Walport, and M. Shlomchik. 2001. Immunobiology, 5th Ed. Garland, New York, p. 132.
2. Diaz, M., and P. Casali. 2002. Somatic immunoglobulin hypermutation. Curr.
Opin. Immunol. 14: 235–240.
3. Franklin, A., and R. V. Blanden. 2004. On the molecular mechanism of somatic
hypermutation of rearranged immunoglobulin genes. Immunol. Cell Biol. 82:
557–567.
4. Neuberger, M. S., J. M. Di Noia, R. C. Beale, G. T. Williams, Z. Yang, and
C. Rada. 2005. Somatic hypermutation at A.T pairs: polymerase error versus
dUTP incorporation. Nat. Rev. Immunol. 5: 171–178.
5. Samaranayake, M., J. M. Bujnicki, M. Carpenter, and A. S. Bhagwat. 2006.
Evaluation of molecular models for the affinity maturation of antibodies: roles of
cytosine deamination by AID and DNA repair. Chem. Rev. 106: 700 –719.
6. Pham, P., R. Bransteitter, J. Petruska, and M. F. Goodman. 2003. Processive
AID-catalysed cytosine deamination on single-stranded DNA simulates somatic
hypermutation. Nature 424: 103–107.
7. Betz, A. G., C. Rada, R. Pannell, C. Milstein, and M. S. Neuberger. 1993. Passenger transgenes reveal intrinsic specificity of the antibody hypermutation mech-
29.
30.
31.
32.
33.
34.
anism: clustering, polarity, and specific hot spots. Proc. Natl. Acad. Sci. USA 90:
2385–238.
Wilson, P. C., O. D. Bouteiller, Y. J. Liu, K. Potter, J. Banchereau, J. D. Capra,
and V. Pascual. 1998. Somatic hypermutation introduces insertions and deletions
into immunoglobulin v genes. J. Exp. Med. 187: 59 –70.
de Wildt, R. M., W. J. van Venrooij, G. Winter, R. M. Hoet, and I. M. Tomlinson.
1999. Somatic insertions and deletions shape the human antibody repertoire.
J. Mol. Biol. 294: 701–710.
Rogozin, I. B., and N. A. Kolchanov. 1992. Somatic hypermutagenesis in immunoglobulin genes. II. influence of neighbouring base sequences on mutagenesis. Biochim. Biophys. Acta 1171: 11–18.
Doerner, T., S. J. Foster, N. L. Farner, and P. E. Lipsky. 1998. Somatic hypermutation of human immunoglobulin H chain genes: targeting of RGYW motifs
on both DNA strands. Eur. J. Immunol. 28: 3384 –3396.
England, P., R. Nageotte, M. Renard, A. L. Page, and H. Bedouelle. 1999. Functional characterization of the somatic hypermutation process leading to antibody
D1.3, a high affinity antibody directed against lysozyme. J. Immunol. 162:
2129 –2136.
Daugherty, P. S., G. Chen, B. L. Iverson, and G. Georgiou. 2000. Quantitative
analysis of the effect of the mutation frequency on the affinity maturation of single
chain Fv antibodies. Proc. Natl. Acad. Sci. USA 97: 2029 –2034.
Zahnd, C., S. Spinelli, B. Luginbuhl, P. Amstutz, C. Cambillau, and
A. Plückthun. 2004. Directed in vitro evolution and crystallographic analysis of
a peptide-binding single chain antibody fragment (scFv) with low picomolar affinity. J. Biol. Chem. 279: 18870 –18877.
Lefranc, M. P., V. Giudicelli, C. Ginestoux, N. Bosc, G. Folch, D. Guiraudou,
J. Jabado-Michaloud, S. Magris, D. Scaviner, V. Thouvenin, et al. 2004. IMGTONTOLOGY for immunogenetics and immunoinformatics (genes are from ImMunoGeneTics website: imgt.cines.fr). In Silico Biol. 4: 17–29.
Lefranc, M.-P., and G. Lefranc. 2001. The Immunoglobulin Facts Book. Academic Press, London.
Tomlinson, I. M., G. Walter, P. T. Jones, P. H. Dear, E. L. Sonnhammer, and
G. Winter. 1996. The imprint of somatic hypermutation on the repertoire of
human germline V genes. J. Mol. Biol. 256: 813– 817.
Monod, M. Y., V. Giudicelli, D. Chaume, and M. P. Lefranc. 2004. IMGT/
Junction Analysis: the first tool for the analysis of the immunoglobulin and T cell
receptor complex V-J and V-D-J JUNCTIONs. Bioinformatics 20: I379 –I385.
Honegger, A., and A. Plückthun. 2001. Yet another numbering scheme for immunoglobulin variable domains: an automatic modeling and analysis tool. J. Mol.
Biol. 309: 657– 670.
Johnson, G., and T. T. Wu. 2001. Kabat database and its applications: future
directions (www.kabatdatabase.com). Nucleic Acids Res. 29: 205–206.
Wu, C. H., L. S. Yeh, H. Huang, L. Arminski, J. Castro-Alvear, Y. Chen, Z. Hu,
P. Kourtesis, R. S. Ledley, B. E. Suzek, et al. 2003. The protein information
resource. Nucleic Acids Res. 31: 345–347.
Bairoch, A., R. Apweiler, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro,
E. Gasteiger, H. Huang, R. Lopez, M. Magrane, et al. 2005. The universal protein
resource (UniProt). Nucleic Acids Res. 33: D154 –D159.
Henikoff, S., and J. G. Henikoff. 1992. Amino acid substitution matrices from
protein blocks (ftp.ncbi.nih.gov/repository/blocks/unix/blosum/BLOSUM/).
Proc. Nat. Acad. Sci. USA 89: 10915–10919.
Worn, A., and A. Plückthun. 1999. Different equilibrium stability behavior of
ScFv fragments: identification, classification, and improvement by protein engineering. Biochemistry 38: 8739 – 8750.
Khalifa, M. B., M. Weidenhaupt, L. Choulier, J. Chatellier, N. Rauffer-Bruyere,
D. Altschuh, and T. Vernet. 2000. Effects on interaction kinetics of mutations at
the VH-VL interface of Fabs depend on the structural context. J. Mol. Recognit.
13: 127–139.
Lo Conte, L., C. Chothia, and J. Janin. 1999. The atomic structure of proteinprotein recognition sites. J. Mol. Biol. 285: 2177–2198.
Zemlin, M., M. Klinger, J. Link, C. Zemlin, K. Bauer, J. A. Engler,
H. W. J. Schroeder, and P. M. Kirkham. 2003. Expressed murine and human
CDR-H3 intervals of equal length exhibit distinct repertoires that differ in their
amino acid composition and predicted range of structures. J. Mol. Biol. 334:
733–749.
Fellouse, F. A., C. Wiesmann, and S. S. Sidhu. 2004. Synthetic antibodies from
a four-amino-acid code: a dominant role for tyrosine in antigen recognition. Proc.
Natl. Acad. Sci. USA 101: 12467–12472.
Fellouse, F. A., B. Li, D. M. Compaan, A. A. Peden, S. G. Hymowitz, and
S. S. Sidhu. 2005. Molecular recognition by a binary code. J. Mol. Biol. 348:
1153–1162.
Wagner, S. D., C. Milstein, and M. S. Neuberger. 1995. Codon bias targets
mutation. Nature 376: 732.
Jordan, I. K., F. A. Kondrashov, I. A. Adzhubei, Y. I. Wolf, E. V. Koonin,
A. S. Kondrashov, and S. Sunyaev. 2005. A universal trend of amino acid gain
and loss in protein evolution. Nature 433: 633– 638.
Wedemayer, G. J., P. A. Patten, L. H. Wang, P. G. Schultz, and R. C. Stevens.
1997. Structural insights into the evolution of an antibody combining site. Science
276: 1665–169.
Ghiara, J. B., E. A. Stura, R. L. Stanfield, A. T. Profy, and I. A. Wilson. 1994.
Crystal structure of the principal neutralization site of HIV-1. Science 264:
82– 85.
Clark, L. A., P. A. Boriack-Sjodin, J. Eldredge, C. Fitch, B. Friedman, K. J. Hanf,
M. Jarpe, S. F. Liparoto, Y. Li, A. Lugovskoy, et al. 2006. Affinity enhancement
of an in vivo matured therapeutic antibody using structure-based computational
design. Protein Sci. 15: 949-960.
Downloaded from http://www.jimmunol.org/ by guest on June 15, 2017
is consistently the case for histidine in most regions, but only at the
Ab-Ag interface for methionine. A more structurally oriented hypothesis can also be used to explain low methionine usage on the
interface. Methionine has a long side chain and is likely to lose
comparatively more entropy on binding relative to its gain in enthalpy, resulting in net lower binding affinity. Relative to methionine, histidine usage may be favored at interfaces, because it can
provide both hydrogen bonding and hydrophobic interactions.
One may also speculate on the usage of certain amino acids to
modify loop characteristics with benefits for Ab stability and affinity. Proline residues are frequently created on the Ab-Ag interface (Figs. 7A and 8A), particularly in the H3 loop where they
could stabilize beneficial loop conformations. Loop preorganization should reduce the entropy cost of binding and increase affinity
(see Ref. 32 for a possible example). Similarly, Ab maturation
sometimes leads to the introduction of disulfide bonds, which may
stabilize the Ig fold or result in subtle conformational changes that
lead to higher affinity. An example may be the anti-gp120 peptide
complex structure (1ACY), where an additional disulfide is formed
between CDR H1 and H2, which actually contacts the Ag (33). It
is also notable that many of the excess tyrosines on the Ab-Ag
interface of the germline chains are converted to glycines (see Fig.
10). In this case, the creation of glycines may allow for the backbone flexibility necessary for the remaining tyrosines or other
nearby Ag-directed side chains to better contact the Ag.
Ab-Ag interaction systems are ideal for examining factors and
strategies for improving binding affinity at protein-protein interfaces. The trends uncovered and discussed in this work illuminate
the strategies that nature uses to bias immature Ab properties and
subsequently refine them during the affinity maturation process.
Residue type changes are clearly biased (Fig. 1A), by which
codons are accessible given a single base change. Serine, tyrosine,
and tryptophan are over-represented in the germline at common
Ab-Ag positions and are some of the most frequently coded in the
known mutation hotspot motifs. Remaining influences on the residue type changes are presumed to result from functional selection
during the affinity maturation process. These functional selection
aspects make the somatic hypermutation process a good model for
evolution on an accelerated time scale.
TRENDS IN SOMATIC HYPERMUTATIONS