Download Local Rates of Recombination Are Positively Correlated with GC

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Heritability of IQ wikipedia , lookup

Nations and intelligence wikipedia , lookup

Genetic algorithm wikipedia , lookup

Gene expression programming wikipedia , lookup

Transcript
Letter to the Editor
Local Rates of Recombination Are Positively Correlated with GC Content in the Human
Genome
Stephanie M. Fullerton,*1 Antonio Bernardo Carvalho,*† and Andrew G. Clark*
*Institute of Molecular Evolutionary Genetics, Department of Biology, Pennsylvania State University; and †Departamento de
Genética, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
The GC content of human DNA varies widely
across the genome, ranging from 30% to 60%, and regions of hundreds of kilobases (often referred to as isochores [Bernardi 2000]) may have relatively homogeneous base compositions. This compositional heterogeneity appears to be very widespread in eukaryotes (Nekrutenko and Li 2000) and may represent an important
level of genome organization, insofar as gene density
(Zouback, Clay, and Bernardi 1996), gene length (Duret,
Mouchiroud, and Gautier 1995), and patterns of codon
usage (Sharp et al. 1995), as well as the distribution of
different classes of repetitive elements (Soriano, Meunier-Rotival, and Bernardi 1983; Duret, Mouchiroud,
and Gautier 1995), are all correlated with GC content.
Despite intensive investigation, the underlying cause(s)
of the observed heterogeneity remains contested, with
two major hypotheses competing: Bernardi has suggested that selection is primarily responsible for maintaining
the observed patterns (Bernardi and Bernardi 1986; Bernardi 2000), a view supported by recent analysis of
polymorphism in the human MHC cluster (Eyre-Walker
1999), whereas the balance of opinion has favored systematic mutational bias as the ultimate cause (Filipski
1987; Suoeka 1988; Wolfe, Sharp, and Li 1989; Francino and Ochman 1999). Ultimately, however, it is not
clear why either selection for increased GC content or
mutation bias should promote such marked local variation in genomic nucleotide content.
A third hypothesis, namely, that recombination
may explain, or at least contribute to, the observed compositional differences (Holmquist 1992; Eyre-Walker
1993; Charlesworth 1994), has received less attention.
The idea that recombination may be involved in determining variation in GC content arose from the observation of an association between recombination and GCrich chromosomal regions. The clustering of chromosomal rearrangements in isochores with high GC contents was described first by Bernardi (1989) and
subsequently by Holmquist (1992). Eyre-Walker (1993)
demonstrated a statistically significant positive correlation between overall chromosomal GC content and chi1 Present address: Department of Human Genetics, University of
Chicago.
Abbreviations: cM, centimorgans; DSB, double-strand break.
Key words: recombination, GC content, isochores, Hill-Robertson
effect, mutation.
Address for correspondence and reprints: Stephanie M. Fullerton,
Department of Human Genetics, University of Chicago, 920 East 58th
Street, CLSC 501, Chicago, Illinois 60637. E-mail:
[email protected].
Mol. Biol. Evol. 18(6):1139–1142. 2001
q 2001 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038
asmata density, which accorded both with Ikemura and
Wada’s (1991) finding that third-position GC content
was positively correlated with chiasmata density and
with Holmquist’s (1992) observation that GC-rich chromosomal bands were chiasmata-dense. Although the
coarse level at which the association was identified prevented determination of the causality underlying the relationship, GC-biased repair of gene conversion was
suggested as the most likely explanation (Holmquist
1992; Eyre-Walker 1993). Another author, commenting
on the same data, suggested instead the possible indirect
effects of recombination via its impact on the efficacy
of natural selection (Charlesworth 1994). This evolutionary explanation posited that changes in GC content
promoted by natural selection would be confined to
those regions of the genome with a recombination rate
high enough to overcome the effects of random genetic
drift (the so-called ‘‘Hill-Robertson effect’’; Hill and
Robertson 1966; Felsenstein 1974), thereby explaining
the correlation as selectively rather than mutationally determined. There were not enough data available at the
time to resolve the issue, however, so the apparent association has remained only incompletely described and
largely unexplained. In particular, the scale of the association has remained unknown in the absence of data
on the relationship between GC content and recombination rate within chromosomes.
With the human genome project nearly complete
and detailed sequence and genetic map data readily
available, we chose to revisit the reported association,
both to confirm that the rough correlation described
originally still holds and to determine whether, with
more information, it is possible to disentangle the forces
likely to contribute to the inferred relationship. To examine the association of recombination and GC content
in the human genome, we used intron sequences for
genes in GenBank for which genetic map positions were
known, and band-specific estimates of recombination
rate derived using the Genetic Location Database (LDB)
(Collins et al. 1996; http://cedar.genetics.soton.ac.uk/
publicphtml/ldb.html). The GC content of intronic sequences has been shown to be correlated both with the
third-codon-position GC content of associated genes and
with the GC content of the surrounding isochore (Clay
et al. 1996); thus, intron sequences provide a convenient
measure of chromosomal GC content. Only genes with
defined map information were considered, and redundant sequences were eliminated prior to analysis with
the program GPURGE from Falling Rain Genomics
(http://www.fallingrain.com/publicserver/index.html).
Recombination rates (in cM/Mb) were estimated from
the integrated genetic and physical map data contained
1139
1140
Fullerton et al.
in the LDB. The ultimate source of the genetic map in
the LDB is the CEPH family study (Daussett et al.
1990), although data from a large variety of additional
sources (e.g., Broman et al. 1998) were added (see Collins et al. [1996] and on-line citations for further information). The resulting genetic map compares favorably
with other published map data (e.g., Dib et al. 1996),
while incorporating a larger number of informative meioses than any individual map. The human linkage map
is unlikely to be substantively improved in the near future, as this would require collection of more extended
pedigrees and massive typing, as with the 8,0001 markers typed for the CEPH panel (Broman et al. 1998).
Recombination rates were estimated on a per-band
basis (at 0.1 band resolution), using the range of observed sex-averaged genetic distances and corresponding physical map distances for each of 721 bands or
band segments. For each band interval (e.g., 11p14.1 or
11p14.2, or 11p14 if sub-band information was unavailable), we identified markers at either end of the interval
for which external physical location data (based on restriction mapping evidence) were known, i.e., markers
with values in the ‘‘phymb’’ column of the LDB map.
We then took the difference in the physical map locations of these markers as the estimate of the physical
distance for the band and used the corresponding estimated genetic map locations for the same markers as
our estimates of the genetic distance for the interval (female and male distances were derived separately and
then averaged to obtain the final sex-averaged distance).
While the overall accuracy of these estimates was limited to some extent by the availability of physical data,
we felt that this approach was preferable to using the
composite physical locations in the LDB, which are derived assuming proportionality of cytogenetic band
width and DNA content (Collins et al. 1996). Shortrange inaccuracies in the resulting recombination estimates were ameliorated by assigning to each band the
average of a three-interval window encompassing adjacent band estimates. Three-band smoothing increased
the observed correlation relative to that found using the
unaveraged data (r 5 0.2869, as discussed below, vs. r
5 0.2330 unaveraged).
For the set of 8,244 introns examined, there was a
significant positive correlation between the estimated
rates of recombination and GC content (r 5 0.2829, P
K 0.001). When the GC contents of introns from the
same gene were averaged, the number of observations
collapsed to 1,531 genes, but virtually the same relationship was observed (r 5 0.2869, P K 0.001; fig. 1).
Adjacent bands had correlated GC contents and local
recombination rates, reducing the degrees of freedom.
Autocorrelation analysis showed that this correlation
disappeared (P . 0.05) when we took every 25th gene
or a sparser sampling, yet the correlations between GC
content and recombination rate persisted (mean r 5
0.2986, P(bootstrap) , 0.001). In our analysis, the correlation of GC content with recombination was smaller
than the correlation reported with chromosomal chiasmata density (i.e., r 5 0.524; Eyre-Walker 1993). Uncertainty in our estimates of recombination do not ap-
FIG. 1.—Relationship between recombination and GC content for
1,531 genes (assembled from 8,244 introns in 1,358 human GenBank
accessions). GC content and local recombination rate have a significant
positive correlation (r 5 10.2869, P K 0.001).
pear to explain this discrepancy, as we found little difference in the observed correlation coefficients when estimates derived from averaging over larger numbers of
bands (five, seven, or nine) were used (data not shown).
Instead, the difference is more likely to reflect the substantively different data used in each study and, in particular, the much lower resolution of the whole chromosome-chiasmata relationship examined by EyreWalker (1993).
We also examined the association between GC content and recombination rate within chromosomes. First,
we removed the effect of chromosomal location from
both GC content and recombination by performing ANOVAs using chromosome as the factor, and GC content
and recombination rate as the dependent variables. As
expected (Nekrutenko and Li 2000), we found that chromosomes differed in their mean GC levels (P K 0.001).
They also differed significantly in their mean recombination rates (P K 0.001). We then saved the residuals
from both ANOVAs and computed their correlation. We
again found a very significant correlation (r 5 0.122, P
K 0.001) and concluded that GC content and recombination were correlated within chromosomes. This observation was confirmed by a per-chromosome analysis,
in which we found that the correlation of recombination
rate with GC content was significantly .0 (P , 0.05)
in 9 of the 23 chromosomes (table 1). Data paucity may
explain the failure to observe a significant positive correlation in all cases. Together, these new results confirm
that GC content is higher in chromosomal regions with
higher recombination rates, and they suggest that local
recombination rate is an important indicator of compositional heterogeneity in the human genome.
Given the new evidence confirming a positive correlation between recombination and GC content in humans, it is relevant to ask whether these data shed any
additional light on the causes of the observed association. As described above, the original correlation was
previously explained as arising either directly, from mutational differences associated with biased gene conversion (Holmquist 1992; Eyre-Walker 1993), or indirectly,
Letter to the Editor
Table 1
Correlation Between Recombination and GC Content for
Individual Human Chromosomes
Chromosomes
r
P
na
6 ..........
19 . . . . . . . . .
X ..........
16 . . . . . . . . .
22 . . . . . . . . .
1 ..........
20 . . . . . . . . .
11 . . . . . . . . .
17 . . . . . . . . .
14 . . . . . . . . .
7 ..........
2 ..........
4 ..........
12 . . . . . . . . .
3 ..........
21 . . . . . . . . .
9 ..........
5 ..........
8 ..........
10 . . . . . . . . .
18 . . . . . . . . .
15 . . . . . . . . .
13 . . . . . . . . .
0.351
0.403
0.451
0.471
0.012
20.005
0.449
20.019
0.202
20.085
0.049
0.111
0.414
20.253
20.447
0.291
20.396
0.438
0.653
0.126
0.508
20.035
0.882
0.000
0.000
0.000
0.000
0.898
0.960
0.000
0.879
0.098
0.530
0.734
0.475
0.009
0.130
0.010
0.141
0.045
0.025
0.001
0.608
0.134
0.934
0.048
215
195
159
134
119
94
74
69
68
57
51
44
39
37
32
27
26
26
23
19
10
8
5
a
Number of genes considered in the correlation.
as a consequence of natural selection for increased GC
content (Charlesworth 1994). If high GC content were
selectively favored, due either to its higher inherent thermal stability in warm-blooded organisms (Bernardi and
Bernardi 1986) or perhaps to its role in promoting CpG
islands (Eyre-Walker 1999) and/or the more active chromatin structure these islands promote (Mucha et al.
2000), we might expect to observe higher GC content
and the strongest relationship to recombination in regions with the highest rates of exchange. Invoking natural selection would be consistent with the inferred role
of selection on patterns of polymorphism in the human
MHC cluster (Eyre-Walker 1999), which falls in a region of the genome with a higher than average rate of
recombination (the estimated rate for chromosome band
6p21.3 is 2.13 cM/Mb). On the other hand, the impact
of selection on the correlations between GC and recombination rate depends critically on the parameters of mutation rate, effective size, and local recombination rates,
and plausible arguments can be made to generate the
reverse pattern. Similarly, in our ignorance of the relationship between local recombination and patterns of
mutation that may change GC content, stronger correlations in regions of higher recombination may also be
generated by a mutational mechanism.
Either of the above explanations assumes that patterns of recombination in the human genome are ultimately responsible for determining GC content differences. It is, of course, possible that the direction of causation is the reverse, such that differences in genome
composition instead determine differences in the rate of
recombination. One piece of evidence in support of this
idea is the association of recombination hot-spot activity
with the occurrence of double-strand breaks (DSBs) in
regions of open chromatin (Wahls 1998), which are of-
1141
ten GC-rich (Svetlova et al. 1998; Mucha et al. 2000).
Although available data suggest that DNA accessibility
per se is not sufficient for hot-spot activity (Wu and
Lichten 1995; Wahls 1998), the recent observation of
the nonrandom association of meiotic DSBs with regions of high GC content in Saccharomyces cerevisiae
(Gerton et al. 2000) suggests a direct relationship, which
may prove causal. However, the extent to which we
would expect such hot-spot behavior to be reflected in
the long-range estimates of recombination derived here
is unclear.
Of course, either class of explanation leaves unexplained why heterogeneity exists in either recombination rate or GC content to begin with. It is possible,
for example, that both recombination rate and GC content are coincidentally related to an unknown third factor, such as replication timing or chromatin structure,
which may ultimately be responsible for generating variation in both rates of recombination and DNA sequence
composition. Relevant to this suggestion is the demonstration of a direct link between DNA replication and
meiotic DSB formation in S. cerevisiae (Borde, Goldman, and Lichten 2000). If, as Borde, Goldman, and
Lichten (2000) suggest, replication of a genomic region
is a necessary prerequisite for DSB occurrence, differences in replication origin distribution and firing could
have an important effect on the local recombination environment. Differences in the availability of nucleotide
precursors at different stages of DNA replication have
also previously been suggested as an explanation for GC
content variation (e.g., Wolfe, Sharp, and Li 1989), and
at least one GC-content transition region has been
shown to correspond closely to a genomic region in
which replication timing switches (Tenzen et al. 1997;
but see Eyre-Walker [1992] and Watanabe et al. [2000]
for cases in which the relationship between replication
timing and GC content has proved to be less clear-cut).
Therefore, it is possible that the replication structure of
the human genome has significant, independent effects
on both variables. Without further information about the
precise biological relationship of recombination to GCrich DNA, it is impossible to rule out the influence of
such an additional factor.
Our analysis clearly demonstrates that the originally suggested correlation between recombination and GC
content, described previously at the chromosomal level
alone, is present within chromosomes and thus on a genomewide scale at the DNA sequence level. This correlation has important implications for how we interpret
and understand other genomewide correlations with GC
content and suggests that GC content may serve as a
useful marker of local recombination frequency, with
potential applications in human genomic analysis as
well as in surveys of DNA polymorphism and linkage
disequilibrium (e.g., Przeworski, Hudson, and Di Rienzo
2000). Untangling of the precise causes of the association awaits fine-scale analysis of GC content boundary
regions, such as the recently reported investigation of
linkage disequilibrium differences across an L1-H2 isochore transition boundary on chromosome 17 (Eisenbarth et al. 2000). Full genome sequences of other pri-
1142
Fullerton et al.
mates would also allow a much richer picture of variation in patterns of mutation across the genome, which
may ultimately allow rejection of mutation as an explanation for the GC-recombination correlation.
Acknowledgments
We thank C. Mugnier for providing collated
GenAtlas (http://bisance.citi2.fr/GENATLAS) genic
map data, and E. T. Dermitzakis, A. Eyre-Walker, B. P.
Lazzaro, K. L. Montooth, and an anonymous reviewer
for helpful comments. This work was supported in part
by grants from the Pew Latin American Fellows Program (A.B.C.) and the National Heart, Lung, and Blood
Institute (S.M.F. and A.G.C., grant HL58239).
LITERATURE CITED
BERNARDI, G. 1989. The isochore organization of the human
genome. Annu. Rev. Genet. 23:637–661.
———. 2000. Isochores and the evolutionary genomics of vertebrates. Gene 241:3–17.
BERNARDI, G., and G. BERNARDI. 1986. Compositional constraints and genome evolution. J. Mol. Evol. 24:1–11.
BORDE, V., A. S. GOLDMAN, and M. LICHTEN. 2000. Direct
coupling between meiotic DNA replication and recombination initiation. Science 290:806–809.
BROMAN, K. W., J. C. MURRAY, V. C. SHEFFIELD, R. L. WHITE,
and J. L. WEBER. 1998. Comprehensive human genetic
maps: individual and sex-specific variation in recombination. Am. J. Hum. Genet. 63:861–869.
CHARLESWORTH, B. 1994. Genetic recombination: patterns in
the genome. Curr. Biol. 4:182–184.
CLAY, O., S. CACCIO, S. ZOUBAK, D. MOUCHIROUD, and G.
BERNARDI. 1996. Human coding and noncoding DNA:
compositional correlations. Mol. Phylogenet. Evol. 5:2–12.
COLLINS, A., J. FREZAL, J. TEAGUE, and N. E. MORTON. 1996.
A metric map of humans: 23,500 loci in 850 bands. Proc.
Natl. Acad. Sci. USA 93:14771–14775.
DAUSSET, J., H. CANN, D. COHEN, M. LATHROP, J. M. LALOUEL, and R. WHITE. 1990. Centre d’étude du polymorphisme humain (CEPH): collaborative genetic mapping of the
human genome. Genomics 6:575–577.
DIB, C., S. FAURE, C. FIZAMES et al. (14 co-authors). 1996. A
comprehensive genetic map of the human genome based on
5,264 microsatellites. Nature 380:152–154.
DURET, L., D. MOUCHIROUD, and C. GAUTIER. 1995. Statistical
analysis of vertebrate sequences reveals that long genes are
scarce in GC-rich isochores. J. Mol. Evol. 40:308–317.
EISENBARTH, I., G. VOGEL, W. KRONE, W. VOGEL, and G. ASSUM. 2000. An isochore transition in the NF1 gene region
coincides with a switch in the extent of linkage disequilibrium. Am. J. Hum. Genet. 67:873–880.
EYRE-WALKER, A. 1992. Evidence that both G 1 C rich and
G 1 C poor isochores are replicated early and late in the
cell cycle. Nucleic Acids Res. 20:1497–1501.
———. 1993. Recombination and mammalian genome evolution. Proc. R. Soc. Lond. B Biol. Sci. 252:237–243.
———. 1999. Evidence of selection on silent site base composition in mammals: potential implications for the evolution of isochores and junk DNA. Genetics 152:675–683.
FELSENSTEIN, J. 1974. The evolutionary advantage of recombination. Genetics 78:737–756.
FILIPSKI, J. 1987. Correlation between molecular clock ticking,
codon usage fidelity of DNA repair, chromosome banding
and chromatin compactness in germline cells. FEBS Lett.
217:184–186.
FRANCINO, M. P., and H. OCHMAN. 1999. Isochores result from
mutation not selection. Nature 400:30–31.
GERTON, J. L., J. DERISI, R. SHROFF, M. LICHTEN, P. O.
BROWN, and T. D. PETES. 2000. Inaugural article: global
mapping of meiotic recombination hotspots and coldspots
in the yeast Saccharomyces cerevisiae. Proc. Natl. Acad.
Sci. USA 97:11383–11390.
HILL, W. G., and A. ROBERTSON. 1966. The effect of linkage
on limits to artificial selection. Genet. Res. 8:269–294.
HOLMQUIST, G. P. 1992. Chromosome bands, their chromatin
flavors, and their functional features. Am. J. Hum. Genet.
51:17–37.
IKEMURA, T., and K. WADA. 1991. Evident diversity of codon
usage patterns of human genes with respect to chromosome
banding patterns and chromosome numbers; relation between nucleotide sequence data and cytogenetic data. Nucleic Acids Res. 19:4333–4339.
MUCHA, M., K. LISOWSKA, A. GOC, and J. FILIPSKI. 2000. Nuclease-hypersensitive chromatin formed by a CpG island in
human DNA cloned as an artificial chromosome in yeast.
J. Biol. Chem. 275:1275–1278.
NEKRUTENKO, A., and W.-H. LI. 2000. Assessment of compositional heterogeneity within and between eukaryotic genomes. Genome Res. 10:1986–1995.
PRZEWORSKI, M., R. R. HUDSON, and A. DI RIENZO. 2000.
Adjusting the focus on human variation. Trends Genet. 16:
296–302.
SHARP, P. M., M. AVEROF, A. T. LLOYD, G. MATASSI, and J.
F. PEDEN. 1995. DNA sequence evolution: the sounds of
silence. Philos. Trans. R. Soc. Lond. B Biol. Sci. 349:241–
247.
SORIANO, P., M. MEUNIER-ROTIVAL, and G. BERNARDI. 1983.
The distribution of interspersed repeats is nonuniform and
conserved in the mouse and human genomes. Proc. Natl.
Acad. Sci. USA 80:1816–1820.
SUEOKA, N. 1988. Directional mutation pressure and neutral
molecular evolution. Proc. Natl. Acad. Sci. USA 85:2653–
2657.
SVETLOVA, E., N. AVRIL-FOURNOUT, G. IRA, P. DESCHAVANNE,
and J. FILIPSKI. 1998. DNase-hypersensitive sites in yeast
artificial chromosomes containing human DNA. Mol. Gen.
Genet. 257:292–298.
TENZEN, T., T. YAMAGATA, T. FUKAGAWA, K. SUGAYA, A. ANDO,
H. INOKO, T. GOJOBORI, A. FUJIYAMA, K. OKUMURA, and
T. IKEMURA. 1997. Precise switching of DNA replication
timing in the GC content transition area in the human major
histocompatibility complex. Mol. Cell. Biol. 17:4043–4050.
WAHLS, W. P. 1998. Meiotic recombination hotspots: shaping
the genome and insights into hypervariable minisatellite
DNA change. Curr. Top. Dev. Biol. 37:37–75.
WATANABE, Y., T. TENZEN, Y. NAGASAKA, H. INOKO, and T.
IKEMURA. 2000. Replication timing of the human X-inactivation center (XIC) region: correlation with chromosome
bands. Gene 252:163–172.
WOLFE, K. H., P. M. SHARP, and W. H. LI. 1989. Mutation
rates differ among regions of the mammalian genome. Nature 337:283–285.
WU, T. C., and M. LICHTEN. 1995. Factors that affect the location and frequency of meiosis-induced double-strand
breaks in Saccharomyces cerevisiae. Genetics 140:55–66.
ZOUBAK, S., O. CLAY, and G. BERNARDI. 1996. The gene distribution of the human genome. Gene 174:95–102.
ADAM EYRE-WALKER, reviewing editor
Accepted February 15, 2001