Download It has long been speculated that adaptive changes in noncoding

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

The Selfish Gene wikipedia , lookup

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Recent natural selection in human noncoding sequences
Heather A. Lawson, Joel Martin, David C. King, Belinda Giardine, Webb Miller, Hiroshi
Akashi, and Ross C. Hardison
Center for Comparative Genomics and Bioinformatics, Huck Institutes of Life Sciences,
and Departments of Anthropology, Biology, Computer Science and Engineering,
Biochemistry and Molecular Biology, The Pennsylvania State University, University
Park, PA 16802
Running Title: Recent selection in noncoding sequences
Contact information for corresponding author:
Ross Hardison
Email: [email protected]
Phone: 814-863-0113
FAX: 814-863-7024
1
Abstract
It has long been speculated that adaptive changes in noncoding gene regulatory
sequence, rather than in coding sequence, underlie many of the phenotypic differences
between humans and other primates. Therefore, identification of noncoding regions
bearing the signature of recent positive or negative selection is an important step toward
understanding the evolutionary history of our species. Using the ratios of polymorphic
sites to divergent sites in noncoding sequences relative to those in a widely distributed
model for neutral DNA (repeat sequences ancestral to catarrhines) to evaluate deviation
from neutral expectation, we examine the ability of current polymorphism data
(dbSNP126 filtered for quality) and divergence from chimpanzee and rhesus genomes
to provide useful information in this variation of the McDonald-Kreitman (MKAR) test.
The test can be conducted on almost half the human genome, and at a false discovery
rate of 10%, only 0.26% of the genome shows signficant deviation from neutrality. The
windows implicated as non-neutral by the MKAR test are supported by other lines of
evidence for selection: they show significant overlap with genes previously implicated as
targets of recent selection, and the non-neutral windows suggestive of positive selection
are enriched in genomic duplications and in intervals showing evidence of positive
selection by Tajima’s D. The intervals deviating from neutrality overlap little when
divergence is determined with chimpanzee versus rhesus, revealing distinctive
signatures on different evolutionary time scales that may reflect the unique adaptive
histories among these lineages. Some of the non-neutral noncoding segments contain
DNase hypersensitive sites; these are strong candidates for further experimental
investigation of the role of gene regulation in human evolution. Results of this study are
a useful addition to comparative genomics resources aimed at finding functional
noncoding DNA sequences.
2
Introduction
Only a small fraction of mammalian genomes, roughly 5% (1-4), is thought to carry out
functions conserved in all mammals. This fraction should be subject to negative
(purifying) selection and thus can be detected as sequences showing significantly less
change relative to neutrally evolving DNA over the span of mammalian evolution. Other
functional sequences show significantly more change relative to neutral expectations
along certain mammalian lineages and hence appear to be under positive (adaptive)
selection. Such a phylogenetic approach has been used to identify protein-coding
genes showing evidence of selection on the human lineage(5-9). Most standard tests
for lineage-specific selection have been applied to protein-coding regions, e.g. the
McDonald-Kreitman test (10), relative rate tests (11), and a variety of tests based on
differences in synonymous and nonsynonymous amino acid sites, e.g. D N/DS (12).
However, functional sequences under selection include both coding and non-coding
sequences, such as cis-regulatory modules, and identifying signatures of selection in
non-coding DNA is an important line of inquiry into mammalian evolution in general, and
into human evolution in particular. For example, mutations in a regulatory region near
the lactase (LCT) gene are associated with selection for adult lactase persistence (13),
and a mutation in the promotor region of the Duffy (FY) gene has been associated with
selection for P. vivax malarial resistance in African populations (14). Other recent
dramatic examples of presumptive positive selection in non-coding regions are the
highly accelerated regions of the human genome (15, 16), and the conserved noncoding sequences found to be undergoing accelerated evolution by examining humanspecific substitutions in interspecies comparisons (17). Human chromosomal intervals
suggestive of positive selection have been identified by skews in the frequency
spectrum of polymorphisms (e.g. Tajima’s D in the study of Carlson et al. 2006) and
extended regions of homozygosity (Akey ….). These latter studies utilize polymorphism
frequency data in both coding and noncoding DNA sequences, and the results tend to
be interpreted as recent selection in coding regions.
3
Another approach to finding DNA segments under recent selection compares the
number of polymorphic sites (reflecting mutation rate) in one species with the number of
sites divergent in related species (reflecting the fixation rate for mutations). Neutral sites
are expected to accumulate intraspecific changes (polymorphisms) and interspecies
changes (divergences) at proportional rates, e.g. a neutral region that mutates
frequently in a population will also so a high number of divergent sites when compared
with a related species. McDonald and Kreitman introduced an explicit test of deviation
from neutral expectation in protein-coding DNA sequences (10) by comparing counts of
polymorphisms and interspecies divergent sites at nonsynonymous (test) positions
within coding regions with counts of polymorphisms and divergences at synonymous
sites, which are assumed to be neutral. Most test sites should show a ratio of
polymorphism to divergence counts indistinguishable from that at neutral sites (18);
these will have a ratio of polymorphism to divergence relative to that at neutral sites (rpd)
of around 1 [Akashi, 1995 #67]. Test sites under positive selection will diverge more
between species than the neutral sites, giving an rpd less than 1. Conversely, test sites
under negative selection will diverge less between species than the neutral sites,
producing an rpd larger than 1. Population-level processes such as nonrandom mating,
population growth, and population bottlenecks can also affect the rpd (19) but, in
general, selection will differentially affect specific sequences whereas population-level
processes will affect all sequences in a genome (20).
In this report, we explore the ability of current genome-wide data to formulate a useful
McDonald-Kreitman test for recent selection in noncoding DNA sequences in humans.
The human genome was partitioned into windows large enough to get sufficiently high
counts for polymorphism and divergence; we found that 10kb windows gave a good
balance between resolution and coverage of the genome (about half is covered).
Repetitive DNA ancestral to the species compared was used as the model for neutral
sites, because these sites occur frequently and are interspersed with nonrepetitive DNA
on a 10kb scale. Divergences of human from two primates, the closely related
chimpanzee and the more distant rhesus macaque, were both used to examine effects
of selection over two different time scales. The largest set of human polymorphism data
4
(dbSNP126) was used because it was less biased than the HapMap polymorphic sites;
more comprehensive and less biased polymorphism data can be included as they
become available.
Our results indicate that the currrent application of the test does give insights into
evidence suggestive of recent selection on non-coding DNA sequences in the half of the
human genome examined. The segments of DNA that show statistically significant
deviation from neutral expectation by this test also appear to be biologically meaningful,
because they overlap with several loci whose protein products appear to be under
selection in humans. In contrast to the observations for recent selection in coding
sequences, the non-neutral segments of non-coding DNA show an rpd suggestive of
negative selection more frequently than positive selection. The non-neutral DNA
segments that overlap with DNase hypersensitive sites in chromatin are particularly
strong candidates for gene regulatory regions under recent selection. The MKAR test is
a useful and timely addition to comparative whole-genome resources aimed at finding
functional non-coding DNA sequences.
Results
McDonald-Kreitman test in noncoding sequences
We implemented a variation of the McDonald-Kreitman test (MKAR test) to search for
evidence of non-neutrality in equally-sized segments of DNA across the human
genome. While developing this, we investigated many issues that affect the outcome of
the test. First, we decided to exclude certain sites in the human genome. Coding
portions of exons were masked so the that only non-coding DNA sequences contribute
to the test, and the hypermutable CpG dinucleotides are not included. The next issue
was the choice of the neutral model. Previous studies [Waterston, 2002 #54; Hardison,
2003 #15] had shown that sites in interspersed repeats ancestral to mouse and human
behave similarly to other sites that are good models for neutral DNA, but they are far
more frequent and are highly interspersed with nonrepetitive noncoding DNA, which are
5
the sites we want to investigate for evidence of recent selection. We evaluated the
effectiveness of repeats deduced to be ancestral to euarchontoglires (e.g. rodents and
primates), simians and catarrhines (Old World Monkeys, apes and humans) as the
neutral model, and chose those ancestral to catarrhines (Fig. 1) because their higher
abundance led to greater coverage of the human genome. Other models for neutral
sites that are highly abundant, such as intergenic or intronic DNA
[The_Chimpanzee_Sequencing_and_Analysis_Consortium, 2005 #56] could not be
used because they cover the potential targets for recent selection.
The closest relative to humans with a sequenced genome is chimpanzee, and
divergences between human and chimp were computed in one version of the MKAR
test. This gives insight into selection over the past 6 million years, but the two species
do share some polymorphic sites
[The_Chimpanzee_Sequencing_and_Analysis_Consortium, 2005 #56]. Thus
divergence from the Old World Monkey rhesus macaque [Gibbs, 2007 #76] was also
computed, and resulting version of the MKAR test gives information on possible
selection over the past 23 million years since the divergence between rhesus and
human.
The choice of polymorphism data was largely determined by the size of available
datasets. Build 126 of dbSNP has 5,608,719 validated entries that map to a single
location in the human genome. This is the largest collection of SNPs available and it
was chosen for the current implementations of the MKAR test. We considered
restricting the polymorphisms to those genotyped in the HapMap project
[The_International_HapMap_Consortium, 2005 #33], but these were under-represented
in the more recent repeats (Fig. 2), as might be expected from the requirement that they
produce unambiguous genotyping signals. The entries in dbSNP are biased toward the
more common SNPs [Clark, 2005 #34]. However, this bias affects both our test and AR
classes of sites, and the test is not dependent on frequency of polymorphisms, but
rather counts of polymorphic and divergent sites. Thus we expect the bias toward
6
common SNPs to not preclude an effective MKAR test. Nevertheless, it will be important
to repeat the test on more comprehensive SNP datasets as they become available.
With these selections of the target regions in the human genome, the neutral model, the
comparison species and the source of polymorphism data, we could then evaluate the
size of windows for genome partitioning. The windows must to be large enough for the
ARs to be intercalated with the test sites and to have sufficient counts of polymorphic
and divergent sites, yet small enough to give an informative resolution across the
human genome. After examining window sizes from 1kb to 50kb, we found that almost
half the 10kb non-overlapping windows had the desirable properties, whereas other
sizes compromised coverage or resolution to an unacceptable extent. An assumption of
the McDonald-Kreitman test is that the classes of sites compared are intercalated and
hence have the same genealogical history. The catarrhine ARs are widely dispersed; all
but 0.24% of the 10kb windows have ARs, and the windows have an average of 14
individual repeats (containing an average of 4371 AR nucleotide sites) per window.
These individual repeats are interspersed with nonrepetitive DNA, with an average
distance between them of 444 bp. This distance is much smaller than the 10kb window,
thus supporting an appropriate level of interspersion of neutral and test sites for the
MKAR test. We restricted the MKAR test to windows with a minimum expected count of
five for each category. Of the total of 281,871 windows for the human genome assembly
(hg18), 139,701 had sufficient counts for the MKAR test with divergence from
chimpanzee. Thus using 10kb windows allowed the test to be run in 46% of the human
genome.
For each window, the number of polymorphic and divergent sites were counted in the
ARs and in the rest of the unmasked portion of the window. The ratio of polymorphism
to divergence in the non-AR sites divided by that ratio for AR sites gives the relative
ratio, called rpd , for that window. The significance of the deviation from neutral
expectation was evaluated by Fisher’s exact test followed by FDR correction for multiple
testing [Storey, 2002 #25]. For this report, the 737 windows passing an FDR threshold
of 0.10 are considered to deviate significantly from neutrality (based on divergence from
7
chimpanzee), and they will be called non-neutral. It is likely that other windows not
passing this FDR threshold may have informative signals suggestive of selection, and
thus the full data on rpd , p-values and FDR values for each window tested by MKAR is
made available as a custom track on the UCSC Genome Browser (49) at http://genometest.cse.ucsc.edu/ (hg18 March 2006 assembly, Track Group = “Experimental Tracks”,
track name = “McDonald-Kreitman AR 4”). {We will request that these tracks be
included in the UCSC Genome Browser once this report is accepted for publication.}
The data for all 139,701 windows with sufficient counts for the test can be downloaded
from http://www.bx.psu.edu/~ross/dataset/MKARchimpEnufCtsPfdr.txt.
Features of windows with non-neutral noncoding DNA
The results of the MKAR test for many windows differ when divergence is computed
relative to chimpanzee or rhesus (Table 1). A substantially larger number of windows
pass the FDR threshold when divergence is determined from chimpanzee, and only half
the non-neutral windows observed with rhesus divergence are in common with those
observed with chimp divergence. Thus, different regions of the genome show unique
signatures of selection over the two evolutionary timeframes of approximately 6 million
(chimp) and 23 million years ago (rhesus). We will focus on the windows that are show
significant results with comparison to chimpanzee for most of our examples, but it is
important to keep in mind that other windows are significant when the comparison is
with rhesus.
For the non-neutral windows, the direction of selection can be inferred from the value of
rpd , with values greater than 1 suggestive of negative selection and values less than 1
suggestive of positive selection. Most of the non-neutral windows show an rpd
suggestive of negative selection (Fig. 3); this is the case whether divergence from
chimpanzee or from rhesus is considered (Fig. 4). The dominance of negative selection
was also observed in a genome-wide survey of selection in protein-coding regions
based on a comparison of rates of substitution between human and chimpanzee at
synonymous and nonsynonymous sites [Bustamante, 2005 #41]. In both our MKAR
8
tests and the Bustamante et al. [, 2005 #41] study, the divergence from a related
species contributes to the signal being evaluated, and negative selection has had
sufficient time to impact the signal. In contrast, tests for selection based on allele
frequencies or other data only in humans may be influenced more by the
disadvantageous alleles still in the human population and thus provide a stronger signal
for positive selection [Nielsen, 2007 #98]. {Hiroshi - am I misinterpreting or mis-stating
this? - rh} The limited overlap in windows detected as non-neutral when divergence is
considered from chimpanzee versus rhesus is found regardless of the suggested
direction of selection (Fig. 4).
The non-neutral windows detected by this test tend to cluster in certain regions of the
human genome, including many with segmental duplications (Fig. 3). For example, the
pericentromeric regions of chromosomes 1, 2, 4, 9, 10, 15 and 16 are enriched in both
non-neutral windows and segmental duplications. These two features overlap in many
other chromosonal regions, such as chromosomal band 11p15.4, which contains a large
family of olfactory receptor genes that show a clear signal for selection {ref}. Genes with
duplicated copies and regions with segmental duplications tend to be targets of recent
positive selection {refs}, and thus one may expect that the windows deviating from
neutrality by the MKAR test would be enriched for signals suggestive of positive
selection. Indeed, this was observed (Table __). Almost 30% of the windows suggestive
of positive selection overlap with segmental duplications, compared with only 6% of all
the windows tested by MKAR and 15% of the windows suggestive of recent negative
selection. The extensive overlap between windows deviating from neutrality and regions
with segmental duplication, along with the concordance in direction of selection, lends
support for the validity of the MKAR test. The large number of windows deviating
signficantly from neutral expectation by the MKAR in these bands supports the validity
of the test.
Several of the windows that are suggestive of positive selection by the MKAR test are
congruent with an independent assay for evidence of recent positive selection in the
human genome. Carlson et al. {ref} used HapMap (?) data to find chromosomal
9
intervals with a site frequency spectrum skewed toward rare alleles, as measured by
Tajima’s D {ref}. Contiguous regions with negative value for Tajima’s D are consistent
with the expectation for positive selection. In the genome-wide study of Carlson et al.,
about 10% of the windows have negative values for Tajima’s D. Of the windows
suggestive of positive selection by the MKAR test, 14% of them overlap with intervals
suggestive of positive selection by Tajima’s D, whereas about 10% of all non-neutral
windows by MKAR overlap with these intervals with negative values of Tajima’s D.
As a group, the non-neutral 10kb windows have a similar amount of overlap with
functional regions as do windows for the entire genome (Table 2). The sets of functional
regions compiled throughout the human genome include two groups of experimentally
determined functional regions, i.e. DNase hypersensitive sites observed in CD4+ cells
(DHS) [Boyle, 2008 #99], which are markers for most gene-regulatory regions (in that
cell line) and exons of protein-coding genes [Hsu, 2006 #74], plus a set of DNA
segments predicted to be involved in gene regulation based from two independent
analyses of multi-species alignments [PRPs \Miller, 2007 #2231]. A similar amount of
overlap with functional features is seen for all the windows with sufficient counts for the
MKAR tests and the ones that are non-neutral, regardless of whether the overlap is
measured as fraction of windows that overlap the functional segments or as the portion
of DNA in the intervals composed of functional segments (Table 2). Interestingly, the
non-neutral windows suggestive of recent positive selection in humans show a
significant reduction in overlap with the two functional categories that are changing
more slowly over the span of mammalian evolution [Miller, 2007 #2231]. This lends
support to the evolutionary inferences based on the MKAR test, and suggests that some
positively selected noncoding DNA elements are separated from negatively selected
ones by at least the window size used here (10kb). In addition, this analysis shows that
the half of the genome lacking sufficient counts of polymorphism and/or divergence for
the current version of the MKAR test is slightly enriched in functional regions (Table 2).
Comparison with genes implicated as targets of recent selection
10
A good test of the effectiveness of this implementation of the MKAR test is to examine
overlap with protein-coding genes previously implicated as being under recent positive
or negative selection. While there is little concurrence among results of recent studies
identifying genes affected by positive selection on the human lineage (35), these
differences result at least in part from different types of evolutionary signals and distinct
time-frames examined [Nielsen, 2007 #88]. Thus we looked at all the genes implicated
in recent selection, and tested for overlap and proximity to the windows with noncoding
DNA deviating signficantly from neutrality in the MKAR test. An important question is
whether or not a similar signal of selection is found in the noncoding regions of genes
deviating from neutrality as is seen in their coding regions.
Of the 737 non-neutral windows (using chimp divergence), 11 overlap with genes
previously implicated as targets of recent selection, whereas random expectation is that
6 will overlap. Conversely, of the 420 genes in our compiliation of genes under recent
selection, 12 overlap with windows with non-neutral noncoding DNA, compared to 6
expected for randomly chosen windows. In contrast, extension of the windows to 100kb
gives the same level of overlap as expected for random intervals. This shows that
noncoding sequences within about 10 kb of coding sequences under selection also tend
to show evidence of recent selection (p-value from a chi-squared test = ______), but
this effect does not extend to distances of 100kb. This congruence with the results of
multiple analyses of selection in coding genes lends support to the utility of the results
of analysis of noncoding DNA by MKAR.
Within these non-neutral windows overlapping genes previously identified as targets of
recent selection in the human lineage, several cases were found in which the direction
of selection appears to be the same in both protein-coding and noncoding DNA. For
simplicity of presentation, these examples are limited to those observed by the MKAR
test using divergence from chimpanzee, and additional examples with divergence from
rhesus are in ______ Supplementary Table ___. Three genes previously identfied as
being positively selected on the human lineage overlap non-neutral windows also
suggestive of positive selection in noncoding DNA. These are PCDH15 (37), linked to
11
inherited forms of deafness, ABCC1 (38), an anion transporter related to immune
function in cells, and SLC30A9 (37), which is expressed in human embryonic lung. ____
genes showing evidence of negative selection on the human lineage overlapped
windows also suggestive of negative selection on noncoding DNA. These include
IQGAP2, HLA-DPA1, and CNTN5. Although considerably more experiments are
required to test this hypothesis, these examples are consistent with the expectation for
loci in which both the protein-coding exons and and the proximal DNA sequences
involved in regulation are undergoing similar adaptive evolution (coding and noncoding
sequences with signals suggestive of positive selection) or purifying selection (both
classes showing signals for negative selection).
Several examples of different directions of selection in coding versus noncoding DNA
sequences were also observed. DOCK1, a signaling factor required for neurite
outgrowth and showing evidence of negative selection (6), overlaps an MKAR window
suggestive of positive selection. Two genes with evidence of positive selection
overlapped non-neutral windows suggestive of negative selection. These are TLL2,
associated with bone morphogenesis (5, 9), and MCPH1, mutations in which are
associated with primary microcephaly (39, 40). It is important to note, however, that
other work finds evidence of negative selection in MCPH1 (6), and quite a bit of
controversy surrounds claims of human-specific adaptive evolution in this gene (35).
Despite the ambiguities for MCPH1, it is certainly possible for the directions of selection
on coding sequences to differ from that in noncoding sequences. For example, changes
in a protein structure may be adaptive in one lineage, but the tissues and developmental
stages at which the protein is produced remain constant across different lineages.
Candidates for cis-regulatory modules under selection in the human lineage
The MKAR test is executed across the human genome, and thus it will find non-neutral
windows regardless of their position with respect to genes. Although many cisregulatory modules (CRMs) are within 5 kb of transcription start sites for genes [Birney,
2007 #100], others can be hundreds of kb or more from the promoters they regulate,
12
particularly for genes under complex regulatory control such as those underlying
embryonic development (43). We have detected strong signals suggestive of positive
and negative selection in noncoding regions both proximal to genes and in gene-sparse
regions that may house CRMs. These regions are good candidates for further
investigation of potential regulatory elements, and this should be most productive when
combined with biochemical data indicative of gene regulatory regions.
The DNaseI hypersensitive sites (DHSs) in chromatin of CD4+ T-cells [Boyle, 2008 #99]
comprise a good dataset for a feature frequently associated with cis-regulatory
modules, i.e. a discrete region with altered chromatin structure such that the
transcription machinery (and nucleases added experimentally) have access to the DNA.
DHSs have been identified as the the locations of regulatory elements such as
promoters, silencers, enhancers and locus-control regions (47, 48). Tables 4 and 5 list
MKAR windows both significantly deviating from neutrality and containing DNaseI
hypersensitive sites. Fig. 6 illustrates a non-neutral window with noncoding DNA
significantly suggestive of positive selection. This window also contains a DHS along
with many segments of high regulatory potential [Taylor, 2006 #27]. The DNase1
hypersensitive site in this window is 183bps long and located 112bp upstream of the
gene MTRF1L, which encodes a protein related to a mitochondrial translational release
factor. The human-chimp pairwise alignment for this site has 3 diverged bases and the
human-rhesus alignment has 17 diverged bases (FIG 7), but no SNPs are in this DHS
in the dataset used for our tests. The excess of divergence relative to polymorphism in
this site is a signature of positive selection. It is reasonable to speculate that this site
has undergone adaptive evolution since the human lineage split from that of the
catarrhines, although more experimentation is necessary to understand the precise
function of this DNaseI hypersensitive site. The MTRF1L gene is similar to the MTRF1
gene, presumably arising by gene duplication. Duplicated genes are more likely to
undergo adaptive changes, and this may be an example of adaptive change in its
pattern of expression. {Do the GNF tracks suggest any differences in the patterns of
expression?}
13
Discussion
{This is just a holding area for various ideas now.}
Results of the MKAR test can be used to find noncoding regions throughout the human
genome that are candidate targets of natural selection over the past 23Myr
(comparisons with rhesus) or 6Myr (comparisons with chimp) of primate evolution.
These are potentially functional noncoding sequences, some of which may be CRMs.
By leveraging polymorphism counts and recent divergence, the MKAR test explores a
phylogenetic span, and likely some functional regions, different from those investigated
by studies of conservation among non-primate mammalian species or conservation
from mammals to other vertebrates, such as birds and fish. Thus the results of this test
are a useful addition to comparative genomics resources aimed at finding functional
DNA sequences. These results also point to candidates for noncoding regions that play
a role in understanding evolution of the human lineage. Follow-up of these candidates
promises to be a fruitful area of further investigation.
Previous similar modifications to the McDonald-Kreitman test have been successfully
used to evaluate evidence of selection in non-coding DNA that is not truly intercalated in
drosophila, separating sites into transcription factor binding (test) sites and nontranscription factor binding (neutral) sites (29, 30), or into noncoding (test) and
synonymous (neutral) sites (31). Our study represents the first such modification to
evaluate evidence for negative and positive selection in noncoding DNA spanning the
entire human genome, including regions far from genes. Furthermore, by comparing
counts in the noncoding test sites to those in AR sites for each 10kb window
individually, we capture local variations in neutral evolutionary rates.
Compare with Tajima’s D analysis and extended homozygosity data?
14
However, the MKAR test does not resolve the actual sequence under selection. Rather
it shows the 10kb window housing the sequence having the signature of selection. The
source of the signal, and potential target of selection, could be the putative CRM, or it
could be the surrounding region. As our understanding of molecular evolution of
noncoding DNA in general, and of CRMs in particular, improves, we will be able to
refine our methods of searching for signatures of selection in these regions and tease
out the source sequence deviating from neutrality.
Methods
Datasets examined
15
Divergence data for the MKAR test was generated from three-way whole-genome
alignments of human-chimp-macaque computed on the March 2006 human assembly
(hg18), the March 2006 chimp assembly (panTro2), and the January 2006 macaque
assembly (rheMac2) using MULTIZ (50). Alignments are available for bulk download
from the UCSC Genome Browser (http://genome.ucsc.edu/). Polymorphism data for the
MKAR test was obtained from two sources: 1) dbSNP build 126 (51) downloaded from
the UCSC Genome Table Browser and filtered for SNPs having a known validation
status that map to a single genomic location; and 2) The International HapMap Project
(33) non-redundant public release #22 (http://www.hapmap.org/index.html). All
polymorphism variants and coordinates are for the forward (+) strand. Intervals
corresponding to interspersed repeats ancestral to human and macaque, defined as
LTR, SINE, LINE and DNA elements excluding AluY, L1PA1-L1PA7, and L1HS, were
obtained from the UCSC Genome Table Browser using the Galaxy metaserver (52)
(http://g2.trac.bx.psu.edu/) . These ARs were filtered to remove MIR121 family
elements, shown to be evolving non-neutrally (24), and so-called ‘exapted repeats’,
conserved non-exonic elements deposited by mobile elements and shown to have
acquired functional regulatory roles (25). All coding exon start and stop intervals were
obtained from the UCSC Genome Table Browser UCSC Genes track (53).
DNaseI hypersensitive site coordinates were downloaded
(http://research.nhgri.nih.gov/DNaseHS/chimp2006) and start and stop coordinates
were converted from the July 2003 human genome assembly (hg16) to the March 2006
assembly (hg18) using the UCSC genome coordinate conversion tool, liftOver. Custom
python scripts were used to extract sites overlapping significant MKAR windows
(passing 10% FDR threshold).
MKAR Test
CpG sites were masked out of the multiple alignments according to the ‘inclusive’
definition, where all sites are divided into CpG and non-CpG sites (54) in a pairwise
16
comparison (human-chimp and human-macaque). CpG sites are those that are CG in
at least one species. Non-CpG sites are those not CG in either species.
Non-overlapping windows of size 10kb were tiled across the genome. A 2X2
contingency table was generated for each window by apportioning SNP counts and
divergence counts within a window into AR and TEST categories. A SNP count is
registered when a SNP is present in an aligned position. A divergence count is
registered either when aligned bases differ and no SNP is present at an aligned
position, or when aligned bases differ and a SNP is present, but all the known SNP
variants differ from the aligned base. Thus, at any given aligned base, the MKAR test
registers one of four possibilities: 1) a SNP count, or 2) a divergence count, or 3) both a
SNP and a divergence count, or 4) neither a SNP nor a divergence count (FIG 8). An
AR site is registered when either a SNP or diverged base fall in an ancestral repeat site
within a window. A TEST site is registered if either a SNP or a diverged base fall
outside an AR site within a window. Gaps or regions that do not align in ether species
pair (human-chimp or human-macaque) are not considered.
The minimum expected value for cells in each 2X2 contingency table is calculated
based on the apportionment of SNP and divergence counts for each window and, for
those windows passing a minimum expected value threshold of 5.0 for each cell, a
Fisher’s Exact Test tests the significance of the table’s deviation from neutrality. The
false-discovery rate (FDR) multiple tests correction procedure is applied to the ranked
distribution of Fisher’s Exact Test p-values as implemented in R, using the qvalue
library and the bootstrap method of estimating the proportion of true null hypotheses
(28).
All MKAR results can be viewed and analyzed as a custom track on the UCSC Genome
Browser (49) at http://hgwdev-giardine.cse.ucsc.edu/cgi-bin/hgTracks (track name =
“McDonald-Kreitman AR”), and in Supplementary Materials.
{I took this from the primary text of the previous version - please merge it in with the rest
of the methods, removing any redundancy.}
17
The estimated rates of neutral substitutions in ARs co-varies with that of four-fold
degenerate sites (21), and ARs are frequently used (1, 2, 22), and indeed have been
quantified (23), as a good model of neutral molecular evolution. Still, there is some
evidence that certain repetitive elements have evolved non-neutrally (24, 25) and we
removed these elements from our neutral class of sites. We mask out the coding
portions of exons so each window’s test regions are composed only of presumptive
non-coding DNA. Additionally, we mask out hypermutable CpG dinucleotide sites,
which may lead to underestimates of true divergence in the MKAR test, particularly
when divergence is reckoned with rhesus (26, 27). A 2X2 contingency table is
constructed for each window binning polymorphism and divergence counts in AR and in
non-coding test sites and, for windows passing a minimum expected value threshold of
5 counts for each cell of each window’s contingency table, deviation from neutrality is
evaluated for significance by a Fisher’s Exact Test. To correct for multiple testing, a qvalue is computed from the ranked distribution of Fisher’s Exact p-values and used to
estimate the false discovery rate (FDR) (28).
An assumption of the McDonald-Kreitman test is that the classes of sites compared are
intercalated and hence have the same genealogical history. While the classes of sites
examined here (noncoding DNA and ARs) are certainly not intercalated to the extent
that synonymous and nonsynonymous amino acid sites are in a coding sequence, the
ARs examined here are genomically ubiquitous as well as quite evenly distributed with
an average distance of 443.7 nucleotides between AR sequences in each window
(Supplementary Table 7). This rather homogenous intercalation suggests the risk of
type-1 error due to partial linkage is minimal.
When interpreting any test for selection using polymorphism and divergence data, one
must be aware of the limitations of the input data. Some of the SNPs in dbSNP126
come from datasets with known ascertainment biases (32), however these biases affect
both our test and AR classes of sites, so their effects to the MKAR results are negligible.
A more serious concern is the technical difficulty genotyping SNPs in ARs, particularly
in the younger repetitive sequences (33), leading to an under-representation of SNPs in
18
our neutral class of sites. We examined SNP density using both dbSNP126 and
HapMap in the ARs tested here binned by percent divergence from a consensus
sequence. An individual ARs’ percent divergence from an overall consensus sequence,
which is assumed to approximate the ancestral repetitive sequence, is a proxy for AR
age. As expected, we found SNP density is indeed higher in older repeats, regardless
of which human polymorphism dataset examined. However, the distribution of SNP
density is both greater and more even across all our ARs using dbSNP126 (FIG 2), and
so in this main text we report results generated using human polymorphism data from
dbSNP126. The MKAR test was run on both SNP datasets, however, and there are
highly significant positive correlations in results generated with dbSNP126 and with
HapMap (Supplementary Tables 1 and 2; Supplementary Figs 1 and 2). It will be
important to continue to evaluate signals for selection as more polymorphism data
become available, especially for less biased data. The divergence data also have
limitations, largely in the quality of the assemblies of the comparison species, but also in
the contribution of non-orthologous regions to the alignments. As genome assemblies
and alignments improve, it will be important to re-run these tests. Nevertheless, the
MKAR test is a useful and timely addition to comparative whole-genome resources
aimed at finding functional non-coding DNA sequences.
{The comparison we had in the first submission seemed to me to have value - shouldn’t
we continue to include it? The text follows.}
When interpreting any test for selection using polymorphism and divergence
data, one must be aware of the limitations in the input data. Some of the data in
dbSNP126 come from datasets with known biases toward more frequent alleles (22,
23). This bias will apply equally to both AR and nonAR sites, so the possible effect on
the MKAR results may not be large. One approach to evaluating the magnitude of the
effect of this ascertainment bias is to compare MKAR results obtained using dbSNP 126
data with results obtained using a polymorphism dataset that should be less biased
toward frequent alleles. This dataset combined resequenced HapMap data (22) with
HapMap Phase II data in the 10 resequenced ENCODE (24) regions. For the 417 10kb
windows that could be tested in these resequenced ENCODE regions, the p-value for
19
deviation of windows from neutrality correlated between the MKAR tests with an r of
0.535 (using divergence from chimp). A similar comparison using divergence from
rhesus gave an r of 0.538. The NI values correlated as well, with r = 0.531 and 0.539 for
divergence from chimpanzee and rhesus, respectively (p-value < 2.2x10-16 in both
cases). Thus even with known biases in the current polymorphism data, the test is fairly
robust. However, it will be important to continue to evaluate signals for recent selection
as more polymorphism data become available, especially for less biased data. The
divergence data also have limitations, largely in the quality of the assemblies of the
comparison species, but also in the contribution of non-orthologous regions to the
alignments (for example, see Fig. 3). As genome assemblies and alignments improve, it
will be important to re-run these tests.
Literature Cited
1.
R. H. Waterston et al., Nature 420, 520 (2002).
2.
F. Chiaromonte et al., Cold Spring Harb Symp Quant Biol 68, 245 (2003).
20
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
K. Lindblad-Toh et al., Nature 438, 803 (2005).
A. Siepel et al., Genome Res 15, 1034 (2005).
A. G. Clark et al., Science 302, 1960 (2003).
C. D. Bustamante et al., Nature 437, 1153 (2005).
The_Chimpanzee_Sequencing_and_Analysis_Consortium, Nature 437, 69
(2005).
R. A. Gibbs et al., Science 316, 222 (2007).
R. Nielsen et al., PLoS Biol 3, e170 (2005).
J. H. McDonald, M. Kreitman, Nature 351, 652 (1991).
A. L. Hughes, M. Nei, Nature 335, 167 (1988).
Z. Yang, J. P. Bielawski, Trends in Ecology and Evolution 15, 496 (2000).
S. A. Tishkoff et al., Nat Genet 39, 31 (2007).
M. T. Hamblin, A. Di Rienzo, Am J Hum Genet 66, 1669 (2000).
K. S. Pollard et al., PLoS Genet 2 (2006).
K. S. Pollard et al., Nature 443, 167 (2006).
S. Prabhakar, J. P. Noonan, S. Paabo, E. M. Rubin, Science 314, 786 (2006).
H. Akashi, Genetics 139, 1067 (1995).
A. Eyre-Walker, Genetics 162, 2017 (2002).
M. W. Nachman, in Evolutionary Genetics: Concepts and Case Studies C. W.
Fox, J. B. Wolf, Eds. (Oxford University Press, Oxford, 2006) pp. 103-118.
R. C. Hardison et al., Genome Res 13, 13 (2003).
E. M. Kvikstad, S. Tyekucheva, F. Chiaromonte, K. D. Makova, PLoS Comput
Biol 3, 1772 (2007).
G. Lunter, C. P. Ponting, J. Hein, PLoS Comput Biol 2, e5 (2006).
M. Kamal, X. Xie, E. S. Lander, Proc Natl Acad Sci U S A 103, 2740 (2006).
C. B. Lowe, G. Bejerano, D. Haussler, Proc Natl Acad Sci U S A 104, 8005
(2007).
J. Meunier, L. Duret, Mol Biol Evol 21, 984 (2004).
R. D. Hernandez, S. H. Williamson, C. D. Bustamante, Mol Biol Evol 24, 1792
(2007).
J. D. Storey, Journal of the Royal Statistical Society Series B-Statistical
Methodology 64, 479 (2002).
M. Z. Ludwig, M. Kreitman, Mol Biol Evol 12, 1002 (1995).
D. L. Jenkins, C. A. Ortori, J. F. Brookfield, Proc Biol Sci 261, 203 (1995).
P. Andolfatto, Nature 437, 1149 (2005).
A. G. Clark, M. J. Hubisz, C. D. Bustamante, S. H. Williamson, R. Nielsen,
Genome Res 15, 1496 (2005).
The_International_HapMap_Consortium, Nature 437, 1299 (2005).
D. M. Rand, L. M. Kann, Mol Biol Evol 13, 735 (1996).
R. Nielsen, I. Hellmann, M. Hubisz, C. Bustamante, A. G. Clark, Nat Rev Genet
8, 857 (2007).
P. C. Sabeti et al., Science 312, 1614 (2006).
P. C. Sabeti et al., Nature 449, 913 (2007).
E. C. Walsh et al., Hum Genet 119, 92 (2006).
P. D. Evans, J. R. Anderson, E. J. Vallender, S. S. Choi, B. T. Lahn, Hum Mol
Genet 13, 1139 (2004).
21
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
Y. Q. Wang, B. Su, Hum Mol Genet 13, 1131 (2004).
M. Ashburner et al., Nat Genet 25, 25 (2000).
T. Beissbarth, T. P. Speed, Bioinformatics 20, 1464 (2004).
O. S. Akbari et al., Development 135, 123 (2008).
L. A. Lettice et al., Proc Natl Acad Sci U S A 99, 7548 (2002).
M. A. Nobrega, I. Ovcharenko, V. Afzal, E. M. Rubin, Science 302, 413 (2003).
J. Taylor et al., Genome Res (2006).
G. E. Crawford et al., Nature Methods 3, 503 (2006).
The_ENCODE_Project_Consortium, Science 306, 636 (2004).
W. J. Kent et al., Genome Res 12, 996 (2002).
M. Blanchette et al., Genome Res 14, 708 (2004).
S. T. Sherry et al., Nucleic Acids Res 29, 308 (2001).
B. Giardine et al., Genome Res 15, 1451 (2005).
F. Hsu et al., Bioinformatics 22, 1036 (2006).
J. Taylor, S. Tyekucheva, M. Zody, F. Chiaromonte, K. D. Makova, Mol Biol Evol
23, 565 (2006).
22