Download Uncovering Adaptation from Sequence Data: Lessons

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Genetics: Early Online, published on March 26, 2016 as 10.1534/genetics.115.181594
1
Uncovering adaptation from sequence data: lessons from genome
2
resequencing of four cattle breeds
3
Simon Boitard1,2 , Mekki Boussaha1 , Aurélien Capitan1,3 , Dominique Rocha1 , and Bertrand
4
Servin4
1
5
6
2
Institut de Systématique, Évolution, Biodiversité ISYEB - UMR 7205 - CNRS & MNHN & UPMC & EPHE,
Ecole Pratique des Hautes Etudes, Sorbonne Universités, Paris, 75005, France
7
3
8
9
GABI, INRA, AgroParisTech, Université Paris-Saclay, Jouy-en-Josas, 78350, France
4
Union Nationale des Coopératives d’Elevage et d’Insémination Animale, Paris, 75595, France
GenPhySE, Université de Toulouse, INRA, INPT, INP-ENVT, Castanet-Tolosan, 31326, France
1
Copyright 2016.
10
Running title: Detecting adaptation in cattle using resequencing data
11
Keywords: FST, domestication, linkage disequilibrium, next generation sequencing, selective sweeps.
12
Corresponding author: Simon Boitard
13
Mailing adress: INRA GABI, Domaine de Vilvert, 78352 Jouy-en-Josas, France.
14
Phone: 0033 1 40 79 81 66
15
E-mail: [email protected]
2
Abstract
16
17
Detecting the molecular basis of adaptation is one of the major questions in population genetics.
18
With the advance in sequencing technologies, nearly complete interrogation of genome-wide poly-
19
morphisms in multiple populations is becoming feasible in some species, with the expectation that it
20
will extend quickly to new ones. Here, we investigate the advantages of sequencing for the detection
21
of adaptive loci in multiple populations, exploiting a recently published dataset in cattle (Bos tau-
22
rus). We used two different approaches to detect statistically significant signals of positive selection: a
23
within-population approach aimed at identifying hard selective sweeps and a population-differentiation
24
approach that can capture other selection events such as soft or incomplete sweeps. We show that
25
the two methods are complimentary in that they indeed capture different kind of selection signatures.
26
Our study confirmed some of the well known adaptive loci in cattle (e.g. MC1R, KIT, GHR, PLAG1,
27
NCAPG/LCORL) and detected some new ones (e.g. ARL15, PRLR, CYP19A1, PPM1L). Compared
28
to genome scans based on medium or high-density SNP data, we found that sequencing offered an
29
increased detection power and a higher resolution in the localization of selection signatures. In sev-
30
eral cases, we could even pinpoint the underlying causal adaptive mutation, or at least a very small
31
number of possible candidates (e.g. MC1R, PLAG1). Our results on these candidates suggest that a
32
vast majority of adaptive mutations are likely to be regulatory rather than protein coding variants.
33
Introduction
34
Detecting the molecular basis of adaptation in natural species is one of the major questions in popula-
35
tion genetics. With the spectacular progress of genotyping and sequencing technologies, genome-wide
36
scans for positive selection have been performed in multiple species and populations within the last
37
decade. Livestock species provide a considerable ressource for these selection scans, because they have
38
been subjected to strong artificial selection since their initial domestication, leading to a large variety
39
of breeds with disctinct morphology, coat color or specialized production. Besides, the economic value
40
of these species and the need to improve them has motivated the developement of standardized single
41
nucleotide polymorphism (SNP) chips and the genotyping of millions of animals using these chips,
42
providing considerable data for population genetics analyzes. For instance, in taurine cattle, at least
43
21 genomic scans for selection have already been published and were reviewed in Gutierrez-Gil et al.
44
(2015). Numerous genomic scans for selection have also been published in other livestock species, see
45
Gouveia et al. (2014) for a review.
46
The regions detected by these studies are generally convincing, because they contain interesting
47
positional and functionnal candidate genes (e.g. Fariello et al. (2014)) and / or are statistically
48
enriched with genes from regulation pathways related to production traits (e.g. Flori et al. (2009)).
49
Nevertheless, these regions often span several Megabases and typically include up to tens of genes, so
50
determining the exact gene(s) under selection in each region, and even more the causal mutation(s),
51
remains difficult from these studies. This might change with the recent advent of next generation
52
sequencing technologies, which allows to characterize a very large proportion of the variants in the
53
genome of a species. However, although several examples of genomic scans for selection based on large
54
sequencing samples are already available in livestock species such as pig (Rubin et al., 2012; Li et al.,
3
55
2013), cattle (Qanbari et al., 2014) or chicken (Rubin et al., 2010; Roux et al., 2015), the advantage
56
of using sequencing data for detecting selection signatures has still not been widely discussed.
57
Another interesting question is the small overlap observed between studies, even when these studies
58
focus on similar populations (Biswas and Akey, 2006; Qanbari et al., 2011). This can largely be
59
explained by the fact that many different detection methods have been applied, whose sensitivity
60
depends on the type of selection event: recent or old, complete or ongoing, from a new variant (i.e.
61
“hard”) or standing variation (i.e. “soft”) etc. These differences between methods have been described
62
by several studies (Biswas and Akey, 2006; Sabeti et al., 2006; Gouveia et al., 2014). However, the
63
pratical implications of these differences when comparing the regions detected by different studies, or
64
by different approaches within a same study, are rarely discussed.
65
Here we detected selection signatures in 4 European taurine cattle breeds (Angus, Fleckvieh,
66
Holstein and Jersey), using large samples of sequencing data which have been recently published by
67
the 1000 bull genomes project (Daetwyler et al., 2014). We used two different statistical approaches:
68
a within-breed approach detecting genomic regions with low genetic diversity (Boitard et al., 2009),
69
and a between population approach detecting genomic regions with large allele (Bonhomme et al.,
70
2010) or haplotype (Fariello et al., 2013) frequency differences between breeds. These two analyzes
71
provided regions whose genomic features are significantly different from what would be expected
72
under neutrality, even when accounting for the effects of past population size changes, population
73
structure and gene flow on the four breeds of our study. We show that applying the above methods
74
to sequencing data improves the detection power of selection signatures and reduces considerably the
75
length of detected regions. In some particular situations, it even leads to identify the exact mutation
76
under selection. We also provide a detailed characterization of the regions that are detected only
77
by the within-population approach (in one or several populations), only by the between population
78
approach, or jointly by the two approaches.
79
Results
80
Genomic regions with low within-population diversity: hard sweeps signa-
81
tures
82
We looked for hard sweep signatures within each breed using the method of Boitard et al. (2009), as
83
described in the Methods. This method detects regions showing an excess of low and high frequency
84
derived alleles compared to the rest of the genome. Although this pattern is typically expected under
85
a hard sweep scenario, i.e. when a new mutation appears in the population and goes to fixation
86
due to positive selection, it may also arise from purely demographic events, in particular bottlenecks.
87
In order to test if our analysis could be influenced by such false positive signals, we first simulated
88
genomic samples under two different neutral demographic models allowing to reproduce the genetic
89
diversity of the breeds under study, and applied the method of Boitard et al. (2009) to these samples.
90
Analysis of neutral samples In the first demographic model, we considered the joint history
91
of the four breeds and accounted for several important features of this history: (i) the shared ancestry
4
92
of the four breeds, which diverged recently from an ancestral pool of european domestic animals, (ii)
93
the population size differences between breeds since their divergence, and (iii) the possible existence
94
of gene flow between breeds. Moreover, because recent studies suggested that effective population size
95
in taurine cattle strongly declined since domestication (MacLeod et al., 2013; Boitard et al., 2016),
96
we allowed one population size change in the ancestral population. We estimated the parameters
97
of this model from the joint allele frequency spectra observed in our data for all breed pairs, using
98
the approach of Excoffier et al. (2013) (see Methods for more details), and obtained the demography
99
shown in Figure 1.
100
In this estimated demography, the ancestral population size change was not a decline related to
101
domestication, but an older expansion occuring in the wild population, approximately 120,000 years
102
before present. However, a very strong population decline was found at the time where the four
103
breeds diverged, from an order of 100,000 individuals to an order of 100 individuals. Interestingly,
104
the estimated divergence time (500 years before present) was consistent with a geographic isolation
105
process starting a few hundred years before the strict separation of these populations, induced by
106
the creation of modern breeds (Felius et al., 2011). Besides, the order of magnitude of estimated
107
recent effective sizes (100) and the ranking of breeds according to these sizes were consistent with
108
previous studies (Leroy et al., 2013; The Bovine HapMap Consortium, 2009; MacLeod et al., 2013;
109
Boitard et al., 2016). Thus, this simple model seemed to provide a reasonable approximation of the
110
demography of the four breeds under study. When genomic samples were simulated from this model,
111
the average number of sweeps detected per genome was equal to 0 in Fleckvieh and Holstein, 3.75 (±
112
1.93) in Angus and 17.25 (± 4.15) in Jersey.
113
We performed the same test using a second demographic model, which was estimated in another
114
study (Boitard et al., 2016) based on the same dataset as that considered here. This model treats each
115
breed independently of the others, so it does not account for shared ancestry or gene flow. However,
116
it accounts for the variations of population size over time more accurately than the previous model,
117
because population size in each breed is modelled as a stepwise process with 21 time windows (Figure
118
6 in Boitard et al. (2016)). The population size within each time window is estimated from the allele
119
frequency and linkage disequilibrium patterns observed in the breed, using an approximate Bayesian
120
computation approach. When genomic samples were simulated from this second model, the average
121
number of sweeps detected per genome was even lower than with the first model: 0 in Fleckvieh,
122
Holstein and Angus, and 0.75 (± 0.87) in Jersey.
123
Overall, these results indicate that the number of false hard sweep signals detected by the method
124
of Boitard et al. (2009) should be negligible in Angus, Fleckvieh and Holstein, and relatively small in
125
Jersey, even when accouting for the demography of these breeds.
126
Overview of the detected signals When analyzing the cattle data with the same approach,
127
we detected 1,057 hard sweep signals: 226, 384, 316 and 131 in Holstein, Angus, Jersey and Fleckvieh
128
respectively. According to the simulation results presented above, the false discovery rate associated
129
with this analysis should be below 7% in Jersey (21.4/316), and close to 0 in the other breeds. The
130
size of detected regions ranged from 8.2kb to 948kb, with a median of 78.7kb. Some signals were
131
overlapping between breeds so that after merging them we obtained 798 sweeps that were unique to
5
132
one of the breeds (159, 297, 249 and 93 respectively) and 118 that were shared between at least two
133
breeds. Overall this provided 916 regions covering about 4.3% of the autosomal genome. Among these
134
916 regions, 450 included no (protein coding) gene, 268 included a single gene, 154 included between
135
2 and 5 genes, and 44 included more than 5 genes, with a maximum of 19 genes. Overall, 1088 genes
136
were included in sweeps windows, which represents about 5.7% of all annotated genes in the bovine
137
genome, so there was a slight enrichment of protein coding genes within sweep regions. The list of all
138
detected regions and of genes included in these regions is given in File S1.
139
In a recent genome scan for selection focusing on the Fleckvieh breed (Qanbari et al., 2014), the
140
43 Fleckvieh sequences considered in our study were analyzed using the CLR method (Nielsen et al.,
141
2005) and the iHS method (Voight et al., 2006). Seventy-three hard sweep signals were found with the
142
former approach and 67 with the latter. Since the HMM approach used in this study aims at capturing
143
the same allele frequency patterns than the CLR method, we checked if our results in Fleckvieh were
144
consistent with those in Qanbari et al. (2014). To this end, we compared the distributions of CLR
145
p-values within regions associated with selective sweeps to their distribution on the rest of the genome.
146
Figure 2 (left) plots the ratio of the two densities (on a log2 scale) for increasing levels of significance
147
of the CLR test. It shows a very strong enrichment of low CLR p-values in the sweep windows we
148
detected in Fleckvieh, compared to the rest of the genome. We performed the same analysis with iHS
149
p-values and found a similarly strong enrichment (Figure S1), which can be explained by the fact that
150
iHS also tries to detect hard sweep patterns, even if the information used (the length of haplotypes)
151
is different.
152
We also observed an enrichment, albeit of lower intensity, of CLR and iHS low p-values in the
153
sweep windows detected in other breeds than the Fleckvieh. For example, Figure 2 (right) shows the
154
enrichment in low Fleckvieh CLR p-values in selective sweep regions detected only in the Holstein
155
breed. A similar trend was observed in selective sweep regions specific to the Jersey and Angus breeds
156
(not shown). Hence some of the hard sweeps detected in one breed also have probably taken place
157
in the other breeds, but to a slightly lower extent that did not lead to a significant signal. These
158
signatures must be related to favorable alleles that either started to increase in frequency before the
159
divergence of the breeds or were selected in parallel in different breeds. However, selection signatures
160
that are specific to one breed are interesting because they illustrate the importance of this breed for
161
cattle functional diversity (Gutierrez-Gil et al., 2015). We therefore derived a way of finding clear
162
breed-specific sweeps as follows.
163
Hard sweep regions specific to one population The HMM approach “tags” a region as
164
selected by reconstructing a hidden state at each position of the genome. Based on the observation
165
above, we suspected that some regions were not tagged as selected, but might still have a non negligible
166
probability of being adaptive under the HMM model, explaining the enrichment patterns observed
167
in Figure 2. To investigate this possibility, we derived a statistic (TOR ) quantifying the strength of
168
evidence for selection in a breed, measured as the log odds ratio of selection over neutrality in a region
169
(see methods for details). As expected, the TOR in one breed showed clearly different distributions in
170
regions tagged as selected in this breed and in regions where no selection was detected in any breed
171
(Figure S2). In regions where selection was detected in another breed, the distribution of TOR was
6
172
skewed toward higher values compared to clearly neutral regions (Figure S2). We exploited this to call
173
breed-specific sweeps regions where TOR was unambiguously consistent with the neutral density in all
174
other breeds (see methods for details). Fifty-five breed-specific regions were detected, and we could
175
check that this time the sweeps specific to Holstein did not show any enrichment in low Fleckvieh
176
CLR p-values (Figure S3). The 12 sweeps exhibiting the most contrasted patterns for TOR are listed
177
in Table S1.
178
Hard sweep regions shared by all populations We also looked for sweep signals shared
179
by all breeds, as they might correspond to older selection events, anterior to the divergence of the 4
180
breeds considered here and possibly related to initial cattle domestication. We found only one region
181
with a sweep detected in all 4 breeds, but we also considered regions where a sweep was detected in
182
3 breeds and where allele frequencies in the fourth one were almost consistent with a sweep. This
183
provided 8 candidate regions, 4 of which include a single gene (Table 1).
184
Several of these genes represent natural selection targets in cattle, as they are related to husbandry,
185
metabolism or fertility. On chromosome 1, we found evidence for selection in a region 10Kb upstream
186
the OLIG1 gene, encompassing a lincRNA, orthologous to the Human gene LINC00945, which ex-
187
pression has been shown to be associated with polledness in Holstein and Fleckvieh (Allais-Bonnet
188
et al., 2013). PPM1L is a protein phosphatase that has been shown to be involved in the response to
189
exercice in humans (Tonevitsky et al., 2013). SLC25A33, located in the middle of one of the shared
190
sweeps windows, encodes for a mitochondrial pyrimidine nucleotide transporters and is essential for
191
mitochondrial DNA and RNA metabolism in humans (Di Noia et al., 2014). CYP19A1 encodes the
192
key enzyme for estrogen biosynthesis. Many studies have documented its role during the development
193
of bovine follicles, and it has been found more abundant in bovine cells of twinners versus controls
194
(Echternkamp et al., 2012). To illustrate the allele frequency patterns observed in such regions, allele
195
frequencies in the polled locus region are provided in Figure 3.
196
Genomic regions with large genetic differentiation between breeds.
197
We applied two approaches to detect genome regions that exhibit outlying divergence in single site
198
(Bonhomme et al., 2010) or haplotype (Fariello et al., 2013) frequencies between populations. The
199
first step in these two approaches is to estimate a population tree summarizing the neutral history
200
of the populations under study. For our dataset, this population tree (Figure 4) was approximately
201
star shaped, (i.e. breeds essentially evolved in parallel from an ancestral population), although we
202
estimated a small shared history between the Fleckvieh and Jersey populations. Fixation indices,
203
represented by the branch length from the root to the tips of the tree, had similar values in all
204
populations, the Jersey’s one being slightly higher. The genome-wide level of differentiation between
205
populations was rather large, with between population FST values ranging from 0.22 to 0.4.
206
When genetic drift is large, single marker tests are expected to have low power because even large
207
allele frequency differences can be explained by drift alone. This was indeed the case here for the single
208
marker differentiation analysis (FLK): the smallest observed p-value was 6.10−7 , which corresponded
209
to an FDR of about 10% when applying the approach of Storey and Tibshirani (2003). Although it
210
is not clear how to correct p-values for correlation between markers in such a setting (genome-wide
7
211
differentiation based tests), even this smallest p-value did not provide clear evidence of selection. This
212
does not mean that there is no selection in these data, only that genuine selection signatures cannot be
213
discriminated from background noise provoked by drift when looking at sites independently. However,
214
as illustrated later in this study, given a region where a selection signature has been found, FLK can
215
help in identifying the mutations that have likely been under selection.
216
The hapFLK method (Fariello et al., 2013) is similar to FLK, but it incorporates LD information
217
through the exploitation of a multilocus LD model (Scheet and Stephens, 2006). Because hapFLK
218
combines information across multiple sites, it has been shown to have better detection power than
219
single site statistics (Fariello et al., 2013, 2014). This was confirmed when applied to this dataset,
220
as we could confidently find 67 significant regions using a FDR threshold of 15%. hapFLK has been
221
shown to be robust to bottlenecks and to a certain extent to gene flow (Fariello et al., 2013). We
222
confirmed this by simulating haplotypes under the demographic model estimated by the approach of
223
Excoffier et al. (2013) (Figure 1) and calculating hapFLK on the simulated data (see Methods). While
224
the fixation indices computed from the simulated samples were very close to the ones computed from
225
our data (Table S2), hapFLK p-values obtained from simulated samples did not lead to any signal
226
called significant with the Storey and Tibshirani (2003) approach (Figure S4).
227
Ten of the significant regions detected by hapFLK likely resulted from assembly errors and were
228
thus not considered in the rest of our analysis (Table S3). The cumulated length of the remaining 57
229
regions, listed and annotated in table S4, was 9.1 Megabases, 0.36 % of the total autosome length.
230
Detected regions spanned from a few hundred basepairs to more than one megabase with a median
231
length of about 20 kilobases (Figure S5). The median size of detected regions was thus considerably
232
smaller than that obtained in previous studies where hapFLK was applied to 60K data in sheep
233
(Fariello et al., 2014; Kijas, 2014). Nineteen of the regions encompassed at least one gene while 38
234
contained no gene. In total, 82 genes were included in hapFLK regions, which represents about 0.4%
235
of the total number of genes in the genome, corresponding to a small enrichment in protein coding
236
genes in hapFLK signatures.
237
To investigate whether sequencing improves detection power with hapFLK, we thinned the dataset
238
by considering only SNPs present on the Illumina BovineHD BeadChip. We found (Figure 5) that the
239
excess of small p-values was much larger when applying hapFLK to all sites identified in the 1000 bull
240
genomes project, than when applying it only to the SNPs that are included in the SNP chip. Note
241
that this was not the case with FLK, where detection power was low with both sequencing or SNP
242
chip data due to the amount of drift, as already discussed above (Figure S6).
243
Hard sweep regions showing a strong differentiation signal
244
Eight selection signatures were found with both the HMM and the hapFLK analyzes, pointing out
245
regions that combine a low diversity within at least one breed and a strong haplotypic differentiation
246
beween breeds (Table 2). Although this is significantly more than expected if the two kind of signals
247
were independent (p = 1.4 10−4 ), it is somewhat surprising to observe that many hard sweeps were
248
not detected with hapFLK.
249
A prevalent reason for this is the large genome-wide level of differentiation between the four breeds,
8
250
which reduces power of differentiation based tests (Fariello et al., 2013). Indeed, hard sweep regions
251
detected in one or two populations showed a clear enrichment in low hapFLK p-values (Figure 6).
252
This indicates that many hard sweeps exhibit a mild differentiation signal, although the power to
253
detect them with hapFLK is not sufficient. In addition, some hard sweep regions did not show any
254
differentiation signal, because the same haplotype was fixed or at least increased in frequency in all
255
populations. This is typically the case of hard sweep regions detected in three or four populations,
256
for which there was a depletion of low hapFLK p-values (Figure 6). This may also concern regions
257
where an hard sweep was detected in one or two populations, but where the swept haplotype was also
258
at quite high frequency in other populations, as already discussed above.
259
Among the regions with evidence for both a hard-sweep and an extreme differentiation signature,
260
the top three corresponded to genes and mutations of known phenotypic effects that recapitulate the
261
most obvious phenotypic divergence of the four breeds in this dataset. The most differentiated region
262
corresponded to the KIT gene, which has been shown to be associated with white spotting patterns
263
in the Holstein (Hayes et al., 2010) and the Fleckvieh (Qanbari et al., 2014), while the Jersey and
264
the Angus are non spotted breeds. The next most differentiated region harbored the MC1R gene,
265
for which previous studies have identified two causal polymorphisms for coat color (Klungland et al.,
266
1995). Finally, the third signature was a small genomic region comprising the PLAG1 gene (Figure
267
7, top). Hard sweeps identified in this region indicate a past selection event affecting the Angus and
268
the Holstein breeds, whereas no selection was evidenced in the Fleckvieh and the Jersey breeds (Table
269
2), as can also be seen from heterozygosity patterns in the region (Figure 7, bottom). The region
270
surrounding PLAG1 was previously demonstrated to harbour a QTL for calving ease in the Fleckvieh
271
(Pausch et al., 2011) and one for stature in a Holstein × Jersey cross (Karim et al., 2011). Our
272
results are consistent with these studies and suggest that the allele favoring high stature was selected
273
in Holstein and Angus, but not in Jersey and Fleckvieh.
274
The other common regions include the ASIP gene, which plays a crucial role in adipocyte devel-
275
opment and seems to be expressed in a wide set of tissues in different cattle breeds (Albrecht et al.,
276
2012) and the ANKRD55 region that is strongly associated with autoimmune disorders in humans (in
277
particular multiple sclerosis (Stahl et al., 2010)) and type 2 diabetes (Morris et al., 2012).
278
hapFLK signatures of soft or incomplete sweeps
279
Apart from the 8 signatures above, none of the other hapFLK signatures matched hard sweeps detected
280
by the HMM approach. These signatures most likely resulted from incomplete sweeps for which the
281
selected allele did not reach fixation, or soft sweeps where selection targeted an allele that was already
282
at intermediate frequency in the population. Indeed, such signals do not lead to the skewed allele
283
frequency patterns that are looked for by the approach of Boitard et al. (2009). This hypothesis
284
was confirmed by examining haplotype diversity patterns in hapFLK signatures, which typically show
285
haplotypes of large but not fixed frequency in at least one population (e.g. Figure S7 for region 1 in
286
Table S4).
287
Two signatures are located close to the homologous region of a human promoter region, near
288
ROBO1 on one hand and the prolactin receptor gene PRLR on the other, hinting that the causal
9
289
mutation is likely regulatory in nature. ROBO1 has been shown to be involved in early follicular
290
development in sheep (Dickinson and Duncan, 2010; Dickinson et al., 2010), while the prolactin and its
291
receptor are involved in a large range of biological functions (development, metabolism, immunology,
292
reproduction ...) (Bole-Feysot et al., 1998).
293
The PRLR gene lies ≈ 6 Mb to the growth hormone receptor (GHR) gene, itself close (≈ 140Kb) to
294
another hapFLK signature. It has been evidenced that two QTLs affecting milk, fat and protein yield
295
segregate near these two genes in a Finnish dairy cattle population (Viitala et al., 2006). Our results
296
suggest that these QTLs might be in regulatory regions of these genes and that they have responded
297
to selection in populations from the 1000 bull genomes. We found another selection signature, in
298
Jersey, within a gene potentially involved in milk production (Figure S8), ARL15. Seven highly
299
differentiated variants (FLK p-value < 10−5 ) were located in an ARL15 intron. ARL15 is a protein of
300
unknown function that has been shown to be strongly associated with adiponectin levels in humans
301
(Richards et al., 2009). Adiponectin is an hormone involved in glucose metabolism, low concentration
302
of adiponectin being associated with insulin resistance. In dairy cows, insulin resistance is maintained
303
in early lactation, favoring mammary glucose uptake. Giesy et al. (2012) showed that the reduction
304
of plasma adiponectin concentration in early lactating cows was not associated to changes in the
305
adiponectin expression itself. Our results could suggest a potential role of ARL15 in this process, with
306
a particular adaptation of the Jersey dairy cattle at this gene.
307
Apart from the PLAG1 signature, several other include genes involved in morphology and growth:
308
RUNX3 (Yoshida et al., 2004; Soung do et al., 2007), STARD3NL (Rivadeneira et al., 2009) and
309
RASSF2 (Song et al., 2012) are involved in bone development, NCAPG and/or LCORL match a large
310
QTL for many growth traits in cattle, horse and sheep (Eberlein et al., 2009; Weikard et al., 2010;
311
Lindholm-Perry et al., 2011; Signer-Hasler et al., 2012; Makvandi-Nejad et al., 2012; Bongiorni et al.,
312
2012; Metzger et al., 2013; Tetens et al., 2013; Lindholm-Perry et al., 2013; Kijas, 2014; Xu et al.,
313
2015; Randhawa et al., 2015; Sahana et al., 2015) and CTNNBL1 (Liu et al., 2008) is associated with
314
obesity traits in humans.
315
Identifying causal adaptive polymorphisms
316
As demonstrated above, genomic scans for selection based on sequencing data have a higher detection
317
power than those based on genotyping chip data, and locate the selection signatures with higher
318
precision. This is an expected outcome of the higher marker density, and the same could be said when
319
comparing high density to medium density chip data. But a more fundamental difference between
320
dense SNP chip data and sequencing data is that, with the latter, one can reasonably expect the causal
321
polymorphism under selection to be included in the observed data. Clearly, not all selection signatures
322
can be related to a single polymorphism, and even in this case this polymorphism might be absent in
323
our data due to insufficient coverage or remaining calling issues. Still, the favorable situation where
324
a single polymorphism leads to a selective advantage and is present in the data should also occur.
325
One natural question is thus to determine if such variants can be identified only from genetic diversity
326
patterns. We show below that this is indeed possible.
10
327
HapFLK signals
328
Haplotype frequency differences detected by hapFLK typically result from the increase in frequency of
329
one particular allele in a population due to positive selection, which implied the increase in frequency
330
of one or several haplotypes carrying this allele by genetic hitchhiking. Thus, in regions detected by
331
hapFLK, the causal polymorphism under selection should be the one with the largest allele frequency
332
differentiation, and a natural strategy to detect this polymorphism is to look at the variants with the
333
largest FLK value. Two of the regions identified by hapFLK validate this strategy.
334
Within the MC1R selection signature, we found three polymorphisms with clear outlying FLK val-
335
ues (Figure 8), and two of these corresponded to the known causal mutations mentionned previously:
336
a single base mutation at position 14757910 (rs109688013), responsible for the black pigmentation in
337
Holstein and Angus breeds, and a single base deletion at position 14757924 (rs110710422), responsible
338
for the red pigmentation in the Fleckvieh breed (Klungland et al., 1995). The third outlying polymor-
339
phism (rs110494166) in the region was located at position 14678403, within an intron of the nearby
340
FANCA gene, and exhibited the same allele frequencies as rs110710422.
341
In the PLAG1 region, the causal mutation was shown to be one of 8 candidate QTNs (Karim
342
et al., 2011). We found 7 polymorphisms exhibiting a high level of differentiation between breeds
343
in this region based on the FLK statistic (p-value below 10−4 ) (Figure 7 middle, Table S5). Out
344
of these 7 candidates, 6 were common with the candidate QTNs listed in the Table 2 of Karim
345
et al. (2011). rs134215421 and rs109815800 showed particularly extreme allele frequency differences
346
between Holstein and Angus on the one hand and Jersey and Simmental on the other hand. These
347
two mutations are located at the 3’ end of the PLAG1 gene, rs134215421 being 1Kb downstream and
348
rs109815800 within and intron of the PLAG1 gene. One SNP (rs134029466) was not identified as a
349
potential QTN in Karim et al. but showed a strong signal for differentiation. Two of the mutations in
350
Karim et al., rs209821678 and rs210030313, which are considered by the authors as the most serious
351
candidates because they affect a highly conserved element, were not available in the 1000 bull genomes
352
project. One is a VNTR, a kind of polymorphism that is hardly callable from short-read sequence
353
data and the other one is a SNP that lies 44 base pair from the VNTR, exhibited very low quality
354
scores and was therefore not called in the 1000 bull genomes data. While these polymorphisms cannot
355
be considered as disqualified based on our study, their positions, highlighted in Figure 7, lie in a region
356
where the differentiation signal was not significantly elevated.
357
Hard sweep signals
358
Hard sweep signals are expected when an allele goes from very low frequency to almost fixation in a
359
population due to positive selection. In these regions, detecting the causal selected variant only from
360
genetic data of the swept population is impossible, because all physical positions show either extreme
361
allele frequencies or no polymorphism at all. However, if we assume that other sampled populations
362
evolved neutrally in the region, alleles that were initially at low frequency in the swept population
363
have likely remained at relatively low frequency in these other populations. This should result in high
364
genetic differentiation at and around the selected polymorphism, which again can be detected using
365
the FLK statistic.
11
366
For all hard sweep regions except the ones shared by all populations, we thus tried to identify the
367
selected site by looking for polymorphisms with a high FLK value (p ≤ 10−4 ). To ensure that the
368
detected polymorphisms could indeed be the causal ones, we further required a high allele frequency
369
(≥ 75%) in the swept population(s), and only in this(ese) one(s). We found only 12 sweep regions
370
exhibiting such causal candidates (Tables 3 and S6). Again, this small proportion is related to the
371
fact that in most cases the selected allele must actually be at relatively high frequency even in non
372
swept populations, due to undetected ongoing sweeps or just random drift.
373
The regions detected by this approach include those of KIT, PLAG1 and MC1R. In all these
374
regions, a very limited number of potentially causal polymorphisms was detected, and in the case of
375
MC1R these candidates included the true causal variants. This provides a validation of the detection
376
strategy considered here and suggests that other regions of Table 3 should also contain interesting
377
putative causal polymorphisms.
378
Of particular interest, a sweep region on chromosome 20 included a single causal candidate at
379
position 39872347 (Figures 9 and S9). This polymorphism is located downstream SLC45A2, a gene
380
that has been associated to fertility (Killeen et al., 2014) and residual feed intake (Karisa et al., 2013)
381
in cattle, and to pigmentation in several other species including human (Sturm, 2009; Stefanaki et al.,
382
2013; Morice-Picard et al., 2014), dog (Wijesena and Schmutz, 2015) or tiger (Xu et al., 2013). It is
383
also located downstream RXFP3, a gene known to be involved in food intake regulation and body
384
weight in mice (Ganella et al., 2012; Smith et al., 2014)
385
Another interesting region was found on chromosome 22, where two strong candidate polymor-
386
phisms were located 30kb upstream MAGI1 (Figures S10 and S11). Although not reported in Table 3
387
due to p-values slightly greater than 10−4 , 5 other suggestive polymorphisms are located within MAGI1
388
introns. One of these, at position 36011838, is located within a highly conserved region according to
389
a multiple alignment of 36 eutherian mammals (www.ensembl.org). At this position, all considered
390
species carry allele G, which is also the most frequent allele in Angus, Fleckvieh and Holstein, while
391
allele A swept in Jersey. MAGI1 is a scaffolding protein present in tight junction of epithelial cells
392
and might be implied in nervous system functions. Adaptive selection around this gene was already
393
reported in some West African cattle breeds (Gautier et al., 2009).
394
A more complex situation was observed on chromosome 7, where 8 and 16 candidate polymorphisms
395
were found in two sweep windows distant from 300kb. Among all these candidates, the highest
396
FLK value was reached at position 26459812, in an intergenic region between SLC27A6 and FBN2.
397
Interestingly, this polymorphism was located in a very conserved region, and it was the only one in
398
this case among all candidates in the sweep window. Allele T was found in almost all species at
399
this position and was almost fixed in Angus, Fleckvieh and Holstein, while allele A swept in Jersey.
400
SLC27A6 has been shown related to fatty acid metabolism in cattle (Bionaz and Loor, 2008; Nafikov
401
et al., 2013), while FBN2 has been related to several development processes, including for instance
402
bone formation, in mice (Nistala et al., 2010).
403
Strong candidate mutations were also found in two regions without protein-coding genes on the
404
current bovine annotation, and in general we note that all candidate polymorphisms were located in
405
intergenic or regulatory regions. This implies that validating the effect of these polymorphisms will be
406
difficult, but this also outlines the potential of selective sweep studies to improve genome annotation.
12
407
Hard sweep signals shared by all breeds
408
In regions where hard sweep signals were shared by all breeds (Table 1), identifying causal poly-
409
morphisms only from our dataset is impossible. Indeed, the majority of polymorphisms have similar
410
diversity patterns, with a low minor allele frequency in all 4 breeds. Besides, positions that were
411
monomorphic in our data set are also good candidates, since they might result from the complete
412
fixation of a favorable allele in the 4 breeds.
413
In order to identify potential causal variants in these regions, one possible approach can be to use
414
data from related species, looking for highly conserved positions for which the major (or the only)
415
allele of our bovine data set is absent in other mammals. To illustrate this approach, we implemented
416
the following procedure. Based on a multiple alignment of the bovine reference sequence with 10 other
417
mammal reference sequences (Rocha et al., 2014), we selected the positions where (i) the bovine allele
418
(or the major bovine allele in the case of polymorphic positions) was distinct from the yak allele, (ii)
419
the bovine allele was unobserved among the 10 other species, and (iii) the yak allele was observed in
420
more than 7 (out of 9) other species, including at least buffalo or sheep (the two closest species). Such
421
positions represent convincing candidates for two reasons. First, alleles that are observed in other
422
mammal species must have been segregating in the bovine for a very long time and are thus very
423
unlikely to produce a hard sweep pattern, even if they became positively selected at some point in
424
the bovine history. Second, highly conserved positions are more likely functional, and thus subject to
425
selection, than less conserved ones.
426
Among the 1,382,681 positions included in the regions of Table 1, only 91 satisfied the 3 conditions
427
above. Further removing 7 positions for which the minor bovine allele was at quite high frequency, we
428
finally obtained 84 causal candidates (Table S7). All sweep regions except one exhibited convincing
429
causal variants located in coding or regulatory regions, but more work will be needed to determine
430
the exact causal variants in these regions.
431
Discussion
432
We performed in this study a genomic scan for selection in European taurine cattle, based on large
433
samples of sequencing data from four different breeds. We used two detection approaches, based
434
respectively on the genetic diversity within breeds and on the genetic diferentiation between breeds,
435
and compared the signals detected by these two approaches.
436
One important conclusion from our analysis is that sequencing data represents a great opportunity
437
in the context of genomic scans for selection. Indeed, the detection power of hapFLK was higher when
438
applied to sequencing data than when applied to high-density chip data (Figure 5). This was consistent
439
with previous studies looking for selection signatures in cattle (Ma et al., 2015) and humans (Liu et al.,
440
2014), which also found, based on computer simulations, that higher power could be expected from
441
sequencing data compared to SNP chip data. The localization of selection signatures was also found
442
more accurate with sequencing data, both for hapFLK and the within population approach, as the
443
median size of detected regions (a few tens of kilobases) was considerably lower than with SNP chip
444
data. Note that the reduced size of detected regions does not mean that we detected older selection
13
445
events. It just comes from the fact that, in many regions, only the most significant part of the
446
sweep was identified. For instance, the selection events detected by hapFLK are quite recent, because
447
they must have occured more recently than breed divergence (approximately 500 years before present
448
according to our demographic analysis). Still, the region with excess differentiation, which is captured
449
by hapFLK, is generally smaller than the sweep itself, because part of the swept haplotype can be
450
shared between breeds. An example of such a signature can be seen in the ARL15 signature (Figure
451
S8). Importantly, the higher precision provided by the use of NGS data reduces the number of genes
452
included in each window, allowing potentially an easier exploration of the molecular functions driving
453
selection.
454
Since sequencing data capture a large proportion of the SNPs and small indels segregating in a
455
sample, genomic scans for selection based on such data can be expected to identify even the causal
456
polymorphism under selection in a given sweep region. We demonstrated that this is indeed the case in
457
some regions, where the few variants with the highest allele frequency differentiation between breeds,
458
as measured by the FLK statistic, included the causal variant. For instance, in the MC1R region, the
459
two mutations that have been shown to affect coat color in the breeds of our study, were included in the
460
top three FLK values. Similarly, in the PLAG1 region, a good overlap was observed between variants
461
with top FLK values and the QTNs found by Karim et al. (2011) in a GWAS analysis on stature. In
462
several other regions, variants with top FLK values also lead to promising candidate polymorphisms
463
(Tables 3 and S6), which were often located in highly conserved regions or/and in the vicinity of genes
464
with interesting metabolic functions. However, it is important to point out that identifying the selected
465
variant from FLK values will essentially be possible in population-specific hard sweep scenarios, where
466
the favorable allele has almost fixed in the selected population(s), and has remained absent or at
467
very low frequency in the other sampled populations. In all other situations, and for instance those
468
where the favorable allele is at intermediate frequency also in neutral populations, the FLK value of
469
the causal variant will be reached just by chance by many neutral variants, so identifying the causal
470
variant will require additional data, for instance phenotypes related to the selection constraint.
471
Our study also illustrated the small overlap between the selection signatures detected by the within
472
and the between population approach, and allowed to better characterize the regions detected by each
473
of these approaches. Among the 916 hard sweep regions detected by the within-population approach,
474
only 8 were detected by hapFLK. One reason is that, in many of the regions where an hard sweep
475
was detected, the swept allele was likely at quite high frequency also in the other breeds, reducing
476
the differentiation signal. Indeed, in regions where a sweep was detected in one of the breeds, the
477
TOR distribution in the other breeds was shifted toward that of swept regions, even if this signal
478
was not strong enough to be detected by the HMM (Figure S2). This can be explained by the long
479
shared history between the 4 breeds considered in this study, which have a common geographic origin
480
and have been strictly isolated only in the last 200 years. Convergent selection of similar alleles in
481
different breeds may also have occurred, as the same traits were selected in these breeds, but this
482
is not the most parcimonious hypothesis. Finally, the small proportion of hard sweeps detected by
483
hapFLK was also likely related to a power issue, because none of the most breed-specific hard sweeps
484
listed in Table S1 was detected by this approach. Although such regions clearly showed some signal
485
of differentiation, this signal was not strong enough to be detected by hapFLK, because the very
14
486
high levels of genetic drift in cattle breeds implies that only extremely differentiated regions can be
487
considered as significantly under selection. Actually, we can see from Table 2 that in most hard
488
sweep regions detected by hapFLK, there was not one hard sweep in a single population, but at least
489
two sweeps (either complete or incomplete) implying distinct haplotypes in the four breeds, which
490
represents an even stronger differentiation signal (Figure S12). We expect the 1000 bull genomes
491
dataset to grow by inclusion of new populations, that will most likely increase power of differentiation
492
approaches such as FLK and hapFLK by adding information on breeds more closely related to each
493
other than on the initial release.
494
On the other hand, among the 57 selection signatures detected by hapFLK, 49 were not detected by
495
the within-population approach. This comes from the fact that this approach is specifically designed
496
to detect hard sweep signals, while hapFLK also detects incomplete or soft sweeps (Fariello et al., 2013,
497
2014). For the 49 regions detected by hapFLK and not by the within-breed approach, the evidence
498
of a hard sweep, as measured by statistic TOR , was always very low in all breeds, indicating that the
499
signal detected by hapFLK had nothing to do with a hard sweep in one of the breeds. For example,
500
in the ARL15 signature, the swept haplotype clearly segregates at moderate frequency (∼85%) in
501
the Jersey population (Figure S8). This ability of hapFLK to detect incomplete and soft sweeps is
502
extremely interesting for the study of livestock species. Indeed, recent intensive selection in modern
503
livestock breeds has mainly targeted polygenic traits such as milk or meat production, which is much
504
more likely to produce soft and incomplete sweep patterns than hard sweep patterns (Pritchard et al.,
505
2010; Hernandez et al., 2011).
506
Conclusion
507
Our study illustrates how sequencing data offer great advantages for the detection of adaptive loci:
508
higher power and better precision in localizing adaptive genes, and even in some rare cases causal
509
mutations. However, we found that many of our strongest signals and mutations lie in non-coding
510
regions of the genome, hinting that a majority of adaptive mutations are regulatory in nature. A
511
better annotation of genomes, such as obtained by high-throughput post-genomic approaches, in
512
combination with phenotyping in large population samples will be key in uncovering the biological
513
basis of adaptation.
514
Materials and Methods
515
Samples and sequencing
516
A total of 234 genome sequences were obtained from the 1000 bull genomes project, Run II (Daetwyler
517
et al., 2014). These included 129 Holsteins (125 Black and 4 Red) , 43 Fleckviehs, 47 Angus, and 15
518
Jerseys. We considered all these sequences except the 4 Red Holsteins. We based our analyzes on the
519
phased and corrected autosomal data produced by the 1000 bull genomes project (Daetwyler et al.,
520
2014). These data included 27,535,425 bi-allelic SNPs and 1,507,728 bi-allelic indels.
15
521
Choice of unrelated animals
522
To remove potential biases arising from sample size heterogeneity between breeds and inbreeding
523
within breeds, we selected a subset of 25 unrelated animals in Holstein, Fleckvieh and Angus, while
524
keeping all 15 Jersey animals. Within each breed, we computed the genetic relationship matrix (GRM)
525
of all available animals based on SNPs from Chromosome 1 with minor allele frequency (MAF) above
526
10%, using the GCTA 1.04 software (Yang et al., 2011). We then selected unrelated animals as follows.
527
First, we removed all animals with inbreeding value (the diagonal term of the GRM) above 1.5 (this
528
threshold was chosen because inbreeding values above 1.5 clearly appeared as outliers compared to the
529
rest of the distribution, Figure S13). Second, we considered all animal pairs with genetic relationship
530
above 0.3 in absolute value and removed the most inbred animal of each pair. Third, we performed
531
a hierarchical clustering of the remaining animals based on the distance dij = M − GRMi,j , where
532
GRMi,j is the genetic relationship between animals i and j and M is the maximum value of the GRM,
533
and sampled the 25 most distant animals.
534
Estimation of a demographic model for the joint history of the four breeds
535
In order to model the demographic history of the four breeds under study, we assumed that these breeds
536
diverged simultaneously from a common ancestral population, TDIV generations ago. Although the
537
population tree estimated by FLK (Figure 4) would suggest a slightly more recent divergence between
538
Fleckvieh and Jersey, we considered that this difference was negligible and preferred reducing the
539
number of parameters to be estimated. Based on previous studies suggesting that effective population
540
size in taurine cattle strongly declined since domestication (MacLeod et al., 2013; Boitard et al.,
541
2016), we allowed one population size change from NAN C (the ancestral population size) to NDOM
542
(the “domestic” population size) in the ancestral population, TAN C generations ago. The divergence
543
between breeds implied a second population size change: after this event, each breed i was assumed
544
to have a specific population size Ni , which possibly differed from NDOM . Finally, we assumed that
545
each breed received, every generation, a proportion m of migrants from any of the other breeds.
546
We estimated the parameters of this model using the composite likelihood approach implemented
547
in fastsimcoal2 (Excoffier et al., 2013), which is based on the joint Site Frequency Spectrum (SFS).
548
In our dataset, this joint SFS had a very high dimension (51×51×51×31). Consequently, we rather
549
considered the collection of joint SFS obtained for all 6 population pairs (Angus × Fleckvieh, Angus
550
× Holstein …), following the recommandations of the authors. We computed these observed joint SFS
551
from our data using home-made scripts and provided them as input to fastsimcoal2, version 5.2.8 (May
552
2015). We performed 50 independent EM estimations and selected that with the highest composite
553
likelihood. For each estimation, we used the folded SFS (option -m) and the default settings -n100000
554
-N100000 -M0.001 –l10 -L40. In fastsimcoal2, time is scaled in generations. Time in years were
555
obtained by assuming a generation time of 5 years.
556
Detection of genomic regions with low within-breed diversity
557
Model We looked for hard sweep signatures within each breed using the Hidden Markov Model
558
(HMM) of Boitard et al. (2009). In this model, only bi-allelic variants are considered and the derived
16
559
allele frequency at variant i, denoted Yi , is taken as the observed state of the HMM at this position.
560
Each variant i is also assumed to have a hidden state Xi , which can take 3 different values: “Selection”,
561
for variants that are very close to a swept site, “Neutral”, for variants that are far away from any
562
swept site, and “Intermediate” for variants in between. These three values are associated with differ-
563
ent allele frequency distributions. The “Neutral” allele frequency distribution is estimated using all
564
variants in the genome, assuming most of them have indeed evolved under neutrality. Allele frequency
565
distributions in states “Intermediate” and “Selection” are deduced from this “Neutral” distribution
566
using the derivations in Nielsen et al. (2005), and are typically more skewed toward extreme allele
567
frequencies. The hidden states Xi form a Markov chain along the genome with a per bp probability p
568
of switching state, so that close variants tend to be in the same hidden state. Under this HMM, the
569
most likely sequence of hidden states can be predicted from the sequence of observed states using the
570
Viterbi algorithm. Each set of consecutive variants with predicted state “Selection” is called a sweep
571
window. The method of Boitard et al. (2009) is implemented in the freq-hmm program, available at
572
https://forge-dga.jouy.inra.fr/projects/pool-hmm.
573
Implementation Ancestral Bovinae alleles at 448,289 SNPs included in the Illumina BovineHD
574
BeadChip were obtained from Utsunomiya et al. (2013). To check this information we also aligned the
575
bovine sequence against the sequence of three Bovidae species (Rocha et al., 2014). For 365,146 SNPs,
576
the ancestral allele in our alignment was consistent with that reported by Utsunomiya et al. (2013)
577
so we used this information for sweep detection. For all other SNPs, we used a folded allele frequency
578
distribution, i.e allele frequencies Yi and 1 − Yi were considered as the same observed state. Indels
579
were not included at this stage of the analysis (they were only considered when looking at candidate
580
polymorphisms within the region).
581
582
To reduce computation time, we estimated the neutral SFS in each breed using only 5% of the
SNPs from each chromosome, which were selected at random. These SFS are shown in Figure S14.
583
The type I error of the above method, i.e. the probability that it detects a sweep window in a
584
population that has evolved under neutrality, depends on parameter p (see Boitard et al. (2009) for
585
more details). To control the genome wide number of false positives, we simulated 5,000 samples of
586
length 500kb under neutral evolution using ms (Hudson, 2002) and adjusted p so that sweeps were
587
detected in only 0.1% of these samples. We performed this calibration for each breed, using the same
588
sample size and proportion of unfolded sites as in the data used for sweep detection. Parameter θ was
589
also estimated from this data using Waterson’s estimator (Watterson, 1975) and ρ was taken equal
590
to θ/10, which is rather low compared to current estimates in cattle (Sandor et al., 2012; Ma et al.,
591
2015). As the detection sensitivity of the HMM method increases when ρ/θ decreases (Boitard et al.,
592
2009), our adjusted value of p should be conservative. Assuming a 2.5Gb genome as that of bovine
593
(focusing on autosomes) is equivalent to 5,000 windows of 500kb, so we expected no more than 5 false
594
positive signals over the genome with this value of p.
595
Influence of demography on sweep detection We evaluated the robustness of the above
596
detection approach under two neutral demographic scenarios: the multi-population model estimated
597
by fastsimcoal2 (Figure 1) and the single-population model estimated in (Boitard et al., 2016). For
17
598
each model, we simulated 20,000 samples of length 500kb using ms, assuming a mutation rate and
599
a recombination rate of 1e-8 per bp and generation, and we applied the HMM procedure to these
600
samples. For the multi-population model, each simulated sample included genomes from the 4 breeds,
601
which were splitted in order to apply the HMM within each population. For the single-population
602
scenario, breed-specific samples were directly simulated independently of each others, each breed
603
being associated to a different population size history. For each scenario and breed, the total length of
604
simulated samples was equivalent to that of 4 cattle genomes. Consequently, the estimated proportion
605
m of false positive signals per genome was given by the total number of sweeps detected in the
606
simulations, divided by four, and the variance of this estimation was equal to m/4. The confidence
607
interval of this estimation was approximated by
[m − 2σ(m); m + 2σ(m)] = [m −
√
√
m; m + m]
608
Comparing Hard-sweep signatures detected in different populations For a given
609
variant i, the evidence for a hard sweep window occuring around this variant in population j was
610
measured by the statistic TOR (i, j) = log10 (qij /(1 − qij )), where qij was the posterior probability
611
of hidden state “Selection” returned by the backward-forward algorithm applied to the HMM in
612
population j. When considering a given region of the genome, the evidence for a hard sweep in
613
population j was quantified by the median of TOR (i, j) over the variants of the region. In order to
614
detect selection signatures that are really breed-specific, we computed for each breed the distribution
615
of TOR in 3 different classes of regions : (i) those where a hard sweep was detected in this breed, (ii)
616
those where a hard sweep was detected in another breed, and (iii) those where no hard sweep was
617
detected (Figure S2). Obviously, class (i) and class (iii) regions lead to very different distributions, with
618
lower TOR values in the latter (i.e. in the completely neutral regions). In addition, the distribution of
619
TOR in class (ii) regions was not similar to that found in class (iii) regions, but was shifted towards
620
that of class (i) regions. Based on these observations, we therefore considered that a hard sweep was
621
breed-specific when, in all other breeds, the value of TOR of the region was below a given quantile q
622
of the class (iii) distribution. For q = 0.5, 55 breed-specific sweeps were detected (listed in File S2)
623
and for q = 0.25 , 12 breed-specific regions were detected.
624
Detection of genomic regions with large differentiation between breeds
625
We applied two methods for the detection of genomic regions exhibiting large genetic differentiation
626
between populations: FLK (Bonhomme et al., 2010), a single marker approach based on allele fre-
627
quency differences, and its haplotypic extension hapFLK (Fariello et al., 2013), which exploits the
628
Linkage Disequilibrium information to capture differences of haplotype frequencies.
629
Kinship matrix In contrast with the FST statistic, FLK and hapFLK account for the population
630
history through a kinship matrix, which captures (i) differences in effective population sizes between
631
populations and (ii) possible shared ancestry between populations. The kinship matrix is inferred
632
from a population tree, with branch length expressed in units of drift, i.e. measured in fixation indices
633
f ≈ t/2N where t is the number of generations from the root and N the effective population size.
18
634
We estimated the population tree using neighbour-joining on the Reynold’s genetic distances between
635
populations (see Bonhomme et al. (2010) for details). We used the ancestral allele reconstruction of
636
Utsunomiya et al. (2013) to root the population tree and estimate the population kinship matrix.
637
FLK For the single marker analysis, we performed the FLK test on all variants and computed
638
p-values using the theoretical χ2 (3) distribution, which was a good fit to the observed distribution
639
(Figure S15).
640
hapFLK For the hapFLK test, to save computation time, we removed variants that had low minor
641
allele frequency (< 10%) in all breeds. Note that as the analysis looks for signal of differentiation,
642
removing these variants does not preclude detection. In subsequent reanalysis of small genomic re-
643
gions, we kept all variants in the analysis. hapFLK makes use of the local clustering approach in
644
Scheet and Stephens (2006) to model haplotype diversity. This model requires specifying a number
645
of haplotype clusters as input parameter. Using the cross validation procedure implemented in the
646
fastPHASE software, we found that 15 clusters provided the smallest imputation error rate. hapFLK
647
can be computed on unphased or phased genotype data. Genotype calling made used of imputation
648
approaches, and data was therefore already phased (Daetwyler et al., 2014). We computed hapFLK
649
both on the haplotype data and on the genotype (but imputed) data and found the two analyzes
650
provided similar results (not shown).
651
The distribution of hapFLK is not known, but, from theoretical arguments hapFLK is a deviance
652
statistic. However, the variance parameter of this deviance is not known. If this parameter was known
653
then hapFLK should follow a χ2 ((N − 1)(K − 1)) distribution, where N is the number of populations
654
and K the number of haplotype clusters. Building on this fact, we compared a set of quantiles (from
655
0.05 to 0.95 every 0.05) of the χ2 (42) distribution tq to the observed quantiles of the hapFLK statistic
656
oq . We found that the relationship between tq and oq was very close to linear (Figure S16). Thus, we
657
used the parameter of the linear model to scale the hapFLK statistic to a χ2 (42) distribution which
658
was then used to compute p-values.
659
Extracting significantly differentiated regions In order to call significant regions, we ap-
660
plied the approach of Storey and Tibshirani (2003) aimed at controlling the False Discovery Rate
661
(FDR), at the 15% level, i.e. we called significant variants with q-values < 0.15 and selection signa-
662
tures regions where more than one variant was called significant. We note however that our level of
663
control of the FDR for the regions themselves may not be well calibrated as the tests are correlated.
664
This procedure still provides a significant threshold so that among marker discoveries there is a six
665
fold enrichment in favor of the number of statistics under the alternative.
666
Influence of demography on hapFLK To evaluate the robustness to demography of hapFLK
667
and our scaling approach, we performed 10,000 simulations of 50Kb windows under the population
668
model estimated with fastsimcoal2. We applied the same testing procedure as with the real data,
669
including filtering out SNPs with low minor allele frequencies in all breeds. The resulting hapFLK
670
distribution was different from the one observed on real data, in particular it showed a depletion
671
of low hapFLK values compared to real data, i.e. on real data some regions look more similar
19
672
between breeds than on simulated data. The reasons for this are not clear, but we note that (i)
673
fastsimcoal2 estimation does not use haplotype or linkage disequilibrium information so the haplotype
674
patterns simulated are not expected to necessarily fit the data (ii) simulations assume homogeneous
675
recombination and mutation rates, which does not hold on real data and (iii) common background
676
/ purifying selection between breeds might reduce differentiation in some genome regions. Despite
677
the lesser hapFLK variance in simulations, scaling the hapFLK distribution to a χ2 with 14 degrees
678
of freedom provided a very good fit (see Figure S4). Applying the Storey and Tibshirani (2003)
679
approach to estimate proportion of alternative hypotheses in the resulting p-value distribution lead to
680
an estimate of 0 (i.e. π̂0 = 1). A python script to perform the scaling of hapFLK to χ2 distributions
681
is now available on the hapFLK webpage : https://forge-dga.jouy.inra.fr/projects/hapflk/documents
682
References
683
Albrecht E, Komolka K, Kuzinski J, Maak S, 2012. Agouti revisited: Transcript quantification of the
684
asip gene in bovine tissues related to protein expression and localization. PLoS ONE 7:e35282.
685
Allais-Bonnet A, Grohs C, Medugorac I, Krebs S, Djari A, Graf A, Fritz S, Seichter D, Baur A,
686
Russ I, et al., 2013. Novel insights into the bovine polled phenotype and horn ontogenesis in
687
<italic>bovidae</italic>. PLoS ONE 8:e63512.
688
Bionaz M, Loor JJ, 2008. Acsl1, agpat6, fabp3, lpin1, and slc27a6 are the most abundant isoforms
689
in bovine mammary tissue and their expression is affected by stage of lactation. The Journal of
690
nutrition 138:1019–1024.
691
Biswas S, Akey JM, 2006. Genomic insights into positive selection. Trends in Genetics 22:437 – 446.
692
Boitard S, Rodríguez W, Jay F, Mona S, Austerlitz F, 2016. Inferring population size history from
693
large samples of genome-wide molecular data - an approximate bayesian computation approach.
694
PLoS Genet 12:1–36.
695
696
Boitard S, Schlotterer C, Futschik A, 2009. Detecting selective sweeps: a new approach based on
hidden markov models. Genetics 181:1567–1578.
697
Bole-Feysot C, Goffin V, Edery M, Binart N, Kelly PA, 1998. Prolactin (prl) and its receptor: actions,
698
signal transduction pathways and phenotypes observed in prl receptor knockout mice. Endocrine
699
Reviews 19:225–268. PMID: 9626554.
700
701
702
703
Bongiorni S, Mancini G, Chillemi G, Pariset L, Valentini A, 2012. Identification of a short region on
chromosome 6 affecting direct calving ease in Piedmontese cattle breed. PLoS One 7:e50137.
Bonhomme M, Chevalet C, Servin B, Boitard S, Abdallah J, Blott S, SanCristobal M, 2010. Detecting
selection in population trees: the lewontin and krakauer test extended. Genetics 186:241–262.
704
Daetwyler HD, Capitan A, Pausch H, Stothard P, van Binsbergen R, Brondum RF, Liao X, Djari A,
705
Rodriguez SC, Grohs C, et al., 2014. Whole-genome sequencing of 234 bulls facilitates mapping of
706
monogenic and complex traits in cattle. Nat Genet 46:858–865.
20
707
Di Noia MA, Todisco S, Cirigliano A, Rinaldi T, Agrimi G, Iacobazzi V, Palmieri F, 2014. The
708
human slc25a33 and slc25a36 genes of solute carrier family 25 encode two mitochondrial pyrimidine
709
nucleotide transporters. Journal of Biological Chemistry 289:33137–33148.
710
711
Dickinson RE, Duncan WC, 2010. The SLIT-ROBO pathway: a regulator of cell function with
implications for the reproductive system. Reproduction 139:697–704.
712
Dickinson RE, Hryhorskyj L, Tremewan H, Hogg K, Thomson AA, McNeilly AS, Duncan WC, 2010.
713
Involvement of the SLIT/ROBO pathway in follicle development in the fetal ovary. Reproduction
714
139:395–407.
715
Eberlein A, Takasuga A, Setoguchi K, Pfuhl R, Flisikowski K, Fries R, Klopp N, Furbass R, Weikard R,
716
Kuhn C, 2009. Dissection of genetic factors modulating fetal growth in cattle indicates a substantial
717
role of the non-SMC condensin I complex, subunit G (NCAPG) gene. Genetics 183:951–64.
718
Echternkamp S, Aad P, Eborn D, Spicer L, 2012. Increased abundance of aromatase and follicle
719
stimulating hormone receptor mrna and decreased insulin-like growth factor-2 receptor mrna in
720
small ovarian follicles of cattle selected for twin births. Journal of animal science 90:2193–2200.
721
Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa VC, Foll M, 2013. Robust demographic inference
722
from genomic and snp data. PLoS Genet 9:e1003905.
723
Fariello MI, Boitard S, Naya H, SanCristobal M, Servin B, 2013. Detecting signatures of selection
724
through haplotype differentiation among hierarchically structured populations. Genetics 193:929–
725
941.
726
727
728
729
730
731
Fariello MI, Servin B, Tosser-Klopp G, Rupp R, Moreno C, Cristobal MS, Boitard S, Consortium ISG,
2014. Selection signatures in worldwide sheep populations. PLoS ONE 9:e103813.
Felius M, Koolmees PA, Theunissen B, Consortium ECGD, Lenstra JA, 2011. On the breeds of
cattle—historic and current classifications. Diversity 3:660–692.
Flori L, Fritz S, Jaffrézic F, Boussaha M, Gut I, Heath S, Foulley JL, Gautier M, 2009. The genome
response to artificial selection: A case study in dairy cattle. PLoS ONE 4:e6595.
732
Ganella DE, Ryan PJ, Bathgate RA, Gundlach AL, 2012. Increased feeding and body weight gain
733
in rats after acute and chronic activation of rxfp3 by relaxin-3 and receptor-selective peptides:
734
functional and therapeutic implications. Behavioural pharmacology 23:516–525.
735
Gautier M, Flori L, Riebler A, Jaffrezic F, Laloe D, Gut I, Moazami-Goudarzi K, Foulley JL, 2009. A
736
whole genome bayesian scan for adaptive genetic divergence in west african cattle. BMC Genomics
737
10:550.
738
739
740
741
Giesy SL, Yoon B, Currie WB, Kim JW, Boisclair YR, 2012. Adiponectin deficit during the precarious
glucose economy of early lactation in dairy cows. Endocrinology 153:5834–44.
Gouveia JJdS, Silva MVGBd, Paiva SR, Oliveira SAMPd, 2014. Identification of selection signatures
in livestock species. Genetics and Molecular Biology 37:330 – 342.
21
742
Gutierrez-Gil B, Arranz JJ, Wiener P, 2015. An interpretive review of selective sweep studies in
743
bos taurus cattle populations: identification of unique and shared selection signals across breeds.
744
Frontiers in Genetics 6.
745
Hayes BJ, Pryce J, Chamberlain AJ, Bowman PJ, Goddard ME, 2010. Genetic architecture of complex
746
traits and accuracy of genomic prediction: coat colour, milk-fat percentage, and type in holstein
747
cattle as contrasting model traits. PLoS Genetics 6:e1001139.
748
749
750
751
Hernandez RD, Kelley JL, Elyashiv E, Melton SC, Auton A, McVean G, Sella G, Przeworski M, et al.,
2011. Classic selective sweeps were rare in recent human evolution. science 331:920–924.
Hudson R, 2002. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18:337–338.
752
Karim L, Takeda H, Lin L, Druet T, Arias JA, Baurain D, Cambisano N, Davis SR, Farnir F, Grisart
753
B, et al., 2011. Variants modulating the expression of a chromosome domain encompassing plag1
754
influence bovine stature. Nature genetics 43:405–413.
755
Karisa B, Thomson J, Wang Z, Stothard P, Moore S, Plastow G, 2013. Candidate genes and single
756
nucleotide polymorphisms associated with variation in residual feed intake in beef cattle. Journal
757
of animal science 91:3502–3513.
758
Kijas JW, 2014. Haplotype-based analysis of selective sweeps in sheep. Genome 57:433–7.
759
Killeen AP, Morris DG, Kenny DA, Mullen MP, Diskin MG, Waters SM, 2014. Global gene expression
760
in endometrium of high and low fertility heifers during the mid-luteal phase of the estrous cycle.
761
BMC genomics 15:234.
762
Klungland H, Vage D, Gomez-Raya L, Adalsteinsson S, Lien S, 1995. The role of melanocyte-
763
stimulating hormone (msh) receptor in bovine coat color determination. Mammalian genome 6:636–
764
639.
765
Leroy G, Mary-Huard T, Verrier E, Danvy S, Charvolin E, Danchin-Burge C, 2013. Methods to
766
estimate effective population size using pedigree data: Examples in dog, sheep, cattle and horse.
767
Genetics Selection Evolution 45:1.
768
Li M, Tian S, Jin L, Zhou G, Li Y, Zhang Y, Wang T, Yeung CK, Chen L, Ma J, et al., 2013. Genomic
769
analyses identify distinct patterns of selection in domesticated pigs and tibetan wild boars. Nature
770
genetics 45:1431–1438.
771
Lindholm-Perry AK, Kuehn LA, Oliver WT, Sexten AK, Miles JR, Rempel LA, Cushman RA, Freetly
772
HC, 2013. Adipose and muscle tissue gene expression of two genes (NCAPG and LCORL) located
773
in a chromosomal region associated with cattle feed intake and gain. PLoS One 8:e80882.
774
Lindholm-Perry AK, Sexten AK, Kuehn LA, Smith TP, King DA, Shackelford SD, Wheeler TL, Ferrell
775
CL, Jenkins TG, Snelling WM, et al., 2011. Association, effects and validation of polymorphisms
776
within the NCAPG - LCORL locus located on BTA6 with feed intake, gain, meat and carcass traits
777
in beef cattle. BMC Genet 12:103.
22
778
Liu X, Saw WY, Ali M, Ong RTH, Teo YY, 2014. Evaluating the possibility of detecting evidence of
779
positive selection across asia with sparse genotype data from the hugo pan-asian snp consortium.
780
BMC genomics 15:332.
781
Liu YJ, Liu XG, Wang L, Dina C, Yan H, Liu JF, Levy S, Papasian CJ, Drees BM, Hamilton JJ,
782
et al., 2008. Genome-wide association scans identified ctnnbl1 as a novel gene for obesity. Human
783
Molecular Genetics 17:1803–1813.
784
Ma L, O’Connell JR, VanRaden PM, Shen B, Padhi A, Sun C, Bickhart DM, Cole JB, Null DJ,
785
Liu GE, et al., 2015. Cattle sex-specific recombination and genetic control from a large pedigree
786
analysis. PLoS Genet 11:e1005387.
787
MacLeod IM, Larkin DM, Lewin HA, Hayes BJ, Goddard ME, 2013. Inferring demography from runs
788
of homozygosity in whole-genome sequence, with correction for sequence errors. Molecular Biology
789
and Evolution 30:2209–2223.
790
Makvandi-Nejad S, Hoffman GE, Allen JJ, Chu E, Gu E, Chandler AM, Loredo AI, Bellone RR,
791
Mezey JG, Brooks SA, et al., 2012. Four loci explain 83% of size variation in the horse. PLoS One
792
7:e39929.
793
794
Metzger J, Schrimpf R, Philipp U, Distl O, 2013. Expression levels of LCORL are associated with
body size in horses. PLoS One 8:e56497.
795
Morice-Picard F, Lasseaux E, Cailley D, Gros A, Toutain J, Plaisant C, Simon D, François S, Gilbert-
796
Dussardier B, Kaplan J, et al., 2014. High-resolution array-cgh in patients with oculocutaneous
797
albinism identifies new deletions of the tyr, oca2, and slc45a2 genes and a complex rearrangement
798
of the oca2 gene. Pigment cell & melanoma research 27:59–71.
799
Morris AP, Voight BF, Teslovich TM, Ferreira T, Segre AV, Steinthorsdottir V, Strawbridge RJ, Khan
800
H, Grallert H, Mahajan A, et al., 2012. Large-scale association analysis provides insights into the
801
genetic architecture and pathophysiology of type 2 diabetes. Nat Genet 44:981–90.
802
Nafikov R, Schoonmaker J, Korn K, Noack K, Garrick D, Koehler K, Minick-Bormann J, Reecy J,
803
Spurlock D, Beitz D, 2013. Association of polymorphisms in solute carrier family 27, isoform a6
804
(slc27a6) and fatty acid-binding protein-3 and fatty acid-binding protein-4 (fabp3 and fabp4) with
805
fatty acid composition of bovine milk. Journal of dairy science 96:6007–6021.
806
807
Nielsen R, Williamson L, Kim Y, Hubisz M, Clark A, Bustamante C, 2005. Genomic scans for selective
sweeps using SNP data. Genome Research 15:1566–1575.
808
Nistala H, Lee-Arteaga S, Smaldone S, Siciliano G, Carta L, Ono RN, Sengle G, Arteaga-Solis E,
809
Levasseur R, Ducy P, et al., 2010. Fibrillin-1 and-2 differentially modulate endogenous tgf-β and
810
bmp bioavailability during bone formation. The Journal of cell biology 190:1107–1121.
811
Pausch H, Flisikowski K, Jung S, Emmerling R, Edel C, Götz KU, Fries R, 2011. Genome-wide
812
association study identifies two major loci affecting calving ease and growth-related traits in cattle.
813
Genetics 187:289–297.
23
814
815
Pritchard JK, Pickrell JK, Coop G, 2010. The genetics of human adaptation: hard sweeps, soft sweeps,
and polygenic adaptation. Current Biology 20:R208–R215.
816
Qanbari S, Gianola D, Hayes B, Schenkel F, Miller S, Moore S, Thaller G, Simianer H, 2011. Appli-
817
cation of site and haplotype-frequency based approaches for detecting selection signatures in cattle.
818
BMC Genomics 12:318.
819
820
Qanbari S, Pausch H, Jansen S, Somel M, Strom TM, Fries R, Nielsen R, Simianer H, 2014. Classic
selective sweeps revealed by massive sequencing in cattle. PLoS Genet 10:e1004148.
821
Randhawa IA, Khatkar MS, Thomson PC, Raadsma HW, 2015. Composite Selection Signals for
822
Complex Traits Exemplified Through Bovine Stature Using Multi-Breed Cohorts of European and
823
African Bos taurus. G3 (Bethesda) .
824
Richards JB, Waterworth D, O’Rahilly S, Hivert MF, Loos RJ, Perry JR, Tanaka T, Timpson NJ,
825
Semple RK, Soranzo N, et al., 2009. A genome-wide association study reveals variants in ARL15
826
that influence adiponectin levels. PLoS Genet 5:e1000768.
827
Rivadeneira F, Styrkarsdottir U, Estrada K, Halldorsson BV, Hsu YH, Richards JB, Zillikens MC,
828
Kavvoura FK, Amin N, Aulchenko YS, et al., 2009. Twenty bone-mineral-density loci identified by
829
large-scale meta-analysis of genome-wide association studies. Nat Genet 41:1199–206.
830
Rocha D, Billerey C, Samson F, Boichard D, Boussaha M, 2014. Identification of the putative ancestral
831
allele of bovine single-nucleotide polymorphisms. Journal of Animal Breeding and Genetics 131:483–
832
486.
833
Roux PF, Boitard S, Blum Y, Parks B, Montagner A, Mouisel E, Djari A, Esquerré D, Désert C,
834
Boutin M, et al., 2015. Combined qtl and selective sweep mappings with coding snp annotation
835
and cis-eqtl analysis revealed park2 and jag2 as new candidate genes for adiposity regulation. G3:
836
Genes| Genomes| Genetics 5:517–529.
837
Rubin CJ, Megens HJ, Barrio AM, Maqbool K, Sayyab S, Schwochow D, Wang C, Carlborg �, Jern P,
838
Jørgensen CB, et al., 2012. Strong signatures of selection in the domestic pig genome. Proceedings
839
of the National Academy of Sciences 109:19529–19536.
840
Rubin CJ, Zody MC, Eriksson J, Meadows JR, Sherwood E, Webster MT, Jiang L, Ingman M,
841
Sharpe T, Ka S, et al., 2010. Whole-genome resequencing reveals loci under selection during chicken
842
domestication. Nature 464:587–591.
843
844
845
846
847
Sabeti P, Schaffner S, Fry B, Lohmueller J, Varilly P, Shamovsky O, Palma A, Mikkelsen T, Altshuler
D, Lander E, 2006. Positive natural selection in the human lineage. science 312:1614–1620.
Sahana G, Hoglund JK, Guldbrandtsen B, Lund MS, 2015. Loci associated with adult stature also
affect calf birth survival in cattle. BMC Genet 16:47.
Sandor C, Li W, Coppieters W, Druet T, Charlier C, Georges M, 2012.
Genetic variants in
848
<italic>rec8</italic>, <italic>rnf212</italic>, and <italic>prdm9</italic> influence male re-
849
combination in cattle. PLoS Genet 8:e1002854.
24
850
Scheet P, Stephens M, 2006. A fast and flexible statistical model for large-scale population genotype
851
data: applications to inferring missing genotypes and haplotypic phase. The American Journal of
852
Human Genetics 78:629–644.
853
Signer-Hasler H, Flury C, Haase B, Burger D, Simianer H, Leeb T, Rieder S, 2012. A genome-wide
854
association study reveals loci influencing height and other conformation traits in horses. PLoS One
855
7:e37282.
856
Smith CM, Chua BE, Zhang C, Walker AW, Haidar M, Hawkes D, Shabanpoor F, Hossain MA, Wade
857
JD, Rosengren KJ, et al., 2014. Central injection of relaxin-3 receptor (rxfp3) antagonist peptides
858
reduces motivated food seeking and consumption in c57bl/6j mice. Behavioural brain research
859
268:117–126.
860
Song H, Kim H, Lee K, Lee DH, Kim TS, Song JY, Lee D, Choi D, Ko CY, Kim HS, et al., 2012.
861
Ablation of Rassf2 induces bone defects and subsequent haematopoietic anomalies in mice. EMBO
862
J 31:1147–59.
863
Soung do Y, Dong Y, Wang Y, Zuscik MJ, Schwarz EM, O’Keefe RJ, Drissi H, 2007.
864
Runx3/AML2/Cbfa3 regulates early and late chondrocyte differentiation.
865
22:1260–70.
J Bone Miner Res
866
Stahl EA, Raychaudhuri S, Remmers EF, Xie G, Eyre S, Thomson BP, Li Y, Kurreeman FA, Zher-
867
nakova A, Hinks A, et al., 2010. Genome-wide association study meta-analysis identifies seven new
868
rheumatoid arthritis risk loci. Nat Genet 42:508–14.
869
Stefanaki I, Panagiotou OA, Kodela E, Gogas H, Kypreou KP, Chatzinasiou F, Nikolaou V, Plaka M,
870
Kalfa I, Antoniou C, et al., 2013. Replication and predictive value of snps associated with melanoma
871
and pigmentation traits in a southern european case-control study. PloS one 8:e55712.
872
873
874
875
876
Storey JD, Tibshirani R, 2003. Statistical significance for genomewide studies. Proceedings of the
National Academy of Sciences 100:9440–9445.
Sturm RA, 2009. Molecular genetics of human pigmentation diversity. Human molecular genetics
18:R9–R17.
Tetens J, Widmann P, Kuhn C, Thaller G, 2013.
A genome-wide association study indicates
877
LCORL/NCAPG as a candidate locus for withers height in German Warmblood horses. Anim
878
Genet 44:467–71.
879
880
The Bovine HapMap Consortium, 2009. Genome-wide survey of snp variation uncovers the genetic
structure of cattle breeds. Science 324:528–532.
881
Tonevitsky AG, Maltseva DV, Abbasi A, Samatov TR, Sakharov DA, Shkurnikov MU, Lebedev AE,
882
Galatenko VV, Grigoriev AI, Northoff H, 2013. Dynamically regulated mirna-mrna networks re-
883
vealed by exercise. BMC physiology 13:9.
25
884
Utsunomiya YT, Perez OBrien AM, Sonstegard TS, Van Tassell CP, do Carmo AS, Mészáros GA,
885
SÃlkner J, Garcia JAF, 2013. Detecting loci under recent positive selection in dairy and beef cattle
886
by combining different genome-wide scan methods. PLoS ONE 8:e64280.
887
Viitala S, Szyda J, Blott S, Schulman N, Lidauer M, MÃki-Tanila A, Georges M, Vilkki J, 2006. The
888
role of the bovine growth hormone receptor and prolactin receptor genes in milk, fat and protein
889
production in finnish ayrshire dairy cattle. Genetics 173:2151–2164.
890
891
892
893
Voight BF, Kudaravalli S, Wen X, Pritchard JK, 2006. A map of recent positive selection in the
human genome. PLoS Biol 4:e72.
Watterson G, 1975. On the number of segregating sites in genetical models without recombination.
Theor Popul Biol 7:256–276.
894
Weikard R, Altmaier E, Suhre K, Weinberger KM, Hammon HM, Albrecht E, Setoguchi K, Takasuga
895
A, Kuhn C, 2010. Metabolomic profiles indicate distinct physiological pathways affected by two loci
896
with major divergent effect on Bos taurus growth and lipid deposition. Physiol Genomics 42A:79–88.
897
Wijesena HR, Schmutz SM, 2015. A missense mutation in slc45a2 is associated with albinism in
898
several small long haired dog breeds. Journal of Heredity p. esv008.
899
Xu L, Bickhart DM, Cole JB, Schroeder SG, Song J, Tassell CP, Sonstegard TS, Liu GE, 2015.
900
Genomic signatures reveal new evidences for selection of important traits in domestic cattle. Mol
901
Biol Evol 32:711–25.
902
903
904
905
Xu X, Dong GX, Hu XS, Miao L, Zhang XL, Zhang DL, Yang HD, Zhang TY, Zou ZT, Zhang TT,
et al., 2013. The genetic basis of white tigers. Current Biology 23:1031–1035.
Yang J, Lee SH, Goddard ME, Visscher PM, 2011. Gcta: A tool for genome-wide complex trait
analysis. The American Journal of Human Genetics 88:76 – 82.
906
Yoshida CA, Yamamoto H, Fujita T, Furuichi T, Ito K, Inoue K, Yamana K, Zanma A, Takada K, Ito
907
Y, et al., 2004. Runx2 and Runx3 are essential for chondrocyte maturation, and Runx2 regulates
908
limb growth through induction of Indian hedgehog. Genes Dev 18:952–63.
26
Angus
N=214
Fleckvieh
N=411
Ancestral population
N=137,775
Ancestral population
N=73,042
Holstein
N=378
Jersey
N=193
Present
500 years
123,465 years
Figure 1: Joint demography of the four cattle breeds, estimated from our
data using the approach of Excoffier et al. (2013). Population sizes correspond to
the number of haploid individuals. Parameter values correspond to the EM iteration
with the highest likelihood, but similar values were obtained from the second and
third best EM iterations.
27
●
●
●
0.1 %
1%
5%
10 %
90 %
75 %
50 %
6
5
4
●
●
●●●
●●
●●●
p−value
●
●
●
●
●
●
●
●
●
●
p−value
Figure 2: Comparison between HMM and CLR results. Proportion of Fleckvieh CLR p-values (obtained from Qanbari et al. (2014)) in sweep windows vs other
windows in the genome. On the left panel, the sweep windows considered were
those detected in Fleckvieh, independently of what happened in other breeds. On
the right panel, the sweep windows considered were those detected only in Holstein.
28
0.1 %
●
●
●
1%
●
●●
90 %
75 %
50 %
1
0
●
●
●
●
●
●
●
●
5%
2
●●
●
●
●●
10 %
●●
●
3
3
●●●●
2
●●
1
5
4
●
0
●
●
−2
●
●
Enrichment in sel. sweeps (log2 scale)
6
overlap with Holstein sweeps
−2
Enrichment in sel. sweeps (log2 scale)
overlap with Fleckvieh sweeps
Figure 3: Allele frequencies in the sweep region at the polled locus. For
SNPs where the ancestral allele is known (in red), the frequency is that of the derived
allele. For other SNPs (in black) the frequency is that of the minor allele (among
all breeds). Vertical red bars delimit the union of detected regions among the four
breeds.
29
Figure 4: Population tree estimated on the 1000 bull genomes data. Branch
length is measured in units of drift (≈ t/2N, where t is the time in generations and
N the effective population size).
30
Figure 5: Influence of NGS on hapFLK detection power. PP plot of the
hapFLK test applied to SNPs of the Illumina BovineHD SNP chip (left) or to all
sites of the 1000 bull genomes project (right)
31
1 2 3 4 5
−1
0.1 %
1%
10 %
5%
90
75 %
%
50 %
0.1 %
●
1%
5%
10 %
p−value
●
●
●
1 2 3 4 5
●
●
●
●
●
●
●
10 %
−3
●
●
50 %
●
●
90 %
75 %
●
●
●
−1
●
−3
● ●
●
−5
●
Enrichment in sel. sweeps (log2 scale)
−1
●
●
5%
1%
90
75 %
%
50 %
10 %
5%
Sweep in 4 populations
1 2 3 4 5
Sweep in 3 populations
●
●
●
p−value
● ●
90 %
75 %
50 %
●
● ●●
● ●●●
●●
●
●
●
●
●●
●
●
●
●●● ●
●
●●● ●● ●●●
● ●●
● ● ●
●
●
●
● ●●
●
● ●
p−value
● ● ● ● ● ●
● ● ● ● ● ●
●
●
●
●
●
−3
4
●
●●●●●●●● ●●● ●●●●●
●
●●
●
● ●●
●● ●●●●● ●
●
● ●
●
●
●
−5
●
−4
−2
0
2
●●
●
●
●
●
●
●
●
●
●
●
●●
● ●
Enrichment in sel. sweeps (log2 scale)
6
●●
●
●
●●
●●
−5
Enrichment in sel. sweeps (log2 scale)
Sweep in 2 populations
−6
Enrichment in sel. sweeps (log2 scale)
Sweep in 1 population
p−value
Figure 6: Distribution of hapFLK within hard sweep regions. Comparison
of the distribution of hapFLK p-values in hard sweep regions identified within populations and in the rest of the genome. The plot shows the ratio of the p-values
distribution in hard sweeps detected in 1 to 4 populations to their distribution on
the part of the genome where no hard sweep is detected. y-axis is on a log2 scale
and x-axis on a log10 scale.
32
hapFLK
PLAG1
CHCHD7
0
1
2
3
4
5
6
RPS20
24800
24900
25000
25100
25200
Position (Kbp)
FLK
RPS20
PLAG1
3
0
1
2
−log10(p−value)
4
5
CHCHD7
24800
24900
25000
25100
25200
Position (Kbp)
0.5
Heterozygosity
0.2
0.3
0.4
Jersey
Fleckvieh
0.0
0.1
Holstein
Angus
24800
24900
25000
25100
25200
Position (Kbp)
Figure 7: Selection signature around PLAG1. HapFLK (top) and FLK (middle) p-values (log10 scale) for the selection signature around PLAG1, and local
heterozygosity in the four breeds (bottom) for the same region. On the top and
middle panels, genes are indicated by purple full rectangles, and red full triangles
correspond to the candidate QTNs of Karim et al. (2011).
33
Figure 8: Selection signature around MC1R. HapFLK (gray) and FLK (black)
p-values (log10 scale) for the selection signature around MC1R (purple vertical line).
Red triangle highlight the two known causal mutations for red and black coat color
in cattle.
34
Figure 9: Allele frequencies in the sweep region including SLC45A2 and
RXFP3. For SNPs where the ancestral allele is known (in red), the frequency is
that of the derived allele. For other SNPs (in black) the frequency is that of the
minor allele (among all breeds).
35
chr
1
start (Mb)
1.781
end (Mb)
1.818
1
1
5
7
107.452
107.571
68.675
4.574
107.557
107.749
68.751
4.745
10
16
59.148
44.672
59.338
44.956
16
45.644
45.903
Genes
lincRNA, Polled locus (AllaisBonnet et al., 2013)
PPM1L
ARL14
SLC41A2
FKBP8, ELL, ISYNA1, SSBP4,
LRRC25, GDF15
CYP19A1
CLSTN1, PIK3CD, TMEM201,
SLC25A33
RERE, SLC45A1
Table 1: Sweep regions shared among all breeds.
36
BTA
6
18
14
10
13
20
7
24
Hard sweep region
end population
start
71.439 71.558
F
14.755 14.963
F
24.805 25.076
H,A
5.736
5.843
A
64.149 64.197
A
22.923 23.203
J
43.436 43.542
A,J
14.006 14.040
F
start
70.332
14.305
24.937
5.736
63.879
23.110
43.473
14.021
hapFLK
end p-value
71.607 1.2 10−12
14.872 2.1 10−8
25.070 2.2 10−7
5.782 4.9 10−6
64.546 2.0 10−5
23.125 4.2 10−5
43.474 7.0 10−5
14.023 8.2 10−5
Candidate
genes
KIT
MC1R
PLAG1
intergenic
ASIP
ANKRD55
OR cluster
intergenic
Table 2: Hard sweep selection signatures associated to significant differentiation signals. Signatures are ordered by decreasing hapFLK p-values. Region
coordinates are expressed in Megabases on assembly UMD 3.1. Population abbreviation: Angus, Fleckvieh, Holstein, Jersey
37
chr
start (Mb)
end (Mb)
71.560
26.000
sel
pop
F
J
nb
mut
2
8
pval
FLK
10−5
3.10−5
sel
freq
0.92
0.77
6
7
71.440
25.430
7
14
26.280
24.810
27.060
25.080
J
H,A
16
7
10−5
3.10−5
0.83
0.04
18
14.760
14.960
F
3
10−5
0.95
20
24.230
24.740
J
1
3.10−5
0.77
20
20
20
25.030
26.640
39.830
25.390
27.230
39.970
J
J
J
21
2
1
2.10−5
3.10−5
10−5
0.80
0.77
0.80
22
22
27
35.680
35.960
4.140
35.790
36.030
4.230
J
J
J
1
2
1
3.10−5
5.10−5
10−4
0.77
0.90
0.90
genes
intergenic
CHSY3,
KIAA1024L,
ADAMTS19
SLC27A6, FBN2, SLC12A2
LYN, RPS20,
PLAG1,
CHCHD7
ENSBTAG00000039031,
MOS
TUBB6, TUBB3, DEF8,
LOC532875
DBNDD1, GAS8, SHCBP1,
MC1R
LOC530348,
SNX18,
COX8A,
LOC783202,
HSPB3
intergenic
intergenic
SLC45A2,
ADAMTS12,
RXFP3
intergenic
MAGI1
intergenic
Table 3: Private hard sweep regions including candidate mutations. Horizontal lines are used to group closely related sweep windows, which might result
from the same selection event.
38