Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Genetics: Early Online, published on March 26, 2016 as 10.1534/genetics.115.181594 1 Uncovering adaptation from sequence data: lessons from genome 2 resequencing of four cattle breeds 3 Simon Boitard1,2 , Mekki Boussaha1 , Aurélien Capitan1,3 , Dominique Rocha1 , and Bertrand 4 Servin4 1 5 6 2 Institut de Systématique, Évolution, Biodiversité ISYEB - UMR 7205 - CNRS & MNHN & UPMC & EPHE, Ecole Pratique des Hautes Etudes, Sorbonne Universités, Paris, 75005, France 7 3 8 9 GABI, INRA, AgroParisTech, Université Paris-Saclay, Jouy-en-Josas, 78350, France 4 Union Nationale des Coopératives d’Elevage et d’Insémination Animale, Paris, 75595, France GenPhySE, Université de Toulouse, INRA, INPT, INP-ENVT, Castanet-Tolosan, 31326, France 1 Copyright 2016. 10 Running title: Detecting adaptation in cattle using resequencing data 11 Keywords: FST, domestication, linkage disequilibrium, next generation sequencing, selective sweeps. 12 Corresponding author: Simon Boitard 13 Mailing adress: INRA GABI, Domaine de Vilvert, 78352 Jouy-en-Josas, France. 14 Phone: 0033 1 40 79 81 66 15 E-mail: [email protected] 2 Abstract 16 17 Detecting the molecular basis of adaptation is one of the major questions in population genetics. 18 With the advance in sequencing technologies, nearly complete interrogation of genome-wide poly- 19 morphisms in multiple populations is becoming feasible in some species, with the expectation that it 20 will extend quickly to new ones. Here, we investigate the advantages of sequencing for the detection 21 of adaptive loci in multiple populations, exploiting a recently published dataset in cattle (Bos tau- 22 rus). We used two different approaches to detect statistically significant signals of positive selection: a 23 within-population approach aimed at identifying hard selective sweeps and a population-differentiation 24 approach that can capture other selection events such as soft or incomplete sweeps. We show that 25 the two methods are complimentary in that they indeed capture different kind of selection signatures. 26 Our study confirmed some of the well known adaptive loci in cattle (e.g. MC1R, KIT, GHR, PLAG1, 27 NCAPG/LCORL) and detected some new ones (e.g. ARL15, PRLR, CYP19A1, PPM1L). Compared 28 to genome scans based on medium or high-density SNP data, we found that sequencing offered an 29 increased detection power and a higher resolution in the localization of selection signatures. In sev- 30 eral cases, we could even pinpoint the underlying causal adaptive mutation, or at least a very small 31 number of possible candidates (e.g. MC1R, PLAG1). Our results on these candidates suggest that a 32 vast majority of adaptive mutations are likely to be regulatory rather than protein coding variants. 33 Introduction 34 Detecting the molecular basis of adaptation in natural species is one of the major questions in popula- 35 tion genetics. With the spectacular progress of genotyping and sequencing technologies, genome-wide 36 scans for positive selection have been performed in multiple species and populations within the last 37 decade. Livestock species provide a considerable ressource for these selection scans, because they have 38 been subjected to strong artificial selection since their initial domestication, leading to a large variety 39 of breeds with disctinct morphology, coat color or specialized production. Besides, the economic value 40 of these species and the need to improve them has motivated the developement of standardized single 41 nucleotide polymorphism (SNP) chips and the genotyping of millions of animals using these chips, 42 providing considerable data for population genetics analyzes. For instance, in taurine cattle, at least 43 21 genomic scans for selection have already been published and were reviewed in Gutierrez-Gil et al. 44 (2015). Numerous genomic scans for selection have also been published in other livestock species, see 45 Gouveia et al. (2014) for a review. 46 The regions detected by these studies are generally convincing, because they contain interesting 47 positional and functionnal candidate genes (e.g. Fariello et al. (2014)) and / or are statistically 48 enriched with genes from regulation pathways related to production traits (e.g. Flori et al. (2009)). 49 Nevertheless, these regions often span several Megabases and typically include up to tens of genes, so 50 determining the exact gene(s) under selection in each region, and even more the causal mutation(s), 51 remains difficult from these studies. This might change with the recent advent of next generation 52 sequencing technologies, which allows to characterize a very large proportion of the variants in the 53 genome of a species. However, although several examples of genomic scans for selection based on large 54 sequencing samples are already available in livestock species such as pig (Rubin et al., 2012; Li et al., 3 55 2013), cattle (Qanbari et al., 2014) or chicken (Rubin et al., 2010; Roux et al., 2015), the advantage 56 of using sequencing data for detecting selection signatures has still not been widely discussed. 57 Another interesting question is the small overlap observed between studies, even when these studies 58 focus on similar populations (Biswas and Akey, 2006; Qanbari et al., 2011). This can largely be 59 explained by the fact that many different detection methods have been applied, whose sensitivity 60 depends on the type of selection event: recent or old, complete or ongoing, from a new variant (i.e. 61 “hard”) or standing variation (i.e. “soft”) etc. These differences between methods have been described 62 by several studies (Biswas and Akey, 2006; Sabeti et al., 2006; Gouveia et al., 2014). However, the 63 pratical implications of these differences when comparing the regions detected by different studies, or 64 by different approaches within a same study, are rarely discussed. 65 Here we detected selection signatures in 4 European taurine cattle breeds (Angus, Fleckvieh, 66 Holstein and Jersey), using large samples of sequencing data which have been recently published by 67 the 1000 bull genomes project (Daetwyler et al., 2014). We used two different statistical approaches: 68 a within-breed approach detecting genomic regions with low genetic diversity (Boitard et al., 2009), 69 and a between population approach detecting genomic regions with large allele (Bonhomme et al., 70 2010) or haplotype (Fariello et al., 2013) frequency differences between breeds. These two analyzes 71 provided regions whose genomic features are significantly different from what would be expected 72 under neutrality, even when accounting for the effects of past population size changes, population 73 structure and gene flow on the four breeds of our study. We show that applying the above methods 74 to sequencing data improves the detection power of selection signatures and reduces considerably the 75 length of detected regions. In some particular situations, it even leads to identify the exact mutation 76 under selection. We also provide a detailed characterization of the regions that are detected only 77 by the within-population approach (in one or several populations), only by the between population 78 approach, or jointly by the two approaches. 79 Results 80 Genomic regions with low within-population diversity: hard sweeps signa- 81 tures 82 We looked for hard sweep signatures within each breed using the method of Boitard et al. (2009), as 83 described in the Methods. This method detects regions showing an excess of low and high frequency 84 derived alleles compared to the rest of the genome. Although this pattern is typically expected under 85 a hard sweep scenario, i.e. when a new mutation appears in the population and goes to fixation 86 due to positive selection, it may also arise from purely demographic events, in particular bottlenecks. 87 In order to test if our analysis could be influenced by such false positive signals, we first simulated 88 genomic samples under two different neutral demographic models allowing to reproduce the genetic 89 diversity of the breeds under study, and applied the method of Boitard et al. (2009) to these samples. 90 Analysis of neutral samples In the first demographic model, we considered the joint history 91 of the four breeds and accounted for several important features of this history: (i) the shared ancestry 4 92 of the four breeds, which diverged recently from an ancestral pool of european domestic animals, (ii) 93 the population size differences between breeds since their divergence, and (iii) the possible existence 94 of gene flow between breeds. Moreover, because recent studies suggested that effective population size 95 in taurine cattle strongly declined since domestication (MacLeod et al., 2013; Boitard et al., 2016), 96 we allowed one population size change in the ancestral population. We estimated the parameters 97 of this model from the joint allele frequency spectra observed in our data for all breed pairs, using 98 the approach of Excoffier et al. (2013) (see Methods for more details), and obtained the demography 99 shown in Figure 1. 100 In this estimated demography, the ancestral population size change was not a decline related to 101 domestication, but an older expansion occuring in the wild population, approximately 120,000 years 102 before present. However, a very strong population decline was found at the time where the four 103 breeds diverged, from an order of 100,000 individuals to an order of 100 individuals. Interestingly, 104 the estimated divergence time (500 years before present) was consistent with a geographic isolation 105 process starting a few hundred years before the strict separation of these populations, induced by 106 the creation of modern breeds (Felius et al., 2011). Besides, the order of magnitude of estimated 107 recent effective sizes (100) and the ranking of breeds according to these sizes were consistent with 108 previous studies (Leroy et al., 2013; The Bovine HapMap Consortium, 2009; MacLeod et al., 2013; 109 Boitard et al., 2016). Thus, this simple model seemed to provide a reasonable approximation of the 110 demography of the four breeds under study. When genomic samples were simulated from this model, 111 the average number of sweeps detected per genome was equal to 0 in Fleckvieh and Holstein, 3.75 (± 112 1.93) in Angus and 17.25 (± 4.15) in Jersey. 113 We performed the same test using a second demographic model, which was estimated in another 114 study (Boitard et al., 2016) based on the same dataset as that considered here. This model treats each 115 breed independently of the others, so it does not account for shared ancestry or gene flow. However, 116 it accounts for the variations of population size over time more accurately than the previous model, 117 because population size in each breed is modelled as a stepwise process with 21 time windows (Figure 118 6 in Boitard et al. (2016)). The population size within each time window is estimated from the allele 119 frequency and linkage disequilibrium patterns observed in the breed, using an approximate Bayesian 120 computation approach. When genomic samples were simulated from this second model, the average 121 number of sweeps detected per genome was even lower than with the first model: 0 in Fleckvieh, 122 Holstein and Angus, and 0.75 (± 0.87) in Jersey. 123 Overall, these results indicate that the number of false hard sweep signals detected by the method 124 of Boitard et al. (2009) should be negligible in Angus, Fleckvieh and Holstein, and relatively small in 125 Jersey, even when accouting for the demography of these breeds. 126 Overview of the detected signals When analyzing the cattle data with the same approach, 127 we detected 1,057 hard sweep signals: 226, 384, 316 and 131 in Holstein, Angus, Jersey and Fleckvieh 128 respectively. According to the simulation results presented above, the false discovery rate associated 129 with this analysis should be below 7% in Jersey (21.4/316), and close to 0 in the other breeds. The 130 size of detected regions ranged from 8.2kb to 948kb, with a median of 78.7kb. Some signals were 131 overlapping between breeds so that after merging them we obtained 798 sweeps that were unique to 5 132 one of the breeds (159, 297, 249 and 93 respectively) and 118 that were shared between at least two 133 breeds. Overall this provided 916 regions covering about 4.3% of the autosomal genome. Among these 134 916 regions, 450 included no (protein coding) gene, 268 included a single gene, 154 included between 135 2 and 5 genes, and 44 included more than 5 genes, with a maximum of 19 genes. Overall, 1088 genes 136 were included in sweeps windows, which represents about 5.7% of all annotated genes in the bovine 137 genome, so there was a slight enrichment of protein coding genes within sweep regions. The list of all 138 detected regions and of genes included in these regions is given in File S1. 139 In a recent genome scan for selection focusing on the Fleckvieh breed (Qanbari et al., 2014), the 140 43 Fleckvieh sequences considered in our study were analyzed using the CLR method (Nielsen et al., 141 2005) and the iHS method (Voight et al., 2006). Seventy-three hard sweep signals were found with the 142 former approach and 67 with the latter. Since the HMM approach used in this study aims at capturing 143 the same allele frequency patterns than the CLR method, we checked if our results in Fleckvieh were 144 consistent with those in Qanbari et al. (2014). To this end, we compared the distributions of CLR 145 p-values within regions associated with selective sweeps to their distribution on the rest of the genome. 146 Figure 2 (left) plots the ratio of the two densities (on a log2 scale) for increasing levels of significance 147 of the CLR test. It shows a very strong enrichment of low CLR p-values in the sweep windows we 148 detected in Fleckvieh, compared to the rest of the genome. We performed the same analysis with iHS 149 p-values and found a similarly strong enrichment (Figure S1), which can be explained by the fact that 150 iHS also tries to detect hard sweep patterns, even if the information used (the length of haplotypes) 151 is different. 152 We also observed an enrichment, albeit of lower intensity, of CLR and iHS low p-values in the 153 sweep windows detected in other breeds than the Fleckvieh. For example, Figure 2 (right) shows the 154 enrichment in low Fleckvieh CLR p-values in selective sweep regions detected only in the Holstein 155 breed. A similar trend was observed in selective sweep regions specific to the Jersey and Angus breeds 156 (not shown). Hence some of the hard sweeps detected in one breed also have probably taken place 157 in the other breeds, but to a slightly lower extent that did not lead to a significant signal. These 158 signatures must be related to favorable alleles that either started to increase in frequency before the 159 divergence of the breeds or were selected in parallel in different breeds. However, selection signatures 160 that are specific to one breed are interesting because they illustrate the importance of this breed for 161 cattle functional diversity (Gutierrez-Gil et al., 2015). We therefore derived a way of finding clear 162 breed-specific sweeps as follows. 163 Hard sweep regions specific to one population The HMM approach “tags” a region as 164 selected by reconstructing a hidden state at each position of the genome. Based on the observation 165 above, we suspected that some regions were not tagged as selected, but might still have a non negligible 166 probability of being adaptive under the HMM model, explaining the enrichment patterns observed 167 in Figure 2. To investigate this possibility, we derived a statistic (TOR ) quantifying the strength of 168 evidence for selection in a breed, measured as the log odds ratio of selection over neutrality in a region 169 (see methods for details). As expected, the TOR in one breed showed clearly different distributions in 170 regions tagged as selected in this breed and in regions where no selection was detected in any breed 171 (Figure S2). In regions where selection was detected in another breed, the distribution of TOR was 6 172 skewed toward higher values compared to clearly neutral regions (Figure S2). We exploited this to call 173 breed-specific sweeps regions where TOR was unambiguously consistent with the neutral density in all 174 other breeds (see methods for details). Fifty-five breed-specific regions were detected, and we could 175 check that this time the sweeps specific to Holstein did not show any enrichment in low Fleckvieh 176 CLR p-values (Figure S3). The 12 sweeps exhibiting the most contrasted patterns for TOR are listed 177 in Table S1. 178 Hard sweep regions shared by all populations We also looked for sweep signals shared 179 by all breeds, as they might correspond to older selection events, anterior to the divergence of the 4 180 breeds considered here and possibly related to initial cattle domestication. We found only one region 181 with a sweep detected in all 4 breeds, but we also considered regions where a sweep was detected in 182 3 breeds and where allele frequencies in the fourth one were almost consistent with a sweep. This 183 provided 8 candidate regions, 4 of which include a single gene (Table 1). 184 Several of these genes represent natural selection targets in cattle, as they are related to husbandry, 185 metabolism or fertility. On chromosome 1, we found evidence for selection in a region 10Kb upstream 186 the OLIG1 gene, encompassing a lincRNA, orthologous to the Human gene LINC00945, which ex- 187 pression has been shown to be associated with polledness in Holstein and Fleckvieh (Allais-Bonnet 188 et al., 2013). PPM1L is a protein phosphatase that has been shown to be involved in the response to 189 exercice in humans (Tonevitsky et al., 2013). SLC25A33, located in the middle of one of the shared 190 sweeps windows, encodes for a mitochondrial pyrimidine nucleotide transporters and is essential for 191 mitochondrial DNA and RNA metabolism in humans (Di Noia et al., 2014). CYP19A1 encodes the 192 key enzyme for estrogen biosynthesis. Many studies have documented its role during the development 193 of bovine follicles, and it has been found more abundant in bovine cells of twinners versus controls 194 (Echternkamp et al., 2012). To illustrate the allele frequency patterns observed in such regions, allele 195 frequencies in the polled locus region are provided in Figure 3. 196 Genomic regions with large genetic differentiation between breeds. 197 We applied two approaches to detect genome regions that exhibit outlying divergence in single site 198 (Bonhomme et al., 2010) or haplotype (Fariello et al., 2013) frequencies between populations. The 199 first step in these two approaches is to estimate a population tree summarizing the neutral history 200 of the populations under study. For our dataset, this population tree (Figure 4) was approximately 201 star shaped, (i.e. breeds essentially evolved in parallel from an ancestral population), although we 202 estimated a small shared history between the Fleckvieh and Jersey populations. Fixation indices, 203 represented by the branch length from the root to the tips of the tree, had similar values in all 204 populations, the Jersey’s one being slightly higher. The genome-wide level of differentiation between 205 populations was rather large, with between population FST values ranging from 0.22 to 0.4. 206 When genetic drift is large, single marker tests are expected to have low power because even large 207 allele frequency differences can be explained by drift alone. This was indeed the case here for the single 208 marker differentiation analysis (FLK): the smallest observed p-value was 6.10−7 , which corresponded 209 to an FDR of about 10% when applying the approach of Storey and Tibshirani (2003). Although it 210 is not clear how to correct p-values for correlation between markers in such a setting (genome-wide 7 211 differentiation based tests), even this smallest p-value did not provide clear evidence of selection. This 212 does not mean that there is no selection in these data, only that genuine selection signatures cannot be 213 discriminated from background noise provoked by drift when looking at sites independently. However, 214 as illustrated later in this study, given a region where a selection signature has been found, FLK can 215 help in identifying the mutations that have likely been under selection. 216 The hapFLK method (Fariello et al., 2013) is similar to FLK, but it incorporates LD information 217 through the exploitation of a multilocus LD model (Scheet and Stephens, 2006). Because hapFLK 218 combines information across multiple sites, it has been shown to have better detection power than 219 single site statistics (Fariello et al., 2013, 2014). This was confirmed when applied to this dataset, 220 as we could confidently find 67 significant regions using a FDR threshold of 15%. hapFLK has been 221 shown to be robust to bottlenecks and to a certain extent to gene flow (Fariello et al., 2013). We 222 confirmed this by simulating haplotypes under the demographic model estimated by the approach of 223 Excoffier et al. (2013) (Figure 1) and calculating hapFLK on the simulated data (see Methods). While 224 the fixation indices computed from the simulated samples were very close to the ones computed from 225 our data (Table S2), hapFLK p-values obtained from simulated samples did not lead to any signal 226 called significant with the Storey and Tibshirani (2003) approach (Figure S4). 227 Ten of the significant regions detected by hapFLK likely resulted from assembly errors and were 228 thus not considered in the rest of our analysis (Table S3). The cumulated length of the remaining 57 229 regions, listed and annotated in table S4, was 9.1 Megabases, 0.36 % of the total autosome length. 230 Detected regions spanned from a few hundred basepairs to more than one megabase with a median 231 length of about 20 kilobases (Figure S5). The median size of detected regions was thus considerably 232 smaller than that obtained in previous studies where hapFLK was applied to 60K data in sheep 233 (Fariello et al., 2014; Kijas, 2014). Nineteen of the regions encompassed at least one gene while 38 234 contained no gene. In total, 82 genes were included in hapFLK regions, which represents about 0.4% 235 of the total number of genes in the genome, corresponding to a small enrichment in protein coding 236 genes in hapFLK signatures. 237 To investigate whether sequencing improves detection power with hapFLK, we thinned the dataset 238 by considering only SNPs present on the Illumina BovineHD BeadChip. We found (Figure 5) that the 239 excess of small p-values was much larger when applying hapFLK to all sites identified in the 1000 bull 240 genomes project, than when applying it only to the SNPs that are included in the SNP chip. Note 241 that this was not the case with FLK, where detection power was low with both sequencing or SNP 242 chip data due to the amount of drift, as already discussed above (Figure S6). 243 Hard sweep regions showing a strong differentiation signal 244 Eight selection signatures were found with both the HMM and the hapFLK analyzes, pointing out 245 regions that combine a low diversity within at least one breed and a strong haplotypic differentiation 246 beween breeds (Table 2). Although this is significantly more than expected if the two kind of signals 247 were independent (p = 1.4 10−4 ), it is somewhat surprising to observe that many hard sweeps were 248 not detected with hapFLK. 249 A prevalent reason for this is the large genome-wide level of differentiation between the four breeds, 8 250 which reduces power of differentiation based tests (Fariello et al., 2013). Indeed, hard sweep regions 251 detected in one or two populations showed a clear enrichment in low hapFLK p-values (Figure 6). 252 This indicates that many hard sweeps exhibit a mild differentiation signal, although the power to 253 detect them with hapFLK is not sufficient. In addition, some hard sweep regions did not show any 254 differentiation signal, because the same haplotype was fixed or at least increased in frequency in all 255 populations. This is typically the case of hard sweep regions detected in three or four populations, 256 for which there was a depletion of low hapFLK p-values (Figure 6). This may also concern regions 257 where an hard sweep was detected in one or two populations, but where the swept haplotype was also 258 at quite high frequency in other populations, as already discussed above. 259 Among the regions with evidence for both a hard-sweep and an extreme differentiation signature, 260 the top three corresponded to genes and mutations of known phenotypic effects that recapitulate the 261 most obvious phenotypic divergence of the four breeds in this dataset. The most differentiated region 262 corresponded to the KIT gene, which has been shown to be associated with white spotting patterns 263 in the Holstein (Hayes et al., 2010) and the Fleckvieh (Qanbari et al., 2014), while the Jersey and 264 the Angus are non spotted breeds. The next most differentiated region harbored the MC1R gene, 265 for which previous studies have identified two causal polymorphisms for coat color (Klungland et al., 266 1995). Finally, the third signature was a small genomic region comprising the PLAG1 gene (Figure 267 7, top). Hard sweeps identified in this region indicate a past selection event affecting the Angus and 268 the Holstein breeds, whereas no selection was evidenced in the Fleckvieh and the Jersey breeds (Table 269 2), as can also be seen from heterozygosity patterns in the region (Figure 7, bottom). The region 270 surrounding PLAG1 was previously demonstrated to harbour a QTL for calving ease in the Fleckvieh 271 (Pausch et al., 2011) and one for stature in a Holstein × Jersey cross (Karim et al., 2011). Our 272 results are consistent with these studies and suggest that the allele favoring high stature was selected 273 in Holstein and Angus, but not in Jersey and Fleckvieh. 274 The other common regions include the ASIP gene, which plays a crucial role in adipocyte devel- 275 opment and seems to be expressed in a wide set of tissues in different cattle breeds (Albrecht et al., 276 2012) and the ANKRD55 region that is strongly associated with autoimmune disorders in humans (in 277 particular multiple sclerosis (Stahl et al., 2010)) and type 2 diabetes (Morris et al., 2012). 278 hapFLK signatures of soft or incomplete sweeps 279 Apart from the 8 signatures above, none of the other hapFLK signatures matched hard sweeps detected 280 by the HMM approach. These signatures most likely resulted from incomplete sweeps for which the 281 selected allele did not reach fixation, or soft sweeps where selection targeted an allele that was already 282 at intermediate frequency in the population. Indeed, such signals do not lead to the skewed allele 283 frequency patterns that are looked for by the approach of Boitard et al. (2009). This hypothesis 284 was confirmed by examining haplotype diversity patterns in hapFLK signatures, which typically show 285 haplotypes of large but not fixed frequency in at least one population (e.g. Figure S7 for region 1 in 286 Table S4). 287 Two signatures are located close to the homologous region of a human promoter region, near 288 ROBO1 on one hand and the prolactin receptor gene PRLR on the other, hinting that the causal 9 289 mutation is likely regulatory in nature. ROBO1 has been shown to be involved in early follicular 290 development in sheep (Dickinson and Duncan, 2010; Dickinson et al., 2010), while the prolactin and its 291 receptor are involved in a large range of biological functions (development, metabolism, immunology, 292 reproduction ...) (Bole-Feysot et al., 1998). 293 The PRLR gene lies ≈ 6 Mb to the growth hormone receptor (GHR) gene, itself close (≈ 140Kb) to 294 another hapFLK signature. It has been evidenced that two QTLs affecting milk, fat and protein yield 295 segregate near these two genes in a Finnish dairy cattle population (Viitala et al., 2006). Our results 296 suggest that these QTLs might be in regulatory regions of these genes and that they have responded 297 to selection in populations from the 1000 bull genomes. We found another selection signature, in 298 Jersey, within a gene potentially involved in milk production (Figure S8), ARL15. Seven highly 299 differentiated variants (FLK p-value < 10−5 ) were located in an ARL15 intron. ARL15 is a protein of 300 unknown function that has been shown to be strongly associated with adiponectin levels in humans 301 (Richards et al., 2009). Adiponectin is an hormone involved in glucose metabolism, low concentration 302 of adiponectin being associated with insulin resistance. In dairy cows, insulin resistance is maintained 303 in early lactation, favoring mammary glucose uptake. Giesy et al. (2012) showed that the reduction 304 of plasma adiponectin concentration in early lactating cows was not associated to changes in the 305 adiponectin expression itself. Our results could suggest a potential role of ARL15 in this process, with 306 a particular adaptation of the Jersey dairy cattle at this gene. 307 Apart from the PLAG1 signature, several other include genes involved in morphology and growth: 308 RUNX3 (Yoshida et al., 2004; Soung do et al., 2007), STARD3NL (Rivadeneira et al., 2009) and 309 RASSF2 (Song et al., 2012) are involved in bone development, NCAPG and/or LCORL match a large 310 QTL for many growth traits in cattle, horse and sheep (Eberlein et al., 2009; Weikard et al., 2010; 311 Lindholm-Perry et al., 2011; Signer-Hasler et al., 2012; Makvandi-Nejad et al., 2012; Bongiorni et al., 312 2012; Metzger et al., 2013; Tetens et al., 2013; Lindholm-Perry et al., 2013; Kijas, 2014; Xu et al., 313 2015; Randhawa et al., 2015; Sahana et al., 2015) and CTNNBL1 (Liu et al., 2008) is associated with 314 obesity traits in humans. 315 Identifying causal adaptive polymorphisms 316 As demonstrated above, genomic scans for selection based on sequencing data have a higher detection 317 power than those based on genotyping chip data, and locate the selection signatures with higher 318 precision. This is an expected outcome of the higher marker density, and the same could be said when 319 comparing high density to medium density chip data. But a more fundamental difference between 320 dense SNP chip data and sequencing data is that, with the latter, one can reasonably expect the causal 321 polymorphism under selection to be included in the observed data. Clearly, not all selection signatures 322 can be related to a single polymorphism, and even in this case this polymorphism might be absent in 323 our data due to insufficient coverage or remaining calling issues. Still, the favorable situation where 324 a single polymorphism leads to a selective advantage and is present in the data should also occur. 325 One natural question is thus to determine if such variants can be identified only from genetic diversity 326 patterns. We show below that this is indeed possible. 10 327 HapFLK signals 328 Haplotype frequency differences detected by hapFLK typically result from the increase in frequency of 329 one particular allele in a population due to positive selection, which implied the increase in frequency 330 of one or several haplotypes carrying this allele by genetic hitchhiking. Thus, in regions detected by 331 hapFLK, the causal polymorphism under selection should be the one with the largest allele frequency 332 differentiation, and a natural strategy to detect this polymorphism is to look at the variants with the 333 largest FLK value. Two of the regions identified by hapFLK validate this strategy. 334 Within the MC1R selection signature, we found three polymorphisms with clear outlying FLK val- 335 ues (Figure 8), and two of these corresponded to the known causal mutations mentionned previously: 336 a single base mutation at position 14757910 (rs109688013), responsible for the black pigmentation in 337 Holstein and Angus breeds, and a single base deletion at position 14757924 (rs110710422), responsible 338 for the red pigmentation in the Fleckvieh breed (Klungland et al., 1995). The third outlying polymor- 339 phism (rs110494166) in the region was located at position 14678403, within an intron of the nearby 340 FANCA gene, and exhibited the same allele frequencies as rs110710422. 341 In the PLAG1 region, the causal mutation was shown to be one of 8 candidate QTNs (Karim 342 et al., 2011). We found 7 polymorphisms exhibiting a high level of differentiation between breeds 343 in this region based on the FLK statistic (p-value below 10−4 ) (Figure 7 middle, Table S5). Out 344 of these 7 candidates, 6 were common with the candidate QTNs listed in the Table 2 of Karim 345 et al. (2011). rs134215421 and rs109815800 showed particularly extreme allele frequency differences 346 between Holstein and Angus on the one hand and Jersey and Simmental on the other hand. These 347 two mutations are located at the 3’ end of the PLAG1 gene, rs134215421 being 1Kb downstream and 348 rs109815800 within and intron of the PLAG1 gene. One SNP (rs134029466) was not identified as a 349 potential QTN in Karim et al. but showed a strong signal for differentiation. Two of the mutations in 350 Karim et al., rs209821678 and rs210030313, which are considered by the authors as the most serious 351 candidates because they affect a highly conserved element, were not available in the 1000 bull genomes 352 project. One is a VNTR, a kind of polymorphism that is hardly callable from short-read sequence 353 data and the other one is a SNP that lies 44 base pair from the VNTR, exhibited very low quality 354 scores and was therefore not called in the 1000 bull genomes data. While these polymorphisms cannot 355 be considered as disqualified based on our study, their positions, highlighted in Figure 7, lie in a region 356 where the differentiation signal was not significantly elevated. 357 Hard sweep signals 358 Hard sweep signals are expected when an allele goes from very low frequency to almost fixation in a 359 population due to positive selection. In these regions, detecting the causal selected variant only from 360 genetic data of the swept population is impossible, because all physical positions show either extreme 361 allele frequencies or no polymorphism at all. However, if we assume that other sampled populations 362 evolved neutrally in the region, alleles that were initially at low frequency in the swept population 363 have likely remained at relatively low frequency in these other populations. This should result in high 364 genetic differentiation at and around the selected polymorphism, which again can be detected using 365 the FLK statistic. 11 366 For all hard sweep regions except the ones shared by all populations, we thus tried to identify the 367 selected site by looking for polymorphisms with a high FLK value (p ≤ 10−4 ). To ensure that the 368 detected polymorphisms could indeed be the causal ones, we further required a high allele frequency 369 (≥ 75%) in the swept population(s), and only in this(ese) one(s). We found only 12 sweep regions 370 exhibiting such causal candidates (Tables 3 and S6). Again, this small proportion is related to the 371 fact that in most cases the selected allele must actually be at relatively high frequency even in non 372 swept populations, due to undetected ongoing sweeps or just random drift. 373 The regions detected by this approach include those of KIT, PLAG1 and MC1R. In all these 374 regions, a very limited number of potentially causal polymorphisms was detected, and in the case of 375 MC1R these candidates included the true causal variants. This provides a validation of the detection 376 strategy considered here and suggests that other regions of Table 3 should also contain interesting 377 putative causal polymorphisms. 378 Of particular interest, a sweep region on chromosome 20 included a single causal candidate at 379 position 39872347 (Figures 9 and S9). This polymorphism is located downstream SLC45A2, a gene 380 that has been associated to fertility (Killeen et al., 2014) and residual feed intake (Karisa et al., 2013) 381 in cattle, and to pigmentation in several other species including human (Sturm, 2009; Stefanaki et al., 382 2013; Morice-Picard et al., 2014), dog (Wijesena and Schmutz, 2015) or tiger (Xu et al., 2013). It is 383 also located downstream RXFP3, a gene known to be involved in food intake regulation and body 384 weight in mice (Ganella et al., 2012; Smith et al., 2014) 385 Another interesting region was found on chromosome 22, where two strong candidate polymor- 386 phisms were located 30kb upstream MAGI1 (Figures S10 and S11). Although not reported in Table 3 387 due to p-values slightly greater than 10−4 , 5 other suggestive polymorphisms are located within MAGI1 388 introns. One of these, at position 36011838, is located within a highly conserved region according to 389 a multiple alignment of 36 eutherian mammals (www.ensembl.org). At this position, all considered 390 species carry allele G, which is also the most frequent allele in Angus, Fleckvieh and Holstein, while 391 allele A swept in Jersey. MAGI1 is a scaffolding protein present in tight junction of epithelial cells 392 and might be implied in nervous system functions. Adaptive selection around this gene was already 393 reported in some West African cattle breeds (Gautier et al., 2009). 394 A more complex situation was observed on chromosome 7, where 8 and 16 candidate polymorphisms 395 were found in two sweep windows distant from 300kb. Among all these candidates, the highest 396 FLK value was reached at position 26459812, in an intergenic region between SLC27A6 and FBN2. 397 Interestingly, this polymorphism was located in a very conserved region, and it was the only one in 398 this case among all candidates in the sweep window. Allele T was found in almost all species at 399 this position and was almost fixed in Angus, Fleckvieh and Holstein, while allele A swept in Jersey. 400 SLC27A6 has been shown related to fatty acid metabolism in cattle (Bionaz and Loor, 2008; Nafikov 401 et al., 2013), while FBN2 has been related to several development processes, including for instance 402 bone formation, in mice (Nistala et al., 2010). 403 Strong candidate mutations were also found in two regions without protein-coding genes on the 404 current bovine annotation, and in general we note that all candidate polymorphisms were located in 405 intergenic or regulatory regions. This implies that validating the effect of these polymorphisms will be 406 difficult, but this also outlines the potential of selective sweep studies to improve genome annotation. 12 407 Hard sweep signals shared by all breeds 408 In regions where hard sweep signals were shared by all breeds (Table 1), identifying causal poly- 409 morphisms only from our dataset is impossible. Indeed, the majority of polymorphisms have similar 410 diversity patterns, with a low minor allele frequency in all 4 breeds. Besides, positions that were 411 monomorphic in our data set are also good candidates, since they might result from the complete 412 fixation of a favorable allele in the 4 breeds. 413 In order to identify potential causal variants in these regions, one possible approach can be to use 414 data from related species, looking for highly conserved positions for which the major (or the only) 415 allele of our bovine data set is absent in other mammals. To illustrate this approach, we implemented 416 the following procedure. Based on a multiple alignment of the bovine reference sequence with 10 other 417 mammal reference sequences (Rocha et al., 2014), we selected the positions where (i) the bovine allele 418 (or the major bovine allele in the case of polymorphic positions) was distinct from the yak allele, (ii) 419 the bovine allele was unobserved among the 10 other species, and (iii) the yak allele was observed in 420 more than 7 (out of 9) other species, including at least buffalo or sheep (the two closest species). Such 421 positions represent convincing candidates for two reasons. First, alleles that are observed in other 422 mammal species must have been segregating in the bovine for a very long time and are thus very 423 unlikely to produce a hard sweep pattern, even if they became positively selected at some point in 424 the bovine history. Second, highly conserved positions are more likely functional, and thus subject to 425 selection, than less conserved ones. 426 Among the 1,382,681 positions included in the regions of Table 1, only 91 satisfied the 3 conditions 427 above. Further removing 7 positions for which the minor bovine allele was at quite high frequency, we 428 finally obtained 84 causal candidates (Table S7). All sweep regions except one exhibited convincing 429 causal variants located in coding or regulatory regions, but more work will be needed to determine 430 the exact causal variants in these regions. 431 Discussion 432 We performed in this study a genomic scan for selection in European taurine cattle, based on large 433 samples of sequencing data from four different breeds. We used two detection approaches, based 434 respectively on the genetic diversity within breeds and on the genetic diferentiation between breeds, 435 and compared the signals detected by these two approaches. 436 One important conclusion from our analysis is that sequencing data represents a great opportunity 437 in the context of genomic scans for selection. Indeed, the detection power of hapFLK was higher when 438 applied to sequencing data than when applied to high-density chip data (Figure 5). This was consistent 439 with previous studies looking for selection signatures in cattle (Ma et al., 2015) and humans (Liu et al., 440 2014), which also found, based on computer simulations, that higher power could be expected from 441 sequencing data compared to SNP chip data. The localization of selection signatures was also found 442 more accurate with sequencing data, both for hapFLK and the within population approach, as the 443 median size of detected regions (a few tens of kilobases) was considerably lower than with SNP chip 444 data. Note that the reduced size of detected regions does not mean that we detected older selection 13 445 events. It just comes from the fact that, in many regions, only the most significant part of the 446 sweep was identified. For instance, the selection events detected by hapFLK are quite recent, because 447 they must have occured more recently than breed divergence (approximately 500 years before present 448 according to our demographic analysis). Still, the region with excess differentiation, which is captured 449 by hapFLK, is generally smaller than the sweep itself, because part of the swept haplotype can be 450 shared between breeds. An example of such a signature can be seen in the ARL15 signature (Figure 451 S8). Importantly, the higher precision provided by the use of NGS data reduces the number of genes 452 included in each window, allowing potentially an easier exploration of the molecular functions driving 453 selection. 454 Since sequencing data capture a large proportion of the SNPs and small indels segregating in a 455 sample, genomic scans for selection based on such data can be expected to identify even the causal 456 polymorphism under selection in a given sweep region. We demonstrated that this is indeed the case in 457 some regions, where the few variants with the highest allele frequency differentiation between breeds, 458 as measured by the FLK statistic, included the causal variant. For instance, in the MC1R region, the 459 two mutations that have been shown to affect coat color in the breeds of our study, were included in the 460 top three FLK values. Similarly, in the PLAG1 region, a good overlap was observed between variants 461 with top FLK values and the QTNs found by Karim et al. (2011) in a GWAS analysis on stature. In 462 several other regions, variants with top FLK values also lead to promising candidate polymorphisms 463 (Tables 3 and S6), which were often located in highly conserved regions or/and in the vicinity of genes 464 with interesting metabolic functions. However, it is important to point out that identifying the selected 465 variant from FLK values will essentially be possible in population-specific hard sweep scenarios, where 466 the favorable allele has almost fixed in the selected population(s), and has remained absent or at 467 very low frequency in the other sampled populations. In all other situations, and for instance those 468 where the favorable allele is at intermediate frequency also in neutral populations, the FLK value of 469 the causal variant will be reached just by chance by many neutral variants, so identifying the causal 470 variant will require additional data, for instance phenotypes related to the selection constraint. 471 Our study also illustrated the small overlap between the selection signatures detected by the within 472 and the between population approach, and allowed to better characterize the regions detected by each 473 of these approaches. Among the 916 hard sweep regions detected by the within-population approach, 474 only 8 were detected by hapFLK. One reason is that, in many of the regions where an hard sweep 475 was detected, the swept allele was likely at quite high frequency also in the other breeds, reducing 476 the differentiation signal. Indeed, in regions where a sweep was detected in one of the breeds, the 477 TOR distribution in the other breeds was shifted toward that of swept regions, even if this signal 478 was not strong enough to be detected by the HMM (Figure S2). This can be explained by the long 479 shared history between the 4 breeds considered in this study, which have a common geographic origin 480 and have been strictly isolated only in the last 200 years. Convergent selection of similar alleles in 481 different breeds may also have occurred, as the same traits were selected in these breeds, but this 482 is not the most parcimonious hypothesis. Finally, the small proportion of hard sweeps detected by 483 hapFLK was also likely related to a power issue, because none of the most breed-specific hard sweeps 484 listed in Table S1 was detected by this approach. Although such regions clearly showed some signal 485 of differentiation, this signal was not strong enough to be detected by hapFLK, because the very 14 486 high levels of genetic drift in cattle breeds implies that only extremely differentiated regions can be 487 considered as significantly under selection. Actually, we can see from Table 2 that in most hard 488 sweep regions detected by hapFLK, there was not one hard sweep in a single population, but at least 489 two sweeps (either complete or incomplete) implying distinct haplotypes in the four breeds, which 490 represents an even stronger differentiation signal (Figure S12). We expect the 1000 bull genomes 491 dataset to grow by inclusion of new populations, that will most likely increase power of differentiation 492 approaches such as FLK and hapFLK by adding information on breeds more closely related to each 493 other than on the initial release. 494 On the other hand, among the 57 selection signatures detected by hapFLK, 49 were not detected by 495 the within-population approach. This comes from the fact that this approach is specifically designed 496 to detect hard sweep signals, while hapFLK also detects incomplete or soft sweeps (Fariello et al., 2013, 497 2014). For the 49 regions detected by hapFLK and not by the within-breed approach, the evidence 498 of a hard sweep, as measured by statistic TOR , was always very low in all breeds, indicating that the 499 signal detected by hapFLK had nothing to do with a hard sweep in one of the breeds. For example, 500 in the ARL15 signature, the swept haplotype clearly segregates at moderate frequency (∼85%) in 501 the Jersey population (Figure S8). This ability of hapFLK to detect incomplete and soft sweeps is 502 extremely interesting for the study of livestock species. Indeed, recent intensive selection in modern 503 livestock breeds has mainly targeted polygenic traits such as milk or meat production, which is much 504 more likely to produce soft and incomplete sweep patterns than hard sweep patterns (Pritchard et al., 505 2010; Hernandez et al., 2011). 506 Conclusion 507 Our study illustrates how sequencing data offer great advantages for the detection of adaptive loci: 508 higher power and better precision in localizing adaptive genes, and even in some rare cases causal 509 mutations. However, we found that many of our strongest signals and mutations lie in non-coding 510 regions of the genome, hinting that a majority of adaptive mutations are regulatory in nature. A 511 better annotation of genomes, such as obtained by high-throughput post-genomic approaches, in 512 combination with phenotyping in large population samples will be key in uncovering the biological 513 basis of adaptation. 514 Materials and Methods 515 Samples and sequencing 516 A total of 234 genome sequences were obtained from the 1000 bull genomes project, Run II (Daetwyler 517 et al., 2014). These included 129 Holsteins (125 Black and 4 Red) , 43 Fleckviehs, 47 Angus, and 15 518 Jerseys. We considered all these sequences except the 4 Red Holsteins. We based our analyzes on the 519 phased and corrected autosomal data produced by the 1000 bull genomes project (Daetwyler et al., 520 2014). These data included 27,535,425 bi-allelic SNPs and 1,507,728 bi-allelic indels. 15 521 Choice of unrelated animals 522 To remove potential biases arising from sample size heterogeneity between breeds and inbreeding 523 within breeds, we selected a subset of 25 unrelated animals in Holstein, Fleckvieh and Angus, while 524 keeping all 15 Jersey animals. Within each breed, we computed the genetic relationship matrix (GRM) 525 of all available animals based on SNPs from Chromosome 1 with minor allele frequency (MAF) above 526 10%, using the GCTA 1.04 software (Yang et al., 2011). We then selected unrelated animals as follows. 527 First, we removed all animals with inbreeding value (the diagonal term of the GRM) above 1.5 (this 528 threshold was chosen because inbreeding values above 1.5 clearly appeared as outliers compared to the 529 rest of the distribution, Figure S13). Second, we considered all animal pairs with genetic relationship 530 above 0.3 in absolute value and removed the most inbred animal of each pair. Third, we performed 531 a hierarchical clustering of the remaining animals based on the distance dij = M − GRMi,j , where 532 GRMi,j is the genetic relationship between animals i and j and M is the maximum value of the GRM, 533 and sampled the 25 most distant animals. 534 Estimation of a demographic model for the joint history of the four breeds 535 In order to model the demographic history of the four breeds under study, we assumed that these breeds 536 diverged simultaneously from a common ancestral population, TDIV generations ago. Although the 537 population tree estimated by FLK (Figure 4) would suggest a slightly more recent divergence between 538 Fleckvieh and Jersey, we considered that this difference was negligible and preferred reducing the 539 number of parameters to be estimated. Based on previous studies suggesting that effective population 540 size in taurine cattle strongly declined since domestication (MacLeod et al., 2013; Boitard et al., 541 2016), we allowed one population size change from NAN C (the ancestral population size) to NDOM 542 (the “domestic” population size) in the ancestral population, TAN C generations ago. The divergence 543 between breeds implied a second population size change: after this event, each breed i was assumed 544 to have a specific population size Ni , which possibly differed from NDOM . Finally, we assumed that 545 each breed received, every generation, a proportion m of migrants from any of the other breeds. 546 We estimated the parameters of this model using the composite likelihood approach implemented 547 in fastsimcoal2 (Excoffier et al., 2013), which is based on the joint Site Frequency Spectrum (SFS). 548 In our dataset, this joint SFS had a very high dimension (51×51×51×31). Consequently, we rather 549 considered the collection of joint SFS obtained for all 6 population pairs (Angus × Fleckvieh, Angus 550 × Holstein …), following the recommandations of the authors. We computed these observed joint SFS 551 from our data using home-made scripts and provided them as input to fastsimcoal2, version 5.2.8 (May 552 2015). We performed 50 independent EM estimations and selected that with the highest composite 553 likelihood. For each estimation, we used the folded SFS (option -m) and the default settings -n100000 554 -N100000 -M0.001 –l10 -L40. In fastsimcoal2, time is scaled in generations. Time in years were 555 obtained by assuming a generation time of 5 years. 556 Detection of genomic regions with low within-breed diversity 557 Model We looked for hard sweep signatures within each breed using the Hidden Markov Model 558 (HMM) of Boitard et al. (2009). In this model, only bi-allelic variants are considered and the derived 16 559 allele frequency at variant i, denoted Yi , is taken as the observed state of the HMM at this position. 560 Each variant i is also assumed to have a hidden state Xi , which can take 3 different values: “Selection”, 561 for variants that are very close to a swept site, “Neutral”, for variants that are far away from any 562 swept site, and “Intermediate” for variants in between. These three values are associated with differ- 563 ent allele frequency distributions. The “Neutral” allele frequency distribution is estimated using all 564 variants in the genome, assuming most of them have indeed evolved under neutrality. Allele frequency 565 distributions in states “Intermediate” and “Selection” are deduced from this “Neutral” distribution 566 using the derivations in Nielsen et al. (2005), and are typically more skewed toward extreme allele 567 frequencies. The hidden states Xi form a Markov chain along the genome with a per bp probability p 568 of switching state, so that close variants tend to be in the same hidden state. Under this HMM, the 569 most likely sequence of hidden states can be predicted from the sequence of observed states using the 570 Viterbi algorithm. Each set of consecutive variants with predicted state “Selection” is called a sweep 571 window. The method of Boitard et al. (2009) is implemented in the freq-hmm program, available at 572 https://forge-dga.jouy.inra.fr/projects/pool-hmm. 573 Implementation Ancestral Bovinae alleles at 448,289 SNPs included in the Illumina BovineHD 574 BeadChip were obtained from Utsunomiya et al. (2013). To check this information we also aligned the 575 bovine sequence against the sequence of three Bovidae species (Rocha et al., 2014). For 365,146 SNPs, 576 the ancestral allele in our alignment was consistent with that reported by Utsunomiya et al. (2013) 577 so we used this information for sweep detection. For all other SNPs, we used a folded allele frequency 578 distribution, i.e allele frequencies Yi and 1 − Yi were considered as the same observed state. Indels 579 were not included at this stage of the analysis (they were only considered when looking at candidate 580 polymorphisms within the region). 581 582 To reduce computation time, we estimated the neutral SFS in each breed using only 5% of the SNPs from each chromosome, which were selected at random. These SFS are shown in Figure S14. 583 The type I error of the above method, i.e. the probability that it detects a sweep window in a 584 population that has evolved under neutrality, depends on parameter p (see Boitard et al. (2009) for 585 more details). To control the genome wide number of false positives, we simulated 5,000 samples of 586 length 500kb under neutral evolution using ms (Hudson, 2002) and adjusted p so that sweeps were 587 detected in only 0.1% of these samples. We performed this calibration for each breed, using the same 588 sample size and proportion of unfolded sites as in the data used for sweep detection. Parameter θ was 589 also estimated from this data using Waterson’s estimator (Watterson, 1975) and ρ was taken equal 590 to θ/10, which is rather low compared to current estimates in cattle (Sandor et al., 2012; Ma et al., 591 2015). As the detection sensitivity of the HMM method increases when ρ/θ decreases (Boitard et al., 592 2009), our adjusted value of p should be conservative. Assuming a 2.5Gb genome as that of bovine 593 (focusing on autosomes) is equivalent to 5,000 windows of 500kb, so we expected no more than 5 false 594 positive signals over the genome with this value of p. 595 Influence of demography on sweep detection We evaluated the robustness of the above 596 detection approach under two neutral demographic scenarios: the multi-population model estimated 597 by fastsimcoal2 (Figure 1) and the single-population model estimated in (Boitard et al., 2016). For 17 598 each model, we simulated 20,000 samples of length 500kb using ms, assuming a mutation rate and 599 a recombination rate of 1e-8 per bp and generation, and we applied the HMM procedure to these 600 samples. For the multi-population model, each simulated sample included genomes from the 4 breeds, 601 which were splitted in order to apply the HMM within each population. For the single-population 602 scenario, breed-specific samples were directly simulated independently of each others, each breed 603 being associated to a different population size history. For each scenario and breed, the total length of 604 simulated samples was equivalent to that of 4 cattle genomes. Consequently, the estimated proportion 605 m of false positive signals per genome was given by the total number of sweeps detected in the 606 simulations, divided by four, and the variance of this estimation was equal to m/4. The confidence 607 interval of this estimation was approximated by [m − 2σ(m); m + 2σ(m)] = [m − √ √ m; m + m] 608 Comparing Hard-sweep signatures detected in different populations For a given 609 variant i, the evidence for a hard sweep window occuring around this variant in population j was 610 measured by the statistic TOR (i, j) = log10 (qij /(1 − qij )), where qij was the posterior probability 611 of hidden state “Selection” returned by the backward-forward algorithm applied to the HMM in 612 population j. When considering a given region of the genome, the evidence for a hard sweep in 613 population j was quantified by the median of TOR (i, j) over the variants of the region. In order to 614 detect selection signatures that are really breed-specific, we computed for each breed the distribution 615 of TOR in 3 different classes of regions : (i) those where a hard sweep was detected in this breed, (ii) 616 those where a hard sweep was detected in another breed, and (iii) those where no hard sweep was 617 detected (Figure S2). Obviously, class (i) and class (iii) regions lead to very different distributions, with 618 lower TOR values in the latter (i.e. in the completely neutral regions). In addition, the distribution of 619 TOR in class (ii) regions was not similar to that found in class (iii) regions, but was shifted towards 620 that of class (i) regions. Based on these observations, we therefore considered that a hard sweep was 621 breed-specific when, in all other breeds, the value of TOR of the region was below a given quantile q 622 of the class (iii) distribution. For q = 0.5, 55 breed-specific sweeps were detected (listed in File S2) 623 and for q = 0.25 , 12 breed-specific regions were detected. 624 Detection of genomic regions with large differentiation between breeds 625 We applied two methods for the detection of genomic regions exhibiting large genetic differentiation 626 between populations: FLK (Bonhomme et al., 2010), a single marker approach based on allele fre- 627 quency differences, and its haplotypic extension hapFLK (Fariello et al., 2013), which exploits the 628 Linkage Disequilibrium information to capture differences of haplotype frequencies. 629 Kinship matrix In contrast with the FST statistic, FLK and hapFLK account for the population 630 history through a kinship matrix, which captures (i) differences in effective population sizes between 631 populations and (ii) possible shared ancestry between populations. The kinship matrix is inferred 632 from a population tree, with branch length expressed in units of drift, i.e. measured in fixation indices 633 f ≈ t/2N where t is the number of generations from the root and N the effective population size. 18 634 We estimated the population tree using neighbour-joining on the Reynold’s genetic distances between 635 populations (see Bonhomme et al. (2010) for details). We used the ancestral allele reconstruction of 636 Utsunomiya et al. (2013) to root the population tree and estimate the population kinship matrix. 637 FLK For the single marker analysis, we performed the FLK test on all variants and computed 638 p-values using the theoretical χ2 (3) distribution, which was a good fit to the observed distribution 639 (Figure S15). 640 hapFLK For the hapFLK test, to save computation time, we removed variants that had low minor 641 allele frequency (< 10%) in all breeds. Note that as the analysis looks for signal of differentiation, 642 removing these variants does not preclude detection. In subsequent reanalysis of small genomic re- 643 gions, we kept all variants in the analysis. hapFLK makes use of the local clustering approach in 644 Scheet and Stephens (2006) to model haplotype diversity. This model requires specifying a number 645 of haplotype clusters as input parameter. Using the cross validation procedure implemented in the 646 fastPHASE software, we found that 15 clusters provided the smallest imputation error rate. hapFLK 647 can be computed on unphased or phased genotype data. Genotype calling made used of imputation 648 approaches, and data was therefore already phased (Daetwyler et al., 2014). We computed hapFLK 649 both on the haplotype data and on the genotype (but imputed) data and found the two analyzes 650 provided similar results (not shown). 651 The distribution of hapFLK is not known, but, from theoretical arguments hapFLK is a deviance 652 statistic. However, the variance parameter of this deviance is not known. If this parameter was known 653 then hapFLK should follow a χ2 ((N − 1)(K − 1)) distribution, where N is the number of populations 654 and K the number of haplotype clusters. Building on this fact, we compared a set of quantiles (from 655 0.05 to 0.95 every 0.05) of the χ2 (42) distribution tq to the observed quantiles of the hapFLK statistic 656 oq . We found that the relationship between tq and oq was very close to linear (Figure S16). Thus, we 657 used the parameter of the linear model to scale the hapFLK statistic to a χ2 (42) distribution which 658 was then used to compute p-values. 659 Extracting significantly differentiated regions In order to call significant regions, we ap- 660 plied the approach of Storey and Tibshirani (2003) aimed at controlling the False Discovery Rate 661 (FDR), at the 15% level, i.e. we called significant variants with q-values < 0.15 and selection signa- 662 tures regions where more than one variant was called significant. We note however that our level of 663 control of the FDR for the regions themselves may not be well calibrated as the tests are correlated. 664 This procedure still provides a significant threshold so that among marker discoveries there is a six 665 fold enrichment in favor of the number of statistics under the alternative. 666 Influence of demography on hapFLK To evaluate the robustness to demography of hapFLK 667 and our scaling approach, we performed 10,000 simulations of 50Kb windows under the population 668 model estimated with fastsimcoal2. We applied the same testing procedure as with the real data, 669 including filtering out SNPs with low minor allele frequencies in all breeds. The resulting hapFLK 670 distribution was different from the one observed on real data, in particular it showed a depletion 671 of low hapFLK values compared to real data, i.e. on real data some regions look more similar 19 672 between breeds than on simulated data. The reasons for this are not clear, but we note that (i) 673 fastsimcoal2 estimation does not use haplotype or linkage disequilibrium information so the haplotype 674 patterns simulated are not expected to necessarily fit the data (ii) simulations assume homogeneous 675 recombination and mutation rates, which does not hold on real data and (iii) common background 676 / purifying selection between breeds might reduce differentiation in some genome regions. Despite 677 the lesser hapFLK variance in simulations, scaling the hapFLK distribution to a χ2 with 14 degrees 678 of freedom provided a very good fit (see Figure S4). Applying the Storey and Tibshirani (2003) 679 approach to estimate proportion of alternative hypotheses in the resulting p-value distribution lead to 680 an estimate of 0 (i.e. π̂0 = 1). A python script to perform the scaling of hapFLK to χ2 distributions 681 is now available on the hapFLK webpage : https://forge-dga.jouy.inra.fr/projects/hapflk/documents 682 References 683 Albrecht E, Komolka K, Kuzinski J, Maak S, 2012. Agouti revisited: Transcript quantification of the 684 asip gene in bovine tissues related to protein expression and localization. PLoS ONE 7:e35282. 685 Allais-Bonnet A, Grohs C, Medugorac I, Krebs S, Djari A, Graf A, Fritz S, Seichter D, Baur A, 686 Russ I, et al., 2013. Novel insights into the bovine polled phenotype and horn ontogenesis in 687 <italic>bovidae</italic>. PLoS ONE 8:e63512. 688 Bionaz M, Loor JJ, 2008. Acsl1, agpat6, fabp3, lpin1, and slc27a6 are the most abundant isoforms 689 in bovine mammary tissue and their expression is affected by stage of lactation. The Journal of 690 nutrition 138:1019–1024. 691 Biswas S, Akey JM, 2006. Genomic insights into positive selection. Trends in Genetics 22:437 – 446. 692 Boitard S, Rodríguez W, Jay F, Mona S, Austerlitz F, 2016. Inferring population size history from 693 large samples of genome-wide molecular data - an approximate bayesian computation approach. 694 PLoS Genet 12:1–36. 695 696 Boitard S, Schlotterer C, Futschik A, 2009. Detecting selective sweeps: a new approach based on hidden markov models. Genetics 181:1567–1578. 697 Bole-Feysot C, Goffin V, Edery M, Binart N, Kelly PA, 1998. Prolactin (prl) and its receptor: actions, 698 signal transduction pathways and phenotypes observed in prl receptor knockout mice. Endocrine 699 Reviews 19:225–268. PMID: 9626554. 700 701 702 703 Bongiorni S, Mancini G, Chillemi G, Pariset L, Valentini A, 2012. Identification of a short region on chromosome 6 affecting direct calving ease in Piedmontese cattle breed. PLoS One 7:e50137. Bonhomme M, Chevalet C, Servin B, Boitard S, Abdallah J, Blott S, SanCristobal M, 2010. Detecting selection in population trees: the lewontin and krakauer test extended. Genetics 186:241–262. 704 Daetwyler HD, Capitan A, Pausch H, Stothard P, van Binsbergen R, Brondum RF, Liao X, Djari A, 705 Rodriguez SC, Grohs C, et al., 2014. Whole-genome sequencing of 234 bulls facilitates mapping of 706 monogenic and complex traits in cattle. Nat Genet 46:858–865. 20 707 Di Noia MA, Todisco S, Cirigliano A, Rinaldi T, Agrimi G, Iacobazzi V, Palmieri F, 2014. The 708 human slc25a33 and slc25a36 genes of solute carrier family 25 encode two mitochondrial pyrimidine 709 nucleotide transporters. Journal of Biological Chemistry 289:33137–33148. 710 711 Dickinson RE, Duncan WC, 2010. The SLIT-ROBO pathway: a regulator of cell function with implications for the reproductive system. Reproduction 139:697–704. 712 Dickinson RE, Hryhorskyj L, Tremewan H, Hogg K, Thomson AA, McNeilly AS, Duncan WC, 2010. 713 Involvement of the SLIT/ROBO pathway in follicle development in the fetal ovary. Reproduction 714 139:395–407. 715 Eberlein A, Takasuga A, Setoguchi K, Pfuhl R, Flisikowski K, Fries R, Klopp N, Furbass R, Weikard R, 716 Kuhn C, 2009. Dissection of genetic factors modulating fetal growth in cattle indicates a substantial 717 role of the non-SMC condensin I complex, subunit G (NCAPG) gene. Genetics 183:951–64. 718 Echternkamp S, Aad P, Eborn D, Spicer L, 2012. Increased abundance of aromatase and follicle 719 stimulating hormone receptor mrna and decreased insulin-like growth factor-2 receptor mrna in 720 small ovarian follicles of cattle selected for twin births. Journal of animal science 90:2193–2200. 721 Excoffier L, Dupanloup I, Huerta-Sánchez E, Sousa VC, Foll M, 2013. Robust demographic inference 722 from genomic and snp data. PLoS Genet 9:e1003905. 723 Fariello MI, Boitard S, Naya H, SanCristobal M, Servin B, 2013. Detecting signatures of selection 724 through haplotype differentiation among hierarchically structured populations. Genetics 193:929– 725 941. 726 727 728 729 730 731 Fariello MI, Servin B, Tosser-Klopp G, Rupp R, Moreno C, Cristobal MS, Boitard S, Consortium ISG, 2014. Selection signatures in worldwide sheep populations. PLoS ONE 9:e103813. Felius M, Koolmees PA, Theunissen B, Consortium ECGD, Lenstra JA, 2011. On the breeds of cattle—historic and current classifications. Diversity 3:660–692. Flori L, Fritz S, Jaffrézic F, Boussaha M, Gut I, Heath S, Foulley JL, Gautier M, 2009. The genome response to artificial selection: A case study in dairy cattle. PLoS ONE 4:e6595. 732 Ganella DE, Ryan PJ, Bathgate RA, Gundlach AL, 2012. Increased feeding and body weight gain 733 in rats after acute and chronic activation of rxfp3 by relaxin-3 and receptor-selective peptides: 734 functional and therapeutic implications. Behavioural pharmacology 23:516–525. 735 Gautier M, Flori L, Riebler A, Jaffrezic F, Laloe D, Gut I, Moazami-Goudarzi K, Foulley JL, 2009. A 736 whole genome bayesian scan for adaptive genetic divergence in west african cattle. BMC Genomics 737 10:550. 738 739 740 741 Giesy SL, Yoon B, Currie WB, Kim JW, Boisclair YR, 2012. Adiponectin deficit during the precarious glucose economy of early lactation in dairy cows. Endocrinology 153:5834–44. Gouveia JJdS, Silva MVGBd, Paiva SR, Oliveira SAMPd, 2014. Identification of selection signatures in livestock species. Genetics and Molecular Biology 37:330 – 342. 21 742 Gutierrez-Gil B, Arranz JJ, Wiener P, 2015. An interpretive review of selective sweep studies in 743 bos taurus cattle populations: identification of unique and shared selection signals across breeds. 744 Frontiers in Genetics 6. 745 Hayes BJ, Pryce J, Chamberlain AJ, Bowman PJ, Goddard ME, 2010. Genetic architecture of complex 746 traits and accuracy of genomic prediction: coat colour, milk-fat percentage, and type in holstein 747 cattle as contrasting model traits. PLoS Genetics 6:e1001139. 748 749 750 751 Hernandez RD, Kelley JL, Elyashiv E, Melton SC, Auton A, McVean G, Sella G, Przeworski M, et al., 2011. Classic selective sweeps were rare in recent human evolution. science 331:920–924. Hudson R, 2002. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18:337–338. 752 Karim L, Takeda H, Lin L, Druet T, Arias JA, Baurain D, Cambisano N, Davis SR, Farnir F, Grisart 753 B, et al., 2011. Variants modulating the expression of a chromosome domain encompassing plag1 754 influence bovine stature. Nature genetics 43:405–413. 755 Karisa B, Thomson J, Wang Z, Stothard P, Moore S, Plastow G, 2013. Candidate genes and single 756 nucleotide polymorphisms associated with variation in residual feed intake in beef cattle. Journal 757 of animal science 91:3502–3513. 758 Kijas JW, 2014. Haplotype-based analysis of selective sweeps in sheep. Genome 57:433–7. 759 Killeen AP, Morris DG, Kenny DA, Mullen MP, Diskin MG, Waters SM, 2014. Global gene expression 760 in endometrium of high and low fertility heifers during the mid-luteal phase of the estrous cycle. 761 BMC genomics 15:234. 762 Klungland H, Vage D, Gomez-Raya L, Adalsteinsson S, Lien S, 1995. The role of melanocyte- 763 stimulating hormone (msh) receptor in bovine coat color determination. Mammalian genome 6:636– 764 639. 765 Leroy G, Mary-Huard T, Verrier E, Danvy S, Charvolin E, Danchin-Burge C, 2013. Methods to 766 estimate effective population size using pedigree data: Examples in dog, sheep, cattle and horse. 767 Genetics Selection Evolution 45:1. 768 Li M, Tian S, Jin L, Zhou G, Li Y, Zhang Y, Wang T, Yeung CK, Chen L, Ma J, et al., 2013. Genomic 769 analyses identify distinct patterns of selection in domesticated pigs and tibetan wild boars. Nature 770 genetics 45:1431–1438. 771 Lindholm-Perry AK, Kuehn LA, Oliver WT, Sexten AK, Miles JR, Rempel LA, Cushman RA, Freetly 772 HC, 2013. Adipose and muscle tissue gene expression of two genes (NCAPG and LCORL) located 773 in a chromosomal region associated with cattle feed intake and gain. PLoS One 8:e80882. 774 Lindholm-Perry AK, Sexten AK, Kuehn LA, Smith TP, King DA, Shackelford SD, Wheeler TL, Ferrell 775 CL, Jenkins TG, Snelling WM, et al., 2011. Association, effects and validation of polymorphisms 776 within the NCAPG - LCORL locus located on BTA6 with feed intake, gain, meat and carcass traits 777 in beef cattle. BMC Genet 12:103. 22 778 Liu X, Saw WY, Ali M, Ong RTH, Teo YY, 2014. Evaluating the possibility of detecting evidence of 779 positive selection across asia with sparse genotype data from the hugo pan-asian snp consortium. 780 BMC genomics 15:332. 781 Liu YJ, Liu XG, Wang L, Dina C, Yan H, Liu JF, Levy S, Papasian CJ, Drees BM, Hamilton JJ, 782 et al., 2008. Genome-wide association scans identified ctnnbl1 as a novel gene for obesity. Human 783 Molecular Genetics 17:1803–1813. 784 Ma L, O’Connell JR, VanRaden PM, Shen B, Padhi A, Sun C, Bickhart DM, Cole JB, Null DJ, 785 Liu GE, et al., 2015. Cattle sex-specific recombination and genetic control from a large pedigree 786 analysis. PLoS Genet 11:e1005387. 787 MacLeod IM, Larkin DM, Lewin HA, Hayes BJ, Goddard ME, 2013. Inferring demography from runs 788 of homozygosity in whole-genome sequence, with correction for sequence errors. Molecular Biology 789 and Evolution 30:2209–2223. 790 Makvandi-Nejad S, Hoffman GE, Allen JJ, Chu E, Gu E, Chandler AM, Loredo AI, Bellone RR, 791 Mezey JG, Brooks SA, et al., 2012. Four loci explain 83% of size variation in the horse. PLoS One 792 7:e39929. 793 794 Metzger J, Schrimpf R, Philipp U, Distl O, 2013. Expression levels of LCORL are associated with body size in horses. PLoS One 8:e56497. 795 Morice-Picard F, Lasseaux E, Cailley D, Gros A, Toutain J, Plaisant C, Simon D, François S, Gilbert- 796 Dussardier B, Kaplan J, et al., 2014. High-resolution array-cgh in patients with oculocutaneous 797 albinism identifies new deletions of the tyr, oca2, and slc45a2 genes and a complex rearrangement 798 of the oca2 gene. Pigment cell & melanoma research 27:59–71. 799 Morris AP, Voight BF, Teslovich TM, Ferreira T, Segre AV, Steinthorsdottir V, Strawbridge RJ, Khan 800 H, Grallert H, Mahajan A, et al., 2012. Large-scale association analysis provides insights into the 801 genetic architecture and pathophysiology of type 2 diabetes. Nat Genet 44:981–90. 802 Nafikov R, Schoonmaker J, Korn K, Noack K, Garrick D, Koehler K, Minick-Bormann J, Reecy J, 803 Spurlock D, Beitz D, 2013. Association of polymorphisms in solute carrier family 27, isoform a6 804 (slc27a6) and fatty acid-binding protein-3 and fatty acid-binding protein-4 (fabp3 and fabp4) with 805 fatty acid composition of bovine milk. Journal of dairy science 96:6007–6021. 806 807 Nielsen R, Williamson L, Kim Y, Hubisz M, Clark A, Bustamante C, 2005. Genomic scans for selective sweeps using SNP data. Genome Research 15:1566–1575. 808 Nistala H, Lee-Arteaga S, Smaldone S, Siciliano G, Carta L, Ono RN, Sengle G, Arteaga-Solis E, 809 Levasseur R, Ducy P, et al., 2010. Fibrillin-1 and-2 differentially modulate endogenous tgf-β and 810 bmp bioavailability during bone formation. The Journal of cell biology 190:1107–1121. 811 Pausch H, Flisikowski K, Jung S, Emmerling R, Edel C, Götz KU, Fries R, 2011. Genome-wide 812 association study identifies two major loci affecting calving ease and growth-related traits in cattle. 813 Genetics 187:289–297. 23 814 815 Pritchard JK, Pickrell JK, Coop G, 2010. The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation. Current Biology 20:R208–R215. 816 Qanbari S, Gianola D, Hayes B, Schenkel F, Miller S, Moore S, Thaller G, Simianer H, 2011. Appli- 817 cation of site and haplotype-frequency based approaches for detecting selection signatures in cattle. 818 BMC Genomics 12:318. 819 820 Qanbari S, Pausch H, Jansen S, Somel M, Strom TM, Fries R, Nielsen R, Simianer H, 2014. Classic selective sweeps revealed by massive sequencing in cattle. PLoS Genet 10:e1004148. 821 Randhawa IA, Khatkar MS, Thomson PC, Raadsma HW, 2015. Composite Selection Signals for 822 Complex Traits Exemplified Through Bovine Stature Using Multi-Breed Cohorts of European and 823 African Bos taurus. G3 (Bethesda) . 824 Richards JB, Waterworth D, O’Rahilly S, Hivert MF, Loos RJ, Perry JR, Tanaka T, Timpson NJ, 825 Semple RK, Soranzo N, et al., 2009. A genome-wide association study reveals variants in ARL15 826 that influence adiponectin levels. PLoS Genet 5:e1000768. 827 Rivadeneira F, Styrkarsdottir U, Estrada K, Halldorsson BV, Hsu YH, Richards JB, Zillikens MC, 828 Kavvoura FK, Amin N, Aulchenko YS, et al., 2009. Twenty bone-mineral-density loci identified by 829 large-scale meta-analysis of genome-wide association studies. Nat Genet 41:1199–206. 830 Rocha D, Billerey C, Samson F, Boichard D, Boussaha M, 2014. Identification of the putative ancestral 831 allele of bovine single-nucleotide polymorphisms. Journal of Animal Breeding and Genetics 131:483– 832 486. 833 Roux PF, Boitard S, Blum Y, Parks B, Montagner A, Mouisel E, Djari A, Esquerré D, Désert C, 834 Boutin M, et al., 2015. Combined qtl and selective sweep mappings with coding snp annotation 835 and cis-eqtl analysis revealed park2 and jag2 as new candidate genes for adiposity regulation. G3: 836 Genes| Genomes| Genetics 5:517–529. 837 Rubin CJ, Megens HJ, Barrio AM, Maqbool K, Sayyab S, Schwochow D, Wang C, Carlborg �, Jern P, 838 Jørgensen CB, et al., 2012. Strong signatures of selection in the domestic pig genome. Proceedings 839 of the National Academy of Sciences 109:19529–19536. 840 Rubin CJ, Zody MC, Eriksson J, Meadows JR, Sherwood E, Webster MT, Jiang L, Ingman M, 841 Sharpe T, Ka S, et al., 2010. Whole-genome resequencing reveals loci under selection during chicken 842 domestication. Nature 464:587–591. 843 844 845 846 847 Sabeti P, Schaffner S, Fry B, Lohmueller J, Varilly P, Shamovsky O, Palma A, Mikkelsen T, Altshuler D, Lander E, 2006. Positive natural selection in the human lineage. science 312:1614–1620. Sahana G, Hoglund JK, Guldbrandtsen B, Lund MS, 2015. Loci associated with adult stature also affect calf birth survival in cattle. BMC Genet 16:47. Sandor C, Li W, Coppieters W, Druet T, Charlier C, Georges M, 2012. Genetic variants in 848 <italic>rec8</italic>, <italic>rnf212</italic>, and <italic>prdm9</italic> influence male re- 849 combination in cattle. PLoS Genet 8:e1002854. 24 850 Scheet P, Stephens M, 2006. A fast and flexible statistical model for large-scale population genotype 851 data: applications to inferring missing genotypes and haplotypic phase. The American Journal of 852 Human Genetics 78:629–644. 853 Signer-Hasler H, Flury C, Haase B, Burger D, Simianer H, Leeb T, Rieder S, 2012. A genome-wide 854 association study reveals loci influencing height and other conformation traits in horses. PLoS One 855 7:e37282. 856 Smith CM, Chua BE, Zhang C, Walker AW, Haidar M, Hawkes D, Shabanpoor F, Hossain MA, Wade 857 JD, Rosengren KJ, et al., 2014. Central injection of relaxin-3 receptor (rxfp3) antagonist peptides 858 reduces motivated food seeking and consumption in c57bl/6j mice. Behavioural brain research 859 268:117–126. 860 Song H, Kim H, Lee K, Lee DH, Kim TS, Song JY, Lee D, Choi D, Ko CY, Kim HS, et al., 2012. 861 Ablation of Rassf2 induces bone defects and subsequent haematopoietic anomalies in mice. EMBO 862 J 31:1147–59. 863 Soung do Y, Dong Y, Wang Y, Zuscik MJ, Schwarz EM, O’Keefe RJ, Drissi H, 2007. 864 Runx3/AML2/Cbfa3 regulates early and late chondrocyte differentiation. 865 22:1260–70. J Bone Miner Res 866 Stahl EA, Raychaudhuri S, Remmers EF, Xie G, Eyre S, Thomson BP, Li Y, Kurreeman FA, Zher- 867 nakova A, Hinks A, et al., 2010. Genome-wide association study meta-analysis identifies seven new 868 rheumatoid arthritis risk loci. Nat Genet 42:508–14. 869 Stefanaki I, Panagiotou OA, Kodela E, Gogas H, Kypreou KP, Chatzinasiou F, Nikolaou V, Plaka M, 870 Kalfa I, Antoniou C, et al., 2013. Replication and predictive value of snps associated with melanoma 871 and pigmentation traits in a southern european case-control study. PloS one 8:e55712. 872 873 874 875 876 Storey JD, Tibshirani R, 2003. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences 100:9440–9445. Sturm RA, 2009. Molecular genetics of human pigmentation diversity. Human molecular genetics 18:R9–R17. Tetens J, Widmann P, Kuhn C, Thaller G, 2013. A genome-wide association study indicates 877 LCORL/NCAPG as a candidate locus for withers height in German Warmblood horses. Anim 878 Genet 44:467–71. 879 880 The Bovine HapMap Consortium, 2009. Genome-wide survey of snp variation uncovers the genetic structure of cattle breeds. Science 324:528–532. 881 Tonevitsky AG, Maltseva DV, Abbasi A, Samatov TR, Sakharov DA, Shkurnikov MU, Lebedev AE, 882 Galatenko VV, Grigoriev AI, Northoff H, 2013. Dynamically regulated mirna-mrna networks re- 883 vealed by exercise. BMC physiology 13:9. 25 884 Utsunomiya YT, Perez OBrien AM, Sonstegard TS, Van Tassell CP, do Carmo AS, Mészáros GA, 885 SÃlkner J, Garcia JAF, 2013. Detecting loci under recent positive selection in dairy and beef cattle 886 by combining different genome-wide scan methods. PLoS ONE 8:e64280. 887 Viitala S, Szyda J, Blott S, Schulman N, Lidauer M, MÃki-Tanila A, Georges M, Vilkki J, 2006. The 888 role of the bovine growth hormone receptor and prolactin receptor genes in milk, fat and protein 889 production in finnish ayrshire dairy cattle. Genetics 173:2151–2164. 890 891 892 893 Voight BF, Kudaravalli S, Wen X, Pritchard JK, 2006. A map of recent positive selection in the human genome. PLoS Biol 4:e72. Watterson G, 1975. On the number of segregating sites in genetical models without recombination. Theor Popul Biol 7:256–276. 894 Weikard R, Altmaier E, Suhre K, Weinberger KM, Hammon HM, Albrecht E, Setoguchi K, Takasuga 895 A, Kuhn C, 2010. Metabolomic profiles indicate distinct physiological pathways affected by two loci 896 with major divergent effect on Bos taurus growth and lipid deposition. Physiol Genomics 42A:79–88. 897 Wijesena HR, Schmutz SM, 2015. A missense mutation in slc45a2 is associated with albinism in 898 several small long haired dog breeds. Journal of Heredity p. esv008. 899 Xu L, Bickhart DM, Cole JB, Schroeder SG, Song J, Tassell CP, Sonstegard TS, Liu GE, 2015. 900 Genomic signatures reveal new evidences for selection of important traits in domestic cattle. Mol 901 Biol Evol 32:711–25. 902 903 904 905 Xu X, Dong GX, Hu XS, Miao L, Zhang XL, Zhang DL, Yang HD, Zhang TY, Zou ZT, Zhang TT, et al., 2013. The genetic basis of white tigers. Current Biology 23:1031–1035. Yang J, Lee SH, Goddard ME, Visscher PM, 2011. Gcta: A tool for genome-wide complex trait analysis. The American Journal of Human Genetics 88:76 – 82. 906 Yoshida CA, Yamamoto H, Fujita T, Furuichi T, Ito K, Inoue K, Yamana K, Zanma A, Takada K, Ito 907 Y, et al., 2004. Runx2 and Runx3 are essential for chondrocyte maturation, and Runx2 regulates 908 limb growth through induction of Indian hedgehog. Genes Dev 18:952–63. 26 Angus N=214 Fleckvieh N=411 Ancestral population N=137,775 Ancestral population N=73,042 Holstein N=378 Jersey N=193 Present 500 years 123,465 years Figure 1: Joint demography of the four cattle breeds, estimated from our data using the approach of Excoffier et al. (2013). Population sizes correspond to the number of haploid individuals. Parameter values correspond to the EM iteration with the highest likelihood, but similar values were obtained from the second and third best EM iterations. 27 ● ● ● 0.1 % 1% 5% 10 % 90 % 75 % 50 % 6 5 4 ● ● ●●● ●● ●●● p−value ● ● ● ● ● ● ● ● ● ● p−value Figure 2: Comparison between HMM and CLR results. Proportion of Fleckvieh CLR p-values (obtained from Qanbari et al. (2014)) in sweep windows vs other windows in the genome. On the left panel, the sweep windows considered were those detected in Fleckvieh, independently of what happened in other breeds. On the right panel, the sweep windows considered were those detected only in Holstein. 28 0.1 % ● ● ● 1% ● ●● 90 % 75 % 50 % 1 0 ● ● ● ● ● ● ● ● 5% 2 ●● ● ● ●● 10 % ●● ● 3 3 ●●●● 2 ●● 1 5 4 ● 0 ● ● −2 ● ● Enrichment in sel. sweeps (log2 scale) 6 overlap with Holstein sweeps −2 Enrichment in sel. sweeps (log2 scale) overlap with Fleckvieh sweeps Figure 3: Allele frequencies in the sweep region at the polled locus. For SNPs where the ancestral allele is known (in red), the frequency is that of the derived allele. For other SNPs (in black) the frequency is that of the minor allele (among all breeds). Vertical red bars delimit the union of detected regions among the four breeds. 29 Figure 4: Population tree estimated on the 1000 bull genomes data. Branch length is measured in units of drift (≈ t/2N, where t is the time in generations and N the effective population size). 30 Figure 5: Influence of NGS on hapFLK detection power. PP plot of the hapFLK test applied to SNPs of the Illumina BovineHD SNP chip (left) or to all sites of the 1000 bull genomes project (right) 31 1 2 3 4 5 −1 0.1 % 1% 10 % 5% 90 75 % % 50 % 0.1 % ● 1% 5% 10 % p−value ● ● ● 1 2 3 4 5 ● ● ● ● ● ● ● 10 % −3 ● ● 50 % ● ● 90 % 75 % ● ● ● −1 ● −3 ● ● ● −5 ● Enrichment in sel. sweeps (log2 scale) −1 ● ● 5% 1% 90 75 % % 50 % 10 % 5% Sweep in 4 populations 1 2 3 4 5 Sweep in 3 populations ● ● ● p−value ● ● 90 % 75 % 50 % ● ● ●● ● ●●● ●● ● ● ● ● ●● ● ● ● ●●● ● ● ●●● ●● ●●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● p−value ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 4 ● ●●●●●●●● ●●● ●●●●● ● ●● ● ● ●● ●● ●●●●● ● ● ● ● ● ● ● −5 ● −4 −2 0 2 ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● Enrichment in sel. sweeps (log2 scale) 6 ●● ● ● ●● ●● −5 Enrichment in sel. sweeps (log2 scale) Sweep in 2 populations −6 Enrichment in sel. sweeps (log2 scale) Sweep in 1 population p−value Figure 6: Distribution of hapFLK within hard sweep regions. Comparison of the distribution of hapFLK p-values in hard sweep regions identified within populations and in the rest of the genome. The plot shows the ratio of the p-values distribution in hard sweeps detected in 1 to 4 populations to their distribution on the part of the genome where no hard sweep is detected. y-axis is on a log2 scale and x-axis on a log10 scale. 32 hapFLK PLAG1 CHCHD7 0 1 2 3 4 5 6 RPS20 24800 24900 25000 25100 25200 Position (Kbp) FLK RPS20 PLAG1 3 0 1 2 −log10(p−value) 4 5 CHCHD7 24800 24900 25000 25100 25200 Position (Kbp) 0.5 Heterozygosity 0.2 0.3 0.4 Jersey Fleckvieh 0.0 0.1 Holstein Angus 24800 24900 25000 25100 25200 Position (Kbp) Figure 7: Selection signature around PLAG1. HapFLK (top) and FLK (middle) p-values (log10 scale) for the selection signature around PLAG1, and local heterozygosity in the four breeds (bottom) for the same region. On the top and middle panels, genes are indicated by purple full rectangles, and red full triangles correspond to the candidate QTNs of Karim et al. (2011). 33 Figure 8: Selection signature around MC1R. HapFLK (gray) and FLK (black) p-values (log10 scale) for the selection signature around MC1R (purple vertical line). Red triangle highlight the two known causal mutations for red and black coat color in cattle. 34 Figure 9: Allele frequencies in the sweep region including SLC45A2 and RXFP3. For SNPs where the ancestral allele is known (in red), the frequency is that of the derived allele. For other SNPs (in black) the frequency is that of the minor allele (among all breeds). 35 chr 1 start (Mb) 1.781 end (Mb) 1.818 1 1 5 7 107.452 107.571 68.675 4.574 107.557 107.749 68.751 4.745 10 16 59.148 44.672 59.338 44.956 16 45.644 45.903 Genes lincRNA, Polled locus (AllaisBonnet et al., 2013) PPM1L ARL14 SLC41A2 FKBP8, ELL, ISYNA1, SSBP4, LRRC25, GDF15 CYP19A1 CLSTN1, PIK3CD, TMEM201, SLC25A33 RERE, SLC45A1 Table 1: Sweep regions shared among all breeds. 36 BTA 6 18 14 10 13 20 7 24 Hard sweep region end population start 71.439 71.558 F 14.755 14.963 F 24.805 25.076 H,A 5.736 5.843 A 64.149 64.197 A 22.923 23.203 J 43.436 43.542 A,J 14.006 14.040 F start 70.332 14.305 24.937 5.736 63.879 23.110 43.473 14.021 hapFLK end p-value 71.607 1.2 10−12 14.872 2.1 10−8 25.070 2.2 10−7 5.782 4.9 10−6 64.546 2.0 10−5 23.125 4.2 10−5 43.474 7.0 10−5 14.023 8.2 10−5 Candidate genes KIT MC1R PLAG1 intergenic ASIP ANKRD55 OR cluster intergenic Table 2: Hard sweep selection signatures associated to significant differentiation signals. Signatures are ordered by decreasing hapFLK p-values. Region coordinates are expressed in Megabases on assembly UMD 3.1. Population abbreviation: Angus, Fleckvieh, Holstein, Jersey 37 chr start (Mb) end (Mb) 71.560 26.000 sel pop F J nb mut 2 8 pval FLK 10−5 3.10−5 sel freq 0.92 0.77 6 7 71.440 25.430 7 14 26.280 24.810 27.060 25.080 J H,A 16 7 10−5 3.10−5 0.83 0.04 18 14.760 14.960 F 3 10−5 0.95 20 24.230 24.740 J 1 3.10−5 0.77 20 20 20 25.030 26.640 39.830 25.390 27.230 39.970 J J J 21 2 1 2.10−5 3.10−5 10−5 0.80 0.77 0.80 22 22 27 35.680 35.960 4.140 35.790 36.030 4.230 J J J 1 2 1 3.10−5 5.10−5 10−4 0.77 0.90 0.90 genes intergenic CHSY3, KIAA1024L, ADAMTS19 SLC27A6, FBN2, SLC12A2 LYN, RPS20, PLAG1, CHCHD7 ENSBTAG00000039031, MOS TUBB6, TUBB3, DEF8, LOC532875 DBNDD1, GAS8, SHCBP1, MC1R LOC530348, SNX18, COX8A, LOC783202, HSPB3 intergenic intergenic SLC45A2, ADAMTS12, RXFP3 intergenic MAGI1 intergenic Table 3: Private hard sweep regions including candidate mutations. Horizontal lines are used to group closely related sweep windows, which might result from the same selection event. 38