Download Text S1.

Supplement to Fraser et al., “Systematic detection of polygenic cis-regulatory evolution” Notes on the test of selection We note that the F2/eQTL mapping version of our test is readily applicable to any species in which F2 populations can be produced and genotyped. This generally requires inbred parental lines derived from independent populations, or a haploid/diploid life cycle (such as S. cerevisiae). At present this includes nearly all major model organisms (mouse, rat, nematode, fruit fly, Arabidopsis, S. pombe, etc.), underscoring the general applicability of our test. In addition, outbred individuals could be used as the parents in the F1 RNA-seq version of our method, as long as these individuals come from distinct populations (so that adaptive differences could have accumulated) and a sufficient number of sequence differences are known for the two parents (or at least the two parental populations) to allow measurement of allele-specific expression for a significant fraction of the genome. As stated in the main text, nearly all previous tests of selection require either 1) information from neutral sites in order to assess when neutrality can be rejected; 2) assumptions about population demography that are violated by bottlenecks or other common demographic scenarios; 3) assumptions about mutation rates or the distribution of fitness effects of mutations; or 4) some combination of these. To name just a handful of these tests that depend on such assumptions: dN/dS, McDonald-Kreitman, and others require assuming neutrality of synonymous sites, which is often violated in real data; polymorphism frequency spectrum-based tests (such as Tajima's D, Fay and Wu's H, Fst, etc.), haplotype-based methods (e.g. iHS2) and the McDonald-Kreitman test are sensitive to bottlenecks and other irregular population demographics (e.g. refs 3-4); and Poisson Random Field is sensitive to many assumptions about demography and the distribution of selection coefficients5. Because the present test (like Orr’s1) focuses only on the directionality of differences between lineages, it requires no such assumptions. Put another way, there is no known mechanism by which changes in population size, mutation rate, or fitness effects of new mutations could cause some gene sets to accumulate an excess (compared to all other genes) of independent cis-regulatory mutations that act in the same direction, aside from effects on the selective forces acting on those gene sets (which is what we are measuring). Although most new mutations are in fact down-regulating (see section “Note on positive selection vs. relaxed negative selection” below), and so a simple increase in mutation rate is expected to increase the number of down-regulating mutations in any gene set, it will also be increasing this for all other genes—and so our method of choosing an equal number of B6-upregulated vs. CASTupregulated genes for use in our scan will not be affected by changes in mutation rate, even in the presence of a bias in directionality of new mutations. The difference in the number of cis-upregulated genes from B6 vs. CAST is an unbiased estimate of the number of genes in a particular gene set that are under selection, under the assumption that selection was not also acting to upregulate genes in the lineage with fewer cis-upregulated genes (in which case our estimate would be conservative). The rationale for this is that under neutrality, given that we use an equal number of B6-cis-upregulated and CAST-cis-upregulated genes as input to our scan, the expectation is for an equal number of cis-upregulating alleles from each lineage. Since our scan searches for the largest deviations from this, performing this on a single cohort would result in over-estimates of the true difference, due to the Beavis effect (or “winner’s curse”). However when similar results are seen across all four cohorts (or each pair of cohorts from each sex, in the case of sex-specific effects), this makes any such bias extremely unlikely—for the same reason that replication studies of QTLs are not subject to the Beavis effect. The Morris Water Maze The Morris Water Maze is a widely used tool for testing learning and memory of mice. In this test, mice are placed in a circular pool of water, with an invisible submerged platform being the only means that the mice can stay above water without swimming. In one version of the test, the mice must find the platform by chance during the initial training trials, after which the platform is removed from the pool. Those mice that recall the location of the platform will tend to spend more time swimming in its expected location, whereas those with poor memory will swim randomly. Two measures of memory accuracy are the fraction of time the mice spend in the correct quadrant of the pool, and the number of times the mice swim over the former location of the platform, during a one minute trial period. CAST and SPRET each showed essentially no memory in this test, spending 28% and 27% of their time in the correct quadrant (compared to 25% expected by chance) respectively6. B6 spent 42% of its time in the correct quadrant, which is significantly greater than 25%. During the trial period B6 crossed the platform's former location an average of 5.9 times, while CAST and SPRET were well under half of this, at 2.4 and 2.0 crossings. Therefore it is apparent that B6 outperforms CAST and SPRET in this memory test. In fact, B6 outperformed all 12 other strains tested by over 50% in the number of platform location crossings, a significant difference6. Another version of the MWM does not remove the platform, but instead records the time required for the mice to find the platform, after initial training. In this test, B6 mice took an average of 41 seconds to find the platform on the first day, but this dropped to 28 seconds on the second day and 23 seconds by the third7. Therefore B6 showed increasing memory of the platform location as the experiment progressed. In contrast, CAST again showed no capacity for learning or memory, spending an average of 56 – 58 seconds swimming on all three days of the trial (and this is actually misleadingly low, since the experiment was stopped and the mice were placed on the platform by the experimenter after 60 seconds of swimming), with no improvement and thus no apparent memory of the platform's location7. Testing effects of SNPs on microarray and RNA-seq data SNPs overlapping microarray probes could disrupt hybridization, leading to false cis-eQTL with a directional bias towards higher B6 expression (since the arrays were designed to the B6 genome sequence). If this occurred preferentially in some gene sets, it could lead to a false-positive result in our test. To test for this possibility, we conducted two tests. First, we compiled a list of our array probes that overlap B6/CAST SNPs, and asked whether this set of probes was enriched in any of our significant gene sets (Table 1). No enrichment was seen (hypergeometric p > 0.1 for all gene sets). Second, we excluded these probes from our analysis, and tested whether the same enrichments were observed. In all cases this exclusion had only a negligible effect (the same gene sets were found at high- and medium-confidence). Together these analyses indicate that SNPs disrupting probe hybridization are unlikely to explain our results. For RNA-seq the situation is somewhat different. When measuring allele-specific expression in an F1 hybrid, annotated SNPs are key to differentiating alleles, but unannotated SNPs will appear to be sequencing errors when the non-reference (CAST) allele is observed. While a small number (1-2) of errors/unannotated SNPs can be tolerated when aligning short reads, this effect may still cause CAST alleles to be under-represented. To test whether this may cause the B6-upregulated gene sets we observed, we reasoned that while a gene set may well be enriched for SNPs, it is very unlikely to be enriched only for unannotated SNPs (considering that millions of SNPs are already known, and were discovered without regard to the functions of genes they are in, so statistical power and SNP ascertainment bias should not pose problems). Therefore we used annotated SNP enrichment as a proxy for unannotated SNP enrichment, and tested whether either of the two significant gene sets from our RNA-seq analysis (memory and calmodulin binding) was enriched for known SNPs. Neither was (hypergeometric p > 0.2 for both), indicating that their B6-upregulation is unlikely to be an artifact due to unannotated SNPs. Detecting subtle phenotypic effects In Figure 4 we show several examples where eQTLs for growth regulators colocalize with QTLs for naso-anal length. This approach can only detect major-effect QTL, given our sample size of 442 F2 mice. One approach to detect more subtle effects is to sum the total number of each F2 individual's B6 alleles at all growth regulator cis-eQTL, and compare this sum to the mass of each F2 individual; a relationship might be found even when each locus is not predictive in isolation. Although this sum was a significant predictor of mass, this was also true for randomly chosen genetic markers, suggesting that too many minor-effect loci exist for this approach to be informative. Note on trait ascertainment bias For our previous test of lineage-specific selection on gene expression, choosing genes for the test based on having large expression differences between parental lines would introduce another type of ascertainment bias, since it would enrich for genes that are targeted by multiple “reinforcing” eQTLs (see supplement of ref. 8). We note that this type of filtering is not as much of a problem for the current gene setbased test, since the genes used in this test are ranked based on the strength of their cis-eQTL only. Nevertheless, we recommend not using this type of filtering for this test, as there are some scenarios where requiring a strong parental difference could bias results. It may appear that for any given gene set, the neutral expectation should be based upon the difference in parental expression levels for each gene in that set, following Orr’s method for adjusting for ascertainment bias1. However this is not the case, as can be demonstrated by the following example. Flips of unbiased and biased coins can be used to represent neutral and selected gene sets (respectively). For a neutral gene set, the cis-upregulating allele is equally likely to come from each parent, so is perfectly modeled by a fair coin. Having sets of unbaised coin flips (with any number of flips per set), the distribution of number of “heads” flips will follow the binomial distribution with 50% expected heads, and 50% tails. Having biased coin flips will instead yield a biased result, with for example 80% heads expected if the coin is 80% biased towards heads (e.g., ~80 heads and ~20 tails after 100 flips). The goal of our method is to distinguish between these biased and unbiased sets. If we were to adjust each set for the total bias (analogous to the parental difference in expression levels in the absence of trans-acting changes) before applying our method, we would be controlling for the very signal we wish to detect—for example adjusting the 80 heads/20 tails result to say that our neutral (or unbiased) expectation is 80/20 for this set would then guarantee that we would not detect the bias in this set. This illustrates why it is not appropriate to adjust each gene set based on the parental gene expression levels; as long as these are free of ascertainment bias as discussed above, the neutral expectation is a binomial distribution with 50% expected cis-upregulation from each strain’s alleles. For further discussion of ascertainment bias issues in sign tests, see the supplement of ref. 8. Note on positive selection vs. relaxed negative selection The following paragraph is reproduced (with some modifications) from the supplement of ref. 8. It is important to note that this test of lineage-specific selection cannot distinguish between positive selection for altered gene expression levels vs. a relaxation of negative selection, combined with a bias in the directionality of mutational effects. The following scenario illustrates how relaxed negative selection can lead to a pattern of cis-eQTL with biased directionality in a gene set. Imagine a gene set whose expression is under strong negative selection in one lineage, so that no eQTL accumulate in this lineage, but (for whatever reason) is under no selection in another lineage. In the unselected lineage, mutations causing cis eQTL will accumulate. If the directions of these neutral mutations are equally likely (under the null) to be up- or downregulating, then the selection test will be a faithful indicator of positive selection. However, if they are biased in one direction, then this will appear as an excess of cis eQTL acting in one direction. Such a bias is likely to exist for new mutations (prior to selection) to down-regulate gene expression. This can be seen in two ways. First, since the vast majority of random nucleotide sequences do not drive significant levels of transcription, it stands to reason that mutations bringing a cis-regulatory region closer to a random sequence will tend, on average, to down-regulate any transcribed gene. More direct evidence for this comes from saturation mutagenesis studies where every possible base substitution or single-base deletion is engineered into a promoter region, and the resulting gene expression is measured. For both mammalian and bacteriophage promoters, the vast majority of mutations that affect expression result in down-regulation (96.1% in three mammalian promoters and 99.8% in three bacteriophage promoters9, at a 2-fold change cutoff). Although occasional mutations can result in up-regulation, the observation of consistent cis-acting up-regulation of genes in a gene set along one lineage likely indicates positive selection. We note that this relaxation of selection effect applies equally to Orr's test; thus Orr’s test may be more appropriately referred to as a test of lineage-specific selection, rather than a test of positive selection. We note that we do not expect that fixed mutations will necessarily be so overwhelmingly downregulating, since these are the very biased subset that has survived the gauntlet of selection and drift. For the purposes of our arguments, the relevant quantity is the fraction of new mutations that are down-regulating, and so our use of saturation mutagenesis data is appropriate. The McDonald-Kreitman test10, a widely used test of selection, is quite similar to ours in that either relaxed negative selection or positive selection can result in the same effects. For the MK test, this fact has been noted [e.g. ref. 11 and references therein], but it almost universally ignored when applying the test; it is nearly always assumed to reflect positive, and not relaxed negative, selection. In any case, it is fair to say that our test reflects positive selection in much the same way as the MK test, in the sense that neither can distinguish positive from relaxed negative selection. Note on overlap between cis-eQTLs from separate cohorts In describing Figure 2 we note that the specific genes implicated as cis-eQTLs within the gene sets shown (mitochondria in Fig 2a and adult locomotory behavior in Fig 2b) show extensive overlap. In the mitochondria gene set, there are 112-126 B6-upregulated genes in each cohort (some of these are within 2 mb of one another in the genome, so were excluded from analysis to ensure independence of cis-eQTLs; thus Fig 2a shows numbers lower than this). 83 of these are found as targets of cis-eQTLs in all four cohorts, and an additional 29 are sex-specific (11 to females and 18 to males), as they appear in both cohorts of one gender, but neither of the other. In the adult locomotory behavior gene set, there are 12-14 CASTupregulated genes. Eight of these are found as targets of cis-eQTLs in all four cohorts, and another two are sex-specific (one in males and one in females). For a complete list of genes from Fig 2, see Supplemental Table 1. Supplemental References 1. Orr H.A. Genetics 149, 2099-2104 (1998). 2. Voight BF et al. PLoS Biol 4, e72 (2006). 3. Eyre-Walker A. Genetics 162, 2017 (2002). 4. Macpherson JM, et al. Mol Biol Evol 25, 1025 (2008). 5. Sawyer SA, Hartl DL. Genetics 132, 1161 (1992). 6. Brown RE, Wong AA. Learn Mem. 14, 134-144 (2007). 7. Le Roy I, et al. Behav Brain Res. 95, 135-142 (1998). 8. Fraser HB, et al. PNAS 107, 2977 (2010). 9. Patwardhan RP, et al. Nat Biotechnol. 27, 1173 (2009). 10. McDonald JH, Kreitman M. Nature 351, 652 (1991). 11. Hughes AL. Heredity 99, 364 (2007).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Text S1.