Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sequence based characterization of structural variation in the mouse genome Binnaz Yalcin1†, Kim Wong2†, Avigail Agam1†, Martin Goodson1†, Thomas M. Keane2, Leo Goodstadt1, Amarjit Bhomra1, Polinka Hernandez-Pliego1, Helen Whitley1, James Cleak1, Deborah Janowitz1, Richard Mott1, David J. Adams2,*, Jonathan Flint2,* 1The Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK , 2The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK 3MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, South Parks Road, Oxford OX1 3QX, UK. †Co-first authors *Correspondence to: Dr. David Adams Prof. Jonathan Flint Wellcome Trust Sanger Institute Wellcome Trust Centre for Human Genetics Hinxton, Cambs, CB10 1SA, UK Oxford, OX3 7BN, UK Ph: +44 (0) 1223 86862 Ph: +44 (0) 1865 287512 Fax: +44 (0) 1223 494919 Fax: +44 (0) 1865 287501 Email: [email protected] Email: [email protected] 1 Abstract The importance of structural variants (SVs) in DNA as a cause of quantitative variation and as a contributor to disease is unknown, but without knowing how many SVs there are, and how they arise, it is difficult to discover what they do. Combining experimental and automated analysis of the mouse genome sequence, we identified 0.7M SVs in thirteen classical and four wild-derived inbred mouse strains. The majority of SVs are less than 1 kilobase in size and 98% are deletions or insertions. The breakpoints of 58% were mapped to base pair resolution allowing us to confirm that insertion of retrotransposons causes more than half of SVs. Yet, despite their prevalence, SVs are less likely than other sequence variants to cause gene-expression or quantitative phenotypic variation. We identified only 22 SVs that disrupt coding exons, acting as rare variants of large effect on gene function. One third of the genes so affected have immunological functions. Our catalogue provides a starting point for the analysis of the most dynamic and complex regions of genomes from a genetically tractable model organism. [178 words] Introduction Structural variation (SV) is believed to be widespread in mammalian genomes {Iafrate, 2004 #91;Korbel, 2007 #93;Sebat, 2004 #90;Tuzun, 2005 #92} and an important cause of disease{Buchanan, 2008 #88;Hurles, 2008 #75}, but just how abundant and important structural variants (SVs) are in shaping phenotypic variation remains unclear. Understanding what SVs do depends 2 on understanding what they are, where they occur and how they arise: large, recurrent SVs (SVs occurring in multiple individuals with clustering of breakpoints and sharing a common interval and size) coinciding with genes are far more likely to contribute to phenotypic variation than small nonrecurrent SVs within intergenic regions. The preeminent organism for modeling the relationship between phenotype and genotype, including SVs, is the mouse, but our catalogue of SVs in this animal is incomplete. Estimates of SV numbers and the proportion of the mouse genome they occupy, vary considerably, from figures of a few hundred to over 7,000 {Li, 2004 #322; Snijders, 2005 #320; Graubert, 2007 #128; Egan, 2007 #98; Cutler, 2007 #321; Quinlan, 2010 #52}, affecting from 3.2% to more than 10% of the genome {Cahan, 2009 #270; Henrichsen, 2009 #269}. Incompleteness and inconsistencies are largely due to reliance on differential hybridization of genomic DNA to oligonucleotide arrays (REF TO AGAM, 2010?), a technology blind to some SV categories (such as inversions and insertions) with only limited ability to detect others (segmental duplications and transposable elements). Sequence based methods of SV detection, with higher resolution and greater sensitivity, have so far had limited application {Akagi, 2008 #99;Quinlan, 2010 #52}. Along with SV catalogues, we need to know how SVs arise, as this will tell us what SVs may or may not do. Recurrent events, hitting genes, will have different consequences from stably inherited SVs in intergenic regions. The major molecular mechanism producing SVs in the mouse genome is believed to be retrotransposition {Akagi, 2008 #99;Quinlan, 2010 #52}, which, may account for more than 80% of SVs between 100 nucleotides to 10 kilobases in 3 length {Akagi, 2008 #99}. In cell culture, about 10% of LINE-1 insertions delete DNA{Gilbert, 2002 #136;Symer, 2002 #135}, a process that also occurs in mouse genomic DNA{Garvey, 2002 #137}. While this suggests SV formation is recurrent {Egan, 2007 #98}, it is not known to what extent retrotransposons, or other mechanisms of SV formation, contribute to mouse phenotypic variation and disease. What we know about the burden of SVs’ impact on phenotypes in the mouse comes primarily from analyses of gene expression. Up to 28% of the between-strain variation in gene expression in hematopoietic stem and progenitor cells has been attributed to SVs {Cahan, 2009 #131}; for genes lying within SVs, the latter account for between 66% to 74% of between-strain expression variation in kidney, liver, lung and testis {Henrichsen, 2009 #89}. If the genome is replete with SVs, and given that their influence on gene expression could extend up to 500 Kb from their margins{Henrichsen, 2009 #89}, then SVs might be responsible for a considerable fraction of heritable gene expression variance. Since gene expression variation is believed to contribute to variation in phenotypes in the whole organism {Schadt, 2005 #315} SVs may turn out to have a major role in the genetic determination of all aspects of mouse, and mammalian, biology. In this paper we use next generation sequencing to address three critical questions: what is the extent and complexity of SV in the mouse genome, what are the likely mechanisms for SV formation, and to what extent do SVs contribute to phenotypic variation? We report the identification, using short-read paired-end mapping, of 0.7M SVs in 17 inbred strains of mice. By analyzing breakpoint sequence we infer the mechanisms of formation and 4 assess their relative impact on shaping a mammalian genome. Our molecular characterization of SVs in the mouse genome allows us to determine the extent to which SVs contribute to genetic and phenotypic diversity. Results SV identification We identified almost three quarters of a million SVs, relative to the reference genome C57BL/6J, in 17 mouse strains, far more than previously recognized (Fig. 1a) and consisting of a greater variety of molecular structures (Fig. 1b&1c). To understand why we found more, and to explain our results, we start by explaining how we went about finding SVs. We combined visual inspection of short-read sequencing data with molecular validation to improve automated SV detection across the genome. We used two criteria to identify SVs manually: read depth and anomalous paired-end mapping (PEM). We did this using data from the mouse’s smallest chromosome (19) in its entirety, and a random set of other chromosomal regions, for eight strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J). Based on read depth and PEM we expected to find eleven patterns that classify SVs. We refer to these as type H (“High-confidence”) patterns (H1-H11: Supplementary Fig. 1). For example, some deletions and inversions leave precise, easily identifiable signatures (Fig. 1d). In addition, we found ten patterns whose interpretation was ambiguous. We refer to these as type Q (“Query”) patterns (Q1-Q10: Supplementary Fig. 1, Fig. 1e). We investigated the molecular structure of all 21 patterns using a PCR strategy 5 (Supplementary Fig. 2, Supplementary Methods). We designed 575 pairs of primers (Supplementary Table 1) and successfully amplified 538 SV regions across eight classical inbred strains (Supplementary Table 2). Our categorization of predicted SV structures, based on manual inspection of PEM patterns, resulted in the confident identification of an SV for nineteen of the 21 patterns in all instances that we examined by PCR (Supplementary Table 3). Two patterns were always false (Q6 and Q10), and arose because of the presence of a retrotransposed pseudogene giving mapping errors. Recognizing these patterns, we were able to predict underlying SV structure with high confidence. PCR confirmed that 12 patterns were indicative of a single SV and six patterns indicative of multiple adjacent SVs (Supplementary Table 3). However, SVs of type Q7 (45 cases) were due to a variable number tandem repeat, for which we could not predict the number of repeats or molecular structure. Available automated methods to identify SVs are unable to differentiate all 19 PEM patterns, and may also classify some SVs incorrectly; for example, the PEM patterns of linked insertions (Q5 and Q9: Supplementary Fig. 1) are similar to those for inversions or deletions. Therefore we adapted automated methods to recognize 15 types (Q1, Q2, Q3 and Q7 could not be unambiguously identified) identified by manual inspection and PCR validation (Supplementary Methods and {Wong, 2010 #64}). Sensitivity and specificity analyses 6 We established false positive and false negative rates for the automated analysis in three ways. First, we used our manually identified set of SVs on chromosome 19 (Supplementary Table 4) where we found 932 deletions (684 type H and 248 type Q), 15 inversions (2 type H and 13 type Q) and three copy number gains (all type H). Automated analysis of chromosome 19 detected between 83% to 86% of manually-called deletions (at least 50 bp in size) depending on the strain (Supplementary Table 5a). The false positive rate ranges from 3.1% to 4.6% (Supplementary Table 5b). Second, to ensure that our sensitivity and specificity analyses were not vitiated because we used chromosome 19 as a training set for the automated analysis, we derived a second, smaller, set of manually curated deletions from a randomly chosen 10 Mb region (101Mb to 111Mb) from chromosome 3 in the strain C3H/HeJ. Automated analysis of this region correctly identified 43 (82.7%) and called 2 false positive deletions (4.4%). Third, we investigated the false negative rate for the automated detection of deletions across the genome using a PCR validation data of 267 simple deletions. Consistent with the chromosome 19 and chromosome 3 analyses we found that the false negative rate for deletions was between 9% and 15% (Supplementary Table 6a). We could not assess the performance of automated analysis to detect SV types other than deletions using calls derived from manual inspection of chromosome 19 because so few of these rearrangements were called. To do this, we turned to PCR-based validation of insertions, inversions and tandem duplications (n=62 to n=76) and found that the average false negative rate was higher than for deletions, ranging from 21% to 33% per strain 7 (Supplementary Table 6b). Automated analysis was less successful in detecting the more complex rearrangements, with 25% to 38% false negative rates (n=46 to n=54). SV categories The results of the detection and classification of 711,923 SVs across the entire genome of 17 strains are shown in Table 1. There are on average 26,000 SVs in classical inbred strains, and 92,000 in wild derived inbred strains, affecting 1.2% (33.0 Mb) and 3.7% (98.6 Mb) of the genome respectively. SVs smaller than 100 bp are excluded, as below this, it is difficult to determine whether the deviation in distance between two paired end reads is due to variation in the library insert size distribution or due to paired ends flanking a structural variant. However we know from the chromosome 19 analysis that there are relatively few SVs in this size range. Therefore our catalogue does not omit any abundant classes of SV. Table 1 classifies SVs into two groups: 99.4% are simple and 0.6% are complex. Simple SVs include those whose biological interpretation is straightforward: insertions, deletions and inversions. We separately identify one type of insertion, a copy number gain, consisting of non-repetitive DNA that is present in multiple copies in other strains. When this sequence occurs immediately adjacent to its original, it is annotated as a tandem duplication. It is less clear to what extent the more complex categories we found represent different biological categories. Complex SVs consist of a mixture of events that abut each other. Sometimes the mixture is simply because two or more simple SVs occur next to each other: given the density of deletions in the 8 genome, the 2,132 deletions that we found separated by less than 250 bp could have occurred by chance (two of our PEM patterns (H3 and Q5). However we recognize as a separate category SVs that are immediately adjacent to each other, with no intervening DNA, since we suspect that these might be the progeny of a single biological process (marked as Del+Ins and Del+Ins/Inv in Table 1). Thus, intriguingly, we noted that half of the inversions co-occur with an insertion or deletion (Fig. 3a). We also separately identify an SV within a copy number gain (termed “nested” in Table 1) since the probability of coincidence is less than one event per genome. SV formation Homology at SV breakpoints, as well as the content of sequence within SVs and the SV’s ancestral state, was used to infer the likely mechanism of formation for simple SVs (Supplementary Fig.3). To obtain breakpoint sequence, we performed de novo local assembly at 81.3% of deletions and 74.2% of non-transposable element insertions {Wong, 2010 #64}. Comparison of 1,314 predicted breakpoints to the breakpoint delineated by PCR and sequencing (Supplementary Table 7; Supplementary Methods), revealed that 57.7% of breakpoint predictions are exact and 86.5% are within 20 bp (Supplementary Table 8a). In cases where the local assembly strategy failed, we relied on breakpoints obtained from the mapping reads reference genome: 83.3% are within 100 bp of the actual breakpoint (Supplementary Table 8b). Breakpoint accuracy for insertions, inversions and copy number gains SV is presented in Supplementary Tables 8c, 8d and 8e, respectively. Using rat and SPRET/EiJ as outgroups, we classified 19% of SVs as 9 ancestral deletions, 57% as ancestral insertions and the remainder (24%) were indeterminate. We examined the sequence features of 40 SVs that failed the outgroup analysis and found that in every case the regions contained highly repetitive DNA, consisting primarily of transposon and transposon related sequence. Classification of SVs and their size characteristics are summarized in Figure 2. The main class of SVs consists of those formed by mechanisms involving retrotransposons (LINEs (25%), LTRs (14%) and SINEs (15%)), followed by variable number tandem repeats (VNTRs) (15%) and pseudogenes (2%) and other mechanisms not involving retrotransposons (29%) (Fig. 2a). We found that the median length of all SVs is 349 bp, with modes at 100 bp and 6,400 bp, LINE insertions comprising the majority of the larger insertions (Fig. 2b). Estimates of the proportion of SVs categorized as VNTRs fell to 4.9% when we required the whole of the SV to be overlapped by a VNTR and for the flanking sequence to have VNTR content of the same periodicity (>= 5bp). Outgroup analysis showed that the transposon-derived SVs arose, as expected, almost exclusively from ancestral insertions events (98.8%). As expected, microhomology (12-16 bp) surrounds the breakpoints of LINE and SINE derived SVs (known as target site duplication) and shorter (6-8 bp) stretches associated with LTR SVs (Fig. 2c). Non-repeat mediated SVs are mainly a result of ancestral deletion events (79%), and are associated with short microhomologies, up to 7bp in length, consistent with either a microhomology-mediated break-induced replication (MMBIR) or microhomolgy-mediated end joining (MMEJ). Table 2 gives genome-wide estimates of mechanisms of SV formation. MORE HERE? 10 We found that in all cases the presence of SNPs in the microhomology region was correlated with the presence of the SV (Fig. 3b). In every case the SNP elongates the microhomology. However this phenomenon is rare: we only observed five (4.5%) cases amongst our 113 manually-curated ancestral deletions (Supplementary Table 7) where a SNP and SV formation cosegregate. We found a similar relationship between a SNP formed in a target site duplication and the presence of an ancestral insertion (Fig. 3c). 15 ancestral insertions (16%) had SNPs or short indels within their target site duplication, coincident with the insertion (Supplementary Table 7). Given their potential role in disease {Stankiewicz, 2010 #325}, we were interested to document the occurrence of recurrent SVs, those that arise at the same genomic locus independently in unrelated individuals. Nonallelic homologous recombination (NAHR) is the major mechanism for recurrent SVs{Gu, 2008 #352}, while fork stalling and template switching (FoSTeS) and/or microhomology-mediated break-induced replication (MMBIR) mechanisms may be important for non-recurrent SVs {Zhang, 2009 #26}. We looked for SVs occurring at the same locus, but with different breakpoints, indicating independent origins. Using the SV breakpoints obtained from PCR sequencing (over 4,000 breakpoints; Supplementary Table 7), we found that in the classical strains, only 2.5% of deletions at the same locus had different breakpoint sequences. However within all 17 strains, we found multiple alleles at 12% of SVs, due almost entirely to the presence of different alleles originating from the wild-derived inbred strains (Supplementary Table 7). Consistent with the low frequency of recurrent SVs, breakpoint features associated with NAHR are rare. Using Vmatch 11 (http://www.vmatch.de/) to detect similarity between the sequences flanking and internal to all deletion SVs, we estimated that 0.25% of deletions are due to NAHR, when we required a signature of >=200bp of >=90% sequence identity. Two analyses, therefore, indicate that recurrent SVs are rare. Impact of SVs on gene function We assessed the impact of SVs on phenotypes in three ways: i) the relationship between the position of SVs and the position of genes; (ii) changes in expression of genes overlapping, or nearby, an SV; (iii) association between SVs and phenotypes in an outbred population of mice. Across all strains, SVs overlap 10,291 genes, reducing to 5,115 genes when only the classical laboratory strains are considered (Supplementary Table 10). We investigated whether this represented enrichment or depletion by comparing the number of SVs that overlapped genes to a distribution obtained by permutation (Supplementary Table 11). We found that SVs are, in all strains except C57BL/6N, significantly depleted (P<0.01) in genes, (fold change 0.91), introns (mean fold change 0.93), exons (fold change 0.22; including C57BL/6N) and promoter regions (fold change 0.77). However, we found that SINE insertions are significantly enriched in the introns of genes (P<0.01, fold change 1.34). The relative depletion of SVs within genes implies a proportionate deficit in their phenotypic consequences. We confirmed this hypothesis first by finding that SVs are relatively unlikely to be the cause of cis-acting expression QTLs. We examined 833 cis-acting eQTLs that influence expression of transcripts from the hippocampus and liver of outbred mice derived from eight 12 of the sequenced strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J, DBA/2J and LP/J) {Huang, 2009 #330}. Applying a test that discriminates between variants that are likely to be functional from those that are not {Yalcin, 2005 #338}, we found that SVs are significantly less likely to be causal variants (0.3% compared to 0.9%, P < 2.2E-13,2 = 53.8). We next asked whether SVs were enriched among those variants most likely to contribute to variation in the abundance of transcripts obtained from the whole brain RNAseq analysis of 15 strains {Trapnell, 2010 #342}. Taking into account the relationship between strains, we tested for association between 11,245 transcripts and all variants lying within 25 Kb upstream and downstream of the transcript’s genomic locus. 5,337 transcripts had a single variant with a minimum P-value for association, which we assumed would include true causal variants. Figure 4a shows the distance to the transcriptional start site from the position of the single most significant variant, for these 5,337 transcripts, regardless of the P-value for association. Figure 4b shows the same information but only for variants where the P-value is less than 0.0001. There is a significant enrichment of variants with low P-values lying less than 2,000 bp upstream of the transcriptional start site (P =0.00013,2 = 59.3, df = 25). None of the 225 variants were SVs, significantly less than the frequency of SVs in the complete set of variants tested (2 = 4.8, P = 0.025 by simulation). The proportion of variation in gene expression attributable to SVs is small. Figure 4c shows scaled variances in gene expression from brain RNAseq data measured between and within strains for five categories of SVs. Assuming variation within strains is due to environmental factors and variation 13 between strains is due to both environmental and genetic factors, the difference between the two variances is a measure of heritability and we can apportion that attributable to the SVs {Henrichsen, 2009 #89}. No category of SV accounts for more than 10% of the heritability. Since many transcripts overlap multiple small SVs (median of 3, maximum of 216), we hypothesized that heritability might be related to the amount of gene overlapped. For each transcript we summed the amount of DNA overlapping a gene and expressed this as a proportion of the total length of the gene. Overlap proportions of 50% or more make a disproportionately large contribution to heritability: at loci with SVs in this category, SVs contribute to 25% of the variance, compared to 7.8% for transcripts where SVs overlap less than 50% of the gene. However, large overlaps (50% or more) are rare, affecting less than 3% of transcripts. Thus while SVs make a modest contribution to the overall heritability of expression variance, at individual transcripts, they may be the main cause of between-strain differences in expression. We also observed only small effects on gene expression from SVs that lie outside a gene. Figure 4d shows between and within strain variances for SVs lying at distances from less than 2 Kb to more than 40 Kb from transcripts with no SV overlap (the density of SVs meant that we found too few transcripts further than 60 Kb from an SV to analyze). For these analyses we measured the closest distance to either the start or end of the gene. Heritability attributable to SVs within 2 Kb of the gene is 2%, and falls as the distance from the gene increases. SVs are unlikely to be the causative variant at QTLs, as we know from genetic association with 100 phenotypes measured in over 2,000 14 heterogeneous stock (HS) mice {Valdar, 2006 #177}. We applied a test of functionality {Yalcin, 2005 #338} to 281,246 SVs where we were certain that the strain distribution pattern (SDP) was correct (Supplementary Methods). Relatively high rates of SV mutation {Egan, 2007 #98} might invalidate the imputation (the HS animals are at least 60 generations distant from the sequenced strains), so we genotyped 100 HS animals using a high-density array ({Agam, 2010 #54}, Supplementary Methods). 194 deletions could be genotyped on the array (with an additional 47 deletions when we allow for non-segregating SVs in the HS). In every case imputation correctly predicted the logP obtained from ANOVA carried out using the array -based genotypes. We identified 290 QTLs where SVs were among the variants most likely to be functional, but in all these cases the SVs were only a subset of the total number of functional variants. Just as with the cis-expression QTLs, we found a small but significant deficit in SVs among the functional variants (0.36% compared to 0.54% among the non-functional, P < 1E-16,2 = 72.1). SVs that affect phenotypes While SVs make a relatively small contribution to the total amount of quantitative phenotypic variation, at a small number of loci they are the cause of variation. We identified two categories of SVs that have large functional consequences: the first are those at QTLs, and the second are those that disrupt the coding exons of genes. As shown in our companion paper {Keane, 2011 #345}, larger effect QTLs are more likely to arise from SVs (and see Supplementary Figure 4a Supplemental Figure 4b and 4c). We identified 12 QTLs where the SV 15 overlapped a gene with its flanking region (2 Kb up or downstream of a gene), and where the QTL effect size is in the top 5% of the distribution. Table 3 lists these SVs, the genes they affect and the putative phenotype with which they are associated. Complementation of the deletion of the H2-Ea promoter confirmed the effect of this SV on the T-cell phenotype (). Eps15 -/- male mice exhibited a significantly lower activity in the open field arena (Supplementary Fig. 5) compared to matched wild type male mice. Further work is needed to confirm the other candidate genes. We identified 22 SVs that affect coding exons including 5 SVs that encompass a gene (or several) in its entirety. Table 4 gives positional information about these SVs, the gene they affect (gene that are affected in their entirety are indicated by an asterisk), how they formed, their strain distribution pattern (SDP) and their known function as reported in the current literature. Remarkably, a third of the genes affected by an SV are involved in immunity and infection. Five of the 22 SVs are already known{Best, 1996 #346;Boyden, 2008 #347; Morrison, 2002 #348; Nelson, 2005 #349; Persson, 1999 #350}; the remaining 17 SVs are novel. The sequence level data, on multiple strains, casts new light on all five known cases, (for example the known deletion in Fv1 is in fact a deletion with an insertion), (for example the mutation in the Skint locus is in fact an insertion, so does the mutation in the Trim locus; Fig. 5a) 16 Discussion Our results are important in three respects: first we find an unexpectedly large number of SVs with diverse molecular architecture, thus providing a catalogue of the most dynamic and variable regions of the mouse genome. Second, we identify breakpoints at nucleotide level resolution, giving a genome wide picture of how SVs originate. Third, we demonstrate that, despite their abundance, SVs make relatively little functional impact, as assessed by their effects on gene expression and phenotypic variation in the whole animal. We were able to find more SVs, of greater complexity, because we relied on manual inspection of the PEM results, combined with molecular validation, before using automated calling methods. Previous studies have revealed the noisiness of sequenced based methods of SV calling {Kidd, 2010 #49;Korbel, 2007 #93;Quinlan, 2010 #52}, due in part to the multiplicity of forms and the presence of insertions, deletions and inversions often in close proximity to each other, and the difficulty of mapping sequence reads back to repetitive genomes. Nevertheless, we have shown here that it is possible to calibrate automated methods to generate genome-wide SV calls of high accuracy. The SVs we find have two distinguishing characteristics: first, typically they are small. For deletions, whose size we know accurately, the median is 385 bp. In comparison, the median size of SVs in a recent high-density array analysis of the genomes of 20 laboratory strains was 9 Kb {Cahan, 2009 #131} and about 1.9 Kb from a PEM analysis of DBA/2{Quinlan, 2010 #52}. Second, their density means that we frequently find regions with high 17 concentrations of small rearrangements. These two features emphasize the need for methods of SV identification at base pair, or near base pair resolution. Otherwise not only are many SVs missed, but those recognized are misclassified: a mixture of small deletions and insertions will be mistaken for a large SV of a single type {Agam, 2010 #54}. Our second important finding is the catalogue of SV mechanisms based on breakpoint sequence. We were able to map almost 60% of deletions to base pair resolution, allowing us to classify SVs by the mechanism that created them. We find that the primary origin of structural variation between mouse strains is attributable to LINE-1 retrotransposons. For reasons still unexplained, mice differ from humans in whom LINE-1 retrotransposition comes third after microhomology-mediated processes and nonallelic homologous recombination as the predominant processes in generating SVs {Kidd, 2010 #49}. In contrast to human SV studies, the great majority SVs we have discovered are non-recurrent rearrangements, based on two observations: among the classical strains, only 2.5% of deletions at the same locus had different breakpoint sequences and less than 1% of deletions are due to NAHR, the mechanism thought to be responsible for the majority of recurrent SVs in humans {Stankiewicz, 2010 #325;Gu, 2008 #352}. Our third important observation is that SVs have relatively little impact on gene function. Results from human genome-wide association studies have revealed that common SNPs (minor allele frequency > 5%) explain only a part of trait heritability suggesting that SVs might be a major unrecognized contributor to phenotypic variation{Manolio, 2009 #318}. Available evidence has not yet resolved whether or not this is so. Analysis of human 18 lymphoblastoid cell lines attributed at least 8.5-17.7% of heritable gene expression variation to CNVs{Stranger, 2007 #84}. Importantly, this heritability was not shared with common SNPs, potentially making CNVs a contributor to the missing heritability of GWAS. In mice, SVs overlapping a gene were estimated to contribute to a substantial proportion of between-strain expression variance (up to 74%){Henrichsen, 2009 #89}, which, together with the prevalence of SVs in the genome, implies that they might be responsible for a considerable fraction of heritable gene expression variance. If the genetic basis of gene expression were a model for understanding the molecular basis of other phenotypes, then SVs would be a major player. However, two recent analyses of the association between SVs and disease phenotypes in humans provide little support for this view: common SVs are no more likely than common SNPs to contribute to phenotypic variation {Conrad, 2010 #44;Craddock, 2010 #319}. However rare CNVs (minor allele frequency < 5%) of large effect (odds ratio > 2), that could not be detected using the technologies available, might still be important contributors. Our findings make three important contributions to this debate. First, we find that SVs overlapping a gene make a small contribution to variation in gene expression, accounting for less than 10%, and we find limited evidence that they affect the expression of flanking genes. This might be due to our analysis of very large numbers of small SVs, but we find that even when SVs overlap more than 50% of a gene they account for less than a third of the heritability. The most likely explanation is that previous array based studies conflated under one apparently large SV the effects of numerous smaller 19 rearrangements together with regions of diploid DNA, containing other variants that influenced gene expression. Second, our analysis of the phenotypic consequences of SVs on QTLs for multiple phenotypes also points to a relative deficit of SVs as the molecular basis of complex phenotypes. By working with an outbred population where all chromosomes are descended from known progenitors, imputation effectively reconstitutes the genomes of all animals, so that we can detect the effects of all variants, both common and rare. Our results indicate that common and rare SVs make less of a contribution to phenotypic variation than we would expect given their abundance in the genome. However the outbred population we tested is derived from inbred progenitors whose homozygosity will have purged their genomes of variants that could be maintained in heterozygous freely mating populations. Third, we identified 22 SVs that delete one or more exons. These SVs, with large effects on a phenotype, are the equivalent of rare variants found in human populations. In mouse populations they are very rare indeed: Our analysis has highlighted those QTLs where SVs are likely to be the responsible molecular lesion. Encouragingly, our computational predictions include a promoter deletion whose role we have recently confirmed through transgenesis{Yalcin, 2010 #317}. This is important because genetic association studies typically implicate SNPs as the causative variant at a QTL. Biological insight into a phenotype however requires discovering which gene is involved, still a major challenge if the starting point is a SNP. The task is 20 considerably easier when an SV is identified as the causative variant, particularly if the SV removes a coding segment, effectively creating a null allele, now relatively straightforward to model in mice. Thus the discovery of causal SVs is likely to provide biological insights out of proportion to their relative small contribution to phenotypic variance. Acknowledgements We thank Adam Whitley, Giles Durrant, Andrew Marc Hammond, Danica Joy Fabrigar, Lucia Chen, Martina Johannesson and Enzhao Cong for helping B.Y with various laboratory-based work. This project was supported by The Medical Research Council, UK and the Wellcome Trust. DJA is supported by Cancer Research UK. Author contributions D.J.A and J.F conceived the study and directed the research. 21 Figure Legends Figure 1. Identification of structural variants. a) Venn diagrams showing the overlap between deletion SVs (relative to C57BL/6J) detected in our study (blue) and those published elsewhere (Agam et al, 2010 in red and Quinlan et al in green), in DBA/2J. b) Blue boxes represent deletions, pink boxes insertions, orange boxes inversions and yellow boxes duplications; all types of structural variants are relative to the reference genome sequence. We found six basic types of structural variant: deletion (del), insertion (ins), inversion (inv), tandem duplication (dup), inverted tandem duplication (not drawn here) and dispersed duplication. c) Additionally, eight complex types of structural variant were found: deletion with an insertion (del+ins), linked deletion (normal copy of small length flanked by two deletions), deletion within a duplication (del in dup), inversion with flanking deletion(s) (for example del+inv+del), inversion with an insertion (inv+ins), inversion within a duplication (inv in dup), a linked insertion (linked ins) where the inserted sequence is copied from another location in the vicinity of the inserted site and an inverted linked ins (not drawn here) which has a similar pattern to a linked insertion but with the inserted sequence being inverted. d) Example of paired-end mapping (PEM) pattern of a del+inv+del. Green arrows represent primers used for PCR amplification and sequencing reactions. Primer names provide their positional information, relative to the reference genome. Black arrows attached with a curved line represent paired-ends, whereas single black arrows represent singleton reads. Grey straight lines indicate mapping of the test reads onto the reference genome. When the inversion is smaller than the insert size, pairedend reads will flank both deletions and inversion, as shown here. In other 22 cases, decreased read depth will indicate flanking deletions. e) Example of PEM pattern of an inv+ins, with PCR data across the eight classical strains. HyperladderII is used as molecular marker. Amplicon size for BALB/cJ, C3H/HeJ, CBA/J and DBA/2J is about 500 bp larger than the other strains, indicative of the insertion. A complete list of PEM patterns is given in Supplementary Figure 1, with examples and PCR validation data. Figure 2. Classification of structural variants. LEGEND TO ADD Figure 3. Breakpoint analysis of a complex SV. a) Complex SV, involving several genomic rearrangements including an inversion, deletion, short insertion and copy number gain (CNG), is displayed relative to its genic location along Zbtb10, a Zinc finger and BTB domain containing 10 gene. PCR amplification using forward (F) and reverse (R) primers revealed an AT insertion at the first breakpoint J1, followed by an inversion of 125 bp that encompasses an inverted copy number gain of the 22 bp proceeding J1, as seen in J2. Finally breakpoint 3 (J3) revealed a deletion of 813 bp. Using repeatmasker, a SINE element was found to be part of the deletion. b) PCR picture of the amplification using F and R primers (primer sequences available in Supplementary Table xxx). Hyperladder II was used as the size marker. C57BL/6N and LP/J show a normal size of 1604 bp, whereas A/J, AKR/J, BALB/cJ, C3H/HeJ, CBA/J and DBA/2J show a smaller band at 793bp. c) Sequencing data across J1, J2 and J3 breakpoints. A colour code is used to indicate each type of SV: blue is used for the 22 bp inverted copy number 23 gain, green for the inversion and red for the deletion. When the test strain matches the reference strain, both are in the same color. b) Relationship between SNP and SV formation. a) Relationship between SNP and ancestral deletion formation. Two SNPs lying on the 6 bp microhomolgy of an ancestral deletion of 64 bp (chr12:27,040,45927,040,522) correlated with the presence of the SV. On the left, PCR amplification of the SV is shown across the eight classical strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J). HyperladderII was used as DNA molecular weight marker. Some strains show a smaller amplicon compared to other strains. On the right, sequencing traces are shown for a test strain (A/J) and the reference strain (C57BL/6N). Note that all other test strains traces are identical to the one shown here. Asterisk is used to emphasize the microhomology of 6 bp (GAACTA). The presence of two SNPs (C->G and T->A) in all test strains (here only shown in A/J) is associated with the presence of the ancestral deletion. b) Relationship between SNP and ancestral insertion formation. PCR data is shown on the right with amplification in A/J, AKR/J and BALB/cJ. Strains with the the ancestral insertion (C57BL/6N, CBA/J, DBA/2J and LP/J) have failed to amplify due to size. The insertion is a LINE on chromosome 13 (119,134,049119,135,126). On the left, sequencing trace is shown over the TSD for a strain that doesn’t have the ancestral insertion. The TSD is 17 bp (AAGAATGTCAGCAAAGT) and at the 12th position, a SNP (G->C) is observed in all the strains that have the insertion. 24 Figure 4: Effect of structural variants on gene expression a) and b) Genetic association between 5,337 transcripts and sequence variants lying within 50 kilobases. a) shows the distance from the transcriptional start site to the most significant variant at those loci where there is a single peak of association. Results are shown regardless of the Pvalue of the association. The majority are non-significant and as expected show no enrichment closer to the gene. b) shows the same distances, but only for variants at loci where the single peak of association has a P-value less than 0.0001. None of the variants within the peak at the transcriptional start site are structural variants c) Between-strain (grey boxes) and within-strain (white boxes) gene expression variances for transcripts which are not overlapped by any structural variant (No SV) and for those which are overlapped by one of five types of structural variant: deletions (Dels), insertions (Ins), copy number gains (Gains), inversions (Inv) and complex rearrangements (Complex). The difference between the two variances is a measure of heritability. d) Effect of distance from the transcript on gene expression variances. Grey boxes are between-strain and white boxes are within-strain variances. The figure shows standardized variances of gene expression for transcripts with structural variants at distances from less than 2 Kb to more than 40 Kb from either the start or end of the transcript. 25 Tables Table 1. Structural variants greater than 100bp in 17 inbred strains Simple Strain 129P2/OlaHsd Del CNG Complex Inv Ins Del + Ins Nested Inv + Del/Ins 16292 57 74 15604 105 27 68 129S1/SvImJ 17307 70 88 11516 73 32 67 129S5/SvEvBrd 16089 72 67 8970 43 41 58 A/J 16190 69 92 12184 61 28 67 AKR/J 15806 88 89 14576 88 13 82 BALB/cJ 14859 82 87 10551 48 17 58 C3H/HeJ 16062 94 94 12100 90 16 76 164 44 6 213 0 3 1 CAST/EiJ 50978 361 224 34122 133 239 265 CBA/J 16996 79 83 10867 64 16 78 DBA/2J 17478 67 83 10559 55 29 75 LP/J 16964 64 88 12745 64 30 69 NOD/ShiLtJ 17047 51 116 13244 53 16 79 NZO/HlLtJ 15429 62 71 9445 33 23 62 PWK/PhJ 54147 96 272 35098 184 60 268 SPRET/EiJ 91295 112 470 64304 463 110 554 WSB/EiJ 22154 88 97 12521 64 37 105 C57BL/6NJ Del=deletion; CNG=copy number gain; Inv=inversion; Ins=insertion; Del+Ins=deletion plus insertion; Nested=SV in a CNG region; Linked Ins/Del=linked insertion or linked deletion; Inv+Del/Ins=inversion plus deletion(s) or inversion plus insertion. 26 Table 2. Sequence features at SV breakpoints and inferred mechanism Table 2. Sequence features at SV breakpoints and inferred mechanism. In a, the percentage of each sequence feature at precise breakpoint is given per category of ancestral SV (insertion, deletion, inversion, CNG and multiple events). In b, the percentage of each inferred mechanisms is given relative to all SV regions presented in a. Empty cases are due to no applicability and all abbreviations are listed in the Supplementary Glossary. 27 Table 3. QTLs associated with SVs Phenotype Chr SV start SV stop Ancestral Event Gene SV overlap LogP Mean platelet volume 1 175158884 175158885 insertion Fcer1a upstream 52.833 OFT Total activity Hippocampus cellular proliferation marker 2 144402772 144402974 SINE insertion Sec23b intron 15.721 4 49690364 49690365 SINE insertion Grin3a intron 20.119 Home cage activity 4 108951264 108951265 ERV insertion Eps15 upstream 15.922 T-cells: %CD3 4 130038389 130038390 SINE insertion Snrnp40 intron 12.129 Wound healing 7 90731819 90731820 ERV insertion Tmc3 upstream 22.216 Red cells: mean cellular haemoglobin 7 111398000 111480000 insertion Trim5 exon 13.016 Red cells: mean cellular haemoglobin 7 111504957 111505193 deletion Trim30b UTR 12.806 Red cells: mean cellular volume 8 87957244 87957245 LINE insertion 4921524J17Rik upstream 18.141 11 115106122 115106250 deletion Tmem104 UTR 13.404 13 17 113783196 34483680 113783359 34483681 deletion deletion Gm6320 H2-Ea upstream upstream 17.456 82.858 Serum urea concentration Hippocampus cellular proliferation marker T-cells: CD4/CD8 ratio Start and stop coordinates are given for build37 of the mouse genome, so that insertions into the reference are given as consecutive base pairs (columns headed SV start and SV stop). The part of the gene overlapped is reported in the column headed SV overlap. LogP is the negative logarithm of the P-value for association between the SV and the phenotype as assessed in outbred HS mice {Valdar, 2006 #177}. LP/J 129P2/OlaHsd 129S1/SvImJ 129S5SvEvBrd NOD/ShiLtJ NZO/HiLtJ CAST/EiJ PWK/PhJ SPRET/EiJ WSB/EiJ 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 0 0 0 0 1 1 1 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 1 0 0 1 0 1 0 1 1 0 0 0 0 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 2 1 0 1 0 0 1 1 1 0 0 0 1 0 1 0 1 0 1 0 0 1 2 1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 2 1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 2 1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 Known function DBA/2J ins del VNTR del ins del+ins complex ins del del del del VNTR ins+del del del VNTR ins VNTR del VNTR ins CBA/J Ancestral State 7079 873 140 3522 541811 1341 4440 148 1684 20663 1280 24888 155 3192 44 899 421 55218 326 349 412 1909 C57BL/6J SV Length 86186982 87245948 87780656 91835385 112272814 147245739 87854999 128559740 129691211 132613777 132986718 28671649 149415279 19450577 22004981 38403495 110889308 71101410 31759699 98328892 88874526 64609669 C3H/HeJ SV Stop Bp 86179904 87245076 87780517 91831864 111731004 147244399 87850560 128559593 129689528 132593115 132985439 28646762 149415125 19447386 22004938 38402597 110888888 71046193 31759374 98328544 88874115 64607761 BALB/cJ SV Start Bp 2 3 3 3 4 4 5 6 6 6 6 7 7 8 9 9 9 11 12 15 16 18 AKR/J SV Chromosome Olfr1055 Fcrl5 Nes Pglyrp3 Sknit4,3,9* Fv1 Ugt2b38 Klrb1a Klri2 Tas2r120* Tas2r103 Zfp607* Krtap5-5 Defb8 Zfp872 Olfr913 Rtp3 Nlrp1c* Fam110c Olfr234 Krtap16-1 Amd2* A/J MGI gene name Table 4: SVs affecting coding regions. Olfaction Infection and immunity Brain development Infection and immunity Infection and immunity Infection and immunity Metabolism Infection and immunity Infection and immunity Taste Taste DNA-binding Hair formation Infection and immunity DNA-binding Olfaction Bone density Embryonic development Cell spreading and migration Olfaction Hair formation Biosynthesis of polyamines MGI is Mouse Genome Informatics. Ins: insertion; del: deletion; VNTR: variable number tandem repeat. The strain distribution pattern relative to the ancestral state is given for all strains: “1” referring to presence, “0” to absence and “2” to an additional allele. * indicates that the SV overlaps the entire gene . Figure 1. Identification of structural variants. 3 0 Figure 2. Classification of structural variants. 3 1 Figure 3. Breakpoint analysis of a complex SV and relationship between SNP and SV formation 3 2 Figure 4 3 3 Methods An outline of the methods applied in this paper is provided in the supplementary Methods. 3 4 References 3 5