Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sequence based characterization of structural variation in the mouse genome Binnaz Yalcin1, Kim Wong2, Avigail Agam1, Martin Goodson1, Christoffer Nellaker3, Thomas M. Keane2, Leo Goodstadt1, Amarjit Bhomra1, Polinka Hernandez-Pliego1, Helen Whitley1, James Cleak1, Deborah Janowitz1, Richard Mott1, Chris P. Ponting3, David J. Adams2,*, Jonathan Flint2,* 1The Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK , 2The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK 3MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, South Parks Road, Oxford OX1 3QX, UK. †Co-first authors *Correspondence to: Dr. David Adams Prof. Jonathan Flint Wellcome Trust Sanger Institute Wellcome Trust Centre for Human Genetics Hinxton, Cambs, CB10 1SA, UK Oxford, OX3 7BN, UK Ph: +44 (0) 1223 86862 Ph: +44 (0) 1865 287512 Fax: +44 (0) 1223 494919 Fax: +44 (0) 1865 287501 Email: [email protected] Email: [email protected] 1 Abstract The origins and functional impact of structural variants (SVs) in mammalian genomes remain poorly understood. Complete genome sequence of thirteen classical and four wild-derived inbred strains of mice combined with extensive experimental validations allowed us to identify more SVs (0.7M) and to recognize more categories than hitherto reported. SVs affect 1.2% and 3.7% of the genome of the classical and wild strains, respectively. The majority of SVs are small (median size 385 base pairs) and while most (97.9%) are deletion and insertions, 1.6% have a complex structure and 0.5% includes duplications and inversions. Base pair resolution of breakpoints allowed the inference of genome-wide structural variation (SV) origins: retrotransposition is the commonest mechanism (54%), followed by non-homologous endjoining (NHEJ) processes (29%) and fork stalling and template switching (FoSTeS) mechanisms (17%). Sequence features at the actual breakpoints of 250 SVs revealed a plethora of mutational processes during SV formation including microdeletions of 1-289 bp and microinsertions of 1-107 bp. A large proportion of inversions (63%) co-segregated with deletions, suggesting a loop-type mechanism of formation. Analysis of gene expression and phenotypic variation shows that SVs make less of a contribution to phenotypic variation than we would expect given their abundance in the genome. We identify 15 loci where a structural variant is likely to be the molecular lesion responsible for the quantitative trait locus. Our catalogue provides a starting point for the analysis of the most dynamic and complex regions of mammalian genomes in a genetically tractable model organism. 2 Introduction Structural variation (SV) in the mammalian genome is known to be abundant and to contribute to phenotypic variation and disease. There has been considerable progress assessing its extent and complexity1-4, phenotypic impact5,6 and the responsible molecular mechanisms7,8 in the human genome, but much less is known about structural variants (SVs) in the mouse, the preeminent organism for modeling how genetic lesions give rise to disease in mammals. In this paper we use next generation sequencing to address three critical questions: what is the extent and complexity of SV in the mouse genome, what are the likely mechanisms for its formation, and what are its phenotypic consequences? Current catalogues of mouse SVs are incomplete9-11. They are based on differential hybridization of genomic DNA to oligonucleotide arrays (array comparative genome hybridization (aCGH)) which are blind to some SV categories (such as inversions and insertions), have limited ability to detect others (segmental duplications and transposable elements), and cannot provide sufficient breakpoint resolution which is required to assess the mechanisms of SV formation. Sequence based methods of SV detection, with higher resolution and greater sensitivity, have so far had limited application in the mouse. Results so far are available for only four classical strains12,13. These indicate that 85% of insertions between 100 nucleotides to 10 kilobases are due to retrotransposition, and that non-allelic homologous recombination (NAHR) is 3 rare. However we lack a comprehensive analysis of structural variation between inbred mouse strains. Even less is known about the impact of SVs on phenotypes, even though there are several indications that they may be important. First, given the predominant role of retrotransposition in SV formation, even low levels of activity could have a large phenotypic impact. In cell culture about 10% of L1 insertions delete DNA14,15, a process that is also documented in mouse genomic DNA16. Second, SVs overlapping a gene are estimated to contribute up to 74% of between-strain gene-expression variance. The high prevalence of SVs in the genome, and the observation that SVs influence the expression of genes lying up to 500 Kb from their margins17, suggest that SVs might be responsible for a considerable fraction of heritable gene expression variance. The phenotypic impact of SVs could extend further, since gene expression variation is believed to contribute to variation in phenotypes in the whole organism (schadt). Here we report the identification, using short-read paired-end mapping, of 0.7M SVs in 17 inbred strains of mice. By analyzing breakpoint sequence we infer the mechanisms of formation and assess their relative impact on shaping a mammalian genome. Our molecular characterization of SVs in the mouse genome allows us to determine the extent to which SVs contribute to genetic and phenotypic diversity. 4 Results SV identification We identified almost three quarters of a million SVs in 17 mouse strains, far more than previously recognized (Fig. 1a) and consisting of a greater variety of molecular structures (Fig. 1b&1c). To understand why, and to explain our results, we start by explaining how we went about finding SVs. We combined visual inspection of the data with molecular validation to improve automated SV detection across the genome. We used two criteria to identify SVs manually: read depth and anomalous paired-end mapping (PEM). We did this using data from the mouse’s smallest chromosome (19) in its entirety and a random set of other chromosomal regions, for eight strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J). Based on read depth and PEM we expected to find eleven patterns that classify SVs. We refer to these as type H patterns (H1-H11: Supplementary Fig. 1). For example a deletion is indicated by a reduction in read depth and by observing reads where one of the pair aligns to one side of the deletion, and its mate to the other. Because the two ends are sequenced on opposite strands, the direction of the two reads will be towards each other (H1). By contrast, paired-end reads pointing in the same direction and an unchanged read depth (except at the breakpoints) indicates an inversion (H4). We were surprised to discover an additional ten patterns (of read depth and PEM) whose interpretation was ambiguous. We refer to these as type Q patterns (Q1-Q10: Supplementary Fig. 1). For instance we found examples of a reduction in read depth coverage without paired-end reads flanking the putative deletion (Q2), and putative inversions where reads mapped to only 5 one of the breakpoints (Q8; Fig. 1D). We investigated the molecular structure of all 21 patterns using a PCR strategy (Supplementary Fig. 2, online Methods). We designed 575 pairs of primers (Supplementary Table 1) and amplified 538 unique SV regions across eight classical inbred strains (Supplementary Table 2). Our categorization of predicted SV structures, based on manual inspection of PEM patterns, resulted in the confident identification of an SV for nineteen of the 21 patterns in all 532 instances that we examined by PCR (Supplementary Table 3). Two patterns were always false (Q6 and Q10), and arose because of the presence of a pseudogene giving mapping errors. Recognizing these patterns, we were able to predict the underlying SV structure with high confidence. PCR confirmed the structure for 18 patterns (Supplementary Table 3): twelve patterns were indicative of a single SV and six patterns indicative of multiple adjacent SVs (complex SVs). The exception is type Q7, attributable in each case to the presence of a variable number tandem repeat, where we could not unequivocally predict the number of repeats and its molecular structure. Available automated methods to identify SVs are unable to differentiate all of the 21 PEM patterns we identified, and may also classify some SVs incorrectly; for example, the PEM patterns of linked insertions (Q5 and Q9: Supplementary Fig. 1) are similar to those for inversions or deletions. Therefore we adapted automated methods to recognize 17 types (Q1, Q2, Q3 and Q7 excluded) identified by manual inspection and PCR validation (Fig. 1, Supplementary Fig. 1, Supplementary Methods and 18). 6 The results of the detection and classification of 711,923 SVs across the entire genome of 17 strains are shown in Table 1. SVs smaller than 100 bp are excluded, as below this, it is difficult to determine whether the deviation in distance between two paired end reads is due to variation in the library insert size distribution or due to paired ends flanking a structural variant. There are on average 26,393 SVs in classical inbred strains, and 92,205 in wild derived inbred strains, affecting 1.2% (33.0 Mb) and 3.7% (98.6 Mb) of the genome, respectively. SVs with complex PEM patterns account for 1.4% to 1.8% of all SVs identified in each strain, except in C57BL/6NJ, whose genome is almost void of complex PEM patterns. The majority of SVs in all strains are deletions and insertions with simple PEM patterns; however, we have been able to identify a small subset of complex SVs. Inversions are rare, and occur concurrently with a deletion or an insertion about 50% of the time. Copy number gains are also rare (~100 per genome of each type), and cover from 1.8 Mb (NOD/ShiLtJ) to 16 Mb (CAST/EiJ) of the genome. We also observe a small number of inversions and deletions that occur in regions of copy number gain. Sensitivity and specificity analyses We established false positive and false negative rates for the automated analysis in three ways. First, we used our manually identified set of SVs on chromosome 19 where we found 932 deletions (684 type H and 248 type Q), 15 inversions (2 type H and 13 type Q) and three copy number gains (all type H). Automated analysis of chromosome 19 detected between 83% to 86% of manually-called deletions (at least 50 bp in size) depending on the strain 7 (Supplementary Table 5a). The false positive rate ranges from 3.1% to 4.6% (Supplementary Table 5b). Although the false negative rate per strain ranges from 14% to 17% when considering all types of deletions excluding type Q7, it should be noted that the automated analysis accurately identified 94.8% of sites containing a deletion. Second, to ensure that our sensitivity and specificity analyses were not vitiated because we used SVs from chromosome 19 as a training set for the automated analysis, we derived a second, smaller, set of manually curated deletions from a randomly chosen 10 Mb region (101Mb to 111Mb) from chromosome 3 in the strain C3H/HeJ. Automated analysis of this region identified 43 (82.7%) and called 2 false positive deletions (4.4%). Third, we investigated the false negative rate for the automated detection of deletions across the genome using our PCR validation data of 267 simple deletions. Consistent with the chromosome 19 and chromosome 3 analyses we found that the false negative rate for deletions excluding type Q7 was between 9% and 15% (Supplementary Table 6a). We could not assess the performance of automated analysis to detect SV types other than deletions using calls derived from manual inspection of chromosome 19 because so few of these rearrangements were called. To do this, we turned to PCR-based validation and found that the average false negative rate was higher than for deletions, ranging from 17% to 55% (Supplementary Table 6b). Automated analysis was less successful in detecting the more complex rearrangements: of the 58 PCR validated complex SVs only 33 were found. SV mechanism of formation 8 Sequences within a structural variant (known mobile elements, pseudogenes and variable number tandem repeats (VNTRs)) or sequence flanking its breakpoints reveal mechanism by which an SV arose. Accurate sequencebased SV classification has two requirements: nucleotide level resolution of the breakpoint and the ability to distinguish unambiguously between historical insertions and deletions. Both of these factors are crucial for accurately determining the mechanism by which a structural variant was formed. We successfully carried out de novo local assembly and breakpoint refinement of 81.3% deletions and 74.2% non-transposable element insertions as described in18 (genome-wide breakpoint localization of transposable element insertions is reported in our companion paper(Keane et al, 2011)). Comparison of 1,314 breakpoints to the breakpoint delineated by PCR and sequencing (Supplementary Methods), revealed that 57.7% of breakpoints are exact and 86.5% are within 20 bp (Supplementary Table 7a). We assessed the accuracy of deletion breakpoints in cases where the local assembly strategy failed. We found that 83.3% are within 100 bp of the actual breakpoint (Supplementary Table 7b). Breakpoint accuracy for insertions, inversions and copy number gains SV is presented in Supplementary Tables 7c, 7d and 7e, respectively. Using rat as an outgroup, we classified 19% of SVs as ancestral deletions, 57% as ancestral insertions and the remainder (24%) were indeterminate. We examined the sequence features of 40 SVs that failed the outgroup analysis and found that in every case the regions contained highly repetitive DNA, consisting primarily of transposon and transposon related sequence. 9 [spretus and manual classification - Compared to the classification using rat, we found relatively more ancestral deletions (35%) when using Spretus as an outgroup. Finally we estimated the error rates of the classification inspection by manual of the sequence at 250 SVs Supplementary Figure 2).] The mechanism of formation of SVs that arise independently at the same location in the genome (recurrent SVs) is thought to be different from the mechanism of formation of non-recurrent SVs19,20. However we found little evidence for recurrent SVs in the mouse. We used a set of 241 deletions whose breakpoints we amplified and sequenced, and identified recurrent SVs from finding different breakpoint sequences at the same genomic location in different strains. Within the eight classical strains, size differences were found at 2.5% (6/241) of SVs. Within all 17 strains, we found multiple alleles at 12% of SVs, due almost entirely to the presence of different alleles originating from the wild-derived inbred strains. In two cases, we observed four alleles at an SV locus and in one case five alleles: on chromosome 10 . At this locus different SV breakpoints in AKR/J, CAST/Ei, PWK/Ph and SPRET/Ei all of which have SVs with different breakpoints (Supplementary Table 8). Classification of SVs and their size characteristics are summarized in Figure 2. We found that the median length of all SVs is 349 bp, with modes at 100 bp and 6400 bp, LINE insertions comprising the majority of the larger insertions (Fig. 2b). The cutoff at 100 bp is artifactual in that we did not include variants of less than 100 bp as SVs. We observed a lower density of SVs on the X chromosome than the autosomes (4.97Mb-1 compared to an average of 14.45Mb-1, s.e=3.02). 1 0 The largest class of SVs consists of those formed by mechanisms involving retrotransposons (LINEs (25%), LTRs (14%) and SINEs (15%)), followed by mechanisms not involving retrotransposons (29%), VNTRs (15%) and pseudogenes (2%). Outgroup analysis showed that the transposonderived structural variants arose almost exclusively from ancestral insertions events (98.8%). Non-repeat mediated SVs were mainly a result of ancestral deletion events (79%). Consistent with their role in the creation of SVs (), we confirmed that segmental duplications (SDs) in the mouse genome are more likely to contain SVs. Deletions larger than 1 Kb are twice as likely to occur within segmental duplications. However we were surprised to see that deletions smaller than this bps long are depleted in SDs in all strains (fold change range = 0.43-0.64 and 0.3-0.39, in classical and wild-derived strains, respectively). We found microhomology (12-16bp) surrounding the breakpoints of LINE and SINE derived SVs and shorter (6-8bp) stretches associated with LTR SVs (Fig. 2c). Non-repeat mediated SVs were associated with short segments of up to 7bp in length (Fig. 2c). We found no evidence of longer micro-homology for non-repeat mediated SV formation, consistent with a major role for fork stalling and template switching (FoSTeS) . In total we found an excess of SVs with microhomology of 2 bp or over compared to random expectation (60% of the sample) (Fig. 2c). We found no difference in the micro-homology profile of ancestral insertions compared to ancestral deletions (Supplementary Fig. 3). Automated analysis may miss unexpected features of breakpoint sequence. Therefore we examined 249 SVs where breakpoints were obtained 1 1 by PCR and sequencing and made two observations. First, we identified cases where the machinery of SV formation has resulted in a complex mixture of insertions and deletions (Fig. 3a). While almost all SVs due to LINE, ERV, SINE and VNTR insertions do not show any missing nucleotides at their breakpoints (95%, 93.3%, 92.3% and 92.3% respectively), there are rare cases (4 LINEs, 2 ERVs, 1 SINE and 1 VNTR) during which the insertion machinery also deletes nucleotides (from 1 bp to 289 bp). For three LINEs, the presence of an ancestral microdeletion is directly linked to the absence of the target site duplication (TSD), suggesting a dual mechanism of SV formation: union between DSB repair processes and LINE retrotransposition. Five of the eight inversions for which we had breakpoint sequence data had deletions right next to the inversion (62.5%). The second unexpected result to emerge from exact breakpoint sequence of ancestral deletions was that in all cases the presence of SNPs in the microhomology region was correlated with the presence of the SV (Fig. 3b). In every case the SNP elongates the microhomology. This phenomenon is rare: we only observed five (4.5%) cases amongst our 113 ancestral deletions where a SNP and SV formation co-segregate. We found a similar relationship between a SNP formed in the TSD and the presence of an ancestral insertion (Figure 3b). 15 ancestral insertions (16%) had SNPs or short indels within their TSD, coincident with an insertion (Supplementary Table 8). Impact of SVs on gene function 1 2 In this section we report results of assessing the impact of SVs on phenotypes in three ways: i) the relationship between the position of SVs and the position of genes; (ii) changes in expression of genes overlapping, or nearby, an SV; (iii) association between SVs and phenotypes in an outbred population of mice. Across all strains, 10,291 genes are partially or completely deleted; reducing to 5,115 genes when only the classical laboratory strains are considered. The introns of 9,802 genes are affected (laboratory strains: 4,885), 4,530 promoter regions (laboratory strains: 1641), and 1,631 have deleted exons (laboratory strains: 648) (Supplementary Table 9). We investigated whether this represented an enrichment or a depletion by counting the number of SVs that overlapped genes and then comparing this to a null distribution of the expected number of overlaps, obtained by permutation (Supplemental Table 10). We found that relative deletions are, in all strains except C57BL/6N, significantly depleted (P<0.01) in genes (fold change 0.91), introns (mean fold change 0.93), exons (fold change 0.22; including C57BL/6N) and promoter regions (fold change 0.77). However, examining the overlap between genes and the subset of relative deletions that correspond to inserted transposable elements in the reference genome, we found that SINEs are significantly enriched in the introns of genes (P<0.01, fold change 1.34). The relative depletion of SVs within genes implies a proportionate deficit in their phenotypic consequences, an expectation upheld by analyses of SV impact on gene expression and phenotypes measured in the whole animal. First, we find that SVs are relatively unlikely to be the cause of cis- 1 3 acting expression QTLs. We examined 833 cis-acting eQTLs that influence expression of transcripts from the hippocampus of outbred mice derived from eight of the sequenced strains (Huang 2009). Applying a test that discriminates between variants that are likely to be functional and those that are not (Yalcin) we find that SVs constitute 0.3% of all causal variants, compared to 0.9% of all non-causal (P < 2.2E-13,2 = 53.8). Second, the heritability attributable to SV effects on gene expression is small. Figure 4a shows scaled variances in gene expression from brain RNAseq data measured between and within strains for five categories of SVs. Assuming variation within strains is due to environmental factors and variation between strains is due to both environmental and genetic factors, the difference between the two variances is a measure of heritability. No category of SV accounts for more than 10% of the heritability. Since many transcripts overlap multiple small SVs (median of 3, maximum of 216), we hypothesized that heritability might be related to the amount of gene overlapped. For each transcript we summed the amount of deleted DNA and expressed this as a proportion of the total length of the gene. Overlap proportions of 50% or more make a disproportionately large contribution to heritability: 25% of the variance is attributable to SVs in this category, compared to 7.8% for transcripts where SVs overlap less than 50% of the gene. However overlaps of this size are rare, affecting less than 3% of transcripts. We also observed only small effects on gene expression from SVs that lie outside a gene. Figure 4b shows between and within strain variances for SVs lying at distances from less than 2 Kb to more than 40 Kb from 1 4 transcripts with no SV overlap (the density of SVs meant that we found too few transcripts with SVs 60 Kb or more distant to analyze). For these analyses we measured the closest distance to either the start or end of the gene. Heritability attributable to SVs within 2 Kb of the gene is 2%, and falls as the distance from the gene increases. We observed a non-significant increase in the most distant category (greater than 40 Kb). Third, SVs are unlikely to be responsible for a QTL for a phenotype measured in the whole animal. We know this from genetic association with 100 phenotypes measured in over 2,000 heterogeneous stock (HS) mice, animals that are descended from eight of the sequenced strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J, DBA/2J and LP/J)21. As described in our companion paper, we used imputation to genotype SVs and then applied a test that discriminates between variants that are likely to be functional and those that are not22. We were thus able to test 281,246 SVs where we were certain that the SDP was correct (Supplementary Methods). Relatively high rates of SV mutation (Hall NG) might invalidate the imputation (the HS animals are at least 60 generations distant from the sequenced strains), so we genotyped 100 HS animals using a high-density array (Agam and Supplementary Methods). 194 deletions could be genotyped on the array (with an additional 47 deletions when we allow for non-segregating SVs in the HS). In every case imputation correctly predicted the logP obtained from ANOVA carried out using the array -based genotypes. We identified 290 QTLs where SVs were among the variants most likely to be functional, but in all these cases the SVs were only a subset of the total number of functional variants. Just as with the cis-expression QTLs we 1 5 found a small but significant deficit in SVs among the functional variants (0.36% compared to 0.54% among the non-functional, P < 1E-16,2 = 72.1). 460 functional SVs overlap 245 genes (by at least one base pair). There was evidence to suggest that functional SVs are enriched for genes compared to non-functional SVs (Supplememtal Figure XA; 2= 9.7, P = 0.002), but there was no evidence for enrichment of exons or promoter regions (18 functional SVs overlap exons, and 58 functional SVs overlap promoter regions; Supplemental Figure XB and XC). However there are loci where the functional variant is most likely to be an SV. As shown in our companion paper, larger effect QTLs are more likely to arise from SVs. Our prior analysis also suggests that larger effect QTLs are likely to involve known functional regions. We identified 15 QTLs where the SV overlapped an exon or flanking region (2 Kb up or downstream of a gene), and where the QTL effect size is in the top 5% of the distribution. Table X lists these SVs, the genes they affect and the putative phenotype with which they are associated. Complementation of the deletion of the H2-Ea promoter (recorded as insertion in table X as the reference strain carries the deletion) confirmed the effect of this SV on the T-cell phenotype (); further work is needed to confirm the others. Discussion Our results are important in three respects: first we find an unexpectedly large number of SVs with remarkably diverse molecular architecture, thus providing a catalogue of the most dynamic and variable regions of the mouse genome. Second, we identify breakpoints at nucleotide level resolution, giving a 1 6 genome wide picture of how SVs originate. Third, we demonstrate that, despite their abundance, SVs make relatively little functional impact, as assessed by their effects on gene expression and phenotypic variation in the whole animal. We were able to find more, and more complex, SVs because we relied on manual inspection of the PEM results, combined with molecular validation, before using automated calling methods. Previous studies have revealed the noisiness of sequenced based methods of SV calling (), due in part to the multiplicity of forms and the presence of insertions, deletions and inversions often in close proximity to each other, and the difficulty of mapping sequence reads back to repetitive genomes. particularly challenging when repetitive sequences act as a nursery for SVs. Nevertheless, we have shown here that it is possible to identify and classify up to 19 SV types and thereby calibrate automated methods to generate genome-wide SV calls of superior accuracy. The SVs we find have two distinguishing characteristics: first, typically they are small. For relative deletions, whose size we know accurately, the median is 385 bp. In comparison, the median size of SVs in a recent highdensity array analysis of the genomes of 12 laboratory strains and wildderived mice was 61 Kb (Henrichsen 2009) and about 1.9 Kb from a PEM analysis of DBA/2J (Quinlan 2010). Second, their density means that we frequently find regions with high concentrations of small rearrangements. These two features emphasize the need for methods of SV identification at base pair, or near base pair resolution. Otherwise not only are many SVs missed, but those recognized are misclassified: a mixture of small deletions and insertions will be mistaken for a large SV of a single type. 1 7 It should also be noted that we do not report SVs less than 100bp, an arbitrary limit imposed by the sequencing technology. Variants less than 100bp are described in our companion paper, but the sensitivity and specificity of the methods used for their detection do not approach those for the PEM SV detection described here. Variants in this size range remain poorly characterized by current sequencing technologies. Our second important finding is the catalogue of SV mechanisms based on breakpoint sequence. We were able to map almost 60% of relative deletions to base pair resolution, allowing us to classify SVs by the mechanism that created them. We find that the primary origin of structural variation between mouse strains is attributable to L1 retrotransposons. For reasons still unexplained, mice differ from humans in whom L1 retrotransposition comes third after microhomology-mediated processes and nonallelic homologous recombination as the predominant processes in generating SVs (Eichler 2010). Investigation of the precise breakpoint sequence revealed that in 4% cases the machinery of SV formation gives rise to a complex mixture of insertions and deletions. We also find evidence that in 5% of cases, SNPs, extending regions of microhomology at breakpoints, occur in strains with an SV: since we found no instances where a SNP in the microhomology region occurs without the SV, we assume that the formation of deletion and SNP are related. It should be noted that, in contrast to human SV studies, we do not distinguish between SVs found in multiple unrelated individuals (recurrent rearrangements) and non-recurrent rearrangements. NAHR is believed to be the major mechanism responsible for the former while fork stalling and 1 8 template switching (FoSTeS) and/or microhomology-mediated break-induced replication (MMBIR) mechanisms may be important for the latter (Zhang 2009). We have difficulty distinguishing recurrent SVs because the 17 strains we have sequenced are not completely unrelated, which means we cannot separate recurrent SVs from those that are identical by descent. On one hand, the 13 laboratory strains are derived from a relatively small set of founders; on the other hand, the wild derived strains include animals that are very distantly related, to the point of being separate species in the case of SPRET/EiJ. Our third important observation is that SVs have relatively little impact on gene function. This question receives attention because results from genome-wide association studies (GWAS) have revealed that common SNPs (minor allele frequency > 5%) explain only a part of trait heritability suggesting that SVs might be a major unrecognized contributor to phenotypic variation (). Available evidence has not yet resolved whether or not this is so. Analysis of human lymphoblastoid cell lines attributed at least 8.5-17.7% of heritable gene expression variation to CNVs (Stranger 2007). Importantly, this heritability was not shared with common SNPs, potentially making CNVs a contributor to the missing heritability of GWAS. In mice, SVs overlapping a gene were estimated to contribute to a substantial proportion of betweenstrain expression variance (up to 74%), which, when put together with the prevalence of SVs in the genome, implies that they might be responsible for a considerable fraction of heritable gene expression variance. If the genetic basis of gene expression were a model for understanding the molecular basis of other phenotypes, then SVs would be a major player. Two recent analyses 1 9 of the association between SVs and disease phenotypes in humans provide little support for this view: common SVs are no more likely than common SNPs to contribute to phenotypic variation (WTCCC 2010, Conrad 2010). However rare CNVs (minor allele frequency < 5%) of large effect (odds ratio > 2), that could not be detected using the technologies available, might still be important contributors. Our findings make two important contributions to this debate. First, we find that SVs overlapping a gene make a much smaller contribution than expected, not much more than 10%, and we find limited evidence that they affect the expression of flanking genes. This might be due to our analysis of very large numbers of small SVs, but we find that even when SVs overlap more than 50% of the extent of the gene they account for less than a third of the heritability. The most likely explanation is that previous array based studies conflated under one apparently large SV the effects of numerous smaller rearrangements together with regions of diploid DNA, containing other variants that influenced gene expression. Second, our analysis of the phenotypic consequences of SVs on QTLs for multiple phenotypes also points to a relative deficit of SVs as the molecular basis of complex phenotypes. By working with an outbred population where all chromosomes are descended from known progenitors, imputation effectively reconstitutes the genomes of all animals, so that we can detect the effects of all variants, both common and rare. Our results indicate that common and rare SVs make less of a contribution to phenotypic variation than we would expect given their abundance in the genome. However this conclusion may not apply to other species, or indeed other populations of 2 0 mice, because selection of inbreeding and homozygosity will purge the genome of variants that could be maintained in heterozygous freely mating populations. Finally, it should be stressed that SVs are the responsible molecular lesion at some QTLs. Our analysis has highlighted those where this is most likely to be so. Encouragingly, our computational predictions include a promoter deletion whose role we have recently confirmed through transgenesis (). This is important because genetic association studies typically implicate SNPs as the causative variant at a QTL. Biological insight into a phenotype however requires discovering which gene is involved, still a major challenge if the starting point is a SNP. The task is considerably easier when an SV is identified as the causative variant, particularly if the SV removes a gene segment, effectively creating a null allele, now relatively straightforward to model in mice (). Thus the discovery of causal SVs is likely to provide biological insights out of proportion to their relative small contribution to phenotypic variance. Acknowledgements We thank Adam Whitley, Giles Durrant, Andrew Marc Hammond, Danica Joy Fabrigar, Lucia Chen, Martina Johannesson and Enzhao Cong for helping B.Y with various laboratory-based work. This project was supported by The Medical Research Council, UK and the Wellcome Trust. DJA is supported by Cancer Research UK. Author contributions 2 1 D.J.A and J.F conceived the study and directed the research. 2 2 Figure Legends Figure 1. Identification of structural variants. a) Venn diagrams showing the overlap between deletion SVs (relative to C57BL/6J) detected in our study (blue) and those published elsewhere (Agam et al, 2010 in red and Quinlan et al in green), in DBA/2J. b) Blue boxes represent deletions, pink boxes insertions, orange boxes inversions and yellow boxes duplications; all types of structural variants are relative to the reference genome sequence. We found six basic types of structural variant: deletion (del), insertion (ins), inversion (inv), tandem duplication (dup), inverted tandem duplication (not drawn here) and dispersed duplication. c) Additionally, eight complex types of structural variant were found: deletion with an insertion (del+ins), linked deletion (normal copy of small length flanked by two deletions), deletion within a duplication (del in dup), inversion with flanking deletion(s) (for example del+inv+del), inversion with an insertion (inv+ins), inversion within a duplication (inv in dup), a linked insertion (linked ins) where the inserted sequence is copied from another location in the vicinity of the inserted site and an inverted linked ins (not drawn here) which has a similar pattern to a linked insertion but with the inserted sequence being inverted. d) Example of paired-end mapping (PEM) pattern of a del+inv+del. Green arrows represent primers used for PCR amplification and sequencing reactions. Primer names provide their positional information, relative to the reference genome. Black arrows attached with a curved line represent paired-ends, whereas single black arrows represent singleton reads. Grey straight lines indicate mapping of the test reads onto the reference genome. When the inversion is smaller than the insert size, pairedend reads will flank both deletions and inversion, as shown here. In other 2 3 cases, decreased read depth will indicate flanking deletions. e) Example of PEM pattern of an inv+ins, with PCR data across the eight classical strains. HyperladderII is used as molecular marker. Amplicon size for BALB/cJ, C3H/HeJ, CBA/J and DBA/2J is about 500 bp larger than the other strains, indicative of the insertion. Inversion is revealed by sequencing. Complete list of patterns is drawn in Supplementary Figure 1, with examples and PCR validation data. Figure 2. Classification of structural variants. LEGEND TO ADD Figure 3. Breakpoint analysis of a complex SV. a) Complex SV, involving several genomic rearrangements including an inversion, deletion, short insertion and copy number gain (CNG), is displayed relative to its genic location along Zbtb10, a Zinc finger and BTB domain containing 10 gene. PCR amplification using forward (F) and reverse (R) primers revealed an AT insertion at the first breakpoint J1, followed by an inversion of 125 bp which encompasses an inverted copy number gain of the 22 bp proceeding J1, as seen in J2. Finally breakpoint 3 (J3) revealed a deletion of 813 bp. Using repeatmasker, a SINE element was found to be part of the deletion. b) PCR picture of the amplification using F and R primers (primer sequences available in Supplementary Table xxx). Hyperladder II was used as the size marker. C57BL/6N and LP/J show a normal size of 1604 bp, whereas A/J, AKR/J, BALB/cJ, C3H/HeJ, CBA/J and DBA/2J show a smaller band at 793bp. c) Sequencing data across J1, J2 and J3 breakpoints. A colour code is used to 2 4 indicate each type of SV: blue is used for the 22 bp inverted copy number gain, green for the inversion and red for the deletion. When the test strain matches the reference strain, both are in the same color. b) Relationship between SNP and SV formation. a) Relationship between SNP and ancestral deletion formation. Two SNPs lying on the 6 bp microhomolgy of an ancestral deletion of 64 bp (chr12:27,040,45927,040,522) correlated with the presence of the SV. On the left, PCR amplification of the SV is shown across the eight classical strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J). HyperladderII was used as DNA molecular weight marker. Some strains show a smaller amplicon compared to other strains. On the right, sequencing traces are shown for a test strain (A/J) and the reference strain (C57BL/6N). Note that all other test strains traces are identical to the one shown here. Asterisk is used to emphasize the microhomology of 6 bp (GAACTA). The presence of two SNPs (C->G and T->A) in all test strains (here only shown in A/J) is associated with the presence of the ancestral deletion. b) Relationship between SNP and ancestral insertion formation. PCR data is shown on the right with amplification in A/J, AKR/J and BALB/cJ. Strains with the the ancestral insertion (C57BL/6N, CBA/J, DBA/2J and LP/J) have failed to amplify due to size. The insertion is a LINE on chromosome 13 (119,134,049119,135,126). On the left, sequencing trace is shown over the TSD for a strain that doesn’t have the ancestral insertion. The TSD is 17 bp (AAGAATGTCAGCAAAGT) and at the 12th position, a SNP (G->C) is observed in all the strains that have the insertion. 2 5 Tables Table 1. Structural variants in 17 inbred strains Simple Strain Del CNG Complex Inv Ins Del + Linked Inv + Ins Nested Ins/Del Del/Ins 129P2/OlaHsd 16184 57 74 15476 102 27 239 68 129S1/SvImJ 17169 70 88 11375 67 32 285 67 129S5/SvEvBrd 15967 72 67 8885 40 41 210 58 A/J 16078 69 92 12065 55 28 237 67 AKR/J 15692 88 89 14434 84 13 260 82 BALB/cJ 14761 82 87 10462 43 17 192 58 C3H/HeJ 15952 94 94 11960 86 16 254 76 164 44 0 3 5 1 CAST/EiJ 50637 361 224 33650 120 239 826 265 CBA/J 16878 79 83 10759 60 16 230 78 DBA/2J 17314 67 83 10427 45 29 306 75 LP/J 16834 64 88 12608 60 30 271 69 NOD/ShiLtJ 16903 51 116 13088 46 16 307 79 NZO/HlLtJ 15355 62 31 23 168 62 PWK/PhJ 53612 96 272 34553 160 60 1104 268 SPRET/EiJ 90533 112 470 63147 426 110 1956 554 WSB/EiJ 22028 88 37 265 105 C57BL/6NJ 6 71 208 9353 97 12386 60 Table 1: Structural variants in 17 inbred strains. Listed are structural variants with a minimum size of 100 bp. In addition to the four main SV types with simple PEM patterns, we have elucidated a number of complex patterns (Figure 1). Complex SVs are identified from local assembly analysis and overlap assessment of SV calls (see Supplemental Methods). We differentiate between insertions, where we can determine the insertion points from read pair patterns and local assembly, and copy number gains (CNG), where a duplication is inferred from an increase in read depth. CNGs include tandem duplications, which are inferred from both read depth and read pair evidence. There is minimal overlap between the insertions and the copy number gains, since the insertion discovery algorithms find de novo insertions and TE insertions. Del=deletion; CNG=copy number gain; Inv=inversion; Ins=insertion; Del+Ins=deletion plus insertion; Nested=SV in a CNG region; Linked Ins/Del=linked insertion or linked deletion; Inv+Del/Ins=inversion plus deletion(s) or inversion plus insertion. 2 6 Table 2. Sequence features at SV breakpoints and inferred mechanism Table 2. Sequence features at SV breakpoints and inferred mechanism. In a, the percentage of each sequence feature at precise breakpoint is given per category of ancestral SV (insertion, deletion, inversion, CNG and multiple events). In b, the percentage of each inferred mechanisms is given relative to all SV regions presented in a. Empty cases are due to no applicability and all abbreviations are listed in the Supplementary Glossary. 2 7 Table 3. QTL associated with SVs Phenotype Red cells: mean cellular haemoglobin Wound healing Weight Red cells: mean cellular haemoglobin Red cells: mean cellular volume Mean platelet volume T-cells: CD4/CD8 ratio T-cells: %CD3 Hippocampus cellular proliferation marker Hippocampus cellular proliferation marker OFT Total activity Home cage activity Serum urea concentration Red cells: mean cellular haemoglobin OFT Total activity Chr Start Stop Structural variant 7 7 3 111415000 90731810 104731236 111415000 90731831 104731238 DEL INS|IAPTypeI|ERV INS|SINE|SINE 9 8 1 107952000 87957077 175158883 107960000 87957262 175158895 GAIN INS|LINE|LINE INS 17 4 34483680 130038389 34483692 130038391 4 49690361 49690364 13 13 4 11 113783195 116943933 108951256 115106125 7 2 Strains Gene A, BALB, C3H, CBA A, AKR, BALB C3H Trim5 TMC3 St7l e u d Gmppb 4921524J17Rik Fcer1a d u u INS INS|SINE|SINE C3H A, C3H, CBA, DBA A A, AKR, C3H, CBA, DBA LP H2e-a Snrnp40 u u INS|SINE|SINE CBA, DBA Grin3a d 113783359 116944422 108951281 115106247 DEL DEL INS|IAPTypeI|ERV DEL Gm6320 Gm6404 Eps15 Tmem104 u u u e 111504853 111505190 INSi A A, C3H, DBA, LP A,DBA, LP BALB AKR, A, CBA, BALBc, DBA Trim30b e 144402762 144402974 DEL|SINE DBA, LP Sec23b e 2 8 Figure 1. Identification of structural variants. 2 9 Figure 2. Classification of structural variants. 3 0 Figure 3. Breakpoint analysis of a complex SV and relationship between SNP and SV formation 3 1 Methods An outline of the methods applied in this paper is provided in the supplementary Methods. 3 2 References 1 Iafrate, A. J. et al. Detection of large-scale variation in the human genome. Nat Genet 36, 949-951 (2004). 2 Korbel, J. O. et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420-426 (2007). 3 Sebat, J. et al. Large-scale copy number polymorphism in the human genome. Science 305, 525-528 (2004). 4 Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat Genet 37, 727-732 (2005). 5 Buchanan, J. A. & Scherer, S. W. Contemplating effects of genomic structural variation. Genet Med 10, 639-647 (2008). 6 Hurles, M. E., Dermitzakis, E. T. & Tyler-Smith, C. The functional impact of structural variation in humans. Trends Genet 24, 238-245 (2008). 7 Mills, R. E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59-65 (2011). 8 Perry, G. H. et al. The fine-scale and complex architecture of human copy-number variation. Am J Hum Genet 82, 685-695 (2008). 9 Agam, A. et al. Elusive copy number variation in the mouse genome. PLoS One 5 (2010). 10 Cahan, P., Li, Y., Izumi, M. & Graubert, T. A. The impact of copy number variation on local gene expression in mouse hematopoietic stem and progenitor cells. Nat Genet 41, 430-437 (2009). 11 Graubert, T. A. et al. A high-resolution map of segmental DNA copy number variation in the mouse genome. PLoS Genet 3, e3, doi:10.1371/journal.pgen.0030003 (2007). 12 Akagi, K., Li, J., Stephens, R. M., Volfovsky, N. & Symer, D. E. Extensive variation between inbred mouse strains due to endogenous L1 retrotransposition. Genome Res 18, 869-880 (2008). 3 3 13 Quinlan, A. R. et al. Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res 20, 623-635 (2010). 14 Gilbert, N., Lutz-Prigge, S. & Moran, J. V. Genomic deletions created upon LINE-1 retrotransposition. Cell 110, 315-325 (2002). 15 Symer, D. E. et al. Human l1 retrotransposition is associated with genetic instability in vivo. Cell 110, 327-338 (2002). 16 Garvey, S. M., Rajan, C., Lerner, A. P., Frankel, W. N. & Cox, G. A. The muscular dystrophy with myositis (mdm) mouse mutation disrupts a skeletal muscle-specific domain of titin. Genomics 79, 146-149 (2002). 17 Henrichsen, C. N. et al. Segmental copy number variation shapes tissue transcriptomes. Nat Genet 41, 424-429 (2009). 18 Wong, K., Keane, T. M., Stalker, J. & Adams, D. J. Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly. Genome Biol 11, R128 (2010). 19 Egan, C. M., Sridhar, S., Wigler, M. & Hall, I. M. Recurrent DNA copy number variation in the laboratory mouse. Nat Genet 39, 1384-1389 (2007). 20 Zhang, F., Carvalho, C. M. & Lupski, J. R. Complex human chromosomal and genomic rearrangements. Trends Genet 25, 298307 (2009). 21 Valdar, W. et al. Genome-wide genetic association of complex traits in heterogeneous stock mice. Nat Genet 38, 879-887 (2006). 22 Yalcin, B. et al. Genetic dissection of a behavioral quantitative trait locus shows that Rgs2 modulates anxiety in mice. Nat Genet 36, 11971202 (2004). 3 4