* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download SVPaper050711
Epigenetics of neurodegenerative diseases wikipedia , lookup
X-inactivation wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genetic engineering wikipedia , lookup
Heritability of IQ wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Gene therapy wikipedia , lookup
Gene nomenclature wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
Non-coding DNA wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Oncogenomics wikipedia , lookup
Transposable element wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene desert wikipedia , lookup
Genomic imprinting wikipedia , lookup
Quantitative trait locus wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
History of genetic engineering wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Metagenomics wikipedia , lookup
Minimal genome wikipedia , lookup
Public health genomics wikipedia , lookup
Human genetic variation wikipedia , lookup
Genome (book) wikipedia , lookup
Gene expression profiling wikipedia , lookup
Microevolution wikipedia , lookup
Human genome wikipedia , lookup
Gene expression programming wikipedia , lookup
Genomic library wikipedia , lookup
Designer baby wikipedia , lookup
Copy-number variation wikipedia , lookup
Human Genome Project wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome editing wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Pathogenomics wikipedia , lookup
Sequence based characterization of structural variation in the mouse genome Binnaz Yalcin1†, Kim Wong2†, Avigail Agam1†, Martin Goodson1†, Thomas M. Keane2, Xiangchao Gan1, Christoffer Nellåker3, Leo Goodstadt1, Jérôme Nicod1, Amarjit Bhomra1, Polinka Hernandez-Pliego1, Helen Whitley1, James Cleak1, Rebekah Dutton1, Deborah Janowitz1, Richard Mott1, David J. Adams2,*, Jonathan Flint2,* 1The Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK 2The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK 3MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, South Parks Road, Oxford OX1 3QX, UK †Co-first authors *Correspondence to: Dr. David Adams Dr. Jonathan Flint Wellcome Trust Sanger Institute Wellcome Trust Centre for Human Genetics Hinxton, Cambs, CB10 1SA, UK Oxford, OX3 7BN, UK Ph: +44 (0) 1223 86862 Ph: +44 (0) 1865 287512 Fax: +44 (0) 1223 494919 Fax: +44 (0) 1865 287501 Email: [email protected] Email: [email protected] Abstract The extent to which structural variants (SVs) cause quantitative variation and contribute to disease is unknown. Without knowing how many SVs there are, and how they arise, it is difficult to discover what they do. Combining experimental with automated analyses of the mouse genome sequence, we identified 0.71M SVs at 0.28M sites in the genomes of thirteen classical and four wild-derived inbred mouse strains. The majority of SVs are less than 1 kilobase in size and 98% are deletions or insertions. The breakpoints of 0.16M SVs were mapped to base pair resolution allowing us to infer that insertion of retrotransposons causes more than half of SVs. Yet, despite their prevalence, SVs are less likely than other sequence variants to cause geneexpression or quantitative phenotypic variation. We identified 24 SVs that disrupt coding exons, acting as rare variants of large effect on gene function. One third of the genes so affected have immunological functions. Our catalogue provides a starting point for the analysis of the most dynamic and complex regions of genomes from a genetically tractable model organism. 2 Introduction Structural variation is believed to be widespread in mammalian genomes1-5 and is an important cause of disease6-8, but just how abundant and important structural variants (SVs) are in shaping phenotypic variation remains unclear. Understanding what SVs do depends on understanding what they are, where they occur and how they arise: large SVs that keep recurring and coincide with genes are far more likely to contribute to phenotypic variation than small non-recurrent SVs within intergenic regions. The preeminent organism for modeling the relationship between phenotype and genotype, including SVs, is the mouse, but our catalogue of SVs in this animal is incomplete. Estimates of SV numbers and the proportion of the mouse genome they occupy, vary considerably, from figures of a few hundred to over 7,0009-13, affecting from 3.2% to more than 10% of the genome14-16. Incompleteness and inconsistencies are largely due to reliance on differential hybridization of genomic DNA to oligonucleotide arrays 14, a technology blind to some SV categories (such as inversions and insertions) and with only limited ability to detect others (segmental duplications and transposable elements). Sequence based methods of SV detection, with higher resolution and greater sensitivity, have so far had limited application12,17. Along with SV catalogues, we need to know how SVs arise, as this will tell us what SVs may or may not do. The major molecular mechanism producing SVs in the mouse genome is believed to be retrotransposition 12,17, which, may account for more than 80% of SVs between 100 nucleotides to 10 kilobases in length17. In cell culture, about 10% of LINE-1 insertions delete 3 DNA18,19, a process that also occurs in mouse genomic DNA20. It is not known to what extent retrotransposons, or other mechanisms of SV formation, contribute to mouse phenotypic variation and disease. What we know about the burden of SVs’ impact on phenotypes in the mouse comes primarily from analyses of gene expression 15,16,21. Up to 28% of the between-strain variation in gene expression in hematopoietic stem and progenitor cells has been attributed to SVs15; for genes lying within SVs, the latter account for between 66% to 74% of between-strain expression variation in kidney, liver, lung and testis16. If the genome is replete with SVs, and given that their influence on gene expression could extend up to 500 Kb from their margins16, then SVs might be responsible for a considerable fraction of heritable gene expression variance. Since gene expression variation is believed to contribute to variation in phenotypes in the whole organism 21, SVs may turn out to have a major role in the genetic determination of many aspects of mouse biology. We used next generation sequencing to address three critical questions: what are the extent and complexity of SVs in the mouse genome, what are the likely mechanisms of SV formation, and to what extent do SVs contribute to phenotypic variation? Our molecular characterization of SVs in the mouse genome allows us to determine the extent to which SVs contribute to genetic and phenotypic diversity. 4 Results SV identification Using short-read paired-end mapping, we found SVs at 0.28M sites in the mouse genome, amounting to 0.71M SVs in 17 inbred strains of mice: A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J, LP/J, NOD/ShiLtJ, NZO/HlLtJ, 129S5SvEvBrd, 129P2/OlaHsd, 129S1/SvImJ, WSB/EiJ, PWK/PhJ, CAST/EiJ and SPRET/EiJ. Our catalogue contains far more SVs than previously recognized (Fig. 1a) and consists of a greater variety of molecular structures (Fig. 1b&1c). To explain why we found more, we start by describing how we went about finding SVs. We combined visual inspection of short-read sequencing data with molecular validation to improve automated SV detection across the genome. We used two criteria to identify SVs manually: read depth and anomalous paired-end mapping (PEM). We did this using data from the mouse’s smallest chromosome (19) in its entirety, and a random set of other chromosomal regions, for eight classical strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J, DBA/2J and LP/J), founder strains of heterogeneous stock (HS) population22. Based on read depth and PEM we expected to find eleven patterns that classify SVs. We refer to these as type H (“High-confidence”) patterns (H1-H11: Supplementary Fig. 1). For example, some deletions and inversions leave precise, easily identifiable signatures (Fig. 1d). In addition, we found ten patterns whose interpretation was ambiguous. We refer to these as type Q (“Questionable”) patterns (Q1-Q10: Supplementary Fig. 1, Fig. 1e). We investigated the molecular structure of all 21 patterns using a PCR 5 strategy (Supplementary Fig. 2, Supplementary Methods). We designed 742 pairs of primers and successfully amplified 662 SV sites across the eight strains (Supplementary Table 1). Our categorization of predicted SV structures, based on manual inspection of PEM patterns, resulted in the confident identification of an SV for nineteen of the 21 patterns in all instances that we examined by PCR (Supplementary Table 2). Two patterns were always false (Q6 and Q10), and arose because of the presence of a retrotransposed pseudogene giving mapping errors. Recognizing these patterns, we were able to predict underlying SV structure with high confidence. PCR confirmed that 12 patterns were indicative of a single SV and six patterns indicative of multiple adjacent SVs (Supplementary Table 2). However, SVs of type Q7 (55 cases) were due to a variable number tandem repeat, for which we could not predict the number of repeats or molecular structure. Available automated methods to identify SVs are unable to differentiate all 19 PEM patterns, and may also classify some SVs incorrectly; for example, the PEM patterns of linked insertions (Q5 and Q9: Supplementary Fig. 1) are similar to those for inversions or deletions. Therefore we adapted automated methods to recognize 15 types (Q1, Q2, Q3 and Q7 could not be unambiguously identified) identified by manual inspection and PCR validation (Supplementary Methods and 23). 6 Sensitivity and specificity analyses We established false positive and false negative rates for the automated analysis in three ways. First, we used our manually identified set of SVs on chromosome 19 (Supplementary Table 3) where we found 1,017 deletions (756 type H and 261 type Q), 15 inversions (2 type H and 13 type Q) and three copy number gains (all type H). False negative rates per strain range from 15% to 20% (Supplementary Table 4a); false positive rates range from 3.7% to 4.8% (Supplementary Table 4b). Second, to ensure that our sensitivity and specificity analyses were not vitiated because we used chromosome 19 as a training set for the automated analysis, we derived a second, smaller, set of manually curated deletions from a randomly chosen 10 Mb region (101 Mb to 111 Mb) from chromosome 3 in strain C3H/HeJ. Automated analysis of this region correctly identified 43/49 (87.7%) and called 2 false positive deletions (4.1%). Third, we investigated the false negative rate for the automated detection of deletions across the genome using a PCR validation dataset of 267 simple deletions (Supplementary Table 1). Consistent with the chromosome 19 and chromosome 3 analyses we found that the false negative rate for deletions was between 6% and 11%, respectively (Supplementary Table 5a). Few non-deletion SVs were manually detected on chromosome 19, so we turned to PCR-based validation of insertions, inversions and tandem duplications (n = 106 to n = 136 SVs per strain). We found that the average false negative rate was higher than for deletions, ranging from 24% to 31% per strain (Supplementary Table 5b). Automated analysis was less successful in detecting the more complex rearrangements, with 35% to 54% 7 false negative rates (n = 33 to n = 41 per strain), however, excluding SVs with Q2 and Q3 PEM patterns, the rates range from 5% to 35% (n= 18 to n= 24 SVs per strain). SV categories The results of the detection and classification of 711,923 SVs across the entire genome of 17 strains are shown in Table 1. There are, on average, 26,000 SVs in classical inbred strains, and 92,000 in wild derived inbred strains, affecting 1.2% (33.0 Mb) and 3.7% (98.6 Mb) of the genome respectively (Supplementary Table 6). Importantly, differences between frequencies of SVs in wild-derived and classical strains are due to differences in the amounts of accessible genome, and also to differences in our ability to detect SVs in the wild-derived strains. Using our chromosome 19 experimental analysis of SVs, we have estimated false negative rate across the four wild-derived strains to be 24% (28% in SPRET/EiJ) as opposed to 17% in the eight classical inbred strains (Supplementary Table 4c). It proved difficult to obtain robust estimates of SVs smaller than 100 bp. We generated SV calls using both Pindel24 and Dindel25. Across the whole genome, we found 33,779 and 149,854 deletions using Pindel and Dindel, respectively. To explain such large differences between the two-methods, we visually inspected all calls from each method along the whole of chromosome 19. We found a false positive rate of about 1% for Dindel set and 14% for Pindel calls. We then inspected a 7.2 Mb region and found that a large number of manual calls were missed by both Pindel and Dindel (67/75) but also found that 42 deletions found by Pindel or Dindel were missed by visual 8 inspection. We confirmed by PCR that the 75 deletions found by manual inspection are in all cases real and smaller than 100 bp (Supplementary Table 1). Size distribution of all deletions occurring in this 7.2 Mb region is plotted in Supplementary Figure 3. Our best estimate of the rate of SVs between 10 and 100 bp is based on combining manual and automated methods, which, for the 7.2 Mb region, yields 117 deletions. Assuming that this region is typical, the rest of the genome (in classical laboratory strains) will contain approximately 49,000 deletions in this size range. Table 1 classifies SVs greater than 100 bp into two groups: 99.4% are simple and 0.6% are complex. Simple SVs include those whose structural interpretation is straightforward: insertions, deletions and inversions. We separately identify one type of insertion, a copy number gain, consisting of non-repetitive DNA that is present in multiple copies in other strains. When this sequence occurs immediately adjacent to its original, it is annotated as a tandem duplication. It is less clear to what extent the more complex categories we found represent different structures. Complex SVs consist of a mixture of events that abut each other. Sometimes the mixture arises because two or more simple SVs occur next to each other: given the density of deletions in the genome, the 2,021 deletions that we found separated by less than 250 bp could have occurred by chance (H3: Supplementary Fig.1). However we recognize as a separate structure SVs that are immediately adjacent to each other, with no intervening DNA, since these might be the progeny of a single process (marked as del+ins and del+ins/inv in Table 1). Intriguingly, half of the 9 inversions co-occur with an insertion or deletion, or in rare cases with both an insertion and deletion (Fig. 2a). We also separately identify an SV within a copy number gain (termed “nested” in Table 1) since the probability of coincidence is less than one event per genome. SV formation Microhomology at SV breakpoints, as well as the content of sequence within SVs and the SV’s ancestral state, was used to infer the likely mechanism of formation for simple SVs. To obtain breakpoint sequence, we performed de novo local assembly at 80.3% of deletions23. Comparison of 1,314 predicted deletion breakpoints to the breakpoint delineated by PCR and sequencing (Supplementary Table 1; Supplementary Methods), revealed that 57.7% of breakpoint predictions are exact and 86.5% are within 20 bp (Supplementary Table 7a). In cases where the local assembly strategy failed, we relied on the original breakpoint estimates obtained from the mapping of reads to the reference genome: 83.3% of these estimates are within 100 bp of the actual breakpoint (Supplementary Table 7b). Breakpoint accuracy for insertions, inversions and copy number gains is presented in Supplementary Tables 7c, 7d and 7e, respectively. To obtain genome-wide estimates of the contribution of each mechanism to SV formation, we used sequence data from relative deletions (that is relative to C57BL/6J). We have highly accurate breakpoint sequence for this sample, which should be unbiased with respect to ancestry. Using rat as an outgroup, we classified 19% of relative deletion SVs as ancestral 10 deletions, 57% as ancestral insertions and the remainder (24%) were indeterminate (Supplementary Methods). Classification of SVs and their size characteristics are summarized in Figure 3. SVs are most often due to retrotransposons (LINEs (25%), LTRs (14%) and SINEs (15%)), followed by variable number tandem repeats (VNTRs) (15%) and pseudogenes (2%). Other mechanisms, not involving retrotransposons, account for 29% of SVs. The median length of all SVs is 349 bp, with modes at 100 bp and 6,400 bp, LINE insertions comprising the majority of the larger insertions (Fig. 3a). Outgroup analysis showed that the transposon-derived SVs arose almost exclusively from ancestral insertions events (98.8%). Target site duplications (12-16 bp) surrounds the breakpoints of LINE and SINE derived SVs (known as target site duplication) and shorter (6-8 bp) duplicated sequences are associated with LTR SVs (Fig. 3b). Nonrepeat mediated SVs are mainly a result of ancestral deletion events (79%), and are associated with short microhomologies, up to 7 bp in length, consistent with either microhomology-mediated break-induced replication (MMBIR)26,27 or microhomolgy-mediated end joining (MMEJ)28. A substantial proportion of SVs caused by LINE, ERV and SINE insertions do not show any missing nucleotides at their breakpoints (93.3%, 93.3% and 92.3% respectively; Table 2). However, we found rare cases (4 LINEs, 2 ERVs and 1 SINE) during which the insertion machinery also deletes nucleotides. Missing sequence ranged from 1 bp to 289 bp. The presence of an ancestral microdeletion is directly linked to the absence of the TSD for three LINEs. Of the 113 ancestral deletions, 36 (32%) had from 1 bp to 107 bp of inserted sequence at the breakpoint, in addition to the deletion. 11 Unexpectedly, in all cases the presence of SNPs in the microhomology region was correlated with the presence of the SV (Fig. 2b). The SNP elongates the microhomology, or, alternatively, the microhomology reflects a hypermutable state associated with break induced replication around the SV29. However this phenomenon is rare: we only observed five (4.5%) cases amongst our 113 manually-curated ancestral deletions (Supplementary Table 8) where a SNP and SV formation co-segregate. We found a similar relationship between a SNP formed at a target site duplication and the presence of an ancestral insertion. Fifteen ancestral insertions (16%) had SNPs or short indels within their target site duplication, coincident with the insertion (Supplementary Table 8). Given their potential role in human disease30, we were interested to document the occurrence of recurrent SVs, those that arise at the same genomic locus independently in unrelated individuals. Non-allelic homologous recombination (NAHR) is the major mechanism for recurrent SVs31, while fork stalling and template switching (FoSTeS) and/or microhomology-mediated break-induced replication (MMBIR) mechanisms may be important for nonrecurrent SVs26,27. We looked for SVs occurring at the same locus in different strains, but with different breakpoints, indicating independent origins. Using the SV breakpoints obtained from PCR sequencing (249 SV sites in eight strains that account for over 4,000 breakpoints; Supplementary Table 8), we found that in the classical strains, only 2.5% of deletions at the same locus had different breakpoint sequences. However within all 17 strains, we found multiple alleles at 12% of SVs, due almost entirely to the presence of different alleles 12 originating from the wild-derived inbred strains (Supplementary Table 8). Consistent with the low frequency of recurrent SVs, breakpoint features associated with NAHR are rare. We estimated that 0.13% of deletions are due to NAHR, when we required a signature of >=200 bp of >=90% sequence identity. Two analyses, therefore, indicate that recurrent SVs are rare. Impact of SVs on gene function We assessed the impact of SVs on phenotypes in three ways: i) the relationship between the position of SVs and the position of genes; (ii) changes in expression of genes overlapping, or nearby, an SV; (iii) association between SVs and phenotypes in an outbred population of mice. Across all strains, SVs overlap 10,291 genes. We found that SVs are, in all strains except C57BL/6N, significantly depleted (P<0.01) in genes (fold change 0.91). However, we found that SINE insertions are significantly enriched in the introns of genes (P<0.01, fold change 1.34). The relative depletion of SVs within genes implies a proportionate deficit in their phenotypic consequences. We found this to be true for the effect of SVs on gene expression, by estimating heritability attributable to SVs. Variation within strains is due to environmental factors and variation between strains is due to both environmental and genetic factors, so the difference between the two variances is a measure of genetic effect (heritability). Figure 4a shows scaled variances in gene expression from brain RNA-Seq data measured between and within strains for five categories of SVs. Variances for between strain variation are clearly larger than for within 13 strain variances, indicating that SVs do have an impact on expression, but how big an impact? We estimated the proportion of heritability attributable to SVs16 and found that no category accounts for more than 10%. To determine if these results were specific to brain tissue, we analysed gene expression data for the eight founder strains of the HS population (n = 5 for each) from liver, measured on Illumina arrays32. Mean heritability attributable to an SV, for transcripts overlapping one or more SVs, was 9.5%. Since many transcripts overlap multiple small SVs (median of 3, maximum of 216), we hypothesized that SV heritability might be related to the amount of gene overlapped. For each transcript we summed the amount of DNA overlapping a gene and expressed this as a proportion of the total length of the gene. SVs that overlap 50% or more of a gene make a large contribution to heritability: in brain tissue, such SVs contribute to 25% of the variance, compared to 7.8% for transcripts where SVs overlap less than 50% of the gene. However, large overlaps (50% or more) are rare, affecting less than 3% of transcripts. Thus while SVs make a modest contribution to the overall heritability of expression variance, at individual transcripts they may be the main cause of between-strain differences in expression. We also found that SVs outside a gene have only small effects on gene expression. Figure 4b shows between and within strain variances for SVs lying at distances from less than 2 Kb to more than 40 Kb from brain transcripts with no SV overlap (the density of SVs meant that we found too few transcripts further than 60 Kb from an SV to analyze). Heritability 14 attributable to SVs within 2 Kb of the gene is 2%, and falls as the distance from the gene increases. We considered whether the lower estimates we obtained for the effect of SVs, compared to those obtained from array based assays, might be due to the differences in the way SVs were assessed. Using SVs from a genomewide array based assessment of SVs in 12 classical strains, we calculated within and between strain variances16. Results, shown in Figure 4a, demonstrate a larger difference between within-strain and between-strain variances than seen using SVs from our sequence analysis. SVs assessed by arrays contribute to 25% of the variance in gene expression. Differences in the heritability estimates are thus due in part to the differences in the way SVs are called. Our third observation of the phenotypic impact of SVs is that they are unlikely to be the causative variant at QTLs, as we know from genetic association with 100 phenotypes measured in over 2,000 heterogeneous stock (HS) mice22. We applied a test of functionality33 to 281,246 SVs where we were certain that the strain distribution pattern (SDP) was correct (Supplementary Methods). We identified 290 QTLs where SVs were among the variants most likely to be functional, but in all these cases the SVs were only a subset of the total number of functional variants. We found a small but significant deficit in SVs among the functional variants (0.36% compared to 0.54% among the non-functional, P < 1E-16,2 = 72.1). While SVs make a relatively small contribution to the total amount of quantitative phenotypic variation, at a small number of QTLs they are the cause of variation. As shown in our companion paper34, larger effect QTLs are 15 more likely to arise from SVs (and see Supplementary Figure 4a, 4b and 4c). We identified 12 QTLs where the SV overlapped a gene with its flanking region (2 Kb up or downstream of a gene), and where the QTL effect size is in the top 5% of the distribution. Table 3 lists these SVs, the genes they affect and the putative phenotype with which they are associated. Complementation of the deletion of the H2-Ea promoter has confirmed the effect of this SV on the T-cell phenotype35. In one other case we have evidence in favour of a causative role for the SV: Eps15 -/- male mice exhibited a significantly lower locomotor activity (Supplementary Fig. 5) compared to matched wild type male mice, indicating that the SV is likely the cause of the QTL. SVs that disrupt coding exons There are relatively few examples where an SV can be said unequivocally to delete one, or more, coding exons. Without nucleotide resolution accuracy it is often impossible to be certain whether the breakpoint of an SV lies within an exon, so to find SVs overlapping exons we used our most accurate and complete category of SV calls: deletions relative to the reference strain. Using this list, we started with 210 that overlap exons from Ensembl (Build 58); after removing pseudogenes, and anything not annotated as 'protein coding', we were left with just 24 SVs that affect coding exons, including six that encompass a gene (or several) in its entirety. Table 4 gives positional information for these SVs, the gene they affect (gene that are affected in their entirety are indicated by an asterisk), how they formed, their strain distribution pattern (SDP) and their known function as reported in the current literature. 16 Five of the 24 SVs are already known36-40; the remaining 19 SVs are novel. Remarkably, a third of the genes affected by these SVs are involved in immunity and infection. Figure 2c gives an example of how our data expands current knowledge of the molecular architecture of these SVs. The antiviral genes Trim5 and Trim12a are for the first time revealed as unique to C57BL/6J, due to segmental duplication41. All the other strains contain only the Trim12c gene. Therefore the mouse contains a unique homologue of the human TRIM5 gene, similarly to the rat, and the expansion of the Trim12 genes appeared only in the C57BL/6J lineage. A second example is our analysis of beta defensin 8 gene (Defb8), another immune related gene. Two alleles have been identified and differ by 3 base pairs changes in the second exon42,43. Our analysis reveals that these documented exonic changes are linked to a previously undetected 3,192 bp deletion that includes the first exon of the gene. Discussion Our results are important in three respects: first we find an unexpectedly large number of SVs with diverse molecular architecture, thus providing a catalogue of the most dynamic and variable regions of the mouse genome. Second, we identify breakpoints at nucleotide level resolution, giving a genome wide picture of how SVs originate. Third, we demonstrate that, despite their abundance, SVs make relatively little functional impact, as assessed by their effects on gene expression and phenotypic variation in the whole animal. We were able to find more SVs, of greater complexity, because we relied on manual inspection of the PEM results, combined with molecular 17 validation, before using automated calling methods. Previous studies have revealed the noisiness of sequenced based methods of SV calling12,44,45, due in part to the multiplicity of forms and the presence of insertions, deletions and inversions often in close proximity to each other, and the difficulty of mapping sequence reads back to repetitive genomes. Nevertheless, we have shown here that it is possible to calibrate automated methods to generate genomewide SV calls of high accuracy. The SVs we find have two distinguishing characteristics: first, they are small. For deletions, whose size we know accurately, the median is 349 bp. In comparison, the median size of SVs in a recent high-density array analysis of the genomes of 20 laboratory strains was 9 Kb15 and about 1.9 Kb from a PEM analysis of DBA/2J12. Our size estimate is actually an upper limit, since it does not include SVs less than 100 bp. While the latter category currently present a challenge to sequence technologies, we estimate they there may be 63,000 in the classical laboratory strains, in which case an upper estimate of the median size of SVs is 180 bp. Second, SV density means that we frequently find regions with high concentrations of small rearrangements. These two features emphasize the need for methods of SV identification at base pair, or near base pair resolution. Otherwise not only are many SVs missed, but those recognized are misclassified: a mixture of small deletions and insertions will be mistaken for a large SV of a single type 14. Our second important finding is the catalogue of SV mechanisms based on breakpoint sequence. We were able to map almost 60% of deletions to base pair resolution, allowing us to classify SVs by the mechanism that created them. We find that the primary origin of structural variation between 18 mouse strains is attributable to LINE-1 retrotransposons. Mice differ from humans in whom LINE-1 retrotransposition comes third after microhomologymediated processes and nonallelic homologous recombination as the predominant processes in generating SVs44. In contrast to human SV studies, the great majority of SVs we have discovered are non-recurrent rearrangements, based on two observations: among the classical strains, only 2.5% of deletions at the same locus had different breakpoint sequences and less than 1% of deletions are due to NAHR, the mechanism thought to be responsible for the majority of recurrent SVs in humans30. Our third important observation is that SVs have relatively little impact on gene function. SVs overlapping a gene have been estimated to contribute to a substantial proportion of between-strain expression variance (28% in hematopoietic stem and progenitor cells15; 38% in brain and 66-74% in heart, kidney, liver, lung and testis16). If, as these results suggest, SVs contribute to a third or a half of variation in transcript abundance, then, assuming gene expression contributes to phenotypic variation, SVs will likely have a major role in the genetic determination of all aspects of mouse biology. Available evidence has not yet resolved whether or not this is so. Our findings add to this debate in three ways. First, we find that SVs overlapping a gene account for less than 10% of variation in gene expression. This value is between three and four times smaller than that found by studies using expression arrays15,16. We think the most likely explanation for the disparity is that array-based studies overestimate the contribution of SVs by conflating under one apparently large SV the effects of numerous smaller 19 rearrangements together with regions of diploid DNA, that also contain other variants influencing gene expression (Figure 4a). Second, few SVs overlap exons. From our set of relative deletions we identified 24 that delete one or more exons. These SVs, with large effects on a phenotype, are the equivalent of rare variants found in human populations. In mouse populations they are very rare indeed. Since the frequency of insertions is equal to that of deletions, and since these two categories make up 98% of all SVs then we predict that there may be only about 50 SVs that directly overlap exons, or about 0.2% of the total burden of SVs in the genome. Third, our analysis of the phenotypic consequences of SVs on QTLs for multiple phenotypes points to a relative deficit of SVs as the molecular basis of complex phenotypes. By working with an outbred population where all chromosomes are descended from known progenitors, imputation effectively reconstitutes the genomes of all animals, so that we can detect the effects of all variants, both common and rare. Our results indicate that SVs make less of a contribution to phenotypic variation than we would expect given the amount of the genome they affect. For the classical laboratory strains, summing the number of bases involved, SNPs and indels affect 0.5% of the genome (this is a maximum estimate, assuming that indels have a size of 10 bp, many are less than this), while on average 33 Mb (2.5%) of each classical laboratory strain falls into structurally variant regions of the genome. This implies that SVs are about five fold more likely to have phenotypic consequences than the combined effect of SNPs and indels. Yet we find that SVs only contribute 10% of heritability, not the 80% implied by the genomic size argument. 20 It is important to note that conclusions based on our analysis of an outbred mouse population may not apply to other outbred populations (such as in human, where there is continuing debate over the contribution of SVs to phenotypic variation1,46,47). The population we tested is derived from inbred progenitors whose homozygosity will have purged their genomes of variants that could otherwise be maintained in heterozygous freely mating populations. Nevertheless, despite their relative rarity in the mouse genome, SVs that cause phenotype change are likely to provide biological insights out of proportion to their relative small contribution to phenotypic variance. Biological insight into a phenotype requires discovering which genes are involved. The task is considerably easier if the SV removes a coding segment, effectively creating a null allele. We expect that the alleles we have described will provide a starting point for investigating the relationship between phenotype and genotype in mice. Methods Summary SV discovery. We used a combination of four computational methods: splitread mapping24, mate-pair analysis48, single-end cluster analysis (SECluster and RetroSeq, unpublished), and read-depth49. These methods identify deletions, insertions, inversions and copy number gains. We also derived methods to recognize other types of rearrangements, such as inversion plus insertion or inversion plus deletion, newly revealed from our experimental analysis. Experimental analysis. We inspected short-read sequencing data using LookSeq50 and manually detected SVs across mouse chromosome 19 in its 21 entirety and a random set of other chromosomal regions. We analysed molecular structures of these SVs at nucleotide-level resolution using PCR and Sanger-based sequencing. Outgroup analysis. We used the rat as an outgroup species to classify each mouse SV as either an ancestral deletion or an ancestral insertion. We predicted the ancestral state in the rat by estimating the size of the region in the rat genome that was homologous to the region that encompassed the mouse SV. SV classification. We developed a machine learning method to classify SVs. The method used a random forest classifier, trained using sequence features within the SVs. Microhomology between breakpoints was determined by recording the longest sequence of bases that was identical between each breakpoint of each SV. Functional impact of SVs. We tested whether an SV is likely to be functional using merge analysis33. The variances of expression data were calculated using ANOVA in the statistical software R using formulae described in16 and also by comparing a model where the expression value is explained by the strain, to a model in which the expression is explained by strain and whether or not the animal has an SV. Full methods are provided in Supplementary Information. 22 References 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704-712 (2010). Durbin, R. M. et al. A map of human genome variation from populationscale sequencing. Nature 467, 1061-1073 (2010). Kidd, J. M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56-64 (2008). Mills, R. E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59-65 (2011). Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444-454 (2006). Buchanan, J. A. & Scherer, S. W. Contemplating effects of genomic structural variation. Genet Med 10, 639-647 (2008). Hurles, M. E., Dermitzakis, E. T. & Tyler-Smith, C. The functional impact of structural variation in humans. Trends Genet 24, 238-245 (2008). Zhang, F., Gu, W., Hurles, M. E. & Lupski, J. R. Copy number variation in human health, disease, and evolution. Annu Rev Genomics Hum Genet 10, 451-481 (2009). Cutler, G., Marshall, L. A., Chin, N., Baribault, H. & Kassner, P. D. Significant gene content variation characterizes the genomes of inbred mouse strains. Genome Res 17, 1743-1754 (2007). Egan, C. M., Sridhar, S., Wigler, M. & Hall, I. M. Recurrent DNA copy number variation in the laboratory mouse. Nat Genet 39, 1384-1389 (2007). Graubert, T. A. et al. A high-resolution map of segmental DNA copy number variation in the mouse genome. PLoS Genet 3, e3 (2007). Quinlan, A. R. et al. Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res 20, 623-635 (2010). Snijders, A. M. et al. Mapping segmental and sequence variations among laboratory mice using BAC array CGH. Genome Res 15, 302311 (2005). Agam, A. et al. Elusive copy number variation in the mouse genome. PLoS One 5 (2010). Cahan, P., Li, Y., Izumi, M. & Graubert, T. A. The impact of copy number variation on local gene expression in mouse hematopoietic stem and progenitor cells. Nat Genet 41, 430-437 (2009). Henrichsen, C. N. et al. Segmental copy number variation shapes tissue transcriptomes. Nat Genet 41, 424-429 (2009). Akagi, K., Li, J., Stephens, R. M., Volfovsky, N. & Symer, D. E. Extensive variation between inbred mouse strains due to endogenous L1 retrotransposition. Genome Res 18, 869-880 (2008). Gilbert, N., Lutz-Prigge, S. & Moran, J. V. Genomic deletions created upon LINE-1 retrotransposition. Cell 110, 315-325 (2002). Symer, D. E. et al. Human l1 retrotransposition is associated with genetic instability in vivo. Cell 110, 327-338 (2002). Garvey, S. M., Rajan, C., Lerner, A. P., Frankel, W. N. & Cox, G. A. The muscular dystrophy with myositis (mdm) mouse mutation disrupts 23 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 a skeletal muscle-specific domain of titin. Genomics 79, 146-149 (2002). Schadt, E. E. et al. An integrative genomics approach to infer causal associations between gene expression and disease. Nat Genet 37, 710-717 (2005). Valdar, W. et al. Genome-wide genetic association of complex traits in heterogeneous stock mice. Nat Genet 38, 879-887 (2006). Wong, K., Keane, T. M., Stalker, J. & Adams, D. J. Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly. Genome Biol 11, R128 (2010). Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865-2871 (2009). Albers, C. A. et al. Dindel: Accurate indel calls from short-read data. Genome Res 21, 961-973 (2011). Hastings, P. J., Ira, G. & Lupski, J. R. A microhomology-mediated break-induced replication model for the origin of human copy number variation. PLoS Genet 5, e1000327 (2009). Zhang, F. et al. The DNA replication FoSTeS/MMBIR mechanism can generate genomic, genic and exonic complex rearrangements in humans. Nat Genet 41, 849-853 (2009). Ma, J. L., Kim, E. M., Haber, J. E. & Lee, S. E. Yeast Mre11 and Rad1 proteins define a Ku-independent mechanism to repair double-strand breaks lacking overlapping end sequences. Mol Cell Biol 23, 88208828 (2003). Deem, A. et al. Break-induced replication is highly inaccurate. PLoS Biol 9, e1000594 (2011). Stankiewicz, P. & Lupski, J. R. Structural variation in the human genome and its role in disease. Annu Rev Med 61, 437-455 (2010). Stankiewicz, P. & Lupski, J. R. Genome architecture, rearrangements and genomic disorders. Trends Genet 18, 74-82 (2002). Huang, G. J. et al. High resolution mapping of expression QTLs in heterogeneous stock mice in multiple tissues. Genome Res 19, 11331140 (2009). Yalcin, B., Flint, J. & Mott, R. Using progenitor strain information to identify quantitative trait nucleotides in outbred mice. Genetics 171, 673-681 (2005). Keane, T. Sequence variation amongst 17 laboratory and wild-derived mouse genomes and its affect on gene regulation and phenotypic variation. Nature (2011). Yalcin, B. et al. Commercially available outbred mice for genome-wide association studies. PLoS Genet 6 (2010). Best, S., Le Tissier, P., Towers, G. & Stoye, J. P. Positional cloning of the mouse retrovirus restriction gene Fv1. Nature 382, 826-829 (1996). Boyden, L. M. et al. Skint1, the prototype of a newly identified immunoglobulin superfamily gene cluster, positively selects epidermal gammadelta T cells. Nat Genet 40, 656-662 (2008). 24 38 39 40 41 42 43 44 45 46 47 48 49 50 Nelson, T. M., Munger, S. D. & Boughter, J. D., Jr. Haplotypes at the Tas2r locus on distal chromosome 6 vary with quinine taste sensitivity in inbred mice. BMC Genet 6, 32 (2005). Persson, K., Heby, O. & Berger, F. G. The functional intronless Sadenosylmethionine decarboxylase gene of the mouse (Amd-2) is linked to the ornithine decarboxylase gene (Odc) on chromosome 12 and is present in distantly related species of the genus Mus. Mamm Genome 10, 784-788 (1999). Wu, B. et al. Mutations in sterol O-acyltransferase 1 (Soat1) result in hair interior defects in AKR/J mice. J Invest Dermatol 130, 2666-2668 (2010). Tareen, S. U., Sawyer, S. L., Malik, H. S. & Emerman, M. An expanded clade of rodent Trim5 genes. Virology 385, 473-483 (2009). Bauer, F. et al. Structure determination of human and murine betadefensins reveals structural conservation in the absence of significant sequence similarity. Protein Sci 10, 2470-2479 (2001). Taylor, K. et al. Defensin-related peptide 1 (Defr1) is allelic to Defb8 and chemoattracts immature DC and CD4+ T cells independently of CCR6. Eur J Immunol 39, 1353-1360 (2009). Kidd, J. M. et al. A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell 143, 837847 (2010). Korbel, J. O. et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420-426 (2007). Craddock, N. et al. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature 464, 713-720 (2010). Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747-753 (2009). Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods 6, 677-681 (2009). Simpson, J. T., McIntyre, R. E., Adams, D. J. & Durbin, R. Copy number variant detection in inbred strains from short read sequence data. Bioinformatics 26, 565-567 (2010). Manske, H. M. & Kwiatkowski, D. P. LookSeq: a browser-based viewer for deep sequencing data. Genome Res 19, 2125-2132 (2009). 25 Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Supplementary Information contains Supplementary Figures and Tables, additional Methods, and Supplementary References. Acknowledgements We thank Adam Whitley, Giles Durrant, Andrew Marc Hammond, Danica Joy Fabrigar, Lucia Chen, Martina Johannesson, Enzhao Cong and Glòria Blázquez for helping B.Y. with various laboratory-based work. We also thank Chris P. Ponting for comments on the manuscript. This project was supported by The Medical Research Council, UK and the Wellcome Trust. DJA is supported by Cancer Research UK. Author contributions D.J.A. and J.F. conceived the study and directed the research. J.F. wrote the core of the paper. K.W. and T.K. performed the genome-wide SV discovery and local assembly for SV breakpoint resolution. K.W. carried out the sensitivity and specificity analyses. K.W. and B.Y. liaised regularly to integrate experimental work into genome-wide SV discovery pipeline. This resulted in a highly accurate map of SV across the mouse genome, essential to downstream analyses. A.B., P.H.P., H.W., J.C., R.D. and D.J. carried out experimental work, led by B.Y. A.B. and B.Y. analysed Sanger-based sequencing data, resolved SV breakpoints at nucleotide-level resolution and inferred mechanism of SV formation. M.G. performed the genome-wide SV mechanism of formation and outgroup analysis, with contributions from A.A. and B.Y.. J.F. analysed functional impact of SVs on expression and 26 phenotypes, with contributions from A.A.. C.N., L.G., J.N. and R.M. carried out additional analyses. B.Y. characterised function of individual SV examples. Author information Data sets described here will be available under study accession number estd118 from the Database of Genomic Variants archive (DGVa) at http://www.ebi.ac.uk/dgva/page.php. Reprints and permissions information is available at www.nature.com/reprints. Readers are welcome to comment on the online version of this article at www.nature.com/nature. Correspondence and requests for materials should be addressed to J.F. ([email protected]). 27 Tables Table 1: Structural variants greater than 100 bp in 17 inbred strains Simple Complex Strain 129P2/OlaHsd del 16292 gain 57 inv 74 ins del+ins nested inv+del/ins 15604 105 27 68 129S1/SvImJ 17307 70 88 11516 73 32 67 129S5/SvEvBrd 16089 72 67 8970 43 41 58 A/J 16190 69 92 12184 61 28 67 AKR/J 15806 88 89 14576 88 13 82 BALB/cJ 14859 82 87 10551 48 17 58 C3H/HeJ 16062 94 94 12100 90 16 76 C57BL/6N 164 44 6 213 0 3 1 CAST/EiJ 50978 361 224 34122 133 239 265 CBA/J 16996 79 83 10867 64 16 78 DBA/2J 17478 67 83 10559 55 29 75 LP/J 16964 64 88 12745 64 30 69 NOD/ShiLtJ 17047 51 115 13244 53 16 79 NZO/HlLtJ 15429 62 71 9445 33 23 62 PWK/PhJ 54147 96 272 35098 184 60 268 SPRET/EiJ 91295 112 470 64304 463 110 552 WSB/EiJ 22154 88 97 12521 64 37 105 Del: deletion; gain: copy number gain; inv: inversion; ins: insertion; del+ins: deletion plus insertion; nested: SV in a copy number gain region; inv+del/ins: inversion plus deletion(s) or inversion plus insertion. 28 Table 2: SV classification and inferred mechanism of formation a Sequence features at breakpoints LINE ERV SINE Target site duplication (TSD) none 4-10 bp 11-20 bp >20 bp 6.7% 13.3% 78.3% 1.7% 6.7% 93.3% 0.0% 0.0% 0.0% 15.4% 84.6% 0.0% Microdeletion none 1-34 bp >200 bp 93.3% 5.0% 1.7% 93.3% 6.7% 0.0% 92.3% 7.7% 0.0% Microhomology none 1-2 bp 3-25 bp 26-200 bp >200 bp b Inferred mechanisms Total Retrotransposition LINE Retrotransposition ERV Retrotransposition SINE Retrotransposition SRS MMEJ, MMBIR NHEJ SSA NAHR FoSTeS/others 30 13 Complex CNG 92.3% 7.7% 0.0% 0.0% 0.0% 23.1% 76.9% 0.0% 60 Inversion VNTR Microinsertion none 1-10 bp 11-50 bp >51 bp Total (249 SV regions) Deletion Insertion Ancestral Events 13 15.9% 15.0% 67.3% 0.9% 0.9% 50.0% 12.5% 37.5% 0.0% 0.0% 12.5% 37.5% 50.0% 0.0% 0.0% 0.0% 50.0% 50.0% 0.0% 0.0% 68.1% 23.0% 8.0% 0.9% 62.5% 37.5% 0.0% 0.0% 87.5% 12.5% 0.0% 0.0% 50.0% 50.0% 0.0% 0.0% 113 8 8 4 30.5% 13.3% 0.4% 0.4% 0.8% 3.2% 3.2% 1.6% 24.1% 12.0% 5.2% 5.2% This detailed classification is based on the 249 SVs resolved at nucleotidelevel resolution (Supplementary Table 8). MMEJ: Microhomology-mediated end joining; NHEJ: Non-homologous end joining; FoSTeS: fork stalling and template switching; MMBIR: Microhomology-mediated break-induced replication; NAHR: Non-allelic homologous recombination; SRS: Serial replication slippage; SSA: Single strand annealing; CNG: Copy number gain. 29 Table 3: QTLs associated with SVs Phenotype Mean platelet volume OFT Total activity Hippocampus cellular proliferation marker Home cage activity T-cells: %CD3 Wound healing Red cells: mean cellular haemoglobin Red cells: mean cellular haemoglobin Red cells: mean cellular volume Serum urea concentration Hippocampus cellular proliferation marker T-cells: CD4/CD8 ratio Chr 1 2 4 4 4 7 7 7 8 11 13 17 Start 175158884* 144402760 49690362 108951263 130038388 90731819 111397607 111504989 87957244 115106127 113783196 34483681 Stop 175158885* 144402971 49690363 108951264 130038389 90731820 111479433 111505193 87957245 115106250 113783359 34483682 Ancestral Event ins (large) SINE ins del (137 bp) IAP ins (~6400 bp) SINE ins (202 bp) IAP ins (~6400 bp) ins del LINE ins (~500 bp) del del del (629 bp) SDP 10000000 11111100 11111001 10000011 00000001 11100000 00001000 11110111 10010110 00100000 10000000 00001001 Gene Fcer1a Sec23b Grin3a Eps15 Snrnp40 Tmc3 Trim5 Trim30b 4921524J17Rik Tmem104 Gm6320 H2-Ea Gene region upstream intron intron upstream intron upstream exon UTR upstream UTR upstream upstream Merge LogP 52.8 15.7 20.1 15.9 12.1 22.2 13.0 12.8 18.1 13.4 17.5 82.9 Ins: insertion; del: deletion; SINE: Short INterspersed repeat Elements; LINE: Long INterspersed repeat Elements; IAP: Intracisternal A Particle. Start and stop coordinates are given for build37 of the mouse genome, so that insertions into the reference are given as consecutive base pairs (columns headed SV start and SV stop). Unless there is a plus sign (“+”), coordinates refer to the exact coordinates as delineated by Sanger PCR sequencing. SDP is the Strain Distribution Pattern of the ancestral event (“1” refers to presence and “0” to the absence of the event) in the following strain order: A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J. The part of the gene overlapped is reported in the column headed SV overlap. LogP is the negative logarithm of the P-value for association between the SV and the phenotype as assessed in outbred HS mice 22. 30 Table 4: SVs affecting coding regions MGI gene Chr SV start SV stop Ancestral Event SDP Known function Soat1 Olfr1055 Fcrl5 Nes Pglyrp3 Skint4,3,9* Fv1 Ugt2b38 Klrb1a Klri2 Tas2r120* Tas2r103 Zfp607* Krtap5-5 Trim5,12a* Defb8 Zfp872 Olfr913 Rtp3 Nlrp1c* Fam110c Olfr234 Krtap16-1 Amd2* 1 2 3 3 3 4 4 5 6 6 6 6 7 7 7 8 9 9 9 11 12 15 16 18 158394620 86179898 87245084 87780530 91831862 111731004+ 147244398 87850554 128559593 129689526 132580541 132985563 28646761 149415121 111397607 19447465 22004856 38402589 110889280 71046193+ 31759321 98328544 88874294 64607747 158401436 86186982 87245947 87780662 91835385 112272814+ 147245739 87854999 128559740 129691211 132613777 132986696 28671650 149415210 111479433 19450575 22005023 38403498 110889465 71101410+ 31759461 98328861 88874392 64609669 del IAP ins del VNTR del ins del del del del del+linked ins (326 bp) del del VNTR ins ins+del (54 bp) VNTR del VNTR ins VNTR del VNTR ins 01000000000000000 00001000111001000 00000000000100000 00001001111000001 00000000000100000 00001000000010000 01010111111110101 00000001111000020 00000000000100110 00000000111000000 11110111000000000 00000000000010000 00000000000010000 11011111111111001 00001000000000000 01001001000010001 11111110111011111 11110010111100000 00001000000000000 10001000000001100 00001011111111110 10010000000000000 10111002222122221 10111110000010001 Hair interior defects Olfactory Infection and immunity Brain development Infection and immunity Infection and immunity Infection and immunity Metabolism Infection and immunity Infection and immunity Taste Taste DNA-binding Hair formation Infection and immunity Infection and immunity DNA-binding Olfactory Bone density Embryonic development Cell migration Olfactory Hair formation Biosynthesis of polyamines MGI: Mouse Genome Informatics; ins: insertion; del: deletion; VNTR: Variable Number Tandem Repeat; IAP: Intracisternal A Particle. The Strain Distribution Pattern (SDP) relative to the ancestral event is given for all strains: “1” referring to presence, “0” to absence and “2” to an additional allele, in the following strain order: A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J, LP/J, 129P2/OlaHsd, 129S1/SvImJ, 129S5SvEvBrd, NOD/ShiLtJ, NZO/HiLtJ, CAST/EiJ, PWK/PhJ, SPRET/EiJ and WSB/EiJ.* indicates that the structural variant overlaps the entire gene. Unless there is a plus sign (“+”), coordinates refer to the exact coordinates as delineated by Sanger PCR sequencing. 31 Figure Legends Figure 1: Identification of structural variants a) Venn diagram showing the overlap between deletion SVs (relative to C57BL/6J) detected in our study (blue) and those published elsewhere (Agam et al, 2010 in red and Quinlan et al in green), in DBA/2J. b) Basic rearrangements: deletion (del), insertion (ins), inversion (inv), tandem duplication (tandem dup) and other types of copy number gains. Inverted tandem duplication is drawn in Supplementary Figure 1 (H9). Linked insertion (linked ins) is an insertion where the inserted sequence is copied from nearby. Inverted linked insertion (Q9; drawn in Supplementary Fig. 1) has a similar pattern to a linked insertion but the inserted sequence is inverted. c) Complex rearrangements: deletion co-occurring with an insertion (del+ins), linked deletion (normal copy of small size flanked by two deletions), deletion within a gain (del in gain), inversion with flanking deletions (del+inv+del), inversion with an insertion (inv+ins) and inversion within a gain (inv in gain). d) PEM pattern of a del+inv+del. Green arrows represent primers used for PCR amplification and sequencing reactions. e) PEM pattern of an inv+ins, with PCR data across the eight classical strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J, DBA/2J and LP/J). HyperladderII is used as molecular marker. Amplicon size for BALB/cJ, C3H/HeJ, CBA/J and DBA/2J is about 500 bp larger than the other strains, indicative of the insertion. Complete list of PEM patterns is given in Supplementary Figure 1. 32 Figure 2: Experimental analysis of SVs a) Complex SV, involving several genomic rearrangements including an inversion, a deletion and two small insertions, is displayed relative to its genic location along Zbtb10, a Zinc finger and BTB domain containing 10 gene. PCR amplification using forward (F) and reverse (R) primers revealed an AT insertion at the first breakpoint B1, followed by an inversion of 125 bp that comprises an inverted linked insertion of the 22bp-region, as seen in B2. The third breakpoint (B3) revealed a deletion of 813 bp. Hyperladder II was used as the size marker. C57BL/6J and LP/J show a normal size of 1604 bp, whereas A/J, AKR/J, BALB/cJ, C3H/HeJ, CBA/J and DBA/2J show a smaller band at 793 bp. b) Relationship between SNP and SV formation. Two SNPs lying on the 6 bp microhomolgy of an ancestral deletion of 64 bp (chr12:27,040,45927,040,522) correlated with the presence of the SV. Sequencing traces are shown for a test strain (A/J) and the reference strain (C57BL/6J). Note that all other test strains traces are identical to the one shown here. Asterisk is used to emphasize the microhomology of 6 bp (GAACTA). The presence of two SNPs (C->G and T->A) in all test strains (here only shown in A/J) is associated with the presence of the ancestral deletion. c) Schematic representation of the Trim6 to Trim30 genes cluster on chromosome 7. Boxes represent the sequential positions of the Trim6, Trim34, Trim5/12 and Trim30 genes. Trim5 and Trim12a genes are only present in the C57BL/6J genome occurred by segmental duplication of the 33 Trim12c gene present in all 17 strains. The flanking Trim34 and Trim30 genes do not vary between strains. Figure 3: Classification of structural variants a) Histogram of lengths for each deletion SV class. b) Microhomology surrounding SV breakpoints. SVs were classified as in (a) and the longest length of microhomology between both breakpoints was recorded. Figure 4: Impact of SVs on gene expression a) Within-strain (grey boxes) and between-strain (white boxes) gene expression variances for transcripts which are not overlapped by any structural variant (No SV) and for those which are. Within-strain variance is due to environmental effects; between strain to environmental and genetic effects. The difference between the two variances is a measure of heritability. Six categories are shown: deletions (Dels), insertions (Ins), copy number gains (Gains), inversions (Inv), complex rearrangements (Complex), and SVs (of any class) identified by an array analysis (Array:SVs). b) Effect of distance from the transcript on gene expression variances. Grey boxes are within-strain and white boxes are between-strain variances. The figure shows standardized variances of gene expression for transcripts with structural variants at distances from less than 2 Kb to more than 40 Kb from either the start or end of the transcript. 34