Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sequence based characterization of structural variation in the mouse genome Introduction Structural variation in the mammalian genome is known to be abundant and to contribute to phenotypic variation and disease. There has been considerable progress assessing its extent (), phenotypic impact () and the responsible molecular mechanisms () in the human genome, but much less is known about SV in the mouse, currently the preeminent organism for modeling how genetic lesions give rise to disease in mammals. In this paper we use next generation sequencing to address three critical questions: what is the extent of SV in the mouse genome, what are the likely mechanisms for its formation, and what are its phenotypic consequences? Current catalogues of mouse SVs are based on differential hybridizaton of genomic DNA to oligonucleotide arrays (array comparative genome hybridization (aCGH)). While array CGH can interrogate entire genomes, it is blind to some SV categories (such as inversions and insertions), and has a limited ability to detect others (segmental duplications and transposable elements). Estimates of the proportion of the mouse genome affected by SVs range from 3% (CAHAN et al. 2009) to over 10% (HENRICHSEN et al. 2009), with three to four fold more deletions than duplications detected in the most recent genome-wide aCGH experiments (Cahan, Agam). 1 Assessing the potential mechanism of SV formation requires much higher resolution than aCGH affords, ideally down to the base pair. Sequence based methods, such as short-read paired end mapping (PEM), has the requisite level of resolution and has been used to identify 7,196 SVs and 3,316 breakpoint sequences. These data, from comparison of two laboratory strains (C57BL/6J and DBA/2J), indicate that most variation is due to retrotransposition () and that mechanisms of SV formation require little or no homology, so that non–allelic homologous recombination is rare. Here we report the identification, using short-read sequencing, of more than half a million SVs in 17 inbred strains of mice. By analyzing breakpoint sequence we infer the mechanisms of formation and assess their relative impact on shaping a mammalian genome. Our molecular characterization of SVs in the mouse genome is a starting point to determine the extent to which SVs contribute to genetic and phenotypic diversity. Results SV identification Variation in the expected number of reads mapping to the reference sequence was used to identify copy number variation while deviations from the expected distance between reads, and the orientation of reads, was used to determine the type of structural variant (Supplementary Methods). It has been difficult to determine how well computational implementations of these approaches perform, as validation methods are typically labour intensive; furthermore, it has become clear that many structural variants are complex, involving combinations of deletions, insertions and inversions (AGAM et al. ; QUINLAN et 2 al.). We therefore decided to identify a benchmark set of SVs to use as validation for detection. We manually designed 352 unique pairs of primers (uninformative primers excluded) and successfully amplified 329 SV covering a deletion. [BINNAZ CURENTLY WORKING ON THIS]. We manually examined reads mapping to the entire chromosome 19 (the smallest mouse chromosome) in addition to a random set of other chromosomal regions for eight strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J), and used two criteria to identify SVs: read depth and anomalous read pairs. Although the basis of PEM have been described in detail in a previous study(MEDVEDEV et al. 2009), to understand our findings it is important to appreciate the expected six patterns of read architecture (Supplementary Figure XX). For deletions we expect to find a reduction in read depth and supporting read pairs flanking the deleted region; for gains we looked for an increase in read depth; for tandem copy number gains we looked in addition for supporting read pairs with opposing orientation spanning the length of the duplicated region; for inverted copy number gains we expect an increase and supporting read pairs pointing in the same direction; for insertions we looked for singleton reads flanking the SV (the paired end will be inside the inserted, unsequenced, DNA); and for inversions the read depth is unchanged with sequences from both ends of read pairs pointing in the same direction. In addition to the six expected patterns we observed an additional XX (?6). By amplifying and sequencing DNA at the putative SVs we were able to 3 demonstrate that XX anomalous patterns were indicative of a true SV and YY were not. Anomalous patterns were due to additional complexity within the SV and repeat content at the SV, leading to read mis-mapping. We identified 693 deletions with the expected read architecture (Supplementary Table xx); we tested 100 by PCR in the eight strains and all validated; we refer to this set as a high confidence deletion set. We identified 600 deletions with anomalous read architecture, we tested 100 and about half validated by PCR. True positives tended to be characterized by possessing multiple read pair architectures, indicating a complex combination of deletion, inversion, insertions and gains (fig) False positives tended to be due to mismapped reads, due to repetitive sequence (fig). Chromosome 19 contained only three copy number gains and two inversions, all of which fitted expected patterns of read architecture. Automated analysis of chromosome 19 detected, on average, 92.5% of the high confidence deletions. The false positive rate, again for high confidence deletions ranges from 2.5% to 3.7%. Although the false negative rate per strain ranges from 5.6% (AKR/J) to 8.5% (LP/J), it should be noted that the automated analysis accurately identified 98.2% of sites containing a high confidence deletion (with a concomitant false positive rate of < 1%). Automated analysis was less successful in detecting the more complex rearrangements – of the 50 PCR validated complex deletions XX were found. [PLEASE CHECK WITH KIM]. [FIgure showing SVs on chr 19? Is this useful?] We investigated the false positive rate of the automated analysis of deletions across the genome by PCR validation of 700 deletions. Consistent with the chromosome 19 analysis we found that the false positive rate for 4 simple deletions was very low (less than X%). For other types of deletion the rate was higher – YY%. [ MORE HERE – BINNAZ/KIM] We could not assess the performance of automated analysis to detect SV types other than deletions because so few were found by manual inspection of chromosome 19. However we could estimate the false positive rates by randomly choosing 24 of each type for PCR validation. We found XX% inversions were false positives, YY% of gains, and ZZ% of insertions. Genome-wide SVs [Kim] We searched the entire genome of all strains computationally and identified a total of 1.4M SVs in the 17 strains (Table 2). There are on average XX SVs in classical inbred strains, affecting YY% of the genome. We find four times as many deletions in DBA/2J [What about the comparison to CAST?] than prior studies (Figure X) due to higher sequence coverage, SV calls from multiple algorithms, and the ability to find shorter variants (Figure Y). Relative to the reference we found two to three times as many insertions as deletions; inversions and copy number gains are very rare (100 per genome of each type). [We need to connect this to the chr 19 work. Chr 19 data was used to develop/ fine-tune methods] Sequence content of SVs [Avi] 5 Based on read architecture and sequence composition we classified SVs and observed that, consistent with previous analyses, the majority (~60%) coincide with retrotransposons (how was this determined? Is overlap necessary at a certain length? Results needed from Avi) (LINES X% LTRs X%, SINES Y%) suggesting transposable elements are the main cause of SV in the mouse genome. Analysis of the size distribution of SVs reflects this pattern as we see peaks at 6 Kb, 250bp for SINEs, and peaks for LTR (5 and 9 Kb) (Figure XX). VNTRs constitute XX %. X% of LINEs are polymorphic – if a high proportion could be important Distribution of SVs across the genome Despite a number of aCGH studies(CAHAN, AGAM etc), the landscape and distribution of SV in the mouse genome remains poorly characterized. We found XX% of the genome is constituted of SV. Is distribution of SV enriched at telomeres? Are there differences between strains? Relationship between segmental duplications and SVs – should find a 2 – 4 fold enrichment of SVs in regions of segmental duplication Complex SVs Given our observation of clustered SVs (at very close proximity), either from the same SV type (for example two deletions very close to each other) or from different SV types (for example an inversion followed directly by a deletion), we attempted to find them automatically across the genome. 6 SV mechanism Our next aim was to infer the mechanism by which SV formed. Inferring SV mechanism in an accurate way requires two criteria: 1) single-nucleotide level resolution of breakpoint delineation and 2) ancestral state of the SV. Our analysis of SV mechanism of formation fulfilled both criteria. We began by randomly selected 249 SV regions, polymorphic across eight classical inbred strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J). Each selected SV type had a size distribution representative of the genome (Supplementary Figure 1 shows the size distribution of deletions, copy number gains and inversions amongst the 249 SV regions selected). We determined the sequence at the breakpoints of 249 polymorphic SV regions across eight classical inbred strains (see Methods). We assembled and manually examined contigs encompassing sequence across 1926 (EXPLAIN BINNAZ) SV breakpoints for each test strain. We successfully identified the exact start and end position of the SV, and characterized the sequence features at the break points by ascertaining the extent and content of the homology and any additional or missing sequence (microinsertion and microdeletion) at the breakpoint. We predicted the ancestral state of each SV using SPRET/EiJ as the outgroup strain: we assume a deletion is ancestral if it is not found in SPRET/EiJ but is present in one or more of the classical strains (Example A in Table 2). Conversely if there is a deletion in SPRET/EiJ, but not in the reference genome, then we assume the SV is an insertion (Example B in Table 2). However we found that in 26 cases out of 249 (~10% of the total), 7 the inferred ancestral state is inconsistent with breakpoint features: example C (Table 2) using SPRET/EiJ as an outgroup suggested the presence of an ancestral deletion, Target Site Duplication (TSD) at the breakpoint suggested an ancestral insertion. Similarly, example D (Table 2) using SPRET/EiJ suggested the presence of an ancestral insertion, 3 bp microhomology at the breakpoint (TTA) suggested an ancestral deletion. Using rat sequence to determine the ancestral state validated both the ancestral insertion on chromosome 14 and the ancestral deletion on chromosome 15. With the exception of 2 cases (<1%), all of the other 24 inconsistencies using SPRET/EiJ as an outgroup could be reconciled by combining breakpoint sequence and rat sequence. Inferred ancestral state of all 249 SV regions is shown in Supplementary Table 1. Inferring SV mechanism of formation based on breakpoint features SV mechanism of formation is typically inferred by examining the sequence features of its breakpoints. For example, 200 bp of sequence identity is thought to be required for NAHR(INOUE and LUPSKI 2002), whereas much smaller homology (microhomology) has been often associated with endjoining processes, such as MMEJ. Delineation of retroelements is also facilitated by the presence of flanking target site duplication (TSD), with poly(A) tail or poly(T) head for LINE and SINE elements, and with dual or mono long terminal repeat (LCR) for ERV elements. Variable number of tandem repeat (VNTR) polymorphism is also easily identifiable from its repetitive structure. 8 We classified the 249 SV by inferred mechanism of formation (a flowchart for our method is presented in Supplementary Figure 2). Table 3b describes the inferred mechanism for the 249 SV regions. Retrotransposition is commonest mechanism (with 41.7% including 24.5% LINE, 12% ERV and 5.2% SINE retrotranspositions), followed by microhomolgy-mediated end joining processes (31.3%), non-microhomolgy-mediated end joining (13.3%), replication-based mechanisms such as FoSTeS (<10%), VNTR expansion (5.2%), SSA (0.4%) and NAHR (0.4%). A substantial proportion of SVs caused by LINE, ERV, SINE and VNTR insertions do not show any missing nucleotides at their breakpoints (95%, 93.3%, 92.3% and 92.3% respectively). However, we found rare cases (4 LINEs, 2 ERVs, 1 SINE and 1 VNTR) during which the insertion machinery also deletes nucleotides. Missing sequence ranged from 1 bp to 289 bp. We found that the presence of an ancestral microdeletion is directly linked to the absence of the TSD for three LINEs. This would suggest a dual mechanism of SV formation, union between DSB repair processes and LINE retrotransposition. Half (69% for ancestral deletions, 37.5% for inversions and 50% for both CNG and multiple events) of the SVs without LINE, ERV or SINE elements have a microhomology ranging between 3 bp to 25 bp, suggesting a microhomology-mediated mutational process. We found several patterns of microhomology: direct, palindromic, inverted and a complex combination of these. 70% of 3-25 bp microhomology are direct, 13.3% inverted, 10% complex and 6.6% palindromic. Longer sequence identity (>26 bp) is rarer than smaller sequence identity (<3bp). Breakpoints at inversion are half blunt 9 ended, followed by ancestral deletions (15.9%) and CNGs (12.5%). Of the 113 ancestral deletions, 36 (32%) had from 1 bp to 107 bp of inserted sequence at the breakpoint, in addition to the deletion. Common breakpoint features We assessed all 1,926 breakpoints for common sequence features. LINE retrotransposition is associated with TTTCT motif in the TSD (P value < 1.3E8). Genome-wide breakpoint identification We next estimated the accuracy of reconstructing breakpoint sequence for all SVs directly from NG sequencing reads without PCR-based Sanger sequencing data. For each deletion and insertion breakpoint we implemented local assembly [PLEASE MORE FROM KIM]. Comparison with 1,926 breakpoints delineated using PCR-based Sanger sequencing, revealed that breakpoint accuracy was within on average +/- 50 bp (?) of the actual breakpoint [CHECK WITH KIM]. This is insufficient to robustly identify microhomology-mediated processes such as MMEJ or SSA for which precise sequence is required but is sufficient to estimate NAHR (for which we search for 200 bp sequence identity) and VNTR expansion. Genome-wide, 0.5% of SVs have sequence features consistent with NAHR and 6% with VNTR. Relationship between SNP and SV formation Our analysis of breakpoint sequence features in multiple strains allowed us to look for a relationship between sequence variants (SNP or shortindels) and 10 SV formation. In particular, we addressed the question as to whether sequence variants at breakpoints were associated with SV formation. In our set of ancestral deletions for which we have base pair resolution data, we observed in all cases that the presence of SNPs in the microhomology region was correlated with the presence of the SV (Figure 3a). In all cases, presence of the SNP elongates the microhomology. Since we do not find instances where a SNP in the microhomology region occurs without the deletion, we assume that the formation of deletion and SNP are related. This phenomenon is rare: we only saw five (4.5%) cases amongst our 113 ancestral deletions where SNP and SV formation co-segregate. We found a similar relationship between a SNP formed in the TSD and the presence of an ancestral insertion (Figure 3b). 15 ancestral insertions (16%) had SNPs or shortindels within their TSD, coincident with an insertion. Details are given in Supplementary Table 1. These SNPs are ideal candidates to tag SVs for genotyping purposes; but their close proximity to SV breakpoints may make genotyping difficult (it should be noted that none of these SNPs were identified by short-read sequence). Origin of SV breakpoints We asked whether SVs that overlap in different strains could have arisen more than once. We inferred independent origins when the position of the breakpoint is different, so that for example one strain may have a 3 kb deletion, while in another only 1 kb is missing. Within the eight classical strains, size differences between SVs at the same locus were found at six SV regions out of 241 (2.5%). We found no case with more than three alleles at 11 one SV locus. However when expanded our analysis to look at all 17 strains, we found multiple alleles at 12% of SVs, due almost entirely to the presence of different alleles in the wild-derived inbred strains. In two cases, we observed four alleles at an SV locus with and in one case five alleles: on chromosome 10 AKR/J, CAST/Ei, PWK/Ph and Spretus/Ei all have SVs with different breakpoints (Supplementary Table 1). Inversions Inversions are more complicated than deletions, insertions and CNG, with little known about their mechanism of formation. They require at least two double-strand chromosomal breakages, as opposed for example to deletions that only require one DSB. Here we characterized at nucleotide level resolution breakpoints of 8 inversions. 62.5% (five cases) have deletions right next to the inversion. An example is provided in Figure x. Impact of SVs on gene function We addressed the question of the impact of SVs on gene function in two ways. First, we examined the impact on gene expression, and second we looked at the association between SVs and phenotypic variation. GENE EXPRESSION Genes affected by deletions - how many genes are affected? Evidence from mRNA – deficit of exons involved. – one fusion gene. Get list of genes and confirm by qPCR. Impact of inversions – no evidence of fusion genes? 12 Do SVs affect gene expression outside the SV regions (Reymond experiment)? PHENOTYPES Second, we asked whether SVs are associated with any of the 98 [I used 122??] phenotypes assessed in genetically heterogeneous stock (HS) mice descended from eight inbred strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J). Genetic mapping has identified 843 quantitative trait loci (QTL) (VALDAR et al. 2006a; VALDAR et al. 2006b). The phenotypes were chosen to target three diseases: anxiety; type II diabetes and asthma. We cannot directly genotype all of the SVs in the HS mice. Instead, for an SV genotyped in the progenitor strains, we infer the allele probabilities in each HS animal by estimating the haplotype probabilities at the SV locus (VALDAR et al. 2006a), and then merging these probabilities within each of the possible genotypes. We then use the allele probabilities to conduct a single marker analysis between each variant and each phenotype [REF]; this gives us a logP value for each SV/phenotype test, indicating how consistent the variant is with the phenotype. In this manner, we tested ~166,000 SVs where we were certain that the SDP was correct (including deletions, copy number gains, insertions and inversions) (refer to Kim’s method for producing SDPs; in addition we only included SVs where there was no missing data in any of the HS founder strain columns), XXX SNPs and YYY InDels (Jonathan?), for association with each phenotype. We extracted all SVs that lie under a QTL peak and have a logP that exceeds that of the QTL peak; we reasoned that these are the SVs that 13 are most likely to either be, or tag, the causal variant under the QTL. In total, there are 737 potentially causal SVs, lying under 395 QTL, for 80 of the phenotypes (Figure 1). We overlaid these causal SVs with protein coding genes. 291 causal SVs overlap (by at least 1bp) 180 genes (Supplemental Table causal_SV_overlap_genes.xlsx). However, there was no evidence to suggest that causal SVs are enriched for genes compared to non-causal SVs (Figure 2A). Similarly, we overlaid the SVs with the exons and promoter regions of protein coding genes (promoter regions were defined as 2Kb up and down stream of the transcript start site). 11 causal SVs overlap 12 exons (Supplemental Table causal_SV_overlap_exon.xlsx), and 47 causal SVs overlap 48 promoter regions (Supplemental Table causal_SV_overlap_gene_promoters.xlsx). There was no evidence to suggest that causal SVs are enriched for either exons or promoter regions, compared to non-causal SVs (Figures 2B and C, respectively). Discussion We find XX more than anyone else Did we get this right? Inferred mechanisms are consistent with other papers (Quinlan) and different from human We are the first to relate sequence variants to SVs We find little effect on gene function – due to homozygosity and selection for inbreeding? 14 Recent studies of mutation spectrum in human and mouse SV found a similar figure (33%)(CONRAD et al. ; QUINLAN et al.), suggesting similar mutational processes occur in both human and mouse SV formation. 12.5% of CNG have additional sequences (1-10 bp) at the breakpoint, followed by 37.5% for inversions and 50% for multiple events. We next looked at the correlation between Class 3 and Class 4 SVs and found that microinsertion at the breakpoint is enriched in blunt ended SVs (ratio 2.5). 15 References AGAM, A., B. YALCIN, A. BHOMRA, M. CUBIN, C. WEBBER et al., Elusive copy number variation in the mouse genome. PLoS One 5: e12839. CAHAN, P., L. E. GODFREY, P. S. EIS, T. A. RICHMOND, R. R. SELZER et al., 2008 wuHMM: a robust algorithm to detect DNA copy number variation using long oligonucleotide microarray data. Nucleic Acids Res 36: e41. CAHAN, P., Y. LI, M. IZUMI and T. A. GRAUBERT, 2009 The impact of copy number variation on local gene expression in mouse hematopoietic stem and progenitor cells. Nat Genet 41: 430-437. CONRAD, D. F., C. BIRD, B. BLACKBURNE, S. LINDSAY, L. MAMANOVA et al., Mutation spectrum revealed by breakpoint sequencing of human germline CNVs. Nat Genet 42: 385-391. HENRICHSEN, C. N., N. VINCKENBOSCH, S. ZOLLNER, E. CHAIGNAT, S. PRADERVAND et al., 2009 Segmental copy number variation shapes tissue transcriptomes. Nat Genet 41: 424-429. INOUE, K., and J. R. LUPSKI, 2002 Molecular mechanisms for genomic disorders. Annu Rev Genomics Hum Genet 3: 199-242. MEDVEDEV, P., M. STANCIU and M. BRUDNO, 2009 Computational methods for discovering structural variation with next-generation sequencing. Nat Methods 6: S13-20. QUINLAN, A. R., R. A. CLARK, S. SOKOLOVA, M. L. LEIBOWITZ, Y. ZHANG et al., Genomewide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res 20: 623-635. VALDAR, W., L. C. SOLBERG, D. GAUGUIER, S. BURNETT, P. KLENERMAN et al., 2006a Genome-wide genetic association of complex traits in heterogeneous stock mice. Nature Genetics 38: 879-887. VALDAR, W., L. C. SOLBERG, D. GAUGUIER, W. O. COOKSON, J. N. RAWLINS et al., 2006b Genetic and environmental effects on complex traits in mice. Genetics 174: 959-984. 16 Tables Table 1. Structural variants in 17 inbred strains SV Type 129P2/ OlaHsd 129S1/ SvImJ 129S5/ SvEvBrd A/J AKR/J BALBc/J C3H/ HeJ C57BL/ 6N CAST/ EiJ CBA/J DBA/2J LP/J NOD/ ShiLtJ NZO PWK/ PhJ Spretus/ EiJ WSB/ EiJ Deletion 16,402 17,385 16,154 15,885 16,258 14,898 16,148 167 51,304 17,066 17,531 17,030 17,078 15,479 54,312 91,729 22,231 Insertion 86,805 42,156 39,240 73,909 42,327 45,038 68,161 2,697 107,912 54,044 36,753 47,770 22,651 30,535 103,968 172,997 57,042 Inversion 46 46 53 46 49 45 52 3 128 46 54 47 55 47 158 282 53 Gain 57 70 72 88 69 82 94 44 361 79 67 64 51 62 96 112 88 Other 29 30 26 27 21 21 33 0 108 33 31 30 27 29 108 230 51 Table 1. Structural variants in 17 inbred strains. Listed are the total numbers of structural variants with a minimum size of 100 bp in the 17 inbred strains. Here we differentiate between insertions, where we can determine the insertion points from read pair patterns and local assembly, and copy number gains, where a duplication is inferred from an increase in read depth. Copy number gains include tandem duplications, which are inferred from both read depth and read pair evidence. There is minimal overlap between the insertions and the copy number gains, since the insertion discovery algorithm considers only read pairs in which one mate is unmapped (ie: de novo insertions). Included in 'Other' are those SVs which appear to be comprised of more than one SV. These include: deletions with insertions, and inversions with deletions. 17 LP/J 1 1 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 1 1 Table 2. Inferring ancestral state using sequences flanking SV breakpoints. Examples A, B, C and D are taken from our list of 249 SVs resolved to base pair resolution (Supplementary Table 1). The first three columns give chromosome, start position and end position of the SV in bp. Columns 4,5 and 6 gives a small stretch of sequences flanking SV breakpoints, as well as the first 10 bp of the SV. Note that full sequence of each SV is given in Supplementary Table 1. Columns entitled A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J gives the strain distribution pattern (SDP) of the SV, with “0” indicating the absence and “1” the presence of the SV. Column before last is the PEM (Paired-End Mapping) signature relative to the reference genome. The last column gives the inferred ancestral state relative to either SPRET/EiJ or Rattus norvegicus indicated by an asterisk. Microhomology at breakpoints is highlighted in red and target side duplication (TSD) in green. 18 Ancestral state DBA/1J 1 1 1 0 PEM signature CBA/J 0 1 0 0 C57BL/6J ..CATTG CTCTCTGCTTCTTT.. ..TTTGG TTAGTGTTTTGTCA.. 1 1 TTCTCTGCTTCTTG.. 1 TTAATGAATTATTA.. 1 C3H/HeJ ..AAGGA GGATGTATGTATGT.. GGAAACGTAACTC.. ..ACCCC CTCCCCTGTTGCGG.. CTCCCCTACTGTA.. BALB/cJ 46114057 7901149 70138776 91974384 A/J AKR/J Bp end position 46111101 7900914 70138175 91970970 B2 flanking sequence Bp start position 2 1 14 15 SV sequence Chromosome A B C D B1 flanking sequence Example Table 2. Inferring ancestral state using sequences flanking SV breakpoints. DEL DEL DEL INS DEL INS* DEL DEL* Table 3. Sequence features at SV breakpoints and inferred mechanism LINE ERV 6.7% 6.7% 0.0% 13.3% 93.3% 15.4% 78.3% 0.0% 84.6% 1.7% 0.0% 0.0% Class 2. Microdeletion none 1-34 bp >200 bp 95.0% 93.3% 92.3% 92.3% 5.0% 6.7% 7.7% 7.7% 1.7% 0.0% 0.0% 0.0% 0.0% 15.9% 50.0% 12.5% 0.0% 0.0% 15.0% 12.5% 37.5% 25.0% 23.1% 69.0% 37.5% 50.0% 50.0% 76.9% 0.9% 0.0% 0.0% 0.0% 0.0% 0.9% 0.0% 0.0% 0.0% Class 4. Microinsertion none 1-10 bp 11-50 bp >51 bp Total (249 SV regions) b Inferred mechanisms Total Retrotransposition (41.7%) LINE Retrotransposition ERV Retrotransposition SINE Retrotransposition SRS MMEJ NMMEJ SSA NAHR FoSTeS/others multiple VNTR Class 1. Target site duplication none 4-10 bp 11-20 bp >20 bp Class 3. Microhomology none 1-2 bp 3-25 bp 26-200 bp >200 bp CNG Deletion SINE Inversion Ancestral Events Insertion a Sequence features at breakpoints 69.9% 62.5% 87.5% 25.0% 23.0% 37.5% 12.5% 50.0% 8.0% 0.0% 0.0% 0.0% 0.9% 0.0% 0.0% 0.0% 60 30 13 13 113 8 8 4 31.3% 13.3% 0.4% 0.4% 0.8% 3.2% 3.2% 1.2% 24.5% 12.0% 5.2% 5.2% Table 3. Sequence features at SV breakpoints and inferred mechanism. In a, the percentage of each sequence feature at precise breakpoint is given per category of ancestral SV (insertion, deletion, inversion, CNG and multiple events). In b, the percentage of each inferred mechanisms is given relative to all SV regions presented in a. Empty cases are due to no applicability and all abbreviations are listed in the Supplementary Glossary. 19 Figure Legends Figure 1. Venn diagrams showing the overlap between SVs detected in our study (in DBA/2J and CAST/Ei) and those published elsewhere. A: Venn diagram showing the overlap between DBA/2J deletions (relative to C57BL/6J) found in our study (blue circle) and those found in another sequencing based analysis (Quinlan et al., green circle) and a high density aCGH experiment (2.1 million probes, Agam et al., red circle). B: Similarly for copy number gains. [Need to add sentence explaining that the figures show overlap for merged SV-regions rather than pure SV calls.] Figure 2. Breakpoint analysis of a complex SV. a) Complex SV, involving several genomic rearrangements including an inversion, deletion, short insertion and copy number gain (CNG), is displayed relative to its genic location along Zbtb10, a Zinc finger and BTB domain containing 10 gene. PCR amplification using forward (F) and reverse (R) primers revealed an AT insertion at the first breakpoint J1, followed by an inversion of 125 bp which encompasses an inverted copy number gain of the 22 bp proceeding J1, as seen in J2. Finally breakpoint 3 (J3) revealed a deletion of 813 bp. Using repeatmasker, a SINE element was found to be part of the deletion. b) PCR picture of the amplification using F and R primers (primer sequences available in Supplementary Table xxx). Hyperladder II was used as the size marker. C57BL/6N and LP/J show a normal size of 1604 bp, whereas A/J, AKR/J, BALB/cJ, C3H/HeJ, CBA/J and DBA/2J show a smaller band at 793bp. c) 20 Sequencing data across J1, J2 and J3 breakpoints. A colour code is used to indicate each type of SV: blue is used for the 22 bp inverted copy number gain, green for the inversion and red for the deletion. When the test strain matches the reference strain, both are in the same color. Figure 3. Relationship between SNP and SV formation. a) Relationship between SNP and ancestral deletion formation. Two SNPs lying on the 6 bp microhomolgy of an ancestral deletion of 64 bp (chr12:27,040,45927,040,522) correlated with the presence of the SV. On the left, PCR amplification of the SV is shown across the eight classical strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J). HyperladderII was used as DNA molecular weight marker. Some strains show a smaller amplicon compared to other strains. On the right, sequencing traces are shown for a test strain (A/J) and the reference strain (C57BL/6N). Note that all other test strains traces are identical to the one shown here. Asterisk is used to emphasize the microhomology of 6 bp (GAACTA). The presence of two SNPs (C->G and T->A) in all test strains (here only shown in A/J) is associated with the presence of the ancestral deletion. b) Relationship between SNP and ancestral insertion formation. PCR data is shown on the right with amplification in A/J, AKR/J and BALB/cJ. Strains with the the ancestral insertion (C57BL/6N, CBA/J, DBA/2J and LP/J) have failed to amplify due to size. The insertion is a LINE on chromosome 13 (119,134,049119,135,126). On the left, sequencing trace is shown over the TSD for a strain that doesn’t have the ancestral insertion. The TSD is 17 bp 21 (AAGAATGTCAGCAAAGT) and at the 12th position, a SNP (G->C) is observed in all the strains that have the insertion. 22 Figure 1. Venn diagrams showing the overlap between SVs detected in our study (in DBA/2J) and those published elsewhere. 23 Figure 2. Breakpoint analysis of a complex SV. Ladder LP DBA CBA C57 C3H BALB Ladder 34.73 Kb forward strand Zbtb10 (Zinc finger and BTB domain containing 10 gene) AKR b AJ a F R F 22 bp AT B1 Inversion (125 bp) 22 bp CNG B2 Deletion (813 bp) R B3 B2_Mm2 SINE repeat c B1 ProxRef DistRef Test TATCAGCCTTTGTCTTCAGGCTCAGC - - TCTATCAGTTTATT CTGTTCATATCCCAGGTGTTGGGATTACAGGCATGTGTCAC TATCAGCCTTTGTCTTCAGGCTCAGCATGTGACACATGCCT B2 ProxRef DistRef Test TCCCAGGTGTTGGGATTACAGGCATGTGTCACTAGGA TCTATCAGTTTATTGTGGTCCTTCTGTATATAGCTCAGAATG AATAAACTGATAGAGCTGAGCCTGAAGACAAAGGCT - - - - - - B3 ProxRef DistRef Test CATAGTGAGACTTCCCATCCAGAAAGGGAAGGTAAAACCCA TAGGAAATGGAAGTGCTGCTTGTTTATAAATCTGATGGACG - - - - - - - - - - - - - - - - - - - - - - - -GAAAGGGAAGGTAAAACCCA 24 Figure 3. Relationship between SNP and SV formation B1-2 Ladder * * * * * * * B1 … C57 LP DBA CBA C57 C3H BALB AKR AJ Ladder AJ a C57 * * * * * * *B2 …. * * * * * * Ladder B1-2 AJ LP DBA CBA C57 C3H BALB AKR AJ Ladder b * * * * ** * ** * * ** * ** * 25