Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sequence based characterization of structural variation in the mouse genome Introduction Structural variation in the mammalian genome is known to be abundant and to contribute to phenotypic variation and disease. There has been considerable progress assessing its extent and complexity (), phenotypic impact () and the responsible molecular mechanisms () in the human genome, but much less is known about SV in the mouse, currently the preeminent organism for modeling how genetic lesions give rise to disease in mammals. In this paper we use next generation sequencing to address three critical questions: what is the extent and complexity of SV in the mouse genome, what are the likely mechanisms for its formation, and what are its phenotypic consequences? Current catalogues of mouse SVs are based on differential hybridizaton of genomic DNA to oligonucleotide arrays (array comparative genome hybridization (aCGH)). While array CGH can interrogate entire genomes, it is blind to some SV categories (such as inversions and insertions), and has a limited ability to detect others (segmental duplications and transposable elements). Estimates of the proportion of the mouse genome affected by SVs range from 3% (CAHAN et al. 2009) to over 10% (HENRICHSEN et al. 2009), with three to four fold more deletions than duplications detected in the most recent genome-wide aCGH experiments (Cahan, Agam). 1 Assessing the potential mechanism of SV formation requires much higher resolution than aCGH affords, ideally down to the base pair. Sequence based methods, such as short-read paired end mapping (PEM), has the requisite level of resolution and has been used to identify 7,196 SVs and 3,316 breakpoint sequences. These data, from comparison of two laboratory strains (C57BL/6J and DBA/2J), indicate that most variation is due to retrotransposition () and that mechanisms of SV formation require little or no homology, so that non–allelic homologous recombination is rare. A small number of SVs are associated with known phenotypic abnormalities (Table). Genome-wide information about the impact of SVs on phenotypes is limited to analyses of their impact on transcript abundance. These studies have demonstrated that not only do SVs alter the expression of genes that they overlap, but also that SV influence the expression of genes lyng up to 500 Kb of their margins. Here we report the identification, using short-read sequencing, of 1.4M SVs in 17 inbred strains of mice. By analyzing breakpoint sequence we infer the mechanisms of formation and assess their relative impact on shaping a mammalian genome. Our molecular characterization of SVs in the mouse genome is a starting point to determine the extent to which SVs contribute to genetic and phenotypic diversity. Results SV identification We identified a total of xx M SVs in the 17 strains, found more SVs than other studies of the same genomes (for example four times as many deletions in 2 DBA/2J((QUINLAN et al.))), and discovered a greater variety of molecular structures than previously reported (Fig. 1). To understand why, and to explain our results, we start by explaining how we went about finding SVs. We combined visual inspection of the data with molecular validation to improve automated SV detection across the genome. We used two criteria to identify SVs manually: read depth and anomalous paired-end mapping (PEM). We did this using data from the mouse’s smallest chromosome (19) in its entirety and a random set of other chromosomal regions, for eight strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J). We expected to find eleven patterns, based on read depth and PEM, to classify SVs (H1-H11; Fig. S1). For example a deletion is indicated by a reduction in read depth and observing reads where one of the pair aligns to one side of the deletion, and its mate to the other. Because the two ends are sequenced on opposite strands, the direction of the two reads will be towards each other (H1_del). By contrast, paired-end reads pointing in the same direction and an unchanged read depth (except at the breakpoints) indicates an inversion (H4_inv). However we were surprised to discover an additional ten patterns whose interpretation was ambiguous (Q1-Q10; Fig. S1). For instance we found examples of a reduction in read depth coverage without paired-end reads flanking the putative deletion (Q2_del), and putative inversions where reads mapped to only one of the breakpoints (Q8_inv; Fig. 1D). We investigated the molecular structure of all 21 patterns using a PCR strategy (Fig. S2). We designed 484 pairs of primers and amplified 447 unique SV regions across eight classical inbred strains (Table S1). PCR products and 3 sequencing demonstrated that twelve patterns were indicative of a simple SV, seven of a complex SV and two of a false SV (Table S2). Based on manual inspection and classification of PEM patterns, we identified a benchmark set of SVs on chromosome 19. We identified 684 deletions with the expected read architecture (Table S3), amongst which 317 were classified as “H1_del” (high confidence deletions of non-repetitive sequences), 353 as “H2_del” (high confidence deletions of repeat elements), and 14 as “H3_del” (linked deletions). Chromosome 19 contained only two inversions and three gains, all of which fitted expected patterns of read architecture. We refer to this set as a high-confidence SV set. We tested 15 of these high-confident SVs by PCR in the eight strains and all validated. We identified 248 deletions (Q1-Q7) and 13 inversions (Q8-Q9) with ambiguous read architecture (detailed breakdown of each category is provided in Table S3). We tested 62 distributed evenly across the categories and expectedly found that 5 were false, 15 were insertions instead, 16 were simple and 26 complex. To search the entire genome of all 17 strains, we used a combination of four computational methods: split-read mapping(YE et al. 2009), mate-pair analysis(CHEN et al. 2009), single-end cluster analysis (SECluster and RetroSeq, unpublished), and read-depth(SIMPSON et al. 2010) (Supplementary Methods and (W ONG et al. 2010)). These methods identify deletions, insertions and inversions based on PEM patterns, and copy number changes from read depth. However, they are unable to differentiate basic PEM patterns (eg: inversion) from complex ones (eg: inversion plus a deletion), and some SVs are incorrectly classified. For example, the PEM patterns of linked insertions 4 (Q5_del and Q9_inv; Fig. S1) are similar to those for inversions or deletions. Therefore, to further classify the original SV calls, we derived methods to recognize most of the non-basic, but meaningful, PEM patterns shown in Fig. 1 and Fig. S1 (Supplementary Methods). The results of the detection and classification of XXM SVs are shown in Table 1. SVs smaller than 100bp are excluded, as, below this, it is difficult to determine whether the deviation in distance between two paired end reads is due to variation in the library insert size distribution or due to paired ends flanking a SV. There are on average XX SVs in classical inbred strains, and 165,816 in wild derived inbred strains, affecting 1.2% (33.0Mb) and 3.7% (98.6Mb) of the genome, respectively. SVs with complex PEM patterns account for X% to X% of all SVs identified in each strain, except in C57BL/6NJ, whose genome is almost void of complex PEM patterns. The majority of SVs in all strains are deletions and insertions with simple PEM patterns, however, we have been able to identify a small subset which occur as complex SVs. Inversions are rare, and occur concurrently with a deletion or an insertion about 50% of the time. Copy number gains are also rare (~100 per genome of each type), and cover from 1.8 Mb (NOD/ShiLtJ) to 16 Mb (CAST/EiJ) of the genome. We also observe a small number of inversions and deletions that occur in regions of copy number gain. Sensitivity and specificity analyses 5 Automated analysis of chromosome 19 detected between 83.2% to 88% of validated deletions (at least 50 bp in size), depending on the strain . The false positive rate ranges from 5% to 7.2% (Table S4a). Although the false negative rate per strain ranges from 12% to 16.8% when considering all types of simple deletions, it should be noted that the automated analysis accurately identified 91% of sites containing a high confidence deletion (Table S4b). To ensure that our sensitivity and specificity analyses were not vitiated because we used SVs from chromosome 19 as a training set, we derived a second, smaller, set of manually curated deletions from a randomly chosen 10 Mb region (101Mb to 111Mb) from chromosome 3 in strain C3H/HeJ. Automated analysis of this region identified 43 (82.7%) and called 2 false deletions (4.4%). We also investigated the false negative rate for the automated detection of deletions across the genome using our PCR validation data of 267 simple deletions. Consistent with the chromosome 19 and chromosome 3 analyses we found that the false negative rate for simple deletions was between 18% and 19.9% (Table S5a). We could not assess the performance of automated analysis to detect SV types other than deletions because so few were found by manual inspection of chromosome 19. However we estimated false negative rates by using PCR validated insertions, inversions and gains. The average rate was higher than for deletions, ranging from 17% to 55% (Table S5b). Automated analysis was less successful in detecting the more complex rearrangements – of the 58 PCR validated complex SV XX were found. Genome-wide breakpoint localization 6 We next determined the accuracy of reconstructing deletion and insertion breakpoints from NG sequencing reads by local assembly and breakpoint refinement as described in (W ONG et al. 2010). Comparison of 848 breakpoints (from 424 deletions that were detected computationally) to the actual breakpoint delineated by PCR and sequencing (Supplementary Methods), revealed that breakpoint accuracy for deletions was within on average +/- 18 bp of the actual breakpoint and with a median of 0; 56% of breakpoints are exact and 77% are within 10 bp (Table S6a). ADD INSERTIONS (Table S6b). Breakpoint accuracy for SV types other than deletions and insertions is presented in Table S6c. The existence of the TSD in LINE sequence afforded a convenient way to determine how accurate our breakpoint estimates were. For each LINE breakpoint we found the longest length of microhomology in the region 50 bp around the breakpoint and assumed this was the TSD. Since the ends of the TSDs correspond to true breakpoint junctions, base pair level breakpoint resolution implies that predicted breakpoint junction should fall close to the end of each TSD. We found that the vast majority of TSDs fall exactly on the base pair predicted as being the breakpoint junction (Fig. S3). Outgroup analysis We predicted the ancestral state of each SV across the mouse genome using rat as an outgroup: we assume a deletion is ancestral if it is found in the rat but is absent in one or more of the classical strains (Example A in Table 2). Conversely if there is a deletion in classical strains, but not in the reference 7 genome, then we assume the SV is an insertion (Example B in Table 2). However we found that in 26 cases out of 249 (~10% of the total), the inferred ancestral state is inconsistent with breakpoint features: example C (Table 2) using SPRET/EiJ as an outgroup suggested the presence of an ancestral deletion, Target Site Duplication (TSD) at the breakpoint suggested an ancestral insertion. Similarly, example D (Table 2) using SPRET/EiJ suggested the presence of an ancestral insertion, 3 bp microhomology at the breakpoint (TTA) suggested an ancestral deletion. Using rat sequence to determine the ancestral state validated both the ancestral insertion on chromosome 14 and the ancestral deletion on chromosome 15. With the exception of 2 cases (<1%), all of the other 24 inconsistencies using SPRET/EiJ as an outgroup could be reconciled by combining breakpoint sequence and rat sequence. Inferred ancestral state of all 249 SV regions is shown in Supplementary Table 5. We recorded for each relative deletion class the length of the longest segment of microhomology within the 100bp region centred on each predicted SV breakpoint and compared this to random expectation (Figure 2). As expected, LINE elements were associated with >15bp microhomology. SINE elements and pseudo-genes depend on the LINE integration machinery and exhibited a similar microhomology profile to LINEs. LTR elements had much shorter segments of microhomology, again corresponding to the known mechanism of LTR formation. Breakpoints surrounding VNTRs had longer sequences of microhomology, presumably reflecting degenerate tandem repeat sequence surrounding each breakpoint. 8 SV mechanism SV mechanism of formation is typically inferred by examining the sequence features of its breakpoints. For example, 200 bp of sequence identity is thought to be required for NAHR (INOUE and LUPSKI 2002), whereas much smaller homology (microhomology) has been often associated with endjoining processes, such as MMEJ. Delineation of retroelements is also facilitated by the presence of flanking target site duplication (TSD), with poly(A) tail or poly(T) head for LINE and SINE elements, and with dual or mono long terminal repeat (LCR) for ERV elements. Variable number of tandem repeat (VNTR) polymorphism is also easily identifiable from its repetitive structure. However without knowing whether and SV is more likely to be an ancestral deletion or insertion it is difficult to infer mechanism appropriately. Therefore in our analyses we used the classification based on outgroup analysis described above in combination with the sequence features at the breakpoint. We classified the 249 SV by inferred mechanism of formation (a flowchart for our method is presented in Supplementary Figure 2). Table 3b describes the inferred mechanism for the 249 SV regions. Retrotransposition is commonest mechanism (with 41.7% including 24.5% LINE, 12% ERV and 5.2% SINE retrotranspositions), followed by microhomolgy-mediated end joining processes (31.3%), non-microhomolgy-mediated end joining (13.3%), replication-based mechanisms such as FoSTeS (<10%), VNTR expansion (5.2%), SSA (0.4%) and NAHR (0.4%). 9 A substantial proportion of SVs caused by LINE, ERV, SINE and VNTR insertions do not show any missing nucleotides at their breakpoints (95%, 93.3%, 92.3% and 92.3% respectively). However, we found rare cases (4 LINEs, 2 ERVs, 1 SINE and 1 VNTR) during which the insertion machinery also deletes nucleotides. Missing sequence ranged from 1 bp to 289 bp. We found that the presence of an ancestral microdeletion is directly linked to the absence of the TSD for three LINEs. This would suggest a dual mechanism of SV formation, union between DSB repair processes and LINE retrotransposition. Half (69% for ancestral deletions, 37.5% for inversions and 50% for both CNG and multiple events) of the SVs without LINE, ERV or SINE elements have a microhomology ranging between 3 bp to 25 bp, suggesting a microhomology-mediated mutational process. We found several patterns of microhomology: direct, palindromic, inverted and a complex combination of these. 70% of 3-25 bp microhomology are direct, 13.3% inverted, 10% complex and 6.6% palindromic. Longer sequence identity (>26 bp) is rarer than smaller sequence identity (<3bp). Breakpoints at inversion are half blunt ended, followed by ancestral deletions (15.9%) and CNGs (12.5%). Of the 113 ancestral deletions, 36 (32%) had from 1 bp to 107 bp of inserted sequence at the breakpoint, in addition to the deletion. GENOME WIDE: Genome-wide, 0.5% of SVs have sequence features consistent with NAHR and 6% with VNTR. .. Martin stuff?? 1 0 Relationship between SNP and SV formation Our analysis of breakpoint sequence features in multiple strains allowed us to look for a relationship between sequence variants (SNP or shortindels) and SV formation. In particular, we addressed the question as to whether sequence variants at breakpoints were associated with SV formation. In our set of ancestral deletions for which we have base pair resolution data, we observed in all cases that the presence of SNPs in the microhomology region was correlated with the presence of the SV (Figure 3a). In all cases, presence of the SNP elongates the microhomology. Since we do not find instances where a SNP in the microhomology region occurs without the deletion, we assume that the formation of deletion and SNP are related. This phenomenon is rare: we only saw five (4.5%) cases amongst our 113 ancestral deletions where SNP and SV formation co-segregate. We found a similar relationship between a SNP formed in the TSD and the presence of an ancestral insertion (Figure 3b). 15 ancestral insertions (16%) had SNPs or shortindels within their TSD, coincident with an insertion. Details are given in Supplementary Table 1. These SNPs are ideal candidates to tag SVs for genotyping purposes; but their close proximity to SV breakpoints may make genotyping difficult (it should be noted that none of these SNPs were identified by short-read sequence). Origin of SV breakpoints We asked whether SVs that overlap in different strains could have arisen more than once. We inferred independent origins when the position of the breakpoint is different, so that for example one strain may have a 3 kb 1 1 deletion, while in another only 1 kb is missing. Within the eight classical strains, size differences between SVs at the same locus were found at six SV regions out of 241 (2.5%). We found no case with more than three alleles at one SV locus. However when expanded our analysis to look at all 17 strains, we found multiple alleles at 12% of SVs, due almost entirely to the presence of different alleles in the wild-derived inbred strains. In two cases, we observed four alleles at an SV locus with and in one case five alleles: on chromosome 10 AKR/J, CAST/Ei, PWK/Ph and Spretus/Ei all have SVs with different breakpoints (Supplementary Table 1). Inversions Inversions are more complicated than deletions, insertions and CNG, with little known about their mechanism of formation. They require at least two double-strand chromosomal breakages, as opposed for example to deletions that only require one DSB. Here we characterized at nucleotide level resolution breakpoints of 8 inversions. 62.5% (five cases) have deletions right next to the inversion. An example is provided in Figure x. Impact of SVs on gene function We assessed the impact of SVs on phenotypes in three ways: i) we examined the relationship between the position of SVs and the position of genes; (ii) we looked for changes in expression of genes overlapping, or nearby, an SV; (iii) we tested by genetic association for a relationship between SVs and 98 phenotypes in an outbred population of mice. 1 2 We investigated the enrichment and depletion of SVs in genes by counting the number of SVs that overlapped genes and then comparing this to a null distribution of the expected number of overlaps, obtained by permutation. Consistent with earlier studies () we found that relative deletions are depleted in genes, introns, exons and promoter regions (P<0.01) and that tandem duplications are more likely to include exons (P < 0.05, 1.7 to 3.3 fold depending on the strain). We also made a number of novel observations about the relationship between SVs and genes. First, we found a slight, but significant, enrichment of small (<1000 bp) relative deletions in genes in five of the classical inbred strains (129P2, 129S1, 129S5 DBA/2J and LP/J), and in all of the wild-derived strains (F.C. range 1.03 – 1.07, P<=0.01). In three of the wild-derived strains we found a larger enrichment (F.C. ~1.2, P<0.01) of VNTR deletions in genes. we found that deletions are significantly underrepresented in genes (). We found no significant relationship between any other class of SV and gene location. Tandem duplications are enriched for genes (check – and cf Eichler) – AVI Genes affected by deletions - how many genes are affected? Evidence from mRNA – deficit of exons involved. – one fusion gene. Exons of 1,901 genes are partly or completely deleted by SVs for all strains, and 781 genes for laboratory strains (table ) . 1 3 GO analyses to confirm “Genes involved in immunity and defense, sensory perception, cell adhesion and signal transduction seem to be especially prone to deletion (see also refs. 1,3,18)” We expect that larger SVs are more likely to have a functional impact than smaller (simply because their larger size means they are more likely to include a functional element). While this prediction is true for deletions, we were surprised to find that there is an enrichment of small deletions within introns. Impact of inversions – no evidence of fusion genes? Gene expression Results globally – effect at a distance, and analysed for different sizes. Relationship between SVs and phenotypic variation To attribute a phenotypic consequence to the SVs we carried out genetic association with phenotypes measured in over 2,000 heterogeneous stock (HS) mice, animals that are descended from eight of the sequenced strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J, DBA/2J and LP/J) {Valdar, 2006 #1047}. The large number of recombinants that have accumulated since the founding of the HS means that QTLs are mapped to an average region of 3 Mb. The HS is not only unique for its high resolution and the number of QTLs that have been mapped (843) {Valdar, 2006 #1047}, but also for the diversity of traits analysed, including disease models (asthma, 1 4 anxiety and type 2 diabetes), as well as haematological, immunological, biochemical and anatomical assays. As described in our companion paper, we used imputation to genotype SVs and then applied a test that discriminates between variants that are likely to be functional and those that are not {Yalcin, 2005 #1087}. We were thus able to test ~166,000 SVs where we were certain that the SDP was correct (including deletions, copy number gains, insertions and inversions) (refer to Kim’s method for producing SDPs; in addition we only included SVs where there were no missing data in any of the HS founder strains). We were concerned that the relatively high rates of CNV mutation might invalidate the imputation (the HS animals are at least 60 generations distant from the sequenced strains), so we genotyped 100 HS animals using a high-density array (). 217 deletions could be genotyped on the array (with an additional 50 deletions when we allow for non-segregating SVs in the HS). We compared results by determining whether the imputation correctly predicted the SV We identified 331 QTLs where the logP of the SV is among the highest (and therefore the SV is among the variants most likely to be functional). In all these cases the SV was only one among a number of variants with the highest score. Since, as shown in our companion paper, larger effect QTLs are more likely to arise from SVs, we decided to look at QTLs with the largest effect size. Our prior analysis also suggests that larger effect QTLs are likely to involve exonic regions. We identified 16 QTLs where the SV overlapped an exon, and where the QTL effect size is in the top 5% of the distribution. Table X lists these SVs, the genes they affect and the putative phenotype with which 1 5 they are associated. None of the genes has been previously associated with these phenotypes. Discussion We find XX more than anyone else Did we get this right? V Inferred mechanisms are consistent with other papers (Quinlan) and different from human We are the first to relate sequence variants to SVs We find little effect on gene function – due to homozygosity and selection for inbreeding? Recent studies of mutation spectrum in human and mouse SV found a similar figure (33%)(CONRAD et al.; QUINLAN et al.), suggesting similar mutational processes occur in both human and mouse SV formation. 12.5% of CNG have additional sequences (1-10 bp) at the breakpoint, followed by 37.5% for inversions and 50% for multiple events. We next looked at the correlation between Class 3 and Class 4 SVs and found that microinsertion at the breakpoint is enriched in blunt ended SVs (ratio 2.5). Most SV studies have only been able to identify basic structural variation, such as deletion or insertion. Here we were able to discover complex genomic 1 6 rearrangement between the genome of the reference mouse strain (C57BL/6J) and the genome of 16 other inbred strains, using PEM patterns (Figure 1 and Supplementary Figure 1). It is important to appreciate the extent of the mouse genome that has undergone a complex rearrangement, for several reasons. First, it is reasonable to assume that a complex structure will correlate with a complex mechanism of formation such as FoSTeS. Second, genotyping complex structural polymorphism by sequencing might prove difficult since new analytical frameworks that have started to emerge have based the allelic state of the population on a simple molecular structure (HANDSAKER et al.). References CAHAN, P., Y. LI, M. IZUMI and T. A. GRAUBERT, 2009 The impact of copy number variation on local gene expression in mouse hematopoietic stem and progenitor cells. Nat Genet 41: 430-437. CHEN, K., J. W. WALLIS, M. D. MCLELLAN, D. E. LARSON, J. M. KALICKI et al., 2009 BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods 6: 677-681. CONRAD, D. F., C. BIRD, B. BLACKBURNE, S. LINDSAY, L. MAMANOVA et al., Mutation spectrum revealed by breakpoint sequencing of human germline CNVs. Nat Genet 42: 385-391. HANDSAKER, R. E., J. M. KORN, J. NEMESH and S. A. MCCARROLL, Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat Genet. HENRICHSEN, C. N., N. VINCKENBOSCH, S. ZOLLNER, E. CHAIGNAT, S. PRADERVAND et al., 2009 Segmental copy number variation shapes tissue transcriptomes. Nat Genet 41: 424-429. INOUE, K., and J. R. LUPSKI, 2002 Molecular mechanisms for genomic disorders. Annu Rev Genomics Hum Genet 3: 199-242. QUINLAN, A. R., R. A. CLARK, S. SOKOLOVA, M. L. LEIBOWITZ, Y. ZHANG et al., Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res 20: 623-635. SIMPSON, J. T., R. E. MCINTYRE, D. J. ADAMS and R. DURBIN, 2010 Copy number variant detection in inbred strains from short read sequence data. Bioinformatics 26: 565-567. 1 7 WONG, K., T. M. KEANE, J. STALKER and D. J. ADAMS, 2010 Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly. Genome Biol 11: R128. YE, K., M. H. SCHULZ, Q. LONG, R. APWEILER and Z. NING, 2009 Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25: 2865-2871. ZERBINO, D. R., G. K. MCEWEN, E. H. MARGULIES and E. BIRNEY, 2009 Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet shortread de novo assembler. PLoS One 4: e8407. 1 8 Tables Table 1. Structural variants in 17 inbred strains 129P2 129S5 129S1 C57B CAST NOD/ SV / / AKR/ BALB C3H/ CBA/ DBA/ PWK/ Spret WSB/ / A/J L/ / LP/J ShiLt NZO Type OlaHs SvEv J c/J HeJ J 2J PhJ us/EiJ EiJ SvImJ 6N EiJ J d Brd Deleti 16,40 17,38 16,15 15,88 16,25 14,89 16,14 51,30 17,06 17,53 17,03 17,07 15,47 54,31 91,72 22,23 167 on 2 5 4 5 8 8 8 4 6 1 0 8 9 2 9 1 Inserti 86,80 42,15 39,24 73,90 42,32 45,03 68,16 107,9 54,04 36,75 47,77 22,65 30,53 103,9 172,9 57,04 2,697 on 5 6 0 9 7 8 1 12 4 3 0 1 5 68 97 2 Invers 46 46 53 46 49 45 52 3 128 46 54 47 55 47 158 282 53 ion Gain 57 70 72 88 69 82 94 44 361 79 67 64 51 62 96 112 88 Other 29 30 26 27 21 21 33 0 108 33 31 30 27 29 108 230 51 Table 1. Structural variants in 17 inbred strains. Listed are the total numbers of structural variants with a minimum size of 100 bp in the 17 inbred strains. Here we differentiate between insertions, where we can determine the insertion points from read pair patterns and local assembly, and copy number gains, where a duplication is inferred from an increase in read depth. Copy number gains include tandem duplications, which are inferred from both read depth and read pair evidence. There is minimal overlap between the insertions and the copy number gains, since the insertion discovery algorithm considers only read pairs in which one mate is unmapped (ie: de novo insertions). Included in 'Other' are those SVs which appear to be comprised of more than one SV. These include: deletions with insertions, and inversions with deletions. 1 9 Table 2. Inferring ancestral state using sequences flanking SV breakpoints. Table 2. Inferring ancestral state using sequences flanking SV breakpoints. Examples A, B, C and D are taken from our list of 249 SVs resolved to base pair resolution (Supplementary Table 1). The first three columns give chromosome, start position and end position of the SV in bp. Columns 4,5 and 6 gives a small stretch of sequences flanking SV breakpoints, as well as the first 10 bp of the SV. Note that full sequence of each SV is given in Supplementary Table 1. Columns entitled A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J gives the strain distribution pattern (SDP) of the SV, with “0” indicating the absence and “1” the presence of the SV. Column before last is the PEM (Paired-End Mapping) signature relative to the reference genome. The last column gives the inferred ancestral state relative to either SPRET/EiJ or Rattus norvegicus indicated by an asterisk. Microhomology at breakpoints is highlighted in red and target side duplication (TSD) in green. Table 3. Sequence features at SV breakpoints and inferred mechanism 2 0 Table 3. Sequence features at SV breakpoints and inferred mechanism. In a, the percentage of each sequence feature at precise breakpoint is given per category of ancestral SV (insertion, deletion, inversion, CNG and multiple events). In b, the percentage of each inferred mechanisms is given relative to all SV regions presented in a. Empty cases are due to no applicability and all abbreviations are listed in the Supplementary Glossary. 2 1 Table: Phenotypes associated with 16 SVs Phenotype chr SV geneid OFT Total activity SMEK homolog ins.stop.11.29121681.29121 ENSMUSG00000 2, suppressor of 11779 020463 mek1 T-cells: CD4 Intensity histocompatibilit ins.stop.17.36419891.36419 ENSMUSG00000 y 2, M region 17987 023083 locus 10.2 Hippocampus cellular proliferation marker ins.stop.1.136200143.13620 ENSMUSG00000 10245 026458 Ppfia4 OFT Total activity del_noINS.stop.2.144402762 ENSMUSG00000 2.144402974 027429 SEC23B Serum urea concentration ins.start.8.35158082.351581 ENSMUSG00000 884 031516 dynactin 6 Serum Low density lipoproteins zinc finger, ins.stop.4.108219592.10822 ENSMUSG00000 CCHC domain 40677 034610 containing 11 Red cells: mean cellular volume ins.start.8.88460934.884610 ENSMUSG00000 phosphorylase 827 036879 kinase beta Adrenal Weight ins.start.7.112753048.11275 ENSMUSG00000 73150 036989 Trim3 Hippocampus cellular proliferation marker poly (ADPribose) ins.stop.6.127443414.12744 ENSMUSG00000 polymerase 63516 037997 family Red cells: mean cellular volume ins.stop.11.5189017.518911 ENSMUSG00000 119 041961 Znrf3 Hippocampus cellular proliferation marker ins.stop.13.114014000.1140 ENSMUSG00000 granzyme K 1315996 042385 Gene Serum urea concentration del_noINS.stop.11.11510612 ENSMUSG00000 transmembrane 115.115106247 045980 protein 104 Red cells: mean cellular haemoglobin ins.stop.7.111511629.11151 ENSMUSG00000 71632 052749 Trim30b T-cells: CD4/CD8 ratio mitochondrial del_noINS.stop.6.71763250. ENSMUSG00000 ribosomal 671763885 052962 protein L35 2 2 Serum Low density lipoproteins ins.stop.1.175961679.17596 ENSMUSG00000 interferon 11765 054203 activated gene Red cells: mean ins.stop.11.58664774.58664 ENSMUSG00000 cellular haemoglobin 11817 068869 predicted gene 2 3 Figure Legends Figure 1. Types of structural variant. Blue boxes represent deletions, pink boxes insertions, orange boxes inversions and yellow boxes duplications; all types of structural variants are relative to the reference genome sequence. A) We found six basic types of structural variant: deletion (del), insertion (ins), inversion (inv), tandem duplication (dup), inverted tandem duplication (not drawn here) and dispersed duplication. B) Additionally, eight complex types of structural variant were found: deletion with an insertion (del+ins), linked deletion (normal copy of small length flanked by two deletions), deletion within a duplication (del in dup), inversion with flanking deletion(s) (for example del+inv+del), inversion with an insertion (inv+ins), inversion within a duplication (inv in dup), a linked insertion (linked ins) where the inserted sequence is copied from another location in the vicinity of the inserted site and an inverted linked ins (not drawn here) which has a similar pattern to a linked insertion but with the inserted sequence being inverted. C) Example of paired-end mapping (PEM) pattern of a del+inv+del. Green arrows represent primers used for PCR amplification and sequencing reactions. Primer names provide their positional information, relative to the reference genome. Black arrows attached with a curved line represent paired-ends, whereas single black arrows represent singleton reads. Grey straight lines indicate mapping of the test reads onto the reference genome. When the inversion is smaller than the insert size, paired-end reads will flank both deletions and inversion, as shown here. In other cases, decreased read depth will indicate flanking deletions. D) Example of PEM pattern of an inv+ins, with PCR data across the 2 4 eight classical strains. HyperladderII is used as molecular marker. Amplicon size for BALB/cJ, C3H/HeJ, CBA/J and DBA/2J is about 500 bp larger than the other strains, indicative of the insertion. Inversion is revealed by sequencing. Complete list of patterns is drawn in Supplementary Figure 1, with examples and PCR data Figure 2. Venn diagrams showing the overlap between SVs detected in our study (in DBA/2J and CAST/Ei) and those published elsewhere. A: Venn diagram showing the overlap between DBA/2J deletions (relative to C57BL/6J) found in our study (blue circle) and those found in another sequencing based analysis (Quinlan et al., green circle) and a high density aCGH experiment (2.1 million probes, Agam et al., red circle). B: Similarly for copy number gains. [Need to add sentence explaining that the figures show overlap for merged SV-regions rather than pure SV calls.] Figure 3. Breakpoint analysis of a complex SV. a) Complex SV, involving several genomic rearrangements including an inversion, deletion, short insertion and copy number gain (CNG), is displayed relative to its genic location along Zbtb10, a Zinc finger and BTB domain containing 10 gene. PCR amplification using forward (F) and reverse (R) primers revealed an AT insertion at the first breakpoint J1, followed by an inversion of 125 bp which encompasses an inverted copy number gain of the 22 bp proceeding J1, as seen in J2. Finally breakpoint 3 (J3) revealed a deletion of 813 bp. Using repeatmasker, a SINE element was found to be part of the deletion. b) PCR 2 5 picture of the amplification using F and R primers (primer sequences available in Supplementary Table xxx). Hyperladder II was used as the size marker. C57BL/6N and LP/J show a normal size of 1604 bp, whereas A/J, AKR/J, BALB/cJ, C3H/HeJ, CBA/J and DBA/2J show a smaller band at 793bp. c) Sequencing data across J1, J2 and J3 breakpoints. A colour code is used to indicate each type of SV: blue is used for the 22 bp inverted copy number gain, green for the inversion and red for the deletion. When the test strain matches the reference strain, both are in the same color. Figure 4. Relationship between SNP and SV formation. a) Relationship between SNP and ancestral deletion formation. Two SNPs lying on the 6 bp microhomolgy of an ancestral deletion of 64 bp (chr12:27,040,45927,040,522) correlated with the presence of the SV. On the left, PCR amplification of the SV is shown across the eight classical strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J). HyperladderII was used as DNA molecular weight marker. Some strains show a smaller amplicon compared to other strains. On the right, sequencing traces are shown for a test strain (A/J) and the reference strain (C57BL/6N). Note that all other test strains traces are identical to the one shown here. Asterisk is used to emphasize the microhomology of 6 bp (GAACTA). The presence of two SNPs (C->G and T->A) in all test strains (here only shown in A/J) is associated with the presence of the ancestral deletion. b) Relationship between SNP and ancestral insertion formation. PCR data is shown on the right with amplification in A/J, AKR/J and BALB/cJ. Strains with the the ancestral insertion (C57BL/6N, CBA/J, DBA/2J and LP/J) have failed to 2 6 amplify due to size. The insertion is a LINE on chromosome 13 (119,134,049119,135,126). On the left, sequencing trace is shown over the TSD for a strain that doesn’t have the ancestral insertion. The TSD is 17 bp (AAGAATGTCAGCAAAGT) and at the 12th position, a SNP (G->C) is observed in all the strains that have the insertion. Figure 1. Types of structural variant. 2 7 Figure 2. Venn diagrams showing the overlap between SVs detected in our study (in DBA/2J) and those published elsewhere. Figure 3. Breakpoint analysis of a complex SV. 2 8 Figure 4. Relationship between SNP and SV formation 2 9 3 0