Download SVPaper210211_kwEdit..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Sequence based characterization of structural variation in the mouse genome
Introduction
Structural variation in the mammalian genome is known to be abundant and to
contribute to phenotypic variation and disease. There has been considerable
progress assessing its extent and complexity (), phenotypic impact () and the
responsible molecular mechanisms () in the human genome, but much less is
known about SV in the mouse, currently the preeminent organism for
modeling how genetic lesions give rise to disease in mammals. In this paper
we use next generation sequencing to address three critical questions: what is
the extent and complexity of SV in the mouse genome, what are the likely
mechanisms for its formation, and what are its phenotypic consequences?
Current catalogues of mouse SVs are based on differential
hybridizaton of genomic DNA to oligonucleotide arrays (array comparative
genome hybridization (aCGH)). While array CGH can interrogate entire
genomes, it is blind to some SV categories (such as inversions and
insertions), and has a limited ability to detect others (segmental duplications
and transposable elements). Estimates of the proportion of the mouse
genome affected by SVs range from 3% (CAHAN et al. 2009) to over 10%
(HENRICHSEN et al. 2009), with three to four fold more deletions than
duplications detected in the most recent genome-wide aCGH experiments
(Cahan, Agam).
1
Assessing the potential mechanism of SV formation requires much
higher resolution than aCGH affords, ideally down to the base pair. Sequence
based methods, such as short-read paired end mapping (PEM), has the
requisite level of resolution and has been used to identify 7,196 SVs and
3,316 breakpoint sequences. These data, from comparison of two laboratory
strains (C57BL/6J and DBA/2J), indicate that most variation is due to
retrotransposition () and that mechanisms of SV formation require little or no
homology, so that non–allelic homologous recombination is rare.
A small number of SVs are associated with known phenotypic
abnormalities (Table). Genome-wide information about the impact of SVs on
phenotypes is limited to analyses of their impact on transcript abundance.
These studies have demonstrated that not only do SVs alter the expression of
genes that they overlap, but also that SV influence the expression of genes
lyng up to 500 Kb of their margins.
Here we report the identification, using short-read sequencing, of 1.4M
SVs in 17 inbred strains of mice. By analyzing breakpoint sequence we infer
the mechanisms of formation and assess their relative impact on shaping a
mammalian genome. Our molecular characterization of SVs in the mouse
genome is a starting point to determine the extent to which SVs contribute to
genetic and phenotypic diversity.
Results
SV identification
We identified a total of xx M SVs in the 17 strains, found more SVs than other
studies of the same genomes (for example four times as many deletions in
2
DBA/2J((QUINLAN et al.))), and discovered a greater variety of molecular
structures than previously reported (Fig. 1). To understand why, and to explain
our results, we start by explaining how we went about finding SVs.
We combined visual inspection of the data with molecular validation to
improve automated SV detection across the genome. We used two criteria to
identify SVs manually: read depth and anomalous paired-end mapping (PEM).
We did this using data from the mouse’s smallest chromosome (19) in its
entirety and a random set of other chromosomal regions, for eight strains (A/J,
AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J).
We expected to find eleven patterns, based on read depth and PEM, to
classify SVs (H1-H11; Fig. S1). For example a deletion is indicated by a
reduction in read depth and observing reads where one of the pair aligns to
one side of the deletion, and its mate to the other. Because the two ends are
sequenced on opposite strands, the direction of the two reads will be towards
each other (H1_del). By contrast, paired-end reads pointing in the same
direction and an unchanged read depth (except at the breakpoints) indicates
an inversion (H4_inv). However we were surprised to discover an additional
ten patterns whose interpretation was ambiguous (Q1-Q10; Fig. S1). For
instance we found examples of a reduction in read depth coverage without
paired-end reads flanking the putative deletion (Q2_del), and putative
inversions where reads mapped to only one of the breakpoints (Q8_inv; Fig.
1D).
We investigated the molecular structure of all 21 patterns using a PCR
strategy (Fig. S2). We designed 484 pairs of primers and amplified 447 unique
SV regions across eight classical inbred strains (Table S1). PCR products and
3
sequencing demonstrated that twelve patterns were indicative of a simple SV,
seven of a complex SV and two of a false SV (Table S2).
Based on manual inspection and classification of PEM patterns, we
identified a benchmark set of SVs on chromosome 19. We identified 684
deletions with the expected read architecture (Table S3), amongst which 317
were classified as “H1_del” (high confidence deletions of non-repetitive
sequences), 353 as “H2_del” (high confidence deletions of repeat elements),
and 14 as “H3_del” (linked deletions). Chromosome 19 contained only two
inversions and three gains, all of which fitted expected patterns of read
architecture. We refer to this set as a high-confidence SV set. We tested 15 of
these high-confident SVs by PCR in the eight strains and all validated. We
identified 248 deletions (Q1-Q7) and 13 inversions (Q8-Q9) with ambiguous
read architecture (detailed breakdown of each category is provided in Table
S3). We tested 62 distributed evenly across the categories and expectedly
found that 5 were false, 15 were insertions instead, 16 were simple and 26
complex.
To search the entire genome of all 17 strains, we used a combination
of four computational methods: split-read mapping(YE et al. 2009), mate-pair
analysis(CHEN et al. 2009), single-end cluster analysis (SECluster and
RetroSeq, unpublished), and read-depth(SIMPSON et al. 2010) (Supplementary
Methods and (W ONG et al. 2010)). These methods identify deletions, insertions
and inversions based on PEM patterns, and copy number changes from read
depth. However, they are unable to differentiate basic PEM patterns (eg:
inversion) from complex ones (eg: inversion plus a deletion), and some SVs
are incorrectly classified. For example, the PEM patterns of linked insertions
4
(Q5_del and Q9_inv; Fig. S1) are similar to those for inversions or deletions.
Therefore, to further classify the original SV calls, we derived methods to
recognize most of the non-basic, but meaningful, PEM patterns shown in Fig.
1 and Fig. S1 (Supplementary Methods).
The results of the detection and classification of XXM SVs are shown in
Table 1. SVs smaller than 100bp are excluded, as, below this, it is difficult to
determine whether the deviation in distance between two paired end reads is
due to variation in the library insert size distribution or due to paired ends
flanking a SV. There are on average XX SVs in classical inbred strains, and
165,816 in wild derived inbred strains, affecting 1.2% (33.0Mb) and 3.7%
(98.6Mb) of the genome, respectively. SVs with complex PEM patterns
account for X% to X% of all SVs identified in each strain, except in
C57BL/6NJ, whose genome is almost void of complex PEM patterns. The
majority of SVs in all strains are deletions and insertions with simple PEM
patterns, however, we have been able to identify a small subset which occur
as complex SVs. Inversions are rare, and occur concurrently with a deletion
or an insertion about 50% of the time. Copy number gains are also rare (~100
per genome of each type), and cover from 1.8 Mb (NOD/ShiLtJ) to 16 Mb
(CAST/EiJ) of the genome. We also observe a small number of inversions
and deletions that occur in regions of copy number gain.
Sensitivity and specificity analyses
5
Automated analysis of chromosome 19 detected between 83.2% to 88% of
validated deletions (at least 50 bp in size), depending on the strain . The false
positive rate ranges from 5% to 7.2% (Table S4a). Although the false negative
rate per strain ranges from 12% to 16.8% when considering all types of simple
deletions, it should be noted that the automated analysis accurately identified
91% of sites containing a high confidence deletion (Table S4b).
To ensure that our sensitivity and specificity analyses were not vitiated
because we used SVs from chromosome 19 as a training set, we derived a
second, smaller, set of manually curated deletions from a randomly chosen 10
Mb region (101Mb to 111Mb) from chromosome 3 in strain C3H/HeJ.
Automated analysis of this region identified 43 (82.7%) and called 2 false
deletions (4.4%). We also investigated the false negative rate for the
automated detection of deletions across the genome using our PCR validation
data of 267 simple deletions. Consistent with the chromosome 19 and
chromosome 3 analyses we found that the false negative rate for simple
deletions was between 18% and 19.9% (Table S5a).
We could not assess the performance of automated analysis to detect
SV types other than deletions because so few were found by manual
inspection of chromosome 19. However we estimated false negative rates by
using PCR validated insertions, inversions and gains. The average rate was
higher than for deletions, ranging from 17% to 55% (Table S5b). Automated
analysis was less successful in detecting the more complex rearrangements –
of the 58 PCR validated complex SV XX were found.
Genome-wide breakpoint localization
6
We next determined the accuracy of reconstructing deletion and
insertion breakpoints from NG sequencing reads by local assembly and
breakpoint refinement as described in (W ONG et al. 2010). Comparison of 848
breakpoints (from 424 deletions that were detected computationally) to the
actual breakpoint delineated by PCR and sequencing (Supplementary
Methods), revealed that breakpoint accuracy for deletions was within on
average +/- 18 bp of the actual breakpoint and with a median of 0; 56% of
breakpoints are exact and 77% are within 10 bp (Table S6a). ADD
INSERTIONS (Table S6b). Breakpoint accuracy for SV types other than
deletions and insertions is presented in Table S6c.
The existence of the TSD in LINE sequence afforded a convenient way
to determine how accurate our breakpoint estimates were. For each LINE
breakpoint we found the longest length of microhomology in the region 50 bp
around the breakpoint and assumed this was the TSD. Since the ends of the
TSDs correspond to true breakpoint junctions, base pair level breakpoint
resolution implies that predicted breakpoint junction should fall close to the
end of each TSD. We found that the vast majority of TSDs fall exactly on the
base pair predicted as being the breakpoint junction (Fig. S3).
Outgroup analysis
We predicted the ancestral state of each SV across the mouse genome using
rat as an outgroup: we assume a deletion is ancestral if it is found in the rat
but is absent in one or more of the classical strains (Example A in Table 2).
Conversely if there is a deletion in classical strains, but not in the reference
7
genome, then we assume the SV is an insertion (Example B in Table 2).
However we found that in 26 cases out of 249 (~10% of the total), the inferred
ancestral state is inconsistent with breakpoint features: example C (Table 2)
using SPRET/EiJ as an outgroup suggested the presence of an ancestral
deletion, Target Site Duplication (TSD) at the breakpoint suggested an
ancestral insertion. Similarly, example D (Table 2) using SPRET/EiJ
suggested the presence of an ancestral insertion, 3 bp microhomology at the
breakpoint (TTA) suggested an ancestral deletion. Using rat sequence to
determine the ancestral state validated both the ancestral insertion on
chromosome 14 and the ancestral deletion on chromosome 15. With the
exception of 2 cases (<1%), all of the other 24 inconsistencies using
SPRET/EiJ as an outgroup could be reconciled by combining breakpoint
sequence and rat sequence. Inferred ancestral state of all 249 SV regions is
shown in Supplementary Table 5.
We recorded for each relative deletion class the length of the longest
segment of microhomology within the 100bp region centred on each predicted
SV breakpoint and compared this to random expectation (Figure 2). As
expected, LINE elements were associated with >15bp microhomology. SINE
elements and pseudo-genes depend on the LINE integration machinery and
exhibited a similar microhomology profile to LINEs. LTR elements had much
shorter segments of microhomology, again corresponding to the known
mechanism of LTR formation. Breakpoints surrounding VNTRs had longer
sequences of microhomology, presumably reflecting degenerate tandem
repeat sequence surrounding each breakpoint.
8
SV mechanism
SV mechanism of formation is typically inferred by examining the sequence
features of its breakpoints. For example, 200 bp of sequence identity is
thought to be required for NAHR (INOUE and LUPSKI 2002), whereas much
smaller homology (microhomology) has been often associated with endjoining processes, such as MMEJ. Delineation of retroelements is also
facilitated by the presence of flanking target site duplication (TSD), with
poly(A) tail or poly(T) head for LINE and SINE elements, and with dual or
mono long terminal repeat (LCR) for ERV elements. Variable number of
tandem repeat (VNTR) polymorphism is also easily identifiable from its
repetitive structure. However without knowing whether and SV is more likely
to be an ancestral deletion or insertion it is difficult to infer mechanism
appropriately. Therefore in our analyses we used the classification based on
outgroup analysis described above in combination with the sequence features
at the breakpoint.
We classified the 249 SV by inferred mechanism of formation (a
flowchart for our method is presented in Supplementary Figure 2). Table 3b
describes the inferred mechanism for the 249 SV regions. Retrotransposition
is commonest mechanism (with 41.7% including 24.5% LINE, 12% ERV and
5.2% SINE retrotranspositions), followed by microhomolgy-mediated end
joining processes (31.3%), non-microhomolgy-mediated end joining (13.3%),
replication-based mechanisms such as FoSTeS (<10%), VNTR expansion
(5.2%), SSA (0.4%) and NAHR (0.4%).
9
A substantial proportion of SVs caused by LINE, ERV, SINE and VNTR
insertions do not show any missing nucleotides at their breakpoints (95%,
93.3%, 92.3% and 92.3% respectively). However, we found rare cases (4
LINEs, 2 ERVs, 1 SINE and 1 VNTR) during which the insertion machinery
also deletes nucleotides. Missing sequence ranged from 1 bp to 289 bp. We
found that the presence of an ancestral microdeletion is directly linked to the
absence of the TSD for three LINEs. This would suggest a dual mechanism of
SV
formation,
union
between
DSB
repair
processes
and
LINE
retrotransposition.
Half (69% for ancestral deletions, 37.5% for inversions and 50% for
both CNG and multiple events) of the SVs without LINE, ERV or SINE
elements have a microhomology ranging between 3 bp to 25 bp, suggesting a
microhomology-mediated mutational process. We found several patterns of
microhomology: direct, palindromic, inverted and a complex combination of
these. 70% of 3-25 bp microhomology are direct, 13.3% inverted, 10%
complex and 6.6% palindromic. Longer sequence identity (>26 bp) is rarer
than smaller sequence identity (<3bp). Breakpoints at inversion are half blunt
ended, followed by ancestral deletions (15.9%) and CNGs (12.5%). Of the
113 ancestral deletions, 36 (32%) had from 1 bp to 107 bp of inserted
sequence at the breakpoint, in addition to the deletion.
GENOME WIDE:
Genome-wide, 0.5% of SVs have sequence features consistent with
NAHR and 6% with VNTR. .. Martin stuff??
1
0
Relationship between SNP and SV formation
Our analysis of breakpoint sequence features in multiple strains allowed us to
look for a relationship between sequence variants (SNP or shortindels) and
SV formation. In particular, we addressed the question as to whether
sequence variants at breakpoints were associated with SV formation. In our
set of ancestral deletions for which we have base pair resolution data, we
observed in all cases that the presence of SNPs in the microhomology region
was correlated with the presence of the SV (Figure 3a).
In all cases, presence of the SNP elongates the microhomology. Since
we do not find instances where a SNP in the microhomology region occurs
without the deletion, we assume that the formation of deletion and SNP are
related. This phenomenon is rare: we only saw five (4.5%) cases amongst our
113 ancestral deletions where SNP and SV formation co-segregate. We found
a similar relationship between a SNP formed in the TSD and the presence of
an ancestral insertion (Figure 3b). 15 ancestral insertions (16%) had SNPs or
shortindels within their TSD, coincident with an insertion. Details are given in
Supplementary Table 1. These SNPs are ideal candidates to tag SVs for
genotyping purposes; but their close proximity to SV breakpoints may make
genotyping difficult (it should be noted that none of these SNPs were
identified by short-read sequence).
Origin of SV breakpoints
We asked whether SVs that overlap in different strains could have arisen
more than once. We inferred independent origins when the position of the
breakpoint is different, so that for example one strain may have a 3 kb
1
1
deletion, while in another only 1 kb is missing. Within the eight classical
strains, size differences between SVs at the same locus were found at six SV
regions out of 241 (2.5%). We found no case with more than three alleles at
one SV locus. However when expanded our analysis to look at all 17 strains,
we found multiple alleles at 12% of SVs, due almost entirely to the presence
of different alleles in the wild-derived inbred strains. In two cases, we
observed four alleles at an SV locus with and in one case five alleles: on
chromosome 10 AKR/J, CAST/Ei, PWK/Ph and Spretus/Ei all have SVs with
different breakpoints (Supplementary Table 1).
Inversions
Inversions are more complicated than deletions, insertions and CNG, with
little known about their mechanism of formation. They require at least two
double-strand chromosomal breakages, as opposed for example to deletions
that only require one DSB. Here we characterized at nucleotide level
resolution breakpoints of 8 inversions. 62.5% (five cases) have deletions right
next to the inversion. An example is provided in Figure x.
Impact of SVs on gene function
We assessed the impact of SVs on phenotypes in three ways: i) we examined
the relationship between the position of SVs and the position of genes; (ii) we
looked for changes in expression of genes overlapping, or nearby, an SV; (iii)
we tested by genetic association for a relationship between SVs and 98
phenotypes in an outbred population of mice.
1
2
We investigated the enrichment and depletion of SVs in genes by
counting the number of SVs that overlapped genes and then comparing this to
a null distribution of the expected number of overlaps, obtained by
permutation. Consistent with earlier studies () we found that relative deletions
are depleted in genes, introns, exons and promoter regions (P<0.01) and that
tandem duplications are more likely to include exons (P < 0.05, 1.7 to 3.3 fold
depending on the strain).
We also made a number of novel observations about the relationship
between SVs and genes. First, we found a slight, but significant, enrichment
of small (<1000 bp) relative deletions in genes in five of the classical inbred
strains (129P2, 129S1, 129S5 DBA/2J and LP/J), and in all of the wild-derived
strains (F.C. range 1.03 – 1.07, P<=0.01). In three of the wild-derived strains
we found a larger enrichment (F.C. ~1.2, P<0.01) of VNTR deletions in genes.
we found that deletions are significantly underrepresented in genes (). We
found no significant relationship between any other class of SV and gene
location. Tandem duplications are enriched for genes (check – and cf Eichler)
– AVI
Genes affected by deletions -
how many genes are affected?
Evidence from mRNA – deficit of exons involved. – one fusion gene.
Exons of 1,901 genes are partly or completely deleted by SVs for all strains,
and 781 genes for laboratory strains (table ) .
1
3
GO analyses to confirm “Genes involved in immunity and defense,
sensory perception, cell adhesion and signal transduction seem to be
especially prone to deletion (see also refs. 1,3,18)”
We expect that larger SVs are more likely to have a functional impact than
smaller (simply because their larger size means they are more likely to
include a functional element). While this prediction is true for deletions, we
were surprised to find that there is an enrichment of small deletions within
introns.
Impact of inversions – no evidence of fusion genes?
Gene expression
Results globally – effect at a distance, and analysed for different sizes.
Relationship between SVs and phenotypic variation
To attribute a phenotypic consequence to the SVs we carried out
genetic association with phenotypes measured in over 2,000 heterogeneous
stock (HS) mice, animals that are descended from eight of the sequenced
strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J, DBA/2J and LP/J)
{Valdar, 2006 #1047}. The large number of recombinants that have
accumulated since the founding of the HS means that QTLs are mapped to an
average region of 3 Mb. The HS is not only unique for its high resolution and
the number of QTLs that have been mapped (843) {Valdar, 2006 #1047}, but
also for the diversity of traits analysed, including disease models (asthma,
1
4
anxiety and type 2 diabetes), as well as haematological, immunological,
biochemical and anatomical assays.
As described in our companion paper, we used imputation to genotype
SVs and then applied a test that discriminates between variants that are likely
to be functional and those that are not {Yalcin, 2005 #1087}. We were thus
able to test ~166,000 SVs where we were certain that the SDP was correct
(including deletions, copy number gains, insertions and inversions) (refer to
Kim’s method for producing SDPs; in addition we only included SVs where
there were no missing data in any of the HS founder strains).
We were concerned that the relatively high rates of CNV mutation
might invalidate the imputation (the HS animals are at least 60 generations
distant from the sequenced strains), so we genotyped 100 HS animals using a
high-density array (). 217 deletions could be genotyped on the array (with an
additional 50 deletions when we allow for non-segregating SVs in the HS). We
compared results by determining whether the imputation correctly predicted
the SV
We identified 331 QTLs where the logP of the SV is among the highest
(and therefore the SV is among the variants most likely to be functional). In all
these cases the SV was only one among a number of variants with the
highest score. Since, as shown in our companion paper, larger effect QTLs
are more likely to arise from SVs, we decided to look at QTLs with the largest
effect size. Our prior analysis also suggests that larger effect QTLs are likely
to involve exonic regions. We identified 16 QTLs where the SV overlapped an
exon, and where the QTL effect size is in the top 5% of the distribution. Table
X lists these SVs, the genes they affect and the putative phenotype with which
1
5
they are associated. None of the genes has been previously associated with
these phenotypes.
Discussion
We find XX more than anyone else
Did we get this right?
V
Inferred mechanisms are consistent with other papers (Quinlan) and different
from human
We are the first to relate sequence variants to SVs
We find little effect on gene function – due to homozygosity and selection for
inbreeding?
Recent studies of mutation spectrum in human and mouse SV found a similar
figure (33%)(CONRAD et al.; QUINLAN et al.), suggesting similar mutational
processes occur in both human and mouse SV formation. 12.5% of CNG
have additional sequences (1-10 bp) at the breakpoint, followed by 37.5% for
inversions and 50% for multiple events. We next looked at the correlation
between Class 3 and Class 4 SVs and found that microinsertion at the
breakpoint is enriched in blunt ended SVs (ratio 2.5).
Most SV studies have only been able to identify basic structural variation,
such as deletion or insertion. Here we were able to discover complex genomic
1
6
rearrangement between the genome of the reference mouse strain
(C57BL/6J) and the genome of 16 other inbred strains, using PEM patterns
(Figure 1 and Supplementary Figure 1). It is important to appreciate the extent
of the mouse genome that has undergone a complex rearrangement, for
several reasons. First, it is reasonable to assume that a complex structure will
correlate with a complex mechanism of formation such as FoSTeS. Second,
genotyping complex structural polymorphism by sequencing might prove
difficult since new analytical frameworks that have started to emerge have
based the allelic state of the population on a simple molecular structure
(HANDSAKER et al.).
References
CAHAN, P., Y. LI, M. IZUMI and T. A. GRAUBERT, 2009 The impact of copy number
variation on local gene expression in mouse hematopoietic stem and
progenitor cells. Nat Genet 41: 430-437.
CHEN, K., J. W. WALLIS, M. D. MCLELLAN, D. E. LARSON, J. M. KALICKI et al., 2009
BreakDancer: an algorithm for high-resolution mapping of genomic structural
variation. Nat Methods 6: 677-681.
CONRAD, D. F., C. BIRD, B. BLACKBURNE, S. LINDSAY, L. MAMANOVA et al.,
Mutation spectrum revealed by breakpoint sequencing of human germline
CNVs. Nat Genet 42: 385-391.
HANDSAKER, R. E., J. M. KORN, J. NEMESH and S. A. MCCARROLL, Discovery and
genotyping of genome structural polymorphism by sequencing on a population
scale. Nat Genet.
HENRICHSEN, C. N., N. VINCKENBOSCH, S. ZOLLNER, E. CHAIGNAT, S. PRADERVAND
et al., 2009 Segmental copy number variation shapes tissue transcriptomes.
Nat Genet 41: 424-429.
INOUE, K., and J. R. LUPSKI, 2002 Molecular mechanisms for genomic disorders.
Annu Rev Genomics Hum Genet 3: 199-242.
QUINLAN, A. R., R. A. CLARK, S. SOKOLOVA, M. L. LEIBOWITZ, Y. ZHANG et al.,
Genome-wide mapping and assembly of structural variant breakpoints in the
mouse genome. Genome Res 20: 623-635.
SIMPSON, J. T., R. E. MCINTYRE, D. J. ADAMS and R. DURBIN, 2010 Copy number
variant detection in inbred strains from short read sequence data.
Bioinformatics 26: 565-567.
1
7
WONG, K., T. M. KEANE, J. STALKER and D. J. ADAMS, 2010 Enhanced structural
variant and breakpoint detection using SVMerge by integration of multiple
detection methods and local assembly. Genome Biol 11: R128.
YE, K., M. H. SCHULZ, Q. LONG, R. APWEILER and Z. NING, 2009 Pindel: a pattern
growth approach to detect break points of large deletions and medium sized
insertions from paired-end short reads. Bioinformatics 25: 2865-2871.
ZERBINO, D. R., G. K. MCEWEN, E. H. MARGULIES and E. BIRNEY, 2009 Pebble and
rock band: heuristic resolution of repeats and scaffolding in the velvet shortread de novo assembler. PLoS One 4: e8407.
1
8
Tables
Table 1. Structural variants in 17 inbred strains
129P2
129S5
129S1
C57B CAST
NOD/
SV
/
/
AKR/ BALB C3H/
CBA/ DBA/
PWK/ Spret WSB/
/
A/J
L/
/
LP/J ShiLt NZO
Type OlaHs
SvEv
J
c/J HeJ
J
2J
PhJ us/EiJ EiJ
SvImJ
6N EiJ
J
d
Brd
Deleti 16,40 17,38 16,15 15,88 16,25 14,89 16,14
51,30 17,06 17,53 17,03 17,07 15,47 54,31 91,72 22,23
167
on
2
5
4
5
8
8
8
4
6
1
0
8
9
2
9
1
Inserti 86,80 42,15 39,24 73,90 42,32 45,03 68,16
107,9 54,04 36,75 47,77 22,65 30,53 103,9 172,9 57,04
2,697
on
5
6
0
9
7
8
1
12
4
3
0
1
5
68
97
2
Invers
46
46
53
46
49
45
52
3
128 46
54
47
55
47 158 282 53
ion
Gain 57
70
72
88
69
82
94
44 361 79
67
64
51
62
96 112 88
Other
29
30
26
27
21
21
33
0
108
33
31
30
27
29
108
230
51
Table 1. Structural variants in 17 inbred strains. Listed are the total numbers of
structural variants with a minimum size of 100 bp in the 17 inbred strains. Here we
differentiate between insertions, where we can determine the insertion points from
read pair patterns and local assembly, and copy number gains, where a duplication
is inferred from an increase in read depth. Copy number gains include tandem
duplications, which are inferred from both read depth and read pair evidence. There
is minimal overlap between the insertions and the copy number gains, since the
insertion discovery algorithm considers only read pairs in which one mate is
unmapped (ie: de novo insertions). Included in 'Other' are those SVs which appear to
be comprised of more than one SV. These include: deletions with insertions, and
inversions with deletions.
1
9
Table 2. Inferring ancestral state using sequences flanking SV breakpoints.
Table 2. Inferring ancestral state using sequences flanking SV breakpoints.
Examples A, B, C and D are taken from our list of 249 SVs resolved to base
pair resolution (Supplementary Table 1). The first three columns give
chromosome, start position and end position of the SV in bp. Columns 4,5 and
6 gives a small stretch of sequences flanking SV breakpoints, as well as the
first 10 bp of the SV. Note that full sequence of each SV is given in
Supplementary Table 1. Columns entitled A/J, AKR/J, BALB/cJ, C3H/HeJ,
C57BL/6N, CBA/J, DBA/2J and LP/J gives the strain distribution pattern
(SDP) of the SV, with “0” indicating the absence and “1” the presence of the
SV. Column before last is the PEM (Paired-End Mapping) signature relative to
the reference genome. The last column gives the inferred ancestral state
relative to either SPRET/EiJ or Rattus norvegicus indicated by an asterisk.
Microhomology at breakpoints is highlighted in red and target side duplication
(TSD) in green.
Table 3. Sequence features at SV breakpoints and inferred mechanism
2
0
Table 3. Sequence features at SV breakpoints and inferred mechanism. In a,
the percentage of each sequence feature at precise breakpoint is given per
category of ancestral SV (insertion, deletion, inversion, CNG and multiple
events). In b, the percentage of each inferred mechanisms is given relative to
all SV regions presented in a. Empty cases are due to no applicability and all
abbreviations are listed in the Supplementary Glossary.
2
1
Table: Phenotypes associated with 16 SVs
Phenotype
chr
SV
geneid
OFT Total activity
SMEK homolog
ins.stop.11.29121681.29121 ENSMUSG00000 2, suppressor of
11779
020463
mek1
T-cells: CD4
Intensity
histocompatibilit
ins.stop.17.36419891.36419 ENSMUSG00000 y 2, M region
17987
023083
locus 10.2
Hippocampus
cellular proliferation
marker
ins.stop.1.136200143.13620 ENSMUSG00000
10245
026458
Ppfia4
OFT Total activity
del_noINS.stop.2.144402762 ENSMUSG00000
2.144402974
027429
SEC23B
Serum urea
concentration
ins.start.8.35158082.351581 ENSMUSG00000
884
031516
dynactin 6
Serum Low density
lipoproteins
zinc finger,
ins.stop.4.108219592.10822 ENSMUSG00000 CCHC domain
40677
034610
containing 11
Red cells: mean
cellular volume
ins.start.8.88460934.884610 ENSMUSG00000 phosphorylase
827
036879
kinase beta
Adrenal Weight
ins.start.7.112753048.11275 ENSMUSG00000
73150
036989
Trim3
Hippocampus
cellular proliferation
marker
poly (ADPribose)
ins.stop.6.127443414.12744 ENSMUSG00000 polymerase
63516
037997
family
Red cells: mean
cellular volume
ins.stop.11.5189017.518911 ENSMUSG00000
119
041961
Znrf3
Hippocampus
cellular proliferation
marker
ins.stop.13.114014000.1140 ENSMUSG00000 granzyme K
1315996
042385
Gene
Serum urea
concentration
del_noINS.stop.11.11510612 ENSMUSG00000 transmembrane
115.115106247
045980
protein 104
Red cells: mean
cellular haemoglobin
ins.stop.7.111511629.11151 ENSMUSG00000
71632
052749
Trim30b
T-cells: CD4/CD8
ratio
mitochondrial
del_noINS.stop.6.71763250. ENSMUSG00000 ribosomal
671763885
052962
protein L35
2
2
Serum Low density
lipoproteins
ins.stop.1.175961679.17596 ENSMUSG00000 interferon
11765
054203
activated gene
Red cells: mean
ins.stop.11.58664774.58664 ENSMUSG00000
cellular haemoglobin 11817
068869
predicted gene
2
3
Figure Legends
Figure 1. Types of structural variant. Blue boxes represent deletions, pink
boxes insertions, orange boxes inversions and yellow boxes duplications; all
types of structural variants are relative to the reference genome sequence. A)
We found six basic types of structural variant: deletion (del), insertion (ins),
inversion (inv), tandem duplication (dup), inverted tandem duplication (not
drawn here) and dispersed duplication. B) Additionally, eight complex types of
structural variant were found: deletion with an insertion (del+ins), linked
deletion (normal copy of small length flanked by two deletions), deletion within
a duplication (del in dup), inversion with flanking deletion(s) (for example
del+inv+del), inversion with an insertion (inv+ins), inversion within a
duplication (inv in dup), a linked insertion (linked ins) where the inserted
sequence is copied from another location in the vicinity of the inserted site
and an inverted linked ins (not drawn here) which has a similar pattern to a
linked insertion but with the inserted sequence being inverted. C) Example of
paired-end mapping (PEM) pattern of a del+inv+del. Green arrows represent
primers used for PCR amplification and sequencing reactions. Primer names
provide their positional information, relative to the reference genome. Black
arrows attached with a curved line represent paired-ends, whereas single
black arrows represent singleton reads. Grey straight lines indicate mapping
of the test reads onto the reference genome. When the inversion is smaller
than the insert size, paired-end reads will flank both deletions and inversion,
as shown here. In other cases, decreased read depth will indicate flanking
deletions. D) Example of PEM pattern of an inv+ins, with PCR data across the
2
4
eight classical strains. HyperladderII is used as molecular marker. Amplicon
size for BALB/cJ, C3H/HeJ, CBA/J and DBA/2J is about 500 bp larger than
the other strains, indicative of the insertion. Inversion is revealed by
sequencing. Complete list of patterns is drawn in Supplementary Figure 1, with
examples and PCR data
Figure 2. Venn diagrams showing the overlap between SVs detected in our
study (in DBA/2J and CAST/Ei) and those published elsewhere. A: Venn
diagram showing the overlap between DBA/2J deletions (relative to
C57BL/6J) found in our study (blue circle) and those found in another
sequencing based analysis (Quinlan et al., green circle) and a high density
aCGH experiment (2.1 million probes, Agam et al., red circle). B: Similarly for
copy number gains.
[Need to add sentence explaining that the figures show overlap for merged
SV-regions rather than pure SV calls.]
Figure 3. Breakpoint analysis of a complex SV. a) Complex SV, involving
several genomic rearrangements including an inversion, deletion, short
insertion and copy number gain (CNG), is displayed relative to its genic
location along Zbtb10, a Zinc finger and BTB domain containing 10 gene. PCR
amplification using forward (F) and reverse (R) primers revealed an AT
insertion at the first breakpoint J1, followed by an inversion of 125 bp which
encompasses an inverted copy number gain of the 22 bp proceeding J1, as
seen in J2. Finally breakpoint 3 (J3) revealed a deletion of 813 bp. Using
repeatmasker, a SINE element was found to be part of the deletion. b) PCR
2
5
picture of the amplification using F and R primers (primer sequences available
in Supplementary Table xxx). Hyperladder II was used as the size marker.
C57BL/6N and LP/J show a normal size of 1604 bp, whereas A/J, AKR/J,
BALB/cJ, C3H/HeJ, CBA/J and DBA/2J show a smaller band at 793bp. c)
Sequencing data across J1, J2 and J3 breakpoints. A colour code is used to
indicate each type of SV: blue is used for the 22 bp inverted copy number
gain, green for the inversion and red for the deletion. When the test strain
matches the reference strain, both are in the same color.
Figure 4. Relationship between SNP and SV formation. a) Relationship
between SNP and ancestral deletion formation. Two SNPs lying on the 6 bp
microhomolgy of an ancestral deletion of 64 bp (chr12:27,040,45927,040,522) correlated with the presence of the SV. On the left, PCR
amplification of the SV is shown across the eight classical strains (A/J, AKR/J,
BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J). HyperladderII was
used as DNA molecular weight marker. Some strains show a smaller
amplicon compared to other strains. On the right, sequencing traces are
shown for a test strain (A/J) and the reference strain (C57BL/6N). Note that all
other test strains traces are identical to the one shown here. Asterisk is used
to emphasize the microhomology of 6 bp (GAACTA). The presence of two
SNPs (C->G and T->A) in all test strains (here only shown in A/J) is
associated with the presence of the ancestral deletion. b) Relationship
between SNP and ancestral insertion formation. PCR data is shown on the
right with amplification in A/J, AKR/J and BALB/cJ. Strains with the the
ancestral insertion (C57BL/6N, CBA/J, DBA/2J and LP/J) have failed to
2
6
amplify due to size. The insertion is a LINE on chromosome 13 (119,134,049119,135,126). On the left, sequencing trace is shown over the TSD for a strain
that
doesn’t
have
the
ancestral
insertion.
The
TSD
is
17
bp
(AAGAATGTCAGCAAAGT) and at the 12th position, a SNP (G->C) is
observed in all the strains that have the insertion.
Figure 1. Types of structural variant.
2
7
Figure 2. Venn diagrams showing the overlap between SVs detected in our
study (in DBA/2J) and those published elsewhere.
Figure 3. Breakpoint analysis of a complex SV.
2
8
Figure 4. Relationship between SNP and SV formation
2
9
3
0