Download SVPaper260111

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neurogenomics wikipedia , lookup

Transcript
Sequence based characterization of structural variation in the mouse
genome
Introduction
Structural variation in the mammalian genome is known to be abundant and to
contribute to phenotypic variation and disease. There has been considerable
progress assessing its extent (), phenotypic impact () and the responsible
molecular mechanisms () in the human genome, but much less is known
about SV in the mouse, currently the preeminent organism for modeling how
genetic lesions give rise to disease in mammals. In this paper we use next
generation sequencing to address three critical questions: what is the extent
of SV in the mouse genome, what are the likely mechanisms for its formation,
and what are its phenotypic consequences?
Current catalogues of mouse SVs are based on differential
hybridizaton of genomic DNA to oligonucleotide arrays (array comparative
genome hybridization (aCGH)). While array CGH can interrogate entire
genomes, it is blind to some SV categories (such as inversions and
insertions), and has a limited ability to detect others (segmental duplications
and transposable elements). Estimates of the proportion of the mouse
genome affected by SVs range from 3% (CAHAN et al. 2009) to over 10%
(HENRICHSEN et al. 2009), with three to four fold more deletions than
duplications detected in the most recent genome-wide aCGH experiments
(Cahan, Agam).
1
Assessing the potential mechanism of SV formation requires much
higher resolution than aCGH affords, ideally down to the base pair. Sequence
based methods, such as short-read paired end mapping (PEM), has the
requisite level of resolution and has been used to identify 7,196 SVs and
3,316 breakpoint sequences. These data, from comparison of two laboratory
strains (C57BL/6J and DBA/2J), indicate that most variation is due to
retrotransposition () and that mechanisms of SV formation require little or no
homology, so that non–allelic homologous recombination is rare.
Here we report the identification, using short-read sequencing, of more
than half a million SVs in 17 inbred strains of mice. By analyzing breakpoint
sequence we infer the mechanisms of formation and assess their relative
impact on shaping a mammalian genome. Our molecular characterization of
SVs in the mouse genome is a starting point to determine the extent to which
SVs contribute to genetic and phenotypic diversity.
Results
SV identification
Variation in the expected number of reads mapping to the reference sequence
was used to identify copy number variation while deviations from the expected
distance between reads, and the orientation of reads, was used to determine
the type of structural variant (Supplementary Methods). It has been difficult
to determine how well computational implementations of these approaches
perform, as validation methods are typically labour intensive; furthermore, it
has become clear that many structural variants are complex, involving
combinations of deletions, insertions and inversions (AGAM et al. ; QUINLAN et
2
al.). We therefore decided to identify a benchmark set of SVs to use as
validation for detection.
We manually designed 352 unique pairs of primers (uninformative
primers excluded) and successfully amplified 329 SV covering a deletion.
[BINNAZ CURENTLY WORKING ON THIS].
We manually examined reads mapping to the entire chromosome 19
(the smallest mouse chromosome) in addition to a random set of other
chromosomal regions for eight strains (A/J, AKR/J, BALB/cJ, C3H/HeJ,
C57BL/6N, CBA/J, DBA/2J and LP/J), and used two criteria to identify SVs:
read depth and anomalous read pairs. Although the basis of PEM have been
described in detail in a previous study(MEDVEDEV et al. 2009), to understand
our findings it is important to appreciate the expected six patterns of read
architecture (Supplementary Figure XX). For deletions we expect to find a
reduction in read depth and supporting read pairs flanking the deleted region;
for gains we looked for an increase in read depth; for tandem copy number
gains we looked in addition for supporting read pairs with opposing orientation
spanning the length of the duplicated region; for inverted copy number gains
we expect an increase and supporting read pairs pointing in the same
direction; for insertions we looked for singleton reads flanking the SV (the
paired end will be inside the inserted, unsequenced, DNA); and for inversions
the read depth is unchanged with sequences from both ends of read pairs
pointing in the same direction.
In addition to the six expected patterns we observed an additional XX
(?6). By amplifying and sequencing DNA at the putative SVs we were able to
3
demonstrate that XX anomalous patterns were indicative of a true SV and YY
were not. Anomalous patterns were due to additional complexity within the SV
and repeat content at the SV, leading to read mis-mapping.
We identified 693 deletions with the expected read architecture
(Supplementary Table xx); we tested 100 by PCR in the eight strains and all
validated; we refer to this set as a high confidence deletion set. We identified
600 deletions with anomalous read architecture, we tested 100 and about half
validated by PCR. True positives tended to be characterized by possessing
multiple read pair architectures, indicating a complex combination of deletion,
inversion, insertions and gains (fig) False positives tended to be due to mismapped reads, due to repetitive sequence (fig). Chromosome 19 contained
only three copy number gains and two inversions, all of which fitted expected
patterns of read architecture.
Automated analysis of chromosome 19 detected, on average, 92.5% of
the high confidence deletions. The false positive rate, again for high
confidence deletions ranges from 2.5% to 3.7%. Although the false negative
rate per strain ranges from 5.6% (AKR/J) to 8.5% (LP/J), it should be noted
that the automated analysis accurately identified 98.2% of sites containing a
high confidence deletion (with a concomitant false positive rate of < 1%).
Automated analysis was less successful in detecting the more complex
rearrangements – of the 50 PCR validated complex deletions XX were found.
[PLEASE CHECK WITH KIM]. [FIgure showing SVs on chr 19? Is this useful?]
We investigated the false positive rate of the automated analysis of
deletions across the genome by PCR validation of 700 deletions. Consistent
with the chromosome 19 analysis we found that the false positive rate for
4
simple deletions was very low (less than X%). For other types of deletion the
rate was higher – YY%. [ MORE HERE – BINNAZ/KIM]
We could not assess the performance of automated analysis to detect
SV types other than deletions because so few were found by manual
inspection of chromosome 19. However we could estimate the false positive
rates by randomly choosing 24 of each type for PCR validation. We found
XX% inversions were false positives, YY% of gains, and ZZ% of insertions.
Genome-wide SVs
[Kim]
We searched the entire genome of all strains computationally and
identified a total of 1.4M SVs in the 17 strains (Table 2). There are on average
XX SVs in classical inbred strains, affecting YY% of the genome. We find four
times as many deletions in DBA/2J [What about the comparison to CAST?]
than prior studies (Figure X) due to higher sequence coverage, SV calls from
multiple algorithms, and the ability to find shorter variants (Figure Y).
Relative to the reference we found two to three times as many insertions as
deletions; inversions and copy number gains are very rare (100 per genome
of each type). [We need to connect this to the chr 19 work. Chr 19 data was
used to develop/ fine-tune methods]
Sequence content of SVs
[Avi]
5
Based on read architecture and sequence composition we classified
SVs and observed that, consistent with previous analyses, the majority
(~60%) coincide with retrotransposons (how was this determined? Is overlap
necessary at a certain length? Results needed from Avi) (LINES X% LTRs
X%, SINES Y%) suggesting transposable elements are the main cause of SV
in the mouse genome. Analysis of the size distribution of SVs reflects this
pattern as we see peaks at 6 Kb, 250bp for SINEs, and peaks for LTR (5 and
9 Kb) (Figure XX). VNTRs constitute XX %.
X% of LINEs are polymorphic – if a high proportion could be important
Distribution of SVs across the genome
Despite a number of aCGH studies(CAHAN, AGAM etc), the landscape and
distribution of SV in the mouse genome remains poorly characterized. We
found XX% of the genome is constituted of SV.
Is distribution of SV enriched at telomeres?
Are there differences
between strains?
Relationship between segmental duplications and SVs – should find a
2 – 4 fold enrichment of SVs in regions of segmental duplication
Complex SVs
Given our observation of clustered SVs (at very close proximity), either from
the same SV type (for example two deletions very close to each other) or from
different SV types (for example an inversion followed directly by a deletion),
we attempted to find them automatically across the genome.
6
SV mechanism
Our next aim was to infer the mechanism by which SV formed. Inferring SV
mechanism in an accurate way requires two criteria: 1) single-nucleotide level
resolution of breakpoint delineation and 2) ancestral state of the SV. Our
analysis of SV mechanism of formation fulfilled both criteria. We began by
randomly selected 249 SV regions, polymorphic across eight classical inbred
strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and
LP/J). Each selected SV type had a size distribution representative of the
genome (Supplementary Figure 1 shows the size distribution of deletions,
copy number gains and inversions amongst the 249 SV regions selected).
We determined the sequence at the breakpoints of 249 polymorphic
SV regions across eight classical inbred strains (see Methods). We
assembled and manually examined contigs encompassing sequence across
1926 (EXPLAIN BINNAZ) SV breakpoints for each test strain. We
successfully identified the exact start and end position of the SV, and
characterized the sequence features at the break points by ascertaining the
extent and content of the homology and any additional or missing sequence
(microinsertion and microdeletion) at the breakpoint.
We predicted the ancestral state of each SV using SPRET/EiJ as the
outgroup strain: we assume a deletion is ancestral if it is not found in
SPRET/EiJ but is present in one or more of the classical strains (Example A in
Table 2). Conversely if there is a deletion in SPRET/EiJ, but not in the
reference genome, then we assume the SV is an insertion (Example B in
Table 2). However we found that in 26 cases out of 249 (~10% of the total),
7
the inferred ancestral state is inconsistent with breakpoint features: example
C (Table 2) using SPRET/EiJ as an outgroup suggested the presence of an
ancestral deletion, Target Site Duplication (TSD) at the breakpoint suggested
an ancestral insertion. Similarly, example D (Table 2) using SPRET/EiJ
suggested the presence of an ancestral insertion, 3 bp microhomology at the
breakpoint (TTA) suggested an ancestral deletion. Using rat sequence to
determine the ancestral state validated both the ancestral insertion on
chromosome 14 and the ancestral deletion on chromosome 15. With the
exception of 2 cases (<1%), all of the other 24 inconsistencies using
SPRET/EiJ as an outgroup could be reconciled by combining breakpoint
sequence and rat sequence. Inferred ancestral state of all 249 SV regions is
shown in Supplementary Table 1.
Inferring SV mechanism of formation based on breakpoint features
SV mechanism of formation is typically inferred by examining the sequence
features of its breakpoints. For example, 200 bp of sequence identity is
thought to be required for NAHR(INOUE and LUPSKI 2002), whereas much
smaller homology (microhomology) has been often associated with endjoining processes, such as MMEJ. Delineation of retroelements is also
facilitated by the presence of flanking target site duplication (TSD), with
poly(A) tail or poly(T) head for LINE and SINE elements, and with dual or
mono long terminal repeat (LCR) for ERV elements. Variable number of
tandem repeat (VNTR) polymorphism is also easily identifiable from its
repetitive structure.
8
We classified the 249 SV by inferred mechanism of formation (a
flowchart for our method is presented in Supplementary Figure 2). Table 3b
describes the inferred mechanism for the 249 SV regions. Retrotransposition
is commonest mechanism (with 41.7% including 24.5% LINE, 12% ERV and
5.2% SINE retrotranspositions), followed by microhomolgy-mediated end
joining processes (31.3%), non-microhomolgy-mediated end joining (13.3%),
replication-based mechanisms such as FoSTeS (<10%), VNTR expansion
(5.2%), SSA (0.4%) and NAHR (0.4%).
A substantial proportion of SVs caused by LINE, ERV, SINE and VNTR
insertions do not show any missing nucleotides at their breakpoints (95%,
93.3%, 92.3% and 92.3% respectively). However, we found rare cases (4
LINEs, 2 ERVs, 1 SINE and 1 VNTR) during which the insertion machinery
also deletes nucleotides. Missing sequence ranged from 1 bp to 289 bp. We
found that the presence of an ancestral microdeletion is directly linked to the
absence of the TSD for three LINEs. This would suggest a dual mechanism of
SV
formation,
union
between
DSB
repair
processes
and
LINE
retrotransposition.
Half (69% for ancestral deletions, 37.5% for inversions and 50% for
both CNG and multiple events) of the SVs without LINE, ERV or SINE
elements have a microhomology ranging between 3 bp to 25 bp, suggesting a
microhomology-mediated mutational process. We found several patterns of
microhomology: direct, palindromic, inverted and a complex combination of
these. 70% of 3-25 bp microhomology are direct, 13.3% inverted, 10%
complex and 6.6% palindromic. Longer sequence identity (>26 bp) is rarer
than smaller sequence identity (<3bp). Breakpoints at inversion are half blunt
9
ended, followed by ancestral deletions (15.9%) and CNGs (12.5%). Of the
113 ancestral deletions, 36 (32%) had from 1 bp to 107 bp of inserted
sequence at the breakpoint, in addition to the deletion.
Common breakpoint features
We assessed all 1,926 breakpoints for common sequence features. LINE
retrotransposition is associated with TTTCT motif in the TSD (P value < 1.3E8).
Genome-wide breakpoint identification
We next estimated the accuracy of reconstructing breakpoint sequence for all
SVs directly from NG sequencing reads without PCR-based Sanger
sequencing data. For each deletion and insertion breakpoint we implemented
local assembly [PLEASE MORE FROM KIM]. Comparison with 1,926
breakpoints delineated using PCR-based Sanger sequencing, revealed that
breakpoint accuracy was within on average +/- 50 bp (?) of the actual
breakpoint [CHECK WITH KIM]. This is insufficient to robustly identify
microhomology-mediated processes such as MMEJ or SSA for which precise
sequence is required but is sufficient to estimate NAHR (for which we search
for 200 bp sequence identity) and VNTR expansion. Genome-wide, 0.5% of
SVs have sequence features consistent with NAHR and 6% with VNTR.
Relationship between SNP and SV formation
Our analysis of breakpoint sequence features in multiple strains allowed us to
look for a relationship between sequence variants (SNP or shortindels) and
10
SV formation. In particular, we addressed the question as to whether
sequence variants at breakpoints were associated with SV formation. In our
set of ancestral deletions for which we have base pair resolution data, we
observed in all cases that the presence of SNPs in the microhomology region
was correlated with the presence of the SV (Figure 3a).
In all cases, presence of the SNP elongates the microhomology. Since
we do not find instances where a SNP in the microhomology region occurs
without the deletion, we assume that the formation of deletion and SNP are
related. This phenomenon is rare: we only saw five (4.5%) cases amongst our
113 ancestral deletions where SNP and SV formation co-segregate. We found
a similar relationship between a SNP formed in the TSD and the presence of
an ancestral insertion (Figure 3b). 15 ancestral insertions (16%) had SNPs or
shortindels within their TSD, coincident with an insertion. Details are given in
Supplementary Table 1. These SNPs are ideal candidates to tag SVs for
genotyping purposes; but their close proximity to SV breakpoints may make
genotyping difficult (it should be noted that none of these SNPs were
identified by short-read sequence).
Origin of SV breakpoints
We asked whether SVs that overlap in different strains could have arisen
more than once. We inferred independent origins when the position of the
breakpoint is different, so that for example one strain may have a 3 kb
deletion, while in another only 1 kb is missing. Within the eight classical
strains, size differences between SVs at the same locus were found at six SV
regions out of 241 (2.5%). We found no case with more than three alleles at
11
one SV locus. However when expanded our analysis to look at all 17 strains,
we found multiple alleles at 12% of SVs, due almost entirely to the presence
of different alleles in the wild-derived inbred strains. In two cases, we
observed four alleles at an SV locus with and in one case five alleles: on
chromosome 10 AKR/J, CAST/Ei, PWK/Ph and Spretus/Ei all have SVs with
different breakpoints (Supplementary Table 1).
Inversions
Inversions are more complicated than deletions, insertions and CNG, with
little known about their mechanism of formation. They require at least two
double-strand chromosomal breakages, as opposed for example to deletions
that only require one DSB. Here we characterized at nucleotide level
resolution breakpoints of 8 inversions. 62.5% (five cases) have deletions right
next to the inversion. An example is provided in Figure x.
Impact of SVs on gene function
We addressed the question of the impact of SVs on gene function in two
ways. First, we examined the impact on gene expression, and second we
looked at the association between SVs and phenotypic variation.
GENE EXPRESSION
Genes affected by deletions -
how many genes are affected?
Evidence from mRNA – deficit of exons involved. – one fusion gene.
Get list of genes and confirm by qPCR.
Impact of inversions – no evidence of fusion genes?
12
Do SVs affect gene expression outside the SV regions (Reymond
experiment)?
PHENOTYPES
Second, we asked whether SVs are associated with any of the 98 [I
used 122??] phenotypes assessed in genetically heterogeneous stock (HS)
mice descended from eight inbred strains (A/J, AKR/J, BALB/cJ, C3H/HeJ,
C57BL/6N, CBA/J, DBA/2J and LP/J). Genetic mapping has identified 843
quantitative trait loci (QTL) (VALDAR et al. 2006a; VALDAR et al. 2006b). The
phenotypes were chosen to target three diseases: anxiety; type II diabetes
and asthma.
We cannot directly genotype all of the SVs in the HS mice. Instead, for
an SV genotyped in the progenitor strains, we infer the allele probabilities in
each HS animal by estimating the haplotype probabilities at the SV locus
(VALDAR et al. 2006a), and then merging these probabilities within each of the
possible genotypes. We then use the allele probabilities to conduct a single
marker analysis between each variant and each phenotype [REF]; this gives
us a logP value for each SV/phenotype test, indicating how consistent the
variant is with the phenotype.
In this manner, we tested ~166,000 SVs where we were certain that
the SDP was correct (including deletions, copy number gains, insertions and
inversions) (refer to Kim’s method for producing SDPs; in addition we only
included SVs where there was no missing data in any of the HS founder strain
columns), XXX SNPs and YYY InDels (Jonathan?), for association with each
phenotype. We extracted all SVs that lie under a QTL peak and have a logP
that exceeds that of the QTL peak; we reasoned that these are the SVs that
13
are most likely to either be, or tag, the causal variant under the QTL. In total,
there are 737 potentially causal SVs, lying under 395 QTL, for 80 of the
phenotypes (Figure 1).
We overlaid these causal SVs with protein coding genes. 291 causal
SVs
overlap
(by
at
least
1bp)
180
genes
(Supplemental
Table
causal_SV_overlap_genes.xlsx). However, there was no evidence to suggest
that causal SVs are enriched for genes compared to non-causal SVs (Figure
2A). Similarly, we overlaid the SVs with the exons and promoter regions of
protein coding genes (promoter regions were defined as 2Kb up and down
stream of the transcript start site). 11 causal SVs overlap 12 exons
(Supplemental Table causal_SV_overlap_exon.xlsx), and 47 causal SVs
overlap
48
promoter
regions
(Supplemental
Table
causal_SV_overlap_gene_promoters.xlsx). There was no evidence to suggest
that causal SVs are enriched for either exons or promoter regions, compared
to non-causal SVs (Figures 2B and C, respectively).
Discussion
We find XX more than anyone else
Did we get this right?
Inferred mechanisms are consistent with other papers (Quinlan) and different
from human
We are the first to relate sequence variants to SVs
We find little effect on gene function – due to homozygosity and selection for
inbreeding?
14
Recent studies of mutation spectrum in human and mouse SV found a similar
figure (33%)(CONRAD et al. ; QUINLAN et al.), suggesting similar mutational
processes occur in both human and mouse SV formation. 12.5% of CNG
have additional sequences (1-10 bp) at the breakpoint, followed by 37.5% for
inversions and 50% for multiple events. We next looked at the correlation
between Class 3 and Class 4 SVs and found that microinsertion at the
breakpoint is enriched in blunt ended SVs (ratio 2.5).
15
References
AGAM, A., B. YALCIN, A. BHOMRA, M. CUBIN, C. WEBBER et al., Elusive copy number
variation in the mouse genome. PLoS One 5: e12839.
CAHAN, P., L. E. GODFREY, P. S. EIS, T. A. RICHMOND, R. R. SELZER et al., 2008 wuHMM: a
robust algorithm to detect DNA copy number variation using long
oligonucleotide microarray data. Nucleic Acids Res 36: e41.
CAHAN, P., Y. LI, M. IZUMI and T. A. GRAUBERT, 2009 The impact of copy number
variation on local gene expression in mouse hematopoietic stem and
progenitor cells. Nat Genet 41: 430-437.
CONRAD, D. F., C. BIRD, B. BLACKBURNE, S. LINDSAY, L. MAMANOVA et al., Mutation
spectrum revealed by breakpoint sequencing of human germline CNVs.
Nat Genet 42: 385-391.
HENRICHSEN, C. N., N. VINCKENBOSCH, S. ZOLLNER, E. CHAIGNAT, S. PRADERVAND et al.,
2009 Segmental copy number variation shapes tissue transcriptomes. Nat
Genet 41: 424-429.
INOUE, K., and J. R. LUPSKI, 2002 Molecular mechanisms for genomic disorders.
Annu Rev Genomics Hum Genet 3: 199-242.
MEDVEDEV, P., M. STANCIU and M. BRUDNO, 2009 Computational methods for
discovering structural variation with next-generation sequencing. Nat
Methods 6: S13-20.
QUINLAN, A. R., R. A. CLARK, S. SOKOLOVA, M. L. LEIBOWITZ, Y. ZHANG et al., Genomewide mapping and assembly of structural variant breakpoints in the
mouse genome. Genome Res 20: 623-635.
VALDAR, W., L. C. SOLBERG, D. GAUGUIER, S. BURNETT, P. KLENERMAN et al., 2006a
Genome-wide genetic association of complex traits in heterogeneous
stock mice. Nature Genetics 38: 879-887.
VALDAR, W., L. C. SOLBERG, D. GAUGUIER, W. O. COOKSON, J. N. RAWLINS et al., 2006b
Genetic and environmental effects on complex traits in mice. Genetics
174: 959-984.
16
Tables
Table 1. Structural variants in 17 inbred strains
SV Type
129P2/
OlaHsd
129S1/
SvImJ
129S5/
SvEvBrd
A/J
AKR/J
BALBc/J
C3H/
HeJ
C57BL/
6N
CAST/
EiJ
CBA/J
DBA/2J
LP/J
NOD/
ShiLtJ
NZO
PWK/
PhJ
Spretus/
EiJ
WSB/
EiJ
Deletion
16,402
17,385
16,154
15,885
16,258
14,898
16,148
167
51,304
17,066
17,531
17,030
17,078
15,479
54,312
91,729
22,231
Insertion
86,805
42,156
39,240
73,909
42,327
45,038
68,161
2,697
107,912
54,044
36,753
47,770
22,651
30,535
103,968 172,997
57,042
Inversion
46
46
53
46
49
45
52
3
128
46
54
47
55
47
158
282
53
Gain
57
70
72
88
69
82
94
44
361
79
67
64
51
62
96
112
88
Other
29
30
26
27
21
21
33
0
108
33
31
30
27
29
108
230
51
Table 1. Structural variants in 17 inbred strains. Listed are the total numbers of structural variants with a minimum size of 100 bp in
the 17 inbred strains. Here we differentiate between insertions, where we can determine the insertion points from read pair patterns
and local assembly, and copy number gains, where a duplication is inferred from an increase in read depth. Copy number gains
include tandem duplications, which are inferred from both read depth and read pair evidence. There is minimal overlap between the
insertions and the copy number gains, since the insertion discovery algorithm considers only read pairs in which one mate is
unmapped (ie: de novo insertions). Included in 'Other' are those SVs which appear to be comprised of more than one SV. These
include: deletions with insertions, and inversions with deletions.
17
LP/J
1
1
0
1
0
0
0
0
1
1
0
1
0
0
0
1
0
1
1
1
Table 2. Inferring ancestral state using sequences flanking SV breakpoints.
Examples A, B, C and D are taken from our list of 249 SVs resolved to base
pair resolution (Supplementary Table 1). The first three columns give
chromosome, start position and end position of the SV in bp. Columns 4,5 and
6 gives a small stretch of sequences flanking SV breakpoints, as well as the
first 10 bp of the SV. Note that full sequence of each SV is given in
Supplementary Table 1. Columns entitled A/J, AKR/J, BALB/cJ, C3H/HeJ,
C57BL/6N, CBA/J, DBA/2J and LP/J gives the strain distribution pattern
(SDP) of the SV, with “0” indicating the absence and “1” the presence of the
SV. Column before last is the PEM (Paired-End Mapping) signature relative to
the reference genome. The last column gives the inferred ancestral state
relative to either SPRET/EiJ or Rattus norvegicus indicated by an asterisk.
Microhomology at breakpoints is highlighted in red and target side duplication
(TSD) in green.
18
Ancestral state
DBA/1J
1
1
1
0
PEM signature
CBA/J
0
1
0
0
C57BL/6J
..CATTG CTCTCTGCTTCTTT..
..TTTGG TTAGTGTTTTGTCA..
1
1
TTCTCTGCTTCTTG.. 1
TTAATGAATTATTA.. 1
C3H/HeJ
..AAGGA GGATGTATGTATGT.. GGAAACGTAACTC..
..ACCCC CTCCCCTGTTGCGG.. CTCCCCTACTGTA..
BALB/cJ
46114057
7901149
70138776
91974384
A/J
AKR/J
Bp end position
46111101
7900914
70138175
91970970
B2 flanking sequence
Bp start position
2
1
14
15
SV sequence
Chromosome
A
B
C
D
B1 flanking sequence
Example
Table 2. Inferring ancestral state using sequences flanking SV breakpoints.
DEL DEL
DEL INS
DEL INS*
DEL DEL*
Table 3. Sequence features at SV breakpoints and inferred mechanism
LINE
ERV
6.7% 6.7% 0.0%
13.3% 93.3% 15.4%
78.3% 0.0% 84.6%
1.7% 0.0% 0.0%
Class 2. Microdeletion
none
1-34 bp
>200 bp
95.0% 93.3% 92.3% 92.3%
5.0% 6.7% 7.7% 7.7%
1.7% 0.0% 0.0% 0.0%
0.0% 15.9% 50.0% 12.5% 0.0%
0.0% 15.0% 12.5% 37.5% 25.0%
23.1% 69.0% 37.5% 50.0% 50.0%
76.9% 0.9% 0.0% 0.0% 0.0%
0.0% 0.9% 0.0% 0.0% 0.0%
Class 4. Microinsertion
none
1-10 bp
11-50 bp
>51 bp
Total (249 SV regions)
b Inferred mechanisms
Total Retrotransposition (41.7%)
LINE Retrotransposition
ERV Retrotransposition
SINE Retrotransposition
SRS
MMEJ
NMMEJ
SSA
NAHR
FoSTeS/others
multiple
VNTR
Class 1. Target site duplication
none
4-10 bp
11-20 bp
>20 bp
Class 3. Microhomology
none
1-2 bp
3-25 bp
26-200 bp
>200 bp
CNG
Deletion
SINE
Inversion
Ancestral Events
Insertion
a Sequence features at breakpoints
69.9% 62.5% 87.5% 25.0%
23.0% 37.5% 12.5% 50.0%
8.0% 0.0% 0.0% 0.0%
0.9% 0.0% 0.0% 0.0%
60
30
13
13
113
8
8
4
31.3%
13.3%
0.4%
0.4%
0.8%
3.2%
3.2%
1.2%
24.5%
12.0%
5.2%
5.2%
Table 3. Sequence features at SV breakpoints and inferred mechanism. In a,
the percentage of each sequence feature at precise breakpoint is given per
category of ancestral SV (insertion, deletion, inversion, CNG and multiple
events). In b, the percentage of each inferred mechanisms is given relative to
all SV regions presented in a. Empty cases are due to no applicability and all
abbreviations are listed in the Supplementary Glossary.
19
Figure Legends
Figure 1. Venn diagrams showing the overlap between SVs detected in our
study (in DBA/2J and CAST/Ei) and those published elsewhere. A: Venn
diagram showing the overlap between DBA/2J deletions (relative to
C57BL/6J) found in our study (blue circle) and those found in another
sequencing based analysis (Quinlan et al., green circle) and a high density
aCGH experiment (2.1 million probes, Agam et al., red circle). B: Similarly for
copy number gains.
[Need to add sentence explaining that the figures show overlap for merged
SV-regions rather than pure SV calls.]
Figure 2. Breakpoint analysis of a complex SV. a) Complex SV, involving
several genomic rearrangements including an inversion, deletion, short
insertion and copy number gain (CNG), is displayed relative to its genic
location along Zbtb10, a Zinc finger and BTB domain containing 10 gene.
PCR amplification using forward (F) and reverse (R) primers revealed an AT
insertion at the first breakpoint J1, followed by an inversion of 125 bp which
encompasses an inverted copy number gain of the 22 bp proceeding J1, as
seen in J2. Finally breakpoint 3 (J3) revealed a deletion of 813 bp. Using
repeatmasker, a SINE element was found to be part of the deletion. b) PCR
picture of the amplification using F and R primers (primer sequences available
in Supplementary Table xxx). Hyperladder II was used as the size marker.
C57BL/6N and LP/J show a normal size of 1604 bp, whereas A/J, AKR/J,
BALB/cJ, C3H/HeJ, CBA/J and DBA/2J show a smaller band at 793bp. c)
20
Sequencing data across J1, J2 and J3 breakpoints. A colour code is used to
indicate each type of SV: blue is used for the 22 bp inverted copy number
gain, green for the inversion and red for the deletion. When the test strain
matches the reference strain, both are in the same color.
Figure 3. Relationship between SNP and SV formation. a) Relationship
between SNP and ancestral deletion formation. Two SNPs lying on the 6 bp
microhomolgy of an ancestral deletion of 64 bp (chr12:27,040,45927,040,522) correlated with the presence of the SV. On the left, PCR
amplification of the SV is shown across the eight classical strains (A/J, AKR/J,
BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J). HyperladderII was
used as DNA molecular weight marker. Some strains show a smaller
amplicon compared to other strains. On the right, sequencing traces are
shown for a test strain (A/J) and the reference strain (C57BL/6N). Note that all
other test strains traces are identical to the one shown here. Asterisk is used
to emphasize the microhomology of 6 bp (GAACTA). The presence of two
SNPs (C->G and T->A) in all test strains (here only shown in A/J) is
associated with the presence of the ancestral deletion. b) Relationship
between SNP and ancestral insertion formation. PCR data is shown on the
right with amplification in A/J, AKR/J and BALB/cJ. Strains with the the
ancestral insertion (C57BL/6N, CBA/J, DBA/2J and LP/J) have failed to
amplify due to size. The insertion is a LINE on chromosome 13 (119,134,049119,135,126). On the left, sequencing trace is shown over the TSD for a strain
that
doesn’t
have
the
ancestral
insertion.
The
TSD
is
17
bp
21
(AAGAATGTCAGCAAAGT) and at the 12th position, a SNP (G->C) is
observed in all the strains that have the insertion.
22
Figure 1. Venn diagrams showing the overlap between SVs detected in our
study (in DBA/2J) and those published elsewhere.
23
Figure 2. Breakpoint analysis of a complex SV.
Ladder
LP
DBA
CBA
C57
C3H
BALB
Ladder
34.73 Kb forward strand
Zbtb10 (Zinc finger and BTB domain containing 10 gene)
AKR
b
AJ
a
F R
F
22 bp AT
B1
Inversion (125 bp)
22 bp
CNG
B2
Deletion (813 bp)
R
B3
B2_Mm2 SINE repeat
c
B1
ProxRef
DistRef
Test
TATCAGCCTTTGTCTTCAGGCTCAGC - - TCTATCAGTTTATT
CTGTTCATATCCCAGGTGTTGGGATTACAGGCATGTGTCAC
TATCAGCCTTTGTCTTCAGGCTCAGCATGTGACACATGCCT
B2
ProxRef
DistRef
Test
TCCCAGGTGTTGGGATTACAGGCATGTGTCACTAGGA
TCTATCAGTTTATTGTGGTCCTTCTGTATATAGCTCAGAATG
AATAAACTGATAGAGCTGAGCCTGAAGACAAAGGCT - - - - - -
B3
ProxRef
DistRef
Test
CATAGTGAGACTTCCCATCCAGAAAGGGAAGGTAAAACCCA
TAGGAAATGGAAGTGCTGCTTGTTTATAAATCTGATGGACG
- - - - - - - - - - - - - - - - - - - - - - - -GAAAGGGAAGGTAAAACCCA
24
Figure 3. Relationship between SNP and SV formation
B1-2
Ladder
* * * * * * *
B1
…
C57
LP
DBA
CBA
C57
C3H
BALB
AKR
AJ
Ladder
AJ
a
C57
* * * * * * *B2
….
* * * * * *
Ladder
B1-2
AJ
LP
DBA
CBA
C57
C3H
BALB
AKR
AJ
Ladder
b
* * * * ** * ** * * ** * ** *
25