Download SVPaper050711

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics of neurodegenerative diseases wikipedia , lookup

X-inactivation wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genetic engineering wikipedia , lookup

Heritability of IQ wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Gene therapy wikipedia , lookup

Gene nomenclature wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Non-coding DNA wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

NUMT wikipedia , lookup

Oncogenomics wikipedia , lookup

Transposable element wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene desert wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene wikipedia , lookup

Quantitative trait locus wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

History of genetic engineering wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Metagenomics wikipedia , lookup

Minimal genome wikipedia , lookup

Public health genomics wikipedia , lookup

Human genetic variation wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Microevolution wikipedia , lookup

Human genome wikipedia , lookup

Gene expression programming wikipedia , lookup

Genomic library wikipedia , lookup

Designer baby wikipedia , lookup

Copy-number variation wikipedia , lookup

Human Genome Project wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genomics wikipedia , lookup

Genome editing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Pathogenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Sequence based characterization of structural variation in the mouse
genome
Binnaz Yalcin1†, Kim Wong2†, Avigail Agam1†, Martin Goodson1†, Thomas M. Keane2,
Xiangchao Gan1, Christoffer Nellåker3, Leo Goodstadt1, Jérôme Nicod1, Amarjit
Bhomra1, Polinka Hernandez-Pliego1, Helen Whitley1, James Cleak1, Rebekah
Dutton1, Deborah Janowitz1, Richard Mott1, David J. Adams2,*, Jonathan Flint2,*
1The
Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK
2The
Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
3MRC
Functional Genomics Unit, Department of Physiology, Anatomy and Genetics,
University of Oxford, South Parks Road, Oxford OX1 3QX, UK
†Co-first
authors
*Correspondence to:
Dr. David Adams
Dr. Jonathan Flint
Wellcome Trust Sanger Institute
Wellcome Trust Centre for Human Genetics
Hinxton, Cambs, CB10 1SA, UK
Oxford, OX3 7BN, UK
Ph: +44 (0) 1223 86862
Ph: +44 (0) 1865 287512
Fax: +44 (0) 1223 494919
Fax: +44 (0) 1865 287501
Email: [email protected]
Email: [email protected]
Abstract
The extent to which structural variants (SVs) cause quantitative variation and
contribute to disease is unknown. Without knowing how many SVs there are,
and how they arise, it is difficult to discover what they do. Combining
experimental with automated analyses of the mouse genome sequence, we
identified 0.71M SVs at 0.28M sites in the genomes of thirteen classical and
four wild-derived inbred mouse strains. The majority of SVs are less than 1
kilobase in size and 98% are deletions or insertions. The breakpoints of
0.16M SVs were mapped to base pair resolution allowing us to infer that
insertion of retrotransposons causes more than half of SVs. Yet, despite their
prevalence, SVs are less likely than other sequence variants to cause geneexpression or quantitative phenotypic variation. We identified 24 SVs that
disrupt coding exons, acting as rare variants of large effect on gene function.
One third of the genes so affected have immunological functions. Our
catalogue provides a starting point for the analysis of the most dynamic and
complex regions of genomes from a genetically tractable model organism.
2
Introduction
Structural variation is believed to be widespread in mammalian genomes1-5
and is an important cause of disease6-8, but just how abundant and important
structural variants (SVs) are in shaping phenotypic variation remains unclear.
Understanding what SVs do depends on understanding what they are, where
they occur and how they arise: large SVs that keep recurring and coincide
with genes are far more likely to contribute to phenotypic variation than small
non-recurrent SVs within intergenic regions.
The preeminent organism for modeling the relationship between
phenotype and genotype, including SVs, is the mouse, but our catalogue of
SVs in this animal is incomplete. Estimates of SV numbers and the proportion
of the mouse genome they occupy, vary considerably, from figures of a few
hundred to over 7,0009-13, affecting from 3.2% to more than 10% of the
genome14-16. Incompleteness and inconsistencies are largely due to reliance
on differential hybridization of genomic DNA to oligonucleotide arrays 14, a
technology blind to some SV categories (such as inversions and insertions)
and with only limited ability to detect others (segmental duplications and
transposable elements). Sequence based methods of SV detection, with
higher
resolution
and
greater
sensitivity,
have
so
far
had
limited
application12,17.
Along with SV catalogues, we need to know how SVs arise, as this will
tell us what SVs may or may not do. The major molecular mechanism
producing SVs in the mouse genome is believed to be retrotransposition 12,17,
which, may account for more than 80% of SVs between 100 nucleotides to 10
kilobases in length17. In cell culture, about 10% of LINE-1 insertions delete
3
DNA18,19, a process that also occurs in mouse genomic DNA20. It is not known
to what extent retrotransposons, or other mechanisms of SV formation,
contribute to mouse phenotypic variation and disease.
What we know about the burden of SVs’ impact on phenotypes in the
mouse comes primarily from analyses of gene expression 15,16,21. Up to 28% of
the between-strain variation in gene expression in hematopoietic stem and
progenitor cells has been attributed to SVs15; for genes lying within SVs, the
latter account for between 66% to 74% of between-strain expression variation
in kidney, liver, lung and testis16. If the genome is replete with SVs, and given
that their influence on gene expression could extend up to 500 Kb from their
margins16, then SVs might be responsible for a considerable fraction of
heritable gene expression variance. Since gene expression variation is
believed to contribute to variation in phenotypes in the whole organism 21, SVs
may turn out to have a major role in the genetic determination of many
aspects of mouse biology.
We used next generation sequencing to address three critical
questions: what are the extent and complexity of SVs in the mouse genome,
what are the likely mechanisms of SV formation, and to what extent do SVs
contribute to phenotypic variation? Our molecular characterization of SVs in
the mouse genome allows us to determine the extent to which SVs contribute
to genetic and phenotypic diversity.
4
Results
SV identification
Using short-read paired-end mapping, we found SVs at 0.28M sites in the
mouse genome, amounting to 0.71M SVs in 17 inbred strains of mice: A/J,
AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J, LP/J, NOD/ShiLtJ,
NZO/HlLtJ,
129S5SvEvBrd,
129P2/OlaHsd,
129S1/SvImJ,
WSB/EiJ,
PWK/PhJ, CAST/EiJ and SPRET/EiJ. Our catalogue contains far more SVs
than previously recognized (Fig. 1a) and consists of a greater variety of
molecular structures (Fig. 1b&1c). To explain why we found more, we start by
describing how we went about finding SVs.
We combined visual inspection of short-read sequencing data with
molecular validation to improve automated SV detection across the genome.
We used two criteria to identify SVs manually: read depth and anomalous
paired-end mapping (PEM). We did this using data from the mouse’s smallest
chromosome (19) in its entirety, and a random set of other chromosomal
regions, for eight classical strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J,
CBA/J, DBA/2J and LP/J), founder strains of heterogeneous stock (HS)
population22.
Based on read depth and PEM we expected to find eleven patterns
that classify SVs. We refer to these as type H (“High-confidence”) patterns
(H1-H11: Supplementary Fig. 1). For example, some deletions and
inversions leave precise, easily identifiable signatures (Fig. 1d). In addition,
we found ten patterns whose interpretation was ambiguous. We refer to these
as type Q (“Questionable”) patterns (Q1-Q10: Supplementary Fig. 1, Fig.
1e). We investigated the molecular structure of all 21 patterns using a PCR
5
strategy (Supplementary Fig. 2, Supplementary Methods). We designed
742 pairs of primers and successfully amplified 662 SV sites across the eight
strains (Supplementary Table 1).
Our categorization of predicted SV structures, based on manual
inspection of PEM patterns, resulted in the confident identification of an SV for
nineteen of the 21 patterns in all instances that we examined by PCR
(Supplementary Table 2). Two patterns were always false (Q6 and Q10),
and arose because of the presence of a retrotransposed pseudogene giving
mapping errors.
Recognizing these patterns, we were able to predict underlying SV
structure with high confidence. PCR confirmed that 12 patterns were
indicative of a single SV and six patterns indicative of multiple adjacent SVs
(Supplementary Table 2). However, SVs of type Q7 (55 cases) were due to
a variable number tandem repeat, for which we could not predict the number
of repeats or molecular structure.
Available automated methods to identify SVs are unable to differentiate
all 19 PEM patterns, and may also classify some SVs incorrectly; for example,
the PEM patterns of linked insertions (Q5 and Q9: Supplementary Fig. 1) are
similar to those for inversions or deletions. Therefore we adapted automated
methods to recognize 15 types (Q1, Q2, Q3 and Q7 could not be
unambiguously identified) identified by manual inspection and PCR validation
(Supplementary Methods and 23).
6
Sensitivity and specificity analyses
We established false positive and false negative rates for the automated
analysis in three ways. First, we used our manually identified set of SVs on
chromosome 19 (Supplementary Table 3) where we found 1,017 deletions
(756 type H and 261 type Q), 15 inversions (2 type H and 13 type Q) and
three copy number gains (all type H). False negative rates per strain range
from 15% to 20% (Supplementary Table 4a); false positive rates range from
3.7% to 4.8% (Supplementary Table 4b). Second, to ensure that our
sensitivity and specificity analyses were not vitiated because we used
chromosome 19 as a training set for the automated analysis, we derived a
second, smaller, set of manually curated deletions from a randomly chosen 10
Mb region (101 Mb to 111 Mb) from chromosome 3 in strain C3H/HeJ.
Automated analysis of this region correctly identified 43/49 (87.7%) and called
2 false positive deletions (4.1%). Third, we investigated the false negative rate
for the automated detection of deletions across the genome using a PCR
validation dataset of 267 simple deletions (Supplementary Table 1).
Consistent with the chromosome 19 and chromosome 3 analyses we found
that the false negative rate for deletions was between 6% and 11%,
respectively (Supplementary Table 5a).
Few non-deletion SVs were manually detected on chromosome 19, so
we turned to PCR-based validation of insertions, inversions and tandem
duplications (n = 106 to n = 136 SVs per strain). We found that the average
false negative rate was higher than for deletions, ranging from 24% to 31%
per strain (Supplementary Table 5b). Automated analysis was less
successful in detecting the more complex rearrangements, with 35% to 54%
7
false negative rates (n = 33 to n = 41 per strain), however, excluding SVs with
Q2 and Q3 PEM patterns, the rates range from 5% to 35% (n= 18 to n= 24
SVs per strain).
SV categories
The results of the detection and classification of 711,923 SVs across the
entire genome of 17 strains are shown in Table 1. There are, on average,
26,000 SVs in classical inbred strains, and 92,000 in wild derived inbred
strains, affecting 1.2% (33.0 Mb) and 3.7% (98.6 Mb) of the genome
respectively (Supplementary Table 6). Importantly, differences between
frequencies of SVs in wild-derived and classical strains are due to differences
in the amounts of accessible genome, and also to differences in our ability to
detect SVs in the wild-derived strains. Using our chromosome 19
experimental analysis of SVs, we have estimated false negative rate across
the four wild-derived strains to be 24% (28% in SPRET/EiJ) as opposed to
17% in the eight classical inbred strains (Supplementary Table 4c).
It proved difficult to obtain robust estimates of SVs smaller than 100 bp.
We generated SV calls using both Pindel24 and Dindel25. Across the whole
genome, we found 33,779 and 149,854 deletions using Pindel and Dindel,
respectively. To explain such large differences between the two-methods, we
visually inspected all calls from each method along the whole of chromosome
19. We found a false positive rate of about 1% for Dindel set and 14% for
Pindel calls. We then inspected a 7.2 Mb region and found that a large
number of manual calls were missed by both Pindel and Dindel (67/75) but
also found that 42 deletions found by Pindel or Dindel were missed by visual
8
inspection. We confirmed by PCR that the 75 deletions found by manual
inspection are in all cases real and smaller than 100 bp (Supplementary
Table 1). Size distribution of all deletions occurring in this 7.2 Mb region is
plotted in Supplementary Figure 3.
Our best estimate of the rate of SVs between 10 and 100 bp is based
on combining manual and automated methods, which, for the 7.2 Mb region,
yields 117 deletions. Assuming that this region is typical, the rest of the
genome (in classical laboratory strains) will contain approximately 49,000
deletions in this size range.
Table 1 classifies SVs greater than 100 bp into two groups: 99.4% are
simple and 0.6% are complex. Simple SVs include those whose structural
interpretation is straightforward: insertions, deletions and inversions. We
separately identify one type of insertion, a copy number gain, consisting of
non-repetitive DNA that is present in multiple copies in other strains. When
this sequence occurs immediately adjacent to its original, it is annotated as a
tandem duplication.
It is less clear to what extent the more complex categories we found
represent different structures. Complex SVs consist of a mixture of events that
abut each other. Sometimes the mixture arises because two or more simple
SVs occur next to each other: given the density of deletions in the genome,
the 2,021 deletions that we found separated by less than 250 bp could have
occurred by chance (H3: Supplementary Fig.1). However we recognize as a
separate structure SVs that are immediately adjacent to each other, with no
intervening DNA, since these might be the progeny of a single process
(marked as del+ins and del+ins/inv in Table 1). Intriguingly, half of the
9
inversions co-occur with an insertion or deletion, or in rare cases with both an
insertion and deletion (Fig. 2a). We also separately identify an SV within a
copy number gain (termed “nested” in Table 1) since the probability of
coincidence is less than one event per genome.
SV formation
Microhomology at SV breakpoints, as well as the content of sequence within
SVs and the SV’s ancestral state, was used to infer the likely mechanism of
formation for simple SVs. To obtain breakpoint sequence, we performed de
novo local assembly at 80.3% of deletions23. Comparison of 1,314 predicted
deletion breakpoints to the breakpoint delineated by PCR and sequencing
(Supplementary Table 1; Supplementary Methods), revealed that 57.7% of
breakpoint predictions are exact and 86.5% are within 20 bp (Supplementary
Table 7a). In cases where the local assembly strategy failed, we relied on the
original breakpoint estimates obtained from the mapping of reads to the
reference genome: 83.3% of these estimates are within 100 bp of the actual
breakpoint (Supplementary Table 7b). Breakpoint accuracy for insertions,
inversions and copy number gains is presented in Supplementary Tables 7c,
7d and 7e, respectively.
To obtain genome-wide estimates of the contribution of each
mechanism to SV formation, we used sequence data from relative deletions
(that is relative to C57BL/6J). We have highly accurate breakpoint sequence
for this sample, which should be unbiased with respect to ancestry. Using rat
as an outgroup, we classified 19% of relative deletion SVs as ancestral
10
deletions, 57% as ancestral insertions and the remainder (24%) were
indeterminate (Supplementary Methods).
Classification of SVs and their size characteristics are summarized in
Figure 3. SVs are most often due to retrotransposons (LINEs (25%), LTRs
(14%) and SINEs (15%)), followed by variable number tandem repeats
(VNTRs) (15%) and pseudogenes (2%). Other mechanisms, not involving
retrotransposons, account for 29% of SVs. The median length of all SVs is
349 bp, with modes at 100 bp and 6,400 bp, LINE insertions comprising the
majority of the larger insertions (Fig. 3a). Outgroup analysis showed that the
transposon-derived SVs arose almost exclusively from ancestral insertions
events (98.8%). Target site duplications (12-16 bp) surrounds the breakpoints
of LINE and SINE derived SVs (known as target site duplication) and shorter
(6-8 bp) duplicated sequences are associated with LTR SVs (Fig. 3b). Nonrepeat mediated SVs are mainly a result of ancestral deletion events (79%),
and are associated with short microhomologies, up to 7 bp in length,
consistent with either microhomology-mediated break-induced replication
(MMBIR)26,27 or microhomolgy-mediated end joining (MMEJ)28.
A substantial proportion of SVs caused by LINE, ERV and SINE
insertions do not show any missing nucleotides at their breakpoints (93.3%,
93.3% and 92.3% respectively; Table 2). However, we found rare cases (4
LINEs, 2 ERVs and 1 SINE) during which the insertion machinery also deletes
nucleotides. Missing sequence ranged from 1 bp to 289 bp. The presence of
an ancestral microdeletion is directly linked to the absence of the TSD for
three LINEs. Of the 113 ancestral deletions, 36 (32%) had from 1 bp to 107
bp of inserted sequence at the breakpoint, in addition to the deletion.
11
Unexpectedly, in all cases the presence of SNPs in the microhomology
region was correlated with the presence of the SV (Fig. 2b). The SNP
elongates the microhomology, or, alternatively, the microhomology reflects a
hypermutable state associated with break induced replication around the
SV29. However this phenomenon is rare: we only observed five (4.5%) cases
amongst our 113 manually-curated ancestral deletions (Supplementary
Table 8) where a SNP and SV formation co-segregate. We found a similar
relationship between a SNP formed at a target site duplication and the
presence of an ancestral insertion. Fifteen ancestral insertions (16%) had
SNPs or short indels within their target site duplication, coincident with the
insertion (Supplementary Table 8).
Given their potential role in human disease30, we were interested to
document the occurrence of recurrent SVs, those that arise at the same
genomic locus independently in unrelated individuals. Non-allelic homologous
recombination (NAHR) is the major mechanism for recurrent SVs31, while fork
stalling and template switching (FoSTeS) and/or microhomology-mediated
break-induced replication (MMBIR) mechanisms may be important for nonrecurrent SVs26,27.
We looked for SVs occurring at the same locus in different strains, but
with different breakpoints, indicating independent origins. Using the SV
breakpoints obtained from PCR sequencing (249 SV sites in eight strains that
account for over 4,000 breakpoints; Supplementary Table 8), we found that
in the classical strains, only 2.5% of deletions at the same locus had different
breakpoint sequences. However within all 17 strains, we found multiple alleles
at 12% of SVs, due almost entirely to the presence of different alleles
12
originating from the wild-derived inbred strains (Supplementary Table 8).
Consistent with the low frequency of recurrent SVs, breakpoint features
associated with NAHR are rare. We estimated that 0.13% of deletions are due
to NAHR, when we required a signature of >=200 bp of >=90% sequence
identity. Two analyses, therefore, indicate that recurrent SVs are rare.
Impact of SVs on gene function
We assessed the impact of SVs on phenotypes in three ways: i) the
relationship between the position of SVs and the position of genes; (ii)
changes in expression of genes overlapping, or nearby, an SV; (iii)
association between SVs and phenotypes in an outbred population of mice.
Across all strains, SVs overlap 10,291 genes. We found that SVs are,
in all strains except C57BL/6N, significantly depleted (P<0.01) in genes (fold
change 0.91). However, we found that SINE insertions are significantly
enriched in the introns of genes (P<0.01, fold change 1.34).
The relative depletion of SVs within genes implies a proportionate
deficit in their phenotypic consequences. We found this to be true for the
effect of SVs on gene expression, by estimating heritability attributable to
SVs. Variation within strains is due to environmental factors and variation
between strains is due to both environmental and genetic factors, so the
difference between the two variances is a measure of genetic effect
(heritability). Figure 4a shows scaled variances in gene expression from brain
RNA-Seq data measured between and within strains for five categories of
SVs. Variances for between strain variation are clearly larger than for within
13
strain variances, indicating that SVs do have an impact on expression, but
how big an impact?
We estimated the proportion of heritability attributable to SVs16 and
found that no category accounts for more than 10%. To determine if these
results were specific to brain tissue, we analysed gene expression data for the
eight founder strains of the HS population (n = 5 for each) from liver,
measured on Illumina arrays32. Mean heritability attributable to an SV, for
transcripts overlapping one or more SVs, was 9.5%. Since many transcripts
overlap multiple small SVs (median of 3, maximum of 216), we hypothesized
that SV heritability might be related to the amount of gene overlapped. For
each transcript we summed the amount of DNA overlapping a gene and
expressed this as a proportion of the total length of the gene. SVs that overlap
50% or more of a gene make a large contribution to heritability: in brain tissue,
such SVs contribute to 25% of the variance, compared to 7.8% for transcripts
where SVs overlap less than 50% of the gene. However, large overlaps (50%
or more) are rare, affecting less than 3% of transcripts. Thus while SVs make
a modest contribution to the overall heritability of expression variance, at
individual transcripts they may be the main cause of between-strain
differences in expression.
We also found that SVs outside a gene have only small effects on gene
expression. Figure 4b shows between and within strain variances for SVs
lying at distances from less than 2 Kb to more than 40 Kb from brain
transcripts with no SV overlap (the density of SVs meant that we found too
few transcripts further than 60 Kb from an SV to analyze). Heritability
14
attributable to SVs within 2 Kb of the gene is 2%, and falls as the distance
from the gene increases.
We considered whether the lower estimates we obtained for the effect
of SVs, compared to those obtained from array based assays, might be due to
the differences in the way SVs were assessed. Using SVs from a genomewide array based assessment of SVs in 12 classical strains, we calculated
within and between strain variances16. Results, shown in Figure 4a,
demonstrate a larger difference between within-strain and between-strain
variances than seen using SVs from our sequence analysis. SVs assessed by
arrays contribute to 25% of the variance in gene expression. Differences in
the heritability estimates are thus due in part to the differences in the way SVs
are called.
Our third observation of the phenotypic impact of SVs is that they are
unlikely to be the causative variant at QTLs, as we know from genetic
association with 100 phenotypes measured in over 2,000 heterogeneous
stock (HS) mice22. We applied a test of functionality33 to 281,246 SVs where
we were certain that the strain distribution pattern (SDP) was correct
(Supplementary Methods). We identified 290 QTLs where SVs were among
the variants most likely to be functional, but in all these cases the SVs were
only a subset of the total number of functional variants. We found a small but
significant deficit in SVs among the functional variants (0.36% compared to
0.54% among the non-functional, P < 1E-16,2 = 72.1).
While SVs make a relatively small contribution to the total amount of
quantitative phenotypic variation, at a small number of QTLs they are the
cause of variation. As shown in our companion paper34, larger effect QTLs are
15
more likely to arise from SVs (and see Supplementary Figure 4a, 4b and
4c). We identified 12 QTLs where the SV overlapped a gene with its flanking
region (2 Kb up or downstream of a gene), and where the QTL effect size is in
the top 5% of the distribution. Table 3 lists these SVs, the genes they affect
and the putative phenotype with which they are associated. Complementation
of the deletion of the H2-Ea promoter has confirmed the effect of this SV on
the T-cell phenotype35. In one other case we have evidence in favour of a
causative role for the SV: Eps15 -/- male mice exhibited a significantly lower
locomotor activity (Supplementary Fig. 5) compared to matched wild type
male mice, indicating that the SV is likely the cause of the QTL.
SVs that disrupt coding exons
There are relatively few examples where an SV can be said unequivocally to
delete one, or more, coding exons. Without nucleotide resolution accuracy it is
often impossible to be certain whether the breakpoint of an SV lies within an
exon, so to find SVs overlapping exons we used our most accurate and
complete category of SV calls: deletions relative to the reference strain. Using
this list, we started with 210 that overlap exons from Ensembl (Build 58); after
removing pseudogenes, and anything not annotated as 'protein coding', we
were left with just 24 SVs that affect coding exons, including six that
encompass a gene (or several) in its entirety. Table 4 gives positional
information for these SVs, the gene they affect (gene that are affected in their
entirety are indicated by an asterisk), how they formed, their strain distribution
pattern (SDP) and their known function as reported in the current literature.
16
Five of the 24 SVs are already known36-40; the remaining 19 SVs are
novel. Remarkably, a third of the genes affected by these SVs are involved in
immunity and infection. Figure 2c gives an example of how our data expands
current knowledge of the molecular architecture of these SVs. The antiviral
genes Trim5 and Trim12a are for the first time revealed as unique to
C57BL/6J, due to segmental duplication41. All the other strains contain only
the Trim12c gene. Therefore the mouse contains a unique homologue of the
human TRIM5 gene, similarly to the rat, and the expansion of the Trim12
genes appeared only in the C57BL/6J lineage. A second example is our
analysis of beta defensin 8 gene (Defb8), another immune related gene. Two
alleles have been identified and differ by 3 base pairs changes in the second
exon42,43. Our analysis reveals that these documented exonic changes are
linked to a previously undetected 3,192 bp deletion that includes the first exon
of the gene.
Discussion
Our results are important in three respects: first we find an unexpectedly large
number of SVs with diverse molecular architecture, thus providing a catalogue
of the most dynamic and variable regions of the mouse genome. Second, we
identify breakpoints at nucleotide level resolution, giving a genome wide
picture of how SVs originate. Third, we demonstrate that, despite their
abundance, SVs make relatively little functional impact, as assessed by their
effects on gene expression and phenotypic variation in the whole animal.
We were able to find more SVs, of greater complexity, because we
relied on manual inspection of the PEM results, combined with molecular
17
validation, before using automated calling methods. Previous studies have
revealed the noisiness of sequenced based methods of SV calling12,44,45, due
in part to the multiplicity of forms and the presence of insertions, deletions and
inversions often in close proximity to each other, and the difficulty of mapping
sequence reads back to repetitive genomes. Nevertheless, we have shown
here that it is possible to calibrate automated methods to generate genomewide SV calls of high accuracy.
The SVs we find have two distinguishing characteristics: first, they are
small. For deletions, whose size we know accurately, the median is 349 bp. In
comparison, the median size of SVs in a recent high-density array analysis of
the genomes of 20 laboratory strains was 9 Kb15 and about 1.9 Kb from a
PEM analysis of DBA/2J12. Our size estimate is actually an upper limit, since it
does not include SVs less than 100 bp. While the latter category currently
present a challenge to sequence technologies, we estimate they there may be
63,000 in the classical laboratory strains, in which case an upper estimate of
the median size of SVs is 180 bp. Second, SV density means that we
frequently find regions with high concentrations of small rearrangements.
These two features emphasize the need for methods of SV identification at
base pair, or near base pair resolution. Otherwise not only are many SVs
missed, but those recognized are misclassified: a mixture of small deletions
and insertions will be mistaken for a large SV of a single type 14.
Our second important finding is the catalogue of SV mechanisms
based on breakpoint sequence. We were able to map almost 60% of deletions
to base pair resolution, allowing us to classify SVs by the mechanism that
created them. We find that the primary origin of structural variation between
18
mouse strains is attributable to LINE-1 retrotransposons. Mice differ from
humans in whom LINE-1 retrotransposition comes third after microhomologymediated processes and nonallelic homologous recombination as the
predominant processes in generating SVs44. In contrast to human SV studies,
the
great
majority
of
SVs we
have
discovered
are
non-recurrent
rearrangements, based on two observations: among the classical strains, only
2.5% of deletions at the same locus had different breakpoint sequences and
less than 1% of deletions are due to NAHR, the mechanism thought to be
responsible for the majority of recurrent SVs in humans30.
Our third important observation is that SVs have relatively little impact
on gene function. SVs overlapping a gene have been estimated to contribute
to a substantial proportion of between-strain expression variance (28% in
hematopoietic stem and progenitor cells15; 38% in brain and 66-74% in heart,
kidney, liver, lung and testis16). If, as these results suggest, SVs contribute to
a third or a half of variation in transcript abundance, then, assuming gene
expression contributes to phenotypic variation, SVs will likely have a major
role in the genetic determination of all aspects of mouse biology. Available
evidence has not yet resolved whether or not this is so.
Our findings add to this debate in three ways. First, we find that SVs
overlapping a gene account for less than 10% of variation in gene expression.
This value is between three and four times smaller than that found by studies
using expression arrays15,16. We think the most likely explanation for the
disparity is that array-based studies overestimate the contribution of SVs by
conflating under one apparently large SV the effects of numerous smaller
19
rearrangements together with regions of diploid DNA, that also contain other
variants influencing gene expression (Figure 4a).
Second, few SVs overlap exons. From our set of relative deletions we
identified 24 that delete one or more exons. These SVs, with large effects on
a phenotype, are the equivalent of rare variants found in human populations.
In mouse populations they are very rare indeed. Since the frequency of
insertions is equal to that of deletions, and since these two categories make
up 98% of all SVs then we predict that there may be only about 50 SVs that
directly overlap exons, or about 0.2% of the total burden of SVs in the
genome.
Third, our analysis of the phenotypic consequences of SVs on QTLs for
multiple phenotypes points to a relative deficit of SVs as the molecular basis
of complex phenotypes. By working with an outbred population where all
chromosomes are descended from known progenitors, imputation effectively
reconstitutes the genomes of all animals, so that we can detect the effects of
all variants, both common and rare. Our results indicate that SVs make less of
a contribution to phenotypic variation than we would expect given the amount
of the genome they affect. For the classical laboratory strains, summing the
number of bases involved, SNPs and indels affect 0.5% of the genome (this is
a maximum estimate, assuming that indels have a size of 10 bp, many are
less than this), while on average 33 Mb (2.5%) of each classical laboratory
strain falls into structurally variant regions of the genome. This implies that
SVs are about five fold more likely to have phenotypic consequences than the
combined effect of SNPs and indels. Yet we find that SVs only contribute 10%
of heritability, not the 80% implied by the genomic size argument.
20
It is important to note that conclusions based on our analysis of an
outbred mouse population may not apply to other outbred populations (such
as in human, where there is continuing debate over the contribution of SVs to
phenotypic variation1,46,47). The population we tested is derived from inbred
progenitors whose homozygosity will have purged their genomes of variants
that could otherwise be maintained in heterozygous freely mating populations.
Nevertheless, despite their relative rarity in the mouse genome, SVs that
cause phenotype change are likely to provide biological insights out of
proportion to their relative small contribution to phenotypic variance. Biological
insight into a phenotype requires discovering which genes are involved. The
task is considerably easier if the SV removes a coding segment, effectively
creating a null allele. We expect that the alleles we have described will
provide a starting point for investigating the relationship between phenotype
and genotype in mice.
Methods Summary
SV discovery. We used a combination of four computational methods: splitread mapping24, mate-pair analysis48, single-end cluster analysis (SECluster
and RetroSeq, unpublished), and read-depth49. These methods identify
deletions, insertions, inversions and copy number gains. We also derived
methods to recognize other types of rearrangements, such as inversion plus
insertion or inversion plus deletion, newly revealed from our experimental
analysis.
Experimental analysis. We inspected short-read sequencing data using
LookSeq50 and manually detected SVs across mouse chromosome 19 in its
21
entirety and a random set of other chromosomal regions. We analysed
molecular structures of these SVs at nucleotide-level resolution using PCR
and Sanger-based sequencing.
Outgroup analysis. We used the rat as an outgroup species to classify each
mouse SV as either an ancestral deletion or an ancestral insertion. We
predicted the ancestral state in the rat by estimating the size of the region in
the rat genome that was homologous to the region that encompassed the
mouse SV.
SV classification. We developed a machine learning method to classify SVs.
The method used a random forest classifier, trained using sequence features
within the SVs. Microhomology between breakpoints was determined by
recording the longest sequence of bases that was identical between each
breakpoint of each SV.
Functional impact of SVs. We tested whether an SV is likely to be functional
using merge analysis33. The variances of expression data were calculated
using ANOVA in the statistical software R using formulae described in16 and
also by comparing a model where the expression value is explained by the
strain, to a model in which the expression is explained by strain and whether
or not the animal has an SV.
Full methods are provided in Supplementary Information.
22
References
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Conrad, D. F. et al. Origins and functional impact of copy number
variation in the human genome. Nature 464, 704-712 (2010).
Durbin, R. M. et al. A map of human genome variation from populationscale sequencing. Nature 467, 1061-1073 (2010).
Kidd, J. M. et al. Mapping and sequencing of structural variation from
eight human genomes. Nature 453, 56-64 (2008).
Mills, R. E. et al. Mapping copy number variation by population-scale
genome sequencing. Nature 470, 59-65 (2011).
Redon, R. et al. Global variation in copy number in the human genome.
Nature 444, 444-454 (2006).
Buchanan, J. A. & Scherer, S. W. Contemplating effects of genomic
structural variation. Genet Med 10, 639-647 (2008).
Hurles, M. E., Dermitzakis, E. T. & Tyler-Smith, C. The functional
impact of structural variation in humans. Trends Genet 24, 238-245
(2008).
Zhang, F., Gu, W., Hurles, M. E. & Lupski, J. R. Copy number variation
in human health, disease, and evolution. Annu Rev Genomics Hum
Genet 10, 451-481 (2009).
Cutler, G., Marshall, L. A., Chin, N., Baribault, H. & Kassner, P. D.
Significant gene content variation characterizes the genomes of inbred
mouse strains. Genome Res 17, 1743-1754 (2007).
Egan, C. M., Sridhar, S., Wigler, M. & Hall, I. M. Recurrent DNA copy
number variation in the laboratory mouse. Nat Genet 39, 1384-1389
(2007).
Graubert, T. A. et al. A high-resolution map of segmental DNA copy
number variation in the mouse genome. PLoS Genet 3, e3 (2007).
Quinlan, A. R. et al. Genome-wide mapping and assembly of structural
variant breakpoints in the mouse genome. Genome Res 20, 623-635
(2010).
Snijders, A. M. et al. Mapping segmental and sequence variations
among laboratory mice using BAC array CGH. Genome Res 15, 302311 (2005).
Agam, A. et al. Elusive copy number variation in the mouse genome.
PLoS One 5 (2010).
Cahan, P., Li, Y., Izumi, M. & Graubert, T. A. The impact of copy
number variation on local gene expression in mouse hematopoietic
stem and progenitor cells. Nat Genet 41, 430-437 (2009).
Henrichsen, C. N. et al. Segmental copy number variation shapes
tissue transcriptomes. Nat Genet 41, 424-429 (2009).
Akagi, K., Li, J., Stephens, R. M., Volfovsky, N. & Symer, D. E.
Extensive variation between inbred mouse strains due to endogenous
L1 retrotransposition. Genome Res 18, 869-880 (2008).
Gilbert, N., Lutz-Prigge, S. & Moran, J. V. Genomic deletions created
upon LINE-1 retrotransposition. Cell 110, 315-325 (2002).
Symer, D. E. et al. Human l1 retrotransposition is associated with
genetic instability in vivo. Cell 110, 327-338 (2002).
Garvey, S. M., Rajan, C., Lerner, A. P., Frankel, W. N. & Cox, G. A.
The muscular dystrophy with myositis (mdm) mouse mutation disrupts
23
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
a skeletal muscle-specific domain of titin. Genomics 79, 146-149
(2002).
Schadt, E. E. et al. An integrative genomics approach to infer causal
associations between gene expression and disease. Nat Genet 37,
710-717 (2005).
Valdar, W. et al. Genome-wide genetic association of complex traits in
heterogeneous stock mice. Nat Genet 38, 879-887 (2006).
Wong, K., Keane, T. M., Stalker, J. & Adams, D. J. Enhanced structural
variant and breakpoint detection using SVMerge by integration of
multiple detection methods and local assembly. Genome Biol 11, R128
(2010).
Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a
pattern growth approach to detect break points of large deletions and
medium sized insertions from paired-end short reads. Bioinformatics
25, 2865-2871 (2009).
Albers, C. A. et al. Dindel: Accurate indel calls from short-read data.
Genome Res 21, 961-973 (2011).
Hastings, P. J., Ira, G. & Lupski, J. R. A microhomology-mediated
break-induced replication model for the origin of human copy number
variation. PLoS Genet 5, e1000327 (2009).
Zhang, F. et al. The DNA replication FoSTeS/MMBIR mechanism can
generate genomic, genic and exonic complex rearrangements in
humans. Nat Genet 41, 849-853 (2009).
Ma, J. L., Kim, E. M., Haber, J. E. & Lee, S. E. Yeast Mre11 and Rad1
proteins define a Ku-independent mechanism to repair double-strand
breaks lacking overlapping end sequences. Mol Cell Biol 23, 88208828 (2003).
Deem, A. et al. Break-induced replication is highly inaccurate. PLoS
Biol 9, e1000594 (2011).
Stankiewicz, P. & Lupski, J. R. Structural variation in the human
genome and its role in disease. Annu Rev Med 61, 437-455 (2010).
Stankiewicz, P. & Lupski, J. R. Genome architecture, rearrangements
and genomic disorders. Trends Genet 18, 74-82 (2002).
Huang, G. J. et al. High resolution mapping of expression QTLs in
heterogeneous stock mice in multiple tissues. Genome Res 19, 11331140 (2009).
Yalcin, B., Flint, J. & Mott, R. Using progenitor strain information to
identify quantitative trait nucleotides in outbred mice. Genetics 171,
673-681 (2005).
Keane, T. Sequence variation amongst 17 laboratory and wild-derived
mouse genomes and its affect on gene regulation and phenotypic
variation. Nature (2011).
Yalcin, B. et al. Commercially available outbred mice for genome-wide
association studies. PLoS Genet 6 (2010).
Best, S., Le Tissier, P., Towers, G. & Stoye, J. P. Positional cloning of
the mouse retrovirus restriction gene Fv1. Nature 382, 826-829 (1996).
Boyden, L. M. et al. Skint1, the prototype of a newly identified
immunoglobulin superfamily gene cluster, positively selects epidermal
gammadelta T cells. Nat Genet 40, 656-662 (2008).
24
38
39
40
41
42
43
44
45
46
47
48
49
50
Nelson, T. M., Munger, S. D. & Boughter, J. D., Jr. Haplotypes at the
Tas2r locus on distal chromosome 6 vary with quinine taste sensitivity
in inbred mice. BMC Genet 6, 32 (2005).
Persson, K., Heby, O. & Berger, F. G. The functional intronless Sadenosylmethionine decarboxylase gene of the mouse (Amd-2) is
linked to the ornithine decarboxylase gene (Odc) on chromosome 12
and is present in distantly related species of the genus Mus. Mamm
Genome 10, 784-788 (1999).
Wu, B. et al. Mutations in sterol O-acyltransferase 1 (Soat1) result in
hair interior defects in AKR/J mice. J Invest Dermatol 130, 2666-2668
(2010).
Tareen, S. U., Sawyer, S. L., Malik, H. S. & Emerman, M. An expanded
clade of rodent Trim5 genes. Virology 385, 473-483 (2009).
Bauer, F. et al. Structure determination of human and murine betadefensins reveals structural conservation in the absence of significant
sequence similarity. Protein Sci 10, 2470-2479 (2001).
Taylor, K. et al. Defensin-related peptide 1 (Defr1) is allelic to Defb8
and chemoattracts immature DC and CD4+ T cells independently of
CCR6. Eur J Immunol 39, 1353-1360 (2009).
Kidd, J. M. et al. A human genome structural variation sequencing
resource reveals insights into mutational mechanisms. Cell 143, 837847 (2010).
Korbel, J. O. et al. Paired-end mapping reveals extensive structural
variation in the human genome. Science 318, 420-426 (2007).
Craddock, N. et al. Genome-wide association study of CNVs in 16,000
cases of eight common diseases and 3,000 shared controls. Nature
464, 713-720 (2010).
Manolio, T. A. et al. Finding the missing heritability of complex
diseases. Nature 461, 747-753 (2009).
Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping
of genomic structural variation. Nat Methods 6, 677-681 (2009).
Simpson, J. T., McIntyre, R. E., Adams, D. J. & Durbin, R. Copy
number variant detection in inbred strains from short read sequence
data. Bioinformatics 26, 565-567 (2010).
Manske, H. M. & Kwiatkowski, D. P. LookSeq: a browser-based viewer
for deep sequencing data. Genome Res 19, 2125-2132 (2009).
25
Supplementary Information is linked to the online version of the paper at
www.nature.com/nature. Supplementary Information contains Supplementary
Figures and Tables, additional Methods, and Supplementary References.
Acknowledgements
We thank Adam Whitley, Giles Durrant, Andrew Marc Hammond, Danica Joy
Fabrigar, Lucia Chen, Martina Johannesson, Enzhao Cong and Glòria
Blázquez for helping B.Y. with various laboratory-based work. We also thank
Chris P. Ponting for comments on the manuscript. This project was supported
by The Medical Research Council, UK and the Wellcome Trust. DJA is
supported by Cancer Research UK.
Author contributions
D.J.A. and J.F. conceived the study and directed the research. J.F. wrote the
core of the paper. K.W. and T.K. performed the genome-wide SV discovery
and local assembly for SV breakpoint resolution. K.W. carried out the
sensitivity and specificity analyses. K.W. and B.Y. liaised regularly to integrate
experimental work into genome-wide SV discovery pipeline. This resulted in a
highly accurate map of SV across the mouse genome, essential to
downstream analyses. A.B., P.H.P., H.W., J.C., R.D. and D.J. carried out
experimental work, led by B.Y. A.B. and B.Y. analysed Sanger-based
sequencing data, resolved SV breakpoints at nucleotide-level resolution and
inferred mechanism of SV formation. M.G. performed the genome-wide SV
mechanism of formation and outgroup analysis, with contributions from A.A.
and B.Y.. J.F. analysed functional impact of SVs on expression and
26
phenotypes, with contributions from A.A.. C.N., L.G., J.N. and R.M. carried out
additional analyses. B.Y. characterised function of individual SV examples.
Author information
Data sets described here will be available under study accession number
estd118 from the Database of Genomic Variants archive (DGVa) at
http://www.ebi.ac.uk/dgva/page.php. Reprints and permissions information is
available at www.nature.com/reprints. Readers are welcome to comment on
the online version of this article at www.nature.com/nature. Correspondence
and requests for materials should be addressed to J.F. ([email protected]).
27
Tables
Table 1: Structural variants greater than 100 bp in 17 inbred strains
Simple
Complex
Strain
129P2/OlaHsd
del
16292
gain
57
inv
74
ins
del+ins nested inv+del/ins
15604
105
27
68
129S1/SvImJ
17307
70
88
11516
73
32
67
129S5/SvEvBrd
16089
72
67
8970
43
41
58
A/J
16190
69
92
12184
61
28
67
AKR/J
15806
88
89
14576
88
13
82
BALB/cJ
14859
82
87
10551
48
17
58
C3H/HeJ
16062
94
94
12100
90
16
76
C57BL/6N
164
44
6
213
0
3
1
CAST/EiJ
50978
361
224
34122
133
239
265
CBA/J
16996
79
83
10867
64
16
78
DBA/2J
17478
67
83
10559
55
29
75
LP/J
16964
64
88
12745
64
30
69
NOD/ShiLtJ
17047
51
115
13244
53
16
79
NZO/HlLtJ
15429
62
71
9445
33
23
62
PWK/PhJ
54147
96
272
35098
184
60
268
SPRET/EiJ
91295
112
470
64304
463
110
552
WSB/EiJ
22154
88
97
12521
64
37
105
Del: deletion; gain: copy number gain; inv: inversion; ins: insertion; del+ins:
deletion plus insertion; nested: SV in a copy number gain region; inv+del/ins:
inversion plus deletion(s) or inversion plus insertion.
28
Table 2: SV classification and inferred mechanism of formation
a Sequence features at breakpoints
LINE
ERV
SINE
Target site duplication (TSD)
none
4-10 bp
11-20 bp
>20 bp
6.7%
13.3%
78.3%
1.7%
6.7%
93.3%
0.0%
0.0%
0.0%
15.4%
84.6%
0.0%
Microdeletion
none
1-34 bp
>200 bp
93.3%
5.0%
1.7%
93.3%
6.7%
0.0%
92.3%
7.7%
0.0%
Microhomology
none
1-2 bp
3-25 bp
26-200 bp
>200 bp
b Inferred mechanisms
Total Retrotransposition
LINE Retrotransposition
ERV Retrotransposition
SINE Retrotransposition
SRS
MMEJ, MMBIR
NHEJ
SSA
NAHR
FoSTeS/others
30
13
Complex
CNG
92.3%
7.7%
0.0%
0.0%
0.0%
23.1%
76.9%
0.0%
60
Inversion
VNTR
Microinsertion
none
1-10 bp
11-50 bp
>51 bp
Total (249 SV regions)
Deletion
Insertion
Ancestral Events
13
15.9%
15.0%
67.3%
0.9%
0.9%
50.0%
12.5%
37.5%
0.0%
0.0%
12.5%
37.5%
50.0%
0.0%
0.0%
0.0%
50.0%
50.0%
0.0%
0.0%
68.1%
23.0%
8.0%
0.9%
62.5%
37.5%
0.0%
0.0%
87.5%
12.5%
0.0%
0.0%
50.0%
50.0%
0.0%
0.0%
113
8
8
4
30.5%
13.3%
0.4%
0.4%
0.8%
3.2%
3.2%
1.6%
24.1%
12.0%
5.2%
5.2%
This detailed classification is based on the 249 SVs resolved at nucleotidelevel resolution (Supplementary Table 8). MMEJ: Microhomology-mediated
end joining; NHEJ: Non-homologous end joining; FoSTeS: fork stalling and
template switching; MMBIR: Microhomology-mediated break-induced
replication; NAHR: Non-allelic homologous recombination; SRS: Serial
replication slippage; SSA: Single strand annealing; CNG: Copy number gain.
29
Table 3: QTLs associated with SVs
Phenotype
Mean platelet volume
OFT Total activity
Hippocampus cellular proliferation marker
Home cage activity
T-cells: %CD3
Wound healing
Red cells: mean cellular haemoglobin
Red cells: mean cellular haemoglobin
Red cells: mean cellular volume
Serum urea concentration
Hippocampus cellular proliferation marker
T-cells: CD4/CD8 ratio
Chr
1
2
4
4
4
7
7
7
8
11
13
17
Start
175158884*
144402760
49690362
108951263
130038388
90731819
111397607
111504989
87957244
115106127
113783196
34483681
Stop
175158885*
144402971
49690363
108951264
130038389
90731820
111479433
111505193
87957245
115106250
113783359
34483682
Ancestral Event
ins (large)
SINE ins
del (137 bp)
IAP ins (~6400 bp)
SINE ins (202 bp)
IAP ins (~6400 bp)
ins
del
LINE ins (~500 bp)
del
del
del (629 bp)
SDP
10000000
11111100
11111001
10000011
00000001
11100000
00001000
11110111
10010110
00100000
10000000
00001001
Gene
Fcer1a
Sec23b
Grin3a
Eps15
Snrnp40
Tmc3
Trim5
Trim30b
4921524J17Rik
Tmem104
Gm6320
H2-Ea
Gene
region
upstream
intron
intron
upstream
intron
upstream
exon
UTR
upstream
UTR
upstream
upstream
Merge
LogP
52.8
15.7
20.1
15.9
12.1
22.2
13.0
12.8
18.1
13.4
17.5
82.9
Ins: insertion; del: deletion; SINE: Short INterspersed repeat Elements; LINE: Long INterspersed repeat Elements; IAP:
Intracisternal A Particle. Start and stop coordinates are given for build37 of the mouse genome, so that insertions into the
reference are given as consecutive base pairs (columns headed SV start and SV stop). Unless there is a plus sign (“+”),
coordinates refer to the exact coordinates as delineated by Sanger PCR sequencing. SDP is the Strain Distribution Pattern
of the ancestral event (“1” refers to presence and “0” to the absence of the event) in the following strain order: A/J, AKR/J,
BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J. The part of the gene overlapped is reported in the column headed
SV overlap. LogP is the negative logarithm of the P-value for association between the SV and the phenotype as assessed in
outbred HS mice 22.
30
Table 4: SVs affecting coding regions
MGI gene
Chr
SV start
SV stop
Ancestral Event
SDP
Known function
Soat1
Olfr1055
Fcrl5
Nes
Pglyrp3
Skint4,3,9*
Fv1
Ugt2b38
Klrb1a
Klri2
Tas2r120*
Tas2r103
Zfp607*
Krtap5-5
Trim5,12a*
Defb8
Zfp872
Olfr913
Rtp3
Nlrp1c*
Fam110c
Olfr234
Krtap16-1
Amd2*
1
2
3
3
3
4
4
5
6
6
6
6
7
7
7
8
9
9
9
11
12
15
16
18
158394620
86179898
87245084
87780530
91831862
111731004+
147244398
87850554
128559593
129689526
132580541
132985563
28646761
149415121
111397607
19447465
22004856
38402589
110889280
71046193+
31759321
98328544
88874294
64607747
158401436
86186982
87245947
87780662
91835385
112272814+
147245739
87854999
128559740
129691211
132613777
132986696
28671650
149415210
111479433
19450575
22005023
38403498
110889465
71101410+
31759461
98328861
88874392
64609669
del
IAP ins
del
VNTR
del
ins
del
del
del
del
del+linked ins (326 bp)
del
del
VNTR
ins
ins+del (54 bp)
VNTR
del
VNTR
ins
VNTR
del
VNTR
ins
01000000000000000
00001000111001000
00000000000100000
00001001111000001
00000000000100000
00001000000010000
01010111111110101
00000001111000020
00000000000100110
00000000111000000
11110111000000000
00000000000010000
00000000000010000
11011111111111001
00001000000000000
01001001000010001
11111110111011111
11110010111100000
00001000000000000
10001000000001100
00001011111111110
10010000000000000
10111002222122221
10111110000010001
Hair interior defects
Olfactory
Infection and immunity
Brain development
Infection and immunity
Infection and immunity
Infection and immunity
Metabolism
Infection and immunity
Infection and immunity
Taste
Taste
DNA-binding
Hair formation
Infection and immunity
Infection and immunity
DNA-binding
Olfactory
Bone density
Embryonic development
Cell migration
Olfactory
Hair formation
Biosynthesis of polyamines
MGI: Mouse Genome Informatics; ins: insertion; del: deletion; VNTR: Variable Number Tandem Repeat; IAP: Intracisternal A
Particle. The Strain Distribution Pattern (SDP) relative to the ancestral event is given for all strains: “1” referring to presence,
“0” to absence and “2” to an additional allele, in the following strain order: A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N,
CBA/J, DBA/2J, LP/J, 129P2/OlaHsd, 129S1/SvImJ, 129S5SvEvBrd, NOD/ShiLtJ, NZO/HiLtJ, CAST/EiJ, PWK/PhJ,
SPRET/EiJ and WSB/EiJ.* indicates that the structural variant overlaps the entire gene. Unless there is a plus sign (“+”),
coordinates refer to the exact coordinates as delineated by Sanger PCR sequencing.
31
Figure Legends
Figure 1: Identification of structural variants
a) Venn diagram showing the overlap between deletion SVs (relative to
C57BL/6J) detected in our study (blue) and those published elsewhere (Agam
et al, 2010 in red and Quinlan et al in green), in DBA/2J.
b) Basic rearrangements: deletion (del), insertion (ins), inversion (inv), tandem
duplication (tandem dup) and other types of copy number gains. Inverted
tandem duplication is drawn in Supplementary Figure 1 (H9). Linked
insertion (linked ins) is an insertion where the inserted sequence is copied
from nearby. Inverted linked insertion (Q9; drawn in Supplementary Fig. 1)
has a similar pattern to a linked insertion but the inserted sequence is
inverted.
c) Complex rearrangements: deletion co-occurring with an insertion (del+ins),
linked deletion (normal copy of small size flanked by two deletions), deletion
within a gain (del in gain), inversion with flanking deletions (del+inv+del),
inversion with an insertion (inv+ins) and inversion within a gain (inv in gain).
d) PEM pattern of a del+inv+del. Green arrows represent primers used for
PCR amplification and sequencing reactions.
e) PEM pattern of an inv+ins, with PCR data across the eight classical strains
(A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J, DBA/2J and LP/J).
HyperladderII is used as molecular marker. Amplicon size for BALB/cJ,
C3H/HeJ, CBA/J and DBA/2J is about 500 bp larger than the other strains,
indicative of the insertion. Complete list of PEM patterns is given in
Supplementary Figure 1.
32
Figure 2: Experimental analysis of SVs
a) Complex SV, involving several genomic rearrangements including an
inversion, a deletion and two small insertions, is displayed relative to its genic
location along Zbtb10, a Zinc finger and BTB domain containing 10 gene.
PCR amplification using forward (F) and reverse (R) primers revealed an AT
insertion at the first breakpoint B1, followed by an inversion of 125 bp that
comprises an inverted linked insertion of the 22bp-region, as seen in B2. The
third breakpoint (B3) revealed a deletion of 813 bp. Hyperladder II was used
as the size marker. C57BL/6J and LP/J show a normal size of 1604 bp,
whereas A/J, AKR/J, BALB/cJ, C3H/HeJ, CBA/J and DBA/2J show a smaller
band at 793 bp.
b) Relationship between SNP and SV formation. Two SNPs lying on the 6 bp
microhomolgy of an ancestral deletion of 64 bp (chr12:27,040,45927,040,522) correlated with the presence of the SV. Sequencing traces are
shown for a test strain (A/J) and the reference strain (C57BL/6J). Note that all
other test strains traces are identical to the one shown here. Asterisk is used
to emphasize the microhomology of 6 bp (GAACTA). The presence of two
SNPs (C->G and T->A) in all test strains (here only shown in A/J) is
associated with the presence of the ancestral deletion.
c) Schematic representation of the Trim6 to Trim30 genes cluster on
chromosome 7. Boxes represent the sequential positions of the Trim6,
Trim34, Trim5/12 and Trim30 genes. Trim5 and Trim12a genes are only
present in the C57BL/6J genome occurred by segmental duplication of the
33
Trim12c gene present in all 17 strains. The flanking Trim34 and Trim30 genes
do not vary between strains.
Figure 3: Classification of structural variants
a) Histogram of lengths for each deletion SV class.
b) Microhomology surrounding SV breakpoints. SVs were classified as in (a)
and the longest length of microhomology between both breakpoints was
recorded.
Figure 4: Impact of SVs on gene expression
a) Within-strain (grey boxes) and between-strain (white boxes) gene
expression variances for transcripts which are not overlapped by any
structural variant (No SV) and for those which are. Within-strain variance is
due to environmental effects; between strain to environmental and genetic
effects. The difference between the two variances is a measure of heritability.
Six categories are shown: deletions (Dels), insertions (Ins), copy number
gains (Gains), inversions (Inv), complex rearrangements (Complex), and SVs
(of any class) identified by an array analysis (Array:SVs).
b) Effect of distance from the transcript on gene expression variances. Grey
boxes are within-strain and white boxes are between-strain variances. The
figure shows standardized variances of gene expression for transcripts with
structural variants at distances from less than 2 Kb to more than 40 Kb from
either the start or end of the transcript.
34