Download SVPaper2803111

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Behavioural genetics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Heritability of IQ wikipedia , lookup

Neurogenomics wikipedia , lookup

Transcript
Sequence based characterization of structural variation in the mouse genome
Binnaz Yalcin1, Kim Wong2, Avigail Agam1, Martin Goodson1, Christoffer Nellaker3,
Thomas M. Keane2, Leo Goodstadt1, Amarjit Bhomra1, Polinka Hernandez-Pliego1,
Helen Whitley1, James Cleak1, Deborah Janowitz1, Richard Mott1, Chris P. Ponting3,
David J. Adams2,*, Jonathan Flint2,*
1The
Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK ,
2The
Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
3MRC
Functional Genomics Unit, Department of Physiology, Anatomy and Genetics,
University of Oxford, South Parks Road, Oxford OX1 3QX, UK.
†Co-first
authors
*Correspondence to:
Dr. David Adams
Prof. Jonathan Flint
Wellcome Trust Sanger Institute
Wellcome Trust Centre for Human Genetics
Hinxton, Cambs, CB10 1SA, UK
Oxford, OX3 7BN, UK
Ph: +44 (0) 1223 86862
Ph: +44 (0) 1865 287512
Fax: +44 (0) 1223 494919
Fax: +44 (0) 1865 287501
Email: [email protected]
Email: [email protected]
1
Abstract
The origins and functional impact of structural variants (SVs) in mammalian
genomes remain poorly understood. Complete genome sequence of thirteen
classical and four wild-derived inbred strains of mice combined with extensive
experimental validations allowed us to identify more SVs (0.7M) and to
recognize more categories than hitherto reported. SVs affect 1.2% and 3.7%
of the genome of the classical and wild strains, respectively. The majority of
SVs are small (median size 385 base pairs) and while most (97.9%) are
deletion and insertions, 1.6% have a complex structure and 0.5% includes
duplications and inversions. Base pair resolution of breakpoints allowed the
inference of genome-wide structural variation (SV) origins: retrotransposition
is the commonest mechanism (54%), followed by non-homologous endjoining (NHEJ) processes (29%) and fork stalling and template switching
(FoSTeS) mechanisms (17%). Sequence features at the actual breakpoints of
250 SVs revealed a plethora of mutational processes during SV formation
including microdeletions of 1-289 bp and microinsertions of 1-107 bp. A large
proportion of inversions (63%) co-segregated with deletions, suggesting a
loop-type mechanism of formation. Analysis of gene expression and
phenotypic variation shows that SVs make less of a contribution to phenotypic
variation than we would expect given their abundance in the genome. We
identify 15 loci where a structural variant is likely to be the molecular lesion
responsible for the quantitative trait locus. Our catalogue provides a starting
point for the analysis of the most dynamic and complex regions of mammalian
genomes in a genetically tractable model organism.
2
Introduction
Structural variation (SV) in the mammalian genome is known to be abundant
and to contribute to phenotypic variation and disease. There has been
considerable progress assessing its extent and complexity1-4, phenotypic
impact5,6 and the responsible molecular mechanisms7,8 in the human genome,
but much less is known about structural variants (SVs) in the mouse, the
preeminent organism for modeling how genetic lesions give rise to disease in
mammals. In this paper we use next generation sequencing to address three
critical questions: what is the extent and complexity of SV in the mouse
genome, what are the likely mechanisms for its formation, and what are its
phenotypic consequences?
Current catalogues of mouse SVs are incomplete9-11. They are based
on differential hybridization of genomic DNA to oligonucleotide arrays (array
comparative genome hybridization (aCGH)) which are blind to some SV
categories (such as inversions and insertions), have limited ability to detect
others (segmental duplications and transposable elements), and cannot
provide sufficient breakpoint resolution which is required to assess the
mechanisms of SV formation.
Sequence based methods of SV detection, with higher resolution and
greater sensitivity, have so far had limited application in the mouse. Results
so far are available for only four classical strains12,13. These indicate that 85%
of insertions between 100 nucleotides to 10 kilobases are due to
retrotransposition, and that non-allelic homologous recombination (NAHR) is
3
rare. However we lack a comprehensive analysis of structural variation
between inbred mouse strains.
Even less is known about the impact of SVs on phenotypes, even
though there are several indications that they may be important. First, given
the predominant role of retrotransposition in SV formation, even low levels of
activity could have a large phenotypic impact. In cell culture about 10% of L1
insertions delete DNA14,15, a process that is also documented in mouse
genomic DNA16. Second, SVs overlapping a gene are estimated to contribute
up to 74% of between-strain gene-expression variance. The high prevalence
of SVs in the genome, and the observation that SVs influence the expression
of genes lying up to 500 Kb from their margins17, suggest that SVs might be
responsible for a considerable fraction of heritable gene expression variance.
The phenotypic impact of SVs could extend further, since gene expression
variation is believed to contribute to variation in phenotypes in the whole
organism (schadt).
Here we report the identification, using short-read paired-end mapping,
of 0.7M SVs in 17 inbred strains of mice. By analyzing breakpoint sequence
we infer the mechanisms of formation and assess their relative impact on
shaping a mammalian genome. Our molecular characterization of SVs in the
mouse genome allows us to determine the extent to which SVs contribute to
genetic and phenotypic diversity.
4
Results
SV identification
We identified almost three quarters of a million SVs in 17 mouse strains, far
more than previously recognized (Fig. 1a) and consisting of a greater variety
of molecular structures (Fig. 1b&1c). To understand why, and to explain our
results, we start by explaining how we went about finding SVs.
We combined visual inspection of the data with molecular validation to
improve automated SV detection across the genome. We used two criteria to
identify SVs manually: read depth and anomalous paired-end mapping (PEM).
We did this using data from the mouse’s smallest chromosome (19) in its
entirety and a random set of other chromosomal regions, for eight strains (A/J,
AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J).
Based on read depth and PEM we expected to find eleven patterns
that classify SVs. We refer to these as type H patterns (H1-H11:
Supplementary Fig. 1). For example a deletion is indicated by a reduction in
read depth and by observing reads where one of the pair aligns to one side of
the deletion, and its mate to the other. Because the two ends are sequenced
on opposite strands, the direction of the two reads will be towards each other
(H1). By contrast, paired-end reads pointing in the same direction and an
unchanged read depth (except at the breakpoints) indicates an inversion (H4).
We were surprised to discover an additional ten patterns (of read depth
and PEM) whose interpretation was ambiguous. We refer to these as type Q
patterns (Q1-Q10: Supplementary Fig. 1). For instance we found examples
of a reduction in read depth coverage without paired-end reads flanking the
putative deletion (Q2), and putative inversions where reads mapped to only
5
one of the breakpoints (Q8; Fig. 1D). We investigated the molecular structure
of all 21 patterns using a PCR strategy (Supplementary Fig. 2, online
Methods). We designed 575 pairs of primers (Supplementary Table 1) and
amplified 538 unique SV regions across eight classical inbred strains
(Supplementary Table 2).
Our categorization of predicted SV structures, based on manual
inspection of PEM patterns, resulted in the confident identification of an SV for
nineteen of the 21 patterns in all 532 instances that we examined by PCR
(Supplementary Table 3). Two patterns were always false (Q6 and Q10),
and arose because of the presence of a pseudogene giving mapping errors.
Recognizing these patterns, we were able to predict the underlying SV
structure with high confidence. PCR confirmed the structure for 18 patterns
(Supplementary Table 3): twelve patterns were indicative of a single SV and
six patterns indicative of multiple adjacent SVs (complex SVs). The exception
is type Q7, attributable in each case to the presence of a variable number
tandem repeat, where we could not unequivocally predict the number of
repeats and its molecular structure.
Available automated methods to identify SVs are unable to differentiate
all of the 21 PEM patterns we identified, and may also classify some SVs
incorrectly; for example, the PEM patterns of linked insertions (Q5 and Q9:
Supplementary Fig. 1) are similar to those for inversions or deletions.
Therefore we adapted automated methods to recognize 17 types (Q1, Q2, Q3
and Q7 excluded) identified by manual inspection and PCR validation (Fig. 1,
Supplementary Fig. 1, Supplementary Methods and 18).
6
The results of the detection and classification of 711,923 SVs across
the entire genome of 17 strains are shown in Table 1. SVs smaller than 100
bp are excluded, as below this, it is difficult to determine whether the deviation
in distance between two paired end reads is due to variation in the library
insert size distribution or due to paired ends flanking a structural variant.
There are on average 26,393 SVs in classical inbred strains, and 92,205 in
wild derived inbred strains, affecting 1.2% (33.0 Mb) and 3.7% (98.6 Mb) of
the genome, respectively. SVs with complex PEM patterns account for 1.4%
to 1.8% of all SVs identified in each strain, except in C57BL/6NJ, whose
genome is almost void of complex PEM patterns.
The majority of SVs in all strains are deletions and insertions with
simple PEM patterns; however, we have been able to identify a small subset
of complex SVs. Inversions are rare, and occur concurrently with a deletion
or an insertion about 50% of the time. Copy number gains are also rare (~100
per genome of each type), and cover from 1.8 Mb (NOD/ShiLtJ) to 16 Mb
(CAST/EiJ) of the genome. We also observe a small number of inversions
and deletions that occur in regions of copy number gain.
Sensitivity and specificity analyses
We established false positive and false negative rates for the automated
analysis in three ways. First, we used our manually identified set of SVs on
chromosome 19 where we found 932 deletions (684 type H and 248 type Q),
15 inversions (2 type H and 13 type Q) and three copy number gains (all type
H). Automated analysis of chromosome 19 detected between 83% to 86% of
manually-called deletions (at least 50 bp in size) depending on the strain
7
(Supplementary Table 5a). The false positive rate ranges from 3.1% to 4.6%
(Supplementary Table 5b). Although the false negative rate per strain ranges
from 14% to 17% when considering all types of deletions excluding type Q7, it
should be noted that the automated analysis accurately identified 94.8% of
sites containing a deletion.
Second, to ensure that our sensitivity and
specificity analyses were not vitiated because we used SVs from
chromosome 19 as a training set for the automated analysis, we derived a
second, smaller, set of manually curated deletions from a randomly chosen 10
Mb region (101Mb to 111Mb) from chromosome 3 in the strain C3H/HeJ.
Automated analysis of this region identified 43 (82.7%) and called 2 false
positive deletions (4.4%). Third, we investigated the false negative rate for the
automated detection of deletions across the genome using our PCR validation
data of 267 simple deletions. Consistent with the chromosome 19 and
chromosome 3 analyses we found that the false negative rate for deletions
excluding type Q7 was between 9% and 15% (Supplementary Table 6a).
We could not assess the performance of automated analysis to detect
SV types other than deletions using calls derived from manual inspection of
chromosome 19 because so few of these rearrangements were called. To do
this, we turned to PCR-based validation and found that the average false
negative rate was higher than for deletions, ranging from 17% to 55%
(Supplementary Table 6b). Automated analysis was less successful in
detecting the more complex rearrangements: of the 58 PCR validated
complex SVs only 33 were found.
SV mechanism of formation
8
Sequences within a structural variant (known mobile elements, pseudogenes
and variable number tandem repeats (VNTRs)) or sequence flanking its
breakpoints reveal mechanism by which an SV arose. Accurate sequencebased SV classification has two requirements: nucleotide level resolution of
the breakpoint and the ability to distinguish unambiguously between historical
insertions and deletions. Both of these factors are crucial for accurately
determining the mechanism by which a structural variant was formed.
We successfully carried out de novo local assembly and breakpoint
refinement of 81.3% deletions and 74.2% non-transposable element
insertions as described in18 (genome-wide breakpoint localization of
transposable element insertions is reported in our companion paper(Keane et
al, 2011)). Comparison of 1,314 breakpoints to the breakpoint delineated by
PCR and sequencing (Supplementary Methods), revealed that 57.7% of
breakpoints are exact and 86.5% are within 20 bp (Supplementary Table 7a).
We assessed the accuracy of deletion breakpoints in cases where the local
assembly strategy failed. We found that 83.3% are within 100 bp of the actual
breakpoint (Supplementary Table 7b). Breakpoint accuracy for insertions,
inversions and copy number gains SV is presented in Supplementary Tables
7c, 7d and 7e, respectively.
Using rat as an outgroup, we classified 19% of SVs as ancestral
deletions, 57% as ancestral insertions and the remainder (24%) were
indeterminate. We examined the sequence features of 40 SVs that failed the
outgroup analysis and found that in every case the regions contained highly
repetitive DNA, consisting primarily of transposon and transposon related
sequence.
9
[spretus and manual classification - Compared to the classification
using rat, we found relatively more ancestral deletions (35%) when using
Spretus as an outgroup.
Finally we estimated the error rates of the
classification
inspection
by
manual
of
the
sequence
at
250
SVs
Supplementary Figure 2).]
The mechanism of formation of SVs that arise independently at the
same location in the genome (recurrent SVs) is thought to be different from
the mechanism of formation of non-recurrent SVs19,20. However we found little
evidence for recurrent SVs in the mouse. We used a set of 241 deletions
whose breakpoints we amplified and sequenced, and identified recurrent SVs
from finding different breakpoint sequences at the same genomic location in
different strains. Within the eight classical strains, size differences were found
at 2.5% (6/241) of SVs. Within all 17 strains, we found multiple alleles at 12%
of SVs, due almost entirely to the presence of different alleles originating from
the wild-derived inbred strains. In two cases, we observed four alleles at an
SV locus and in one case five alleles: on chromosome 10 . At this locus
different SV breakpoints in AKR/J, CAST/Ei, PWK/Ph and SPRET/Ei all of
which have SVs with different breakpoints (Supplementary Table 8).
Classification of SVs and their size characteristics are summarized in
Figure 2. We found that the median length of all SVs is 349 bp, with modes
at 100 bp and 6400 bp, LINE insertions comprising the majority of the larger
insertions (Fig. 2b). The cutoff at 100 bp is artifactual in that we did not
include variants of less than 100 bp as SVs. We observed a lower density of
SVs on the X chromosome than the autosomes (4.97Mb-1 compared to an
average of 14.45Mb-1, s.e=3.02).
1
0
The largest class of SVs consists of those formed by mechanisms
involving retrotransposons (LINEs (25%), LTRs (14%) and SINEs (15%)),
followed by mechanisms not involving retrotransposons (29%), VNTRs (15%)
and pseudogenes (2%). Outgroup analysis showed that the transposonderived structural variants arose almost exclusively from ancestral insertions
events (98.8%). Non-repeat mediated SVs were mainly a result of ancestral
deletion events (79%). Consistent with their role in the creation of SVs (), we
confirmed that segmental duplications (SDs) in the mouse genome are more
likely to contain SVs. Deletions larger than 1 Kb are twice as likely to occur
within segmental duplications. However we were surprised to see that
deletions smaller than this bps long are depleted in SDs in all strains (fold
change range = 0.43-0.64 and 0.3-0.39, in classical and wild-derived strains,
respectively).
We found microhomology (12-16bp) surrounding the breakpoints of
LINE and SINE derived SVs and shorter (6-8bp) stretches associated with
LTR SVs (Fig. 2c). Non-repeat mediated SVs were associated with short
segments of up to 7bp in length (Fig. 2c). We found no evidence of longer
micro-homology for non-repeat mediated SV formation, consistent with a
major role for fork stalling and template switching (FoSTeS) . In total we found
an excess of SVs with microhomology of 2 bp or over compared to random
expectation (60% of the sample) (Fig. 2c). We found no difference in the
micro-homology profile of ancestral insertions compared to ancestral deletions
(Supplementary Fig. 3).
Automated analysis may miss unexpected features of breakpoint
sequence. Therefore we examined 249 SVs where breakpoints were obtained
1
1
by PCR and sequencing and made two observations. First, we identified
cases where the machinery of SV formation has resulted in a complex mixture
of insertions and deletions (Fig. 3a). While almost all SVs due to LINE, ERV,
SINE and VNTR insertions do not show any missing nucleotides at their
breakpoints (95%, 93.3%, 92.3% and 92.3% respectively), there are rare
cases (4 LINEs, 2 ERVs, 1 SINE and 1 VNTR) during which the insertion
machinery also deletes nucleotides (from 1 bp to 289 bp). For three LINEs,
the presence of an ancestral microdeletion is directly linked to the absence of
the target site duplication (TSD), suggesting a dual mechanism of SV
formation: union between DSB repair processes and LINE retrotransposition.
Five of the eight inversions for which we had breakpoint sequence data had
deletions right next to the inversion (62.5%).
The second unexpected result to emerge from exact breakpoint
sequence of ancestral deletions was that in all cases the presence of SNPs in
the microhomology region was correlated with the presence of the SV (Fig.
3b). In every case the SNP elongates the microhomology. This phenomenon
is rare: we only observed five (4.5%) cases amongst our 113 ancestral
deletions where a SNP and SV formation co-segregate. We found a similar
relationship between a SNP formed in the TSD and the presence of an
ancestral insertion (Figure 3b). 15 ancestral insertions (16%) had SNPs or
short indels within their TSD, coincident with an insertion (Supplementary
Table 8).
Impact of SVs on gene function
1
2
In this section we report results of assessing the impact of SVs on phenotypes
in three ways: i) the relationship between the position of SVs and the position
of genes; (ii) changes in expression of genes overlapping, or nearby, an SV;
(iii) association between SVs and phenotypes in an outbred population of
mice.
Across all strains, 10,291 genes are partially or completely deleted;
reducing to 5,115 genes when only the classical laboratory strains are
considered. The introns of 9,802 genes are affected (laboratory strains:
4,885), 4,530 promoter regions (laboratory strains: 1641), and 1,631 have
deleted exons (laboratory strains: 648) (Supplementary Table 9). We
investigated whether this represented an enrichment or a depletion by
counting the number of SVs that overlapped genes and then comparing this to
a null distribution of the expected number of overlaps, obtained by
permutation (Supplemental Table 10). We found that relative deletions are,
in all strains except C57BL/6N, significantly depleted (P<0.01) in genes (fold
change 0.91), introns (mean fold change 0.93), exons (fold change 0.22;
including C57BL/6N) and promoter regions (fold change 0.77). However,
examining the overlap between genes and the subset of relative deletions that
correspond to inserted transposable elements in the reference genome, we
found that SINEs are significantly enriched in the introns of genes (P<0.01,
fold change 1.34).
The relative depletion of SVs within genes implies a proportionate
deficit in their phenotypic consequences, an expectation upheld by analyses
of SV impact on gene expression and phenotypes measured in the whole
animal. First, we find that SVs are relatively unlikely to be the cause of cis-
1
3
acting expression QTLs. We examined 833 cis-acting eQTLs that influence
expression of transcripts from the hippocampus of outbred mice derived from
eight of the sequenced strains (Huang 2009). Applying a test that
discriminates between variants that are likely to be functional and those that
are not (Yalcin) we find that SVs constitute 0.3% of all causal variants,
compared to 0.9% of all non-causal (P < 2.2E-13,2 = 53.8).
Second, the heritability attributable to SV effects on gene expression is
small. Figure 4a shows scaled variances in gene expression from brain
RNAseq data measured between and within strains for five categories of SVs.
Assuming variation within strains is due to environmental factors and variation
between strains is due to both environmental and genetic factors, the
difference between the two variances is a measure of heritability.
No
category of SV accounts for more than 10% of the heritability. Since many
transcripts overlap multiple small SVs (median of 3, maximum of 216), we
hypothesized that heritability might be related to the amount of gene
overlapped. For each transcript we summed the amount of deleted DNA and
expressed this as a proportion of the total length of the gene. Overlap
proportions of 50% or more make a disproportionately large contribution to
heritability: 25% of the variance is attributable to SVs in this category,
compared to 7.8% for transcripts where SVs overlap less than 50% of the
gene.
However overlaps of this size are rare, affecting less than 3% of
transcripts.
We also observed only small effects on gene expression from SVs that
lie outside a gene. Figure 4b shows between and within strain variances for
SVs lying at distances from less than 2 Kb to more than 40 Kb from
1
4
transcripts with no SV overlap (the density of SVs meant that we found too
few transcripts with SVs 60 Kb or more distant to analyze). For these
analyses we measured the closest distance to either the start or end of the
gene. Heritability attributable to SVs within 2 Kb of the gene is 2%, and falls
as the distance from the gene increases. We observed a non-significant
increase in the most distant category (greater than 40 Kb).
Third, SVs are unlikely to be responsible for a QTL for a phenotype
measured in the whole animal. We know this from genetic association with
100 phenotypes measured in over 2,000 heterogeneous stock (HS) mice,
animals that are descended from eight of the sequenced strains (A/J, AKR/J,
BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J, DBA/2J and LP/J)21. As described in
our companion paper, we used imputation to genotype SVs and then applied
a test that discriminates between variants that are likely to be functional and
those that are not22. We were thus able to test 281,246 SVs where we were
certain that the SDP was correct (Supplementary Methods). Relatively high
rates of SV mutation (Hall NG) might invalidate the imputation (the HS
animals are at least 60 generations distant from the sequenced strains), so
we genotyped 100 HS animals using a high-density array (Agam and
Supplementary Methods). 194 deletions could be genotyped on the array
(with an additional 47 deletions when we allow for non-segregating SVs in the
HS). In every case imputation correctly predicted the logP obtained from
ANOVA carried out using the array -based genotypes.
We identified 290 QTLs where SVs were among the variants most
likely to be functional, but in all these cases the SVs were only a subset of the
total number of functional variants. Just as with the cis-expression QTLs we
1
5
found a small but significant deficit in SVs among the functional variants
(0.36% compared to 0.54% among the non-functional, P < 1E-16,2 = 72.1).
460 functional SVs overlap 245 genes (by at least one base pair). There was
evidence to suggest that functional SVs are enriched for genes compared to
non-functional SVs (Supplememtal Figure XA; 2= 9.7, P = 0.002), but there
was no evidence for enrichment of exons or promoter regions (18 functional
SVs overlap exons, and 58 functional SVs overlap promoter regions;
Supplemental Figure XB and XC).
However there are loci where the functional variant is most likely to be
an SV. As shown in our companion paper, larger effect QTLs are more likely
to arise from SVs. Our prior analysis also suggests that larger effect QTLs
are likely to involve known functional regions. We identified 15 QTLs where
the SV overlapped an exon or flanking region (2 Kb up or downstream of a
gene), and where the QTL effect size is in the top 5% of the distribution.
Table X lists these SVs, the genes they affect and the putative phenotype
with which they are associated. Complementation of the deletion of the H2-Ea
promoter (recorded as insertion in table X as the reference strain carries the
deletion) confirmed the effect of this SV on the T-cell phenotype (); further
work is needed to confirm the others.
Discussion
Our results are important in three respects: first we find an unexpectedly large
number of SVs with remarkably diverse molecular architecture, thus providing
a catalogue of the most dynamic and variable regions of the mouse genome.
Second, we identify breakpoints at nucleotide level resolution, giving a
1
6
genome wide picture of how SVs originate. Third, we demonstrate that,
despite their abundance, SVs make relatively little functional impact, as
assessed by their effects on gene expression and phenotypic variation in the
whole animal.
We were able to find more, and more complex, SVs because we relied
on manual inspection of the PEM results, combined with molecular validation,
before using automated calling methods. Previous studies have revealed the
noisiness of sequenced based methods of SV calling (), due in part to the
multiplicity of forms and the presence of insertions, deletions and inversions
often in close proximity to each other, and the difficulty of mapping sequence
reads back to repetitive genomes. particularly challenging when repetitive
sequences act as a nursery for SVs. Nevertheless, we have shown here that
it is possible to identify and classify up to 19 SV types and thereby calibrate
automated methods to generate genome-wide SV calls of superior accuracy.
The SVs we find have two distinguishing characteristics: first, typically
they are small. For relative deletions, whose size we know accurately, the
median is 385 bp. In comparison, the median size of SVs in a recent highdensity array analysis of the genomes of 12 laboratory strains and wildderived mice was 61 Kb (Henrichsen 2009) and about 1.9 Kb from a PEM
analysis of DBA/2J (Quinlan 2010). Second, their density means that we
frequently find regions with high concentrations of small rearrangements.
These two features emphasize the need for methods of SV identification at
base pair, or near base pair resolution. Otherwise not only are many SVs
missed, but those recognized are misclassified: a mixture of small deletions
and insertions will be mistaken for a large SV of a single type.
1
7
It should also be noted that we do not report SVs less than 100bp, an
arbitrary limit imposed by the sequencing technology. Variants less than
100bp are described in our companion paper, but the sensitivity and
specificity of the methods used for their detection do not approach those for
the PEM SV detection described here. Variants in this size range remain
poorly characterized by current sequencing technologies.
Our second important finding is the catalogue of SV mechanisms
based on breakpoint sequence. We were able to map almost 60% of relative
deletions to base pair resolution, allowing us to classify SVs by the
mechanism that created them. We find that the primary origin of structural
variation between mouse strains is attributable to L1 retrotransposons. For
reasons
still
unexplained,
mice
differ
from
humans
in
whom
L1
retrotransposition comes third after microhomology-mediated processes and
nonallelic homologous recombination as the predominant processes in
generating SVs (Eichler 2010).
Investigation of the precise breakpoint
sequence revealed that in 4% cases the machinery of SV formation gives rise
to a complex mixture of insertions and deletions. We also find evidence that in
5% of cases, SNPs, extending regions of microhomology at breakpoints,
occur in strains with an SV: since we found no instances where a SNP in the
microhomology region occurs without the SV, we assume that the formation of
deletion and SNP are related.
It should be noted that, in contrast to human SV studies, we do not
distinguish between SVs found in multiple unrelated individuals (recurrent
rearrangements) and non-recurrent rearrangements. NAHR is believed to be
the major mechanism responsible for the former while fork stalling and
1
8
template switching (FoSTeS) and/or microhomology-mediated break-induced
replication (MMBIR) mechanisms may be important for the latter (Zhang
2009). We have difficulty distinguishing recurrent SVs because the 17 strains
we have sequenced are not completely unrelated, which means we cannot
separate recurrent SVs from those that are identical by descent. On one
hand, the 13 laboratory strains are derived from a relatively small set of
founders; on the other hand, the wild derived strains include animals that are
very distantly related, to the point of being separate species in the case of
SPRET/EiJ.
Our third important observation is that SVs have relatively little impact
on gene function. This question receives attention because results from
genome-wide association studies (GWAS) have revealed that common SNPs
(minor allele frequency > 5%) explain only a part of trait heritability suggesting
that SVs might be a major unrecognized contributor to phenotypic variation ().
Available evidence has not yet resolved whether or not this is so. Analysis of
human lymphoblastoid cell lines attributed at least 8.5-17.7% of heritable
gene expression variation to CNVs (Stranger 2007).
Importantly, this
heritability was not shared with common SNPs, potentially making CNVs a
contributor to the missing heritability of GWAS. In mice, SVs overlapping a
gene were estimated to contribute to a substantial proportion of betweenstrain expression variance (up to 74%), which, when put together with the
prevalence of SVs in the genome, implies that they might be responsible for a
considerable fraction of heritable gene expression variance. If the genetic
basis of gene expression were a model for understanding the molecular basis
of other phenotypes, then SVs would be a major player. Two recent analyses
1
9
of the association between SVs and disease phenotypes in humans provide
little support for this view: common SVs are no more likely than common
SNPs to contribute to phenotypic variation (WTCCC 2010, Conrad 2010).
However rare CNVs (minor allele frequency < 5%) of large effect (odds ratio >
2), that could not be detected using the technologies available, might still be
important contributors.
Our findings make two important contributions to this debate. First, we
find that SVs overlapping a gene make a much smaller contribution than
expected, not much more than 10%, and we find limited evidence that they
affect the expression of flanking genes. This might be due to our analysis of
very large numbers of small SVs, but we find that even when SVs overlap
more than 50% of the extent of the gene they account for less than a third of
the heritability. The most likely explanation is that previous array based
studies conflated under one apparently large SV the effects of numerous
smaller rearrangements together with regions of diploid DNA, containing other
variants that influenced gene expression.
Second, our analysis of the phenotypic consequences of SVs on QTLs
for multiple phenotypes also points to a relative deficit of SVs as the molecular
basis of complex phenotypes. By working with an outbred population where
all chromosomes are descended from known progenitors, imputation
effectively reconstitutes the genomes of all animals, so that we can detect the
effects of all variants, both common and rare. Our results indicate that
common and rare SVs make less of a contribution to phenotypic variation
than we would expect given their abundance in the genome. However this
conclusion may not apply to other species, or indeed other populations of
2
0
mice, because selection of inbreeding and homozygosity will purge the
genome of variants that could be maintained in heterozygous freely mating
populations.
Finally, it should be stressed that SVs are the responsible molecular
lesion at some QTLs. Our analysis has highlighted those where this is most
likely to be so. Encouragingly, our computational predictions include a
promoter deletion whose role we have recently confirmed through
transgenesis (). This is important because genetic association studies
typically implicate SNPs as the causative variant at a QTL. Biological insight
into a phenotype however requires discovering which gene is involved, still a
major challenge if the starting point is a SNP. The task is considerably easier
when an SV is identified as the causative variant, particularly if the SV
removes a gene segment, effectively creating a null allele, now relatively
straightforward to model in mice (). Thus the discovery of causal SVs is likely
to provide biological insights out of proportion to their relative small
contribution to phenotypic variance.
Acknowledgements
We thank Adam Whitley, Giles Durrant, Andrew Marc Hammond, Danica Joy
Fabrigar, Lucia Chen, Martina Johannesson and Enzhao Cong for helping B.Y
with various laboratory-based work. This project was supported by The
Medical Research Council, UK and the Wellcome Trust. DJA is supported by
Cancer Research UK.
Author contributions
2
1
D.J.A and J.F conceived the study and directed the research.
2
2
Figure Legends
Figure 1. Identification of structural variants. a) Venn diagrams showing the
overlap between deletion SVs (relative to C57BL/6J) detected in our study
(blue) and those published elsewhere (Agam et al, 2010 in red and Quinlan et
al in green), in DBA/2J. b) Blue boxes represent deletions, pink boxes
insertions, orange boxes inversions and yellow boxes duplications; all types of
structural variants are relative to the reference genome sequence. We found
six basic types of structural variant: deletion (del), insertion (ins), inversion
(inv), tandem duplication (dup), inverted tandem duplication (not drawn here)
and dispersed duplication. c) Additionally, eight complex types of structural
variant were found: deletion with an insertion (del+ins), linked deletion (normal
copy of small length flanked by two deletions), deletion within a duplication
(del in dup), inversion with flanking deletion(s) (for example del+inv+del),
inversion with an insertion (inv+ins), inversion within a duplication (inv in dup),
a linked insertion (linked ins) where the inserted sequence is copied from
another location in the vicinity of the inserted site and an inverted linked ins
(not drawn here) which has a similar pattern to a linked insertion but with the
inserted sequence being inverted. d) Example of paired-end mapping (PEM)
pattern of a del+inv+del. Green arrows represent primers used for PCR
amplification and sequencing reactions. Primer names provide their positional
information, relative to the reference genome. Black arrows attached with a
curved line represent paired-ends, whereas single black arrows represent
singleton reads. Grey straight lines indicate mapping of the test reads onto the
reference genome. When the inversion is smaller than the insert size, pairedend reads will flank both deletions and inversion, as shown here. In other
2
3
cases, decreased read depth will indicate flanking deletions. e) Example of
PEM pattern of an inv+ins, with PCR data across the eight classical strains.
HyperladderII is used as molecular marker. Amplicon size for BALB/cJ,
C3H/HeJ, CBA/J and DBA/2J is about 500 bp larger than the other strains,
indicative of the insertion. Inversion is revealed by sequencing. Complete list
of patterns is drawn in Supplementary Figure 1, with examples and PCR
validation data.
Figure 2. Classification of structural variants.
LEGEND TO ADD
Figure 3. Breakpoint analysis of a complex SV. a) Complex SV, involving
several genomic rearrangements including an inversion, deletion, short
insertion and copy number gain (CNG), is displayed relative to its genic
location along Zbtb10, a Zinc finger and BTB domain containing 10 gene.
PCR amplification using forward (F) and reverse (R) primers revealed an AT
insertion at the first breakpoint J1, followed by an inversion of 125 bp which
encompasses an inverted copy number gain of the 22 bp proceeding J1, as
seen in J2. Finally breakpoint 3 (J3) revealed a deletion of 813 bp. Using
repeatmasker, a SINE element was found to be part of the deletion. b) PCR
picture of the amplification using F and R primers (primer sequences available
in Supplementary Table xxx). Hyperladder II was used as the size marker.
C57BL/6N and LP/J show a normal size of 1604 bp, whereas A/J, AKR/J,
BALB/cJ, C3H/HeJ, CBA/J and DBA/2J show a smaller band at 793bp. c)
Sequencing data across J1, J2 and J3 breakpoints. A colour code is used to
2
4
indicate each type of SV: blue is used for the 22 bp inverted copy number
gain, green for the inversion and red for the deletion. When the test strain
matches the reference strain, both are in the same color.
b) Relationship between SNP and SV formation. a) Relationship between
SNP and ancestral deletion formation. Two SNPs lying on the 6 bp
microhomolgy of an ancestral deletion of 64 bp (chr12:27,040,45927,040,522) correlated with the presence of the SV. On the left, PCR
amplification of the SV is shown across the eight classical strains (A/J, AKR/J,
BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J). HyperladderII was
used as DNA molecular weight marker. Some strains show a smaller
amplicon compared to other strains. On the right, sequencing traces are
shown for a test strain (A/J) and the reference strain (C57BL/6N). Note that all
other test strains traces are identical to the one shown here. Asterisk is used
to emphasize the microhomology of 6 bp (GAACTA). The presence of two
SNPs (C->G and T->A) in all test strains (here only shown in A/J) is
associated with the presence of the ancestral deletion. b) Relationship
between SNP and ancestral insertion formation. PCR data is shown on the
right with amplification in A/J, AKR/J and BALB/cJ. Strains with the the
ancestral insertion (C57BL/6N, CBA/J, DBA/2J and LP/J) have failed to
amplify due to size. The insertion is a LINE on chromosome 13 (119,134,049119,135,126). On the left, sequencing trace is shown over the TSD for a strain
that
doesn’t
have
the
ancestral
insertion.
The
TSD
is
17
bp
(AAGAATGTCAGCAAAGT) and at the 12th position, a SNP (G->C) is
observed in all the strains that have the insertion.
2
5
Tables
Table 1. Structural variants in 17 inbred strains
Simple
Strain
Del
CNG
Complex
Inv
Ins
Del
+
Linked Inv +
Ins Nested Ins/Del Del/Ins
129P2/OlaHsd
16184
57
74 15476 102
27
239
68
129S1/SvImJ
17169
70
88 11375
67
32
285
67
129S5/SvEvBrd
15967
72
67
8885
40
41
210
58
A/J
16078
69
92 12065
55
28
237
67
AKR/J
15692
88
89 14434
84
13
260
82
BALB/cJ
14761
82
87 10462
43
17
192
58
C3H/HeJ
15952
94
94 11960
86
16
254
76
164
44
0
3
5
1
CAST/EiJ
50637
361
224 33650 120
239
826
265
CBA/J
16878
79
83 10759
60
16
230
78
DBA/2J
17314
67
83 10427
45
29
306
75
LP/J
16834
64
88 12608
60
30
271
69
NOD/ShiLtJ
16903
51
116 13088
46
16
307
79
NZO/HlLtJ
15355
62
31
23
168
62
PWK/PhJ
53612
96
272 34553 160
60
1104
268
SPRET/EiJ
90533
112
470 63147 426
110
1956
554
WSB/EiJ
22028
88
37
265
105
C57BL/6NJ
6
71
208
9353
97 12386
60
Table 1: Structural variants in 17 inbred strains. Listed are structural variants
with a minimum size of 100 bp. In addition to the four main SV types with
simple PEM patterns, we have elucidated a number of complex patterns
(Figure 1). Complex SVs are identified from local assembly analysis and
overlap assessment of SV calls (see Supplemental Methods). We
differentiate between insertions, where we can determine the insertion points
from read pair patterns and local assembly, and copy number gains (CNG),
where a duplication is inferred from an increase in read depth. CNGs include
tandem duplications, which are inferred from both read depth and read pair
evidence. There is minimal overlap between the insertions and the copy
number gains, since the insertion discovery algorithms find de novo insertions
and TE insertions. Del=deletion; CNG=copy number gain; Inv=inversion;
Ins=insertion; Del+Ins=deletion plus insertion; Nested=SV in a CNG region;
Linked Ins/Del=linked insertion or linked deletion; Inv+Del/Ins=inversion plus
deletion(s) or inversion plus insertion.
2
6
Table 2. Sequence features at SV breakpoints and inferred mechanism
Table 2. Sequence features at SV breakpoints and inferred mechanism. In a,
the percentage of each sequence feature at precise breakpoint is given per
category of ancestral SV (insertion, deletion, inversion, CNG and multiple
events). In b, the percentage of each inferred mechanisms is given relative to
all SV regions presented in a. Empty cases are due to no applicability and all
abbreviations are listed in the Supplementary Glossary.
2
7
Table 3. QTL associated with SVs
Phenotype
Red cells: mean cellular
haemoglobin
Wound healing
Weight
Red cells: mean cellular
haemoglobin
Red cells: mean cellular volume
Mean platelet volume
T-cells: CD4/CD8 ratio
T-cells: %CD3
Hippocampus cellular proliferation
marker
Hippocampus cellular proliferation
marker
OFT Total activity
Home cage activity
Serum urea concentration
Red cells: mean cellular
haemoglobin
OFT Total activity
Chr
Start
Stop
Structural
variant
7
7
3
111415000
90731810
104731236
111415000
90731831
104731238
DEL
INS|IAPTypeI|ERV
INS|SINE|SINE
9
8
1
107952000
87957077
175158883
107960000
87957262
175158895
GAIN
INS|LINE|LINE
INS
17
4
34483680
130038389
34483692
130038391
4
49690361
49690364
13
13
4
11
113783195
116943933
108951256
115106125
7
2
Strains
Gene
A, BALB, C3H,
CBA
A, AKR, BALB
C3H
Trim5
TMC3
St7l
e
u
d
Gmppb
4921524J17Rik
Fcer1a
d
u
u
INS
INS|SINE|SINE
C3H
A, C3H, CBA, DBA
A
A, AKR, C3H,
CBA, DBA
LP
H2e-a
Snrnp40
u
u
INS|SINE|SINE
CBA, DBA
Grin3a
d
113783359
116944422
108951281
115106247
DEL
DEL
INS|IAPTypeI|ERV
DEL
Gm6320
Gm6404
Eps15
Tmem104
u
u
u
e
111504853
111505190
INSi
A
A, C3H, DBA, LP
A,DBA, LP
BALB
AKR, A, CBA,
BALBc, DBA
Trim30b
e
144402762
144402974
DEL|SINE
DBA, LP
Sec23b
e
2
8
Figure 1. Identification of structural variants.
2
9
Figure 2. Classification of structural variants.
3
0
Figure 3. Breakpoint analysis of a complex SV and relationship between SNP
and SV formation
3
1
Methods
An outline of the methods applied in this paper is provided in the
supplementary Methods.
3
2
References
1
Iafrate, A. J. et al. Detection of large-scale variation in the human
genome. Nat Genet 36, 949-951 (2004).
2
Korbel, J. O. et al. Paired-end mapping reveals extensive structural
variation in the human genome. Science 318, 420-426 (2007).
3
Sebat, J. et al. Large-scale copy number polymorphism in the human
genome. Science 305, 525-528 (2004).
4
Tuzun, E. et al. Fine-scale structural variation of the human genome.
Nat Genet 37, 727-732 (2005).
5
Buchanan, J. A. & Scherer, S. W. Contemplating effects of genomic
structural variation. Genet Med 10, 639-647 (2008).
6
Hurles, M. E., Dermitzakis, E. T. & Tyler-Smith, C. The functional
impact of structural variation in humans. Trends Genet 24, 238-245
(2008).
7
Mills, R. E. et al. Mapping copy number variation by population-scale
genome sequencing. Nature 470, 59-65 (2011).
8
Perry, G. H. et al. The fine-scale and complex architecture of human
copy-number variation. Am J Hum Genet 82, 685-695 (2008).
9
Agam, A. et al. Elusive copy number variation in the mouse genome.
PLoS One 5 (2010).
10
Cahan, P., Li, Y., Izumi, M. & Graubert, T. A. The impact of copy
number variation on local gene expression in mouse hematopoietic
stem and progenitor cells. Nat Genet 41, 430-437 (2009).
11
Graubert, T. A. et al. A high-resolution map of segmental DNA copy
number variation in the mouse genome. PLoS Genet 3, e3,
doi:10.1371/journal.pgen.0030003 (2007).
12
Akagi, K., Li, J., Stephens, R. M., Volfovsky, N. & Symer, D. E.
Extensive variation between inbred mouse strains due to endogenous
L1 retrotransposition. Genome Res 18, 869-880 (2008).
3
3
13
Quinlan, A. R. et al. Genome-wide mapping and assembly of structural
variant breakpoints in the mouse genome. Genome Res 20, 623-635
(2010).
14
Gilbert, N., Lutz-Prigge, S. & Moran, J. V. Genomic deletions created
upon LINE-1 retrotransposition. Cell 110, 315-325 (2002).
15
Symer, D. E. et al. Human l1 retrotransposition is associated with
genetic instability in vivo. Cell 110, 327-338 (2002).
16
Garvey, S. M., Rajan, C., Lerner, A. P., Frankel, W. N. & Cox, G. A.
The muscular dystrophy with myositis (mdm) mouse mutation disrupts
a skeletal muscle-specific domain of titin. Genomics 79, 146-149
(2002).
17
Henrichsen, C. N. et al. Segmental copy number variation shapes
tissue transcriptomes. Nat Genet 41, 424-429 (2009).
18
Wong, K., Keane, T. M., Stalker, J. & Adams, D. J. Enhanced structural
variant and breakpoint detection using SVMerge by integration of
multiple detection methods and local assembly. Genome Biol 11, R128
(2010).
19
Egan, C. M., Sridhar, S., Wigler, M. & Hall, I. M. Recurrent DNA copy
number variation in the laboratory mouse. Nat Genet 39, 1384-1389
(2007).
20
Zhang, F., Carvalho, C. M. & Lupski, J. R. Complex human
chromosomal and genomic rearrangements. Trends Genet 25, 298307 (2009).
21
Valdar, W. et al. Genome-wide genetic association of complex traits in
heterogeneous stock mice. Nat Genet 38, 879-887 (2006).
22
Yalcin, B. et al. Genetic dissection of a behavioral quantitative trait
locus shows that Rgs2 modulates anxiety in mice. Nat Genet 36, 11971202 (2004).
3
4