Download SVPaper110411

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Heritability of IQ wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Neurogenomics wikipedia , lookup

Gene expression programming wikipedia , lookup

Transcript
Sequence based characterization of structural variation in the mouse
genome
Binnaz Yalcin1†, Kim Wong2†, Avigail Agam1†, Martin Goodson1†, Thomas M. Keane2,
Leo Goodstadt1, Amarjit Bhomra1, Polinka Hernandez-Pliego1, Helen Whitley1,
James Cleak1, Deborah Janowitz1, Richard Mott1, David J. Adams2,*, Jonathan
Flint2,*
1The
Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK ,
2The
Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
3MRC
Functional Genomics Unit, Department of Physiology, Anatomy and Genetics,
University of Oxford, South Parks Road, Oxford OX1 3QX, UK.
†Co-first
authors
*Correspondence to:
Dr. David Adams
Prof. Jonathan Flint
Wellcome Trust Sanger Institute
Wellcome Trust Centre for Human Genetics
Hinxton, Cambs, CB10 1SA, UK
Oxford, OX3 7BN, UK
Ph: +44 (0) 1223 86862
Ph: +44 (0) 1865 287512
Fax: +44 (0) 1223 494919
Fax: +44 (0) 1865 287501
Email: [email protected]
Email: [email protected]
1
Abstract
The importance of structural variants (SVs) in DNA as a cause of quantitative
variation and as a contributor to disease is unknown, but without knowing how
many SVs there are, and how they arise, it is difficult to discover what they do.
Combining experimental and automated analysis of the mouse genome
sequence, we identified 0.7M SVs in thirteen classical and four wild-derived
inbred mouse strains. The majority of SVs are less than 1 kilobase in size and
98% are deletions or insertions. The breakpoints of 58% were mapped to
base pair resolution allowing us to confirm that insertion of retrotransposons
causes more than half of SVs. Yet, despite their prevalence, SVs are less
likely than other sequence variants to cause gene-expression or quantitative
phenotypic variation. We identified only 22 SVs that disrupt coding exons,
acting as rare variants of large effect on gene function. One third of the genes
so affected have immunological functions. Our catalogue provides a starting
point for the analysis of the most dynamic and complex regions of genomes
from a genetically tractable model organism. [178 words]
Introduction
Structural variation (SV) is believed to be widespread in mammalian genomes
{Iafrate, 2004 #91;Korbel, 2007 #93;Sebat, 2004 #90;Tuzun, 2005 #92} and
an important cause of disease{Buchanan, 2008 #88;Hurles, 2008 #75}, but
just how abundant and important structural variants (SVs) are in shaping
phenotypic variation remains unclear. Understanding what SVs do depends
2
on understanding what they are, where they occur and how they arise: large,
recurrent SVs (SVs occurring in multiple individuals with clustering of
breakpoints and sharing a common interval and size) coinciding with genes
are far more likely to contribute to phenotypic variation than small nonrecurrent SVs within intergenic regions.
The preeminent organism for modeling the relationship between
phenotype and genotype, including SVs, is the mouse, but our catalogue of
SVs in this animal is incomplete. Estimates of SV numbers and the proportion
of the mouse genome they occupy, vary considerably, from figures of a few
hundred to over 7,000 {Li, 2004 #322; Snijders, 2005 #320; Graubert, 2007
#128; Egan, 2007 #98; Cutler, 2007 #321; Quinlan, 2010 #52}, affecting from
3.2% to more than 10% of the genome {Cahan, 2009 #270; Henrichsen, 2009
#269}. Incompleteness and inconsistencies are largely due to reliance on
differential hybridization of genomic DNA to oligonucleotide arrays (REF TO
AGAM, 2010?), a technology blind to some SV categories (such as inversions
and insertions) with only limited ability to detect others (segmental
duplications and transposable elements). Sequence based methods of SV
detection, with higher resolution and greater sensitivity, have so far had
limited application {Akagi, 2008 #99;Quinlan, 2010 #52}.
Along with SV catalogues, we need to know how SVs arise, as this will
tell us what SVs may or may not do. Recurrent events, hitting genes, will have
different consequences from stably inherited SVs in intergenic regions. The
major molecular mechanism producing SVs in the mouse genome is believed
to be retrotransposition {Akagi, 2008 #99;Quinlan, 2010 #52}, which, may
account for more than 80% of SVs between 100 nucleotides to 10 kilobases in
3
length {Akagi, 2008 #99}. In cell culture, about 10% of LINE-1 insertions
delete DNA{Gilbert, 2002 #136;Symer, 2002 #135}, a process that also
occurs in mouse genomic DNA{Garvey, 2002 #137}. While this suggests SV
formation is recurrent {Egan, 2007 #98}, it is not known to what extent
retrotransposons, or other mechanisms of SV formation, contribute to mouse
phenotypic variation and disease.
What we know about the burden of SVs’ impact on phenotypes in the
mouse comes primarily from analyses of gene expression. Up to 28% of the
between-strain variation in gene expression in hematopoietic stem and
progenitor cells has been attributed to SVs {Cahan, 2009 #131}; for genes
lying within SVs, the latter account for between 66% to 74% of between-strain
expression variation in kidney, liver, lung and testis {Henrichsen, 2009 #89}. If
the genome is replete with SVs, and given that their influence on gene
expression could extend up to 500 Kb from their margins{Henrichsen, 2009
#89}, then SVs might be responsible for a considerable fraction of heritable
gene expression variance. Since gene expression variation is believed to
contribute to variation in phenotypes in the whole organism {Schadt, 2005
#315} SVs may turn out to have a major role in the genetic determination of all
aspects of mouse, and mammalian, biology.
In this paper we use next generation sequencing to address three
critical questions: what is the extent and complexity of SV in the mouse
genome, what are the likely mechanisms for SV formation, and to what extent
do SVs contribute to phenotypic variation? We report the identification, using
short-read paired-end mapping, of 0.7M SVs in 17 inbred strains of mice. By
analyzing breakpoint sequence we infer the mechanisms of formation and
4
assess their relative impact on shaping a mammalian genome. Our molecular
characterization of SVs in the mouse genome allows us to determine the
extent to which SVs contribute to genetic and phenotypic diversity.
Results
SV identification
We identified almost three quarters of a million SVs, relative to the reference
genome C57BL/6J, in 17 mouse strains, far more than previously recognized
(Fig. 1a) and consisting of a greater variety of molecular structures (Fig.
1b&1c). To understand why we found more, and to explain our results, we
start by explaining how we went about finding SVs.
We combined visual inspection of short-read sequencing data with
molecular validation to improve automated SV detection across the genome.
We used two criteria to identify SVs manually: read depth and anomalous
paired-end mapping (PEM). We did this using data from the mouse’s smallest
chromosome (19) in its entirety, and a random set of other chromosomal
regions, for eight strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J,
DBA/2J and LP/J).
Based on read depth and PEM we expected to find eleven patterns
that classify SVs. We refer to these as type H (“High-confidence”) patterns
(H1-H11: Supplementary Fig. 1). For example, some deletions and
inversions leave precise, easily identifiable signatures (Fig. 1d). In addition,
we found ten patterns whose interpretation was ambiguous. We refer to these
as type Q (“Query”) patterns (Q1-Q10: Supplementary Fig. 1, Fig. 1e). We
investigated the molecular structure of all 21 patterns using a PCR strategy
5
(Supplementary Fig. 2, Supplementary Methods). We designed 575 pairs
of primers (Supplementary Table 1) and successfully amplified 538 SV
regions across eight classical inbred strains (Supplementary Table 2).
Our categorization of predicted SV structures, based on manual
inspection of PEM patterns, resulted in the confident identification of an SV for
nineteen of the 21 patterns in all instances that we examined by PCR
(Supplementary Table 3). Two patterns were always false (Q6 and Q10),
and arose because of the presence of a retrotransposed pseudogene giving
mapping errors.
Recognizing these patterns, we were able to predict underlying SV
structure with high confidence. PCR confirmed that 12 patterns were
indicative of a single SV and six patterns indicative of multiple adjacent SVs
(Supplementary Table 3). However, SVs of type Q7 (45 cases) were due to
a variable number tandem repeat, for which we could not predict the number
of repeats or molecular structure.
Available automated methods to identify SVs are unable to differentiate
all 19 PEM patterns, and may also classify some SVs incorrectly; for example,
the PEM patterns of linked insertions (Q5 and Q9: Supplementary Fig. 1) are
similar to those for inversions or deletions. Therefore we adapted automated
methods to recognize 15 types (Q1, Q2, Q3 and Q7 could not be
unambiguously identified) identified by manual inspection and PCR validation
(Supplementary Methods and {Wong, 2010 #64}).
Sensitivity and specificity analyses
6
We established false positive and false negative rates for the automated
analysis in three ways. First, we used our manually identified set of SVs on
chromosome 19 (Supplementary Table 4) where we found 932 deletions
(684 type H and 248 type Q), 15 inversions (2 type H and 13 type Q) and
three copy number gains (all type H). Automated analysis of chromosome 19
detected between 83% to 86% of manually-called deletions (at least 50 bp in
size) depending on the strain (Supplementary Table 5a). The false positive
rate ranges from 3.1% to 4.6% (Supplementary Table 5b). Second, to
ensure that our sensitivity and specificity analyses were not vitiated because
we used chromosome 19 as a training set for the automated analysis, we
derived a second, smaller, set of manually curated deletions from a randomly
chosen 10 Mb region (101Mb to 111Mb) from chromosome 3 in the strain
C3H/HeJ. Automated analysis of this region correctly identified 43 (82.7%)
and called 2 false positive deletions (4.4%). Third, we investigated the false
negative rate for the automated detection of deletions across the genome
using a PCR validation data of 267 simple deletions. Consistent with the
chromosome 19 and chromosome 3 analyses we found that the false
negative rate for deletions was between 9% and 15% (Supplementary Table
6a).
We could not assess the performance of automated analysis to detect
SV types other than deletions using calls derived from manual inspection of
chromosome 19 because so few of these rearrangements were called. To do
this, we turned to PCR-based validation of insertions, inversions and tandem
duplications (n=62 to n=76) and found that the average false negative rate
was higher than for deletions, ranging from 21% to 33% per strain
7
(Supplementary Table 6b). Automated analysis was less successful in
detecting the more complex rearrangements, with 25% to 38% false negative
rates (n=46 to n=54).
SV categories
The results of the detection and classification of 711,923 SVs across the
entire genome of 17 strains are shown in Table 1. There are on average
26,000 SVs in classical inbred strains, and 92,000 in wild derived inbred
strains, affecting 1.2% (33.0 Mb) and 3.7% (98.6 Mb) of the genome
respectively. SVs smaller than 100 bp are excluded, as below this, it is difficult
to determine whether the deviation in distance between two paired end reads
is due to variation in the library insert size distribution or due to paired ends
flanking a structural variant. However we know from the chromosome 19
analysis that there are relatively few SVs in this size range. Therefore our
catalogue does not omit any abundant classes of SV.
Table 1 classifies SVs into two groups: 99.4% are simple and 0.6% are
complex. Simple SVs include those whose biological interpretation is
straightforward: insertions, deletions and inversions. We separately identify
one type of insertion, a copy number gain, consisting of non-repetitive DNA
that is present in multiple copies in other strains. When this sequence occurs
immediately adjacent to its original, it is annotated as a tandem duplication. It
is less clear to what extent the more complex categories we found represent
different biological categories. Complex SVs consist of a mixture of events
that abut each other. Sometimes the mixture is simply because two or more
simple SVs occur next to each other: given the density of deletions in the
8
genome, the 2,132 deletions that we found separated by less than 250 bp
could have occurred by chance (two of our PEM patterns (H3 and Q5).
However we recognize as a separate category SVs that are immediately
adjacent to each other, with no intervening DNA, since we suspect that these
might be the progeny of a single biological process (marked as Del+Ins and
Del+Ins/Inv in Table 1). Thus, intriguingly, we noted that half of the inversions
co-occur with an insertion or deletion (Fig. 3a). We also separately identify an
SV within a copy number gain (termed “nested” in Table 1) since the
probability of coincidence is less than one event per genome.
SV formation
Homology at SV breakpoints, as well as the content of sequence within SVs
and the SV’s ancestral state, was used to infer the likely mechanism of
formation for simple SVs (Supplementary Fig.3). To obtain breakpoint
sequence, we performed de novo local assembly at 81.3% of deletions and
74.2% of non-transposable element insertions {Wong, 2010 #64}. Comparison
of 1,314 predicted breakpoints to the breakpoint delineated by PCR and
sequencing (Supplementary Table 7; Supplementary Methods), revealed
that 57.7% of breakpoint predictions are exact and 86.5% are within 20 bp
(Supplementary Table 8a). In cases where the local assembly strategy
failed, we relied on breakpoints obtained from the mapping reads reference
genome: 83.3% are within 100 bp of the actual breakpoint (Supplementary
Table 8b). Breakpoint accuracy for insertions, inversions and copy number
gains SV is presented in Supplementary Tables 8c, 8d and 8e, respectively.
Using rat and SPRET/EiJ as outgroups, we classified 19% of SVs as
9
ancestral deletions, 57% as ancestral insertions and the remainder (24%)
were indeterminate. We examined the sequence features of 40 SVs that failed
the outgroup analysis and found that in every case the regions contained
highly repetitive DNA, consisting primarily of transposon and transposon
related sequence.
Classification of SVs and their size characteristics are summarized in
Figure 2. The main class of SVs consists of those formed by mechanisms
involving retrotransposons (LINEs (25%), LTRs (14%) and SINEs (15%)),
followed
by variable
number
tandem
repeats
(VNTRs) (15%)
and
pseudogenes (2%) and other mechanisms not involving retrotransposons
(29%) (Fig. 2a). We found that the median length of all SVs is 349 bp, with
modes at 100 bp and 6,400 bp, LINE insertions comprising the majority of the
larger insertions (Fig. 2b). Estimates of the proportion of SVs categorized as
VNTRs fell to 4.9% when we required the whole of the SV to be overlapped
by a VNTR and for the flanking sequence to have VNTR content of the same
periodicity (>= 5bp). Outgroup analysis showed that the transposon-derived
SVs arose, as expected, almost exclusively from ancestral insertions events
(98.8%). As expected, microhomology (12-16 bp) surrounds the breakpoints
of LINE and SINE derived SVs (known as target site duplication) and shorter
(6-8 bp) stretches associated with LTR SVs (Fig. 2c). Non-repeat mediated
SVs are mainly a result of ancestral deletion events (79%), and are
associated with short microhomologies, up to 7bp in length, consistent with
either a microhomology-mediated break-induced replication (MMBIR) or
microhomolgy-mediated end joining (MMEJ). Table 2 gives genome-wide
estimates of mechanisms of SV formation. MORE HERE?
10
We found that in all cases the presence of SNPs in the microhomology
region was correlated with the presence of the SV (Fig. 3b). In every case the
SNP elongates the microhomology. However this phenomenon is rare: we
only observed five (4.5%) cases amongst our 113 manually-curated ancestral
deletions (Supplementary Table 7) where a SNP and SV formation cosegregate. We found a similar relationship between a SNP formed in a target
site duplication and the presence of an ancestral insertion (Fig. 3c). 15
ancestral insertions (16%) had SNPs or short indels within their target site
duplication, coincident with the insertion (Supplementary Table 7).
Given their potential role in disease {Stankiewicz, 2010 #325}, we were
interested to document the occurrence of recurrent SVs, those that arise at
the same genomic locus independently in unrelated individuals. Nonallelic
homologous recombination (NAHR) is the major mechanism for recurrent
SVs{Gu, 2008 #352}, while fork stalling and template switching (FoSTeS)
and/or
microhomology-mediated
break-induced
replication
(MMBIR)
mechanisms may be important for non-recurrent SVs {Zhang, 2009 #26}.
We looked for SVs occurring at the same locus, but with different
breakpoints, indicating independent origins. Using the SV breakpoints
obtained from PCR sequencing (over 4,000 breakpoints; Supplementary
Table 7), we found that in the classical strains, only 2.5% of deletions at the
same locus had different breakpoint sequences. However within all 17 strains,
we found multiple alleles at 12% of SVs, due almost entirely to the presence
of
different
alleles
originating
from
the
wild-derived
inbred
strains
(Supplementary Table 7). Consistent with the low frequency of recurrent
SVs, breakpoint features associated with NAHR are rare. Using Vmatch
11
(http://www.vmatch.de/) to detect similarity between the sequences flanking
and internal to all deletion SVs, we estimated that 0.25% of deletions are due
to NAHR, when we required a signature of >=200bp of >=90% sequence
identity. Two analyses, therefore, indicate that recurrent SVs are rare.
Impact of SVs on gene function
We assessed the impact of SVs on phenotypes in three ways: i) the
relationship between the position of SVs and the position of genes; (ii)
changes in expression of genes overlapping, or nearby, an SV; (iii)
association between SVs and phenotypes in an outbred population of mice.
Across all strains, SVs overlap 10,291 genes, reducing to 5,115 genes
when only the classical laboratory strains are considered (Supplementary
Table 10). We investigated whether this represented enrichment or depletion
by comparing the number of SVs that overlapped genes to a distribution
obtained by permutation (Supplementary Table 11). We found that SVs are,
in all strains except C57BL/6N, significantly depleted (P<0.01) in genes, (fold
change 0.91), introns (mean fold change 0.93), exons (fold change 0.22;
including C57BL/6N) and promoter regions (fold change 0.77). However, we
found that SINE insertions are significantly enriched in the introns of genes
(P<0.01, fold change 1.34).
The relative depletion of SVs within genes implies a proportionate
deficit in their phenotypic consequences. We confirmed this hypothesis first by
finding that SVs are relatively unlikely to be the cause of cis-acting expression
QTLs. We examined 833 cis-acting eQTLs that influence expression of
transcripts from the hippocampus and liver of outbred mice derived from eight
12
of the sequenced strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J,
DBA/2J and LP/J) {Huang, 2009 #330}. Applying a test that discriminates
between variants that are likely to be functional from those that are not
{Yalcin, 2005 #338}, we found that SVs are significantly less likely to be
causal variants (0.3% compared to 0.9%, P < 2.2E-13,2 = 53.8).
We next asked whether SVs were enriched among those variants most
likely to contribute to variation in the abundance of transcripts obtained from
the whole brain RNAseq analysis of 15 strains {Trapnell, 2010 #342}. Taking
into account the relationship between strains, we tested for association
between 11,245 transcripts and all variants lying within 25 Kb upstream and
downstream of the transcript’s genomic locus. 5,337 transcripts had a single
variant with a minimum P-value for association, which we assumed would
include true causal variants. Figure 4a shows the distance to the
transcriptional start site from the position of the single most significant variant,
for these 5,337 transcripts, regardless of the P-value for association. Figure
4b shows the same information but only for variants where the P-value is less
than 0.0001. There is a significant enrichment of variants with low P-values
lying less than 2,000 bp upstream of the transcriptional start site (P
=0.00013,2 = 59.3, df = 25). None of the 225 variants were SVs, significantly
less than the frequency of SVs in the complete set of variants tested (2 = 4.8,
P = 0.025 by simulation).
The proportion of variation in gene expression attributable to SVs is
small. Figure 4c shows scaled variances in gene expression from brain
RNAseq data measured between and within strains for five categories of SVs.
Assuming variation within strains is due to environmental factors and variation
13
between strains is due to both environmental and genetic factors, the
difference between the two variances is a measure of heritability and we can
apportion that attributable to the SVs {Henrichsen, 2009 #89}. No category
of SV accounts for more than 10% of the heritability. Since many transcripts
overlap multiple small SVs (median of 3, maximum of 216), we hypothesized
that heritability might be related to the amount of gene overlapped. For each
transcript we summed the amount of DNA overlapping a gene and expressed
this as a proportion of the total length of the gene. Overlap proportions of 50%
or more make a disproportionately large contribution to heritability: at loci with
SVs in this category, SVs contribute to 25% of the variance, compared to
7.8% for transcripts where SVs overlap less than 50% of the gene. However,
large overlaps (50% or more) are rare, affecting less than 3% of transcripts.
Thus while SVs make a modest contribution to the overall heritability of
expression variance, at individual transcripts, they may be the main cause of
between-strain differences in expression.
We also observed only small effects on gene expression from SVs that
lie outside a gene. Figure 4d shows between and within strain variances for
SVs lying at distances from less than 2 Kb to more than 40 Kb from
transcripts with no SV overlap (the density of SVs meant that we found too
few transcripts further than 60 Kb from an SV to analyze). For these analyses
we measured the closest distance to either the start or end of the gene.
Heritability attributable to SVs within 2 Kb of the gene is 2%, and falls as the
distance from the gene increases.
SVs are unlikely to be the causative variant at QTLs, as we know from
genetic
association
with
100
phenotypes measured
in
over
2,000
14
heterogeneous stock (HS) mice {Valdar, 2006 #177}. We applied a test of
functionality {Yalcin, 2005 #338} to 281,246 SVs where we were certain that
the strain distribution pattern (SDP) was correct (Supplementary Methods).
Relatively high rates of SV mutation {Egan, 2007 #98} might invalidate the
imputation (the HS animals are at least 60 generations distant from the
sequenced strains), so we genotyped 100 HS animals using a high-density
array ({Agam, 2010 #54}, Supplementary Methods). 194 deletions could be
genotyped on the array (with an additional 47 deletions when we allow for
non-segregating SVs in the HS). In every case imputation correctly predicted
the logP obtained from ANOVA carried out using the array -based genotypes.
We identified 290 QTLs where SVs were among the variants most
likely to be functional, but in all these cases the SVs were only a subset of the
total number of functional variants. Just as with the cis-expression QTLs, we
found a small but significant deficit in SVs among the functional variants
(0.36% compared to 0.54% among the non-functional, P < 1E-16,2 = 72.1).
SVs that affect phenotypes
While SVs make a relatively small contribution to the total amount of
quantitative phenotypic variation, at a small number of loci they are the cause
of variation. We identified two categories of SVs that have large functional
consequences: the first are those at QTLs, and the second are those that
disrupt the coding exons of genes.
As shown in our companion paper {Keane, 2011 #345}, larger effect
QTLs are more likely to arise from SVs (and see Supplementary Figure 4a
Supplemental Figure 4b and 4c). We identified 12 QTLs where the SV
15
overlapped a gene with its flanking region (2 Kb up or downstream of a gene),
and where the QTL effect size is in the top 5% of the distribution. Table 3 lists
these SVs, the genes they affect and the putative phenotype with which they
are associated. Complementation of the deletion of the H2-Ea promoter
confirmed the effect of this SV on the T-cell phenotype (). Eps15 -/- male mice
exhibited a significantly lower activity in the open field arena (Supplementary
Fig. 5) compared to matched wild type male mice. Further work is needed to
confirm the other candidate genes.
We identified 22 SVs that affect coding exons including 5 SVs that
encompass a gene (or several) in its entirety. Table 4 gives positional
information about these SVs, the gene they affect (gene that are affected in
their entirety are indicated by an asterisk), how they formed, their strain
distribution pattern (SDP) and their known function as reported in the current
literature. Remarkably, a third of the genes affected by an SV are involved in
immunity and infection. Five of the 22 SVs are already known{Best, 1996
#346;Boyden, 2008 #347; Morrison, 2002 #348; Nelson, 2005 #349; Persson,
1999 #350}; the remaining 17 SVs are novel.
The sequence level data, on multiple strains, casts new light on all five
known cases,
(for example the known deletion in Fv1 is in fact a deletion with an
insertion),
(for example the mutation in the Skint locus is in fact an insertion, so
does the mutation in the Trim locus; Fig. 5a)
16
Discussion
Our results are important in three respects: first we find an unexpectedly large
number of SVs with diverse molecular architecture, thus providing a catalogue
of the most dynamic and variable regions of the mouse genome. Second, we
identify breakpoints at nucleotide level resolution, giving a genome wide
picture of how SVs originate. Third, we demonstrate that, despite their
abundance, SVs make relatively little functional impact, as assessed by their
effects on gene expression and phenotypic variation in the whole animal.
We were able to find more SVs, of greater complexity, because we
relied on manual inspection of the PEM results, combined with molecular
validation, before using automated calling methods. Previous studies have
revealed the noisiness of sequenced based methods of SV calling {Kidd, 2010
#49;Korbel, 2007 #93;Quinlan, 2010 #52}, due in part to the multiplicity of
forms and the presence of insertions, deletions and inversions often in close
proximity to each other, and the difficulty of mapping sequence reads back to
repetitive genomes. Nevertheless, we have shown here that it is possible to
calibrate automated methods to generate genome-wide SV calls of high
accuracy.
The SVs we find have two distinguishing characteristics: first, typically
they are small. For deletions, whose size we know accurately, the median is
385 bp. In comparison, the median size of SVs in a recent high-density array
analysis of the genomes of 20 laboratory strains was 9 Kb {Cahan, 2009
#131} and about 1.9 Kb from a PEM analysis of DBA/2{Quinlan, 2010 #52}.
Second, their density means that we frequently find regions with high
17
concentrations of small rearrangements. These two features emphasize the
need for methods of SV identification at base pair, or near base pair
resolution. Otherwise not only are many SVs missed, but those recognized
are misclassified: a mixture of small deletions and insertions will be mistaken
for a large SV of a single type {Agam, 2010 #54}.
Our second important finding is the catalogue of SV mechanisms
based on breakpoint sequence. We were able to map almost 60% of deletions
to base pair resolution, allowing us to classify SVs by the mechanism that
created them. We find that the primary origin of structural variation between
mouse strains is attributable to LINE-1 retrotransposons. For reasons still
unexplained, mice differ from humans in whom LINE-1 retrotransposition
comes
third
after
microhomology-mediated
processes
and
nonallelic
homologous recombination as the predominant processes in generating SVs
{Kidd, 2010 #49}. In contrast to human SV studies, the great majority SVs we
have
discovered
are
non-recurrent
rearrangements,
based
on
two
observations: among the classical strains, only 2.5% of deletions at the same
locus had different breakpoint sequences and less than 1% of deletions are
due to NAHR, the mechanism thought to be responsible for the majority of
recurrent SVs in humans {Stankiewicz, 2010 #325;Gu, 2008 #352}.
Our third important observation is that SVs have relatively little impact
on gene function. Results from human genome-wide association studies have
revealed that common SNPs (minor allele frequency > 5%) explain only a part
of trait heritability suggesting that SVs might be a major unrecognized
contributor to phenotypic variation{Manolio, 2009 #318}. Available evidence
has not yet resolved whether or not this is so. Analysis of human
18
lymphoblastoid cell lines attributed at least 8.5-17.7% of heritable gene
expression variation to CNVs{Stranger, 2007 #84}. Importantly, this heritability
was not shared with common SNPs, potentially making CNVs a contributor to
the missing heritability of GWAS. In mice, SVs overlapping a gene were
estimated to contribute to a substantial proportion of between-strain
expression variance (up to 74%){Henrichsen, 2009 #89}, which, together with
the prevalence of SVs in the genome, implies that they might be responsible
for a considerable fraction of heritable gene expression variance. If the
genetic basis of gene expression were a model for understanding the
molecular basis of other phenotypes, then SVs would be a major player.
However, two recent analyses of the association between SVs and disease
phenotypes in humans provide little support for this view: common SVs are no
more likely than common SNPs to contribute to phenotypic variation {Conrad,
2010 #44;Craddock, 2010 #319}. However rare CNVs (minor allele frequency
< 5%) of large effect (odds ratio > 2), that could not be detected using the
technologies available, might still be important contributors.
Our findings make three important contributions to this debate. First,
we find that SVs overlapping a gene make a small contribution to variation in
gene expression, accounting for less than 10%, and we find limited evidence
that they affect the expression of flanking genes. This might be due to our
analysis of very large numbers of small SVs, but we find that even when SVs
overlap more than 50% of a gene they account for less than a third of the
heritability. The most likely explanation is that previous array based studies
conflated under one apparently large SV the effects of numerous smaller
19
rearrangements together with regions of diploid DNA, containing other
variants that influenced gene expression.
Second, our analysis of the phenotypic consequences of SVs on QTLs
for multiple phenotypes also points to a relative deficit of SVs as the molecular
basis of complex phenotypes. By working with an outbred population where
all chromosomes are descended from known progenitors, imputation
effectively reconstitutes the genomes of all animals, so that we can detect the
effects of all variants, both common and rare. Our results indicate that
common and rare SVs make less of a contribution to phenotypic variation
than we would expect given their abundance in the genome. However the
outbred population we tested is derived from inbred progenitors whose
homozygosity will have purged their genomes of variants that could be
maintained in heterozygous freely mating populations.
Third, we identified 22 SVs that delete one or more exons. These SVs,
with large effects on a phenotype, are the equivalent of rare variants found in
human populations. In mouse populations they are very rare indeed:
Our analysis has highlighted those QTLs where SVs are likely to be the
responsible molecular lesion. Encouragingly, our computational predictions
include a promoter deletion whose role we have recently confirmed through
transgenesis{Yalcin, 2010 #317}. This is important because genetic
association studies typically implicate SNPs as the causative variant at a QTL.
Biological insight into a phenotype however requires discovering which gene
is involved, still a major challenge if the starting point is a SNP. The task is
20
considerably easier when an SV is identified as the causative variant,
particularly if the SV removes a coding segment, effectively creating a null
allele, now relatively straightforward to model in mice. Thus the discovery of
causal SVs is likely to provide biological insights out of proportion to their
relative small contribution to phenotypic variance.
Acknowledgements
We thank Adam Whitley, Giles Durrant, Andrew Marc Hammond, Danica Joy
Fabrigar, Lucia Chen, Martina Johannesson and Enzhao Cong for helping B.Y
with various laboratory-based work. This project was supported by The
Medical Research Council, UK and the Wellcome Trust. DJA is supported by
Cancer Research UK.
Author contributions
D.J.A and J.F conceived the study and directed the research.
21
Figure Legends
Figure 1. Identification of structural variants. a) Venn diagrams showing the
overlap between deletion SVs (relative to C57BL/6J) detected in our study
(blue) and those published elsewhere (Agam et al, 2010 in red and Quinlan et
al in green), in DBA/2J. b) Blue boxes represent deletions, pink boxes
insertions, orange boxes inversions and yellow boxes duplications; all types of
structural variants are relative to the reference genome sequence. We found
six basic types of structural variant: deletion (del), insertion (ins), inversion
(inv), tandem duplication (dup), inverted tandem duplication (not drawn here)
and dispersed duplication. c) Additionally, eight complex types of structural
variant were found: deletion with an insertion (del+ins), linked deletion (normal
copy of small length flanked by two deletions), deletion within a duplication
(del in dup), inversion with flanking deletion(s) (for example del+inv+del),
inversion with an insertion (inv+ins), inversion within a duplication (inv in dup),
a linked insertion (linked ins) where the inserted sequence is copied from
another location in the vicinity of the inserted site and an inverted linked ins
(not drawn here) which has a similar pattern to a linked insertion but with the
inserted sequence being inverted. d) Example of paired-end mapping (PEM)
pattern of a del+inv+del. Green arrows represent primers used for PCR
amplification and sequencing reactions. Primer names provide their positional
information, relative to the reference genome. Black arrows attached with a
curved line represent paired-ends, whereas single black arrows represent
singleton reads. Grey straight lines indicate mapping of the test reads onto the
reference genome. When the inversion is smaller than the insert size, pairedend reads will flank both deletions and inversion, as shown here. In other
22
cases, decreased read depth will indicate flanking deletions. e) Example of
PEM pattern of an inv+ins, with PCR data across the eight classical strains.
HyperladderII is used as molecular marker. Amplicon size for BALB/cJ,
C3H/HeJ, CBA/J and DBA/2J is about 500 bp larger than the other strains,
indicative of the insertion. A complete list of PEM patterns is given in
Supplementary Figure 1, with examples and PCR validation data.
Figure 2. Classification of structural variants.
LEGEND TO ADD
Figure 3. Breakpoint analysis of a complex SV. a) Complex SV, involving
several genomic rearrangements including an inversion, deletion, short
insertion and copy number gain (CNG), is displayed relative to its genic
location along Zbtb10, a Zinc finger and BTB domain containing 10 gene.
PCR amplification using forward (F) and reverse (R) primers revealed an AT
insertion at the first breakpoint J1, followed by an inversion of 125 bp that
encompasses an inverted copy number gain of the 22 bp proceeding J1, as
seen in J2. Finally breakpoint 3 (J3) revealed a deletion of 813 bp. Using
repeatmasker, a SINE element was found to be part of the deletion. b) PCR
picture of the amplification using F and R primers (primer sequences available
in Supplementary Table xxx). Hyperladder II was used as the size marker.
C57BL/6N and LP/J show a normal size of 1604 bp, whereas A/J, AKR/J,
BALB/cJ, C3H/HeJ, CBA/J and DBA/2J show a smaller band at 793bp. c)
Sequencing data across J1, J2 and J3 breakpoints. A colour code is used to
indicate each type of SV: blue is used for the 22 bp inverted copy number
23
gain, green for the inversion and red for the deletion. When the test strain
matches the reference strain, both are in the same color.
b) Relationship between SNP and SV formation. a) Relationship between
SNP and ancestral deletion formation. Two SNPs lying on the 6 bp
microhomolgy of an ancestral deletion of 64 bp (chr12:27,040,45927,040,522) correlated with the presence of the SV. On the left, PCR
amplification of the SV is shown across the eight classical strains (A/J, AKR/J,
BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J and LP/J). HyperladderII was
used as DNA molecular weight marker. Some strains show a smaller
amplicon compared to other strains. On the right, sequencing traces are
shown for a test strain (A/J) and the reference strain (C57BL/6N). Note that all
other test strains traces are identical to the one shown here. Asterisk is used
to emphasize the microhomology of 6 bp (GAACTA). The presence of two
SNPs (C->G and T->A) in all test strains (here only shown in A/J) is
associated with the presence of the ancestral deletion. b) Relationship
between SNP and ancestral insertion formation. PCR data is shown on the
right with amplification in A/J, AKR/J and BALB/cJ. Strains with the the
ancestral insertion (C57BL/6N, CBA/J, DBA/2J and LP/J) have failed to
amplify due to size. The insertion is a LINE on chromosome 13 (119,134,049119,135,126). On the left, sequencing trace is shown over the TSD for a strain
that
doesn’t
have
the
ancestral
insertion.
The
TSD
is
17
bp
(AAGAATGTCAGCAAAGT) and at the 12th position, a SNP (G->C) is
observed in all the strains that have the insertion.
24
Figure 4: Effect of structural variants on gene expression
a) and b) Genetic association between 5,337 transcripts and sequence
variants lying within 50 kilobases. a) shows the distance from the
transcriptional start site to the most significant variant at those loci where
there is a single peak of association. Results are shown regardless of the Pvalue of the association. The majority are non-significant and as expected
show no enrichment closer to the gene. b) shows the same distances, but
only for variants at loci where the single peak of association has a P-value
less than 0.0001. None of the variants within the peak at the transcriptional
start site are structural variants
c) Between-strain (grey boxes) and within-strain (white boxes) gene
expression variances for transcripts which are not overlapped by any
structural variant (No SV) and for those which are overlapped by one of five
types of structural variant: deletions (Dels), insertions (Ins), copy number
gains (Gains), inversions (Inv) and complex rearrangements (Complex). The
difference between the two variances is a measure of heritability.
d) Effect of distance from the transcript on gene expression variances. Grey
boxes are between-strain and white boxes are within-strain variances. The
figure shows standardized variances of gene expression for transcripts with
structural variants at distances from less than 2 Kb to more than 40 Kb from
either the start or end of the transcript.
25
Tables
Table 1. Structural variants greater than 100bp in 17 inbred strains
Simple
Strain
129P2/OlaHsd
Del
CNG
Complex
Inv
Ins
Del + Ins
Nested
Inv + Del/Ins
16292
57
74
15604
105
27
68
129S1/SvImJ
17307
70
88
11516
73
32
67
129S5/SvEvBrd
16089
72
67
8970
43
41
58
A/J
16190
69
92
12184
61
28
67
AKR/J
15806
88
89
14576
88
13
82
BALB/cJ
14859
82
87
10551
48
17
58
C3H/HeJ
16062
94
94
12100
90
16
76
164
44
6
213
0
3
1
CAST/EiJ
50978
361
224
34122
133
239
265
CBA/J
16996
79
83
10867
64
16
78
DBA/2J
17478
67
83
10559
55
29
75
LP/J
16964
64
88
12745
64
30
69
NOD/ShiLtJ
17047
51
116
13244
53
16
79
NZO/HlLtJ
15429
62
71
9445
33
23
62
PWK/PhJ
54147
96
272
35098
184
60
268
SPRET/EiJ
91295
112
470
64304
463
110
554
WSB/EiJ
22154
88
97
12521
64
37
105
C57BL/6NJ
Del=deletion; CNG=copy number gain; Inv=inversion; Ins=insertion;
Del+Ins=deletion plus insertion; Nested=SV in a CNG region; Linked
Ins/Del=linked insertion or linked deletion; Inv+Del/Ins=inversion plus
deletion(s) or inversion plus insertion.
26
Table 2. Sequence features at SV breakpoints and inferred mechanism
Table 2. Sequence features at SV breakpoints and inferred mechanism. In a,
the percentage of each sequence feature at precise breakpoint is given per
category of ancestral SV (insertion, deletion, inversion, CNG and multiple
events). In b, the percentage of each inferred mechanisms is given relative to
all SV regions presented in a. Empty cases are due to no applicability and all
abbreviations are listed in the Supplementary Glossary.
27
Table 3. QTLs associated with SVs
Phenotype
Chr
SV start
SV stop
Ancestral
Event
Gene
SV overlap
LogP
Mean platelet volume
1
175158884
175158885
insertion
Fcer1a
upstream
52.833
OFT Total activity
Hippocampus cellular proliferation
marker
2
144402772
144402974
SINE insertion
Sec23b
intron
15.721
4
49690364
49690365
SINE insertion
Grin3a
intron
20.119
Home cage activity
4
108951264
108951265
ERV insertion
Eps15
upstream
15.922
T-cells: %CD3
4
130038389
130038390
SINE insertion
Snrnp40
intron
12.129
Wound healing
7
90731819
90731820
ERV insertion
Tmc3
upstream
22.216
Red cells: mean cellular haemoglobin
7
111398000
111480000
insertion
Trim5
exon
13.016
Red cells: mean cellular haemoglobin
7
111504957
111505193
deletion
Trim30b
UTR
12.806
Red cells: mean cellular volume
8
87957244
87957245
LINE insertion
4921524J17Rik
upstream
18.141
11
115106122
115106250
deletion
Tmem104
UTR
13.404
13
17
113783196
34483680
113783359
34483681
deletion
deletion
Gm6320
H2-Ea
upstream
upstream
17.456
82.858
Serum urea concentration
Hippocampus cellular proliferation
marker
T-cells: CD4/CD8 ratio
Start and stop coordinates are given for build37 of the mouse genome, so that insertions into the reference are given as
consecutive base pairs (columns headed SV start and SV stop). The part of the gene overlapped is reported in the column
headed SV overlap. LogP is the negative logarithm of the P-value for association between the SV and the phenotype as
assessed in outbred HS mice {Valdar, 2006 #177}.
LP/J
129P2/OlaHsd
129S1/SvImJ
129S5SvEvBrd
NOD/ShiLtJ
NZO/HiLtJ
CAST/EiJ
PWK/PhJ
SPRET/EiJ
WSB/EiJ
0
0
0
0
0
0
1
1
0
1
0
0
1
0
0
1
0
1
0
1
1
1
0
0
0
0
0
1
1
1
0
1
0
0
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
0
1
0
0
1
0
0
0
0
1
1
0
0
0
0
0
1
1
1
0
1
0
0
0
0
0
1
0
0
0
1
1
1
0
0
1
0
1
0
1
1
0
0
0
0
1
1
0
0
1
1
1
0
1
1
0
0
0
0
0
1
1
1
0
1
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
1
1
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
2
1
0
1
0
0
1
1
1
0
0
0
1
0
1
0
1
0
1
0
0
1
2
1
1
0
0
0
1
0
0
1
0
0
1
0
0
0
1
0
1
0
0
1
2
1
1
0
0
0
1
0
0
1
0
0
1
0
0
0
1
0
1
0
0
1
2
1
1
0
0
0
1
0
0
1
0
0
1
0
0
0
0
1
0
1
0
1
1
0
0
0
0
0
1
0
1
1
0
0
1
0
1
0
0
0
0
0
1
1
1
1
0
0
1
1
1
1
0
0
0
0
1
0
0
1
1
0
0
0
0
0
1
1
0
0
0
0
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
0
1
0
0
0
0
1
1
0
0
0
0
0
0
1
1
Known function
DBA/2J
ins
del
VNTR
del
ins
del+ins
complex
ins
del
del
del
del
VNTR
ins+del
del
del
VNTR
ins
VNTR
del
VNTR
ins
CBA/J
Ancestral State
7079
873
140
3522
541811
1341
4440
148
1684
20663
1280
24888
155
3192
44
899
421
55218
326
349
412
1909
C57BL/6J
SV Length
86186982
87245948
87780656
91835385
112272814
147245739
87854999
128559740
129691211
132613777
132986718
28671649
149415279
19450577
22004981
38403495
110889308
71101410
31759699
98328892
88874526
64609669
C3H/HeJ
SV Stop Bp
86179904
87245076
87780517
91831864
111731004
147244399
87850560
128559593
129689528
132593115
132985439
28646762
149415125
19447386
22004938
38402597
110888888
71046193
31759374
98328544
88874115
64607761
BALB/cJ
SV Start Bp
2
3
3
3
4
4
5
6
6
6
6
7
7
8
9
9
9
11
12
15
16
18
AKR/J
SV Chromosome
Olfr1055
Fcrl5
Nes
Pglyrp3
Sknit4,3,9*
Fv1
Ugt2b38
Klrb1a
Klri2
Tas2r120*
Tas2r103
Zfp607*
Krtap5-5
Defb8
Zfp872
Olfr913
Rtp3
Nlrp1c*
Fam110c
Olfr234
Krtap16-1
Amd2*
A/J
MGI gene name
Table 4: SVs affecting coding regions.
Olfaction
Infection and immunity
Brain development
Infection and immunity
Infection and immunity
Infection and immunity
Metabolism
Infection and immunity
Infection and immunity
Taste
Taste
DNA-binding
Hair formation
Infection and immunity
DNA-binding
Olfaction
Bone density
Embryonic development
Cell spreading and migration
Olfaction
Hair formation
Biosynthesis of polyamines
MGI is Mouse Genome Informatics. Ins: insertion; del: deletion; VNTR: variable number tandem repeat. The strain distribution pattern
relative to the ancestral state is given for all strains: “1” referring to presence, “0” to absence and “2” to an additional allele. * indicates
that the SV overlaps the entire gene .
Figure 1. Identification of structural variants.
3
0
Figure 2. Classification of structural variants.
3
1
Figure 3. Breakpoint analysis of a complex SV and relationship between SNP
and SV formation
3
2
Figure 4
3
3
Methods
An outline of the methods applied in this paper is provided in the
supplementary Methods.
3
4
References
3
5