Download TEV_v7_BY

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Polycomb Group Proteins and Cancer wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Human genetic variation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genetic engineering wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Copy-number variation wikipedia , lookup

Oncogenomics wikipedia , lookup

Gene desert wikipedia , lookup

NUMT wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Nutriepigenomics wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Ridge (biology) wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Gene expression programming wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Public health genomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Human genome wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene expression profiling wikipedia , lookup

Human Genome Project wikipedia , lookup

Non-coding DNA wikipedia , lookup

Metagenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Designer baby wikipedia , lookup

Genomic library wikipedia , lookup

Microevolution wikipedia , lookup

Minimal genome wikipedia , lookup

RNA-Seq wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Genome editing wikipedia , lookup

Genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Transposable element wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
The impact of transposable element variants on mouse genomes and genes
Christoffer Nellåker*1, Thomas M. Keane*2, Binnaz Yalcin3, Kim Wong2, Avigail Agam1,3, Jonathan
Flint3, David J. Adams2, Wayne N. Frankel4, Chris P. Ponting1
*contributed equally
1.
MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of
Oxford, South Parks Road, Oxford, OX1 3QX, UK
2.
Experimental Cancer Genetics, the Wellcome Trust Sanger Institute, Wellcome Trust Genome
Campus, Hinxton, Cambridge, CB10 1HH, UK
3.
Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, OX3
7BN, UK
4.
The Jackson Laboratory, Bar Harbor, ME 04609, USA
Transposable element-derived (TE) sequence dominates the landscape of mammalian genomes
and can modulate gene function by dysregulating transcription and translation. Virtually all TEs
present in the C57BL/6J reference mouse genome are drawn from three distinct classes, namely
short interspersed nuclear elements (SINEs), long interspersed nuclear elements (LINEs) and
members of the endogenous retrovirus (ERV) superfamily. Despite their high prevalence, the
relative contribution of TEs to quantitative traits and gene expression variation is largely
unknown. Using whole genome sequences from 13 classical laboratory mice and 4 inbred strains
derived from wild mice, we developed a catalogue containing approximately 100,000 polymorphic
TE variants (TEVs). SINE variants show tendencies to lie in G+C-rich sequence and to be inserted in
the flanking regions of genes, whereas LINE or ERV variants occur preferentially in more A+T-rich
sequence. In general, TEVs tend to be depleted near to transcriptional start sites, in and near
exons, and more particularly, LINEs are depleted within the introns of transcription factor genes,
which we assume to be a consequence of purifying selection. Within introns, we find only
approximately half the expected number of ERV TEVs that are inserted in the sense transcriptional
orientation which, again, appears to reflect past episodes of negative selection. Nevertheless,
using both RNA-seq expression and QTL data derived from a cross between eight of the sequenced
strains we were unable to demonstrate frequent phenotypic effects of TEVs among mouse strains.
Our findings indicate that although TE insertion alleles are often highly deleterious, surviving
variants which segregate among mouse strains only infrequently contribute to phenotypic effects.
( 257 words, currently)(limit 300)
Author summary
Only a tiny fraction of mammalian genomes encode functional proteins. A much larger fraction of
these genomes (~50%) of mammals originates from Transposable Elements (TEs), which are
supposedly “junk” DNA and genomic parasites. We pose the question of whether differences in
physical, biochemical or behavioural traits among mouse strains can be attributed to differences in
their TEs? We examined over 100,000 TEs that inserted into ancestors of these mouse strains over
the past 5 million years. Skewed distributions in different genomic regions and orientation are
strong signs of selection implying that 20-30% of recent TE insertions have had deleterious
functional consequences. Despite this we were unable to demonstrate frequent effects of TE
variants on these mouse strains’ traits. We conclude that although many TEs have negative
functional consequences after their insertion into the genome, those retained contribute little to
physiological differences among laboratory mice.
Introduction
The debris of ancient transposable element-derived (TE) sequence dominates the landscape of
mammalian genomes and new TE insertions can modulate the transcription, translation or function
of genes 1-7. Functional effects of TE insertions include their regulation of transcription by acting as
alternate promoters, enhancer elements, antisense transcripts or transcriptional silencers. TEs can
alter splice sites or RNA editing, provide alternate poly-adenylation signals or exons, modify
chromatin structure or alter translation. Furthermore, TE insertion has been suggested to be a
mechanism by which new coregulatory networks arise 1-7.
TEs are conventionally classified on the basis of their transposition mechanism. Class I
retrotransposons propagate in the host genome through an intermediate RNA step, requiring a
reverse transcriptase to revert it to DNA and subsequent insertion into the genome. Class II DNA
transposons do not have any RNA intermediate and translocate with the aid of transposases and
DNA polymerase. The overwhelming majority, over 96%, of TEs in the mouse genome are of the
retrotransposon type 8. These are further classified into three distinct classes: short interspersed
nuclear elements (SINEs), long interspersed nuclear elements (LINEs) and the endogenous retrovirus
(ERV) superfamily. The ERVs are ancient remnants of exogenous virus infections, internal sequence
encoding viral genes flanked by long terminal repeats (LTRs).
TEs provide a potential source of variants that are detrimental to host viability and promote disease.
Indeed, several TE variants (TEVs) have been observed to contribute to mouse traits. For example,
one allele of the agouti locus in mouse contains an intra-cisternal A particle (IAP) sequence upstream
of the promoter which causes ectopic expression of the agouti protein leading to variation in fur
colour, obesity, diabetes and tumour susceptibility 9,10. Evidence that TEs have often been
detrimental derives principally from a bias in the orientation, with respect to gene transcription, by
which TEs have been observed within introns. Human intronic ERVs and LINEs, but not SINEs, show
a tendency to disrupt expression when inserted into introns in the gene’s transcriptional sense
orientation 11. In mice there are over 50 instances of phenotypes due to spontaneous insertional
mutagenesis by ERVs, with one class of functional variants (ETn) showing a strong bias to be in the
sense transcriptional orientation 6 . This orientation bias is attributed to cryptic splice acceptor
usage and/or inefficient read-through of the ERV LTR, which contains its own regulatory signals 6.
TEs that are present in the C57BL6/J reference genome assembly exhibit this orientation bias (ref)
which indicates that TE insertions have often been deleterious over rodent evolution. TEVs, on the
other hand, which were inserted only during recent Mus evolution, may show a reduced orientation
bias. This is because weakly detrimental TEVs that have been inserted, in the sense orientation, over
recent evolution may not have had sufficient numbers of generations to be effectively purged from
the population. In addition, deleterious TEVs present in laboratory mice may be maintained owing
to their artificial inbreeding. It is thus plausible that TEVs contribute substantially to the genetic load
and phenotypic variation among inbred and wild mice.
Elsewhere, we report the generation and analysis of over a terabase of raw sequence from the
genomes of 17 mouse strains {REF MAIN PAPER}. This sequencing project examined 13 inbred
laboratory mouse strains (129P2/OlaHsd, 129S1/SvImJ, 129S5/SvEvBrd, A/J, AKR/J, BALB/cJ,
C3H/HeJ, C57BL/6NJ, CBA/J, DBA/2J, LP/J, NOD/ShiLtJ and NZO/HiLtJ) and 4 wild derived strains
(CAST/EiJ, PWK/PhJ, WSB/EiJ and SPRET/EiJ) which together encompass approximately 2My of
evolutionary divergence 12. From this data set, 0.71 million structural variants have been identified
{REF STRUCT PAPER}. Concomitantly, RNA-seq data were generated from whole brain tissue from
each of the 17 mouse strains, thereby allowing comparisons between gene expression and genotypic
differences. With the new data from the mouse strain sequencing project high resolution
physiological and expression quantitative trait loci (pQTL and eQTL, respectively) have been further
refined {REF QTL PAPER} from the earlier work of Flint et al. 13.
Here we report the first genome-wide study of TEV contributions to trait variation among a large
number of mouse strains. Previous studies examined two ERV families in four strains (IAP or
ETn/MusD elements in C57BL/6J, A/J, DBA/2J, and 129X1/SvJ) 14, in particular focusing on intron
insertions 15. By contrast, we consider the extents by which SINE, LINE and ERV TE variants
contribute to genomic and phenotypic differences among 17 mouse strains. If TE variants were to
contribute to mouse trait variation then enrichments of TEVs in these pQTL and eQTLs, and
depletions in functional elements, would be expected.
RESULTS AND DISCUSSION (no lim, currently 1683)
Genome Landscape of Recently Inserted TEVs
We first predicted 103,798 TEVs (28,951 SINEs, 40,074 LINEs and 34,773 ERVs) among the 17 mouse
strains, an order of magnitude higher than previous studies 14,16. We employed two approaches to
TEV discovery, SVMerge which combines the results of four methods of structural variation
prediction 17, and RetroSeq (Methods). After filtering, SVMerge predicted 44,401 insertions within
the lineage of the C57BL/6J reference strain, whereas the RetroSeq method inferred 59,397 TEVs
insertions occurring outside of this lineage (Figure 1a). We refer to these as B6+ and B6- TEVs,
respectively. By further classifying TEVs according to type and class, we determined that virtually all
mouse strain TEVs are drawn from families that were previously observed to be active 18 (Figure 1,
Supplemental Figure 1). As expected, there are approximately equal numbers of B6+ and B6variants, and TEVs are more frequent in strains that are wild-derived (SPRET/EiJ, PWK/PhJ and
CAST/EiJ; 13.8-22.4 per Mb) than in the laboratory strains (4.2-6.3 per Mb; Figure 1a). Using PCR,
we determined the TEV set to be associated with both low false positive rates (9-11%) and low false
negative rates (13%-39%) (Supplemental Tables 3 and 4). Furthermore, by conservatively assuming
the three 129-derived substrains to be largely monomorphic, we further estimated false negative
rates as being between 9.5%-19.9% (Supplemental Tables 3 and 4). Deep sequencing of these 17
genomes thus has made substantial improvements in the numbers and accuracies of TEV calls 14.
We next placed TE insertions within a primary phylogeny of these mouse strains (Figure 1a) in order
to estimate the rates of expansion of TE families over time (Figure 1c-e). This analysis revealed the
historic expansion of ERV families, in particular intracisternal A particles (IAPs), in laboratory strains.
ERVs were seen to contribute between 29%-39% of all TEVs in the sequenced strains, over twice
more than previously proposed 6 (Figure 1c-e).
Different ERV families contribute to greatly differing extents to these mouse genomes (Figure 2a).
The relative proportions of polymorphic, and apparently fixed ERVs also vary considerably among
these families (Figure 2b), in part reflecting the ages of past exogenous viral infections. The family of
MuLV elements, for example, arose recently and thus are found in a smaller number of copies which
show a higher fraction of variable elements (Figure 2a-b). ERVs are prone to recombination
between their flanking long terminal repeat (LTR) sequences. To estimate this recombination rate in
different ERV families, we mapped TEVs to the mouse strain primary phylogeny and observed
increasing proportions of solo LTR elements with increasing phylogenetic divergence (Figure 2c). IAP
elements are notable by the rapidity by which they recombine, with a half-life we estimate to be
approximately 800,000 years (Figure 2c). This is similar to a previous estimate derived from a single
MuLV element within the dilute locus of DBA/2J mice 19. By contrast, the older family of ETn
elements are associated with a much slower rate of recombination.
TEV density is influenced by chromosome, by local nucleotide composition (G+C content), and by
position relative to functional sequence, such as exons. LINEs are known to be greatly enriched in
the X chromosome of the C57BL/6J reference strain 8. Nevertheless, a small study of four mouse
strains (A/J, DBA/2J, 129S1/SvImJ and 129X1/SvJ) proposed that polymorphic LINEs are depleted on
their X chromosomes 16. Our study shows that ERV and SINE TEVs are depleted on the X
chromosomes; by contrast, the density of chromosome X LINE TEVs is little different from those for
the autosomes (Figure 2e). Nevertheless, owing to the X chromosome having an effective
population size that is 75% of the autosomes, TEV integration rates should be correspondingly lower
on the X than on the autosomes 20. After accounting for this effect, LINE TEVs are indeed elevated in
density on the mouse strains’ X chromosomes.
LINE TEVs show a bias for being located in (A+T)-rich sequence, whilst SINE TEVs tend to reside in
(G+C)-rich sequence (Figure 3a), as was shown by sequencing of the C57BL/6J mouse reference
genome 8. As was noted then, these opposing tendencies are perplexing since LINEs and SINEs insert
using the same endonuclease. One possible resolution to this puzzle is suggested by ‘older’ SINEs
showing G+C distributions that differ from those for more recently inserted SINEs, perhaps because
SINEs in A+T-rich sequence are more readily deleted 21. If so, then we would expect G+C
distributions for recent TEV insertions to differ from those for all TE insertions. Indeed, it has been
proposed that observed genomic distributions will differ substantially from the original insertion site
preferences, owing to a combination of selection and genetic drift 15. Nevertheless, for the G+C
distributions of different TEV classes we observed no substantial differences between the set of all,
primarily ancient, TEs in the reference genome, and the set of recently inserted TEVs (Figure 3a).
We also observed ERV TEVs to be more heterogeneous than SINEs or LINEs in their (G+C)-bias, with
MuLV TEVs being as enriched in high (G+C)-sequence as SINEs (Figure 3a).
Purifying Selection on TE Insertions within Genes
TEVs from all three classes show strong and significant depletions in protein-coding gene exons
(Figure 3b), implying that such insertions are strongly deleterious. In our analyses that determined
the significance of these depletions, we implemented a genome-wide association procedure that
accounted for three potentially confounding effects, namely the differential rates of TE insertion
across (a) the (G+C)-content spectrum (Figure 3a), (b) different chromosomes (Figure 2e), and (c)
sequence of varying length (Methods; 22). We tested for the over- or under-representation of TEVs
across the genome within introns, or the flanking sequences of protein coding genes, or within
intergenic sequence. No significant differences were found in the densities of SINE, LINE or ERV
TEVs between first, middle or last introns. However, SINE TEVs were enriched in flanking and
intronic sequence — opposite to the trends seen for LINE TEVs — and ERV TEVs were strongly
depleted in introns (Figure 3b). These results confirm a previous observation that intronic SINEs are
less often deleterious than the other TE classes 23. We interpret the significant deficits of ERV or LINE
TEVs in introns as indicating that many are deleterious and thus have been selectively purged over
these strains’ evolutionary history. Our observations agree with previous findings that LINE TE
insertions are less well tolerated within gene-rich sequence 20.
We then considered whether intronic TEV densities are higher in genes from particular functional
classes than others, again accounting for chromosome and nucleotide composition biases. The
introns of genes with essential housekeeping functions, such as transcription and chromatin binding
factors, or that, when disrupted, result in embryogenesis phenotypes, were observed to be
significantly and strongly depleted in LINE and ERV TEVs (Figure 3c,d). TE intronic insertion variants
are thus likely to often dysregulate such genes. In contrast, housekeeping genes that are highly
expressed in most tissues show significant enrichments of intronic SINE TEVs, which we assume
reflects a bias for TEs to be inserted into actively transcribed genes, whilst less transcriptionally
active, more tissue-specific genes show significant depletions (Figure 3c).
Next, we calculated the orientation bias (OB = [TEVs in sense orientation]/[all TEVs]) for the 20,001
intronic TEVs that we sampled. If TE insertions are not frequently deleterious, or they are only
mildly deleterious, then we would not expect to observe this bias. Instead, a strong orientation bias
was evident for each of the three TE classes (OB = 32.6%, 41.7%, 41.6%, for ERV, LINE and SINE TEVs
respectively). The strong bias for ERVs is consistent with these elements being depleted in introns
(Figure 3b). Assuming that TEVs inserted in the antisense orientation are not under selection, then
approximately 50% of all ERV insertions in the sense orientation into the introns of protein coding
genes are deleterious, as are about one-third of LINE or SINE sense insertions.
The large set of TEVs in this study allowed us to infer whether the location of a TEV within a gene
structure affects the strength by which it is purified from the population. Orientation bias was
significantly stronger for ERV TEVs within middle or last introns, and SINE TEVs within first introns
(Figure 4a). These differences among introns would be difficult to reconcile with competing models
that explain orientation bias as being due solely to a mutational preference for insertion into the
antisense strand, perhaps as a consequence of transcription-coupled repair 24. We find orientation
bias to not be significantly different between genes with high or low brain expression (data not
shown) or between TEVs that are relatively young or old (Figure 4b), or between solo LTRs and
proviral LTRs (data not shown).
By comparing the orientation bias for TEVs with the corresponding bias for TEs that appear to be
fixed among these mouse strains, we were able to infer the rapidity by which purifying selection on
each TEV class occurs. Orientation bias was not significantly different between apparently fixed and
variant ERV TEs, indicating that deleterious sense inserted ERV TEVs are purged very rapidly from the
mouse population (Table 1). By contrast, the bias was significantly and substantially stronger for
apparently fixed LINE TEs than for recently-inserted LINE TEVs, implying that purifying selection on
sense inserted LINE TEVs tends to be less strong. It is unclear what underlies the small, yet
significant, increase in the strength of this bias for SINE TEVs relative to apparently fixed SINE TEs.
Purifying Selection on TE Insertion Depends on Proximity to Functional Elements
Strong purifying selection of TEVs from all three classes, and in both transcriptional orientations, is
evident in sequence near (<0.5kb) to the transcriptional start sites of genes (Figure 4a). Purifying
selection of deleterious TEVs appears less strong near the 3’ of genes. A significant increase in SINE
TEVs in the vicinity of genes 25, on the other hand, is consistent with a previous proposal that SINEs
preferentially insert within genes that are expressed in the germ line 26. This is likely to account for
the enrichment of SINE TEVs upstream and downstream (1-10kb) of genes (Figure 5a).
A recent study of 161 mouse ERV TEVs identified their strongest intronic orientation bias to be in the
close vicinity of exon boundaries 27. Using our much larger set of 20,001 intronic TEVs we confirmed
this finding, and then extended it to include LINE TEVs (Figure 5b). SINE TEVs exhibit a reduced
orientation bias near to exons, thus appearing to be less deleterious; their depletion within the
interiors of introns appears to reflect an effect of G+C composition 27.
Impact of TEVs on quantitative traits
TEV insertion biases with respect to orientation and gene functional class then permitted us to
predict functional TEVs, more specifically TEVs that might either alter gene expression or underlie
quantitative trait variation. We collated a set of 128 candidate functional TEVs, and a second set of
4,055 candidate non-functional TEVs (see Methods, Table 2). TEVs were selected based upon the
evolutionary signals of selection inferred from orientation, genomic location and GO term
association biases. For example, we assumed that TEVs associated with strong biases, which would
indicate strong negative selection (Figure 3b,c and Figure 4a), would have the strongest effect on
function. For each set in turn we first compared gene expression levels, acquired from an RNA-Seq
experiment of brain samples, of single strains with or without the TEV. Neither set, and neither ERV,
LINE nor SINE TEV sets separately, were significantly associated with expression level changes (data
not shown).
Next, we considered the contribution of candidate functional TEVs to about 100 mouse quantitative
traits whose associated loci have been further refined using a merge analysis (Keane et al.,
submitted). Surprisingly, only 1 TEV coincided with any merge quantitative trait locus (QTL).
Consequently, we have no evidence that TEVs commonly affect molecular function or organismal
phenotype. These findings are consistent with results, which we describe elsewhere (Keane et al.,
submitted), that most QTLs, which are of small effect, are less likely to be caused by structural
variants, including TEVs, than point substitutions or short insertions/deletions.
Nevertheless, QTLs of large effect are more likely to be caused by structural variants, and Figure 6a,b
provides 2 examples of TEVs predicted to have large effects (Keane et al., submitted). For neither of
these examples, however, was evidence available for a TEV significantly affecting expression levels.
Between these extremes of a few large effect functional TEVs, and the majority of TEVs that appear
to be without significant functional effect, there will be a set of currently unknown functional TEVs
of modest effect. Our study should now facilitate the identification and validation of such variants,
which will be valuable additions to those previously documented 6 (Wayne’s Table: Supplemental
Table).
Conclusions
Our results assist in distinguishing the minority of TEVs with profound negative effects on organismal
reproductive fitness from the majority of TEVs which have little or no effects on fitness. Deleterious
TEVs are enriched within, and in the close vicinity of, exons, in particular for genes of nuclear
proteins with transcription factor activity. Relative to ERVs and LINEs, SINE TEVs appear more
benign. These findings imply that deleterious TEVs are not frequently segregating among wild
mouse populations, and founder populations of laboratory mice. By contrast, many TEVs arising de
novo in mice are likely to result in nonviability with one-third of all intronic ERV insertions being
purged rapidly from the mouse population. Considering de novo TEVs, rather than more ancient
TEVs segregating among mouse lines, should thus contribute most to the discovery of gene function
and phenotypic variation.
Methods
Sequencing data
Raw sequencing data were generated from 13 classical (129P2/OlaHsd, 129S1/SvImJ,
129S5/SvEvBrd, A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6NJ, CBA/J, DBA/2J, LP/J, NOD/ShiLtJ and
NZO/HiLtJ) and 4 wild-derived (CAST/EiJ, PWK/PhJ, WSB/EiJ and SPRET/EiJ) laboratory strains as part
of the Mouse Genomes Project [Keane et al. submitted]. Briefly, 1,239 Gb of mapped sequence was
generated using the Illumina GAIIx platform 28 providing an average of 27.6-fold sequence coverage
across 17 genomes. Paired end reads were a mixture of 37 bp, 54 bp, 76 bp and 108 bp in length,
with fragments being 150-600 bp in length. Accession numbers for the raw sequencing data are
given in Supplementary Table 1.
TE classification and ERV probes
Differences in nomenclature and classification groupings between RepBase/RepeatMasker and
colloquial descriptions used in the literature can be difficult to resolve. In this study, we classify TEs
into DNA elements, SINEs, LINEs and ERVs. ERVs have been further classified into the following
families: IAP, ETn, VL30, MaLR, RLTR10, RLTR1B, RLTR45, IS2 and MuLV. A complete listing of
RepBase 29/RepeatMasker (ref) classifications and conventional ERV super-classes corresponding to
these families is provided in Supplementary Table 2.
B6J+ Calling Algorithms
Structural variant (SV) deletions in all inbred strains were detected using four methods: split-read
mapping (Pindel)
30
, mate-pair analysis (BreakDancer, release-0.0.1r61 31), single-end cluster analysis
(SECluster and RetroSeq, unpublished) and read-depth (CND, 32). Following merger of these calls into
a non-redundant set, computational validation by local assembly and breakpoint refinement was
performed. Details of the complete pipeline, SVMerge, are described elsewhere 17.
SV calls were intersected with the RepeatMasker 33 track of mm9 downloaded from UCSC on 201007-20 (http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/) and SVs in multiple strains
linked. SVs were then classified as B6+ TEV based on the following criteria: they must contain TE
sequence as annotated by RepeatMasker, must be a deletion with respect to the C57BL/6J reference
assembly and sequence annotated as TE needed to be within 50 bp of the SV breakpoints. TEVs were
then further classified into superfamilies and TE structure types: LINE, fragment of a LINE, SINE, DNA
transposons, LTR bound element or “complex”. The LTR bound elements were further subdivided
based on structure: Solo LTR (containing only a single LTR), LTR-int (the deleted sequence if
comparing a provirus element to a Solo-LTR), Provirus (an intact ERV with two LTRs and internal
sequence), Pseudoelement (ERV with partial LTR on either end and/or poly-A tail), Hybrid Provirus
(multiple RepeatMasker subfamily annotations within one repeat) or Hybrid Pseudoelement. A
flowchart of the classification criteria can be found in Supplementary Figure 3.
Of 145,429 SVs that are absent from the C57BL/6J assembly, 11% did not appear to contain TE
sequence, 21.2% TE sequences did not coincide with the SV breakpoints and 33.4% were denoted as
being Complex, meaning that they contained either simple repeats or multiple events of TE insertion.
The categories DNA transposons (n=283), LTR-int (n=317), Pseudoelement (n=12), Hybrid
Pseudoelement (n=36) and Complex were subsequently disregarded in further analyses. The
remainder were classified into the TE repeat families corresponding to the probes used in the B6calls (Supplemental Table 2). Although most TE families were classified based on their
RepeatMasker annotations, LINE elements were classified as LINE fragments (hereafter referred to
as LINE_frag) if the length of the element was less than 5Kb. The minimum cut-off was determined
by the local minima between the frequency distribution curves of small fragments and the clustering
of LINEs of lengths near the 6.4Kb canonical sizes (data not shown).
B6- Calling Algorithms
TE insertions that were present in any strain but absent from the C57BL/6J mm9 reference sequence
(denoted as B6- calls) were identified using RetroSeq (https://github.com/tk2/RetroSeq). RetroSeq
seeks inconsistently mapped read pairs where one end is mapped confidently (referred to as anchor
reads) but either the other end is not mapped to the reference, or it is mapped to a distant location
on the reference with low mapping quality. The non-mapping or distantly mapped mates are then
aligned to the ERV probes (Supplemental Table 2). RetroSeq requires the anchoring read to have a
minimum mapping quality of 30 and at least 10 independent read pairs to support a call. Alignments
to Repbase were performed with SSAHA2 34 with a minimum of 80% identity and hit length of 36bp.
RetroSeq clusters the supporting read anchors to produce variant calls to approximately 1-2kb
resolution. The initial seed call windows were subject to further checking as follows. To identify the
putative breakpoints, we scanned the region for positions with coverage lower than 10 reads and
positions with low coverage mismatches (false alignments at the breakpoints can appear as false
SNPs). For each putative breakpoint, we checked the ratio of forward/reverse orientated anchor
reads at either side of the breakpoint. For the breakpoint to be accepted, we required there to be at
least 10 forward orientated anchors within 450bp upstream and 10 reverse orientated anchors
within 450bp downstream. We also required that the ratio of forward-to-reverse anchors in the
450bp upstream and 450bp downstream to be less than 2-to-1. Furthermore, we required the
distance from the final forward orientated upstream anchor to the first reverse orientated
downstream anchor to be less than 120bp. We then removed any calls that occurred within 50bp of
a region annotated by Repeatmasker as a ‘simple_repeat’ or ‘low_complexity’ or a SINE, LINE or ERV
element in the mm9 reference.
Due to the differences in the sequencing depth across the strains, it was necessary to carry out a
computational genotyping step in order to correct for false negatives in the strains with lower
sequencing coverage. For each TEV call and each strain that the call was not made in, we examined
the reads 300bp upstream and downstream and counted the number of putative anchor reads
(single end mapping). If there were at least 5 forward orientated anchor reads upstream and 5
reverse orientated anchor reads downstream, then we called the genotype present in the strain.
B6- Orientation
To determine the sense/antisense orientation of the elements B6- elements, we carried out local de
novo assembly with Velvet 35 of the reads that mapped less than 600bp upstream and downstream
(including their mates) of the putative breakpoint. We realigned the contigs to the reference with
SSAHA2 34 and detected contigs that align incompletely at the breakpoints. We aligned the
unmatched part of the contig to the ERV probe set in order to determine the orientation status of
the element.
B6- Size Estimation
In order to get an accurate estimate of the sizes of the B6- TEVs, we generated a single sequencing
‘jumping’ library with an estimated fragment size of 3kb and sequenced the 50bp of the ends of the
fragments in a single HiSeq lane per strain for 13 strains (129P2, 129S1/SvImJ, 129S5, A/J, BALB/cJ,
C3H/HeJ, CAST/EiJ, CBA/J, DBA/2J, LP/J, NZO/SHiLtJ, PWK/PhJ, WSB/EiJ).
Briefly, mate pair (3kb) libraries were prepared from 10μg mouse genomic DNA using a hybrid
SOLiD/Illumina library protocol developed by L. Shirley and M. Quail at the Wellcome Trust Sanger
Institute. Mouse genomic DNA was sheared to approx 3kb fragments using a Digilab Hydroshear and
the 2 x 50bp mate-paired library was constructed using the nick translation protocol (SOLiD 3 Plus
System Library Preparation Guide 2009) as described elsewhere 36. Immediately following the S1
nuclease/T7 exonuclease digest, the biotinylated mate-pair fragments were purified and ligated to
appropriate adapters (Integrated DNA technologies), enriched by PCR then size-selected exactly as
described in the Illumina mate-pair library v2 Sample Preparation Guide.
We mapped these reads to the reference genome using smalt
(http://www.sanger.ac.uk/resources/software/smalt/) and estimated the physical coverage from
these lanes to be between 30-40-fold per strain. For each B6- TEV in the above strains, if there were
more than 2 read pairs spanning the insertion breakpoint, then we estimated the size of the element
to be <3kb. This information was used to assign an approximate size status to the LINE and ERV calls.
To validate this approach, we observed that almost all (>95%) of SINE calls were spanned.
B6+ Validation
An estimate of 2.6% was made for the false positive (FP) rate in the B6+ calls as the percentage of
TEVs in families considered to be inactive in the mouse lineage. For the B6+ calls, an estimation of
the true false negative (FN) rate was performed against the high confidence manual validation sets
for chromosome 19 in eight of the strains and the 250 selected PCR validated SVs described in the
(Yalchin et al, in preparation) (Supplemental Table 3). In these PCR sets no false positives were
detected.
In order to estimate false negative rates, we made a conservative assumption that the three 129derived substrains (129P2/OlaHsd, 129S1/SvImJ and 129S5/SvEvBrd) are monomorphic for any TEV.
Therefore, we counted the number of TEVs where there was a call made in two out of the three 129derived substrains and assumed the missing call to be a false negative. For B6+ calls, we obtained FN
estimates of 13.3%, 14.4%, and 9.5% for SINE, LINE, and ERV classes respectively (Supplemental
Table 3).
B6- Validation
To measure the false positive rates of the B6- TEV calls, we carried out a total of 53, 34, 47 random
PCRs across the SINE, LINE, and ERV superfamilies respectively, Primers were designed using Primer3
37
and purchased from MWG (Germany). For each insertion call, several independent PCR reactions
were carried out including two reactions with Hotstar Taq (Qiagen), and a reaction with LongRange
PCR Kit (Qiagen). Reactions were performed as previously described 38. PCR gel images were then
taken to assess performance of PCR reaction. PCR results are listed in Supplementary Table 4a and
4b. From these data, we estimate FP rates to be 9%, 11%, and 0% for the three superfamilies (SINE,
LINE, ERV), respectively. Using the same approach, we estimated the false negative rates amongst
the 129-derived substrains to be 28%, 32% and 12%, respectively.
Expression Comparison Methods
Thomas could you field this please? Figure 6 uses it... and some “data not shown” does too.
Log transformed ANOVA
Availability of Calls
The full set of TEV calls has been submitted to DGVa at the EBI
(http://www.ebi.ac.uk/dgva/page.php; accession numbers pending). In the interim, pre-accession
calls can be downloaded from ftp://ftp.sanger.ac.uk/pub/mouse_genomes/current_svs.
Distribution of TEVs across a phylogeny representing a primary subspecies history of 18 mouse
strains.
Ignoring incomplete lineage sorting we calculated an approximate phylogenetic tree of mouse
strains that would allow us to infer the insertion of a TEV (Figure 1). Using Seqboot, Mix and
Consense from the Phylip package 39, we considered the TEVs to be discrete morphologies and
performed 100 bootstraps. All nodes of the resulting consensus tree were established with 100%
reliability. TEVs were mapped to the last node where the strain distribution pattern of the variant
was a subset of the strains included.
Structure and activities of ERV families.
From the RepeatMasker 33 track of mm9 the number of bases belonging to each ERV family was
calculated as the total amount of sequence annotated as one of the RepBase identifiers
(Supplemental Table 2). The proportion of ERVs that are TEVs was estimated as the number of bases
in the B6+ TEV calls
Figure 2b. From the B6+ structure classifications of the ERV TEVs mapped to
the primary phylogeny (Figure 1a) within the C57BL6/J lineage (Figure 1c) percentage of solo-LTRs
was calculated (Figure 2c,d). For the set of all TEVs the average autosomal densities (TEV/bp), and
the individual chromosome density ratios across all strains were calculated (chromosome density /
autosome density; Figure 2e).
Genome-wide nucleotide composition, gene structure and gene annotation biases for TEV
occurrence.
C57BL6/J genomic TEs were identified from the RepeatMasker 33 track of mm9 by excluding
consecutive annotations of TEs in the same subfamily and by concatenating LTR bound sequences of
a proviral structure. The local GC content was calculated from the 20Kb of sequence surrounding the
TE (Figure 3a).
To calculate TEV density in different exons, introns, 5kb flanks of genes and intergenic regions we
used the Genomic Association Tester (GAT)
(http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/gat/contents.html) (Figure 3b). GAT
calculates an expected count by running randomised simulations of the input data. Simulation data
replicates genomic space, chromosome and isochore distributions and so provide unbiased
measures of the null expectation. Sample data counts are compared to a hypergeometric
distribution from simulated dataset measurements. Multiple testing corrections were applied with
the Benjamini-Hochberg method. GAT was also used within the various genomic spaces to test
association of TEVs to GO – slim terms 40 (Figure 3e) and MGI overarching phenotypes
41-43
(Figure
3d).
Densities and orientations of TEVs with respect to the transcriptional (sense) direction of mouse
genes.
Orientation bias was calculated as the percentage of TEVs in the sense orientation. Dividing introns
into first, middle and last positions in genes showed skewed distributions of TEV orientations. In
Figure 4a we performed with Chi-squared tests and showed significant differences for the
occurrences of SINE and ERV TEVs depending on intronic space.
From the RepeatMasker 33 track of mm9 the divergence from the prototypical sequence of each B6+
TEV was taken as a measure of age of the insertion. The percent divergence was divided into bins
and the trend plotted (Figure 4b).
Densities of TEVs in the proximity of gene and exon boundaries.
TEV densities were calculated upstream, downstream (Figure 5a ) and in introns near exon
boundaries (Figure 5b) for genes with orthologs in humans. Sense and antisense orientations,
relative to the direction of gene transcription, were considered separately and TEVs with unresolved
orientation were disregarded from this analysis. The most 5’ base of the TEV was taken as the
position. For each span, upsteam, downstream or intronic the maximum distance (half way to the
next gene or exon) was recorded. Distance from the exon or gene was divided into bins and each bin
scaled using the maximum number of bases in each span present in the genome. Using these scaled
bin sizes we calculated the density of TEVs (number of TEVs at the relevant distances / average
number of bases at those distances). The bin sizes used in Figure 5a,b were taken from the Fibonacci
series which was found to visualize the data better than linear or logarithmic scaling.
Impact of TEVs on quantitative traits
To assess the contribution of TEVs to quantitative traits two lists of TEVs were compiled for each
superfamily one with high and one with very low predicted likelihood of having functional effects.
TEVs were selected based upon structure and the signals of selection found in orientation, genomic
location and GO term association biases. The “high probability” set were intact (proviral ERV, LINEs
> 5Kb, Figure 2d) TEVs in sense from introns (Figure 3b) with the highest orientation biases (Figure
4a), in genes associated to GO-terms found to be depleted for the TEV (Figure 3c). The “very low
probability” set were degraded TEVs (solo-LTR ERV, LINE_frag) in intergenic regions associated to
GO terms found to be enriched for the TEV (Table 2).Figure Legends
Figure 1. Distribution of TEVs across a phylogeny representing a primary subspecies history of 18
mouse strains. a) This phylogeny (left), which averages across these strains’ known phylogenetic
discordances, was imputed by considering TEVs to be discrete morphologies. All nodes were
supported by bootstrap values of 100%. Numbers of B6+ and B6- TEV insertions are shown (right),
within and without the C57BL/6J lineage, respectively. b,d) Proportions of TEV classes (ERVs, LINEs
and SINEs) across the inferred phylogeny for C57BL/6J (b) and C3H/HeJ (d) lineages. LINE elements
are further divided into full length insertions (>5kb) and smaller, fragments of LINEs (LINE_frag). c,e)
Proportions of ERV families across the inferred phylogeny for C57BL/6J or C57BL/6NJ (b) and
C3H/HeJ or CBA/J (d) lineages. For example, ‘AB’ indicates insertion events inferred to have occurred
after the divergence of SPRET/EiJ from all other strains. Numbers of predicted insertions are given
below.
Figure 2. Structure and activities of ERV families. a) Numbers of bases within 9 ERV families present
in the C57BL/6J reference genome. b) Proportions of C57BL/6J bases belonging to each ERV TEV
family relative to apparently fixed ERV sequence. These proportions reflect the variable ages and
historical activities of ERV families. c) Percentage of TEV ERVs predicted as having a solo-LTR
structure within the C57BL/6J or C57BL/6NJ lineages projected onto the phylogenetic tree from
Figure 1. An older ERV insertion shows a greater tendency to be a solo-LTR than a canonical proviral
form. Approximately proportional to the age of the TEVs with the exception of ETn and IAP that have
a higher percentage of solo-LTRs for their age than the other families. d) Proportions of proviral and
solo-LTR structures for each TEV ERV family. e) Boxplot of densities of TEVs per chromosome relative
to average density on the autosomes. Outliers labelled as the corresponding chromosome. The X
chromosome is expected to have a lower density of TEV occurrences than the autosomes due to the
single copy in male germ cells, partially counter balanced by the lower rate of gene conversion
events due to the lower recombination rate. A corresponding depletion in density is only observed
for ERV and SINE TEVs. LINE and LINE_frag TEVs occur in the same proportions on the X as on the
autosomes, indicating an enrichment of LINE on the X above the expected.
Figure 3. Genome-wide nucleotide composition, gene structure and gene annotation biases for TEV
occurrence. a) Cumulative distributions of TEV families according to their genomic GC context. SINE
TEVs tend to occur in GC-rich sequence while LINE and ERV TEVs each show an AT preference, with
the notable exception of the MuLV family which is biased towards GC. TEVs showed no differences
in these biases compared to all TEs in the reference genome assembly. b) Having accounted for
these GC biases, TEVs are substantially and significantly depleted (red shades) in exons and introns,
with the notable exception of SINEs that are enriched (green shades) in intronic regions. Both SINE
and ERV TEVs are enriched in 5Kb upstream and downstream flanking regions of genes, while LINEs
are depleted. c, d) Gene annotations that are significantly enriched (green shades) or depleted (red
shades) in intronic or intergenic TEV insertions having accounted for GC content, and intronic or
intergenic lengths, and after correcting for multiple tests. Gene annotations are from either the
Gene Ontology (slim set) (c) or the Mouse Genome Informatics phenotypes associated with gene
disruptions (d), and are shown when at least one significant association (p < 10-6) was observed. SINE
TEVs show a pattern of enrichments and depletions that is the complement of the patterns for LINE
and ERV TEVs.
Figure 4. Densities and orientations of TEVs with respect to the transcriptional (sense) direction of
mouse genes a) Orientation bias within first, middle and last introns of protein coding genes. All
TEV types occur preferentially in the antisense orientation, with the ERV TEV bias being the
strongest. ERV TEVs show a lower bias in the first introns of genes (p<0.001 determined by ChiSquared dist). SINE TEVs show significantly stronger orientation bias in the first introns of protein
coding genes (p<0.001). The orientation biases of intronic TEVs do not appear to correlate with
expression levels of the host gene (data not shown). b)Orientation biases are not significantly
different between ‘young’ and ‘old’ TEVs, as categorised using percentage sequence divergence
from the repeat consensus sequence (X-axis).
Figure 5. Densities of TEVs in the proximity of gene and exon boundaries. a) Densities of TEVs (full
lines) or of TEs from the reference C57BL/6J assembly (dashed lines) 5´ of genes’ transcriptional start
sites (left panels) or 3´ of genes’ poly-adenylation signals (right panels). b) Densities of intronic TEVs
(full lines) or of TEs from the reference C57BL/6J assembly (dashed lines) 5´ of exons (left panels) or
3´ of exons (right panels). The top two panels of each (a) and (b) represent TEVs and TEs that occur
in the transcriptional sense orientation, whereas the bottom two panels represent those present in
the antisense orientation. For each family, the densities of TEVs (y-axis) present within distance bins
(x-axis) from the gene are shown relative to the TEV density observed within 46,368 bps of the gene.
All TEVs and TEs are depleted in close proximity to the 5’ of genes, but SINEs are enriched upstream
(~ 500bp -10kbp) of genes. No significant effects of TEV orientation on density distributions in the
vicinity of genes were observed. A difference in density profiles of sense and antisense TEVs is
observed in proximity to exon boundaries.
Figure 6. Example of TEVs with high probability of having direct functional consequences as
determined by merge QTL analysis. In the {REF SV PAPER} the top 12 candidate SVs for having
functional consequences as determined by merge QTL analysis were reported. Two of these that are
also classified as TEVs here are exemplified a) An IAP insertion event in the strains 129P2/OlaHsd,
129S1/SvImJ, 129S5/SvEvBrd, A/J, DBA/2J, LP/J and WSB/EiJ. The TEV is 1620bp upstream of the
gene Eps15 and is associated with home cage activity phenotypes. c) An IAP insertion event found in
the strains A/J, BALB/cJ, C3H/HeJ and CBA/J. Located 1630bp upstream of the gene Tmc3, associated
with wound healing phenotypes. Both IAP elements are of a proviral intact structure in the sense
orientation relative to the neighbouring gene. Neither can be linked to any significant difference in
RNA-seq expression levels from brain tissue.
Table 1. Orientation bias of TEVs and apparently fixed TEs in the mouse genome. For each
superfamily of TEs the number of intronic variants within mouse genes with human orthologs was
counted. B6+ TEVs and apparently fixed C57BL6/J TEs were counted as either sense or antisense
with respect to the gene’s transcriptional orientation. Significant differences between the
orientation biases of apparently fixed and variant TEs were identified using a Chi-squared test.
Table 2. TEV “high” vs. “very low” probability of functional consequences sets and occurrence in
merge QTLs. Using the strong signals of selection to identify TEVs likely to cause functional
consequences finds no association between these and phenotypic functional consequences as
determined by Chi-squared tests.
ERV TEV
Apparently fixed ERV
LINE TEV
Apparently fixed LINE
SINE TEV
Apparently fixed SINE
Sense Antisense
2042
4176
61857
128263
3803
5336
81655
148497
4192
5528
282011
343528
%Sense
32.8%
32.5%
41.6%
35.5%
43.1%
45.1%
Chi square test
p-value
0.61
3. 5x10-33
1.2x10-4
SINE
LINE
ERV
High probability
Very low probability
total
merge
non-merg
total merge non-merg
9
0
9 1281
0
1281
68
1
67 1364
29
1335
51
0
51 1358
23
1335
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
Gogvadze, E. & Buzdin, A. Retroelements and their impact on genome evolution and
functioning. Cell Mol Life Sci 66, 3727-42 (2009).
Shapiro, J.A. Mobile DNA and evolution in the 21st century. Mob DNA 1, 4.
Belancio, V.P., Hedges, D.J. & Deininger, P. Mammalian non-LTR retrotransposons: for better
or worse, in sickness and in health. Genome Res 18, 343-58 (2008).
Cordaux, R. & Batzer, M.A. The impact of retrotransposons on human genome evolution.
Nat Rev Genet 10, 691-703 (2009).
Stocking, C. & Kozak, C.A. Murine endogenous retroviruses. Cell Mol Life Sci 65, 3383-98
(2008).
Maksakova, I.A. et al. Retroviral elements and their hosts: insertional mutagenesis in the
mouse germ line. PLoS Genet 2, e2 (2006).
Hedges, D.J. & Deininger, P.L. Inviting instability: Transposable elements, double-strand
breaks, and the maintenance of genome integrity. Mutat Res 616, 46-59 (2007).
Waterston, R.H. et al. Initial sequencing and comparative analysis of the mouse genome.
Nature 420, 520-62 (2002).
Duhl, D.M., Vrieling, H., Miller, K.A., Wolff, G.L. & Barsh, G.S. Neomorphic agouti mutations
in obese yellow mice. Nat Genet 8, 59-65 (1994).
Morgan, H.D., Sutherland, H.G., Martin, D.I. & Whitelaw, E. Epigenetic inheritance at the
agouti locus in the mouse. Nat Genet 23, 314-8 (1999).
Smit, A.F. Interspersed repeats and other mementos of transposable elements in
mammalian genomes. Curr Opin Genet Dev 9, 657-63 (1999).
Chevret, P., Veyrunes, F. & Britton-Davidian, J. Molecular phylogeny of the genus Mus
(Rodentia: Murinae) based on mmitochondrial and nuclear data. . Biological Journal of the
Linnaean Society 84, 417-427 (2005).
Flint, J., Valdar, W., Shifman, S. & Mott, R. Strategies for mapping and cloning quantitative
trait genes in rodents. Nat Rev Genet 6, 271-86 (2005).
Zhang, Y., Maksakova, I.A., Gagnier, L., van de Lagemaat, L.N. & Mager, D.L. Genome-wide
assessments reveal extremely high levels of polymorphism of two active families of mouse
endogenous retroviral elements. PLoS Genet 4, e1000007 (2008).
Zhang, Y., Romanish, M. & Mager, D. Distributions of Transposable Elements Reveal
Hazardous Zones in Mammalian Introns. PLoS Comput Biol 7(2011).
Akagi, K., Li, J., Stephens, R.M., Volfovsky, N. & Symer, D.E. Extensive variation between
inbred mouse strains due to endogenous L1 retrotransposition. Genome Res 18, 869-80
(2008).
Wong, K., Keane, T.M., Stalker, J. & Adams, D.J. Enhanced structural variant and breakpoint
detection using SVMerge by integration of multiple detection methods and local assembly.
Genome Biol 11, R128.
Goodier, J.L. & Kazazian, H.H., Jr. Retrotransposons revisited: the restraint and rehabilitation
of parasites. Cell 135, 23-35 (2008).
Seperack, P.K., Strobel, M.C., Corrow, D.J., Jenkins, N.A. & Copeland, N.G. Somatic and germline reverse mutation rates of the retrovirus-induced dilute coat-color mutation of DBA
mice. Proc Natl Acad Sci U S A 85, 189-92 (1988).
Kvikstad, E.M. & Makova, K.D. The (r)evolution of SINE versus LINE distributions in primate
genomes: sex chromosomes are important. Genome Res 20, 600-13.
Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860-921
(2001).
Ponjavic, J., Ponting, C.P. & Lunter, G. Functionality or transcriptional noise? Evidence for
selection within long noncoding RNAs. Genome Res 17, 556-65 (2007).
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
Cordaux, R., Lee, J., Dinoso, L. & Batzer, M.A. Recently integrated Alu retrotransposons are
essentially neutral residents of the human genome. Gene 373, 138-44 (2006).
Svejstrup, J.Q. Mechanisms of transcription-coupled DNA repair. Nat Rev Mol Cell Biol 3, 219 (2002).
Medstrand, P., van de Lagemaat, L.N. & Mager, D.L. Retroelement distributions in the
human genome: variations associated with age and proximity to genes. Genome Res 12,
1483-95 (2002).
Warnefors, M., Pereira, V. & Eyre-Walker, A. Transposable elements: insertion pattern and
impact on gene expression evolution in hominids. Mol Biol Evol 27, 1955-62.
Zhang, Y., Romanish, M.T. & Mager, D.L. Distributions of transposable elements reveal
hazardous zones in Mammalian introns. PLoS Comput Biol 7, e1002046 (2011).
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator
chemistry. Nature 456, 53-9 (2008).
Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet
Genome Res 110, 462-7 (2005).
Ye, K., Schulz, M.H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to
detect break points of large deletions and medium sized insertions from paired-end short
reads. Bioinformatics 25, 2865-71 (2009).
Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural
variation. Nat Methods 6, 677-81 (2009).
Simpson, J.T., McIntyre, R.E., Adams, D.J. & Durbin, R. Copy number variant detection in
inbred strains from short read sequence data. Bioinformatics 26, 565-7.
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in
genomic sequences. Curr Protoc Bioinformatics Chapter 4, Unit 4 10 (2009).
Ning, Z., Cox, A.J. & Mullikin, J.C. SSAHA: a fast search method for large DNA databases.
Genome Res 11, 1725-9 (2001).
Zerbino, D.R. Using the Velvet de novo assembler for short-read sequencing technologies.
Curr Protoc Bioinformatics Chapter 11, Unit 11 5.
McKernan, K.J. et al. Sequence and structural variation in a human genome uncovered by
short-read, massively parallel ligation sequencing using two-base encoding. Genome Res 19,
1527-41 (2009).
Rozen, S. & Skaletsky, H. Primer3 on the WWW for general users and for biologist
programmers. Methods Mol Biol 132, 365-86 (2000).
Yalcin, B. et al. Genetic dissection of a behavioral quantitative trait locus shows that Rgs2
modulates anxiety in mice. Nat Genet 36, 1197-202 (2004).
Felsenstein, J. PHYLIP (Phylogeny Inference Package) version 3.6. (2005).
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology
Consortium. Nat Genet 25, 25-9 (2000).
Blake, J.A., Bult, C.J., Kadin, J.A., Richardson, J.E. & Eppig, J.T. The Mouse Genome Database
(MGD): premier model organism resource for mammalian genomics and genetics. Nucleic
Acids Res 39, D842-8.
Finger, J.H. et al. The mouse Gene Expression Database (GXD): 2011 update. Nucleic Acids
Res 39, D835-41.
Krupke, D.M., Begley, D.A., Sundberg, J.P., Bult, C.J. & Eppig, J.T. The Mouse Tumor Biology
database. Nat Rev Cancer 8, 459-65 (2008).