Download SNPPaper291010

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
The functional impact of sequence variation on phenotypes in
17 mouse genomes.
The sequence and analysis of 17 laboratory and wild-derived mouse
genomes
Abstract
2
Introduction
Genetic analysis of the laboratory mouse plays a key role in our
understanding of mammalian biology and of human disease. Indeed, amongst
eukaryotes, inbred strains of laboratory mice provide a unique resource of
genetically uniform stocks to interpret the relationship between sequence and
phenotypic variation. Genetic and phenotypic diversity is remarkably large,
due in part to the origin of classical inbred strains from different subspecies
(1,2), and in part to homozygous genomes unmasking recessive mutations.
The availability of inbred strains derived from wild caught mice has provided
access to even more genetic variability (3). Furthermore, in addition to the 300
or so inbred strains (4-6), mouse geneticists have developed, and use, strain
hybrids. When these are inbred they provide a renewable source of the
identical allelic combinations, essential for investigating traits with low
heritability or the effects of multiple environments (7).
The naturally occurring sequence variation that in part underlies
phenotypic differences between strains has been identified through three
complementary methods:
first, a relatively small number of variants with
known functional consequences have been found through mapping and
cloning of classical Mendelian mutations in mice (8); second, a larger number
of variants has been identified by mapping and sequencing of loci that
contribute to quantitative variation, but the attribution of function to any is
unclear; third, the creation of catalogues of sequence variants (9,10).
Databases currently hold XX million single nucleotide polymorphisms (SNPs),
YY short insertion and deletions (indels) and 10,938 structural variants (SVs).
However it is not known to what extent these catalogues are complete, nor
how many of the variants have functional consequences, nor how they might
exert their effect. Only complete genome sequence of inbred strains will
provide the raw material necessary to address these questions and to begin
dissecting the path from sequence to phenotypic variant.
New technologies have brought within our grasp the ability to sequence
individual mammalian genomes. In this paper we apply these technologies to
sequence seventeen mouse genomes, chosen to address a number of aims:
3
first, to assess the extent of sequence diversity between strains and to
apportion that diversity between different classes of sequence variant.
Second, to compare sequence diversity present within the classical inbred
strains to that in wild-derived mice; third, to investigate the functional
consequences of sequence variation. Specifically for the latter we asked
whether we could find sequence variants contributing to quantitative variation
at more than 800 quantitative trait loci (QTLs) mapped at high resolution in
mice descended from eight classical inbred strains (11); in addition we set out
to determine the sequence variants that contribute to variation in gene
expression and whether these might also explain phenotypic variation.
Results
Data generation and variant discovery
We generated 738Gb of raw sequence from the following classical laboratory
mouse inbred strains: C3H/HeJ, CBA/J, A/J, AKR/J, DBA/2J, LP/J, CBA/J,
BALBc/J, NZO/HiLtJ, NOD/ShiLtJ and 238 Gb from wild-derived strains,
CAST/EiJ, SPRET/EiJ, WSB/EiJ and PWK/PhJ. In addition we also generated
sequence coverage (261Gb) from three 129- strains; 129SvSvEvBrd,
129P2/OlaHsd and 129S1/SvImJ and xx Gb from C57BL/6NJ: This provided
an average of 25x sequence coverage across these 17 genomes, reads were
mostly 54bp, paired, and fragments were between 150-600bp in length (Table
1). Sequence variants were defined as differences from the reference strain
(C57BL/6J; mm9/m37). Overall we identified 148.3 million single nucleotide
polymorphisms
(SNPs)
at
65.2
million
unique
sites,
21.7
million
insertion/deletion (indels) at 8.8 million unique sites and ZZ structural variants
(SVs) including YY transposon insertions. We provide an overview of all
variants called in this study and discuss what we found and missed, for two
categories of variant: SNPs, and indels. Analysis of SVs and mobile genetic
elements are described in accompanying papers.
4
SNPs and short indels
As shown in Table 1 we managed to identify virtually all previously described
variants, and also to generate a catalog of new variants of unprecedented
resolution and accuracy. Results presented in the table include only those
SNPs from reads whose average mapping quality was greater than 40 (12).
We refer to these regions of the genome as “callable” (Supplementary
Information).
We established the accuracy and sensitivity of the SNP calls in three
ways. First, in comparison with dbSNP we found that 99% of genotypes
common to both are identical. Of the 1% of genotype mismatches,
approximately 60% are private to one study and are likely mistakes in dbSNP,
the remainder were ambigious sites with adjacent SNPs and indels. Second,
we compared our calls to the mouse Perlegen dataset (9) in which nine of the
strains we sequenced (129S1/SvImJ, A/J, AKR/J, BALB/cJ, C3H/HeJ,
DBA/2J, NOD/ShiLtJ, WSB/EiJ, and CAST/EiJ), were genotyped using wholegenome hybridization arrays. SNPs that mapped to the same genomic
location had identical genotypes in 99.8% of cases. Third, we compared our
calls to SNPs identified in the NOD/ShiLtJ strain from eighteen large insert
clones (BACs): 17.5 Mb was sequenced using standard methods to an
estimated accuracy of one error per 100,000 bp (13). SNPs were then called
from 1kb fragments that mapped back to the reference genome: 11.6 Mb of
sequence aligned uniquely onto the NCBI37 mouse reference (66% of the
total). From this we estimate that 1.4% of our calls are false positives and
1.4% are false negatives.
We identified far fewer indels than SNPs (20M compared to 148M) and
with lower confidence. We relied for validation on comparison with the
NOD/ShiLtJ BAC sequences (the Perlegen data does not detect indels) and
estimated false positive and negative rates to be 1.4% and 35.9%
respectively.
The use of inbred strains for genetic mapping experiments makes the
absence of a variant just as important as its presence. Therefore we
determined to what extent we could call every position in the genome. We
5
start by considering those regions that map to the reference (we consider
strain specific sequences below) and the limitations imposed by restricting our
attention to “callable” regions of the genomes. We then describe approaches
to overcome that restriction.
Distribution of variants across the genome
Calling missing genotypes
The false negative rates quoted above do not take into account
variants that lie in regions of low mappability (where the mapping quality score
is < 40). Between 11.2-16.1% of the each strain’s genome is excluded from
analysis using this filter. Thus 2% of the Perlegen SNPs and 2.3% of dbSNPs
are in our uncallable regions and 8.9% of the NOD/ShiLtJ BAC sequence.
Uncallable regions tend to be more divergent, than the genome average. BAC
regions from the NOD/ShiLtJ strain which correspond to uncallable sequence
are xx% more likely to contain variant SNPs.
The pattern of missing data in our dataset is not random: if it were, we
would expect to find that XX% of genotypes to be incomplete across the eight
HS strains, whereas we find YY%. Uncallable regions of the genome largely
consist of repeated sequence, present in the same location in each strain,
rendering these regions inaccessible to the current sequencing methodology.
Stochastic variation in coverage and read quality did not significantly affect
our ability to accurately call variants.
We took two approaches to complete the sequence. First, using
comparisons with the BAC sequence from the NOD/ShiLtJ strain, we were
able to assessed the error rates info less stringent mapping parameters by
reducing the mapping quality score thresholds below 40. We found that with
evidence in favour of the reference genotype, even dropping the quality score
from 40 to 10 incurred an error rate of only 0.005X%. This is not true when
reads support a variant, when a quality score of 10 would result in much
6
higher error rates (XX%) When a read supported a variant the error rate is
higher (X%). Thus by selectively setting thresholds of XX10 we could fill in YY
(XX%) missing SNP calls. Second, we used imputation to call missing
genotypes, This approach is effective when applied to classical strains
because of their derivation from a limited number of founders. By treating
each strain as a mosaic of the founder strains it is possible to impute missing
genotypes by using data from strains that have the same haplotype.
Imputation identified XX missing genotypes (Table 1).
Heterozygote variant calls
All of the mouse strains we sequenced as part of this project have been
maintained by inbreeding to limit genetic drift, and are homozygous at every
loci with the exception of sites in the genome where de novo mutations have
occurred, and have not yet been fixed. Supplementary Table x and
Supplementary Figure Y show the location and number of of heterozygous
base calls. There was a significant enrichment of heterozygous calls in
regions of copy number gain, regions that are duplicated in one strain and
collapsed back on the reference and in uncallable regions of the genome,
representing 12% and 21% of heterozygous calls respectively. We genotyped
167 putative heterozygous variants that did not fall into these regions, none
were correct, revealing that most if not all of these calls are false positives.
Differences in the patterns of variation between mouse strains
The major difference observed in the single nucleotide variation between the
strains was in the number of variants found in the laboratory strains of mice
relative to those strains that were wild-derived (Table 1). On average we
observed between 4-5 million SNPs for the laboratory strains and 6.9 million,
19.4 million, 20.2 million and 40.1 million SNPs for the wild-derived strains
WSB/EiJ, PWK/PhJ, CAST/EiJ and SPRETUS/EiJ, respectively. As expected
variants found in the laboratory strains were distributed in a block-like pattern
across
the
genome
reflecting
the
haplotype
structure
of
mouse
(Supplementary Figure x). Fewer SNPs were called from C57BL/6N, from
7
which we called 0.8% SNPs relative to the other laboratory strains. 11.1% of
these (4146) were found in all strains and are therefore likely to be reference
errors. In keeping with the expected molecular spectrum of mutations
G:CA:T and A:TGC transitions predominated (Supplementary Figure x).
The laboratory strains of mice carried few private SNPs representing around
5.6% of all SNPs called in each strain (Supplementary Figure x). Some of
the private variation in the laboratory strains clustered into discreet regions
suggesting that a strain had acquired a block of sequence not represented in
the other strains, which most of the variants were distributed genome-wide
suggesting de novo evolution of these variants since the divergence of the
strains (Supplementary Figure x).
Strain specific sequence
The second category of sequence that is hard to access with the method we
have used lies in regions that are not present in the C57BL/6J reference
genome. These will appear as SV insertions in the genome, described in our
accompanying paper. To analyze the content of this sequence we took read
pairs that could not be mapped and assembled these into contigs, Figure 1.
Overall we identified XXMb of sequence in this way. Unsurprisingly more is
found in the wild derived strains than the classical laboratory strains, but the
distribution is not uniform even among the latter. Analysis of these sequence
revealed that much of it mapped to mouse sequence found in databased but
not present in the reference genome, representing haplotypes from other
strains, and some mapped to rabbit and rat. Further analysis of these
sequence revealed that it was largely composed of x, y, z.
Type of variation across 17 mouse genomes
Introduce the Circos plots etc. table bringing together all the TE, SVs
Phylogenetic analysis
8
One of the unanswered questions in mouse genetics is what is the
phylogenetic history of the laboratory mouse. While attempts to assess this
have been made using genotyping data, here we used the complete complete
autosomal sequences of M. m. musculus (PWK/PhJ), M. m. domesticus
(WSB/EiJ), M. m. castaneus (CAST/EiJ), and M. spretus (SPRET/EiJ) which
we aligned to the rat genome as an outgroup, and conducted a Bayesian
concordance analysis (14).
Individual phylogenies at most (94%) of the
43,255 loci were inferred with high statistical confidence (posterior probability
> 0.8), suggesting only a limited contribution of phylogenetic error to genomic
patterns. We observed substantial phylogenetic discordance among M. m.
musculus, M. m. domesticus, and M. m. musculus (Figure XX). A plurality of
loci supported a M. m. musculus/M. m. castaneus primary subspecies history
(concordance factor (CF) = 37.9%; 95% credibility interval (CI) = 37.8-38.0%)
and two co-minor histories were supported by equal numbers of loci (CF =
30.3%; 95% CI = 30.2-30.4%; and CF = 30.2%; 95% CI = 30.1-30.3%),
consistent with theoretical models of incomplete lineage sorting (15-17).
Phylogenetic switching occurred over a short physical scale and median locus
sizes paralleled the three phylogenetic histories (primary history: 40,975 bp;
co-minor histories: 33,626 bp and 33,412 bp). We also found evidence of
phylogenetic discordance involving M. spretus: 12.1% of loci did not place
this species as the outgroup to a M. musculus subspecies clade. The X
chromosome showed phylogenetic discordance, but to a lesser extent than
the autosomes.
The percentage of X-linked loci that did not position M.
spretus as the outgroup was reduced to 7.9%.
The functional Impact of nucleotide variation
Transcriptome
We generated transcriptome sequence data for 15 of the 17 mouse strains
and used these data to ask if there are novel strain specific genes. Table X
shows a breakdown of these genes by strain. Over-represented GO terms for
these genes are also provided in Table X. Intriguingly X of these genes have
conserved orthologs in other organisms.
9
Novel genes in strains and genes missing in strains
Allele specific gene expression – transcriptomes and ChIP
Phenotypes – known Mendelians (unknown Mendelians?)
Numerous genetic and sequence differences are already known between
laboratory mouse strains, which can be used to validate the strain sequences,
and which can be extended using this data. The most obvious phenotypic
difference between strains is coat colour. Four strains (A/J, AKR/J, BALB/cJ,
and NOD/ShilLtJ ) are albino (Tyrc) and contain the same G-C transversion in
the tyrosinase gene at 7:94641553 (Jackson and Bennett). Unsurprisingly the
four strains also share over 100 SNPs within the Tyr gene, different from the
reference sequence, reflecting the common origin of the mutation, but there
are also numerous differences between them, indicating accumulated
variation in the time since the original mutation event. Three strains are
homozygous for the brown mutation of Tyrp1 (A/J, BALB/cJ and DBA/2J). All
contain the causative mutation, G-A at 4:80481306 and the known linked
second nonsynonymous coding SNP (Zdarsky et al). Interestingly this SNP is
also found in WSB/EiJ, along with many other shared SNPs, few of which are
seen in other strains, indicating M m domesticus origin of the brown mutant
chromosome. We can extend known mutations to additional strains with this
data. The DNA polymerase iota (Poli) gene in all 129 strains contains a
premature stop codon which ablates function (McDonald et al 2003). The
strain sequence data identifies this same mutation previously undescribed in
LP/J mice.
Quantitative phenotypes – our unstoppable bid to find quantitative trait
nucleotides….
Discussion

We recovered a primary phylogenetic history placing M. m. musculus
and M. m. castaneus as sister subspecies as supported by an earlier
10
study using high-density SNP sequence data (White et al. 2009). In
contrast to the earlier SNP data set, the complete sequence data
revealed co-minor histories supported by equal proportions of the
autosomes, as predicted by theoretical models of incomplete lineage
sorting (Pamilo and Nei 1988, Maddison 1997, Rosenberg 2002, Baum
2007). This difference between the two studies was probably caused
by ascertainment bias against M. m. castaneus in the Perlegen SNP
data set (Yang et al. 2007).

We found phylogenetic discordance involving M. spretus, despite the
longer divergence time separating this species from the M. musculus
subspecies clade. This discordance might reflect incomplete lineage
sorting, introgression, or both. This result suggests caution when using
M. spretus as an outgroup in phylogenetic analyses of M. musculus
subspecies.

Phylogenetic switching occurred over a short physical scale.
As
predicted by theory (Slatkin and Pollack 2006), locus lengths roughly
corresponded to the scale of linkage disequilibrium in current house
mouse populations (Laurie et al. 2007).

Widespread phylogenetic discordance among the three house mouse
subspecies challenges the assignment of subspecific ancestry across
the genomes of the classical inbred laboratory strains. This task will
require genome sequences from population samples of the three
house mouse subspecies.
11
Methods
Sequencing
Strain Selection: We selected the 17 most widely used mouse strains for
sequencing including the Heterogeneous stock (HS) (A/J, AKR/J, BALB/cJ,
CBA/J,
C3H/HeJ,
DBA/2J and
LP/J) and
collaborative
cross
(A/J,
129S1/SvImJ, NOD/ShiLtJ, NZO/HiLtJ, WSB/EiJ, CAST/EiJ and PWK/PhJ)
progenitors. C57BL/6NJ which is the background on which the Knockout
mouse project (KOMP) and the European conditional mutant mouse program
(EUCOMM) are generating targeted embryonic stem cells for all genes in the
mouse genome, three 129-derived strains (129S5SvEvBrd, 129P2/OlaHsd and
129S1) which are the backgrounds on which the vast majority of mouse
knockouts have been made, and several wild-derived strains CAST/EiJ,
PWK/PhJ and SPRETUS/EiJ. All of the mice that were sequenced were
females from pedigreed colonies from the Jackson laboratory, with the
exception of 129P2/OlaHsd and 129S5SvEvBrd which came from pedigreed
colonies from MRC-Harwell and The Wellcome Trust Sanger Institute
respectively. A list of the pedigreed mouse IDs is provided in the
(Supplementary Information).
Source of DNA for Sequencing and DNA library preparation: Liver tissue
was used to extract DNA for sequencing using standard procedures. Prior to
sequencing all DNA samples were genotyped with a genome-wide marker
panel to confirm they were from the desired strain. For each strain we
generated multiple “no-PCR” libraries for sequencing on the illumina GAii
sequencing platform as described previously (18). A breakdown of the
sequencing metrics for each strain is provided in Supplementary Table X.
Each lane of sequence was re-genotyped prior to downstream analysis.
Generation
and
sequencing
of
12
NOD/ShiLtJ
bacterial
artificial
chromosomes (BAC): We constructed a NOD/ShiLtJ BAC clone libraries as
described previously (19). 18 BACs from this libraries were shotgun and
capillary sequenced generating contigs of a total length of 17.6 Mb, Accession
numbers for these contigs is provided in the Supplementary Information.
These BACs were sequenced finished with a 1 error per 100,000 base error
rate prediction (13).
Transcriptome libraries and sequencing: RNA was extracted from the
whole brain of the sequenced mouse and a female sibling using Trizol at eight
weeks of age. RNA of RIN >8 was then used to generate transcriptome
libraries which were sequenced on the Illumina platform generating around X
million 76bp paired-end Illumina reads for each strain, Supplementary Table
X. Each lane of sequence was re-genotyped prior to downstream analysis.
For allele specific gene expression analysis C57BL/6J and DBA/2J mice were
intercrossed and livers from female F1 mice collected for RNA extraction and
sequencing as described above.
Computational Methods
Sequence Genotype Check
We checked the genotype of each lane of sequence by comparing the
concordance with existing whole-genome genotypes from the Perlegen set.
Sequence Alignment
The reads from each lane were aligned to the NCBI37 mouse reference
sequence using MAQ v0.7.1-6 (12). A BAM file was produced for each lane,
then the lanes for each library merged into a library BAM, library PCR
duplicates were removed with samtools (http://samtools.sourceforge.net), and
13
then the library BAMs were merged to produce a single BAM file per strain (Li
et al, 2009).
SNP Calling
SNPs were called from the individual strain BAM files with 4 different SNP
callers: samtools varFilter (20), Genome Analysis Toolkit (21), iMR
(Xiangchao and Mott, unpublished), QCALL (ref). The parameters used for
each caller are in Supplementary. Table X. We defined uncallable regions as
those positions where the average mapping quality at the position was less
than Q40 (12) or the sequencing depth was higher than 200. The final set of
SNPs consisted of positions that passed the uncallable filter and SNPs that
were called by two or more callers.
Short Indel Calling
Short indels were called using two different algorithms: Dindel (Albers et al,
2010), iMR (Xiangchao and Mott, unpublished). The final indel calls were the
union of the candidate calls.
Variant Calling from finished NOD/ShiLtJ BACs
Finished NOD/LtJ sequence was fragmented into 1kb sequences which were
mapped back against the reference genome assembly using BWA700. SNPs
and short indels differences were called from the 1kb fragments that aligned
uniquely onto the NCBI37 mouse reference. Taking these calls as a gold
standard, we measured the false positives and negatives of the SNP and
short indel calls.
Structural variant Calling
14
A detailed summary of structural variant and transposon element calling is
provided in accompanying papers.
Variant Imputation
NF
De novo assembly of novel sequence
The reads for each strain were mapped to the m37 reference genome
assembly using MAQ (0.7.1-6), and PCR duplicates were removed.
Unmapped read pairs (where neither the read nor its mate pair mapped to the
reference) were extracted from the BAM files and assemblies were built using
serial ABYSS (1.1.1) (using a kmer size of 31) to produce unmapped contigs
for each strain. These contigs were matched against x, y and z using XX.
Transcriptome Analysis with TopHat and CuffLinks
Each BAM file was converted to two FASTQ files (one per sequenced end)
and TopHat was deployed to map data from each library to the genome
including splice sites annotated in Ensembl and UCSC gene structures,
known mRNAs, and expressed sequence tags, and to exhaustively search for
novel splice sites. Once splice sites were defined at the library level TopHat
was re-run combining all splice sites from all libraries across all of the strains
so that reads from all of the libraries had the opportunity to map over the
same splice sites. Cufflinks was then used to quantify expression of all
Ensembl genes/transcripts across all libraries and gene models were
generated for each strain. Cuffcompare was then used to build a set of
consensus transcripts, which were used to quantify expression of every RNAseq-based transcript in each strain. Cuffdiff was then used to identify
significant differences in transcript isoform abundance, expression at the gene
level, TSS usage, and coding sequence (Ensembl models only).
15
Phylogenetic Analysis
The NCBI build 37 mouse genome sequence was aligned to version 3.4 of the
rat genome using Mercator to build a one-to-one collinear orthology map and
using MAVID to produce nucleotide-level alignments on the collinear blocks
(Dewey 2007). Consensus sequences of CAST/EiJ, PWK/PhJ, WSB/EiJ, and
SPRET/EiJ were mapped to the alignment and gaps were filled with N’s.
Collinear blocks were partitioned into 43,255 loci using a minimum description
length algorithm with a maximum cost of 4.8 (Ané and Sanderson 2005).
Following the procedure in White et al. (2009), a separate Bayesian
phylogenetic analysis was conducted for each locus and the posterior
distribution of concordance factors across loci was estimated using Bayesian
Concordance Analysis (Ané et al. 2007). The prior distribution of gene tree
concordance was set at =1.
Using an extreme prior of complete
independence among loci (=infinity) altered the concordance factors slightly,
but did not change their rank orders.
16
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Paigen, K. (2003) One hundred years of mouse genetics: an
intellectual history. II. The molecular revolution (1981-2002). Genetics,
163, 1227-1235.
Paigen, K. (2003) One hundred years of mouse genetics: an
intellectual history. I. The classical period (1902-1980). Genetics, 163,
1-7.
Wade, C.M. and Daly, M.J. (2005) Genetic variation in laboratory mice.
Nat Genet, 37, 1175-1180.
Beck, J.A., Lloyd, S., Hafezparast, M., Lennon-Pierce, M., Eppig, J.T.,
Festing, M.F. and Fisher, E.M. (2000) Genealogies of mouse inbred
strains. Nat Genet, 24, 23-25.
Atchley, W.R. and Fitch, W. (1993) Genetic affinities of inbred mouse
strains of uncertain origin. Mol Biol Evol, 10, 1150-1169.
Atchley, W.R. and Fitch, W.M. (1991) Gene trees and the origins of
inbred strains of mice. Science, 254, 554-558.
Churchill, G.A., Airey, D.C., Allayee, H., Angel, J.M., Attie, A.D., Beatty,
J., Beavis, W.D., Belknap, J.K., Bennett, B., Berrettini, W. et al. (2004)
The Collaborative Cross, a community resource for the genetic
analysis of complex traits. Nat Genet, 36, 1133-1137.
Zheng, Z., Schmidt-Ott, K.M., Chua, S., Foster, K.A., Frankel, R.Z.,
Pavlidis, P., Barasch, J., D'Agati, V.D. and Gharavi, A.G. (2005) A
Mendelian locus on chromosome 16 determines susceptibility to
doxorubicin nephropathy in the mouse. Proc Natl Acad Sci U S A, 102,
2502-2507.
Frazer, K.A., Eskin, E., Kang, H.M., Bogue, M.A., Hinds, D.A., Beilharz,
E.J., Gupta, R.V., Montgomery, J., Morenzoni, M.M., Nilsen, G.B. et al.
(2007) A sequence-based variation map of 8.27 million SNPs in inbred
mouse strains. Nature, 448, 1050-1053.
Cunningham, F., Rios, D., Griffiths, M., Smith, J., Ning, Z., Cox, T.,
Flicek, P., Marin-Garcin, P., Herrero, J., Rogers, J. et al. (2006)
TranscriptSNPView: a genome-wide catalog of mouse coding variation.
Nat Genet, 38, 853.
Valdar, W., Solberg, L.C., Gauguier, D., Burnett, S., Klenerman, P.,
Cookson, W.O., Taylor, M.S., Rawlins, J.N., Mott, R. and Flint, J.
(2006) Genome-wide genetic association of complex traits in
heterogeneous stock mice. Nat Genet, 38, 879-887.
Li, H., Ruan, J. and Durbin, R. (2008) Mapping short DNA sequencing
reads and calling variants using mapping quality scores. Genome Res,
18, 1851-1858.
Chain, P.S., Grafham, D.V., Fulton, R.S., Fitzgerald, M.G., Hostetler,
J., Muzny, D., Ali, J., Birren, B., Bruce, D.C., Buhay, C. et al. (2009)
Genomics. Genome project standards in a new era of sequencing.
Science, 326, 236-237.
Ane, C., Larget, B., Baum, D.A., Smith, S.D. and Rokas, A. (2007)
Bayesian estimation of concordance among gene trees. Mol Biol Evol,
24, 412-426.
17
15.
16.
17.
18.
19.
20.
21.
22.
Pamilo, P. and Nei, M. (1988) Relationships between gene trees and
species trees. Mol Biol Evol, 5, 568-583.
Maddison, D.R., Swofford, D.L. and Maddison, W.P. (1997) NEXUS:
an extensible file format for systematic information. Syst Biol, 46, 590621.
Rosenberg, N.A. (2002) The probability of topological concordance of
gene trees and species trees. Theor Popul Biol, 61, 225-247.
Bentley, D.R., Balasubramanian, S., Swerdlow, H.P., Smith, G.P.,
Milton, J., Brown, C.G., Hall, K.P., Evers, D.J., Barnes, C.L., Bignell,
H.R. et al. (2008) Accurate whole human genome sequencing using
reversible terminator chemistry. Nature, 456, 53-59.
Steward, C.A., Humphray, S., Plumb, B., Jones, M.C., Quail, M.A.,
Rice, S., Cox, T., Davies, R., Bonfield, J., Keane, T.M. et al. (2010)
Genome-wide end-sequenced BAC resources for the NOD/MrkTac()
and NOD/ShiLtJ() mouse genomes. Genomics, 95, 105-110.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N.,
Marth, G., Abecasis, G. and Durbin, R. (2009) The Sequence
Alignment/Map format and SAMtools. Bioinformatics, 25, 2078-2079.
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K.,
Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M. et al.
(2010) The Genome Analysis Toolkit: a MapReduce framework for
analyzing next-generation DNA sequencing data. Genome Res, 20,
1297-1303.
Ning, Z., Cox, A.J. and Mullikin, J.C. (2001) SSAHA: a fast search
method for large DNA databases. Genome Res, 11, 1725-1729.
18