Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The functional impact of sequence variation on phenotypes in 17 mouse genomes. The sequence and analysis of 17 laboratory and wild-derived mouse genomes Abstract 2 Introduction Genetic analysis of the laboratory mouse plays a key role in our understanding of mammalian biology and of human disease. Indeed, amongst eukaryotes, inbred strains of laboratory mice provide a unique resource of genetically uniform stocks to interpret the relationship between sequence and phenotypic variation. Genetic and phenotypic diversity is remarkably large, due in part to the origin of classical inbred strains from different subspecies (1,2), and in part to homozygous genomes unmasking recessive mutations. The availability of inbred strains derived from wild caught mice has provided access to even more genetic variability (3). Furthermore, in addition to the 300 or so inbred strains (4-6), mouse geneticists have developed, and use, strain hybrids. When these are inbred they provide a renewable source of the identical allelic combinations, essential for investigating traits with low heritability or the effects of multiple environments (7). The naturally occurring sequence variation that in part underlies phenotypic differences between strains has been identified through three complementary methods: first, a relatively small number of variants with known functional consequences have been found through mapping and cloning of classical Mendelian mutations in mice (8); second, a larger number of variants has been identified by mapping and sequencing of loci that contribute to quantitative variation, but the attribution of function to any is unclear; third, the creation of catalogues of sequence variants (9,10). Databases currently hold XX million single nucleotide polymorphisms (SNPs), YY short insertion and deletions (indels) and 10,938 structural variants (SVs). However it is not known to what extent these catalogues are complete, nor how many of the variants have functional consequences, nor how they might exert their effect. Only complete genome sequence of inbred strains will provide the raw material necessary to address these questions and to begin dissecting the path from sequence to phenotypic variant. New technologies have brought within our grasp the ability to sequence individual mammalian genomes. In this paper we apply these technologies to sequence seventeen mouse genomes, chosen to address a number of aims: 3 first, to assess the extent of sequence diversity between strains and to apportion that diversity between different classes of sequence variant. Second, to compare sequence diversity present within the classical inbred strains to that in wild-derived mice; third, to investigate the functional consequences of sequence variation. Specifically for the latter we asked whether we could find sequence variants contributing to quantitative variation at more than 800 quantitative trait loci (QTLs) mapped at high resolution in mice descended from eight classical inbred strains (11); in addition we set out to determine the sequence variants that contribute to variation in gene expression and whether these might also explain phenotypic variation. Results Data generation and variant discovery We generated 738Gb of raw sequence from the following classical laboratory mouse inbred strains: C3H/HeJ, CBA/J, A/J, AKR/J, DBA/2J, LP/J, CBA/J, BALBc/J, NZO/HiLtJ, NOD/ShiLtJ and 238 Gb from wild-derived strains, CAST/EiJ, SPRET/EiJ, WSB/EiJ and PWK/PhJ. In addition we also generated sequence coverage (261Gb) from three 129- strains; 129SvSvEvBrd, 129P2/OlaHsd and 129S1/SvImJ and xx Gb from C57BL/6NJ: This provided an average of 25x sequence coverage across these 17 genomes, reads were mostly 54bp, paired, and fragments were between 150-600bp in length (Table 1). Sequence variants were defined as differences from the reference strain (C57BL/6J; mm9/m37). Overall we identified 148.3 million single nucleotide polymorphisms (SNPs) at 65.2 million unique sites, 21.7 million insertion/deletion (indels) at 8.8 million unique sites and ZZ structural variants (SVs) including YY transposon insertions. We provide an overview of all variants called in this study and discuss what we found and missed, for two categories of variant: SNPs, and indels. Analysis of SVs and mobile genetic elements are described in accompanying papers. 4 SNPs and short indels As shown in Table 1 we managed to identify virtually all previously described variants, and also to generate a catalog of new variants of unprecedented resolution and accuracy. Results presented in the table include only those SNPs from reads whose average mapping quality was greater than 40 (12). We refer to these regions of the genome as “callable” (Supplementary Information). We established the accuracy and sensitivity of the SNP calls in three ways. First, in comparison with dbSNP we found that 99% of genotypes common to both are identical. Of the 1% of genotype mismatches, approximately 60% are private to one study and are likely mistakes in dbSNP, the remainder were ambigious sites with adjacent SNPs and indels. Second, we compared our calls to the mouse Perlegen dataset (9) in which nine of the strains we sequenced (129S1/SvImJ, A/J, AKR/J, BALB/cJ, C3H/HeJ, DBA/2J, NOD/ShiLtJ, WSB/EiJ, and CAST/EiJ), were genotyped using wholegenome hybridization arrays. SNPs that mapped to the same genomic location had identical genotypes in 99.8% of cases. Third, we compared our calls to SNPs identified in the NOD/ShiLtJ strain from eighteen large insert clones (BACs): 17.5 Mb was sequenced using standard methods to an estimated accuracy of one error per 100,000 bp (13). SNPs were then called from 1kb fragments that mapped back to the reference genome: 11.6 Mb of sequence aligned uniquely onto the NCBI37 mouse reference (66% of the total). From this we estimate that 1.4% of our calls are false positives and 1.4% are false negatives. We identified far fewer indels than SNPs (20M compared to 148M) and with lower confidence. We relied for validation on comparison with the NOD/ShiLtJ BAC sequences (the Perlegen data does not detect indels) and estimated false positive and negative rates to be 1.4% and 35.9% respectively. The use of inbred strains for genetic mapping experiments makes the absence of a variant just as important as its presence. Therefore we determined to what extent we could call every position in the genome. We 5 start by considering those regions that map to the reference (we consider strain specific sequences below) and the limitations imposed by restricting our attention to “callable” regions of the genomes. We then describe approaches to overcome that restriction. Distribution of variants across the genome Calling missing genotypes The false negative rates quoted above do not take into account variants that lie in regions of low mappability (where the mapping quality score is < 40). Between 11.2-16.1% of the each strain’s genome is excluded from analysis using this filter. Thus 2% of the Perlegen SNPs and 2.3% of dbSNPs are in our uncallable regions and 8.9% of the NOD/ShiLtJ BAC sequence. Uncallable regions tend to be more divergent, than the genome average. BAC regions from the NOD/ShiLtJ strain which correspond to uncallable sequence are xx% more likely to contain variant SNPs. The pattern of missing data in our dataset is not random: if it were, we would expect to find that XX% of genotypes to be incomplete across the eight HS strains, whereas we find YY%. Uncallable regions of the genome largely consist of repeated sequence, present in the same location in each strain, rendering these regions inaccessible to the current sequencing methodology. Stochastic variation in coverage and read quality did not significantly affect our ability to accurately call variants. We took two approaches to complete the sequence. First, using comparisons with the BAC sequence from the NOD/ShiLtJ strain, we were able to assessed the error rates info less stringent mapping parameters by reducing the mapping quality score thresholds below 40. We found that with evidence in favour of the reference genotype, even dropping the quality score from 40 to 10 incurred an error rate of only 0.005X%. This is not true when reads support a variant, when a quality score of 10 would result in much 6 higher error rates (XX%) When a read supported a variant the error rate is higher (X%). Thus by selectively setting thresholds of XX10 we could fill in YY (XX%) missing SNP calls. Second, we used imputation to call missing genotypes, This approach is effective when applied to classical strains because of their derivation from a limited number of founders. By treating each strain as a mosaic of the founder strains it is possible to impute missing genotypes by using data from strains that have the same haplotype. Imputation identified XX missing genotypes (Table 1). Heterozygote variant calls All of the mouse strains we sequenced as part of this project have been maintained by inbreeding to limit genetic drift, and are homozygous at every loci with the exception of sites in the genome where de novo mutations have occurred, and have not yet been fixed. Supplementary Table x and Supplementary Figure Y show the location and number of of heterozygous base calls. There was a significant enrichment of heterozygous calls in regions of copy number gain, regions that are duplicated in one strain and collapsed back on the reference and in uncallable regions of the genome, representing 12% and 21% of heterozygous calls respectively. We genotyped 167 putative heterozygous variants that did not fall into these regions, none were correct, revealing that most if not all of these calls are false positives. Differences in the patterns of variation between mouse strains The major difference observed in the single nucleotide variation between the strains was in the number of variants found in the laboratory strains of mice relative to those strains that were wild-derived (Table 1). On average we observed between 4-5 million SNPs for the laboratory strains and 6.9 million, 19.4 million, 20.2 million and 40.1 million SNPs for the wild-derived strains WSB/EiJ, PWK/PhJ, CAST/EiJ and SPRETUS/EiJ, respectively. As expected variants found in the laboratory strains were distributed in a block-like pattern across the genome reflecting the haplotype structure of mouse (Supplementary Figure x). Fewer SNPs were called from C57BL/6N, from 7 which we called 0.8% SNPs relative to the other laboratory strains. 11.1% of these (4146) were found in all strains and are therefore likely to be reference errors. In keeping with the expected molecular spectrum of mutations G:CA:T and A:TGC transitions predominated (Supplementary Figure x). The laboratory strains of mice carried few private SNPs representing around 5.6% of all SNPs called in each strain (Supplementary Figure x). Some of the private variation in the laboratory strains clustered into discreet regions suggesting that a strain had acquired a block of sequence not represented in the other strains, which most of the variants were distributed genome-wide suggesting de novo evolution of these variants since the divergence of the strains (Supplementary Figure x). Strain specific sequence The second category of sequence that is hard to access with the method we have used lies in regions that are not present in the C57BL/6J reference genome. These will appear as SV insertions in the genome, described in our accompanying paper. To analyze the content of this sequence we took read pairs that could not be mapped and assembled these into contigs, Figure 1. Overall we identified XXMb of sequence in this way. Unsurprisingly more is found in the wild derived strains than the classical laboratory strains, but the distribution is not uniform even among the latter. Analysis of these sequence revealed that much of it mapped to mouse sequence found in databased but not present in the reference genome, representing haplotypes from other strains, and some mapped to rabbit and rat. Further analysis of these sequence revealed that it was largely composed of x, y, z. Type of variation across 17 mouse genomes Introduce the Circos plots etc. table bringing together all the TE, SVs Phylogenetic analysis 8 One of the unanswered questions in mouse genetics is what is the phylogenetic history of the laboratory mouse. While attempts to assess this have been made using genotyping data, here we used the complete complete autosomal sequences of M. m. musculus (PWK/PhJ), M. m. domesticus (WSB/EiJ), M. m. castaneus (CAST/EiJ), and M. spretus (SPRET/EiJ) which we aligned to the rat genome as an outgroup, and conducted a Bayesian concordance analysis (14). Individual phylogenies at most (94%) of the 43,255 loci were inferred with high statistical confidence (posterior probability > 0.8), suggesting only a limited contribution of phylogenetic error to genomic patterns. We observed substantial phylogenetic discordance among M. m. musculus, M. m. domesticus, and M. m. musculus (Figure XX). A plurality of loci supported a M. m. musculus/M. m. castaneus primary subspecies history (concordance factor (CF) = 37.9%; 95% credibility interval (CI) = 37.8-38.0%) and two co-minor histories were supported by equal numbers of loci (CF = 30.3%; 95% CI = 30.2-30.4%; and CF = 30.2%; 95% CI = 30.1-30.3%), consistent with theoretical models of incomplete lineage sorting (15-17). Phylogenetic switching occurred over a short physical scale and median locus sizes paralleled the three phylogenetic histories (primary history: 40,975 bp; co-minor histories: 33,626 bp and 33,412 bp). We also found evidence of phylogenetic discordance involving M. spretus: 12.1% of loci did not place this species as the outgroup to a M. musculus subspecies clade. The X chromosome showed phylogenetic discordance, but to a lesser extent than the autosomes. The percentage of X-linked loci that did not position M. spretus as the outgroup was reduced to 7.9%. The functional Impact of nucleotide variation Transcriptome We generated transcriptome sequence data for 15 of the 17 mouse strains and used these data to ask if there are novel strain specific genes. Table X shows a breakdown of these genes by strain. Over-represented GO terms for these genes are also provided in Table X. Intriguingly X of these genes have conserved orthologs in other organisms. 9 Novel genes in strains and genes missing in strains Allele specific gene expression – transcriptomes and ChIP Phenotypes – known Mendelians (unknown Mendelians?) Numerous genetic and sequence differences are already known between laboratory mouse strains, which can be used to validate the strain sequences, and which can be extended using this data. The most obvious phenotypic difference between strains is coat colour. Four strains (A/J, AKR/J, BALB/cJ, and NOD/ShilLtJ ) are albino (Tyrc) and contain the same G-C transversion in the tyrosinase gene at 7:94641553 (Jackson and Bennett). Unsurprisingly the four strains also share over 100 SNPs within the Tyr gene, different from the reference sequence, reflecting the common origin of the mutation, but there are also numerous differences between them, indicating accumulated variation in the time since the original mutation event. Three strains are homozygous for the brown mutation of Tyrp1 (A/J, BALB/cJ and DBA/2J). All contain the causative mutation, G-A at 4:80481306 and the known linked second nonsynonymous coding SNP (Zdarsky et al). Interestingly this SNP is also found in WSB/EiJ, along with many other shared SNPs, few of which are seen in other strains, indicating M m domesticus origin of the brown mutant chromosome. We can extend known mutations to additional strains with this data. The DNA polymerase iota (Poli) gene in all 129 strains contains a premature stop codon which ablates function (McDonald et al 2003). The strain sequence data identifies this same mutation previously undescribed in LP/J mice. Quantitative phenotypes – our unstoppable bid to find quantitative trait nucleotides…. Discussion We recovered a primary phylogenetic history placing M. m. musculus and M. m. castaneus as sister subspecies as supported by an earlier 10 study using high-density SNP sequence data (White et al. 2009). In contrast to the earlier SNP data set, the complete sequence data revealed co-minor histories supported by equal proportions of the autosomes, as predicted by theoretical models of incomplete lineage sorting (Pamilo and Nei 1988, Maddison 1997, Rosenberg 2002, Baum 2007). This difference between the two studies was probably caused by ascertainment bias against M. m. castaneus in the Perlegen SNP data set (Yang et al. 2007). We found phylogenetic discordance involving M. spretus, despite the longer divergence time separating this species from the M. musculus subspecies clade. This discordance might reflect incomplete lineage sorting, introgression, or both. This result suggests caution when using M. spretus as an outgroup in phylogenetic analyses of M. musculus subspecies. Phylogenetic switching occurred over a short physical scale. As predicted by theory (Slatkin and Pollack 2006), locus lengths roughly corresponded to the scale of linkage disequilibrium in current house mouse populations (Laurie et al. 2007). Widespread phylogenetic discordance among the three house mouse subspecies challenges the assignment of subspecific ancestry across the genomes of the classical inbred laboratory strains. This task will require genome sequences from population samples of the three house mouse subspecies. 11 Methods Sequencing Strain Selection: We selected the 17 most widely used mouse strains for sequencing including the Heterogeneous stock (HS) (A/J, AKR/J, BALB/cJ, CBA/J, C3H/HeJ, DBA/2J and LP/J) and collaborative cross (A/J, 129S1/SvImJ, NOD/ShiLtJ, NZO/HiLtJ, WSB/EiJ, CAST/EiJ and PWK/PhJ) progenitors. C57BL/6NJ which is the background on which the Knockout mouse project (KOMP) and the European conditional mutant mouse program (EUCOMM) are generating targeted embryonic stem cells for all genes in the mouse genome, three 129-derived strains (129S5SvEvBrd, 129P2/OlaHsd and 129S1) which are the backgrounds on which the vast majority of mouse knockouts have been made, and several wild-derived strains CAST/EiJ, PWK/PhJ and SPRETUS/EiJ. All of the mice that were sequenced were females from pedigreed colonies from the Jackson laboratory, with the exception of 129P2/OlaHsd and 129S5SvEvBrd which came from pedigreed colonies from MRC-Harwell and The Wellcome Trust Sanger Institute respectively. A list of the pedigreed mouse IDs is provided in the (Supplementary Information). Source of DNA for Sequencing and DNA library preparation: Liver tissue was used to extract DNA for sequencing using standard procedures. Prior to sequencing all DNA samples were genotyped with a genome-wide marker panel to confirm they were from the desired strain. For each strain we generated multiple “no-PCR” libraries for sequencing on the illumina GAii sequencing platform as described previously (18). A breakdown of the sequencing metrics for each strain is provided in Supplementary Table X. Each lane of sequence was re-genotyped prior to downstream analysis. Generation and sequencing of 12 NOD/ShiLtJ bacterial artificial chromosomes (BAC): We constructed a NOD/ShiLtJ BAC clone libraries as described previously (19). 18 BACs from this libraries were shotgun and capillary sequenced generating contigs of a total length of 17.6 Mb, Accession numbers for these contigs is provided in the Supplementary Information. These BACs were sequenced finished with a 1 error per 100,000 base error rate prediction (13). Transcriptome libraries and sequencing: RNA was extracted from the whole brain of the sequenced mouse and a female sibling using Trizol at eight weeks of age. RNA of RIN >8 was then used to generate transcriptome libraries which were sequenced on the Illumina platform generating around X million 76bp paired-end Illumina reads for each strain, Supplementary Table X. Each lane of sequence was re-genotyped prior to downstream analysis. For allele specific gene expression analysis C57BL/6J and DBA/2J mice were intercrossed and livers from female F1 mice collected for RNA extraction and sequencing as described above. Computational Methods Sequence Genotype Check We checked the genotype of each lane of sequence by comparing the concordance with existing whole-genome genotypes from the Perlegen set. Sequence Alignment The reads from each lane were aligned to the NCBI37 mouse reference sequence using MAQ v0.7.1-6 (12). A BAM file was produced for each lane, then the lanes for each library merged into a library BAM, library PCR duplicates were removed with samtools (http://samtools.sourceforge.net), and 13 then the library BAMs were merged to produce a single BAM file per strain (Li et al, 2009). SNP Calling SNPs were called from the individual strain BAM files with 4 different SNP callers: samtools varFilter (20), Genome Analysis Toolkit (21), iMR (Xiangchao and Mott, unpublished), QCALL (ref). The parameters used for each caller are in Supplementary. Table X. We defined uncallable regions as those positions where the average mapping quality at the position was less than Q40 (12) or the sequencing depth was higher than 200. The final set of SNPs consisted of positions that passed the uncallable filter and SNPs that were called by two or more callers. Short Indel Calling Short indels were called using two different algorithms: Dindel (Albers et al, 2010), iMR (Xiangchao and Mott, unpublished). The final indel calls were the union of the candidate calls. Variant Calling from finished NOD/ShiLtJ BACs Finished NOD/LtJ sequence was fragmented into 1kb sequences which were mapped back against the reference genome assembly using BWA700. SNPs and short indels differences were called from the 1kb fragments that aligned uniquely onto the NCBI37 mouse reference. Taking these calls as a gold standard, we measured the false positives and negatives of the SNP and short indel calls. Structural variant Calling 14 A detailed summary of structural variant and transposon element calling is provided in accompanying papers. Variant Imputation NF De novo assembly of novel sequence The reads for each strain were mapped to the m37 reference genome assembly using MAQ (0.7.1-6), and PCR duplicates were removed. Unmapped read pairs (where neither the read nor its mate pair mapped to the reference) were extracted from the BAM files and assemblies were built using serial ABYSS (1.1.1) (using a kmer size of 31) to produce unmapped contigs for each strain. These contigs were matched against x, y and z using XX. Transcriptome Analysis with TopHat and CuffLinks Each BAM file was converted to two FASTQ files (one per sequenced end) and TopHat was deployed to map data from each library to the genome including splice sites annotated in Ensembl and UCSC gene structures, known mRNAs, and expressed sequence tags, and to exhaustively search for novel splice sites. Once splice sites were defined at the library level TopHat was re-run combining all splice sites from all libraries across all of the strains so that reads from all of the libraries had the opportunity to map over the same splice sites. Cufflinks was then used to quantify expression of all Ensembl genes/transcripts across all libraries and gene models were generated for each strain. Cuffcompare was then used to build a set of consensus transcripts, which were used to quantify expression of every RNAseq-based transcript in each strain. Cuffdiff was then used to identify significant differences in transcript isoform abundance, expression at the gene level, TSS usage, and coding sequence (Ensembl models only). 15 Phylogenetic Analysis The NCBI build 37 mouse genome sequence was aligned to version 3.4 of the rat genome using Mercator to build a one-to-one collinear orthology map and using MAVID to produce nucleotide-level alignments on the collinear blocks (Dewey 2007). Consensus sequences of CAST/EiJ, PWK/PhJ, WSB/EiJ, and SPRET/EiJ were mapped to the alignment and gaps were filled with N’s. Collinear blocks were partitioned into 43,255 loci using a minimum description length algorithm with a maximum cost of 4.8 (Ané and Sanderson 2005). Following the procedure in White et al. (2009), a separate Bayesian phylogenetic analysis was conducted for each locus and the posterior distribution of concordance factors across loci was estimated using Bayesian Concordance Analysis (Ané et al. 2007). The prior distribution of gene tree concordance was set at =1. Using an extreme prior of complete independence among loci (=infinity) altered the concordance factors slightly, but did not change their rank orders. 16 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Paigen, K. (2003) One hundred years of mouse genetics: an intellectual history. II. The molecular revolution (1981-2002). Genetics, 163, 1227-1235. Paigen, K. (2003) One hundred years of mouse genetics: an intellectual history. I. The classical period (1902-1980). Genetics, 163, 1-7. Wade, C.M. and Daly, M.J. (2005) Genetic variation in laboratory mice. Nat Genet, 37, 1175-1180. Beck, J.A., Lloyd, S., Hafezparast, M., Lennon-Pierce, M., Eppig, J.T., Festing, M.F. and Fisher, E.M. (2000) Genealogies of mouse inbred strains. Nat Genet, 24, 23-25. Atchley, W.R. and Fitch, W. (1993) Genetic affinities of inbred mouse strains of uncertain origin. Mol Biol Evol, 10, 1150-1169. Atchley, W.R. and Fitch, W.M. (1991) Gene trees and the origins of inbred strains of mice. Science, 254, 554-558. Churchill, G.A., Airey, D.C., Allayee, H., Angel, J.M., Attie, A.D., Beatty, J., Beavis, W.D., Belknap, J.K., Bennett, B., Berrettini, W. et al. (2004) The Collaborative Cross, a community resource for the genetic analysis of complex traits. Nat Genet, 36, 1133-1137. Zheng, Z., Schmidt-Ott, K.M., Chua, S., Foster, K.A., Frankel, R.Z., Pavlidis, P., Barasch, J., D'Agati, V.D. and Gharavi, A.G. (2005) A Mendelian locus on chromosome 16 determines susceptibility to doxorubicin nephropathy in the mouse. Proc Natl Acad Sci U S A, 102, 2502-2507. Frazer, K.A., Eskin, E., Kang, H.M., Bogue, M.A., Hinds, D.A., Beilharz, E.J., Gupta, R.V., Montgomery, J., Morenzoni, M.M., Nilsen, G.B. et al. (2007) A sequence-based variation map of 8.27 million SNPs in inbred mouse strains. Nature, 448, 1050-1053. Cunningham, F., Rios, D., Griffiths, M., Smith, J., Ning, Z., Cox, T., Flicek, P., Marin-Garcin, P., Herrero, J., Rogers, J. et al. (2006) TranscriptSNPView: a genome-wide catalog of mouse coding variation. Nat Genet, 38, 853. Valdar, W., Solberg, L.C., Gauguier, D., Burnett, S., Klenerman, P., Cookson, W.O., Taylor, M.S., Rawlins, J.N., Mott, R. and Flint, J. (2006) Genome-wide genetic association of complex traits in heterogeneous stock mice. Nat Genet, 38, 879-887. Li, H., Ruan, J. and Durbin, R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res, 18, 1851-1858. Chain, P.S., Grafham, D.V., Fulton, R.S., Fitzgerald, M.G., Hostetler, J., Muzny, D., Ali, J., Birren, B., Bruce, D.C., Buhay, C. et al. (2009) Genomics. Genome project standards in a new era of sequencing. Science, 326, 236-237. Ane, C., Larget, B., Baum, D.A., Smith, S.D. and Rokas, A. (2007) Bayesian estimation of concordance among gene trees. Mol Biol Evol, 24, 412-426. 17 15. 16. 17. 18. 19. 20. 21. 22. Pamilo, P. and Nei, M. (1988) Relationships between gene trees and species trees. Mol Biol Evol, 5, 568-583. Maddison, D.R., Swofford, D.L. and Maddison, W.P. (1997) NEXUS: an extensible file format for systematic information. Syst Biol, 46, 590621. Rosenberg, N.A. (2002) The probability of topological concordance of gene trees and species trees. Theor Popul Biol, 61, 225-247. Bentley, D.R., Balasubramanian, S., Swerdlow, H.P., Smith, G.P., Milton, J., Brown, C.G., Hall, K.P., Evers, D.J., Barnes, C.L., Bignell, H.R. et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456, 53-59. Steward, C.A., Humphray, S., Plumb, B., Jones, M.C., Quail, M.A., Rice, S., Cox, T., Davies, R., Bonfield, J., Keane, T.M. et al. (2010) Genome-wide end-sequenced BAC resources for the NOD/MrkTac() and NOD/ShiLtJ() mouse genomes. Genomics, 95, 105-110. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G. and Durbin, R. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078-2079. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M. et al. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res, 20, 1297-1303. Ning, Z., Cox, A.J. and Mullikin, J.C. (2001) SSAHA: a fast search method for large DNA databases. Genome Res, 11, 1725-1729. 18