Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Algorithmic challenges in mammalian whole genome assembly Serafim Batzoglou Department of Computer Science Stanford University James H. Clark Center 318 Campus Drive, RM S-266 Stanford CA 94305-5428 [email protected] Abstract Sequencing the human, mouse, and rat genomes was accomplished more quickly than had seemed feasible a mere 8 years ago. The accelerating pace of sequencing has been made possible by the development of large-scale laboratory operations, assembly algorithms, and less so by changes in the underlying sequencing technologies. This short review summarizes the computational challenges of whole genome assembly and the algorithms and software systems that enabled the recent advances in sequencing of mammalian genomes. Keywords: DNA sequencing, whole genome shotgun assembly, physical mapping, contig, scaffold, overlap-layout-consensus The basic methodology used today for sequencing a large genome is double-barreled shotgun sequencing. Shotgun sequencing, introduced by Sanger and colleagues early on (Sanger et al. 1977), involves first obtaining a redundant representation of a genomic segment with sequenced reads, and then assembling the reads into contigs on the basis of sequence overlap. Doublebarreled shotgun sequencing enhances this method by introducing the notion of paired-end reads that are obtained by sequencing both ends of a fragment of known size, and thus provide useful long-range linking information (Edwards et al. 1990). It is relatively straight-forward to cover an arbitrarily long genome with sequenced reads, so that few gaps remain (e.g., by the Lander-Waterman model (1988) a 10-fold redundancy coverage of the genome with 500-long reads yields less than 1 gap in every 106 base pairs). Sequencing a genome is simple then, if defined as obtaining a representation of the majority of nucleotides in a genome. What makes sequencing challenging is the subsequent computational problem of correctly assembling the reads into the original sequence. Repeats, sequencing errors, and their relationship. Assembly would be easy were it not for repeats. If genomes had the properties of random strings on four letters {A, C, G, T}, then it would be unlikely for two reads to share a significant overlap unless they represented overlapping segments in the genome – a true overlap. Unfortunately, nearly half a mammalian genome’s length consists of repeats and true overlaps can be confused with repeat-induced overlaps. The main challenge of assembly is to distinguish between true overlaps, where reads should be merged, and repeat-induced overlaps, where merging should be (usually) avoided to prevent misassemblies – assembled sequence that does not occur in the genome. Repeats come in many varieties. The number of copies ranges from 2, such as in gene duplications, to about 106 in the case of ALU repeats in the human genome. Sequence similarity between copies can be almost 100%, such as in recent duplications, or very low, such as in ancient transposon lineages that have become inactive. Our ability to distinguish between true and repeat-induced overlaps is intimately tied to the sequencing error rate. In essence, if more than a proportion r of sequencing errors is unlikely, then any overlap with more than 2r mismatches and gaps is likely to be repeatinduced. Therefore, increasing read accuracy reduces the repeat content of a genome for the purposes of assembly. Two main ways to deal with repeats: clustering the reads, and pairing the reads. There are broadly two methods for assembling long genomic regions in the presence of repeats. The first method is clone-based sequencing, whereby the reads are being clustered by performing shotgun sequencing on medium-sized segments such as Bacterial Artificial Chromosome (BAC) clones of length ~200Kb. With this method the effective repeat content is drastically cut because most segments that are repeated across the genome are unique within such short regions. Moreover, the number of reads to be assembled is reduced by a factor of 104, and therefore the computational assembly problem becomes much simpler. The second method is double-barreled whole-genome shotgun (WGS) sequencing, whereby paired reads are obtained from both ends of inserts of various sizes (Edwards et al. 1990). Paired reads can resolve repeats by jumping across them: when taken from plasmid inserts longer than 5,000 bp they can resolve LINE repeats of that length; paired reads from longer inserts such as cosmids and BACs can resolve duplications, help build longer scaffolds, and verify the large-scale accuracy of an assembly. Clone-based sequencing. Clone-based sequencing was used to sequence several genomes including those of the yeast Saccharomyces cerevisiae (Oliver et al. 1992, Dujon 1996), Caenorhabditis elegans (The C. elegans Sequencing Consortium 1998), Arabidopsis thaliana (The Arabidopsis Genome Initiative 2000, and Human (International Human Genome Sequencing Consortium 2001). Most applications of clone-based sequencing were performed under the “map first, sequence second”, or physical mapping approach. Physical mapping involves constructing a complete physical map of a large set of clones covering the genome with redundancy, and then selecting a minimal tiling subset of those clones to be fully sequenced. After sequencing and assembly, errors in the original map, gaps, and misassemblies can be spotted, and if necessary corrected with additional targeted sequencing. This approach was successfully used to generate physical maps for sequencing the S. cerevisiae and C. elegans genomes (Coulson et al. 1986; Olson et al. 1986). The human genome was also sequenced by the public consortium using this approach, although the first draft large-scale assembly was still a challenging computational task: ordering and orienting the assembled contigs into a global genome assembly required integration of various sources of information such as sequence overlap, mRNA and EST fragments, and endsequencing of BACs. All these pieces of evidence were put together by a specialized system called GigAssembler (Kent and Haussler 2001). Physical mapping is by no means the only possible way to perform clone-based sequencing. Other methods are possible but less explored, such as the walking approach (Venter et al. 1996, Batzoglou et al. 1999, Roach et al. 1999, Roach et al. 2000), or light shotgun sequencing of completely unmapped random clones, whereby a collection of clones covering the genome with redundancy C1, are sequenced to redundancy C2, for a total sequencing depth C1C2. As clustering directly addresses the repeat challenge by effectively eliminating most repeats, assembly under any of the above schemes is in principle easier than under double-barreled sequencing. Whole genome shotgun sequencing. Shotgun sequencing was initially applied by Sanger to the genome of bacteriophage (Sanger et al. 1980, 1982). Over the years, shotgun sequencing has been applied to genomic segments of increasing length. In the 1980s it was applied to plasmids, chloroplasts (Ohyama et al. 1986), viruses (Goebel et al. 1990), and cosmid clones of length roughly 40 kbp, which were considered to be the limit of this approach. In the 1990s, mitochondria (Oda et al. 1992), bacterial artificial chromosomes of length 200 Kbp carrying genomic DNA from human or other organisms, as well as the entire 1,800 Kbp genome of the bacterium H. influenzae (Fleischmann et al. 1995), were sequenced. Compared to clone-based shotgun sequencing, WGS poses a much harder computational challenge, especially in large, repeat-rich genomes. When a WGS strategy was proposed for sequencing a mammalian genome such as human (Weber and Myers 1997) scientists debated whether such as strategy would succeed (Green 1997). Today we know that this is possible, as demonstrated by WGS assemblies of several complex genomes such as Drosophila melanogaster (Myers et al. 2000), human (Venter et al. 2001), and mouse (Mouse Genome Sequencing Consortium 2002, Mural et al. 2002). These successes have brought computer science to the forefront of genomics: WGS assembly became possible through advances in algorithms and software systems, rather than in either computer processor speed or sequencing accuracy. Overview of WGS assembly. The prevailing methodology for WGS assembly is the three-stage overlap-layout-consensus strategy (Peltola et al. 1984, Kececioglu 1991, Huang 1992, Kececioglu and Myers 1995). During the overlap phase all read-to-read sequence overlaps are detected. During layout, subsets of the overlaps are selected in order to merge reads into longer scaffolds of ordered and oriented reads. Those scaffolds are converted into sequence through a multiple alignment during the consensus phase. One of the most successful systems employing this threephase strategy is PHRAP (Green 1994), which has been the standard for BAC-size assembly but is unable to handle whole mammalian-size genomes. Current WGS assemblers include the Celera assembler (Myers et al. 2000), Euler (Pevzner et al. 2001), Arachne (Batzoglou et al. 2002, Jaffe et al. 2003), Phusion (Mullikin et al. 2003), Atlas (Havlak et al. 2004), and other. Most of these systems, at least implicitly, use the overlap-layout-consensus strategy. In the following paragraphs, we will briefly describe each of the assembly phases, focusing on the challenges they pose and on the algorithms that current WGS assemblers employ to address these challenges. We will be using the following terminology: a contig is a multiple alignment of reads, which can be converted into contiguous genomic sequence during the consensus phase. A supercontig is a structure of ordered and oriented contigs, usually derived by linking contigs according to paired reads. Some papers refer to ultracontigs as higher structures of supercontigs that are derived by various sources of linking information, such as paired reads from large inserts, mapping of BAC clones, or even cross-species comparison. The general term for assembled structures is the scaffold, a term usually interchangeable with supercontig. As the layout phase is the most complex, a useful conceptual (and implementation) division of this phase is in two sub-phases: (1) contig layout, during which most assembly systems strive to identify repeat boundaries and connect the reads into contigs that are as long as possible without crossing those boundaries, as that would induce misassemblies; and (2) supercontig assembly, perhaps the hardest phase of all, where contigs are chained into longer structures by use of paired reads crossing contig boundaries, and subsequently gaps are filled and misassemblies are identified and remedied. The overlap phase. In the overlap phase, read-to-read sequence overlaps are detected. Note that since read orientation is not known, all overlaps between reads in forward or reverse complement direction are detected. The nature of the data requires several preprocessing tasks to take place prior to overlap detection, such as trimming of reads to remove bases with quality scores that are too low, and filtering of contamination from sequencing/cloning vector or mitochondrial sequences. Next, in the case of large-scale WGS a significant algorithmic challenge is to find the overlaps efficiently. The strategies that have been employed rely on detecting exact word matches between the reads. Celera, Phusion, and Arachne, all perform a global step of indexing all exact fixed-length word occurrences in the reads, and then aligning all pairs of reads that share a word. Prior to alignment, filtering of very frequent words is necessary to avoid a huge number of pairwise read matches. Phusion and Arachne mask all highly repetitive k-mers, while Celera masks repeats, a step that is inherently more conservative and will discard additional spurious matches, but is also more complicated. Alignment is done with fast procedures that rely on the fact that only almost exact sequence identity between two reads can induce an overlap. Phusion and Arachne’s original indexing strategies were more time-efficient, but required more memory, than Celera’s strategy. Arachne performs this step in several passes, trading off speed for memory. Upon detecting potential read overlaps, most assemblers filter them so as to remove the ones that are very likely to be repeat induced. Arachne and Euler in particular, use a sequencing error-correction strategy that greatly simplifies subsequent assembly. In Arachne, the main idea is that a sequencing error reveals itself as a singleton base aligned to several other reads that all agree to a different base – in contrast to a repeat polymorphism, which should be shared by several reads. Euler’s error correction is based on finding and modifying low-frequency words in the reads, and is generally more aggressive, resulting in a drastic reduction in the number of errors by as much as a factor of 50. After error correction, Arachne aggressively discards read-toread overlaps that disagree in high-quality bases, because such overlaps are overwhelmingly likely to be repeat-induced. Most assemblers at this stage also discard likely chimeric reads, i.e., reads that are formed by two distant segments of the genome being glued together. The signature of such reads is that they overlap other reads to the left and right, but no overlap spans the chimeric joint. Contig layout. The prevailing way of representing all read overlaps is the overlap graph (Myers 1995), where reads are nodes, and overlaps are labeled edges that contain information on the location and (optionally) quality of overlap. Only full overlaps between reads x and y are stored in the graph: partial overlaps that end somewhere within the two sequences are clearly repeatinduced, and not stored in the graph. Repeat boundaries can be detected in an overlap graph by finding a read x that extends to two reads y and z to the left (or right), where y and z do overlap one another. The Celera assembler and Arachne use this overlap graph to merge reads into contigs up to the boundaries of repeats, thus joining all local interval subgraphs into contigs. To assemble contigs, Arachne employs a further heuristic: paired pairs of reads, defined as two plasmids of similar insert size with sequence overlaps occurring at both ends, are identified and merged, except under a heuristic condition that detects and avoids the instance of long repeats. Paired pairs help to cross high-fidelity repeats of length slightly longer than a read length. Phusion instead clusters the reads, and then passes the clusters to local assembly calls to PHRAP. One advantage of this approach is the use of PHRAP for low-level assembly, because PHRAP has been highly tuned to work well with real data. Most assemblers use several additional heuristics during contig layout, in order to extend contigs as far as possible. For example, sequencing errors near the end of one read are identified as such by noticing that the read does not overlap any other read on that side. Contig breaks due to such sequencing errors are then crossed. Also, contig breaks due to repeats that are shorter than one read length are resolved by finding reads that span the different repeat copies. Precisely, a repeat can be resolved if each copy, except for at most one, is spanned by a read. The resulting contigs constructed by Celera and Arachne are either unique, representing sequence that occurs once in the genome, or repetitive, representing a high-fidelity repeat sequence whose reads were merged together. In Arachne, the repeat has to exhibit nearly 100% fidelity. Contigs are classified as unique or repetitive by measuring the density of reads within a contig: repetitive contigs are likely to be twice or more times as dense as unique contigs (Myers et al. 2000). Supercontig layout. Once contigs are constructed and classified as unique or repetitive, the Celera and Arachne assemblers link them into larger structures, called scaffolds or supercontigs. One read-pair link can be an error, but two or more mutually confirming links that connect a pair of contigs are very unlikely by chance, and – in general – can be safely used to join the contigs into supercontigs. Most assemblers first join the unique contigs while excluding likely repeat ones, to produce supercontigs that represent correctly ordered an oriented unique sequence with gaps. Then, gaps are filled by recruiting the repeat contigs. This last stage of repeat resolution is rather complicated, and involves a combination of several heuristic approaches. The Celera assembler fills-in repeats in a series of increasingly aggressive steps, by recruiting first repetitive contigs that contain two or more read pair links to flanking unique contigs, then those that contain one link and some sequence overlap, and finally the rest conditionally on some heuristics. Arachne attempts to walk through the gap between two unique contigs by a shortest paths search that requires anchors spaced at most every 2Kb, where an anchor is a repetitive contig with read-pair links to the flanking unique contigs. This heuristic in practice results in reasonably accurate repeat reconstruction, except in the case of high-fidelity tandem repeats, which are often collapsed and show up as deletions with respect to the true sequence. Derivation of consensus sequence. Once a layout has been obtained, deriving consensus sequence is a relatively straight-forward task. Starting from the leftmost read of each contig, consensus sequence is derived from the multiple alignment of reads. Conservative quality scores are given to each consensus base according to the quality and agreement of bases in the multiple alignment column. The Celera assembler will additionally report positions that appear polymorphic, and give an estimate of the likelihood that the polymorphism is real versus repeatinduced. Discussion on existing WGS assemblers. An elegant formulation of sequence reconstruction as an Eüler Path Problem on a de Bruijn graph (Pevzner 1989), originally proposed in the context of sequencing by hybridization, has led to the development of the Eüler assembler (Pevzner et al. 2001). According to that approach, a graph is constructed whose nodes are the k-long words in reads, and whose edges connect pairs of k-long words that share a (k–1)-long overlap. Read overlaps are thus implicitly found in an efficient manner similar to Arachne’s and Phusion’s approach. Error correction is done directly on the k-long words, by modifying words that are too infrequent. Assembly then reduces to finding the correct Eülerian path—there are usually a large number of Eülerian Paths. An Eülerian Path is a path that uses every edge exactly once. To achieve that, a series of graph simplifications are taken based on the reads and on the read pair links. This approach has the advantage of clean formulation, and demonstrably strong performance in resolving repeats – or identifying all assembly ambiguities due to repeats. One challenge is to develop an implementation that is efficient enough for whole mammalian genome assembly. The de Bruijn graph provides the same branching structure (repeat representation) as Myers’s original overlap graph but at k-mer resolution rather than fragment resolution, and provides more precise repeat boundaries at the cost of extra memory usage. Several assemblers exist for large-scale application. The Celera assembler, Arachne, and Jazz, effectively use the overlap-layout-consensus paradigm combined with read-pair information. Phusion follows a different approach, motivated by the fact that PHRAP is a well tried and effective approach for small-scale assembly: Phusion first clusters the reads efficiently, and then sends several local assembly problems to PHRAP. The assemblies are then merged and corrected according to read-pair information. Arachne and Phusion were both used for assembling the Mouse genome, with comparable performance (Jaffe et al. 2003, Mullikin et al. 2003). Atlas is a suite of programs for assembling a genome with an approach that combines clone-based and WGS data, and was used to assemble the rat genome (Rat Genome Sequencing Project Consortium 2004). Sequencing and assembly in the future. A few years ago whole-genome shotgun assembly was considered impossible for large and repeat-rich genomes. Today it is routinely applied, thanks to the development of computational methods led by the Celera assembly team and by public efforts. The success of WGS assembly contributed to the interest in bioinformatics methods within the biology community. Although the assembly research pertinent to current sequencing technology seems to be reaching diminishing returns, additional open problems are coming from the development of new and revolutionary sequencing technologies that promise to deliver the “$1000 genome” within a decade (http://grants.nih.gov/grants/guide/rfa-files/RFA-HG-04003.html). Computationally, such technologies can be largely divided in two categories: (1) those that produce short, accurate reads cheaply, such as Pyrosequencing (Ronaghi et al. 1998, Ronaghi 2001) or other technologies (Mitra and Church 1999, Mitra et al. 2003, Braslavsky et al. 2003); and (2) those that produce very long reads, but potentially with very high error rates (see Pennisi 2002). Short reads should be harder to assemble, even if mate pair information is available. The reason is, again, effective repeat content: fewer exact repeats will be completely spanned by a read length. However, if reads are clustered without sacrificing the method’s speed and low cost (e.g., by following a random clone-based sequencing approach that does not involve physical mapping), assembly should be possible. Long reads will present interesting alignment and largescale indexing problems. In general, once the technological hurdles are overcome and a concrete computational problem is formulated, we can be confident that computational biologists will solve it and help ultimately make the $1,000 genome a reality. Acknowledgements The author would like to thank Granger Sutton for helpful comments on the first draft of this manuscript. References Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. Journal of Molecular Biology 215:403-410. The Arabidopsis Genome Initiative (AGI). 2000. Analysis of the genome of the flowering plant Arabidopsis thaliana. Nature 408(6814):796-815. Batzoglou S, Mesirov JP, Berger B, Lander ES. Sequencing a genome by walking with cloneends: A mathematical analysis. Genome Research 9:1163-1174, 1999. Braslavsky I, Hebert B, Kartalov E, Quake S. 2003. Sequence information can be obtained from single DNA molecules. PNAS 100:3960-3964. The C. elegans Sequencing Consortium. 1998. Genome sequencing of the nematode C. elegans: a platform for investigating biology. Science 282(5396):2012-8. Coulson A, Sulsto J, Brenner S, Karn J. 1986. Towards a physical map of the genome of the nematode Caenorhabditis elegans. Proceedings of the National Academy of Sciences 83:78217825. Edwards A, Voss H, Rice P, Civitello A, Stegemann J, Schwager C, Zimmermann J, Erfle H, Caskey CT, Ansorge W. 1990 Automated DNA sequencing of the human HPTR locus. Genomics 6: 593–608. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM et al. 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269: 496–511. Galagan JE, Calvo SE, Borkovich KA, Selker EU, Read ND et al. 2003. The genome sequence of the filamentous fungus Neurospora crassa. Nature 422: 859 – 868. Goebel SJ, Johnson GP, Perkus ME, Davis SW, Winslow JP, Paoletti E. 1990. The complete DNA sequence of Vaccinia virus. Virology 179: 247–266. Green P. 1994. PHRAP documentation. http://www.phrap.org. Green P. 1997. Against a whole-genome shotgun. Genome Research 7: 410–417. Havlak P, Chen R, Durbin KJ, Egan A, Ren Y, Song XZ, Weinstock GM, Gibbs RA. 2004. The Atlas Genome Assembly System. Genome Res. 14: 721-732. Huang X. 1992. A contig assembly program based on sensitive detection of fragment overlaps. Genomics 14:18–25. International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409:860-921. Jaffe DB, Butler J, Gnerre S, Mauceli E, Lindblad-Toh K, Mesirov JP, Zody M, Lander ES. 2003.Whole-Genome Sequence Assembly for Mammalian Genomes: Arachne 2. Genome Research 13: 91-96. Kececioglu J. 1991. Exact and approximation algorithms for DNA sequence reconstruction. PhD thesis, Department of Computer Science, University of Arizona, Tuscon, Arizona. Kececioglu J, Myers E. 1995. Exact and approximate algorithms for the sequence reconstruction problem. Algorithmica 13: 7-51. Kent WJ, Haussler D. 2001. Assembly of the working draft of the human genome with GigAssembler. Genome Research 11: 1541-1548. Lander ES, Waterman MS. 1988. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2:231-239. Mitra R and Church G. 1999. In situ localized amplification and contact replication of many individual DNA molecules. Nucleic Acids Research 27:1-6. Mitra R, Schendure J, Olejnik J, Church G. Fluorescent in situ sequencing on polymerase colonies. Analytical Biochemistry 320:55-65. Mouse Genome Sequencing Consortium. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420:520-562. Mullikin JC, Ning Z. 2003. The Phusion assembler. Genome Research 13: 81-90. Mural RJ, Adams MD, Myers EW, Smith HO, et al. A comparison of whole-genome shotgunderived mouse chromosome 16 and the human genome. Science 296:1661–1671, 2002. Myers EW. 1995. Towards simplifying and accurately formulating fragment assembly. Journal of Computational Biology 2: 275-290. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KHJ, Remington KA, et al. 2000. A whole-genome assembly of Drosophila. Science 287: 2196–2204. Oda K, Katsuyuki Y, Ohta E, Nakamura Y, Takemura M, Nozato N, Akashi K, Kanegae T, Ogura Y, Kohchi T, Ohyama K. 1992. Gene organization deduced from the complete sequence of Liverwort Marchantia polymorpha mitochondrial DNA. Journal of Molecular Biology 223:1–7. Ohyama K, Fukuzawa H, Kohchi T, Shirai H, Sano T, Sano S, Umesono K, Shiki Y, Takeuchi M, Chang Z et al. 1986. Chloroplast gene organization deduced from complete sequence of livewort Marchantia polymorpha chloroplast DNA. Nature 322: 572–574. Olso MV, Dutchik JE, Graham MY, Brodeur GM, Helms C, Frank M, MacCollin M, Scheinman R, Frank T. 1986. Random-clone strategy for genome restriction mapping in yeast. Proceedings of the National Academy of Sciences 83:7826-7830. Peltola H, Soderlund H, Ukkonen E. 1984. SEQUAID: a DNA sequence assembly program based on a mathematical model. Nucleic Acids Research 12: 307–321. Pennisi E. 2002. Gene researchers hunt bargains, fixer-uppers. Science 298: 735-736. Pevzner PA. 1989. L-tuple DNA sequencing: computer analysis. Journal of Biomolecular Structural Dynamics 7: 63-73. Pevzner PA, Tang H, Waterman MS. 2001. An Eulerian path approach to DNA fragment assembly. PNAS 98:9748-9753. Rat Genome Sequencing Project Consortium. 2004. Genome sequence of the Brown Norway Rat yields insights into mammalian evolution. Nature (in press). Roach JC, Thorsson V, Siegel AF. 2000. Parking strategies for genome sequencing. Genome Research 10:1020-1030. Roach JC, Siegel AF, Trask B, van den Engh G, Hood L. 1999. Gaps in the Human Genome Project. Nature 401: 843-845. Ronaghi M. 2001. Pyrosequencing sheds light on DNA sequencing. Genome Research 11:3-11. Ronaghi M, Uhlen M, Nyrén P. 1998. A sequencing method based on real-time pyrophosphate. Science 281:363-365. Sanger, F., Coulson, A.R., Hong, G.F., Hill, D.F., and Peterson, G.B. 1982. Nucleotide sequence of bacteriophage _ DNA. J. Mol. Biol. 162: 729–773. Sanger, F., Nicklen, S., and Coulsen, A.R. 1977. DNA sequencing with chain terminating inhibitors. Proc. Natl. Acad. Sci. 74: 5463–5467. Venter JC, Smith HO, Hood L. 1996. A new strategy for genome sequencing. Nature 381:364366. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG et al. 2001. The sequence of the human genome. Science 291: 1304-1351. Weber JL, Myers EW. 1997. Human whole-genome shotgun sequencing. Genome Research 7: 401–409.