Download Batzoglou`s review

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Algorithmic challenges in mammalian whole genome assembly
Serafim Batzoglou
Department of Computer Science
Stanford University
James H. Clark Center
318 Campus Drive, RM S-266
Stanford CA 94305-5428
[email protected]
Abstract
Sequencing the human, mouse, and rat genomes was accomplished more quickly than had
seemed feasible a mere 8 years ago. The accelerating pace of sequencing has been made possible
by the development of large-scale laboratory operations, assembly algorithms, and less so by
changes in the underlying sequencing technologies. This short review summarizes the
computational challenges of whole genome assembly and the algorithms and software systems
that enabled the recent advances in sequencing of mammalian genomes.
Keywords: DNA sequencing, whole genome shotgun assembly, physical mapping,
contig, scaffold, overlap-layout-consensus
The basic methodology used today for sequencing a large genome is double-barreled shotgun
sequencing. Shotgun sequencing, introduced by Sanger and colleagues early on (Sanger et al.
1977), involves first obtaining a redundant representation of a genomic segment with sequenced
reads, and then assembling the reads into contigs on the basis of sequence overlap. Doublebarreled shotgun sequencing enhances this method by introducing the notion of paired-end reads
that are obtained by sequencing both ends of a fragment of known size, and thus provide useful
long-range linking information (Edwards et al. 1990).
It is relatively straight-forward to cover an arbitrarily long genome with sequenced reads, so that
few gaps remain (e.g., by the Lander-Waterman model (1988) a 10-fold redundancy coverage of
the genome with 500-long reads yields less than 1 gap in every 106 base pairs). Sequencing a
genome is simple then, if defined as obtaining a representation of the majority of nucleotides in a
genome. What makes sequencing challenging is the subsequent computational problem of
correctly assembling the reads into the original sequence.
Repeats, sequencing errors, and their relationship. Assembly would be easy were it not for
repeats. If genomes had the properties of random strings on four letters {A, C, G, T}, then it
would be unlikely for two reads to share a significant overlap unless they represented overlapping
segments in the genome – a true overlap. Unfortunately, nearly half a mammalian genome’s
length consists of repeats and true overlaps can be confused with repeat-induced overlaps. The
main challenge of assembly is to distinguish between true overlaps, where reads should be
merged, and repeat-induced overlaps, where merging should be (usually) avoided to prevent
misassemblies – assembled sequence that does not occur in the genome. Repeats come in many
varieties. The number of copies ranges from 2, such as in gene duplications, to about 106 in the
case of ALU repeats in the human genome. Sequence similarity between copies can be almost
100%, such as in recent duplications, or very low, such as in ancient transposon lineages that
have become inactive. Our ability to distinguish between true and repeat-induced overlaps is
intimately tied to the sequencing error rate. In essence, if more than a proportion r of sequencing
errors is unlikely, then any overlap with more than 2r mismatches and gaps is likely to be repeatinduced. Therefore, increasing read accuracy reduces the repeat content of a genome for the
purposes of assembly.
Two main ways to deal with repeats: clustering the reads, and pairing the reads. There are
broadly two methods for assembling long genomic regions in the presence of repeats. The first
method is clone-based sequencing, whereby the reads are being clustered by performing shotgun
sequencing on medium-sized segments such as Bacterial Artificial Chromosome (BAC) clones of
length ~200Kb. With this method the effective repeat content is drastically cut because most
segments that are repeated across the genome are unique within such short regions. Moreover, the
number of reads to be assembled is reduced by a factor of 104, and therefore the computational
assembly problem becomes much simpler. The second method is double-barreled whole-genome
shotgun (WGS) sequencing, whereby paired reads are obtained from both ends of inserts of
various sizes (Edwards et al. 1990). Paired reads can resolve repeats by jumping across them:
when taken from plasmid inserts longer than 5,000 bp they can resolve LINE repeats of that
length; paired reads from longer inserts such as cosmids and BACs can resolve duplications, help
build longer scaffolds, and verify the large-scale accuracy of an assembly.
Clone-based sequencing. Clone-based sequencing was used to sequence several genomes
including those of the yeast Saccharomyces cerevisiae (Oliver et al. 1992, Dujon 1996),
Caenorhabditis elegans (The C. elegans Sequencing Consortium 1998), Arabidopsis thaliana (The
Arabidopsis Genome Initiative 2000, and Human (International Human Genome Sequencing
Consortium 2001). Most applications of clone-based sequencing were performed under the “map
first, sequence second”, or physical mapping approach. Physical mapping involves constructing a
complete physical map of a large set of clones covering the genome with redundancy, and then
selecting a minimal tiling subset of those clones to be fully sequenced. After sequencing and
assembly, errors in the original map, gaps, and misassemblies can be spotted, and if necessary
corrected with additional targeted sequencing. This approach was successfully used to generate
physical maps for sequencing the S. cerevisiae and C. elegans genomes (Coulson et al. 1986;
Olson et al. 1986). The human genome was also sequenced by the public consortium using this
approach, although the first draft large-scale assembly was still a challenging computational task:
ordering and orienting the assembled contigs into a global genome assembly required integration
of various sources of information such as sequence overlap, mRNA and EST fragments, and endsequencing of BACs. All these pieces of evidence were put together by a specialized system
called GigAssembler (Kent and Haussler 2001). Physical mapping is by no means the only
possible way to perform clone-based sequencing. Other methods are possible but less explored,
such as the walking approach (Venter et al. 1996, Batzoglou et al. 1999, Roach et al. 1999, Roach
et al. 2000), or light shotgun sequencing of completely unmapped random clones, whereby a
collection of clones covering the genome with redundancy C1, are sequenced to redundancy C2,
for a total sequencing depth C1C2. As clustering directly addresses the repeat challenge by
effectively eliminating most repeats, assembly under any of the above schemes is in principle
easier than under double-barreled sequencing.
Whole genome shotgun sequencing. Shotgun sequencing was initially applied by Sanger to the
genome of bacteriophage  (Sanger et al. 1980, 1982). Over the years, shotgun sequencing has
been applied to genomic segments of increasing length. In the 1980s it was applied to plasmids,
chloroplasts (Ohyama et al. 1986), viruses (Goebel et al. 1990), and cosmid clones of length
roughly 40 kbp, which were considered to be the limit of this approach. In the 1990s,
mitochondria (Oda et al. 1992), bacterial artificial chromosomes of length 200 Kbp carrying
genomic DNA from human or other organisms, as well as the entire 1,800 Kbp genome of the
bacterium H. influenzae (Fleischmann et al. 1995), were sequenced. Compared to clone-based
shotgun sequencing, WGS poses a much harder computational challenge, especially in large,
repeat-rich genomes. When a WGS strategy was proposed for sequencing a mammalian genome
such as human (Weber and Myers 1997) scientists debated whether such as strategy would
succeed (Green 1997). Today we know that this is possible, as demonstrated by WGS assemblies
of several complex genomes such as Drosophila melanogaster (Myers et al. 2000), human
(Venter et al. 2001), and mouse (Mouse Genome Sequencing Consortium 2002, Mural et al.
2002). These successes have brought computer science to the forefront of genomics: WGS
assembly became possible through advances in algorithms and software systems, rather than in
either computer processor speed or sequencing accuracy.
Overview of WGS assembly. The prevailing methodology for WGS assembly is the three-stage
overlap-layout-consensus strategy (Peltola et al. 1984, Kececioglu 1991, Huang 1992, Kececioglu
and Myers 1995). During the overlap phase all read-to-read sequence overlaps are detected.
During layout, subsets of the overlaps are selected in order to merge reads into longer scaffolds of
ordered and oriented reads. Those scaffolds are converted into sequence through a multiple
alignment during the consensus phase. One of the most successful systems employing this threephase strategy is PHRAP (Green 1994), which has been the standard for BAC-size assembly but
is unable to handle whole mammalian-size genomes. Current WGS assemblers include the Celera
assembler (Myers et al. 2000), Euler (Pevzner et al. 2001), Arachne (Batzoglou et al. 2002, Jaffe
et al. 2003), Phusion (Mullikin et al. 2003), Atlas (Havlak et al. 2004), and other. Most of these
systems, at least implicitly, use the overlap-layout-consensus strategy.
In the following paragraphs, we will briefly describe each of the assembly phases,
focusing on the challenges they pose and on the algorithms that current WGS assemblers employ
to address these challenges. We will be using the following terminology: a contig is a multiple
alignment of reads, which can be converted into contiguous genomic sequence during the
consensus phase. A supercontig is a structure of ordered and oriented contigs, usually derived by
linking contigs according to paired reads. Some papers refer to ultracontigs as higher structures
of supercontigs that are derived by various sources of linking information, such as paired reads
from large inserts, mapping of BAC clones, or even cross-species comparison. The general term
for assembled structures is the scaffold, a term usually interchangeable with supercontig. As the
layout phase is the most complex, a useful conceptual (and implementation) division of this phase
is in two sub-phases: (1) contig layout, during which most assembly systems strive to identify
repeat boundaries and connect the reads into contigs that are as long as possible without crossing
those boundaries, as that would induce misassemblies; and (2) supercontig assembly, perhaps the
hardest phase of all, where contigs are chained into longer structures by use of paired reads
crossing contig boundaries, and subsequently gaps are filled and misassemblies are identified and
remedied.
The overlap phase. In the overlap phase, read-to-read sequence overlaps are detected. Note that
since read orientation is not known, all overlaps between reads in forward or reverse complement
direction are detected. The nature of the data requires several preprocessing tasks to take place
prior to overlap detection, such as trimming of reads to remove bases with quality scores that are
too low, and filtering of contamination from sequencing/cloning vector or mitochondrial
sequences. Next, in the case of large-scale WGS a significant algorithmic challenge is to find the
overlaps efficiently. The strategies that have been employed rely on detecting exact word matches
between the reads. Celera, Phusion, and Arachne, all perform a global step of indexing all exact
fixed-length word occurrences in the reads, and then aligning all pairs of reads that share a word.
Prior to alignment, filtering of very frequent words is necessary to avoid a huge number of
pairwise read matches. Phusion and Arachne mask all highly repetitive k-mers, while Celera
masks repeats, a step that is inherently more conservative and will discard additional spurious
matches, but is also more complicated. Alignment is done with fast procedures that rely on the
fact that only almost exact sequence identity between two reads can induce an overlap. Phusion
and Arachne’s original indexing strategies were more time-efficient, but required more memory,
than Celera’s strategy. Arachne performs this step in several passes, trading off speed for
memory.
Upon detecting potential read overlaps, most assemblers filter them so as to remove the
ones that are very likely to be repeat induced. Arachne and Euler in particular, use a sequencing
error-correction strategy that greatly simplifies subsequent assembly. In Arachne, the main idea is
that a sequencing error reveals itself as a singleton base aligned to several other reads that all
agree to a different base – in contrast to a repeat polymorphism, which should be shared by
several reads. Euler’s error correction is based on finding and modifying low-frequency words in
the reads, and is generally more aggressive, resulting in a drastic reduction in the number of
errors by as much as a factor of 50. After error correction, Arachne aggressively discards read-toread overlaps that disagree in high-quality bases, because such overlaps are overwhelmingly
likely to be repeat-induced. Most assemblers at this stage also discard likely chimeric reads, i.e.,
reads that are formed by two distant segments of the genome being glued together. The signature
of such reads is that they overlap other reads to the left and right, but no overlap spans the
chimeric joint.
Contig layout. The prevailing way of representing all read overlaps is the overlap graph (Myers
1995), where reads are nodes, and overlaps are labeled edges that contain information on the
location and (optionally) quality of overlap. Only full overlaps between reads x and y are stored
in the graph: partial overlaps that end somewhere within the two sequences are clearly repeatinduced, and not stored in the graph. Repeat boundaries can be detected in an overlap graph by
finding a read x that extends to two reads y and z to the left (or right), where y and z do overlap
one another. The Celera assembler and Arachne use this overlap graph to merge reads into
contigs up to the boundaries of repeats, thus joining all local interval subgraphs into contigs. To
assemble contigs, Arachne employs a further heuristic: paired pairs of reads, defined as two
plasmids of similar insert size with sequence overlaps occurring at both ends, are identified and
merged, except under a heuristic condition that detects and avoids the instance of long repeats.
Paired pairs help to cross high-fidelity repeats of length slightly longer than a read length.
Phusion instead clusters the reads, and then passes the clusters to local assembly calls to PHRAP.
One advantage of this approach is the use of PHRAP for low-level assembly, because PHRAP
has been highly tuned to work well with real data. Most assemblers use several additional
heuristics during contig layout, in order to extend contigs as far as possible. For example,
sequencing errors near the end of one read are identified as such by noticing that the read does not
overlap any other read on that side. Contig breaks due to such sequencing errors are then crossed.
Also, contig breaks due to repeats that are shorter than one read length are resolved by finding
reads that span the different repeat copies. Precisely, a repeat can be resolved if each copy, except
for at most one, is spanned by a read.
The resulting contigs constructed by Celera and Arachne are either unique, representing sequence
that occurs once in the genome, or repetitive, representing a high-fidelity repeat sequence whose
reads were merged together. In Arachne, the repeat has to exhibit nearly 100% fidelity. Contigs
are classified as unique or repetitive by measuring the density of reads within a contig: repetitive
contigs are likely to be twice or more times as dense as unique contigs (Myers et al. 2000).
Supercontig layout. Once contigs are constructed and classified as unique or repetitive, the
Celera and Arachne assemblers link them into larger structures, called scaffolds or supercontigs.
One read-pair link can be an error, but two or more mutually confirming links that connect a pair
of contigs are very unlikely by chance, and – in general – can be safely used to join the contigs
into supercontigs. Most assemblers first join the unique contigs while excluding likely repeat ones,
to produce supercontigs that represent correctly ordered an oriented unique sequence with gaps.
Then, gaps are filled by recruiting the repeat contigs. This last stage of repeat resolution is rather
complicated, and involves a combination of several heuristic approaches. The Celera assembler
fills-in repeats in a series of increasingly aggressive steps, by recruiting first repetitive contigs
that contain two or more read pair links to flanking unique contigs, then those that contain one
link and some sequence overlap, and finally the rest conditionally on some heuristics. Arachne
attempts to walk through the gap between two unique contigs by a shortest paths search that
requires anchors spaced at most every 2Kb, where an anchor is a repetitive contig with read-pair
links to the flanking unique contigs. This heuristic in practice results in reasonably accurate
repeat reconstruction, except in the case of high-fidelity tandem repeats, which are often
collapsed and show up as deletions with respect to the true sequence.
Derivation of consensus sequence. Once a layout has been obtained, deriving consensus
sequence is a relatively straight-forward task. Starting from the leftmost read of each contig,
consensus sequence is derived from the multiple alignment of reads. Conservative quality scores
are given to each consensus base according to the quality and agreement of bases in the multiple
alignment column. The Celera assembler will additionally report positions that appear
polymorphic, and give an estimate of the likelihood that the polymorphism is real versus repeatinduced.
Discussion on existing WGS assemblers. An elegant formulation of sequence reconstruction as
an Eüler Path Problem on a de Bruijn graph (Pevzner 1989), originally proposed in the context of
sequencing by hybridization, has led to the development of the Eüler assembler (Pevzner et al.
2001). According to that approach, a graph is constructed whose nodes are the k-long words in
reads, and whose edges connect pairs of k-long words that share a (k–1)-long overlap. Read
overlaps are thus implicitly found in an efficient manner similar to Arachne’s and Phusion’s
approach. Error correction is done directly on the k-long words, by modifying words that are too
infrequent. Assembly then reduces to finding the correct Eülerian path—there are usually a large
number of Eülerian Paths. An Eülerian Path is a path that uses every edge exactly once. To
achieve that, a series of graph simplifications are taken based on the reads and on the read pair
links. This approach has the advantage of clean formulation, and demonstrably strong
performance in resolving repeats – or identifying all assembly ambiguities due to repeats. One
challenge is to develop an implementation that is efficient enough for whole mammalian genome
assembly. The de Bruijn graph provides the same branching structure (repeat representation) as
Myers’s original overlap graph but at k-mer resolution rather than fragment resolution, and
provides more precise repeat boundaries at the cost of extra memory usage.
Several assemblers exist for large-scale application. The Celera assembler, Arachne, and Jazz,
effectively use the overlap-layout-consensus paradigm combined with read-pair information.
Phusion follows a different approach, motivated by the fact that PHRAP is a well tried and
effective approach for small-scale assembly: Phusion first clusters the reads efficiently, and then
sends several local assembly problems to PHRAP. The assemblies are then merged and corrected
according to read-pair information. Arachne and Phusion were both used for assembling the
Mouse genome, with comparable performance (Jaffe et al. 2003, Mullikin et al. 2003). Atlas is a
suite of programs for assembling a genome with an approach that combines clone-based and
WGS data, and was used to assemble the rat genome (Rat Genome Sequencing Project
Consortium 2004).
Sequencing and assembly in the future. A few years ago whole-genome shotgun assembly was
considered impossible for large and repeat-rich genomes. Today it is routinely applied, thanks to
the development of computational methods led by the Celera assembly team and by public efforts.
The success of WGS assembly contributed to the interest in bioinformatics methods within the
biology community. Although the assembly research pertinent to current sequencing technology
seems to be reaching diminishing returns, additional open problems are coming from the
development of new and revolutionary sequencing technologies that promise to deliver the
“$1000 genome” within a decade (http://grants.nih.gov/grants/guide/rfa-files/RFA-HG-04003.html). Computationally, such technologies can be largely divided in two categories: (1) those
that produce short, accurate reads cheaply, such as Pyrosequencing (Ronaghi et al. 1998, Ronaghi
2001) or other technologies (Mitra and Church 1999, Mitra et al. 2003, Braslavsky et al. 2003);
and (2) those that produce very long reads, but potentially with very high error rates (see Pennisi
2002). Short reads should be harder to assemble, even if mate pair information is available. The
reason is, again, effective repeat content: fewer exact repeats will be completely spanned by a
read length. However, if reads are clustered without sacrificing the method’s speed and low cost
(e.g., by following a random clone-based sequencing approach that does not involve physical
mapping), assembly should be possible. Long reads will present interesting alignment and largescale indexing problems. In general, once the technological hurdles are overcome and a concrete
computational problem is formulated, we can be confident that computational biologists will
solve it and help ultimately make the $1,000 genome a reality.
Acknowledgements
The author would like to thank Granger Sutton for helpful comments on the first draft of this
manuscript.
References
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool.
Journal of Molecular Biology 215:403-410.
The Arabidopsis Genome Initiative (AGI). 2000. Analysis of the genome of the flowering plant
Arabidopsis thaliana. Nature 408(6814):796-815.
Batzoglou S, Mesirov JP, Berger B, Lander ES. Sequencing a genome by walking with cloneends: A mathematical analysis. Genome Research 9:1163-1174, 1999.
Braslavsky I, Hebert B, Kartalov E, Quake S. 2003. Sequence information can be obtained from
single DNA molecules. PNAS 100:3960-3964.
The C. elegans Sequencing Consortium. 1998. Genome sequencing of the nematode C. elegans: a
platform for investigating biology. Science 282(5396):2012-8.
Coulson A, Sulsto J, Brenner S, Karn J. 1986. Towards a physical map of the genome of the
nematode Caenorhabditis elegans. Proceedings of the National Academy of Sciences 83:78217825.
Edwards A, Voss H, Rice P, Civitello A, Stegemann J, Schwager C, Zimmermann J, Erfle H,
Caskey CT, Ansorge W. 1990 Automated DNA sequencing of the human HPTR locus. Genomics
6: 593–608.
Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ,
Tomb JF, Dougherty BA, Merrick JM et al. 1995. Whole-genome random sequencing and
assembly of Haemophilus influenzae Rd. Science 269: 496–511.
Galagan JE, Calvo SE, Borkovich KA, Selker EU, Read ND et al. 2003. The genome sequence of
the filamentous fungus Neurospora crassa. Nature 422: 859 – 868.
Goebel SJ, Johnson GP, Perkus ME, Davis SW, Winslow JP, Paoletti E. 1990. The complete
DNA sequence of Vaccinia virus. Virology 179: 247–266.
Green P. 1994. PHRAP documentation. http://www.phrap.org.
Green P. 1997. Against a whole-genome shotgun. Genome Research 7: 410–417.
Havlak P, Chen R, Durbin KJ, Egan A, Ren Y, Song XZ, Weinstock GM, Gibbs RA. 2004. The
Atlas Genome Assembly System. Genome Res. 14: 721-732.
Huang X. 1992. A contig assembly program based on sensitive detection of fragment overlaps.
Genomics 14:18–25.
International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of
the human genome. Nature 409:860-921.
Jaffe DB, Butler J, Gnerre S, Mauceli E, Lindblad-Toh K, Mesirov JP, Zody M, Lander ES.
2003.Whole-Genome Sequence Assembly for Mammalian Genomes: Arachne 2. Genome
Research 13: 91-96.
Kececioglu J. 1991. Exact and approximation algorithms for DNA sequence reconstruction. PhD
thesis, Department of Computer Science, University of Arizona, Tuscon, Arizona.
Kececioglu J, Myers E. 1995. Exact and approximate algorithms for the sequence reconstruction
problem. Algorithmica 13: 7-51.
Kent WJ, Haussler D. 2001. Assembly of the working draft of the human genome with
GigAssembler. Genome Research 11: 1541-1548.
Lander ES, Waterman MS. 1988. Genomic mapping by fingerprinting random clones: a
mathematical analysis. Genomics 2:231-239.
Mitra R and Church G. 1999. In situ localized amplification and contact replication of many
individual DNA molecules. Nucleic Acids Research 27:1-6.
Mitra R, Schendure J, Olejnik J, Church G. Fluorescent in situ sequencing on polymerase
colonies. Analytical Biochemistry 320:55-65.
Mouse Genome Sequencing Consortium. 2002. Initial sequencing and comparative analysis of
the mouse genome. Nature 420:520-562.
Mullikin JC, Ning Z. 2003. The Phusion assembler. Genome Research 13: 81-90.
Mural RJ, Adams MD, Myers EW, Smith HO, et al. A comparison of whole-genome shotgunderived mouse chromosome 16 and the human genome. Science 296:1661–1671, 2002.
Myers EW. 1995. Towards simplifying and accurately formulating fragment assembly. Journal of
Computational Biology 2: 275-290.
Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry
CM, Reinert KHJ, Remington KA, et al. 2000. A whole-genome assembly of Drosophila. Science
287: 2196–2204.
Oda K, Katsuyuki Y, Ohta E, Nakamura Y, Takemura M, Nozato N, Akashi K, Kanegae T,
Ogura Y, Kohchi T, Ohyama K. 1992. Gene organization deduced from the complete sequence of
Liverwort Marchantia polymorpha mitochondrial DNA. Journal of Molecular Biology 223:1–7.
Ohyama K, Fukuzawa H, Kohchi T, Shirai H, Sano T, Sano S, Umesono K, Shiki Y, Takeuchi M,
Chang Z et al. 1986. Chloroplast gene organization deduced from complete sequence of livewort
Marchantia polymorpha chloroplast DNA. Nature 322: 572–574.
Olso MV, Dutchik JE, Graham MY, Brodeur GM, Helms C, Frank M, MacCollin M, Scheinman
R, Frank T. 1986. Random-clone strategy for genome restriction mapping in yeast. Proceedings
of the National Academy of Sciences 83:7826-7830.
Peltola H, Soderlund H, Ukkonen E. 1984. SEQUAID: a DNA sequence assembly program based
on a mathematical model. Nucleic Acids Research 12: 307–321.
Pennisi E. 2002. Gene researchers hunt bargains, fixer-uppers. Science 298: 735-736.
Pevzner PA. 1989. L-tuple DNA sequencing: computer analysis. Journal of Biomolecular
Structural Dynamics 7: 63-73.
Pevzner PA, Tang H, Waterman MS. 2001. An Eulerian path approach to DNA fragment
assembly. PNAS 98:9748-9753.
Rat Genome Sequencing Project Consortium. 2004. Genome sequence of the Brown Norway Rat
yields insights into mammalian evolution. Nature (in press).
Roach JC, Thorsson V, Siegel AF. 2000. Parking strategies for genome sequencing. Genome
Research 10:1020-1030.
Roach JC, Siegel AF, Trask B, van den Engh G, Hood L. 1999. Gaps in the Human Genome
Project. Nature 401: 843-845.
Ronaghi M. 2001. Pyrosequencing sheds light on DNA sequencing. Genome Research 11:3-11.
Ronaghi M, Uhlen M, Nyrén P. 1998. A sequencing method based on real-time pyrophosphate.
Science 281:363-365.
Sanger, F., Coulson, A.R., Hong, G.F., Hill, D.F., and Peterson, G.B. 1982. Nucleotide sequence
of bacteriophage _ DNA. J. Mol. Biol. 162: 729–773.
Sanger, F., Nicklen, S., and Coulsen, A.R. 1977. DNA sequencing with chain terminating
inhibitors. Proc. Natl. Acad. Sci. 74: 5463–5467.
Venter JC, Smith HO, Hood L. 1996. A new strategy for genome sequencing. Nature 381:364366.
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG et al. 2001. The sequence of the
human genome. Science 291: 1304-1351.
Weber JL, Myers EW. 1997. Human whole-genome shotgun sequencing. Genome Research 7:
401–409.