Download DNA sequencing - University of Louisville Bioinformatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Replisome wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

DNA nanotechnology wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Microsatellite wikipedia , lookup

DNA sequencing wikipedia , lookup

Exome sequencing wikipedia , lookup

Human Genome Project wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Christy M. Bogard
Introduction
• A look at where we have been, where we are
currently, and where we are going
The Sequence Concept in Biology
• The importance of sequencing of biological
macromolecules was first demonstrated with
Sanger’s studies of insulin.
– His work showed that amino acids residues are joined
in linear polypeptides to form proteins.
– Sanger recognized that his research did not and could
not make any claims into the arrangement of the
residues.
– Consequently, when Watson and Crick proposed the
double helix DNA structure, they pointed out that their
structure placed no constraints on the sequence of a
DNA molecule, only that it suggested a mechanism for
sequence replication.
Delays in DNA Sequencing
• 15 years elapsed between discovery of the DNA
double helix structure and the first experimental
determination of a DNA sequence.
– Different DNA molecule structures were chemically too
similar to easily separate.
– DNA chains were of much greater length than their
protein counterparts, making complete sequencing
seem unapproachable.
– 20 different amino acids of proteins were of wide
variety, making them easier to separate; only 4
nucleotides seem to make the problem of separation
more complex.
– No base-specific DNAases were known.
Progression with RNA
• RNA molecules, while similar in structure, did not
share all of these drawbacks.
– Transfer RNA (tRNA) were small and individual types
could be purified.
– RNAases with base specificity were known.
– Escherichia coli (E.coli) alanine tRNA first nucleic acid
molecule to be sequenced.
– Models for tRNA structure could be deduced by
assuming base pairing analogous to that found in DNA
double helix.
From RNA to DNA
• The genome of the bacteriophage ΦX174 was
the first DNA molecule to be purified to
homogeneity. This was accomplished through the
process of Equilibrium buoyant density.
• ΦX is a single-stranded circular molecule; phage
lambda DNA, which is a linear molecule with
cohesive ends, was the first molecule to be
successfully sequenced.
– Wu and Kaiser measured incorporation of
radiolabeled nucleotides by E.coli DNA polymerase in
reactions that extended the 3` termini to fill in the
complementary cohesive end sequences.
Early Days of DNA Sequencing
• Discovery of type II restriction enzymes by Smith
and coworkers. These enzymes recognized and
cleaved DNA at specific, short nucleotide
sequences (4-6 bp), providing a way to cut large
DNA molecules into smaller pieces that could be
separated by size using gel electrophoresis.
Gel-Based Sequencing Methods
• 1975: Sanger introduced ‘plus and minus’ method
for DNA sequencing, which used polyacrylamide
gels to separate the products of primed synthesis
by DNA polymerase in order of increasing chain
length.
– Problem of determining the length of a homopolymer
runs – must be estimated.
• Maxam and Gilbert developed a method that
used polyacrylamide gels to resolve bands that
terminated at each base through the target
sequence. Cleave at purine reactions, pyrimidine
reactions, and preferencial at A and C.
Gel-Based Sequencing Methods
• 1977: Sanger develops the ‘dideoxy method’.
Use chain-terminating nucleotide analogs rather
than subsets of the four natural dNTPs to cause
base-specific termination of primed DNA
synthesis. Unlike ‘plus and minus’ method, bands
were produced at each nucleotide in a run.
Sequences, Sequences, Sequences
• Unlike amino acid sequences, ΦX DNA sequences
could be interpreted in terms of genetic code.
• Analysis of mutations in genes identified by
traditional phage genetics combined with amino
acids allowed phage genes to be located on the
DNA sequence.
• For the first time, DNA sequence identified long
open reading frames that could be assigned to
genes identified by traditional methods.
• It was clear that significant portions of the
genome were translated by more than one
reading frame to produce two different proteins.
Sequences, Sequences, Sequences
• With the introduction of gel-based sequencing
methods, the rate of DNA sequencing advanced.
• Likewise, the useful read length of dideoxy
sequencing increased from about 100 to about
400.
– The use of very thin sequencing gels
– 35S labeling of DNA, which gives sharper bands than
32P due to lower energy emitting beta particles
The Birth of Bioinformatics
• Beginning with ΦX, the management and analysis
of sequence data became a major undertaking.
• McCallum wrote the first problems to help with
compilation and analysis of DNA sequences
– Compiled and numbered the complete sequence
– Allowed editing of previously compiled sequence
– Searched the sequence for specific short sequences or
families of sequences (restriction sites)
– Translated the sequence in all reading frames
The Birth of Bioinformatics
• Dayhoff established
– protein sequence database
– First collection of nucleotide sequence information
• NIH creation of GenBank
• Methods of aligning and comparison followed:
– FASTA and BLAST made it practical to identify genes
in a new sequence by comparing it with sequences
currently in the database.
Automated Sequencing Factories
• 1986: Caltech and ABI published the first report
of automation of DNA sequencing – data could
be collected directly to a computer without
autoradiography of sequencing gel.
– A differently labeled primer was used in each of the
four dideoxy sequencing reactions.
– Reactions combined and electrophoresed in single
polyacrylamide tube gel that when passed by
detector could distinguish each of the four colors.
Automated Sequencing Factories
• NIH: Venter set up sequencing facility with 6
automated sequencers and 2 Catalyst robots.
• 1992: Venter established The Institute for
Genomic Research (TIGR) to expand sequencing
operation to 30 sequencers and 17 robots.
– First factory with teams dedicated to different steps in
the sequencing process.
– Data analysis added to each phase to quickly detect
and correct sequencing problems.
Cellular Genomes
• 1995: Craig Venter’s group (TIGR) reported
complete genome sequence of two bacterial
species: Haemophilus Influenzae and
Mycoplasma Genitalium.
– H. influenzae gave the first glimpse of complete
information set for a living organism.
– M. genitalium sequence showed us an approximation
to the minimal set of genes required for cellular life.
• H. influenzae introduced the whole genome
shotgun (WGS) method for sequencing cellular
organisms.
– DNA randomly fragmented and cloned. Clones
sequenced at random and reassembled by computer.
Cellular Genomes
• Adoption of ‘paired ends’ strategy is perhaps
most important improvement to shotgun sequence.
• The automated sequencing procedure used on H.
influenzae used melted double-stranded DNA as
template whereas the HCMV project had to use
single-stranded vectors. With double-stranded
templates, one could sequence each clone from
both ends.
• TIGR assembler – designed to handle thousands
of sequence reads involved in even the smallest
cellular genome projects.
Cellular Genomes
• Advancements led to steady stream of completed
genome sequencing.
–
–
–
–
–
–
E.coli
Bacillus subtilis
Saccharomyces cerevisiae
Caenorhabditis elegans
Drosophila melanogaster
Eventually Humans
• 1996: ABI introduced the first commercial DNA
sequencer that used capillary electrophoresis
rather than slab gel
• 1998: ABI Prism 3700 with 96 capillaries
Sequencing the Human Genome
• 1985: Robert Sinsheimer formally organized a
meeting on human genome sequencing at
University of California, Santa Cruz
• 1985: DeLisi and Smith commissioned first Santa
Fe conference funded by DOE to study the
feasibility of a Human Genome Initiative
• 1990: DOE and NIH present 5 year US Genome
Project plan to Congress – 15 years and ~$3 bil
• The publicly funded effort became an
international collaboration between sequencing
centers in US, Europe, and Japan – each focusing
on a particular region of the genome.
Sequencing the Human Genome
• 1999: Human Genome Project celebrated
passing the billion base-pair mark and the first
complete human chromosome completely
sequenced – chr22.
• 2000: Clinton and Blair publicly announce draft
versions of the human genome sequence
• 2001: public draft human genome sequences
published in Science and Nature.
Next Generation Seq. Technology
• Newly-emerging methods are, for the first time,
challenging the supremacy of the dideoxy
method.
– Massively parallel – the number of sequence reads
from a single experiment is vastly greater
– Pyrosequencing: shotgun sequencing of whole
genomes without cloning in E.coli or any host cell.
– Solexa technology: uses chain-terminating nucleotides
to make chain termination a reversible process.
– Nanopore Sequencing
Genomic Medicine
• All disease has a genetic basis, whether in genes
inherited by the affected individual,
environmentally induced genetic changes, or
genes of a pathogen and their interaction with
those individuals infected.
• Sequencing of the human genome and major
pathogens is beginning to have an impact
– Diagnosis, treatment, and prevention of disease
– Potential targets of drug therapy and vaccine
candidates
– Predicted era of personalized medicine
Metagenomics
• We lack a comprehensive view of the genetic
diversity on Earth because only a very small
fraction of microbes found in nature have been
grown in pure culture.
• Metagenomics focuses on isolating DNA directly
from environmental samples and sequenced,
without attempting to culture the organisms from
which it comes.
• Metagenomics currently be applied to study
microbial populations in many environments, such
as the human gut.
Looking to the Future
• The amount of nucleotide sequences in databases
has increased logarithmically by nine orders of
magnitude from 1965 to 2005. This is an
average doubling time of about 16 months.
Interpreting our growth
• Inflections in the curve correspond to technical
innovations, suggesting we are the on the verge
of the next generation of massively parallel
sequencers.
• It appears possible that methods for collecting
sequence data could soon outstrip our capacity
to adequately analyze that data, making
fundamental advances in computation and
bioinformatics essential to our continued progress.
A bit of trivia:
• 1953: Double-helix structure:
– Watson was 24yr old postdoc fellow.
– Crick was still a grad student in mid-30s.
• Brains are not enough; remember your courage!
– “Once you get your courage up and believe that you
can do important things, then you can” ~Hamming~
• Always Believe and Doubt your Hypothesis
– Believe enough to move forward, doubt enough to find
the errors
•