Download DNA Sequencing and Gene Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein adsorption wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Genome evolution wikipedia , lookup

Protein moonlighting wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Replisome wikipedia , lookup

List of types of proteins wikipedia , lookup

Agarose gel electrophoresis wikipedia , lookup

Molecular cloning wikipedia , lookup

Gene expression wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Point mutation wikipedia , lookup

Exome sequencing wikipedia , lookup

Gel electrophoresis wikipedia , lookup

DNA sequencing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Two-hybrid screening wikipedia , lookup

RNA-Seq wikipedia , lookup

Western blot wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Molecular evolution wikipedia , lookup

Community fingerprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
DNA Sequencing and Gene
Analysis
Determining DNA Sequence
• Originally 2 methods were invented around 1976, but only one is
widely used: invented by Fred Sanger.
– After discussing Sanger sequencing, we will go over the newer
pyrosequencing method.
• Uses DNA polymerase to synthesize a second DNA strand that is
labeled. DNA polymerase always adds new bases to the 3’ end of
a primer that is base-paired to the template DNA.
• Also uses chain terminator nucleotides: dideoxy nucleotides
(ddNTPs), which lack the -OH group on the 3' carbon of the
deoxyribose. When DNA polymerase inserts one of these ddNTPs
into the growing DNA chain, the chain terminates, as nothing can
be added to its 3' end.
Sanger Sequencing Reaction
•
The template DNA is usually single stranded
DNA, which can be produced from plasmid
cloning vectors that contain the origin of
replication from a single stranded
bacteriophage such as M13 or fd. The primer is
complementary to the region in the vector
adjacent to the multiple cloning site.
•
Sequencing is done by having 4 separate
reactions, one for each DNA base.
All 4 reactions contain the 4 normal dNTPs, but
each reaction also contains one of the ddNTPs.
In each reaction, DNA polymerase starts
creating the second strand beginning at the
primer.
When DNA polymerase reaches a base for
which some ddNTP is present, the chain will
either:
–
terminate if a ddNTP is added, or:
– continue if the corresponding dNTP is
added.
– which one happens is random, based on
ratio of dNTP to ddNTP in the tube.
However, all the second strands in, say, the A
tube will end at some A base: you get a
collection of DNAs that end at each of the A's in
the region being sequenced.
•
•
•
•
Electrophoresis
•
•
•
The newly synthesized DNA from
the 4 reactions is then run (in
separate lanes) on an
electrophoresis gel.
The DNA bands fall into a ladderlike sequence, spaced one base
apart. The actual sequence can
be read from the bottom of the gel
up.
Automated sequencers use 4
different fluorescent dyes as tags
attached to the dideoxy
nucleotides and run all 4 reactions
in the same lane of the gel.
– Today’s sequencers use capillary
electrophoresis instead of slab
gels.
– Radioactive nucleotides (32P) are
used for non-automated
sequencing.
•
Sequencing reactions usually
produce about 500-1000 bp of
good sequence.
Pyrosequencing
•
Recently a number of faster and cheaper sequencing methods have been
developed.
– The Archon X prize (2006): "the first Team that can build a device and use it to
sequence 100 human genomes within 10 days or less, with an accuracy of no
more than one error in every 100,000 bases sequenced, with sequences
accurately covering at least 98% of the genome, and at a recurring cost of no
more than $10,000 (US) per genome.”
– Probably the most widely used new methods involve the pyrosequencing
biochemical reactions (invented by Nyren and Ronaghi in 1996), with the
massively parallel microfluidics technology invented by the 454 Life Sciences
company. We can call this combined technology “454 sequencing”.
•
Applications:
–
–
–
–
–
sequencing of whole bacterial genomes in a single run
sequencing genomes of individuals
metagenomics: sequencing DNA extracted from environmental samples
looking for rare variants in a single amplified region, in tumors or viral infections
transcriptome sequencing: total cellular mRNA converted to cDNA.
Pyrosequencing Biochemistry
•
•
In DNA synthesis, a dNTP is attached to the
3’ end of the growing DNA strand. The two
phosphates on the end are released as
pyrophosphate (PPi).
ATP sulfurylase uses PPi and adenosine 5’phosphosulfate to make ATP.
–
•
ATP sulfurylase is normally used in sulfur
assimilation: it converts ATP to adenosine 5’phosphosulfate and PPi. However, the reaction is
reversed in pyrosequencing.
Luciferase is the enzyme that causes fireflies
to glow. It uses luciferin and ATP as
substrates, converting luciferin to oxyluciferin
and releasing visible light.
– The amount of light released is proportional to
the number of nucleotides added to the new
DNA strand.
•
After the reaction has completed, apyrase is
added to destroy any leftover dNTPs.
More Pyrosequencing
• The four dNTPs are added one at a time, with apyrase
degradation and washing in between.
• The amount of light released is proportional to the
number of bases added. Thus, if the sequence has 2
A’s in a row, both get added and twice as much light is
released as would have happened with only 1 A.
• The pyrosequencing machine cycles between the 4
dNTPs many times, building up the complete sequence.
About 300 bp of sequence is possible (as compared to
800-1000 bp with Sanger sequencing).
• The light is detected with a charge-coupled device
(CCD) camera, similar to those used in astronomy.
• YouTube animation (with music!):
http://www.youtube.com/watch?v=kYAGFrbGl6E
454 Technology
•
To start, the DNA is sheared into 300-800 bp
fragments, and the ends are “polished” by removing
any unpaired bases at the ends.
•
Adapters are added to each end. The DNA is made
single stranded at this point.
•
One adapter contains biotin, which binds to a
streptavidin-coated bead. The ratio of beads to DNA
molecules is controlled so that most beads get only a
single DNA attached to them.
•
Oil is added to the beads and an emulsion is created.
PCR is then performed, with each aqueous droplet
forming its own micro-reactor. Each bead ends up
coated with about a million identical copies of the
original DNA.
More 454 Technology
•
After the emulsion PCR has been performed,
the oil is removed, and the beads are put into
a “picotiter” plate. Each well is just big enough
to hold a single bead.
•
The pyrosequencing enzymes are attached to
much smaller beads, which are then added to
each well.
•
The plate is then repeatedly washed with the
each of the four dNTPs, plus other necessary
reagents, in a repeating cycle.
•
The plate is coupled to a fiber optic chip. A
CCD camera records the light flashes from
each well.
Illumina/Solexa Sequencing
• Another high throughput Next Generation Sequencing method.
• http://www.youtube.com/watch?v=77r5p8IBwJk&NR=1
• http://www.youtube.com/watch?v=HtuUFUnYB9Y
• First we will discuss the sequencing reaction itself, and then deal
with how it is done on many sequences simultaneously.
Illumina Sequencing Chemistry
•
•
this method uses the basic Sanger idea
of “sequencing by synthesis” of the
second strand of a DNA molecule.
Starting with a primer, new bases are
added one at a time, with fluorescent
tags used to determine which base was
added.
The fluorescent tags block the 3’-OH of
the new nucleotide, and so the next
base can only be added when the tag is
removed.
–
•
So, unlike pyrosequencing, you never have to
worry about how many adjacent bases of the
same type are present.
The cycle is repeated 50-100 times.
Illumina Massively Parallel System
• The idea is to put 2 different
adapters on each end of the
DNA, then bind it to a slide
coated with the
complementary sequences
for each primer. This allows
“bridge PCR”, producing a
small spot of amplified DNA
on the slide.
• The slide contains millions of
individual DNA spots. The
spots are visualized during
the sequencing run, using
the fluorescence of the
nucleotide being added.
Sequence Assembly
•
•
DNA is sequenced in very small
fragments: at most, 1000 bp.
Compare this to the size of the
human genome: 3,000,000,000 bp.
How to get the complete
sequence?
In the early days (1980’s), genome
sequencing was done by
chromosome walking (aka primer
walking): sequence a region, then
make primers from the ends to
extend the sequence. Repeat until
the target gene was reached.
– The cystic fibrosis gene was
identified by walking about 500 kbp
from a closely linked genetic
marker, a process that took a long
time and was very expensive.
– Still useful for fairly short DNA
molecules, say 1-10 kbp.
Shotgun Sequencing
•
•
Shotgun sequencing is what is
typically done today: DNA is
fragmented randomly and
enough fragments are
sequenced so each base is
read 10 times or more on
average. The overlapping
fragments (“reads”) are then
assembled into a complete
sequence.
For large genomes, hierarchical
shotgun sequencing is a useful
technique: first break up the
genome into an ordered set of
cloned fragments (scaffolds),
usually BAC clones. Each BAC
is shotgun sequenced
separately.
Assembly Problems
•
•
•
•
•
•
In principle, assembling a sequence is just a matter of finding overlaps and combining
them.
In practice:
– most genomes contain multiple copies of many sequences,
– there are random mutations (either naturally occurring cell-to-cell variation or
generated by PCR or cloning),
– there are sequencing errors and misreadings,
– sometimes the cloning vector itself is sequenced
– sometimes miscellaneous junk DNA gets sequenced
Getting rid of vector sequences is easy once you recognize the problem: just check for
them.
Repeat sequence DNA is very common in eukaryotes, and sequencing highly repeated
regions (such as centromeres) remains difficult even now. High quality sequencing
helps a lot: small variants can be reliably identified.
Sequencing errors, bad data, random mutations, etc. were originally dealt with by hand
alignment and human judgment. However, this became impractical when dealing with
the Human Genome Project.
This led to the development of automated methods. The most useful was the
phred/phrap programs developed by Phil Green and collaborators at Washington
University in St. Louis.
Phred Quality Scores
•
Phred is a program that assigns a quality score to each base in a sequence.
These scores can then be used to trim bad data from the ends, and to
determine how good an overlap actually is.
– there are much improved algorithms now, but the phred quality score is still
widely used.
•
Phred scores are logarithmically related to the probability of an error: a
score of 10 means a 10% error probability; 20 means a 1% chance, 30
means a 0.1% chance, etc.
– Q = -10 log P, where Q is the phred score and P is the probability that the base
was called incorrectly.
– A score of 20 is generally considered the minimum acceptable score.
•
Phred uses Fourier analysis
(decomposing the data into a series of
sine waves) to examine chromatogram
trace data.
–
–
–
•
•
First, find the expected position of each
peak, assuming they are supposed to
be evenly spaced, with no compressions
or other factors altering peak positions.
Next, find actual center and area of
each peak.
Finally, match observed peaks to
expected. This involves splitting some
peaks and ignoring others.
Assigning a quality score involves
comparing various parameters
determined for each peak with data that
was generated from known sequences
run under a wide variety of conditions
(i.e., based on ad hoc observations and
not theory).
Since all four traces (A, C, G, T) are
examined separately, phred generates
a best-guess sequence. Phred output
is a file where each line contains one
base and its quality score.
Phred Algorithm
Combining Sequences with Phrap
•
•
•
•
•
Phrap first examines all reads for matching “words”: short sections of identical
sequence. The matching words need to be in the same order and spacing.
– Sequences in both orientations are examined, using the reverse-complement
sequence if necessary.
The entire sequences of pairs with matching words are then aligned using the SmithWaterman algorithm (a standard technique we will look at later).
Phrap then looks for discrepancies in the combined sequences, using phred scores to
decide between alternatives. Phrap generates quality scores from the combined
phred data.
– Sequencing errors are not necessarily random: homopolymeric regions (several
of the same base in a row) are notoriously tricky to sequence accurately. Using
the opposite strand often helps resolve these regions. Also using a different
sequencing technology or chemistry.
Sequences are combined with a greedy algorithm: all pairs of fragments are scored
for the length and quality of their overlap region, and then the largest and bestmatched pair is merged. This process is repeated until some minimum score is
reached.
The result is a set of contigs: reads assembled into a continuous DNA sequence.
– The ideal result is the entire chromosome assembled into a single contig.
Finishing the Sequence
•
Shotgun sequencing of random DNA fragments necessarily misses some
regions altogether.
– Also, for sequencing methods that involve cloning (Sanger), certain regions are
impossible to clone: they kill the host bacteria.
•
•
Thus it is necessary to close gaps between contigs, and to re-sequence
areas with low quality scores. This process is called finishing. It can take
up to 1/2 of all the effort involved in a genome sequencing project.
Mostly hand work: identify the bad areas and sequence them by primer
walking.
– Sometimes using alternative sequencing chemistries (enzymes, dyes,
terminators, dNTPs) can resolve a problematic region.
• Once a sequence is completed, it is usually analyzed by finding the
genes and other features on it: annotation.
• Submission of the annotated sequence to Genbank allows
everyone access to it: the final step in the scientific method.
Single Nucleotide Polymorphisms
•
•
•
•
Looking at many individuals,
you can see that most bases
in their DNA are the same in
everyone. However, some
bases are different in
different individuals. These
changes are single
nucleotide polymorphisms
(SNPs).
SNPs are found everywhere
in the genome, and they are
inherited in a regular
Mendelian fashion. These
characteristics makes them
good markers for finding
disease genes and
determining their inheritance.
Lots of ways to detect SNPs,
many of which are easy to
automate.
Primer extension: make a
primer 1 base short of the
SNP site, and then extend
the primer using DNA
polymerase with nucleotides
having different fluorescent
tags.
Gene Detection
• It is surprisingly hard to be sure that a given genomic
sequence is a gene: that it is ever expressed as RNA.
• Protein-coding regions are open reading frames (ORFs):
they don’t contain stop codons.
• But, human genes often contain long introns and very
short exons, and some parts of genes are introns in one
cell type but exons in other cell types. So, finding all the
pieces of a gene can be a challenge.
• Three questions:
– is a given DNA sequence ever expressed?
– is the sequence expressed in a given cell type or set of
conditions?
– what is the intron/exon structure of the sequence?
Evolutionarily Conserved Sequences
•
•
•
•
When looking across different species, most DNA sequences are not conserved.
However, the exons of genes are often highly conserved, because their function is
necessary for life.
Zoo blot: a Southern blot containing genomic DNA from many species. Probe it with the
sequence in question: exons will hybridize with other species’ DNA, while introns and
non-gene DNA won’t.
Computer-based homology search: BLAST search. Do similar sequences appear in the
nucleotide databases? Especially chimpanzee and mouse, which have complete
genome sequences available.
Detecting Gene Expression
• Northern blots: RNA
extracted from
various tissues or
experimental
conditions, run on an
electrophoresis gel,
then probed with a
specific DNA
sequence.
Detecting Gene Expression
• Real time PCR:
– first convert all mRNA in
a sample to cDNA using
reverse transcriptase,
– then amplify the region of
interest using specific
primers.
– Use a fluorescent probe
to detect and quantitate
the specific product as it
is being made by the
PCR reaction.
– the two components of
the fluorescent tag
interact to quench each
other. When one part is
removed by the Taq
polymerase, the
quenching stops and
fluorescence can be
detected.
Expressed Sequence Tags
• ESTs are cDNA clones that have has a single round of sequencing
done from one end.
• First extract mRNA from a given tissue. Then convert it to cDNA and
clone.
• Sequence thousands of EST clones and save the results in a
database.
• A search can then show whether your sequence was expressed in
that tissue.
– quantitation issues: some mRNAs are present in much higher
concentration than others. Many EST libraries are “normalized” by
removing duplicate sequences.
• Also can get data on transcription start sites and exon/intron
boundaries by comparing to genomic DNA
– but sometimes need to obtain the clone and sequence the rest of it
yourself.
RNA Seq
•
•
This is a new method, published in 2008. It is
probably the method of choice today for
analyzing RNA content. Also called whole
transcriptome shotgun sequencing.
Very simple: isolate messenger RNA, break it into
200-300 base fragments, reverse transcribe, then
perform large scale sequencing using 454,
Illumina. Or other massively parallel sequencing
technology.
– RNA sequences then compared to genomic
sequences to find which gene is expressed and
also exon boundaries
– Exon boundaries are a problem with very short
reads: you might only have a few bases of overlap
to one of the exons.
•
•
As with all RNA methods, which RNAs are
present depends on the tissue analyzed and
external conditions like environmental stress or
disease state.
Get info on copy number over a much wider
range than microarrays. Also detects SNPs.
Etc.
• New techniques in DNA/RNA technology
are being developed constantly. The main
goal is to increase reliability and decrease
cost. Primarily the aim is to automate as
much as possible.
• Just a few techniques we are not going to
discuss: RACE, SAGE, differential
display, S1 nuclease protection
Protein Methods
•
•
It is important to be sure that the
protein product of a gene is made,
and to know where in the tissue or
cell it is made, and how much is
made.
Most protein detection is based on
either making antibodies to the
protein of interest, or by making a
fusion protein: your protein fused
to a fluorescent protein.
– GFP: green fluorescent protein.
Isolated from jellyfish. Several
variants give different colors. It
still works when it is fused to other
proteins.
•
Often done in conjunction with
confocal microscopy: examining
the same image with visible light
and fluorescence.
Two-Dimensional Gel Electrophoresis
•
2D gels are a way of separating proteins into
individual spots that can be individually
analyzed.
– Proteins are first separated by their isoelectric
point and then by their molecular weight.
– Individual protein spots can then be identified
using mass spectrometry
•
Isoelectric focusing separates proteins by
their isoelectric point, the pH at which the net
surface charge is zero.
– IEF uses a mixture of ampholytes, chemical
compounds that contain both acidic and basic
groups.
– When an electric field is applied, the
ampholytes move to a position in the gel
where their net charge is zero, and in the
process they set up a pH gradient.
– Proteins also move down the pH gradient until
they reach a pH where they have no net
charge, their isoelectric point.
– Isoelectric focusing is thus an equilibrium
process: proteins move to a stable position
and stay there. (But in practical terms, the
gradient breaks down over time).
SDS-PAGE
•
SDS-PAGE is a method for
separating proteins according to
their molecular weight.
– SDS = sodium dodecyl sulfate
(a.k.a. sodium lauryl sulfate), a
detergent that unfolds proteins
and coats them in charged
molecules so that their charge to
mass ratio is essentially identical.
• “Native” gel electrophoresis
uses undenatured proteins,
which vary greatly in charge
to mass ratio.
– SDS denaturation isn’t perfect:
some proteins behave
anomalously,
– PAGE = polyacrylamide gel
electrophoresis
2D gels
•
•
•
First, isoelectric focusing is performed on a
protein sample, running the proteins through a
narrow tube or strip of acrylamide.
Then the IEF gel is placed on top of an SDS gel,
allowing the proteins to be separated by their
molecular weight at right angles to the isoelectric
point separation.
Then the gel is stained with a general protein
stain such as Coomassie Blue or silver stain.
–
•
A Western blot involves transferring the separated
proteins onto a membrane, where specific proteins
can be identified by antibody binding.
A couple of issues:
–
–
–
–
While a cell might contain up to 100,000 proteins, at
best only 3000 spots can be resolved.
Proteins expressed at a low level (such as
regulatory proteins) don’t show up well: spot size is
proportional to the amount present in the cell
Special techniques are needed for membrane
proteins, which aren’t easily solubilized by the usual
techniques.
Comparing spots between 2D gels require image
analysis software (and well-standardized running
conditions).
Mass Spectrometry
•
The general principle of mass
spectrometry is that if you ionize a
group of atoms or molecules, you
can separate them on the basis of
charge to mass ratio, by
accelerating them in an electric field
in a vacuum.
– The original mass spectrometers
were used to separate isotopes,
based on slightly different masses.
•
•
During the ionization process,
proteins tend to break up in
characteristic ways, producing
“molecular ions” whose molecular
weights can be measured very
precisely.
Since you are generally working
with an already sequenced
genome, you can predict the size of
fragments that will be generated by
any gene. Thus you can identify the
gene product by matching the
actual fragments with list of
predicted fragments.
•
For most protein work, the proteins are first
digested into small fragments (say , 5-10 amino
acids), separated by HPLC (high performance
liquid chromatography), and then run individually
through the mass spec.
–
–
•
•
•
Protein sequencing and older protein identification
methods also start with proteolytic digestion
Endopeptidases that digest at known sites are used,
such as trypsin (cleaves after Lys or Arg) and
chymotrypsin (cleaves after Phe, Trp, or Tyr).
Ionizing the peptide needs to be done rather
gently. One common technique is MALDI (matrixassisted laser desorption/ionization). The proteins
are mixed with the matrix molecules, which
efficiently absorb the UV laser energy and
encourage ionization of the proteins. When
irradiated with the laser, they vaporize along with
the protein, but their small size makes them easy
to detect and ignore.
Time-of-flight mass spectrometry is generally used
(so the whole thing is MALDI-TOF). The
moelcular ions are accelerated in an electric field,
and the time it takes them to cross a chamber of
known length is proportional to their mass
9actaully, charge to mass ratio). This technique
works well for the wide range of sizes seen with
peptides.
Sample comparisons can be done by labeling one
sample with a heavy, stable isotope such as 13C or
15N. The samples are mixed before 2D
electrophoresis and they co-migrate on the gel.
However, mass spec can easily resolve them.
More Mass
Spectrometry
Antibodies
•
•
•
•
•
If you inject rabbits (usually) with your protein,
the rabbit will develop an immune response
against it. This means that a set of
immunoglobulin (Ig) molecules that specifically
bind to your protein will be produced. Mostly
IgG, the main immunoglobulin that circulates in
the blood.
Your protein is acting as an antigen. Each Ig
molecule binds to a specific epitope on the
protein.
The blood serum from the rabbit contains
polyclonal antibodies: several different Ig
proteins that bind to different epitopes on your
protein.
Monoclonal antibodies can be made in mice.
Each monoclonal antibody is a single IgG
molecule, and so it will bind to a single epitope
on your protein.
Monoclonal antibodies are made by fusing
individual Ig-producing spleen cells to myeloma
cells, which creates an immortal Ig-producing
cell line, called a hybridoma.
-------------------------------------------------------------------------------
Western Blots
•
E. M. Southern, inventor of the Southern blot, is a real person.
However, his cool name has been hijacked to name the
Northern blot (i.e. running RNA instead of DNA on the gel), and
the Western blot (running protein on the gel).
– Also called a “protein immunoblot”.
•
Proteins are separated by gel electrophoresis.
–
•
•
Can be done with denatured proteins (SDS gel), which
separates proteins by molecular weight, or under nondenaturing conditions (no SDS), which separates proteins
by their surface charge and size.
The proteins on the gel are then blotted onto a
nitrocellulose membrane.
The specific protein of interest is detected using
antibodies.
–
–
–
First antibody is made in a rabbit (usually) by injecting the
specific protein of interest. It binds to the protein on the
blot.
The presence of the first antibody is detected by using a
labeled second antibody that has been conjugated
(covalently bound to) a fluorescent tag or to an easily
detected enzyme.
The second antibody is made by injecting rabbit
immunoglobulins into a sheep.