Download Sequencing the Human Genome

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DNA profiling wikipedia , lookup

DNA barcoding wikipedia , lookup

RNA-Seq wikipedia , lookup

Transposable element wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Nucleosome wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Designer baby wikipedia , lookup

DNA damage theory of aging wikipedia , lookup

DNA polymerase wikipedia , lookup

Pathogenomics wikipedia , lookup

Minimal genome wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

SNP genotyping wikipedia , lookup

Microevolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

NUMT wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Primary transcript wikipedia , lookup

DNA vaccination wikipedia , lookup

Gene wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Molecular cloning wikipedia , lookup

DNA sequencing wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Replisome wikipedia , lookup

Point mutation wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

DNA supercoil wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

Epigenomics wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Microsatellite wikipedia , lookup

Genome evolution wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Metagenomics wikipedia , lookup

Human genome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Non-coding DNA wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Human Genome Project wikipedia , lookup

Genome editing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genomic library wikipedia , lookup

Genomics wikipedia , lookup

Transcript
The Structure of
Proteins and DNA
Pauling 1951
Crick&Watson 1953
The History of Genome Mapping
1955: Fred Sanger produces first amino-acid sequencing
of a protein (insulin)
1956: Tjio, Levan determine the number of human chromosomes.
1961: Brenner, Jacobs, Meselson discover role of mRNA
in making proteins.
1963: Wu and Kaiser map first DNA sequence (12 basepairs).
1966: Nirenberg, Khorana, Ochoa map codon-amino acid
connections.
1968: Meselson, Smith, Wilcox, Kelley discover the use
of restrictions enzymes.
1975: Gilbert, Maxam develop the “clone-contig” method
of producing DNA sequences.
1977: Sanger improves on process by the use of dideoxynucleotides and DNA polymerase.
1978: Sanger maps entire genome of ΦX174 virus (5386
base-pairs).
1980: Sanger maps first human gene (16,569 base-pairs).
How Do We Sequence a Protein?
M (methionine)
(cysteine)
C
C (cysteine)
(glycine) G
A (alanine)
(proline) P
G (glycine)
(proline) P
T (threonine)
Try Breaking It Up
1 A
M (methionine)
(cysteine)
2 C ’s
2 G ’s
C
1 M
C (cycsteine)
2 P ’s
(glycine) G
1 T
A (alanine)
(proline) P
G (glycine)
(proline) P
T (threonine)
The Protein
G
A
P
T
G
P
T
C
A
P
P
M
M
G
C
C
C
G
Protein Broken Apart
Another Breakdown
Sanger’s Technique
for Protein Mapping
1. Attach a molecule to the end of the protein
to make it glow yellow.
2. Use a mild detergent to break the protein
apart into random pieces.
3. Put the mixture into a separator material
(gel) and let the pieces sink.
4. Smaller molecules sink faster, so the molecules
separate by length.
5. Find the glowing molecules at each level,
and analyze their amino acid content.
6. Put the protein back together amino acid
by amino acid. Problem: There was a limit
to the number of amino acids in a row that
could be separated.
7. Solution: Use digestive enzymes to break
the protein chain at specific places, then
analyze each piece as above.
The Sanger Method
M
M (methionine)
(cysteine)
C
C
C
C (cycsteine)
G
(glycine) G
A (alanine)
A
P
(proline) P
G
G (glycine)
P
(proline) P
T
T (threonine)
The Protein
With Molecule Attached
G
A
P
T
G
P
T
C
A
P
P
M
M
G
C
C
C
G
Protein Broken Apart
Another Breakdown
Mapping DNA
Some key players in DNA mapping:
restriction enzymes: Cut DNA at specific points,
depending upon the sequence at that point.
DNA polymerase: Replicates complementary
DNA (cDNA) strands from a single strand
of DNA.
primer: Short sequence of single-strand DNA
that can start the DNA polymerase off at
some point in the main DNA strand
dideoxynucleotides: Artificial A,C,G,T molecules
that serve two functions — first, to tag
the growing DNA with a colored dye (a
different one for each letter), and second, to cause the DNA polymerase to
stop building the DNA at that point.
The Basic Method of Sequencing a Small
(700-900 bp) Strand of DNA
The Sangar Sequencing Method :
1. Put many copies of the DNA strand, together with lots of primers, DNA polymerase, regular nucleotides and a smaller
number of special dideoxynucleotides, into
a warm broth. Shake well.
2. The DNA polymerase starts a copying reaction on the strands of DNA, constructing a strand of cDNA by grabbing either a nucleotide or a dideoxynucleotide
from the broth. The sequencing process
starts.
3. As long as normal nucleotides are attached
the cDNA continues to grow. When one
of the dideoxynucleotides are attached,
however, the process stops, leaving a strand
of cDNA beginning with a specific starting sequence and ending with a single
dideoxynucleotide.
The Sequence Reading Process:
4. After letting this process go a specific
amount of time, place the mixture into
a gel.
5. Draw the mixture through the gel, letting
each piece sink to the right level based on
its size.
6. Read the dye colors at the various levels,
to get the base pair at the end of a section of DNA of each length.
7. Put the whole thing together to get your
genome sequence.
Getting Manageable Pieces of DNA to
Sequence
1. Break each chromosome apart at known sequence locations, called bacterial artificial
chromosomes (BACs) of about 150,000
bps each.
2. “Shock” these into the DNA of e-coli bacteria, and let them replicate the BACs to
any degree.
3. Take each BAC and cut it into manageable
pieces, using restriction enzymes.
4. Clone (artificially replicate) these pieces, so
as to have enough to work with. This is
known as PCR, or polymerase chain reaction.
5. Put the pieces into a bath that unwinds and
separates them into single strand.
6. Perform the Sanger sequencing process to
obtain the sequences of each piece of DNA
7. “Put these pieces back together” to form
the entire DNA sequence.
Shotgun Sequencing
A major computational tool in all large genome
sequencing projects is the shotgun technique
of sequencing. Instead of always sequencing a
genome from known locations (a difficult and
time-consuming job), you sequence from many
different locations, and try to put the sequence
back together.
Steps of the Shotgun Technique
1. Break the DNA you are trying to sequence
into arbitrary smaller fragments of 700–
900 base pairs, so that there is a large
amount of overlapping among the fragments.
2. Sequence each fragment by using the consecutive tagging technique given above.
3. Take pairs of fragments, and match up the
overlapping right- and left-hand ends letter
by letter to grow longer and longer multifragment subsequences that are consistent
with all of the contained fragments.
4. If the overlapping of the fragments is sufficiently large, then there will be a unique sequence of the correct size that is “strongly”
consistent with the set of smaller fragments.
Coverage
The key to obtaining a unique DNA sequence
from a set of DNA fragments is to insure a
sufficent amount of coverage of the fragments
to the DNA you are trying to sequence.
k-fold coverage: Insures that at least k of your
fragments cover each base pair of the DNA
sequence.
Mapping the human genome requires a coverage of between 5- and 10-fold to insure reasonable accuracy.
Example
Suppose you had the following set of 8 fragments:
ATCG CCA CCAT CCCC
CGC CGCC GCC TCG
And you wish to find a sequence with 2-fold coverage.
The unique (10-base) sequence that has 2-fold coverage
is
CGCCCCATCG
---------CCCC TCG
CCA
CGCC ATCG
GCC
CGC CCAT
2343333322
Note that with two additional GC fragments, we could
obtain 3-fold coverage of the same sequence:
CGCCCCATCG
---------CCCC TCG
CCA
CGCC ATCG
GCC
CGC CCAT
CG
CG
3443333333
On the other hand, if the original set of fragments consisted of two CC fragments instead of one CCCC fragment,
we could also obtain a sequence having 3-fold coverage.
How?
The History of the Human
Genome Project
1984: Department of Energy needs information on genetic defects of chemical agents.
The International Commission for Protection Against Environmental Mutagens and
Carcinogens suggests that a map of the
human genome would be important in this
endeavor.
1986: Renato Dulbecco, in an editorial in Science, suggests a national effort at reconstructing the human genome. DOE sets
up Santa Fe Workshop to pursue the issue. National Academy of Scences sets up
a blue-ribbon panel to discuss the project.
National Institutes of Health belatedly starts
discussions.
1988: NAS report appears, stressing multidisciplinary participation of labs across the
country. The House Energy and Commerce Committee decides that the government should fund such an effort.
1990: Joint public effort launched, at an estimated cost of $3 billion, by the International Human Genome Mapping Consortium, jointly administered by NIH and
DOE and involving 20 labs and hundreds
of scientists.
1998: Celera Genomics, under the direction
of Craig Venter, becomes the first private
company to enter the race. It worked almost independently of the HGP.
February, 2001: IHGMC and Celera announce
jointly in Nature and Science, respectively, the draft map of the human genome.
This consisted of 94% of the genome,
26,000 reported genes with 30,000-40,000
total genes suspected.
A Comparison of Techniques
Organization
HGP: Public, 20 laboratories and many
hundreds of people.
Celera: Private, 1 laboratory and about
65 people (and 40 high-speed computers).
Technique:
HPG: Clone contig — Separate genome
into clone libraries with known locations,
and shotgun sequence each library element. Better control of gene locations,
but significant startup time to obtain the
associated chromosomal maps.
Celera: Whole-genome shotgun — Sequence entire chromosomes by shotgun
method. More computer intensive, but
also needs more coverage.
Source of the genome
HPG: 5 donors chosen from hundreds of
candidates.
Celera: 21 donors.
Both groups were anonymous and chosen
from varied ethnic groups.
Time frame:
HPG: 1990–2000, but actual mapping done
between 1999 and 2000.
Celera: 1998–2000.
Publication:
HPG: Nature. In addition, newly sequenced
sections were made public on the web within
24 hours of sequencing.
Celera: Science. Celera’s intention is to
sell or patent further information about
the human genome.
Computer time: Celera reported 30,000CPU
hours for assembly of fragments into a
single genome.
Number of genes: 25,000–35,000 for both studies, accounting for only about 3% of the
entire genome sequence.
Coverage: 90-94% of the genes mapped in
both studies (and 25% of the entire genome).
Comparison of results: Hard to judge, since
presentation of the two studies is different. Preliminary studies indicate at least
a 99% match between the two sequences.
Current Accomplishments
“Complete” sequencing of the HG: 99% of
the euchromatic (gene-containing) portion of the HG has been sequenced with
99.99% accuracy, and with no gaps in this
region greater than 150,000 bps. Current
estimate of number of genes: 20,00025,000.
All chromosomes have been completely sequenced: The last chromosome (#1) was
sequenced in May 2006.
Other genomes sequenced: 180 different species
have been sequenced, including lots of
bacteria, E.coli, brewer’s yeast, roundworm,
fruit fly, mosquito, mouse, rat, dog (at
NC State), chimpanzee, orangutan, elephant, cat, chicken, and many others.
Cost (human genome):
2003: $3 billion
2012: $1700