Download Sequencing a genome - Information Services and Technology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DNA barcoding wikipedia , lookup

Replisome wikipedia , lookup

Gene wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Restriction enzyme wikipedia , lookup

DNA vaccination wikipedia , lookup

Genetic engineering wikipedia , lookup

Bioinformatics wikipedia , lookup

Craig Venter wikipedia , lookup

Designer baby wikipedia , lookup

Molecular cloning wikipedia , lookup

DNA supercoil wikipedia , lookup

Gene prediction wikipedia , lookup

DNA sequencing wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Exome sequencing wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Mycoplasma laboratorium wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Community fingerprinting wikipedia , lookup

Non-coding DNA wikipedia , lookup

Metagenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Human Genome Project wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Sequencing a genome
Definition
• Determining the identity and order of
nucleotides in the genetic material – usually
DNA, sometimes RNA, of an organism
Basic problem
• Genomes are large (typically millions or
billions of base pairs)
• Current technology can only reliably ‘read’
a short stretch – typically hundreds of base
pairs
Elements of a solution
• Automation – over the past decade, the
amount of hand-labor in the ‘reads’ has been
steadily and dramatically reduced
• Assembly of the reads into sequences is an
algorithmic and computational problem
A human drama
• There are competing methods of assembly
• The competing – public and private –
sequencing teams used competing assembly
methods
Assembly:
• Putting sequenced fragments of DNA
into their correct chromosomal
positions
BAC
• Bacterial artificial chromosome:
bacterial DNA spliced with a mediumsized fragment of a genome (100 to
300 kb) to be amplified in bacteria and
sequenced.
Contig
• Contiguous sequence of DNA created
by assembling overlapping sequenced
fragments of a chromosome (whether
natural or artificial, as in BACs)
Cosmid
• DNA from a bacterial virus spliced
with a small fragment of a genome (45
kb or less) to be amplified and
sequenced
Directed sequencing
• Successively sequencing DNA from
adjacent stretches of chromosome
Draft sequence
• Sequence with lower accuracy than a
finished sequence; some segments are
missing or in the wrong order or
orientation
EST
• Expressed sequence tag: a unique
stretch of DNA within a coding region
of a gene; useful for identifying fulllength genes and as a landmark for
mapping
Exon
• Region of a gene’s DNA that encodes a
portion of its protein; exons are
interspersed with noncoding introns
Genome
• The entire chromosomal genetic
material of an organism
Intron
• Region of a gene’s DNA that is not
translated into a protein
Kilobase (kb)
• Unit of DNA equal to 1000 bases
Locus
• Chromosomal location of a gene or
other piece of DNA
Megabase (mb)
• Unit of DNA equal to 1 million bases
PCR
• Polymerase chain reaction: a technique
for amplifying a piece of DNA quickly
and cheaply
Physical map
• A map of the locations of identifiable
markers spaced along the
chromosomes; a physical map may
also be a set of overlapping clones
Plasmid
• Loop of bacterial DNA that replicates
independently of the chromosomes;
artificial plasmids can be inserted into
bacteria to amplify DNA for
sequencing
Regulatory region
• A segment of DNA that controls
whether a gene will be expressed and
to what degree
Repetitive DNA
• Sequences of varying lenths that occur
in multiple copies in the genome; it
represents much of the genome
Restriction enzyme
• An enzyme that cuts DNA at specific
sequences of base pairs
RFLP
• Restriction fragment length
polymorphism: genetic variation in the
length of DNA fragments produced by
restriction enzymes; useful as markers
on maps
Scaffold
• A series of contigs that are in the right
order but are not necessarily connected
in one continuous stretch of sequence
Shotgun sequencing
• Breaking DNA into many small pieces,
sequencing the pieces, and assembling
the fragments
STS
• Sequence tagged site: a unique stretch
of DNA whose location is known;
serves as a landmark for mapping and
assembly
YAC
• Yeast artificial chromosome: yeast
DNA spliced with a large fragment of a
genome (up to 1 mb) to be amplified in
yeast cells and sequenced
Readings
• Myers, “Whole Genome DNA Sequencing,”
http://www.cs.arizona.edu/people/gene/PAPERS/whole.IEEE
.pdf
• Venter, et al, “The Sequence of the Human Genome,”
Science, 16 Feb 2001, Vol. 291 No 5507, 1304 (parts 1 & 2)
• Waterston, Lander, Sulston, “On the sequencing of the
human genome,” PNAS, March 19, 2002, Vol 99, no 6,
3712-3716
• Myers, et.al., “On the sequencing and assembly of the
human genome,”
www.pnas.org/cgi/doi/10.1073/pnas.092136699
Hierarchical sequencing
• Create a high-level physical map, using
ESTs and STSs
• Shred genome into overlapping clones
• Multiply clones in BACs
• ‘shotgun’ each clone
• Read each ‘shotgunned’ fragment
• Assemble the fragments
Physical map
Whole genome sequencing (WGS)
• Make multiple copies of the target
• Randomly ‘shotgun’ each target, discarding
very big and very small pieces
• Read each fragment
• Reassemble the ‘reads’
Hierarchical v. whole-genome
The fragment assembly problem
• Aim: infer the target from the reads
• Difficulties –
– Incomplete coverage. Leaves contigs separated
by gaps of unknown size.
– Sequencing errors. Rate increases with length
of read. Less than some .
– Unknown orientation. Don’t know whether to
use read or its Watson-Crick complement.
Scaling and computational
complexity
• Increasing size of target G.
– 1990 – 40kb (one cosmid)
– 1995 – 1.8 mb (H. Influenza)
– 2001 – 3,200 mb (H. sapiens)
The repeat problem
• Repeats
– Bigger G means more repeats
– Complex organisms have more repetitive
elements
– Small repeats may appear multiple times in a
read
– Long repeats may be bigger than reads (no
unique region)
Gaps
• Read length LR hasn’t changed much
•  = LR /G gets steadily smaller
• Gaps ~ Re- R (Waterman & Lander)
How deep must coverage be?
Double-barreled shotgun
sequencing
•
•
•
•
•
Choose longer fragments (say, 2 x LR)
Read both ends
Such fragments probably span gaps
This gives an approximate size of the gap
This links contigs into scaffolds
Genomic results
HGSC v Celera results
To do or not to do?
• “The idea is gathering momentum. I shiver
at the thought.” – David Baltimore, 1986
• “If there is anything worth doing twice, it’s
the human genome.” – David Haussler,
2000
Public or private?
• “This information is so important that it
cannot be proprietary.” – C Thomas Caskey,
1987
• “If a company behaves in what scientists
believe is a socially responsible manner,
they can’t make a profit.” – Robert CookDeegan, 1987
HW for Feb 17
• Comment on these assertions (500-1000
words):
– WLS – “Our analysis indicates that the Celera
paper provides neither a meaningful test of the
WGS approach nor an independent sequence of
the human genome.”
– Venter – “This conclusion is based on incorrect
assumptions and flawed reasoning.”