Download class02Sequencing-03.. - Department of Computer Science • NJIT

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DNA profiling wikipedia , lookup

Replisome wikipedia , lookup

DNA nanotechnology wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

DNA sequencing wikipedia , lookup

Exome sequencing wikipedia , lookup

Microsatellite wikipedia , lookup

Human Genome Project wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Sequencing a genome
Approximate Molecular Dynamics:
New Algorithms with Applications in Protein Folding
Author: Qun (Marc) Ma
Predicting the 3D native structures of proteins from the
known amino acid sequence, i.e., protein folding, has
become pressing in structural genomics and computational
biology. Though it is plausible to use molecular dynamics
(MD) simulations to study the folding of proteins, the
currently available methodologies are incapable of
addressing the timescale problems.
In this talk, I will describe the recent advances in the
development of two new multiscale integrators that allow
very large time steps (and thus ``approximate'' molecular
dynamics)
Definition
• Determining the identity and order of
nucleotides in the genetic material – usually
DNA, sometimes RNA, of an organism
Basic problem
• Genomes are large (typically millions or
billions of base pairs)
• Current technology can only reliably ‘read’
a short stretch – typically hundreds of base
pairs
Elements of a solution
• Automation – over the past decade, the
amount of hand-labor in the ‘reads’ has
been steadily and dramatically reduced
• Assembly of the reads into sequences is an
algorithmic and computational problem
A human drama
• There are competing methods of assembly
• The competing – public and private –
sequencing teams used competing assembly
methods
Assembly:
• Putting sequenced fragments of DNA
into their correct chromosomal
positions
BAC
• Bacterial artificial chromosome:
bacterial DNA spliced with a mediumsized fragment of a genome (100 to
300 kb) to be amplified in bacteria and
sequenced.
Contig
• Contiguous sequence of DNA created
by assembling overlapping sequenced
fragments of a chromosome (whether
natural or artificial, as in BACs)
Cosmid
• DNA from a bacterial virus spliced
with a small fragment of a genome (45
kb or less) to be amplified and
sequenced
Directed sequencing
• Successively sequencing DNA from
adjacent stretches of chromosome
Draft sequence
• Sequence with lower accuracy than a
finished sequence; some segments are
missing or in the wrong order or
orientation
EST
• Expressed sequence tag: a unique
stretch of DNA within a coding region
of a gene; useful for identifying fulllength genes and as a landmark for
mapping
Exon
• Region of a gene’s DNA that encodes a
portion of its protein; exons are
interspersed with noncoding introns
Genome
• The entire chromosomal genetic
material of an organism
Intron
• Region of a gene’s DNA that is not
translated into a protein
Kilobase (kb)
• Unit of DNA equal to 1000 bases
Locus
• Chromosomal location of a gene or
other piece of DNA
Megabase (mb)
• Unit of DNA equal to 1 million bases
PCR
• Polymerase chain reaction: a technique
for amplifying a piece of DNA quickly
and cheaply
Physical map
• A map of the locations of identifiable
markers spaced along the
chromosomes; a physical map may
also be a set of overlapping clones
Plasmid
• Loop of bacterial DNA that replicates
independently of the chromosomes;
artificial plasmids can be inserted into
bacteria to amplify DNA for
sequencing
Regulatory region
• A segment of DNA that controls
whether a gene will be expressed and
to what degree
Repetitive DNA
• Sequences of varying lenths that occur
in multiple copies in the genome; it
represents much of the genome
Restriction enzyme
• An enzyme that cuts DNA at specific
sequences of base pairs
RFLP
• Restriction fragment length
polymorphism: genetic variation in the
length of DNA fragments produced by
restriction enzymes; useful as markers
on maps
Scaffold
• A series of contigs that are in the right
order but are not necessarily connected
in one continuous stretch of sequence
Shotgun sequencing
• Breaking DNA into many small pieces,
sequencing the pieces, and assembling
the fragments
STS
• Sequence tagged site: a unique stretch
of DNA whose location is known;
serves as a landmark for mapping and
assembly
YAC
• Yeast artificial chromosome: yeast
DNA spliced with a large fragment of a
genome (up to 1 mb) to be amplified in
yeast cells and sequenced
Readings
• Myers, “Whole Genome DNA Sequencing,”
www.cs.arizona.edu/people/gene/PAPERS/whole.IEEE.pdf
• Venter, et al, “The Sequence of the Human Genome,”
Science, 16 Feb 2001, Vol. 291 No 5507, 1304 (parts 1 & 2)
• Waterston, Lander, Sulston, “On the sequencing of the
human genome,” PNAS, March 19, 2002, Vol 99, no 6,
3712-3716
• Myers, et.al., “On the sequencing and assembly of the
human genome,”
www.pnas.org/cgi/doi/10.1073/pnas.092136699
Hierarchical sequencing
• Create a high-level physical map, using
ESTs and STSs
• Shred genome into overlapping clones
• Multiply clones in BACs
• ‘shotgun’ each clone
• Read each ‘shotgunned’ fragment
• Assemble the fragments
Physical map
Whole genome sequencing (WGS)
• Make multiple copies of the target
• Randomly ‘shotgun’ each target, discarding
very big and very small pieces
• Read each fragment
• Reassemble the ‘reads’
Hierarchical v. whole-genome
The fragment assembly problem
• Aim: infer the target from the reads
• Difficulties –
– Incomplete coverage. Leaves contigs separated
by gaps of unknown size.
– Sequencing errors. Rate increases with length
of read. Less than some e.
– Unknown orientation. Don’t know whether to
use read or its Watson-Crick complement.
Scaling and computational
complexity
• Increasing size of target G.
– 1990 – 40kb (one cosmid)
– 1995 – 1.8 mb (H. Influenza)
– 2001 – 3,200 mb (H. sapiens)
The repeat problem
• Repeats
– Bigger G means more repeats
– Complex organisms have more repetitive
elements
– Small repeats may appear multiple times in a
read
– Long repeats may be bigger than reads (no
unique region)
Gaps
• Read length LR hasn’t changed much
• w = LR /G gets steadily smaller
• Gaps ~ Re- wR (Waterman & Lander)
How deep must coverage be?
Double-barreled shotgun
sequencing
•
•
•
•
•
Choose longer fragments (say, 2 x LR)
Read both ends
Such fragments probably span gaps
This gives an approximate size of the gap
This links contigs into scaffolds
Genomic results
HGSC v Celera results
To do or not to do?
• “The idea is gathering momentum. I shiver
at the thought.” – David Baltimore, 1986
• “If there is anything worth doing twice, it’s
the human genome.” – David Haussler,
2000
Public or private?
• “This information is so important that it
cannot be proprietary.” – C Thomas Caskey,
1987
• “If a company behaves in what scientists
believe is a socially responsible manner,
they can’t make a profit.” – Robert CookDeegan, 1987
HW for Feb 19
• Comment on these assertions 500-1000
words:
– WLS – “Our analysis indicates that the Celera
paper provides neither a meaningful test of the
WGS approach nor an independent sequence of
the human genome.”
– Venter – “This conclusion is based on incorrect
assumptions and flawed reasoning.”
• Lesk, Exercise 2.15, problem 2.3