Download Slides - Brown CS

Document related concepts

DNA sequencing wikipedia , lookup

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Exome sequencing wikipedia , lookup

Transcript
CSCI2950-C
Lecture 3
DNA Sequencing and
Fragment Assembly
http://cs.brown.edu/courses/csci2950-c/
Outline
• EULER fragment assembly
• Mate-pairs, scaffolding and copy
number
• Next-generation DNA Sequencing
• Cancer Genome Sequencing
Whole Genome Shotgun
Sequencing
genome
cut many times at
random
plasmids (2 – 10 Kbp)
forward-reverse paired
reads
known dist
cosmids (40 Kbp)
~500 bp
(mate pair)
~500 bp
Overlap-Layout-Consensus
Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA
Overlap: find potentially overlapping reads
Layout: merge reads into contigs and
contigs into supercontigs
Consensus: derive the DNA
sequence and correct read errors
..ACGATTACAATAGGTT..
Approaches to Fragment Assembly
Find a path visiting every VERTEX exactly
once in the OVERLAP graph:
Hamiltonian path problem
NP-complete: algorithms unknown
Approaches to Fragment
Assembly (cont’d)
Find a path visiting every EDGE exactly once
in the REPEAT graph:
Eulerian path problem
Linear time algorithms are known
EULER - A New Approach to
Fragment Assembly
• Traditional “overlap-layout-consensus”
technique has a high rate of mis-assembly
• EULER uses the Eulerian Path approach
borrowed from “sequencing by hybridization”
(SBH)
• Fragment assembly without repeat masking
can be done in linear time with greater
accuracy
Sequencing by Hybridization
(SBH)
• Build a microarray with all 4l DNA
sequences of length l (l ~ 20)
• For DNA sequence s, measure l-mer
composition
l-mer composition
Def: Given string s, the Spectrum ( s, l ) is
unordered multiset of all possible (n – l + 1)
l-mers in a string s of length n
• The order of individual elements in
Spectrum ( s, l ) does not matter
• For s = TATGGTGC all of the following are
equivalent representations of
Spectrum ( s, 3 ):
{TAT, ATG, TGG, GGT, GTG, TGC}
{ATG, GGT, GTG, TAT, TGC, TGG}
{TGG, TGC, TAT, GTG, GGT, ATG}
The SBH Problem
• Goal: Reconstruct a string from its l-mer
composition
• Input: A multiset S, representing all l-mers
from an (unknown) string s
• Output: String s such that Spectrum ( s,l )
=S
SBH: Eulerian Path Approach
S = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }
Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT,
CA, CG }
Edges correspond to l – mers from S
de Bruijn graph of S
GT
AT
TG
CG
GC
GG
CA
Path visited every EDGE once
SBH: Eulerian Path Approach
S = {ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }
Two different paths give different sequence reconstructions:
GT
AT
TG
CG
GC
GG
ATGGCGTGCA
GT
CA
AT
TG
CG
GC
GG
ATGCGTGGCA
CA
Euler Theorem
• A graph is balanced if for every vertex
the number of incoming edges equals to
the number of outgoing edges:
in(v)=out(v)
• Theorem: A connected graph is
Eulerian if and only if each of its vertices
is balanced.
Euler Theorem: Proof
• Eulerian → balanced
for every edge entering v (incoming
edge) there exists an edge leaving v
(outgoing edge). Therefore
in(v)=out(v)
• Balanced → Eulerian
???
Algorithm for Constructing an Eulerian
Cycle
a. Start with an arbitrary vertex v
and form an arbitrary cycle
with unused edges until a
dead end is reached. Since
the graph is balanced this
dead end is necessarily the
starting point, i.e., vertex v.
Algorithm for Constructing an Eulerian Cycle
(cont’d)
b. If cycle from (a) above is
not an Eulerian cycle, it
must contain a vertex w,
which has untraversed
edges. Perform step (a)
again, using vertex w as
the starting point. Once
again, we will end up in
the starting vertex w.
Algorithm for Constructing an Eulerian Cycle
(cont’d)
c. Combine the
cycles from (a)
and (b) into a
single cycle and
iterate step (b).
Overlap Graph: Hamiltonian
Approach
Each vertex represents a read from the original sequence.
Vertices from repeats are connected to many others.
Repeat
Repeat
Repeat
Find a path visiting every VERTEX exactly once: Hamiltonian path problem
Overlap Graph: Eulerian Approach
Repeat
Repeat
Repeat
Placing each repeat edge
together gives a clear
progression of the path
through the entire sequence.
Find a path visiting every EDGE
exactly once:
Eulerian path problem
Multiple Repeats
Repeat1
Repeat2
Repeat1
Repeat2
Can be easily
constructed with any
number of repeats
Repeat Graph
(a) DNA sequence with a triple
repeat R;
(b) the layout graph;
(c) construction of the de
Bruijn graph by gluing
repeats;
(d) de Bruijn graph.
Pevzner P. A. et.al. PNAS 2001;98:9748-9753
Building Repeat Graph
• Problem: Construct the repeat graph from
a collection of reads.
?
• Solution: Break the reads into smaller
pieces.
Building Repeat Graph
• Reads are constructed from an original
sequence in lengths that allow biologists
a high level of certainty.
• They are then broken again into k-mers
EULER Fragment Assembly
Approach
• Input: Reads s1, …, sN
• Further subdivide reads into k-mers (k =
20)
• Build repeat graph on resulting k-mers
• Each read is path in resulting graph.
• Solve Eulerian Superpath Problem.
Given an Eulerian graph and a collection of paths in
this graph, find an Eulerian path in this graph that
contains all these paths as subpaths.
Repeat Graph
Vertices correspond to ( k – 1 ) – mers in each read
Edges correspond to k – mers in each read
Example: S = ATGGCGTGCA
Reads = {ATGGC, GGCGTG, GTGCA}
3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }
GT
AT
TG
CG
GC
GG
Two Eulerian paths:
(visit every EDGE once)
CA
ATGCGTGGCA
ATGGCGTGCA
Reads in Repeat Graph
Example: S = ATGGCGTGCA
Reads = {ATGGC, GGCGTG, GTGCA}
3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT }
Eulerian superpath: an Eulerian path that contains set
of paths (reads) as subpaths.
GT
AT
TG
CG
GC
GG
CA
ATGCGTGGCA
ATGGCGTGCA
Additional challenges in EULER
Approach
1. Errors in reads
2. Reverse-complement of DNA string
3. Using mate-pair information to simplify the
repeat graph.
4. Multiplicities of edges generally unknown
(Copy number problem).
Sequencing Errors
• If an error exists in one of the 20-mer
reads, the error will be perpetuated
among all of the smaller pieces broken
from that read.
Sequencing Errors
• However, an error will not be present in the
other instances of the 20-mer read.
“Consensus first” approach
• Let T = {all l-tuples appearing in > M reads}
• A string s is called a T-string if all its l-tuples
belong to T.
• Spectral Alignment Problem. Given a string
s and a spectrum T, find the minimum
number of mutations in s that transform s into
a T-string.
Sequencing Errors
• Solving Spectral Alignment Problem
attempts to eliminate most point
mutation errors before reconstructing
the original sequence.
• Not perfect!
Forward and Reverse
Complements
5’
3’
3’
5’
We obtain reads from both strands of DNA. Do not
know strand of origin.
s = CAGT s’ = ACTG (reverse complement)
Forward and Reverse
Complements
5’
3’
In Euler assembler, include reverse
complement of each read.
3’
5’
“assume that S contains a complement
of every read and that the de Bruijn
graph can be partitioned into two
subgraphs (the “canonical” one and its
reverse complement)”
Alternative approaches using bidirected graphs.
Using Mate-Pair Information
Repeats and other ambiguities lead to
tangles in repeat graph
1
3
2
4
1  3 and 2 4
OR
1 4 and 2 3 ?
A repeat v1 … vn and a
system of paths
overlapping with this
repeat
Using Mate-Pair Information
Mate-pair (r1, r2) gives
pair of positions in G.
l(r1, r2)
r1
r1
1
Find path P in G from r1
to r2.
r2
d(r1, r2)
2
If unique path P with d(r1, r2) ≈ l(r1, r2) length of mate
pair, then use P as “long read” in superpath algorithm
r2
3
4
Using Mate-Pair Information
Scaffolding
Using Mate-Pair Information
Copy number problem
Let d(v) = in degree – outdegree
Balanced graph: d(v) = 0 for all v.
Goal: Introduce multiplicities on edges so that graph is
balanced.
Copy number problem
Goal: Introduce multiplicities on edges so that graph is
balanced.
Use as few extra edges as possible.
Balance each vertex by adding edge multiplicities
Assign flow f(e) to each edge such that d(v) = 0 for all
vertices.
Copy number problem
Let d(v) = indegree – outdegree
Balanced graph: d(v) = 0 for all v.
Graph G = (V, e, w). Weights w(e) = 1 for all e.
Copy Number Problem (Pevzner & Tang 2001): For an
edge e in G, find a flow minimizing the multiplicity f(e) of e.
Copy number problem
Copy Number Problem (Pevzner & Tang 2001): For an
edge e in G, find a flow minimizing the multiplicity f(e) of e.
Min-flow Max-cut Theorem: For a directed acyclic graph
G = (V, e, w) with lower capacity bounds:
min flow from v to w =
capacity of the maximum cut separating v from w
Copy number problem
Copy Number Problem (Pevzner & Tang 2001): For an
edge e in G, find a flow minimizing the multiplicity f(e) of e.
Min-cost circulation (See Myers 2005):
Assign cost c(e) = 1 to each edge.
min Σc(e) f(e) such that
f(e) ≥ w(e) for all e.
d(v) = 0 for all vertices.
Next-generation sequence
platforms
• 454
– http://www.454.com/enabling-technology/index.asp
• Illumina
– http://www.illumina.com/pages.ilmn?ID=203
• ABI Solid
– solid.appliedbiosystems.com
Polony Sequencing
Polony sequencing—Assembly
?
•
Resulting reads are likely to look different than Sanger reads:
– Short (currently 100 to 200 bp)
– Low error rates, except in homopolymeric runs (AAA…, CCC…, etc)
– Currently, not known how to do paired reads on a chip. Maybe very soon!
454 Sequencing
Illumina Sequencing
Nanopore Sequencing
http://www.mcb.harvard.edu/branton/index.htm
Nanopore Sequencing—Assembly
•
Resulting reads are likely to look different than Sanger reads:
– Long (perhaps 10,000bp-1,000,000bp)
– High error rate (perhaps 10% – 30%)
– Two colors?
• A/ CTG
• AT/ CG
• AG/ CT
•
How can we assemble under such conditions?
Some future directions for
sequencing
1.
Personalized genome sequencing
•
•
Find your ~1,000,000 single nucleotide polymorphisms (SNPs)
Find your rearrangements
•
Goals:
•
•
•
•
Link genome with phenotype
Provide personalized diet and medicine
(???) designer babies, big-brother insurance companies
Timeline:
•
•
•
Inexpensive sequencing:
Genotype–phenotype association:
Personalized drugs:
2010-2015
2010-???
2015-???
Some future directions for
sequencing
2.
Environmental sequencing
•
Find your flora:
•
•
•
•
•
External organs: skin, mucous membranes
Gut, mouth, etc.
Normal flora: >200 species, >trillions of individuals
Flora–disease, flora–non-optimal health associations
Timeline:
•
•
•
•
organisms living in your body
Inexpensive research sequencing:
Research & associations
Personalized sequencing
today
within next 10 years
2015+
Find diversity of organisms living in different environments
•
•
Hard to isolate
Assembly of all organisms at once
Some future directions for
sequencing
3.
Organism sequencing
•
•
Sequence a large fraction of all organisms
Deduce ancestors
•
•
•
•
Reconstruct ancestral genomes
Synthesize ancestral genomes
Clone—Jurassic park!
Study evolution of function
•
•
•
Find functional elements within a genome
How those evolved in different organisms
Find how modules/machines composed of many genes evolved
DNA Sequencing – Recap
1975
•
Gel electrophoresis
– Predominant, old technology by F. Sanger
•
Whole genome strategies
– Physical mapping
– Walking
– Shotgun sequencing
•
Computational fragment assembly
•
The future—new sequencing technologies
– Pyrosequencing, single molecule methods, …
– Assembly techniques
•
Future variants of sequencing
– Resequencing of humans
– Cancer genome sequencing
– Microbial and environmental sequencing
2015
Cell Division and Mutation
Single nucleotide
change
Copy number
Structural
Rearrangements in Cancer
1) Change gene
structure, create
novel fusion genes
Gleevec targets ABL-BCR
fusion
2) Alter gene regulation
Burkitt’s lymphoma
IMAGE CREDIT: Gregory Schuler, NCBI, NIH, Bethesda, MD
Cancer Genomes
Fusion gene in >50% prostate cancer patients
(Tomlins et al.Science Oct. 2005)
End Sequence Profiling (ESP)
C. Collins and S. Volik (2003)
1) Pieces of cancer
genome: clones (100250kb).
Cancer DNA
2) Sequence ends of
clones (500bp).
Human DNA
x
y
3) Map end sequences
to human genome.
Each clone corresponds to pair of end sequences (ES pair) (x,y).
Retain clones that correspond to a unique ES pair.
End Sequence Profiling (ESP)
C. Collins and S. Volik (2003)
1) Pieces of cancer
genome: clones (100250kb).
Cancer DNA
2) Sequence ends of
clones (500bp).
L
Human DNA
x
y
3) Map end sequences
to human genome.
Valid ES pairs
• Lmin ≤ y – x ≤ Lmax, min (max) size of clone.
• Convergent orientation.
End Sequence Profiling (ESP)
C. Collins and S. Volik (2003)
1) Pieces of cancer
genome: clones (100250kb).
Cancer DNA
2) Sequence ends of
clones (500bp).
L
Human DNA a x
a x
yb
y b
3) Map end sequences
to human genome.
Invalid ES pairs
• Putative rearrangement in cancer
• ES directions toward breakpoints (a,b):
Lmin ≤ |x-a| + |y-b| ≤ Lmax
Sources
• Serafim Batzoglou
http://ai.stanford.edu/~serafim/CS262_
2006/ (Sequencing slides)
• http://bioalgorithms.info (Euler slides)