Download Sequence Assembly

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Algorithm wikipedia , lookup

Gene prediction wikipedia , lookup

Graph coloring wikipedia , lookup

Travelling salesman problem wikipedia , lookup

Dijkstra's algorithm wikipedia , lookup

Transcript
SequenceAssembly
Fall2016
BMI/CS576
www.biostat.wisc.edu/bmi576/
ColinDewey
[email protected]
Thesequencingproblem
• Wewanttodeterminetheidentityofthebase
pairsthatmakeup:
– AsinglelargemoleculeofDNA
– Thegenomeofasinglecell
– Thegenomeofanindividualorganism
– Thegenomeofaspecies
• Butwecan’t(currently)“read”offthe
sequenceofanentiremoleculeallatonce
2
Thestrategy:substrings
• Wedohavetheabilitytoreadordetectshort
pieces(substrings)ofDNA
– Sangersequencing:500-700bp/read
– Hybridizationarrays:8-30bp/probe
– Latesttechnologies:
• 454GenomeSequencerFLX:250-600bp/read
• Illumina GenomeAnalyzer:35-300bp/read
• PacificBiosciences:~10,000bp/read
3
Sangersequencing
• Classicsequencingtechnique:“Chain-termination
method”
DNApolymerase
tag
primer
TGATGCGTAGATCGATGC
ACTACGCATCTAGCTACGTACGTACGTACGTTAGCTGAC
• Replicationterminatedbyinclusionof
dideoxynucleotide(ddNTP)
ddATP
TGATGCGTAGATCGATGCA
ACTACGCATCTAGCTACGTACGTACGTACGTTAGCTGAC
4
Sequencinggels
• Runreplicationinfourseparatetesttubes
– EachwithoneofsomeconcentrationofeitherddATP,
ddTTP,ddGTP,orddCTP
• DependingonwhenddNTPisincluded,differentlength
fragmentsaresynthesized
• Fragmentsseparatedbylengthwithelectrophoresisgel
• Sequencecanbereadfrombandsongel
5
UniversalDNAarrays
• Arraywithallpossibleoligonucleotides(short
DNAsequence)ofacertainlengthasprobes
• Sampleislabeledandthenwashedoverarray
• Hybridizationisdetectedfromlabels
A
T
probe
C
G
A
A
chip
A
T
C
G
A
C
A
T
C
G
A
G
A
T
C
G
A
T
A
C
AT
TA
CG
GC
CG
AT
C
tag
A
T
C
G
C
C
…
6
ReadingaDNAarray
7
Latesttechnologies
• 454:
– “Sequencingbysynthesis”
– Lightemittedanddetectedonadditionofanucleotideby
polymerase
– 400-600Mb/10hourrun
• Illumina
– Also“sequencingbysynthesis”
– ~100Gb/dayononemachine
– Usesfluorescently-labeledreversiblenucleotide
terminators
– LikeSanger,butdetectsaddednucleotideswithlaserafter
eachstep
8
Latesttechnologies
• PacificBiosciences:
– “Sequencingbysynthesis”
– Singlemoleculesequencing
– Detectsadditionofsinglefluorescently-labeled
nucleotidesbyanimmobilizedDNApolymerase
– Real-time:readsbasesattherateofDNApolymerase
– 4hoursforsequencingwithreadsupto60kblong
– video
9
OxfordNanopore
• Emerging
technology
• Pocket-sized
• Higherrorrate
• Currentlyin
“community”
program
10
ShotgunSequencingFragment
Assembly
Multiple copies of sample DNA
Randomly fragment DNA
Sequence sample of fragments
Assemble reads
11
Twosequencingparadigms
1. Fragmentassembly
– Fortechnologiesthatproduce“reads”
• Sanger,454,Illumina,etc.
2. Spectralassembly
– Fortechnologiesthatproduce“spectra”
• UniversalDNAarrays
– Readdatacanalsobe“converted”tospectra
Thetwoparadigmsareactuallycloselyrelated
12
Thefragmentassemblyproblem
• Given:Asetofreads(strings){s1,s2,…,sn}
• Do:Determinealargestrings that“best
explains” thereads
• Whatdowemeanby“bestexplains”?
• Whatassumptionsmightwerequire?
13
Shortestsuperstringproblem
• Objective:Findastrings suchthat
– allreadss1,s2,…,sn aresubstringsofs
– sisasshortaspossible
• Assumptions:
– Readsare100%accurate
– Identicalreadsmustcomefromthesamelocation
onthegenome
– “best” =“simplest”
14
Shortestsuperstringexample
• Reads:
{ACG, CGA, CGC, CGT, GAC, GCG, GTA, TCG}
• Shortestsuperstring(length10)
TCGACGCGTA
TCG
CGA
GAC
ACG
CGC
GCG
CGT
GTA
15
Algorithmsforshortestsubstringproblem
• ThisproblemturnsouttobeNP-complete
• Simplegreedy strategy:
while # strings > 1 do
merge two strings with maximum overlap
loop
• Conjecturedtogivestringwith
length≤2× minimumlength
• “2-approximation”
• Otheralgorithmswillrequiregraphtheory…
16
GraphBasics
• Agraph(G)consistsofvertices(V)andedges(E)
G=(V,E)
• Edgescaneitherbedirected (directedgraphs)
2
1
3
4
• orundirected (undirectedgraphs)
2
1
3
4
17
Vertexdegrees
• Thedegree ofavertex:the#ofedgesincident
tothatvertex
• Fordirectedgraphs,wealsohavethenotion
of
– indegree:Thenumberincomingedges
– outdegree: Thenumberofoutgoingedges
2
1
3
4
degree(v2)=3
indegree(v2)=1
outdegree(v2)=2
18
Overlapgraph
• ForasetofsequencereadsS,constructa
directedweightedgraphG=(V,E,w)
– withonevertexperread(vi correspondstosi)
– edgesbetweenallvertices(acomplete graph)
– w(vi,vj)= overlap(si,sj)=lengthoflongestsuffixof
si thatisaprefixofsj
19
Overlapgraphexample
• LetS={AGA,GAT,TCG,GAG}
AGA
2
0
2
0
0
1
GAT
TCG
1
1
2
0
0
1
GAG
20
AssemblyasHamiltonianPath
• HamiltonianPath:paththroughgraphthat
visitseachvertexexactlyonce
AGA
2
0
2
0
0
1
GAT
TCG
1
1
2
0
0
1
GAG
Path:AGAGATCG
21
ShortestsuperstringasTSP
• minimizesuperstringlengthèminimize
hamiltonianpathlengthinoverlapgraphwith
edgeweightsnegated
AGA
-2
0
0
Path:GAGATCG
Pathlength:-5
Stringlength:7
-2
0
-1
GAT
TCG
-1
-1
-2
0
0
-1
GAG
• ThisisessentiallytheTravelingSalesmanProblem
(alsoNP-complete)
22
TheGreedyAlgorithm
• LetG beagraphwithfragmentsasvertices,and
noedgestostart
• Createaqueue,Q,ofoverlapedges,withedges
inorderofincreasingweight
• WhileGisdisconnected
– Popthenextpossibleedgee =(u,v)offofQ
– Ifoutdegree(u)=0 andindegree(v)=0 ande doesnot
createacycle
• AddetoG
23
GreedyAlgorithms
• Definition:Analgorithmthatalwaystakesthe
bestimmediate,orlocal,solutionwhile
findingananswer.
• Greedyalgorithmsfindtheoverall,orglobally,
optimalsolutionforsomeoptimization
problems,butmayfindless-than-optimal
solutionsforsomeinstancesofother
problems.
Paul E. Black, "greedy algorithm", in Dictionary of Algorithms and Data Structures [online],
Paul E. Black, ed., U.S. National Institute of Standards and Technology. 2 February 2005.
24
http://www.itl.nist.gov/div897/sqg/dads/HTML/greedyalgo.html
GreedyAlgorithmExamples
• Kruskal’s AlgorithmforMinimumSpanningTree
– Minimumspanningtree:asetofn-1edgesthat
connectsagraphofnverticesandthathasminimal
totalweight
– Kruskal’s algorithmaddstheedgethatconnectstwo
componentswiththesmallestweightateachstep
• Proventogiveanoptimalsolution
• TravelingSalesmanProblem
– Greedyalgorithmchoosestovisitclosestvertexat
eachstep
– Cangivefar-from-optimalanswers
25
Simplificationsofoverlapgraph
• Requireminimumlengthforoverlap
• Linearchaincompression
AGA
GAC
ACT
AGACT
• Transitiveedgeremoval
AGAT
ATCG
AGAT
ATCG
GATC
26
SequencingbyHybridization(SBH)
• SBHarrayhasprobesforallpossiblek-mers
• ForagivenDNAsample,arraytellsuswhether
eachk-merisPRESENTorABSENT inthe
sample
• Thesetofallk-merspresentinastrings is
calleditsspectrum
• Example:
– s=ACTGATGCAT
– spectrum(s,3) ={ACT,ATG,CAT,CTG,GAT,GCA,
TGA,TGC}
27
ExampleDNAArray
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
AA
AC
X
AG
AT
X
CA
Sample:
ACTGATGCAT
Spectrum (k=4):
{ACTG, ATGC,
CTGA,GATG,
GCAT,TGAT,
TGCA}
CC
CG
CT
X
GA
GC
X
X
GG
GT
TA
TC
TG
TT
X
X
28
SBHProblem
• Given:AsetS ofk-mers
• Do:Findastrings,suchthatspectrum(s,k)=S
{ACT, ATG, CAT, CTG, GAT, GCA, TGA, TGC}
?
29
SBHasEulerianpath
• CoulduseHamiltonianpathapproach,butnot
usefulduetoNP-completeness
• Instead,useEulerian pathapproach
• Eulerianpath:Apaththroughagraphthat
traverseseveryedgeexactlyonce
• Constructgraphwithall(k-1)-mersasvertices
• Foreachk-merinspectrum,addedgefrom
vertexrepresentingfirstk-1 charactersto
vertexrepresentinglastk-1 characters
30
PropertiesofEuleriangraphs
• ItwillbeeasiertoconsiderEulerian cycles:
Eulerian pathsthatformacycle
• GraphsthathaveanEulerian cyclearesimply
calledEulerian
• Theorem:Aconnecteddirectedgraphis
Eulerian ifandonlyifeachofitsverticesare
balanced
• Avertexv isbalanced ifindegree(v)=
outdegree(v)
• Thereisapolynomial-timealgorithmfor
findingEulerian cycles!
31
SevenBridgesofKönigsberg
Euler answered the question: “Is there a
walk through the city that traverses each
bridge exactly once?”
32
Euleriancyclealgorithm
• Startatanyvertexv,traverseunusededges
untilreturningtov
• WhilethecycleisnotEulerian
– Pickavertexw alongthecycleforwhichthereare
untraversedoutgoingedges
– Traverseunusededgesuntilendingupbackatw
– Jointwocyclesintoonecycle
33
Joiningcycles
v
v
v
w
34
EulerianPath->EulerianCycle
• IfagraphhasanEulerianPathstartingats
andendingatt then
– Allverticesmustbebalanced,exceptforsandt
whichmayhave|indegree(v)– outdegree(v)|=1
– Ifands andtarenotbalanced,addanedge
betweenthemtobalance
• GraphnowhasanEuleriancyclewhichcanbe
convertedtoanEulerianpathbyremovaloftheadded
edge
35
SBHgraphexample
{ACT, ATG, CAT, CTG, GAT, GCA, TGA, TGC}
AA
AC
AG
AT
CA
CC
CG
CT
GA
GC
GG
GT
TA
TC
TG
TT
36
SBHdifficulties
• Inpractice,sequencingbyhybridizationis
hard
– Arraysareofteninaccurate->incorrectspectra
• Falsepositives/negatives
– Needlongprobestodealwithrepetitivesequence
• Butthenumberofprobesneededisexponentialinthe
lengthoftheprobes!
• Thereisalimittothenumberofprobesperarray
(currentlybetween1-10millionprobes/array)
37
K-mer spectrumapproachwith
readdata(deBruijn approach)
• Generatespectrumfromsetofallk-mers
containedwithinreads
• Choosek tobesmallenoughsuchthatthe
majorityofthegenome’sk-mers willbefound
withinthereads
• Particularlyusefulforshort-readdata,suchas
thatproducedbyIllumina
• MadepopularbymethodssuchasEulerand
Velvet
38
DifficultieswithdeBruijn approach
• Notallk-mers maybecontainedwithinthe
readsevenifreadscompletelycoverthe
genome
• DNArepeatsresultink-mers thatarepresent
inmultiplecopiesacrossthegenome
• Readsoftenhavesequencingerrors!
39
Fragmentassemblychallenges
• Readerrors
– Complicatescomputingreadoverlaps
• Repeats
– Roughlyhalfofthehumangenomeiscomposed
ofrepetitiveelements
– Repetitiveelementscanbelong(1000sofbp)
– Humangenome
• 1millionAlurepeats(~300bp)
• 200,000LINE repeats(~1000bp)
40
Overlap-Layout-Consensus
• Mostcommonassemblerstrategyforlong
reads
1.Overlap:Findallsignificantoverlapsbetween
reads,allowingforerrors
2.Layout:Determinepaththroughoverlapping
readsrepresentingassembledsequence
3.Consensus:Correctforerrorsinreadsusing
layout
41
Consensus
Layout
Consensus
GTATCGTAGCTGACTGCGCTGC
ATCGTCTCGTAGCTGACTGCGCTGC
ATCGTATCGAATCGTAG
TGACTGCGCTGCATCGTATCGTATC
TGACTGCGCTGCATCGTATCGTATCGTAGCTGACTGCGCTGC
42
WholeGenomeSequencing
• Twomainstrategies:
1.Clone-by-clonemapping
• Fragmentgenomeintolargepieces,insertintoBACs
(BacterialArtificialChromosomes)
• ChoosetilingsetofBACs:overlappingsetthatcovers
entiregenome
• ShotgunsequencetheBACs
2.Whole-genomeshotgun
• Shotgunsequencetheentiregenomeatonce
43
Assemblyinpractice
• Assemblymethodsusedinpracticearecomplex
– Butgenerallyfollowoneofthetwoapproaches
• Readsasvertices
• Readsasedges(orpaths ofedges)
• Assembliesdonottypicallygivewhole
chromosomes
– Insteadgivesasetof“contigs”
– contig:contiguouspieceofsequencefrom
overlappingreads
– contigscanbeorderedintoscaffoldswithextra
information(e.g.,pairedendreads)
44
CloningandPaired-endreads
DNA fragment
…
Insert fragment into
vector
vector
transform
bacteria
with vector
and grow
Sequence ends of
insert using flanking
primers
45
Paired-endreadadvantages
• Scaffolding:layoutofadjacent,butnot
overlapping,contigs
contig
scaffold
paired-end reads
• Gapfilling:
gap
46
Sequenceassemblysummary
• Twogeneralalgorithmicstrategies
– Overlapgraphhamiltonianpaths
– Eulerianpathsink-mergraphs
• Biggestchallenge
– Repeats!
• Largegenomeshavealotofrepetitivesequence
• Sequencingstrategies
– Clone-by-clone:breaktheproblemintosmallerpieces
whichhavefewerrepeats
– Whole-genomeshotgun:usepaired-endreadsto
assemblearoundandinsiderepeats
47