Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SequenceAssembly Fall2016 BMI/CS576 www.biostat.wisc.edu/bmi576/ ColinDewey [email protected] Thesequencingproblem • Wewanttodeterminetheidentityofthebase pairsthatmakeup: – AsinglelargemoleculeofDNA – Thegenomeofasinglecell – Thegenomeofanindividualorganism – Thegenomeofaspecies • Butwecan’t(currently)“read”offthe sequenceofanentiremoleculeallatonce 2 Thestrategy:substrings • Wedohavetheabilitytoreadordetectshort pieces(substrings)ofDNA – Sangersequencing:500-700bp/read – Hybridizationarrays:8-30bp/probe – Latesttechnologies: • 454GenomeSequencerFLX:250-600bp/read • Illumina GenomeAnalyzer:35-300bp/read • PacificBiosciences:~10,000bp/read 3 Sangersequencing • Classicsequencingtechnique:“Chain-termination method” DNApolymerase tag primer TGATGCGTAGATCGATGC ACTACGCATCTAGCTACGTACGTACGTACGTTAGCTGAC • Replicationterminatedbyinclusionof dideoxynucleotide(ddNTP) ddATP TGATGCGTAGATCGATGCA ACTACGCATCTAGCTACGTACGTACGTACGTTAGCTGAC 4 Sequencinggels • Runreplicationinfourseparatetesttubes – EachwithoneofsomeconcentrationofeitherddATP, ddTTP,ddGTP,orddCTP • DependingonwhenddNTPisincluded,differentlength fragmentsaresynthesized • Fragmentsseparatedbylengthwithelectrophoresisgel • Sequencecanbereadfrombandsongel 5 UniversalDNAarrays • Arraywithallpossibleoligonucleotides(short DNAsequence)ofacertainlengthasprobes • Sampleislabeledandthenwashedoverarray • Hybridizationisdetectedfromlabels A T probe C G A A chip A T C G A C A T C G A G A T C G A T A C AT TA CG GC CG AT C tag A T C G C C … 6 ReadingaDNAarray 7 Latesttechnologies • 454: – “Sequencingbysynthesis” – Lightemittedanddetectedonadditionofanucleotideby polymerase – 400-600Mb/10hourrun • Illumina – Also“sequencingbysynthesis” – ~100Gb/dayononemachine – Usesfluorescently-labeledreversiblenucleotide terminators – LikeSanger,butdetectsaddednucleotideswithlaserafter eachstep 8 Latesttechnologies • PacificBiosciences: – “Sequencingbysynthesis” – Singlemoleculesequencing – Detectsadditionofsinglefluorescently-labeled nucleotidesbyanimmobilizedDNApolymerase – Real-time:readsbasesattherateofDNApolymerase – 4hoursforsequencingwithreadsupto60kblong – video 9 OxfordNanopore • Emerging technology • Pocket-sized • Higherrorrate • Currentlyin “community” program 10 ShotgunSequencingFragment Assembly Multiple copies of sample DNA Randomly fragment DNA Sequence sample of fragments Assemble reads 11 Twosequencingparadigms 1. Fragmentassembly – Fortechnologiesthatproduce“reads” • Sanger,454,Illumina,etc. 2. Spectralassembly – Fortechnologiesthatproduce“spectra” • UniversalDNAarrays – Readdatacanalsobe“converted”tospectra Thetwoparadigmsareactuallycloselyrelated 12 Thefragmentassemblyproblem • Given:Asetofreads(strings){s1,s2,…,sn} • Do:Determinealargestrings that“best explains” thereads • Whatdowemeanby“bestexplains”? • Whatassumptionsmightwerequire? 13 Shortestsuperstringproblem • Objective:Findastrings suchthat – allreadss1,s2,…,sn aresubstringsofs – sisasshortaspossible • Assumptions: – Readsare100%accurate – Identicalreadsmustcomefromthesamelocation onthegenome – “best” =“simplest” 14 Shortestsuperstringexample • Reads: {ACG, CGA, CGC, CGT, GAC, GCG, GTA, TCG} • Shortestsuperstring(length10) TCGACGCGTA TCG CGA GAC ACG CGC GCG CGT GTA 15 Algorithmsforshortestsubstringproblem • ThisproblemturnsouttobeNP-complete • Simplegreedy strategy: while # strings > 1 do merge two strings with maximum overlap loop • Conjecturedtogivestringwith length≤2× minimumlength • “2-approximation” • Otheralgorithmswillrequiregraphtheory… 16 GraphBasics • Agraph(G)consistsofvertices(V)andedges(E) G=(V,E) • Edgescaneitherbedirected (directedgraphs) 2 1 3 4 • orundirected (undirectedgraphs) 2 1 3 4 17 Vertexdegrees • Thedegree ofavertex:the#ofedgesincident tothatvertex • Fordirectedgraphs,wealsohavethenotion of – indegree:Thenumberincomingedges – outdegree: Thenumberofoutgoingedges 2 1 3 4 degree(v2)=3 indegree(v2)=1 outdegree(v2)=2 18 Overlapgraph • ForasetofsequencereadsS,constructa directedweightedgraphG=(V,E,w) – withonevertexperread(vi correspondstosi) – edgesbetweenallvertices(acomplete graph) – w(vi,vj)= overlap(si,sj)=lengthoflongestsuffixof si thatisaprefixofsj 19 Overlapgraphexample • LetS={AGA,GAT,TCG,GAG} AGA 2 0 2 0 0 1 GAT TCG 1 1 2 0 0 1 GAG 20 AssemblyasHamiltonianPath • HamiltonianPath:paththroughgraphthat visitseachvertexexactlyonce AGA 2 0 2 0 0 1 GAT TCG 1 1 2 0 0 1 GAG Path:AGAGATCG 21 ShortestsuperstringasTSP • minimizesuperstringlengthèminimize hamiltonianpathlengthinoverlapgraphwith edgeweightsnegated AGA -2 0 0 Path:GAGATCG Pathlength:-5 Stringlength:7 -2 0 -1 GAT TCG -1 -1 -2 0 0 -1 GAG • ThisisessentiallytheTravelingSalesmanProblem (alsoNP-complete) 22 TheGreedyAlgorithm • LetG beagraphwithfragmentsasvertices,and noedgestostart • Createaqueue,Q,ofoverlapedges,withedges inorderofincreasingweight • WhileGisdisconnected – Popthenextpossibleedgee =(u,v)offofQ – Ifoutdegree(u)=0 andindegree(v)=0 ande doesnot createacycle • AddetoG 23 GreedyAlgorithms • Definition:Analgorithmthatalwaystakesthe bestimmediate,orlocal,solutionwhile findingananswer. • Greedyalgorithmsfindtheoverall,orglobally, optimalsolutionforsomeoptimization problems,butmayfindless-than-optimal solutionsforsomeinstancesofother problems. Paul E. Black, "greedy algorithm", in Dictionary of Algorithms and Data Structures [online], Paul E. Black, ed., U.S. National Institute of Standards and Technology. 2 February 2005. 24 http://www.itl.nist.gov/div897/sqg/dads/HTML/greedyalgo.html GreedyAlgorithmExamples • Kruskal’s AlgorithmforMinimumSpanningTree – Minimumspanningtree:asetofn-1edgesthat connectsagraphofnverticesandthathasminimal totalweight – Kruskal’s algorithmaddstheedgethatconnectstwo componentswiththesmallestweightateachstep • Proventogiveanoptimalsolution • TravelingSalesmanProblem – Greedyalgorithmchoosestovisitclosestvertexat eachstep – Cangivefar-from-optimalanswers 25 Simplificationsofoverlapgraph • Requireminimumlengthforoverlap • Linearchaincompression AGA GAC ACT AGACT • Transitiveedgeremoval AGAT ATCG AGAT ATCG GATC 26 SequencingbyHybridization(SBH) • SBHarrayhasprobesforallpossiblek-mers • ForagivenDNAsample,arraytellsuswhether eachk-merisPRESENTorABSENT inthe sample • Thesetofallk-merspresentinastrings is calleditsspectrum • Example: – s=ACTGATGCAT – spectrum(s,3) ={ACT,ATG,CAT,CTG,GAT,GCA, TGA,TGC} 27 ExampleDNAArray AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT AA AC X AG AT X CA Sample: ACTGATGCAT Spectrum (k=4): {ACTG, ATGC, CTGA,GATG, GCAT,TGAT, TGCA} CC CG CT X GA GC X X GG GT TA TC TG TT X X 28 SBHProblem • Given:AsetS ofk-mers • Do:Findastrings,suchthatspectrum(s,k)=S {ACT, ATG, CAT, CTG, GAT, GCA, TGA, TGC} ? 29 SBHasEulerianpath • CoulduseHamiltonianpathapproach,butnot usefulduetoNP-completeness • Instead,useEulerian pathapproach • Eulerianpath:Apaththroughagraphthat traverseseveryedgeexactlyonce • Constructgraphwithall(k-1)-mersasvertices • Foreachk-merinspectrum,addedgefrom vertexrepresentingfirstk-1 charactersto vertexrepresentinglastk-1 characters 30 PropertiesofEuleriangraphs • ItwillbeeasiertoconsiderEulerian cycles: Eulerian pathsthatformacycle • GraphsthathaveanEulerian cyclearesimply calledEulerian • Theorem:Aconnecteddirectedgraphis Eulerian ifandonlyifeachofitsverticesare balanced • Avertexv isbalanced ifindegree(v)= outdegree(v) • Thereisapolynomial-timealgorithmfor findingEulerian cycles! 31 SevenBridgesofKönigsberg Euler answered the question: “Is there a walk through the city that traverses each bridge exactly once?” 32 Euleriancyclealgorithm • Startatanyvertexv,traverseunusededges untilreturningtov • WhilethecycleisnotEulerian – Pickavertexw alongthecycleforwhichthereare untraversedoutgoingedges – Traverseunusededgesuntilendingupbackatw – Jointwocyclesintoonecycle 33 Joiningcycles v v v w 34 EulerianPath->EulerianCycle • IfagraphhasanEulerianPathstartingats andendingatt then – Allverticesmustbebalanced,exceptforsandt whichmayhave|indegree(v)– outdegree(v)|=1 – Ifands andtarenotbalanced,addanedge betweenthemtobalance • GraphnowhasanEuleriancyclewhichcanbe convertedtoanEulerianpathbyremovaloftheadded edge 35 SBHgraphexample {ACT, ATG, CAT, CTG, GAT, GCA, TGA, TGC} AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT 36 SBHdifficulties • Inpractice,sequencingbyhybridizationis hard – Arraysareofteninaccurate->incorrectspectra • Falsepositives/negatives – Needlongprobestodealwithrepetitivesequence • Butthenumberofprobesneededisexponentialinthe lengthoftheprobes! • Thereisalimittothenumberofprobesperarray (currentlybetween1-10millionprobes/array) 37 K-mer spectrumapproachwith readdata(deBruijn approach) • Generatespectrumfromsetofallk-mers containedwithinreads • Choosek tobesmallenoughsuchthatthe majorityofthegenome’sk-mers willbefound withinthereads • Particularlyusefulforshort-readdata,suchas thatproducedbyIllumina • MadepopularbymethodssuchasEulerand Velvet 38 DifficultieswithdeBruijn approach • Notallk-mers maybecontainedwithinthe readsevenifreadscompletelycoverthe genome • DNArepeatsresultink-mers thatarepresent inmultiplecopiesacrossthegenome • Readsoftenhavesequencingerrors! 39 Fragmentassemblychallenges • Readerrors – Complicatescomputingreadoverlaps • Repeats – Roughlyhalfofthehumangenomeiscomposed ofrepetitiveelements – Repetitiveelementscanbelong(1000sofbp) – Humangenome • 1millionAlurepeats(~300bp) • 200,000LINE repeats(~1000bp) 40 Overlap-Layout-Consensus • Mostcommonassemblerstrategyforlong reads 1.Overlap:Findallsignificantoverlapsbetween reads,allowingforerrors 2.Layout:Determinepaththroughoverlapping readsrepresentingassembledsequence 3.Consensus:Correctforerrorsinreadsusing layout 41 Consensus Layout Consensus GTATCGTAGCTGACTGCGCTGC ATCGTCTCGTAGCTGACTGCGCTGC ATCGTATCGAATCGTAG TGACTGCGCTGCATCGTATCGTATC TGACTGCGCTGCATCGTATCGTATCGTAGCTGACTGCGCTGC 42 WholeGenomeSequencing • Twomainstrategies: 1.Clone-by-clonemapping • Fragmentgenomeintolargepieces,insertintoBACs (BacterialArtificialChromosomes) • ChoosetilingsetofBACs:overlappingsetthatcovers entiregenome • ShotgunsequencetheBACs 2.Whole-genomeshotgun • Shotgunsequencetheentiregenomeatonce 43 Assemblyinpractice • Assemblymethodsusedinpracticearecomplex – Butgenerallyfollowoneofthetwoapproaches • Readsasvertices • Readsasedges(orpaths ofedges) • Assembliesdonottypicallygivewhole chromosomes – Insteadgivesasetof“contigs” – contig:contiguouspieceofsequencefrom overlappingreads – contigscanbeorderedintoscaffoldswithextra information(e.g.,pairedendreads) 44 CloningandPaired-endreads DNA fragment … Insert fragment into vector vector transform bacteria with vector and grow Sequence ends of insert using flanking primers 45 Paired-endreadadvantages • Scaffolding:layoutofadjacent,butnot overlapping,contigs contig scaffold paired-end reads • Gapfilling: gap 46 Sequenceassemblysummary • Twogeneralalgorithmicstrategies – Overlapgraphhamiltonianpaths – Eulerianpathsink-mergraphs • Biggestchallenge – Repeats! • Largegenomeshavealotofrepetitivesequence • Sequencingstrategies – Clone-by-clone:breaktheproblemintosmallerpieces whichhavefewerrepeats – Whole-genomeshotgun:usepaired-endreadsto assemblearoundandinsiderepeats 47