Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSCI2950-C Lecture 3 DNA Sequencing and Fragment Assembly http://cs.brown.edu/courses/csci2950-c/ Outline • EULER fragment assembly • Mate-pairs, scaffolding and copy number • Next-generation DNA Sequencing • Cancer Genome Sequencing Whole Genome Shotgun Sequencing genome cut many times at random plasmids (2 – 10 Kbp) forward-reverse paired reads known dist cosmids (40 Kbp) ~500 bp (mate pair) ~500 bp Overlap-Layout-Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT.. Approaches to Fragment Assembly Find a path visiting every VERTEX exactly once in the OVERLAP graph: Hamiltonian path problem NP-complete: algorithms unknown Approaches to Fragment Assembly (cont’d) Find a path visiting every EDGE exactly once in the REPEAT graph: Eulerian path problem Linear time algorithms are known EULER - A New Approach to Fragment Assembly • Traditional “overlap-layout-consensus” technique has a high rate of mis-assembly • EULER uses the Eulerian Path approach borrowed from “sequencing by hybridization” (SBH) • Fragment assembly without repeat masking can be done in linear time with greater accuracy Sequencing by Hybridization (SBH) • Build a microarray with all 4l DNA sequences of length l (l ~ 20) • For DNA sequence s, measure l-mer composition l-mer composition Def: Given string s, the Spectrum ( s, l ) is unordered multiset of all possible (n – l + 1) l-mers in a string s of length n • The order of individual elements in Spectrum ( s, l ) does not matter • For s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG} The SBH Problem • Goal: Reconstruct a string from its l-mer composition • Input: A multiset S, representing all l-mers from an (unknown) string s • Output: String s such that Spectrum ( s,l ) =S SBH: Eulerian Path Approach S = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG } Edges correspond to l – mers from S de Bruijn graph of S GT AT TG CG GC GG CA Path visited every EDGE once SBH: Eulerian Path Approach S = {ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Two different paths give different sequence reconstructions: GT AT TG CG GC GG ATGGCGTGCA GT CA AT TG CG GC GG ATGCGTGGCA CA Euler Theorem • A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing edges: in(v)=out(v) • Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced. Euler Theorem: Proof • Eulerian → balanced for every edge entering v (incoming edge) there exists an edge leaving v (outgoing edge). Therefore in(v)=out(v) • Balanced → Eulerian ??? Algorithm for Constructing an Eulerian Cycle a. Start with an arbitrary vertex v and form an arbitrary cycle with unused edges until a dead end is reached. Since the graph is balanced this dead end is necessarily the starting point, i.e., vertex v. Algorithm for Constructing an Eulerian Cycle (cont’d) b. If cycle from (a) above is not an Eulerian cycle, it must contain a vertex w, which has untraversed edges. Perform step (a) again, using vertex w as the starting point. Once again, we will end up in the starting vertex w. Algorithm for Constructing an Eulerian Cycle (cont’d) c. Combine the cycles from (a) and (b) into a single cycle and iterate step (b). Overlap Graph: Hamiltonian Approach Each vertex represents a read from the original sequence. Vertices from repeats are connected to many others. Repeat Repeat Repeat Find a path visiting every VERTEX exactly once: Hamiltonian path problem Overlap Graph: Eulerian Approach Repeat Repeat Repeat Placing each repeat edge together gives a clear progression of the path through the entire sequence. Find a path visiting every EDGE exactly once: Eulerian path problem Multiple Repeats Repeat1 Repeat2 Repeat1 Repeat2 Can be easily constructed with any number of repeats Repeat Graph (a) DNA sequence with a triple repeat R; (b) the layout graph; (c) construction of the de Bruijn graph by gluing repeats; (d) de Bruijn graph. Pevzner P. A. et.al. PNAS 2001;98:9748-9753 Building Repeat Graph • Problem: Construct the repeat graph from a collection of reads. ? • Solution: Break the reads into smaller pieces. Building Repeat Graph • Reads are constructed from an original sequence in lengths that allow biologists a high level of certainty. • They are then broken again into k-mers EULER Fragment Assembly Approach • Input: Reads s1, …, sN • Further subdivide reads into k-mers (k = 20) • Build repeat graph on resulting k-mers • Each read is path in resulting graph. • Solve Eulerian Superpath Problem. Given an Eulerian graph and a collection of paths in this graph, find an Eulerian path in this graph that contains all these paths as subpaths. Repeat Graph Vertices correspond to ( k – 1 ) – mers in each read Edges correspond to k – mers in each read Example: S = ATGGCGTGCA Reads = {ATGGC, GGCGTG, GTGCA} 3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } GT AT TG CG GC GG Two Eulerian paths: (visit every EDGE once) CA ATGCGTGGCA ATGGCGTGCA Reads in Repeat Graph Example: S = ATGGCGTGCA Reads = {ATGGC, GGCGTG, GTGCA} 3-mers = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } Eulerian superpath: an Eulerian path that contains set of paths (reads) as subpaths. GT AT TG CG GC GG CA ATGCGTGGCA ATGGCGTGCA Additional challenges in EULER Approach 1. Errors in reads 2. Reverse-complement of DNA string 3. Using mate-pair information to simplify the repeat graph. 4. Multiplicities of edges generally unknown (Copy number problem). Sequencing Errors • If an error exists in one of the 20-mer reads, the error will be perpetuated among all of the smaller pieces broken from that read. Sequencing Errors • However, an error will not be present in the other instances of the 20-mer read. “Consensus first” approach • Let T = {all l-tuples appearing in > M reads} • A string s is called a T-string if all its l-tuples belong to T. • Spectral Alignment Problem. Given a string s and a spectrum T, find the minimum number of mutations in s that transform s into a T-string. Sequencing Errors • Solving Spectral Alignment Problem attempts to eliminate most point mutation errors before reconstructing the original sequence. • Not perfect! Forward and Reverse Complements 5’ 3’ 3’ 5’ We obtain reads from both strands of DNA. Do not know strand of origin. s = CAGT s’ = ACTG (reverse complement) Forward and Reverse Complements 5’ 3’ In Euler assembler, include reverse complement of each read. 3’ 5’ “assume that S contains a complement of every read and that the de Bruijn graph can be partitioned into two subgraphs (the “canonical” one and its reverse complement)” Alternative approaches using bidirected graphs. Using Mate-Pair Information Repeats and other ambiguities lead to tangles in repeat graph 1 3 2 4 1 3 and 2 4 OR 1 4 and 2 3 ? A repeat v1 … vn and a system of paths overlapping with this repeat Using Mate-Pair Information Mate-pair (r1, r2) gives pair of positions in G. l(r1, r2) r1 r1 1 Find path P in G from r1 to r2. r2 d(r1, r2) 2 If unique path P with d(r1, r2) ≈ l(r1, r2) length of mate pair, then use P as “long read” in superpath algorithm r2 3 4 Using Mate-Pair Information Scaffolding Using Mate-Pair Information Copy number problem Let d(v) = in degree – outdegree Balanced graph: d(v) = 0 for all v. Goal: Introduce multiplicities on edges so that graph is balanced. Copy number problem Goal: Introduce multiplicities on edges so that graph is balanced. Use as few extra edges as possible. Balance each vertex by adding edge multiplicities Assign flow f(e) to each edge such that d(v) = 0 for all vertices. Copy number problem Let d(v) = indegree – outdegree Balanced graph: d(v) = 0 for all v. Graph G = (V, e, w). Weights w(e) = 1 for all e. Copy Number Problem (Pevzner & Tang 2001): For an edge e in G, find a flow minimizing the multiplicity f(e) of e. Copy number problem Copy Number Problem (Pevzner & Tang 2001): For an edge e in G, find a flow minimizing the multiplicity f(e) of e. Min-flow Max-cut Theorem: For a directed acyclic graph G = (V, e, w) with lower capacity bounds: min flow from v to w = capacity of the maximum cut separating v from w Copy number problem Copy Number Problem (Pevzner & Tang 2001): For an edge e in G, find a flow minimizing the multiplicity f(e) of e. Min-cost circulation (See Myers 2005): Assign cost c(e) = 1 to each edge. min Σc(e) f(e) such that f(e) ≥ w(e) for all e. d(v) = 0 for all vertices. Next-generation sequence platforms • 454 – http://www.454.com/enabling-technology/index.asp • Illumina – http://www.illumina.com/pages.ilmn?ID=203 • ABI Solid – solid.appliedbiosystems.com Polony Sequencing Polony sequencing—Assembly ? • Resulting reads are likely to look different than Sanger reads: – Short (currently 100 to 200 bp) – Low error rates, except in homopolymeric runs (AAA…, CCC…, etc) – Currently, not known how to do paired reads on a chip. Maybe very soon! 454 Sequencing Illumina Sequencing Nanopore Sequencing http://www.mcb.harvard.edu/branton/index.htm Nanopore Sequencing—Assembly • Resulting reads are likely to look different than Sanger reads: – Long (perhaps 10,000bp-1,000,000bp) – High error rate (perhaps 10% – 30%) – Two colors? • A/ CTG • AT/ CG • AG/ CT • How can we assemble under such conditions? Some future directions for sequencing 1. Personalized genome sequencing • • Find your ~1,000,000 single nucleotide polymorphisms (SNPs) Find your rearrangements • Goals: • • • • Link genome with phenotype Provide personalized diet and medicine (???) designer babies, big-brother insurance companies Timeline: • • • Inexpensive sequencing: Genotype–phenotype association: Personalized drugs: 2010-2015 2010-??? 2015-??? Some future directions for sequencing 2. Environmental sequencing • Find your flora: • • • • • External organs: skin, mucous membranes Gut, mouth, etc. Normal flora: >200 species, >trillions of individuals Flora–disease, flora–non-optimal health associations Timeline: • • • • organisms living in your body Inexpensive research sequencing: Research & associations Personalized sequencing today within next 10 years 2015+ Find diversity of organisms living in different environments • • Hard to isolate Assembly of all organisms at once Some future directions for sequencing 3. Organism sequencing • • Sequence a large fraction of all organisms Deduce ancestors • • • • Reconstruct ancestral genomes Synthesize ancestral genomes Clone—Jurassic park! Study evolution of function • • • Find functional elements within a genome How those evolved in different organisms Find how modules/machines composed of many genes evolved DNA Sequencing – Recap 1975 • Gel electrophoresis – Predominant, old technology by F. Sanger • Whole genome strategies – Physical mapping – Walking – Shotgun sequencing • Computational fragment assembly • The future—new sequencing technologies – Pyrosequencing, single molecule methods, … – Assembly techniques • Future variants of sequencing – Resequencing of humans – Cancer genome sequencing – Microbial and environmental sequencing 2015 Cell Division and Mutation Single nucleotide change Copy number Structural Rearrangements in Cancer 1) Change gene structure, create novel fusion genes Gleevec targets ABL-BCR fusion 2) Alter gene regulation Burkitt’s lymphoma IMAGE CREDIT: Gregory Schuler, NCBI, NIH, Bethesda, MD Cancer Genomes Fusion gene in >50% prostate cancer patients (Tomlins et al.Science Oct. 2005) End Sequence Profiling (ESP) C. Collins and S. Volik (2003) 1) Pieces of cancer genome: clones (100250kb). Cancer DNA 2) Sequence ends of clones (500bp). Human DNA x y 3) Map end sequences to human genome. Each clone corresponds to pair of end sequences (ES pair) (x,y). Retain clones that correspond to a unique ES pair. End Sequence Profiling (ESP) C. Collins and S. Volik (2003) 1) Pieces of cancer genome: clones (100250kb). Cancer DNA 2) Sequence ends of clones (500bp). L Human DNA x y 3) Map end sequences to human genome. Valid ES pairs • Lmin ≤ y – x ≤ Lmax, min (max) size of clone. • Convergent orientation. End Sequence Profiling (ESP) C. Collins and S. Volik (2003) 1) Pieces of cancer genome: clones (100250kb). Cancer DNA 2) Sequence ends of clones (500bp). L Human DNA a x a x yb y b 3) Map end sequences to human genome. Invalid ES pairs • Putative rearrangement in cancer • ES directions toward breakpoints (a,b): Lmin ≤ |x-a| + |y-b| ≤ Lmax Sources • Serafim Batzoglou http://ai.stanford.edu/~serafim/CS262_ 2006/ (Sequencing slides) • http://bioalgorithms.info (Euler slides)