* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Genome Rearrangements, Synteny, and Comparative Mapping
Genomic library wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Non-coding DNA wikipedia , lookup
Ridge (biology) wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Oncogenomics wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
X-inactivation wikipedia , lookup
Human genetic variation wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Copy-number variation wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Genomic imprinting wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genetic engineering wikipedia , lookup
Transposable element wikipedia , lookup
Minimal genome wikipedia , lookup
Human genome wikipedia , lookup
Gene therapy wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene nomenclature wikipedia , lookup
Public health genomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene desert wikipedia , lookup
Gene expression profiling wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome editing wikipedia , lookup
Genome (book) wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome evolution wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome Rearrangements, Synteny, and Comparative Mapping CSCI 4830: Algorithms for Molecular Biology Debra S. Goldberg Turnip vs Cabbage • Share a recent common ancestor • Look and taste different Turnip vs Cabbage • Comparing mtDNA gene sequences yields no evolutionary information • 99% similarity between genes • These surprisingly identical gene sequences differed in gene order • This study helped pave the way to analyzing genome rearrangements in molecular evolution Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison: Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison: Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison: Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison: Turnip vs Cabbage: Gene Order Comparison Before After • Evolution is manifested as the divergence in gene order Transforming Cabbage into Turnip Reversals 1 2 3 9 8 4 7 1, 2, 3, -8, 4, -7, 5, -6, 6, -5, 7, -4, 8, 9, 10 10 6 5 Blocks represent conserved genes. In the course of evolution or in a clinical context, blocks 1,…,10 could be misread as 1, 2, 3, -8, -7, -6, -5, -4, 9, 10. Types of Mutations Types of Rearrangements Inversion/Reversal 1 2 3 4 5 6 1 2 -5 -4 -3 6 Types of Rearrangements Translocation 1 2 3 45 6 1 26 4 53 Types of Rearrangements Fusion 1 2 3 4 5 6 1 2 3 4 5 6 Fission Genome rearrangements Mouse (X chrom.) Unknown ancestor ~ 75 million years ago Human (X chrom.) • What are the similarity blocks and how to find them? • What is the architecture of the ancestral genome? • What is the evolutionary scenario for transforming one genome into the other? Why do we care? SKY (spectral karyotyping) Robertsonian Translocation 13 14 Robertsonian Translocation 13 14 • Translocation of chromosomes 13 and 14 • No net gain or loss of genetic material: normal phenotype. • Increased risk for an abnormal child or spontaneous pregnancy loss Philadelphia Chromosome 9 22 Philadelphia Chromosome 9 22 • A translocation between chromosomes 9 and 22 (part of 22 is attached to 9) • Seen in about 90% of patients with Chronic myelogenous leukemia (CML) Colon cancer Colon Cancer Comparative maps Waardenburg’s Syndrome: Mouse Provides Insight into Human Genetic Disorder • Characterized by pigmentary dysphasia • Gene implicated linked to human chromosome 2 • It was not clear where exactly on chromosome 2 Waardenburg’s syndrome and splotch mice • A breed of mice (with splotch gene) had similar symptoms caused by the same type of gene as in humans • Scientists identified location of gene responsible for disorder in mice • Finding the gene in mice gives clues to where same gene is located in humans Reversals: Example p=12345678 12543678 Reversals: Example p=12345678 12543678 12546378 Reversals and Gene Orders • Gene order represented by permutation p: p = p 1 ------ p i-1 p i p i+1 ------ p j-1 p j p j+1 ----- p n r(i,j) p 1 ------ p i-1 p j p j-1 ------ p i+1 p i p j+1 ----- pn Reversal r ( i, j ) reverses (flips) the elements from i to j in p Reversal Distance Problem • Goal: Given two permutations, find shortest series of reversals to transform one into another • Input: Permutations p and s • Output: A series of reversals r1,…rt transforming p into s, such that t is minimum • t - reversal distance between p and s • d(p, s) = smallest possible value of t, given p, s Sorting By Reversals Problem • Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n ) • Input: Permutation p • Output: A series of reversals r1, … rt transforming p into the identity permutation such that t is minimum • min t =d(p ) = reversal distance of p Sorting by reversals Example: 5 steps Step Step Step Step Step Step 0: p 2 -4 1: 2 3 2: 2 3 3: 2 3 4: -8 -7 5: g 1 2 -3 4 4 4 -6 3 5 5 5 5 -5 4 -8 -8 6 6 -4 5 -7 -7 7 7 -3 6 -6 -6 8 8 -2 7 1 1 1 -1 -1 8 Sorting by reversals Example: 4 steps Step Step Step Step Step 0: p 2 -4 -3 1: 2 3 4 2: -5 -4 -3 3: -5 -4 -3 4: g 1 2 3 5 5 -2 -2 4 -8 -8 -8 -1 5 -7 -7 -7 6 6 -6 -6 -6 7 7 1 1 1 8 8 What is the reversal distance for this permutation? Can it be sorted in 3 steps? Pancake Flipping Problem • Chef prepares unordered stack of pancakes of different sizes • The waiter wants to sort (rearrange) them, smallest on top, largest at bottom • He does it by flipping over several from the top, repeating this as many times as necessary Christos Papadimitrou and Bill Gates flip pancakes Sorting By Reversals: A Greedy Algorithm • Unsigned permutations • Example: permutation p = 1 2 3 6 4 5 • First three elements are already in order • prefix(p) = length of already sorted prefix – prefix(p) = 3 • Idea: increase prefix(p) at every step Greedy Algorithm: An Example • Doing so, p can be sorted 123645 123465 123456 • Number of steps to sort permutation of length n is at most (n – 1) Greedy Algorithm: Pseudocode SimpleReversalSort(p) 1 for i 1 to n – 1 2 j position of element i in p (i.e., pj = i) 3 if j ≠i 4 p p * r(i, j) 5 output p 6 if p is the identity permutation 7 return Analyzing SimpleReversalSort • SimpleReversalSort does not guarantee the smallest number of reversals and takes five steps on p = 6 1 2 3 4 5 : •Step •Step •Step •Step •Step 1: 2: 3: 4: 5: 1 1 1 1 1 6 2 2 2 2 2 6 3 3 3 3 3 6 4 4 4 4 4 6 5 5 5 5 5 6 Analyzing SimpleReversalSort (cont’d) • But it can be sorted in two steps: p = 612345 – Step 1: 5 4 3 2 1 6 – Step 2: 1 2 3 4 5 6 • So, SimpleReversalSort(p) is not optimal • Optimal algorithms are unknown for many problems; approximation algorithms used Approximation Algorithms • These algorithms find approximate solutions rather than optimal solutions • The approximation ratio of an algorithm A on input p is: A(p) / OPT(p) where A(p) -solution produced by algorithm A OPT(p) - optimal solution of the problem Approximation Ratio / Performance Guarantee • Approximation ratio (performance guarantee) of algorithm A: max approximation ratio of all inputs of size n – For algorithm A that minimizes objective function (minimization algorithm): •max|p| = n A(p) / OPT(p) – For maximization algorithm: •min|p| = n A(p) / OPT(p) Adjacencies and Breakpoints p = p1p2p3…pn-1pn • A pair of elements p i and p i + 1 are adjacent if pi+1 = pi + 1 • For example: p=1 9 3 4 7 8 2 6 5 • (3, 4) or (7, 8) and (6,5) are adjacent pairs Breakpoints: An Example There is a breakpoint between any adjacent element that are non-consecutive: p=1 9 3 4 7 8 2 6 5 • Pairs (1,9), (9,3), (4,7), (8,2) and (2,5) form breakpoints of permutation p • b(p) - # breakpoints in permutation p Adjacency & Breakpoints •An adjacency – consecutive •A breakpoint – not consecutive Extend π with π0 = 0 and πn+1 = n+1 adjacencies π=5 6 2 1 3 4 0 5 6 2 1 3 4 7 breakpoints Extending Permutations • Add p 0 =0 and p n + 1=n+1 at ends of p Example: p=1 9 3 4 7 8 2 6 5 Extending with 0 and 10 p = 0 1 9 3 4 7 8 2 6 5 10 Note: A new breakpoint was created after extending Reversal Distance and Breakpoints Each reversal eliminates at most 2 breakpoints. p =2 3 1 4 6 5 0 0 0 0 2 1 1 1 3 3 2 2 1 2 3 3 4 4 4 4 6 6 6 5 5 5 5 6 7 7 7 7 b(p) = 5 b(p) = 4 b(p) = 2 b(p) = 0 This implies: reversal distance ≥ #breakpoints / 2 Sorting By Reversals: A Better Greedy Algorithm BreakPointReversalSort(p) 1 while b(p) > 0 2 Among all possible reversals, choose reversal r minimizing b(p • r) 3 p p • r(i, j) 4 output p 5 return Problem: this algorithm may work forever Strips • Strip: an interval between two consecutive breakpoints in a permutation – Decreasing strip: elements in decreasing order (e.g. 6 5) – Increasing strip: elements in increasing order (e.g. 7 8) 0 1 9 4 3 7 8 2 5 6 10 – Consider single-element strips decreasing except strips 0 and n+1 are increasing Reducing the Number of Breakpoints Theorem 1: If permutation p contains at least one decreasing strip, then there exists a reversal r which decreases the number of breakpoints (i.e. b(p • r) < b(p) ) Things To Consider • For p = 1 4 6 5 7 8 3 2 0 1 4 6 5 7 8 3 2 9 b(p) = 5 – Choose decreasing strip with the smallest element k in p (k = 2 in this case) Things To Consider (cont’d) • For p = 1 4 6 5 7 8 3 2 0 1 4 6 5 7 8 3 2 9 b(p) = 5 – Choose decreasing strip with the smallest element k in p (k = 2 in this case) Things To Consider (cont’d) • For p = 1 4 6 5 7 8 3 2 0 1 4 6 5 7 8 3 2 9 b(p) = 5 – Choose decreasing strip with the smallest element k in p (k = 2 in this case) – Find k – 1 in the permutation Things To Consider (cont’d) • For p = 1 4 6 5 7 8 3 2 0 1 4 6 5 7 8 3 2 9 b(p) = 5 – Choose decreasing strip with the smallest element k in p (k = 2 in this case) – Find k – 1 in the permutation – Reverse segment between k and k-1: 0 1 2 3 8 7 5 6 4 9 b(p) = 4 Reducing the Number of Breakpoints Again – If there is no decreasing strip, there may be no reversal r that reduces the number of breakpoints (i.e. b(p • r) ≥ b(p) for any reversal r). – By reversing an increasing strip ( # of breakpoints stay unchanged ), we will create a decreasing strip at the next step. Then the number of breakpoints will be reduced in the next step (theorem 1). Things To Consider (cont’d) • There are no decreasing strips in p, for: p =0 1 2 5 6 7 3 4 8 b(p) = 3 p •r(3,4) = 0 1 2 5 6 7 4 3 8 b(p) = 3 r(3,4) does not change the # of breakpoints r(3,4) creates a decreasing strip thus guaranteeing that the next step will decrease the # of breakpoints. ImprovedBreakpointReversalSort ImprovedBreakpointReversalSort(p) 1 while b(p) > 0 2 if p has a decreasing strip 3 Among all possible reversals, choose reversal r 4 5 that minimizes b(p • r) else Choose a reversal r that flips an increasing strip in p 6 p p • r 7 output p 8 return Performance Guarantee • ImprovedBreakPointReversalSort is an approximation algorithm with a performance guarantee of at most 4 – It eliminates at least one breakpoint in every two steps; at most 2b(p) steps – Approximation ratio: 2b(p) / d(p) – Optimal algorithm eliminates at most 2 breakpoints in every step: d(p) b(p) / 2 – Performance guarantee: ( 2b(p) / d(p) ) [ 2b(p) / (b(p) / 2) ] = 4 Signed Permutations • Up to this point, reversal sort algorithms sorted unsigned permutations • But genes have directions… so we should consider signed permutations 5’ p = 3’ 1 -2 - 3 4 -5 Signed Permutation • Genes are directed fragments of DNA • Genes in the same position but different orientations do not have same gene order • These two permutations are not equivalent gene sequences 1 -1 2 2 3 4 5 -3 -4 -5 Signed permutations are easier! Polynomial time (optimal) algorithm is known Genome rearrangements Mouse (X chrom.) Unknown ancestor ~ 75 million years ago Human (X chrom.) • What are the similarity blocks and how to find them? • What is the architecture of the ancestral genome? • What is the evolutionary scenario for transforming one genome into the other? Genome rearrangements Mouse (X chrom.) Unknown ancestor ~ 75 million years ago Human (X chrom.) • What are the similarity blocks and how to find them? • What is the architecture of the ancestral genome? • What is the evolutionary scenario for transforming one genome into the other? Comparative maps A brief history • Chromosome comparisons – no information about genes • 1920’s: Sturtevant, Weinstein • Today: many organisms, many uses • Humans: – primates, mouse, cat, dog, zebrafish, ... – Alzheimer, cancers, diabetes, obesity, ... Why construct comparative maps? • Identify & isolate genes – Crops: drought resistance, yield, nutrition... – Human: disease genes, drug response,… • Infer ancestral relationships • Discover principles of evolution – Chromosome – Gene family • “key to understanding the human genome” Map construction Go from this to this 3S 8L Maize 1 (target), Rice (base) Wilson et al. Genetics 1999 10L 3L Why automate? • Time consuming, laborious – Needs to be redone frequently • Codify a common set of principles • Nadeau and Sankoff: warn of “arbitrary nature of comparative map construction” Input/Output • Input: – genetic maps of 2 species – marker/gene correspondences (homologs) • Output: – a comparative map • homeologies identified A natural model? Maize 1 Rice 3S 8L 10L 3L Maize 1 (target), Rice (base) Wilson et al. Genetics 1999 Scoring 10L s 3L bcd207a (10L) cdo94b (10L) bcd386a (10L) isu78 (5L) csu77 (10L) cdo98b (10L) rz630e (3L) rz403 (3L) cdo795a (3L) isu92b (3L) m Assumptions • Accept published marker order • All linkage groups of base are unique • Simplistic homeology criteria • At least one homeologous region A natural model? Dynamic programming • li = location of homolog to marker i • S[i,a] = penalty (score) for an optimal labeling of the submap from marker i to the end, when labeling begins with label a 1 ... a i ... n Recurrence relation • S[n,a] = m (a, ln) ... ... a n ln S[i,a] = m (a, li) + min (S[i+1,b] + s (a,b) ) bL a b ... i i+1 li li+1 ... n ln Problem with linear model s=2 a-b-c motif: a a a b a b b b a b score: 2s = 4 c b c c c a-b-a motif: a a a score: 3m = 3 b a a a The stack model d c a b c e f b b • Segment at top of the stack can be: – pushed (remembered), later popped – replaced • Push and replace cost s -- pop is free. Scoring 7L s 9L “free” pop 7L uaz265a isu136 isu151 rz509b cdo59c rz698c bcd1087a rz206b bcd1088c csu40 cdo786a csu154 isu113a csu17 cdo337 rz530a (7L) (2L) (7L) (7L) (7L) (9L) (9L) (9L) (9L) (3S) (9L) (7L) (7L) (7L) (3L) (7L) m m m Dynamic programming • S[i,j,a] = score for an optimal labeling of: – submap from marker i to marker j – when labeling begins with label a -i.e., marker i is labeled a 1 ... a i ... j ... n Recurrence relation • S[i,i,a] = m (a, li) • S[i,j,a] = min: 1 ... a i i+1 ... n • m (a, li) + min (S[i+1,j,b] + s (a,b) ) bL 1 ... a b i i+1 ... • min S[i,k,a] + S[k+1,j,a] i<k<j 1 ... a a i ... k+1 ... j ... n n Advantage: output similar to experts’ Maize 6 (target), Rice (base) Advantage: proposes testable hypotheses • New relations predicted; greater resolution maps confirm Maize 7 (target), Rice (base) Ahn-Tanksley ‘93 Ahn-Tanksley data Wilson et. al. ‘99 Advantage: infers evolutionary events Wilson et al. Maize 1 (target) Rice (base) Stack Problem: Incomplete input • Gene order not always fully resolved. • Co-located genes can be ordered to give most parsimonious labeling. 8p 19p 33.0 33.0 33.0 33.0 33.0 33.0 33.0 33.0 33.0 33.0 33.0 33.0 33.0 33.0 33.0 33.0 33.0 33.0 33.0 33.0 Atp6b1 (8p) Comp (19) Jak3 (19p) Jund1 (19p) Lpl (8p) Mel (19p) Npy1r (4q) Pde4c (19) Slc18a1 (8p) Srebf1 (17p) = Atp6b1 (8p) Lpl (8p) Slc18a1 (8p) Npy1r (4q) Srebf1 (17p) Comp (19) Jak3 (19p) Jund1 (19p) Mel (19p) Pde4c (19) 8p 19p The reordering algorithm • Uses a compression scheme – Within a megalocus, group genes by location of related gene. – Order these groups – First, last groups interact with nearby genes – Any ordering of internal groups is equally parsimonious The reordering algorithm The reordering algorithm Definitions extended to distance to a set A of labels (a, A) = 0 1 if a A, otherwise li = set of labels matching markers in megalocus i S = set of megalocus start nodes Definitions • p(i,a,b) gives the total mismatched marker and segment boundary penalties attributed to “hidden markers” – i is index for megalocus – a and b are labels for megalocus ends – Do any markers in megalocus match a, b? • No: don’t penalize in recurrence and p(i,a,b) Recurrence relation S[i,i,a] = m (a, li) 1 ... a i i+1 ... n S[i,j,a] = min: m (a, li) + min (S[i+1,j,b] + s (a,b) + p(i,a,b)) bL 1 ... a b i i+1 ... n min S[i,k,a] + S[k+1,j,a] i<k<j k S 1 ... a a i ... k+1 ... j ... n Results: Fewer mismatches stack Mouse 5 (target) Human (base) reordering Results: Mismatches placed between segments stack Mouse 8 (target) Human (base) reordering Results: Detects new segments stack Mouse 13 (target) Human (base) reordering Summary • Global view • Finds optimal comparative map – Arranges markers in most parsimonious way • Biologically meaningful results • Robust – not species-specific – high/low resolution, genetic/physical maps – stable to errors in marker order