Download Multiple Sequence Alignment

Computational Molecular Biology Multiple Sequence Alignment Sequence Alignment  Problem Definition:  Given: 2 DNA or protein sequences  Find: Best match between them  What is an Alignment:  Given: 2 Strings S and S’  Goal: The lengths of S and S’ are the same by inserting spaces (--; sometimes denote as ∆) into these strings A -- T C -- A -- C T C A A My T. Thai [email protected] 2 Matches, Mismatches and Indels  Match: two aligned, identical characters in an alignment  Mismatch: two aligned, unequal characters  Indel: A character aligned with a space A A C T A C T -- C C T A A C A C T -- --- -- C T C C T A C C T -- -- T A C T T T 10 matches, 2 mismatches, 7 indels My T. Thai [email protected] 3 Basic Algorithmic Problem  Find the alignment of the two strings that:  max m where m = (# matches – mismatches – indels)  Or min m where m is the SP-score of an alignment  m defines the similarity of the two strings, also called Optimal Global Alignment  Biologically: a mismatch represents a mutation, whereas an indel represents a historical insertion or deletion of a single character My T. Thai [email protected] 4 Multiple Sequence Alignment  Problem Definition:  Similar to the sequence alignment problem but the input has more than 2 strings  Challenges:  NP-hard, MAX-SNP  Guarantee factor: 2 – 2/k where k is the number of the input sequences.  More work to reduce the time and space complexity My T. Thai [email protected] 5 Sum of Pairs Score (SP-Score)  Given a finite alphabet  and     {} where ∆ denotes a space  Consider k sequences over  that we want to align. After an alignment, each sequence has length l  A score d is assigned to each pair of letters: My T. Thai [email protected] 6 SP-Score  The SP-Score of an alignment A is defined as:  Consider a matrix of l columns and k rows where the rows represents the sequences and columns represent the letters  SP-Score is the sum of the scores of all columns:  Score of each column is the sum of the scores of all distinct unordered pairs of letters in the column  Or we can view as sum of pairwise sequence alignment values.  Find an (optimal) alignment to minimize the SP-Score value My T. Thai [email protected] 7 Proving MSA with SP-Score that is a Metric is NP-hard My T. Thai [email protected] 8 Some Notations My T. Thai [email protected] 9 Some Basic Properties  Lemma 1: Let s1, s2 be two sequences over Σ such that l1=|s1|, l2=|s2|, l2≥l1 and there are m symbols of s1 that are not in s2. Then every alignment of the set {s1,s2} has at least m+l2-l1 mismatches My T. Thai [email protected] 10 My T. Thai [email protected] 11 The construction  Reduce the vertex cover (or node cover) to MSA.  Vertex cover:  Instance: A graph G=(V,E) and an integer k≤|V|  Question: Is there a vertex cover V1 of G of size k or less?  MSA:  Instance: A set S={s1, …, sn} of finite sequences over a fixed alphabet Σ, an SP-score and an integer C  Question: Is there a multiple alignment of the sequences in S that is of value C or less? My T. Thai [email protected] 12 SP-Score (alphabet of size 6) My T. Thai [email protected] 13 The Reduction So, we have , T is a set of C2 sequences t and X contains C1 sequences x(k), where C1 and C2 will be determined later My T. Thai [email protected] 14 An Example My T. Thai [email protected] 15 Intuition  By the above construction, an optimal alignment A of S is obtained when A satisfies certain properties (called standard alignment)  The value of standard alignment is bounded by a given threshold C only where G has a vertex cover of size k  How to obtain:  Force d’s of the test sequences to be aligned with b’s of the edge sequences  Only one b of each edge sequence can be aligned to a d  The number of such alignment determines the value of the alignment My T. Thai [email protected] 16 Standard Alignemnt My T. Thai [email protected] 17 My T. Thai [email protected] 18 My T. Thai [email protected] 19 My T. Thai [email protected] 20 My T. Thai [email protected] 21  Let US and US,X denote the upper bounds of D(AS) and D(AS,X) respectively  By Corollary 8 and Lemma 9, we have the standard alignment has value not greater than DSD + US + US,X  where DSD = D(AX) + D(AT) + D(AX,T) + D(AS,T) over a standard alignment A  Now, let C1 > US and C2 > US + US,X, we can prove that an optimal alignment must be a standard one My T. Thai [email protected] 22 My T. Thai [email protected] 23 My T. Thai [email protected] 24 Show the NP-hardness of any scoring matrix in a broad class M Show that there is a scoring matrix M0 such that MSA for M0 is MAX-SNP hard My T. Thai [email protected] 25 Interesting Observation  Via the brute force, optimal MSA contains very few gaps  Suggesting the study of gap limitations:  Have an upper bound of the number of gaps one can insert during the alignment  Special case:  Gap-0: No gap allows, but we can shift the strings for an alignment (insert gaps at the beginning or at the end of a string)  Gap-0-1: a gap-0 alignment such that the gaps at the beginning or at the end of each string is exactly one space My T. Thai [email protected] 26 Problem Definition  Given a finite alphabet   {a1 ,, aw}  Scoring matrix M ( w1)( w1)  (si , j )iw, j w  For i, j > 0, si,j represents the penalty for aligning ai with aj  For i > 0, s0,i and si,0 are called indel penalites  Gap opening penalties (in addition to the indel penalties) for aligning ai with the first or last ∆ in the string of ∆’s My T. Thai [email protected] 27 Generic Scoring Matrix Where Σ={A,T}, x, y, x are fixed nonnegative numbers and u > max{0, vA, vT} holds • Let M2 be the class of all scoring matrices that contain a generic submatrix M • Let M1 be the class of all scoring matrices that contain a sub-matrix isomorphic to a generic matrix M with z > vT. • Let M be the class of all scoring matrices that contain a submatrix isomorphic to a generic matrix M with y > u and z > vT. Theorem 1: (a) The gap-0-1 multiple alignment problem is NP-hard for every scoring matrix M in M2. (b) The gap-0 multiple alignment problem is NP-hard for every M in M1 (c) The multiple alignment problem is NP-hard for every M in M Note that M is quite broad and covers most scoring schemes used in biological applications. My T. Thai [email protected] 28 Reduction  Reduce the MAX-CUT-B:  Given G=(V,E) where k=|V| and each vertex has a degree at most B  Find a partition of V into two disjoint sets such that to maximize the number of edges crossing these two sets  Given a graph G=(V,E) with k vertices v0, …, vk-1 and l edges e0, …, el-1. We will construct a set of k2 sequences t0, …, tk -1 as follows: 2 My T. Thai [email protected] 29 Reduction  For each vertex vi, construct a sequence ti such that  for each edge em={vh, vi} incident at vi, h < i, n < k5, set where ti,j represents the character at the jth position in ti.  For other j, let ti,j = T  For i ≥ k, set ti = T T T … T with length k12l My T. Thai [email protected] 30 An Example My T. Thai [email protected] 31 Proof of Theorem 1(a)  We will show that a gap-0-1 alignment will partition V into two disjoint subsets V0 and V1:  V0: all vertices vi such that ti remains in place (a space appends at the end)  V1: all vertices vi such that ti shifts to the right  Thus, based on the alignment, we can find the cut. And vice versa, based on the cut, we can find the alignment  The left part is: prove that if k is sufficiently large, the optimal gap-0-1 alignment yields a partion of V with maximum edge cut. My T. Thai [email protected] 32 Proof of Theorem 1(a)  Let c denote the cut based on the alignment A  Consider all the sequences ti after that alignment A:  The total indel penalties is of order O(k4) (appears at the first and last column in the SP score matrix)  The total number of mismatches before the alignment is 3k5l(k2-1)  To maximally reduce this number:    1 A-A match reduces 2 A-T mismatches For each edge (vh, vi), if there are in different subsets (of the partition), then a total of k5 A-A matches between sequences th and ti are created No other A-T mismatches can be elimiated  Thus the SP-score:  k12lvTk2(k2-1)2+3k5l(u-vT)(k2-1)-ck5(2u-vA-vT)+O(k4) My T. Thai [email protected] 33 Theorem 2 Consider the following scoring matrix M0 for the alphabet ∑0 = {A,T,C}. (a) The gap-0-1 MSA problem is MAX-SNP-hard (b) The gap-0 MSA problem in MAX-SNP-hard (c) The MSA problem in MAX-SNP-hard My T. Thai [email protected] 34 MAX-SNP-hard Proof  To prove problem A’ is MAX-SNP-hard, we need to L-reduce problem A, which is MAXSNP-hard to A’  L-reduce:  There are two polynomial-time algorithms f, g and constants a, b > 0 such that for each instance I of A:  f produces an instance I’ = f(I) of A’ such that OPT(I’) ≤ aOPT(I)  Given any solution of I’ with cost c’, g produces a solution of I with cost c such that |c-OPT(I)| ≤ b|c’OPT(I’)| My T. Thai [email protected] 35 Proof of Theorem 2  To prove MSA (with M0 and the scoring matrix mentioned before) MAX-SNP-hard:  L-reduce the MAX-CUT-B to another optimization problem, called A’, which is L-reduce to a scaled version of MSA  Problem A’:  Given a graph G=(V,E) with bounded degree B. For every partition P={V0, V1}, let cp be the size of cut determined by P.  Find the partition P of V that minimizes dp = 3|E|2cp My T. Thai [email protected] 36 Show A’ is MAX-SNP-hard  Let f and g be an identity function  Set a = 3B and b = 2, we can easily prove the two properties of the L-reduction since:  cp ≥|E|/B and dp = 3|E| - 2 cp ≤ 3 |E|  Any increase of cp by 1 = decrease dp by 2 My T. Thai [email protected] 37 Show A’ L-reduce to scaled MSA Similar to the above construction, we have: My T. Thai [email protected] 38  Similar to the proof of Theorem 1, we have the optimal SP-score: where  If the SP-score is scaled by a factor of k-5/2 for a MSA of k sequences, then A’ L-reduce to MSA. My T. Thai [email protected] 39 GENETIC ALGORITHMS How do GAs work?  Create a population of random solutions  Use natural selection:  crossover and mutation to improve the solutions  Stop the operation if satisfying some certain criteria such as:  No improvement on fitness function  The improvement is less than some certain threshold  The number of iteration is more than some certain threhold Terms and Definitions  Chromosomes  Potential solutions  Population  Collection of chromosomes  Generations  Successive populations Terms and Definitions  Crossover  Exchange of genes between two chromosomes  Mutation  Random change of one or more genes in a chromosome  Elitism  Copy the best solutions without doing crossover or mutation. Terms and Definitions  Offspring  New chromosome created by crossover between two parent chromosomes  Fitness function  Measures how “good” a chromosome is.  Encoding scheme  How do we represent every chromosome/gene?  Binary, combination, syntax trees. Why are GAs attractive?  No need for a particular algorithm to solve the given problem. Only the fitness function is required to evaluate the quality of the solutions.  Implicitly a parallel technique and can be implement efficiently on powerful parallel computers for demanding large scale problems. Basic Outline of a GA  Initial population composed of random chromosomes, called first generation  Evaluate the fitness of each chromosome in the population  Create a new population:  Select two parent chromosomes from a population according to their fitness  Crossover (with some probability) to form a new offspring  Mutation (with some probability) to mutate new offspring  Place new offspring in a new population  Process is repeated until a satisfactory solution evolves Operations Mutation Operation: • Modify a single parent • Try to avoid local minima Let's see some running examples  Minimum of a function:  http://cs.felk.cvut.cz/~xobitko/ga/example_f.html  Elitism:  http://cs.felk.cvut.cz/~xobitko/ga/params.html  The travelling salesman problem:  http://cs.felk.cvut.cz/~xobitko/ga/tspexample.htm l Multiple Sequence Alignment  Fitness function is used to compare the different alignments  Based on the number of matching symbols and the number and size of gaps  Also called the cost function  Different weights for different types of matches  Gap costs  can be simple and count the total matching symbols  can be complicated and consider the type of matching symbols, location in the sequence, neighboring symbols etc. Approximation Algorithms My T. Thai [email protected] 51 Scoring method  Score zero for a match or for two opposing spaces  Score one for a mismatch or for a character opposite a space Assumptions:  Assume that two opposing spaces have a zero value  Assume other values satisfies triangle inequality  s(x,z) ≤ s(x,y) + s(y,z)  s(x,z) – cost of transforming character x into character z Objective Functions  Two objective functions  SP  The sum of the values of pairwise alignments induced by an alignment A  TA  Using the topology of the tree, map the strings to the nodes of the tree  The sum of the selected pairwise alignments is called tree alignment Center Star Method  For a set of k strings X  Choose a center string Xc of X which minimizes Σj≠cD(Xc,Xj)  Let M = min Σj≠cD(Xc,Xj)  Center star is a star tree of k nodes with the center node labeled Xc and each of the k-1 remaining nodes labeled by a distinct string in X \ {Xc}  If Xi and Xj are strings labeling adjacent nodes of tree T, then alignment of Xi and Xj induced by A(T) has value D(Xi,Xj) Center Star Method – Alg Ac  Do an optimal alignment for each pair (Xc, Xj) for all j ≠ c  s0 = max number of spaces placed before the first char of Xc  sf = max number of spaces placed after the last char of Xc  si = max number of spaces placed between Xc(i) and Xc(i+1) Center Star Method – Alg Ac  For Xc, insert s0, si, and sf spaces at the beginning, between, and the end of Xc respectively. Call X’c  Then for each Xj, do the optimal alignment without modifying X’c My T. Thai [email protected] 57 Analysis  d(Xi,Xj) ≥ D(Xi,Xj)  V(Ac) = Σi<jd(Xi,Xj)  V(Ac) is at most twice the value of the optimal multiple alignment of X My T. Thai [email protected] 58 Analysis  Lemma 3.1: For any 2 strings Xi,Xj, we have: d(Xi,Xj) ≤ d(Xi,Xc) + d(Xc,Xj) = D(Xi,Xc) + D(Xc,Xj)  triangle inequality Analysis  A* be the optimal multiple alignment of k strings X  Define: V(A*) = Σi<jd*(Xi,Xj) Analysis  Theorem 3.1 V(Ac) / V(A*) ≤ 2(k-1)/ k < 2  Proof: Disadvantages  Requires all pairwise alignments  Computationally expensive  Faster, Randomized alignments     Randomly select string Xi Build multiple alignment with star centered at Xi Select best multiple alignment A from p such stars At most (k-1)p pairwise alignments need to be computed Randomized Alignments  Theorem 3.2 For any r >1, let e(r) be the expected number of stars needed to be chosen at random before the value of best resulting alignment is within a factor of 2+1/(r-1) of the optimal alignment. Then e(r) ≤ r.  e(r) is independent of k and the length of the strings. Proof of Theorem 3.2  For r = 2, for each string Xi define M(i) = ΣjD(Xi,Xj) then M(c) = M From Theorem 3.1, Σ(i,j)D(Xi,Xj) = ΣjM(i) ≤ 2(k-1)M so the Avg value of M(i) < 2 M  Since min M(i) = M, then Median M(i) < 3M Number of centers selected before a selected M(i) is less than the median = 2 Proof  Suppose median is ∂M for 1 ≤ ∂ ≤ 3 Then Σ(i,j)D(Xi,Xj)≥ kM/2 + k ∂ M/2  Value of the alignment obtained from any below median star ≤ 2(k-1) ∂ M Therefore, error ratio for this star ≤ = 2 ∂ / (1/2 + ∂ /2)  When ∂ = 3, error ratio = 3.  So we have e(2) ≤ 2 Proof  Now generalize this proof for r > 2  At least k/r stars have M(i) less than or equal to (2r-1)M/(r-1)  Minimum M(i) is M  Mean < 2M  expected number of stars to pick with M(i) < ∂ M is r for 1 ≤ ∂ ≤ (2r-1)/(r-1)  error ratio = 2 ∂ /[1/r + (r-1) ∂ /r]  (2r-1)/(r-1)=2 + 1/(r-1) Theorem 3.3  Picking p stars at random, the best resulting alignment will have value within a factor of 2 + 1/(r-1) of the optimal with probability at least 1 – [(r-1)/r]p Center Star Method  Proof  From theorem 3.2, if Median value was actually 3M  For half the stars M(i) = M and M(i) = 3M for the other half  Σ(i,j)D(Xi,Xj)=2kM  optimal SP alignment can be obtained from any center string Xiwith M(i) = M  Probability of selecting such a string is one-half Tree Alignment Method  Typical approach:  first find multiple alignment and then build a tree showing the evolutionary derivations  Another approach (called tree alignment):  first choose the typology of the tree and then map the strings to the nodes of the tree  Alignment is the pairwise alignments of the strings at the ends of the edges of the tree Formal Definitions  Let K be an input set of k strings  K’  K be a set of strings containing K  Evolutionary tree TK’ for K is a tree:  with at least k nodes  each string in K’ labels exactly one node & each node gets exactly one label in K’  The value of TK’ : V(TK’) = ΣD(X,Y)  the problem is to find a set of strings K’ and T(K’) for K which minimizes V (TK’)  The alignment value D(X,Y ) is interpreted as the minimum “cost" to transform string X to string Y  The sum of the alignment values of the edges gives the evolutionary cost implied by the tree. Method  Let G be a graph with k nodes labeled with a distinct string in K  Each edge (X,Y) has a weight D(X,Y)  Find the MST of G. This MST is an evolutionary tree for K Analysis  T* denote the optimal evolutionary tree for K.  Prove: V(MST)/V(T*) < 2OPT  Let C be a traversal of edges of T* which traverses everyy edge exactly once in each direction  Let C1, …, Ck be the order that C encounters  Let V(C) = D(Ck,C1) + Σi<kD(Ci,Ci+1) Analysis My T. Thai [email protected] 74 Analysis  Corollary 4.1: V(C) ≤ 2V(T*),  Let D(Ci*,Ci*+1) be the largest distance of any adjacent strings in C traversal  Lemma(4.2) V(MST) ≤ V(C) – D(Ci*,Ci*+1) ≤ V(C) – V(C)/K Analysis  Theorem 4.1 For any set K of k strings, we have: V(MST)/ V(T*k) ≤ 2(k-1)/k < 2  Theorem 4.2 V(MST) / V(T*k) ≤ (k-1)/k V(C)/V(T*k) ≤ 2 (k-1)/k  Corollary 4.2 V(T*k) > kV(MST)/2(k-1) Constrained MSA Motivation General SP MSA problem:  NP-completeness has already been established  Appromixation algorithms have been developed  Heuristics are also avaliable Constrained MSA:  Biologists often have additional knowledge of data (e.g. active site residues)  Additional knowledge can specify matches at certain locations  Models allow users to provide additional constraints Definition of CMSA Problem  Suppose that P = p1p2 . . . pα is a common subsequence of S1, S2, . . . , SK  The constrained multiple sequence alignment of S with respect to P is:  an MSA A with the constraints that there are α columns in A, c1, c2, . . . , cα with c1 < c2 < …< cα, such that the characters of column ci, 1 ≤ i ≤ α, are all equal to pi. Optimal CPSA Dynamic Algorithm My T. Thai [email protected] 81 Time and Space Complexities My T. Thai [email protected] 82 CMSA The improvement of CPSA in turn improves the time & space complexity of Progressive CMSA from O(αkn4) and O(αn4) to O(αk2n2) and O(αn2). Optimal CMSA This Optimal CMSA algorithm involves the creation of a matrix with k+1 dimensions. (Assume δ(x,y) is the distance function and satisfies the triangle inequality.)  Let D(i1, . . . , ik; γ) be the optimal CMSA score matrix for {S1[1..i1], . . . , Sk[1..ik]} where P[1..γ] is aligned in γ columns.  Then optimal alignment score is D(n1, . . . , nk; α), where ni =|Si|. Computing D:  D({0}k; 0) = 0  Let εj = 0 or 1 with εjSj[ij] where j = 0 represents a space, and δ(x1, . . . , xk) = Σ1≤i<j≤kδ(xi, xj). D(i1, i2, . . . , ik; γ) is the minimum of:  if S1[i1] = . . . = Sk[ik] = P[γ],   D(i1 − 1, . . . , ik − 1; γ − 1) + δ(S1[i1], . . . , Sk[ik]) minε∈{0,1}k (D(i1 − ε1, . . . , ik − εk; γ) + δ(ε1S1[i1], . . . , εkSk[ik])). These values can be computed using dynamic programming. CMSA (Center Star) The Center-Star method proposed for the general MSA problem can be modified to apply to the CMSA problem.  Consider each sequence as the center, Sc. Consider each list position that Sc is aligned with P.  Find the minimum star-sum score Sc.  Create a constrained alignment matrix by merging the constrained pairwise sequence alignments between Sc & S j. CMSA (Center Star) The recurrence of Thm. 3.1 is only slightly modified: Example My T. Thai [email protected] 86

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Multiple Sequence Alignment