Download j - Computer Science & Engineering

Bioinformatics Algorithms and Data Structures Chapter 11.8: Gaps Lecturer: Dr. Rose Slides by: Dr. Rose February 6, 2003 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gaps • Our investigation of alignment has focused on: – Matches – Mismatches – Spaces • An important concept is that of gaps. • Defn. A gap is a maximal consecutive run of spaces in a single string of a given alignment. • Q: Can a single space be a gap? • A: Yes, if there are no adjacent spaces. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gaps • Gaps can occur: – Before the first character of a string – After the last character of a string – Inside a string • c • • Example: t g c g g g - - - g g t a a a t - g c g g - a g a g g - a a a Q: How many gaps are there? A: 5 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gaps • Q: Other than our recognition of gaps, did the preceding example show anything new? • A: No. • Q: Then what motivates the introduction of this concept? • A: We can include a gap term in the objective function for computing alignment. • So??? • So we can influence the distribution of gaps. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gaps • Analogy: specifying the location of the hole is critical to donut making, otherwise you’ll end of with a berliner.  l ' ' s ( S ( i ), S • Example:  1 2 (i ))  kWg i 1 • In this objective fcn, each gap contributes the constant weight Wg irrespective of the gap length. • The variable k, indicates the number of gaps in the alignment. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gaps • Recall, a space in an alignment corresponds to an insertion or deletion in the edit transcript. • A gap corresponds to an atomic insertion or deletion of an entire substring. • Biologically, mutations are such atomic events. – A single mutation can create a gap – The size of the gap can vary over a large range with equal likelihood. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gaps Sources of mutation mentioned by textbook: – unequal cross-over in meiosis  insertion in one string and corresponding deletion in the other. • http://www4.ncsu.edu:8030/unity/users/b/bnchorle/www/ – DNA slippage slippage in replication procedure resulting in the repetition of a substring – Retrovirus insertions – Translocations of DNA between chromosomes UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gaps • Common gaps in aligned strings can be used to deduce evolutionary history – Mutations at the single character level are frequent. – Does anybody know what these are called?  makes it difficult to determine evolutionary relationship at the DNA sequence level. – Large gaps occur less frequently.  gap features can be used to recognize similarity over long periods of time. – See Figure 11.6 for an example of gap as alignment feature UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gaps • Consider: – An alignment should reflect the cost of mutational events transforming one string into the other. – A single mutation can produce a gap of more than one space • Consequently: – Distribution of spaces into gaps should follow a plausible model – Gap weights should be modeled to reflect biological meaning UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching • Preliminaries: – A single gene is comprised of exons and introns – Exons are the coding part of the gene – Introns are the noncoding parts between exons • Gene expression: – RNA is transcribed from DNA • • • • DNA:A RNA:U (uracil) DNA:C RNA:G DNA:G RNA:C DNA:T RNA:A UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching • Gene expression continued: – RNA is transformed into mRNA (messenger RNA) • The introns are excised • The remaining exons are concatenated – The resulting mRNA leaves the cell nucleus – A ribosome: • Translates the mRNA into the corresponding protein by – parsing the mRNA into codons – assembling amino acids in the order specified by the codons. • The resulting sequence of amino acids is the protein UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching • Imagine that you have the mRNA for a protein and want to find the corresponding gene. – (Wet biology) Take the mRNA and create complimentary DNA (cDNA). • Map mRNA:U  cDNA:A – Note: cDNA differs from DNA is several respects • cDNA does not contain intron substrings • The nucleotides in cDNA compliment the nucleotides in the corresponding DNA, i.e., AT and C  G UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching – (Wet biology) Hybridize the cDNA with the DNA • In hybridization: complementary nucleotides try to match up, i.e., AT and C  G • Sections of the cDNA will hybridize with the corresponding sections of DNA. • The non-hybridizing segments are gaps • Possibly corresponding to introns UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching • Now imagine that you have the mRNA sequence for a protein and want to find the corresponding gene without doing wet biology. – Take the mRNA and create complimentary DNA (cDNA). • Map mRNA:U  cDNA:A with a computer • While we are at it, compile of library of each cDNA string that we create for future use. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching – Align (hybridize) the cDNA with the DNA • We assume that the relevant genome has been sequenced. • We have a short string (cDNA) and a very long string, the genome. • Align complementary nucleotides in the two strings, i.e., AT and C  G • Sections of the cDNA will align (hybridize) with the corresponding sections of genome. • The non-alinging (non-hybridizing) segments are gaps • Possibly corresponding to introns UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching • Q: What kind of objective fcn do we need to align cDNA with DNA? • Features: – Small penalties for spaces • Q: Why does this matter? • A: large penalties would force the cDNA to bunch up, not alowing gaps for introns UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching • Features: continued – Large penalties for mismatches • Some mismatches are unavoidable (sequencing error) • Long sequences of mismatches must be avoided – Positive values for matches • We want to reward exon matches – Gap penalties UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching • Gap penalities – Q: Assume: match +, mismatch --, space -, what happens if there is no gap penalty? – A: the alignment would be the longest common subsequence.  Match of ALL characters in the cDNA string Match of cDNA with noncoding DNA  Tells us nothing about the position of the exons UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching • Gap penalties continued – Soln: augment objective fcn with a gap term – Complication: pseudogenes UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: Pseudogenes • Pseudogenes – Nonworking inexact copy of a gene – Conceptually: • a trial gene not ready for prime time or • a failed gene mutation – The psuedogene may be very far from the actual gene UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: Pseudogenes • Pseudogenes: processed psuedogenes – contains only exon substrings – introns have been removed & exons concatenated – Theory: mRNA that is re-transcribed back into DNA and inserted into a random position. – Problem: • Assume the DNA might contain the pseudogene & the working gene • how can processed psuedogenes be located? UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gap Weights • Q: What types of gap weight can we choose from? • A: The textbook lists four general types: – – – – Constant gap weight Affine gap weight Convex gap weight Arbitrary gap weight UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gap Weights • Constant gap weight: simplest – No cost for individual space – Gaps are assigned a constant weight Wg – Operator-weight objective fcn: Wm(#matches) – Wms(#mismatches) – Wg(#gaps) Where Wm= match weight & Wms= mismatch weight – Alphabet-weight objective fcn: S i=1[s(S´1(i), S´2(i))] - Wg(#gaps) l Here s(x,_) = s(_,x)=0 for every letter x in the alphabet. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gap Weights • Affine gap weight – Extend the constant gap weight with a charge for each space, Ws. • Wg is the gap initiation charge • Ws is the gap extension charge – Gap weight is given by the affine function Wg+ qWs, where q is the number of spaces in the gap. – Operator-weight objective fcn: Wm(#matches) – Wms(#mismatches) – Wg(#gaps) - Ws(#spaces) UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gap Weights – Affine alphabet-weight objective fcn: S i=1[s(S´1(i), S´2(i))] - Wg(#gaps) - Ws(#spaces) l Here s(x,_) = s(_,x)=0 for every letter x in the alphabet. – An important question is what the values of Wg and Ws should be. • Obviously, this is related to similarity matrix, s(). • Textbook says FASTA uses Wg=10 - Ws= 2 for protein sequences UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gap Weights • convex gap weight – Idea: additional spaces contribute less – Example: Wg + logeq – Longer gaps are somewhat penalized UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Arbitrary Gap Weight • Arbitrary gap weight – The gap weight is an arbitrary function, w(q), of the gap length. – Obviously, the preceding weight types are subcases of the arbitrary gap weight model UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Arbitrary Gap Weight Arbitrary gap weight recurrence: Three types of alignments for S1[1..i] and S2[1..j] 1. 2. 3. • S1(i) aligns to the left of S2(j),  S1 ends with a gap. • Let E(i, j) be the maximal value for alignment case 1. S1(i) aligns to the right of S2(j),  S2 ends with a gap • Let F(i, j) be the maximal value for alignment case 2. S1(i) coaligns with S2(j). • Let G(i, j) be the maximal value for alignment case 3. Let V(i, j) be the maximal value of E(i, j), F(i, j), & G(i, j). UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Arbitrary Gap Weight We have the following recurrences: • V(i, j) = max[E(i, j), F(i, j), G(i, j)], • G(i, j) = V(i - 1, j - 1) + s(S1(i), S2(j)), Where S1(i), S2(j) are co-aligned. • E(i, j) = max0k  j-1[V(i, k) – w(j – k)],  S1 ends with a gap. • F(i, j) = max0l  i-1[V(l, j) – w(i – l)]  S2 ends with a gap. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Arbitrary Gap Weight The base conditions are: V(i, 0) = -w(i), V(0, j) = -w(j), E(i, 0) = -w(i), F(0, j) = -w(j), G(0, 0) = 0, but G(i, j) is undefined if only i or j is 0. If end spaces are free then end gaps are free and: V(i, 0) = 0, V(0, j) = 0 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Arbitrary Gap Weight Up until this point all dynamic programming examples have had complexity O(nm). Q: What is the complexity of V(i, j)? A: O(nm2 + n2m)? Q: Why does the consideration of gaps require O(nm2 + n2m)? A: Previous computations depended only on the 3 adjacent cells. Considering gaps entails considering all preceding cells in the row and column. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Arbitrary Gap Weight Thm. If |S1| = n and |S2| = m, the recurrences can be solved in O(nm2 + n2m) Proof. (n+1) * (m+1) cells in the table are filled. To fill cell (i, j): – E(i, j) examines j cells of row i, max0k  j-1[V(i, k) – w(j – k)], A row entails m(m+1)/2 = O(m2) to evaluate E for that row. – F(i, j) examines i cells of column j, max0l  i-1[V(l, j) – w(i – l)] A column entails n(n+1)/2 = O(n2) to evaluate F for that column. – G(i, j) examines one other cell. Since there are n rows and m columns give O(nm2 + n2m) UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Affine Gap Weight • • • O(nm2 + n2m) is expensive. The affine weight gap model supports O(nm) computation. Recall, we want to maximize the operator objective fcn: Wm(#matches) – Wms(#mismatches) – Wg(#gaps) - Ws(#spaces) As before, three types of alignments: 1. S1(i) aligns to the left of S2(j),  S1 ends with a gap. 2. S1(i) aligns to the right of S2(j),  S2 ends with a gap 3. S1(i) coaligns with S2(j). We will use E(i, j), F(i, j), G(i, j) & V(i, j), but we will modify the gap weight UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Affine Gap Weight Q: How can the cost be reduced from O(nm2 + n2m) to O(nm)? A: The affine model sets a constant cost per space. Q: How does this help? A: It is not necessary to do row (O(m2)) & column (O(n2)) searches  It doesn’t matter where the gaps occur, only how large they are. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Affine Gap Weight The base conditions where end gaps are included are: V(i, 0) = E(i, 0) = - Wg- iWs, V(0, j) = F(0, j) = - Wg- jWs, If end spaces are free then end gaps are free and: V(i, 0) = V(0, j) = 0 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Affine Gap Weight We have the following recurrences: V(i, j) = max[E(i, j), F(i, j), G(i, j)], G(i, j) = V(i - 1, j - 1) + Wm, if S1(i)=S2(j) G(i, j) = V(i - 1, j - 1) - Wms, if S1(i)S2(j) E(i, j) = max[E(i, j - 1), V(i, j - 1) – Wg] - Ws  S1 ends with a gap. F(i, j) = max[F(i - 1, j), V(i - 1, j) – Wg ] - Ws  S2 ends with a gap. Notice that each recurrence entails examining recurrences for a constant number of cells. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Affine Gap Weight The textbook explains E(i, j) in detail. Let’s consider F(i, j) = max[F(i - 1, j), V(i - 1, j) – Wg ] - Ws  F(i, j) is the case where S2 ends with a gap. The recurrence considers two cases: 1. S2(j) is exactly one place to the left of S1(i) There is a gap aligned with S1(i), then F(i, j) = V(i - 1, j) – Wg - Ws 1. S2(j) is to the left of S1(i - 1) The same gap aligned with S1(i - 1), extends to S1(i), then F(i, j) = F(i - 1, j) - Ws UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download j - Computer Science & Engineering