* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download j - Computer Science & Engineering
Survey
Document related concepts
Transcript
Bioinformatics Algorithms and Data Structures Chapter 11.8: Gaps Lecturer: Dr. Rose Slides by: Dr. Rose February 6, 2003 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gaps • Our investigation of alignment has focused on: – Matches – Mismatches – Spaces • An important concept is that of gaps. • Defn. A gap is a maximal consecutive run of spaces in a single string of a given alignment. • Q: Can a single space be a gap? • A: Yes, if there are no adjacent spaces. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gaps • Gaps can occur: – Before the first character of a string – After the last character of a string – Inside a string • c • • Example: t g c g g g - - - g g t a a a t - g c g g - a g a g g - a a a Q: How many gaps are there? A: 5 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gaps • Q: Other than our recognition of gaps, did the preceding example show anything new? • A: No. • Q: Then what motivates the introduction of this concept? • A: We can include a gap term in the objective function for computing alignment. • So??? • So we can influence the distribution of gaps. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gaps • Analogy: specifying the location of the hole is critical to donut making, otherwise you’ll end of with a berliner. l ' ' s ( S ( i ), S • Example: 1 2 (i )) kWg i 1 • In this objective fcn, each gap contributes the constant weight Wg irrespective of the gap length. • The variable k, indicates the number of gaps in the alignment. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gaps • Recall, a space in an alignment corresponds to an insertion or deletion in the edit transcript. • A gap corresponds to an atomic insertion or deletion of an entire substring. • Biologically, mutations are such atomic events. – A single mutation can create a gap – The size of the gap can vary over a large range with equal likelihood. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gaps Sources of mutation mentioned by textbook: – unequal cross-over in meiosis insertion in one string and corresponding deletion in the other. • http://www4.ncsu.edu:8030/unity/users/b/bnchorle/www/ – DNA slippage slippage in replication procedure resulting in the repetition of a substring – Retrovirus insertions – Translocations of DNA between chromosomes UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gaps • Common gaps in aligned strings can be used to deduce evolutionary history – Mutations at the single character level are frequent. – Does anybody know what these are called? makes it difficult to determine evolutionary relationship at the DNA sequence level. – Large gaps occur less frequently. gap features can be used to recognize similarity over long periods of time. – See Figure 11.6 for an example of gap as alignment feature UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gaps • Consider: – An alignment should reflect the cost of mutational events transforming one string into the other. – A single mutation can produce a gap of more than one space • Consequently: – Distribution of spaces into gaps should follow a plausible model – Gap weights should be modeled to reflect biological meaning UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching • Preliminaries: – A single gene is comprised of exons and introns – Exons are the coding part of the gene – Introns are the noncoding parts between exons • Gene expression: – RNA is transcribed from DNA • • • • DNA:A RNA:U (uracil) DNA:C RNA:G DNA:G RNA:C DNA:T RNA:A UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching • Gene expression continued: – RNA is transformed into mRNA (messenger RNA) • The introns are excised • The remaining exons are concatenated – The resulting mRNA leaves the cell nucleus – A ribosome: • Translates the mRNA into the corresponding protein by – parsing the mRNA into codons – assembling amino acids in the order specified by the codons. • The resulting sequence of amino acids is the protein UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching • Imagine that you have the mRNA for a protein and want to find the corresponding gene. – (Wet biology) Take the mRNA and create complimentary DNA (cDNA). • Map mRNA:U cDNA:A – Note: cDNA differs from DNA is several respects • cDNA does not contain intron substrings • The nucleotides in cDNA compliment the nucleotides in the corresponding DNA, i.e., AT and C G UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching – (Wet biology) Hybridize the cDNA with the DNA • In hybridization: complementary nucleotides try to match up, i.e., AT and C G • Sections of the cDNA will hybridize with the corresponding sections of DNA. • The non-hybridizing segments are gaps • Possibly corresponding to introns UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching • Now imagine that you have the mRNA sequence for a protein and want to find the corresponding gene without doing wet biology. – Take the mRNA and create complimentary DNA (cDNA). • Map mRNA:U cDNA:A with a computer • While we are at it, compile of library of each cDNA string that we create for future use. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching – Align (hybridize) the cDNA with the DNA • We assume that the relevant genome has been sequenced. • We have a short string (cDNA) and a very long string, the genome. • Align complementary nucleotides in the two strings, i.e., AT and C G • Sections of the cDNA will align (hybridize) with the corresponding sections of genome. • The non-alinging (non-hybridizing) segments are gaps • Possibly corresponding to introns UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching • Q: What kind of objective fcn do we need to align cDNA with DNA? • Features: – Small penalties for spaces • Q: Why does this matter? • A: large penalties would force the cDNA to bunch up, not alowing gaps for introns UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching • Features: continued – Large penalties for mismatches • Some mismatches are unavoidable (sequencing error) • Long sequences of mismatches must be avoided – Positive values for matches • We want to reward exon matches – Gap penalties UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching • Gap penalities – Q: Assume: match +, mismatch --, space -, what happens if there is no gap penalty? – A: the alignment would be the longest common subsequence. Match of ALL characters in the cDNA string Match of cDNA with noncoding DNA Tells us nothing about the position of the exons UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: cDNA Matching • Gap penalties continued – Soln: augment objective fcn with a gap term – Complication: pseudogenes UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: Pseudogenes • Pseudogenes – Nonworking inexact copy of a gene – Conceptually: • a trial gene not ready for prime time or • a failed gene mutation – The psuedogene may be very far from the actual gene UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Motivation: Pseudogenes • Pseudogenes: processed psuedogenes – contains only exon substrings – introns have been removed & exons concatenated – Theory: mRNA that is re-transcribed back into DNA and inserted into a random position. – Problem: • Assume the DNA might contain the pseudogene & the working gene • how can processed psuedogenes be located? UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gap Weights • Q: What types of gap weight can we choose from? • A: The textbook lists four general types: – – – – Constant gap weight Affine gap weight Convex gap weight Arbitrary gap weight UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gap Weights • Constant gap weight: simplest – No cost for individual space – Gaps are assigned a constant weight Wg – Operator-weight objective fcn: Wm(#matches) – Wms(#mismatches) – Wg(#gaps) Where Wm= match weight & Wms= mismatch weight – Alphabet-weight objective fcn: S i=1[s(S´1(i), S´2(i))] - Wg(#gaps) l Here s(x,_) = s(_,x)=0 for every letter x in the alphabet. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gap Weights • Affine gap weight – Extend the constant gap weight with a charge for each space, Ws. • Wg is the gap initiation charge • Ws is the gap extension charge – Gap weight is given by the affine function Wg+ qWs, where q is the number of spaces in the gap. – Operator-weight objective fcn: Wm(#matches) – Wms(#mismatches) – Wg(#gaps) - Ws(#spaces) UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gap Weights – Affine alphabet-weight objective fcn: S i=1[s(S´1(i), S´2(i))] - Wg(#gaps) - Ws(#spaces) l Here s(x,_) = s(_,x)=0 for every letter x in the alphabet. – An important question is what the values of Wg and Ws should be. • Obviously, this is related to similarity matrix, s(). • Textbook says FASTA uses Wg=10 - Ws= 2 for protein sequences UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Gap Weights • convex gap weight – Idea: additional spaces contribute less – Example: Wg + logeq – Longer gaps are somewhat penalized UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Arbitrary Gap Weight • Arbitrary gap weight – The gap weight is an arbitrary function, w(q), of the gap length. – Obviously, the preceding weight types are subcases of the arbitrary gap weight model UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Arbitrary Gap Weight Arbitrary gap weight recurrence: Three types of alignments for S1[1..i] and S2[1..j] 1. 2. 3. • S1(i) aligns to the left of S2(j), S1 ends with a gap. • Let E(i, j) be the maximal value for alignment case 1. S1(i) aligns to the right of S2(j), S2 ends with a gap • Let F(i, j) be the maximal value for alignment case 2. S1(i) coaligns with S2(j). • Let G(i, j) be the maximal value for alignment case 3. Let V(i, j) be the maximal value of E(i, j), F(i, j), & G(i, j). UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Arbitrary Gap Weight We have the following recurrences: • V(i, j) = max[E(i, j), F(i, j), G(i, j)], • G(i, j) = V(i - 1, j - 1) + s(S1(i), S2(j)), Where S1(i), S2(j) are co-aligned. • E(i, j) = max0k j-1[V(i, k) – w(j – k)], S1 ends with a gap. • F(i, j) = max0l i-1[V(l, j) – w(i – l)] S2 ends with a gap. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Arbitrary Gap Weight The base conditions are: V(i, 0) = -w(i), V(0, j) = -w(j), E(i, 0) = -w(i), F(0, j) = -w(j), G(0, 0) = 0, but G(i, j) is undefined if only i or j is 0. If end spaces are free then end gaps are free and: V(i, 0) = 0, V(0, j) = 0 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Arbitrary Gap Weight Up until this point all dynamic programming examples have had complexity O(nm). Q: What is the complexity of V(i, j)? A: O(nm2 + n2m)? Q: Why does the consideration of gaps require O(nm2 + n2m)? A: Previous computations depended only on the 3 adjacent cells. Considering gaps entails considering all preceding cells in the row and column. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Arbitrary Gap Weight Thm. If |S1| = n and |S2| = m, the recurrences can be solved in O(nm2 + n2m) Proof. (n+1) * (m+1) cells in the table are filled. To fill cell (i, j): – E(i, j) examines j cells of row i, max0k j-1[V(i, k) – w(j – k)], A row entails m(m+1)/2 = O(m2) to evaluate E for that row. – F(i, j) examines i cells of column j, max0l i-1[V(l, j) – w(i – l)] A column entails n(n+1)/2 = O(n2) to evaluate F for that column. – G(i, j) examines one other cell. Since there are n rows and m columns give O(nm2 + n2m) UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Affine Gap Weight • • • O(nm2 + n2m) is expensive. The affine weight gap model supports O(nm) computation. Recall, we want to maximize the operator objective fcn: Wm(#matches) – Wms(#mismatches) – Wg(#gaps) - Ws(#spaces) As before, three types of alignments: 1. S1(i) aligns to the left of S2(j), S1 ends with a gap. 2. S1(i) aligns to the right of S2(j), S2 ends with a gap 3. S1(i) coaligns with S2(j). We will use E(i, j), F(i, j), G(i, j) & V(i, j), but we will modify the gap weight UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Affine Gap Weight Q: How can the cost be reduced from O(nm2 + n2m) to O(nm)? A: The affine model sets a constant cost per space. Q: How does this help? A: It is not necessary to do row (O(m2)) & column (O(n2)) searches It doesn’t matter where the gaps occur, only how large they are. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Affine Gap Weight The base conditions where end gaps are included are: V(i, 0) = E(i, 0) = - Wg- iWs, V(0, j) = F(0, j) = - Wg- jWs, If end spaces are free then end gaps are free and: V(i, 0) = V(0, j) = 0 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Affine Gap Weight We have the following recurrences: V(i, j) = max[E(i, j), F(i, j), G(i, j)], G(i, j) = V(i - 1, j - 1) + Wm, if S1(i)=S2(j) G(i, j) = V(i - 1, j - 1) - Wms, if S1(i)S2(j) E(i, j) = max[E(i, j - 1), V(i, j - 1) – Wg] - Ws S1 ends with a gap. F(i, j) = max[F(i - 1, j), V(i - 1, j) – Wg ] - Ws S2 ends with a gap. Notice that each recurrence entails examining recurrences for a constant number of cells. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Affine Gap Weight The textbook explains E(i, j) in detail. Let’s consider F(i, j) = max[F(i - 1, j), V(i - 1, j) – Wg ] - Ws F(i, j) is the case where S2 ends with a gap. The recurrence considers two cases: 1. S2(j) is exactly one place to the left of S1(i) There is a gap aligned with S1(i), then F(i, j) = V(i - 1, j) – Wg - Ws 1. S2(j) is to the left of S1(i - 1) The same gap aligned with S1(i - 1), extends to S1(i), then F(i, j) = F(i - 1, j) - Ws UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology