Download j - Computer Science & Engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Point mutation wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genetic engineering wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Primary transcript wikipedia , lookup

RNA-Seq wikipedia , lookup

History of genetic engineering wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Transcript
Bioinformatics Algorithms and
Data Structures
Chapter 11.8: Gaps
Lecturer: Dr. Rose
Slides by: Dr. Rose
February 6, 2003
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Gaps
• Our investigation of alignment has focused on:
– Matches
– Mismatches
– Spaces
• An important concept is that of gaps.
• Defn. A gap is a maximal consecutive run of
spaces in a single string of a given alignment.
• Q: Can a single space be a gap?
• A: Yes, if there are no adjacent spaces.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Gaps
• Gaps can occur:
– Before the first character of a string
– After the last character of a string
– Inside a string
•
c
•
•
Example:
t g c g g g - - - g g t a a a t
- g c g g - a g a g g - a a a Q: How many gaps are there?
A: 5
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Gaps
• Q: Other than our recognition of gaps, did the
preceding example show anything new?
• A: No.
• Q: Then what motivates the introduction of this
concept?
• A: We can include a gap term in the objective
function for computing alignment.
• So???
• So we can influence the distribution of gaps.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Gaps
• Analogy: specifying the location of the hole is
critical to donut making, otherwise you’ll end of
with a berliner. 
l
'
'
s
(
S
(
i
),
S
• Example:  1
2 (i ))  kWg
i 1
• In this objective fcn, each gap contributes the
constant weight Wg irrespective of the gap length.
• The variable k, indicates the number of gaps in the
alignment.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Gaps
• Recall, a space in an alignment corresponds to an
insertion or deletion in the edit transcript.
• A gap corresponds to an atomic insertion or
deletion of an entire substring.
• Biologically, mutations are such atomic events.
– A single mutation can create a gap
– The size of the gap can vary over a large range with
equal likelihood.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Gaps
Sources of mutation mentioned by textbook:
– unequal cross-over in meiosis  insertion in one string
and corresponding deletion in the other.
• http://www4.ncsu.edu:8030/unity/users/b/bnchorle/www/
– DNA slippage slippage in replication procedure
resulting in the repetition of a substring
– Retrovirus insertions
– Translocations of DNA between chromosomes
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Gaps
• Common gaps in aligned strings can be used to
deduce evolutionary history
– Mutations at the single character level are frequent.
– Does anybody know what these are called?
 makes it difficult to determine evolutionary
relationship at the DNA sequence level.
– Large gaps occur less frequently.
 gap features can be used to recognize similarity over
long periods of time.
– See Figure 11.6 for an example of gap as alignment
feature
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Gaps
• Consider:
– An alignment should reflect the cost of mutational
events transforming one string into the other.
– A single mutation can produce a gap of more than one
space
• Consequently:
– Distribution of spaces into gaps should follow a
plausible model
– Gap weights should be modeled to reflect biological
meaning
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Motivation: cDNA Matching
• Preliminaries:
– A single gene is comprised of exons and introns
– Exons are the coding part of the gene
– Introns are the noncoding parts between exons
• Gene expression:
– RNA is transcribed from DNA
•
•
•
•
DNA:A RNA:U (uracil)
DNA:C RNA:G
DNA:G RNA:C
DNA:T RNA:A
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Motivation: cDNA Matching
• Gene expression continued:
– RNA is transformed into mRNA (messenger RNA)
• The introns are excised
• The remaining exons are concatenated
– The resulting mRNA leaves the cell nucleus
– A ribosome:
• Translates the mRNA into the corresponding protein by
– parsing the mRNA into codons
– assembling amino acids in the order specified by the codons.
• The resulting sequence of amino acids is the protein
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Motivation: cDNA Matching
• Imagine that you have the mRNA for a protein and
want to find the corresponding gene.
– (Wet biology) Take the mRNA and create
complimentary DNA (cDNA).
• Map mRNA:U  cDNA:A
– Note: cDNA differs from DNA is several respects
• cDNA does not contain intron substrings
• The nucleotides in cDNA compliment the nucleotides in
the corresponding DNA, i.e., AT and C  G
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Motivation: cDNA Matching
– (Wet biology) Hybridize the cDNA with the
DNA
• In hybridization: complementary nucleotides try to
match up, i.e., AT and C  G
• Sections of the cDNA will hybridize with the
corresponding sections of DNA.
• The non-hybridizing segments are gaps
• Possibly corresponding to introns
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Motivation: cDNA Matching
• Now imagine that you have the mRNA sequence
for a protein and want to find the corresponding
gene without doing wet biology.
– Take the mRNA and create complimentary DNA
(cDNA).
• Map mRNA:U  cDNA:A with a computer
• While we are at it, compile of library of each cDNA
string that we create for future use.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Motivation: cDNA Matching
– Align (hybridize) the cDNA with the DNA
• We assume that the relevant genome has been sequenced.
• We have a short string (cDNA) and a very long string, the
genome.
• Align complementary nucleotides in the two strings, i.e., AT
and C  G
• Sections of the cDNA will align (hybridize) with the
corresponding sections of genome.
• The non-alinging (non-hybridizing) segments are gaps
• Possibly corresponding to introns
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Motivation: cDNA Matching
• Q: What kind of objective fcn do we need to
align cDNA with DNA?
• Features:
– Small penalties for spaces
• Q: Why does this matter?
• A: large penalties would force the cDNA to bunch
up, not alowing gaps for introns
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Motivation: cDNA Matching
• Features: continued
– Large penalties for mismatches
• Some mismatches are unavoidable (sequencing
error)
• Long sequences of mismatches must be avoided
– Positive values for matches
• We want to reward exon matches
– Gap penalties
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Motivation: cDNA Matching
• Gap penalities
– Q: Assume: match +, mismatch --, space -, what
happens if there is no gap penalty?
– A: the alignment would be the longest common
subsequence.
 Match of ALL characters in the cDNA string
Match of cDNA with noncoding DNA 
Tells us nothing about the position of the exons
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Motivation: cDNA Matching
• Gap penalties continued
– Soln: augment objective fcn with a gap term
– Complication: pseudogenes
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Motivation: Pseudogenes
• Pseudogenes
– Nonworking inexact copy of a gene
– Conceptually:
• a trial gene not ready for prime time or
• a failed gene mutation
– The psuedogene may be very far from the
actual gene
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Motivation: Pseudogenes
• Pseudogenes: processed psuedogenes
– contains only exon substrings
– introns have been removed & exons concatenated
– Theory: mRNA that is re-transcribed back into DNA
and inserted into a random position.
– Problem:
• Assume the DNA might contain the pseudogene & the working
gene
• how can processed psuedogenes be located?
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Gap Weights
• Q: What types of gap weight can we choose
from?
• A: The textbook lists four general types:
–
–
–
–
Constant gap weight
Affine gap weight
Convex gap weight
Arbitrary gap weight
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Gap Weights
• Constant gap weight: simplest
– No cost for individual space
– Gaps are assigned a constant weight Wg
– Operator-weight objective fcn:
Wm(#matches) – Wms(#mismatches) – Wg(#gaps)
Where Wm= match weight & Wms= mismatch weight
– Alphabet-weight objective fcn:
S i=1[s(S´1(i), S´2(i))] - Wg(#gaps)
l
Here s(x,_) = s(_,x)=0 for every letter x in the alphabet.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Gap Weights
• Affine gap weight
– Extend the constant gap weight with a charge for each
space, Ws.
• Wg is the gap initiation charge
• Ws is the gap extension charge
– Gap weight is given by the affine function Wg+ qWs,
where q is the number of spaces in the gap.
– Operator-weight objective fcn:
Wm(#matches) – Wms(#mismatches) – Wg(#gaps) - Ws(#spaces)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Gap Weights
– Affine alphabet-weight objective fcn:
S i=1[s(S´1(i), S´2(i))] - Wg(#gaps) - Ws(#spaces)
l
Here s(x,_) = s(_,x)=0 for every letter x in the
alphabet.
– An important question is what the values of Wg
and Ws should be.
• Obviously, this is related to similarity matrix, s().
• Textbook says FASTA uses Wg=10 - Ws= 2 for
protein sequences
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Gap Weights
• convex gap weight
– Idea: additional spaces contribute less
– Example: Wg + logeq
– Longer gaps are somewhat penalized
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Arbitrary Gap Weight
• Arbitrary gap weight
– The gap weight is an arbitrary function, w(q),
of the gap length.
– Obviously, the preceding weight types are
subcases of the arbitrary gap weight model
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Arbitrary Gap Weight
Arbitrary gap weight recurrence:
Three types of alignments for S1[1..i] and S2[1..j]
1.
2.
3.
•
S1(i) aligns to the left of S2(j),  S1 ends with a gap.
• Let E(i, j) be the maximal value for alignment case 1.
S1(i) aligns to the right of S2(j),  S2 ends with a gap
• Let F(i, j) be the maximal value for alignment case 2.
S1(i) coaligns with S2(j).
• Let G(i, j) be the maximal value for alignment case 3.
Let V(i, j) be the maximal value of E(i, j), F(i, j), & G(i, j).
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Arbitrary Gap Weight
We have the following recurrences:
•
V(i, j) = max[E(i, j), F(i, j), G(i, j)],
• G(i, j) = V(i - 1, j - 1) + s(S1(i), S2(j)),
Where S1(i), S2(j) are co-aligned.
•
E(i, j) = max0k  j-1[V(i, k) – w(j – k)],
 S1 ends with a gap.
•
F(i, j) = max0l  i-1[V(l, j) – w(i – l)]
 S2 ends with a gap.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Arbitrary Gap Weight
The base conditions are:
V(i, 0) = -w(i),
V(0, j) = -w(j),
E(i, 0) = -w(i),
F(0, j) = -w(j),
G(0, 0) = 0,
but G(i, j) is undefined if only i or j is 0.
If end spaces are free then end gaps are free and:
V(i, 0) = 0,
V(0, j) = 0
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Arbitrary Gap Weight
Up until this point all dynamic programming
examples have had complexity O(nm).
Q: What is the complexity of V(i, j)?
A: O(nm2 + n2m)?
Q: Why does the consideration of gaps require
O(nm2 + n2m)?
A: Previous computations depended only on the 3
adjacent cells. Considering gaps entails
considering all preceding cells in the row and
column.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Arbitrary Gap Weight
Thm. If |S1| = n and |S2| = m, the recurrences can be
solved in O(nm2 + n2m)
Proof.
(n+1) * (m+1) cells in the table are filled.
To fill cell (i, j):
– E(i, j) examines j cells of row i, max0k  j-1[V(i, k) – w(j – k)],
A row entails m(m+1)/2 = O(m2) to evaluate E for that row.
–
F(i, j) examines i cells of column j, max0l  i-1[V(l, j) – w(i – l)]
A column entails n(n+1)/2 = O(n2) to evaluate F for that column.
–
G(i, j) examines one other cell.
Since there are n rows and m columns give O(nm2 + n2m)
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Affine Gap Weight
•
•
•
O(nm2 + n2m) is expensive.
The affine weight gap model supports O(nm)
computation.
Recall, we want to maximize the operator objective fcn:
Wm(#matches) – Wms(#mismatches) – Wg(#gaps) - Ws(#spaces)
As before, three types of alignments:
1. S1(i) aligns to the left of S2(j),  S1 ends with a gap.
2. S1(i) aligns to the right of S2(j),  S2 ends with a gap
3. S1(i) coaligns with S2(j).
We will use E(i, j), F(i, j), G(i, j) & V(i, j), but we will
modify the gap weight
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Affine Gap Weight
Q: How can the cost be reduced from O(nm2 + n2m)
to O(nm)?
A: The affine model sets a constant cost per space.
Q: How does this help?
A: It is not necessary to do row (O(m2)) & column
(O(n2)) searches
 It doesn’t matter where the gaps occur, only how
large they are.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Affine Gap Weight
The base conditions where end gaps are included
are:
V(i, 0) = E(i, 0) = - Wg- iWs,
V(0, j) = F(0, j) = - Wg- jWs,
If end spaces are free then end gaps are free and:
V(i, 0) = V(0, j) = 0
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Affine Gap Weight
We have the following recurrences:
V(i, j) = max[E(i, j), F(i, j), G(i, j)],
G(i, j) = V(i - 1, j - 1) + Wm,
if S1(i)=S2(j)
G(i, j) = V(i - 1, j - 1) - Wms,
if S1(i)S2(j)
E(i, j) = max[E(i, j - 1), V(i, j - 1) – Wg] - Ws
 S1 ends with a gap.
F(i, j) = max[F(i - 1, j), V(i - 1, j) – Wg ] - Ws
 S2 ends with a gap.
Notice that each recurrence entails examining recurrences
for a constant number of cells.
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology
Affine Gap Weight
The textbook explains E(i, j) in detail.
Let’s consider F(i, j) = max[F(i - 1, j), V(i - 1, j) – Wg ] - Ws
 F(i, j) is the case where S2 ends with a gap.
The recurrence considers two cases:
1.
S2(j) is exactly one place to the left of S1(i)
There is a gap aligned with S1(i), then F(i, j) = V(i - 1, j) – Wg - Ws
1.
S2(j) is to the left of S1(i - 1)
The same gap aligned with S1(i - 1), extends to S1(i), then
F(i, j) = F(i - 1, j) - Ws
UNIVERSITY OF SOUTH CAROLINA
College of Engineering & Information Technology