Download lecture slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Archaeoastronomy wikipedia , lookup

Transcript
Chapter 3
Computational Molecular Biology
Michael Smith
[email protected]
Sequence Comparison


Sequence comparison is the most important
operation in computational biology
Consists of finding which parts of the
sequences are alike and which parts differ
Similarity and Alignment

Similarity


Gives a measure of how similar sequences are
Alignment

A way of placing sequences one above the other in
order to make clear the correspondence between
similar characters or substrings
Sequence Comparison


Want best alignment between two or more
sequences
Global Comparison


Local Comparison


Alignment involving substrings
Semi-Global Comparison


Alignment involving entire sequences
Aligning prefixes and suffixes of the sequences
All can be solved by Dynamic Programming
Global Comparison

Consider the following DNA sequences
GACGGATTAG
GATCGGAATAG

Are they similar?
 After alignment, similarities are more obvious
GA-CGGATTAG
GATCGGAATAG
Alignment and Score

Alignment, more precise definition
Insertion of spaces in arbitrary locations along the
sequences so that they end up with the same size
 No column can be entirely composed of spaces


Score
Measure of similarity
 Each column receive +1, for a match, -1 for a mismatch
or -2 for a space
 Sum values to get score

Dynamic Programming



Solving an instance of a problem by taking
advantage of already computed solutions for
smaller instances of the problem
Main algorithmic approach used in sequence
alignment
Figure 3.1, 3.2
Optimal Alignments





From Figure 3.1, start at (m,n) and follow
arrows to (0,0)
Each arrow gives one column of the alignment
If arrow is horizontal, it corresponds to a
column with a space in s matched with t[j]
If arrow is vertical, it corresponds to s[i]
matched with a space in t
If arrow is diagonal, s[i] is matched with t[j]
Optimal Alignments

Many alignments are possible, depending on
which arrow is given priority
Local Comparison




A local alignment between s and t is an
alignment between a substring of s and a
substring of t
Goal : find the highest scoring local alignment
between two sequences
Variation of basic algorithm (Figure 3.2)
Each entry holds highest score of an alignment
between suffixes of s and t (page 55)
SemiGlobal Comparison



Score alignments ignoring some of the end spaces in
the sequences
End spaces are those that appear before the first or
after the last character in a sequence
For example,
CAGCA-CTTGGATTCTCGG
---CAGCGTGG--------

If we aligned the sequences in the usual way, then
CAGCACTTGGATTCTCGG
CAGC-----G-T----GG
Extensions to Basic Algorithm



Basic algorithm has O(mn) complexity and uses space
on the order of O(mn)
Possible to improve complexity from quadratic to linear
at the expense of doubling processing time
Can be accomplished by using a Divide and Conquer
strategy

Divide the problem into small subproblems and later
combine the solutions to obtain a solution for the whole
problem
Gap Penalty Functions



A gap is a consecutive number of spaces
When mutations occur, it is more likely to have a
block of gaps verses a series of isolated gaps
Previous discussed scoring method is not
appropriate in this case
Gap Penalty Functions

For example,
A------ATTCCTTCCTTCC
AAAGAGAATTCCTTCCTTCC
 Scoring is done at a block level, not a column
level
A
A
-----AAGAGA
ATTCCTTCCTTCC
ATTCCTTCCTTCC
Multiple Sequences




Multiple sequence alignment is a generation of
the two sequence case
Multiple alignment of s1,s2…..sk is obtained by
inserting spaces in the sequences in such a way
to make them all the same size
No column is made entirely of spaces
Figure 3.10
Scoring Multiple Sequences


Need a function that inputs amino acid
sequences and returns a score
The function must have two properties
Order of arguments must be independent. For
example if a column has I,V,- the same score should
be produced if the order is -,V,I
 Should reward the presence of many equal resides
and penalize unequal residues and spaces

Sum-of-Pairs (SP)



Sum-of-Pairs (SP) satisfies the properties
Sum of pairwise scores of all pairs of symbols
in a column
SP-score(I,-,I,V) = p(I,-) + p(I,I) + p(I,V) +
p(-,I) + p(-,V) + p(I,V)
where p(a,b) is pairwise score of a and b
Algorithm Paradigm





Dynamic programming is used again
Basic algorithm can be used, but there will be
problems
In two sequence case, complexity is O(n2)
For k sequence case, complexity is O(nk)
Can take a really long time if k is large
Algorithm Paradigm

Must reduce the amount or number of cells to
compute

Apply a heuristic to reduce the number of
computed cells
Star Alignments


Building a multiple alignment based on pairwise
alignments between a fixed sequence and all
others
Fixed sequence is the center of the star
Star Alignments

Example
a = ATTGCCATT
b = ATGGCCATT
c = ATCCAATTTT
d = ATCTTCTT
e = ACTGACC
Select a as the center of the star
Star Alignments

Align
a with b
a with c
a with d
a with e
Star Alignments








ATTGCCATT
ATGGCCATT
ATTGCCATT-ATC-CAATTTT
ATTGCCATT
ATCTTC-TT
ATTGCCATT
ACTGACC--
Star Alignments

Combine results

ATTGCCATT-ATGGCCATT-ATC-CAATTTT
ATCTTC-TT-ACTGACC----




Database Search



Database exist for searching and comparing
protein and DNA sequences
Methods described work, but may take to long
and be impractical for searching large databases
Novel and faster methods have been developed
PAM Matrix


When scoring protein sequences, the +1,-1,-2
may not be sufficient
Amino acids have properties that influence the
likelihood that they will be substituted in an
evolutionary scenario
PAM Matrix



Point Accepted Mutations
A 1-PAM matrix is suitable for comparing
sequences that are 1 unit of evolution apart
A 250-PAM matrix is suitable for comparing
sequences that are 250 units of evolution apart
PAM Matrix




Markovian in nature
Need the probability of for each amino acid
Probability transition matrix
Score matrix
BLAST




Most frequently programs used to search
sequence databases
Acronym for Basic Alignment Search Tool
Returns a list of high scoring segment pairs
between the query sequence and sequences in
the database
http://www.ncbi.nlm.nih.gov
FAST



Another family of programs for sequence
database search
http://www.rcsb.org/pdb/index.html
BLAST and FAST use PAM matrices