Download Graphical comparison of sequences using “Dotplots”.

Document related concepts

Hyperreal number wikipedia , lookup

Sequence wikipedia , lookup

Transcript
Sequence alignment I.
Sándor Pongor
With slides adapted from David Judge, Jack Leunissen and Christoph
Sensen
September 27, 2016
Last lectures


Representations (unstructured, structured,
mixed).
Core operations:



Comparison gives 1) Proximity measures (similarities,
distances) 2) Motifs (from pairwise and multiple
alignment of sequences). Main distance and similarity
measures
Aggregation of numbers, vectors, sequences (distance
matrices, trees, heatmaps, multiple alignments)
Projections onto sequences 1D plots
This lecture (Part I-2)





Edit distance (refresh)
Substitution matrices (PAM, BLOSUM, how to
build your own
The two basic sequence alignment algorithms
Algorithm types according to how we do it 1:
exhaustive and heuristic, global and local.
Algorithm types according to what we compare 2:
two sequences, seq. vs dbase, seq vs. genome,
many seqs vs. genome, etc
Applications
The tree of bioinformatics:
core, branches leaves
Bioinfo
algorithms
Core data, core
principles
New branches and leaves every year…
Application example
 Communicaton
bacteria
in
The input of a sequence alignment algorithm


1) Two sequences
2) A scoring scheme (a score formula AND a
scoring (substitution) matrix)*
*This is applicable also to the comparison of 3D structures of
macromolecules, or any other type of linear object descriptions
(namely: macromolecules have a linear backbone)
The steps of sequence alignment

1) Find all possible alignments (Mappings)
between two sequences and find the best one
according to some “quick” score.
*

2) Calculate a final quantitative score* * for the
best alignment (Matching)
* This sketch symbolizes local alignment to indicate that there are many
possible mappings. The same is true for global alignments as well.
* * Usually an approximate edit distance with a scoring matrix, that we
discussed last time
The results of sequence alignment


1) A score (similarity score or distance)
2) A motif (common subsequence, consensus
description…)
The human mind describes similarity also in terms of patterns and
scores. But the patterns are stored in the human memory ina
smart way…
Motif: AGACXTGA.CTGA
Sequence similarity score
Range of alignment or High
Scoring Pair (HSP)

The score S is a sum of costs assigned to identities and mismatches,
minus a penalty for gaps. Costs are stored in the substitution matrix.
Gap usually a sum of gap opening and gap-extension costs.
2017.05.06..
2017.05.06..
TÁMOP – 4.1.2-08/2/A/KMR-2009-0006
9
Alignment score

Score =
end
 Similarity
_weights   Penalties
start

end
(Gap) penalty =
G ap in it  G ap len g th   g ap
start
Gap penalty functions

Linear


Affine


Penalty has a gap-opening and a separate length component
Probabilistic


Penalty rises monotonous with length of gap
Penalties may depend upon the neighboring residues
Other functions

Penalty first rises fast, but levels off at greater length values
No dramatic differences. Affine gaps
are widely used.

A simple example (alignment without gaps):
For a match/mismatch we look up the value in the
substitution matrix. The matrix is a lookup table…
2017.05.06..
2017.05.06..
TÁMOP – 4.1.2-08/2/A/KMR-2009-0006
12
Substitution matrices in details




The susbstitution matrix (also called scoring matrix) contains
costs for amino acid identities and substitutions in an
alignment.
For amino acids, it is a 20x20 symmetrical matrix that can be
constructed from pairwise alignments of related sequences
“Related” means either
a) evolutionary relatedness described by an “approved”
evolutionary tree (Dayhoff’s PAM matrices)
b) any sequence similarity as described in the PROSITE
database (Hennikoff’s BLOSUM matrices)
Groups of related sequences can be organized into a multiple
alignment for calculation of the matrix elements.
2017.05.06..
2017.05.06..
TÁMOP – 4.1.2-08/2/A/KMR-2009-0006
13
Substitution matrices (cost matrices)

Calculation of scoring matrices from multiple
alignments.
ASDESKLVV
|
ATDDATLSI
|
|
ASDSERITV
f(S/T)=3
f(S)=5, f(T)=3
Matrix elements are calculated from the
observed and expected frequencies (using a “log
odds” principle). E.g. for S/T (indicated by red):
 f (S / T ) 

M ( S / T )  log 
 f ( S )  f (T ) 
S/T denotes that S is aligned with T or T with S.
The values are calculated from many multiple
alignments (not just one).The log odds values in
the matrix are then normalized to a given range
depending on the application. (e.g. -5 to +15, for
historical reasons. The range does not matter
14
much)
The problem of making a substitution matrix



Problem: To make a matrix you
need a multiple alignment, but to
make a multiple alignment you
need a matrix.
The first generation solution:
Make multiple alignments by
hand, using known proteins. Very
tedious  this gives the so-called
PAM matrix.
The second generation solution is
to make multiple alignments with
a program using the PAM matrix,
and then extract a large statistics
from conserved regions  this is
A Münchausen
the so-called BLOSUM matrix
All entries  104
problem
2017.05.06..
2017.05.06..
TÁMOP – 4.1.2-08/2/A/KMR-2009-0006
15


Pam_1 = 1% of amino acids mutate
Pam_30 = (Pam_1)30 (matrix multiplication)
PAM 250
small
(the higher the numbers the
higher the divergence found)
polar
Note: chemically similar amino
acids are near each other …
basic
large
aromatic
2017.05.06..
2017.05.06..
TÁMOP – 4.1.2-08/2/A/KMR-2009-0006
16
Scoring Matrices used today

BLOSUM Matrices (most often used)




Developed by Henikoff & Henikoff (1992)
BLOcks SUbstitution Matrix
Derived from the BLOCKS database
PAM Matrices



Developed by Schwarz and Dayhoff (1978)
Point Accepted Mutation
Derived from manual alignments of closely related
proteins
PAM versus BLOSUM




First useful scoring
matrix for protein
Assumed a Markov
Model of evolution (I.e.
all sites equally mutable
and independent)
Derived from small,
closely related proteins
with ~15% divergence



Much later entry to matrix
“sweepstakes”
No evolutionary model is
assumed
Built from sequence blocks
taken from PROSITE
(functionally similer
segments of proteins)
Uses much larger, more
diverse set of protein
sequences (30% - 90% ID)
PAM versus BLOSUM




Higher PAM numbers to 
detect more remote
sequence similarities
Lower PAM numbers to 
detect high similarities
1 PAM ~ 1 million years of 
divergence
Errors in PAM 1 are

scaled 250X in PAM 250
Lower BLOSUM numbers
to detect more remote
sequence similarities
Higher BLOSUM numbers
to detect high similarities
Sensitive to structural and
functional substitution
Errors in BLOSUM arise
from errors in alignment
PAM Matrices
PAM 40 - prepared by multiplying PAM 1 by
itself for a total of 40 times
best for short alignments with high similarity
 PAM 120 - prepared by multiplying PAM 1 by
itself for a total of 120 times
best for general alignment
 PAM 250 - prepared by multiplying PAM 1 by
itself for a total of 250 times
best for detecting distant sequence similarity

BLOSUM Matrices
BLOSUM 90 - prepared from BLOCKS
sequences with >90% sequence ID
best for short alignments with high similarity
 BLOSUM 62 - prepared from BLOCKS
sequences with >62% sequence ID
best for general alignment (default)
 BLOSUM 30 - prepared from BLOCKS
sequences with >30% sequence ID
best for detecting weak local alignments

Scores
V
V
BLOSUM62 +4
PAM30
+7
Slide by David Landsman, NCBI
D S –
C
Y
E T L
C
F
+2 +1 -12 +9 +3
+2 0 -10 +10 +2
7
11
Nucleic acid matrices
A
C
G
T
10 0 0 0
0 10 0 0
0 0 10 0
0 0 0 10
A C G T
Needleman-Wunsch *
A
C
G
T
10
-9
-9
-9
A
-9
10
-9
-9
C
-9
-9
10
-9
G
-9
-9
-9
10
T
Smith-Waterman *
1) The magnitude of the elements are relative, can be scaled.
2) Other heuristic matrices can be easily constructed. Identity matirx:
diagonal =1, rest=0. Ore, one can penalize certain associations
assigning a large negative value to them, etc.
*These are the names of two classical algorithms, to be
discussed in the next section.
Dot plots
Visual comparison of sequences
A method of visualizing matching
positions in biological sequences
This presentation was created
by David and Paul Judge
ACCTGCCCTGTCCAGCTTACATGCATGCTTATAGGGGCATTTTACAT
Graphical comparison of sequences using “Dotplots”.
Basic Principles.
1) Write sequences of length n on two
diagonal axes. This defines an n x n
matrix, called the dot matrix or alignment
matrix.
2) Immagine that we put now a red
dot to those positions where the
nucleotides x(i) and y(i) are
identical.
3) If the two sequences are identical,
the diagonal will be red. x(i) = y(i) all
along the sequences
ACCTGCCGATTCCATATTACGCATGCTTCTGGGTTACCGTTCAGGGCATTTTACATGTGCTG
ACCTGCCCTGTCCAGCTTACATGCATGCTTATAGGGGCATTTTACAT
Graphical comparison of sequences using “Dotplots”.
Basic Principles.
4) If 10 nucleotides are deleted in
sequence y at a certain position, but
the two are otherwise identical, then
after the point of deletion y(i) = x(I +10)
We can view this two ways|:
y(i) = x(i +10) insertion in x
10 nt
or
y(i-10) = x(i) deletion in y
ACCTGCCGATTCCATATTACGCATGCTTCTGGGTTACCGTTCAGGGCATTTTACATGTGCTG
ACCTGCCCTGTCCAGCTTACATGCATGCTTATAGGGGCATTTTACAT
ATGCTTATAGG
Graphical comparison of sequences using “Dotplots”.
Basic Principles.
A “word size” (11 say)
Diagonal runs of dots indicate similar regions
a
A “Scoring scheme”
(1 for a match,
0 for a mismatch, say)
A T G C
1 1 1 1 11 0 1 0 1 1=9
A 1 0 0 0
T 0 1 0 0
G 0 0 1 0
Summary: Dotplots provide a comprehensive overview
C but
0 0NO0detail.
1
+
+ +
+ +
+
+
+ +
+
A “Cut-off score” (8 say)
ACCTGCCGATTCCATATTACGCATGCTTCTGGGTTACCGTTCAGGGCATTTTACATGTGCTG
ATGCTTCTGGG
Matching bit or character-strings
H
am
m
ingdistance
A
1: 01010010
|||||
2: 11010001
B
1: BIRD
||
2: WORD
D=
12 3
D=
12 2
• The Hamming distance is the number of exchanges necessary to
turn one string of bits or characters into another one (the number of
positions not connected with a straight line). The two strings are of
identical length and no alignment is done.
• The exchanges in character strings can have different costs, stored
in a lookup table. In this case the value of the Hamming distance will
be the sum of costs, rather than the number of the exchanges.
WE USE THIS IN DOT PLOT
28
Graphical comparison of sequences using “Dotplots”.
Scoring Schemes.
DNA: Simplest Scheme is the Identity Matrix.
A
T
G
C
A
1
0
0
0
T
0
1
0
0
G
0
0
1
0
C
0
0
0
1
More complex matrices can be used.
For example, the default EMBOSS DNA scoring matrix is:
The use of negative numbers is only pertinent
when these matrices are used for computing
textual alignments.
Using a wider spread of scores eases the
Expansion of the scoring matrix to sensibly
include ambiguity codes.
A
T
G
C
A T G C
5 -4 -4 -4
-4 5 -4 -4
-4 -4 5 -4
-4 -4 -4 5
Graphical comparison of sequences using “Dotplots”.
Scoring Schemes.
A
A
C
G
T
S
W
R
Y
K
M
B
V
H
D
N
U
B
C
D
E
F
G
H
I
K
L
M
N
P
Q
D
0
1
-1
-5
-1
2
-1
2
-4
-5
-3
-1
-1
3
-1
-2
1
-3
-2
-1
-1
-3
1
-2
0
-2
4
-2
1
-1
-1
-1
-1
-2
-1
W -6 -5 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5
include
Y -3 -3ambiguity
0 -4 -4 codes.
7 -5 0 -1 -4 -1 -2 -2 -5 -4
Z 0 2 -5 3 3 -5 -1 2 -2 0 -3 -2 1 0 3
W 1R -1Y -1K-1 M-2 B
AA 2C 0 G-2 T 0 S0 -4
-1 V0 H
1
B5 -4
0 2
1 -2
-2 -12 -1
-1
-4-4-4 3-42 -5
1 01 -4
-4 1 1-3-4
C -25-4
-5 -4
-3 -31 -21-5-4-6-1
-5 -4
-4 -1
-3
-4
-412-4-5-4
1 -4
D -4
0 3 5-5-4 4 13 -4
-6 1
1 -21 0-4-4-1
-3 -12 -4
-1
-4
1 -4
E 0 2 -5 3 4 -5 0 1 -2 0 -3 -2 1 -1
-4
-4 -4 5 1 -4 -4 1 -4 1 -1 -1 -1
F -4 -5 -4 -6 -5 9 -5 -2 1 -5 2 0 -4 -5
-4
-2 -3
-2-2-2-4-1
G -4
1 0 1-3 1 1-10 -4
-5 -2
5 -2
-3 -10 -3
-1
1
1
-4
-4
-4
-1
-2
-2
-2
-2
-3
-3
-1
H -1 1 -3 1 1 -2 -2 6 -2 0 -2 -2 2 0
-4-2 1-2-4-2-2
-4 -2
I1 -1
-2 -21 -1
-3 -2
5 -2-2 2-32 -1
-2 -3
-2
K -1
-5 -4
-2 -1
0 -2
-1
-4
-1 1-4-5 1 0-20 -2
-2 5-2-3-10 -31 -1
L -21-3 1-6-4
-4-2
-3 -22 -2
-4 -2
2 -3-4 6-14 -3
-3 -3
-3
-4
-2 -1
M1 -1
-2 -20 -2
-3 -2
2 0-1 4-36 -1
-2 -1
-2
-4-2-4-5 1-3-2
-2 -4
N 0 2 -4 2 1 -4 0 2 -2 1 -3 -2 2 -1
-4
-1 -1 -1 -1 -3 -3 -1 -1 -3 -1 -2 -2
P 1 -1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6
-1
-1-5-1 2-12 -3
-3 1-1-2-2
Q -4
0 1
-5 -1
-1 -3
3 -2
-1 -11 -2
0
-1
-1
-4
-1
-3
-1
-3
-1
-3
-1
-2
-2
-1
R -2 -1 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0
-1
-1 0-4 0-30 -1
-3 -1
-1 0-3-3-2
S -1
1 0
-3 -1
1 -1
-2 -21 -2
1
T -2
1 a0
-2-2 0-1
0 -1
-3 -1
0
0 0-1
-2-1
-1 -1
0 -1
0
-2
-2
-1
-1
Using
wider
spread
of-1
scores
eases
the
V 05-2
-2-4
-2 -1
-1 -21 41-2-4 2-12 -4
-2 -1
-1
-4
-4-2-4
1 -4
expansion
of the
scoring
matrix to sensibly
R S T
-2N 1 U 1
-1
0 0
-2 -4
-4
-2 0 5-2
-1
0 0
-2 -4
-1 0 0
-2 -4
-4 -3 -3
-1 -4
-3
1 0
-1
2 -1 1-1
-1 -1
-4 0
-2
3 01 0
-1
-3
-1 -3 1-2
0 -2
-1
-4-1
0 1 0
-1
-1
0 1 0
-1
-4-1
1 -1
-1
6 -1
0 -1
-1
0 -1
2 1
-1
1 3
-1 -2
-2
-2 -1 5 0
2 -2 -5
-4 -3 -3
0 0 -1
V W Y Z
IUB DNA Alphabet
0 -6 -3 0
-2 -5 -3 2
Code
Meaning
-2
-8 0 -5
-2 -7 -4 3
-2
A -7 -4 3
-1
C 0 7 -5
-1
G -7 -6 -1
-2
-3 0 2
T/U
4
-1 -2
M -5
`aMino`
A|C
-2
-4 0
R -3
`puRine`
A|G
2
-1 -3
W -2
`Weak`
A|T
2
-2 -2
S -4
`Strong`
C|G
-2
-2 1
Y -4
`pYrimidine`
C|T
-1
-5 0
K -6
`Keto`
G|T
-2
-4 T`
3
V -5
`not
A|C|G
-2
-4 G`
0
H 2
`not
A|C|T
-1
-3 C`
0
D -2
`not
A|G|T
0
-3 -1
B -5
`not
A`
C|G|T
4
-2 -2
N -6
`aNy`
A|C|G|T
-6 17 0 -6
-2 0 10 -4
-2 -6 -4 3
For Protein sequence dotplots more complex scoring schemes are required.
Scores must reflect far more than alphabetic identity.
Graphical comparison of sequences using “Dotplots”.
Faster plots for perfect matches.
To detect perfectly matching words, a dotplot program has a choice of strategies
Select a scoring scheme
A
T
G
C
T
0
1
0
0
G
0
0
1
0
C
0
0
0
1
and a word size (11, say)
For every pair of words, compute a word match score in the normal way

1)
A
1
0
0
0
If the ifmaximum
Only
the maximum
possible
possible
cut-off
cut-off
scorescore
(still 11)
(11)isisnot
achieved
achieved
ATGCTTATAGG
a
1+1+1+1+1+1+1+1+1+1+1
=11
ATGCTTCTGGG
ATGCTTATAGG
r
1+1+1+1+1+1+0+1+0+1+1
=9

Celebrate with a dot ATGCTTCTGGG Do not celebrate with a dot
Graphical comparison of sequences using “Dotplots”.
Faster plots for perfect matches.
To detect perfectly matching words, a dotplot program has a choice of strategies
For every pair of words, ……… see if the letters are exactly the same

2)
OR
If they are not
ATGCTTATAGG
a
aaaaaaaaaaa
ATGCTTCTGGG
ATGCTTATAGG
r
aaaaaararaa

Celebrate with a dot ATGCTTCTGGG Do not celebrate with a dot
To detect exactly matching words, fast character string matching can replace
laborious computation of match scores to be compared with a cut-off score
Many packages include a dotplot option specifically for detecting exactly
matching words.
Particular advantage when seeking strong matches in long DNA sequences.
Graphical comparison of sequences using “Dotplots”.
Dotplot parameters.
There are three parameters to consider for a dotplot:
1)The scoring scheme.
2)The cut-off score
3)The word size
Graphical comparison of sequences using “Dotplots”.
Dotplot parameters.
The Scoring scheme.
DNA
Usually, DNA Scoring schemes award a fixed reward for each matched
pair of bases and a fixed penalty for each mismatched pair of bases.
Choosing between such scoring schemes will affect only the choice of
a sensible cut-off score and the way ambiguity codes are treated.
Protein
Protein scoring schemes differ in the evolution distance assumed
between the proteins being compared. The choice is rarely crucial
for dotplot programs.
Graphical comparison of sequences using “Dotplots”.
Dotplot parameters.
The Cut-off score.
The higher the cut-off score the less dots will be plotted.
But, each dot is more likely to be significant.
The lower the cut-off score the more dots will be plotted.
But, dots are more likely to indicate a chance match (noise).
Graphical comparison of sequences using “Dotplots”.
Dotplot parameters.
The Cut-off score.
Scoring Scheme: PAM 250, Word Size: 25, Cut-off score:
More “features”,
4 regions
Cut-off
now
become
clearly
probably noise,
clearer,
too
low.strong
some
Too much
other
4 clear
appear obscuring
weaker
noise
tofeatures
see
regions
apparent
the original 4
appear
interesting
regions.
clear regions.
10
5
20
30
Graphical comparison of sequences using “Dotplots”.
Dotplot parameters. The Word size.
Smaller
Large words
words
can
pick
miss
up small
smaller
matches.
features.
The smallest “features” are
often just “noise”.
Graphical comparison of sequences using “Dotplots”.
Dotplot parameters. The Word size.
For sequences with regions of small
matching features.
Small words pick small features
Individually.
Larger words show matching
regions more clearly.
The lack of detail can be
an advantage
Graphical comparison of sequences using “Dotplots”.
Dotplot parameters. The Word size.
Displaying the word
Superimposing
a plot
11 plot
alone
shows
Using
relatively
with
a asmaller
word
that major
features
large
size ofthe
25,
size
ofword
11 shows
are drawn
inof
more
features
are
drawn
emergence
extra
“carefully”.
with
a broad brush.
dots.
Arguably,
less
Detail
beprobably
missed
In
this can
case
usefully
all
noise.if a broad
overview is the
objective.
Graphical comparison of sequences using “Dotplots”.
Other uses of dotplots.
Detection of Repeats
Graphical comparison of sequences using “Dotplots”.
Other uses of dotplots.
Detection of Repeats
Graphical comparison of sequences using “Dotplots”.
Other uses of dotplots.
Detection of Repeats
Graphical comparison of sequences using “Dotplots”.
Other uses of dotplots.
Detection of Repeats
Graphical comparison of sequences using “Dotplots”.
Other uses of dotplots.
Detection of Stem Loops
Graphical comparison of sequences using “Dotplots”.
Other uses of dotplots.
Detection of Stem Loops
What you should know
• What dot plot is
• Parameters (scoring scheme, cut-off
score, word size)
• Appearance of related regions (between
sequences)
• Repeats within sequences
• Palindromes (within sequences)
• Programs
The End.
Back to 0.3a
Example
Bacterial sensor protein binds communication signal,
then binds to DNA and initiates transcription
Signal
Signal
binding
DNAbinding
RNA
polymerase
DNA
The “normal” domain architecture of the sensor
protein
Signal binding
DNA-binding
Example
Normal sequence
Inverted sequence
Normal and shuffled (inverted) bacterial sensor proteins can
fulfill the same function
Do we see the difference by dot plot?
Dot Plot
a
Inverted sequence
Normal sequence
The two fundamental sequence alignment
algorithms
Global and local alignment
Pairwise alignment – the simplest case


Why
?
We have two (protein or DNA) sequences
originating from a common ancestor
The purpose of an alignment is to line up all
positions that originate from the same position in
the ancestral sequence
Pairwise alignment – the simplest case

The purpose of an alignment is to line up all
residues that were derived from the same residue
position in the ancestral gene or protein in two
sequences
Pairwise alignment – the simplest case

The purpose of an alignment is to line up all
residues that were derived from the same residue
position in the ancestral gene or protein in two
sequences
gap = insertion or deletion
Types of algorithms according to how we do it:
Global and local
Global
Local
Local similarities e.g.between multidomain proteins…
Global similarities e.g.between single domains…
Global alignment

Align two sequences from “head to toe”, i.e.



from 5’ ends to 3’ ends
from N-termini to C-termini
Exhaustive algorithm published by:


Needleman, S.B. and Wunsch, C.D. (1970)
“A general method applicable to the search for
similarities in the amino acid sequence of two proteins”
J. Mol. Biol. 48:443-453.
“Exhaustive” means: all cases tested so the result (the
alignment) is guaranteed to be optimal.
Local alignment

Locate region(s) with high degree of similarity in
two sequences

Exhaustive algorithm published by:

Smith, T.F. and Waterman, M.S. (1981)
“Identification of common molecular subsequences”
J. Mol. Biol. 147:195-197.
Global Alignment
• Simple rules:
– Match (i,j) =
• 1, if residue (i) = residue (j); else 0
– Gap = 1
– Score (i,j) = Maximum of
• Score (i+1,j+1) + Match (i,j)
• Score (i+1,j) + Match (i,j) - Gap
• Score (i,j+1) + Match (i,j) - Gap
Global Alignment
a
c
t
g
a
g
t
-
a
c
t
t
g
a
g
c
-6
-5
-4
-3
-2
-1
-9 -8 -7 -6 -5 -4 -3 -2 -1
0
Global Alignment
a
c
t
g
a
g
t
-
a
c
t
t
g
a
g
c
-6
-5
-4
-3
-2
0 -1
-9 -8 -7 -6 -5 -4 -3 -2 -1
0
Global Alignment
a
c
t
g
a
g
t
-
a
c
t
t
g
a
g
c
1
0 -3
2
0 -2
0
1 -1
-2 -1
0
-9 -8 -7 -6 -5 -4 -3 -2 -1
-6
-5
-4
-3
-2
-1
0
Global Alignment
c
t
g
a
g
t
-
a
3
0
-2
-2
-5
-6
-9
a
-
a
4
1
-1
-2
-4
-5
-8
a
-
c
5
2
0
-3
-3
-4
-7
c
c
t
4
4
1
-2
-3
-3
-6
t
t
t
-2
4
2
-1
-2
-3
-5
t
-
g
a
g
c
-1 -1 -2 -4 -6
-2
0 -1 -4 -5
3
1
0 -3 -4
0
2
0 -2 -3
-1
0
1 -1 -2
-3 -2 -1
0 -1
-4 -3 -2 -1
0
g
a g c
g
a g t
Global Alignment
c
t
g
a
g
t
-
a
3
0
-2
-2
-5
-6
-9
a
-
a
4
1
-1
-2
-4
-5
-8
a
-
c
5
2
0
-3
-3
-4
-7
c
c
t
4
4
1
-2
-3
-3
-6
t
t
t
-2
4
2
-1
-2
-3
-5
t
-
g
a
g
c
-1 -1 -2 -4 -6
-2
0 -1 -4 -5
3
1
0 -3 -4
0
2
0 -2 -3
-1
0
1 -1 -2
-3 -2 -1
0 -1
-4 -3 -2 -1
0
g
a g c
g
a g t
Local Alignment
• Simple rules:
– Match (i,j) =
• 1, if residue (i) = residue (j); else 0
– Gap = 1
– Score (i,j) = Maximum of
•
•
•
•
Score (i+1,j+1) + Match (i,j)
Score (i+1,j) + Match (i,j) - Gap
Score (i,j+1) + Match (i,j) - Gap
0
Local Alignment
c
t
g
a
g
t
-
a
3
1
2
2
0
0
0
a
4
2
1
2
0
0
0
c
c
c
5
3
0
1
1
0
0
t
t
t
4
4
1
0
1
1
0
t
3
4
2
1
0
1
0
t
-
g
1
2
3
1
1
0
0
g
g
a
0
1
1
2
0
0
0
a
a
g
0
0
1
0
1
0
0
g
g
c
1
0
0
0
0
0
0
0
0
0
0
0
0
0
Local Alignment
c
t
g
a
g
t
-
a
3
1
2
2
0
0
0
a
4
2
1
2
0
0
0
c
c
c
5
3
0
1
1
0
0
t
-
t
4
4
1
0
1
1
0
t
3
4
2
1
0
1
0
t
t
g
1
2
3
1
1
0
0
g
g
a
0
1
1
2
0
0
0
a
a
g
0
0
1
0
1
0
0
g
g
c
1
0
0
0
0
0
0
0
0
0
0
0
0
0
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
12
0
-2
-3
-2
-3
-4
-5
-5
-5
-3
-4
-5
-5
-2
-6
-2
-4
0
-8
C
2
1
1
1
1
1
0
0
-1
-1
0
0
-2
-1
-3
-1
-3
-3
-2
S
3
0
1
0
0
0
0
-1
-1
-1
0
-1
0
-2
0
-3
-3
-5
T
6
1
-1
-1
-1
-1
0
0
0
-1
-2
-2
-3
-1
-5
-5
-6
P
PAM250
2
1
0
0
0
0
-1
-2
-1
-1
-1
-2
0
-4
-3
-6
A
5
0
1
0
-1
-2
-3
-2
-3
-3
-4
-1
-5
-5
-7
G
2
2
1
1
2
0
1
-2
-2
-3
-2
-4
-2
-4
N
4
3
2
1
-1
0
-3
-2
-4
-2
-6
-4
-7
D
4
2
1
-1
0
-2
-2
-3
-2
-5
-4
-7
E
4
3
1
1
-1
-2
-2
-2
-5
-4
-5
Q
6
2
0
-2
-2
-2
-2
-2
0
-3
H
6
3
0
-2
-3
-2
-4
-4
2
R
5
0 6
-2 2 5
-3 4 2 6
-2 2 4 2 4
-5 0 1 2 -1
-4 -2 -1 -1 -2
-3 -4 -5 -2 -6
K M I L V
9
7 10
0 0 17
F Y W
Global Alignment
• Advanced rules:
– Match (i,j) =
• W(i,j), where W is a score matrix, like PAM250
– Gap =
• Gap_init + Gap_length  length_of_gap
– Score (i,j) = Maximum of
• Score (i+1,j+1) + Match (i,j)
• Score (i+1,j) + Match (i,j) - Gap
• Score (i,j+1) + Match (i,j) - Gap
Local Alignment
• Advanced rules:
– Match (i,j) =
• W(i,j), where W is a score matrix, like PAM250
– Gap =
• Gap_init + Gap_length  length_of_gap
– Score (i,j) = Maximum of
•
•
•
•
Score (i+1,j+1) + Match (i,j)
Score (i+1,j) + Match (i,j) - Gap
Score (i,j+1) + Match (i,j) - Gap
0
Concepts learnt in the this lecture

Alignments can be exhaustive or heuristic



Exhaustive, also called dynamic programming, if we do
not need much of resources (e.g. we have few
sequences to align)
Heuristic, for realistic problems where time is an issue
Alignments can be global and local

Global: from beginning to end

Local: pinpoint highly similar regions (more realistic)
What methods to select according to the
time/resources we have?

If we have time/resources, we can try exhaustive
algorithms. This is an option with supercomputers
or GPU implementations…

For realistic problems (and realistic resources) we
need heuristic alignments that restrict the search
space to a manageable size…. at a price of
loosing some accuracy.
Alignment heuristics (examples)

Search space reduction 1: Pre-filter sequences to be
aligned. Rationale: comparing very different sequences
make no biological sense. Brute force filtering is efficient.

Search space reduction 2: Filter out obviously useless
alignments. Means leaving out the corners of the SW or
NW search matrices
Only those around the
diagonal make sense.
The corners look like this:
Bacterial sensor protein binds communication signal,
then binds to DNA and initiates transcription
Signal
Signal
binding
DNAbinding
RNA
polymerase
DNA
The “normal” domain architecture of the sensor
protein
Signal binding
DNA-binding
Normal and shuffled (inverted) bacterial sensor proteins can
fulfill the same function
Do we see the difference by simple (raw) pairwise alignment?
Local alignment
(Smith Waterman)
Normal
Inverted
Identical sequences match at each
amino acid – we show them as a
series of “|” symbols
Global alignment
(Needleman-Wunsch)
Pairwise alignment by itself sees
only the similarity of the larger
domain, the smaller one is lost
(empty line, no hits(
Normal sequence
Inverted sequence
Normal and shuffled (inverted) bacterial sensor proteins can
fulfill the same function
Do we see the difference by dot plot?
Dot Plot
a
Inverted sequence
Normal sequence
The distance matrix has the info, just the
alignment algorithm does not pick it up!!!
Heat map of Smith
Waterman matrix
Inverted sequence
Inverted sequence
Dot Plot
Normal sequence
Normal sequence
What have we learnt?

Sequence scoring matrices (PAM, BLOSUM, unitary, and
how to make one’s own…)

Dot plots

The two basic algorithms Global alignment (Needleman
Wunsch), local alignment (Smith-Waterman)

Classifying alignment methods (how to align): exhaustive,
heuristic, local, global



Global alignment, exhaustive: Needleman-Wunsch
Local alignment, exhaustive: Smith-Waterman
Heuristics: simple examples