Download sequence is horizontal/vertical? To answer this question lets align

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Biochemistry wikipedia , lookup

Biosynthesis wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Genetic code wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Sequence Alignment
CSCE 769 Guest Lecture
November 1, 2012
Stephanie Irausquin, PhD
Sequence Alignment:
Definition and Importance
●
●
Sequence alignment is a process in which at
least two homologous sequences are
compared and involves the identification of
insertions or deletions that might have
occurred in either lineage since their
divergence from a common ancestor
A powerful tool for discovering biological
function and establishing evolutionary
relationships
Sequence Alignment
●
●
The same principles for sequence alignment
can be used to align both nucleotide and
amino acid sequences
More reliable alignments are usually obtained
by using amino acid sequences
1.Amino acids change less frequently during
evolution than nucleotides
2.There are 20 amino acids and only 4 nucleotides,
so the probability for 2 sites to be identical by
chance is lower at the amino acid level than at the
nucleotide level
Sequence Alignment
●
●
A DNA sequence alignment consists of a
series of paired bases (one base from each
sequence)
There are 3 types of aligned pairs
1.Match – it is assumed that the nucleotide at this
site has not changed since the divergence
between the two sequences
2.Mismatch – at least one substitution has occurred
in one of the sequences since their divergence
from each other
3.Gaps - a deletion has occurred in one sequence,
or an insertion has occurred in the other (the
Types of Alignment
●
Manual
●
Dot matrix
●
Distance and similarity methods
●
Alignment algorithms
Manual Alignment
●
●
●
A reasonable alignment by visual inspection
can be obtained using either specialized
alignment editors or plain text editors, when
there are few gaps and the two sequences are
not too different from each other
Advantages: uses the brain and allows direct
integration of additional data (i.e. domain
structure)
Disadvantages: is subjective and results
cannot be compared to those derived using
other methods
Dot Matrix
S
S
E
A dot is put in the dot
matrix plot at a
position where the
nucleotides in the
two sequences are
N
C
E
N
A
L
Y
S
•
S
P
R
M
E
R
•
•
•
•
•
•
•
C
•
E
•
•
•
•
A
•
N
•
•
•
A
•
•
L
•
Y
•
•
•
•
I
•
•
•
•
•
P
•
R
•
I
•
•
•
M
E
I
•
•
N
S
I
•
•
E
S
A
•
U
R
●
E
•
•
Q
The two sequences
to be aligned are
written out as column
and row headings of
a two-dimensional
matrix
U
•
E
●
Q
•
•
•
•
•
•
Dot Matrix
●
●
Advantages:
–
a simple method
–
is useful in unraveling important evolution of
sequences
Disadvantages:
–
may become very cluttered
–
may require human intervention to recognize
patterns
–
may not be reliable
–
limited to two sequences
Dot Matrix Examples
a.)
b.)
Distance and similarity methods
●
●
The best possible alignment between two
sequences is the one which minimizes the
numbers of mismatches and gaps
However, reducing the number of mismatches
usually results in an increase in the number of
gaps (and vice versa)
Distance and similarity methods
Considering the following example:
Seq1: TCAGACGATTG
LengthSeq1=11
●
Seq2: TCGGAGCTG
LengthSeq2=9
We can reduce the number of mismatches to
0, but the number of gaps in this case is 6:
Seq1: TCAG-ACG-ATTG
●
Seq2: TC-GGA-GC-T-G
Distance and similarity methods
Our example, yet again:
Seq1: TCAGACGATTG
●
LengthSeq1=11
Seq2: TCGGAGCTG
LengthSeq2=9
Conversely, we can reduce the number of
gaps to a single gap having the minimum
possible size |LengthSeq1 – LengthSeq2| = 2
nucleotides, which increases the number of
mismatches to 5:
Seq1: TCAGACGATTG
●
Distance and similarity methods
Our example, yet again:
Seq1: TCAGACGATTG
●
LengthSeq1=11
Seq2: TCGGAGCTG
LengthSeq2=9
We can also choose an alignment that
minimizes neither the number of gaps nor the
number of mismatches. In the case below, the
number of gaps is 4 and the number of
mismatches is 2:
Seq1: TCAG-ACGATTG
●
Distance and similarity methods
●
●
●
Which of the three alignments is preferable?
In order to determine that, we need to find a
common denominator (the gap penalty) that
allows us to compare gaps and mismatches
Gap penalty – a factor (or set of factors) by
which gap values (the numbers and lengths of
gaps) are multiplied in order to make the gaps
equivalent in value to mismatches
–
Based on how frequent different types of
insertions and deletions occur in evolution in
comparison with the frequency of
Distance and similarity methods
●
For any given alignment, we can calculate a
distance or dissimilarity index (D) as:
D = ∑miyi + ∑wkzk
where yi is the number of mismatches of type i,
mi is the mismatch penalty for an i-type of
mismatch, zk is the number of gaps of length k,
and wk is a positive number representing the
penalty for gaps of length k
Distance and similarity methods
●
In the most frequently used gap penalty
systems, it is assumed that the gap penalty
includes two components:
1.Gap-opening penalty
2.Gap-extension penalty
●
Further complications in the gap penalty
system may be introduced by distinguishing
among different mismatches (i.e. amino acids)
–
Leu and Ile vs Arg and Glu
BLOSUM
●
●
●
BLOSUM (BLOcks of Amino Acid SUbstitution
Matrix) is a substitution matrix used for
sequence alignment of proteins
First introduced in a paper by Henikoff and
Henikoff
–
scanned very conserved regions of protein
families and counted the relative frequencies of
amino acids and their substitution probabilities
–
Calculated a log-odds score for each of the
possible substitutions of the 20 standard amino
acids
Several sets of matrices exist
BLOSUM50 Substitution Matrix
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
Ala
Cys
Asp
Glu
Phe
Gly
His
Ile
Lys
Leu
Met
Asn
Pro
Gln
Arg
Ser
Thr
Val
Trp
Tyr
A
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
R
5
-2
-1
-2
-1
-1
-1
0
-2
-1
-2
-1
-1
-3
-1
1
0
-3
-2
0
N D C Q E G H I
-2 -1 -2 -1 -1 -1 0 -2
7 -1 -2 -4 1 0 -3 0
-1 7 2 -2 0 0 0 1
-2 2 8 -4 0 2 -1 -1
-4 -2 -4 13 -3 -3 -3 -3
1 0 0 -3 7 2 -2 1
0 0 2 -3 2 6 -3 0
-3 0 -1 -3 -2 -3 8 -2
0 1 -1 -3 1 0 -2 10
-4 -3 -4 -2 -3 -4 -4 -4
-3 -4 -4 -2 -2 -3 -4 -3
3 0 -1 -3 2 1 -2 0
-2 -2 -4 -2 0 -2 -3 -1
-3 -4 -5 -2 -4 -3 -4 -1
-3 -2 -1 -4 -1 -1 -2 -2
-1 1 0 -1 0 -1 0 -1
-1 0 -1 -1 -1 -1 -2 -2
-3 -4 -5 -5 -1 -3 -3 -3
-1 -2 -3 -3 -1 -2 -3 2
-3 -3 -4 -1 -3 -3 -4 -4
L
-1
-4
-3
-4
-2
-3
-4
-4
-4
5
2
-3
2
0
-3
-3
-1
-3
-1
4
K
-2
-3
-4
-4
-2
-2
-3
-4
-3
2
5
-3
3
1
-4
-3
-1
-2
-1
1
M F P S T W Y V
-1 -1 -3 -1 1 0 -3 -2 0
3 -2 -3 -3 -1 -1 -3 -1 -3
0 -2 -4 -2 1 0 -4 -2 -3
-1 -4 -5 -1 0 -1 -5 -3 -4
-3 -2 -2 -4 -1 -1 -5 -3 -1
2 0 -4 -1 0 -1 -1 -1 -3
1 -2 -3 -1 -1 -1 -3 -2 -3
-2 -3 -4 -2 0 -2 -3 -3 -4
0 -1 -1 -2 -1 -2 -3 2 -4
-3 2 0 -3 -3 -1 -3 -1 4
-3 3 1 -4 -3 -1 -2 -1 1
6 -2 -4 -1 0 -1 -3 -2 -3
-2 7 0 -3 -2 -1 -1 0 1
-4 0 8 -4 -3 -2 1 4 -1
-1 -3 -4 10 -1 -1 -4 -3 -3
0 -2 -3 -1 5 2 -4 -2 -2
-1 -1 -2 -1 2 5 -3 -2 0
-3 -1 1 -4 -4 -3 15 2 -3
-2 0 4 -3 -2 -2 2 8 -1
-3 1 -1 -3 -2 0 -3 -1 5
s x,y = log
p xy
Px P y
Pxy is the probability
that x and y are
evolutionarily related.
Px is the probability of
occurrence of x.
Py is the probability of
occurrence of y.
Sequence Alignment Algorithms
●
●
●
●
The purpose of any alignment algorithm is to
choose the alignment associated with the
smallest D from all possible alignments
The number of possible alignments can be
very large
Fortunately, there are computer alignment
algorithms for searching the optimal alignment
between two sequences
Fundamentally, there are two different types of
alignment algorithms:
1.Global (Needleman-Wunsch)
Global Alignment: NeedlemanWunsch
●
●
●
●
●
Every letter of each sequence is aligned to a
letter or gap
Alignment takes place in a 2D matrix
Each cell corresponds to a pairing of one letter
from each sequence and contains a score
derived from a scoring scheme along with a
corresponding pointer
The algorithm contains three major phases
(initialization, fill, and trace-back)
In order to examine each phase, lets align the
words HEAGAWGHE and PAWHEAE using
Global Alignment: NeedlemanWunsch
●
Initialization
–
Values for the first row and column are assigned
–
The score of each cell is set to the gap penalty (-8)
multiplied by the distance from the origin
P
A
W
H
E
A
E
0
-8
-16
-24
-32
-40
-48
-56
H
-8
E
-16
A
-24
G
-32
A
-40
W
-48
G
-56
H
-64
E
-72
Global Alignment: NeedlemanWunsch
●
Fill
–
Three scores are computed for each cell
Diagonal Score – sum of the diagonal cell score and the
score for a match/mismatch (BLOSUM50 matrix)
● Horizontal Score – sum of the cell to the left and the
H
E
A
G
A
W
G
H
E
gap penalty
0
-8
-16
-24
-32
-40
-48
-56
-64
-72
● -8
Vertical
-2 Score – sum of the above cell and the gap
(P->H) Diagonal Score
penalty
-16
{0 + (-2) = -2 }
●
P
A
W
H–
E
A
E
-24
The
is then filled by assigning
for
(P->H) Horizontal
-32 entire matrix
(P->H) Max Score = -2
-40 cell theScore
each
max score (obtained from the 3
{-8 + (-8) = -16}
-48
(P->H) Vertical
computed
scores)
andScore
corresponding pointer
-56
{-8 + (-8) = -16}
Global Alignment: NeedlemanWunsch
●
Fill
–
Three scores are computed for each cell
Diagonal Score – sum of the diagonal cell score and the
score for a match/mismatch (BLOSUM50 matrix)
● Horizontal Score – sum of the cell to the left and the
H
E
A
G
A
W
G
H
E
gap penalty
0
-8
-16
-24
-32
-40
-48
-56
-64
-72
● -8
Vertical
-2 Score – sum of the above cell and the gap
(P->E) Diagonal Score
penalty
-16
{-8 + (-1) = -9 }
●
P
A
W
H–
E
A
E
-24
The
is then
filledScore
by assigning
for
Horizontal
-32 entire matrix (P->E)
(P->E) Max Score = -9
+ (-8) = (obtained
-10}
-40 cell the max{-2
each
score
from the 3
-48
Score
computed
scores)(P->E)
andVertical
corresponding
pointer
-56
{-16 + (-8) = -24}
Global Alignment: NeedlemanWunsch
●
Fill
–
P
A
W
H
E
A
E
Continue calculating max score for all cells along
with corresponding pointer
0
-8
-16
-24
-32
-40
-48
-56
H
-8
-2
-10
-18
-14
-22
-30
-38
E
-16
-9
-3
-11
-18
-8
-16
-24
A
-24
-17
-4
-6
-13
-16
-3
-11
G
-32
-25
-12
-7
-8
-16
-11
-6
A
-40
-33
-20
-15
-9
-9
-11
-12
W
-48
-41
-28
-5
-13
-12
-12
-14
G
-56
-49
-36
-13
-7
-15
-12
-15
H
-64
-57
-44
-21
-3
-7
-15
-12
E
-72
-65
-52
-29
-11
3
-5
-9
Global Alignment: NeedlemanWunsch
●
Trace-back
–
Allows one to recover the alignment from the
matrix
–
Trace back your transition from the bottom right
corner toH the Etop left
toE
A corner
G
Aby referring
W
G back
H
the 0completed
-8
-16 matrix
-24
-32
-40
-48
-56
-64
-72
P
A
W
H
E
A
E
-8
-16
-24
-32
-40
-48
-56
-2
-10
-18
-14
-22
-30
-38
-9
-3
-11
-18
-8
-16
-24
-17
-4
-6
-13
-16
-3
-11
-25
-12
-7
-8
-16
-11
-6
-33
-20
-15
-9
-9
-11
-12
-41
-28
-5
-13
-12
-12
-14
-49
-36
-13
-7
-15
-12
-15
-57
-44
-21
-3
-7
-15
-12
-65
-52
-29
-11
3
-5
-9
Global Alignment: NeedlemanWunsch
●
Trace-back
–
–
–
P
A
W
H–
E
A
E
Horizontal transition represents a gap in the
vertical sequence
Vertical transition represents a gap in the
horizontal
sequence
H
E
A
G
A
W
G
H
E
0
-8
-16
-24
-32
-40
-48
-56
-64
-72
Diagonal
transition
represents
a
match
in
the
-8
-2
-9
-17
-25
-33
-42
-49
-57
-65
corresponding
characters
-16
-10
-3
-4
-12
-20of the
-28 two
-36 sequences
-44
-52
-24
-18
-11
-6
-7
-15
-5
-13
-21
-29
Final
Alignment:
-32
-14
-18
-13
-8
-9
-13
-7
-3
-11
3
H-40 E -22
A G -8A W-16G H-16 - -9E -12 -15 -7
-48
-30
-16
-3
-11
-11
-12
-12
-15
-5
--56 - -38
P - -24A W-11H E-6 A -12
E -14 -15 -12 -9
Local Alignment: Smith-Waterman
●
A slight modification of the NeedlemanWunsch algorithm:
–
Edges of the matrix are initialized to zero
–
Max score is never less than zero, no pointer is
recorded unless the score is greater than zero
–
Trace-back starts from the highest score in the
matrix and ends at a score of zero
Local Alignment: Smith-Waterman
●
Again, lets align the words HEAGAWGHE and
PAWHEAE using the same scoring scheme:
–
gap penalty of -8
–
match score and mismatch penalty to be
H
E using
A the
G BLOSUM50
A
W
Gmatrix
H
determined
P–
A
W
H
E–
A
E
–
–
0
0
0
0
0
0
0
0
0
0
0
0
10
2
0
0
0
0
0
0
2
16
8
6
0
0
5
0
0
8
21
13
0
0
0
2
0
0
13
18
0
0
5
0
0
0
5
12
0
0
0
20
12
4
0
4
0
0
0
12
18
10
4
0
0
0
0
0
22
18
10
4
E
0
0
0
0
14
28
20
16
Start from the largest score and trace back to
determine the best local alignment
Horizontal transition represents a gap in the
vertical sequence
Vertical transition represents a gap in the
horizontal sequence
Diagonal transition represents a match in the
Local Alignment: Smith-Waterman
●
●
Does it matter what “word”/sequence is
horizontal/vertical?
To answer this question lets align PAWHEAE
(horizontal) to HEAGAWGHE (vertical) using
the same
scoring
scheme
asA before:
P
A
W
H
E
E
H–
E
A–
G
A
W
G–
H
E
–
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
0
5
0
0
0
0
0
0
0
0
2
0
20
12
4
0
gap penalty of -8
0
10
2
0
0
0
12
18
22
14
0
2
16
8
0
0
4
10
18
28
0
0
8
21
13
5
0
4
10
20
0
0
6
13
18
12
4
0
4
16
match score and mismatch penalty to be
determined using the BLOSUM50 matrix
Start from the largest score and trace back to
determine the best local alignment
Horizontal transition represents a gap in the
vertical sequence
Local Alignment: Smith-Waterman
●
●
Does it matter what “word”/sequence is
horizontal/vertical?
To answer this question lets align
PAWHEAE (horizontal) to HEAGAWGHE
P using
A
WtheH same
E
A
E
(vertical)
scoring
scheme
0
0
0
0
0
0
0
0
H
0
0
0
0
10
2
0
0
as
before:
E
A
G–
A
W–
G
H
E
–
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
0
5
0
0
0
0
0
0
2
0
20
12
4
0
gap penalty of -8
2
0
0
0
12
18
22
14
16
8
0
0
4
10
18
28
8
21
13
5
0
4
10
20
6
13
18
12
4
0
4
16
match score and mismatch penalty to be
determined using the BLOSUM50 matrix
Start from the largest score and trace back to
determine the best local alignment
Local Alignment: Smith-Waterman
●
●
Does it matter what “word”/sequence is
horizontal/vertical?
To answer this question lets align
PAWHEAE (horizontal) to HEAGAWGHE
P using
A
WtheH same
E
A
E
(vertical)
scoring
scheme
0
0
0
0
0
0
0
0
H
0
0
0
0
10
2
0
0
as
before:
E
A
G–
A
W–
G
H
E
–
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
0
5
0
0
0
0
0
0
2
0
20
12
4
0
gap penalty of -8
2
0
0
0
12
18
22
14
16
8
0
0
4
10
18
28
8
21
13
5
0
4
10
20
6
13
18
12
4
0
4
16
match score and mismatch penalty to be
determined using the BLOSUM50 matrix
Start from the largest score and trace back to
determine the best local alignment
So does it matter what “word”/sequence is
horizontal/vertical? No, it does not. Either way, the final
alignment is the same and is considered to be the
“optimal” alignment
P
A
W
H
E
A
E
H
E
A
G
A
W
G
H
E
0
0
0
0
0
0
0
0
H
0
0
0
0
10
2
0
0
E
0
0
0
0
2
16
8
6
A
0
0
5
0
0
8
21
13
G
0
0
0
2
0
0
13
18
A
0
0
5
0
0
0
5
12
W
0
0
0
20
12
4
0
4
G
0
0
0
12
18
10
4
0
0
0
0
0
0
0
0
0
0
0
P
0
0
0
0
0
0
0
0
0
0
A
0
0
0
5
0
5
0
0
0
0
W
0
0
0
0
2
0
20
12
4
0
H
0
10
2
0
0
0
12
18
22
14
E
0
2
16
8
0
0
4
10
18
28
A
0
0
8
21
13
5
0
4
10
20
E
0
0
6
13
18
12
4
0
4
16
H
0
0
0
0
22
18
10
4
E
0
0
0
0
14
28
20
16
Final Alignment:
H E A G A W G H E - - - - P A W - H E A E
Global or Local?
●
When is a global alignment more useful?
–
●
When sequences in a query set are similar and
close in size
When is a local alignment more useful?
–
When sequences in a query set are dissimilar but
suspected to contain regions of similarity
When sequences (amino acid or nucleotide) are
sufficiently similar, there is no difference
between local and global alignments
Helpful Charts
AA chart:
http://sofbiology.blogspot.com/2010/12/proteinsynthesis-amino-acid-table.html
IUPAC chart:
http://www.bioinformatics.org/sms/iupa
c.html
Except where otherwise noted (i.e. items on the slide labeled “Helpful Charts”),
most information contained in this presentation was obtained from:
Graur, Dan and Wen-Hsiung Li. Fundamentals of Molecular Evolution. Second
Edition. Sunderland, Massachusetts: Sinauer Associates, Inc., Publishers, 2000.
Some of the information related to global & local alignment algorithms was
obtained from and can be accessed at:
http://etutorials.org/Misc/blast/Part+II+Theory/Chapter+3.+Sequence+Alignment/