Download Lecture 4: Sequence analysis methods revisited. PSI-BLAST

Document related concepts
no text concepts found
Transcript
LSM3241: Bioinformatics and
Biocomputing
Lecture 4: Sequence analysis methods
revisited
Prof. Chen Yu Zong
Tel: 6874-6877
Email: [email protected]
http://xin.cz3.nus.edu.sg
Room 07-24, level 7, SOC1,
National University of Singapore
Sequence Analysis Methods
2
Gene and Protein Sequence Alignment
as a Mathematical Problem:
Example:
Sequence a: ATTCTTGC
Sequence b: ATCCTATTCTAGC
Best Alignment:
ATTCTTGC
ATCCTATTCTAGC
/|\
gap
Bad Alignment:
AT TCTT
GC
ATCCTATTCTAGC
/|\
/|\
gap
gap
What is a good alignment?
3
How to rate an alignment?
• Match: +8 (w(x, y) = 8, if x = y)
• Mismatch: -5 (w(x, y) = -5, if x ≠ y)
• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)
a1 a2 a3 - - x - b1 b2 b3 - - y - -
4
Pairwise Alignment
Sequence a: CTTAACT
Sequence b: CGGATCAT
An alignment of a and b:
Mismatch
Match
C---TTAACT
CGGATCA--T
Insertion
gap
Deletion
gap
5
Alignment Graph
Sequence a: CTTAACT
Sequence b: CGGATCAT
C G G A
C
T
Insertion
gap
T
C
A
T
C---TTAACT
CGGATCA--T
T
A
Deletion
gap
A
C
T
6
Graphic representation of an alignment
Sequence a: CTTAACT
Sequence b: CGGATCAT
C
C
C---TTAACT
CGGATCA--T
7
Graphic representation of an alignment
Sequence a: CTTAACT
C
C
G
G
Sequence b: CGGATCAT
A
C---TTAACT
CGGATCA--T
8
Graphic representation of an alignment
Sequence a: CTTAACT
C
C
T
G
G
Sequence b: CGGATCAT
A
T
C---TTAACT
CGGATCA--T
9
Graphic representation of an alignment
Sequence a: CTTAACT
C
C
T
G
G
Sequence b: CGGATCAT
A
T
C
A
C---TTAACT
CGGATCA--T
T
A
A
C
10
Graphic representation of an alignment
Sequence a: CTTAACT
C
C
T
G
G
Sequence b: CGGATCAT
A
T
C
A
T
C---TTAACT
CGGATCA--T
T
A
A
C
T
11
Pathway of an alignment
Sequence a: CTTAACT
Sequence b: CGGATCAT
C G G A
C
T
T
C
A
T
C---TTAACT
CGGATCA--T
T
A
A
C
T
12
Graphic representation of an alignment
Sequence a: CTTAACT
C
C
T
G
G
Sequence b: CGGATCAT
A
T
C
A
T
CTTAACTCGGATCAT
T
A
A
C
T
13
Pathway of an alignment
Sequence a: CTTAACT
Sequence b: CGGATCAT
C G G A
C
T
T
C
A
T
CTTAACTCGGATCAT
T
A
A
C
T
14
Use of graph to generate alignments
Sequence a: CTTAACT
Sequence b: CGGATCAT
C G G A
C
T
C
A
T
- CTTAACT
CGGATCAT
T
T
A
A
C
T
15
Use of graph to generate alignments
Sequence a: CTTAACT
Sequence b: CGGATCAT
C G G A
C
T
C
A
T
- C - - TTAACT
CGGATC - AT -
T
T
A
A
C
T
16
Use of graph to generate alignments
Sequence a: CTTAACT
Sequence b: CGGATCAT
C G G A
C
T
C
A
T
CTTAACT - - - - CGGATCAT
T
T
A
A
C
T
17
Which pathway is better?
Sequence a: CTTAACT
Sequence b: CGGATCAT
C G G A
C
T
C
A
T
Multiple
pathways
T
T
A
A
Each with a
unique
scoring
function
C
T
18
Alignment Score
Sequence a: CTTAACT
Sequence b: CGGATCAT
C G G A
8
C
T
T
C
A
T
C---TTAACT
CGGATCA--T
T
A
A
C
T
19
Alignment Score
Sequence a: CTTAACT
Sequence b: CGGATCAT
C G G A
C
T
8
8-3
=5
T
C
A
T
C---TTAACT
CGGATCA--T
T
A
A
C
T
20
Alignment Score
Sequence a: CTTAACT
Sequence b: CGGATCAT
C G G A
C
T
8
8-3
=5
5-3
=2
2-3
=-1
T
C
A
T
C---TTAACT
CGGATCA--T
T
A
A
C
T
21
Alignment Score
Sequence a: CTTAACT
Sequence b: CGGATCAT
C G G A
8
C
T
T
A
A
C
T
5
2
T
C
A
T
-1
C---TTAACT
CGGATCA--T
-1+8
=7
7-3
=4
4+8
=12
12-3
=9
9-3
=6
Alignment score
6+8=14
22
An optimal alignment
-- the alignment of maximum score
• Let A=a1a2…am and B=b1b2…bn .
• Si,j: the score of an optimal alignment
between
a1a2…ai and b1b2…bj
• With proper initializations, Si,j can be
computed
si 1, j  w(ai ,)
as follows.

si , j  max si , j 1  w(, b j )
s
 i 1, j 1  w(ai , b j )
23
Computing Si,j
j
w(ai,bj)
w(ai,-)
i
w(-,bj)
Sm,n
24
Initializations
Gap symbol: -3
C
0
C
-3
T
-6
T
-9
-3
G
-6
G
-9
A
T
C
A
T
-12 -15 -18 -21 -24
S0,0= 0
S0,1=-3, S0,2=-6,
S0,3=-9, S0,4=-12,
S0,5=-15, S0,6=-18,
A -12
S0,7=-21, S0,8=-24
S1,0=-3, S2,0=-6,
S3,0=-9, S4,0=-12,
S5,0=-15, S6,0=-18,
A -15
S7,0=-21
C -18
T -21
25
Match: 8
S1,1 = ?
Mismatch: -5
Gap symbol: -3
C
0
-3
C
-3
?
T
-6
T
-9
Option 1:
G
-6
G
A
T
C
A
T
-9
-12 -15 -18 -21 -24
S1,1 = S0,0 +w(a1, b1)
= 0 +8 = 8
Option 2:
S1,1=S0,1 + w(a1, -)
= -3 - 3 = -6
A -12
A -15
Option 3:
S1,1=S1,0 + w( - , b1)
= -3-3 = -6
C -18
Optimal:
T -21
S1,1 = 8
26
Match: 8
S1,2 = ?
Mismatch: -5
Gap symbol: -3
C
Option 1:
G
0
-3
-6
C
-3
8
?
T
-6
T
-9
G
A
T
C
A
T
-9
-12 -15 -18 -21 -24
S1,2 = S0,1 +w(a1, b2)
= -3 -5 = -8
Option 2:
S1,2=S0,2 + w(a1, -)
= -6 - 3 = -9
A -12
A -15
Option 3:
S1,2=S1,1 + w( - , b2)
= 8-3 = 5
C -18
Optimal:
T -21
S1,2 =5
27
Match: 8
S2,1 = ?
Mismatch: -5
Option 1:
Gap symbol: -3
C
G
0
-3
-6
C
-3
8
5
T
-6
?
T
-9
G
A
T
C
A
T
-9
-12 -15 -18 -21 -24
S2,1= S1,0 +w(a2, b1)
= -3 -5 = -8
Option 2:
S2,1=S1,1 + w(a2, -)
=8-3=5
A -12
A -15
Option 3:
S2,1=S2,0 + w( - , b1)
= -6-3 = -9
C -18
Optimal:
T -21
S2,1 =5
28
Match: 8
S2,2 = ?
Mismatch: -5
Gap symbol: -3
C
Option 1:
G
G
A
T
C
A
T
-9
-12 -15 -18 -21 -24
S2,2= S1,1 +w(a2, b2)
= 8 -5 = 3
0
-3
-6
C
-3
8
5
Option 2:
T
-6
5
?
S2,2=S1,2 + w(a2, -)
T
-9
=5-3=2
A -12
A -15
Option 3:
S2,2=S2,1 + w( - , b2)
= 5-3 = 2
C -18
Optimal:
T -21
S2,2 =3
29
S3,5 = ?
C
G
G
A
T
C
A
T
0
-3
-6
-9
-12 -15 -18 -21 -24
C
-3
8
5
2
-1
-4
-7
T
-6
5
3
0
-3
7
4
T
-9
2
0
-2
-5
?
-10 -13
1
-2
A -12
A -15
C -18
T -21
30
S3,5 = ?
C
G
G
A
T
C
A
T
0
-3
-6
-9
-12 -15 -18 -21 -24
C
-3
8
5
2
-1
-4
-7
T
-6
5
3
0
-3
7
4
1
-2
T
-9
2
0
-2
-5
5
-1
-4
9
A -12 -1
-3
-5
6
3
0
7
6
A -15 -4
-6
-8
3
1
-2
8
5
C -18 -7
-9
-11
0
-2
9
6
3
T -21 -10 -12 -14 -3
8
6
4
14
-10 -13
optimal
score
31
C T T A A C – T
C G G A T C A T
8 – 5 –5 +8 -5 +8 -3 +8 = 14
C G G A
T
C
A
T
0
-3
-6
-9
-12 -15 -18 -21 -24
C
-3
8
5
2
-1
-4
-7
T
-6
5
3
0
-3
7
4
1
-2
T
-9
2
0
-2
-5
5
-1
-4
9
A -12 -1
-3
-5
6
3
0
7
6
A -15 -4
-6
-8
3
1
-2
8
5
C -18 -7
-9
-11
0
-2
9
6
3
T -21 -10 -12 -14 -3
8
6
4
14
-10 -13
32
Local vs. Global Sequence Alignment:
Example:
DNA sequence a: ATTCTTGC
DNA sequence b: ATCCTATTCTAGC
Local Alignment:
ATTCTTGC
ATCCTATTCTAGC
/|\
gap
Global Alignment: AT TCTT
GC
ATCCTATTCTAGC
/|\
/|\
gap
gap
Gaps ignored in local alignments
Gaps counted in global alignments
33
Global Alignment vs. Local Alignment
• global alignment:
• local alignment:
All sections are
counted
Only local sections
(normally separated
by gaps) are counted
34
An optimal local alignment
• Si,j: the score of an optimal local alignment
ending at ai and bj
• With proper initializations, Si,j can be
computed
0
as follows.
s  w(a ,)
i
 i 1, j
si , j  max si , j 1  w(, b j )
s
i 1, j 1  w( ai , b j )


35
Match: 8
Initializations
Mismatch: -5
Gap symbol: -3
0
C
0
T
0
T
0
A
0
A
0
C
0
T
0
C
G
G
A
0
0
0
0
T
0
C
0
A
0
T
0
36
Match: 8
S1,1 = ?
Mismatch: -5
Gap symbol: -3
0
C
0
T
0
T
0
A
0
= 0 +8 = 8
C
G
G
A
0
0
0
0
?
S1,1 = S0,0 +w(a1, b1)
T
0
C
0
A
0
T
0
Option 2:
S1,1=S0,1 + w(a1, -)
= 0 - 3 = -3
Option 3:
A
0
C
0
T
Option 1:
0
S1,1=S1,0 + w( - , b1)
= 0-3 = -3
Option 4:
S1,1=0
Optimal:
S1,1 = 8
37
local alignment
Match: 8
Mismatch: -5
Gap symbol: -3
0
C
0
G
0
G
0
A
0
T
0
C A T
0 0 0
C
0
8
5
2
0
0
8
5
2
T
0
5
3
0
0
8
5
3
13
T
0
2
0
0
0
8
5
2
11
A
0
0
0
0
8
5
3
?
A
0
C
0
T
0
38
A – C - T
A T C A T
8-3+8-3+8 = 18
C G
0 0 0
local alignment
G
0
A
0
T
0
C A T
0 0 0
C
0
8
5
2
0
0
8
5
2
T
0
5
3
0
0
8
5
3
13
T
0
2
0
0
0
8
5
2
11
A
0
0
0
0
8
5
3
13 10
A
0
0
0
0
8
5
2
11
8
C
0
8
5
2
5
3
13 10
7
T
0
5
3
0
2
13 10
8
The
best
score
18
39
BLAST
Basic Local Alignment Search Tool
Procedure:
•
•
•
•
Divide all sequences into overlapping constituent
words (size k)
Build the hash table for Sequence a.
Scan Sequence b for hits.
Extend hits.
40
BLAST
Basic Local Alignment Search Tool
Step 1:
Hash table for
sequence A
41
Amino acid
similarity
matrix
PAM 120
Instead of using
the simple values
+8 and -5 for
matches and
mismatches, this
statistically
derived score
matrix is used to
rank the level of
similarity between
two amino acids
42
Amino acid similarity matrix PAM 250
This is a more popularly used score matrix for ranking the level of similarity
of two amino acids. It is derived by consideration of more diverse sets of data
and more number of statistical steps.
43
Amino acid similarity matrix Blosum 45
The Blosum matrices were calculated using data from the BLOCKS database
which contains alignments of more distantly-related proteins. In principle,
Blosum matrices should be more realistic for comparing distantly-related
proteins, but may introduce error for conventional proteins.
.
44
BLAST
Basic Local Alignment Search Tool
45
BLAST
Basic Local Alignment Search Tool
Step 2:
Use all of the 2letter words in
query sequence to
scan against
database sequence
and mark those
with score > 8
Note:
LN:LN=9
NF:NY=8
Marked points can
be on the diagonal
and off-diagonal
GW:PW=10
46
BLAST
Step2: Scan sequence b for hits.
47
BLAST
Step2: Scan sequence b for hits.
Step 3: Extend hits.
hit
Terminate if the
score of the
extension fades
away.
BLAST 2.0 saves the
time spent in
extension, and
considers gapped
alignments.
48
Multiple sequence alignment (MSA)
• The multiple sequence alignment problem is to
simultaneously align more than two sequences.
Seq1: GCTC
GC-TC
Seq2: AC
A---C
Seq3: GATC
G-ATC
49
Multiple sequence alignment MSA
50
How to score an MSA?
• Sum-of-Pairs (SP-score)
GC-TC
Score
GC-TC
Score
A---C
G-ATC
+
A---C
GC-TC
= Score G-ATC
+
A---C
Score
G-ATC
51
How to score an MSA?
• Sum-of-Pairs (SP-score)
GC-TC
Score
GC-TC
Score
A---C
G-ATC
+
A---C
+
GC-TC
= Score G-ATC
+
A---C
Score
-5-3+8-3+8= 5
8-3-3+8+8=
18
+
-5+8-3-3+8=
5
G-ATC
= 28
SP-score=5+18+5=28
52
Position Specific Iterated BLAST
• PSI-BLAST is a rather permissive alignment
tool and it can find more distantly related
sequences than FASTA or BLAST
• Especially, in many cases, it is much more
sensitive to weak but biologically relevant
sequence similarities.
53
Position Specific Iterated BLAST
PSI-BLAST is used for:








Distant homology detection
Fold assignment: profile-profile comparison
Domain identification
Evolutionary Analysis (e.g. tree building)
Sequence Annotation / function assignment
Profile export to other programs
Sequence clustering
Structural genomics target selection
54
Position Specific Iterated BLAST
• Collect all database sequence segments that
have been aligned with query sequence with
E-value below set threshold (default 0.001,
but all sequences with E<10 are displayed
for manual inclusion)
• Construct position specific scoring matrix for
collected sequences. Rough idea:
– Align all sequences to the query sequence as the
template.
– Assign weights to the sequences
– Construct position specific scoring matrix
• Iterate
55
How PLS-BLAST works?
A 029001100003200
MGLLTREIF--ILQQ
C 000070000000000
.
.
Y 002000080202000
MGLLTREIF--ILQQ
FGLGRT-I-T-YMTN
FGLLRT-I-T-YMTN
-GLVRT-I---LGLE
-RLTRD-I---LGLY
FGLLRT-I---YMTQ
FGLLRT-I---FMTS
Take a sequence
using profile
Search for similar sequences in a full
sequence database
Sequences
are multiply
aligned alignment
New sequences
in the multiple
After several iterations of this procedure we have:
027005101003200
A 029001100003200
Construct
newtoprofile
aa profile,
and represent
•C 000070000000000
Sequence information, Construct
including links
annotation
.
conservation in each position numerically
•. Several sets of multiple alignments.
Y 002000080202000
202000060202000
•
•
Profiles, derived by us Profile
or by PSI-BLAST
holds more information than a single
the profile to retrieve additional
Threshold information sequence:
(alignmentuse
statistics)
sequences
Consensus sequence
• A sequence where each position is defined by majority
vote based on multiple sequence alignment. Use
consensus sequence for data base search.
PEAINYGRFTPFS I KSDVW
57
Flow chart of PSI-BLAST
MGLLTREIF--ILQQ
MGLLTREIF--ILQQ
FGLGRT-I-T-YMTN
-GLVRT-I---LGLE
FGLLRT-I---YMTQ
A 029001100003200
C 000070000000000
.
.
Y 002000080202000
Take a sequence
Search for similar sequences in a full sequence database
Sequences are multiply aligned
Construct a profile, and represent conservation in each
position numerically
Profile holds more information than a single sequence:
use the profile to retrieve additional sequences
New iteration
A 029001100003200
C 000070000000000
.
.
Y 002000080202000
FGLLRT-I-T-YMTN
-RLTRD-I---LGLY
FGLLRT-I---FMTS
A 027005101003200
C 000070000000000
.
.
Y 202000060202000
Next New iteration……
Using profile to search for similar sequences in a full
sequence database
New sequences in the multiple alignments
Construct a new profile
58
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
59
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
60
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
61
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
62
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
63
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
64
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
65
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
66
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
67
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
68
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
69
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
70
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
71
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
72
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
73
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
74
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
75
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
76
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
77
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
78
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
79
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
80
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
81
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
82
PSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
83
Summary of Today’s lecture
• Sequence alignment methods revisited:
–
–
–
–
Pair-wise alignment
Multiple sequence alignment
BLAST
PSI-BLAST
• Use of PSI-BLAST to probe protein function
84
Related documents