Download 3 -2 -1 -2 -1 1 2 K

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Sequence similarity search II
Searching for remote homologies
(How) can we decide if two
sequences really have the same
function?
Homolog = come from a common origin => have the same function
Homologous proteins =
come from a common origin => have the same function
Last
Universal
Common
Ancestor
Homology
Rule of thumb:
-Proteins are homologous if 25%-35% identical
-DNA sequences are homologous if 70% identical
Can we always go by the rules?
Alignment between the worm and human arrestin
VERY SIGNIFICANT , NOT HIGH IDENTITY
Assessing whether proteins are functional homologous
High levels of a protein RBP4 (Retinol binding protein 4)
were found to be correlated with childhood obesity
RBP4= carrier of vitamin A in the blood
PAEP
RBP4
Are they functionally
homologous???
Assessing whether proteins are functional homologous
RBP4= carrier of vitamin A in the blood
RBP4 (retinol binding) and PAEP (pregnancy protein)
E value= 0.49; identity=24%
Are they functionally homologous???
The lipocalins protein family (each dot is a protein)
RBP4
retinol-binding
protein
PAEP
apolipoprotein D
odorant-binding
protein
Are they functionally
homologous???
PAEP
RBP4
They belong to the same protein family= have a common ancestor
Their functions have probably diverse
BUT …
Is identity the right way to score?
The 20 Amino Acids
Sequence Alignment based on AA
similarity
TQSPSSLSASVGDTVTITCRASQSISTYLNWYQQKP----GKAPKLLIYAASSSQSGVPS
|| +
|||| +|| ||| |
+|
|
|
|
|
TQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADS
RFSGSGSGTDFTLTINSLQPEDFATYYCQ---------------QSYSTPHFSQGTKLEI
| |
| +| | | +|+ || || |+
+ | | || | +
RRSLWDQG-NFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTL
---KRTVAAPSVFIFPPSDEQLKSGTASVVCLLN---------NFYPREAKVQWKVD
++|||
| + ++ |
| |
+ ||++|+|
TLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKID
| = identity
45/178=25%
+ = similarity 63/178=35%
Scoring system for amino acids mismatches
11
How do we define the scoring system
Given an alignment of closely related sequences
we can score the relation between amino acids
based on how frequently they substitute each other
Protein X e-coli
Protein X yeast
Protein X worm
Protein X Chicken
Protein X Mice
Protein X Pig
Protein X Monkey
Protein X Human
...M
...M
…..M
…..M
…..M
…..M
…..M
…..M
G
G
G
G
G
G
G
G
Y
Y
Y
Y
Y
Y
Y
Y
D
D
E
D
Q
D
E
E
E
E
E
E
E
E
E
E
In this column
E & D are found
7/8
D/E
COO-
+H N
3
C
COO-
H
+H N
3
C
HCH
HCH
C
HCH
O
H
OC
Aspartate
(Asp, D)
O
O-
Glutamate
(Glu, E)
PAM - Point Accepted Mutations
•
•
•
Developed by Margaret Dayhoff, 1978.
Analyzed very similar protein sequences
“Accepted” mutations – do not
negatively affect a protein’s fitness
Used global alignment.
Counted the number of substitutions (i,j)
per amino acid
pair: Many i<->j substitutions => high
score s(i,j)
Margaret Dayhoff
1925-1983
Basic matrix (example)
normalized probabilities multiplied by 10000
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
9867
2
9
10
3
8
17
21
2
6
4
2
6
2
22
35
32
0
2
18
1 9913
1
0
1
10
0
0
10
3
1
19
4
1
4
6
1
8
0
1
4
1 9822
36
0
4
6
6
21
3
1
13
0
1
2
20
9
1
4
1
6
0
42 9859
0
6
53
6
4
1
0
3
0
0
1
5
3
0
0
1
1
1
0
0 9973
0
0
0
1
1
0
0
0
0
1
5
1
0
3
2
3
9
4
5
0 9876
27
1
23
1
3
6
4
0
6
2
2
0
0
1
10
0
7
56
0
35 9865
4
2
3
1
4
1
0
3
4
2
0
1
2
21
1
12
11
1
3
7 9935
1
0
1
2
1
1
3
21
3
0
0
5
1
8
18
3
1
20
1
0 9912
0
1
1
0
2
3
1
1
1
4
1
2
2
3
1
2
1
2
0
0 9872
9
2
12
7
0
1
7
0
1
33
3
1
3
0
0
6
1
1
4
22 9947
2
45
13
3
1
3
4
2
15
2
37
25
6
0
12
7
2
2
4
1 9926
20
0
3
8
11
0
1
1
1
1
0
0
0
2
0
0
0
5
8
4 9874
1
0
1
2
0
0
4
1
1
1
0
0
0
0
1
2
8
6
0
4 9946
0
2
1
3
28
0
13
5
2
1
1
8
3
2
5
1
2
2
1
1 9926
12
4
0
0
2
28
11
34
7
11
4
6
16
2
2
1
7
4
3
17 9840
38
5
2
2
22
2
13
4
1
3
2
2
1
11
2
8
6
1
5
32 9871
0
2
9
0
2
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0 9976
1
0
1
0
3
0
3
0
1
0
4
1
1
0
0
21
0
1
1
2 9945
1
13
2
1
1
3
2
2
3
3
57
11
1
17
1
3
2
10
0
2 9901
Log Odds Matrices
– Calculate odds ratio for each substitution
• Divide the frequency of the substitution by the frequency
of each amino acid
f(aa1>aa2)/ f(aa1)*f(aa2)
– Take average of ratio for converting A to B and converting
B to A
– Convert ratio to log10 and multiply by 10
– Result: Symmetric log-odds matrix
PAM250 Log odds
Entry (i,j):matrix
the score of
aligning amino acid i against
amino acid j.
Simliar aa have high score
Entry (i,i) is
greater than any
entry (i,j), ji.
The entries on the
diagonal are not always
identical
The different PAM Matrices
• There are different PAM matrices (PAM 1- PAM250). The
matrices are derived from each other by multiplying the
PAM1 matrices N times
• Low PAM matrices are suitable for strong local
similarities (Arrestin worm vs Arrestin Human)
• High PAM matrices are suitable for weak similarities
(RBP4 and PEAP)
– PAM120 recommended for general use
– PAM60 for close relations
– PAM250 for distant relations
BLOSUM=BLOcks SUstitution Matrix
Steven and Jorga G. Henikoff (1992)
• Based on BLOCKS database
Families of proteins with similar function
• Ungapped local alignment
– Each block is generated from a local alignment
– Counts amino acids observed in same column
– Symmetrical model of substitution
AABCDA… BBCDA
DABCDA. A.BBCBB
BBBCDABA.BCCAA
AAACDAC.DCBCDB
CCBADAB.DBBDCC
AAACAA… BBCCC
BLOSUM Matrices
BLOSUM 62
• Different BLOSUMn matrices are calculated
independently from different BLOCKS
• BLOSUMn is based on blocks that are at most n
percent identical.
Selecting a BLOSUM Matrix
• For BLOSUMn, higher n suitable for
sequences which are more similar
– BLOSUM62 recommended for general use
– BLOSUM80 for close relations
– BLOSUM45 for distant relations
QUIZ
• The score for ARG-LYS in BLOSUM 45 is 2,
what will the score for the same pair in
BLOSUM 80?
A. 2
B. 3
C. 1
D. -1
Remote homologues
• Sometimes BLAST isn’t enough.
• When searching homologs in large and
diverse protein families and/or when looking
for homology in non highly conserved
proteins in very far species (e-coli vs human)
What do we do?
PSI-BLAST
General Idea :
- Builds specialized scoring matrices which
are specific to the family of interest
- Generates a position specific scoring
matrix
Page 138
PSI-BLAST
STEPS:
[1] Select a query and search it against a protein database
[2] PSI-BLAST constructs a specialized
multiple sequence alignment
[3] Creates a “profile” or the specialized alignment for
each position independently
position-specific scoring matrix (PSSM)
Page 138
R,I,K
C
D,E,T K,R,T
N,L,Y,G
1 M
2 K
3 W
4 V
5 W
6 A
7 L
8 L
9 L
10 L
11 A
12 A
13 W
14 A
15 A
16 A
...
37 S
38 G
39 T
40 W
41 Y
42 A
A
-1
-1
-3
0
-3
5
-2
-1
-1
-2
5
5
-2
3
2
4
R
-2
1
-3
-3
-3
-2
-2
-3
-3
-2
-2
-2
-3
-2
-1
-2
N
-2
0
-4
-3
-4
-2
-4
-3
-4
-4
-2
-2
-4
-1
0
-1
D
-3
1
-5
-4
-5
-2
-4
-4
-4
-4
-2
-2
-4
-2
-1
-2
C
-2
-4
-3
-1
-3
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
Q
-1
2
-2
-3
-2
-1
-2
-3
-2
-2
-1
-1
-2
-1
2
-1
E
-2
4
-3
-3
-3
-1
-3
-3
-3
-3
-1
-1
-3
-2
0
-1
G
-3
-2
-3
-4
-3
0
-4
-4
-4
-4
0
0
-4
4
2
3
H
-2
0
-3
-4
-3
-2
-3
-3
-3
-3
-2
-2
-3
-2
-1
-2
I
1
-3
-3
3
-3
-2
2
2
2
2
-2
-2
1
-2
-3
-2
L
2
-3
-2
1
-2
-2
4
2
4
4
-2
-2
4
-2
-3
-2
K
-2
3
-3
-3
-3
-1
-3
-3
-3
-3
-1
-1
-3
-1
0
-1
M
6
-2
-2
1
-2
-1
2
1
2
2
-1
-1
2
-2
-2
-1
F
0
-4
1
-1
1
-3
0
3
0
0
-3
-3
1
-3
-3
-3
P
-3
-1
-4
-3
-4
-1
-3
-3
-3
-3
-1
-1
-3
-1
-1
-1
S
-2
0
-3
-2
-3
1
-3
-2
-3
-3
1
1
-3
1
3
1
T
-1
-1
-3
0
-3
0
-1
-1
-1
-1
0
0
-2
-1
0
0
W
-2
-3
12
-3
12
-3
-2
-2
-2
-2
-3
-3
7
-3
-3
-3
Y
-1
-2
2
-1
2
-2
-1
0
-1
-1
-2
-2
0
-3
-2
-2
V
1
-3
-3
4
-3
0
1
3
2
1
0
0
0
-1
-2
-1
2
0
0
-3
-2
4
-1
-3
-1
-3
-2
-2
0
-1
0
-4
-2
-2
-1
-2
-1
-5
-3
-2
-1
-3
-1
-3
-3
-1
0
-2
-1
-2
-2
-1
0 0 -1 -2 -3
-2 6 -2 -4 -4
-1 -2 -2 -1 -1
-3 -3 -3 -3 -2
-2 -3 2 -2 -1
-1 0 -2 -2 -2
0
-2
-1
-3
-2
-1
-2
-3
-1
-2
-1
-1
-3
-4
-2
1
3
-3
-1 4 1 -3
-2 0 -2 -3
-1 1 5 -3
-4 -3 -3 12
-3 -2 -2 2
-1 1 0 -3
-2
-3
-2
2
7
-2
-2
-4
0
-3
-1
0
1 M
2 K
3 W
4 V
5 W
6 A
7 L
8 L
9 L
10 L
11 A
12 A
13 W
14 A
15 A
16 A
...
37 S
38 G
39 T
40 W
41 Y
42 A
A
-1
-1
-3
0
-3
5
-2
-1
-1
-2
5
5
-2
3
2
4
R
-2
1
-3
-3
-3
-2
-2
-3
-3
-2
-2
-2
-3
-2
-1
-2
N
-2
0
-4
-3
-4
-2
-4
-3
-4
-4
-2
-2
-4
-1
0
-1
D
-3
1
-5
-4
-5
-2
-4
-4
-4
-4
-2
-2
-4
-2
-1
-2
C
-2
-4
-3
-1
-3
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
Q
-1
2
-2
-3
-2
-1
-2
-3
-2
-2
-1
-1
-2
-1
2
-1
E
-2
4
-3
-3
-3
-1
-3
-3
-3
-3
-1
-1
-3
-2
0
-1
G
-3
-2
-3
-4
-3
0
-4
-4
-4
-4
0
0
-4
4
2
3
H
-2
0
-3
-4
-3
-2
-3
-3
-3
-3
-2
-2
-3
-2
-1
-2
I
1
-3
-3
3
-3
-2
2
2
2
2
-2
-2
1
-2
-3
-2
L
2
-3
-2
1
-2
-2
4
2
4
4
-2
-2
4
-2
-3
-2
K
-2
3
-3
-3
-3
-1
-3
-3
-3
-3
-1
-1
-3
-1
0
-1
M
6
-2
-2
1
-2
-1
2
1
2
2
-1
-1
2
-2
-2
-1
F
0
-4
1
-1
1
-3
0
3
0
0
-3
-3
1
-3
-3
-3
P
-3
-1
-4
-3
-4
-1
-3
-3
-3
-3
-1
-1
-3
-1
-1
-1
S
-2
0
-3
-2
-3
1
-3
-2
-3
-3
1
1
-3
1
3
1
T
-1
-1
-3
0
-3
0
-1
-1
-1
-1
0
0
-2
-1
0
0
W
-2
-3
12
-3
12
-3
-2
-2
-2
-2
-3
-3
7
-3
-3
-3
Y
-1
-2
2
-1
2
-2
-1
0
-1
-1
-2
-2
0
-3
-2
-2
V
1
-3
-3
4
-3
0
1
3
2
1
0
0
0
-1
-2
-1
2
0
0
-3
-2
4
-1
-3
-1
-3
-2
-2
0
-1
0
-4
-2
-2
-1
-2
-1
-5
-3
-2
-1
-3
-1
-3
-3
-1
0
-2
-1
-2
-2
-1
0 0 -1 -2 -3
-2 6 -2 -4 -4
-1 -2 -2 -1 -1
-3 -3 -3 -3 -2
-2 -3 2 -2 -1
-1 0 -2 -2 -2
0
-2
-1
-3
-2
-1
-2
-3
-1
-2
-1
-1
-3
-4
-2
1
3
-3
-1 4 1 -3
-2 0 -2 -3
-1 1 5 -3
-4 -3 -3 12
-3 -2 -2 2
-1 1 0 -3
-2
-3
-2
2
7
-2
-2
-4
0
-3
-1
0
PSI-BLAST
Continue…
[4] The PSSM is used as a query against the database
[5] PSI-BLAST estimates statistical significance (E values)
[6] Repeat steps [4] and [5] iteratively, typically 3-5 times.
At each new search, a new profile is used as the query.
Page 138
Searching for remote homology
using PSI-BLAST
Another member of the lipocalins family is b-lactoglobulin
RBP4
b-lactoglobulin
PSI-BLAST alignment of RBP4 (retinol binding protein)
and b-lactoglobulin: iteration 1
Score = 46.2 bits (108), Expect = 2e-04
Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%)
Query: 27
Sbjct: 33
Query: 87
Sbjct: 83
VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86
V+ENFD ++ G WY + +K P
+ I A +S+ E G +
K
++
VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82
ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137
D GT
++ +PAK +++++ +
+WI+ TDY+ YA+ YSC
PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135
Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163
L ++D
+ ++ R+P LPPE
Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158
PSI-BLAST alignment of RBP4 and b-lactoglobulin: iteration 2
Score = 140 bits (353), Expect = 1e-32
Identities = 45/176 (25%), Positives = 78/176 (43%), Gaps = 33/176 (18%)
Query: 4
Sbjct: 2
Query: 56
Sbjct: 61
VWALLLLAAWAAAERDCRVSSF--------RVKENFDKARFSGTWYAMAKKDPEGLFLQD 55
V L+ LA A
+ +F
V+ENFD ++ G WY + +K P
+
VTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEI-EKIPASFEKGN 60
NIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMV---GTFTDTEDPAKFKMKYWGVASF 112
I A +S+ E G +
K
+ D
+ V
++ +PAK +++++ +
CIQANYSLMENGNIEVLNKEL-----SPDGTMNQVKGEAKQSNVSEPAKLEVQFFPL--- 112
Query: 113 LQKGNDDHWIVDTDYDTYAVQYSCR----LLNLDGTCADSYSFVFSRDPNGLPPEA 164
+WI+ TDY+ YA+ YSC
L ++D
+ ++ R+P LPPE
Sbjct: 113 --MPPAPYWILATDYENYALVYSCTTFFWLFHVD------FFWILGRNPY-LPPET 159
PSI-BLAST alignment of RBP4 and b-lactoglobulin: iteration 3
Score = 159 bits (404), Expect = 1e-38
Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%)
Query: 3
Sbjct: 1
Query: 55
Sbjct: 60
WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54
V L+ LA A
+ S V+ENFD ++ G WY + K
MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59
DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114
+ I A +S+ E G +
K
V +
++ +PAK +++++ +
NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112
Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164
+WI+ TDY+ YA+ YSC
+ ++ R+P LPPE
Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159
1
Score = 46.2 bits (108), Expect = 2e-04
Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%)
Query: 27
Sbjct: 33
Query: 87
Sbjct: 83
VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86
V+ENFD ++ G WY + +K P
+ I A +S+ E G +
K
++
VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82
ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137
D GT
++ +PAK +++++ +
+WI+ TDY+ YA+ YSC
PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135
Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163
L ++D
+ ++ R+P LPPE
Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158
3
Score = 159 bits (404), Expect = 1e-38
Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%)
Query: 3
Sbjct: 1
Query: 55
Sbjct: 60
WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54
V L+ LA A
+ S V+ENFD ++ G WY + K
MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59
DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114
+ I A +S+ E G +
K
V +
++ +PAK +++++ +
NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112
Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164
+WI+ TDY+ YA+ YSC
+ ++ R+P LPPE
Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159
The lipocalins protein family (each dot is a protein)
B-lactoglobulin
retinol-binding
protein
RBP4
apolipoprotein D
odorant-binding
protein
The universe of lipocalins (each dot is a protein)
retinol-binding
protein
apolipoprotein D
odorant-binding
protein
Scoring matrices let you focus on the big (or small) picture
retinol-binding
protein
Scoring matrices let you focus on the big (or small) picture
PAM250
PAM30
retinol-binding
retinol-binding
protein
protein
Blosum80
Blosum45
PSI-BLAST is more powerful than changing scoring functions
VERY good for remote homologies
retinol-binding
protein
Related documents