Download Multiple Alignment - Cayetano Heredia University

Document related concepts
no text concepts found
Transcript
Alineamiento
Múltiple de
secuencias
Pairwise sequence alignment is the most
fundamental operation of bioinformatics
• It is used to decide if two proteins (or genes)
are related structurally or functionally
• It is used to identify domains or motifs that
are shared between proteins
• It is the basis of BLAST searching
• It is used in the analysis of genomes
Pairwise alignment: protein sequences
can be more informative than DNA
• protein is more informative (20 vs 4 characters);
many amino acids share related biophysical properties
• codons are degenerate: changes in the third position
often do not alter the amino acid that is specified
• protein sequences offer a longer “look-back” time
• DNA sequences can be translated into protein,
and then used in pairwise alignments
Page 54
Pairwise alignment: protein sequences
can be more informative than DNA
• DNA can be translated into six potential proteins
5’ CAT CAA
5’ ATC AAC
5’ TCA ACT
5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’
3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’
5’ GTG GGT
5’ TGG GTA
5’ GGG TAG
Definitions
Homology
Similarity attributed to descent from a common ancestor.
Identity
The extent to which two (nucleotide or amino acid)
sequences are invariant.
RBP:
glycodelin:
26
23
RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84
+ K ++ + + +
GTW++ MA
+
L
+
A
QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEI V
LHRWEN 81
Page 44
Definitions: two types of homology
Orthologs
Homologous sequences in different species
that arose from a common ancestral gene
during speciation; may or may not be responsible
for a similar function.
Paralogs
Homologous sequences within a single species
that arose by gene duplication.
Page 43
common carp
zebrafish
rainbow trout
teleost
Orthologs:
members of a
gene (protein)
family in various
organisms.
This tree shows
RBP orthologs.
African
clawed
frog
chicken
human
mouse
rat
horse
pig cow rabbit
10 changes
Page 43
apolipoprotein D
retinol-binding
protein 4
Complement
component 8
Alpha-1
Microglobulin
/bikunin
Paralogs:
members of a
gene (protein)
family within a
species
prostaglandin
D2 synthase
progestagenassociated
endometrial
protein
Odorant-binding
protein 2A
neutrophil
gelatinaseassociated
lipocalin
Lipocalin 1
10 changes
Page 44
Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| |
.
|. . . | : .||||.:|
:
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | |
|
|
:: | .| . || |:
||
|.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP
|| ||.
|
:.|||| | .
.|
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP
. |
|
| :
||
.
| || |
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Page 46
retinol-binding protein
(NP_006735)
b-lactoglobulin
(P02754)
Page 42
Definitions
Similarity
The extent to which nucleotide or protein sequences
are related. It is based upon identity plus
conservation.
Identity
The extent to which two sequences are invariant.
Conservation
Changes at a specific position of an amino acid or
(less commonly, DNA) sequence that preserve the
physico-chemical properties of the original residue.
Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| |
.
|. . . | : .||||.:|
:
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | |
|
|
:: | .| . || |:
||
|.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP
|| ||.
|
:.|||| | .
.|
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
Identity
(bar)
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP
. |
|
| :
||
.
| || |
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Page 46
Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| |
.
|. . . | : .||||.:|
:
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | |
|
|
:: | .| . || |:
||
|.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP
|| ||.
|
:.|||| | .
.|
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
Somewhat
similar
(one dot)
Very
similar
(two dots)
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP
. |
|
| :
||
.
| || |
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Page 46
Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| |
.
|. . . | : .||||.:|
:
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | |
|
|
:: | .| . || |:
||
|.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP
|| ||.
|
:.|||| | .
.|
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP
. |
|
| :
||
.
| || |
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Internal
gap
Terminal
gap
Page 46
Multiple sequence alignment of ‘ortologues’
glyceraldehyde 3-phosphate dehydrogenases
fly
human
plant
bacterium
yeast
archaeon
GAKKVIISAP
GAKRVIISAP
GAKKVIISAP
GAKKVVMTGP
GAKKVVITAP
GADKVLISAP
SAD.APM..F
SAD.APM..F
SAD.APM..F
SKDNTPM..F
SS.TAPM..F
PKGDEPVKQL
VCGVNLDAYK
VMGVNHEKYD
VVGVNEHTYQ
VKGANFDKY.
VMGVNEEKYT
VYGVNHDEYD
PDMKVVSNAS
NSLKIISNAS
PNMDIVSNAS
AGQDIVSNAS
SDLKIVSNAS
GE.DVVSNAS
CTTNCLAPLA
CTTNCLAPLA
CTTNCLAPLA
CTTNCLAPLA
CTTNCLAPLA
CTTNSITPVA
fly
human
plant
bacterium
yeast
archaeon
KVINDNFEIV
KVIHDNFGIV
KVVHEEFGIL
KVINDNFGII
KVINDAFGIE
KVLDEEFGIN
EGLMTTVHAT
EGLMTTVHAI
EGLMTTVHAT
EGLMTTVHAT
EGLMTTVHSL
AGQLTTVHAY
TATQKTVDGP
TATQKTVDGP
TATQKTVDGP
TATQKTVDGP
TATQKTVDGP
TGSQNLMDGP
SGKLWRDGRG
SGKLWRDGRG
SMKDWRGGRG
SHKDWRGGRG
SHKDWRGGRT
NGKP.RRRRA
AAQNIIPAST
ALQNIIPAST
ASQNIIPSST
ASQNIIPSST
ASGNIIPSST
AAENIIPTST
fly
human
plant
bacterium
yeast
archaeon
GAAKAVGKVI
GAAKAVGKVI
GAAKAVGKVL
GAAKAVGKVL
GAAKAVGKVL
GAAQAATEVL
PALNGKLTGM
PELNGKLTGM
PELNGKLTGM
PELNGKLTGM
PELQGKLTGM
PELEGKLDGM
AFRVPTPNVS
AFRVPTANVS
AFRVPTSNVS
AFRVPTPNVS
AFRVPTVDVS
AIRVPVPNGS
VVDLTVRLGK
VVDLTCRLEK
VVDLTCRLEK
VVDLTVRLEK
VVDLTVKLNK
ITEFVVDLDD
GASYDEIKAK
PAKYDDIKKV
GASYEDVKAA
AATYEQIKAA
ETTYDEIKKV
DVTESDVNAA
Page 49
Multiple sequence alignment of
human lipocalin ‘paralogs’
~~~~~EIQDVSGTWYAMTVDREFPEMNLESVTPMTLTTL.GGNLEAKVTM
LSFTLEEEDITGTWYAMVVDKDFPEDRRRKVSPVKVTALGGGNLEATFTF
TKQDLELPKLAGTWHSMAMATNNISLMATLKAPLRVHITSEDNLEIVLHR
VQENFDVNKYLGRWYEIEKIPTTFENGRCIQANYSLMENGNQELRADGTV
VKENFDKARFSGTWYAMAKDPEGLFLQDNIVAEFSVDETGNWDVCADGTF
LQQNFQDNQFQGKWYVVGLAGNAI.LREDKDPQKMYATIDKSYNVTSVLF
VQPNFQQDKFLGRWFSAGLASNSSWLREKKAALSMCKSVDGGLNLTSTFL
VQENFNISRIYGKWYNLAIGSTCPWMDRMTVSTLVLGEGEAEISMTSTRW
PKANFDAQQFAGTWLLVAVGSACRFLQRAEATTLHVAPQGSTFRKLD...
lipocalin 1
odorant-binding protein 2a
progestagen-assoc. endo.
apolipoprotein D
retinol-binding protein
neutrophil gelatinase-ass.
prostaglandin D2 synthase
alpha-1-microglobulin
complement component 8
Page 49
Calculation of an alignment score
PAM matrices:
Point-accepted mutations
PAM matrices are based on global alignments
of closely related proteins.
The PAM1 is the matrix calculated from comparisons
of sequences with no more than 1% divergence.
Other PAM matrices are extrapolated from PAM1.
All the PAM data come from closely related proteins
(>85% amino acid identity)
Comparing two proteins with a PAM1 matrix
gives completely different results than PAM250!
Consider two distantly related proteins. A PAM40 matrix
is not forgiving of mismatches, and penalizes them
severely. Using this matrix you can find almost no match.
hsrbp, 136 CRLLNLDGTC
btlact,
3 CLLLALALTC
* ** * **
A PAM250 matrix is very tolerant of mismatches.
24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7%
hsrbp, 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV
btlact, 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN
*
**** *
* *
*
** *
hsrbp, 86 --CADMVGTFTDTEDPAKFKM
btlact, 80 GECAQKKIIAEKTKIPAVFKI
**
* ** **
Page 60
BLOSUM Matrices
BLOSUM matrices are based on local alignments.
BLOSUM stands for blocks substitution matrix.
BLOSUM62 is a matrix calculated from comparisons of
sequences with no less than 62% divergence.
Page 60
Rat versus
mouse RBP
Rat versus
bacterial
lipocalin
PAM matrices reflect different degrees of divergence
PAM250
Ancestral sequence
ACCCTAC
A
C
C
C --> G
T --> A
A --> C --> T
C
no change
single substitution
multiple substitutions
coincidental substitutions
parallel substitutions
convergent substitutions
back substitution
Sequence 1
A
C --> A
C --> A --> T
C --> A
T --> A
A --> T
C --> T --> C
Sequence 2
Li (1997) p.70
homologous
sequences
non-homologous
sequences
Sequences reported
as related
True positives
False positives
Sequences reported
as unrelated
False negatives
True negatives
Sensitivity:
ability to find
true positives
Specificity:
ability to minimize
false positives
Outline
-Why Do We Need Multiple Sequence Alignment ?
-The progressive Alignment Algorithm
-A possible Strategy…
-Potential Difficulties
Pre-requisite
-How Do Sequences Evolve?
-How can We COMPARE Sequences ?
-How can We ALIGN Sequences ?
What is A Multiple Sequence Alignment?
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
Structural Criteria:
Residues are arranged so that those playing a similar role end up in the
same column.
Evolution Criteria:
Residues are arranged so that those having the same ancestor end up in
the same column.
Phylogenic
Relation
Functional
Relation
How Can I Use A Multiple Sequence Alignment?
chite
wheat
trybr
unknown
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
unknown
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
Less Than 30 % id
BUT
Conserved where it MATTERS
Extrapolation Beyond The Twilight Zone
Homology?
SwissProt
Unkown Sequence
How Can I Use A Multiple Sequence Alignment?
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
Extrapolation
Prosite Patterns
P-K-R-[PA]-x(1)-[ST]…
How Can I Use A Multiple Sequence Alignment?
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
Extrapolation
Prosite Patterns
Profiles And HMMs
L?
K>R
A
F
D
E
F
G
H
Q
I
V
L
W
-More Sensitive
-More Specific
How Can I Use A Multiple Sequence Alignment?
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
Extrapolation
Motifs/Patterns
Profiles
Phylogeny
chite
wheat
trybr
mouse
-Evolution
-Paralogy/Orthology
How Can I Use A Multiple Sequence Alignment?
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
Extrapolation
Motifs/Patterns
Profiles
Phylogeny
Struc. Prediction
Column Constraint

Evolution Constraint

Structure Constraint
How Can I Use A Multiple Sequence Alignment?
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
Extrapolation
Motifs/Patterns
Profiles
Phylogeny
Struc. Prediction
PsiPred OR PhD For secondary
Structure Prediction:
75% Accurate.
Threading: is improving
but is not yet as good.
How Can I Use A Multiple Sequence Alignment?
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
Automatic Multiple
Sequence Alignment methods
are not always perfect…
You know better…
With your big BRAIN
Why Is It Difficult To Compute A multiple Sequence
Alignment?
A CROSSROAD PROBLEM
BIOLOGY:
What is A Good Alignment
chite
wheat
trybr
mouse
COMPUTATION
What is THE Good Alignment
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
The Biological Problem.
How to Evaluate an Alignment
-A nice set of Sequences
-Substitution Matrix (Blosum)
-Gap Penalties.
-An Evaluation Function
A
A
A
C
C
A
A
A
C
Sums of Pairs: Cost=6
C
Over-estimation of the Substitutions
Easy to compute
The COMPUTATIONAL Problem.
Producing the Alignment
-A nice set of Sequences
-Substitution Matrix (Blosum)
-Gap Penalties.
-An Evaluation Function
-An Alignment Algorithm
Will It Work
?
GLOBAL Alignment
HOW CAN I ALIGN MANY SEQUENCES
2 Globins =>1 Min
HOW CAN I ALIGN MANY SEQUENCES
3 Globins =>2 hours
HOW CAN I ALIGN MANY SEQUENCES
4 Globins => 10 days
HOW CAN I ALIGN MANY SEQUENCES
5 Globins => 3 years
HOW CAN I ALIGN MANY SEQUENCES
6 Globins =>300 years
HOW CAN I ALIGN MANY SEQUENCES
7 Globins =>30. 000 years
Solidified Fossil,
Old stuff
HOW CAN I ALIGN MANY SEQUENCES
8 Globins =>3 Million years
The Progressive
Multiple Alignment
Algorithm
(Clustal W)
Making An Alignment
Any Exact Method would be TOO SLOW
We will use a Heuristic Algorithm.
Progressive Alignment Algorithm is the most Popular
-ClustalW
-Greedy Heuristic (No Guarranty).
-Fast
Progressive Alignment
Feng and Dolittle, 1988
Clustering
Progressive Alignment
Dynamic Programming Using A Substitution Matrix
Progressive Alignment
-Depends on the CHOICE of the sequences.
-Depends on the ORDER of the sequences (Tree).
-Depends on the PARAMETERS:
•Substitution Matrix.
•Penalties (Gop, Gep).
•Sequence Weight.
•Tree making Algorithm.
Progressive Alignment
When Does It Work
Works Well When Phylogeny is Dense
No outlayer Sequence.
Image: River Crossing
Progressive Alignment
When Doesn’t It Work
CLUSTALW (Score=20, Gop=-1, Gep=0, M=1)
SeqA
SeqB
SeqC
SeqD
GARFIELD
GARFIELD
GARFIELD
--------
THE
THE
THE
THE
LAST
FAST
VERY
----
FA-T
CA-T
FAST
FA-T
CAT
--CAT
CAT
LAST
FAST
VERY
----
FA-T
---FAST
FA-T
CAT
CAT
CAT
CAT
CORRECT (Score=24)
SeqA
SeqB
SeqC
SeqD
GARFIELD
GARFIELD
GARFIELD
--------
THE
THE
THE
THE
GARFIELD THE LAST FAT CAT
GARFIELD THE LAST FAT CAT
GARFIELD THE FAST CAT ---
GARFIELD THE FAST CAT
GARFIELD
GARFIELD
GARFIELD
--------
THE
THE
THE
THE
LAST
FAST
VERY
----
FA-T
CA-T
FAST
FA-T
CAT
--CAT
CAT
GARFIELD THE VERY FAST CAT
GARFIELD THE VERY FAT CAT
-------- THE ---- FAT CAT
THE FAT CAT
Building the Right
Multiple Sequence
Alignment.
Recognizing The Right Sequences When you Meet
Them…
Gathering Sequences: BLAST
Common Mistake:
Sequences Too Closely Related
PRVA_MACFU
PRVA_HUMAN
PRVA_GERSP
PRVA_MOUSE
PRVA_RAT
PRVA_RABIT
SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE
SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE
SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEE
SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEE
SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE
AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE
:**::*.*******:***:* :****************..::******:***********
PRVA_MACFU
PRVA_HUMAN
PRVA_GERSP
PRVA_MOUSE
PRVA_RAT
PRVA_RABIT
DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES
DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES
DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES
DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES
DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES
EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES
:*** ******.******.**** *:************.:******:**
-IDENTICAL SEQUENCES BRING NO INFORMATION FOR THE MULTIPLE
SEQUENCE ALIGNMENT
-MULTIPLE SEQUENCE ALIGNMENTS THRIVE ON DIVERSITY…
Selecting Diverse Sequences (Opus II)
Respect Information!
PRVA_MACFU
PRVA_HUMAN
PRVA_GERSP
PRVA_MOUSE
PRVA_RAT
PRVA_RABIT
TPCC_MOUSE
------------------------------------------SMTDLLN----AEDIKKA
------------------------------------------SMTDLLN----AEDIKKA
------------------------------------------SMTDLLS----AEDIKKA
------------------------------------------SMTDVLS----AEDIKKA
------------------------------------------SMTDLLS----AEDIKKA
------------------------------------------AMTELLN----AEDIKKA
MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM
: :*.
.*::::
PRVA_MACFU
PRVA_HUMAN
PRVA_GERSP
PRVA_MOUSE
PRVA_RAT
PRVA_RABIT
TPCC_MOUSE
VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI
VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFI
IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFI
IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSI
IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSI
IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFI
IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM
This Alignment Is not Informative about the relation
Betwwen TPCC MOUSE and the rest of the sequences.
-A better Spread of the
Sequences is needed
Selecting Diverse Sequences (Opus II)
Selecting Diverse Sequences (Opus II)
PRVB_CYPCA
PRVB_BOACO
PRV1_SALSA
PRVB_LATCH
PRVB_RANES
PRVA_MACFU
PRVA_ESOLU
-AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIE
-AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIE
MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE
-AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIE
-SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIE
-SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIE
--AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE
:
*: .: . .* .:*. * **
*:
* :
* :* * **:**
PRVB_CYPCA
PRVB_BOACO
PRV1_SALSA
PRVB_LATCH
PRVB_RANES
PRVA_MACFU
PRVA_ESOLU
EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKAEDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG
VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQDEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKAQDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKAEDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES
EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA
:** .*:.*
.* *: ** :: .* **** **::** **
-A REASONABLE Model Now Exists.
-Going Further:Remote Homologues.
Aligning Remote Homologues
PRVA_MACFU
PRVA_ESOLU
PRVB_CYPCA
PRVB_BOACO
PRV1_SALSA
PRVB_LATCH
PRVB_RANES
TPCS_RABIT
TPCS_PIG
TPCC_MOUSE
------------------------------------------SMTDLLNA----EDIKKA
-------------------------------------------AKDLLKA----DDIKKA
------------------------------------------AFAGVLND----ADIAAA
------------------------------------------AFAGILSD----ADIAAG
-----------------------------------------MACAHLCKE----ADIKTA
------------------------------------------AVAKLLAA----ADVTAA
------------------------------------------SITDIVSE----KDIDAA
-TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI
-TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI
MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM
:
::
PRVA_MACFU
PRVA_ESOLU
PRVB_CYPCA
PRVB_BOACO
PRV1_SALSA
PRVB_LATCH
PRVB_RANES
TPCS_RABIT
TPCS_PIG
TPCC_MOUSE
VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI
LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFV
LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLF
LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF
LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF
LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELF
LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLF
IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI
IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI
IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM
:
. .: .. . *:
* :
* :* : .*:*: :** .
PRVA_MACFU
PRVA_ESOLU
PRVB_CYPCA
PRVB_BOACO
PRV1_SALSA
PRVB_LATCH
PRVB_RANES
TPCS_RABIT
TPCS_PIG
TPCC_MOUSE
LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEALQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKGLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ
FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ
LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE
::
.. :: :
:: .* :.** *. :** ::
Some
Guidelines
…
Do Not Use Two Many Sequences…
Reading Your Alignment
WHAT MAKES A GOOD ALIGNMENT…
-THE MORE DIVERGEANT THE SEQUENCES, THE BETTER
-THE FEWER INDELS, THE BETTER
-NICE UNGAPPED BLOCKS SEPARATED WITH INDELS
-DIFFERENT CLASSES OF RESIDUES WITHIN A BLOCK:
•Completely Conserved
•Conserved For Size and Hydropathy
•Conserved For Size or Hydropathy
-THE ULTIMATE EVALUATION IS A MATTER OF PERSONNAL JUDGEMENT
AND KNOWLEDGE.
Potential Difficulties
DO NOT OVERTUNE!!!
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
DO NOT PLAY WITH PARAMETERS IF YOU KNOW THE
ALIGNMENT YOU WANT: MAKE IT YOURSELF!
chite
wheat
trybr
mouse
---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. :*: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
TUNING or NOT TUNING!!!
-PARAMETERS TO TUNE USUALLY INCLUDE:
•GOP/ GEP
•MATRIX
•SENSITIVITY Vs SPEED
Substitution Matrices
(Etzold and al. 1993)
GOP
Gonnet
Blosum50
Pam250
61.7 %
59.7 %
59.2 %
GEP
-MOST METHODS ARE TUNED FOR WORKING WELL ON AVERAGE
-PARAMETERS BEHAVIOUR DO NOT NECESSARILY FOLLOW THE
THEORY (i.e. Substitution Matrices).
-A GOOD ALIGNMENT IS USUALLY ROBUST(i.e. Changes little).
-TUNE IF YOU WANT TO CONVINCE YOURSELF.
KEEP A BIOLOGICAL PERSPECTIVE
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
DIFFERENT PARAMETERS
chite
wheat
trybr
mouse
AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL-DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLS
-K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG
----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS
*
*** .:: ::... :
* . . .
: * . *: *
WRONG ALIGNMENT !!!
REPEATS
THERE IS A PROBLEM WHEN TWO SEQUENCES DO NOT CONTAIN THE
SAME NUMBER OF REPEATS
IT IS THEN BETTER TO MANUALLY EXTRACT THE REPEATS AND TO
ALIGN THEM. INDIVIDUAL REPEATS CAN BE RECOGNIZED USING DOTTER
Naming Your Sequences The Right Way
Choosing the right
Method
Simultaneous Alignments : MSA
1) Set Bounds on each pair of
sequences (Carillo and Lipman)
2) Compute the Maln within the
Hyperspace
-Few Small Closely Related
Sequence.
-Memory and CPU hungry
-Do Well When They Can Run.
Dialign II
1) Identify best chain of
segments on each pair of
sequence. Assign a Pvalue to
each Segment Pair.
2) Ré-évaluate each
segment pair according to
its consistency with the
others
3) Assemble the alignment
according to the segment
pairs.
Dialign II
-May Align Too Few Residues
-No Gap Penalty
-Does well with ESTs
Iterative Methods
7.16.1 Progressive
-HMMs, HMMER, SAM.
-Slow, Sometimes Inaccurate
-Good Profile Generators
Mixing Local and Global Alignments
Local Alignment
Global Alignment
Extension
Multiple Sequence Alignment
WhatBaliBase
Is BaliBase
Source: BaliBase, Thompson et al, NAR, 1999,
PROBLEM
Description
Even Phylogenic
Spread.
One Outlayer
Sequence
Two Distantly
related Groups
Long Internal
Indel
Long Terminal
Indel
WhichIs
Method
?
What
BaliBase
Source: BaliBase, Thompson et al, NAR, 1999,
PROBLEM
Strategy
ClustalW,
T-coffee,
MSA, DCA
T-Coffee
PrrP,
T-Coffee
Dialign
T-Coffee
Dialign
T-Coffee
Strategy
Methods /Situtations
1-Carillo and Lipman:
-MSA, DCA.
-Few Small Closely Related Sequence.
-Do Well When They Can Run.
2-Segment Based:
-DIALIGN, MACAW.
-May Align Too Few Residues
-Good For Long Indels
3-Iterative:
-HMMs, HMMER, SAM.
-Slow, Sometimes Inaccurate
-Good Profile Generators
4-Progressive:
-ClustalW, Pileup, Multalign…
-Fast and Sensitive
Conclusion
Multiple Alignment
-The BEST alignment Method:
Your Brain
The Right Data
-The Best Evaluation Procedure:
Experimental Data (SwissProt)
-Choosing The Sequences Well is Important
-Beware of repeated elements
Editing Multiple
Alignments



There are a variety of tools that can be
used to modify a multiple alignment.
These programs can be very useful in
formatting and annotating an alignment
for publication.
An editor can also be used to make
modifications by hand to improve
biologically significant regions in a
multiple alignment created by one of the
automated alignment programs.
BioEdit
Editors on the Web

Check out CINEMA (Colour
INteractive Editor for Multiple
Alignments)
 It
is an editor created completely in
JAVA (old browsers beware)
 It includes a fully functional version
of CLUSTAL, BLAST, and a
DotPlot module
http://www.biochem.ucl.ac.uk/bsm/dbbrowser/CINEMA2.1
Addresses
Some URLs

EMBL-EBI
http://www.ebi.ac.uk/clustalw/

BCM Search Launcher: Multiple
Alignment
http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html

Multiple Sequence Alignment for
Proteins (Wash. U. St. Louis)
http://www.ibc.wustl.edu/service/msa/
Related documents