* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download F1: Multiple alignment and its meaning
Endogenous retrovirus wikipedia , lookup
Interactome wikipedia , lookup
Multilocus sequence typing wikipedia , lookup
Metalloprotein wikipedia , lookup
Magnesium transporter wikipedia , lookup
Western blot wikipedia , lookup
Biosynthesis wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Biochemistry wikipedia , lookup
Genetic code wikipedia , lookup
Point mutation wikipedia , lookup
Proteolysis wikipedia , lookup
Protein structure prediction wikipedia , lookup
There is alignment, and there
is multiple alignment…
Multiple alignment
- what it means
(F1)
a Sequence:
an Alignment (or Pair-Wise Alignment):
1 string of nucleotide or amino acid characters
2 sequences side by side in a “good” way
a Multiple (Sequence) Alignment:
3 or more
sequences aligned
Why multiple alignment?
Multiple alignments often say more
Of course! More data tends to say more than less data
Example:
Consider 3 amino acids matched in a pair-wise alignment
– Suppose 1st matches, 2nd & 3rd do not match
– What can we now understand?
But multiple alignment reveals that
2nd varies greatly across homologs
1st is conserved across homologs
3rd is conserved across only some homologs
What can we now understand?
Inference from multiple
alignments
See Fig. 1 p. 84, Westhead et al.
What is conserved across all homologs?
What might explain such conservation?
Inference from multiple
alignments II
See Westhead Fig. 1 p. 84 again
It is a serine protease –
what is a protease? Why is it useful?
What is conserved across all homologs?
What might explain such conservation?
Preservation of _______
Preservation of _______
Inference from multiple
alignments II
See Westhead Fig. 1 p. 84 again
It is a serine protease –
what is a protease? Why is it useful?
What is conserved across all homologs?
What might explain such conservation?
Preservation of protein function
Preservation of protein structure
Inference from multiple
alignments II
Preservation of protein function
Preservation of protein structure
Active sites (i.e. binding sites) are hard
to mutate successfully (why?)
Structure-defining amino acids are hard
to change successfully (why?)
How to do multiple alignment?
Answer 1: use existing software
“Clustal” is one
Answer 2: understand an algorithm
Clustal uses “progressive alignment”
• What is the “technician” answer?
• What is the “scientist” answer?
• Which is “better”?
Progressive alignment
Start with a set of sequences to align
Repeat
Find the most closely related pair of sequences
Align them as well as possible
Use a fast, approximate method like FASTA
Use a slow, high-quality method like Smith-Waterman
Add to (or start) a phylogenetic tree
Delete the pair from the set, add their merge
Until done
Progressive alignment
(single linkage variation)
Start with a set of sequences to align
Repeat
Find the most closely related pair of sequences
Align them as well as possible
Use a fast, approximate method like FASTA
How do we define “most closely related”?
Use a slow, high-quality method like Smith-Waterman
Add to (or start) a phylogenetic tree
Delete the pair from the set, add their merge
Until done
Progressive alignment
(you can design your own variation)
Start with a set of sequences to align
Repeat
Find the most closely related pair of sequences
(How to define “most closely related”?)
Align them as well as possible
Use a fast, approximate method like FASTA
Use a slow, high-quality method like Smith-Waterman
Add to (or start) a phylogenetic tree
Delete the pair from the set, add their merge
Until done
Progressive alignment II
Let’s do an example
Note the phylogenetic tree that results!
Name a few organisms and we’ll assume results
of applying the component algorithms as needed
Cool
What could go wrong, algorithmically?
MSA and databases
Obtaining multiple sequence alignments
is computationally expensive
What if there are a dozen proteins to align?
So, we’d like to store results in a DB
(F2)
…so we don’t have to reinvent the wheel
every time
Consensus sequences is one way
Consensus Sequences
Given a multiple alignment
That is, a set of aligned strings (Fig. 1 p. 84,
Westhead et al.)
Store a summarizing “consensus sequence”
Use ‘x’ for places which lack consensus
Consensus: e.g. 30% or more of the sequences agree
What is the consensus sequence for Fig. 1?
Do consensus sequences lose information?
Consensus Sequences (cont.)
Given a multiple alignment
Store a single summarizing “sequence”
That is, a set of aligned strings (fig 1 p. 84)
Use ‘x’ for places which lack consensus
What is the consensus sequence for fig. 1?
Consensus sequences throw away a lot of
info, so a better solution is needed
PROSITE
A database of multiple alignments
– See the consensus textbook - wikipedia
Alignments are described more flexibly
than consensus sequences
Examples
(from p. 88, Westhead et al.)
[LIVM]-[ST]-A-[STAG]-H-C
…[GSTAPIMVQH]-x(2)-G-[DE]…
N-{P}-[ST]-{P}
Why the “{P}”s?
Is any information lost?
PROSITE
Alignments are described more flexibly than
consensus sequences
Examples
(from p. 88, Westhead et al.)
[LIVM]-[ST]-A-[STAG]-H-C
…[GSTAPIMVQH]-x(2)-G-[DE]…
N-{P}-[ST]-{P}
Why the “{P}”s?
A limitation (info lost):
no proportions associated with the
variations
Consensus sequence notations
Consider the examples (from p. 88, Westhead et al.)
[LIVM]-[ST]-A-[STAG]-H-C
…[GSTAPIMVQH]-x(2)-G-[DE]…
N-{P}-[ST]-{P}
Write down a hypothetical PROSITE
sequence and let’s all decode it…
PRINTS, BLOCKS
These are also multiple alignment
databases
Consider a family of related proteins
Some regions are likely highly conserved
PRINTS calls these motifs
BLOCKS calls these, uh, blocks
No gaps allowed! (Prosite permits x(2,4))
A set of motifs for a family is a
fingerprint
So gaps come into play to give fingerprints
PRINTS, BLOCKS
Consider a family of related proteins
A set of highly conserved regions (motifs)
for a family is a fingerprint
So gaps come into play to give fingerprints
Why not allow gaps in a motif?
Families of protein domains
Even PRINTS, BLOCKS, & PROSITE…
(F3)
…have too little information about families
Proteins tend to be built of domains
A domain is a chunk or “module” that is in
many different proteins
The fact that proteins share a domain makes them
related:
They are related by virtue of sharing domain x
Protein domain families
Many proteins share e.g. the PH domain
This domain’s sequence details vary
…but are lumped into the PH domain family
PH – Pleckstrin Homology
A subsequence from a protein matches a
given domain better or worse
If it matches well enough the subsequence is
in that domain family
(see e.g. fig 1 p. 93, Westhead et al.)
What is the score of one of the sequences? A random sequence?
How could we make a cladogram from the figure?
Reference: The Amino
Acid Abbreviations
Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
A
R
N
D
C
Q
E
G
H
I
L
K
Alanine
Arginine
Asparagine
Aspartic acid (Aspartate)
Cysteine
Glutamine
Glutamic acid (Glutamate)
Glycine
Histidine
Isoleucine
Leucine
Lysine
Met
M
Phe
F
Pro
P
Ser
S
Thr
T
Trp
W
Tyr
Y
Val
V
Asx
B
Asparagine
Glx
Z
acid
Xaa
X
TERM
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
Aspartic acid or
Glutamine or Glutamic
Any amino acid
termination codon
Let’s Review CATH and SCOP
…since we had rushed it
(lecture11notes.pdf)