Download F1: Multiple alignment and its meaning

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Endogenous retrovirus wikipedia , lookup

Interactome wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

SR protein wikipedia , lookup

Metalloprotein wikipedia , lookup

Magnesium transporter wikipedia , lookup

Protein wikipedia , lookup

Western blot wikipedia , lookup

Biosynthesis wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Biochemistry wikipedia , lookup

Genetic code wikipedia , lookup

Point mutation wikipedia , lookup

Proteolysis wikipedia , lookup

Protein structure prediction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
There is alignment, and there
is multiple alignment…
Multiple alignment
- what it means
(F1)

a Sequence:


an Alignment (or Pair-Wise Alignment):


1 string of nucleotide or amino acid characters
2 sequences side by side in a “good” way
a Multiple (Sequence) Alignment:

3 or more
sequences aligned
Why multiple alignment?

Multiple alignments often say more


Of course! More data tends to say more than less data
Example:


Consider 3 amino acids matched in a pair-wise alignment
– Suppose 1st matches, 2nd & 3rd do not match
– What can we now understand?
But multiple alignment reveals that




2nd varies greatly across homologs
1st is conserved across homologs
3rd is conserved across only some homologs
What can we now understand?
Inference from multiple
alignments

See Fig. 1 p. 84, Westhead et al.

What is conserved across all homologs?

What might explain such conservation?
Inference from multiple
alignments II

See Westhead Fig. 1 p. 84 again

It is a serine protease –



what is a protease? Why is it useful?
What is conserved across all homologs?
What might explain such conservation?


Preservation of _______
Preservation of _______
Inference from multiple
alignments II

See Westhead Fig. 1 p. 84 again

It is a serine protease –



what is a protease? Why is it useful?
What is conserved across all homologs?
What might explain such conservation?


Preservation of protein function
Preservation of protein structure
Inference from multiple
alignments II




Preservation of protein function
Preservation of protein structure
Active sites (i.e. binding sites) are hard
to mutate successfully (why?)
Structure-defining amino acids are hard
to change successfully (why?)
How to do multiple alignment?

Answer 1: use existing software


“Clustal” is one
Answer 2: understand an algorithm

Clustal uses “progressive alignment”
• What is the “technician” answer?
• What is the “scientist” answer?
• Which is “better”?
Progressive alignment


Start with a set of sequences to align
Repeat

Find the most closely related pair of sequences


Align them as well as possible




Use a fast, approximate method like FASTA
Use a slow, high-quality method like Smith-Waterman
Add to (or start) a phylogenetic tree
Delete the pair from the set, add their merge
Until done
Progressive alignment
(single linkage variation)


Start with a set of sequences to align
Repeat

Find the most closely related pair of sequences



Align them as well as possible




Use a fast, approximate method like FASTA
How do we define “most closely related”?
Use a slow, high-quality method like Smith-Waterman
Add to (or start) a phylogenetic tree
Delete the pair from the set, add their merge
Until done
Progressive alignment
(you can design your own variation)


Start with a set of sequences to align
Repeat
Find the most closely related pair of sequences
(How to define “most closely related”?)



Align them as well as possible




Use a fast, approximate method like FASTA
Use a slow, high-quality method like Smith-Waterman
Add to (or start) a phylogenetic tree
Delete the pair from the set, add their merge
Until done
Progressive alignment II

Let’s do an example


Note the phylogenetic tree that results!


Name a few organisms and we’ll assume results
of applying the component algorithms as needed
Cool
What could go wrong, algorithmically?
MSA and databases

Obtaining multiple sequence alignments
is computationally expensive


What if there are a dozen proteins to align?
So, we’d like to store results in a DB


(F2)
…so we don’t have to reinvent the wheel
every time
Consensus sequences is one way
Consensus Sequences

Given a multiple alignment


That is, a set of aligned strings (Fig. 1 p. 84,
Westhead et al.)
Store a summarizing “consensus sequence”

Use ‘x’ for places which lack consensus



Consensus: e.g. 30% or more of the sequences agree
What is the consensus sequence for Fig. 1?
Do consensus sequences lose information?
Consensus Sequences (cont.)

Given a multiple alignment


Store a single summarizing “sequence”



That is, a set of aligned strings (fig 1 p. 84)
Use ‘x’ for places which lack consensus
What is the consensus sequence for fig. 1?
Consensus sequences throw away a lot of
info, so a better solution is needed
PROSITE



A database of multiple alignments
– See the consensus textbook - wikipedia
Alignments are described more flexibly
than consensus sequences
Examples
(from p. 88, Westhead et al.)





[LIVM]-[ST]-A-[STAG]-H-C
…[GSTAPIMVQH]-x(2)-G-[DE]…
N-{P}-[ST]-{P}
Why the “{P}”s?
Is any information lost?
PROSITE


Alignments are described more flexibly than
consensus sequences
Examples
(from p. 88, Westhead et al.)





[LIVM]-[ST]-A-[STAG]-H-C
…[GSTAPIMVQH]-x(2)-G-[DE]…
N-{P}-[ST]-{P}
Why the “{P}”s?
A limitation (info lost):

no proportions associated with the
variations
Consensus sequence notations

Consider the examples (from p. 88, Westhead et al.)




[LIVM]-[ST]-A-[STAG]-H-C
…[GSTAPIMVQH]-x(2)-G-[DE]…
N-{P}-[ST]-{P}
Write down a hypothetical PROSITE
sequence and let’s all decode it…
PRINTS, BLOCKS


These are also multiple alignment
databases
Consider a family of related proteins





Some regions are likely highly conserved
PRINTS calls these motifs
BLOCKS calls these, uh, blocks
No gaps allowed! (Prosite permits x(2,4))
A set of motifs for a family is a
fingerprint

So gaps come into play to give fingerprints
PRINTS, BLOCKS


Consider a family of related proteins
A set of highly conserved regions (motifs)
for a family is a fingerprint


So gaps come into play to give fingerprints
Why not allow gaps in a motif?
Families of protein domains

Even PRINTS, BLOCKS, & PROSITE…


(F3)
…have too little information about families
Proteins tend to be built of domains

A domain is a chunk or “module” that is in
many different proteins

The fact that proteins share a domain makes them
related:

They are related by virtue of sharing domain x
Protein domain families

Many proteins share e.g. the PH domain


This domain’s sequence details vary
…but are lumped into the PH domain family


PH – Pleckstrin Homology
A subsequence from a protein matches a
given domain better or worse

If it matches well enough the subsequence is
in that domain family



(see e.g. fig 1 p. 93, Westhead et al.)
What is the score of one of the sequences? A random sequence?
How could we make a cladogram from the figure?
Reference: The Amino
Acid Abbreviations












Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
A
R
N
D
C
Q
E
G
H
I
L
K
Alanine

Arginine

Asparagine

Aspartic acid (Aspartate) 

Cysteine

Glutamine
Glutamic acid (Glutamate) 

Glycine

Histidine
Isoleucine

Leucine

Lysine

Met
M
Phe
F
Pro
P
Ser
S
Thr
T
Trp
W
Tyr
Y
Val
V
Asx
B
Asparagine
Glx
Z
acid
Xaa
X
TERM
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
Aspartic acid or
Glutamine or Glutamic
Any amino acid
termination codon
Let’s Review CATH and SCOP

…since we had rushed it
(lecture11notes.pdf)