Download PowerPoint

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bottromycin wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Protein adsorption wikipedia , lookup

Protein wikipedia , lookup

Non-coding DNA wikipedia , lookup

Expanded genetic code wikipedia , lookup

Biochemistry wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Molecular evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Genetic code wikipedia , lookup

Point mutation wikipedia , lookup

Protein structure prediction wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Homology modeling wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
Introduction to bioinformatics
lecture 8
Deriving amino acid exchange matrices (II)
and Multiple sequence alignment (I)
Summary Dayhoff’s PAM-matrices

Derived from global alignments of closely related sequences.

Matrices for greater evolutionary distances are extrapolated
from those for lesser ones.

The number with the matrix (PAM40, PAM100) refers to the
evolutionary distance; greater numbers are greater distances.

Several later groups have attempted to extend Dayhoff's
methodology or re-apply her analysis using later databases
with more examples.

Extensions of Dayhoff’s methodology:
> Jones, Thornton and coworkers used the same methodology as
Dayhoff but with modern databases (CABIOS 8:275).
> Gonnett and coworkers (Science 256:1443) used a slightly different
(but theoretically equivalent) methodology.
> Henikoff & Henikoff (Proteins 17:49) compared these two newer
versions of the PAM matrices with Dayhoff's originals.
The BLOSUM matrices
(BLOcks SUbstitution Matrix)

The BLOSUM series of matrices were created by Steve
Henikoff and colleagues (PNAS 89:10915).

Derived from local, un-gapped alignments of distantly
related sequences.

All matrices are directly calculated; no extrapolations
are used.

Again: the observed frequency of each pair is compared
to the expected frequency (which is essentially the
product of the frequencies of each residue in the
dataset).
Then: Log-odds matrix.
The Blocks Database

The Blocks Database contains multiple alignments of
conserved regions in protein families.

Blocks are multiply aligned un-gapped segments corresponding
to the most highly conserved regions of proteins.

The blocks for the BLOCKS database are made automatically
by looking for the most highly conserved regions in groups of
proteins represented in the PROSITE database. These blocks
are then calibrated against the SWISS-PROT database to
obtain a measure of the random distribution of matches. It is
these calibrated blocks that make up the BLOCKS database.

The database can be searched by e-mail and World Wide Web
(WWW) servers (http://blocks.fhcrc.org/help) to classify protein
and nucleotide sequences.
The Blocks Database
Gapless
alignment
blocks
The BLOSUM series

BLOSUM30, 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80,
85, 90.

The number after the matrix (BLOSUM62) refers to the
minimum percent identity of the blocks (in the BLOCKS
database) used to construct the matrix
(all blocks have >=62% sequence identity);

No extrapolations are made in going to higher
evolutionary distances

High number - closely related sequences
Low number - distant sequences

BLOSUM62 is the most popular: best for general
alignment.
The log-odds matrix for BLOSUM62
PAM versus BLOSUM
 Based on an explicit
evolutionary model
 Based on empirical
frequencies
 Derived from small,
closely related proteins
with ~15% divergence
 Uses much larger, more
diverse set of protein
sequences (30-90% ID)
 Higher PAM numbers to
detect more remote
sequence similarities
 Lower BLOSUM numbers
to detect more remote
sequence similarities
 Errors in PAM 1 are
scaled 250X in PAM 250
 Errors in BLOSUM arise
from errors in alignment
Comparing exchange matrices

To compare amino acid exchange matrices, the
"Entropy" value can be used. This is a relative entropy
value (H) which describes the amount of information
available per aligned residue pair.
Specialized matrices
 Claverie (J.Mol.Biol 234:1140) developed a set of
substitution matrices designed explicitly for finding
possible frameshifts in protein sequences.
These matrices are designed solely for use in protein-protein
comparisons; they should not be used with programs which
blindly translate DNA (e.g. BLASTX, TBLASTN).
Specialized matrices

Rather than starting from alignments generated by
sequence comparison, Rissler et al (1988) and later
Overington et al (1992) only considered proteins for
which an experimentally determined three dimensional
structure was available.

They then aligned similar proteins on the basis of their
structure rather than sequence and used the resulting
sequence alignments as their database from which to
gather substitution statistics. In principle, the Rissler or
Overington matrices should give more reliable results
than either PAM or BLOSUM. However, the
comparatively small number of available protein
structures (particularly in the Rissler et al study)
limited the reliability of their statistics.

Overington et al (1992) developed further matrices
that consider the local environment of the amino acids.
A note on reliability

All these matrices are designed using standard
evolutionary models.

It is important to understand that evolution is not the
same for all proteins, not even for the same regions of
proteins.

No single matrix performs best on all sequences. Some
are better for sequences with few gaps, and others are
better for sequences with fewer identical amino acids.

Therefore, when aligning sequences, applying a general
model to all cases is not ideal. Rather, re-adjustment
can be used to make the general model better fit the
given data.
Pair-wise alignment quality
versus sequence identity
(Vogt et al., JMB 249, 816-831,1995)
Summary







If ORF exists, then align at protein level.
Amino acid substitution matrices reflect the log-odds ratio
between the evolutionary and random model and can
therefore
help in determining homology via the alignment score.
The evolutionary and random models depend on the
generalized data used to derive them. This not an ideal
solution.
Apart from the PAM and BLOSUM series, a great
number of further matrices have been developed.
Matrices have been made based on DNA, protein
structure, information content, etc.
For local alignment, BLOSUM62 is often superior; for
distant (global) alignments, BLOSUM50, GONNET, or
(still) PAM250 work well.
Remember that gap penalties are always a problem;
unlike the matrices themselves, there is no formal way
to calculate their values -- you can follow
recommended settings, but these are based on trial
and error and not on a formal framework.
Biological definitions for
related sequences
 Homologues are similar sequences in two different
organisms that have been derived from a common ancestor
sequence. Homologues can be described as either
orthologues or paralogues.
 Orthologues are similar sequences in two different
organisms that have arisen due to a speciation event.
Orthologs typically retain identical or similar functionality
throughout evolution.
 Paralogues are similar sequences within a single organism
that have arisen due to a gene duplication event.
 Xenologues are similar sequences that do not share the
same evolutionary origin, but rather have arisen out of
horizontal transfer events through symbiosis, viruses, etc.
So this means …
Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html
Multiple sequence alignment
 Sequences can be conserved across species and perform
similar or identical functions.
> hold information about which regions have high mutation
rates over evolutionary time and which are evolutionarily
conserved;
> identification of regions or domains that are critical to
functionality.
 Sequences can be mutated or rearranged to perform an
altered function.
> which changes in the sequences have caused a change in
the functionality.
Multiple sequence alignment: the idea is to take three or more
sequences and align them so that the greatest number of similar
characters are aligned in the same column of the alignment.
What to ask yourself
 How do we get a multiple alignment?
(three or more sequences)
 What is our aim?
– Do we go for max accuracy, least
computational time or the best compromise?
 What do we want to achieve each time
Sequence-sequence alignment
sequence
sequence
Multiple alignment methods
 Multi-dimensional dynamic programming
> extension of pairwise sequence alignment.
 Progressive alignment
> incorporates phylogenetic information to guide the
alignment process
 Iterative alignment
> correct for problems with progressive alignment by
repeatedly realigning subgroups of sequence
Simultaneous multiple alignment
Multi-dimensional dynamic programming
The combinatorial explosion
 2 sequences of length n
 n2 comparisons
 Comparison number increases exponentially
 i.e. nN where n is the length of the sequences, and N is the
number of sequences
 Impractical for even a small number of short sequences
Multi-dimensional dynamic
programming (Murata et al., 1985)
Sequence 2
Sequence 1
The MSA approach
 MSA (Lipman et al., 1989, PNAS 86, 4412)
 MSA restricts the amount of memory by computing bounds that
approximate the centre of a multi-dimensional hypercube.










Calculate all pair-wise alignment scores.
Use the scores to to predict a tree.
Calculate pair weights based on the tree (lower bound).
Produce a heuristic alignment based on the tree.
Calculate the maximum weight for each sequence pair (upper
bound).
Determine the spatial positions
that must be calculated to obtain
the optimal alignment.
Perform the optimal alignment.
Report the weight found compared
to the maximum weight previously
found (measure of divergence).
Extremely slow and memory intensive.
Max 8-9 sequences of ~250 residues.
The DCA approach
 DCA (Stoye et al., 1997, Appl. Math. Lett. 10(2), 67-73)
 Each sequence is cut in two behind
a suitable cut position somewhere
close to its midpoint.
 This way, the problem of aligning
one family of (long) sequences is
divided into the two problems of
aligning two families of (shorter)
sequences.
 This procedure is re-iterated until
the sequences are sufficiently short.
 Optimal alignment by MSA.
 Finally, the resulting short
alignments are concatenated.
So in effect …
Sequence 2
Sequence 1
Multiple alignment methods
 Multi-dimensional dynamic programming
> extension of pairwise sequence alignment.
 Progressive alignment
> incorporates phylogenetic information to guide the
alignment process
 Iterative alignment
> correct for problems with progressive alignment by
repeatedly realigning subgroups of sequence
The progressive alignment method
 Underlying idea: usually we are interested in aligning
families of sequences that are evolutionary related.
 Principle: construct an approximate phylogenetic tree
for the sequences to be aligned and than to build up the
alignment by progressively adding sequences in the
order specified by the tree.
 But before going into details, some notices of multiple
alignment profiles …
How to represent a block of sequences?
 Historically: consensus sequence – single
sequence that best represents the amino acids
observed at each alignment position.
 Modern methods: Alignment profile –
representation that retains the information
about frequencies of amino acids observed at
each alignment position.
Multiple alignment profiles
(Gribskov et al. 1987)
 Gribskov created a probe: group of typical sequences of
functionally related proteins that have been aligned by
similarity in sequence or three-dimensional structure (in
his case: globins & immunoglobulins).
 Then he constructed a profile, which consists of a
sequence position-specific scoring matrix M(p,a)
composed of 21 columns and N rows (N = length of
probe).
 The first 20 columns of each row specify the score for
finding, at that position in the target, each of the 20
amino acid residues. An additional column contains a
penalty for insertions or deletions at that position (gapopening and gap-extension).
Multiple alignment profiles
Core region
Gapped region
Core region
i
A
C
D



W
Y
-
fA..
fC..
fD..



fW..
fY..
Gapo, gapx
fA..
fC..
fD..



fW..
fY..
Gapo, gapx
fA..
fC..
fD..



fW..
fY..
Gapo, gapx
Position dependent gap penalties
Profile building
 Example: each aa is represented as a frequency penalties as weights.
i
A
C
D



W
Y
0.5
0
0



0
0.5
0.3
0.1
0



0.3
0.3
0
0.5
0.2



0.1
0.2
Gap
penalties 1.0
0.5
1.0
Position dependent gap penalties
Profile-sequence alignment
sequence
ACD……VWY
Sequence to profile alignment
A
A
V
V
L
0.4 A
0.2 L
0.4 V
Score of amino acid L in sequence that is aligned against
this profile position:
Score = 0.4 * s(L, A) + 0.2 * s(L, L) + 0.4 * s(L, V)
Profile-profile alignment
profile
A
C
D
.
.
Y
profile
ACD……VWY
Profile to profile alignment
0.4 A
0.75 G
0.2 L
0.25 S
0.4 V
Match score of these two alignment columns using the a.a frequencies at the
corresponding profile positions:
Score = 0.4*0.75*s(A,G) + 0.2*0.75*s(L,G) + 0.4*0.75*s(V,G) +
+ 0.4*0.25*s(A,S) + 0.2*0.25*s(L,S) + 0.4*0.25*s(V,S)
s(x,y) is value in amino acid exchange matrix (e.g. PAM250, Blosum62) for
amino acid pair (x,y)
So, for scoring profiles …
 Think of sequence-sequence alignment.
 Same principles but more information for each position.
Reminder:
 The sequence pair alignment score S comes from the
sum of the positional scores M(aai,aaj) (i.e. the
substitution matrix values at each alignment position
minus penalties if applicable)
 Profile alignment scores are exactly the same, but the
positional scores are more complex