Download Multiple Alignment

Document related concepts

Cre-Lox recombination wikipedia , lookup

Genetic code wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Protein structure prediction wikipedia , lookup

Molecular evolution wikipedia , lookup

Non-coding DNA wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Homology modeling wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
Multiple Sequence
Alignment
Terminology



Motif: the biological object one attempts
to model - a functional or structural
domain, active site, phosphorylation site
etc.
Pattern: a qualitative motif description
based on a regular expression-like syntax
Profile: a quantitative motif description assigns a degree of similarity to a
potential match
Global Alignment


Global algorithms are often not effective for
highly diverged sequences and do not reflect
the biological reality that two sequences may
only share limited regions of conserved
sequence.
Sometimes two sequences may be derived from
ancient recombination events where only a
single functional domain is shared.
What is Multiple Sequence
Alignment (MSA)?
• Multiple sequence alignment (MSA) can be seen as a
generalization of Pairwise Sequence Alignment instead of aligning two sequences, n sequences are
aligned simultaneously, where n is > 2
• Definition: A multiple sequence alignment is an
alignment of n > 2 sequences obtained by inserting
gaps (“-”) into sequences such that the resulting
sequences have all length L and can be arranged in a
matrix of N rows and L columns where each column
represents a homologous position (each column
corresponds to a specific residue in the 'prototypical'
protein)
Multiple Sequence Alignment



MSA applies both to nucleotide and amino
acid sequences
To construct a multiple alignment, one
may have to introduce gaps in sequences
at positions where there were no gaps in
the corresponding pairwise alignment.
This means that multiple alignments
typically contain more gaps than any given
pair of aligned sequences
How to optimize alignment
algorithms?

Use structural information:
 reading
frame
 protein structure


Sequence elements are not truly
independent but related by phylogenic
descent
Sequences often contain highly conserved
regions
Optimize alignment algorithms
Sequences often contain highly conserved regions
These regions can be used for an initial alignment
By analyzing a number of small, independent fragments,
the algorithmic complexity can be drastically reduced!
Pairwise Alignment

The alignment of two sequences
(DNA or protein) is a relatively
straightforward computational
problem.
The big-O notation
•One of the most important properties of an
algorithm is how its execution time increases as the
problem is made larger. By a larger problem, we mean
more sequences to align, or longer sequences to align.
•This is the so-called algorithmic (or computational)
complexity of the algorithm
•There is a notation to describe the algorithmic
complexity, called the big-O notation.
•If we have a problem size (number of input data
points) n, then an algorithm takes O(n) time if the
time increases linearly with n.
•If the algorithm needs time proportional to the
square of n, then it is O(n2)
The big-O notation


It is important to realize that an algorithm that
is quick on small problems may be totally useless
on large problems if it has a bad O() behavior.
As a rule of thumb one can use the following
characterizations, where n is the size of the
problem, and c is a constant:
O(c)
O(log n)
O(n)
O(n2)
O(n3)
O(cn)
utopian
excellent
very good
not so good
pretty bad
disaster
The big-O notation
• To compute a N-wise alignment, the
algorithmic complexity is something like
O(c2n), where c is a constant, and n is
the number of sequences.
• This is a big-O disaster!
The best solution is Dynamic
Programming.
Multiple Sequence Alignment




In pairwise alignments, you have a twodimensional matrix with the sequences
on each axis.
The number of operations required to locate
the best “path” through the matrix is
approximately proportional to the product of
the lengths of the two sequences
A possible general method would be to extend
the pairwise alignment method into a
simultaneous N-wise alignment, using a
complete dynamical-programming algorithm in
N dimensions.
Algorithmically, this is not difficult to do
Dynamic Programming


Dynamic Programming is a very general
programming technique.
It is applicable when a large search space can
be structured into a succession of stages, such
that:
 the
initial stage contains trivial solutions to subproblems
 each partial solution in a later stage can be
calculated by recurring a fixed number of partial
solutions in an earlier stage
 the
final stage contains the overall solution
Multiple Alignments


In theory, making an optimal alignment
between two sequences is computationally
straightforward (Smith-Waterman
algorithm), but aligning a large number of
sequences using the same method is
almost impossible.
The problem increases exponentially with
the number of sequences involved
(the product of the sequence lengths)
Optimal Alignment


For a given group of sequences, there is
no single "correct" alignment, only an
alignment that is "optimal" according to
some set of calculations.
Determining what alignment is best for a
given set of sequences is really up to the
judgement of the investigator.
Why we do multiple alignments?
 In
order to characterize protein families,
identify shared regions of homology in a multiple
sequence alignment; (this happens generally when
a sequence search revealed homologies to several
sequences).
 Determination of the consensus sequence of
several aligned sequences.
 Consensus sequences can help to develop a
sequence “finger print” which allows the
identification of members of distantly related
protein family (motifs)
 MSA can help us to reveal biological facts about
proteins, like analysis of the secondary/tertiary
structure)
Why we do multiple alignments?

Crucial for genome sequencing:





Random fragments of a large molecule are sequenced and
those that overlap are found by a multiple sequence alignment
program.
There should be one correct alignment that corresponds to
the genomic sequence rather than a range of possibilities
Sequence may be from one strand of DNA or the other, so
complements of each sequence must also be compared
Sequence fragments will usually overlap, but by an unknown
amount and in some cases, one sequence may be included within
another
All of the overlapping pairs of sequence fragments must be
assembled into large composite genome sequence
An example of Multiple Alignment
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG-ATLVCLISDFYPGA--VTVAWKADS-AALGCLVKDYFPEP--VTVSWNSG--VSLTCLVKGFYPSD--IAVEWWSNG--
Three Types of Algorithms

Progressive: ClustalW

Iterative: Muscle

Concistency Based: T-Coffee and
Probcons
Progressive Multiple Alignment


The most practical and widely used method in
multiple sequence alignment is the hierarchical
extensions of pairwise alignment methods.
The principal is that multiple alignments is
achieved by successive application of pairwise
methods.
Choosing sequences for alignment



The more sequences to align the better.
Don’t include similar (>80%) sequences.
Sub-groups should be pre-aligned separately,
and one member of each subgroup should be
included in the final multiple alignment.
Progressive Pairwise Methods


Most of the available multiple
alignment programs use some sort of
incremental or progressive method
that makes pairwise alignments, then
adds new sequences one at a time to
these aligned groups.
This is an approximate or heuristic
method!
Multiple Alignment Method



Compare all sequences pairwise.
Perform cluster analysis on the pairwise data to
generate a hierarchy for alignment. This may be in the
form of a binary tree or a simple ordering
Build the multiple alignment by first aligning the most
similar pair of sequences, then the next most similar
pair and so on. Once an alignment of two sequences has
been made, then this is fixed. Thus for a set of
sequences A, B, C, D having aligned A with C and B with
D the alignment of A, B, C, D is obtained by comparing
the alignments of A and C with that of B and D using
averaged scores at each aligned position.
Gap Penalties





In the MSA scoring scheme, a penalty is subtracted
for each gap introduced into an alignment because the
gap increases uncertainty into an alignment
The gap penalty is used to help decide whether or not
to accept a gap or insertion in an alignment
Biologically, it should in general be easier for a
sequence to accept a different residue in a position,
rather than having parts of the sequence chopped
away or inserted. Gaps/insertions should therefore
be more rare than point mutations (substitutions)
In general, the lower the gapping penalties, the more
gaps and more identities are detected but this should
be considered in relation to biological significance
Most MSA programs allow for an adjustment of gap
penalties
The PILEUP Algorithm




First, PILEUP calculates approximate pairwise
similarity scores between all sequences to be
aligned, and they are clustered into a
dendrogram (tree structure).
Then the most similar pairs of sequences are
aligned.
Averages (similar to consensus sequences) are
calculated for the aligned pairs.
New sequences and clusters of sequences are
added one by one, according to the branching
order in the dendrogram.
Choosing sequences for MSA



As far as possible, try to align sequences
of similar length.
Pileup can align sequences of up to 5000
residues, with 2000 gaps (total 7000
characters).
Pileup is a good program only for similar
(close) sequences.
PileUp considerations




PileUp does global multiple alignment, and
therefore is good for a group of similar
sequences.
PileUp will fail to find the best local region of
similarity (such as a shared motif) among
distant related sequences.
PileUp always aligns all of the sequences you
specified in the input file, even if they are not
related.
The alignment can be degraded if some of the
sequences are only distantly related.
PILEUP Considerations




Since the alignment is calculated on a
progressive basis, the order of the initial
sequences can affect the final alignment.
PILEUP parameters: 2 gap penalties (gap
insert and gap extend) and an amino acid
comparison matrix.
PILEUP will refuse to align sequences that
require too many gaps or mismatches.
PILEUP will take quite a while to align more
than about 10 sequences
CLUSTAL




CLUSTAL is a stand-alone (i.e. not integrated
into GCG) multiple alignment program that is
superior in some respects to PILEUP
Works by progressive alignment: it aligns a pair
of sequences then aligns the next one onto the
first pair
Most closely related sequences are aligned first,
and then additional sequences and groups of
sequences are added, guided by the initial
alignments
Uses alignment scores to produce a phylogenetic
tree
CLUSTAL




Aligns the sequences sequentially, guided
by the phylogenetic relationships
indicated by the tree
Gap penalties can be adjusted based on
specific amino acid residues, regions of
hydrophobicity, proximity to other gaps,
or secondary structure
Is available with a great web interface:
http://www.ebi.ac.uk/clustalw/
Also available in Biology Workbench
Multiple Alignment tools on the
Web



There are a variety of multiple alignment
tools available for free on the web.
CLUSTAL is available from a number of
sites (with a variety of restrictions)
Other algorithms are available too
Muscle Algorithm: Using The
Iteration
Consistency Based Algorithms:
T-Coffee

Gotoh (1990)


Martin Vingron (1991)




Concistency
Agglomerative Assembly
T-Coffee (2000, Notredame)



Dot Matrices Multiplications
Accurate but too stringeant
Dialign (1996, Morgenstern)


Iterative strategy using concistency
Concistency
Progressive algorithm
ProbCons (2004, Do)

T-Coffee with a Bayesian Treatment
T-Coffee and Consistency…
T-Coffee and Consistency…
T-Coffee and Consistency…
T-Coffee and Consistency…
T-Coffee and Consistency…
APPROXIMATE
FAST
ACCURATE
SLOW
Some URLs

EMBL-EBI
http://www.ebi.ac.uk/clustalw/

BCM Search Launcher: Multiple
Alignment
http://dot.imgen.bcm.tmc.edu:9331/multi-align/multialign.html

Multiple Sequence Alignment for
Proteins (Wash. U. St. Louis)
http://www.ibc.wustl.edu/service/msa/
Editing and displaying
alignments

Sequence editors are used for:
 manual
alignment/editing of sequences
 visualization of data
 data management
 import/export of data
 graphical enhancement of data for
presentations
Editing Multiple Alignments



There are a variety of tools that can be
used to modify a multiple alignment.
These programs can be very useful in
formatting and annotating an alignment
for publication.
An editor can also be used to make
modifications by hand to improve
biologically significant regions in a
multiple alignment created by one of the
automated alignment programs.
Displaying a multiple alignment
in GCG



There are several programs to display the
multiple alignment prettily.
The Pretty program prints sequences with
their columns aligned and can display a
consensus for the alignment, allowing you to
look at relationships among the sequences.
The PrettyBox program displays the
alignment graphically with the conserved
regions of the alignment as shaded boxes.
The output is in Postscript format.
Example of PrettyBox Output
GCG alignment editors



Alignments produced with PILEUP (or
CLUSTAL) can be adjusted with LINEUP.
Nicely shaded printouts can be produced
with PRETTYBOX
GCG's SeqLab X-Windows interface has a
superb multiple sequence editor - the
best editor of any kind.
Other editors


The MACAW and SeqVu program for
Macintosh and GeneDoc and DCSE for
PCs are free and provide excellent editor
functionality.
Many “comprehensive” molecular biology
programs include multiple alignment
functions:

MacVector, OMIGA, Vector NTI, and
GeneTool/PepTool all include a built-in version of
CLUSTAL
SeqVu
CINEMA

CINEMA (Colour INteractive Editor for
Multiple Alignments)
 It
is an editor created completely in
JAVA (old browsers beware)
 It includes a fully functional version of
CLUSTAL, BLAST, and a DotPlot module
http://www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/
Informative Colors









By default, the alignment is coloured crudely
according to residue type (proline and glycine
have special structural properties, particularly in
membrane proteins, so are grouped separately;
similarly for cysteine, which is often involved in
disulphide bond formation):
Polar positive H, K, R
Blue
Polar negative D, E
Red
Polar neutral S, T, N, Q
Green
Non-polar aliphatic A, V, L, I, M
White
Non-polar aromatic F, Y, W
Purple
P, G
Brown
C
Yellow
Special characters B, Z, X, Grey