Download Multiple Sequence Alignment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Promoter (genetics) wikipedia , lookup

Genetic code wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Molecular evolution wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Non-coding DNA wikipedia , lookup

Point mutation wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Homology modeling wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
Multiple Sequence
Alignment
Alignment can be easy or
difficult
GCGGCCCA
GCGGCCCA
GCGTTCCA
GCGTCCCA
GCGGCGCA
********
TTGACATG
TTGACATG
TTGACATG
TTGACATG
TTGACATC
********
TCAGGTAGTT
TCAGGTAGTT
TCAGCTGGTT
TCAGCTAGTT
TTAGCTAGTT
**********
GGTGG
GGTGG
GGTGG
GGTGG
GGTGA
*****
CCGGGG---A
CCGGTG--GT
-CTAGG---A
-CTAGGGAAC
-CTCTG---A
??????????
Difficult due
AACCG
AAGCC to insertions
ACGCG or deletions
ACGCG
(indels)
ACGCG
*****
Easy
Homology: Definition
• Homology: similarity that is the result of inheritance from a
common ancestor - identification and analysis of homologies is
central to phylogenetic systematics.
• An Alignment is an hypothesis of positional homology between
bases/Amino Acids.
Multiple Sequence AlignmentGoals
• To generate a concise, information-rich summary of
sequence data.
• Sometimes used to illustrate the dissimilarity
between a group of sequences.
• Alignments can be treated as models that can be
used to test hypotheses.
• Does this model of events accurately reflect known
biological evidence.
Alignment of 16S rRNA can be guided
by secondary structure
<---------------(--------------------HELIX 19---------------------)
<---------------(22222222-000000-111111-00000-111111-0000-22222222
Thermus ruber
UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGA
Th. thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGA
E.coli
UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGA
Ancyst.nidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGA
B.subtilis
UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGA
Chl.aurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGA
match
**
***
* ** ** *
**
Alignment of 16S rRNA sequences from different bacteria
Protein Alignment may be guided
by Tertiary Structure Interactions
Escherichia coli
DjlA protein
Homo sapiens
DjlA protein
Multiple Sequence AlignmentMethods
–3 main methods of
alignment:
• Manual
• Automatic
• Combined
Manual Alignment - reasons
• Might be carried out because:
– Alignment is easy.
– There is some extraneous information
(structural).
– Automated alignment methods have
encountered the local minimum problem.
– An automated alignment method can be
“improved”.
Dynamic programming
2 methods:
• Dynamic programming
– Consider 2 protein sequences of 100 amino acids in length.
– If it takes 1002 seconds to exhaustively align these sequences,
then it will take 1003 seconds to align 3 sequences, 1004 to align
4 sequences...etc.
– More time than the universe has existed to align 20 sequences
exhaustively.
• Progressive alignment
Progressive Alignment
• Devised by Feng and Doolittle in 1987.
• Essentially a heuristic method and as such
is not guaranteed to find the ‘optimal’
alignment.
• Requires n-1+n-2+n-3...n-n+1 pairwise
alignments as a starting point
• Most successful implementation is Clustal
(Des Higgins)
Overview of ClustalW Procedure
Hbb_Human
Hbb_Horse
Hba_Human
Hba_Horse
Myg_Whale
1
2
3
4
5
.17
.59
.59
.77
CLUSTAL W
.60
.59
.77
.13
.75
Hbb_Human
.75
-
1
3
Quick pairwise alignment:
calculate distance matrix
4
Hbb_Horse
Hba_Human
Neighbor-joining tree
(guide tree)
2
Hba_Horse
Myg_Whale
alpha-helices
1
2
3
4
5
PEEKSAVTALWGKVN--VDEVGG
GEEKAAVLALWDKVN--EEEVGG
PADKTNVKAAWGKVGAHAGEYGA
AADKTNVKAAWSKVGGHAGEYGA
EHEWQLVLHVWAKVEADVAGHGQ
1
2
3
4
Progressive alignment
following guide tree
ClustalW- Pairwise Alignments
• First perform all possible pairwise
alignments between each pair of
sequences. There are (n-1)+(n-2)...(nn+1) possibilities.
• Calculate the ‘distance’ between each pair
of sequences based on these isolated
pairwise alignments.
• Generate a distance matrix.
Path Graph for aligning two
sequences.
Possible alignment
Scoring Scheme:
•Match:
+1
•Mismatch: 0
•Indel:
-1
1
1
0
Score for this path= 2
1
0
-1
Alignment using this path
1
GATTCGAATTC
1
0
1
0
-1
Optimal Alignment 1
Alignment using
this path
1
1
GA-TTC
GAATTC
-1
1
1
Alignment score: 4
1
Optimal Alignment 2
Alignment using
this path
1
-1
G-ATTC
GAATTC
1
1
1
Alignment score: 4
1
ClustalW- Guide Tree
• Generate a Neighbor-Joining
‘guide tree’ from these pairwise
distances.
• This guide tree gives the order
in which the progressive
alignment will be carried out.
Neighbor joining method
•The neighbor joining method is a greedy heuristic which
joins at each step, the two closest sub-trees that are not
already joined.
•It is based on the minimum evolution principle.
•One of the important concepts in the NJ method is
neighbors, which are defined as two taxa that are
connected by a single node in an unrooted tree
Node 1
A
B
What is required for the Neighbour joining method?
Distance matrix
PAM
Spinach
Rice
Mosquito
Monkey
Human
Spinach
0.0
84.9
105.6
90.8
86.3
Distance Matrix
Rice
84.9
0.0
117.8
122.4
122.6
Mosquito
105.6
117.8
0.0
84.7
80.8
Monkey
90.8
122.4
84.7
0.0
3.3
Human
86.3
122.6
80.8
3.3
0.0
First Step
PAM distance 3.3 (Human - Monkey) is the minimum. So we'll
join Human and Monkey to MonHum and we'll calculate the new
distances.
Mon-Hum
Mosquito
Spinach
Rice Human
Monkey
Calculation of New Distances
After we have joined two species in a subtree we have to compute the
distances from every other node to the new subtree. We do this with a
simple average of distances:
Dist[Spinach, MonHum]
= (Dist[Spinach, Monkey] + Dist[Spinach, Human])/2
= (90.8 + 86.3)/2 = 88.55
Mon-Hum
Spinach
Human
Monkey
Next Cycle
PAM
Spinach
Rice
Mosquito
MonHum
Spinach
0.0
84.9
105.6
88.6
Rice
84.9
0.0
117.8
122.5
Mosquito
105.6
117.8
0.0
82.8
MonHum
88.6
122.5
82.8
0.0
Mos-(Mon-Hum)
Mon-Hum
Rice
Spinach
Mosquito
Human
Monkey
Penultimate Cycle
PAM
Spinach
Rice
MosMonHum
Spinach
0.0
84.9
97.1
Rice
84.9
0.0
120.2
MosMonHum
97.1
120.2
0.0
Mos-(Mon-Hum)
Spin-Rice
Rice
Spinach
Mon-Hum
Mosquito
Human
Monkey
Last Joining
PAM
Spinach
MosMonHum
SpinRice
0.0
108.7
MosMonHum
108.7
0.0
(Spin-Rice)-(Mos-(Mon-Hum))
Mos-(Mon-Hum)
Spin-Rice
Rice
Mon-Hum
Spinach
Mosquito
Human
Monkey
Unrooted Neighbor-Joining
Tree
Human
Spinach
Monkey
Rice
Mosquito
Multiple Alignment- First pair
• Align the two most closely-related
sequences first.
• This alignment is then ‘fixed’ and
will never change. If a gap is to be
introduced subsequently, then it will
be introduced in the same place in
both sequences, but their relative
alignment remains unchanged.
ClustalW- Decision time
• Next, consult the guide tree to see what alignment is
performed next.
– Align a third sequence to the first two
Or
– Align two entirely different sequences to each other.
Option 1
Option 2
ClustalW- Alternative 1
If the situation arises
where a third sequence is
aligned to the first two,
then when a gap has to be
introduced to improve the
alignment, each of these
two entities are treated as
two single sequences.
+
ClustalW- Alternative 2
• If, on the other hand,
two separate sequences
have to be aligned
together, then the first
pairwise alignment is
placed to one side and the
pairwise alignment of the
other two is carried out.
+
ClustalW- Progression
• The alignment is progressively
built up in this way, with each
step being treated as a pairwise
alignment, sometimes with each
member of a ‘pair’ having more
than one sequence.
ClustalW-Good points/Bad
points
• Advantages:
– Speed.
• Disadvantages:
– No objective function.
– No way of quantifying whether or not
the alignment is good
– No way of knowing if the alignment is
‘correct’.
ClustalW-Local Minimum
• Potential problems:
– Local minimum problem. If an
error is introduced early in
the alignment process, it is
impossible to correct this
later in the procedure.
– Arbitrary alignment.
Increasing the sophistiaction of
the alignment process.
• Should we treat all the sequences in the
same way? - even though some
sequences are closely-related and some
sequences are distant relatives.
• Should we treat all positions in the
sequences as though they were the
same? - even though they might have
different functions and different
locations in the 3-dimensional structure.
ClustalW- Caveats
• Sequence weighting
• Varying substitution matrices
• Residue-specific gap penalties and reduced
penalties in hydrophilic regions (external regions of
protein sequences), encourage gaps in loops rather
than in core regions.
• Positions in early alignments where gaps have been
opened receive locally reduced gap penalties to
encourage openings in subsequent alignments
Sequence weighting
• First we must be able to categorise sequences
according to whether they have close relatives or
if they are distantly-related to the other
sequences (calculated directly from the guide
tree).
• Weights are normalised, so that the largest
weight is 1.
• Closely-related sequences have a large amount of
the same information, so they are downweighted.
• These weights are multiplication factors.
ClustalW- User-supplied values
• Two penalties are set by the user
(there are default values, but you
should know that it is possible to
change these).
• GOP- Gap Opening Penalty is the cost
of opening a gap in an alignment.
• GEP- Gap Extension Penalty is the cost
of extending this gap.
ClustalW- Manipulation of
penalties
• Although GOP and GEP are set by the
user, the program attempts to manipulate
these according to the following criteria:
– Dependence on the weight matrix:
– Dependence on the similarity of the sequences:
– The percent identity of the sequences is used
as a scaling factor to increase the GOP for
closely-related sequences and decrease it for
more distantly-related sequences.
ClustalW
• Dependence on the length of the sequences:
– The program uses the formula
– GOP->(GOP+log(MIN(N,M))*(Average residue mismatch
score)*(percent identity scaling factor)
– The logarithm of the length of the shortest sequence is
used as a scaling factor to increase the GOP with
increasing length
• Dependence on the difference in lengths of the
two sequences:
• GEP-> GEP*(1.0+|log(N/M)|)
Position-Specific gap penalties
• Before any pair of (groups of) sequences are aligned, a
table of GOPs are generated for each position in the two
(sets of) sequences.
• The GOP is manipulated in a position-specific manner, so
that it can vary over the sequences.
• If there is a gap at a position, the GOP and GEP penalties
are lowered, the other rules do not apply.
• This makes gaps more likely at positions where gaps
already exist.
Discouraging too many gaps
• If there is no gap opened, then the GOP is increased if the
position is within 8 residues of an existing gap.
• This discourages gaps that are too close together.
• At any position within a run of hydrophilic residues, the GOP
is decreased.
• These runs usually indicate loop regions in protein structures.
• A run of 5 hydrophilic residues is considered to be a
hydrophilic stretch.
• The default hydrophilic residues are:
– D, E, G, K, N, Q, P, R, S
– But this can be changed by the user.
Divergent Sequences
• The most divergent sequences (most different, on
average from all of the other sequences) are usually the
most difficult to align.
• It is sometimes better to delay their aligment until later
(when the easier sequences have already been aligned).
• The user has the choice of setting a cutoff (default is
40% identity).
• This will delay the alignment until the others have been
aligned.
Advice on progressive alignment
• Progressive alignment is a mathematical process that is
completely independent of biological reality.
• Can be a very good estimate
• Can be an impossibly poor estimate.
• Requires user input and skill.
• Treat cautiously
• Can be improved by eye (usually)
• Often helps to have colour-coding.
• Depending on the use, the user should be able to make a
judgement on those regions that are reliable or not.
• For phylogeny reconstruction, only use those positions whose
hypothesis of positional homology is unimpeachable
Alignment of protein-coding
DNA sequences
• It is not very sensible to align the DNA
sequences of protein-coding genes.
ATGCTGTTAGGG
ATGCTCGTAGGG
ATGCT-GTTAGGG
ATGCTCGTA-GGG
The result might be highly-implausible and might not reflect
what is known about biological processes.
It is much more sensible to translate the sequences to their
corresponding amino acid sequences, align these protein
sequences and then put the gaps in the DNA sequences according
to where they are found in the amino acid alignment.
Manual Alignment- software
GDE- The Genetic Data Environment (UNIX)
CINEMA- Java applet available from:
– http://www.biochem.ucl.ac.uk
Seqapp/Seqpup- Mac/PC/UNIX available from:
– http://iubio.bio.indiana.edu
SeAl for Macintosh, available from:
– http://evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html
BioEdit for PC, available from:
– http://www.mbio.ncsu.edu/RNaseP/info/programs/BIOEDIT/bi
oedit.html