Download the sequence alignment itself is a hypothesis about the homology of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene wikipedia , lookup

Non-coding DNA wikipedia , lookup

RNA-Seq wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Molecular ecology wikipedia , lookup

Community fingerprinting wikipedia , lookup

Genetic code wikipedia , lookup

Protein structure prediction wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Point mutation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Sequence Alignment
• Only things that are homologous
should be compared in a
phylogenetic analysis
• Homologous – sharing a common
ancestor
• This is true for morphological
characters and must also be true
for molecular characters or the
entire analysis is meaningless
• Two different types of homology –
– Paralogous sequences are
homologous due to duplication
– Orthologous sequences are
homologous due to speciation
• Paralogous comparisons can be
useful but in most cases we are
interested in orthologous
comparisons
Sequence Alignment
• Only things that are homologous should be compared in a
phylogenetic analysis
• It is relatively easy to determine orthology/paralogy in the cases of
genes/sequences
• We must determine homology for each and every nucleotide/amino
acid position within the sequences.
• This is accomplished via sequence alignment
• It is THE CRITICAL first step in most phylogenetic analyses
• Always remember that the sequence alignment itself is a
hypothesis about the homology of multiple positions in a set of
protein of nucleotide sequences
Sequence Alignment
• A multiple sequence alignment
aims to find homology among as
many residues in a group of
sequences as possible
• Most of the time, in order to align
the sequences gaps must be
introduced
•
Gaps represent indels –
insertion/deletion events - that have
presumably occurred since the
sequences diverged in evolutionary
history
• All of this works on the
assumption that they began as the
same sequence and diverged
over time due to mutations
(substitutions, insertions,
deletions).
Sequence Alignment
• The problem of repeats
• Repeated nucleotides, SSRs, make alignment difficult
Sequence Alignment
• Substitutions – point changes in sequences over time
• Sequence identity – the number of identical residues in an alignment
divided by the number of aligned positions
• Gaps are not counted so it can be a misleading number
• Example- an amino acid sequence alignment
Sequence Alignment
• Note the indels – They represent an assumption that there has been
an insertion and/or deletion in one or both sequences relative to
each other (we can rarely know which it is for sure)
• Note the blocks of identical residues – They likely represent
functionally important amino acids
– Functional importance can be structural or enzymatic or both
Sequence Alignment
• Amino acid alignments have an advantage over DNA/RNA
alignments
• Side chains of amino acids can be grouped according to chemical
properties (basic, acidic, polar, nonpolar, charged, uncharged,
hydrophobic, hydrophilic)
• Evolutionary theory suggests that similar substitions to similar amino
acids will be tolerated more readily than more drastic changes
Sequence Alignment
• We can take advantage of this
pattern to inform and aid the
process of aligning protein
sequences
• Dayhoff et al. (1978) developed a
matrix to inform alignments based
on assigning weights to various
substitutions
• Based on 1572 observed changes
in closely related protein sequences
• Higher weight – less likely change
PAM
Sequence Alignment
• Most modern analyses use
variations of a BLOSUM (BLOck
SUbstitution Matrix) matrix by Jorja
and Henikoff (1992)
• High number = likely substitution
• The idea is to find an alignment
with the highest score.
BLOSUM62
Sequence Alignment
• Gaps
• Gaps are introduced to help maximize an alignment score
• Gaps can easily be added willy-nilly by alignment programs
– Think about it – to obtain the highest score, just keep moving along the sequence
to which you are aligning until you find a matching base
• Gap penalties – subtractions from the alignment scores when a gap
is introduced
• GP = g + hl
• g = gap opening penalty, h = gap extension penalty, l = gap length
• No real biological justification for the formula
• In reality the origin of the gap must be taken into account but no
models exist to do this
• The best scoring alignment may not reflect biological reality
Sequence Alignment
• Gaps mean something – what that is is subject to debate
• Most software ignores gaps by default, others utilize them but with no
biological model to support their weight
• All of the previous information applies in some ways to DNA/RNA
sequence alignments
• Nucleic acids for secondary structures and may have blocks of
conserved sequence
• Some nucleotides are more likely to change to other nucleotides
Sequence Alignment
• Multiple alignment algorithms
• Dot-matrix sequence comparison
• A dot-plot is constructed
M N
N
L
N

M N A L S Q L N
N
M


Q



S


N
H


L


Q

A

S
H
S Q

L
N
L

A
M
A
MNALSQLN
NALMSQNH



Sequence Alignment
• Dot-matrix sequence comparison
• Gaps are indicated by deviations from a diagonal
M N A L S Q L N
N

A

L
Indicates that M
matches with a gap

M
N
NAL-SQLN
NALMSQ-N
Stage 2:


Q


– Align middle
– Use triangles
• To indicate gaps

S
Stage 1:

H
Indicates that L
matches with a gap
– Sort the ends out
MNAL-SQLN-NALMSQ-NH
Sequence Alignment
• Dot-matrix sequence comparison
• Same for nucleotide alignments
Sequence Alignment
• Dot-matrix sequence comparison
• Method is great for getting an
overall picture of the quality of the
alignment and for identifying
features of the sequences
• Detecting exons and similar genes
in divergent taxa
Sequence Alignment
• Dot-matrix sequence comparison
• Detecting repetitive sequences
• Self align using a dot matrix
Sequence Alignment
• Dynamic programming
• Keep in mind that until now, we’ve only been talking about TWO
sequences
• Dynamic programming can be used to find scores for all possible
pairs of aligned residues and all possible pairs of sequences
• A score for each pair (Dij) is calculated and all possible Dij’s are
summed to get a score.
• Sequence pairs can be weighted to give preference to more reliable
pairs
• Time and memory requirements grow exponentially with the number
of sequences
• Prohibitive for more than a few sequences
• Some problems with Dynamic programming can be overcome by
using short subsection alignments (instead of global alignments) via
DIALIGN (Morgenstern, 1999)
Sequence Alignment
• Progressive alignments
• Typically, we are trying to find the phylogeny given the sequences
• It would make it easier to align the sequences if we knew the
phylogeny
• Build a quick and dirty guide tree and use it as the basis for the
alignment
• Fast and reasonably reliable
• Align all possible pairs, generate genetic distances and build a guide
tree
• Build the multiple sequence alignment by following the branching
order of the tree from the most similar sequences to the least similar
Sequence Alignment
•
•
•
•
•
•
Progressive alignments
ClustalW and ClustalX
ClustalX is just ClustalW with a built-in GUI
Uses a progressive method
Downweights sequences according to guide tree relatedness
Can vary the weight matrix for protein sequences automatically
according to relatedness of the sequences
• Limitation - Final results are highly dependent on initial alignments
– Initial alignments are always incorporated into the final result - that is, once a
sequence has been aligned into the MSA, its alignment is not considered further.
This approximation improves efficiency at the cost of accuracy.
Sequence Alignment
• Progressive alignments
• T-Coffee
• Corrects an inherent problem of progressive alignments –
– Early alignment mistakes cannot be corrected later in the process
• Calculates pairwise alignments by combining the direct alignment of
the pair with indirect alignments that aligns each sequence of the pair
to a third sequence.
• Uses the output from other local alignment programs to finds multiple
regions of local alignment between two sequences.
• The resulting alignment and phylogenetic tree are used as a guide to
produce new and more accurate weighting factors.
• Slower but more accurate than Clustal
Sequence Alignment
• Iterative alignments
• Work similarly to progressive methods but repeatedly realign the
initial sequences as well as adding new sequences to the growing
MSA.
• Iterative methods can return to previously calculated pairwise
alignments or sub-MSAs incorporating subsets of the query
sequence as a means of optimizing a general objective function such
as finding a high-quality alignment score.
Sequence Alignment
• Iterative alignments
• The software package PRRN/PRRP uses a hill-climbing algorithm to
optimize its MSA alignment score and iteratively corrects both
alignment weights and locally divergent or "gappy" regions of the
growing MSA.
• PRRP performs best when refining an alignment previously
constructed by a faster method. The alignment of individual motifs is
then achieved with a matrix representation similar to a dot-matrix plot
in a pairwise alignment.
• MUSCLE (multiple sequence alignment by log-expectation) improves
on progressive methods with a more accurate distance measure to
assess the relatedness of two sequences.The distance measure is
updated between iteration stages.
Sequence Alignment
• Hidden Markov model alignments
• Use probabalistic models of substitution and indel occurrence.
• Do not always reach the same alignment during multiple runs