Download the sequence alignment itself is a hypothesis about the homology of

Sequence Alignment • Only things that are homologous should be compared in a phylogenetic analysis • Homologous – sharing a common ancestor • This is true for morphological characters and must also be true for molecular characters or the entire analysis is meaningless • Two different types of homology – – Paralogous sequences are homologous due to duplication – Orthologous sequences are homologous due to speciation • Paralogous comparisons can be useful but in most cases we are interested in orthologous comparisons Sequence Alignment • Only things that are homologous should be compared in a phylogenetic analysis • It is relatively easy to determine orthology/paralogy in the cases of genes/sequences • We must determine homology for each and every nucleotide/amino acid position within the sequences. • This is accomplished via sequence alignment • It is THE CRITICAL first step in most phylogenetic analyses • Always remember that the sequence alignment itself is a hypothesis about the homology of multiple positions in a set of protein of nucleotide sequences Sequence Alignment • A multiple sequence alignment aims to find homology among as many residues in a group of sequences as possible • Most of the time, in order to align the sequences gaps must be introduced • Gaps represent indels – insertion/deletion events - that have presumably occurred since the sequences diverged in evolutionary history • All of this works on the assumption that they began as the same sequence and diverged over time due to mutations (substitutions, insertions, deletions). Sequence Alignment • The problem of repeats • Repeated nucleotides, SSRs, make alignment difficult Sequence Alignment • Substitutions – point changes in sequences over time • Sequence identity – the number of identical residues in an alignment divided by the number of aligned positions • Gaps are not counted so it can be a misleading number • Example- an amino acid sequence alignment Sequence Alignment • Note the indels – They represent an assumption that there has been an insertion and/or deletion in one or both sequences relative to each other (we can rarely know which it is for sure) • Note the blocks of identical residues – They likely represent functionally important amino acids – Functional importance can be structural or enzymatic or both Sequence Alignment • Amino acid alignments have an advantage over DNA/RNA alignments • Side chains of amino acids can be grouped according to chemical properties (basic, acidic, polar, nonpolar, charged, uncharged, hydrophobic, hydrophilic) • Evolutionary theory suggests that similar substitions to similar amino acids will be tolerated more readily than more drastic changes Sequence Alignment • We can take advantage of this pattern to inform and aid the process of aligning protein sequences • Dayhoff et al. (1978) developed a matrix to inform alignments based on assigning weights to various substitutions • Based on 1572 observed changes in closely related protein sequences • Higher weight – less likely change PAM Sequence Alignment • Most modern analyses use variations of a BLOSUM (BLOck SUbstitution Matrix) matrix by Jorja and Henikoff (1992) • High number = likely substitution • The idea is to find an alignment with the highest score. BLOSUM62 Sequence Alignment • Gaps • Gaps are introduced to help maximize an alignment score • Gaps can easily be added willy-nilly by alignment programs – Think about it – to obtain the highest score, just keep moving along the sequence to which you are aligning until you find a matching base • Gap penalties – subtractions from the alignment scores when a gap is introduced • GP = g + hl • g = gap opening penalty, h = gap extension penalty, l = gap length • No real biological justification for the formula • In reality the origin of the gap must be taken into account but no models exist to do this • The best scoring alignment may not reflect biological reality Sequence Alignment • Gaps mean something – what that is is subject to debate • Most software ignores gaps by default, others utilize them but with no biological model to support their weight • All of the previous information applies in some ways to DNA/RNA sequence alignments • Nucleic acids for secondary structures and may have blocks of conserved sequence • Some nucleotides are more likely to change to other nucleotides Sequence Alignment • Multiple alignment algorithms • Dot-matrix sequence comparison • A dot-plot is constructed M N N L N  M N A L S Q L N N M   Q    S   N H   L   Q  A  S H S Q  L N L  A M A MNALSQLN NALMSQNH    Sequence Alignment • Dot-matrix sequence comparison • Gaps are indicated by deviations from a diagonal M N A L S Q L N N  A  L Indicates that M matches with a gap  M N NAL-SQLN NALMSQ-N Stage 2:   Q   – Align middle – Use triangles • To indicate gaps  S Stage 1:  H Indicates that L matches with a gap – Sort the ends out MNAL-SQLN-NALMSQ-NH Sequence Alignment • Dot-matrix sequence comparison • Same for nucleotide alignments Sequence Alignment • Dot-matrix sequence comparison • Method is great for getting an overall picture of the quality of the alignment and for identifying features of the sequences • Detecting exons and similar genes in divergent taxa Sequence Alignment • Dot-matrix sequence comparison • Detecting repetitive sequences • Self align using a dot matrix Sequence Alignment • Dynamic programming • Keep in mind that until now, we’ve only been talking about TWO sequences • Dynamic programming can be used to find scores for all possible pairs of aligned residues and all possible pairs of sequences • A score for each pair (Dij) is calculated and all possible Dij’s are summed to get a score. • Sequence pairs can be weighted to give preference to more reliable pairs • Time and memory requirements grow exponentially with the number of sequences • Prohibitive for more than a few sequences • Some problems with Dynamic programming can be overcome by using short subsection alignments (instead of global alignments) via DIALIGN (Morgenstern, 1999) Sequence Alignment • Progressive alignments • Typically, we are trying to find the phylogeny given the sequences • It would make it easier to align the sequences if we knew the phylogeny • Build a quick and dirty guide tree and use it as the basis for the alignment • Fast and reasonably reliable • Align all possible pairs, generate genetic distances and build a guide tree • Build the multiple sequence alignment by following the branching order of the tree from the most similar sequences to the least similar Sequence Alignment • • • • • • Progressive alignments ClustalW and ClustalX ClustalX is just ClustalW with a built-in GUI Uses a progressive method Downweights sequences according to guide tree relatedness Can vary the weight matrix for protein sequences automatically according to relatedness of the sequences • Limitation - Final results are highly dependent on initial alignments – Initial alignments are always incorporated into the final result - that is, once a sequence has been aligned into the MSA, its alignment is not considered further. This approximation improves efficiency at the cost of accuracy. Sequence Alignment • Progressive alignments • T-Coffee • Corrects an inherent problem of progressive alignments – – Early alignment mistakes cannot be corrected later in the process • Calculates pairwise alignments by combining the direct alignment of the pair with indirect alignments that aligns each sequence of the pair to a third sequence. • Uses the output from other local alignment programs to finds multiple regions of local alignment between two sequences. • The resulting alignment and phylogenetic tree are used as a guide to produce new and more accurate weighting factors. • Slower but more accurate than Clustal Sequence Alignment • Iterative alignments • Work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA. • Iterative methods can return to previously calculated pairwise alignments or sub-MSAs incorporating subsets of the query sequence as a means of optimizing a general objective function such as finding a high-quality alignment score. Sequence Alignment • Iterative alignments • The software package PRRN/PRRP uses a hill-climbing algorithm to optimize its MSA alignment score and iteratively corrects both alignment weights and locally divergent or "gappy" regions of the growing MSA. • PRRP performs best when refining an alignment previously constructed by a faster method. The alignment of individual motifs is then achieved with a matrix representation similar to a dot-matrix plot in a pairwise alignment. • MUSCLE (multiple sequence alignment by log-expectation) improves on progressive methods with a more accurate distance measure to assess the relatedness of two sequences.The distance measure is updated between iteration stages. Sequence Alignment • Hidden Markov model alignments • Use probabalistic models of substitution and indel occurrence. • Do not always reach the same alignment during multiple runs

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download the sequence alignment itself is a hypothesis about the homology of