* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download the sequence alignment itself is a hypothesis about the homology of
Non-coding DNA wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Molecular ecology wikipedia , lookup
Community fingerprinting wikipedia , lookup
Genetic code wikipedia , lookup
Protein structure prediction wikipedia , lookup
Multilocus sequence typing wikipedia , lookup
Point mutation wikipedia , lookup
Sequence Alignment • Only things that are homologous should be compared in a phylogenetic analysis • Homologous – sharing a common ancestor • This is true for morphological characters and must also be true for molecular characters or the entire analysis is meaningless • Two different types of homology – – Paralogous sequences are homologous due to duplication – Orthologous sequences are homologous due to speciation • Paralogous comparisons can be useful but in most cases we are interested in orthologous comparisons Sequence Alignment • Only things that are homologous should be compared in a phylogenetic analysis • It is relatively easy to determine orthology/paralogy in the cases of genes/sequences • We must determine homology for each and every nucleotide/amino acid position within the sequences. • This is accomplished via sequence alignment • It is THE CRITICAL first step in most phylogenetic analyses • Always remember that the sequence alignment itself is a hypothesis about the homology of multiple positions in a set of protein of nucleotide sequences Sequence Alignment • A multiple sequence alignment aims to find homology among as many residues in a group of sequences as possible • Most of the time, in order to align the sequences gaps must be introduced • Gaps represent indels – insertion/deletion events - that have presumably occurred since the sequences diverged in evolutionary history • All of this works on the assumption that they began as the same sequence and diverged over time due to mutations (substitutions, insertions, deletions). Sequence Alignment • The problem of repeats • Repeated nucleotides, SSRs, make alignment difficult Sequence Alignment • Substitutions – point changes in sequences over time • Sequence identity – the number of identical residues in an alignment divided by the number of aligned positions • Gaps are not counted so it can be a misleading number • Example- an amino acid sequence alignment Sequence Alignment • Note the indels – They represent an assumption that there has been an insertion and/or deletion in one or both sequences relative to each other (we can rarely know which it is for sure) • Note the blocks of identical residues – They likely represent functionally important amino acids – Functional importance can be structural or enzymatic or both Sequence Alignment • Amino acid alignments have an advantage over DNA/RNA alignments • Side chains of amino acids can be grouped according to chemical properties (basic, acidic, polar, nonpolar, charged, uncharged, hydrophobic, hydrophilic) • Evolutionary theory suggests that similar substitions to similar amino acids will be tolerated more readily than more drastic changes Sequence Alignment • We can take advantage of this pattern to inform and aid the process of aligning protein sequences • Dayhoff et al. (1978) developed a matrix to inform alignments based on assigning weights to various substitutions • Based on 1572 observed changes in closely related protein sequences • Higher weight – less likely change PAM Sequence Alignment • Most modern analyses use variations of a BLOSUM (BLOck SUbstitution Matrix) matrix by Jorja and Henikoff (1992) • High number = likely substitution • The idea is to find an alignment with the highest score. BLOSUM62 Sequence Alignment • Gaps • Gaps are introduced to help maximize an alignment score • Gaps can easily be added willy-nilly by alignment programs – Think about it – to obtain the highest score, just keep moving along the sequence to which you are aligning until you find a matching base • Gap penalties – subtractions from the alignment scores when a gap is introduced • GP = g + hl • g = gap opening penalty, h = gap extension penalty, l = gap length • No real biological justification for the formula • In reality the origin of the gap must be taken into account but no models exist to do this • The best scoring alignment may not reflect biological reality Sequence Alignment • Gaps mean something – what that is is subject to debate • Most software ignores gaps by default, others utilize them but with no biological model to support their weight • All of the previous information applies in some ways to DNA/RNA sequence alignments • Nucleic acids for secondary structures and may have blocks of conserved sequence • Some nucleotides are more likely to change to other nucleotides Sequence Alignment • Multiple alignment algorithms • Dot-matrix sequence comparison • A dot-plot is constructed M N N L N M N A L S Q L N N M Q S N H L Q A S H S Q L N L A M A MNALSQLN NALMSQNH Sequence Alignment • Dot-matrix sequence comparison • Gaps are indicated by deviations from a diagonal M N A L S Q L N N A L Indicates that M matches with a gap M N NAL-SQLN NALMSQ-N Stage 2: Q – Align middle – Use triangles • To indicate gaps S Stage 1: H Indicates that L matches with a gap – Sort the ends out MNAL-SQLN-NALMSQ-NH Sequence Alignment • Dot-matrix sequence comparison • Same for nucleotide alignments Sequence Alignment • Dot-matrix sequence comparison • Method is great for getting an overall picture of the quality of the alignment and for identifying features of the sequences • Detecting exons and similar genes in divergent taxa Sequence Alignment • Dot-matrix sequence comparison • Detecting repetitive sequences • Self align using a dot matrix Sequence Alignment • Dynamic programming • Keep in mind that until now, we’ve only been talking about TWO sequences • Dynamic programming can be used to find scores for all possible pairs of aligned residues and all possible pairs of sequences • A score for each pair (Dij) is calculated and all possible Dij’s are summed to get a score. • Sequence pairs can be weighted to give preference to more reliable pairs • Time and memory requirements grow exponentially with the number of sequences • Prohibitive for more than a few sequences • Some problems with Dynamic programming can be overcome by using short subsection alignments (instead of global alignments) via DIALIGN (Morgenstern, 1999) Sequence Alignment • Progressive alignments • Typically, we are trying to find the phylogeny given the sequences • It would make it easier to align the sequences if we knew the phylogeny • Build a quick and dirty guide tree and use it as the basis for the alignment • Fast and reasonably reliable • Align all possible pairs, generate genetic distances and build a guide tree • Build the multiple sequence alignment by following the branching order of the tree from the most similar sequences to the least similar Sequence Alignment • • • • • • Progressive alignments ClustalW and ClustalX ClustalX is just ClustalW with a built-in GUI Uses a progressive method Downweights sequences according to guide tree relatedness Can vary the weight matrix for protein sequences automatically according to relatedness of the sequences • Limitation - Final results are highly dependent on initial alignments – Initial alignments are always incorporated into the final result - that is, once a sequence has been aligned into the MSA, its alignment is not considered further. This approximation improves efficiency at the cost of accuracy. Sequence Alignment • Progressive alignments • T-Coffee • Corrects an inherent problem of progressive alignments – – Early alignment mistakes cannot be corrected later in the process • Calculates pairwise alignments by combining the direct alignment of the pair with indirect alignments that aligns each sequence of the pair to a third sequence. • Uses the output from other local alignment programs to finds multiple regions of local alignment between two sequences. • The resulting alignment and phylogenetic tree are used as a guide to produce new and more accurate weighting factors. • Slower but more accurate than Clustal Sequence Alignment • Iterative alignments • Work similarly to progressive methods but repeatedly realign the initial sequences as well as adding new sequences to the growing MSA. • Iterative methods can return to previously calculated pairwise alignments or sub-MSAs incorporating subsets of the query sequence as a means of optimizing a general objective function such as finding a high-quality alignment score. Sequence Alignment • Iterative alignments • The software package PRRN/PRRP uses a hill-climbing algorithm to optimize its MSA alignment score and iteratively corrects both alignment weights and locally divergent or "gappy" regions of the growing MSA. • PRRP performs best when refining an alignment previously constructed by a faster method. The alignment of individual motifs is then achieved with a matrix representation similar to a dot-matrix plot in a pairwise alignment. • MUSCLE (multiple sequence alignment by log-expectation) improves on progressive methods with a more accurate distance measure to assess the relatedness of two sequences.The distance measure is updated between iteration stages. Sequence Alignment • Hidden Markov model alignments • Use probabalistic models of substitution and indel occurrence. • Do not always reach the same alignment during multiple runs