Download Slide 1

Chapter 5 Multiple Sequence Alignment •Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned •This alignment provides insights not possible in pairwise alignments, such as •Conserved sequence patterns •Conserved and functionally critical amino acid residues •Prerequisite for phylogenetic analyses •Prediction of protein secondary and tertiary structures •Design of degenerate PCR primers Scoring Function •The purpose of multiple alignment is to line up sequences in a way so that a maximum number of residues from each sequence are matched according to a scoring function •The scoring function is generally based on “sum of pairs” (SP) •The SP is the sum of all pairwise scores for all residues in the alignment C S T P A G N D E Q H R K M I L V F Y W C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2 S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3 T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3 P -3 -1 1 7 -1 -2 -2 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4 A 0 1 -1 -1 4 0 -2 -2 -1 -1 -2 -1 -1 -1 -1 -1 0 -2 -2 -3 G -3 0 1 -2 0 6 0 -1 -2 -2 -2 -2 -2 -3 -4 -4 -3 -3 -3 -2 N -3 1 0 -1 -1 -2 6 1 0 0 1 0 0 -2 -3 -3 -3 -3 -2 -4 D -3 0 1 -1 -2 -1 1 6 2 0 1 -2 -1 -3 -3 -4 -3 -3 -3 -4 E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -2 -3 -2 -3 Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2 H -3 -1 0 -2 -2 -2 -1 -1 0 0 8 0 -1 -2 -3 -3 -3 -1 2 -2 R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3 K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -2 -3 -2 -3 M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 1 0 -1 -1 I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 3 0 -1 -3 L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 1 0 -1 -2 V -1 -2 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 4 -1 -1 -3 Blosum62 substitution matrix F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1 Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2 W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11 Sequence 1: Sequence 2: Sequence 3: G:T = 1 T:S = 1 G:S = 0 Total:2 G K N T R N S H E K:R=2 R:H=0 K:H=-1 + 1 N:N=6 N:E=0 N:E=0 + 6 = 9 Thus 29 = 512 times more likely than by random chance Exhaustive Algorithms Brute Force Algorithm Similar to dynamic programming algorithms that searches for the best solution, examining every possible solution In pairwise alignment use a 2D matrix For N sequences, use an N-dimensional matrix Number of calculations increase exponentially (N×N×N×N×…) Generally only useful for <=10 short sequences Divide and Conquer Alignment (DCA) Identify regional similarities in multiple sequences Do a brute force alignment of the similar regions Join the independently aligned regions http://bibiserv.techfak.uni-bielefeld.de/dca/ Heuristic Algorithm Progressive Alignment Method •Pairwise alignment by Needleman-Wunsch of all pairs •Records similarity scores of aligned pairs •Scores entered into matrix •Guide tree constructed that reflects similarity between aligned pairs •Most closely related sequences re-aligned with Needleman-Wunsch •Different substitution matrices are selected depending on evolutionary distance between sequences to be aligned •Aligned pair converted to “consensus sequence” with fixed gaps •Consensus sequences treated as ordinary sequence for next step which is pairwise alignment with most related sequence in guide tree •Next “consensus sequence” is calculated and process repeated until all sequences are aligned •Most famous: clustalW (command line) clustalX (GUI) •http://www.ebi.ac.uk/Tools/clustalw2/index.html Download and install clustW from ftp://ftp.ebi.ac.uk/pub/software/clustalw2/2.0.9/ Spend a few minutes entering sequences and doing alignments •ClustalW uses gap penalties that is context sensitive: •Gaps count more close to runs of hydrophobic amino acids (more likely to be in internal conserved regions of a protein) compared to next to hydrophilic regions or G, likely to be on the outside in loops •Weighing scheme: closely related sequences are given a lower weighting score •The weighting score is dependent upon the branch length divided by the number of shared branches •This has the effect of minimizing a possible dominating effect of common sequences Drawbacks and Solutions •Based on global alignment – thus only sequences of similar length can be aligned •Long gaps required for alignment of dissimilar sequence length penalized •“Greedy” algorithm – once gaps are introduced, they stay in subsequence consensus sequences T-Coffee •Tree-based Consistency Objective Function for alignment Evaluation •http://www.ebi.ac.uk/Tools/t-coffee/ •http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi •Performs global alignment with clustal •Local pairwise alignment with Lalign •Global and ten best local alignments are pooled to form a library •All pairwise alignments are then aligned with a third possible sequence •Distance matrix calculated to build a guide tree •Guide tree used for final multiple alignment •Does not get” stuck” in sub-optimal initial alignments •Slower than clustal dbClustal •First performs BLASTP search for a query sequence •Aligned pairs are analyzed to obtain anchor points (local conserved regions) using a program called Ballast •Global alignment generated by Clustal, weighed to anchor points •Initial local alignment minimizes errors in divergent sequences •Multiple alignment subsequently evaluated by NorMD which removes poorly aligned sequences •http://bips.u-strasbg.fr/PipeAlign/jump_to.cgi?DbClustal+noid Partial Order Alignment (POA) •http://bioinformatics.ucla.edu/poa/ •Multiple alignments performed on more and more sequences from a list •Identical residues condensed to nodes •Each new sequence aligned with each sequence of the graph model •Eliminates the problem of error fixation •Faster and more accurate than clustal PRALINE •http://zeus.cs.vu.nl/programs/pralinewww/ •Builds profiles of sequences to be aligned •Profiles generated by PSI-BLAST •Because profiles contain information on close relatives, divergent sequences are more accurately aligned •Program can incorporate secondary protein structure •Very sophisticated but very slow Iterative Alignment PRRN •Find optimal solution by iteratively modifying sub-optimal solutions •http://prrn.ims.u-tokyo.ac.jp/ •Multiple alignment is performed on whole group of sequences •Sequences randomly distributed into two groups •Dynamic programming applied to consensus sequences derived from each group •The random split is repeated and another round of dynamic programming alignment performed •This is repeated until the alignment score no longer increases •A multiple alignment of the sequences are then again performed •Process repeated until multiple alignment score no longer improves Iterative Alignment DIALIGN2 •http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=dialign •Breaks all sequences down into segments, and performs alignment between segments •High-scoring segments are progressively assembled into larger and larger sequences •The score of an alignment is calculated from the block and not from individual residues •Sequence regions between block are left unaligned •Very suited to alignment of divergent sequences Practical Issues •DNA alignments are only based on 4 nucleotides, and are less reliable than protein sequence alignments •Alignments of DNA sequence does not consider functional issues, suchas gene boundaries •Insertion of gaps may “break” codons or cause frameshift that will not be tolerated in the protein, and is functional nonsense •Thus, always better toalign protein sequences •Possible to convert DNA to amino acid sequence, then align, and then decode back to DNA •RevTrans (http://www.cbs.dtu.dk/services/RevTrans/) •PROTA2DNA (missing link…) Editing and Format •Most alignment programs require final editing by a human to ensure that there are no problems in functionality •Finding badly aligned regions •Removing non-sensical gaps etc. •http://www.mbio.ncsu.edu/bioEdit/bioedit.html •Need to convert one sequence format to another: http://iubio.bio.indiana.edu/cgi-bin/readseq.cgi/

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide 1