Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sequence alignment I. Sándor Pongor With slides adapted from David Judge, Jack Leunissen and Christoph Sensen September 27, 2016 Last lectures Representations (unstructured, structured, mixed). Core operations: Comparison gives 1) Proximity measures (similarities, distances) 2) Motifs (from pairwise and multiple alignment of sequences). Main distance and similarity measures Aggregation of numbers, vectors, sequences (distance matrices, trees, heatmaps, multiple alignments) Projections onto sequences 1D plots This lecture (Part I-2) Edit distance (refresh) Substitution matrices (PAM, BLOSUM, how to build your own The two basic sequence alignment algorithms Algorithm types according to how we do it 1: exhaustive and heuristic, global and local. Algorithm types according to what we compare 2: two sequences, seq. vs dbase, seq vs. genome, many seqs vs. genome, etc Applications The tree of bioinformatics: core, branches leaves Bioinfo algorithms Core data, core principles New branches and leaves every year… Application example Communicaton bacteria in The input of a sequence alignment algorithm 1) Two sequences 2) A scoring scheme (a score formula AND a scoring (substitution) matrix)* *This is applicable also to the comparison of 3D structures of macromolecules, or any other type of linear object descriptions (namely: macromolecules have a linear backbone) The steps of sequence alignment 1) Find all possible alignments (Mappings) between two sequences and find the best one according to some “quick” score. * 2) Calculate a final quantitative score* * for the best alignment (Matching) * This sketch symbolizes local alignment to indicate that there are many possible mappings. The same is true for global alignments as well. * * Usually an approximate edit distance with a scoring matrix, that we discussed last time The results of sequence alignment 1) A score (similarity score or distance) 2) A motif (common subsequence, consensus description…) The human mind describes similarity also in terms of patterns and scores. But the patterns are stored in the human memory ina smart way… Motif: AGACXTGA.CTGA Sequence similarity score Range of alignment or High Scoring Pair (HSP) The score S is a sum of costs assigned to identities and mismatches, minus a penalty for gaps. Costs are stored in the substitution matrix. Gap usually a sum of gap opening and gap-extension costs. 2017.05.06.. 2017.05.06.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 9 Alignment score Score = end Similarity _weights Penalties start end (Gap) penalty = G ap in it G ap len g th g ap start Gap penalty functions Linear Affine Penalty has a gap-opening and a separate length component Probabilistic Penalty rises monotonous with length of gap Penalties may depend upon the neighboring residues Other functions Penalty first rises fast, but levels off at greater length values No dramatic differences. Affine gaps are widely used. A simple example (alignment without gaps): For a match/mismatch we look up the value in the substitution matrix. The matrix is a lookup table… 2017.05.06.. 2017.05.06.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 12 Substitution matrices in details The susbstitution matrix (also called scoring matrix) contains costs for amino acid identities and substitutions in an alignment. For amino acids, it is a 20x20 symmetrical matrix that can be constructed from pairwise alignments of related sequences “Related” means either a) evolutionary relatedness described by an “approved” evolutionary tree (Dayhoff’s PAM matrices) b) any sequence similarity as described in the PROSITE database (Hennikoff’s BLOSUM matrices) Groups of related sequences can be organized into a multiple alignment for calculation of the matrix elements. 2017.05.06.. 2017.05.06.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 13 Substitution matrices (cost matrices) Calculation of scoring matrices from multiple alignments. ASDESKLVV | ATDDATLSI | | ASDSERITV f(S/T)=3 f(S)=5, f(T)=3 Matrix elements are calculated from the observed and expected frequencies (using a “log odds” principle). E.g. for S/T (indicated by red): f (S / T ) M ( S / T ) log f ( S ) f (T ) S/T denotes that S is aligned with T or T with S. The values are calculated from many multiple alignments (not just one).The log odds values in the matrix are then normalized to a given range depending on the application. (e.g. -5 to +15, for historical reasons. The range does not matter 14 much) The problem of making a substitution matrix Problem: To make a matrix you need a multiple alignment, but to make a multiple alignment you need a matrix. The first generation solution: Make multiple alignments by hand, using known proteins. Very tedious this gives the so-called PAM matrix. The second generation solution is to make multiple alignments with a program using the PAM matrix, and then extract a large statistics from conserved regions this is A Münchausen the so-called BLOSUM matrix All entries 104 problem 2017.05.06.. 2017.05.06.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 15 Pam_1 = 1% of amino acids mutate Pam_30 = (Pam_1)30 (matrix multiplication) PAM 250 small (the higher the numbers the higher the divergence found) polar Note: chemically similar amino acids are near each other … basic large aromatic 2017.05.06.. 2017.05.06.. TÁMOP – 4.1.2-08/2/A/KMR-2009-0006 16 Scoring Matrices used today BLOSUM Matrices (most often used) Developed by Henikoff & Henikoff (1992) BLOcks SUbstitution Matrix Derived from the BLOCKS database PAM Matrices Developed by Schwarz and Dayhoff (1978) Point Accepted Mutation Derived from manual alignments of closely related proteins PAM versus BLOSUM First useful scoring matrix for protein Assumed a Markov Model of evolution (I.e. all sites equally mutable and independent) Derived from small, closely related proteins with ~15% divergence Much later entry to matrix “sweepstakes” No evolutionary model is assumed Built from sequence blocks taken from PROSITE (functionally similer segments of proteins) Uses much larger, more diverse set of protein sequences (30% - 90% ID) PAM versus BLOSUM Higher PAM numbers to detect more remote sequence similarities Lower PAM numbers to detect high similarities 1 PAM ~ 1 million years of divergence Errors in PAM 1 are scaled 250X in PAM 250 Lower BLOSUM numbers to detect more remote sequence similarities Higher BLOSUM numbers to detect high similarities Sensitive to structural and functional substitution Errors in BLOSUM arise from errors in alignment PAM Matrices PAM 40 - prepared by multiplying PAM 1 by itself for a total of 40 times best for short alignments with high similarity PAM 120 - prepared by multiplying PAM 1 by itself for a total of 120 times best for general alignment PAM 250 - prepared by multiplying PAM 1 by itself for a total of 250 times best for detecting distant sequence similarity BLOSUM Matrices BLOSUM 90 - prepared from BLOCKS sequences with >90% sequence ID best for short alignments with high similarity BLOSUM 62 - prepared from BLOCKS sequences with >62% sequence ID best for general alignment (default) BLOSUM 30 - prepared from BLOCKS sequences with >30% sequence ID best for detecting weak local alignments Scores V V BLOSUM62 +4 PAM30 +7 Slide by David Landsman, NCBI D S – C Y E T L C F +2 +1 -12 +9 +3 +2 0 -10 +10 +2 7 11 Nucleic acid matrices A C G T 10 0 0 0 0 10 0 0 0 0 10 0 0 0 0 10 A C G T Needleman-Wunsch * A C G T 10 -9 -9 -9 A -9 10 -9 -9 C -9 -9 10 -9 G -9 -9 -9 10 T Smith-Waterman * 1) The magnitude of the elements are relative, can be scaled. 2) Other heuristic matrices can be easily constructed. Identity matirx: diagonal =1, rest=0. Ore, one can penalize certain associations assigning a large negative value to them, etc. *These are the names of two classical algorithms, to be discussed in the next section. Dot plots Visual comparison of sequences A method of visualizing matching positions in biological sequences This presentation was created by David and Paul Judge ACCTGCCCTGTCCAGCTTACATGCATGCTTATAGGGGCATTTTACAT Graphical comparison of sequences using “Dotplots”. Basic Principles. 1) Write sequences of length n on two diagonal axes. This defines an n x n matrix, called the dot matrix or alignment matrix. 2) Immagine that we put now a red dot to those positions where the nucleotides x(i) and y(i) are identical. 3) If the two sequences are identical, the diagonal will be red. x(i) = y(i) all along the sequences ACCTGCCGATTCCATATTACGCATGCTTCTGGGTTACCGTTCAGGGCATTTTACATGTGCTG ACCTGCCCTGTCCAGCTTACATGCATGCTTATAGGGGCATTTTACAT Graphical comparison of sequences using “Dotplots”. Basic Principles. 4) If 10 nucleotides are deleted in sequence y at a certain position, but the two are otherwise identical, then after the point of deletion y(i) = x(I +10) We can view this two ways|: y(i) = x(i +10) insertion in x 10 nt or y(i-10) = x(i) deletion in y ACCTGCCGATTCCATATTACGCATGCTTCTGGGTTACCGTTCAGGGCATTTTACATGTGCTG ACCTGCCCTGTCCAGCTTACATGCATGCTTATAGGGGCATTTTACAT ATGCTTATAGG Graphical comparison of sequences using “Dotplots”. Basic Principles. A “word size” (11 say) Diagonal runs of dots indicate similar regions a A “Scoring scheme” (1 for a match, 0 for a mismatch, say) A T G C 1 1 1 1 11 0 1 0 1 1=9 A 1 0 0 0 T 0 1 0 0 G 0 0 1 0 Summary: Dotplots provide a comprehensive overview C but 0 0NO0detail. 1 + + + + + + + + + + A “Cut-off score” (8 say) ACCTGCCGATTCCATATTACGCATGCTTCTGGGTTACCGTTCAGGGCATTTTACATGTGCTG ATGCTTCTGGG Matching bit or character-strings H am m ingdistance A 1: 01010010 ||||| 2: 11010001 B 1: BIRD || 2: WORD D= 12 3 D= 12 2 • The Hamming distance is the number of exchanges necessary to turn one string of bits or characters into another one (the number of positions not connected with a straight line). The two strings are of identical length and no alignment is done. • The exchanges in character strings can have different costs, stored in a lookup table. In this case the value of the Hamming distance will be the sum of costs, rather than the number of the exchanges. WE USE THIS IN DOT PLOT 28 Graphical comparison of sequences using “Dotplots”. Scoring Schemes. DNA: Simplest Scheme is the Identity Matrix. A T G C A 1 0 0 0 T 0 1 0 0 G 0 0 1 0 C 0 0 0 1 More complex matrices can be used. For example, the default EMBOSS DNA scoring matrix is: The use of negative numbers is only pertinent when these matrices are used for computing textual alignments. Using a wider spread of scores eases the Expansion of the scoring matrix to sensibly include ambiguity codes. A T G C A T G C 5 -4 -4 -4 -4 5 -4 -4 -4 -4 5 -4 -4 -4 -4 5 Graphical comparison of sequences using “Dotplots”. Scoring Schemes. A A C G T S W R Y K M B V H D N U B C D E F G H I K L M N P Q D 0 1 -1 -5 -1 2 -1 2 -4 -5 -3 -1 -1 3 -1 -2 1 -3 -2 -1 -1 -3 1 -2 0 -2 4 -2 1 -1 -1 -1 -1 -2 -1 W -6 -5 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 include Y -3 -3ambiguity 0 -4 -4 codes. 7 -5 0 -1 -4 -1 -2 -2 -5 -4 Z 0 2 -5 3 3 -5 -1 2 -2 0 -3 -2 1 0 3 W 1R -1Y -1K-1 M-2 B AA 2C 0 G-2 T 0 S0 -4 -1 V0 H 1 B5 -4 0 2 1 -2 -2 -12 -1 -1 -4-4-4 3-42 -5 1 01 -4 -4 1 1-3-4 C -25-4 -5 -4 -3 -31 -21-5-4-6-1 -5 -4 -4 -1 -3 -4 -412-4-5-4 1 -4 D -4 0 3 5-5-4 4 13 -4 -6 1 1 -21 0-4-4-1 -3 -12 -4 -1 -4 1 -4 E 0 2 -5 3 4 -5 0 1 -2 0 -3 -2 1 -1 -4 -4 -4 5 1 -4 -4 1 -4 1 -1 -1 -1 F -4 -5 -4 -6 -5 9 -5 -2 1 -5 2 0 -4 -5 -4 -2 -3 -2-2-2-4-1 G -4 1 0 1-3 1 1-10 -4 -5 -2 5 -2 -3 -10 -3 -1 1 1 -4 -4 -4 -1 -2 -2 -2 -2 -3 -3 -1 H -1 1 -3 1 1 -2 -2 6 -2 0 -2 -2 2 0 -4-2 1-2-4-2-2 -4 -2 I1 -1 -2 -21 -1 -3 -2 5 -2-2 2-32 -1 -2 -3 -2 K -1 -5 -4 -2 -1 0 -2 -1 -4 -1 1-4-5 1 0-20 -2 -2 5-2-3-10 -31 -1 L -21-3 1-6-4 -4-2 -3 -22 -2 -4 -2 2 -3-4 6-14 -3 -3 -3 -3 -4 -2 -1 M1 -1 -2 -20 -2 -3 -2 2 0-1 4-36 -1 -2 -1 -2 -4-2-4-5 1-3-2 -2 -4 N 0 2 -4 2 1 -4 0 2 -2 1 -3 -2 2 -1 -4 -1 -1 -1 -1 -3 -3 -1 -1 -3 -1 -2 -2 P 1 -1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 -1 -1-5-1 2-12 -3 -3 1-1-2-2 Q -4 0 1 -5 -1 -1 -3 3 -2 -1 -11 -2 0 -1 -1 -4 -1 -3 -1 -3 -1 -3 -1 -2 -2 -1 R -2 -1 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 -1 -1 0-4 0-30 -1 -3 -1 -1 0-3-3-2 S -1 1 0 -3 -1 1 -1 -2 -21 -2 1 T -2 1 a0 -2-2 0-1 0 -1 -3 -1 0 0 0-1 -2-1 -1 -1 0 -1 0 -2 -2 -1 -1 Using wider spread of-1 scores eases the V 05-2 -2-4 -2 -1 -1 -21 41-2-4 2-12 -4 -2 -1 -1 -4 -4-2-4 1 -4 expansion of the scoring matrix to sensibly R S T -2N 1 U 1 -1 0 0 -2 -4 -4 -2 0 5-2 -1 0 0 -2 -4 -1 0 0 -2 -4 -4 -3 -3 -1 -4 -3 1 0 -1 2 -1 1-1 -1 -1 -4 0 -2 3 01 0 -1 -3 -1 -3 1-2 0 -2 -1 -4-1 0 1 0 -1 -1 0 1 0 -1 -4-1 1 -1 -1 6 -1 0 -1 -1 0 -1 2 1 -1 1 3 -1 -2 -2 -2 -1 5 0 2 -2 -5 -4 -3 -3 0 0 -1 V W Y Z IUB DNA Alphabet 0 -6 -3 0 -2 -5 -3 2 Code Meaning -2 -8 0 -5 -2 -7 -4 3 -2 A -7 -4 3 -1 C 0 7 -5 -1 G -7 -6 -1 -2 -3 0 2 T/U 4 -1 -2 M -5 `aMino` A|C -2 -4 0 R -3 `puRine` A|G 2 -1 -3 W -2 `Weak` A|T 2 -2 -2 S -4 `Strong` C|G -2 -2 1 Y -4 `pYrimidine` C|T -1 -5 0 K -6 `Keto` G|T -2 -4 T` 3 V -5 `not A|C|G -2 -4 G` 0 H 2 `not A|C|T -1 -3 C` 0 D -2 `not A|G|T 0 -3 -1 B -5 `not A` C|G|T 4 -2 -2 N -6 `aNy` A|C|G|T -6 17 0 -6 -2 0 10 -4 -2 -6 -4 3 For Protein sequence dotplots more complex scoring schemes are required. Scores must reflect far more than alphabetic identity. Graphical comparison of sequences using “Dotplots”. Faster plots for perfect matches. To detect perfectly matching words, a dotplot program has a choice of strategies Select a scoring scheme A T G C T 0 1 0 0 G 0 0 1 0 C 0 0 0 1 and a word size (11, say) For every pair of words, compute a word match score in the normal way 1) A 1 0 0 0 If the ifmaximum Only the maximum possible possible cut-off cut-off scorescore (still 11) (11)isisnot achieved achieved ATGCTTATAGG a 1+1+1+1+1+1+1+1+1+1+1 =11 ATGCTTCTGGG ATGCTTATAGG r 1+1+1+1+1+1+0+1+0+1+1 =9 Celebrate with a dot ATGCTTCTGGG Do not celebrate with a dot Graphical comparison of sequences using “Dotplots”. Faster plots for perfect matches. To detect perfectly matching words, a dotplot program has a choice of strategies For every pair of words, ……… see if the letters are exactly the same 2) OR If they are not ATGCTTATAGG a aaaaaaaaaaa ATGCTTCTGGG ATGCTTATAGG r aaaaaararaa Celebrate with a dot ATGCTTCTGGG Do not celebrate with a dot To detect exactly matching words, fast character string matching can replace laborious computation of match scores to be compared with a cut-off score Many packages include a dotplot option specifically for detecting exactly matching words. Particular advantage when seeking strong matches in long DNA sequences. Graphical comparison of sequences using “Dotplots”. Dotplot parameters. There are three parameters to consider for a dotplot: 1)The scoring scheme. 2)The cut-off score 3)The word size Graphical comparison of sequences using “Dotplots”. Dotplot parameters. The Scoring scheme. DNA Usually, DNA Scoring schemes award a fixed reward for each matched pair of bases and a fixed penalty for each mismatched pair of bases. Choosing between such scoring schemes will affect only the choice of a sensible cut-off score and the way ambiguity codes are treated. Protein Protein scoring schemes differ in the evolution distance assumed between the proteins being compared. The choice is rarely crucial for dotplot programs. Graphical comparison of sequences using “Dotplots”. Dotplot parameters. The Cut-off score. The higher the cut-off score the less dots will be plotted. But, each dot is more likely to be significant. The lower the cut-off score the more dots will be plotted. But, dots are more likely to indicate a chance match (noise). Graphical comparison of sequences using “Dotplots”. Dotplot parameters. The Cut-off score. Scoring Scheme: PAM 250, Word Size: 25, Cut-off score: More “features”, 4 regions Cut-off now become clearly probably noise, clearer, too low.strong some Too much other 4 clear appear obscuring weaker noise tofeatures see regions apparent the original 4 appear interesting regions. clear regions. 10 5 20 30 Graphical comparison of sequences using “Dotplots”. Dotplot parameters. The Word size. Smaller Large words words can pick miss up small smaller matches. features. The smallest “features” are often just “noise”. Graphical comparison of sequences using “Dotplots”. Dotplot parameters. The Word size. For sequences with regions of small matching features. Small words pick small features Individually. Larger words show matching regions more clearly. The lack of detail can be an advantage Graphical comparison of sequences using “Dotplots”. Dotplot parameters. The Word size. Displaying the word Superimposing a plot 11 plot alone shows Using relatively with a asmaller word that major features large size ofthe 25, size ofword 11 shows are drawn inof more features are drawn emergence extra “carefully”. with a broad brush. dots. Arguably, less Detail beprobably missed In this can case usefully all noise.if a broad overview is the objective. Graphical comparison of sequences using “Dotplots”. Other uses of dotplots. Detection of Repeats Graphical comparison of sequences using “Dotplots”. Other uses of dotplots. Detection of Repeats Graphical comparison of sequences using “Dotplots”. Other uses of dotplots. Detection of Repeats Graphical comparison of sequences using “Dotplots”. Other uses of dotplots. Detection of Repeats Graphical comparison of sequences using “Dotplots”. Other uses of dotplots. Detection of Stem Loops Graphical comparison of sequences using “Dotplots”. Other uses of dotplots. Detection of Stem Loops What you should know • What dot plot is • Parameters (scoring scheme, cut-off score, word size) • Appearance of related regions (between sequences) • Repeats within sequences • Palindromes (within sequences) • Programs The End. Back to 0.3a Example Bacterial sensor protein binds communication signal, then binds to DNA and initiates transcription Signal Signal binding DNAbinding RNA polymerase DNA The “normal” domain architecture of the sensor protein Signal binding DNA-binding Example Normal sequence Inverted sequence Normal and shuffled (inverted) bacterial sensor proteins can fulfill the same function Do we see the difference by dot plot? Dot Plot a Inverted sequence Normal sequence The two fundamental sequence alignment algorithms Global and local alignment Pairwise alignment – the simplest case Why ? We have two (protein or DNA) sequences originating from a common ancestor The purpose of an alignment is to line up all positions that originate from the same position in the ancestral sequence Pairwise alignment – the simplest case The purpose of an alignment is to line up all residues that were derived from the same residue position in the ancestral gene or protein in two sequences Pairwise alignment – the simplest case The purpose of an alignment is to line up all residues that were derived from the same residue position in the ancestral gene or protein in two sequences gap = insertion or deletion Types of algorithms according to how we do it: Global and local Global Local Local similarities e.g.between multidomain proteins… Global similarities e.g.between single domains… Global alignment Align two sequences from “head to toe”, i.e. from 5’ ends to 3’ ends from N-termini to C-termini Exhaustive algorithm published by: Needleman, S.B. and Wunsch, C.D. (1970) “A general method applicable to the search for similarities in the amino acid sequence of two proteins” J. Mol. Biol. 48:443-453. “Exhaustive” means: all cases tested so the result (the alignment) is guaranteed to be optimal. Local alignment Locate region(s) with high degree of similarity in two sequences Exhaustive algorithm published by: Smith, T.F. and Waterman, M.S. (1981) “Identification of common molecular subsequences” J. Mol. Biol. 147:195-197. Global Alignment • Simple rules: – Match (i,j) = • 1, if residue (i) = residue (j); else 0 – Gap = 1 – Score (i,j) = Maximum of • Score (i+1,j+1) + Match (i,j) • Score (i+1,j) + Match (i,j) - Gap • Score (i,j+1) + Match (i,j) - Gap Global Alignment a c t g a g t - a c t t g a g c -6 -5 -4 -3 -2 -1 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Global Alignment a c t g a g t - a c t t g a g c -6 -5 -4 -3 -2 0 -1 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Global Alignment a c t g a g t - a c t t g a g c 1 0 -3 2 0 -2 0 1 -1 -2 -1 0 -9 -8 -7 -6 -5 -4 -3 -2 -1 -6 -5 -4 -3 -2 -1 0 Global Alignment c t g a g t - a 3 0 -2 -2 -5 -6 -9 a - a 4 1 -1 -2 -4 -5 -8 a - c 5 2 0 -3 -3 -4 -7 c c t 4 4 1 -2 -3 -3 -6 t t t -2 4 2 -1 -2 -3 -5 t - g a g c -1 -1 -2 -4 -6 -2 0 -1 -4 -5 3 1 0 -3 -4 0 2 0 -2 -3 -1 0 1 -1 -2 -3 -2 -1 0 -1 -4 -3 -2 -1 0 g a g c g a g t Global Alignment c t g a g t - a 3 0 -2 -2 -5 -6 -9 a - a 4 1 -1 -2 -4 -5 -8 a - c 5 2 0 -3 -3 -4 -7 c c t 4 4 1 -2 -3 -3 -6 t t t -2 4 2 -1 -2 -3 -5 t - g a g c -1 -1 -2 -4 -6 -2 0 -1 -4 -5 3 1 0 -3 -4 0 2 0 -2 -3 -1 0 1 -1 -2 -3 -2 -1 0 -1 -4 -3 -2 -1 0 g a g c g a g t Local Alignment • Simple rules: – Match (i,j) = • 1, if residue (i) = residue (j); else 0 – Gap = 1 – Score (i,j) = Maximum of • • • • Score (i+1,j+1) + Match (i,j) Score (i+1,j) + Match (i,j) - Gap Score (i,j+1) + Match (i,j) - Gap 0 Local Alignment c t g a g t - a 3 1 2 2 0 0 0 a 4 2 1 2 0 0 0 c c c 5 3 0 1 1 0 0 t t t 4 4 1 0 1 1 0 t 3 4 2 1 0 1 0 t - g 1 2 3 1 1 0 0 g g a 0 1 1 2 0 0 0 a a g 0 0 1 0 1 0 0 g g c 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Local Alignment c t g a g t - a 3 1 2 2 0 0 0 a 4 2 1 2 0 0 0 c c c 5 3 0 1 1 0 0 t - t 4 4 1 0 1 1 0 t 3 4 2 1 0 1 0 t t g 1 2 3 1 1 0 0 g g a 0 1 1 2 0 0 0 a a g 0 0 1 0 1 0 0 g g c 1 0 0 0 0 0 0 0 0 0 0 0 0 0 C S T P A G N D E Q H R K M I L V F Y W 12 0 -2 -3 -2 -3 -4 -5 -5 -5 -3 -4 -5 -5 -2 -6 -2 -4 0 -8 C 2 1 1 1 1 1 0 0 -1 -1 0 0 -2 -1 -3 -1 -3 -3 -2 S 3 0 1 0 0 0 0 -1 -1 -1 0 -1 0 -2 0 -3 -3 -5 T 6 1 -1 -1 -1 -1 0 0 0 -1 -2 -2 -3 -1 -5 -5 -6 P PAM250 2 1 0 0 0 0 -1 -2 -1 -1 -1 -2 0 -4 -3 -6 A 5 0 1 0 -1 -2 -3 -2 -3 -3 -4 -1 -5 -5 -7 G 2 2 1 1 2 0 1 -2 -2 -3 -2 -4 -2 -4 N 4 3 2 1 -1 0 -3 -2 -4 -2 -6 -4 -7 D 4 2 1 -1 0 -2 -2 -3 -2 -5 -4 -7 E 4 3 1 1 -1 -2 -2 -2 -5 -4 -5 Q 6 2 0 -2 -2 -2 -2 -2 0 -3 H 6 3 0 -2 -3 -2 -4 -4 2 R 5 0 6 -2 2 5 -3 4 2 6 -2 2 4 2 4 -5 0 1 2 -1 -4 -2 -1 -1 -2 -3 -4 -5 -2 -6 K M I L V 9 7 10 0 0 17 F Y W Global Alignment • Advanced rules: – Match (i,j) = • W(i,j), where W is a score matrix, like PAM250 – Gap = • Gap_init + Gap_length length_of_gap – Score (i,j) = Maximum of • Score (i+1,j+1) + Match (i,j) • Score (i+1,j) + Match (i,j) - Gap • Score (i,j+1) + Match (i,j) - Gap Local Alignment • Advanced rules: – Match (i,j) = • W(i,j), where W is a score matrix, like PAM250 – Gap = • Gap_init + Gap_length length_of_gap – Score (i,j) = Maximum of • • • • Score (i+1,j+1) + Match (i,j) Score (i+1,j) + Match (i,j) - Gap Score (i,j+1) + Match (i,j) - Gap 0 Concepts learnt in the this lecture Alignments can be exhaustive or heuristic Exhaustive, also called dynamic programming, if we do not need much of resources (e.g. we have few sequences to align) Heuristic, for realistic problems where time is an issue Alignments can be global and local Global: from beginning to end Local: pinpoint highly similar regions (more realistic) What methods to select according to the time/resources we have? If we have time/resources, we can try exhaustive algorithms. This is an option with supercomputers or GPU implementations… For realistic problems (and realistic resources) we need heuristic alignments that restrict the search space to a manageable size…. at a price of loosing some accuracy. Alignment heuristics (examples) Search space reduction 1: Pre-filter sequences to be aligned. Rationale: comparing very different sequences make no biological sense. Brute force filtering is efficient. Search space reduction 2: Filter out obviously useless alignments. Means leaving out the corners of the SW or NW search matrices Only those around the diagonal make sense. The corners look like this: Bacterial sensor protein binds communication signal, then binds to DNA and initiates transcription Signal Signal binding DNAbinding RNA polymerase DNA The “normal” domain architecture of the sensor protein Signal binding DNA-binding Normal and shuffled (inverted) bacterial sensor proteins can fulfill the same function Do we see the difference by simple (raw) pairwise alignment? Local alignment (Smith Waterman) Normal Inverted Identical sequences match at each amino acid – we show them as a series of “|” symbols Global alignment (Needleman-Wunsch) Pairwise alignment by itself sees only the similarity of the larger domain, the smaller one is lost (empty line, no hits( Normal sequence Inverted sequence Normal and shuffled (inverted) bacterial sensor proteins can fulfill the same function Do we see the difference by dot plot? Dot Plot a Inverted sequence Normal sequence The distance matrix has the info, just the alignment algorithm does not pick it up!!! Heat map of Smith Waterman matrix Inverted sequence Inverted sequence Dot Plot Normal sequence Normal sequence What have we learnt? Sequence scoring matrices (PAM, BLOSUM, unitary, and how to make one’s own…) Dot plots The two basic algorithms Global alignment (Needleman Wunsch), local alignment (Smith-Waterman) Classifying alignment methods (how to align): exhaustive, heuristic, local, global Global alignment, exhaustive: Needleman-Wunsch Local alignment, exhaustive: Smith-Waterman Heuristics: simple examples