Download Bolsum and PAM Matrix

DNA sequences alignment measurement Lecture 13 Introduction • Measurement of “strength” alignment • Nucleic acid and amino acid substitutions • Measurement of alignment gaps Measurement of aligned sequences • When aligning sequences (DNA/AA ) it is assumed that: – they have a common ancestor; – the differences between the sequences is the result of mutations – important areas like coding sequences (CDS) will be conserved. There is a bias “against” mutations in these areas – Furthermore there is a bias in the types of mutations: substitutions more likely that insertions/deletions…. • The dot plot gives a visual representation of sequence alignment regions. But how do we measure the strength of these alignments. Measurement of aligned sequences • One way is to count the mismatches: the “difference” between the sequences. – Hamming distance; : • The distance corresponds to mismatches for strings of equal length. – agtc – cgta Distance is 2 (give another example) • If the sequences (strings) are not of equal length then use: – The Levenshtein distance: is the minimum number of edit operations (alter/ insert/delete) to required to turn one string into another: • ag- tcc • cgctca what is the levensthein distance? • The latter technique has the advantage of allowing the inclusions of gaps Measurement of matching • But what about the biological plausibility of these approaches to measuring “differences” between sequences (strings): • DNA sequences (string mismatches) are different: – due to the probability of substitution; insertions, deletions is not the same. – Certain types of mutations like inversions; translocations; duplications …. Complicate the assessment of similarity; e.g. how would you treat tandem repeats; inverted repeats…. Nucleic Acid mutations • In sequence alignment we are trying to determine have the differences (similarity) occurred due to: – chance (random mutations) – They had a common origin (degree of conservatism) • One approach would be to count the percentage of matches but there is now a need to include the bias associated with possible substitutions. • However, similarity does not necessarily imply common ancestor or visa versa Zvelebil and Baum (2008 p. 74) suggest this can occur in convergent evolution/divergent evolution. • So the results need to be contextualised the findings of alignment tests. (bat and bird both have wings…) Alignment Scoring methods • In general sequences are given a score at each matching position and the one with the largest score is optimal and is chosen; however suboptimal may also need to be considered. • The most basic approach is obtained by measuring the percentage of similarity. • Given that not all “changes” occur with equal chance there is a need to develop: – A nucleotide substitution matrix • Nucleotide scoring Matrix • While it is know that certain mutations are more likely to occur than others: e.g. transitions a<->g is more common than transversions c<->t. • However since the probability of such difference is insignificant in relation to the chance of a mutation itself the differences are mostly ignored. The following shows a typical scoring matrix for nucleotides. Adapted from Baxevanis p. 303 Nucleic acid scoring Matrix • The values are based on the probability of a type of substitution occurring (expected value); this includes a nucleotide substituting with itself. • These expected values are calculated by getting the ratio of : – number of “observed changes” /number of changes “due to chance” • These values are obtained by examining large numbers of DNA sequences. Nucleic acid scoring Matrix • Then calculate 10*log 10 (“expected value”). • This ensures that adjacent nucleotides expected values can now be added as opposed to being multiplied in determining the alignment score. Nucleic acid scoring Matrix • A expected value greater of 1 indicates the substitution has the same change of occurrence as it is was occurring randomly. • A value greater than 1 indicates a bias in favour or the substitution • A values less than 1 indicates a bias against the substitution. • A value of 5 will give what expected value? Measuring Protein similarity • Deriving a matrix for proteins is more complex because: • There are 20 amino acids so much higher set of substitutions. • The amino acids have properties that affect the structure and so the protein functionality. • Therefore substitutions can be conserved or semiconserved • Observations shows that conserved substitutions • e.g. Hydrophobic <-> hydrophobic mutations are more common • semi conserved; e.g. hydrophilic <-> hydrophobic Dot plot Matrix: imperfect match • Some alignments require gaps to increase the matching score; the gaps are used represent inclusion/deletion mutations • The diagram shows that most of the 2 sequences are aligned. Where there are gaps indicates areas of non-alignment or mismatches: gaps or substitutions Adapted from: dotplot example 13 Measurement of alignment gaps • Gaps represents insertions and deletions • Baxevanis (2005) suggest that no more than “one gap in 20 pairs is a good rule of thumb”. • Gaps in alignments are penalised; given a negative scoring value. • The penalty associated with the using gaps is dependent on – Opening the gap (introducing an insertion or deletion) – Extending the gap (as opposed to opening a new gap) – The length of the gap (the number of deletions/insertions). Gap penalties • There is no overall agreement on what values should be assigned to gap penalties (Zvelebil e Baum 2008). • The purpose of an insertion is to increase the strength of the alignment. • So choosing a high score will eliminate sequences with gaps while of the score is too low then alignments with more and larger gaps will be chosen. • The value should also be dependent on how closely “related” the alignments must be : – So sequences with a very strict match would use a high gap score. – Alignment between distantly related species would use a low gap score. Potential Exam Questions • What is the purpose of measuring the strength of an alignment (3 marks) • Explain two differences between analysing a string (sequence) and a DNA string. (4 marks) • Describe how you would measure the similarity between two DNA sequences (10 marks) • Discuss the use of gap penalties in a sequence alignment score (13 marks) References • Baxevanis A.D. 2005 Bioinformatics: a practical guide to the analysis of genes and proteins chapter 11; Wiley • Lesk, A. 2008; Introduction to bioinformatics, 3rd edition, oxford university press • Zvelebil e Baum (2008) Understanding Bioinformatics

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Bolsum and PAM Matrix