Download Lesson 2 - Laboratory of Molecular Modelling

Lesson 2 Aligning sequences and searching databases 1 Homology and sequence alignment. 2 Homology Homology = Similarity between objects due to a common ancestry Hund = Dog, Schwein = Pig Sequence homology Similarity between sequences as a result of common ancestry. VLSPAVKWAKVGAHAAGHG ||| || |||| | |||| VLSEAVLWAKVEADVAGHG 4 Sequence alignment Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. 5 Why align? VLSPAVKWAKV ||| || |||| VLSEAVLWAKV 1. To detect if two sequences are homologous. If so, homology may indicate similarity in function (and structure). 2. Required for evolutionary studies (e.g., tree reconstruction). 3. To detect conservation (e.g., a tyrosine that is evolutionary conserved is more likely to be a phosphorylation site). 4. Given a sequenced DNA, from an unknown region, align it to the genome. 6 Insertions, deletions, and substitutions 7 Sequence alignment If two sequences share a common ancestor – for example human and dog hemoglobin, we can represent their evolutionary relationship using a tree VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV-WAKV VLSEAVLWAKV 8 Perfect match A perfect match suggests that no change has occurred from the common ancestor (although this is not always the case). VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV-WAKV VLSEAVLWAKV 9 A substitution A substitution suggests that at least one change has occurred since the common ancestor (although we cannot say in which lineage it has occurred). VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV-WAKV VLSEAVLWAKV 10 Indel Option 1: The ancestor had L and it was lost here. In such a case, the event was a deletion. VLSEAVLWAKV VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV-WAKV VLSEAVLWAKV 11 Indel Option 2: The ancestor was shorter and the L was inserted here. In such a case, the event was an insertion. L VLSEAVWAKV VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV-WAKV VLSEAVLWAKV 12 Indel Normally, given two sequences we cannot tell whether it was an insertion or a deletion, so we term the event as an indel. Deletion? VLSPAV-WAKV Insertion? VLSEAVLWAKV 13 Indels in protein coding genes Indels in protein coding genes are often of 3bp, 6bp, 9bp, etc... Gene Search In fact, searching for indels of length 3K (K=1,2,3,…) can help algorithms that search a genome for coding regions 14 Global and Local pairwise alignments 15 Global vs. Local • Global alignment – finds the best alignment across the entire two sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ • Local alignment – finds regions of similarity in parts of the sequences. ADLG |||| ADLG CDRYFQ |||| | CDRYYQ Global alignment: forces alignment in regions which differ Local alignment will return only regions of good alignment 16 Global alignment PTK2 protein tyrosine kinase 2 of human and rhesus monkey 17 Proteins are comprised of domains Human PTK2 : Domain A Domain B Protein tyrosine kinase domain 18 Protein tyrosine kinase domain In leukocytes, a different gene for tyrosine kinase is expressed. Domain A Domain X Protein tyrosine kinase domain 19 The sequence similarity is restricted to a single domain Domain A Protein tyrosine Domain B PTK2 kinase domain Domain X Protein tyrosine kinase domain Leukocyte TK 20 Global alignment of PTK and LTK 21 Local alignment of PTK and LTK 22 Conclusions Use global alignment when the two sequences share the same overall sequence arrangement. Use local alignment to detect regions of similarity. 23 How alignments are computed 24 Pairwise alignment AAGCTGAATTCGAA AGGCTCATTTCTGA One possible alignment: AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- 25 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2 mismatches 4 indels (gap) 10 perfect matches 26 Choosing an alignment for a pair of sequences Many different alignments are possible for 2 sequences: AAGCTGAATTCGAA AGGCTCATTTCTGA A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- Which alignment is better? 27 Scoring system (naïve) Perfect match: +1 Mismatch: -2 Indel (gap): -1 AAGCTGAATT-C-GAA AGGCT-CATTTCTGAScore: = (+1)x10 + (-2)x2 + (-1)x4 = 2 A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGAScore: = (+1)x9 + (-2)x2 + (-1)x6 = -1 Higher score  Better alignment 28 Alignment scoring - scoring of sequence similarity: Assumes independence between positions: each position is considered separately Scores each position: • Positive if identical (match) • Negative if different (mismatch or gap) Total score = sum of position scores Can be positive or negative 29 Scoring systems 30 Scoring system •In the example above, the choice of +1 for match,-2 for mismatch, and -1 for gap is quite arbitrary •Different scoring systems  different alignments •We want a good scoring system… 31 Scoring matrix •Representing the scoring system as a table or matrix n X n (n is the number of letters the alphabet contains. n=4 for nucleotides, n=20 for amino acids) A G C A 2 G -6 2 C -6 -6 2 T -6 -6 -6 T 2 •symmetric 32 DNA scoring matrices • Uniform substitutions between all nucleotides: From To A A 2 G -6 2 C -6 -6 2 T -6 -6 -6 Mismatch G C Match T 2 33 DNA scoring matrices Can take into account biological phenomena such as: • Transition-transversion 34 Amino-acid scoring matrices • Take into account physico-chemical properties 35 Scoring gaps (I) In advanced algorithms, two gaps of one amino-acid are given a different score than one gap of two amino acids. This is solved by giving a penalty to each gap that is opened. Gap extension penalty < Gap opening penalty 36 Scoring gaps (II) The dependency between the penalty and the length of the gap need not to be linear. AGGGTTC—GA AGGGTTCTGA Score = -2 AGGGTT-—GA AGGGTTCTGA Score = -4 AGGGT--—GA AGGGTTCTGA Score = -6 AGGG---—GA AGGGTTCTGA Score = -8 Linear penalty 37 Scoring gaps (II) The dependency between the penalty and the length of the gap need not to be linear. AGGGTTC—GA AGGGTTCTGA Score = -4 AGGGTT-—GA AGGGTTCTGA Score = -6 AGGGT--—GA AGGGTTCTGA Score = -7 AGGG---—GA AGGGTTCTGA Score = -8 Non-linear penalty 38 PAM AND BLOSUM 39 Amino-acid substitution matrices • Actual substitutions: – Based on empirical data – Commonly used by many bioinformatics programs – PAM & BLOSUM 40 Protein matrices – actual substitutions The idea: Given an alignment of a large number of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other M M M M M M M M G G G G G G G G Y Y Y Y Y Y Y Y D D E D Q D E E E E E E E E E E In the fourth column E and D are found in 7 / 8 41 PAM Matrix - Point Accepted Mutations • The Dayhoff PAM matrix is based on a database of 1,572 changes in 71 groups of closely related proteins (85% identity => Alignment was easy and reliable). • Counted the number of substitutions per amino-acid pair (20 x 20) • Found that common substitutions occurred between chemically similar amino acids 42 PAM Matrices • Family of matrices PAM 80, PAM 120, PAM 250 • The number on the PAM matrix represents evolutionary distance • Larger numbers are for larger distances 43 Example: PAM 250 Similar amino acids have greater score 44 PAM - limitations • Based only on a single, and limited dataset • Examines proteins with few differences (85% identity) • Based mainly on small globular proteins so the matrix is biased 45 BLOSUM • Henikoff and Henikoff (1992) derived a set of matrices based on a much larger dataset • BLOSUM observes significantly more replacements than PAM, even for infrequent pairs 46 BLOSUM: Blocks Substitution Matrix • Based on BLOCKS database – ~2000 blocks from 500 families of related proteins – Families of proteins with identical function • Blocks are short conserved patterns of 3-60 amino acids without gaps AABCDA----BBCDA DABCDA----BBCBB BBBCDA-AA-BCCAA AAACDA-A--CBCDB CCBADA---DBBDCC AAACAA----BBCCC 47 BLOSUM • Each block represents a sequence alignment with different identity percentage • For each block the amino-acid substitution rates were calculated to create the BLOSUM matrix 48 BLOSUM Matrices • BLOSUMn is based on sequences that share at least n percent identity • BLOSUM62 represents closer sequences than BLOSUM45 49 Example : Blosum62 Derived from blocks where the sequences share at least 62% identity 50 PAM vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences 51 Intermediate summary 1. Scoring system = substitution matrix + gap penalty. 2. Used for both global and local alignment 3. For amino acids, there are two types of substitution matrices: PAM and Blosum 52

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lesson 2 - Laboratory of Molecular Modelling