* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PDF handout
Survey
Document related concepts
Gene expression wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Metalloprotein wikipedia , lookup
Non-coding DNA wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Community fingerprinting wikipedia , lookup
Biochemistry wikipedia , lookup
Proteolysis wikipedia , lookup
Biosynthesis wikipedia , lookup
Genetic code wikipedia , lookup
Catalytic triad wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Protein structure prediction wikipedia , lookup
Molecular evolution wikipedia , lookup
Transcript
C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Bioinformatics Introduction to bioinformatics 2007 “Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975)) Lecture 5 Pair-wise Sequence Alignment “Nothing in bioinformatics makes sense except in the light of Biology” Divergent evolution Divergent evolution Ancestral sequence: ABCD ACCD (B C) ACCD AB─D or Ancestral sequence: ABCD ABD (C ø) ACCD (B C) mutation deletion ACCD A─BD Pairwise Alignment ACCD AB─D or ABD (C ø) mutation deletion ACCD A─BD Pairwise Alignment true alignment What can be observed about divergent evolution (a) G (b) Convergent evolution • Often with shorter motifs (e.g. active sites) • Motif (function) has evolved more than once independently, e.g. starting with two very different sequences adopting different folds • Sequences and associated structures remain different, but (functional) motif can become identical • Classical example: serine proteinase and chymotrypsin G Ancestral sequence G Sequence 1 1: ACCTGTAATC 2: ACGTGCGATC * ** D = 3/10 (fraction different sites (nucleotides)) C A One substitution one visible Sequence 2 (c) G C Two substitutions one visible (d) G G A A Two substitutions none visible A Back mutation not visible G 1 Serine proteinase (subtilisin) and chymotrypsin • Different evolutionary origins • Similarities in the reaction mechanisms. Chymotrypsin, subtilisin and carboxypeptidase C have a catalytic triad of serine, aspartate and histidine in common: serine acts as a nucleophile, aspartate as an electrophile, and histidine as a base. • The geometric orientations of the catalytic residues are similar between families, despite different protein folds. • The linear arrangements of the catalytic residues reflect different family relationships. For example the catalytic triad in the chymotrypsin clan is ordered HDS, but is ordered DHS in the subtilisin clan and SDH in the carboxypeptidase clan. Serine proteinase (subtilisin) and chymotrypsin H D S chymotrypsin D H S serine proteinase S D H carboxypeptidase C Catalytic triads Read http://www.ebi.ac.uk/interpro/potm/2003_5/Page1.htm Serine proteinase (subtilisin) and chymotrypsin Serine proteinase (subtilisin) and chymotrypsin There is also divergent evolution.. Proc Natl Acad Sci U S A. 2000 December 19; 97(26): 14097– 14102. The structure of aspartyl dipeptidase reveals a unique fold with a Ser-His-Glu catalytic triad Kjell Håkansson,* Andrew H.-J. Wang,† and Charles G. Miller*‡ Serine proteinase (subtilisin) and chymotrypsin A protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE-------GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---*** **** **** ** ****** 2 Searching for similarities Intermezzo: what is a domain A domain is a: What is the function of the new gene? The “lazy” investigation (i.e., no biologial experiments, just bioinformatics techniques): – Find a set of similar protein sequences to the unknown sequence – Identify similarities and differences • Compact, semi-independent unit (Richardson, 1981). • Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973). • Recurring functional and evolutionary module (Bork, 1992). “Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977). – For long proteins: first identify domains Protein domains recur in different combinations The DEATH Domain (DD) • Present in a variety of Eukaryotic proteins involved with cell death. • Six helices enclose a tightly packed hydrophobic core. • Some DEATH domains form homotypic and heterotypic dimers. Structural domain organisation can intricate… Pyruvate kinase Phosphotransferase β barrel regulatory domain α/β barrel catalytic substrate binding domain http://www.mshri.on.ca/pawson α/β nucleotide binding domain 1 continuous + 2 discontinuous domains Evolutionary and functional relationships Reconstruct evolutionary relation: •Based on sequence -Identity (simplest method) -Similarity •Homology (common ancestry: the ultimate goal) •Other (e.g., 3D structure) Functional relation: Sequence Structure Function Searching for similarities Common ancestry is more interesting: Makes it more likely that genes share the same function Homology: sharing a common ancestor – a binary property (yes/no) – it’s a nice tool: When (an unknown) gene X is homologous to (a known) gene G it means that we gain a lot of information on X: what we know about G can be transferred to X as a good suggestion. 3 How to go from DNA to protein sequence How to go from DNA to protein sequence 6-frame translation using the codon table (last lecture): A piece of double stranded DNA: 5’ attcgttggcaaatcgcccctatccggc 3’ 5’ attcgttggcaaatcgcccctatccggc 3’ 3’ taagcaaccgtttagcggggataggccg 5’ DNA direction is from 5’ to 3’ 3’ taagcaaccgtttagcggggataggccg 5’ Evolution and three-dimensional protein structure information Bioinformatics tool Isocitrate dehydrogenase: Algorithm The distance from the active site (in yellow) determines the rate of evolution (red = fast evolution, blue = slow evolution) Data tool Biological Interpretation (model) Dean, A. M. and G. B. Golding: Pacific Symposium on Bioinformatics 2000 Example today: Pairwise sequence alignment needs sense of evolution Global dynamic programming MDAGSTVILCFVG Evolution M D A A S T I L C G S How to determine similarity Frequent evolutionary events at the DNA level: 1. Substitution 2. Insertion, deletion Search matrix MDAGSTVILCFVGMDAAST-ILC--GS Amino Acid Exchange Matrix We will restrict ourselves to these events 3. Duplication 4. Inversion Gap penalties (open,extension) 4 Dynamic programming Scoring alignments – Substitution (or match/mismatch) A protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE-------GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---*** **** **** ** ****** Dynamic programming ∑ s (a , b ) - ∑ l i j gp(k) = gapinit + k⋅gapextension k • proteins – Gap penalty • Linear: gp(k)=ak • Affine: gp(k)=b+ak • Concave, e.g.: gp(k)=log(k) The score for an alignment is the sum of the scores of all alignment columns DNA: define a score for match/mismatch of letters Simple: Scoring alignments Sa,b = • DNA Nk • gp (k ) affine gap penalties A C G T A 1 -1 -1 -1 C -1 1 -1 -1 G -1 -1 1 -1 T -1 -1 -1 1 Used in genome alignments: A C G T A 91 -114 -31 -123 C -114 100 -125 -31 G -31 -125 100 -114 T -123 -31 -114 91 Dynamic programming Amino acid exchange matrices Scoring alignments T D W V T A L K T D W L - - I K 20×20 How do we get one? 20×20 10 Amino Acid Exchange Matrix 1 Affine gap penalties (open, extension) Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)-Po-2Px + +s(L,I)+s(K,K) And how do we get associated gap penalties? First systematic method to derive a.a. exchange matrices by Margaret Dayhoff et al. (1968) – Atlas of Protein Structure 5 A 2 R -2 6 N 0 0 2 D 0 -1 2 PAM250 matrix 4 C -2 -4 -4 -5 12 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2 4 G 1 -3 0 1 -3 -1 0 2 1 -3 1 -2 H -1 2 3 5 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 1 0 -5 1 amino acid exchange matrix (log odds) 0 -2 6 K -1 3 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 1 0 -1 -1 -3 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 -3 -1 -2 -5 0 -1 -3 0 1 0 -2 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 3 0 -6 -2 -5 17 7 -5 -3 -3 B 0 -1 2 3 -4 1 2 0 1 -2 -3 Z 0 0 1 3 -5 3 3 -1 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3 A R N D Q E H K P S B Z I L 2 -1 -1 -1 0 10 0 -2 -2 -2 -2 -2 -2 -1 -2 G 2 -2 6 V C 4 Positive exchange values denote mutations that are more likely than randomly expected, while negative numbers correspond to avoided mutations compared to the randomly expected situation 5 P W -6 0 -1 -1 0 -2 -3 1 -2 -5 -1 M F 0 0 -6 -2 W Y V 2 Pair-wise alignment T D W V T A L K T D W L - - I K Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n ~ = n 22n (2n)! (n!)2 Amino acids are not equal: 1. Some are easily substituted because they have similar: • physico-chemical properties • structure 2. Some mutations between amino acids occur more often due to similar codons 4 0 -5 -3 -2 T Amino acid exchange matrices √πn 2 sequences of 300 a.a.: ~1088 alignments 2 sequences of 1000 a.a.: ~10600 alignments! To say the same more statistically… •To perform statistical analyses on messages or sequences, we need a reference model. •The model: each letter in a sequence is selected from a defined alphabet in an independent and identically distributed (i.i.d.) manner. •This choice of model system will allow us to compute the statistical significance of certain characteristics of a sequence, its subsequences, or an alignment. •Given a probability distribution, Pi, for the letters in a i.i.d. message, the probability of seeing a particular sequence of letters i, j, k, ... n is simply Pi Pj Pk···Pn. •As an alternative to multiplication of the probabilities, we could sum their logarithms and exponentiate the result. The probability of the same sequence of letters can be computed by exponentiating log Pi + log Pj + log Pk+ ··· + log Pn. •In practice, when aligning sequences we only add log-odds values (residue exchange matrix) but we do not exponentiate the final score. The two above observations give us ways to define substitution matrices Technique to overcome the combinatorial explosion: Dynamic Programming • Alignment is simulated as Markov process, all sequence positions are seen as independent • Chances of sequence events are independent – Therefore, probabilities per aligned position need to be multiplied – Amino acid matrices contain so-called log-odds values (log10 of the probabilities), so probabilities can be summed Sequence alignment History of Dynamic Programming algorithm 1970 Needleman-Wunsch global pair-wise alignment Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3):443-53. 1981 Smith-Waterman local pair-wise alignment Smith, TF, Waterman, MS (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197. 6 Pairwise sequence alignment Global dynamic programming Global dynamic programming MDAGSTVILCFVG j-1 j Evolution M D A A S T I L C G S i-1 i Value from residue exchange matrix Amino Acid Exchange Matrix Search matrix H(i,j) = Max Gap penalties (open,extension) MDAGSTVILCFVGMDAAST-ILC--GS This is a recursive formula Global dynamic programming Global dynamic programming PAM250, Gap =6 (linear) S P E A R E 0 -6 -12 -18 -24 -30 -36 S -6 2 -4 -10 -16 -22 -28 0 H -12 -4 2 -3 -9 -14 -20 -24 3 0 A -18 -10 -3 0 -1 -7 -13 -18 -1 4 K -24 -16 -9 -3 -1 2 -4 -12 E -30 -22 -15 -5 -3 -2 6 -6 -30 -24 -18 -12 -6 0 S P E A R E S 2 1 0 1 0 0 H -1 0 1 -1 2 1 A 1 1 0 2 -2 K 0 -1 0 -1 E 0 -1 4 0 These values are copied from the PAM250 matrix (see earlier slide) Affine gap penalties j-1 i-1 Gap opening penalty The extra bottom row and rightmost column give the penalties that would need to be applied due to end gaps Si,j = si,j + Max Higgs & Attwood, p. 124 Max{S0<x<i-1, j-1 - Pi - (i-x-1)Px} Si-1,j-1 Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px} Gap extension penalty Easy DP recipe for using affine gap penalties j-1 Global dynamic programming Gapo=10, Gape=2 D W V T A L K 0 -12 -14 -16 -18 -20 -22 -24 T -12 8 -9 -6 -5 -9 -11 -14 5 D -14 0 9 2 2 3 -5 -3 -34 10 6 W -16 -13 25 11 5 4 9 0 -21 6 14 5 V -18 -10 -4 37 21 19 19 15 -16 7 5 13 L -20 -14 -2 23 46 31 37 26 1 K -22 -12 -9 17 33 53 39 50 14 -34 -29 -1 17 39 27 50 D W V T A L K T 8 3 8 11 9 9 8 D 12 1 6 8 8 4 8 W 1 25 2 3 2 6 V 6 2 12 8 8 L 4 6 10 9 K 8 5 6 8 These values are copied from the PAM250 matrix (see earlier slide), after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) H(i-1,j-1) + S(i,j) diagonal H(i-1,j) - g vertical H(i,j-1) - g horizontal The extra bottom row and rightmost column give the final global alignment scores i-1 • M[i,j] is optimal alignment (highest scoring alignment until [i,j]) • Check – preceding row until j-2: apply appropriate gap penalties – preceding row until i-2: apply appropriate gap penalties – and cell[i-1, j-1]: apply score for cell[i-1, j-1] 7 Global dynamic programming DP is a two-step process • Forward step: calculate scores • Trace back: start at highest score and reconstruct the path leading to the highest score – These two steps lead to the highest scoring alignment (the optimal alignment) – This is guaranteed when you use DP! Semi-global dynamic programming Semi-global pairwise alignment - two examples with different gap penalties These values are copied from the PAM250 matrix (see earlier slide), after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) • Global alignment: all gaps are penalised • Semi-global alignment: N- and C-terminal gaps (end-gaps) are not penalised End-gaps MSTGAVLIY--TS-------GGILLFHRTSGTSNS Global score is 65 – 10 – 1*2 –10 – 2*2 End-gaps Local dynamic programming (Smith & Waterman, 1981) Semi-global pairwise alignment LCFVMLAGSTVIVGTR Applications of semi-global: – Finding a gene in genome – Placing marker onto a chromosome – One sequence much longer than the other E D A S T I L C G S Negative numbers Amino Acid Exchange Matrix Search matrix Danger: if gap penalties high -- really bad alignments for divergent sequences AGSTVIVG A-STILCG Gap penalties (open, extension) 8 Local dynamic programming (Smith & Waterman, 1981) Local dynamic programming j-1 i-1 Gap opening penalty Si,j = Max Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px} Si,j + Si-1,j-1 Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px} 0 Gap extension penalty Dot plots • Way of representing (visualising) sequence similarity without doing dynamic programming (DP) • Make same matrix, but locally represent sequence similarity by averaging using a window Comparing two sequences We want to be able to choose the best alignment between two sequences. A simple method of visualising similarities between two sequences is to use dot plots. The first sequence to be compared is assigned to the horizontal axis and the second is assigned to the vertical axis. Dot plots can be filtered by window approaches (to calculate running averages) and applying a threshold They can identify insertions, deletions, inversions 9