Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Overview: Sequence Analysis Lectures Protein and DNA Sequence Analysis Part 1 Fritz Roth BCMP 201 Spring 2008 Motivation: A Firehose of Sequence Sequence Analysis I Sequence Analysis II Case Study w/ BLAST Searching large sequence databases Aligning a pair of sequences Scoring aligned sequences n n n Whole-genome projects —> Not whole-genome Representing and finding sequence patterns Sequence Analysis: What’s the Use? As of Mar 2007: ——100 billion base pairs— Aligning multiple sequences 542 eubacteria 39 archaebacteria 24 microbial ‘metagenomes’ 60 eukaryotes (of which 21 are multicellular) n *does not include many ‘draft’ genomes n Find genes n Infer protein function n Infer evolutionary history n Infer subcellular localization n Infer gene regulation n Infer protein-protein interactions 0 1997 ‘98 ‘99 ‘00 ‘01 ‘02 ‘03 ‘04 ‘05 http://www.ebi.ac.uk/genomes/index.html http://www.ncbi.nlm.nih.gov/Genbank/genbankgrowth.jpg Identity vs. Similarity Identity: Extent to which residues in aligned sequences are invariant. Similarity: Extent to which residues in aligned sequences have similar properties. Need not have diverged from common ancestor. Homology in Two Flavors Homology: Similarity due to descent from common ancestor. Need not have same function. Orthologs: Homologs diverged by speciation. Sequences with ‘mutual best hit’ relationship. Paralogs: Homologs diverged by gene duplication Share function less often 1 Similar sequence… Same function? Legend o general similarity X non-enzyme, same functional class p enzyme, same functional class --p-- enzyme, same precise function --X-- non-enzyme, same precise function Outline: Sequence Analysis I n BLAST Case study n Pairwise sequence alignment - Global vs. local alignment - Dot plots - Smith-Waterman method n Scoring aligned sequences Caveats: Only for single domain proteins, From Wilson, Kreychman, and Gerstein, J Mol Biol, 2000 Case study n n n Case Study Methanococcus jannaschii: Archaebacterium from undersea thermal vent Genome sequenced, 1996 MJ0577: predicted protein based on GeneMark software 1 MSVMYKKILY PTDFSETAEI ALKHVKAFKT LKAEEVILLH VIDEREIKKR DIFSLLLGVA 61 GLNKSVEEFE NELKNKLTEE AKNKMENIKK ELEDVGFKVK DIIVVGIPHE EIVKIAEDEG 121 VDIIIMGSHG KTNLKEILLG SVTENVIKKS NKPVLVVKRK NS (Case study adapted from http://www.ncbi.nih.gov/Education/BLASTinfo/tut2.html ) 2 Case Study: BLAST Results self No Known function Case Study: BLAST Results Putative Filament Protein Case Study: Filament protein? Twilight Zone n n n Predicted coiled-coil region (COILS) Manually filtered the query Subsequent BLAST run did not retrieve this protein Case Study: BLAST with Cationic amino acid transporter Case Study: BLAST Results Cationic amino acid transporter M. jannaschii Protein MJ0577 Twilight Zone n n The transporter entry is 780 aa long, so MJ0577 (~160aa) is not likely to perform the same function It may share a domain, but we’ll drop this lead for now 3 Case Study: BLAST conclusion? Position-Specific Iterated BLAST (PSI-BLAST) Not very satisfying From http://bioweb.pasteur.fr/seqanal/blast/ Putative Filament Protein Case study: Filament protein? n n Remember, this hit dropped out when predicted coiled-coil is masked Another approach to verify: - Run PSI-BLAST using filament protein as query - MJ0577 is below threshold in BLAST stage - MJ0577 is recovered in first profile scan - By this criterion, MJ0577 and filament protein are related, but we’re still worried about coiled-coil region Cationic amino acid transporter 4 Case Study: Universal stress protein? Universal stress proteins Case Study: Following up the universal stress protein lead n n n BLAST search with E. coli UspA does not yield MJ0577 as significant. First PSI-BLAST iteration yields MJ0577 and some of its closest relatives. Tentative Prediction: Case Study: Experimental Followup n n n MJ0577 is a universal stress protein n E. coli Universal stress protein trivia n n n Important for survival in stationary phase n Is phosphorylated at either serine or threonine Autophosphorylates inefficiently in vitro n Outline: Sequence Analysis I n BLAST Case study n Pairwise sequence alignment - Global vs. local alignment - Dot plots - Smith-Waterman method n Scoring aligned sequences Usually, function before structure Zarembinski et al. solved MJ0577 structure Test case for “structural genomics” Structure similarity search yielded nothing new. However… a bound ATP! This supports stress protein prediction somewhat (remember autophosphorylation) Global vs. Local Alignment Global Alignment: Alignment of sequences over their entire length. LGPSTKQFGKGSAS-RIWDN | |||| | | | LNQIERSFGKG-AIMRLGDA Local Alignment: Alignment of some portion of sequences. -------FGKGSA-------|||| | -------FGKG-A-------- 5 Sequence Alignment: The Brute Force Approach n If no gaps, slide the sequences past each other until you get the best score: ABCDEFGHI |||| WXYABCDZE n Dot Plots: Local Alignment the Old-Fashioned Way A dot for each residue match 5’ 3’ 3’ For two n-residue sequences, this takes ~2n evaluations With gaps, scoring all alignments of two 100 aa proteins takes 2n = ( 2 n )! , or ~1059 evaluations n n n ( n !)2 5’ 1059 eval on a 1GHz computer -> longer than lifetime of universe! Hemoglobin alpha vs beta chain (“Dotter”) Dot Plots: Filtering N-terminal C-terminal N-terminal A dot for each match 5’ 3’ A dot only for 2 matches 5’ 3’ 3’ Window length 31 Match score: +5 Mismatch score: -4 5’ C-terminal From http://lectures.molgen.mpg.de/Pairwise/DotPlots/ Dot Plot Summary Global Alignment Advantages n Simple n Visual Drawbacks n n Isn’t automated (try a whole genome!) Hard to find optimal alignment if gaps are included n In 1970, Needleman and Wunsch solved global alignment in O(n3) using dynamic programming. (O(n3) means running time is proportional to n 3, where n is input data size) n In 1982, Gotoh solved it in O(n2). 6 Local Alignment: Smith-Waterman Local Alignment: Smith-Waterman Sub-problem: What’s the best possible score for two aligned segments that end at a particular position? Smith-Waterman: A dynamic programming algorithm developed in 1981 that aligns two sequences of length n and m in O(nm). ∆ A G C C T ∆ A T G Dynamic Programming: C Finding the optimal solution to a problem by reusing optimal solutions of similar (but smaller) problems. C A T Local Alignment: Smith-Waterman Sub-problem: What’s the best possible score for two aligned segments that end at a particular position? A-G | | ATG ∆ A G Local Alignment: Smith-Waterman C C How to Calculate Top Score ∆ 0 Match(i, j) TopScore(i, j) = Max TopScore(i −1, j −1) + Match(i, j) TopScore(i −1, j) −Gap(1) TopScore(i, j −1) −Gap(1) A T G 3 for match Match(i, j ) = -1 for mismatch C C -----AG | ATGCCAT Gap(1) = 2 A T Local Alignment: Smith-Waterman All subproblems solved! ∆ A G C C T ∆ 0 0 0 0 0 0 A 0 T T T 0 G 0 C 0 C 0 A 0 T 0 Local Alignment: Smith-Waterman ∆ A G C C T ∆ 0 0 0 0 0 0 A 0 3 1 0 0 0 T 0 1 2 0 0 3 G 0 0 4 2 0 1 C 0 0 2 7 5 C 0 0 0 5 10 A 0 3 1 3 8 T 0 1 2 1 6 What is the best alignment? A-GCC-T | ||| | ATGCCAT ∆ A G C C ∆ 0 0 0 0 0 0 A 0 3 1 0 0 0 T 0 1 2 0 0 3 G 0 0 4 2 0 1 3 C 0 0 2 7 5 3 8 C 0 0 0 5 10 8 9 A 0 3 1 3 8 9 11 T 0 1 2 1 6 11 7 Local Alignment: Smith-Waterman Outline: Sequence Analysis I Optimal Alignments ∆ A-GCC-T | ||| | ATGCCAT A-GCCTA | ||| | ATGCC-A ∆ A G C C T A 0 0 0 0 0 0 0 A 0 3 1 0 0 0 3 T 0 1 2 0 0 3 1 G 0 0 4 2 0 1 2 C 0 0 2 7 5 3 1 C 0 0 0 5 10 8 6 A 0 3 1 3 8 9 11 T 0 1 2 1 6 11 9 G 0 0 4 2 4 9 10 n Case study n Pairwise sequence alignment n Scoring aligned sequences - Log-odds scoring - Substitution matrices - Gap penalties YKIL Scoring an Alignment Conservation-based Scoring | | FKVL Two Models: Random vs. Diverged YKKILYGPTD--FSETA | | ||| n ||| | n FRKVLF-PTDGGFSEGA n Approaches to Similarity n n n n % Identity—exact match Similar codons Similar chemical characteristics (polar, non-polar, bulky, etc) Conservation—frequently substituted amino acids are similar q(i ) : Prob. of residue by chance q(i ) ⋅ q ( j ) : Prob. of residue pair by chance. p(i ⇔ j ) : Prob. of residue pair if diverged from common ancestor: n Odds ratio of Y and F is: OddsYF = n Odds ratio of entire sequence is Oddsseq = p(Y ⇔ F ) p ( K ⇔ K ) p( I ⇔ V ) p ( L ⇔ L ) ⋅ ⋅ ⋅ q (Y ) ⋅ q( F ) q ( K ) ⋅ q ( K ) q ( I ) ⋅ q(V ) q ( L ) ⋅ q( L) YKIL Log-Odds Scores YKIL Log-Odds Scores | | FKVL n Log-odds score of Y and F is: SYF = log p (Y ⇔ F ) q(Y ) ⋅ q ( F ) | | FKVL p (Y ⇔ F ) q(Y ) ⋅ q ( F ) n Log-odds score of Y and F is: SYF = log n Log-odds score of entire sequence is p (Y ⇔ F ) q(Y ) ⋅ q ( F ) p (Y ⇔ F ) p ( K ⇔ K ) p( I ⇔ V ) p( L ⇔ L) Sseq = log × × × q(Y ) ⋅ q ( F ) q( K ) ⋅ q ( K ) q ( I ) ⋅ q(V ) q ( L) ⋅ q ( L) 8 PAM? BLOSUM? YKIL Log-Odds Scores | | FKVL n Log-odds score of Y and F is: SYF = log n Log-odds score of entire sequence is p (Y ⇔ F ) q(Y ) ⋅ q ( F ) p (Y ⇔ F ) p ( K ⇔ K ) p( I ⇔ V ) p( L ⇔ L) Sseq = log × × × q(Y ) ⋅ q ( F ) q( K ) ⋅ q ( K ) q ( I ) ⋅ q(V ) q ( L) ⋅ q ( L) = log p(Y ⇔ F) p(K ⇔ K ) p(I ⇔ V ) p(L ⇔ L) +log + log + log q(Y ) ⋅ q(F) q(K) ⋅ q(K) q(I ) ⋅ q(V ) q(L) ⋅ q(L) = SYF + SKK + SIV + S LL Substitution Matrix = Log-Odds Table Percent Accepted Mutation (PAM) Matrix PAM: A unit of evolutionary distance at which proteins have diverged an average of 1% (1 amino acid per 100). Developed by Margaret Dayhoff ~1978 based on 71 protein families. The log-odds scoring matrix for sequences with ~1% divergence is called PAM1. PAM: How Do You Build It? n Align sequences that are at least 85% identical. n Reconstruct phylogenetic trees n Infer ancestral sequences (71 trees w/ 1572 exchanges) ACGH \ \ B - C \ \ DBGH / / / A - D / \/ ABGH \ \ I - G \ J - H \ ADIJ CBIJ \ / \ / B - D \ / A - C \ / \/ ABIJ / / / I - L / \ / | ABIJ n Count aligned residue pairs at every step PAM: How Do You Build It? Make a matrix M1 with probability of substitution after one PAM unit of time... M ij (t ) = p (i | j , t ) j A For example: Mij(1) = A i B C B C .900 .090 .010 .045 .950 .005 .001 .001 .998 Adapted from Wheeler, BioComputing Hypertext Coursebook 9 PAM: How Do You Build It? Relating M to log-odds We can then extrapolate to any evolutionary time: PAMij (t ) = log M(2) = [M(1)]2; M(3) = [M(1)]3; etc. M(20) M(1) A B C A B C M(200) A B C M(1000) A B = log A .310 .548 .142 .120 .243 .637 .077 .154 .769 B .045 .950 .005 .274 .613 .113 .122 .248 .631 .077 .154 .769 C .001 .001 .998 .014 .023 .963 .064 .126 .810 .077 .154 .769 q (i ) Mij (t ) ⋅ q ( j ) q (i ) ⋅ q ( j ) p (i | j , t ) ⋅ q ( j ) = log q (i ) ⋅ q( j ) p (i, j | t ) = log q (i ) ⋅ q ( j ) C .900 .090 .010 Mij (t ) The PAM250 The Trouble With PAM n Rare substitutions not observed in PAM1 (36/190 substitutions not observed!). n n Errors in PAM1 are magnified by extrapolation. Distant sequences usually have islands (blocks) of conserved residues. (substitution not equally likely over entire sequence.) Adapted from Wheeler, BioComputing Hypertext Coursebook BLOSUM BLOSUM (Blocks Substitution) Matrices Developed by Henikoff and Henikoff in 1992 Does not extrapolate from close homologs. Uses BLOCKS, a collection of ungapped multiple alignments ID HOMSERKINASE; BLOCK AC PR00958A; distance from previous block=(8,20) DE Homoserine kinase signature BL adapted; width=16; seqs=18; 99.5%=874; strength=1260 KHSE_FREDI|P04947 (14) TTANLGPGFDCIGAAL 46 KHSE_SYNY3|P73646 (11) TTANIGPGFDCLGAAL 43 KHSE_BRELA|P07128 (17) SSANLGPGFDTLGLAL 36 KHSE_CORGL|P08210 (17) SSANLGPGFDTLGLAL 36 KHSE_MYCTU|Q10603 (20) SSANLGPGFDSVGLAL 37 KHSE_MYCLE|P45836 (18) SSANLGPGFDSIGLAL 38 KHSE_BACSU|P04948 (15) STANLGPGFDSVGMAL 42 O32121 (15) STANLGPGFDSVGMAL 42 KHSE_STRPN|P72535 ( 8) TSANIGPGFDSVGVAV 48 O67332 ( 9) TTTNFGSGFDTFGLAL 91 KHSE_LACLA|P52991 ( 8) TSANLGAGFDSIGIAV 68 KHSE_YEAST|P17423 (11) SSANIGPGYDVLGVGL 85 O43056 (11) SSANIGPGFDVLGMSL 46 KHSE_ECOLI|P00547 ( 9) SSANMSVGFDVLGAAV 62 KHSE_HAEIN|P44504 ( 9) SSANISVGFDTLGAAI 68 KHSE_SERMA|P27722 ( 9) SIGNVSVGFDVLGAAV 100 O25690 ( 8) TSANLGPGFDCLGLSL 46 KHSE_METJA|Q58504 (14) TSANLGVGFDVFGLCL 60 BLOSUM: How do you build it? Procedure n n Aligned, ungapped sequence blocks from Blocks database. Tally observed frequency of each amino acid pair among aligned residues. Adapted from Wheeler, BioComputing Hypertext Coursebook 10 BLOSUM Remember the log-odds score Sij = log p(i ⇔ j) q(i) ⋅ q( j ) BLOSUM ? p(i ⇔ j) ? How do I calculate Count all possible residue pairings (e.g. there are eleven A-B pairs) BLOCK ABC ABC BAC AAB ABA Α A B C 7 11 3 3 3 B BLOCK ABC ABC BAC AAB ABA Α B C B A B C 7 11 3 3 3 3 C #AB substitutions Total # substitutions 11 = = .37 30 p( A ⇔ B) = 3 C BLOSUM BLOCK ABC ABC BAC AAB ABA Α BLOSUM: How do you build it? A B C 7 11 3 3 3 #AB substitutions Total # substitutions 11 = = .37 30 p( A ⇔ B) = 3 Procedure n n n Aligned, ungapped sequence blocks from Blocks database. Cluster similar sequences Tally observed frequency of each amino acid pair among aligned residues, counting members of a cluster fractionally toward the tally. if q( A) = .33 and q(B) = .33, p( A ⇔ B) 0.37 then S AB = log = log = log(3.4) = 0.53 q( A) ⋅ q(B) 0.33⋅ 0.33 Adapted from Wheeler, BioComputing Hypertext Coursebook BLOSUM: “Tuning” for distant homology BLOSUM: Comments Cluster similar sequences Tally fractional aligned residue pairs n Clustering is analogous to increasing PAM distance. n Clustering threshold 80% -> BLOSUM 80. n Good statistics: ½ of an “AB” pair BLOCK ABC ABC BAC AAB ABA Α B A B C 4 7 2 1 2 - 1.25 x 106 pairs contributed - Least frequent pair observed 2369 times! 1 C #AB substitutions Total # substitutions 7 = = .41 17 p( A ⇔ B) = 11 BLOSUM and PAM correspondence The Affine Gap Penalty n Links to Smith-Waterman applets Applet that shows Smith-Waterman alignment for DNA sequences: http://www.cs.pdx.edu/~ps/CapStone03/dynvis/SimilarityApplet.html n Gap(L) = -b – (L-1) · e n Where… n Default for BLAST is (Try “Smith-Waterman”, “Affine Gap Model”, “Blosum 62”, Gap opening cost 9, Gap extension cost 9. You may want to hit “New Alignment” a few times until you get an alignment that scores well enough to be interesting. Email [email protected] with questions - L is the gap length - b is the gap opening penalty - e is the gap extension penalty - gap-opening penalty of 11 - gap extension penalty of 1 Summary n Basics of homology n Case study - BLAST - PSI-BLAST (select “local alignment”, and match, mismatch, and gap scores of 3, -1, and -1 to be consistent with the lecture) Applet that shows Smith-Waterman on protein sequences: http://www.cs.auckland.ac.nz/~cam/bio/swnw.html Name comes from the affine transformation y=ax+b n Pairwise sequence alignment n Defining sequence similarity - Global vs. local - Dot plots - Smith-Waterman - Log-odds scoring - Substitution matrices - Gap penalties 12