Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Sequence similarity search II Searching for remote homologies (How) can we decide if two sequences really have the same function? Homolog = come from a common origin => have the same function Homologous proteins = come from a common origin => have the same function Last Universal Common Ancestor Homology Rule of thumb: -Proteins are homologous if 25%-35% identical -DNA sequences are homologous if 70% identical Can we always go by the rules? Alignment between the worm and human arrestin VERY SIGNIFICANT , NOT HIGH IDENTITY Assessing whether proteins are functional homologous High levels of a protein RBP4 (Retinol binding protein 4) were found to be correlated with childhood obesity RBP4= carrier of vitamin A in the blood PAEP RBP4 Are they functionally homologous??? Assessing whether proteins are functional homologous RBP4= carrier of vitamin A in the blood RBP4 (retinol binding) and PAEP (pregnancy protein) E value= 0.49; identity=24% Are they functionally homologous??? The lipocalins protein family (each dot is a protein) RBP4 retinol-binding protein PAEP apolipoprotein D odorant-binding protein Are they functionally homologous??? PAEP RBP4 They belong to the same protein family= have a common ancestor Their functions have probably diverse BUT … Is identity the right way to score? The 20 Amino Acids Sequence Alignment based on AA similarity TQSPSSLSASVGDTVTITCRASQSISTYLNWYQQKP----GKAPKLLIYAASSSQSGVPS || + |||| +|| ||| | +| | | | | TQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADS RFSGSGSGTDFTLTINSLQPEDFATYYCQ---------------QSYSTPHFSQGTKLEI | | | +| | | +|+ || || |+ + | | || | + RRSLWDQG-NFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTL ---KRTVAAPSVFIFPPSDEQLKSGTASVVCLLN---------NFYPREAKVQWKVD ++||| | + ++ | | | + ||++|+| TLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKID | = identity 45/178=25% + = similarity 63/178=35% Scoring system for amino acids mismatches 11 How do we define the scoring system Given an alignment of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other Protein X e-coli Protein X yeast Protein X worm Protein X Chicken Protein X Mice Protein X Pig Protein X Monkey Protein X Human ...M ...M …..M …..M …..M …..M …..M …..M G G G G G G G G Y Y Y Y Y Y Y Y D D E D Q D E E E E E E E E E E In this column E & D are found 7/8 D/E COO- +H N 3 C COO- H +H N 3 C HCH HCH C HCH O H OC Aspartate (Asp, D) O O- Glutamate (Glu, E) PAM - Point Accepted Mutations • • • Developed by Margaret Dayhoff, 1978. Analyzed very similar protein sequences “Accepted” mutations – do not negatively affect a protein’s fitness Used global alignment. Counted the number of substitutions (i,j) per amino acid pair: Many i<->j substitutions => high score s(i,j) Margaret Dayhoff 1925-1983 Basic matrix (example) normalized probabilities multiplied by 10000 A R N D C Q E G H I L K M F P S T W Y V Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901 Log Odds Matrices – Calculate odds ratio for each substitution • Divide the frequency of the substitution by the frequency of each amino acid f(aa1>aa2)/ f(aa1)*f(aa2) – Take average of ratio for converting A to B and converting B to A – Convert ratio to log10 and multiply by 10 – Result: Symmetric log-odds matrix PAM250 Log odds Entry (i,j):matrix the score of aligning amino acid i against amino acid j. Simliar aa have high score Entry (i,i) is greater than any entry (i,j), ji. The entries on the diagonal are not always identical The different PAM Matrices • There are different PAM matrices (PAM 1- PAM250). The matrices are derived from each other by multiplying the PAM1 matrices N times • Low PAM matrices are suitable for strong local similarities (Arrestin worm vs Arrestin Human) • High PAM matrices are suitable for weak similarities (RBP4 and PEAP) – PAM120 recommended for general use – PAM60 for close relations – PAM250 for distant relations BLOSUM=BLOcks SUstitution Matrix Steven and Jorga G. Henikoff (1992) • Based on BLOCKS database Families of proteins with similar function • Ungapped local alignment – Each block is generated from a local alignment – Counts amino acids observed in same column – Symmetrical model of substitution AABCDA… BBCDA DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC BLOSUM Matrices BLOSUM 62 • Different BLOSUMn matrices are calculated independently from different BLOCKS • BLOSUMn is based on blocks that are at most n percent identical. Selecting a BLOSUM Matrix • For BLOSUMn, higher n suitable for sequences which are more similar – BLOSUM62 recommended for general use – BLOSUM80 for close relations – BLOSUM45 for distant relations QUIZ • The score for ARG-LYS in BLOSUM 45 is 2, what will the score for the same pair in BLOSUM 80? A. 2 B. 3 C. 1 D. -1 Remote homologues • Sometimes BLAST isn’t enough. • When searching homologs in large and diverse protein families and/or when looking for homology in non highly conserved proteins in very far species (e-coli vs human) What do we do? PSI-BLAST General Idea : - Builds specialized scoring matrices which are specific to the family of interest - Generates a position specific scoring matrix Page 138 PSI-BLAST STEPS: [1] Select a query and search it against a protein database [2] PSI-BLAST constructs a specialized multiple sequence alignment [3] Creates a “profile” or the specialized alignment for each position independently position-specific scoring matrix (PSSM) Page 138 R,I,K C D,E,T K,R,T N,L,Y,G 1 M 2 K 3 W 4 V 5 W 6 A 7 L 8 L 9 L 10 L 11 A 12 A 13 W 14 A 15 A 16 A ... 37 S 38 G 39 T 40 W 41 Y 42 A A -1 -1 -3 0 -3 5 -2 -1 -1 -2 5 5 -2 3 2 4 R -2 1 -3 -3 -3 -2 -2 -3 -3 -2 -2 -2 -3 -2 -1 -2 N -2 0 -4 -3 -4 -2 -4 -3 -4 -4 -2 -2 -4 -1 0 -1 D -3 1 -5 -4 -5 -2 -4 -4 -4 -4 -2 -2 -4 -2 -1 -2 C -2 -4 -3 -1 -3 -1 -1 -1 -1 -1 -1 -1 -2 -1 -2 -1 Q -1 2 -2 -3 -2 -1 -2 -3 -2 -2 -1 -1 -2 -1 2 -1 E -2 4 -3 -3 -3 -1 -3 -3 -3 -3 -1 -1 -3 -2 0 -1 G -3 -2 -3 -4 -3 0 -4 -4 -4 -4 0 0 -4 4 2 3 H -2 0 -3 -4 -3 -2 -3 -3 -3 -3 -2 -2 -3 -2 -1 -2 I 1 -3 -3 3 -3 -2 2 2 2 2 -2 -2 1 -2 -3 -2 L 2 -3 -2 1 -2 -2 4 2 4 4 -2 -2 4 -2 -3 -2 K -2 3 -3 -3 -3 -1 -3 -3 -3 -3 -1 -1 -3 -1 0 -1 M 6 -2 -2 1 -2 -1 2 1 2 2 -1 -1 2 -2 -2 -1 F 0 -4 1 -1 1 -3 0 3 0 0 -3 -3 1 -3 -3 -3 P -3 -1 -4 -3 -4 -1 -3 -3 -3 -3 -1 -1 -3 -1 -1 -1 S -2 0 -3 -2 -3 1 -3 -2 -3 -3 1 1 -3 1 3 1 T -1 -1 -3 0 -3 0 -1 -1 -1 -1 0 0 -2 -1 0 0 W -2 -3 12 -3 12 -3 -2 -2 -2 -2 -3 -3 7 -3 -3 -3 Y -1 -2 2 -1 2 -2 -1 0 -1 -1 -2 -2 0 -3 -2 -2 V 1 -3 -3 4 -3 0 1 3 2 1 0 0 0 -1 -2 -1 2 0 0 -3 -2 4 -1 -3 -1 -3 -2 -2 0 -1 0 -4 -2 -2 -1 -2 -1 -5 -3 -2 -1 -3 -1 -3 -3 -1 0 -2 -1 -2 -2 -1 0 0 -1 -2 -3 -2 6 -2 -4 -4 -1 -2 -2 -1 -1 -3 -3 -3 -3 -2 -2 -3 2 -2 -1 -1 0 -2 -2 -2 0 -2 -1 -3 -2 -1 -2 -3 -1 -2 -1 -1 -3 -4 -2 1 3 -3 -1 4 1 -3 -2 0 -2 -3 -1 1 5 -3 -4 -3 -3 12 -3 -2 -2 2 -1 1 0 -3 -2 -3 -2 2 7 -2 -2 -4 0 -3 -1 0 1 M 2 K 3 W 4 V 5 W 6 A 7 L 8 L 9 L 10 L 11 A 12 A 13 W 14 A 15 A 16 A ... 37 S 38 G 39 T 40 W 41 Y 42 A A -1 -1 -3 0 -3 5 -2 -1 -1 -2 5 5 -2 3 2 4 R -2 1 -3 -3 -3 -2 -2 -3 -3 -2 -2 -2 -3 -2 -1 -2 N -2 0 -4 -3 -4 -2 -4 -3 -4 -4 -2 -2 -4 -1 0 -1 D -3 1 -5 -4 -5 -2 -4 -4 -4 -4 -2 -2 -4 -2 -1 -2 C -2 -4 -3 -1 -3 -1 -1 -1 -1 -1 -1 -1 -2 -1 -2 -1 Q -1 2 -2 -3 -2 -1 -2 -3 -2 -2 -1 -1 -2 -1 2 -1 E -2 4 -3 -3 -3 -1 -3 -3 -3 -3 -1 -1 -3 -2 0 -1 G -3 -2 -3 -4 -3 0 -4 -4 -4 -4 0 0 -4 4 2 3 H -2 0 -3 -4 -3 -2 -3 -3 -3 -3 -2 -2 -3 -2 -1 -2 I 1 -3 -3 3 -3 -2 2 2 2 2 -2 -2 1 -2 -3 -2 L 2 -3 -2 1 -2 -2 4 2 4 4 -2 -2 4 -2 -3 -2 K -2 3 -3 -3 -3 -1 -3 -3 -3 -3 -1 -1 -3 -1 0 -1 M 6 -2 -2 1 -2 -1 2 1 2 2 -1 -1 2 -2 -2 -1 F 0 -4 1 -1 1 -3 0 3 0 0 -3 -3 1 -3 -3 -3 P -3 -1 -4 -3 -4 -1 -3 -3 -3 -3 -1 -1 -3 -1 -1 -1 S -2 0 -3 -2 -3 1 -3 -2 -3 -3 1 1 -3 1 3 1 T -1 -1 -3 0 -3 0 -1 -1 -1 -1 0 0 -2 -1 0 0 W -2 -3 12 -3 12 -3 -2 -2 -2 -2 -3 -3 7 -3 -3 -3 Y -1 -2 2 -1 2 -2 -1 0 -1 -1 -2 -2 0 -3 -2 -2 V 1 -3 -3 4 -3 0 1 3 2 1 0 0 0 -1 -2 -1 2 0 0 -3 -2 4 -1 -3 -1 -3 -2 -2 0 -1 0 -4 -2 -2 -1 -2 -1 -5 -3 -2 -1 -3 -1 -3 -3 -1 0 -2 -1 -2 -2 -1 0 0 -1 -2 -3 -2 6 -2 -4 -4 -1 -2 -2 -1 -1 -3 -3 -3 -3 -2 -2 -3 2 -2 -1 -1 0 -2 -2 -2 0 -2 -1 -3 -2 -1 -2 -3 -1 -2 -1 -1 -3 -4 -2 1 3 -3 -1 4 1 -3 -2 0 -2 -3 -1 1 5 -3 -4 -3 -3 12 -3 -2 -2 2 -1 1 0 -3 -2 -3 -2 2 7 -2 -2 -4 0 -3 -1 0 PSI-BLAST Continue… [4] The PSSM is used as a query against the database [5] PSI-BLAST estimates statistical significance (E values) [6] Repeat steps [4] and [5] iteratively, typically 3-5 times. At each new search, a new profile is used as the query. Page 138 Searching for remote homology using PSI-BLAST Another member of the lipocalins family is b-lactoglobulin RBP4 b-lactoglobulin PSI-BLAST alignment of RBP4 (retinol binding protein) and b-lactoglobulin: iteration 1 Score = 46.2 bits (108), Expect = 2e-04 Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%) Query: 27 Sbjct: 33 Query: 87 Sbjct: 83 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86 V+ENFD ++ G WY + +K P + I A +S+ E G + K ++ VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82 ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137 D GT ++ +PAK +++++ + +WI+ TDY+ YA+ YSC PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135 Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163 L ++D + ++ R+P LPPE Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158 PSI-BLAST alignment of RBP4 and b-lactoglobulin: iteration 2 Score = 140 bits (353), Expect = 1e-32 Identities = 45/176 (25%), Positives = 78/176 (43%), Gaps = 33/176 (18%) Query: 4 Sbjct: 2 Query: 56 Sbjct: 61 VWALLLLAAWAAAERDCRVSSF--------RVKENFDKARFSGTWYAMAKKDPEGLFLQD 55 V L+ LA A + +F V+ENFD ++ G WY + +K P + VTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEI-EKIPASFEKGN 60 NIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMV---GTFTDTEDPAKFKMKYWGVASF 112 I A +S+ E G + K + D + V ++ +PAK +++++ + CIQANYSLMENGNIEVLNKEL-----SPDGTMNQVKGEAKQSNVSEPAKLEVQFFPL--- 112 Query: 113 LQKGNDDHWIVDTDYDTYAVQYSCR----LLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC L ++D + ++ R+P LPPE Sbjct: 113 --MPPAPYWILATDYENYALVYSCTTFFWLFHVD------FFWILGRNPY-LPPET 159 PSI-BLAST alignment of RBP4 and b-lactoglobulin: iteration 3 Score = 159 bits (404), Expect = 1e-38 Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%) Query: 3 Sbjct: 1 Query: 55 Sbjct: 60 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A + S V+ENFD ++ G WY + K MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114 + I A +S+ E G + K V + ++ +PAK +++++ + NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112 Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC + ++ R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159 1 Score = 46.2 bits (108), Expect = 2e-04 Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%) Query: 27 Sbjct: 33 Query: 87 Sbjct: 83 VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86 V+ENFD ++ G WY + +K P + I A +S+ E G + K ++ VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82 ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137 D GT ++ +PAK +++++ + +WI+ TDY+ YA+ YSC PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135 Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163 L ++D + ++ R+P LPPE Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158 3 Score = 159 bits (404), Expect = 1e-38 Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%) Query: 3 Sbjct: 1 Query: 55 Sbjct: 60 WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54 V L+ LA A + S V+ENFD ++ G WY + K MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59 DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114 + I A +S+ E G + K V + ++ +PAK +++++ + NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112 Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164 +WI+ TDY+ YA+ YSC + ++ R+P LPPE Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159 The lipocalins protein family (each dot is a protein) B-lactoglobulin retinol-binding protein RBP4 apolipoprotein D odorant-binding protein The universe of lipocalins (each dot is a protein) retinol-binding protein apolipoprotein D odorant-binding protein Scoring matrices let you focus on the big (or small) picture retinol-binding protein Scoring matrices let you focus on the big (or small) picture PAM250 PAM30 retinol-binding retinol-binding protein protein Blosum80 Blosum45 PSI-BLAST is more powerful than changing scoring functions VERY good for remote homologies retinol-binding protein