Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Big Data Protein functional Prediction Blast 1042. Data Science in Practice Week 16, 06/06 Jia-Ming Chang http://www.cs.nccu.edu.tw/~jmchang/course/1042/datascience/ The slide isonly for educational purposes. If any infringement, please contact me, we will correct immediately. Dataset for homework4 Performance comparison for archaeal proteins • Yu,N.Y. et al. (2010) PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics, 26, 1608–15. Protein subcellular localization prediction Prokaryotic Structure protein MPLDLYNTLTRRKERF… 1. Chang, J.-M., Su, E.C.-Y., Lo, A., Chiu, H.-S., Sung, T.-Y. and Hsu, W.-L. (2008) PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis. Proteins, 72, 693-710. 2. Chang, J.-M., Taly, J.-F., Erb, I., Sung, T.-Y., Hsu, W.-L., Tang, C.Y., Notredame, C. and Su, E.C.-Y. (2013) Efficient and Interpretable Prediction of Protein Functional Classes by Correspondence Analysis and Compact Set Relations. PLoS One, 8, e75542. Document Classification Categories Classifier Documents Salton’s vector space model Represent each document by a high-dimensional vector in the space of words Documents Journal of Artificial Intelligence Research JAIR is a refereed journal, covering all areas of Artificial Intelligence, which is distributed free of charge over the internet. Each volume of the journal is also published by AI Access Foundation … Vectors 0 learning 2 Journal 3 Intelligence 0 text 0 agent 1internet 0 webwatcher 0 perlS … 1 volume Gerald Salton bag-of-words model Term-document matrix is m x n matrix where m is number of terms and n is number of documents document d 1 ¯ éa ê 11 êa 21 A = êê ê ê êëa m1 d d 2 ¯ a a a ¯ 12 … 22 … m2 n … ù ú ¬ t1 ú ¬ 2n t2 ú ú ú ú a mnúû ¬ t m a a 1n term Vectors in Term Space Predicted by 1 Nearest-Neighbor based on Cosine Similarity Term Weighting by TFIDF • The term frequency (tf) in the given document d gives a measure of the importance of the term ti within the particular document tf (ti , d ) ni nk k with ni being the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms • The inverse document frequency (idf) is obtained by dividing the number of all documents by the number of documents containing the term ti, idf (ti ) log D ( d i ti ) |D| : total number of document in the corpus : number of documents where the term ti appears tfidf = tf*idf Feature Reduction • a best choice of axes – shows most variation in the data. => Found by linear algebra: Singular Value Decomposition (SVD) True plot in k dimensions Reduced-dimensionality plot System Architecture MPLDLYNTLT… PSIBLAST 1 2 3 4 5 6 7 8 9 10 M P L D L Y N T L T A -3 2 -4 -2 -4 -4 -4 -2 0 -1 R -3 -3 -5 5 -5 -3 -3 -3 -1 -3 N -4 -3 -6 -1 -6 -3 8 -1 -5 -1 D -5 -1 -6 -3 -6 -5 4 -3 -5 -1 C -3 -3 -4 -4 -4 -5 -6 -1 -4 -4 Q -3 -1 -3 2 -5 -3 -3 -3 -3 -2 E -4 -1 -5 -1 -6 -4 -2 -3 -4 -3 G -5 -1 -6 -4 -6 -5 -3 -4 -4 -2 H -4 -4 -5 2 -4 4 -2 -3 -3 -1 I 0 -2 3 -5 4 -4 -6 -4 -1 -4 L 1 -4 5 -3 4 -3 -6 -4 5 -3 K -3 -2 -5 5 -5 -3 -3 -1 -3 -1 M 10 -2 4 -2 0 -2 -5 -4 3 -3 F -2 -5 0 -2 1 4 -6 -4 0 -4 P -5 4 -5 -4 -5 -5 -4 -4 -4 -4 S -4 2 -5 -2 -5 -3 -1 4 -3 3 T -3 4 -3 0 -3 -2 -3 6 -3 6 W -4 -5 -4 -1 -4 2 -7 -5 -3 -5 Y -3 -4 -3 0 -3 8 -5 -4 -2 -4 V -1 -3 2 -3 3 -4 -6 -2 -1 -3 Gapped-Dipeptide Representation A0A, A1A, A2A, A3A, A4A, A5A , …, Y5Y {0.81396, 0.78755, 0.788206, 0.799535, 0.784058, 0.742093,…,0.437457} PSLDoc Protein Subcellular Localization prediction by PLSA Reduction {0.012103, 0.014095, 0.015480, 0.018894,…,0.003121} SVMCP SVMIM SVMPP SVMOM Document classification SVMEC Highest Probability Predicted Localization Site 12/50 PSLDoc 2 PSLDoc MPLDLYNTLT… PSIBLAST 1 2 3 4 5 6 7 8 9 10 M P L D L Y N T L T A -3 2 -4 -2 -4 -4 -4 -2 0 -1 R -3 -3 -5 5 -5 -3 -3 -3 -1 -3 N -4 -3 -6 -1 -6 -3 8 -1 -5 -1 D -5 -1 -6 -3 -6 -5 4 -3 -5 -1 C -3 -3 -4 -4 -4 -5 -6 -1 -4 -4 Q -3 -1 -3 2 -5 -3 -3 -3 -3 -2 E -4 -1 -5 -1 -6 -4 -2 -3 -4 -3 G -5 -1 -6 -4 -6 -5 -3 -4 -4 -2 H -4 -4 -5 2 -4 4 -2 -3 -3 -1 I 0 -2 3 -5 4 -4 -6 -4 -1 -4 L 1 -4 5 -3 4 -3 -6 -4 5 -3 K -3 -2 -5 5 -5 -3 -3 -1 -3 -1 M 10 -2 4 -2 0 -2 -5 -4 3 -3 F -2 -5 0 -2 1 4 -6 -4 0 -4 P -5 4 -5 -4 -5 -5 -4 -4 -4 -4 S -4 2 -5 -2 -5 -3 -1 4 -3 3 T -3 4 -3 0 -3 -2 -3 6 -3 6 W -4 -5 -4 -1 -4 2 -7 -5 -3 -5 Y -3 -4 -3 0 -3 8 -5 -4 -2 -4 V -1 -3 2 -3 3 -4 -6 -2 -1 -3 Gapped-Dipeptide Representation A0A, A1A, A2A, A3A, A4A, A5A , …, Y5Y {0.81396, 0.78755, 0.788206, 0.799535, 0.784058, 0.742093,…,0.437457} PLSA Reduction {0.012103, 0.014095, 0.015480, 0.018894,…,0.003121} SVMCP SVMIM SVMPP SVMOM Highest Probability Predicted Localization Site SVMEC Term Weighting Scheme – TF Position Specific Score Matrix • Position Specific Score Matrix (PSSM) : A PSSM is constructed from a multiple alignment of the highest scoring hits in the BLAST search A R N D C Q E G H I L K M F P S T W Y V 1 M -3 -3 -4 -5 -3 -3 -4 -5 -4 0 1 -3 10 -2 -5 -4 -3 -4 -3 -1 2 P 2 -3 -3 -1 -3 -1 -1 -1 -4 -2 -4 -2 -2 -5 4 2 4 -5 -4 -3 3 L -4 -5 -6 -6 -4 -3 -5 -6 -5 3 5 -5 4 0 -5 -5 -3 -4 -3 2 4 D -2 5 -1 -3 -4 2 -1 -4 2 -5 -3 5 -2 -2 -4 -2 0 -1 0 -3 5 L -4 -5 -6 -6 -4 -5 -6 -6 -4 4 4 -5 0 1 -5 -5 -3 -4 -3 3 ... 78 N -4 -3 8 4 -6 -3 -2 -3 -2 -6 -6 -3 -5 -6 -4 -1 -3 -7 -5 -6 79 T -2 -3 -1 -3 -1 -3 -3 -4 -3 -4 -4 -1 -4 -4 -4 4 6 -5 -4 -2 80 L 0 -1 -5 -5 -4 -3 -4 -4 -3 -1 5 -3 3 0 -4 -3 -3 -3 -2 -1 81 T -1 -3 -1 -1 -4 -2 -3 -2 -1 -4 -3 -1 -3 -4 -4 3 6 -5 -4 -3 Database Size NCBI non-redundant (NR) UniProt (release 15.15 – 2010) UniRef50 UniRef90 UniRef100 Data Set No. UniRef50 3,077,464 UniRef90 6,544,144 UniRef100 9,865,668 UniProt 11,009,767 NCBI NR 10,565,004 Feature reduction – topic model Terms Documents economic imports TRADE Latent Concepts trade Probabilistic Latent Semantic Analysis A joint probability between a term w and a document d can be modeled as: P( w, d ) P(d ) P( w | z )P( z | d ) zZ Latent variable z (“small” #states) Concept expression probabilities Document-specific mixing proportions The parameters could be estimated by maximumlikelihood function through EM algorithm Hofmann T: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Mach Learn 2001, 42(1-2):177-196. PLSA model fitting • Likeli-hood function • E-step: the probability that a term w in a particular document d explained by the class corresponding to z • M-step : Probabilistic Latent Semantic Analysis Topic Space Term Space Term 1 Topic 1 Term 2 Vector PLSA Feature Reduction Topic 2 Term 3 Term 5 Term 4 Topic 3 Gapped-peptide signature The site-topic preference of the topic z for a site l = average { P(z|d)| d (a protein) belongs to l class} site-topic preference matrix For each site, 10 preferred topics according to preference confidence ( = the 1th site-topic preference - the 2th site-topic preference) Gapped-peptide signature For each topic, 5 most frequent gapped-dipeptides are selected. Classifier – Support Vector Machines • Support Vector Machines (SVM) – LIBSVM software – Five 1-v-rest SVM classifiers corresponding to five localization sites. – Kernel: Radial Basis Function (RBF) – Parameter selection • c (cost) and γ(gamma) are optimized • five-fold cross-validation SVMCP v.s. -CP SVMIM v.s. -IM SVMPP v.s. -PP SVMOM v.s. -OM SVMEC v.s. -EC Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm 4 5 6 7 8 9 10 D L Y N T L T -2 -4 -4 -4 -2 0 -1 5 -5 -3 -3 -3 -1 -3 -1 -6 -3 8 -1 -5 -1 -3 -6 -5 4 -3 -5 -1 -4 -4 -5 -6 -1 -4 -4 2 -5 -3 -3 -3 -3 -2 -1 -6 -4 -2 -3 -4 -3 -4 -6 -5 -3 -4 -4 -2 2 -4 4 -2 -3 -3 -1 -5 4 -4 -6 -4 -1 -4 -3 4 -3 -6 -4 5 -3 5 -5 -3 -3 -1 -3 -1 -2 0 -2 -5 -4 3 -3 -2 1 4 -6 -4 0 -4 -4 -5 -5 -4 -4 -4 -4 -2 -5 -3 -1 4 -3 3 0 -3 -2 -3 6 -3 6 -1 -4 2 -7 -5 -3 -5 0 -3 8 -5 -4 -2 -4 -3 3 -4 -6 -2 -1 -3 Prediction Confidence Gapped-Dipeptide Representation A0A, A1A, A2A, A3A, A4A, A5A , …, Y5Y • {0.81396, 0.78755, 0.788206,class 0.799535, 0.784058, 0.742093,…,0.437457} The confidence of the final predicted • Prediction Confidence = the largest probability - the second largest probability PLSA Reduction {0.012103, 0.014095, 0.015480, 0.018894,…,0.003121} Largest SVMCP Second SVMIM SVMPP SVMOM SVMEC Prediction Confidence = SVMCP – SVMOM Highest Probability 100 Predicted Localization Site 90 Overall Accuracy(%) 80 70 60 50 40 30 20 10 0 [0-0.1) [0.1-0.2) [0.2-0.3) [0.3-0.4) [0.4-0.5) [0.5-0.6) [0.6-0.7) [0.7-0.8) [0.8-0.9) Prediction Confidence [0.9-1] Prediction Threshold (1/3) Prediction Confidence Threshold No Unknown Yes Predicted Localization Site 95 0 0.1 0.2 90 0.3 0.4 0.5 Recall(%) 0.6 0.7 85 0.8 80 75 0.9 70 92 93 94 95 96 97 98 99 Precision(%) The value above the point denotes the corresponding prediction threshold. 100 Prediction Threshold (2/3) PSLDoc_PreThr=0.7 PSLDoc_PreThr=0.3 Precision Recall Precision Recall Precision Recall CP 97.30 77.70 94.92 87.41 92.86 70.14 IM 98.91 88.35 97.94 92.23 95.33 92.56 PP 96.19 73.19 93.00 81.88 95.50 69.20 OM 99.46 93.61 98.41 95.14 97.38 94.88 EC 95.57 79.47 91.57 85.79 97.40 78.95 Overall 97.89 83.66 95.77 89.27 95.82 82.62 Loc. Sites PSORTb v.2.0 Prediction Threshold (3/3) *The threshold is set such that the coverage is similar with PSLT. PSLDoc MPLDLYNTLT… PSIBLAST 1 2 3 4 5 6 7 8 9 10 M P L D L Y N T L T A -3 2 -4 -2 -4 -4 -4 -2 0 -1 R -3 -3 -5 5 -5 -3 -3 -3 -1 -3 N -4 -3 -6 -1 -6 -3 8 -1 -5 -1 D -5 -1 -6 -3 -6 -5 4 -3 -5 -1 C -3 -3 -4 -4 -4 -5 -6 -1 -4 -4 Q -3 -1 -3 2 -5 -3 -3 -3 -3 -2 E -4 -1 -5 -1 -6 -4 -2 -3 -4 -3 G -5 -1 -6 -4 -6 -5 -3 -4 -4 -2 H -4 -4 -5 2 -4 4 -2 -3 -3 -1 I 0 -2 3 -5 4 -4 -6 -4 -1 -4 L 1 -4 5 -3 4 -3 -6 -4 5 -3 K -3 -2 -5 5 -5 -3 -3 -1 -3 -1 M 10 -2 4 -2 0 -2 -5 -4 3 -3 F -2 -5 0 -2 1 4 -6 -4 0 -4 P -5 4 -5 -4 -5 -5 -4 -4 -4 -4 S -4 2 -5 -2 -5 -3 -1 4 -3 3 T -3 4 -3 0 -3 -2 -3 6 -3 6 W -4 -5 -4 -1 -4 2 -7 -5 -3 -5 Y -3 -4 -3 0 -3 8 -5 -4 -2 -4 V -1 -3 2 -3 3 -4 -6 -2 -1 -3 How to efficiently search? How to directly infer? Gapped-Dipeptide Representation A0A, A1A, A2A, A3A, A4A, A5A , …, Y5Y {0.81396, 0.78755, 0.788206, 0.799535, 0.784058, 0.742093,…,0.437457} PLSA Reduction How to intuitively predict? {0.012103, 0.014095, 0.015480, 0.018894,…,0.003121} SVMCP SVMIM SVMPP SVMOM Highest Probability Predicted Localization Site SVMEC PSLDoc 2 Correspondence analysis CA may be defined as a special case of principal components analysis - eigenvector methods Matrix Y is decomposed using the generalized singular value decomposition under the constraints imposed by the matrices M (masses for the rows) and W (weights for the columns): This is illustrated by the analysis of the columns of matrix X, or equivalently by the rows of the transposed matrix XT . Because the factor scores obtained for the rows and the columns have the same variance(i.e., they have the same “scale”), it is possible to plot them in the same space. http://www.universityoftexasatdallascomets.com/~herve/abdi-CorrespondenceAnaysis2010-pretty.pdf Correspondence analysis of the Gram-negative IM, OM, CP, EC, PP : proteins gapped-dipeptides * gapped-dipeptide signatures . Compact set S1 S2 S3 S4 S5 S6 S1 S2 S3 S4 S5 S6 0 10 16 18 13 8 0 14 17 15 9 0 9 10 12 0 8 19 0 11 0 S6 11 S5 8 S1 10 S4 9 S2 S3 C is a compact set if min { E(vi ,vk)|vi ÎC, vk Î V \ C } > max{ D(vi ,vj)|vi ,vj ÎC } Hierarchical clustering S1 S1 S2 S3 S4 S5 S6 0 10 16 18 13 8 0 14 17 15 9 0 9 10 12 0 8 19 0 11 S2 S3 S4 S5 S6 C2 0 C3 S2 C1 s1 S4 S1 s6 s2 s3 s4 compact set tree s5 s1 S3 s6 s2 s3 s4 s5 single-linkage clustering Compact set • Input : Given a connected undirected graph G = (V, E) , V represents proteins and the edge E(vi ,vj) = the distance between two proteins vi and vj measured as the Euclidean distance in CA reduced space • Output : Find all the compact sets in G – Step1 : Construct a Kruskal Merging Ordering Tree TKru of G. (CONSTRUCT_TKru) – Step2 : Verify all candidate sets. • Time = O(L + M + M log N) – M = the numbers of edges – N = the numbers of vertices – L = the sum of the sizes of all compact sets CS+1NN on Gram-Negative PSORTdb • http://psort.org/psortb/index.html • Peabody,M.A. et al. (2016) PSORTdb: expanding the bacteria and archaea protein subcellular localization database to better reflect diversity in cell envelope structures. Nucleic Acids Res., 44, D663–8. BLAST Basic local alignment search tool SF Altschul, W Gish, W Miller, EW Myers, DJ Lipman Journal of molecular biology 215 (3), 403-410 BLAST • The top 100 papers – http://www.nature.com/news/the-top-100-papers1.16224#/interactive What is BLAST? Nucleotide/Protein Sequence Databases BLAST : as Google : Internet Credit by David Fristrom, Bibliographer/Librarian, Science and Engineering Library, [email protected] Alignment AACGTTTCCAGTCCAAATAGCTAGGC ===--=== =-===-==-====== AACCGTTC TACAATTACCTAGGC Hits(+1): 18 Misses (-2): 5 Gaps (existence -2, extension -1): 1 Length: 3 Score = 18 * 1 + 5 * (-2) – 2 – 2 = 6 Credit by David Fristrom, Bibliographer/Librarian, Science and Engineering Library, [email protected] Global Alignment • Compares total length of two sequences • Needleman, S.B. and Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 48(3):443-53(1970). Credit by David Fristrom, Bibliographer/Librarian, Science and Engineering Library, [email protected] Local Alignment • Compares segments of sequences • Finds cases when one sequence is a part of another sequence, or they only match in parts. • Smith, T.F. and Waterman, M.S. Identification of common molecular subsequences. J Mol Biol. 147(1):195-7 (1981) Credit by David Fristrom, Bibliographer/Librarian, Science and Engineering Library, [email protected] Arranging Everything in a Table F A S T F A T 1…I-1 1…I 1…J-1 1…J-1 1…I-1 1…I 1…J 1…J Adapted from Cedric Notredame Filing Up The Matrix Adapted from Cedric Notredame Delivering the alignment: Trace-back T S A F T - A F Score of 1…3 Vs 1…4 Optimal Aln Score Adapted from Cedric Notredame Local Alignments GLOBAL Alignment LOCAL Alignment Smith And Waterman (SW)=LOCAL Alignment Adapted from Cedric Notredame Search Tool • By aligning query sequence against all sequences in a database, alignment can be used to search database for similar sequences • But alignment algorithms are slow Credit by David Fristrom, Bibliographer/Librarian, Science and Engineering Library, [email protected] What is BLAST? • Quick, heuristic alignment algorithm • Divides query sequence into short words, and initially only looks for (exact) matches of these words, then tries extending alignment. • Much faster, but can miss some alignments • Altschul, S.F. et al. Basic local alignment search tool. J Mol Biol. 215(3):403-10(1990). Credit by David Fristrom, Bibliographer/Librarian, Science and Engineering Library, [email protected] What is BLAST? • Basic Local Alignment Search Tool • BLAST is a Program Designed for RAPIDLY Comparing Your Sequence With every Sequence in a database and REPORT the most SIMILAR sequences Adapted from Cedric Notredame Database Search Q SW 1.10e-20 10 1.10e-100 1.10e-2 1.10e-1 10 QUERRY Comparison Engine Database 3 1 3 6 1.10e-2 1 20 15 E-values How many time do we expect such an Alignment by chance? 13 Adapted from Cedric Notredame Database Search 1-Query 2-Comparison Engine 3-Database LOCAL Alignment 4-Statistical Evaluation (E-Value) PROBLEM: LOCAL ALIGNMENT (SW)TOO SLOW Adapted from Cedric Notredame BLAST BLAST is a Heuristic Smith and Waterman BLAST = 3 STEPS 1-Decide who will be compared This is where Blast SAVES TIME This is where it LOSES HITS Most BLAST parameters refer to this step Adapted from Cedric Notredame BLAST BLAST is a Heuristic Smith and Waterman BLAST = 3 STEPS 1-Decide who will be compared 2-Check the most promising Hits 3-Compute the E-value of the most interesting Hits Adapted from Cedric Notredame Inside BLAST Step 1: finding the worthy words Query score < T ... YYY List of all the 3AA words that Can be found in the database score > T ACT ... ... AAA AAC AAD REL RSL LKP RSL TVF Words with a score > T Adapted from Cedric Notredame Inside BLAST Step 2: Eliminate the database sequences that do not contain any interesting word Sequences within the database ... ... ACT ACT ACT Look for «interesting» words RSL TVF RSL RSL RSL RSL TVF TVF List of « interesting » words > T Sequences containing interesting words (Hits) Adapted from Cedric Notredame Inside BLAST: the end Step 3: Extension of the Hits Database sequence Query X • 2 "Hits" on the same diagonal distant by less than X Database sequence Query X Extension by limited Dynamic Programming Adapted from Cedric Notredame BLAST Statistics • Raw Score – Sum of the substitutions and gap penalties. – Not very informative • p-value (Derived Statistics) – Probability of finding an alignment with such a score, by chance. – The lower, the better Adapted from Cedric Notredame Any Question?