* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download IvDimitrov_slides
Survey
Document related concepts
Ancestral sequence reconstruction wikipedia , lookup
Western blot wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
List of types of proteins wikipedia , lookup
Cell-penetrating peptide wikipedia , lookup
Protein adsorption wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Protein (nutrient) wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Biosynthesis wikipedia , lookup
Genetic code wikipedia , lookup
Expanded genetic code wikipedia , lookup
Transcript
School of Pharmacy Medical University of Sofia Application of machine learning techniques for allergenicity prediction Ivan Dimitrov 2nd Regional Conference “Supercomputing Applications in Science and Industry” Rodopi Hotel, Sunny Beach, Bulgaria, September 20-21, 2011 Allergen processing pathways C. M. Hawrylowicz & A. O'Garra, Nature Reviews Immunology 2005, 271-283 FAO and WHO Codex alimentarius guidelines for evaluating potential allergenicity for novel proteins A query protein is potentially allergenic if it: has an identity of 6 to 8 contiguous amino acids or has > 35% sequence similarity over a window of 80 amino acids when compared with known allergens. Codex Principles and Guidelines on Foods Derived from Biotechnology. 2003 Rome, Italy: Codex Alimentarius Commission, Joint FAO/WHO Food Standards Programme, Food and Agriculture Organization. Bioinformatics approaches to allergen prediction 1. Sequence-alignment search of query protein Extensive databases of known allergen proteins and the FAO/WHO guidelines - Structural Database of Allergenic Proteins - Allermatch Characteristics: -High sensitivity (true positives/(true positives + false negatives)) - Produce many false positives and low precision (true positives/(true positives + false positives)) - Discovery of novel antigens is restricted by their lack of similarity to known allergens. Ivanciuc et al. Nucleic Acids Res. 2003, 31, 359–362 Fiers et al. BMC Bioinformatics 2004, 5, 133 Bioinformatics approaches to allergen prediction 2. Identification of conserved allergenicity-related linear motifs - Comparing allergens to non-allergens by MEME motif discovery tool - Clustering of known allergens, wavelet analysis and hidden Markov model - Automated Selection of Allergen-Representative Peptides (DASARP). - Motif search by Support Vector Machines (SVM), MEME/MAST, IgE epitopes and Allergen-Representative Peptides (ARP) - Iterative pairwise sequence similarity encoding scheme with SVM as the discriminating engine Both approaches are based on the assumption that the allergenicity is a linearly coded property. Stadler and Stadler FASEB J. 2003, 17, 1141-1143 Li et al. Bioinformatics 2004, 20, 2572-2578. Björklund et al. Bioinformatics. 2005, 21, 39–50 Saha and Raghava Nucleic Acids Research,2006,34, 202-209 Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861 AIM of the study To create an alignment-free method for in silico identification of allergens based on the main chemical properties of amino acid sequences and implement it to a web server. Obstacles: The choice of an appropriate descriptors to represent the physicochemical properties of amino acid sequences. Allergens are proteins with different length. The z-scales …Phe – Arg – Trp… z1 z2 hydrophobicity molecular size z3 polarity z1 z2 z3 -4.22 1.94 1.08 Hellberg et al. J. Med. Chem. 1987; 30, 1126-1135 z1 z2 z3 3.62 2.60 -3.60 z1 z2 z3 -4.36 3.94 0.69 ACC transformation Auto-covariance ACC jj (lag ) n lag Z j ,i Z j ,i lag i n lag Cross-covariance ACC jk j k (lag ) n lag i Z j ,i Z k ,i lag n lag j, k are the zscales (j=1,2,3); i is the amino acid positions; n is the number of amino acids in the sequence; protein Phe – Arg – Trp – Phe – Arg – Trp z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3 /5 ACC11(1) z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3 /5 Wold et al. Anal. Chim. Acta 1993, 277:239-225 ACC13(1) Preliminary study 595 food allergens from CSL allergen database 595 non-allergens from NCBI database Training set 475 food allergens 475 non-allergens ACC transformation of z descriptors matrix with 45 variables (32 x 5) and 950 observations statistical methods, machine learning PLS - discriminant analysis Logistic regression Naïve - Bayes algorithm Decision tree algorithm k Nearest Neighbours http://allergen.csl.gov.uk http://www.ncbi.nlm.nih.gov/ Test set 120 food allergens 120 non-allergens external validation Sensitivity Specificity Accuracy Results from preliminary study sensitivit y TP TP FN specificit y TN TN FP accuracy TP TN TP FP TN FN TP – true positive, FP – false positive TN – true negative, FN – false negative 100 90 80 70 % 60 Sensitivity,% 50 Specificity,% Accuracy,% 40 30 20 10 0 PLS-DA Logistic regression Decision tree Naïve-Bayes Algorithm kNN(k=3) kNN(k=5) Web servers on the test set Test set 120 food allergens 120 non-allergens Algpred Sensitivity Specificity Accuracy - SVM with single aa composition - SVM with dipeptide composition Evaller APPEL Allerhunter 100 90 80 70 % 60 Sensitivity,% 50 Specificity,% 40 Accuracy,% 30 20 10 0 ALGPRED (svm, single aa composition) ALGPRED (svm, dipeptide composition) EVALLER APPELL ALLERHUNTER kNN(5) Server Saha and Raghava Nucleic Acids Research,2006,34, 202-209. Barrio et al., Nucleic Acids Research 2007, 35, 694-700 http://jing.cz3.nus.edu.sg/cgi-bin/APPEL Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861 Conclusions from the preliminary study 1. The model developed by the k Nearest Neighbors method shows the best performance on the test set comparing to the other methods. It has a good balance between specificity and sensitivity, and the highest accuracy. kNN was used further in the study. 2. The server Allerhunter is the best performing among the known servers for allergen prediction. kNN needs some more improvements. 3. A great misbalance exists between sensitivity and specificity for almost all servers. This indicates that the dataset needs some improvement too. The kNN algorithm Training set 475 allergens, 475 non-allergens Unknown protein ACC transformation of z descriptors ACC transformation of z descriptors vector with 45 variables (32 x 5) matrix of 45 variables (32 x 5) and 950 observations Calculate the Euclidian distance between the vector and each observation Sort the distance by value in ascending order Determine the k nearest neighbours Determine the class of unknown allergen according to the majority of nearest neighbours Next: Extend the data sets CSL allergen database, FARRP allergen database SDAP database, ADFS database 684 food, 1157 inhalant, 553 toxins, venom or salivary allergens Allergen species NCBI database Create local database Proteins from allergen species Blasts search against all allergens 684 non-allergen from food origin 1157 non-allergens from inhalant origin 553 non-allergens from species with toxins, venom or salivary allergens http://allergen.csl.gov.uk http://www.allergenonline.org/ http://fermi.utmb.edu/SDAP/ http://allergen.nihs.go.jp/ADFS/index.jsp http://www.ncbi.nlm.nih.gov/ Next: kNN optimization 684 food allergens 684 non-allergens 100 528 allergens 528 non-allergens machine learning k nearest neighbours Test set 95 156 allergens 156 non-allergens 90 85 % Training set external validation 80 sensitivity 75 specificity 70 accuracy 65 60 55 50 3 Sensitivity Specificity Accuracy 5 7 9 11 13 k nearest neigbours 15 17 19 kNN models 684 food allergens 684 non-allergens Test set 156 allergens 156 non-allergens 1157 inhalant allergens 1157 non-allergens Training set 528 allergens 528 non-allergens external validation Training set 933 allergens 933 non-allergens external validation external validation k NN k=3 k NN k=3 Sensitivity Specificity Accuracy Test set 224 allergens 224 non-allergens kNN models 100 90 80 70 60 sensitivity 50 specificity accuracy 40 30 20 10 0 kNN, food training and kNN, food training set kNN, inhalant training kNN inhalant training test set on inhalant test set and test set set on food test set kNN aggregated training and test set AllerTOP web tool for allergenicity prediction Training set 1952 food, inhalant and others allergens and 1952 non-allergens ACC transformation of z descriptors kNN model external validation AllerTOP http://www.pharmfac.net/alletop Servers performance on united testset United test set of 441 food and inhalant allergens and 441 non-allergens 100 90 80 70 60 sensitivity 50 specificity accuracy 40 30 20 10 0 AllerTOP(KNN, K=3) Allerhunter AlgPred, svm amino acid decomposition AlgPred, svm dipeptide decomposition AlgPred (ARP) Two of the servers from preliminary studies: Appel and Evaller were not available during recent study. The results for Allerhunter server are achieved with smaller testset due to its incapability to work with short sequences (<21 amino acids) Conclusions 1. An alignment-free method for in silico prediction of allergens based on the main physicochemical properties of proteins was developed. 2. The method uses z descriptors for representation of amino acids in the protein sequences and ACC transformation for conversion of proteins into uniform vectors. 3. The k Nearest Neighbours clustering method showed the best performance among the other algorithms for classification tested in the study: PLS discriminant analysis, Logistic regression, Naïve - Bayes and Decision Tree algorithm. 4. The k NN algorithm was optimized and its performance was compared to the freely available web servers for prediction of allergens. 5. The kNN algorithm was implemented on a web server, freely available on: http://www.pharmfac.net/allertop Drug Design Group School of Pharmacy Medical University of Sofia Irini Doytchinova Ivan Dimitrov Mariyana Atanasova Panaiot Garnev Acknowledgements Darren R. Flower Aston University, Birmingham, UK Funding: National Research Fund, Ministry of Education and Science, Bulgaria, Grant 02-1/2009