Download About

Machine learning assisted target selection for structural genomics Max-Planck-Institut für Biochemie Martinsried Pawel Smialowski, Antonio J. Martin-Galiano, Jürgen Coxa and Dmitrij Frishman Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenschaftszentrum Weihenstephan, 85350 Freising, Germany aMax Planck Institute of Biochemistry, 82152 Martinsried, Germany 1. Structural genomics 5. Solubility of globular proteins Structural genomics consists in the determination of the three-dimensional structure of all proteins of a given organism, by experimental methods such as X-ray crystallography, NMR spectroscopy or computational approaches such as homology modeling. PROSO - A PROtein SOlubility predictor. Protein solubility is a prerequisite for most of biophysical studies: NMR, High throughput genomics approach puts new demands on both experimentation and target selection. Scattering, Biocore, ITC, X-ray, Target selection strategies: Protein solubility – here defined as a ability of a DSC, Dynamic ROC Light Chromatography, Spectrophotometry. Protein function, metabolic pathway protein to be heterologously expressed E.coli in a Experimental amenability of proteins soluble fraction or to be tractable to standard refolding procedures. Accurate theoretical prediction of solubility from sequence is instrumental for setting priorities on targets 2. Input data in large-scale proteomics projects. Positive dataset PDB - Protein structure database, provides structural information on proteins, peptides and nucleic acids. http://www.rcsb.org/pdb/ • Proteins from PDB which were expressed in E.Coli TargetDB - Target registration database, provides status and tracking information on the progress of sample preparation and solution of structures. http://targetdb.pdb.org • using plasmid. Proteins from TargetDB having status at Receiver operating characteristic (ROC) of our method least “Soluble”. Soluble proteins from literature. Negative dataset for protein solubility prediction (PROSO). Curves for from the soluble and the insoluble classes are drawn individually. True positive rate (TP rate) and true TargetDB. All proteins identical to those in PDB were negative rate (TP rate) are the number of correctly removed classified instances from the respective class divided Sequence length 20 – 1900 amino acids by the number of all instances from that class. False Performance positive rate (FP rate) and false negative rate (FN rate) are defined as the number of incorrectly classified An overall prediction accuracy of 71.2% instances from the respective class divided by the 74.5% for the soluble, 68% for the insoluble number of all members of that class. Persistantly Population of instances along the structural genomics pipeline. The numbers were taken from the TargetDB (Status: March 2006). Out of 83596 targets selected by structural genomics consortia only for 2830 (3.4%) structures were determined and deposited with the PDB. 3. Correlating sequence with experimental properties recalcitrant expressed proteins Matthews Correlation Coefficient (MCC) = 0.425 6. Heterologous expression of transmembrane proteins • Each sequence from the positive and MEMEX - A MEMbrane protein EXperimental properties predictor. the negative dataset was represented by Integral membrane proteins are challenging targets for frequencies of amino acids, di-, and tripeptides. • The classification algorithm was constructed as a two-layered structure in which the output of primary classifiers, the first attempt to predict their experimental behavior in the context of high-throughput structure determination. At present, the work support vector machine (SVM) with is hindered by the severely limited amount of data available for training the classifiers. gaussian kernel, operating on peptide frequencies was combined by a secondlevel Naive Bayes classifier. The number of amenable and recalcitrant targets at different stages of experimental structure determination. Positive dataset Integral membrane proteins amiable for given experimental step – based on • Naive Bayes was used to integrate results from classifiers different sets of attributes structure determination due to the substantial experimental difficulties involved in their sample preparation. Our study constitutes trained TargetDB database on Negative dataset Integral membrane proteins persistently recalcitrant for given experimental step – based on TargetDB database • Ten times cross-validation training and evaluation was used to measure the Summary of protein sequence features. Classification Naive Bayes classifiers were built using as an input 72 features of protein sequences, including amino acid composition, occurrence of amino acid groups, performance of classifiers. ratios between residue groups, and hydrophobicity measures. • Feature selection was performed using Wrapper method with Naive Bayes as a base classifier. Performance An overall prediction accuracy of 70%, 64% and 60% was achieved for the cloning, expression and solubilization stages of the structural genomics pipeline, • Crystallizability of small proteins – Accuracy ~67% respectively, with an acceptable balance between the performance for the positive and negative class. • Solubility of non-membrane proteins – Accuracy ~72% Our predictors produce results 37-59% better than random with p-values of 10-3 to 10-10 as judged by their respective success rates (contingency table analysis). Thus our classifiers are immediately useful for designing rational target selection strategies and may be instrumental in setting Performance of the Bayesian classifier priorities on members of membrane protein families according to their predicted experimental behavior. • Expressability of membrane proteins – Accuracy ~65% 4. Crystallizability of globular proteins 7. Web servers SECRET - A SEquence-based CRystallizability EvaluaTor ALL CLASSIFIERS INTEGRATED INTO ONE WEB PAGE Crystallization – a major time limiting step in structure • EXPROPRIATOR - An EXPerimental PROterin PRoperties predicTOR http://binfo.bio.wzw.tum.de/proj/expropriator/index.html ROC determination 1 INDIVIDUAL CLASSIFIERS 0.9 Multidimensional Small changes in sequence can strongly influence crystallizability and/or crystal quality. TP rate, TN rate Crystallization is highly influenced by experimental setup. Non linear 0.8 0.7 0.6 Negative Positive 0.5 0.4 0.3 Non conserved 0.2 There is no evolution pressure on crystallization. 0.1 • MEMEX - A MEMbrane protein EXperimental properties predictor http://webclu.bio.wzw.tum.de:8080/memex • SECRET - A SEquence-based CRystallizability EvaluaTor http://webclu.bio.wzw.tum.de:8080/secret • PROSO - A PROtein SOlubility predictor http://webclu.bio.wzw.tum.de:8080/proso 0 Positive dataset 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FP rate, FN rate References proteins with NMR (Nuclear Magnetic Resonance) structures that also have an X-ray structure, or bear a high sequence similarity (blast 75-100% identity, 10% length difference) to proteins with X-ray determined structures Negative dataset proteins with NMR structures only (based on BLAST bit score cutoff 30) . Sequence length 46 – 200 amino acids The meta-classifier had an accuracy of 67% (65% and 69% for positive and negative class respectively). As seen at Receiver Operating Characteristic (ROC) graph it classified correctly 70% of negative instances with having only 35% of positive instances misclassified into the negative class. TP, FP, TN, FN rates as described in solubility section • Martin-Galiano, A.J. and Frishman, D. (2006a) Defining the fold space of membrane proteins: the CAMPS database, Proteins, 64, 906-922. • Martin-Galiano, A.J., Smialowski, P. and Frishman, D. (2006b) Predicting experimental properties of integral membrane proteins by a Naive Bayes aproach, Proteins (Submited). • Smialowski, P., Martin-Galiano, A., Mikolajka, A., Girschick, T., Holak, T.A. and Frishman, D. (2006a) Protein solubility: sequence based prediction and experimental verification. Technical University Munich. • Smialowski, P., Schmidt, T., Cox, J., Kirschner, A. and Frishman, D. (2006b) Will my protein crystallize? A sequence-based predictor, Proteins, 62, 343-355.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download About