* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download EGEE07_FP_October1st2007
Protein design wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
List of types of proteins wikipedia , lookup
Circular dichroism wikipedia , lookup
Structural alignment wikipedia , lookup
Alpha helix wikipedia , lookup
Protein folding wikipedia , lookup
Protein domain wikipedia , lookup
Protein purification wikipedia , lookup
Protein moonlighting wikipedia , lookup
Western blot wikipedia , lookup
Homology modeling wikipedia , lookup
Rosetta@home wikipedia , lookup
Protein mass spectrometry wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Biological applications in GRID: the EUChinaGRID experience F. Polticelli – University Roma Tre EUChinaGRID WP4-Applications Manager Budapest, 1.10.2007 FP6−2004−Infrastructures−6-SSA-026634 Outline EUChinaGrid Overview The structural genomics challenge Biological Applications in EUChinaGRID • The “never born proteins” • Protein structure prediction using GRID – Rosetta integration within the Genius portal – Early/Late stage integration in the Gridsphere portal • Structure validation using GRID – AMBER deployment on GRID Conclusions and perspectives • function recognition and catalytic site identification tools • In silico structural genomics 2/tot Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007 EUChinaGRID Overview Aim • provide support actions to foster the integration and interoperability of the Grid infrastructures in Europe (EGEE) and China (CNGrid). • promote the migration of new applications on the Grid infrastructures by training new user communities and supporting the adoption of grid tools for scientific applications. Applications • validate the intercontinental infrastructure using scientific applications • facilitate porting of new applications relevant for scientific and industrial collaboration between Europe and China. • three main application fields: – EGEE Applications (ATLAS and CMS) – Astroparticle Physics applications (the ARGO experiment) – Biology applications (“Never born Proteins) 3/tot Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007 The structural genomics challenge The combination of the 20 natural amino acids in a specific sequence in a protein chain dictates the three-dimensional structure of the protein Protein function is linked to the specific three-dimensional arrangement of amino acids functional groups. With the advancement of molecular biology techniques a huge amount of information on protein sequences has been made available but far less information is available on structure and function of these proteins. Prediction of protein structure and function is a key instrument to better understand the protein folding principles and successfully exploit the information provided by the “genomic revolution”. 4/tot Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007 The test case: Never born proteins With 20 different comonomers, a protein chain of just 60 amino acids can theoretically exist in 2060 chemically and structurally unique combinations But the number of natural proteins (109 to a maximum of 1013) is just a tiny fraction of all possible proteins There exist a huge number of protein sequences that have never been exploited by biological systems, in other words and enormous number of “never born proteins” (NBP). These pose the following questions: – Which are the criteria with which the existing proteins have been selected? – Natural proteins have peculiar properties in terms for example of thermal stability, solubility in water or amino acid composition? – Can NBP be exploited for biomedical and/or biotechnological purposes? 5/tot Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007 Never born proteins and GRID The problem is tackled by a “high throughput” approach made feasible by the use of the GRID infrastructure A huge library of random amino acid sequences of fixed length is generated (n=70) “ab initio” protein structure prediction software is used. Analysis of the structural characteristics of the resulting proteins • frequency of compact and yet unknown folds • presence of putative catalytic sites • experimental validation on “interesting” cases 6/tot Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007 The tool: Rosetta abinitio Developed by David Baker – University of Washington Based on a “fragment assembly” strategy semi-empirical force field for the evaluation of the thermodinamics of the predicted structure Particularly successful in the prediction of novel folds in the CASP competitions (Critical Assesment of Structure Prediction) Rosetta abinitio has been deployed in GRID through the use of the GENIUS interface with the option of parametric jobs submission to run a large number of jobs (structure predictions) at the same time. 7/tot Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007 First step: Integration on the GILDA facility Single job execution on GILDA • A shell script has been prepared which: – registers the program executable and the required input files (fragment libraries and secondary structure prediction file) on the LFC catalog – calls the Rosetta executable and proceeds with workflow execution. • A JDL file was created to run the application on the GILDA working nodes which use the gLite middleware 8/tot Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007 Integration on the GENIUS web portal To facilitate the use of the Rosetta abinitio application within the grid environment by the computational biology community, the application was integrated within the GENIUS portal (https://glite-tutor.ct.infn.it). After MyProxy server initialization, input files and executable uploading, JDL file preparation, application running, run status monitoring and download of the output file are carried out from within the portal. Given the huge number of “never born proteins” to be simulated, a parametric JDL file automatic generation procedure has been set up within the GENIUS environment. More than 2x104 never born protein structures predicted so far 9/tot Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007 GENIUS screenshots 10/tot Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007 Never born proteins structure examples 11/tot Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007 Early/Late stage Developed by Irena Roterman group – Jagellonian University A program for protein folding simulation not structure prediction (complementary approach to Rosetta) based on early stage - statistics using a database of known sequences; late stage - energy minimization in alternating potentials; this stage is the most computationally expensive; Early/Late stage has been deployed in GRID through the use of the GridSphere Portal Framework and Gridwise Tech LCG API package, that provides access to gLite middleware. 12/tot Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007 Early/Late stage A self-containing bundle of programs and libraries needed for application running was created and registered in the LFC catalogue. A script was created to install the application on site each time a job is started. A JDL file was created to run application on the grid, that use the gLite middleware. Finally, to enable running the application for users that are not familiar with the grid, it was decided to integrate it in a web portal based on the GridSphere Portal Framework. 13/tot Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007 Never born proteins. What’s next ? Build consensus Experimental validation 14/tot • Nuclear magnetic resonance (NMR) data acquired in the NMR centre of Peking University • Experimental data contain all the information about the primary structure of the protein, about topology and bonds. • NMR structure calculation and refinement is an iterative process which, for a single protein, involves many starting structures, normally 200 structures per round, and each protein may need 10-30 (or more) rounds of calculations. Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007 AMBER porting on GRID A simple .JDL file and a set of scripts to run the program have been developed • • • • • • Executable = "amber_serial.sh"; StdOutput = "testJob.out"; StdError = "testJob.err"; InputSandbox = {"amber_serial.sh","amber_test/amber_grid.tar"}; OutputSandbox = {"testJob.out","testJob.err","out.tar"}; Requirements = other.GlueCEUniqueID == "gridce.roma3.infn.it:2119/jobmanager-lcgpbs-grid"; The program is currently under testing by the Peking University NMR group 15/tot Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007 What we have done Structural bioinformatics is a key instrument to exploit the huge amount of data available on human and pathogens genes. In the EUChinaGRID project we set up a system to predict the three-dimensional structure of a high number of protein sequences, to validate the predictions and to test them experimentally We are currently refining function recognition (ASSIST) and catalytic site identification tools (Early/Late Stage) What we plan to do In silico structural genomics of bacterial and viral pathogens • • • • 16/tot Low-cost activity High potential biomedical impact with small investments Application to endemic human and animal pathogens of developing countries Sinergy with pharmaceutical industry Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007 Acknowledgements - Prof. Luisi for the original idea of “never born proteins” - INFN Catania (Rosetta deployment) - Jagellonian Univ. (Early/Late Stage deployment) - INFN Roma Tre (AMBER deployment) FP6−2004−Infrastructures−6-SSA-026634 Thank you for your attention ! FP6−2004−Infrastructures−6-SSA-026634