Download EGEE07_FP_October1st2007

Biological applications in GRID: the EUChinaGRID experience F. Polticelli – University Roma Tre EUChinaGRID WP4-Applications Manager Budapest, 1.10.2007 FP6−2004−Infrastructures−6-SSA-026634 Outline  EUChinaGrid Overview  The structural genomics challenge  Biological Applications in EUChinaGRID • The “never born proteins” • Protein structure prediction using GRID – Rosetta integration within the Genius portal – Early/Late stage integration in the Gridsphere portal • Structure validation using GRID – AMBER deployment on GRID  Conclusions and perspectives • function recognition and catalytic site identification tools • In silico structural genomics  2/tot  Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007 EUChinaGRID Overview  Aim • provide support actions to foster the integration and interoperability of the Grid infrastructures in Europe (EGEE) and China (CNGrid). • promote the migration of new applications on the Grid infrastructures by training new user communities and supporting the adoption of grid tools for scientific applications.  Applications • validate the intercontinental infrastructure using scientific applications • facilitate porting of new applications relevant for scientific and industrial collaboration between Europe and China. • three main application fields: – EGEE Applications (ATLAS and CMS) – Astroparticle Physics applications (the ARGO experiment) – Biology applications (“Never born Proteins)  3/tot  Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007 The structural genomics challenge  The combination of the 20 natural amino acids in a specific sequence in a protein chain dictates the three-dimensional structure of the protein  Protein function is linked to the specific three-dimensional arrangement of amino acids functional groups.  With the advancement of molecular biology techniques a huge amount of information on protein sequences has been made available but far less information is available on structure and function of these proteins.  Prediction of protein structure and function is a key instrument to better understand the protein folding principles and successfully exploit the information provided by the “genomic revolution”.  4/tot  Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007 The test case: Never born proteins  With 20 different comonomers, a protein chain of just 60 amino acids can theoretically exist in 2060 chemically and structurally unique combinations  But the number of natural proteins (109 to a maximum of 1013) is just a tiny fraction of all possible proteins  There exist a huge number of protein sequences that have never been exploited by biological systems, in other words and enormous number of “never born proteins” (NBP). These pose the following questions: – Which are the criteria with which the existing proteins have been selected? – Natural proteins have peculiar properties in terms for example of thermal stability, solubility in water or amino acid composition? – Can NBP be exploited for biomedical and/or biotechnological purposes?  5/tot  Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007 Never born proteins and GRID  The problem is tackled by a “high throughput” approach made feasible by the use of the GRID infrastructure  A huge library of random amino acid sequences of fixed length is generated (n=70)  “ab initio” protein structure prediction software is used.  Analysis of the structural characteristics of the resulting proteins • frequency of compact and yet unknown folds • presence of putative catalytic sites • experimental validation on “interesting” cases  6/tot  Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007 The tool: Rosetta abinitio  Developed by David Baker – University of Washington  Based on a “fragment assembly” strategy  semi-empirical force field for the evaluation of the thermodinamics of the predicted structure  Particularly successful in the prediction of novel folds in the CASP competitions (Critical Assesment of Structure Prediction)  Rosetta abinitio has been deployed in GRID through the use of the GENIUS interface with the option of parametric jobs submission to run a large number of jobs (structure predictions) at the same time.  7/tot  Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007 First step: Integration on the GILDA facility  Single job execution on GILDA • A shell script has been prepared which: – registers the program executable and the required input files (fragment libraries and secondary structure prediction file) on the LFC catalog – calls the Rosetta executable and proceeds with workflow execution. • A JDL file was created to run the application on the GILDA working nodes which use the gLite middleware  8/tot  Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007 Integration on the GENIUS web portal  To facilitate the use of the Rosetta abinitio application within the grid environment by the computational biology community, the application was integrated within the GENIUS portal (https://glite-tutor.ct.infn.it).  After MyProxy server initialization, input files and executable uploading, JDL file preparation, application running, run status monitoring and download of the output file are carried out from within the portal.  Given the huge number of “never born proteins” to be simulated, a parametric JDL file automatic generation procedure has been set up within the GENIUS environment.  More than 2x104 never born protein structures predicted so far  9/tot  Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007 GENIUS screenshots  10/tot  Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007 Never born proteins structure examples  11/tot  Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007 Early/Late stage  Developed by Irena Roterman group – Jagellonian University  A program for protein folding simulation not structure prediction (complementary approach to Rosetta)  based on early stage - statistics using a database of known sequences;  late stage - energy minimization in alternating potentials; this stage is the most computationally expensive;  Early/Late stage has been deployed in GRID through the use of the GridSphere Portal Framework and Gridwise Tech LCG API package, that provides access to gLite middleware.  12/tot  Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007 Early/Late stage  A self-containing bundle of programs and libraries needed for application running was created and registered in the LFC catalogue.  A script was created to install the application on site each time a job is started.  A JDL file was created to run application on the grid, that use the gLite middleware.  Finally, to enable running the application for users that are not familiar with the grid, it was decided to integrate it in a web portal based on the GridSphere Portal Framework.  13/tot  Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007 Never born proteins. What’s next ?  Build consensus  Experimental validation  14/tot  • Nuclear magnetic resonance (NMR) data acquired in the NMR centre of Peking University • Experimental data contain all the information about the primary structure of the protein, about topology and bonds. • NMR structure calculation and refinement is an iterative process which, for a single protein, involves many starting structures, normally 200 structures per round, and each protein may need 10-30 (or more) rounds of calculations. Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007 AMBER porting on GRID  A simple .JDL file and a set of scripts to run the program have been developed • • • • • • Executable = "amber_serial.sh"; StdOutput = "testJob.out"; StdError = "testJob.err"; InputSandbox = {"amber_serial.sh","amber_test/amber_grid.tar"}; OutputSandbox = {"testJob.out","testJob.err","out.tar"}; Requirements = other.GlueCEUniqueID == "gridce.roma3.infn.it:2119/jobmanager-lcgpbs-grid";  The program is currently under testing by the Peking University NMR group  15/tot  Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007 What we have done  Structural bioinformatics is a key instrument to exploit the huge amount of data available on human and pathogens genes.  In the EUChinaGRID project we set up a system to predict the three-dimensional structure of a high number of protein sequences, to validate the predictions and to test them experimentally  We are currently refining function recognition (ASSIST) and catalytic site identification tools (Early/Late Stage) What we plan to do  In silico structural genomics of bacterial and viral pathogens • • • •  16/tot  Low-cost activity High potential biomedical impact with small investments Application to endemic human and animal pathogens of developing countries Sinergy with pharmaceutical industry Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007 Acknowledgements - Prof. Luisi for the original idea of “never born proteins” - INFN Catania (Rosetta deployment) - Jagellonian Univ. (Early/Late Stage deployment) - INFN Roma Tre (AMBER deployment) FP6−2004−Infrastructures−6-SSA-026634 Thank you for your attention ! FP6−2004−Infrastructures−6-SSA-026634

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download EGEE07_FP_October1st2007