Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BioCoRE and GEMS: Cyber Infrastructure for Cyber Chemistry Jesús A. Izaguirre Computer Science & Engineering University of Notre Dame with Kirby Vandivort NIH Resource for Macromolecular Modeling and Bioinformatics University of Illinois Overview I • Chemical applications such as virtual screening, protein kinetics and structure, and analysis and validation of molecular simulations require enormous resources that can be provided by CyberInfrastructure • Successful solution of these problems require collaborative approaches, also facilitated by CyberInfrastructure BioCoRE and GEMS 3 October 2004 Overview II To make CyberInfrastucture effective, the following issues must be addressed: • Users of CyberInfrastructure need a datacentric way of managing their computations and data • Distributed databases on the grid need to address the problem of reliability and faulttolerance of data BioCoRE and GEMS 3 October 2004 Overview III • We will study examples of collaborative software that address these issues, primarily: – BioCoRE: A Collaboratory for Structural Biology – GEMS: Grid Enabled Molecular Simulations Toolset and Database BioCoRE and GEMS 3 October 2004 Sample CyberScience Projects Collaborative Biophysics BioCoRE K. Schulten, Illinois Virtual Screening The Screensaver Project W.G. Richards, Oxford Protein Kinetics Folding@Home V. Pande, Stanford Distributed Database of BioSimGrid Molecular Simulations M. Sansom, Oxford BioCoRE and GEMS 3 October 2004 What is BioCoRE? BioCoRE: a collaborative work environment for biomedical research, research management and training. BioCoRE assists the entire research process, from talking with collaborators to performing simulations and collecting data, to preparing papers and reports. BioCoRE and GEMS 3 October 2004 Sharing Documents With the BioFS and WebDAV, scientists can exchange and edit files from anywhere with a web connection. BioCoRE and GEMS 3 October 2004 Setting Up and Running Simulations • NAMDCFG: A “Simulation Setup Wizard” • Online help and error checking for NAMD input files • Job submission to supercomputers simplified • Job status monitored for easy retrieval • Job data archived for future reference BioCoRE and GEMS 3 October 2004 Sharing Molecular Views Using VMD and BioCoRE, collaborators may exchange and manipulate 3-D models of molecules Emphasis on collaborative sessions. Streamlined process of sharing views. BioCoRE and GEMS 3 October 2004 Communicating • Control Panel provides instant messaging and notifications • BioCoRE also provides message boards, Web site library, lab book BioCoRE and GEMS 3 October 2004 Programming Interface • Provide way for users to programmatically interact with BioCoRE. • Communication (Control Panel), shared states (VMD) • WebDAV BioCoRE and GEMS 3 October 2004 Availability • Free • Can be accessed from Illinois site, or server software can be installed locally • Server software can be modified if necessary • http://www.ks.uiuc.edu/Research/biocore/ BioCoRE and GEMS 3 October 2004 Virtual Screening • Combinatorial Complexity Lead Exploration • Screen docking affinities based on a scoring function (interaction energies, RMSD, etc…) • Modeled as an all pairs problem • Logically independent computational requirements are well suited for wide area grid distribution BioCoRE and GEMS Leads (ligands) L0001 L0002 L0003 L0004 L0005 3 October 2004 CyberInfrastructure Needs for Virtual Screening I • Incorporate protein (receptor) flexibility – Use multiple protein structures (hierarchical representations and algorithms) • Iterative refinement of results – Add new protein conformations to improve docking – Use higher resolution models for promising hits (integration of data and work flow) – Monitor status of results (not just jobs running) BioCoRE and GEMS 3 October 2004 CyberInfrastructure Needs for Virtual Screening II • Manage computation and storage in the grid – Declarative rather than imperative specification • Automate usage of algorithms / tools – Select software and optimal parameters for algorithms (recommender system) – Example: MDSimAid (http://mdsimaid.cse.nd.edu) selects optimal MD simulation protocol (limited options) BioCoRE and GEMS 3 October 2004 BioSimGrid Mark S. P. Sansom, Oxford • Database for biomolecular simulations • Specifically: molecular dynamics trajectories • Facilitate validation and analysis of simulations • Provides “independence” from the specific simulation semantics (configuration parameters, architecture, simulation tools, etc…) • Trajectory data stored in relational database tables per Data Schema • Semi-Automated Deposition of trajectory files for certain formats (CHARMM, NAMD, etc…) • Trajectory analysis modules • Future goal to distribute database BioCoRE and GEMS 3 October 2004 CyberInfrastructure Needs for Distributed Databases I • Metadata for trajectories – Simulation protocol, software, etc. • Distribution on the grid – Storage fault tolerance / reliability – Scalable solution: reduce storage requirements and centralization BioCoRE and GEMS 3 October 2004 CyberInfrastructure Needs for Distributed Databases II • Data-driven model for the user – Data organized around key themes (trajectories, molecules) • Generic tools for developers – Applicable to different applications BioCoRE and GEMS 3 October 2004 Solving Integration Problem • We need to capture the data flow and the work flow – – – Ecce project XML metadata Component architectures (e.g., JavaBeans, Common Component Architecture) BioCoRE and GEMS 3 October 2004 Solving Integration Problem • BioCoRE (K. Schulten, Illinois) – Use of programming interface – Provides multiple services to applications (web file system, job management, shared visualization) BioCoRE and GEMS 3 October 2004 Solving Grid Management • Current grid tools are task oriented: run this particular simulation code with these input files, etc. – Web portals are an incremental improvement over command line or stand alone applications • Problem: Controlling multiple resources – For example, create 10,000 tasks & keep track of the data, as might be needed for virtual screening or @home applications BioCoRE and GEMS 3 October 2004 Solving Grid Management with GIPSE • GIPSE: Grid Interface for Parameter-driven Simulation Environments – Shift focus from management to research – Result-driven interface – Scripting capabilities BioCoRE and GEMS 3 October 2004 Solving grid management with GIPSE • Simplify process – XML Data format – Missing “glue” • Powerful searches – Optimizations – Control loops GEMS Toolset BioCoRE and GEMS HIV-1 Protease 3 October 2004 Solving grid management with GIPSE • Manage data – Storage – Database retrieval • Monitor progress – Status – Application – specific GEMS Toolset BioCoRE and GEMS HIV-1 Protease 3 October 2004 GEMS Database Toolset • Grid Enabled Molecular Simulation – Data Centric – Wide area distributed storage – Researchers have data and resource autonomy – Simulation configuration, input data files, and output data files identified via XML – Centralized SQL locator – Availability via replication BioCoRE and GEMS 3 October 2004 Reliability and Leveraged Availability via Runtime Imaging • Reliability of data storage is increased • User can tradeoff availability versus storage volume • Workspace data has 2-way redundancy by default • Archival data has a 2-way redundancy of fewer snapshots, but saves the computational images • For each computational run through the GEMS portal a comprehensive runtime image is created from which the simulation can automatically be regenerated. • Runtime images include executable version and location, library requirements, hardware requirements, input files, and configuration parameters BioCoRE and GEMS 3 October 2004 Integration of Distributed Data Into New Simulations • A grid distributed “make” based on a computational requirement over a set parameter sweep – Example: optimize MD simulation protocol • Before starting the sweep a query determines data points that are up to date and those that require computation (including regeneration) – Example: keep current list of results of virtual screening as more computations are performed or targets and ligands added BioCoRE and GEMS 3 October 2004 Example: Validating Simulations • Locate specific published simulation configurations for benchmarking • Select pertinent input data files (pdb, psf, force fields, etc…) for direct utilization in a new simulation for purpose of comparison/contrast. • Researcher B wants to vary certain parameters of Researcher A’s published simulation to test her new MD integrator BioCoRE and GEMS 3 October 2004 Acknowledgments • Collaborators in GIPSE and GEMS: – Aaron Striegel – Doug Thain – Jeff Peng • Students – Paul Brenner – Santanu Chatterjee • Klaus Schulten • BioCoRE Team: – Robert Brunner – Michael Bach – David Brandon • BioCoRE funding from NIH • Funding from NSF Career and Biocomplexity BioCoRE and GEMS 3 October 2004