Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Grid based System for Microbial Genome Comparison and analysis Anil Wipat University of Newcastle upon Tyne, UK Motivation: Genome Comparison The past decade has seen the emergence of whole genome sequencing Whole genome sequences can reveal a great deal about the biology of an organism Comparing genomes is one of the most effective ways to exploit genome sequence information Establishes the differences and similarities at the genetic level Aids biologists in understanding pathogenicity, evolution, ecology, metabolism, etc. Microbial Genome comparison commonly applied at different levels: Proteins (amino acid sequence MCSAKMQTR..) All–against-all Amino acid sequence comparisons between proteins Proteins (amino acid sequence MSAKMPTR..) Nucleotide sequence Comparison (whole genome) DNA (nucleotide sequence) (..atcggatcgtacgagcgatc..) DNA (nucleotide sequence) (..atcccatcgaacgagcgatc..) Motivation: Genome Comparison The number of complete genome sequences is rapidly increasing as sequencing technology advances Sequence analysis and comparison is becoming more computationally intensive e.g. ~200 whole genomes have been sequenced Large scale genome comparison is already beyond the capability of many laboratories How are we going to handle all these genomes? New methods and technologies for genome comparison are required. Microbase Project Overview Aims to create a scalable, Grid-enabled analytical system to support microbial genome comparison. Aims to support both the biological and bioinformatics community. Funded by BBSRC Bioinformatics and e-Science & DTI Started April 2003. Collaboration with microbiologists and industrial partners Providing use cases. Microbase: Functionality A system that utilises Grid resources to automatically perform genome comparisons at nucleotide and protein levels An information repository that: maintains and exposes the results of these comparisons to users as a base level dataset provides canned algorithms for analysis A Grid-enabled high-performance environment to execute remote user-specified computations Data integration with remote, Grid-enabled databases e.g. Genomic, Metabolic, Protein Interaction, Gene Expression databases, etc… MicrobaseLite: A Prototype The first prototype of the Microbase system Automatically performs all-against-all genome comparisons and exposes the resulting datasets Provide services for biologists to browse and query genome sequences and comparison results Helps the specification of entire Microbase system and the derivation of use cases Implemented using a Component-based architecture with Web services interfaces Also uses existing Grid technology – myGrid Notification Service MicrobaseLite: Datasets 170 + microbial genomes including Bacteria, archaea, eukaryota Held in the GenomePool component Results of all-against-all nucleotide sequence comparison Blastn, MUMmer Results of all-against-all protein sequence comparison Blastp, Ssearch, Promer Held in the ComparisonPool component Object-oriented data model of interspecies genome rearrangements The OGRE module component (current research) MicrobaseLite: Architecture Server Side Client Side User Tools Microbial Genome Pool Request Builder Client Proxies Notification Proxy Web Services Proxy Response Receiver Data Processing Genome Comparison Pool Notification Service External Notification Task Scheduler Internal Notification Protein Comparison Genome Loader Graphical Viewer DNA Comparison Post-processing BIOSQL Web Services Comparison Database Query Query & Execution Object Model Builder Object-oriented Database OGRE Module MicrobaseLite: Microbial Genome Pool Clients Microbial Genome Pool Notification Service External Notification Comparison Pool Internal Notification Genome Loader BIOSQL Provide a Web / Grid service based information repository of microbial genomes maintains a database of 170+ microbial genomes A web-service implementation of BioJava Interfaces Uses the myGrid Notification Service to notify registered clients of new genomes Available for use now with a prototype API Web Service API MicrobaseLite: Genome Comparison Pool Retrieves genomes from the Microbial Genome Pool automatically on Notification Executes a variety of genome comparison tools: Blast, MUMmer, Promer, MSPcrunch Incorporates a Task Scheduler for parallel processing Genome Comparison Pool Comparison Database Task Scheduler Post-processing N1 Grid Engine Protein & Nucleotide Comparison Parallel Parallel Cluster(s) Cluster(s) Uses N1 Grid Engine (batch system) to dispatch comparison tasks to run on Linux clusters Comparison outputs processed and stored into a relational database (mySQL). Task Scheduler and scalability Execution times of all-against-all comparisons with 10 microbial genomes (Blastp, Blastn, MSPcrunch, MUMmer and PROmer ) 1000 Task Scheduler Job State Checking 900 Threshold Contral 800 Job Creation BIOSQL Job Submission Pre-load N1 Grid Engine Genome Comparison Pool Job Execution Workstation Workstation Workstation Workstation Workstation 700 600 500 400 300 200 Input Workstation Time (minutes) Microbial Genome Pool Output Comparison Database 100 0 1 10 20 30 40 Processors Number of Processors Execution Time (minutes) 1 10 20 30 40 978.02 103.03 57.67 48.48 37.33 MicrobaseLite: User Tools Demonstration graphical tools under development Genome Browser allows users to view genomes, the comparison results and the results of canned algorithms Deployed at client-side operating via Web services Vision for the full Microbase System Continue to explore scalability issues using MicrobaseLite as platform Towards seamless scalability Harnessing of remote clusters on demand A system for the submission and enactment of remotely conceived code or workflows for user defined comparative analysis Investigating the integration of Taverna core to enact SCUFL workflows within Microbase Conclusions Microbase aims to exploit Grid resources to provide a scalable system for Microbial genome comparison MicrobaseLite produced as a prototype and demonstrator application for the biologist/bioinformatician Work now underway on the full Microbase - a system to support remotely conceived computations Acknowledgements The Microbase Team: Anil Wipat, Yudong Sun, Matthew Pocock, Keith Flanagan, Pete Lee, and Paul Watson The Microbase User Requirements/Use case contributors myGrid The Industrial supporters: NonLinear Dynamics, NCIMB, Arrow project (Particularly Southampton and EBI) Therapeutics, Angel Biotech, Complement Genomics, ACS Dobfar, AstraZeneca See www.microbase.org.uk Microbial Genome comparison commonly applied at two levels: Proteins (amino acid sequence MCSAKMQTR..) All–against-all Amino acid sequence comparisons between proteins Proteins (amino acid sequence MSAKMPTR..) Nucleotide sequence Comparison (whole genome) DNA (nucleotide sequence) (..atcggatcgtacgagcgatc..) DNA (nucleotide sequence) (..atcccatcgaacgagcgatc..) OGRE: Object-oriented Genome REarrangements Model A dataset that captures genomic rearrangements between microorganisms Object-Oriented (OO) concepts and formalism are being used to classify the results of the nucleotide sequence comparison An Ontology and OO-conceptual model is being developed to describe chromosomal rearrangements and to define objects that can represent them Algorithms developed to recognise defined rearrangement features in nucleotide sequence comparison data Objects made persistent in a OO database MicrobaseLite: OGRE Module Comparison Pool Web Services Query & Execution Object Model Builder Performs object-oriented analysis and storage of genome rearrangements Object-oriented Database OGRE Module An OO dataset captures genomic rearrangements revealed through nucleotide sequence comparison Made persistent in an OO database Provides Web services interface for external users to query and analyse the OO dataset