Download MicrobaseLite

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
A Grid based System for
Microbial Genome Comparison
and analysis
Anil Wipat
University of Newcastle upon
Tyne, UK
Motivation: Genome Comparison

The past decade has seen the emergence of whole
genome sequencing

Whole genome sequences can reveal a great deal about
the biology of an organism

Comparing genomes is one of the most effective ways to
exploit genome sequence information

Establishes the differences and similarities at the genetic
level

Aids biologists in understanding pathogenicity, evolution,
ecology, metabolism, etc.
Microbial Genome comparison commonly applied
at different levels:
Proteins
(amino acid sequence
MCSAKMQTR..)
All–against-all
Amino acid sequence
comparisons between proteins
Proteins
(amino acid sequence
MSAKMPTR..)
Nucleotide sequence
Comparison
(whole genome)
DNA
(nucleotide sequence)
(..atcggatcgtacgagcgatc..)
DNA
(nucleotide sequence)
(..atcccatcgaacgagcgatc..)
Motivation: Genome Comparison

The number of complete genome sequences is rapidly
increasing as sequencing technology advances


Sequence analysis and comparison is becoming more
computationally intensive


e.g. ~200 whole genomes have been sequenced
Large scale genome comparison is already beyond the capability
of many laboratories
How are we going to handle all these genomes?

New methods and technologies for genome comparison are
required.
Microbase Project Overview

Aims to create a scalable, Grid-enabled analytical
system to support microbial genome comparison.

Aims to support both the biological and bioinformatics
community.

Funded by BBSRC Bioinformatics and e-Science &
DTI


Started April 2003.
Collaboration with microbiologists and industrial
partners

Providing use cases.
Microbase: Functionality

A system that utilises Grid resources to automatically perform
genome comparisons at nucleotide and protein levels

An information repository that:


maintains and exposes the results of these comparisons to users as a
base level dataset
provides canned algorithms for analysis

A Grid-enabled high-performance environment to execute remote
user-specified computations

Data integration with remote, Grid-enabled databases

e.g. Genomic, Metabolic, Protein Interaction, Gene Expression
databases, etc…
MicrobaseLite: A Prototype

The first prototype of the Microbase system

Automatically performs all-against-all genome comparisons and
exposes the resulting datasets

Provide services for biologists to browse and query genome
sequences and comparison results

Helps the specification of entire Microbase system and the
derivation of use cases

Implemented using a Component-based architecture with Web
services interfaces

Also uses existing Grid technology – myGrid Notification Service
MicrobaseLite: Datasets


170 + microbial genomes including

Bacteria, archaea, eukaryota

Held in the GenomePool component
Results of all-against-all nucleotide sequence comparison



Blastn, MUMmer
Results of all-against-all protein sequence comparison

Blastp, Ssearch, Promer

Held in the ComparisonPool component
Object-oriented data model of interspecies genome rearrangements

The OGRE module component (current research)
MicrobaseLite: Architecture
Server Side
Client Side
User Tools
Microbial Genome Pool
Request
Builder
Client Proxies
Notification
Proxy
Web Services
Proxy
Response
Receiver
Data
Processing
Genome Comparison Pool
Notification Service
External
Notification
Task Scheduler
Internal
Notification
Protein
Comparison
Genome
Loader
Graphical
Viewer
DNA
Comparison
Post-processing
BIOSQL
Web
Services
Comparison
Database
Query
Query &
Execution
Object
Model
Builder
Object-oriented
Database
OGRE Module
MicrobaseLite: Microbial Genome
Pool

Clients
Microbial Genome Pool
Notification Service
External
Notification
Comparison
Pool
Internal
Notification

Genome
Loader
BIOSQL
Provide a Web / Grid service based
information repository of microbial
genomes
maintains a database of 170+ microbial
genomes

A web-service implementation of BioJava
Interfaces

Uses the myGrid Notification Service to
notify registered clients of new genomes

Available for use now with a prototype
API
Web Service
API
MicrobaseLite: Genome Comparison
Pool

Retrieves genomes from the Microbial
Genome Pool automatically on Notification

Executes a variety of genome comparison
tools: Blast, MUMmer, Promer, MSPcrunch

Incorporates a Task Scheduler for parallel
processing
Genome Comparison Pool
Comparison
Database
Task Scheduler
Post-processing
N1 Grid
Engine
Protein &
Nucleotide
Comparison
Parallel
Parallel
Cluster(s)
Cluster(s)


Uses N1 Grid Engine (batch system) to
dispatch comparison tasks to run on
Linux clusters
Comparison outputs processed and stored
into a relational database (mySQL).
Task Scheduler and scalability
Execution times of all-against-all comparisons
with 10 microbial genomes
(Blastp, Blastn, MSPcrunch, MUMmer and PROmer )
1000
Task Scheduler
Job State Checking
900
Threshold
Contral
800
Job Creation
BIOSQL
Job Submission
Pre-load
N1 Grid Engine
Genome
Comparison Pool
Job Execution
Workstation Workstation Workstation Workstation Workstation
700
600
500
400
300
200
Input
Workstation
Time (minutes)
Microbial
Genome Pool
Output
Comparison
Database
100
0
1
10
20
30
40
Processors
Number of
Processors
Execution Time
(minutes)
1
10
20
30
40
978.02
103.03
57.67
48.48
37.33
MicrobaseLite: User Tools

Demonstration graphical
tools under development

Genome Browser allows
users to view genomes, the
comparison results and the
results of canned algorithms

Deployed at client-side
operating via Web services
Vision for the full Microbase
System

Continue to explore scalability issues using
MicrobaseLite as platform



Towards seamless scalability
Harnessing of remote clusters on demand
A system for the submission and enactment of remotely
conceived code or workflows for user defined
comparative analysis

Investigating the integration of Taverna core to enact SCUFL
workflows within Microbase
Conclusions

Microbase aims to exploit Grid resources to provide a
scalable system for Microbial genome comparison

MicrobaseLite produced as a prototype and
demonstrator application for the
biologist/bioinformatician

Work now underway on the full Microbase - a system to
support remotely conceived computations
Acknowledgements

The Microbase Team:

Anil Wipat, Yudong Sun, Matthew Pocock, Keith Flanagan, Pete
Lee, and Paul Watson

The Microbase User Requirements/Use case contributors

myGrid

The Industrial supporters: NonLinear Dynamics, NCIMB, Arrow
project (Particularly Southampton and EBI)
Therapeutics, Angel Biotech, Complement Genomics, ACS Dobfar,
AstraZeneca

See www.microbase.org.uk
Microbial Genome comparison commonly applied
at two levels:
Proteins
(amino acid sequence
MCSAKMQTR..)
All–against-all
Amino acid sequence
comparisons between proteins
Proteins
(amino acid sequence
MSAKMPTR..)
Nucleotide sequence
Comparison
(whole genome)
DNA
(nucleotide sequence)
(..atcggatcgtacgagcgatc..)
DNA
(nucleotide sequence)
(..atcccatcgaacgagcgatc..)
OGRE: Object-oriented Genome
REarrangements Model

A dataset that captures genomic rearrangements between
microorganisms

Object-Oriented (OO) concepts and formalism are being used to
classify the results of the nucleotide sequence comparison



An Ontology and OO-conceptual model is being developed to describe
chromosomal rearrangements and to define objects that can represent
them
Algorithms developed to recognise defined rearrangement features in
nucleotide sequence comparison data
Objects made persistent in a OO database
MicrobaseLite: OGRE Module
Comparison
Pool

Web Services
Query &
Execution
Object
Model
Builder
Performs object-oriented analysis and
storage of genome rearrangements

Object-oriented
Database

OGRE Module

An OO dataset captures genomic
rearrangements revealed through
nucleotide sequence comparison
Made persistent in an OO database
Provides Web services interface for
external users to query and analyse
the OO dataset