Download A Service-Oriented Data Integration and Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Geographic information system wikipedia , lookup

Gene prediction wikipedia , lookup

Theoretical computer science wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Pattern recognition wikipedia , lookup

Neuroinformatics wikipedia , lookup

Data analysis wikipedia , lookup

Corecursion wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
A Service-Oriented Data Integration
and Analysis Environment for insilico Experiments and
Bioinformatics Research
Xiaorong Xiang
Gregory Madey
Jeanne Romero-Severson
HICSS 40, January 2007
Supported in part by the Indiana Center for Insect Genomics, with funds from the
Indiana 21st Century Research & Technology Fund
Overview




Bioinformatics project: Computer
Science and Biology
Mother of Green (MoG): Deep
phylogeny with possible applications to
malaria drug development
Time consuming data-driven research
MoGServ: prototype web services
solution
Goals



Build an environment to support scientists’
investigations – Mother of Green
Demonstrate effective practices in building a
service-oriented architecture based system
Provide a prototype for future research in
service-oriented architecture domain
Mother of Green
Phylogenomics of the P. falciparum Apicoplast
Indiana Center for Insect Genomics
An International Center of Excellence
University of Notre Dame
Purdue University
Indiana University
Mother of Green
• Malaria causes 1.5 - 2.7 million deaths every year
• 3,000 children under age five die of malaria every day
•Plasmodium falciparum causes human malaria
• Drug resistance a world-wide problem
• Targeted drug design through phylogenomics
P. falciparum
Mother of Green
• P. falciparum has three genomes
Nuclear, mitochondrial, plastid
• Animals and insects have only two
• Target the third genome
• No harm to animals
• New antimalarial drug
• High risk, high tech, high payoff
J. Romero-Severson
Department of Biological Sciences
Greg Madey & Xiaorong Xiang
Department of Computer Science & Engineering
Mother of Green
•Plastids are the third genome
•Intracellular organelles
•Terrestrial plants, algae, apicomplexans
•Functions in plants and algae
Photosynthesis
Oxidation of water
Reduction of NADP
Synthesis of ATP
Fatty acid biosynthesis
Aromatic amino acid biosynthesis
•Functions in apicomplexans ?
Chloroplast in plant cell
plastid
Apicoplast in P. falciparum
Plastid in Toxoplasma sp.
Mother of Green
•The apicoplast appears to code for <30 proteins.
•Repair, replication and transcription proteins
•Why is the apicoplast essential?
Mother of Green
Phylogenomics
• Find the ancestors of the apicoplast
• Identify genes in the ancestors
• Determine gene function
• Look for these genes in the P. falciparum nucleus
• Then study regulatory mechanisms in candidate genes
Phylogenomics of plastids
• Very old lineage (> 2.5 billion years)
• Cyanobacterial ancestor
• Three main plastid lineages
Glaucophytes
Group of freshwater algae
Chloroplast resembles intact cyanobacteria
Chlorophytes
Green plant lineage
Chloroplast genome reduced
Many chloroplast genes now in nuclear genome
Rhodophytes
Red algal lineage
Chloroplast genome bigger than in green plants
Oomycetes
Apicomplexans
One plastid origin
Phylogenomics of plastids
• One cyanobacterial ancestor ?
• Many?
• Lineages are not linear
Multiple plastid origins
Nucleus
Primitive eukaryote
Endosymbiont
plastid
Cyanobacteria
Nucleus
Second
eukaryote
Nucleomorph
Secondary
endosymbionts
Secondary
nonphotosynthetic
endosymbiont
Plastid
disappears
The process of
endosymbiosis.
Horizontal Gene
Transfer (arrows)
from the plastid to
the nucleus.
The nucleomorph is a
remnant of the
original endosymbiont
nucleus.
Secondary
endosymbiont
Tertiary endosymbiosis.
Horizontal Gene Transfer
Third eukaryote
Tertiary
endosymbionts
Plastid disappears
Tertiary
nonphotosynthetic
endosymbiont
P. falciparum
The information gathering problem
• Rapid accumulation of raw sequence information
~100 sequenced chloroplast genomes
~57 sequenced cyanobacterial genomes
Rate of accumulation is increasing
Information accumulates faster than analyses finish
Information in forms not readily accessible
•
Solution
Semi-automated web-services
“Smart” web-services
Semantic web
Phylogenomics of the P. falciparum Apicoplast
•
•
•
•
Extract data from public and private databases
Web services
Choose a metric for sequence comparison
ClustalW and others
Choose a method to infer genealogy
Maximum Likelihood (ML)
Develop a strategy to use ML that is feasible
fastDNAml and others
Current research in SOA



Too many standards on services and
workflows?
Many theoretical research papers on service
discovery, service composition applied in
“demo” application domains, such as travel
arrangements services
Not many practical implementations of
systems for solving real world problem
Bioinformatics today


Rapidly accumulating data: DNA sequences, contigs,
expression data, ontologies, annotations, etc.
Standard Web-based data and online analysis tools







NCBI
DDBJ
EMBL
BLAST
CLUSTAL-W
Non-standard independently developed heterogeneous
data sources
Data sharing and security
From the article “Genome Sequencing vs. Moore’s Law: Cyber Challenges
for the Next Decade” by Folker Meyer in journal CTWatch Quarterly
August, 2006 volume 2 number 3
SOA in Bioinformatics

Recent exposure of data & analysis tools as services





Several active in-progress middleware projects





NCBI
DDBJ
EMBL
Others
myGrid
BioMoby
IRIS
Community efforts needed to provide more shared and
reliable services
More demonstration projects needed => best practices,
measured utility, feedback to middleware projects, etc.
A typical in-silico investigation –
Data driven research
B: Query protein
coding genes
for each genome
sequence
A: Query complete
genome sequences
given a taxa
E: Phylogenetic
analysis
D: Sequences
alignment
C: Eliminate vector
sequences
Time consuming manual webbased operations

Data collection


Analysis tool usage


Copy & paste!
Experiment data recording


Copy & paste!
Copy & paste!
Repetitive experiments for scientific discovery

Copy & paste!
MoGServ system architecture

MoGServ interface



MoGServ middle layer






Web interface
Application interface (coming soon)
Data access storage
Data and analysis services
Service and workflow registry
Indexing and querying metadata
Service and workflow enactment
Acting in two roles: service consumer and service
provider
MoGServ System Architecture
Web Interface
Applications
Services
Access
Client
Application Server
Data Access
Services
Job Manager
Data Analysis
Services
Service/Workflow
Registry
Job Launcher
Local Data
Storage
Metadata
Search
Workflow/Soap
Engines
Services
NCBI
DDBJ
MoGServ
Middle
Layer
EMBL
Others
Data/Services
Providers
Data storage and access services

Local database




Integrating data from multiple data sources with
scientists interests
Supporting repetitive investigations against
several subsets of sequences
Avoiding network traffic and service failure when
retrieving data on-the-fly from public data
sources
Accessing the data in the local database by
services
Service and workflow registry

A table-based description with necessary properties









Not intended for supporting service discovery or composition at current
stage
To answer end-users questions of their experiments results


Text description
Service location
Input/output
Provider
Version
Algorithm
Invocation method
Which algorithm was used to generate the data and what is the source of the
input data
A repository of service and workflow used for local application
developers
Indexing and querying metadata

Metadata




Service and workflow description
Description of sequence data in order to track the
origination of data
Experimental data output, input, and intermediate
data
Indexing and querying with keyword


Lucene
Implemented as services
Service and workflow enactment
Service/Workflow
Registry
Find the service/workflow
definition using the task name
INPUT
Parameters
Job Launcher
Job Manager
Form a Job
Description
Task Name
Job
Information
Timer
Output
Job ID
Instances of
Workflow/Service Engines
Implementation

Development and deployment




Database


Apache Lucene library
Service implementation



PostgresSQL 8.1
Index and search of metadata


J2EE
Tomcat 5.0.18
Axis 1_2RC2
Java2WSDL
Wrap command line applications with JLaunch library
Workflow


Taverna workbench
Freefluo workflow engine
A workflow created using the Taverna workbench tool
Issues with the first prototype

Meta data description

Solution




Limitation



Similar to most services in the bioinformatics community
Lack of semantic description (goal => semantic search)
Failure tolerance and recovery

Solution




Statically encode alternative services in the workflow to prevent service failure
Record status of the service and workflow execution into the database for possible
recovery strategy
Multiple workflow engines deployment to prevent the hardware or network failure
Limitation



Index-based (keyword syntactic search)
Capture most properties to support the end-users requirement
Support data provenance
Security
No dynamic service selection during execution time
More semantic description support
Extension of the system



Use existing domain ontology in bioformatics
community to describe services, workflows,
and data
Integrate the grid computing technologies to
address the security and resource allocation
issues
Integrate the semantic web technology to
support end-users workflow creation based on
their knowledge of scientific domain
Summary


A practical demonstration of building a SOA-based system
Applied in a bioinformatics application to study the deep
phylogeny





Easy and rapid extraction of DNA and protein sequence from public
databases to a local database which saves scientists months of repetitive
searching, downloading, and data management.
Painless reformatting of the extracted data for commonly used
analytical tools.
Preliminary data inspection and analysis using these tools within the
web-services environment which permits inspection of many conserved
gene candidates, enabling the investigator to rapidly determine the
suitability of the chosen gene for deep phylogenetic analysis.
User-specified additions to the local database which allows upload
sequences into the local database.
User-specified additions to the automated queries which provides a freetext searching interface for constructing data sets with interests.
Thank You!
Questions?
The computational problem
•Phylogenetic trees
NP-hard
Poisoned by information conflict
•Phylogenies based on individual genes
Maximum likelihood models exist
Processes are parallelizable
Access to compute farms inadequate
RAW number-crunching power
Greedy
• Similar genealogies may be merged
Convergence not possible for all
Makes computational problem more daunting
Phase 1
Phase 2
Phase 3
Phase 4
Input
Input
Input
Input
Task A
?
Output
High
Task B
Service
a
Service
b
Service
B’
Instance of
Service a
Instance of
Service b
Instance of
Service B’
Task C
Service
c
Instance of
Service c
Output
Output
Output
Workflow Abstraction Level
Low