* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A Service-Oriented Data Integration and Analysis
Geographic information system wikipedia , lookup
Gene prediction wikipedia , lookup
Theoretical computer science wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Pattern recognition wikipedia , lookup
Neuroinformatics wikipedia , lookup
Data analysis wikipedia , lookup
A Service-Oriented Data Integration and Analysis Environment for insilico Experiments and Bioinformatics Research Xiaorong Xiang Gregory Madey Jeanne Romero-Severson HICSS 40, January 2007 Supported in part by the Indiana Center for Insect Genomics, with funds from the Indiana 21st Century Research & Technology Fund Overview Bioinformatics project: Computer Science and Biology Mother of Green (MoG): Deep phylogeny with possible applications to malaria drug development Time consuming data-driven research MoGServ: prototype web services solution Goals Build an environment to support scientists’ investigations – Mother of Green Demonstrate effective practices in building a service-oriented architecture based system Provide a prototype for future research in service-oriented architecture domain Mother of Green Phylogenomics of the P. falciparum Apicoplast Indiana Center for Insect Genomics An International Center of Excellence University of Notre Dame Purdue University Indiana University Mother of Green • Malaria causes 1.5 - 2.7 million deaths every year • 3,000 children under age five die of malaria every day •Plasmodium falciparum causes human malaria • Drug resistance a world-wide problem • Targeted drug design through phylogenomics P. falciparum Mother of Green • P. falciparum has three genomes Nuclear, mitochondrial, plastid • Animals and insects have only two • Target the third genome • No harm to animals • New antimalarial drug • High risk, high tech, high payoff J. Romero-Severson Department of Biological Sciences Greg Madey & Xiaorong Xiang Department of Computer Science & Engineering Mother of Green •Plastids are the third genome •Intracellular organelles •Terrestrial plants, algae, apicomplexans •Functions in plants and algae Photosynthesis Oxidation of water Reduction of NADP Synthesis of ATP Fatty acid biosynthesis Aromatic amino acid biosynthesis •Functions in apicomplexans ? Chloroplast in plant cell plastid Apicoplast in P. falciparum Plastid in Toxoplasma sp. Mother of Green •The apicoplast appears to code for <30 proteins. •Repair, replication and transcription proteins •Why is the apicoplast essential? Mother of Green Phylogenomics • Find the ancestors of the apicoplast • Identify genes in the ancestors • Determine gene function • Look for these genes in the P. falciparum nucleus • Then study regulatory mechanisms in candidate genes Phylogenomics of plastids • Very old lineage (> 2.5 billion years) • Cyanobacterial ancestor • Three main plastid lineages Glaucophytes Group of freshwater algae Chloroplast resembles intact cyanobacteria Chlorophytes Green plant lineage Chloroplast genome reduced Many chloroplast genes now in nuclear genome Rhodophytes Red algal lineage Chloroplast genome bigger than in green plants Oomycetes Apicomplexans One plastid origin Phylogenomics of plastids • One cyanobacterial ancestor ? • Many? • Lineages are not linear Multiple plastid origins Nucleus Primitive eukaryote Endosymbiont plastid Cyanobacteria Nucleus Second eukaryote Nucleomorph Secondary endosymbionts Secondary nonphotosynthetic endosymbiont Plastid disappears The process of endosymbiosis. Horizontal Gene Transfer (arrows) from the plastid to the nucleus. The nucleomorph is a remnant of the original endosymbiont nucleus. Secondary endosymbiont Tertiary endosymbiosis. Horizontal Gene Transfer Third eukaryote Tertiary endosymbionts Plastid disappears Tertiary nonphotosynthetic endosymbiont P. falciparum The information gathering problem • Rapid accumulation of raw sequence information ~100 sequenced chloroplast genomes ~57 sequenced cyanobacterial genomes Rate of accumulation is increasing Information accumulates faster than analyses finish Information in forms not readily accessible • Solution Semi-automated web-services “Smart” web-services Semantic web Phylogenomics of the P. falciparum Apicoplast • • • • Extract data from public and private databases Web services Choose a metric for sequence comparison ClustalW and others Choose a method to infer genealogy Maximum Likelihood (ML) Develop a strategy to use ML that is feasible fastDNAml and others Current research in SOA Too many standards on services and workflows? Many theoretical research papers on service discovery, service composition applied in “demo” application domains, such as travel arrangements services Not many practical implementations of systems for solving real world problem Bioinformatics today Rapidly accumulating data: DNA sequences, contigs, expression data, ontologies, annotations, etc. Standard Web-based data and online analysis tools NCBI DDBJ EMBL BLAST CLUSTAL-W Non-standard independently developed heterogeneous data sources Data sharing and security From the article “Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade” by Folker Meyer in journal CTWatch Quarterly August, 2006 volume 2 number 3 SOA in Bioinformatics Recent exposure of data & analysis tools as services Several active in-progress middleware projects NCBI DDBJ EMBL Others myGrid BioMoby IRIS Community efforts needed to provide more shared and reliable services More demonstration projects needed => best practices, measured utility, feedback to middleware projects, etc. A typical in-silico investigation – Data driven research B: Query protein coding genes for each genome sequence A: Query complete genome sequences given a taxa E: Phylogenetic analysis D: Sequences alignment C: Eliminate vector sequences Time consuming manual webbased operations Data collection Analysis tool usage Copy & paste! Experiment data recording Copy & paste! Copy & paste! Repetitive experiments for scientific discovery Copy & paste! MoGServ system architecture MoGServ interface MoGServ middle layer Web interface Application interface (coming soon) Data access storage Data and analysis services Service and workflow registry Indexing and querying metadata Service and workflow enactment Acting in two roles: service consumer and service provider MoGServ System Architecture Web Interface Applications Services Access Client Application Server Data Access Services Job Manager Data Analysis Services Service/Workflow Registry Job Launcher Local Data Storage Metadata Search Workflow/Soap Engines Services NCBI DDBJ MoGServ Middle Layer EMBL Others Data/Services Providers Data storage and access services Local database Integrating data from multiple data sources with scientists interests Supporting repetitive investigations against several subsets of sequences Avoiding network traffic and service failure when retrieving data on-the-fly from public data sources Accessing the data in the local database by services Service and workflow registry A table-based description with necessary properties Not intended for supporting service discovery or composition at current stage To answer end-users questions of their experiments results Text description Service location Input/output Provider Version Algorithm Invocation method Which algorithm was used to generate the data and what is the source of the input data A repository of service and workflow used for local application developers Indexing and querying metadata Metadata Service and workflow description Description of sequence data in order to track the origination of data Experimental data output, input, and intermediate data Indexing and querying with keyword Lucene Implemented as services Service and workflow enactment Service/Workflow Registry Find the service/workflow definition using the task name INPUT Parameters Job Launcher Job Manager Form a Job Description Task Name Job Information Timer Output Job ID Instances of Workflow/Service Engines Implementation Development and deployment Database Apache Lucene library Service implementation PostgresSQL 8.1 Index and search of metadata J2EE Tomcat 5.0.18 Axis 1_2RC2 Java2WSDL Wrap command line applications with JLaunch library Workflow Taverna workbench Freefluo workflow engine A workflow created using the Taverna workbench tool Issues with the first prototype Meta data description Solution Limitation Similar to most services in the bioinformatics community Lack of semantic description (goal => semantic search) Failure tolerance and recovery Solution Statically encode alternative services in the workflow to prevent service failure Record status of the service and workflow execution into the database for possible recovery strategy Multiple workflow engines deployment to prevent the hardware or network failure Limitation Index-based (keyword syntactic search) Capture most properties to support the end-users requirement Support data provenance Security No dynamic service selection during execution time More semantic description support Extension of the system Use existing domain ontology in bioformatics community to describe services, workflows, and data Integrate the grid computing technologies to address the security and resource allocation issues Integrate the semantic web technology to support end-users workflow creation based on their knowledge of scientific domain Summary A practical demonstration of building a SOA-based system Applied in a bioinformatics application to study the deep phylogeny Easy and rapid extraction of DNA and protein sequence from public databases to a local database which saves scientists months of repetitive searching, downloading, and data management. Painless reformatting of the extracted data for commonly used analytical tools. Preliminary data inspection and analysis using these tools within the web-services environment which permits inspection of many conserved gene candidates, enabling the investigator to rapidly determine the suitability of the chosen gene for deep phylogenetic analysis. User-specified additions to the local database which allows upload sequences into the local database. User-specified additions to the automated queries which provides a freetext searching interface for constructing data sets with interests. Thank You! Questions? The computational problem •Phylogenetic trees NP-hard Poisoned by information conflict •Phylogenies based on individual genes Maximum likelihood models exist Processes are parallelizable Access to compute farms inadequate RAW number-crunching power Greedy • Similar genealogies may be merged Convergence not possible for all Makes computational problem more daunting Phase 1 Phase 2 Phase 3 Phase 4 Input Input Input Input Task A ? Output High Task B Service a Service b Service B’ Instance of Service a Instance of Service b Instance of Service B’ Task C Service c Instance of Service c Output Output Output Workflow Abstraction Level Low