Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Integromics: a grid-enalbled platform for integration of advanced bioinformatics tools and data Luca Corradi [email protected] BIO-Lab, DIST University of Genoa Integromics • Cancer research goal: tailor treatment to the molecular profile of an individual patient's tumor • Microarrays and other 'omic' technologies allow to study tens of thousand of genes simultaneously • Tools and methodologies used lack of standardization and repeatability • Need of an "integromic" platform to: – Develop integrative ('integromic') analyses of the data – Combine tools available for genomics Better results, higher quality of work 2 Focus on... • How to exploit the backend gLite infrastructure and a HPC environment to integrate bioinformatics tools and data • How a Grid Portal can: – integrate heterogeneous tools and data – simplify user interaction through customized web interfaces – increase usability and efficiency • Case study: example of correlation between genomics data and clinical data through a combination of processing tools provided by the platform 3 The challenges • Manage large volumes of bioinformatics data • Deal with complex issues as different formats, distributed locations, timeconsuming tasks, computational needs • Integrate heterogeneous tools and platforms • Speed up analysis process through automated metodologies • Improve efficiency and quality of work • Make the system usable and accessible 4 Microarray technology • Computation of genes expression values of thousands genes at the same time • Collection of microscopic DNA spots, representing single genes, arrayed on a solid surface by covalent attachment to chemically suitable matrices • Estimation of the absolute value of gene expression 5 The use case • Analyse large microarray datasets for breast cancer prognosis assessment • Run several R/Bioconductor scripts • Deploy a re-usable and reliable service • Avoid errors, increase repeatability • Create a processing pipeline where new algorithms and data analysis techniques can be tested • Create a set of “atomic” components that can be combined into workflows 6 Data Analysis Tools R/Bioconductor • Free software environment for statistical computing and graphics • Bioconductor is a series of R packages specific for bioinformatics community • Active user community Dchip • Free software for analysis and visualization of gene expression data Affymetrix Power Tools (APT) . Cross-platform command line programs that implement algorithms for analyzing Affymetrix GeneChip arrays Parallel dChip execution • Module 1 – – • – – • n jobs each opening N/n Files and normalizing them Each job produces N/n CSV Files (matching with input files) Module 2 m jobs each opening all N CSV Files and computing genes expression values concerneing a certain group of genes Each job produces one CSV File Module 3 – – One job opening the m expression files It searches for differentially expressed genes and it performs clustering of results CEL N/n CEL 1 CEL N Mod1 2 Mod1 1 Mod1 n CSV 1 CSV N Mod2 1 Mod2 2 Mod2 m CSV 1 CSV 2 CSV m Mod3 8 Parallel APT execution 9 The service • Analyze large microarray datasets for breast cancer prognosis assessment • Concatenate phenodata and expression results • Mix of custom and R programs • Automatic analysis and plot creation 10 The BioMedicalPortal Based on EnginFrame, an industry proven production-grade grid-portal (public/private academic and industry customer worldwide) 11 BMPortal Architecture gLite WLM Secure Storage GSAF Client Apps BM Portal User Web Interface AMGA Grid Clusters APIs NON-Grid users Engin Frame Web Service Interface (LSF, PBS, LL, etc..) AMGA local Grid Users Other Grids • • based on EnginFrame product from NICE srl data management and secure storage layer are based on GSAF / Secure Storage APIs NorduGrid, Globus, SRB, AliEn, etc… other Grid DBs 12 BioMedicalPortal services • • • • • • • • User management, authentication and authorization services Data management (extension to metadata support on GRID) Job submission (GRID, local, remote cluster) and monitoring Support for every programming and scripting language Plugin strategy for applications integration Web services interface Workflow management system Lots of software and applications already integrated etc...... 13 gLite plugin & GWT • Authentication, Authorization using VOMS (client side applet is coming) • Job submission and monitoring, retrieve and result visualization • Preference settings (RB, CE, …) • Traditional LFC based data management • New Google Web Toolkit interfaces for GSAF integration via Java API using VOMS credentials 14 Testbed architecture Users User submits and monitor work via a standard web browser 1 Win LX Mac UX BMPortal Users 6 8 2 Streaming output User can check theallows job to monitor progress status, exit the results or of the job messages BMPortal checks input parameters and files, and submits a job to gLite gLite UI EF Server&Agent Input files - primary - include 3 5 Results are written to the input file directory Job is done 7 Local or remote cluster (LSF) The RB matches the user requirements with the available resources on the Grid EGEE gLite infrastructure 4 The job starts Application 15 Analysis /1 • EnginFrame Grid portal interface (web access) • Input data selection (Affy .CEL files, phenodata, gene list) 16 Analysis /2 – Services execution & monitoring – Users can come back after coffee 17 Analysis /3 • Result visualization in portal spooler area (txt files, images, etc.) 18 Impact • Addressed to bio-medical researchers without specific computation skills • The collaboration between molecular oncologists and software engineers allowed for the optimization of the system without loosing flexibility • Scales up in the size of processed data above current available Desktop Personal computer limitations • Following the Software as a Service paradigm, users can focus on experimental design rather than infrastructure. 19 Atomic services • Each processing step is an “atomic” service • Services can be invoked one by one • Now services are composed using EnginFrame portal features and LSF scheduler tools • But… 20 Current work (1) • Viasual and easy WF monitoring • Totally integrated with the EnginFrame job monitoring and data access • Useful for very long lasting workflows • User-designed “virtual experiments 21 Current work (2) Integration of new algorithms (multi-chip quality control, across-platform data integration, etc...) 22 Current work (3) Possibility to perform different analyses in a parallel way 23 Acknowledgements • Part of this work is developed within the Italian FIRB project LITBIO (Laboratory for Interdisciplinary Technologies in BlOinformatics). • Thanks are due to Ulrich Pfeffer and his functional genomics group at IST (National Institute for Cancer Research) of Genoa, Italy for their support. 24 Thank you! Thank you! 25