Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
From Bio-Informatics towards e-BioScience L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit van Amsterdam [email protected] Background information experimental sciences • There is a tendency to look ever deeper in: Matter e.g. Physics Universe e.g. Astronomy Life e.g. Life sciences • Instrumental consequences are increase in detector: Resolution & sensitivity Automation & robotization • Therefore experiments change in nature & become increasingly more complex Impact in the life sciences • Impact of high throughput methods e.g. Omics experimentation genome ===> genomics New technologies in Life Sciences research cell Methodology/ Technology DNA Genomics RNA Transcriptomics protein metabolites Proteomics Metabolomics University of Amsterdam Omics impact Impact in the life sciences • Impact of high throughput methods e.g. Omics experimentation genome ===> genomics • Instrumentation being used in omics experimentation: Transcriptomics via among others; micro-arrays Proteomics via among others; Mass Spectroscopy (MS) Metabolomics via among others; MS & Nuclear Magnetic Resonance (NMR) Results in Paradigm shift in Life sciences • Past experiments where hypothesis driven Evaluate hypothesis Complement existing knowledge • Present experiments are data driven Discover knowledge from large amounts of data Life sciences research: from gene to function nucleus cell Gene DNA Whole-genome sequence projects Gene expression by RNA synthesis Genome-wide micro-array analysis AAAAAAAAA mRNA mRNA translation by protein synthesis “High-throughput” protein-analysis Protein function: -prediction by bioinformatics -proof by laboratory research NH2 Protein COOH function-1 function-n function-2 Developments towards Bioinformatics & e-Science • Experiments become increasingly more complex • Driven by increase of detector developments • Results in an increase in amount and complexity of data • Something has to be done to harness this development Bio-informatics to translate data into useful biological, medical, pharmaceutical & agricultural knowledge The what of Bioinformatics Bioinformatics is redefining rules and scientific approaches, resulting in the ‘new biology’. Within this new paradigm the traditional scientific boundaries are blurred, leaving no clear line between ‘dry or computational’ and ‘wet-based’ approaches Role of bioinformatics Genomics RNA Transcriptomics protein metabolites Proteomics Metabolomics Integrative/System Biology Data usage/user interfacing DNA Bioinformatics Data integration/fusion methodology Data generation/validation cell Two sides of Bioinformatics • The scientific responsibility to develop the underlying computational concepts and models to convert complex biological data into useful biological and chemical knowledge • Technological responsibility to manage and integrate huge amounts of heterogeneous data sources from high throughput experimentation Need for e-Science support Developments towards Bioinformatics & e-Science • Experiments become increasingly more complex • Driven by increase of detector developments • Results in an increase in amount and complexity of data • Something has to be done to harness this development Bio-informatics to translate data into useful biological, medical, pharmaceutical & agricultural knowledge Virtualization of experimental resources enabling sharing & leading to e-BioScience Life science application areas Life science/genomics research consortia and industry e-Bioscience and life science innovation domain Bioinformatics e-Bioscience & research infrastructure Generic e-Science ICT development and support e-Science & research infrastructure Grid infrastructure Network infrastructure and computing capacity Why e-BioScience • There is an increasing necessity to use results from other scientist e.g. share data & information: Re-use and sharing of biological data (2) Information content of omics data extremely high, however, • Data subject to noise, biological and technical variation • How to induce biological principles from these genome-wide data sets? Approach: develop methodology for “reverse engineering” of biological mechanisms. • Biggest challenge in bioinformatics today. Need for external data sources for in-silico experimentation • Two practices for re-use and sharing of data Collectively compile huge amounts of relevant data and make these available to the community. Examples: Bio-banking, compendia (e.g. NIH’s Affymetrix SNP repository). Re-use information from different and diverse experiments to discover phenomena Re-use and sharing of biological data (2) Compendium example: re-use and sharing of Huntington data • • • • Datasets: 404 Affymetrix Gene chips of measurements on extremely rare human brain samples (Hodges et al. Hum. Mol. Genetics, 2006) Available from NCBI GEO database (MIAME) Goal: find genes involved in Huntington’s Disease Approach: Reanalyze gene expression data Combine genotype data and clinical data (e.g. using SigWin) Extend experiments with own ChIP on chip data Resource Identification software Repository of relevant meta-information from: • Data warehouses e.g. GEO, ArrayExpress, Protein Interaction database • Literature (Mining of PubMed using Collexis) • Information resources specialized on diseases, genes, proteins, e.g. OMIM, GenBank, Ensembl Why e-BioScience • There is an increasing necessity to use results from other scientist e.g. share data & information: Data repositories Cohort studies in Bio-banking Biodiversity Expensive and complex equipment Mass Spectroscopy MRI Other Problems for the realization of e-BioScience • Life Science field is still in an early stage of development and: First principles are not understood at all • As a consequence experimental methods are not well established and will not for a time to come • Because of the new forms of omics instrumentation there is a need for design for experimentation methods Lack correct logging of conditions under which experiments are done is production of large amounts of data that request among others statistical techniques for interpretation • As a consequence results are multi interpretable Problems for the realization of e-BioScience • Problems for bioinformatics & e-Bioscience: Rationalisation at this early stage is almost impossible Pre- standardization & standardization almost non existent Where there are standards they are inadequate because multi interpretable (like MIAME for micro-array’s) • In addition there are commercial end-user products that are difficult to integrate • Users lack the training necessary to handle these complex experimental situation • Only possible solution is to create a flexible experimentation environment for the end-users Role of ICT in e-BioScience • e-Science is a new form of science methodology complementing theoretical and experimental sciences. • It is using generic methods and an ICT infrastructure to support this methodology. Web services as a paradigm/way of using/accessing information Grid is as a method of accessing & sharing computing resources by virtualization • What is missing in e-BioScience: Connection between biological problem & e-Bioscience User oriented tools that can be re-used and extended General model of ICT based integration Semantic support ontology’s and semantic support for workflows to make user knowledge explicit Consequences for bioinformatics & e-BioScience • Considerable amounts of experimentation is necessary before a well established methodology will emerge • The VL-e approach might be a good model & produces an environment in which the necessary experimentation can be realized Enhancing the scientific process: e-BioLab Motivation: • Interacting with the problem domain requires an environment in which the domain can be opened up and ideas, hunches and notions on the data and crude models of the biology can be visualized • A tangible space in which biologists, aided by e-scientists, will have the full potential of VL-e at their disposal. An actual laboratory in which: • Problem domain experts (biologists, medical doctors) and scientists from enabling disciplines jointly and in a creative manner work on the analyses and design of –omics experiments. Basic concept of e-BioLab: • Problem domain experts can focus on Basic model of the biology because they are shielded Small integration problem area Readily accessible experiments from technical details by e-scientists. data + models + integration methods data mining • Viewpoints on the research question and the data semi-instantaneously can Easy be expressed and visualized. visuaVague results lization • Ideas and analyses can be retained e-BioOperator and documented. • Facilities for remote collaboration are Biologists Biologists present*. e-BioScientist * Rauwerda et al., 2nd IEEE International Conference on e-Science and Grid Computing (submitted) Enhancing the scientific process: e-BioLab (2) Realization: • • • • • Large high resolution display (26.2 Mpixel) with high bandwidth (10 Gbit/s) connection to render cluster Full access to computational facilities and GRID middleware of VL-e e-whiteboards and tablet PCs to share and store ideas High definition video cameras for remote collaboration Highly adaptable lab configuration. Research into: • Problem Solving Environments for biology under study • formulation of scientific workflows that allow for sufficient interactivity and guarantee reproducibility • Maintaining an electronic lab journal for e-science experimentation • Methods for: • Information Management of omics data • Biological Domain Interaction / Resource Identification • Modeling of Biological Information and Knowledge • Remote scientific co-operation • Man-machine interaction High resolution displays in e-bioscience Remote whiteboard 3 2 1 2 1 3 GSEA SOM Video remote collaboration Literature Mining Clustering Gene lists Interesting Pathways GO catagories Example: concurrently display in a discussion with a remote partner • Clustering results of microarray experiments • Interesting pathways that are predominant in certain clusters • Gene Ontology categories • Results from literature mining • Gene Set Enrichment of categories identified in literature mining • Notions depicted on the e-whiteboards Virtual Lab for e-Science research Philosophy • Multidisciplinary research and development of related ICT infrastructure • Generic application support Application cases are drivers for computer & computational science and engineering research Problem solving partly generic and partly specific Re-use of components via generic solutions whenever possible Domain generic e-BioScience services Microarray pipeline Mass spectroscopy pipeline Pathway visualization Protein annotation Generic e-Science Generic e-Science Generic e-Science services services services Technology push Grid Services Harness multi-domain distributed resources Application pull Domain Specific tools Generic e-Science Generic e-Science services Generic e-Science services services Technology push Grid Services Harness multi-domain distributed resources Application pull Micro-array Transcriptomics pipeline Domain Mass spectroscopy Specific tools Proteomics pipeline Domain generic Domain generic Domain Generic services e-Science services e-Science services Bioinformatics methods in VL-e (1) Example 1 – An application specific method modified by e-science into a generic one: SigWin* • Starting point: Application specific method for detecting windows of increased gene expression on chromosomes** (implemented in C and perl for SAGE technology) • Motivation: Broad interest from molecular biology in positional behaviour of any measurement data that can be mapped onto DNA sequences • SigWin e-Science version: GRID-based modular workflow for detecting windows of significance in any sequence of values Widely applicable from gene expression to meteorology data Modules reusable for alternative workflows, e.g. protein modification Scalable to very large datasets * Inda et al., 2nd IEEE International Conference on e-Science and Grid Computing (submitted) ** Versteeg et al, Genome Research, 2003 Bioinformatics methods: SigWin Human gene expression DNA curvature of the Escherichia coli chromosome Significant window detector Generalisation of RIDGE method Temperature in Amsterdam Bioinformatics methods in VL-e (2) Example 2 – An application specific method composed of generic and specific modules in a workflow: OligoRAP* • Purpose: a re-annotation workflow for oligo libraries • Motivation: rapidly evolving knowledge in genome analysis requires frequent re-assessment of the molecules which are used to measure gene-expression. • OligoRAP Uses set of application generic (BIOMOBY) BLAT and BLAST sequence alignment (web)services. Uses application specific (BIOMOBY) annotation analysis service BIOMOBY: de-facto standard for bio-informatics webservices. Joint work of sequence analysis lab and micro-array lab Workflow: • Adjustable filtering criteria make quality level of oligos explicit • Workflow provenance makes re-annotation reproducible. * P. Neerincx, H. Rauwerda, F. Verster, A. Kommadath, T.M. Breit, J.A.M. Leunissen, Poster ISMB 2006 Virtual Lab for e-Science research Philosophy • Multidisciplinary research and development of related ICT infrastructure • Generic application support Application cases are drivers for computer & computational science and engineering research Problem solving partly generic and partly specific Re-use of components via generic solutions whenever possible • Rationalization of experimental process Reproducible & comparable • Two research experimentation environments Proof of concept for application experimentation Rapid prototyping for computer & computational science experimentation Medical Diagnosis and Imaging Problem Solving Environment Partners: • Universiteit van Amsterdam (UvA) • Academisch Medisch Centrum (AMC) • Vrije Universiteit Medisch Centrum (VUMC) • Philips Research • Philips Medical Systems • TU Delft • IBM Applications: 1. Eddy current reduction 2. Matched Masked Bone Elimination 3. Functional brain imaging, DWI and fiber tracking 4. MR virtual colonoscopy 5. Parallel MEG data analyses 6. Grid-based data storage, retrieval and sharing 7. Interactive 3D medical visualization Objective: To study the design and implementation of a PSE for medical diagnosis and imaging to support and enhance the clinical diagnostic and therapeutic decision process. 1 4 3 5 7 Brain Imaging and Fiber Tractography • Diffusion Weighted Imaging (DWI) Restricted Brownian motion results in anisotropy that can be measured >= 6 measurements, reduced to tensor per voxel Largest eigenvectors give diffusion vector • Whole volume fiber tracking can take many hours Depends on size of volume and number of measurements per voxel Suitable for parallelization • Visualization techniques Medical Diagnosis and Imaging Problem Solving Environment Application specific services: • Access to PACS, DICOM • Interfaces to medical scanners (MRI) • In-house developed algorithms: … Medical Applications … • Eddy Current Reduction Matched Masked Bone Elimination Patient privacy VL-e generic services: • Provides: Virtual Laboratory Grid Middleware Surfnet VL-e Environment • Scientific visualization techniques Image processing algorithms Uses: Experiment editor Parallel processing techniques Grid services: • Storage facilities (SRB) • High Performance Computing platforms • High Performance Visualization platforms Eddy current reduction • Shear, magnification and translation as a result of residual currents in DWI 2D matching to correct Computationally expensive • Parallelization through domain decomposition Computing cycles via Grid Integrated PACS solution Effects of residual eddy currents on Philips 3T Intera with DWI. Figure by Erik-Jan Vlieger, AMC. Medical Diagnosis and Imaging Problem Solving Environment 2D/3D visualization Data retrieval, acquisition Filtering, analyses, simulation VL experiment topology Image processing, Data storage The situation in the Netherlands • • • Netherlands Bio-Informatics Center (NBIC) was set up as part of the Dutch Genomics Initiative Netherlands Genomics Initiative (NGI) Its aim was to organize bio-informatics in the Netherlands and to generate sufficient critical mass also to support as a technology center the other genomics initiatives Organizational structure: Board of directors Dr van Kampen scientific director Drs R. Kok executive director Prof. Dr. Hertzberger adjunct scientific director Board of overseeing International Advisory board Scientific Committee Program Steering Group Current NBIC activities • Currently NBIC runs three programs and took the initiative and participates in another three joint activities besides collaboration such as with SURF (networking) and VL-e (e-Science): • NBIC programs: BioRange: a bio-informatics research program of 25 M$ & 25 M$ matching BioAssist: a 10 M$ support program BioWise: a 3 M$ education program • Participation in : Computation life sciences: a 5 M$ program with among others physics, chemistry and computational science Pilot grid roll out: a 3M$ Grid rollout & support with Dutch Foundation for computing (NCF) and others BIG GRID: a 35M$ GRID and e-Science program in the Netherlands together with NCF, physics, VL-e and others Program activities • Bio Range has four program lines: Micro array related bio-informatics Proteomics related bio-informatics Integrated bio-informatics Informatics research for Bio-informatics • All program lines comprise a number of collaborative projects with participation of groups all over the Netherlands • Bio Assist runs two program lines Establishment of e-bioscience support environment Establishment of generic e-science infrastructure • In future also addition towards biomedical as was illustrated The VL-e infrastructure Application specific service Application Potential Generic service & Virtual Lab. services Grid & Network Services Telescience Medical Application Bio Informatics Applications Virtual Laboratory Grid Middleware Surfnet VL-e Proof of Concept Environment Test & Cert. VL-software Virtual Lab. rapid prototyping (interactive simulation) Test & Cert. Grid Middleware Additional Grid Services (OGSA services) Test & Cert. Compatibility Network Service (lambda networking) VL-e Certification Environment VL-e Experimental Environment xxxx xxxx BioAssist Total 25M$ support + 25M$ matching Telescience Bio Medical Applicatio Application ns Virtual Laboratory Virtual Laboratory Grid Middleware Grid Middleware e-Science Roll out VL-E Proof of concept Environment & VL-e component Big Grid Surfnet Stable Application Surfnet Application feedback Rapid prototyping (interactive simulation) Additional Grid Services (OGSA services) Network Service (lambda networking) VL-E Experimental Environment Unstable Application & VL-e component Total 35 M$ support Conclusions • Omics experiments change the face of life sciences • Bioinformatics can be considered to be an essential enabler and is a form of e-Science • Will help to realize necessary paradigm shift in Life Science experimentation • Better support of experimentation & optimal use of ICT infrastructure requires rationalization experimentation process • Information management essential technology • Bioinformatics can not be decoupled from e-Bioscience applications • e-Bioscience also has to comprise biomedical applications