Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Distributed, Modular Grid Software for Management and Exploration of Data in Patient-Centric Healthcare IT Andrew Hart NASA Jet Propulsion Laboratory David Kale Whittier VPICU, Children’s Hospital LA Heather Kincaid NASA Jet Propulsion Laboratory Agenda  Health Care Data Challenges for Large-scale Research  Intro to Object Oriented Data Technology (OODT)  Applications of OODT in distributed scientific data systems - NASA’s Planetary Data System - NCI’s Early Detection Research Network - Whittier Virtual Pediatric Intensive Care Unit (VPICU)  OODT as Open Source  Learning More & Keeping in Touch Health care research  Increasingly collaborative  Increasingly geographically distributed  Scale, Complexity, Cost drive cooperation  Opportunities for discovery emerge through larger data sets  Increase in need for technology to support for “virtual organizations” carrying out distributed scientific research OODT – What Is It? “A data grid software infrastructure for constructing large-scale, distributed data-intensive systems”  Reference Architecture  Software Product Line OODT/Science Web Tools Archive Client Navigation Service OBJECT ORIENTED DATA TECHNOLOGY FRAMEWORK Archive Service Profile Service Product Service Query Service Bridge to External Services Other Service 1  Reusable Components Other Service 2  Common Patterns Profile XML Data Data System 1 Data System 2 A Brief History of OODT  Funded out of NASA’s Office of Space Science in 1998  Funded to address critical software engineering challenges affecting the design of mission science data systems  Designed, implemented, and refined over the past 7 years across multiple scientific domains: - Planetary Science, - Earth Science, - Cancer Research, - Space Physics, - Modeling and Simulation, - Pediatric Intensive Care  Runner up NASA software of the year in 2003 Principles behind OODT  Division of Labor Avoid making one component the workhorse, configurable  Technology Independence Guard against unexpected changes in the technology landscape  Metadata as a first-class citizen Descriptions of resources come in handy  Separation of software and data models Allow each to evolve independently  Modular, domain-agnostic Pick and choose from adaptable components with defined interfaces OODT Core Framework Services OODT/Science Web Tools Archive Client Navigation Service OBJECT ORIENTED DATA TECHNOLOGY FRAMEWORK Archive Service Profile Service Product Service Query Service Bridge to External Services Other Service 1 Other Service 2  Archive Service Profile XML Data Data System 1 Data System 2 Ingest data + metadata, processing algorithms, workflow support  Profile Service Deliver metadata from an underlying data store  Product Service Deliver data from an underlying data store  Query Service Manage sets of profile servers  Data Grid Service Interfaces and tools for connecting distributed resources over the web Applications of OODT: PDS  Planetary Data System  National Aeronautics and Space Administration  http://pds.nasa.gov NASA Planetary Data System  Official NASA archive for all planetary data  9 Nodes with data located at discipline sites  All missions must add their data (required as part of mission Announcement of Opportunity Planetary Data System Distributed Planetary Science Archive Rings Node Ames Research Center Moffett Field, CA  Prior to October 2002, no ability to find and share data between PDS nodes Geosciences Node W ashington University St. Louis, MO Imaging Node JPL and USGS Pasadena, CA and Flagstaff, AZ THEMIS Data Node Arizona State University Tempe, AZ Central Node Jet Propulsion Laboratory Pasadena, CA Planetary Plasma Interactions Node University of California Los Angeles Los Angeles, CA Navigation Ancillary Information Node Jet Propulsion Laboratory Pasadena, CA Atmospheres Node New Mexico State University Las Cruces, NM Small Bodies Node University of Maryland College Park, MD PDS Data Key Challenges Challenges to building a science data system for the PDS:  NASA often flies unique, one of a kind missions  A static infrastructure won’t work: Nodes and models change  Data stored at PDS nodes differs dramatically in structure  Missions are required to share science data results with the research community PDS Data Architecture  Distributed data system environment with federated governance Each site maintains their own database and infrastructure  Common domain information model (regularly updated) used to drive system implementations Ontology and Common Data Elements (based on ISO/IEC 11179)  Common query interface to distributed services implemented with OODT Query Handlers  Software services that wrap existing data systems to share data Implemented with OODT Product & Profile servers  Publishing of data products to a common portal Implemented using Resource Description Format (RDF) PDS Architecture Decomposition Applications of OODT: EDRN  Early Detection Research Network - Division of Cancer Prevention, National Cancer Institute - http://cancer.gov/edrn EDRN Overview  Focus: investigator-initiated, collaborative research on molecular, genetic and other biomarkers for cancer detection and risk assessment.  Funded since 2000 by the Division of Cancer Prevention in the National Cancer Institute (NCI)  40+ geographically distributed centers performing parallel, complementary studies  Strong emphasis on the role of informatics EDRN Participants  Biomarker Development Laboratories Responsible for the development and characterization of new biomarkers or the refinement of existing biomarkers.  Biomarker Reference Laboratories Serve as a Network resource for clinical and laboratory validation of biomarkers, which includes technological development, quality control, refinement, and high throughput.  Clinical Epidemiology and Validation Centers Conduct clinical and epidemiological research regarding the clinical application of biomarkers.  Data Management and Coordinating Center Coordinate EDRN research activities, provide logistic support, conduct statistical and computational research for data analysis, analyzing data for validation. OODT and EDRN  OODT’s success lead to interagency agreements with both NIH and NCI, resulting in:  EDRN Informatics Center Support EDRN's efforts through the development of software systems for information management. Located at NASA Jet Propulsion Laboratory, Pasadena, CA. - Principal Investigator: Dan Crichton, JPL. EDRN Data  EDRN collects, generates, analyzes, and stores a wide variety of different data, including: - Specimen Inventories Map specimens collected (blood, sputum, etc.) to patient characteristics - Studies and Publications Information about studies conducted in the EDRN as well as published results (publications, outputs) - Biomarkers Information about indicators of early disease - Science Data Outputs of experiments on specimens, regarding biomarkers, driven by particular studies and protocols EDRN Data Flow  Moving beyond the local laboratory  Scalability, interoperability Case Study: ERNE  ERNE: EDRN Resource Network Exchange  Challenge: Overcome differences in local schema to develop a national distributed specimen information infrastructure  All sites running different software and following own procedures  Rely on a common information model for distributed querying, and provide site-specific mappings at each participant ERNE Architecture Connecting Research  Designing the EDRN informatics architecture as a collection of well-defined components via OODT has simplified the process of building interfaces to non-EDRN systems  Wrappers can be built to link non-EDRN systems  Translators can be developed to deal with different semantic architectures  caBIG - ERNE/caTissue Wrapper  EDRN-Canary Collaboration - A cloud computing effort that shares raw science data via Amazon S3 between EDRN and the Canary group which uses software from GenoLogics Life Sciences EDRN Knowledge Environment  Building a Semantic Bioinformatics Grid for the EDRN Lessons From EDRN  Architecture and a vision has been critical - Technology hasn’t been as critical - Keep it simple  Science support has been critical - Getting buy-in and participation from domain experts is key  Incremental development and deployment - Starting with a few sites was very helpful in understanding the issues - We had both development sites and observer sites initially  The IRB process has been a big schedule driver  Distributed architecture can be a challenge - Not all sites up to maintaining the implementation - Loosely coupled architecture with simple interfaces helped Applications of OODT: VPICU  Whittier Virtual Pediatric Intensive Care Unit - Childrens Hospital Los Angeles - http://picu.net Collaboration between 85 Multi-disciplinary pediatric intensive care units across the U.S. Collaboration with VPICU  Laura P. and Leland K. Whittier Virtual Pediatric Intensive Care Unit (VPICU), founded in 1998 by clinicians at CHLA  Leverage advances in technology to: - Improve patient care - Educate practitioners - Conduct research - Reduce cost of providing care VPICU Research Data Secondary use of observational clinical (EHR, monitor, annotations) data Ideal Research Data Set  Manageable size, Static  Homogeneous  Complete, standardized descriptions and annotations  Available as single unit  Complete, consistent  Minimal usage restrictions Real Health Care Data Set  Massive, grows continuously  Heterogeneous formats, types, etc.  Incomplete, proprietary, descriptions  Fragmented across stores, organizational boundaries  Incomplete, inconsistent  Highly restricted (legal, privacy, ethical considerations) VPICU Project Areas  Data extraction and management Take data from proprietary stores, make it accessible  Transformation of data into knowledge Process (and re-process) the data to extract insight  Data-driven decision support Develop tools that learn continuously from the data  Distributed data-sharing over a national network Enable research on scales previously impossible while maintaining security, privacy, compliance Principles behind VPICU  Decouple from (proprietary) vendor databases  Integrate disparate data sources into a single model  Dynamically (re)generate research database(s) - we don’t know for sure what queries will be most useful at the outset  Provide web services for multi-faceted access to the data to enable discovery & analysis  Support federation among multiple PICU sites “Algorithm” for VPICU Data System 1. Develop a common Domain Ontology to describe the information space 2. Develop compute services that support extraction of data from existing databases 3. Identify mechanisms to integrate information objects from disparate repositories and map them to the common domain ontology 4. Construct a set of online research databases to enable data mining and analysis 5. Deploy a “data grid” infrastructure of hardware & software to facilitate utilization of the data environment at CHLA and beyond (external entities and applications) 6. Deploy a set of compute services to support data mining and analysis 7. Develop an architectural plan and roadmap for scaling and integrating other PICUs VPICU Architecture File-based storage VPICU Architecture EHR Homegrown Clinical apps Monitor data  Original data sources/stores at backend  Proprietary schema  Hardware that we don’t “own” or control  Production systems (very load-sensitive)  Legacy technologies (sometimes)  Unreliable (can’t guarantee always available) File-based storage  Includes:  Hospital-wide commercial EHR system(s)  Homegrown critical care database  Specialized clinical applications  Raw bedside monitor data Proprietary data sources VPICU Architecture  Regular extraction of new data  VPICU-controlled resources (Our hardware and software)  Transform to VPICU schema  Link data belonging to same patient  May contain PHI Must be highly secure File-based storage  Data at this stage is normalized, stored in a format suitable for ingestion into any number of research databases VPICU-owned resources VPICU Architecture  Research databases  Application-specific  Optimized  Contain de-identified or anonymized data File-based storage  VPICU ontology, schema  Access via configurable web services What are “research databases?”  Designed for specific research questions, analytical techniques  Need not always be relational or databases at all  Available via web interfaces and software services Researcher using R can connect directly through R bindings  Examples:  Relational database for traditional retrospective studies  Search engine over free text clinical notes, etc.  Patient/patient comparison, retrieval (find patient like this one)  Data-backed patient simulator for “testing” interventions VPICU Architecture File-based storage OODT and the VPICU Data System 1. Develop an Information Model (Ontology) to describe the domain 2. Develop compute services that support extraction of data from existing CHLA databases (OODT Query Handlers) 3. Identify mechanisms to integrate information objects from disparate repositories and map them to the common domain ontology (OODT CAS crawler, catalog services) 4. Construct a set of online research databases to enable data mining and analysis (OODT Catalog and Archive Services) 5. Deploy a “data grid” infrastructure of hardware & software to facilitate utilization of the data environment at CHLA and beyond (external entities and applications) (OODT Data Grid Services) 6. Deploy a set of compute services to support data mining and analysis 7. Develop an architectural plan and roadmap for scaling and integrating other PICUs OODT as Open Source  Jan 2010: OODT Accepted as a podling in the Apache Software Foundation (ASF) Incubator  First NASA software licensed and incubating within the ASF  Learn more and track our progress at: - http://incubator.apache.org/projects/oodt.html  Join the mailing list: - [email protected]  Chat on IRC: - #oodt on irc.freenode.net Acknowledgements  Jet Propulsion Laboratory: Dan Crichton, Chris Mattmann, Sean Kelly, Steve Hughes, Amy Braverman, Thuy Tran  National Cancer Institute: Sudhir Srivastava, Christos Patriotis, Don Johnsey  Fred Hutchinson Cancer Research Center: Mark Thornquist, Ziding Feng, Jackie Dalhgren, Suzanna Reid  Children’s Hospital Los Angeles: Randall Wetzel, Robinder Khemani, Paul Vee, Jeff Terry, Robert Kaptan, Doug Hallam