Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Overview and Implementation Strategy of the NIAID-Funded Bio-defense Proteomics Database System Xianfeng Jeff Chen Ph.D. Research Investigator/Project Manager Agenda Today (1) Introduction • VBI responsibility in Admin Center • PRCs datatype and organism • Proteomics data submission and storage work flow (2) Database Development • VBI computing system architecture (CPU and storage) • VBI database system prototype and functionality • VBI existing database schema and status • Example Y2H schema for design logics and case study (3) Strategy on Knowledgebase Development • Proposed data integration and knowledgebase construction Introduction Proteomics Data Management RAW DATA Tasks of Proteomics Data Management (processed data) Data Analysis, Data Storage QA/QC, Annotation, & Visualization Tools Interoperability & Curation (VBI) (VBI/GU) (GU) SOP, LIMS, & Adm DB (SSS) PRCs Major Data Type Organization Major Data Type University of Michigan Microarray and mass spectrometry Caprion Mass spectrometry Harvard Proteomics Institute Genomics and protein expression array Albert Einsten College of Medicine Mass spectrometry PNNL Mass spectrometry Scripps NMR structural and X-ray crystal diffraction data Myriad Genetics Yeast two-hybrid system PRCs Organisms Einstein Toxoplasma gondii, Cryptosporidium parvum Caprion Brucella abortus Harvard Bacillus anthracis (Protein array), Vibrio cholerae Myriad Bacillus anthracis (Y2H), Yersinia pestis, Francisella tularensis, vaccinia PNNL Orthopox (vaccinia and monkeypox), Salmonella typhimurium, Salmonella typhi Scripps SARS CoV Michigan Bacillus anthracis (TXP, MS) + host (human) Proteomics Data Flow Data Modeling w/ Decomposition 2D GELS PRCS Protein Array LC Converting to Standard Format Immunoaffinity purification VBI Y2H QA MS & MS/MS QC Standar d Format QA & QC NMR Public X-Ray Cryoelectron Microscopy Quality Assurance X-Ray Defraction & Quality Control etc… Data Sources Data Types Standard Format for Each Data Type Quality Assurance & Quality Control Relational Database MIAME and MIAPE-like Standards/SOP for Data Submission Database Development VBI Computing System LINUX Web Server Gimli PC Users Jeff SUN (Solaris) Wei Chaitanya Chengdong Ranjan Project Binary Proteomics Software Oswald Bruno Elenwe Data Storage Application Server Genomics Networked File Server TUOR Relational Database Server Proteomics Chendong, Jeff, Wei, Ranjan, Chaitanya 7 PRCs System Development in Q3 of 2005 Development Web Interface Database Test/Stage Production Proteomics Database Project Websites Production: http://proteinbank.vbi.vt.edu/bprc Test: http://proteinbankdev.gepasi.org/bprc/ Development: http://txue.bioinformatics.vt.edu:8080/bprc http://wsun.vbi.vt.edu:8080/bprc/ Production Website Instance Dynamically generated webpage Functionalities: (1) Account management (2) File and doc management (3) News group and news update (4)Textual data display (5) 2D gel Image data display (6) Table and record query (7) Data uploading and simple submission (8)HTTP data downloading (9)SFTP file transfer Database Query Search By Experiment Search By Organism •Select Experiment •Retrieve list of Bait protein and nucleotide, Prey protein & nucleotide •Links to details of bait and Prey example: Drosophila melanogaster •Escherichia coli •Saccharomyces cerevisiae •Homo sapiens •Drosophila melanogaster •Helicobacter pylori •Caenorhabclitis elegans Search By Data Type •Proteomics •Genomics •Microarray Query for Scripps Sample Data Search By Project/Experiment •Scripps MS testing project •Available peptide hit list •Retrieve peak information and m/z & intensity list Query for 2 D Gel Data Search By Experiment/Sample Proteomics Database Architecture Three Phases of Database Design Process-Oriented Normalized with Key-value Pair Production Design Stored Procedure for Analysis Pipeline 2D Gel MS LC NMR X-Ray Y2H Defraction Application Views -- materialized views Layer Logical Layer X-Ray Protein Cryoelectron Microscopy Array Immnoaffinity Purification Multiple Schemas of Disparate Data Consolidate to One Schema to Remove Redundancy Physical Layer Final Views Proteomics Database Architecture Three Database Instances Phase 1 Individual Dataset Modeling Disparate Data With Multiple Schemas Phase 2 Phase 3 Development Consolidation into a Few Schema A normalized data model implemented as key –value pairs, highly decomposed. Analysis Pipeline Procedures Logical Layer with Views for the User Test/stage Physical Layer Version 1 Version 2 Version 3 0.5-1 year 1-1.5 year 2 years 1. Partially Processed Data 2. Data Enhanced with Knowledge 3. Interface Less Changeable 4. Curated/Annotated Data Production Status of VBI Database Development Schema Development (Maturity) Test/stage Production Adm +(10/10) + + 2 D Gel +(10/10) + + MS +(10/10) + + Interaction +(9/10) + - Pathway +(7/10) + - Data Repository +(8/10) + + Y2H +(10/10) + + Genomics +(10/10)(GUS) + + Microarray +(10/10) (AE) + + Default Tablespace: Admin_data, Genomics_TBLS, Pathway_TBLS, Microarray_TBLS, Proteomics_TBLS. Generic Experiment Data Components -------Example of Database Design Logics Who (People) Where (Organization) Project (Goal) Materials and Methods (Metadata) Results (Raw Data) Conclusion and Hypothesis (Processed and Analyzed Data) Y2H Data Component Modeling People Experiment Sample Project DNA /Protein Detail Results Conclusion Hypothesis Experiment Component Object Model Experiment Experiment Design Design Description Experiment Factor Ontology Entry Factor Value Ontology entries are taking care of the annotation cases 1) There are diverse choices and there exist ontologies that can better capture the information 2) What are essentially controlled vocabularies which are limited in number of choices but might grow in the future or vary by technology type Y2H Partial Database Schema Proteomics DB System Architecture Batch Processing (1) Data uploading; (2) Data validation; (3) Data analysis; (4) Data processing Perl, Java JSP, CGI, Java JDBC, Perl DBI/DBD, ODBC Private File Server Oracle Relational Database Public File Server System Architecture of Putative VBI Proteomics Knowledgebase ------- Data, Tool, Project, and Team Interoperability Web Display and Data Visualization Security Application Layer Security Service-Oriented MiddleWare with Process Control Temporary data Security Virtual Database/ Warehouse Security Array Express Mass Spectrometry Two Component System 2D Gel Structure Data Genomics Data Strategy on Data Integration and Construction of Knowledge Warehouse Biological Information Workflow Diagnostics, Therapeutics & Vaccines Target Discovery Biological Research Knowledge Generation Data Management Knowledge Management Curation and Annotation of Data Cleaning, Processing Algorithms Information Storage, Queries & DB Management VBI PDC Project Phases Phase I First 2 years Bio-IT Scope •Raw data management •Schema development •Data visualization •Data standardization Phase II Phase III 3rd-4th years 5th year Data Integration Knowledge generation Knowledge management Knowledge presentation •Integration at interface level •Integration of data at DB level •Interoperability of datasets •Normalization and warehousing •Predefined query •Materialized view •Comparative analysis •Statistical analysis Mapping the Proteome (1) Yeast two-hybrid system Measures association between two proteins. Allows very high throughput. (2) Mass spectrometry Allows identification of proteins within large complexes (2-100 proteins). Lower throughput. Infer Complex Interaction Topology Knowledgebase Binary interactions R2H Analysis Proteins MS Analysis PO4 N-ary interations Complex Interaction Model Bacillus anthracis Data Organization (1) Completed Genome Ames, Ames Ancestor, a2012 NCBI, TIGR (2) Yeast two-hybrid interaction data Myriad Genetics (3) Mass Spectrometry Scripps and Caprion (4) Microarray expression profiling Univ. of Michigan (5) Interspecies and interspecies clustering NCBI(COG) and TIGR (6) Functional category assignment GU(PIR) Strategy for Knowledgebase Construction (1) Annotation Improvement (1) Non-homologous based methods -------------- phylogenetic profiling, Rosetta stone pattern, operon analysis, co-expression profiling, gene neighboring etc. (2) Comparative genomics with two reference genomes --- E. Coli and Yeast (2) Identifying anchor points for data integration (1) Known metabolic pathway – E. coli and yeast; (2) Known signal transduction pathway; (3) Known Gene regulation machinery; (4) Known Protein-protein interaction map. Data Integration Lay down microarray data to add co-expression pattern to gene network Lay down MS multiple interaction data to expend the network Lay down Y2H interaction data and expend network Anchor on knowledge network of Reference Genomes – E. Coli and Yeast Comparative Genomics Improved annotation Genomics Data Putative Knowledgebase: No thing http://www.Bacillus_anthracis.org Data Mining and Knowledge Augmentation Key: Literature Y2H analysis Multi-Protein Complex Curated In-House Y2H Both Curated + Y2H MS analysis Microarray Acknowledgement Name Dr. Jeff Chen Dr. Chendong Zhang Dr. Steve Cammer Dr. Oswald Crasta Susan Baker Jiang Lu Ranjan Jha Qiang Yu Jian Li Wei Sun Chaitanya Kommidi Dr.Bruno Sobral Dr. Peter MacGarvey Dr. Cathy Wu Paula Yadvish Margaret Moore Role Organization Project Manager/Investigator Senior Software Engineer Bioinformatics Scientist Scientist and CI-Co-director DBA DBA Software Engineer Software Engineer Software Engineer Software Engineer Software Engineer Co-PI Senior Bioinformatics Scientist Co-PI Web Coordinator PI VBI VBI VBI VBI VBI VBI VBI VBI VBI VBI VBI VBI GU GU SSS SSS