Download Proteomics Cyber-Infrastructure Slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Overview and Implementation Strategy of
the NIAID-Funded Bio-defense
Proteomics Database System
Xianfeng Jeff Chen Ph.D.
Research Investigator/Project Manager
Agenda Today
(1) Introduction
• VBI responsibility in Admin Center
• PRCs datatype and organism
• Proteomics data submission and storage work flow
(2) Database Development
• VBI computing system architecture (CPU and storage)
• VBI database system prototype and functionality
• VBI existing database schema and status
• Example Y2H schema for design logics and case study
(3) Strategy on Knowledgebase Development
• Proposed data integration and knowledgebase construction
Introduction
Proteomics Data Management
RAW
DATA
Tasks of Proteomics Data
Management
(processed data)
Data
Analysis,
Data Storage
QA/QC,
Annotation,
& Visualization
Tools
Interoperability & Curation
(VBI)
(VBI/GU)
(GU)
SOP, LIMS,
& Adm DB
(SSS)
PRCs Major Data Type
Organization
Major Data Type
University of Michigan
Microarray and mass spectrometry
Caprion
Mass spectrometry
Harvard Proteomics Institute
Genomics and protein expression array
Albert Einsten College of Medicine
Mass spectrometry
PNNL
Mass spectrometry
Scripps
NMR structural and X-ray crystal
diffraction data
Myriad Genetics
Yeast two-hybrid system
PRCs Organisms
Einstein
Toxoplasma gondii, Cryptosporidium parvum
Caprion
Brucella abortus
Harvard
Bacillus anthracis (Protein array), Vibrio cholerae
Myriad
Bacillus anthracis (Y2H), Yersinia pestis,
Francisella tularensis, vaccinia
PNNL
Orthopox (vaccinia and monkeypox),
Salmonella typhimurium, Salmonella typhi
Scripps
SARS CoV
Michigan
Bacillus anthracis (TXP, MS) + host (human)
Proteomics Data Flow
Data Modeling w/
Decomposition
2D GELS
PRCS
Protein Array
LC
Converting to
Standard Format
Immunoaffinity
purification
VBI
Y2H
QA
MS
&
MS/MS
QC
Standar
d
Format
QA
&
QC
NMR
Public
X-Ray
Cryoelectron
Microscopy
Quality
Assurance
X-Ray
Defraction
& Quality
Control
etc…
Data
Sources
Data Types
Standard
Format for
Each Data
Type
Quality
Assurance
& Quality
Control
Relational
Database
MIAME and MIAPE-like Standards/SOP for Data Submission
Database Development
VBI Computing System
LINUX
Web Server
Gimli
PC Users
Jeff
SUN
(Solaris)
Wei
Chaitanya
Chengdong
Ranjan
Project
Binary Proteomics
Software
Oswald
Bruno
Elenwe
Data
Storage
Application Server
Genomics
Networked File Server
TUOR
Relational Database Server
Proteomics
Chendong, Jeff, Wei, Ranjan, Chaitanya
7 PRCs
System Development in Q3 of 2005
Development
Web Interface
Database
Test/Stage
Production
Proteomics Database Project Websites
Production: http://proteinbank.vbi.vt.edu/bprc
Test: http://proteinbankdev.gepasi.org/bprc/
Development: http://txue.bioinformatics.vt.edu:8080/bprc
http://wsun.vbi.vt.edu:8080/bprc/
Production Website Instance
Dynamically generated webpage
Functionalities:
(1) Account management
(2) File and doc management
(3) News group and news update
(4)Textual data display
(5) 2D gel Image data display
(6) Table and record query
(7) Data uploading and simple
submission
(8)HTTP data downloading
(9)SFTP file transfer
Database Query
Search By Experiment
Search By Organism
•Select Experiment
•Retrieve list of Bait protein
and nucleotide, Prey protein &
nucleotide
•Links to details of bait and Prey
example: Drosophila melanogaster
•Escherichia coli
•Saccharomyces cerevisiae
•Homo sapiens
•Drosophila melanogaster
•Helicobacter pylori
•Caenorhabclitis elegans
Search By Data Type
•Proteomics
•Genomics
•Microarray
Query for Scripps Sample Data
Search By Project/Experiment
•Scripps MS testing project
•Available peptide hit list
•Retrieve peak information and
m/z & intensity list
Query for 2 D Gel Data
Search By Experiment/Sample
Proteomics Database Architecture
Three Phases of Database Design
Process-Oriented
Normalized with Key-value Pair
Production Design
Stored Procedure
for Analysis
Pipeline
2D Gel
MS LC NMR
X-Ray
Y2H Defraction
Application
Views -- materialized
views
Layer
Logical
Layer
X-Ray
Protein
Cryoelectron
Microscopy Array
Immnoaffinity
Purification
Multiple Schemas of
Disparate Data
Consolidate to One
Schema to Remove
Redundancy
Physical
Layer
Final Views
Proteomics Database Architecture
Three Database Instances
Phase 1
Individual Dataset
Modeling
Disparate
Data
With Multiple
Schemas
Phase 2
Phase 3
Development
Consolidation into
a Few Schema
A normalized
data model
implemented
as key –value
pairs, highly
decomposed.
Analysis
Pipeline
Procedures
Logical Layer with
Views for the User
Test/stage
Physical Layer
Version 1
Version 2
Version 3
0.5-1 year
1-1.5 year
2 years
1. Partially Processed Data
2. Data Enhanced with Knowledge
3. Interface Less Changeable
4. Curated/Annotated Data
Production
Status of VBI Database Development
Schema
Development
(Maturity)
Test/stage
Production
Adm
+(10/10)
+
+
2 D Gel
+(10/10)
+
+
MS
+(10/10)
+
+
Interaction
+(9/10)
+
-
Pathway
+(7/10)
+
-
Data Repository
+(8/10)
+
+
Y2H
+(10/10)
+
+
Genomics
+(10/10)(GUS)
+
+
Microarray
+(10/10) (AE)
+
+
Default Tablespace: Admin_data, Genomics_TBLS, Pathway_TBLS,
Microarray_TBLS, Proteomics_TBLS.
Generic Experiment Data Components
-------Example of Database Design Logics
Who (People)
Where (Organization)
Project (Goal)
Materials and Methods (Metadata)
Results (Raw Data)
Conclusion and Hypothesis
(Processed and Analyzed Data)
Y2H Data Component Modeling
People
Experiment
Sample
Project
DNA /Protein
Detail
Results
Conclusion
Hypothesis
Experiment Component Object Model
Experiment
Experiment
Design
Design
Description
Experiment
Factor
Ontology
Entry
Factor Value
Ontology entries are taking care of the annotation cases
1) There are diverse choices and there exist ontologies that can
better capture the information
2) What are essentially controlled vocabularies which are limited in
number of choices but might grow in the future or vary by technology
type
Y2H Partial Database Schema
Proteomics DB System Architecture
Batch Processing
(1) Data uploading;
(2) Data validation;
(3) Data analysis;
(4) Data processing
Perl,
Java
JSP, CGI,
Java
JDBC,
Perl DBI/DBD,
ODBC
Private File Server
Oracle Relational Database
Public File Server
System Architecture of Putative VBI Proteomics Knowledgebase
------- Data, Tool, Project, and Team Interoperability
Web Display and Data Visualization
Security
Application Layer
Security
Service-Oriented MiddleWare with Process Control
Temporary data
Security
Virtual Database/ Warehouse
Security
Array Express Mass Spectrometry Two Component System 2D Gel Structure Data Genomics Data
Strategy on Data Integration and
Construction of Knowledge Warehouse
Biological Information Workflow
Diagnostics,
Therapeutics &
Vaccines
Target Discovery
Biological Research
Knowledge Generation
Data Management
Knowledge Management
Curation and Annotation of Data
Cleaning, Processing Algorithms
Information Storage, Queries
& DB Management
VBI PDC Project Phases
Phase I
First 2 years
Bio-IT Scope
•Raw data management
•Schema development
•Data visualization
•Data standardization
Phase II
Phase III
3rd-4th years
5th year
Data Integration
Knowledge generation
Knowledge management
Knowledge presentation
•Integration at interface level
•Integration of data at DB level
•Interoperability of datasets
•Normalization and warehousing
•Predefined query
•Materialized view
•Comparative analysis
•Statistical analysis
Mapping the Proteome
(1) Yeast two-hybrid system
Measures association between
two proteins.
Allows very high throughput.
(2) Mass spectrometry
Allows identification of
proteins within large
complexes (2-100 proteins).
Lower throughput.
Infer Complex Interaction Topology
Knowledgebase
Binary
interactions
R2H
Analysis
Proteins
MS
Analysis
PO4
N-ary
interations
Complex
Interaction
Model
Bacillus anthracis
Data
Organization
(1) Completed Genome
Ames, Ames Ancestor, a2012
NCBI, TIGR
(2) Yeast two-hybrid interaction data
Myriad Genetics
(3) Mass Spectrometry
Scripps and Caprion
(4) Microarray expression profiling
Univ. of Michigan
(5) Interspecies and interspecies clustering
NCBI(COG) and TIGR
(6) Functional category assignment
GU(PIR)
Strategy for Knowledgebase Construction
(1) Annotation Improvement
(1) Non-homologous based methods
--------------
phylogenetic profiling,
Rosetta stone pattern,
operon analysis,
co-expression profiling,
gene neighboring etc.
(2) Comparative genomics with two reference genomes --- E. Coli and Yeast
(2) Identifying anchor points for data integration
(1) Known metabolic pathway – E. coli and yeast;
(2) Known signal transduction pathway;
(3) Known Gene regulation machinery;
(4) Known Protein-protein interaction map.
Data Integration
Lay down microarray data to add co-expression pattern to gene network
Lay down MS multiple interaction data to expend the network
Lay down Y2H interaction data and expend network
Anchor on knowledge network of
Reference Genomes – E. Coli and Yeast
Comparative Genomics
Improved annotation
Genomics Data
Putative Knowledgebase:
No thing
http://www.Bacillus_anthracis.org
Data Mining and Knowledge
Augmentation
Key:
Literature
Y2H analysis
Multi-Protein Complex
Curated
In-House Y2H
Both Curated + Y2H
MS analysis
Microarray
Acknowledgement
Name
Dr. Jeff Chen
Dr. Chendong Zhang
Dr. Steve Cammer
Dr. Oswald Crasta
Susan Baker
Jiang Lu
Ranjan Jha
Qiang Yu
Jian Li
Wei Sun
Chaitanya Kommidi
Dr.Bruno Sobral
Dr. Peter MacGarvey
Dr. Cathy Wu
Paula Yadvish
Margaret Moore
Role
Organization
Project Manager/Investigator
Senior Software Engineer
Bioinformatics Scientist
Scientist and CI-Co-director
DBA
DBA
Software Engineer
Software Engineer
Software Engineer
Software Engineer
Software Engineer
Co-PI
Senior Bioinformatics Scientist
Co-PI
Web Coordinator
PI
VBI
VBI
VBI
VBI
VBI
VBI
VBI
VBI
VBI
VBI
VBI
VBI
GU
GU
SSS
SSS
Related documents