Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data Provenance Workshop Natalia Maltsev MCS Argonne National Laboratory Why Biotechnological Revolution? So much data!! High-throughput technologies provide huge amounts of biological data: Sequence data Data describing functional Networks (Metabolism, Regulation, Gene Expression) Dynamic data Progress of Computer Science and Computer Technologies and Bioinformatics allows to analyze this data Hmmm… 98 published genomes 652 on-going genomes Biology in a Nutshell (for people with little knowledge but infinite intelligence) Genome (ROM): assembly code on how to build proteins Genomes 3 Gene Products C, T, G variables amino acid Structure & Function Genome consists of genes Protein: Object description Object instantiation Gene Protein Instructions: A, Functions Enzymes: proteins that catalyze biochemical reactions Pathway: sequence Network of reactions (directed graph): set of pathways with metabolites as vertices and enzymes as edges Pathways & Physiology Data: Classes Sequence data Data describing Networks DNA sequences, Protein Sequences – NCBI, GenBank, SwissProt, TIGR, sequencing projects Metabolic Networks (EMP database, KEGG, etc) Regulatory Networks (Sentra, TransFac, etc) Gene Expression data (Experimental) Other experimental data Dynamic Data (experimental and literature) Organisms data Phenotypic data Physiological data Gene Functions Data sources: Predictions Un-annotated genomes Genome Annotations from public databases Experimental Results Sequence Analysis results Problems: Genome Sequencing and Assembly Errors Problems: Gene Functions Assignments Identification of Components Stages of analysis: Determine components of the system (assign functions to the genes) Establish relationships between components – reconstruct biological networks (develop a static model) Develop a dynamic model of the system Data sources: Metabolic data from public databases (EMP, KEGG, EcoCyc, Brenda, etc) Regulatory data from public databases (RegulonDB, Sentra, etc) Experimental Results Networks Analysis results Data sources: Enzymatic and enzyme kinetic data from EMP Experimental Results Networks Analysis results Biological Networks Reconstructions Static Models Problems: Wrong/incomplete information about metabolic or regulatory networks Wrong info from step1 Problems: Dynamic Models Phenotypes Predictions General Systems Biology Project Architecture Wrong Assignment of functions to the genes Biological Engineering Wrong info from step 1 &2 Wrong dynamic data Wrong procedures Hmm… This microbe does everything wrong… Data Sources Public and private databases (GeneBank, SwissProt, EMP, KEGG, etc) Results of data analysis Updates and versioning? (Data and annotations updates, Developed models) Prediction of Gene functions Predicting of gene functions by comparing of an unknown sequence with sequences of genes for which the functions are established Seq1 – function alcohol dehydrogenase Seq2– Function? Alcohol dehydrogenase? Seq1_Mus.musculus Seq2_Homo_sapiens GSGITKGLGAGANPEVGRNAADEDRDALRAALEGSDMVFIAAGMGGGTGTGAAPVVAE GSGITKGLGAGANPEVGRNS AEEDRDALRAALDGSDMVFIAAGMGGGTGTGAAPVVAE Example 1 Gene Function Assignments Query sequence Function Unknown!!! Bioinformatics tools Blast InterPro Blocks KNOWLEDGE BASE F1 result result F2 result F3 VOTING ALGORITHM F1 with probability P1 F2 with probability P2 F2 An Example on Pathways Reconstruction Enzyme 1 5.1.1.5 present Enzyme2 1.13.12.2 Not found Enzyme 3 3.5.1.30 Weak evidence For enzyme Enzyme 4 2.6.1.48 present Enzyme 5 1.2.1.20 present How reliably can we predict this pathway? What approach will Increase our confidence The most? Another Problem: Control of Data flow ftp to NCBI, TREMBL for updates on annotated databases (i.e. nr, swissport pdb) ftp to NCBI, TIGR, JGI for new and updated genomes Updated DB? New genome? ye s no ye s no Exit Check genome timestamp ye s Download genome to Chiba City directory Updated genome? no Exit Download genome to Chiba City directory Data Acquisition Organism Name: Corynebacterium_glutamicum Version and GI Number: NC_003450.1 GI:19551250 Def inition: Corynebacterium glutamicum, complete genome. Create multiple files for each genome containing information that will help the user decide whether to analyze the genome or not How reliable? User interface: Genome Analyzer (which genomes to run through tools) Create multiple files for each tool and for each genome selected to be run through Chiba City (or TeraGrid) (RunOnChiba) Get information from each file generated to submit to Chiba in Parallel Data Analysis How reliable? (GetCDSinfo) Parse information from each gbk file of each genome. Output to Oracle Databases (SubmitToChiba) Submit each genome and each tool to Chiba in Parallel (Capable of doing all genomes at same time) CHIBA CHIBA OR OR TERAGRID TERAGRID processing the jobs processing the jobs Create multiple files for submitted job to check output. Similar to above files created. Output generated Data Storage How reliable? Organism Name: Corynebacterium_glutamicum Sequence Qty: 3456 Path to f asta f ile: /nf s/chiba-homes01/………. Tool: ChibaBlocks Def inition: Corynebacterium glutamicum, complete genome. ORACLE DB no output correct? Tables: GENOME, BLOCKS, BLAST, PFAM, CDS, UPDATE, etc. ye s (OracleParsers) Parse output from each tool. Output to Oracle Databases Gene Functions Data sources: Predictions Un-annotated genomes Genome Annotations from public databases Experimental Results Sequence Analysis results Problems: Genome Sequencing and Assembly Errors Problems: Gene Functions Assignments Identification of Components What can provenance do? Help plan experiments by uggesting “weak” facts to be tested in a wetlab Find “weak” spots in a model Prioritize certain steps of model building Evaluate data flows Data sources: Metabolic data from public databases (EMP, KEGG, EcoCyc, Brenda, etc) Regulatory data from public databases (RegulonDB, Sentra, etc) Experimental Results Networks Analysis results Data sources: Enzymatic and enzyme kinetic data from EMP Experimental Results Networks Analysis results Biological Networks Reconstructions Static Models Problems: Wrong/incomplete information about metabolic or regulatory networks Wrong info from step1 Problems: Dynamic Models Phenotypes Predictions General Systems Biology Project Architecture Wrong Assignment of functions to the genes Biological Engineering Wrong info from step 1 &2 Wrong dynamic data Wrong procedures Hmm… This microbe does everything wrong…