Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Overview of Genome Databases Peter D. Karp, Ph.D. SRI International [email protected] www-db.stanford.edu/dbseminar/seminar.html Talk Overview  Definition of bioinformatics  Motivations  Issues for genome databases in building genome databases Definition of Bioinformatics  Computational techniques for management and analysis of biological data and knowledge  Methods for disseminating, archiving, interpreting, and mining scientific information  Computational  Genome theories of biology Databases is a subfield of bioinformatics Motivations for Bioinformatics  Growth in molecular-biology knowledge (literature)  Genomics 1. Study of genomes through DNA sequencing 2. Industrial Biology Example Genomics Datatypes  Genome sequences  DOE Joint Genome Institute    Gene 511M bases in Dec 2001 11.97G bases since Mar 1999 and protein expression data  Protein-protein  Protein interaction data 3-D structures Genome Databases  Experimental data  Archive experimental datasets  Retrieving past experimental results should be faster than repeating the experiment  Capture alternative analyses  Lots of data, simpler semantics  Computational symbolic theories  Complex theories become too large to be grasped by a single mind  The database is the theory  Biology is very much concerned with qualitative relationships  Less data, more complex semantics Bioinformatics   Distinct intellectual field at the intersection of CS and molecular biology Distinct field because researchers in the field must know CS, biology, and bioinformatics  Spectrum from CS research to biology service  Rich source of challenging CS problems  Large, noisy, complex data-sets and knowledge-sets  Biologists and funding agencies demand working solutions Bioinformatics Research  algorithms + data structures = programs  algorithms + databases = discoveries  Combine sophisticated algorithms with the right content:  Properly structured  Carefully curated  Relevant data fields  Proper amount of data Reference on Major Genome Databases  Nucleic Acids Research Database Issue  http://nar.oupjournals.org/content/vol30/issue1/  112 databases Questions to Ask of a New Genome Database What are Database Goals and Requirements?  What  Who problems will database be used to solve? are the users and what is their expertise? What is its Organizing Principle?  Different DBs partition the space of genome information in different dimensions  Experimental  Organism methods (Genbank, PDB) (EcoCyc, Flybase) What is its Level of Interpretation?  Laboratory data  Primary literature (Genbank)  Review (SwissProt, MetaCyc)  Does DB model disagreement? What are its Semantics and Content?  What  How entities and relationships does it model? does its content overlap with similar DBs?  How many entities of each type are present?  Sparseness of attributes and statistics on attribute values What are Sources of its Data?  Potential information sources  Laboratory instruments  Scientific literature    Manual entry Natural-language text mining Direct submission from the scientific community  Genbank  Modification policy  DB staff only  Submission of new entries by scientific community  Update access by scientific community What DBMS is Employed?  None  Relational  Object oriented  Frame knowledge representation system Distribution / User Access  Multiple distribution forms enhance access  Browsing access with visualization tools  API  Portability What Validation Approaches are Employed?  None  Declarative consistency constraints  Programmatic  Internal  What consistency checking vs external consistency checking types of systematic errors might DB contain? Database Documentation  Schema and its semantics  Format  API  Data acquisition techniques  Validation techniques  Size of different classes  Coverage of subject matter  Sparseness of attributes  Error rates  Update frequency Relationship of Database Field to Bioinformatics  Scientists generally unaware of basic DB principles  Complex queries vs click-at-a-time access  Data model  Defined semantics for DB fields  Controlled vocabularies  Regular syntax for flatfiles  Automated consistency checking  Most biologists take one programming class  Evolution of typical genome database  Finer points of DB research off their radar screen  Handfull of DB researchers work in bioinformatics Database Field  For many years, the majority of bioinformatics DBs did not employ a DBMS  Flatfiles were the rule  Scientists want to see the data directly  Commercial DBMSs too expensive, too complex  DBAs too expensive  Most scientists do not understand  Differences between BA, MS, PhD in CS  CS research vs applications  Implications for project planning, funding, bioinformatics research Recommendation  Teaching scientists programming is not enough  Teaching scientists how to build a DBMS is irrelevant  Teach scientists basic aspects of databases and symbolic computing  Database requirements analysis  Data models, schema design  Knowledge representation, ontologies  Formal grammars  Complex queries  Database interoperability BioSPICE Bioinformatics Database Warehouse Peter Karp, Dave Stringer-Calvert, Tom Lee, Kemal Sonmez SRI International http://www.BioSPICE.org/ Project Goal  Create a toolkit for constructing bioinformatics database warehouses that collect together a set of bioinformatics databases into one physical DBMS Motivations    Important bioinformatics problems require access to multiple bioinformatics databases Hundreds of bioinformatics databases exist  Nucleic Acids Research 30(1) 2002 – DB issue  Nucleic Acids Research DB list: 350 DBs at http://www3.oup.co.uk/nar/database/a/ Different problems require different sets of databases Motivations  Combining multiple databases allows for data verification and complementation  Simulation problems require access to data on pathways, enzymes, reactions, genetic regulation Why is the Multidatabase Approach Not Sufficient?       Multidatabase query approaches assume databases are in a DBMS Internet bandwidth limits query throughput Most sites that do operate DBMSs do not allow remote SQL access because of security and loading concerns Control data stability Need to capture, integrate and publish locally produced data of different types Multidatabase and Warehouse approaches complementary Scenario 1  BioSPICE scientist wants to model multiple metabolic pathways in a given organism  Enumerate pathways and reactions  What enzymes catalyze each reaction?  What genes code for each enzyme?  What control regions regulate each gene? Approach     Oracle and MySQL implementations Warehouse schema defines many bioinformatics datatypes Create loaders for public bioinformatics DBs  Parse file format for the DB  Semantic transformations  Insert database into warehouse tables Warehouse query access mechanisms  SQL queries via Perl, ODBC, OAA Example: Swiss-Prot DB     Version 40.0 describes 101K proteins in a 320MB file Each protein described as one block of records (an entry) in a large text file Loader tool parses file one entry at a time Creates new entries in a set of warehouse tables Warehouse Schema   Manages many bioinformatics datatypes simultaneously  Pathways, Reactions, Chemicals  Proteins, Genes, Replicons  Citations, Organisms  Links to external databases Each type of warehouse object implemented through one or more relational tables (currently 43) Warehouse Schema  Databases on our wish list:  Genbank (nucleotide sequences)  Protein expression database  Protein-protein interactions database  Gene expression database  NCBI Taxonomy database  Gene Ontology  CMR Warehouse Schema       Manages multiple datasets simultaneously  Dataset = Single version of a database Support alternative measurements and viewpoints Version comparison Multiple software tools or experiments that require access to different versions Each dataset is a warehouse entity Every warehouse object is registered in a dataset Warehouse Schema    Different databases storing the same biological types are coerced into same warehouse tables Design of most datatypes inspired by multiple databases Representational tricks to decrease schema bloat  Single space of primary keys  Single set of satellite tables such as for synonyms, citations, comments, etc. Warehouse Schema  Examples Protein data from Swiss-Prot, TrEMBL, KEGG, and EcoCyc all loaded into same relational tables  Pathway data from MetaCyc and KEGG are loaded into the same relational tables  Example: Swiss-Prot DB ID AC DT DT DT DE DE GN 1A11_CUCMA STANDARD; PRT; 493 AA. P23599; 01-NOV-1991 (Rel. 20, Created) 01-NOV-1991 (Rel. 20, Last sequence update) 15-DEC-1998 (Rel. 37, Last annotation update) 1-AMINOCYCLOPROPANE-1-CARBOXYLATE SYNTHASE CMW33 (EC 4.4.1.14) (ACC SYNTHASE) (S-ADENOSYL-L-METHIONINE METHYLTHIOADENOSINE-LYASE). ACS1 OR ACCW. How Swiss-Prot is Loaded into The Warehouse  Register Swiss-Prot in Datasets table  Create entry in Entry and Protein tables for each Swiss-Prot protein  Satellite tables store  Protein synonyms, citations, comments, accession numbers, organism, sequence features, subunits/complexes, DB links Protein Table CREATE TABLE Protein ( WID Name AASequence Charge Fragment MolecularWeightCalc MolecularWeightExp PICalc PIExp DataSetWID ); NUMBER --The warehouse ID of this protein VARCHAR2(500) --Common name of the protein VARCHAR2(4000),--Amino-acid sequence for this prote NUMBER, --Charge of the chemical CHAR(1), --Is this protein a fragment or not, NUMBER, --Molecular weight calculated from s NUMBER, --Molecular Weight determined throug VARCHAR2(50), --pI calculated from its sqeuence. VARCHAR2(50), --pI value determined through experi NUMBER --Reference to the data set from whi Database Loaders    Loader tool defined for each DB to be loaded into Warehouse Example loaders available in several languages Loaders  KEGG (C)  BioCyc collection of 15 pathway DBs (C)  Swiss-Prot (Java)  ENZYME (Java) Terminology Organism Database (MOD) – DB describing genome and other information about an organism Pathway/Genome Database (PGDB) – MOD that combines information about  Pathways, reactions, substrates  Enzymes, transporters  Genes, replicons  Transcription factors, promoters, operons, DNA binding sites Model – Collection of 15 PGDBs at BioCyc.org  EcoCyc, AgroCyc, YeastCyc BioCyc Loader Architecture Swiss-Prot Datafile Grammar for Swiss-Prot ANTLR Parser Generator Parser for SwissProt SQL Insert Commands Oracle Loadable File Current Warehouse Contents KEGG ENZYME SwissProt BsubCyc Warehouse Total Chemicals 7,284 2,952 0 576 10,812 Genes 5,714 0 88,605 4,221 98,540 60 0 103,807 1 103,868 Proteins 3,829 3,870 101,602 4,150 113,451 Enzymatic Reactions 3,509 0 0 717 4,226 Pathways 4,517 0 0 138 4,655 Pathway Reactions 36,271 0 0 530 36,801 Organisms Example Warehouse Uses  Check completeness of data sources Count reactions in ENZYME database with (and without) associated protein sequences in SWISS-PROT database: 3870 reactions in ENZYME 1662 reactions (43%) with a sequence in SWISS-PROT 2208 reactions (57%) without a sequence in SWISS-PROT Count #of distinct non-partial EC numbers in SWISS-PROT: 1554 distinct EC numbers in SWISS-PROT (non-partial)