Download SRI International Bioinformatics Bioinformatics Databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Interoperation of
Molecular Biology
Databases
Peter D. Karp, Ph.D.
Bioinformatics Research Group
SRI International
Menlo Park, CA
[email protected]
Main Message
SRI International
Bioinformatics
 Interoperation
of molecular-biology databases is
a challenging problem of critical importance
 DOE
should initiate a program in interoperation of
molecular biology databases
 Pursue both warehouse approach and multidatabase
approach
 Major progress possible within 5 years
Motivations



Important biological problems require access to multiple
bioinformatics databases
Different problems require different sets of databases
Hundreds of bioinformatics databases exist
 Nucleic Acids Research 32:2004 – Database issue
 Nucleic Acids Research DB list: http://www3.oup.co.uk/nar/database/a/



SRI International
Bioinformatics
350 databases listed in 2002
560 databases listed in 2004
Applications of integration include
 Complex queries
 Comparison of overlapping sources
 Data mining
Bioinformatics Databases
SRI International
Bioinformatics
 Tremendous
progress in point-and-click access
for biologist users
 Less
progress toward providing a computable,
interoperable infrastructure for large-scale data
mining
 Every
large-scale mining/learning problem
requires time consuming crafting of input/training
datasets
Warehouse Approach vs
Multidatabase Approach






SRI International
Bioinformatics
Multidatabase query approaches assume databases are in
a queryable DBMS
Most sites that do operate DBMSs do not allow remote
query access because of security and loading concerns
Users want to control data stability
Users want to control hardware applied to problem
Internet bandwidth limits query throughput
Users need to capture, integrate and publish locally
produced data of different types

Replicating and refreshing very large sources is expensive

Multidatabase and Warehouse approaches complementary
SRI BioWarehouse
Project Goal

SRI International
Bioinformatics
Create a toolkit for constructing bioinformatics
database warehouses that integrate sets of
bioinformatics databases into one physical
DBMS
BioWarehouse Approach
SRI International
Bioinformatics

Warehouse schema defines many bioinformatics datatypes

Create loaders for public bioinformatics DBs
 Parse file format for the DB
 Apply semantic transformations
 Insert database into warehouse tables

Oracle and MySQL implementations

Warehouse query access mechanisms
 SQL queries via JDBC,Lisp,Perl, ODBC, OAA
Warehouse Schema
SRI International
Bioinformatics

Manages many bioinformatics datatypes
simultaneously
 Pathways, Reactions, Chemicals
 Proteins, Genes, Replicons
 Sequences, Sequence Features
 Organisms, Taxonomic relationships
 Computations (sequence matches)
 Citations, Controlled vocabularies
 Links to external databases

Each type of warehouse object implemented
through one or more relational tables (currently
43)
Warehouse Schema
SRI International
Bioinformatics

Manages multiple datasets simultaneously
 Dataset = Single version of a database
 Allows version comparison
 Multiple software tools or experiments require access to different versions

Each dataset is a warehouse entity
Every warehouse object is registered in a dataset


Different databases storing the same biological datatypes are
coerced into same warehouse tables

Design of most datatypes inspired by multiple databases

Representational tricks to decrease schema bloat
 Single space of primary keys
 Single set of satellite tables such as for synonyms, citations, comments, etc.
Current Databases Supported by
BioWarehouse







BioCyc
 15 genomes and metabolic networks
Swiss-Prot, TrEMBL
 1.3M proteins
ENZYME
KEGG
NCBI Taxonomy
CMR
 105 genomes, 250K genes, 250K proteins
Applications:
 DARPA BioSpice program on biological simulation
 Study of sequence coverage of known enzymes
SRI International
Bioinformatics
Summary
SRI International
Bioinformatics
 Interoperation
of molecular-biology databases is
a challenging problem of critical importance
 DOE
should initiate a program in interoperation of
molecular biology databases
 Pursue both warehouse approach and multidatabase
approach
 Major progress possible within 5 years