Download Creation and Maintenance of GeneKeyDB

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

IMDb wikipedia , lookup

Oracle Database wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Functional Database Model wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

ContactPoint wikipedia , lookup

Transcript
Creation and Maintenance
of GeneKeyDB
Research being conducted by
Kevin Kastner
Under the direction of
Dr. Erich Baker
The Problem
 There exists thousands of biomedical data
sources.
 In 2006, there were ~557 relevant public
resources in molecular biology.
 This is growing rapidly.
 203 sources in 1999
 226 sources in 2000
 277 sources in 2001.
The Problem
 Traditional database approaches are too
structured.
 Scientific objects change identification over time.
 Gene names change over time.
 The Human Genome Nomenclature Database
(HUGO) contains 13,594 active symbols, 9635
literature aliases, and 2739 withdrawn symbols.
 SIR2L1 (w/drawn) is a synonym for SIRT1 and sir2like 1.
Scientific Object Identities
Hugo Name
GDB
GenAtlas
OMIM
GeneCards
LocusLink
TP53
1
33
52
22
13
P53
1(same)
17
188
69
63
SIRT1
1
0
5
1
2
SIR2L1
0
0
1
1(same)
1(same)
The Solution
 GeneKeyDB
 A gene-centered relational database
developed to enhance data mining in
biological data sets.
 GeneKeyDB relies primarily on existing
database identifiers derived from community
databases (NCBI, GO, Ensembl, et al.) as
well as the known relationships among those
identifiers.
 Version 1 is already out!
 http://www.biomedcentral.com/1471-2105/6/72
Weaknesses of Version 1
 Can no longer be updated
 Complex queries must be made to the
database in order to obtain desired
information
Complex Queries
SELECT ll_xp_cdd.cdd_name, ll_np_cdd.cdd_name, organism
FROM ll_xp_cdd, ll_np_cdd, ll_locus
WHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score
AND ll_id IN
(SELECT ll_id
FROM ll_refseq_xm
WHERE ll_refseq_xm_id IN
(SELECT ll_refseq_xm_id
FROM ll_xp_cdd, ll_np_cdd
WHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score))
AND ll_id IN
(SELECT ll_id
FROM ll_refseq_nm
WHERE ll_refseq_nm_id IN
(SELECT ll_refseq_nm_id
FROM ll_xp_cdd, ll_np_cdd
WHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score));
Current Research
 Creation of APIs to validate data in the
database and to enable querying to
become much easier for the user.
 One-step updating of the database and
the information it contains.
API Alternative
// fxn(search_params, desired_info), returns ll_id
curated.cdd(score[ ],null)
curated_score[ ]  score[ ]
locus_id1[ ]  gaa.cdd((name[ ],score[ ]), score[ ])
gaa_name[ ]  name[ ]
gaa_score[ ]  score[ ]
locus_id2[ ]  curated.cdd(name[ ],score[ ])
curated_name[ ]  name[ ]
locus_id[ ]  intersect(locus_id1[ ],locus_id2[ ])
locus(organism[ ], locus_id[ ])
print(gaa_name[ ], curated_name[ ], organism[ ])
External Implementations
 Some databases have APIs as well.
 Ensembl
 APIs are done in Perl.
 APIs for GeneKeyDB will be done in Java.
 More structured language.
 Easier to read.
The Future of GeneKeyDB
 GeneKeyDB will join even more external
and widely used databases together.
 Code for updating GeneKeyDB will tie into
database information that will change in
expected ways.
 Lowers the required number of code rewrites.
 GeneKeyDB will be dynamically updated.
The Future of GeneKeyDB
 APIs made that will be written in Perl.
 Perl is used often, almost exclusively, by
biologists.
 Can have Perl APIs tie into Java APIs, rather
than creating all new ones.
Comments? Questions?
 http://genereg.ornl.gov/gkdb/