Download Haritsa - CSE, IIT Bombay

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pathogenomics wikipedia , lookup

DNA barcoding wikipedia , lookup

Genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Transcript
BODHI,
A Bio-diversity Database Pla(n)tform
Jayant Haritsa
Database Systems Lab
Supercomputer Education and Research Centre
Indian Institute of Science
BODHI
1
Team
 B. J. Srikanta (next talk)
 Prof. Madhav Gadgil
Prof. V. Nanjundiah
(Centre for Ecological Sciences, IISc)
 Several Masters Students
 Funded by DBT
BODHI
2
Motivation
 GATT – Patent Laws

To be in place by 2005
 Loss



Neem
Basmati (estimated export value: Rs. 1,198 crore)
Turmeric
 Global and local efforts


GBIF (Global Biodiversity Information Facility)
Karnataka Bio-diversity Board [Deccan Herald - Aug 26 2000]
BODHI
3
Bio-diversity Data
 Taxonomy of species


Phenetic (physical) characteristics
Phylogenetic (evolutionary) characteristics
 Habitat / Spatial distribution



Political Layout
Geographic Layout
Biospheres
 Genetic information


Bio-molecular sequences
Structural information
BODHI
4
MULTI-DOMAIN QUERY
 Retrieve all plant species that share a
common habitat, have identical
Inflorescence characteristics, and have
a DNA sequence within BLAST score of
80, with respect to “Michelia-champa”.
BODHI
5
Difficulties:
 Complex range of data types

sets, hierarchies, aggregations, sequences,
geometries, maps, audio, images …
 Multidimensional data

spatial (latitude, longitude, elevation) to
proteins (hundreds of coordinates)
 Computationally-intensive operators

species relationships, spatial distributions,
sequence alignments, ...
BODHI
6
Current Solutions
 Small-Scale


MS-Access / FoxPro / Excel / ...
Pentium PCs
 Large-Scale


RDBMS: Oracle / DB2 / Informix / Sybase / …
Unix servers: Sun / SGI / IBM / HP / ...
BODHI
7
Limitations:
 RDBMS approach of
“the world is a flat collection of
tables with simple attributes”
suits financial applications,
NOT scientific (biological) applications
 In particular, taxonomic / spatial / sequence /
multimedia data modeling and processing
are very cumbersome and coarse
BODHI
8
Limitations (contd)
 Spatial and other applications are not within
the database kernel but are connected
externally. E.g. Many GIS systems have
ArcInfo and MS-Access hooked up in a
“black-box” manner. Or, Blast/FASTA utilizing
sequence files generated from Oracle.
 Problem: Slow and ugly!
BODHI
9
Is there Hope?
 Object-Oriented DBMS

“Natural” for biological applications
 High-performance data access methods

Path Dictionary Index, Multi-key Type Index,
Pyramid Tree, ...
 High-performance specialized operators

spatial join, data mining, sequence processing, …
 XML = HTML + Semantics
BODHI
10
Goals of BODHI
 Seamless integration of taxonomic,
spatial and genomic data using OO
technology
 Latest access methods and operators
for all three types of data
 Utilize XML for data exchange
 Low-cost (ideally, free!)
BODHI
11
Architecture of BODHI
The Internet
Client Interface Framework
Query Processor
Spatial Operations
Object Operations
Genome Operations
Spatial Indexes
Object Indexes
Genome Indexes
Spatial Model
Taxonomy Model
Genome Model
Spatial Services
Object Services
Sequence Services
OBJECT STORAGE MANAGER
BODHI
12
Implementation of BODHI
The Internet
Client Interface Framework
–DB
Overlaps, Contains,
Closest, Within
Inheritance
Aggregation
Alignment
BLAST, FASTA
R*-tree, Hilbert-Rtree
Multi-Key Type,
Path-Dictionary
??? Indexes
(next talk)
Country, State,
City, River, Road
Species, Genera,
Family, Order
DNA, Protein
Spatial Services
Object Services
Sequence Services
Basic Types (Point, Line, Polygon, Sets, Sequences, ...)
SHORE MICRO-KERNEL
BODHI
13
Query Flow
BODHI
15
Project Status
 Prototype (minus Client Interface
Framework) is operational since last
month !
 Platform: PIII-700MHz running Redhat
Linux.
 For Code, contact
“[email protected]”
BODHI
16
Performance Evaluation
 SEQUOIA 2000 spatial benchmark:
Competitive with Paradise GIS from
Wisconsin
 Taxonomy + Spatial Queries:
Reasonably fast
 But Genomics slows things down a lot
due to absence of indexes (next talk)
BODHI
17
More details
 “Design and Implementation of a Biodiversity
Information System”,
Proc. of Intl. Conf. On Management of Data
(COMAD), Pune, December 2000
 “The Building of BODHI, A Bio-diversity
Database System”,
TechRep-2001-02, DSL/SERC, IISc
 Available at http://dsl.serc.iisc.ernet.in
BODHI
18
End of Talk
BODHI
19