Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The MoBIoS Project
Molecular Biological Information System
Daniel P. Miranker
Dept. of Computer Sciences &
Center for Computational Biology
and Bioinformatics
University of Texas
Weijia Xu, Rui Mao, Will
Briggs, Smriti
Ramakrishnan, Shu Wang,
Lulu Zhang
Problem:
In Life Sciencses, database management
systems (DBMS) serve as glorified file
managers.
Little use of sophisticated data and pattern-based
retrieval
Real scientific and technological problems
When biological data is put in to an
RDBMS
• Primary data is stored in text or blob fields
– Annotations may be relational
Organism
Function
Sequence
Yeast
membrane
AACCGGTTT
Yeast
mitosis
TATCGAAA
E. Coli
membrane
AGGCCTA
• Data retrieval
– Filter DB, sequential dump, O(n), to utilities
• E.g. BLAST,
Linear Data Scans, O(n),
Endemic in Life Sciences
Sequences:
Mass Spectra
proteomics
Small Molecules & Protein Structure
DNA, RNA, Protein databases
Protein interaction
Rational drug design
Pathways (graphs)
Phylogenies (graphs, trees in particular)
Scope: To Find Common Ground Both
Biology and DBMS’ Have to Move
DBMS
Biological
Information
System
Metric-Space Database as the Common Ground
Metric Space is
a pair, M=(D,d),
where
D is a set of points
d is [metric] distance function with the following
properties:
d(x,y) = d (y,x)
d(x, y) > 0, d(x,x) = 0
d(x,z) <= d(x,y) + d(y,z)
(symmetry)
(non negativity)
(triangle inequality)
x
y
z
Definition - By Analogy
A Metric-Space Database
Management System
Extend Relational DBMS
Special indexes for metricspaces
New data types
Biological information
system
A Spatial Database
Management System:
Life science data types
Extend relational DBMS
Special indexes for 2D and
3D data; k-d and R-trees
New data types
Geographic information
systems
Topographic maps
Buildings and the like
Develop index structures to support
distance & nearest-neighbor queries
• Well studied in main-memory
– But by no means a closed problem
• In databases (external/disk based methods)
– Embryonic
– Many myths
• Often assumed to be the basis of multimedia
database systems
How to build a metric-space
index
• Three algorithmic classes [Tasan, Ozsoyoglu 04]
– Vantage points
– Hyperplanes
– Bounding spheres
Vantage Point Method [Burkhard&Keller73]
Vantage Point Method
Choose a point,VP
And a radius, R
Vantage Point Method
• Given VP, R
The predicates
Choose a point,VP
• d(VP,x) < R
• d(VP,x) R
Divide the set into
two equal halves
• apply recursively
And a radius,R
Query, q, range r
r
q
Query, q, range r
if
• d(q,VP) > R + r
then
• all neighbors are
outside the sphere
VP
R
r
q
Multi-vantage point method
Multi-vantage point method
• Consider d(VPi, x) a
projection onto an axis
• Looks like a k-d tree
– Choose number k & d
Myths
• Solved problem; M-trees [Ciaccia et.al. 96, 97]
– I can’t get them to work on anything but their original
synthetic data generator
• Good choice for vantage points is to find
corners[Yianilos93] (farthest-first clustering)
– Might be true for euclidean spaces
– Early result, not true for our data
• High dimensional indexing always asymptotically
reduces to linear scans.
– Formal result based on an assumption of uniform data
distributions.
Comparison of Three Methods of Metric-Space Indexing
#dist. cal.: RBT VS. GHT VS. MVPT
18000
RBT
GHT
MVPT
16000
#dist cal.
14000
12000
10000
8000
6000
4000
2000
0
0
2
4
radius
6
8
10
8
10
#I/O, RBT VS. GHT. VS MVPT
800
RBT
GHT
MVPT
700
600
#IO
500
400
300
200
100
0
0
2
4
radius
6
Figure 9. Comparison of metric-space index structures: RBT, GHT, and VPT
Open problems
• Is there a general metric-space index structure that
is generally good for most work loads.
– We are optimistic mvp tree’s – further tuning will be a
useful answer
– Hyperplane methods are fair game – there is
circumstantial evidence that that is key component in
Google’s search engine.
• No work addresses clustering data pages on disk.
• Metric-space join algorithms
Biological Models are Usually
Based on Similarity
Similarity
• Biologist like scoring functions that reward each
similar feature with a positive number
• Intuitive
Distance:
• More Similar smaller numbers
• Identical 0
But Do Metric Models Capture Biology?
• Metrics are a subset of possible mathematical models
.
Sequence Problem 1
Sequence similarity based on weighted edit
distance
Accepted weight matrices, PAM &
BLOSSUM, are not metric
Log-odd matrices – negative values
Defy simple algebraic
normalization[TaylorJones93,Linialetal97]
Our First Result: mPAM [Xu&Miranker04]
Dayhoffetal’s PAM Derivation[74]
• Took a set of closely related
protein sequences
• Developed a phylogenetic tree
• Counted substitutions to
transform one sequence to
another
• Tree determines a measure of
time
PAM vs. mPAM: t = 1/f
Using original substitution counts
PAM: frequency of substitution
S(a,b|t) = log P(b|a,t)/qb
mPAM: expected time between
substitutions
D(a,b) = 1/log(1 – (P(a,x)P(b,x))
x
Sequence Problem 2
• Sequences long units (identity for storage
and retrieval)
– Genes
– Chromosomes
• Analysis comprises comparing small
substrings
Soln: Sequence View
• New view type
• Breaks sequences into q-grams
create SEQUENCEVIEW rice_sview as
SELECT CREATE FRAGMENTS (…, 3, 1)
FROM …
WHERE …
USING HAMMING-DISTANCE
Materialize as an Index
D(AAA)
≤2
Genomes
Rowid
Seq
R1
CAACA
R2
ATCAAA
R3
…
{
Rowd
Offset
Logical Fragment
R1
1
A C A
R1
2
R1
3
R1
4
…
{
A A C
A C A
…
R2
1
R2
2
R2
3
R2
4
…
C A A
…
A T C
T C A
C A A
A A A
…
…
D(ACA)
≤1
D(CAA)
≤0
D(ATC)
≤1
Status
• Started with McKoi
– A Java open source object-relational DBMS
– (Think of Postgress written in Java)
• Added
Biological data types
Metric-space index
Extending SQL engine (in progress)
Computed in MoBIoS
Compare Arabidopsis Genome X Rice Genome
1.
Rice
Locate nucleotide patterns of form
18 Matching
Nucleotides
Arab.
Rice Gap 400 – 3000 Long
Arab. Gap 400 – 3000 Long
primer pair candidate
2.
3.
Eliminate non-unique primer candidates
Merge overlapping primer candidates
•
Usual implementations O(n2), n = 109
18 Matching
Nucleotides
mSQL Query to locate candidate
primer pairs
SELECT merge(R1.fragment, A1.fragment)
FROM
G1_sview R1, G1_sview R2, G2_sview A1, G2_sview A2
WHERE
distance(‘HAMMINGDISTANCE',
R1.fragment,
A1.fragment)
<=
1.0
distance(‘HAMMINGDISTANCE', R2.fragment, A2.fragment) <= 1.0 AND
(FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) >= 400 AND
(FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) <= 3000 AND
(FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) >= 400 AND
(FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) <= 3000
GROUP BY R1.fragment, A1.fragment;
AND
Query
Plan
Arab. Genome,
Rice Genome, O(m)
O(n)
Offline: Build Sequence
View O(n log n)
Compare O(mlogn)
Indexed Nested Loop
Eliminate Duplicates
Eliminate Low Complexity
Primers (LZ compression)
Merge Overlapping Primers
~10,000 conserved
primer pairs candidates
Preliminary Results
• Found 13,418 possible primer pairs from MoBIoS
• 100 best candidates BLASTed for matches in GenBank
– 15 matched other plant genes and the primers
– At least 2 of 15 showed potential after PCR amplification against
Helianthus and Phalaenopsis.
MoBIoS Architecture
(Molecular Biological Information System)
Analysing Mass-Spectra
Spectrum = Histogram of Mass/Charge Ratios
of a collection peptides
Similarity = Shared peaks count = Inner Product
(0100101) • (0111100) = 2
Cosine Distance Approx. Inner Product
Drs= 1 – xrx’s/(x’rxr)1/2(x’sxs)1/2
shown store and retrieve mass-spectra
- using cosine distance, and it scales
mSQL Query for Protein
Identification by Mass-Spec.
Signature Database Look
SELECT
Prot.accesion_id, Prot.sequence
FROM protein_sequences Prot, digested_sequences DS,
mass_spectra MS
WHERE
MS.enzyme = DS.enzyme = E and
Cosine_Distance(S, MS.spectrum, range1) and
DS.accession_id = MS.accession_id = Prot.accesion_id
and
DS.ms_peak = P and MPAM250(PS, DS.sequence,
range2);
Matching Electrostatic Shape of Molecules
Still benefit from grid-services:
Intermittently, but regularly compile (recluster) the indices O(nlog n), n > 106
Rational drug design: O(log n) finite element solutions to traverse search tree.
Make a service call to the grid for these operations only
Mirror data contents to minimize I/O
Since need is intermittant, one grid serves many MoBIoS servers
recluster
MoBIoS
Server
G
R I
D
New index
Shape match (FEM)
Distance(real)
High speed I/O
Mirror DB-Contents
Hyper-planes [Ulhmann91]
• If d(x,h1) < d(x,h2) then x assigned to h1
h1
x
h2
Develop a Hierarchical Clustering
C
A
F
E
B
D
Hierarchy of Bounding spheres, (center, radius),
• Bounding spheres may overlap
• Inspired by R-trees