Download A Metric-Space Database Management Supporting Molecular Biology

The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics University of Texas Weijia Xu, Rui Mao, Will Briggs, Smriti Ramakrishnan, Shu Wang, Lulu Zhang Problem: In Life Sciencses, database management systems (DBMS) serve as glorified file managers.  Little use of sophisticated data and pattern-based retrieval  Real scientific and technological problems When biological data is put in to an RDBMS • Primary data is stored in text or blob fields – Annotations may be relational Organism Function Sequence Yeast membrane AACCGGTTT Yeast mitosis TATCGAAA E. Coli membrane AGGCCTA • Data retrieval – Filter DB, sequential dump, O(n), to utilities • E.g. BLAST, Linear Data Scans, O(n), Endemic in Life Sciences  Sequences:   Mass Spectra     proteomics Small Molecules & Protein Structure   DNA, RNA, Protein databases Protein interaction Rational drug design Pathways (graphs) Phylogenies (graphs, trees in particular) Scope: To Find Common Ground Both Biology and DBMS’ Have to Move DBMS Biological Information System Metric-Space Database as the Common Ground Metric Space is a pair, M=(D,d), where  D is a set of points  d is [metric] distance function with the following properties:     d(x,y) = d (y,x) d(x, y) > 0, d(x,x) = 0 d(x,z) <= d(x,y) + d(y,z) (symmetry) (non negativity) (triangle inequality) x y z Definition - By Analogy A Metric-Space Database Management System  Extend Relational DBMS     Special indexes for metricspaces New data types Biological information system  A Spatial Database Management System: Life science data types Extend relational DBMS    Special indexes for 2D and 3D data; k-d and R-trees New data types Geographic information systems   Topographic maps Buildings and the like Develop index structures to support distance & nearest-neighbor queries • Well studied in main-memory – But by no means a closed problem • In databases (external/disk based methods) – Embryonic – Many myths • Often assumed to be the basis of multimedia database systems How to build a metric-space index • Three algorithmic classes [Tasan, Ozsoyoglu 04] – Vantage points – Hyperplanes – Bounding spheres Vantage Point Method [Burkhard&Keller73] Vantage Point Method Choose a point,VP And a radius, R Vantage Point Method • Given VP, R The predicates Choose a point,VP • d(VP,x) < R • d(VP,x)  R Divide the set into two equal halves • apply recursively And a radius,R Query, q, range r r q Query, q, range r if • d(q,VP) > R + r then • all neighbors are outside the sphere VP R r q Multi-vantage point method Multi-vantage point method • Consider d(VPi, x) a projection onto an axis • Looks like a k-d tree – Choose number k & d Myths • Solved problem; M-trees [Ciaccia et.al. 96, 97] – I can’t get them to work on anything but their original synthetic data generator • Good choice for vantage points is to find corners[Yianilos93] (farthest-first clustering) – Might be true for euclidean spaces – Early result, not true for our data • High dimensional indexing always asymptotically reduces to linear scans. – Formal result based on an assumption of uniform data distributions. Comparison of Three Methods of Metric-Space Indexing #dist. cal.: RBT VS. GHT VS. MVPT 18000 RBT GHT MVPT 16000 #dist cal. 14000 12000 10000 8000 6000 4000 2000 0 0 2 4 radius 6 8 10 8 10 #I/O, RBT VS. GHT. VS MVPT 800 RBT GHT MVPT 700 600 #IO 500 400 300 200 100 0 0 2 4 radius 6 Figure 9. Comparison of metric-space index structures: RBT, GHT, and VPT Open problems • Is there a general metric-space index structure that is generally good for most work loads. – We are optimistic mvp tree’s – further tuning will be a useful answer – Hyperplane methods are fair game – there is circumstantial evidence that that is key component in Google’s search engine. • No work addresses clustering data pages on disk. • Metric-space join algorithms Biological Models are Usually Based on Similarity Similarity • Biologist like scoring functions that reward each similar feature with a positive number • Intuitive Distance: • More Similar  smaller numbers • Identical  0 But Do Metric Models Capture Biology? • Metrics are a subset of possible mathematical models . Sequence Problem 1 Sequence similarity based on weighted edit distance Accepted weight matrices, PAM & BLOSSUM, are not metric  Log-odd matrices – negative values  Defy simple algebraic normalization[TaylorJones93,Linialetal97] Our First Result: mPAM [Xu&Miranker04] Dayhoffetal’s PAM Derivation[74] • Took a set of closely related protein sequences • Developed a phylogenetic tree • Counted substitutions to transform one sequence to another • Tree determines a measure of time PAM vs. mPAM: t = 1/f Using original substitution counts  PAM: frequency of substitution S(a,b|t) = log P(b|a,t)/qb  mPAM: expected time between substitutions D(a,b) = 1/log(1 – (P(a,x)P(b,x)) x Sequence Problem 2 • Sequences long units (identity for storage and retrieval) – Genes – Chromosomes • Analysis comprises comparing small substrings Soln: Sequence View • New view type • Breaks sequences into q-grams create SEQUENCEVIEW rice_sview as SELECT CREATE FRAGMENTS (…, 3, 1) FROM … WHERE … USING HAMMING-DISTANCE Materialize as an Index D(AAA) ≤2 Genomes Rowid Seq R1 CAACA R2 ATCAAA R3 … { Rowd Offset Logical Fragment R1 1 A C A R1 2 R1 3 R1 4 … { A A C A C A … R2 1 R2 2 R2 3 R2 4 … C A A … A T C T C A C A A A A A … … D(ACA) ≤1 D(CAA) ≤0 D(ATC) ≤1 Status • Started with McKoi – A Java open source object-relational DBMS – (Think of Postgress written in Java) • Added  Biological data types  Metric-space index  Extending SQL engine (in progress) Computed in MoBIoS Compare Arabidopsis Genome X Rice Genome 1. Rice Locate nucleotide patterns of form 18 Matching Nucleotides Arab. Rice Gap 400 – 3000 Long Arab. Gap 400 – 3000 Long primer pair candidate 2. 3. Eliminate non-unique primer candidates Merge overlapping primer candidates • Usual implementations O(n2), n = 109 18 Matching Nucleotides mSQL Query to locate candidate primer pairs SELECT merge(R1.fragment, A1.fragment) FROM G1_sview R1, G1_sview R2, G2_sview A1, G2_sview A2 WHERE distance(‘HAMMINGDISTANCE', R1.fragment, A1.fragment) <= 1.0 distance(‘HAMMINGDISTANCE', R2.fragment, A2.fragment) <= 1.0 AND (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) >= 400 AND (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) <= 3000 AND (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) >= 400 AND (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) <= 3000 GROUP BY R1.fragment, A1.fragment; AND Query Plan Arab. Genome, Rice Genome, O(m) O(n) Offline: Build Sequence View O(n log n) Compare O(mlogn) Indexed Nested Loop Eliminate Duplicates Eliminate Low Complexity Primers (LZ compression) Merge Overlapping Primers ~10,000 conserved primer pairs candidates Preliminary Results • Found 13,418 possible primer pairs from MoBIoS • 100 best candidates BLASTed for matches in GenBank – 15 matched other plant genes and the primers – At least 2 of 15 showed potential after PCR amplification against Helianthus and Phalaenopsis. MoBIoS Architecture (Molecular Biological Information System) Analysing Mass-Spectra Spectrum = Histogram of Mass/Charge Ratios of a collection peptides Similarity = Shared peaks count = Inner Product (0100101) • (0111100) = 2 Cosine Distance Approx. Inner Product Drs= 1 – xrx’s/(x’rxr)1/2(x’sxs)1/2 shown store and retrieve mass-spectra - using cosine distance, and it scales mSQL Query for Protein Identification by Mass-Spec. Signature Database Look SELECT Prot.accesion_id, Prot.sequence FROM protein_sequences Prot, digested_sequences DS, mass_spectra MS WHERE MS.enzyme = DS.enzyme = E and Cosine_Distance(S, MS.spectrum, range1) and DS.accession_id = MS.accession_id = Prot.accesion_id and DS.ms_peak = P and MPAM250(PS, DS.sequence, range2); Matching Electrostatic Shape of Molecules Still benefit from grid-services:  Intermittently, but regularly compile (recluster) the indices O(nlog n), n > 106  Rational drug design: O(log n) finite element solutions to traverse search tree.  Make a service call to the grid for these operations only  Mirror data contents to minimize I/O  Since need is intermittant, one grid serves many MoBIoS servers recluster MoBIoS Server G R I D New index Shape match (FEM) Distance(real) High speed I/O Mirror DB-Contents Hyper-planes [Ulhmann91] • If d(x,h1) < d(x,h2) then x assigned to h1 h1 x h2 Develop a Hierarchical Clustering C A F E B D Hierarchy of Bounding spheres, (center, radius), • Bounding spheres may overlap • Inspired by R-trees

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Metric-Space Database Management Supporting Molecular Biology