Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The MoBIoS Project Molecular Biological Information System Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics University of Texas Weijia Xu, Rui Mao, Will Briggs, Smriti Ramakrishnan, Shu Wang, Lulu Zhang Problem: In Life Sciencses, database management systems (DBMS) serve as glorified file managers. Little use of sophisticated data and pattern-based retrieval Real scientific and technological problems When biological data is put in to an RDBMS • Primary data is stored in text or blob fields – Annotations may be relational Organism Function Sequence Yeast membrane AACCGGTTT Yeast mitosis TATCGAAA E. Coli membrane AGGCCTA • Data retrieval – Filter DB, sequential dump, O(n), to utilities • E.g. BLAST, Linear Data Scans, O(n), Endemic in Life Sciences Sequences: Mass Spectra proteomics Small Molecules & Protein Structure DNA, RNA, Protein databases Protein interaction Rational drug design Pathways (graphs) Phylogenies (graphs, trees in particular) Scope: To Find Common Ground Both Biology and DBMS’ Have to Move DBMS Biological Information System Metric-Space Database as the Common Ground Metric Space is a pair, M=(D,d), where D is a set of points d is [metric] distance function with the following properties: d(x,y) = d (y,x) d(x, y) > 0, d(x,x) = 0 d(x,z) <= d(x,y) + d(y,z) (symmetry) (non negativity) (triangle inequality) x y z Definition - By Analogy A Metric-Space Database Management System Extend Relational DBMS Special indexes for metricspaces New data types Biological information system A Spatial Database Management System: Life science data types Extend relational DBMS Special indexes for 2D and 3D data; k-d and R-trees New data types Geographic information systems Topographic maps Buildings and the like Develop index structures to support distance & nearest-neighbor queries • Well studied in main-memory – But by no means a closed problem • In databases (external/disk based methods) – Embryonic – Many myths • Often assumed to be the basis of multimedia database systems How to build a metric-space index • Three algorithmic classes [Tasan, Ozsoyoglu 04] – Vantage points – Hyperplanes – Bounding spheres Vantage Point Method [Burkhard&Keller73] Vantage Point Method Choose a point,VP And a radius, R Vantage Point Method • Given VP, R The predicates Choose a point,VP • d(VP,x) < R • d(VP,x) R Divide the set into two equal halves • apply recursively And a radius,R Query, q, range r r q Query, q, range r if • d(q,VP) > R + r then • all neighbors are outside the sphere VP R r q Multi-vantage point method Multi-vantage point method • Consider d(VPi, x) a projection onto an axis • Looks like a k-d tree – Choose number k & d Myths • Solved problem; M-trees [Ciaccia et.al. 96, 97] – I can’t get them to work on anything but their original synthetic data generator • Good choice for vantage points is to find corners[Yianilos93] (farthest-first clustering) – Might be true for euclidean spaces – Early result, not true for our data • High dimensional indexing always asymptotically reduces to linear scans. – Formal result based on an assumption of uniform data distributions. Comparison of Three Methods of Metric-Space Indexing #dist. cal.: RBT VS. GHT VS. MVPT 18000 RBT GHT MVPT 16000 #dist cal. 14000 12000 10000 8000 6000 4000 2000 0 0 2 4 radius 6 8 10 8 10 #I/O, RBT VS. GHT. VS MVPT 800 RBT GHT MVPT 700 600 #IO 500 400 300 200 100 0 0 2 4 radius 6 Figure 9. Comparison of metric-space index structures: RBT, GHT, and VPT Open problems • Is there a general metric-space index structure that is generally good for most work loads. – We are optimistic mvp tree’s – further tuning will be a useful answer – Hyperplane methods are fair game – there is circumstantial evidence that that is key component in Google’s search engine. • No work addresses clustering data pages on disk. • Metric-space join algorithms Biological Models are Usually Based on Similarity Similarity • Biologist like scoring functions that reward each similar feature with a positive number • Intuitive Distance: • More Similar smaller numbers • Identical 0 But Do Metric Models Capture Biology? • Metrics are a subset of possible mathematical models . Sequence Problem 1 Sequence similarity based on weighted edit distance Accepted weight matrices, PAM & BLOSSUM, are not metric Log-odd matrices – negative values Defy simple algebraic normalization[TaylorJones93,Linialetal97] Our First Result: mPAM [Xu&Miranker04] Dayhoffetal’s PAM Derivation[74] • Took a set of closely related protein sequences • Developed a phylogenetic tree • Counted substitutions to transform one sequence to another • Tree determines a measure of time PAM vs. mPAM: t = 1/f Using original substitution counts PAM: frequency of substitution S(a,b|t) = log P(b|a,t)/qb mPAM: expected time between substitutions D(a,b) = 1/log(1 – (P(a,x)P(b,x)) x Sequence Problem 2 • Sequences long units (identity for storage and retrieval) – Genes – Chromosomes • Analysis comprises comparing small substrings Soln: Sequence View • New view type • Breaks sequences into q-grams create SEQUENCEVIEW rice_sview as SELECT CREATE FRAGMENTS (…, 3, 1) FROM … WHERE … USING HAMMING-DISTANCE Materialize as an Index D(AAA) ≤2 Genomes Rowid Seq R1 CAACA R2 ATCAAA R3 … { Rowd Offset Logical Fragment R1 1 A C A R1 2 R1 3 R1 4 … { A A C A C A … R2 1 R2 2 R2 3 R2 4 … C A A … A T C T C A C A A A A A … … D(ACA) ≤1 D(CAA) ≤0 D(ATC) ≤1 Status • Started with McKoi – A Java open source object-relational DBMS – (Think of Postgress written in Java) • Added Biological data types Metric-space index Extending SQL engine (in progress) Computed in MoBIoS Compare Arabidopsis Genome X Rice Genome 1. Rice Locate nucleotide patterns of form 18 Matching Nucleotides Arab. Rice Gap 400 – 3000 Long Arab. Gap 400 – 3000 Long primer pair candidate 2. 3. Eliminate non-unique primer candidates Merge overlapping primer candidates • Usual implementations O(n2), n = 109 18 Matching Nucleotides mSQL Query to locate candidate primer pairs SELECT merge(R1.fragment, A1.fragment) FROM G1_sview R1, G1_sview R2, G2_sview A1, G2_sview A2 WHERE distance(‘HAMMINGDISTANCE', R1.fragment, A1.fragment) <= 1.0 distance(‘HAMMINGDISTANCE', R2.fragment, A2.fragment) <= 1.0 AND (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) >= 400 AND (FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) <= 3000 AND (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) >= 400 AND (FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) <= 3000 GROUP BY R1.fragment, A1.fragment; AND Query Plan Arab. Genome, Rice Genome, O(m) O(n) Offline: Build Sequence View O(n log n) Compare O(mlogn) Indexed Nested Loop Eliminate Duplicates Eliminate Low Complexity Primers (LZ compression) Merge Overlapping Primers ~10,000 conserved primer pairs candidates Preliminary Results • Found 13,418 possible primer pairs from MoBIoS • 100 best candidates BLASTed for matches in GenBank – 15 matched other plant genes and the primers – At least 2 of 15 showed potential after PCR amplification against Helianthus and Phalaenopsis. MoBIoS Architecture (Molecular Biological Information System) Analysing Mass-Spectra Spectrum = Histogram of Mass/Charge Ratios of a collection peptides Similarity = Shared peaks count = Inner Product (0100101) • (0111100) = 2 Cosine Distance Approx. Inner Product Drs= 1 – xrx’s/(x’rxr)1/2(x’sxs)1/2 shown store and retrieve mass-spectra - using cosine distance, and it scales mSQL Query for Protein Identification by Mass-Spec. Signature Database Look SELECT Prot.accesion_id, Prot.sequence FROM protein_sequences Prot, digested_sequences DS, mass_spectra MS WHERE MS.enzyme = DS.enzyme = E and Cosine_Distance(S, MS.spectrum, range1) and DS.accession_id = MS.accession_id = Prot.accesion_id and DS.ms_peak = P and MPAM250(PS, DS.sequence, range2); Matching Electrostatic Shape of Molecules Still benefit from grid-services: Intermittently, but regularly compile (recluster) the indices O(nlog n), n > 106 Rational drug design: O(log n) finite element solutions to traverse search tree. Make a service call to the grid for these operations only Mirror data contents to minimize I/O Since need is intermittant, one grid serves many MoBIoS servers recluster MoBIoS Server G R I D New index Shape match (FEM) Distance(real) High speed I/O Mirror DB-Contents Hyper-planes [Ulhmann91] • If d(x,h1) < d(x,h2) then x assigned to h1 h1 x h2 Develop a Hierarchical Clustering C A F E B D Hierarchy of Bounding spheres, (center, radius), • Bounding spheres may overlap • Inspired by R-trees