Download Clustering Sequences in a Metric Space

Clustering Sequences in a Metric Space The MoBIoS Project Rui Mao, Daniel P. Miranker, Jacob N. Sarvela and Weijia Xu Department of Computer Sciences, University of Texas Austin, TX 78712, USA {rmao, miranker}@cs.utexas.edu Research supported in part by the Texas Higher Education Coordinating Board, Texas Advanced Research Program. Immediate Goal: Use Metric Space Indexing to Support Homology Search 1. 2. 3. Develop tree-based index structure to speed homology search. Maintain the use of an evolutionary model of similarity. Deliver full Smith-Waterman sensitivity. 2 Metric Space  A metric space[CPRZ97a] is a pair, M=(D,d), where D is a domain of indexing keys, and d is distance function with the following properties: d(Ox,Oy) = d (Oy,Ox) d(Ox,Oy) > 0, d(Ox,Ox) = 0 d(Ox,Oy) <= d(Ox,Oz) + d(Oz,Oy) (symmetry) (non negativity) (triangle inequality) 3 Metric Space Indexing  Metric space indexing exploits intrinsic clustering of the data.  Hierarchical structure avoids linear scan of entire database. in the best case leads to search time logarithmic to database size. 4 Challenges  Local alignment does not form a metric. Local Alignment produces a set of answers, a distance function produce a single number.  Popular evolutionary models (PAM) are not metrics. PAM Matrices are based on log-odds – Negative Values PAM Matices are Asymmetric – Let Pr(x,y) be the probability that amino acid x, mutates to amino acid y. – Pr(x,y)  Pr(y,x) More similar sequences score higher, not lower – Identical sequences must be distance 0 apart 5 From PAM to mPAM - Symmetry PAM: The computation of PAM matrix computed frequency of one amino acid mutating to another.  mPAM: We model that a pair of amino acids, one in each sequence, evolved from a common ancestor [Gonnet & Korostensky]  X Y Z The probability that amino acid Y and amino acid Z are from same ancestor amino acid x is: Pr(y,z)= f(x)Pr(x,y)Pr(x,z) 6 From PAM to mPAM – Distance vs. Similarity PAM computed logodds based on frequency of mutations  mPAM: Compute the expected time for a particular mutation to occur.  More frequent mutations will occur, on average, in less time. mPAM matrix 7 Computing Local Alignments from an Index  Divide the database into small fixed size pieces Build a metric-space index based on global alignment  Divide the query into small fixed size pieces For each query piece, use index to find results based on global alignment. – Like BLASTS hot spot index, but is fully sensitive  Chain the results together Intuitively like BLASTS extension of hot-spots Best algorithm is the last step of “A Sublinear Algorithm for Approximate Keyword Searching” [Myers94] 8 Initial Results: M-Tree  M-tree [CPRZ97b]is an open-source Metric- space indexing package.  Results for global alignment of Yeast peptide sequences of length 10.  Compare M-tree clustering result with farthestfirst traversal bulk load clustering result. 9 fraction of leaves visited to all the leaves vs. database size fraction of leaves visited to all the leaves 0.79 0.78 0.77 0.76 0.75 0.74 0.73 0 2000 4000 6000 8000 10000 12000 database size For a set of queries, the average fraction of number of leaves visited in the searching to the total leaf number decreases while the database size goes up. This shows that is the database is clustered well. 10 Covering radius vs. tree level of M-tree 350 radii of routing objects 300 average radius 250 min radius 200 max radius 150 100 50 0 0 1 2 3 4 5 6 7 8 tree lev el The covering radius of routing objects of one level of M-tree decreases while descending the tree. This shows the database the hierarchically clustered well. 11 Covering radius vs. tree level of farthest-first traversal bulk load radii of routing objects 350 average radius 300 min radius 250 max radius 200 150 100 50 0 0 2 4 tree level 6 8 The covering radius of routing objects of one database tree level created by farthestfirst traversal bulk load decreases while descending the tree. This shows the database the hierarchically clustered well. The radii here are significantly smaller than those of Mtree, which means that we can build a new index structure that is better than M-tree. 12 Long-term Goal: Biologists Need a New Kind of DBMS Traditional relational databases:  Data is dynamic  Workload: Biological databases  Data is write-once.  Workload: Ad-hoc queries based Data clustering (mining) Regular, exact, periodic queries – Billing – Customer service Transactional – inventory – bank accounts   Biological data types are non-relational Biological data types do cluster in metric-spaces Genomic/proteomic sequences Mass-Spectrometer signatures Molecular Models 13 MoBIoS Architecture (Molecular Biological Information System) 14 Storage Manager Metric-Space Index Structure:  Persistent representation Multiple hierarchical trees. Choice of metric distance functions, including user defined  Results in: Efficient clustering of the database Search time logarithmic to the database size 15 Query Engine MoBIoS SQL (M-SQL)  Built-in biological data types Sequence Mass-spectra data   Examples: Homology look-up Gene fusion experiment Embodies evolutionary semantics of bioinformatic investigation 16 M-SQL Program for MS/MS Protein Identifcation // Return proteins in the intersection of recorded spectra sufficiently // similar, range1, to the measured spectra of the first MS, and proteins // which have a digested fragment computed to be sufficiently similar in // sequence to the sequencing determined by the second MS // Database is loaded with genomic and proteomic information Create table protein_sequences (accesion_id int, sequence peptide, …, primary metrickey(sequence, mPAM250); Create table digested_sequences(accession_id int, fragment peptide, enzyme varchar, ms_peak int…, primary key(enzyme, accession_id); Create index fragment_sequence on digested_sequences (fragment) metric(mPAM250); Create table mass_spectra(accession_id int, enzyme varchar, spectrum spectrum, primary metrickey(spectrum, cosine_distance); SELECT Prot.accesion_id, Prot.sequence FROM protein_sequences Prot, digested_sequences DS, mass_spectra MS WHERE MS.enzyme = DS.enzyme = E and Cosine_Distance(S, MS.spectrum, range1) and DS.accession_id = MS.accession_id = Prot.accesion_id and DS.ms_peak = P and MPAM250(PS, DS.sequence, range2) 17 Mining Engine (No much idea???) Primitives for relating clusters to each other.  Possible syntax Gene expression  Protein family  18 Applications  Homology search  Proteomics MS/MS and Ion-Trap MS need both MS signature and sequence data to analyze results  Gene Expression Built in clustering algorithms  Sequence Assembly 19 Properties of Biological Databases Data is write-once.  Workload:  Ad-hoc retrieval queries based on evolutionary criteria Data clustering and categorization (mining)  Many biological data types are non-relational Genomic/proteomic sequences Structural and functional annotations to sequences. Mass-Spectrometer signatures Molecular Models 20 Existing Methods  BLAST Build index of the query sequence, linear scan the database  BLAT Build index of the database, search the database based on exact match of fixed length segments  SST Tree-structured index for vector space object 21 From PAM matrix to mPAM • PAM matrix is one of most commonly used substitution matrix to compute the similarity between two peptide sequence under an evolutional model. • PAM matrix can not be used directly for metric distance indexing technique. – Similarity score don’t have reflexivity properties. – There are negative values. – Doesn’t satisfy triangular inequality rules Figure-2 Log odds matrix for 250 PAMs. (DayHoff 1978) 22 Metric-Space Indexing to Speed Homology Search 1. 2. 3. 4. Split the database and build metric space index structure Split the query sequence Search the query segments in the metric indexing database Chain the search results 23 Results Drosophila distance distribution on mPAM #pairs 800 600 400 200 0 0 5000 10000 Distance 15000 24 Max. Abs. Log-odds of Sub-seq. Lengths 8 7 M.A.L. 6 5 4 3 2 1 0 0 20 40 Length 60 80 Min length: 3 Max length: 80 Threshold: ln10 (2.303) Segment number: 1M Trial number: 1M Bucket number: 100 Sequential search range: 80 25 Length:30 #DistCalcu VS Index size 40000 Radi us:0 5 30000 10 20000 15 #calcu 50000 10000 20 0 0 10000 20000 size 30000 40000 25 30 35 40 26 References • • • • • [CPRZ97a] P. Ciaccia, M. Patella, F. Rabitti, and P. Zezula. Indexing metric spaces with M-tree. In Atti del Quinto Convegno Nazionale SEBD, Verona, Italy, June 1997. [CPRZ97b] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces”. Proc. VLDB, 1997. [DSO78] Dayhoff M.O., Schwartz R. and Orcutt B.C. (1978) Atlas of protein sequence and structure. Vol. 5, Suppl. 3, Ed. M. O. Dayhoff. [MT] The M-Tree Project Homepage, http://wwwdb.deis.unibo.it/Mtree/index.html [SW81] Temple F. Smith and Mchael S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195-197, 1981. 27

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Clustering Sequences in a Metric Space