Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MoBIoS A Metric-space DBMS to Support Biological Discovery Presenter: Enohi I. Ibekwe Overview • • • • • • • • MoBIoS Project Motivation The challenge Established similarity measures Metric-space distance measure Disk-based metric tree index MoBIoS as a DBMS Application of MoBIoS MoBIoS Project • Molecular Biological Information System • Project at UT-Austin center for computational biology and bioinformatics. • DBMS based on metric-space indexing techniques, object-relational model of genomic and proteomic data types and a database query language that embodies the semantics of genomic and proteomic data. Motivation Develop a DBMS to power Biological Information System The Challenge • Established biological model of similarity measure do not form a metrics. • Scalable disk-based metric-indexes suffer from the Curse of dimensionality Established Similarity Measure (I) • Sequence Homology – – – – Query Sequence Database of sequences Substitution Matrix (PAM / BLOSUM) Similarity Measure – Global Sequence Alignment (Edit distance) – Local Sequence Alignment (Most important) Established Similarity Measure (II) • Local Sequence Alignment – A local sequence alignment query asks, given a query sequence S, a database of sequences T and a similarity matrix corresponding to an evolutionary model, return all subsequences of T that are sufficiently similar to a subsequence of S – Main issue: Result is a set of answer. • A metric distance function must return a single value for each pair of argument Established Similarity Measure (III) • Global Sequence Alignment – Given an alphabet A , a similarity substitution matrix M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) over all pairs of such strings obtained from s and t. (example) – Issue: Result maybe negative since substitution matrix is based on log-odd probability. Similarity measure favors greater positive number. • Metric-space Distance measure (I) Homology Search • Query Sequence: Sub strings of length q (q-grams) • Database of sequences: Metric indexed records of fixed length q (indexed q-grams) strings. • Substitution Matrix (mPAM) • Similarity Measure (distance measure) – Local Alignments is computed from global alignment. Metric-space Distance measure (II) • mPAM substitution Matrix – Accepted Point Mutation Model. – PAM calculates scores based on frequency in which individual pairs of amino acids substituted for each other. – mPAM instead of calculating frequency of substitutions (PAM), computes expected time between substitution. – mPAM has been validated.(Validation) Metric-space Distance measure (III) • Computing Local Alignment from Global Alignment (Algorithm) – Offline 1. Divide database of sequence into sub strings (qgrams) 2. Build metric-space index structure on q-grams – Online 1. Divide query sequence into sub strings (q-grams) 2. Using global alignment as a distance function to match query q-grams. Disk-based metric-tree index • Phases • Initialization • Searching • Query performance metric • Number of disk I/O ( nodes visited) • Number of distance computation • Options Exploited • M-Tree • Generalized Hyper plane tree • MVP-Tree (optimal) Disk-based metric-tree index (initialization) • M-Tree initialization – Best case : O(nlogn); – worst case: O(n3) • Generalized Hyper plane (GH-Tree) initialization – Best case : O(nlogn); – worst case: O(n2) • • • GH-tree: Bi-direction M-Tree: Bottom-up In practice, both M-Tree and GHTree scale linearly Disk-based metric-tree index (Searching) MoBIoS as a DBMS (I) • Mckoi (Java RDBMS). – Plus metric-space indexing – Plus Biological data types – Plus biological semantics • Life science data store – Biological sequence data – Mass-spectrometry protein signature MoBIoS as a DBMS (III) • Language Extension – M-SQL • Data type Extension – Data type for Sequences (DNA,RNA,peptide) – Data type for Mass spectrum • Semantics Extension – Subsequence Operators – Local alignment MoBIoS as a DBMS (IV) • Semantics Extension – Similarity (metric distance) between data types • mPAM250 • Cosine distance • Lk norms • Keys Extension – Primary key (metrickey) – Index (metric) Application of MoBIoS (I) • MS/MS Protein Identification 1. Breakdown protein into fragments called peptide using a protease enzyme 2. Identify protein by using a mass-spectrometer to measure the mass-charge ratio of the fragments and comparing the experiment result to a database of precomputed spectra. Application of MoBIoS(II) M-SQL Solution Create table protein_sequences (accesion_id int, sequence peptide, primary metrickey(sequence, mPAM250); Create table digested_sequences (accession_id int, fragment peptide, enzyme varchar, ms_peak int, primary key(enzyme, Create index fragment_sequence on digested_sequences (fragment) metric(mPAM250); Create table mass_spectra (accession_id int, enzyme varchar, spectrum spectrum, primary metrickey(spectrum, cosine_distance); Application of MoBIoS(III) • M-SQL Solution SELECT Prot.accesion_id, Prot.sequence FROM protein_sequences Prot, digested_sequences DS,mass_spectra MS WHERE MS.enzyme = DS.enzyme = E and Cosine_Distance(S, MS.spectrum, range1) and DS.accession_id = MS.accession_id = Prot.accesion_id and BLAST vs MoBIoS MoBIoS 1. 2. 3. Molecular Biological Information System DBMS specialized for storage, retrieval and mining of biological data Sequence Database and query sequence is divided into q-grams and Database is indexed offline. BLAST 1. 2. 3. Basic Local Alignment Search Tool Utility specialized for retrieval and mining of biological data outside a database Only query sequence is divide and hot-point index is done at query time MoBIoS Demo • MoBIoS: http://ccvweb.csres.utexas.edu:9080/msfoun d/ccForm.jsp • PDB : http://www.rcsb.org/pdb/ Conclusion • Biological data is not random and very likely exhibit the intrinsic structure necessary for metric-space indexing to succeed. References • http://www.cs.utexas.edu/users/mobios/Publicat ions/miranker-mobios-final-03.pdf • http://www.cs.utexas.edu/users/mobios/Publicat ions/mao-bibe-03.pdf • http://www.cs.utexas.edu/users/mobios/ • http://www.mckoi.com/database/ Appendix Return Appendix I- Metric A metric-space is a set of objects S, with a distance function d, such that given any three objects x, y, z, 1. Non-Negativity d(x,y) > 0 for x = y; d(x,y) = 0 for x = y 2. Symmetry d(x,y) = d(y,x) 3. Triangular inequality d(x,y) + d(y,z) = d(x,y) Return Appendix II - Sequence • 2 RNA sequences from a DNA strand. Return Appendix III - PAM Percent Accepted Mutation(PAM) A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. (e.g PAM250) A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability. Return Appendix IV – PAM250 • At this evolutionary distance (250 substitutions per hundred residues) Return Appendix V - BLOSUM Blocks Substitution Matrix (BLOSUM) A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related ( e.g BLOSUM62) A unit to quantify the amount of evolutionary change in a protein sequence. Based on log-odd probability Return Appendix VI – BLOSUM62 • BLOSUM62 matrix is calculated from protein blocks such that if two sequences are more than 62% identical Return Appendix VII – mPAM250 • Expected time based on 250 PAM distance as a unit. Return Appendix VIII – mPAM Validation • Based on benchmark query set by SmithWaterman. • Graph shows ROC50 values (Receiver Operating Characteristics) • Negative x- axis indicate mPAM has better performance Difference between ROC50 values using mPAM and PAM250 Return Appendix IX - Distance measure Global Sequence Alignment Given an alphabet A , a similarity substitution matrix M corresponding to an evolutionary model, the global sequence alignment for two sequences s and t is to find a strings a and b which are obtained from s and t respectively by inserting spaces either into or at the ends of s and t and whose score computed using M is at a maximum (Similarity measure) or minimum (distance measure) over all pairs of such strings obtained from s and t. Return Appendix X – Homology Search Build Index Structure(Offline) 1. 2. Divide the database sequences into a set of overlapping sub strings of length q (q-grams) with step size 1. Build a metric-space index D based on global alignment to support constant time lookup of exact match. Homology Search Query (Online) 1. 2. 3. Divide the query sequence W into overlapping sub string , F = {wi | i =0..| W |-q }, of length q with step size 1. For each wi in F, run range query Q(wi, r) against database D to find a set of matching q-grams, Ri = f i,j | d( f i,j , wi) <= r, f i,j E D wi E F }, where d is the distance function. Using a greedy heuristic algorithm to extend and chain all fragments in R0UR1U…Rw-t to deduce the result of homology search based on local alignment for query W Return Appendix XI - GSA Return