Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
ACM SAC 2005 – Santa Fe, New Mexico Incremental Interactive Mining of Constrained Association Rules from Biological Annotation Data Imad Rahal, Dongmei Ren, Amal Perera, Hassan Najadat and William Perrizo North Dakota State University, USA Riad Rahhal, University of Iowa, USA Willy Valdivia, Orion Intregrated Biosciences, USA ACM SAC 2005 – Santa Fe, New Mexico High throughput techniques are producing massive quatities of boiinformatics data Consequently, there is a need for analysis methodologies that scale to larger and larger datasets. In this paper we us Association rule mining (ARM) to discover relationships in Saccharomyces cerevisiae (Yeast) genomic data. ARM was 1st proposed for Market Basket Research (MBR) ARM comes into its own when much of the data is categorical or where there are a very large number of dimensions. However, ARM has been noted for producing a large number of rules, which can overwhelm researchers Frequent itemset mining (1st step in ARM) also provides indexing for attributes that appear often, for faster access to information. ACM SAC 2005 – Santa Fe, New Mexico We propose a new ARM technique which Optimizes the rule-discovery process by giving biologists the flexibility of incorporating their knowledge into it, Reduces the overwhelming number of rules that match the specified minimum support and confidence thresholds, Operates in an incremental and interactive mode, Allows new queries to be posed from old ones; interactive mining Uses previous results to answer new queries; incremental mining Stores and processes data vertically ACM SAC 2005 – Santa Fe, New Mexico Data Representation Data used was extracted mostly from the MIPS database (Munich Information center for Protein Sequences) Left column shows all considered features (feature groups) Right column shows the number of distinct feature values in the extent domain of each feature Feature Total Values pathway 80 EC 622 complexes 316 function 259 localization 43 protein class 191 phenotype 181 interactions 6347 ACM SAC 2005 – Santa Fe, New Mexico Data Representation We built a Binary gene-by-feature table. For a categorical feature, we consider each category as a separate attribute or column by bit-mapping it. For numeric attributes and hierarchical categorical attributes, we used a bit vector for each bit position or hierarchy level (reducing the number of bit vectors by ~ log(n) The resulting table has a total of 8039 distinct feature bit vectors (corresponding to “items” in MBR) for 6374 yeast genes (corresponding to transactions in MBR) For processing and storage optimization, we use Predicate tree (P-tree) patent pending technology to vertically store and process the resulting bit vectors Current practice: Structure data into horizontal records. Process vertically (scans) Base 2 Base 10 Scanned vertically 2 6 2 2 5 2 7 7 7 7 7 7 2 2 0 0 Top-down construction of the 1-dimensional Ptree representation of R11, denoted, P11, is built by recording the truth of the universal predicate “pure 1” in a tree recursively on halves (1/21 subsets), until purity is achieved. 6 6 5 5 1 1 1 1 1 0 1 7 4 5 4 4 = 010 011 010 010 101 010 111 111 111 111 110 111 010 010 000 000 110 110 101 101 001 001 001 001 001 000 001 111 100 101 100 100 pure1? false=0 pure1? true=1 0 0 0 0 1 0 1 1 Horizontally structured records R[A1] R[A2] R[A3] R[A4] A2 A 3 A4 ) R11 R(A1 Predicate tree technology: vertically project each attribute, then vertically project each bit position of each attribute, then compress each bit slice into a basic Ptree. e.g., compression of R11 into P11 goes as follows: pure1? false=0 pure1? false=0 pure1? false=0 010 011 010 010 101 010 111 111 111 111 110 111 010 010 000 000 110 110 101 101 001 001 001 001 001 000 001 111 100 101 100 100 R11 R12 R13 R21 R22 R23 R31 R32 R33 0 0 0 0 1 0 1 1 1 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 R41 R42 R43 0 0 0 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 Horizontally AND basic Ptrees 1. Whole is pure1? false 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 2. Left half pure1? false 0 P11 And it’s pure0000so00 10 0 00 0 1 00 1 00 0 00 1 00 0 00 0 01 0 01 0 00 00 0 3. Right half pure1? false 0 10 10 10 01 01 01 0100 0 01 1 01 0001 branch ends 10 0 ^ 01 ^ 10 ^ ^ ^ 01 ^ ^^ ^ ^ 10 01 01 01 01 10 4. Left half of rt half ? false0 0 0 5. Rt half of right half? true1 01 6. Lf half of lf of rt? true1 But it is pure 1 10 For categorical attributes, a bitmap is formed for (pure0) so this each category then compressed into a P-tree. 7. Rt half of lf of rt? false0 branch ends Top-down construction of basic P-trees is best for understanding, but bottom-up is much more efficient. R11 P11 0 0 0 0 1 0 1 1 Bottom-up construction of 1-Dim, P11, is done using in-order tree traversal and the collapsing of pure siblings, as follow: 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 1 0 1 1 bottom up construction of choice for images) 1 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 2-Dimensional Ptrees 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 R41 R42 R43 0 0 0 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 (eg, natural dim bit-file (e.g., hi-order bit of Green band): 1111110011111000111111001111111011110000111100001111000001110000 Which, in spatial raster order is: 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 00 00 00 00 00 00 00 10 00 00 00 00 0 1 Ptree using 2-Dim Peano order. 0 0 0 0 1 0 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 1 0 ACM SAC 2005 – Santa Fe, New Mexico Mining The Yeast Genome A scientist interested in investigating the effect of one subset of the features over another, such as the effect of phenotype on function would Perform a join on the two sets of frequent itemsets and produce a new set containing all frequent itemsets combining the two features Mine the frequent itemsets from the phenotype and function feature values separately (produce two independent sets of frequent itemsets) We assume the antecedent is to come from one feature set and the consequent from the other, thus, each frequent itemset will produce at most one rule (if the confidence of that rule is high enough). All subsequent queries that include phenotype and/or function would benefit from the frequent itemset mining already done. ACM SAC 2005 – Santa Fe, New Mexico The Mining Algorithm Input: Rule query minisupp and miniconf Step 1: Mining of FISs from Individual Features Relevant feature F, mine all frequent itemsets from F-values separately Using P-trees: Support of an itemset containing items F1 and F2 is just PF1 and PF2 Perfom the ROOTCOUNT operation on the result Because of the independent treatment of the feature, mining them involved is done in parallel Step 2: Joining of Feature FISs After separately mining all the frequent itemsets from the items of all selected features, we perform a join step ACM SAC 2005 – Santa Fe, New Mexico The Mining Algorithm Exploits down closure property of support with respect to itemset size any itemset must have support greater than or equal to the support of any of its supersets and thus no itemset can be frequent unless all of its subsets are also frequent E.g., phenotypefunction: If the join of two frequent itemsets Iphenotype and Ifunction is a non-frequent itemset then there is no need to join Iphenotype or any of its supersets with Ifunction or any of its supersets ACM SAC 2005 – Santa Fe, New Mexico The Mining Algorithm Step 3: Producing Strong Rules No enumeration of different rules that could be derived from a frequent itemset is needed (second step in traditional ARM) Note: computing the confidence of a rule is also efficient using P-trees: confidence of a rule AC is ROOTCOUNT(PAC) /ROOTCOUNT (PA) Step 4: After the user examines the returned rules, s/he often wishes to issue a related but slightly different query. This can be viewed as the start of the interactive mode Such new queries typically involve features that have already been included in previous query. Our approach would incrementally build on the results obtained so far to answer the new query ACM SAC 2005 – Santa Fe, New Mexico The Mining Algorithm For example, suppose that the user submits: “localizationfunction” after “phenotype function” , all that needs to be done is to mine frequent itemsets from localization and join them with function If a new query, “localization, phenotypefunction”, is submitted, we utilize the all frequent itemsets from the first request and join them those derived from localization. ACM SAC 2005 – Santa Fe, New Mexico Algorithmic Details For the generation of FISs, we utilize a previous P-tree ARM approach [Rahal, Denton, Perrizo JIKM Journal Dec. 2004 [13] and store them in a (frequent) Set Enumeration (SE) tree containing all frequent itemsets Ø Cell cycle defects Stress response defects Cell cycle defects Sensitivity to antibiotics Stress response defects a) Ø a) example (frequent) SE for function Metabolism b) example (frequent) SE for phenotype Transcription Energy Metabolism b) ACM SAC 2005 – Santa Fe, New Mexico Experimental Study Implementations coded in C++ and executed on an Intel Pentium-4 2.4GHz processor workstation, 2GB RAM, Redhat Linux 9.0. All implementations use P-tree API http://midas.cs.ndsu.nodak.edu/~datasurg/ptree For our approach, we computed the total time for executing 5, 10, 15, 20 and 25 consecutive inter-related queries We compare with the standard approach (mine over all attribute values) Each query contains up to 3 features and uses at least one feature from a previous query we only include the time needed to mine the whole dataset without the time needed to scan the resulting set of rules for the subset of interest We set the min. conf. threshold to 90% and varied the min. supp. threshold between 0.05% and 20% ACM SAC 2005 – Santa Fe, New Mexico 700 600 500 400 300 200 100 0 5 queries 10 queries 15 queries 20 queries 25 queries Brute Force 20 .0% 15 .0% 10 .0% 5. 9% 5. 0% 1. 0% 0. 0% 0. 25 % 0. 12 5% 0. 1% 0. 05 % Time (s) Execution Time Support (%) The figure clearly shows the gain achieved by using our approach The post-processing approach needs more than 620 seconds at 5.9% support threshold ACM SAC 2005 – Santa Fe, New Mexico Frequent Itemsets Number of Frequent Itemsets 1200000 5 queries 1000000 10 queries 800000 15 queries 600000 20 queries 400000 25 queries 200000 Brute Force 0. 1% 0. 05 % 20 .0 % 15 .0 % 10 .0 % 5. 9% 5. 0% 1. 0% 0. 0% 0. 25 0. % 12 5% 0 Support (%) Biologists could go to very low support thresholds and mine frequent itemsets (and eventually rules) that would go undetected in the post-processing approach ACM SAC 2005 – Santa Fe, New Mexico Number of Rules 1000000 Rules 800000 5 queries 10 queries 15 queries 20 queries 25 queries Brute Force 600000 400000 200000 05 % 0. 1% 0. % 0. 12 5 25 % 0. 0% 0. 0% 1. 0% 5. 9% 5. .0 % 10 .0 % 15 20 .0 % 0 Support (%) The brute-force approach returned slightly less than a million rules at support 5.9% most of which are irrelevant to the queries we’ve selected For our queries, interesting rules started to show up at support ~ 0.5% For high support, mostly uninteresting & evident (trivial) rules appeared Here is where our results associated the yeast eIF2B factor with specific interactions within the cellular complex. ACM SAC 2005 – Santa Fe, New Mexico A significant portion of the rules were straight forward in the sense of providing only common knowledge, e.g., complex=cytoplasmic ribosomal large subunit localization=cytoplasm Of significant interest to our biological colaborators was a set of rules pertinent to the yeast eukaryotic initiation factor 2B (eIF2B) “complex = eIF2B (5 ORFs)”“function = ribosome biogenesis” A multi-sub-unit guanine nucleotide exchange factor which catalyzes the exchange of GDP bound to initiation factor eIF2 for GTP, generating active eIF2-GTP. In humans, it is composed of five subunits, alpha, beta, delta, gamma and epsilon In yeast, the eIF2B factor mediates the exchange of a series of proteins bound to translation initiation, the process preceding formation of the peptide bond between the first two amino acids of a protein. In specific, it catalyzes a vital regulatory step in the initiation of the translation of mRNA System (DataMIMEtm data mining, NO NOISE) http://www.cs.ndsu.nodak.edu/~datamine YOUR DATA MINING YOUR DATA Data Integration Language Ptree (Predicates) Query Language DIL PQL Internet DII (Data Integration Interface) DMI (Data Mining Interface) Data Repository lossless, compressed, distributed, verticallystructured database ACM SAC 2005 – Santa Fe, New Mexico Conclusion In this paper, we proposed a computational approach targeted at the analysis of the yeast genome annotation data It gives biologists the flexibility of incorporating domain knowledge, in the form of queries, thus aiding in focusing their analysis on specific features of interest. It optimizes the rule-discovery process by allowing operation in the interactive and incremental modes and enables parallel processing reuse of mined results Vertical, efficient storage and processing ACM SAC 2005 – Santa Fe, New Mexico Future Directions Extend the features in our analyzed data such as to include secondary protein structure information We also aim to pursue similar analysis over different genomes such as the human genome A broader goal is to look for “inter-organism” association rules valid across organisms rather than on a single organism