Download PPTX

Facilitating Interactive Mining of Global and Local Association Rules Abhishek Mukherji* Elke A. Rundensteiner Matthew O. Ward Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is an open source multivariate visual analytics tool developed at WPI with a series of NSF grants over the past 20 years (http://sourceforge.net/projects/xmdvtool/). This PhD research work was partly supported by NSF under grants IIS-0812027, CCF-0811510 and IIS-1117139. Era of Big Data …. And we are DRIVING! Volume Variety Velocity Veracity 1. Where’s the Data in the Big Data Wave? Gerhard Weikum, Res. Director at Max Planck Institute, http://wp.sigmod.org/?p=786. 2. Analytic DB Technology for the Data Enthusiast. Pat Hanrahan, Stanford & Tableau, SIGMOD‘12 Keynote Talk. 11/03/2014 2 XmdvTool’s Efforts Towards This Paradigm Shift Visualize Static Data I. Visualize Stream & Sensor Data SNIFTool & FireStream Visualize Data Records ViStream* II. Visualize Mined Results PARAS/FIRE COLARM *Di Yang et al., Interactive visual exploration of neighbor-based patterns in data streams, ACM SIGMOD’10 Demo. 11/03/2014 3 Summary of Graduate Research Works CAPE* I. Stream & Sensor Data Processing 1. SNIFTool/FireStream: Discover Patterns in Live Stream [CIKM ’08, ICDE Demo ’07] 2. JAQPOT: High Velocity Streams MJoin Exec. [BNCOD ’11] 1. 2. XMDVTool^ III. Scalable Nugget-guided Hypothesis Testing 1. SPHINX: Evidence-Hypotheses Explor.[CIKM’13] 2. Iterative Multi-Evidence-Hypotheses Model II. Interactive Mining PARAS /FIRE [VLDB’13, SIGMOD’13, CIKM’13] COLARM [EDBT’14] *http://davis.wpi.edu/dsrg/PROJECTS/CAPE/index.html ^http://davis.wpi.edu/xmdv/index.html 11/03/2014 4 PARAS/FIRE: Interactive Visual Support for Parameter Space-Driven Mining of Global Rules [PVLDB 2013, SIGMOD 2013, CIKM 2013] Joint work with Xika Lin, Christopher Ryan Botaish, Jason Whitehouse, Elke A. Rundensteiner, Matthew O. Ward Department of Computer Science, Worcester Polytechnic Institute (WPI), MA, USA. Association Rule Mining (ARM) Basics Which customers to target for multi-car discount promos? RecordID Age Married NumCars 100 23 No 1 200 25 Yes 1 300 29 No 0 400 34 Yes 2 500 38 Yes 2  <Age: 30..39> and <Married: Yes>  <NumCars: 2>  Support = 40%, Confidence = 100% R. Agrawal and R. Srikant, Fast algorithms for mining association rules in large databases, VLDB’94. R. Srikant and R. Agrawal, Mining quantitative association rules in large relational tables, SIGMOD’96. 11/03/2014 6 Motivation for Interactive Mining Limitations (minsupp, minconf)  Unacceptably long response time.  Trial-and-error iterations.  Forced to rerun for each subset. Data Miner {ARs} Research Goals Data Analyst  Improve turnaround times of mining queries.  Provide parameter recommendations.  Preprocess data to enable fast interactive mining experience. C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01. C. Hidber. Online Association Rule Mining, SIGMOD’99. B. Nag, P. M. Deshpande, and D. J. DeWitt, Using a knowledge cache for interactive discovery of association rules, SIGKDD’99. M. Kubat et al., Itemset trees for targeted association querying, IEEE TKDE’03. M. Kaya and R. Alhajj. Online mining of fuzzy multidimensional weighted association rules. Applied Intelligence’08. 11/03/2014 7 The State-of-the-art in Online Rule Mining 100 {} 80 60 X 40 XY 40 Y 20 XZ Z 20 YZ 10 II. Rule Generation Online XYZ I. Frequent Itemset Generation Offline Assumptions 1. Cost(Freq. Itemset Generation) >> Cost(Rule Generation), 2. Count(Itemsets) << Count(Rules). ^Cost per GB of RAM: $1000 (in 2000)  $25 (in 2012). C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01. M. Kubat et al., Itemset trees for targeted association querying, IEEE TKDE’03. M. Kaya and R. Alhajj. Online mining of fuzzy multidimensional weighted association rules. Applied Intelligence’08. ^ http://www.jcmit.com/memoryprice.htm 11/03/2014 8 Adjacency Lattice and Redundancy 100 {} 60 X 40 XY 40 Y 20 XZ 10 XYZ Z 20 YZ Simple Redundancy [(AUC) = XYZ, (A)XYZ (A)XYZ] ∩ 80 Strict Redundancy [(AUC)XYZ ‫( ﬤ‬AUC)XY]  Starting with maximal ancestors of XYZ, i.e., X, Y and Z. If (XYZ qualifies) Then skip XY and XZ as antecedent (simple) or consequent (strict). C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01. 11/03/2014 9 Research Challenges PARAS: Preprocessing/Computational aspects 1. Instead of the itemsets, can we pre-store association rules to altogether avoid the online rule generation step? 2. Instead of the itemset index, can we have a direct look-up using (minsupp, minconf)? 3. How to handle Redundancy Relationships in the context of the parameterized Index? FIRE: Visualization aspects 1. How should we visually present these mining results to the users? 2. How can we leverage these results to support interactive rule exploration? 3. Can we utilize some data visualization techniques to help users better understand these mined results? 11/03/2014 10 1 Parameter Space Model (PARAS) Stable Regions {} 60 X 40 XY 40 Y 20 XZ 10 XYZ Z 20 YZ (0.4,0.67) S1= S(0,0.5) YX (0.4,0.67) l3 XY XZ  Y, YZ X (0.4,0.5) S2 =S(0.2,0) XYZ, ZXY XYZ (0.1, 0.125) 11/03/2014 l2 ZX, ZY XZ 0.2 80 Confidence 0.8 0.5 100 l1 0.2 0.5 Support 0.8 1 11 + Stable Regions {S } w/ Neighbors + Rules S1 S2 Further, re-examining the redundancy definitions, we observed certain properties that enabled us to optimize computation and storage of redundancy information with respect to the parameter space.* * Xika Lin majorly contributed in the redundancy results. PARAS: Parameter Space Framework for Online Association Mining, VLDB 2013. 11/03/2014 12 Framework for Interactive Rule Exploration (FIRE) Mushroom dataset 11/03/2014 Chess dataset 13 All rules versus unique rules view 11/03/2014 14 Unique + non-redundant rules view 11/03/2014 15 Two-region Comparison 11/03/2014 16 Rule Glyph View Lined Glyph* {poisonous? = edible}  {gill-attachment = free, veil-type = partial, veil-color = white} Filled Glyph MDS layout Filled Glyph 11/03/2014 *M. O. Ward, A taxonomy of glyph placement strategies for multidimensional data visualization, Information Visualization 2002. 17 PARAS: Experimental Evaluation Data sets  Synthetic*: IBM Quest Generator (T10I4D100k and T10I4D5000k). Tx_Iy_Dz = x avg # of items per transaction, y x 1k total # of items, z transactions. ~  Real : Chess, Mushroom, Webdocs`. * R. Agrawal and R. Srikant, Fast algorithms for mining association rules in large databases, VLDB’94. ~ A. Asuncion and D. Newman, UCI ML repository. http://www.ics.uci.edu/~mlearn/MLRepository.html, 2007. ` C. Lucchese and S. Orlando and R. Perego and F. Silvestri, Webdocs: a real-life huge transactional dataset. FIMI’04. 11/03/2014 18 PARAS: Experimental Evaluation Tested Algorithms^ w/ redundancy resolution w/o redundancy resolution 1. 2. 3. 4. 5. 1. 2. 3. 4. Apriori_RR Eclat_RR FP-Growth_RR AdjLattice_RR PARAS_RR Apriori Eclat FP-Growth PARAS ^ C. Borgelt, Efficient apriori, eclat & fp-growth, http://www.borgelt.net. 11/03/2014 19 PARAS: Experimental Methodologies 1. Average Online Processing Times (w/ and w/o RR).  Varying minsupp, fixed minconf  Fixed minsupp, varying minconf 2. Offline Preprocessing Times (AdjLatticeRR vs. PARAS) 11/03/2014 20 1. Average Online Processing times (T5000k) w/ RR w/o RR For a large diversity of online queries, PARAS consistently outperforms the state-of-the-art competitors from the literature by 2 to 5 orders of magnitude over the tested datasets. 11/03/2014 21 2. Pre-processing Times Rule Generation  T5000k = 4 sec  Webdocs = 220 sec Confirmed: Cost(Freq. Itemset Generation) >> Cost(Rule Generation) PARAS requires ~10% extra offline preprocess time compared with AdjLatticeRR. C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01. B. Nag, P. M. Deshpande, and D. J. DeWitt, Using a knowledge cache for interactive discovery of association rules, SIGKDD’99. 11/03/2014 22 FIRE: User Study Questions  Stable Region Usage Tests T1: What are the most prominent rules by support and confidence? T2: Which settings (out of choice of 4) returns a different set of rules? T3: Find the common and unique rules for two distinct parameter settings.  Filter/Redundancy Test T4: Find the most frequent characteristics of edible and poisonous mushrooms.  Skyline View Test T5: Find the parameter settings that produce top-k rules in the dataset, where k = 20, 50, 100.     22 subjects Mushroom and chess datasets Cached Rule Miner (CRM) versus FIRE Randomization to eliminate pre-knowledge 11/03/2014 23 Mushroom Dataset: Tasks 1, 2 and 3 Overall, FIRE outperforms the competitor CRM approach such that the users can achieve similar or better accuracy while having to use significantly less time for the tasks. 11/03/2014 24 Tasks 4 and 5 Overall, FIRE outperforms the competitor CRM approach such that the users can achieve similar or better accuracy while having to use significantly less time for the tasks. 11/03/2014 25 Conclusion We proposed a novel parameter space model, developed optimal algorithms and designed effective visualizations to facilitate interactive rule exploration by tackling challenges related to both computational and visualization aspects of online rule mining. Gains of several orders of magnitude when using PARAS for online processing outweigh the one-time minimal offline preprocessing time and storage requirements. Our user study establishes usability and effectiveness of the proposed features and interactions of the FIRE system in facilitating interactive rule mining. 11/03/2014 26 Recent works at Samsung Research America User Behavior Analysis via On-device Mobile Sensing Association rule mining over multi-modal mobile context data Unobtrusively learn sequential patterns of mobile users “Typically, when I am home on Sunday nights, I call my parents” MobileMiner: Mining Your Frequent Behavior Patterns On Your Phone V Srinivasan et al., ACM UbiComp 2014 (Best Paper Nominee), HotMobile 2013. Mobile Sequence Miner: Adding Intelligence to Your Mobile Device via On-Device Sequential Pattern Mining A Mukherji et al., ACM MCSS Workshop in UbiComp 2014. 11/03/2014 27 Thanks Contact me with questions: Abhishek Mukherji Samsung Research America [email protected] 11/03/2014 28

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PPTX