Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Facilitating Interactive Mining of
Global and Local Association Rules
Abhishek Mukherji*
Elke A. Rundensteiner
Matthew O. Ward
Department of Computer Science, Worcester Polytechnic Institute, MA, USA.
*Samsung Research America, CA, USA.
Xmdvtool is an open source multivariate visual analytics tool developed at WPI with a series of NSF grants over the
past 20 years (http://sourceforge.net/projects/xmdvtool/).
This PhD research work was partly supported by NSF under grants IIS-0812027, CCF-0811510 and IIS-1117139.
Era of Big Data …. And we are DRIVING!
Volume
Variety
Velocity
Veracity
1. Where’s the Data in the Big Data Wave? Gerhard Weikum, Res. Director at Max Planck Institute, http://wp.sigmod.org/?p=786.
2. Analytic DB Technology for the Data Enthusiast. Pat Hanrahan, Stanford & Tableau, SIGMOD‘12 Keynote Talk.
11/03/2014
2
XmdvTool’s Efforts Towards This Paradigm Shift
Visualize Static Data
I. Visualize Stream & Sensor Data
SNIFTool & FireStream
Visualize Data Records
ViStream*
II. Visualize Mined Results
PARAS/FIRE
COLARM
*Di Yang et al., Interactive visual exploration of neighbor-based patterns in data streams, ACM SIGMOD’10 Demo.
11/03/2014
3
Summary of Graduate Research Works
CAPE*
I. Stream & Sensor Data Processing
1. SNIFTool/FireStream: Discover Patterns in Live
Stream [CIKM ’08, ICDE Demo ’07]
2. JAQPOT: High Velocity Streams MJoin Exec.
[BNCOD ’11]
1.
2.
XMDVTool^
III. Scalable Nugget-guided Hypothesis Testing
1. SPHINX: Evidence-Hypotheses Explor.[CIKM’13]
2. Iterative Multi-Evidence-Hypotheses Model
II. Interactive Mining
PARAS /FIRE [VLDB’13, SIGMOD’13, CIKM’13]
COLARM [EDBT’14]
*http://davis.wpi.edu/dsrg/PROJECTS/CAPE/index.html
^http://davis.wpi.edu/xmdv/index.html
11/03/2014
4
PARAS/FIRE: Interactive Visual Support for
Parameter Space-Driven Mining of Global Rules
[PVLDB 2013, SIGMOD 2013, CIKM 2013]
Joint work with
Xika Lin, Christopher Ryan Botaish, Jason Whitehouse,
Elke A. Rundensteiner, Matthew O. Ward
Department of Computer Science, Worcester Polytechnic Institute (WPI), MA, USA.
Association Rule Mining (ARM) Basics
Which customers to target for
multi-car discount promos?
RecordID
Age
Married
NumCars
100
23
No
1
200
25
Yes
1
300
29
No
0
400
34
Yes
2
500
38
Yes
2
<Age: 30..39> and <Married: Yes> <NumCars: 2>
Support = 40%, Confidence = 100%
R. Agrawal and R. Srikant, Fast algorithms for mining association rules in large databases, VLDB’94.
R. Srikant and R. Agrawal, Mining quantitative association rules in large relational tables, SIGMOD’96.
11/03/2014
6
Motivation for Interactive Mining
Limitations
(minsupp, minconf)
Unacceptably long response time.
Trial-and-error iterations.
Forced to rerun for each subset.
Data Miner
{ARs}
Research Goals
Data Analyst
Improve turnaround times of mining queries.
Provide parameter recommendations.
Preprocess data to enable fast interactive mining experience.
C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01.
C. Hidber. Online Association Rule Mining, SIGMOD’99.
B. Nag, P. M. Deshpande, and D. J. DeWitt, Using a knowledge cache for interactive discovery of association rules, SIGKDD’99.
M. Kubat et al., Itemset trees for targeted association querying, IEEE TKDE’03.
M. Kaya and R. Alhajj. Online mining of fuzzy multidimensional weighted association rules. Applied Intelligence’08.
11/03/2014
7
The State-of-the-art in Online Rule Mining
100
{}
80
60
X
40
XY
40
Y
20
XZ
Z
20
YZ
10
II. Rule Generation
Online
XYZ
I. Frequent Itemset Generation
Offline
Assumptions
1. Cost(Freq. Itemset Generation) >> Cost(Rule Generation),
2. Count(Itemsets) << Count(Rules).
^Cost per GB of RAM: $1000 (in 2000) $25 (in 2012).
C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01.
M. Kubat et al., Itemset trees for targeted association querying, IEEE TKDE’03.
M. Kaya and R. Alhajj. Online mining of fuzzy multidimensional weighted association rules. Applied Intelligence’08.
^ http://www.jcmit.com/memoryprice.htm
11/03/2014
8
Adjacency Lattice and Redundancy
100
{}
60
X
40
XY
40
Y
20
XZ
10
XYZ
Z
20
YZ
Simple Redundancy [(AUC) = XYZ, (A)XYZ (A)XYZ]
∩
80
Strict Redundancy [(AUC)XYZ ( ﬤAUC)XY]
Starting with maximal ancestors of XYZ, i.e., X, Y and Z.
If (XYZ qualifies)
Then skip XY and XZ as antecedent (simple) or consequent (strict).
C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01.
11/03/2014
9
Research Challenges
PARAS: Preprocessing/Computational aspects
1. Instead of the itemsets, can we pre-store association rules to altogether
avoid the online rule generation step?
2. Instead of the itemset index, can we have a direct look-up using
(minsupp, minconf)?
3. How to handle Redundancy Relationships in the context of the
parameterized Index?
FIRE: Visualization aspects
1. How should we visually present these mining results to the users?
2. How can we leverage these results to support interactive rule exploration?
3. Can we utilize some data visualization techniques to help users better
understand these mined results?
11/03/2014
10
1
Parameter Space Model (PARAS)
Stable Regions
{}
60
X
40
XY
40
Y
20
XZ
10
XYZ
Z
20
YZ
(0.4,0.67)
S1= S(0,0.5)
YX (0.4,0.67)
l3
XY
XZ Y,
YZ X
(0.4,0.5)
S2 =S(0.2,0)
XYZ,
ZXY
XYZ
(0.1, 0.125)
11/03/2014
l2
ZX, ZY
XZ
0.2
80
Confidence
0.8
0.5
100
l1
0.2
0.5
Support
0.8
1
11
+
Stable Regions {S } w/ Neighbors + Rules
S1
S2
Further, re-examining the redundancy definitions, we observed certain
properties that enabled us to optimize computation and storage of
redundancy information with respect to the parameter space.*
* Xika Lin majorly contributed in the redundancy results.
PARAS: Parameter Space Framework for Online Association Mining, VLDB 2013.
11/03/2014
12
Framework for Interactive Rule Exploration (FIRE)
Mushroom dataset
11/03/2014
Chess dataset
13
All rules versus unique rules view
11/03/2014
14
Unique + non-redundant rules view
11/03/2014
15
Two-region Comparison
11/03/2014
16
Rule Glyph View
Lined Glyph*
{poisonous? = edible}
{gill-attachment = free,
veil-type = partial,
veil-color = white}
Filled Glyph MDS layout
Filled Glyph
11/03/2014
*M. O. Ward, A taxonomy of glyph placement strategies for multidimensional data visualization,
Information Visualization 2002.
17
PARAS: Experimental Evaluation
Data sets
Synthetic*: IBM Quest Generator (T10I4D100k and T10I4D5000k).
Tx_Iy_Dz = x avg # of items per transaction, y x 1k total # of items, z transactions.
~
Real : Chess, Mushroom, Webdocs`.
* R. Agrawal and R. Srikant, Fast algorithms for mining association rules in large databases, VLDB’94.
~ A. Asuncion and D. Newman, UCI ML repository. http://www.ics.uci.edu/~mlearn/MLRepository.html, 2007.
` C. Lucchese and S. Orlando and R. Perego and F. Silvestri, Webdocs: a real-life huge transactional dataset. FIMI’04.
11/03/2014
18
PARAS: Experimental Evaluation
Tested Algorithms^
w/ redundancy resolution
w/o redundancy resolution
1.
2.
3.
4.
5.
1.
2.
3.
4.
Apriori_RR
Eclat_RR
FP-Growth_RR
AdjLattice_RR
PARAS_RR
Apriori
Eclat
FP-Growth
PARAS
^ C. Borgelt, Efficient apriori, eclat & fp-growth, http://www.borgelt.net.
11/03/2014
19
PARAS: Experimental Methodologies
1. Average Online Processing Times (w/ and w/o RR).
Varying minsupp, fixed minconf
Fixed minsupp, varying minconf
2. Offline Preprocessing Times (AdjLatticeRR vs. PARAS)
11/03/2014
20
1. Average Online Processing times (T5000k)
w/ RR
w/o RR
For a large diversity of online queries, PARAS consistently outperforms the state-of-the-art
competitors from the literature by 2 to 5 orders of magnitude over the tested datasets.
11/03/2014
21
2. Pre-processing Times
Rule Generation
T5000k = 4 sec
Webdocs = 220 sec
Confirmed:
Cost(Freq. Itemset Generation)
>> Cost(Rule Generation)
PARAS requires ~10% extra offline preprocess time compared with AdjLatticeRR.
C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01.
B. Nag, P. M. Deshpande, and D. J. DeWitt, Using a knowledge cache for interactive discovery of association rules, SIGKDD’99.
11/03/2014
22
FIRE: User Study
Questions
Stable Region Usage Tests
T1: What are the most prominent rules by support and confidence?
T2: Which settings (out of choice of 4) returns a different set of rules?
T3: Find the common and unique rules for two distinct parameter settings.
Filter/Redundancy Test
T4: Find the most frequent characteristics of edible and poisonous mushrooms.
Skyline View Test
T5: Find the parameter settings that produce top-k rules in the dataset,
where k = 20, 50, 100.
22 subjects
Mushroom and chess datasets
Cached Rule Miner (CRM) versus FIRE
Randomization to eliminate pre-knowledge
11/03/2014
23
Mushroom Dataset: Tasks 1, 2 and 3
Overall, FIRE outperforms the competitor CRM approach such that the users can achieve
similar or better accuracy while having to use significantly less time for the tasks.
11/03/2014
24
Tasks 4 and 5
Overall, FIRE outperforms the competitor CRM approach such that the users can achieve
similar or better accuracy while having to use significantly less time for the tasks.
11/03/2014
25
Conclusion
We proposed a novel parameter space model, developed optimal algorithms and
designed effective visualizations to facilitate interactive rule exploration by tackling
challenges related to both computational and visualization aspects of online rule mining.
Gains of several orders of magnitude when using PARAS for online processing outweigh
the one-time minimal offline preprocessing time and storage requirements.
Our user study establishes usability and effectiveness of the proposed features and
interactions of the FIRE system in facilitating interactive rule mining.
11/03/2014
26
Recent works at Samsung Research America
User Behavior Analysis via
On-device Mobile Sensing
Association rule mining over
multi-modal mobile context data
Unobtrusively learn sequential patterns of
mobile users
“Typically, when I am home on Sunday nights,
I call my parents”
MobileMiner: Mining Your Frequent Behavior Patterns On Your Phone
V Srinivasan et al., ACM UbiComp 2014 (Best Paper Nominee), HotMobile 2013.
Mobile Sequence Miner: Adding Intelligence to Your Mobile Device via On-Device Sequential Pattern Mining
A Mukherji et al., ACM MCSS Workshop in UbiComp 2014.
11/03/2014
27
Thanks
Contact me with questions:
Abhishek Mukherji
Samsung Research America
[email protected]
11/03/2014
28