Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree How to find good features from semi-structured raw data for classification Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei Han, Philip S. Yu, Olivier Verscheure Feature Construction Most data mining and machine learning model assume the following structured data: (x1, x2, ..., xk) -> y where xi’s are independent variable y is dependent variable. y drawn from discrete set: classification y drawn from continuous variable: regression When feature vectors are good, differences in accuracy among learners are not much. Questions: where do good features come from? Frequent Pattern-Based Feature Extraction Data not in the pre-defined feature vectors Transactions Biological sequence Graph database Frequent pattern is a good candidate for discriminative features So, how to mine them? A discovered pattern FP: Sub-graph NSC 4960 O O NSC 699181 NSC 40773 OH HO O O SH NSC 191370 HN O NH O O H2N O S NSC 164863 O HO N O O O O HO O O O O O O O O O O (example borrowed from George Karypis presentation) O Frequent Pattern Feature Vector Representation Petal.Length< 2.45 | P1 P2 P3 110 101 110 001 ……… Data1 Data2 Data3 Data4 Mining these predictive features is an NP-hard problem. DT setosa Petal.Width< 1.75 versicolor SVM LR 100 examples can get up to 1010 patterns Most are useless Any classifiers you can name virginica Example 192 examples 12% support (at least 12% examples contain the pattern), 8600 patterns returned by itemsets 192 vs 8600 ? 4% support, 92,000 patterns 192 vs 92,000 ?? Most patterns have no predictive power and cannot be used to construct features. Our algorithm Find only 20 highly predictive patterns can construct a decision tree with about 90% accuracy Data in “bad” feature space Discriminative patterns A non-linear combination of single feature(s) Increase the expressive and discriminative power of the feature space An example y X Y C 0 0 0 1 1 1 -1 1 1 1 -1 1 -1 -1 1 1 1 0 1 x 1 Data is non-linearly separable in (x, y) New Feature Space • Solving Problem X Y C 0 0 0 1 1 1 -1 1 1 1 -1 1 -1 -1 1 1 ItemSet: F: x=0,y=0 Association rule F: x=0 y=0 0 1 1 1 F 1 1 0 X Y F:x=0, y=0 0 0 1 0 1 1 0 1 -1 1 0 1 1 -1 0 1 -1 -1 0 1 C 1 1 x y Data is linearly separable in (x, y, F) Computational Issues Measured by its “frequency” or support. E.g. frequent subgraphs with sup ≥ 10% or ≥ 10% examples contain these patterns “Ordered” enumeration: cannot enumerate “sup = 10%” without first enumerating all patterns > 10%. NP hard problem, easily up to 1010 patterns for a realistic problem. Most Patterns are Non-discriminative. Low support patterns can have high “discriminative power”. Bad! Random sampling not work since it is not exhaustive. Most patterns are useless. Random sample patterns (or blindly enumerate without considering frequency) is useless. Small number of examples. If subset of vocabulary, incomplete search. If complete vocabulary, won’t help much but introduce sample selection bias problem, particularly to miss low support but high info gain patterns Conventional Procedure Two-Step Batch Method Frequent Patterns DataSet mine 1------------------------------2----------3 ----- 4 --- 5 ---------- 6 ------- 7------ Mined Discriminative select Patterns Petal.Length< 2.45 | 124 F1 F2 F4 1. Mine frequent patterns (>sup) 2. Select most discriminative patterns; 3. Represent data in the feature space using such patterns; 4. Build classification models. Data1 Data2 Data3 Data4 110 101 110 001 ……… Feature Construction and Selection DT setosa Petal.Width< 1.75 versicolor virginica SVM LR Any classifiers you can name Two Problems Mine step combinatorial explosion 1. exponential explosion Frequent Patterns DataSet mine 1------------------------------2----------3 ----- 4 --- 5 ---------- 6 ------- 7------ 2. patterns not considered if minsupport isn’t small enough Two Problems Select step Issue of discriminative power 4. Correlation not directly evaluated on their joint predictability 3. InfoGain against the complete dataset, NOT on subset of examples Frequent Patterns 1------------------------------2----------3 ----- 4 --- 5 ---------- 6 ------- 7------ select Mined Discriminative Patterns 124 Direct Mining & Selection via Modelbased Search Tree Classifier Feature Miner Basic Flow Mine & dataset Select P: 20% 1 Y Mine & Select P: 20% Y Mine & Select P:20% 3 Y + N 5 2 N Y 4 N … Few Data 6 Y … Compact set of highly discriminative patterns Most discriminative F based on IG Mine & Select P:20% N 7 N Y Mine & Select P:20% N + Divide-and-Conquer Based Frequent Pattern Mining Global Support: 10*20%/10000 =0.02% 1 2 3 4 5 6 7 . . . Mined Discriminative Patterns Analyses (I) 1. 2. Scalability (Theorem 1) Upper bound “Scale down” ratio to obtain extremely low support pat: Bound on number of returned features (Theorem 2) Analyses (II) Subspace is important for discriminative pattern 3. Original set: no-information gain if 4. 5. C1 and C0: number of examples belonging to class 1 and 0 P1: number of examples in C1 that contains “a pattern α” P0: number of examples in C0 that contains the same pattern α Subsets could have info gain: Non-overfitting Optimality under exhaustive search Experimental Studies: Itemset Mining (I) Scalability Comparison Log(DT #Pat) Mine & dataset 4 Select 3 P: 20% 1 2 N Y Log(DTAbsSupport) Log(MbT #Pat) 4 Mine & 32dataset Select 1 P: 20% 0 1 Adult N Y Most discriminative F based on IG 1 0 Mine & Select P: 20% Y Adult HypoMine &Sick Chess 5 2 N Y Select P:20% N Datasets #Pat using Mine & Select P:20% 3 Y 4 NChess Hypo + Few Sick Data Sonar 5 2 Global Mine & Support: 7 P:20% Select 4 N P:20% 3 Y 10*20%/10000 N Y 423439 =0.02% +∞ 4818391 + 95507 + Sick Sonar Mine & Select P:20% N RatioN(MbTY #Pat / #Pat using MbT sup) 252809 Select 6 Most discriminative Hypo F Chess based on IG Sonar Mine & Select P: 20% MbT supY Mine & Adult Log(MbTAbsSupport) Few Data 0.41% 6 ~0% 7 Mine & Select P:20% N Y 0.0035% 0.00032% 0.00775% + Global Support: 10*20%/10000 =0.02% Experimental Studies: Itemset Mining (II) Accuracy of Mined Itemsets DT Accuracy MbT Accuracy 100% 90% 4 Wins 80% 1 loss 70% Adult Chess Hypo Log(DT #Pat) Sick Sonar much smaller number of patterns Log(MbT #Pat) 4 3 2 1 0 Adult Chess Hypo Sick Sonar Experimental Studies: Itemset Mining (III) Convergence Experimental Studies: Graph Mining (I) 9 NCI anti-cancer screen datasets 2 AIDS anti-viral screen datasets URL: http://dtp.nci.nih.gov. H1: CM+CA – 3.5% H2: CA – 1% O The PubChem Project. URL: pubchem.ncbi.nlm.nih.gov. Active (Positive) class : around 1% - 8.3% HO HO O O O O O Experimental Studies: Graph Mining (II) Scalability DT #Pat MbT #Pat 1800 1500 1200 900 600 300 0 Mine & dataset Select P: 20% 1 N Y NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 Mine & NCI123 NCI145 Log(DT Abs Support) Select 2 P: 20% Log(MbT Abs Support) N Y Most discriminative F based on IG H2Mine & Select P:20% N H1 5 Y 4 Mine & Select P:20% 3 Y 3 2 4 6 N 7 Y Mine & Select P:20% N 1 + 0 NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 Few Data NCI123 NCI145 + H1 H2 Global Support: 10*20%/10000 =0.02% Experimental Studies: Graph Mining (III) AUC and Accuracy AUC DT MbT 0.8 0.7 11 Wins 0.6 0.5 NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 Accuracy NCI109 NCI123 NCI145 DT H1 H2 MbT 1 0.96 10 Wins 0.92 1 Loss 0.88 NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2 Experimental Studies: Graph Mining (IV) AUC of MbT, DT MbT VS Benchmarks 7 Wins, 4 losses Summary Model-based Search Tree Integrated feature mining and construction. Dynamic support Can mine extremely small support patterns Both a feature construction and a classifier Not limited to one type of frequent pattern: plug-play Experiment Results Itemset Mining Graph Mining Software and Dataset available from: www.cs.columbia.edu/~wfan