Download kdd08mbt-Final

Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree How to find good features from semi-structured raw data for classification Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei Han, Philip S. Yu, Olivier Verscheure Feature Construction  Most data mining and machine learning model assume the following structured data:      (x1, x2, ..., xk) -> y where xi’s are independent variable y is dependent variable.  y drawn from discrete set: classification  y drawn from continuous variable: regression When feature vectors are good, differences in accuracy among learners are not much. Questions: where do good features come from? Frequent Pattern-Based Feature Extraction  Data not in the pre-defined feature vectors  Transactions  Biological sequence  Graph database Frequent pattern is a good candidate for discriminative features So, how to mine them? A discovered pattern FP: Sub-graph NSC 4960 O O NSC 699181 NSC 40773 OH HO O O SH NSC 191370 HN O NH O O H2N O S NSC 164863 O HO N O O O O HO O O O O O O O O O O (example borrowed from George Karypis presentation) O Frequent Pattern Feature Vector Representation Petal.Length< 2.45 | P1 P2 P3 110 101 110 001 ……… Data1 Data2 Data3 Data4 Mining these predictive features is an NP-hard problem. DT setosa Petal.Width< 1.75 versicolor SVM LR 100 examples can get up to 1010 patterns Most are useless Any classifiers you can name virginica Example  192 examples     12% support (at least 12% examples contain the pattern), 8600 patterns returned by itemsets  192 vs 8600 ? 4% support, 92,000 patterns  192 vs 92,000 ?? Most patterns have no predictive power and cannot be used to construct features. Our algorithm   Find only 20 highly predictive patterns can construct a decision tree with about 90% accuracy Data in “bad” feature space  Discriminative patterns   A non-linear combination of single feature(s) Increase the expressive and discriminative power of the feature space  An example y X Y C 0 0 0 1 1 1 -1 1 1 1 -1 1 -1 -1 1 1 1 0 1 x 1 Data is non-linearly separable in (x, y) New Feature Space • Solving Problem X Y C 0 0 0 1 1 1 -1 1 1 1 -1 1 -1 -1 1 1 ItemSet: F: x=0,y=0 Association rule F: x=0  y=0 0 1 1 1 F 1 1 0 X Y F:x=0, y=0 0 0 1 0 1 1 0 1 -1 1 0 1 1 -1 0 1 -1 -1 0 1 C 1 1 x y Data is linearly separable in (x, y, F) Computational Issues  Measured by its “frequency” or support.  E.g. frequent subgraphs with sup ≥ 10% or ≥ 10% examples contain these patterns      “Ordered” enumeration: cannot enumerate “sup = 10%” without first enumerating all patterns > 10%. NP hard problem, easily up to 1010 patterns for a realistic problem. Most Patterns are Non-discriminative. Low support patterns can have high “discriminative power”. Bad! Random sampling not work since it is not exhaustive.   Most patterns are useless. Random sample patterns (or blindly enumerate without considering frequency) is useless. Small number of examples.   If subset of vocabulary, incomplete search. If complete vocabulary, won’t help much but introduce sample selection bias problem, particularly to miss low support but high info gain patterns Conventional Procedure Two-Step Batch Method Frequent Patterns DataSet mine 1------------------------------2----------3 ----- 4 --- 5 ---------- 6 ------- 7------ Mined Discriminative select Patterns Petal.Length< 2.45 | 124 F1 F2 F4 1. Mine frequent patterns (>sup) 2. Select most discriminative patterns; 3. Represent data in the feature space using such patterns; 4. Build classification models. Data1 Data2 Data3 Data4 110 101 110 001 ……… Feature Construction and Selection DT setosa Petal.Width< 1.75 versicolor virginica SVM LR Any classifiers you can name Two Problems  Mine step  combinatorial explosion 1. exponential explosion Frequent Patterns DataSet mine 1------------------------------2----------3 ----- 4 --- 5 ---------- 6 ------- 7------ 2. patterns not considered if minsupport isn’t small enough Two Problems  Select step  Issue of discriminative power 4. Correlation not directly evaluated on their joint predictability 3. InfoGain against the complete dataset, NOT on subset of examples Frequent Patterns 1------------------------------2----------3 ----- 4 --- 5 ---------- 6 ------- 7------ select Mined Discriminative Patterns 124 Direct Mining & Selection via Modelbased Search Tree Classifier Feature  Miner Basic Flow Mine & dataset Select P: 20% 1 Y Mine & Select P: 20% Y Mine & Select P:20% 3 Y + N 5 2 N Y 4 N … Few Data 6 Y … Compact set of highly discriminative patterns Most discriminative F based on IG Mine & Select P:20% N 7 N Y Mine & Select P:20% N + Divide-and-Conquer Based Frequent Pattern Mining Global Support: 10*20%/10000 =0.02% 1 2 3 4 5 6 7 . . . Mined Discriminative Patterns Analyses (I) 1. 2. Scalability (Theorem 1)  Upper bound  “Scale down” ratio to obtain extremely low support pat: Bound on number of returned features (Theorem 2) Analyses (II) Subspace is important for discriminative pattern 3. Original set: no-information gain if      4. 5. C1 and C0: number of examples belonging to class 1 and 0 P1: number of examples in C1 that contains “a pattern α” P0: number of examples in C0 that contains the same pattern α Subsets could have info gain: Non-overfitting Optimality under exhaustive search Experimental Studies: Itemset Mining (I)  Scalability Comparison Log(DT #Pat) Mine & dataset 4 Select 3 P: 20% 1 2 N Y Log(DTAbsSupport) Log(MbT #Pat) 4 Mine & 32dataset Select 1 P: 20% 0 1 Adult N Y Most discriminative F based on IG 1 0 Mine & Select P: 20% Y Adult HypoMine &Sick Chess 5 2 N Y Select P:20% N Datasets #Pat using Mine & Select P:20% 3 Y 4 NChess Hypo + Few Sick Data Sonar 5 2 Global Mine & Support: 7 P:20% Select 4 N P:20% 3 Y 10*20%/10000 N Y 423439 =0.02% +∞ 4818391 + 95507 + Sick Sonar Mine & Select P:20% N RatioN(MbTY #Pat / #Pat using MbT sup) 252809 Select 6 Most discriminative Hypo F Chess based on IG Sonar Mine & Select P: 20% MbT supY Mine & Adult Log(MbTAbsSupport) Few Data 0.41% 6 ~0% 7 Mine & Select P:20% N Y 0.0035% 0.00032% 0.00775% + Global Support: 10*20%/10000 =0.02% Experimental Studies: Itemset Mining (II)  Accuracy of Mined Itemsets DT Accuracy MbT Accuracy 100% 90% 4 Wins 80% 1 loss 70% Adult Chess Hypo Log(DT #Pat) Sick Sonar much smaller number of patterns Log(MbT #Pat) 4 3 2 1 0 Adult Chess Hypo Sick Sonar Experimental Studies: Itemset Mining (III)  Convergence Experimental Studies: Graph Mining (I)  9 NCI anti-cancer screen datasets    2 AIDS anti-viral screen datasets  URL: http://dtp.nci.nih.gov.  H1: CM+CA – 3.5% H2: CA – 1%  O The PubChem Project. URL: pubchem.ncbi.nlm.nih.gov. Active (Positive) class : around 1% - 8.3% HO HO O O O O O Experimental Studies: Graph Mining (II) Scalability  DT #Pat MbT #Pat 1800 1500 1200 900 600 300 0 Mine & dataset Select P: 20% 1 N Y NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 Mine & NCI123 NCI145 Log(DT Abs Support) Select 2 P: 20% Log(MbT Abs Support) N Y Most discriminative F based on IG H2Mine & Select P:20% N H1 5 Y 4 Mine & Select P:20% 3 Y 3 2 4 6 N 7 Y Mine & Select P:20% N 1 + 0 NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 Few Data NCI123 NCI145 + H1 H2 Global Support: 10*20%/10000 =0.02% Experimental Studies: Graph Mining (III)  AUC and Accuracy AUC DT MbT 0.8 0.7 11 Wins 0.6 0.5 NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 Accuracy NCI109 NCI123 NCI145 DT H1 H2 MbT 1 0.96 10 Wins 0.92 1 Loss 0.88 NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2 Experimental Studies: Graph Mining (IV)  AUC of MbT, DT MbT VS Benchmarks 7 Wins, 4 losses Summary  Model-based Search Tree        Integrated feature mining and construction. Dynamic support Can mine extremely small support patterns Both a feature construction and a classifier Not limited to one type of frequent pattern: plug-play Experiment Results  Itemset Mining  Graph Mining Software and Dataset available from:  www.cs.columbia.edu/~wfan

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download kdd08mbt-Final