Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson www.cs.columbia.edu/~wfan www.weifan.info [email protected], [email protected] Feature Vector  Most data mining and machine learning model assume the following structured data:    (x1, x2, ..., xk) -> y where xi’s are independent variable y is dependent variable.   y drawn from discrete set: classification y drawn from continuous variable: regression Frequent Pattern-Based Feature Construction  Data not in the pre-defined feature vectors  Transactions  Biological sequence  Graph database Frequent pattern is a good candidate for discriminative features So, how to mine them? A discovered pattern FP: Sub-graph NSC 4960 O O NSC 699181 NSC 40773 OH HO O O SH NSC 191370 HN O NH O O H2N O S NSC 164863 O HO N O O O O HO O O O O O O O O O O (example borrowed from George Karypis presentation) O Computational Issues  Measured by its “frequency” or support.     E.g. frequent subgraphs with sup ≥ 10% Cannot enumerate sup = 10% without first enumerating all patterns > 10%. Random sampling not work since it is not exhaustive. NP hard problem Conventional Procedure Two-Step Batch Method Frequent Patterns DataSet mine 1------------------------------2----------3 ----- 4 --- 5 ---------- 6 ------- 7------ Mined Discriminative select Patterns Petal.Length< 2.45 | 124 F1 F2 F4 1. Mine frequent patterns (>sup) 2. Select most discriminative patterns; 3. Represent data in the feature space using such patterns; 4. Build classification models. Data1 Data2 Data3 Data4 110 101 110 001 ……… DT setosa Petal.Width< 1.75 versicolor virginica SVM LR Any classifiers you can Feature Construction followed by Selection name Two Problems  Mine step  combinatorial explosion 1. exponential explosion Frequent Patterns DataSet mine 1------------------------------2----------3 ----- 4 --- 5 ---------- 6 ------- 7------ 2. patterns not considered if minsupport isn’t small enough Two Problems  Select step  Issue of discriminative power 4. Correlation not directly evaluated on their joint predictability 3. InfoGain against the complete dataset, NOT on subset of examples Frequent Patterns 1------------------------------2----------3 ----- 4 --- 5 ---------- 6 ------- 7------ select Mined Discriminative Patterns 124 Direct Mining & Selection via Modelbased Search Tree Classifier Feature  Miner Basic Flow Mine & dataset Select P: 20% 1 Y Mine & Select P: 20% Y Mine & Select P:20% 3 Y + N 5 2 N Y 4 N … Few Data 6 Y … Compact set of highly discriminative patterns Most discriminative F based on IG Mine & Select P:20% N 7 N Y Mine & Select P:20% N + Divide-and-Conquer Based Frequent Pattern Mining Global Support: 10*20%/10000 =0.02% 1 2 3 4 5 6 7 . . . Mined Discriminative Patterns Analyses (I) 1. 2. Scalability of pattern enumeration  Upper bound (Theorem 1):  “Scale down” ratio: Bound on number of returned features Analyses (II) Subspace pattern selection 3.  Original set:  Subset: 4. Non-overfitting 5. Optimality under exhaustive search Experimental Studies: Itemset Mining (I)  Scalability Comparison Log(DT #Pat) Mine & dataset 4 Select 3 P: 20% 1 2 N Y Log(DTAbsSupport) Log(MbT #Pat) 4 Mine & 32dataset Select 1 P: 20% 0 1 Adult N Y Most discriminative F based on IG 1 0 Mine & Select P: 20% Y Adult HypoMine &Sick Chess 5 2 N Y Select P:20% N Datasets #Pat using Mine & Select P:20% 3 Y 4 NChess Hypo + Few Sick Data Sonar 5 2 Global Mine & Support: 7 P:20% Select 4 N P:20% 3 Y 10*20%/10000 N Y 423439 =0.02% +∞ 4818391 + 95507 + Sick Sonar Mine & Select P:20% N RatioN(MbTY #Pat / #Pat using MbT sup) 252809 Select 6 Most discriminative Hypo F Chess based on IG Sonar Mine & Select P: 20% MbT supY Mine & Adult Log(MbTAbsSupport) Few Data 0.41% 6 ~0% 7 Mine & Select P:20% N Y 0.0035% 0.00032% 0.00775% + Global Support: 10*20%/10000 =0.02% Experimental Studies: Itemset Mining (II)  Accuracy of Mined Itemsets DT Accuracy MbT Accuracy 100% 90% 4 Wins 80% 1 loss 70% Adult Chess Hypo Log(DT #Pat) Sick Sonar But, much smaller number of patterns Log(MbT #Pat) 4 3 2 1 0 Adult Chess Hypo Sick Sonar Experimental Studies: Itemset Mining (III)  Convergence Experimental Studies: Graph Mining (I)  9 NCI anti-cancer screen datasets    2 AIDS anti-viral screen datasets  URL: http://dtp.nci.nih.gov.  H1: CM+CA – 3.5% H2: CA – 1%  O The PubChem Project. URL: pubchem.ncbi.nlm.nih.gov. Active (Positive) class : around 1% - 8.3% HO HO O O O O O Experimental Studies: Graph Mining (II) Scalability  DT #Pat MbT #Pat 1800 1500 1200 900 600 300 0 Mine & dataset Select P: 20% 1 N Y NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 Mine & NCI123 NCI145 Log(DT Abs Support) Select 2 P: 20% Log(MbT Abs Support) N Y Most discriminative F based on IG H2Mine & Select P:20% N H1 5 Y 4 Mine & Select P:20% 3 Y 3 2 4 6 N 7 Y Mine & Select P:20% N 1 + 0 NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 Few Data NCI123 NCI145 + H1 H2 Global Support: 10*20%/10000 =0.02% Experimental Studies: Graph Mining (III)  AUC and Accuracy AUC DT MbT 0.8 0.7 11 Wins 0.6 0.5 NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 Accuracy NCI109 NCI123 NCI145 DT H1 H2 MbT 1 0.96 10 Wins 0.92 1 Loss 0.88 NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2 Experimental Studies: Graph Mining (IV)  AUC of MbT, DT MbT VS Benchmarks 7 Wins, 4 losses Summary  Model-based Search Tree       Integrated feature mining and construction. Dynamic support Can mine extremely small support patterns Both a feature construction and a classifier Not limited to one type of frequent pattern: plug-play Experiment Results     Itemset Mining Graph Mining New: Found a DNA sequence not previously reported but can be explained in biology. Code and dataset available for download List of methods: • Logistic Regression • Probit models • Naïve Bayes • Kernel Methods  Even the true distribution • Linear Regression Some unknown After is prefixed, learning becomes •distribution isstructure unknown, still assume RBF optimization to minimize errors: • Mixture models that the data is generated Model 6  quadratic loss Model 1 Model 3 by some known function. Model 5  exponential loss Model 2 Model 4  Estimate parameters inside How to train models?  slack thevariables function via  training data  CV on the training data There probably will always be mistakes unless: 1. The chosen model indeed generates the distribution 2. Data is sufficient to estimate those parameters But how about, you don’t know which to choose or use the wrong one? How to train models II List of methods:  Not quite sure the exact • Decision Trees • RIPPER rule learner function, but use a Preference criteria • CBA: association rule  Simplest hypothesis that fits the data is the best. family of “free-form” • clustering-based methods  Heuristics: •…… functions given some  info gain, gini index, Kearns-Mansour, etc pruning: MDL pruning, reduced error-pruning, cost-based pruning. “preference criteria”.   Truth: none of purity check functions guarantee accuracy on unseen test data, it only tries to build a smaller model There probably will always be mistakes unless: • the training data is sufficiently large. • free form function/criteria is appropriate. Can Data Speak for Themselves?   Make no assumption about the true model, neither parametric form nor free form. “Encode” the data in some rather “neutral” representations:    Think of it like encoding numbers in computer’s binary representation. Always cannot represent some numbers, but overall accurate enough. Main challenge:    Avoid “rote learning”: do not remember all the details Generalization “Evenly” representing “numbers” – “Evenly” encoding the “data”. Potential Advantages  If the accuracy is quite good, then   Method is quite “automatic and easy” to use No Brainer – DM can be everybody’s tool. Encoding Data for Major Problems  Classification:     Probability Estimation:    Similar to the above setting: estimate the probability that a transaction is a fraud. Difference: no truth is given, i.e., no true probability Regression:    Given a set of labeled data items, such as, (amt, merchant category, outstanding balance, date/time, ……,) and the label is whether it is a fraud or non-fraud. Label: set of discrete values classifier: predict if a transaction is a fraud or non-fraud. Given a set of valued data items, such as (zipcode, capital gain, education, …), interested value is annual gross income. Target value: continuous values. Several other on-going problems Encoding Data in Decision Trees Think of each tree as a way to “encode” the training data.  2.5 cc c ccccccc vv v c ccccccc cc cc c c cc cc c cc Obviously, each tree encodes the data differently. Subjective criteria that prefers some encodings than s s versicolor ssss s sss s adhoc. others are always s ssss sss s sss s ss setosa 0.5  v vv Why tree? a decision tree records some common v v vvvv vv v v vv v virginicabut vnot vvvv every v characteristic of the data, piece of trivial vvvv vv vvv v vcv v vv v v v detail vc c cc v Petal width  1.5    Do not prefer anything then – just do it randomly 1 2 3 4 5 Petal length 6 7 Minimizes the difference by multiple encodings, and then “average” them. Random Decision Tree to Encode Data -classification, regression, probability estimation  At each node, an un-used feature is chosen randomly   A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node. A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen Continued  We stop when one of the following happens:   A node becomes too small (<= 3 examples). Or the total height of the tree exceeds some limits:  Such as the total number of features. Illustration of RDT B1: {0,1} B1 chosen randomly B1 == 0 B2: {0,1} B3: continuous B2: {0,1} Y N Random threshold 0.3 B2: {0,1} B3 < 0.3? B2 == 0? B3: continuous Y B2 chosen randomly N B3: continuous B3 chosen randomly Random threshold 0.6 ……… B3 < 0.6? B3: continous Classification Petal.Length< 2.45 | setosa 50/0/0 Petal.Width< 1.75 versicolor 0/49/5 virginica 0/1/45 P(setosa|x,θ) = 0 P(versicolor|x,θ) = 49/54 P(virginica|x,θ) = 5/54 Regression Petal.Length< 2.45 | setosa Height=10in Petal.Width< 1.75 versicolor Height=15 in virginica Height=12in 15 in average value of all examples In this leaf node Prediction  Simply Averaging over multiple trees Potential Advantage  Training can be very efficient. Particularly true for very large datasets.     No cross-validation based estimation of parameters for some parametric methods. Natural multi-class probability. Natural multi-label classification and probability estimation. Imposes very little about the structures of the model. Reasons  The true distribution P(y|X) is never known.   Is it an elephant? Every random tree is not a random guess of this P(y|X).     Their structure is, but not the “node statistics” Every random tree is consistent with the training data. Each tree is quite strong, not weak. In other words, if the distribution is the same, each random tree itself is a rather decent model. Expected Error Reduction  Proven that for quadratic loss, such as:  for probability estimation:   regression problems   ( P(y|X) – P(y|X, θ) )2 ( y – f(x))2 General theorem: the “expected quadratic loss” of RDT (and any other model averaging) is less than any combined model chosen “at random”. Theorem Summary Number of trees  Sampling theory:  The random decision tree can be thought as sampling from a large (infinite when continuous features exist) population of trees.  Unless the data is highly skewed, 30 to 50 gives pretty good estimate with reasonably small variance. In most cases, 10 are usually enough. Variance Reduction Optimal Decision Boundary from Tony Liu’s thesis (supervised by Kai Ming Ting) RDT looks like the optimal boundary Regression Decision Boundary (GUIDE) Properties • Broken and Discontinuous • Some points are far from truth • Some wrong ups and downs RDT Computed Function Properties • • • Smooth and Continuous Close to true function All ups and downs caught Hidden Variable Hidden Variable  Limitation of GUIDE    Need to decide grouping variables and independent variables. A non-trivial task. If all variables are categorical, GUIDE becomes a single CART regression tree. Strong assumption and greedy-based search. Sometimes, can lead to very unexpected results. It grows like … ICDM’08 Cup Crown Winner   Nuclear ban monitoring RDT based approach is the highest award winner. Ozone Level Prediction (ICDM’06 Best Application Paper)  Daily summary maps of two datasets from Texas Commission on Environmental Quality (TCEQ) SVM: 1-hr criteria CV AdaBoost: 1-hr criteria CV SVM: 8-hr criteria CV AdaBoost: 8-hr criteria CV Other Applications      Credit Card Fraud Detection Late and Default Payment Prediction Intrusion Detection Semi Conductor Process Control Trading anomaly detection Conclusion      Imposing a particular form of model may not be a good idea to train highly-accurate models for general purpose of DM. It may not even be efficient for some forms of models. RDT has been show to solve all three major problems in data mining, classification, probability estimation and regressions, simply, efficiently and accurately. When physical truth is unknown, RDT is highly recommended Code and dataset is available for download. Standard Supervised Learning training (labeled) test (unlabeled) Classifier New York Times 85.5% New York Times In Reality…… training (labeled) test (unlabeled) Classifier Labeled data not Reuters available! New York Times 64.1% New York Times Domain Difference  Performance Drop train test ideal setting NYT Classifier New York Times NYT 85.5% New York Times realistic setting Reuters Reuters Classifier NYT 64.1% New York Times A Synthetic Example Training Test (have conflicting concepts) Partially overlapping Goal Source Domain Source Target Domain Domain Source Domain  To unify knowledge that are consistent with the test domain from multiple source domains (models) Summary  Transfer from one or multiple source domains   Target domain has no labeled examples Do not need to re-train   Rely on base models trained from each domain The base models are not necessarily developed for transfer learning applications Locally Weighted Ensemble Training set 1 M1 f 1 ( x, y) f i ( x, y )  P(Y  y | x, M i ) x-feature value y-class label w1 ( x) f 2 ( x, y) Training set 2 Training set …… M2 w2 ( x) Test example x k f ( x, y )   wi ( x ) f i ( x, y ) …… E i 1 k f k ( x, y ) Training set k Mk wk (x)  w ( x)  1 i i 1 y | x  arg max y f E ( x, y) Modified Bayesian Model Averaging Bayesian Model Averaging Modified for Transfer Learning M1 M1 M2 P ( M i | D) Test set …… Test set …… P ( y | x, M i ) k P ( y | x )   P ( M i | D ) P ( y | x, M i ) Mk M2 i 1 P ( y | x, M i ) P( M i | x) Mk P( y | x)  k  P( M i 1 i | x ) P ( y | x, M i ) Global versus Local Weights x 2.40 -2.69 -3.97 2.08 5.08 1.43 ……  5.23 0.55 -3.62 -3.73 2.15 4.48 y M1 wg wl M2 wg wl 1 0 0 0 0 1 … 0.6 0.4 0.2 0.1 0.6 1 … 0.3 0.3 0.3 0.3 0.3 0.3 … 0.2 0.6 0.7 0.5 0.3 1 … 0.9 0.6 0.4 0.1 0.3 0.2 … 0.7 0.7 0.7 0.7 0.7 0.7 … 0.8 0.4 0.3 0.5 0.7 0 … Locally weighting scheme   Training Weight of each model is computed per example Weights are determined according to models’ performance on the test set, not training set Synthetic Example Revisited M1 M2 Training Test (have conflicting concepts) Partially overlapping Optimal Local Weights C1 0.9 Higher Weight 0.1 Test example x C2 H 0.4 0.6 w 0.9 0.4 w1 f 0.8 = 0.1  0.6 w2 k  w ( x)  1 i 0.2 Optimal weights  0.8 Solution to a regression problem i 1 0.2 Approximate Optimal Weights  Optimal weights   How to approximate the optimal weights M should be assigned a higher weight at x if P(y|M,x) is closer to the true P(y|x) Have some labeled examples in the target domain  Use these examples to compute weights None of the examples in the target domain are labeled  Need to make some assumptions about the relationship between feature values and class labels    Impossible to get since f is unknown! Clustering-Manifold Assumption Test examples that are closer in feature space are more likely to share the same class label. Graph-based Heuristics  Graph-based weights approximation  Map the structures of models onto test domain weight on x Clustering Structure M1 M2 Graph-based Heuristics Higher Weight  Local weights calculation  Weight of a model is proportional to the similarity between its neighborhood graph and the clustering structure around x. Local Structure Based Adjustment  Why adjustment is needed?  It is possible that no models’ structures are similar to the clustering structure at x  Simply means that the training information are conflicting with the true target distribution at x Error Clustering Structure Error M1 M2 Local Structure Based Adjustment  How to adjust?  Check if  is below a threshold Ignore the training information and propagate the labels of neighbors in the test set to x Clustering Structure M1 M2 Verify the Assumption  Need to check the validity of this assumption  Still, P(y|x) is unknown   How to choose the appropriate clustering algorithm Findings from real data sets  This property is usually determined by the nature of the task    Positive cases: Document categorization Negative cases: Sentiment classification Could validate this assumption on the training set Algorithm Check Assumption Neighborhood Graph Construction Model Weight Computation Weight Adjustment Data Sets  Different applications  Synthetic data sets    Spam filtering: public email collection  personal inboxes (u01, u02, u03) (ECML/PKDD 2006) Text classification: same top-level classification problems with different sub-fields in the training and test sets (Newsgroup, Reuters) Intrusion detection data: different types of intrusions in training and test sets. Baseline Methods   Baseline Methods  One source domain: single models  Winnow (WNN), Logistic Regression (LR), Support Vector Machine (SVM)  Transductive SVM (TSVM)  Multiple source domains:  SVM on each of the domains  TSVM on each of the domains  Merge all source domains into one: ALL  SVM, TSVM  Simple averaging ensemble: SMA  Locally weighted ensemble without local structure based adjustment: pLWE  Locally weighted ensemble: LWE Implementation Package:  Classification: SNoW, BBR, LibSVM, SVMlight  Clustering: CLUTO package Performance Measure  Prediction Accuracy  0-1 loss: accuracy  Squared loss: mean squared error  Area Under ROC Curve (AUC)  Tradeoff between true positive rate and false positive rate  Should be 1 ideally A Synthetic Example Training Test (have conflicting concepts) Partially overlapping Experiments on Synthetic Data Spam Filtering  Accuracy Problems WNN  Training set: public emails LR  SVM Test set: SMA personal emails TSVM from three pLWE users: U00, LWE U01, U02 MSE WNN LR SVM SMA TSVM pLWE LWE 20 Newsgroup C vs S R vs T R vs S S vs T C vs R C vs T Acc WNN LR SVM SMA TSVM pLWE LWE MSE WNN LR SVM SMA TSVM pLWE LWE Reuters  Accuracy Problems WNN  Orgs vs People LR (O vs Pe)   Orgs vs Places (O vs Pl) People vs Places (Pe vs Pl) SVM SMA TSVM pLWE LWE MSE WNN LR SVM SMA TSVM pLWE LWE Intrusion Detection  Problems (Normal vs Intrusions)     Normal vs R2L (1) Normal vs Probing (2) Normal vs DOS (3) Tasks    2 + 1 -> 3 (DOS) 3 + 1 -> 2 (Probing) 3 + 2 -> 1 (R2L) Conclusions  Locally weighted ensemble framework   Graph-based heuristics to compute weights   transfer useful knowledge from multiple source domains Make the framework practical and effective Code and Dataset available for download More information    www.weifan.info or www.cs.columbia.edu/~wfan For code, dataset and papers