Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Machine Learning Part 2: Intermediate and Active Sampling Methods Jaime Carbonell (with contributions from Pinar Donmez and Jingrui He) Carnegie Mellon University [email protected] December, 2008 © 2008, Jaime G Carbonell Beyond “Standard” Learning: Multi-Objective Learning Structuring Unstructured Data Text Categorization Temporal Prediction Cycle & trend detection Semi-Supervised Methods Labeled + Unlabeled Data Active Learning Proactive Learning “Unsupervised” Learning Predictor attributes, but no explicit objective Clustering methods Rare category detection December, 2008 © 2008, Jaime G. Carbonell 2 Multi-Objective Supervised Learning Several objectives to predict, overlapping sets of predictor attributes p1 p2 p3 p4 p5 Predictor att’s p6 obj1 -- Independent predictions (each solved ignoring others) obj2 -- Dependent predictions (results of earlier predictions partially feed next round) obj3 Dependent case: sequence the predictions. If feedback, cycle until stability (or fixed N) December, 2008 © 2008, Jaime G. Carbonell 3 The Vector Space Model How to Convert Text to “Data” Definitions of document and query vectors, where wj = jth word, and c(wj,di) = count the occurrences of wi in document dj For topic-categorization use wn+1 as objective category to predict (e.g. “finance”, “sports”) Vocabulary {wi , w2 ,...wn } di [c( w1 , d i ), c( w2 , d i ),..., c( wn , d i )] qi [c( w1 , qi ), c( w2 , qi ),..., c( wn , qi )] December, 2008 © 2008, Jaime G. Carbonell 4 Refinements to Word-Based Features Well-known methods Stop-word removal (e.g., “it”, “the”, “in”, …) Phrasing (e.g., “White House”, “heart attack”, …) Morphology (e.g., “countries” => “country”) Feature Expansion Query expansion (e.g., “cheap” => “inexpensive”, “discount”, “economic”,…) Feature Transformation & Reduction Singular-value decomposition (SVD) Linear discriminant analysis (LDA) December, 2008 © 2008, Jaime G. Carbonell 5 Query-Document Similarity (For Retrieval and for kNN) Traditional “Cosine Similarity” qd Sim (q , d ) qd where: d 2 d i i 1,... n Each element in the query and document vectors are word weights Rare words count more, e.g.: di = log2(Dall/Dfreq(wordi)) Getting the top-k documents (or web pages) is done by: Retrieve( q, k ) Arg max [k , Sim(d , q )] d D December, 2008 © 2008, Jaime G. Carbonell 6 Multi-tier Text Categorization News Event Terrorist Event Bombing Economic disaster Shooting Asian Crisis US tech crisis Given text, predict category at each level Issue: What if we need to go beyond words as features? December, 2008 © 2008, Jaime G. Carbonell 7 Time Series Prediction Process Find leading indicators “predictor” variables from earlier epochs Code values per distinct time interval E.g. “sales at t-1, at t-2, t-3 …” E.g. “advertisement $ at t, t-1, t-2” Objective is to predict desired variable at current or future epochs E.g. “sales at t, t+1, t+2” Apply machine learning methods you learned Regression, d-trees, kNN, Bayesian, … December, 2008 © 2008, Jaime G. Carbonell 8 Time Series Prediction: caveat 1 2006 Total Sales 2008 Total Sales Q1: 9.5M Q1: 12M Q2: 8.5M Q2: 11M Q3: 7.5M Q3: 9.5M Q4: 11M Q4: ?? 2007 Total Sales 1. Determine periodic cycle Q1: 11M 2. Find within-cycle trend Q2: 10M 3. Find cross-cycle trend Q3: 8.5M 4. Combine both components Q4: 13M December, 2008 © 2008, Jaime G. Carbonell 9 Time Series Prediction: caveat 2 2008 Total Airline Sales Q1: 12M Q1: 9.5M Q2: 8.5M Q2: 11M Q3: 7.5M Q3: 9.5M Q4: 11M Q4: ?? Watch for exogenous variable! (World-trade Center attack wreaked havoc with airline industry predictions) Less tragic and less obvious one-of-a-kind events too December, 2008 2006 Total Sales 2007 Total Sales Q1: 11M Q2: 10M Q3: 8.5M Q4: 13M © 2008, Jaime G. Carbonell 10 Leveraging Existing Data Collecting Systems 1999 Influenza outbreak Influenza cultures Sentinel physicians WebMD queries about ‘cough’ etc. School absenteeism Sales of cough and cold meds Sales of cough syrup ER respiratory complaints ER ‘viral’ complaints Influenza-related deaths December, 2008 [Moore, 2002] Week (1999-2000)) © 2008, Jaime G. Carbonell 11 Adaptive Filtering over a Document Stream Training documents (past) Unlabeled documents Test documents Current document: On-topic? On-topic documents Off-topic documents December, 2008 time Topic 1 Topic 2 Topic 3 … RF © 2008, Jaime G. Carbonell 12 Classifier = Rocchio, Topic = Civil War (R76 in TREC10), Threshold = MLR MLR threshold function: locally linear, globally non-linear December, 2008 © 2008, Jaime G. Carbonell 13 Time Series in a Nutshell Time-Series Prediction requires regression, except Historical data per time period (aka “epoch”) Predictor attributes come from both current + earlier epochs Objective attribute from earlier epochs predictor attributes for current epoch Process Difference with Normal Machine Learning First detect cyclical patterns among epochs Predict within a cycle Predict cross-cycle using corresponding epochs only (then combine with within-cycle prediction) December, 2008 © 2008, Jaime G. Carbonell 14 Active Learning Assume: {x, y} Very few “labeled” instances Very many “unlabeled” instances {x} An omniscient “oracle” which can assign an label to an unlabeled instance Objective: Select instances to label such that learning accuracy is maximized with the fewest oracle labeling requests December, 2008 © 2008, Jaime G. Carbonell 15 Active Learning (overall idea) Data Source learn a new model unlabeled data Learning Mechanism User output label request Expert labeled data Why is Active Learning Important? Labeled data volumes unlabeled data volumes 1.2% of all proteins have known structures .01% of all galaxies in the Sloan Sky Survey have consensus type labels .0001% of all web pages have topic labels If labeling is costly, or limited, we want to select the points that will have maximal impact December, 2008 © 2008, Jaime G. Carbonell 17 Review of Supervised Learning Training data: {xi , yi }i 1,... k , y simplify y Functional space: { f j pl } Fitness Criterion: arg min yi f j , pl ( xi ) ( f j , pl ) j ,l i Variants: online learning, noisy data, … December, 2008 © 2008, Jaime G. Carbonell 18 Active Learning Training data: {xi , yi }i 1,... k {xi }i k 1,... n O : xi yi Special case: k 0 Functional space: { f j Fitness Criterion: a.k.a. loss function pl } arg min yi f j , pl ( xi ) ( f j , pl ) j ,l i Sampling Strategy: ˆ arg min L( f ( x , y )) | x {x ,..., x } xi { xk 1 ,..., xn } December, 2008 test test i © 2008, Jaime G. Carbonell 1 k 19 Sampling Strategies Random sampling (preserves distribution) Uncertainty sampling (Tong & Koller, 2000) proximity to decision boundary maximal distance to labeled x’s Density sampling (kNN-inspired McCallum & Nigam, 2004) Representative sampling (Xu et al, 2003) Instability sampling (probability-weighted) x’s that maximally change decision boundary Ensemble Strategies Boosting-like ensemble (Baram, 2003) DUAL (Donmez & Carbonell, 2007) Dynamically switches strategies from Density-Based to Uncertainty-Based by estimating derivative of expected residual error reduction December, 2008 © 2008, Jaime G. Carbonell 20 Which point to sample? Green = unlabeled Red = class A Brown = class B Density-Based Sampling Centroid of largest unsampled cluster Uncertainty Sampling Closest to decision boundary Maximal Diversity Sampling Maximally distant from labeled x’s Ensemble-Based Possibilities Uncertainty + Diversity criteria Density + uncertainty criteria Active Learning Issues Interaction of active sampling with underlying classifier(s). On-line sampling vs. batch sampling. Active sampling for rank learning and for structured learning (e.g. HMMs, sCRFs). What if Oracle is fallible, or reluctant, or differentially expensive proactive learning. How does noisy data affect active learning? What if we do not have even the first labeled point(s) for one or more classes? new class discovery. How to “optimally” combine A.L .strategies December, 2008 © 2008, Jaime G. Carbonell 26 Strategy Selection: No Universal Optimum • Optimal operating range for AL sampling strategies differs • How to get the best of both worlds? • (Hint: ensemble methods, e.g. DUAL) December, 2008 © 2008, Jaime G. Carbonell 27 Motivation for DUAL Strength of DWUS: favors higher density samples close to the decision boundary fast decrease in error But! DWUS establishes diminishing returns! Why? • Early iterations -> many points are highly uncertain • Later iterations -> points with high uncertainty no longer in dense regions28 December, 2008 © 2008, Jaime G. Carbonell • DWUS wastes time picking instances with no direct effect on the error How does DUAL do better? Runs DWUS until it estimates a cross-over (DWUS ) x t Monitor the change in expected error at each iteration to detect when it is stuck in local minima ^ ^ (DWUS ) 1 nt E [(y i y i ) 2 | xi ] 0 DUAL uses a mixture model after the cross-over ( saturation ) point ^ x s argmax * E [(y i y i )2 | x i ] (1 ) * p (x i ) * i I U Our goal should be to minimize the expected future error If we knew the future error of Uncertainty Sampling (US) to be zero, then we’d force 1 But in practice, we do not know it December, 2008 © 2008, Jaime G. Carbonell 29 More on DUAL After cross-over, US does better => uncertainty score should be given more weight should reflect how well US performs can be calculated by the expected error of ^ ^ US on the unlabeled data* => (US ) Finally, we have the following selection criterion for DUAL: ^ ^ ^ x s argmax(1 (US )) * E [(y i y i ) | x i ] (US ) * p (x i ) * 2 i I U * US is allowed to choose data only from among the already sampled instances, and is calculated on the remaining ^ unlabeled set to (US ) December, 2008 © 2008, Jaime G. Carbonell 30 Results: DUAL vs DWUS December, 2008 © 2008, Jaime G. Carbonell 31 Paired Density-Based Sampling (Donmez & Carbonell, 2008) Desiderata Balanced Sampling from both (all) classes Combine density-based with coverage-based Method Non-Euclidian distance function p 1 1 pk pk 1 d (x i , x j )= ln(1 min (e 1)) p Pij k 1 Select maximally separated pairs of points based on maximizing a utility function December, 2008 © 2008, Jaime G. Carbonell 32 Paired Density Method (cont.) Utility function: U (i , j ) log pˆ(x ) pˆ(x ) i j 2 Pˆ(y k | x k ) exp( x i x k ) * y min k { 1} k i N x i 2 log exp( x j x r ) * min Pˆ(y r | x r ) y r { 1} r j N x j ˆ ˆ s * min P (y i | x i ) min P (y j | x j ) y j { 1} y i { 1} Select the two points that optimize utility and are maximally distant i December, 2008 * ,j * argmax x i j I U i xj 2 * U (i , j ) © 2008, Jaime G. Carbonell 33 Results of Paired-Density Sampling December, 2008 © 2008, Jaime G. Carbonell 34 Active Learning model in NLP Test Data Evaluation Parsing model Training Data Build Machine Translation System Active Learner Named Entity Recognition module Word Sense Disambiguation model Sample selection Addition Samples Unlabeled Set Active Training Set Un-annotated corpus Annotation Translation Word-Sense Disambiguation Needed in NLP for parsing, translation, search… Example: Line ax+by+c, rope, queue, track,… “Banco” bench, financial inst, sand bank, … Challenge: How to disambiguate from context Approach: Build ML classifier (sense = class) Problem: Insufficient training data Amelioration: Active Learning December, 2008 © 2008, Jaime G. Carbonell 36 Word Sense Disambiguation: Active Learning Methods Entropy Sampling Vector q represents the trained model’s predictions qc prediction probability of class c Pick the example whose prediction vector displays the greatest entropy Margin Sampling If c and c’ are the two most likely categories Picks the example with the smallest margin December, 2008 © 2008, Jaime G. Carbonell Word Sense Disambiguation: Experiment On 5 English verbs that had coarse grained senses. Double-blind tagging applied to 50 instances of the target word If the inter-tagger (ITA) agreement < 90%, the sense entry is revised by adding examples and explanations December, 2008 © 2008, Jaime G. Carbonell Word Sense Disambiguation Results Active vs. Proactive Learning ACTIVE LEARNING PROACTIVE LEARNING All x’s cost the same to label Max number of labels Omniscient oracle Never errs Indefatigable oracle Always answers Single oracle Oracle selection unnecessary December, 2008 Labeling cost is f1(D(x),O) Max labeling budget Fallible oracles Errs with p(E(x)) ~ f2(D(x),O) Reluctant oracles Answers with p(A(x)) … Multiple oracles Joint optimization of oracle and instance selection © 2008, Jaime G. Carbonell 40 Scenario 1: Reluctance 2 oracles: reliable oracle: expensive but always answers with a correct label reluctant oracle: cheap but may not respond to some queries Define a utility score as expected value of information at unit cost P (ans | x , k ) *V (x ) U (x , k ) Ck December, 2008 © 2008, Jaime G. Carbonell 41 How to estimate Pˆ(ans | x , k ) ? Cluster unlabeled data using k-means Ask the label of each cluster centroid to the reluctant oracle. If label received: increase Pˆ(ans | x ,reluctant) of nearby points no label: decrease Pˆ(ans | x ,reluctant) of nearby points h (x c t , y c t ) maxd x c t x Pˆ(ans | x ,reluctant) exp ln Z 2 x ct x 0.5 x C t h (x c , y c ) {1, 1} equals 1 when label received, -1 otherwise # clusters depend on the clustering budget and oracle fee December, 2008 © 2008, Jaime G. Carbonell 42 Algorithm for Scenario 1 December, 2008 © 2008, Jaime G. Carbonell 43 Scenario 2: Fallibility Two oracles: One perfect but expensive oracle One fallible but cheap oracle, always answers Alg. Similar to Scenario 1 with slight modifications During exploration: Fallible oracle provides the label with its confidence Confidence = Pˆ(y | x ) of fallible oracle If Pˆ(y | x ) [0.45,0.5] then we don’t use the label but we still update Pˆ(correct | x , k ) December, 2008 © 2008, Jaime G. Carbonell 44 Scenario 3: Non-uniform Cost Uniform cost: Fraud detection, face recognition, etc. Non-uniform cost: text categorization, medical diagnosis, protein structure prediction, etc. 2 oracles: Fixed-cost Oracle Variable-cost Oracle C non unif (x ) 1 December, 2008 max y Y Pˆ(y | x ) 1 Y 1 1 Y © 2008, Jaime G. Carbonell 45 Outline of Scenario 3 December, 2008 © 2008, Jaime G. Carbonell 46 Underlying Sampling Strategy Conditional entropy based sampling, weighted by a density measure Captures the information content of a close neighborhood U (x ) log min Pˆ(y | x ,wˆ) exp x k k x N x y { 1} 2 2 ˆ * min P (y | k ,wˆ) y { 1} close neighbors of x December, 2008 © 2008, Jaime G. Carbonell 47 Results: Reluctance December, 2008 © 2008, Jaime G. Carbonell 48 Cost varies non-uniformly statistically significant (p<0.01) December, 2008 © 2008, Jaime G. Carbonell 49 Proactive Learning in General Multiple Expert (a.k.a. Oracles) Different areas of expertise Different costs Different reliabilities Different availability What question to ask and whom to query? Joint optimization of query & oracle selection Referals among Oracles (with referal fees) Learn about Oracle capabilities as well as solving the Active Learning problem at hand December, 2008 © 2008, Jaime G. Carbonell 50 Unsupervised Learning in DM What does it mean to learn without an objective? Explore the data for natural groupings Learn association rules, and later examine whether they can be of any business use Illustrative examples Market basket analysis later optimize shelf allocation & placements Cascaded or correlated mechanical faults Demographic grouping beyond known classes Plan product bundling offers December, 2008 © 2008, Jaime G. Carbonell 51 Example Similarity Functions Determine a similarity metric Eucledian Cosine KL-divergence sim euclid (d i , d j ) 2 2 (d i ,k d j ,k ) k 1, n q di simcos (q , d i ) q 2 di 1 2 Determine a clustering algorithm Incremental, agglomerative, K-means, … December, 2008 © 2008, Jaime G. Carbonell 52 Hierarchical Agglomerative Clustering Methods Generic Agglomerative Procedure (Salton '89), result in nested clusters via iterations 1. Compute all pairwise document-document similarity coefficients 2. Place each of n documents into a class of its own 3. Merge the two most similar clusters into one; - replace the two clusters by the new cluster - recompute intercluster similarity scores w.r.t. the new cluster - If cluster radius > max-size, block further merging 4. Repeat the above step until there are only k clusters left (note k could = 1). December, 2008 © 2008, Jaime G. Carbonell 53 Group Agglomerative Clustering 2 1 6 5 4 3 9 7 8 K-Means Clustering 1. Select k-seeds s.t. d(ki,kj) > dmin 2. Assign points to clusters by min dist. Cluster(pi) = Argmin(d(pi,sj)) sj{s1,…,sk} 3. Compute new cluster centroids: 1 cj pi n pi j thcluster 4. Reassign points to clusters (as in 2 above) 5. Iterate until no points change clusters December, 2008 © 2008, Jaime G. Carbonell 55 K-Means Clustering: Initial Data Points Step 1: Select k random seeds s.t. d(ki,kj) > dmin Initial Seeds (if k=3) K-Means Clustering: First-Pass Clusters Step 2: Assign points to clusters by min dist. Cluster(pi) = Argmin(d(pi,sj)) sj{s1,…,sk} Initial Seeds K-Means Clustering: Seeds Centroids Step 3: Compute new cluster centroids: 1 cj n p i pi j th cluster New Centroids K-Means Clustering: Second Pass Clusters Note: some data points reassigned Step 4: Recompute Cluster(pi) = Argmin(d(pi,cj)) cj{c1,…,ck} Centroids Cluster Optimization (finding “k”) average(d ( xi , x j ), x cluster , i j ) k Arg min k[1, n ] average(d ( xk , xl ), x cluster , k l ) 1 1 d ( x , x ) i j k c2 cCk xi x j c k Arg min k[1, n ] 1 d ( cen ( c ), cen ( c )) l m k2 cl cm Ck 1 k d ( x , x ) i j c2 cCk xi x j c k Arg min k[1, n ] d (cen(cl ), cen(cm )) cl cm Ck December, 2008 © 2008, Jaime G. Carbonell 60 Clustering for Novelty Detection Functionality Build background model Technology Expected Events (clusters) Find divergences (Hierarchical) k-means Individual outliers (but many false positives) New Mini-clusters (unmasked new-event detection) Detect when a novel event is masked by ordinary ones Trigger Alerts December, 2008 Divergence metrics Radial density gradients from cluster centroid Temporally-adaptive distance measures Secondary peaks in density function Route & Prioritize Formulate hypotheses for Analyst Modeling methods Create analyst profiles RETE-based SAMs methods (last PI-meeting ARGUS paper) © 2008, Jaime G. Carbonell 61 Cluster Evolution Constant Event New Obfuscated Event New Un-obfuscated Event Growing Event ( x ) ( x ) (1 ) max j r j Cluster Density Changes Constant Event New Obfuscated Event New Unobfuscated Event Growing Event ( x ) ( x ) (1 ) max j r j Novelty Detection and Profile Management 1 Novelty Detection Matcher Profiles Data Streams New Profiles Analyst December, 2008 © 2008, Jaime G. Carbonell 64 Results on Medical Data New Mini-Cluster Analysis reveals outbreaks of: • • • • Tularemia Dengue Fever Myiasis Chagas Disease SARS Outbreak simulation Added new records for patients from a small geographical region diagnosed with influenza in 9/2001 Graph shows resulting secondary peak in the pulmonary disease density function December, 2008 © 2008, Jaime G. Carbonell 65 What’s Rare Category Detection Start de-novo Very skewed classes Majority classes Minority classes Labeling oracle Goal Discover minority classes with a few label requests December, 2008 © 2008, Jaime G. Carbonell 66 Comparison with Outlier Detection Rare classes A group of points Clustered Non-separable from the majority classes December, 2008 Outliers A single point Scattered Separable © 2008, Jaime G. Carbonell 67 Fraud detection Network intrusion detection Applications Astronomy Spam image detection The Big Picture Unbalanced Unlabeled Data Set Rare Category Detection Feature Extraction Learning in Unbalanced Settings Classifier Feature Representation Relational Temporal Raw Data Questions We Want to Address How to detect rare categories in an unbalanced, unlabeled data set with the help of an oracle? How to detect rare categories with different data types, such as graph data, stream data, etc? How to do rare category detection with the least information about the data set? How to select relevant features for the rare categories? How to design effective classification algorithms which fully exploit the property of the minority classes (rare category classification)? December, 2008 © 2008, Jaime G. Carbonell 70 Notation d x S x , , x 1 , n i Unlabeled examples: m Classes: yi 1, , m m-1 rare classes: p 2 , , p m One majority class: p1 , p c 2cm Goal: find at least ONE example from each rare class by requesting a few labels December, 2008 © 2008, Jaime G. Carbonell 71 Assumptions The distribution of the majority class is sufficiently smooth Examples from the minority classes form compact clusters in the feature space 0.25 0.2 0.15 0.1 0.05 December, 2008 0 -6 © 2008, Jaime G. Carbonell -4 -2 0 2 72 4 6 Two Classes: NNDB 1. Calculate class-specific radius r 2. xi S , NN xi , r x x xi r , ni NN xi , r Increase t by 1 3. si max x j NN xi ,tr n n i j 4. Query x arg max xi S si No 5. xRare class? Yes 6. Output December, 2008 x © 2008, Jaime G. Carbonell 73 NNDB: Calculate Nearest Neighbors r 200 190 180 170 160 150 140 130 120 120 140 160 180 200 220 200 190 180 170 NN xi , r x x xi r ni NN xi , r 160 150 140 130 120 120 December, 2008 140 © 2008, Jaime G. Carbonell 160 180 200 220 74 NNDB: Calculate the Scores tr 200 si max x j NN xi ,tr n n i j 190 180 170 Query x arg max xi S si 160 150 140 130 120 120 December, 2008 140 © 2008, Jaime G. Carbonell 160 180 200 220 75 NNDB: Pick the Next Candidate t 1 r 200 Increase t by 1 si max 190 n n x j NN xi , t 1 r i j 180 170 160 Query x arg max xi S si 150 140 130 120 120 December, 2008 140 © 2008, Jaime G. Carbonell 160 180 200 220 76 Why NNDB Works Theoretically Theorem 1 [He & Carbonell 2007]: under certain conditions, with high probability, after a few iteration steps, NNDB queries at least one example whose probability of coming from the minority class is at least 1/3 Intuitively The scoresi measures the change in local density 200 190 180 170 160 150 140 130 120 120 December, 2008 © 2008, Jaime G. Carbonell 140 160 180 200 220 77 Multiple Classes: ALICE 2 m p , , p m-1 rare classes: 1 c One majority class: p ,p 2 c m c c 1 Yes 1. For each rare class c, 2cm 2. We have found examples from class c No 3. Run NNDB with prior December, 2008 © 2008, Jaime G. Carbonell pc 78 Why ALICE Works Theoretically Theorem 2 [He & Carbonell 2008]: under certain conditions, with high probability, in each outer loop of ALICE, after a few iteration steps in NNDB, ALICE queries at least one example whose probability of coming from one minority class is at least 1/3 December, 2008 © 2008, Jaime G. Carbonell 79 Implementation Issues ALICE Problem: repeatedly sampling from the same rare class MALICE Solution: relevance feedback Class-specific radius December, 2008 © 2008, Jaime G. Carbonell 80 Results on Synthetic Data Sets 5 4 3 2 1 0 -1 -3 -2 -1 0 December, 2008 1 2 3 4 © 2008, Jaime G. Carbonell 81 Summary of Real Data Sets Abalone 4177 examples 7-dimensional features 20 classes Largest class: 16.50% Smallest class: 0.34% December, 2008 Shuttle 4515 examples 9-dimensional features 7 classes Largest class: 75.53% Smallest class: 0.13% © 2008, Jaime G. Carbonell 82 Results on Real Data Sets Abalone Shuttle MALICE Interleave Random sampling December, 2008 © 2008, Jaime G. Carbonell MALICE Interleave Random sampling 83 Imprecise priors Abalone Shuttle 20 7 Classes Discovered Classes Discovered 6 15 -5% -10% -20% 0 +5% +10% +20% 10 5 0 0 50 100 150 200 Number of Selected Examples December, 2008 5 4 3 2 250 1 0 © 2008, Jaime G. Carbonell -5% -10% -20% 0 +5% +10% +20% 20 40 60 80 Number of Selected Examples 84 100 Specially Designed Exponential Families [Efron & Tibshirani 1996] Favorable compromise between parametric and nonparametric density estimation Estimated density p 1 parameter vector Carrier density g x g0 x exp 0 t x Normalizing parameter December, 2008 T 1 p 1 vector of sufficient statistics © 2008, Jaime G. Carbonell 85 SEDER Algorithm Carrier density: kernel density estimator T 1 2 d 2 t x x ,, x To decouple the estimation of different parameters d j Decompose 0 j 1 0 Relax the constraint such that xj December, 2008 dx j j 2 x xi 1 j j j exp exp x 0i 1 2 j 2 2 j © 2008, Jaime G. Carbonell 2 j 1 86 Parameter Estimation Theorem 3 [To appear]: the maximum likelihood estimate j and j satisfy the following conditions: j and ˆ j of 0i ̂ 0i 1 1 x n k 1 j 2 k where j Ei x j 2 December, 2008 xj j 1,, d j j 2 n ˆ j xk xi E j x j exp i1 0i 2 j 2 i n k 1 j j 2 n ˆ j xk xi exp i 1 0i 2 j 2 x j 2 2 dx j j 2 x xi 1 ˆ j ˆ j x j exp exp 0i 1 j j 2 2 2 © 2008, Jaime G. Carbonell 2 87 j Parameter Estimation cont. 1 Let 1 j b 1 j : positive parameter b j 2 2 2 B B 4 AC j j 1,, d: bˆ 2A 1 n where ,j 2 j 2 B C k 1 xk n bˆ j 1 j j 2 in most cases n xk xi j 2 i 1 exp 2 j 2 xi 1 n A k 1 j j 2 n n xk xi exp i 1 2 j 2 j 1 December, 2008 © 2008, Jaime G. Carbonell 88 Scoring Function The estimated density d 1 n ~ g b x i 1 j 1 n Scoring function: norm of the gradient n sk d l 1 where 1 d Di x j 1 n December, 2008 j j j 2 x b xi 1 exp j 2 j j j 2b 2 b i 1 l k l 2 l Di xk x b x l b l i 2 2 j j j 2 x b xi 1 exp j 2 j j j 2b 2 b © 2008, Jaime G. Carbonell 89 Results on Synthetic Data Sets December, 2008 © 2008, Jaime G. Carbonell 90 Summary of Real Data Sets Data Set n d m Largest Class Smallest Class Ecoli 336 7 6 42.56% 2.68% Glass 214 Moderately Skewed 9 6 35.51% 4.21% Page Blocks 5473 10 5 89.77% 0.51% Abalone 4177 7 20 16.50% 0.34% Shuttle 4515 9 7 75.53% 0.13% December, 2008 Extremely Skewed © 2008, Jaime G. Carbonell 91 Moderately Skewed Data Sets Ecoli Glass MALICE MALICE December, 2008 © 2008, Jaime G. Carbonell 92 Extremely Skewed Data Sets Page Blocks Abalone MALICE MALICE Shuttle MALICE Additional Notation W : n n pair-wise similarity matrix D : n n diagonal matrix, W D 1 2W D 1 2 Dii j 1Wij : normalized matrix n A I nn W : global similarity matrix, where is an 1 I nn identity matrix, and is a positive parameter close to 1 December, 2008 © 2008, Jaime G. Carbonell 94 Global Similarity Matrix 1 A I nn W Better than pair-wise similarity matrix for rare category detection December, 2008 © 2008, Jaime G. Carbonell 95 GRADE: Full Prior Information 2cm 1. For each rare class c, 2. Calculate class-specific similarity a c 3. xi S, NN xi , a c x A x, xi a c , nic NN xi , a c Increase t by 1 4. si Relevance max c Feedback x j NN xi , a t n c i ncj 5. Query x arg max xi S si No 6. x class c? Yes 7. Output x GRADE-LI: Less Prior Information 1. Calculate problem-specific similarity a 2. xi S , NN xi , a x A x, xi a , ni NN xi , a Increase t by 1 3. si Relevance max ni xj NN xi , a t Feedback nj t 2 4. Query x arg max xi S si No 5. xa new class? Yes 6. Output December, 2008 x © 2008, Jaime G. Carbonell 7. Budget exhausted? No 97 MALICE Glass MALICE Shuttle Abalone Ecoli Results on Real Data Sets MALICE MALICE Applying Machine Learning for Data Mining in Business Step 1: Have clear objective to Optimize Step 2: Have sufficient data Step 3: Clean, normalize, clean data some more Step 4: Make sure there isn’t an easy solution (e.g. a small number of rules from expert) Step 5: Do the Data Mining for real Step 6: Cross-validate, improve, go to step 5 December, 2008 © 2008, Jaime G. Carbonell 99 Managing the Data Mining Process Ingredients for successful DM Data (warehouse, stream, DBs, …) Right problems (objectives, …) Tools (Machine Learning tool suites, …) People (analogy to surgical team: next slide) Estimate (size) problem, approach, progress ROI (max, min, realistic) Determine if DM is likely best approach Deploy team Evaluate intermediate results December, 2008 © 2008, Jaime G. Carbonell 100 The Data Mining Team The Administrator (manager & domain) Pick problem, resources, ROI calc, monitor, … The Surgeon (ML specialist w/domain knowledge) Select ML method, predictor atts, objective, … The Anesthesiologist (preparer) Chief data specialist, sampling, coverage, … The Nurses (assistants) DB manager, programmers, gophers … The Medical Students Prepare new surgeons: learn by doing December, 2008 © 2008, Jaime G. Carbonell 101 Need Some Domain Expertise Data Preparation What are good candidate predictor att’s? How to combine multiple objectives? How to sample? (e.g. id cyclic periods) Progress monitoring and results interpretation How accurate must prediction be? Do we need more or different data? Are we pursing reasonable objective(s)? Application of DM after accomplished Update of DM when/as environment evolves December, 2008 © 2008, Jaime G. Carbonell 102 Typical Data Mining Pitfalls Insufficient data to establish predictive patterns Incorrect selection of predictor attributes Statistics to the rescue (e.g. 2 test) Unrealistic objectives (e.g. fraud recovery) Inappropriate ML method selection Data preparation problems Failure to normalize across data sets Systematic bias in original data collection Belief in DM as panacea or black magic Giving up too soon (very common) December, 2008 © 2008, Jaime G. Carbonell 103 Final Words on Data Mining Data Mining is: 1/3 science (math, algorithms, …) …and 1/3 engineering (data prep, analysis, …) …and 1/3 “art” (experience really counts) 10 years ago it was mostly art 10 years from now it will be mostly engineering What to expect from the research labs? Better supervised algorithms Focus on unsupervised learning + optimization Move to incorporate semi-structured (text) data December, 2008 © 2008, Jaime G. Carbonell 104 THANK YOU! December, 2008 © 2008, Jaime G. Carbonell 105