Download Estimating P(j|x)

Learning and Making Decisions When Costs and Probabilities are Both Unknown by B. Zadrozny and C. Elkan Contents • • • • • • • Introduce the Problem Previous work Direct Cost Sensitive Decision Making The Dataset Estimating Class Membership Probabilities Estimating Costs Results and Conclusions Introduction • Costs/Benefits are the values assigned to classification decisions. • Cost are often different for different examples • Often we are interested in the rare class in cost-sensitive learning • Hence the problem of unbalanced data Cost Sensitive Decisions • Each training and test example has associated cost • General optimal prediction arg min i  P( j | x)C (i, j, x) j • Methods differ w.r.t. P( j | x) and C (i, j , x) • Previous literature has assumed cost are known in advance and independent of examples. C (i, j , x)  C (i, j , y ) x, y MetaCost • Estimation of P ( j | x ) – Estimated in training only. • Estimation of C (i, j ) – Example independent • Training changes labelling to its optimum • Learns a classifier to predict labeling of test examples. Direct Cost-Sensitive-DecisionMaking • Estimation of P ( j | x) – Average of Naïve Bayes and Decision Trees . – Estimated on training and test sets. • Estimation of C (i, j , x) – Multiple Linear Regression. – Unbiased estimate using Heckman procedure – Example dependant. • Evaluation – Evaluate against MetaCost and KDD competition results. – Using a large and difficult dataset, KDD. MetaCost Implementation • For evaluation MetaCost is adapted – Probability class estimates found by simple methods using decision trees – Cost are made example dependant during training • Adapted MetaCost vs. DCSDM – DCSDM uses two models on test example MetaCost one. – Estimation P ( j | x) was made in both training and test examples in DCSDM. The Data Mining Task • Data on persons who have donated in the past to a certain charity, KDD '98 competition – Non-donor and donor labelling based on last campaign • The task is to choose which donors to ask for new donations • Training/Test set – 95,412 records, labelled donors or nondonors and donation amount – 96,367 unlabelled records from same donation campaign Data Mining Task cont. • Cost of soliciting $0.68 • Donations range from $1-200 • 5% donors and 95% non-donors – Very low response rate and varying donations make hard to beat soliciting to everyone. • The dataset set is hard – Already been filtered to be a reasonable set of prospects – The task is to improve upon the unknown method that produced the set Applied DCSDM • In KDD we will change C(i,j,x) to B(i,j,x) – Costs become benefits Predict Non-donor Predict Donor Actual Non-donor 0 Actual Donor 0 -0.68 y(x)-0.68 • B(1,1,x) is example dependant – Replaced by a constant by previous literature Optimal policy • The expected benefit of not soliciting, i = 0 P( j  0 | x) B(0,0, x)  P( j  1 | x) B(0,1, x)  0 • Expected benefit of soliciting, i = 1 P( j  0 | x) B(1,0, x)  P( j  1 | x) B(1,1, x)  (1  P( j  1 | x))( 0.68)  P( j  1 | x)( y ( x)  0.68)  P( j  1 | x) y ( x)  0.68 • Optimal policy: P( j  1 | x) y ( x)  0.68  0 P( j | x) • Optimal decisions require P ( j | x ) • Class sizes may be highly unbalanced • Two methods proposed – Decision Trees - Smoothing Curtailment – Naïve Bayes - Binning Problems w. Decision Trees • Decision trees assign as a score to each leaf the raw training frequency of that leaf. • High Bias k p n – Decision trees growing methods try to make leaves Homogeneous. p’s tend to be over or under estimates • High Variance – When n is small p not to be trusted. Smoothing • Pruning is no good for us. • To make the estimates less extreme lets replace: k k  b.m p   p'  n nm • b – base rate, m – heuristic value (smoothing strength) • Effect – where k, n small p’ essentially just base rate. – If k, n larger then p’ is ‘combination’ of base rate and original score Smoothed Scores Curtailment • What if the leaves have enough training examples to be statistically reliable? – Then smoothing seems to be unnecessary. • Curtailment searches through the tree and removes nodes where n < v. – V chosen either through cross-validation, or a heuristic, like b.v = 10. Curtailed Tree Curtailed Scores Naïve Bayes Classifiers • Naïve Bayes – Assumes that within any class the attribute values are independent variables. • This assumption gives inaccurate probability estimates • But, attributes tend to be positively correlated so naïve Bayes estimates tend to be too extreme, i.e. close to zero or one. • So, they do rank examples well: if n( x)  n( y ) then P( j  1 | x)  P( j  1 | y ) Calibrating Naïve Bayes Scores • The Histogram method: – Sort training examples by n.b. scores – Divide sorted set into b subsets of equal size, called bins – For each bin compute lower and upper boundary n.b. scores • Given a new data point x – Calculate n(x) and find the associated bin – Let P ( j | x) = fraction of positive training examples in that bin Averaging Probability Estimates • If probability estimates are partially uncorrelated then it follows that averaging these estimates will reduce their variance. • Assuming all estimates have the same variance the average estimate will have a variance given by: 2 1   ( N  1) 2    N 2 The individual classifier variances N The number of classifiers  The correlation factor among all classifiers Estimating Donation Amount • Solicit the person based on policy. • Policy P( j  1 | x) y ( x)  0.68  0 y (x) is estimated donation amount Cost and Probability • Good Decisions – Estimating Cost well is more important than estimating probabilities. • Why? – Relative variation of cost across different examples is much greater than the relative variation of probabilities • Probability – Estimating Donation probability is difficult. • Estimating donation amount are easier because past amount are excellent predictor of future amounts. Training and Test data • Two random process – Donate or not to. – How much to donate? Donation Amount. Donor Non donor Training data Known - Test data unknown unknown • Method used for estimating donation amount is called as Multiple Linear regression (MLR). Multiple Linear Regression Two attributes are used – lastgift : dollar amount of most recent gift. – ampergift : average gift amount in response to the last 22 promotions • Linear Regression equation is used to estimate donation amount. • 46 of 4843 donations recorded have donation amount more than $50. • Donors that have donated at most $50 are used as input for linear regression. Problem of Sample Selection Bias • Reasoning outside your learning space. • Donation Amount Training data Test data Donor Non donor Known unknown unknown • Estimating Donation Amount – Any donation estimator is learned on the basis of people who actually donated. – This estimator is applied to different population consisting of donors and non-donors. Donation Amount and Probability Estimates are Negatively Correlated Solution to Sample Selection Bias • Heckman’s procedure – Estimate conditional probability p( j=1 | x) using linear probit model. – Estimate y(x) on training dataset for which j (x) = 1 by including a transformation for each x using the estimated values of conditional probability. • Their own procedure – conditional probability is learned using decision tree or Naïve bayes classifier. – These probability estimates are added as additional attributes by estimating y(x). Experimental Results Direct cost sensitive Decision Making Meta cost Experimental Results Interpretation • With Heckman – profits on test set increases by $484, in all probability estimation methods. – Systematic improvement indicates that Heckman’s procedure solves the problem of Sample Selection Bias • Meta cost – Best result of Meta cost is $14113. – Best result of Direct cost sensitive method is $15329. – On an average, profit achieved in Meta Cost on test set is $1751 lower than the profit achieved in case of direct cost-sensitive decision making. Statistical Significance of Results • 4872 donors in fixed test set • Average donation of $15.62 • Different Test set drawn randomly from same probability distribution would expect a standard deviation of sqrt(4872) • Fluctuation will cause a change of about $1090. sqrt(4872) * 15.62 = $1090. • Profit Difference between two methods less than $1090 is not significant. Conclusions • Cost sensitive learning is better than Meta cost. • Provides solution to fundamental problem of cost being example dependent. • Identify and solves the problem of Sample Selection Bias for KDD’98 dataset Questions?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Estimating P(j|x)