Download Gastrointestinal Cancer Committee

Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 www-stat.wharton.upenn.edu/~bob Overview  Critical Wharton Department of Statistics stages of data mining process - Choosing the right data, people, and problems - Modeling - Validation  Automated modeling - Feature creation and selection - Exploiting expert knowledge, “insights”  Applications - Little detail – Biomedical: finding predictive risk factors - More detail – Financial: predicting returns on the market - Lots of detail – Credit: anticipating the onset of bankruptcy 2 Predicting Health Risk  Who Wharton Department of Statistics is at risk for a disease? - Example: detect osteoporosis without expense of x-ray  Goals - Improving public health - Savings on medical care - Confirm an informal model with data mining  Many types of features, interested groups - Clinical observations of doctors - Laboratory measurements, “genetic” - Self-reported behavior  Missing data 3 Predicting the Stock Market  Small, Wharton Department of Statistics “hands-on” example  Goals - Better retirement savings? - Money for that special vacation? - Trade-offs: risk vs return  Lots College? of “free” data - Access to accurate historical time trends, macro factors - Recent data more useful than older data  “Simple” modeling technique  Validation 4 Predicting the Market: Specifics Wharton Department of Statistics  Build a regression model - Response is return on the value-weighted S&P - Use standard forward/backward stepwise - Battery of 12 predictors with interactions  Train the model during 1992-1996 (training data) - Model captures most of variation in 5 years of returns - Retain only the most significant features (Bonferroni)  Predict returns in 1997 (validation data)  Another version in Foster, Stine & Waterman 5 Wharton Historical patterns? Department of Statistics 0.08 0.06 vwReturn 0.04 0.02 ? 0.00 -0.02 -0.04 -0.06 92 93 94 95 96 97 98 Ye ar 6 Wharton Fitted model predicts... Department of Statistics 0.15 Exceptional Feb return? 0.10 0.05 -0.00 -0.05 92 93 94 95 96 97 98 Ye ar 7 Wharton What happened? Department of Statistics 0.10 Pred Error 0.05 -0.00 -0.05 Training Period -0.10 -0.15 92 93 94 95 96 97 98 Ye ar 8 Wharton Claimed versus Actual Error Department of Statistics 12 0 Actual Squared 10 0 Prediction Error 80 60 40 Claimed 20 0 10 20 30 40 50 60 70 80 90 10 0 Comp lexity o f Mode l 9 Over-confidence? Wharton Department of Statistics  Over-fitting - Model fits the training data too well – better than it can predict the future. - Greedy fitting procedure “Optimization capitalizes on chance”  Some intuition - Coincidences • Cancer clusters, the “birthday problem” - Illustration with an auction • What is the value of the coins in this jar? 10 Auctions and Over-fitting Wharton Department of Statistics What is the value of these coins? 11 Auctions and Over-fitting    Auction jar of coins to a class of MBA students Histogram shows the bids of 30 students Most were suspicious, but a few were not!  Actual value is $3.85  Known as “Winner’s Curse”  Similar to over-fitting: best model like high bidder Wharton Department of Statistics 9 8 7 6 5 4 3 2 1 12 Profiting from data mining?  Where’s Wharton Department of Statistics the profit in this? - “Mining the miners” vs getting value from your data - Lost opportunities  Importance  Validation of domain knowledge as a measure of success - Prediction provides an explicit check - Does your application predict something? 13 Pitfalls and Role of Management Wharton Department of Statistics Over-fitting is dominated by other issues…  Management support - Life in silos - Coordination across domains  Responsibility and reward - Accountability - Who gets the credit when it succeeds? Who suffers if the project is not successful? 14 Specific Potholes  Moving Wharton Department of Statistics targets - “Let’s try this with something else.”  Irrational expectations - “I could have done better than that.”  Not with my data - “It’s our data. You can’t use it.” - “You did not use our data properly.” 15 Back to a real application… Wharton Department of Statistics Emphasis on the statistical issues… 16 Predicting Bankruptcy Wharton Department of Statistics  Goal - Reduce losses stemming from personal bankruptcy  Possible strategies - If can identify those with highest risk of bankruptcy… Take some action • Call them for a “friendly chat” about circumstances • Unilaterally reduce credit limit  Trade-off - Good customers borrow lots of money - Bad customers also borrow lots of money 17 Predicting Bankruptcy  “Needle Wharton Department of Statistics in a haystack” - 3,000,000 months of credit-card activity - 2244 bankruptcies - Simple predictor that all are OK looks pretty good.  What factors anticipate bankruptcy? - Spending patterns? Payment history? - Demographics? Missing data? - Combinations of factors? • Cash Advance + Las Vegas = Problem  We consider more than 100,000 predictors! 18 Modeling: Predictive Models Wharton Department of Statistics  Build the model Identify patterns in training data that predict future observations. - Which features are real? Coincidental?  Evaluate the model How do you know that it works? - During the model construction phase • Only incorporate meaningful features - After the model is built • Validate by predicting new observations 19 Are all prediction errors the same? Wharton Department of Statistics  Symmetry - Is over-predicting as costly as under-predicting? - Managing inventories and sales - Visible costs versus hidden costs  Does a false positive = a false negative? - Classification in data mining - Credit modeling, flagging “risky” customers - False positive: call a good customer “bad” - False negative: fail to identify a “bad” - Differential costs for different types of errors 20 Building a Predictive Model Wharton Department of Statistics So many choices…  Structure: What type of model? • Neural net • CART, classification tree • Additive model or regression spline  Identification: Which features to use? • Time lags, “natural” transformations • Combinations of other features  Search: How does one find these features? • Brute force has become cheap. 21 Our Choices Wharton Department of Statistics  Structure - Linear regression with nonlinearity via interactions - All 2-way and some 3-way, 4-way interactions - Missing data handled with indicators  Identification - Conservative standard error - Comparison of conservative t-ratio to adaptive threshold  Search - Forward stepwise regression - Coming: Dynamically changing list of features • Good choice affects where you search next. 22 Identifying Predictive Features  Classical Wharton Department of Statistics problem of “variable selection”  Thresholding methods (compare t-ratio to threshold) - Akaike information criterion (AIC) - Bayes information criterion (BIC) - Hard thresholding and Bonferroni  Arguments for adaptive thresholds - Empirical Bayes - Information theory - Step-up/step-down tests 23 Adaptive Thresholding  Threshold Wharton Department of Statistics changes to conform to attributes of data - Easier to add features as more are found.  Threshold for first predictor - Compare conservative t-ratio to Bonferroni. - Bonferroni is about Sqrt(2 log p) - If something significant is found, continue.  Threshold for second predictor - Compare t-ratio to reduced threshold - New threshold is about Sqrt(2 log p/2) 24 Adaptive Thresholding: Benefits Wharton Department of Statistics  Easy As easy and fast as implementing the standard criterion that is used in stepwise regression.  Theory Resulting model provably as good as best Bayes model for the problem at hand.  Real world It works! Finds models with real signal, and stops when the signal runs out. 25 Bankruptcy Model: Construction  Data: Wharton Department of Statistics reserve 80% for validation - Training data • 600,000 months • 458 bankruptcies - Validation data • 2,400,000 months • 1786 bankruptcies  Selection via adaptive thresholding - Compare sequence of t-statistics to Sqrt(2 log p/q) - Dynamic expansion of feature space 26 Bankruptcy Model: Preview Wharton Department of Statistics  Predictors - Initial search identifies 39 • Validation SS monotonically falls to 1650 • Linear fit can do no better than 1735 - Expanded search of higher interactions finds a bit more • Nature of predictors comprising the interactions • Validation SS drops 10 more  Validation: Lift chart - Top 1000 candidates have 351 bankrupt  More validation: Calibration - Close to actual Pr(bankrupt) for most groups. 27 Bankruptcy Model: Fitting Department of Statistics should the fitting process be stopped? Residual Sum of Squares SS  Where Wharton 470 460 450 440 430 420 410 400 0 50 100 150 Number of Predictors 28 Bankruptcy Model: Fitting Wharton Department of Statistics  Our adaptive selection procedure stops at a model with 39 predictors. SS Residual Sum of Squares 470 460 450 440 430 420 410 400 0 50 100 150 Number of Predictors 29 Bankruptcy Model: Validation Wharton Department of Statistics  The validation indicates that the fit gets better while the model expands. Avoids over-fitting. Validation Sum of Squares 1760 SS 1720 1680 1640 0 50 100 150 Number of Predictors 30 Bankruptcy Model: Linear? Wharton Department of Statistics  Choosing from linear predictors (no interactions) does not match the performance of the full search. Validation Sum of Squares 1760 SS 1720 1680 1640 0 50 100 150 Number of Predictors Linear Quadratic 31 Wharton Bankruptcy Model: More? Department of Statistics  Searching higher-order interactions offers modest improvement. Validation Sum of Squares SS 1680 1640 0 20 40 60 Number of Predictors Quadratic Cubic 32 Lift Chart  Measures Wharton Department of Statistics how well model classifies sought-for group % bankrupt in DM selection Lift  % bankrupt in all data  Depends on rule used to label customers - Very high threshold Lots of lift, but few bankrupt customers are found. - Lower threshold Lift drops, but finds more bankrupt customers. 33 Wharton Generic Lift Chart Department of Statistics 1.0 Model %Respon ders 0.8 Random 0.6 0.4 0.2 0.0 0 10 20 30 40 50 60 70 80 90 10 0 % Ch osen 34 Wharton Bankruptcy Model: Lift  Much Department of Statistics better than diagonal! 100 % Found 75 50 25 0 0 25 50 75 100 % Contacted 35 Wharton Calibration Classifier assigns Prob(“BR”) rating to a customer.  Weather forecast  Among those classified as 2/10 chance of “BR”, how many are BR?  10 0 75 Actual  Department of Statistics 50 25 0 10 20 30 40 50 60 70 80 90 Closer to diagonal is better. 36 Bankruptcy Model: Calibration  Over-predicts Wharton Department of Statistics risk above claimed probability 0.4 Calibration Chart 1.2 Actual 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Claim 37 Summary of Bankruptcy Model Wharton Department of Statistics  Automatic, adaptive selection - Finds patterns that predict new observations - Predictive, but not easy to explain  Dynamic feature set - Current research - Information theory allows changing search space - Finds more structure than direct search could find  Validation - Essential only for judging fit. - Better than “hand-made models” that take years to create. 38 So, where’s the profit in DM? Wharton Department of Statistics  Automated modeling has become very powerful, avoiding problems of over-fitting.  Role for expert judgment remains - What data to use? - Which features to try first? - What are the economics of the prediction errors?  Collaboration - Data sources - Data analysis - Strategic decisions 39

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Gastrointestinal Cancer Committee