Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Regression Models Fit data Time-series data: Forecast Other data: Predict 6-2 Use in Data Mining • One of major analytic models – Linear regression • The standard – ordinary least squares regression • Can use for discriminant analysis • Can apply stepwise regression – Nonlinear regression • More complex (but less reliable) data fitting – Logistic regression • When data are categorical (usually binary) McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-3 OLS (Ordinary Least Square) Model Y 0 1 X 1 2 X 2 ... n X n where Y is the dependent variable 0 is the intercept term n are the n coefficien ts for independen t variable s is the error term McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-4 OLS Regression • Uses intercept and slope coefficients () to minimize squared error terms over all i observations • Fits the data with a linear model • Time-series data: – Observations over past periods – Best fit line (in terms of minimizing sum of squared errors) McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-5 Regression Output (page 101) R2 : 0.987 Intercept: 0.642 Week: 5.086 t=0.286 t=53.27 P=0.776 P=0 Requests = 0.642 + 5.086*Week McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-6 Time-Series Forecast Regression Forecast 300 250 200 Requests 150 Model 100 50 0 0 10 20 30 40 50 60 Week McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-7 Regression Tests • FIT: – SSE – sum of squared errors • Synonym: SSR – sum of squared residuals – R2 – proportion explained by model – Adjusted R2 – adjusts calculation to penalize for number of independent variables • Significance – F-test - test of overall model significance – t-test - test of significant difference between model coefficient & zero – P – probability that the coefficient is zero • (or at least the other side of zero from the coefficient) McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-8 Regression Model Tests • SSE (sum of squared errors) – For each observation, subtract model value from observed, square difference, total over all observations – By itself means nothing – Can compare across models (lower is better) – Can use to evaluate proportion of variance in data explained by model • R2 – Ratio of explained squared dependent variable values (MSR) to sum of squares (SST) • SST = MSR plus SSE – 0 ≤ R2 ≤ 1 McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-9 Multiple Regression • Can include more than one independent variable – Trade-off: • Too many variables – many spurious, overlapping information • Too few variables – miss important content – Adding variables will always increase R2 – Adjusted R2 penalizes for additional independent variables McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-10 Example: Hiring Data • Dependent Variable – Sales • Independent Variables: – Years of Education – College GPA – Age – Gender – College Degree McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-11 Regression Model Sales = 269025 -17148*YrsEd P = 0.175 -7172*GPA P = 0.812 +4331*Age P = 0.116 -23581*Male P = 0.266 +31001*Degree P = 0.450 R2 = 0.252 Adj R2 = -0.015 • Weak model, no IV significant at 0.10 McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-12 Improved Regression Model Sales = 173284 - 9991*YrsEd +3537*Age -18730*Male R2 = 0.218 McGraw-Hill/Irwin P = 0.098* P = 0.141 P = 0.328 Adj R2 = 0.070 ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-13 Logistic Regression • Data often ordinal or nominal • Regression based on continuous numbers not appropriate – Need dummy variables • Binary – either are or are not – LOGISTIC REGRESSION (probability of either 1 or 0) • Two or more categories – DISCRIMINANT ANALYSIS (perform regression for each outcome; pick one that fit’s best) McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-14 Logistic Regression • For dependent variables that are nominal or ordinal • Probability of acceptance of – case i to class j Pj • Sigmoidal function 1 1 e 0 i xi – (in English, an S curve from 0 to 1) McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-15 Insurance Claim Model Fraud = 81.824 -2.778 * Age P = 0.789 -75.893 * Male P = 0.758 + 0.017 * Claim P = 0.757 -36.648 * Tickets P = 0.824 + 6.914 * Prior P = 0.935 -29.362 * Attorney Smith P = 0.776 Can get probability by running score through logistic formula McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-16 Linear Discriminant Analysis • Group objects into predetermined set of outcome classes • Regression one means of performing discriminant analysis – 2 groups: find cutoff for regression score – More than 2 groups: multiple cutoffs McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-17 Centroid Method (NOT regression) • Binary data • Divide training set into two groups by binary outcome – Standardize data to remove scales • Identify means for each independent variably by group (the CENTROID) • Calculate distance function McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-18 Fraud Data Age Claim Tickets Prior Outcome 52 2000 0 1 OK 38 1800 0 0 OK 19 600 2 2 OK 21 5600 1 2 Fraud 41 4200 1 2 Fraud McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-19 Standardized & Sorted Fraud Data Age Claim Tickets Prior Outcome 1 0.60 1 0.5 0 0.9 0.64 1 1 0 0 0.88 0 0 0 0.633 0.707 0.667 0.500 0 0.05 0 1 0 1 1 0.16 1 0 1 0.525 0.080 1.000 0.000 1 McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-20 Distance Calculations New To 0 Age 0.50 (0.633-0.5)2 0.018 (0.525-0.5)2 0.001 Claim 0.30 (0.707-0.3)2 0.166 (0.08-0.3)2 0.048 Tickets 0 (0.667-0)2 0.445 (1-0)2 1.000 Prior 1 (0.5-1)2 0.250 (0-1)2 1.000 Totals McGraw-Hill/Irwin To 1 0.879 2.049 ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-21 Discriminant Analysis with Regression Standardized data, Binary outcomes Intercept 0.430 P = 0.670 Age -0.421 P = 0.671 Gender 0.333 P = 0.733 Claim -0.648 P = 0.469 Tickets 0.584 P = 0.566 Prior Claims -1.091 P = 0.399 Attorney 0.573 P = 0.607 • R2 = 0.804 • Cutoff average of group averages: 0.429 McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-22 Case: Stepwise Regression • Stepwise Regression – Automatic selection of independent variables • Look at F scores of simple regressions • Add variable with greatest F statistic • Check partial F scores for adding each variable not in model • Delete variables no longer significant • If no external variables significant, quit • Considered inferior to selection of variables by experts McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-23 Credit Card Bankruptcy Prediction Foster & Stine (2004), Journal of the American Statistical Association • Data on 244,000 credit card accounts – 12-month period – 1 percent default – Cost of granting loan that defaults almost $5,000 – Cost of denying loan that would have paid about $50 McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-24 Data Treatment • Divided observations into 5 groups – Used one for training – Any smaller would have problems due to insufficient default cases – Used 80% of data for detailed testing • Regression performed better than C5 model – Even though C5 used costs, regression didn’t McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-25 Summary • Regression a basic classical model – Many forms • Logistic regression very useful in data mining – Often have binary outcomes – Also can use on categorical data • Can use for discriminant analysis – To classify McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved