Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 6 Regression Algorithms in Data Mining Fit data Time-series data: Forecast Other data: Predict 結束 Contents Describes OLS (ordinary least square) regression and Logistic regression Describes linear discriminant analysis and centroid discriminant analysis Demonstrates techniques on small data sets Reviews the real applications of each model Shows the application of models to larger data sets 6-2 結束 Use in Data Mining Telecommunication Industry, turnover (churn) One of major analytic models for classification problem. Linear regression The standard – ordinary least squares regression Can use for discriminant analysis Can apply stepwise regression Nonlinear regression More complex (but less reliable) data fitting Logistic regression When data are categorical (usually binary) 6-3 結束 OLS Model Y 0 1 X 1 2 X 2 ... n X n where Y is the dependent variable 0 is the intercept term n are the n coefficien ts for independen t variable s is the error term 6-4 結束 OLS Regression Uses intercept and slope coefficients () to minimize squared error terms over all i observations Fits the data with a linear model Time-series data: Observations over past periods Best fit line (in terms of minimizing sum of squared errors) 6-5 結束 Regression Output R2 : 0.987 Intercept: 0.642 Week: 5.086 t=0.286 t=53.27 P=0.776 P=0 Requests = 0.642 + 5.086*Week 6-6 結束 Example R2 SSE SST 6-7 結束 Example 6-8 結束 A graph of the time-series model (X1) Requests vs. (X2) Pred_lmreg_1 200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 20 40 60 80 100 120 140 160 180 200 6-9 結束 Time-Series Forecast Time-series prediction 250 200 150 100 50 0 0 10 20 30 40 50 6-10 結束 Regression Tests FIT: SSE – sum of squared errors Synonym: SSR – sum of squared residuals R2 – proportion explained by model Adjusted R2 – adjusts calculation to penalize for number of independent variables Significance F-test - test of overall model significance t-test - test of significant difference between model coefficient & zero P – probability that the coefficient is zero (or at least the other side of zero from the coefficient) See page. 103 6-11 結束 Regression Model Tests SSE (sum of squared errors) For each observation, subtract model value from observed, square difference, total over all observations By itself means nothing Can compare across models (lower is better) Can use to evaluate proportion of variance in data explained by model R2 Ratio of explained squared dependent variable values (MSR) to sum of squares (SST) SST = MSR plus SSE 0 ≤ R2 ≤ 1 See page. 104 6-12 結束 Multiple Regression Can include more than one independent variable Trade-off: Too many variables – many spurious, overlapping information Too few variables – miss important content Adding variables will always increase R2 Adjusted R2 penalizes for additional independent variables 6-13 結束 Example: Hiring Data Dependent Variable – Sales Independent Variables: Years of Education College GPA Age Gender College Degree See page. 104-105 6-14 結束 Regression Model Sales = 269025 -17148*YrsEd -7172*GPA +4331*Age -23581*Male +31001*Degree R2 = 0.252 Adj R2 = -0.015 Weak model, no significant at 0.10 P = 0.175 P = 0.812 P = 0.116 P = 0.266 P = 0.450 6-15 結束 Improved Regression Model Sales = 173284 - 9991*YrsEd +3537*Age -18730*Male P = 0.098* P = 0.141 P = 0.328 R2 = 0.218 Adj R2 = 0.070 6-16 結束 Logistic Regression Data often ordinal or nominal Regression based on continuous numbers not appropriate Need dummy variables Binary – either are or are not – LOGISTIC REGRESSION (probability of either 1 or 0) Two or more categories – DISCRIMINANT ANALYSIS (perform regression for each outcome; pick one that fit’s best) 6-17 結束 Logistic Regression For dependent variables that are nominal or ordinal Probability of acceptance of case i to class j Sigmoidal function (in English, an S curve from 0 to 1) Pj 1 1 e 0 i xi 6-18 結束 Insurance Claim Model Fraud = 81.824 -2.778 * Age -75.893 * Male + 0.017 * Claim -36.648 * Tickets + 6.914 * Prior -29.362 * Atty Smith P = 0.789 P = 0.758 P = 0.757 P = 0.824 P = 0.935 P = 0.776 Can get probability by running score through logistic formula See page. 107~109 6-19 結束 Linear Discriminant Analysis Group objects into predetermined set of outcome classes Regression one means of performing discriminant analysis 2 groups: find cutoff for regression score More than 2 groups: multiple cutoffs 6-20 結束 Centroid Method (NOT regression) Binary data Divide training set into two groups by binary outcome Standardize data to remove scales Identify means for each independent variable by group (the CENTROID) Calculate distance function 6-21 結束 Fraud Data Age 52 Claim 2000 Tickets 0 Prior 1 Outcome OK 38 19 21 1800 600 5600 0 2 1 0 2 2 OK OK Fraud 41 4200 1 2 Fraud 6-22 結束 Standardized & Sorted Fraud Data Age Claim Tickets Prior Outcome 1 0.60 1 0.5 0 0.9 0.64 1 1 0 0 0.88 0 0 0 0.633 0.707 0.667 0.500 0 0.05 0 1 0 1 1 0.16 1 0 1 0.525 0.080 1.000 0.000 1 6-23 結束 Distance Calculations New To 0 To 1 Age 0.50 (0.633-0.5)2 0.018 (0.525-0.5)2 0.001 Claim 0.30 (0.707-0.3)2 0.166 (0.08-0.3)2 0.048 Tickets 0 (0.667-0)2 0.445 (1-0)2 1.000 Prior 1 (0.5-1)2 0.250 (0-1)2 1.000 Totals 0.879 2.049 6-24 結束 Discriminant Analysis with Regression Standardized data, Binary outcomes Intercept Age Gender Claim Tickets Prior Claims Attorney 0.430 -0.421 0.333 -0.648 0.584 -1.091 0.573 P = 0.670 P = 0.671 P = 0.733 P = 0.469 P = 0.566 P = 0.399 P = 0.607 R2 = 0.804 Cutoff average of group averages: 0.429 6-25 結束 Case: Stepwise Regression Stepwise Regression Automatic selection of independent variables Look at F scores of simple regressions Add variable with greatest F statistic Check partial F scores for adding each variable not in model Delete variables no longer significant If no external variables significant, quit Considered inferior to selection of variables by experts 6-26 結束 Credit Card Bankruptcy Prediction Foster & Stine (2004), Journal of the American Statistical Association Data on 244,000 credit card accounts 12-month period 1 percent default Cost of granting loan that defaults almost $5,000 Cost of denying loan that would have paid about $50 6-27 結束 Data Treatment Divided observations into 5 groups Used one for training Any smaller would have problems due to insufficient default cases Used 80% of data for detailed testing Regression performed better than C5 model Even though C5 used costs, regression didn’t 6-28 結束 Summary Regression a basic classical model Many forms Logistic regression very useful in data mining Often have binary outcomes Also can use on categorical data Can use for discriminant analysis To classify 6-29