Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Chapter 13 Multiple Regression Section 13.1 Using Several Variables to Predict a Response Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Regression Models The model that contains only two variables, x and y, is called a bivariate model. x y 3 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Regression Models Suppose there are two predictors, denoted by x1 This is called a multiple regression model. x x y 4 1 1 2 2 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. and x2 . Multiple Regression Model y of a The multiple regression model relates the mean quantitative response variable y to a set of explanatory variablesx1 , x2 ,... 5 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Multiple Regression Model Example: For three explanatory variables, the multiple regression equation is: x x x y 6 1 1 2 2 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 3 3 Multiple Regression Model Example: The sample prediction equation with three explanatory variables is: yˆ a b x b x b x 1 7 1 2 2 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 3 3 Example: Predicting Selling Price Using House Size and Number of Bedrooms The data set “house selling prices” contains observations on 100 home sales in Florida in November 2003. A multiple regression analysis was done with selling price as the response variable and with house size and number of bedrooms as the explanatory variables. 8 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Predicting Selling Price Using House Size and Number of Bedrooms Output from the analysis: Table 13.3 Regression of Selling Price on House Size and Bedrooms. The regression equation is price = 60,102 + 63.0 house size + 15,170 bedrooms. 9 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Predicting Selling Price Using House Size and Number of Bedrooms Prediction Equation: yˆ 60,102 63.0x1 15,170 x2 x1 where y = selling price, of bedrooms. 10 x2 =house size and Copyright © 2013, 2009, and 2007, Pearson Education, Inc. = number Example: Predicting Selling Price Using House Size and Number of Bedrooms One house listed in the data set had house size = 1679 square feet, number of bedrooms = 3: Find its predicted selling price: yˆ 60,102 63.0(1679) 15,170(3) $211,389 11 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Predicting Selling Price Using House Size and Number of Bedrooms Find its residual: y yˆ 232,500 211,389 21,111 The residual tells us that the actual selling price was $21,111 higher than predicted. 12 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. The Number of Explanatory Variables You should not use many explanatory variables in a multiple regression model unless you have lots of data. A rough guideline is that the sample size n should be at least 10 times the number of explanatory variables. For example, to use two explanatory variables, you should have at least n = 20. 13 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Plotting Relationships Always look at the data before doing a multiple regression. Most software has the option of constructing scatterplots on a single graph for each pair of variables. This is called a scatterplot matrix. 14 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Plotting Relationships Figure 13.1 Scatterplot Matrix for Selling Price, House Size, and Number of Bedrooms. The middle plot in the top row has house size on the x -axis and selling price on the y -axis. The first plot in the second row reverses this, with selling price on the x -axis and house size on the y -axis. Question: Why are the plots of main interest the ones in the first row? 15 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Interpretation of Multiple Regression Coefficients The simplest way to interpret a multiple regression equation looks at it in two dimensions as a function of a single explanatory variable. We can look at it this way by fixing values for the other explanatory variable(s). 16 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Interpretation of Multiple Regression Coefficients Example using the housing data: x2 Suppose we fix number of bedrooms = three bedrooms. The prediction equation becomes: yˆ 60,102 63.0 x1 15,170(3) 105,612 63.0 x1 17 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Interpretation of Multiple Regression Coefficients x2 is 63, the predicted Since the slope coefficient of selling price increases, for houses with this number of bedrooms, by $63.00 for every additional square foot in house size. For a 100 square-foot increase in lot size, the predicted selling price increases by 100(63.00) = $6300. 18 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Summarizing the Effect While Controlling for a Variable The multiple regression model assumes that the slope for a particular explanatory variable is identical for all fixed values of the other explanatory variables. 19 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Summarizing the Effect While Controlling for a Variable x1 For example, the coefficient of equation: in the prediction yˆ 60,102 63.0x1 15,170 x2 is 63.0 regardless of whether we plug xin2 1 x2 3 . or 20 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. x2 2 or Summarizing the Effect While Controlling for a Variable Figure 13.2 The Relationship Between y and x1 for the Multiple Regression Equation yˆ 60,102 63.0 x1 15,170 x2 . This shows how the equation simplifies when number of bedrooms x2 1 , or x2 2 , or x2 3. Question: The lines move upward (to higher ŷ -values ) as x2 increases. How would you interpret this fact? 21 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Slopes in Multiple Regression and in Bivariate Regression In multiple regression, a slope describes the effect of an explanatory variable while controlling effects of the other explanatory variables in the model. Bivariate regression has only a single explanatory variable. A slope in bivariate regression describes the effect of that variable while ignoring all other possible explanatory variables. 22 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Importance of Multiple Regression One of the main uses of multiple regression is to identify potential lurking variables and control for them by including them as explanatory variables in the model. Doing so can have a major impact on a variable’s effect. When we control a variable, we keep that variable from influencing the associations among the other variables in the study. 23 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.2 Extending the Correlation and R-Squared for Multiple Regression Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Multiple Correlation To summarize how well a multiple regression model predicts y, we analyze how well the observed y values ŷ values. correlate with the predicted The multiple correlation is the correlation between the yˆ values. observed y values and the predicted It is denoted by R. 25 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Multiple Correlation For each subject, the regression equation provides a predicted value. Each subject has an observed y-value and a predicted y-value. Table 13.4 Selling Prices and Their Predicted Values. These values refer to the two home sales listed in Table 13.1. The predictors are x1= house size and x2= number of bedrooms. 26 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Multiple Correlation The correlation computed between all pairs of observed y-values and predicted y-values is the multiple correlation, R. The larger the multiple correlation, the better are the predictions of y by the set of explanatory variables. 27 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Multiple Correlation The R-value always falls between 0 and 1. In this way, the multiple correlation ‘R’ differs from the bivariate correlation ‘r’ between y and a single variable x, which falls between -1 and +1. 28 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. R-squared For predicting y, the square of R describes the relative improvement from using the prediction equation instead y . of using the sample mean, The error in using the prediction equation to predict y is summarized by the residual sum of squares: ( y yˆ ) 29 2 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. R-squared The error in using y to predict y is summarized by the total sum of squares: 30 ( y y) 2 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. R-squared The proportional reduction in error is: R 31 2 ( y y ) 2 ( y yˆ ) 2 ( y y) 2 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. R-squared The better the predictions are using the regression equation, the larger R 2 is. For multiple regression, R2 multiple correlation, . R 32 is the square of the Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Predicting House Selling Prices For the 200 observations on y=selling price, x1 = house size, and x2 = lot size, a table, called the ANOVA (analysis of variance) table was created. The table displays the sums of squares in the SS column. Table 13.5 ANOVA Table and R -Squared for Predicting House Selling Price (in thousands of dollars) Using House Size (in thousands of square feet) and Number of Bedrooms. 33 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Predicting House Selling Prices 2 R value can be created from the sums of The squares in the table ( y y ) ( y yˆ ) ( y y) 2 R 2 2 2 2,668,870 - 1,269,345 0.524 2,668,870 34 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Predicting House Selling Prices Using house size and lot size together to predict selling price reduces the prediction error by 52%, relative to y alone to predict selling price. using 35 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Predicting House Selling Prices Find and interpret the multiple correlation. R R 0.524 0.72 2 There is a moderately strong association between the observed and the predicted selling prices. House size and number of bedrooms are very helpful in predicting selling prices. 36 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Predicting House Selling Prices If we used a bivariate regression model to predict selling 2 r price with house size as the predictor, the value would be 0.51. If we used a bivariate regression model to predict selling r2 price with number of bedrooms as the predictor, the value would be 0.1. 37 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Predicting House Selling Prices The multiple regression model has R 2 0.52 , a similar value to using only house size as a predictor. There is clearly more to this prediction than using only one variable in the model. Interpretation of results is important. Larger lot sizes in this area could mean older homes with smaller size or fewer bedrooms or bathrooms. 38 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Predicting House Selling Prices Table 13.6 39 R 2 Value for Multiple Regression Models for y = House Selling Price Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Predicting House Selling Prices Although R2 goes up by only small amounts after house size is in the model, this does not mean that the other predictors are only weakly correlated with selling price. Because the predictors are themselves highly correlated, once one or two of them are in the model, the remaining ones don’t help much in adding to the predictive power. For instance, lot size is highly positively correlated with number of bedrooms and with size of house. So, once number of bedrooms and size of house are included as predictors in the model, there’s not much benefit to including lot size as an additional predictor. 40 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Properties of R 2 2 R The previous example showed that for the multiple 2 r regression model was larger than for a bivariate model using only one of the explanatory variables. 2 R A key factor of is that it cannot decrease when predictors are added to a model. 41 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Properties of R R 2 2 falls between 0 and 1. The larger the value, the better the explanatory variables collectively predict y. R 1 only when all residuals are 0, that is, when all regression predictions are perfect. 2 2 R 0 when the correlation between y and each explanatory variable equals 0. 42 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Properties of R 2 R 2 gets larger, or at worst stays the same, whenever an explanatory variable is added to the multiple regression model. R The value of measurement. 43 2 does not depend on the units of Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Inferences about the Population Assumptions required when using a multiple regression model to make inferences about the population: The regression equation truly holds for the population means. This implies that there is a straight-line relationship between the mean of y and each explanatory variable, with the same slope at each value of the other predictors. 45 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Inferences about the Population Assumptions required when using a multiple regression model to make inferences about the population: 46 The data were gathered using randomization. The response variable y has a normal distribution at each combination of values of the explanatory variables, with the same standard deviation. Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Inferences about Individual Regression Parameters Consider a particular parameter,1 If 1 0 , the mean of y is identical for all values of at fixed values of the other explanatory variables. x1 , So, H 0 : 1 0 states that y and x1 are statistically independent, controlling for the other variables. This means that once the other explanatory variables are in the model, it doesn’t help to have x1 in the model. 47 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. SUMMARY: Significance Test about a Multiple Regression Parameter 1. Assumptions: Each explanatory variable has a straight-line relation with y with the same slope for all combinations of values of other predictors in the model. Data gathered using randomization. Normal distribution for y with same standard deviation at each combination of values of other predictors in model. 48 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. SUMMARY: Significance Test about a Multiple Regression Parameter 2. Hypotheses: H 0 : 1 0 H a : 1 0 When H 0 is true, y is independent of the other predictors. 49 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. x1 , controlling for SUMMARY: Significance Test about a Multiple Regression Parameter 3. Test Statistic: b 0 t se 1 where se is the standard error for b1 50 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. SUMMARY: Significance Test about a Multiple Regression Parameter 4. P-value: Two-tail probability from t-distribution of values larger than observed t test statistic (in absolute value). The t-distribution has: df = n – number of parameters in the regression equation 51 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. SUMMARY: Significance Test about a Multiple Regression Parameter 5. Conclusion: Interpret P-value in context; compare to significance level if decision needed. 52 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athlete’s Weight? The “College Athletes” data set comes from a study of 64 University of Georgia female athletes. The study measured several physical characteristics, including total body weight in pounds (TBW), height in inches (HGT), the percent of body fat (%BF) and age. 53 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athlete’s Weight? The results of fitting a multiple regression model for predicting weight using the other variables: Table 13.10 Multiple Regression Analysis for Predicting Weight Predictors are HGT = height, %BF = body fat, and age of subject. 54 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athlete’s Weight? Interpret the effect of age on weight in the multiple regression equation: Let yˆ predicted weight, x1 height, x2 % body fat, and x3 age Then yˆ 97.7 3.43x1 1.36 x2 0.96 x3 55 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athlete’s Weight? The slope coefficient of age is -0.96. For athletes having fixed values for x1 and x2 , the predicted weight decreases by 0.96 pounds for a 1-year increase in age, and the ages vary only between 17 and 23. 56 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athlete’s Weight? Run a hypothesis test to determine whether age helps to predict weight, if you already know height and percent body fat. Here are the steps: 1. Assumptions Met: The 64 female athletes were a convenience sample, not a random sample. Caution should be taken when making inferences about all female college athletes. 57 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athlete’s Weight? 2. Hypotheses: H 0 : 3 0 H a : 3 0 3. Test statistic: b 0 0.960 t 1.48 se 0.648 3 58 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athlete’s Weight? 4. P-value: This value is reported in the output as 0.144. 5. Conclusion: The P-value of 0.144 does not give much evidence against the null hypothesis that 3 0 . Age may not significantly predict weight if we already know height and % body fat. 59 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Confidence Interval for a Multiple Regression Parameter A 95% confidence interval for a multiple regression equals: slope parameter in Estimated slope t (se) .025 The t-score has df = (n - # of parameters in the model). The assumptions are the same as for the t test. 60 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Confidence Interval for a Multiple Regression Parameter Construct and interpret a 95% CI for 3 , the effect of age while controlling for height and % body fat b t ( se) 0.96 2.00(0.648) 3 .025 0.96 1.30 (2.3,0.3) 61 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Confidence Interval for a Multiple Regression Parameter At fixed values of x1 and x2 , we infer that the population mean of weight changes very little (and maybe not at all) for a 1 year increase in age. The confidence interval contains 0. Age may have no effect on weight, once we control for height and % body fat. 62 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Estimating Variability Around the Regression Equation A standard deviation parameter, , describes variability of the observations around the regression equation. Its sample estimate is: s Residual SS df ( y yˆ ) n (# of parameters in reg. eq.) 2 63 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athletes’ Weight? Anova Table for the “college athletes” data set: Table 13.7 ANOVA Table for Multiple Regression Analysis of Athlete Weights 64 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athletes’ Weight? For female athletes at particular values of height, % of body fat, and age, estimate the standard deviation of their weights. Begin by finding the Mean Square Error: residual SS 6131.0 s 102.2 df 60 2 Notice that this value (102.2) appears in the MS column in the ANOVA table. 65 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athletes’ Weight? The standard deviation is: s 102.2 10.1 This value is also displayed in the ANOVA table. For athletes with certain fixed values of height, % body fat, and age, the weights vary with a standard deviation of about 10 pounds. 66 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athletes’ Weight? Insight: The conditional distributions of weight are approximately bell-shaped, about 95% of the weight values fall within about 2s = 20 pounds of the true regression equation. 67 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. The Collective Effect of Explanatory Variables Example: With 3 predictors in a model, we can check this by testing: H : 0 0 1 2 3 H : At least one parameter 0 a 68 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. The Collective Effect of Explanatory Variables The test statistic for H 0 is denoted by F. It equals the ratio of the mean squares from the ANOVA table, Mean square for regression F Mean square error 69 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. The Collective Effect of Explanatory Variables H 0 is true, the expected value of the F test When statistic is approximately 1. H0 When is false, F tends to be larger than 1. The larger the F test statistic, the stronger the evidenceHagainst . 0 70 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. SUMMARY: F Test That All Beta Parameters = 0 1. Assumptions: Multiple regression equation holds Data gathered randomly Normal distribution for y with same standard deviation at each combination of predictors 71 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. SUMMARY: F Test That All Beta Parameters = 0 2. Hypothesis: H0 : 1 2 0 Ha : At least one parameter 0 3. Test statistic: Mean square for regression F Mean square error 72 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. SUMMARY: F Test That All Beta Parameters = 0 4. P-value: Right-tail probability above observed F-test statistic value from F distribution with: df1 = number of explanatory variables df2 = n – (number of parameters in regression equation) 73 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. SUMMARY: F Test That All Beta Parameters = 0 5. Conclusion: The smaller the P-value, the stronger the evidence that at least one explanatory variable has an effect on y. 74 If a decision is needed, reject H 0 if P-value significance level, such as 0.05. Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athlete’s Weight? For the 64 female college athletes, the regression model for predicting y = weight using x1 = height, x2 = % body fat and x3 = age is summarized in the ANOVA table on the next slide. 75 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athlete’s Weight? Anova table for the “college athletes” data set: Table 13.7 ANOVA Table for Multiple Regression Analysis of Athlete Weights 76 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athlete’s Weight? Use the output in the ANOVA table to test the hypothesis: H 0 : 1 2 3 0 H a : At least one parameter 0 77 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athlete’s Weight? The observed F statistic is 40.5 The corresponding P-value is 0.000 We can reject H0 at the 0.05 significance level In summary, we conclude that at least one predictor has an effect on weight. 78 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athlete’s Weight? Insight: The F-test tells us that at least one explanatory variable has an effect. If the explanatory variables are chosen sensibly, at least one should have some predictive power. The F-test result tells us whether there is sufficient evidence to make it worthwhile to consider the individual effects, using t-tests. 79 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athlete’s Weight? The individual t-tests identify which of the variables are significant (controlling for the other variables) Table 13.10 Multiple Regression Analysis for Predicting Weight Predictors are HGT = height, %BF = body fat, and age of subject. 80 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: What Helps Predict a Female Athlete’s Weight? If a variable turns out not to be significant, it can be removed from the model. In this example, ‘age’ can be removed from the model. 81 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.4 Checking a Regression Model Using Residual Plots Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Assumptions for Inference with a Multiple Regression Model 1. The regression equation approximates well the true relationship between the predictors and the mean of y. 2. The data were gathered randomly. 3. y has a normal distribution with the same standard deviation at each combination of predictors. 83 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Checking Shape and Detecting Unusual Observations To test Assumption 3 (the conditional distribution of y is normal at any fixed values of the explanatory variables): Construction a histogram of the standardized residuals. The histogram should be approximately bell-shaped. Nearly all the standardized residuals should fall between -3 and +3. Any residual outside these limits is a potential outlier. 84 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: House Selling Price 85 For the house selling price data, a MINITAB histogram of the standardized residuals for the multiple regression model predicting selling price by the house size and the lot size was created and is displayed on the following slide. Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: House Selling Price Figure 13.4 Histogram of Standardized Residuals for Multiple Regression Model Predicting Selling Price. Question: Give an example of a shape for this histogram that would indicate that a few observations are highly unusual. 86 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: House Selling Price The residuals are roughly bell shaped about 0. They fall mostly between about -3 and +3. No severe nonnormality is indicated. 87 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Plotting Residuals against Each Explanatory Variable Plots of residuals against each explanatory variable help us check for potential problems with the regression model. Ideally, the residuals should fluctuate randomly about the horizontal line at 0. There should be no obvious change in trend or change x1 in variation as the values of the explanatory variable increases. 88 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Plotting Residuals against Each Explanatory Variable Figure 13.5 Possible Patterns for Residuals, Plotted Against an Explanatory Variable. Question: Why does the pattern in (b) suggest that the effect of x1 is not linear? 89 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.5 Regression and Categorical Predictors Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Indicator Variables Regression models can specify categories of a categorical explanatory variable using artificial variables, called indicator variables. The indicator variable for a particular category is binary. It equals 1 if the observation falls into that category and it equals 0 otherwise. 91 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Indicator Variables In the house selling prices data set, the condition of the house is a categorical variable. It was measured with categories (good, not good). The indicator variable x for condition is x 1 if house is in good condition x 0 if house is not in good condition 92 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Indicator Variables The regression model is then y x , with x as just defined. Substituting the possible values 1 and 0 for x , y (1) , if house is in good condition (so x 1) y (0) , if house is not in good condition (so x 0) The difference between the mean selling price for houses in good condition and not in good condition is y for good condition y for otherwise The coefficient of the indicator variable x is the difference between the mean selling prices for homes in good condition and for homes not in good condition. 93 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Including Condition in Regression for House Selling Price Output from the regression model for selling price of home using house size and region. Table 13.11 Regression Analysis of y = Selling Price Using and x2 = Indicator Variable for Condition (Good, Not Good) 94 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. x1=House Size Example: Including Condition in Regression for House Selling Price 1. Find and plot the lines showing how predicted selling price varies as a function of house size, for homes in good condition or not in good condition. 2. Interpret the coefficient of the indicator variable for condition. 95 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Including Condition in Regression for House Selling Price The regression equation from the MINITAB output is: yˆ 96,271 66.5x1 12,927 x2 96 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Including Condition in Regression for House Selling Price For homes not in good condition, x2 0 The prediction equation then simplifies to: yˆ 96,271 66.5 x1 12,927(0) 96,271 66.5 x1 97 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Including Condition in Regression for House Selling Price For homes in good condition, x2 1 The prediction equation then simplifies to: yˆ 96,271 66.5 x1 12,927(1) 109,198 66.5 x1 98 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Including Condition in Regression for House Selling Price Figure 13.7 Plot of Equation Relating ŷ =Predicted Selling Price to x1 =House Size, According to x2 =Condition (1=Good, 0=Not Good). Question: Why are the lines parallel? 99 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Including Condition in Regression for House Selling Price Both lines have the same slope, 66.5 The line for homes in good condition is above the other line (not good) because its y-intercept is larger. This means that for any fixed value of house size, the predicted selling price is higher for homes in better condition. 100 The P-value of 0.453 for the test for the coefficient of the indicator variable suggests that this difference is not statistically significant. Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Is there Interaction? For two explanatory variables, interaction exists between them in their effects on the response variable when the slope of the relationship between y and one of them changes as the value of the other changes. 101 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Interaction in effects on House Selling Price Suppose the actual population relationship between house size and the mean selling price is: y 100, 000 50 x1 for homes in good condition y 80, 000 35 x1 for homes not in good condition Then the slope for the effect of x1 differs for the two conditions. There is then interaction between house size and condition in their effects on selling price. See Figure 13.8 on the next slide. 102 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Interaction in effects on House Selling Price Figure 13.8 An Example of Interaction. There’s a larger slope between selling price and house size for homes in good condition than in other conditions. 103 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Interaction in effects on House Selling Price How can you allow for interaction when you do a regression analysis? To allow for interaction with two explanatory variables, one quantitative and one categorical, you can fit a separate regression line with a different slope between the two quantitative variables for each category of the categorical variable. 104 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.6 Modeling a Categorical Response Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Modeling a Categorical Response Variable The regression models studied so far are designed for a quantitative response variable y. When y is categorical, a different regression model applies, called logistic regression. 106 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Examples of Logistic Regression A voter’s choice in an election (Democrat or Republican), with explanatory variables: annual income, political ideology, religious affiliation, and race. Whether a credit card holder pays their bill on time (yes or no), with explanatory variables: family income and the number of months in the past year that the customer paid the bill on time. 107 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. The Logistic Regression Model Denote the possible outcomes for y as 0 and 1. Use the generic terms failure (for outcome = 0), and success (for outcome =1). The population mean of the scores equals the population proportion of ‘1’ outcomes (successes). That is, y p The proportion, p, also represents the probability that a randomly selected subject has a successful outcome. 108 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. The Logistic Regression Model The straight-line model is usually inadequate when there are multiple explanatory variables. A more realistic model has a curved S-shape instead of a straight-line trend. The regression equation that best models this Sshaped curve is known as the logistic regression equation. 109 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. The Logistic Regression Model Figure 13.10 Two Possible Regressions for a Probability p of a Binary Response Variable. A straight line is usually less appropriate than an S-shaped curve. Question: Why is the straight-line regression model for a binary response variable often poor? 110 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. The Logistic Regression Model A regression equation for an S-shaped curve for the probability of success p is: ( x ) e p 1 e ( x ) This equation for p is called the logistic regression equation. Logistic regression is used when the response variable has only two possible outcomes (it’s binary). 111 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Travel Credit Cards An Italian study with 100 randomly selected Italian adults considered factors that are associated with whether a person possesses at least one travel credit card. The table 13.12 on the next slide shows results for the first 15 people on this response variable and on the person’s annual income (in thousands of euros). 112 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Travel Credit Cards Table 13.12 Annual Income (in thousands of euros) and Whether Possess a Travel Credit Card. The response y equals 1 if a person has a travel credit card and equals 0 otherwise. 113 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Travel Credit Cards Let x = annual income and let y = whether the person possesses a travel credit card (1 = yes, 0 = no). Table 13.13 shows what software provides for conducting a logistic regression analysis. Table 13.13 Results of Logistic Regression for Italian Credit Card Data 114 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Travel Credit Cards Substituting the and estimates into the logistic regression model formula yields: ( 3.52 0.105 x ) e pˆ 1 e 115 ( 3.52 0.105 x ) Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Travel Credit Cards Find the estimated probability of possessing a travel credit card at the lowest and highest annual income levels in the sample, which were x = 12 and x = 65. 116 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Travel Credit Cards For x = 12 thousand euros, the estimated probability of possessing a travel credit card is: 3.52 0.105 ( 12 ) 2.26 e e pˆ 1 e 1 e 0.104 0.09 1.104 3.521.05 ( 12 ) 117 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 2.26 Example: Travel Credit Cards For x = 65 thousand euros, the estimated probability of possessing a travel credit card is: 3.52 0.105 ( 65 ) 3.305 e e pˆ 1 e 1 e 27.2485 0.97 28.2485 3.521.05 ( 65 ) 118 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 3.305 Example: Travel Credit Cards Insight: Annual income has a strong positive effect on having a credit card. The estimated probability of having a travel credit card changes from 0.09 to 0.97 as annual income changes over its range. 119 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Estimating Proportion of Students Who’ve Used Marijuana A three-variable contingency table from a survey of senior high-school students is shown on the next slide. The students were asked whether they had ever used: alcohol, cigarettes or marijuana. We’ll treat marijuana use as the response variable and cigarette use and alcohol use as explanatory variables. 120 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Estimating Proportion of Students Who’ve Used Marijuana Table 13.14 Alcohol, Cigarette, and Marijuana Use for High School Seniors 121 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Estimating Proportion of Students Who’ve Used Marijuana Let y indicate marijuana use, coded: (1 = yes, 0 = no) x1 be an indicator variable for alcohol use, coded Let (1 = yes, 0 = no) Let x2 be an indicator variable for cigarette use, coded (1 = yes, 0 = no) 122 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Estimating Proportion of Students Who’ve Used Marijuana Table 13.15 MINITAB Output for Estimating the Probability of Marijuana Use Based on Alcohol Use and Cigarette Use 123 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Estimating Proportion of Students Who’ve Used Marijuana The logistic regression prediction equation is: 5.31 2.99 x1 2.85 x2 e pˆ 1 e 124 5.31 2.99 x1 2.85 x2 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Estimating Proportion of Students Who’ve Used Marijuana For those who have not used alcohol or cigarettes, x1 x2 0 . For them, the estimated probability of marijuana use is 5.31 2.99 ( 0 ) 2.85 ( 0 ) e pˆ 1 e 125 5.31 2.99 ( 0 ) 2.85 ( 0 ) 0.005 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Estimating Proportion of Students Who’ve Used Marijuana For those who have used alcohol and cigarettes, x1 x2 1 . For them, the estimated probability of marijuana use is 5.31 2.99(1) 2.85(1) e pˆ 0.629 5.31 2.99(1) 2.85(1) 1 e 126 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Example: Estimating Proportion of Students Who’ve Used Marijuana SUMMARY: The probability that students have tried marijuana seems to depend greatly on whether they’ve used alcohol and/or cigarettes. 127 Copyright © 2013, 2009, and 2007, Pearson Education, Inc.