* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Linear Regression - math-b
Survey
Document related concepts
Transcript
Residuals Revisited Residuals Revisited The linear model we are using assumes that the relationship between the two variables is a perfect straight line. The residuals are the part of the data that hasn’t been modeled. Residuals 𝐷𝑎𝑡𝑎 = 𝑀𝑜𝑑𝑒𝑙 + 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 Therefore, 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝐷𝑎𝑡𝑎 − 𝑀𝑜𝑑𝑒𝑙 In Symbols, 𝑒 = 𝑦 − 𝑦 Residuals Residuals help tell us if the model makes sense. When a regression is appropriate, it should model the underlying relationship. Because nothing, in a sense, should be left behind, we usually plot the residuals in the hope of finding…nothing. Residuals A scatterplot of the residuals versus the x-values should be the most boring scatterplot you’ve ever seen. It shouldn’t have any interesting features like a direction or shape. It should stretch horizontally, with about the same amount of scatter throughout. It should have no bends, and it should have no outliers. Residuals If the residuals show no interesting pattern when we plot them against x we can look at how big they are. The standard deviation of the residuals, 𝑠𝑒 gives us a measure of how much the points spread around the regression line. Remember, we are trying to make the residuals as small as possible. For this to make sense the residuals should share the same underlying spread, so we can check to make sure that the residual plot has about the same scatter throughout. Since their mean is always, though, it’s only sensible to look at how much they vary The Residual Standard Deviation Equal Variance Assumption is this assumption that the residuals have the same scatter throughout. The condition to check is the Does the Plot Thicken? Condition We check to make sure the spread is about the same all along the line. We can check that either in the original scatterplot of y against x or in the scatterplot of residuals The Residual SD We estimate the standard deviation of the residuals in almost the way you would expect: 𝑠𝑒 = 𝑒2 𝑛−2 We do not subtract the mean because the mean of residuals is 𝑒 = 0 The Residual SD 𝑠𝑒 = 𝑒2 𝑛−2 For the BK foods, the SD of the residuals is 9.2 grams of fat. That looks about right in the scatterplot of residuals. The residual for the BK Broiler chicken was -11 grams, just over one SD The Residual SD It is a good idea to make a histogram of the residuals. If we see a unimodal, symmetric histogram then we can apply the 68-95-99.7 rule to see how well the regression model describes the data. 2 𝑅 -The Variation Accounted For The variation of the residuals is key to assessing how well the model fits. Let’s look at them using our favorite fast food example The total Fat has a standard deviation of 16.4 grams. The standard deviation of the residuals is 9.2 grams. 2 𝑅 -The Variation Accounted For The total Fat has a standard deviation of 16.4 grams. The standard deviation of the residuals is 9.2 grams. If the correlation were 1.0 and the model predicted Fat values perfectly, the residuals would all be zero and have no variation. On the other hand, if the correlation were zero the model would simply predict 23.5 grams of Fat (the mean) for all menu items. 2 𝑅 -The Variation Accounted For If the correlation were 1.0 and the model predicted Fat values perfectly, the residuals would all be zero and have no variation. On the other hand, if the correlation were zero the model would simply predict 23.5 grams of Fat (the mean) for all menu items. The residuals from that prediction would just be the observed Fat values minus their mean. These residuals would have the same variability as the original data because, as we know, just subtracting the mean doesn’t change the spread. 2 𝑅 -The Variation Accounted For How well does our BK regression model do? We want to know how much of the variation is left in the residuals. Think of it as finding a percent between 0% and 100% that is the fraction of the variation left in the residuals. The squared correlation, 𝑟 2 gives the fraction of the data’s variation accounted for by the model, and 1 − 𝑟 2 is the fraction of the original variation left in the residuals. 2 𝑅 -The Variation Accounted For For the Burger King model, 𝑟 2 = 0.832 = 0.69 and 1 − 𝑟 2 = 1 − 0.69 = .31 So 31% of the variability in total Fat has been left in the residuals According to our linear model, 69% of the variability in fat content of Burger King sandwiches is accounted for by variation in protein content. 2 𝑅 The regression model that predicts maximum wind speed in hurricanes based on the storm’s central pressure has 𝑅2 = 77.3% What does this say about the regression model? An “R-squared” of 77.3% indicates that 77.3% of the variation in maximum wind speed can be accounted for by the hurricane’s central pressure. Other factors, such as temperature and whether the storm is over water or land, may explain some of the remaining variation. Just Checking… House Price vs. Size The 𝑅2 value is reported as 59.5%and the standard deviation of the residuals is 53.79 A) What does the 𝑅2 value mean about the relationship between Price and Size Answer: Differences in the size of houses account for about 59.5% of the variation in the price of houses. 90,000 square feet – the size of 50 average sized family homes Just Checking… House Price vs. Size The 𝑅2 value is reported as 59.5%and the standard deviation of the residuals is 53.79 B) Is the correlation of Price and Size positive or negative? How do you know? Answer: It’s positive. The correlation and the slope have the same sign. A final price tag of 75 million dollars Just Checking… House Price vs. Size The 𝑅2 value is reported as 59.5%and the standard deviation of the residuals is 53.79 C) If we measure house Size in square meters instead of square feet, would 𝑅2 change? Would the slope of the line change? Explain. Answer: 𝑅2 would not change but the slope would. Slope depends on the units but correlation does not. 30 bedrooms and 20 bathrooms Just Checking… House Price vs. Size The 𝑅2 value is reported as 59.5%and the standard deviation of the residuals is 53.79 You find that your house in Saratoga is worth $100,000 more than the regression model predicts. Should you be very surprised? Answer: No, the standard deviation of the residuals is 53.79 thousand dollars. We shouldn’t be surprised by any residual smaller than 2 standard deviations and a residual of $100,000 is less than 2*53,790 20-car garage, three swimming pools The amenities don't finish there. Also constructed is an adult movie theater with a balcony, four fireplaces, a formal dining room that seats 30, all 23 full baths with full-sized Jacuzzis, 160 tripled paned windows and Brazilian mahogany French-style doors that alone cost $4 million. The banquet kitchen features two large commercial gas stoves, four commercial built-in refrigerators and a Japanese-style steakhouse island that seats 12. 2 𝑅 All regression analyses include the statistic 𝑅2 An 𝑅2 of 0 means that none of the variance in the data is in the model; all of it is still in the residuals. It would be hard to imagine using that sort of model for anything. Aside: Is a correlation of 0.8 twice as strong as a correlation of 0.4? Not if you think in terms of 𝑅2 A correlation of 0.4 means an 𝑅2 of 0.42 = 16% A correlation of 0.8 gives an 𝑅2 of 0.82 = 64% A correlation of 0.8 gives an 𝑅2 four times as strong as a correlation of 0.4 and accounts for four times as much of the variability. 2 How Big Should 𝑅 Be? 𝑅2 is always between 0% and 100%, but what is a “good” 𝑅2 value? There is no hard and fast rule on what the value of 𝑅2 should be – it will always be context dependent. For example, data from scientific experiments often have 𝑅2 values in the 80 to 90% range and even higher. Data from surveys can have 𝑅2 values as low as 50%, 40%, or even lower. 2 𝑅 An 𝑅2 of 100% would be a perfect fit with no scatter around the line. In this case, the 𝑠𝑒 would be zero and all variance is accounted for by the model and none is left in the residuals at all. So, along with the slope and intercept for a regression you should always report 𝑅2 so readers can judge for themselves how successful the regression is at fitting the data. 𝑅2 tells you whether the model is even worth thinking about. Regression Assumptions and Conditions Quantitative Variable Condition Linearity Assumption Straight Enough Condition Outlier Condition For the standard deviation of the residuals to summarize the scatter, all the residuals should share the same spread. Equal Variance Assumption Does the Plot Thicken? Condition A Tale of Two Regressions Be Careful! Regression lines do not always do what you expect: Recall for BK sandwiches, 𝑓𝑎𝑡 = 6.8 + 0.97𝑝𝑟𝑜𝑡𝑒𝑖𝑛 With this equation we estimated that a sandwich with 30 grams of protein would have 35.9 grams of fat. Suppose, though, we knew the fat content and wanted to predict the amount of protein. Two Regressions It might seem natural to solve the equation for protein using simple algebra: 𝑓𝑎𝑡 = 6.8 + 0.97𝑝𝑟𝑜𝑡𝑒𝑖𝑛 𝑝𝑟𝑜𝑡𝑒𝑖𝑛 = 1 𝑓𝑎𝑡 0.97 − 6.8 This isn’t even close to correct. Why? Because our original model is 𝑦 = 𝑏0 + 𝑏1 𝑥 This solves for PREDICTED fat based on actual protein. Going the other way is trying to solve for PREDICTED protein based on actual fat. Two Regressions If we want to predict protein from fat, we need to create the model. The slope is 𝑏1 = (0.83)(14.0) 16.4 = 0.709 grams of protein per gram of fat Our equation will turn out to be 𝑝𝑟𝑜𝑡𝑒𝑖𝑛 = 0.55 + 0.709𝑓𝑎𝑡 so we would predict a sandwich with 35.9 grams of fat should have 26.0 grams of protein – not the 30 grams that we used in the first equation to get that prediction. Pg 195, 16, 22, 23,25,37