Chapter 8 Linear Regression Objectives & Learning Goals Understand Linear Regression (linear modeling): Create and interpret a linear regression model using technology, equations (when statistics are provided) and when given computer output: Discuss slope and y-intercept in context; Check conditions by graphing and interpreting residuals; Make predictions within the range of data using the created model. Fast Food - Fat Versus Protein r = 0.83; describe the association… “There is a strong, positive linear association between fat and protein. In general, as protein increases, so does fat. We may be concerned about these points as possible outliers…” We can say even more with a model… The Linear Model The linear model is the equation of the “line of best fit.”. Residual Models aren’t perfect. No matter how “best” our line of best fit is; points will fall above and below the line. Residuals are the distances from the best fit line to each data point – they’re the parts of the relationship between the variables that our model can’t explain – or model errors. More on Residuals Residual To find the residual of a data point, subtract the predicted value from the observed value: residual y yˆ actual predicted Residuals (cont.) A negative residual means the model’s prediction is too large (the model overestimates). A positive residual means the predicted value is too small (it underestimates). In this example, the linear model overestimates the fat content of a sandwich (33g); the actual value is 25g, so the residual is negative 8g. Residuals are key because our technology uses them to determine which line is “best.” “Best Fit” Means Least Squares Some residuals are positive, others are negative, on average these errors cancel each other out. We can’t assess how well the line fits by adding up all the residuals (we would just get zero!). Solution?? Anyone?? Anyone?? Bueller?? Similar to what we did with standard deviation, we square the residuals, then add up the squares. The smaller the sum, the better the line. The line of best fit is the line that minimizes the sum of the squared residuals. We call this the least squares regression line (LSRL). Conditions for Regression We check the same conditions for regression as we did for correlations. These are??? Quantitative Variables Condition Straight Enough Condition Outlier Condition The Regression Line in Real Units In Algebra we learned that the equation of a y mx b line can be written as? ___________ In Statistics we use slightly different notation: ŷ b0 b1 x ŷ b0 is the y-intercept b1 is the slope We use to emphasize that the points that satisfy the equation are predictions made using a model, not data. Back to Burgers! An Example The LSRL is shown for the for the Burger King sandwich data. The regression equation is: To predict the fat content (!!“fat hat”!!) for a BK sandwich with 30g of protein; PLUG AND CHUG the formula (do it now) !! 35.9 g Conclude with a statement like: “According to our model, we expect a sandwich with 30 grams of protein to have about 36g of fat.” Class Example… Country Czech Rep Denmark England Finland Ireland Italy Ireland Norway Portugal Scotland USA Marijuana (%) 22 17 40 5 37 19 23 6 7 53 34 Other Drugs (%) 4 3 21 1 16 8 14 3 3 31 24 The table above shows the % of city teens who use Marijuana recreationally and the % of teens who use other, more dangerous drugs. Enter the data in your calculator lists and let’s build and analyze a model!! Correlation and the Line of Best Fit This figure shows the scatterplot of the z-scores for fat and protein. If a burger has an average protein content, it should also have an average fat content. So in z-score world, the LSRL passes through the origin: (0, 0). This means that in the “real” world the LSRL passes through: ( x , y ) _________ In z-score world, r is the slope of the regression line! In z-score world, r is the slope of the LSRL! Moving one standard deviation away from the mean of x moves us r standard deviations away from the mean in y. How Big Can Predicted Values Get? Because we move r units in the y direction for every unit in the x direction and r = -1 ≤ r ≤ +1, each y-hat must be closer to its mean than its corresponding x value. This property of the linear model is called regression to the mean (predictions using the LSRL will tend toward the mean of y). 2 R — Accounting for the Model’s Variation The variation in the residuals is the key to assessing how well the model fits. In our BK sandwich example, the standard deviation of the variable fat is 16.4 grams. The standard deviation of the residuals (the errors after we’ve applied the model) is only 9.2 grams. *Original Data* After Model 2 R —The Variation Accounted For (cont.) If the correlation were 1.0 and the model predicted fat values perfectly, all the residuals would be zero (no variation). As it is, the correlation is 0.83 not perfect; but pretty good! The model’s residuals have less variation than total fat alone. So the model “explains” some of the variability in fat by accounting for protein content. To build a model that accounts for the remaining errors we would use multiple regression – beyond the scope of this course). R2—The Variation Accounted For (cont.) The correlation coefficient squared (r2), gives the percentage of the data’s variance accounted for by the model. For the BK model, r2 = 0.832 = 0.69, this tells us that 69% of the variability fat changes can be explained by changes in protein levels. The remaining 31% of the variability in fat is left in the residuals (“other” reasons). 2 How Big Should R Be? Well……(you guessed it!)……it depends! R2 is always between 0% and 100% (and it is always less than r (unless r = 1 or -1). What makes a “good” R2 value depends on the kind of data you’re analyzing and what you want to do with it. Examples? Along with the slope and intercept for a regression, you should always report R2 so that readers can judge for themselves how successful the regression is at fitting the data. AP Test Requirements For the AP test, you are expected to be able to find / analyze regression equations THREE ways: 1) Using summary statistics & formulas; 2) Using technology (calculators) when given data; 3) From computer-generated regression output. Worksheet!!! The Regression Line in Real Units Since the linear model integrates “z-score world” (r), the line of best fit we use is a little more complicated than the ones you’re used to from your algebra courses… We want our model to be useful in real units so we have to back out of z-score world... To find a linear model’s slope, we use: ŷ b0 b1 x b1 r sy sx To find the model’s y-intercept: Income vs. Housing Cost Example b0 y b1 x Income v. Housing Costs Example Some governmental organization is interested in building a model to predict a person’s housing costs based on the person’s income (using tax return data). They capture a sample of data and find that the mean income is $46,209 with a standard deviation of $7,004. The mean housing cost for this same sample is $324 with a standard deviation of $119; r=0.62. Is a linear model appropriate? Find the regression equation. Explain what the slope and the y-intercept mean in context. Compute and interpret r-squared. The organization then decides to use this same data to predict a person’s income based on their housing costs. Find the new regression equation. Reading Computer Output – HP vs. MPG Write the regression equation now… mpˆ g 38.4542 0.0918175(hp ) More on Residuals The linear model assumes the relationship between the two variables is a straight line. The residuals are the part of the data that hasn’t been accounted for by the model. Actual Data = Model + Residual or Residual = Actual – Prediction (AP) In symbols: e y ŷ Residuals (cont.) Residuals help us to see whether the model is appropriate. When it is, there should be nothing interesting left behind in the residuals (just random error). After we fit a regression model, we ALWAYS make a scatterplot of the residuals vs. X or y-hat (both will look identical) hoping to find……….. well, nothing interesting. Residuals (cont.) The residual plot for the BK sandwich regression looks random (no pattern) – and that’s good! Residuals (cont.) This residual plot shows a pattern, indicating that our assumption of linearity may be wrong. Residuals (cont.) Another “bad” residual plot… Residual Standard Deviation The standard deviation of the residuals, se, measures how spread out the points are around the regression line. The equation is: se e 2 n2 Once again, we don’t actually calculate this manually, our technology does it for us!!! The Residual Standard Deviation Examine a Normal Probability Plot or a Histogram of the residuals to make sure the residuals have about the same amount of scatter throughout (this is called the Equal Variance Assumption). Reality Check: Does the Regression Make Sense? When statistics are based on real data, the results of a statistical analysis should reinforce your common sense. If the results are surprising, then you’ve either learned something new about the world or your analysis is incorrect. Which do you think is more likely? When you perform a regression, think about what the coefficients mean and ask yourself whether they make sense. What Can Go Wrong? Don’t fit a straight line to a nonlinear relationship. Beware extraordinary points (y-values that stand off from the linear pattern or extreme x-values). Don’t infer that x causes y just because there is a good linear model for their relationship - association is not causation. Don’t choose a model based on R2 alone. Get it?? http://xkcd.com/552/ What Have We Learned? When the relationship between two quantitative variables is “straight enough”, a linear model can help summarize that relationship. The regression line doesn’t pass through all points, but it is the “best” line because it minimizes the sum of squared residuals. What Have We Learned? Correlation tells us lots of things: The slope of the line is based on the correlation, adjusted for the units of x and y. For each SD in x that we are away from x-bar, we expect to be r SDs in y away from y-bar. Since r is always between –1 and +1, each yhat is fewer SDs away from its mean than the corresponding x was (regression to the mean). R2 tells us the percent of the response accounted for by the regression model; the rest is error.