Download Interpreting Regression

Interpreting Regression January 25, 2017 Contents • Standard error of the estimate • Homoscedasticity • Questions In the tutorial on prediction we used the regression line to predict values of y for values of x. That is, the regression line is a way of using your data to predict what an average y value should be for a given value of x. In this tutorial, we’ll take this one step further by defining not just what the average value of y should be, but how those values of y should be distributed around the regression line. If you’ve gone through the z table and the normal distribution tutorials then you should be familiar with how to use normal distributions to calculate areas and scores. This comes in useful here where we’ll use the assumption of homoscedasticity to estimate the distribution of scores around a regression line. Let’s work with a made-up example. Suppose that we measure the class attendance and the outdoor temperature for the 38 lectures of psych 315 this quarter. Here are some statistics for our made up data: (1) temperatures (x) are distributed with a mean of 60 and a standard deviation of 10 degrees, (2) attendance (y) is distributed with a mean of 65 and a standard deviation of 15 students, and (3) temperature and attendance correlates with a value of -0.5. Here’s a scatterplot and regression line of our data. For practice you could use the statistics above to derive the equation of the regression line. n = 38, r = -0.50 y = -0.75x + 110.06 Attendance 80 60 40 40 50 60 70 80 Temperature (F) Standard error of the estimate This tutorial is all about how well we can use our data to predict y from x. Intuitively, if all of the data points fall nicely along the regression line, then we can be pretty confident that this line will provide an accurate estimate of y from x. However, if 1 the line doesn’t fit well, then the points will be scattered all over the line, and we won’t be as confident about our prediction. Rememeber from the tutorial on prediction that we defined how well the regression line fit the data with the standard error of the estimate, Syx : syx = rP (y−y 0 )2 n where y 0 is the y-value of the regression line for each value of x. We discussed in the prediction tutorial that syx can be thought of as the standard deviation of the distribution of scores around the regression line. If you know the correlation, then there’s an easier way of calculating the standard error of the estimate: p Syx = Sy 1 − r2 q Syx = 15 1 − (−0.5)2 = 12.99 Look closely at this equation. What happens with the correlation, r, is either near 1 or -1? Syx gets close to zero. This should make sense; if the correlation is nearly perfect, then the data points are close to the line and therefore the standard deviation of the values around the line is near zero. On the other hand, if the correlation is zero, then Syx = Sy . That is, the standard deviation of the values around the regression line is the same as the standard deviation of the y-values. Again, this should make sense. If the correlation is zero, then the slope of the regression line is zero, which means that the regression line is simply y 0 = ȳ. In other words, if the correlation is zero, then the predicted value of y is just the mean of y. So it makes sense that the standard deviation around the regression line is just the standard deviation around the mean of y, which is sy . Homoscedasticity Next we’ll make an important assumption: the distribution of scores above and below the regression line is normally distributed around y 0 with a standard deviation equal to the standard error of the estimate (syx ). This assumption has a special name: homoscedasticity. With homoscedasticity, we know everything about the distribution of scores above and below the regression line. We know it’s normal, we know the mean (y’), and we know the standard deviation (syx ). This means that we can answer questions about the proportion of scores that are expected to fall above and below the regression line. Example 1: What is the expected attendance when the outdoor temperature is 70 degrees? This is found simply by finding the y-value on the regression line for x = 70: y 0 = mx + b = (−0.75)(70) + 110.06 = 57.49 2 90 Attendance 80 70 60 57.49 50 40 30 70 40 50 60 Temperature (F) 3 70 80 Example 2: What is the temperature for which the expected attendance is 60? This requires us to use the regression line, but solving for x for when y’ = 60: y 0 = 60 = (−0.75)x + 110.06 x= (60−110.06) −0.75 = 66.7 90 Attendance 80 70 60 60 50 40 30 66.7 40 50 60 70 Temperature (F) If the outdoor temperature is 66.7 degrees, we expect the attendance to be 60 students. 4 80 Example 3: On a day that the outdoor temperature is 66.7, what is the probability that 80 or more students will show up for class? We know that the distribution of attendance for x = 66.7 will be normal with a mean of y’=60 and a standard deviation of Syx = 12.99. So now we’re reduced to a normal distribution problem of finding the area under the normal distribution of mean 60 and standard deviation of 12.99 above a value of 80: area =0.0618 80 21.03 34.02 47.01 60 72.99 85.98 98.97 Attendance The z-score is therefore z = 80−60 12.99 = 1.54: area =0.0618 1.54 -3 -2 -1 0 1 2 3 z Using Table A, the area above z = 1.54 is 0.0618. So the probability of 80 or more students showing up on a 66.7 degree day is 0.0618. 5 Example 4: What does the outdoor temperature have to be so that there is a 75% chance of 60 or more students showing up? This is like the normal distribution tutorial problems where we convert from areas to scores. For those problems, we first find the z-score for the area, and then convert that to scores using: x = (z)(σ) + µ But now, we have scores distributed normally around a regression line with mean y’ and standard deviation Syx , so this means that: y = (z)(Syx ) + y 0 For this problem, the z-score for which 75% of the standard normal distribution lies above is -0.67: area =0.75 -0.67 -3 -2 -1 0 1 2 3 z To convert to scores, we know that z = -0.67, we know that y = 60, and we know that Syx = 12.99, so 60 = (−0.67)(12.99) + y 0 Solving for y’: y 0 = 60 − (−0.67)(12.99) = 68.7 area =0.75 60 68.7 Attendance Now that we know y’, we can find the corresponding value of x (temperature) on the regression line like we did for Example 2 above: x= (68.7−110.06) −0.75 = 55.07 degrees. 6 90 Attendance 80 70 68.7 60 50 40 30 55.07 40 50 60 70 80 Temperature (F) In summary, when it’s 55.07 degrees outside, then the distribution of attendance will have a mean of 68.7 and a standard deviation of 12.99. For this distribution of attendance, there is a 75 percent chance of 60 or more students showing up. 7 Questions It’s your turn again. Here are 30 problems and answers. 1) Let the mean of x be 78, the mean of y be 78 the standard deviation of x be 1.22 and the standard deviation of y be 1.05 and the correlation between x and y be 0.87 Find the value of x for which 62 percent of the values lie above y = 77.52. Answer: x = 77.44 2) Let the regression line for y on x be y’ = -1x + 139.94 and the standard error of the estimate be Syx = 0.21 For x = 63.21, find the value of y for which 62 percent of the values lie above. slope = −0.98( 1.05 1.03 ) = −1 regression line: y’ = -1X + 139.94 For x = 63.21, y’ = 76.73 For 62 percent q of area above, z = -0.31 syx = (1.05) 1 − (−0.98)2 = 0.21 y = (z)(syx ) + y 0 = (−0.31)(0.21) + 76.73 = 76.66 Answer: y = 76.66 3) Let the mean of x be 103, the mean of y be 95 the standard deviation of x be 1.69 and the standard deviation of y be 1.59 and the correlation between x and y be 0.05 For x = 102.45, find the value of y for which 91 percent of the values lie above. slope = 0.05( 1.59 1.69 ) = 0.05 regression line: y’ = 0.05X + 90.15 For x = 102.45, y’ = 95.27 For 91 percent q of area above, z = -1.34 syx = (1.59) 1 − (0.05)2 = 1.59 y = (z)(syx ) + y 0 = (−1.34)(1.59) + 95.27 = 93.14 Answer: y = 93.14 4) Let the regression line for y on x be y’ = -1.02x + 123.93 and the standard error of the estimate be Syx = 0.79 For x = 85.53, find the value of y for which 74 percent of the values lie below. slope = −0.91( 1.91 1.7 ) = −1.02 regression line: y’ = -1.02X + 123.93 For x = 85.53, y’ = 36.69 For 74 percent q of area below, z = 0.64 syx = (1.91) 1 − (−0.91)2 = 0.79 y = (z)(syx ) + y 0 = (0.64)(0.79) + 36.69 = 37.2 Answer: y = 37.2 8 5) Let the mean of x be 47, the mean of y be 35 the standard deviation of x be 1.76 and the standard deviation of y be 1.99 and the correlation between x and y be 0.97 For x = 47.02, find the value of y for which 72 percent of the values lie below. slope = 0.97( 1.99 1.76 ) = 1.1 regression line: y’ = 1.1X + -16.55 For x = 47.02, y’ = 35.17 For 72 percent q of area below, z = 0.58 syx = (1.99) 1 − (0.97)2 = 0.48 y = (z)(syx ) + y 0 = (0.58)(0.48) + 35.17 = 35.45 Answer: y = 35.45 6) Let the regression line for y on x be y’ = -0.2x + 56.41 Find the value of x for which y’ = 36.79 Answer: x= 98.09 7) Let the regression line for y on x be y’ = 0.86x + 35.87 and the standard error of the estimate be Syx = 0.82 Find the value of x for which 91 percent of the values lie below y = 60.03. Answer: x = 26.81 8) Let the regression line for y on x be y’ = -0.01x + 109.43 and the standard error of the estimate be Syx = 1.52 Find the value of x for which 27 percent of the values lie below y = 108.08. Answer: x = 42.08 9) Let the regression line for y on x be y’ = 1.16x + -17.85 and the standard error of the estimate be Syx = 0.84 Find the value of x for which 53 percent of the values lie below y = 37.93. Answer: x = 48.03 10) Let the mean of x be 87, the mean of y be 93 the standard deviation of x be 1.05 and the standard deviation of y be 1.76 and the correlation between x and y be -0.75 Find the value of x for which 2 percent of the values lie below y = 90.91. Answer: x = 86.57 11) Let the regression line for y on x be y’ = 0.83x + 40.89 9 and the standard error of the estimate be Syx = 1.09 For x = 82.47, find the value of y for which 77 percent of the values lie above. slope = 0.78( 1.74 1.63 ) = 0.83 regression line: y’ = 0.83X + 40.89 For x = 82.47, y’ = 109.34 For 77 percent q of area above, z = -0.74 syx = (1.74) 1 − (0.78)2 = 1.09 y = (z)(syx ) + y 0 = (−0.74)(1.09) + 109.34 = 108.53 Answer: y = 108.53 12) Let the regression line for y on x be y’ = 0.72x + -17.02 and the standard error of the estimate be Syx = 1.17 For x = 94.77, find the value of y for which 27 percent of the values lie below. slope = 0.68( 1.59 1.51 ) = 0.72 regression line: y’ = 0.72X + -17.02 For x = 94.77, y’ = 51.21 For 27 percent q of area below, z = -0.61 syx = (1.59) 1 − (0.68)2 = 1.17 y = (z)(syx ) + y 0 = (−0.61)(1.17) + 51.21 = 50.5 Answer: y = 50.5 13) Let the regression line for y on x be y’ = 0.12x + 21.56 and the standard error of the estimate be Syx = 1.28 For x = 28.27, find the value of y for which 80 percent of the values lie below. slope = 0.14( 1.29 1.47 ) = 0.12 regression line: y’ = 0.12X + 21.56 For x = 28.27, y’ = 24.95 For 80 percent q of area below, z = 0.84 syx = (1.29) 1 − (0.14)2 = 1.28 y = (z)(syx ) + y 0 = (0.84)(1.28) + 24.95 = 26.03 Answer: y = 26.03 14) Let the mean of x be 85, the mean of y be 65 the standard deviation of x be 1.5 and the standard deviation of y be 1.96 and the correlation between x and y be 0.78 Find the value of y’ for which x = 85.25 Answer: y’= 65.33 15) Let the regression line for y on x be y’ = -0.64x + 33.63 Find the value of y’ for which x = 22.53 Answer: y’= 19.21 10 16) Let the mean of x be 34, the mean of y be 28 the standard deviation of x be 1.71 and the standard deviation of y be 1.94 and the correlation between x and y be -0.36 Find the value of y’ for which x = 33.55 Answer: y’= 28.13 17) Let the mean of x be 49, the mean of y be 60 the standard deviation of x be 1.68 and the standard deviation of y be 1.39 and the correlation between x and y be -0.7 For x = 49.2, find the value of y for which 59 percent of the values lie below. slope = −0.7( 1.39 1.68 ) = −0.58 regression line: y’ = -0.58X + 88.38 For x = 49.2, y’ = 59.84 For 59 percent q of area below, z = 0.23 syx = (1.39) 1 − (−0.7)2 = 0.99 y = (z)(syx ) + y 0 = (0.23)(0.99) + 59.84 = 60.07 Answer: y = 60.07 18) Let the mean of x be 25, the mean of y be 108 the standard deviation of x be 1.96 and the standard deviation of y be 1.56 and the correlation between x and y be -0.18 Find the value of x for which y’ = 108.04 Answer: x= 25.28 19) Let the mean of x be 106, the mean of y be 23 the standard deviation of x be 1.49 and the standard deviation of y be 1.46 and the correlation between x and y be -0.6 Find the value of y’ for which x = 105.36 Answer: y’= 23.16 20) Let the regression line for y on x be y’ = -0.05x + 39.42 Find the value of y’ for which x = 89.7 Answer: y’= 34.94 21) Let the mean of x be 24, the mean of y be 105 the standard deviation of x be 1.9 and the standard deviation of y be 1.43 and the correlation between x and y be -0.18 Find the value of y’ for which x = 24.07 Answer: y’= 104.88 11 22) Let the regression line for y on x be y’ = -0.67x + 66.65 and the standard error of the estimate be Syx = 0.73 For x = 45.89, find the value of y for which 22 percent of the values lie above. slope = −0.73( 1.07 1.16 ) = −0.67 regression line: y’ = -0.67X + 66.65 For x = 45.89, y’ = 35.9 For 22 percent q of area above, z = 0.77 syx = (1.07) 1 − (−0.73)2 = 0.73 y = (z)(syx ) + y 0 = (0.77)(0.73) + 35.9 = 36.46 Answer: y = 36.46 23) Let the mean of x be 103, the mean of y be 75 the standard deviation of x be 1.45 and the standard deviation of y be 1.81 and the correlation between x and y be -0.57 Find the value of x for which y’ = 75.6 Answer: x= 102.38 24) Let the regression line for y on x be y’ = 0.51x + 39.15 and the standard error of the estimate be Syx = 1.13 Find the value of x for which 76 percent of the values lie below y = 71.31. Answer: x = 61.49 25) Let the mean of x be 92, the mean of y be 79 the standard deviation of x be 1.63 and the standard deviation of y be 1.44 and the correlation between x and y be 0.4 Find the value of x for which 99 percent of the values lie below y = 81.68. Answer: x = 91.74 26) Let the mean of x be 61, the mean of y be 20 the standard deviation of x be 1.29 and the standard deviation of y be 1.54 and the correlation between x and y be -0.17 For x = 61.03, find the value of y for which 42 percent of the values lie above. slope = −0.17( 1.54 1.29 ) = −0.2 regression line: y’ = -0.2X + 32.38 For x = 61.03, y’ = 20.17 For 42 percent q of area above, z = 0.2 syx = (1.54) 1 − (−0.17)2 = 1.52 y = (z)(syx ) + y 0 = (0.2)(1.52) + 20.17 = 20.47 Answer: y = 20.47 12 27) Let the regression line for y on x be y’ = 0.44x + 53.66 and the standard error of the estimate be Syx = 1.71 Find the value of x for which 74 percent of the values lie below y = 90.6. Answer: x = 81.48 28) Let the mean of x be 17, the mean of y be 81 the standard deviation of x be 1.52 and the standard deviation of y be 1.46 and the correlation between x and y be -0.02 Find the value of x for which y’ = 81 Answer: x= 16.52 29) Let the mean of x be 99, the mean of y be 64 the standard deviation of x be 1.92 and the standard deviation of y be 1.87 and the correlation between x and y be -0.72 Find the value of x for which y’ = 64.09 Answer: x= 99.04 30) Let the mean of x be 61, the mean of y be 98 the standard deviation of x be 1.45 and the standard deviation of y be 1.32 and the correlation between x and y be -0.12 For x = 61.82, find the value of y for which 53 percent of the values lie above. slope = −0.12( 1.32 1.45 ) = −0.11 regression line: y’ = -0.11X + 104.66 For x = 61.82, y’ = 97.86 For 53 percent q of area above, z = -0.08 syx = (1.32) 1 − (−0.12)2 = 1.31 y = (z)(syx ) + y 0 = (−0.08)(1.31) + 97.86 = 97.76 Answer: y = 97.76 13

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Interpreting Regression