Download Interpreting Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Interpreting Regression
January 25, 2017
Contents
• Standard error of the estimate
• Homoscedasticity
• Questions
In the tutorial on prediction we used the regression line to predict values of y for values of x. That is, the regression line is
a way of using your data to predict what an average y value should be for a given value of x.
In this tutorial, we’ll take this one step further by defining not just what the average value of y should be, but how those
values of y should be distributed around the regression line.
If you’ve gone through the z table and the normal distribution tutorials then you should be familiar with how to use normal
distributions to calculate areas and scores. This comes in useful here where we’ll use the assumption of homoscedasticity to
estimate the distribution of scores around a regression line.
Let’s work with a made-up example. Suppose that we measure the class attendance and the outdoor temperature for the 38
lectures of psych 315 this quarter.
Here are some statistics for our made up data: (1) temperatures (x) are distributed with a mean of 60 and a standard
deviation of 10 degrees, (2) attendance (y) is distributed with a mean of 65 and a standard deviation of 15 students, and (3)
temperature and attendance correlates with a value of -0.5.
Here’s a scatterplot and regression line of our data. For practice you could use the statistics above to derive the equation of
the regression line.
n = 38, r = -0.50
y = -0.75x + 110.06
Attendance
80
60
40
40
50
60
70
80
Temperature (F)
Standard error of the estimate
This tutorial is all about how well we can use our data to predict y from x. Intuitively, if all of the data points fall nicely along
the regression line, then we can be pretty confident that this line will provide an accurate estimate of y from x. However, if
1
the line doesn’t fit well, then the points will be scattered all over the line, and we won’t be as confident about our prediction.
Rememeber from the tutorial on prediction that we defined how well the regression line fit the data with the standard error
of the estimate, Syx :
syx =
rP
(y−y 0 )2
n
where y 0 is the y-value of the regression line for each value of x.
We discussed in the prediction tutorial that syx can be thought of as the standard deviation of the distribution of scores
around the regression line.
If you know the correlation, then there’s an easier way of calculating the standard error of the estimate:
p
Syx = Sy 1 − r2
q
Syx = 15 1 − (−0.5)2 = 12.99
Look closely at this equation. What happens with the correlation, r, is either near 1 or -1? Syx gets close to zero. This
should make sense; if the correlation is nearly perfect, then the data points are close to the line and therefore the standard
deviation of the values around the line is near zero.
On the other hand, if the correlation is zero, then Syx = Sy . That is, the standard deviation of the values around the
regression line is the same as the standard deviation of the y-values. Again, this should make sense. If the correlation is zero,
then the slope of the regression line is zero, which means that the regression line is simply y 0 = ȳ. In other words, if the
correlation is zero, then the predicted value of y is just the mean of y. So it makes sense that the standard deviation around
the regression line is just the standard deviation around the mean of y, which is sy .
Homoscedasticity
Next we’ll make an important assumption: the distribution of scores above and below the regression line is normally distributed around y 0 with a standard deviation equal to the standard error of the estimate (syx ). This assumption has a special
name: homoscedasticity.
With homoscedasticity, we know everything about the distribution of scores above and below the regression line. We know
it’s normal, we know the mean (y’), and we know the standard deviation (syx ). This means that we can answer questions
about the proportion of scores that are expected to fall above and below the regression line.
Example 1: What is the expected attendance when the outdoor temperature is 70 degrees?
This is found simply by finding the y-value on the regression line for x = 70:
y 0 = mx + b = (−0.75)(70) + 110.06 = 57.49
2
90
Attendance
80
70
60
57.49
50
40
30
70
40
50
60
Temperature (F)
3
70
80
Example 2: What is the temperature for which the expected attendance is 60?
This requires us to use the regression line, but solving for x for when y’ = 60:
y 0 = 60 = (−0.75)x + 110.06
x=
(60−110.06)
−0.75
= 66.7
90
Attendance
80
70
60
60
50
40
30
66.7
40
50
60
70
Temperature (F)
If the outdoor temperature is 66.7 degrees, we expect the attendance to be 60 students.
4
80
Example 3: On a day that the outdoor temperature is 66.7, what is the probability that 80 or more students will show up
for class?
We know that the distribution of attendance for x = 66.7 will be normal with a mean of y’=60 and a standard deviation of
Syx = 12.99. So now we’re reduced to a normal distribution problem of finding the area under the normal distribution of
mean 60 and standard deviation of 12.99 above a value of 80:
area =0.0618
80
21.03
34.02
47.01
60
72.99
85.98
98.97
Attendance
The z-score is therefore z = 80−60
12.99 = 1.54:
area =0.0618
1.54
-3
-2
-1
0
1
2
3
z
Using Table A, the area above z = 1.54 is 0.0618. So the probability of 80 or more students showing up on a 66.7 degree day
is 0.0618.
5
Example 4: What does the outdoor temperature have to be so that there is a 75% chance of 60 or more students showing
up?
This is like the normal distribution tutorial problems where we convert from areas to scores. For those problems, we first
find the z-score for the area, and then convert that to scores using:
x = (z)(σ) + µ
But now, we have scores distributed normally around a regression line with mean y’ and standard deviation Syx , so this
means that:
y = (z)(Syx ) + y 0
For this problem, the z-score for which 75% of the standard normal distribution lies above is -0.67:
area =0.75
-0.67
-3
-2
-1
0
1
2
3
z
To convert to scores, we know that z = -0.67, we know that y = 60, and we know that Syx = 12.99, so
60 = (−0.67)(12.99) + y 0
Solving for y’:
y 0 = 60 − (−0.67)(12.99) = 68.7
area =0.75
60
68.7
Attendance
Now that we know y’, we can find the corresponding value of x (temperature) on the regression line like we did for Example
2 above:
x=
(68.7−110.06)
−0.75
= 55.07 degrees.
6
90
Attendance
80
70
68.7
60
50
40
30
55.07
40
50
60
70
80
Temperature (F)
In summary, when it’s 55.07 degrees outside, then the distribution of attendance will have a mean of 68.7 and a standard
deviation of 12.99. For this distribution of attendance, there is a 75 percent chance of 60 or more students showing up.
7
Questions
It’s your turn again. Here are 30 problems and answers.
1) Let the mean of x be 78, the mean of y be 78
the standard deviation of x be 1.22 and the standard deviation of y be 1.05
and the correlation between x and y be 0.87
Find the value of x for which 62 percent of the values lie above y = 77.52.
Answer: x = 77.44
2) Let the regression line for y on x be y’ = -1x + 139.94
and the standard error of the estimate be Syx = 0.21
For x = 63.21, find the value of y for which 62 percent of the values lie above.
slope = −0.98( 1.05
1.03 ) = −1
regression line: y’ = -1X + 139.94
For x = 63.21, y’ = 76.73
For 62 percent
q of area above, z = -0.31
syx = (1.05) 1 − (−0.98)2 = 0.21
y = (z)(syx ) + y 0 = (−0.31)(0.21) + 76.73 = 76.66
Answer: y = 76.66
3) Let the mean of x be 103, the mean of y be 95
the standard deviation of x be 1.69 and the standard deviation of y be 1.59
and the correlation between x and y be 0.05
For x = 102.45, find the value of y for which 91 percent of the values lie above.
slope = 0.05( 1.59
1.69 ) = 0.05
regression line: y’ = 0.05X + 90.15
For x = 102.45, y’ = 95.27
For 91 percent
q of area above, z = -1.34
syx = (1.59) 1 − (0.05)2 = 1.59
y = (z)(syx ) + y 0 = (−1.34)(1.59) + 95.27 = 93.14
Answer: y = 93.14
4) Let the regression line for y on x be y’ = -1.02x + 123.93
and the standard error of the estimate be Syx = 0.79
For x = 85.53, find the value of y for which 74 percent of the values lie below.
slope = −0.91( 1.91
1.7 ) = −1.02
regression line: y’ = -1.02X + 123.93
For x = 85.53, y’ = 36.69
For 74 percent
q of area below, z = 0.64
syx = (1.91) 1 − (−0.91)2 = 0.79
y = (z)(syx ) + y 0 = (0.64)(0.79) + 36.69 = 37.2
Answer: y = 37.2
8
5) Let the mean of x be 47, the mean of y be 35
the standard deviation of x be 1.76 and the standard deviation of y be 1.99
and the correlation between x and y be 0.97
For x = 47.02, find the value of y for which 72 percent of the values lie below.
slope = 0.97( 1.99
1.76 ) = 1.1
regression line: y’ = 1.1X + -16.55
For x = 47.02, y’ = 35.17
For 72 percent
q of area below, z = 0.58
syx = (1.99) 1 − (0.97)2 = 0.48
y = (z)(syx ) + y 0 = (0.58)(0.48) + 35.17 = 35.45
Answer: y = 35.45
6) Let the regression line for y on x be y’ = -0.2x + 56.41
Find the value of x for which y’ = 36.79
Answer: x= 98.09
7) Let the regression line for y on x be y’ = 0.86x + 35.87
and the standard error of the estimate be Syx = 0.82
Find the value of x for which 91 percent of the values lie below y = 60.03.
Answer: x = 26.81
8) Let the regression line for y on x be y’ = -0.01x + 109.43
and the standard error of the estimate be Syx = 1.52
Find the value of x for which 27 percent of the values lie below y = 108.08.
Answer: x = 42.08
9) Let the regression line for y on x be y’ = 1.16x + -17.85
and the standard error of the estimate be Syx = 0.84
Find the value of x for which 53 percent of the values lie below y = 37.93.
Answer: x = 48.03
10) Let the mean of x be 87, the mean of y be 93
the standard deviation of x be 1.05 and the standard deviation of y be 1.76
and the correlation between x and y be -0.75
Find the value of x for which 2 percent of the values lie below y = 90.91.
Answer: x = 86.57
11) Let the regression line for y on x be y’ = 0.83x + 40.89
9
and the standard error of the estimate be Syx = 1.09
For x = 82.47, find the value of y for which 77 percent of the values lie above.
slope = 0.78( 1.74
1.63 ) = 0.83
regression line: y’ = 0.83X + 40.89
For x = 82.47, y’ = 109.34
For 77 percent
q of area above, z = -0.74
syx = (1.74) 1 − (0.78)2 = 1.09
y = (z)(syx ) + y 0 = (−0.74)(1.09) + 109.34 = 108.53
Answer: y = 108.53
12) Let the regression line for y on x be y’ = 0.72x + -17.02
and the standard error of the estimate be Syx = 1.17
For x = 94.77, find the value of y for which 27 percent of the values lie below.
slope = 0.68( 1.59
1.51 ) = 0.72
regression line: y’ = 0.72X + -17.02
For x = 94.77, y’ = 51.21
For 27 percent
q of area below, z = -0.61
syx = (1.59) 1 − (0.68)2 = 1.17
y = (z)(syx ) + y 0 = (−0.61)(1.17) + 51.21 = 50.5
Answer: y = 50.5
13) Let the regression line for y on x be y’ = 0.12x + 21.56
and the standard error of the estimate be Syx = 1.28
For x = 28.27, find the value of y for which 80 percent of the values lie below.
slope = 0.14( 1.29
1.47 ) = 0.12
regression line: y’ = 0.12X + 21.56
For x = 28.27, y’ = 24.95
For 80 percent
q of area below, z = 0.84
syx = (1.29) 1 − (0.14)2 = 1.28
y = (z)(syx ) + y 0 = (0.84)(1.28) + 24.95 = 26.03
Answer: y = 26.03
14) Let the mean of x be 85, the mean of y be 65
the standard deviation of x be 1.5 and the standard deviation of y be 1.96
and the correlation between x and y be 0.78
Find the value of y’ for which x = 85.25
Answer: y’= 65.33
15) Let the regression line for y on x be y’ = -0.64x + 33.63
Find the value of y’ for which x = 22.53
Answer: y’= 19.21
10
16) Let the mean of x be 34, the mean of y be 28
the standard deviation of x be 1.71 and the standard deviation of y be 1.94
and the correlation between x and y be -0.36
Find the value of y’ for which x = 33.55
Answer: y’= 28.13
17) Let the mean of x be 49, the mean of y be 60
the standard deviation of x be 1.68 and the standard deviation of y be 1.39
and the correlation between x and y be -0.7
For x = 49.2, find the value of y for which 59 percent of the values lie below.
slope = −0.7( 1.39
1.68 ) = −0.58
regression line: y’ = -0.58X + 88.38
For x = 49.2, y’ = 59.84
For 59 percent
q of area below, z = 0.23
syx = (1.39) 1 − (−0.7)2 = 0.99
y = (z)(syx ) + y 0 = (0.23)(0.99) + 59.84 = 60.07
Answer: y = 60.07
18) Let the mean of x be 25, the mean of y be 108
the standard deviation of x be 1.96 and the standard deviation of y be 1.56
and the correlation between x and y be -0.18
Find the value of x for which y’ = 108.04
Answer: x= 25.28
19) Let the mean of x be 106, the mean of y be 23
the standard deviation of x be 1.49 and the standard deviation of y be 1.46
and the correlation between x and y be -0.6
Find the value of y’ for which x = 105.36
Answer: y’= 23.16
20) Let the regression line for y on x be y’ = -0.05x + 39.42
Find the value of y’ for which x = 89.7
Answer: y’= 34.94
21) Let the mean of x be 24, the mean of y be 105
the standard deviation of x be 1.9 and the standard deviation of y be 1.43
and the correlation between x and y be -0.18
Find the value of y’ for which x = 24.07
Answer: y’= 104.88
11
22) Let the regression line for y on x be y’ = -0.67x + 66.65
and the standard error of the estimate be Syx = 0.73
For x = 45.89, find the value of y for which 22 percent of the values lie above.
slope = −0.73( 1.07
1.16 ) = −0.67
regression line: y’ = -0.67X + 66.65
For x = 45.89, y’ = 35.9
For 22 percent
q of area above, z = 0.77
syx = (1.07) 1 − (−0.73)2 = 0.73
y = (z)(syx ) + y 0 = (0.77)(0.73) + 35.9 = 36.46
Answer: y = 36.46
23) Let the mean of x be 103, the mean of y be 75
the standard deviation of x be 1.45 and the standard deviation of y be 1.81
and the correlation between x and y be -0.57
Find the value of x for which y’ = 75.6
Answer: x= 102.38
24) Let the regression line for y on x be y’ = 0.51x + 39.15
and the standard error of the estimate be Syx = 1.13
Find the value of x for which 76 percent of the values lie below y = 71.31.
Answer: x = 61.49
25) Let the mean of x be 92, the mean of y be 79
the standard deviation of x be 1.63 and the standard deviation of y be 1.44
and the correlation between x and y be 0.4
Find the value of x for which 99 percent of the values lie below y = 81.68.
Answer: x = 91.74
26) Let the mean of x be 61, the mean of y be 20
the standard deviation of x be 1.29 and the standard deviation of y be 1.54
and the correlation between x and y be -0.17
For x = 61.03, find the value of y for which 42 percent of the values lie above.
slope = −0.17( 1.54
1.29 ) = −0.2
regression line: y’ = -0.2X + 32.38
For x = 61.03, y’ = 20.17
For 42 percent
q of area above, z = 0.2
syx = (1.54) 1 − (−0.17)2 = 1.52
y = (z)(syx ) + y 0 = (0.2)(1.52) + 20.17 = 20.47
Answer: y = 20.47
12
27) Let the regression line for y on x be y’ = 0.44x + 53.66
and the standard error of the estimate be Syx = 1.71
Find the value of x for which 74 percent of the values lie below y = 90.6.
Answer: x = 81.48
28) Let the mean of x be 17, the mean of y be 81
the standard deviation of x be 1.52 and the standard deviation of y be 1.46
and the correlation between x and y be -0.02
Find the value of x for which y’ = 81
Answer: x= 16.52
29) Let the mean of x be 99, the mean of y be 64
the standard deviation of x be 1.92 and the standard deviation of y be 1.87
and the correlation between x and y be -0.72
Find the value of x for which y’ = 64.09
Answer: x= 99.04
30) Let the mean of x be 61, the mean of y be 98
the standard deviation of x be 1.45 and the standard deviation of y be 1.32
and the correlation between x and y be -0.12
For x = 61.82, find the value of y for which 53 percent of the values lie above.
slope = −0.12( 1.32
1.45 ) = −0.11
regression line: y’ = -0.11X + 104.66
For x = 61.82, y’ = 97.86
For 53 percent
q of area above, z = -0.08
syx = (1.32) 1 − (−0.12)2 = 1.31
y = (z)(syx ) + y 0 = (−0.08)(1.31) + 97.86 = 97.76
Answer: y = 97.76
13
Related documents