Download Regression Line for Standardized Values (z_x,z_y )

Document related concepts

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
AP Statistics Objectives Ch8
Find the Least Squares
Regression Line and interpret its
slope, y-intercept, and the
coefficients of correlation and
determination
Justify the regression model
using the scatterplot and
residual plot
Vocabulary
Model Linear model
Residuals Predicted value
Slope Regression line
Regression to the mean
Intercept
2
R
Regression Line Notes
Lurking
Variable
Residual
Plot
Linear Regression
Practice
Chp 8 Part I Day 2
Example
Vocabulary
Chapter 8
Assignments
Chapter 7
Answers
Lurking Variable
Chapter 8 #1
𝒙
a)
10
b)
𝑟𝑠2𝑦
𝑏c)1 = 12
𝑠𝑥
d)
2.5
𝒔𝒙
𝒚
𝒔𝒚
2
0.06
6
12
20
7.2
3
1.2
r
𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙
0.5
-0.4
𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙
-0.8
𝒚 =200-4x
𝒚 = 𝟏𝟐. 𝟓 + 𝟎. 𝒚
𝟕𝟓𝒙
100
=-100+50x
Chapter 8 #1
𝒙
a)
10
b)
2
c)
12𝑦
𝑟𝑠
𝑏1 = 2.5
d)
𝑠
𝑥
𝒔𝒙
𝒚
𝒔𝒚
2
0.06
6
12
20
7.2
3
1.2
100
r
𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙
0.5 𝒚 = 𝟏𝟐. 𝟓 + 𝟎. 𝟕𝟓𝒙
-0.4
-0.8
𝒚 =200-4x
𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙
𝒚 =-100+50x
Chapter 8 #1
a)
b)
c)
d)
𝒙
𝒔𝒙
𝒚
𝒔𝒚
10
2
12
2.5
2
0.06
6
12
20
7.2
3
1.2
𝒚 =200-4x
𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙
r
100
0.5 𝒚 = 𝟏𝟐. 𝟓 + 𝟎. 𝟕𝟓𝒙
-0.4 𝒚 = 𝟐𝟑. 𝟐 − 𝟖𝒙
-0.8
𝒚 =200-4x
𝑟𝑠 𝒚 =-100+50x
𝑏1 =
𝑦
𝑠𝑥
Chapter 8 #1
a)
b)
c)
d)
𝒙
𝒔𝒙
10
2
12
2.5
2
20
0.06 7.2
6
𝟏𝟓𝟐
1.2
𝒚 =-100+50x
𝒚
𝒔𝒚
r
3
1.2
𝟑𝟎
100
𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙
0.5 𝒚 = 𝟏𝟐. 𝟓 + 𝟎. 𝟕𝟓𝒙
-0.4 𝒚 = 𝟐𝟑. 𝟐 − 𝟖𝒙
-0.8
𝒚 =200-4x
𝒚 =-100+50x
𝑟𝑠𝑦
𝑏1 =
𝑠𝑥
Standardized Foot Length vs Height 2011
𝑧𝐻𝑒𝑖𝑔ℎ𝑡 = 0.84𝑧𝐹𝑜𝑜𝑡𝑆𝑖𝑧𝑒
Slope is the
correlation
𝒙, 𝒚 is part
of all
regression
lines
NOTE: (0,0) represents the
mean of x and the mean of y.
Regression Line for Standardized
Values 𝑧𝑥 , 𝑧𝑦
𝑧𝑦 = 𝑟𝑧𝑥
𝑧𝑦 is the predicted z-score for the response variable
𝑧𝑥 is the z-score for the explanatory variable
𝑟 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡
Regression Line for 𝑥, 𝑦
𝑦 = 𝑏0 + 𝑏1 𝑥
𝑦 is the predicted response variable
𝑏1 is the slope
𝑏1 =
𝑟𝑠𝑦
𝑠𝑥
𝑏0 is the y-intercept
𝑏0 = 𝑦 − 𝑏1 𝑥
Explanatory or Response
𝑯𝒆𝒊𝒈𝒉𝒕 = 𝟑𝟐. 𝟖𝟔 + 𝟏. 𝟑𝟔𝑭𝒐𝒐𝒕𝑺𝒊𝒛𝒆
Now interpret the
2
R.
2
R
= .697
According to the linear model, 69.7%
of the variability in height is accounted
for by variation in foot size.
Explanatory or Response
2011 data resulted in the following linear equation:
𝑭𝒐𝒐𝒕𝑺𝒊𝒛𝒆 = −𝟗. 𝟓𝟓 + 𝟎. 𝟓𝟏𝑯𝒆𝒊𝒈𝒉𝒕
𝑯𝒆𝒊𝒈𝒉𝒕 = 𝟑𝟐. 𝟖𝟔 + 𝟏. 𝟑𝟔𝑭𝒐𝒐𝒕𝑺𝒊𝒛𝒆
Explanatory or Response
2011 data resulted in the following linear equation:
𝑭𝒐𝒐𝒕𝑺𝒊𝒛𝒆 = −𝟗. 𝟓𝟓 + 𝟎. 𝟓𝟏𝑯𝒆𝒊𝒈𝒉𝒕
𝑯𝒆𝒊𝒈𝒉𝒕 = 𝟑𝟐. 𝟖𝟔 + 𝟏. 𝟑𝟔𝑭𝒐𝒐𝒕𝑺𝒊𝒛𝒆
Residual Plot Example
e=y-𝑦
Residual Plot Example
REMEMBER:
POSITIVE RESIDUALS are
UNDERESTIMATES
e=y-𝑦
Residual Plot Example
NEGATIVE RESIDUALS are
OVERESTIMATES
Assignment
CHAPTER 8
Part I: pp. 189-190 #2,4,8&10,12&14
Part II: pp. 190-192 #16,18,20,28&30
Chapter 7 Answers
a) #1 shows little or no association
b) #4 shows a negative association
c) #2 & #4 each show a linear
association
d) #3 shows a moderately strong,
curved association
e) #2 shows a very strong association
Chapter 7 Answers
a) -0.977
b) 0.736
c) 0.951
d) -0.021
Chapter 7 Answers
The researcher should have plotted the
data first. A strong, curved relationship
may have a very low correlation. In fact,
correlation is only a useful measure of the
strength of a linear relationship.
Chapter 7 Answers
If the association between GDP and infant mortality is
linear, a correlation of -0.772 shows a moderate,
negative association.
Chapter 7 Answers
Continent is a categorical variable. Correlation
measures the strength of linear associations between
quantitative variables.
Chapter 7 Answers
Correlation must be between -1 and 1, inclusive.
Correlation can never be 1.22.
Chapter 7 Answers
A correlation, no matter how strong, cannot prove a
cause-and-effect relationship.
Chapter 8 Vocabulary
1) Regression to the mean – each predicted
response variable (y) tends to be closer to the
mean (in standard deviations) than its
corresponding explanatory variable (x)
Chapter 8 Vocabulary
2) 𝑦 – predicted response variable
3) Residual – the difference between the actual
response value and the predicted response value
e=y-𝑦
4) Overestimate – produces a negative residual
5) Underestimate – produces a positive residual
Chapter 8 Vocabulary
6) Slope – rate of change
given in units of the response variable (y) per
unit of the explanatory variable (x)
7) intercept – response value when the
explanatory value is zero
8) R2 – Must also be interpreted when describing
a regression model
(aka Coefficient of Determination)
Chapter 8 Vocabulary
8) R2 – Must also be interpreted when describing
a regression model
“According to the linear model, _____% of
the variability in _______ (response variable) is
accounted for by variation in ________
(explanatory variable)”
The remaining variation is due to the residuals
Chapter 8 Vocabulary
CONDITIONS FOR USING A LINEAR REGRESSION
1) Quantitative Variables – Check the variables
2) Straight Enough – Check the scatterplot 1st
(should be nearly linear)
- Check the residual plot next
(should be random scatter)
3) Outlier Condition- Any outliers need to be investigated
Chapter 8 Vocabulary
9. Residual Plot - a scatterplot of the residuals and
either x or 𝑦
If you find a pattern in the Residual Plot, that
means the residuals (errors) are predictable. If the
residuals are predictable, then a better model
exists. ---- LINEAR MODEL IS NOT APPROPRIATE.
A residual plot is done with the RESIDUALS on the
y-axis. On the x-axis, put the explanatory variable.
NOTE: Some software packages will put 𝑦 on the
x-axis. This does not change the presence of
(or lack of) of a pattern.
Chapter 8 Vocabulary
9. Residual Plot - a scatterplot of the residuals and
either x or 𝑦
If you find a pattern in the Residual Plot, that
means the residuals (errors) are predictable. If the
residuals are predictable, then a better model
exists. ---- LINEAR MODEL IS NOT APPROPRIATE.
A residual plot is done with the RESIDUALS on the
y-axis. On the x-axis, put the explanatory variable.
NOTE: Some software packages will put 𝑦 on the
x-axis. This does not change the presence of
(or lack of) of a pattern.
What is the 𝟒 ?
Did you say 2? Wrong. Try again.
It is actually ±2, because both
2
2
(2) and (-2) is 4.
So what?
Important Note: The correlation is not given
directly in this software package. You need to look
in two places for it. Taking the square root of the
“R squared” (coefficient of determination) is not
enough. You must look at the sign of the slope
too. Positive slope is a positive r-value. Negative
slope is a negative r-value.
Grad Rate
S/F Ratio
-0.07861
So here you should note that the slope is negative.
The correlation will be negative too. Since R2 is
0.482, r will be -0.694.
Coefficient of
Determination =
(0.694)2 = 0.4816
0.4816
With the linear regression model,
48.2% of the variability in airline
fares is accounted for by the
variation in distance of the flight.
𝑠𝑦
𝑏1 = 𝑟
𝑠𝑥
𝟓𝟔. 𝟑𝟕
= 0.694
497.8
= 0.0786
There is an increase of 7.86 cents
for every additional mile.
There is an increase of $7.86 for
every additional 100 miles.
𝑠𝑦
𝑏1 = 𝑟
𝑠𝑥
𝟓𝟔. 𝟑𝟕
= 0.694
497.8
There is an increase of 7.86 cents
for every additional mile.
There is an increase of $7.86 for
every additional 100 miles.
𝑏1 = 0.0786
𝑦 = 𝑏0 + 𝑏1 𝑥
244.33 = 𝑏0 + (0.0786)(853.7)
244.33 – (0.0786)(853.7) = 𝑏0
177.2292= 𝑏0
The model predicts a flight of zero
miles will cost $177.23. The airline
may have built in an initial cost to
pay for some of its expenses.
𝑏1 = 0.0786
𝑏0 = 177.2292
𝑨𝒊𝒓𝒇𝒂𝒓𝒆 =
177.2292 + 0.0786Distance
𝑏1 = 0.0786
𝑏0 = 177.2292
𝑨𝒊𝒓𝒇𝒂𝒓𝒆 =
177.2292 + 0.0786Distance
𝑨𝒊𝒓𝒇𝒂𝒓𝒆 = 177.2292 + 0.0786(200)
= $192.95
𝑨𝒊𝒓𝒇𝒂𝒓𝒆 =
177.2292 + 0.0786Distance
𝑨𝒊𝒓𝒇𝒂𝒓𝒆 = 177.2292 + 0.0786(200)
= $192.95
𝑨𝒊𝒓𝒇𝒂𝒓𝒆 =177.2292 + 0.0786(2000)
= $334.43
𝑨𝒊𝒓𝒇𝒂𝒓𝒆 = 177.2292 + 0.0786(200) = $192.95
𝑨𝒊𝒓𝒇𝒂𝒓𝒆 =177.2292 + 0.0786(2000) = $334.43
8. Using those estimates, draw the line on the scatterplot.
𝑨𝒊𝒓𝒇𝒂𝒓𝒆 = 177.2292 + 0.0786Distance
𝑨𝒊𝒓𝒇𝒂𝒓𝒆 = 177.2292 + 0.0786(1719)
= $312.34
𝒆 =y–𝑦
= 212 – 312.34
= -$100.34
12. In general, a positive residual means
13. In general, a negative residual means
A linear model should be appropriate, because
1) the scatterplot shows a nearly linear form and
2) the residual plot shows random scatter.
The coefficient of determination is .482, so
the coefficient of correlation is .482 = .694.
This shows a moderate strength in association
for the model.
$150 for a flight of about 700 miles seems low
compared to the other fares.
“fare” is the response variable. Not all software will call it
the dependent variable.
Always look for “Constant” and what is listed beside it.
Here above it shows the column is for the “variable” and
below “dist” is the explanatory variable.
Recall:
For y = 3x + 1 the coefficient of x is ‘3’.
For computer printouts this is the key column for
your regression model.
Recall:
For y = 3x + 1 the coefficient of x is ‘3’.
For computer printouts this is the key
column for your regression model.
The “Coefficient” of the “Constant” is the
y-intercept for your linear regression.
Recall:
For y = 3x + 1 the coefficient of x is ‘3’.
For computer printouts this is the key
column for your regression model.
The “Coefficient” of the “Constant” is
the y-intercept for your linear
regression.
The “Coefficient” of the variable “dist” is the
slope for your linear regression.
Recall:
For y = 3x + 1 the coefficient of x is ‘3’.
For computer printouts this is the key
column for your regression model.
The “Coefficient” of the “Constant” is
the y-intercept for the linear
regression.
The “Coefficient” of the variable
“dist” is the slope for the linear
regression.
𝑓𝑎𝑟𝑒 = 177.215 + 0.078619distance
𝑓𝑎𝑟𝑒 = 177.215 + 0.078619distance
5. Predict the airfare for a 1000-mile flight.
𝑓𝑎𝑟𝑒 = 177.215 + 0.078619(1000)
= $𝟐𝟓𝟓. 𝟖𝟑
R2 doesn’t change, but
the equation does.
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = -644.287 + 6.13101fare
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = -644.287 + 6.13101fare
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = -644.287 + 6.13101(𝟐𝟓𝟓. 𝟖𝟑)
= 924.2 miles
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = -644.287 + 6.13101fare
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = -644.287 + 6.13101(𝟐𝟓𝟓. 𝟖𝟑)
= 924.2 miles
8. Residual? e = y - 𝑦 = 924.2 – 1000 = -75.8
Chp 8 #17
R squared = 92.4%
17a. What is the correlation between tar and
nicotine? (NOTE: scatterplot shows a strong
positive linear association.)
+ .924 = 0.961
Chp 8 #17
R squared = 92.4%
r= 0.961
17b. What would you predict about the average
nicotine content of cigarettes that are 2
standard deviations below average in tar
content.
𝑧𝑛𝑖𝑐𝑜𝑡𝑖𝑛𝑒 = r𝑧𝑡𝑎𝑟
= 0.961(−2)= -1.922
I would predict that the nicotine content would
be 1.922 standard deviations below the
average.
Chp 8 #17
R squared = 92.4%
r= 0.961
17c. If a cigarette is 1 standard deviation above
average in nicotine content, what do you
suspect is true about its tar content?
𝑧𝑡𝑎𝑟 = r𝑧𝑛𝑖𝑐𝑜𝑡𝑖𝑛𝑒
= 0.961(1) = 0.961
I would predict that the tar content would be
0.961 standard deviations above the average.