Download Chap8

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Least squares wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Fat Versus Protein: An Example

The following is a scatterplot of total fat versus
protein for 30 items on the Burger King menu:
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 8- 1
The Linear Model



Correlation says “There seems to be a linear
association between these two variables,” but it
doesn’t tell what that association is.
We can say more about the linear relationship
between two quantitative variables with a model.
A model simplifies reality to help us understand
underlying patterns and relationships.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 8- 2
The Linear Model (cont.)

The linear model is just an equation of a straight
line through the data.
 The points in the scatterplot don’t all line up,
but a straight line can summarize the general
pattern.
 The linear model can help us understand how
the values are associated.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 8- 3
Residuals



The model won’t be perfect, regardless of the line
we draw.
Some points will be above the line and some will
be below.
The estimate made from a model is the predicted
value (denoted as ŷ ).
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 8- 4
Residuals (cont.)


The difference between the observed value and
its associated predicted value is called the
residual.
To find the residuals, we always subtract the
predicted value from the observed one:
re sid u a l  o b se rv e d  p re d ic te d  y  yˆ
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 8- 5
Residuals (cont.)


A negative residual means
the predicted value’s too
big (an overestimate).
A positive residual means
the predicted value’s too
small (an underestimate).
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 8- 6
“Best Fit” Means Least Squares





Some residuals are positive, others are negative,
and, on average, they cancel each other out.
So, we can’t assess how well the line fits by
adding up all the residuals.
Similar to what we did with deviations, we square
the residuals and add the squares.
The smaller the sum, the better the fit.
The line of best fit is the line for which the sum of
the squared residuals is smallest.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 8- 7
Least Squares Regression Line
LSRL
• The line that gives the best fit to
the data set
• The line that minimizes the sum of
the squares of the deviations from
the line
(3,10)
y =.5(6) + 4 = 7
4.5
2 – 7 = -5
y =.5(0) + 4 = 4
ˆ
y  .5 x  4
0 – 4 = -4
y =.5(3) + 4 = 5.5
-4
(0,0)
10 – 5.5 = 4.5
-5
(6,2)
Sum of the squares = 61.25
What is the sum
of the deviations
from the line?
Will it always be
zero?
(3,10)
Use a calculator
to find the line of
best fit
6
1
ŷ  x  3
3
Find y - y
The line that minimizes the sum of the
squares of the deviations from the line
-3
is the LSRL.
(0,0)
-3
(6,2)
Sum of the squares = 54
Least Squares Regression Line LSRL
You may see the equation written
in the form:
yˆ  a  bx
ŷ - (y-hat) means the predicted y

Be sure to put the hat
on the y
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
The Least Squares Line (cont.)

Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 8- 12
The Least Squares Line (cont.)

Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 8- 13
The following statistics are found for the
variables posted speed limit and the
average number of accidents.
x  40, s x  11 .6,
y  18, s y  8.4, r  .9981
Find the LSRL & predict the number of
accidents for a posted speed limit of 50 mph.
ˆ
y  .723 x  10 .92
ˆ
y  25.23 accidents
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Fat Versus Protein: An Example

The regression line for the
Burger King data fits the data
well:
 The equation is

Using the equation for our
line we can predict the fat
content for a 30 g of protein
BK Broiler chicken sandwich
as:
6.8 + 0.97(30) = 35.9 grams
of fat. Note that the actual
fat content is about 25 g.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 8- 15
Interpretations
SPLOPE is a rate of change and it explains how an
increase in the explanatory variable affects the response.
Y-INTERCEPT shows you what's predicted for the
response when the explanatory value is zero, and
sometimes it doesn't have a meaningful interpretation.
Doesn't mean we shouldn't know how to interpret it, just
sometimes it doesn't make a whole lot of sense in context.
Sometimes it falls outside that reasonable predictions
window.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
The ages (in months) and heights (in
inches) of seven children are given.
x
16
24
42
60
75
102 120
y
24
30
35
40
48
56
60
Find the LSRL.
Interpret the slope and y-intercept in the
context of the problem.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slope:
For an increase in age of one month,
there is an approximate increase of .34
inches in heights of children.
Y-Intercept: the predicted height of a
newborn is 20 inches
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
The ages (in months) and heights (in
inches) of seven children are given.
x
16
24
42
60
75
102 120
y
24
30
35
40
48
56
60
Predict the height of a child who is 4.5
years old.
Predict the height of someone who is 20
years old.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Extrapolation: Reaching Beyond the Data



Linear models give a predicted value for each
case in the data.
We cannot assume that a linear relationship in
the data exists beyond the range of the data.
Once we venture into new x territory, such a
prediction is called an extrapolation.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Extrapolation
The LSRL should not be used to
predict y for values of x outside the
data set.
 It is unknown whether the pattern
observed in the scatterplot
continues outside this range.

Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Residuals Revisited


The vertical deviation between the observations &
the LSRL
the sum of the residuals is always zero
error = observed - expected
residual  y  yˆ
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Residual plot




A scatterplot of the (x, residual) pairs.
Residuals can be graphed against other statistics
besides x
Purpose is to tell if a linear association exist
between the x & y variables
If no pattern exists between the points in the
residual plot, then the association is linear.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Residuals
Residuals
x
Linear
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
x
Not linear
Range of Motion
35
154
24
142
40
137
31
133
28
122
25
126
26
135
16
135
14
108
20
120
21
127
30
122
One measure of the success of knee
surgery is post-surgical range of motion
for the knee joint following a knee
dislocation. Is there a linear
relationship between age & range of
motion?
Sketch a residual plot.
Residuals
Age
x
Since there is no pattern in the
residual plot, there is a linear
relationship between age and
range of motion
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Range of Motion
35
154
24
142
40
137
31
133
28
122
25
126
26
135
16
135
14
108
20
120
21
127
30
122
Plot the residuals against the yhats. How does this residual plot
compare to the previous one?
Residuals
Age
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
ŷ
Residuals
Residuals
x
Residual plots are the same no matter if
plotted against x or y-hat.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
ŷ
Assumptions and Conditions


Quantitative Variables Condition:
 Regression can only be done on two
quantitative variables, so make sure to check
this condition.
Straight Enough Condition:
 The linear model assumes that the relationship
between the variables is linear.
 A scatterplot will let you check that the
assumption is reasonable.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 8- 28
Assumptions and Conditions (cont.)



It’s a good idea to check linearity again after
computing the regression when we can examine
the residuals.
You should also check for outliers, which could
change the regression.
If the data seem to clump or cluster in the
scatterplot, that could be a sign of trouble worth
looking into further.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Slide 8- 29
Computer Regression Output
Least-Squares Regression
A number of statistical software packages produce similar
regression output. Be sure you can locate
 the slope b,
 the y intercept a,
 and the values of s and r2.
+
 Interpreting
Computer-generated regression analysis of knee surgery
Be sure to convert r2
data:
NEVER use
to decimal before 2
adjusted r !
taking the square
Predictor
Coef
Stdev
T
P
root!
Constant
107.58What is
11.12
9.67 of0.000
the equation
the
What
Age
0.8710are the0.4146
LSRL? 2.10 0.062
correlation
coefficient
Find
the slope & y-intercept.
and the coefficient of
s = 10.42
R-sq = 30.6%
R-sq(adj) = 23.7%
determination?
yˆ  107.58  .8710 x
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
r  .5532
Outliers, Leverage, and Influence

Outlying points can strongly influence a
regression. Even a single point far from the body
of the data can dominate the analysis.
 Any point that stands away from the others can
be called an outlier and deserves your special
attention.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Outlier –

In a regression setting, an
outlier is a data point with a
large residual
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Influential pointA point that influences where the LSRL
is located
 If removed, it will significantly change
the slope of the LSRL

Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Outliers, Leverage, and Influence (cont.)

The following scatterplot shows that something was awry in
Palm Beach County, Florida, during the 2000 presidential
election…
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Outliers, Leverage, and Influence (cont.)

The red line shows the effects that one unusual point can
have on a regression:
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Outliers, Leverage, and Influence (cont.)


The linear model doesn’t fit points with large
residuals very well.
Because they seem to be different from the other
cases, it is important to pay special attention to
points with large residuals.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Lurking Variables and Causation


No matter how strong the association, no matter how
straight the line, there is no way to conclude from a
regression alone that one variable causes the other.
 There’s always the possibility that some third variable
is driving both of the variables you have observed.
With observational data, as opposed to data from a
designed experiment, there is no way to be sure that a
lurking variable is not the cause of any apparent
association.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Lurking Variables and Causation (cont.)

The following scatterplot shows that the average life
expectancy for a country is related to the number of
doctors per person in that country:
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Lurking Variables and Causation (cont.)

This new scatterplot shows that the average life
expectancy for a country is related to the number of
televisions per person in that country:
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Lurking Variables and Causation (cont.)

Since televisions are cheaper than doctors, send TVs to
countries with low life expectancies in order to extend
lifetimes. Right?

How about considering a lurking variable? That makes
more sense…
 Countries with higher standards of living have both
longer life expectancies and more doctors (and TVs!).
 If higher living standards cause changes in these other
variables, improving living standards might be
expected to prolong lives and increase the numbers of
doctors and TVs.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley