Download Linear Regression - math-b

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Regression toward the mean wikipedia , lookup

Forecasting wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Residuals Revisited
Residuals Revisited

 The linear model we
are using assumes that
the relationship
between the two
variables is a perfect
straight line.
 The residuals are the
part of the data that
hasn’t been modeled.
Residuals

 𝐷𝑎𝑡𝑎 = 𝑀𝑜𝑑𝑒𝑙 + 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙
 Therefore,
 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝐷𝑎𝑡𝑎 − 𝑀𝑜𝑑𝑒𝑙
 In Symbols,
𝑒 = 𝑦 − 𝑦
Residuals

 Residuals help tell us if
the model makes sense.
 When a regression is
appropriate, it should
model the underlying
relationship.
 Because nothing, in a
sense, should be left
behind, we usually plot
the residuals in the hope
of finding…nothing.
Residuals

 A scatterplot of the residuals
versus the x-values should
be the most boring
scatterplot you’ve ever seen.
 It shouldn’t have any
interesting features like a
direction or shape. It should
stretch horizontally, with
about the same amount of
scatter throughout. It should
have no bends, and it should
have no outliers.
Residuals

 If the residuals show no
interesting pattern when
we plot them against x we
can look at how big they
are.
 The standard deviation of
the residuals, 𝑠𝑒 gives us a
measure of how much the
points spread around the
regression line.
 Remember, we are trying
to make the residuals as
small as possible.
 For this to make sense the
residuals should share the
same underlying spread,
so we can check to make
sure that the residual plot
has about the same scatter
throughout.
 Since their mean is
always, though, it’s only
sensible to look at how
much they vary
The Residual Standard
Deviation

 Equal Variance
Assumption is this
assumption that the
residuals have the same
scatter throughout.
 The condition to check
is the Does the Plot
Thicken? Condition
 We check to make sure
the spread is about the
same all along the line.
We can check that
either in the original
scatterplot of y against
x or in the scatterplot of
residuals
The Residual SD

 We estimate the
standard deviation of
the residuals in almost
the way you would
expect:
 𝑠𝑒 =
𝑒2
𝑛−2
 We do not subtract the
mean because the mean
of residuals is 𝑒 = 0
The Residual SD

𝑠𝑒 =
𝑒2
𝑛−2
 For the BK foods, the
SD of the residuals is
9.2 grams of fat. That
looks about right in the
scatterplot of residuals.
The residual for the BK
Broiler chicken was -11
grams, just over one SD
The Residual SD

 It is a good idea to
make a histogram of
the residuals. If we see
a unimodal, symmetric
histogram then we can
apply the 68-95-99.7
rule to see how well the
regression model
describes the data.
2
𝑅
-The Variation Accounted For

 The variation of the
residuals is key to
assessing how well the
model fits.
 Let’s look at them using
our favorite fast food
example
 The total Fat has a
standard deviation of
16.4 grams.
 The standard deviation
of the residuals is 9.2
grams.
2
𝑅
-The Variation Accounted For

 The total Fat has a
standard deviation of
16.4 grams.
 The standard deviation
of the residuals is 9.2
grams.
 If the correlation were 1.0
and the model predicted
Fat values perfectly, the
residuals would all be
zero and have no
variation.
 On the other hand, if the
correlation were zero the
model would simply
predict 23.5 grams of Fat
(the mean) for all menu
items.
2
𝑅
-The Variation Accounted For
 If the correlation were 1.0
and the model predicted
Fat values perfectly, the
residuals would all be
zero and have no
variation.
 On the other hand, if the
correlation were zero the
model would simply
predict 23.5 grams of Fat
(the mean) for all menu
items.

 The residuals from that
prediction would just be
the observed Fat values
minus their mean.
 These residuals would
have the same variability
as the original data
because, as we know, just
subtracting the mean
doesn’t change the
spread.
2
𝑅
-The Variation Accounted For

 How well does our BK
regression model do?
 We want to know how
much of the variation is
left in the residuals.
 Think of it as finding a
percent between 0% and
100% that is the fraction
of the variation left in the
residuals.
 The squared
correlation, 𝑟 2 gives the
fraction of the data’s
variation accounted for
by the model, and 1 −
𝑟 2 is the fraction of the
original variation left in
the residuals.
2
𝑅
-The Variation Accounted For
 For the Burger King model,
𝑟 2 = 0.832 = 0.69 and 1 −
𝑟 2 = 1 − 0.69 = .31
 So 31% of the variability in
total Fat has been left in the
residuals
 According to our linear
model, 69% of the variability
in fat content of Burger King
sandwiches is accounted for
by variation in protein
content.


2
𝑅

 The regression model
that predicts maximum
wind speed in
hurricanes based on the
storm’s central pressure
has 𝑅2 = 77.3%
 What does this say
about the regression
model?
 An “R-squared” of 77.3%
indicates that 77.3% of the
variation in maximum
wind speed can be
accounted for by the
hurricane’s central
pressure. Other factors,
such as temperature and
whether the storm is over
water or land, may
explain some of the
remaining variation.
Just Checking…

 House Price vs. Size
The 𝑅2 value is reported
as 59.5%and the standard
deviation of the residuals
is 53.79
 A) What does the 𝑅2
value mean about the
relationship between
Price and Size
 Answer:
Differences in the size
of houses account for
about 59.5% of the
variation in the price of
houses.
90,000 square feet – the size of 50
average sized family homes
Just Checking…

 House Price vs. Size
The 𝑅2 value is reported
as 59.5%and the standard
deviation of the residuals
is 53.79
 B) Is the correlation of
Price and Size positive or
negative? How do you
know?
 Answer:
It’s positive. The correlation
and the slope have the same
sign.
A final price tag of 75 million dollars
Just Checking…

 House Price vs. Size
The 𝑅2 value is reported
as 59.5%and the standard
deviation of the residuals
is 53.79
 C) If we measure house Size
in square meters instead of
square feet, would
𝑅2 change? Would the slope
of the line change? Explain.
 Answer:
𝑅2 would not change but the
slope would. Slope depends on
the units but correlation does
not.
30 bedrooms and 20 bathrooms
Just Checking…

 House Price vs. Size
The 𝑅2 value is reported
as 59.5%and the standard
deviation of the residuals
is 53.79
 You find that your house in
Saratoga is worth $100,000 more
than the regression model
predicts. Should you be very
surprised?
 Answer:
No, the standard deviation of
the residuals is 53.79 thousand
dollars. We shouldn’t be
surprised by any residual
smaller than 2 standard
deviations and a residual of
$100,000 is less than 2*53,790
20-car garage, three swimming pools
The amenities don't finish there. Also
constructed is an adult movie theater
with a balcony, four fireplaces, a
formal dining room that seats 30, all
23 full baths with full-sized Jacuzzis,
160 tripled paned windows and
Brazilian mahogany French-style
doors that alone cost $4 million.
The banquet kitchen features two
large commercial gas stoves, four
commercial built-in refrigerators and
a Japanese-style steakhouse island
that seats 12.
2
𝑅

 All regression analyses
include the statistic 𝑅2
 An 𝑅2 of 0 means that
none of the variance in
the data is in the model;
all of it is still in the
residuals. It would be
hard to imagine using
that sort of model for
anything.
 Aside: Is a correlation of
0.8 twice as strong as a
correlation of 0.4?
 Not if you think in terms
of 𝑅2
 A correlation of 0.4 means
an 𝑅2 of 0.42 = 16%
 A correlation of 0.8 gives
an 𝑅2 of 0.82 = 64%
 A correlation of 0.8 gives
an 𝑅2 four times as strong
as a correlation of 0.4 and
accounts for four times as
much of the variability.
2
How Big Should 𝑅 Be?

 𝑅2 is always between
0% and 100%, but what
is a “good” 𝑅2 value?
 There is no hard and
fast rule on what the
value of 𝑅2 should be –
it will always be
context dependent.
 For example, data from
scientific experiments
often have 𝑅2 values in
the 80 to 90% range and
even higher.
 Data from surveys can
have 𝑅2 values as low
as 50%, 40%, or even
lower.
2
𝑅

 An 𝑅2 of 100% would
be a perfect fit with no
scatter around the line.
 In this case, the 𝑠𝑒
would be zero and all
variance is accounted
for by the model and
none is left in the
residuals at all.
 So, along with the slope
and intercept for a
regression you should
always report 𝑅2 so
readers can judge for
themselves how
successful the regression
is at fitting the data.
 𝑅2 tells you whether the
model is even worth
thinking about.
Regression Assumptions
and Conditions

 Quantitative Variable
Condition
 Linearity Assumption
 Straight Enough
Condition
 Outlier Condition
 For the standard
deviation of the residuals
to summarize the scatter,
all the residuals should
share the same spread.
 Equal Variance
Assumption
 Does the Plot Thicken?
Condition
A Tale of Two Regressions

 Be Careful! Regression
lines do not always do
what you expect:
 Recall for BK
sandwiches,
𝑓𝑎𝑡 = 6.8 + 0.97𝑝𝑟𝑜𝑡𝑒𝑖𝑛
 With this equation we
estimated that a
sandwich with 30
grams of protein would
have 35.9 grams of fat.
 Suppose, though, we
knew the fat content
and wanted to predict
the amount of protein.
Two Regressions

 It might seem natural to
solve the equation for
protein using simple
algebra:
 𝑓𝑎𝑡 = 6.8 + 0.97𝑝𝑟𝑜𝑡𝑒𝑖𝑛
 𝑝𝑟𝑜𝑡𝑒𝑖𝑛 =
1
𝑓𝑎𝑡
0.97
− 6.8
 This isn’t even close to
correct.
 Why?
 Because our original
model is 𝑦 = 𝑏0 + 𝑏1 𝑥
 This solves for
PREDICTED fat based on
actual protein.
 Going the other way is
trying to solve for
PREDICTED protein
based on actual fat.
Two Regressions

 If we want to predict
protein from fat, we
need to create the
model.
 The slope is 𝑏1 =
(0.83)(14.0)
16.4
= 0.709
grams of protein per
gram of fat
 Our equation will turn
out to be
 𝑝𝑟𝑜𝑡𝑒𝑖𝑛 = 0.55 + 0.709𝑓𝑎𝑡
so we would predict a
sandwich with 35.9 grams
of fat should have 26.0
grams of protein – not the
30 grams that we used in
the first equation to get
that prediction.
Pg 195, 16, 22, 23,25,37