Download Linear Regression - math-b

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
The Whopper

 One Double Whopper
with cheese provides 53
grams of protein – all
the protein you need in
a day.
 It also supplies 1020
calories and 65 grams
of fat. The Daily Value
(based on a 2000-calorie
diet) for fat is 65 grams.
 How are fat and protein
related on the entire BK
Menu?
 The scatterplot for Fat
(grams) vs. Protein
(grams) shows a
positive, moderately
strong, linear
relationship.
The Whopper

The Whopper
Whopper, Association

 So, if you want 25 grams
of protein in your lunch,
how much fat should you
expect to consume at
Burger King?
 The correlation between
fat and protein is .83, a
sign that the linear
association in the
scatterplot is fairly strong.
 However, strength of
the relationship is only
part of the picture.
 The correlation says,
“The linear association
between these two
variables is fairly
strong” but it doesn’t
tell us what the line is.
Let’s Say More

 Yes, the relationship is
strong, but let’s say
something more: we can
model the relationship
with a line and give its
equation
 This equation will let us
predict the fat content for
any Burger King food,
given its amount of
protein.
 How is this like what
we do with a Normal
Model?
Linear Model

 A linear model is an
equation of a straight
line through the data.
 Of course, no line can
go through all the
points, but a linear
model can summarize
the general pattern
with only a couple of
parameters.
 Like all models of the
real world, the model
will be wrong – wrong
in the sense that it
cannot match reality
exactly. But it can help
us understand how
these variables are
associated.
Residuals

 We want to find a line
that goes through our
data that comes closer to
all the points than any
other line.
 It may turn out that this
line doesn’t even hit a
single point! But it does
minimize the error
between the line and each
data point.
 For one example, our line
might predict the BK
Broiled Chicken
Sandwich with 30 grams
of protein should have 36
grams of fat, when in fact,
it actually has only 25
grams of fat.
 We call the estimate made
from a model the
predicted value and write
it as 𝑦 which is called “yhat” and distinguish it
from the true value, y.
Residuals

 The difference between
the observed value, y,
and the predicted
value, 𝑦, is called the
residual.
 The BK Broiled Chicken
Residual would be 𝑦 −
𝑦 = 25 − 36 = −11g of
fat.
 The residual tells us
how far off the model’s
prediction is at that
point.
 To find residuals we
always subtract the
predicted value from the
observed one.
Residuals

 To find residuals we
always subtract the
predicted value from
the observed one.
 A negative residual
means the predicted
value is too big – an
overestimate.
 A positive residual
shows that the model
makes an
underestimate.
“Best Fit” Means Least-Squares
 When we draw our line
through our scatter plot,
some residuals are positive
and some are negative.

 We can’t assess how well the
line fits by adding up all the
residuals – the positive and
negative ones will cancel
each other out.
 This is the same issue we
faced when calculating
Standard Deviation.
 So what did we do?
Best Fit = Least Squares

 We are going to square
the residuals!!!!!!!!!!!!!!!!!
 (Emphasis Added)
 Squaring:
Makes all the
values positive
Emphasizes the
large residuals
 The line of best fit is
the line for which the
sum of the squared
residuals is smallest,
the least squares line.
Finding That Line

 What we know about
correlation can lead us
to the equation of the
linear model.
 Let’s look specifically at
a scatterplot of
standardized variables.
Finding the Line

 Let’s start in the center of
the plot – how much protein
and fat does the typical
Burger King food email
provide?
 The typical amount of
protein content is 𝑥. What is
the fat content of this
average protein content?
 So…
Our best fit line must
go through the point
(𝑥, 𝑦). In the plot of zscores, then, the line
passes through the
origin (0,0)
 The answer, as you might
guess, is about average: 𝑦.
 Why is that the case?
Finding the Line

 A normal linear
equation can be written
in the form y=mx+b
 If it passes through the
origin, b=0, so the line
can be expressed as
y=mx where m is the
slope of the line.
 Note that our coordinates
are not written (x,y)
because they are z-scores,
so our points are written
𝑧𝑥 , 𝑧𝑦 and we need to
indicate that the point on
the line corresponding to a
particular 𝑧𝑥 is 𝑧𝑦 :

𝑧𝑦 = 𝑚𝑧𝑥
Finding the Line

 Now, many lines pass
through the origin, but
which one fits our data
the best?
…
 That is, which slope
determines the line that
minimizes the sum of
squared residuals?
 It turns out that the slope
that minimizes our
squared residuals is r
itself!!!!!!!!!!!!!!!!!!!!!!!!!!!!
…
 Once again, emphasis
added.
Finding the Line

 Wow! The equation for
the line is about as
simple as we could ever
hope for:
𝑧𝑦 = 𝑟 ∗ 𝑧𝑥
 What does it tell us?
 It says that moving one
standard deviation
from the mean in x we
can expect to move r
standard deviations
away from the mean in
y.
𝑧𝑦 = 𝑟 ∗ 𝑧𝑥

 Let’s get specific:
For the sandwiches, the
correlation is 0.83
If we standardize both
protein and fat we can
write:
𝑧𝐹𝑎𝑡 = 0.83 ∗ 𝑧𝑃𝑟𝑜𝑡𝑒𝑖𝑛
 This model tells us that
for every standard
deviation above (or
below) the mean a
sandwich is in protein,
we’ll predict its fat
content is 0.83 standard
deviations above (or
below) the mean fat
content.
𝑧𝑦 = 𝑟 ∗ 𝑧𝑥

 A double hamburger has
31 grams of protein, about
1 SD from the mean.
 Putting 1.0 in for 𝑧𝑃𝑟𝑜𝑡𝑒𝑖𝑛
in the model gives a 𝑧𝐹𝑎𝑡
value of 0.83. If you trust
the model, you’d expect
the fat content to be about
0.83 fat SDs above the
mean fat level.
 Moving one standard
deviation away from the
mean in x moves our
estimate r standard
deviations away from the
mean in y.
 That is to say for our
example, you’d expect the
fat content to be about
0.83 fat SDs above the
mean fat level.
R = 0, 1, or -1

 For r = 0, there is no
linear relationship. The
line is horizontal, and
no matter how many
standard deviations
you move in x, the
predicted value for y
doesn’t change.
 On the other hand, if r
= 1.0 or -1.0, there’s a
perfect linear
association. In this case,
moving one SD in x
moves exactly the same
number of SD in y.
How Big Can Predicted
Values Get?

 A new student is to join
the class and you have
to guess his height.
 A reasonable guess
would be to guess the
mean height of male
students in the class.
 Now assume you are told
he is 2 SDs above mean
height in centimeters,
how tall would you guess
he is in inches?
 Well, height in inches and
height in centimeters are
perfectly correlated, so
you would guess 2 SDs in
inches above the mean.
How Big Can Predicted
Values Get?

 A new student
is to join the
class and you
have to guess
his height.
 Now assume you are told
his GPA is 2 SDs above
the mean. What would
you guess his height to
be?
 A reasonable
guess would be
to guess the
mean height of
male students
in the class.
 There is little to no
correlation between height
and GPA so we would
still guess the mean
height of male students.
How Big Can Predicted
Values Get?

 A new student is to join
the class and you have
to guess his height.
 Now, assume you are told
he’s 2 SDs above the mean
in shoe size. Now what
would you guess his height
to be?
 A reasonable guess
would be to guess the
mean height of male
students in the class.
 There is a positive
correlation between shoe size
and height. We wouldn’t
guess an exact correlation so
it would be less than the 2
SDs we guessed from the
height in centimeters example
but it would certainly be
higher than the 0 SD we
guessed from the GPA
example.
How Big Can Predicted
Values Get?

 The height example
provides a key insight
into a general rule:
 Each predicted y value
tends to be closer to its
mean (in Standard
Deviations) than its
corresponding x value
was.
 The property of the
linear model is called
regression to the mean
and the line is called
the regression line.
Just Checking…

 A scatterplot of house
Price (in thousands of
dollars) versus Size (in
thousands of square feet)
for houses sold recently in
Saratoga Springs, NY
shows a relationship that
is straight, with only
moderate scatter and no
outliers. The correlation
between house Price and
Size is 0.77
 You go to an open house and
find that the house is 1 SD
above the mean in size. What
would you expect about its
price?
 You read an ad for a house
priced 2 SD below the mean.
What would you guess about
its size?
 A friend tells you about a
house whose size in square
meters is 1.5 SD above the
mean. What would you guess
about its size in square feet?
Page 192, # 1, 3, 5, 13, 15
The Regression Line in
Real Units

 We don’t always think
in terms of z-scores,
and in fact, most real
world scenarios will
require you keep
thinking of things in
their original units
(though it is vital you
understand the
significance of their zscores as well)
 How much fat would you
predict for a double
hamburger with 31 grams of
protein?
 The mean for protein is near
17 grams and the SD is 14,
so that items is1 SD above
the mean.
 Since r = 0.83, we predict the
fat content will be 0.83 SD
above the mean fat content.
The Regression Line in
Real Units

 Mean fat content is 23.5
grams and the SD for
fat content is 16.4
grams, so we predict
the double hamburger
will have:
 We can always convert
both x and y to z-scores,
find the correlation, use
𝑧𝑦 = 𝑟 ∗ 𝑧𝑥 and then
convert 𝑧𝑦 back to its
original units so that we
can understand the
prediction.
 23.5 + 0.83 * 16.4 = 37.11
grams of fat.
 But can this be done more
simply?
The Regression Line in
Real Units

 Let’s re-write the
equation of the line for
protein and fat to be back
in terms of the original
units:
 We find the slope using a
formula developed on
page 175-176 of your book,
𝑟𝑠𝑦
0.83 ∗16.4 𝑔 𝑓𝑎𝑡
14 𝑔 𝑝𝑟𝑜𝑡𝑒𝑖𝑛
 𝑦 = 𝑏0 + 𝑏1 𝑥
 𝑏1 =
 b0 is the y-intercept, the
value of the line where it
crosses the y-axis, and b1
0.93 grams of fat per gram of
protein
is the slope
𝑠𝑥
=
=
The Regression Line in
Real Units

 Let’s re-write the
equation of the line for
protein and fat to be back
in terms of the original
units:
 𝑦 = 𝑏0 + 𝑏1 𝑥
 b0 is the y-intercept, the
value of the line where it
crosses the y-axis, and b1
is the slope
 Next, how do we find the
y-intercept 𝑏0 ? Remember
that the line has to go
through the mean-mean
point, (𝑥, 𝑦)
 That is, the model predicts
𝑦 to be the value that
corresponds to 𝑥
 We can put the means into
the equation and write 𝑦 =
𝑏0 + 𝑏1 𝑥
The Regression Line in
Real Units

 Let’s re-write the
equation of the line for
protein and fat to be back
in terms of the original
units:
 𝑦 = 𝑏0 + 𝑏1 𝑥
 b0 is the y-intercept, the
value of the line where it
crosses the y-axis, and b1
is the slope
 𝑦 = 𝑏0 + 𝑏1 𝑥
 Rewrite to solve for 𝑏0
gives us:
𝑏0 = 𝑦 − 𝑏1 𝑥
The Regression Line in
Real Units

 𝑏0 = 𝑦 − 𝑏1 𝑥
 For our Burger King example this comes out to be
𝑔 𝑓𝑎𝑡
𝑏0 = 23.5𝑔 𝑓𝑎𝑡 − 0.97
∗ 17.2 𝑔 𝑝𝑟𝑜𝑡𝑒𝑖𝑛 = 6.8 𝑔 𝑓𝑎𝑡
𝑔 𝑝𝑟𝑜𝑡𝑒𝑖𝑛
The Regression Line in
Real Units

𝑔 𝑓𝑎𝑡
𝑏0 = 23.5𝑔 𝑓𝑎𝑡 − 0.97
∗ 17.2 𝑔 𝑝𝑟𝑜𝑡𝑒𝑖𝑛 = 6.8 𝑔 𝑓𝑎𝑡
𝑔 𝑝𝑟𝑜𝑡𝑒𝑖𝑛
Putting this back into the regression equation gives:
𝑓𝑎𝑡 = 6.8 + 0.97 ∗ 𝑃𝑟𝑜𝑡𝑒𝑖𝑛
The Regression Line in
Real Units

𝑓𝑎𝑡 = 6.8 + 0.97 ∗ 𝑃𝑟𝑜𝑡𝑒𝑖𝑛
The slope of 0.97 means that an additional gram of protein is
associated with an additional 0.97grams of fat, on average.
Less formally, we might say that BK sandwiches pack about
0.97 grams of fat per gram of protein.
Keep in mind that for slope, units matter!
Slope and Units

 The units of slope are
always the units of y per
units of x
 Changing units doesn’t
change the correlation but
does change the standard
deviations. The slope
introduces the units into
the equation by
multiplying the
correlation by the ratio of
𝑠𝑦 𝑡𝑜 𝑠𝑥
 If children grow an
average of 3 inches per
year that is the same as
growing 0.21
millimeters per day.
The Intercept

 What is the significance
of the intercept of the
BK regression line, 6.8?
 This is the value of y
when we are at an x of
zero
 So, for BK items, this
means that we have 6.8
grams of fat even when
an item contains
noprotein.
Note!

 When using a
regression model it is
vital that we check the
same conditions for
regressions as we did
for correlation:
 Quantitative Variable
Condition
 Straight Enough
Condition
 OutlierCondition