Download File

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Chapter 8
Linear Regression
Objectives & Learning Goals

Understand Linear Regression (linear modeling):
 Create and interpret a linear regression model
using technology, equations (when statistics are
provided) and when given computer output:



Discuss slope and y-intercept in context;
Check conditions by graphing and interpreting residuals;
Make predictions within the range of data using the created
model.
Fast Food - Fat Versus Protein


r = 0.83; describe the association…
 “There is a strong, positive linear association
between fat and protein. In general, as protein
increases, so does fat. We may be concerned
about these points as possible outliers…”
We can say even more with a model…
The Linear Model

The linear model is the equation of the “line of best fit.”.
Residual


Models aren’t perfect. No matter how “best” our line
of best fit is; points will fall above and below the line.
Residuals are the distances from the best fit line to each
data point – they’re the parts of the relationship between
the variables that our model can’t explain – or model
errors.
More on Residuals
Residual

To find the residual of a data point, subtract the predicted
value from the observed value:
residual  y  yˆ  actual  predicted
Residuals (cont.)

A negative residual means
the model’s prediction is too
large (the model
overestimates).

A positive residual means
the predicted value is too
small (it underestimates).

In this example, the linear
model overestimates the fat
content of a sandwich (33g);
the actual value is 25g, so
the residual is negative 8g.
Residuals are key because our
technology uses them to
determine which line is “best.”
“Best Fit” Means Least Squares



Some residuals are positive, others are negative, on
average these errors cancel each other out.
We can’t assess how well the line fits by adding up all
the residuals (we would just get zero!).
Solution?? Anyone?? Anyone?? Bueller??


Similar to what we did with standard deviation, we square the
residuals, then add up the squares. The smaller the sum, the
better the line.
The line of best fit is the line that minimizes the sum
of the squared residuals. We call this the least
squares regression line (LSRL).
Conditions for Regression

We check the same conditions for regression
as we did for correlations. These are???

Quantitative Variables Condition

Straight Enough Condition

Outlier Condition
The Regression Line in Real Units

In Algebra we learned that the equation of a
y  mx  b
line can be written as? ___________

In Statistics we use slightly different notation:
ŷ  b0  b1 x

ŷ
b0 is the y-intercept
b1 is the slope
We use to emphasize that the points that
satisfy the equation are predictions made
using a model, not data.
Back to Burgers! An Example

The LSRL is shown for the
for the Burger King sandwich
data. The regression equation
is:
To predict the fat content (!!“fat hat”!!) for a BK sandwich
with 30g of protein; PLUG AND CHUG the formula (do it
now) !!
35.9 g
Conclude with a statement like: “According to our model,
we expect a sandwich with 30 grams of protein to have
about 36g of fat.”
Class Example…
Country
Czech Rep
Denmark
England
Finland
Ireland
Italy
Ireland
Norway
Portugal
Scotland
USA
Marijuana (%)
22
17
40
5
37
19
23
6
7
53
34
Other Drugs (%)
4
3
21
1
16
8
14
3
3
31
24
The table above shows the % of city teens who use
Marijuana recreationally and the % of teens who use
other, more dangerous drugs. Enter the data in your
calculator lists and let’s build and analyze a model!!
Correlation and the Line of Best Fit

This figure shows the
scatterplot of the z-scores for
fat and protein.

If a burger has an average
protein content, it should also
have an average fat content. So
in z-score world, the LSRL
passes through the origin: (0, 0).
This means that in the “real”
world the LSRL passes through:
(
x
,
y
)
_________

In z-score world, r is the slope
of the regression line!
In z-score
world, r is
the slope
of the
LSRL!
Moving one standard deviation away
from the mean of x moves us r
standard deviations away from the
mean in y.
How Big Can Predicted Values Get?

Because we move r units in the y
direction for every unit in the x direction
and r = -1 ≤ r ≤ +1, each y-hat must be
closer to its mean than its corresponding
x value.

This property of the linear model is called
regression to the mean (predictions using
the LSRL will tend toward the mean of y).
2
R — Accounting for the Model’s Variation

The variation in the
residuals is the key to
assessing how well the
model fits.

In our BK sandwich
example, the standard
deviation of the
variable fat is 16.4
grams.

The standard deviation
of the residuals (the
errors after we’ve
applied the model) is
only 9.2 grams.
*Original
Data*
After
Model
2
R —The Variation Accounted For (cont.)

If the correlation were 1.0 and the
model predicted fat values
perfectly, all the residuals would
be zero (no variation).

As it is, the correlation is 0.83 not perfect; but pretty good!


The model’s residuals have less
variation than total fat alone. So
the model “explains” some of the
variability in fat by accounting for
protein content.
To build a model that accounts
for the remaining errors we
would use multiple regression
– beyond the scope of this
course).
R2—The Variation Accounted For (cont.)

The correlation coefficient squared (r2), gives
the percentage of the data’s variance
accounted for by the model.

For the BK model, r2 = 0.832 = 0.69, this tells us
that 69% of the variability fat changes can be
explained by changes in protein levels.

The remaining 31% of the variability in fat is left in
the residuals (“other” reasons).
2
How Big Should R Be?


Well……(you guessed it!)……it depends!
R2 is always between 0% and 100% (and it is
always less than r (unless r = 1 or -1). What
makes a “good” R2 value depends on the kind of
data you’re analyzing and what you want to do
with it.


Examples?
Along with the slope and intercept for a
regression, you should always report R2 so
that readers can judge for themselves how
successful the regression is at fitting the data.
AP Test Requirements

For the AP test, you are expected to
be able to find / analyze regression
equations THREE ways:
1) Using summary statistics & formulas;
2) Using technology (calculators) when
given data;
3) From computer-generated regression
output.
Worksheet!!!
The Regression Line in Real Units


Since the linear model integrates “z-score
world” (r), the line of best fit we use is a little
more complicated than the ones you’re used
to from your algebra courses…
We want our model to be useful in real units so
we have to back out of z-score world...

To find a linear model’s slope, we use:
ŷ  b0  b1 x

b1  r
sy
sx
To find the model’s y-intercept:
Income vs. Housing Cost
Example
b0  y  b1 x
Income v. Housing Costs Example
Some governmental organization is interested in
building a model to predict a person’s housing costs
based on the person’s income (using tax return
data). They capture a sample of data and find that
the mean income is $46,209 with a standard
deviation of $7,004. The mean housing cost for this
same sample is $324 with a standard deviation of
$119; r=0.62.
 Is a linear model appropriate?
Find the regression equation.
 Explain what the slope and the y-intercept mean in context.
 Compute and interpret r-squared.
 The organization then decides to use this same data to predict a
person’s income based on their housing costs. Find the new regression
equation.

Reading Computer Output – HP vs. MPG
Write the regression equation now…
mpˆ g  38.4542  0.0918175(hp )
More on Residuals

The linear model assumes the relationship
between the two variables is a straight line. The
residuals are the part of the data that hasn’t
been accounted for by the model.
Actual Data = Model + Residual
or
Residual = Actual – Prediction (AP)
In symbols:
e  y  ŷ
Residuals (cont.)

Residuals help us to see whether the
model is appropriate. When it is, there
should be nothing interesting left behind in
the residuals (just random error).

After we fit a regression model, we
ALWAYS make a scatterplot of the
residuals vs. X or y-hat (both will look
identical) hoping to find……….. well,
nothing interesting.
Residuals (cont.)

The residual plot for the BK sandwich
regression looks random (no pattern) – and
that’s good!
Residuals (cont.)

This residual plot shows a pattern, indicating
that our assumption of linearity may be
wrong.
Residuals (cont.)

Another “bad” residual plot…
Residual Standard Deviation

The standard deviation of the residuals, se,
measures how spread out the points are around
the regression line. The equation is:
se 
e
2
n2
Once again, we don’t actually calculate
this manually, our technology does it
for us!!!
The Residual Standard Deviation

Examine a Normal Probability Plot or a Histogram
of the residuals to make sure the residuals have
about the same amount of scatter throughout (this
is called the Equal Variance Assumption).
Reality Check: Does the Regression Make Sense?

When statistics are based on real data, the
results of a statistical analysis should reinforce
your common sense.

If the results are surprising, then you’ve either
learned something new about the world or your
analysis is incorrect.


Which do you think is more likely?
When you perform a regression, think about what
the coefficients mean and ask yourself whether
they make sense.
What Can Go Wrong?
 Don’t fit a straight line to a nonlinear
relationship.
 Beware extraordinary points (y-values
that stand off from the linear pattern or
extreme x-values).
 Don’t infer that x causes y just
because there is a good linear model
for their relationship - association is
not causation.
 Don’t choose a model based on R2
alone.
Get it??

http://xkcd.com/552/
What Have We Learned?

When the relationship between two quantitative
variables is “straight enough”, a linear model can
help summarize that relationship.

The regression line doesn’t pass through all points, but
it is the “best” line because it minimizes the sum of
squared residuals.
What Have We Learned?

Correlation tells us lots of things:




The slope of the line is based on the
correlation, adjusted for the units of x and y.
For each SD in x that we are away from x-bar,
we expect to be r SDs in y away from y-bar.
Since r is always between –1 and +1, each yhat is fewer SDs away from its mean than the
corresponding x was (regression to the mean).
R2 tells us the percent of the response
accounted for by the regression model; the rest
is error.