Download File

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia, lookup

Linear regression wikipedia, lookup

Data assimilation wikipedia, lookup

Regression toward the mean wikipedia, lookup

Choice modelling wikipedia, lookup

Coefficient of determination wikipedia, lookup

Forecasting wikipedia, lookup

Chapter 8
Linear Regression
Objectives & Learning Goals
Understand Linear Regression (linear modeling):
 Create and interpret a linear regression model
using technology, equations (when statistics are
provided) and when given computer output:
Discuss slope and y-intercept in context;
Check conditions by graphing and interpreting residuals;
Make predictions within the range of data using the created
Fast Food - Fat Versus Protein
r = 0.83; describe the association…
 “There is a strong, positive linear association
between fat and protein. In general, as protein
increases, so does fat. We may be concerned
about these points as possible outliers…”
We can say even more with a model…
The Linear Model
The linear model is the equation of the “line of best fit.”.
Models aren’t perfect. No matter how “best” our line
of best fit is; points will fall above and below the line.
Residuals are the distances from the best fit line to each
data point – they’re the parts of the relationship between
the variables that our model can’t explain – or model
More on Residuals
To find the residual of a data point, subtract the predicted
value from the observed value:
residual  y  yˆ  actual  predicted
Residuals (cont.)
A negative residual means
the model’s prediction is too
large (the model
A positive residual means
the predicted value is too
small (it underestimates).
In this example, the linear
model overestimates the fat
content of a sandwich (33g);
the actual value is 25g, so
the residual is negative 8g.
Residuals are key because our
technology uses them to
determine which line is “best.”
“Best Fit” Means Least Squares
Some residuals are positive, others are negative, on
average these errors cancel each other out.
We can’t assess how well the line fits by adding up all
the residuals (we would just get zero!).
Solution?? Anyone?? Anyone?? Bueller??
Similar to what we did with standard deviation, we square the
residuals, then add up the squares. The smaller the sum, the
better the line.
The line of best fit is the line that minimizes the sum
of the squared residuals. We call this the least
squares regression line (LSRL).
Conditions for Regression
We check the same conditions for regression
as we did for correlations. These are???
Quantitative Variables Condition
Straight Enough Condition
Outlier Condition
The Regression Line in Real Units
In Algebra we learned that the equation of a
y  mx  b
line can be written as? ___________
In Statistics we use slightly different notation:
ŷ  b0  b1 x
b0 is the y-intercept
b1 is the slope
We use to emphasize that the points that
satisfy the equation are predictions made
using a model, not data.
Back to Burgers! An Example
The LSRL is shown for the
for the Burger King sandwich
data. The regression equation
To predict the fat content (!!“fat hat”!!) for a BK sandwich
with 30g of protein; PLUG AND CHUG the formula (do it
now) !!
35.9 g
Conclude with a statement like: “According to our model,
we expect a sandwich with 30 grams of protein to have
about 36g of fat.”
Class Example…
Czech Rep
Marijuana (%)
Other Drugs (%)
The table above shows the % of city teens who use
Marijuana recreationally and the % of teens who use
other, more dangerous drugs. Enter the data in your
calculator lists and let’s build and analyze a model!!
Correlation and the Line of Best Fit
This figure shows the
scatterplot of the z-scores for
fat and protein.
If a burger has an average
protein content, it should also
have an average fat content. So
in z-score world, the LSRL
passes through the origin: (0, 0).
This means that in the “real”
world the LSRL passes through:
In z-score world, r is the slope
of the regression line!
In z-score
world, r is
the slope
of the
Moving one standard deviation away
from the mean of x moves us r
standard deviations away from the
mean in y.
How Big Can Predicted Values Get?
Because we move r units in the y
direction for every unit in the x direction
and r = -1 ≤ r ≤ +1, each y-hat must be
closer to its mean than its corresponding
x value.
This property of the linear model is called
regression to the mean (predictions using
the LSRL will tend toward the mean of y).
R — Accounting for the Model’s Variation
The variation in the
residuals is the key to
assessing how well the
model fits.
In our BK sandwich
example, the standard
deviation of the
variable fat is 16.4
The standard deviation
of the residuals (the
errors after we’ve
applied the model) is
only 9.2 grams.
R —The Variation Accounted For (cont.)
If the correlation were 1.0 and the
model predicted fat values
perfectly, all the residuals would
be zero (no variation).
As it is, the correlation is 0.83 not perfect; but pretty good!
The model’s residuals have less
variation than total fat alone. So
the model “explains” some of the
variability in fat by accounting for
protein content.
To build a model that accounts
for the remaining errors we
would use multiple regression
– beyond the scope of this
R2—The Variation Accounted For (cont.)
The correlation coefficient squared (r2), gives
the percentage of the data’s variance
accounted for by the model.
For the BK model, r2 = 0.832 = 0.69, this tells us
that 69% of the variability fat changes can be
explained by changes in protein levels.
The remaining 31% of the variability in fat is left in
the residuals (“other” reasons).
How Big Should R Be?
Well……(you guessed it!)……it depends!
R2 is always between 0% and 100% (and it is
always less than r (unless r = 1 or -1). What
makes a “good” R2 value depends on the kind of
data you’re analyzing and what you want to do
with it.
Along with the slope and intercept for a
regression, you should always report R2 so
that readers can judge for themselves how
successful the regression is at fitting the data.
AP Test Requirements
For the AP test, you are expected to
be able to find / analyze regression
equations THREE ways:
1) Using summary statistics & formulas;
2) Using technology (calculators) when
given data;
3) From computer-generated regression
The Regression Line in Real Units
Since the linear model integrates “z-score
world” (r), the line of best fit we use is a little
more complicated than the ones you’re used
to from your algebra courses…
We want our model to be useful in real units so
we have to back out of z-score world...
To find a linear model’s slope, we use:
ŷ  b0  b1 x
b1  r
To find the model’s y-intercept:
Income vs. Housing Cost
b0  y  b1 x
Income v. Housing Costs Example
Some governmental organization is interested in
building a model to predict a person’s housing costs
based on the person’s income (using tax return
data). They capture a sample of data and find that
the mean income is $46,209 with a standard
deviation of $7,004. The mean housing cost for this
same sample is $324 with a standard deviation of
$119; r=0.62.
 Is a linear model appropriate?
Find the regression equation.
 Explain what the slope and the y-intercept mean in context.
 Compute and interpret r-squared.
 The organization then decides to use this same data to predict a
person’s income based on their housing costs. Find the new regression
Reading Computer Output – HP vs. MPG
Write the regression equation now…
mpˆ g  38.4542  0.0918175(hp )
More on Residuals
The linear model assumes the relationship
between the two variables is a straight line. The
residuals are the part of the data that hasn’t
been accounted for by the model.
Actual Data = Model + Residual
Residual = Actual – Prediction (AP)
In symbols:
e  y  ŷ
Residuals (cont.)
Residuals help us to see whether the
model is appropriate. When it is, there
should be nothing interesting left behind in
the residuals (just random error).
After we fit a regression model, we
ALWAYS make a scatterplot of the
residuals vs. X or y-hat (both will look
identical) hoping to find……….. well,
nothing interesting.
Residuals (cont.)
The residual plot for the BK sandwich
regression looks random (no pattern) – and
that’s good!
Residuals (cont.)
This residual plot shows a pattern, indicating
that our assumption of linearity may be
Residuals (cont.)
Another “bad” residual plot…
Residual Standard Deviation
The standard deviation of the residuals, se,
measures how spread out the points are around
the regression line. The equation is:
se 
Once again, we don’t actually calculate
this manually, our technology does it
for us!!!
The Residual Standard Deviation
Examine a Normal Probability Plot or a Histogram
of the residuals to make sure the residuals have
about the same amount of scatter throughout (this
is called the Equal Variance Assumption).
Reality Check: Does the Regression Make Sense?
When statistics are based on real data, the
results of a statistical analysis should reinforce
your common sense.
If the results are surprising, then you’ve either
learned something new about the world or your
analysis is incorrect.
Which do you think is more likely?
When you perform a regression, think about what
the coefficients mean and ask yourself whether
they make sense.
What Can Go Wrong?
 Don’t fit a straight line to a nonlinear
 Beware extraordinary points (y-values
that stand off from the linear pattern or
extreme x-values).
 Don’t infer that x causes y just
because there is a good linear model
for their relationship - association is
not causation.
 Don’t choose a model based on R2
Get it??
What Have We Learned?
When the relationship between two quantitative
variables is “straight enough”, a linear model can
help summarize that relationship.
The regression line doesn’t pass through all points, but
it is the “best” line because it minimizes the sum of
squared residuals.
What Have We Learned?
Correlation tells us lots of things:
The slope of the line is based on the
correlation, adjusted for the units of x and y.
For each SD in x that we are away from x-bar,
we expect to be r SDs in y away from y-bar.
Since r is always between –1 and +1, each yhat is fewer SDs away from its mean than the
corresponding x was (regression to the mean).
R2 tells us the percent of the response
accounted for by the regression model; the rest
is error.