Download R 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Linear Regression
Models for Data
Correlation says “There seems to be a linear
association between these two variables,” but it
doesn’t tell what that association is.
We can say more about the linear relationship
between two quantitative variables with a linear
model. The linear model is just an equation
which best describes the linear relationship of
the two variables involved.
2
What Do Models Do?
A model simplifies reality to help us
understand underlying patterns and
relationships.
Models require numbers, called
parameters, to specify them. Choosing
values for the parameters helps us mold
the model to fit a particular situation.
3
This figure shows one variable plotted against another
and it appears as if there may be a line which we could
use as a predictor for future relationships.
4
The model that we use is of the form
y  mx b
The y ’s are called predicted values and are
predicted from the x ’s or the predictors.
The value m is the slope:
– larger magnitude m’s indicate steeper slopes,
– negative slopes show a negative association,
– and a slope of zero gives a horizontal line.
5
How Big Can Predicted Values Get?
Since r cannot be bigger than 1 (in
absolute value), each predicted y tends to
be closer to its mean (in standard
deviations) than its corresponding x was.
This property of the linear model is called
regression to the mean; the line is called
the regression line.
6
Residuals
No model is perfect, so there will be
differences between observed and
predicted values.
The difference between the observed
value and its associated predicted value is
called the residual.
The residual at each data value tells us
how far off the model’s prediction is at that
point.
7
Residuals (cont.)
Observed y value for
this particular x
Predicted y
value using the
line of
regression
The residual is the difference
between these two y values
for the same x value.
8
Residuals (cont.)
The linear model assumes that the
relationship between the two variables is a
perfect straight line. The residuals are the
part of the data that hasn’t been modeled.
So, Data = Model + Residual or
(equivalently) Residual = Data – Model.
9
Residuals (cont.)
When we ask how well the model fits, we
are really asking how much of the data is
still in the residuals.
10
Residuals (cont.)
Residuals also help us to see whether the
model makes sense.
We determine this by plotting the residuals
in the hope of finding…nothing.
11
2
R —The Variation Accounted For
The variation in the residuals is the key to
assessing how well the model fits.
If the correlation were -1 or +1, the model would
predict perfectly, and the residuals would all be
zero and have no variation. We can’t do any
better than that!
At the other extreme, if the correlation were 0,
the model would simply predict the mean yvalue for each x-value. These residuals would
have the same variation as the original data.
But, neither extreme is likely. So…
12
2
R —The Variation Accounted For
(cont.)
The squared correlation, r2, gives the fraction of
the data’s variance accounted for by the model.
Thus, 1– r2 is the fraction of the original variance
left in the residuals.
All regression analyses include this statistic,
although by tradition, it is written R2 (pronounced
“R-squared”). An R2 of 0 means that none of the
variance in the data is in the model; all of it is still
in the residuals.
13
2
How Big Should R Be?
R2 is always between 0% and 100%. What
makes a “good” R2 value depends on the
kind of data you are analyzing and on
what you want to do with it.
The standard deviation of the residuals
can give us more information about the
usefulness of the regression by telling us
how much scatter there is around the line.
14
2
Reporting R
Along with the slope and intercept for a
regression, you should always report R2 so that
readers can judge for themselves how
successful the regression is at fitting the data.
Statistics is about variation, and R2 measures the
success of the regression model in terms of the
fraction of the variation of y accounted for by the
regression.
15
Least Squares
Our regression line has the special
property that the variation of its residuals
is the smallest it can be for any straight
line model for these data. No other line
has this property.
Speaking mathematically, we say that this
line minimizes the sum of the squared
residuals. This is why it is often called the
least squares regression line.
16
Assumptions and Conditions
Linearity Assumption:
– The linear model assumes that the
relationship between the variables is linear.
– A scatterplot will let you check that the
assumption is reasonable.
– The straight enough condition is satisfied if
the scatterplot looks reasonably straight.
17
Assumptions and Conditions (cont.)
It’s a good idea to check linearity again
after computing the regression when we
can examine the residuals.
You should also check for outliers, which
could change the regression.
If the data seem to clump or cluster in the
scatterplot, that could be a sign of trouble
worth looking into further.
18
Assumptions and Conditions (cont.)
If the scatterplot is not straight enough,
stop here.
– You can’t use a linear model for any two
variables, even if they are related.
– They must have a linear association or the
model won’t mean a thing.
Some nonlinear relationships can be
saved by re-expressing the data to make
the scatterplot more linear.
19
Reality Check:
Is the Regression Reasonable?
Statistics don’t come out of nowhere. They are
based on data.
– The results of a statistical analysis should reinforce
your common sense, not fly in its face.
– If the results are surprising, then either you’ve learned
something new about the world or your analysis is
wrong.
When you perform a regression, think about the
coefficients and ask yourself whether they make
sense.
20
What Can Go Wrong?
Don’t fit a straight line to a nonlinear
relationship.
Beware extraordinary points (y-values that stand
off from the linear pattern or extreme x-values).
Don’t extrapolate beyond the data—the linear
model may no longer hold outside of the range
of the data.
Don’t infer that x causes y just because there is
a good linear model for their relationship—
association is not causation.
21
Key Concepts
If two quantitative variables appear to be linearly
related, we can find the regression line that best
fits the data.
Once we find the regression equation, we can
make predictions for y based on x-values in the
range of the data.
R2 gives the fraction of the variability of y
accounted for by the least squares linear
regression on x.
Examine residuals to check whether a linear
model is appropriate.
22