Download or least-squares line

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Data assimilation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Forecasting wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Chapters 8-9
Summarizing Data: Paired Quantitative Data
• regression line (or least-squares line)
a straight line model for the relationship between explanatory (x) and response (y) variables, often used to
produce a prediction ŷ of the variable y for a given
value of x (the small “hat” over the variable indicates
that the quantity is not a measured but is rather a predicted value of the response variable); also, the line that
minimizes the sum of the squared deviations between
the data points and the model line, with equation
ŷ = b0 + b1x,
s
having slope b1 = r · sxy and y-intercept b0 = ȳ − b1x̄.
[TI83: STAT Calc LinReg(a+bx)]
Assumptions for using the linear regression model
• Quantitative Variables Condition
• Straight Enough Condition
• Outlier Condition
1
Chapters 8-9
Analyzing Paired Quantitative Data: Using the
least-squares line
• The least-squares regression line is determined by minimizing y-deviations between the observed data values
and the corresponding predicted values, so switching
explanatory and response variables will generate a different least-squares line.
• The least-squares line always passes through the point
of means (x̄, ȳ). That is, the predicted response for
the average value of the explanatory variable x̄ will
equal the average value of the response variable.
• An increase in the value of x by one standard deviation
sx corresponds to a change in ŷ of r times a standard
deviation sy . Thus, since r lies between −1 and +1,
predicted values of ŷ will lie closer to their mean value
ȳ than the corresponding x values are from their mean
value x̄. (We say that the predicted ŷ values regress
towards their mean. This is why the least-squares line
is also called the regression line.)
2
Chapters 8-9
• coefficient of determination (r2 or R2)
measures the percentage of total variation in y values
that is due to their linear association with their corresponding x values.
• residual (Resid)
the deviation y − ŷ between the measured value of the
response variable and its corresponding predicted value
on the regression line; the mean of the residuals always
equals 0.
• residual plot
a scatterplot of pairs (ŷ, Resid), used to evaluate whether
a linear model is appropriate: if it is, the residual plot
should be absent of any patterns or trends
[TI83: StatPlot, use Ylist:RESID]
• residual standard deviation (se)
a measure of how far a typical point can lie above or
below the regression line, or the size of a typical residual:
rP
(y − ŷ)2
se =
n−2
[TI83: STAT TESTS LinRegTTest, find s]
3
Chapters 8-9
Analyzing Paired Quantitative Data: Linear Regression “wisdom”
• Residual plots are an indispensable tool for analyzing
the suitability of the linear model; the data should be
homogeneous, that is, there should not be subgroups
of the data which differ from each other in some respect
(often recognizable in a residual plot)
• The Straight Enough Condition warns us to check that
the scatterplot be reasonably straight to ensure that the
linear model is appropriate; deviations from straightness
are often more easily noticed in a residual plot.
• Regression formulas are often used to extrapolate,
that is, to make predictions for y corresponding to x
values beyond the range of the measured data but based
on trends within the range of the data; all such predictions are suspect, and the further one extrapolates, the
more suspect the prediction!
• The Outlier Condition warns us to be on guard for outliers in the data, points with large deviations in x or y,
or both; such points can be influential, in the sense
that the size of the correlation (hence also the regression formula) can change dramatically when that outlier
is removed from the data set.
4
Chapters 8-9
• A residual plot can also identify outliers having high
leverage, the tendency to singlehandedly change the
direction of the regression line by a noticeable amount;
treat them in the same way as influential points.
• Outliers in the data need not be “bad”, and should not
be dismissed out of hand or discarded only so as to
strengthen the association between the variables; they
should rather be explained: let the data honestly speak
for itself.
• A high correlation does not necessarily signify a causative
relationship. There may be a strong association between variables without there being a cause/effect relation between them, since both the explanatory and
response variables might be influenced by a third lurking variable that has not been measured.
• Correlations between paired data sets based on averaged data smooth out much of the natural variation in
raw measurements and naturally tend to be very high;
predictions in these cases may be unreliable when applied to individual cases.
5