Download Lecture 9 - Department of Mathematics and Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Forecasting wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Lasso (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
MATH 2560 C F03
Elementary Statistics I
LECTURE 9: Least-Squares Regression Line
and Equation
1
Outline
⇒ least-squares regresion line (LSRL);
⇒ equation of the LSRL;
⇒ interpreting the LSRL;
⇒ correlation and regression;
2
Least-Squares Regression Line
=⇒ Our first aim is:
we need a way to draw a regression line that doesn’t depend on our guess
as to where the line should go. We want one line that is as close as possible.
=⇒ Our second aim is:
we want a regression line that makes the prediction errors as small as
possible:
errors=observed variables-predicted variables → minimize!
Figure 2.13 illustrate the idea.
=⇒ The most common idea how to make these errors ”as small as possible” precisely is the LEAST-SQUARES idea.
Leat-Squares Regression Line
A least-squares regression line of y on x is the line that makes the sum
of the squares of the vertical distances of the data points from the line as small as
possible.
Below we have the least-squares idea expressed as a mathematical problem.
Least-Squares Idea as a Mathematical Problem
1. There are n observations on two variables x and y :
(x1 , y1 ), (x2 , y2 ), ..., (xn , yn );
2. The line y = a + bx through scatterplot of these observations
predicts the value of y corresponding to xi as
ŷi = a + bxi ;
3. The predicted response ŷi will not be exactly the same as the actually
observed response yi ;
4.The prediction error for the point xi is:
error=observed yi -predicted ŷi ;
5. The method of Least-Squares chooses the line that
makes the sum of the squares of these errors as small as possible;
6. Mathematical problem : find the values of the intercept a and the slope b that
minimize
thePfollowing expression;
P
P
(error)2 = (yi − ŷi )2 = (yi − a − bxi )2 → minimize.
Equation for the LSRL
Equation of the Leat-Squares Regression Line
1. Let we have data on explanatory variable x and a response variable y for n
individuals;
2. The mean and standard deviations of the sample data are x̄ and sx for x
and ȳ and sy for y, and the correlation between x and y is r;
3. The equation of the least-squares regression line of y on x is:
ŷ = a + bx
with slope
b=r
sy
sx
and intercept
a = ȳ − bx̄.
Example 2.13. Mean height of Kalama children (Table 2.7).
We calculate means, standard deviations for x and y, correlation r, slope
b, intercept a and give the equation of the least-squares line in this case:
1. Mean and Standard Deviation for x:
x̄ = 23.5m,
sx = 3.606m;
2. Mean and Standard Deviation for y:
ȳ = 79.85,
sy = 2.302;
3. Correlation:
r = 0.9944;
4. Slope:
b=r
sy
2.302
= 0.9944
= 0.6348cm/m;
sx
3.606
5. Intercept:
a = ȳ − bx̄ = 79.85 − (0.6348)(23.5) = 64.932cm;
6. The equation of the least-squares line is:
ŷ = 64.932 + 0.6348x.
3
Interpreting the regression line
Interpreting the Leat-Squares Regression Line
1. Slope b = r ssxy : says that along the regression line,
a change of one standard deviation in x corresponds to a change of r
standard deviations in y;
(The change in the predicted response ŷ is the same as the change in x when r = 1
or r = −1.
Otherwise, −1 < r < 1, the change in ŷ is less than the change in x.)
2. The least-squares regression line always passes through the point (x̄, ȳ);
Figure 2.14 displays the basic regression output for the Kalama data from
a graphing calculator and two statistical software packages.
4
Correlation and Regression
⇒ Least-squares regression looks at the distances of the data points from the
line only in the y direction.
Example 2.14. Expanding the Universe (Figure 2.15).
Figure 2.15 is a scatterplot of data that played a central role in the discovery that the universe is expanding. Here r = 0.7842, hence, relationship
between the distances from Earth of 24 spiral galaxies and the speeds at
which they are moving away from us is a positive and linear.
Important Remark: Although there is only one correlation between
velocity and distance, regression of velocity on distance and regression of
distance on velocity give different lines.
=⇒ There is a close connection between correlation and regression:
Connection between Correlation and Regression:
⇒ the slope of the least-squares line involves r;
⇒ the square of the correlation, r2 , is the fraction
of the variation in the values of y that is explained by the least-squares
regression of y on x.
Relationship between r and r2
⇒ When you report a regression, give r2 as a measure of how successfully
the regression explains the response.
All the software outputs in Figure 2.14 include r2 .
⇒ The use of r2 to describe the success of regression in explaining the
response y is very common: it rests on the fact that there are two
sources of variation in the responses y in a regression setting.
Example: Kalama children.
One reason the Kalama heights vary is that height changes with age;
Second reason is that heights do not lie exactly on the line, but are scattered above and below it.
⇒ We use r2 to measure variation along the line as a fraction of
the total variation in the response variables.
For a pictorial grasp of what r2 tells us, look at Figure 2.16.
Both scatterplots resemble the Kalama data, but with many more observations. The least-squares regression line is the same as we computed from
the Kalama data.
In Figure 2.16(a), r = 0.994 and r2 = 0.989.
In Figure 2.16(b), r = 0.921 and r2 = 0.849. There is more scatter about
the fitted line and here r2 is less than in Figure 2.16(a).
5
More Specific Interpretation of r2
⇒ The squared correlation gives us the variance of the predicted
responses as a fraction of the variance of the actual responses:
r2 =
varianceof predictedvalues ŷ
.
varianceof observedvalues y
This fact is always true.
Final Important Remark: The connections with correlation are special
properties of least-squares regression. They are not true for other methods of
fitting a line to data.
6
Summary
1. A regression line is stright line that describes how a response variable
y changes as an explanatory variable x changes.
2. The most common method of fitting a line to a scatterplot is least
squares. The least-squares regression line is the stright line ŷ = a + bx
that minimizes the sum of the squares of the vertical distances of the observed
y-values from the line.
3. A regression line is used to predict the value of y for any value of x by
substituting this x into the eqution of the line.
Exptrapolation beyond the range of x values spanned by the data is
risky.
4. The slope b of a regression line ŷ = a + bx is the rate at which the
predicted response ŷ changes along the line as the explanatory variable x
changes.
Specifically, b is the change in ŷ when x increases by 1.
5. The intercept a of a regression line ŷ = a + bx is the predicted
response ŷ when the explanatory variable x = 0.
This prediction is of no statistical use unless x can actually take values
near 0.
The least-squares regression line of y on x is the line with slope r ssxy and
intercept a = ȳ − bx̄. This line always passes through the point (x̄, ȳ).
6. Remarks. Correlation and regression are closely connected. The
correlation r is the slope of the least-squares regression line when we measure
both x and y in standardized units.
The square of the correaltion r2 is the fraction of the variance of one
variable that is explaned by least-squares regression on the other variable.