Download 2.3 Least-Squares Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Forecasting wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Looking at Data–Relationships
2.3 Least-Squares Regression
© 2012 W. H. Freeman and Company
Explanatory and response variables
A response variable measures or records an outcome of a study. An
explanatory variable explains changes in the response variable.
Typically, the explanatory or independent variable is plotted on the x
axis, and the response or dependent variable is plotted on the y axis.
Blood Alcohol as a function of Number of Beers
Blood Alcohol Level (mg/ml)
0.20
Response
(dependent)
variable:
blood alcohol
content
y
0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
x
0
1
2
3
4
5
6
7
8
9
10
Number of Beers
Explanatory (independent) variable:
number of beers
Correlation tells us about
strength (scatter) and direction
of the linear relationship
between two quantitative
variables.
In addition, we would like to have a numerical description of how both
variables vary together. For instance, is one variable increasing faster
than the other one? And we would like to make predictions based on that
numerical description.
But which line best
describes our data?
The regression line
A regression line is a straight line that describes how a response
variable y changes as an explanatory variable x changes.
We often use a regression line to predict the value of y for a given
value of x.
In regression, the distinction between explanatory and response
variables is important.
The regression line
The least-squares regression line is the unique line such that the sum
of the squared vertical (y) distances between the data points and the
line is as small as possible.
Distances between the points and
line are squared so all are positive
values. This is done so that
distances can be properly added
(Pythagoras).
Properties
The least-squares regression line can be shown to have this equation:
yˆ = b 0 + b1 x
ŷ
is the predicted y value (y hat)
b1 is the slope
b0 is the y-intercept
How to:
First we calculate the slope of the line, b1;
from statistics we already know:
b1 = r
r is the correlation.
sy is the standard deviation of the response variable y.
sx is the the standard deviation of the explanatory variable x.
sy
sx
Once we know b1, the slope, we can calculate b0, the y-intercept:
b 0 = y − b1 x
Where x and y are the sample
means of the x and y variables
The equation completely describes the regression line.
To plot the regression line you only need to plug two x values into the
equation, get y, and draw the line that goes through those points.
Hint: The regression line always passes through the mean of x and y.
The points you use for
drawing the regression
line are derived from the
equation.
They are NOT points from
your sample data (except
by pure coincidence).
The distinction between explanatory and response variables is crucial in
regression. If you exchange y for x in calculating the regression line, you
will get the wrong line.
Regression examines the distance of all points from the line in the y
direction only.
Hubble telescope data about
galaxies moving away from earth:
These two lines are the two
regression lines calculated either
correctly (x = distance, y = velocity,
solid line) or incorrectly (x = velocity,
y = distance, dotted line).
Making predictions
The equation of the least-squares regression allows you to predict y
for any x within the range studied.
yˆ = 0 . 0144 x + 0 . 0008
Nobody in the study drank 6.5
beers, but by finding the value
of ŷ from the regression line for
x = 6.5 we would expect a
blood alcohol content of 0.094
mg/ml.
yˆ = 0.0144* 6.5 + 0.0008
yˆ = 0.936+ 0.0008= 0.0944mg/ml
Inference for Regression
10.1 Simple Linear
Regression
© 2012 W.H. Freeman and Company
y? = 0.125 x − 41 .4
The data in a scatterplot are a random
sample from a population that may
exhibit a linear relationship between x
and y. Different sample different plot.
Now we want to describe the populationmean
response µy as a function of the explanatory
variable x:µy = β0 + β1x.
And to assess whether the observed relationship
is statistically significant (not entirely explained
by chance events due to random sampling).
Statistical model for linear regression
In the population, the linear regression equation is µy = β0 + β1x.
Sample data then fits the model:
Data =
fit
+ residual
y i = ( β0 + β1x i) +
(εi)
where the εiare
independent and
Normally distributed N(0,σ).
Linear regression assumes equal variance of y
(σ is the same for all values of x).
Estimating the parameters
µy = β0 + β1x
The intercept β0, the slope β1, and the standard deviation σ of y are the
unknown parameters of the regression model. We rely on the random
sample data to provide unbiased estimates of these parameters.
The value of ŷ from the least-squares regression line is really a prediction
of the mean value of y (µy) for a given value of x.
The least-squares regression line (ŷ = b0 + b1x) obtained from sample data
is the best estimate of the true population regression line (µy = β0 + β1x).
ŷ unbiased estimate for mean response µy
b0 unbiased estimate for intercept β0
b1 unbiased estimate for slope β1
The population standard deviation
σfor y at any given value of x
represents the spread of the normal
distribution of the εi around the mean
µy .
The regression standard error, s, for n sample data points is
calculated from the residuals (yi – ŷi):
s=
2
residual
∑
n−2
=
2
ˆ
(
y
−
y
)
∑ i i
n−2
s is an unbiased estimate of the regression standard deviation σ.
Conditions for inference
The observations are independent.
The relationship is indeed linear.
The standard deviation of y,σ, is the same for all values of x.
The response y varies normally
around its mean.
Using residual plots to check for regression validity
The residuals (y−ŷ) give useful information about the contribution of
individual data points to the overall pattern of scatter.
We view the residuals in
a residual plot:
If residuals are scattered randomly around 0 with uniform variation, it
indicates that the data fit a linear model, have normally distributed
residuals for each value of x, and constant standard deviation σ.
Residuals are randomly scattered
good!
Curved pattern
the relationship is not linear.
Change in variability across plot
σ not equal for all values of x.
What is the relationship between
the average speed a car is
driven and its fuel efficiency?
We plot fuel efficiency (in miles
per gallon, MPG) against average
speed (in miles per hour, MPH)
for a random sample of 60 cars.
The relationship is curved.
When speed is log transformed
(log of miles per hour, LOGMPH)
the new scatterplot shows a
positive, linear relationship.
Residual plot:
The spread of the residuals is
reasonably random—no clear pattern.
The relationship is indeed linear.
Normal quantile plot for residuals:
The plot is fairly straight, supporting
the assumption of normally distributed
residuals.
Data okay for inference.
Confidence interval for regression parameters
Estimating the regression parameters β0, β1 is a case of one-sample
inference with unknown population variance.
We rely on the t distribution, with n – 2 degrees of freedom.
A level C confidence interval for the slope, β1, is proportional to the
standarderror of the least-squares slope:
b1 ± t* SEb1
A level C confidence interval for the intercept, β0 , is proportional to the
standard error of the least-squares intercept:
b0 ± t* SEb0
t* is the t critical value for the t (n – 2) distribution with area C between –t* and +t*.
Significance test for the slope
We can test the hypothesis H0: β1 = 0 versus a 1 or 2 sided alternative.
We calculate
t = b1 / SEb1
which has the t (n – 2)
distribution to find the
p-value of the test.
Testing the hypothesis of no relationship
We may look for evidence of a significant relationship between
variables x and y in the population from which our data were drawn.
For that, we can test the hypothesis that the regression slope
parameter β is equal to zero.
H0: β1 = 0 vs. H0: β1 ≠ 0
s y Testing H0: β1 = 0 also allows to test the hypothesis of no
slope b1 = r
sx correlation between x and y in the population.