Download Powerpoint Slides for Least Squares Lines and

Document related concepts

Types of artificial neural networks wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Least Squares Regression
Fitting a Line to Bivariate Data
Correlation tells us about
strength (scatter) and direction
of the linear relationship
between two quantitative
variables.
In addition, we would like to have a numerical description of how both
variables vary together. For instance, is one variable increasing faster
than the other one? And we would like to make predictions based on that
numerical description.
But which line best
describes our data?
The regression line
The least-squares regression line is the unique line such that the sum
of the squared vertical (y) distances between the data points and the
line is the smallest possible.
Distances between the points and
line are squared so all are positive
values. This is done so that
distances can be properly added
(Pythagoras).
Properties
The least-squares regression line can be shown to have this equation:
ŷ  b0  b1 x
where
b1  r
sy
sx
b0  y  b1 x
yˆ
is the predicted y value (y hat)
b1 is the slope
b0 is the y-intercept

“b0" is in units of y
"b1" is in units of y / units of x
How to:
First we calculate the slope of the line, b;
from statistics we already know:
b1  r
r is the correlation.
sy is the standard deviation of the response variable y.
sx is the the standard deviation of the explanatory variable x.
sy
sx
Once we know b1, the slope, we can calculate b0, the y-intercept:
b0  y  b1 x
where x and y are the sample
means of the x and y variables
This means that we don't have to calculate a lot of squared distances to find the leastsquares regression line for a data set. We can instead rely on the equation.
But typically, we use a 2-var stats calculator or stats software.
BEWARE!!!
Not all calculators and software use the same convention:
yˆ  a  bx
Some use instead:
yˆ  ax  b
Make sure you know what YOUR
calculator gives you for a and b before
you answer homework or exam questions.
Software output
intercept
slope
R2
absolute value of r
R2
intercept
slope
The equation completely describes the regression line.
NOTE: The regression line always passes through the point with
coordinates (xbar, ybar).
The distinction between explanatory and response variables is crucial in
regression. If you exchange y for x in calculating the regression line, you
s
will get the wrong line. Recall that b1  r y
sx
Regression examines the distance of all points from the line in the y
direction only.
Hubble telescope data about
galaxies moving away from earth:
These two lines are the two
regression lines calculated either
correctly (x = distance, y = velocity,
solid line) or incorrectly (x =
velocity, y = distance, dotted line).
Correlation versus regression
The correlation is a measure
In regression we examine
of spread (scatter) in both the
the variation in the response
x and y directions in the linear
variable (y) given change in
relationship.
the explanatory variable (x).
Making predictions: interpolation
The equation of the least-squares regression allows to predict y for
any x within the range studied. This is called interpolating.
yˆ  0.0144x  0.0008
Nobody in the study drank 6.5
beers, but by finding the value
of ŷ from the regression line for
x = 6.5 we would expect a
blood alcohol content of 0.094
mg/ml.
yˆ  0.0144 * 6.5  0.0008
yˆ  0.936  0.0008  0.0944 mg/ml
(in 1000’s)
Year
Powerboat s
Dead Manate es
1 977
4 47
13
1 978
4 60
21
1 979
4 81
24
1 980
4 98
16
1 981
5 13
24
1 982
5 12
20
1 983
5 26
15
1 984
5 59
34
1 985
5 85
33
1 986
6 14
33
1 987
6 45
39
1 988
6 75
43
1 989
7 11
50
1 990
7 19
47
yˆ  0.125 x  41 .4
There is a positive linear relationship between the number of powerboats
registered and the number of manatee deaths.
The least squares regression line has the equation:
yˆ  0.125 x  41 .4
Thus if we were to limit the number of powerboat registrations to 500,000, what
could we expect for the number of manatee deaths?
yˆ  0.125(500)  41.4  yˆ  62.5  41.4  21.1
Roughly 21 manatees.
Extrapolation
!!!
!!!
Extrapolation is the use of a
regression line for predictions
outside the range of x values
used to obtain the line.
This can be a very stupid thing
to do, as seen here.
The y intercept
Sometimes the y-intercept is not a realistic possibility. Here we have
negative blood alcohol content, which makes no sense…
But the negative value is
appropriate for the equation
of the regression line.
There is a lot of scatter in the
data, and the line is just an
estimate.
y-intercept shows
negative blood alcohol
R-squared = r2; the proportion of y-variation
explained by changes in x.
r2, the coefficient of determination, is the square of the correlation
coefficient.
r2 represents the proportion of
the variation in y (vertical scatter
from the regression line) that can
be explained by changes in x.
b1  r
sy
sx
r = -1
r2 = 1
Changes in x
explain 100% of
the variations in y.
r = 0.87
r2 = 0.76
Y can be entirely
predicted for any
given value of x.
r=0
r2 = 0
Changes in x
explain 0% of the
variations in y.
The value(s) y
takes is (are)
entirely
independent of
what value x
takes.
Here the change in x only
explains 76% of the change in
y. The rest of the change in y
(the vertical scatter, shown as
red arrows) must be explained
by something other than x.
Example: SAT scores
SAT Mean per State vs % Seniors Taking Test
Mean SAT Score
1120
1070
y = -2.2375x + 1023.4
R2 = 0.7542
1020
970
920
870
820
0
10
20
30
40
50
% of Seniors Taking Test
60
70
80
SAT scores: calculations
x  33.882 sx  24.103 y  947.549 s y  62.1 r  .868
br
sy
sx
, a  y  bx
62.1
slope b  .868
 2.23635
24.103
intercept a  947.549  (2.236)33.882  1023.309
least squares prediction line yˆ  1023.309  2.236 x
SAT scores: result
r2 = (-.868)2 =
.7534
SAT Mean per State vs % Seniors Taking Test
Mean SAT Score
1120
1070
y = -2.2375x + 1023.4
R2 = 0.7542
1020
970
920
870
820
0
10
20
30
40
50
60
70
About 75% of
the variation in
state mean SAT
scores is
explained by
differences in
the % of seniors
that take the
80 test.
% of Seniors Taking Test
If 57% of NC seniors take the SAT, the predicted mean score is
yˆ  1023.309  2.23635(57)  895.84
r =0.7
r2 =0.49
There is a great deal of variation in BAC for the
same number of beers drunk. A person’s blood
volume is a factor in the equation that was
overlooked here.
We changed number
of beers to number of
beers/weight of
person in lb.
r =0.9
r2 =0.81
In the first plot, number of beers only explains
49% of the variation in blood alcohol content.
But number of beers / weight explains 81% of
the variation in blood alcohol content.
Additional factors contribute to variations in
BAC among individuals (like maybe some
genetic ability to process alcohol).
Grade performance
If class attendance explains 16% of the variation in grades, what is
the correlation between percent of classes attended and grade?
1. We need to make an assumption: attendance and grades are
positively correlated. So r will be positive too.
2. r2 = 0.16, so
r = +√0.16 = + 0.4
A weak correlation.
Transforming relationships
A scatterplot might show a clear relationship between two quantitative
variables, but issues of influential points or non linearity prevent us from
using correlation and regression tools.
Transforming the data – changing the scale in which one or both of the
variables are expressed – can make the shape of the relationship linear
in some cases.
Example: Patterns of growth are often exponential, at least in their initial
phase. Changing the response variable y into log(y) or ln(y) will transform
the pattern from an upward-curved exponential to a straight line.
Exponential bacterial growth
In ideal environments, bacteria multiply through binary fission. The
number of bacteria can double every 20 minutes in that way.
4
5000
Log of bacterial count
Bacterial count
4000
3000
2000
1000
3
2
1
0
0
0
30
60
90 120 150 180 210 240
Time (min)
1 - 2 - 4 - 8 - 16 - 32 - 64 - …
0
30
60
90
120 150 180 210 240
Time (min)
log(2n) = n*log(2) ≈ 0.3n
Exponential growth 2n,
Taking the log changes the growth
not suitable for regression.
pattern into a straight line.
Body weight and brain weight
in 96 mammal species
r = 0.86, but this is misleading.
The elephant is an influential point. Most
mammals are very small in comparison.
Without this point, r = 0.50 only.
Now we plot the log of brain weight
against the log of body weight.
The pattern is linear, with r = 0.96.
The vertical scatter is homogenous
→ good for predictions of brain weight
from body weight (in the log scale).
Inference for least squares lines
Inference for simple linear regression
Simple linear regression model
Conditions for inference
Confidence interval for regression parameters
Significance test for the slope
Confidence interval for E(y) for a given x
Prediction interval for y for a given x
yˆ  0.125x  41.4
The data in a scatterplot are a random
sample from a population that may

exhibit a linear relationship between x
and y. Different sample  different plot.
Now we want to describe the population mean
response E(y) as a function of the explanatory
variable x: E(y)= b0 + b1x.
And to assess whether the observed relationship
is statistically significant (not entirely explained
by chance events due to random sampling).
Simple linear regression model
In the population, the linear regression equation is E(y) = b0 + b1x.
Sample data then fits the model:
Data =
fit
+ residual
y i = (b 0 + b 1 x i ) +
(ei)
where the ei are
independent and
Normally distributed N(0,s).
Linear regression assumes equal standard deviation of y
(s is the same for all values of x).
E(y) = b0 + b1x
The intercept b0, the slope b1, and the standard deviation s of y are the
unknown parameters of the regression model. We rely on the random
sample data to provide unbiased estimates of these parameters.

The value of ŷ from the least-squares regression line is really a prediction
of the mean value of y (E(y)) for a given value of x.

The least-squares regression line (ŷ = b0 + b1x) obtained from sample data
is the best estimate of the true population regression line (E(y) = b0 + b1x).
ŷ unbiased estimate for mean response E(y)
b0 unbiased estimate for intercept b0
b1 unbiased estimate for slope b1
The population standard deviation s
for y at any given value of x represents
the spread of the normal distribution of
the ei around the mean E(y) .
The regression standard error, s, for n sample data points is
calculated from the residuals (yi – ŷi):
se 
2
residual

n2

2
ˆ
(
y

y
)
 i i
n2
se is an unbiased estimate of the regression standard deviation s.
Conditions for inference

The observations are independent.

The relationship is indeed linear.

The standard deviation of y, σ, is the same for all values of x.

The response y varies normally
around its mean.
Using residual plots to check for regression validity
The residuals (y − ŷ) give useful information about the contribution of
individual data points to the overall pattern of scatter.
We view the residuals in
a residual plot:
If residuals are scattered randomly around 0 with uniform variation, it
indicates that the data fit a linear model, have normally distributed
residuals for each value of x, and constant standard deviation σ.
Residuals are randomly scattered
 good!
Curved pattern
 the relationship is not linear.
Change in variability across plot
 σ not equal for all values of x.
What is the relationship between
the average speed a car is
driven and its fuel efficiency?
We plot fuel efficiency (in miles
per gallon, MPG) against average
speed (in miles per hour, MPH)
for a random sample of 60 cars.
The relationship is curved.
When speed is log transformed
(log of miles per hour, LOGMPH)
the new scatterplot shows a
positive, linear relationship.
Residual plot:
The spread of the residuals is
reasonably random—no clear pattern.
The relationship is indeed linear.
But we see one low residual (3.8, −4)
and one potentially influential point
(2.5, 0.5).
Normal quantile plot for residuals:
The plot is fairly straight, supporting
the assumption of normally distributed
residuals.
 Data okay for inference.
Standard Error for the Slope

Three aspects of the scatterplot affect the standard error
of the regression slope:
 spread around the line, se
 spread of x values, sx
 sample size, n.

The formula for the standard error (which you will
probably never have to calculate by hand) is:
SE  b1  
se
n  1 sx
Slide 1- 35
Confidence interval for b1
Estimating the regression parameters b0, b1 is a case of one-sample
inference with unknown population standard deviation.
 We rely on the t distribution, with n – 2 degrees of freedom.
A level C confidence interval for the slope, b1, is proportional to the
standard error of the least-squares slope:
b1 ± t* SE(b1)
t* is the t critical for the t (n – 2) distribution with area C between –t* and +t*.
We estimate the standard error of b1 with
where
se 
  y  yˆ 
2
SE  b1  
se
n  1 sx
n2
n is the sample size, sx is the ordinary standard deviation of the x values
Confidence interval for b0
A level C confidence interval for the intercept, b0 , is proportional to
the standard error of the least-squares intercept:
b0 ± t* SEb0

The intercept usually isn’t interesting. Most hypothesis tests and
confidence intervals for regression are about the slope.
Hypothesis test for the slope
We may look for evidence of a significant relationship between
variables x and y in the population from which our data were drawn.
For that, we can test the hypothesis that the regression slope
parameter β1 is equal to zero.
H0: β1 = 0 vs. Ha: β1 ≠ 0
s y Testing H0: β1 = 0 also allows to test the hypothesis of no
slope b1  r
sx correlation between x and y in the population.
Note: A test of hypothesis for b0 is irrelevant (b0 is often not even achievable).
Hypothesis test for the slope (cont.)
We usually test the hypothesis H0: β1 = 0 vs. Ha: β1 ≠ 0 but we can
also test H0: β1 = 0 vs. Ha: β1 < 0 or H0: β1 = 0 vs. Ha: β1 > 0
To do this we calculate the test statistic
Use the t dist. with n – 2
df to find the
P-value of the test.
Note: Software typically provides
two-sided p-values.
b1  0
t
SE  b1 
Using technology
Computer software runs all the computations for regression analysis.
Here is some software output for the car speed/gas efficiency example.
SPSS
Slope
Intercept
p-value for tests
of significance
Confidence
intervals
The t-test for regression slope is highly significant (p < 0.001). There is a
significant relationship between average car speed and gas efficiency.
Excel
“intercept”: intercept
“logmph”: slope
SAS
P-value for tests
of significance
confidence
intervals
Confidence Intervals and Prediction
Intervals for Predicted Values


Once we have a useful regression, how can we
indulge our natural desire to predict, without
being irresponsible?
Now we have standard errors—we can use
those to construct a confidence interval for the
predictions and to report our uncertainty
honestly.
An Example: Body Fat and Waist Size

Consider an example that involves investigating the
relationship in adult males between % Body Fat and
Waist size (in inches). Here is a scatterplot of the data
for 250 adult males of various ages:
Slide 1- 43
Confidence Intervals and Prediction
Intervals for Predicted Values(cont.)

For our %body fat and waist size example, there are two
questions we could ask:
1. Do we want to know the mean %body fat for all men
with a waist size of, say, 38 inches?
2.
Do we want to estimate the %body fat for a particular
man with a 38-inch waist?

The predicted %body fat is the same in both
questions, but we can predict the mean %body fat for
all men whose waist size is 38 inches with a lot more
precision than we can predict the %body fat of a
particular individual whose waist size happens to be 38
inches.
Confidence Intervals and Prediction
Intervals for Predicted Values(cont.)

We start with the same prediction in both cases.
 We are predicting for a new individual, one
that was not in the original data set.
 Call his x-value xν.
 The regression predicts %body fat as
ŷ  b0  b1 x
Confidence Intervals and Prediction
Intervals for Predicted Values(cont.)

Both intervals take the form
yˆ  t


n2
 SE
The SE’s will be different for the two questions
we have posed.
Confidence Intervals and Prediction
Intervals for Predicted Values(cont.)
1.
The standard error of the mean predicted value
is:
2
s
2
2
ˆ
SE     SE  b1    x  x   e
n
2.
Individuals vary more than means, so the
standard error for a single predicted value is
larger than the standard error for the mean:
SE  yˆ   SE 2  b1    x  x 
2
se2
  se2
n
Confidence Intervals and Prediction
Intervals for Predicted Values (cont.)
Confidence interval for 
yˆ  tn* 2 SE 2  b1    x  x  
2
se2
n
Prediction interval for y
*
2
yˆ  t n  2 SE  b1    x  x  
2
2
se
n
 se
2
Confidence Intervals for Predicted Values



Here’s a look at the difference
between predicting for a mean
and predicting for an
individual.
The solid green lines near the
regression line show the 95%
confidence intervals for the
mean predicted value, and the
dashed red lines show the
prediction intervals for
individuals.
The solid green lines and the
dashed red lines curve away
from the least squares line as x
moves farther away from xbar.
More on confidence intervals for 
As seen on the preceding slides, we can calculate a confidence
interval for the population mean of all responses y when x takes the
value x (within the range of data tested): denote this expected value
E(y) by 
This interval is centered on ŷ, the unbiased estimate of .
The true value of the population mean  at a particular
value x, will indeed be within our confidence
interval in C% of all intervals calculated
from many different random samples.
The level C confidence interval for the mean response μ at a given
value x of x is centered on ŷ (unbiased estimate of μ):
yˆ  t
*
n2
SE ( ˆ )
t* is the t critical for the t (n – 2)
distribution with area C between
–t* and +t*.
A separate confidence interval is
calculated for μ along all the values
that x takes.
Graphically, the series of confidence
intervals is shown as a continuous
interval on either side of ŷ.
95% confidence
bands for  as
x varies over
all x values
More on prediction intervals for y
One use of regression is for predicting the value of y, ŷ, for any value
of x within the range of data tested: ŷ = b0 + b1x.
But the regression equation depends on the particular sample drawn.
More reliable predictions require statistical inference:
To estimate an individual response y for a given value of x, we use a
prediction interval.
If we randomly sampled many times, there
would be many different values of y
obtained for a particular x following
N(0, σ) around the mean response μ.
The level C prediction interval for a single observation on y when x
takes the value x is:
yˆ  tn* 2 SE ( yˆ )
t* is the t critical for the t (n – 2)
distribution with area C between
–t* and +t*.
The prediction interval represents
95% prediction
mainly the error from the normal
interval for ŷ as
distribution of the residuals ei.
x varies over all
Graphically, the series confidence
intervals is shown as a continuous
interval on either side of ŷ.
x values

The confidence interval for μ contains with C% confidence the
population mean μ of all responses at a particular value x.

The prediction interval contains C% of all the individual values
taken by y at a particular value x.
95% prediction interval for y
95% confidence interval for 
Estimating  uses a smaller
confidence interval than estimating
an individual in the population
(sampling distribution narrower
than population
distribution).
1918 flu epidemics
1918 influenza epidemic
Date
# Cases # Deaths
800
700
600
500
400
300
200
100
0
17
ee
k
15
ee
k
13
ee
k
11
9
ee
k
ee
k
7
w
ee
k
5
w
ee
k
3
w
ee
k
w
w
ee
k
1
1918 influenza epidemic
w
w
w
w
10000
800
9000
700
8000
600
# Cases
# Deaths
7000
500
6000
The line graph suggests that 7 to 9% of those
5000
400
4000
300
diagnosed with the flu died within about a week
of
3000
200
2000
100
diagnosis.
1000
0
0
17
ee
k
15
ee
k
13
ee
k
k
11
9
ee
ee
k
7
ee
k
5
ee
k
3
k
ee
k
1
We look at the relationship between the number of
ee
w
w
w
w
w
w
w
deaths in a given week and the number of new
w
0
0
130
552
738
414
198
90
56
50
71
137
178
194
290
310
149
w
36
531
4233
8682
7164
2229
600
164
57
722
1517
1828
1539
2416
3148
3465
1440
Incidence
week 1
week 2
week 3
week 4
week 5
week 6
week 7
week 8
week 9
week 10
week 11
week 12
week 13
week 14
week 15
week 16
week 17
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
diagnosed# cases
Cases one#week
Deathsearlier.
# deaths reported
# cases diagnosed
1918 influenza epidemic
1918 flu epidemic: Relationship between the number of
r = 0.91
deaths in a given week and the number of new diagnosed
cases one week earlier.
EXCEL
Regression Statistics
Multiple R
0.911
R Square
0.830
Adjusted R Square
0.82
Standard Error
85.07 s
Observations
16.00
Coefficients
Intercept
49.292
FluCases0
0.072
b1
St. Error
29.845
0.009
SE (b1 )
t Stat
1.652
8.263
P-value Lower 95% Upper 95%
0.1209
(14.720) 113.304
0.0000
0.053
0.091
P-value for
H0: β1 = 0
P-value very small  reject H0  β1 significantly different from 0
There is a significant relationship between the number of flu
cases and the number of deaths from flu a week later.
SPSS
CI for mean weekly death
count one week after x=
4000 flu cases are
diagnosed: µ within about
300–380.
Prediction interval for a
weekly death count one
week after x= 4000 flu
cases are diagnosed: y
within about 180–500
deaths.
Least squares regression line
95% prediction interval for y
95% confidence interval for y
What is this?
A 90% prediction interval
for the height (above) and
a 90% prediction interval for
the weight (below) of male
children, ages 3 to 18.