Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
CHAPTER 12
More About Regression
12.1
Inference for Linear
Regression
The Practice of Statistics, 5th Edition
Starnes, Tabor, Yates, Moore
Bedford Freeman Worth Publishers
Inference for Linear Regression
Learning Objectives
After this section, you should be able to:
 CHECK the conditions for performing inference about the slope b of
the population (true) regression line.
 INTERPRET the values of a, b, s, SEb, and r2 in context, and
DETERMINE these values from computer output.
 CONSTRUCT and INTERPRET a confidence interval for the slope
b of the population (true) regression line.
 PERFORM a significance test about the slope b of the population
(true) regression line.
The Practice of Statistics, 5th Edition
2
Introduction
When a scatterplot shows a linear relationship between a quantitative
explanatory variable x and a quantitative response variable y, we can
use the least-squares line fitted to the data to predict y for a given value
of x. If the data are a random sample from a larger population, we need
statistical inference to answer questions like these:
• Is there really a linear relationship between x and y in the
population, or could the pattern we see in the scatterplot
plausibly happen just by chance?
• In the population, how much will the predicted value of y change
for each increase of 1 unit in x? What’s the margin of error for
this estimate?
The Practice of Statistics, 5th Edition
3
Inference for Linear Regression
Below is a scatterplot of the duration and interval of time until the next
eruption of the Old Faithful geyser for all 222 recorded eruptions in a
single month. The least-squares regression line for this population of
data has been added to the graph. We call this the population regression
line (or true regression line) because it uses all the observations that
month.
Suppose we take an SRS of 20
eruptions from the population and
calculate the least - squares
regression line yˆ = a + bx for the
sample data. How does the slope
of the sample regression line
(also called the estimated
regression line) relate to the slope
of the population regression line?
The Practice of Statistics, 5th Edition
4
Sampling Distribution of b
The figures below show the results of taking three different SRSs of 20
Old Faithful eruptions in this month. Each graph displays the selected
points and the LSRL for that sample.
Notice that the slopes of the sample regression lines – 10.2, 7.7, and 9.5
– vary quite a bit from the slope of the population regression line, 10.36.
The Practice of Statistics, 5th Edition
5
Sampling Distribution of b
Confidence intervals and significance tests about the slope of the
population regression line are based on the sampling distribution of b,
the slope of the sample regression line.
Shape: We can see that the distribution
of b-values is roughly symmetric and
unimodal.
Center: The mean of the 1000 b-values
is 10.35. This value is quite close to the
slope of the population (true) regression
line, 10.36.
Spread: The standard deviation of the
1000 b-values is 1.29. Later, we will see
that the standard deviation of the
sampling distribution of b is actually
1.27.
The Practice of Statistics, 5th Edition
6
Sampling Distribution of b
Sampling Distribution of a Slope
Choose an SRS of n observations (x, y) from a population of size N
with least-squares regression line
predicted y = a + bx
Let b be the slope of the sample regression line. Then:
• The mean of the sampling distribution of b is µb = b.
• The standard deviation of the sampling distribution of b is
as long as the 10% Condition is satisfied.
• The sampling distribution of b will be approximately normal if the
values of the response variable y follow a Normal distribution for
each value of the explanatory variable x (the Normal condition).
The Practice of Statistics, 5th Edition
7
Conditions for Regression Inference
The regression model requires that for each possible value of the
explanatory variable x:
1.The mean value of the response variable µy falls on the population
(true) regression line µy = a + bx.
2.The values of the response variable y follow a Normal distribution
with common standard deviation s.
The Practice of Statistics, 5th Edition
8
Conditions for Regression Inference
Conditions for Regression Inference
Suppose we have n observations on an explanatory variable x and a
response variable y. Our goal is to study or predict the behavior of y for
given values of x.
• Linear: The actual relationship between x and y is linear. For any
fixed value of x, the mean response µy falls on the population (true)
regression line µy= α + βx.
• Independent: Individual observations are independent of each other.
When sampling without replacement, check the 10% condition.
• Normal: For any fixed value of x, the response y varies according to
a Normal distribution.
• Equal SD: The standard deviation of y (call it σ) is the same for all
values of x.
• Random: The data come from a well-designed random sample or
randomized experiment.
The Practice of Statistics, 5th Edition
9
How to Check Conditions for Inference
Start by making a histogram or Normal probability plot of the residuals
and also a residual plot.
• Linear: Examine the scatterplot to check that the overall pattern is roughly
linear. Look for curved patterns in the residual plot. Check to see that the
residuals center on the “residual = 0” line at each x-value in the residual plot.
• Independent: Look at how the data were produced. Random sampling and
random assignment help ensure the independence of individual observations.
If sampling is done without replacement, check the 10% condition.
• Normal: Make a stemplot, histogram, or Normal probability plot of the
residuals and check for clear skewness or other major departures from
Normality.
• Equal SD: Look at the scatter of the residuals above and below the “residual
= 0” line in the residual plot. The vertical spread of the residuals should be
roughly the same from the smallest to the largest x-value.
• Random: See if the data were produced by random sampling or a
randomized experiment.
The Practice of Statistics, 5th Edition
10
• See Tech Corner 9 if you don’t remember how to make a residual
plot.
• Watch the video, “The Helicopter Experiment”, p. 745
The Practice of Statistics, 5th Edition
11
Estimating the Parameters
When the conditions are met, we can do inference about the regression
model µy = α+ βx. The first step is to estimate the unknown parameters.
 If we calculate the least-squares regression line, the slope b is an
unbiased estimator of the population slope β, and the y-intercept a
is an unbiased estimator of the population y-intercept α.
 The remaining parameter is the standard deviation σ, which
describes the variability of the response y about the population
regression line.
The LSRL computed from the sample data estimates the population
regression line. So the residuals estimate how much y varies about the
population line. Because σ is the standard deviation of responses about
the population regression line, we estimate it by the standard deviation
of the residuals
åresiduals2 å(y - yˆ )2
s=
The Practice of Statistics, 5th Edition
n -2
=
i
i
n -2
12
Estimating the Parameters
In practice, we don’t know σ for the population regression line. So we
estimate it with the standard deviation of the residuals, s. Then we
estimate the spread of the sampling distribution of b with the standard
s
error of the slope:
SE b =
sx n -1
What happens if we transform the values of b by standardizing? Since
the sampling distribution of b is Normal, the statistic
z=
b-b
sb
has the standard Normal distribution.
The Practice of Statistics, 5th Edition
13
Estimating the Parameters
Replacing the standard deviation σb of the sampling distribution with its
standard error gives the statistic
b-b
t=
SE b
which has a t distribution with n - 2 degrees of freedom.
The Practice of Statistics, 5th Edition
14
Constructing a Confidence Interval
The confidence interval for β has the familiar form
statistic ± (critical value) · (standard deviation of statistic)
Because we use the statistic b as our estimate, the confidence interval is
b ± t* SEb
We call this a t interval for the slope.
t Interval for the Slope
When the conditions for regression inference are met, a level C
confidence interval for the slope β of the population (true) regression
line is
b ± t* SEb
In this formula, the standard error of the slope is
SE b =
sx
s
n -1
and t* is the critical value for the t distribution with df = n - 2 having area
C between -t* and t*.
The Practice of Statistics, 5th Edition
15
• Watch the video, “The Helicopter Experiment”, p. 748. Same
Helicopter but with a confidence interval for b.
• Watch Tech Corner 28 because we don’t really want to do this by
hand.
• Read Concept 1 and answer Concept 2, p. 230 – just the
Confidence Interval.
The Practice of Statistics, 5th Edition
16
Performing a Significance Test for the Slope
When the conditions for inference are met, we can use the slope b of
the sample regression line to construct a confidence interval for the
slope β of the population (true) regression line.
We can also perform a significance test to determine whether a
specified value of β is plausible. The null hypothesis has the general
form H0: β = hypothesized value. To do a test, standardize b to get the
test statistic:
test statistic =
t=
statistic - parameter
standard deviation of statistic
b - b0
SE b
To find the P-value, use a t distribution with n - 2 degrees of freedom.
The Practice of Statistics, 5th Edition
17
Performing a Significance Test for the Slope
t Test for the Slope
Suppose the conditions for inference are met. To test the hypothesis H0
: β = hypothesized value, compute the test statistic
b - b0
t=
SE b
Find the P-value by calculating the probability of getting a t statistic this
large or larger in the direction specified by the alternative hypothesis
Ha. Use the t distribution with df = n - 2.
The Practice of Statistics, 5th Edition
18
• Putting it all together. Watch the video of “Crying and IQ”, p. 754
• But let’s use our calculator, Tech Corner 29
• Can you do this? Answer Concept 2, Significance Test on p. 232
The Practice of Statistics, 5th Edition
19
Inference for Linear Regression
Section Summary
In this section, we learned how to…
 CHECK the conditions for performing inference about the slope b of
the population (true) regression line.
 INTERPRET the values of a, b, s, SEb, and r2 in context, and
DETERMINE these values from computer output.
 CONSTRUCT and INTERPRET a confidence interval for the slope b
of the population (true) regression line.
 PERFORM a significance test about the slope b of the population
(true) regression line.
The Practice of Statistics, 5th Edition
20