Download 12.1 Inference for Linear Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Confidence interval wikipedia , lookup

Choice modelling wikipedia , lookup

German tank problem wikipedia , lookup

Time series wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
12.1 Inference for Linear Regression
Objectives
SWBAT:
• CHECK the conditions for performing inference about the slope b of the
population (true) regression line.
• INTERPRET the values of a, b, s, SEb, and r2 in context, and DETERMINE
these values from computer output.
• CONSTRUCT and INTERPRET a confidence interval for the slope b of the
population (true) regression line.
• PERFORM a significance test about the slope b of the population (true)
regression line.
Many people believe that students learn better if they sit closer to the front of the classroom.
Does sitting closer cause higher achievement, or do better students simply choose to sit in the
front? To investigate, an AP Statistics teacher randomly assigned students to seat locations in
his classroom for a particular chapter and recorded the test score for each student at the end
of the chapter. The explanatory variable in this experiment is which row the student was
assigned (Row 1 is closest to the front and Row 7 is the farthest away). Do these data provide
convincing evidence that sitting closer causes students to get higher grades?
1) Describe the association shown in the scatterplot.
There is a negative, linear, moderate association between row and test score
2) Using the computer output, determine the equation of the least-square regression line.
3) Calculate the value of the correlation.
Negative because the slope is negative
4) Calculate and interpret the residual for a student who sat in Row 1 and scored 76.
The student scored 8.59 points lower than
expected, based on the row that they sit in.
5) Interpret the slope of the least-squares regression line.
Each 1 row increase, the score is predicted to decrease by 1.1171 points
6) Interpret the standard deviation of the residuals (s).
When we use the least-squares regression line to predict score using row, our predictions will
typically be off by 10.0673 points.
7) Interpret the value of r-squared.
About 4.7% of the variation in scores is accounted for by the linear model relating scores to
row (the other 95.3% is accounted for by other factors).
8) Explain why it was important to randomly assign the students to seats rather than letting each
student choose his or her own seats.
Randomly assigning seats decreases one of the factors of variability for this experiment. Students can
choose seats for a variety of reasons, such as where there friends are sitting, or perhaps based on
intelligence (i.e. the smarter students sit near the front). Randomly assigning seats eliminates this
factor from confounding our results.
9) What other factors did the teacher control? What is the purpose of keeping these variables the
same?
Same test, same instructional time, same teacher, same chapter, etc… The purpose is to reduce the
variation in response that is due to other factors besides seat location. This will increase r-squared
and decrease s and the standard error of the slope, making it easier to see the association.
10) Does the negative slope provide convincing evidence that sitting closer
causes higher achievement, or is it plausible that the association is due to the
chance variation in the random assignment?
We’ll soon be able to see how we can run this test…
What is the difference between a sample regression line and population (true) regression
line?
To the right is a scatterplot of the duration and
interval of time until the next eruption of the Old
Faithful geyser for all 222 recorded eruptions in a
single month. The least-squares regression line for
this population of data has been added to the graph.
We call this the population regression line (or true
regression line) because it uses all the observations
that month.
Suppose we take an SRS of 20 eruptions from the population and calculate the least-squares
regression line for the sample data. This would give us the equation of the sample
regression line.
What is the sampling distribution of b? What shape, center, and spread does it have?
• The sampling distribution of b shows the possible values of b that could occur by chance in
random sampling or random assignment, when the null hypothesis is true.
• For example:
• These are three different samples of 20 observations from the Old Faithful example.
Notice that the slopes of the sample regression lines, 10.2, 7.7, and 9.5, vary quite a
bit from the slope of the population regression line, 10.36.
Shape: We can see that the distribution of b-values is
roughly symmetric and unimodal.
Center: The mean of the 1000 b-values is 10.35. This value
is quite close to the slope of the population (true) regression
line, 10.36.
Spread: The standard deviation of the 1000 b-values is
1.29. Later, we will see that the standard deviation of the
sampling distribution of b is actually 1.27.
Sampling Distribution of a Slope
Choose an SRS of n observations (x, y) from a population of size N
with least-squares regression line
predicted y = 𝛼 + bx
Let b be the slope of the sample regression line. Then:
• The mean of the sampling distribution of b is µb = b.
• The standard deviation of the sampling distribution of b is
as long as the 10% Condition is satisfied.
• The sampling distribution of b will be approximately normal if the
values of the response variable y follow a Normal distribution for
each value of the explanatory variable x (the Normal condition).
What are the conditions for regression inference? How do we check them?
Conditions for Regression Inference
Suppose we have n observations on an explanatory variable x and a
response variable y. Our goal is to study or predict the behavior of y for
given values of x.
• Linear: The actual relationship between x and y is linear. For any
fixed value of x, the mean response µy falls on the population (true)
regression line µy= α + βx.
• Independent: Individual observations are independent of each other.
When sampling without replacement, check the 10% condition.
• Normal: For any fixed value of x, the response y varies according to
a Normal distribution.
• Equal SD: The standard deviation of y (call it σ) is the same for all
values of x.
• Random: The data come from a well-designed random sample or
randomized experiment.
To check the conditions:
Start by making a histogram or Normal probability plot of the residuals
and also a residual plot.
• Linear: Examine the scatterplot to check that the overall pattern is roughly
linear. Look for curved patterns in the residual plot. Check to see that the
residuals center on the “residual = 0” line at each x-value in the residual plot.
• Independent: Look at how the data were produced. Random sampling and
random assignment help ensure the independence of individual observations.
If sampling is done without replacement, check the 10% condition.
• Normal: Make a stemplot, histogram, or Normal probability plot of the
residuals and check for clear skewness or other major departures from
Normality.
• Equal SD: Look at the scatter of the residuals above and below the “residual
= 0” line in the residual plot. The vertical spread of the residuals should be
roughly the same from the smallest to the largest x-value.
• Random: See if the data were produced by random sampling or a
randomized experiment.
What are the three parameters in a regression model? What statistics do we use to estimate
them?
When the conditions are met, we can do inference about the regression model µy = α+ βx. The first
step is to estimate the unknown parameters.
 If we calculate the least-squares regression line, the slope b is an unbiased estimator of the
population slope β,
 the y-intercept a is an unbiased estimator of the population y-intercept α,
 And the remaining parameter is the standard deviation σ, which describes the variability of the
response y about the population regression line.
The LSRL computed from the sample data estimates the population regression line. So
the residuals estimate how much y varies about the population line. Because σ is the
standard deviation of responses about the population regression line, we estimate it by
the standard deviation of the residuals
s=
åresiduals
n-2
2
=
å(y
i
- yˆ i ) 2
n-2
What is the standard error of the slope? Is this on the formula sheet? How
do you interpret it?
In practice, we don’t know σ for the population regression line. So we estimate it with the
standard deviation of the residuals, s. Then we estimate the spread of the sampling
distribution of b with the standard error of the slope:
s
SEb =
sx n -1
• Formula sheet? Negative
• The interpretation is how much the slope of the estimated regression line varies
from the slope of the population regression line.
For the seating chart experiment, state and interpret the standard error of the
slope.
What is the formula for constructing a confidence interval for a slope? Where
do you get the value of t*? How many degrees of freedom should you use?
The confidence interval for β has the familiar form
You can get the t* value
from the t table or from
the calculator. The
calculator command is
invT(area, df). Remember
that on the calculator, the
area is always to the left
of a value. So a 95%
confidence interval would
have an area of 0.975.
statistic ± (critical value) · (standard deviation of statistic)
Because we use the statistic b as our estimate, the confidence interval is
b ± t* SEb
We call this a t interval for the slope.
t Interval for the Slope
When the conditions for regression inference are met, a level C
confidence interval for the slope β of the population (true) regression
line is
b ± t* SEb
In this formula, the standard error of the slope is
SEb =
s
sx n -1
and t* is the critical value for the t distribution with df = n - 2 having area
C between -t* and t*.
Let’s come back to the seating chart example and calculate a 95% confidence
interval.
df = 30 – 2 = 28
invT(area: 0.975, df: 28) = 2.048
Interpretation: We are 95% confident that the
interval from −3.0570 to 0.8228 captures the slope
of the true regression line relating a student’s test
score y and the student’s row number x.
Based on the interval, is there convincing evidence that seat location affects test scores?
Because the interval of plausible slopes includes 0, we do not have convincing evidence
that seat location affects test scores.
For their second-semester project, two AP Statistics students decided to investigate the effect of sugar
on the life of cut flowers. They went to the local grocery store and randomly selected 12 carnations.
All the carnations seemed equally healthy when they were selected. When the students got home,
they prepared 12 identical vases with exactly the same amount of water in each vase. They put one
tablespoon of sugar in 3 vases, two tablespoons of sugar in 3 vases, and three tablespoons of sugar in
3 vases. In the remaining 3 vases, they put no sugar. After the vases were prepared and placed in the
same location, the students randomly assigned one flower to each vase and observed how many hours
each flower continued to look fresh. Here are the data and computer output. Construct and interpret
a 99% confidence interval for the slope of the true regression line.
State:
(continued on next slide)
For their second-semester project, two AP Statistics students decided to investigate the effect of sugar
on the life of cut flowers. They went to the local grocery store and randomly selected 12 carnations.
All the carnations seemed equally healthy when they were selected. When the students got home,
they prepared 12 identical vases with exactly the same amount of water in each vase. They put one
tablespoon of sugar in 3 vases, two tablespoons of sugar in 3 vases, and three tablespoons of sugar in
3 vases. In the remaining 3 vases, they put no sugar. After the vases were prepared and placed in the
same location, the students randomly assigned one flower to each vase and observed how many hours
each flower continued to look fresh. Here are the data and computer output. Construct and interpret
a 99% confidence interval for the slope of the true regression line.
Linear: The scatterplot
shows a linear pattern, and
there is no obvious
leftover curvature in the
residual plot.
Scatterplot
Residual Plot
For their second-semester project, two AP Statistics students decided to investigate the effect of sugar
on the life of cut flowers. They went to the local grocery store and randomly selected 12 carnations.
All the carnations seemed equally healthy when they were selected. When the students got home,
they prepared 12 identical vases with exactly the same amount of water in each vase. They put one
tablespoon of sugar in 3 vases, two tablespoons of sugar in 3 vases, and three tablespoons of sugar in
3 vases. In the remaining 3 vases, they put no sugar. After the vases were prepared and placed in the
same location, the students randomly assigned one flower to each vase and observed how many hours
each flower continued to look fresh. Here are the data and computer output. Construct and interpret
a 99% confidence interval for the slope of the true regression line.
Independent: All the flowers
were in different vases, so
knowing the hours of
freshness for one flower
shouldn’t provide additional
information about the hours
of freshness for other
flowers.
Normal: The histogram
of residuals does not
show any skewness or
outliers.
For their second-semester project, two AP Statistics students decided to investigate the effect of sugar
on the life of cut flowers. They went to the local grocery store and randomly selected 12 carnations.
All the carnations seemed equally healthy when they were selected. When the students got home,
they prepared 12 identical vases with exactly the same amount of water in each vase. They put one
tablespoon of sugar in 3 vases, two tablespoons of sugar in 3 vases, and three tablespoons of sugar in
3 vases. In the remaining 3 vases, they put no sugar. After the vases were prepared and placed in the
same location, the students randomly assigned one flower to each vase and observed how many hours
each flower continued to look fresh. Here are the data and computer output. Construct and interpret
a 99% confidence interval for the slope of the true regression line.
Equal sd: Although the
variability of the residuals is
not constant at each value of x,
there is no systematic pattern
such as increasing variation as Random: Flowers were randomly assigned to the
x increases.
treatments.
(note: pg 744 shows a pattern of increasing variation)
For their second-semester project, two AP Statistics students decided to investigate the effect of sugar
on the life of cut flowers. They went to the local grocery store and randomly selected 12 carnations.
All the carnations seemed equally healthy when they were selected. When the students got home,
they prepared 12 identical vases with exactly the same amount of water in each vase. They put one
tablespoon of sugar in 3 vases, two tablespoons of sugar in 3 vases, and three tablespoons of sugar in
3 vases. In the remaining 3 vases, they put no sugar. After the vases were prepared and placed in the
same location, the students randomly assigned one flower to each vase and observed how many hours
each flower continued to look fresh. Here are the data and computer output. Construct and interpret
a 99% confidence interval for the slope of the true regression line.
Do: df = 12 – 2 = 10
t* = 3.169
Conclude: We are 99% confident that the interval
of values from 9.043 to 21.357 captures the slope
of the true regression line relating hours of
freshness y to amount of sugar x.
Calculator:
Option G: LinRegTInt
Xlist: L1
Ylist: L2
Freq: 1
C-Level: 0.99
(9.0415, 21.359)
df = 10
Can you use your calculator for the DO step?
Are you kitten me right meow?
What is the standardized test statistic for a significance test for the slope? Is
this formula on the formula sheet?
t Test for the Slope
Suppose the conditions for inference are met. To test the hypothesis
H0 : β = hypothesized value, compute the test statistic
b - b0
t=
SEb
Find the P-value by calculating the probability of getting a t statistic this
large or larger in the direction specified by the alternative hypothesis
Ha. Use the t distribution with df = n - 2.
Negative formula sheet
Do customers who stay longer at buffets give larger tips? Charlotte, an AP statistics student
who worked at an Asian buffet, decided to investigate this question for her second semester
project. While she was doing her job as a hostess, she obtained a random sample of receipts,
which included the length of time (in minutes) the party was in the restaurant and the amount
of the tip (in dollars). Do these data provide convincing evidence that customers who stay
longer give larger tips?
a) Here is a scatterplot of the data with the
least-squares regression line added. Describe
what this graph tells you about the relationship
between the two variables.
There is a positive, linear, moderate to weak association between length of
time in a buffet restaurant and the amount of money left for the tip.
b) What is the equation of the
least-squares regression line for
predicting the amount of the tip
from the length of the stay?
or:
c) Interpret the slope of the least-squares
regression line in context.
For each additional minute that a party stays
at the buffet, the least-squares regression
line predicts an increase in the tip of $0.03
(3 cents).
d) Carry out an appropriate test to
answer the question.
Linear: The scatterplot shows a weak, positive, linear relationship between length of stay and tip amount.
The residual plot looks randomly scattered about the residual = 0 line.
Independent: Knowing one tip amount shouldn’t provide additional information about other tip amounts.
Since we are sampling without replacement, we must assume that there are more than 10(12) = 120 receipts
in the population from which these receipts were selected.
Normal: The histogram does not show any skewness or outliers.
Equal standard deviation: The residual plot shows a fairly equal amount of scatter around the residual = 0
line.
Random: The receipts were randomly selected.
Do: From the computer output:
Test statistic: t = 1.23
p-value: 0.247/2 = 0.1235
(Note: computer output ALWAYS
gives 2-sided p-values)
df = 12 – 2 = 10
Conclude: Since the p-value of 0.1235 is greater than the alpha of 0.05, we fail to reject the
null. There is not convincing evidence that parties who stay longer at buffets leave larger
tips.
Calculator:
LinRegTTest
This confirms the output.
Can you use your calculator to conduct a test for the slope?
What a great question to be our last note of the year.