* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 12.1 Inference for Linear Regression
Survey
Document related concepts
Transcript
12.1 Inference for Linear Regression Objectives SWBAT: • CHECK the conditions for performing inference about the slope b of the population (true) regression line. • INTERPRET the values of a, b, s, SEb, and r2 in context, and DETERMINE these values from computer output. • CONSTRUCT and INTERPRET a confidence interval for the slope b of the population (true) regression line. • PERFORM a significance test about the slope b of the population (true) regression line. Many people believe that students learn better if they sit closer to the front of the classroom. Does sitting closer cause higher achievement, or do better students simply choose to sit in the front? To investigate, an AP Statistics teacher randomly assigned students to seat locations in his classroom for a particular chapter and recorded the test score for each student at the end of the chapter. The explanatory variable in this experiment is which row the student was assigned (Row 1 is closest to the front and Row 7 is the farthest away). Do these data provide convincing evidence that sitting closer causes students to get higher grades? 1) Describe the association shown in the scatterplot. There is a negative, linear, moderate association between row and test score 2) Using the computer output, determine the equation of the least-square regression line. 3) Calculate the value of the correlation. Negative because the slope is negative 4) Calculate and interpret the residual for a student who sat in Row 1 and scored 76. The student scored 8.59 points lower than expected, based on the row that they sit in. 5) Interpret the slope of the least-squares regression line. Each 1 row increase, the score is predicted to decrease by 1.1171 points 6) Interpret the standard deviation of the residuals (s). When we use the least-squares regression line to predict score using row, our predictions will typically be off by 10.0673 points. 7) Interpret the value of r-squared. About 4.7% of the variation in scores is accounted for by the linear model relating scores to row (the other 95.3% is accounted for by other factors). 8) Explain why it was important to randomly assign the students to seats rather than letting each student choose his or her own seats. Randomly assigning seats decreases one of the factors of variability for this experiment. Students can choose seats for a variety of reasons, such as where there friends are sitting, or perhaps based on intelligence (i.e. the smarter students sit near the front). Randomly assigning seats eliminates this factor from confounding our results. 9) What other factors did the teacher control? What is the purpose of keeping these variables the same? Same test, same instructional time, same teacher, same chapter, etc… The purpose is to reduce the variation in response that is due to other factors besides seat location. This will increase r-squared and decrease s and the standard error of the slope, making it easier to see the association. 10) Does the negative slope provide convincing evidence that sitting closer causes higher achievement, or is it plausible that the association is due to the chance variation in the random assignment? We’ll soon be able to see how we can run this test… What is the difference between a sample regression line and population (true) regression line? To the right is a scatterplot of the duration and interval of time until the next eruption of the Old Faithful geyser for all 222 recorded eruptions in a single month. The least-squares regression line for this population of data has been added to the graph. We call this the population regression line (or true regression line) because it uses all the observations that month. Suppose we take an SRS of 20 eruptions from the population and calculate the least-squares regression line for the sample data. This would give us the equation of the sample regression line. What is the sampling distribution of b? What shape, center, and spread does it have? • The sampling distribution of b shows the possible values of b that could occur by chance in random sampling or random assignment, when the null hypothesis is true. • For example: • These are three different samples of 20 observations from the Old Faithful example. Notice that the slopes of the sample regression lines, 10.2, 7.7, and 9.5, vary quite a bit from the slope of the population regression line, 10.36. Shape: We can see that the distribution of b-values is roughly symmetric and unimodal. Center: The mean of the 1000 b-values is 10.35. This value is quite close to the slope of the population (true) regression line, 10.36. Spread: The standard deviation of the 1000 b-values is 1.29. Later, we will see that the standard deviation of the sampling distribution of b is actually 1.27. Sampling Distribution of a Slope Choose an SRS of n observations (x, y) from a population of size N with least-squares regression line predicted y = 𝛼 + bx Let b be the slope of the sample regression line. Then: • The mean of the sampling distribution of b is µb = b. • The standard deviation of the sampling distribution of b is as long as the 10% Condition is satisfied. • The sampling distribution of b will be approximately normal if the values of the response variable y follow a Normal distribution for each value of the explanatory variable x (the Normal condition). What are the conditions for regression inference? How do we check them? Conditions for Regression Inference Suppose we have n observations on an explanatory variable x and a response variable y. Our goal is to study or predict the behavior of y for given values of x. • Linear: The actual relationship between x and y is linear. For any fixed value of x, the mean response µy falls on the population (true) regression line µy= α + βx. • Independent: Individual observations are independent of each other. When sampling without replacement, check the 10% condition. • Normal: For any fixed value of x, the response y varies according to a Normal distribution. • Equal SD: The standard deviation of y (call it σ) is the same for all values of x. • Random: The data come from a well-designed random sample or randomized experiment. To check the conditions: Start by making a histogram or Normal probability plot of the residuals and also a residual plot. • Linear: Examine the scatterplot to check that the overall pattern is roughly linear. Look for curved patterns in the residual plot. Check to see that the residuals center on the “residual = 0” line at each x-value in the residual plot. • Independent: Look at how the data were produced. Random sampling and random assignment help ensure the independence of individual observations. If sampling is done without replacement, check the 10% condition. • Normal: Make a stemplot, histogram, or Normal probability plot of the residuals and check for clear skewness or other major departures from Normality. • Equal SD: Look at the scatter of the residuals above and below the “residual = 0” line in the residual plot. The vertical spread of the residuals should be roughly the same from the smallest to the largest x-value. • Random: See if the data were produced by random sampling or a randomized experiment. What are the three parameters in a regression model? What statistics do we use to estimate them? When the conditions are met, we can do inference about the regression model µy = α+ βx. The first step is to estimate the unknown parameters. If we calculate the least-squares regression line, the slope b is an unbiased estimator of the population slope β, the y-intercept a is an unbiased estimator of the population y-intercept α, And the remaining parameter is the standard deviation σ, which describes the variability of the response y about the population regression line. The LSRL computed from the sample data estimates the population regression line. So the residuals estimate how much y varies about the population line. Because σ is the standard deviation of responses about the population regression line, we estimate it by the standard deviation of the residuals s= åresiduals n-2 2 = å(y i - yˆ i ) 2 n-2 What is the standard error of the slope? Is this on the formula sheet? How do you interpret it? In practice, we don’t know σ for the population regression line. So we estimate it with the standard deviation of the residuals, s. Then we estimate the spread of the sampling distribution of b with the standard error of the slope: s SEb = sx n -1 • Formula sheet? Negative • The interpretation is how much the slope of the estimated regression line varies from the slope of the population regression line. For the seating chart experiment, state and interpret the standard error of the slope. What is the formula for constructing a confidence interval for a slope? Where do you get the value of t*? How many degrees of freedom should you use? The confidence interval for β has the familiar form You can get the t* value from the t table or from the calculator. The calculator command is invT(area, df). Remember that on the calculator, the area is always to the left of a value. So a 95% confidence interval would have an area of 0.975. statistic ± (critical value) · (standard deviation of statistic) Because we use the statistic b as our estimate, the confidence interval is b ± t* SEb We call this a t interval for the slope. t Interval for the Slope When the conditions for regression inference are met, a level C confidence interval for the slope β of the population (true) regression line is b ± t* SEb In this formula, the standard error of the slope is SEb = s sx n -1 and t* is the critical value for the t distribution with df = n - 2 having area C between -t* and t*. Let’s come back to the seating chart example and calculate a 95% confidence interval. df = 30 – 2 = 28 invT(area: 0.975, df: 28) = 2.048 Interpretation: We are 95% confident that the interval from −3.0570 to 0.8228 captures the slope of the true regression line relating a student’s test score y and the student’s row number x. Based on the interval, is there convincing evidence that seat location affects test scores? Because the interval of plausible slopes includes 0, we do not have convincing evidence that seat location affects test scores. For their second-semester project, two AP Statistics students decided to investigate the effect of sugar on the life of cut flowers. They went to the local grocery store and randomly selected 12 carnations. All the carnations seemed equally healthy when they were selected. When the students got home, they prepared 12 identical vases with exactly the same amount of water in each vase. They put one tablespoon of sugar in 3 vases, two tablespoons of sugar in 3 vases, and three tablespoons of sugar in 3 vases. In the remaining 3 vases, they put no sugar. After the vases were prepared and placed in the same location, the students randomly assigned one flower to each vase and observed how many hours each flower continued to look fresh. Here are the data and computer output. Construct and interpret a 99% confidence interval for the slope of the true regression line. State: (continued on next slide) For their second-semester project, two AP Statistics students decided to investigate the effect of sugar on the life of cut flowers. They went to the local grocery store and randomly selected 12 carnations. All the carnations seemed equally healthy when they were selected. When the students got home, they prepared 12 identical vases with exactly the same amount of water in each vase. They put one tablespoon of sugar in 3 vases, two tablespoons of sugar in 3 vases, and three tablespoons of sugar in 3 vases. In the remaining 3 vases, they put no sugar. After the vases were prepared and placed in the same location, the students randomly assigned one flower to each vase and observed how many hours each flower continued to look fresh. Here are the data and computer output. Construct and interpret a 99% confidence interval for the slope of the true regression line. Linear: The scatterplot shows a linear pattern, and there is no obvious leftover curvature in the residual plot. Scatterplot Residual Plot For their second-semester project, two AP Statistics students decided to investigate the effect of sugar on the life of cut flowers. They went to the local grocery store and randomly selected 12 carnations. All the carnations seemed equally healthy when they were selected. When the students got home, they prepared 12 identical vases with exactly the same amount of water in each vase. They put one tablespoon of sugar in 3 vases, two tablespoons of sugar in 3 vases, and three tablespoons of sugar in 3 vases. In the remaining 3 vases, they put no sugar. After the vases were prepared and placed in the same location, the students randomly assigned one flower to each vase and observed how many hours each flower continued to look fresh. Here are the data and computer output. Construct and interpret a 99% confidence interval for the slope of the true regression line. Independent: All the flowers were in different vases, so knowing the hours of freshness for one flower shouldn’t provide additional information about the hours of freshness for other flowers. Normal: The histogram of residuals does not show any skewness or outliers. For their second-semester project, two AP Statistics students decided to investigate the effect of sugar on the life of cut flowers. They went to the local grocery store and randomly selected 12 carnations. All the carnations seemed equally healthy when they were selected. When the students got home, they prepared 12 identical vases with exactly the same amount of water in each vase. They put one tablespoon of sugar in 3 vases, two tablespoons of sugar in 3 vases, and three tablespoons of sugar in 3 vases. In the remaining 3 vases, they put no sugar. After the vases were prepared and placed in the same location, the students randomly assigned one flower to each vase and observed how many hours each flower continued to look fresh. Here are the data and computer output. Construct and interpret a 99% confidence interval for the slope of the true regression line. Equal sd: Although the variability of the residuals is not constant at each value of x, there is no systematic pattern such as increasing variation as Random: Flowers were randomly assigned to the x increases. treatments. (note: pg 744 shows a pattern of increasing variation) For their second-semester project, two AP Statistics students decided to investigate the effect of sugar on the life of cut flowers. They went to the local grocery store and randomly selected 12 carnations. All the carnations seemed equally healthy when they were selected. When the students got home, they prepared 12 identical vases with exactly the same amount of water in each vase. They put one tablespoon of sugar in 3 vases, two tablespoons of sugar in 3 vases, and three tablespoons of sugar in 3 vases. In the remaining 3 vases, they put no sugar. After the vases were prepared and placed in the same location, the students randomly assigned one flower to each vase and observed how many hours each flower continued to look fresh. Here are the data and computer output. Construct and interpret a 99% confidence interval for the slope of the true regression line. Do: df = 12 – 2 = 10 t* = 3.169 Conclude: We are 99% confident that the interval of values from 9.043 to 21.357 captures the slope of the true regression line relating hours of freshness y to amount of sugar x. Calculator: Option G: LinRegTInt Xlist: L1 Ylist: L2 Freq: 1 C-Level: 0.99 (9.0415, 21.359) df = 10 Can you use your calculator for the DO step? Are you kitten me right meow? What is the standardized test statistic for a significance test for the slope? Is this formula on the formula sheet? t Test for the Slope Suppose the conditions for inference are met. To test the hypothesis H0 : β = hypothesized value, compute the test statistic b - b0 t= SEb Find the P-value by calculating the probability of getting a t statistic this large or larger in the direction specified by the alternative hypothesis Ha. Use the t distribution with df = n - 2. Negative formula sheet Do customers who stay longer at buffets give larger tips? Charlotte, an AP statistics student who worked at an Asian buffet, decided to investigate this question for her second semester project. While she was doing her job as a hostess, she obtained a random sample of receipts, which included the length of time (in minutes) the party was in the restaurant and the amount of the tip (in dollars). Do these data provide convincing evidence that customers who stay longer give larger tips? a) Here is a scatterplot of the data with the least-squares regression line added. Describe what this graph tells you about the relationship between the two variables. There is a positive, linear, moderate to weak association between length of time in a buffet restaurant and the amount of money left for the tip. b) What is the equation of the least-squares regression line for predicting the amount of the tip from the length of the stay? or: c) Interpret the slope of the least-squares regression line in context. For each additional minute that a party stays at the buffet, the least-squares regression line predicts an increase in the tip of $0.03 (3 cents). d) Carry out an appropriate test to answer the question. Linear: The scatterplot shows a weak, positive, linear relationship between length of stay and tip amount. The residual plot looks randomly scattered about the residual = 0 line. Independent: Knowing one tip amount shouldn’t provide additional information about other tip amounts. Since we are sampling without replacement, we must assume that there are more than 10(12) = 120 receipts in the population from which these receipts were selected. Normal: The histogram does not show any skewness or outliers. Equal standard deviation: The residual plot shows a fairly equal amount of scatter around the residual = 0 line. Random: The receipts were randomly selected. Do: From the computer output: Test statistic: t = 1.23 p-value: 0.247/2 = 0.1235 (Note: computer output ALWAYS gives 2-sided p-values) df = 12 – 2 = 10 Conclude: Since the p-value of 0.1235 is greater than the alpha of 0.05, we fail to reject the null. There is not convincing evidence that parties who stay longer at buffets leave larger tips. Calculator: LinRegTTest This confirms the output. Can you use your calculator to conduct a test for the slope? What a great question to be our last note of the year.