Download Simple Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Practice Problems on Correlation & Simple Regression
1. Suppose that, across a sample of stores, the correlation coefficient between beer prices and beer
sales is -0.65. What does this number indicate?
(a)
(b)
(c)
(d)
There is almost no variability in beer sales that is unexplained by beer price.
More beer sales tend to go along with lower beer prices.
As price increases by $1, beer sales decrease by 65%
All of the above are true.
Answer: (b)
The correlation is negative, so (b) is correct: higher sales go with lower prices (basic
economics also tells us this!)
Option (a) is false because while .652 = 42% of the variability in sales is explained by price,
the remainder (58%) is not explained by price. (we’ll discuss this interpretation of the
squared correlation next week)
Option (c) is a kind of distorted interpretation of the regression slope, not the correlation
coefficient.
2. The purpose of a scatterplot is:
(a)
(b)
(c)
(d)
To test for the significance of association in bivariate data.
To calculate the correlation coefficient.
To provide a visual picture of the relationship in bivariate data.
To determine a confidence interval for the regression slope.
Answer: (c)
The scatterplot is a visual display of bivariate data
3. The standard error of the sample regression slope tells you:
(a)
(b)
(c)
(d)
Approximately how different the slope coefficient will be in different samples.
Approximately how large the prediction errors are.
Approximately how spread out the Y scores are.
Approximately how much of the variability of Y is explained by X.
Answer: (a)
The standard error of any quantity is a measure of how different that quantity will be in
different samples. So (a) is the correct interpretation of the standard error of the sample
regression slope.
4. The correlation coefficient describes the _________ between 2 variables.
(a)
(b)
(c)
(d)
strength of curved association
strength of random association
strength of linear association
American Marketing Association
Answer: (c)
The correlation coefficient measures linear (straight-line) association: how close the points
in a scatterplot fall to a straight line.
5. R2 is a measure used to describe the overall fit of the regression line. Which of the following
statements is/are correct about R2?
(a) In general, the closer the R2 is to 1, the better the fit of the regression line to the points in the
scatterplot.
(b) R2 tells you the proportion of the points in the scatterplot that fall right on the regression line.
(c) R2 will always decrease as you add new observations to your regression.
(d) All of the above are true statements about R2.
Answer: (a)
Larger R2 means a closer fit between the points and the regression line, so (a) is correct
Option (b) is not true (for example, because R2 could be large even if no points fall right on
the regression line (so long as most of the points are close to the line)
Option (c) is also not true: there is no consistent relation between R2 and the number of
points; new points can either increase or decrease R2
6. A cost accountant is developing a regression model to predict the total cost of producing a
batch of circuit boards as a function of the batch size. The independent and dependent variables
for this regression would be:
(a)
(b)
(c)
(d)
IV: circuit board
IV: batch size
IV: average cost
IV: total cost
Answer: (b)
DV: batch size
DV: total cost
DV: circuit board
DV: average cost
(The next 9 questions are based on the following information.)
Pete Estrian is looking to buy a used Honda Civic. He checks the Internet and finds a huge list of
Civics for sale in his area. He selects a random sample of 10 cars, ranging in age from 2 years
old to 15 years old. For each car, he enters the age (in years) and the offered sales price (in
thousands) into Excel. He runs a regression predicting price from age, and gets the following
(edited) output:
ANOVA
Regression
Residual
Total
Intercept
Age
df
1
8
9
SS
93.5
6.4
99.9
Coefficients Standard Error
12.10
0.60
-0.80
0.07
MS
93.51
0.80
F
117.0
t Stat
20.2
-10.8
P-value
0.00000004
0.000005
Significance F
0.000005
7. What is the equation for the regression line?
(a) Predicted price = $12,100 - $800 * Age
(b) Predicted price = $12,100 - $600 * Age
(c) Predicted price = $20,200 - $10,800 * Age
(d) Predicted price = $12,100 - $70 * Age
Answer: (a)
Slope = -.80 thousands = -$800
Intercept = 12.10 thousands = $12,100
Predicted Price = $12,100 - $800 * Age
8. Car #5 in the sample was 10 years old and cost $4,000. Determine the predicted price and the
residual for this car.
(a) Predicted price = $11,300;
residual = -$7,300
(b) Predicted price = $11,300;
residual = $7,300
(c) Predicted price = $4,100;
residual = $100
(d) Predicted price = $4,100;
residual = -$100
Answer: (d)
Predicted price = $12,100 - $800*10 = $4,100
residual = $4,000 - $4,100 = -$100
9. Construct a 95% confidence interval for the drop in price associated with an additional year of
age.
(a) ($639, $961)
(b) ($667, $933)
(c) ($749, $851)
(d) Cannot be determined from the information given
Answer: (a)
This problem is asking for a confidence interval for the slope (or more accurately for a
confidence interval for “how negative” the slope is.)
From the t-table then, you get t=2.306 (for df=8)
To calculate the interval: -0.80 +/- (t from table) * .07
Using df=8:
-0.80 +/- 2.306 * .07  (-.961, -.639)
Converting to dollars from thousands, and since the problem asks for the drop in price, we
can convert from negatives to positives, and thus we get ($639, $961)
10. What is the correlation between Age and Price?
(a) r= 0.97
(b) r= -0.97
(c) r= 0.80
(d) r= -0.80
Answer: (b)
We can get R2 from the ANOVA table, and then use it to determine the correlation.
R2 = SSRegression / SSTotal = 93.5 / 99.9 = .936
Take the square root to get +/- .97 – but we need to make sure the sign of the correlation
matches the sign of the slope (they need to be negative)
So the correlation is –0.97.
(Honda Civics Prices and Ages, continued.)
11. What is the typical difference between the predicted prices (based on the regression line) and
the actual prices for these cars?
(a)
(b)
(c)
(d)
about $70
about $600
about $890
about $2,530
Answer: (c)
The question is asking for how far the points are from the regression line – this is best
measured by the SD of the residuals, which can be calculated by taking the square root of
MSResidual from the ANOVA table: sqrt(.80) = .894 or about $890.
12. What does the p-value of 0.000005 tell us?
(a) It is not very plausible that the population regression line relating Price to Age is flat.
(b) There is strong evidence that the slope of the population regression line is not 0.
(c) There is strong evidence that the knowing a Civic’s age would improve our prediction of its
price.
(d) All of the statements above are implied by the low p-value.
Answer: (d)
The low p-value says that there’s strong evidence that the population regression line is
probably not flat, and that the regression line improves our predictions. So (d) is the best
answer – all three given statements are true.
13. The average age of the cars in the sample is 7.1 years. What is the average price of the cars in
the sample?
(a)
(b)
(c)
(d)
$5,000
$6,420
$7,190
Cannot be determined from the information given.
Answer: (b)
The regression line always goes through the middle of the scatterplot, through the point
defined by the mean of X and the mean of Y. So if we plug the mean of X into the
regression equation, we’ll get the mean of Y:
Average price = $12,100 - $800 * average age = $6,420 = $12,100 - $800 * 7.1 = $6,420
14. What is the best conclusion we can draw about Honda Civics that are 5 years old, based on
the information we have?
(a) We conclude that the average price of 5-year old Civics is about $8,100, but we expect to see
some differences in prices for different 5-year Civics.
(b) We conclude that that all 5-year-old Civics should cost the same, about $8,100.
(c) We conclude that all 5-year-old Civics should cost more than all 6-year-old Civics, although
we can’t be completely sure by how much.
(d) All of the above are equally valid conclusions.
Answer: (a)
The regression line tells of the average price for each level of Age. Thus, (a) is correct,
because the average price for Age=5 is = $12,100 - $800 * 5 = $8,100. But there’s still
variability of the points around the regression line – in other words, the prices of 5-year old
Civics will vary around the average of $8,100.
Options (b) and (c) are not correct because of this variability around the regression line.
The regression line tells us about average prices (for cars of different Ages), not about the
exact prices of ALL cars of any particular Age.
15. Suppose instead that Pete had taken a second sample, consisting of 6 cars that ranged in age
from 4 to 8 years old, and suppose that he regressed Price on Age for this second sample. How
would the standard error of the regression slope be different for this second sample (compared to
the first sample described on the previous page)?
(a) The standard error of the regression slope would probably be larger for the second sample.
(b) The standard error of the regression slope would probably be about the same for both
samples.
(c) The standard error of the regression slope would probably be smaller for the second sample.
Answer: (a)
The second sample is smaller -- 6 cars versus 10 (or 12) cars.
The second sample involves less spread out X-values than the first sample (X ranging from
4 to 8 rather than from 2 to 15).
The standard error of the regression slope gets larger as the sample size gets smaller, and as
the SD of the X’s gets smaller. (Remember, more data points are better, and wider data
points are better – they lead to smaller standard errors and thus more precise estimates.) So
the standard error of the slope in the second sample will be larger than before. The second
sample wouldn’t estimate the population regression slope as accurately as the first sample.
(The next 5 questions deal with the following information.)
Below is partial output from a regression predicting consumption of beef (called
BeefConsumption, and measured in pounds of beef per person annually) from the price of beef
(called BeefPrice, and measured in cents per pound). [The data are from the United States from
1925 to 1941. During this period, the price of beef ranged from about 55 to 80 cents per pound.]
ANOVA
Regression
Residual
Total
df
1
15
16
SS
166.0
127.2
293.1
Intercept
BeefPrice
Coefficients
85.24
-0.47
Standard Error
7.30
0.11
MS
166.0
8.5
F
19.6
t Stat P-value
11.67 6.3E-09
-4.42 0.0005
Significance F
0.0005
Lower 95%
69.68
-0.69
Upper 95%
100.80
-0.24
16. In 1941, beef cost 56 cents per pound, and annual consumption of beef was 60.0 pounds per
person. Determine the predicted consumption of beef in 1941, and say whether the data point for
1941 is above or below the regression line.
(a) Predicted consumption of beef = 57.0;
data point is above the regression line
(b) Predicted consumption of beef = 57.0;
data point is below the regression line
(c) Predicted consumption of beef = 58.9;
data point is above the regression line
(d) Predicted consumption of beef = 58.9;
data point is below the regression line
Answer: (c)
Predicted consumption of beef = 85.24 - .47 * 56 = 58.9
Actual consumption is 60.0 which is above the prediction of 58.9. So the data point is above
the regression line (in other words, the residual is positive)
17. What is the best interpretation of the y-intercept (85.24) for this regression?
(a) When beef was free in 1941, everyone consumed about 85 pounds of beef per year.
(b) The intercept doesn’t tell us much, because 0 isn’t a reasonable value to plug in for the price
of beef.
(c) The intercept doesn’t tell us much, because the p-value is too low.
(d) Both (a) and (c).
Answer: (b)
The y-intercept tells us the prediction for Y (BeefConsumption) when X (BeefPrice) is 0.
But because beef was never free in the sample (the lowest price of 55 cents isn’t even close to
0) , we don’t want to extrapolate the regression line by plugging in 0 for BeefPrice.
18. Determine the correlation between BeefPrice and BeefConsumption.
(a) r = +0.75
(b) r = -0.75
(c) r = +0.47
(d) r = -0.47
Answer: (b)
R2 = r2 =SSRegression / SSTotal = 166.0 / 293.1 = .566
r = +/- sqrt(.566) = +/- .75
Because the slope is negative, the correlation is negative too. So r = -0.75
19. Determine the standard deviation of the residuals for this regression
(a) 2.9 pounds/person
(b) 0.11 pounds/person
(c) 8.5 pounds/person
(d) 0.75 pounds/person
Answer: (a)
SD of the residuals = sqrt(MSResidual) = sqrt (8.5) = 2.9
20. Can we reject the null hypothesis that BeefConsumption is unrelated to BeefPrice? (use
alpha=.05)
(a) Yes, because the p-value for the slope is very small (p=.0005)
(b) Yes, because the 95% confidence interval for the slope does not include 0.
(c) No, because the t-statistic for the slope is negative (t=-4.42)
(d) Both (a) and (b).
Answer: (d)
The p-value for the slope is less than .05, so we can reject the null hypothesis of a flat
regression line.
We can also reject that null hypothesis because the confidence intervals for the slope does
not include 0: (-0.69 to –0.24)
Both (a) and (b) allow us to reject the null hypothesis of no relation between
BeefConsumption and BeefPrice.
21. Let X= the dosage of a stimulant (in milligrams) given to a patient, and let Y= the pulse rate
of the patient (in beats per minute). Data is gathered for 100 patients, and the resulting regression
equation is: predicted Y = 70 + 0.35 * X.
What is the best interpretation of the number 0.35?
(a) 35% of the variability in pulse rate is explained by the amount of stimulant.
(b) The correlation between amount of stimulant and pulse rate is 0.35.
(c) When a patient is given one additional milligram of stimulant, their pulse tends to increase
by about 0.35 beats per minute.
(d) When a patient is given 0.35 additional milligrams of stimulant, their pulse tends to increase
by about 1 beats per minute.
Answer: (c)
(c) is the correct interpretation of the regression slope; as X goes up by 1, Y goes up by .35
(The next 2 questions are based on the following information.)
22. Across a sample of data from New York City (one data point for each month), the correlation
between average monthly temperature (X) and number of homicides per month (Y) is observed to
be .40. The most appropriate interpretation is:
(a) As the temperature increases by 10 degrees, we expect to see 4 more homicides.
(b) For every additional 10 homicides, we expect temperature to increase by 4 degrees.
(c) Average temperature explains 16% of the variability in number of homicides per month.
(d) Average temperature explains 40% of the variability in number of homicides per month.
Answer: (c)
Only (c) uses the interpretation of the squared correlation as the percentage of variability in
the dependent variable that is explained by the independent variable (.402 = .16 = 16%)
(a) is incorrect because it implies that the slope of the regression line is .40. (A slope of .40 is
not the same as a correlation coefficient of .40.)
23. If the standard deviation of the monthly temperature measurements is 15 degrees, and the
standard deviation of number of homicides per month is 5, what is the slope of the regression line
predicting number of homicides from monthly temperature? (calculate directly)
X=temperature
Y = homicides
slope b = r * Sy/ Sx = .40 * 5 / 15 = 0.133
24.
(a)
(b)
(c)
(d)
Which of the following is not an assumption of the simple linear regression model?
The population mean of Y (for each level of X) is linearly related to X.
The population variance of Y is the same for each level of X.
The population distribution of Y for each level of X follows the normal distribution.
The population correlation between X and Y is equal to 1.
Answer: (d)
The other three options are all assumptions of simple linear regression. (a) refers to the
Linearity assumption; (b) refers to the Equal Variance assumptions, and (c) refers to the
Normality assumption. (d) implies that X and Y are perfectly associated, which is rarely
true (and certainly not something that we would assume to be true).
(The next 3 questions are based on the following information.)
25. A manager wants to predict the cost of travel (Y) for salespeople based on the number of
days (X) spent on each trip. Based on a sample of data for trips ranging from 1 to 5 days, the
following regression line is estimated: Y-hat = 180 + 130 X
What would be the best prediction for the cost for a trip lasting 3 days?
(a)
(b)
(c)
(d)
$390
$570
$930
Cannot be determined; the regression line shouldn't be used to predict costs for trips lasting
that long.
Answer: (b)
The data used to estimate the equation included trips lasting about 3 days, so we can
reasonably use the regression equation to make predictions for X=3:
Predicted cost = 180 + 130*3 = 180 + 390 = 570
26.
(a)
(b)
(c)
(d)
What would be the best prediction for the cost for a trip lasting 10 days?
$1300
$1480
$3100
Cannot be determined; the regression line shouldn't be used to predict costs for trips lasting
that long.
Answer: (d)
The data used to estimate the equation did not included trips lasting 10 days, so we
shouldn’t use the regression line to extrapolate to X=10.
27. Suppose the average length of the trips in the sample is 4 days. What is the average cost of
travel for these trips? (calculate directly)
The regression line goes through the center of the scatterplot, the point defined by the mean
on X and the mean on Y. So if we plug the overall mean of X into the regression equation,
we will get the overall mean of Y:
average cost of travel = overall mean of Y = 180 + 130 * 4 = $700
28. The standard error of the slope of the regression line depends on which of the following
quantities?
(a)
(b)
(c)
(d)
the number of observations in the sample
the variance of the scores on the independent variable
the variance of the residuals
all of the above
Answer: (d)
The standard error of the regression slope (b) is
Se
Sx n 1
Where Se is the SD of the residuals, and Sx is the SD of the scores on the independent
variable, and n is the sample size. Therefore, (d) is the best answer.
29. There is a relationship between caffeine consumption and performance on early-morning
tests as follows: at very low and at very high levels of caffeine consumption, performance is
worse than performance at moderate levels of caffeine consumption. Which of the following is
the best estimate of the correlation coefficient that describes this relationship?
(a)
(b)
(c)
(d)
r = -0.70
r = 0.00
r = +0.70
r = +1.00
Answer: (b)
The question describes an upside-down U-shaped curve: low then high then low. This kind
of curve implies a very low correlation coefficient, because Y doesn’t tend to consistently
increase or decrease with X.
(The next 9 questions are based on the following information.)
Mark E. Ting, a market researcher, is analyzing household earnings and spending data. He is
interested in relating monthly household grocery spending (in dollars) to annual household
income (in thousands of dollars). He gathers data for a sample of 9 households. Below is partial
output from his analysis.
Mean
SD
Source
Regression
Residual
Total
Intercept
Income
Annual Income (thousands of $)
52.8
29.9
df
1
7
8
SS
25543
3112
28656
MS
25543
445
Monthly Grocery Spending ($)
137.2
59.8
F
57.4
Coefficients Standard Error t Stat
37.51
14.92
2.51
1.89
0.25
7.57
P-value
0.0401
0.0001
30. What is the equation for the estimated regression line?
(a)
(b)
(c)
(d)
Predicted monthly grocery spending = 37.51 + 1.89 Annual income
Predicted monthly grocery spending = 1.89 + 37.51 Annual income
Predicted monthly grocery spending = 14.92 + 0.25 Annual income
Predicted monthly grocery spending = 1.89 + 0.25 Annual income
Answer: (a)
Using the “coefficients” column at the bottom of the regression output, we see that the
intercept term is 37.51 and the slope is 1.89.
31. One household in the sample has annual income of $50,000 and spends $125 monthly on
groceries. What is the residual for this household?
(a)
(b)
(c)
(d)
-7
7
125
37.5
Answer: (a)
The residual is the actual value of Y minus the value predicted by the regression line:
Actual value: $125
Predicted value: 37.51 + 1.89*50 = $132
Residual: 125 – 132.01 = -7
(Since income is measured in thousands, we plug 50 into the equation for an income of
50,000.)
32.
(a)
(b)
(c)
(d)
What is the correlation between monthly grocery spending and annual income?
.891
.944
1.89
Cannot be determined from the information given
Answer: (b)
The squared correlation is the proportion of variability explained:
SSRegression / SSTotal = 25543 / 28656 = .891
So the correlation coefficient is sqrt(.891) = +/- .944.
(Make sure the sign of the coefficient matches the slope of the regression line. Since the
slope is positive, the correlation is positive too: r=.944).
33. In the ANOVA table, the F-statistic is 57.4. This F value is the test statistic used in testing
what null hypothesis?
(a) H0: The slope of the sample regression line is 0
(b) H0: The intercept of the sample regression line is 0
(c) H0: The slope of the population regression line is 0
(d) H0: The intercept of the population regression line is 0
Answer: (c)
F-stat in the ANOVA table for simple regression tests whether the population regression
equation is useless, or in other words, whether the population regression line is flat (same
test as the t-test for the IV). Flat line means a slope of 0. Best answer is therefore (c).
34. What is the sum of squared deviations between each household's actual grocery spending and
the predicted grocery spending based on the regression line?
(a) 25543
(b) 3112
(c) 28656
(d) 445
Answer: (b)
The question is asking for the sum of squared residuals, or sum of squared prediction
errors. This is the quantity we minimize when we fit the regression line (by the method of
least squares). The sum of squared residuals is given in the ANOVA table by SSResidual =
3112
35.
(a)
(b)
(c)
(d)
What is the standard deviation of the residuals?
21.1
29.9
59.8
Cannot be determined from the information given
Answer: (a)
SD of the residuals is determined by taking the square root of the MSResidual term in the
ANOVA table: SD of the residuals = sqrt (445) = 21.1
36. Construct a 90% confidence interval for the expected additional monthly grocery spending
for each additional $1000 of annual income.
(a) (0, 3.78)
(b) (1.31, 2.47)
(c) (1.42, 2.36)
(d) (9.23, 65.7)
Answer: (c)
“expected additional monthly grocery spending for each additional $1000 of annual
income”: this is a way of asking about the slope of the regression line (the change in average
Y for a 1-unit increase in X)
Our point estimate for the slope is 1.89, and the standard error is 0.25. The t-value from
the t-table for df=n-2=7 and a 90% interval is 1.895
90% CI: 1.89 +/- 0.25*1.895 = 1.89 +/- .47 = (1.42, 2.36)
37. Mark is especially interested in estimating the mean grocery spending for three types of
households: those with annual income of $20,000, those with annual income of $50,000, and
those with annual income of $100,000. He constructs three 95% confidence intervals for mean
monthly grocery spending, one for each of the three sub-populations of households. Which of
these confidence intervals will be the narrowest (i.e., for which set of households will the
estimation of mean grocery spending be the most precise?)
(a) The confidence interval for the $20,000 households will be narrowest.
(b) The confidence interval for the $50,000 households will be narrowest.
(c) The confidence interval for the $100,000 households will be narrowest.
(d) All three confidence intervals will be equally wide.
Answer: (b)
the standard error for the mean of Y for specific values of X gets bigger as you move
further from the overall mean of X. Therefore, the confidence intervals get wider as you
move further from the overall mean of X, and the narrowest interval is the one closest to the
mean of X (which is $52,800). And the confidence interval for the $100,000 households
(farthest from $52,800) will be the widest.
38. The 95% confidence interval for the mean grocery spending for households with annual
income of $20,000 is calculated to be (50,101). What is the best interpretation of this interval?
(a) If a new household with income of $20,000 was observed, we would be 95% confident that
this household's actual grocery spending would fall in the interval.
(b) If we observed 100 new households with income of $20,000, about 95 of them would have
grocery spending falling in the interval.
(c) We are 95% confident that the mean grocery spending of all households with income of
$20,000 falls in this interval.
(d) About 95% of all households in the sample have annual incomes between $50,000 and
$101,000.
Answer: (c)
The confidence interval is intended to cover the mean (average) grocery spending for
households with $20,000 annual income. (a), (b), and (d) all incorrectly deal with grocery
spending for individual households, rather than the mean grocery spending across a set of
households.