Download Review Linear Regression t-tests Name

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Forecasting wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Review Linear Regression t-tests
Source:Devore
Name_____________________________
Class period_____
For questions #1-3, complete the following.
a) Make a scatterplot and check for any obvious outliers.
b) Find the equation of the least squares regression line.
c) Calculate r and r2 and explain what they mean.
d) sketch the residual plot and determine if a linear equation is appropriate
e) Interpret the slope and y-intercept in the regression equation.
f) Predict the y value for the given x value.
g) Test at the specified significance level if  is different from zero.
h) Find sb.
i) Construct the specified confidence interval for .
1. The accompanying data on fish survival and ammonia concentration is taken from the paper “Effects
off Ammonia on Growth and Survival of Rainbow Trout in Intensive Static Water Culture” .
Let x = ammonia exposure (mg/L) and y = percent survival.
x
y
10
85
f) let x = 30
10
92
20
85
20
96
g) use  = .05
25
87
27
80
27
90
31
59
50
62
i) 95% confidence interval
2. The paper “Root Dentine Transparency: Age Determination of Human Teeth Using Computerized
Densitometric Analysis” (Amer. J. Phys. Anthr. (1991):25-30) reported on an investigation of methods for
age determination based on tooth characteristics. With y=age (years) and x = percentage of root with
transparent dentine. The accompanying data is for anterior teeth:
x
y
15
23
f) let x = 25
19
52
31
65
39
55
g) use  = .02
41
32
44
60
47
78
48
59
55
61
65
60
i) 98% confidence interval
3. The heights (in inches) are listed for boys who were measured at 5 years of age and again at 18 years of
age.
Age 5
43
41
42
45
46
Age 18
68
66
66
70
71
f) let x = 44 inches
g) use  = .05
i) 95% confidence interval
4.a) What does a correlation coefficient of 1 mean? of 2 mean? of 0 mean?
b) Describe what finding the least squares regression line means.
5. a)The equation of a least squares regression line is ŷ = 2.4 + 3.7x. What is the residual for
the point (3, 12)? What does this tell us about the actual point?
b) What is the actual y value for x = 5 if the residual is 1.9?
6. A study concerning the relationship between the average ozone level y (in parts per million) and the
population of a city x (in millions) resulted in the following MINITAB output.
Predictor
Constant
popn
Coef
8.892
16.650
s = 5.454
Stdev
2.395
1.910
R-sq = 84.4%
t-ratio
3.71
8.72
P
0.002
0.000
R-sq(adj) = 83.3%
Analysis of Variance
SOURCE
Regression
Error
Total
DF
1
14
15
SS
2260.5
416.4
2676.9
MS
2260.5
29.7
F
76.00
P
0.000
a) What is the equation of the least squares regression line?
b) Estimate the mean ozone level for a city having population of 1 million people.
c) Determine the proportion of the observed variation in ozone levels that can be attributed to the
population of a city.
d) What is the correlation coefficient?
f) Is the regression equation useful in predicting the average ozone level using the population of the
city? Support your answer.
7. What is the difference between an influential point and an outlier?
8. Suppose a simple linear regression model is appropriate for describing the relationship between
x = the angle formed by the point of an elf’s ear and y = the elf’s IQ rating. The true regression equation is
believed to be y = 163 – .62x and  = 5. What proportion of elves with a 20˚ ear point angle have an IQ
above 159? (Elves are very smart!) (pjs)
Multiple Choice Practice
9. A bivariate set of data relates the amount of annual salary raise and previous performance rating. The
least squares regression equation is ŷ = 1400 + 2000x where ŷ is the estimated raise and x is the
performance rating. Which of the following statements is not correct? (from AMSCO’s AP Stat pg 70 #2)
A) For each increase of one point in performance rating, the raise will increase on average by $2,000.
B) A rating of 0 will yield a predicted raise of $1,400.
C) The correlation coefficient of the data is positive.
D) All of the above are false.
10. Which of the following is not a consideration in determining the goodness of fit of a model?
(from AMSCO’s AP Stat pg 265 #13)
A) the value of r2
B) the slope of the residual plot
C) the existence of influential points
D) the existence of pattern in the residual plot
11. Suppose a data set is transformed using (x,y) → (x, lny) and a least squares linear regression
procedure is performed on the transformed data. If the residual plot of this regression shows a curved
pattern, which of the following is an appropriate conclusion? (from AMSCO’s AP Stat pg 305 #12)
A) A quadratic model should be used with the original data.
B) A square root transformation should be applied to the transformed data.
C) The correlation of the set of transformed data is 0.
D) The exponential transformation is not appropriate.
E) None of these is appropriate.
12. The monthly cost of a long distance phone call depends on many factors. A least squares regression
relating cost to time on line determines an equation of cost = 4.70 + 0.15(time in minutes). Which is not
correct? (from AMSCO’s AP Stat pg 305 #14)
A) Long distance service is predicted to cost 15 cents per minute.
B) The monthly fixed cost of long distance service is, on average, $4.70.
C) Adding another phone will raise cost 0.15.
D) Five dollars will cover the long distance charges if you’re only call is for two minutes.
13. A value in a bivariate set of data is an influential point if ?
(from AMSCO’s AP Stat pg 305 #15)
A) it has a large residual.
B) it is an outlier with respect to the values of the explanatory variables.
C) its removal creates a substantial change in r.
D) its removal changes the sign of r.
E) None of these.
For each data set below (14-16), find a) the best transformed linear equation (unless quadratic),
b) the equation in y= form, and c) make the prediction for the given x value. If the scatterplot appears
quadratic, simply use the quadratic regression function in your calculator. (pjs)
14.
x
.1
.2
.3
.6
.8
1.5
2
y
1
1.3
1.5
1.8
1.9
2.2
2.3
predict y when x = 1
15.
x
.2
y
12
predict y when x = 5
1
9.1
2
5.9
3
4.8
4
6
6
15
16.
x
2
y
11
predict y when x = 10
7
6.7
12
4
16
2.6
21
1.5
30
.6
40
.2
17.
x
1.8
y
.7
predict y when x = 2
3
5.8
4
6
6
10
8
35
7
21
Answers
residual
1. b) ŷ = .778x + 100.791
c) r = .727 suggests a strong, negative, linear correlation between ammonia exposure and survival.
r2 = .528 states that 52.8% of the total variation in survival percentage is explained by the amount
of ammonia exposure.
d) the residual plot has no pattern and verifies a linear relationship
e) the slope: .778 means that for every increasing mg of ammonia exposure, .778 percent of the
trout survive. y-intercept: 100.791 means that if the ammonia exposure is zero, 100% of the trout
would survive.
Residual Plot
f) 77.46
15
g) conditions: *The scatterplot appears linear.
10
5
Residual plot appears random.
0
10
20
30
40
50
60
-5 0
* Independence is reasonable since the survival of one
-10
-15
fish should not depend on another.
-20
*The distribution of residuals is close enough to
ammonia exposure
normal to continue. (Histogram must have scales and
labeled axes.)
*The spread about the line is nearly constant to insure
equal variance of y at each x value.
residuals
Ho:  = 0
t = 2.80
df = 7
critical values: t = 2.365, 2.365
Ha:  0
p-value = .0265
se = 9.487
reject Ho since p-value < α
We have enough evidence to claim that the slope of the linear regression equation is not zero and the
equations appears to be useful in predicting survival using ammonia exposure. The percent of rainbow
trout surviving decreases as the amount of ammonia exposure increases.
h) .278
i) ( 1.435, .121)
2. b) ŷ = .555x + 32.081
c) r = .535 suggests a weak positive linear relationship between age and dentine transparency.
r2 = .286 states that 28.6% of the total variation in age is associated with the variation in the
amount of root with transparent dentine.
d) the residual plot has no pattern and verifies a linear relationship
e) slope: .555 means that for each percentage point of root with transparent dentine, the age
increases .555 years. y-intercept: 32.081 means that if there is no root with transparent dentine, the
age of the person was 32 years.
f) 45.954
residuals
Residual plot
g) conditions: *The scatterplot appears linear.
Residual plot appears random.
* Independence is reasonable since the age of one
person based on teeth doesn’t affect another’s age.
*The spread about the line is nearly constant to insure
about an equal variability of y at each x value.
No fanning effect noticed. (continued on next page)
40
20
0
-20 0
-40
20
40
% of root
60
80
*The distribution of residuals is not so far from normal to completely
thwart our normal condition. Axes must have labels and scale.
(window set with x-scale = 4)
Ho:  = 0
t = 1.790
df = 8
critical values: t = 2.896, 2.896
Ha:  0
p-value = .111
se = 14.299
fail to reject Ho since p-value > α
We do not have enough evidence to claim the slope of the linear regression equation is not zero. The
linear regression line does not appear to be a useful way to predict age based on percentage of root with
transparent dentine.
h) .310
i) ( .343, 1.453) Note: zero is part of the interval
3. b) ŷ = 1.081x + 21.267
c) r = .983 suggests a strong positive linear correlation between age 5 height and age 18 height
r2 = .967 states that 96.7% of the total variation in height at age 18 is associated with the variation
of height at age 5.
d) The residual plot has no pattern and verifies a linear relationship although there are so few points
that a pattern may be present. (I just tried to save you the time of entering in more data.)
3e) slope: 1.081 means that the rate of growth is 1.081 inches for each inch at age 5
y-intercept: 21.267 is how tall a child will be at 18 years if they are 0 inches tall at 5 years. This
doesn’t really make sense in this problem. Note that most babies are about 21 inches long at birth.
f) 68.8 inches
g) conditions: *The scatterplot is linear.
Residual plot appears random. Not really
so conclusion may be suspect.
* Independence is reasonable since the height of one
child should not depend on another.
*The spread about the line is nearly constant to insure
about equal variability of y at each x value.
No fanning effect.
*The distribution of residuals is close enough to
normal to continue. Axes must have labels and scale.
(window: x-scale = .2)
Ho:  = 0
t = 9.378
df = 3
critical values: t = 3.182, 3.182
Ha:  0
p-value = .00257
se = .4782
reject Ho since p-value < α
We have enough evidence to claim that the slope of the linear regression equation is not zero. The
linear regression equation appears to be useful in predicting male height at 18 years given male height a 5
years. The height of a male at age 18 years increases as their height at age 5 years increases. Of course, the
equation becomes useless after that point (extrapolation).
h) .1153
i) (.714, 1.448)
4a) A correlation coefficient of 1 or 1 means the points form a straight line. A correlation coefficient of
2 makes no sense since r is always between −1 and 1(inclusive). A correlation coefficient of zero would
indicate no linear relationship between the variables.
b) To find the least squares regression line, one would calculate the deviation score from each point to
the regression line (residual), square it and the find the sum of all the squares. Any line drawn between the
points would give a different sum. The line of best fit would be the one with the smallest (least) sum of
the squared deviation (residual) scores.
5. a) 1.5; The point is below the linear regression line.
b) 22.8
6. a) ŷ = 8.89 + 16.6 x
b) 25.59
c) .844
d) .919
d) yes, since the p-value = 0, we would reject Ho which supports the usefulness of the regression
equation
7. An influential point is horizontally separated from the bulk of the data and strongly influences the
slope of the line and the correlation coefficient. An outlier is separated from the bulk of the data
vertically.
8. 0.0465
9. D
10. B
11. D
12. C
14. a) ŷ = 2.01 + .437(lnx)
15. a) nope, it’s quadratic
13. B and maybe C
b) ŷ = 2.01 + .437(lnx)
c) 2.01
b) ŷ = 1.0268x2 – 5.93x + 13.511
16. a) ln ŷ = –.106x + 2.635
b) ŷ = 13.942 (.9)x
c) 9.53
c) 4.861
17. a&b) I tried taking the ln of both x and y for a power function but it didn’t help much so I’m thinking
a piecewise function as follows: if x<6 then ŷ = 2.0475x – 1.951 ; if x≥6 then
ŷ = 12.5x – 65.5. I used the point (6,10) in both equations. So ŷ = 2.144 when x = 2.
Nonlinear Notes
Exponential: (x, y) → (x, ln y)
Logarithmic: (x, y) → (ln x, y)
Power: (x, y) → (ln x, ln y)
Formulas for Inference with Slope
b
t
Confidence Interval: b  t * sb
sb