Download Answers to Homework 4 1. (a) Let p be the true probability a home

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
Answers to Homework 4
1. (a) Let p be the true probability a home having a garage. The observed proportion
of universities is .51, and a 90% confidence interval for p has the form
p
p ± z.05 p(1 − p)/n.
Substituting in the observed values gives
p
.51 ± (1.645) (.51)(.49)/104 = .51 ± .081.
The exact interval, which we can get from Minitab, is (0.424780, 0.594031), which
is unsurprisingly not very different from the standard interval.
(b) We can actually just get this from Minitab:
One-Sample T: Living space
Variable
N
Living space 104
Mean
StDev SE Mean
95% CI
1.5664 0.5585
0.0548 (1.4578, 1.6750)
√
n−1
s/ n. This uses the exact
Of course, this is based on the usual formula X ± t.025
t-critical value of 1.98326 (based on 103 degrees of freedom), but if you used 1.984
(based on df = 100) you will get the virtually identical answer (1.4577, 1.6571)
This is in thousands of square feet, of course, so this is (1458, 1657) square feet.
q
(c) We now want a prediction interval, X ± t103
s
1 + n1 . A prediction interval
.025
q
1
for living space is thus 1.5664 ± (1.983)(.5585) 1 + 104
= 1.5664 ± 1.1128 =
(0.4536, 2.6694), or (454, 2669) square feet.
(d) We are assuming that these data can be viewed as a random sample from a normal
distribution. This is of course not at all the case here, as a histogram of living
spaces shows a noticeable right tail:
c
2016,
Jeffrey S. Simonoff
1
This probably has not had a strong effect on the confidence interval, as with a
sample this large from a not-greatly-nonnormal distribution the Central Limit
Theorem has likely taken over, but this has likely affected the prediction interval.
The obvious thing to do to try to fix the prediction interval is to take logs,
construct the interval in the logged scale, and then exponentiate the interval
endpoints to get back to the original scale. This will probably work better, since
logged living spaces are certainly closer to normally distributed:
This output gives us the sample mean and standard deviation of the logged variable:
c
2016,
Jeffrey S. Simonoff
2
Descriptive Statistics: Logged living space
Variable
Logged living space
Variable
Logged living space
N
104
Mean SE Mean
StDev
0.1687 0.0150 0.1532
Minimum
-0.2807
Q1 Median
Q3 Maximum
0.0804 0.1806 0.2637
0.6053
p
The prediction interval in the logged scale is 0.1687±(1.983)(.1532) 1 + 1/104 =
0.1687 ± 0.3053 = (−0.1366, 0.474). Antilogging the two endpoints gives the
prediction interval (730, 2979) square feet, being shifted up at both ends (with an
estimate of the “typical” living space being the geometric mean 1475 square feet).
2. (a) Let p be the probability that a TV has HDR. The observed proportion of movies
is .44, and a 95% confidence interval for p has the form
p
p ± z.025 p(1 − p)/n.
Substituting in the observed values gives
p
.44 ± (1.96) (.44)(.56)/50 = .44 ± .138.
The exact interval, which we can get from Minitab, is (0.299907, 0.587456), which
is not very different from the approximate interval.
(b) The standard confidence interval is available from Minitab:
One-Sample T: Price
Variable N
Price
50
Mean
1052
StDev SE Mean
972
137
99% CI
(684, 1420)
q
(c) The prediction interval is X ± t49
s
1 + n1 , or
.005
1052 ± (2.68)(972)(1.01) = 1052 ± 2631 = (−1579, 3683).
(d) This is a poor prediction interval, of course, as the lower end is less than zero.
The reason for this is clear from a histogram of the variable:
c
2016,
Jeffrey S. Simonoff
3
The variable is right-tailed, making the prediction interval invalid (the confidence
interval in part (b) is possibly also invalid, since the sample is possibly too small
to appeal to a Central Limit Theorem argument, but it’s difficult to know for
sure). The natural fix to consider is to take logs, construct the prediction interval
in the log scale, and then antilog back to the original scale. This should work
better, although interestingly enough the distribution of logged prices is noticeably
short-tailed:
c
2016,
Jeffrey S. Simonoff
4
Descriptive Statistics: Logged price
Variable
N
Logged price 50
Mean SE Mean
StDev Minimum
2.8250 0.0620 0.4385
2.1139
Variable
Q1 Median
Q3 Maximum
Logged price 2.4472 2.8451 3.1761
3.6021
p
The prediction interval in the logged scale is 2.825 ± (2.68)(.4385) 1 + 1/50 =
2.825 ± 1.187 = (1.638, 4.012). Antilogging the two endpoints gives the prediction
interval (43.5, 10280.2) (with an estimate of the “typical” enrollment being the
geometric mean 668). Note that this seems too wide given the actual range of
prices, and that is because the observed logged prices are shorter-tailed than a
normal random variable.
q
67
3. (a) We are looking for a prediction interval, which takes the form X ± t.005s 1 + n1 ,
or
10.3 ± (2.65)(10.14)(1.008) = 10.3 ± 27.1 = (−16.8, 37.4).
The interval goes into impossible negative values. In many years there is a small
number of attacks, but some years have many more. The solution of working in
the logged scale is problematic here, since there are years with zero attacks, and
the log of 0 is undefined.
(b) Let p be the true probability that there will be a fatality from an unprovoked
shark attack in a given year in Florida. The observed proportion of movies is
14
= .206, and a 99% confidence interval for p has the form
68
p ± z.005
p
p(1 − p)/n.
Substituting in the observed values gives
p
.206 ± (2.58) (.206)(.794)/68 = .206 ± .127.
The exact interval, which we can get from Minitab, is (0.096629, 0.357842). We
are assuming for the approximate interval that the sample size is large enough to
appeal to the normal approximation to the binomial, which is a little off (we can
see that from the difference between the exact and approximate intervals). More
seriously, we are assuming that the probability of at least one fatality is the same
for all years, and whether or not a fatality occurs in any year is independent of
that in any other year. Neither of these conditions is likely to hold. The number of
tourists and the population of Florida have both grown tremendously in the past
67 years, meaning that attacks (and therefore fatal attacks) are far more likely now
(implying increasing p). On the other hand, there is much better understanding
of shark attacks now, which would lead to less (fatal) attacks (implying decreasing
c
2016,
Jeffrey S. Simonoff
5
p). We could also expect that if a fatal attack occurs in one year people might
react by staying out of the ocean or being more careful in the next year, implying
a lack of independence in the occurrence of a fatal attack from one year to the
next. Note that the fact that the number of unprovoked attacks and whether or
not there is a fatal unprovoked attack are not independent of each other is not
directly a violation of assumptions; it is only relevant in the sense given above
(that the number of unprovoked attacks has changed over the years, and that in
turn changes the probability of a fatal unprovoked attack in a given year).
c
2016,
Jeffrey S. Simonoff
6