Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Answers to Homework 4 1. (a) Let p be the true probability a home having a garage. The observed proportion of universities is .51, and a 90% confidence interval for p has the form p p ± z.05 p(1 − p)/n. Substituting in the observed values gives p .51 ± (1.645) (.51)(.49)/104 = .51 ± .081. The exact interval, which we can get from Minitab, is (0.424780, 0.594031), which is unsurprisingly not very different from the standard interval. (b) We can actually just get this from Minitab: One-Sample T: Living space Variable N Living space 104 Mean StDev SE Mean 95% CI 1.5664 0.5585 0.0548 (1.4578, 1.6750) √ n−1 s/ n. This uses the exact Of course, this is based on the usual formula X ± t.025 t-critical value of 1.98326 (based on 103 degrees of freedom), but if you used 1.984 (based on df = 100) you will get the virtually identical answer (1.4577, 1.6571) This is in thousands of square feet, of course, so this is (1458, 1657) square feet. q (c) We now want a prediction interval, X ± t103 s 1 + n1 . A prediction interval .025 q 1 for living space is thus 1.5664 ± (1.983)(.5585) 1 + 104 = 1.5664 ± 1.1128 = (0.4536, 2.6694), or (454, 2669) square feet. (d) We are assuming that these data can be viewed as a random sample from a normal distribution. This is of course not at all the case here, as a histogram of living spaces shows a noticeable right tail: c 2016, Jeffrey S. Simonoff 1 This probably has not had a strong effect on the confidence interval, as with a sample this large from a not-greatly-nonnormal distribution the Central Limit Theorem has likely taken over, but this has likely affected the prediction interval. The obvious thing to do to try to fix the prediction interval is to take logs, construct the interval in the logged scale, and then exponentiate the interval endpoints to get back to the original scale. This will probably work better, since logged living spaces are certainly closer to normally distributed: This output gives us the sample mean and standard deviation of the logged variable: c 2016, Jeffrey S. Simonoff 2 Descriptive Statistics: Logged living space Variable Logged living space Variable Logged living space N 104 Mean SE Mean StDev 0.1687 0.0150 0.1532 Minimum -0.2807 Q1 Median Q3 Maximum 0.0804 0.1806 0.2637 0.6053 p The prediction interval in the logged scale is 0.1687±(1.983)(.1532) 1 + 1/104 = 0.1687 ± 0.3053 = (−0.1366, 0.474). Antilogging the two endpoints gives the prediction interval (730, 2979) square feet, being shifted up at both ends (with an estimate of the “typical” living space being the geometric mean 1475 square feet). 2. (a) Let p be the probability that a TV has HDR. The observed proportion of movies is .44, and a 95% confidence interval for p has the form p p ± z.025 p(1 − p)/n. Substituting in the observed values gives p .44 ± (1.96) (.44)(.56)/50 = .44 ± .138. The exact interval, which we can get from Minitab, is (0.299907, 0.587456), which is not very different from the approximate interval. (b) The standard confidence interval is available from Minitab: One-Sample T: Price Variable N Price 50 Mean 1052 StDev SE Mean 972 137 99% CI (684, 1420) q (c) The prediction interval is X ± t49 s 1 + n1 , or .005 1052 ± (2.68)(972)(1.01) = 1052 ± 2631 = (−1579, 3683). (d) This is a poor prediction interval, of course, as the lower end is less than zero. The reason for this is clear from a histogram of the variable: c 2016, Jeffrey S. Simonoff 3 The variable is right-tailed, making the prediction interval invalid (the confidence interval in part (b) is possibly also invalid, since the sample is possibly too small to appeal to a Central Limit Theorem argument, but it’s difficult to know for sure). The natural fix to consider is to take logs, construct the prediction interval in the log scale, and then antilog back to the original scale. This should work better, although interestingly enough the distribution of logged prices is noticeably short-tailed: c 2016, Jeffrey S. Simonoff 4 Descriptive Statistics: Logged price Variable N Logged price 50 Mean SE Mean StDev Minimum 2.8250 0.0620 0.4385 2.1139 Variable Q1 Median Q3 Maximum Logged price 2.4472 2.8451 3.1761 3.6021 p The prediction interval in the logged scale is 2.825 ± (2.68)(.4385) 1 + 1/50 = 2.825 ± 1.187 = (1.638, 4.012). Antilogging the two endpoints gives the prediction interval (43.5, 10280.2) (with an estimate of the “typical” enrollment being the geometric mean 668). Note that this seems too wide given the actual range of prices, and that is because the observed logged prices are shorter-tailed than a normal random variable. q 67 3. (a) We are looking for a prediction interval, which takes the form X ± t.005s 1 + n1 , or 10.3 ± (2.65)(10.14)(1.008) = 10.3 ± 27.1 = (−16.8, 37.4). The interval goes into impossible negative values. In many years there is a small number of attacks, but some years have many more. The solution of working in the logged scale is problematic here, since there are years with zero attacks, and the log of 0 is undefined. (b) Let p be the true probability that there will be a fatality from an unprovoked shark attack in a given year in Florida. The observed proportion of movies is 14 = .206, and a 99% confidence interval for p has the form 68 p ± z.005 p p(1 − p)/n. Substituting in the observed values gives p .206 ± (2.58) (.206)(.794)/68 = .206 ± .127. The exact interval, which we can get from Minitab, is (0.096629, 0.357842). We are assuming for the approximate interval that the sample size is large enough to appeal to the normal approximation to the binomial, which is a little off (we can see that from the difference between the exact and approximate intervals). More seriously, we are assuming that the probability of at least one fatality is the same for all years, and whether or not a fatality occurs in any year is independent of that in any other year. Neither of these conditions is likely to hold. The number of tourists and the population of Florida have both grown tremendously in the past 67 years, meaning that attacks (and therefore fatal attacks) are far more likely now (implying increasing p). On the other hand, there is much better understanding of shark attacks now, which would lead to less (fatal) attacks (implying decreasing c 2016, Jeffrey S. Simonoff 5 p). We could also expect that if a fatal attack occurs in one year people might react by staying out of the ocean or being more careful in the next year, implying a lack of independence in the occurrence of a fatal attack from one year to the next. Note that the fact that the number of unprovoked attacks and whether or not there is a fatal unprovoked attack are not independent of each other is not directly a violation of assumptions; it is only relevant in the sense given above (that the number of unprovoked attacks has changed over the years, and that in turn changes the probability of a fatal unprovoked attack in a given year). c 2016, Jeffrey S. Simonoff 6