Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Taylor's law wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Opinion poll wikipedia , lookup
Student's t-test wikipedia , lookup
Resampling (statistics) wikipedia , lookup
1 Chapter 9 Exercises 1. Suppose X is a variable that follows the normal distribution with known standard deviation σ = 0.3 but unknown mean µ. (a) Construct a 95% confidence interval for µ if a random sample of n = 16 observations of X has sample mean x̄ = 5. (b) Suppose that we want the entire width of the confidence interval to be equal to 0.04. Find the sample size n needed. 2. A sample of size n = 100 of a variable Y is taken. The sample mean of these 100 observations is found to be ȳ = 1450. Assume that the population standard deviation is σ = 50. (a) Construct a 95% confidence interval for µ, the population mean of Y . (b) What sample size is needed so that the length of the interval is 10 with 95% confidence? 3. Five observations of a variable W are taken: 680, 705, 690, 783, and 702. Construct a 95% confidence interval for µ, the population mean of W . State any assumptions needed for this confidence interval to be valid. 4. In a rural area of a developing country, a survey is conducted to estimate the proportion, p, of households that have access to clean water. Out of the 1000 households survey, 650 reported they have clean water. (a) Construct a 95% confidence interval for p and state all assumptions. (b) Find the sample size needed so that the margin of error will be ±0.02 with confidence level 95%. 5. A researcher wants to estimate the prevalence of a disease in a country. What sample size should be used if she desires to be 95% confident that the final estimate is within 0.05 of the true prevalence? 6. A student recorded the duration (T in minutes) on 20 occasions when the course website was down: 5.9, 5.1, 5.7, 8.8, 10.2, 8.3, 3.5, 9.2, 8.5, 7.3 19.6, 7.5, 0.3, 2.1, 2.0, 0.5, 0.9, 5.9, 0.4, 0.5 2 (a) The student wants to study the population mean duration of time that the website is down. She assumes that the data is normally distributed. Based on this assumption, find a 95% confidence interval of the mean duration. (b) Suppose she is subsequently told that T actually follows an Exp(λ) distribution. Based on this new piece of information, find another 95% confidence interval. (c) Compare the answers in (a) and (b) and discuss their differences. 7. The Farøes is a group of islands situated about half way between Norway and Iceland. The islands have been a dependency of the Kingdom of Denmark since the 1300s. However, over the past few decades, there have been increasing desire from inhabitants on the islands to seek independence from Denmark. A random sample of 1200 inhabitants are used in a survey, where each person gives their opinion (X) on whether Farøe Islands should become an independent country. The survey results are as follows: (x1 , ..., x1200 ) = ( 0, 0, ..., 0 , 1, 1, ..., 1 ), where xi = 1 if the i-th person supports | {z } | {z } 636 observations 564 observations independence and xi = 0, otherwise. Suppose the observations are IID Bernoulli(p), where p represents the proportion of all inhabitants who want the islands to be independent. (a) How many observations of X are there? (b) Find the MLE, p̂, of p, based on the given data? (c) Use the CLT to find a 95% confidence interval for p. (d) What is the margin of error in your estimate? (e) A local politician claims that there is enough evidence in the results to suggest that 50% of all inhabitants of Farøe Islands want independence. Does your analysis shed some light on her comments? 3 8. The name of Farøe Islands is derived from the word Faøroyar, meaning “sheep”. Since Vikings time, wool products have been of major importance for the subsistence of the islands. A particular farm owns a herd of sheep. The sheep are free to roam around the mountains surrounding the farm. The sheep are natural climbers and they can scale the steepest of all slopes. However, occasionally, a sheep may be trapped and require rescue. Suppose on 6 out of 120 days, a sheep would require rescue. (a) Use the CLT to find a 95% confidence interval for p, the proportion of days when a sheep requires rescue. You may use some of the results you found in Question 1 to answer this question. (b) What is the margin of error in your estimate? (c) If it is desired to reduce the margin of error by a factor of 1/2, how much the sample size needs to be increased? (i.e., if E1 is the margin of error under the current sample, then we want E2 = 12 E1 under the new sample size.) 9. Tourism accounts for a substantial part of the islands’ economy. Apart from the spectacular scenery and landscape, many visitors to the islands want to see the northern lights, or Aurora Borealis. Northern lights are display of lights formed from the collision of solar clouds and the Earth’s magnetic field and are best observed at night in the northern hemisphere. Let X be the duration (in minutes on the log-scale) of the display of lights on any particular occasion and suppose X ∼ N (µ, σ 2 ), with (µ, σ 2 ) unknown. Suppose the durations (in minutes on the log-scale) of a random sample of 30 displays are recorded and they are: 5.3, 6.8, 5.1, 6.9, 6.2, 4.2, 5.7, 5.9, 3.7, 5.6, 6.0, 7.7, 3.4, 4.5, 5.9, 4.1, 4.5, 6.3, 5.8, 5.0, 7.1, 4.2, 5.4, 5.6, 5.3, 5.4, 7.9, 5.0, 4.9, 5.8 (a) Find the MLE, (µ̂, σ̂ 2 ), of (µ, σ 2 ), based on the given data? (b) Use the CLT to find a 95% confidence interval for µ. (c) Express your 95% confidence interval from (b) in terms of minutes in its original scale. (d) What should be the sample size if we want to reduce the margin of error (in log-scale) to ± 0.2? 4 10. A group of scientists studying global warming has arrived the islands. They took 60 observations of the time (in days) between days when the temperature on the islands exceeded 10 degrees Celsius. Suppose the data are IID Exp(λ), where λ1 represents the mean time P60 1 between days with temperature exceeding 10 degrees Celsius. Suppose x̄ = 60 i=1 xi = 19.8 (in days). (a) Find the MLE of λ, based on the given data? (b) Use the CLT to find a 95% confidence interval for λ. You may use the fact that var(1/x̄) ≈ λ2 /n for reasonably large values of n. Note that in general var(1/x̄) does NOT equal 1/var(x̄). (c) Fifty years ago, λ = 0.04 (or mean time= 25 days). Does your analysis give evidence that λ has changed from 50 years ago? 11. Arguably the biggest industry on the islands is fishing and fish farming. Recently, fishermen have been complaining that their income are dropping due to competition and over fishing in the waters surrounding the islands. Data are collected to determine whether there is enough evidence to support the fishermen’s claims. The data consist of records of the catch X (in 1000 kg, same below) from m = 80 fishing trawlers five years ago and the catch (Y ) from n = 70 fishing trawlers this year. Summary statistics of the data are given below: m 1 X xi = 175.3, x̄ = m i=1 n ȳ = 1X yi = 155.8, n i=1 m 1 X (xi − x̄)2 = 1800.4, m i=1 n 1X (yi − ȳ)2 = 1182.8. n i=1 2 2 ) and (µY , σY2 ) are unknown. ) and Y ∼ N (µY , σY2 ), where (µX , σX Assume X ∼ N (µX , σX Furthermore, assume all data are independent of each other. (a) Find the MLE of µX and µY and hence, find the MLE for µX − µY , using the given data. You may use established results from Question 3. (b) Let µ̂X−Y be the MLE of µX − µY . Show that var(µ̂X−Y ) = fact that var(µ̂X ) = 5.) 2 σX σ2 + Y . (Hint: Use the m n 2 σX σ2 , var(µ̂Y ) = Y and recall the rules of var(X + Y ) in Chapter m n (c) Use the CLT to find a 95% confidence interval for µX − µY . Does your analysis give evidence that the amount of catch has depleted compared to five years ago? 5 ANSWERS (1a) A 95% confidence interval is σ 0.3 x̄ ± 1.96 √ = 5 ± 1.96 √ = 5 ± 0.147. n 16 We are 95% confident that µ is between 4.853 and 5.147. (b) The width of the confidence interval is 0.3 2 × 1.96 √ . n If we want the width to be no more than 0.04, then we find n such that 2 0.3 0.3 ≈ 865. 0.04 = 2 × 1.96 √ ⇒ n = 2 × 1.96 0.04 n (2a) A 95% confidence interval is σ 50 x̄ ± 1.96 √ = 1450 ± 1.96 √ = 1450 ± 9.8. n 100 We are 95% confident that µ is between 1440.2 and 1459.8. (b) The width of the confidence interval is 0.3 2 × 1.96 √ . n If we want the width to be no more than 10, then we find n such that 2 50 50 10 = 2 × 1.96 √ ⇒ n = 2 × 1.96 ≈ 385. 10 n (3) We may be able to consider a 95% confidence interval σ x̄ ± 1.96 √ n if the sample comes from a normal distribution and σ is known. However, in this case, σ is unknown and since the sample size n is small, we estimate σ using a sample estimate σ̂ and we replace 1.96 by a number from the t-table. Since n = 5, df = n − 1 = 4, the number we use is 2.776, hence a 95% confidence interval is σ̂ 40.9 x̄ ± 2.776 √ ≈ 712 ± 2.776 √ ≈ 712 ± 50.8. n 5 6 q Pn 1 2 where σ̂ is the sample standard deviation, n−1 i=1 (xi − x̄) (this estimate is better than q P the alternative estimate n1 ni=1 (xi − x̄)2 , since n = 5 is quite small). Therefore, we are 95% confident that µ is between 661.2 and 762.8. (b) The width of the confidence interval is 0.3 2 × 1.96 √ . n If we want the width to be no more than 10, then we find n such that 2 50 50 10 = 2 × 1.96 √ ⇒ n = 2 × 1.96 ≈ 385. 10 n (4a) We assume there is a probability p that a household has access to clean water and the chance households have access to clean water are independent. A 95% confidence interval is r p(1 − p) . p̂ ± 1.96 n Since p is unknown, we estimate the margin of error using p̂, giving s r 350 650 p̂(1 − p̂) 650 1000 1000 p̂ ± 1.96 = ± 1.96 = 0.65 ± 0.0296. n 1000 1000 We are 95% confident that p is between 0.62 and 0.68. (b) The margin of error is r p(1 − p) . n Since p is unknown, and the largest margin of error, for a particular value of n is when p = 0.5, then we find n such that r 2 0.5(1 − 0.5) 1.96 0.02 = 1.96 ⇒n= (0.25) = 2401. n 0.02 1.96 5. Since a prevalence is a proportion, 0 < p < 1 which is unknown, then a confidence interval estimate has the form: r p(1 − p) p̂ ± 1.96 n q meaning that we are 95% certain that p is from p̂ by the margin of error, 1.96 p(1−p) . n 7 Using the same argument as in Question 4, we replace the unknown p by the value 0.5 that would lead to the largest margin of error, then we find n such that r 2 0.5(1 − 0.5) 1.96 ⇒n= 0.05 = 1.96 (0.25) = 385. n 0.05 (6a) Based on the information, we may consider a 95% confidence interval σ x̄ ± 1.96 √ n if the sample comes from a normal distribution and σ is known. However, in this case, σ is unknown and since the sample size n is small, we estimate σ using a sample estimate σ̂ and we replace 1.96 by a number from the t-table. For n = 20, df = n − 1 = 19, the number we use is 2.093, hence a 95% confidence interval is 4.71 σ̂ x̄ ± 2.093 √ ≈ 5.61 ± 2.093 √ ≈ 5.61 ± 2.20. n 20 where σ̂ is a sample estimate of the population standard deviation. We use σ̂ = q P q P n n 1 1 2 2 i=1 (xi − x̄) here; alternatively, we could have used i=1 (xi − x̄) but for small n−1 n n, the former is better. We are 95% confident that the mean is between 3.41 and 7.81. (b) Assuming the observations follow an Exp(λ) distribution, then the mean 1/λ can be estimated by 1/λ̂ = x̄. However for an exponential distribution, the standard deviation is also 1/λ hence, we also use x̄ to estimate the standard deviation. So as long as the sample size is assumed to be “big”, an approximate 95% confidence interval is x̄ 5.61 x̄ ± 1.96 √ ≈ 5.61 ± 1.96 √ ≈ 5.61 ± 2.46. n 20 We are 95% confident that the mean is between 3.15 and 8.06. √ (c) Comparing (a) to (b), the main difference is the way the margin of error, 1.96σ̂/ n, is estimated. We aim to estimate that as well as possible. The estimate using σ̂ = x̄ is the MLE when the data follow an exponential distribution and hence (b)is better than (a) under that q P n 1 2 assumption. Alternatively, σ̂ estimated by n−1 i=1 (xi − x̄) is a simple sample standard deviation without any assumptions; furthermore, when the normality assumption holds, (a) gives a confidence interval with correct level of confidence and (a) is better than (b) because in that case σ̂ = x̄ is biased for σ. To conclude, we choose a confidence interval that utilizes the information that is given. (7a) Each xi , i = 1, ..., 1200 is an observation of X. observations. Therefore, there are n = 1200 8 (b) Let p̂ be the MLE, then n 1X 564 p̂ = x̄ = . xi = n i=1 1200 (c) According to the CLT, in a random sample of size n, as long as n is large, p(1 − p) p̂ ∼ N p, var(p̂) = . n Therefore, using the CLT, a 95% confidence interval for p is s r r 564 564 (1 − 1200 ) p(1 − p) p̂(1 − p̂) 564 p̂ ± 1.96 ≈ p̂ ± 1.96 = ± 1.96 1200 = 0.47 ± 0.0282. n n 1200 1200 (d) The margin of error is 0.0282. (e) According to the 95% confidence interval, the level of support is between (0.47 − 0.0282, 0.47 + 0.0282) = (0.441, 0.498). Since the upper limit is less than 0.5, we can say that we are 95% certain that the politician is wrong. (8a) The sample size is n = 120. Using the CLT, a 95% confidence interval for p is r p̂ ± 1.96 r p(1 − p) ≈ p̂ ± 1.96 n s p̂(1 − p̂) 6 = ± 1.96 n 120 6 − 120 ) = 0.05 ± 0.0389953. 120 6 (1 120 (b) The margin of error is 0.0389953. (c) Let m be the new sample size, so we want r 1 p(1 − p) 1.96 = 2 | {z m } new margin of error r 1.96 | ! p(1 − p) n {z } old margin of error 1 1 1 √ √ = 2 m n 1 1 1 = m 4 n m = 4n The answer shows that the new sample size should be 480 = 4 × 120. Therefore, we need 4 times the original sample size to reduce the margin of error by a factor of 1/2. The 9 general rule is, for a reduction of every factor of 1/2 in the margin of error, we require a 4-fold increase in the sample size. For example, if we want to reduce the margin of error by a factor of 1/16, then since 16 = 2 × 2 × 2 × 2, we need to increase the sample size by 4 × 4 × 4 × 4 = 256 times. (9a) Let x1 , ..., xn be iid N(µ,σ 2 ). The MLE (µ̂, σ̂) are: n µ̂ = x̄ = 5.506667, 1 X σ̂ = (xi − x̄)2 = 1.202023. n − 1 i=1 2 so in terms of minutes, the mean duration is exp(5.506667) or about 247 minutes. (b) Using the CLT, µ̂ ∼ N (µ, var(µ̂) = σ2 ). n Therefore, using the CLT, if we use value from the t-table based df = n−1 = 30−1 = 29, a 95% confidence interval for µ is r r r σ2 σ̂ 2 1.202023 µ̂ ± 2.045 ≈ µ̂ ± 2.045 = 5.506667 ± 2.045 = 5.506667 ± 0.4093446. n n 30 (c) From (b), the 95% confidence interval on log-scale can be written as (5.506667 − 0.4093446, 5.506667 + 0.4093446) which, in terms of minutes, is [exp(5.506667 − 0.4093446), exp(5.506667 + 0.4093446)] ≈ (163.6, 370.9) (d) The expression for the margin of error (on log-scale) is r σ2 2.045 ≈ 0.4093446, n using the sample size of n = 30 and estimating σ 2 by σ̂ 2 = 1.202023 in the expression for margin of error. To reduce the margin of error to ± 0.2, we can approximate the new sample size by: r σ2 2.045 = 0.2, n 10 and solve for n. The above equation gives: r σ2 0.2 = n 2.045 2 2 σ 0.2 ⇒ = n 2.045 0.2 2 1 2.045 = ⇒ n σ2 σ2 ⇒ n= 2 . 0.2 2.045 2 ⇒ n≈ σ̂ 0.2 2.045 2 = 1.202023 = 125.6 ≈ 126. 0.2 2 2.045 (10a) The MLE of λ is: n λ̂ = Pn i=1 xi = 1 1 = = 0.05050. x̄ 19.8 (b) Using the CLT, λ̂ ∼ N (λ, var(λ̂)). Assuming n is large enough and since we are using an MLE, therefore, 1 var(λ̂) ≈ λ2 . n But λ2 is unknown, so we estimate it by n1 λ̂2 . Therefore, using the CLT, a 95% confidence interval for λ is s s r 1 2 2 λ̂ 1 λ 19.82 ≈ λ̂ ± 1.96 = ± 1.96 = 0.05050 ± 0.01277. λ̂ ± 1.96 n n 19.8 60 (c) According to the 95% confidence interval, λ is between (0.05050 − 0.01277, 0.05050 + 0.01277) = (0.03772, 0.06327). Since the interval includes 0.04, we cannot say that the rate is different from 50 years ago. 2 (11a) From Question 9, we know the MLE of (µX , σX ) based on (x1 , ..., xm ) are m µ̂X = x̄ = 175.3, 2 σ̂X 1 X = (xi − x̄)2 = 1800.4. m i=1 Similarly, the MLE of (µY , σY2 ) based on (y1 , ..., yn ) are n µ̂Y = ȳ = 155.8, σ̂Y2 1X = (yi − ȳ)2 = 1182.8. n i=1 11 Therefore, an estimate for µX − µY is µ̂X − µ̂Y = 175.3 − 155.8 = 19.5 (b) Recall in Chapter 5, we learned that, for independent random variables X and Y , var(X − Y ) = var(X) + var(Y ). var(µ̂X−Y ) = var(µ̂X − µ̂Y ) = var(µ̂X ) + var(µ̂Y ) | {z } X’s and Y ’s are independent samples = var(x̄) + var(ȳ) 2 σY2 σX + . = |m {z n} From Question 3 (c) Using the CLT for MLE, µ̂X−Y 2 σX σY2 ∼ N (µX − µY , var(µ̂X−Y )) = N µX − µY , + , m n where the last result comes from (b). Therefore, using the CLT, a 95% confidence interval for µX − µY is r r 2 2 σX σY2 σ̂X σ̂ 2 µ̂X − µ̂Y ± 1.96 + ≈ µ̂X − µ̂Y ± 1.96 + Y m n m n r 1800.4 1182.8 = 19.5 ± 1.96 + 80 70 = 19.5 ± 12.30. According to the 95% confidence interval, the mean difference is between (19.5 − 12.3, 19.5 + 12.3) = (7.2, 31.8). Since the lower limit of the interval is above zero, we are 95% certain that the average catch has decreased by more than 7200 kg from five years ago. So the claims from the fishermen are supported.