Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Informatics - DTU 02402 Introduction To Statistics 2010-2-01 LFF/lff Solution exam 15. December 2008 References to ”Probability and Statistics for Engineers” are given in the order [8th edition, 7th edition]. Question 1 We wish to determine whether the two variances are significantly different on a 10% significance level. Under the null hypothesis, the ratio between the variances follows an F-distribution, cf. p. [273, 287]. The largest variance is put in the numerator, such that the test statistic becomes 2.6726 2.6458 . The variance in the numerator is that for women, and thus based on 8 subjects while the variance in the denominator is based on 7 subjects. Hence we need to use the F distribution with (8-1, 7-1) degrees of freedom. Correct answer is 5. Question 2 Here we are concerned with a difference of means based on a small sample. Hence we may use the box on p. [254, 266]. We have n1 = 7, n2 = 8, s1 = 2.6458, s2 = 2.6726, and find tα/2 (n1 + n2 − 2) = t0.025 (13) = −2.16. Correct answer is 2. Question 3 The mean and standard deviation of albumin content in women are given as 43.5 and 2.6726, respectively. Let X denote the albumin in a single randomly chosen woman. Then X ∼ N (43.5, 2.6726). We find P (X > 48) = 1 − P (X ≤ 48) In R, this can be found using pnorm as follows > 1-pnorm(48, mean=43.5, sd=2.6726) [1] 0.04611464 Multiplying this result by 100000, we get the result. Correct answer is 1. z Question 4 From p. [281, 296] we get that n = p(1 − p) α/2 where n is the sample size E that we are seeking. E is the allowed error, i.e. 1 in this case. We find zα/2 > qnorm(.01/2) [1] -2.575829 Finally, we use that p(1 − p) is equal to the variance in a binomial distribution with one trial, such that p(1 − p) = 2.652 . Correct answer is 1. 1 Question 5 Refer to p. [362, 406]. To find the sum of squared residuals, we need the mean square and the degrees of freedom (df). The mean square is given in the question. The value of df is found by seeing that there are 18 observations and 3 treatments (N = 18, k = 3). Thus df=15. Then we find that the sum of squared errors (SSE) is SSE = MSE ×(N − k) = 20.04 × 15. Correct answer is 5. Question 6 The test statistic follows an F-distribution with k−1, N −k degrees of freedom, cf. p. [362, 406]. Since N = 18 and k = 3, we have degrees of freedom (2, 15). Correct answer is 4. Question 7 The definition of the p-value is given on p. [231, 248]. Small p-values indicate that the observed data is very unlikely if the null hypothesis is true. This leads to rejection of the null hypothesis. In ANOVA’s the null hypothesis is that all group means are equal. Thus the p-value in the output 5.649e-07 is evidence of different group means between at least two groups. Correct answer is 5. Question 8 The estimators of α and β are given on p. [304, 340]. We find 325.20 Sxy ≈ 1.29 = Sxx 42.00 · 6 a = ȳ − b · x̄ = 13.1143 − 1.29 · 9.0 ≈ 1.50 b= The definitions of Sxx , Syy , and Sxy are given on p. [304, 340]. Correct answer is 4. Question 9 The estimate of σ 2 is given on p. [308, 343]. We calculate: s ˆ = sigma r 2 /S Syy − Sxy xx = n−2 6 · 70.5381 − 325.22 /(6 · 42) ≈ 0.844 5 Correct answer is 1. Question 10 The slope is the parameter β. The confidence interval for β is given on p. [311, 346]. We also use the test statistic for b with β = 0, which is given in the output as ”t value”, and denoted by t here. Information concerning b is given in the output in the line beginning with ”x2”. b ± tα/2 (n − 2) · √ σ̂ = Sxx b = t 5.4117 ± 2.365 · 0.2258 = [4.88; 5.95] b ± tα/2 (n − 2) · 2 We get 2.365 from the t-distribution with 7 degrees of freedom (qt(0.05/2,7)). We find that 7 degrees of freedom should be used through the following reasoning: The residual standard error has 7 degrees of freedom (read in the output). The degrees of freedom for the residual standard error is n − 2, cf. p. [310, 346]. Thus there were 9 observations. Since we need the t-distribution with n − 2 degrees of freedom, the t-distribution with 7 degrees of freedom is used. Correct answer is 3. Question 11 The p-value of 5.59e-08 is clearly less than α = 0.1%. Since the observed data is more unlikely than the specified level of 0.1% the null hypothesis is rejected. Correct answer is 5. Question 12 We use the limits of prediction given on p. [314, 350]. Sxx cannot be found directly in the output. Instead we use the relation t= b−βp Sxx σ̂ Where β is 0, t is the t value given in the output for the slope, and σ̂ is the residual standard error. We get b2 Sxx ⇔ σ̂ 2 σ̂ 2 = t2 2 b t2 = Sxx > 23.972^2 * 3.497^2/5.4117^2 [1] 239.9564 To get the prediction limits we need to use tα/2 (n − 2) = t0.025 (7) = −2.365. We can now find the prediction limits: s 1 (x0 − x̄)2 + = n Sxx r 1 (9 − 8)2 (5.5178 + 9 · 5.4117) ± 2.365 · 3.497 1 + + = 9 239.96 s 1 1 (5.5178 + 9 · 5.4117) ± 2.365 3.4972 1 + + 9 240 (a + bx0 ) ± tα/2 · σ̂ 1+ Correct answer is 1. Question 13 We use a test of randomness, p. [455, 329]. First we find the median of the residuals. The residuals in increasing order are -4.11 -3.40 -3.16 -0.98 0.08 0.66 1.81 3.87 5.24. The median is 0.08. We now identify the runs. All numbers above the median are given the symbol a, and those below b. Those equal to the median are taken out of the sample, cf. example p. [457, 330]. The residuals are 0.08 0.66 -3.16 1.81 -4.11 3.87 5.24 -0.98 -3.40. We identify the runs a b a b aa bb. Thus u = 6, n1 = 4, and n2 = 4. Calculate µu and σu 3 2·4·4 +1=5 4+4 s 2 · 4 · 4(2 · 4 · 4 − 4 − 4) σu = = (4 + 4)2 (4 + 4 − 1) r r r 32 · 24 4·3 12 = = ≈ 1.309 64 · 7 7 7 µu = Thus the test statistic becomes u − µu 6−5 = σu 1.309 The probability of finding this, or a more extreme value for the test statistic if the null hypothesis is true is 6−5 6−5 P Z<− +P Z > ) = 1.309 1.309 6−5 = 2 ∗ pnorm(1/1.309, lower.tail = F ALSE) ≈ 0.44 2·P Z > 1.309 Thus it is quite likely (specifically, will happen 44% of the time) that this value of the test statistic will be observed if the null hypothesis is true. Hence we do not reject the null hypothesis that the numbers are random. Correct answer is 5. Question 14 Confidence intervals for proportions are discussed in section [10.1, 9.1]. We have that 12+23=35 experiments were performed on eggs taken from Fie (one experiment per egg, i.e. will it hatch or not). Out of these, 12 were successes. Hence the upper limit in the confidence interval becomes s 12 + 1.96 35 12 35 1− 35 12 35 12 = + 1.96 35 r 12 · 23 35 · 35 · 35 Where 1.96 is used since this is the 97.5% percentile in the standard normal distribution. Correct answer is 2. Question 15 This is a test for independence in a contingency table, described in section [10.3, 9.3]. Using the ”statistic for test concerning difference among proportions” on 2 p. [286, 301], we see that the test statistic is (observed−expected) summed over all expected cells. The expected values are given in the table supplied for this question, while the observed values are given in the table supplied in question 14. Correct answer is 2. Question 16 The critical value is found in the χ2 -distribution with (3 − 1) = 2 degrees of freedom, cf. p. [285, 301]. Using α = 0.05, we find the critical value in table 5 p. [517, 588]. Correct answer is 3. 4 Question 17 Let X be a random variable denoting the points obtained for a particular question. This is -1 with probability 23 and 3 with probability 13 . By using the ”mean of discrete probability distribution” p. [94, 116] and ”computing formula for the variance” p. [99, 121], we calculate E(X) = −1 · 2/3 + 3 · 1/3 = 1/3 E(X 2 ) = mu‘2 = (−1)2 · 2/3 + 32 · 1/3 = 11/3 V ar(X) = 11/3 − 1/9 = 33/9 − 1/9 = 32/9 P10 Now let Y = i=1 Xi where Xi follows the same distribution as X for each i = 1, 2, . . . , 10. Finally, use the bottom box p. [153, 185]. Correct answer is 4. Question 18 If no one knows the answer, the probability of getting the question right is 1 3 for each student. Let X denote the number of students to answer the question correctly. Then X ∼ Bin(66, 1/3) under the null hypothesis H0 that no one knows the answer. The alternative hypothesis is that some students do know the answer. Under H0 , E(X) = 22. We wish to test whether the true mean of X, µ0 is greater than 22. Hence we find the p-value as follows P (X ≥ 33) = 1 − P (X ≤ 32) => 1 − pbinom(32, 66, 1/3) ≈ 0.003741 p If the normal approximation is used, we get 1 − pnorm(32.5, 22, 66 · 1/3 · 2/3) ≈ 0.003056. Since the p-value is less than the specified level of significance, we reject the null hypothesis. Correct answer is 3. Question 19 The samples are small, meaning that we cannot make any distributional assumptions. Instead, we use a non-parametric rank-sum test, section [14.3, 10.3]. First we assign ranks: 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 A A A A A A A A B B B B A A B B B B B B rank: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 means: 3 | 9 | 15 | 19 sumA = 5*3 + 3*9 + 2*15 = 72 sumB = 4*9 + 3*15 + 3*19 = 138 We can now calculate: 10(10 + 1) = 72 − 55 = 17 2 10 · 10 µU1 = = 50 2 10 · 10(10 + 10 + 1) 2100 2 σU = = = 175 1 12 12 U1 − µU1 17 − 50 test statistic: = √ ≈ −2.495 σU1 175 U1 = 72 − 5 We then find P (Z < −2.495) = pnorm(−2.495) ≈ 0.006298. This is very small, and we reject the null hypothesis that the two TVs were rated equally. Correct answer is 5. Question 20 We need to use the F ratio for treatments, p. [373, 419]. SS(Tr) = 194.25, and SSE = 34.25. Also, there are 5 treatments and 4 methods, such that we get (5-1)*(4-1)=12 degrees of freedom for the residual error. Correct answer is 1. Question 21 Referring to the text on p. [361, 406] and p. [371, 418], we see that mean square for the error ( 34.25 12 ) gives the variance of the error. The standard deviation is, as always, the square root of this. Correct answer is 5. Question 22 If the methods are not taken into account, the variance explained by method will enter into the residual variance. That is, the sum of squares from method will be included in the sum of squares for error instead. Likewise, the degrees of freedom for error will be increased, and equal N-k (k is the number of treatments and N the total number of observations), as in a one-way ANOVA. Correct answer is 4. Question 23 Since we have many samples, we may use the ”large sample confidence interval for p” p. [280, 295]. We have observed x = 107 successes out of a total of n = 482. We calculate: s x ± zα/2 n x n s 107 ± 1.645 482 107 ± 1.645 482 r 1− n x n 107 482 1− 482 = 107 482 = 107 · 375 4823 Since zα/2 = z0.10/2 = z0.05 = qnorm(0.05) ≈ −1.645 Correct answer is 5. Question 24 Denote the proportion reported on the 27/11/2008 by p1 and that found earlier by p2 . We wish to test the null hypothesis p1 = p2 against the alternative p1 > p2 . Use p. [288, 304] to find the test statistic X1 + X2 52 + 107 = = 0.1978 n1 + n2 322 + 482 X1 X2 n1 − n2 r = p̂(1 − p̂) n11 + n12 p̂ = 107 482 q − 52 322 0.1978(1 − 0.1978) 6 1 482 + 1 322 =≈ 2.110239 Now we find the p-value, letting Z ∼ N (0, 1). P (Z > 2.110239) = 1 − P (Z < 2.110239) = 1 − pnorm(2.110239) ≈ 0.01741889 Since this p-value is low, we reject the null hypothesis, proving that the proportion has increased. Correct answer is 3. Question 25 Since it is assumed that the proportion is about the same as the current 107 ≈ 0.22) we use the box ”sample size determination” p. [281, 296]. The width of ( 482 the confidence interval should be plus/minus 2 percentage points, i.e. plus/minus 0.02. Hence E = 0.02. With confidence level 95% we get zα/2 = z0.05/2 = z0.025 = −1.96. Thus we find n = 0.22 · 0.78 · 1.96 0.02 2 Correct answer is 3. Question 26 We need to find the distribution of the sum of eight random variables, where each follows the normal distribution with mean 100 and variance 1. Using p. [153-154, 185], and assuming that the weights of the pieces of chocolate are independent we find Xi ∼ N (100, 1), i ∈ [1, 2, . . . , 8] Y = 8 X Xi i=1 E(Y ) = E( 8 X Xi ) = i=1 V ar(Y ) = V ar( 8 X E(Xi ) = i=1 8 X Xi ) = i=1 8 X 100 = 800 i=1 8 X i=1 V ar(Xi ) = 8 X 1=8 i=1 Also, the sum of normally distributed variables is √ itself normally distributed. Thus √ Y ∼ N (800, 8). The standard deviation of Y is 8 = 2 2 ≈ 2.83. 2.5% of the probability mass lies to each side of the interval [800 ± 1.96 · 2.83]. Hence the correct distribution is the symmetric distribution which has almost all its mass between 794.5 and 805.5, but still some (2.5% on each side) mass outside of that interval. Correct answer is 1. Question 27 The two lines indicating the 25 and 75 percentiles (right below and above the thick line indicating the mean, respectively) do not match the 25 and 75 percentiles of any of the distributions given above. The lines are symmetric around the mean, excluding distribution c. The 25 percentile is drawn at about 775 and the 75 percentile at about 825. None of the three symmetrical distributions look like they contain 50% of the probability mass between 775 and 825. Correct answer is 5. 7 Question 28 Using the pooled estimator of variance p. [252, 264] we find: (n1 − 1)S12 + (n2 − 1)S22 = n1 + n2 − 2 4 · 5.21232 + 4 · 2.14592 ≈ 15.88648 ≈ 3.98582 8 σ̂ 2 = Correct answer is 2. 2 5.2123 Question 29 Under the null hypothesis (that the variances are equal), the fraction 2.1459 2 follows an F distribution with (4, 4) degrees of freedom, cf. p. [273, 287]. Hence the critical value is 6.39, found in table 6(a) p. [518, 589]. Correct answer is 4. Question 30 Cf. [p. 246 and 251, section 7.8], the two samples must both come from normal populations, have the same variance, and be randomly and independently chosen. The only unnecessary assumption is that the samples contain more than 15 observations each. Correct answer is 2. 8