Download Solution exam 15. December 2008

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Informatics - DTU
02402 Introduction To Statistics
2010-2-01
LFF/lff
Solution exam 15. December 2008
References to ”Probability and Statistics for Engineers” are given in the order [8th edition,
7th edition].
Question 1 We wish to determine whether the two variances are significantly different on
a 10% significance level. Under the null hypothesis, the ratio between the variances
follows an F-distribution, cf. p. [273, 287].
The largest variance is put in the numerator, such that the test statistic becomes
2.6726
2.6458 .
The variance in the numerator is that for women, and thus based on 8 subjects while
the variance in the denominator is based on 7 subjects. Hence we need to use the F
distribution with (8-1, 7-1) degrees of freedom.
Correct answer is 5.
Question 2 Here we are concerned with a difference of means based on a small sample.
Hence we may use the box on p. [254, 266]. We have n1 = 7, n2 = 8, s1 = 2.6458,
s2 = 2.6726, and find tα/2 (n1 + n2 − 2) = t0.025 (13) = −2.16.
Correct answer is 2.
Question 3 The mean and standard deviation of albumin content in women are given as
43.5 and 2.6726, respectively. Let X denote the albumin in a single randomly chosen
woman. Then X ∼ N (43.5, 2.6726). We find
P (X > 48) = 1 − P (X ≤ 48)
In R, this can be found using pnorm as follows
> 1-pnorm(48, mean=43.5, sd=2.6726)
[1] 0.04611464
Multiplying this result by 100000, we get the result.
Correct answer is 1.
z
Question 4 From p. [281, 296] we get that n = p(1 − p) α/2
where n is the sample size
E
that we are seeking. E is the allowed error, i.e. 1 in this case. We find zα/2
> qnorm(.01/2)
[1] -2.575829
Finally, we use that p(1 − p) is equal to the variance in a binomial distribution with
one trial, such that p(1 − p) = 2.652 .
Correct answer is 1.
1
Question 5 Refer to p. [362, 406]. To find the sum of squared residuals, we need the mean
square and the degrees of freedom (df).
The mean square is given in the question.
The value of df is found by seeing that there are 18 observations and 3 treatments
(N = 18, k = 3). Thus df=15.
Then we find that the sum of squared errors (SSE) is SSE = MSE ×(N − k) = 20.04
× 15.
Correct answer is 5.
Question 6 The test statistic follows an F-distribution with k−1, N −k degrees of freedom,
cf. p. [362, 406]. Since N = 18 and k = 3, we have degrees of freedom (2, 15).
Correct answer is 4.
Question 7 The definition of the p-value is given on p. [231, 248]. Small p-values indicate
that the observed data is very unlikely if the null hypothesis is true. This leads to
rejection of the null hypothesis. In ANOVA’s the null hypothesis is that all group
means are equal. Thus the p-value in the output 5.649e-07 is evidence of different
group means between at least two groups.
Correct answer is 5.
Question 8 The estimators of α and β are given on p. [304, 340]. We find
325.20
Sxy
≈ 1.29
=
Sxx
42.00 · 6
a = ȳ − b · x̄ = 13.1143 − 1.29 · 9.0 ≈ 1.50
b=
The definitions of Sxx , Syy , and Sxy are given on p. [304, 340].
Correct answer is 4.
Question 9 The estimate of σ 2 is given on p. [308, 343]. We calculate:
s
ˆ =
sigma
r
2 /S
Syy − Sxy
xx
=
n−2
6 · 70.5381 − 325.22 /(6 · 42)
≈ 0.844
5
Correct answer is 1.
Question 10 The slope is the parameter β. The confidence interval for β is given on p.
[311, 346]. We also use the test statistic for b with β = 0, which is given in the output
as ”t value”, and denoted by t here.
Information concerning b is given in the output in the line beginning with ”x2”.
b ± tα/2 (n − 2) · √
σ̂
=
Sxx
b
=
t
5.4117 ± 2.365 · 0.2258 = [4.88; 5.95]
b ± tα/2 (n − 2) ·
2
We get 2.365 from the t-distribution with 7 degrees of freedom (qt(0.05/2,7)). We find
that 7 degrees of freedom should be used through the following reasoning:
The residual standard error has 7 degrees of freedom (read in the output). The degrees
of freedom for the residual standard error is n − 2, cf. p. [310, 346]. Thus there were
9 observations. Since we need the t-distribution with n − 2 degrees of freedom, the
t-distribution with 7 degrees of freedom is used.
Correct answer is 3.
Question 11 The p-value of 5.59e-08 is clearly less than α = 0.1%. Since the observed
data is more unlikely than the specified level of 0.1% the null hypothesis is rejected.
Correct answer is 5.
Question 12 We use the limits of prediction given on p. [314, 350]. Sxx cannot be found
directly in the output. Instead we use the relation
t=
b−βp
Sxx
σ̂
Where β is 0, t is the t value given in the output for the slope, and σ̂ is the residual
standard error. We get
b2
Sxx ⇔
σ̂ 2
σ̂ 2
= t2 2
b
t2 =
Sxx
> 23.972^2 * 3.497^2/5.4117^2
[1] 239.9564
To get the prediction limits we need to use tα/2 (n − 2) = t0.025 (7) = −2.365. We can
now find the prediction limits:
s
1
(x0 − x̄)2
+
=
n
Sxx
r
1 (9 − 8)2
(5.5178 + 9 · 5.4117) ± 2.365 · 3.497 1 + +
=
9
239.96
s
1
1
(5.5178 + 9 · 5.4117) ± 2.365 3.4972 1 + +
9 240
(a + bx0 ) ± tα/2 · σ̂
1+
Correct answer is 1.
Question 13 We use a test of randomness, p. [455, 329]. First we find the median of the
residuals. The residuals in increasing order are -4.11 -3.40 -3.16 -0.98 0.08 0.66 1.81
3.87 5.24. The median is 0.08.
We now identify the runs. All numbers above the median are given the symbol a, and
those below b. Those equal to the median are taken out of the sample, cf. example
p. [457, 330]. The residuals are 0.08 0.66 -3.16 1.81 -4.11 3.87 5.24 -0.98 -3.40. We
identify the runs a b a b aa bb. Thus u = 6, n1 = 4, and n2 = 4. Calculate µu and σu
3
2·4·4
+1=5
4+4
s
2 · 4 · 4(2 · 4 · 4 − 4 − 4)
σu =
=
(4 + 4)2 (4 + 4 − 1)
r
r
r
32 · 24
4·3
12
=
=
≈ 1.309
64 · 7
7
7
µu =
Thus the test statistic becomes
u − µu
6−5
=
σu
1.309
The probability of finding this, or a more extreme value for the test statistic if the null
hypothesis is true is
6−5
6−5
P Z<−
+P Z >
) =
1.309
1.309
6−5
= 2 ∗ pnorm(1/1.309, lower.tail = F ALSE) ≈ 0.44
2·P Z >
1.309
Thus it is quite likely (specifically, will happen 44% of the time) that this value of the
test statistic will be observed if the null hypothesis is true. Hence we do not reject the
null hypothesis that the numbers are random.
Correct answer is 5.
Question 14 Confidence intervals for proportions are discussed in section [10.1, 9.1]. We
have that 12+23=35 experiments were performed on eggs taken from Fie (one experiment per egg, i.e. will it hatch or not). Out of these, 12 were successes. Hence the
upper limit in the confidence interval becomes
s
12
+ 1.96
35
12
35
1−
35
12
35
12
=
+ 1.96
35
r
12 · 23
35 · 35 · 35
Where 1.96 is used since this is the 97.5% percentile in the standard normal distribution.
Correct answer is 2.
Question 15 This is a test for independence in a contingency table, described in section
[10.3, 9.3]. Using the ”statistic for test concerning difference among proportions” on
2
p. [286, 301], we see that the test statistic is (observed−expected)
summed over all
expected
cells. The expected values are given in the table supplied for this question, while the
observed values are given in the table supplied in question 14.
Correct answer is 2.
Question 16 The critical value is found in the χ2 -distribution with (3 − 1) = 2 degrees of
freedom, cf. p. [285, 301]. Using α = 0.05, we find the critical value in table 5 p. [517,
588].
Correct answer is 3.
4
Question 17 Let X be a random variable denoting the points obtained for a particular
question. This is -1 with probability 23 and 3 with probability 13 . By using the ”mean
of discrete probability distribution” p. [94, 116] and ”computing formula for the
variance” p. [99, 121], we calculate
E(X) = −1 · 2/3 + 3 · 1/3 = 1/3
E(X 2 ) = mu‘2 = (−1)2 · 2/3 + 32 · 1/3 = 11/3
V ar(X) = 11/3 − 1/9 = 33/9 − 1/9 = 32/9
P10
Now let Y =
i=1 Xi where Xi follows the same distribution as X for each i =
1, 2, . . . , 10. Finally, use the bottom box p. [153, 185].
Correct answer is 4.
Question 18 If no one knows the answer, the probability of getting the question right is
1
3 for each student. Let X denote the number of students to answer the question
correctly. Then X ∼ Bin(66, 1/3) under the null hypothesis H0 that no one knows the
answer. The alternative hypothesis is that some students do know the answer. Under
H0 , E(X) = 22. We wish to test whether the true mean of X, µ0 is greater than 22.
Hence we find the p-value as follows
P (X ≥ 33) = 1 − P (X ≤ 32) => 1 − pbinom(32, 66, 1/3) ≈ 0.003741
p
If the normal approximation is used, we get 1 − pnorm(32.5, 22, 66 · 1/3 · 2/3) ≈
0.003056.
Since the p-value is less than the specified level of significance, we reject the null
hypothesis.
Correct answer is 3.
Question 19 The samples are small, meaning that we cannot make any distributional
assumptions. Instead, we use a non-parametric rank-sum test, section [14.3, 10.3].
First we assign ranks:
1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4
A A A A A A A A B B B B A A B B B B B B
rank:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
means: 3
| 9
| 15
| 19
sumA = 5*3 + 3*9 + 2*15 = 72
sumB = 4*9 + 3*15 + 3*19 = 138
We can now calculate:
10(10 + 1)
= 72 − 55 = 17
2
10 · 10
µU1 =
= 50
2
10 · 10(10 + 10 + 1)
2100
2
σU
=
=
= 175
1
12
12
U1 − µU1
17 − 50
test statistic:
= √
≈ −2.495
σU1
175
U1 = 72 −
5
We then find P (Z < −2.495) = pnorm(−2.495) ≈ 0.006298. This is very small, and
we reject the null hypothesis that the two TVs were rated equally.
Correct answer is 5.
Question 20 We need to use the F ratio for treatments, p. [373, 419]. SS(Tr) = 194.25,
and SSE = 34.25. Also, there are 5 treatments and 4 methods, such that we get
(5-1)*(4-1)=12 degrees of freedom for the residual error.
Correct answer is 1.
Question 21 Referring to the text on p. [361, 406] and p. [371, 418], we see that mean
square for the error ( 34.25
12 ) gives the variance of the error. The standard deviation is,
as always, the square root of this.
Correct answer is 5.
Question 22 If the methods are not taken into account, the variance explained by method
will enter into the residual variance. That is, the sum of squares from method will be
included in the sum of squares for error instead. Likewise, the degrees of freedom for
error will be increased, and equal N-k (k is the number of treatments and N the total
number of observations), as in a one-way ANOVA.
Correct answer is 4.
Question 23 Since we have many samples, we may use the ”large sample confidence interval
for p” p. [280, 295]. We have observed x = 107 successes out of a total of n = 482.
We calculate:
s
x
± zα/2
n
x
n
s
107
± 1.645
482
107
± 1.645
482
r
1−
n
x
n
107
482
1−
482
=
107
482
=
107 · 375
4823
Since zα/2 = z0.10/2 = z0.05 = qnorm(0.05) ≈ −1.645
Correct answer is 5.
Question 24 Denote the proportion reported on the 27/11/2008 by p1 and that found
earlier by p2 . We wish to test the null hypothesis p1 = p2 against the alternative
p1 > p2 . Use p. [288, 304] to find the test statistic
X1 + X2
52 + 107
=
= 0.1978
n1 + n2
322 + 482
X1
X2
n1 − n2
r
=
p̂(1 − p̂) n11 + n12
p̂ =
107
482
q
−
52
322
0.1978(1 − 0.1978)
6
1
482
+
1
322
=≈ 2.110239
Now we find the p-value, letting Z ∼ N (0, 1).
P (Z > 2.110239) = 1 − P (Z < 2.110239) = 1 − pnorm(2.110239) ≈ 0.01741889
Since this p-value is low, we reject the null hypothesis, proving that the proportion
has increased.
Correct answer is 3.
Question 25 Since it is assumed that the proportion is about the same as the current
107
≈ 0.22) we use the box ”sample size determination” p. [281, 296]. The width of
( 482
the confidence interval should be plus/minus 2 percentage points, i.e. plus/minus 0.02.
Hence E = 0.02. With confidence level 95% we get zα/2 = z0.05/2 = z0.025 = −1.96.
Thus we find
n = 0.22 · 0.78 ·
1.96
0.02
2
Correct answer is 3.
Question 26 We need to find the distribution of the sum of eight random variables, where
each follows the normal distribution with mean 100 and variance 1. Using p. [153-154,
185], and assuming that the weights of the pieces of chocolate are independent we find
Xi ∼ N (100, 1), i ∈ [1, 2, . . . , 8]
Y =
8
X
Xi
i=1
E(Y ) = E(
8
X
Xi ) =
i=1
V ar(Y ) = V ar(
8
X
E(Xi ) =
i=1
8
X
Xi ) =
i=1
8
X
100 = 800
i=1
8
X
i=1
V ar(Xi ) =
8
X
1=8
i=1
Also, the sum of normally distributed variables is √
itself normally
distributed. Thus
√
Y ∼ N (800, 8). The standard deviation of Y is 8 = 2 2 ≈ 2.83. 2.5% of the
probability mass lies to each side of the interval [800 ± 1.96 · 2.83]. Hence the correct
distribution is the symmetric distribution which has almost all its mass between 794.5
and 805.5, but still some (2.5% on each side) mass outside of that interval.
Correct answer is 1.
Question 27 The two lines indicating the 25 and 75 percentiles (right below and above the
thick line indicating the mean, respectively) do not match the 25 and 75 percentiles
of any of the distributions given above. The lines are symmetric around the mean,
excluding distribution c. The 25 percentile is drawn at about 775 and the 75 percentile
at about 825. None of the three symmetrical distributions look like they contain 50%
of the probability mass between 775 and 825.
Correct answer is 5.
7
Question 28 Using the pooled estimator of variance p. [252, 264] we find:
(n1 − 1)S12 + (n2 − 1)S22
=
n1 + n2 − 2
4 · 5.21232 + 4 · 2.14592
≈ 15.88648 ≈ 3.98582
8
σ̂ 2 =
Correct answer is 2.
2
5.2123
Question 29 Under the null hypothesis (that the variances are equal), the fraction 2.1459
2
follows an F distribution with (4, 4) degrees of freedom, cf. p. [273, 287]. Hence the
critical value is 6.39, found in table 6(a) p. [518, 589].
Correct answer is 4.
Question 30 Cf. [p. 246 and 251, section 7.8], the two samples must both come from
normal populations, have the same variance, and be randomly and independently
chosen. The only unnecessary assumption is that the samples contain more than 15
observations each.
Correct answer is 2.
8