Download Exam, May 26, 2009 and solutions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Exam Applied Statistics- Tuesday May 26, 2009, 9.00 in GOVI


This exam consists of 2 parts: part 1 is for all students,
part 2 is only for students who want to do or improve the result of test 1
The use of a calculator is allowed and advised. A list of formulas and 3 tables are attached to this exam.
-- Success! --
Exercise 1
Overweight is public enemy number one in the western societies. A producer of a new slimming diet claims
to have developed an effective diet such that after two years more than 60% of the users, who have between
10 and 20 kg overweight, will have less than 5 kg overweight. The company hired university researchers to
substantiate this claim. The researchers chose at random 384 persons who had the appropriate overweight
and who were willing to use the diet. After two years 245 of the participants had an overweight of less than
5 kg. Does this result prove the claim of the producer?
Conduct a test at significance level 5%: use the 8 steps testing procedure and report the critical value(s).
Exercise 2
An Insurance company collaborates with two garages for repairing damaged cars which are insured by the
company. To investigate which of the garages has the lowest repair rates the company asks both garages to
make an assessment for the repair of 7 arbitrary chosen damaged cars. Here are the costs of repair given by
the two garages for the 7 cars:
car
1
2
3
4
5
6
garage 1
15.2 20.4 19.0 6.2 8.0 12.6
garage 2
14.7 18.3 16.8 6.7 7.6 11.6
difference +0.5 +2.1 +2.2 -0.5 +0.4 +1.0
7
sample mean standard deviation
10.6
13.14
5.36
9.8
+0.8
0.93
0.96
a. Compute the sample mean and sample standard deviation of garage 2 (which were accidentally left out).
b. Give a 95% confidence interval for the expected difference in repair rates. Explain which formula you
choose (and why).
c. Does the interval in part b. show that the null hypothesis of no difference can be rejected versus the
alternative that there is a difference? At what level of significance?
If you did not find the answer to part b. use interval (0.2, 1.6)
d. Does the interval in part b. show that the null hypothesis of no difference can be rejected versus the
alternative that garage 1 has higher repair rates? At what level of significance?
e. If we want to conduct a non parametric test to find out whether garage 1 has higher repair rates than
garage 1, then give: 1. The name of the test, 2. The hypotheses, 3. The value of the test statistic
4. The p-value
and 5. Your conclusion
Exercise 3
Many persons suffer from FNE (“Fear of Negative Evaluation”). To check whether the eating pattern
influences the level of FNE a psychologist conducted a survey, using two groups of female students: the first
group of students suffered from an eating disorder (bulimia) and the second group consists of students with
normal eating habits. He selected 11 students of both groups: all had to fill in a list of questions. The
answers were used to find their FNE-score. These were the results, including summarizing measures:
1 2 3 4 5 6 7 8 9 10 11 mean standard deviation
students with bulimia (x) 21 13 10 20 25 19 16 21 24 13 14 17.82
4.916
normal students (y)
13 6 16 13 8 19 23 18 11 15 7 13.55
5.336
z=x-y
8 7 -6 7 17 0 -7 3 13 -2 7 4.27
7.525
a. Which t-test would you conduct to prove the conjecture of the researcher that women, who suffer from
bulimia, have a larger expected FNE-score? First state the statistical assumptions you consider
reasonable in this case. Then give the appropriate hypotheses, compute the value of the test statistic, give
the critical value(s) of the test (use α = 0.05) and indicate whether the null hypothesis should be rejected.
b. Which non parametric test would you choose as an alternative for the t-test at a? Name the test, point out
how the test statistic is defined (do not compute its value!) and whether the test is left, right or two sided.
Exercise 4
Green pea plants have different characteristics: the colour of the flowers can either be blue or red and the
shape of pollen can be either long or round. A biologist observed many plants and tried to find out whether
these characteristics can be considered independent. After observing 427 green pea plants he published the
given table:
Colour
Blue
Red
Total
Conduct a test to show, whether or not, Pollen and
Long
296
27
323
Colour are independent at 5% level. You can
Pollen
Round
19
85
104
restrict yourself to 3.- 8. of the test procedure.
Total
315
112
427
Exercise 5
The length (in inches) and the weight (in pounds) of the so called supermodels Niki Taylor, Nadia Averman,
Claudia Schiffer, Elle MacPherson, Christy Turlington, Bridget Hall, Kate Moss, Valerie Mazza and Kirsty
Hume are given in the following table:
x = Length (in) 71 70.5 71 72 70 70 66.5 70 71
y = Weight (lb) 125 119 128 128 119 127 105 123 115
=70.222 sx = 1.5434
sy = 7.5333
= 121
Applying linear regression to these observations we find:
a. Compute
and the regression variance s2, in three decimals.
(If you cannot compute these values, use
= 19.1 and s2 = 23.8 resp. in part b. and c.)
b. Find a 95% confidence interval for the expected increase of the weight per inch.
c. Sylvia is another supermodel and her length is 69 inches. Estimate her weight by an interval having a
confidence level of 90%.
---------------------------------------------------------------------------------------
Part 2 (only for those who want to redo test 1)
Exercise 6 Consider the observations given in exercise 3:
a. Make a back-to-back stem-and-leaf plot of these two samples and comment the shape of the two plots
(compare e.g. centre and variation).
b. Give a 5 number summary of the scores of the normal students and check whether there are any outliers.
Exercise 7 Consider the observations given in exercise 5
a.
b.
c.
d.
Find the correlation coefficient r.
Give the proper interpretation of the value of r2.
Estimate the expected weight of a supermodel of 69 inches and of a supermodel of 75 inches.
What is the difference of the values you computed in c with respect to the reliability?
Exercise 8
The distribution of the salaries of hotel cleaners is not known but the mean salary, D 1100 per month, and
the standard deviation, D 240 per month, are. Compute (approximate) the probability that 25 arbitrarily
chosen hotel cleaners have a mean salary of at least D 1150 per month.
--------------------------------------------------------------------------------------------------------------------------
Marks: Mark exam = #points/4.5.
1 2
3
4 5
Total Part 2 (test 1)
6
7
8 Total
a b c d e a b
a b c
Mark = #points/2 a b a b c d
8 2 4 2 2 4 5 3 6 2 4 3 45
4 3 2 2 1 2 6 20
Final mark Applied Statistics = 0.4  test 1 + 0.6  exam, increased by 5% of the difference of each of 4
assignments marks and this average if this difference is positive.
Formulas
Testing procedure:
1.
2.
3.
4.
5.
6.
7.
8.
The research question (in words)
The statistical assumptions (model)
The hypotheses and level of significance
The test statistic and its distribution
The observed value (of the statistic)
The rejection region (for H0) or the p-value
The statistical conclusion
The conclusion in words (answer to the question)
and
X ~ B(n, p) => (appr. for large n: np(1-p) > 5)
Estimation:
N(np, np(1-p) )
MSE =
Confidence Intervals:
,
P(Tn-1 ≥ c) =
and
Tests: if H0: µ = µ0 is true, then
if H0:
~ t(n -1)
is true, then
if H0: p = p0 is true, then
~ N(0, 1)
Sign test: if H0: p = ½ is true, then X ~ B(n, ½) and, if n > 10, X is approximately N(½n, ¼n).
Linear Regression:
,
Solutions
Exam
Exercise 1
1. Is the diet successful after 2 years in more than 60% of the cases of 10-20 kg overweight?
2. X is the number of cases where the overweight is reduced after 2 years: X ~ B(384, p)
3. Test H 0 : p  0.60 versus H1 : p  0.60 ,  = 0.05.
4. Test statistic: X ~ B(384, 0.60) if H 0 p = 0.60 is true.
5. Observed x = 245
6. Right sided test: if X  c , then reject H 0
P( X  c | p  0.6)  5% where, approximately, X ~ N (np, np(1  p))  N (230.4, 92.16)
c  0.5  230.4 

Using continuity correction: P( X  c)  P( X  c  0.5)  P  Z 
  5% ( Z ~ N (0,1))
92.16


c  230.9
 1.645  c  230.9  1.645  92.16  246.7 . So we choose c  247
92.16
7. x = 245 < 247 = c => do not reject H 0
N(0,1)-Table 
8. The claim of the producer is not proven at 5% level
Exercise 2
a. = 12.21
= 4.51
b. These are paired samples (two observations per car) => we use the one sample method for the differences:
95% confidence interval for the expected difference µZ
where n = 7 ,
= (0.04, 1.82),
, such that P(T6 ≥ c) = = 0.025
c. Yes, µZ = 0 is not included in the 95% interval so at 5% level H0: µZ = 0 can be rejected versus H1: µZ  0
d. µZ = 0 is not included in the 95% interval so at 2.5% level H0: µZ = 0 can be rejected versus H1: µZ > 0
e. 1. The sign test 2. We test H0: p = ½ (the probability of a positive difference 1-2) versus H1: p > ½
3. Observed X = 6
4. p-value = P(X ≥ 6) = P(X = 6) + P(X =7) =
> 5% =>
5. do not reject H0
Exercise 3
a. In this case we will conduct the two independent samples t-test for the equality of means, assuming that X1,...X11
and Y1,...Y11 and where Xi ~ N(
) and Yi ~ N(
), where the µ’s and σ’ s are unknown and possibly
different (you could assume the σ’ s to be equal considering the small difference between the sample standard
deviations.).
We test
( and T
(10 = min (11-1, 11-1) )
Critical value for this right sided test: c = 1.812 from the
-table at 5%: In this case T > c, so reject .
b. The Wilcoxon Rank Sum Test: W = the sum of the ranks of the X-values (the FNE-scores of the students with
eating disorder). Reject
for large values of W: a right sided test.
Exercise 4. In the table the expected values are
Colour
Blue
Red
total
computed using the formula
Long O = 296, E = 238.3 O = 27, E =84.7 323
E = (row total)  (column total)/ n
Pollen
Round
O = 19, E = 76.7
O = 85, E = 27.4 104
3. We test:
Total
315
112
427
H0 : “independence of Pollen and Colour” versus
H1: “There is a relation between Pollen and Colour”, α = 5%
4. Test Statistic
, if H0 : “independence of Pollen and Colour” is true
5. Observed value X 2 =
218.4 ,
6. It is a right sided test: if X 2 ≥ c = 3.841, from the
table, tail probability 5% .
(or: p-value = P(
≥ 218.4) < 0.0005)
2
7. X
218.4 > c (or: p-value < α) => reject H0
8. A relation between pollen and Colour is proven at 5%-level
Exercise 5
a.
19.056 and the regression variance s2
b. A 95% confidence interval for the regression coefficient is:
and
(given),
19.056 and
(see a)
c. This is not a confidence interval for the mean of all supermodels having a length 61 inches but a prediction
interval for one individual supermodel (Sylvia):
≈ (106.2, 126.3)
, x* = 69 and
= 116.254.
Exercise 6 a. the centre of the data is higher for bulimia, the variation is approximately the same.
b. ordering the observation we get the following:
Rank
1 2 3 4 5 6 7
The 5 number summary is
minimum Q1 M Q3 maximum
6
8 13 18
23
8 9 10 11
Observation 6 7 8 11 13 13 15 16 18 19 23
The IQD = Q3 - Q1 =18-8 =10
The 1.5 IQD –rule says that observations outside (Q1 -1.5 IQD, Q3 +1.5 IQD ) = (-7, 33) are outliers => there are
no outliers.
Exercise 7
a. r = 0.7956
b. r2 = 0.6330: 63.3% of the variation of the weight of supermodels can be explained by (the linear
relation between weight and) the length
c.
= 116.254
= 139.554
d.
is within the interval (interpolation) the predicted value is more reliable than at
, which is outside the interval: extrapolation, the question arises whether the linear relation
applies outside the interval of observations
Exercise 8
If the population, expected value µ and variance σ2, is not normally distribudted, then according to the Central Limit
Theorem is approximately N(µ, σ2/n)-distributed if n is large enough. So in this case the mean salary of 25
hotelcleaners
~ N(1100, 2402/25)