Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Exam Applied Statistics- Tuesday May 26, 2009, 9.00 in GOVI This exam consists of 2 parts: part 1 is for all students, part 2 is only for students who want to do or improve the result of test 1 The use of a calculator is allowed and advised. A list of formulas and 3 tables are attached to this exam. -- Success! -- Exercise 1 Overweight is public enemy number one in the western societies. A producer of a new slimming diet claims to have developed an effective diet such that after two years more than 60% of the users, who have between 10 and 20 kg overweight, will have less than 5 kg overweight. The company hired university researchers to substantiate this claim. The researchers chose at random 384 persons who had the appropriate overweight and who were willing to use the diet. After two years 245 of the participants had an overweight of less than 5 kg. Does this result prove the claim of the producer? Conduct a test at significance level 5%: use the 8 steps testing procedure and report the critical value(s). Exercise 2 An Insurance company collaborates with two garages for repairing damaged cars which are insured by the company. To investigate which of the garages has the lowest repair rates the company asks both garages to make an assessment for the repair of 7 arbitrary chosen damaged cars. Here are the costs of repair given by the two garages for the 7 cars: car 1 2 3 4 5 6 garage 1 15.2 20.4 19.0 6.2 8.0 12.6 garage 2 14.7 18.3 16.8 6.7 7.6 11.6 difference +0.5 +2.1 +2.2 -0.5 +0.4 +1.0 7 sample mean standard deviation 10.6 13.14 5.36 9.8 +0.8 0.93 0.96 a. Compute the sample mean and sample standard deviation of garage 2 (which were accidentally left out). b. Give a 95% confidence interval for the expected difference in repair rates. Explain which formula you choose (and why). c. Does the interval in part b. show that the null hypothesis of no difference can be rejected versus the alternative that there is a difference? At what level of significance? If you did not find the answer to part b. use interval (0.2, 1.6) d. Does the interval in part b. show that the null hypothesis of no difference can be rejected versus the alternative that garage 1 has higher repair rates? At what level of significance? e. If we want to conduct a non parametric test to find out whether garage 1 has higher repair rates than garage 1, then give: 1. The name of the test, 2. The hypotheses, 3. The value of the test statistic 4. The p-value and 5. Your conclusion Exercise 3 Many persons suffer from FNE (“Fear of Negative Evaluation”). To check whether the eating pattern influences the level of FNE a psychologist conducted a survey, using two groups of female students: the first group of students suffered from an eating disorder (bulimia) and the second group consists of students with normal eating habits. He selected 11 students of both groups: all had to fill in a list of questions. The answers were used to find their FNE-score. These were the results, including summarizing measures: 1 2 3 4 5 6 7 8 9 10 11 mean standard deviation students with bulimia (x) 21 13 10 20 25 19 16 21 24 13 14 17.82 4.916 normal students (y) 13 6 16 13 8 19 23 18 11 15 7 13.55 5.336 z=x-y 8 7 -6 7 17 0 -7 3 13 -2 7 4.27 7.525 a. Which t-test would you conduct to prove the conjecture of the researcher that women, who suffer from bulimia, have a larger expected FNE-score? First state the statistical assumptions you consider reasonable in this case. Then give the appropriate hypotheses, compute the value of the test statistic, give the critical value(s) of the test (use α = 0.05) and indicate whether the null hypothesis should be rejected. b. Which non parametric test would you choose as an alternative for the t-test at a? Name the test, point out how the test statistic is defined (do not compute its value!) and whether the test is left, right or two sided. Exercise 4 Green pea plants have different characteristics: the colour of the flowers can either be blue or red and the shape of pollen can be either long or round. A biologist observed many plants and tried to find out whether these characteristics can be considered independent. After observing 427 green pea plants he published the given table: Colour Blue Red Total Conduct a test to show, whether or not, Pollen and Long 296 27 323 Colour are independent at 5% level. You can Pollen Round 19 85 104 restrict yourself to 3.- 8. of the test procedure. Total 315 112 427 Exercise 5 The length (in inches) and the weight (in pounds) of the so called supermodels Niki Taylor, Nadia Averman, Claudia Schiffer, Elle MacPherson, Christy Turlington, Bridget Hall, Kate Moss, Valerie Mazza and Kirsty Hume are given in the following table: x = Length (in) 71 70.5 71 72 70 70 66.5 70 71 y = Weight (lb) 125 119 128 128 119 127 105 123 115 =70.222 sx = 1.5434 sy = 7.5333 = 121 Applying linear regression to these observations we find: a. Compute and the regression variance s2, in three decimals. (If you cannot compute these values, use = 19.1 and s2 = 23.8 resp. in part b. and c.) b. Find a 95% confidence interval for the expected increase of the weight per inch. c. Sylvia is another supermodel and her length is 69 inches. Estimate her weight by an interval having a confidence level of 90%. --------------------------------------------------------------------------------------- Part 2 (only for those who want to redo test 1) Exercise 6 Consider the observations given in exercise 3: a. Make a back-to-back stem-and-leaf plot of these two samples and comment the shape of the two plots (compare e.g. centre and variation). b. Give a 5 number summary of the scores of the normal students and check whether there are any outliers. Exercise 7 Consider the observations given in exercise 5 a. b. c. d. Find the correlation coefficient r. Give the proper interpretation of the value of r2. Estimate the expected weight of a supermodel of 69 inches and of a supermodel of 75 inches. What is the difference of the values you computed in c with respect to the reliability? Exercise 8 The distribution of the salaries of hotel cleaners is not known but the mean salary, D 1100 per month, and the standard deviation, D 240 per month, are. Compute (approximate) the probability that 25 arbitrarily chosen hotel cleaners have a mean salary of at least D 1150 per month. -------------------------------------------------------------------------------------------------------------------------- Marks: Mark exam = #points/4.5. 1 2 3 4 5 Total Part 2 (test 1) 6 7 8 Total a b c d e a b a b c Mark = #points/2 a b a b c d 8 2 4 2 2 4 5 3 6 2 4 3 45 4 3 2 2 1 2 6 20 Final mark Applied Statistics = 0.4 test 1 + 0.6 exam, increased by 5% of the difference of each of 4 assignments marks and this average if this difference is positive. Formulas Testing procedure: 1. 2. 3. 4. 5. 6. 7. 8. The research question (in words) The statistical assumptions (model) The hypotheses and level of significance The test statistic and its distribution The observed value (of the statistic) The rejection region (for H0) or the p-value The statistical conclusion The conclusion in words (answer to the question) and X ~ B(n, p) => (appr. for large n: np(1-p) > 5) Estimation: N(np, np(1-p) ) MSE = Confidence Intervals: , P(Tn-1 ≥ c) = and Tests: if H0: µ = µ0 is true, then if H0: ~ t(n -1) is true, then if H0: p = p0 is true, then ~ N(0, 1) Sign test: if H0: p = ½ is true, then X ~ B(n, ½) and, if n > 10, X is approximately N(½n, ¼n). Linear Regression: , Solutions Exam Exercise 1 1. Is the diet successful after 2 years in more than 60% of the cases of 10-20 kg overweight? 2. X is the number of cases where the overweight is reduced after 2 years: X ~ B(384, p) 3. Test H 0 : p 0.60 versus H1 : p 0.60 , = 0.05. 4. Test statistic: X ~ B(384, 0.60) if H 0 p = 0.60 is true. 5. Observed x = 245 6. Right sided test: if X c , then reject H 0 P( X c | p 0.6) 5% where, approximately, X ~ N (np, np(1 p)) N (230.4, 92.16) c 0.5 230.4 Using continuity correction: P( X c) P( X c 0.5) P Z 5% ( Z ~ N (0,1)) 92.16 c 230.9 1.645 c 230.9 1.645 92.16 246.7 . So we choose c 247 92.16 7. x = 245 < 247 = c => do not reject H 0 N(0,1)-Table 8. The claim of the producer is not proven at 5% level Exercise 2 a. = 12.21 = 4.51 b. These are paired samples (two observations per car) => we use the one sample method for the differences: 95% confidence interval for the expected difference µZ where n = 7 , = (0.04, 1.82), , such that P(T6 ≥ c) = = 0.025 c. Yes, µZ = 0 is not included in the 95% interval so at 5% level H0: µZ = 0 can be rejected versus H1: µZ 0 d. µZ = 0 is not included in the 95% interval so at 2.5% level H0: µZ = 0 can be rejected versus H1: µZ > 0 e. 1. The sign test 2. We test H0: p = ½ (the probability of a positive difference 1-2) versus H1: p > ½ 3. Observed X = 6 4. p-value = P(X ≥ 6) = P(X = 6) + P(X =7) = > 5% => 5. do not reject H0 Exercise 3 a. In this case we will conduct the two independent samples t-test for the equality of means, assuming that X1,...X11 and Y1,...Y11 and where Xi ~ N( ) and Yi ~ N( ), where the µ’s and σ’ s are unknown and possibly different (you could assume the σ’ s to be equal considering the small difference between the sample standard deviations.). We test ( and T (10 = min (11-1, 11-1) ) Critical value for this right sided test: c = 1.812 from the -table at 5%: In this case T > c, so reject . b. The Wilcoxon Rank Sum Test: W = the sum of the ranks of the X-values (the FNE-scores of the students with eating disorder). Reject for large values of W: a right sided test. Exercise 4. In the table the expected values are Colour Blue Red total computed using the formula Long O = 296, E = 238.3 O = 27, E =84.7 323 E = (row total) (column total)/ n Pollen Round O = 19, E = 76.7 O = 85, E = 27.4 104 3. We test: Total 315 112 427 H0 : “independence of Pollen and Colour” versus H1: “There is a relation between Pollen and Colour”, α = 5% 4. Test Statistic , if H0 : “independence of Pollen and Colour” is true 5. Observed value X 2 = 218.4 , 6. It is a right sided test: if X 2 ≥ c = 3.841, from the table, tail probability 5% . (or: p-value = P( ≥ 218.4) < 0.0005) 2 7. X 218.4 > c (or: p-value < α) => reject H0 8. A relation between pollen and Colour is proven at 5%-level Exercise 5 a. 19.056 and the regression variance s2 b. A 95% confidence interval for the regression coefficient is: and (given), 19.056 and (see a) c. This is not a confidence interval for the mean of all supermodels having a length 61 inches but a prediction interval for one individual supermodel (Sylvia): ≈ (106.2, 126.3) , x* = 69 and = 116.254. Exercise 6 a. the centre of the data is higher for bulimia, the variation is approximately the same. b. ordering the observation we get the following: Rank 1 2 3 4 5 6 7 The 5 number summary is minimum Q1 M Q3 maximum 6 8 13 18 23 8 9 10 11 Observation 6 7 8 11 13 13 15 16 18 19 23 The IQD = Q3 - Q1 =18-8 =10 The 1.5 IQD –rule says that observations outside (Q1 -1.5 IQD, Q3 +1.5 IQD ) = (-7, 33) are outliers => there are no outliers. Exercise 7 a. r = 0.7956 b. r2 = 0.6330: 63.3% of the variation of the weight of supermodels can be explained by (the linear relation between weight and) the length c. = 116.254 = 139.554 d. is within the interval (interpolation) the predicted value is more reliable than at , which is outside the interval: extrapolation, the question arises whether the linear relation applies outside the interval of observations Exercise 8 If the population, expected value µ and variance σ2, is not normally distribudted, then according to the Central Limit Theorem is approximately N(µ, σ2/n)-distributed if n is large enough. So in this case the mean salary of 25 hotelcleaners ~ N(1100, 2402/25)