Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
AMS 5 – Statistics (Quiz 1) 10 August 2006 Due at the beginning of class on Tuesday 15 August 2006 The Instructor Dimitris Fouskakis Student Name_______________________________________________ 1. The conventional method of measuring the refractive status of an eye involves three quantities: (1) sphere power, (2) cylinder power, and (3) axis. Optometric researchers studied the variation in these three measures of refraction (Optometry and Vision Science, June 1995). Twenty-five successive refractive measurements were obtained on the eyes of over 100 university students. The cylinder power measurements for the left eye of one particular student (ID #11) are listed in the table: -0.08 -0.16 -0.11 -0.20 -0.07 -0.09 -0.06 -0.08 -0.04 -0.06 -0.12 -0.16 -0.06 -1.07 -0.07 -0.12 -0.17 -0.15 -0.10 -0.09 -0.17 -0.17 -0.09 -0.16 -0.21 a) Calculate the mean and the median. Which one is a more appropriate numerical measure of center for the above data? (7 pts) b) Do you find any extreme values in the data set? (2 pts) c) If yes, delete it and calculate again the mean and the median. Which one is a more appropriate numerical measure of center for the new data? (7 pts) Answer: a) The most appropriate numerical measure of the above data is the median. After sorting the data we select the middle number (the 13th ordered observation), which is –0.11. The mean is –0.15. b) Yes, the value –1.07 is an extreme one. c) Without the outlier both the mean and the median are appropriate. The new mean is –0.11 and the new median is –0.105 (much closer now). 2. In a study to determine the effect of cigarette smoking on phenacetin metabolism, phenacetin is administered to 12 smokers and 14 nonsmokers. After two hours, plasma phenacetin level (µg/ml) is measured in all study participants. Phenacetin levels, smoking status and age for each participant were recorded. Table 1: Grouped frequencies and relative frequencies for plasma phenacetin levels. Plasma Phenacetin (mg/ml) 0.005 – 0.505 0.505 – 1.005 1.005 – 1.505 1.505 – 2.005 2.005 – 2.505 2.505 – 3.005 3.005 – 3.505 3.505 – 4.005 Total Non - Smokers Frequency Percent (%) 1 1 2 3 3 2 1 1 14 7 7 14 22 22 14 7 7 100 Smokers Frequency Percent (%) 5 2 2 1 0 1 0 1 12 42 17 17 8 0 8 0 8 100 a) Summary statistics for phenacetin levels (µg/ml) are presented in Table 2. Do these summary statistics suggest anything about the shape of the distribution for each group? (3 pts) Table 2: Summary statistics for the phenacetin levels according to study group Measure Mean Median Standard Deviation Non-Smoking 2.05 2.03 0.90 Smoking 0.93 0.57 1.20 b) Draw the frequency and density histograms for each group. (8 pts) c) Is it a good idea to plot the relative frequencies (percentages) in this example? (3 pts) d) For comparing the two groups is it a better idea to use the frequency or the relative frequency histograms and why? (3 pts) e) What do the statistical numerical and graphical summaries suggest about the qualitative effect of smoking on phenacetin concentration? (3 pts) f) Suppose, it is known that phenacetin metabolism increases with age. Summary statistics for age according to study group are given in the table below. How would this factor affect the conclusions made in previous question? (3 pts) Table 3: Summary statistics for age (in years) according to study group Measure Mean Median Standard Deviation Non-Smoking 47.2 46.5 15.1 Smoking 35.4 30.0 14.6 Answers: a) The mean and the median of the levels taken from the non-smoking participants are nearly equal, which suggests that the density distribution for this group will be symmetric. (or nearly symmetric). On the other hand, the mean of the phenacetin levels taken from the smoking participants is larger than the 0 1 Frequency 2 3 median, which suggests that the density distribution for this group will be skewed with a tail to the right. b) 2 4 6 Plasma Phanacetin (non-smokers) 8 0 2 4 6 Plasma Phanacetin (non-smokers) 8 0 .05 Density .1 .15 .2 0 5 4 Frequency 2 3 1 0 2 4 6 Plasma Phenacetin (Smokers) 8 0 2 4 6 Plasma Phenacetin (Smokers) 8 0 .1 Density .2 .3 .4 0 c) Yes, the bin lengths are the same. d) The relative frequency ones because they don’t take into account the difference in the two sample sizes. e) All statistical summaries suggest that cigarette smokers have lower plasma phenacetin concentrations than non-smokers. f) If phenacetin metabolism increases with age, then plasma phenacetin levels would be expected to be higher in older individuals than in younger ones. If average age of smokers is compared with average age of non-smokers (Table 3), it appears that, in general, non-smokers are older than smokers. Thus, if age affects phenacetin metabolism, it may be that age - not smoking - accounts for the lower levels of phenacetin among smokers in this study. In other words, age is a confounding factor since it relates with both smoking status and plasma phenacetin levels. 3. The following data are the failure times in hours of 14 transmissions from caterpillar tractors belonging to a particular American company: 2603 2320 2161 3286 3168 2376 3330 4159 2537 3814 2157 218 3168 3266 a) Calculate the median of the above data. (2 pts) b) Without calculating the mean, say whether it is greater, approximately equal, or smaller than the median. (3 pts) c) Do you think that the standard deviation provides a good measure of variability for the above data? If not, can you propose an alternative measure? Explain briefly. (5 pts) Answers: a) After sorting the data we select the middle number (the average between the 7th and 8th ordered observations), which is 2885.5. b) Because of the extreme observation with the value of 218 the mean will be lower than the average. c) Again because of the outlier the standard deviation is not a good measure of variability in this example. An alternative would be the interquartile range which is the difference between the 75th and 25th percentile. 4. Among applicants to one law school in 1923, the average LSAT score was about 169, the SD was about 9 and the lowest score was 160. Did the LSAT scores follow the normal curve? (3 pts) Answer: No. For example the normal curve says that about 16% of the scores should be more than 1 SD below the average, and none are. 5. The scores on a Math test of 20 second year students were as follows: 39 61 47 58 65 70 75 66 60 54 55 47 68 82 66 62 54 58 50 42 a) Draw a histogram. Would you say that the histogram follows the normal curve? (2 pts) b) Calculate the mean and the standard deviation. (4 pts) c) Using the normal curve estimate the percentage of students who scored (i) below 50; (ii) between 52 and 58 and (iii) above 60. (9 pts) d) One student’s score was 0.7 SDs below average. About what percentage of the students had higher scores than he did? (3 pts) e) Estimate the 60th percentile of the Math test. (2 pts) f) Find the percentile of the student with score 42. (2 pts) 0 .01 Density .02 .03 .04 Answers: a) 40 50 60 scores 70 80 The histogram follows the normal curve. b) The mean is 58.95 and the SD = 10.63. c) 0.20; 0.21; 0.47 d) 0.76. e) The score is 0.25 SDs above average, which is 0.25 × 10.63 = 2.65 points above average. That’s 58.95 + 2.65 = 61.60. f) This student was 1.59 SDs below average, at the 5th percentile. 6. According to a study done at Keiser Permanente in Walnut Creek, California, users of oral contraceptives have a higher rate of cervical cancer than nonusers, even after adjusting for age, education and marital status. Investigators concluded that the pill causes cervical cancer. a) Is this a controlled experiment or an observational study? (3 pts) b) Why did the investigators adjust for age, education and marital status? (3 pts) c) Women using the pill were likely to differ from non-users on another factor which affects the risk of cervical cancer. What factor is that? (3 pts) d) Were the conclusions of the study justified by the data? Answer yes or no, and explain briefly. (2 pts) Answers: a) This is an observational study. b) Rates of cervical cancer go up with age; women of different marital status have different partners of sexual activity, and therefore exposed to different kinds of risk; similarly for education. In other words, these are potential confounders. c) Pill users are more active sexually than non-users, and have more partners. That seems to be what makes the rate of cervical cancer higher among pill users. d) No, see c). 7. A die is rolled 10 times. Find the chance of a) getting 10 sixes. (3 pts) b) not getting 10 sixes. (3 pts) c) all the rolls showing 5 spots or less. (3 pts) Answers: a) (1/ 6)10 = 1/ 60, 466,176 . b) 1 − (1/ 6)10 = 1 − 1/ 60, 466,176 ≈ 1 . c) (5 / 6)10 = 0.16 = 16% . 8. A die is weighted so that, when the die is thrown, a “6” is three times more likely to occur than each of the other numbers. a) Find of the probability of a “6” and of each of the other numbers, occurring. (3 pts) b) Find the probability of obtaining on a single throw an even number. (3 pts) Answers: a) If we denote with X = P(getting an ace) = P(getting a deuce) =...= P(getting a five), then P(getting a 6) = 3 × X. Since the probabilities of all possible outcomes should add up to 1 we have 5X + 3X = 1 ⇒ X = 1/ 8 . Therefore, P(getting an ace) = P(getting a deuce) =...= P(getting a five) = 1/8 and P(getting a 6) = 3/8. b) P(getting an even number)=P(getting 2 or 4 or 6) and since these events are mutually exclusive using the addition rule for probabilities we have P(getting an even number) = 1/8 + 1/8 + 3/8 = 5/8 = 62.5%.