* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download In the paper "Color Association of Male and
Survey
Document related concepts
Transcript
MASSEY UNIVERSITY PALMERSTON NORTH CAMPUS EXAMINATION FOR 161.120 INTRODUCTORY STATISTICS 161.130 BIOMETRICS SEMESTER II 2003 ________________________________________________ Time allowed: THREE (3) hours. This paper comprises: SECTION A, containing 30 multiple choice questions; SECTION B, containing 3 questions. An appendix of tables is at the back of the paper. A Scantron card is provided for your answers to Section A. Instructions: Attempt ALL of Section A and TWO (2) questions from Section B. Section A will be marked out of 30, and each question in Section B will be marked out of 15. This examination contributes 50% (internal) or 70% (extramural) to the final assessment _________________________________________________________________________ SECTION A: The answers to the questions in this section must be entered on a Scantron card in pencil and handed in with your blue examination book. A1 Soil samples were collected at thirty different sites in an agricultural area and the soil acidity (pH) was measured. The following stem-and-leaf plot shows the pH values, which range from 2.6 to 6.3. Stems 2 3 4 5 6 Leaves 679 237789 1222446899 0556788 0233 The median acidity is: a. *b. c. d. e. A2 4.6 4.5 4.4 4.3 4.2 A _______________ shows the relationship between two variables. a. b. c. *d. e. box plot bar chart histogram scatter plot pie chart A3 To help interpret diagnostic tests, doctors need to understand the distribution of the test results for ‘normal’ people. The histogram below shows the plasma glucose concentrations (mg per dL) for 50 normal fasting people. What proportion of the people has plasma glucose levels below 95? a. b. c. *d. e. A4 A sample of 99 distances has a mean of 24 metres and a median of 24.5 metres. Unfortunately, it has just been discovered that an observation which was erroneously recorded as "30" actually had a value of "35". If we make this correction to the data, then: a b *c d e A5 0.09 0.10 0.16 0.34 0.50 the mean remains the same, but the median is increased the mean and median remain the same the median remains the same, but the mean is increased the mean and median are both increased we do not know how the mean and median are affected without further calculations; but the variance is increased. The weights of the male and female students in a class are summarized in the following boxplots: Males Females 80 100 120 140 160 180 200 220 Weight (pounds) Four of the following statements are correct. Which one is FALSE? a. b. About 50% of the male students have weights between 150 and 185 lbs. About 25% of female students have weights more than 130 lbs. 240 c. d. *e. A6 Four of the following statements are correct. Which one is FALSE? *a b c d e A7 The numbers 1, 5, 9 have a smaller standard deviation than 101, 105, 109. The numbers 3, 3, 3 have a standard deviation of 0. The numbers 3, 4, 5 have the same standard deviation as 1003, 1004, 1005. The standard deviation is a measure of spread around the centre of the data. The standard deviation can only be computed for numerical data. A researcher wishes to calculate the average height of patients suffering from a particular disease. From patient records, the mean was computed as 156 cm, and standard deviation as 5 cm. Further investigation reveals that the scale was misaligned, and that all reading are 2 cm too large, e.g., a patient whose height is really 180 cm was measured as 182 cm. Furthermore, the researcher would like to work with statistics based on meters. The correct mean and standard deviation are: a *b c d e A8 The median weight of male students is about 162 lbs. The mean weight of female students is about 120 because their distribution is fairly symmetric. The male students have less variability than the female students. 1.56m, .05m 1.54m, .05m 1.56m, .03m 1.58m, .05m 1.58m, .07m The following information about the country of origin of immigrants to Australia was published in the Dominion Post on Fri August 15. Country of origin of immigrants to Australia, in the year to June 30 Britain 12,510 Philippines 3,190 China 6,600 South Africa 4,600 Fiji 1,610 Taiwan 1,110 Former USSR 1,100 United States 1,320 India 5,780 Vietnam 2,570 Indonesia 3,030 Yugoslavia 1,630 New Zealand 12,370 Other 36,490 The graphical display of these data that makes it easiest to see the proportion of immigrants from Asia is… a b c d *e A9 A bar chart with the countries in the same order as presented in the table A bar chart with the countries ordered in decreasing order of frequency A pie chart with the countries ordered in decreasing order of frequency A bar chart with the countries reordered so that the Asian countries are adjacent A pie chart with the countries reordered so that the Asian countries are adjacent When looking at a sequence of monthly postal revenue data, we note that the revenue is consistently highest in December. The high December revenue is an illustration of: a *b trend seasonal variation c d e A10 random fluctuations a cycle an outlier For children between the ages of 18 months and 29 months, there is approximately a linear relationship between "height" and "age". From a data set of 100 children in this age group, the least squares line was y 64.93 0.63x where y represents height (in centimeters) and x represents age (in months). One of the children, Joseph, is 22.5 months old and is 80 centimeters tall. What is Joseph's residual? *a +0.9 b 79.1 c -0.9 d 56.6 e 64.93 A11 A company that conducts regular political public opinion polls for a TV station has decided to increase the size of its random sample of voters from about 1500 people to about 4000 people. The effect of this increase is to: a b *c d e A12 Four of the following statements are correct. Which one is FALSE? a b c *d e A13 reduce the bias of the estimate. increase the standard error of the estimate. reduce the variability of the estimate. increase the confidence interval width for the parameter. have no effect since the population size is the same. In a proper random sampling, every element of the population has a known (and often equal) chance of being selected. The precision of a sample mean or sample proportion depends mainly upon the sample size (and not the population size) in a proper random sample. Convenience sampling often leads to biases in estimates since the sample is often not representative of the population. In a telephone survey of households in New Zealand, a high sample size guarantees that the mean household income in the country can be accurately estimated. The sampling distribution of the sample mean describes how the sample mean will vary among repeated samples. A new headache remedy was given to a group of 25 subjects who had headaches. Four hours after taking the new remedy, 20 of the subjects reported that their headaches had disappeared. From this information you conclude: a *b c d e that the remedy is effective for the treatment of headaches. nothing, because there is no control group for comparison. nothing, because the sample size is too small. that the new treatment is better than aspirin. that the remedy is not effective for the treatment of headaches. A14 What is the best reason for performing a paired experiment rather than an experiment with two independent samples? a *b c d e A15 The daily milk production of Guernsey cows is approximately normally distributed with a mean of 35 kg/day and a std. deviation of 6 kg/day. The probability that one day’s production for a single animal will be less than 28 kg is approximately: *a b c d e A16 Mean 4 12 48 48 48 St devn 1 4 4 1 16 The essence of the Central Limit Theorem is that: *a. b. c. d. e. A18 .12 .41 .09 .38 .62 If the sampled population has mean 48 and standard deviation 16, then the mean and the standard deviation for the sampling distribution of x for n = 16 are a. b. *c. d. e. A17 It is easier to do since we need fewer experimental units and each unit receives more than one treatment. It allows us to remove variation in the results caused by other factors since we can compare both treatments within the same experimental unit. The calculations will be more accurate since we work only with the differences. The paired t-test uses fewer degrees of freedom than the two-sample t-test. It allows us to do more experiments since we use each experimental unit twice. Irrespective of the distribution of the parent population, the distribution of the sample mean will be approximately normal, provided the sample size is large Irrespective of the sample size, the sample mean will be normally distributed Irrespective of the sample size, the population mean will be normally distributed Provided the sample size is large, the distribution of the population from which the sample is selected will be normal Provided the sample size is large, the distribution of the sample can be regarded as approximately normal Suppose that 30% of first year students in the University of Auckland live in flats. If 200 students are randomly selected, then the standard deviation of the number who live in flats will be approximately a b c *d e 0.0011 0.0324 0.3 6.48 42.0 A19 Minitab reports the following information about the weights in pounds of 143 bears, classified by gender (male=1, female=2) Descriptive Statistics: Weight by Sex Variable Weight Sex 1 2 N 99 44 Mean 214.0 143.05 Median 180.0 141.00 TrMean 208.9 139.17 StDev 119.7 64.48 Variable Weight Sex 1 2 SE Mean 12.0 9.72 Minimum 34.0 26.00 Maximum 514.0 356.00 Q1 122.0 114.00 Q3 316.0 164.50 The difference between the mean weights of the male and female bears is estimated to be 70.9 pounds. The standard error of this estimate is... *a. b. c. d. e. 15.4 21.7 55.2 92.1 136.0 Questions 20 and 21 relate to the following problem. A Massey University researcher wishes to investigate whether a new variety of wheat is more resistant to a disease than an old variety. It is known that this disease strikes approximately 15% of all plants of the old variety. A field experiment was conducted and 12 of the 120 experimental plants became infected. A20 The null and alternative hypothesis are: a b c *d e A21 HA: π > 0.15 HA: π > 0.10 HA: π ≠ 0.15 HA: π < 0.15 HA: π > 0.15 The calculated value of the test statistic is: a b c d *e A22 H0: π = 0.10 H0: π = 0.10 H0: π = 0.15 H0: π = 0.15 H0: π = 0.15 z = -47.1 z = -0.39 z = -3.07 z = -1.83 z = -1.53 A study was carried out on the effectiveness of a grain additive in deterring pigs from eating the grain. 1000 pigs were selected for the study, with 500 assigned to the treatment group (grain laced with 1080+dye) and the remaining 500 assigned to the placebo group (grain laced with dye only). Minitab reports… T-Test of difference = 0 (vs not =): T-Value = -5.42 P-Value = 0.000 The best conclusion is… a. There is a large difference between the effects of the treatment and the placebo DF = 499 b. *c. d. e. A23 There is strong evidence that the 1080 additive is very effective in altering the intake of grain by pigs There is strong evidence of a difference in intake between the treatment and placebo but the difference may be small There is little evidence that the treatment has any effect on the intake of grain by pigs There is evidence of a strong treatment effect Health researchers wish to investigate whether the tar content (milligrams) varies among four brand of cigarettes. Three packs of each brand were selected, and one cigarette from each pack was placed in a smoking machine to determine the tar content. An analysis of variance was performed and here are the results (some parts are hidden): Analysis of Variance for Tar Source DF SS MS Brand 3 348.00 116.0 Error 8 80.00 10.0 Total 11 428.00 F 11.60 P 0.003 Which of the following is correct: a *b c d e Because the p-value is small, there is evidence that all the brands differ from each other in the mean amount of tar present. Because the p-value is small, there is evidence that at least one brand has a different mean tar content from the other brands. Because the p-value is small, there is no evidence that any of the brands differ in the mean tar content. Because the p-value is small, there is no evidence that at least one brand has a different mean tar content from the other brands. Because the p-value is small, there is evidence that all of brands have the same mean tar content. Questions 24 to 26 relate to the following. One concern about the depletion of the ozone layer is that the increase in UV light will decrease crop yields. An experiment was conducted in a green house where 40 soybean plants were exposed to varying levels of UV, measured in Dobson units. At the end of the experiment the yield (kg) was measured. A regression analysis was performed in Minitab with the following results: The regression equation is Yield = 3.98 - 0.0463 UV Predictor Constant UV A24 Coef 3.980 -0.04629 SE Coef 0.0538 0.01074 T 74.01 -4.31 P 0.000 0.001 The least squares regression line is the line… a b *c that minimizes the sum of the squared differences between the actual UV values and the predicted UV values. that minimizes the sum of the residuals between the actual yield and the predicted yield. that minimizes the sum of the squared differences between the actual yield and the predicted yield. d e A25 Which of the following is correct? a b *c d e A26 that minimizes the sum of the squared residuals between the actual UV reading and the predicted UV reading. that minimizes the total variation in the data. If the UV reading is increased by 1 Dobson unit, the yield is expected to increase by .0463 kg. If the yield increases by 1 kg, the UV reading is expected to decline by .0463 Dobson units. The estimated yield is 3.98 kg when the UV reading is 0 Dobson units. The predicted yield is 4.3 kg when the UV reading is 20 Dobson units. The t-ratio 74.01 is used to test whether the estimated slope is different from zero. A 95% confidence interval for the slope is… a b c d *e –0.046 ± 0.011 –0.046 ± 0.108 –0.046 ± 0.054 –0.046 ± 0.046 –0.046 ± 0.021 Questions 27 to 29 relate to the following data set. In the paper "Color Association of Male and Female Fourth-Grade School Children" (J. Psych., 1988, 383-8), children were asked to indicate what emotion they associated with the color red. The response and the sex of the child are shown in the table below. anger happy love pain total female 27 19 39 17 102 male 34 12 38 28 112 total 61 31 77 45 214 A27. Four of the following statements are correct. Which one is FALSE? a b c d *e A lower percentage of girls associate the emotion "anger" with the color red than do boys. More students associate the color red with the emotion "love" than with the emotion "anger". Each student was classified by gender and by emotion association. Each student was counted in one and only one cell. We will be unable to compute a correlation for this data because the variables are not both numerical variables. We compute conditional proportions (given gender) by dividing the cell counts by the table total, 214. A28. The null hypothesis for a chi-squared test on the above data is: *a b c the emotion associated with red is independent of gender gender is dependent upon the emotion associated with red the probability of a child associating any of the emotions with red is related to gender d e the number of children in each cell does not depend upon gender nor upon emotion the color red is independent of the emotion associated with it and with gender. A29. Under this null hypothesis, the expected frequency for the cell corresponding to Anger and Males is: a b c d *e A30 34.0 55.7 30.4 29.1 31.9 A survey was conducted to investigate the severity of rodent problems in egg and chicken operations. A random sample of 78 egg operators and 53 chicken operators was selected, and the operators were classified according to the extent of the rodent population. A Minitab analysis of the data gave the following output. (NB the first row of the contingency table corresponds to egg operators and the second row to chicken operators.) Chi-Square Test: mild, moderate, severe Expected counts are printed below observed counts 1 mild moderate 26 37 31.56 35.13 severe 15 11.31 Total 78 2 27 21.44 22 23.87 4 7.69 53 Total 53 59 19 131 Chi-Sq = 0.979 + 0.100 + 1.440 + 0.147 + DF = 2, P-Value = 0.060 1.202 + 1.768 = 5.635 The conclusion from the test is... a b. c *d e The severity of rodent problems is the same for egg and poultry operators. The severity of rodent problems is different for egg and poultry operators There is no evidence that the severity of rodent problems is different for egg and poultry operators. The evidence for a difference in severity of the rodent problem between egg and poultry operators is only weak. There is strong evidence of a difference in severity of rodent problems between egg and poultry operators. SECTION B: Answer two out of the three questions in this section. This question investigates traffic fatalities in New Zealand during 2001 and their relationship to the blood alcohol level of the drivers. The data used in the question were published by the Land Transport Safety Authority. (a) The diagram below shows the distribution of ages of fatally injured drivers in 2001 and the numbers of these who were tested for the alcohol level in their blood. 50 45 40 35 Drivers B1. 30 25 Tested Not tested 20 15 10 5 0 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60+ Age (i) Ignoring initially the testing for blood alcohol level, critically discuss the effectiveness of this diagram as a way of showing how age is related to death. In particular, consider… • Treatment of the category “60+” Wider than the other categories, explaining the higher number of fatalities • The effectiveness of the graphic as a display of the data The 3D enhancements are chartjunk • Whether a different type of display might have been better Age is continuous, so a histogram would be better, though there is a problem with the “60+” age group. Perhaps treat it as “60 to 75”? • Lurking variables that may explain the downward trend up to age 59 There are probably more kilometres driven by younger drivers, so the accident rate per km may not be higher for them. (ii) In the 35-39 age group, 11 out of the 15 fatalities were tested for alcohol level, whereas in the 40-44 age group, 22 out of 25 were tested. Test whether the probability of getting tested is the same in both age groups. The pooled p = 33/40=0.825. z = (11/15 – 22/25) / root(0.825 * 0.175 * (1/15+1/25)) = – 1.18, so there is no evidence that the probabilities are different. (iii) Discuss the following diagram, taking into account what you have learned in (ii). 100% 80% 60% Tested Not tested 40% 20% 0% 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60+ Age This is an effective way to show how the proportion tested depends on age. There is no obvious trend and a fair amount of variability. Since (ii) showed that two of the larger differences were not significantly different, we can conclude that there is little if any influence of age on whether the dead drivers were tested. (b) Out of the total of 204 dead drivers who were tested, 43 had alcohol levels over 80 mg per 100 ml of blood (the legal limit). (i) Find a 95% confidence interval for the probability that a tested driver is over the legal limit. 43/204 ± 2 * root(43 * 161 / 2043) = 0.154 to 0.268 (ii) Use the confidence interval in (i) to find a 95% confidence interval for the number of the 63 untested drivers who were over the 80mg limit, assuming that they have the same distribution of blood alcohol levels as the tested drivers, and hence a 95% confidence interval for the total number of dead drivers over the limit in 2001. For the untested drivers, 0.154*63 to 0.268*63 = 9.7 to 16.9, so for all drivers, the CI is 52.7 to 59.9. (c) The table below describes only the fatally injured drivers who were tested for alcohol level. The deaths were classified by blood alcohol level and the time of day when the accident happened. (Note that the legal limit for driving is 80 mg per 100 ml of blood. Blood alcohol level (mg per 100ml blood) Time of day Under 80 80 to 200 Over 200 Total 1am to 9am 49 8 5 62 9am to 5pm 77 2 6 85 5pm to 1am 35 18 4 57 Total 161 28 15 204 (i) Draw on graph paper a stacked bar chart that can be used to compare blood alcohol levels at the different times of day. Describe the pattern in this diagram in words that a traffic researcher might understand. 1am to 9am Under 80 80 to 200 Over 200 9am to 5pm 5pm to 1am 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% The proportion of deaths with extremely high alcohol levels does not seem to depend on the time of day, but the proportion with moderately high alcohol levels (80 to 200) is much higher between 5pm and 1am and is lowest between 9am and 5pm. (ii) Minitab reports the following results from a chi-squared test on the data. Expected counts are printed below observed counts Under 80 80 to 20 Over 200 1 49 8 5 48.93 8.51 4.56 Total 62 2 77 67.08 2 11.67 6 6.25 85 3 35 44.99 18 7.82 4 4.19 57 Total 161 28 15 204 Chi-Sq = 0.000 + 0.031 + 0.043 1.466 + 8.010 + 0.010 2.216 + 13.237 + 0.009 DF = 4, P-Value = 0.000 2 cells with expected counts less + + = 25.021 than 5.0 What are your conclusions from the test? The alcohol levels of the dead drivers are different at different times of day. There is strong evidence that the pattern described in (i) is therefore not due to chance.