Download In the paper "Color Association of Male and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
MASSEY UNIVERSITY
PALMERSTON NORTH CAMPUS
EXAMINATION FOR
161.120 INTRODUCTORY STATISTICS
161.130 BIOMETRICS
SEMESTER II 2003
________________________________________________
Time allowed: THREE (3) hours.
This paper comprises:
SECTION A,
containing 30 multiple choice questions;
SECTION B,
containing 3 questions.
An appendix of tables
is at the back of the paper.
A Scantron card is provided for your answers to Section A.
Instructions:
Attempt ALL of Section A
and
TWO (2) questions from Section B.
Section A will be marked out of 30,
and each question in Section B will be marked out of 15.
This examination contributes 50% (internal) or 70% (extramural) to the final assessment
_________________________________________________________________________
SECTION A:
The answers to the questions in this section must be entered on a Scantron card in pencil
and handed in with your blue examination book.
A1
Soil samples were collected at thirty different sites in an agricultural area and the soil
acidity (pH) was measured. The following stem-and-leaf plot shows the pH values, which
range from 2.6 to 6.3.
Stems
2
3
4
5
6
Leaves
679
237789
1222446899
0556788
0233
The median acidity is:
a.
*b.
c.
d.
e.
A2
4.6
4.5
4.4
4.3
4.2
A _______________ shows the relationship between two variables.
a.
b.
c.
*d.
e.
box plot
bar chart
histogram
scatter plot
pie chart
A3
To help interpret diagnostic tests, doctors need to understand the distribution of the test
results for ‘normal’ people. The histogram below shows the plasma glucose concentrations
(mg per dL) for 50 normal fasting people.
What proportion of the people has plasma glucose levels below 95?
a.
b.
c.
*d.
e.
A4
A sample of 99 distances has a mean of 24 metres and a median of 24.5 metres.
Unfortunately, it has just been discovered that an observation which was erroneously
recorded as "30" actually had a value of "35". If we make this correction to the data, then:
a
b
*c
d
e
A5
0.09
0.10
0.16
0.34
0.50
the mean remains the same, but the median is increased
the mean and median remain the same
the median remains the same, but the mean is increased
the mean and median are both increased
we do not know how the mean and median are affected without further calculations; but the
variance is increased.
The weights of the male and female students in a class are summarized in the following
boxplots:
Males
Females
80
100
120
140
160
180
200
220
Weight (pounds)
Four of the following statements are correct. Which one is FALSE?
a.
b.
About 50% of the male students have weights between 150 and 185 lbs.
About 25% of female students have weights more than 130 lbs.
240
c.
d.
*e.
A6
Four of the following statements are correct. Which one is FALSE?
*a
b
c
d
e
A7
The numbers 1, 5, 9 have a smaller standard deviation than 101, 105, 109.
The numbers 3, 3, 3 have a standard deviation of 0.
The numbers 3, 4, 5 have the same standard deviation as 1003, 1004, 1005.
The standard deviation is a measure of spread around the centre of the data.
The standard deviation can only be computed for numerical data.
A researcher wishes to calculate the average height of patients suffering from a particular
disease. From patient records, the mean was computed as 156 cm, and standard deviation
as 5 cm. Further investigation reveals that the scale was misaligned, and that all reading are
2 cm too large, e.g., a patient whose height is really 180 cm was measured as 182 cm.
Furthermore, the researcher would like to work with statistics based on meters. The correct
mean and standard deviation are:
a
*b
c
d
e
A8
The median weight of male students is about 162 lbs.
The mean weight of female students is about 120 because their distribution is fairly
symmetric.
The male students have less variability than the female students.
1.56m, .05m
1.54m, .05m
1.56m, .03m
1.58m, .05m
1.58m, .07m
The following information about the country of origin of immigrants to Australia was
published in the Dominion Post on Fri August 15.
Country of origin of immigrants to Australia, in the year to June 30
Britain
12,510
Philippines
3,190
China
6,600
South Africa
4,600
Fiji
1,610
Taiwan
1,110
Former USSR
1,100
United States
1,320
India
5,780
Vietnam
2,570
Indonesia
3,030
Yugoslavia
1,630
New Zealand
12,370
Other
36,490
The graphical display of these data that makes it easiest to see the proportion of immigrants
from Asia is…
a
b
c
d
*e
A9
A bar chart with the countries in the same order as presented in the table
A bar chart with the countries ordered in decreasing order of frequency
A pie chart with the countries ordered in decreasing order of frequency
A bar chart with the countries reordered so that the Asian countries are adjacent
A pie chart with the countries reordered so that the Asian countries are adjacent
When looking at a sequence of monthly postal revenue data, we note that the revenue is
consistently highest in December. The high December revenue is an illustration of:
a
*b
trend
seasonal variation
c
d
e
A10
random fluctuations
a cycle
an outlier
For children between the ages of 18 months and 29 months, there is approximately a linear
relationship between "height" and "age". From a data set of 100 children in this age group,
the least squares line was
y  64.93  0.63x
where y represents height (in centimeters) and x represents age (in months). One of the
children, Joseph, is 22.5 months old and is 80 centimeters tall. What is Joseph's residual?

*a +0.9
b
79.1
c
-0.9
d
56.6
e
64.93
A11
A company that conducts regular political public opinion polls for a TV station has decided
to increase the size of its random sample of voters from about 1500 people to about 4000
people. The effect of this increase is to:
a
b
*c
d
e
A12
Four of the following statements are correct. Which one is FALSE?
a
b
c
*d
e
A13
reduce the bias of the estimate.
increase the standard error of the estimate.
reduce the variability of the estimate.
increase the confidence interval width for the parameter.
have no effect since the population size is the same.
In a proper random sampling, every element of the population has a known (and often equal)
chance of being selected.
The precision of a sample mean or sample proportion depends mainly upon the sample size
(and not the population size) in a proper random sample.
Convenience sampling often leads to biases in estimates since the sample is often not
representative of the population.
In a telephone survey of households in New Zealand, a high sample size guarantees that the
mean household income in the country can be accurately estimated.
The sampling distribution of the sample mean describes how the sample mean will vary
among repeated samples.
A new headache remedy was given to a group of 25 subjects who had headaches. Four
hours after taking the new remedy, 20 of the subjects reported that their headaches had
disappeared. From this information you conclude:
a
*b
c
d
e
that the remedy is effective for the treatment of headaches.
nothing, because there is no control group for comparison.
nothing, because the sample size is too small.
that the new treatment is better than aspirin.
that the remedy is not effective for the treatment of headaches.
A14
What is the best reason for performing a paired experiment rather than an experiment with
two independent samples?
a
*b
c
d
e
A15
The daily milk production of Guernsey cows is approximately normally distributed with a
mean of 35 kg/day and a std. deviation of 6 kg/day. The probability that one day’s
production for a single animal will be less than 28 kg is approximately:
*a
b
c
d
e
A16
Mean
4
12
48
48
48
St devn
1
4
4
1
16

The essence of the Central Limit Theorem is that:
*a.
b.
c.
d.
e.
A18
.12
.41
.09
.38
.62
If the sampled population has mean 48 and standard deviation 16, then the mean and the
standard deviation for the sampling distribution of x for n = 16 are
a.
b.
*c.
d.
e.
A17
It is easier to do since we need fewer experimental units and each unit receives more than
one treatment.
It allows us to remove variation in the results caused by other factors since we can compare
both treatments within the same experimental unit.
The calculations will be more accurate since we work only with the differences.
The paired t-test uses fewer degrees of freedom than the two-sample t-test.
It allows us to do more experiments since we use each experimental unit twice.
Irrespective of the distribution of the parent population, the distribution of the sample mean
will be approximately normal, provided the sample size is large
Irrespective of the sample size, the sample mean will be normally distributed
Irrespective of the sample size, the population mean will be normally distributed
Provided the sample size is large, the distribution of the population from which the sample is
selected will be normal
Provided the sample size is large, the distribution of the sample can be regarded as
approximately normal
Suppose that 30% of first year students in the University of Auckland live in flats. If 200
students are randomly selected, then the standard deviation of the number who live in flats
will be approximately
a
b
c
*d
e
0.0011
0.0324
0.3
6.48
42.0
A19
Minitab reports the following information about the weights in pounds of 143 bears,
classified by gender (male=1, female=2)
Descriptive Statistics: Weight by Sex
Variable
Weight
Sex
1
2
N
99
44
Mean
214.0
143.05
Median
180.0
141.00
TrMean
208.9
139.17
StDev
119.7
64.48
Variable
Weight
Sex
1
2
SE Mean
12.0
9.72
Minimum
34.0
26.00
Maximum
514.0
356.00
Q1
122.0
114.00
Q3
316.0
164.50
The difference between the mean weights of the male and female bears is estimated to be
70.9 pounds. The standard error of this estimate is...
*a.
b.
c.
d.
e.
15.4
21.7
55.2
92.1
136.0
Questions 20 and 21 relate to the following problem.
A Massey University researcher wishes to investigate whether a new variety of wheat is more
resistant to a disease than an old variety. It is known that this disease strikes approximately 15%
of all plants of the old variety. A field experiment was conducted and 12 of the 120
experimental plants became infected.
A20
The null and alternative hypothesis are:
a
b
c
*d
e
A21
HA: π > 0.15
HA: π > 0.10
HA: π ≠ 0.15
HA: π < 0.15
HA: π > 0.15
The calculated value of the test statistic is:
a
b
c
d
*e
A22
H0: π = 0.10
H0: π = 0.10
H0: π = 0.15
H0: π = 0.15
H0: π = 0.15
z = -47.1
z = -0.39
z = -3.07
z = -1.83
z = -1.53
A study was carried out on the effectiveness of a grain additive in deterring pigs from
eating the grain. 1000 pigs were selected for the study, with 500 assigned to the treatment
group (grain laced with 1080+dye) and the remaining 500 assigned to the placebo group
(grain laced with dye only). Minitab reports…
T-Test of difference = 0 (vs not =): T-Value = -5.42
P-Value = 0.000
The best conclusion is…
a.
There is a large difference between the effects of the treatment and the placebo
DF = 499
b.
*c.
d.
e.
A23
There is strong evidence that the 1080 additive is very effective in altering the intake of
grain by pigs
There is strong evidence of a difference in intake between the treatment and placebo but the
difference may be small
There is little evidence that the treatment has any effect on the intake of grain by pigs
There is evidence of a strong treatment effect
Health researchers wish to investigate whether the tar content (milligrams) varies among
four brand of cigarettes. Three packs of each brand were selected, and one cigarette from
each pack was placed in a smoking machine to determine the tar content. An analysis of
variance was performed and here are the results (some parts are hidden):
Analysis of Variance for Tar
Source
DF
SS
MS
Brand
3
348.00
116.0
Error
8
80.00
10.0
Total
11
428.00
F
11.60
P
0.003
Which of the following is correct:
a
*b
c
d
e
Because the p-value is small, there is evidence that all the brands differ from each other in
the mean amount of tar present.
Because the p-value is small, there is evidence that at least one brand has a different mean
tar content from the other brands.
Because the p-value is small, there is no evidence that any of the brands differ in the mean
tar content.
Because the p-value is small, there is no evidence that at least one brand has a different
mean tar content from the other brands.
Because the p-value is small, there is evidence that all of brands have the same mean tar
content.
Questions 24 to 26 relate to the following.
One concern about the depletion of the ozone layer is that the increase in UV light will decrease
crop yields. An experiment was conducted in a green house where 40 soybean plants were
exposed to varying levels of UV, measured in Dobson units. At the end of the experiment the
yield (kg) was measured. A regression analysis was performed in Minitab with the following
results:
The regression equation is
Yield = 3.98 - 0.0463 UV
Predictor
Constant
UV
A24
Coef
3.980
-0.04629
SE Coef
0.0538
0.01074
T
74.01
-4.31
P
0.000
0.001
The least squares regression line is the line…
a
b
*c
that minimizes the sum of the squared differences between the actual UV values and the
predicted UV values.
that minimizes the sum of the residuals between the actual yield and the predicted yield.
that minimizes the sum of the squared differences between the actual yield and the predicted
yield.
d
e
A25
Which of the following is correct?
a
b
*c
d
e
A26
that minimizes the sum of the squared residuals between the actual UV reading and the
predicted UV reading.
that minimizes the total variation in the data.
If the UV reading is increased by 1 Dobson unit, the yield is expected to increase by .0463
kg.
If the yield increases by 1 kg, the UV reading is expected to decline by .0463 Dobson units.
The estimated yield is 3.98 kg when the UV reading is 0 Dobson units.
The predicted yield is 4.3 kg when the UV reading is 20 Dobson units.
The t-ratio 74.01 is used to test whether the estimated slope is different from zero.
A 95% confidence interval for the slope is…
a
b
c
d
*e
–0.046 ± 0.011
–0.046 ± 0.108
–0.046 ± 0.054
–0.046 ± 0.046
–0.046 ± 0.021
Questions 27 to 29 relate to the following data set.
In the paper "Color Association of Male and Female Fourth-Grade School Children" (J. Psych.,
1988, 383-8), children were asked to indicate what emotion they associated with the color red.
The response and the sex of the child are shown in the table below.
anger
happy
love
pain
total
female
27
19
39
17
102
male
34
12
38
28
112
total
61
31
77
45
214
A27. Four of the following statements are correct. Which one is FALSE?
a
b
c
d
*e
A lower percentage of girls associate the emotion "anger" with the color red than do boys.
More students associate the color red with the emotion "love" than with the emotion
"anger".
Each student was classified by gender and by emotion association. Each student was
counted in one and only one cell.
We will be unable to compute a correlation for this data because the variables are not both
numerical variables.
We compute conditional proportions (given gender) by dividing the cell counts by the table
total, 214.
A28. The null hypothesis for a chi-squared test on the above data is:
*a
b
c
the emotion associated with red is independent of gender
gender is dependent upon the emotion associated with red
the probability of a child associating any of the emotions with red is related to gender
d
e
the number of children in each cell does not depend upon gender nor upon emotion
the color red is independent of the emotion associated with it and with gender.
A29. Under this null hypothesis, the expected frequency for the cell corresponding to Anger and
Males is:
a
b
c
d
*e
A30
34.0
55.7
30.4
29.1
31.9
A survey was conducted to investigate the severity of rodent problems in egg and chicken
operations. A random sample of 78 egg operators and 53 chicken operators was selected,
and the operators were classified according to the extent of the rodent population. A
Minitab analysis of the data gave the following output. (NB the first row of the contingency
table corresponds to egg operators and the second row to chicken operators.)
Chi-Square Test: mild, moderate, severe
Expected counts are printed below observed counts
1
mild moderate
26
37
31.56
35.13
severe
15
11.31
Total
78
2
27
21.44
22
23.87
4
7.69
53
Total
53
59
19
131
Chi-Sq =
0.979 + 0.100 +
1.440 + 0.147 +
DF = 2, P-Value = 0.060
1.202 +
1.768 = 5.635
The conclusion from the test is...
a
b.
c
*d
e
The severity of rodent problems is the same for egg and poultry operators.
The severity of rodent problems is different for egg and poultry operators
There is no evidence that the severity of rodent problems is different for egg and poultry
operators.
The evidence for a difference in severity of the rodent problem between egg and poultry
operators is only weak.
There is strong evidence of a difference in severity of rodent problems between egg and
poultry operators.
SECTION B:
Answer two out of the three questions in this section.
This question investigates traffic fatalities in New Zealand during 2001 and their
relationship to the blood alcohol level of the drivers. The data used in the question were
published by the Land Transport Safety Authority.
(a)
The diagram below shows the distribution of ages of fatally injured drivers in 2001 and the
numbers of these who were tested for the alcohol level in their blood.
50
45
40
35
Drivers
B1.
30
25
Tested
Not tested
20
15
10
5
0
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60+
Age
(i)
Ignoring initially the testing for blood alcohol level, critically discuss the effectiveness
of this diagram as a way of showing how age is related to death. In particular,
consider…
• Treatment of the category “60+”
Wider than the other categories, explaining the higher number of fatalities
• The effectiveness of the graphic as a display of the data
The 3D enhancements are chartjunk
• Whether a different type of display might have been better
Age is continuous, so a histogram would be better, though there is a problem with
the “60+” age group. Perhaps treat it as “60 to 75”?
• Lurking variables that may explain the downward trend up to age 59
There are probably more kilometres driven by younger drivers, so the accident rate
per km may not be higher for them.
(ii) In the 35-39 age group, 11 out of the 15 fatalities were tested for alcohol level, whereas
in the 40-44 age group, 22 out of 25 were tested. Test whether the probability of getting
tested is the same in both age groups.
The pooled p = 33/40=0.825. z = (11/15 – 22/25) / root(0.825 * 0.175 * (1/15+1/25)) = –
1.18, so there is no evidence that the probabilities are different.
(iii) Discuss the following diagram, taking into account what you have learned in (ii).
100%
80%
60%
Tested
Not tested
40%
20%
0%
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60+
Age
This is an effective way to show how the proportion tested depends on age. There is no
obvious trend and a fair amount of variability. Since (ii) showed that two of the larger
differences were not significantly different, we can conclude that there is little if any
influence of age on whether the dead drivers were tested.
(b) Out of the total of 204 dead drivers who were tested, 43 had alcohol levels over 80 mg per
100 ml of blood (the legal limit).
(i) Find a 95% confidence interval for the probability that a tested driver is over the legal
limit.
43/204 ± 2 * root(43 * 161 / 2043) = 0.154 to 0.268
(ii) Use the confidence interval in (i) to find a 95% confidence interval for the number of
the 63 untested drivers who were over the 80mg limit, assuming that they have the
same distribution of blood alcohol levels as the tested drivers, and hence a 95%
confidence interval for the total number of dead drivers over the limit in 2001.
For the untested drivers, 0.154*63 to 0.268*63 = 9.7 to 16.9, so for all drivers, the CI is 52.7
to 59.9.
(c) The table below describes only the fatally injured drivers who were tested for alcohol level.
The deaths were classified by blood alcohol level and the time of day when the accident
happened. (Note that the legal limit for driving is 80 mg per 100 ml of blood.
Blood alcohol level (mg per 100ml blood)
Time of day
Under 80
80 to 200
Over 200
Total
1am to 9am
49
8
5
62
9am to 5pm
77
2
6
85
5pm to 1am
35
18
4
57
Total
161
28
15
204
(i)
Draw on graph paper a stacked bar chart that can be used to compare blood alcohol
levels at the different times of day. Describe the pattern in this diagram in words that a
traffic researcher might understand.
1am to 9am
Under 80
80 to 200
Over 200
9am to 5pm
5pm to 1am
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
The proportion of deaths with extremely high alcohol levels does not seem to depend on the
time of day, but the proportion with moderately high alcohol levels (80 to 200) is much
higher between 5pm and 1am and is lowest between 9am and 5pm.
(ii) Minitab reports the following results from a chi-squared test on the data.
Expected counts are printed below observed counts
Under 80 80 to 20 Over 200
1
49
8
5
48.93
8.51
4.56
Total
62
2
77
67.08
2
11.67
6
6.25
85
3
35
44.99
18
7.82
4
4.19
57
Total
161
28
15
204
Chi-Sq =
0.000 + 0.031 + 0.043
1.466 + 8.010 + 0.010
2.216 + 13.237 + 0.009
DF = 4, P-Value = 0.000
2 cells with expected counts less
+
+
= 25.021
than 5.0
What are your conclusions from the test?
The alcohol levels of the dead drivers are different at different times of day. There is strong
evidence that the pattern described in (i) is therefore not due to chance.