Download c e d b a a d c e b

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Categorical variable wikipedia , lookup

Transcript
1-1
1-2
1-3
1-4
1-5
1-6
1-7
1-8
1-9
1-10
c
e
d
b
a
a
d
c
e
b
1. Instructions: In problems 1-1 through 1-5, match each outcome with the letter that best
corresponds to its data type given below, and enter your answer in the appropriate cell above.
(a) Numerical: Continuous
(b) Numerical: Discrete
(c) Categorical: Ordinal
(d) Categorical: Nominal: Binary
(e) Categorical: Nominal: Not Binary
1-1. One of the questions on a health survey codes participants according to the following scale:
0 = Nonsmoker
1 = Smokes between 1-4 cigarettes per day
2 = Smokes between 5-8 cigarettes per day
3 = Smokes between 9-12 cigarettes per day
4 = Smokes between 13-16 cigarettes per day
5 = Smokes between 17-20 cigarettes per day
6 = Smokes more than 20 cigarettes per day
While the actual number of cigarettes
smoked is “Numerical: Discrete,” the
classification into labeled groups turns it
into “Categorical.” Furthermore, they have
a natural order: 0 < 1 < 2 < 3 < 4 < 5 < 6 –
i.e.., members of any category smoke fewer
cigarettes that members in the next category
– so the correct answer is (c).
1-2. The same survey asks participants to check which statement best applies:
1 = I smoke and drink.
2 = I smoke, but don't drink.
3 = I drink, but don't smoke.
4 = I neither smoke nor drink.
As above, these are not numerical measurements,
but rather “names” of categories. However, unlike
the previous scenario, there is no natural ordering
here. Inherently, “I smoke” is not measurably
better or worse than “I drink.” Finally, as there are
more than two categories, the correct answer is (e).
1-3. Each toaster in a random sample of 250 is tested to determine whether or not it is defective.
“Defective” versus “Non-defective” is a classification into two categories; hence the correct answer is (d).
1-4. The number of defective toasters in each of several random samples of 250 is recorded.
The values that this variable can assume are just the integers 0, 1, 2, 3, …, 249, 250, which are numerical
measurements (not categorical attributes), that do not form a continuous interval, say [0, 250], which
consists of all the real numbers between 0 and 250 inclusive. So the correct answer is (b).
1-5. An urban study investigates disease prevalence among children living in low-income
apartments. The amount of lead in different samples of wall paint taken from these
apartments is measured.
Clearly, these are numerical measurements. But unlike the previous scenario, the amount of lead can, in
principle, be measured to infinite precision. All of the numerical values between two such measurements
are also potential measurements. Therefore, this is a continuous numerical random variable, (a).
1-6. “Exact Body Temperature (°F)” in a population is an example of what kind of random variable?
(a) Numerical: Continuous, low variability
(b) Numerical: Continuous, high variability
(c) Numerical: Discrete, high variability
(d) Categorical: Ordinal, low variability
As above, this variable is a continuous numerical
measurement. Moreover, in a typical population,
the amount of variability around 98.6°F is very low;
there are very few – if any – healthy individuals
with even moderately outlying body temperatures.
(e) None of the above
1-7. It is known that patients with a certain form of cancer have a median survival time of 1
year without treatment. A medical study wishes to determine the median survival time of
patients on a particular therapy. From a random sample of 500 such patients, it is found
that the median is 5.0 years. The statistic (= “sample characteristic”) in this study is
(b) 1 year
It is stated that the median survival time without treatment is 1 year, i.e., 50% of this
untreated population survives < 1 year (and 50% survives > 1 year). This population
is the “reference group”; neither of these two values refers to a random sample.
(c) 500 patients
This is only the “sample size” n, not an actual statistic calculated from the sample.
(d) 5 years
This is a “statistic” (i.e., “sample characteristic”) – a value derived from the sample.
(a) 50%
(e) None of the above
1-8. Suppose the mean age of a random sample of 100 pregnant women is computed to be 26.7
years old. It then follows that the mean age difference from 26.7 is
(a) always positive
(b) always negative
(c) always zero
(d) unrestricted in sign
(e) None of the above
The underlined phrase is simply the average
of the deviations from the mean age of 26.7
years. But the sum of those deviations is
always equal to 0! Hence so is their average.
1-9. In a certain city, the mean winter temperature in 2014 was 10°F, and in 2015 was 15°F.
Which of the following statements must be true?
(a) Every day in the winter of 2014 was colder than every day in the winter of 2015.
(b) The mode winter temperature in 2014 was less than the mode winter temperature in 2015.
(c) The median winter temperature in 2014 was less than the median winter temperature in 2015.
(d) The standard deviation of the winter temperature distribution in 2014 was less than the
standard deviation of the winter temperature distribution in 2015.
(e) None of the above
The mean is a summary measure of center for a population (or sample) of individuals.
In particular, without more information, it cannot be used to infer conclusions about
any of the following:
a specific individual, as in (a)
other measures of center, as in (b) or (c)
a measure of spread, such as standard deviation in (d).
It is easy to imagine counterexamples to each claim stated in (a), (b), (c), and (d).
1-10. Three masses, weighing 5 grams, 3 grams, and 2 grams, are placed, respectively, at the
2-inch, 6-inch, and 11-inch marks of a standard one-foot ruler, as shown. Where is the
balance point of this system of masses located?
(a) 4.5 inches
(b) 5.0 inches
(c) 5.5 inches
(d) 5.75 inches
(e) None of the above
This is quite literally an example of a “weighted average.” The
random variable X = “location on ruler,” and each of the sample
values x1 = 2, x2 = 6, x3 = 11 are weighted with their corresponding
relative frequencies p1 = 5/10, p2 = 3/10, p3 = 2/10, respectively.
Hence
x = (2)(5/10) + (6)(3/10) + (11)(2/10) = 50/10 = 5 inches.
2. Shown below is the grouped frequency table and frequency histogram of Exam 1 scores in
one section of Stat 224/324 in the Fall 2015 semester, consisting of n = 133 students.
Intervals
(“Bins”)
(55, 60]
(60, 65]
(65, 70]
(70, 75]
(75, 80]
(80, 85]
(85, 90]
(90, 95]
(95, 100]
Frequency
2
3
1
4
12
10
26
42
33
(a) Circle the shape of the distribution: symmetric, left-skewed, or right-skewed.
(3 pts)
(b) Calculate the relative frequency corresponding to the interval (95, 100]. Show all work. (3 pts)
Relative frequency = Frequency / n = 33/133 = 0.24812
(c) Calculate the density corresponding to the interval (95, 100]. Show all work.
Density = relative frequency / bin width = 0.24812 / (100 – 95) = 0.04962
(3 pts)
(d) What proportion of the class can be estimated to have received a score strictly greater than 65?
Show all work.
(4 pts)
We could calculate, then sum, all the relative frequencies from the third row on, but it is far
easier to subtract the first two rows from 1. That is, 1 – (2 + 3) / 133 = 128 / 133 = 0.96241.
(e) Since the exact score data are not given, all summary statistics must be estimated from the
grouped data. Which bin contains the grouped median score? Justify your answer!
(5 pts)
The median must divide the 133 scores so that 50% of them are below it, and 50% are above it.
The last interval (95, 100] contains 33 of them, or 33/133 = 24.81% < 50%. So the median
must lie below 95. However the interval just before it, (90, 95], contains 42 of them, which
makes a combined total of 42 + 33 = 75 scores, or 75/133 = 56.41% > 50%. So the median
must lie above 90. These two results together imply that the median must lie in (90, 95].
(f) Without performing any further calculations, would the grouped mean be less than, greater
than, or equal to the grouped median in (e)? Justify your answer!
(5 pts)
Less. The histogram is left-skewed, so the mean would tend to decrease, while leaving
the median unchanged.
(g) Suppose that a recording error had been made, and the two scores in (55, 60] are actually in
(60, 65]. What effect would this have on each of the following? Briefly state why.
(12 pts)
Median is unaffected by outliers, since they would not alter the 50%-50% split.
Mean would increase, since the two higher scores now make the numerator sum larger.
Variance would decrease, since it a meaure of spread, and the distribution would now
have less skew due to the disappearance of the first bin.
3. Prior probabilities of treatment: P ( T1 ) 0.6 and P ( T2 ) 0.4
Probabilities of symptoms, conditioned on treatment:
P ( S1 | T1 ) 0.70
P ( S2 | T1 ) 0.25
P ( S3 | T1 ) 0.05
P ( S1 | T2 ) 0.45
P ( S2 | T2 ) 0.375
P ( S3 | T2 ) 0.175
(a) From this, we obtain for the first cell, P ( S1
likewise for the five remaining cells. Thus...
T1 )
P ( S1 | T1 ) P ( T1 ) = (0.7)(0.6) = 0.42, and
Treatment
Disease Symptoms
Mild (S1)
Moderate (S2)
Severe (S3)
T1
0.42
0.15
0.03
0.60
T2
0.18
0.15
0.07
0.40
0.60
0.30
0.10
1.00
(b) Posterior probabilities of treatment, conditioned on symptoms:
From the first cell, we obtain P ( T1 | S1 )
P ( T1
S1 )
P ( S1 )
0.42
= 0.7, and likewise for the others:
0.60
P ( T1 | S1 )
0.7
P ( T1 | S 2 )
0.5
P ( T1 | S3 )
0.3
P ( T2 | S1 )
0.3
P ( T2 | S 2 )
0.5
P ( T2 | S3 )
0.7
(c) The prior probability of taking treatment T1 (low potency, low side effects) is 60%. However,
In the presence of mild disease symptoms (S1), the corresponding posterior probability of
taking treatment T1 increases to 70%.
In the presence of moderate disease symptoms (S2), the corresponding posterior probability
of taking treatment T1 decreases to 50%.
In the presence of severe disease symptoms (S3), the corresponding posterior probability of
taking treatment T1 decreases to 30%.
The prior probability of taking treatment T2 (high potency, high side effects) is 40%. However,
In the presence of mild disease symptoms (S1), the corresponding posterior probability of
taking treatment T2 decreases to 30%.
In the presence of moderate disease symptoms (S2), the corresponding posterior probability
of taking treatment T2 increases to 50%.
In the presence of severe disease symptoms (S3), the corresponding posterior probability of
taking treatment T2 increases to 70%.
(d) Based on the 60% of those with mild disease symptoms (S1), patients are more likely to choose
the weaker treatment with fewer side effects (T1), than the stronger treatment with more side
effects (T2). Based on the 30% of patients with moderate disease symptoms (S2), the choice
between the two treatments is equally likely. (This group can thus be taken as the “reference.”)
However, based on the remaining 10% with severe disease symptoms (S3), patients seem to be
more likely to risk suffering more side effects for the benefit of the stronger treatment.