Download Problem Set 4 - Massachusetts Institute of Technology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Department of Urban Studies and Planning
Massachusetts Institute of Technology
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Spring 1999
Homework Set #4 - Solutions
Due:
Friday, April 16, 5:00 p.m. to Mark in 10-485.
[Total = 100 points]
Probability, Probability Distributions, and Statistical Estimation
Question 1
The director of a local pollution control board is concerned that a particular company in the area
may be illegally dumping certain chemical wastes into a river. Recent national studies have
indicated that such practices are relatively widespread; 15 percent of the companies that are
similar in nature to the local company do dump wastes illegally. Before undertaking a formal
inquiry, the director can authorize the staff to sample the water quality a short distance
downstream from the company’s factory. The staff estimates that these water samples will be
75% accurate in predicting excessive dumping when it is in fact occurring and 80% accurate in
predicting no illegal dumping when it is in fact not occurring.
[6]
(a)
Draw a complete probability tree for this problem and identify all of the nodes, branches,
and probabilities associated with it.
The following tree uses the these abbreviations: D= Dumping, ND= Not Dumping, PY= sample
Predicts Yes (dumping), PN= sample Predicts No (dumping). Bold values were provided in the
question itself. Remember that the total probability of each set of branches emanating from the
same node is always = 1.0.
●
p(D and PY) =
p(D)•p(PY|D) =
(0.15)(0.75) = 0.1125
(A)
●
p(D and PN) =
p(D)•p(PN|D) =
(0.15)(0.25) = 0.0375
(B)
●
p(ND and PY) =
p(ND)•p(PY|ND) =
(0.85)(0.2) = 0.17
(C)
●
p(ND and PN) =
p(ND)•p(PN|ND) =
(0.85)(0.8) = 0.68
(D)
p(PY|D) = 0.75
●
p(D) = 0.15
p(PN|D) = 0.25
●
p(PY|ND) = 0.2
p(ND) = 0.85
●
p(PN|ND) = 0.8
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4 – Solutions – Spring 1999
[3]
(b)
Page 2
What is the probability that the staff, having taking a sample, will predict “yes, there is
illegal dumping?”
Referring to the letters on the right hand side of the tree, the answer is given by adding the
probabilities associated with letters (A) and (C).
P(PY) = p(PY and D) + p(PY and ND) = 0.1125 + 0.17 = 0.2825
[3]
(b)
What is the probability that the company is in fact dumping illegally when the staff
predicts illegal dumping is occurring?
This is conditional probability: p( D | PY ) 
p( D and PY ) 0.1125

 0.3982
p( PY )
0.2825
Question 2
In the past few months, the Department of Urban Studies and Planning has submitted eight
research proposals to various government agencies for funding. Our past experience is that, on
average, one out of every ten proposals will be funded. In this case, because each of the
proposals is to a different agency, it seems reasonable to believe that approval or denial of each
proposal will have no bearing on the decision concerning the others.
[12]
(a)
Graph the probability histogram for the number of these eight proposals that will be
funded. Clearly label both axes.
This is a Binomial Probability situation. Each paper is bound to either succeed or fail (yes or
no). The general Binomial Probability of getting x successes in n trials is:

 x
 n
n!
 p (1  p)n  x
p( X  x)    p x (1  p)n  x  
 x
 x!(n  x)! 
We know that each paper has an independent probability of success of 1 in 10, so p=0.1 and
we’re asked to evaluate the probability of x papers being successful in 8 trials,
for x=0, 1, 2, …, 8.
The probability of 0 papers being successful in 8 trials is:
8


 8! 
8!
(1)(0.9) 80  
(0.9) 8 
p(0 successes in 8 trials)    p 0 (1  p) 80  
 0
 0!(8  0)! 
 (1)(8)! 
 (1)(0.430) 
≈ 0.430
Similarly, the probability of 1 paper being successful in 8 trials is:
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4 – Solutions – Spring 1999
Page 3
8
 8! 
 8  7! 
(0.1)1 (0.9) 81  
(0.1)(0.9) 7 
p(1 success in 8 trials)    p1 (1  p) 81  
1
 1!(8  1)! 
 (1)(7)! 
 (8)(0.1)(0.478)  (0.8)(0.487) 
≈ 0.383
Once again, the probability of 2 papers being successful in 8 trials is:
8


8!
(0.1) 2 (0.9) 8 2 
p (2 successes in 8 trials)    p 2 (1  p ) 8 2  
 2
 2!(8  2)! 
 8  7  (6!) 
 56 
(0.01)(0.9) 6   (0.01)(0.531)  (28)(0.00531)  0.14880348
 
 2
 (2  1)(6!) 
≈ 0.149
The rest of the probabilities are calculated in the same way. The whole table for all 8 trials
follows:
Number of Successes (x)
0
1
2
3
4
5
6
7
8
Number of Trials (n)
8
8
8
8
8
8
8
8
8
Binomial Probability
0.430
0.383
0.149
0.033
0.005
0.000
0.000
0.000
0.000
The probability histogram, based on the above table is shown on the next page:
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4 – Solutions – Spring 1999
Page 4
0.5
Probability of being Funded
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
5
6
7
8
Number of Funded Proposals Out of Eight
Question 3
Families wanting to get into public housing often face long waiting periods before receiving a
housing unit. Applicants to the Worcester Housing Authority face waits that are distributed
“normally” with a mean () of 4 years and a standard deviation () of 9 months.
[3]
(a)
What is the probability that a randomly chosen family who is seeking public housing will
have to wait at least five years for a public housing assignment?
We’re looking for p(x≥5 years).
x
Remember that z 
.

4 5
Here, x=5 years, years, and years or 0.75 years.
54
1
z

 1.33
0.75 0.75
Look up 1.33 in “white card” and you’ll obtain a probability of 0.9082. Recall that the “white
card” only gives the probability for the left-hand tail, whereas here we’re looking for the righthand tail (“wait at least five years”) so we need to subtract the above value from 1. Therefore
p(x≥5 years)= 1 – 0.9082 = 0.0918 (or 9.18%)
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4 – Solutions – Spring 1999
[3]
(b)
Page 5
What is the amount of time past which the 20% of the families who wait the longest will
have to wait?
20%
This time we want to find a value in years, not a probability.
The probability is given to us in the question.
4 Z=?
Recall again that the “white card” only gives the probability for
the left-hand tail, so we need to rephrase the question as “the time up to which the 80% of the
families who wait the least will have to wait”. Then, we look up 0.8000 inside the table and find
the corresponding Z-score. The closest thing is 0.7995, which corresponds to a Z of 0.84.
Remember now that Z represents the number of standard deviations away from the mean.
Recall that the mean is 4 and the standard deviation is 0.75, so
x= + z  = 4 + (0.84)(0.75) = 4 + 0.63 = 4.63 years.
[3]
(c)
What is the twenty-fifth percentile of waiting times?
This problem is similar to the one above (part b) in that we’re given a probability and we’re
looking for a value in years, so once again we’ll be looking inside the table to determine the Z to
be multiplied by the standard deviation and added to the mean (4 years).
This problem is more straightforward since we are looking
for a left tail, which is what the “white card” gives us.
25%
So, we need to find the Z corresponding to p = 0.25, by
looking inside the table. The closest probability is 0.2514
which happens to be located in the left half of the table.
Z=? 4
This corresponds to a Z of –0.67.
Again, recall that Z represents the number of standard deviations away from the mean, so
x= + z  = 4 + (-0.67)(0.75) = 4 - 0.5025 = 3.4975 ≈ 3.5 years.
[3]
(d)
What is the probability that an applicant who has already waited five years will get an
assignment within the next year?
This is a conditional probability. We’re looking for
p(5 ≤ x ≤ 6 | x ≥ 5 yrs).
p( A and B) p( A)  p( B | A)
Recall that p( A | B) 
.

p ( B)
p ( B)
In this case, p(A) = p (5 ≤ x ≤ 6) whereas p(B) = p(x ≥ 5 yrs). It should be clear that p(B|A), i.e.
p(x ≥ 5 yrs | 5 ≤ x ≤ 6) is equal to 1 (if one is waiting between 5 and 6 years, one is definitely
waiting more than 5 years). So, in this case,
p( A)  1.0 p( A) P(5  x  6)
. We already know the denominator, which was
p( A | B) 


p( B)
p( B)
p( x  5)
the answer to part (a) of this question (=0.0918), so now we need to find out the numerator.
But, p (5 ≤ x ≤ 6)= p (x ≤ 6 yrs)- p (x ≤ 5 yrs), so we need to find out the Z for x=5 and the Z for
x=6, then look up their respective probabilities and calculate the difference.
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4 – Solutions – Spring 1999
Remember that z 
Page 6
x
and recall, from part (a), that Z5 yrs= 1.33. Similarly to what was done

x 64
2


 2.67 . Now, we look up, from the “white card”, the
in part (a), Z 6 yrs 

0.75 0.75
two probabilities for z=2.67 (which turns out to be 0.9962), and for z=1.33 (which is 0.9082).
So, p (5 ≤ x ≤ 6)= p (x ≤ 6 yrs)- p (x ≤ 5 yrs) = 0.9962 – 0.9082 = 0.088.
Finally, we calculate the conditional probability, as follows
p( A)  1.0 p( A) P(5  x  6) 0.088
p( A | B) 



 0.958  0.96 .
p ( B)
p( B)
p( x  5)
0.0918
So, the probability of getting a house within one year, once one has waited already for 5 years is
0.96, or 96%.
(Note that – as one would intuitively expect – this is much higher than the probability of getting
a house between 5 and 6 years on day one, when one begins the public housing application
(which is 8.8%), since there is a 90.8% chance one will get the house before 5 years).
Question 4
In the early 1970s Dr. Troy Zimmer conducted a comparative study of female participation in
higher education.1 He developed his own measure to use in this study:
Participation Ratio 
Percentage of citizens enrolled in higher education who were female
Percentage of citizens age 15 - 24 who were female
He calculated this Participation Ratio for a simple random sample of 105 countries, 58 of which
he characterized as “western” and 47 of which he characterized as “non-western.”
The values of this variable for the 105 countries ranged from .08 (in the Congo, Guinea, and
Saudi Arabia) to 1.10 (in the Philippines). The summary statistics for the data he collected are
provided in Table 1:
Table 1: Data for Female Participation in Higher Education
Western Countries Non-Western Countries
Number of Countries
58
47
Mean of Participation Ratio ( x )
.66
.34
Standard Deviation of Participation Ratio (s)
.19
.17
[4]
(a)
Explain in no more than two sentences exactly what this variable is measuring. (You
might want to think about what a particularly high value and a particularly low value of
Troy A. Zimmer, “Sexism in Higher Education: A Cross-National Analysis,” Pacific Sociological Review, Vol. 18, No. 1,
Jan. 1975.
1
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4 – Solutions – Spring 1999
Page 7
the Participation Ratio would indicate. Also, note that it is possible for this variable to be
greater than 1.00, as is the case for the Philippines.)
The Participation Ratio measures the relative participation of females in higher
education as compared to their relative proportion of the college age population. It
measures whether the percentage of people enrolled in higher education who are women
is higher (Participation Ratio > 1) or lower (Participation Ratio <1) than the
percentage of college age individuals in the population who are women.
Another way to see this is to begin by understanding that the Participation Ratio is the
ratio of two conditional probabilities:
P( female enrolled in higher education)
P( female 15  24 years old)
[4]
(b)
Using the information contained in this table and your knowledge of how to quantify the
chance error inherent in estimating population parameters from sample statistics,
estimate the mean Participation Ratio for all western countries.
In estimating population parameters from sample statistics while accounting for the chance
error involved in that estimation one needs to calculate an interval estimate for the population
parameter.
In this case you are dealing with sample and population means, so you need to construct a
confidence interval around the sample mean, plus, since we don’t know , we should use the tdistribution:
 s 

µ ≈ x  t / 2 
 n
However, since, in this case, n≥30, we can use the z distribution instead of the t distribution, so
s
  x  z
2
n
You have to choose a confidence level. 90%, 95%, and 99% are the conventional choices.
There is nothing in particular in this problem to suggest the choice of one over another.
The calculations for each confidence level that you might have chosen are the following:
90% confidence interval
95% confidence interval
99% confidence interval
.19
.66.04  (.62,.70)
58
.19
.66 1.96 
.66..05  (.61,.71)
58
.19
.66  2.57 
.66..06  (.60,.72)
58
.66  1.64 
NOTE: any one of the 3 answers is sufficient. Also, if you used the t-distribution, your answer
will still be correct and actually more accurate than the ones above.
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4 – Solutions – Spring 1999
[4]
(c)
Page 8
Using the information contained in this table and your knowledge of how to quantify the
chance error inherent in estimating population parameters from sample statistics,
estimate the mean Participation Ratio for all non-western countries.
Once again, calculate the appropriate confidence interval(s):
90% confidence interval
95% confidence interval
99% confidence interval
.17
.34..04  (.30,.38)
47
.17
.34 1.96 
.34..05  (.29,.39)
47
.17
.34  2.57 
.34..06  (.28,.40)
47
.34 1.64 
NOTE: any one of the 3 answers is sufficient. Also, if you used the t-distribution, your answer
will still be correct and actually more accurate than the ones above.
[2]
(d)
By themselves, what do these two results suggest about differences between female
participation in higher education in western countries and in non-western countries?
[Note that in the third part of the course we will develop the statistical tools to address
this question more rigorously.]
For whatever confidence level you chose, the corresponding confidence intervals do not
overlap. This leads to the conclusion that the mean Participation Ratio for all western
countries is higher than the mean Participation Ratio for all non-western countries.
[Note: When we turn to hypothesis testing we will see a more explicit way of handling
this question.]
Question 5
As part of the Annual Housing Survey the Census Bureau determines how far the head of a
household has to commute to work. In 1974, this averaged 13 miles (i. e., the mean distance
traveled was 13 miles). The standard deviation of distance traveled happened to be 13 miles too!
(These are one-way distances to work.)
[4]
(a)
From these summary statistics, what do you know about the shape of the distribution of
the variable “distance to work?”
To explain a mean and standard deviation both equal to 13 miles, and knowing that nobody
would travel less than zero miles, the distribution of the variable "“distance traveled to work"
must be skewed to the right quite a bit. Some commuters must be commuting distance much
longer than 13 miles.
A real estate office wanted to make a similar survey in a certain town, which has about 20,000
households. A simple random sample of 400 households was chosen, the occupants were
interviewed, and it was determined that, on average, the heads of the sample households
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4 – Solutions – Spring 1999
Page 9
commuted 12.7 miles to work, with a standard deviation of 12.0 miles. (Note that if someone
was not working, the commuting distance was defined as 0. This is the same procedure as that
used by the Census Bureau.)
[6]
(b)
Using this information find the 90% confidence interval for the mean distance that all
heads of households in that town commute to work.
The general formula for estimating the population mean, based on a sample mean is, when we
don’t know , is:
 s 

µ ≈ x  t / 2 
 n
However, once again, since the size of the sample is greater than 30, we can use the zs
distribution instead of the t-distribution:   x  Z 
where s is the sample standard
2
n
deviation, n is the size of the sample and α is the % of the normal distribution curve at both tail
ends, left out of the confidence interval. For a confidence interval of 90%, the tail end (α) is
10%, or 0.1, therefore α/2 is 0.05. Since the “white card” gives us the entire left tail of the
curve, including one of the α/2 tails (i.e. the left one), we need to look up the probability of 0.95
(90% plus the left 5%, α/2).
We look inside the table to find the closest value to 0.95 and we find both a 0.9495,
corresponding to a Z of 1.64 and a 0.9505, corresponding to a Z of 1.65. Our desired Z is
exactly half way between the two, i.e. 1.645.
We know that x  12.7 , s = 12.0 and n = 400. Plugging all of these values into the formula, we
obtain:
  12.7  1.645
12
400
 12.7  1.645
12
 12.7  1.6450.6  12.7  0.987
20
NOTE: if you used the t-distribution, your answer will still be correct and actually more
accurate than the one above.
[3]
(c)
Is the following statement true or false? Why?
“90% of the heads of households in the town have one-way commuting distances
that are between the bounds of the 90% confidence interval calculated in part (b)
above.”
The statement is false. It should read: “There is a 90% probability that the true mean of the
commuting distance falls within a 90% confidence interval estimating it (calculated in part (b)
above).” This is a statement about the process and not a claim about the result (as the other
statement was).
Another piece of information that was gathered in the real estate office’s survey was that in 321 of
the 400 sample households the head of the household commuted by car.
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4 – Solutions – Spring 1999
[6]
(d)
Page 10
Find the 95% confidence interval for the percentage of all households in the town in
which the head of the household commutes by car.
This problem asks us to estimate a population proportion based on a sample proportion. The
general equation is:
321
pˆ (1  pˆ )
 0.8025
, where n = 400 and pˆ 
population proportion  p  pˆ  Z 
400
2
n
In this case, α is 0.05, therefore α/2 is 0.025. We need to look for a probability of
0.95+0.025=0.975 inside the Z table in the “white card” and find the corresponding Z value.
From the table, we find that Z=1.96. Plugging all our values back into the above equation, we
obtain:
population proportion  p  (0.8025)  (1.96)
 (0.8025)  (1.96)
(0.8025)(1  0.8025)

400
(0.8025)(0.1975)
0.15849375
 (0.8025)  (1.96)

400
400
 0.8025  (1.96) 0.000396234375  0.8025  (1.96)(0.0199) 
 0.8025  0.039 
= 80.25% ± 3.9%
[3]
(e)
How would your answer to part (d) change if the town had had only 10,000 households
instead of 20,000?
Since the sample estimate of population proportion is not dependent on the size of the
population, the answer would not change.
(Population size, N (capital N) is not part of the estimation equations).
[3]
(f)
How would your answer to part (d) change if the survey had sampled 1,600 households
and found that in 1,284 of them the head of the household commuted by car?
In general, an estimate of a population parameter based on a sample statistic is expected to
improve with larger sample sizes. Therefore, we expect our estimate of population proportion to
become more precise. The random sampling error should get smaller. To test this, we use the
1284
 0.8025 . Note that the
same equations used in part (d) above, using n=1600 and pˆ 
1600
sample proportion is the same as before.
(0.8025)(1  0.8025)
population proportion  p  (0.8025)  (1.96)

1600
 (0.8025)  (1.96)
(0.8025)(0.1975)
0.15849375
 (0.8025)  (1.96)

1600
1600
 0.8025  (1.96) 0.00009905859375  0.8025  (1.96)(0.00995) 
 0.8025  0.0195 
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4 – Solutions – Spring 1999
= 80.25% ± 1.95%
Page 11
Q.E.D.
As you recall, Mark mentioned in class that a quadrupling of the sample size will cut the error in
half, which is what happened in this case. In general the error will be reduced as the square
root of the increase of the sample size.
Question 6
A labor union has an examining board whose job is to select apprentices for admission into the
apprenticeship program of the union. Not everyone who qualifies is admitted to the
apprenticeship program, but there is suspicion that the admissions that are made are made in a
manner that is discriminatory.
Records of the examining board show that it has a record for admitting 70% of all the applicants
who satisfy the basic set of requirements. Recently, five women who satisfied the basic
requirements came before the board, but 4 out of the 5 were rejected. Only one was admitted to
the program.
[6]
(a)
Calculate the probability that this would have happened using the assumption that the
admissions process was non-discriminatory.
This is a Binomial Probability problem. We’re being asked to determine the probability of
getting 1 success out of 5 tries. The probability of success is 70% or 0.7.
Using the Binomial Probability equation:
n

 x
n!
 p (1  p) n x we substitute our values and get
p( X  x)    p x (1  p) n  x  
 x
 x!(n  x)! 
 5
 5! 
 5  (4!) 
(0.7)(0.3) 4  
(0.7)(0.0081) 
  (0.7)1 (1  0.7) 51  
1
 1!(5  1)! 
 (1)( 4!) 
 (5)(0.00567) 
= 0.02835
[2]
(b)
In order to answer part (a) you had to determine what non-discrimination means in a
probabilistic sense. How was non-discrimination incorporated into the calculations you
made for part (a)?
We assumed the probability of admission was independent of gender.
P(Admission|Female)=p(Admission|Male)=p(Admission)=70%=0.7
Question 7
The 4 November 1985 Boston Globe contained an article entitled, “Emission Tampering Found
in Many Boston Autos.” It reported on a study by the Environmental Protection Agency (EPA),
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4 – Solutions – Spring 1999
Page 12
in which the EPA attempted to estimate the proportion of automobiles that had emission systems
that had been tampered with (i.e., illegally modified by their owners):
“Fifteen percent of the emission control systems in 1975 to 1984 model cars in
Boston have been tampered with, a government study has found, causing
dangerous fumes to be spewed into the air...The study was conducted last year by
pulling motorists over at roadsides or inspection stations. With the consent of the
owners, inspectors examined emission control devices such as the catalytic
converter system...”
In Boston one vehicle out of ten was stopped during a specified time period. 286 vehicles were
stopped and examined, and fifteen percent of these vehicles had illegal emission systems. (You
may assume that no one who was stopped refused to have their car examined.)
[2]
(a)
Assuming that the likelihood of being stopped was independent of whether or not the
emission system had been tampered with—which seems to be a reasonable assumption—
calculate the probability that a vehicle that had been tampered with would be stopped.
From the sentence: “In Boston one vehicle out of ten was stopped…”, we determine that that the
probability of being stopped is 0.10.
Assuming that p(stopped)=p(stopped|tampered), as stated in the question, then
P(stopped|tampered)=0.10
[6]
(b)
Assume that the 286 examined automobiles were a simple random sample of all the
1975-1984 cars in Boston. Using these results and your knowledge of how to quantify
the chance error that comes with sampling, estimate the true proportion of all cars in
Boston with emission systems that have been illegally tampered with.
This problem asks us to estimate a population proportion based on a sample proportion. The
general equation is:
pˆ (1  pˆ )
, where n = 286 and pˆ  15%  0.15
population proportion  p  pˆ  Z 
2
n
In this case, you could have chosen any of the three most common confidence intervals (90%,
95% and 99%), which correspond to α’s of 0.1, 0.05 and 0.01, respectively, hence α/2’s of 0.05,
0.025 and 0,005 respectively. We need to extract the 3 Z-scores from the “white card” for
probabilities of 0.90+0.05=0.95, 0.95+0.025=0.975 and 0.99+0.005=0.995, inside the Z table.
From the table, we find that Z0.05=1.645, Z0.025=1.96 and Z0.005=2.575. Plugging all our values
back into the above equation, we obtain, for each confidence level:
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4 – Solutions – Spring 1999
population proportion  p  (0.15)  Z 
2
 (0.15)  Z 
2
Page 13
(0.15)(1  0.15)

286
(0.15)( 0.85)
0.1275
 (0.15)  Z 

286
286
2
 (0.15)  Z  (0.0211) 
2
For a confidence interval of 90%, Z0.05=1.645, the estimate of the proportion of Boston cars
with tampered emissions is  (0.15)  (1.645)(0.0211)  0.15  0.0347 ,
i.e. 15% ± 3.47%;
For a confidence interval of 95%, Z0.025=1.96, the estimate of the proportion of Boston cars with
tampered emissions is  (0.15)  (1.96)(0.0211)  0.15  0.041356 ,
i.e. 15% ± 4.14%;
For a confidence interval of 99%, Z0.005=2.575, the estimate of the proportion of Boston cars
with tampered emissions is ,  (0.15)  (2.575)(0.0211)  0.15  0.0543325 ,i.e. 15% ± 5.43%;
(NOTE: any one of the 3 answers is sufficient).
Question 8
A short article from the science section of the Boston Globe is reproduced below. It describes a
method of survey question design called “randomized response.”
[2]
(a)
How would randomized response protect a respondent’s privacy?
The key to “Randomized Response”, which may or may not be clear from the short article, is
that the respondent flips the coin and the questioner does not see the result. Only the respondent
knows whether he/she flipped heads or tails. To make the article clearer, one should also
understand that, basically, when one flips “tails” he/she will give “an honest answer”.
To summarize: HEADS: always Yes; TAILS: honest answer (yes or no).
Privacy is protected because a person who, for example, actually did have sex with a prostitute
should feel OK answering yes to such a sensitive question, since at least 50% of the respondents
will also answer yes, after flipping a coin and getting “heads”. The questioner will never know
who is responding yes because the coin came up “heads” and who is answering yes because,
after flipping and getting “tails”, he/she is actually admitting to such an act.
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4 – Solutions – Spring 1999
[5]
(b)
Page 14
How would a survey analyst use this type of question to estimate the proportion of a
population that would answer “yes" to a particularly sensitive question? Be as specific as
possible.
Given the random nature of coin flipping, 50% of the time the result will be “heads” and 50% it
will be “tails”. All “heads” will produce an answer of “yes” to a sensitive question. Some
percentage of the “tails” will also produce a “yes” response, which represents the “true” yes
answers (honest answers). Due to the random nature of coin flipping, one would expect that the
same percentage of “real YES answers” will occur both in the 50% of tails as in the 50% of
heads, except the latter will be mixed in with “forced YES answers” due to the “heads” rule,
since everyone will be answering YES, whether it is a “true YES” or not, so:
p(true YES|tails) = p(true YES|heads),
therefore p(true YES) = p(true YES|tails) × 2, but
p(true YES|tails) = p(YES) – p(YES|heads), where p(YES|heads) = 50%, or 0.5, therefore
p(true YES|tails) = p(YES) – 50%
(the true YES in the “tails” group are the % of YES in excess of 50%)
Therefore, p(true YES)=[p(YES) – 50%] × 2.
[2]
(c)
There is a major unspoken assumption embedded in the use of randomized response.
What is it?
The major unspoken assumption is that the respondents will understand clearly how their
privacy is assured and will actually answer honestly whenever “tails” comes up. If they don’t,
the whole process will not yield useful results.