Download Solution Set 4

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inductive probability wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Confidence interval wikipedia , lookup

German tank problem wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Department of Urban Studies and Planning
Massachusetts Institute of Technology
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Spring 1998
Homework Set #4 Solutions
[Total = 88 points]
Probability, Probability Distributions, and Statistical Estimation
Question 1
This question is in honor of my son the golfer (who has not yet had a hole-in-one).
[4]
Do parts (a) and (b) of Case Study 5, page 333 of Weiss. The easiest way to do the
calculations is probably to go into Excel and to use the appropriate formula there.
Here we’re dealing with a binomial problem, with
1
p(success) 
3709
n  155
The probability we’re trying to find is P(x>=4).
which equals 1-P(x<4) = 1 - P(0) - P(1) - P(2) - P (3), or
3
1   P(x)
x 0
Using the binomial formula and our values for p(success) and n, this becomes
 3 155  1 x 3708(155  x ) 
1   
•
•


 3709 
i 0  x  3709

To calculate this, we plug in the appropriate values of x and evaluate the resulting expressions.
For example, for the first step (x=0) we get
0
(155  0)
155

155!
 •  1  • 3708 
•1• 0.9591  0.9591





 0  3709
3709
0!• (155  0)!
The entire expression works out to be
1 - ( 0.9592 + 0.0400 + 0.0008 + 0.0000 ) = 0.0000 (to four decimal places)
At this level of precision, the probability is zero.
In Excel, we can calculate the answer with the BINOMDIST function. Its format is
=BINOMDIST(x,n,p,0)
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4
Page 2
Where n,x, and p are the parameters we’re familiar with.1 So the monster equation above becomes:
=1-(BINOMDIST(0,155,1/3709,0)+ BINOMDIST(1,155,1/3709,0)+
BINOMDIST(2,155,1/3709,0)+ BINOMDIST(3,155,1/3709,0))
for which Excel returns 1.1831E-07, which is its way of saying 1.1831•10-7, or 0.0000001183.
b) The assumptions we made were
 There are two possibile outcomes for each trial: A golfer either makes a hole-in-one or she
doesn’t. This is a very reasonable assumption.
 The trials are independent: One golfer’s performance does not affect another golfer’s
performance. You could argue this one, but it seems reasonable.
 The probability of a hole-in-one remains 1/3709 from trial to trial: Each golfer has the same
chance of making a hole-in-one. This is less reasonable (some golfers are better than
others), but it might be approximately true, at least on average.
The 0 just before the right parentheses tells Excel that we want to know the probability that we get exactly X
successes. If you replace the 0 with a 1, Excel will tell you the probability that we get X or fewer successes. So
there’s a shorter way to solve the problem than the way I do it above: You could simply type:
=1-BINOMDIST(3,155,1/3709,1)
and Excel would return 1.1831E-07. The only reason I did it the long way above was to be consistent with the
formula as you learned it in Weiss.
11
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4
Page 3
Question 2
This question has three parts.
[6]
(a)
Do Exercise 5.50 on page 310 of Weiss.
Here p = 0.25 n = 10
a) Looking at the Table 1 for Binomial probabilities for x = 2 we have a probability of 0.282
that exactly 2 children are not living with their parents.
b) Looking at the Table 1 for Binomial probabilities for x = 2, x= 1 and x = 0 we have a total
probability of 0.282+0.188+0.056 = 0.526 that at most 2 children are not living with their
parents
c) P(between 3 and 6, inclusive are living with their parents) = 1 - P(at most 2 children are not
living with their parents) - P(7 or more children are not living with their parents) = 1 - 0.526
- 0.003 = 0.471
d) P(either less than 3 or more than 7) = P(at most 2 children are not living with their parents)
+ P(more than 7 are not living with their parents) = 0.526 + 0.000 = 0.526
[2]
(b)
Do Exercise 5.74 on page 317 of Weiss.
Mean = np = 10 * 0.25 = 2.5
Standard Deviation  =
[2]
(c)
np(1  p) =
10  0.25  0.75 = 1.37
As an extension of Exercise 5.74 determine the mean and standard deviation of
the percentage of children in a sample of 10 that are not living with both parents.
Mean = p = 0.25 = 0.25
Standard Deviation  =
p(1  p)
n
=
0.25  0.75
10
= 0.137
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4
Page 4
Question 3
The Mayor’s Planning Office is interested in getting an estimate of the mean annual
income of employed persons residing in the city. To estimate this number the MPO has
taken two samples, the first was a simple random sample of 400 local households and the
second was a simple random sample of 100 local employers.
Using data as reported by household members, the household survey yielded a mean
sample income of $15,000 per household with a standard deviation of $2,000 per
household.
[6]
(a)
Using your knowledge of how to quantify chance error, estimate the mean
household income for all households in the city.
x = $15,000 per hh
s = $2000 per hh
90% Confidence Interval
95% Confidence Interval
99% Confidence Interval
15000 ± 1.645 * 2000 / 400 = [14835.5, 15164.5]
15000 ± 1.96 * 2000 / 400 = [14804, 15196]
15000 ± 2.575 * 2000 / 400 = [14742.5, 15257.5]
[2]
(b)
n = 400
The Mayor, in a hurry to present some statistics in a speech she is scheduled to
give, uses the mean of $15,000 as her estimate of the mean annual income of
employed persons residing in the city. What is the most important reason why
these household data (even if the sampling was conducted properly and the data
were accurately reported and properly collected) might yield a biased estimate of
the mean annual income of employed persons residing in the city? Explain
briefly, indicating the direction (high or low) of the bias.
Households might include more than one worker. Thus, mean household income would likely be
greater than the mean income of employed persons. (On the other hand households would also
include unemployed persons so this would offset over-estimation due to numbers.)
In the employer survey each employer was asked to calculate a mean income for his or
her employees. Using these figures, the analysts then calculated the mean of these 100
employer-provided numbers. This resulted in an estimated mean income of $12,500 per
employee.
[6]
(c)
Trying to correct the incorrect impressions left by the Mayor’s speech (mentioned
in part (b) above), the staff of the Mayor’s Planning Office decides to use the
mean that was calculated from the survey of employers as its estimate of the mean
annual income of employed persons residing in the city. List the three most
important reasons why this procedure might yield a biased estimate of the mean
annual income of employed persons residing in the city (even if the sampling was
conducted properly, the data were accurately reported by the employers, and the
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4
Page 5
information was properly recorded). What is the direction of the likely bias for
each of these reasons?
1. Taking the mean of each firm’s mean salary weights all firms equally, which is wrong if we
want to find the mean income of persons: Not all firms have the same number of employees.
The direction of bias this problem will cause depends on what you believe about the
difference between large and small firms. For example, if you think that larger firms tend to
have a lot of low-income employees, then the estimated mean will be too high.
2. If an employee has two jobs, she will be counted twice (in other words, her income will be
split in two). These cases will introduce a downward bias: The mean will be too low.
3. Employee data will include some people who live outside the city (while we’re trying to find
the mean income for people who reside in the city). Again, the direction of bias depends on
our judgement about the relative incomes of residents vs. non-residents. If we think nonresidents tend to have higher salaries than residents, then the employer survey will give an
estimate that is too high.
4. An employer survey will probably miss people who are self-employed. If self-employed
people tend to have higher incomes, then our estimate will be too low.
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4
Page 6
Question 4
The article below appeared in the Boston Globe just after the 1983 mayoral election in Boston in
which Ray Flynn defeated Mel King (former MIT faculty member and director of the Community
Fellows Program). It discusses the discrepancies between the polls that were conducted by each
of the three major television stations the day of the election.
[2]
(a)
In the third paragraph, Stan Hopkins of WBZ says there were indications that
voters lied in the exit polls. Why would they lie?
Voters could have lied because they didn’t want to be identified in public as having voted for
Flynn. Some say that race was a big issue in that campaign, and some voters might have thought
that people would think they were racist if they admitted that they voted for Flynn. (Mel King
was African American and Ray Flynn was Irish Catholic from South Boston)
[4]
(b)
Decision Research conducted 3600 interviews for Channel 4. Assume that these
results were were calculated from a simple random sample of people who had
voted. 56% said they had voted for Flynn. Calculate a 90% confidence interval
for the proportion of voters in the population who would have said they had voted
for Flynn.
For a 90% confidence interval, the proportion of voters in the population who would have said
they voted for Flynn was:
p̂  (1  p̂)
0.56  0.44
p̂  z  p̂  (1.64) 
 p̂  (1.64) 
n
3600
 0.56  (1.64)  (0.008)  0.56  0.014  [54.6%,57.4%]
[2]
(c)
What is the importance of the phrase in italics in part (d) above?
(HINT: Why doesn't it simply say who voted for Flynn?)
The polls report how people said they voted, not how they actually voted. As in all opinion
polling, there can be an important difference between someone saying they did (or would do)
something and actually doing it.
[4]
(d)
The first paragraph makes the point that Channel 7’s poll was more accurate
because it used larger samples than the other Channels’ polls. Is this explanation
a sufficient explanation for the discrepancies among the three polls? If so, explain
why. If not, explain why not.
One way to find out the extent to which sample size determined the accuracy of the estimation is
to calculate the size of the standard error in each poll.
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4
Channel
7
5
4
Sample Size
8,450
2,000
3,600
Page 7
Estimate
65% - 35%
56% - 44% (at least)
56% - 44%
So the standard errors are:
Channel 7 : p̂ 
0.65  0.35
 0.005
8450
Channel 5 : pˆ 
0.56  0.44
 0.011
2000
Channel 4 : pˆ 
0.56  0.44
 0.008
3600
None is large enough to explain the nine percentage point difference between channel 7’s and
the other two stations’ estimates. Even for a 99% confidence interval, z=±2.58, which would
give (for example) less than a ±3% interval around Channel 5’s estimate—hardly enough to
explain the 9 percentage points difference between channel 7 and channel 5’s estimates. This
suggests that the more important differences may lie in the way the samples were taken.
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4
Page 8
Question 5
MIT is in the process of revising its parking policies for faculty and staff. As input to that
process, the MIT Planning Office sent a survey to all faculty and staff asking them a
variety of questions concerning their commuting patterns, distances, and costs.
One question asked respondents to estimate their monthly commuting costs. For those
who commuted by subway or bus, the survey asked each respondent to calculate his or
her typical cost per month. For those who commuted by car, the survey asked for the
number of miles they commuted per month, multiplied that number of miles by an
assumed cost of 28¢ per mile (a figure meant to include the cost of depreciation,
insurance, gas, and oil) and added any expenditures for tolls and parking. The mean
monthly commuting cost, calculated from the respondents’ answers to this question, was
$25 with a standard deviation of $6.
Assume, for the moment, that all members of the faculty and staff responded to the
survey and to this question. Assume also that monthly commuting costs are distributed
normally.
[3]
(a)
What percentage of the MIT faculty and staff spends more than $35 per month in
commuting costs?
We know that µ=$25 and  =$6. We want to find P(x>$35).
The z-score for x=$35 is ($35-$25)/$6 = 1.67
so P(x>$35) = 1 - P(x<$35) = 1 - P(z<1.67) = 1 - 0.9525 = 0.0475, or 4.75%
[3]
(b)
What is the 75th percentile of commuting costs for MIT faculty and staff?
We want to find a such that P(z<a)=0.75. From table II, we find that P(z<0.67) = 0.75
So we need to find the x corresponding to z=0.67 (in other words, we need to de-standardize
0.67).
x = z •  + µ = 0.67 • $6 + $25 = $29
75% of MIT faculty and staff spend less than $29 on commuting
[3]
(c)
In parts (a) and (b) I have asked you to assume that monthly commuting costs are
distributed normally. Is that a reasonable assumption? Please give explicit
reasons as to why it is or is not a reasonable assumption in this case.
Probably not very reasonable. People’s commuting costs depend strongly on the mode of
transportation they use, so costs for all faculty and staff probably cluster around 4 or 5 values
corresponding to the different modes people use:
 People who live very close and walk. Their cost is zero—there is no variation among
individuals in this group.
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4
Page 9

People who bicycle. Their only costs are monthly depreciation on the bicycle and
maintenance (one could also add hospital bills for Boston bikers...). There is a small
amount of variation in this group.
 People who take a bus or the subway. Their daily cost is fixed, so there wouldn’t be
much variation among individuals.
 Drivers. Their costs are likely to be more spread out and probably skewed to the
right as well (a few people commute from very far away)
The distribution of commuting costs for all faculty and staff would be a combination of each of
these distributions. It might look something like this:
In any case, it wouldn’t be normal (Gaussian).
Actually, the mean and standard deviation reported above are sample statistics calculated
not from a survey to which all faculty and staff responded but from a simple random
sample of 160 members of the staff and faculty. All 160 responded to the survey, but the
question about monthly commuting cost was answered by only 144 members of the staff
and faculty, i.e. sixteen respondents left it blank.
[2]
(d)
How would you handle the sixteen non-responses in making an estimate of the
mean monthly commuting costs for all members of the staff and faculty of MIT?
Be explicit as to what you would do and why.
You might want to follow up on those 16 non-respondents, but it seems likely that that won’t get
you very far (the 16 non-respondents may simply not want to divulge their commuting costs; they
did fill out the rest of the survey).
So you want to see if you are justified basing your results on a sample of n=144 rather than
n=160. The loss of 16 data points isn’t a big problem in itself. The problem is that those 16
might have some systematic bias—for example, if they were all people who don’t spend any
money on commuting (perhaps they didn’t know you were supposed to write “$0”), the estimate
of commuting costs based on the other 144 surveys will be inflated. You might check answers to
the other questions in the survey (especially place of residence, since that is an important
predictor of commuting costs) to see if you could find any pattern in the 16 non-respondents
(e.g., they all live in Cambridge). If you don’t find any pattern, you might just use the sample of
size 144. If you aren’t comfortable with that, you might use some other variable(s) from the
survey to predict the commuting costs for the 16 non-respondents. For example, you might use
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4
Page 10
the 144 “good surveys” to regress commuting costs on distance from MIT, which would give you
an equation of the form
COST= b0 + b1 • MILES.
You could then estimate the missing values.
[4]
(e)
Using these results and your knowledge of how to quantify the chance error that
comes with sampling, estimate the mean monthly commuting costs for all
members of the staff and faculty of MIT.
Assume you can use the 144 surveys as-is (i.e., the non-responses seem random). We want to
generate a confidence interval around our point estimate for the mean commuting cost (which is
$25). Since n>30, we can use the z-table instead of the t-table, even though we only have an
estimate for  (s=$6). Let’s use a 95% confidence level (z=1.96).
The confidence interval is:
x
$6
x  z / 2 
 $25  1.96 
 $25  $0.98  [$24.02,$25.98]
n
144
We estimate that the mean monthly commuting cost for MIT faculty and staff is between $24 and
$26 per month.
[3]
(f)
Write a clear sentence describing your result from part (d) above that can be
included in the final report of the Planning Office. Make sure you write this
sentence so that it is accurate but can also be understood by a layperson (i.e., a
non-statistician).
The “loose” way to say it is: “We’re 95% certain that the mean commuting cost for MIT faculty
and staff is between $24 and $26.”
More rigorously, “We estimate that the mean commuting cost for MIT faculty and staff is
between $24 and $26; and while the process we used to generate this estimate isn’t perfect, it
will give an interval that includes the actual mean 95% of the time.” Or you might say, "We are
95% certain that the sample mean is within ±$0.98 of the true population mean.”
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4
Page 11
Question 6
The October 8, 1995 Washington Post included an article by Malcolm Gladwell,
“Personal Experience, The Primary Gauge,” in which he discussed misperceptions among
various groups in the American population concerning the relative size of other groups,
particularly racial groups, in the population. The article included the following
paragraph:
Consider white flight from America’s cities. In their 1993 book, American
Apartheid, sociologists Nancy Denton and Douglas Massey argue that
white flight is the result of an extraordinary sensitivity on the part of
whites to the proximity, or rather the potential proximity, of blacks.
According to their analysis, a neighborhood that was 95 percent or more
white in 1970 and situated within 10 to 25 miles of a predominantly black
neighborhood had a 36 percent chance of losing white population over the
following decade. If the same neighborhood lay within 5 to 10 miles of a
black area, the probability rose to 61 percent. And once a black area came
within 5 miles of a white neighborhood, the chances that whites would
start to flee rose to 85 percent.
Thus, considering only neighborhoods that were (i) 95 percent or more white in 1970 and
(ii) within 25 miles of a predominantly black neighborhood, Denton and Massey looked
at whether or not the white population of these neighborhoods had declined after 1970 as
a function of their distance from neighborhoods that were predominantly black.
In answering the following questions, assume that 40 percent of the neighborhoods that
were 95 percent or more white in 1970 were “far” (10-25 miles) from neighborhoods that
were predominantly black; that another 40 percent of these neighborhoods were a
“medium” distance (5-10 miles) from a predominantly black neighborhood; and that 20
percent of these neighborhoods were “near” (less than 5 miles away from) a
predominantly black neighborhood. Also, assume that there were no such neighborhoods
more than 25 miles away from a predominantly black neighborhood. (This last
assumption is obviously not the case.)
[6]
(a)
Draw a probability tree for this problem. Clearly label all the nodes, branches,
and outcomes and indicate which probabilities belong at which locations on the
tree.
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4
Far (f)
Page 12
P(d|f) = 0.36
Decline (d)
P(d&f) = 0.144
P(nd|f) = 0.64
Not Decline
P(nd&f) = 0.256
P(f) = 0.4
P(d|m) = 0.61
P(m) = 0.4
Decline (d)
P( d&m) = 0.244
Medium (m)
P(nd|m) = 0.39
Not Decline ( nd)
P(nd&m) = 0.156
P(d|n) = 0.85
Decline (d)
P( d&n) = 0.170
P(n) = 0.2
Near (n)
P(nd|n) = 0.15
Not Decline ( nd)
P( nd&n) = 0.030
[4]
(b)
Calculate the probability that a neighborhood in which the white
population did not decline was actually near a predominantly black neighborhood.
P("near" white population did not decline ) 
P("near"and white population did not decline )
[P(" far "and white population did not decline ) 
P("medium"and white population did not decline ) 
P("near"and white population did not decline)]

.030
.068  6.8%
.256.156.030
[6]
(c)
If whether or not the white population of a neighborhood declined were
independent of that neighborhood’s distance from a predominantly black
neighborhood, what would the probability of decline be for each type of white
neighborhood (e.g. for white neighborhoods that were far from, a medium
distance from, or near predominantly black neighborhoods)?
If they are independent:
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4
Page 13
P(decline " far")  P(decline "medium")  P(decline "near")  P(decline ) 
.144.244.170 .558
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4
Page 14
Question 7
In 1990 the City of Cambridge commissioned a study on the level of rents paid by
households living and renting housing in Cambridge. The study was based on a census of
all Cambridge households that were renting their housing. The mean rent for households
living and renting in Cambridge was calculated as $542.
[4]
(a)
As part of the same study it was reported that 7.3% of households living and
renting in Cambridge were paying rents less than $300 per month. Assuming that
rent was normally distributed in Cambridge in 1990, calculate the standard
deviation of rent paid by Cambridge households who were renting their housing.
Under the assumption that this distribution is normal, we can use the table of normal
probabilities to calculate the value of z that corresponds to $300 per month. Then, using that
value of z plus the population mean of $542, we can calculate the standard deviation of
household rents implied by these data.
The value of z that leaves 7.3% of the total probability in the left hand tail of the normal
distribution is, roughly, z = -1.45. Therefore,
z
x

1.45 
$300  $542

1.45    $242
  $169
[4]
(b)
It was also reported that 12.3% of households living and renting in Cambridge
paid more than $900 per month in rent. Again, assume that rent was normally
distributed in Cambridge in 1990 and calculate the standard deviation of rent paid
by Cambridge households who were renting their housing.
Once again, under the assumption that this distribution is normal, we can use the table of
normal probabilities to calculate the value of z that corresponds to $900 per month. Then, using
that value of z plus the population mean of $542, we can calculate the standard deviation of
household rents implied by these data.
The value of z that leaves 12.3% of the total probability in the right hand tail of the normal
distribution is z = +1.16. Therefore,
11.220 Quantitative Reasoning and Statistical Methods for Planning I
Homework Set #4
z
Page 15
x

1.16 
$900  $542

1.16    $358
  $309
[6]
(c)
Compare your answers to parts (a) and (b) and draw whatever conclusions are
justified from the results of your calculations and this comparison. Be as explicit
as possible.
If both of these probabilities were derived from the same normal distribution, then the standard
deviations implied by both of these probabilities would have been roughly equal. The second is
much larger than the first. Because the standard deviation is one of the parameters that defines
a particular normal distribution, this variable cannot be distributed normally. (It is, in fact,
positively skewed because there is more probability in the right hand tail.)
12.3%
7.3%
-1.45
(300)
1.16
(542)
(900)