* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Solution Set 4
Survey
Document related concepts
Transcript
Department of Urban Studies and Planning Massachusetts Institute of Technology 11.220 Quantitative Reasoning and Statistical Methods for Planning I Spring 1998 Homework Set #4 Solutions [Total = 88 points] Probability, Probability Distributions, and Statistical Estimation Question 1 This question is in honor of my son the golfer (who has not yet had a hole-in-one). [4] Do parts (a) and (b) of Case Study 5, page 333 of Weiss. The easiest way to do the calculations is probably to go into Excel and to use the appropriate formula there. Here we’re dealing with a binomial problem, with 1 p(success) 3709 n 155 The probability we’re trying to find is P(x>=4). which equals 1-P(x<4) = 1 - P(0) - P(1) - P(2) - P (3), or 3 1 P(x) x 0 Using the binomial formula and our values for p(success) and n, this becomes 3 155 1 x 3708(155 x ) 1 • • 3709 i 0 x 3709 To calculate this, we plug in the appropriate values of x and evaluate the resulting expressions. For example, for the first step (x=0) we get 0 (155 0) 155 155! • 1 • 3708 •1• 0.9591 0.9591 0 3709 3709 0!• (155 0)! The entire expression works out to be 1 - ( 0.9592 + 0.0400 + 0.0008 + 0.0000 ) = 0.0000 (to four decimal places) At this level of precision, the probability is zero. In Excel, we can calculate the answer with the BINOMDIST function. Its format is =BINOMDIST(x,n,p,0) 11.220 Quantitative Reasoning and Statistical Methods for Planning I Homework Set #4 Page 2 Where n,x, and p are the parameters we’re familiar with.1 So the monster equation above becomes: =1-(BINOMDIST(0,155,1/3709,0)+ BINOMDIST(1,155,1/3709,0)+ BINOMDIST(2,155,1/3709,0)+ BINOMDIST(3,155,1/3709,0)) for which Excel returns 1.1831E-07, which is its way of saying 1.1831•10-7, or 0.0000001183. b) The assumptions we made were There are two possibile outcomes for each trial: A golfer either makes a hole-in-one or she doesn’t. This is a very reasonable assumption. The trials are independent: One golfer’s performance does not affect another golfer’s performance. You could argue this one, but it seems reasonable. The probability of a hole-in-one remains 1/3709 from trial to trial: Each golfer has the same chance of making a hole-in-one. This is less reasonable (some golfers are better than others), but it might be approximately true, at least on average. The 0 just before the right parentheses tells Excel that we want to know the probability that we get exactly X successes. If you replace the 0 with a 1, Excel will tell you the probability that we get X or fewer successes. So there’s a shorter way to solve the problem than the way I do it above: You could simply type: =1-BINOMDIST(3,155,1/3709,1) and Excel would return 1.1831E-07. The only reason I did it the long way above was to be consistent with the formula as you learned it in Weiss. 11 11.220 Quantitative Reasoning and Statistical Methods for Planning I Homework Set #4 Page 3 Question 2 This question has three parts. [6] (a) Do Exercise 5.50 on page 310 of Weiss. Here p = 0.25 n = 10 a) Looking at the Table 1 for Binomial probabilities for x = 2 we have a probability of 0.282 that exactly 2 children are not living with their parents. b) Looking at the Table 1 for Binomial probabilities for x = 2, x= 1 and x = 0 we have a total probability of 0.282+0.188+0.056 = 0.526 that at most 2 children are not living with their parents c) P(between 3 and 6, inclusive are living with their parents) = 1 - P(at most 2 children are not living with their parents) - P(7 or more children are not living with their parents) = 1 - 0.526 - 0.003 = 0.471 d) P(either less than 3 or more than 7) = P(at most 2 children are not living with their parents) + P(more than 7 are not living with their parents) = 0.526 + 0.000 = 0.526 [2] (b) Do Exercise 5.74 on page 317 of Weiss. Mean = np = 10 * 0.25 = 2.5 Standard Deviation = [2] (c) np(1 p) = 10 0.25 0.75 = 1.37 As an extension of Exercise 5.74 determine the mean and standard deviation of the percentage of children in a sample of 10 that are not living with both parents. Mean = p = 0.25 = 0.25 Standard Deviation = p(1 p) n = 0.25 0.75 10 = 0.137 11.220 Quantitative Reasoning and Statistical Methods for Planning I Homework Set #4 Page 4 Question 3 The Mayor’s Planning Office is interested in getting an estimate of the mean annual income of employed persons residing in the city. To estimate this number the MPO has taken two samples, the first was a simple random sample of 400 local households and the second was a simple random sample of 100 local employers. Using data as reported by household members, the household survey yielded a mean sample income of $15,000 per household with a standard deviation of $2,000 per household. [6] (a) Using your knowledge of how to quantify chance error, estimate the mean household income for all households in the city. x = $15,000 per hh s = $2000 per hh 90% Confidence Interval 95% Confidence Interval 99% Confidence Interval 15000 ± 1.645 * 2000 / 400 = [14835.5, 15164.5] 15000 ± 1.96 * 2000 / 400 = [14804, 15196] 15000 ± 2.575 * 2000 / 400 = [14742.5, 15257.5] [2] (b) n = 400 The Mayor, in a hurry to present some statistics in a speech she is scheduled to give, uses the mean of $15,000 as her estimate of the mean annual income of employed persons residing in the city. What is the most important reason why these household data (even if the sampling was conducted properly and the data were accurately reported and properly collected) might yield a biased estimate of the mean annual income of employed persons residing in the city? Explain briefly, indicating the direction (high or low) of the bias. Households might include more than one worker. Thus, mean household income would likely be greater than the mean income of employed persons. (On the other hand households would also include unemployed persons so this would offset over-estimation due to numbers.) In the employer survey each employer was asked to calculate a mean income for his or her employees. Using these figures, the analysts then calculated the mean of these 100 employer-provided numbers. This resulted in an estimated mean income of $12,500 per employee. [6] (c) Trying to correct the incorrect impressions left by the Mayor’s speech (mentioned in part (b) above), the staff of the Mayor’s Planning Office decides to use the mean that was calculated from the survey of employers as its estimate of the mean annual income of employed persons residing in the city. List the three most important reasons why this procedure might yield a biased estimate of the mean annual income of employed persons residing in the city (even if the sampling was conducted properly, the data were accurately reported by the employers, and the 11.220 Quantitative Reasoning and Statistical Methods for Planning I Homework Set #4 Page 5 information was properly recorded). What is the direction of the likely bias for each of these reasons? 1. Taking the mean of each firm’s mean salary weights all firms equally, which is wrong if we want to find the mean income of persons: Not all firms have the same number of employees. The direction of bias this problem will cause depends on what you believe about the difference between large and small firms. For example, if you think that larger firms tend to have a lot of low-income employees, then the estimated mean will be too high. 2. If an employee has two jobs, she will be counted twice (in other words, her income will be split in two). These cases will introduce a downward bias: The mean will be too low. 3. Employee data will include some people who live outside the city (while we’re trying to find the mean income for people who reside in the city). Again, the direction of bias depends on our judgement about the relative incomes of residents vs. non-residents. If we think nonresidents tend to have higher salaries than residents, then the employer survey will give an estimate that is too high. 4. An employer survey will probably miss people who are self-employed. If self-employed people tend to have higher incomes, then our estimate will be too low. 11.220 Quantitative Reasoning and Statistical Methods for Planning I Homework Set #4 Page 6 Question 4 The article below appeared in the Boston Globe just after the 1983 mayoral election in Boston in which Ray Flynn defeated Mel King (former MIT faculty member and director of the Community Fellows Program). It discusses the discrepancies between the polls that were conducted by each of the three major television stations the day of the election. [2] (a) In the third paragraph, Stan Hopkins of WBZ says there were indications that voters lied in the exit polls. Why would they lie? Voters could have lied because they didn’t want to be identified in public as having voted for Flynn. Some say that race was a big issue in that campaign, and some voters might have thought that people would think they were racist if they admitted that they voted for Flynn. (Mel King was African American and Ray Flynn was Irish Catholic from South Boston) [4] (b) Decision Research conducted 3600 interviews for Channel 4. Assume that these results were were calculated from a simple random sample of people who had voted. 56% said they had voted for Flynn. Calculate a 90% confidence interval for the proportion of voters in the population who would have said they had voted for Flynn. For a 90% confidence interval, the proportion of voters in the population who would have said they voted for Flynn was: p̂ (1 p̂) 0.56 0.44 p̂ z p̂ (1.64) p̂ (1.64) n 3600 0.56 (1.64) (0.008) 0.56 0.014 [54.6%,57.4%] [2] (c) What is the importance of the phrase in italics in part (d) above? (HINT: Why doesn't it simply say who voted for Flynn?) The polls report how people said they voted, not how they actually voted. As in all opinion polling, there can be an important difference between someone saying they did (or would do) something and actually doing it. [4] (d) The first paragraph makes the point that Channel 7’s poll was more accurate because it used larger samples than the other Channels’ polls. Is this explanation a sufficient explanation for the discrepancies among the three polls? If so, explain why. If not, explain why not. One way to find out the extent to which sample size determined the accuracy of the estimation is to calculate the size of the standard error in each poll. 11.220 Quantitative Reasoning and Statistical Methods for Planning I Homework Set #4 Channel 7 5 4 Sample Size 8,450 2,000 3,600 Page 7 Estimate 65% - 35% 56% - 44% (at least) 56% - 44% So the standard errors are: Channel 7 : p̂ 0.65 0.35 0.005 8450 Channel 5 : pˆ 0.56 0.44 0.011 2000 Channel 4 : pˆ 0.56 0.44 0.008 3600 None is large enough to explain the nine percentage point difference between channel 7’s and the other two stations’ estimates. Even for a 99% confidence interval, z=±2.58, which would give (for example) less than a ±3% interval around Channel 5’s estimate—hardly enough to explain the 9 percentage points difference between channel 7 and channel 5’s estimates. This suggests that the more important differences may lie in the way the samples were taken. 11.220 Quantitative Reasoning and Statistical Methods for Planning I Homework Set #4 Page 8 Question 5 MIT is in the process of revising its parking policies for faculty and staff. As input to that process, the MIT Planning Office sent a survey to all faculty and staff asking them a variety of questions concerning their commuting patterns, distances, and costs. One question asked respondents to estimate their monthly commuting costs. For those who commuted by subway or bus, the survey asked each respondent to calculate his or her typical cost per month. For those who commuted by car, the survey asked for the number of miles they commuted per month, multiplied that number of miles by an assumed cost of 28¢ per mile (a figure meant to include the cost of depreciation, insurance, gas, and oil) and added any expenditures for tolls and parking. The mean monthly commuting cost, calculated from the respondents’ answers to this question, was $25 with a standard deviation of $6. Assume, for the moment, that all members of the faculty and staff responded to the survey and to this question. Assume also that monthly commuting costs are distributed normally. [3] (a) What percentage of the MIT faculty and staff spends more than $35 per month in commuting costs? We know that µ=$25 and =$6. We want to find P(x>$35). The z-score for x=$35 is ($35-$25)/$6 = 1.67 so P(x>$35) = 1 - P(x<$35) = 1 - P(z<1.67) = 1 - 0.9525 = 0.0475, or 4.75% [3] (b) What is the 75th percentile of commuting costs for MIT faculty and staff? We want to find a such that P(z<a)=0.75. From table II, we find that P(z<0.67) = 0.75 So we need to find the x corresponding to z=0.67 (in other words, we need to de-standardize 0.67). x = z • + µ = 0.67 • $6 + $25 = $29 75% of MIT faculty and staff spend less than $29 on commuting [3] (c) In parts (a) and (b) I have asked you to assume that monthly commuting costs are distributed normally. Is that a reasonable assumption? Please give explicit reasons as to why it is or is not a reasonable assumption in this case. Probably not very reasonable. People’s commuting costs depend strongly on the mode of transportation they use, so costs for all faculty and staff probably cluster around 4 or 5 values corresponding to the different modes people use: People who live very close and walk. Their cost is zero—there is no variation among individuals in this group. 11.220 Quantitative Reasoning and Statistical Methods for Planning I Homework Set #4 Page 9 People who bicycle. Their only costs are monthly depreciation on the bicycle and maintenance (one could also add hospital bills for Boston bikers...). There is a small amount of variation in this group. People who take a bus or the subway. Their daily cost is fixed, so there wouldn’t be much variation among individuals. Drivers. Their costs are likely to be more spread out and probably skewed to the right as well (a few people commute from very far away) The distribution of commuting costs for all faculty and staff would be a combination of each of these distributions. It might look something like this: In any case, it wouldn’t be normal (Gaussian). Actually, the mean and standard deviation reported above are sample statistics calculated not from a survey to which all faculty and staff responded but from a simple random sample of 160 members of the staff and faculty. All 160 responded to the survey, but the question about monthly commuting cost was answered by only 144 members of the staff and faculty, i.e. sixteen respondents left it blank. [2] (d) How would you handle the sixteen non-responses in making an estimate of the mean monthly commuting costs for all members of the staff and faculty of MIT? Be explicit as to what you would do and why. You might want to follow up on those 16 non-respondents, but it seems likely that that won’t get you very far (the 16 non-respondents may simply not want to divulge their commuting costs; they did fill out the rest of the survey). So you want to see if you are justified basing your results on a sample of n=144 rather than n=160. The loss of 16 data points isn’t a big problem in itself. The problem is that those 16 might have some systematic bias—for example, if they were all people who don’t spend any money on commuting (perhaps they didn’t know you were supposed to write “$0”), the estimate of commuting costs based on the other 144 surveys will be inflated. You might check answers to the other questions in the survey (especially place of residence, since that is an important predictor of commuting costs) to see if you could find any pattern in the 16 non-respondents (e.g., they all live in Cambridge). If you don’t find any pattern, you might just use the sample of size 144. If you aren’t comfortable with that, you might use some other variable(s) from the survey to predict the commuting costs for the 16 non-respondents. For example, you might use 11.220 Quantitative Reasoning and Statistical Methods for Planning I Homework Set #4 Page 10 the 144 “good surveys” to regress commuting costs on distance from MIT, which would give you an equation of the form COST= b0 + b1 • MILES. You could then estimate the missing values. [4] (e) Using these results and your knowledge of how to quantify the chance error that comes with sampling, estimate the mean monthly commuting costs for all members of the staff and faculty of MIT. Assume you can use the 144 surveys as-is (i.e., the non-responses seem random). We want to generate a confidence interval around our point estimate for the mean commuting cost (which is $25). Since n>30, we can use the z-table instead of the t-table, even though we only have an estimate for (s=$6). Let’s use a 95% confidence level (z=1.96). The confidence interval is: x $6 x z / 2 $25 1.96 $25 $0.98 [$24.02,$25.98] n 144 We estimate that the mean monthly commuting cost for MIT faculty and staff is between $24 and $26 per month. [3] (f) Write a clear sentence describing your result from part (d) above that can be included in the final report of the Planning Office. Make sure you write this sentence so that it is accurate but can also be understood by a layperson (i.e., a non-statistician). The “loose” way to say it is: “We’re 95% certain that the mean commuting cost for MIT faculty and staff is between $24 and $26.” More rigorously, “We estimate that the mean commuting cost for MIT faculty and staff is between $24 and $26; and while the process we used to generate this estimate isn’t perfect, it will give an interval that includes the actual mean 95% of the time.” Or you might say, "We are 95% certain that the sample mean is within ±$0.98 of the true population mean.” 11.220 Quantitative Reasoning and Statistical Methods for Planning I Homework Set #4 Page 11 Question 6 The October 8, 1995 Washington Post included an article by Malcolm Gladwell, “Personal Experience, The Primary Gauge,” in which he discussed misperceptions among various groups in the American population concerning the relative size of other groups, particularly racial groups, in the population. The article included the following paragraph: Consider white flight from America’s cities. In their 1993 book, American Apartheid, sociologists Nancy Denton and Douglas Massey argue that white flight is the result of an extraordinary sensitivity on the part of whites to the proximity, or rather the potential proximity, of blacks. According to their analysis, a neighborhood that was 95 percent or more white in 1970 and situated within 10 to 25 miles of a predominantly black neighborhood had a 36 percent chance of losing white population over the following decade. If the same neighborhood lay within 5 to 10 miles of a black area, the probability rose to 61 percent. And once a black area came within 5 miles of a white neighborhood, the chances that whites would start to flee rose to 85 percent. Thus, considering only neighborhoods that were (i) 95 percent or more white in 1970 and (ii) within 25 miles of a predominantly black neighborhood, Denton and Massey looked at whether or not the white population of these neighborhoods had declined after 1970 as a function of their distance from neighborhoods that were predominantly black. In answering the following questions, assume that 40 percent of the neighborhoods that were 95 percent or more white in 1970 were “far” (10-25 miles) from neighborhoods that were predominantly black; that another 40 percent of these neighborhoods were a “medium” distance (5-10 miles) from a predominantly black neighborhood; and that 20 percent of these neighborhoods were “near” (less than 5 miles away from) a predominantly black neighborhood. Also, assume that there were no such neighborhoods more than 25 miles away from a predominantly black neighborhood. (This last assumption is obviously not the case.) [6] (a) Draw a probability tree for this problem. Clearly label all the nodes, branches, and outcomes and indicate which probabilities belong at which locations on the tree. 11.220 Quantitative Reasoning and Statistical Methods for Planning I Homework Set #4 Far (f) Page 12 P(d|f) = 0.36 Decline (d) P(d&f) = 0.144 P(nd|f) = 0.64 Not Decline P(nd&f) = 0.256 P(f) = 0.4 P(d|m) = 0.61 P(m) = 0.4 Decline (d) P( d&m) = 0.244 Medium (m) P(nd|m) = 0.39 Not Decline ( nd) P(nd&m) = 0.156 P(d|n) = 0.85 Decline (d) P( d&n) = 0.170 P(n) = 0.2 Near (n) P(nd|n) = 0.15 Not Decline ( nd) P( nd&n) = 0.030 [4] (b) Calculate the probability that a neighborhood in which the white population did not decline was actually near a predominantly black neighborhood. P("near" white population did not decline ) P("near"and white population did not decline ) [P(" far "and white population did not decline ) P("medium"and white population did not decline ) P("near"and white population did not decline)] .030 .068 6.8% .256.156.030 [6] (c) If whether or not the white population of a neighborhood declined were independent of that neighborhood’s distance from a predominantly black neighborhood, what would the probability of decline be for each type of white neighborhood (e.g. for white neighborhoods that were far from, a medium distance from, or near predominantly black neighborhoods)? If they are independent: 11.220 Quantitative Reasoning and Statistical Methods for Planning I Homework Set #4 Page 13 P(decline " far") P(decline "medium") P(decline "near") P(decline ) .144.244.170 .558 11.220 Quantitative Reasoning and Statistical Methods for Planning I Homework Set #4 Page 14 Question 7 In 1990 the City of Cambridge commissioned a study on the level of rents paid by households living and renting housing in Cambridge. The study was based on a census of all Cambridge households that were renting their housing. The mean rent for households living and renting in Cambridge was calculated as $542. [4] (a) As part of the same study it was reported that 7.3% of households living and renting in Cambridge were paying rents less than $300 per month. Assuming that rent was normally distributed in Cambridge in 1990, calculate the standard deviation of rent paid by Cambridge households who were renting their housing. Under the assumption that this distribution is normal, we can use the table of normal probabilities to calculate the value of z that corresponds to $300 per month. Then, using that value of z plus the population mean of $542, we can calculate the standard deviation of household rents implied by these data. The value of z that leaves 7.3% of the total probability in the left hand tail of the normal distribution is, roughly, z = -1.45. Therefore, z x 1.45 $300 $542 1.45 $242 $169 [4] (b) It was also reported that 12.3% of households living and renting in Cambridge paid more than $900 per month in rent. Again, assume that rent was normally distributed in Cambridge in 1990 and calculate the standard deviation of rent paid by Cambridge households who were renting their housing. Once again, under the assumption that this distribution is normal, we can use the table of normal probabilities to calculate the value of z that corresponds to $900 per month. Then, using that value of z plus the population mean of $542, we can calculate the standard deviation of household rents implied by these data. The value of z that leaves 12.3% of the total probability in the right hand tail of the normal distribution is z = +1.16. Therefore, 11.220 Quantitative Reasoning and Statistical Methods for Planning I Homework Set #4 z Page 15 x 1.16 $900 $542 1.16 $358 $309 [6] (c) Compare your answers to parts (a) and (b) and draw whatever conclusions are justified from the results of your calculations and this comparison. Be as explicit as possible. If both of these probabilities were derived from the same normal distribution, then the standard deviations implied by both of these probabilities would have been roughly equal. The second is much larger than the first. Because the standard deviation is one of the parameters that defines a particular normal distribution, this variable cannot be distributed normally. (It is, in fact, positively skewed because there is more probability in the right hand tail.) 12.3% 7.3% -1.45 (300) 1.16 (542) (900)