Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mathematics 243 Problems 1. A model often used to predict the height in inchers of an adult son is \ = 1 father + 1 mother + 2.5 height 2 2 A different model suggests that the height of the father is more important in predicting the height of sons. That model is \ + 2 father + 1 mother + 2.5 height 3 3 Of course neither model is “right.” Suppose that you have a lot of data (like Galton). How would you decide which of these two models is “better”? Be as precise as you can. Solution. A start would be to ask which model makes predictions that are closer to the actual value. A better solution would be to describe some way to combine the prediction errors. (Of course later in the course we will compute the sums of squares of residuals. But they don’t know that yet.) 2. We might write Einstein’s law as Ê = mc2 but we use the equation as if it is an exact equality (unlike the models in the first problem). Explain why you would not expect exact equality in any real application of this law. Solution. We can’t measure any of the three quantities exactly. So measurement error is a source of variation that will prevent exact equality. (Some students might say that this rule itself is not exactly right. Maybe that’s true!) 3. Create a vector x in R containing the following numbers: 1, 4, 7, 9, 13, 19, 21, 25 For each of the following R commands, try to predict what the output will be and then write down (using R of course), what the output actually is. In each case where an explanation is asked for, say how R is computing the answer that you get. (a) x (b) x+1 (c) sum(x) (d) x>10 (e) x[x>10] (f) sum(x>10) Explain. Explain. (g) sum(x[x>10]) (h) x[-(1:3)] Explain. Explain. (i) x^2 Solution. > x = c(1, 4, 7, 9, 13, 19, 21, 25) > x [1] 1 4 7 9 13 19 21 25 5 8 10 14 20 22 26 > x + 1 [1] 2 > sum(x) 1 Mathematics 243 Problems [1] 99 > x > 10 [1] FALSE FALSE FALSE FALSE > x[x > 10] TRUE TRUE TRUE TRUE # the values of x that are greater than 10 [1] 13 19 21 25 > sum(x > 10) # the sum of trues and falses, evidently true=1 [1] 4 > sum(x[x > 10]) # the sum of the values of x that are greater than 10 [1] 78 > x[-(1:3)] [1] # all values of x except the first three 9 13 19 21 25 4. Data on all the counties in the United States are in the counties dataframe in the Stob package. Each row in the dataframe is a county. (a) How many counties are there? (b) How many variables are in the dataframe? (c) The variable Population has the population of each county as of the 2010 census. What was the total population of the United States in 2010? (d) Why is a histogram of the populations of all the counties not very informative? (e) Two of the variables are not really necessary – they can be computed from other variables in the dataframe. Name one of these and show how it can be computed easily from other variables. bf Solution. > dim(counties) [1] 3141 9 > sum(counties$Population) [1] 281421906 The problem with a histogram is that the data is too skewed (or spread over two many orders of magnitude). The density variables can be computed from the others. 5. The data frame Chile in the car package (be sure to load this package) has data on a survey of voters in Chile conducted in the spring of 1988 before the election that unseated Augusto Pinochet. Use what you know about data frames and the tally() and favstats() functions to answer the following questions. 2 Mathematics 243 Problems (a) What is a case and how many cases are there? ANS: an individual voter. (b) What are the variables and which are quantitative and categorical? ANS: region, sex, education and vote are categorical and population, age, income and status quo are quantitative (c) How many of each gender participated in the survey? > tally(~sex, data = Chile) F M 1379 1321 (d) What was the average income of those participating in the survey? What caveat would you want to mention when reporting this statistic? ANS: 33,876 but there are 98 missing data values. > favstats(~income, data = Chile) min Q1 median Q3 max mean sd n missing 2500 7500 15000 35000 200000 33876 39503 2602 98 (e) What percentage of the voters surveyed expected to vote against Pinochet? (Read the help document to determine which variable and value you should be looking at.) ANS: 889/2532. (35.11%) > tally(~vote, data = Chile) A N U Y 187 889 588 868 (f) Voters with incomes greater than 40,000 pesos could be considered rich. What percentage of the rich voters expected to vote against Pinochet? (Hint: first make a dataframe with only rich voters.) ANS: 165/410 (0.4%) > richguys = subset(Chile, income > 40000) > tally(~vote, data = richguys) A N 31 165 U Y 56 158 6. Suppose that we are studying baseball statistics. We might use two different units of analysis: the player or the team. (a) Suppose that we are studying baseball with the unit of analysis being the player. Give an example of an appropriate categorical variable for this situation and an appropriate quantiative variable. ANS: a lot of answers are possible - categorical: team, quantitative: number of homeruns. (b) Do the same in the case that we are studying baseball with the unit of analysis being the team. ANS: categorical: league, quantitative: number of wins. 7. In the parts below, we list some convenience samples of Calvin students. For each of these methods for sampling Calvin students, indicate one way in which the sample is likely not to be representative of the population of all Calvin students. (a) The students in Mathematics 243A. (b) The students in Nursing 329. (c) The first 30 students who walk into the FAC west door after 12:30 PM today. 3 Mathematics 243 Problems (d) The first 30 students you meet on the sidewalk outside Hiemenga after 12:30 PM today. (e) The first 30 students named in the Calvin directory. (f) The men’s basketball team. ANS: many answers are possible here and they should be quite obvious. 8. Suppose that we were attempting to estimate the average height of a Calvin student. For this purpose, which of the convenience samples in the previous problem would you suppose to be most representative of the Calvin population? Which would you suppose to be least representative? Why? ANS: Most representative would probably be c or d (maybe e?). Least representative would probably be (b) (nursing students are shorter since they are all female) or (f) basketball players of course are taller. 9. Consider the set of natural numbers P = {1, 2, . . . , 50} to be a population. (a) How many prime numbers are there in the population? ANS: 15 (b) If a sample of size 10 is representative of the population, how many prime numbers would we expect to be in the sample? How many even numbers would we expect to be in the sample? Would expect 15/50 of the numbers to be prime so 3. Would expect 5 to be even. (c) Using R choose 5 different random samples of size 10 from the population P . Record how many prime numbers and how many even numbers are in each sample. Make any comments about the results that strike you as relevant. ANS: Of course a wide range of answers is possible. They might comment on finding a sample that is quite unrepresentative. 10. Sometimes we can improve on simple random sampling by incorporating randomness but in a more complicated way. As of Day 10 in the fall, the breakdown of the Calvin student body by level was given by the following table Class Level First-year Sophomore Junior Senior Other Total Population 1,078 975 821 1,054 106 4,034 If we were to choose a simple random sample, we might get too many juniors or not enough first-year students. But since we know the true proportion of each class, we might separately sample from each class to get an overall sample that is representative of the whole. If a sample of size 200 is desired, explain how many of each class we should choose. (Of course withing each class, we should choose a simple random sample. This sampling design is called stratified random sampling). ANS: A computation is below. Of course they will have to round to get whole numbers of students. > > > > students = c(1078, 975, 821, 1054, 106) total = sum(students) desired = 200 * students/total desired [1] 53.446 48.339 40.704 52.256 5.255 11. The dataframe Allegan (in the Stob package) has data on 57 years of weather in Allegan, MI. Each case is a day. The variable TMAX is the maximum temperature recorded on the given day. 4 Mathematics 243 Problems (a) What is a 50% coverage interval for the maximum daily temperature? (b) What is a 95% coverage interval for the maximum daily temperature? (c) What is an interesting feature of the distribution of this variable? (Hint: draw a densityplot.) qdata(c(0.25, 0.75), ~TMAX, data = Allegan) 25% 75% # 50% coverage interval quantile p 39 0.25 77 0.75 qdata(c(0.025, 0.975), ~TMAX, data = Allegan) 2.5% 97.5% # 95% coverage interval quantile p 20 0.025 90 0.975 densityplot(~TMAX, data = Allegan) # the bimodal distribution of the data is interesting 0.020 Density 0.015 0.010 0.005 0.000 0 50 100 TMAX 12. The following histograms and boxplots are of five different datasets. Match the boxplot to the histogram of the same data. Justify your choices with a sentence or two. X Z U V E D C B A x x 5 W Mathematics 243 Problems ANS: X and Z are reasonably symmetric. Looks like X is A and Z is B because of the outlier in B. U and W anre skewed right like C and E. W is E and C is U. B and V are skewed left. 13. Sometimes it is useful to change the units of a variable. For instance, we might change from inches to feet or from degrees centigrade to fahrenheit. Obviously, such statistcs as the mean, median, variance and standard deviation will change if we do that. (a) A new variable Y is created from a variable X by multiplying each case of X by a constant c (i.e., Y = cX). How are the mean, median, variance and standard deviation of Y related to those of X? ANS: the mean and median are both changed by the same factor c. The variance increases by a factor of c2 and the standard deviation by c. (b) A new variable Y is created from a variable X by adding a constant d to each case of X. (i.e., Y = X + d). How are the mean, median, variance and standard deviation of Y related to those of X? ANS: The variance and standard deviation are unchanged (since they measure variation not location.) THe mean and median are increased by d. 14. Boxplots are fairly popular. But boxplots cannot show some of the more interesting, and often important, features of a variable. The data frame faithful has data on over 270 eruptions of the Old Faithful geyser in Yellowstone National Park. The variable eruptions gives the length in minutes of each of the eruptions. What interesting feature of the distribution of this variable is shown by a density plot but which cannot be seen in a box plot? ANS: The boxplot doesn’t tell you anything about modes. This distribution is definintely bimodal. > densityplot(~eruptions, data = faithful) > bwplot(~eruptions, data = faithful) 0.5 Density 0.4 0.3 0.2 0.1 0.0 1 2 3 4 5 6 2 eruptions 3 4 5 eruptions 15. The mean of a variable is one choice for a simple model for that variable. But it is not the only one. For example, we might choose the median instead. (a) Use the mean as a model for the GPA of Calvin seniors (using the data in the sr dataframe). Compute the sum of squares of residuals of this model. m <- mean(~GPA, data = sr) resids <- sr$GPA - m ssr <- sum(resids^2) ssr [1] 327 6 Mathematics 243 Problems (b) Use instead the median as a model for the GPA of Calvin seniors using the very same data. Compute the sums of squares of residuals for this model. me <- median(~GPA, data = sr) residsm <- sr$GPA - me ssrm <- sum(residsm^2) ssrm [1] 336.1 (c) Compare the sums of squares of residuals for these two models. Which is smaller? (It turns out the mean is the value that minimizes the sums of squares of residuals.) ANS: Obviously the sum of squares of residuals are smaller for the means model than the median model. 16. The Current Population Survey data (CPS85) has data on both the wage of the respondent (wage) and the employment sector in which the respondent worked (sector). SUppose that we try to explain the variation in wages by using sector as an explanatory variable. (a) What are the model values (i.e., ŷi ) for each of the eight sectors? (b) What is the variance of the wage variable? (c) What is the variance of the residuals of the model? (d) Which observation has the largest (in absolute value) residual? (e) Construct a side-by-side boxplot of the wages by sector. Notice that the wage distributions for any two sectors overlap by quite a bit. For which sectors is it true that at least 75% of the respondents in that sector made more money that half of those respondents in the manufacturing sector. M <- mm(wage ~ sector, data = CPS85) M # part a Groupwise Model Call: wage ~ sector Coefficients: clerical const 7.42 9.50 service 6.54 manag 12.70 var(~wage, data = CPS85) # manuf 8.04 other 8.50 prof 11.95 part b [1] 26.41 var(residuals(M)) # part c [1] 21.59 favstats(residuals(M)) # min Q1 median Q3 max mean sd n missing -11.7 -3.037 -1.037 2.279 31.8 2.46e-16 4.646 534 0 7 sales 7.59 Mathematics 243 Problems CPS85[residuals(M) > 31.7, ] # this observation wage educ race sex hispanic south married exper union age sector 249 44.5 14 W F NH NS Single 1 Not 21 manag bwplot(wage ~ sector, data = CPS85) # part e (management, construction, professional) 40 wage 30 20 10 0 clerical const managmanuf other prof sales service 17. This made-up data frame is so small that you can answer all the questions below without the use of a computer (or even a calculator). Do that. Sex F F F M M Job Rating 5 7 9 4 8 (a) If we use a means model to predict job rating from sex, what are the five values of ŷ for each of the five cases? (b) What are the five residual values for this model? (c) How much do we gain by using gender to predict job rating? In other words, quantify the difference between this model and a model that doesn’t use any explanatory variable at all. ANS: Sex F F F M M Job Rating 5 7 9 4 8 Model 7 7 7 6 6 Residual -2 0 2 -2 2 FOr part C, note that the model that uses no explanatory variable uses the mean 6.6 as the model value so has residuals: −1.6, .4, 2.4, −2.6, 1.4. The variance of the residuals of that model is (1.62 +.42 +2.42 +2.62 +1.42 )/4 = 4.3. The variance of the residuals of this model is 4. That’s a fairly modelst improvement. 18. The data frame deathpenalty in the Stob package has data on whether the death penalty was given in 326 homicide cases in Florida in 1976 and 1977. 8 Mathematics 243 Problems (a) Use an appropriate measure to “prove” that black defendants are less likely to get the death penalty than white defendants – perhaps a counterintuitive result. > tally(Penalty ~ Defendant, data = deathpenalty) Defendant Penalty Black White Death 0.1024 0.1187 Not 0.8976 0.8812 So whites get the death penalty 12% of the time while blacks get the death penalty only 10% of the time. (b) The R function cross makes a new variable from two categorical variables. Use the following syntax: tally(Penalty ~ cross(Defendant, Victim), deathpenalty) Explain how the result of this computation suggests that a more nuanced conclusion than that of part (a) might be warranted. tally(Penalty ~ cross(Defendant, Victim), deathpenalty) cross(Defendant, Victim) Penalty Black:Black Black:White White:Black White:White Death 0.05825 0.17460 0.00000 0.12583 Not 0.94175 0.82540 1.00000 0.87417 This shows that Black defendants are more likely to get the death penalty than white defendants if the victim is white but also are more likely to get the death penalty if the defendant is black. 19. For each of the following stories, identify the explanatory and response variables and indicate whether you think an observational study or experiment would be used. (Explain your choice in each case.) (a) Calvin wants to compare the GPAs of women living in Heyns to those living in Noordewier. (b) A doctor wants to compare the effectiveness of atenolol and quinapril for his patients in controlling high blood pressure. (c) A child psychologist wants to find out whether the amount of video game playing by fifth graders is related to the frequency with which they bully children at school. 20. Suppose that we want to decide who is the better Mathematics 171 instructor of two instructors A and B who are teaching two different sections. (a) Give several reasons why we cannot simply compare the instructors by comparing the grades of the students in class. (b) There is a uniform final exam in Mathematics 171. Suppose we compare the instructors by comparing the scores of their students on the uniform final exam. How does that help? What problems still remain? (c) Suppose that we also give a pretest to all the students in Mathematics 171 and compare the two instructors by comparing how much their students improved over the course of the semester. How does that help? What problems still remain? 21. Ultramarathoners often develop respiratory infections after running an ultra. Researchers were interested in whether vitamin C was useful in reducing these infections. The researchers studied a group of 100 ultra runners. They gave 50 a daily regimen of 600 mg vitamin C while the other 50 received a placebo. They also studied 100 nonrunners over the same period, also giving 50 of them the vitamin C treatment and 50 of them the placebo. All 200 subjects were watched for 14 days after the race to determine if infections developed. (a) What are the explanatory and response variables? 9 Mathematics 243 Problems (b) What is the name of this experimental design? (c) Why did the researchers use nonrunners at all if the purpose was to determine whether vitamin C helped prevent infections in ultramarathoners? 22. (AP Statistics Free Reponse, 2007, Problem 2) As dogs age, diminished joint and hip health may lead to joint pain and thus reduce a dog’s activity level. Such a reduction in activity can lead to other health concerns such as weight gain and lethargy due to lack of exercise. A study is to be conducted to see which of dietary supplements, glucosamine or chondroitin, is more effective in promoting joint and hip health and reducing the onset of canine osteoarthritis. Researchers will randomly select a total of 300 dogs from ten different large veterinary practices around the country. All of the dogs are more than 6 years old, and their owners have given consent to participate in the study. Changes in joint and hip health will be evaluated after 6 months of treatment. (a) What would be an advantage to adding a control group in the design of this study? (b) Assuming a control group is added to the other two groups in the study, explain how you would assign the 300 digs to these three groups for a completely randomized design. (c) Rather than using a completely randomized design, one group of researchers proposes blocking on clinics, and another group of researchers proposes blocking on breed of dog. How would you decide which one of these two variables to use as a blocking variable? 23. The dataframe normaltemp has data on the temperatures of 130 Calvin students (taken in Psychology 151 in 1998). (a) What is the sample mean temperature for this group of students? mean(~Temp, data = normaltemp) [1] 98.25 (b) Use the bootstrap to compute a standard error for your statistic. r <- do(1000) * mean(~Temp, data = resample(normaltemp)) Loading required package: sd(~result, data = r) parallel # of course different answers are possible due to the simulation [1] 0.06585 (c) It is folklore that the “normal” body temperature is 98.6 degrees. Explain how your analysis in (b) gives strong evidence that this folklore is wrong. ANS: 98.6 is several standard errors away from our estimated mean so it is unlikely that 98.6 is the true mean. 24. The normaltemp data also records the gender of each individual. (Unfortunately gender has been coded as a quantitative rather than categorical variable. To convert this variable to categorical, use the function factor. For example bwplot(Temp∼factor(Gender),data=normaltemp) draws a boxplot that you will want to look at.) (a) What is the mean temperature for each gender group in this sample? mean(Temp ~ factor(Gender), data = normaltemp) # most students will have changed Gender to Male and 1 2 98.10 98.39 (b) Use the bootstrap to compute confidence intervals for the mean temperature of each of men and women. 10 Mathematics 243 Problems r <- do(1000) * mean(Temp ~ factor(Gender), data = resample(normaltemp)) confint(r, method = "quantile") # either the quantile or standard error method is okay 1 2 name lower upper level method 1 97.94 98.27 0.95 quantile 2 98.22 98.57 0.95 quantile (c) Does your analysis in (b) suggests that there is a difference in the population in the mean temperature of men and women? ANS: Not obviously. The two confidence intervals overlap. 25. The bootstrap can help us estimate parameters other than the mean of the population. Using the normaltemp data, compute an estimate of the third quartile of the temperature distribution of the population in question and its standard error. (Note that you will have to look at the dataframe that results from your do function to see where the various sample statistics are.) qdata(0.75, ~Temp, data = normaltemp) p quantile 0.75 98.70 r <- do(1000) * qdata(0.75, ~Temp, data = resample(normaltemp)) sd(~quantile, data = r) # answers will vary somewhat [1] 0.06466 26. The tips dataframe in the reshape2 package has data on the tips received by a waiter in a certain restaraunt. (a) Fit the model tips ∼1. What is the coefficient? What is the sum of squares of residuals? l <- lm(tip ~ 1, data = tips) l Call: lm(formula = tip ~ 1, data = tips) Coefficients: (Intercept) 3 sum(residuals(l)^2) [1] 465.2 (b) Fit the model tips ∼total bill. What are the coefficients? What is the sum of squares of residuals? l2 <- lm(tip ~ total_bill, data = tips) l2 Call: lm(formula = tip ~ total_bill, data = tips) 11 Mathematics 243 Problems Coefficients: (Intercept) total_bill 0.920 0.105 sum(residuals(l2)^2) [1] 252.8 (c) For each of the two models above, what is the predicted tip for a $20 bill? ANS: For the first model, 2.998 (i.e., $3.00). For the second model, 3.0203. (d) From the results of the preceding two parts, do you think it important to know the total bill in predicting the tip? Give a statstical answer. ANS: The question should have said the first two parts. In that case, note the reduced sum of squares of residuals. 27. The SAT dataframe in the mosaic package has data on state by state educational inputs and outputs (for high schools). The variable sat has average SAT scores of high school students in the state and the variable expend has data on per pupil expenditures on education. (a) Fit the model sat∼1 + expend. lsat <- lm(sat ~ expend, data = SAT) lsat Call: lm(formula = sat ~ expend, data = SAT) Coefficients: (Intercept) 1089.3 expend -20.9 (b) Intepret the coeffienct of expend in the model. What seems odd about this? ANS: For each increase of a thousand dollars in expenditures per student the predicted decrease in average SAT score is 20.9 points. (c) Examine the other variables in the dataframe. Do you have an explanation for this “odd” result? ANS: Different answers are possible (including no) but some students will see that the higher sat states are those with lower participation rate in the SAT. 28. Suppose that response is a quantitative response variable and color is a categorical variable with the levels Red, Green, and Blue. Suppose that the fitted model from R gave the following coefficients: (Intercept) ColorBlue ColorGreen 15 8 13 (a) What is the mean of response for all those cases that are Blue? (b) What is the mean of response for all those cases that are Red? (c) We cannot determine the mean of response for all cases. Why not? (d) What are the possible values could the mean of response for all cases (using only the information given)? 29. Using the CPS85 data from the mosaic package, write a model and then use lm to answer the following questions. (a) What is the average age of single people in the sample? 12 Mathematics 243 Problems (b) What is the average age difference between single and married people in the sample? (c) Write a good statistical statement that describes how wages vary with age. (d) Write a good statistical statement that describes how wages vary by gender with age being held fixed. 30. In the SAT dataset of the mosaic package are variables on education in the states. (a) Fit the model sat ∼expend + frac. (b) What are the units of the three coefficients in the model? (c) Compare this model to problem 27. How does the model of this problem help understand the “strangeness” of problem 27? 31. A certain drug has a side effect of increasing hemoglobin levels in individuals. The effect depends on the dosage strength and is somewhat different for males and females. Consider the model Hemoglobin ∼ 1 + dose + sex + sex:dose F M Hemoglobin 25 20 15 0 1 2 3 4 dose Suppose that Males are the reference group for sex (so that the model formula has a variable sexF). In each part below, choose the correct response and give a short explanation for your choice. (a) The coefficent of the intercept in the model formula will be: (choose one) POSITIVE (it looks to be about 14, it’s where the male line has an intercept) positive negative 0 (b) The coefficient of dose in the model formula will be: (choose one) POSITIVE – it’s the slope of the line for males positive negative 0 (c) The coefficient of sexF in the model formula will be: (choose one) NEGATIVE – the female line has a lower y intercept than the male line positive negative 0 (d) The coefficient of dose:sexF in the model formula will be: (choose one) NEGATIVE – the female line has a lower slope than the male line positive negative 32. The data frame mammals in the MASS package has data on the brain and body weight of various mammals. (a) The model brain ∼ 1 + body doesn’t seem like a very good model. Why not? (Look at a graph.) xyplot(brain ~ body, data = mammals) 13 0 Mathematics 243 Problems 5000 brain 4000 3000 2000 1000 0 0 2000 4000 6000 body ANS: Looks concave up rather than linear. (Perhaps also because it spans too many orders of magnitude.) (b) An appropriate transformation of both brain and body size results in a much better model. Find such a transformation. (Hint: you might notice that these data extend over several orders of magnitude.) ANS: While there are a couple of possibilities, I think the best by far is log, log l <- lm(log(brain) ~ log(body), data = mammals) coef(l) (Intercept) 2.1348 log(body) 0.7517 xyplot(log(brain) ~ log(body), data = mammals, type = c("p", "r")) log(brain) 8 6 4 2 0 −2 −5 0 5 log(body) 33. A dataset available from Danny’s website Diamonds has data on 308 diamonds. You can get the dataset by > Diamonds = fetchData("Diamonds.csv") Retrieving from http://www.mosaic-web.org/go/datasets/Diamonds.csv > head(Diamonds) 1 2 3 4 carat color clarity certification price 0.30 D VS2 GIA 1302 0.30 E VS1 GIA 1510 0.30 G VVS1 GIA 1510 0.30 G VS1 GIA 1260 14 Mathematics 243 5 6 0.31 0.31 D E Problems VS1 VS1 GIA GIA 1641 1555 The variables include carat weight (one carat is 200 mg) color on a scale from D (colorless) to Z (yellow) clarity a categorical variable, see http://www.diamondinfo.org certification third party certification organization price price in dollars (a) Anyone who knows diamonds knows that diamonds of greater carat weight are more highly valued. Fit a model price ∼1 + carat. ld <- lm(price ~ carat, data = Diamonds) coef(ld) (Intercept) -2298 carat 11599 [ = −2298 + 11599 carat price (b) From the graph, you might suppose that a second degree polynomial fits the data somewhat better. That is, you might want to fit the model price ∼ 1 + carat + carat2 . R provides a function that makes this easy to do: > l <- lm(price ~ poly(carat, 2), data = Diamonds) FIt this model and write the model equation in more conventional notation than that returned by R. ANS: Unfortunately they are going to get the equation wrong if they use poly since it uses orthogonal polynomials instead of the usual terms. The usual terms can be gotten by adding an argument to poly that they may not have heard me say. > > > > lbad <- lm(price ~ poly(carat, 2), data = Diamonds) lgood <- lm(price ~ poly(carat, 2, raw = TRUE), data = Diamonds) # this gives the right coefficien lgoodtoo <- lm(price ~ 1 + carat + I(carat^2), data = Diamonds) # this does too coef(lbad) (Intercept) poly(carat, 2)1 poly(carat, 2)2 5019 56332 8152 > coef(lgood) (Intercept) poly(carat, 2, raw = TRUE)1 -42.51 2786.10 poly(carat, 2, raw = TRUE)2 6961.71 > coef(lgoodtoo) (Intercept) -42.51 carat 2786.10 I(carat^2) 6961.71 ANS: So the best model is [ = −43 + 2786 carat + 6962 carat2 price 15 Mathematics 243 Problems (c) Give an argument that the quadaratic model fits the data better. ANS: They might look at a graph. Or they might compare sums of squares of residuals. sum(residuals(ld)^2) # linear model [1] 382178624 sum(residuals(lgood)^2) # quadratic model [1] 315717826 34. The dataframe KidsFeet in the mosaic package has data on a number of children including measurements of their feet. (a) Fit the following model for the width of the feet: width ∼ 1 + length. l <- lm(width ~ length, data = KidsFeet) coef(l) (Intercept) 2.8623 length 0.2479 rsquared(l) [1] 0.411 List the coefficients and find the value of R2 for this model. (b) Now fit the model width ∼ 1 + length + sex and find the value of R2 . Would you say that knowing the gender of the student is of significant help in predicting the width of a foot given the length? l2 <- lm(width ~ 1 + sex + length, data = KidsFeet) coef(l2) (Intercept) 3.6412 sexG -0.2325 length 0.2210 rsquared(l2) [1] 0.4595 ANS: length alone explains 41% of the variation in foot width while including sex explains 46% of the variation. Reasonable people could disagree about whether this is a significant improvement but it doesn’t seem so to me! 35. Suppose that y is a response variable, x and z are quantitative explanatory variables, and g and h are categorical explanatory variables. Consider the following models: A B C D E F G H y y y y y y y y ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ 1 1 1 1 1 1 1 1 + + + + + + + x x z x x x z + + + + + + g h g g z g + + + + h x:g g + h h For some pairs of the above models, the R2 for one is certainly less than or equal to the R2 of the other. For other pairs, it depends on what the variables are as to which R2 is greater. List all pairs of models for which it is 16 Mathematics 243 Problems possible to determine the realtionship of R2 and say what that relationship is. (For example, the R2 for model A is less than or equal to that of model B, and so forth.) ANS: In terms of R2 , A ≤ B ≤ C ≤ E ≤ G, A ≤ B ≤ F , A ≤ B ≤ G, A ≤ C ≤ F , A ≤ D ≤ H ≤ G. Those are the only pairs that we can compare. 36. The dataframe trees has data on the Volume, Girth, and Height of some felled cherry trees. using R2 as a guide, give a good reason for using the model Volume ∼ 1 + Girth + Girth2 instead of either a first degree or third degree polynomial. rsquared(lm(Volume ~ Girth, data = trees)) [1] 0.9353 rsquared(lm(Volume ~ Girth + I(Girth^2), data = trees)) [1] 0.9616 rsquared(lm(Volume ~ poly(Girth, 3), data = trees)) [1] 0.9627 For the linear model, R2 is 93% and for both the quadratic and cubic model R2 is 96%. So the second degree polynomial explains about half of the variation that the first degree polynomial doesn’t explain but the cubic model explains hardly any of the rest of the variation. 37. The Duncan dataframe of the car package has data (from 1950) on 45 different professions. Each of the 45 professions was rated on prestige and the typical characterstics of persons in that profession. prestige percent of raters rating occupation as excellent or good income percent of males earning $3,500 or more education percent of males who were high school graduates (a) Fit a model prestige ∼ 1 + income. l <- lm(prestige ~ income, data = Duncan) coef(l) (Intercept) 2.457 income 1.080 (b) Fit a model prestige ∼ 1 + income + education. l2 <- lm(prestige ~ income + education, data = Duncan) coef(l2) (Intercept) -6.0647 income 0.5987 education 0.5458 (c) Explain why the coefficient of income is different in the two models. Give an explanation that explains the size of the difference. l3 <- lm(education ~ income, data = Duncan) coef(l3) (Intercept) 15.6114 income 0.8824 17 Mathematics 243 Problems ANS: Since education is related to income (its a confounding variable) the effect of income on prestige will be different when education is controlled for. To see where the 1.080 in the first model comes from, notice that if income changes one unit, the change in prestige is .60 units if education is held fixed. However a change in income of one unit also means a change in income of .88 units so the change in prestige given a change of one unit in income is .88 ∗ .55 + .60. 38. The story in the following link http://news.sciencemag.org/social-sciences/2014/02/scienceshot-why-you-should-talk-your-baby describes the result of a study done on verbal abilities in babies. (a) The headline claims that parents should talk to their babies. The headline is the summary of a conclusion of a study. What are the explanatory and response variables of that study? ANS: It’s actually confusing. The response variable appears to be word processing speed. The explanatory variable of the study appears to be parent-child banter but then it switches to SES. (b) One of the variables in the study is socioeconomic status of the parents. What is the role of that variable in this study? ANS: It appears that SES is being used as a stand-in for how much parents talk to their kids. 39. The following study http://well.blogs.nytimes.com/2010/06/15/eating-brown-rice-to-cut-diabetes-risk/?_php=true&_type= blogs&_r=0 makes claims about diets that include brown rice. (a) What are the explanatory and response variables referred to in the headline? ANS: Explanatory variable is eating brown rice and the response variable is diabetes risk. (b) What covariates did the article imply were considered? ANS: The researchers tried to control for the fact that Americans who eat brown rice tend to be more healthy overall - they eat more fruits and vegetables and less red meat and trans fats, and they also tend to be thinner, more active and less likely to smoke than those who don’t eat brown rice. (That’s a quote from the article!) 40. Do problem 10.04 from the collection of exercises from the book. (Link here or on the homework webpage.) ANS: See attached. 41. Suppose that a manufacturer of computer chips claims to produce no more than 5% defective chips. You inspect 100 chips produced. (a) Would you have a strong reason to doubt the claim of the manufacturer if you find that 6 of the chips are defective? Why or why not? (b) If in part (a) you said no, how many defective chips would you have to find before you think that you would have a strong reason to doubt the manufacturer? Defend your choice. 42. Breanna Verkaik made 72 out of 89 free throws this past basketball season. (a) Use an appropriate binomial model to estimate the probability that Breanna would make all 10 free throws if she shot 10. (b) What assumption of the binomial model might be questioned in this situation? 43. Do problem 11.23 from the textbook exercises. here 44. Do problem 11.33 from the textbook exercises. here 45. Doproblem 11.34 from the textbook exercises. here 18 Mathematics 243 Problems 46. The whiteside data in the MASS package has data on gas usage in a certain house before and after the owner installed insulation. Gas weekly gas consumption in 1000s of cubic feet Insul before or after insulation was installed Temp the average of outside temperature in degrees Celsius (a) Fit the model Gas ∼ 1 + Insul + Temp. l <- lm(Gas ~ Insul + Temp, data = whiteside) l Call: lm(formula = Gas ~ Insul + Temp, data = whiteside) Coefficients: (Intercept) InsulAfter 6.551 -1.565 Temp -0.337 (b) Compute a 95% confidence interval for Insul using the bootstrap method. r <- do(1000) * lm(Gas ~ Insul + Temp, data = resample(whiteside)) confint(r) name lower upper level method estimate margin.of.error 1 Intercept 6.3035 6.7944 0.95 stderr 6.5489 0.24544 2 InsulAfter -1.7528 -1.3776 0.95 stderr -1.5652 0.18758 3 Temp -0.3758 -0.2954 0.95 stderr -0.3356 0.04019 4 sigma 0.2872 0.4059 0.95 stderr 0.3465 0.05932 5 r.squared 0.8607 0.9593 0.95 stderr 0.9100 0.04930 The confidence intervals may vary a bit because of the simulation. (c) Compute a 95% confidence interval for Insul using the standard method. Comment on any difference you see. confint(l) 2.5 % 97.5 % (Intercept) 6.3145 6.7882 InsulAfter -1.7599 -1.3705 Temp -0.3723 -0.3011 At least in my simulation, these intervals are very close. (d) Find fitted values of the model for an average temperature of 0 both before and after insulation. f <- makeFun(l) f(Insul = "Before", Temp = 0) 1 6.551 f(Insul = "After", Temp = 0) 1 4.986 19 Mathematics 243 Problems (e) Write a 95% prediction interval for each of your fitted values in part (d). f(Insul = "Before", Temp = 0, interval = "prediction") fit lwr upr 1 6.551 5.796 7.306 f(Insul = "After", Temp = 0, interval = "prediction") fit lwr upr 1 4.986 4.24 5.732 (f) Compare the two intervals and write a nice sentence saying qualitatively what these two intervals tell us. Perhaps one thing to note is that the intervals do not overlap so if we were predicting the gas usage for two days of 0 temperature, we would be fairly confident in predicting a lower gas usage in the after insulation condition. 47. It might be expected that more successful baseball teams might tend to draw more fans to the ballpark. The dataframe Baseball21 in the Stob package has data on the first 11 years of major league baseball seasons in the 21st century. Here are a few of the variables: attendance W R LG the attendance at home games for the team on the season the number of games won by the team in the season runs scored by the team the league, American or National, that the team is in (a) Fit the model attendance ∼ 1 + W. l <- lm(attendance ~ W, data = Baseball21) l Call: lm(formula = attendance ~ W, data = Baseball21) Coefficients: (Intercept) -135675 W 31942 (b) Compute a confidence interval for the coefficient of wins in the model of part (a). Write a sentence that interprets this coefficient. confint(l) 2.5 % 97.5 % (Intercept) -627820 356470 W 25925 37959 We can be fairly confident that on average an additional win increases attendance by between 26,000 and 38,000. (c) Note the confidence interval for the intercept term. Why is it so wide? The values of the explanatory variable are a fair distance from W = 0. So the intercept value is quite uncertain – small changes in the fitted slope mean large changes in the fitted intercept. (d) Now fit a model attendance ∼ 1 + W + R. One might consider this model by thinking that teams that score more runs, even if they lose, are more interesting to watch and so might attract more fans. Does the fitted model support this conjecture? 20 Mathematics 243 Problems ll <- lm(attendance ~ W + R, data = Baseball21) confint(ll) 2.5 % 97.5 % (Intercept) -1081344.7 329941 W 22422.6 37306 R -586.1 1646 Not really. The confidence interval suggests that we cannot even be confident of the sign of the coefficient of R suggesting that the it is not clear that increasing runs increases attendance while holding wins fixed. (e) You might instead think that National League teams are, in general, more interesting to watch and so might attract more fans. Fit a model and decide whether the data support this conjecture. lll <- lm(attendance ~ W + LG, data = Baseball21) lll Call: lm(formula = attendance ~ W + LG, data = Baseball21) Coefficients: (Intercept) -295141 W 32430 LGNL 224955 confint(lll) 2.5 % 97.5 % (Intercept) -789765 199484 W 26495 38364 LGNL 86188 363722 There does seem to be evidence that national league teams can be predicted to attract more fans on average, even holding wins fixed. 48. At this website, http://www.yale.edu/infantlab/socialevaluation/Helper-Hinderer.html are some videos that show some of the experiments done on the question of whether very young children already have a sense of social relationships. The first two videos were shown to several young children. In the first video, a triangle helps the circle up the hill. In the second video, the square hinders the circle from reaching the top of the hill. After a child is shown both sequences, they are tested to see whether they have a preference for the helper triangle or the hinderer square. The first video in the second row shows a pre-verbal 6 month old child clearly showing a preference for the helper triangle. In one particular run of this experiment, 12 children were tested and 9 preferred the helper in this way. (a) What is the natural null hypothesis in this experiment? ANS: That children would prefer the helper and hinderer equally. (b) If the null hypothesis is true, how many children are expected to choose the helper rather than the hinderer? ANS: 6 of the 12. (c) Is the result in this experiment, 9 out of 12 favoring the helper, strong evidence against the null hypothesis? Give a quantitative justification for your claim. 1 - pbinom(8, 12, 0.5) [1] 0.073 About 7.2% of the time we would see 9 or more the kids chose the helper even if there were no difference in the population. Probably not strong evidence. 21 Mathematics 243 Problems (d) Of course it was important to conduct this experiment very carefully in order to ensure that it was the helperhinderer distinction that was important. Give two examples of possible confounding variables and explaim how randomization of some aspect of the experiment would address each. ANS: Various answers are possible. The shape and the order in which the scenes were shown are two obvious ones. We should randomly assign the shapes to the two roles and also do the videos in random order. (e) In this experiment, where specifically should blinding have been used? ANS: Various answers are possible. But certainly the researcher showing the baby the two shapes should not know which is the helper and which is the hinderer. 49. In this problem, you will think about the relationship between confidence intervals and hypothesis tests. Recall the dataset from the last test on the survival time of Alzheimer’s patients. mydata <- fetchData("CSHA.csv") Retrieving from http://www.mosaic-web.org/go/datasets/CSHA.csv The variables Gender Education AAO Survival are gender number of years of education age at onset of Alzheimer’s disease number of days from onset of Alzheimer’s until death (a) Suppose that you are investigating the model Survival reasonable null hypothesis? ANS: Gender doesn’t affect survival time. ∼ 1 + Gender. For this model, what is the most (b) Construct a confidence interval for the coefficient of Gender in the model. Explain how that confidence interval might indicate that there is not enough evidence to say that the null hypothesis is false. csha = fetchData("CSHA.csv") Retrieving from http://www.mosaic-web.org/go/datasets/CSHA.csv l <- lm(Survival ~ Gender, data = csha) confint(l) 2.5 % 97.5 % (Intercept) 1909.44 2373.8 GenderM -50.35 513.7 Notice that the confidence interval contains 0. This means that we don’t have enough evidence to say that the effect of Gender is not 0 meaning that gender doesn’t matter. 50. The males of stalk-eyed flies have long eye stalks. Is the male’s long eye-stalk affected by the quality of its diet? Two groups of male flies were reared on different diets. One group was fed corn and the other cotton wool. The eye spans (distance between the eyes) were measured in mm. The data are in the package Stalkies2 in the abd package. eye.span food eye span in mm diet (Corn or Cotton) (a) Identify the null and alternate hypotheses. ANS: Null - diet doesn’t affect distance between eys. ALT: diet does. (b) If we fit the model eye.span ∼ 1 + food, what is a reasonable test statistic? ANS: the coefficient of food, R2 are both reasonable choices. 22 Mathematics 243 Problems (c) Do a simulation to compute an approximate P -value for this test statistic. ANS: Their computation might vary somewhat and they might use either of the two statistics. But the P -value in either case should be 0 or almost 0. require(abd) Loading required package: Loading required package: abd nlme l <- lm(eye.span ~ food, data = Stalkies2) fd = coefficients(l)[2] rs = rsquared(l) fd foodCotton -0.5042 rs [1] 0.59 r <- do(1000) * lm(shuffle(eye.span) ~ food, data = Stalkies2) pdata(rs, ~r.squared, r) [1] 1 pdata(fd, ~foodCotton, r) foodCotton 0 (d) Is there strong evidence against the null hypothesis? ANS: Yes. A P -value that is essentially 0 means that the data are not consistent with the null hypothesis at all. 51. The dataframe pulp in the faraway package has data on an experiment to test the brightness of paper produced by four different shift operators. (The units have been lost and these are not typical units. In standard units, paper brightness is usually between 80 and 90.) bright operator brightness of the pulp operator a-d Use a randomization test to determine if the shift operator makes a difference in the brightness of paper produced. l <- lm(bright ~ operator, data = pulp) r = rsquared(l) d <- do(1000) * rsquared(lm(shuffle(bright) ~ operator, data = pulp)) pdata(r, ~result, data = d) [1] 0.968 The p-value of the test statistic is xxx. 52. The dataframe corrosion in the faraway package has data on the corrosion of test bars with various percentages of added iron. Fe percentage of added iron loss material lost due to corrosion Fit a model loss 1+Fe. 23 Mathematics 243 Problems l <- lm(loss ~ Fe, data = corrosion) l Call: lm(formula = loss ~ Fe, data = corrosion) Coefficients: (Intercept) 130 Fe -24 (a) Compute R2 for this model. rsquared(l) [1] 0.9697 (b) What is the obvious null hypothesis in this situation? ANS: That iron content (FE) has no effect on loss due to corrosion. (c) What is the expected value of R2 if the null hypothesis is true? ANS: There are n = 13 cases and m = 2 predictors. So the typical value of R2 is 1/12. (d) Compute the F -statistic. You could have R do this, but computing it from the formula will make you feel really smart. ANS: F = .97/1 R2 /(m − 1) = = 355 (1 − R2 )/(n − m) .03/11 Depending on rounding could get a variety of answers. R computes it as 352.3. 53. Do problem 14.22 of the problems from the text, found here. The entries in the R2 column will vary by simulation. p 1 3 10 20 37 38 p/(n − 1) .03 .08 .26 .53 .97 1 Mean R2 For p = 38, there are enough variables to fit the data exactly. 54. The nels88 dataframe in the faraway package has data on a mathematics test taken as part of a national study. Some variables are math the mathematics test score of the student ses the socioeconomic status of the family of child paredu the parents educational status For the following models, complete the following table Model math ∼ 1 + sex math ∼ 1 + sex + paredu math ∼ 1 + sex + paredu + ses SSModel df SSResid 24 df F Mathematics 243 Problems l1 <- lm(math ~ sex, data = nels88) l2 <- lm(math ~ sex + paredu, data = nels88) l3 <- lm(math ~ sex + paredu + ses, data = nels88) anova(l1) Analysis of Variance Table Response: math Df Sum Sq Mean Sq F value Pr(>F) sex 1 9 9.2 0.07 0.79 Residuals 258 32107 124.4 anova(l2) Analysis of Variance Table Response: math Df Sum Sq Mean Sq F value Pr(>F) sex 1 9 9 0.13 0.72 paredu 5 13708 2742 37.70 <2e-16 Residuals 253 18400 73 anova(l3) Analysis of Variance Table Response: math Df Sum Sq Mean Sq F value Pr(>F) sex 1 9 9 0.13 0.721 paredu 5 13708 2742 38.22 <2e-16 ses 1 324 324 4.52 0.034 Residuals 252 18076 72 Last column of the table may vary a bit due to rounding. Model math ∼ 1 + sex math ∼ 1 + sex + paredu math ∼ 1 + sex + paredu + ses SSModel 9 13717 14041 df 1 6 7 SSResid 32107 18400 18076 df 258 253 252 F .74 31.43 27.96 55. Do problem 15.02 of the textbookproblems. 56. Do problem 15.10 of the textbook problems. (The data are in TenMileRace in the mosaic package.) 57. The dataframe cathedral in the faraway package has data on the cathedral nave heights and lengths of several cathedrals in England. x y style the length of the nave the height of the nave the style of the cathedral, romanesque or gothic Fit a model to predict the height of the nave from the length of the nave and the style. (y ∼ 1 + x + style) (a) Write the equation of the fitted model. (b) Use R to esitmate the P -value for the null hypothesis that style does not matter in this model. 25 Mathematics 243 Problems (c) Instead, use a simulation method (using shuffle) to compute the same P -value. (d) Should style be included in this model? 58. The dataframe Ericksen in the car package has data on the undercount rate in the 1980 census. See the help document for a description of all the variables. The response variable is undercount which is the percentage of people undercounted in a given region. (a) Make a big model for undercount by including all the other available variables as explanatory variables. Don’t include any interaction terms however. l = lm(undercount ~ minority + crime + poverty + language + highschool + housing + city + conventional, data = Ericksen) l Call: lm(formula = undercount ~ minority + crime + poverty + language + highschool + housing + city + conventional, data = Ericksen) Coefficients: (Intercept) -0.6114 highschool 0.0613 minority 0.0798 housing -0.0350 crime 0.0301 citystate -1.1600 poverty -0.1784 conventional 0.0370 language 0.2151 (b) Make an argument that several of the variables are clearly not necessary in the model. summary(l) Call: lm(formula = undercount ~ minority + crime + poverty + language + highschool + housing + city + conventional, data = Ericksen) Residuals: Min 1Q Median -2.836 -0.803 -0.055 3Q 0.705 Max 4.247 Coefficients: (Intercept) minority crime poverty language highschool housing citystate conventional Estimate Std. Error t value Pr(>|t|) -0.61141 1.72084 -0.36 0.72368 0.07983 0.02261 3.53 0.00083 0.03012 0.01300 2.32 0.02412 -0.17837 0.08492 -2.10 0.04012 0.21512 0.09221 2.33 0.02320 0.06129 0.04477 1.37 0.17642 -0.03496 0.02463 -1.42 0.16126 -1.15998 0.77064 -1.51 0.13779 0.03699 0.00925 4.00 0.00019 Residual standard error: 1.43 on 57 degrees of freedom Multiple R-squared: 0.708,Adjusted R-squared: 0.667 F-statistic: 17.2 on 8 and 57 DF, p-value: 1.04e-12 anova(l) 26 Mathematics 243 Problems Analysis of Variance Table Response: undercount Df Sum Sq Mean Sq F value Pr(>F) minority 1 195.8 195.8 96.23 7.6e-14 crime 1 29.2 29.2 14.34 0.00037 poverty 1 5.4 5.4 2.67 0.10792 language 1 11.6 11.6 5.71 0.02020 highschool 1 2.9 2.9 1.41 0.23957 housing 1 0.2 0.2 0.07 0.78692 city 1 3.2 3.2 1.58 0.21345 conventional 1 32.5 32.5 15.98 0.00019 Residuals 57 116.0 2.0 An argument should be made on the basis of the summary or the anova but it appears that we should get rid of at least highschool, housing, and city. (c) Make a smaller model for undercount and give a statistical reason for believing that this smaller model is to be preferred to the larger model. lsmall = lm(undercount ~ minority + crime + poverty + language + conventional, data = Ericksen) anova(lsmall, l) Analysis of Variance Table Model 1: undercount ~ minority + crime + poverty + language + conventional Model 2: undercount ~ minority + crime + poverty + language + highschool + housing + city + conventional Res.Df RSS Df Sum of Sq F Pr(>F) 1 60 125 2 57 116 3 8.79 1.44 0.24 Notice that the P -value here is .24 which means that there is no reason to think that the larger model explains more variation that the smaller. 59. In this problem we are going to develop a hypothesis test and determine its power for a typical testing situation. Suppose a certain medical test is marketed with the statement that it detects a certain medical condition 90% of the time that the patient actually has the condition. (90% is called the sensitivity of the test.) To test this claim, a market research firm finds people with the condition and gives them the test. (a) Suppose that the firm tests 20 persons with the disease. How many patients should test positive for the condition if the claim of the company is true? Obviously, 18! (b) What is the probability that 16 or fewer of 20 persons would test positive? (Hint: you might want to think of the binomial model.) pbinom(16, 20, 0.9) [1] 0.133 (c) If the testing firm sets a significance level of 5%, when will they reject the null hypothesis if they test 20 patients? 27 Mathematics 243 Problems pbinom(15, 20, 0.9) [1] 0.04317 They should reject the null if 15 or fewer patients test positive. (d) Suppose that the test only detects the disease in 80% of the patients that have it. What is the power of this hypothesis test at the 5% level of significance? pbinom(15, 20, 0.8) [1] 0.3704 (e) Suppose that the testing firm desires 95% power and a 5% level of significance. How many people should the firm test if the true sensitivity is 80%? The best answer is between 140 and 150 (which they have to get by trial and error.) For 145 and 150, we have: pbinom(120:125, 145, 0.9) # shows that we should use 123 as our rejection critiria [1] 0.004939 0.009597 0.017843 0.031720 0.053879 0.087383 pbinom(123, 145, 0.8) # shows that the power is 94% so should go a little higher [1] 0.9443 pbinom(125:130, 150, 0.9) # shows that wew should use 128 as our rejection criteria [1] 0.007648 0.014310 0.025640 0.043964 0.072088 0.112977 pbinom(128, 150, 0.8) # power of 96% [1] 0.9628 60. Medical testing is even more complicated than the last problem suggests. Obviously the medical test should detect the disease in patients who have it but it should also not detect the disease in patients who don’t have it. The specificity of the test is the percentage of negative results in testees who don’t have the disease. So a specifity of 90% means that the test will produce 10% false positives. If a claim is made that a medical test is 90% accurate, it usually means that both the sensitivity and specificity are at least 90%. In this problem, we investigate the problem of going backward: if a test tells you that you have a certain disease, how likely is it that you have the disease? Seems like 90% right? (a) Suppose that in a certain population, only 1% of the people have the disease. Now suppose that everyone is tested. What proportion of the population will test postive for the disease? (Remember that a person can test positive either by having the diseasee and the test is right or not having the disease and the test is wrong.) ANS: 90% of 1% will test postive correctly and 10% of 99% will test positive incorrectly. So that is .9 + 9.9 = 10.8%. (b) Based on your result in part (a), what proportion of those who test positive actually have the disease? ANS: .9/10.8 = 8.3%. (c) If you did (a) and (b) right, you should be more than a little surprised. What do you think this result says about universal screening programs for various diseases? Universal screening for a rare condition is likely to produce mostly false positives. In other words it might be unnecessarily expensive and give a lot of people needless concern. 61. In logistic regression, the values of the model y are “link” values that have to be transformed to get probabilities. 28 Mathematics 243 Problems (a) For each of the following link values, compute the corresponding probability: −3, −2, −1, 0, 1, 2, 3. ilogit(c(-3, -2, -1, 0, 1, 2, 3)) [1] 0.04743 0.11920 0.26894 0.50000 0.73106 0.88080 0.95257 (b) For each of the following probabilities, compute the corresponding link value: .1, .25, .5, .75, .9. logit(c(0.1, 0.25, 0.5, 0.75, 0.9)) [1] -2.197 -1.099 0.000 1.099 2.197 62. The dataframe CAFE in the Stat2Data has data on a 2003 vote in the US Senate on an amendment to a bill sponsered by John Kerry and John McCain to mandate improved fuel economy on cars and light trucks. The amendment effectively killed the bill so was strongly supported by most car manufacturers. One might suppose that the political party of the Senator might be related to their vote (it usually is) and it also might be supposed that the contributions from car manufacturers had an affect. The variable LogContr is the logarithm of the contributions that a senator received in their lifetime while the variable Dem is 1 if the senator was a Democrat or Independent and 0 otherwise. The variable Vote is the vote of the senator (1 or 0). (Remember that for datasets in the Stat2Data package you need to load the package and use the command data(CAFE) to make the dataset accessible.) (a) Fit a model for the Senator’s vote that uses LogContr and Dem as explanatory variables. data(CAFE) g <- glm(Vote ~ LogContr + Dem, data = CAFE, family = binomial) (b) The median of LogContr is approximately 4. For this value, what are the predicted probabilities of a YES vote from a Democrat and from a Republican? f <- makeFun(g) f(LogContr = 4, Dem = 1) 1 0.5226 f(LogContr = 4, Dem = 0) 1 0.861 (c) Write a good sentence interpreting the coefficient of LogContr in the model. exp(coefficients(g)) (Intercept) 0.00107 LogContr 8.72233 Dem 0.17679 For each unit increase in the log of contributions received we predict an increase in odds for voting for the bill of a factor of 8.7, holding the party of the senator fixed. (d) Write a sentence interpreting the coefficient of Dem in the model. Holding contributions fixed, the odds a democrat votes for the bill are estimated at .18 of the odds of a republican. 29