Download Mathematics 243 Problems 1. A model often used to predict the

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Mathematics 243
Problems
1. A model often used to predict the height in inchers of an adult son is
\ = 1 father + 1 mother + 2.5
height
2
2
A different model suggests that the height of the father is more important in predicting the height of sons. That
model is
\ + 2 father + 1 mother + 2.5
height
3
3
Of course neither model is “right.” Suppose that you have a lot of data (like Galton). How would you decide
which of these two models is “better”? Be as precise as you can.
Solution. A start would be to ask which model makes predictions that are closer to the actual value. A better
solution would be to describe some way to combine the prediction errors. (Of course later in the course we will
compute the sums of squares of residuals. But they don’t know that yet.)
2. We might write Einstein’s law as Ê = mc2 but we use the equation as if it is an exact equality (unlike the models
in the first problem). Explain why you would not expect exact equality in any real application of this law.
Solution. We can’t measure any of the three quantities exactly. So measurement error is a source of variation
that will prevent exact equality. (Some students might say that this rule itself is not exactly right. Maybe that’s
true!)
3. Create a vector x in R containing the following numbers:
1, 4, 7, 9, 13, 19, 21, 25
For each of the following R commands, try to predict what the output will be and then write down (using R of
course), what the output actually is. In each case where an explanation is asked for, say how R is computing the
answer that you get.
(a) x
(b) x+1
(c) sum(x)
(d) x>10
(e) x[x>10]
(f) sum(x>10)
Explain.
Explain.
(g) sum(x[x>10])
(h) x[-(1:3)]
Explain.
Explain.
(i) x^2
Solution.
> x = c(1, 4, 7, 9, 13, 19, 21, 25)
> x
[1]
1
4
7
9 13 19 21 25
5
8 10 14 20 22 26
> x + 1
[1]
2
> sum(x)
1
Mathematics 243
Problems
[1] 99
> x > 10
[1] FALSE FALSE FALSE FALSE
> x[x > 10]
TRUE
TRUE
TRUE
TRUE
# the values of x that are greater than 10
[1] 13 19 21 25
> sum(x > 10)
# the sum of trues and falses, evidently true=1
[1] 4
> sum(x[x > 10])
# the sum of the values of x that are greater than 10
[1] 78
> x[-(1:3)]
[1]
#
all values of x except the first three
9 13 19 21 25
4. Data on all the counties in the United States are in the counties dataframe in the Stob package. Each row in
the dataframe is a county.
(a) How many counties are there?
(b) How many variables are in the dataframe?
(c) The variable Population has the population of each county as of the 2010 census. What was the total
population of the United States in 2010?
(d) Why is a histogram of the populations of all the counties not very informative?
(e) Two of the variables are not really necessary – they can be computed from other variables in the dataframe.
Name one of these and show how it can be computed easily from other variables.
bf Solution.
> dim(counties)
[1] 3141
9
> sum(counties$Population)
[1] 281421906
The problem with a histogram is that the data is too skewed (or spread over two many orders of magnitude).
The density variables can be computed from the others.
5. The data frame Chile in the car package (be sure to load this package) has data on a survey of voters in Chile
conducted in the spring of 1988 before the election that unseated Augusto Pinochet. Use what you know about
data frames and the tally() and favstats() functions to answer the following questions.
2
Mathematics 243
Problems
(a) What is a case and how many cases are there? ANS: an individual voter.
(b) What are the variables and which are quantitative and categorical? ANS: region, sex, education and vote are
categorical and population, age, income and status quo are quantitative
(c) How many of each gender participated in the survey?
> tally(~sex, data = Chile)
F
M
1379 1321
(d) What was the average income of those participating in the survey? What caveat would you want to mention
when reporting this statistic? ANS: 33,876 but there are 98 missing data values.
> favstats(~income, data = Chile)
min
Q1 median
Q3
max mean
sd
n missing
2500 7500 15000 35000 200000 33876 39503 2602
98
(e) What percentage of the voters surveyed expected to vote against Pinochet? (Read the help document to
determine which variable and value you should be looking at.) ANS: 889/2532. (35.11%)
> tally(~vote, data = Chile)
A
N
U
Y
187 889 588 868
(f) Voters with incomes greater than 40,000 pesos could be considered rich. What percentage of the rich voters
expected to vote against Pinochet? (Hint: first make a dataframe with only rich voters.) ANS: 165/410
(0.4%)
> richguys = subset(Chile, income > 40000)
> tally(~vote, data = richguys)
A
N
31 165
U
Y
56 158
6. Suppose that we are studying baseball statistics. We might use two different units of analysis: the player or the
team.
(a) Suppose that we are studying baseball with the unit of analysis being the player. Give an example of an
appropriate categorical variable for this situation and an appropriate quantiative variable. ANS: a lot of
answers are possible - categorical: team, quantitative: number of homeruns.
(b) Do the same in the case that we are studying baseball with the unit of analysis being the team. ANS:
categorical: league, quantitative: number of wins.
7. In the parts below, we list some convenience samples of Calvin students. For each of these methods for sampling
Calvin students, indicate one way in which the sample is likely not to be representative of the population of all
Calvin students.
(a) The students in Mathematics 243A.
(b) The students in Nursing 329.
(c) The first 30 students who walk into the FAC west door after 12:30 PM today.
3
Mathematics 243
Problems
(d) The first 30 students you meet on the sidewalk outside Hiemenga after 12:30 PM today.
(e) The first 30 students named in the Calvin directory.
(f) The men’s basketball team.
ANS: many answers are possible here and they should be quite obvious.
8. Suppose that we were attempting to estimate the average height of a Calvin student. For this purpose, which
of the convenience samples in the previous problem would you suppose to be most representative of the Calvin
population? Which would you suppose to be least representative? Why?
ANS: Most representative would probably be c or d (maybe e?). Least representative would probably be (b)
(nursing students are shorter since they are all female) or (f) basketball players of course are taller.
9. Consider the set of natural numbers P = {1, 2, . . . , 50} to be a population.
(a) How many prime numbers are there in the population? ANS: 15
(b) If a sample of size 10 is representative of the population, how many prime numbers would we expect to be
in the sample? How many even numbers would we expect to be in the sample? Would expect 15/50 of the
numbers to be prime so 3. Would expect 5 to be even.
(c) Using R choose 5 different random samples of size 10 from the population P . Record how many prime
numbers and how many even numbers are in each sample. Make any comments about the results that strike
you as relevant.
ANS: Of course a wide range of answers is possible. They might comment on finding a sample that is quite
unrepresentative.
10. Sometimes we can improve on simple random sampling by incorporating randomness but in a more complicated
way. As of Day 10 in the fall, the breakdown of the Calvin student body by level was given by the following table
Class Level
First-year
Sophomore
Junior
Senior
Other
Total
Population
1,078
975
821
1,054
106
4,034
If we were to choose a simple random sample, we might get too many juniors or not enough first-year students.
But since we know the true proportion of each class, we might separately sample from each class to get an overall
sample that is representative of the whole. If a sample of size 200 is desired, explain how many of each class we
should choose. (Of course withing each class, we should choose a simple random sample. This sampling design is
called stratified random sampling).
ANS: A computation is below. Of course they will have to round to get whole numbers of students.
>
>
>
>
students = c(1078, 975, 821, 1054, 106)
total = sum(students)
desired = 200 * students/total
desired
[1] 53.446 48.339 40.704 52.256
5.255
11. The dataframe Allegan (in the Stob package) has data on 57 years of weather in Allegan, MI. Each case is a day.
The variable TMAX is the maximum temperature recorded on the given day.
4
Mathematics 243
Problems
(a) What is a 50% coverage interval for the maximum daily temperature?
(b) What is a 95% coverage interval for the maximum daily temperature?
(c) What is an interesting feature of the distribution of this variable? (Hint: draw a densityplot.)
qdata(c(0.25, 0.75), ~TMAX, data = Allegan)
25%
75%
#
50% coverage interval
quantile
p
39 0.25
77 0.75
qdata(c(0.025, 0.975), ~TMAX, data = Allegan)
2.5%
97.5%
# 95% coverage interval
quantile
p
20 0.025
90 0.975
densityplot(~TMAX, data = Allegan)
# the bimodal distribution of the data is interesting
0.020
Density
0.015
0.010
0.005
0.000
0
50
100
TMAX
12. The following histograms and boxplots are of five different datasets. Match the boxplot to the histogram of the
same data. Justify your choices with a sentence or two.
X
Z
U
V
E
D
C
B
A
x
x
5
W
Mathematics 243
Problems
ANS: X and Z are reasonably symmetric. Looks like X is A and Z is B because of the outlier in B. U and W anre
skewed right like C and E. W is E and C is U. B and V are skewed left.
13. Sometimes it is useful to change the units of a variable. For instance, we might change from inches to feet or from
degrees centigrade to fahrenheit. Obviously, such statistcs as the mean, median, variance and standard deviation
will change if we do that.
(a) A new variable Y is created from a variable X by multiplying each case of X by a constant c (i.e., Y = cX).
How are the mean, median, variance and standard deviation of Y related to those of X?
ANS: the mean and median are both changed by the same factor c. The variance increases by a factor of c2
and the standard deviation by c.
(b) A new variable Y is created from a variable X by adding a constant d to each case of X. (i.e., Y = X + d).
How are the mean, median, variance and standard deviation of Y related to those of X?
ANS: The variance and standard deviation are unchanged (since they measure variation not location.) THe
mean and median are increased by d.
14. Boxplots are fairly popular. But boxplots cannot show some of the more interesting, and often important, features
of a variable. The data frame faithful has data on over 270 eruptions of the Old Faithful geyser in Yellowstone
National Park. The variable eruptions gives the length in minutes of each of the eruptions. What interesting
feature of the distribution of this variable is shown by a density plot but which cannot be seen in a box plot?
ANS: The boxplot doesn’t tell you anything about modes. This distribution is definintely bimodal.
> densityplot(~eruptions, data = faithful)
> bwplot(~eruptions, data = faithful)
0.5
Density
0.4
0.3
0.2
0.1
0.0
1
2
3
4
5
6
2
eruptions
3
4
5
eruptions
15. The mean of a variable is one choice for a simple model for that variable. But it is not the only one. For example,
we might choose the median instead.
(a) Use the mean as a model for the GPA of Calvin seniors (using the data in the sr dataframe). Compute the
sum of squares of residuals of this model.
m <- mean(~GPA, data = sr)
resids <- sr$GPA - m
ssr <- sum(resids^2)
ssr
[1] 327
6
Mathematics 243
Problems
(b) Use instead the median as a model for the GPA of Calvin seniors using the very same data. Compute the
sums of squares of residuals for this model.
me <- median(~GPA, data = sr)
residsm <- sr$GPA - me
ssrm <- sum(residsm^2)
ssrm
[1] 336.1
(c) Compare the sums of squares of residuals for these two models. Which is smaller? (It turns out the mean is
the value that minimizes the sums of squares of residuals.)
ANS: Obviously the sum of squares of residuals are smaller for the means model than the median model.
16. The Current Population Survey data (CPS85) has data on both the wage of the respondent (wage) and the
employment sector in which the respondent worked (sector). SUppose that we try to explain the variation in
wages by using sector as an explanatory variable.
(a) What are the model values (i.e., ŷi ) for each of the eight sectors?
(b) What is the variance of the wage variable?
(c) What is the variance of the residuals of the model?
(d) Which observation has the largest (in absolute value) residual?
(e) Construct a side-by-side boxplot of the wages by sector. Notice that the wage distributions for any two sectors
overlap by quite a bit. For which sectors is it true that at least 75% of the respondents in that sector made
more money that half of those respondents in the manufacturing sector.
M <- mm(wage ~ sector, data = CPS85)
M #
part a
Groupwise Model Call:
wage ~ sector
Coefficients:
clerical
const
7.42
9.50
service
6.54
manag
12.70
var(~wage, data = CPS85)
#
manuf
8.04
other
8.50
prof
11.95
part b
[1] 26.41
var(residuals(M))
#
part c
[1] 21.59
favstats(residuals(M))
#
min
Q1 median
Q3 max
mean
sd
n missing
-11.7 -3.037 -1.037 2.279 31.8 2.46e-16 4.646 534
0
7
sales
7.59
Mathematics 243
Problems
CPS85[residuals(M) > 31.7, ]
#
this observation
wage educ race sex hispanic south married exper union age sector
249 44.5
14
W
F
NH
NS Single
1
Not 21 manag
bwplot(wage ~ sector, data = CPS85)
# part e
(management, construction, professional)
40
wage
30
20
10
0
clerical const managmanuf other
prof sales service
17. This made-up data frame is so small that you can answer all the questions below without the use of a computer
(or even a calculator). Do that.
Sex
F
F
F
M
M
Job Rating
5
7
9
4
8
(a) If we use a means model to predict job rating from sex, what are the five values of ŷ for each of the five cases?
(b) What are the five residual values for this model?
(c) How much do we gain by using gender to predict job rating? In other words, quantify the difference between
this model and a model that doesn’t use any explanatory variable at all.
ANS:
Sex
F
F
F
M
M
Job Rating
5
7
9
4
8
Model
7
7
7
6
6
Residual
-2
0
2
-2
2
FOr part C, note that the model that uses no explanatory variable uses the mean 6.6 as the model value so has
residuals: −1.6, .4, 2.4, −2.6, 1.4. The variance of the residuals of that model is (1.62 +.42 +2.42 +2.62 +1.42 )/4 = 4.3.
The variance of the residuals of this model is 4. That’s a fairly modelst improvement.
18. The data frame deathpenalty in the Stob package has data on whether the death penalty was given in 326
homicide cases in Florida in 1976 and 1977.
8
Mathematics 243
Problems
(a) Use an appropriate measure to “prove” that black defendants are less likely to get the death penalty than
white defendants – perhaps a counterintuitive result.
> tally(Penalty ~ Defendant, data = deathpenalty)
Defendant
Penalty Black White
Death 0.1024 0.1187
Not
0.8976 0.8812
So whites get the death penalty 12% of the time while blacks get the death penalty only 10% of the time.
(b) The R function cross makes a new variable from two categorical variables. Use the following syntax:
tally(Penalty ~ cross(Defendant, Victim), deathpenalty)
Explain how the result of this computation suggests that a more nuanced conclusion than that of part (a)
might be warranted.
tally(Penalty ~ cross(Defendant, Victim), deathpenalty)
cross(Defendant, Victim)
Penalty Black:Black Black:White White:Black White:White
Death
0.05825
0.17460
0.00000
0.12583
Not
0.94175
0.82540
1.00000
0.87417
This shows that Black defendants are more likely to get the death penalty than white defendants if the victim
is white but also are more likely to get the death penalty if the defendant is black.
19. For each of the following stories, identify the explanatory and response variables and indicate whether you think
an observational study or experiment would be used. (Explain your choice in each case.)
(a) Calvin wants to compare the GPAs of women living in Heyns to those living in Noordewier.
(b) A doctor wants to compare the effectiveness of atenolol and quinapril for his patients in controlling high blood
pressure.
(c) A child psychologist wants to find out whether the amount of video game playing by fifth graders is related
to the frequency with which they bully children at school.
20. Suppose that we want to decide who is the better Mathematics 171 instructor of two instructors A and B who are
teaching two different sections.
(a) Give several reasons why we cannot simply compare the instructors by comparing the grades of the students
in class.
(b) There is a uniform final exam in Mathematics 171. Suppose we compare the instructors by comparing the
scores of their students on the uniform final exam. How does that help? What problems still remain?
(c) Suppose that we also give a pretest to all the students in Mathematics 171 and compare the two instructors
by comparing how much their students improved over the course of the semester. How does that help? What
problems still remain?
21. Ultramarathoners often develop respiratory infections after running an ultra. Researchers were interested in
whether vitamin C was useful in reducing these infections. The researchers studied a group of 100 ultra runners.
They gave 50 a daily regimen of 600 mg vitamin C while the other 50 received a placebo. They also studied 100
nonrunners over the same period, also giving 50 of them the vitamin C treatment and 50 of them the placebo. All
200 subjects were watched for 14 days after the race to determine if infections developed.
(a) What are the explanatory and response variables?
9
Mathematics 243
Problems
(b) What is the name of this experimental design?
(c) Why did the researchers use nonrunners at all if the purpose was to determine whether vitamin C helped
prevent infections in ultramarathoners?
22. (AP Statistics Free Reponse, 2007, Problem 2) As dogs age, diminished joint and hip health may lead to joint
pain and thus reduce a dog’s activity level. Such a reduction in activity can lead to other health concerns such as
weight gain and lethargy due to lack of exercise. A study is to be conducted to see which of dietary supplements,
glucosamine or chondroitin, is more effective in promoting joint and hip health and reducing the onset of canine
osteoarthritis. Researchers will randomly select a total of 300 dogs from ten different large veterinary practices
around the country. All of the dogs are more than 6 years old, and their owners have given consent to participate
in the study. Changes in joint and hip health will be evaluated after 6 months of treatment.
(a) What would be an advantage to adding a control group in the design of this study?
(b) Assuming a control group is added to the other two groups in the study, explain how you would assign the
300 digs to these three groups for a completely randomized design.
(c) Rather than using a completely randomized design, one group of researchers proposes blocking on clinics, and
another group of researchers proposes blocking on breed of dog. How would you decide which one of these
two variables to use as a blocking variable?
23. The dataframe normaltemp has data on the temperatures of 130 Calvin students (taken in Psychology 151 in
1998).
(a) What is the sample mean temperature for this group of students?
mean(~Temp, data = normaltemp)
[1] 98.25
(b) Use the bootstrap to compute a standard error for your statistic.
r <- do(1000) * mean(~Temp, data = resample(normaltemp))
Loading required package:
sd(~result, data = r)
parallel
# of course different answers are possible due to the simulation
[1] 0.06585
(c) It is folklore that the “normal” body temperature is 98.6 degrees. Explain how your analysis in (b) gives
strong evidence that this folklore is wrong.
ANS: 98.6 is several standard errors away from our estimated mean so it is unlikely that 98.6 is the true
mean.
24. The normaltemp data also records the gender of each individual. (Unfortunately gender has been coded as a
quantitative rather than categorical variable. To convert this variable to categorical, use the function factor. For
example bwplot(Temp∼factor(Gender),data=normaltemp) draws a boxplot that you will want to look at.)
(a) What is the mean temperature for each gender group in this sample?
mean(Temp ~ factor(Gender), data = normaltemp)
# most students will have changed Gender to Male and
1
2
98.10 98.39
(b) Use the bootstrap to compute confidence intervals for the mean temperature of each of men and women.
10
Mathematics 243
Problems
r <- do(1000) * mean(Temp ~ factor(Gender), data = resample(normaltemp))
confint(r, method = "quantile") # either the quantile or standard error method is okay
1
2
name lower upper level
method
1 97.94 98.27 0.95 quantile
2 98.22 98.57 0.95 quantile
(c) Does your analysis in (b) suggests that there is a difference in the population in the mean temperature of
men and women?
ANS: Not obviously. The two confidence intervals overlap.
25. The bootstrap can help us estimate parameters other than the mean of the population. Using the normaltemp
data, compute an estimate of the third quartile of the temperature distribution of the population in question and
its standard error. (Note that you will have to look at the dataframe that results from your do function to see
where the various sample statistics are.)
qdata(0.75, ~Temp, data = normaltemp)
p quantile
0.75
98.70
r <- do(1000) * qdata(0.75, ~Temp, data = resample(normaltemp))
sd(~quantile, data = r) # answers will vary somewhat
[1] 0.06466
26. The tips dataframe in the reshape2 package has data on the tips received by a waiter in a certain restaraunt.
(a) Fit the model tips
∼1.
What is the coefficient? What is the sum of squares of residuals?
l <- lm(tip ~ 1, data = tips)
l
Call:
lm(formula = tip ~ 1, data = tips)
Coefficients:
(Intercept)
3
sum(residuals(l)^2)
[1] 465.2
(b) Fit the model tips
∼total
bill. What are the coefficients? What is the sum of squares of residuals?
l2 <- lm(tip ~ total_bill, data = tips)
l2
Call:
lm(formula = tip ~ total_bill, data = tips)
11
Mathematics 243
Problems
Coefficients:
(Intercept)
total_bill
0.920
0.105
sum(residuals(l2)^2)
[1] 252.8
(c) For each of the two models above, what is the predicted tip for a $20 bill?
ANS: For the first model, 2.998 (i.e., $3.00). For the second model, 3.0203.
(d) From the results of the preceding two parts, do you think it important to know the total bill in predicting
the tip? Give a statstical answer.
ANS: The question should have said the first two parts. In that case, note the reduced sum of squares of
residuals.
27. The SAT dataframe in the mosaic package has data on state by state educational inputs and outputs (for high
schools). The variable sat has average SAT scores of high school students in the state and the variable expend
has data on per pupil expenditures on education.
(a) Fit the model sat∼1 + expend.
lsat <- lm(sat ~ expend, data = SAT)
lsat
Call:
lm(formula = sat ~ expend, data = SAT)
Coefficients:
(Intercept)
1089.3
expend
-20.9
(b) Intepret the coeffienct of expend in the model. What seems odd about this?
ANS: For each increase of a thousand dollars in expenditures per student the predicted decrease in average
SAT score is 20.9 points.
(c) Examine the other variables in the dataframe. Do you have an explanation for this “odd” result?
ANS: Different answers are possible (including no) but some students will see that the higher sat states are
those with lower participation rate in the SAT.
28. Suppose that response is a quantitative response variable and color is a categorical variable with the levels Red,
Green, and Blue. Suppose that the fitted model from R gave the following coefficients:
(Intercept)
ColorBlue
ColorGreen
15
8
13
(a) What is the mean of response for all those cases that are Blue?
(b) What is the mean of response for all those cases that are Red?
(c) We cannot determine the mean of response for all cases. Why not?
(d) What are the possible values could the mean of response for all cases (using only the information given)?
29. Using the CPS85 data from the mosaic package, write a model and then use lm to answer the following questions.
(a) What is the average age of single people in the sample?
12
Mathematics 243
Problems
(b) What is the average age difference between single and married people in the sample?
(c) Write a good statistical statement that describes how wages vary with age.
(d) Write a good statistical statement that describes how wages vary by gender with age being held fixed.
30. In the SAT dataset of the mosaic package are variables on education in the states.
(a) Fit the model sat
∼expend
+ frac.
(b) What are the units of the three coefficients in the model?
(c) Compare this model to problem 27. How does the model of this problem help understand the “strangeness”
of problem 27?
31. A certain drug has a side effect of increasing hemoglobin levels in individuals. The effect depends on the dosage
strength and is somewhat different for males and females. Consider the model
Hemoglobin ∼ 1 + dose + sex + sex:dose
F
M
Hemoglobin
25
20
15
0
1
2
3
4
dose
Suppose that Males are the reference group for sex (so that the model formula has a variable sexF). In each part
below, choose the correct response and give a short explanation for your choice.
(a) The coefficent of the intercept in the model formula will be: (choose one)
POSITIVE (it looks to be about 14, it’s where the male line has an intercept)
positive
negative
0
(b) The coefficient of dose in the model formula will be: (choose one)
POSITIVE – it’s the slope of the line for males
positive
negative
0
(c) The coefficient of sexF in the model formula will be: (choose one)
NEGATIVE – the female line has a lower y intercept than the male line
positive
negative
0
(d) The coefficient of dose:sexF in the model formula will be: (choose one)
NEGATIVE – the female line has a lower slope than the male line
positive
negative
32. The data frame mammals in the MASS package has data on the brain and body weight of various mammals.
(a) The model brain
∼
1 + body doesn’t seem like a very good model. Why not? (Look at a graph.)
xyplot(brain ~ body, data = mammals)
13
0
Mathematics 243
Problems
5000
brain
4000
3000
2000
1000
0
0
2000
4000
6000
body
ANS: Looks concave up rather than linear. (Perhaps also because it spans too many orders of magnitude.)
(b) An appropriate transformation of both brain and body size results in a much better model. Find such a
transformation. (Hint: you might notice that these data extend over several orders of magnitude.)
ANS: While there are a couple of possibilities, I think the best by far is log, log
l <- lm(log(brain) ~ log(body), data = mammals)
coef(l)
(Intercept)
2.1348
log(body)
0.7517
xyplot(log(brain) ~ log(body), data = mammals, type = c("p", "r"))
log(brain)
8
6
4
2
0
−2
−5
0
5
log(body)
33. A dataset available from Danny’s website Diamonds has data on 308 diamonds. You can get the dataset by
> Diamonds = fetchData("Diamonds.csv")
Retrieving from http://www.mosaic-web.org/go/datasets/Diamonds.csv
> head(Diamonds)
1
2
3
4
carat color clarity certification price
0.30
D
VS2
GIA 1302
0.30
E
VS1
GIA 1510
0.30
G
VVS1
GIA 1510
0.30
G
VS1
GIA 1260
14
Mathematics 243
5
6
0.31
0.31
D
E
Problems
VS1
VS1
GIA
GIA
1641
1555
The variables include
carat
weight (one carat is 200 mg)
color
on a scale from D (colorless) to Z (yellow)
clarity
a categorical variable, see http://www.diamondinfo.org
certification third party certification organization
price
price in dollars
(a) Anyone who knows diamonds knows that diamonds of greater carat weight are more highly valued. Fit a
model price ∼1 + carat.
ld <- lm(price ~ carat, data = Diamonds)
coef(ld)
(Intercept)
-2298
carat
11599
[ = −2298 + 11599 carat
price
(b) From the graph, you might suppose that a second degree polynomial fits the data somewhat better. That is,
you might want to fit the model price ∼ 1 + carat + carat2 . R provides a function that makes this easy
to do:
> l <- lm(price ~ poly(carat, 2), data = Diamonds)
FIt this model and write the model equation in more conventional notation than that returned by R.
ANS: Unfortunately they are going to get the equation wrong if they use poly since it uses orthogonal
polynomials instead of the usual terms. The usual terms can be gotten by adding an argument to poly that
they may not have heard me say.
>
>
>
>
lbad <- lm(price ~ poly(carat, 2), data = Diamonds)
lgood <- lm(price ~ poly(carat, 2, raw = TRUE), data = Diamonds) # this gives the right coefficien
lgoodtoo <- lm(price ~ 1 + carat + I(carat^2), data = Diamonds) # this does too
coef(lbad)
(Intercept) poly(carat, 2)1 poly(carat, 2)2
5019
56332
8152
> coef(lgood)
(Intercept) poly(carat, 2, raw = TRUE)1
-42.51
2786.10
poly(carat, 2, raw = TRUE)2
6961.71
> coef(lgoodtoo)
(Intercept)
-42.51
carat
2786.10
I(carat^2)
6961.71
ANS: So the best model is
[ = −43 + 2786 carat + 6962 carat2
price
15
Mathematics 243
Problems
(c) Give an argument that the quadaratic model fits the data better.
ANS: They might look at a graph. Or they might compare sums of squares of residuals.
sum(residuals(ld)^2)
# linear model
[1] 382178624
sum(residuals(lgood)^2)
# quadratic model
[1] 315717826
34. The dataframe KidsFeet in the mosaic package has data on a number of children including measurements of their
feet.
(a) Fit the following model for the width of the feet:
width
∼
1 + length.
l <- lm(width ~ length, data = KidsFeet)
coef(l)
(Intercept)
2.8623
length
0.2479
rsquared(l)
[1] 0.411
List the coefficients and find the value of R2 for this model.
(b) Now fit the model width ∼ 1 + length + sex and find the value of R2 . Would you say that knowing the
gender of the student is of significant help in predicting the width of a foot given the length?
l2 <- lm(width ~ 1 + sex + length, data = KidsFeet)
coef(l2)
(Intercept)
3.6412
sexG
-0.2325
length
0.2210
rsquared(l2)
[1] 0.4595
ANS: length alone explains 41% of the variation in foot width while including sex explains 46% of the variation.
Reasonable people could disagree about whether this is a significant improvement but it doesn’t seem so to
me!
35. Suppose that y is a response variable, x and z are quantitative explanatory variables, and g and h are categorical
explanatory variables. Consider the following models:
A
B
C
D
E
F
G
H
y
y
y
y
y
y
y
y
∼
∼
∼
∼
∼
∼
∼
∼
1
1
1
1
1
1
1
1
+
+
+
+
+
+
+
x
x
z
x
x
x
z
+
+
+
+
+
+
g
h
g
g
z
g
+
+
+
+
h
x:g
g + h
h
For some pairs of the above models, the R2 for one is certainly less than or equal to the R2 of the other. For
other pairs, it depends on what the variables are as to which R2 is greater. List all pairs of models for which it is
16
Mathematics 243
Problems
possible to determine the realtionship of R2 and say what that relationship is. (For example, the R2 for model A
is less than or equal to that of model B, and so forth.)
ANS: In terms of R2 , A ≤ B ≤ C ≤ E ≤ G, A ≤ B ≤ F , A ≤ B ≤ G, A ≤ C ≤ F , A ≤ D ≤ H ≤ G. Those are
the only pairs that we can compare.
36. The dataframe trees has data on the Volume, Girth, and Height of some felled cherry trees. using R2 as a guide,
give a good reason for using the model Volume ∼ 1 + Girth + Girth2 instead of either a first degree or third
degree polynomial.
rsquared(lm(Volume ~ Girth, data = trees))
[1] 0.9353
rsquared(lm(Volume ~ Girth + I(Girth^2), data = trees))
[1] 0.9616
rsquared(lm(Volume ~ poly(Girth, 3), data = trees))
[1] 0.9627
For the linear model, R2 is 93% and for both the quadratic and cubic model R2 is 96%. So the second degree
polynomial explains about half of the variation that the first degree polynomial doesn’t explain but the cubic
model explains hardly any of the rest of the variation.
37. The Duncan dataframe of the car package has data (from 1950) on 45 different professions. Each of the 45
professions was rated on prestige and the typical characterstics of persons in that profession.
prestige
percent of raters rating occupation as excellent or good
income
percent of males earning $3,500 or more
education percent of males who were high school graduates
(a) Fit a model prestige
∼
1 + income.
l <- lm(prestige ~ income, data = Duncan)
coef(l)
(Intercept)
2.457
income
1.080
(b) Fit a model prestige
∼
1 + income + education.
l2 <- lm(prestige ~ income + education, data = Duncan)
coef(l2)
(Intercept)
-6.0647
income
0.5987
education
0.5458
(c) Explain why the coefficient of income is different in the two models. Give an explanation that explains the
size of the difference.
l3 <- lm(education ~ income, data = Duncan)
coef(l3)
(Intercept)
15.6114
income
0.8824
17
Mathematics 243
Problems
ANS: Since education is related to income (its a confounding variable) the effect of income on prestige will be
different when education is controlled for. To see where the 1.080 in the first model comes from, notice that
if income changes one unit, the change in prestige is .60 units if education is held fixed. However a change
in income of one unit also means a change in income of .88 units so the change in prestige given a change of
one unit in income is .88 ∗ .55 + .60.
38. The story in the following link
http://news.sciencemag.org/social-sciences/2014/02/scienceshot-why-you-should-talk-your-baby
describes the result of a study done on verbal abilities in babies.
(a) The headline claims that parents should talk to their babies. The headline is the summary of a conclusion of
a study. What are the explanatory and response variables of that study?
ANS: It’s actually confusing. The response variable appears to be word processing speed. The explanatory
variable of the study appears to be parent-child banter but then it switches to SES.
(b) One of the variables in the study is socioeconomic status of the parents. What is the role of that variable in
this study?
ANS: It appears that SES is being used as a stand-in for how much parents talk to their kids.
39. The following study
http://well.blogs.nytimes.com/2010/06/15/eating-brown-rice-to-cut-diabetes-risk/?_php=true&_type=
blogs&_r=0
makes claims about diets that include brown rice.
(a) What are the explanatory and response variables referred to in the headline?
ANS: Explanatory variable is eating brown rice and the response variable is diabetes risk.
(b) What covariates did the article imply were considered?
ANS: The researchers tried to control for the fact that Americans who eat brown rice tend to be more healthy
overall - they eat more fruits and vegetables and less red meat and trans fats, and they also tend to be thinner,
more active and less likely to smoke than those who don’t eat brown rice. (That’s a quote from the article!)
40. Do problem 10.04 from the collection of exercises from the book. (Link here or on the homework webpage.)
ANS: See attached.
41. Suppose that a manufacturer of computer chips claims to produce no more than 5% defective chips. You inspect
100 chips produced.
(a) Would you have a strong reason to doubt the claim of the manufacturer if you find that 6 of the chips are
defective? Why or why not?
(b) If in part (a) you said no, how many defective chips would you have to find before you think that you would
have a strong reason to doubt the manufacturer? Defend your choice.
42. Breanna Verkaik made 72 out of 89 free throws this past basketball season.
(a) Use an appropriate binomial model to estimate the probability that Breanna would make all 10 free throws
if she shot 10.
(b) What assumption of the binomial model might be questioned in this situation?
43. Do problem 11.23 from the textbook exercises. here
44. Do problem 11.33 from the textbook exercises. here
45. Doproblem 11.34 from the textbook exercises. here
18
Mathematics 243
Problems
46. The whiteside data in the MASS package has data on gas usage in a certain house before and after the owner
installed insulation.
Gas
weekly gas consumption in 1000s of cubic feet
Insul before or after insulation was installed
Temp
the average of outside temperature in degrees Celsius
(a) Fit the model Gas
∼
1 + Insul + Temp.
l <- lm(Gas ~ Insul + Temp, data = whiteside)
l
Call:
lm(formula = Gas ~ Insul + Temp, data = whiteside)
Coefficients:
(Intercept)
InsulAfter
6.551
-1.565
Temp
-0.337
(b) Compute a 95% confidence interval for Insul using the bootstrap method.
r <- do(1000) * lm(Gas ~ Insul + Temp, data = resample(whiteside))
confint(r)
name
lower
upper level method estimate margin.of.error
1 Intercept 6.3035 6.7944 0.95 stderr
6.5489
0.24544
2 InsulAfter -1.7528 -1.3776 0.95 stderr -1.5652
0.18758
3
Temp -0.3758 -0.2954 0.95 stderr -0.3356
0.04019
4
sigma 0.2872 0.4059 0.95 stderr
0.3465
0.05932
5 r.squared 0.8607 0.9593 0.95 stderr
0.9100
0.04930
The confidence intervals may vary a bit because of the simulation.
(c) Compute a 95% confidence interval for Insul using the standard method. Comment on any difference you
see.
confint(l)
2.5 % 97.5 %
(Intercept) 6.3145 6.7882
InsulAfter -1.7599 -1.3705
Temp
-0.3723 -0.3011
At least in my simulation, these intervals are very close.
(d) Find fitted values of the model for an average temperature of 0 both before and after insulation.
f <- makeFun(l)
f(Insul = "Before", Temp = 0)
1
6.551
f(Insul = "After", Temp = 0)
1
4.986
19
Mathematics 243
Problems
(e) Write a 95% prediction interval for each of your fitted values in part (d).
f(Insul = "Before", Temp = 0, interval = "prediction")
fit
lwr
upr
1 6.551 5.796 7.306
f(Insul = "After", Temp = 0, interval = "prediction")
fit lwr
upr
1 4.986 4.24 5.732
(f) Compare the two intervals and write a nice sentence saying qualitatively what these two intervals tell us.
Perhaps one thing to note is that the intervals do not overlap so if we were predicting the gas usage for two
days of 0 temperature, we would be fairly confident in predicting a lower gas usage in the after insulation
condition.
47. It might be expected that more successful baseball teams might tend to draw more fans to the ballpark. The
dataframe Baseball21 in the Stob package has data on the first 11 years of major league baseball seasons in the
21st century. Here are a few of the variables:
attendance
W
R
LG
the attendance at home games for the team on the season
the number of games won by the team in the season
runs scored by the team
the league, American or National, that the team is in
(a) Fit the model attendance
∼
1 + W.
l <- lm(attendance ~ W, data = Baseball21)
l
Call:
lm(formula = attendance ~ W, data = Baseball21)
Coefficients:
(Intercept)
-135675
W
31942
(b) Compute a confidence interval for the coefficient of wins in the model of part (a). Write a sentence that
interprets this coefficient.
confint(l)
2.5 % 97.5 %
(Intercept) -627820 356470
W
25925 37959
We can be fairly confident that on average an additional win increases attendance by between 26,000 and
38,000.
(c) Note the confidence interval for the intercept term. Why is it so wide?
The values of the explanatory variable are a fair distance from W = 0. So the intercept value is quite uncertain
– small changes in the fitted slope mean large changes in the fitted intercept.
(d) Now fit a model attendance ∼ 1 + W + R. One might consider this model by thinking that teams that score
more runs, even if they lose, are more interesting to watch and so might attract more fans. Does the fitted
model support this conjecture?
20
Mathematics 243
Problems
ll <- lm(attendance ~ W + R, data = Baseball21)
confint(ll)
2.5 % 97.5 %
(Intercept) -1081344.7 329941
W
22422.6 37306
R
-586.1
1646
Not really. The confidence interval suggests that we cannot even be confident of the sign of the coefficient of
R suggesting that the it is not clear that increasing runs increases attendance while holding wins fixed.
(e) You might instead think that National League teams are, in general, more interesting to watch and so might
attract more fans. Fit a model and decide whether the data support this conjecture.
lll <- lm(attendance ~ W + LG, data = Baseball21)
lll
Call:
lm(formula = attendance ~ W + LG, data = Baseball21)
Coefficients:
(Intercept)
-295141
W
32430
LGNL
224955
confint(lll)
2.5 % 97.5 %
(Intercept) -789765 199484
W
26495 38364
LGNL
86188 363722
There does seem to be evidence that national league teams can be predicted to attract more fans on average,
even holding wins fixed.
48. At this website, http://www.yale.edu/infantlab/socialevaluation/Helper-Hinderer.html are some videos
that show some of the experiments done on the question of whether very young children already have a sense of
social relationships. The first two videos were shown to several young children. In the first video, a triangle helps
the circle up the hill. In the second video, the square hinders the circle from reaching the top of the hill. After
a child is shown both sequences, they are tested to see whether they have a preference for the helper triangle or
the hinderer square. The first video in the second row shows a pre-verbal 6 month old child clearly showing a
preference for the helper triangle. In one particular run of this experiment, 12 children were tested and 9 preferred
the helper in this way.
(a) What is the natural null hypothesis in this experiment?
ANS: That children would prefer the helper and hinderer equally.
(b) If the null hypothesis is true, how many children are expected to choose the helper rather than the hinderer?
ANS: 6 of the 12.
(c) Is the result in this experiment, 9 out of 12 favoring the helper, strong evidence against the null hypothesis?
Give a quantitative justification for your claim.
1 - pbinom(8, 12, 0.5)
[1] 0.073
About 7.2% of the time we would see 9 or more the kids chose the helper even if there were no difference in
the population. Probably not strong evidence.
21
Mathematics 243
Problems
(d) Of course it was important to conduct this experiment very carefully in order to ensure that it was the helperhinderer distinction that was important. Give two examples of possible confounding variables and explaim
how randomization of some aspect of the experiment would address each.
ANS: Various answers are possible. The shape and the order in which the scenes were shown are two obvious
ones. We should randomly assign the shapes to the two roles and also do the videos in random order.
(e) In this experiment, where specifically should blinding have been used?
ANS: Various answers are possible. But certainly the researcher showing the baby the two shapes should not
know which is the helper and which is the hinderer.
49. In this problem, you will think about the relationship between confidence intervals and hypothesis tests. Recall
the dataset from the last test on the survival time of Alzheimer’s patients.
mydata <- fetchData("CSHA.csv")
Retrieving from http://www.mosaic-web.org/go/datasets/CSHA.csv
The variables
Gender
Education
AAO
Survival
are
gender
number of years of education
age at onset of Alzheimer’s disease
number of days from onset of Alzheimer’s until death
(a) Suppose that you are investigating the model Survival
reasonable null hypothesis?
ANS: Gender doesn’t affect survival time.
∼
1 + Gender. For this model, what is the most
(b) Construct a confidence interval for the coefficient of Gender in the model. Explain how that confidence
interval might indicate that there is not enough evidence to say that the null hypothesis is false.
csha = fetchData("CSHA.csv")
Retrieving from http://www.mosaic-web.org/go/datasets/CSHA.csv
l <- lm(Survival ~ Gender, data = csha)
confint(l)
2.5 % 97.5 %
(Intercept) 1909.44 2373.8
GenderM
-50.35 513.7
Notice that the confidence interval contains 0. This means that we don’t have enough evidence to say that
the effect of Gender is not 0 meaning that gender doesn’t matter.
50. The males of stalk-eyed flies have long eye stalks. Is the male’s long eye-stalk affected by the quality of its diet?
Two groups of male flies were reared on different diets. One group was fed corn and the other cotton wool. The
eye spans (distance between the eyes) were measured in mm. The data are in the package Stalkies2 in the abd
package.
eye.span
food
eye span in mm
diet (Corn or Cotton)
(a) Identify the null and alternate hypotheses.
ANS: Null - diet doesn’t affect distance between eys. ALT: diet does.
(b) If we fit the model eye.span ∼ 1 + food, what is a reasonable test statistic?
ANS: the coefficient of food, R2 are both reasonable choices.
22
Mathematics 243
Problems
(c) Do a simulation to compute an approximate P -value for this test statistic.
ANS: Their computation might vary somewhat and they might use either of the two statistics. But the
P -value in either case should be 0 or almost 0.
require(abd)
Loading required package:
Loading required package:
abd
nlme
l <- lm(eye.span ~ food, data = Stalkies2)
fd = coefficients(l)[2]
rs = rsquared(l)
fd
foodCotton
-0.5042
rs
[1] 0.59
r <- do(1000) * lm(shuffle(eye.span) ~ food, data = Stalkies2)
pdata(rs, ~r.squared, r)
[1] 1
pdata(fd, ~foodCotton, r)
foodCotton
0
(d) Is there strong evidence against the null hypothesis?
ANS: Yes. A P -value that is essentially 0 means that the data are not consistent with the null hypothesis at
all.
51. The dataframe pulp in the faraway package has data on an experiment to test the brightness of paper produced
by four different shift operators. (The units have been lost and these are not typical units. In standard units,
paper brightness is usually between 80 and 90.)
bright
operator
brightness of the pulp
operator a-d
Use a randomization test to determine if the shift operator makes a difference in the brightness of paper produced.
l <- lm(bright ~ operator, data = pulp)
r = rsquared(l)
d <- do(1000) * rsquared(lm(shuffle(bright) ~ operator, data = pulp))
pdata(r, ~result, data = d)
[1] 0.968
The p-value of the test statistic is xxx.
52. The dataframe corrosion in the faraway package has data on the corrosion of test bars with various percentages
of added iron.
Fe
percentage of added iron
loss material lost due to corrosion
Fit a model loss 1+Fe.
23
Mathematics 243
Problems
l <- lm(loss ~ Fe, data = corrosion)
l
Call:
lm(formula = loss ~ Fe, data = corrosion)
Coefficients:
(Intercept)
130
Fe
-24
(a) Compute R2 for this model.
rsquared(l)
[1] 0.9697
(b) What is the obvious null hypothesis in this situation?
ANS: That iron content (FE) has no effect on loss due to corrosion.
(c) What is the expected value of R2 if the null hypothesis is true?
ANS: There are n = 13 cases and m = 2 predictors. So the typical value of R2 is 1/12.
(d) Compute the F -statistic. You could have R do this, but computing it from the formula will make you feel
really smart.
ANS:
F =
.97/1
R2 /(m − 1)
=
= 355
(1 − R2 )/(n − m)
.03/11
Depending on rounding could get a variety of answers. R computes it as 352.3.
53. Do problem 14.22 of the problems from the text, found here.
The entries in the R2 column will vary by simulation.
p
1
3
10
20
37
38
p/(n − 1)
.03
.08
.26
.53
.97
1
Mean R2
For p = 38, there are enough variables to fit the data exactly.
54. The nels88 dataframe in the faraway package has data on a mathematics test taken as part of a national study.
Some variables are
math
the mathematics test score of the student
ses
the socioeconomic status of the family of child
paredu the parents educational status
For the following models, complete the following table
Model
math ∼ 1 + sex
math ∼ 1 + sex + paredu
math ∼ 1 + sex + paredu + ses
SSModel
df
SSResid
24
df
F
Mathematics 243
Problems
l1 <- lm(math ~ sex, data = nels88)
l2 <- lm(math ~ sex + paredu, data = nels88)
l3 <- lm(math ~ sex + paredu + ses, data = nels88)
anova(l1)
Analysis of Variance Table
Response: math
Df Sum Sq Mean Sq F value Pr(>F)
sex
1
9
9.2
0.07
0.79
Residuals 258 32107
124.4
anova(l2)
Analysis of Variance Table
Response: math
Df Sum Sq Mean Sq F value Pr(>F)
sex
1
9
9
0.13
0.72
paredu
5 13708
2742
37.70 <2e-16
Residuals 253 18400
73
anova(l3)
Analysis of Variance Table
Response: math
Df Sum Sq Mean Sq F value Pr(>F)
sex
1
9
9
0.13 0.721
paredu
5 13708
2742
38.22 <2e-16
ses
1
324
324
4.52 0.034
Residuals 252 18076
72
Last column of the table may vary a bit due to rounding.
Model
math ∼ 1 + sex
math ∼ 1 + sex + paredu
math ∼ 1 + sex + paredu + ses
SSModel
9
13717
14041
df
1
6
7
SSResid
32107
18400
18076
df
258
253
252
F
.74
31.43
27.96
55. Do problem 15.02 of the textbookproblems.
56. Do problem 15.10 of the textbook problems. (The data are in TenMileRace in the mosaic package.)
57. The dataframe cathedral in the faraway package has data on the cathedral nave heights and lengths of several
cathedrals in England.
x
y
style
the length of the nave
the height of the nave
the style of the cathedral, romanesque or gothic
Fit a model to predict the height of the nave from the length of the nave and the style. (y
∼
1 + x + style)
(a) Write the equation of the fitted model.
(b) Use R to esitmate the P -value for the null hypothesis that style does not matter in this model.
25
Mathematics 243
Problems
(c) Instead, use a simulation method (using shuffle) to compute the same P -value.
(d) Should style be included in this model?
58. The dataframe Ericksen in the car package has data on the undercount rate in the 1980 census. See the help
document for a description of all the variables. The response variable is undercount which is the percentage of
people undercounted in a given region.
(a) Make a big model for undercount by including all the other available variables as explanatory variables.
Don’t include any interaction terms however.
l = lm(undercount ~ minority + crime + poverty + language + highschool + housing +
city + conventional, data = Ericksen)
l
Call:
lm(formula = undercount ~ minority + crime + poverty + language +
highschool + housing + city + conventional, data = Ericksen)
Coefficients:
(Intercept)
-0.6114
highschool
0.0613
minority
0.0798
housing
-0.0350
crime
0.0301
citystate
-1.1600
poverty
-0.1784
conventional
0.0370
language
0.2151
(b) Make an argument that several of the variables are clearly not necessary in the model.
summary(l)
Call:
lm(formula = undercount ~ minority + crime + poverty + language +
highschool + housing + city + conventional, data = Ericksen)
Residuals:
Min
1Q Median
-2.836 -0.803 -0.055
3Q
0.705
Max
4.247
Coefficients:
(Intercept)
minority
crime
poverty
language
highschool
housing
citystate
conventional
Estimate Std. Error t value Pr(>|t|)
-0.61141
1.72084
-0.36 0.72368
0.07983
0.02261
3.53 0.00083
0.03012
0.01300
2.32 0.02412
-0.17837
0.08492
-2.10 0.04012
0.21512
0.09221
2.33 0.02320
0.06129
0.04477
1.37 0.17642
-0.03496
0.02463
-1.42 0.16126
-1.15998
0.77064
-1.51 0.13779
0.03699
0.00925
4.00 0.00019
Residual standard error: 1.43 on 57 degrees of freedom
Multiple R-squared: 0.708,Adjusted R-squared: 0.667
F-statistic: 17.2 on 8 and 57 DF, p-value: 1.04e-12
anova(l)
26
Mathematics 243
Problems
Analysis of Variance Table
Response: undercount
Df Sum Sq Mean Sq F value Pr(>F)
minority
1 195.8
195.8
96.23 7.6e-14
crime
1
29.2
29.2
14.34 0.00037
poverty
1
5.4
5.4
2.67 0.10792
language
1
11.6
11.6
5.71 0.02020
highschool
1
2.9
2.9
1.41 0.23957
housing
1
0.2
0.2
0.07 0.78692
city
1
3.2
3.2
1.58 0.21345
conventional 1
32.5
32.5
15.98 0.00019
Residuals
57 116.0
2.0
An argument should be made on the basis of the summary or the anova but it appears that we should get
rid of at least highschool, housing, and city.
(c) Make a smaller model for undercount and give a statistical reason for believing that this smaller model is to
be preferred to the larger model.
lsmall = lm(undercount ~ minority + crime + poverty + language + conventional,
data = Ericksen)
anova(lsmall, l)
Analysis of Variance Table
Model 1: undercount ~ minority + crime + poverty + language + conventional
Model 2: undercount ~ minority + crime + poverty + language + highschool +
housing + city + conventional
Res.Df RSS Df Sum of Sq
F Pr(>F)
1
60 125
2
57 116 3
8.79 1.44
0.24
Notice that the P -value here is .24 which means that there is no reason to think that the larger model explains
more variation that the smaller.
59. In this problem we are going to develop a hypothesis test and determine its power for a typical testing situation.
Suppose a certain medical test is marketed with the statement that it detects a certain medical condition 90% of
the time that the patient actually has the condition. (90% is called the sensitivity of the test.) To test this claim,
a market research firm finds people with the condition and gives them the test.
(a) Suppose that the firm tests 20 persons with the disease. How many patients should test positive for the
condition if the claim of the company is true?
Obviously, 18!
(b) What is the probability that 16 or fewer of 20 persons would test positive? (Hint: you might want to think
of the binomial model.)
pbinom(16, 20, 0.9)
[1] 0.133
(c) If the testing firm sets a significance level of 5%, when will they reject the null hypothesis if they test 20
patients?
27
Mathematics 243
Problems
pbinom(15, 20, 0.9)
[1] 0.04317
They should reject the null if 15 or fewer patients test positive.
(d) Suppose that the test only detects the disease in 80% of the patients that have it. What is the power of this
hypothesis test at the 5% level of significance?
pbinom(15, 20, 0.8)
[1] 0.3704
(e) Suppose that the testing firm desires 95% power and a 5% level of significance. How many people should the
firm test if the true sensitivity is 80%?
The best answer is between 140 and 150 (which they have to get by trial and error.) For 145 and 150, we
have:
pbinom(120:125, 145, 0.9)
# shows that we should use 123 as our rejection critiria
[1] 0.004939 0.009597 0.017843 0.031720 0.053879 0.087383
pbinom(123, 145, 0.8)
# shows that the power is 94%
so should go a little higher
[1] 0.9443
pbinom(125:130, 150, 0.9)
# shows that wew should use 128 as our rejection criteria
[1] 0.007648 0.014310 0.025640 0.043964 0.072088 0.112977
pbinom(128, 150, 0.8)
# power of 96%
[1] 0.9628
60. Medical testing is even more complicated than the last problem suggests. Obviously the medical test should detect
the disease in patients who have it but it should also not detect the disease in patients who don’t have it. The
specificity of the test is the percentage of negative results in testees who don’t have the disease. So a specifity of
90% means that the test will produce 10% false positives. If a claim is made that a medical test is 90% accurate,
it usually means that both the sensitivity and specificity are at least 90%. In this problem, we investigate the
problem of going backward: if a test tells you that you have a certain disease, how likely is it that you have the
disease? Seems like 90% right?
(a) Suppose that in a certain population, only 1% of the people have the disease. Now suppose that everyone
is tested. What proportion of the population will test postive for the disease? (Remember that a person
can test positive either by having the diseasee and the test is right or not having the disease and the test is
wrong.)
ANS: 90% of 1% will test postive correctly and 10% of 99% will test positive incorrectly. So that is .9 + 9.9 =
10.8%.
(b) Based on your result in part (a), what proportion of those who test positive actually have the disease?
ANS: .9/10.8 = 8.3%.
(c) If you did (a) and (b) right, you should be more than a little surprised. What do you think this result says
about universal screening programs for various diseases?
Universal screening for a rare condition is likely to produce mostly false positives. In other words it might be
unnecessarily expensive and give a lot of people needless concern.
61. In logistic regression, the values of the model y are “link” values that have to be transformed to get probabilities.
28
Mathematics 243
Problems
(a) For each of the following link values, compute the corresponding probability: −3, −2, −1, 0, 1, 2, 3.
ilogit(c(-3, -2, -1, 0, 1, 2, 3))
[1] 0.04743 0.11920 0.26894 0.50000 0.73106 0.88080 0.95257
(b) For each of the following probabilities, compute the corresponding link value: .1, .25, .5, .75, .9.
logit(c(0.1, 0.25, 0.5, 0.75, 0.9))
[1] -2.197 -1.099
0.000
1.099
2.197
62. The dataframe CAFE in the Stat2Data has data on a 2003 vote in the US Senate on an amendment to a bill
sponsered by John Kerry and John McCain to mandate improved fuel economy on cars and light trucks. The
amendment effectively killed the bill so was strongly supported by most car manufacturers. One might suppose
that the political party of the Senator might be related to their vote (it usually is) and it also might be supposed
that the contributions from car manufacturers had an affect. The variable LogContr is the logarithm of the
contributions that a senator received in their lifetime while the variable Dem is 1 if the senator was a Democrat or
Independent and 0 otherwise. The variable Vote is the vote of the senator (1 or 0). (Remember that for datasets
in the Stat2Data package you need to load the package and use the command data(CAFE) to make the dataset
accessible.)
(a) Fit a model for the Senator’s vote that uses LogContr and Dem as explanatory variables.
data(CAFE)
g <- glm(Vote ~ LogContr + Dem, data = CAFE, family = binomial)
(b) The median of LogContr is approximately 4. For this value, what are the predicted probabilities of a YES
vote from a Democrat and from a Republican?
f <- makeFun(g)
f(LogContr = 4, Dem = 1)
1
0.5226
f(LogContr = 4, Dem = 0)
1
0.861
(c) Write a good sentence interpreting the coefficient of LogContr in the model.
exp(coefficients(g))
(Intercept)
0.00107
LogContr
8.72233
Dem
0.17679
For each unit increase in the log of contributions received we predict an increase in odds for voting for the
bill of a factor of 8.7, holding the party of the senator fixed.
(d) Write a sentence interpreting the coefficient of Dem in the model.
Holding contributions fixed, the odds a democrat votes for the bill are estimated at .18 of the odds of a
republican.
29