Download Statistics 1100 Sec: 08-27 Test 2-A

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Time series wikipedia , lookup

Data assimilation wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Statistics 1100 Sec: 08-27
Test 2-A-2
October 26, 2011
Kathleen McLaughlin
Name___________________________
People Soft # _____________________
Section Number ___________________
Information for Questions 1 - 5: The rise in obesity rates in the U.S. has been blamed for the rise in
diabetes rates. I decided to explore the relationship between these 2 variables by looking at a sample of
state data (26 different states). I chose state obesity rates (% of state population who are obese) as my
predictor variable and state diabetes rates (% of state population with diabetes) as my response variable.
Please use the MINITAB output displayed here to answer the questions that follow. (Note: I do not show
the data here. Use the output and graphs to answer Questions 1-5.)
Regression Analysis: diabetes versus obesity
The regression equation is
diabetes = 1.09 + 0.350 obesity
Predictor Coef SE Coef T P
Constant 1.088 3.027 0.36 0.722
obesity 0.3498 0.1120 3.12 0.005
S = 1.71772 R-Sq = 28.9% R-Sq(adj) = 25.9%
Analysis of Variance
Source
DF SS MS
F
P
Regression
1 28.767 28.767 9.75 0.005
Residual Error 24 70.814 2.951
Total
25 99.580
Unusual Observations
Obs obesity diabetes Fit
4 29.8
6.300 11.513
26 19.6
7.600
7.945
Predicted Values for New Observations
New
Obs Fit
SE Fit
95% CI
1 10.533 0.337 (9.837, 11.229)
95% PI
(6.920, 14.146)
Values of Predictors for New Observations
New
Obs obesity
1
27.0
Fitted Line P lot
Residuals V er sus obesity
(response is diabetes)
15
S
R-Sq
28.9%
14
R-Sq(adj)
25.9%
3
1.71772
2
13
1
12
0
Residual
diabetes
diabetes = 1.088 + 0.3498 obesity
11
10
-1
-2
9
-3
8
-4
7
-5
6
-6
20 22 24 26 28 30 32 34
obesity
20
22
24
26
28
obesity
30
32
34
1. How much of the variation in state diabetes rates can be explained by state obesity rates?
a.)
b.)
c.)
d.)
e.)
35%
1.09%
28.9%
1.71772%
72.2%
2. The predicted diabetes rate for a state with an obesity rate of 25% is:
a.)
b.)
c.)
d.)
e.)
8.09%
9.84%
11.59%
9.14%
10.26%
3. For Unusual observation #4, calculate the Residual.
a.)
b.)
c.)
d.)
e.)
23.5%
18.287%
-5.213%
-2.3%
-7.513%
4.
Consider all the states with obesity rates of 27%. We would expect the average obesity rate in those
states to lie in the interval:
a.)
b.)
c.)
d.)
e.)
5.
(10.533% ± 0.05%)
(27% ± 2.5%)
(27% ± 5%)
(6.920%, 14.146%)
(9.837%, 11.229%)
Look at all the Minitab output and comment on how well obesity rates predict diabetes rates.
a.) Most of the variation in diabetes rates can be explained by obesity rates. State obesity rates
are very strong predictors of diabetes rates.
b.) Much of the variation in diabetes rates is NOT explained by obesity rates. State obesity rates
are not strong predictors of state diabetes rates.
c.) The two graphs clearly indicate that the linear model is incorrect and that a quadratic model
would be more appropriate.
d.) A causal relationship between obesity and diabetes is clearly demonstrated in this model.
Information for Questions 6- 10: The Internet Movie Database monitors the gross revenues for all
major motion pictures. The accompanying table gives both the U.S. and Worldwide gross revenues for a
random sample selected from the films that were the highest grossing films in the U.S.
Movie Title
Titanic
Shrek 2
E.T.
Star Wars: Episode I
Spider-Man
Star Wars: Episode III
The Passion of the Christ
The Lord of the Rings: The Two
Towers
Finding Nemo
Spider-Man 3
Forrest Gump
Iron Man
Indiana Jones and the Kingdom of
the Crystal Skull
Pirates of the Caribbean: At the
World’s End
Independence Day
Domestic Gross
(millions of dollars)
600.8
436.5
434.9
431.1
403.7
380.2
370.3
340.5
Worldwide Gross
(millions of dollars)
1835.3
880.9
756.7
922.3
806.7
848.5
604.4
921.6
339.7
336.5
329.6
318.3
317.0
865.0
885.4
679.4
571.8
783.0
309.4
958.4
306.1
811.2
I was interested in seeing whether I could use the U.S. gross revenues to predict the Worldwide gross
revenues. That would lead me to believe that movies that are successful here in the U.S. have a worldwide appeal.
6. To begin your analysis, do a scatterplot of the data and comment on the graph.
a.) The graph shows a weak negative linear association between the variables.
b.) There is a lot of scatter in the data, indicating a strong linear relationship between the 2
variables.
c.) The graph shows a positive linear trend but it looks like it is due mostly to one outlier.
d.) The graph shows a strong positive quadratic trend.
7.. Fit a linear model to the data and calculate the R-square value. Store the model in
Y1.
a.)
b.)
c.)
d.)
e.)
ŷ
ŷ
ŷ
ŷ
ŷ
= -470.8 +3.572 x, R-Square = 77.9%
= 492.4 + 0.9347x, R-Sq = 9.9%
= 0.768 – 205.4x, R-Sq = .5900%
= -205.4 + 2.87x, R-Sq = 59%
= 310.5 – 3.42x, R-Sq = 25.3%
8. Using the model you stored in Y1, calculate the predicted y-value and the residual
for Shrek 2.
a.)
b.)
c.)
d.)
e.)
ŷ
ŷ
ŷ
ŷ
ŷ
= 886.62,
= 1030.5,
= 652.4,
= 1046,
= 770.81,
Residual = -36.12
Residual = -108.2
Residual = -207.5
Residual = -165.1
Residual = 150.79
9. Remove the data for Titanic and recalculate the linear model and R-square value.
a.)
b.)
c.)
d.)
e.)
ŷ
ŷ
ŷ
ŷ
ŷ
= 682.7 + 0.344x,
= 0.140 + 0.344x,
= 0.140 + 682.7x,
= 0.344 + 682.7x,
= 315.6 – 0.472x,
R-Sq = 1.97%
R-Sq = .0197%
R-Sq = 1.97%
R-Sq = .0197%
R-Sq = 2.35%
10. Look at the two models (with and without Titanic) and make a statement about
the linear relationship between X and Y.
a.) Based on the 2 models, there appears to be a strong linear relationship
between U.S. revenues and Worldwide revenues.
b.) Even without the Titanic data, the linear relationship still shows the U.S.
revenues are good predictors of Worldwide revenues.
c.) After looking at the scatterplot without the Titanic, I would use a Quadratic
model.
d.) It appears that the strength of the linear relationship is highly dependent on
the Titanic data so I would conclude that U.S. revenues are not good
predictors of Worldwide revenues.
Information for Questions 11 – 14: Use of the Internet has grown at an amazing rate! Here is the data
from the very early stages in 1995 to the present day:
Year
1995
1996
1997
1998
1999
2000
2001
2002
Number of
Users (in
millions)
16
36
70
147
248
361
513
587
Year
2003
2004
2005
2006
2007
2008
2009
2010
Number of
Users (in
millions)
719
817
1018
1093
1319
1574
1802
1971
I fit a linear model to the data and the results are displayed here:
Regression Analysis: Number of Users (in millions) versus Years
The regression equation is
Number of Users (in millions) = - 263147 + 132 Years
S = 132.280 R-Sq = 96.0%
Scatter plot of Residuals vs Y ear s
200
Residuals
100
0
-100
-200
1995.0 1997.5 2000.0 2002.5 2005.0 2007.5 2010.0
Years
11. What conclusion would you make about the usefulness of this linear model?
a.) Based on the large R2-value and the curved pattern in the residuals, the linear model is a good
model for this data.
b.) Based on the positive value of the slope, the negative value of the y-intercept and the large
R2value, the linear model is a good model for this data.
c.) The distinct curved pattern of the residuals tells us that the linear model is not the appropriate
model for the data and that other regression models should be considered.
d.) The distinct curved pattern of the residuals indicates that the residuals are always positive and
squared.
12.
Fit a quadratic model to the data and calculate the R-square value.
a.)
b.)
c.)
d.)
e.)
13.
ŷ
ŷ
ŷ
ŷ
ŷ
= 5.3x2 – 254.21x + 2059, R-square = .988
= 13.4x2 – 2567x + 11921, R-square = .901
= 6.306548 x2 – 25125.9306x + 25026014.77, R-square = .997
= .2173433x2 – 35.2176554x + 4179098076, R-square = .923
= -.00057x2 + 24.21x -7789, R-square = .925
Store the predictions in L3 and the Residuals in L4. Do a scatterplot of the residuals (Xlist:L1 and
Ylist:L4). Remember to Deselect Y1. Use the TRACE key to find the ‘largest’ residual. (Note:
This can be a positive or a negative value.) Return to the lists: L1 – L4. Which of the following
answers contains information on the data point with the largest residual?
a.)
b.)
c.)
d.)
e.)
Year: 2009, y: 1802 million,
Year: 2009, y: 1802 million,
Year: 2008, y: 1093 million,
Year: 2006, y: 1093 million,
Year: 2006, y: 1802 million,
ŷ : 1956.4 million, Residual: -154.4 million
ŷ : 1757.3 million, Residual: 44.7 million
ŷ : 1172.7 million, Residual: 24.2 million
ŷ : 1172.7 million, Residual: -79.7 million
ŷ : 1284.3 million, Residual: 111.6 million
14. Use your quadratic model that you stored in Y1 to predict the number of Internet users in 2011.
a.
b.
c.
d.
e.
2145
2210
2725
2346
2509
million
million
million
million
million
Information for Questions 15-16: Although U.S. citizens value the freedoms and rights of democracy,
they often do not vote. Data on x: the number of U.S. citizens eligible to vote (in millions) and y: the
number of U.S. citizens who actually did vote (also in millions) in the last eight federal elections was
entered into MINITAB and a scatterplot was created and is shown here:
Scatterplot of Actual votes cast vs Voting age population
105
Actual votes cast
100
95
90
85
80
75
70
120
130
140
150
160
170
Voting age population
180
190
200
The following statistics are calculated:
x = 165.2
s x = 26.0
y = 88.0
s y = 10.3
r = .942
Create the regression equation for x: U.S. citizens eligible to vote (in millions) and y: U. S. citizens who
actually did vote (in millions). Here are the formulas you need to find the slope and y-intercept for the
linear regression equation. Round each value (b and a) to the nearest thousandth:
s
b  r( )
s
y
and
a  y  bx
x
15. Find the least squares regression line for predicting ‘actual votes cast’ from ‘voting age population.’
a.)
b.)
c.)
d.)
e.)
ŷ
ŷ
ŷ
ŷ
ŷ
= -488.18 + 507.87x
= 26.0 – 10.3x
= 26.380 + .373x
= 88.0 - .942x
= 165.2 – 88.0x
16. What percent of the variation in ‘Actual Votes Cast’ can be explained by the linear relationship
between ‘Voting Age Population’ and ‘Actual Votes Cast’?
a.)
b.)
c.)
d.)
e.)
64.2%
10.3%
26.0%
88.7%
37.3%
17. With the ever increasing price of gasoline, researchers are constantly evaluating automobile data. In
one study, data was collected on “car weight (in pounds),” and “miles per gallon.” The data was
entered into MINITAB and the following regression equation was generated: ŷ = 45.6 – 0.0052x,
where x = ‘car weight in pounds’ and y = ‘miles per gallon.’ Interpret the slope in terms of a 1000
pound increase in the vehicle weight.
a.) Mileage would increase by .052 miles per gallon for every 1000 pound increase in vehicle
weight.
b.) Mileage would decrease by .052 miles per gallon for every 1000 pound increase in vehicle
weight.
c.) Mileage would increase by 5.2 miles per gallon for every 1000 pound increase in vehicle weight.
d.) Mileage would decrease by .0052 miles per gallon for every 1000 pound increase in vehicle
weight.
e.) Mileage would decrease by 5.2 miles per gallon for every 1000 pound increase in vehicle weight.
18. Blood types: (A, B, AB and O) and Rh factors (+ and -) combine to classify an individual’s blood
into exactly one of the following 8 categories: A+, A-, B+, B-, AB+, AB-, O+ and O-. Consider the
following statement: “If a person is randomly selected from the population, the probability that
he/she has blood type A+ is 1 .” Must this statement be true?
8
a.) Yes, since there are eight different blood classifications.
b.) No, we cannot assume that the eight different blood classifications are equally likely.
c.) Yes, the eight different blood classifications are independent so are therefore equally
likely.
d.) No, we cannot assume that the eight different blood classifications are mutually
exclusive.
e.) Yes, since the eight different blood classifications are mutually exclusive.
19. Internet sites often vanish or move, so that references to them can’t be followed. In fact, 13% of
Internet sites referenced in papers in major scientific journals are lost within two years after
publication. Suppose you are researching obesity rates in the U.S. You find Medical scientific
journal articles with links to 4 Internet sites that you are interested in. All the articles were written
prior to 2008. What is the probability that all 4 Internet articles are still good and that you can link to
them? Assume that the four links are independent. (You can use a tree diagram to model this
problem.)
a.)
b.)
c.)
d.)
e.)
0.13
0.52
0.169
0.0003
0.573
20. Suppose you roll a fair die 9 times and you get a ‘5’ on each of the 9 rolls. What are the chances that
the next roll will be a ‘5’?
a.) Since each roll is independent, the probability of ‘5’ changes with each roll so
we cannot determine an actual value.
b.) The probability that the 10th roll is ‘5’ is 0.167.
c.) The probability that the 10th roll is ‘5’ is approximately 0.95 since ‘5’ is now
more likely based on the previous rolls.
d.) The probability that the 10th roll is ‘5’ is 0.05 since ‘5’ is now very unlikely
based on the previous rolls.
e.) Since each outcome is a random event, the probably that the 10th roll is a ‘5’
is 0.50
Information for Questions 21 – 22: An individual has a torn tendon and is facing surgery to repair it.
The orthopedic surgeon explains the risks to the patient. Infection occurs in 4% of such operations, the
repair fails in 15%, and both infection and failure together in1.5%. (Use a Venn Diagram to model this
problem.)
21. What is the probability that the operation succeeds and is free from infection?
a.) 0.78
b.) 0.80
c.) 0.84
d.) 0.81
e.) 0.825
22. What is the probability that the repair fails, given that an infection occurs?
a.) 0.60
b.) 0.455
c.) 0.375
d.) 0.833
e.) 0.015
Information for Questions 23 and 24: The Triple Blood Test screens pregnant women for the genetic
disorder, Down syndrome (D). This syndrome occurs in about 1 in 800 live births, that is P(D)=1/800.
If the fetus actually has Down syndrome, the Triple Blood Test will result in a positive test with
probability 0.89. And so, the probability of a false negative is 0.11. If the fetus does not have Down
syndrome, the Triple Blood Test will result in a negative test with probability 0.75. And so, the
probability of a false positive is 0.25. Fill in these probabilities on the branches of the tree diagram.
D
Pos
oso
s
Neg
DC
Pos
Neg
23. Calculate the probability of a Positive Test result.
a.)
b.)
c.)
d.)
e.)
0.0011125
0.2496875
0.2508
0.89
0.25
24. Calculate the probability of Down syndrome, given that the test is positive. That is,
calculate P(D|pos).
a.) 0.00618
b.)
c.)
d.)
e.)
0.89
1/800
0.0044
0.0011125
25. A local church holds an annual raffle to raise money for its sister church in Haiti. The first prize is a
weekend in Newport and the second prize is two season passes to the Children’s series at Jorgenson.
The church sells 200 tickets at $10 a piece. One of the parishioners buys 20 tickets. Winning tickets
are drawn without replacement. What is the probability that she wins EXACTLY one (but not both)
of the two prizes. Hint: Use a tree diagram to model this problem.
a.)
b.)
c.)
d.)
e.)
0.10
0.09
0.181
0.20
0.236
26. The Monty Hall Problem:
The Monty Hall Problem, named after the host of the long-running game show "Let’s Make a Deal," is a
statistical puzzle that seems counterintuitive. A recurring deal on the show featured contestants choosing
one of three closed doors, with a big prize (like a car) behind one of them and something else, like a goat,
behind each of the others. As a contestant, you are asked to choose a door.. But before Monty Hall opens
the door you chose, he wants to make the game more interesting. He opens one of the other doors to
reveal a goat. (Note: Monty Hall knows what is behind each door.) Then he asks: "Are you sure you want
the door you chose? Or would you like to switch to the other door?" What should you do and why?
a.) It does not matter if I SWITCH or STAY. Since there are only 2 doors left unopened, the
probability is 0.5 that the car is behind either door.
b.) I should STAY because the probability of Winning by STAYING is 1/3.
c.) I should SWITCH because the probability of Winning by SWITCHING is 9/10 and the
probability of Winning by STAYING is 1/10.
d.) I should SWITCH because the probability of Winning by SWITCHING is 2/3.
1. During one holiday season, the Texas lottery played a game called the Stocking Stuffer. The
price of a ticket for this lottery was $1.00. Shown here are the various prizes and the probability
of winning each prize.
Prize (x)
$1000
Probability .00002
$100
.00063
$20
.00400
$10
.00601
$4
.03403
$2
.14355
$0
.81176
Calculate the expected value for this game and decide whether it is worthwhile, in the long run,
to play.
a.) The expected value is $1.62. This means that in the long run you will make a profit of
$0.62 for every dollar you spend so it is worthwhile to play the game.
b.) The expected value is $0.64. This means that in the long run you will make a profit of
$0.64 for every dollar you spend so it is worthwhile to play the game.
c.) The expected value is $0.64. This means that in the long run you lose $0.36 for every
dollar you spend so it is not worthwhile to play the game.
d.) The expected value is $0.64. This means that in the long run you will lose $0.64
for every dollar you spend so it is not worthwhile to play the game.
2. The probability that a male professional golfer makes a hole-in-one is 1/2780. Suppose 36
professional male golfers play the sixth hole during a round of golf. Let the random variable X
be the number of golfers in the group of 36 who make a hole-in-one. Calculate the probability
that exactly four of the 36 golfers make a hole-in-one on the sixth hole – as actually happened
during the 1989 U.S. Open.
a.) 0.0005
b.) 9.7 x 10-10
c.) 6.3 x 10-8
d.) 4.2 x 10-2
e.) 3.6 x 10-4
3. According to Information Resources, which publishes data on market share for various
products, Oreos control about 10% of the market for cookie brands. Suppose 20 purchasers of
cookies are selected randomly from the population. What is the probability that fewer than four
purchasers choose Oreos?
a.)
b.)
c.)
d.)
e.)
0.957
0.867
0.677
0.989
0.190
Information for Questions 4 and 5: I asked all 500 students in Statistics last semester to flip a
biased coin that I have 10 times and to record the number of Heads. I know that this biased
coin has P(Heads) = 0.75 and P(Tails) = 0.25. Define the random variable X as the number of
heads for each individual student’s experiment. So, X is a binomial random variable with n = 10
and p = 0.75.
Here is a table of the results of the 500 experiments:
X: No.
of
Heads
Freq:
0
1
2
3
4
5
6
7
8
9
10
0
0
0
0
4
32
57
124
148
111
24
4. Use the output in the above table to calculate a relative frequency estimate of
P(5  X  7) .
a.)
b.)
c.)
d.)
e.)
0.93
0.186
0.213
0.426
0.75
5. Now use your calculator and the values for n and p given in the information above to
calculate the theoretical probability: P(5  X  7) .
a.) 0.2206
b.) 0.2044
c.) 0.4547
d.) 0.3963
e.) 0.75
Information for Questions 6 and 7: The National Center for Health Statistics reports that 25%
of all Americans between the ages of 65 and 74 have a chronic heart condition. Suppose you
live in a state where the environment is conducive to good health and low stress and you believe
the conditions in your state promote healthy hearts. To investigate this theory, you conduct a
survey of 150 persons 65 to 74 years of age in your state.
6. On the basis of the figure from the National Center for Health Statistics, calculate the mean
and standard deviation of the binomial random variable, X, the number of persons with a chronic
heart condition in a sample of 150 persons.
a.)
b.)
c.)
d.)
e.)
µ = 37.5
µ = 37.5
µ = 150
µ = .25
µ = .25
σ = 5.30
σ = 28.125
σ = 28.125
σ = 5.30
σ = .75
7. Based on the sample of 150 persons, would you be surprised to observe X = 21 persons in the
65-74 age group in your state with chronic heart disease? What would this tell you about
the environmental conditions in your state?
a.) Yes, X = 21 would be very unusual. I would say that the environmental
conditions in my state may be conducive to promoting healthy
hearts.
b) No, X = 21 would not be all that unusual. The sample data does not support
the belief that the environmental conditions in my state promote healthy hearts
anymore than the conditions in other states.
c.) No comment can be made since the criteria for a binomial experiment have
not been met.
d.) Yes, X = 21 would be slightly unusual since 21 is approximately one standard
deviation greater than the mean. Environmental conditions in my
state are certainly better than in other states.
e.) There is an error in the sampling plan. All sample results must fall within
1 standard deviation of the mean. These results cannot tell us anything about
the environmental conditions in my state.
Information for Questions 8 - 11: The S & P 500 is a collection of 500 stocks of publicly traded
companies. Using data obtained from Yahoo!Finance, the monthly rates of return of the S & P
500 from 1950 through 2010 are normally distributed. The mean rate of return is 0.007443 and
the standard deviation is 0.04135.
8. What is the probability that a randomly selected month has a positive rate of return?
That is, what is P(X > 0)?
a.)
b.)
c.)
d.)
e.)
0.68
0.36
0.57
0.43
0.50
9. What is the probability that the mean monthly return for a one year period is positive?
That is, for n = 12 months, what is P( X  0) ?
a.)
b.)
c.)
d.)
e.)
0.73
0.43
0.57
0.27
0.50
10. What is the probability that the mean monthly return for a five year period is positive?
That is, for n = 60 months, what is P( X  0) ?
a.)
b.)
c.)
d.)
e.)
0.92
0.50
0.08
0.84
0.16
11. Consider the sampling distribution of X in Example #9 and Example #10. What effect
does the sample size have on the standard deviation of the sampling distribution of X ?
a.) As n increases, the standard deviation of the sampling distribution increases.
b.) As n increases, the standard deviation of the sampling distribution decreases.
c.) Since the underlying population is normal, the sample size has no effect on
the standard deviation of the sampling distribution.