Download Statistics Review Chapters 1-8

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Statistics Review
In the paper “Reproduction in Laboratory colonies of Bank Vole,” the authors presented the results of a study of litter
size. (A vole is a small rodent with a stout body, blunt nose, and short ears.) As each new litter was born, the number of
babies was recorded, and the accompanying results were obtained.
1
2
2
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
8
8
8
8
8
9
9
9
10
10
11
The authors also kept track of the color of the first born in each litter. (B = brown, G = gray, W = white, and T = tan)
B
W
B
G
B
B
G
B
G
B
B
B
T
G
T
B
B
B
W
G
T
W
B
G
B
B
B
B
T
T
W
B
B
B
B
G
B
B
G
W
T
T
B
B
T
T
B
G
T
B
G
T
G
T
W
B
G
G
B
G
G
G
W
B
W
B
W
T
B
T
1. Which variable, litter size or color, is categorical?
2. Which variable is quantitative? Litter Size
3. Make a bar chart of the colors.
45
41
40
Frequency
35
30
25
25
21
20
13
15
10
5
0
Brown
Gray
White
Color
Tan
G
B
B
W
B
T
G
T
G
W
B
T
B
T
G
T
T
W
B
G
B
B
G
T
B
G
G
G
B
W
4. Make a histogram of the litter sizes.
5. Make a dotplot of the litter sizes.
6. Are there any outliers in the histogram or dotplot? Yes, there are outliers. Use the outlier rule to prove or disprove
outliers, in both directions!!! Q1 1.5  IQR  5 1.5 2  2 and Q3 1.5  IQR  7 1.5 2 10 So
any numbers less than 2 or more than 10 are considered outliers. So 1 and 11 are outliers.
7. Describe the shape of the histogram (symmetric or skewed). Approximately Symmetric

8. Find the mean of the litter sizes.
x1  x 2  ... x n 
x
 5.87
n
You can use 1-Var Stats XList.
9. Is the mean resistant to outliers? The mean is NOT resistant to outliers. Outliers can and do effect the mean.
10. Find the median of the litter sizes. M = 6, put the numbers in order (if they are not) and then find the middle one. If

there are 2 middle numbers,
it is the average of the 2 numbers.
11. Is the median resistant to outliers? Yes, the median IS resistant to outliers. Outliers do not effect where the middle
number is.
12. Find the range of the litter sizes. The range is from 1 – 11, or 10.
13. Find the 5-number summary of the litter sizes. The 5 # Summary is {1, 5, 6, 7, 11} (Min, Q1, Median, Q3, Max)
14. What is the interquartile range? IQR = Q1 – Q3 = 2
15. Make a boxplot of the litter sizes.
16. Find the variance of the litter sizes.
(x1  x ) 2  (x 2  x ) 2  ... (x n  x ) 2
s 
 3.2860
n 1
2
x
To get credit, you
MUST plug in the first 2 sets of numbers and the last one. Just writing the formula with variables, doesn’t count!
17. Find the standard deviation of the litter sizes. #16  1.8127 Remember you can use 1-Var Stats to find these,
but you must plug the numbers
into the formula for full credit

18. Is standard deviation resistant to outliers? No, standard deviation is NOT resistant to outliers. Since we use x-bar in
the formula and it is not resistant, neither is anything that uses it.
19. Find the degrees of freedom of the litter sizes. n-1 = 99
20. What is the area under a density curve? The area under a density curve is always 1.
21. The (mean or median) of a density curve is the equal-areas point, the point that divides the area under the curve in
half.
22. The (mean or median) of a density curve is the balance point, at which the curve would balance if made of solid
material.
23. If a density curve is skewed to the right, the (mean or median) will be further to the right than the (mean or median).
The average gets “pulled” towards the extremes/outliers.
24. What is the difference between x and  ? X-bar is the average of a sample (the “real” or actual average) or the
average without weight.  is the population mean, ideal mean (“statistical guess”), or mean for a Random Variable.
25. What is the difference between s and  ? Same as #24. s is the standard deviation of a sample (“actual” standard
deviation).  is the standard deviation of a population, idealized standard deviation, or for a Random Variable.
26. Normal curves are density curves that are single peaked, bell-shaped, and symmetric. (See page 85)
27. How do you find the inflection points on a normal curve? The inflection points are where the curve turns from
concave down to concave up or exactly 1 standard deviation away from the mean.
28. Sketch the graph of N(266, 16), the distribution of pregnancy length from conception to birth for humans.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
29. What is the 68-95-99.7 rule? At approximately 68%, 95% and 99.7% will be where 1, 2 and 3 (respectively) standard
deviations will fall away from the mean of a normal curve.

30. Using the empirical rule (the 68-95-99.7 rule), find the length of the longest 16% of all pregnancies. Sketch and
shade a normal curve for this situation. The longest 16% of all pregnancies are 282 days and greater. 68% (1 standard
deviation) is in the middle, so 16% is on each side.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
31. Using the empirical rule, find the length of the middle 99.7% of all pregnancies. Sketch and shade. The middle
99.7% (3 standard deviations) of pregnancies are between 218 and 314 days.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
32. Using the empirical rule, find the length of the shortest 2.5% of all pregnancies. Sketch and shade. The shortest 2.5%
(2 standard deviations, so that 95% is in the middle) of all pregnancies is at 234 days and less.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
33. Using the empirical rule, what percentile rank is a pregnancy of 218 days? The percentile rank for 218 days is the
area of the curve less than 218 days, which is the .15th percentile. Remember percentile is everyone at or below that
value.
Use the Z formula and the Z table to answer questions 34 – 43. Use the following formula
Z
xx
s
34. What percentile rank is a pregnancy of 298 days? We want the percent of pregnancies at or below 298 days.
Z
298  266
 2, so we get .9772 in the table.
16
35. What percentile is a pregnancy of 250 days?
Z
298 days is at the 97.72nd percentile.
250  266
 1 .
16
Using the table, we get .1587. So 250 days is
at the 15.87th percentile.
36. What is the percentile of a pregnancy of 266 days?
Z
266  266
 0.
16
Z
279  266
 .8125 .
16
percentile.
37. What z-score does a pregnancy of 279 days have?
You don’t have to find the percentile/percentage. It just asks for the z-score.
So we get .5, so 266 days is at the 50th
So 279 days has a z-score of .8125.
38. What percent of humans have a pregnancy lasting less than 279 days? Sketch and shade a normal curve.
Using the answer from #37, look for the area of the shaded region for a z-score of
.82. We get .7910, so 79.1% of all pregnancies last less than 279. Use InvNorm to
check this answer
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
39. What z-score does a pregnancy of 257 days have?
z
257  266
 .5625
16
So 257 days has a z-score of -.5625.
40. What percent of humans have a pregnancy lasting less than 257 days? Sketch and shade.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
P(Z  .5625)  .2877 , you have to look up z=-.56 in the table since it only allows 2 decimal places.
So 28.77%
41. What percent of humans have a pregnancy lasting longer than 280 days? Sketch and shade.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
P(Z  280)  P(X 
280  266
)  P(Z  .875)  .8106 , so 81.06%
16
42. What percent of humans have a pregnancy lasting between 260 and 270 days? Sketch and shade.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
P(260  X  270)  P(
260  266
270  266
Z
)  P(.375  Z  .25)  .2467 , Look up z<.25 and
16
16
then subtract z<-.38. So 24.67%
43. Would you say pregnancy length is a continuous or discrete variable? Justify. Continuous because you could find
down to the days/mins/sec of pregnancy, if possible .
44. You have normal distributions on your TI-83. Use these functions to check your answers to problems 38, 40, 41, and
42. You can use Normalcdf (lower, upper, mean, standard deviation) to check your answers for problems where you
know the days. If you are given the percentage, then you would use InvNorm(area below, mean, standard deviation).
Remember these answers should be very close to what you get from the table. You do NOT have to use the tables on
the Final or the AP exam; they are they incase you need them. If you do use the calculator, so NOT use “calculator
speak” as replacement for work!!

45. How long would a pregnancy have to last to be in the longest 10% of all pregnancies? You can use the table, or
InvNorm in your calculator. In the table, look up .9000 (or the closest number possible) inside the table (not a znumber). You get z=1.28, now solve for x.
1.28 
x  266
.
16
X=286.48 days or longer (don’t forget “or longer”)
You can also use InvNorm(.9, 266, 16) = 286.5048 days or longer. Don’t forger you must show work using the
formula. “Calculator speak” means nothing and is NOT considered work!
46. How short would a pregnancy be to be in the shortest 25% of all pregnancies? Do the same thing as #45. Using the
 With the calculator 255.2082 days or less..
table, you get 255.28 days or less.
47. How long would a pregnancy be to be in the middle fifth of all pregnancies? This does NOT say 50%!! It says 20%,
and the middle 20%. So you want to know between 40% & 60%. Table = between 262 & 270 days. Calculator =
between 261.9464 & 270.0536 days.
48. Does the vole information from the beginning of this review seem to be normal? Justify by checking actual
percentages within 1, 2, and 3 standard deviations of the mean. Check to see if the percentages of the 1,2,3
standard deviations are relatively close to 68%, 95%, and 99.7%. 5.87 1s  4.0573 & 7.6827 , there are 62%
of the numbers (5, 6 & 7) that fall in that range. 5.87  2s  2.4456 & 9.4954 , there are 94% in that range.
5.87  3s  .4318 &11.3084 , and finally 100% in this range. This is REALLY close to the correct values, so,
these seem to be normal.


49. Make a back-to-back split stemplot of the following data:

Reading Scores
4th Graders
12
15
18
20
20
35
35
35
36
37
39
40
7th Graders
1
12
15
18
18
30
30
31
33
33
33
35
22
42
20
36
25
26
28
29
31
32
23
23
24
25
27
28
50. Make a comparison between 4th grade and 7th grade reading scores based on your stemplot. 7th graders have a smaller
min & max. 4th graders might have an outlier at 1.

51. What is the mode of each set of scores? 4th grade = 35. 7th grade = 33.
52. Is the score of “1” for one of the 7th graders an outlier? Test using the 1.5 IQR rule. To find IQRs, we use the rule
that any numbers beyond the values of Q1-1.5IQR & Q3+1.5IQR are outliers. Q1=19, Q3=32, IQR = 32-19=13. 191.5*13=-.5. No, 1 is not an outlier.
53. What is the difference between a modified boxplot and a regular boxplot? Why is a modified boxplot usually
considered better? Modified boxplots show outliers and regular boxplots do not. More detail is always better,
especially since outliers can mess up a lot of different calculations. So it’s good to know if there are any.
54. What are the four principles that guide the examination of data? (See page 121) *What individuals do the data
describe? *What exactly are the variables? How are they measured? *Are all the variables quantitative or is at least
one a categorical variable? *Do you want simply to explore the nature of the relationship, or do you think that some
of the variables explain or even causes changes in others? That is, are some of the variables response variables and
others explanatory variables?
55. Graph the following hot dog data:
Calories
Sodium (milligrams)
108
149
130
350
132
345
135
360
138
360
140
375
144
380
145
390
150
400
163
415
167
400
172
420
176
450
180
500
184
505
Calories(cont.) Sodium (milligrams) (cont.)
195
500
200
515
Quic k Ti me™ and a
T IFF (Unc om pres s ed) dec om pres s or
are needed to s ee t his pic t ure.
56. What is the response variable? Sodium is the response.
57. What is the explanatory variable? Calories are the explanatory variable.
58. What is the direction of this scatterplot? (positive, negative…)
59. What is the form of this scatterplot? (linear, exponential…)
60. What is the strength of this scatterplot? (strong, weak…) Pretty strong, if it wasn’t for that 1 outlier.
Quic k Ti me™ and a
T IFF (Unc om pres s ed) dec om pres s or
are needed to s ee t his pic t ure.
61. Are their clusters? If so, where?
There are 2 main clusters.
62. Are their outliers? (Outliers in a scatterplot have large residuals.) There are 2 ways to find residuals. The long way:
y  yˆ . First do LinReg (#8). Then go to an empty list, an make it = Ylist – Y1(Xlist). Then plot this new list with
your original Xlist. The fast way: First do LinReg (#8). Go to plot a scatter plot with your Xlist & “Resid” (which is


Quic k Ti me™ and a
T IFF (Unc om pres s ed) dec om pres s or
are needed to s ee t his pic t ure.
found under List, but you MUST do LinReg first!).
This is the original X & Y lists with the
Quic k Ti me™ and a
T IFF (Unc om pres s ed) dec om pres s or
are needed to s ee t his pic t ure.
equation . This is the residual plot
63. If there are outliers, are they influential? Yes, and it is influential because it will dramatically change the LinReg if it
were deleted.
64. Calculate the correlation.
x n  x yx  y 
1 x1  x y1  y  x 2  x y2  y 
r

 
 ... 
 .9195 To get credit, you



n 1
s
s
s
s
s
s












y
x
y
x
y
 x

MUST plug in the first 2 sets of numbers and the last one. Just writing the formula with variables, doesn’t count!
65. Calculate the correlation without the point (108, 149). .9587 (All answers in this answer key will now no longer have
this point calculated)
66. Using problems 64 and 65, check your answer to problem 63. Yes, influential outliers not only change the Line of
Best Fit, but it will also change the correlation.
67. What two things does correlation tell us about a scatterplot? Direction & Strength.
68. If I change the units on sodium to grams instead of milligrams, what happens to the correlation? Nothing, correlation
is not affected by unit changes. It describes the relationship between the variables. No matter how they are measured,
that doesn’t change.
69. What is the highest correlation possible? -1 & 1. Remember that -1 is a “high” correlation, just in the negative
direction.
70. What is the lowest correlation possible? 0
71. Correlation only applies to what type(s) of relationship(s)? Quantitative, you cannot put categorical/words into the
formula.
72. Is correlation resistant to outliers? No, as we saw in # 65. Any formula with x-bar in it is not resistant.
73. Does a high correlation indicate a strong cause-effect relationship? No. Cause & effect can only be determined
through an experiment. Correlation shows something just might be there, but might not. Remember the ice cream
and drowning example we did in class.
74. Sketch a scatterplot with a correlation of about 0.8. May vary, see me if you are unsure.
75. Sketch a scatterplot with a correlation of about –0.5. May vary, see me if you are unsure.
76. Find the least-squares regression line (LSRL) for the calories-sodium data.
a  y  bx
&
br
sy
sx
yˆ  a  bx Don’t forget the “hat”!!!!
If the “a” or “b” are asked for individually, make sure you show the work!! If LSRL is


asked for, you do not need to show how you got “a” and/or “b”. So the answer is
that one point still deleted)
Quic k Ti me™ and a
T IFF (Unc om pres s ed) dec om pres s or
are needed to s ee t his pic t ure.
yˆ  24.4389  2.4594 x (with

77. Draw the LSRL on your scatterplot.
78. What is the slope of this line, and what does it tell you in this context? Slope is “b”, so it’s 2.4594. It says that for
every 1 Calorie, Sodium increases 2.4594 units.
79. What is the y-intercept of this line, and what does it tell you in this context? Y-intercept is “a”, so it’s 24.4389. It
means that if there are no Calories in a hotdog there will still be 24.4389 units of Sodium.
80. Predict the amount of sodium in a hot dog with 155 calories. If your LSRL is in Y1, you can use the notation Y1(155)
or just hand calculate x=155. y=405.6488 units (with that one point deleted)
81. Predict the amount of sodium in a hot dog with 345 calories. Y = 872.9384 units of sodium.
82. Why is the prediction in problem 80 acceptable but the prediction in problem 81 not? #81 is extrapolation (far from
our data) and cannot be trusted because trends may not continue the same way so far away from the given
information. However #80 is interpolation (close/within our data) and is usually acceptable.
83. Find the error in prediction (residual) for a hot dog with 180 calories. Error (or residual) is
y  yˆ  500  y1 (180)  500  467.1343  32.8657
84. Find the residual for 195 calories. -4.02559
85. The point ( x,y ) is always on the LSRL. Find this point, and label it on your scatterplot. (159.4375, 416.5625)
86. Find the standard deviation of the calories.
(x1  x ) 2  (x 2  x ) 2  ... (x n  x ) 2
sx 
 23.1343
n 1
87. Find the standard deviation of the sodium. Same formula as #86, except with y.
s y  59.3498
88. Using the equations on page 153,
 verify the slope and intercept of the LSRL. Check these by plugging the numbers
into the formula by hand, do not use 1-Var Stats.

89. Find the coefficient of determination for this data. Coefficient of determination
is r2, so just square #65. r2=.9191
90. What does r2 tell you about this data? It tells us how well the y-values (sodium of a hotdog) are explained (predicted)
by the x-values (calories of the hotdogs) though our LSRL.
91. How can you use a residual plot to tell if a line is a good model for data? A residual plot tells us if data is actually
linear (remember our eyes can deceive us!). If the data is linear, we will find equal distant, random scatter in the dots.
That is that there will be about as many points above & below the x-axis. And there will be no obvious pattern to the
data. If there are obvious clusters and/or pattern(s), then the data was not linear to begin with. Thus a LSRL (linear
equation) is not the best fit.
92. If you know a scatterplot has a curved shape, how can you decide whether to use a power model or an exponential
model to fit data? You can check the y-values. If they increase my approximately the same ratio, then it’s probably
exponential.
93. Graph the following data:
Time (days)
Mice
0
6
30
19
60
60
90
195
120
597
Ratio
//////
19/6=3.1667
60/16=3.1579
195/60=3.25
597/195=3.0615
Quic k Ti me™ and a
T IFF (Unc om pres s ed) dec om pres s or
are needed to s ee t his pic t ure.

94. Perform the appropriate logarithmic transformation (power or exponential) by hand on the above data to get an
equation. Since the ratios (see #92) are all approximately the same, I will use exponential. This means I must make a
Quic k Ti me™ and a
T IFF (Unc om pres s ed) dec om pres s or
are needed to s ee t his pic t ure.
new list and make it the LogY. Now graph.
The LinReg(Xlist, LogYlist) is
log yˆ10 .7788  .0167 x
10
log yˆ  .7788  .0167 x Now get rid of the log. yˆ  10 .7788 10 .0167x
You can stop at the 2nd step,
y  6.0094 1.0392 x
but if you want, or are asked to, you can evaluate the powers of 10.
95. Check your equation in problem 94 using your calculator’s power and exponential regression functions.
Stat->Calc->ExpReg (Xlist, Ylist) make sure you use the original y-list, not the logy! You get

yˆ  6.0094 1.0392 x
Quic k Ti me™ and a
T IFF (Unc om pres s ed) dec om pres s or
are needed to s ee t his pic t ure.

96. Make a residual plot to support your choice for problem 94.
97. Graph the following data:
Diameter (inches) Cost (dollars)
6
3.50
9
8.00
12
14.50
15
22.50
20
39.50
Follow directions from # 62
Ratio
/////
8/3.5=2.2856
14.5/8=1.8125
22.5/14.5=1.5517
39.5/22.5=1.7556
Quic k Ti me™ and a
T IFF (Unc om pres s ed) dec om pres s or
are needed to s ee t his pic t ure.
98. Perform the appropriate logarithmic transformation (power or exponential) by hand on the above data to get an
equation. Since the ratios are not really close, I’m going to try power. So we must make 2 new lists and log both X &
Quic k Ti me™ and a
T IFF (Unc om pres s ed) dec om pres s or
are needed to s ee t his pic t ure.
Y.
If you look closely, it does look a little straighter. The LSRL of LogX, LogY is
10
log yˆ10 1.0217  2.0167 log x
yˆ  10 1.0217 10 2.0167logx
log yˆ  1.0217  2.0167 log x , now get rid of the logs. yˆ  10 1.0217 10 logx
2.0167
yˆ  10 1.0217  x 2.0167
yˆ  .0951  x 2.0167

99. Check your equation in problem 98 using your calculator’s power and exponential regression functions. Using
ˆ  .0951 x 2.0167 PwrReg is found just below ExpReg.
PwrReg(Xlist, YList), we get y
100.
101.

Make a residual plot to support your choice for problem 98.

Quic k Ti me™ and a
T IFF (Unc om pres s ed) dec om pres s or
are needed to s ee t his pic t ure.
101.
What is the correlation for the equation you found in problem 98? r=.9999
102.
What is extrapolation, and why shouldn’t we trust predictions using extrapolation? Extrapolation is being far
from our data and cannot be trusted because trends may not continue the same way so far away.
103.
What is interpolation? Interpolation is being close/within our data and is usually acceptable because we know
how the trend reacts there.
104.
What is a lurking variable? Something in the background that we don’t know about that may affect our variables.
105.
Why should we avoid using averaged data for regression and correlation? Average data hides extremes, which
doesn’t give you the entire picture of the data (usually more details is always better). Plus, many times, people want
to take that data and break it down to use it to predict, which is inappropriate. For example, if you have data on the
average snow fall in a month, some people might want to break that down to average snow fall per day.
106.
What is causation? Give an example. Causation is one variable actually causes a change in another variable
(regardless if the are other variables also making that change). Example: the harder your shake your soda bottle, the
more “fizz” is made. The variables are vigorousness of the shake & amount of fizz. Check with me if you’re not sure
if your example is causation.
107.
What is common response? Give an example. Common response is when both variables actually do not affect
each other at all. The change in both is due to the same 3rd (lurking) variable. Example: Sales of snow blowers goes
up & down indirectly with the sales of swimsuits. Just looking at that info and not “thinking” you could say that if a
store sells less snow blowers, that will make swimsuit sales go up. We know that there is a 3rd variable effecting both
sales: weather. Check with me if you’re not sure if your example is common response.
108.
What is confounding? Give an example. Confounding is when you’re not sure if one variable is affecting the
other or if it’s a 3rd lurking variable. So you’re confounded on if “x” is actually causing any change in “y” at all. If
you’re not “confused” on “x” causing “y” and you can come up with another variable that also affects “y,” that
example is Causation. Confounding is only for when you’re not sure about “x.” Example: Intensity of a headache
and amount of light in the room. For some people, light might not do anything to their headache, for others, it makes
their headache much worse. So light may or may not actually be causing someone’s severity of the headache.
109.
Why is a two-way table called a two-way table? A “two-way” table is a normal table, with all row & column
variables shown.
Use this table for questions 110–117:
Education
Did not complete high school
Completed high school
1 to 3 years of college
4 or more years of college
Total
110.
Smoking Status
Never smoked
82
97
92
86
357
Smoked, but quit
19
25
49
63
156
Smokes
113
103
59
37
312
Fill in the marginal distributions for this table. See above, check table total in both directions!
Total
214
225
200
186
825
111.
Display this table on a segmented bar chart (See p. 244).
Education & Smoking
120
Did not complete
high school
Completed high
school
1 to 3 years of
college
4 or more years of
college
Education
100
80
60
40
20
ok
es
Sm
it
qu
t
bu
ok
ed
,
Sm
N
ev
er
sm
ok
ed
0
Smoking Status
112.
What percent of these people smoke? 312/825=37.82%
113.
What percent of never-smokers completed high school? 97/357=27.17%
114.
What percent of those with 4 or more years of college have quit smoking? 63/186=33.87%
115.
What percent of those with some college smoke? (59+37)/(186+200)=25.87%
116.
What percent of smokers did not finish high school? 113/312=36.22%
117.
What conclusion can be drawn about smoking and education from this table? The trend is the less likely are you
to go further in schooling if you smoke or did smoke, but quit. However, if you did quit, you are more likely to go on
in school than someone who smokes & didn’t quit. (However, remember you cannot prove causation from this!)
118.
What is Simpson’s Paradox? Simpson’s Paradox is when 2 variables are shown in two two-way table, along with
rd
a 3 variable. Then for whatever reason, the 3rd variable is taken away to make one two-way table. The omission of
the 3rd variable now makes it a “lurking” variable & this can completely reverse any trends or relationships seen in the
different tables between the original two variables.
119.
What is the difference between an observational study and an experiment? Observational study is when you are
just “watching and recording” data or when you are just gathering data from different sites to combine it (like from
different hospitals or gathering data from the internet). Experiment is when you are actually imposing a treatment on
the subject to gather any responses. So you’re actually doing something to cause a response (remember, no response
is a response)
120.
What is a voluntary response sample? When all subject completely volunteered him or herself and were not put
in some kind of random selection. Yes, most experiments on humans are “voluntary” since you need their
permission, but this is not necessarily voluntary response.
121.
How are a population and a sample related but different? Population is the “big” group. A sample is the “small”
group from within that population.
122.
Why is convenience sampling biased? To not be biased, there needs to be as little human thought in picking the
subject for the sample. If the sample is picked based on convenience, they could all be similar in some aspect that can
taint the data.
123.
SRS stands for what kind of sample? Name and define. Simple Random Sample. Number every thing in the
population and used some form of random number sampling to pick the ones in your sample.
124.
Discuss how to choose a SRS of 4 towns from this list: Number these towns from 1 – 12 or 0 – 11. (any way you
want. Up/down or across). Then use the randInt in your calculator randInt(1,12,4) or you can use the random number
table in your book.
Allendale
Bangor
Chelsea
Detour
Edmonton
Fennville
Gratiot
Hillsdale
Ionia
Joliet
Kentwood
Ludington
125.
What is a stratified random sample? Similar to SRS, but before you number & pick, you break up the population
into some groups, usually based on some variable that you believe may affect your data. (Example: gender, species of
animal, etc) Then you number each group and select from each group separately.
126.
What is a multistage sample? Before numbering & selecting, you much break down a LARGE population into a
small manageable group so you can number it and pick from it. They do this in the US census.
127.
What is undercoverage? Undercoverage is from excluding a group from the population based on HOW you
decided to take the sample (not because you couldn’t get a hold of them) Example: door-to-door survey from 9:00am
– 2:00pm Monday - Friday. Because of this type of survey and how it’s being done, you will exclude all 1st shift
workers from your survey.
128.
What is nonresponse? Nonresponse is when you cannot get a hold of someone, but you could based on how you
did your sampling. Example: phone surveys. You “undercover” any one with a phone, but you get nonresponse from
people like me who just don’t answer their phones. You could get a hold of me, if I answered my phone.
129.
What is response bias? Response bias is when the person giving the answers lie (on purpose or accident). Or
question wording is another type of example. See next question.
130.
Why is the wording of questions important? Give an example. Some times the way a question is worded or
asked by the interviewer can lead the interviewee to one answer over another. “Who should be allowed to carry
concealed weapons? A) Only safety officials. B) Safety office & private citizens with a clear need C) Any private
citizen” (found at Gallop.com at http://www.gallup.com/poll/16822/Public-Wary-About-Broad-Concealed-FirearmPrivileges.aspx) This leaves no option for those who believe no one should and may feel that they should answer one
of the given choices even if they don’t believe it.
131.
How are experimental units and subjects similar but different? Experimental units stand for all things/people who
are experimented on. However, we usually reserve the word “subjects” for humans only.
132.
Explanatory variables in experiments are often called _factors____.
133.
If I test a drug at 100 mg, 200 mg, and 300 mg, I am testing one variable at three __treatments___.
134.
What is the placebo effect? When the subject reacts to a placebo like they were getting the real treatment.
135.
What is the purpose of a control group? To help control the affects of lurking variables on the experiment and to
have something to compare to after treatment is over.
136.
Give an example of when we may not want to use a placebo/control group. We do not need to use placebos on
none “conscious” things, but usually will still need a control. However, one reason we may want to not use a placebo
is if we are comparing two different treatments to each other. Then we don’t need a placebo.
137.
What are matched pairs experiments? Give an example. Matched pairs experiments are when both treatments are
places on the same subject (can also be a before & after). Example: I want to determine the effectiveness of 2
different hand lotions. Since everyone’s skin is different, I will use both lotions of the same person and then compare.
138. What are the three principles of experimental design? * Control the effects of lurking variables by comparing several
treatments. *Randomly assign subjects to treatment groups. * Replicate the experiment on many subjects to reduce
chance variation.
139. What does double-blind mean, and why would we want an experiment to be double-blind? In a double-blind experiment,
neither the subjects nor the people recording the observations know which subject received which treatment. This reduces bias.
140.
Give an example of how lack of realism in an experiment can give false results. If the subjects know they are
being watched, or if the situation doesn’t seem real to them (like scenarios you see on a candid type camera show),
they may act differently or how they think the person watching would want them to react.
141.
What is block design and how is it different than stratified random samples? Not any different really. We use the
words “block design” to describe what we do in an experiment and “stratified random sampling” for observational
studies. See number 125 for a description.
142.
I want to test the effects of aerobic exercise on resting heart rate. I want to test two different levels of exercise, 30
minutes 3 times per week and 30 minutes 5 times per week. I have a group of 20 people to test, 10 men and 10
women. I will take heart rates before and after the experiment. Draw a chart for this experimental design
5 men for Treatment A
compare
Men
10 men
5 men for Treatment B
compare
everyone
random
selection
20
people
5 women for Treament A
compare
women
10 women
5 women for Treament B
143. Why is simulation useful? Sometimes you cannot actually do an experiment for time, monetary reasons, etc.
144. What are the five steps of a simulation? * State the problem * State the assumptions * Assign digits to represent outcomes
* Simulate many repetitions * State your conclusions
145.
Design and perform a simulation of how many children a couple must have to get two sons. (A simulation
involves many trials. For this simulation, perform 10 trials.) There will be various answers to this. You can use your
calculator or random number table. P(male) = .5 See me for help if you need it.
146.
What is independence? Independence of two or more events is when they do not affect each other’s probabilities.
There is a test for independence. A & B are Independent if and only if P(A)  P(B)  P(A  B)
147.
You are going to flip a coin. What is the sample space for each flip? S={H, T}
148.
You are going to flip a coin three times and note how many heads and tails you get. What is the sample space?

S={0, 1, 2, 3}
149.
You are going to flip a coin three times and note what you get on each flip. What is the sample space? S=
{HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}
150.
3rd
coin
Make a tree diagram for the three flips.
2nd
coin
Heads
Heads
Tails
1st
coin
Heads
Tails
Tails
Heads
HHH
Tails
HHT
Heads
HTH
Tails
Heads
HTT
THH
Tails
THT
Heads
TTH
Tails
TTT
151.
There are three ways I can drive from Fremont to Grand Rapids and four ways I can drive from Grand Rapids to
my home. How many different ways can I drive from Fremont to my home through Grand Rapids? There are two
“tasks” for you to do: drive to Grand Rapids & then home. Use the multiplication rule. 3*4=12.
152.
How many different four-digit numbers can you make? There are 4 tasks: pick 4 numbers. 10*10*10*10=10000
153.
How many different four-digit numbers can you make without repeating digits? There are 4 tasks again, but this
time, there can’t be any repeats! So it’s 10*9*8*7=5040
154.
What is an event in probability? An event is one or more of the possible outcomes in a random phenomenon.
155.
Any probability is a number between (and including) __0___ and __1___.
156.
All possible outcomes together must have probability of __1___.
157.
If S is the sample space, P(S) = ___1__.
158.
What are complements? Give an example and draw a Venn diagram.
Complements are the “nots.” If an event is rolling less than a 3 on a die,
then it’s complement is “not rolling less than a three” or “rolling greater
than or equal to 3”
159.
What are disjoint events? Give two examples and draw a Venn
diagram. Disjoint events are 2 or more events that share no
outcomes. Like Aces & Kings in a deck of cards
1 and 2
Aces
3, 4, 5 and 6
Kings


Use the following chart for questions 160-163:
M & M Color Brown
Probability
0.3
Red
0.2
Yellow
0.2
Green
0.1
Orange
0.1
Blue
?
160.
What is the probability that an M & M is blue? All colors add to 1, so 1-.3-.2-.2-.1-.1 = .1
161.
What is the probability that an M & M is red or green? P(red or green) = .2+.1=.3
162.
What is the probability that an M & M is yellow and orange? Impossibility. You cannot have a yellow & an
orange M & M at the same time. So the answer is  (empty set).
163.
What is the probability that an M & M is not brown or blue?
P(Brownc  Blue)  P(red  yellow  green orangeblue)  .7
164.
Bre can beat Erica in tennis 9% of the time. Erica can swim faster than Bre 8% of the time. What is the
probability that Bre would beat Erica in a tennis match and in a swimming race? Assuming independence, then it’s
.09*.92=.0828
165.
What assumption are you making in problem 164? Do you think this assumption is valid? We assumed the two
races were independent. This probably is not valid. If you loose one race it can affect your next race, positively or
negatively.
166.
Using two dice, what is the probability that you would roll a sum of seven or eleven? Check p. 336 for dice
sample space.
167.
P(7 11) 
8 2

36 9
Using two dice, what is the probability that you would roll doubles?

6 1

36 6
168.
Using two dice, what is the probability that you would roll a sum of 7 or 11 on the first roll and doubles on the
second roll? “And” means to “multiply” in probability. And since dice rolling is independent, we can just multiply
the two probabilities:
2 1 1
P[(7 11)  (doubles)]   
9 6 27
169.
What assumption are you making in problem 168? Do you think this assumption is valid? The dice are
independent is the assumption. Yes, it’s valid.
170.
Using
two dice, what is the probability that you would roll a sum of 7 or 11 that is also doubles?
P[(711)(doubles)]   This is impossibility.
171.
What is the union of two events? The union of two or more events is when any one or more of the events can
happen. The union of Aces and Hearts in a deck of cards has the S={Ace of D, Ace of H, Ace of S, Ace of C, 2H, 3H,
4H, 5H, 6H, 7H, 8H, 9H, 10H, JH, QH, KH}
172.
What is an intersection of two events? The intersection of two or more events is when all the events happen at the
same time. The intersection of Aces and Hearts in a deck of cards has the S={Ace of Hearts}.
173.
How can we test independence in probability? We can use the independence rule: P(A)  P(B)  P(A  B)
If this equation is true, then A & B are independent. If this equation is not true, then A & B are not independent.


174.
Perform an independence test on the smoking/education chart from problem 110 to show that smoking status and
education are not independent. You can check any 2 options from the smoking & education categories. As soon as
you find one that does not work, then the two categories are not independent.
?
P(Did not complete HS  Never smoked ) P(Didn' t complete HS)  P(Never smoked )
82
825
.0994
214 357

825 825
 .1122
?

So Education & smoking are not independent, they are dependent according to this information.
175.
Make a Venn diagram for the following situation:
Do this problem in the following order to make it work.
7. 45% of kids like Barney This involves the ENTIRE circle of Barney. There are 4 Sections, we’ve already used
3, so subtract what’s left to find out the last part: 15%
6. 25% of kids like Blue This involves the ENTIRE circle of Blue. There are 4 Sections, we’ve already used 3,
so subtract what’s left to find out the last part: 5%
5. 55% of kids like Pooh This involves the ENTIRE circle of Pooh. There are 4 Sections, we’ve already used 3,
so subtract what’s left to find out the last part: 20%
4. 15% of kids like Blue and Pooh This is 2 sections again. Again, we already used 5% for #2, so there’s only
10% left for the Blue & Pooh only section.
3. 25% of kids like Barney and Pooh This involves 2 sections, the section we used in #2 and the Barney & Pooh
only section. Since we’ve used 5% already in #2, there’s 20% left for the other section.
2. 5% of kids Barney, Blue, and Pooh So a “5” goes in the section were all 3 circles intersect.
1. 5% of kids like Blue but not Barney or Pooh Start with the one with the most information, this one. So a “5”
goes in the crescent for Blue all by himself.
8. Now subtract all the parts from 100% to find the probability of being in the box, but outside of the circles. In
other words, kids that do not like any of them at all = 20%
Don’t forget the “box.” The box says, “Here are the subjects in this experiment or survey” Outside of the box is
everyone else.
Blue
Barney
5%
5%
15%
5%
10%
20%
20%
20%
Pooh

176.
A dartboard has a circle with a 20-inch diameter drawn inside a 2-foot square. What is the probability that a dart
lands inside the circle given that it at least lands inside the square? (Assume a random trial here.) Find the area of
2
2
each, the Circle A  r  10  314.1593 (store in your calculator so you
don’t have any rounding error. Square = 242=576. Now divide Circle/Square =
.5454
2 ft
20 in

177.
Give an example of a discrete random variable. Number of students in a class. See me if you’re not sure if your
example is correct.
178.
Give an example of a continuous random variable. Temperature on a given day (we can measure to as many
decimal places as technology will allow us). See me if you’re not sure if your example is correct.
179.
Make a probability histogram of the following grades on a four-point scale:
Grade
Probability
180.
0
0.05
1
0.28
2
0.19
3
0.32
4
0.16
Quic k Ti me™ and a
T IFF (Unc om pres s ed) dec om pres s or
are needed to s ee t his pic t ure.
Using problem 179, what is P(X > 2)?
P(3 or 4)  P(3)  P(4)  .46
P(2 or 3 or 4)  .67
181.
Using problem 179, what is P(X > 2)?
182.
What is a uniform distribution? Draw a picture. Uniform distribution is when
all the probabilities for every x-value is the same.
Quic k Ti me™ and a
T IFF (Unc om pres s ed) dec om pres s or
are needed to s ee t his pic t ure.

183.
In a uniform distribution, what is P(0.2 < X < 0.6)? Find the area between .2 & .6, if
0  X 1.
.4*1 = .4
184.
In a uniform distribution, what is P(0.2 < X < 0.6)? Same as #183. The probability of being exactly on a point in
a continuous distribution is 0. So, you take the answer from #183 and add 2*0 and you still get .4
185.
How do your answers to problems 180, 181, 183, and 184 demonstrate a 
difference between continuous and
discrete random variables? A uniform distribution is continuous and there is no difference between < and . There is
a difference between them if we are using a discrete distribution, like in # 180 &181.
186.
Normal distributions are (continuous or discrete).
187.
Expected value is another name for __mean/average___.
188.
Find the expected value of the grades in problem 179.
Var Stats XList, ProbList
189.
 x  x1 p1  x2 p2  ... xk pk  2.26 You can use 1-
Find the variance of the grades in problem 179.
2
  (x1  x ) 2 p1  (x2  x ) 2 p2  ... (x
k  x ) pk  1.3724
2
x
190.
Find the standard deviation of the grades in problem 179.


You can use 1-Var Stats XList, ProbList.
#189  1.1715

191.
What is the law of large numbers? The law of large numbers says that probabilities can only be counted on in the
long-run. Never in the short-run.
192.
If I sell an average of 5 books per day and 7 CDs per day, what is the average number of items I sell per day? Use
the rule for means:  x y   x   y  12 items , where  x  5 and  y  7
193.
If I charge $2 per book and $1.50 per CD in problem 192, what is my average amount of income per day?
( a bX )( a bY )   a bX   a b Y  a  b X   a `  b `Y   $20.50 , where
a=a`= nothing,
b=$2, and b`=$1.50



Combine both means rules:
194.
`
`
Before you can use the rules for variances you must make sure the variables are _independent____.
 use the following situation: For Test 1, the class average was 80 with a standard deviation of 10.
For problems 195-203,
For Test 2, the class average was 70 with a standard deviation of 12.
195.
What is the average for the two tests added together? Formula from #192. Answer is 80+70=150.
196.
What is the standard deviation for the two tests added together?  X Y
sure you use the variances! Then square root the final answer for 15.6205
2
197.
What is the difference in the test averages?
  X2   Y2  10 2  12 2  244
Make
 xy   x   y  10

198.
What is the standard deviation for the difference in the test averages?  X Y   X   Y  10  12  244
Then square root for 15.6205. The variance/standard deviation formula doesn’t care if you are subtracting or adding.
It’s the same formula.

199.
2
2
2
2
2
If I cut the test scores on Test 2 in half and add 50, what is the new average?

 a bX  a  b X  50  .5  70  85
200.
What is the new standard deviation for Test 2 in problem 199?  a bX  b  X  .5 12
root for standard deviation. Remember that “a” does not affect the standard deviation/variance.
2
2
2
2
2
 36
Now square
201.
If I add 7 points to every Test 1, what is the new standard deviation? Since “a” does not affect standard deviation,
the answer is the same as before adding 7 points. It’s 10.

202.
If I multiply every Test 1 by 2 and subtract 80, what is the new mean? “a”=-80 and “b”=2, so the average is 80.
203.
If I multiply every Test 1 by 2 and subtract 80, what is the new standard deviation? “a”=-80 and “b”=2, so the
standard deviation is
2 2 10 2  20 .
204.
Where are the mean and median located on a normal distribution? Since a normal curve is symmetrical, they are
the same so right in the middle.
For problems
205-211 consider the process of a drawing a card from a standard deck and replacing it. Let A be drawing a
heart, B be drawing a king, and C be drawing a spade.
205.
Are the events A and B disjoint? Explain. No because you can draw a King of Hearts. That is an outcome in
both events.
206.
Are the events A and B independent? Explain. Yes, since you are replacing cards. Also the
?
P(Heart)  P(King) P(Heart  King)
multiplication/independence rule works.
207.
1 1

4 13

1
52
This is true.
Are the events A and C disjoint? Explain. Yes, because you cannot have a card that is both a heart & a spade.
208.
Are the events A and C independent? Explain. NO!! Disjoint events are dependent!! Both A & C have a
 However, if I pick a card and I get one event (let’s say Hearts) then there is no way to
probability of happening of ¼.
get the other (spades) and it’s probability goes down to zero. So it is affected by the first card.
209.
What does the symbol  mean? “Union” See # 171
210.
What does the symbol  mean? “Intersect” See # 172