Download TPS Chapter 1: Exploring Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
TPS Chapter 1: Exploring Data
Q1. The science of data is known as ____.
Q2. Most raw data sets can be organized into rows and columns. Each
row represents some object or person that is studied, and each column
represents some characteristic about that thing that is measured. Our
textbook calls those objects and characteristics what two things
respectively?
Q3. What are the two main classes of variable types?
Q4. A description, depiction, or equation telling what values a variable
takes on and how often it takes on these values is called the ___ of the
variable.
Q5. Before studying the relationships among variables, it's usually good
to begin by examining what?
Q6. Before getting numerical summaries of the data, your textbook
advises exploring the data with what?
Q7. What two types of graphs are usually most appropriate for
categorical data?
Q8. If several percentages do not represent portions of the same whole,
then what type of graph is inappropriate?
Q9. When you are asked to describe a distribution after looking at a
graph, the general tactic is to look for an overall pattern and also for
striking deviations from that pattern. When describing the overall
pattern, what three features should you mention?
Q10. When you are asked to describe a distribution, the general tactic is
to look for an overall pattern and also for striking deviations from that
pattern. What are the striking deviations called?
Q11. Someone wants to display this center, shape, and spread of a data
set with a picture. But the person also wants to communicate, through
the same graph, the individual raw data values that were collected in the
study. There are too many different values that the variable takes on to
make a dot plot feasible. What type of graph should the person choose?
Q12. Instead of a dot plot or a stem plot, a ____ is the most common
graph of the distribution of one quantitative variable.
Q13. What does your textbook depict as a minimum number for either
the number of stems in a stem plot, or the number of classes in a
histogram?
Q14. If the right and left sides of a histogram are approximately mirror
images of each other, we call the distribution what?
Q15. If there's a big hump on the left side of a histogram and a long tail
extending far out to the right, do we say that the distribution is skewed
right or skewed left?
Q16. If you look at people's incomes, defining income so that zero is
the smallest possible value, and your sample includes mainly middle
income people but at a few extremely high income people, will the
distribution be skewed right or skewed left?
Q17. Mary gets a test report saying that 79% of the test takers fell at or
below the score that she made. The name of the type of score she got is
what?
Q18. A relative cumulative frequency graph is often called what?
Q19. In a relative cumulative frequency graph, or ogive, the horizontal
axis is for the values of the variable you are looking at. For any given
value on the horizontal axis, what does the value on the vertical axis
stand for?
A1. Statistics
A2. Individuals and variables
A3. Categorical and quantitative
A4. Distribution
A5. Each variable by itself
A6. Graphs
A7. Bar charts and pie charts
A8. A pie chart
A9. Center, shape, and spread
A10. Outliers
A11. A stem plot
A12. Histogram
A13. Five
A14. Symmetric
A15. Skewed right
A16. Skewed right
A17. Percentile
A18. Ogive.
A19. The fraction of observations less than
or equal to that value
Q20. If you are given a relative cumulative frequency graph, and
someone asks you to find the center of the distribution, how do you do
it?
Q21. On a time plot, what axis does time go on?
Q22. On a time plot, an overall upward or downward slope is called
what?
Q23. On a time plot, what do you call the shorter-term variations that
occur regularly, repeating themselves in a cyclic fashion?
Q24. 1/n times the summation of the x(i), where n is the number of
cases and x(i) is the value of the ith case, is known as what?
Q25. The number in a distribution such than half the observations are
smaller and the other half are larger is called what?
Q26. If there is no middle value in a data set because you have an even
number of cases, how do you do find the median then?
Q27. Between the mean and median, which of these is pulled farther in
the direction of extreme values or outliers?
Q28. If a distribution is highly skewed to the right, which value will be
lower: the mean, or the median?
Q29. From which statistic, the mean or the median, can you recover the
total value of all the cases in your data set, if you know how many cases
there are?
Q30. What's the definition of the range of a distribution?
Q31. What's the chief problem with using the range as a measure of the
spread of a distribution?
Q32. What you call the median of the subset of observations whose
position in the ordered list is to the left of the overall median?
Q33. What's the definition of the interquartile range?
Q34. What's the rule of thumb for defining outliers in terms of the
interquartile range?
Q35. What five numbers are in the so-called five number summary?
Q36. What type of graph gives a picture of the five number summary?
Q37. What's the difference between a regular box plot and a modified
box plot?
Q38. If you take the deviation of each observation from the mean of the
whole set, square those deviations, add those squares, and divide by one
less than the number of observations, what do you call the resulting
number?
Q39. What is the relationship between the variance and the standard
deviation?
Q40. How is the standard deviation like the interquartile range?
Q41. When you average the squared deviations from the mean to find
the variance of a sample, what should you divide by: the n of cases, or
the "degrees of freedom"?
Q42. Under what conditions will a standard deviation equal zero?
A20. Find the value on the x-axis that has a
50% or .5 value on the y-axis.
A21. The horizontal axis
A22. A trend
A23. Seasonal variation
A24. The mean
A25. The median
A26. You find the mean of the two center
observations.
A27. The mean.
A28. The median.
A29. The mean
A30. The difference between the largest
and smallest value
A31. It's too sensitive to outliers, and it
depends on only two values in the data set.
A32. The first quartile
A33. The third quartile minus the first
quartile.
A34. An outlier falls more than 1.5 times
the interquartile range above the third
quartile or below the first quartile.
A35. The minimum, the first quartile, the
median, the third quartile, and the
maximum.
A36. The box plot
A37. In a regular box plot, the whiskers go
out to the maximum and minimum. In a
modified box plot, the whiskers go out to
the largest and smallest data points that are
not outliers. The outliers are plotted as
isolated points on a modified box plot.
A38. The variance
A39. The standard deviation is the square
root of the variance.
A40. Both of them are measures of spread
of the distribution.
A41. The degrees of freedom
A42. When all the observations have the
same value.
Q43. Between the interquartile range and the standard deviation, which
is more resistant to the effects of the outliers?
Q44. How do you choose between the five number summary on the one
hand, and the mean and standard deviation, on the other hand, as ways
of describing a distribution?
Q45. If you add the same number to each observation, how does that
affect the center and the spread of the distribution?
Q46. If you multiply each observation by the same number, how does
that affect measures of center and spread?
Q47. What are three graphical methods of comparing distributions?
A43. The interquartile range
A44. The mean and standard deviation are
good for reasonably symmetric distributions
that are free of outliers. Otherwise the five
number summary is usually better.
A45. The number that you add is added to
the measures of center, such as the mean
and median. But measures of spread, such
as the interquartile range and standard
deviation, are not affected.
A46. Both the measures of center (median
and mean) and the measures of spread
(standard deviation and interquartile range)
are multiplied by the same number. (The
variance, which is also a measure of spread,
is multiplied by the square of the number
each observation is multiplied by.)
A47. Side by side bar graphs, back-to-back
stem plots, and side-by-side box plots.
TPS Chapter 2: The Normal Distribution
Q1. The scales of density curves are adjusted so that the total area under
each curve is what?
Q2. The area under the density curves between a couple of x-axis values
represents what?
Q3. Do measures of center and spread apply a to density curve as well
as to sets of observations?
Q4. How do you define the median of the density curve?
Q5. The quartiles of a density curve divide the area into what?
Q6. What is the relationship between the mean and the median of a
symmetric density curve?
Q7. Which is pulled the farther toward the tail of a skewed distribution:
the median, or the mean?
Q8. In conventional notation, what are the meanings of x-bar and s, as
contrasted to mu and sigma?
Q9. What three features describe the overall shape of normal curve?
Q10. Is there only one normal curve, or is there an infinite number of
normal curves?
Q11. For any given mean and standard deviation, is there only one
normal curve, or an infinite number of normal curves?
Q12. How can you visually find the points one standard deviation from
the mean of a normal curve?
A1. One
A2. The proportion of all observations that
fall between those values.
A3. Yes
A4. The point with half the area under the
curve to its left and the remaining half of
the area to its right.
A5. Four equal parts.
A6. They are equal.
A7. The mean
A8. The first two refer to the mean and
standard deviation, respectively, of a set of
observations, a sample. The second two
refer to the mean and standard deviation,
respectively, of a density curve or idealized
distribution, or the population distribution.
A9. Normal curves are symmetric, single
peaked, and bell shaped.
A10. An infinite number.
A11. Only one.
A12. Those points are the inflection points
of the curve. That is, the curve changes
from falling more and more steeply to
falling less and less steeply, or vice versa.
(Optional answer for calculus lovers: they
are points where the second derivative of
the curve equals zero.)
Q13. The distributions of test scores, of measures of characteristics of
living things, and of summary statistics for chance outcomes repeated
many times, often (but not always!) follow what type of distribution?
Q14. What three percentages do you have to remember when you are
stating the “empirical rule”?
Q15. Are the three percentages for 1, 2, and 3 standard deviations exact,
or easier-to-remember rounded approximations?
Q16. What do the three percentages in the empirical rule apply to? in
other words, what is the meaning of this rule?
Q17. True or false: If Mary scores one standard deviation above the
mean on a normally distributed test, then approximately 68% of the test
takers scored as close to the mean of the test as, or closer to the mean
than, Mary did.
Q18. True or false: If Mary scores one standard deviation above the
mean a on a normally distributed test, her score is in the 68 th percentile.
Q19. True or false: if Mary scores one standard deviation above the
mean on a normally distributed test, half of 68% or 34% are above the
mean but at or below Mary’s score. An additional 50% are below the
mean. Thus Mary equals or surpasses 50% plus 34% of the test takers,
and is at the 84th percentile.
Q20. What does the notation N(100,15) mean?
Q21. True or false: the standard score for any observation tells how
many standard deviations that score is from the mean.
Q22. What two operations do we do, to standardize a score?
Q23. A standard score is often called by what other term?
Q24. What does the sign of a standard score correspond to?
Q25. Are there an infinite number of standard and normal curves, each
with its own equation describing it, or just one standard normal curve,
with just one equation describing it?
Q26. In a table of areas under the standard normal curve, what does the
table entry for each z score represent?
Q27. What steps do you follow to use the z table solve the following
“problem prototype”: given N(mu, sigma), please find the proportion of
observations less than x?
Q28. What steps do you follow with a z table if you want to know what
proportion of the scores are between two values?
Q29. What two pictures do TPS recommend drawing when solving
problems where you are asked to solve problem where you are given
normal curves and asked for proportions of the observations?
A13. The normal distribution
A14. 68%, and 95%, and 99.7%.
A15. Approximations.
A16. The three numbers tell the per cent
observations falling within the region plus
or minus 1, 2, or 3 standard deviations from
the mean, respectively, in a normal curve.
(Note that the percents refer to the percent
of observations encompassed by the
interval from that number of standard
deviations below the mean to that number
above the mean.)
A17. True
A18. False
A19. True
A20. It denotes a normal in distribution
with mean it 100 and standard deviation 15.
A21. True
A22. Subtract the mean and divide by the
standard deviation.
A23. The z-score.
A24. If the z-score is positive, it’s above the
mean, and if negative, below the mean.
A25. There is just one standard normal
curve, with only one equation and
describing it.
A26. The area under the curve to the left of
z, or in other words, the proportion of cases
with values less than z.
A27. First we standardize x (by subtracting
mu and dividing by sigma). Then we look
at the z table to find the proportion of the
distribution less than the z score we’ve
obtained.
A28. Look up the proportion less than the
first, and less than the second, and find the
difference between the two proportions.
A29. They recommend drawing
unstandardized and standardized normal
curves, and shading in the areas that are
asked for.
Q30. What do the authors recommend (as a word to the wise for future
test-takers) as the last step of problems giving a normal distribution and
asking for proportions of observations?
Q31. What steps do you go through when you want to find a value given
a proportion of a normal distribution, using the z-table?
Q32. What operations do you do, in what order, to “unstandardize” a zscore, or turn the z-score into a raw score?
Q33. Suppose you have a data set, and you want to see if it is
approximately normally distributed. What’s the first thing to do, before
doing calculations?
Q34. What’s a way of checking a data set for normality, using the
empirical rule?
Q35. True or false: The point of making a normal probability plot is to
see whether a set of numbers is normally distributed.
Q36. If you were to take any set of numbers, and plot the numbers on
the x-axis, and their z-scores on the y-axis, you would get what shape
for your graph? (Hint1: y values are (1/sd)*x + mean/sd, or of the form
y=mx +b) (Hint 2: Standardizing a score involves a linear
transformation.)
Q37. On a normal probability plot for a set of observations, what goes
on the x axis?
Q38. True or false: On a normal probability plot, what goes on the yaxis for each x value is the z score that would be associated with the
percentile for that value, assuming a normal distribution (and using
midpoints of intervals in finding percentiles)?
Q39. What conclusion do you come to if a normal probability plot is not
linear?
A30. They recommend stating the
conclusion in the context of the problem.
Thus rather than just saying, the answer is
49%, you would say, “About 49% of boys
have cholesterol levels between 170 and
240 mg/dl.”
A31. You look for the proportion in the
body of the table, and you find at the
margin the z-score that corresponds to it.
Then you “unstandardize” the z-score.
A32. You multiply the z score by the
standard deviation, and then you add it to
the mean.
A33. Make a frequency histogram or
stemplot, and see if the curve looks bellshaped and symmetric.
A34. Find the mean and sd of the data set,
and count (or get a computer to count) the
proportions of observations that are within
1, 2, and 3 standard deviations of the mean.
See if these proportions correspond,
roughly, to .68, .95, and .997.
A35. True
A36. A line.
A37. The values of the observations
themselves.
A38. True.
A39. That the data are not normally
distributed.
TPS Chapter 3: Examining Relationships
Q1. Suppose that a researcher wants to study the effect of people’s ever
having taken the drug "ecstasy" upon the people’s memory scores when
tested. Which of these is the response variable and which is the
explanatory variable?
Q2. How do the terms "dependent variable" and "independent variable"
correspond to the terms “response variable” and “explanatory variable”?
Q3. Is it proper to use the terms, response variable and explanatory
variable, if the explanatory variable does not actually cause the response
variable?
Q4. What's the order of tasks involved in the examining relationships
between two variables?
A1. Ingestion of ecstacy is the explanatory
variable and memory test scores are the
response variable.
A2. Dependent corresponds to response,
and independent corresponds to
explanatory.
A3. Yes. No implication of causation is
contained in the terms explanatory and
response (or independent and dependent).
A4. First plot the data, then use numerical
summaries. Look for the overall pattern and
deviations from that pattern, and when the
overall pattern is regular, use a
mathematical model to describe it.
Q5. Suppose that someone has math scores for the children in one
classroom, and English scores for a second set of children in another
classroom. The person asks you about making a scatterplot for these
data. What would you reply?
Q6. True or false: in a scatterplot, each point represents one individual;
the x-coordinate of the point represents the value of one variable and the
y-coordinate represents the value of another variable measured on that
same individual.
Q7. If there is an explanatory variable, which axis should it be graphed
on?
Q8. When describing a scatterplot, what three aspects of the pattern
should you refer to?
Q9. True or false: in describing the form of a scatterplot, it's important
to say whether the graph appears to be linear or not.
Q10. In describing the form of a scatterplot, what term do you use if the
values tend to fall into two or more groups that are separated from one
another by gaps?
Q11. In describing the direction of a scatterplot, when there is a positive
or negative slope, we say that the variables are positively or negatively
what?
Q12. When any given x value on a scatterplot has vary widely varying y
values associated with it: the more widely varying the y values, the less
_____ is the relationship between the two variables.
Q13. When you are drawing a scatterplot, what symbols should you use
in showing the axes if the origin of the graph is not at zero?
Q14. What are about 3 other guidelines on how to draw scatterplots
properly?
Q15. Suppose that you want your scatterplot to reflect the influence of a
particular categorical variable, in addition to the relationship of the two
quantitative variables that are plotted. For example, suppose you want
to graph the relation between entertainment violence and real-life
violence for males and females on the same graph, in such a way that
displays the relationship separately for males and females. What should
you do?
Q16. A common problem in constructing a scatterplot occurs when two
or more individuals have exactly the same values for each of the two
variables. What should you do in that case?
Q17. Which is a better method for judging the strength of a linear
relationship: simply to look at a graph, or to use a calculated numerical
statistic that summarizes the strength of the linear relationship? Why do
you think your chosen method is better?
Q18. What is the summary statistic that measures the strength of a linear
relationship?
Q19. We’ve used Greek letter mu to represent a population mean; x-bar
to represent a sample mean; Greek sigma to represent the standard
deviation, and s to represent the sample standard deviation. What letter
does out book use to designate what it calls the correlation?
Q20. Given that the letter r, for the correlation coefficient, is in our own
alphabet and not the Greek alphabet, do you think it refers to a sample
statistic or a population parameter?
Q21. Would you guess that there is some other Greek letter that refers to
the population value of the correlation coefficient?
A5. It doesn't make sense to use a
scatterplot in this situation, because a
scatterplot involves graphing two variables
measured upon the same individuals.
A6. True
A7. The x-axis
A8. The form, the direction, and the
strength of the relationship.
A9. True
A10. You say that there are clusters.
A11. Positively or negatively associated.
A12. strong
A13. You use a symbol that looks like two
slashes to indicate a break in the scale.
A14. Make the intervals uniform. Label
both axes. Choose a scale that makes your
graph big enough.
A15. Use a different symbol on the
scatterplot for the points designating males
as for those designating females.
A16. Use a different plotting symbol to call
attention to those points.
A17. A calculated statistic works better,
because our eyes can be deceived by a
different scaling methods used in graphs.
A18. The correlation coefficient.
A19. r
A20. sample statistic
A21. Yes. (It’s the letter rho, which looks
pretty much like a p!)
Q22. When you look at the formula for the sample correlation
coefficient that your text gives, you see (xi-xbar)/sx and (yi-ybar)/sy.
Can you give a simpler name to these expressions?
Q23. What is the meaning of a positive and negative sign associated
with the correlation coefficient?
Q24. Suppose one person calculates the correlation of IQ score of some
individuals with number of boxing matches fought, testing the
hypothesis that boxing (the explanatory variable) affects IQ (the
response variable). A second person, using the same data set, also
calculates the correlation of the number of fights with IQ score, only this
person thinks of IQ as the explanatory variable and number of fights as
the response variable. Do they get the same correlation, or different
ones?
Q25. Suppose someone codes race as follows: 0=Caucasian, 1=African
American, 2=Asian, 3=Hispanic, 4=American Indian 5=Other. Then
someone calculates a correlation between race and a reading test score
for a sample of kids. Do you have a problem with this? If so, what’s
your problem?
Q26. Melinda computes a correlation between the height of mothers and
their daughters. Lunk is looking at the computations, and says, “You
blew it! You have the height of mothers measured in centimeters, and
the height of the daughters measured in inches!” Please tell whether
Melinda needs to do anything to fix her correlation coefficient, and if so,
what?
Q27. What range of values is possible for the correlation coefficient?
Q28. What sort of correlation coefficient do you find when two
variables have a very strong linear relationship, and when as the first
gets greater, the second gets smaller?
Q29. Suppose the data points are two variables collected for all the days
of 2005. For each of those days, imagine that we know (variable 1) the
number of words your instructor for this course spoke in that day, and
(variable 2) the peak barometric pressure for that day in Caracas,
Venezuela. About what would you guess the correlation between these
two variables to be? Why?
A22. These are the standard scores, or zscores, for the ith individual. The first
factor is the z-score for the x variable and
the second is the z-score for the y variable.
A23. A positive sign means there’s a
positive association between the variables;
in other words, higher values of one are
associated with higher values of the other.
A negative sign means there’s a negative
association; that is, higher values of one
variable are associated with lower values of
the other.
A24. The same correlation. The correlation
coefficient is not affected by which variable
is considered explanatory and which is
considered response.
A25. The problem is that the correlation
coefficient is to be used with quantitative
variables, not categorical variables like this.
The obtained correlation would be
meaningless, and an artifact of the arbitrary
coding system.
A26. Melinda did not blow it, because the
correlation coefficient comes out the same
no matter what units are used. (This is
because a transformation from one unit to
another (which involves multiplying each
number in the data set by the same number)
multiplies both the mean and the standard
deviation of the data set by the same
number (as was learned in chapter 1). The
z-score, which is (xi-xbar)/sx comes out the
same, because each of the three numbers
that make up the z-score is multiplied by
the same factor, and that factor cancels out.
Since the z-scores are not affected by
changes of units, the correlation coefficient
is also not affected.)
A27. -1 to +1.
A28. A correlation close to –1.
A29. You’d guess a correlation of about 0,
since there is no reason to expect that these
two variables would rise and fall in
association with each other.
Q30. Suppose there are two variables which, when graphed in a
scatterplot, form an almost perfect u-shaped parabola. Would the strong
relationship between these variables imply a high correlation coefficient
(meaning one close to 1)? Why or why not?
Q31. Does the correlation coefficient resemble the median and
interquartile range in being fairly resistant to outliers, or resemble the
mean and standard deviation in being heavily influenced by outliers?
Q32. Someone practices guessing correlation coefficients from
scatterplots using an “applet” on the internet. Why should the person not
get too confident of his or her guessing power given scatterplots of reallife data?
Q33. In attempting to give a more complete description of a set of data
involving two variables, someone wants to give a measure of center and
spread as well as measure of the correlation coefficient. Assuming the
person has made a good decision to use the correlation coefficient, what
measures of center and spread would be most consistent with the
correlation coefficient: the mean and sd or the median and IQR?
Q34. The women in a corporation think that they are being
discriminated against in their salaries. A management spokesman says to
them, “Look at this plot. The first data point is the average salary for
men who have worked here 1 year, put into an ordered pair with the
salary for women who have worked here one year. The second ordered
pair is the average salary for men and women with two year’s
experience, and so forth. The correlation between men’s salaries and
women’s salaries is .95! That’s almost a perfect correlation! You
women have nothing to complain about!” Is this argument valid? Why
or why not?
Q35. Suppose that you have a data set with a correlation fairly close to
0. All the numbers for both variables are between 0 and 10. There are
about 10 individuals in the data set. Then suppose that one more
individual gets added, an outlier with a value of 100, 100. What do you
think the correlation coefficient will become? (Try it out with your
calculator or minitab if you want, or mimic this situation on an
“applet.”)
Q36. True or false: In a regression line, like a correlation coefficient,
you get the same numbers (slopes and intercepts) no matter which
variable is considered the explanatory variable and which is considered
the response variable.
Q37. Please explain, for a least squares regression line: the sum of the
squares of what are being minimized?
Q38. Please explain why the distance from the data point to the
regression line corresponds to the idea of an “error.”
Q39. What’s the formula for the slope of a regression line, in terms of
the correlation between the two variables and their standard deviations?
(Call the slope b, the correlation r, and the two sd’s sx and sy.)
A30. No, because the correlation coefficient
measures the strength of linear relationships
only, not curvilinear relationships. A ushaped curve isn’t a straight line!
A31. Like the mean and sd, the correlation
coefficient can be greatly influenced by
outliers.
A32. Because the scales of the variables are
not necessarily the same as they were on the
applet, and scales can throw off “eyeball”
estimates.
A33. The mean and sd, because the formula
for the correlation uses the mean and sd.
A34. It’s not valid. The correlation
coefficient measures the predictability of
one score from another, not the equality of
the two scores. Adding same value to all
values of either x or y does not change the
correlation – the same with subtraction,
multiplication, or division. So the salaries
of women could be half those of the
comparable salaries of men, or $10,000 less
than the comparable salaries of men, and
you could still get a high correlation.
A35. The correlation will become close to
1, because this one outlier has such a strong
effect.
A36. False. The change in y per unit change
in x, for example, is not the same as the
change in x per unit change in y.
A37. The squares of the errors for each data
point, where the errors are the distances
from the data point to the regression line.
(The word residuals is also correct.)
A38. The regression line gives predicted
values of y (called y-hat) for each x. There
is also an actual observed value y for each
x, for each data point. The difference
between the actual and the predicted value
is the “error” in prediction that is made by
using the regression line to predict the
response variable.
A39. b=r (sy/sx).
Q40. Every least-squares regression line passes through what point?
Q41. Once you know the slope of a regression line, how would you find
the y-intercept, knowing the means of the x values and the y values?
(call the intercept a, and the means for x and y xbar and ybar.)
Q42. When you have a regression equation delivered by the computer
software output, and someone asks you for the predicted value of y
given a certain x value, what do you do?
Q43. Suppose that someone measures height as a function of weight for
a bunch of human adults, and gets a regression equation predicting
height as a function of weight. Why is the y-intercept of the equation not
as meaningful or important as the slope, or as the equation as a whole?
Q44. Suppose you have a regression equation output from a computer
and you are asked to plot the line by hand. How would you do it?
Q45. When, in the context of regression, people speak about the SST
(sum of squares total), what do they mean by that?
Q46. When in the context of regression people speak of the SSE or sum
of squares for error, what do they mean by that?
Q47. Your book doesn’t define very explicitly in this chapter what the
sum of squares for regression is. Do you think it would be reasonable to
think of that as the sum of all the squared deviations of the predicted y
values (the y-hats) from the mean of the y values? Particularly if a
trustworthy source hinted that it was? (P.S. you can calculate the SSReg
in your head, easily, for the 3-point data set of example 3.11 on page
160: it comes out to 32. The SST comes out to 38, and the SSE to 6.)
Q48. The book speaks of the sum of squares for the regression as the
SST-SSE, or the sum of squares total minus the sum of squares for error.
Can we infer from this that the total sum of squares, SST, can be
partitioned into the SSReg (sum of squares for regression) and the SSE,
(sum of squares for error), and that SST=SSReg+SSE? (P.S. I use the
notation SSReg so as not to confuse sum of squares for regression with
sum of squares for residuals.)
Q49. The square of the correlation coefficient, or r-squared, a.k.a. the
coefficient of determination, means what in terms of the fraction of the
total sum of squares? Please answer in symbols and words.
A40. It passes through xbar, ybar, the
ordered pair formed by the means of both
variables.
A41. a=ybar- b*xbar. You get this by just
solving for a the equation ybar=b*xbar + a.
And the second equation comes from the
fact that ybar is always the predicted value
of y for xbar.
A42. Just substitute the value of x into the
equation and solve for the predicted y
value.
A43. Because the y-intercept corresponds to
the height of someone with weight 0. But
the weight of 0 is far outside the range of
weights measured in the study and thus the
height predicted will be an extrapolation.
Secondly, the weight of 0 is one seldom
found in human beings, (at least those who
have already been born and aren’t dead
yet)!
A44. Just pick two values of x, and
calculate the yhat values for each, and
connect those two dots. It helps if you pick
points that are close to the bottom and top
ends of the range. (One easy point is the yintercept.)
A45. The sum of the squared deviations of
each y value from the mean of the y values.
A46. They mean the sum of the squares of
the deviations of the actual y values from
the predicted y values. (These deviations
are also called residuals.)
A47. Yes.
A48. Yes.
A49. r-squared = the SSReg/SST or (SSTSSE)/SST. The r-square is the fraction of
the total sum of squares that is accounted
for by the regression of y on x.
Q50. One person studies IQ as a function of number of boxing matches
participated in, and another uses the same data set to study boxing
matches participated in as a function of IQ. (That is, matches is the
explanatory variable in the first study and IQ is the explanatory variable
in the second.) Do they both get the same value for the slope of the
regression line? Can you explain in words the reason for this answer?
(You may assume a negative relationship between the two variables in
constructing language for your answer.)
Q51. The slope of the regression line b is equal to r*(sy/sx). Along the
regression line, a change in 1 standard deviation in x results in a change
of how many standard deviations of y? (Hint: the slope is the change in
y over the change in x. So the change in y equals the slope times the
change in x. So if the change in x is sx, we get r*(sy/sx)*sx, which
equals...
Q52. True or false: the slope of the regression line tells you how many
unstandardized units the predicted value of y changes for each
unstandardized unit change in x.
Q53. True or false: the correlation coefficient tells you how many
standard deviations the predicted y changes for each standard deviation
change in x.
Q54. True or false: If both of two variables y and x are standardized, (so
that the standard deviation of both is 1) then the slope of the regression
line and the correlation coefficient are equal.
Q55. What is another name for y – yhat, or the deviation of the observed
y value from the predicted value, or the error in prediction for a given
value, or the vertical distance between any data point and the regression
line?
Q56. True or false: when we speak of a “least squares” regression line,
we mean that we choose the line so as to minimize the squares of the
residuals.
Q57. Someone draws a graph of residuals (on the y axis) versus the
values of the explanatory variable. This graph is called what?
Q58. Someone draws a residual plot, and all the values are positive.
Someone says to that person, “There must be some mistake.” Why did
the person say that?
Q59. If the linear regression equation fits the data well, what do you see
on the residual plot?
Q60. What do you call a data point that has a big effect on the slope or
intercept of the regression line?
Q61. Does an influential point necessarily have a large residual? (Hint:
the influential point can pull the line close to it.)
A50. No. The first slope tells how many
points IQ changes per additional boxing
match, and the second slope tells how many
fewer boxing matches someone has for each
additional IQ point.
A51. r standard deviations of y, or r*sy.
A52. True.
A53. True.
A54. True.
A55. The residual.
A56. True.
A57. A residual plot.
A58. Because the mean of the least-squares
residuals is always zero; thus if you have
positive values you have to have at least
one negative value.
A59. A uniform scatter of points, without a
clear pattern, and with no unusual
individual observations.
A60. An influential point.
A61. No.
TPS Chapter 4: More on Two-Variable Data
Q1. In the example at the beginning of this chapter, a plot of the log of
brain weight as a function of the log of body weight provides a “better
fit” for the observed data than a simple plot of brain weight as a function
of body weight. What is meant by better fit? (This is also the answer to
the question, what are we trying to do (at least in this course) when we
transform data?)
A1. A better fit means that the data are
more linear – a linear model is more
successful in describing the relationship, the
correlation coefficient is higher, the sum of
squares of residuals for a regression line is
lower.
Q2. Can you figure out why we would want to transform data so as to
get a more linear relationship?
Q3. True or false: if we have a curvilinear function, and we want to
straighten it out to make a linear function, we can’t do that by
multiplying or dividing by constants or adding or subtracting constants
(i.e. by using linear transformations).
Q4. What are the transformations that are most commonly used, other
than linear transformation?
Q5. What is the definition of a monotonic function?
Q6. Is it kosher to speak of a function as being, for example, monotonic
increasing over part of the domain of x, and monotonic decreasing over
another part? If so, can you give an example?
Q7. True or false: There are often two steps in transformation. The
second is to apply a power or logarithmic function that simplifies the
data. The first is to use a linear transformation, such as adding a
constant, that makes the values all positive, so that the function applied
in the second step will be defined and monotonic increasing.
Q8. How is the ladder of power functions useful?
Q9. Linear growth is to adding a fixed amount per unit time as
exponential growth is to ______ by a fixed amount per unit time.
Q10. If the number of a certain type of bacteria doubles every two
hours, is that linear growth or exponential growth?
Q11. Increasing everyone’s salary by a certain percentage is to ______
growth as increasing everyone’s salary by the same dollar amount is to
_______ growth.
Q12. Suppose we have a function y=ab^x, where a and b are constants
and x is the explanatory or independent variable, and y is the response or
dependent variable. Is this an example of an exponential function, or a
power function?
Q13. Suppose we have a function y=ax^b, where a and b are constants
and x is the explanatory variable and y is the response variable. Is this
an example of an exponential function, or a power function?
Q14. If y is an exponential function of x, plotting what function of y
versus x should result in a linear graph?
Q15. Suppose you do a regression of the log (base 10) of y versus x, and
you get a nice linear scatterplot and a high coefficient of determination
(r^2) when you do a regression. Now you can use this linear relationship
for prediction. Suppose someone (like a test-maker) asks you what the
predicted value is of y (not log y) for a given value of x. How would you
find it?
A2. Because then we can use the methods
of correlation and regression that were
studied in the previous chapter and will be
studied more in future chapters, which are
very powerful methods of statistical
analysis.
A3. True. Linear transformations don’t
straighten curves.
A4. Positive and negative powers, and
logarithms.
A5. It’s one where as x increases, y always
increases (a monotonic increasing function)
or as x increases, y always decreases (a
monotonic decreasing function). It produces
a graph that doesn’t go up and dip down,
but consistently has either a positive or
negative slope.
A6. Yes. An example is y=x2, which is
monotonic decreasing for negative values
of x, and monotonic increasing for positive
values of x.
A7. True.
A8. When we are trying to straighten out
curved data sets, we can go in one direction
or the other along the ladder, seeing how
straight the line becomes, rather than just
randomly picking different functions.
A9. Multiplying.
A10. Exponential.
A11. Exponential, linear.
A12. An exponential function. (This is a
function like y=2^x.)
A13. A power function. (This is a function
like y=x^2.)
A14. The log of y versus x.
A15. You’d just use your equation to find
the predicted value of log y. Then you take
the antilog (or 10 to that number) to get the
predicted value of y. In other words, you
“untransform” the value back to the original
scale.
Q16. If a variable grows exponentially, its logarithm grows how?
Q17. To make an exponential function linear, we use the log
transformation just with the response variable y. To make a power
function linear, we use the log transformation with what?
Q18. If you start with the power function y=ax^p, and take the log of
both sides, what result do you end up with?
Q19. Suppose you have a data set, and its scatterplot is curved. Then
you take the log of both explanatory and response variables, and plot
them, and you get a line. What do you infer from this?
Q20. When you plot the log of y vs. the log of x, do you give any
meaningful interpretation to the slope of the line that you get? If so,
what is it?
Q21. Jane gets a regression coefficient (i.e. a slope) of 3.617 when
regressing log y vs. log x. She says, “Now I know that x and y are
related in a power function, and y= a constant times x to the 3.617
power.” Do you think this conclusion should be tempered or qualified?
If so, how?
Q22. Suppose you plot the log y vs. the log x and you get a good line,
with intercept 2 and slope 3. So log y=2+3log x. Now you are asked to
find the equation for y in terms of x, without logs in it. How do you do
this?
Q23. Suppose I find that in the range of 3 to 7 milligrams of Ritalin
given to a group of children, their math scores rise in linear fashion with
increasing dose. A parent looks at the regression equation and says, “By
my calculations, all it would take would be 400 milligrams of Ritalin for
my child to get an 5 on the AP statistics course, while he’s still in 3 rd
grade.” What do we call this type of reasoning (which often leads to
incorrect conclusions)?
Q24. I notice a linear relationship between shoe size and basketball
prowess, and propose to help my daughter’s fifth grade basketball team
win by dressing them all in size 15 Converse All-Stars. Someone says,
“But in your study, there’s another variable, namely height, that was not
among your explanatory or response variables and yet may influence the
interpretation of the relationship between shoe size and basketball skill.”
What type of variable is height, in this situation?
Q25. In the example you just read, a lurking variable enhanced the
apparent association between two variables. Can lurking variables also
mask or attenuate the apparent association between two variables? If so,
can you think of an example?
A16. Linearly.
A17. Both the explanatory and the response
variable.
A18. log y=log a +p log x.
A19. That the original variables were
related according to a power function (or
power law).
A20. According to the equation log y =log a
+ p log x, the slope of the line is the power
to which x is raised in the original power
function.
A21. Yes. The slope you obtain in any
given experiment is an estimate of the
population value of the slope, and not an
exact rendering of it. Another sample is
very likely to give a different slope, and it
could possibly even lead to a different
conclusion about the form of the functional
relationship!
A22. You just take the antilog of both sides.
You get y= 10^(2+3 log x), or
y=10^2*(10^log x)^3, or y=100*x^3. That
is, y=100 times x cubed.
A23. Extrapolation, which is using the
regression equation to make predictions for
values of the explanatory variable that we
have no experience with.
A24. A lurking variable.
A25. Yes. The relation of overcrowding and
lack of indoor toilets on page 227 of TPS is
an example. Here’s another: suppose a
researcher finds that the correlation
between violent entertainment viewed and
aggressive behavior in real life is low. Then
someone reanalyzes the data and finds that
when you make separate scatterplots for
males and females, you get a strong
relationship. Males are much more
aggressive than females, and when you
throw the data together on one scatterplot
you get more of a horizontal line than a
sloped one. Thus the lurking variable of
gender obscured the real association that
was there.
Q26. What’s one way of discovering lurking variables that someone
may not have thought about?
Q27. One researcher studies the murder rate of every state as a function
of the average literacy level of the teenagers in that state. The
correlation is reported to be very strong (and close to –1). Another
researcher studies aggression in individuals in a high school, as a
function of their reading scores. The correlation, though in the same
direction, shows much less strong a relationship. Can you explain why?
Q28. When two variables X and Y are found to correlate with each
other, of course two possible explanations for this association are 1) that
X causes Y, and 2) (one not diagrammed on page 232) that Y causes X.
Please name the other two possible explanations that are good to keep in
mind when interpreting findings of associations.
Q29. Suppose a researcher studies the effects of a way of teaching
children not to be violent. The researcher gives the instruction to all the
children in Mrs. Harmony’s classroom, and uses the kids in Mr.
Gutsly’s classroom as a comparison group. But then the researcher
realizes that Mrs. Harmony has a very different personality and
interpersonal style than Mr. Gutsly: she tries to promote kindness and
good will, whereas Mr. Gutsly is mainly interested in promoting
competitiveness and not being wimps. What would we say about the
variables of teacher personality and interpersonal style in this study?
Q30. Someone finds that the degree of physical fitness in youth (as
measured by heart rate recovery from exercise) is correlated with the
number of ankle injuries the person has had. But before concluding that
we should hurt the ankles of youth in order to make them more fit, a
COMMON RESPONSE explanation for the association comes to mind.
Can you posit this common response explanation?
Q31. Even when causation is present, is there usually one and only one
contributing cause for a given effect, at least in the types of phenomena
people study with statistics?
Q32. Someone says, “Lots of kids play “shooter” video games for
hundreds of hours, and never do anything violent. Therefore these
games can’t cause violence.” What does the principle, as stated in your
text, that “Even when direct causation is present, it is rarely a complete
explanation of an association between two variables” have to do with
this reasoning?
Q33. What is the strongest type of evidence for causal relations?
A26. Plot the data as a function of time.
You may able to see relationships that can
be linked to events that happened at a
certain time, or conditions that changed at a
certain time.
A27. Correlations with averaged data are
generally considerably higher than
correlations obtained with individuals. This
is because some of the random variation in
each variable gets reduced when you sum
or average over many individuals. (Later on
in the course, you’ll find out that the
variance of the mean of a set of
observations for a population is equal to 1/n
times the variance of the individual
observations, where n is the number of
observations.) The less “noisy” both
variables are, the more highly they tend to
correlate with each other.
A28. Common response (z causes both x
and y) and confounding (z, which is
associated with x, may cause y).
A29. That they are CONFOUNDED with
the intervention. Thus the effects of these
teacher variables can’t be distinguished
from the effects of the intervention the
study is meant to test.
A30. That both fitness and ankle injuries
are associated with more running or more
athletic activity – both are responses to this
basic causal variable.
A31. No.
A32. Another way of stating this principle
is that one phenomenon does not have to be
a necessary and sufficient condition for a
second, in order to be causally related.
Therefore one or several instances of nonassociation do not disprove a causal
relationship.
A33. Well-designed experiments that are
meant to control for all lurking variables.
(These usually entail randomly assigning
individuals to different conditions.)
Q34. What’s the problem with doing a well-designed experiment, for
example, to see what the effects of child abuse are?
Q35. Is it possible to come to valid causal inferences without doing
experiments that randomly assign people to various conditions? Can you
give an example of such?
Q36. A two-way table describes the relation between two of what kind
of variables?
Q37. When you look at a two-way table that looks like this
party affiliation
Approval of president’s performance
Yes
No
Total
Democrat
25
100
125
Republican
125
5
130
Total
150
105
255
What is the row variable, and what is the column variable?
Q38. If we look, at the table above, at the totals for the rows, we get
how many Democrats and how many Republicans are in the sample.
Similarly, the column totals tell us how many approvers and
disapprovers are in our sample. These give us the distribution for each
variable separately, in our sample. These distributions are called what?
Q39. The above table gives the results in counts. Especially when the
marginal distributions are not equal (for example, if the sample should
contain twice as many Republicans as Democrats) we should convert the
count data to what kind of data?
Q40. True or false: When describing the relationship between two
quantitative variables, the scatterplot and the correlation coefficient are
usually the graph and numeric measure of choice; but in describing the
relation between two categorical variables, no single graph or numeric
measure summarizes the strength of the association. We usually pick
and choose among bar charts and pie charts and the reporting of various
per cents.
Q41. Someone looks at a sample of 500 men and 100 women. 250 men
oppose the war, whereas 80 women oppose the war. The researcher
says, “Lots more men than women oppose the war. Therefore the idea
that women in this area are more pacificist is incorrect.” What’s the
problem with this reasoning, and what should the researcher do?
Q42. Suppose you have three age groups, and you have data on how
many individuals got educated to each of 4 different levels. Suppose you
calculate, just for one of the age groups, the per cent of people in that
age group who attained each level. This distribution of per cents for one
age group is called what?
Q43. Do the per cents for a conditional distribution add to 100 for each
of the different groups for which you calculate them?
Q44. Do the per cents for conditional distributions equal the per cents
for marginal distributions?
A34. We will never find it ethical to
randomly assign children to conditions of
child abuse versus nonabuse.
A35. Although your text says that “the only
fully compelling method” of establishing
causality is an experiment, we can and do
come to valid causal inferences without
randomly assigning people to conditions.
The example of smoking and lung cancer is
one where the evidence for causation is
“overwhelming” despite no study in which
people were randomly assigned to smoke or
not smoke over many years.
A36. Categorical.
A37. The party affiliation is the row
variable and the approval of the president’s
performance is the column variable.
A38. Marginal distributions. (Because
they’re in the right and bottom margins of
the table.)
A39. To per cents (or fractions).
A40. True.
A41. He should not just use the counts, but
find the per cents. 50% of men, but 80% of
women, in this sample opposed the war. So
in this region it looks like the women are
more anti-war than the men.
A42. A conditional distribution.
A43. Yes.
A44. No, not necessarily.
Q45. There were two AP Statistics teachers. 40% of the 40 students in
the first teacher’s classes got 5’s, and 25% of the 40 students in the
second teacher’s classes got 5’s. People assumed that the first teacher is
better. However, someone then studied the results based on whether or
not the students scored above or below a certain cutoff on the SAT,
before going into AP Statistics. The first teacher had 80% of students
above this cutoff and 20% below. The second teacher had 20% above
and 80% below. The first teacher had 50% of the “aboves” get 5’s, and
none of the “belows.” The second teacher had 75% of the “aboves” get
5’s, and 12.5% of the “belows.” Now which teacher appears to be better,
and why?
Q46. The situation above is whose paradox?
Q47. True or false: In Simpson’s paradox, there is a lurking variable,
which predisposes the results against one of the two groups; controlling
for the effects of that lurking variable by looking separately at the
subsets formed by the categories of it reveals results in the opposite
direction from those obtained when ignoring the lurking variable.
Q48. If a lurking variable can actually reverse the direction of results, do
you think it is also possible that a lurking variable could result in lack of
an observed association when in fact there is a causal influence?
Q49. Does the fact that lurking variables can obscure influences that are
actually present imply that: not only does correlation not imply
causation, but lack of correlation does not rule out causation?
A45. The second teacher, because a higher
fraction of that teacher’s students got 5s
from those both above the cutoff and below
the cutoff.
A46. Simpson’s.
A47. True.
A48. Yes.
A49. Yes.
TPS Chapter 5: Producing Data
Q1. The difference between an observational study and an experiment is
that in the first, the explanatory variable is observed and measured,
whereas in an experiment, the explanatory variable is ____.
Q2. When there is a jobs program for welfare recipients, and you simply
observe that those who voluntarily take part in the program do better
than those who don’t, what’s the problem with inferring that the
program causes better results?
Q3. The entire group of individuals we want information about is called
the _____.
Q4. The subset of the population we actually examine in order to gather
information is called the ______.
Q5. Studying the whole population by attempting to contact every
individual is called conducting a ______.
Q6. Studying a population by taking a subset of it in order to generalize
to the whole population is called _____.
Q7. The method used for selecting the sample from the population is
called the ____ of sampling.
Q8. If a radio station invites anyone who wants to call and give an
opinion on a question, the set of people thus obtained is called a _____
response sample.
Q9. If the researcher enrolls a group of people in the study on the basis
of how easy it is to contact them and get them to enroll, that method of
sampling is called ______ sampling.
Q10. The systematic error introduced when the sample is very different
from the population is called ____.
Q11. If a conservative radio commentator polls his listeners, and a
liberal commentator polls her listeners, both polls are likely to be biased
as methods of ascertaining the sentiment of the country, because
_______.
A1. Deliberately imposed, manipulated.
A2. The effects of the program are
confounded with the characteristics that
lead people to seek the program, for
example motivation and values.
A3. Population
A4. Sample.
A5. Census.
A6. Sampling.
A7. Design.
A8. Voluntary.
A9. Convenience.
A10. Bias.
A11. It’s very likely that these samples
differ highly from the country as a whole.
Q12. A SRS, or simple random sample, is a subset of n individuals from
a population, chosen in such a way that ____.
Q13. True or false: if every individual in the population has an equal
chance of being included in the sample, the sample is a simple random
sample.
Q14. Suppose I take the numbers 1, 2, 3, and 4, and write them on
identical pieces of paper, put them into a hat and mix them thoroughly,
and draw out two numbers. Is this a simple random sample of the 4
numbers?
Q15. Suppose I take the numbers 1, 2, 3, and 4. First I take the numbers
1 and 2 and put them into a hat, and choose one of them. Then I take the
numbers 3 and 4 and put them into a hat and choose one of them. For
each of the numbers 1, 2, 3, and 4, what is the probability that this
number will end up in the sample?
Q16. Is it possible that the subset {1,2} would be chosen for our sample
using the sampling method just mentioned (that is, pick randomly from
1 and 2, then pick randomly from 3 and 4)?
Q17. So the sampling method just mentioned is one where each
individual has equal probability of being chosen, but each subset is not
equally likely to be chosen; thus the sample obtained is, or is not, a
simple random sample?
Q18. In a table of random digits, each triple of digits is equally likely to
be any of the ____ possibilities from _____ to ______.
Q19. The two rhyming words (with different ways of spelling the
second syllable) that summarize the process of using a table of random
digits to select a simple random sample are ___ and ____.
Q20. There are 7 members in a class. Please describe how you would
use a table of random digits to select a simple random sample of 3 of
them.
Q21. A sample chosen by chance is called a ____ sample.
Q22. Suppose there is a class, and someone wants to choose a random
sample of it. But the researcher wants to make sure that both males and
females are adequately sampled. So the researcher takes the names of
the girls, and draws a simple random sample of them, and then does the
same with the boys’ names. The total sample thus obtained is not a
simple random sample, but a _____ random sample.
Q23. Suppose a researcher wants to collect a random sample of high
school students in the U.S. The researcher first takes a simple random
sample of counties in the country, then takes a simple random sample of
high schools within each county, and then a simple random sample of
students within each high school. This sampling method is called ____.
Q24. The above method of sampling high school students leaves out
homeschoolers. The general term for such a problem in sampling is
____.
Q25. When you get a survey in the mail and immediately toss it in the
trash, the source of bias this introduces into the survey is called _____.
A12. Every subset of n individuals has an
equal chance of being chosen for the
sample.
A13. False. You need not only this
condition, but also that every subset of the
population of size n is equally likely to be
chosen.
A14. Yes.
A15. 0.5
A16. No.
A17. Is not.
A18. 1000, 000, 999
A19. Label and table.
A20. Assign each of them a single digit
label. Enter the random number table at any
point, and look at the numbers in order. If
the number isn’t one of the ones you
assigned, ignore it and go to the next. If it is
one that you assigned, put that individual in
the sample. Keep going until you have put 3
individuals in the sample.
A21. Probability
A22. Stratified
A23. Multistage sampling.
A24. Undercoverage.
A25. Nonresponse.
Q26. If you were asked what is the “essential principle of statistical
sampling,” would you say that it’s to have a simple random sample, a
probability sample, a stratified sample, or a multistage sample?
Q27. During recent decades, society has become less and less tolerant of
any sexual activity between therapists and their clients. Surveys of the
incidence of such behavior are now almost impossible to obtain, because
therapists would avoid trusting a researcher with a confession of
behavior that would lead to severe penalties. This introduces bias into
any survey that is called ______ bias.
Q28. One survey question asks, “Do you believe that children should be
legally protected from exposure to violent models on TV that can lead
them to commit acts of violence?” And a separate question asks, “Do
you believe that government should limit the free expression of ideas by
censoring television?” The major difference in results these questions
would yield would be referred to as ______ effects.
Q29. Which would give more accurate results in a poll: a probability
sample of 1000 people, or a voluntary response sample of 100,000
people?
Q30. A study in which we actually do something to people, animals, or
objects in order to learn about the response is called an _____.
Q31. The individuals on which an experiment is done are called the
experimental whats?
Q32. When the experimental units are human beings, according to our
book they are called ____, (although the preferred term among
psychological researchers these days is “participants.”)
Q33. The thing that is done to the subjects (or participants) (for example
giving them a drug or teaching them to read) is called a ____.
Q34. Suppose that in an experiment, learning of math facts is your
response variable. You are studying two explanatory variables, and
varying them systematically in your study: amount of practice, and the
frequency of recurrence of any one math fact in a practice session.
These two explanatory variables are called the two _____ in the
experiment.
Q35. In an experiment on math facts, one is studying the frequency of
recurrence of any one math fact in practice sessions: does, for example,
7+8 occur every 3 problems, every 20 problems, or every 90 problems?
If the experiment is set up like this, within the factor called “frequency
of recurrence” there are three different degrees of that factor, three
different specific values of that factor, which in the jargon is called three
____ of that factor.
Q36. A pill that is made of inactive material, which is used so that
subjects can have information withheld about which treatment group
they are in, is an example of a _____.
Q37. If you want evidence for causation, and if you want to study the
interactions of factors, and you are able to do either an observational
study or an experiment, you should, all other things equal, choose the
______.
Q38. When people get better from an inactive treatment, that is called
the ____ effect.
Q39. A group of individuals who receive an inactive treatment, so that
the effects of a possibly active treatment can be contrasted with those of
inactive treatment, is called a ____ group.
A26. A probability sample, because the
most essential factor is that the sample be
chosen by chance.
A27. Response
A28. Wording.
A29. The probability sample of 1000.
A30. Experiment.
A31. Units.
A32. Subjects.
A33. Treatment.
A34. Factors.
A35. Levels.
A36. Placebo.
A37. Experiment.
A38. Placebo.
A39. Control.
Q40. A researcher tries to make two treatment groups equal on every
variable other than the treatment of interest. The researcher does this by
fashioning two groups that are very similar on several variables relevant
to outcome. What is the problem with this method?
Q41. What’s the “gold standard” method of assuring the equivalence of
two treatment groups?
Q42. Suppose you first assemble pairs of subjects that are very similar
on the preintervention measure of the response variable. Then, you
randomly assign one member of each pair to the experimental group or
the control group. This method combines random assignment with
______.
Q43. A researcher is studying the effect of two methods of teaching
reading. Instead of matching the subjects on their reading level and then
randomly assigning one of each pair to the two groups, the researcher
ignores the initial reading level for purposes of assignment to groups,
and instead picks a simple random sample of the whole set of subjects to
be in each group. Is this an acceptable method of assignment to groups?
Q44. True or false. When subjects are assigned at random to two groups,
get two treatments, and they differ on the response variable, it must be
true that the treatment accounts for the difference between the groups.
Q45. Suppose that both treatments studied in an experiment in fact have
no causal influence upon the response variable. Under what conditions
are we more likely to see big differences between the two groups, due to
the vagaries of random assignment: with a very small sample size, or a
very big sample size?
Q46. A difference between groups that is so large (and with so many
subjects) that it would “rarely” (i.e. to whatever criterion of rarity we
specify) occur by chance is called a _____ ______ effect.
Q47. The three central principles of experimental design are _____
(which is making comparisons between groups), _______ (a method of
assigning individuals to groups), and _______ (which has to do with
how many individuals you have in your groups).
Q48. Suppose that we want to study the effect of a new curriculum and
an old one, on reading skills. We also want to study the effects of
whether the curriculum is delivered in person or over the phone. We
randomly assign subjects to the curriculum, but we can’t randomly
assign them to in person or over the phone, because certain people live
too far away to get the training in person. So subjects are allocated to the
curriculum at random, but not to the delivery method. We would say
that the experimental design here is not _____ ________.
Q49. In a “double-blind” experiment, what two sets of people are
“blind” to which group the subject is in?
Q50. What’s a problem in making inferences from experiments that is
often less of a problem in observational studies?
A40. There are too many lurking variables
– the experimenter may not measure all of
them, and some of them may not become
apparent until after the experiment. Some of
them may not be measurable at all.
A41. To randomly assign subjects to
treatment groups.
A42. Matching.
A43. Yes. Understanding why this is true is
central to the logic of experimental design.
A44. False. It is also possible that the “play
of chance in the random assignment”
accounts for the difference in the groups.
(However, the likelihood of this alternative
explanation can be quantified, and when it
is small enough, the other explanation is
favored.)
A45. A very small sample size.
A46. Statistically significant.
A47. Control, random assignment, and
replication. (The word replication here
refers to repeating the observation on more
subjects within a given experiment. The
word is also used, in a different sense, to
refer to repeating the experiment.)
A48. Completely randomized.
A49. The subjects themselves, and the
research staff who have contact with them.
A50. The problem of “lack of realism,” in
other words, the problem that the conditions
in the study do not match those to which we
wish to generalize.
Q51. How do you do random assignment in a matched pairs design?
Q52. In a certain type of matched pair design where each subject serves
as his or her own control, and each “pair” consists of only one
individual, what is randomly assigned?
Q53. Suppose that we want to compare two method of tutoring in
reading that children receive after school. We know that the teacher the
student has is also an important variable in the outcome variable, which
is reading skill. Please describe how we would use a block design to
control the effect of teacher when studying the method of teaching
reading.
Q54. True or false: If we want to make separate conclusions about males
and females in a study, it’s a good idea to block on gender when making
our assignment to groups.
Q55. Making a model that accurately reflects the experiment under
consideration and imitating chance behavior based on that model is
called doing a _____.
Q56. What are the 5 steps of doing simulations?
Q57. Someone is wanting to simulate a situation where there’s a 3/10
chance that a child will be involved in bullying. The person assigns the
digits 0 to 3 for involved in bullying, and the rest of the digits to
noninvolved in bullying. Do you have a problem with this? If so, what’s
your problem?
Q58. Please use your calculator to generate 4 random integers in the
range from 0 to 99. Please tell what you entered on your calculator to
get these, and what 4 integers you got.
A51. First choose pairs that are as similar as
possible, then randomly choose one subject
from each pair.
A52. Whether the subject gets treatment 1
first or treatment 2 first.
A53. We randomly assign the students of
the first teacher to the two groups, then do
the same for the students of the second
teacher, and so forth, rather than using a
simple random sample of all students.
A54. True.
A55. Simulation.
A56. State problem, state assumptions,
assign digits to represent outcomes,
simulate repetitions, state conclusions.
A57. Yes a problem. There are 4 digits
from 0 to 3 inclusive, and 6 other digits, so
the person would be simulating a 40%
probability situation rather than a 30%
probability.
A58. On the TI 83 or 84, you do
math>prb>5:randint, (0, 99, 4). On the TI
89, you do catalog, F3, and then scroll
down to randInt and hit enter. Then you
insert 0, 99, 4 in the parentheses. You’ll get
different sets of numbers each time, unless
something very unlikely happens!
TPS Chapter 6 Probability
Q1.The branch of mathematics that deals with the pattern of chance
outcomes is ____.
Q2. The big idea of the study of probability is that chance behavior is
unpredictable in the _____ but has a regular and predictable pattern in
the _____.
Q3. An illustration of the “big idea” mentioned in Q2 is that while it is
unpredictable whether a single coin toss will come out heads, the
________ is almost always very close to .5.
Q4. What is the difference between a changing, or variable phenomenon
that is “random” and one that is not?
Q5. The ____ of any outcome of a random phenomenon is the
proportion of times the outcome would occur in a very long series of
repetitions, i.e. long-term relative frequency.
Q6. When there are independent trials, that means that the outcome of
one trial _______.
Q7. The set of all possible outcomes of a random phenomenon is called
the ______.
Q8. An event is defined as a subset of ____.
A1. Probability
A2. short run, long run
A3. Fraction of heads in a very large
number of tosses
A4. A random phenomenon is uncertain
with respect to individual outcomes, but
nonetheless there is a regular distribution of
outcomes in a large number of repetitions.
A5. Probability.
A6. Does not influence the outcome of
another.
A7. Sample space.
A8. The sample space.
Q9. When we make a mathematical description of a random
phenomenon by describing a sample space and a way of assigning
probabilities to events, we are constructing a
Q10. Jane has 2 shirts and 3 pairs of pants. If we want to picture the 6
ways she can dress in these garments, we can draw a diagram with a
bifurcation point at the left of the page, with two lines going out to two
points called “red shirt” and “brown shirt.” From each of these, you then
draw 3 lines, saying “blue pants,” “green pants,” and “black pants.” This
sort of picture is called a _____.
Q11. Jane has 2 shirts and 3 pairs of pants. The “Cartesian Product” of
these two sets produces 6 possible combinations. This illustrates what
our book calls the _____ principle, which says that if you can do one
task in a ways, and another in b ways, you can do both together in
_____ ways.
Q12. Please give an example of sampling with and without replacement.
Q13. The probability of any event A has to satisfy the inequality x<=
P(A) <=y. What are x and y?
Q14. If you sum the probabilities for each member of the sample space,
you always come to a grand total of ____.
Q15. If the probability that A will occur is P(A), the probability that A
will not occur is ____.
Q16. What notation do we use to represent the probability that A will
not occur, if P(A) is the probability that A will occur?
Q17. What does it mean when we call two events disjoint or mutually
exclusive?
Q18. If two events are mutually exclusive, and P(A) is the probability of
A, and P(B) is the probability of B, what is the probability of (A and B)
(a.k.a. A intersect B)?
Q19. If two events are mutually exclusive, and P(A) is the probability of
A, and P(B) is the probability of B, what is the probability of (A or B)
(a.k.a. A union B)?
Q20. Suppose a random event has k equally likely outcomes. What’s the
probability of any one of these outcomes?
Q21. When there are k equally likely outcomes, and a certain number of
outcomes are counted as being in event A, the P(A) = (number of
outcomes in A)/k. Please give an example of this.
Q22. In a finite sample space, with outcomes that are not equally likely,
the probability of any event is the sum of the probabilities of the
outcomes making up the event. Someone’s statistics instructor asks,
“Please give an example of this.” The person says, “Suppose the
probability that a randomly selected person in a certain community will
own a dog is .3, a cat is .2, and another pet is .1. Then the probability
that the person will own a pet is (.3 +.2 +.1). Do you have a problem
with this example? If so, what’s your problem?
A9. Probability model.
A10. Tree diagram.
A11. multiplication, ab
A12. As one of many possible examples: in
sampling without replacement, you draw
first one, then another card from a deck
without putting the first card back. In
sampling with replacement, you draw one
card from the deck, note its identity, replace
it, shuffle them, draw again, and note the
identity of the second draw.
A13. 0 and 1.
A14. 1.
A15. 1-P(A).
A16. P(Ac)
A17. That if one happens, the other can’t
also happen.
A18. 0
A19. P(A) + P(B).
A20. 1/k.
A21. As one possible example: when a deck
of cards has 52 cards, each of which is
equally likely to be drawn, and 13 of these
are called hearts, then P(drawing a heart) =
number of hearts/number of cards, or 13/52,
or ¼.
A22. One big problem is: what about the
people who own two or more different
types of pets? If the only people who owned
cats or other pets were the dog owners, the
probability of pet ownership would be .3
instead of .6. The addition rule only applies
to disjoint events, and these categories are
not disjoint.
Q23. When two events are independent, the probability that both will
occur?
Q24. How would you use the multiplication rule for independent events,
as well as the complement rule, to calculate the probability that in two
coin tosses, you would get at least 1 tail? Please explain your reasoning
as you compute an answer.
Q25. Suppose you have a lab test that gets a “false positive” 1 in 100
times. (That means that the P is 1/100 that the test will say you have the
disease, when you don’t.) How would you use the multiplication rule as
well as the complement rule to compute the probability that out of 100
tests where the correct answer is “negative,” you will get at least one
false positive? Please explain your reasoning as you compute an answer.
Q26. What is the general addition rule for unions of two events that
may or may not be disjoint?
Q27. The probability that someone plays the guitar, in a certain
community, is .4, and the probability that the person plays the
harmonica is .2. The probability that the person plays both guitar and
harmonica is .1. What’s the probability that the person plays either
guitar or harmonica?
Q28. In a certain community, .7 of people take physics and .6 of people
take statistics. The fraction of people taking one or the other is .85.
What’s the probability that a randomly drawn person will take both
courses?
Q29. What does the symbol P(B|A) mean?
Q30. What is the general multiplication rule for any two events?
Q31. Why does this rule simplify to the multiplication rule for
independent events, when the two events are independent?
Q32. The general multiplication rule for any two events really follows
from the definition of conditional probability, and is a rearrangement of
the defining formula. What is that defining formula for conditional
probability?
Q33. True or false: P(A and B and C) = P(A)P(B|A)P(C|A and B)?
Q34. True or false: the fact that P(A and B and C)=P(A)P(B|A)P(C|A
and B) is the basis of tree diagrams.
Q35. If there’s a .05 chance that a male high school athlete will play in
college, and a .017 chance that a college athlete will play professionally,
what’s the chance that a male high school athlete will both go to college
and play professionally?
A23. The product of their individual
probabilities.
A24. Your strategy is to realize that (at least
one is tails) is the complement of the event
(both are heads). So you use the
multiplication rule to compute the
probability of two heads (1/2 * ½ = ¼) and
then compute the complement of that (11/4) to get ¾.
A25. The logic for this is exactly the same
as for Q24. You realize that the event (at
least one false positive in 100 tests) is the
complement of (no false positives in 100
tests). The probability of not getting a false
positive on one test is .99, and each test is
assumed independent of the others, so .99100
(or about .37) is the probability that all the
tests will be free of false positive results.
The complement of this, which is the
probability that at least one is a false
positive, is about .63.
A26. P(A or B) = P(A) +P(B) –P(A and B).
A27. .4+.2-.1=.5
A28. P(A or B) = P(A) + P(B) - P(A and
B); therefore .85=.7+.6 –P(A and B), from
which P(A and B) =.45.
A29. The probability of B given A, or the
probability that B will happen given that A
has happened.
A30. P(A and B) = P(A)P(B|A)
A31. Because the definition of
independence is that P(B|A) =P(B), another
way of saying that A doesn’t influence the
occurrence of B. Thus when the two events
are independent, we can substitute P(B) for
P(B|A) in the general multiplication rule,
and get P(A and B)= P(A)P(B).
A32. P(B|A) = P(A and B) /P(A)
A33. True.
A34. True.
A35. .05 * .017, or .00085.
Q36. If there’s a .95 chance that a male high school athlete will not play
in college, and a .0001 chance that someone not playing in college will
play professionally, what’s the chance that someone will both not play
in college and play professionally?
Q37. If the chance is .00085 that a male high school athlete will play in
college and play professionally, and the chance is .000095 that an
athlete will not play in college and play professionally, what’s the
chance that a male high school athlete will play professionally?
Q38. For a tree diagram: the probability of reaching the end of any
complete branch is the product of what?
Q39. When you look at Bayes’s rule, (page 375) does the numerator of
the fraction, P(B|A)P(A) equal the probability of A and B?
Q40. When you look at Bayes’s rule, (page 375) does the denominator
of the fraction reduce to the probability of B and A plus the probability
of B and Ac, which reduces to the probability of B?
Q41. Is a simpler form of Bayes’s rule the following: P(A|B) =
P(B|A)P(A)/P(B) ?
Q42. Does this simpler form of Bayes’s rule say that if you want to
“reverse the direction of a conditional probability,” i.e. to find P(A|B)
when you know P(B|A), you multiply the P(B|A) by the ratio of the
P(A) to the P(B)?
Q43. P(A|B)=P(B|A)P(A)/P(B) implies that if you know the probability
of a positive test given having a disease, you can calculate the
probability of having the disease given a positive test, provided that you
also know what two other quantities?
Q44. What should someone do to organize the information for decision
analysis if one is given a set of complex information about branching
alternatives and outcomes given those alternatives?
A36. .95 * .0001, or .000095.
A37. .00085+.000095, or .000945, which is
a little less than one in a thousand.
A38. The probabilities written on its
segments (provided that those probabilities
are written so as to mean the conditional
probability of going down this branch given
that you have reached the previous
bifurcation!)
A39. Yes.
A40. Yes.
A41. Yes.
A42. Yes.
A43. The probability (or rate) of the
disease, and the probability (or rate) of
positive tests. So for example if the
probability of a positive test given the
disease was .99, but the rate of positive tests
was 1 in 100, but the rate of the disease was
1 in 1000, the probability of the disease
given a positive test would be only
.99*.001/.01, or .099. This illustrates that
the probability of having the disease given a
positive test does not necessarily equal the
probability of having a positive test given
the disease!
A44. Use a tree diagram.