Download Advanced Probability and Statistics Module 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Confidence interval wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Advanced Probability and Statistics Module 7
Topics: Random sampling, bias, standard error, review of the central limit theorem, margin of error, confidence intervals, t-tests, hypothesis
testing, statistical significance, comparing two groups of related data with two-sample z- and t-tests, comparing several groups of related
data with an ANOVA and F-test, making statistical inferences from categorical data using the chi-square test, and correlation, method of
least squares, & regression analysis.
Don’t be intimidated by the length of this module. Most of it is my attempt to explain the remaining statistics concepts. You’ll probably spend
more time reading and thinking than you will answering questions. Moreover, I have provided answers to most of the questions requiring
computations so that you won’t have to wonder if you did them correctly. Be sure to show your work. There is an error in the book on page
312. In the equation in example 9.8, “standard deviation” should be “standard error.”
1. Section 9.1-9.4. Pretend you’re going to conduct a study that involves collecting data of some sort. Concoct and describe a study in a
paragraph or two: Define the population, your sample, the parameter of interest, and your statistic. Discuss how you plan to minimize
selection bias and response bias and explain how each type of bias might cause you to make an incorrect inference. Do you need a
stratified random sample? Give details about the method you will use to select your sample. Assuming bias is not a problem, explain
what you can infer from your study. Explain how your sample size affects your confidence in your inference. Why can’t you just make
your sample size the whole population in order to boost your confidence? In terms of your study, what is the meaning of , , X-bar,
and S?
2. Section 9.4-9.5. Suppose you had the resources to repeat a study 100 times over, each with the same sample size of 64. With each study
you measure X-bar, and S. What is meant (in words, not equations) by the variability of the average value? (It is not the same
variability as described by  or an S value.) The standard error of the average is a way to measure this variability. How would the
averages of the studies be distributed (at least approximately) regardless of the distribution of your population? (Think central limit
theorem.) In terms of  and, the mean of this distribution would be ___ and its standard deviation would be ___ . (Think central limit
theorem.) The standard deviation of this distribution of sample averages is called the standard error. It’s an estimate of about how far
the sample average is likely to be from the population mean. In reality, of course, it’s normally too costly to repeat a study 100 times.
So, we must hope that the mean we measure from a single study, X-bar, is close to what we really want to know but which is
impractical to measure directly, . The standard error can give us an idea of how representative X-bar is of . If these values are close,
our study has estimated accurately what it set out to determine. Since  can’t be known for certain, we compute X-bar and the standard
error, and if the standard error is low, we conclude that our sample mean is a good approximation to the population mean. Assuming
just a single study has been done, compute the standard error. (Be mindful of what you divide by—it’s not one.) If you wanted your
study to have three times the precision, you need to increase the size of it by a factor of ___ . If the sample size is the entire population,
then, naturally, the standard error should be zero. The formula does not give zero, though, because the formula is intended for random
samples much smaller than the population size. Use the correction factor for large factors to explain how standard error comes out to be
nearly zero for very large samples. (See page 305.)
3. Schmedrick’s hometown is having its annual Wiener Fest, where pure, delicious soy hotdogs are consumed in vast quantities in whole
wheat buns with lots of healthy veggie condiments. Just for the heck of it, Schmed measures the lengths in centimeters of all 10 wieners
he eats that day: 15.0, 15.8, 16.5, 14.2, 15.2, 15.6, 14.0, 14.9, 15.3, 15.9. Suppose the lengths of all hotdigities at the fair have some
distribution (not necessarily normal) with mean  and variance  2. What is Schmed’s best approximation for  ? About how far is his
estimate likely to be from the true mean? (Calculate the standard error.) Notice that the standard error is about three times more than the
sample standard deviation. The interpretation of this is that there is about three times more variability among individual hotdogs than
among averages of ten. In other words, if he looked at the averages of many samples of ten, those averages would be clustered closer to
the mean than individuals would. This is just an application of the central limit theorem. Think about it this way: If Schmed randomly
selected sets of n dogs with replacement, the central limit theorem states that the averages of the sets would have an approximately
normal distribution (especially if n is large) with mean the same mean as the population, , and a standard deviation of  /n0.5. The
standard deviation of his distribution (which is less than the standard deviation for the population by a factor of the square root of n) is
simply the standard error, but since we don’t know  we approximate it with S. If Schmed took 144 samples, each of size 25, would
the distribution of his sample averages be  /12 or  /5 ? What would the standard error be?
4. Section 9.6. I mentioned back in Module 1 that for a presidential election poll with a margin of error as low as 3%, it only takes a
sample of about 1000 people. It seems amazing that only 1000 people can predict so precisely how nearly 300 million people are
feeling about the candidates. Of course, it is essential for the polling to be as nonbiased as possible. Let’s see why this works. If there
are two candidates, we essentially are gathering binary data: for example, a preference for Bush corresponds to zero and a preference
for Kerry corresponds to one (or vice versa). In a previous module you proved the variance formula for binary data on page 127:
Var = XY / [n(n – 1)]. So, S is the square root of this. Suppose we sample 1000 registered voters at random and 579 say they prefer
Kerry. What are X, Y, and n? The proportion, p, who prefer Kerry is X / n, which makes it easy to derive the formula on page 128:
S = sqrt[np(1 – p) / (n – 1)]. Show that p = 0.579 and S = 0.494. Use the definition for standard error, error = S / n0.5, and the formula for
S above to derive the formula on page 310. (Very simple.) 57.9% of the people we surveyed favor Kerry. To determine how far off this
is compared to the proportion of people who like Kerry among the whole population, let’s find the standard error. Show that the error is
0.0156, or about 1.56%. Since standard error comes directly from standard deviation, we can interpret the error as follows: the true
proportion of the population that prefers Kerry is 57.9% give or take about 1.56%. In other words, by the central limit theorem, our
sample average (the proportion) should have a normal distribution with the same mean as the population (which we estimate to be
57.9%) and a standard deviation equal to that of the population over the square root of n (and this is approximately 1.56%, the standard
error). Since 68% of normal data lie within one standard deviation of the mean, we can say with about 68% certainty that the true
percentage of people favoring Kerry is within 1.56 percentage points of 57.9%. That is, with 68% certainty, 57.9% of voters prefer
Kerry, with a margin of error 1.56%. Typically, though, we want more than 68% certainty. 95% certainty is sort of the accepted
standard, which corresponds to two standard deviations on either side of the mean. When margins of error for polls are reported in the
news, they typically are referring to 95% certainty. For our data we can say with 95% certainty that 57.9% of voters prefer Kerry give
or take 3.12% percentage points. Where did I get the margin of error of 3.12%? In other words, if we repeated our survey 100 times,
about 95 times the survey would give yield results within 3.12% of the actual Kerry preference; the other 5% of the surveys would
amount to off the wall results (which is why we can’t put too much stock into a single survey). In an earlier module you proved with
calculus that, for binary data, the variance is at a max when the proportion is 0.5 (when opinion is split down the middle). Thus, when
p = 0.5, the standard error would be greatest. For people who are approximately split on an issue, show (with 95% certainty) that the
size of survey must be 1,112 people for a margin of error of 3%.
5. Many times now we’ve used Z = (X - ) /. Recall that this formula is used when a random variable X has a normal distribution with
mean  and standard deviation . Thus, X -  is how far from the mean X is, and (X - )/ is how many standard deviations away X is
from . This means Z has a standard normal distribution (mean of zero, stan. dev. of one), and we can use the table in covers of the
book. To use this formula we needed to know  and . But these parameters are often unknown. In fact, we’re often interested in
approximating  by taking the average of a random sample. Let’s say a wildlife conservationist wants to know the density of prairie
dogs (animals per acre) in a wilderness region of Montana. Since it would be impractical to count all the prairie dogs in dozens of
square miles, the region is divided up on a map into small parcels, and many parcels are chosen at random on which actual counts will
be conducted. So we can speak of a population of parcels, each of which has a certain density of prairie dogs. The density has some sort
of distribution (which may or may not be normal) with mean , and standard deviation .  is the unknown average density of interest
to the biologist. Suppose that somehow  is known, let’s say 30 prairie dogs/acre. (In real life this probably wouldn’t be known and
would have to be approximated.) To estimate  the biologist investigates n = 50 parcels. Let Xi be the random variable representing the
# of prairie dogs per acre in parcel i. Upon gathering the data X1 through X50, the she then computes X-bar, the average of the 50 data
values. The central limit theorem says that X-bar has a distribution that is approximately normal with mean  and standard deviation
 / 501/2. X-bar is called a nonbiased estimator of  provided that there was no bias in the selection process for the parcels. Say X-bar
comes out to be 93 dogs/acre. Since it is only an estimate, the biologist can report definitively that the mean density is 93; she can,
however, report a 90% confidence interval for the true mean as 86 to 100 per acre. I’ll explain how she got this in a minute. The 90%
confidence interval is an estimated interval for . It means that there is about a 90% chance that the interval really does contain the
mean. 90% is arbitrary; a confidence interval can be computed for any percentage. Explain why a 75% confidence interval would be a
smaller interval than the 90% interval. To compute the interval we use the distribution
Z
X 
. Notice how similar this equation
/ n
is to the one with which you’re already accustomed. One difference is that we’re using X-bar rather than X, since it is X-bar that is
serving as her estimator of  rather than X, i.e., she used the average of the 50 pieces of data, rather than individual measurement to
estimate the true mean. Since X-bar has a standard deviation of  / n1/2, this replaces  in the original formula. Z is the number of
standard deviations 93 is from the true mean. To form a 90% confidence interval we want to know how far above and below 93 we
have to go in order to cover 90% of the area under the normal curve. This means 5% of the area will be under the right tail (very large
values), and 5% of the area will be under the left tail (very low values). So, the left endpoint of the interval corresponds to a probability
of 0.05 in the table, and the right endpoint corresponds to a probability of 0.95 (since the normal table is a list of probabilities from -
to some value of interest). What Z value corresponds with each endpoint? (They are opposites of one another.) Books refer to these Z
values as  z0.05 since they mark the boundaries for the upper and lower 5% areas. Ok, now show that the endpoints for the 90%
confidence interval are given by X  z0.05

. (Use basic algebra with the substitution Z =  z0.05.) Use this formula to show that the
n
90% confidence interval is 86 to 100 per acre. Finally, if the biologist were inclined to be overly cautious, she might report her findings
with a 95% confidence interval. Show that this interval is from 81.7 to 104.3 prairie dogs per acre, and explain in words what this
interval means in terms of probability and the actual unknown mean.
6. Terminology and notation: In the prairie dog example we wanted the probability of the actual mean being between two endpoints to be
90%. That is, we found endpoints such that P(left endpt <  < right endpt) = 0.90. When discussing confidence intervals, books always
talk about , which is just 1 – (the amount of confidence). At 90% confidence,  = 0.10. Then we can rewrite
P(left endpt <  < right endpt) = 0.90 as P(-z/2 < Z < z/2 ) = 1 - . Write the corresponding equation for a confidence interval of 80%.
7. Section 10.1. In the prairie dog example we pretended the population standard deviation was known. Normally, though,  is not known,
so we approximate it with the sample standard deviation. In cases like these we could replace sigma with S, and, if n is large, we could
proceed as we did in the last problem, assuming an approximately normal distribution. To be more accurate, especially if n isn’t very
large, instead of Z, we calculate a different test statistic called T, which is defined as T

X 
. Notice the similarity to the formula
S/ n
above for Z and that the denominator is the standard error (which is also the approximate standard deviation of the distribution of
X-bar). Like the normal distribution the t-distribution looks like a bell. However, it’s not a normal distribution. It’s a different
distribution and there are different tables for it (page 589 in your book). Unlike a normal distribution, a t-distribution has degrees of
freedom (df ) associated with it: df = n – 1. With the prairie dogs, df = 49. (There are only 49 degrees of freedom, since if the standard
deviation of 50 measurements is known or estimated, 49 observations have the freedom to be any value, but the 50 th must have a
particular value to preserve that standard deviation.) There is a pic on page 323. Note that the t-distribution is symmetric but it’s got a
greater variance than the normal curve (lower peak, bigger tails). True of false: the smaller the number of degrees of freedom, the flatter
the t-distribution. The variance is greater because extra variability was introduced by approximating  with S. The equation for T has
two random variables in it: X-bar and S. For larger df, the t-distribution begins to look more normal since for large n, S is more likely to
be close to . Notice that the table entries for infinity are exactly the same as in the normal distribution. If df is small, then n is small, so
the central limit theorem may not hold. This means that if we use a small sample size, the population must be approximately normal.
Let’s do a sample problem. Say we want to determine how much garbage the average American household produces in a week. We’ll
assume garbage production has a normal distribution and choose 51 households at random. We calculate the mean of our 51 pieces of
data to be 68 pounds and standard deviation to be 16 pounds. We don’t know  , but we have T 
68  
. For a 90% confidence
16 / 51
interval, we need P(-t/2 < T < t/2) = 0.09, where  = 0.10, and t/2 will be looked up in a table.
Substituting for T: P(-t/2 < T < t/2) = P(-t/2 < (68 - )(S / n1/2) < t/2) = P(-t/2 S / n1/2 < 68 -  < t/2 S / n1/2)
= P(-68 - t/2 S / n1/2 < - < -68 + t/2 S / n1/2) = P(68 + t/2 S / n1/2 >  > 68 - t/2 S / n1/2)
= P(68 - t/2 S / n1/2 <  < 68 + t/2 S / n1/2). So, just like in the prairie dog example, we get the same sort of endpoints:
S
. We use the 90% confidence interval table on the bottom of page 589 and look up t0.05 with df = 50 and get 1.676, which
X  t / 2
n
gives us endpoints of 64.245 and 71.755. This means that with 90% confidence we can say that the true mean amount of garbage
produced per week is somewhere between about 64.2 and 71.8 pounds.
All right, now it’s your turn. Schmedrick gets a job working for Acme Hand Grenade Company as a statistician. His task is to
determine with 99% certainty the mean blast radius of the grenade, as defined by the maximum distance shrapnel flies from the point of
explosion. So Schmed selects 6 grenades at random, heads out to the desert, explodes them remotely one at a time from a
predetermined height, finds the furthest piece of shrapnel in the sand after each explosion, and measures the distance from the
explosion point. His data in meters are: 140.8, 123.7, 166.7, 130.9, 170.1, and 142.2. Technical manuals lead him to believe it is safe to
assume that the blast radius has a normal distribution, but he has no idea what to expect for a standard deviation for that distribution.
Thus, he uses the t-distribution. Show that Schmed can report to his boss that with 99% confidence the mean blast radius is between
114.7 and 176.7 meters. (For each endpoint I rounded away from the mean to ensure at least 99% confidence.)
8. Look at the figure on page 333. Note that each confidence interval is centered on the same mean and that the interval gets larger as the
confidence level increases. In one or two sentences, tell me why this is so. Note also that these are two-sided confidence intervals. For
example, in the top interval, there is a 5% chance that the true mean is greater than the right endpt. value, a 5% chance that the true
mean is less than the left endpt. value, and a 90% chance that the true mean lies within the interval. Now imagine the same interval
except that the right endpt. extends to infinity. This is a one-sided, 95% confidence interval and it means that there’s a 5% chance that
the mean is not in the interval (too small) and a 95% that the mean is in the interval. Thus, to create a one-sided, 95% confidence
interval, we begin with a two-sided, 90% interval and extend either the left or the right side (but not both) indefinitely. Suppose a 92%
confidence interval for the mean number of times dogs in Urbana bark per day is 31 to 49 barks/dog/day. Explain why the one-sided,
96% confidence interval is from - to 49, which in this case is equivalent to 0 to 49. Also, to compute the original interval, the sample
mean must have been 40. Why?
9. Section 11.1: Upon completion of his grenade analysis, Acme transfers Schmedrick to the yo-yo department. His new boss says,
“Schmedrick, our packaging asserts that each yo-yo has a string length of 48 inches, but the Federal Yo-yo Commission wrote us a
threatening letter saying they’re suspicious that we’re misleading the public by supplying yo-yos with an average string length that is
not 48 inches.” What are the null and alternative hypotheses? Section 11.2: The boss continues, “Schmedrick, I want you to conduct a
test whose results, hopefully, will show no statistical significant difference between your sample average and the hypothesized mean.
Then I can write the Commission back and assure them that there is insufficient evidence to reject the null hypothesis.” Explain what
the boss means by this. Section 11.3: Schmed gets to work by randomly selecting 31 yo-yos, measuring their string lengths, and
computing a mean of 46.8 inches and a standard deviation of 3.5 inches. Compute the two-sided, 95% confidence interval. (Answer:
45.5 to 48.1 inches) Should Schmed accept or reject the null hypothesis? (He should accept it; explain why.) His sample average of
46.8 inches is not exactly the same as the hypothesized mean of 48 inches, but is the difference statistically significant? (It’s not;
explain.) What then accounts for the fact that the sample average is not exactly 48 inches? Another common way to do hypothesis
testing is via a t-test which uses a t-statistic:
t
x  0
, where 0 is the hypothesized mean (associated with the null hypothesis, H0).
s/ n
Notice that it’s the same formula we used above except that the actual population mean  is replaced with what the null hypothesis
claims to be the mean. Suppose we’re testing this null hypothesis: “The mean yo-yo mass is 95 grams.” A ten-yo-yo sample has a mean
of 92.2 g with a standard deviation of 4.6 g. Then t = (92.2 – 95) / (4.6 / 101/2) = -1.71863. The absolute value of t is about 1.719. We
then compare this value on the t-table for 9 degrees of freedom (typically using 95% confidence). The table lists a critical value of
2.262. Since our test statistic is less than this, we accept the null hypothesis, which means our data did not supply sufficient evidence to
claim that a mean of 95 g is wrong. In other words, a two-sided, 95% confidence interval centered at 92.2 would indeed be large
enough to contain the hypothesized mean of 95. Use a t-test to test this H0: “The mean yo-yo diameter is 4.3 cm.” Use the following
parameters: a 15 yo-yo sample; sample mean = 4.5 cm; sample standard deviation = 0.3 cm. Decide whether to accept or reject the null
hypothesis (with 95% confidence). (Answer: reject H0 since 2.58199 > 2.145.) Notice that the sample mean was only 2 mm larger than
the hypothesized mean, but with such a small standard deviation, we expect most of the yo-yos to have diameters right around 4.5 cm,
and we cannot say with 95% certainty that real mean is only 4.3 cm.
10. Section 11.4. Hypothesis testing is sort of like accepting the status quo until sufficient evidence is provided to force us to believe
otherwise. (The null hypothesis is analogous the status quo.) This is pretty much what is done in science. No well-accepted theory is
rejected unless there is very strong evidence against it, that is, unless there is statistically significant evidence against the current theory.
In other words, H0 is “innocent until proven guilty.” Accepting the null doesn’t necessarily mean we’re convinced that it is exactly
correct, but that we haven’t seen convincing evidence that it is incorrect. In court, the defendant does not have to prove innocence, but
if there is not sufficient evidence of his guilt, he is declared “not guilty.” If evidence “beyond a reasonable doubt” is presented against
the defendant, the jury would reject the null hypothesis of his innocence, which means he is declared guilty as charged, even though the
jury may not be 100% certain of his guilt. Rejecting a null hypothesis means that the evidence (be it sample measurements or court
testimony) is not consistent with a claim, and that it is inconsistent to the extent that most likely the claim isn’t true. There are two
types of errors that a jury can make: convicting an innocent man (type 1 error), or releasing someone who is guilty (type 2 error).
H0 claims that the man is innocent. Convicting an innocent man means rejecting H0 when
H0 is accepted H0 is rejected
it is true. If the man is guilty, H0 is false. Releasing a guilty man means wrongly accepting
H0 is true
his innocence (accepting H0 when it’s false). Maybe one way to keep these errors straight
H0 is false
is that the word accept has two c’s back-to-back, and accepting H when it’s false is a
0
type 2 error. In each cell of the table enter one of the following choices: “Correct,” “Type 1 error,” or “Type 2 error.” The significance
level  is the probability of making a type 1 error. For example, suppose the defendant is innocent of the crime but the jury decides that it
is 95% certain that the man is guilty and convicts him. The jury, then, rejects H0 when it is true, committing a type 1 error. By their
figuring, though, there is only a 5% chance of this happening. In other words, the jury has the man somewhere outside of their 95%
confidence interval. In this situation the significance level is  = 0.05. Suppose Popeye claims he eats, on average, 5 cans of spinach a day.
Olive Oyl is skeptical. So, she randomly chooses 8 days from his food record and calculates the sample mean and standard deviation. State
H0 and H1. Explain how she could make each type of error. If she concludes with 90% certainty that Popeye is wrong, what is the
significance level?
11. Section 11.7. Give me a quick example of a scenario in which a one-sided t-test would be appropriate and explain why.
12. Section 12.2: Check out the histograms on page 389. They’re reminiscent of the multiple box-and-whisker plots on page 96. What
information do these histograms immediately provide in terms of comparing the two species of iris? Section 12.3: The iris histograms
suggest that one species has smaller petals, but that difference could be attributed to the randomness of the samples, especially since
the histograms overlap. There is a statistical method to determine whether the mean petal size of each species is likely to be different
in real life and not just different in the samples. As always, H0 is the status quo: there is no difference between the means of the
species, i.e., despite the difference in sample means, their population means are the same: x = y. The idea is to use statistics to
determine how large a difference in sample averages is needed in order to say with 95% confidence that the population means are
different. Check out example 12.5 on page 393. What is H0? What does a quick scan of the data suggest might be the case?
13. Instead of testing if a single sample average equals a population mean, we’re now interested in testing whether the difference between
the averages of two different samples is equal to the difference between the two population means. Example: We’re interested in
determining whether there is a difference in math ability between senior boys and girls at UHS. We select 100 seniors, 50 boys and 50
girls, at random (a stratified random sample) and invite them to participate in our study. 12 boys and 15 girls agree to participate.
(Since they don’t all participate, we may have some response bias, but we’ll assume it is negligible for the sake of the example.) We
give all participants the same math test and score them. The boys’ sample average, X-bar, is 73 points, and the boys’ sample standard
deviation, Sx, is 5 points. For the girls, Y-bar = 76 points, and Sy = 4 points. We want to test the null hypothesis that x = y. Let’s first
assume that both populations (boys and girls) are approximately normally distributed and that the variability of each is known. So, we
don’t know the  ’s but we do know the  ’s. Assume x = 7 and y = 3 points. Recall earlier that when the population standard
deviation was known and we had only one sample, we used the formula:
Z
( X  Y )  ( x   y )
X 
X 
or equivalently, Z 
. The corresponding equation for two samples is Z 
. Note
/ n
 2 /n 
 x2 / n x   y2 / n y
that the sample average is replaced with the difference of the sample averages, and the population mean is replaced with the difference
of the population means; also, both population standard deviations and population sizes are included in the denominator. Under our
null hypothesis, x - y = 0, so our test statistic becomes
Z
X Y
 / nx   / n y
2
x
2
y

73  76
7 / 12  32 / 15
2
 1.386. (We do not
need the sample standard deviations since we’re assuming we know the population standard deviations.) Since our test statistic,
-1.386, is not less than -1.96 or greater than +1.96, we do not have sufficient evidence that H0 is wrong, so we accept that there is no
statistical difference in the population means and, thus, no statistical difference in math ability between boys and girls. The reason for
comparing our test stat to 1.96 is because in a standard normal distribution values between -1.96 and +1.96 incorporate 95% of the
random values. In other words, our test stat is in the 95% confidence interval. If our test stat had come out to be < -1.96 or > 1.96, we
could that we’re 95% certain that there is a difference between boys and girls. How many standard deviations must the test stat be
from the mean in order to proclaim with 95% confidence that a difference exists? People who believe that boys are better in math
might have done the same study with the same H0, but they might take the alternative hypothesis to be that x > y, rather than x  y.
They’d use the same test stat, -1.38, but they’d do a one-sided test, and they’d only reject H0 at a 95% confidence level if the test stat
came out > 1.645. Explain where I got this value and why I say > rather than <.
14. Suppose now that we have the same math data for boys and girls but we don’t know the population standard deviations (which is
much more realistic). As before, we would replace the  ’s with the S ’s. However, this approximation only words well for large
sample sizes. Ours aren’t very large so we’ll boost the confidence level to 99%. Compute the test stat and do a two-sided, 99%
confidence test of H0. State your conclusion in ordinary language. (Z = -1.69, which is not < -2.576 or > +2.576, so once again there is
not sufficient evidence to conclude a statistical difference in the populations means. That is, the difference in sample means is not big
enough to be considered statistically significant. Therefore, we cannot conclude that a real male/female difference exists and we
accept the null hypothesis that there is no difference. Make sure you show your work and explain where the numbers are coming from
on the table.)
15. When testing the difference of means of two populations in which the population standard deviations are not known and the samples
aren’t very large (as in the last problem), we often use a t-test for two samples. The assumptions we’re working under are that the both
populations are approximately normal with about the same variability. The formula we used for one sample was:
x  0
x  0
or equivalently, t 
. Here the sample standard deviation, s, is an approximation for that of the population. For
1
s/ n
s
n
x y
two samples with averages x-bar and y-bar we use t 
. Notice that S has no subscript; it’s a combination of the
1
1
S

nx n y
t
sample standard deviation that gets calculated separately. Since our assumption is that both populations have about the same
variability, they should both have about the same standard deviation,  , and we could reasonably approximate it with the sample
standard deviation of the x values, sx, or with that of the y values, sy. It wouldn’t be right simply to let S be the average sx and sy, since
the samples aren’t necessarily the same size. So we use a weighted average for S:
S
(n1  1) s x2  (n2  1) s y2
n1  n2  2
or equivalently, S 
2
df1  s x2  df 2  s y2
df1  df 2
. Notice that S2 is just a weighted average of the sample
variances—the bigger the sample, the bigger a df is for a sample, the greater the contribution that sample makes to the weighted
average. Example: Pinocchio and his twin sister Pinocchita have been wondering if their noses have the same sensitivity to telling lies.
Pinocchio says he’s a better liar, so his nose doesn’t grow as much. Their noses do grow when they lie, but the amount of growth
varies. They assume that the variation they experience is the same and that nose growth is normally distributed for each. They want to
know whether or not the exact same distribution describes both noses. That is, they want to know if they have the same means. To find
out, they each tell the same lie over and over and each measures how much his/her nose grows in inches. The table contains the data in
inches. State the null hypothesis, mathematically and informally. Do the same for the alternative hypothesis (one-sided). Show that
t = -2.81. Expain why the critical value for significance level of 0.05 is 1.717 (to which we compare the absolute value of t).
Therefore, after making this comparison, you can make the statement, “With ___ % certainty I can say that that the null hypothesis
should be [rejected / accepted]. This means that there is [sufficient / insufficient] evidence to conclude that the means of the two
populations are most likely [the same / not the same], and that mostly likely Pinocchio is [right / wrong] about being a better liar.”
Pinocchio, x
0.8 1.8 1.0 0.1 0.9 1.7 1.0 1.4 0.9 1.2 0.5
Pinocchita, y
1.0 0.8 1.6 2.6 1.3 1.1 2.4 1.8 2.5 1.4 1.9 2.0 1.2
16. We’ve tested hypotheses about the difference in the means of two groups. This can be done for any number of groups with a technique
called ANOVA (analysis of variance) with the assumption that all groups are normally distributed with about the same variability. Say
we have three groups of people: group A is comprised of 5 men from Atlanta whose average shoe size is a-bar =10.7 with variance
sa2 = 0.55; group B is comprised of 4 men from Baton Rouge whose average shoe size is b-bar =10.3 with variance sb2 = 0.28; group
C is comprised of 6 men from Columbia whose average shoe size is c-bar =10.5 with variance sc2 = 0.39. We’d like to know if the
differences in the sample averages are statistically significant. H0 is that a = b = c and H1 is that they’re not all the same. First we
find a weighted average of the means:
X
na  a  nb  b  nc  c
= [5(10.7) + 4(10.3) + 6(10.5)] / (5 + 4 + 6) = 10.5133
na  nb  nc
Notice that the weighted average is a little higher than 10.5, which it would have been if the sample sizes were the same. The next step
is to do sort of a weighted variance of the averages of the three cities:
S
2
between
na (a  X ) 2  nb (b  X ) 2  nc (c  X ) 2 5(10.7  10.5133) 2  4(10.3  10.5133) 2  6(10.5  10.5133) 2
= 0.178667


# groups  1
3 1
Then we calculate a weighted average of the variances of the three groups, just like we did for two groups earlier:
S
2
within
df a  sa2  df b  sb2  df c  sc2 4(0.55)  3(0.28)  5(0.39)
= 0.415833


df a  df b  df c
435
Review of calculations done so far: X-bar is the weighted average of the sample averages. S 2-between is, informally speaking, a
“weighted variance of the averages”; it reflects how much variability exists between
the groups. If there were no variability between the groups then the sample averages
would all be [very different / about the same] and S 2-between would be [close to
0.6
zero / very large] since each term in parentheses in the numerator is [very negative /
F distribution
nearly zero / very positive]. S 2-within is, informally speaking, a “weighted average
of the variances”; it reflects the average variability within the groups. To understand
these S ’s better consider an extreme case. If everyone in Atlanta wore size 8,
everyone in Baton Rouge wore size 10, and everyone in Columbia wore size 12,
then S 2-between would be [large / zero] and S 2-within would be [large / zero]. In any
case, extreme or not, the bigger S 2-between, the more variability there is between the
groups and the more likely we are to [accept / reject] H0. Randomness makes it hard
1
2
3
to decide how big S 2-between should be before we reject H0, so we compare it to
S 2-within. We do this by defining a test statistic called F:
2
2
F = S -between / S -within. For our shoe size example F = 0.178667  0.415833 = 0.429659.
If F had come out to be about one, then the variability between the groups was [much < / about the same as / much > ] the variability
within the groups, which means we should [accept / reject] H0. If F had come out to be very large, then the variability between the
groups was [much < / about the same as / much > ] the variability within the groups, which means we should [accept / reject] H0. As
with other test stats, we use a table to find a critical value (which depends upon degrees of freedom and a significance level) and we
compare the test stat to the critical value in order to make a decision regarding H0. To do an F-test we need two degrees of freedom
values, one for S 2-between, and one for S 2-within. S 2-between has 2 degrees of freedom (3 groups minus one). S 2-within has 12
degrees of freedom (the sum of the df ’s for the individual groups). Unlike the normal or T distributions, the F distribution is not
symmetric (I did my best to draw one; the shape would vary depending on what the degrees of freedom are). The F-stat is always
positive, and we just want to know whether the F-stat is beyond the critical value. So, F-tests are always one-sided. As with all
probability distributions, the total area under the curve is one. Note that if H0 is true it is very unlikely for the F-stat to be much greater
than 1. F tables begin on page 593. At a 5% significance level we use Table D.2 and look for the row with 2 deg. of freedom. There is
no column with 12 deg. of freedom, so we’ll use the column with 10 deg. of freedom, which gives us a critical value of 4.10. This
means we should not reject H0 unless the F-stat is > than 4.10. However, our F-stat is only about 0.43, well under the critical value.
(F-stat is also < the critical value in the column with 20 deg. of freedom, so it would be < the critical value for 12 deg. of freedom, if
had been shown.). Thus, with 95% certainty we can conclude that we should accept H0 and believe that there is no difference in the
mean shoe size of men in the different cities, i.e., the different means we got from our samples are not statistically significant and are
best explained by random variation inherent in the sampling process rather than any real difference among cities. Note that the critical
values are large in the upper left part of the table because with only a few small groups, there is more uncertainty, hence the need for
the test stat to be bigger before rejecting H0. Explain why the critical value at row infinity and column infinity is one.
17. Section 14.1. Let’s look now at a distribution for categorical data. Suppose a guitar manufacturer is planning a new advertising
campaign targeting young, hard rock fans. The company hires a marketing analyst to determine which rock guitarist kids like best
these days. The analyst claims that this information is well known: 35% percent of the kids like Angus Young from AC/DC; 20% like
Eddie Van Halen; 18% prefer Jimmy Page from Led Zeppelin; and the rest like various other guitarists best. Not trusting these data,
the guitar manufacturer decides to test these hypothesized percentages (H0). This is not quite like anything we’ve done thus far. To test
H0, the guitar guy must conduct a random survey of teen rockers, create a table (below), and calculate yet another test statistic. This
test stat is called chi-square (“chi” is a Greek letter pronounced like the first syllable of “kite”). Its symbol is χ 2. The process is very
similar to computing the variance of a set of numbers except, instead of subtracting the mean from each number, we subtract its
expected value based on H0. Also, each squared difference is divided by its own expected value. In this example χ 2 = 14.611.
Mathematically, we can say χ 2 = (Oi - Ei)2 / Ei where the summation on i runs from 1 to n, the number of categories. About what
value would χ 2 have if the marketing analyst’s information were completely dependable? The higher χ 2, the more likely the marketing
guy was wrong, but because of the inherent randomness of the procedure, we must allow for the possibility that the marketing guy was
exactly right but that randomness is responsible for χ 2 being so high. Thus, we must find a cut-off point beyond which we can say with
95% confidence that we can reject H0 and conclude that his percentages were wrong. (The χ 2 test is always positive and its
distribution looks very much that for F above.) We do this
by looking up the 0.05 significance level critical value in a
Theoretical
χ 2 table with df = (# categories) – 1 = 3. The table lists a
Category Observed, O Probability Expected, E (O - E)2 / E
critical value of 7.815. Since our test stat is way beyond
Angus
10
0.35
17.5
3.214
this, it is very unlikely that the differences between the
Eddie
6
0.2
10
1.600
expected and observed values can be attributed to
Jimmy
9
0.18
9
0
randomness. It is much, much more likely that the
Other
25
0.27
13.5
9.796
differences are do to the fact that H0 is wrong. The guitar
guy concludes that the marketing guy gave him bogus
14.611
Totals
50
1
50
information and demands his money back. (Note: the
conditions required for a χ 2 test to work well are that the random samples come from a large population and that none of the expected
values is extremely small.)
Your turn: An interplanetary commission has been set up to resolve disputes among various planets in our region of the galaxy. The
commission is comprised of 15 Earthlings, 21 Klingons, 13 Vulcans, 17 Romulans, and 15 Martians. The commission is supposed to
represent each planet equally in terms of its population (sort of like the number of representatives from each state in the U.S. House of
Representatives is proportional to each state’s population). Here are the planetary populations in billions of beings: Earth 6.1; Klingon
13.8; Vulcan 3.9; Romulus 9.5; Mars 4.3. You’re charged with determining whether or not the commission is truly representative.
Show that there is not quite (but almost) enough evidence to say with 95% confidence that the commission is unbalanced. That is, you
can’t say with enough certainty that the some planets were intentionally underrepresented. (H0 is that no planet is under- or
overrepresented on the commission.) My advice is to create a table in Excel and make Excel do the computations. To find the
theoretical probabilities, use the relative populations, e.g., the probability for Earth is 0.162, since Earthlings represent 16.2% of the
interplanetary population. Earth’s expected value will come out to about 13.1 members on the commission. This means that if H0 is
true and all is fair then Earth should get about 13 delegates. Don’t round, because your errors will compound. χ 2 should come out to
be just under the critical value, meaning any higher and you would have had sufficient evidence to claim unfair representation. Instead
you’re forced to assume that the differences between the actual and expected numbers of representatives could reasonably be due to
the fact that whenever a group is chosen or elected some random variation is to be expected.
18. I’ve saved the best concepts for last: regression and correlation. Some of this you already know; some you most definitely do not. This
question will deal with the stuff you already likely know. We’ve been comparing things like math scores (boys vs. girls), nose growth
(Pinocchio vs. his sister), shoe sizes (comparing men in different cities), Popeye’s spinach consumption (actual vs. claimed), planetary
representation (actual vs. ideal), etc. In all of those cases the comparisons were made between the exact same types of quantities, e.g.,
boys’ average math score vs. girls’ average math score. There are times, though, when we’d like to see if a relationship exists between
two completely different quantities, such as the amount of rainfall and the yield of a soybean field, or the amount of oxygen dissolved
in a body of water and the water’s temperature. In math class you’ve entered lists of data in your graphing calculator and made a scatter
plot in order to ascertain whether or not there is a relationship between the two lists of numbers (a correlation). If the points in the
scatter plot seemed somewhat to lie on a line, then a linear relationship existed between the two sets of numbers, and you used the
calculator to do a linear regression to find the equation of the line that “fits the data” best. The regression line typically didn’t go
through any of the plotted points, but it did come as close to as many of them as possible. If the regression line had a positive slope,
then there is a positive correlation between the two sets of numbers (as one quantity or variable increases, so does the other); a negative
slope indicates a negative correlation (as one increases, the other decreases, and vice versa). Furthermore, the equation of this
regression line allowed you to make predictions about the relationship beyond what your data showed (called extrapolation) as well as
between your data points (interpolation). You may also be aware that there is a number, r, called the “correlation coefficient.” If r = 1,
you have a perfect positive correlation (all data points lie on, rather than near, the regression line, which has a positive slope).
Similarly, if r = -1, you have a perfect negative correlation (all points lie on the regression line, which has a negative slope). | r | ≈ 1
implies a very strong linear relationship (the data points at least come very close to lining up). r ≈ 0 implies no linear relationship
between the two quantities exists (perhaps the scatter plot has points all over the place, or perhaps some other sort of relationship
exists, like a quadratic or exponential relationship). As was discussed in a previous module the exponential relationship y = abx can be
made linear by taking logs of both sides of the equation:
log y = log a + x log b. Thus, if y ≈ abx, then log y vs. x should have a correlation coefficient close to one, and the regression line
should have a slope of about log b and a y-intercept of about log a. Taking logs of both sides of the power relationship y ≈ axn yields:
log y ≈ log a + n log x. Thus, there should be a strong correlation between log y and log x. Check out the graphs on pages 538 and 539.
What is the correlation coefficient going to be about, if a strong nonlinear relationship exists? What about when a strong linear relation
exists for all the data except for one outlier?
19. Now for some of the stuff you may have learned before. When the calculator cranks out a correlation coefficient and the equation for a
linear regression, it’s doing a lot statistics behind the scenes. The formula for the correlation coefficient is given by:
 n  xy  x  y
r 
. There are many equivalent versions of this formula, but I think this is the simplest. Note that xy and x  y

 n 1  sx s y
are not the same quantities. xy is the average of the products of corresponding x and y values, while x  y is the product of the
averages. r doesn’t require a derivation, since the above equation is a definition. However, it can be shown that, when the slope of the
regression line is positive so is r, when the slope is negative so is r, and that -1  r  1 always. I won’t subject you to the proofs, but
let’s do a couple of demonstrations. Let’s deal with three points that have a perfect correlation: (1, 10), (2, 20), and (3, 30). Show that r
comes out to be exactly one in this case. (You should get 140 / 3 for xy .) If I change the point (3, 30) to (3, 10), now we’ve got an
“inverted V” for a scatter plot. Clearly, there is no correlation now between y and x. Show that r is zero in this case. (It is sufficient to
show that the numerator is zero.)
20. Last question of the whole course! As a special treat, I’ve prepared a cool spreadsheet to help you understand the method of least
squares, which is the method typically used to find regression lines. As you know, a regression line is the line that bests fits the data. (It
is sometimes called a line of best fit or a trend line.) Let’s first make the reasonable assumption that the regression line goes through
the point (x-bar, y-bar), which is the large red dot in my spreadsheet. We start with any old line that goes through this point. The idea
behind finding the slope of the regression equation is to look at the residues. Residues are a measure of how far above or below the
actual data points are from the line drawn through (x-bar, y-bar). Residues can be positive or negative (see page 548). I set up the
spreadsheet so that negative residues are blue and positive residues are red. Change the slope of the line using the up/down arrows or
the scroll bar. Notice that the graph changes automatically, as do the residues. It’s not hard to change the slope to get a decent fit, but
we want the best fit, not just a decent fit. To do this we monitor the sum of the squares of the residues in the large, bold font. (As in
computing a variance, squaring makes everything positive. We could use absolute values and get a similar result. Squaring, though,
lends itself to mathematical manipulation more so than does taking absolute values.) The object is to find the slope that minimizes the
sum of the squares of the residues, thereby finding the slope of the line that comes as close to as many points as possible (close in the
sense that sum of the squares of the distances from the points couldn’t get any smaller). For the x and y values given, find this slope. (I
set it up so that you can change the x and y values if you like.) Of course now it’s easy to find the regression equation since you’ve got
a point and the slope. It’s displayed on the graph. By the way, if you’re interested in how I created the spreadsheet, unprotect the sheet
(Tools menu). Then you can click on different cells to see what formulae they contain.
As much as you like spreadsheets, I think you would agree that there must be a more efficient way to find regression equations.
Naturally, there is a formula for it:
y yr
sy
sx
( x  x) . The derivation is quite long and complicated, involving lots of summations
and some multivariate calculus (which none of you have had yet). Nevertheless, I would be remiss in my duties were I not at least to
outline the main idea, and I would be more than happy to go through the details with anyone with the time and interest to do so. First
off, note that, if you believe the above equation is true, it shows that (x-bar, y-bar) is on the regression line. Explain why it also shows
that if y values are more spread out than the x values, then the slope will be steeper. Here’s the main idea behind the derivation. Let the
equation of the regression line be y = mx + b. So, if (xi, yi) is a data point, this line might be above or below this point. The y value of
the line at this x value is mxi + b, and the residue is yi – y = yi – (mxi + b). Explain how yi and y differ in meaning. Let the sum of the
squares of the residues be given by H = (yi – y)2 =  (yi – mxi – b)2. H is a function of two variables, m and b. Recall that when we
have a function of one variable, say y = f(x), to minimize or maximize y, we take its derivative and set it equal to zero (to find where
the tangent line is horizontal). That is, we solve f ’(x) = 0 for x, and these are the x values where y can have its peaks and valleys. Well,
with a function of two variables like H, we would have to take a derivative with respect to m and another derivative with respect to b.
The derivative with respect to m is a measure of how quickly H changes as m changes while b is held constant. How would you
describe the derivative of H with respect to b? Once we have two derivatives equal to zero, those equations are solved simultaneously,
yielding the formula above.
Keep in mind that when two quantities, x and y, are correlated, this may be because x causes y, because y causes x, because something
else causes both of them, or there may be no causation at all. For example, let’s assume there is a positive correlation between
intelligence as measured by IQ and chess ability. This may be because it is necessary to be smart to excel at chess. It won’t be a perfect
correlation since some smart people don’t know how to play. Some may say that being smart causes one to be good at chess. Someone
else might claim that playing chess enough to get good at it causes one’s IQ to rise. Yet another person might argue that scoring high
on an IQ test and being a good chess player are both caused by having a good memory. An example in which there is no causation at
all is the following. There is a strong correlation between my going for a walk and my dog going for a walk, since I often take her with
me when I walk. It would be wrong, though, for me to claim that my walking causes her walking. These ideas are pertinent to research,
especially medical research. Let’s say we’re trying to determine whether high blood pressure causes heart disease. One way to do this
is by gathering data and plotting incidence of heart disease vs. blood pressure. Say we find a fairly strong correlation with a coefficient
of 0.93. This tells us that many more people with high blood pressure have heart attacks than people with low blood pressure, but it
doesn’t necessarily mean that high b.p. causes heart attacks. It might, but it could be that a third factor, like smoking or a high fat diet,
causes both high b.p. and heart attacks. A study like this definitely establishes a link between the two, and it does provide a legitimate
reason for people with high b.p. to do what they can to lower it. The action they take to lower their b.p., or the lower b.p. itself, may
prevent a heart attack.
Causality aside, how large does the correlation coefficient have to be in order to be confident that a correlation between two quantities
really does exist? (rhetorical question) For example, it might not be clear whether or not researchers should claim there’s a link
between heart attacks and high b.p. if r were only 0.75. So, we do as we’ve done before: we test the null hypothesis, which states that
there is no correlation between the two, meaning they’re independent parameters (unrelated to one another). Stated concisely, H0 is that
r = 0. There is a symmetric distribution for r that looks fairly normal. The distribution depends on the sample size, just as the
distributions for T, F, and χ 2 do. A table is on page 543. Suppose our study was done on 80 people. The critical value at the 5%
significance level 0.2199. So, even if r is only 0.75, we have sufficient evidence to reject H0, and we can still say with 95% confidence
that high b.p. is indeed associated with and heart attacks. Notice in the table that as the size of the study increases, the critical values go
down. This means that with a larger study we don’t demand a high of a correlation coefficient. This is because with a small study,
random variation could more easily be responsible for a high r value even when there is no real correlation. Say you gather data on
Olympic swimming records. You plot winning times for a particular race vs. the year for 8 recent Olympics, you do a regression
analysis, and you find the correlation coefficient to be -0.84. Why would r be negative? You want to determine whether or not there
really is an association between winning times and the year. What is the null hypothesis? Can you claim with 99% certainty that an
association really does exist? What about with 99.9% certainty?