Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Advanced Probability and Statistics Module 7 Topics: Random sampling, bias, standard error, review of the central limit theorem, margin of error, confidence intervals, t-tests, hypothesis testing, statistical significance, comparing two groups of related data with two-sample z- and t-tests, comparing several groups of related data with an ANOVA and F-test, making statistical inferences from categorical data using the chi-square test, and correlation, method of least squares, & regression analysis. Don’t be intimidated by the length of this module. Most of it is my attempt to explain the remaining statistics concepts. You’ll probably spend more time reading and thinking than you will answering questions. Moreover, I have provided answers to most of the questions requiring computations so that you won’t have to wonder if you did them correctly. Be sure to show your work. There is an error in the book on page 312. In the equation in example 9.8, “standard deviation” should be “standard error.” 1. Section 9.1-9.4. Pretend you’re going to conduct a study that involves collecting data of some sort. Concoct and describe a study in a paragraph or two: Define the population, your sample, the parameter of interest, and your statistic. Discuss how you plan to minimize selection bias and response bias and explain how each type of bias might cause you to make an incorrect inference. Do you need a stratified random sample? Give details about the method you will use to select your sample. Assuming bias is not a problem, explain what you can infer from your study. Explain how your sample size affects your confidence in your inference. Why can’t you just make your sample size the whole population in order to boost your confidence? In terms of your study, what is the meaning of , , X-bar, and S? 2. Section 9.4-9.5. Suppose you had the resources to repeat a study 100 times over, each with the same sample size of 64. With each study you measure X-bar, and S. What is meant (in words, not equations) by the variability of the average value? (It is not the same variability as described by or an S value.) The standard error of the average is a way to measure this variability. How would the averages of the studies be distributed (at least approximately) regardless of the distribution of your population? (Think central limit theorem.) In terms of and, the mean of this distribution would be ___ and its standard deviation would be ___ . (Think central limit theorem.) The standard deviation of this distribution of sample averages is called the standard error. It’s an estimate of about how far the sample average is likely to be from the population mean. In reality, of course, it’s normally too costly to repeat a study 100 times. So, we must hope that the mean we measure from a single study, X-bar, is close to what we really want to know but which is impractical to measure directly, . The standard error can give us an idea of how representative X-bar is of . If these values are close, our study has estimated accurately what it set out to determine. Since can’t be known for certain, we compute X-bar and the standard error, and if the standard error is low, we conclude that our sample mean is a good approximation to the population mean. Assuming just a single study has been done, compute the standard error. (Be mindful of what you divide by—it’s not one.) If you wanted your study to have three times the precision, you need to increase the size of it by a factor of ___ . If the sample size is the entire population, then, naturally, the standard error should be zero. The formula does not give zero, though, because the formula is intended for random samples much smaller than the population size. Use the correction factor for large factors to explain how standard error comes out to be nearly zero for very large samples. (See page 305.) 3. Schmedrick’s hometown is having its annual Wiener Fest, where pure, delicious soy hotdogs are consumed in vast quantities in whole wheat buns with lots of healthy veggie condiments. Just for the heck of it, Schmed measures the lengths in centimeters of all 10 wieners he eats that day: 15.0, 15.8, 16.5, 14.2, 15.2, 15.6, 14.0, 14.9, 15.3, 15.9. Suppose the lengths of all hotdigities at the fair have some distribution (not necessarily normal) with mean and variance 2. What is Schmed’s best approximation for ? About how far is his estimate likely to be from the true mean? (Calculate the standard error.) Notice that the standard error is about three times more than the sample standard deviation. The interpretation of this is that there is about three times more variability among individual hotdogs than among averages of ten. In other words, if he looked at the averages of many samples of ten, those averages would be clustered closer to the mean than individuals would. This is just an application of the central limit theorem. Think about it this way: If Schmed randomly selected sets of n dogs with replacement, the central limit theorem states that the averages of the sets would have an approximately normal distribution (especially if n is large) with mean the same mean as the population, , and a standard deviation of /n0.5. The standard deviation of his distribution (which is less than the standard deviation for the population by a factor of the square root of n) is simply the standard error, but since we don’t know we approximate it with S. If Schmed took 144 samples, each of size 25, would the distribution of his sample averages be /12 or /5 ? What would the standard error be? 4. Section 9.6. I mentioned back in Module 1 that for a presidential election poll with a margin of error as low as 3%, it only takes a sample of about 1000 people. It seems amazing that only 1000 people can predict so precisely how nearly 300 million people are feeling about the candidates. Of course, it is essential for the polling to be as nonbiased as possible. Let’s see why this works. If there are two candidates, we essentially are gathering binary data: for example, a preference for Bush corresponds to zero and a preference for Kerry corresponds to one (or vice versa). In a previous module you proved the variance formula for binary data on page 127: Var = XY / [n(n – 1)]. So, S is the square root of this. Suppose we sample 1000 registered voters at random and 579 say they prefer Kerry. What are X, Y, and n? The proportion, p, who prefer Kerry is X / n, which makes it easy to derive the formula on page 128: S = sqrt[np(1 – p) / (n – 1)]. Show that p = 0.579 and S = 0.494. Use the definition for standard error, error = S / n0.5, and the formula for S above to derive the formula on page 310. (Very simple.) 57.9% of the people we surveyed favor Kerry. To determine how far off this is compared to the proportion of people who like Kerry among the whole population, let’s find the standard error. Show that the error is 0.0156, or about 1.56%. Since standard error comes directly from standard deviation, we can interpret the error as follows: the true proportion of the population that prefers Kerry is 57.9% give or take about 1.56%. In other words, by the central limit theorem, our sample average (the proportion) should have a normal distribution with the same mean as the population (which we estimate to be 57.9%) and a standard deviation equal to that of the population over the square root of n (and this is approximately 1.56%, the standard error). Since 68% of normal data lie within one standard deviation of the mean, we can say with about 68% certainty that the true percentage of people favoring Kerry is within 1.56 percentage points of 57.9%. That is, with 68% certainty, 57.9% of voters prefer Kerry, with a margin of error 1.56%. Typically, though, we want more than 68% certainty. 95% certainty is sort of the accepted standard, which corresponds to two standard deviations on either side of the mean. When margins of error for polls are reported in the news, they typically are referring to 95% certainty. For our data we can say with 95% certainty that 57.9% of voters prefer Kerry give or take 3.12% percentage points. Where did I get the margin of error of 3.12%? In other words, if we repeated our survey 100 times, about 95 times the survey would give yield results within 3.12% of the actual Kerry preference; the other 5% of the surveys would amount to off the wall results (which is why we can’t put too much stock into a single survey). In an earlier module you proved with calculus that, for binary data, the variance is at a max when the proportion is 0.5 (when opinion is split down the middle). Thus, when p = 0.5, the standard error would be greatest. For people who are approximately split on an issue, show (with 95% certainty) that the size of survey must be 1,112 people for a margin of error of 3%. 5. Many times now we’ve used Z = (X - ) /. Recall that this formula is used when a random variable X has a normal distribution with mean and standard deviation . Thus, X - is how far from the mean X is, and (X - )/ is how many standard deviations away X is from . This means Z has a standard normal distribution (mean of zero, stan. dev. of one), and we can use the table in covers of the book. To use this formula we needed to know and . But these parameters are often unknown. In fact, we’re often interested in approximating by taking the average of a random sample. Let’s say a wildlife conservationist wants to know the density of prairie dogs (animals per acre) in a wilderness region of Montana. Since it would be impractical to count all the prairie dogs in dozens of square miles, the region is divided up on a map into small parcels, and many parcels are chosen at random on which actual counts will be conducted. So we can speak of a population of parcels, each of which has a certain density of prairie dogs. The density has some sort of distribution (which may or may not be normal) with mean , and standard deviation . is the unknown average density of interest to the biologist. Suppose that somehow is known, let’s say 30 prairie dogs/acre. (In real life this probably wouldn’t be known and would have to be approximated.) To estimate the biologist investigates n = 50 parcels. Let Xi be the random variable representing the # of prairie dogs per acre in parcel i. Upon gathering the data X1 through X50, the she then computes X-bar, the average of the 50 data values. The central limit theorem says that X-bar has a distribution that is approximately normal with mean and standard deviation / 501/2. X-bar is called a nonbiased estimator of provided that there was no bias in the selection process for the parcels. Say X-bar comes out to be 93 dogs/acre. Since it is only an estimate, the biologist can report definitively that the mean density is 93; she can, however, report a 90% confidence interval for the true mean as 86 to 100 per acre. I’ll explain how she got this in a minute. The 90% confidence interval is an estimated interval for . It means that there is about a 90% chance that the interval really does contain the mean. 90% is arbitrary; a confidence interval can be computed for any percentage. Explain why a 75% confidence interval would be a smaller interval than the 90% interval. To compute the interval we use the distribution Z X . Notice how similar this equation / n is to the one with which you’re already accustomed. One difference is that we’re using X-bar rather than X, since it is X-bar that is serving as her estimator of rather than X, i.e., she used the average of the 50 pieces of data, rather than individual measurement to estimate the true mean. Since X-bar has a standard deviation of / n1/2, this replaces in the original formula. Z is the number of standard deviations 93 is from the true mean. To form a 90% confidence interval we want to know how far above and below 93 we have to go in order to cover 90% of the area under the normal curve. This means 5% of the area will be under the right tail (very large values), and 5% of the area will be under the left tail (very low values). So, the left endpoint of the interval corresponds to a probability of 0.05 in the table, and the right endpoint corresponds to a probability of 0.95 (since the normal table is a list of probabilities from - to some value of interest). What Z value corresponds with each endpoint? (They are opposites of one another.) Books refer to these Z values as z0.05 since they mark the boundaries for the upper and lower 5% areas. Ok, now show that the endpoints for the 90% confidence interval are given by X z0.05 . (Use basic algebra with the substitution Z = z0.05.) Use this formula to show that the n 90% confidence interval is 86 to 100 per acre. Finally, if the biologist were inclined to be overly cautious, she might report her findings with a 95% confidence interval. Show that this interval is from 81.7 to 104.3 prairie dogs per acre, and explain in words what this interval means in terms of probability and the actual unknown mean. 6. Terminology and notation: In the prairie dog example we wanted the probability of the actual mean being between two endpoints to be 90%. That is, we found endpoints such that P(left endpt < < right endpt) = 0.90. When discussing confidence intervals, books always talk about , which is just 1 – (the amount of confidence). At 90% confidence, = 0.10. Then we can rewrite P(left endpt < < right endpt) = 0.90 as P(-z/2 < Z < z/2 ) = 1 - . Write the corresponding equation for a confidence interval of 80%. 7. Section 10.1. In the prairie dog example we pretended the population standard deviation was known. Normally, though, is not known, so we approximate it with the sample standard deviation. In cases like these we could replace sigma with S, and, if n is large, we could proceed as we did in the last problem, assuming an approximately normal distribution. To be more accurate, especially if n isn’t very large, instead of Z, we calculate a different test statistic called T, which is defined as T X . Notice the similarity to the formula S/ n above for Z and that the denominator is the standard error (which is also the approximate standard deviation of the distribution of X-bar). Like the normal distribution the t-distribution looks like a bell. However, it’s not a normal distribution. It’s a different distribution and there are different tables for it (page 589 in your book). Unlike a normal distribution, a t-distribution has degrees of freedom (df ) associated with it: df = n – 1. With the prairie dogs, df = 49. (There are only 49 degrees of freedom, since if the standard deviation of 50 measurements is known or estimated, 49 observations have the freedom to be any value, but the 50 th must have a particular value to preserve that standard deviation.) There is a pic on page 323. Note that the t-distribution is symmetric but it’s got a greater variance than the normal curve (lower peak, bigger tails). True of false: the smaller the number of degrees of freedom, the flatter the t-distribution. The variance is greater because extra variability was introduced by approximating with S. The equation for T has two random variables in it: X-bar and S. For larger df, the t-distribution begins to look more normal since for large n, S is more likely to be close to . Notice that the table entries for infinity are exactly the same as in the normal distribution. If df is small, then n is small, so the central limit theorem may not hold. This means that if we use a small sample size, the population must be approximately normal. Let’s do a sample problem. Say we want to determine how much garbage the average American household produces in a week. We’ll assume garbage production has a normal distribution and choose 51 households at random. We calculate the mean of our 51 pieces of data to be 68 pounds and standard deviation to be 16 pounds. We don’t know , but we have T 68 . For a 90% confidence 16 / 51 interval, we need P(-t/2 < T < t/2) = 0.09, where = 0.10, and t/2 will be looked up in a table. Substituting for T: P(-t/2 < T < t/2) = P(-t/2 < (68 - )(S / n1/2) < t/2) = P(-t/2 S / n1/2 < 68 - < t/2 S / n1/2) = P(-68 - t/2 S / n1/2 < - < -68 + t/2 S / n1/2) = P(68 + t/2 S / n1/2 > > 68 - t/2 S / n1/2) = P(68 - t/2 S / n1/2 < < 68 + t/2 S / n1/2). So, just like in the prairie dog example, we get the same sort of endpoints: S . We use the 90% confidence interval table on the bottom of page 589 and look up t0.05 with df = 50 and get 1.676, which X t / 2 n gives us endpoints of 64.245 and 71.755. This means that with 90% confidence we can say that the true mean amount of garbage produced per week is somewhere between about 64.2 and 71.8 pounds. All right, now it’s your turn. Schmedrick gets a job working for Acme Hand Grenade Company as a statistician. His task is to determine with 99% certainty the mean blast radius of the grenade, as defined by the maximum distance shrapnel flies from the point of explosion. So Schmed selects 6 grenades at random, heads out to the desert, explodes them remotely one at a time from a predetermined height, finds the furthest piece of shrapnel in the sand after each explosion, and measures the distance from the explosion point. His data in meters are: 140.8, 123.7, 166.7, 130.9, 170.1, and 142.2. Technical manuals lead him to believe it is safe to assume that the blast radius has a normal distribution, but he has no idea what to expect for a standard deviation for that distribution. Thus, he uses the t-distribution. Show that Schmed can report to his boss that with 99% confidence the mean blast radius is between 114.7 and 176.7 meters. (For each endpoint I rounded away from the mean to ensure at least 99% confidence.) 8. Look at the figure on page 333. Note that each confidence interval is centered on the same mean and that the interval gets larger as the confidence level increases. In one or two sentences, tell me why this is so. Note also that these are two-sided confidence intervals. For example, in the top interval, there is a 5% chance that the true mean is greater than the right endpt. value, a 5% chance that the true mean is less than the left endpt. value, and a 90% chance that the true mean lies within the interval. Now imagine the same interval except that the right endpt. extends to infinity. This is a one-sided, 95% confidence interval and it means that there’s a 5% chance that the mean is not in the interval (too small) and a 95% that the mean is in the interval. Thus, to create a one-sided, 95% confidence interval, we begin with a two-sided, 90% interval and extend either the left or the right side (but not both) indefinitely. Suppose a 92% confidence interval for the mean number of times dogs in Urbana bark per day is 31 to 49 barks/dog/day. Explain why the one-sided, 96% confidence interval is from - to 49, which in this case is equivalent to 0 to 49. Also, to compute the original interval, the sample mean must have been 40. Why? 9. Section 11.1: Upon completion of his grenade analysis, Acme transfers Schmedrick to the yo-yo department. His new boss says, “Schmedrick, our packaging asserts that each yo-yo has a string length of 48 inches, but the Federal Yo-yo Commission wrote us a threatening letter saying they’re suspicious that we’re misleading the public by supplying yo-yos with an average string length that is not 48 inches.” What are the null and alternative hypotheses? Section 11.2: The boss continues, “Schmedrick, I want you to conduct a test whose results, hopefully, will show no statistical significant difference between your sample average and the hypothesized mean. Then I can write the Commission back and assure them that there is insufficient evidence to reject the null hypothesis.” Explain what the boss means by this. Section 11.3: Schmed gets to work by randomly selecting 31 yo-yos, measuring their string lengths, and computing a mean of 46.8 inches and a standard deviation of 3.5 inches. Compute the two-sided, 95% confidence interval. (Answer: 45.5 to 48.1 inches) Should Schmed accept or reject the null hypothesis? (He should accept it; explain why.) His sample average of 46.8 inches is not exactly the same as the hypothesized mean of 48 inches, but is the difference statistically significant? (It’s not; explain.) What then accounts for the fact that the sample average is not exactly 48 inches? Another common way to do hypothesis testing is via a t-test which uses a t-statistic: t x 0 , where 0 is the hypothesized mean (associated with the null hypothesis, H0). s/ n Notice that it’s the same formula we used above except that the actual population mean is replaced with what the null hypothesis claims to be the mean. Suppose we’re testing this null hypothesis: “The mean yo-yo mass is 95 grams.” A ten-yo-yo sample has a mean of 92.2 g with a standard deviation of 4.6 g. Then t = (92.2 – 95) / (4.6 / 101/2) = -1.71863. The absolute value of t is about 1.719. We then compare this value on the t-table for 9 degrees of freedom (typically using 95% confidence). The table lists a critical value of 2.262. Since our test statistic is less than this, we accept the null hypothesis, which means our data did not supply sufficient evidence to claim that a mean of 95 g is wrong. In other words, a two-sided, 95% confidence interval centered at 92.2 would indeed be large enough to contain the hypothesized mean of 95. Use a t-test to test this H0: “The mean yo-yo diameter is 4.3 cm.” Use the following parameters: a 15 yo-yo sample; sample mean = 4.5 cm; sample standard deviation = 0.3 cm. Decide whether to accept or reject the null hypothesis (with 95% confidence). (Answer: reject H0 since 2.58199 > 2.145.) Notice that the sample mean was only 2 mm larger than the hypothesized mean, but with such a small standard deviation, we expect most of the yo-yos to have diameters right around 4.5 cm, and we cannot say with 95% certainty that real mean is only 4.3 cm. 10. Section 11.4. Hypothesis testing is sort of like accepting the status quo until sufficient evidence is provided to force us to believe otherwise. (The null hypothesis is analogous the status quo.) This is pretty much what is done in science. No well-accepted theory is rejected unless there is very strong evidence against it, that is, unless there is statistically significant evidence against the current theory. In other words, H0 is “innocent until proven guilty.” Accepting the null doesn’t necessarily mean we’re convinced that it is exactly correct, but that we haven’t seen convincing evidence that it is incorrect. In court, the defendant does not have to prove innocence, but if there is not sufficient evidence of his guilt, he is declared “not guilty.” If evidence “beyond a reasonable doubt” is presented against the defendant, the jury would reject the null hypothesis of his innocence, which means he is declared guilty as charged, even though the jury may not be 100% certain of his guilt. Rejecting a null hypothesis means that the evidence (be it sample measurements or court testimony) is not consistent with a claim, and that it is inconsistent to the extent that most likely the claim isn’t true. There are two types of errors that a jury can make: convicting an innocent man (type 1 error), or releasing someone who is guilty (type 2 error). H0 claims that the man is innocent. Convicting an innocent man means rejecting H0 when H0 is accepted H0 is rejected it is true. If the man is guilty, H0 is false. Releasing a guilty man means wrongly accepting H0 is true his innocence (accepting H0 when it’s false). Maybe one way to keep these errors straight H0 is false is that the word accept has two c’s back-to-back, and accepting H when it’s false is a 0 type 2 error. In each cell of the table enter one of the following choices: “Correct,” “Type 1 error,” or “Type 2 error.” The significance level is the probability of making a type 1 error. For example, suppose the defendant is innocent of the crime but the jury decides that it is 95% certain that the man is guilty and convicts him. The jury, then, rejects H0 when it is true, committing a type 1 error. By their figuring, though, there is only a 5% chance of this happening. In other words, the jury has the man somewhere outside of their 95% confidence interval. In this situation the significance level is = 0.05. Suppose Popeye claims he eats, on average, 5 cans of spinach a day. Olive Oyl is skeptical. So, she randomly chooses 8 days from his food record and calculates the sample mean and standard deviation. State H0 and H1. Explain how she could make each type of error. If she concludes with 90% certainty that Popeye is wrong, what is the significance level? 11. Section 11.7. Give me a quick example of a scenario in which a one-sided t-test would be appropriate and explain why. 12. Section 12.2: Check out the histograms on page 389. They’re reminiscent of the multiple box-and-whisker plots on page 96. What information do these histograms immediately provide in terms of comparing the two species of iris? Section 12.3: The iris histograms suggest that one species has smaller petals, but that difference could be attributed to the randomness of the samples, especially since the histograms overlap. There is a statistical method to determine whether the mean petal size of each species is likely to be different in real life and not just different in the samples. As always, H0 is the status quo: there is no difference between the means of the species, i.e., despite the difference in sample means, their population means are the same: x = y. The idea is to use statistics to determine how large a difference in sample averages is needed in order to say with 95% confidence that the population means are different. Check out example 12.5 on page 393. What is H0? What does a quick scan of the data suggest might be the case? 13. Instead of testing if a single sample average equals a population mean, we’re now interested in testing whether the difference between the averages of two different samples is equal to the difference between the two population means. Example: We’re interested in determining whether there is a difference in math ability between senior boys and girls at UHS. We select 100 seniors, 50 boys and 50 girls, at random (a stratified random sample) and invite them to participate in our study. 12 boys and 15 girls agree to participate. (Since they don’t all participate, we may have some response bias, but we’ll assume it is negligible for the sake of the example.) We give all participants the same math test and score them. The boys’ sample average, X-bar, is 73 points, and the boys’ sample standard deviation, Sx, is 5 points. For the girls, Y-bar = 76 points, and Sy = 4 points. We want to test the null hypothesis that x = y. Let’s first assume that both populations (boys and girls) are approximately normally distributed and that the variability of each is known. So, we don’t know the ’s but we do know the ’s. Assume x = 7 and y = 3 points. Recall earlier that when the population standard deviation was known and we had only one sample, we used the formula: Z ( X Y ) ( x y ) X X or equivalently, Z . The corresponding equation for two samples is Z . Note / n 2 /n x2 / n x y2 / n y that the sample average is replaced with the difference of the sample averages, and the population mean is replaced with the difference of the population means; also, both population standard deviations and population sizes are included in the denominator. Under our null hypothesis, x - y = 0, so our test statistic becomes Z X Y / nx / n y 2 x 2 y 73 76 7 / 12 32 / 15 2 1.386. (We do not need the sample standard deviations since we’re assuming we know the population standard deviations.) Since our test statistic, -1.386, is not less than -1.96 or greater than +1.96, we do not have sufficient evidence that H0 is wrong, so we accept that there is no statistical difference in the population means and, thus, no statistical difference in math ability between boys and girls. The reason for comparing our test stat to 1.96 is because in a standard normal distribution values between -1.96 and +1.96 incorporate 95% of the random values. In other words, our test stat is in the 95% confidence interval. If our test stat had come out to be < -1.96 or > 1.96, we could that we’re 95% certain that there is a difference between boys and girls. How many standard deviations must the test stat be from the mean in order to proclaim with 95% confidence that a difference exists? People who believe that boys are better in math might have done the same study with the same H0, but they might take the alternative hypothesis to be that x > y, rather than x y. They’d use the same test stat, -1.38, but they’d do a one-sided test, and they’d only reject H0 at a 95% confidence level if the test stat came out > 1.645. Explain where I got this value and why I say > rather than <. 14. Suppose now that we have the same math data for boys and girls but we don’t know the population standard deviations (which is much more realistic). As before, we would replace the ’s with the S ’s. However, this approximation only words well for large sample sizes. Ours aren’t very large so we’ll boost the confidence level to 99%. Compute the test stat and do a two-sided, 99% confidence test of H0. State your conclusion in ordinary language. (Z = -1.69, which is not < -2.576 or > +2.576, so once again there is not sufficient evidence to conclude a statistical difference in the populations means. That is, the difference in sample means is not big enough to be considered statistically significant. Therefore, we cannot conclude that a real male/female difference exists and we accept the null hypothesis that there is no difference. Make sure you show your work and explain where the numbers are coming from on the table.) 15. When testing the difference of means of two populations in which the population standard deviations are not known and the samples aren’t very large (as in the last problem), we often use a t-test for two samples. The assumptions we’re working under are that the both populations are approximately normal with about the same variability. The formula we used for one sample was: x 0 x 0 or equivalently, t . Here the sample standard deviation, s, is an approximation for that of the population. For 1 s/ n s n x y two samples with averages x-bar and y-bar we use t . Notice that S has no subscript; it’s a combination of the 1 1 S nx n y t sample standard deviation that gets calculated separately. Since our assumption is that both populations have about the same variability, they should both have about the same standard deviation, , and we could reasonably approximate it with the sample standard deviation of the x values, sx, or with that of the y values, sy. It wouldn’t be right simply to let S be the average sx and sy, since the samples aren’t necessarily the same size. So we use a weighted average for S: S (n1 1) s x2 (n2 1) s y2 n1 n2 2 or equivalently, S 2 df1 s x2 df 2 s y2 df1 df 2 . Notice that S2 is just a weighted average of the sample variances—the bigger the sample, the bigger a df is for a sample, the greater the contribution that sample makes to the weighted average. Example: Pinocchio and his twin sister Pinocchita have been wondering if their noses have the same sensitivity to telling lies. Pinocchio says he’s a better liar, so his nose doesn’t grow as much. Their noses do grow when they lie, but the amount of growth varies. They assume that the variation they experience is the same and that nose growth is normally distributed for each. They want to know whether or not the exact same distribution describes both noses. That is, they want to know if they have the same means. To find out, they each tell the same lie over and over and each measures how much his/her nose grows in inches. The table contains the data in inches. State the null hypothesis, mathematically and informally. Do the same for the alternative hypothesis (one-sided). Show that t = -2.81. Expain why the critical value for significance level of 0.05 is 1.717 (to which we compare the absolute value of t). Therefore, after making this comparison, you can make the statement, “With ___ % certainty I can say that that the null hypothesis should be [rejected / accepted]. This means that there is [sufficient / insufficient] evidence to conclude that the means of the two populations are most likely [the same / not the same], and that mostly likely Pinocchio is [right / wrong] about being a better liar.” Pinocchio, x 0.8 1.8 1.0 0.1 0.9 1.7 1.0 1.4 0.9 1.2 0.5 Pinocchita, y 1.0 0.8 1.6 2.6 1.3 1.1 2.4 1.8 2.5 1.4 1.9 2.0 1.2 16. We’ve tested hypotheses about the difference in the means of two groups. This can be done for any number of groups with a technique called ANOVA (analysis of variance) with the assumption that all groups are normally distributed with about the same variability. Say we have three groups of people: group A is comprised of 5 men from Atlanta whose average shoe size is a-bar =10.7 with variance sa2 = 0.55; group B is comprised of 4 men from Baton Rouge whose average shoe size is b-bar =10.3 with variance sb2 = 0.28; group C is comprised of 6 men from Columbia whose average shoe size is c-bar =10.5 with variance sc2 = 0.39. We’d like to know if the differences in the sample averages are statistically significant. H0 is that a = b = c and H1 is that they’re not all the same. First we find a weighted average of the means: X na a nb b nc c = [5(10.7) + 4(10.3) + 6(10.5)] / (5 + 4 + 6) = 10.5133 na nb nc Notice that the weighted average is a little higher than 10.5, which it would have been if the sample sizes were the same. The next step is to do sort of a weighted variance of the averages of the three cities: S 2 between na (a X ) 2 nb (b X ) 2 nc (c X ) 2 5(10.7 10.5133) 2 4(10.3 10.5133) 2 6(10.5 10.5133) 2 = 0.178667 # groups 1 3 1 Then we calculate a weighted average of the variances of the three groups, just like we did for two groups earlier: S 2 within df a sa2 df b sb2 df c sc2 4(0.55) 3(0.28) 5(0.39) = 0.415833 df a df b df c 435 Review of calculations done so far: X-bar is the weighted average of the sample averages. S 2-between is, informally speaking, a “weighted variance of the averages”; it reflects how much variability exists between the groups. If there were no variability between the groups then the sample averages would all be [very different / about the same] and S 2-between would be [close to 0.6 zero / very large] since each term in parentheses in the numerator is [very negative / F distribution nearly zero / very positive]. S 2-within is, informally speaking, a “weighted average of the variances”; it reflects the average variability within the groups. To understand these S ’s better consider an extreme case. If everyone in Atlanta wore size 8, everyone in Baton Rouge wore size 10, and everyone in Columbia wore size 12, then S 2-between would be [large / zero] and S 2-within would be [large / zero]. In any case, extreme or not, the bigger S 2-between, the more variability there is between the groups and the more likely we are to [accept / reject] H0. Randomness makes it hard 1 2 3 to decide how big S 2-between should be before we reject H0, so we compare it to S 2-within. We do this by defining a test statistic called F: 2 2 F = S -between / S -within. For our shoe size example F = 0.178667 0.415833 = 0.429659. If F had come out to be about one, then the variability between the groups was [much < / about the same as / much > ] the variability within the groups, which means we should [accept / reject] H0. If F had come out to be very large, then the variability between the groups was [much < / about the same as / much > ] the variability within the groups, which means we should [accept / reject] H0. As with other test stats, we use a table to find a critical value (which depends upon degrees of freedom and a significance level) and we compare the test stat to the critical value in order to make a decision regarding H0. To do an F-test we need two degrees of freedom values, one for S 2-between, and one for S 2-within. S 2-between has 2 degrees of freedom (3 groups minus one). S 2-within has 12 degrees of freedom (the sum of the df ’s for the individual groups). Unlike the normal or T distributions, the F distribution is not symmetric (I did my best to draw one; the shape would vary depending on what the degrees of freedom are). The F-stat is always positive, and we just want to know whether the F-stat is beyond the critical value. So, F-tests are always one-sided. As with all probability distributions, the total area under the curve is one. Note that if H0 is true it is very unlikely for the F-stat to be much greater than 1. F tables begin on page 593. At a 5% significance level we use Table D.2 and look for the row with 2 deg. of freedom. There is no column with 12 deg. of freedom, so we’ll use the column with 10 deg. of freedom, which gives us a critical value of 4.10. This means we should not reject H0 unless the F-stat is > than 4.10. However, our F-stat is only about 0.43, well under the critical value. (F-stat is also < the critical value in the column with 20 deg. of freedom, so it would be < the critical value for 12 deg. of freedom, if had been shown.). Thus, with 95% certainty we can conclude that we should accept H0 and believe that there is no difference in the mean shoe size of men in the different cities, i.e., the different means we got from our samples are not statistically significant and are best explained by random variation inherent in the sampling process rather than any real difference among cities. Note that the critical values are large in the upper left part of the table because with only a few small groups, there is more uncertainty, hence the need for the test stat to be bigger before rejecting H0. Explain why the critical value at row infinity and column infinity is one. 17. Section 14.1. Let’s look now at a distribution for categorical data. Suppose a guitar manufacturer is planning a new advertising campaign targeting young, hard rock fans. The company hires a marketing analyst to determine which rock guitarist kids like best these days. The analyst claims that this information is well known: 35% percent of the kids like Angus Young from AC/DC; 20% like Eddie Van Halen; 18% prefer Jimmy Page from Led Zeppelin; and the rest like various other guitarists best. Not trusting these data, the guitar manufacturer decides to test these hypothesized percentages (H0). This is not quite like anything we’ve done thus far. To test H0, the guitar guy must conduct a random survey of teen rockers, create a table (below), and calculate yet another test statistic. This test stat is called chi-square (“chi” is a Greek letter pronounced like the first syllable of “kite”). Its symbol is χ 2. The process is very similar to computing the variance of a set of numbers except, instead of subtracting the mean from each number, we subtract its expected value based on H0. Also, each squared difference is divided by its own expected value. In this example χ 2 = 14.611. Mathematically, we can say χ 2 = (Oi - Ei)2 / Ei where the summation on i runs from 1 to n, the number of categories. About what value would χ 2 have if the marketing analyst’s information were completely dependable? The higher χ 2, the more likely the marketing guy was wrong, but because of the inherent randomness of the procedure, we must allow for the possibility that the marketing guy was exactly right but that randomness is responsible for χ 2 being so high. Thus, we must find a cut-off point beyond which we can say with 95% confidence that we can reject H0 and conclude that his percentages were wrong. (The χ 2 test is always positive and its distribution looks very much that for F above.) We do this by looking up the 0.05 significance level critical value in a Theoretical χ 2 table with df = (# categories) – 1 = 3. The table lists a Category Observed, O Probability Expected, E (O - E)2 / E critical value of 7.815. Since our test stat is way beyond Angus 10 0.35 17.5 3.214 this, it is very unlikely that the differences between the Eddie 6 0.2 10 1.600 expected and observed values can be attributed to Jimmy 9 0.18 9 0 randomness. It is much, much more likely that the Other 25 0.27 13.5 9.796 differences are do to the fact that H0 is wrong. The guitar guy concludes that the marketing guy gave him bogus 14.611 Totals 50 1 50 information and demands his money back. (Note: the conditions required for a χ 2 test to work well are that the random samples come from a large population and that none of the expected values is extremely small.) Your turn: An interplanetary commission has been set up to resolve disputes among various planets in our region of the galaxy. The commission is comprised of 15 Earthlings, 21 Klingons, 13 Vulcans, 17 Romulans, and 15 Martians. The commission is supposed to represent each planet equally in terms of its population (sort of like the number of representatives from each state in the U.S. House of Representatives is proportional to each state’s population). Here are the planetary populations in billions of beings: Earth 6.1; Klingon 13.8; Vulcan 3.9; Romulus 9.5; Mars 4.3. You’re charged with determining whether or not the commission is truly representative. Show that there is not quite (but almost) enough evidence to say with 95% confidence that the commission is unbalanced. That is, you can’t say with enough certainty that the some planets were intentionally underrepresented. (H0 is that no planet is under- or overrepresented on the commission.) My advice is to create a table in Excel and make Excel do the computations. To find the theoretical probabilities, use the relative populations, e.g., the probability for Earth is 0.162, since Earthlings represent 16.2% of the interplanetary population. Earth’s expected value will come out to about 13.1 members on the commission. This means that if H0 is true and all is fair then Earth should get about 13 delegates. Don’t round, because your errors will compound. χ 2 should come out to be just under the critical value, meaning any higher and you would have had sufficient evidence to claim unfair representation. Instead you’re forced to assume that the differences between the actual and expected numbers of representatives could reasonably be due to the fact that whenever a group is chosen or elected some random variation is to be expected. 18. I’ve saved the best concepts for last: regression and correlation. Some of this you already know; some you most definitely do not. This question will deal with the stuff you already likely know. We’ve been comparing things like math scores (boys vs. girls), nose growth (Pinocchio vs. his sister), shoe sizes (comparing men in different cities), Popeye’s spinach consumption (actual vs. claimed), planetary representation (actual vs. ideal), etc. In all of those cases the comparisons were made between the exact same types of quantities, e.g., boys’ average math score vs. girls’ average math score. There are times, though, when we’d like to see if a relationship exists between two completely different quantities, such as the amount of rainfall and the yield of a soybean field, or the amount of oxygen dissolved in a body of water and the water’s temperature. In math class you’ve entered lists of data in your graphing calculator and made a scatter plot in order to ascertain whether or not there is a relationship between the two lists of numbers (a correlation). If the points in the scatter plot seemed somewhat to lie on a line, then a linear relationship existed between the two sets of numbers, and you used the calculator to do a linear regression to find the equation of the line that “fits the data” best. The regression line typically didn’t go through any of the plotted points, but it did come as close to as many of them as possible. If the regression line had a positive slope, then there is a positive correlation between the two sets of numbers (as one quantity or variable increases, so does the other); a negative slope indicates a negative correlation (as one increases, the other decreases, and vice versa). Furthermore, the equation of this regression line allowed you to make predictions about the relationship beyond what your data showed (called extrapolation) as well as between your data points (interpolation). You may also be aware that there is a number, r, called the “correlation coefficient.” If r = 1, you have a perfect positive correlation (all data points lie on, rather than near, the regression line, which has a positive slope). Similarly, if r = -1, you have a perfect negative correlation (all points lie on the regression line, which has a negative slope). | r | ≈ 1 implies a very strong linear relationship (the data points at least come very close to lining up). r ≈ 0 implies no linear relationship between the two quantities exists (perhaps the scatter plot has points all over the place, or perhaps some other sort of relationship exists, like a quadratic or exponential relationship). As was discussed in a previous module the exponential relationship y = abx can be made linear by taking logs of both sides of the equation: log y = log a + x log b. Thus, if y ≈ abx, then log y vs. x should have a correlation coefficient close to one, and the regression line should have a slope of about log b and a y-intercept of about log a. Taking logs of both sides of the power relationship y ≈ axn yields: log y ≈ log a + n log x. Thus, there should be a strong correlation between log y and log x. Check out the graphs on pages 538 and 539. What is the correlation coefficient going to be about, if a strong nonlinear relationship exists? What about when a strong linear relation exists for all the data except for one outlier? 19. Now for some of the stuff you may have learned before. When the calculator cranks out a correlation coefficient and the equation for a linear regression, it’s doing a lot statistics behind the scenes. The formula for the correlation coefficient is given by: n xy x y r . There are many equivalent versions of this formula, but I think this is the simplest. Note that xy and x y n 1 sx s y are not the same quantities. xy is the average of the products of corresponding x and y values, while x y is the product of the averages. r doesn’t require a derivation, since the above equation is a definition. However, it can be shown that, when the slope of the regression line is positive so is r, when the slope is negative so is r, and that -1 r 1 always. I won’t subject you to the proofs, but let’s do a couple of demonstrations. Let’s deal with three points that have a perfect correlation: (1, 10), (2, 20), and (3, 30). Show that r comes out to be exactly one in this case. (You should get 140 / 3 for xy .) If I change the point (3, 30) to (3, 10), now we’ve got an “inverted V” for a scatter plot. Clearly, there is no correlation now between y and x. Show that r is zero in this case. (It is sufficient to show that the numerator is zero.) 20. Last question of the whole course! As a special treat, I’ve prepared a cool spreadsheet to help you understand the method of least squares, which is the method typically used to find regression lines. As you know, a regression line is the line that bests fits the data. (It is sometimes called a line of best fit or a trend line.) Let’s first make the reasonable assumption that the regression line goes through the point (x-bar, y-bar), which is the large red dot in my spreadsheet. We start with any old line that goes through this point. The idea behind finding the slope of the regression equation is to look at the residues. Residues are a measure of how far above or below the actual data points are from the line drawn through (x-bar, y-bar). Residues can be positive or negative (see page 548). I set up the spreadsheet so that negative residues are blue and positive residues are red. Change the slope of the line using the up/down arrows or the scroll bar. Notice that the graph changes automatically, as do the residues. It’s not hard to change the slope to get a decent fit, but we want the best fit, not just a decent fit. To do this we monitor the sum of the squares of the residues in the large, bold font. (As in computing a variance, squaring makes everything positive. We could use absolute values and get a similar result. Squaring, though, lends itself to mathematical manipulation more so than does taking absolute values.) The object is to find the slope that minimizes the sum of the squares of the residues, thereby finding the slope of the line that comes as close to as many points as possible (close in the sense that sum of the squares of the distances from the points couldn’t get any smaller). For the x and y values given, find this slope. (I set it up so that you can change the x and y values if you like.) Of course now it’s easy to find the regression equation since you’ve got a point and the slope. It’s displayed on the graph. By the way, if you’re interested in how I created the spreadsheet, unprotect the sheet (Tools menu). Then you can click on different cells to see what formulae they contain. As much as you like spreadsheets, I think you would agree that there must be a more efficient way to find regression equations. Naturally, there is a formula for it: y yr sy sx ( x x) . The derivation is quite long and complicated, involving lots of summations and some multivariate calculus (which none of you have had yet). Nevertheless, I would be remiss in my duties were I not at least to outline the main idea, and I would be more than happy to go through the details with anyone with the time and interest to do so. First off, note that, if you believe the above equation is true, it shows that (x-bar, y-bar) is on the regression line. Explain why it also shows that if y values are more spread out than the x values, then the slope will be steeper. Here’s the main idea behind the derivation. Let the equation of the regression line be y = mx + b. So, if (xi, yi) is a data point, this line might be above or below this point. The y value of the line at this x value is mxi + b, and the residue is yi – y = yi – (mxi + b). Explain how yi and y differ in meaning. Let the sum of the squares of the residues be given by H = (yi – y)2 = (yi – mxi – b)2. H is a function of two variables, m and b. Recall that when we have a function of one variable, say y = f(x), to minimize or maximize y, we take its derivative and set it equal to zero (to find where the tangent line is horizontal). That is, we solve f ’(x) = 0 for x, and these are the x values where y can have its peaks and valleys. Well, with a function of two variables like H, we would have to take a derivative with respect to m and another derivative with respect to b. The derivative with respect to m is a measure of how quickly H changes as m changes while b is held constant. How would you describe the derivative of H with respect to b? Once we have two derivatives equal to zero, those equations are solved simultaneously, yielding the formula above. Keep in mind that when two quantities, x and y, are correlated, this may be because x causes y, because y causes x, because something else causes both of them, or there may be no causation at all. For example, let’s assume there is a positive correlation between intelligence as measured by IQ and chess ability. This may be because it is necessary to be smart to excel at chess. It won’t be a perfect correlation since some smart people don’t know how to play. Some may say that being smart causes one to be good at chess. Someone else might claim that playing chess enough to get good at it causes one’s IQ to rise. Yet another person might argue that scoring high on an IQ test and being a good chess player are both caused by having a good memory. An example in which there is no causation at all is the following. There is a strong correlation between my going for a walk and my dog going for a walk, since I often take her with me when I walk. It would be wrong, though, for me to claim that my walking causes her walking. These ideas are pertinent to research, especially medical research. Let’s say we’re trying to determine whether high blood pressure causes heart disease. One way to do this is by gathering data and plotting incidence of heart disease vs. blood pressure. Say we find a fairly strong correlation with a coefficient of 0.93. This tells us that many more people with high blood pressure have heart attacks than people with low blood pressure, but it doesn’t necessarily mean that high b.p. causes heart attacks. It might, but it could be that a third factor, like smoking or a high fat diet, causes both high b.p. and heart attacks. A study like this definitely establishes a link between the two, and it does provide a legitimate reason for people with high b.p. to do what they can to lower it. The action they take to lower their b.p., or the lower b.p. itself, may prevent a heart attack. Causality aside, how large does the correlation coefficient have to be in order to be confident that a correlation between two quantities really does exist? (rhetorical question) For example, it might not be clear whether or not researchers should claim there’s a link between heart attacks and high b.p. if r were only 0.75. So, we do as we’ve done before: we test the null hypothesis, which states that there is no correlation between the two, meaning they’re independent parameters (unrelated to one another). Stated concisely, H0 is that r = 0. There is a symmetric distribution for r that looks fairly normal. The distribution depends on the sample size, just as the distributions for T, F, and χ 2 do. A table is on page 543. Suppose our study was done on 80 people. The critical value at the 5% significance level 0.2199. So, even if r is only 0.75, we have sufficient evidence to reject H0, and we can still say with 95% confidence that high b.p. is indeed associated with and heart attacks. Notice in the table that as the size of the study increases, the critical values go down. This means that with a larger study we don’t demand a high of a correlation coefficient. This is because with a small study, random variation could more easily be responsible for a high r value even when there is no real correlation. Say you gather data on Olympic swimming records. You plot winning times for a particular race vs. the year for 8 recent Olympics, you do a regression analysis, and you find the correlation coefficient to be -0.84. Why would r be negative? You want to determine whether or not there really is an association between winning times and the year. What is the null hypothesis? Can you claim with 99% certainty that an association really does exist? What about with 99.9% certainty?