* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Hypothesis Testing Using a Single Sample
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Psychometrics wikipedia , lookup
Taylor's law wikipedia , lookup
Foundations of statistics wikipedia , lookup
Statistical hypothesis testing wikipedia , lookup
Omnibus test wikipedia , lookup
Resampling (statistics) wikipedia , lookup
10-W4959 10/7/08 3:18 PM Page 525 Chapter 10 © Royalty-Free/Corbis Hypothesis Testing Using a Single Sample I n Chapter 9, we considered situations in which the primary goal was to estimate the unknown value of some population characteristic. Sample data can also be used to decide whether some claim or hypothesis about a population characteristic is plausible. For example, cross-border purchasing of prescription drugs is a controversial topic of current interest, and there has been a great deal of media coverage on the practice of importing prescription medications from Canada or Mexico. But is this really a common practice? The article “Much Ado About Cross-Border Prescription Purchasing” (Ipsos, February 19, 2004) summarized the results of a poll of 1000 randomly selected adult Americans and reported that only 15 of the 750 adults in the sample who had purchased prescription drugs in the past year had made a purchase from a pharmacy in Canada or Mexico. Let p denote the proportion of all American adults who have made a prescription drug purchase in the last year from a Canadian or Mexican pharmacy. The hypothesis testing methods presented in this chapter can be used to decide whether the sample data from this survey provide strong support for the hypothesis that this proportion is small, for example p .05. As another example, a report released by the National Association of Colleges and Employers stated that the average starting salary for students graduating in 2006 with a degree in accounting was $45,656 (“Starting Salary Offers to New College Grads Continue to Climb,” July 12, 2006, available at www.naceweb.org/press). Suppose that you are interested in investigating whether the mean starting salary for students graduating with an accounting degree from your university this year is greater than the 2006 average of $45,656. You select a random sample of n 40 accounting graduates Improve your understanding and save time! Visit www.cengage.com/login where you will find: ■ ■ ■ Step-by-step instructions for MINITAB, Excel, TI-83, SPSS, and JMP Video solutions to selected exercises Data sets available for selected examples and exercises ■ ■ Exam-prep pre-tests that build a Personalized Learning Plan based on your results so that you know exactly what to study Help from a live statistics tutor 24 hours a day 525 10-W4959 10/7/08 3:18 PM Page 526 526 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample from the current graduating class of your university and determine the starting salary of each one. If this sample produced a mean starting salary of $45,958 and a standard deviation of $1214, is it reasonable to conclude that m the current mean starting salary for all accounting graduates in the current graduating class at your university, is greater than $45,656 (i.e., m 45,656)? We will see in this chapter how these sample data can be analyzed to decide whether m 45,656 is a reasonable conclusion. ........................................................................................................................................ 10.1 Hypotheses and Test Procedures A hypothesis is a claim or statement about the value of a single population characteristic or the values of several population characteristics. The following are examples of legitimate hypotheses: m 1000, where m is the mean number of characters in an email message p .01, where p is the proportion of email messages that are undeliverable In contrast, the statements x 1000 and p .01 are not hypotheses, because x and p are sample characteristics. A test of hypotheses or test procedure is a method that uses sample data to decide between two competing claims (hypotheses) about a population characteristic. One hypothesis might be m 1000 and the other m 1000 or one hypothesis might be p .01 and the other p .01. If it were possible to carry out a census of the entire population, we would know which of the two hypotheses is correct, but usually we must decide between them using information from a sample. A criminal trial is a familiar situation in which a choice between two contradictory claims must be made. The person accused of the crime must be judged either guilty or not guilty. Under the U.S. system of justice, the individual on trial is initially presumed not guilty. Only strong evidence to the contrary causes the not guilty claim to be rejected in favor of a guilty verdict. The burden is thus put on the prosecution to prove the guilty claim. The French perspective in criminal proceedings is the opposite of ours. There, once enough evidence has been presented to justify bringing an individual to trial, the initial assumption is that the accused is guilty. The burden of proof then falls on the accused to establish otherwise. As in a judicial proceeding, we initially assume that a particular hypothesis, called the null hypothesis, is the correct one. We then consider the evidence (the sample data) and reject the null hypothesis in favor of the competing hypothesis, called the alternative hypothesis, only if there is convincing evidence against the null hypothesis. D E F I N I T I O N The null hypothesis, denoted by H0, is a claim about a population characteristic that is initially assumed to be true. The alternative hypothesis, denoted by Ha, is the competing claim. In carrying out a test of H0 versus Ha, the hypothesis H0 will be rejected in favor of Ha only if sample evidence strongly suggests that H0 is false. If the sample does not provide such evidence, H0 will not be rejected. The two possible conclusions are then reject H0 or fail to reject H0. 10-W4959 10/7/08 3:18 PM Page 527 10.1 ■ Hypotheses and Test Procedures 527 .......................................................................................................................................... E x a m p l e 1 0 . 1 Tennis Ball Diameters Because of variation in the manufacturing process, tennis balls produced by a particular machine do not have identical diameters. Let m denote the true average diameter for tennis balls currently being produced. Suppose that the machine was initially calibrated to achieve the design specification m 3 in. However, the manufacturer is now concerned that the diameters no longer conform to this specification. That is, m 3 in. must now be considered a possibility. If sample evidence suggests that m 3 in., the production process will have to be halted while the machine is recalibrated. Because stopping production is costly, the manufacturer wants to be quite sure that m 3 in. before undertaking recalibration. Under these circumstances, a sensible choice of hypotheses is H0: m 3 (the specification is being met, so recalibration is unnecessary) Ha: m 3 (the specification is not being met, so recalibration is necessary) Only compelling sample evidence would then result in H0 being rejected in favor of Ha. ■ .......................................................................................................................................... E x a m p l e 1 0 . 2 Lightbulb Lifetimes Kmart brand 60W lightbulbs state on the package “Avg. Life 1000 Hr.” Let m denote the true mean life of Kmart 60-W lightbulbs. Then the advertised claim is m 1000 hr. People who purchase this brand would be unhappy if m is actually less than the advertised value. Suppose that a sample of Kmart lightbulbs is selected and the lifetime for each bulb in the sample is recorded. The sample results can then be used to test the hypothesis m 1000 hr against the hypothesis m 1000 hr. The accusation that the company is overstating the mean lifetime is a serious one, and it is reasonable to require compelling evidence from the sample before concluding that m 1000. This suggests that the claim m 1000 should be selected as the null hypothesis and that m 1000 should be selected as the alternative hypothesis. Then H0: m 1000 would be rejected in favor of Ha: m 1000 only if sample evidence strongly suggests that the initial assumption, m 1000 hr, is not plausible. ■ Because the alternative hypothesis in Example 10.2 asserted that m 1000 (true average lifetime is less than the advertised value), it might have seemed sensible to state H0 as the inequality m 1000. The assertion m 1000 is in fact the implicit null hypothesis, but we will state H0 explicitly as a claim of equality. There are several reasons for this. First of all, the development of a decision rule is most easily understood if there is only a single hypothesized value of m (or p or whatever other population characteristic is under consideration). Second, suppose that the sample data provided 10-W4959 10/7/08 3:18 PM Page 528 528 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample compelling evidence that H0: m 1000 should be rejected in favor of Ha: m 1000. This means that we were convinced by the sample data that the true mean was smaller than 1000. It follows that we would have also been convinced that the true mean could not have been 1001 or 1010 or any other value that was larger than 1000. As a consequence, the conclusion when testing H0: m 1000 versus Ha: m 1000 is always the same as the conclusion for a test where the null hypothesis is H0: m 1000. For these reasons it is customary to state the null hypothesis H0 as a claim of equality. The form of a null hypothesis is H0: population characteristic hypothesized value where the hypothesized value is a specific number determined by the problem context. The alternative hypothesis will have one of the following three forms: Ha: population characteristic hypothesized value Ha: population characteristic hypothesized value Ha: population characteristic hypothesized value Thus, we might test H0: p .1 versus Ha: p .1; but we won’t test H0: m 50 versus Ha: m 100. The number appearing in the alternative hypothesis must be identical to the hypothesized value in H0. Example 10.3 illustrates how the selection of H0 (the claim initially believed true) and Ha depend on the objectives of a study. .......................................................................................................................................... E x a m p l e 1 0 . 3 Evaluating a New Medical Treatment A medical research team has been given the task of evaluating a new laser treatment for certain types of tumors. Consider the following two scenarios: Scenario 1: The current standard treatment is considered reasonable and safe by the medical community, has no major side effects, and has a known success rate of 0.85 (85%). Scenario 2: The current standard treatment sometimes has serious side effects, is costly, and has a known success rate of 0.30 (30%). In the first scenario, research efforts would probably be directed toward determining whether the new treatment has a higher success rate than the standard treatment. Unless convincing evidence of this is presented, it is unlikely that current medical practice would be changed. With p representing the true proportion of successes for the laser treatment, the following hypotheses would be tested: H0: p .85 versus Ha: p .85 In this case, rejection of the null hypothesis is indicative of compelling evidence that the success rate is higher for the new treatment. In the second scenario, the current standard treatment does not have much to recommend it. The new laser treatment may be considered preferable because of cost or because it has fewer or less serious side effects, as long as the success rate for the 10-W4959 10/7/08 3:18 PM Page 529 10.1 ■ Hypotheses and Test Procedures 529 new procedure is no worse than that of the standard treatment. Here, researchers might decide to test the hypothesis H0: p .30 versus Ha: p .30 If the null hypothesis is rejected, the new treatment will not be put forward as an alternative to the standard treatment, because there is strong evidence that the laser method has a lower success rate. If the null hypothesis is not rejected, we are able to conclude only that there is not convincing evidence that the success rate for the laser treatment is lower than that for the standard. This is not the same as saying that we have evidence that the laser treatment is as good as the standard treatment. If medical practice were to embrace the new procedure, it would not be because it has a higher success rate but rather because it costs less or has fewer side effects, and there is not strong evidence that it has a lower success rate than the standard treatment. ■ You should be careful in setting up the hypotheses for a test. Remember that a statistical hypothesis test is only capable of demonstrating strong support for the alternative hypothesis (by rejection of the null hypothesis). When the null hypothesis is not rejected, it does not mean strong support for H0—only lack of strong evidence against it. In the lightbulb scenario of Example 10.2, if H0: m 1000 is rejected in favor of Ha: m 1000, it is because we have strong evidence for believing that true average lifetime is less than the advertised value. However, nonrejection of H0 does not necessarily provide strong support for the advertised claim. If the objective is to demonstrate that the average lifetime is greater than 1000 hr, the hypotheses to be tested are H0: m 1000 versus Ha: m 1000. Now rejection of H0 indicates strong evidence that m 1000. When deciding which alternative hypothesis to use, keep the research objectives in mind. ■ E x e r c i s e s 10.1–10.11 ............................................................................................................... 10.1 Explain why the statement x 50 is not a legitimate hypothesis. 10.2 For the following pairs, indicate which do not comply with the rules for setting up hypotheses, and explain why: a. H0: m 15, Ha: m 15 b. H0: p .4, Ha: p .6 c. H0: m 123, Ha: m 123 d. H0: m 123, Ha: m 125 e. H0: p .1, Ha: p .1 10.3 To determine whether the pipe welds in a nuclear power plant meet specifications, a random sample of Bold exercises answered in back welds is selected and tests are conducted on each weld in the sample. Weld strength is measured as the force required to break the weld. Suppose that the specifications state that the mean strength of welds should exceed 100 lb/in.2. The inspection team decides to test H0: m 100 versus Ha: m 100. Explain why this alternative hypothesis was chosen rather than m 100. 10.4 Do state laws that allow private citizens to carry concealed weapons result in a reduced crime rate? The author of a study carried out by the Brookings Institution is reported as saying, “The strongest thing I could say is that I don’t see any strong evidence that they are reducing crime” (San Luis Obispo Tribune, January 23, 2003). ● Data set available online but not required ▼ Video solution available 10-W4959 10/7/08 3:18 PM Page 530 530 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample a. Is this conclusion consistent with testing H0: concealed weapons laws reduce crime versus Ha: concealed weapons laws do not reduce crime or with testing H0: concealed weapons laws do not reduce crime versus Ha: concealed weapons laws reduce crime Explain. b. Does the stated conclusion indicate that the null hypothesis was rejected or not rejected? Explain. 10.5 Consider the following quote from the article “Review Finds No Link Between Vaccine and Autism” (San Luis Obispo Tribune, October 19, 2005): “ ‘We found no evidence that giving MMR causes Crohn’s disease and/or autism in the children that get the MMR,’ said Tom Jefferson, one of the authors of The Cochrane Review. ‘That does not mean it doesn’t cause it. It means we could find no evidence of it.’” (MMR is a measles-mumps-rubella vaccine.) In the context of a hypothesis test with the null hypothesis being that MMR does not cause autism, explain why the author could not just conclude that the MMR vaccine does not cause autism. 10.6 A certain university has decided to introduce the use of plus and minus with letter grades, as long as there is evidence that more than 60% of the faculty favor the change. A random sample of faculty will be selected, and the resulting data will be used to test the relevant hypotheses. If p represents the true proportion of all faculty that favor a change to plus–minus grading, which of the following pair of hypotheses should the administration test: H 0: p .6 versus H a: p .6 or H 0: p .6 versus H a: p .6 Explain your choice. 10.7 ▼ A certain television station has been providing live coverage of a particularly sensational criminal trial. The station’s program director wishes to know whether more than half the potential viewers prefer a return to regular Bold exercises answered in back daytime programming. A survey of randomly selected viewers is conducted. Let p represent the true proportion of viewers who prefer regular daytime programming. What hypotheses should the program director test to answer the question of interest? 10.8 Researchers have postulated that because of differences in diet, Japanese children have a lower mean blood cholesterol level than U.S. children do. Suppose that the mean level for U.S. children is known to be 170. Let m represent the true mean blood cholesterol level for Japanese children. What hypotheses should the researchers test? 10.9 A county commissioner must vote on a resolution that would commit substantial resources to the construction of a sewer in an outlying residential area. Her fiscal decisions have been criticized in the past, so she decides to take a survey of constituents to find out whether they favor spending money for a sewer system. She will vote to appropriate funds only if she can be fairly certain that a majority of the people in her district favor the measure. What hypotheses should she test? 10.10 The mean length of long-distance telephone calls placed with a particular phone company was known to be 7.3 min under an old rate structure. In an attempt to be more competitive with other long-distance carriers, the phone company lowered long-distance rates, thinking that its customers would be encouraged to make longer calls and thus that there would not be a big loss in revenue. Let m denote the true mean length of long-distance calls after the rate reduction. What hypotheses should the phone company test to determine whether the mean length of long-distance calls increased with the lower rates? 10.11 ▼ Many older homes have electrical systems that use fuses rather than circuit breakers. A manufacturer of 40-amp fuses wants to make sure that the mean amperage at which its fuses burn out is in fact 40. If the mean amperage is lower than 40, customers will complain because the fuses require replacement too often. If the mean amperage is higher than 40, the manufacturer might be liable for damage to an electrical system as a result of fuse malfunction. To verify the mean amperage of the fuses, a sample of fuses is selected and tested. If a hypothesis test is performed using the resulting data, what null and alternative hypotheses would be of interest to the manufacturer? ● Data set available online but not required ▼ Video solution available 10-W4959 10/7/08 3:18 PM Page 531 10.2 ■ Errors in Hypothesis Testing 531 ........................................................................................................................................ 10.2 Errors in Hypothesis Testing Once hypotheses have been formulated, we employ a method called a test procedure to use sample data to determine whether H0 should be rejected. Just as a jury may reach the wrong verdict in a trial, there is some chance that using a test procedure with sample data may lead us to the wrong conclusion about a population characteristic. In this section, we discuss the kinds of errors that can occur and consider how the choice of a test procedure influences the chances of these errors. One erroneous conclusion in a criminal trial is for a jury to convict an innocent person, and another is for a guilty person to be set free. Similarly, there are two different types of errors that might be made when making a decision in a hypothesis testing problem. One type of error involves rejecting H0 even though the null hypothesis is true. The second type of error results from failing to reject H0 when it is false. These errors are known as Type I and Type II errors, respectively. D E F I N I T I O N Type I error: the error of rejecting H0 when H0 is true Type II error: the error of failing to reject H0 when H0 is false The only way to guarantee that neither type of error occurs is to base the decision on a census of the entire population. The risk of error is the price researchers pay for basing an inference on a sample. With any reasonable sample-based procedure, there is some chance that a Type I error will be made and some chance that a Type II error will result. .......................................................................................................................................... E x a m p l e 1 0 . 4 On-Time Arrivals The U.S. Bureau of Transportation Statistics reports that for 2005, 78.6% of all domestic passenger flights arrived on time (meaning within 15 min of the scheduled arrival). Suppose that an airline with a poor on-time record decides to offer its employees a bonus if, in an upcoming month, the airline’s proportion of on-time flights exceeds the overall industry rate of 0.786. Let p be the true proportion of the airline’s flights that are on time during the month of interest. A random sample of flights might be selected and used as a basis for choosing between H0: p .786 and Ha: p .786 In this context, a Type I error (rejecting a true H0) results in the airline rewarding its employees when in fact their true proportion of on-time flights did not exceed .786. A Type II error (not rejecting a false H0) results in the airline employees not receiving a reward that in fact they deserved. ■ 10-W4959 10/7/08 3:18 PM Page 532 532 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample .......................................................................................................................................... E x a m p l e 1 0 . 5 Slowing the Growth of Tumors In 2004, Vertex Pharmaceuticals, a biotechnology company, issued a press release announcing that it had filed an application with the Food and Drug Administration to begin clinical trials of an experimental drug VX-680 that had been found to reduce the growth rate of pancreatic and colon cancer tumors in animal studies (New York Times, February 24, 2004). Let m denote the true mean growth rate of tumors for patients receiving the experimental drug. Data resulting from the planned clinical trials can be used to test H0: m mean growth rate of tumors for patients not taking the experimental drug versus Ha: m mean growth rate of tumors for patients not taking the experimental drug The null hypothesis states that the experimental drug is not effective—that the mean growth rate of tumors for patients receiving the experimental drug is the same as for patients who do not take the experimental drug. The alternative hypothesis states that the experimental drug is effective in reducing the mean growth rate of tumors. In this context, a Type I error consists of incorrectly concluding that the experimental drug is effective in slowing the growth rate of tumors. A Type II error consists of concluding that the experimental drug is ineffective when in fact the mean growth rate of tumors is reduced. ■ Examples 10.4 and 10.5 illustrate the two different types of error that might occur when testing hypotheses. Type I and Type II errors—and the associated consequences of making such errors—are quite different. The accompanying box introduces the terminology and notation used to describe error probabilities. D E F I N I T I O N The probability of a Type I error is denoted by A and is called the level of significance of the test. Thus, a test with a .01 is said to have a level of significance of .01 or to be a level .01 test. The probability of a Type II error is denoted by B. .......................................................................................................................................... E x a m p l e 1 0 . 6 Blood Test for Ovarian Cancer Women with ovarian cancer usually are not diagnosed until the disease is in an advanced stage, when it is most difficult to treat. A blood test has been developed that appears to be able to identify ovarian cancer at its earliest stages. In a report issued by the National Cancer Institute and the Food and Drug Administration (February 8, 2002), the following information from a preliminary evaluation of the blood test was given: ■ The test was given to 50 women known to have ovarian cancer, and it correctly identified all of them as having cancer. 10-W4959 10/7/08 3:18 PM Page 533 10.2 ■ ■ Errors in Hypothesis Testing 533 The test was given to 66 women known not to have ovarian cancer, and it correctly identified 63 of these 66 as being cancer free. We can think of using this blood test to choose between two hypotheses: H0: woman has ovarian cancer Ha: woman does not have ovarian cancer Note that although these are not “statistical hypotheses” (statements about a population characteristic), the possible decision errors are analogous to Type I and Type II errors. In this situation, believing that a woman with ovarian cancer is cancer free would be a Type I error—rejecting the hypothesis of ovarian cancer when it is in fact true. Believing that a woman who is actually cancer free does have ovarian cancer is a Type II error—not rejecting the null hypothesis when it is in fact false. Based on the preliminary study results, we can estimate the error probabilities. The probability of a Type I error, a, is approximately 0/50 0. The probability of a Type II error, b, is approximately 3/66 .046. ■ The ideal test procedure would result in both a 0 and b 0. However, if we must base our decision on incomplete information—a sample rather than a census —it is impossible to achieve this ideal. The standard test procedures allow us to control a, but they provide no direct control over b. Because a represents the probability of rejecting a true null hypothesis, selecting a significance level a .05 results in a test procedure that, used over and over with different samples, rejects a true H0 about 5 times in 100. Selecting a .01 results in a test procedure with a Type I error rate of 1% in long-term repeated use. Choosing a small value for a implies that the user wants to use a procedure for which the risk of a Type I error is quite small. One question arises naturally at this point: If we can select a, the probability of making a Type I error, why would we ever select a .05 rather than a .01? Why not always select a very small value for a? To achieve a small probability of making a Type I error, we would need the corresponding test procedure to require the evidence against H0 to be very strong before the null hypothesis can be rejected. Although this makes a Type I error unlikely, it increases the risk of a Type II error (not rejecting H0 when it should have been rejected). Frequently the investigator must balance the consequences of Type I and Type II errors. If a Type II error has serious consequences, it may be a good idea to select a somewhat larger value for a. In general, there is a compromise between small a and small b, leading to the following widely accepted principle for specifying a test procedure. After assessing the consequences of Type I and Type II errors, identify the largest a that is tolerable for the problem. Then employ a test procedure that uses this maximum acceptable value —rather than anything smaller—as the level of significance (because using a smaller a increases b). In other words, don’t make a smaller than it needs to be. 10-W4959 10/7/08 3:18 PM Page 534 534 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample .......................................................................................................................................... E x a m p l e 1 0 . 7 Lead in Tap Water In 1991, the Environmental Protection Agency (EPA) adopted what is known as the Lead and Copper Rule, which defines drinking water as unsafe if the concentration of lead is 15 parts per billion (ppb) or greater or if the concentration of copper is 1.3 parts per million (ppm) or greater. The “2003 National Public Water Systems Compliance Report” (EPA, September 2005) indicates that 6% of public water systems reported violation of a health-based drinking water standard in 2003 and that 5% of these were violation of the Lead and Copper Rule. With m denoting the mean concentration of lead, a water system monitoring lead levels might use lead level measurements from a sample of water specimens to test H0: m 15 versus Ha: m 15 The null hypothesis (which is equivalent to the assertion m 15) states that the mean lead concentration is excessive by EPA standards. The alternative hypothesis states that the mean lead concentration is at an acceptable level and that the water system meets EPA standards for lead. In this context, a Type I error leads to the conclusion that a water source meets EPA standards for lead when in fact it does not. Possible consequences of this type of error include health risks associated with excessive lead consumption (e.g., increased blood pressure, hearing loss, and, in severe cases, anemia and kidney damage). A Type II error is to conclude that the water does not meet EPA standards for lead when in fact it actually does. Possible consequences of a Type II error include elimination of a community water source. Because a Type I error might result in potentially serious public health risks, a small value of a (Type I error probability), such as a .01, could be selected. Of course, selecting a small value for a increases the risk of a Type II error. If the community has only one water source, a Type II error could also have very serious consequences for the community, and we might want to rethink our choice of a. ■ ■ E x e r c i s e s 10.12–10.22 ............................................................................................................... 10.12 Researchers at the University of Washington and Harvard University analyzed records of breast cancer screening and diagnostic evaluations (“Mammogram Cancer Scares More Frequent than Thought,” USA Today, April 16, 1998). Discussing the benefits and downsides of the screening process, the article states that, although the rate of false-positives is higher than previously thought, if radiologists were less aggressive in following up on suspicious tests, the rate of false-positives would fall but the rate of missed cancers would rise. Suppose that such a screening test is used to decide between a null hypothesis of H0: no cancer is present and an alternative hypothesis Bold exercises answered in back of Ha: cancer is present. (Although these are not hypotheses about a population characteristic, this exercise illustrates the definitions of Type I and Type II errors.) a. Would a false-positive (thinking that cancer is present when in fact it is not) be a Type I error or a Type II error? b. Describe a Type I error in the context of this problem, and discuss the consequences of making a Type I error. c. Describe a Type II error in the context of this problem, and discuss the consequences of making a Type II error. d. What aspect of the relationship between the probability of Type I and Type II errors is being described by the statement in the article that if radiologists were less ag- ● Data set available online but not required ▼ Video solution available 10-W4959 10/7/08 3:18 PM Page 535 10.2 gressive in following up on suspicious tests, the rate of false-positives would fall but the rate of missed cancers would rise? 10.13 Medical personnel are required to report suspected cases of child abuse. Because some diseases have symptoms that mimic those of child abuse, doctors who see a child with these symptoms must decide between two competing hypotheses: H0: symptoms are due to child abuse Ha: symptoms are due to disease (Although these are not hypotheses about a population characteristic, this exercise illustrates the definitions of Type I and Type II errors.) The article “Blurred Line Between Illness, Abuse Creates Problem for Authorities” (Macon Telegraph, February 28, 2000) included the following quote from a doctor in Atlanta regarding the consequences of making an incorrect decision: “If it’s disease, the worst you have is an angry family. If it is abuse, the other kids (in the family) are in deadly danger.” a. For the given hypotheses, describe Type I and Type II errors. b. Based on the quote regarding consequences of the two kinds of error, which type of error does the doctor quoted consider more serious? Explain. 10.14 Ann Landers, in her advice column of October 24, 1994 (San Luis Obispo Telegram-Tribune), described the reliability of DNA paternity testing as follows: “To get a completely accurate result, you would have to be tested, and so would (the man) and your mother. The test is 100 percent accurate if the man is not the father and 99.9 percent accurate if he is.” a. Consider using the results of DNA paternity testing to decide between the following two hypotheses: H0: a particular man is the father Ha: a particular man is not the father In the context of this problem, describe Type I and Type II errors. (Although these are not hypotheses about a population characteristic, this exercise illustrates the definitions of Type I and Type II errors.) b. Based on the information given, what are the values of a, the probability of Type I error, and b, the probability of Type II error? c. Ann Landers also stated, “If the mother is not tested, there is a 0.8 percent chance of a false positive.” For the hypotheses given in Part (a), what are the values of a and Bold exercises answered in back ■ Errors in Hypothesis Testing 535 b if the decision is based on DNA testing in which the mother is not tested? 10.15 ▼ Pizza Hut, after test-marketing a new product called the Bigfoot Pizza, concluded that introduction of the Bigfoot nationwide would increase its sales by more than 14% (USA Today, April 2, 1993). This conclusion was based on recording sales information for a random sample of Pizza Hut restaurants selected for the marketing trial. With m denoting the mean percentage increase in sales for all Pizza Hut restaurants, consider using the sample data to decide between H0: m 14 and Ha: m 14. a. Is Pizza Hut’s conclusion consistent with a decision to reject H0 or to fail to reject H0? b. If Pizza Hut is incorrect in its conclusion, is the company making a Type I or a Type II error? 10.16 A television manufacturer claims that (at least) 90% of its TV sets will need no service during the first 3 years of operation. A consumer agency wishes to check this claim, so it obtains a random sample of n 100 purchasers and asks each whether the set purchased needed repair during the first 3 years after purchase. Let p be the sample proportion of responses indicating no repair (so that no repair is identified with a success). Let p denote the true proportion of successes for all sets made by this manufacturer. The agency does not want to claim false advertising unless sample evidence strongly suggests that p .9. The appropriate hypotheses are then H0: p .9 versus Ha: p .9. a. In the context of this problem, describe Type I and Type II errors, and discuss the possible consequences of each. b. Would you recommend a test procedure that uses a .10 or one that uses a .01? Explain. 10.17 A manufacturer of hand-held calculators receives large shipments of printed circuits from a supplier. It is too costly and time-consuming to inspect all incoming circuits, so when each shipment arrives, a sample is selected for inspection. Information from the sample is then used to test H0: p .05 versus Ha: p .05, where p is the true proportion of defective circuits in the shipment. If the null hypothesis is not rejected, the shipment is accepted, and the circuits are used in the production of calculators. If the null hypothesis is rejected, the entire shipment is returned to the supplier because of inferior quality. (A shipment is defined to be of inferior quality if it contains more than 5% defective circuits.) ● Data set available online but not required ▼ Video solution available 10-W4959 10/7/08 3:18 PM Page 536 536 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample a. In this context, define Type I and Type II errors. b. From the calculator manufacturer’s point of view, which type of error is considered more serious? c. From the printed circuit supplier’s point of view, which type of error is considered more serious? 10.18 Water samples are taken from water used for cooling as it is being discharged from a power plant into a river. It has been determined that as long as the mean temperature of the discharged water is at most 150F, there will be no negative effects on the river’s ecosystem. To investigate whether the plant is in compliance with regulations that prohibit a mean discharge water temperature above 150F, researchers will take 50 water samples at randomly selected times and record the temperature of each sample. The resulting data will be used to test the hypotheses H0: m 150F versus Ha: m 150F. In the context of this example, describe Type I and Type II errors. Which type of error would you consider more serious? Explain. 10.19 ▼ Occasionally, warning flares of the type contained in most automobile emergency kits fail to ignite. A consumer advocacy group wants to investigate a claim against a manufacturer of flares brought by a person who claims that the proportion of defective flares is much higher than the value of .1 claimed by the manufacturer. A large number of flares will be tested, and the results will be used to decide between H0: p .1 and Ha: p .1, where p represents the true proportion of defective flares made by this manufacturer. If H0 is rejected, charges of false advertising will be filed against the manufacturer. a. Explain why the alternative hypothesis was chosen to be Ha: p .1. b. In this context, describe Type I and Type II errors, and discuss the consequences of each. 10.20 Suppose that you are an inspector for the Fish and Game Department and that you are given the task of determining whether to prohibit fishing along part of the Oregon coast. You will close an area to fishing if it is determined that fish in that region have an unacceptably high mercury content. a. Assuming that a mercury concentration of 5 ppm is considered the maximum safe concentration, which of the following pairs of hypotheses would you test: H0: m 5 versus Ha: m 5 Bold exercises answered in back or H0: m 5 versus Ha: m 5 Give the reasons for your choice. b. Would you prefer a significance level of .1 or .01 for your test? Explain. 10.21 The National Cancer Institute conducted a 2-year study to determine whether cancer death rates for areas near nuclear power plants are higher than for areas without nuclear facilities (San Luis Obispo Telegram-Tribune, September 17, 1990). A spokesperson for the Cancer Institute said, “From the data at hand, there was no convincing evidence of any increased risk of death from any of the cancers surveyed due to living near nuclear facilities. However, no study can prove the absence of an effect.” a. Let p denote the true proportion of the population in areas near nuclear power plants who die of cancer during a given year. The researchers at the Cancer Institute might have considered the two rival hypotheses of the form H0: p value for areas without nuclear facilities Ha: p value for areas without nuclear facilities Did the researchers reject H0 or fail to reject H0? b. If the Cancer Institute researchers were incorrect in their conclusion that there is no increased cancer risk associated with living near a nuclear power plant, are they making a Type I or a Type II error? Explain. c. Comment on the spokesperson’s last statement that no study can prove the absence of an effect. Do you agree with this statement? 10.22 An automobile manufacturer is considering using robots for part of its assembly process. Converting to robots is an expensive process, so it will be undertaken only if there is strong evidence that the proportion of defective installations is lower for the robots than for human assemblers. Let p denote the true proportion of defective installations for the robots. It is known that human assemblers have a defect proportion of .02. a. Which of the following pairs of hypotheses should the manufacturer test: H0: p .02 versus Ha: p .02 or H0: p .02 versus Ha: p .02 Explain your answer. ● Data set available online but not required ▼ Video solution available 10-W4959 10/7/08 3:19 PM Page 537 10.3 ■ Large-Sample Hypothesis Tests for a Population Proportion b. In the context of this exercise, describe Type I and Type II errors. Bold exercises answered in back 537 c. Would you prefer a test with a .01 or a .1? Explain your reasoning. ● Data set available online but not required ▼ Video solution available ........................................................................................................................................ 10.3 Large-Sample Hypothesis Tests for a Population Proportion Now that some general concepts of hypothesis testing have been introduced, we are ready to turn our attention to the development of procedures for using sample information to decide between a null and an alternative hypothesis. There are two possible conclusions: We either reject H0 or else fail to reject H0. The fundamental idea behind hypothesis-testing procedures is this: We reject the null hypothesis if the observed sample is very unlikely to have occurred when H0 is true. In this section, we consider testing hypotheses about a population proportion when the sample size n is large. Let p denote the proportion of individuals or objects in a specified population that possess a certain property. A random sample of n individuals or objects is selected from the population. The sample proportion p number in the sample that possess property n is the natural statistic for making inferences about p. The large-sample test procedure is based on the same properties of the sampling distribution of p that were used previously to obtain a confidence interval for p, namely: 1. mp p p11 p2 n B 3. When n is large, the sampling distribution of p is approximately normal. 2. sp These three results imply that the standardized variable z pp p11 p 2 n B has approximately a standard normal distribution when n is large. Example 10.8 shows how this information allows us to make a decision. .......................................................................................................................................... E x a m p l e 1 0 . 8 Impact of Food Labels In June of 2006, an Associated Press survey was conducted to investigate how people use the nutritional information provided on food package labels. Interviews were conducted with 1003 randomly selected adult Americans, and each participant was asked a series of questions, including the following two: 10-W4959 10/7/08 3:19 PM Page 538 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample © Royalty-Free/Getty Images 538 Question 1: When purchasing packaged food, how often to you check the nutrition labeling on the package? Question 2: How often do you purchase foods that are bad for you, even after you’ve checked the nutrition labels? It was reported that 582 responded “frequently” to the question about checking labels and 441 responded very often or somewhat often to the question about purchasing “bad” foods even after checking the label. Let’s start by looking at the responses to the first question. Based on these data, is it reasonable to conclude that a majority of adult Americans frequently check the nutritional labels when purchasing packaged foods? We can answer this question by testing hypotheses, where p true proportion of adult Americans who frequently check nutritional labels H0: p .5 Ha: p .5 (The proportion of adult Americans who frequently check nutritional labels is greater than .5. That is, more than half (a majority) frequently check nutritional labels.) Recall that in a hypothesis test, the null hypothesis is rejected only if there is convincing evidence against it—in this case, convincing evidence that p .5. If H0 is rejected, there is strong support for the claim that a majority of adult Americans frequently check nutritional labels when purchasing packaged foods. For this sample, p 582 .58 1003 The observed sample proportion is certainly greater than .5, but this could just be due to sampling variability. That is, when p .5 (meaning H0 is true), the sample proportion p usually differs somewhat from .5 simply because of chance variation from one sample to another. Is it plausible that a sample proportion of p .58 occurred as a result of this chance variation, or is it unusual to observe a sample proportion this large when p .5? To answer this question, we form a test statistic, the quantity used as a basis for making a decision between H0 and Ha. Creating a test statistic involves replacing p pp with the hypothesized value in the z variable z to obtain 2p11 p2 /n z p .5 1.5 2 1.5 2 B n If the null hypothesis is true, this statistic should have approximately a standard normal distribution, because when the sample size is large and H0 is true, 1. mp .5, 1.5 2 1.52 B n 3. p has approximately a normal distribution. 2. sp The calculated value of z expresses the distance between p and the hypothesized value as a number of standard deviations. If, for example, z 3, then the value of 10-W4959 10/7/08 3:19 PM Page 539 10.3 ■ Large-Sample Hypothesis Tests for a Population Proportion 539 p that came from the sample is 3 standard deviations (of p) greater than what we would have expected if the null hypothesis were true. How likely is it that a z value at least this contradictory to H0 would be observed if in fact H0 is true? The test statistic z is constructed using the hypothesized value from the null hypothesis; if H0 is true, the test statistic has (approximately) a standard normal distribution. Therefore P1z 3 when H0 is true2 area under the z curve to the right of 3.00 .0013 That is, if H0 is true, very few samples (much less than 1% of all samples) produce a value of z at least as contradictory to H0 as z 3. Because this z value is in the most extreme 1%, it is sensible to reject H0. For our data, z p .5 1.5 2 1.52 B n .58 .5 .08 5.00 .016 1.5 2 1.5 2 B 1003 That is, p .58 is 5 standard deviations greater than what we would expect it to be if the null hypothesis H0: p .5 were true. The sample data appear to be much more consistent with the alternative hypothesis, Ha: p .5. In particular, P(value of z is at least as contradictory to H0 as 5.00 when H0 is true) P(z 5.00 when H0 is true) area under the z curve to the right of 5.00 0 There is virtually no chance of seeing a sample proportion and corresponding z value this extreme as a result of chance variation alone when H0 is true. If p is 5 standard deviations or more away from .5, how can we believe that p .5? The evidence for rejecting H0 in favor of Ha is very compelling. Interestingly, in spite of the fact that there is strong evidence that a majority of adult Americans frequently check nutritional labels, the data on responses to the second question suggest that the percentage of people who then ignore the information on the label and purchase “bad” foods anyway is not small—the sample proportion who responded very often or somewhat often was .44. ■ The preceding example illustrates the rationale behind large-sample procedures for testing hypotheses about p (and other test procedures as well). We begin by assuming that the null hypothesis is correct. The sample is then examined in light of this assumption. If the observed sample proportion would not be unusual when H0 is true, then chance variability from one sample to another is a plausible explanation for what has been observed, and H0 should not be rejected. On the other hand, if the observed sample proportion would have been quite unlikely when H0 is true, then we would take the sample as convincing evidence against the null hypothesis and we should reject H0. We base a decision to reject or to fail to reject the null hypothesis on an assessment of how extreme or unlikely the observed sample is if H0 is true. The assessment of how contradictory the observed data are to H0 is based on first computing the value of the test statistic z p hypothesized value 1hypothesized value 2 11 hypothesized value 2 n B 10-W4959 10/7/08 3:19 PM Page 540 540 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample We then calculate the P-value, the probability, assuming that H0 is true, of obtaining a z value at least as contradictory to H0 as what was actually observed. D E F I N I T I O N A test statistic is the function of sample data on which a conclusion to reject or fail to reject H0 is based. The P-value (also sometimes called the observed significance level) is a measure of inconsistency between the hypothesized value for a population characteristic and the observed sample. It is the probability, assuming that H0 is true, of obtaining a test statistic value at least as inconsistent with H0 as what actually resulted. .......................................................................................................................................... E x a m p l e 1 0 . 9 Detecting Plagiarism Plagiarism is a growing concern among college and university faculty members, and many universities are now using software tools to detect student work that is not original. Researchers at the University of Luton conducted a survey of 321 faculty members at a variety of academic institutions (“Technical Review of Plagiarism Detection Software,” University of Luton, 2001). Included in the survey were questions about strategies used to uncover instances of plagiarism. It was reported that 36% of those surveyed said they occasionally used online searches with key words from student work to check for plagiarism. Assuming it is reasonable to regard this sample as representative of university faculty members, does the sample provide convincing evidence that more than onethird of faculty members occasionally use key word searches to check student work? With p true proportion faculty members who use key word searches to check student work the relevant hypotheses are 1 .33 3 Ha: p .33 H0: p The sample proportion was reported to be p .36. Does the value of p exceed onethird by enough to cast substantial doubt on H0? Because the sample size is large, the statistic z p .33 1.33 2 11 .33 2 n B has approximately a standard normal distribution when H0 is true. The calculated value of the test statistic is z .36 .33 1.33 2 11 .33 2 B 321 .03 1.15 .026 10-W4959 10/7/08 3:19 PM Page 541 10.3 ■ Large-Sample Hypothesis Tests for a Population Proportion 541 The probability that a z value at least this inconsistent with H0 would be observed if in fact H0 is true is P-value P(z 1.15 when H0 is true) area under the z curve to the right of 1.15 1 .8749 .1251 This probability indicates that when p .33, it would not be all that unusual to observe a sample proportion as large as .36. When H0 is true, roughly 12.5% of all samples would have a sample proportion larger than .36, so a sample proportion of .36 is reasonably consistent with the null hypothesis. Although .36 is larger than the hypothesized value of p .33, chance variation from sample to sample is a plausible explanation for what was observed. There is not strong evidence that the proportion of faculty members who use key word searches to check student work for plagiarism is greater than one-third. ■ As illustrated by Examples 10.8 and 10.9, small P-values indicate that sample results are inconsistent with H0, whereas larger P-values are interpreted as meaning that the data are consistent with H0 and that sampling variability alone is a plausible explanation for what was observed in the sample. As you probably noticed, the two cases examined (P-value 0 and P-value .1251) were such that a decision between rejecting or not rejecting H0 was clear-cut. A decision in other cases might not be so obvious. For example, what if the sample had resulted in a P-value of .04? Is this unusual enough to warrant rejection of H0? How small must the P-value be before H0 should be rejected? The answer depends on the significance level a (the probability of a Type I error) selected for the test. For example, suppose that we set a .05. This implies that the probability of rejecting a true null hypothesis is .05. To obtain a test procedure with this probability of Type I error, we would reject the null hypothesis if the sample result is among the most unusual 5% of all samples when H0 is true. That is, H0 is rejected if the computed P-value .05. If we had selected a .01, H0 would be rejected only if we observed a sample result so extreme that it would be among the most unusual 1% if H0 is true (i.e., if P-value .01). A decision as to whether H0 should be rejected results from comparing the P-value to the chosen a: H0 should be rejected if P-value a. H0 should not be rejected if P-value a. Suppose, for example, that the P-value .0352 and that a significance level of .05 is chosen. Then, because P-value .0352 .05 a H0 would be rejected. This would not be the case, though, for a .01, because then P-value a. 10-W4959 10/7/08 3:19 PM Page 542 542 C h a p t e r 10 ■ ■ Hypothesis Testing Using a Single Sample Computing a P-Value for a Large-Sample Test Concerning p ................... The computation of the P-value depends on the form of the inequality in the alternative hypothesis, Ha. Suppose, for example, that we wish to test H0: p .6 Ha: p .6 versus based on a large sample. The appropriate test statistic is z p .6 1.6 2 11 .6 2 n B Values of p contradictory to H0 and much more consistent with Ha are those much larger than .6 (because p .6 when H0 is true and p .6 when H0 is false and Ha is true). Such values of p correspond to z values considerably greater than 0. If n 400 and p .679, then z .679 .6 .079 3.16 .025 1.6 2 11 .6 2 B 400 The value p .679 is more than 3 standard deviations larger than what we would have expected if H0 were true. Thus P-value P(z at least as contradictory to H0 as 3.16 when H0 is true) P(z 3.16 when H0 is true) area under the z curve to the right of 3.16 1 .9992 .0008 z curve P-value .0008 0 Figure 10.1 Calculated z 3.16 Calculating a P-value. This P-value is illustrated in Figure 10.1. If H0 is true, in the long run only 8 out of 10,000 samples would result in a z value as or more extreme than what actually resulted; most of us would consider such a z quite unusual. Using a significance level of .01, we reject the null hypothesis because P-value .0008 .01 a. Now consider testing H0: p .3 versus Ha: p .3. A value of p either much greater than .3 or much less than .3 is inconsistent with H0 and provides support for Ha. Such a p corresponds to a z value far out in either tail of the z curve. If z p .3 1.3 2 11 .3 2 n B 1.75 then (as shown in Figure 10.2) P-value P(z value at least as inconsistent with H0 as 1.75 when H0 is true) P(z 1.75 or z 1.75 when H0 is true) (z curve area to the right of 1.75) (z curve area to the left of –1.75) (1 .9599) .0401 .0802 The P-value in this situation is also .0802 if z 1.75, because 1.75 and 1.75 are equally inconsistent with H0. 10-W4959 10/7/08 3:19 PM Page 543 10.3 P-value as the sum of two tail areas. ■ Large-Sample Hypothesis Tests for a Population Proportion z curve Figure 10.2 543 Total area .0802 P-value 1.75 Calculated z 1.75 D e t e r m i n a t i o n o f t h e P - Va l u e W h e n t h e Te s t S t a t i s t i c I s z 1. Upper-tailed test: Ha: p hypothesized value P-value computed as illustrated: z curve P-value area in upper tail Calculated z 2. Lower-tailed test: Ha: p hypothesized value P-value computed as illustrated: z curve P-value area in lower tail Calculated z 3. Two-tailed test: Ha: p hypothesized value P-value computed as illustrated: P-value sum of area in two tails z curve Calculated z, z The symmetry of the z curve implies that when the test is two-tailed (the “not equal” alternative), it is not necessary to add two curve areas. Instead, If z is positive, P-value 2(area to the right of z). If z is negative, P-value 2(area to the left of z). 10-W4959 10/7/08 3:19 PM Page 544 544 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample .......................................................................................................................................... E x a m p l e 1 0 . 1 0 Water Conservation In December 2005 a countywide water conservation campaign was conducted in a particular county. In January 2006 a random sample of 500 homes was selected, and water usage was recorded for each home in the sample. The county supervisors wanted to know whether their data supported the claim that fewer than half the households in the county reduced water consumption. The relevant hypotheses are H0: p .5 Ha: p .5 versus where p is the true proportion of households in the county with reduced water usage. Suppose that the sample results were n 500 and p .440. Because the sample size is large and this is a lower-tailed test, we can compute the P-value by first calculating the value of the z test statistic z p .5 1.5 2 11 .5 2 n B and then finding the area under the z curve to the left of this z. Based on the observed sample data, z .440 .5 .060 2.68 .0224 1.5 2 11 .52 B 500 The P-value is then equal to the area under the z curve and to the left of 2.68. From the entry in the 2.6 row and .08 column of Appendix Table 2, we find that P-value .0037 Using a .01 significance level, we reject H0 (because .0037 .01), suggesting that the proportion with reduced water usage was less than .5. Notice that rejection of H0 would not be justified if a very small significance level, such as .001, had been selected. ■ Example 10.10 illustrates the calculation of a P-value for a lower-tailed test. The use of P-values in upper-tailed and two-tailed tests is illustrated in Examples 10.11 and 10.12. But first we summarize large-sample tests of hypotheses about a population proportion and introduce a step-by-step procedure for carrying out a hypothesis test. S u m m a r y o f L a r g e - S a m p l e z Te s t f o r P Null hypothesis: H0: p hypothesized value Test statistic: z p hypothesized value 1hypothesized value 2 11 hypothesized 2 n B (continued) 10-W4959 10/7/08 3:19 PM Page 545 10.3 ■ Alternative Hypothesis: Ha: p hypothesized value Ha: p hypothesized value Ha: p hypothesized value Assumptions: Large-Sample Hypothesis Tests for a Population Proportion 545 P-Value: Area under z curve to right of calculated z Area under z curve to left of calculated z (1) 2(area to right of z) if z is positive, or (2) 2(area to left of z) if z is negative 1. p is the sample proportion from a random sample. 2. The sample size is large. This test can be used if n satisfies both n(hypothesized value) 10 and n(1 hypothesized value) 10. 3. If sampling is without replacement, the sample size is no more than 10% of the population size. We recommend that the following sequence of steps be used when carrying out a hypothesis test. S t e p s i n a H y p o t h e s i s - Te s t i n g A n a l y s i s 1. 2. 3. 4. 5. 6. 7. 8. 9. Describe the population characteristic about which hypotheses are to be tested. State the null hypothesis H0. State the alternative hypothesis Ha. Select the significance level a for the test. Display the test statistic to be used, with substitution of the hypothesized value identified in Step 2 but without any computation at this point. Check to make sure that any assumptions required for the test are reasonable. Compute all quantities appearing in the test statistic and then the value of the test statistic itself. Determine the P-value associated with the observed value of the test statistic. State the conclusion (which is to reject H0 if P-value a and not to reject H0 otherwise). The conclusion should then be stated in the context of the problem, and the level of significance should be included. Steps 1–4 constitute a statement of the problem, Steps 5–8 give the analysis that leads to a decision, and Step 9 provides the conclusion. .......................................................................................................................................... E x a m p l e 1 0 . 1 1 Unfit Teens The article “7 Million U.S. Teens Would Flunk Treadmill Tests” (Associated Press, December 11, 2005) summarized the results of a study in which 2205 adolescents aged 12 to 19 took a cardiovascular treadmill test. The researchers conducting the study indicated that the sample was selected in such a way that it could be regarded as representative of adolescents nationwide. Of the 2205 adolescents tested, 750 showed a poor level of cardiovascular fitness. Does this sample provide support for the claim that more than 30% of adolescents have a low level of cardiovascular fitStep-by-step technology instructions available online 10-W4959 10/7/08 3:19 PM Page 546 546 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample ness? We answer this question by carrying out a hypothesis test using a .05 significance level. 1. Population characteristic of interest: p true proportion of adolescents who have a low level of cardiovascular fitness 2. Null hypothesis: H0: p .3 3. Alternative hypothesis: Ha: p .3 (the percentage of adolescents with a low fitness level is greater than 30%) 4. Significance level: a .05 5. Test statistic: z p hypothesized value 1hypothesized value 2 11 hypothesized value 2 n B p .3 1.3 2 11 .3 2 n B 6. Assumptions: This test requires a random sample and a large sample size. The given sample was considered to be representative of adolescents nationwide, and if this is the case it is reasonable to regard the sample as if it were a random sample. The sample size was n 2205. Since 2205(.3) 10 and 2205(1 .3) 10, the large-sample test is appropriate. The sample size is small compared to the population (adolescents) size. 7. Computations: n 2205 and p 750/2205 .34, so z .34 .3 .04 4.00 .010 1.3 2 11 .3 2 B 2205 8. P-value: This is an upper-tailed test (the inequality in Ha is ), so the P-value is the area to the right of the computed z value. Since z 4.00 is so far out in the upper tail of the standard normal distribution, the area to its right is negligible. Thus, P-value 0 9. Conclusion: Since P-value a (0 .05), H0 is rejected at the .05 level of significance. We conclude that the proportion of adolescents who have a low level of cardiovascular fitness is greater than .3. That is, the sample provides convincing evidence to support the claim that more than 30% of adolescents have a low fitness level. ■ .......................................................................................................................................... E x a m p l e 1 0 . 1 2 Single-Family Homes The Public Policy Institute of California reported that 71% of people nationwide prefer to live in a single-family home. To determine whether the preferences of Californians are consistent with this nationwide figure, a random sample of 2002 Californians were interviewed. Of those interviewed, 1682 said that they consider a single-family home the ideal (Associated Press, November 13, 2001). Can we reasonably conclude that the proportion of Californians who prefer a single-family home is different from the national figure? We answer this question by carrying out a hypothesis test with a .01. 10-W4959 10/7/08 3:19 PM Page 547 10.3 © Royalty-Free/Getty Images 1. 2. 3. 4. 5. ■ Large-Sample Hypothesis Tests for a Population Proportion 547 p proportion of all Californians who prefer a single-family home. H0: p .71. Ha: p .71 (differs from the national proportion). Significance level: a .01. Test statistic: z p hypothesized value 1hypothesized value 2 11 hypothesized value 2 n B p .71 1.71 2 1.292 n B 6. Assumptions: This test requires a random sample and a large sample size. The given sample was a random sample, the population size is much larger than the sample size, and the sample size was n 2002. Because 2002(.71) 10 and 2002(.29) 10, the large-sample test is appropriate. 7. Computations: p 1682/2002 .84, from which z .84 .71 1.71 2 1.29 2 B 2002 .13 12.87 .0101 8. P-value: The area under the z curve to the right of 12.87 is approximately 0, so P-value 2(0) 0. 9. Conclusion: At significance level .01, we reject H0 because P-value 0 .01 a. The data provide convincing evidence that the proportion in California who prefer a single-family home differs from the nationwide proportion. ■ Most statistical computer packages and graphing calculators can calculate and report P-values for a variety of hypothesis-testing situations, including the large sample test for a proportion. MINITAB was used to carry out the test of Example 10.10, and the resulting computer output follows (MINITAB uses p instead of p to denote the population proportion): Test and Confidence Interval for One Proportion Test of p = 0.5 vs p < 0.5 Sample X N Sample p 95.0 % CI 1 220 500 0.440000 (0.396491, 0.483509) Z-Value –2.68 P-Value 0.004 From the MINITAB output, z 2.68, and the associated P-value is .004. The small difference in the P-value is the result of rounding. It is also possible to compute the value of the z test statistic and then use a statistical computer package or graphing calculator to determine the corresponding P-value as an area under the standard normal curve. For example, the user can specify a value and MINITAB will determine the area to the left of this value for any particular normal distribution. Because of this, the computer can be used in place of Appendix Table 2. In Example 10.10 the computed z was 2.68. Using MINITAB gives the following output: Normal with mean = 0 and standard deviation = 1.00000 x P(X ≤ x) –2.6800 0.0037 10-W4959 10/7/08 3:19 PM Page 548 548 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample Thus we learn that the area to the left of 2.68 .0037, which agrees with the value obtained by using the tables. ■ E x e r c i s e s 10.23–10.44 ............................................................................................................... 10.23 Use the definition of the P-value to explain the following: a. Why H0 would certainly be rejected if P-value .0003 b. Why H0 would definitely not be rejected if P-value .350 10.24 For which of the following P-values will the null hypothesis be rejected when performing a level .05 test: a. .001 d. .047 b. .021 e. .148 c. .078 10.25 Pairs of P-values and significance levels, a, are given. For each pair, state whether the observed P-value leads to rejection of H0 at the given significance level. a. P-value .084, a .05 b. P-value .003, a .001 c. P-value .498, a .05 d. P-value .084, a .10 e. P-value .039, a .01 f. P-value .218, a .10 10.26 Let p denote the proportion of grocery store customers that use the store’s club card. For a large sample z test of H0: p .5 versus Ha: p .5, find the P-value associated with each of the given values of the test statistic: a. 1.40 d. 2.45 b. 0.93 e. 0.17 c. 1.96 10.27 Assuming a random sample from a large population, for which of the following null hypotheses and sample sizes n is the large-sample z test appropriate: a. H0: p .2, n 25 b. H0: p .6, n 210 c. H0: p .9, n 100 d. H0: p .05, n 75 10.28 The article “Poll Finds Most Oppose Return to Draft, Wouldn’t Encourage Children to Enlist” (Associated Press, December 18, 2005) reports that in a random sample of 1000 American adults, 700 indicated that they Bold exercises answered in back oppose the reinstatement of a military draft. Is there convincing evidence that the proportion of American adults who oppose reinstatement of the draft is greater than twothirds? Use a significance level of .05. 10.29 The poll referenced in the previous exercise (“Military Draft Study,” AP-Ipsos, June 2005) also included the following question: “If the military draft were reinstated, would you favor or oppose drafting women as well as men?” Forty-three percent of the 1000 people responding said that they would favor drafting women if the draft were reinstated. Using a .05 significance level, carry out a test to determine if there is convincing evidence that fewer than half of adult Americans would favor the drafting of women. 10.30 The article “Irritated by Spam? Get Ready for Spit” (USA Today, November 10, 2004) predicts that “spit,” spam that is delivered via Internet phone lines and cell phones, will be a growing problem as more people turn to web-based phone services. In a 2004 poll of 5500 cell phone users conducted by the Yankee Group, 20% indicated that they had received commercial messages and ads on their cell phones. Is there sufficient evidence that the proportion of cell phone users who have received commercial messages or ads in 2004 was greater than the proportion of .13 reported for the previous year? 10.31 In a survey conducted by Yahoo Small Business, 1432 of 1813 adults surveyed said that they would alter their shopping habits if gas prices remain high (Associated Press, November 30, 2005). The article did not say how the sample was selected, but for purposes of this exercise, assume that it is reasonable to regard this sample as representative of adult Americans. Based on these survey data, is it reasonable to conclude that more than three-quarters of adult Americans plan to alter their shopping habits if gas prices remain high? 10.32 According to a Washington Post-ABC News poll, 331 of 502 randomly selected U.S. adults interviewed said they would not be bothered if the National Security ● Data set available online but not required ▼ Video solution available 10-W4959 10/7/08 3:19 PM Page 549 10.3 ■ Large-Sample Hypothesis Tests for a Population Proportion Agency collected records of personal telephone calls they had made. Is there sufficient evidence to conclude that a majority of U.S. adults feel this way? Test the appropriate hypotheses using a .01 significance level. 10.33 According to a survey of 1000 adult Americans conducted by Opinion Research Corporation, 210 of those surveyed said playing the lottery would be the most practical way for them to accumulate $200,000 in net wealth in their lifetime (“One in Five Believe Path to Riches Is the Lottery,” San Luis Obispo Tribune, January 11, 2006). Although the article does not describe how the sample was selected, for purposes of this exercise, assume that the sample can be regarded as a random sample of adult Americans. Is there convincing evidence that more than 20% of adult Americans believe that playing the lottery is the best strategy for accumulating $200,000 in net wealth? 10.34 The article “Theaters Losing Out to Living Rooms” (San Luis Obispo Tribune, June 17, 2005) states that movie attendance declined in 2005. The Associated Press found that 730 of 1000 randomly selected adult Americans preferred to watch movies at home rather than at a movie theater. Is there convincing evidence that the majority of adult Americans prefer to watch movies at home? Test the relevant hypotheses using a .05 significance level. 10.35 The article referenced in Exercise 10.34 also reported that 470 of 1000 randomly selected adult Americans thought that the quality of movies being produced was getting worse. a. Is there convincing evidence that fewer than half of adult Americans believe that movie quality is getting worse? Use a significance level of .05. b. Suppose that the sample size had been 100 instead of 1000, and that 47 thought that the movie quality was getting worse (so that the sample proportion is still .47). Based on this sample of 100, is there convincing evidence that fewer than half of adult Americans believe that movie quality is getting worse? Use a significance level of .05. c. Write a few sentences explaining why different conclusions were reached in the hypothesis tests of Parts (a) and (b). 10.36 The report “2005 Electronic Monitoring & Surveillance Survey: Many Companies Monitoring, Recording, Videotaping—and Firing—Employees” (American Management Association, 2005) summarized the results of a survey of 526 U.S. businesses. Four hundred of these comBold exercises answered in back 549 panies indicated that they monitor employees’ web site visits. For purposes of this exercise, assume that it is reasonable to regard this sample as representative of businesses in the United States. a. Is there sufficient evidence to conclude that more than 75% of U.S. businesses monitor employees’ web site visits? Test the appropriate hypotheses using a significance level of .01. b. Is there sufficient evidence to conclude that a majority of U.S. businesses monitor employees’ web site visits? Test the appropriate hypotheses using a significance level of .01. 10.37 In an AP-AOL sports poll (Associated Press, December 18, 2005), 272 of 394 randomly selected baseball fans stated that they thought the designated hitter rule should either be expanded to both baseball leagues or eliminated. Based on the given information, is there sufficient evidence to conclude that a majority of baseball fans feel this way? 10.38 In a representative sample of 1000 adult Americans, only 430 could name at least one justice who is currently serving on the U.S. Supreme Court (Ipsos, January 10, 2006). Using a significance level of .01, carry out a hypothesis test to determine if there is convincing evidence to support the claim that fewer than half of adult Americans can name at least one justice currently serving on the Supreme Court. 10.39 ▼ In a national survey of 2013 adults, 1590 responded that lack of respect and courtesy in American society is a serious problem, and 1283 indicated that they believe that rudeness is a more serious problem than in past years (Associated Press, April 3, 2002). Is there convincing evidence that more than three-quarters of U.S. adults believe that rudeness is a worsening problem? Test the relevant hypotheses using a significance level of .05. 10.40 The success of the U.S. census depends on people filling out and returning census forms. Despite extensive advertising, many Americans are skeptical about claims that the Census Bureau will guard the information it collects from other government agencies. In a USA Today poll (March 13, 2000), only 432 of 1004 adults surveyed said that they believe the Census Bureau when it says the information you give about yourself is kept confidential. Is there convincing evidence that, despite the advertising campaign, fewer than half of U.S. adults believe the Cen- ● Data set available online but not required ▼ Video solution available 10-W4959 10/7/08 3:19 PM Page 550 550 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample sus Bureau will keep information confidential? Use a significance level of .01. 10.41 ▼ The article “Americans Seek Spiritual Guidance on Web” (San Luis Obispo Tribune, October 12, 2002) reported that 68% of the general population belong to a religious community. In a survey on Internet use, 84% of “religion surfers” (defined as those who seek spiritual help online or who have used the web to search for prayer and devotional resources) belong to a religious community. Suppose that this result was based on a sample of 512 religion surfers. Is there convincing evidence that the proportion of religion surfers who belong to a religious community is different from .68, the proportion for the general population? Use a .05. 10.42 Teenagers (age 15 to 20) make up 7% of the driving population. The article “More States Demand Teens Pass Rigorous Driving Tests” (San Luis Obispo Tribune, January 27, 2000) described a study of auto accidents conducted by the Insurance Institute for Highway Safety. The Institute found that 14% of the accidents studied involved teenage drivers. Suppose that this percentage was based on examining records from 500 randomly selected acciBold exercises answered in back dents. Does the study provide convincing evidence that the proportion of accidents involving teenage drivers differs from .07, the proportion of teens in the driving population? 10.43 Students at the Akademia Podlaka conducted an experiment to determine whether the Belgium-minted Euro coin was equally likely to land heads up or tails up. Coins were spun on a smooth surface, and in 250 spins, 140 landed with the heads side up (New Scientist, January 4, 2002). Should the students interpret this result as convincing evidence that the proportion of the time the coin would land heads up is not .5? Test the relevant hypotheses using a .01. Would your conclusion be different if a significance level of .05 had been used? Explain. 10.44 The article “Fewer Parolees Land Back Behind Bars” (Associated Press, April 11, 2006) includes the following statement: “Just over 38 percent of all felons who were released from prison in 2003 landed back behind bars by the end of the following year, the lowest rate since 1979.” Explain why it would not be necessary to carry out a hypothesis test to determine if the proportion of felons released in 2003 was less than .40. ● Data set available online but not required ▼ Video solution available ........................................................................................................................................ 10.4 Hypothesis Tests for a Population Mean We now turn our attention to developing a method for testing hypotheses about a population mean. The test procedures in this case are based on the same two results that led to the z and t confidence intervals in Chapter 9. These results follow: 1. When either n is large or the population distribution is approximately normal, then z xm s 2n has approximately a standard normal distribution. 2. When either n is large or the population distribution is approximately normal, then t xm s 2n has approximately a t distribution with df n 1. A consequence of these two results is that if we are interested in testing a null hypothesis of the form H0: m hypothesized value 10-W4959 10/7/08 3:19 PM Page 551 10.4 ■ Hypothesis Tests for a Population Mean 551 then, depending on whether s is known or unknown, we can use (as long as n is large or the population distribution is approximately normal) either the accompanying z or t test statistic: Case 1: S known Test statistic: z x hypothesized value s 2n P-value: Computed as an area under the z curve Case 2: S unknown Test statistic: t x hypothesized value s 2n P-value: Computed as an area under the t curve with df n 1 Because it is rarely the case that s, the population standard deviation, is known, we focus our attention on the test procedure for the case in which s is unknown. When testing a hypothesis about a population mean, the null hypothesis specifies a particular hypothesized value for m, specifically, H0: m hypothesized value. The alternative hypothesis has one of the following three forms, depending on the research question being addressed: Ha: m hypothesized value Ha: m hypothesized value Ha: m hypothesized value If n is large or if the population distribution is approximately normal, the test statistic t x hypothesized value s 1n can be used. For example, if the null hypothesis to be tested is H0: m 100, the test statistic becomes t x 100 s 1n Consider the alternative hypothesis Ha: m 100, and suppose that a sample of size n 24 gives x 104.20 and s 8.23. The resulting test statistic value is t 4.20 104.20 100 2.50 8.23 1.6799 224 Because this is an upper-tailed test, if the test statistic had been z rather than t, the P-value would be the area under the z curve to the right of 2.50. With a t statistic, the P-value is the area under an appropriate t curve (here with df 24 1 23) to the right of 2.50. Appendix Table 4 is a tabulation of t curve tail areas. Each column of the table is for a different number of degrees of freedom: 1, 2, 3, . . . , 30, 35, 40, 60, 120, and a last column for df , which is the same as for the z curve. The table 10-W4959 10/7/08 3:19 PM Page 552 Hypothesis Testing Using a Single Sample ... 22 ... ... df 2.5 ... .010 .010 .010 ... 2.6 ... .008 .008 .008 ... 2.7 ... .007 .006 .006 ... 2.8 .. . ... .005 .005 .005 ... t 0.0 1 2 0.1 .. . 23 24 ... 60 120 ... Part of Appendix Table 4: t curve tail areas. Figure 10.3 ... ■ ... C h a p t e r 10 ... 552 4.0 f Area under 23-d 7 2. of ht rig to e rv t cu gives the area under each t curve to the right of values ranging from 0.0 to 4.0 in increments of 0.1. Part of this table appears in Figure 10.3. For example, area under the 23-df t curve to the right of 2.5 .010 P-value for an upper-tailed t test Suppose that t 2.7 for a lower-tailed test based on 23 df. Then, because any t curve is symmetric about 0, P-value area to the left of 2.7 area to the right of 2.7 .006 As is the case for z tests, we double the captured tail area to obtain the P-value for twotailed t tests. Thus, if t 2.6 or if t 2.6 for a two-tailed test with 23 df, then P-value 2(.008) .016 Once past 30 df, the tail areas change very little, so the last column () in Appendix Table 4 provides a good approximation. The following two boxes show how the P-value is obtained as a t curve area and give a general description of the test procedure. F i n d i n g P - Va l u e s f o r a t Te s t 1. Upper-tailed test: t curve for n 1 df Ha: m hypothesized value P-value area in upper tail 0 Calculated t (continued) 10-W4959 10/7/08 3:19 PM Page 553 10.4 ■ Hypothesis Tests for a Population Mean 553 2. Lower-tailed test: t curve for n 1 df Ha: m hypothesized value P-value area in lower tail 0 Calculated t 3. Two-tailed test: P-value sum of area in two tails t curve for n 1 df Ha: m hypothesized value 0 Calculated t, t Appendix Table 4 gives upper-tail t curve areas to the right of values 0.0, 0.1, . . . , 4.0. These areas are P-values for upper-tailed tests and, by symmetry, also for lower-tailed tests. Doubling an area gives the P-value for a two-tailed test. T h e O n e - S a m p l e t Te s t f o r a P o p u l a t i o n M e a n Null hypothesis: H0: m hypothesized value Test statistic: t x hypothesized value s 2n Alternative Hypothesis: Ha: m hypothesized value Ha: m hypothesized value Ha: m hypothesized value P-Value: Area to right of calculated t under t curve with df n 1 Area to the left of calculated t under t curve with df n 1 (1) 2(area to right of t) if t is positive, or (2) 2(area to left of t) if t is negative Assumptions: 1. x and s are the sample mean and sample standard deviation from a random sample. 2. The sample size is large (generally n 30) or the population distribution is at least approximately normal. .......................................................................................................................................... E x a m p l e 1 0 . 1 3 Time Stands Still (or So It Seems) ● A study conducted by researchers at Pennsylvania State University investigated whether time perception, a simple indication of a person’s ability to concentrate, is impaired during nicotine withdrawal. The study results were presented in the paper ● Data set available online 10-W4959 10/7/08 3:19 PM Page 554 554 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample “Smoking Abstinence Impairs Time Estimation Accuracy in Cigarette Smokers” (Psychopharmacology Bulletin [2003]: 90–95). After a 24-hr smoking abstinence, 20 smokers were asked to estimate how much time had passed during a 45-sec period. Suppose the resulting data on perceived elapsed time (in seconds) were as shown (these data are artificial but are consistent with summary quantities given in the paper): 69 56 65 50 72 70 73 47 59 56 55 45 39 70 52 64 67 67 57 53 From these data, we obtain n 20 x 59.30 s 9.84 The researchers wanted to determine whether smoking abstinence had a negative impact on time perception, causing elapsed time to be overestimated. We can answer this question by testing H0: m 45 (no consistent tendency to overestimate the time elapsed) versus Ha: m 45 (tendency for elapsed time to be overestimated) The null hypothesis is rejected only if there is convincing evidence that m 45. The observed value, 59.30, is certainly larger than 45, but can a sample mean as large as this be plausibly explained by chance variation from one sample to another when m 45? To answer this question, we carry out a hypothesis test with a significance level of .05 using the step-by-step procedure described in Section 10.3. 1. Population characteristic of interest: m mean perceived elapsed time for smokers who have abstained from smoking for 24 hours 2. Null hypothesis: H0: m 45 3. Alternative hypothesis: Ha: m 45 4. Significance level: a .05 x hypothesized value x 45 5. Test statistic: t s s 1n 1n 6. Assumptions: This test requires a random sample and either a large sample size or a normal population distribution. The authors of the paper believed that it was reasonable to consider this sample as representative of smokers in general, so we regard it as if it were a random sample. Because the sample size is only 20, for the t test to be appropriate, we must be willing to presume that the population distribution of perceived elapsed times is at least approximately normal. Is this reasonable? The following graph gives a boxplot of the data: 40 50 60 Perceived elapsed time 70 10-W4959 10/7/08 3:19 PM Page 555 10.4 ■ Hypothesis Tests for a Population Mean 555 Although the boxplot is not perfectly symmetric, it is not too skewed and there are no outliers, so we judge the use of the t test to be reasonable. 7. Computations: n 20, x 59.30, and s 9.84, so t 59.30 45 14.30 6.50 9.84 2.20 220 8. P-value: This is an upper-tailed test (the inequality in Ha is “greater than”), so the P-value is the area to the right of the computed t value. Because df 20 1 19, we can use the df 19 column of Appendix Table 4 to find the P-value. With t 6.50, we obtain P-value area to the right of 6.50 0 (because 6.50 is greater than 4.0, the largest tabulated value). 9. Conclusion: Because P-value a, we reject H0 at the .05 level of significance. There is virtually no chance of seeing a sample mean (and hence a t value) this extreme as a result of just chance variation when H0 is true. There is convincing evidence that the mean perceived time elapsed is greater than the actual time elapsed of 45 sec. This paper also looked at perception of elapsed time for a sample of nonsmokers and for a sample of smokers who had not abstained from smoking. The investigators found that the null hypothesis of m 45 could not be rejected for either of these groups. ■ .......................................................................................................................................... E x a m p l e 1 0 . 1 4 Goofing Off at Work ● A growing concern of employers is time spent in activities like surfing the Internet and emailing friends during work hours. The San Luis Obispo Tribune summarized the findings from a survey of a large sample of workers in an article that ran under the headline “Who Goofs Off 2 Hours a Day? Most Workers, Survey Says” (August 3, 2006). Suppose that the CEO of a large company wants to determine whether the average amount of wasted time during an eight-hour work day for employees of her company is less than the reported 120 minutes. Each person in a random sample of 10 employees was contacted and asked about daily wasted time at work. (Participants would probably have to be guaranteed anonymity to obtain truthful responses!) The resulting data are the following: 108 112 117 130 111 131 113 113 105 128 Summary quantities are n 10, x 116.80, and s 9.45. Do these data provide evidence that the mean wasted time for this company is less than 120 min? To answer this question, let’s carry out a hypothesis test with a .05. 1. 2. 3. 4. m mean daily wasted time for employees of this company H0: m 120 Ha: m 120 a .05 Step-by-step technology instructions available online ● Data set available online 10-W4959 10/7/08 3:19 PM Page 556 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample 5. t x hypothesized value x 120 s s 2n 2n 6. This test requires a random sample and either a large sample or a normal population distribution. The given sample was a random sample of employees. Because the sample size is small, we must be willing to assume that the population distribution of times is at least approximately normal. The following normal probability plot appears to be reasonably straight, and although the normal probability plot and the boxplot reveal some skewness in the sample, there are no outliers: 105 110 115 120 Wasted time 125 130 2 1 Normal score 556 0 1 2 105 110 115 120 Wasted time 125 130 Correlations (Pearson) Correlation of Time and Normal Score 0.943 Also, the correlation between the expected normal scores and the observed data for this sample is .943, which is well above the critical r value for n 10 of .880 (see Chapter 5 for critical r values). Based on these observations, it is plausible that the population distribution is approximately normal, so we proceed with the t test. 7. Test statistic: t 116.80 120 1.07 9.45 110 10-W4959 10/7/08 3:19 PM Page 557 10.4 ■ Hypothesis Tests for a Population Mean 557 8. From the df 9 column of Appendix Table 4 and by rounding the test statistic value to 1.1, we get P-value area to the left of 1.1 area to the right of 1.1 .150 as shown: t curve with df 9 .150 1.1 0 9. Because the P-value a, we fail to reject H0. There is not sufficient evidence to conclude that the mean wasted time per eight-hour work day for employees at this company is less than 120 minutes. MINITAB could also have been used to carry out the test, as shown in the accompanying output: One-Sample T: Wasted Time Test of mu = 120 vs < 120 Variable Wasted Time N 10 Mean 116.800 StDev 9.449 SE Mean 2.988 95% Upper Bound 122.278 T –1.07 P 0.156 Although we had to round the computed t value to 1.1 to use Appendix Table 4, MINITAB was able to compute the P-value corresponding to the actual value of the test statistic. ■ .......................................................................................................................................... © Dynamic Graphics/Creatas/Alamy E x a m p l e 1 0 . 1 5 Cricket Love The article “Well-Fed Crickets Bowl Maidens Over” (Nature Science Update, February 11, 1999) reported that female field crickets are attracted to males that have high chirp rates and hypothesized that chirp rate is related to nutritional status. The usual chirp rate for male field crickets was reported to vary around a mean of 60 chirps per second. To investigate whether chirp rate was related to nutritional status, investigators fed male crickets a high protein diet for 8 days, after which chirp rate was measured. The mean chirp rate for the crickets on the high protein diet was reported to be 109 chirps per second. Is this convincing evidence that the mean chirp rate for crickets on a high protein diet is greater than 60 (which would then imply an advantage in attracting the ladies)? Suppose that the sample size and sample standard deviation are n 32 and s 40. We test the relevant hypotheses with a .01. 1. 2. 3. 4. m mean chirp rate for crickets on a high protein diet H0: m 60 Ha: m 60 a .01 10-W4959 10/7/08 3:19 PM Page 558 558 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample x hypothesized value x 60 s s 1n 1n This test requires a random sample and either a large sample or a normal population distribution. Because the sample size is large (n 32), it is reasonable to proceed with the t test as long as we are willing to consider the 32 male field crickets in this study as if they were a random sample from the population of male field crickets. 109 60 49 6.93 Test statistic: t 40 7.07 132 This is an upper-tailed test, so the P-value is the area under the t curve with df 31 and to the right of 6.93. From Appendix Table 4, P-value 0. Because P-value 0, which is less than the significance level, a, we reject H0. There is convincing evidence that the mean chirp rate is higher for male field crickets that eat a high protein diet. 5. t 6. 7. 8. 9. ■ ■ Statistical Versus Practical Significance ......................................................... Carrying out a hypothesis test amounts to deciding whether the value obtained for the test statistic could plausibly have resulted when H0 is true. When the value of the test statistic leads to rejection of H0, it is customary to say that the result is statistically significant at the chosen level a. The finding of statistical significance means that, in the investigator’s opinion, the observed deviation from what was expected under H0 cannot plausibly be attributed only to chance variation. However, statistical significance cannot be equated with the conclusion that the true situation differs from what H0 states in any practical sense. That is, even after H0 has been rejected, the data may suggest that there is no practical difference between the true value of the population characteristic and what the null hypothesis states that value to be. This is illustrated in Example 10.16. .......................................................................................................................................... E x a m p l e 1 0 . 1 6 “Significant” but Unimpressive Test Score Improvement Let m denote the true average score on a standardized test for children in a certain region of the United States. The average score for all children in the United States is 100. Regional education authorities are interested in testing H0: m 100 versus Ha: m 100 using a significance level of .001. A sample of 2500 children resulted in the values n 2500, x 101.0, and s 15.0. Then 101.0 100 3.3 15 12500 This is an upper-tailed test, so (using the z column of Appendix Table 4 because df 2499) P-value area to the right of 3.33 .000. Because P-value .001, we reject H0. The true mean score for this region does appear to exceed 100. However, with n 2500, the point estimate x 101.0 is almost surely very close to the true value of m. Therefore it looks as though H0 was rejected because m 101 rather than 100. And, from a practical point of view, a 1-point difference is most likely of no practical importance. A statistically significant result does not necessarily mean that there are any practical consequences. t ■ 10-W4959 10/7/08 3:19 PM Page 559 10.4 ■ ■ Hypothesis Tests for a Population Mean 559 E x e r c i s e s 10.45–10.64 ............................................................................................................ 10.45 Newly purchased automobile tires of a certain type are supposed to be filled to a pressure of 30 psi. Let m denote the true average pressure. Find the P-value associated with each of the following given z statistic values for testing H0: m 30 versus Ha: m 30 when s is known: a. 2.10 d. 1.44 b. 1.75 e. 5.00 c. 0.58 10.46 The desired percentage of silicon dioxide in a certain type of cement is 5.0%. A random sample of n 36 specimens gave a sample average percentage of x 5.21. Let m be the true average percentage of silicon dioxide in this type of cement, and suppose that s is known to be 0.38. Test H0: m 5 versus Ha: m 5 using a significance level of .01. 10.47 Give as much information as you can about the P-value of a t test in each of the following situations: a. Upper-tailed test, df 8, t 2.0 b. Upper-tailed test, n 14, t 3.2 c. Lower-tailed test, df 10, t 2.4 d. Lower-tailed test, n 22, t 4.2 e. Two-tailed test, df 15, t 1.6 f. Two-tailed test, n 16, t 1.6 g. Two-tailed test, n 16, t 6.3 10.48 Give as much information as you can about the P-value of a t test in each of the following situations: a. Two-tailed test, df 9, t 0.73 b. Upper-tailed test, df 10, t 0.5 c. Lower-tailed test, n 20, t 2.1 d. Lower-tailed test, n 20, t 5.1 e. Two-tailed test, n 40, t 1.7 10.49 Paint used to paint lines on roads must reflect enough light to be clearly visible at night. Let m denote the true average reflectometer reading for a new type of paint under consideration. A test of H0: m 20 versus Ha: m 20 based on a sample of 15 observations gave t 3.2. What conclusion is appropriate at each of the following significance levels? a. a .05 c. a .001 b. a .01 10.50 A certain pen has been designed so that true average writing lifetime under controlled conditions (involving Bold exercises answered in back the use of a writing machine) is at least 10 hr. A random sample of 18 pens is selected, the writing lifetime of each is determined, and a normal probability plot of the resulting data support the use of a one-sample t test. The relevant hypotheses are H0: m 10 versus Ha: m 10. a. If t 2.3 and a .05 is selected, what conclusion is appropriate? b. If t 1.83 and a .01 is selected, what conclusion is appropriate? c. If t 0.47, what conclusion is appropriate? 10.51 The true average diameter of ball bearings of a certain type is supposed to be 0.5 in. What conclusion is appropriate when testing H0: m 0.5 versus Ha: m 0.5 in each of the following situations: a. n 13, t 1.6, a .05 b. n 13, t 1.6, a .05 c. n 25, t 2.6, a .01 d. n 25, t 3.6 10.52 A credit bureau analysis of undergraduate students credit records found that the average number of credit cards in an undergraduate’s wallet was 4.09 (“Undergraduate Students and Credit Cards in 2004,” Nellie Mae, May 2005). It was also reported that in a random sample of 132 undergraduates, the sample mean number of credit cards carried was 2.6. The sample standard deviation was not reported, but for purposes of this exercise, suppose that it was 1.2. Is there convincing evidence that the mean number of credit cards that undergraduates report carrying is less than the credit bureau’s figure of 4.09? 10.53 ● Medical research has shown that repeated wrist extension beyond 20 degrees increases the risk of wrist and hand injuries. Each of 24 students at Cornell University used a proposed new mouse design, and while using the mouse their wrist extension was recorded for each one. Data consistent with summary values given in the paper “Comparative Study of Two Computer Mouse Designs” (Cornell Human Factors Laboratory Technical Report RP7992) are given. Use these data to test the hypothesis that the mean wrist extension for people using this new mouse design is greater than 20 degrees. Are any assumptions required in order for it to be appropriate to generalize the results of your test to the population of Cornell students? To the population of all university students? (data on next page) ● Data set available online but not required ▼ Video solution available 10-W4959 10/7/08 3:19 PM Page 560 560 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample 27 28 24 26 27 25 25 24 24 24 25 28 22 25 24 28 27 26 31 25 28 27 27 25 c. Explain why the null hypothesis was rejected in the test of Part (b) but not in the test of Part (a). 10.54 The international polling organization Ipsos reported data from a survey of 2000 randomly selected Canadians who carry debit cards (Canadian Account Habits Survey, July 24, 2006). Participants in this survey were asked what they considered the minimum purchase amount for which it would be acceptable to use a debit card. Suppose that the sample mean and standard deviation were $9.15 and $7.60 respectively. (These values are consistent with a histogram of the sample data that appears in the report.) Do these data provide convincing evidence that the mean minimum purchase amount for which Canadians consider the use of a debit card to be appropriate is less than $10? Carry out a hypothesis test with a significance level of .01. 10.57 A survey of teenagers and parents in Canada conducted by the polling organization Ipsos (“Untangling the Web: The Facts About Kids and the Internet,” January 25, 2006) included questions about Internet use. It was reported that for a sample of 534 randomly selected teens, the mean number of hours per week spent online was 14.6 and the standard deviation was 11.6. a. What does the large standard deviation, 11.6 hours, tell you about the distribution of online times for this sample of teens? b. Do the sample data provide convincing evidence that the mean number of hours that teens spend online is greater than 10 hours per week? 10.55 A comprehensive study conducted by the National Institute of Child Health and Human Development tracked more than 1000 children from an early age through elementary school (New York Times, November 1, 2005). The study concluded that children who spent more than 30 hours a week in child care before entering school tended to score higher in math and reading when they were in the third grade. The researchers cautioned that the findings should not be a cause for alarm because the effects of child care were found to be small. Explain how the difference between the mean math score for third graders who spent long hours in child care and the overall mean for thirdgraders could be small but the researchers could still reach the conclusion that the mean for the child care group is significantly higher than the overall mean for third-graders. 10.56 In a study of computer use, 1000 randomly selected Canadian Internet users were asked how much time they spend using the Internet in a typical week (Ipsos Reid, August 9, 2005). The mean of the 1000 resulting observations was 12.7 hours. a. The sample standard deviation was not reported, but suppose that it was 5 hours. Carry out a hypothesis test with a significance level of .05 to decide if there is convincing evidence that the mean time spent using the Internet by Canadians is greater than 12.5 hours. b. Now suppose that the sample standard deviation was 2 hours. Carry out a hypothesis test with a significance level of .05 to decide if there is convincing evidence that the mean time spent using the Internet by Canadians is greater than 12.5 hours. Bold exercises answered in back 10.58 The same survey referenced in the previous exercise reported that for a random sample of 676 parents of Canadian teens, the mean number of hours parents thought their teens spent online was 6.5 and the sample standard deviation was 8.6. a. Do the sample data provide convincing evidence that the mean number of hours that parents think their teens spend online is less than 10 hours per week? b. Write a few sentences commenting on the results of the test in Part (a) and of the test in Part (b) of the previous exercise. 10.59 The paper titled “Music for Pain Relief” (The Cochrane Database of Systematic Reviews, April 19, 2006) concluded, based on a review of 51 studies of the effect of music on pain intensity, that “Listening to music reduces pain intensity levels . . . However, the magnitude of these positive effects is small, the clinical relevance of music for pain relief in clinical practice is unclear.” Are the authors of this paper claiming that the pain reduction attributable to listening to music is not statistically significant, not practically significant, or neither statistically nor practically significant? Explain. 10.60 Typically, only very brave students are willing to speak out in a college classroom. Student participation may be especially difficult if the individual is from a different culture or country. The article “An Assessment of Class Participation by International Graduate Students” (Journal of College Student Development [1995]: 132– 140) considered a numerical “speaking-up” scale, with possible values from 3 to 15 (a low value means that a ● Data set available online but not required ▼ Video solution available 10-W4959 10/7/08 3:19 PM Page 561 10.4 255 225 10.63 ● ▼ Many consumers pay careful attention to stated nutritional contents on packaged foods when making purchases. It is therefore important that the information on packages be accurate. A random sample of n 12 frozen dinners of a certain type was selected from production during a particular period, and the calorie content of each one was determined. (This determination entails destroying the product, so a census would certainly not be desirable!) Here are the resulting observations, along with a boxplot and normal probability plot: Bold exercises answered in back 244 226 239 251 242 233 265 265 245 245 259 259 248 248 265 255 10.61 ● ▼ A well-designed and safe workplace can contribute greatly to increasing productivity. It is especially important that workers not be asked to perform tasks, such as lifting, that exceed their capabilities. The following data on maximum weight of lift (MWOL, in kilograms) for a frequency of 4 lifts per minute were reported in the article “The Effects of Speed, Frequency, and Load on Measured Hand Forces for a Floor-to-Knuckle Lifting Task” (Ergonomics [1992]: 833–843): 25.8 36.6 26.3 21.8 27.2 Suppose that it is reasonable to regard the sample as a random sample from the population of healthy males, age 18–30. Do the data suggest that the population mean MWOL exceeds 25? Carry out a test of the relevant hypotheses using a .05 significance level. 10.62 An article titled “Teen Boys Forget Whatever It Was” appeared in the Australian newspaper The Mercury (April 21, 1997). It described a study of academic performance and attention span and reported that the mean time to distraction for teenage boys working on an independent task was 4 min. Although the sample size was not given in the article, suppose that this mean was based on a random sample of 50 teenage Australian boys and that the sample standard deviation was 1.4 min. Is there convincing evidence that the average attention span for teenage boys is less than 5 min? Test the relevant hypotheses using a .01. Hypothesis Tests for a Population Mean 561 Calories student rarely speaks). For a random sample of 64 males from Asian countries where English is not the official language, the sample mean and sample standard deviation were 8.75 and 2.57, respectively. Suppose that the mean for the population of all males having English as their native language is 10.0 (suggested by data in the article). Does it appear that the population mean for males from non-English-speaking Asian countries is smaller than 10.0? ■ 245 235 225 265 Calories 255 245 235 225 1.5 0.5 0.5 Normal score 1.5 a. Is it reasonable to test hypotheses about true average calorie content m by using a t test? b. The stated calorie content is 240. Does the boxplot suggest that true average content differs from the stated value? Explain your reasoning. c. Carry out a formal test of the hypotheses suggested in Part (b). 10.64 ● Much concern has been expressed in recent years regarding the practice of using nitrates as meat preservatives. In one study involving possible effects of these chemicals, bacteria cultures were grown in a medium containing nitrates. The rate of uptake of radio-labeled amino acid was then determined for each culture, yielding the following observations: 7251 7064 6871 7494 9632 7883 6866 8178 9094 7523 5849 8724 8957 7468 7978 0000 Suppose that it is known that the true average uptake for cultures without nitrates is 8000. Do the data suggest that the addition of nitrates results in a decrease in the true average uptake? Test the appropriate hypotheses using a significance level of .10. ● Data set available online but not required ▼ Video solution available 10-W4959 10/7/08 3:19 PM Page 562 562 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample ........................................................................................................................................ 10.5 Power and Probability of Type II Error In this chapter, we have introduced test procedures for testing hypotheses about population characteristics, such as m and p. What characterizes a “good” test procedure? It makes sense to think that a good test procedure is one that has both a small probability of rejecting H0 when it is true (a Type I error) and a high probability of rejecting H0 when it is false. The test procedures presented in this chapter allow us to directly control the probability of rejecting a true H0 by our choice of the significance level a. But what about the probability of rejecting H0 when it is false? As we will see, several factors influence this probability. Let’s begin by considering an example. Suppose that the student body president at a university is interested in studying the amount of money that students spend on textbooks each semester. The director of the financial aid office believes that the average amount spent on books is $300 per semester and uses this figure to determine the amount of financial aid for which a student is eligible. The student body president plans to ask each individual in a random sample of students how much he or she spent on books this semester and has decided to use the resulting data to test H0: m 300 versus Ha: m 300 using a significance level of .05. If the true mean is 300 (or less than 300), the correct decision is to fail to reject the null hypothesis (incorrectly rejecting the null hypothesis is a Type I error). On the other hand, if the true mean is 325 or 310 or even 301, the correct decision is to reject the null hypothesis (not rejecting the null hypothesis is a Type II error). How likely is it that the null hypothesis will in fact be rejected? If the true mean is 301, the probability that we reject H0: m 300 is not very great. This is because when we carry out the test, we are essentially looking at the sample mean and asking, Does this look like what we would expect to see if the population mean were 300? As illustrated in Figure 10.4, if the true mean is greater than but very close to 300, chances are that the sample mean will look pretty much like what we would expect to see if the population mean were 300, and we will be unconvinced that the null hypothesis should be rejected. If the true mean is 325, it is less likely that the sample will be mistaken for a sample from a population with mean 300; sample means will tend to cluster around 325, and so it is more likely that we will correctly reject H0. If the true mean is 350, rejection of H0 is even more likely. Figure 10.4 Sampling distribution of x when m 300, 305, 325. 300 305 325 When we consider the probability of rejecting the null hypothesis, we are looking at what statisticians refer to as the power of the test. The power of a test is the probability of rejecting the null hypothesis. 10-W4959 10/7/08 3:19 PM Page 563 10.5 ■ Power and Probability of Type II Error 563 From the previous discussion, it should be apparent that the power of the test when a hypothesis about a population mean is being tested depends on the true value of the population mean, m. Because the true value of m is unknown (if we knew the value of m we wouldn’t be doing the hypothesis test!), we cannot know what the power is for the actual true value of m. It is possible, however, to gain some insight into the power of a test by looking at a number of “what if” scenarios. For example, we might ask, What is the power if the true mean is 325? or What is the power if the true mean is 310? and so on. That is, we can determine the power at m 325, the power at m 310, and the power at any other value of interest. Although it is technically possible to consider power when the null hypothesis is true, an investigator is usually concerned about the power only at values for which the null hypothesis is false. In general, when testing a hypothesis about a population characteristic, there are three factors that influence the power of the test: 1. The size of the difference between the true value of the population characteristic and the hypothesized value (the value that appears in the null hypothesis) 2. The choice of significance level, a, for the test 3. The sample size E f f e c t o f Va r i o u s F a c t o r s o n t h e P o w e r o f a Te s t 1. The larger the size of the discrepancy between the hypothesized value and the true value of the population characteristic, the higher the power. 2. The larger the significance level, a, the higher the power of the test. 3. The larger the sample size, the higher the power of the test. Let’s consider each of these three statements. The first statement has already been discussed in the context of the textbook example. Because power is the probability of rejecting the null hypothesis, it makes sense that the power will be higher when the true value of a population characteristic is quite different from the hypothesized value than when it is close to that value. The effect of significance level on power is not quite as obvious. To understand the relationship between power and significance level, it helps to see the relationship between power and b, the probability of a Type II error. When H0 is false, power 1 b. This relationship follows from the definitions of power and Type II error. A Type II error results from not rejecting a false H0. Because power is the probability of rejecting H0, it follows that when H0 is false power probability of rejecting a false H0 power 1 probability of not rejecting a false H0 power 1 b Recall from Section 10.2 that the choice of a, the Type I error probability, affects the value of b, the Type II error probability. Choosing a larger value for a results in a 10-W4959 10/7/08 3:19 PM Page 564 564 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample smaller value for b (and thus a larger value for 1 b). In terms of power, this means that choosing a larger value for a results in a larger value for the power of the test. That is, the larger the Type I error probability we are willing to tolerate, the more likely it is that the test will be able to detect any particular departure from H0. The third factor that affects the power of a test is the sample size. When H0 is false, the power of a test is the probability that we will in fact “detect” that H0 is false and, based on the observed sample, reject H0. Intuition suggests that we will be more likely to detect a departure from H0 with a large sample than with a small sample. This is in fact the case—the larger the sample size, the higher the power. Consider testing the hypotheses presented previously: H0: m 300 versus Ha: m 300 The observations about power imply the following, for example: 1. For any value of m exceeding 300, the power of a test based on a sample of size 100 is higher than the power of a test based on a sample of size 75 (assuming the same significance level). 2. For any value of m exceeding 300, the power of a test using a significance level of .05 is higher than the power of a test using a significance level of .01 (assuming the same sample size). 3. For any value of m exceeding 300, the power of the test is greater if the true mean is 350 than if the true mean is 325 (assuming the same sample size and significance level). As was mentioned previously in this section, it is impossible to calculate the actual power of a test because in practice we do not know the true value of population characteristics. However, we can evaluate the power at a selected alternative value if we want to know whether the power would be high or low if this alternative value is the true value. The following optional subsection shows how Type II error probabilities and power can be evaluated for selected tests. ■ Calculating Power and Type II Error Probabilities for Selected Tests (Optional) .................................................................................................. The test procedures presented in this chapter are designed to control the probability of a Type I error (rejecting H0 when H0 is true) at the desired level a. However, little has been said so far about calculating the value of b, the probability of a Type II error (not rejecting H0 when H0 is false). Here we consider the determination of b and power for the hypothesis tests previously introduced. When we carry out a hypothesis test, we specify the desired value of a, the probability of a Type I error. The probability of a Type II error, b, is the probability of not rejecting H0 even though it is false. Suppose that we are testing H0: m 1.5 versus Ha: m 1.5 Because we do not know the true value of m, we cannot calculate the actual value of b. However, the vulnerability of the test to Type II error can be investigated by calculating b for several different potential values of m, such as m 1.55, m 1.6, and m 1.7. Once the value of b has been determined, the power of the test at the corresponding alternative value is just 1 b. 10-W4959 10/7/08 3:19 PM Page 565 10.5 ■ Power and Probability of Type II Error 565 .......................................................................................................................................... E x a m p l e 1 0 . 1 7 Calculating Power A cigarette manufacturer claims that the mean nicotine content of its cigarettes is 1.5 mg. We might investigate this claim by testing H0: m 1.5 versus Ha: m 1.5 where m is the true mean nicotine content. A random sample of n 36 cigarettes is to be selected, and the resulting data will be used to reach a conclusion. Suppose that the standard deviation of nicotine content (s) is known to be 0.20 mg and that a significance level of .01 is to be used. Our test statistic (because s 0.20) is z x 1.5 x 1.5 x 1.5 .20 .20 .0333 1n 136 The inequality in Ha implies that P-value area under z curve to the right of calculated z From Appendix Table 2, it is easily verified that the z critical value 2.33 captures an upper-tail z curve area of .01. Thus, P-value .01 if and only if z 2.33. This is equivalent to the decision rule reject H0 if calculated z 2.33 which becomes reject H0 if x 1.5 2.33 .0333 Solving this inequality for x we get x 1.5 2.331.0333 2 or x 1.578 So if x 1.578, we will reject H0, and if x 1.578, we will fail to reject H0. This decision rule corresponds to a .01. Suppose now that m 1.6 (so that H0 is false). A Type II error will then occur if x 1.578. What is the probability that this occurs? If m 1.6, the sampling distribution of x is approximately normal, centered at 1.6, and has a standard deviation of .0333. The probability of observing an x value less than 1.578 can then be determined by finding an area under a normal curve with mean 1.6 and standard deviation .0333, as illustrated in Figure 10.5. –x distribution (normal with mean 1.6 and standard deviation 0.0333) b when m 1.6 in Example 10.17. Figure 10.5 x 1.578) β P( – 1.578 1.6 10-W4959 10/7/08 3:19 PM Page 566 566 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample Because the curve in Figure 10.5 is not the standard normal (z) curve, we must first convert to a z score before using Appendix Table 2 to find the area. Here, z score for 1.578 1.578 mx 1.578 1.6 .66 sx .0333 and area under z curve to left of 0.66 .2546 So, if m 1.6, b .2546. This means that if m is 1.6, about 25% of all samples would still result in x values less than 1.578 and failure to reject H0. The power of the test at m 1.6 is then (power at m 1.6) 1 (b when m is 1.6) 1 .2546 .7454 Thus, if the true mean is 1.6, the probability of rejecting H0: m 1.5 in favor of Ha: m 1.5 is .7454. That is, if m is 1.6 and the test is used repeatedly with random samples selected from the population, in the long run about 75% of the samples will result in the correct conclusion to reject H0. Now consider b and power when m 1.65. The normal curve in Figure 10.5 would then be centered at 1.65. Because b is the area to the left of 1.578 and the curve has shifted to the right, b decreases. Converting 1.578 to a z score and using Appendix Table 2 gives b .0154. Also, (power at m 1.65) 1 .0154 .9846 As expected, the power at m 1.65 is higher than the power at m 1.6 because 1.65 is farther from the hypothesized value of 1.5. ■ MINITAB can calculate the power for specified values of s, a, n, and the difference between the true and hypothesized values of m. The following output shows power calculations corresponding to those in Example 10.17: 1-Sample Z Test Testing mean = null (versus > null) Alpha = 0.01 Sigma = 0.2 Sample Size = 36 Difference Power 0.10 0.7497 0.15 0.9851 The slight differences between the power values computed by MINITAB and those previously obtained are due to rounding in Example 10.17. The probability of a Type II error and the power for z tests concerning a population proportion are calculated in an analogous manner. .......................................................................................................................................... E x a m p l e 1 0 . 1 8 Power for Testing Hypotheses About Proportions A package delivery service advertises that at least 90% of all packages brought to its office by 9 A.M. for delivery in the same city are delivered by noon that day. Let p 10-W4959 10/7/08 3:19 PM Page 567 10.5 Power and Probability of Type II Error ■ 567 denote the proportion of all such packages actually delivered by noon. The hypotheses of interest are H0: p .9 versus Ha: p .9 where the alternative hypothesis states that the company’s claim is untrue. The value p .8 represents a substantial departure from the company’s claim. If the hypotheses are tested at level .01 using a sample of n 225 packages, what is the probability that the departure from H0 represented by this alternative value will go undetected? At significance level .01, H0 is rejected if P-value .01. For the case of a lower-tailed test, this is the same as rejecting H0 if z p mp sp p .9 mp .8 sp p .9 2.33 .02 1.9 2 1.1 2 B 225 (Because 2.33 captures a lower-tail z curve area of .01, the smallest 1% of all z values satisfy z 2.33.) This inequality is equivalent to p .853, so H0 is not rejected if p .853. When p .8, p has approximately a normal distribution with 1.8 2 1.2 2 .0267 B 225 Then b is the probability of obtaining a sample proportion greater than .853, as illustrated in Figure 10.6. b when p .8 in Example 10.18. Figure 10.6 Sampling distribution of p (normal with mean 0.8 and standard deviation 0.0267) β 0.8 0.853 Converting to a z score results in z .853 .8 1.99 .0267 and Appendix Table 2 gives b 1 .9767 .0233 When p .8 and a level .01 test is used, less than 3% of all samples of size n 225 will result in a Type II error. The power of the test at p .8 is 1 .0233 .9767. This means that the probability of rejecting H0: p .9 in favor of Ha: p .9 when p is really .8 is .9767, which is quite high. ■ ■ b and Power for the t Test (Optional) ............................................................. The power and b values for t tests can be determined by using a set of curves specially constructed for this purpose or by utilizing appropriate software. As with the z test, the 10-W4959 10/7/08 3:19 PM Page 568 568 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample value of b depends not only on the true value of m but also on the selected significance level a; b increases as a is made smaller. In addition, b depends on the number of degrees of freedom, n 1. For any fixed level a, it should be easier for the test to detect a specific departure from H0 when n is large than when n is small. This is indeed the case; for a fixed alternative value, b decreases as n 1 increases. Unfortunately, there is one other quantity on which b depends: the population standard deviation s. As s increases, so does sx. This in turn makes it more likely that a value far from m will be observed, resulting in an erroneous conclusion. Once a is specified and n is fixed, the determination of b at a particular alternative value of m requires that a value of s be chosen, because each different value of s yields a different value of b. (This did not present a problem with the z test because when using a z test, the value of s is known.) If the investigator can specify a range of plausible values for s, then using the largest such value will give a pessimistic b (one on the high side) and a pessimistic value of power (one on the low side). Figure 10.7 shows three different b curves for a one-tailed t test (appropriate for Ha: m hypothesized value or for Ha: m hypothesized value). A more complete set of curves for both one- and two-tailed tests when a .05 and when a .01 appears in Appendix Table 5. To determine b, first compute the quantity d 0 alternative value hypothesized value 0 s Then locate d on the horizontal axis, move directly up to the curve for n 1 df, and move over to the vertical axis to read b. β b curves for the one-tailed t test. Figure 10.7 1.0 α .01, df 6 .8 α .05, df 6 .6 α .01, df 19 .4 Associated value of β .2 d 0 1 2 3 Value of d .......................................................................................................................................... E x a m p l e 1 0 . 1 9 b and Power for t Tests Consider testing H0: m 100 versus Ha: m 100 10-W4959 10/7/08 3:19 PM Page 569 10.5 ■ Power and Probability of Type II Error 569 and focus on the alternative value m 110. Suppose that s 10, the sample size is n 7, and a significance level of .01 has been selected. For s 10, d 0 110 100 0 10 1 10 10 Figure 10.7 (using df 7 1 6) gives b .6. The interpretation is that if s 10 and a level .01 test based on n 7 is used when m 110 (and thus H0 is false), roughly 60% of all samples result in erroneously not rejecting H0! Equivalently, the power of the test at m 110 is only 1 .6 .4. The probability of rejecting H0 when m 110 is not very large. If a level .05 test is used instead, then b .3, which is still rather large. Using a level .01 test with n 20 (df 19) yields, from Figure 10.7, b .05. At the alternative value m 110, for s 10 the level .01 test based on n 20 has smaller b than the level .05 test with n 7. Substantially increasing n counterbalances using the smaller a. Now consider the alternative m 105, again with s 10, so that d 0 105 100 0 5 .5 10 10 Then, from Figure 10.7, b .95 when a .01, n 7; b .7 when a .05, n 7; and b .65 when a .01, n 20. These values of b are all quite large; with s 10, m 105 is too close to the hypothesized value of 100 for any of these three tests to have a good chance of detecting such a departure from H0. A substantial decrease in b necessitates using a much larger sample size. For example, from Appendix Table 5, b .08 when a .05 and n 40. The curves in Figure 10.7 also give b when testing H0: m 100 versus Ha: m 100. If the alternative value m 90 is of interest and s 10, d 090 100 0 10 1 10 10 and values of b are the same as those given in the first paragraph of this example. ■ Because curves for only selected degrees of freedom appear in Appendix Table 5, other degrees of freedom require a visual approximation. For example, the 27-df curve (for n 28) lies between the 19-df and 29-df curves, which do appear, and it is closer to the 29-df curve. This type of approximation is adequate because it is the general magnitude of b—large, small, or moderate—that is of primary concern. MINITAB can also evaluate power for the t test. For example, the following output shows MINITAB calculations for power at m 110 for samples of size 7 and 20 when a .01. The corresponding approximate values from Appendix Table 5 found in Example 10.19 are fairly close to the MINITAB values. 1-Sample t Test Testing mean = null (versus > null) Calculating power for mean = null + 10 Alpha = 0.01 Sigma = 10 Sample Size Power 7 0.3968 20 0.9653 10-W4959 10/7/08 3:19 PM Page 570 C h a p t e r 10 570 ■ Hypothesis Testing Using a Single Sample The b curves in Appendix Table 5 are those for t tests. When the alternative value in Ha corresponds to a value of d relatively close to 0, b for a t test may be rather large. One might ask whether there is another type of test that has the same level of significance a as does the t test and smaller values of b. The following result provides the answer to this question. When the population distribution is normal, the t test for testing hypotheses about m has smaller b than does any other test procedure that has the same level of significance a. Stated another way, among all tests with level of significance a, the t test makes b as small as it can possibly be when the population distribution is normal. In this sense, the t test is a best test. Statisticians have also shown that when the population distribution is not too far from a normal distribution, no test procedure can improve on the t test by very much (i.e., no test procedure can have the same a and substantially smaller b). However, when the population distribution is believed to be strongly nonnormal (heavy-tailed, highly skewed, or multimodal), the t test should not be used. Then it’s time to consult your friendly neighborhood statistician, who can provide you with alternative methods of analysis. ■ E x e r c i s e s 10.65–10.71 .............................................................................................................. 10.65 The power of a test is influenced by the sample size and the choice of significance level. a. Explain how increasing the sample size affects the power (when significance level is held fixed). b. Explain how increasing the significance level affects the power (when sample size is held fixed). 10.66 Water samples are taken from water used for cooling as it is being discharged from a power plant into a river. It has been determined that as long as the mean temperature of the discharged water is at most 150F, there will be no negative effects on the river ecosystem. To investigate whether the plant is in compliance with regulations that prohibit a mean discharge water temperature above 150F, a scientist will take 50 water samples at randomly selected times and will record the water temperature of each sample. She will then use a z statistic z x 150 s 1n Bold exercises answered in back to decide between the hypotheses H0: m 150 and Ha: m 150, where m is the true mean temperature of discharged water. Assume that s is known to be 10. a. Explain why use of the z statistic is appropriate in this setting. b. Describe Type I and Type II errors in this context. c. The rejection of H0 when z 1.8 corresponds to what value of a? (That is, what is the area under the z curve to the right of 1.8?) d. Suppose that the true value for m is 153 and that H0 is to be rejected if z 1.8. Draw a sketch (similar to that of Figure 10.5) of the sampling distribution of x, and shade the region that would represent b, the probability of making a Type II error. e. For the hypotheses and test procedure described, compute the value of b when m 153. f. For the hypotheses and test procedure described, what is the value of b if m 160? g. If H0 is rejected when z 1.8 and x 152.4, what is the appropriate conclusion? What type of error might have been made in reaching this conclusion? ● Data set available online but not required ▼ Video solution available 10-W4959 10/7/08 3:19 PM Page 571 10.6 ■ Interpreting and Communicating the Results of Statistical Analyses 10.67 ▼ Let m denote the true average lifetime for a certain type of pen under controlled laboratory conditions. A test of H0: m 10 versus Ha: m 10 will be based on a sample of size 36. Suppose that s is known to be 0.6, from which sx 0.1. The appropriate test statistic is then z x 10 0.1 a. What is a for the test procedure that rejects H0 if z 1.28? b. If the test procedure of Part (a) is used, calculate b when m 9.8, and interpret this error probability. c. Without doing any calculation, explain how b when m 9.5 compares to b when m 9.8. Then check your assertion by computing b when m 9.5. d. What is the power of the test when m 9.8? when m 9.5? 10.68 The city council in a large city has become concerned about the trend toward exclusion of renters with children in apartments within the city. The housing coordinator has decided to select a random sample of 125 apartments and determine for each whether children are permitted. Let p be the true proportion of apartments that prohibit children. If p exceeds .75, the city council will consider appropriate legislation. a. If 102 of the 125 sampled apartments exclude renters with children, would a level .05 test lead you to the conclusion that more than 75% of all apartments exclude children? b. What is the power of the test when p .8 and a .05? 10.69 The amount of shaft wear after a fixed mileage was determined for each of 7 randomly selected internal combustion engines, resulting in a mean of 0.0372 in. and a standard deviation of 0.0125 in. Bold exercises answered in back 571 a. Assuming that the distribution of shaft wear is normal, test at level .05 the hypotheses H0: m .035 versus Ha: m .035. b. Using s 0.0125, a .05, and Appendix Table 5, what is the approximate value of b, the probability of a Type II error, when m .04? c. What is the approximate power of the test when m .04 and a .05? 10.70 Optical fibers are used in telecommunications to transmit light. Current technology allows production of fibers that transmit light about 50 km (Research at Rensselaer, 1984). Researchers are trying to develop a new type of glass fiber that will increase this distance. In evaluating a new fiber, it is of interest to test H0: m 50 versus Ha: m 50, with m denoting the true average transmission distance for the new optical fiber. a. Assuming s 10 and n 10, use Appendix Table 5 to find b, the probability of a Type II error, for each of the given alternative values of m when a level .05 test is employed: i. 52 ii. 55 iii. 60 iv. 70 b. What happens to b in each of the cases in Part (a) if s is actually larger than 10? Explain your reasoning. 10.71 Let m denote the true average diameter for bearings of a certain type. A test of H0: m 0.5 versus Ha: m 0.5 will be based on a sample of n bearings. The diameter distribution is believed to be normal. Determine the value of b in each of the following cases: a. n 15, a .05, s 0.02, m 0.52 b. n 15, a .05, s 0.02, m 0.48 c. n 15, a .01, s 0.02, m 0.52 d. n 15, a .05, s 0.02, m 0.54 e. n 15, a .05, s 0.04, m 0.54 f. n 20, a .05, s 0.04, m 0.54 g. Is the way in which b changes as n, a, s, and m vary consistent with your intuition? Explain. ● Data set available online but not required ▼ Video solution available ........................................................................................................................................ 10.6 Interpreting and Communicating the Results of Statistical Analyses The step-by-step procedure that we have proposed for testing hypotheses provides a systematic approach for carrying out a complete test. However, you rarely see the results of a hypothesis test reported in publications in such a complete way. 10-W4959 10/7/08 3:19 PM Page 572 572 C h a p t e r 10 ■ ■ Hypothesis Testing Using a Single Sample Communicating the Results of Statistical Analyses ...................................... When summarizing the results of a hypothesis test, it is important that you include several things in the summary to have all the relevant information. These are 1. Hypotheses. Whether specified in symbols or described in words, it is important that both the null and the alternative hypotheses be clearly stated. If you are using symbols to define the hypotheses, be sure to describe them in the context of the problem at hand (e.g., m population mean calorie intake). 2. Test procedure. You should be clear about what test procedure was used (e.g., large-sample z test for proportions) and why you think it was reasonable to use this procedure. The plausibility of any required assumptions should be satisfactorily addressed. 3. Test statistic. Be sure to include the value of the test statistic and the P-value. Including the P-value allows a reader who may have chosen a different significance level to see whether she would have reached the same or a different conclusion. 4. Conclusion in context. Never end the report of a hypothesis test with the statement “I rejected (or did not reject) H0.” Always provide a conclusion that is in the context of the problem and that answers the original research question which the hypothesis test was designed to answer. Be sure also to indicate the level of significance used as a basis for the decision. ■ Interpreting the Results of Statistical Analyses ............................................... When the results of a hypothesis test are reported in a journal article or other published source, it is common to find only the value of the test statistic and the associated P-value accompanying the discussion of conclusions drawn from the data. Sometimes, even the exact P-value doesn’t appear, but instead “coded” information is given. For example, * significant (P-value .05), ** very significant (P-value .01), and *** highly significant (P-value .001). Often, especially in newspaper articles, only sample summary statistics are given, with the conclusion immediately following. You may have to fill in some of the intermediate steps for yourself to see whether or not the conclusion is justified. For example, the article “Physicians’ Knowledge of Herbal Toxicities and Adverse Herb-Drug Interactions” (European Journal of Emergency Medicine, August 2004) summarizes the results of a study to assess doctors’ familiarity with adverse effects of herbal remedies as follows: “A total of 142 surveys and quizzes were completed by 59 attending physicians, 57 resident physicians, and 26 medical students. The mean subject score on the quiz was only slightly higher than would have occurred from random guessing.” The quiz consisted of 16 multiple-choice questions. If each question had four possible choices, the statement that the mean quiz score was only slightly higher than would have occurred from random guessing suggests that the researchers considered the hypotheses H0 : m 4 and Ha : m 4, where m represents the true mean score for the population of physicians and medical students and the null hypothesis corresponds to the expected number of correct choices for someone who is guessing. Assuming that it is reasonable to regard this sample as representative of the population of interest, the data from the sample could be used to carry out a test of these hypotheses. ■ What to Look For in Published Data ................................................................ Here are some questions to consider when you are reading a report that contains the results of a hypothesis test: 10-W4959 10/7/08 3:19 PM Page 573 10.6 ■ ■ ■ ■ ■ Interpreting and Communicating the Results of Statistical Analyses 573 What hypotheses are being tested? Are the hypotheses about a population mean, a population proportion, or some other population characteristic? Was the appropriate test used? Does the validity of the test depend on any assumptions about the population from which the sample was selected? If so, are the assumptions reasonable? What is the P-value associated with the test? Was a significance level selected for the test (as opposed to simply reporting the P-value)? Is the chosen significance level reasonable? Are the conclusions drawn consistent with the results of the hypothesis test? For example, consider the following statement from the paper “Didgeridoo Playing as Alternative Treatment for Obstructive Sleep Apnoea Syndrome” (British Medical Journal [2006]: 266–270): “We found that four months of training of the upper airways by didgeridoo playing reduces daytime sleepiness in people with snoring and obstructive apnoea syndrome.” This statement was supported by data on a measure of daytime sleepiness called the Epworth scale. For the 14 participants in the study, the mean improvement in Epworth scale was 4.4 and the standard deviation was 3.7. The paper does not indicate what test was performed or what the value of the test statistic was. It appears that the hypotheses of interest are H0: m 0 (no improvement) versus Ha: m 0, where m represents the true mean improvement in Epworth score after four months of didgeridoo playing for people with snoring and obstructive sleep apnoea. Because the sample size is not large, the one-sample t test would be appropriate if the sample can be considered a random sample and the distribution of Epworth scale improvement scores is approximately normal. If these assumptions are reasonable (something that was not addressed in the paper), the t test results in t 4.45 and an associated P-value of .000. Because the reported P-value is so small H0 would be rejected, supporting the conclusion in the paper that didgeridoo playing is an effective treatment. (In case you are wondering, a didgeridoo is an Australian Aboriginal woodwind instrument.) ■ A Word to the Wise: Cautions and Limitations.......................................... There are several things you should watch for when conducting a hypothesis test or when evaluating a written summary of such a test. 1. The result of a hypothesis test can never show strong support for the null hypothesis. Make sure that you don’t confuse “There is no reason to believe the null hypothesis is not true” with the statement “There is convincing evidence that the null hypothesis is true.” These are very different statements! 2. If you have complete information for the population, don’t carry out a hypothesis test! It should be obvious that no test is needed to answer questions about a population if you have complete information and don’t need to generalize from a sample, but people sometimes forget this fact. For example, in an article on growth in the number of prisoners by state, the San Luis Obispo Tribune (August 13, 2001) reported “California’s numbers showed a statistically insignificant change, with 66 fewer prisoners at the end of 2000.” The use of the term “statistically insignificant” implies some sort of statistical inference, which is not appropriate when a complete accounting of the entire prison population is known. Perhaps the author confused statistical and practical significance. Which brings us to . . . 3. Don’t confuse statistical significance with practical significance. When statistical significance has been declared, be sure to step back and evaluate the result in light of its practical importance. For example, we may be convinced that the proportion 10-W4959 10/7/08 3:19 PM Page 574 574 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample who respond favorably to a proposed medical treatment is greater than .4, the known proportion that responds favorably for the currently recommended treatments. But if our estimate of this proportion for the proposed treatment is .405, is this of any practical interest? It might be if the proposed treatment is less costly or has fewer side effects, but in other cases it may not be of any real interest. Results must always be interpreted in context. Activity 10.1 Comparing the t and z Distributions Technology Activity: Requires use of a computer or a graphing calculator. The instructions that follow assume the use of MINITAB. If you are using a different software package or a graphing calculator, your instructor will provide alternative instructions. Background: Suppose a random sample will be selected from a population that is known to have a normal distribution. Then the statistic z xm s 1n has a standard normal (z) distribution. Since it is rarely the case that s is known, inferences for population means xm , which has are usually based on the statistic t 1s/ 1n2 a t distribution rather than a z distribution. The informal justification for this was that the use of s to estimate s introduces additional variability, resulting in a statistic whose distribution is more spread out than is the z distribution. In this activity, you will use simulation to sample from a known normal population, and then investigate xm xm . how the behavior of t compares to z s/ 1n s/ 1 1n2 1. Generate 200 random samples of size 5 from a normal population with mean 100 and standard deviation 10. Using MINTAB, go to the Calc Menu. Then Calc S Random Data S Normal In the “Generate” box, enter 200 In the “Store in columns” box, enter c1-c5 In the mean box, enter 100 In the standard deviation box, enter 10 Click on OK You should now see 200 rows of data in each of the first 5 columns of the MINITAB worksheet. 2. Each row contains five values that have been randomly selected from a normal population with mean 100 and standard deviation 10. Viewing each row as a sample of size 5 from this population, calculate the mean and standard deviation for each of the 200 samples (the 200 rows) by using MINITAB’s row statistics functions, which can also be found under the Calc menu: Calc S Row statistics Choose the “Mean” button In the “Input Variables” box, enter c1-c5 In the “Store result in” box, enter c7 Click on OK You should now see the 200 sample means in column 7 of the MINITAB worksheet. Name this column “x-bar”, by typing the name in the gray box at the top of c7. Now follow a similar process to compute the 200 sample standard deviations, and store them in c8. Name c8 “s.” 3. Next, calculate the value of the z statistic for each of the 200 samples. We can calculate z in this example because we know that the samples were selected from a population for which s 10. Use the calculator function xm x 100 of MINITAB to compute z as 1s/ 1n2 110/ 152 follows: Calc S Calculator In the “Store results in” box, enter c10 In the “Expression box” type in the following: (c7-100)/(10/sqrt(5)) Click on OK You should now see the z values for the 200 samples in c10. Name c10 “z”. 4. Now calculate the value of the t statistic for each of the 200 samples. Use the calculator function of MINITAB to xm x 100 compute t as follows: 1s/ 1n2 1s/ 15 2 10-W4959 10/7/08 3:19 PM Page 575 ■ Calc S Calculator In the “Store results in” box, enter c11 In the “Expression box” type in the following: (c7-100)/(c8/sqrt(5)) Click on OK You should now see the t values for the 200 samples in c10. Name c10 “t.” 5. Graphs, at last! Now construct histograms of the 200 z values and the 200 t values. These two graphical displays will provide insight about how each of these two statistics behaves in repeated sampling. Use the same scale for the two histograms so that it will be easier to compare the two distributions. Graph S Histogram In the “Graph variables” box, enter c10 for graph 1 and c11 for graph 2 Click the Frame dropdown menu and select multiple graphs. Then under the scale choices, select “Same X and same Y.” 6. Now use the histograms from Step 5 to answer the following questions: a. Write a brief description of the shape, center and spread for the histogram of the z values. Is what you see in the histogram consistent with what you would have expected to see? Explain. (Hint: In theory, what is the distribution of the z statistic?) Activity 10.2 Summary of Key Concepts and Formulas 575 b. How does the histogram of the t values compare to the z histogram? Be sure to comment on center, shape, and spread. c. Is your answer to Part (b) consistent with what would be expected for a statistic that has a t distribution? Explain. d. The z and t histograms are based on only 200 samples, and they only approximate the corresponding sampling distributions. The 5th percentile for the standard normal distribution is 1.645 and the 95th percentile is 1.645. For a t distribution with df 5 1 4, the 5th and 95th percentiles are 2.13 and 2.13, respectively. How do these percentiles compare to those of the distributions displayed in the histograms? (Hint: Sort the 200 z values— in MINITAB, choose “Sort” from the Manip menu. Once the values are sorted, percentiles from the histogram can be found by counting in 10 [which is 5% of 200] values from either end of the sorted list. Then repeat this with the t values.) e. Are the results of your simulation and analysis conxm sistent with the statement that the statistic z 1s/ 1n2 has a standard normal (z) distribution and the statistic xm t has a t distribution? Explain. 1s/ 1n2 A Meaningful Paragraph Write a meaningful paragraph that includes the following six terms: hypotheses, P-value, reject H0, Type I error, statistical significance, practical significance. A “meaningful paragraph” is a coherent piece of writing in an appropriate context that uses all of the listed words. The paragraph should show that you understand the meaning of the terms and their relationship to one another. A sequence of sentences that just define the terms is not a meaningful paragraph. When choosing a context, think carefully about the terms you need to use. Choosing a good context will make writing a meaningful paragraph easier. Summary of Key Concepts and Formulas Term or Formula Comment Hypothesis A claim about the value of a population characteristic. Null hypothesis, H0 The hypothesis initially assumed to be true. It has the form H0: population characteristic hypothesized value. Alternative hypothesis, Ha A hypothesis that specifies a claim that is contradictory to H0 and is judged the more plausible claim when H0 is rejected. 10-W4959 10/7/08 3:19 PM Page 576 576 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample Term or Formula Comment Type I error Rejection of H0 when H0 is true; the probability of a Type I error is denoted by a and is referred to as the significance level for the test. Type II error Nonrejection of H0 when H0 is false; the probability of a Type II error is denoted by b. Test statistic The quantity computed from sample data and used to make a decision between H0 and Ha. P-value The probability, computed assuming H0 to be true, of obtaining a value of the test statistic at least as contradictory to H0 as what actually resulted. H0 is rejected if P-value a and not rejected if P-value a, where a is the chosen significance level. z z t p hypothesized value 1hyp. val 2 11 hyp. val 2 n B x hypothesized value s 1n A test statistic for testing H0: p hypothesized value when the sample size is large. The P-value is determined from the z curve. A test statistic for testing H0: m hypothesized value when s is known and either the population distribution is normal or the sample size is large. The P-value is determined from the z curve. x hypothesized value s 1n A test statistic for testing H0: m hypothesized value when s is unknown and either the population distribution is normal or the sample size is large. The P-value is determined from the t curve with df n 1. Power The power of a test is the probability of rejecting the null hypothesis. Power is affected by the size of the difference between the hypothesized value and the true value, the sample size, and the significance level. Chapter Review Exercises 10.72–10.95 Know exactly what to study! Take a pre-test and receive your Personalized Learning Plan. 10.72 The authors of the article “Perceived Risks of Heart Disease and Cancer Among Cigarette Smokers” (Journal of the American Medical Association [1999]: 1019–1021) expressed the concern that a majority of smokers do not view themselves as being at increased risk of heart disease or cancer. A study of 737 current smokers selected at random from U.S. households with telephones found that of 737 smokers surveyed, 295 indicated that they believed Bold exercises answered in back they have a higher than average risk of cancer. Do these data suggest that p, the true proportion of smokers who view themselves as being at increased risk of cancer is in fact less than .5, as claimed by the authors of the paper? Test the relevant hypotheses using a .05. 10.73 A number of initiatives on the topic of legalized gambling have appeared on state ballots. Suppose that a ● Data set available online but not required ▼ Video solution available 10-W4959 10/7/08 3:19 PM Page 577 ■ political candidate has decided to support legalization of casino gambling if he is convinced that more than twothirds of U.S. adults approve of casino gambling. USA Today (June 17, 1999) reported the results of a Gallup poll in which 1523 adults (selected at random from households with telephones) were asked whether they approved of casino gambling. The number in the sample who approved was 1035. Does the sample provide convincing evidence that more than two-thirds approve? 10.74 The article “Credit Cards and College Students: Who Pays, Who Benefits?” (Journal of College Student Development [1998]: 50–56) described a study of credit card payment practices of college students. According to the authors of the article, the credit card industry asserts that at most 50% of college students carry a credit card balance from month to month. However, the authors of the article report that, in a random sample of 310 college students, 217 carried a balance each month. Does this sample provide sufficient evidence to reject the industry claim? 10.75 Although arsenic is known to be a poison, it also has some beneficial medicinal uses. In one study of the use of arsenic to treat acute promyelocytic leukemia (APL), a rare type of blood cell cancer, APL patients were given an arsenic compound as part of their treatment. Of those receiving arsenic, 42% were in remission and showed no signs of leukemia in a subsequent examination (Washington Post, November 5, 1998). It is known that 15% of APL patients go into remission after the conventional treatment. Suppose that the study had included 100 randomly selected patients (the actual number in the study was much smaller). Is there sufficient evidence to conclude that the proportion in remission for the arsenic treatment is greater than .15, the remission proportion for the conventional treatment? Test the relevant hypotheses using a .01 significance level. 10.76 According to the article “Which Adults Do Underage Youth Ask for Cigarettes?” (American Journal of Public Health [1999]: 1561–1564), 43.6% of the 149 18- to 19-year-olds in a random sample have been asked to buy cigarettes for an underage smoker. a. Is there convincing evidence that fewer than half of 18to 19-year-olds have been approached to buy cigarettes by an underage smoker? b. The article went on to state that of the 110 nonsmoking 18- to 19-year-olds, only 38.2% had been approached to buy cigarettes for an underage smoker. Is there evidence Bold exercises answered in back Chapter Review Exercises 577 that less than half of nonsmoking 18- to 19-year-olds have been approached to buy cigarettes? 10.77 Many people have misconceptions about how profitable small, consistent investments can be. In a survey of 1010 randomly selected U.S. adults (Associated Press, October 29, 1999), only 374 responded that they thought that an investment of $25 per week over 40 years with a 7% annual return would result in a sum of over $100,000 (the correct amount is $286,640). Is there sufficient evidence to conclude that less than 40% of U.S. adults are aware that such an investment would result in a sum of over $100,000? Test the relevant hypotheses using a .05. 10.78 The same survey described in Exercise 10.77 also asked the individuals in the sample what they thought was their best chance to obtain more than $500,000 in their lifetime. Twenty-eight percent responded “win a lottery or sweepstakes.” Does this provide convincing evidence that more than one-fourth of U.S. adults see a lottery or sweepstakes win as their best chance of accumulating $500,000? Carry out a test using a significance level of .01. 10.79 The state of Georgia’s HOPE scholarship program guarantees fully paid tuition to Georgia public universities for Georgia high school seniors who have a B average in academic requirements as long as they maintain a B average in college. Of 137 randomly selected students enrolling in the Ivan Allen College at the Georgia Institute of Technology (social science and humanities majors) in 1996 who had a B average going into college, 53.2% had a GPA below 3.0 at the end of their first year (“Who Loses HOPE? Attrition from Georgia’s College Scholarship Program,” Southern Economic Journal [1999]: 379–390). Do these data provide convincing evidence that a majority of students at Ivan Allen College who enroll with a HOPE scholarship lose their scholarship? 10.80 Speed, size, and strength are thought to be important factors in football performance. The article “Physical and Performance Characteristics of NCAA Division I Football Players” (Research Quarterly for Exercise and Sport [1990]: 395–401) reported on physical characteristics of Division I starting football players in the 1988 football season. Information for teams ranked in the top 20 was easily obtained, and it was reported that the mean weight of starters on top-20 teams was 105 kg. A random sample of 33 starting players (various positions were represented) from Division I teams that were not ranked in ● Data set available online but not required ▼ Video solution available 10-W4959 10/7/08 3:19 PM Page 578 578 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample the top 20 resulted in a sample mean weight of 103.3 kg and a sample standard deviation of 16.3 kg. Is there sufficient evidence to conclude that the mean weight for nontop-20 starters is less than 105, the known value for top20 teams? California). A random sample of 750 local residents included 560 who strongly opposed hunting on the bay. Does this sample provide sufficient evidence to conclude that the majority of local residents oppose hunting on Morro Bay? Test the relevant hypotheses using a .01. 10.81 Are young women delaying marriage and marrying at a later age? This question was addressed in a report issued by the Census Bureau (Associated Press, June 8, 1991). The report stated that in 1970 (based on census results) the mean age of brides marrying for the first time was 20.8 years. In 1990 (based on a sample, because census results were not yet available), the mean was 23.9. Suppose that the 1990 sample mean had been based on a random sample of size 100 and that the sample standard deviation was 6.4. Is there sufficient evidence to support the claim that in 1990 women were marrying later in life than in 1970? Test the relevant hypotheses using a .01. (Note: It is probably not reasonable to think that the distribution of age at first marriage is normal in shape.) 10.84 Seat belts help prevent injuries in automobile accidents, but they certainly don’t offer complete protection in extreme situations. A random sample of 319 front-seat occupants involved in head-on collisions in a certain region resulted in 95 people who sustained no injuries (“Influencing Factors on the Injury Severity of Restrained Front Seat Occupants in Car-to-Car Head-on Collisions,” Accident Analysis and Prevention [1995]: 143–150). Does this suggest that the true (population) proportion of uninjured occupants exceeds .25? State and test the relevant hypotheses using a significance level of .05. 10.82 According to the article “Workaholism in Organizations: Gender Differences” (Sex Roles [1999]: 333–346), the following data were reported on 1996 income for random samples of male and female MBA graduates from a certain Canadian business school: Males Females N x s 258 233 $133,442 $105,156 $131,090 $98,525 Note: These salary figures are in Canadian dollars. a. Test the hypothesis that the mean salary of male MBA graduates from this school was in excess of $100,000 in 1996. b. Is there convincing evidence that the mean salary for all female MBA graduates is above $100,000? Test using a .10. c. If a significance level of .05 or .01 were used instead of .10 in the test of Part (b), would you still reach the same conclusion? Explain. 10.83 Duck hunting in populated areas faces opposition on the basis of safety and environmental issues. The San Luis Obispo Telegram-Tribune (June 18, 1991) reported the results of a survey to assess public opinion regarding duck hunting on Morro Bay (located along the central coast of Bold exercises answered in back 10.85 White remains the most popular car color in the United States, but its popularity appears to be slipping. According to an annual survey by DuPont (Los Angeles Times, February 22, 1994), white was the color of 20% of the vehicles purchased during 1993, a decline of 4% from the previous year. (According to a DuPont spokesperson, white represents “innocence, purity, honesty, and cleanliness.”) A random sample of 400 cars purchased during this period in a certain metropolitan area resulted in 100 cars that were white. Does the proportion of all cars purchased in this area that are white appear to differ from the national percentage? Test the relevant hypotheses using a .05. Does your conclusion change if a .01 is used? 10.86 When a published article reports the results of many hypothesis tests, the P-values are not usually given. Instead, the following type of coding scheme is frequently used: *p .05, **p .01, ***p .001, ****p .0001. Which of the symbols would be used to code for each of the following P-values? a. .037 c. .072 b. .0026 d. .0003 10.87 A random sample of n 44 individuals with a B.S. degree in accounting who started with a Big Eight accounting firm and subsequently changed jobs resulted in a sample mean time to change of 35.02 months and a sample standard deviation of 18.94 months (“The Debate over Post-Baccalaureate Education: One University’s Experience,” Issues in Accounting Education [1992]: 18–36). Can it be concluded that the true average time to change ● Data set available online but not required ▼ Video solution available 10-W4959 10/7/08 3:19 PM Page 579 ■ exceeds 2 years? Test the appropriate hypotheses using a significance level of .01. 10.88 What motivates companies to offer stock ownership plans to their employees? In a random sample of 87 companies having such plans, 54 said that the primary rationale was tax related (“The Advantages and Disadvantages of ESOPs: A Long-Range Analysis,” Journal of Small Business Management [1991]: 15–21). Does this information provide strong support for concluding that more than half of all such firms feel this way? 10.89 The article “Caffeine Knowledge, Attitudes, and Consumption in Adult Women” (Journal of Nutrition Education [1992]: 179–184) reported the following summary statistics on daily caffeine consumption for a random sample of adult women: n 47, x 215 mg, s 235 mg, and the data values ranged from 5 to 1176. a. Does it appear plausible that the population distribution of daily caffeine consumption is normal? Is it necessary to assume a normal population distribution to test hypotheses about the value of the population mean consumption? Explain your reasoning. b. Suppose that it had previously been believed that mean consumption was at most 200 mg. Does the given information contradict this prior belief? Test the appropriate hypotheses at significance level .10. 10.90 Past experience has indicated that the true response rate is 40% when individuals are approached with a request to fill out and return a particular questionnaire in a stamped and addressed envelope. An investigator believes that if the person distributing the questionnaire is stigmatized in some obvious way, potential respondents would feel sorry for the distributor and thus tend to respond at a rate higher than 40%. To investigate this theory, a distributor is fitted with an eye patch. Of the 200 questionnaires distributed by this individual, 109 were returned. Does this strongly suggest that the response rate in this situation exceeds the rate in the past? State and test the appropriate hypotheses at significance level .05. 10.91 ● An automobile manufacturer who wishes to advertise that one of its models achieves 30 mpg (miles per gallon) decides to carry out a fuel efficiency test. Six nonprofessional drivers are selected, and each one drives a car from Phoenix to Los Angeles. The resulting fuel efficiencies (in miles per gallon) are: 27.2 29.3 31.2 Bold exercises answered in back 28.4 30.3 29.6 Chapter Review Exercises 579 Assuming that fuel efficiency is normally distributed under these circumstances, do the data contradict the claim that true average fuel efficiency is (at least) 30 mpg? 10.92 A student organization uses the proceeds from a particular soft-drink dispensing machine to finance its activities. The price per can had been $0.75 for a long time, and the average daily revenue during that period had been $75.00. The price was recently increased to $1.00 per can. A random sample of n 20 days after the price increase yielded a sample average daily revenue and sample standard deviation of $70.00 and $4.20, respectively. Does this information suggest that the true average daily revenue has decreased from its value before the price increase? Test the appropriate hypotheses using a .05. 10.93 A hot tub manufacturer advertises that with its heating equipment, a temperature of 100F can be achieved in at most 15 min. A random sample of 25 tubs is selected, and the time necessary to achieve a 100F temperature is determined for each tub. The sample average time and sample standard deviation are 17.5 min and 2.2 min, respectively. Does this information cast doubt on the company’s claim? Carry out a test of hypotheses using significance level .05. 10.94 Let p denote the proportion of voters in a certain state who favor a particular proposed constitutional amendment. Consider testing H0: p .5 versus Ha: p .5 at significance level .05 based on a sample of size n 50. a. Suppose that H0 is in fact true. Use Appendix Table 1 (our table of random numbers) to simulate selecting a sample, and use the resulting data to carry out the test. b. If you repeated Part (a) a total of 100 times (a simulation consisting of 100 replications), how many times would you expect H0 to be rejected? c. Now suppose that p .6, which implies that H0 is false. Again, use Appendix Table 1 to simulate selecting a sample, and carry out the test. If you repeated this a total of 100 times, would you expect H0 to be rejected more frequently than when H0 is true? 10.95 A type of lie detector that measures brain waves was developed by a professor of neurobiology at Northwestern University (Associated Press, July 7, 1988). He said, “It would probably not falsely accuse any innocent people and it would probably pick up 70% to 90% of guilty people.” Suppose that the result of this lie detector test is allowed as evidence in a criminal trial as the sole ● Data set available online but not required ▼ Video solution available 10-W4959 10/7/08 3:19 PM Page 580 580 C h a p t e r 10 ■ Hypothesis Testing Using a Single Sample basis of a decision between two rival hypotheses: accused is innocent versus accused is guilty. Although these are not “statistical hypotheses” (statements about a population characteristic), the possible decision errors are analogous to Type I and Type II errors. In this situation, a Type I error is finding an innocent person guilty—rejecting the null hypothesis of innocence when it is in fact true. A Type II error is finding a guilty person innocent—not rejecting the null hypothesis of inBold exercises answered in back nocence when it is in fact false. If the developer of the lie detector is correct in his statements, what is the probability of a Type I error, a? What can you say about the probability of a Type II error, b? Do you need a live tutor for homework problems? ● Data set available online but not required Are you ready? Take your exam-prep post-test now. ▼ Video solution available Graphing Calculator Explorations Exploration 10.1 Hypothesis Test for a Population Proportion Using your calculator’s hypothesis-testing capability begins, as usual, by navigating your calculator’s menu system, this time looking for key words such as “hypothesis” and “tests.” As before with confidence intervals, look for words such as “1” and “prop.” and “z.” Once you select the correct choice, you will be presented with a screen for providing information. The information you must provide is exactly what you would need to test the hypotheses on paper: the sample number of successes, the sample size, the level of significance, and so on. Figure 10.8 shows two representative screens, with information filled in from Example 10.10 in the text (the screen in Figure 10.8(b) has the “less than” alternative hypothesis selected by shading, although the shading doesn’t show up in the figure). Move your cursor down to Execute or Calculate, and press the Enter, Execute, or Calculate button, depending on your calculator. The results should appear immediately. Again, we show two representative screens in Figure 10.9. Notice the slight difference between the text and calculator answers that results from rounding in the hand calculations. 1-Prop ZTest Prop : p0 p0 :.5 x :220 n :500 Execute (a) 1-PropZTest p0:.5 x:220 n:500 Prop p0 p0 Calculate Draw p0 (b) Representative calculator screens for a one-proportion hypothesis test. Figure 10.8 1-Prop ZTest Prop0.5 Z 2.683281573 p .003645226 p̂ .44 n 500 (a) Figure 10.9 hypothesis test. 1-PropZInt prop0.5 z 2.683281573 p .003645226 p̂ .44 n 500 (b) Representative output for a one-proportion 10-W4959 10/7/08 3:19 PM Page 581 ■ Graphing Calculator Explorations 581 The calculator does the work for only a few of the steps in a hypothesis test. Remember, there is still work for you to do in completing the other steps! Exploration 10.2 Hypothesis Test for a Population Mean Testing a hypothesis for a single mean on your calculator requires you to navigate the menu system once again. As was true with the confidence interval for the mean, 1. you must decide whether to base the test on the z or a t distribution, and 2. you can use previous calculations of the sample mean and standard deviation, or the calculator will evaluate these statistics from data contained in a list. We follow Example 10.14, and use the t test after entering the data in List1. The normality of the population must be assessed as before. A calculator boxplot and normal probability plot are shown in Figure 10.10. F i g u r e 1 0 . 1 0 Plots for data of Example 10.14: (a) boxpolt; (b) normal probability plot. (a) (b) After assessing the plausibility of the normality of the population, we test the hypotheses. Again we have the choice of entering the sample calculations or providing data in a list and letting the calculator evaluate the sample mean and standard deviation. Based on the choice, we see one of the screens shown in Figure 10.11. F i g u r e 1 0 . 1 1 Representative calculator screens for a single-sample t test. (a) (b) Figure 10.12 shows calculator output for the hypothesis test. Remember, just writing the calculator output is not a complete response to a hypothesis testing task —there are necessary steps in the hypothesis-testing procedure that you must write yourself. F i g u r e 1 0 . 1 2 Calculator output for the single-sample t test.