Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Part VIII - Tests of Significance Chapters 26, 28, and 29 Dr. Joseph Brennan Math 148, BU Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 1 / 90 Tests of Significance Confidence Intervals: Intervals on the number line which are used to estimate the population parameter (µ) from the sample statistic (x̄). Tests of Significance: Tests intending to assess the evidence provided by the data in favor of some claim about a population parameter (µ). A significance test is a formal procedure which uses the data to choose between two competing hypotheses, the null hypothesis and the alternative hypothesis. Hypotheses are statements about the population parameter (µ). A decision rule based on the probability computation is used to choose between two hypotheses. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 2 / 90 Stating Hypotheses A hypothesis is a statement about the population parameter (µ) whose truth is in question. Example (Air Force recruits) (from McGHEE Introductory Statistics) Suppose the mean weight of male Air Force recruits is thought to be around 154 pounds. The following hypotheses can be drawn: Verbal Statement The mean weight is The mean weight is The mean weight is The mean weight is 154 pounds less than 154 pounds greater than 154 pounds not equal to 154 pounds Math H: H: H: H: Statement µ = 154 µ < 154 µ > 154 µ 6= 154. The tests we will develop require two hypotheses, the null hypothesis and the alternative hypothesis. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 3 / 90 The Null Hypothesis Null Hypothesis: The basic or primary statement about the parameter (µ). We abbreviate null hypothesis as H0 . We will consider a special case of a simple null hypothesis which has a form (H0 : µ = µ0 ), where µ0 is the hypothesized value for µ. The null hypothesis is usually a sceptical statement of no difference or no effect. Generally, the null hypothesis represents some well-established position that should not be rejected unless there is considerable evidence to the contrary. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 4 / 90 The Null Hypothesis Note 1: The null hypothesis states that µ = µ0 . For any sample from the population we do not expect that x̄ = µ0 exactly. Recall from Part V: The probability of an exact value arising from a continuous random variable is zero. Even though x̄ will generally be different from µ0 , it does not mean that we will always reject H0 . A test of significance attempts to determine if the difference is real or if it is attributable to chance error. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 5 / 90 The Null Hypothesis Note 2: We do not test to accept H0 ; it is assumed to be true. Rather, we test to see if it should be rejected. If we reject H0 , there must be some other hypothesis that we are willing to accept. This is called the ALTERNATIVE HYPOTHESIS and is abbreviated as Ha . We establish the alternative hypothesis by (partially) negating the null hypothesis. Since H0 involves equality, the alternative hypothesis involves an inequality. Depending on the direction of inequality, there exist one-sided and two-sided alternative hypotheses : One-sided alternative hypothesis has the form: i) (Ha : µ > µ0 ) (right-sided alternative) or ii) (Ha : µ < µ0 ) (left-sided alternative). Two-sided alternative hypothesis has the form (Ha : µ 6= µ0 ). Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 6 / 90 The Null hypothesis The direction of inequality in the alternative hypothesis (Ha ) usually follows from the question in the problem. If the direction is not specified in the problem, we should use the two-sided alternative as a default. It would be an act of CHEATING to first look at the data and then frame Ha to fit what the data shows. The textbook calls this type of cheating data snooping. If you do not have a specific direction firmly in mind in advance, you should use the two-sided alternative hypothesis. Some statisticians would argue that we should always use a two-sided alternative. Note 3: The alternative hypothesis is also known as research hypothesis. The alternative hypothesis is a statement that we want to prove. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 7 / 90 Hypothesis Examples The flat Earth model was common in ancient times, such as in the civilizations of the Bronze Age or Iron Age. This may be thought of as the null hypothesis, H0 , at the time. H0 : World is Flat. Hellenistic astronomy established the spherical shape of the earth around 300 BC. Many of the Ancient Greek philosophers assumed that the sun, moon and other objects in the universe circled around the Earth. H0 : The Geocentric Model : Earth is the center of the Universe. Copernicus had an alternative hypothesis, H1 that the world actually circled around the sun, thus being the center of the universe. Eventually, people got convinced and accepted it as the null, H0 . H0 : The Heliocentric Model: Sun is the center of the universe. Later someone proposed an alternative hypothesis that the sun itself also circled around the something within the galaxy, thus creating a new null hypothesis. This is how research works - the null hypothesis is refined through testing; even if it isn’t correct, H0 is an improvement over its predecessors. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 8 / 90 Hypothesis Examples Example (Air Force recruits) The mean weight of male Air Force recruits is thought to be around 154 pounds. H0 : µ = 154, Ha : µ 6= 154. (a) A 1998 study reported that the average weight of newborn kids is 7 pounds. You plan to take a simple random sample of newborns to see if the average weight has increased. H0 : µ = 7 Ha : µ > 7 Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 9 / 90 Hypothesis Examples (b) Last year the company’s technicians spent on average 3 hours a day to respond to from customers. Does this year’s data show a different average response time? H0 : µ = 3 Ha : µ 6= 3. (c) The average square footage of one-bedroom apartments in a new development is advertised to be 460 square feet. A student group thinks that the apartments are smaller than advertised. They hire an engineer to measure a sample of apartments to test their suspicion. H0 : µ = 460 Ha : µ < 460. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 10 / 90 Hypothesis Examples (d) Suppose you are playing a game which involves rolling a die and you have a feeling that 6 is appearing more often that it should! Let X be the variable that records the number that shows up on rolling the die. H0 : P(X = 6) = 1/6, Ha : P(X = 6) > 1/6. (e) Suppose you are flipping a coin, which otherwise seems fair, and seem to believe that heads is appearing less often that it should! H0 : P(H) = 1/2, Dr. Joseph Brennan (Math 148, BU) Ha : P(H) < 1/2. Part VIII - Tests of Significance 11 / 90 Why Bother? For each experiment or data, the null hypothesis is a general default position which needs to be substantiated or ruled out. Moreover, this needs to be done on an experiment-by-experiment basis. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 12 / 90 Decision Rule and Test Statistics In testing hypotheses we speak of testing the null against the alternative hypothesis. We either reject or do not reject the null hypothesis based on the evidence from the data. If the null hypothesis is rejected, we accept the alternative hypothesis. The decision to reject H0 or not should be based on an appropriate decision rule. A decision rule for a test is based upon a test statistic. Test Statistic: Test statistic is a quantity computed from the data which measures the compatibility between the null hypothesis and the data. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 13 / 90 Test Statistic Very often the test statistic has the following form: Test Statistic = estimate - hypothesized value standard deviation of the estimate If the hypotheses are statements concerning the population mean µ, then test statistic has the form: z= x̄ − µ0 x̄ − µ0 = σx̄ √σ n The above statistic is called the z-statistic for µ because it is the z-score for x̄ (under the null hypothesis). If the population distribution is normal or sample size is large enough, (n ≥ 30), the distribution of z is standard normal (z ∼ N(0, 1)). From interpretation of the z-score it follows that: The z statistic shows by how many standard deviations x̄ is smaller or greater than µ0 (specified by H0 ). Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 14 / 90 P - value The test statistic is used to compute the P-value of the test. P-value: In a test of hypotheses, the P-value is the probability that the test statistic would take a value as extreme or more extreme (in the same direction) than of that actually observed. This probability is computed under the assumption that H0 is true. Small P-values correspond to extreme values of the test statistic and should lead to rejection of H0 . The smaller the P-value, the stronger the evidence against H0 . We decide to reject or not reject H0 by comparing the P-value with the level of significance α. Note: Usually statistical studies report the value of the test statistic and the P-value. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 15 / 90 Significance Level Significance Level: The significance level, α, is a fixed constant which denotes the critical P-value which we regard to be decisive. This amounts to announcing in advance how much evidence against H0 we will require to reject H0 . The most frequently used values of α are 0.1, 0.05 or 0.01. Rules of decision based on the P-value: We reject H0 at α level if P-value< α. Otherwise, we fail to reject H0 . If we reject H0 , we say that the result is statistically significant at α level, which means that the observed difference between the data and H0 is too large to be attributed to the chance error. When we reject H0 , we accept Ha . Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 16 / 90 Example Consider a test with two hypotheses H0 : µ = 140, Ha : µ > 140, where a sample of size 64 has a mean 143 and standard deviation of 10. The value of x̄ is under the right tail of the distribution as 140 < 143. The P-value is P − value = P(x̄ ≥ 143) = P Z> 143 − 140 ! √10 64 = P(Z > 2.4) = 0.0082. Whatever traditional value of α we choose (0.01, 0.05 or 0.1), we will reject H0 since P − value < α. So we reject H0 and accept the alternative hypothesis as more plausible. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 17 / 90 Case 1: Left-sided alternative (Ha : µ < µ0 ) P-value computation: Let z be the computed value of the test statistic. The way in which we compute the P-value depends on the direction of the alternative hypothesis. There are three possible cases : Case 1: Left-sided alternative (Ha : µ < µ0 ) P − value = P(Z ≤ z) Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 18 / 90 Case 2: Right-sided alternative (Ha : µ > µ0 ) Case 2: Right-sided alternative (Ha : µ > µ0 ). P − value = P(Z ≥ z) Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 19 / 90 Case 3: Two-sided alternative (Ha : µ 6= µ0 ) Case 3: Two-sided alternative (Ha : µ 6= µ0 ). P − value = 1 − P(−z < Z < z). Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 20 / 90 The P - Value Computation The direction of inequality in the P-value is the same as the direction of inequality in the alternative hypothesis. The P-value in the two-sided case is twice as large as the P-value in the one-sided case. The explanation is the following: we have an alternative hypothesis (µ 6= µ0 ). To reject the null hypothesis we should observe either extreme positive or extreme negative values of the test statistic z. Suppose that for a given sample we found z = −2.2. Since we are considering extreme values in both directions in Ha , we argue that the z-values more extreme than the observed −2.2 are: Z ≥ 2.2 OR Z ≤ −2.2. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 21 / 90 The P-Value Computation Z ≥ 2.2 OR Z ≤ −2.2. The above two inequalities can be combined into one: Z ≤ −2.2 or Z ≥ 2.2 ⇔ |Z | ≥ 2.2. Hence, the P-value is computed as P − value = P(|Z | ≥ 2.2) = 1 − P(−2.2 < Z < 2.2) = 2.78%. As a consequence, it is easier to reject H0 in favor on a one-sided alternative because the P-value in the case of two-sided alternative is twice that of the P-value in the case of a one-sided alternative hypothesis. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 22 / 90 The P-Value Computation The Pvalue provides the strength of evidence against H0 . The smaller the P-value, the stronger the evidence against H0 . If the P - value is less than 0.05, the result is often called statistically significant. This is because α = 0.05 is the most frequently used level of significance. If the P - value is less than 0.01, the result is called highly significant. The significance level α = 0.01 is used when we want to reject H0 only for VERY convincing evidence against it. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 23 / 90 Common Misinterpretation of a P - Value Many people misunderstand what question a P - value answers. If the P-value is 0.03, that means that there is a 3% chance of observing a difference from H0 as extreme as you observed on a subsequent trial. It is tempting to conclude that there is a 97% chance that the Ha is correct and a 3% chance that the H0 is correct. This is an incorrect interpretation! What you can say is that random sampling from identical populations would lead to a difference smaller than you observed in 97% of experiments and larger than you observed in 3% of experiments. You have to choose. Would you rather believe in a 3% coincidence? Or that the H0 is incorrect? Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 24 / 90 One Sample z - Test for µ One Sample z-Test for µ: A test to determine the validity of a statement concerning the mean µ based upon a single sample. STEP 1: State the hypotheses. H0 : µ = µ 0 , As a default, the alternative Ha is two-sided. A problem may specify whether Ha is left-sided or right-sided. STEP 2: Choose the significance level α. Assume α = 0.05 unless otherwise stated. STEP 3: Calculate the test statistic. z= Dr. Joseph Brennan (Math 148, BU) x̄ − µ0 √σ n . Part VIII - Tests of Significance 25 / 90 One Sample z - Test for µ STEP 4: Compute the P - value. The formula for the P-value depends on the alternative hypothesis. Recall that the P-value is the probability that a test statistic would take a value more extreme than of that actually observed. STEP 5: Make a decision: Reject H0 if P − value < α. Do not reject H0 is P − value > α. STEP 6: State the conclusion in terms of the alternative hypothesis. If you rejected H0 , say ”there is enough evidence at α level that state your alternative hypothesis in words here”. If you did not reject H0 , say ”there is not enough evidence at α level to say that state your alternative hypothesis in words here”. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 26 / 90 Assumptions Associated to the z-Test The assumptions for the one-sample z-test are the same as assumptions for calculating the confidence interval for µ in Chapter 21: Assumption 1. The data results from a simple random sample from a very large population or observations are obtained by sampling with replacement from a box (population). Assumption 2. The population is either normal or the sample size is large enough (n ≥ 30) for the Central Limit Theorem to apply. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 27 / 90 Example (from Moore and McCabe) Do middle-aged male executives have different average blood pressure than the general population? The National Center for Health Statistic reports that the mean systolic blood pressure for males 35 to 44 years is 128 and the standard deviation in this population is 15. The medical director of a company looks at the medical records of 72 company executives in this age group and finds that the mean systolic blood pressure in this group is 126.07. Is this enough evidence that executive blood pressures differ from the national average? Solution: We will go through the steps outlined in the algorithm for hypothesis testing. Step 1. (H0 : µ = 128) (Ha : µ 6= 128) In words, H0 says that executives are not different from other men, whereas Ha says that they are different. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 28 / 90 Example (Executive Blood Pressure) Step 2. Choose α = 0.05. Step 3. z= x̄ − µ0 √σ n = 126.07 − 128 √15 72 ≈ −1.09. Step 4. P − value = P(Z ≤ −1.09) + P(Z ≥ 1.09) = 1 − P(−1.09 < Z < 1.09) = 100% − (86.21% − 13.79%) = 27.58%. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 29 / 90 Example (Executive Blood Pressure) Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 30 / 90 Example (Executive Blood Pressure) Step 5. Since P − value = 0.2758 > α = 0.05, the null hypothesis is not rejected. In fact, more than 27% (about 1 time out of 4) of times a SRS of size 72 from the general male population would produce a mean blood pressure at least as far from 128 as that of the executive sample. Step 6. There is not enough evidence at α = 0.05 level that the blood pressure of middle-aged executives differ from other men. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 31 / 90 Example (Sleeping Habits) In average, how many hours do people sleep at night? One hundred Wal-Mart shoppers were asked this question. The sample mean was found to be x̄ = 7.5 hours. Assume σ = 1.5 hours. Is the result significantly different from 8 hours? Solution: We can perform the one-sample z-test, but the conclusion will not be valid. The survey was not a random sample but a convenience sample; which consists of people who are readily available and convenient (Wal-Mart shoppers). Wal-Mart shoppers are specific in some sense and do not constitute a representative sample of all people. So, generalizations can not be made to all the people by studying just a convenience sample of 100 Wal-Mart shoppers. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 32 / 90 Example (from Biostatistics by Triola & Triola) A researcher is convinced that on average humans are colder than reported. A simple random sample of 106 body temperatures was taken and with a mean of 98.20◦ F. Assume that the population standard deviation σ is known to be 0.62◦ F. Use a 0.05 significance level to test the common belief that the mean body temperature of healthy adults is equal to 98.6◦ F. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 33 / 90 Example (Healthy Body Temperature) Solution: We write out the following steps : Step 1. (H0 : µ = 98.6◦ F ) (Ha : µ < 98.6◦ F ). Step 2. α = 0.05. Step 3. The z-score is z= x̄ − µ0 √σ n = 98.2 − 98.6 0.62 √ 106 = −6.642. Step 4. The P-value is P(Z ≤ −6.642), which is way smaller than 0.05. Step 5. We reject H0 . Step 6. There is enough evidence at level α = 0.05 such that µ < 98.6◦ F is more acceptable. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 34 / 90 Example(from Biostatistics by Triola & Triola) The health of the bear population in Yellowstone National Park is monitored by periodic measurements taken from anesthetized bears. A sample of 5 bears has a mean weight of 182.9 lb. Assuming that the standard deviation σ is known to be 121.8 lb, use a 0.1 significance level to test the claim that the population mean of all such bear weights is 200 lb. Solution: Step 1. H0 : µ = 200 lb Ha : µ 6= 200 lb. Step 2 and 3 α = 0.1 and the z-score is z= 182.9 − 200 121.8 √ 5 = −0.31. Step 4. P-value: 1 − P(−0.31 ≤ Z ≤ 0.31) = 0.2434 > 0.1 Step 5 and 6: We do not reject H0 . There is not enough evidence at level α = 0.1 that µ 6= 200 lb. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 35 / 90 WARNING In the previous example, we assumed, very conveniently, that the distribution of the bear weights is normal. Even under this assumption, the sample size of n = 5 is way too small to use the normal table; or calculations do not apply! What do we do in such a scenario? The answer lies in the Student’s t-distribution. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 36 / 90 Student’s t-Distribution Student’s t-Distribution: (or simply the t-distribution) is a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown. Like the normal distribution, the t-distribution is symmetric and bell-shaped. The t-distribution has heavier tails, meaning that it is more prone to producing values that fall far from its mean. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 37 / 90 Why ”Student”? History: A derivation of the t-distribution was first published in 1908 by William Sealy Gosset while he worked at the Guinness Brewery in Dublin. One version of the origin of the pseudonym Student is that Gosset’s employer forbade members of its staff from publishing scientific papers, so he had to hide his identity. Another version is that Guinness did not want their competition to know that they were using the t-test to test the quality of raw material. The t-test and the associated theory became well-known through the work of the famous statistician R.A. Fisher, who called the distribution Student’s distribution. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 38 / 90 t-Distributions There is not a single t-distribution. The t-distributions are indexed by Degrees of Freedom, a term related to the sample size the t-distribution represents. For a sample of size n, use a t-distribution with n − 1 degrees of freedom. We only need t-distributions for sample sizes less than 30. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 39 / 90 The Family of t-Distributions There is the whole family of the t - distributions indexed by the number of degrees of freedom. The probability densities of all the members of the family of t - distributions are symmetric about 0, bell-shaped, but have more probability on the tails than does the standard normal distribution. For this reason a t - distribution is called heavy-tailed. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 40 / 90 t-Distributions We denote by t n−1 the t-distribution with n − 1 degrees of freedom. Consider a random variable X that is normally distributed (or at least symmetric) of which n ≤ 30 samples are taken, the probability distribution for the sum approximates t n−1 . That is, for the mean µ and standard deviation σ of X , the X̄ − µ ≈ t n−1 . σ √ n−1 Similar to confidence intervals with z-scores, a confidence level of C % has a t-score tCn−1 in the C /2 + 50th percentile. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 41 / 90 Confidence Intervals Chart To find a confidence interval for the population mean µ with confidence level C from a sample of size n with mean x̄ and sample standard deviation s: (1) If the population standard deviation σ is known, and either population distribution is normal, or sample size is large (n ≥ 30) zC × σ zC × σ x̄ − √ , x̄ + √ n n (2) If the population standard deviation σ is unknown then Case 1: (n < 30 and population distribution is normal) # " tCn−1 × s tCn−1 × s x̄ − √ , x̄ + √ n−1 n−1 Case 2: (n ≥ 30 and population distribution is normal) zC × s zC × s x̄ − √ , x̄ + √ n n Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 42 / 90 Example (Vitamin C) The amount of vitamin C in mg/100g in a certain produce is measured in a random sample of size 10: 26 31 23 21 10 25 33 12 16 30 Compute the 95% confidence interval for µ, the mean vitamin C content. Solution: The sample mean is x̄ = 22.7 and the sample standard deviation is s = 7.53. Assuming that the distribution of vitamin C content in the produce is normal, we will use the t-confidence interval since σ is unknown and n = 10 < 30. The t-distribution in this case has 9 degrees of freedom. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 43 / 90 Example (Vitamin C) Since we are looking for a confidence level of 95%, we must find the t-score of the 95/2 + 50 = 97.5th percentile. On our t-table, that is under df = 9 and t0.025 9 t0.025 = 2.262 The 95% CI for µ is tCn−1 × s tCn−1 × s [x̄ − √ , x̄ + √ ] n−1 n−1 2.262 × 7.53 2.262 × 7.53 √ √ = 22.7 − , 22.7 + = [17.03, 28.38] 9 9 We are 95% confident that the mean vitamin content in the produce is between 17.03 and 28.38. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 44 / 90 Example (Textbook Expenditures) A random sample of semester textbook expenditures by 81 fulltime university students had a mean of $100 and standard deviation of $30. Find a 99% confidence interval for the mean expenditures for textbooks by students at this university. Solution: We have x̄ = 100 and s = 30. The population standard deviation σ is unknown, and n = 81 > 30. The approximate 99% confidence interval µ for is found using a z-table (as n > 30). We must find the z-score, zC , for the 99/2 + 50 = 99.5th percentile. zC = 2.576. 30 30 = [91.41, 108.59] 100 − 2.576 × , 100 + 2.576 × 9 9 Interpretation: We are 99% confident that the true mean expenditures for textbooks by students in the university is between $91.41 and $108.59. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 45 / 90 One Sample t-Test for the Population Mean µ A test of significance for the population mean µ with sample size n ≤ 30 will use t-scores rather than z-scores. Assume the null hypothesis states (H0 : µ = µ0 ). The test statistic is calculated as tx̄ = x̄ − µ0 √s n−1 The t - statistic is a basis for the t - test for µ which has analogous steps to the z - test with only two differences: 1 The t-test uses t - statistic on Step 3. 2 In the t - test the P-value on Step 4 is computed as the corresponding area under the t - curve with n − 1 degrees of freedom. The t - tests are usually used in the case when σ is unknown, the distribution of X is roughly normal, and the sample is small (n < 30). Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 46 / 90 Example (Piano Lessons for Preschoolers) Do piano lessons improve the spatial-temporal reasoning of preschool children? A study designed to test this hypothesis measured the spatial-temporal reasoning of 20 preschool children before and after 6 months of piano lessons. The changes in reasoning scores are shown below 2 -2 5 9 7 6 -2 0 2 3 7 6 4 -1 1 3 0 -4 7 -6 Solution: Summary statistics : x̄ = 2.35, s = 3.98. The data’s histogram is shown on the next slide. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 47 / 90 Example (Piano Lessons for Preschoolers) The distribution is not normal-like, but it is not extremely skewed. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 48 / 90 Example (Piano Lessons for Preschoolers) Sample size n = 20 < 30 and σ is unknown, so we will use the t-test. The hypotheses are (H0 : µ = 0) (Ha : µ > 0) The t-statistic is t= x̄ − µ0 √s n−1 = 2.35 − 0 3.98 √ 19 = 2.57, with degrees of freedom n − 1 = 19. From the t-table: P-value < 0.01. The result is highly significant. We reject the null hypothesis and conclude that piano lessons improve spatial-temporal reasoning of preschoolers. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 49 / 90 Example (Bear Weights) A sample of 5 bears has a mean weight of 182.9 lb. Assuming that the standard deviation σ is known to be 121.8 lb, use a 0.1 significance level to test the claim that the population mean of all such bear weights is 200 lb. Solution: We shall use the t-test as n = 5 < 30 and assume the weight distribution is normal. Step 1. (H0 : µ = 200 lb) (Ha : µ 6= 200 lb) Step 2 and 3. α = 0.1 and the t-score is t= 182.9 − 200 121.8 √ 4 = −0.28 Step 4. The P-value is 2P(t4 ≥ | − 0.28|) > 2 · 0.25 = 0.5 > 0.1. Step 5 and 6. We do not reject H0 . There is not enough evidence at level α = 0.1 that µ 6= 200 lb. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 50 / 90 Chart for Tests of Significance for µ We have two types of test statistic for the null hypothesis (H0 : µ = µ0 ). (1) If the sample size is large, (n > 30), then test statistic = x̄ − µ0 √s n and use the normal table to calculate a P-value. (2) If the sample size is small, n ≤ 30, and the population distribution is roughly normal x̄ − µ0 test statistic = s √ n and use the t-table with n − 1 degree of freedom to calculate a P-value. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 51 / 90 Example (from McGHEE Introductory Statistics). Rats that are raised in laboratory environment have a mean life span of around 24 months. A sample of 31 rats reared to adulthood in a germ-free environment had life spans with a mean of 27.3 and a standard deviation of 5.9 months. Does this type of rearing have an effect on the life span of the laboratory rat? Solution: We are given x̄ = 27.3 s = 5.9 Step 1. (H0 : µ = 24) n = 31 (Ha : µ > 24). Note: We set the alternative as one-sided since we would expect the lifetime to increase in the germ-free environment. Step 2. Choose α = 0.05. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 52 / 90 Example (Rat Lifespan) Step 3. Calculate the test statistic: z= x̄ − µ0 √s n = 27.3 − 24 5.9 √ 31 ≈ 3.1 Step 4. Calculate the P-value: P − value = P(Z ≥ 3.10) = 0.0001. Step 5. The result is highly significant, so we reject H0 . Step 6. There is enough statistical evidence at α = 0.05 level that the average lifetime of rats living in a germ-free environment is greater than 24 months. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 53 / 90 Two-Sided Significance Tests and Confidence Intervals Confidence intervals for µ and two-sided significance tests for µ are related. We can decide whether we should or should not reject H0 from the computed two-sided CI for µ. Relationship between the two-sided test of significance and the confidence interval for µ. A level α two-sided significance test rejects a hypothesis (H0 : µ = µ0 ) exactly when the value µ0 falls outside a (1 − α)% confidence interval for µ. The significance level α of the two-sided test is related to the confidence level C of the confidence interval through the rule C = 1 − α. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 54 / 90 Two-Sided Significance Tests and Confidence Intervals H0 : µ = µ0 , Ha : µ 6= µ0 α=1−C Case 1 : µ0 is inside of the C% CI µ0 x̄ Decision : Failed to reject H0 at level α Case 1 : µ0 is outside of the C% CI x̄ µ0 Decision : Reject H0 at level α Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 55 / 90 Example (Chicken) A company that manufactures chicken feed has developed a new product. The company claims that 12 weeks after hatching, the average weight of chickens using this product will be 3.0 pounds. The owner of a large chicken farm decided to examine this new product, so he fed the new ration to all 12,000 of his newly hatched chickens. At the end of 12 weeks he selected a simple random sample of 20 chickens and weighed them. The sample mean for the 20 chickens is 3.06 pounds and the sample standard deviation was s = 0.63 pounds. (a) Find a 95% confidence interval for the mean µ of the 12,000 chickens. (b) Perform a test of significance to check a company’s claim (α = 0.05). Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 56 / 90 Example (Chicken) Solution: We have 19 degrees of freedom and: s = 0.63 n = 20 x̄ = 3.06 µ0 = 3. 19 95% Confidence Interval: x̄ ± t0.025 ×√ s n−1 0.63 0.63 3.06 − 2.093 × √ , 3.06 + 2.093 × √ = [2.76, 3.36] 19 19 The hypotheses: H0 : µ = 3, Ha : µ 6= 3. The test statistic: t= x̄ − µ0 √s n−1 = 3.06 − 3 0.63 √ 19 = 0.415. P − value = 2P(t 19 ≥ 0.415) > 2(0.10) ≥ 0.2 Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 57 / 90 Example (Chicken) We fail to reject H0 since P − value > α. Conclusion: We do not have enough statistical evidence at α = 0.05 level to claim that the mean chicken weight is different from 3 pounds. So the data supports the company’s claim. NOTE: The computed 95% confidence interval contains µ0 = 3 (pounds) and the test of significance which uses α = 0.05 = 1 − 0.95 level does not reject H0 . TRUE: If µ0 belongs to the 1 − α confidence interval, then the α-level two-sided test fails to reject H0 . TRUE: If the 1 − α confidence interval does not contain µ0 , then the α-level two-sided test rejects H0 . Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 58 / 90 Two Types of Error in Tests of Significance We either reject H0 , or fail to reject H0 based on the data. We hope that our decision is correct, but sometimes it will be wrong! There are two types of incorrect decisions: TYPE I and TYPE II ERRORS If we reject H0 when in fact H0 is true, this is TYPE I error. If we do not reject H0 when in fact Ha is true, this is TYPE II error. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 59 / 90 Probabilities of Type I and Type II Errors Type I error can be thought of as convicting an innocent person. Type II error can be thought of as letting a guilty person go free. Significance and Type I error. From definition of the significance level α it follows that the probability of Type I error is equal to α. This explains why we want to choose α to be small. Power of the test and Type II error. The power of the test is the probability that the test rejects H0 when Ha is true. High power is desirable. The probability of the Type II error is 1 minus the power of the test. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 60 / 90 Practical Significance vs. Statistical Significance Refer to Section 3 of Chapter 29 of the textbook. Sometimes the difference is statistically significant but practically unimportant. The following example illustrates the point. Example (Bulb) An engineer has designed an improved light bulb. The previous design had an average lifetime of 1200 hours. Based on a sample of n = 2500 of the new bulbs, the average lifetime was found to be x̄ = 1201. Take σ = 10 (hours). Does a new bulb have greater lifetime? Solution The hypotheses: H0 : µ = 1200, Dr. Joseph Brennan (Math 148, BU) Ha : µ > 1200. Part VIII - Tests of Significance 61 / 90 Example (Bulb) Test statistic: z= x̄ − µ0 √σ n = 1201 − 1200 √ 10 2500 = 5. The P - value: P − value = P(Z ≥ 5) ≈ 0. We reject H0 and conclude that we have enough evidence that the new bulb is better. But how is it better? Is the lifetime increase of 1 hour for a light bulb really important? REMARK: Statistical significance is easier to show with larger sample sizes n. Even a tiny difference between the true mean µ and the hypothesized mean µ0 will be evident if we choose large enough sample. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 62 / 90 Example (Types of Error) A medical researcher is working on a new treatment for a certain type of cancer. The average survival time after diagnosis on the standard treatment is 2 years. In an early trial, she tries the new treatment on three subjects who have an average survival time after the diagnosis of 4 years. Although the survival time has doubled, the results are not statistically significant even at the 0.10 significance level. Suppose, in fact, that the new treatment does increase the mean survival time in the population of all patients with this particular type of cancer. What type of error, if any, has been committed? Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 63 / 90 Example (Types of Error) Solution : The hypotheses are H0 : µ = 2, Ha : µ > 2. The results are not statistically significant means that we fail to reject H0 . But we know that the new treatment does increase the mean survival time which means that, in fact, Ha is true. So we failed to reject H0 when in fact Ha is true. This is Type II error. Comment: Having just 3 patients was not enough to prove a significance of the result. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 64 / 90 Concluding Remarks about Significance Tests (i) We only discussed the one-sample significance tests for µ. Many other significance tests exist. In fact, a test of significance can be constructed for any population parameter or their difference. (see Chapter 27). (ii) Different tests have different technical details (such as different hypotheses, test statistics and rules for P-value computation), but all the significance tests use the same steps and definitions, and have a similar interpretation. (iii) A chance model is required for a test of significance; a box model is a type of chance model. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 65 / 90 Concluding Remarks about Significance Tests (iv) A test of significance only determines if a difference is real or due to chance variation. It does not rank how important the difference is, explain what causes it, or check the validity of the study used to accumulate data. (v) The z-test and t-test are tests which compare the mean of a sample to the mean established by an external standard. (vi) The χ2 -test, our next topic, compares observed and expected frequencies. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 66 / 90 The Chi-Square Test Often, we must ask the basic and necessary question : How well does the model fit the facts? In many cases, the answer is given by the χ2 -test. χ is a Greek letter. It is often written as chi. It is pronounced as ki as in kite. The χ2 -test compares observed and expected frequencies in determining if a model is appropriate. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 67 / 90 The Chi-Square Test: An Example A gambler is accused of using a loaded die, but he pleads innocent. A record has been kept of the last 60 throws. 4 2 3 6 6 5 3 4 3 4 3 4 3 1 4 4 6 6 1 3 5 2 2 3 2 3 4 3 4 3 3 5 5 3 6 3 4 3 6 2 4 5 6 4 4 4 6 3 5 3 5 4 3 1 6 4 1 5 2 4 There is some disagreement about how to interpret the data and a statistician is called in. What is the verdict? Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 68 / 90 The Chi-Square Test: An Example Solution: If the gambler is innocent, the numbers in the given table should be a result of drawing randomly from the a box containing numbers 1 through 6. Therefore, each of the six numbers should appear in the table approximately 10 times: the expected frequency is 10. To compare this observation with what we have, we have to calculate the frequency distribution: Value 1 2 3 4 5 6 Sum Observed frequency 4 6 17 16 8 9 60 Expected frequency 10 10 10 10 10 10 60 As we can observe, the table has too many 3’s and 4’s. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 69 / 90 The Chi-Square Test: An Example The standard error for the number of 3’s is p p n · p(1 − p) = 60 · (1/6) · (5/6) ' 2.9. Therefore, the observed number of 3’s is (17 − 10)/2.9 ' 2.4 SE’s above the expected number. However, we shouldn’t take the table one line at a time! For example, there are too many 4’s. But with many lines in the table, there is a high probability that at least one of them will look suspicious - even if the die is fair! We need something more substantial to detect the fairness of the die. The value χ2 is defined (observed frequency − expected frequency)2 . χ = sum of expected frequency 2 Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 70 / 90 The Chi-Square Test: An Example The formula is not arbitrarily derived, as we shall see later. For now, χ2 = (4 − 10)2 + (6 − 10)2 + · · · + (8 − 10)2 + (9 − 10)2 = 14.2. 10 When the observed frequency is far from the expected frequency, the corresponding term in the sum is large; when the two are close, this term is small. Question: What is the chance that when a fair die is rolled 60 times and χ2 is computed from the observed frequencies, its value turns out to be 14.2 or more? Note that larger values of χ2 would be even stronger evidence against the model. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 71 / 90 Karl Pearson Calculating this chance is a tremendous undertaking! Back in the days of Karl Pearson (1900’s) there were no computers. He came up with a distribution to compute this probability by hand! It involved a new curve, called the χ2 -curve. There is one curve for each number of degrees of freedom, analogous to the t-distribution. Moreover, if everything is specified, then degrees of freedom = number of terms in χ2 − one. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 72 / 90 The χ2 -Curve How does a χ2 -curve look? Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 73 / 90 Properties of the χ2 -Distribution 1. The χ2 is not symmetric unlike the Student t or the normal distribution. 2. The values of χ2 can never be negative. 3. The χ2 -distribution is different for each number of degrees of freedom, which is given by df = n − 1, where n is the number of categories. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 74 / 90 The χ2 -Table This is only part of the full table; we have only highlighted the relevant part that we will be making use of. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 75 / 90 The Chi-Square Test: An Example In our case, we need χ2 with 5 degrees of freedom. It follows from the table that the probability of 14.2 or more is slightly more than 1%. If we are using modern day computational power, then we can get the answer is an instant :1.4%. In any event, the statistician’s work here is done! There is a strong evidence that the guy pleading innocence is actually a gambler and a fraud! Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 76 / 90 The χ2 -Test When testing a hypothesis on a trial with multiple categories (tickets), use the χ2 -test. The steps of the test are outlined below: (i) Create the chance model (box model). (ii) Create a frequency table consisting of observed frequency and expected frequency for each category (ticket). (iii) Compute the χ2 -statistic. (iv) Compute the degrees of freedom; the number of categories−1. (v) Obtain a P-value from the χ2 -table and consider rejecting H0 . Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 77 / 90 Example (Grand Juries) A study of grand juries formed in Alameda County, California was investigating if the age of jurors chosen is representative of the age of the population. The size of a grand jury varies, but a total of 66 were sampled (representing 6 juries). Age 21 to 40 41 to 50 51 to 60 61 and over Total: County Percentage 42 23 16 19 100 Number of Jurors 15 14 19 18 66 Does the age composition of juries represent the county? Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 78 / 90 Example (Grand Juries) Solution: We use the χ2 -test with a null hypothesis assuming juries represent the age composition of the county. Since α was not specified, let α = 0.05. (i) The box model: (ii) The frequency table for 66 jurors: Age Expected 21 to 40 27.7 41 to 50 15.2 51 to 60 10.6 12.5 61 and over Total: 66 Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance Observed 15 14 19 18 66 79 / 90 Example (Grand Juries) (iii) The χ2 -statistic χ2 = (15 − 27.7)2 (14 − 15.2)2 (19 − 10.6)2 (18 − 12.5)2 + + + ≈ 15 27.7 15.2 10.6 12.5 (iv) There are 4 categories so 3 degrees of freedom. (v) From the χ2 -table, the P-value is less than 0.5% and approaches 0%. We reject the null hypothesis as our significance level was assumed to be 5%. We have a statistically significant sample pointing towards bias in choosing older jurors. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 80 / 90 Uses of χ2 -Curve There are several other uses of χ2 -curve: χ2 -statistic can be used to test independence. χ2 -statistic can be used with any number of categories! χ2 -statistic can be used to test a claim about standard deviation . Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 81 / 90 Independent Experiments If experiments are performed independently, the results can be pooled with separate χ2 -statistics and degrees of freedom. Example: Assume experiment A is performed independently of experiment B. Assume A has χ2 = 5.8 with 5 degrees of freedom and B has χ2 = 3.1 with 2 degrees of freedom. The combined experiment A + B has χ2 = 5.8 + 3.1 = 8.9 and 5 + 2 = 7 degrees of freedom. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 82 / 90 Independence Testing The χ2 -test is able to test for independence. This will be highlighted through examples. The HANES study of 2, 237 Americans between the ages of 25 and 34 recorded the gender and dominant hand of subjects. Right-Handed Left-Handed Ambidextrous Total: Men 934 113 20 1,067 Women 1,070 92 8 1,170 Total: 2,004 205 28 2,237 Assume that subjects were chosen in a simple random sample. From this sample, is dominant hand independent from gender? Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 83 / 90 Independence Testing We have a null hypothesis: H0 : Dominant Hand and Gender are Independent We have an alternative hypothesis: HA : Dominant Hand and Gender are Dependent We do not know the population parameters with respect to dominant hand and gender, only the information given by the sample. We have a large sample, so we will assume the population matches the sample. Right-Handed Left-Handed Ambidextrous Dr. Joseph Brennan (Math 148, BU) Men 87.5% 10.6% 1.9% Women 91.5% 7.9% 0.7% Part VIII - Tests of Significance 84 / 90 Independence Testing Using H0 , the hypothesis that hand dominance and gender are independent, we are able to construct a table for observed and expected frequencies: Observed Men 934 113 20 Observed Women 1,070 92 8 Expected Men 956 98 13 Expected Wome 1,048 107 15 How many degrees of freedom are there? When testing for independence in an m × n table, there are (m − 1) × (n − 1) degrees of freedom. In this example we start with a 3 × 2 table; 3 rows and 2 columns. Therefore, we have (3 − 1) · (2 − 1) = 2 degrees of freedom. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 85 / 90 Independence Testing We now find the χ2 -statistic: χ2 = = X (observed - expected)2 expected (934 − 956)2 (1, 070 − 1, 048)2 (113 − 98)2 (92 − 107)2 + + + 956 1, 048 98 107 + (20 − 13)2 (8 − 15)2 + 13 15 χ2 ≈ 12 We have yet to set a confidence level, though the P-value for a χ2 -statistic of 12 with 2 degrees of freedom is less than 0.5%. The sample provides strong statistical evidence against the null hypothesis. Gender and dominant hand appears to be independent. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 86 / 90 Z -Test or χ2 -Test? When should the χ2 -test be used, as opposed to the z-test? The z-test says whether the data are like the result of drawing at random from a box whose average is given. The χ2 -test says whether the data are like the result of drawing at random from a box whose contents are given. The z-test deals with averages. The χ2 -test deals with frequencies from all categories; this test is more comprehensive and deals with the balance expected from the model. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 87 / 90 Example (from Statistics by Samuels et. al.) A cross between white and yellow summer squash gave progeny of the following colors: COLOR No. of progeny WHITE 155 YELLOW 40 GREEN 10 Question Are these date consistent with the 12 : 3 : 1 ratio predicted by a certain genetic model? (Use a χ2 -test with α = 0.10.) Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 88 / 90 Example (Squash) There are three categories involved here. According to given data White Yellow Green Observed Frequency 155 40 10 Expected Frequency (12/16) · 205 = 153.75 (3/16) · 205 = 38.44 (1/16) · 205 = 12.81 The χ2 -statistic is (expected frequency − observed frequency)2 expected frequency 2 (155 − 153.75) (38.44 − 40)2 (12.81 − 10)2 = + + 153.75 38.44 12.81 = 0.689. χ2 = sum of Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 89 / 90 Example (Squash) Recall that there is the null hypothesis (H0 : no change in ratio) versus (Ha : significant change in ratio). We want to compare the area under χ2 with 2 = 3 − 1 degrees of freedom with α = 0.10. The probability is given by P(χ22 > 0.689) = P(χ22 > 0.689) is bigger than 0.10. Therefore, we cannot reject the null hypothesis based on the data at α = 0.10. Dr. Joseph Brennan (Math 148, BU) Part VIII - Tests of Significance 90 / 90