Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
START WORKSHOP - STATISTICS INSC 60010 STATISTICAL MODELS FOR MANAGERIAL DECISIONS START WORKSHOP - FALL 2013 Ranga Ramasesh Page 1 START WORKSHOP - STATISTICS 1. 2. 3. 4. 5. DISCUSSION TOPICS Workshop Objectives 1. Data Analysis – Motivations and Goals 2. Preparatory Tools Graphs for Exploratory Data Analysis 1. Histograms / Frequency Distributions 2. Scatter Plots 3. Time Series Plots Models of Uncertainty 1. Sampling and Sampling Distributions 2. Z-Distribution 3. T-Distribution Inference based on Sample data 1. Confidence Interval Estimation 2. Hypothesis Testing Module Overview 1. Introduction to Regression Analysis 2. Syllabus and Administrative Details Page 2 START WORKSHOP - STATISTICS NOTE USE OF GRAPHS FOR EXPLORATORY DATA ANALYSIS Example: Graduation Rates Data 142 colleges is available in a spreadsheet with filename prefix college06. This information was obtained from the 2006 issue of U.S. News and World Report and it includes: 1. name of the college 2. graduation rate (GRATE) 3. freshman retention rate (FRESH) 4. percent of classes with fewer than 20 students (CLASS20) 5. percent of classes with more than 50 students (CLASS50) 6. percent of full time faculty (FTFAC) 7. 75th percentile of SAT scores (SAT75) 8. percent of incoming students in top 10% of high school class (TOP10) 9. acceptance rate (ARATE) 10. alumni giving rate (ALUM) 11. indicator for private school (1 = private; 0 = public) (PRIV) A small sub-set of data is shown in the following table. Managerial Concerns 1. How do we make sense of the graduation rates across the different colleges? 2. How do we understand the relationship, if any, between graduation rate and the SAT scores? Page 3 START WORKSHOP - STATISTICS Sample Data set School Name GRATE FRESH CLAS20 CLAS50 FTFAC SAT75 TOP10 ARATE ALUM PRIV Harvard University 0.98 0.97 0.70 0.13 0.92 1580 0.96 0.11 0.47 1 Princeton University 0.97 0.98 0.74 0.11 0.91 1560 0.94 0.13 0.61 1 Yale University 0.96 0.98 0.74 0.08 0.89 1560 0.95 0.10 0.46 1 University of Pennsylvania 0.94 0.98 0.75 0.07 0.88 1500 0.94 0.21 0.40 1 Duke University 0.94 0.97 0.72 0.05 0.97 1530 0.87 0.24 0.45 1 Stanford University 0.93 0.98 0.69 0.12 0.99 1550 0.87 0.13 0.38 1 California Institute of Technology 0.88 0.96 0.63 0.09 0.98 1570 0.93 0.21 0.32 1 Massachusetts Institute of Technology 0.92 0.98 0.61 0.16 0.91 1560 0.97 0.16 0.37 1 Columbia University 0.93 0.98 0.69 0.10 0.91 1540 0.86 0.13 0.34 1 Dartmouth College 0.95 0.97 0.61 0.10 0.93 1550 0.88 0.19 0.49 1 Washington University in St. Louis 0.92 0.97 0.74 0.08 0.92 1520 0.93 0.22 0.39 1 Northwestern University 0.92 0.97 0.73 0.08 0.93 1500 0.82 0.30 0.29 1 Cornell University 0.92 0.96 0.44 0.22 0.99 1490 0.85 0.29 0.35 1 Johns Hopkins University 0.91 0.95 0.55 0.17 1.00 1490 0.80 0.30 0.33 1 Brown University 0.96 0.97 0.65 0.12 0.94 1520 0.90 0.17 0.38 1 University of Chicago 0.87 0.95 0.55 0.06 0.95 1530 0.82 0.40 0.29 1 Rice University 0.91 0.96 0.60 0.10 0.93 1540 0.86 0.22 0.36 1 University of Notre Dame 0.96 0.98 0.56 0.10 0.85 1470 0.85 0.30 0.49 1 Vanderbilt University 0.86 0.94 0.67 0.06 0.97 1440 0.77 0.38 0.28 1 Emory University 0.86 0.94 0.67 0.07 0.95 1460 0.90 0.39 0.19 1 University of California - Berkeley 0.87 0.96 0.58 0.15 0.91 1450 0.99 0.25 0.15 0 Page 4 START WORKSHOP - STATISTICS Frequency Distributions and Histograms Definitions: A frequency distribution is a table that summarizes the numerical values of a variable by recording the number of times (frequency) values fall within certain ranges called classes or bins. Definition: A histogram is a graph of a frequency distribution. Using Excel A frequency distribution can be constructed in Excel by choosing the “Data” tab, and then choosing “Data Analysis” from the “Analysis” category and then choosing “Histogram” (or by choosing “Histogram” from the “Data Analysis” option on the “Tools” menu in earlier versions of Excel). The variable examined in this example will be the graduation rates for the colleges in our sample. Bin limits are the upper inclusive values for each bin. The bin limits chosen by Excel may be somewhat awkward values to work, so Excel also allows you to specify the bin limits. Note that the numbers shown in the Bin column of the Excel frequency distribution represent the upper limits of the bin. Note: If you don’t like the bin limits that Excel chooses (and I don’t), you can choose your own. On the next page is another frequency distribution of graduation rate using bin limits that I chose. The first (lower) bin limit is 0.20 with increments of 0.05 up to a maximum of 1.00. Excel will always add a last row to the frequency distribution and label it “More”. You can delete it, as I did here. Page 5 START WORKSHOP - STATISTICS Page 6 START WORKSHOP - STATISTICS Page 7 START WORKSHOP - STATISTICS Page 8 START WORKSHOP - STATISTICS Frequency Distribution bins 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100% Frequency 1 1 0 5 4 9 14 15 6 14 21 13 11 9 14 5 Histogram This is the histogram (formatted) for the above frequency distribution. Histogram 25 Frequency 20 15 10 5 0 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100% bins Page 9 Scatterplots START WORKSHOP - STATISTICS Definition: A scatterplot is a plot showing the relationship between two variables X and Y. Suppose we want to examine the relationship between graduation rates and SAT score. The following plot is the scatterplot of GRATE versus SAT75. Page 10 START WORKSHOP - STATISTICS Page 11 START WORKSHOP - STATISTICS Page 12 START WORKSHOP - STATISTICS Scatterplot of Graduation Rate versus SAT 1.00 0.90 0.80 y = 0.0011x - 0.7635 R² = 0.7596 Graduation Rate 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 800 900 1000 1100 1200 1300 SAT 75th Percentile Page 13 1400 1500 1600 1700 START WORKSHOP - STATISTICS Time Series Plots Time-series plot (Line Plot) of furniture sales in millions of dollars, January 1992 through December 2007 Data file: Furnsales Page 14 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 133 139 145 151 157 163 169 175 181 187 START WORKSHOP - STATISTICS FURNSALES 14000 12000 10000 8000 6000 4000 2000 0 Page 15 START WORKSHOP - STATISTICS Art and Science of Graphical Presentations Excellence in statistical graphics consists of complex ideas communicated with clarity, precision, and efficiency. Graphical displays should A. Show the data C. Avoid distorting what the data have to say B. Induce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production, or something else D. Present many numbers in a small space F. Encourage the eye to compare different pieces of data E. G. H. I. Make large data sets coherent Reveal the data at several levels of detail, from a broad overview to the fine structure Serve a reasonably clear purpose: description, exploration, tabulation, or decoration Be closely integrated with the statistical and verbal descriptions of a data set. Graphics reveal data. Indeed graphics can be more precise and revealing than conventional statistical computations. From: The Visual Display of Quantitative Information by Edward R. Tufte, Cheshire, Connecticut: Graphics Press 1983. Page 16 START WORKSHOP - STATISTICS NOTE STATISTICAL INFERENCE Introduction A common challenge faced by business managers is making judgments about the characteristics of large populations. For example, managers in a large electronics retail organization with several thousand stores all across the country may want to know the average time per day spent by its salespersons with pseudo customers. (Pseudo customers are those who come to retail stores to get help from salespeople in understanding and comparing different products but stop short of buying anything in the store.) The managers in this organization are interested in a characteristic of the entire group or the “population” of the salespersons across all of its retail stores. In statistical terms, the managers are interested in the numerical value of a “population parameter.” In the first case, the parameter they are interested in is the “population mean”. Statistical Inference is the technique of making judgments about the unknown “population parameters” of our interest such as a population mean based on appropriate “sample statistics.” For example, the appropriate sample statistic to estimate an unknown mean of a population is the sample mean. There are two distinct, but related approaches to statistical inference. These are called “Estimation” and “Hypothesis Testing”. Estimation In some situations a manager or an analyst may not know what the numerical value of the population parameter of interest is (or may not even have a tentative or claimed value for the population parameter). For example, what is the average number of miles driven by families during a summer vacation or what proportion of the eligible voters will vote in favor of a particular candidate in an upcoming election? In such the technique called Page 17 START WORKSHOP - STATISTICS “Estimation” is used. Estimation deals with the determination of the numerical value of an unknown population parameter, such as the population mean. (i.e., the average value of a specific measurement variable) using data from a random sample. The error that arises from the fact that the estimates are based on the data from a relatively small sample is quantified what is called the "standard error of the sample estimate”. In general, we estimate the population parameter, by specifying an estimate (sometimes called the “point estimate”) and a margin of error around it, i.e. by specifying a range of values called the “confidence interval”. We multiply the standard of error of the sample estimate by an appropriate factor to give us the margin of error that results in confidence intervals in which we have a specified level of confidence. The appropriate factor is based on the distribution of the sample estimate. Note: The confidence level is the proportion of times that our estimation procedure is correct. Page 18 Hypothesis Testing START WORKSHOP - STATISTICS In some situations a manager or a business analyst may already have some idea about the population parameter based on some intuition, or a claim made by others. For example, a firm’s marketing department might claim that the average sales for a new model of computer will be 500 units per week. The managerial concern here is whether to reject the claim made and take an alternative position or not to reject the claim made and go with it. In such situations, the “Hypothesis Testing” technique is used. Recall that in the confidence interval approach, a manager has no idea about the population parameter of interest (e.g., the population mean) and the manager constructs a confidence interval using the sample estimate. The sample estimate is the value of an appropriate statistic (i.e., the sample mean) based on a random sample of observations drawn from the population. In the hypothesis testing approach, a manager does have some specific numerical value for the population parameter of interest. This value may come from a variety of considerations. It may be the manager’s target value for the parameter or it may be a claim made by someone. To test this claim, data from a random sample drawn from the population is collected and it is used as evidence to test if it supports the claim or not. An Informal Understanding of Hypothesis Testing A good way to understand the hypothesis testing approach is to think of a criminal trial. The jury’s concern here is the (unknown) truth about the defendant. There are two positions: not guilty (defense’s position) or guilty (prosecution’s position). These are mutually exclusive (i.e., non-overlapping) and collectively exhaustive (i.e., there are no other positions). The jury must decide if it should (a) reject the defense’s position of “not guilty” and go with the alternate position or (b) not reject it. How does the jury decide? It looks at all the evidence presented and seeks answer to the question: Does the evidence enable us to reject the “not-guilty” position beyond reasonable doubt? Page 19 START WORKSHOP - STATISTICS In an analogous way, in hypothesis testing we start with two mutually exclusive and collectively exhaustive positions (or claims) about a population parameter. One of these is called the null hypothesis and the other is called the alternate hypothesis. (We will discuss the technicalities of how to designate the null and the alternate hypotheses later.) Similar to the jury’s concern of whether to reject the defense’ “not-guilty” claim or not, our concern is to take one of the following two decisions: (a) Reject the Null Hypothesis and (b) Do Not Reject the Null Hypothesis. The jury considers a variety of evidence presented to it. In our case, the evidence is simply the appropriate “test statistic” calculated from the data from a random sample. How does the jury decide if there is evidence beyond reasonable doubt? For example, if the prosecution establishes that the murderer was wearing blue jeans and points out that the defendant owns a pair of blue jeans would it constitute evidence beyond reasonable doubt? We would say: most likely not. Since many people own blue jeans, the likelihood or probability of the evidence presented due to pure chance is quite high. There is nothing extraordinary about this observation. It is not compelling enough. On the other hand, if the prosecution presents evidence that the DNA of the defendant matches the DNA of the body fluids found on the murder victim, we would say: most likely yes. The fundamental consideration here is the likelihood or the chance associated with the evidence presented or the sample outcome. The jury asks the question: “Assuming that the defendant is not guilty, what is the probability that there is a DNA match?” Since the probability of a DNA match is extremely small, the evidence that there is a DNA match is compelling enough to reject our initial position that the defendant is not guilty. In a similar vein we first tentatively assume that the null hypothesis is true. We then determine the probability of getting a test statistic as large as or as small as the one we got. This probability is called the p-value of the test statistic. (We will discuss how to calculate this probability “or pvalue” later.) Page 20 START WORKSHOP - STATISTICS We finally ask the question: Is this probability small enough for us consider that it is rather extreme or extraordinary? If it is, we feel compelled to reject our tentative assumption that the claim (or the hypothesized value) contained null hypothesis is true. To answer the above question, we must establish how low the p-value should be for us to consider that it is low enough for us to reject the null hypothesis. This is analogous to the issue of what is considered “beyond reasonable doubt” in the jury decision. What is the threshold for the p-value to be considered significant? Analysts often use the rule that a pvalue below 5% can be viewed as “statistically significant”. In different settings different levels of significance may be appropriate. In medical research, for example, a significance level of 1% to 2% may be required to consider the results publication worthy. In some business applications 5% or even 10% may be acceptable. In the hypothesis testing jargon the threshold or cutoff percentage or probability is called “the significance level for the hypothesis test”. It is denoted by the Greek letter α (alpha). If the p-value is less than the significance level, we conclude that the sample outcome is so extreme that it could not have from a population having the parameter value specified in the null hypothesis. Hence we reject the null hypothesis and adopt the position stipulated in the alternate hypothesis. This is analogous to the jury’s conclusion that the probability of a DNA match is so small that it does not seem reasonable to continue to harbor the position that the defendant is not guilty and dismiss this evidence as fluke. Hence the jury will reject the null position and return the “guilty” verdict. The choice of an appropriate value for α is a managerial policy decision. Generally speaking, if the consequences of mistaking chance variation for a real discovery are very bad, managers use a very strict cutoff (a low number like 1% for the significance level). If the consequences are not serious managers use a more lenient cutoff (like 5% or 10%). In medical research the consequences of this type of mistake may be serious because an ineffective or dangerous vaccine may be approved by the FDA and given to millions of people. The p-value is often quoted in research journals. A researcher may write something like “we found that patients got better after taking our new drug (p-value < 0.01)” to mean Page 21 START WORKSHOP - STATISTICS that benefits of the drug were found to be statistically significant with a p-value of less than 1%. The advice to the manager is not to worry so much about exactly which level of significance to use. If the conclusions of your study change drastically if you adopt a slightly different level of significance, then your results are probably close to the borderline of statistical significance — and you should consider this when basing decisions on the study. Finally, note that the jury decisions are not perfect. A jury sometimes reaches a wrong verdict. Likewise, a hypothesis test may sometimes lead to an incorrect decision. There are two types of errors: (1) A jury may reject the “non-guilty” position and wrongly convict a defendant who is truly not guilty. Likewise, a hypothesis test might reject a null hypothesis which is indeed true. This type of error is called a Type I error. It occurs if a hypothesis test finds the sample evidence to be statistically significant when in reality it is due to (2) pure chance variation. A jury may fail to reject the “non-guilty” position and wrongly let go a defendant who is truly guilty. Likewise, a hypothesis test might fail to reject a null hypothesis which is indeed false. This type of error is called a Type II error. It occurs if the sample evidence is due to something beyond chance variation but the test does not recognize it as statistically significant. The significance level (α) represents the maximum probability of committing a Type I error that is acceptable. If Type I errors are very costly you might use a smaller value for α such as 0.01 or 0.001. For example, in testing the effectiveness of a new drug for a common ailment, committing a Type I error means that an ineffective drug may be prescribed to millions of people. This is very bad and therefore it is appropriate to use a small value for α such as 0.001. On the other hand, in the case of an experimental treatment for a fatal but otherwise incurable disease, a Type II error is probably worse than a Type I error. Now a Type II error means that someone could miss out on the opportunity to be cured, and a higher value for α such as 0.01 might be better. In fact, you might consider giving this treatment to someone even if it has only been tested on a few people in the past; Page 22 START WORKSHOP - STATISTICS waiting until you can establish the adequacy of the drug at the usual level of statistical significance may be too conservative and patients may die in the meantime. The maximum probability of committing a Type II error is often denoted by the Greek letter β and (1- β) is called the power of a hypothesis test. In this course we will not discuss the power of a test. But it is important for you to note that once you have chosen a certain level of significance to control the probability of Type I error (i.e., chosen a value for α), the probability of making a Type II error may be reduced by only increasing sample size. Let us recap and formally state the steps in Hypothesis testing. Step 1: State the Null and Alternate Hypotheses Note that in any decision context, there will be a certain specific numerical value claimed for the unknown population parameter µ. Let us denote it by µ0 . There are three possible positions that you might take with respect to µ0 . 1. µ ≠ µ0 2. µ > µ0 3. µ < µ0 Corresponding to each of the above positions the alternate positions are: 1. µ = µ0 2. µ ≤ µ0 3. µ ≥ µ0 Always choose the position that has a strict inequality as the “Alternate Hypothesis” (usually denoted by Ha). Conversely, the position which has an equality component will be chosen as the “Null Hypothesis” (denoted by H0). Thus, the three possible scenarios are: 1. H0: µ = µ0 2. H0: µ ≤ µ0 3. H0: µ ≥ µ0 Ha: µ ≠ µ0 Ha: µ > µ0 Ha: µ < µ0 Scenario 1 is called a “two-tailed” test. Scenarios 2 and 3 are “one-tailed” tests. Page 23 START WORKSHOP - STATISTICS Step 2: Determine the Test Statistic (based on the sample data) or “Observed TS” 𝑇𝑆 = 𝑆𝑎𝑚𝑝𝑙𝑒 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 − 𝑉𝑎𝑙𝑢𝑒 𝑐𝑙𝑎𝑖𝑚𝑒𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑁𝑢𝑙𝑙 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 Step 3: See if the Test Statistic is significant (Two approaches) 1. Find the Critical value of the Test Statistic corresponding to the significance level of the hypothesis test and establish the Rejection Region. The Critical Value of the Test Statistic for the specified significance level is found using the distribution of the test statistic. The type of test determines the Rejection region. In this situation (two-tail test), we want to know how large or small should the observed test statistic be so that we can consider it as large enough at the specified significance level and hence reject the null hypothesis. 2. Find the P-value of the Observed Test Statistic The p-value is the probability of observing a value for the test statistic as extreme as (i.e., as large as or as small as) or more extreme than the one we observed under the assumption that the null hypothesis is true. The p-value associated with the TS is found using the distribution of the test statistic. Page 24 START WORKSHOP - STATISTICS TWO-TAILED TEST ONE-TAILED TESTS Step 4: Make the Statistical Decision 1. Reject the Null if the Observed Test Statistic falls in the Rejection Region 2. Reject the Null if P-value is less than α-value Step 5: State the Managerial Conclusion in plain English Page 25 START WORKSHOP - STATISTICS Standard Deviation of the Sample Estimate In the above discussion we used the standard error of sample estimate (a) to establish a desired confidence interval for an unknown population parameter or (b) to test a claim about the population parameter. How do we get this standard error of estimate? To answer this question, we must understand the behavior of samples randomly drawn from the population of interest in our study. First, we will first understand the behavior of the sample means (𝑋�) when we take a sample (of size n) from a population with a known population mean (µ). This behavior is described by what is called the “sampling distribution of sample means”. It is the foundation for statistical inference. Also, it has significant applications in statistical process control. Then we will learn the application of the concepts and the techniques of “Confidence Interval Estimation” and “Hypothesis Testing” in the context of a case. We will limit our focus to the estimation of unknown population mean and testing claims about it. Important Note: Statistical Significance versus Practical Significance Statistical significance doesn’t tell you whether or not the results are of practical significance. Something is statistically significant if it is clearly more than just a chance occurrence, whereas something is practically significant if it would have an important impact. If you gather enough data everything looks statistically significant because then there is very little room for chance variation. Something is statistically significant if it is clearly more than just a chance occurrence. Something is practically significant if it would have an important impact on the business situation. Page 26 START WORKSHOP - STATISTICS NOTE SAMPLING DISTRIBUTION OF SAMPLE MEANS A Thought Experiment Before we get into the theoretical concepts, definitions, and analytical details, please visualize playing a simple game and answer the questions that follow. In this game let us go back to the case of a consumer products company like Proctor and Gamble. This firm manufactures liquid detergent, which is sold in 100-ounce plastic bottles. In the final stages of the manufacturing process the 100-ounce plastic bottles are filled with liquid detergent on an automated filling and packaging line. The automatic bottle-filling machine used to fill the bottles is set to fill an average of 100 ounces of the detergent in each bottle. However, no machine is guaranteed to fill exactly 100 ounces in each bottle. Rather, the fill amount varies from bottle to bottle. Thus, some bottles will have slightly more than 100 ounces and some slightly less, although these bottles will be labeled as “100 ounce” bottles. It has been established through extensive data analysis that the variability in the “fill volume” is adequately represented by a normal distribution with a mean of 100 ounces and a standard deviation of 0.2 ounce. The company distributes shrink-wrapped bundles of one-dozen or 12 bottles to its retails. Now let us think of the average fill volume in a bundle of 12 bottles randomly selected from the output of the machine. 1. Before we draw a 12-bottle bundle (i.e., a sample of 12 bottles), what do you expect “Average Fill Volume” of this sample of 12 bottles - call it the “Sample Mean”? _______________________________________________________________________________ 2. We draw a bundle at random i.e., sample of 12 bottles, measure the volume of the 12 individual bottles and find the sample average. Suppose this average value, i.e. the sample mean is 98 ounces. Would you consider this to be significantly lower than what you expect? How could you tell? ________________________________________________________________ Now let us formalize our thoughts and pick up some theoretical fundamentals. Page 27 START WORKSHOP - STATISTICS Distribution of Sample Means Consider a population of observations such as heights, weight, test scores, weekly demands, salaries, and so on. µ = Population Mean σ = Population standard deviation We draw random samples of n observations. In statistical jargon, we say Sample size = n The sample mean is a random variable. It varies from one sample to another. To work with sample means, i.e. to make judgments about the sample means or to use sample means to make judgments about the populations, we must know the distribution of the sample means. This distribution is called the sampling distribution of sample means. Recall that a distribution must specify at three things: The central measure or the mean, the variability around the mean or the standard deviation and the shape of the distribution. From the Central Limit Theorem in Statistics, we have the following result: 𝑀𝑒𝑎𝑛 𝑜𝑟 𝑡ℎ𝑒 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠 𝑋�, 𝐸(𝑋�) = 𝜎 𝑇ℎ𝑒 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠 𝑋�, 𝜎 𝑆𝐷𝑋� = √𝑛 𝑇ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠 𝑓𝑜𝑙𝑙𝑜𝑤𝑠 𝑡ℎ𝑒 𝑵𝒐𝒓𝒎𝒂𝒍 𝑴𝒐𝒅𝒆𝒍. Page 28 START WORKSHOP - STATISTICS The following conditions must be satisfied for the above result to hold: 1. Randomization Condition Sampling method must be unbiased and representative of the population 2. 10% Condition The sample size, n, must be no more than 10% of the population size, N. If it is, then a “Finite Population Correction Factor” must be applied to the standard error. 3. Nearly Normal Condition and Sample size Requirement The data must come from a distribution that is unimodal and approximately symmetric. If the data distribution is known to be normal then any sample size is OK. If the data distribution is not known to be normal, then the sample size must be must be ≥ 30. The approximation to the normal distribution will become closer as the sample size increases. If the parent distribution is symmetric, smaller samples are adequate than if the parent population is skewed or long-tailed. Symmetry of the parent distribution is particularly important. Page 29 START WORKSHOP - STATISTICS Example: In 2008, the average salary for federal workers whose occupations also exist in the private sector was $67,691. By contrast, the average salary for employees working in similar jobs in the private sector in 2008 was $60,046. Assume that the population standard deviation of the salaries of federal workers is $15,300. A random sample of 34l employees is selected. a) What is the probability that the sample mean will be less than $64000? b) What is the probability that the sample mean will be more than $70000? c) What is the probability that the sample mean will be less than $60046? Analysis and Solution First, we make sure that the conditions are satisfied. In this case, we have (1) a random sample (2) the sample size is ≥ 30, and (3) reason to believe that the 10% condition is satisfied. We are interested finding probabilities about the sample mean, i.e. the average salary for �. federal workers whose occupations also exist in the private sector. Let us call it 𝑋 � . The distribution of 𝑋� is Normal with We must determine the distribution of 𝑋 Mean = Population mean = 67691 Standard Deviation = 𝑆𝐷𝑋� = 𝜎 √𝑛 = 15300 √34 = 2623.98 Now, we can find the desired probabilities using the NORM.DIST function in EXCEL. In fact, we can do all the computation in an EXCEL worksheet. Page 30 START WORKSHOP - STATISTICS NOTE ESTIMATION OF AN UNKNOWN POPULATION MEAN Guided Case Analysis: Retail Store Operation ELCO is a chain of stores that sells consumer electronics in the US. ELCO suspects that the main reasons for declining profits are the falling quality of service and growing competition. Managers at ELCO want to know the average time that a salesperson spends with customers. They are worried about pseudo customers who take up a salesperson’s time to get details about a product but then make their purchases elsewhere. ELCO managers are concerned about the average time spent by the sales persons with pseudo customers across the entire population of sales persons in the company. Specifically, ELCO managers are interested estimating the population average time spent with pseudo customers by the company’s sales persons. They want a 95% confidence in their estimate. ELCO has collected data on the service time spent with pseudo customers in a day from a random (i.e. representative) sample of 100 salespersons. The data set is given in the following table. See EXCEL file service. Page 31 Data Set START WORKSHOP - STATISTICS Time spent by salespersons with pseudo customers. Service time in seconds 3897 6743 6692 5301 2466 5702 4973 3482 5456 6981 3589 4320 1245 562 6824 9010 8910 1003 8821 5797 6712 1349 4239 2134 4687 1688 8904 3099 921 5817 6984 2485 8901 4111 8903 8933 6986 7133 2349 9042 7120 4713 4344 5921 1471 7432 7059 8425 7027 5479 6934 7234 1358 2302 8324 2309 2329 7912 2399 4456 7632 11921 1357 5691 3216 4865 9249 8349 3369 4771 9214 5578 2316 1279 3130 5892 3870 2390 3190 7243 2390 2891 8238 4349 1208 3999 4389 2348 5681 3123 4992 3356 1217 1109 5002 4006 1730 2100 2305 7349 Page 32 Analysis and Solution START WORKSHOP - STATISTICS The variable of our interest here is “the time spent by salespersons with pseudo customers.” Let us denote it by the symbol X. X is a random variable because its value changes across salespersons and days of operation. We are interested in “the average time spent by entire population of the company’s salespersons in with pseudo customers”. This is the population parameter, i.e., the population mean (denoted by µ). The numerical value of µ is unknown. We estimate the value of an unknown population parameter using data from a representative sample and determining the appropriate sample statistic. In this case the appropriate sample statistic is the sample mean. We call it the sample estimate. Obviously the sample estimate is not perfect in the sense that we cannot be 100% confident that the population mean is exactly equal to the sample estimate. There is the inevitability of sampling error. We must recognize a margin of error surrounding our sample estimate that is based on (a) the variability in our sampling process and (b) the level of confidence we desire. Therefore, we estimate the unknown population mean not by a single number (i.e., the sample estimate) but by an interval called the “Confidence Interval”. A confidence interval for the population mean is given by Sample Estimate ± [Margin of Error] The margin of error is a product of two components: Margin of Error = Confidence Factor x Standard Deviation of the Sample Estimate Page 33 START WORKSHOP - STATISTICS In the present case, the Sample estimate = Sample mean “= AVERAGE (Data)” Standard deviation of the estimate (i.e. Sample mean) is given by the formula 𝑆𝐷𝑋� = In this formula, while we know that 𝜎 √𝑛 𝑛 = 100. But, we do not know 𝜎 or the population standard deviation. (In fact, except in some rare situations, 𝜎 is usually unknown.) So how do we proceed? Statisticians have offered a solution to this problem. Recognize that 𝜎 is essentially a measure of the variability in population. If we don’t know its numerical value, the best we could do is to substitute the measure of variability in the sample that is representative of the population. This measure is the sample standard deviation (denoted by s). We can easily calculate (or let EXCEL calculate) the numerical value of the sample standard deviation from the sample data. But this substitution comes with some adjustments. First, we use a slightly different terminology. Instead of using the term “Standard Deviation of the sample estimate” we use the term “Standard Error of the sample estimate”. We use the notation 𝑆𝐸𝑋� . The standard error of estimate for the mean given by the following formula: 𝑆𝐸𝑋� = We can compute the standard error easily. 𝑠 √𝑛 Page 34 START WORKSHOP - STATISTICS Second, − more important − what about the confidence factor? If we use the above formula, the appropriate sampling distribution will no longer be the normal distribution that we used in the previous discussion. This means that, we cannot find the confidence factors using the Z-distribution functions. We must use a slightly different distribution called the T-distribution. But it is not a big deal! What it means is just this: to find the value of the confidence factor we should use the T-distribution rather than the Z-distribution. Using T-distribution functions is very much like using the normal and standard normal distribution functions, although the way EXCEL’s T-distribution functions work is somewhat different. We must learn how to use the T-distribution functions in EXCEL. A key difference between the two distributions is this: Whereas there is a unique standard normal distribution − no matter what the sample size is − the T-distribution depends on the sample size. More precisely it depends on (n−1), which is usually called the degrees of freedom. A T-distribution resembles the Z-distribution but has thicker ‘tails’. A T-distribution with large degrees of freedom closely resembles the standard normal distribution. We will now learn how to use the EXCEL functions related to the T-distribution. Page 35 START WORKSHOP - STATISTICS T.INV.2T function Page 36 START WORKSHOP - STATISTICS T.DIST.2T function Page 37 T.INV function START WORKSHOP - STATISTICS Page 38 T.DIST function START WORKSHOP - STATISTICS Page 39 START WORKSHOP - STATISTICS T.DIST.RT function Page 40 START WORKSHOP - STATISTICS Beck to the Case, How to find the 95% Confidence Factor? For a confidence level is 95% the T-value must be such that 5% of the area (equally divided in the two-tails) under the T-distribution curve falls outside this value. Visualization: Finding the desired T-value: Use the T.INV.2T function, enter 0.05 for probability and (100 − 1) or 99 for Degrees of Freedom or enter “=T.INV.2T(0.05, 99)” in any cell. We get the desired confidence level factor = 1.98421 Now, we have all the three pieces required to build the 95% Confidence Interval. Sample Estimate ± Confidence Factor x Standard Error of the Sample Estimate Answer: A 95% confidence interval for the population mean time spent by salespersons with pseudo customers is: [4362 seconds, 5398 seconds]. We are 95% confident that this interval contains the true population mean. Page 41 START WORKSHOP - STATISTICS NOTE HYPOTHESIS TESTING - UNKNOWN POPULATION MEAN Guided Case Analysis: Retail Store Operation ELCO is a chain of stores that sells consumer electronics in the US. ELCO suspects that the main reasons for declining profits are the falling quality of service and growing competition. Managers at ELCO want to know the average time that a salesperson spends with customers. They are worried about pseudo customers who take up a salesperson’s time to get details about a product but then make their purchases elsewhere. ELCO managers are concerned about the average time spent by the sales persons with pseudo customers across the entire population of sales persons in the company. The manager of the Human Resources Department has claimed that average time spent by salespersons with pseudo customers is equal to 15% of an 8-hour work day time or 4320 seconds. Senior managers at ELCO are wondering how they should react to the claim made by the HR Department manager – Should they reject the claim or not reject it? They deem that the decision to reject the claim must be at high level significance of 1 percent or 0.01. ELCO has collected data on a simple random sample of 100 observations (service time spent with pseudo customers in a day by 100 salespersons). The data set is in the EXCEL file service. Page 42 START WORKSHOP - STATISTICS Analysis We first realize that this is a hypothesis test of a claim made about the unknown population mean. Step 1: Statement of the Hypotheses H0: µ = 4320 Ha: µ ≠ 4320 It is a two-tailed test. Significance level α = 0.01 Step 2: Test Statistic 𝑇𝑆 = 𝑆𝑎𝑚𝑝𝑙𝑒 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒−𝑉𝑎𝑙𝑢𝑒 𝑐𝑙𝑎𝑖𝑚𝑒𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑁𝑢𝑙𝑙 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 This Test Statistic follows a T-distribution with (n-1) = (100-1) = 99 degrees of freedom Step 3: Decision Criteria 1. Critical Values of the Test Statistic and the Rejection Region 2. P-value Step 4: Statistical Decision Step 5: Managerial Conclusion Page 43 START WORKSHOP - STATISTICS Let us do the computations in EXCEL and complete the above steps. EXCEL Worksheet Hypothesis Testing of Population Mean - Two-tailed Test Claimed value of the Population Mean µ0 Type of test Direction of alternate hypothesis: ≠ Significance Level α (1-α) Confidence Level Sample size Sample Estimate = Sample Mean Sample Standard Deviation, s n = COUNT(A2:A101) Standard Error of the Sample Estimate, SE Observed Sample Test Statistic Degrees of Freedom Critical value of Test Statistic for the level of significance P-value of the Observed Sample Test Statistic = the area in the two tails beyond the Absolute Value of the Observed Test Statistic =AVERAGE(A2:A101) =STDEV.S(A2:A101) Sample Standard Deviation/ SQRT(n) T = (Observed value - Claimed value)/ Standard Error of the Estimate (n-1) T*=T.INV.2T(α, df) =T.DIST.2T(Observed Sample Test Statistic, df) Page 44 4320 2-Tail 0.01 0.99 100 4880.03 2610.622 261.0622 2.145197 99 2.626405 0.034383 START WORKSHOP - STATISTICS Appendix 1: Normal Distribution The normal distribution is a continuous probability distribution useful in describing many real-world situations. It is also a very important distribution in statistical applications. There is actually a family of normal distributions, with each distribution completely specified by the values of two parameters, the mean, µ, and the standard deviation, σ. Every normal distribution is symmetric and centered at its mean. The standard deviation determines how spread out are the values in the distribution. Although the limits of any normal distribution are, in theory, ± ∞, 99.7% of the values are within ± 3σ of µ. Probability in a normal distribution is determined as the area under the normal curve. The total area (total probability) under the normal curve is 1. The Standard Normal Distribution If X is normal with mean, µ, and standard deviation, σ, then Z= X -µ σ is called standard normal. A probability statement about any normal random variable X can be transformed into an equivalent probability statement about the standard normal random variable Z. Z-distribution has: Mean: μ = 0 Standard Deviation: σ = 1.0 Z will always be used to represent a standard normal random variable. Probabilities under the standard normal curve have been tabulated and are shown in a table of standard normal probabilities. Note that P(Z ≤ z) = P(Z < z) (including or excluding a single number does not change the probability) Page 45 START WORKSHOP - STATISTICS Cumulative Standard Normal (CSN) Table: P(Z < z) Z 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 0 0.5000 0.5398 0.5793 0.6179 0.6554 0.6915 0.7257 0.7580 0.7881 0.8159 0.8413 0.8643 0.8849 0.9032 0.9192 0.9332 0.9452 0.9554 0.9641 0.9713 0.9772 0.9821 0.9861 0.9893 0.9918 0.9938 0.9953 0.9965 0.9974 0.9981 0.9987 0.9990 0.9993 0.9995 0.9997 0.9998 0.9998 0.9999 0.9999 1.0000 1.0000 0.01 0.5040 0.5438 0.5832 0.6217 0.6591 0.6950 0.7291 0.7611 0.7910 0.8186 0.8438 0.8665 0.8869 0.9049 0.9207 0.9345 0.9463 0.9564 0.9649 0.9719 0.9778 0.9826 0.9864 0.9896 0.9920 0.9940 0.9955 0.9966 0.9975 0.9982 0.9987 0.9991 0.9993 0.9995 0.9997 0.9998 0.9998 0.9999 0.9999 1.0000 1.0000 0.02 0.5080 0.5478 0.5871 0.6255 0.6628 0.6985 0.7324 0.7642 0.7939 0.8212 0.8461 0.8686 0.8888 0.9066 0.9222 0.9357 0.9474 0.9573 0.9656 0.9726 0.9783 0.9830 0.9868 0.9898 0.9922 0.9941 0.9956 0.9967 0.9976 0.9982 0.9987 0.9991 0.9994 0.9995 0.9997 0.9998 0.9999 0.9999 0.9999 1.0000 1.0000 0.03 0.5120 0.5517 0.5910 0.6293 0.6664 0.7019 0.7357 0.7673 0.7967 0.8238 0.8485 0.8708 0.8907 0.9082 0.9236 0.9370 0.9484 0.9582 0.9664 0.9732 0.9788 0.9834 0.9871 0.9901 0.9925 0.9943 0.9957 0.9968 0.9977 0.9983 0.9988 0.9991 0.9994 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999 1.0000 1.0000 0.04 0.5160 0.5557 0.5948 0.6331 0.6700 0.7054 0.7389 0.7704 0.7995 0.8264 0.8508 0.8729 0.8925 0.9099 0.9251 0.9382 0.9495 0.9591 0.9671 0.9738 0.9793 0.9838 0.9875 0.9904 0.9927 0.9945 0.9959 0.9969 0.9977 0.9984 0.9988 0.9992 0.9994 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999 1.0000 1.0000 Page 46 0.05 0.5199 0.5596 0.5987 0.6368 0.6736 0.7088 0.7422 0.7734 0.8023 0.8289 0.8531 0.8749 0.8944 0.9115 0.9265 0.9394 0.9505 0.9599 0.9678 0.9744 0.9798 0.9842 0.9878 0.9906 0.9929 0.9946 0.9960 0.9970 0.9978 0.9984 0.9989 0.9992 0.9994 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999 1.0000 1.0000 0.06 0.5239 0.5636 0.6026 0.6406 0.6772 0.7123 0.7454 0.7764 0.8051 0.8315 0.8554 0.8770 0.8962 0.9131 0.9279 0.9406 0.9515 0.9608 0.9686 0.9750 0.9803 0.9846 0.9881 0.9909 0.9931 0.9948 0.9961 0.9971 0.9979 0.9985 0.9989 0.9992 0.9994 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999 1.0000 1.0000 0.07 0.5279 0.5675 0.6064 0.6443 0.6808 0.7157 0.7486 0.7794 0.8078 0.8340 0.8577 0.8790 0.8980 0.9147 0.9292 0.9418 0.9525 0.9616 0.9693 0.9756 0.9808 0.9850 0.9884 0.9911 0.9932 0.9949 0.9962 0.9972 0.9979 0.9985 0.9989 0.9992 0.9995 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999 1.0000 1.0000 0.08 0.5319 0.5714 0.6103 0.6480 0.6844 0.7190 0.7517 0.7823 0.8106 0.8365 0.8599 0.8810 0.8997 0.9162 0.9306 0.9429 0.9535 0.9625 0.9699 0.9761 0.9812 0.9854 0.9887 0.9913 0.9934 0.9951 0.9963 0.9973 0.9980 0.9986 0.9990 0.9993 0.9995 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999 1.0000 1.0000 0.09 0.5359 0.5753 0.6141 0.6517 0.6879 0.7224 0.7549 0.7852 0.8133 0.8389 0.8621 0.8830 0.9015 0.9177 0.9319 0.9441 0.9545 0.9633 0.9706 0.9767 0.9817 0.9857 0.9890 0.9916 0.9936 0.9952 0.9964 0.9974 0.9981 0.9986 0.9990 0.9993 0.9995 0.9997 0.9998 0.9998 0.9999 0.9999 0.9999 1.0000 1.0000 START WORKSHOP - STATISTICS Examples Using the Standard Normal Table a. Find P(Z < 1.00) b. Find P(0 < Z < 1) c. Find P(-1.3 < Z < 2.0) d. Find P(-1.57 < Z < -0.82) Page 47 START WORKSHOP - STATISTICS e. Find P(Z < -2.53) f. P(Z < -6) g. Find z so that a probability of 5% falls above (to the right) of that value: P(Z > z) = 0.05. Page 48 START WORKSHOP - STATISTICS Using EXCEL to Find the Desired Probabilities Let us use the following example to provide us the context to illustrate the use of EXCEL to find probabilities of our interest. Example: Each year thousands of high school students take the Scholastic Aptitude Test (SAT). The distribution of the scores on each SAT is approximately unimodal and symmetric and it is well described by a Normal model with a mean of 500 and a standard deviation of 100. 1. Suppose a student scored 600 on an SAT test. Where does this student stand among all the students that took this SAT? 2. What proportion of students scored between 450 and 600 on this SAT? 3. Suppose a college sys it accepts only students with SAT scores among the top 10%. How high should a student’s SAT score be in order to be accepted at this college? Solution Procedure First, visualize the situation by drawing a picture of the Normal distribution and marking out the desire probability (i.e., the area under the normal curve). To find areas (or probabilities) we use the NORM.DIST function in EXCEL. The function NORM.DIST(X, mean, std-dev, TRUE) gives the area under the normal curve to the LEFT of the value that you input for “X”. You may follow one of two alternative approaches: 1. In any EXCEL cell enter the formula “=NORM.DIST(X, mean, std-dev, TRUE)” with appropriate numerical values and then hit “enter” 2. With the cursor on any EXCEL cell, click on the built-in function button and choose NORM.DIST from the menu of “Statistical” functions. This will open the dialog box and you can fill the appropriate numerical values. Page 49 Answer to Question 1 START WORKSHOP - STATISTICS Visualization Page 50 START WORKSHOP - STATISTICS Answer: The student’s score of 600 is such that about 84% were below his score. Page 51 Answer to Question 2 START WORKSHOP - STATISTICS Visualization The desired area is the difference between two areas. We can find these separately and then do the subtraction. Or we can directly enter the formula that represents the subtraction into an EXCEL cell and get the answer. Answer: 53.28% of the students scored in the range between 450 and 600. Page 52 Answer to Question 3 START WORKSHOP - STATISTICS In this case, we know the probability (or the area) and we must find a corresponding score (or the X-value). Specifically, we must find X such that the area under the curve to the right of X is 10% or equivalently the area to the left of X should be equal to 90%. NORM.INV(Probability, mean, standard-dev) gives the X value for which the area under the normal curve to the LEFT is equal to the value you input for probability. Answer: The cutoff score at this college is 628. Page 53 START WORKSHOP - STATISTICS Appendix 2: Practice Problems 1. Filling Tide Detergent Bottles Proctor and Gamble manufactures liquid Tide detergent (among many other products). Liquid Tide is sold in plastic bottles. One of the final steps in the manufacturing process is to fill the bottles of Tide. One machine used to fill the bottles is set to put an average of 100 ounces of Tide in each bottle. However, this machine cannot be guaranteed to put exactly 100 ounces of Tide in each bottle. Rather, the fill amount is known to follow a normal distribution with mean of 100 ounces and standard deviation of 0.2 ounces. Thus, some bottles will contain slightly more than 100 ounces and some slightly less, even though these bottles will be labeled as “100 ounce” bottles. a. b. c. d. e. What is the probability that less than 99.6 ounces will be put into a “100 ounce” bottle of liquid Tide? Calculate the probability that a single bottle of Tide will contain between 99.9 and 100.1 ounces. What is the 90th percentile of the fill amounts? Suppose P&G can adjust the mean fill amount on the machine that fills the Tide bottles. At what value should the mean fill be set in order to insure that only 5% of the Tide bottles will contain less than 99.8 ounces? We plan to examine a random sample of 100 Tide bottles to assess the operating efficiency of the machine. Calculate the probability that the average fill for a random sample of 100 Tide bottles is between 99.9 and 100.1 ounces. Page 54 START WORKSHOP - STATISTICS 2. Stereo Component Warranty A company that produces an expensive stereo component is considering offering a warranty on the component. Suppose the population of lifetimes of the components is a normal distribution with a mean of 84 months and a standard deviation of 7 months. If the company wants no more than 2% of the components to wear out before they reach the warranty date, what number of months should be used for the warranty? (Answer: 69.68 or 70 months) Page 55 3. Textbook START WORKSHOP - STATISTICS A large required chemistry course at a state university has been using the same textbook for a number of years. Over the years, the students have been asked to rate this text on a 10-point scale, and the average rating has been stable at about 5.2. This year the faculty decided to try a new text. After the course, 35 randomly selected students were asked to rate this new text. The results are shown below: 6 3 6 7 6 10 6 8 7 10 3 6 5 7 8 10 6 7 6 4 6 6 4 6 8 7 7 9 10 9 5 8 6 8 7 The sample mean of the 35 sample values is 6.77. The sample standard deviation is 1.85. Do the data provide evidence that the average rating for the new book is different from that of the old book (5.2)? Page 56 Hypotheses: START WORKSHOP - STATISTICS H0: Ha: Decision Rule: Test Statistic: Decision: Conclusion: Page 57 START WORKSHOP - STATISTICS 4. Battery Lifetimes DC Company makes batteries for cell phones. Recently, the R&D department at the company came up with a new battery design that they believed would last longer than batteries currently on the market. However, senior managers were concerned about making the claim that the new battery lifetime was greater, on average, than the current industry standard of 30 hours. They believed there would be serious bad publicity and sales would decline if trade publication tests showed otherwise. Before the new battery is put into production, the company planned to test a random sample of 100 batteries. The battery lifetime data is shown below: 42.2 30.9 27.4 35 26.8 33.2 38.5 31.2 29.4 25.2 34 26.1 28.4 31.3 39.1 32.8 24.6 30.2 19.6 37.9 29.9 42.8 30 30 32.8 34.6 37.8 26.4 32.2 32.9 26 39.8 28.7 30.5 26.6 31.8 31.1 34.3 22.3 29.6 29.6 30.3 35.3 34.3 32 29.5 27.5 21.7 21.6 35 19.7 38.3 32.1 26.3 30.7 30.7 29.6 26.8 25.1 33.3 26.5 29.5 31.6 31.3 38.8 31.4 28.9 26.6 33.9 25.1 35.7 37.2 28.9 32.2 30.1 32.3 32.7 28.2 26.2 30.4 29.7 29.1 44.1 28.2 30.5 30.8 29.4 33.7 29.2 25.6 29.4 28.1 26.1 26.4 30.3 30.9 21.3 37.3 24.8 38.2 Is there sufficient evidence to conclude that the population of new batteries will have average lifetime greater than 30 hours? Set up the hypotheses and conduct the test using α = .05. The sample standard deviation is 4.816 hours. The sample mean of the 100 lifetimes is 30.639 hours. Page 58 Hypotheses: START WORKSHOP - STATISTICS H0: Ha: Decision Rule: Test Statistic: Decision: Conclusion: Page 59