Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Samples & Summary Measures Sample A set of observations from a population: x1, x2, ..., xn Example: Measure the diameters of 20 pistons produced on a production line, xi = diameter of piston # i. Summary Measures Sample Mean Sample Variance Sample Mean Just the average of the sample: n x xi i 1 n Example: x1 = 10, x2 = 12, x3 = 16, x4 = 18 4 x xi i 1 4 x1 x 2 x 3 x 4 10 12 16 18 56 14 4 4 4 NOTE: The sample mean is an unbiased estimator of the population mean, that is: 4 4 4x i E x i 4 E (x ) E i 1 i 1 i 1 4 4 4 4 where is the population mean. Sample Variance The sample variance, S2, is a measure of how widely dispersed the sample is. The sample variance is an estimator of the population variance, 2. n S2 x i x i 1 2 n 1 Example: x1 = 10, x2 = 12, x3 = 16, x4 = 18 x 14 2 2 2 2 10 14 12 14 16 14 18 14 S 2 4 1 4 2 2 2 2 2 4 2 16 4 4 16 40 . 3 3 Question: Why n-1 instead of n? provides an unbiased estimate E(S2) = 2 only n-1 degrees of freedom 3 n-1 degrees of freedom For a set of n points, if we know the sample mean & also know the values of n-1 of the points, we also know the value of the remaining point. Example: x 14, x1 10, x 2 12 , x 3 16 , x4 ? Then, 10 12 16 x 4 14 x 4 18 4 Central Limit Theorem Revisited When sampling from a population with mean and standard deviation , the sampling distribution of the sample mean, x , will tend to a normal distribution with mean and standard deviation n as the sample size n becomes large. For “large enough” n, x N , n Note: As n gets larger the variance (& standard deviation) of the sample size gets smaller. Confidence Intervals Heart Valve Manufacturer Dimension Mean Piston Diameter 0.060 Sleeve Diameter 0.065 Clearance 0.005 (unsorted) Std. Deviation 0.0002 0.0002 0.000283 Decision: Implement sorting with batches of 5 A random sample (after sorting has been implemented) of 100 piston/valve assemblies yields 79 valid (meet tolerances) assemblies out of the 100 trials. How do we know whether or not the process change has really improved the resulting yield? The yield (# good assemblies out of 100) is a binomial random variable. Our estimate of the mean (based on this sample) is 79% (or 79 out of 100). One way of determining whether the process has been improved is to construct a confidence interval about our estimate. To see how to do this, let X denote the number of good assemblies in 100 trials. Note that we can use the BINOMDIST function in excel to compute, for example, the probability that X is within + or – 10 of our estimate, 79: P{69 X 89} = P{ X 89}- P{ X 69} = BINOMDIST(89,100,0.79,true)BINOMDIST(69,100,0.79,true) 0.9971 – 0.0123 0.985 Another way of stating this in words is that 79 10 is a 98.5% confidence interval for the number of valid assemblies out of 100. The following table was constructed using the binomdist function as described above. It gives confidence intervals for various confidence levels. Confidence Interval 79 2 79 4 79 8 79 10 79 14 % Confidence 37.6% 67.4% 94.9% 98.5% 99.9% Note: the larger the interval the more certain we become that it covers the true mean. Note that the yield of the original process was 52%. Since the lower limit of a 99.9% confidence interval about our sample mean is 65% (substantially larger than 52%) we can be pretty certain the process has improved. Confidence Intervals - Using Central Limit Theorem Recall that: X = # successful assemblies in 100 trials An estimate, p , of the probability of obtaining a successful assembly, p, is given by: p X 100 1 if trial i is successful If we define: X i 0 if trial i is a failure then X = X1 + X2 + X3 +… + X100 Furthermore, p X 1 X 2 ... X 100 100 1 1 1 X1 X 2 ... X 100 100 100 100 Note that Xi is the number of successes in 1 trial - a binomial random variable where p is the probability of success. It follows that the mean and standard deviation of each Xi are given by: p i p1 p i Applying the Central Limit Theorem tells us that p (since it is the weighted sum of a large number of independent random variables) must be approximately normally distributed with mean and standard deviation given by: 1 1 1 1 2 ... 100 100 100 100 1 1 1 p p ... p p 100 100 100 2 2 2 1 1 2 2 1 2 1 2 ... 100 100 100 100 2 2 2 1 p 1 p 1 p 1 p ... 1 p 1 p 100 100 100 2 1 100 p1 p , so 100 p1 p 100 Central Limit Theorem For Population Proportions As the sample size, n, increases, the sampling distribution of p̂ approaches a normal distribution with mean p and standard deviation p(1 p) / n . When the parameter p approaches 1/2, the binomial distribution is symmetric and shaped much like the normal distribution. When p moves either above or below 1/2, the binomial distribution becomes more heavily skewed away from the normal & hence the sample size, n, necessary for the CLT to apply becomes larger. A commonly used rule of thumb is that np and n(1-p) must both be larger than 5. From our sample of 100 assemblies, 79 were good. This implies that our estimates for the mean and standard deviation of p are given by: 0.79 0.791 0.79 0.790.21 0.040731 100 10 To reiterate, we are using the Central Limit Theorem to approximate the distribution of p as a Normal distribution with mean 0.79 and standard deviation 0.040731. What we mean by, for example, a 95% confidence interval is to find a number, r, satisfying: P0.79 r p 0.79 r 0.95 Since the Normal distribution is symmetric about its mean (in this case 0.79), this means that exactly half of the “leftover” probability (5% for a 95% confidence interval) must lie in each tail. This means that a probability of 2.5% must lie in each tail for a 95% confidence interval. In other words, P p 0.79 r 0.025, and Pp 0.79 + r 0.975 To perform this calculation, use the NORMINV function from excel: 0.79 + r = NORMINV(0.975,0.79,0.040731) 0.8698. Solving for r, we get r 0.0798, so a 95% confidence interval for p is given by p = 0.79 0.0798 Note that the resulting confidence interval is almost identical to that obtained by using the BINOMDIST function directly-----clearly an indication that our use of the Central Limit Theorem to approximate the distribution of p with a Normal distribution is valid. The following table summarizes calculations for various confidence levels using the Normal approximation. Confidence Level 99.9% 98.5% 95% 67.4% Confidence Interval 0.79 0.135 0.79 0.099 0.79 0.0798 0.79 0.04 In general, when sampling for the population proportion in this manner our estimates for the mean and standard deviation will be given by: ˆ ˆ ˆ 1 ˆ n Election Polling Example: 1500 prospective voters are surveyed. 825 say they will vote for candidate A and 675 say they will vote for B. What is your estimate of the percentage of voters who will vote for A? Construct a 95% confidence interval. 825 0.55 1500 To construct the confidence interval, use the normal approximation. 0.551 0.55 1500 0.0128 NORMINV(0.975,0.55,0.0128) = 0.575 The associated confidence interval is, therefore, 55% 2.5%. The newsmedia would report this result by stating: “The poll has a margin of error of plus or minus 2.5 percentage points”. What would happen if the 55% estimate were based on a sample of size 750? 0.551 0.55 750 0.0182 NORMINV(0.975,0.55,0.0182) = 0.586 The associated confidence interval is, therefore, 55% 3.6%. Notation The Standard Normal distribution has: 0, 1. Let z be a random variable with a standard normal distribution. We define Z1 2 to be the number satisfying: P z Z 1 2 1 2 . Example: If = 0.05, then 1-/2 = 0.975, and the value of Z1 2 can be found using the excel function NORMSINV: NORMSINV(1-/2)= Z1 2 , or NORMSINV(0.975)=1.9599. Confidence Intervals for a Population Proportion Using the Standard Normal From a sample of 500, the number who say they prefer Coke to Pepsi is 275. Your estimate of the population proportion who prefer Coke is 275/500 = 55%. Since n is large, we can apply the CLT and construct a confidence interval for the population proportion, p: ˆ Z ˆ 1 ˆ 1 2 n provides a (1-)100% confidence interval for the population proportion, p. For a 95% interval in this case, we would first determine that NORMSINV(0.975) = 1.96. We would then compute: 0.55 1.96 0.551 0.55 0.55 0.0436. 500 Confidence Intervals on the Sample Mean 95% of the observations from a normal distribution fall within +/- 2 standard deviations from the mean. The average of a sample, x , is (from CLT) normally distributed with mean and standard deviation n . It follows that the true mean will lie within +/- 2 standard deviations of the sample average 95% of the time. The associated confidence interval is: x 2 ,x 2 n n Note that the standard deviation used here is rather than . n Confidence Intervals (Again) In general, if you have estimated the true mean by using the sample mean, x , as an estimate, the range x Z , x Z 1 2 1 2 n n is expected to contain, or cover, the true mean 100(1-)% of the time. Example: The diameter of pistons is normally distributed with unknown mean and standard deviation of 0.01. You take a small sample of 5, measure their diameters, and compute a sample mean of 1.55. A 90% confidence interval would be given by: 0.01 0.01 ,1.55 Z 1.55 Z . 5 5 0.95 0.95 From Excel, we can compute Z0.95= NORMSINV(0.95)=1.645, & hence the interval is: (1.543, 1.557). Example : 1-tailed Test on a Single Mean The yield of a new process is known to be normally distributed. The current process has an average yield of 0.85 (85%) with a standard deviation of 0.05. The new process is believed to have the same deviation as the old one. To determine whether the yield of the new process is higher than that of the old, you collect a random sample of size 10 & compute a sample mean of 90%. Let = 0.01. Solution: 1. Formulate Hypothesis: H0: 0.85, H1: > 0.85 (Important Note: alternative hypothesis is associated with action.) 2. Compute Test Statistic: x 0 0 .9 0 .85 z 3162 . n 0 .05 10 3. Determine Acceptance Region: Reject if NORMSINV(1-) < z NORMSINV(0.99) < 3.162 2.326 < 3.162 Reject Null Hypothesis! Our test statistic lies 3.162 standard deviations to the right of the mean of the sampling distribution----clearly an indicator that the observed outcome is very unlikely if the null hypothesis held. Example: 2-Tailed Test on a Single Mean An auto manufacturer has an old engine that produces an average of 31.5 mpg. A new engine is believed to have the same standard deviation in mpg, 6.6, as the old engine, but it is unknown whether or not the new engine has the same average mpg. The sample mean of a random sample of 100 turns out to be 29.8. Let = 0.05. Solution: 1. Formulate Hypothesis: H0: = 31.5, H1: 31.5 2. Compute Test Statistic: z x 0 29 .8 31.5 2 .576 n 6 .6 100 3. Determine Acceptance Region: Fail to reject if NORMSINV(/2) z NORMSINV(1-/2) NORMSINV(0.025) -2.576 NORMSINV( 0.0975) -1.959 -2.576 1.959 Since the inequality fails to hold, we Reject Null Hypothesis! Our test statistic lies 2.576 standard deviations to the left of the mean of the sampling distribution. If the null hypothesis were true, we would expect to observe a test statistic this low (or lower) only about 0.4998% of the time---less than 1/2 of 1 percent. T-test Used when , the standard deviation of the underlying population is unknown. Instead of forming the test statistic with , we substitute an estimate for , namely the sample deviation: n S x i x i 1 n 1 2 . The resulting test statistic is: x 0 t . S n The test statistic, t, has a t distribution with n-1 degrees of freedom. In comparison, the test statistic, z, formed using the population standard deviation, , has a normal distribution . The t distribution is shaped like the standard normal distribution (eg. bell-shaped. but more spread out). Its mean is 0, and its variance (whenever degrees of freedom > 2) is df/(df-2). As n (and hence df) increases, the variance approaches 1 & the t distribution approaches the standard normal distribution. When n is large, the z test is sometimes substituted (as a close approximation) for the t test. Assumptions: Either n is large enough for the central limit theorem to hold or the underlying distribution is normal. T-test & EXCEL Suppose t has a t distribution with n-1 degrees of freedom. We define t1 2 ,n 1 to be the number satisfying: P t t1 2 ,n 1 1 2 . That is, the area under the t distribution (probability) to the left of t1 2 ,n 1 is 1-/2. Note: This is the exact analogue of Z 1 2 for the standard normal distribution. To calculate the value for t1 2 ,n 1 in EXCEL, you use the TINV function: t1 2 ,n 1 TINV ( , n 1). EXAMPLE: 1-Tailed t-Test Average weekly earnings of full time employees is reported to be $344. You believe this value is too low. A random sample of 1200 employees yields a sample mean of $361 and a sample deviation of $110. Formulate the appropriate null hypothesis and analyze the data: 1. Formulate Null Hypothesis: H0: < 344, H1: > 344 2. Compute Test Statistic: t x 0 361 344 3.353612 S n 110 1200 3. Analysis: This is an extreme value for the test statistic, more than 3 standard deviations away from the mean (if the null hypotheis were true). For = 0.001, we can calculate: t1 TINV (2 ,1199 ) TINV 0.002 ,1199 3.097084 . Since our test statistic is even larger, we would reject the null hypothesis at the 99.9% significance level. The p-value associated with our test statistic is given by: p-value = TDIST(3.353612, 1199, 1) = 0.000411. Thus, if the null hypothesis were true, the probability of obtaining a test statistic as large (or larger) than 3.353612 is only 0.0411%. Average weekly earnings are almost certainly larger than $344. p-Values Definition: The p-value is the smallest level of significance, , for which a null hypothesis may be rejected using the obtained value of the test statistic. The p-value is the probability of obtaining a value of the test statistic as extreme, or more extreme than, the actual test statistic, when the null hypothesis is true. Example: Your z-statistic in a z-test is 3.162. To calculate the pvalue, use the EXCEL function NORMSDIST. NORMSDIST(z) is the probability of obtaining a test statistic value less than or equal to z. To be more extreme, the test statistic would have to be larger than 3.162. Thus, p-value = 1-NORMSDIST(3.162) = 1-0.999216 = 0.000784 Example: Your z-statistic in a z-test is -0.4. To be more extreme, the test statistic would have to be less than -0.4. Thus, p-value = NORMSDIST(-0.4) = 0.3446 Example: Your t-statistic in a t test with 15 degrees of freedom is 4.56. To calculate the p-value use the EXCEL function TDIST. TDIST(t,n-1,1) = P{test statistic value > t | null hypothesis true}. (Note the different direction of the inequality!) Hence p-value = TDIST(4.56,15,1) = 0.000188 Rules of Thumb for p-values p-value < 0.01 interpretation very significant between 0.01 & 0.05 significant between 0.05 & 0.1 marginally significant > 0.1 not significant Chi-Square Tests Like the T-distribution, the chi-square distribution is defined by its number of degrees of freedom. A chi-square random variable with k degrees of freedom is normally denoted by the symbol k2 , and is defined by the equation: k2 k i2 , i 1 That is, the sum of the squares of k standard normal random variables. Since squares are always non-negative, so is their sum, and hence a chi-square random variable can only take on nonnegative values. Illustrations of the PDF for 1 and 5 degrees of freedom are shown below. Probability Density Function for the Chi-Sq Distribution 1 0.050 3.841 30 25 20 15 10 5 0 Degrees of Freedom: P-Value: Chi-Sq Critical Value: Probability Density Function for the Chi-Sq Distribution 30 25 20 15 10 5 0 Degrees of Freedom: P-Value: Chi-Sq Critical Value: 5 0.050 11.070 Values for the chi-square distribution can be referenced using the EXCEL functions CHIDIST and CHIINV. CHIDIST(x, k) gives the probability that a chi-square random variable with k degrees of freedom attains a value greater than or equal to x. In other words, the area under the PDF to the right of x. In the pictures above, this is reported as the p-value. CHIINV(p, k) gives the inverse, or critical value. That is, if p = CHIDIST(x, k), then CHIINV(p, k) = x. In the pictures above, this is reported as the chi-sq critical value. Examples: CHIDIST(5,5) = 0.41588 CHIDIST(25,5) = 0.00139 CHIINV(0.41588, 5) = 5 CHIINV(0.00139, 5) = 25 Values can also be referenced using a table of chi-square values. For example, to find the critical value for a chi-square with 10 degrees of freedom at the 95% significance level, use row 10 and the = 0.05 column of the attached table (giving a value of 18.31). Alternatively, using EXCEL, one could compute CHIINV(0.05, 10) = 18.307. Test for Population Variance Sometimes it is of interest to draw inferences about the population variance. The distribution used is the chi-square distribution with n-1 degrees of freedom (where n = sample size), and the test statistic is given by: 2 n 1s 2 02 , where s2 is the sample variance and the denominator is the value of the variance stated in the null hypothesis. Example: Heart Valves. Without any sorting, the clearance was normally distributed with mean of 0.005 and standard deviation of 0.000283 (which implies a variance of 8 * 10-8). One key indicator of process improvement is whether or not process variability has been reduced. In this case, we would look to see if the variance of the clearance dimension has been reduced by sorting. The null hypothesis in this case is that the variance has not been reduced: H0: s2 8 * 10 -8 A random sample of size 50 (after sorting by batches of 50) yields a sample variance of : 2.308 * 10-9 Computing the test statistic yields: 2 n 1s 02 2 49 2.308 109 1.414 8 108 The critical value is found (for an alpha of 0.001) by: CHIINV(0.999, 49) = 23.98. We would reject the null hypothesis for any value of the test statistic less than the critical value, 23.98. Example: A machine makes small metal plates used in batteries. The plate diameter is a random variable with a mean of 5 mm. As long as the variance is at most 1.0, the production process is under control & the plates are acceptable. Otherwise, the machine must be repaired. The QC engineer wants, therefore, to test the following hypothesis: H0: s2 < 1.0 With a random sample of 31 plates, the sample variance is 1.62. Solution: Computing the test statistic, we see that: n 1s 2 30 1.62 48 .6 2 2 2 0 1.00 For a critical value of = 0.05, the critical value is found by: CHIINV(0.05, 30)=43.77. Since our test statistic lies to the right of the critical value, we would reject the null hypothesis. The p-value is given by: CHIDIST(48.6, 30) = 0.017257. Thus, we would reject the null hypothesis for any value of > 0.017257, and fail to reject the null hypothesis for smaller values. Important Note: The use of the chi-square test on variance requires that the underlying population be normally distributed. Chi-Square Test for Independence It is often useful to have a statistical test that helps us to determine whether or not two classification criteria, such as age and job performance are independent of each other. The technique uses contingency tables, which are tables with cells corresponding to cross-classifications of attributes. In marketing research, one place where the chi-square test for independence is frequently used, such tables are called cross-tabs. You will recall that we have previously used the pivot-table facility within EXCEL to produce contingency or cross-tabs tables from more unwieldy tabulations of raw data. Example: A random sample of 100 firms is taken. For each firm, we record whether the company made or lost money in its most recent fiscal year, and whether the firm is a service or non-service company. A 2 X 2 contingency table summarizes the data. Profit Loss Total Industry Type Service Non-service 42 18 6 34 48 52 Total 60 40 100 Using the information in the table, we want to investigate whether the two events: the company made a profit in its most recent fiscal year, and the company is in the service sector are independent of each other. Before stating the test, we need to develop a little bit of notation: r = number of rows in the table c = number of columns in the table Oij = observed count of elements in cell (i, j) Eij = expected count of elements in cell (i, j) assuming that the two variables are independent Ri = total count for row i Cj = total count for column j The expected number of items in a cell is equal to the sample size, n, times the probability of the event signified by the particular cell. In the context of a contingency table, the probability associated with cell (i, j) is the joint probability of occurrence of both events. That is, E n Pi j. ij From the definition of independence, it follows that E n Pi P j. ij From the row and column totals, we can estimate: Pi R , n P j C i j n Substituting in these estimates, we get: The expected count in cell (i, j) is E ij RC i j n Example: Using the data from the contingency table of the previous example, we can calculate: E11 R1C1 60 48 28.8, n 100 E12 R1C2 60 52 31.2, n 100 E21 R2 C1 40 48 19.2, n 100 E22 R2 C2 40 52 20.8. n 100 The resulting table of expected counts is shown below: Industry Type Service Non-service 28.8 31.2 19.2 20.8 Profit Loss The chi-square test statistic for independence is given by: r c 2 i 1 O j 1 ij E E ij 2 ij With degrees of freedom: r 1 c 1 Note that the chi-square test statistic is always non-negative. If the observed counts exactly equal the expected (under the hypothesis of independence) counts, then the value of the test statistic would be zero. The greater the difference between observed and expected counts, the larger the test statistic becomes. We would reject the null hypothesis (of independence) only when we obtain a sufficiently large value of the test statistic. We can compute the chi-square test statistic for our example as follows: 2 42 28.8 28.8 2 18 31.2 31.2 2 6 19.2 19.2 . 2 34 20.8 20.8 2 29.09 The number of degrees of freedom is 1 degree of freedom. Using the CHIINV function to compute a critical value for = 0.01, we see that CHIINV(0.01, 1) = 6.63. The rejection region (for a confidence level of 99%) is any test statistic larger than 6.63. Since our computed value of the statistic is much larger, we would reject the null hypothesis. Alternatively, we could compute the pvalue by using the CHIDIST function: CHIDIST(29.09, 1) = 6.91 * 10-8 from which we see that we would reject the null hypothesis for any reasonable value of . Using the CHITEST function: An easier way to do this is to use the EXCEL CHITEST function. To do this, you need to have two tables in your spreadsheet one containing the original contingency table data of actual observed counts, and one containing the expected counts The following spreadsheet information contains: the original counts in range A1:B2, the expected counts in range D1:E2, the chitest function formula in cell F1, and the value returned by the chitest function formula in cell F2. 42 18 6 34 28.8 31.2 =CHITEST(A1:B2,D1:E2) 19.2 20.8 6.92162612220623E-08 Note that the value returned by the chi-test function formula is the p-value.