Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
YALE School of Management EMBA-MGT511: HYPOTHESIS TESTING AND REGRESSION K. Sudhir Review Materials from Data I 1. Review of Concepts We begin with a brief review of concepts of random sampling, sampling distributions, point and interval estimation that you learned in Data-I, which will be useful for us in doing Hypothesis Testing. If you feel comfortable with these concepts, you may skip it and go directly to Section 2 on practice exercises. Random Sampling and Sampling Distributions: Since a census of the population of interest can be too costly or impossible to obtain, random sampling is a practical approach to learn about the populations. Using “sample statistics” such as the sample mean, sample standard deviation and sample proportion, we can learn about the true parameters of the population. Sample statistics will vary from sample to sample. Suppose we knew the population mean is 200. Even then, the sample mean is very unlikely to be exactly 200 in any particular sample. The sample statistic itself is a random variable, due to the random variation in which elements are chosen in the sample. If we use random sampling (where every element of the population has an equal chance to be part of the sample), then we can use probability theory to determine the probability distribution of the sample statistic. The probability distribution of a sample statistic is called its sampling distribution. Using the sampling distribution, make point and interval estimates of the population parameters. Sampling Distribution for Means Consider the random variable Y. The sampling distribution for the sample mean Y has: Mean: Y Population Mean Standard deviation: Y Population Standard Deviation sqrt(Sample Size) n If the sample size is greater than 30, Y will be approximately normally distributed, thanks to the central limit theorem. Sampling Distribution for Proportions Consider a binary random variable Y, which can take one of two values 0 or 1. Let be the proportion of 1’s in a population. The sampling distribution for the sample proportion has the following mean and standard deviation: Mean: Population Proportion (1 ) Standard deviation: n Population Standard Deviation sqrt(Sample Size) If the sample size is such that n >5 and n(1- ) > 5, will be approximately normally distributed. Point Estimation The object of estimation is to learn about population parameter (such as the proportion of all voters who favor Bush at a given time, the mean account balance of all American Express Gold Card accounts) from sample data. Parameters are numbers, but are unknown. To learn about them, we employ special random variables called estimators, which have good properties (such as being correct on average). The properties of estimators derive from random sampling. The actual numbers that we obtain from the data are referred to as point estimates. To clarify the notation for population parameters, estimators and estimates, see the following table. Parameter Estimator Y Estimate y 1 2 1 2 S Y1 Y2 s 1 2 1 2 n n Where Y Y i 1 n i y i 1 n i (Y Y ) and S 2 i i 1 n 1 n n y y1 y2 and s ( y y) i 1 2 i n 1 Interval Estimate (Confidence Interval) We recognize that the point estimates are random variables and are subject to sampling error. We therefore wish to quantify the range of values that we can reasonably expect the true population parameter to take given our sample. This is the interval estimate. It is also called the confidence interval. The standard formula for a 2-sided confidence interval Confidence Interval= Point Estimate Critical Value Estimated Standard Deviation of Estimator We know what the point estimates are from the previous section. We also know how to estimate the standard deviation of estimator (for both means and proportion) from above. So the one thing we need in addition is the critical value. Getting the critical value The critical value asks us how confident we want to be that the range we offer will contain the true population parameter. Often, people use a 95% confidence interval. But we could also use a 99% confidence interval or a 90% interval. The more confident we want to be, the greater will be the range of the interval. At an extreme if you want 100% confidence, then the confidence interval would be - to + . Of course, this confidence interval would be practically useless. So we have to tradeoff confidence with managerial usefulness. Usually people express confidence intervals using the 100(1- )% notation. So a 95% confidence corresponds to 0.05 (called the significance level). A 90% confidence interval corresponds to 0.1 (significance level). A 99% confidence interval corresponds to 0.01(significance level). Since both sample proportions and sample means follow the normal distribution approximately for large samples, the critical value is obtained from the standard normal distribution (usually represented by Z). The critical value (z) corresponding to the significance level / 2 is given by z / 2 . The corresponding critical values for the 90, 95 and 99% confidence intervals are: Confidence Level Significance Level ( ) 90% 95% 99% 0.1 0.05 0.01 z / 2 (Use Hypothesis Testing Worksheet) z0.05 = 1.645 z0.025 = 1.96 z0.005 = 2.58 However, if we have a small sample and the population standard deviation is not known (which is almost always the case), then we cannot use the normal distribution and therefore we need to use the t-distribution with n-1 degrees of freedom. Interval estimation works under the following assumptions: (1) The sample is random (2) The sample means follow a normal distribution (by appealing to Central Limit Theorem). Even if we use the t-distribution for the critical value (because we don’t know the population standard deviation), the sample means are assumed to follow a normal distribution. We can check for the assumptions by doing the following: (1) Look at a histogram of the sample data. If the histogram is symmetric, the sample means will follow a normal distribution even with a small sample. If the data do not appear to be symmetric, the central limit theorem works more slowly. This means that we need to get larger samples before the distribution of sample means can be assumed to follow a normal distribution. (2) Correct for any outliers, or identify causes of any unusual values. These assumptions and checks should apply to hypothesis testing also. 2. Review- Practice Exercises (i) Computing the Probability of Individual Outcome drawn from a Normal Distribution Suppose IQ in a population is normally distributed across individuals. Based on historical data, we know the average IQ in the population is 100 and the standard deviation is 20. What is the probability that a randomly chosen individual will have an IQ between 80 and 120? Y – random variable, measuring IQ Y 100 Y 20 If Y normally distributed, then Y Y Y Z Y has z 0 ; z 1 Note Z is the standard normal variable. Therefore, 80 100 120 100 Z Prob [80 ≤ Y ≤ 120] = Prob [ ] 20 20 = Prob [-1 < Z < 1] = 0.6826 Use p. 800 of the text, to look up the probabilities for the standard normal distribution. The probability for Z to take value between 0 and 1 is 0.3413 (from the table). By symmetry of the normal, the probability of Z between –1 to 0 is also 0.3413. Therefore the total probability is 0.6826. See graph below for a pictorial representation. Probability Prob=0.3413 Prob=0.3413 -1 1 0 Z (ii) Computing the Probability of Sample Means Suppose IQ in a population is normally distributed across individuals. Based on historical data, we know the average IQ in the population is 100 and the standard deviation is 20. What is the probability that a randomly chosen sample of size 16 will have an average IQ between 80 and 120? Y - random variable, measuring average IQ of a sample of size n Y 100 20 Y 5 n 16 Y is normally distributed since Y is normally distributed in this case. (even if Y is not normal, you can appeal to the Central Limit Theorem to claim Y is normally distributed) Probability of sample mean outcomes 80 100 120 100 Z Prob [80 ≤ Y ≤ 120] = P[ ] 5 5 = Prob [-4 < Z < 4] = = 0.9999 Y Y Where: Z Y (iii) Point Estimation and Interval Estimation When Population Standard Deviation is known: Suppose we draw one sample of size 16 and found the sample average IQ to be 105. The population standard deviation is known to be 20. Estimate the average IQ in the population and the 95% confidence interval for average IQ in the population. Point estimator = Y Interval estimator Y Z / 2 y n 16 y 105 20 95% confidence interval: 105 1.96 20 16 105 1.96 5 105 9.8 = (95.2, 114.8) “We are 95% confident that the unknown population mean is between 95 and 115” (in repeated sampling an interval so constructed contains the interval 95% of the time). When Population Standard Deviation is unknown: Suppose the above question was changed as follows: Suppose we draw one sample of size 16 and found the sample average IQ to be 105. The sample standard deviation is calculated to be 20. Estimate the average IQ in the population and the 95% confidence interval for average IQ in the population. Point estimator = Y Here we calculate the sample standard deviation from the sample data (rather than knowing the population standard deviation), so the sampling distribution is a tdistribution in this case. The t-distribution is approximately equal to the normal distribution, for large values of n (n>30), so when n is large, we can work with the normal distribution for convenience. Interval estimator Y tn 1, / 2 s y n 16 y 105 tn 1, / 2 t15,0.025 2.131 s 20 95% confidence interval: 105 2.131 20 16 105 2.131 5 105 10.65 = (94.35, 115.65) Note that when you use the t-distribution, the confidence interval is larger than when you use the normal distribution. Note that the critical value in this problem is 2.131 compared to 1.96 in the previous problem. This reflects the greater uncertainty in the estimate, because we estimate one more parameter (the standard deviation) from the data. (iv) Sample Size Determination Suppose we want to have a maximum range of Y 2 in the 95% confidence interval. If the population standard deviation is 20, what is the required sample size? Let Z / 2 y E Z n E Z n E 2 1.96 20 384.2 385 2 If the population standard deviation is 20, to obtain a 95 percent confidence interval with a difference between the sample mean and the population mean of 2 units maximum (E), we need a random sample of size 385. 2