Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Inferential Statistics Part 1 – Sampling Distributions, Point Estimates & Confidence Intervals Inferential statistics are used to draw inferences (make conclusions/judgements) about a population from a sample. Consider an experiment in which 10 students who sat an exam after 24 hours of sleep deprivation scored 12% lower than 10 students who sat the same exam after a normal night's sleep. Is the difference real or could it be due to chance? How much larger could the real difference be than the 12% found in the sample? These are the types of questions answered by inferential statistics. There are two main methods used in inferential statistics: estimation and hypothesis testing. In estimation, the sample is used to estimate a parameter and a confidence interval about the estimate is constructed. In the most common use of hypothesis testing, a null hypothesis is put forward and it is determined whether the data is strong enough to reject it. For the sleep deprivation study, the null hypothesis would be that sleep deprivation has no effect on performance. Sampling Error When we looked at primary data collection and sampling methods, we stressed the importance of selecting a random sample so that every item or individual in the population had a known chance of being selected. To accomplish this, we could choose a simple random sample, a systematic sample, a stratified sample, a cluster sample, or a combination of these methods. However, it is unlikely that the mean of a sample would be identical to the population mean. Likewise, the sample standard deviation or other measure computed from a sample would probably not be exactly equal to the corresponding population value. We can therefore expect some difference between a sample statistic, such as the sample mean or sample standard deviation, and the corresponding population parameter. The difference between a sample statistic and a population parameter is called sampling error. Example: Suppose that a population of five students had exam results of 68, 72, 67, 69 and 74. Suppose that a sample of two results – 68 and 74 - is selected to estimate the population mean result. The mean of that sample would be 70.7. Another sample of two is selected – 72 and 67 - with a sample mean of 69.65. The mean of all the results (the population mean) is 70. The sampling error for the first sample is 0.7, determined byX - µ = 70.7 – 70 The second sample has a sampling error of -0.35 Each of these differences, 0.7 and -0.35, is the error made in estimating the population mean based on a sample mean, and these sampling errors are due to chance. The amount of these errors will vary from one sample to the next. So given the possibility of a sampling error when sample results are used to estimate a population parameter, how can we make accurate inferences/conclusions about the population based only on sample results? To begin with we develop a sampling distribution of the sample means. Sampling Distribution of the Sample Means The exam results example showed the means for samples of a specified size vary from sample to sample. The mean exam result of the first sample of two students was 70.7, and the second sample mean was 69.65. A third sample would probably result in a different mean. The population mean was 70. If we organised the means of all possible samples of two results into a probability distribution, we would obtain the sampling distribution of the sample means. Example: A firm has seven production workers (considered the population). The hourly earnings of each worker are given below. Employee No. 10001 10002 10003 10004 10005 10006 10007 1. 2. 3. 4. Hourly Earnings € 7 7 8 8 7 8 9 What is the population mean? What is the sampling distribution of the sample means for samples with a size of 2? What is the mean of the sampling distribution? What observations can be made about the population and the sampling distribution? The population mean is found by: To get the sampling distribution of the sample means, all possible samples of 2 are selected without replacement from the population, and their means are computed. There are 21 possible samples, found by: where N is the number of observations in the population and n is the number of observations in the sample. The 21 distinct sample means from all possible samples of 2 that can be drawn from the population are shown below. Sample Employees 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 10001 + 10002 10001 + 10003 10001 + 10004 10001 + 10005 10001 +10006 10001 + 10007 10002 + 10003 10002 + 10004 10002 + 10005 10002 + 10006 10002 + 10007 10003 + 10004 10003 + 10005 10003 + 10006 10003 + 10007 10004 + 10005 10004 + 10006 10004 + 10007 10005 + 10006 10005 + 10007 10006 + 10007 Hourly Earnings € 7+7 7+8 7+8 7+7 7+8 7+9 7+8 7+8 7+7 7+8 7+9 8+8 8+7 8+8 8+9 8+7 8+8 8+9 7+8 7+9 8+9 Sum € 14 15 15 14 15 16 15 15 14 15 16 16 15 16 17 15 16 17 15 16 17 Mean € 7.00 7.50 7.50 7.00 7.50 8.00 7.50 7.50 7.00 7.50 8.00 8.00 7.50 8.00 8.50 7.50 8.00 8.50 7.50 8.00 8.50 This probability distribution is the sampling distribution of the sample means and can be summarised as follows. Sampling Distribution of the Sample Mean for n = 2 Sample Mean € 7.00 7.50 8.00 8.50 No. Means Probability 3 9 6 3 0.1429 0.4286 0.2857 0.1429 21 1.0000 The mean of the sampling distribution of the sample mean is obtained by summing the various sample means and dividing the sum by the number of samples. The mean of all the sample means is usually written . The µ reminds us that it is a population value because we have considered all possible samples. The subscriptX indicates that it is the sampling distribution of mean. These observations can be made: a. The mean of the sample means (€7.71) is equal to the mean of the population: b. The spread in the distribution of the sample means is less than the spread in the population values. The sample means range from €7 to €8.50, while the population values vary from €7 to €9. In fact, the standard deviation of the distribution of the sample means is equal to the population standard deviation divided by the square root of the sample size. So the formula for the standard deviation of the distribution of sample means is: Therefore, as we increase the size of the sample, the spread of the distribution of the sample means becomes smaller. c. The shape of the sampling distribution of the sample means and the shape of the frequency distribution of the population values are different. The distribution of sample means tends to be more bell-shaped and to approximate the normal probability distribution. In summary, we took all possible random samples from a population and for each sample calculated a sample statistic (the mean). Because each possible sample has a chance of being selected, the probability that the mean amount earned will be values such as €7.27, €8.50, €6.50 etc. can be determined. The distribution of the mean amounts earned is called the sampling distribution of the sample means. Even though in practice we see only one particular random sample, in theory any of the samples could arise. Consequently, we view the sampling process as repeated sampling of the statistic from its sampling distribution. This sampling distribution is then used to measure how likely a particular outcome might be. The Central Limit Theorem Applying the central limit theorem to the sampling distribution of the sample means allows us to use the normal probability distribution to create confidence intervals for the population mean. The central limit theorem states that, for large random samples, the shape of the sampling distribution of the sample means is close to a normal probability distribution. The approximation is more accurate for large samples than for small samples (most statisticians consider a sample of 30, or more, large enough for the central limit theorem to be employed) General Procedure Sampling requires that we draw successive samples from a defined population. The samples must be randomly selected and of the same size. Calculate the mean for each sample and plot the sample means. This produces a distribution of sample means. A plot of an " infinite" number of sample means is called the sampling distribution of the mean. Successive Sampling Frequency distributions of sample means quickly approach the shape of a normal distribution, even if we are taking relatively few, small samples from a population that is not normally distributed. As we randomly select more and more samples from the population, the distribution of sample means becomes more normally distributed and looks smoother. With " infinite" numbers of successive random samples, the sampling distributions all have a normal distribution with a mean that is equal to the population mean (μ). Increasing Sample Size As sample sizes increase , the sampling distributions approach a normal distribution. With " infinite" numbers of successive random samples, the mean of the sampling distribution is equal to the population mean (μ). As the sample sizes increases, the variability of each sampling distribution decreases. The range of the sampling distribution is smaller than the range of the original population. Taken together, these distributions suggest that the sample mean provides a good estimate of μ and that errors in our estimates (indicated by the variability of scores in the distribution) decrease as the size of the samples we draw from the population increase. Population Distributions The principles of successive sampling and increasing sample size work for all distributions. We can count on the sampling distribution of the mean being approximately normally distributed, no matter what the original population distribution looks like as long as the sample size is relatively large. The central limit theorem states that when an infinite number of successive random samples are taken from a population, the distribution of sample means calculated for each sample will become approximately normally distributed with mean μ and standard deviation σ/√n as the sample size (n) becomes larger, irrespective of the shape of the population distribution. This is one of the most useful conclusions in statistics. We can reason about the distribution of the sample means with absolutely no information about the shape of the original distribution from which the sample is taken. In other words, the central limit theorem is true for all distributions. The central limit theorem applies only to sample means; its tenets cannot be applied to any other statistic. The central limit theorem tells that what to expect of the distribution of sample means when we take an infinite number of relatively large samples of a given size from a population. The central limit theorem works no matter what how the population distribution is shaped. The central limit theorem helps us test hypotheses about means because it tells us what to expect when we draw samples from a population. Point Estimates & Confidence Intervals In statistics, point estimation involves the use of sample data to calculate a single value (known as a statistic) which serves as a "best guess" for an unknown population parameter. For example, the sample meanX, is a statistic, and is a point estimate of the population mean, a parameter, μ. While we expect the point estimate (statistic) to be close to the population parameter, we would like to measure how close (accurate) it is. A confidence interval serves this purpose. In contrast to point estimation, which is a single number, with confidence intervals we use sample data to construct an interval (range) of possible (or probable) values of an unknown population parameter, so that the parameter occurs within that range at a specified probability. The specified probability is called the level of confidence. The information developed about the shape of a sampling distribution of the sample means, that is the sampling distribution of X, allows us to locate an interval that has a specified probability of containing the population mean μ. for reasonably large samples, we can use the central limit theorem and state the following: 1. 95% of the sample means selected from a population will be within 1.96 standard deviations of the population mean μ. 2. 99% of the sample means will lie within 2.58 standard deviations of the population mean. How are the values of 1.96 and 2.58 obtained? The 95% and 99% refer to the percent of the time that similarly constructed intervals would include the parameter being estimated. For example, 95% refers to the middle 95% of the observations. Therefore, the remaining 5% are equally divided between the two tails. The central limit theorem states that, for large random samples, the shape of the sampling distribution of the sample means is approximately normal. Therefore, we use z-tables to look at areas under the normal curve (see z-table) When the sample size, n, is at least 30, it is generally accepted that the central limit theorem will ensure a normal distribution of the sample means. This is an important consideration. If the sample means are normally distributed, we can use the standard normal distribution, that is, z, in our calculations. When n ≥ 30 the formula for getting the confidence interval for a mean is: where: X = the sample mean z = appropriate z value for level of confidence s = sample standard deviation n = sample size Example: An experiment involves selecting a random sample of 256 managers. One item of interest is annual income. The sample mean is €45,420, and the sample standard deviation is €2,050. 1. What is the estimated mean income of all managers (the population) i.e. what is the point estimate? 2. What is the 95% confidence interval for the population mean (rounded to the nearest €10)? 3. Interpret the findings. 1. The point estimate of the population mean is €45,420 2. The confidence interval is: onfidence Interval for a X = €45,420 z = 1.96 s = €2,050 n = 256 onfidence Interval for ean ean €45,420 €45,420 €45,1 8.875 and €45, 71.125 €45,170 and €45,170≥ μ ≥ 3. We can say that we are 95% confident that the unknown population mean income (μ) is between €45,170 and €45, 70. If we had time to select many samples of size 256 from the population of managers and compute sample means and confidence intervals, the population mean and annual income would be in about 95 of every 100 confidence intervals. Either an interval contains the population mean of not. About 5 out of every 100 confidence intervals would not contain the population mean annual income, μ.