Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 Examples of Sampling A company has 50000 employees at 1000 equally-sized locations around the world. All employees have identification numbers. For simplicity, we assume that the identification numbers are 1 through 50000. We want to obtain some sort of information about the employees. To do this, we will collect data from a sample of 2000 of the employees. To collect this data, we will need to meet with the employees in person. To obtain a simple random sample, we have a computer generate 2000 random numbers between 1 and 50000. We can set this up to allow replacement or forbid replacement. For our sample, we choose the employees with these identification numbers. To obtain a systematic random sample, we start by randomly choosing the first employee. Say the first chosen employee has ID number 37. Now observe that, since there are 50000 total employees and our sample is to have 2000 employees, 1 out of every 25 employees are chosen for inclusion in the sample. Therefore, we systematically select employees whose identification numbers are spaced 25 away from each other. So the employees in our sample have identification numbers 12, 37, 62, 87, 112, and so on, up to 49987. To obtain a clustered sample, we visit only a few of the locations and include in our sample all the employees at each of the chosen locations. We just need to know how many locations we must visit. Since there are 50000 total employees at 1000 locations, there are 50 employees at each location. Since we need a sample of 2000 employees and there are 50 employees at each location, we need to visit 40 different locations. To obtain a stratified sample, we visit every location and include the same number of employees from each location in our sample. We just need to know how many employees at each location we need to pick. Since we need a sample size of 2000 and there are 1000 locations, we need to choose 2 employees from each location. The clustered sampling is best in the sense that it is the easiest to perform in practice. This is because we only need to visit 40 out of the 1000 locations. In this sense, the stratified sampling is the worse, because we necessarily need to visit all 1000 locations. The simple random and systematic random samplings will likely require that we visit most, if not all, of the 1000 locations. 2 Sample Mean Consider quantitative data for a population of size N . As we discussed in Lecture 3, this population has a mean µ and standard deviation σ. The mean is obtained by adding up all the observations and dividing by N . The variance is obtained by adding up all the squares of the deviations from the mean and dividing by N . The standard deviation is the square root of the variance. Now suppose that a sample of size n is randomly selected, perhaps through one of our four methods of sampling. Let x̄ denote the mean of the sample. The mean is obtained by adding up all the observations in the sample and dividing by n. The sample mean x̄ may or may not equal the population mean µ. We define the sampling error as |x̄ − µ|. For example, if the population mean is 7 and the sample mean is 5, then the sample error is |5 − 7| = | − 2| = 2. Since x̄ was obtained through a random process, x̄ is a random variable. Therefore, it has a set of possible values, a probability distribution, an expected value or mean, a variance, and a standard deviation. We let µx̄ denote the mean of x̄ and we let σx̄ denote the standard deviation of x̄. It is important to emphasise the distinction between the population data and the sample mean. The population data has a distribution, a mean, and a standard deviation in the sense of a data set. The sample mean has a distribution, a mean, and a standard deviation in the sense of a random variable. It turns out that the mean and standard deviation of the sample mean are related to the mean and standard deviation of the population data in the following way: µx̄ = µ 1 That is, the mean or expected value of the sample mean is the same as the population mean. Notice that this does not depend on the sample size or the population size. r N −n σ σx̄ = √ n N −1 | {z } FPCF The second factor here, labelled FPCF, is known as the finite population correction factor. There are three cases when we can ignore it: • • • when the population size is not given but assumed to be very large when the sample size is less than 5% of the population size when replacement is allowed in the sampling process In these three cases, the formula for the standard deviation of the sample mean reduces to σ σx̄ = √ . n Observe that, as the sample size n increases, the standard deviation of the sample mean gets smaller. That is, as the sample size increases, the sample mean becomes more likely to be closer to the population mean. Notice that we have not said anything about the distribution of x̄ so far other than its mean and standard deviation. For all we know at this point, it could follow a normal distribution, or a uniform distribution, or any distribution really. We will give a more precise description of the distribution of x̄ later. As an example, suppose that a family has five people, A, B, C, D, and E, who have heights 64, 65, 68, 72, and 75 inches, respectively. This is our population data. Using techniques from Lecture 3, we can find that the population mean is µ = 68.8 inches and the population standard deviation is 4.17 inches. Now suppose that we obtain a simple random sample of 2 people from the family, without replacement. That 5·4 = 10 is, the sample must consist of 2 different people. From Lecture 7, we know that there are 5 C2 = 2·1 possible ways of doing this. Each pair of people is equally likely to occur, with probability 1/10. For each different sample, we will get a (perhaps) different value for x̄. For example, if the sample consists of people A and B, then x̄ is the average of 64 and 65, which is 64.5. We can then fill in the rest of the table below. sample x̄ A,B 64.5 A,C 66 A,D 68 A,E 69.5 B,C 66.5 B,D 68.5 B,E 70 C,D 70 C,E 71.5 D,E 73.5 In the second column, we see all the possible values of x̄. Notice that values of x̄ are less extreme than the original population data. The most extreme heights in the population were 64 and 75. The most extreme average heights of two individuals are 64.5 and 73.5. The value 70 occurs in two of the 10 groups. The other values only occur in one of the 10 groups. Therefore, we can write down the probability distribution of x̄. 2 k P (x̄ = k) 64.5 1/10 66 1/10 66.5 1/10 1/10 68 68.5 1/10 1/10 69.5 70 2/10 1/10 71.5 73.5 1/10 From this, we can use techniques of Lectures 8 and 9 to compute the mean and standard deviation of x̄ using a table. For example, the next step in computing the mean would be to compute the values of kP (x̄ = k) for all the possible values k. The mean would then be the sum of those values. We could compute the standard deviation from the variance. Computing the variance requires several more columns. Since x̄ is a sample mean, we don’t actually need to use these old techniques here. We can use formulas to compute the mean and standard deviation of the sample mean. The mean of x̄ is simply the population mean, so µx̄ = µ = 68.8. Since our population is small and the sample is 40% of the population and the sample did not allow replacement, we must include the FPCF in our computation of the standard deviation of the sample mean, so r r N −n 4.17 5 − 2 σ = √ = 2.55. σx̄ = √ n N −1 2 5−1 3 Distribution of Sample Mean for Normal Data If the population data is a normally distributed data set then the sample mean is always a normally distributed random variable. As always, the mean of the sample mean is equal to the population mean. However, the standard deviation of the sample mean decreases as the sample size increases. Below is a histogram for mens’ heights. This is normally distributed and has population mean µ = 177 cm and population standard deviation σ = 7 cm. The huge numbers for the frequencies give you an idea of how large the population is. 5 histogram x 10 5 frequency 4 3 2 1 0 155 160 165 170 175 180 height (cm) 185 190 195 200 3 If the sample size is 1, the sample mean is simply the value of the single chosen observation. The shape of the graph of the pdf for this randomly chosen observation is exactly the same as the shape of the histogram for the population. The mean of the sample mean is µx̄ = µ = 177 and the standard deviation of the sample 7 mean is σx̄ = √ = 7, just like for the population data set. Unlike for the histogram, the area under the 1 graph of a pdf must be 1. If the sample size is 16, the sample mean is once again a normally distributed random variable with mean 7 µx̄ = µ = 177 and standard deviation σx̄ = √ = 1.75. As the sample size increases, the pdf for the sample 16 mean remains normally distributed with mean equal to to the population mean, but, since its standard deviation decreases, the graph of its pdf becomes thinner and taller. The graph must become taller as it gets thinner so that the area under it remains 1. Below are graphs of the pdf for the sample mean for samples of different sizes. distribution of sample mean 0.35 0.3 n=1 n=2 n=4 n=8 n=16 n=32 0.25 0.2 0.15 0.1 0.05 0 155 4 160 165 170 175 180 185 190 195 200 Central Limit Theorem Now suppose that the population data is not necessarily normally distributed. Below is a histogram for a large data set which has mean µ = 9.11 and standard deviation σ = 4.31 but which is not normally distributed. 4 histogram 2500 frequency 2000 1500 1000 500 0 0 2 4 6 8 values 10 12 14 16 Below are graphs of the pdf for the sample mean for samples of different sizes. distribution of sample mean 0.7 n=1 n=2 n=4 n=8 n=16 n=32 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 Evidently, the effect of taking larger and larger sample sizes is to smooth the graph of the pdf for the sample mean. Eventually, the graph of the pdf approaches a normal curve and becomes thinner and taller. The central limit theorem says that, if the sample size is at least 30, then the sample mean is approximately normally distributed even if the original data set was not normally distributed. 5