Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
History of statistics wikipedia , lookup
Confidence interval wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Sampling (statistics) wikipedia , lookup
German tank problem wikipedia , lookup
1 Topic 4 Sampling and Confidence Intervals Contents 4.1 A Worked Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4.1.1 Samples of size 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Other Sample Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 5 4.2 Sampling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 4.2.2 Systematic Random Sampling . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Stratified Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 6 7 4.2.4 Cluster or Area Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 4.4 Finite Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 11 4.5.1 Light Bulb Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Formula for Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . 11 12 4.5.3 Confidence Intervals when Sigma is Unknown . . . . . . . . . . . . . . 13 4.6 Estimating Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 14 4.7.1 Sampling Distribution of p . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.2 Confidence Intervals for a Population Proportion . . . . . . . . . . . . . 14 15 4.7.3 Summary and Assessment . . . . . . . . . . . . . . . . . . . . . . . . . 16 Learning Objectives state that when many samples are taken from a population, the values of the sample means are not all the same give examples of various types of sampling techniques, quoting appropriate situations where each would be used state the properties of the Central Limit Theorem for sample means use the finite population correction factor calculate confidence intervals for population means based on a sample mean result 2 estimate the approximate sample size required for a specified level of accuracy c describe the distribution of sample proportions calculate confidence intervals for population proportions based on a sample proportion result H ERIOT-WATT U NIVERSITY 2003 4.1. A WORKED EXAMPLE 4.1 A Worked Example The last topic showed many Normal distribution situations where the focus was on whole populations. There are some very interesting results that arise when a number of SAMPLES are taken from a Normal distribution. You will probably have an intuitive idea of the very useful statistical results that come out of this type of analysis. To get a feel for the sort of thing that will crop up time and time again in this course, imagine an office block with a large number of rooms each accommodating ten workers. If some statistic was then measured, say, for example, the number of days an employee was absent from work last year, an average score could be taken for each room. It would probably be expected, then, that the average absence rate for the office block as a whole would be the average of that of all the individual offices. This is indeed the case. In other words the population mean (the office block average absence rate) is equal to the mean of the sample means (the average of the mean values of all the individual offices taken together). But it is a different situation when the spread of the results is considered. Any extreme value of the population will be "averaged out" when the sample mean is taken. So the spread of the population will be greater than the spread of the sample means. The following illustration shows what happens when many samples are taken from a population. 4.1.1 Samples of size 25 These 500 numbers were obtained by measuring some property of every element in a population that follows a Normal distribution. 53 52 27 48 55 46 51 29 65 43 29 46 54 60 66 56 34 59 43 52 46 67 59 51 57 56 64 49 62 30 51 46 50 32 51 48 48 53 45 56 55 65 57 72 47 60 49 47 51 54 54 52 16 52 62 62 35 44 43 60 49 54 32 58 42 41 49 35 57 64 58 58 37 63 38 42 45 45 50 60 51 38 51 64 52 58 55 60 62 50 60 52 45 43 57 48 47 46 32 58 38 47 49 53 40 47 62 34 48 48 60 48 52 58 69 49 44 61 35 53 29 58 62 40 43 63 50 59 46 32 52 48 45 50 60 65 33 44 50 35 30 41 36 46 65 60 33 67 57 56 42 35 37 56 56 56 46 37 37 35 47 40 53 49 33 32 59 40 51 44 43 49 56 64 44 48 53 42 43 43 43 50 51 49 58 58 58 59 46 52 33 63 54 64 48 52 51 61 28 46 53 41 56 44 61 35 50 43 56 66 47 62 37 63 47 63 55 49 52 50 29 44 41 47 48 47 40 46 43 51 44 61 57 51 43 68 41 53 39 61 62 43 38 65 31 67 61 66 49 49 64 43 43 53 64 41 48 53 45 71 44 57 55 44 51 38 49 46 43 36 43 55 70 47 45 40 38 48 43 61 49 43 59 54 41 55 40 42 42 48 52 50 45 36 33 46 66 52 42 61 47 57 55 43 32 42 46 50 44 38 61 46 54 52 40 50 51 51 51 59 53 47 46 58 64 48 55 55 33 57 39 38 35 56 70 74 65 53 67 39 64 39 48 46 44 44 39 44 47 50 46 48 55 60 43 65 45 30 42 67 56 43 40 49 55 45 57 42 48 33 39 45 54 50 43 58 41 36 46 40 49 53 36 51 52 39 67 53 41 47 53 64 23 54 63 64 62 54 47 48 66 43 42 47 55 51 45 54 36 61 78 67 42 43 54 57 61 60 53 36 55 52 46 36 63 25 56 45 56 52 42 54 55 41 42 28 54 41 43 44 54 54 52 56 53 67 34 40 43 57 43 34 62 56 45 56 54 49 50 48 56 54 48 67 43 59 46 46 39 48 45 42 37 48 42 50 54 47 53 56 50 59 64 44 52 52 62 50 49 42 36 56 52 62 48 56 48 68 39 50 It can be shown that the population mean is 48.5 and the population standard deviation is 9.6. Now several different samples, each of size 25, are taken from this population. Sample 1 c H ERIOT-WATT U NIVERSITY 2003 3 4.1. A WORKED EXAMPLE 4 60 53 54 56 65 59 43 35 54 53 45 49 48 64 44 40 43 28 44 51 35 55 46 48 48 Sample 2 35 52 44 43 57 53 56 37 52 49 71 47 52 59 52 43 48 25 56 50 52 42 42 56 45 Sample 3 55 43 56 57 57 61 35 37 54 53 45 56 30 46 56 42 65 71 39 56 34 45 52 52 29 Sample 4 37 37 47 48 36 28 61 45 55 54 59 42 61 64 46 65 33 56 54 43 49 48 43 41 62 Sample 5 45 36 43 46 35 48 39 59 70 41 56 58 52 57 62 60 48 63 51 41 56 63 40 34 43 Sample 6 58 46 55 48 39 50 48 43 51 64 42 56 48 54 34 43 49 62 52 29 59 56 52 47 53 Sample 7 57 39 43 52 61 67 42 55 35 43 45 34 55 49 61 64 36 45 48 42 43 40 49 62 61 Sample 8 46 54 49 49 42 55 58 41 52 52 56 41 43 60 52 57 52 45 43 43 60 48 43 32 50 Sample 9 67 54 40 42 54 58 50 56 48 43 37 51 47 43 64 43 45 52 49 43 52 37 64 60 59 Sample 10 46 65 45 52 61 50 55 46 56 56 42 47 56 56 54 52 65 46 56 42 45 53 40 56 42 The mean of each of the samples is shown below. Sample 1 Mean 48.8 Sample 2 48.7 Sample 3 49.0 Sample 4 48.6 Sample 5 49.8 Sample 6 49.5 Sample 7 49.1 Sample 8 48.9 Sample 9 50.3 Sample 10 51.4 Notice that no two means are exactly the same. This sampling process is repeated 100 more times and the sample means are given below c H ERIOT-WATT U NIVERSITY 2003 4.1. A WORKED EXAMPLE 51.8 46.6 48.7 49.8 48.4 50.8 46.3 49.5 51.2 48.2 47.6 47.7 49.8 47.8 46.8 46.0 49.3 47.8 48.3 46.7 45.5 47.0 46.6 53.4 47.1 49.1 45.3 50.4 47.1 50.8 47.0 49.5 48.2 45.3 44.9 49.0 50.8 5 45.1 52.1 49.4 51.5 50.1 49.1 50.9 46.8 47.1 45.1 48.2 49.5 50.9 48.3 50.8 47.8 47.0 46.2 49.3 48.0 50.2 48.2 49.5 48.5 47.8 49.0 45.1 45.7 52.8 51.3 50.5 43.8 47.0 50.5 49.0 49.9 47.8 48.0 47.8 47.6 49.0 47.8 50.7 48.3 47.4 48.8 48.3 48.9 45.9 50.7 47.9 49.4 47.6 48.1 48.2 49.0 50.2 47.0 50.4 49.1 46.2 47.4 49.7 Now consider a histogram of these data. It has the following appearance. This seems to indicate that the sample means follow a Normal distribution. It can be seen that the mean of the sample means calculates as 48.5 and the standard deviation of the sample means as 1.8. The results from the above can now be summarised. Population Number of results 500 Mean 48.5 Standard Deviation 9.6 Sample means 100 48.5 1.8 This shows that the mean of the sample means is the same as the population mean, but the standard deviation of the distribution of sample means is around 5 times less than the population standard deviation. This supports what was earlier discussed in the example of absence rates in offices where extreme values were "averaged out". The above results can also be shown with reference to the Normal distribution curve (the jagged edges on the histogram above can be smoothed out as usual by taking lots more samples). c H ERIOT-WATT U NIVERSITY 2003 4.1. A WORKED EXAMPLE 4.1.2 6 Other Sample Sizes Notice that all of the samples taken so far were of size 25. The whole sampling procedure is now repeated with samples of size 36 and 64 and the results are given below. Sample Size Population Mean Standard Deviation 48.5 9.6 First samples 25 48.5 1.8 Second samples 36 48.5 1.6 Third samples 64 48.5 1.1 As the sample size increases, the standard deviation of the sample means reduces. The second set of samples produce a value that is about six times less than the population standard deviation whilst the third set are about eight times less. This can all be summarised in an important statistical result called the Central Limit Theorem. c H ERIOT-WATT U NIVERSITY 2003 4.2. SAMPLING TECHNIQUES 4.2 Sampling Techniques Before considering the Central Limit Theorem in detail, this section looks at some different ways of gathering sample data. There are many good reasons why a sample is used instead of a population. Some of them are now listed: The sample can save time and money Accessing the whole population is sometimes impossible so there is no choice Because the research process is sometimes destructive, the sample can save the product Every research study has a target population that consists of the individuals or entities that are the object of the investigation. The sample is taken from a population list, map, directory or other source that is being used to represent the population. This list is called the frame. There are two main types of sampling: random and nonrandom. In random sampling every unit of the population has the same probability of being selected into the sample (e.g. in the UK the National Lottery is an example of random sampling). However this is not the case in nonrandom sampling. Here it might be that the sample is selected simply because members were in the right place at the right time. Samples like this are usually no use to carry any statistical analysis out on so the focus will remain on random samples. They can be categorised into different types. 4.2.1 Simple Random Sampling A sampling procedure that assures that each element in the population has an equal chance of being selected is referred to as simple random sampling. For example, the names of all the winners of a competition could be written on a piece of paper and placed in a drum and then the person who has won the star prize could be pulled out. Tables of random numbers and statistical computer packages provide alternative, and possibly easier, methods of identifying the required winner. 4.2.2 Systematic Random Sampling If a systematic pattern is introduced into random sampling, it is referred to as "systematic (random) sampling". For instance, if the passengers on an aeroplane had numbers attached to their names ranging from 001 to 500, and a random starting point was chosen, e.g. 037, and then every 10th name was picked thereafter to give a sample of 50 (starting over with 007 after reaching 497). In this sense, this technique is similar to cluster sampling , since the choice of the first unit will determine the remainder. There are a number of potential problems with simple and systematic random sampling. If the population is widely dispersed, it may be extremely costly to reach them. On the other hand, a current list of the whole population (sampling frame) may not be readily available. Or perhaps, the population itself is not homogeneous and the sub-groups are very different in size. In such a case, precision can be increased through stratified sampling . c H ERIOT-WATT U NIVERSITY 2003 7 4.3. CENTRAL LIMIT THEOREM 4.2.3 8 Stratified Random Sampling In this random sampling technique, the whole population is first divided into mutually exclusive subgroups or strata and then units are selected randomly from each stratum. The segments are based on some predetermined criteria such as geographic location, size or demographic characteristic. It is important that the segments be as heterogeneous as possible. For example, suppose you require an analysis of the spending patterns of a hotel’s guests. It is expected that business customers behave differently from leisure guests so this defines two groups or strata. In addition, for this hotel it is known that 70% of the guests tend to be on business, whilst 30% of the guests are there for leisure purposes. In simple random sampling, there is no assurance that a sufficient number of leisure travellers would actually be included in the sample. So if it was decided that 60 respondents the leisure segment was required, then 140 business travellers should be questioned for a total of 200 respondents. This is referred to as "proportionate stratified sampling". Disproportionate sampling is only undertaken if a particular strata is very important to the research project but occurs in too small a percentage to allow for meaningful analysis unless is representation is artificially boosted. In this technique you oversample and then weight your data to re-establish the proportions. 4.2.4 Cluster or Area Sampling Suppose that a survey is to be done in a large town and that the unit of enquiry is the individual household. Suppose further that the town contains 20,000 households, all listed on convenient records, and a sample of 200 is needed. A simple random sample of 200 could well spread over the whole town incurring high costs and much inconvenience. However it might be easier to concentrate the sample in a few parts of the town. Now assume the town can be divided into 400 areas with 50 households in each. In this case, then, it is possible to select at random 4 areas and include all households in these areas. Note that, unlike stratified sampling, the clusters are thought of as being typical of the population, rather than subsections. 4.3 Central Limit Theorem If samples of size n are randomly drawn from a population that has a mean of and a standard deviation of , the sample means, x, are approximately normally distributed 30) regardless of the shape of the population for sufficiently large sample sizes (n distribution. If the population is normally distributed, the sample means are normally distributed for any size sample. It can be proved mathematically (and verified by experiment) that 1. The mean of the population means is the population mean 2. The standard deviation of the sample means is the standard deviation of the population divided by the square root of the sample size. These properties were shown earlier by experimentation on a Normal population. But it is very interesting to note that the theorem applies to any type of population as long as the sample size is sufficiently large. This is another reason why the Normal distribution c H ERIOT-WATT U NIVERSITY 2003 4.3. CENTRAL LIMIT THEOREM 9 is so important in statistics. It was also shown earlier that Normal distribution problems can be analysed with reference to statistical tables and using the formula . This formula can now be adapted to deal with sample means and is given as However, since it is usually not possible to take a large number of different samples, the value is virtually impossible to calculate over a realistic time period. Fortunately, though, it has been shown that equals the population mean . Similarly, calculation of would be very time consuming, but again it has been shown that this equals (n is the sample size) The equation now becomes ! " This is known as the z formula for sample means Example The mean amount of cash spent per customer at a tyre and exhaust centre is 78.20 with a standard deviation of 7.10. If a random sample of 40 customers is taken, what is the probability that the total amount spent by these 40 people is more than 3 200? The average amount spent in the sample is 3200/40 = 80. So #$&%' " 2 065 This gives /1(!03) 2 + * (-4 , . ' ' Use the formula Looking up tables gives a value of 0.0548. This is the value of the probability that these 40 customers spend a total amount of more than 3 200. 7c H ERIOT-WATT U NIVERSITY 2003 4.3. CENTRAL LIMIT THEOREM Example In a production process, bags of sugar are produced that supposedly contain 1.00kg of the commodity. The standard deviation is obtained by examining the equipment over a long period of time and is found to be 0.09kg. What is the probability that a sample of 36 bags of sugar has a mean weight of between 0.98kg and 1.01kg? KIL 89 =:>!;? < @ 9 DFA!EGB DIC-AH ;> ? A!B C!JC K 9 DFE The probability of a sample mean weight being greater than 1.01 is calculated by So, from tables, the required probability is 0.2514. JJ 89 =:>!;? < @ 9 DFC-EGB DIM!NH ;> ? A!B C!JC K 9PORQ E Similarly, the probability of a sample mean weight being less than 0.98 is calculated by So the required probability is 0.0918. Therefore, the probability of a sample mean weight being between 0.98 and 1.01 is given by 1 - (0.2514 + 0.0918) = 0.6568 Sc H ERIOT-WATT U NIVERSITY 2003 10 4.4. FINITE POPULATIONS Using the Central Limit Theorem This is also an online activity if you prefer to take it. The average number of times a child is taken to visit their GP between ages five and ten in a certain town is discovered to be 8.1 with a standard deviation of 4.2. A random sample of 49 appropriately aged children is taken from the town. The probability that the sample mean of the average number of visits to the GP (in the appropriate time period) is less than 7 is required. Q1: Use the appropriate formula to find the required probability. 4.4 Finite Populations The sampling procedures so far have been used on populations that are assumed to be infinite or at least extremely large. In the cases of a finite population an adjustment can be made to the z formula for sample means. The adjustment is called the finite T UWVYXZ[ U\V]X_^`[ and it operates on the standard deviation of sample means. correction factor N is the population size, n is the sample size. The formula now becomes ab e c Xd g f +g h ikki jj1l Note that in cases where the population is large the correction factor will make little difference to the calculation of z. For example, if N = 20 000 and n is 30, the answer to T UWVYXZ[ is 0.9992, which is almost 1. Most of the examples considered in this course UWVYX_^`[ can be assumed to come from infinite populations, so unless mentioned otherwise, it mc H ERIOT-WATT U NIVERSITY 2003 11 4.5. CONFIDENCE INTERVALS will not be necessary to use this correction factor. 4.5 Confidence Intervals The z formula for sample means can be manipulated and then used for the very useful purpose of inferring parameters of a population. As has been mentioned earlier in this course, it is often very difficult or impossible to calculate population means or standard deviations, but the process of working them out can be reasonably straightforward for a sample. This theory then allows the values obtained from the sample to be used to give upper and lower bounds of where most of the sample values would lie. This gives a confidence interval of where the population parameters are expected to lie. 4.5.1 Light Bulb Example A Company produces light bulbs and wishes to estimate the average lifetime of a bulb. It takes a random sample of 60 bulbs and tests them by using each one continuously until it burns out (this is clearly an example where measuring the appropriate property of the population would be no use as all the bulbs would then be ruined!) It is known that the standard deviation of the population is 140 hours. After experimentation it was found that the sample produced a mean lifetime of 1456 hours. This is the only figure available so it is used to give a point estimate of the population mean, . However, as has been previously discussed, if another sample was taken there is every possibility that it will be some result other than 1456. But, by using the Central Limit Theorem an interval estimate can be made which gives a range of values that the population mean will lie between with a certain confidence. n This can be shown with reference to a Normal distribution curve. The population mean has been estimated as 1456 so the diagram has the following appearance. oc H ERIOT-WATT U NIVERSITY 2003 12 4.5. CONFIDENCE INTERVALS The curve shows the distribution of sample means. A standard rule in statistics is to find out lower and upper bounds between which 95% of these means would lie between. Statistical tables reveal that a value of z = 1.96 has a probability of 0.025 (2.5%) of being exceeded, and by symmetry there is a probability of 0.025 of z being less than -1.96. This gives 95% of values of z between -1.96 and 1.96. The value of 1.96 is obtained either by using Normal distribution tables in reverse and looking through the body of them for 0.025, or by using tables specially designed to find critical values of z. Now, it is known that . pq urvxsw t y zR{|6}~q { rIs_ v w x! ~ w This gives qP{ ~z{|6}~ { I ~ q{ F|6 w And for the upper boundary, qP{ ~&{|6}~ { I ~ qP{ }F{| So for the lower boundary, The 95% confidence interval for the population mean, mu, is therefore [1420.58, 1491.42]. What this says in colloquial terms is that if any sample of 60 of this type of light bulb is taken, there is a 95% chance that the mean lifetime calculated for the sample will be between 1420.58 and 1491.42. 4.5.2 Formula for Confidence Intervals To have 100% confidence that the population mean falls between two limits is virtually impossible. The researcher must select a desired level of confidence; in the last example it was 95% but other common values are 90%, 98%, 99% and 99.9%. You may well be wondering why the highest possible level is not always selected. The answer is that there is always a "trade off". As the level of confidence increases, so does the range of values for the population mean and so the actual value of the population mean is not so apparent as with a smaller confidence interval. In general the confidence interval for the c H ERIOT-WATT U NIVERSITY 2003 13 4.6. ESTIMATING SAMPLE SIZE 14 population mean is given by the formula: ¢£ ¡ ¤¦¥§¨¥ ª©« ¢£ ¡ ¤ Values of are obtained in tables, for example for the 98% confidence interval, alpha is equal to 0.01 (1% on each side), giving a z value of 2.33 4.5.3 Confidence Intervals when Sigma is Unknown In the examples considered so far, the population standard deviation has always been known. This may seem strange, especially if the population mean is unknown. However, it is possible in some circumstances to obtain the population standard deviation by looking at past records and so it is not impossible to know it and not the mean. In many cases, though, the population standard deviation will have to be estimated. In fact, when sample sizes are large ( 30) the sample standard deviation, s, (which can easily be calculated) provides a very good estimate for the population standard deviation. It can therefore be used in the formula to calculate confidence intervals for the population mean. The formula can be modified as follows: ¬ ®£ ¤¦¥§¨¥ ª©« ®£ ¤ Beware not to use this formula for small samples when the population standard deviation is unknown, even if the population is Normally distributed. There are other methods for dealing with such samples (of size 30) and these will be described in Topic 4. ¯ Confidence Interval Activity Q2: A health association is interested in estimating the average number of days women stay in a local hospital after having a baby. A random sample of 36 women who had babies at the hospital recently was taken and the number of days (rounded to the nearest day) each of them spent in hospital after childbirth is given in the table below. 3 3 3 2 3 3 3 1 5 4 3 5 4 4 3 1 5 4 3 3 2 6 2 3 2 4 4 3 3 5 5 2 3 4 2 4 Use these data to construct a 99% confidence interval to estimate the average maternity stay for all women who have babies in the hospital. 4.6 Estimating Sample Size The examples considered so far in this Topic have always started by specifying a sample size. But in many cases the researcher is going to have to choose the number of elements that make up his or her sample. The bigger the sample, the more representative of the population the result will be, but there is a cost. Researchers have to work to a budget and do not want to take an unnecessarily large sample. °c H ERIOT-WATT U NIVERSITY 2003 4.7. PROPORTIONS 15 The z formula for sample means that has been well used in this Topic can help to decide on an appropriate sample size. Let and refer to it as the error of estimation. Then the formula becomes . ±³² ´¶µ¦· º x » ¼ ¸² ¹ ½ ½ Solving for n produces the sample size, i.e. Example ²¿¾ÁÀx¹Ä ÃÁÅ It is desired to find the average age of the residents in a village. It is known that the oldest resident is 85 and the youngest 1. How many people should be questioned to obtain a result with an error of estimation of 3 years? The researcher wants to be 90% confident of his results. The problem is the lack of any knowledge of the standard deviation. This may well be able to be estimated from similar villages, but in the absence of any other information, an estimate can be made using the formula: Here the range is 85 -1 = 84 and so º ºÇÆÉÊ+È ËÍÌÏÎ ½ÑÐÓÒÕÔ can be estimated as 21. The value of z for 90% is 1.64 and E = 3 so using the formula, n = 131.8. So at least 132 people should be questioned. 4.7 Proportions The population parameter that was estimated in the last few examples has been the mean; but this is not the only thing that can be calculated from a sample. Another important concept in statistics is the proportion of elements in a sample that satisfy some criteria, for example, the proportion of people over 60 in a social club, or the number of left handed children in a class at school. Using the values obtained in the sample it is possible to infer something about the proportion of the population that have the same characteristic as is being examined in the sample. Like the situation involving the mean, confidence intervals will be obtained. 4.7.1 Sampling Distribution of p Ö × will be used to represent a sample proportion and a population The symbol proportion (of course this is nothing to do with the one used in the geometry of a circle). × Just like in the situation for the mean, if a certain property is required for the population, for example, the proportion of people who own a car in a particular town, then this will usually have to be estimated by taking a sample. And, again, in a similar way to the mean, it is very likely that if a number of samples were taken, they would all give slightly different values of p. The importance of the Central Limit Theorem is now highlighted by the fact that this distribution of p also follows a Normal curve in most cases. In fact the theorem applies if n 5 and n(1 - ) 5. (n is the sample size). ×ÙØ × Ø The Central Limit Theorem also reveals that in a distribution of proportions satisfying the above, the mean of all the sample proportions is equal to the population proportion, , È!Ý whilst the standard deviation of the proportions is Ú ÛÜ ß Û Þ . à c HERIOT-WATT UNIVERSITY 2003 × 4.7. PROPORTIONS 16 This leads to the z formula for sample proportions given below. áâ æ ã Áç äè\éëå êç-ì í Example It is thought that 25% of the population uses fabric conditioner when they do their laundry. 60 people are questioned about their laundry habits. What is the probability that more than 18 say that they use fabric conditioner as well as a washing powder? î î Solution Note that n = 15and n (1 - ) = 45 so the sample proportions follow a Normal distribution that is symmetrical about = 0.25 (this is the population proportion î equivalent to 25%). The standard deviation of sample proportions is given by ï õ-ö ÷!ø ü ùÓõ õ-ö úûø â&ýFþGýIÿÿ ï 1å ðòñ!ô äåó â Thus s.d. = 0.0559. 18 people out of a sample of 60 gives a value of p = 18/60 = 0.3. If more than 18 people use fabric conditioner, then this corresponds to a probability of p 0.3. This is shown on the curve below. The z formula for sample proportions can be used to calculate a value for z, which can then be looked up on the tables. áâ æ ã Áç äè\éëå êç-ì â -õ ö ä õ-ö ÷!ø â&ýFþ -õ ö õ!ø!ø í The required probability is therefore 0.1867 (from tables). 4.7.2 Confidence Intervals for a Population Proportion The z formula for sample proportions can be used in the same way as the z formula for sample means to allow for the calculation of upper and lower bounds for a population c H ERIOT-WATT U NIVERSITY 2003 4.7. PROPORTIONS proportion; in other words to create a confidence interval for the population proportion. The point estimate of the population proportion is chosen to equal to the sample proportion. Notice that in the formula for z, the value occurs on both the numerator and denominator leading to difficulties in calculation for . Because of this, it is convenient to replace by in the denominator. Note that this is only done in the case of estimating confidence intervals. The confidence interval to estimate is given by Example A clothing company produces men’s jeans. The jeans are made and then sold with either a regular cut or a boot cut. In an effort to estimate the proportion of their men’s jeans market that is for boot cut jeans, an analyst takes a random sample of 212 sales and finds that 34 were for boot cut. Construct a 90% confidence interval to estimate the proportion of the population who prefer boot cut jeans. Solution The sample proportion who prefer boot cut jeans is 34/212 = 0.16 - this is the point estimate of the population proportion. The 56/10 798 for 90% is 1.64. The lower bound is /1z0243value : : given by ! " #%$'&)(+* (&*-,!. #%$'&)(+; . /120 4356/10 798 : : The upper bound is given by < #%$'&)(+* (&*-,. #%$'&;$ The 90% confidence interval is therefore [0.12, 0.20]. In other words, with this level of confidence, between 12% and 20% of the population prefer boot cut jeans. Note that this calculation is valid because n = 212 . 0.16 = 33.92 and n(1 - ) = 212 . 0.84 = 178.08, both of which satisfy the requirement of being greater than 5.(the point estimate of is used here). 4.7.3 Summary and Assessment At this stage you should be able to: = state that when many samples are taken from a population, the values of the sample means are not all the same = give examples of various types of sampling techniques = quote appropriate situations where simple random sampling would be used = quote appropriate situations where systematic random sampling would be used = quote appropriate situations where stratified random sampling would be used = quote appropriate situations where cluster sampling would be used = state the properties of the Central Limit Theorem for sample means > = use the finite population correction factor c H ERIOT-WATT U NIVERSITY 2003 17 4.7. PROPORTIONS ? calculate confidence intervals for population means based on a sample mean result ? estimate the approximate sample size required for a specified level of accuracy ? describe the distribution of sample proportions @ ? calculate confidence intervals for population proportions based on a sample proportion result c H ERIOT-WATT U NIVERSITY 2003 18 ANSWERS: TOPIC 4 19 Answers to questions and activities 4 Sampling and Confidence Intervals Using the Central Limit Theorem (page 10) Q1: The formula is AB C FHGJD" I KE Using E = 8.1, n = 49, C = 7, F = 4.2 in the expression gives z = -1.83 Finally, look up the tables and select the required probability from the list to give 0.0336 Confidence Interval Activity (page 13) BML6NL'O , P BQON)OR C A SMU T VXW ASMU T V where ASB[Z6N\] The equation for 99% confidence interval is ! E W < C D C Y Z N\]eU ^`2^bca d or [2.81, 3.81]. In Thus the confidence interval is L6NL'O D Z6N\]_U ^`2^bca d W E W L6NL'O Y 6 other words, between 3 and 4 days. Q2: f c H ERIOT-WATT U NIVERSITY 2003