Download Here - School of Mathematical and Computer Sciences - Heriot

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Sampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
1
Topic 4
Sampling and Confidence Intervals
Contents
4.1 A Worked Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
4.1.1 Samples of size 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 Other Sample Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
5
4.2 Sampling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . .
6
6
4.2.2 Systematic Random Sampling . . . . . . . . . . . . . . . . . . . . . . .
4.2.3 Stratified Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . .
6
7
4.2.4 Cluster or Area Sampling . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
7
4.4 Finite Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
11
4.5.1 Light Bulb Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.2 Formula for Confidence Intervals . . . . . . . . . . . . . . . . . . . . . .
11
12
4.5.3 Confidence Intervals when Sigma is Unknown . . . . . . . . . . . . . .
13
4.6 Estimating Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
14
4.7.1 Sampling Distribution of p . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7.2 Confidence Intervals for a Population Proportion . . . . . . . . . . . . .
14
15
4.7.3 Summary and Assessment . . . . . . . . . . . . . . . . . . . . . . . . .
16
Learning Objectives
state that when many samples are taken from a population, the values of the
sample means are not all the same
give examples of various types of sampling techniques, quoting appropriate
situations where each would be used
state the properties of the Central Limit Theorem for sample means
use the finite population correction factor
calculate confidence intervals for population means based on a sample mean
result
2
estimate the approximate sample size required for a specified level of accuracy
c
describe the distribution of sample proportions
calculate confidence intervals for population proportions based on a sample
proportion result
H ERIOT-WATT U NIVERSITY 2003
4.1. A WORKED EXAMPLE
4.1
A Worked Example
The last topic showed many Normal distribution situations where the focus was on
whole populations. There are some very interesting results that arise when a number of
SAMPLES are taken from a Normal distribution. You will probably have an intuitive idea
of the very useful statistical results that come out of this type of analysis.
To get a feel for the sort of thing that will crop up time and time again in this course,
imagine an office block with a large number of rooms each accommodating ten workers.
If some statistic was then measured, say, for example, the number of days an employee
was absent from work last year, an average score could be taken for each room. It
would probably be expected, then, that the average absence rate for the office block as
a whole would be the average of that of all the individual offices. This is indeed the case.
In other words the population mean (the office block average absence rate) is equal to
the mean of the sample means (the average of the mean values of all the individual
offices taken together). But it is a different situation when the spread of the results is
considered. Any extreme value of the population will be "averaged out" when the sample
mean is taken. So the spread of the population will be greater than the spread of the
sample means.
The following illustration shows what happens when many samples are taken from a
population.
4.1.1
Samples of size 25
These 500 numbers were obtained by measuring some property of every element in a
population that follows a Normal distribution.
53 52 27 48 55 46 51 29 65 43 29 46 54 60 66 56 34 59 43 52 46 67 59 51 57 56 64 49
62 30 51 46 50 32 51 48 48 53 45 56 55 65 57 72 47 60 49 47 51 54 54 52 16 52 62 62
35 44 43 60 49 54 32 58 42 41 49 35 57 64 58 58 37 63 38 42 45 45 50 60 51 38 51 64
52 58 55 60 62 50 60 52 45 43 57 48 47 46 32 58 38 47 49 53 40 47 62 34 48 48 60 48
52 58 69 49 44 61 35 53 29 58 62 40 43 63 50 59 46 32 52 48 45 50 60 65 33 44 50 35
30 41 36 46 65 60 33 67 57 56 42 35 37 56 56 56 46 37 37 35 47 40 53 49 33 32 59 40
51 44 43 49 56 64 44 48 53 42 43 43 43 50 51 49 58 58 58 59 46 52 33 63 54 64 48 52
51 61 28 46 53 41 56 44 61 35 50 43 56 66 47 62 37 63 47 63 55 49 52 50 29 44 41 47
48 47 40 46 43 51 44 61 57 51 43 68 41 53 39 61 62 43 38 65 31 67 61 66 49 49 64 43
43 53 64 41 48 53 45 71 44 57 55 44 51 38 49 46 43 36 43 55 70 47 45 40 38 48 43 61
49 43 59 54 41 55 40 42 42 48 52 50 45 36 33 46 66 52 42 61 47 57 55 43 32 42 46 50
44 38 61 46 54 52 40 50 51 51 51 59 53 47 46 58 64 48 55 55 33 57 39 38 35 56 70 74
65 53 67 39 64 39 48 46 44 44 39 44 47 50 46 48 55 60 43 65 45 30 42 67 56 43 40 49
55 45 57 42 48 33 39 45 54 50 43 58 41 36 46 40 49 53 36 51 52 39 67 53 41 47 53 64
23 54 63 64 62 54 47 48 66 43 42 47 55 51 45 54 36 61 78 67 42 43 54 57 61 60 53 36
55 52 46 36 63 25 56 45 56 52 42 54 55 41 42 28 54 41 43 44 54 54 52 56 53 67 34 40
43 57 43 34 62 56 45 56 54 49 50 48 56 54 48 67 43 59 46 46 39 48 45 42 37 48 42 50
54 47 53 56 50 59 64 44 52 52 62 50 49 42 36 56 52 62 48 56 48 68 39 50
It can be shown that the population mean is 48.5 and the population standard deviation
is 9.6.
Now several different samples, each of size 25, are taken from this population.
Sample 1
c
H ERIOT-WATT U NIVERSITY 2003
3
4.1. A WORKED EXAMPLE
4
60 53 54 56 65 59 43 35 54 53 45 49 48 64 44 40 43 28 44 51 35 55 46 48 48
Sample 2
35 52 44 43 57 53 56 37 52 49 71 47 52 59 52 43 48 25 56 50 52 42 42 56 45
Sample 3
55 43 56 57 57 61 35 37 54 53 45 56 30 46 56 42 65 71 39 56 34 45 52 52 29
Sample 4
37 37 47 48 36 28 61 45 55 54 59 42 61 64 46 65 33 56 54 43 49 48 43 41 62
Sample 5
45 36 43 46 35 48 39 59 70 41 56 58 52 57 62 60 48 63 51 41 56 63 40 34 43
Sample 6
58 46 55 48 39 50 48 43 51 64 42 56 48 54 34 43 49 62 52 29 59 56 52 47 53
Sample 7
57 39 43 52 61 67 42 55 35 43 45 34 55 49 61 64 36 45 48 42 43 40 49 62 61
Sample 8
46 54 49 49 42 55 58 41 52 52 56 41 43 60 52 57 52 45 43 43 60 48 43 32 50
Sample 9
67 54 40 42 54 58 50 56 48 43 37 51 47 43 64 43 45 52 49 43 52 37 64 60 59
Sample 10
46 65 45 52 61 50 55 46 56 56 42 47 56 56 54 52 65 46 56 42 45 53 40 56 42
The mean of each of the samples is shown below.
Sample 1
Mean
48.8
Sample 2
48.7
Sample 3
49.0
Sample 4
48.6
Sample 5
49.8
Sample 6
49.5
Sample 7
49.1
Sample 8
48.9
Sample 9
50.3
Sample 10
51.4
Notice that no two means are exactly the same.
This sampling process is repeated 100 more times and the sample means are given
below
c
H ERIOT-WATT U NIVERSITY 2003
4.1. A WORKED EXAMPLE
51.8
46.6
48.7
49.8
48.4
50.8
46.3
49.5
51.2
48.2
47.6
47.7
49.8
47.8
46.8
46.0
49.3
47.8
48.3
46.7
45.5
47.0
46.6
53.4
47.1
49.1
45.3
50.4
47.1
50.8
47.0
49.5
48.2
45.3
44.9
49.0
50.8
5
45.1
52.1
49.4
51.5
50.1
49.1
50.9
46.8
47.1
45.1
48.2
49.5
50.9
48.3
50.8
47.8
47.0
46.2
49.3
48.0
50.2
48.2
49.5
48.5
47.8
49.0
45.1
45.7
52.8
51.3
50.5
43.8
47.0
50.5
49.0
49.9
47.8
48.0
47.8
47.6
49.0
47.8
50.7
48.3
47.4
48.8
48.3
48.9
45.9
50.7
47.9
49.4
47.6
48.1
48.2
49.0
50.2
47.0
50.4
49.1
46.2
47.4
49.7
Now consider a histogram of these data. It has the following appearance.
This seems to indicate that the sample means follow a Normal distribution.
It can be seen that the mean of the sample means calculates as 48.5 and the standard
deviation of the sample means as 1.8.
The results from the above can now be summarised.
Population
Number of results
500
Mean
48.5
Standard Deviation
9.6
Sample means
100
48.5
1.8
This shows that the mean of the sample means is the same as the population mean,
but the standard deviation of the distribution of sample means is around 5 times less
than the population standard deviation. This supports what was earlier discussed in the
example of absence rates in offices where extreme values were "averaged out". The
above results can also be shown with reference to the Normal distribution curve (the
jagged edges on the histogram above can be smoothed out as usual by taking lots more
samples).
c
H ERIOT-WATT U NIVERSITY 2003
4.1. A WORKED EXAMPLE
4.1.2
6
Other Sample Sizes
Notice that all of the samples taken so far were of size 25. The whole sampling
procedure is now repeated with samples of size 36 and 64 and the results are given
below.
Sample Size
Population
Mean
Standard Deviation
48.5
9.6
First samples
25
48.5
1.8
Second samples
36
48.5
1.6
Third samples
64
48.5
1.1
As the sample size increases, the standard deviation of the sample means reduces. The
second set of samples produce a value that is about six times less than the population
standard deviation whilst the third set are about eight times less.
This can all be summarised in an important statistical result called the Central Limit
Theorem.
c
H ERIOT-WATT U NIVERSITY 2003
4.2. SAMPLING TECHNIQUES
4.2
Sampling Techniques
Before considering the Central Limit Theorem in detail, this section looks at some
different ways of gathering sample data. There are many good reasons why a sample
is used instead of a population. Some of them are now listed:
The sample can save time and money
Accessing the whole population is sometimes impossible so there is no choice
Because the research process is sometimes destructive, the sample can save the
product
Every research study has a target population that consists of the individuals or entities
that are the object of the investigation. The sample is taken from a population list, map,
directory or other source that is being used to represent the population. This list is called
the frame. There are two main types of sampling: random and nonrandom. In random
sampling every unit of the population has the same probability of being selected into
the sample (e.g. in the UK the National Lottery is an example of random sampling).
However this is not the case in nonrandom sampling. Here it might be that the sample
is selected simply because members were in the right place at the right time. Samples
like this are usually no use to carry any statistical analysis out on so the focus will remain
on random samples. They can be categorised into different types.
4.2.1
Simple Random Sampling
A sampling procedure that assures that each element in the population has an equal
chance of being selected is referred to as simple random sampling. For example, the
names of all the winners of a competition could be written on a piece of paper and
placed in a drum and then the person who has won the star prize could be pulled out.
Tables of random numbers and statistical computer packages provide alternative, and
possibly easier, methods of identifying the required winner.
4.2.2
Systematic Random Sampling
If a systematic pattern is introduced into random sampling, it is referred to as "systematic
(random) sampling". For instance, if the passengers on an aeroplane had numbers
attached to their names ranging from 001 to 500, and a random starting point was
chosen, e.g. 037, and then every 10th name was picked thereafter to give a sample of
50 (starting over with 007 after reaching 497). In this sense, this technique is similar to
cluster sampling , since the choice of the first unit will determine the remainder.
There are a number of potential problems with simple and systematic random sampling.
If the population is widely dispersed, it may be extremely costly to reach them. On the
other hand, a current list of the whole population (sampling frame) may not be readily
available. Or perhaps, the population itself is not homogeneous and the sub-groups
are very different in size. In such a case, precision can be increased through stratified
sampling .
c
H ERIOT-WATT U NIVERSITY 2003
7
4.3. CENTRAL LIMIT THEOREM
4.2.3
8
Stratified Random Sampling
In this random sampling technique, the whole population is first divided into mutually
exclusive subgroups or strata and then units are selected randomly from each
stratum. The segments are based on some predetermined criteria such as geographic
location, size or demographic characteristic. It is important that the segments be as
heterogeneous as possible. For example, suppose you require an analysis of the
spending patterns of a hotel’s guests. It is expected that business customers behave
differently from leisure guests so this defines two groups or strata. In addition, for
this hotel it is known that 70% of the guests tend to be on business, whilst 30% of
the guests are there for leisure purposes. In simple random sampling, there is no
assurance that a sufficient number of leisure travellers would actually be included in
the sample. So if it was decided that 60 respondents the leisure segment was required,
then 140 business travellers should be questioned for a total of 200 respondents. This
is referred to as "proportionate stratified sampling". Disproportionate sampling is only
undertaken if a particular strata is very important to the research project but occurs in too
small a percentage to allow for meaningful analysis unless is representation is artificially
boosted. In this technique you oversample and then weight your data to re-establish the
proportions.
4.2.4
Cluster or Area Sampling
Suppose that a survey is to be done in a large town and that the unit of enquiry is
the individual household. Suppose further that the town contains 20,000 households,
all listed on convenient records, and a sample of 200 is needed. A simple random
sample of 200 could well spread over the whole town incurring high costs and much
inconvenience. However it might be easier to concentrate the sample in a few parts
of the town. Now assume the town can be divided into 400 areas with 50 households
in each. In this case, then, it is possible to select at random 4 areas and include all
households in these areas. Note that, unlike stratified sampling, the clusters are thought
of as being typical of the population, rather than subsections.
4.3
Central Limit Theorem
If samples of size n are randomly drawn from a population that has a mean of and
a standard deviation of , the sample means, x, are approximately normally distributed
30) regardless of the shape of the population
for sufficiently large sample sizes (n
distribution. If the population is normally distributed, the sample means are normally
distributed for any size sample.
It can be proved mathematically (and verified by experiment) that
1. The mean of the population means is the population mean
2. The standard deviation of the sample means is the standard deviation of the
population divided by the square root of the sample size.
These properties were shown earlier by experimentation on a Normal population. But it
is very interesting to note that the theorem applies to any type of population as long as
the sample size is sufficiently large. This is another reason why the Normal distribution
c
H ERIOT-WATT U NIVERSITY 2003
4.3. CENTRAL LIMIT THEOREM
9
is so important in statistics.
It was also shown earlier that Normal distribution problems can be analysed with
reference to statistical tables and using the formula
.
This formula can now be adapted to deal with sample means and is given as
However, since it is usually not possible to take a large number of different samples,
the value
is virtually impossible to calculate over a realistic time period. Fortunately,
though, it has been shown that
equals the population mean . Similarly, calculation
of
would be very time consuming, but again it has been shown that this equals
(n
is the sample size)
The equation now becomes
! "
This is known as the z formula for sample means
Example The mean amount of cash spent per customer at a tyre and exhaust centre
is 78.20 with a standard deviation of 7.10. If a random sample of 40 customers is taken,
what is the probability that the total amount spent by these 40 people is more than 3
200?
The average amount spent in the sample is 3200/40 = 80.
So
#$&%'
"
2 065
This gives /1(!03) 2 +
* (-4 , . ' '
Use the formula
Looking up tables gives a value of 0.0548.
This is the value of the probability that these 40 customers spend a total amount of more
than 3 200.
7c
H ERIOT-WATT U NIVERSITY 2003
4.3. CENTRAL LIMIT THEOREM
Example In a production process, bags of sugar are produced that supposedly
contain 1.00kg of the commodity. The standard deviation is obtained by examining the
equipment over a long period of time and is found to be 0.09kg. What is the probability
that a sample of 36 bags of sugar has a mean weight of between 0.98kg and 1.01kg?
KIL
89 =:>!;? < @ 9 DFA!EGB DIC-AH ;> ? A!B C!JC K 9 DFE
The probability of a sample mean weight being greater than 1.01 is calculated by
So, from tables, the required probability is 0.2514.
JJ
89 =:>!;? < @ 9 DFC-EGB DIM!NH ;> ? A!B C!JC K 9PORQ E
Similarly, the probability of a sample mean weight being less than 0.98 is calculated by
So the required probability is 0.0918.
Therefore, the probability of a sample mean weight being between 0.98 and 1.01 is
given by 1 - (0.2514 + 0.0918) = 0.6568
Sc
H ERIOT-WATT U NIVERSITY 2003
10
4.4. FINITE POPULATIONS
Using the Central Limit Theorem
This is also an online activity if you prefer to take it.
The average number of times a child is taken to visit their GP between ages five and
ten in a certain town is discovered to be 8.1 with a standard deviation of 4.2. A random
sample of 49 appropriately aged children is taken from the town. The probability that the
sample mean of the average number of visits to the GP (in the appropriate time period)
is less than 7 is required.
Q1:
Use the appropriate formula to find the required probability.
4.4
Finite Populations
The sampling procedures so far have been used on populations that are assumed to
be infinite or at least extremely large. In the cases of a finite population an adjustment
can be made to the z formula for sample means. The adjustment is called the finite
T UWVYXZ[
U\V]X_^`[
and it operates on the standard deviation of sample means.
correction factor
N is the population size, n is the sample size.
The formula now becomes
ab e c Xd g
f +g h ikki jj1l
Note that in cases where the population is large the correction factor will make little
difference to the calculation of z. For example, if N = 20 000 and n is 30, the answer to
T UWVYXZ[ is 0.9992, which is almost 1. Most of the examples considered in this course
UWVYX_^`[
can be assumed to come from infinite populations, so unless mentioned otherwise, it
mc
H ERIOT-WATT U NIVERSITY 2003
11
4.5. CONFIDENCE INTERVALS
will not be necessary to use this correction factor.
4.5
Confidence Intervals
The z formula for sample means can be manipulated and then used for the very useful
purpose of inferring parameters of a population. As has been mentioned earlier in this
course, it is often very difficult or impossible to calculate population means or standard
deviations, but the process of working them out can be reasonably straightforward for
a sample. This theory then allows the values obtained from the sample to be used to
give upper and lower bounds of where most of the sample values would lie. This gives
a confidence interval of where the population parameters are expected to lie.
4.5.1
Light Bulb Example
A Company produces light bulbs and wishes to estimate the average lifetime of a bulb.
It takes a random sample of 60 bulbs and tests them by using each one continuously
until it burns out (this is clearly an example where measuring the appropriate property
of the population would be no use as all the bulbs would then be ruined!) It is known
that the standard deviation of the population is 140 hours.
After experimentation it was found that the sample produced a mean lifetime of 1456
hours. This is the only figure available so it is used to give a point estimate of the
population mean, . However, as has been previously discussed, if another sample was
taken there is every possibility that it will be some result other than 1456. But, by using
the Central Limit Theorem an interval estimate can be made which gives a range of
values that the population mean will lie between with a certain confidence.
n
This can be shown with reference to a Normal distribution curve. The population mean
has been estimated as 1456 so the diagram has the following appearance.
oc
H ERIOT-WATT U NIVERSITY 2003
12
4.5. CONFIDENCE INTERVALS
The curve shows the distribution of sample means. A standard rule in statistics is to find
out lower and upper bounds between which 95% of these means would lie between.
Statistical tables reveal that a value of z = 1.96 has a probability of 0.025 (2.5%) of
being exceeded, and by symmetry there is a probability of 0.025 of z being less than
-1.96. This gives 95% of values of z between -1.96 and 1.96. The value of 1.96 is
obtained either by using Normal distribution tables in reverse and looking through the
body of them for 0.025, or by using tables specially designed to find critical values of z.
Now, it is known that
.
pq urvxsw t y
zR{|6}~q {…r„Is_† v w €x‚!ƒ
~†
w
This gives ‡ˆqP{…„Š‰~‹zŒ{|6}~Ž {…„I† ~† q‘{…„Š’†F|6‰“
w
And for the upper boundary, ‡”qP{…„Š‰~–•&{|6}~— {…„I†  ~† qP{…„Š}F{|˜„Š’
So for the lower boundary,
The 95% confidence interval for the population mean, mu, is therefore [1420.58,
1491.42].
What this says in colloquial terms is that if any sample of 60 of this type of light bulb
is taken, there is a 95% chance that the mean lifetime calculated for the sample will be
between 1420.58 and 1491.42.
4.5.2
Formula for Confidence Intervals
To have 100% confidence that the population mean falls between two limits is virtually
impossible. The researcher must select a desired level of confidence; in the last example
it was 95% but other common values are 90%, 98%, 99% and 99.9%. You may well be
wondering why the highest possible level is not always selected. The answer is that
there is always a "trade off". As the level of confidence increases, so does the range of
values for the population mean and so the actual value of the population mean is not so
apparent as with a smaller confidence interval. In general the confidence interval for the
™c
H ERIOT-WATT U NIVERSITY 2003
13
4.6. ESTIMATING SAMPLE SIZE
14
population mean is given by the formula:
šœ›ž Ÿ¢£ ¡ ¤¦¥Œ§¨¥ šª©« Ÿ¢£ ¡ ¤
Values of Ÿ are obtained in tables, for example for the 98% confidence interval, alpha
is equal to 0.01 (1% on each side), giving a z value of 2.33
4.5.3
Confidence Intervals when Sigma is Unknown
In the examples considered so far, the population standard deviation has always been
known. This may seem strange, especially if the population mean is unknown.
However, it is possible in some circumstances to obtain the population standard
deviation by looking at past records and so it is not impossible to know it and not
the mean. In many cases, though, the population standard deviation will have to be
estimated. In fact, when sample sizes are large ( 30) the sample standard deviation,
s, (which can easily be calculated) provides a very good estimate for the population
standard deviation. It can therefore be used in the formula to calculate confidence
intervals for the population mean. The formula can be modified as follows:
¬
šœ›ž Ÿ®£ ­ ¤¦¥Œ§¨¥ šª©« Ÿ®£ ­ ¤
Beware not to use this formula for small samples when the population standard deviation
is unknown, even if the population is Normally distributed. There are other methods for
dealing with such samples (of size 30) and these will be described in Topic 4.
¯
Confidence Interval Activity
Q2: A health association is interested in estimating the average number of days
women stay in a local hospital after having a baby. A random sample of 36 women
who had babies at the hospital recently was taken and the number of days (rounded
to the nearest day) each of them spent in hospital after childbirth is given in the table
below.
3
3
3
2
3
3
3
1
5
4
3
5
4
4
3
1
5
4
3
3
2
6
2
3
2
4
4
3
3
5
5
2
3
4
2
4
Use these data to construct a 99% confidence interval to estimate the average maternity
stay for all women who have babies in the hospital.
4.6
Estimating Sample Size
The examples considered so far in this Topic have always started by specifying a sample
size. But in many cases the researcher is going to have to choose the number of
elements that make up his or her sample. The bigger the sample, the more representative of the population the result will be, but there is a cost. Researchers have to
work to a budget and do not want to take an unnecessarily large sample.
°c
H ERIOT-WATT U NIVERSITY 2003
4.7. PROPORTIONS
15
The z formula for sample means that has been well used in this Topic can help to decide
on an appropriate sample size. Let
and refer to it as the error of estimation.
Then the formula becomes
.
±³² ´¶µ¦·
º
x
»
¼
¸² ¹ ½
½
Solving for n produces the sample size, i.e.
Example
²¿¾ÁÀx¹Ä ÃÁÅ
It is desired to find the average age of the residents in a village. It is known that the oldest
resident is 85 and the youngest 1. How many people should be questioned to obtain a
result with an error of estimation of 3 years? The researcher wants to be 90% confident
of his results. The problem is the lack of any knowledge of the standard deviation. This
may well be able to be estimated from similar villages, but in the absence of any other
information, an estimate can be made using the formula:
Here the range is 85 -1 = 84 and so
º
ºÇÆÉÊ+È ËÍÌÏÎ ½ÑÐÓÒÕÔ
can be estimated as 21.
The value of z for 90% is 1.64 and E = 3 so using the formula, n = 131.8.
So at least 132 people should be questioned.
4.7
Proportions
The population parameter that was estimated in the last few examples has been the
mean; but this is not the only thing that can be calculated from a sample. Another
important concept in statistics is the proportion of elements in a sample that satisfy
some criteria, for example, the proportion of people over 60 in a social club, or the
number of left handed children in a class at school. Using the values obtained in the
sample it is possible to infer something about the proportion of the population that have
the same characteristic as is being examined in the sample. Like the situation involving
the mean, confidence intervals will be obtained.
4.7.1
Sampling Distribution of p
Ö
×
will be used to represent a sample proportion and
a population
The symbol
proportion (of course this is nothing to do with the one used in the geometry of a
circle).
×
Just like in the situation for the mean, if a certain property is required for the population,
for example, the proportion of people who own a car in a particular town, then this will
usually have to be estimated by taking a sample. And, again, in a similar way to the
mean, it is very likely that if a number of samples were taken, they would all give slightly
different values of p.
The importance of the Central Limit Theorem is now highlighted by the fact that this
distribution of p also follows a Normal curve in most cases. In fact the theorem applies
if n
5 and n(1 - ) 5. (n is the sample size).
×ÙØ
× Ø
The Central Limit Theorem also reveals that in a distribution of proportions satisfying the
above, the mean of all the sample proportions is equal to the population proportion, ,
È!Ý
whilst the standard deviation of the proportions is Ú ÛŠÜ ß Û Þ .
à c HERIOT-WATT UNIVERSITY 2003
×
4.7. PROPORTIONS
16
This leads to the z formula for sample proportions given below.
áâ æ ã Áç äè\éëå êç-ì
í
Example It is thought that 25% of the population uses fabric conditioner when they
do their laundry. 60 people are questioned about their laundry habits. What is the
probability that more than 18 say that they use fabric conditioner as well as a washing
powder?
î
î
Solution Note that n = 15and n (1 - ) = 45 so the sample proportions follow a
Normal distribution that is symmetrical about = 0.25 (this is the population proportion
î
equivalent to 25%). The standard deviation of sample proportions is given by
ï õ-ö ÷!ø ü ùÓõ õ-ö úûø â&ýFþGýIÿÿ
ï 1å ðòñ!ô äåó â
Thus s.d. = 0.0559.
18 people out of a sample of 60 gives a value of p = 18/60 = 0.3. If more than 18 people
use fabric conditioner, then this corresponds to a probability of p 0.3. This is shown on
the curve below.
The z formula for sample proportions can be used to calculate a value for z, which can
then be looked up on the tables.
áâ æ ã Áç äè\éëå êç-ì â -õ ö ä õ-ö ÷!ø â&ýFþ
-õ ö õ!ø!ø
í
The required probability is therefore 0.1867 (from tables).
4.7.2
Confidence Intervals for a Population Proportion
The z formula for sample proportions can be used in the same way as the z formula for
sample means to allow for the calculation of upper and lower bounds for a population
c
H ERIOT-WATT U NIVERSITY 2003
4.7. PROPORTIONS
proportion; in other words to create a confidence interval for the population proportion.
The point estimate of the population proportion is chosen to equal to the sample
proportion.
Notice that in the formula for z, the value occurs on both the numerator and
denominator leading to difficulties in calculation for . Because of this, it is convenient to
replace by in the denominator. Note that this is only done in the case of estimating
confidence intervals.
The confidence interval to estimate is given by
Example
A clothing company produces men’s jeans. The jeans are made and then sold with
either a regular cut or a boot cut. In an effort to estimate the proportion of their men’s
jeans market that is for boot cut jeans, an analyst takes a random sample of 212 sales
and finds that 34 were for boot cut. Construct a 90% confidence interval to estimate the
proportion of the population who prefer boot cut jeans.
Solution
The sample proportion who prefer boot cut jeans is 34/212 = 0.16 - this is the point
estimate of the population
proportion. The
56/10 798 for 90% is 1.64. The lower bound is
/1z0243value
: :
given by !
" #%$'&)(+* (&*-,!.
#%$'&)(+; .
/120 4356/10 798
: :
The upper bound is given by <
#%$'&)(+* (&*-,.
#%$'&;$
The 90% confidence interval is therefore [0.12, 0.20].
In other words, with this level of confidence, between 12% and 20% of the population
prefer boot cut jeans. Note that this calculation is valid because n = 212 . 0.16 =
33.92 and n(1 - ) = 212 . 0.84 = 178.08, both of which satisfy the requirement of being
greater than 5.(the point estimate of is used here).
4.7.3
Summary and Assessment
At this stage you should be able to:
= state that when many samples are taken from a population, the values of the
sample means are not all the same
= give examples of various types of sampling techniques
= quote appropriate situations where simple random sampling would be used
= quote appropriate situations where systematic random sampling would be used
= quote appropriate situations where stratified random sampling would be used
= quote appropriate situations where cluster sampling would be used
= state the properties of the Central Limit Theorem for sample means
>
= use the finite population correction factor
c
H ERIOT-WATT U NIVERSITY 2003
17
4.7. PROPORTIONS
? calculate confidence intervals for population means based on a sample mean
result
? estimate the approximate sample size required for a specified level of accuracy
? describe the distribution of sample proportions
@
? calculate confidence intervals for population proportions based on a sample
proportion result
c
H ERIOT-WATT U NIVERSITY 2003
18
ANSWERS: TOPIC 4
19
Answers to questions and activities
4 Sampling and Confidence Intervals
Using the Central Limit Theorem (page 10)
Q1:
The formula is
AB
C
FHGJD"
I KE
Using E = 8.1, n = 49, C = 7, F = 4.2 in the expression gives z = -1.83
Finally, look up the tables and select the required probability from the list to give 0.0336
Confidence Interval Activity (page 13)
BML6NL'O , P BQON)OR
C
A SMU T VXW
ASMU T V where ASB[Z6N\]
The equation for 99% confidence interval is !
E W <
C D
C Y
Z N\]eU ^`2^bca d or [2.81, 3.81]. In
Thus the confidence interval is L6NL'O D Z6N\]_U ^`2^bca d W E W L6NL'O Y 6
other words, between 3 and 4 days.
Q2:
f
c
H ERIOT-WATT U NIVERSITY 2003