Download BSc/HND IETM Week 5 - Means, Medians, Modes and more.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Receiver operating characteristic wikipedia , lookup

Inductive probability wikipedia , lookup

History of statistics wikipedia , lookup

Law of large numbers wikipedia , lookup

Probability amplitude wikipedia , lookup

Transcript
BSc/HND IETM Week 8 - Some Probability Distributions
The aim
When we looked at the histogram a few weeks ago, we were looking at frequency
distributions. These showed how occurrences of a particular value (or a particular range of
values) of some variable (say x) are distributed across the total range of values which x can
adopt. It is equally possible (and very easy!) to convert such frequency distributions into
probability distributions, such that the probability of encountering some particular value (or
range of values) of x is plotted on the vertical axis, rather than the number of occurrences of
that value of x. There are a few standard forms of such distributions, which make analysis
rather easy - so long as the data really do fit the chosen form. We shall look at two of these
standard forms, the normal and the negative exponential distributions.
Probability distributions from frequency distributions
Say that our previously-mentioned (and, sadly, hypothetical) optional unit for your course,
‘Flower Arranging for Engineers’, becomes extremely popular. In fact, it becomes so popular
that it is studied by 208 students, from all the various BSc courses in the School. In an effort
to analyse the performance of the students, so as to determine if any improvements to the unit
are required, we might decide to plot a histogram of the final marks obtained. As we know,
this is a frequency distribution, and might be obtained from the following summary of the
students’ scores, as shown:
Mark Scored (%)
Frequency (No. of students)
Mark Scored (%)
Frequency (No. of students)
0-9.9
1
50-59.9
53
10-19.9
4
60-69.9
39
20-29.9
8
70-79.9
25
30-39.9
17
80-89.9
11
40-49.9
47
90-100
3
53
Frequency (No. of students)
50
47
39
40
30
25
20
17
11
10
8
4
3
1
0
0
10
20
30
40
50
60
70
80
90
100
Mark (per cent)
Frequency polygons
The first step in the conversion is to change from the histogram to what is called a frequency
polygon. This is simply a line graph, joining the centres of each of the chosen data intervals.
At the ends, our frequency polygon reaches the zero axis as shown, since no student can
obtain less than zero or more than 100 per cent. In situations when this doesn’t apply, it is
conventional to terminate the polygon on the zero axis, half way through the next interval.
c:\ken\lects\ietm8.doc
1
5/6/2017
60
Frequency (No. of students)
53
50
47
39
40
30
25
20
17
11
10
8
4
3
1
0
0
10
20
30
40
50
60
70
80
90
100
Mark (per cent)
Over to you:
Sketch such a frequency polygon for the histogram of ages of the population, which is in the
week 5 notes. Where the histogram has unequal intervals, the procedure is as shown below
(the histogram bars and the polygon lines are not normally shown on the same plot). For bars
of normal width (e.g. the first three bars in the example below, simply join up the centrevalues as above). For other bars, the frequency polygon must be drawn so that the line passes
through the mid-point of the exposed side of the histogram bar (points ‘A’, ‘B’ and ‘C’
below) so that the shaded areas ‘gained’ and ‘lost’ automatically cancel out, and the total area
under the plot stays the same (it is the area of histogram bars which matters).
First three bars are normal width
A
B
C
Probability distributions:
It is very easy to obtain probability distributions from diagrams such as those above. All that
is necessary is to divide each frequency by the total number of (in this case) students, to
obtain the probability of any individual student, selected at random, obtaining a mark in a
particular range. For example, to convert the histogram on page 1, or the frequency polygon
on page 2, into probability distributions, simply divide every number on the vertical axis (and
therefore also the numbers written on the plots) by 208. Thus, the vertical axes would now be
calibrated in probabilities from zero to 53/208 = 0.255. The probability of any given student
obtaining a mark in the range 40 to 49.9 per cent will be 47/208 = 0.226. The probability
of a student scoring 90 per cent or more will be 3/208 = 0.0144, etc.
Over to you:
c:\ken\lects\ietm8.doc
2
5/6/2017
Calculate the remaining probabilities for the various marks ranges, and add them up - what do
you discover?
The normal distribution
It is not very surprising that the marks distribution (frequency or probability) looks like the
diagrams above. In a fair examination, taken by a large number of students, we would expect
that only a few students would obtain either abysmally low marks or astronomically high
marks. We would expect the majority of marks to be ‘somewhere in the middle’, with a ‘tail’
at both the low and the high ends of the range. This is what we see above.
Several real-life situations fit this general form of distribution, where it is most likely that
results will be clustered around the centre of some range, with outlying values tailing off
towards the ends of the range. Wisniewski, in his ‘Foundation’ text, uses an example based
on the distributions of the weights of breakfast cereal packed by machines into boxes. There
should always ideally be the stated amount in a box but, inevitably, some boxes will be
lighter, and some heavier. There will be the odd ‘rogue’ boxes a long way from the mean.
To make it easier to cope with such situations, they are often assumed to fit a standardised
probability distribution, called the normal distribution. By doing this, it is possible to use
standard printed tables to make predictions such as (for example), how many students would
be expected to score less than 40 per cent? To allow standard tables to be used, we need to
assume a certain fixed shape of probability distribution, and we also need to define it in terms
of mean and standard deviation. We cannot define it in terms of actual data values (e.g.
examination marks, or weight of cereal in a box), otherwise we would need a different set of
tables for every new problem.
The normal distribution curve is actually defined by a rather unpleasant formula (but we don’t
need to use it, as we are going to use tables which have been derived from it by someone
else). If the variable in which we are interested is x (e.g. a mark in per cent, or the weight of
cereal in a box in kg), the mean value of x is x and the standard deviation of the data set is
x,
1
P ( x) 
2
1  x x 
 

2  x 
e
2 2x
The resulting plot of P(x) as x varies is a ‘bell-shaped’ curve, as shown below.
Notes:
1) Firstly, notice that the horizontal axis has been plotted in terms of a normalised variable z,
and not in units of x. This follows from our desire to get a distribution whose results are
independent of the data units. The normalisation is very easy to do. Firstly we subtract the
mean value x from all the x values we might have plotted along the horizontal axis.
This has the effect of replacing the mean value with zero, and therefore of shifting the
vertical axis to that value, as shown. Next, we divide all the resulting values by the
standard deviation of the data set, so that the horizontal axis actually becomes calibrated in
x x
‘standard deviations’ either side of the mean. So, we see that z 
.
x
c:\ken\lects\ietm8.doc
3
5/6/2017
P(x)
0.4
0.3
0.2
0.1
x
0
-4
-3
-2
-1
0
z = 0 for mean of x
1
2
3
4
z = no. of standard deviations
of x from its mean value
2) The curve above is therefore effectively plotted for a data set with a mean value of x  0
and a standard deviation of x = 1. If you put these values into the nasty equation of the
normal distribution, together with the particular value x = 0 (the mean), you will find that
P(0) = 0.399, as shown at the centre of the curve. If you make x very large (either
positively, or negatively) with respect to the mean value, then the exponential term tends
to zero, giving P(x very much greater or smaller than the mean)  0, also agreeing with the
plot and therefore tending to confirm its correctness.
3) If you estimate the area under the curve (by crudely ‘counting squares’, for example) you
will find that it is equal to 1. Therefore, the probability that a value of x will fall
somewhere under the curve is 1. Actually, the area shown in the diagram (and hence the
probability) is very slightly less than 1, because the plot continues in both directions,
getting closer and closer to the horizontal axis all the time.
Estimating probabilities from the normal distribution curve
Point (3) above, tells us how we can use this curve. If we know the mean and standard
deviation of our x values (our data), we can ask questions about various areas of interest
under the normal distribution curve. These will be the probabilities we require, and can be
looked up from tables. Any text book statistics section dealing with the normal distribution
will contain such tables. An approximate version calculated by EXCEL is given below.
Example
Say that a large set of examination results has a mean of 55 per cent, and a standard deviation
of 15 per cent. How many students would we expect to fail the examination (if we define a
failure as obtaining less than 40 per cent), and how many students would we expect to get a
first-class result (defined as obtaining 70 per cent or more)?
Firstly, we must remember that we are assuming that our data are normally-distributed. If
they are not, then the results will be approximate only. One indication that a data set might at
least stand a chance of being normally-distributed is if the mean and median are the same
value (i.e. there should be zero skew, as the normal distribution is symmetrical).
c:\ken\lects\ietm8.doc
4
5/6/2017
SD from
mean
2.00
1.95
1.90
1.85
1.80
1.75
1.70
1.65
1.60
1.55
1.50
1.45
1.40
1.35
1.30
1.25
1.20
1.15
1.10
1.05
1.00
Area
0.0227
0.0256
0.0287
0.0321
0.0359
0.0400
0.0445
0.0495
0.0548
0.0606
0.0668
0.0735
0.0807
0.0885
0.0968
0.1056
0.1150
0.1250
0.1356
0.1468
0.1586
SD from
mean
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
Area
0.1710
0.1840
0.1976
0.2118
0.2266
0.2419
0.2578
0.2742
0.2911
0.3085
0.3263
0.3446
0.3632
0.3821
0.4013
0.4207
0.4404
0.4602
0.4801
0.5000
Assuming a normally-distributed result, the bell-shaped curve, with our values of interest,
appears below. Since the horizontal axis is calibrated in terms of z = ‘standard deviations’
from the mean, these are worked out using the mean and standard deviation in the formula in
Note (1) above, as follows:
z40 = (40 - 55)/15 = -1 (in other words, 40 per cent is 1 standard deviation below the mean)
z70 = (70 - 55)/15 = +1 (in other words, 70 per cent is 1 standard deviation above the mean)
The probability tables actually give us the shaded areas of the plot directly, so all we need do
is to look them up. For the shaded area on the right, the z value is 1.00. From the tables, the
corresponding probability is 0.1587. This means that we might expect 15.87 per cent of
students to get ‘firsts’. The area at the other end of the curve looks to be a problem, as z = -1,
but there are no negative values for z in the table. However, because the curve is
symmetrical, we only need positive values. The probability of a student falling into the area
to the left of z = -1 is identical to that of him or her falling into the area to the right of
z = +1. The probability of ‘failure’ is therefore also 0.1587.
c:\ken\lects\ietm8.doc
5
5/6/2017
0.4
P(x)
area = probability that
student gets a ’first’
area = probability
that student fails
x
0
-4
-3
-1
-2
z for x = 40 per cent
0
1
2
3
4
z for x = 70 per cent
z for x = 55 per cent
Note that we could also find the probability that a student will score less than 70 per cent
(say). The tables give us the area to the right of 70 per cent (z = 1, remember) as 0.1587. The
total area under the curve is 1, so the area to the left of z = 1 is 1 - 0.1587 = 0.8413 (so we
would expect 84.13 per cent of students not to get a first-class result.)
Over to you:
Assuming the examination results data given on page 1 to be normally-distributed, what is the
probability of a student obtaining a first-class result? What is the probability of a student
obtaining more than 40 per cent? Discuss the differences between the page 1 data set and a
normal distribution. NOTE: that this data set has a different mean and standard deviation
from that just used in our example, so you’ll need to calculate them. We have not yet
discovered how to do this for data grouped into classes, rather than a simple list of data
points. Here’s how. From the frequency distribution plots on pages 1 or 2, call the frequency
values f1, f2, ..., fn where n is the number of classes (or ‘bins’) chosen for the plot. So, in
our case, n = 10, f1 = 1, f2 = 4, f3 = 8, etc. Next, work out the mid-points of the data classes
(‘bins’) used, and call them m1, m2, ..., mn. So, in our case, these m values are 5 per cent, 15
per cent, 25 per cent, etc. The mean and standard deviation of the distributed data are then
taken to be given by:
n
n
x
 f i mi
i 1
n
f
and  x 
i
i 1
fm
2
i
i
i 1
n
f
 x
2
i
i 1
A tabular approach will be easiest, as usual.
The negative exponential distribution
To cover a wider range of real-world situations, more ‘standardised’ probability distributions
are required. The other one we shall briefly look at is the negative-exponential distribution.
This is also sometimes called a ‘failure-rate’ curve, because it tends to describe how
components fail with time (but it also has other uses, as we shall see).
c:\ken\lects\ietm8.doc
6
5/6/2017
If a certain quantity of components is manufactured and put into service, it is reasonable to
assume that they will all eventually fail (maybe after many, many years). The probability of
any one of the components failing during a given time period might well depend on how
many components are left in service. In other words, with a large number of components, we
might expect several to fail during a given time period; but with a much smaller number of
components, we should expect fewer failures over the same time period. For example, if
1000 components are put into service, knowing something about their reliability, we might
expect 10 to fail in the first three years. However, if only 5 components were put into service,
we certainly would not expect 10 to fail in the first three years!
The example above is rather silly, but it suggests a better way of viewing such problems.
Maybe we should expect a certain proportion of the components to fail over a given time
period. For example, if 10 out of 1000 components fail in three years, that is 1 per cent.
Perhaps we might therefore expect 1 per cent of 5 such components (i.e. most likely, none) to
fail over three years too?
So, to formalise this kind of idea, we follow this reasoning:
 Choose to measure time t in the best units for the problem (seconds, months, years, etc.).
Technically, the unit chosen should be short compared with the expected lifetime of a
component, so that any given component is expected to last for many time units.
 Let  be the failure rate, that is, the proportion of components expected to fail in one
time unit. This means that  must have ‘dimensions’ of (1/time). In the example above,
we said that 1 per cent of components might fail in three years so, in that case, the failure
rate   0.01/3 (proportion per year). This can also be viewed as a probability - there is a
probability of 0.01/3 that any given component will fail in a given period of one year.
 Therefore, to find the proportion of components expected to fail over a time t (measured
in our chosen units), we need the quantity t. This is now dimensionless - it is actually
the probability that any given component will fail over the stated time period. Again, this
period should technically be short compared with the expected lifetime of the component.
 It follows that, if we start off with N components, then after the time t has passed, we
will expect the actual number of failures to be Nt.
 We can now state the rate of change of the number of components as follows (it is
negative, because the number decreases as time passes):
change in the number of components
number of failures
N t


  N
time period
time period
t
dN
  N , in
 This is called a differential equation and would normally be written
dt
which the quantity dN / dt is to be interpreted as, ‘a very small change in N divided by
the very small change in t over which it occurs’.
 The solution of this equation, to discover how the remaining number of components n
might look over long periods of time, belongs to the branch of maths called calculus. It
turns out that: n  Ne   t
 We can plot this negative exponential function as the following curve relating the
remaining number of components n to time:
c:\ken\lects\ietm8.doc
7
5/6/2017
Multiple of initial number
of components (N)
1
0.8
0.6
t
n = Ne -
0.4
0.2
0
0
1
2
3
4
5
6
Time units
This plot effectively shows the number of components we expect to remain in service as time
passes - just multiply the vertical scale by N (the initial number of components), and read off
the plot the number expected to remain at any time of interest. To find the number of
components which we expect to have failed at any time, subtract the plot value from N.
Other things fit such a curve too. One example is radioactive decay. As a radioactive
substance decays, it emits matter as radiation, thus reducing the amount of matter remaining.
The remaining matter fits a curve similar to that above. Another distribution similar to the
above is used in predicting the duration of conversations on a telephone network.
Over to you:
3000 light bulbs are put into service in a large office complex. After 200 hours of use, 500
have failed. How many bulbs might we expect to have failed after (i) 800 hours and (ii) 3000
hours?
Summary
In this session we have seen how probability distributions can be plotted, and used to make
predictions from data sets. We looked at two of the standard probability distributions, the
normal and negative exponential distributions, and saw how these can be used to forecast
results in a standardised manner, so long as our data fit the chosen distribution.
Ken Dutton
November 1998
ND table by Bill Barraclough and EXCEL, November 2001
c:\ken\lects\ietm8.doc
8
5/6/2017
BSc IETM Week 8 - Probability Distributions - ‘solutions’
Page 2 ‘Over to you’:
Histogram data for 1991 were plotted as shown below. Freq. polygon is added to plot. “Nonstandard width compensation” is used as shown at ‘A’, ‘B’, ‘C’ and ‘D’ (and I did them in
that order). Termination at ‘high’ end questionable - what’s the maximum allowable age?!!
It would have been unfortunate if the line through ‘C’ missed the ‘6997’ bar altogether - let’s
hope nobody asks! (I guess the answer would be that some approximation of merging the
first two bars would have to be made.)
Frequency (mllions of people)
14
6997
12383
12
3766
11974
D
9448
C
10
7955
8
6
A
4
3748
B
2
0
0
10
20
30
40
50
60
70
80
90
100
Age in years
Page 3 ‘Over to you’:
Dividing 1,
4,
8,
17,
47,
53,
39,
25,
11,
3 by 208
gives: 0.00481, 0.0192, 0.0384, 0.0817, 0.2260, 0.2548, 0.1875, 0.1202, 0.0529, 0.0144
These sum to 1, give, or take rounding error. This should always happen, so the area under
such a probability distribution curve is always 1.
Page 4, Note (3):
Area under curve seems to be roughly 10 ‘squares’. Units are 1 by 0.1, so area = 1.
Page 6 ‘Over to you’:
The tabular approach, as recommended in the notes, is probably easiest for calculating the
mean and S.D. Note the trick of getting f i mi2 by multiplying the already-obtained fimi
values by mi, rather than doing the squaring and multiplying again. I did the values below the
c:\ken\lects\ietm8.doc
9
5/6/2017
hard way (not using Excel), so there may be errors (space before and after in case you want to
do an OHP just of the table - I shan’t bother):
SUMS
f
1
4
8
17
47
53
39
25
11
3
208
m
5
15
25
35
45
55
65
75
85
95
fm
5
60
200
595
2115
2915
2535
1875
935
285
11520
fm2
25
900
5000
20825
95175
160325
164775
140625
79475
27075
694200
694200
 5538
. 2 = 16.45 per cent.
208
To find P(more than 40%) need to normalise 40 by subtracting mean and dividing by S.D. to
get -0.935. Can’t look this up in table as it’s negative, but the same area under the curve
would be obtained as 1 - value for +0.935. Required value is approx. 0.175 from table, so
required answer is 1 - 0.175 = 0.825, equivalent to 82.5 per cent getting more than 40 %.
So mean = 11520/208 = 55.38 per cent, and
S.D. =
Main difference from normal distribution is likely to be the ‘step’ at 40 per cent, and the
resulting positive skew, due to trying to pass people!!
Page 8 ‘Over to you’:
500 failures implies 500 = Nt, so  = 500/3000/200 = 8.333*10-4.
At 800 hr., number working = 3000e-0.0008333*800 = 1540, so number of failures = 1460.
At 3000 hr., number working = 3000e-0.0008333*3000 = 246, so number of failures = 2754.
Ken
c:\ken\lects\ietm8.doc
10
5/6/2017