Download Chapter 2

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Chapter 2
Statistics of repeated
measurements
Mean and standard deviation
xi
x
n
The distribution of repeated measurements
• Although the standard deviation gives a measure of
the spread of a set of results about the mean value, it
does not indicate the shape of the distribution.
• To illustrate this we need a large number of
measurements
x
= 0.500
S = 0.0165
This shows that the distribution of the measurements is roughly
symmetrical about the mean, with the measurements clustered
towards the centre.
The set of all possible measurements is called the population. I f there are no
systematic errors, then the mean of this population, denoted by µ, is the true
value of the nitrate ion concentration, and the standard deviation denoted by 
• s usually gives an estimate of ????
Normal or Gaussian distribution
The mathematical model that describes a continuous curve:
The curve is symmetrical
about u and the greater the
value of  the greater the
spread of the curve
• More detailed analysis shows that, whatever
the values of  and  and a, the normal
distribution has the following properties.
– For a normal distribution with  and ,
approximately 68% of the population values lie
within ±  of the mean,
– approximately 95% of the population values lie
within ±2  of the mean
– and approximately 99.7% of the population values
lie within ±3  of the mean.
Standardized normal cumulative distribution
function, F(z)
• For a normal distribution with known mean, u, and standard
deviation, , the exact proportion of values which lie within any
interval can be found from tables, provided that the values are
first standardized so as to give z-values.
• This is done by expressing a value of x in terms of its deviation
from the mean in units of standard deviation, . That is
Standardized normal cumulative distribution
function, F(z)
• Table A.1 (Appendix 2) gives the proportion of
values, F(z), that lie below a given value of z.
• F(z) is called the standard normal cumulative
distribution function.
– For example the proportion of values below z = 2 is F(2) =
0.9772
– and the proportion of values below z = -2 is F(-2) = 0.0228.
– Thus the exact value for the proportion of measurements
lying within two standard deviations of the mean is 0.9772 0.0228 = 0.9544.
Standardized normal cumulative distribution
function, F(z)
• If repeated values of a titration are normally
distributed with mean 10.15 ml and standard
deviation 0.02 ml, find the proportion of
measurements which lie between 10.12 ml and 10.20
ml.
• Standardizing the first value gives z = (10.12 10.15)/0.02 = From Table A.1, F(-1.5) =
0.0668.Standardizing the second value gives z
10.15)/0.02 = 2. From Table A.1, F(2.5) =
0.9938.Thus the proportion of values between x =
10.12 to 10.20 (which corresponds to z = -1.5 to 2.5)
is 0.9938 - 0.0668 = 0.927.
• Values of F(z) can also be found using Excel or Minitab
Log-normal distribution
• Another example of a variable which may follow a log-normal
distribution is the particle size of the droplets formed by the
nebulizers used in flame spectroscopy.
• Particle size distributions in atmospheric aerosols may also
take the log-normal form, and the distribution is used to
describe equipment failure rates
• Minitab allows this distribution to be simulated and studied.
However, by no means all asymmetrical population
distributions can be converted to normal ones by the
logarithmic transformation.
• The distribution of the logarithms of the blood serum
concentration shown in Figure 2.5(b) has mean 0.15 and
standard deviation 0.20.
• This means that approximately 68% of the logged values lie in
the interval 0.15 - 0.20 to 0.15 + 0.20, that is -0.05 to 0.35.
• Taking antilogarithms we find that 68% of the original
measurements lie in the interval 10-0.05 to 100.35, that is 0.89 to
2.24.
• The antilogarithm of the mean of the logged values, 100.15 = 1.41,
gives the geometric mean of the original distribution
Definition of a sample
•
•
•
•
•
•
The Commission on Analytical Nomenclature of the Analytical
Chemistry Division of the International Union of Pure and
Applied Chemistry has pointed out that confusion and
ambiguity can arise if the term `sample' is also used in its
colloquial sense of the `actual material being studied'
(Commission on Analytical Nomenclature, 1990).
It recommends that the term sample is confined to its
statistical concept.
Other words should be used to describe the material on which
measurements are being made, in each case preceded by 'test',
for example test solution or test extract.
We can then talk unambiguously of a sample of measurements
on a test extract, or a sample of tablets from a batch.
A test portion from a population which varies with time, such as
a river or circulating blood, should be described as a specimen.
Unfortunately this practice is by no means usual, so the term
'sample' remains in use for two related but distinct uses.
The sampling distribution of the mean
• In the absence of systematic errors, the mean of a sample of
measurements provides us with an estimate of the true value, u,
of the quantity we are trying to measure.
• Even in the absence of systematic errors, the individual
measurements vary due to random errors and so it is most
unlikely that the mean of the sample will be exactly equal to the
true value.
• For this reason it is more useful to give a range of values which
is likely to include the true value.
• The width of this range depends on two factors.
– The first is the precision of the individual measurements, which in
turn depends on the standard deviation of the population.
– The second is the number of measurements in the sample. The
more measurements we make, the more reliable our estimate of ,u,
the true value, will be.
Sampling distribution of the mean
• Assuming each column is a sample:
The mean for a each sample: 0.506, 0.504, 0.502, 0.492, 0.506, 0.504,
0.500, 0.486
• The distribution of all possible sample means (in this case an infinite
number) is called the sampling distribution of the mean.
• Its mean is the same as the mean of the original population.
• Its standard deviation is called the standard error of the mean (s.e.m.).
• There is an exact mathematical relationship between the latter and the
standard deviation, , of the distribution of the individual
measurements:
• For a sample of n measurements, standard error of the mean
(s.e.m.) =
/ n
• As expected, the larger n is, the smaller the value of the s.e.m. and
consequently the smaller the spread of the sample means about ,
the true value, u.
• Another property of the sampling distribution of the mean is that,
even if the original population is not normal, the sampling
distribution of the mean tends to the normal distribution as n
increases.
• This result is known as the central limit theorem.
• This theorem is of great importance because many statistical tests
are performed on the mean and assume that it is normally
distributed.
• Since in practice we can assume that distributions of repeated
measurements are at least approximately normally distributed, it is
reasonable to assume that the means of quite small samples (say
>5) are normally distributed.
Confidence limits of the mean for large samples
• The range which may be assumed to include the
true value is known as a confidence interval and
the extreme values of the range are called the
confidence limits.
• The term `confidence' implies that we can assert
with a given degree of confidence, i.e. a certain
probability, that the confidence interval does
include the true value.
• The size of the confidence interval will obviously
depend on how certain we want to be that it
includes the true value: the greater the certainty,
the greater the interval required.
Sampling distribution of the mean for samples of size n.
•If we assume that this distribution is normal then 95% of the sample
means will lie in the range given by:
• In practice, we usually have one sample, of known
mean, and we require a range for ,u, the true value:
• The equation gives the 95% confidence interval of the mean.
• The 95% confidence limits are
• In practice we are unlikely to know  exactly.
• However, provided that the sample is large,  can be replaced
by its estimate, s.
Confidence limits of the mean for small Examples
• The subscript (n - 1) indicates that t depends on this
quantity, which is known as the number of degrees of
freedom, d.f. (usually given the symbol ).
• The term 'degrees of freedom' refers to the number of
independent deviations
• In this case the number is (n - 1), because when (n - 1)
deviations are known the last can be deduced since
• The value of t also depends on the degree of confidence
required.
Presentation of results
• The mean is quoted as an estimate of the quantity
measured and the standard deviation as the estimate
of the precision.
• Less commonly, the standard error of the mean
(s.e.m) is sometimes quoted instead of the standard
deviation,
• Or the result is given in the form of the 95%
confidence limits of the mean.
• (Uncertainty estimates, see Chapter 4, are also
sometimes used.)
• Since there is no universal convention it is obviously
essential to state the form used and, provided that
the value of n is given, the three forms can be easily
inter-converted by using previous equations
Rounding off
• The important principle is that the number of
significant figures given indicates the precision of
the experiment.
• It would clearly be of no mean, for example, to give
the result of a titrimetric analysis as 0.107846 M - no
analyst could achieve the implied precision of
0.000001 in ca. 0.1, i.e. 0.001%.
• In practice it is usual to quote as significant figures
all the digits which are certain, plus the first
uncertain one.
• For example, the mean of the values 10.09, 10.11,
10.09, 10.10 and 10.12 is 10.102, and their standard
deviation is 0.01304.
• Clearly there is uncertainty in the second decimal
place; the results are all 10.1 to one decimal place,
but disagree in the second decimal place. Using the
suggested method the result would be quoted as:
• If it was felt that this resulted in an unacceptable
rounding-off of the standard deviation, then the
result could be given as:
• When confidence limits are calculated there is no point in
giving the value of
to more than two
significant figures. The value of
should be given
corresponding number of decimal places.
Rounding off the 5
• the bias can be avoided by rounding the 5 to the
nearest even number. For example: 4.75 is rounded
to 4.8.
Example
• The absorbance scale of a spectrometer is tested at a particular
wavelength with a standard solution which has an absorbance
given as 0.470. Ten measurements of the absorbance with the
spectrometer give = 0.461, and s = 0.003.
• Find the 95% confidence interval for the mean absorbance as
measured by the spectrometer, and hence decide whether a
systematic error is present.
Confidence limits of the geometric mean for a lognormal distribution
The following values (expressed as percentages) give the antibody
concentration in human blood serum for a sample of eight healthy
adults.2.15, 1.13, 2.04, 1.45, 1.35, 1.09, 0.99, 2.07.
Calculate the 95% confidence interval for the geometric mean,
assuming that the antibody concentration is log-normally distributed.
The logarithms (to the base 10) of these values are:
0.332, 0.053, 0.310, 0.161, 0.130, 0.037, -0.004, 0.316
The mean of these logged values is 0.1669, giving 100.1669 = 1.47 as the
geometric mean of the original values.
The standard deviation of the logged values is 0.1365.
The 95% confidence limits for the logged values are:
antilogarithms of these limits gives the 95% confidence interval of the
geometric mean as 1.13 to 1.91.
Propagation of random errors
• It is most important to note that the procedures used
for combining random and systematic errors are
completely different.
• This is because random errors to some extent
cancel each other out, whereas every systematic
error occurs in a definite and known sense.
• Suppose, for example, that the final result of an
experiment, x, is given by x = a + b. If a and b each
have a systematic error of +1, it is clear that the
systematic error in x is +2.
• If, however, a and b each have a random error of ±1,
the random error in x is not ±2: this is because there
will be occasions when the random error in a is
positive while that in b is negative (or vice versa).
Linear combinations
• this case the final value, y, is calculated from a linear
combination of measured quantities a, b, c, etc., by:
y=k+ka+kb+kc+……where k, ka, kb, kc, etc., are
constants.
• The variance of a sum or difference of independent
quantities is equal to the sum of their variances.
• It can be shown that if a,  b,  c, etc., are the
standard deviations of a, b, c, etc., then the standard
deviation of y, y is given by:
Example
• In a titration the initial reading on the burette is 3.51
ml and the final reading is 15.67 ml, both with a
standard deviation of 0.02 ml. What is the volume of
titrant used and what is its standard deviation?
Volume used = 12.16± 0.028 ml
Multiplicative expressions
• The error in the result of a multiplication and/or
division such as: y = kab/cd
• where a, b, c and d are independent measured
quantities and k is a constant.
• In such a case there is a relationship between the
squares of the relative standard deviations:
=(kaaf + (k,Volume used = 15.67 - 3.51 =
12.16aa)2+(b)2+(7c)2+Propagation of random
errors(2.12)
• If the relationship is y = bn
• then the standard deviations of y and b are
related by:
• If y is a general function of x, y = f(x), then the
standard deviations of x and y are related by:
• The relative standard deviation (r.s.d.) of A is given by
• Differentiation of this expression with respect to T shows that
the r.s.d. of A is a minimum when T= 1/e= 0.368.
Propagation of systematic errors
• If y is calculated from measured quantities by use of equation and
the systematic errors in a, b, c, etc., are a,  b,  c, etc., then the
systematic error in y,  y, is calculated from:
 y = ka  + k h  b + kc  c +
• Remember that the systematic errors are either positive or negative
and that these signs must be included in the calculation of  y.
• The total systematic error can sometimes be zero.
• Considering the analytical balanca, the weight of the solute used is
found from the difference between two weighings, the systematic
errors cancel out.
• It should be pointed out that this applies only to an electronic
balance with a single internal reference weight.
• Carefully considered procedures, such as this, can often minimize
the systematic errors, as described in Chapter 1.
• If y is calculated from the measured quantities by
use of equation then relative systematic errors are
used: