Download Measurement as Inference: Fundamental Ideas

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Dempster–Shafer theory wikipedia , lookup

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Measurement wikipedia , lookup

Probability interpretations wikipedia , lookup

Inductive probability wikipedia , lookup

Transcript
Measurement as Inference: Fundamental Ideas
W. Tyler Estler (2)
Precision Engineering Division
National Institute of Standards and Technology
Gaithersburg, MD 20899 USA
Abstract:
We review the logical basis of inference as distinct from deduction, and show that measurements in
general, and dimensional metrology in particular, are best viewed as exercises in probable inference:
reasoning from incomplete information. The result of a measurement is a probability distribution that
provides an unambiguous encoding of one's state of knowledge about the measured quantity. Such states
of knowledge provide the basis for rational decisions in the face of uncertainty. We show how simple
requirements for rationality, consistency, and accord with common sense lead to a set of unique rules for
combining probabilities and thus to an algebra of inference. Methods of assigning probabilities and
application to measurement, calibration, and industrial inspection are discussed.
Keywords: dimensional metrology, measurement uncertainty, information
1. Introduction
The growing acceptance and use of the ISO Guide to the
Expression of Uncertainty in Measurement (GUM) [10] has
stimulated renewed thinking about errors, tolerances,
statistics, and the concepts of randomness and
determinism as they relate to manufacturing engineering
and metrology. While we fully subscribe to the notion of
determinism as articulated by J. B. Bryan [3] and R. R.
Donaldson [6], the knowledge that a machine moves in
perfect accord with natural law provides only small comfort
when we must assign an uncertainty to measurements of
its positioning errors. We emphasize here the conceptual
distinction between a state of nature (for example, the
geometry of a highly repeatable machine tool) and the
uncertainty of a process designed to measure that state
(linear positioning error, for example, measured with a
displacement interferometer).
Traditionally, there has been little in the education of a
typical engineer or physicist that provides a fundamental
viewpoint or logical basis for dealing with measurement
uncertainty, in the way that the laws of Newton and Hooke
provide a foundation for major portions of engineering
science. While computing the mean and variance of a set
of repeated measurements seems like a reasonable thing
to do, many statistical tests seem ad hoc and poorly
motivated and they provide no guidance in situations where
repeatability is not an issue or where no population of parts
exists.
It is a pleasure to discover that there exists a unique
mathematical system for plausible reasoning in the
presence of uncertainty that satisfies very elementary and
non-controversial requirements for consistency and rational
agreement with common sense. In this paper we present a
brief outline of the fundamental ideas of this system, called
simply probability theory, with emphasis on its applications
to engineering metrology. The development of probability
theory as logic had its origins in the work of P. S. Laplace
who remarked that 'probability theory is nothing but
common sense reduced to calculation.' The modern
development owes much to the work of H. Jeffreys [16],
G. Polya [26], R. T. Cox [4-5], and E. T. Jaynes [12-15].
Detailed application to problems of data analysis and
measurement uncertainty from a modern point of view are
given by D. S. Sivia [30] and K. Weise and W. Wöger [33].
The latter paper is an excellent introduction to the approach
to uncertainty advocated by the GUM.
2. Deduction and Plausible Inference
2.1 Deductive logic
Classical deductive logic deals with propositions (written
simply A, B, C, ...) that are either true or false. Typical
propositions are declarative statements such as:
A º 'There is life on Mars.'
B º 'The error in the length of the workpiece is
less than 5 mm.'
C º 'The cost of the workpiece is less than $10.'
Propositions are combined and manipulated using a set of
three basic operations defined as follows:
~
Negation: A º 'A is false'
Logical product: AB º 'A and B are both true'
Logical sum: A + B º 'at least one of the propositions
(A,B) is true'
Relations among propositions form the subject of Boolean
algebra, which relates logical combinations of propositions
that have the same truth value.
A typical Boolean expression is:
²
%%.
A
+ B = AB
(1)
Here, the left-hand side says 'It is not true that at least one
of the propositions (A,B) is true', while the right-hand side
says 'A and B are both false.' Clearly these verbal
expressions have the same logical status and semantic
meaning, a feature of any valid Boolean expression.
Because of logical relations such as (1), only two of the
three basic operations are independent, a fact that will
simplify the development of the rules of probability theory.
Deductive logic is a two-valued logic (true/false, up/down,
zero/one, etc) and together with the Boolean formalism
provides the binary mathematical basis of computer
science. Those familiar with the operation of logic gates will
1
recognize the logical sum, for example, as defining the
action of an 'inclusive OR' binary gate.
A basic construction in classical logic is the implication,
written 'A implies B', which means that if A is true, then B is
also, necessarily, true. The connection is logical rather than
(necessarily) causal; for example, the proposition A º 'there
is life on Mars' would logically imply B1 º 'there is liquid
water on Mars', B2 º 'there is oxygen on Mars', and so on.
[In anticipation of objections on semantic grounds we point
out that we are using the term 'life' in the sense of life forms
similar to those that exist on the Earth.]
Deductive logic then proceeds from the implication in two
complementary ways, according to the following syllogisms:
'If A implies B and A is true, then B is true.'
and
'If A implies B and B is false, then A is false.'
These are very simple logical structures with common
sense meanings. If it could be proven beyond doubt, for
example, that Mars was devoid of water, then we could
conclude that no (Earth-like) Martian life could exist.
2.2 Plausible inference and probability
Now suppose that A implies B for some relevant pair of
propositions, and in the course of contemplating A we
happen to learn that B is true. What does this tell us about
A? This question is quite different from those in deductive
logic and belongs to the field of plausible inference that was
richly explored by Polya [26]. Here, knowledge that B is true
supplies evidence for the truth of A, but certainly not
deductive proof. We may feel intuitively that A is more likely
to be true upon learning that one of its consequences is
true, but how much more likely?
It is easy to see that the change in our strength of belief in
proposition A will depend on the nature of the information
supplied by consequence B. Consider the proposition
A º 'the length error of the workpiece is less than 5 mm',
and suppose that we learn, based on a preliminary
measurement, that B1 º 'the length error of the workpiece is
less than 100 mm' is true. Such information would certainly
make A seem more likely to be true, but it would be much
more significant to learn from a more recent measurement
that B2 º 'the length error of the workpiece is less than
7 mm' is true. In this way we can qualitatively order degrees
of plausibility in the sense of: 'A is more likely to be true,
given B1' and 'A is much more likely to be true, given B2'. In
neither case does A become certain, but this qualitative
ordering is something we do naturally as a matter of
common sense reasoning.
What we need now is a way to extend deductive logic into
this region of inference between certainty and impossibility.
Such an extended logic should provide a general
quantitative system for reasoning in the face of uncertainty
or when supplied with incomplete information. In the
development of such a quantitative system of inductive
logic or plausible reasoning, we need a numerical measure
of credibility or degree of reasonable and consistent belief
that will serve to describe our state of knowledge about
propositions that are neither certain nor impossible.
Following the modern interpretation as expressed, for
example, in the GUM, we call this measure the probability,
and write:
f
p (A | I 0 º the probability that A is true, given
that I 0 is true.
Here, I 0 stands for the reasoning environment: the set of
all relevant background information that conditions our
knowledge of A. We will carry I 0 along explicitly in order to
emphasize that all probabilities are conditional on some set
of propositions known (or assumed) to be true. There is a
natural intuitive basis for defining probability in this manner.
The degree of partial belief in an uncertain proposition will
always depend not only on the proposition itself, but also
on whatever information we possess that is relevant to the
matter. For this reason, there is no such thing as an
unconditional probability. The probability we assign to the
chance of rain tomorrow depends, for example, upon
whether we have heard a weather forecast, or whether its is
presently raining, or whether storm clouds are gathering,
and so on.
In Polya's studies of plausible inference he reasoned, and
common sense would agree, that if A implies B, then
necessarily p (A | BI 0 ³ p (A | I 0 , since the probability that A
is true, if it changes at all, can only be increased by
learning that one of its consequences is true. In our
example above concerning the length error of a workpiece,
the probabilities would be ordered according to
p (A | B 2 I 0 > p (A | B 1I 0 > p (A | I 0 . Here we are introducing
the customary and colloquial association of stronger belief
with greater probability. While such a transitive ordering
indicates the direction in which a probability might change
in light of new evidence, it provides no way to calculate the
amount of such a change and Polya's work stopped short
of providing a quantitative formulation. For this we turn to
the work of R. T. Cox [4-5].
f
f
f
f
f
3. The Rules of Probability Theory
The following is a brief sketch of the logic leading to the
unique rules for manipulating probabilities. For a more
complete tutorial introduction we suggest the excellent
synopsis of Smith and Erickson [31]. Following Jaynes [12],
we list three desired properties (desiderata) that ought to
be satisfied by a quantitative system of inference. These
are not strict mathematical requirements or constraints, but
any system lacking all of these properties would be of little
or no value for reasoning from incomplete information.
Desideratum I. Probabilities should be represented by real
numbers. This is a simple desire for mathematical
simplicity.
Desideratum II. Probabilities should display qualitative
agreement with rationality and common sense. This
means, for example, that as evidence for the truth of a
proposition accumulates, the number representing its
probability should increase continuously and monotonically
and the probability of its negation should decrease
continuously and monotonically. It also means that the
system of reasoning should contain the deductive limits of
certainty or impossibility as special cases when
appropriate.
Desideratum III. Rules for manipulating probabilities
should be consistent. For example, if we can reason our
way to a conclusion in more than one way, then all ways
should lead to the same result. It should not matter in what
order we incorporate relevant information into our
reasoning.
3.1 The two axioms of probability theory
Equipped with these quite reasonable requirements, we
can proceed to derive the rules of probability theory. We
2
first seek a way to relate the probability that a proposition is
true to the probability that it is false. That is, given p (A | I 0 ,
~
what is p (A | I 0 ? Cox reasoned that if we know enough, on
information I 0 , to decide if A is true, then the same
information should be sufficient to decide if A is false. This
makes intuitive sense from the point of view of symmetry,
~
since what we call 'A' and what we call ' A ' is a matter of
convention. Cox stated this as the first axiom of
probability theory:
f
f
Axiom 1.
'The probability of an inference (a proposition) on
given evidence (the conditioning information)
determines the probability of its contradictory (its
negation) on the same evidence.'
In symbolic form, this says:
a
f
f
~
p A | I 0 = F1 p ( A | I 0 ,
(2)
where F1 is some function of a single variable.
f
f
f
f
A º 'the spacer can be produced with an error of
less than 5 mm.'
f
(3)
where F2 is some function of the two variables. Of course,
AB and BA are logically equivalent, so by Desideratum III
we could interchange A and B in (3). Any assumed
functional relation that differs from (3) can be shown to run
afoul of our common sense requirements; Tribus [32] gives
an exhaustive demonstration.
At this point the reader is encouraged to ponder the logical
content of Cox's two axioms and to see how they agree
with the intuitive process of everyday plausible reasoning.
The writer knows of no case where these axioms have
been shown to disagree with common sense, while the
demonstrations of Tribus have shown that they are unique
in this property. This is very important because once these
two assertions are accepted as the axiomatic basis for
probability theory, the formal rules of calculation follow by
deductive logic in the form of mathematical theorems.
Equations (2) and (3) are not very informative as they
stand. Some obvious constraints on the unknown functions
F1 and F2 follow from Boolean algebra. Since AB = BA for
example, we must have
f
AB º 'the spacer can be produced with an error
of less than 5 mm, for less than $10.'
In considering whether or not to proceed, the engineer
might first decide whether he has the process capability to
machine a spacer with an error of less than 5 mm [p (A | I 0 ],
and then, assuming that this is possible, decide whether
the cost of production can be held to less than $10
[p (B | AI 0 ]. Alternatively, the engineer might first address
the cost issue and assign p (B | I 0 , and then, on the
assumption that the cost target can be met, decide whether
the length error can be held to less than 5 mm [p (A | BI 0 ].
Either of these approaches seems reasonable, and either
should provide enough information to determine p (AB | I 0 .
f
(4)
F1 F1(x ) = x ,
(5)
In the case of Axiom 2, the result is called the product rule:
f
f
f
f
f
Common sense reasoning along these lines led Cox to the
second axiom of probability theory:
'The probability on given evidence that both of two
inferences (propositions) are true is determined by
their separate probabilities, one on the given
evidence, the other on this evidence with the
additional assumption that the first inference
(proposition) is true.'
f
Using a different set of Boolean relations and the
requirement of consistency, R. T. Cox demonstrated that
the axiomatic relations (2) and (3) can be reduced to a pair
of functional equations whose solutions he proceeded to
find. Details of the proofs may be found in references
[4,5,12,31] .
and their logical product:
Axiom 2.
f
F 2 p (A | I 0 , p (B | AI 0 = F 2 p (B | I 0 , p (A | BI 0 .
~
~
Also, since A º A, the function F1 must be such that
where x is an arbitrary probability. Neither of these
constraints provides a sufficient restriction to determine the
forms of the functions.
B º 'the spacer can be produced for less than
$10.'
f
f
p (AB | I 0 = F 2 p (A | I 0 , p (B | AI 0 ,
3.2 The sum and product rules
We next seek a way to relate the probability of the logical
product AB of two propositions to the probabilities of A and
B separately. That is, suppose we know p (A | I 0 , p (B | I 0 ,
p (B | AI 0 , and so on, and we want to know p (AB | I 0 . For
example, suppose that an engineer is considering the
feasibility of manufacturing a metal spacer for a particular
application. In order to meet its functional requirements, the
spacer must have a length error of no more than 5 mm,
while for economic reasons the cost of production must be
held to less than $10. Now consider the two propositions:
f
As a mathematical assertion, this becomes:
f
f
p (AB | I 0 = p ( A | I 0 p (B | AI 0 .
(6)
This is one of the two fundamental rules of probability
theory. One of its immediate consequences is that certainty
is represented by a probability equal to one. To see this,
suppose that A implies B, so that B is certain given A. Then
logically AB = A, and from (6):
f
f
f
f
p (A | I f ¹ 0, then p (B | AI f = 1 for B certainly
p (AB | I 0 = p ( A | I 0 = p ( A | I 0 p (B | AI 0 ,
so that if
true.
0
0
In the case of Axiom 1, solution of a second functional
equation yields the sum rule:
f
f
~
p (A | I 0 + p ( A | I 0 = 1.
(7)
This is the second fundamental rule of probability theory.
An immediate consequence of the sum rule is that
impossibility is represented by a probability equal to zero.
~
For if A is certainly true then A is false, so that p (A | I 0 = 1
f
3
f
~
and from (7) we must have p (A | I 0 = 0. The sum rule
expresses a primitive form of normalization for probabilities.
We noted previously that only two of the three basic
Boolean operations (logical product, logical sum, and
negation) are independent. It follows that the sum and
product rules, together with Boolean operations among
propositions, are sufficient to derive the probability of any
proposition, such as the generalized sum rule:
f
f
f
f
p (A + B | I 0 = p ( A | I 0 + p (B | I 0 - p ( AB | I 0 .
(8)
Note here that the plus sign (+) takes on different meanings
depending on context, being a logical operator when it
relates propositions and representing ordinary addition
when applied to numbers such as probabilities. The context
will make clear the meaning; the alternative is to introduce
new mathematical notation which may have a strange look
while adding little clarity.
At this point we collect the results of the last few
paragraphs and present a summary of the unique rules for
manipulating probabilities. These two simple operations
form the basis for the system of reasoning called by Cox
the algebra of probable inference:
f
f
f
f
f
(9a)
(9b)
Sum Rule:
f
f
f
f
f
p (H |DI 0 = Kp (H | I 0 p (D |HI 0 ,
(12)
f
where K -1 = p (D | I 0 . Repeating this operation with H
~
replaced with H and dividing (12) by the resulting
expression yields:
f
f
f
f
f
p (H |DI 0
p (H | I 0 p (D | HI 0
=
×
~
~
~ .
p (H |DI 0
p (H | I 0 p (D | HI 0 )
f
~
Now, p (H | I 0 = 1 - p (H | I 0
f
(13)
f
f
~
and p (H |DI 0 = 1 - p (H |DI 0
~
~
from the sum rule, so that replacing p (H | I 0 and p (H | DI 0
in (13) and rearranging gives:
~
f LMN FGH p(H1| I f - 1IJK × pp((DD ||HHII )f OPQ
0
p (H |DI 0 = 1 +
0
f
f
-1
(14)
0
This is a very general result that shows how the prior (predata) probability p (H | I 0 changes, as a result of obtaining
data D, to yield the posterior (post-data) probability
p (H | DI 0 . This is just the process of learning, whereby a
state of knowledge gets updated in light of new information.
f
Product Rule:
p (AB | I 0 = p ( A | I 0 p (B | AI 0
= p (B | I 0 p (A |BI 0
the way we reason intuitively follows from the work of A. J.
M. Garrett and D. J. Fisher [9]. Suppose that we have an
hypothesis H, with an initial probability p (H | I 0 conditioned
on I 0 , and we then obtain new information in the form of
data D. Equating the two equivalent forms of the product
rule, (9a-b), using propositions H and D gives
f
~
p (A | I 0 + p ( A | I 0 = 1
(10)
Deductive Limits:
f
f
A is true Þ p (A | I 0 = 1, A is false Þ p (A | I 0 = 0
(11)
These results may look quite familiar, since they are the
common rules that are derived in conventional treatments
of probability and statistics, where probability is defined as
the frequency of successful outcomes in a series of
repeated trials. In fact, there are several distinct axiom
systems for probability theory, beginning with the work of A.
N. Kolmogorov [19], that lead to the same formal rules for
calculation (for a discussion, see D. V. Lindley [21]). We
have chosen to follow the approach of Cox because of its
intuitive appeal and close connection with the process of
human reasoning. The logical flow from first principles has
proceeded according to:
Desiderata Þ Cox's two axioms Þ sum and product rules
The result is a general and unique system of extended
logic, an algebra of inference, that is applicable to any
situation where limited information precludes deductive
reasoning. The uniqueness should be emphasized,
because any system of reasoning in which probabilities are
represented by real numbers and which disagrees with the
sum and product rules will necessarily violate the very
elementary, common sense requirements for rationality and
consistency.
3.3 Common sense reduced to calculation
A nice demonstration of the way in which the sum and
product rules accord with common sense and reproduce
f
Let us explore the special cases of (14) with a particular
example. Suppose that a doctor must decide a course of
treatment for a patient whose symptoms and medical
history suggest a working hypothesis: H º 'my patient has
disease X.' A blood test for disease X is then performed,
with result D º 'the patient has tested positive for disease
X.' Before performing the test, the doctor's examination of
the patient leads him to assign an initial probability p (H | I 0
to his working hypothesis. Here, the conditioning
information I 0 includes everything relevant to the doctor's
diagnosis, including his training and experience as well as
the symptoms and medical history of the patient. What is
the effect of obtaining the positive result of the blood test?
Consider the following special cases:
f
f
f
f
f
1. If p (H | I 0 = 1 then p (H |DI 0 = 1. If the doctor is certain
that the patient has disease X before the blood test, then
the positive outcome could be anticipated a priori and
would add no useful information. In such a case, the test
itself would be unnecessary.
2. If p (H | I 0 = 0 , then p (H |DI 0 = 0 . If the doctor is certain
that the patient does not have disease X before the test,
then the data will have no effect on his state of belief. A
positive result would most likely be dismissed as a 'false
positive.' Two remarks seem relevant here. First, given that
X is deemed impossible to begin with, one wonders why a
blood test to detect it would be performed. We can also see
the danger posed by a dogmatic refusal to allow one's
beliefs to be changed by what might be highly relevant new
information.
f
f
3. If p (D |HI 0 = 0, then p (H |DI 0 = 0 . If it were impossible
for a person with disease X to have a positive response to
the blood test, then since the patient did test positive, he
could not possibly have disease X.
4
f
f
f
~
4. If p (D |HI 0 = p (D | HI 0 ) , then p (H |DI 0 = p (H | I 0 . If
data D (here a positive blood test) is equally likely whether
H is true or not, then D is irrelevant for reasoning about H.
The doctor would learn nothing, for example, by flipping a
coin.
f
5. If H implies D, so that p (D |HI 0 = 1, then
f
p (H |DI 0 =
f
p (H | I 0
~ .
p (H | I 0 + 1 - p (H | I 0 × p (D |HI 0 )
f
(15)
f
If a positive response always results when disease X is
present, then the post-test probability p (H | DI 0 , given the
positive response, lies in the range p (H | I 0 £ p (H |DI 0 £ 1
~
and depends strongly on p (D |HI 0 ) , the probability of a
'false positive.' For a perfect test, a false positive would be
~
impossible [p (D |HI 0 ) = 0] and a positive result would make
~
H certain to be true. On the other hand, if p (D |HI 0 ) » 1 so
that any test would be likely to yield a positive response,
then p (H |DI 0 » p (H | I 0 , and one learns almost nothing.
f
f
f
f
f
Expression (15) provides the quantitative generalization to
the work of Polya to which we referred at the end of Section
2.2. In the case where H implies D, we see that the effect of
learning that D is true depends, for a given state of prior
knowledge, on the probability that D is true if H is assumed
to be false.
Also note the very important role played by the prior
probability p (H | I 0 . If the doctor assigns p (H | I 0 > 0.9
following the initial examination, then immediate treatment
for X would be indicated, with no need for a blood test. On
the other hand, if p (H | I 0 » 0.2, the doctor might feel
hesitant about beginning a treatment. In this case, a
~
positive blood test with p (D |HI 0 ) = 0.05 (a 5% chance of a
false positive) would yield a post-test probability of
p (H |DI 0 » 0.83 , and the doctor would feel comfortable in
treating the patient for disease X.
f
f
This is the general statement of normalization for a finite set
of N mutually exclusive and exhaustive propositions, a
property that occurs frequently in probability theory.
3.5 Marginal probabilities
Another very common and useful operation involving
mutually exclusive and exhaustive sets of propositions is
called marginalization, which we will illustrate by the
following example.
Suppose that a manufacturer produces a large batch of
metal spacers, dividing the task among N diamond turning
machines. The machines have been individually adjusted,
error-mapped, and characterized for machining accuracy,
so that the probability that machine k produces good
spacers may be assumed to be p (G | M k I 0 , where G º 'the
spacer is good (within tolerance)', and Mk º 'the spacer was
produced by machine k.' Because of machine and operator
variations, the spacer production rate varies from machine
to machine. By the end of a shift, machine Mk has produced
nk spacers so that the N machines together produce a total
of n 1 + n 2 + L + n N spacers which are then mixed together
and sent to inspection. If an inspector now arbitrarily
selects one of these spacers, what can he say about the
probability that it is in tolerance, before actually performing
a measurement?
f
We can answer this question as follows. The joint
probability that the spacer is in tolerance and that it was
produced by machine k is p GM k | I 0 . From the product
rule we then have
a
a
3.4 Mutually exclusive and exhaustive propositions
A very common situation arises when we have a set of N
propositions (B1, B2, ... BN ), one and only one of which can
possibly be true, conditioned on information I 0 . Such
propositions are said to be mutually exclusive given I 0 , a
condition that is written using the product rule:
f a
c
fc
f
p B i B j | I 0 = p B i | I 0 p B j | B i I 0 = 0 , for i ¹ j.
(16)
It follows from (16) and repeated use of the generalized
sum rule (8) that the probability that one of the propositions
is true is given by
a
f
p B1 + B 2 + L + B N | I 0 =
å p aB k | I 0 f .
If it is further known from prior information I 0 that one and
only one of the propositions is certainly true, then the
propositions are also exhaustive, so that the sum in (17)
must be equal to one:
N
k =1
(18)
0
(19)
k 0
Equating these expressions and summing over the N
machines gives
p (G | I 0
N
få paM
k
f
|GI 0 =
k =1
N
å p (G |M k I 0 fp aM k | I 0 f .
(20)
k =1
Now observe that the propositions Mk form a mutually
exclusive and exhaustive set, so that
N
å
a
f
p M k |GI 0 = 1.
(21)
k =1
The inclusion of the proposition G as a part of the
conditioning information does not alter the normalization
constraint, since the condition of the spacer does not
change the fact that it was produced by only one of the N
machines. The probability that the spacer is good is thus :
f
N
å p (G |M k I 0 fp aM k | I 0 f .
(22)
k =1
(17)
k =1
å p aB k | I 0 f = 1.
k
p (G | I 0 =
N
fa
f
= p aM | I fp (G | M I f.
p GM k | I 0 = p (G | I 0 p M k |GI 0
f
f
f
f
The left-hand side of (22) is called the marginal probability
of G, and we can see that it is a weighted sum over the
probabilities for the individual machines p (G | M k I 0 to
produce good spacers, with each term weighted by the
probability p M k | I 0 that the particular spacer chosen was
produced by machine k. The latter may be easily shown
(and is probably intuitively obvious to the reader) to be
equal to n k (n 1 + n 2 + L + n N ) , the fraction of the total
number of spacers produced by machine k.
f
a
f
5
In a problem like this the proposition Mk is called a
nuisance parameter, which means a quantity that affects
the inference and occurs in the analysis but is of no
particular interest in itself. Another example is the error of a
measuring instrument that affects the estimate of a
measured quantity but is itself unknown. Marginalization is
the way to account for the effects of nuisance parameters
by effectively averaging over all possible values.
Here F(y) is evidently a monotonic non-decreasing function
of y called a cumulative distribution function (CDF). Since
the length of any real spacer will certainly be greater than
some very small value of y and less than a very large
value, the qualitative behavior of F(y) will look similar to the
curve shown in Fig. 1
1.0
F ( y ) = p (Y £ y | I 0
4. Uncertainty and random variables
0.8
4.1 The meaning of a random variable
Since no measurement is perfect, no statement of an exact
value for a measured quantity is logically certain to be true.
Therefore our belief in a proposition such as: y º 'the length
of the spacer lies between y and y + Dy' is necessarily
uncertain no matter how well we perform a length
measurement. Consistency then requires that we
communicate the result of a measurement in the language
of probability theory, using the unique rules of the algebra
of probable inference. In order to do this, we need a
mathematical representation for a state of knowledge about
a measurand (such as the length of a spacer)
corresponding to all available information after performing a
measurement.
In the view of measurement as inference, all physical
quantities (except, of course, for defined constants such as
the speed of light in vacuum) are treated as random
variables. This may seem counter to the spirit of
deterministic metrology, because the words 'random' and
'variable' suggest an uncontrolled environment and noisy
instruments, where meaningful data can only be obtained
by repeated sampling and statistical analysis. The word
'variable', in particular, seems singularly inappropriate to
describe the result of a dimensional measurement. At the
time of its measurement, for example, the length of a metal
spacer is not a variable at all but rather an unknown
constant whose value we are trying to estimate on the
basis of given (but incomplete) information.
The issue here turns out to be purely one of semantics. In
probability theory, a random variable is defined as 'a
variable that may take any of the values of a specified set
of values and with which is associated a probability
distribution.' (GUM C.2.2). In discussing a quantity such as
length, it is important to distinguish between (a) length as a
concept (specified by a description, or definition), (b) the
length Y of a particular spacer (a random variable), and (c)
the set of values that could reasonably be attributed to Y,
consistent with whatever information is available. The result
of a measurement is only one of an infinite number of such
values that could, with varying degrees of credibility, be so
attributed. Similarly, a handbook value for a parameter
such as a thermal expansion coefficient is only one of its
possible values, given a state of incomplete information.
Probability theory, as applied to the measurement process,
is concerned with these possible values, or outcomes, and
their associated probability distributions.
4.2 Continuous probability distributions
A state of knowledge about (or degree of belief in) the
value of a quantity, such as the length of a metal spacer,
can be represented by a smooth continuous function
whose qualitative features can be derived using the sum
and product rules as follows. Denote the length of a spacer
by Y, let y be some particular value, and consider the
probability
f
p (Y £ y | I 0 º F ( y ) ,
0 £ F ( y ) £ 1.
f
0.6
0.4
0.2
0
Length y
f
Figure 1. The probability p (Y £ y | I 0 that the
length Y of a spacer is less than or equal to a
given length y, where y denotes position along a
length axis.
Now suppose we are interested in the probability that Y lies
in the interval a < Y £ b . Define the propositions:
A º 'Y £ a '
B º 'Y £ b '
C º ' a < Y £ b '.
These propositions satisfy the Boolean relation (logical
sum) B = A + C, and since A and C are mutually exclusive:
f
p (B | I 0 = p (A + C | I 0
f
f
f
= p (A | I 0 + p (C | I 0 ,
we have:
f
f
p (C | I 0 = p (B | I 0 - p ( A | I 0
f
= F ( b ) - F (a )
z
b
= f ( y )dy ,
a
where f ( y ) º dF ( y ) dy is called the probability density
function (pdf) for the possible values of Y. The qualitative
behavior of the pdf for the CDF of Fig. 1 is displayed in
Fig. 2.
The pdf f(y) = dF/dy is typically a continuous, single-peaked
(called unimodal) symmetric function of location y. In order
to avoid the proliferation of mathematical symbols, we will
use the notation p (y | I 0 = f ( y ) , so that the probability of
the proposition y º 'the length of the spacer lies in the
interval y , y + dy ' will be written simply p ( y | I 0 )dy . The
identification of p (y | I 0 ) with a probability density rather
than a simple probability should be clear from the context.
Also, a density function may sometimes be called a
'distribution' in accord with common parlance, and for
brevity, the same symbol may be used for a quantity and its
possible values.
f
(23)
6
The best estimate of the length of the spacer is, by
definition, the expectation (also called the expected value
or mean) of the distribution, given by:
E (Y ) = y 0 º
z
¥
f
yp ( y | I 0 dy .
-¥
(24)
f(y) = dF/dy
Length y
y0
Figure 2. The probability density function (pdf) f(y)
corresponding to the cumulative distribution function
of Fig. 1. For this function, the best estimate (or
expectation) of Y, denoted y 0 , corresponds to the
peak in the pdf.
For a symmetric single-peaked pdf such as the one shown
in Fig. 2, y 0 is also the value for which p (y | I 0 is a
maximum, called the mode of the pdf. A useful parameter
that characterizes the dispersion of plausible or reasonable
values of Y about the best estimate y 0 is given by the
positive square root of the variance s 2y , where
f
principle of maximum entropy, when one's knowledge
consists only of an estimate y 0 , together with an
associated standard uncertainty s . The normal pdf plays a
central role in probability theory and measurement science.
4.3 Levels of confidence and coverage factors
In the language of the GUM, we associate a level of
confidence in our knowledge of a quantity with a number k
called a coverage factor. For the spacer example, with
estimated length y 0 and associated uncertainty s , this is
interpreted to mean that the length Y may be expected to
lie in the interval y 0 ± ks with an integrated, or cumulative,
probability P(k). The standard deviation (or standard
uncertainty) thus sets the scale of uncertainty and is often
called a scale parameter. The relation between k and P
depends on the assumed functional form of the pdf, and for
the normal distribution we have the well-known and oftenemployed values of P = [68%, 95.5%, 99.7%] for k = 1,
2, and 3, respectively. Since we are reasoning about a
single, particular spacer, we point out that these
probabilities have no frequency interpretation. Their
magnitudes become significant: (a) in the propagation of
uncertainty, where the result of some other measurement
depends on the spacer length, and (b) in the context of a
subsequent decision where the length of the spacer is an
element of risk.
A great deal of time can be wasted in heated arguments
concerning the exact form of the density p (y | I 0 , which
describes not reality in itself but only one's knowledge
about reality. It can be helpful to realize that there exists a
very general and useful quantitative bounding relation on
the level of confidence associated with the best estimate
y 0 which is independent of the detailed nature of the pdf,
so long as it has finite expectation and variance and is
properly normalized. The latter condition means that
f
s 2y º E (Y - y 0 ) 2
=
z
¥
f
( y - y 0 ) 2 p ( y | I 0 dy
-¥
z
¥
(25)
The quantity s y is called the standard deviation of the pdf
p (y | I 0 . The GUM defines an estimated standard
deviation to be the standard uncertainty associated with an
estimate y 0 , using the notation u ( y 0 ) º s y . The
uncertainty characterizes a state of knowledge and is not a
physical attribute of the spacer or something that could be
measured in a metrology laboratory. For this reason it
makes no sense to argue about the 'true' value of the
uncertainty. An expression of uncertainty is always correct
when properly based on all relevant information. If two
people express different uncertainties then they must be
reasoning on different states of prior information or sets of
prior assumptions.
f
In a similar way, a probability density function models a
state of knowledge, and is not something that could be
measured in an experiment. The function shown in Fig. 2 is
the familiar normal (or Gaussian) density defined by
f
p (y | I 0 =
exp - (y - y 0 )
s 2p
º N ( y ; y 0 , s 2 ),
(27)
If y 0 is the best estimate of Y, then it is straightforward to
show that
= E (Y 2 ) - y 02 .
1
f
p (y | I 0 dy = 1.
-¥
2
2s
2
(26)
where for simplicity we write s in place of s y . As we shall
see in Sec. 6.3, the normal density is a consequence of a
general principle for assigning probabilities, called the
a
f
p Y - y 0 ³ ks | I 0 £
1
,
k2
(28)
a result known as the Bienaymé - Chebyshev inequality
[7, 28]. From this we see, for example, that not less than
8/9 » 89% of the reasonably probable values of the length
of the spacer are contained in the interval y 0 ± 3s ,
whatever the distribution p (y | I 0 . Thus we suggest that
there is little to be gained in debate over the exact form of
the pdf. If the uncertainty s is too large to permit a
confident decision, then the proper course of action is
usually to reduce uncertainty and sharpen the distribution
p (y | I 0 by performing an appropriate measurement.
f
f
[NOTE: In writing expressions such as (24) and (27), we
use the formal limits of ( - ¥, + ¥) and recognize that since
physical lengths are positive, we must strictly require that
p ( y | I 0 = 0 for y £ 0. In practice it is common to represent
states of knowledge by pdfs such as the normal distribution
that are non-zero over infinite range. The mathematical
convenience afforded by these analytic functions more than
compensates for the infinitesimally small, non-zero
probabilities for impossible values of physical quantities.]
f
5. Measurement as inference: Bayes' Theorem
7
Now suppose that we have a proposition H in the form of
an hypothesis, and that we subsequently obtain some
relevant data D. As usual we denote our prior information
by I 0 . Writing the two equivalent forms of the product rule
(9a-b):
f
f
f
f
f
p (HD | I 0 = p (H | I 0 p (D | HI 0 = p (D | I 0 p (H |DI 0 ,
We now measure the length of the spacer as illustrated in
Fig. 3. Using a linear indicator we take a pair of readings
before and after insertion of the spacer as shown.
and rearranging, yields Bayes' Theorem:
f
p (H |DI 0 = p (H | I 0
spacer, conditioned primarily by our understanding and
experience with the production process, with such vague
knowledge reflected in a broad prior distribution. This is not
a weakness of the approach but rather its motivation: the
whole purpose of performing the measurement is to
sharpen this broad distribution, refine our knowledge, and
reduce our uncertainty with respect to the length of the
spacer.
f pp(D(D|H| II ff ,
0
(29)
0
which is the starting point for the system of reasoning
known as Bayesian inference. From its very trivial
derivation we see that Bayes' theorem is not a profound
piece of mathematics, being no more than a restatement of
the consistency requirement of probability theory.
Nevertheless, Bayes' theorem gives the general procedure
for updating a probability in light of new, relevant
information, and is a modified form of (14) in which only the
hypothesis H appears, and not its negation.
Before we obtain data D, the degree of belief in hypothesis
H, conditioned on information I 0 , is represented by the
prior probability p (H | I 0 . When we learn of the data D, the
prior probability is multiplied by the ratio on the right side of
(29) to yield the posterior probability p (H | DI 0 . The
quantity p (D |HI 0 is called the likelihood of H given the
data D, and is viewed as the probability of obtaining the
data if the hypothesis is assumed to be true. The
denominator p (D | I 0 has no special name, although it is
sometimes called the global likelihood. It is equal to the
probability of obtaining the data whether H is true or not,
and can be written as a marginal probability using the sum
rule:
f
f
f
f
f
f
f
f
~
~
p (D | I 0 = p (D |HI 0 p (H | I 0 + p (D |HI 0 )p (H | I 0 .
(30)
f
Since p (D | I 0 is a constant, independent of H, Bayes'
theorem is commonly written in the form
f
f
f
p (H |DI 0 = Kp (H | I 0 p (D |HI 0 ,
(31)
with K equal to a normalization constant. In a typical
measurement problem, H stands for a proposition
concerning a dimension of interest and D represents the
measurement data. The likelihood is then equal to the
probability of obtaining the data D as a function of an
assumed dimension specified in H. The way in which the
result of the measurement affects our degree of belief in H
is completely contained in the likelihood function.
To illustrate how Bayes' theorem is used in dimensional
metrology, let us consider a very simple one-dimensional
example in which a linear indicator is used to measure the
length of a metal spacer. Assume that we have just
manufactured such a spacer and that we need to measure
its length in order to make a decision as to whether or not it
is acceptable. Before performing the measurement, our
knowledge of the length of the spacer is described by a
prior pdf p (y | I 0 , where as before p ( y | I 0 dy is the
probability that the length of the spacer lies in the interval
y , y + dy . The width of the prior pdf, as characterized by
its variance s p2 , is a measure of our uncertainty in the
length of the spacer, with the best estimate of the length,
y p , corresponding to the expectation of the distribution.
Usually we would have only limited information about the
f
f
ym
Figure 3. The length of a metal spacer is measured
using a linear indicator. The result of the
measurement is the estimate y m .
The difference in the two indicator readings is the result of
the measurement y m . The probability that a spacer of
actual length y would yield measurement data y m is just
the likelihood function p y m | yI 0 , whose width, as
2 , is a measure of the
characterized by its variance s m
quality of the measurement process (here, the linear
indicator). This is where experimental design enters the
picture, because we want the likelihood to be sharply
peaked about the actual length of the spacer. We then use
Bayes' theorem to find the updated (posterior) probability
distribution that describes our knowledge of the length of
the spacer after performing the measurement:
a
f
f
fa
f
f
f
p ( y | y m I 0 = Kp ( y | I 0 p y m | yI 0 ,
a
f
za
¥
(32)
p y m | yI 0 p ( y | I 0 dy .
-¥
This process is illustrated in Fig. 4, where we sketch the
qualitative forms of the relevant distributions. When the
likelihood is sharply peaked relative to the prior (pre-data)
distribution, the posterior (post-data) distribution will be
dominated by the peak in the likelihood, so that the exact
form of the prior distribution becomes irrelevant. This is
almost always the case for common engineering
measurements, where the measurement process is
2 << s 2 (sharply peaked likelihood).
arranged so that s m
p
Under these conditions, the prior distribution will be nearly
constant in the region where the likelihood is appreciable,
and essentially all knowledge of the measurand (here, the
length of the spacer) derives from the measurement data.
For such a locally uniform prior probability, Bayes' theorem
thus reduces to the approach known as maximum
likelihood, so-called because the best post-data estimate of
the value of the measurand coincides with the peak in the
likelihood function.
where K -1 = p y m | I 0 =
8
Probability
P robability
posterior distribution
posterior distribution
prior distribution
likelihood
prior
distribution
likelihood
indicator
systematic
error
ym
yp
Length
Figure 4. In a typical engineering measurement
such as measuring the length of a metal spacer,
the (post-data) posterior distribution is dominated
by a sharply peaked likelihood function. The best
estimate of the spacer length, y m , then very
nearly coincides with the peak in the likelihood,
and the prior (pre-data) distribution becomes
irrelevant. The curves are not to scale.
A common source of systematic error in such a length
measurement is a possible scale error in the linear
indicator. In order to correct for this error, we can perform a
calibration using a gauge block (length standard) whose
estimated length y g is known to within a small uncertainty
s g . In the case of a calibration, the measurand is the error,
and Bayes' theorem is written:
f
fa
f
p (e |e m I 0 = K ¢p (e | I 0 p e m |eI 0 ,
(33)
where K ¢ is a constant, e º 'the indicator systematic error
lies in the range e, e + de ,' and e m is the result of the
measurement, given by the difference between the
indicator data and the estimated length of the standard:
e m = y m - y g . The prior distribution p (e | I 0 is typically
symmetric about zero in the absence of any a priori
knowledge about the sign of the systematic error. The
likelihood p e m | eI 0 will be sharply peaked because of the
small uncertainty in the length of the standard. Again, the
posterior distribution for the indicator systematic error is
dominated by the peak in the likelihood and whatever is
known a priori becomes irrelevant. This situation is
illustrated in Fig. 5.
f
a
f
Measurement and calibration are thus seen to be
complementary operations in Bayesian inference. The
mechanics of taking the data are exactly the same in both
cases but we are asking different questions. In a
measurement we focus on the length of a workpiece, in a
calibration on the systematic error of an indicator. The
mathematics is the same, the only differences being in the
identification of the measurand and the nature of the prior
information. The calibration/measurement process relies on
2 << s 2 .
the ordering s 2g << s m
p
0
em
Error e
Figure 5. Calibration of a linear indicator using a
gauge block. The measurand is now the systematic
error of the indicator, and the sharply-peaked
likelihood reflects the low uncertainty in the length
realized by a gauge block.
The GUM makes no reference to a prior probability
distribution for a measurand (while encouraging the use of
assumed a priori distributions to describe knowledge of the
input quantities upon which the measurand depends). From
a theoretical point of view this has to be regarded as
inconsistent. Operationally, it amounts to an implicit
assumption of a uniform (constant) distribution to describe
prior knowledge of the measurand, with the best estimate
to be supplied by the measurement data via the likelihood
function.
6. The assignment of probabilities
The sum and product rules, together with Bayes' theorem,
are the unique algebraic tools for working with and
manipulating probabilities, but the question remains of how
to assign prior probabilities in the first place in order for a
calculation to get started. Since probabilities represent (or
encode) states of knowledge or degrees of reasonable
belief, what is needed are principles by which whatever
information is available can be uniquely incorporated into a
probability distribution. This problem is addressed in the
GUM, for variables other than the measurand, where such
distributions are called a priori probability distributions, with
associated variances whose positive square roots are
called Type B standard uncertainties.
There is no easy way to assign a real number to the
probability of an uncertain proposition such as A º 'there is
life on Mars', but for the quantities of interest in engineering
metrology the International System of Units (SI) provides a
set of location parameters that makes such assignment
possible. These parameters are the continuous variables
such as position or mass, with respect to which we can
order degrees of belief and over which we can sum discrete
probabilities or integrate probability densities in order to
effect normalization.
There are three principal theoretical approaches to the
consistent assignment of prior probabilities in problems of
engineering metrology. By 'consistent' we mean that two
persons with the same state of knowledge should assign
the same probabilities. There is really no conceptual
difference between assigning a prior probability distribution
for a measurand before performing a measurement, and
evaluating the likelihood function for the measurement
process after the data is in hand. Both operations yield
9
probability distributions that describe degrees of belief and
both require the exercise of judgment, insight, knowledge,
experience, and skill. In the final analysis it should be
recognized that the limiting uncertainty of a measurement
cannot be gleaned from anything in the measurement data
itself, nor can the error be known in the sense of a logical
deduction.
6.1 The representation of ignorance
Since a probability distribution for a quantity of interest
encodes what is known about the quantity, it is interesting
to ask for the distribution that describes a state of complete
ignorance. For example, suppose that a long metal bar is
engraved with a single ruled line whose position along the
bar is unknown. Here our state of knowledge consists
simply of the line's existence, with no information that would
lead us to favor any location over any other. How can we
represent this state of ignorance? We reason as follows:
denote position along the bar by x, and let f ( x )dx be the
probability that the line lies in the interval x , x + dx .
Ignorance of location then suggests that the probability
should be invariant with respect to the translation
x ® x ¢ = x + a , where a is an arbitrary constant. Thus the
density f ( x ) should satisfy
f ( x )dx = f ( x ¢ )dx ¢ ,
(34)
and since dx ¢ = dx , we have f ( x ) = f ( x + a ) , which implies
that
f ( x ) = constant.
(35)
Thus the probability density that describes ignorance of a
location parameter, such as the position of the ruled line or
the magnitude of an error, is the uniform density.
Now suppose that there are two lines ruled on the metal
bar, thus forming a line scale, and that we are interested in
the length L between them. The probability that the length
lies in the interval L, L + dL is written g (L )dL . Suppose
that we are completely ignorant of the line spacing, in the
sense that we have no definite scale for the unit of length.
We can imagine drawing a graph of g (L ) versus L, using
some local, arbitrary unit of length. Another metrologist,
perhaps using a photograph of the line scale, might draw a
graph in different units, g (L ¢ ) , where L ¢ = bL , with b equal
to an unknown scale factor. If the two states of knowledge
(or more correctly, ignorance) are to be the same, then we
should assign the same probability to equivalent intervals
on the two graphs. That is, we should require that
g (L ¢ )dL ¢ = g (L )dL , with L ¢ = bL , so that:
g ( bL )d (bL ) = g (L )dL .
(36)
Thus we require that g ( bL ) = (1 b )g (L ) , so that the
probability density g (L ) is given by
g (L ) = 1 L .
(37)
A parameter such as the line spacing that is known a priori
to be positive is called a scale parameter. Another scale
parameter is the standard deviation of a probability
distribution for the error of a length measurement. We have
shown that the invariant density that represents ignorance
of a scale parameter is the reciprocal density g (L ) = 1 L .
This is a strange looking probability density that appears
more reasonable if we write the equivalent forms
g (L )dL = dL L = d (ln L ) ,
(38)
so that requiring g ( bL )d (bL ) = g (L )dL is equivalent to the
statement that
d (ln L ) = constant.
(39)
Thus ignorance of a scale parameter is represented by a
uniform distribution of the logarithm of the parameter.
The results given by (35) and (39) for the prior densities
representing ignorance for location and scale parameters
were originally proposed by Jeffreys [16], using heuristic
plausibility arguments. They were subsequently placed on
a firm theoretical foundation by Jaynes [14], who invoked a
'desideratum of consistency' to express the reasonable
requirement that in two problems where we have the same
information, we should assign the same probabilities. In the
case of complete ignorance, where the parameters have
infinite range ( - ¥ < x < + ¥ and 0 £ L < + ¥ ), the prior
probability densities (35) and (39) cannot be normalized,
since the corresponding integrals are undefined. Such prior
distributions are called improper priors and have been the
subject of much controversy and criticism, since a nonnormalizable function can obviously not represent a
probability density. In response, we make several
observations. First, in almost any real application using
Bayes' theorem, the prior distribution occurs in both the
numerator and denominator, and so cancels out of the
calculation. In such a case, the fact that we might be using
an improper prior becomes moot. Next, in the real world of
engineering metrology we are never completely ignorant in
the mathematical sense. As previously argued, the length
of a real workpiece, such as a metal spacer, will certainly
be greater than some definite small value and less than a
definite large value, so that the relevant probability density
will vanish outside of such finite limits, and the
normalization integral will always converge to unity. In an
unusual case where the posterior distribution itself should
turn out to be improper, then this fact should serve as a
warning that there is not enough information in the
measurement data to be able to make a confident
inference with respect to the measurand.
In spite of the mainly theoretical problems with improper
priors, they are useful in real problems as labor-saving
devices when the exact finite limits of the relevant prior
densities make no resolvable difference in the calculations.
6.2 Symmetry and the principle of indifference
Consider a discrete collection of n propositions ( A1 L An )
that form an exhaustive and mutually exclusive set on prior
information I 0 . Furthermore, suppose that that there is
nothing in information I 0 that would lead us to believe that
any one of the propositions was more or less probable than
any other. In such a case we must then have
p A j | I 0 = p Ak | I 0 for any pair of propositions A j , Ak . If
this were not the case, then by simply permuting the
numbering scheme of the propositions we could
demonstrate two problems, each with the same prior
information but with different probability assignments. The
assignment of equal probabilities in this case is perhaps
intuitively obvious given the symmetry of the situation, and
employs what is often called the principle of indifference, a
term introduced by J. M. Keynes [18].
b
f a
Now since
f
å1 p aAk | I 0 f = 1
n
a
f
(exhaustive constraint), and
a
f
since all of the probabilities p Ak | I 0 are equal, we have
necessarily:
10
a
f
p Ak | I 0 =
1
, k = 1,L, n.
n
(40)
The result (40) is perhaps the oldest and most familiar of all
probability assignments. It will appear as a special case of
the principle of maximum entropy to be described in the
next section, but we chose to introduce it separately
because of its importance in probability theory. The
principle of indifference leads, of course, to the equal a
priori probabilities that characterize games of chance such
as drawing cards or rolling dice. Note, however, that the 1 n
probability assignment is a logical consequence of the sum
and product rules of probability theory applied to a set of
exhaustive and mutually exclusive propositions, given a
particular state of prior knowledge. There is no need to
imagine an infinite set of repeated experiments and an
imagined distribution of limiting frequencies. Of course
given the probabilities, it is a straightforward procedure to
calculate the expected frequency of any particular outcome
in a set of repeated trials, and thus to compute, for
example, the familiar odds of the gambler. Such
calculations are developed in great detail in most books on
probability and statistics.
The uniform 1 n discrete probability distribution can be
usefully employed to characterize ignorance of a physical
dimension, such as the length Y of a metal spacer. We
choose an interval [ymin, ymax] that is certain, based on
engineering judgment, to contain the length Y, and we
divide this interval into a large number n of discrete lengths
( y 1,L, y n ) . Here n is chosen so that the discrete lengths
y k are separated by less than the measurement resolution.
A state of knowledge about the length of the spacer can
now be represented by the discrete probability distribution
( p 1,L, p n ) where p k º p Y = y k | I 0 . If now our prior
information I 0 consists only of knowledge of the interval
[ymin, ymax] together with an enumeration of the possible
lengths ( y 1,L, y n ) , then the only consistent and unbiased
probability assignment is the uniform distribution
( p 1,L, p n ) = (1 n ,L,1 n ) .
a
f
6.3 The principle of maximum entropy
Since probabilities represent states of knowledge, it is
useful and productive to think about the information content
of a probability distribution for a physical quantity. In this
view, an accurate measurement supplies missing
information that sharpens a vague, poorly informative prior
distribution. Said a different way, the information provided
by a measurement serves to reduce uncertainty with
respect to the value of an unknown quantity, such as the
length of a metal spacer. In the interpretation of the GUM,
what we call 'uncertainty' is just the standard deviation of
the probability distribution that describes the distribution of
values of a quantity that are reasonable or plausible in the
sense of being consistent with whatever is known (or
assumed) to be true. This kind of uncertainty we might call
'location uncertainty' because the standard deviation is a
characteristic measure of the region about the expectation
of the distribution in which there is an appreciable
probability that the value of the quantity is located.
If we think more carefully about this, however, we can see
that the GUM-type of location uncertainty is useful and
realistic only for particular states of knowledge. To illustrate,
suppose that an inspector has two highly repeatable length
gauges of identical quality, except for the fact that one of
them has a significant zero offset z 0 , while the other has a
negligible offset.
Probability
z0
Length
Figure 6. A bi-modal probability distribution for the
length of a spacer measured using a gauge with
one of two possible zero offsets, zero or z 0 . The
actual offset is unknown. If the peaks are very
narrow relative to their separation, the combined
standard uncertainty of the measurement (standard
deviation of the distribution) is approximately equal
to z 0 2 .
The inspector proceeds to measure the length of a metal
spacer, but fails to record which of the two gauges was
used for the measurement. In this case the measurement
process would yield a doubly-peaked (or bi-modal)
probability distribution, with the two peaks separated by the
unknown gauge offset z 0 , as shown in Fig. 6. If the other
uncertainty components were negligible, the two peaks
would be very narrow and the combined standard
uncertainty (standard deviation of the distribution) would be
well-approximated by z 0 2.
Several features of this situation should be noted. First we
see that the standard deviation z 0 2 is a measure of the
width of the region between the two peaks of the
distribution, over most of which there is a negligible
probability of containing the true length of the spacer. The
expectation of the distribution, in particular, lies in the
center of this low probability region. From this we see that
the GUM identification of a best estimate with an
expectation is useful only for certain types of probability
distributions, and that an estimated standard deviation may
not be the best uncertainty parameter in all cases. In
particular, we see that should the unknown zero offset
increase, so would the combined standard uncertainty,
together with the inclusion of more and more highly
improbable values for the spacer length.
Now notice that there is a sense in which increasing the
gauge offset error z 0 adds no additional uncertainty at all.
If we asked 'Which of the two gauges was used to perform
the measurement?', and somehow managed to obtain this
information, then the spacer length probability distribution
would collapse via Bayes' theorem to a single narrow peak,
and the length of the spacer would be known with high
accuracy. This operation is clearly independent of z 0 ,
depending only our knowing that the probability distribution
has two narrow peaks, independent of their separation. The
information supplied by the answer to our question
decreases our uncertainty about the length of the spacer,
just as might be accomplished by repeating the
measurement with a gauge of known offset. This suggests
that there is another way to think about the uncertainty of a
probability distribution that depends only on the form of the
distribution itself and not on the actual values of the
quantity described by the distribution. Such an approach
leads to the concept of entropy.
11
Consider again a set ( y 1,L, y n ) of possible lengths of a
spacer, with a corresponding discrete probability
distribution ( p 1,L, p n ) . We have argued that a state of
complete ignorance as to the length of the spacer is
represented
by
the
uniform
distribution
( p 1,L, p n ) = (1 n ,L,1 n ) , and it seems intuitively
reasonable that the uniform distribution describes a state of
maximum uncertainty. Now imagine a contrasting situation
in which we know for certain that the length of the spacer is
Y = y k , so that p k = 1 and p j = 0, j ¹ k . A plot of the
distribution ( p 1,L, p n ) versus index number j would display
a single spike at j = k with unit probability and zeros
everywhere else. Since the length of the spacer in this case
is known, we have zero uncertainty in the sense of needing
no more information in order to decide the length state of
the spacer, and our certainty is reflected in the sharply
spiked probability distribution.
We see here how the shape of the probability distribution
encodes general properties that we identify with information
and uncertainty. This raises the interesting question as to
whether there exists some unique function of the
distribution ( p 1,L, p n ) that might serve as a numerical
measure of the amount of information (in a sense to be
described) needed to reduce a state of incomplete
knowledge to a state of certainty. Such a function, called
the entropy of the distribution, was found by C. E. Shannon
[29] in the context of communication theory. We proceed to
sketch the arguments that lead to the mathematical form of
the entropy function.
Given a discrete probability distribution ( p 1,L, p n ) , we
seek a function H ( p 1,L, p n ) that will serve to measure
information uncertainty (in contrast to the location
uncertainty as measured by a standard deviation).
Following Shannon, we require the function H, if it exists, to
satisfy the following reasonable conditions:
Condition 1. H ( p 1,L, p n ) should be a continuous function
of the probabilities ( p 1,L, p n ) .
Condition 2. If all of the probabilities are equal, so that
1
1
p k = 1 n for all k, then H ,L,
should be a
n
n
monotonically increasing function of the positive integer n.
More choices should mean more uncertainty.
FH
p1
IK
p1
p2
q = p2 +
p3
p2 q
p3 q
(a)
(b)
Figure 7. Illustrating the grouping of inferences.
The information uncertainty should be the same in
both cases. In (b), the uncertainty associated with
the choice of p2 or p3 occurs with probability
q = p2 + p3 .
Condition 3. If a problem is reformulated by grouping
subsets of the probabilities and calculating the uncertainty
in stages, the final result must be the same for all possible
groupings. This is a consistency requirement.
We illustrate Condition 3 by example (see Fig. 7). Consider
a problem in which there are three possible inferences with
probabilities ( p 1, p 2 , p 3 ) as shown in Figure 7(a). The
information uncertainty is H ( p 1, p 2 , p 3 ) . Now suppose that
we proceed in two steps by grouping the inferences as
shown in Fig. 7(b). The first step involves the choice of
either p1 or q = p 2 + p 3 , with an uncertainty of H ( p 1, q ) .
Then, with probability q, there will be an additional
uncertainty associated with the choice of either p2 or p3 in
the amount of H ( p 2 q , p 3 q ) . Shannon's Condition 3 then
requires that the information uncertainty be the same in
both cases:
H ( p 1, p 2 , p 3 ) = H ( p 1, q ) + qH (p 2 q , p 3 q ) .
(41)
Shannon generalized the result (41) to derive a functional
equation for H ( p 1,L, p n ) and then showed that the unique
solution for the measure of information uncertainty, called
the entropy of the distribution ( p 1,L, p n ) is given by
n
H ( p 1,L, p n ) = -K å p i log p i .
(42)
i =1
In this expression K is a positive constant that depends on
the base of the logarithms. Such a choice is arbitrary, so
we simplify by setting K = 1 and writing for the entropy
n
H ( p 1,L, p n ) = - å p i log p i .
(43)
i =1
The entropy H of (43) behaves quantitatively as we might
expect from a measure of uncertainty. If one of the
probabilities is equal to one and the rest equal to zero (a
state of certainty), then
H ( p 1,L, p n ) = H (0, 0,L,1,L, 0) = 0 ,
(44)
while the uniform distribution, p k = 1 n for all k, has
entropy
H (1 n ,L,1 n ) = log n ,
(45)
which is the maximum value of H. The logarithmic
dependence of the entropy on the number of equally-likely
choices can be understood most easily in base-2 binary
logic. The answers to N 'yes/no' questions (i.e. N 'bits' of
information) would be sufficient to uniquely specify one of
n = 2 N possibilities, so that the entropy is H = log 2 n = N .
As the number of possibilities increases exponentially, the
entropy increases only linearly, so that, for example,
deciding among twice as many possibilities requires only
one more bit of information.
In the case of a continuous probability distribution for a
parameter such as the length of a spacer, where prior
ignorance is described by a uniform distribution, the
entropy becomes
z
f
f
H = - p ( y | I 0 log p ( y | I 0 dy ,
(46)
where the integral is over all possible values of the length.
12
There is a close connection between entropy in the sense
of information and uncertainty and the entropy of statistical
mechanics. In fact, all of equilibrium statistical mechanics
can be viewed as an exercise in probable inference with
respect to the unknown microscopic state of a
thermodynamic system, when our information consists only
of estimates of a few macroscopic variables such as
temperature and pressure. The interested reader should
see, for example, the pioneering work of Jaynes [15] and
the excellent introductory text by Baierlein [1].
The entropy is a unique measure of uncertainty, in the
sense of missing information, with respect to a state of
nature. Our natural desire for objectivity and freedom from
bias would therefore suggest that among all possible prior
distributions that might describe knowledge of a
measurement variable, we should choose the one that
maximizes the entropy in a way that is consistent with
whatever is known (or assumed) to be true. This is the
principle of maximum entropy (PME). The resulting
probability distribution then reproduces what we assume to
be true while distributing the remaining uncertainty in the
most honest and unbiased manner. At the same time, PME
is a procedure that satisfies our desire for consistency in
the sense that two persons with the same information (state
of knowledge) should assign the same probabilities. Jaynes
[14] has described the maximum entropy distribution as
being 'maximally noncommittal with regard to missing
information' and has also observed that this distribution '...
is the one which is, in a certain sense, spread out as much
as possible without contradicting the given information, i.e.,
it gives free rein to all possible variability of [the unknown
quantity] allowed by the constraints. Thus it accomplishes,
in at least one sense, the intuitive purpose of assigning a
prior distribution; it agrees with what is known, but
expresses a 'maximum uncertainty' with respect to all other
matters, and thus leaves a maximum possible freedom for
our final decisions to be influenced by the subsequent
sample data.'
The mathematical procedure that underlies the PME is one
of constrained maximization, which seeks to maximize the
entropy (either the discrete or continuous form, as
appropriate) subject to constraints on the probability
distribution imposed by prior information, using the method
of Lagrange multipliers [1, 13, 30, 35]. The example of the
metal spacer will serve to illustrate the procedure for
particular states of available information.
Suppose that we are certain, based on engineering
judgment and the known properties of a production
process, that the length of a spacer is contained in the
interval y min , y max . Such knowledge constrains the
distribution p (y | I 0 via the normalization requirement:
f
z
y max
f
p (y | I 0 dy = 1.
(47)
y min
Maximizing the entropy (46) subject to the constraint (47)
then yields the rectangular, or uniform, density given by
f
p ( y | I 0 = 1 ( y max - y min )
(48)
in the allowed range of y, and zero otherwise. We could
have guessed this simple distribution based on the
symmetry of the situation, but it is instructive to see how the
PME works with such meager information.
In many cases we may have a prior estimate y 0 of the
length, together with an estimated variance s 2y , related to
f
p (y | I 0 by (24) and (25). We might know, for example,
that the spacer was produced by a reliable machine or
process with a well-characterized production history.
Maximizing the entropy subject to these constraints,
together with the normalization requirement of (27), yields
the normal (or Gaussian) density:
f
p (y | I 0 =
1
s y 2p
a
exp - ( y - y 0 ) 2 2s 2y
= N y ; y 0 ,s
2
y
f.
(49)
This is a very important and useful result. Prior information
about the length of the spacer might be based not on the
known characteristics of a machine or production process
but rather on the result of a previous measurement,
perhaps performed by a supplier. If the supplier follows the
recommendations of the GUM, the result of the
measurement
will
be
reported
in
the
form
Y = y 0 ± ku c (y 0 ) , where k is a coverage factor and the
combined standard uncertainty u c (y 0 ) is an estimated
standard deviation of the probability distribution that
characterizes the supplier's measurement process. Given
only this information, the best prior probability assignment
(being least informative in the sense of the PME) for
encoding knowledge of the length of the spacer is just
p ( y | I 0 = N ( y ; y 0 ,u c2 ) . Thus the normal distribution, rather
than being an unwarranted assumption, is the least biased
and 'maximally noncommittal' of all distributions for given
mean and variance. Consistency would then require
anyone using the supplier's measurement result to assign
the same normal distribution.
f
7. The ubiquitous normal distribution
The normal, or Gaussian, distribution has a very special
status in probability theory and measurement science. In
this section we describe some of the reasons for the
ubiquitous occurrence of this particular distribution.
7.1 The central limit theorem
When many small independent effects combine additively
to affect either a production process or a set of repeated
measurements, the resultant frequency distributions
(histograms) of either the workpiece errors or the
measurement results will usually be well approximated by
normal distributions. The central limit theorem (CLT)
provides a theoretical basis for modeling this behavior,
under very general and non-restrictive assumptions about
the various probability distributions that characterize the
individual effects. The CLT is a general result in the theory
of random variables. Without attempting a formal proof, the
CLT says that if Z is the sum Z = X 1 + L + X n of n
independent random variables X i , each of which has finite
mean and variance, with none of the variances significantly
larger than the others, then the distribution of Z will be
approximately normal, converging towards a normal
distribution for large n. In practical applications, 'large n'
may mean n no greater than three or four.
7.1.1 Gaussian sampling and Type A uncertainties
There is perhaps no source of measurement uncertainty
more basic and fundamental than that caused by getting a
different answer every time a measurement is repeated.
The CLT suggests a useful and realistic model of a noisy,
non-repeatable measurement procedure. Consider, for
example, a well-calibrated electronic indicator used to
measure the length of a metal spacer. A set of n repeated
measurements yields a data set of indicator readings
13
D º {y 1y 2 L y n } , where each reading is equal to the
length plus an error that fluctuates from reading to reading.
Guided by the CLT we assume that each error is the sum
of a large number of small random (meaning unpredictable)
errors, and model the procedure as repeated sampling
from a normal frequency distribution with an expectation (or
mean) m y º m and a standard deviation s that
characterizes the measurement process repeatability.
In many situations, the standard deviation s may be
known from prior experience with the process. The postdata (posterior) distribution for the spacer length then
follows from Bayes' theorem:
f
f
f
p ( m |DI 0 = Kp ( m | I 0 × p (D | mI 0 ,
(50)
where the prior information I 0 includes the known value of
s. In (50), K is a normalization constant and p( m | I 0 is the
prior distribution for m, which we assume to be constant (a
uniform density), corresponding to knowing little about the
value of m a priori. The last factor on the right side of (50) is
the likelihood function:
f
f a
f
p (D | mI 0 = p y 1 L y n | mI 0 .
(51)
We assume that the sequential measurements are
independent, which means that the probability of obtaining
datum y i does not depend upon the results of previous
measurements. For the first two samples, using the product
rule, we then have:
a
f a fa
= p a y | mI fp a y
p y 1y 2 | mI 0 = p y 1 | mI 0 p y 2 | y 1mI 0
1
0
2
f
f
(52)
| mI 0 .
Independence means, according to Cox, that knowledge of
y 1 is irrelevant for reasoning about y 2 .
Now by definition of the model, the probability of obtaining a
particular indicator reading y i is given by the normal
distribution:
a
f
p y i | mI 0 =
1
s 2p
LM
N
exp -
OP
Q
1
(y i - m ) 2 .
2s 2
(53)
Repeated use of the product rule then yields for the
likelihood:
f a
f a
p (D | mI 0 = p y 1 | mI 0 L p y n | mI 0
LM
N
µ exp Now since
1
2s 2
f
n
O
å (y i - m ) 2 PQ.
(54)
i =1
å (y i - m ) 2 = å a y i - y f
2
+ n ( y - m ) 2 , with
y = å y i n (the sample mean), and since the first term is
fixed, given the data, the likelihood becomes
f
LM 1 FG y - m IJ OP .
N 2 Hs n K Q
with standard deviation (GUM Type A
standard uncertainty) u ( m 0 ) = s n . This familiar result is
called a maximum likelihood estimate, which is seen to be
no more than Bayesian inference in the case of a uniform
prior distribution and Gaussian sampling distribution
(likelihood function). This is an example of the way in which
probability theory as extended logic reproduces the results
of traditional statistical sampling theory when warranted by
the available information.
The case where s is unknown is straightforward but more
complicated, so that we simply state the results. For details,
see references [2,16]. Using a constant prior density for m
and Jeffrey's log-uniform prior density for s, Bayes' theorem
leads to a posterior distribution for m given by Student's tdistribution for n - 1 degrees of freedom. The best estimate
m 0 is again given by the sample mean y , with variance (in
the notation of the GUM) given by
u 2 (m 0 ) =
(55)
Using this result for the likelihood in Bayes' theorem (50),
with a constant (uniform) prior distribution, we have finally:
LM 1 FG y - m IJ OP .
N 2 Hs n K Q
p ( m |DI 0 µ exp -
2
(56)
The post-data distribution for the expectation m is seen to
be a normal distribution centered at the best estimate
n -1 s2
.
n-3 n
(57)
In this expression, s 2 is the sample variance, computed
from the data according to
s2 º
1 n
å yi - y
n -1 1
a
f
2
.
(58)
The uncertainty u ( m 0 ) is seen from (57) to be larger than
the value s n recommended in the GUM, and in fact is
defined only for n > 3 . In Bayesian inference, this is a
signal that for small n, one needs more prior information
about s than is provided by the log-uniform density. As n
increases, the result (57) approaches the GUM
recommendation.
To sum up the results when sampling from an assumed
normal frequency distribution N ( y ; m, s 2 ), when very little is
known a priori about m: the best estimate m 0 is always
given by the sample mean y =
n > 10 or so, is s
å yi
n ; if s is known a
n ; if s is unknown, then u ( m 0 ) for
priori, then u ( m 0 ) is s
n , with s computed according to (58).
7.2 Inspection measurements I: process control
Suppose that a manufacturer produces a large number of
metal spacers, using a well-designed production process.
Because of unavoidable process variations there will be
some distribution of lengths in any particular batch of
spacers. We can think of each spacer as sampling the
manufacturing process, with the process adding a random
error to the intended dimension [24]. Assume that the
process is such that the distribution of spacer lengths, as
displayed in a histogram, is well approximated by a normal
distribution:
2
p (D | mI 0 µ exp -
f
m est = m 0 = y
f (y ) =
1
s p 2p
a
exp - (y - y 0 ) 2 2s p2
= N y ; y 0 ,s
2
p
f.
(59)
Here y 0 is the average length of a spacer and the variance
s p2 characterizes the variability of the production process.
The quantity f ( y ) is a frequency distribution function, where
f ( y )Dy is approximately equal to the fraction of spacers
with lengths in the range y , y + Dy . Figure 8 shows such
a histogram, together with its normal approximation, for a
14
typical large run of spacers.
During production, the spacers are measured using a
calibrated length gauge that has been corrected for all
significant systematic errors. The inspection measurement
process has a combined standard uncertainty s m that
includes the effects of temperature, gauge calibration
uncertainty, measurement process reproducibility, and so
forth. From experience with this gauge and the
measurement process, it is known that length
measurement errors are well described by the normal
distribution:
a
f
za
za
za
=
=
m
0
m
-¥
(62)
0
2
p
0
2
m )de.
Here the second step follows directly from the product rule
and the result p y m | eI 0 = N y m ; y 0 + e, s p2 says that for
a given error e, the distribution of y m would equal the
production distribution, shifted and centered at y 0 + e . The
last integral in (62), called a convolution integral, is
straightforward [2] and leads to the basic result:
a
a
f
p y m | I0 =
f(y)
f
p y |eI f × p (e | I fde
¥
N y ; y + e, s f × N (e ; 0, s
p y m | I 0 = p y me | I 0 de
f a
1
s T 2p
a
sp
f
exp - ( y m - y 0 ) 2 2s T2
= N y m ; y 0 ,s
2
T
f,
(63)
where
2 .
s 2T = s p2 + s m
y0
y
Figure 8. A histogram showing the frequency
distribution of the lengths of a large run of spacers
produced by a machine. The solid curve is a normal
distribution fit to the histogram.
f
p (e | I 0 =
1
s m 2p
2 )
exp( - e 2 2s m
(60)
2 ),
= N (e ; 0, s m
f
2 )de is the probability that the
so that p (e | I 0 de = N (e; 0, s m
error of a length measurement lies in the range e, e + de .
Because all known significant systematic effects have been
accounted for, the measurement error has an expectation
of zero.
As part of a statistical quality control program, an inspector
uses this gauge to measure the lengths of a large sample
of spacers, and plots the result in a histogram. What can
we say about this frequency distribution? In general, the
result y m of a particular length measurement will be the
sum of an unknown length y and an unknown
measurement error e:
ym = y + e
(61)
Such a result could be realized in an infinite number of
ways, corresponding to the infinite number of pairs ( y ,e )
that satisfy (61). The error e here is a nuisance parameter,
present in the data but of no interest in itself, that can be
eliminated via marginalization. Let p y m | I 0 dy m be the
probability of the proposition 'the result of a measurement
lies in the interval y m , y m + dy m .' The distribution
p y m | I 0 can be found by averaging over all possible
measurement errors to yield a marginal distribution,
analogous to the result of (22), where we had discrete
probability distributions.
a
a
f
f
The marginal distribution for y m is found according to:
(64)
We see that the distribution of values that might reasonably
be expected to result from the measurement of a spacer
chosen at random is normally distributed, centered at the
average production value y 0 , with a standard deviation
2 . The inspector's measurements
given by s T = s p2 + s m
can be thought of as a sequence of samples from this
distribution, so that the resulting histogram can be expected
to be approximately Gaussian with a sample variance
s 2 » s T2 .
The two sources of variation, production and measurement,
are seen to be mixed or confounded in the measurement
process, and the behavior of the variance s 2T accords with
our common sense. If the measurement process were
2 ® 0 , then
nearly perfect and noise free so that s m
2
2
s T » s p and the dispersion of measurement results would
reflect only production variation. On the other hand, if all of
the spacers were nearly identical, so that s p2 ® 0 , then
2 and such dispersion would be dominated by
s 2T » s m
measurement variability. In a similar way, any observed
drift in the measurement results away from the nominal
length y 0 would be the sum of production and
measurement drift, requiring more information (i.e. gauge
re-calibration) before being unambiguously attributed to
changes in the production process.
From this example we see that it is very important in
industrial inspection to understand the difference between
actual workpiece variation and measurement uncertainty.
The reliable characterization of a production process
requires a measurement system whose expected error is
held close to zero and whose combined standard
uncertainty has been independently and carefully
evaluated.
7.3 Inspection measurements II: a particular workpiece
While the result (63) is interesting and useful for process
control, it is not the information that an inspector would
need in order to accept or reject a given workpiece. In order
to decide if a spacer is acceptable or not, what an inspector
needs to know, given a particular measurement result y m ,
is the best estimate of the length of the spacer actually
15
measured, together with an evaluation of the measurement
uncertainty.
Before performing the measurement, the inspector's
knowledge of the length of the spacer is guided by his
experience with the production process and data such as
that shown in the histogram of Fig. 8. Using this prior
information he assigns a normal prior distribution:
f
p (y | I 0 =
1
exp - ( y - y 0 ) 2 2s p2
s p 2p
a
f
(65)
= N y ; y 0 , s p2 .
While this has the same mathematical form as f ( y ) in (59),
it should be stressed that f ( y ) is a measured frequency
distribution of lengths, while p (y | I 0 is an assigned
probability distribution for a single spacer drawn from (59).
Probability and frequency are not the same thing. We also
observe that in the real world of manufacturing, many (if not
most) workpieces are never measured at all, but rather
accepted for use based upon pure inference in which
knowledge of the workpiece is implicitly encoded by a
distribution such as p (y | I 0 .
f
f
After obtaining the measurement data y m , we update the
prior distribution to obtain the post-data (posterior) pdf
using Bayes' theorem
f
f ppaayy ||yII ff
= Kp ( y | I fp a y | yI f,
m
p (y | y m I 0 = p (y | I 0
m
m
0
a
f
0
(66)
0
0
-1
where the constant K = p y m | I 0
is independent of y
and will be absorbed into the normalization of p (y | y m I 0 .
The likelihood p y m | yI 0 is the probability of obtaining data
y m as a function of an assumed value y. Given a
measurement process with error probability distributed as
2 ) , i.e., a Gaussian
in (60), this is p y m | yI 0 = N ( y m ; y , s m
centered at the assumed value of y. Thus:
a
f
a
f
f
a
f
f
2 ).
p ( y | y m I 0 = KN y ; y 0 , s p2 × N ( y m ; y , s m
(67)
f
We see from (67) that the posterior pdf p (y | y m I 0 is
proportional to the product of two normal distributions. It is
a straightforward exercise to show that the result is another
normal distribution:
f
1
exp - ( y - y ) 2 2s 2
s 2p
= N ( y ; y , s 2 ),
p (y | y m I 0 =
(68)
where:
y =
s
2
a
f
f
1
-2 y
s p-2 y 0 + s m
m ,
-2
s p-2 + s m
a
= s
-2
p
+s
(69)
-2 -1.
m
These results can be written somewhat more elegantly, and
in a form easier to remember, by defining a weight
parameter (or simply a weight) w for a probability
distribution as the reciprocal of the variance: w º 1 s 2 , so
2 , and w = 1 s 2 . With these
that w p = 1 s p2 , w m = 1 s m
definitions (69) becomes:
y =
w py 0 + w my m
wp +wm
,
(70)
w = w p + w m.
From the results (68-70) we see that the best estimate y of
the length of the spacer, given the measurement data y m ,
is a weighted average of the prior estimate y 0 and the
measured value y m . The weights characterize the
sharpness of the respective probability distributions for y 0
and y m , and the posterior estimate y will be biased toward
the value in which we have the most confidence, as
measured by its weight. If we study this result, we find that
it accords very well with what our intuition might suggest. In
a typical industrial inspection, the measurement procedure
is arranged such that w m >> w p whence y » y m and the
best estimate of the spacer length derives almost
completely from the measurement data. On the other hand,
imagine using a hand micrometer to measure the diameter
of a cylindrical workpiece produced by a modern diamond
turning machine. In this case we could well have
w p >> w m , and in effect we would be using the workpiece
to calibrate the micrometer.
From (69-70) we see that the posterior weight w is always
greater than either of the weights w p or w m , so that a
measurement always supplies information that reduces
uncertainty. Usually w m >> w p so that w » w m and the
information supplied by the measurement overwhelms
whatever we may know a priori. In many cases of modern
manufacture, however, such as the production of complex
aspheric optics by single-point diamond machining, the
workpieces are very difficult to measure independently, and
we could well have w m » w p and w » 2w m , so that prior
knowledge of a machine's positioning error characteristics
could lead to a meaningful reduction in measurement
uncertainty [25].
7.3.1 Comparison measurements
It is a common practice for a dimensional measurement
laboratory to evaluate its procedures in relation to similar
laboratories by participating in comparison measurements,
such as round-robins. Here each participating laboratory
measures, in turn, the same artifact and the results are
then used to evaluate the equivalency of the various
participants. The question naturally arises as to how best to
consolidate and compare the results of the individual
measurements. The optimum way to do so follows by
extending the results of the previous section, using the
rules of probability theory. We give a simple example and
then touch upon practical difficulties.
Suppose that n laboratories independently measure a
quantity m and report the results in the form
m = x i ± s i , i = 1,L, n , where s i is the combined
standard uncertainty of the i-th measurement, evaluated
according to the procedures of the GUM. Here m might be
the length of a gauge block or the diameter of a cylindrical
standard. Guided by the principle of maximum entropy, we
assign a Gaussian probability density to each of the
measurement results, so that
a
f
p x i | m s i I0 =
1
s i 2p
exp - ( x i - m ) 2 2s i2
(71)
16
is the probability density for the i-th measurement result.
Assuming that the measurements are completely
independent, the likelihood of the data set x = {x 1 L x n } is
f
a
f
L
µ exp M- å ( x - m )
N
n
i
2
OP
Q
2s i2 ,
exp - ( m - m - ) 2 2s 2-
= N ( m ; m - , s 2- )
(73)
= p - (m )
where the subscript denotes the pre-data estimate. [Prior
ignorance can be well-approximated by allowing s 2- ® ¥ .]
Bayes' theorem then gives for the post-data (posterior)
density p + ( m ) º p ( m | xsI 0 :
f
f
p + ( m ) = Kp - ( m )p (x | msI 0 ,
(74)
where as usual, K is a normalization constant. Substituting
expressions (72) and (73) into (74) gives a product of n + 1
normal distributions, yielding, after simplification, the
normal posterior density
p + (m) =
1
s + 2p
exp - ( m - m + ) 2 2s 2+ .
(75)
Here m + and s + are, respectively, the best estimate of m
and its combined standard uncertainty after incorporation of
all of the measurement data, and are given by
n
m+ =
w+ =
(72)
where s º {s 1 Ls n } . We also assume a Gaussian prior
density for m, centered on the estimate m - :
s - 2p
(77)
A central question in the analysis of round-robin data is
how to choose a reference value m ref in order to effect the
comparisons. We see from Eqs. (77) that the logical and
consistent way to do so is to use the weighted mean
m ref = m + which is the best estimate of the measurand
using all available information. In spite of this, it seems to
be common practice to use the simple un-weighted mean
value, m ref = x , which discounts the variation in the
measurement uncertainty. The motivation for this choice
is to prevent a participant from claiming a very small
uncertainty and forcing a weighted mean value toward his
own result. We see here how in the real world it is easy to
move beyond purely technical considerations and into
areas that have psychological, political, and economic
aspects.
Choice of a reference value is only one of several
problems that arise in the design and data analysis of
comparison measurements and that are subjects of
active discussion and debate. Among the others are:
(76)
1
n
w + = w - + åw i ,
1
More common is a comparison round-robin (sometimes
called a key comparison when a number of NMIs are
involved), in which only a nominal value of m is known a
priori and the goal is laboratory intercomparison. In such
comparisons it will almost always be the case that for any
laboratory w - << w i , (the measurements overwhelm prior
information) so that w - » 0 and Eqs. (76) simplify to:
1
s + = s 0 n , a familiar result. If laboratory k has a much
smaller uncertainty that the others due to a superior
measurement process (w k >> w i , i ¹ k ) , then m + » x k
with uncertainty s + » s k , which is just as it should be. A
single high-accuracy measurement is more valuable than
a number of poor ones.
1
n
where w + º 1 s 2+ and so on for the rest of the weights.
The results (75-76) have all of the intuitive properties that
we might expect in pondering the situation. If w - >> w i for
all i, then m + » m - and rather than learning about the
artifact, the round-robin would reveal estimates of the
systematic errors of the various measurements. This might
describe a round-robin in which a National Metrology
Institute (NMI) circulated a well-calibrated 'golden artifact'
among a group of lower echelon laboratories, perhaps as
part of a laboratory accreditation program.
åw i .
If all laboratories report the same uncertainty s 0 , then
w i = w 0 = 1 s 02 and m + = x = å x i n with uncertainty
·
w - m - + åw i x i
w - + åw i
n
åw i
n
1
1
1
1
1
f
m+ =
n
p (x | msI 0 = Õ p x i | ms i I 0
p (m | I 0 =
n
åw i x i
Correlations. It is difficult to perform a set of
comparison measurements that are all logically
completely independent of each other. Independence
means that knowledge of Laboratory A's systematic
errors would convey no information that would affect
Laboratory B's result. Use of common reference
standards, using instruments from the same
manufacturer, using the same empirical equation
(such as Edlén's equation for the refractive index of
air) or phenomenological model --- all of these will
correlate the experimental results.
Depending upon the particular nature of the
measurement, failing to account for significant
correlations among the input quantities will lead to
either an underestimation or an overestimation of the
uncertainty in the final result.
·
Method of measurement. Performing a measurement
in two different ways will often give two different
answers, even when the individual procedures are
highly repeatable. From the point of view of
probability theory, we would say that the
measurements occur in different reasoning
environments I 0 and I 0 ¢ so that p ( m | xI 0 ) ¹ p ( m | xI 0 ¢
f
for the same data x, a result often called an 'apples
and oranges' comparison. This introduces a
component of uncertainty due to the method of
measurement that can be studied by an appropriate
experimental design.
17
·
·
Definition of the measurand. In many comparison
measurements, the uncertainties may be dominated
by an incomplete definition of the measured quantity.
Diameter, for example, is not well defined for a
cylinder that is not perfectly round. Similarly, the
width of a chromium line deposited on glass is not
well described by a single number at the level of a
few nanometers. This lack of complete definition can
also interact strongly with the measurement
technique, further complicating both the evaluation of
the uncertainty and comparison with other results.
Unrecognized errors. It is not uncommon for the
results of two independent measurements of the
same quantity to be inconsistent, which means that
the difference between the measured values
exceeds the sum of the individual uncertainties by
more than a 'reasonable' amount. An effort to
achieve a very small measurement uncertainty
requires the correction for smaller and smaller
effects, and it is easy for some tiny effect to go
unrecognized in the data analysis. In such a case, at
least one of the results must be wrong, but it can be
difficult, if not impossible, to find the source of the
inconsistency. Of course, one of the principal
reasons for performing a comparison such as a
round-robin is to discover such unrecognized errors,
and it is important to have a consistent procedure for
handling them. An interesting approach has been
demonstrated by F. Fröhner, who calls inconsistent
results 'one of the thorniest problems in data
evaluation [8].' He models the unrecognized errors
themselves as being normally distributed with
maximum entropy prior distributions for the unknown
means and variances. The resultant Bayesian
inference yields a best estimate for the unknown
measurand, together with best estimates for the
unrecognized errors and their uncertainties, in a
straightforward way.
Differences between the results of measurements of the
'same' measurand will, in general, be due to some
admixture of these last three sources of variation. These
effects cannot be separated without a more complete
understanding and analysis of the various measurement
processes. As suggested above, much creative thinking
is needed about how best to treat the measurement data
created in the course of measurement intercomparisons.
7.4 Industrial inspections III: accept/reject decisions
Let us return to the inspector who has measured the
length of a metal spacer and must now decide whether or
not it is acceptable for use. The nominal length of the
spacer is y 0 , and the design specification calls for y 0 to
be centered in a specification zone of width T, where T is
called the tolerance. This means that the length of an
acceptable spacer must lie in the range LSL £ Y £ USL ,
where the lower specification limit LSL º y 0 - T 2 and the
upper specification limit USL º y 0 + T 2. The tolerance is
related to the specification limits by T = USL - LSL , as
shown in Fig. 9.
Figure 9. The specification zone for a metal
spacer of design length y 0 and tolerance T.
The goal of the inspector's measurement is to answer the
question 'Is the length of the measured spacer contained
in the specification zone with an acceptable probability?'
Clearly what is meant by 'acceptable probability' is a
question of professional or business judgment that
involves matters such as costs and risks. For the
purposes of our discussion we assume that there is a
critical probability P0 such that a spacer will be
acceptable if there is a probability P ³ P0 that its length
lies within the specification zone. Typically P0 will be a
number such as 0.95 or 0.99, corresponding to a level of
confidence of 95% or 99%.
The inspector's knowledge of the length of the spacer
following his measurement is summed up in the posterior
density p ( y | y m I 0 = N ( y ; y , s 2 ) of (68-69), which
describes the distribution of reasonably probable values.
The probability PG that the spacer is good (within
tolerance) is just the fraction of this distribution contained
between the specification limits (see Figure 10):
f
z
USL
f
(78)
exp - (y - y ) 2 2s 2 dy .
(79)
PG =
p ( y | y m I 0 dy .
LSL
From (68) we have explicitly:
PG =
1
s 2p
z
USL
LSL
This integral cannot be evaluated in closed form, but the
result can be expressed in terms of the standard normal
cumulative distribution function (CDF) defined by
1
z
x
exp( - t 2 2) dt .
2p -¥
F( x ) º
(80)
The CDF F ( x ) is tabulated in many statistics texts and is
commonly included in commercial mathematics and
spreadsheet software. Letting z º ( y - y ) s , PG is given
by
FH USL - y IK - FFH LSL - y IK .
s
s
PG = F
(81)
T
LSL
y0
USL
18
0.30
f a
p (y | y m I 0 = N y ; y , s 2 )
s* = s T
PG = 0.95
0.25
PG = 0.99
0.20
0.15
s
0.10
99% conformance zone
0.05
y
LSL
0
USL
f
Figure 10. The probability density p ( y | y m I 0 of a
measured spacer, superimposed on
the
specification zone. The best estimate of the length
is y . The probability PG that the spacer is good is
the fraction of the area under the curve (shown
cross-hatched)
contained
between
the
specification limits. The spacer is in tolerance if
PG ³ P0 , where P0 is a threshold value
determined by economic considerations.
Now defining the dimensionless variables
y - LSL
,
T
s
s* º
T
y* º
.
(82)
(81) becomes:
PG
F 1- y * IJ - FFG - y * IJ
= FG
H s* K H s*K.
= P by * ,s * g
(83)
G
The tolerance T thus provides a natural length scale for
the inspector's decision problem. For a spacer to have a
chance to be accepted, the best estimate y of its length
must lie within the specification zone, for otherwise the
probability of being in tolerance would be less than 50%.
The specification zone corresponds to the region
0 £ y * £ 1, with y * = 0 being the lower specification limit,
y * = 1 being the upper specification limit, and y * = 0.5
being the center of the specification zone.
The dimensionless uncertainty parameter s * = s T is
sometimes called a gauging ratio, and typically has a
value such as 0.25 (a 4-to-1 ratio) or 0.10 (a 10-to-1 ratio
occasionally called the gauge maker's rule.) The reduced
quantities y * and s * are closely related to various
process capability indices such as C p and C pk that are
used in statistical quality control [23].
This result can be appreciated by examining Fig. 11,
which shows the locus of constant probability for two
levels of confidence (PG = 0.95 and 0.99 ) in the y * - s *
plane.
0
0.1
0.2
0.3
0.4
a
0.5
0.6
0.7
0.8
0.9
1.0
y * = y - LSL ) T
Figure 11. The y * - s * plane, showing the locus
of constant probability PG , from Eq. (83), for
PG = 0.95 (upper curve) and PG = 0.99 (lower
curve).
For a given level of confidence, acceptable spacers lie in
the region below the corresponding curve. The horizontal
dotted line in Fig. 11 locates a particular 'gauge maker's
rule' of a 10-to-1 ratio of tolerance to measurement
uncertainty. The intersection of such a line with a
particular probability locus defines a conformance zone
whose width determines the range of measured values y
allowed for acceptable spacers. The 99% conformance
zone is shown, so that a spacer whose measured length
is such that y * lies in this region has at least a 99%
probability of being in tolerance, so long as s £ T 10 .
A simplified way of showing the same decision problem
follows from the recently adopted International Standard
ISO 14253-1 [11] which defines default decision rules for
proving conformance or non-conformance to specification.
The basic idea is shown in Fig. 12.
specification zone
conformance zone
U = ks
Figure 12. Illustrating the specification and
conformance zones according to ISO 14253-1.
The quantity U is the expanded uncertainty, with k
equal to a coverage factor according to the GUM.
According to this standard, the specification zone is
reduced by twice the expanded uncertainty U = ks of the
measurement in order for a supplier to prove conformance
with specification. On the other hand, for a customer to
prove non-conformance requires that he add the expanded
uncertainty to the result of his measurement, thus
increasing the size of the conformance zone. The
measurement uncertainty always works against whoever is
making a conformance or non-conformance decision, and
there is always a tradeoff involving costs and risks.
In ISO 14253-1, the default coverage factor is k = 2 . It
should be emphasized that this is a default procedure that
fails to consider important economic issues such as the
costs and risks associated with making erroneous
19
decisions [34]. These considerations can greatly affect the
boundaries of the conformance and non-conformance
zones so that default rules such as ISO 14253-1 will likely
be of marginal value for real decisions in the marketplace
[22,27].
8. Summary
We have attempted to give a broad overview of the
fundamental ideas of inference, where probability is viewed
as a degree of rational belief. In this view, engineering
metrology is seen to be a particular application of a very
general system of extended logic that applies to any
situation where incomplete information precludes the use of
deductive reasoning. The two major questions of probability
theory are (a) how to assign initial probabilities and (b) how
to revise probabilities in order to incorporate new
information. We have shown how the answers to these
questions are provided by (1) the principal of maximum
entropy and (2) the sum and product rules that follow from
the axioms of Cox. These are the fundamental ideas. All of
the standard results of statistical sampling theory follow as
special cases when necessary. Lack of repeatability is only
one component of uncertainty. Ultimately, any physical
measurement will be limited by uncertainty in the realization
of the unit and will reduce to a set of Type B assumed
distributions best estimated by the method of maximum
entropy.
9. Acknowledgments
It is a sincere pleasure to thank the many individuals who
made valuable comments and suggestions based on
earlier drafts of this paper, and for their contributions and
constructive criticisms that helped to guide its revision. In
particular, I am deeply indebted to W. Wöger (PTB Braunschweig) and S. Sartori (IMGC - CNR, Torino) for
their very thorough critiques of the revised manuscript
and their detailed suggestions for clarifying and correcting
substantial portions of the paper. My gratitude is also
extended to [* denotes CIRP member]:
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
J. Bryan* - Pleasanton, CA USA
K. Bowen* - Bede Scientific, Englewood, CO USA
D. Banks - NIST, Gaithersburg, MD USA
D. DeBra* - Stanford University, Stanford, CA USA
T. Doiron - NIST, Gaithersburg, MD USA
T. Charlton* - Brown and Sharpe, North Kingston, RI
USA
C. Evans* - NIST, Gaithersburg, MD USA
R. Hocken* - University of North Carolina - Charlotte,
NC USA
H. Kunzmann* - PTB, Braunschweig, Germany
R. Levi* - Politecnico di Torino, Italy
D. Lucca* - Oklahoma State University, Stillwater,
OK USA
P. McKeown* - Cranfield University, Cranfield,
United Kingdom
J. Meijer* - University of Twente, Enschede,
Netherlands
E. Pardo - NPL, Teddington, United Kingdom
G. Peggs* - NPL, Teddington, United Kingdom
J. Peters* - Instituut voor Werktuigkunde, Heverlee,
Belgium
S. Phillips - NIST, Gaithersburg, MD USA
·
·
J. Potzick - NIST, Gaithersburg, MD USA
J. Raja - University of North Carolina - Charlotte, NC
USA
I would also like to acknowledge the late Professor L R.
Wilcox of the State University of New York at Stony Brook
and the late Dr. C E. Kuyatt of NIST/Gaithersburg for their
essential contributions to my understanding of the nature
of probability and uncertainty.
10. References
[1] Baierlein, R., 1971, Atoms and Information Theory,
W. H. Freeman, San Francisco.
[2] Box, G. E. P. and Tiao, G. C., 1973, Bayesian
Inference in Statistical Analysis, Wiley Classics
Library Ed. 1992, J. Wiley and Sons, New York.
[3] Bryan, J. B., 1993, The Deterministic Approach in
Metrology and Manufacturing, Int. Forum on
Dimensional Tolerancing and Metrology, ASME,
Dearborn, Michigan.
[4] Cox, R. T., 1946, Probability, Frequency, and
Reasonable Expectation, Am. J. Phys, 14: 1-13.
[5] Cox, R. T., 1961, The Algebra of Probable Inference,
Johns Hopkins Press, Baltimore.
[6] Donaldson, R. R., 1972, The Deterministic Approach
to Machining Accuracy, Soc. Mech. Eng. Fabrication
Technology Symposium, Golden, Colorado.
[7] Estler, W. T., 1997, A Distribution-Independent
Bound on the Level of Confidence in the Result of a
Measurement, J. Res. Natl. Inst. Stand. Technol.
102, 587-88.
[8] Fröhner, F. H., 1989, Bayesian Evaluation of
Discrepant Experimental Data, in Maximum Entropy
and Bayesian Methods, J. Skilling, ed., Kluwer
Academic Publishers, Dordrecht, Netherlands.
[9] Garrett, A. J. M. and Fisher, D. J., 1992, Combining
Data from Different Experiments: Bayesian Analysis
and Meta-analysis, in Maximum Entropy and
Bayesian Methods, Seattle 1991, C. R. Smith et al,
eds., Kluwer Academic Publishers, Dordrecht,
Netherlands 273-86.
[10] International Organization for Standardization (ISO),
1995, Guide to the Expression of Uncertainty in
Measurement, ISO, Geneva.
[11] International Organization for Standardization (ISO),
1998, International Standard 14253-1, Geometrical
Product Specifications (GPS)-Part 1: Decision rules
for proving conformance or non-conformance with
specification,.
[12] Jaynes, E. T., 1994, Probability Theory: The Logic of
Science, preliminary version at ftp://bayes.wustl.edu/
pub/Jaynes/book.probability.theory.
[13] Jaynes, E. T., 1989, Papers on Probability, Statistics,
and Statistical Physics, R. D. Rosenkrantz, Ed., D.
Kluwer Academic Publishers, Dordrecht, Netherlands
[14] Jaynes, E. T., 1968, Prior Probabilities, IEEE Trans.
Syst. Sci. and Cybernetics, Vol. SSC-4, 227-41.
[reprinted in Ref. 13.]
[15] Jaynes, E. T., 1957, Information Theory and
Statistical Mechanics, I, II, Phys. Rev. 106, 620-30,
108, 171-90. [reprinted in Ref. 13.]
[16] Jeffreys, H., 1967, Theory of Probability, Clarendon
Press, Oxford.
[17] Jessop, A., 1995, Informed Assessments - An
Introduction to Information, Entropy, and Statistics,
Ellis Horwood, London.
[18] Keynes, J. M., 1921, A Treatise on Probability,
Macmillan, London.
20
[19] Kolmogorov, A. N., 1950, Foundations of the Theory
of Probability, Chelsea Publishing Co., New York.
[20] Kyburg Jr., H. E. and Smokler, H. E., Eds., 1964,
Studies in Subjective Probability, John Wiley and
Sons, New York.
[21] Lindley, D. V. 1990, The 1988 Wald Memorial
Lectures: The Present Position in Bayesian
Statistics, Stat. Sci. 5, No.1, 44-89.
[22] Lindley, D. V., 1985, Making Decisions, 2nd Ed.,
John Wiley and Sons, London.
[23] Messina, W. S., 1987, Statistical Quality Control for
Manufacturing Managers, John Wiley and Sons, New
York.
[24] Patterson, S. R., 1996, Treatment of Errors and
Uncertainty, Tutorial Notes, American Society for
Precision Engineering, Raleigh, North Carolina.
[25] Phillips, S. D., Estler, W. T., Levenson, M. S., and
Eberhardt, K. R., 1998, Calculation of Measurement
Uncertainty Using Prior Information, J. Res. Natl.
Inst. Stand. Technol. 103, 625-32.
[26] Polya, G., 1954, Mathematics and Plausible
Reasoning, 2 Vols., Princeton University Press.
[27] Schlaifer, R., 1959, Probability and Statistics for
Business Decisions, McGraw-Hill, New York.
[28] Savage, I. R., 1961, Probability Inequalities of the
Tchebyscheff Type, J. Res. Natl. Bur. Stand. 65B,
211-22.
[29] Shannon, C. E. and Weaver, W., 1963, The
Mathematical Theory of Communication, Univ. of
Illinois Press, Urbana, Illinois.
[30] Sivia, D. S., 1996, Data Analysis - A Bayesian
Tutorial, Clarendon Press, Oxford.
[31] Smith, C. R. and Erickson, G., 1989, From
Rationality and Consistency to Bayesian Probability
in Maximum Entropy and Bayesian Methods, Kluwer
Academic Publishers, Dordrecht, Netherlands.
[32] Tribus, M., 1969, Rational Descriptions, Decisions,
and Designs, Pergamon Press, New York.
[33] Weise, K. and Wöger, W., 1992, A Bayesian Theory
of Measurement Uncertainty, Meas. Sci. Technol. 3,
1-11.
[34] Williams, R. H. and Hawkins, C. F., 1993, The
Economics of Guardband Placement, Proc. 24th
IEEE International Test Conference, Baltimore USA.
[35] Wöger, W., 1987, Probability Assignment to
Systematic Deviations by the Principle of Maximum
Entropy, IEEE Trans. Inst. Meas., Vol. IM-36, 655-58.
21