Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Measurement as Inference: Fundamental Ideas W. Tyler Estler (2) Precision Engineering Division National Institute of Standards and Technology Gaithersburg, MD 20899 USA Abstract: We review the logical basis of inference as distinct from deduction, and show that measurements in general, and dimensional metrology in particular, are best viewed as exercises in probable inference: reasoning from incomplete information. The result of a measurement is a probability distribution that provides an unambiguous encoding of one's state of knowledge about the measured quantity. Such states of knowledge provide the basis for rational decisions in the face of uncertainty. We show how simple requirements for rationality, consistency, and accord with common sense lead to a set of unique rules for combining probabilities and thus to an algebra of inference. Methods of assigning probabilities and application to measurement, calibration, and industrial inspection are discussed. Keywords: dimensional metrology, measurement uncertainty, information 1. Introduction The growing acceptance and use of the ISO Guide to the Expression of Uncertainty in Measurement (GUM) [10] has stimulated renewed thinking about errors, tolerances, statistics, and the concepts of randomness and determinism as they relate to manufacturing engineering and metrology. While we fully subscribe to the notion of determinism as articulated by J. B. Bryan [3] and R. R. Donaldson [6], the knowledge that a machine moves in perfect accord with natural law provides only small comfort when we must assign an uncertainty to measurements of its positioning errors. We emphasize here the conceptual distinction between a state of nature (for example, the geometry of a highly repeatable machine tool) and the uncertainty of a process designed to measure that state (linear positioning error, for example, measured with a displacement interferometer). Traditionally, there has been little in the education of a typical engineer or physicist that provides a fundamental viewpoint or logical basis for dealing with measurement uncertainty, in the way that the laws of Newton and Hooke provide a foundation for major portions of engineering science. While computing the mean and variance of a set of repeated measurements seems like a reasonable thing to do, many statistical tests seem ad hoc and poorly motivated and they provide no guidance in situations where repeatability is not an issue or where no population of parts exists. It is a pleasure to discover that there exists a unique mathematical system for plausible reasoning in the presence of uncertainty that satisfies very elementary and non-controversial requirements for consistency and rational agreement with common sense. In this paper we present a brief outline of the fundamental ideas of this system, called simply probability theory, with emphasis on its applications to engineering metrology. The development of probability theory as logic had its origins in the work of P. S. Laplace who remarked that 'probability theory is nothing but common sense reduced to calculation.' The modern development owes much to the work of H. Jeffreys [16], G. Polya [26], R. T. Cox [4-5], and E. T. Jaynes [12-15]. Detailed application to problems of data analysis and measurement uncertainty from a modern point of view are given by D. S. Sivia [30] and K. Weise and W. Wöger [33]. The latter paper is an excellent introduction to the approach to uncertainty advocated by the GUM. 2. Deduction and Plausible Inference 2.1 Deductive logic Classical deductive logic deals with propositions (written simply A, B, C, ...) that are either true or false. Typical propositions are declarative statements such as: A º 'There is life on Mars.' B º 'The error in the length of the workpiece is less than 5 mm.' C º 'The cost of the workpiece is less than $10.' Propositions are combined and manipulated using a set of three basic operations defined as follows: ~ Negation: A º 'A is false' Logical product: AB º 'A and B are both true' Logical sum: A + B º 'at least one of the propositions (A,B) is true' Relations among propositions form the subject of Boolean algebra, which relates logical combinations of propositions that have the same truth value. A typical Boolean expression is: ² %%. A + B = AB (1) Here, the left-hand side says 'It is not true that at least one of the propositions (A,B) is true', while the right-hand side says 'A and B are both false.' Clearly these verbal expressions have the same logical status and semantic meaning, a feature of any valid Boolean expression. Because of logical relations such as (1), only two of the three basic operations are independent, a fact that will simplify the development of the rules of probability theory. Deductive logic is a two-valued logic (true/false, up/down, zero/one, etc) and together with the Boolean formalism provides the binary mathematical basis of computer science. Those familiar with the operation of logic gates will 1 recognize the logical sum, for example, as defining the action of an 'inclusive OR' binary gate. A basic construction in classical logic is the implication, written 'A implies B', which means that if A is true, then B is also, necessarily, true. The connection is logical rather than (necessarily) causal; for example, the proposition A º 'there is life on Mars' would logically imply B1 º 'there is liquid water on Mars', B2 º 'there is oxygen on Mars', and so on. [In anticipation of objections on semantic grounds we point out that we are using the term 'life' in the sense of life forms similar to those that exist on the Earth.] Deductive logic then proceeds from the implication in two complementary ways, according to the following syllogisms: 'If A implies B and A is true, then B is true.' and 'If A implies B and B is false, then A is false.' These are very simple logical structures with common sense meanings. If it could be proven beyond doubt, for example, that Mars was devoid of water, then we could conclude that no (Earth-like) Martian life could exist. 2.2 Plausible inference and probability Now suppose that A implies B for some relevant pair of propositions, and in the course of contemplating A we happen to learn that B is true. What does this tell us about A? This question is quite different from those in deductive logic and belongs to the field of plausible inference that was richly explored by Polya [26]. Here, knowledge that B is true supplies evidence for the truth of A, but certainly not deductive proof. We may feel intuitively that A is more likely to be true upon learning that one of its consequences is true, but how much more likely? It is easy to see that the change in our strength of belief in proposition A will depend on the nature of the information supplied by consequence B. Consider the proposition A º 'the length error of the workpiece is less than 5 mm', and suppose that we learn, based on a preliminary measurement, that B1 º 'the length error of the workpiece is less than 100 mm' is true. Such information would certainly make A seem more likely to be true, but it would be much more significant to learn from a more recent measurement that B2 º 'the length error of the workpiece is less than 7 mm' is true. In this way we can qualitatively order degrees of plausibility in the sense of: 'A is more likely to be true, given B1' and 'A is much more likely to be true, given B2'. In neither case does A become certain, but this qualitative ordering is something we do naturally as a matter of common sense reasoning. What we need now is a way to extend deductive logic into this region of inference between certainty and impossibility. Such an extended logic should provide a general quantitative system for reasoning in the face of uncertainty or when supplied with incomplete information. In the development of such a quantitative system of inductive logic or plausible reasoning, we need a numerical measure of credibility or degree of reasonable and consistent belief that will serve to describe our state of knowledge about propositions that are neither certain nor impossible. Following the modern interpretation as expressed, for example, in the GUM, we call this measure the probability, and write: f p (A | I 0 º the probability that A is true, given that I 0 is true. Here, I 0 stands for the reasoning environment: the set of all relevant background information that conditions our knowledge of A. We will carry I 0 along explicitly in order to emphasize that all probabilities are conditional on some set of propositions known (or assumed) to be true. There is a natural intuitive basis for defining probability in this manner. The degree of partial belief in an uncertain proposition will always depend not only on the proposition itself, but also on whatever information we possess that is relevant to the matter. For this reason, there is no such thing as an unconditional probability. The probability we assign to the chance of rain tomorrow depends, for example, upon whether we have heard a weather forecast, or whether its is presently raining, or whether storm clouds are gathering, and so on. In Polya's studies of plausible inference he reasoned, and common sense would agree, that if A implies B, then necessarily p (A | BI 0 ³ p (A | I 0 , since the probability that A is true, if it changes at all, can only be increased by learning that one of its consequences is true. In our example above concerning the length error of a workpiece, the probabilities would be ordered according to p (A | B 2 I 0 > p (A | B 1I 0 > p (A | I 0 . Here we are introducing the customary and colloquial association of stronger belief with greater probability. While such a transitive ordering indicates the direction in which a probability might change in light of new evidence, it provides no way to calculate the amount of such a change and Polya's work stopped short of providing a quantitative formulation. For this we turn to the work of R. T. Cox [4-5]. f f f f f 3. The Rules of Probability Theory The following is a brief sketch of the logic leading to the unique rules for manipulating probabilities. For a more complete tutorial introduction we suggest the excellent synopsis of Smith and Erickson [31]. Following Jaynes [12], we list three desired properties (desiderata) that ought to be satisfied by a quantitative system of inference. These are not strict mathematical requirements or constraints, but any system lacking all of these properties would be of little or no value for reasoning from incomplete information. Desideratum I. Probabilities should be represented by real numbers. This is a simple desire for mathematical simplicity. Desideratum II. Probabilities should display qualitative agreement with rationality and common sense. This means, for example, that as evidence for the truth of a proposition accumulates, the number representing its probability should increase continuously and monotonically and the probability of its negation should decrease continuously and monotonically. It also means that the system of reasoning should contain the deductive limits of certainty or impossibility as special cases when appropriate. Desideratum III. Rules for manipulating probabilities should be consistent. For example, if we can reason our way to a conclusion in more than one way, then all ways should lead to the same result. It should not matter in what order we incorporate relevant information into our reasoning. 3.1 The two axioms of probability theory Equipped with these quite reasonable requirements, we can proceed to derive the rules of probability theory. We 2 first seek a way to relate the probability that a proposition is true to the probability that it is false. That is, given p (A | I 0 , ~ what is p (A | I 0 ? Cox reasoned that if we know enough, on information I 0 , to decide if A is true, then the same information should be sufficient to decide if A is false. This makes intuitive sense from the point of view of symmetry, ~ since what we call 'A' and what we call ' A ' is a matter of convention. Cox stated this as the first axiom of probability theory: f f Axiom 1. 'The probability of an inference (a proposition) on given evidence (the conditioning information) determines the probability of its contradictory (its negation) on the same evidence.' In symbolic form, this says: a f f ~ p A | I 0 = F1 p ( A | I 0 , (2) where F1 is some function of a single variable. f f f f A º 'the spacer can be produced with an error of less than 5 mm.' f (3) where F2 is some function of the two variables. Of course, AB and BA are logically equivalent, so by Desideratum III we could interchange A and B in (3). Any assumed functional relation that differs from (3) can be shown to run afoul of our common sense requirements; Tribus [32] gives an exhaustive demonstration. At this point the reader is encouraged to ponder the logical content of Cox's two axioms and to see how they agree with the intuitive process of everyday plausible reasoning. The writer knows of no case where these axioms have been shown to disagree with common sense, while the demonstrations of Tribus have shown that they are unique in this property. This is very important because once these two assertions are accepted as the axiomatic basis for probability theory, the formal rules of calculation follow by deductive logic in the form of mathematical theorems. Equations (2) and (3) are not very informative as they stand. Some obvious constraints on the unknown functions F1 and F2 follow from Boolean algebra. Since AB = BA for example, we must have f AB º 'the spacer can be produced with an error of less than 5 mm, for less than $10.' In considering whether or not to proceed, the engineer might first decide whether he has the process capability to machine a spacer with an error of less than 5 mm [p (A | I 0 ], and then, assuming that this is possible, decide whether the cost of production can be held to less than $10 [p (B | AI 0 ]. Alternatively, the engineer might first address the cost issue and assign p (B | I 0 , and then, on the assumption that the cost target can be met, decide whether the length error can be held to less than 5 mm [p (A | BI 0 ]. Either of these approaches seems reasonable, and either should provide enough information to determine p (AB | I 0 . f (4) F1 F1(x ) = x , (5) In the case of Axiom 2, the result is called the product rule: f f f f f Common sense reasoning along these lines led Cox to the second axiom of probability theory: 'The probability on given evidence that both of two inferences (propositions) are true is determined by their separate probabilities, one on the given evidence, the other on this evidence with the additional assumption that the first inference (proposition) is true.' f Using a different set of Boolean relations and the requirement of consistency, R. T. Cox demonstrated that the axiomatic relations (2) and (3) can be reduced to a pair of functional equations whose solutions he proceeded to find. Details of the proofs may be found in references [4,5,12,31] . and their logical product: Axiom 2. f F 2 p (A | I 0 , p (B | AI 0 = F 2 p (B | I 0 , p (A | BI 0 . ~ ~ Also, since A º A, the function F1 must be such that where x is an arbitrary probability. Neither of these constraints provides a sufficient restriction to determine the forms of the functions. B º 'the spacer can be produced for less than $10.' f f p (AB | I 0 = F 2 p (A | I 0 , p (B | AI 0 , 3.2 The sum and product rules We next seek a way to relate the probability of the logical product AB of two propositions to the probabilities of A and B separately. That is, suppose we know p (A | I 0 , p (B | I 0 , p (B | AI 0 , and so on, and we want to know p (AB | I 0 . For example, suppose that an engineer is considering the feasibility of manufacturing a metal spacer for a particular application. In order to meet its functional requirements, the spacer must have a length error of no more than 5 mm, while for economic reasons the cost of production must be held to less than $10. Now consider the two propositions: f As a mathematical assertion, this becomes: f f p (AB | I 0 = p ( A | I 0 p (B | AI 0 . (6) This is one of the two fundamental rules of probability theory. One of its immediate consequences is that certainty is represented by a probability equal to one. To see this, suppose that A implies B, so that B is certain given A. Then logically AB = A, and from (6): f f f f p (A | I f ¹ 0, then p (B | AI f = 1 for B certainly p (AB | I 0 = p ( A | I 0 = p ( A | I 0 p (B | AI 0 , so that if true. 0 0 In the case of Axiom 1, solution of a second functional equation yields the sum rule: f f ~ p (A | I 0 + p ( A | I 0 = 1. (7) This is the second fundamental rule of probability theory. An immediate consequence of the sum rule is that impossibility is represented by a probability equal to zero. ~ For if A is certainly true then A is false, so that p (A | I 0 = 1 f 3 f ~ and from (7) we must have p (A | I 0 = 0. The sum rule expresses a primitive form of normalization for probabilities. We noted previously that only two of the three basic Boolean operations (logical product, logical sum, and negation) are independent. It follows that the sum and product rules, together with Boolean operations among propositions, are sufficient to derive the probability of any proposition, such as the generalized sum rule: f f f f p (A + B | I 0 = p ( A | I 0 + p (B | I 0 - p ( AB | I 0 . (8) Note here that the plus sign (+) takes on different meanings depending on context, being a logical operator when it relates propositions and representing ordinary addition when applied to numbers such as probabilities. The context will make clear the meaning; the alternative is to introduce new mathematical notation which may have a strange look while adding little clarity. At this point we collect the results of the last few paragraphs and present a summary of the unique rules for manipulating probabilities. These two simple operations form the basis for the system of reasoning called by Cox the algebra of probable inference: f f f f f (9a) (9b) Sum Rule: f f f f f p (H |DI 0 = Kp (H | I 0 p (D |HI 0 , (12) f where K -1 = p (D | I 0 . Repeating this operation with H ~ replaced with H and dividing (12) by the resulting expression yields: f f f f f p (H |DI 0 p (H | I 0 p (D | HI 0 = × ~ ~ ~ . p (H |DI 0 p (H | I 0 p (D | HI 0 ) f ~ Now, p (H | I 0 = 1 - p (H | I 0 f (13) f f ~ and p (H |DI 0 = 1 - p (H |DI 0 ~ ~ from the sum rule, so that replacing p (H | I 0 and p (H | DI 0 in (13) and rearranging gives: ~ f LMN FGH p(H1| I f - 1IJK × pp((DD ||HHII )f OPQ 0 p (H |DI 0 = 1 + 0 f f -1 (14) 0 This is a very general result that shows how the prior (predata) probability p (H | I 0 changes, as a result of obtaining data D, to yield the posterior (post-data) probability p (H | DI 0 . This is just the process of learning, whereby a state of knowledge gets updated in light of new information. f Product Rule: p (AB | I 0 = p ( A | I 0 p (B | AI 0 = p (B | I 0 p (A |BI 0 the way we reason intuitively follows from the work of A. J. M. Garrett and D. J. Fisher [9]. Suppose that we have an hypothesis H, with an initial probability p (H | I 0 conditioned on I 0 , and we then obtain new information in the form of data D. Equating the two equivalent forms of the product rule, (9a-b), using propositions H and D gives f ~ p (A | I 0 + p ( A | I 0 = 1 (10) Deductive Limits: f f A is true Þ p (A | I 0 = 1, A is false Þ p (A | I 0 = 0 (11) These results may look quite familiar, since they are the common rules that are derived in conventional treatments of probability and statistics, where probability is defined as the frequency of successful outcomes in a series of repeated trials. In fact, there are several distinct axiom systems for probability theory, beginning with the work of A. N. Kolmogorov [19], that lead to the same formal rules for calculation (for a discussion, see D. V. Lindley [21]). We have chosen to follow the approach of Cox because of its intuitive appeal and close connection with the process of human reasoning. The logical flow from first principles has proceeded according to: Desiderata Þ Cox's two axioms Þ sum and product rules The result is a general and unique system of extended logic, an algebra of inference, that is applicable to any situation where limited information precludes deductive reasoning. The uniqueness should be emphasized, because any system of reasoning in which probabilities are represented by real numbers and which disagrees with the sum and product rules will necessarily violate the very elementary, common sense requirements for rationality and consistency. 3.3 Common sense reduced to calculation A nice demonstration of the way in which the sum and product rules accord with common sense and reproduce f Let us explore the special cases of (14) with a particular example. Suppose that a doctor must decide a course of treatment for a patient whose symptoms and medical history suggest a working hypothesis: H º 'my patient has disease X.' A blood test for disease X is then performed, with result D º 'the patient has tested positive for disease X.' Before performing the test, the doctor's examination of the patient leads him to assign an initial probability p (H | I 0 to his working hypothesis. Here, the conditioning information I 0 includes everything relevant to the doctor's diagnosis, including his training and experience as well as the symptoms and medical history of the patient. What is the effect of obtaining the positive result of the blood test? Consider the following special cases: f f f f f 1. If p (H | I 0 = 1 then p (H |DI 0 = 1. If the doctor is certain that the patient has disease X before the blood test, then the positive outcome could be anticipated a priori and would add no useful information. In such a case, the test itself would be unnecessary. 2. If p (H | I 0 = 0 , then p (H |DI 0 = 0 . If the doctor is certain that the patient does not have disease X before the test, then the data will have no effect on his state of belief. A positive result would most likely be dismissed as a 'false positive.' Two remarks seem relevant here. First, given that X is deemed impossible to begin with, one wonders why a blood test to detect it would be performed. We can also see the danger posed by a dogmatic refusal to allow one's beliefs to be changed by what might be highly relevant new information. f f 3. If p (D |HI 0 = 0, then p (H |DI 0 = 0 . If it were impossible for a person with disease X to have a positive response to the blood test, then since the patient did test positive, he could not possibly have disease X. 4 f f f ~ 4. If p (D |HI 0 = p (D | HI 0 ) , then p (H |DI 0 = p (H | I 0 . If data D (here a positive blood test) is equally likely whether H is true or not, then D is irrelevant for reasoning about H. The doctor would learn nothing, for example, by flipping a coin. f 5. If H implies D, so that p (D |HI 0 = 1, then f p (H |DI 0 = f p (H | I 0 ~ . p (H | I 0 + 1 - p (H | I 0 × p (D |HI 0 ) f (15) f If a positive response always results when disease X is present, then the post-test probability p (H | DI 0 , given the positive response, lies in the range p (H | I 0 £ p (H |DI 0 £ 1 ~ and depends strongly on p (D |HI 0 ) , the probability of a 'false positive.' For a perfect test, a false positive would be ~ impossible [p (D |HI 0 ) = 0] and a positive result would make ~ H certain to be true. On the other hand, if p (D |HI 0 ) » 1 so that any test would be likely to yield a positive response, then p (H |DI 0 » p (H | I 0 , and one learns almost nothing. f f f f f Expression (15) provides the quantitative generalization to the work of Polya to which we referred at the end of Section 2.2. In the case where H implies D, we see that the effect of learning that D is true depends, for a given state of prior knowledge, on the probability that D is true if H is assumed to be false. Also note the very important role played by the prior probability p (H | I 0 . If the doctor assigns p (H | I 0 > 0.9 following the initial examination, then immediate treatment for X would be indicated, with no need for a blood test. On the other hand, if p (H | I 0 » 0.2, the doctor might feel hesitant about beginning a treatment. In this case, a ~ positive blood test with p (D |HI 0 ) = 0.05 (a 5% chance of a false positive) would yield a post-test probability of p (H |DI 0 » 0.83 , and the doctor would feel comfortable in treating the patient for disease X. f f This is the general statement of normalization for a finite set of N mutually exclusive and exhaustive propositions, a property that occurs frequently in probability theory. 3.5 Marginal probabilities Another very common and useful operation involving mutually exclusive and exhaustive sets of propositions is called marginalization, which we will illustrate by the following example. Suppose that a manufacturer produces a large batch of metal spacers, dividing the task among N diamond turning machines. The machines have been individually adjusted, error-mapped, and characterized for machining accuracy, so that the probability that machine k produces good spacers may be assumed to be p (G | M k I 0 , where G º 'the spacer is good (within tolerance)', and Mk º 'the spacer was produced by machine k.' Because of machine and operator variations, the spacer production rate varies from machine to machine. By the end of a shift, machine Mk has produced nk spacers so that the N machines together produce a total of n 1 + n 2 + L + n N spacers which are then mixed together and sent to inspection. If an inspector now arbitrarily selects one of these spacers, what can he say about the probability that it is in tolerance, before actually performing a measurement? f We can answer this question as follows. The joint probability that the spacer is in tolerance and that it was produced by machine k is p GM k | I 0 . From the product rule we then have a a 3.4 Mutually exclusive and exhaustive propositions A very common situation arises when we have a set of N propositions (B1, B2, ... BN ), one and only one of which can possibly be true, conditioned on information I 0 . Such propositions are said to be mutually exclusive given I 0 , a condition that is written using the product rule: f a c fc f p B i B j | I 0 = p B i | I 0 p B j | B i I 0 = 0 , for i ¹ j. (16) It follows from (16) and repeated use of the generalized sum rule (8) that the probability that one of the propositions is true is given by a f p B1 + B 2 + L + B N | I 0 = å p aB k | I 0 f . If it is further known from prior information I 0 that one and only one of the propositions is certainly true, then the propositions are also exhaustive, so that the sum in (17) must be equal to one: N k =1 (18) 0 (19) k 0 Equating these expressions and summing over the N machines gives p (G | I 0 N få paM k f |GI 0 = k =1 N å p (G |M k I 0 fp aM k | I 0 f . (20) k =1 Now observe that the propositions Mk form a mutually exclusive and exhaustive set, so that N å a f p M k |GI 0 = 1. (21) k =1 The inclusion of the proposition G as a part of the conditioning information does not alter the normalization constraint, since the condition of the spacer does not change the fact that it was produced by only one of the N machines. The probability that the spacer is good is thus : f N å p (G |M k I 0 fp aM k | I 0 f . (22) k =1 (17) k =1 å p aB k | I 0 f = 1. k p (G | I 0 = N fa f = p aM | I fp (G | M I f. p GM k | I 0 = p (G | I 0 p M k |GI 0 f f f f The left-hand side of (22) is called the marginal probability of G, and we can see that it is a weighted sum over the probabilities for the individual machines p (G | M k I 0 to produce good spacers, with each term weighted by the probability p M k | I 0 that the particular spacer chosen was produced by machine k. The latter may be easily shown (and is probably intuitively obvious to the reader) to be equal to n k (n 1 + n 2 + L + n N ) , the fraction of the total number of spacers produced by machine k. f a f 5 In a problem like this the proposition Mk is called a nuisance parameter, which means a quantity that affects the inference and occurs in the analysis but is of no particular interest in itself. Another example is the error of a measuring instrument that affects the estimate of a measured quantity but is itself unknown. Marginalization is the way to account for the effects of nuisance parameters by effectively averaging over all possible values. Here F(y) is evidently a monotonic non-decreasing function of y called a cumulative distribution function (CDF). Since the length of any real spacer will certainly be greater than some very small value of y and less than a very large value, the qualitative behavior of F(y) will look similar to the curve shown in Fig. 1 1.0 F ( y ) = p (Y £ y | I 0 4. Uncertainty and random variables 0.8 4.1 The meaning of a random variable Since no measurement is perfect, no statement of an exact value for a measured quantity is logically certain to be true. Therefore our belief in a proposition such as: y º 'the length of the spacer lies between y and y + Dy' is necessarily uncertain no matter how well we perform a length measurement. Consistency then requires that we communicate the result of a measurement in the language of probability theory, using the unique rules of the algebra of probable inference. In order to do this, we need a mathematical representation for a state of knowledge about a measurand (such as the length of a spacer) corresponding to all available information after performing a measurement. In the view of measurement as inference, all physical quantities (except, of course, for defined constants such as the speed of light in vacuum) are treated as random variables. This may seem counter to the spirit of deterministic metrology, because the words 'random' and 'variable' suggest an uncontrolled environment and noisy instruments, where meaningful data can only be obtained by repeated sampling and statistical analysis. The word 'variable', in particular, seems singularly inappropriate to describe the result of a dimensional measurement. At the time of its measurement, for example, the length of a metal spacer is not a variable at all but rather an unknown constant whose value we are trying to estimate on the basis of given (but incomplete) information. The issue here turns out to be purely one of semantics. In probability theory, a random variable is defined as 'a variable that may take any of the values of a specified set of values and with which is associated a probability distribution.' (GUM C.2.2). In discussing a quantity such as length, it is important to distinguish between (a) length as a concept (specified by a description, or definition), (b) the length Y of a particular spacer (a random variable), and (c) the set of values that could reasonably be attributed to Y, consistent with whatever information is available. The result of a measurement is only one of an infinite number of such values that could, with varying degrees of credibility, be so attributed. Similarly, a handbook value for a parameter such as a thermal expansion coefficient is only one of its possible values, given a state of incomplete information. Probability theory, as applied to the measurement process, is concerned with these possible values, or outcomes, and their associated probability distributions. 4.2 Continuous probability distributions A state of knowledge about (or degree of belief in) the value of a quantity, such as the length of a metal spacer, can be represented by a smooth continuous function whose qualitative features can be derived using the sum and product rules as follows. Denote the length of a spacer by Y, let y be some particular value, and consider the probability f p (Y £ y | I 0 º F ( y ) , 0 £ F ( y ) £ 1. f 0.6 0.4 0.2 0 Length y f Figure 1. The probability p (Y £ y | I 0 that the length Y of a spacer is less than or equal to a given length y, where y denotes position along a length axis. Now suppose we are interested in the probability that Y lies in the interval a < Y £ b . Define the propositions: A º 'Y £ a ' B º 'Y £ b ' C º ' a < Y £ b '. These propositions satisfy the Boolean relation (logical sum) B = A + C, and since A and C are mutually exclusive: f p (B | I 0 = p (A + C | I 0 f f f = p (A | I 0 + p (C | I 0 , we have: f f p (C | I 0 = p (B | I 0 - p ( A | I 0 f = F ( b ) - F (a ) z b = f ( y )dy , a where f ( y ) º dF ( y ) dy is called the probability density function (pdf) for the possible values of Y. The qualitative behavior of the pdf for the CDF of Fig. 1 is displayed in Fig. 2. The pdf f(y) = dF/dy is typically a continuous, single-peaked (called unimodal) symmetric function of location y. In order to avoid the proliferation of mathematical symbols, we will use the notation p (y | I 0 = f ( y ) , so that the probability of the proposition y º 'the length of the spacer lies in the interval y , y + dy ' will be written simply p ( y | I 0 )dy . The identification of p (y | I 0 ) with a probability density rather than a simple probability should be clear from the context. Also, a density function may sometimes be called a 'distribution' in accord with common parlance, and for brevity, the same symbol may be used for a quantity and its possible values. f (23) 6 The best estimate of the length of the spacer is, by definition, the expectation (also called the expected value or mean) of the distribution, given by: E (Y ) = y 0 º z ¥ f yp ( y | I 0 dy . -¥ (24) f(y) = dF/dy Length y y0 Figure 2. The probability density function (pdf) f(y) corresponding to the cumulative distribution function of Fig. 1. For this function, the best estimate (or expectation) of Y, denoted y 0 , corresponds to the peak in the pdf. For a symmetric single-peaked pdf such as the one shown in Fig. 2, y 0 is also the value for which p (y | I 0 is a maximum, called the mode of the pdf. A useful parameter that characterizes the dispersion of plausible or reasonable values of Y about the best estimate y 0 is given by the positive square root of the variance s 2y , where f principle of maximum entropy, when one's knowledge consists only of an estimate y 0 , together with an associated standard uncertainty s . The normal pdf plays a central role in probability theory and measurement science. 4.3 Levels of confidence and coverage factors In the language of the GUM, we associate a level of confidence in our knowledge of a quantity with a number k called a coverage factor. For the spacer example, with estimated length y 0 and associated uncertainty s , this is interpreted to mean that the length Y may be expected to lie in the interval y 0 ± ks with an integrated, or cumulative, probability P(k). The standard deviation (or standard uncertainty) thus sets the scale of uncertainty and is often called a scale parameter. The relation between k and P depends on the assumed functional form of the pdf, and for the normal distribution we have the well-known and oftenemployed values of P = [68%, 95.5%, 99.7%] for k = 1, 2, and 3, respectively. Since we are reasoning about a single, particular spacer, we point out that these probabilities have no frequency interpretation. Their magnitudes become significant: (a) in the propagation of uncertainty, where the result of some other measurement depends on the spacer length, and (b) in the context of a subsequent decision where the length of the spacer is an element of risk. A great deal of time can be wasted in heated arguments concerning the exact form of the density p (y | I 0 , which describes not reality in itself but only one's knowledge about reality. It can be helpful to realize that there exists a very general and useful quantitative bounding relation on the level of confidence associated with the best estimate y 0 which is independent of the detailed nature of the pdf, so long as it has finite expectation and variance and is properly normalized. The latter condition means that f s 2y º E (Y - y 0 ) 2 = z ¥ f ( y - y 0 ) 2 p ( y | I 0 dy -¥ z ¥ (25) The quantity s y is called the standard deviation of the pdf p (y | I 0 . The GUM defines an estimated standard deviation to be the standard uncertainty associated with an estimate y 0 , using the notation u ( y 0 ) º s y . The uncertainty characterizes a state of knowledge and is not a physical attribute of the spacer or something that could be measured in a metrology laboratory. For this reason it makes no sense to argue about the 'true' value of the uncertainty. An expression of uncertainty is always correct when properly based on all relevant information. If two people express different uncertainties then they must be reasoning on different states of prior information or sets of prior assumptions. f In a similar way, a probability density function models a state of knowledge, and is not something that could be measured in an experiment. The function shown in Fig. 2 is the familiar normal (or Gaussian) density defined by f p (y | I 0 = exp - (y - y 0 ) s 2p º N ( y ; y 0 , s 2 ), (27) If y 0 is the best estimate of Y, then it is straightforward to show that = E (Y 2 ) - y 02 . 1 f p (y | I 0 dy = 1. -¥ 2 2s 2 (26) where for simplicity we write s in place of s y . As we shall see in Sec. 6.3, the normal density is a consequence of a general principle for assigning probabilities, called the a f p Y - y 0 ³ ks | I 0 £ 1 , k2 (28) a result known as the Bienaymé - Chebyshev inequality [7, 28]. From this we see, for example, that not less than 8/9 » 89% of the reasonably probable values of the length of the spacer are contained in the interval y 0 ± 3s , whatever the distribution p (y | I 0 . Thus we suggest that there is little to be gained in debate over the exact form of the pdf. If the uncertainty s is too large to permit a confident decision, then the proper course of action is usually to reduce uncertainty and sharpen the distribution p (y | I 0 by performing an appropriate measurement. f f [NOTE: In writing expressions such as (24) and (27), we use the formal limits of ( - ¥, + ¥) and recognize that since physical lengths are positive, we must strictly require that p ( y | I 0 = 0 for y £ 0. In practice it is common to represent states of knowledge by pdfs such as the normal distribution that are non-zero over infinite range. The mathematical convenience afforded by these analytic functions more than compensates for the infinitesimally small, non-zero probabilities for impossible values of physical quantities.] f 5. Measurement as inference: Bayes' Theorem 7 Now suppose that we have a proposition H in the form of an hypothesis, and that we subsequently obtain some relevant data D. As usual we denote our prior information by I 0 . Writing the two equivalent forms of the product rule (9a-b): f f f f f p (HD | I 0 = p (H | I 0 p (D | HI 0 = p (D | I 0 p (H |DI 0 , We now measure the length of the spacer as illustrated in Fig. 3. Using a linear indicator we take a pair of readings before and after insertion of the spacer as shown. and rearranging, yields Bayes' Theorem: f p (H |DI 0 = p (H | I 0 spacer, conditioned primarily by our understanding and experience with the production process, with such vague knowledge reflected in a broad prior distribution. This is not a weakness of the approach but rather its motivation: the whole purpose of performing the measurement is to sharpen this broad distribution, refine our knowledge, and reduce our uncertainty with respect to the length of the spacer. f pp(D(D|H| II ff , 0 (29) 0 which is the starting point for the system of reasoning known as Bayesian inference. From its very trivial derivation we see that Bayes' theorem is not a profound piece of mathematics, being no more than a restatement of the consistency requirement of probability theory. Nevertheless, Bayes' theorem gives the general procedure for updating a probability in light of new, relevant information, and is a modified form of (14) in which only the hypothesis H appears, and not its negation. Before we obtain data D, the degree of belief in hypothesis H, conditioned on information I 0 , is represented by the prior probability p (H | I 0 . When we learn of the data D, the prior probability is multiplied by the ratio on the right side of (29) to yield the posterior probability p (H | DI 0 . The quantity p (D |HI 0 is called the likelihood of H given the data D, and is viewed as the probability of obtaining the data if the hypothesis is assumed to be true. The denominator p (D | I 0 has no special name, although it is sometimes called the global likelihood. It is equal to the probability of obtaining the data whether H is true or not, and can be written as a marginal probability using the sum rule: f f f f f f f f ~ ~ p (D | I 0 = p (D |HI 0 p (H | I 0 + p (D |HI 0 )p (H | I 0 . (30) f Since p (D | I 0 is a constant, independent of H, Bayes' theorem is commonly written in the form f f f p (H |DI 0 = Kp (H | I 0 p (D |HI 0 , (31) with K equal to a normalization constant. In a typical measurement problem, H stands for a proposition concerning a dimension of interest and D represents the measurement data. The likelihood is then equal to the probability of obtaining the data D as a function of an assumed dimension specified in H. The way in which the result of the measurement affects our degree of belief in H is completely contained in the likelihood function. To illustrate how Bayes' theorem is used in dimensional metrology, let us consider a very simple one-dimensional example in which a linear indicator is used to measure the length of a metal spacer. Assume that we have just manufactured such a spacer and that we need to measure its length in order to make a decision as to whether or not it is acceptable. Before performing the measurement, our knowledge of the length of the spacer is described by a prior pdf p (y | I 0 , where as before p ( y | I 0 dy is the probability that the length of the spacer lies in the interval y , y + dy . The width of the prior pdf, as characterized by its variance s p2 , is a measure of our uncertainty in the length of the spacer, with the best estimate of the length, y p , corresponding to the expectation of the distribution. Usually we would have only limited information about the f f ym Figure 3. The length of a metal spacer is measured using a linear indicator. The result of the measurement is the estimate y m . The difference in the two indicator readings is the result of the measurement y m . The probability that a spacer of actual length y would yield measurement data y m is just the likelihood function p y m | yI 0 , whose width, as 2 , is a measure of the characterized by its variance s m quality of the measurement process (here, the linear indicator). This is where experimental design enters the picture, because we want the likelihood to be sharply peaked about the actual length of the spacer. We then use Bayes' theorem to find the updated (posterior) probability distribution that describes our knowledge of the length of the spacer after performing the measurement: a f f fa f f f p ( y | y m I 0 = Kp ( y | I 0 p y m | yI 0 , a f za ¥ (32) p y m | yI 0 p ( y | I 0 dy . -¥ This process is illustrated in Fig. 4, where we sketch the qualitative forms of the relevant distributions. When the likelihood is sharply peaked relative to the prior (pre-data) distribution, the posterior (post-data) distribution will be dominated by the peak in the likelihood, so that the exact form of the prior distribution becomes irrelevant. This is almost always the case for common engineering measurements, where the measurement process is 2 << s 2 (sharply peaked likelihood). arranged so that s m p Under these conditions, the prior distribution will be nearly constant in the region where the likelihood is appreciable, and essentially all knowledge of the measurand (here, the length of the spacer) derives from the measurement data. For such a locally uniform prior probability, Bayes' theorem thus reduces to the approach known as maximum likelihood, so-called because the best post-data estimate of the value of the measurand coincides with the peak in the likelihood function. where K -1 = p y m | I 0 = 8 Probability P robability posterior distribution posterior distribution prior distribution likelihood prior distribution likelihood indicator systematic error ym yp Length Figure 4. In a typical engineering measurement such as measuring the length of a metal spacer, the (post-data) posterior distribution is dominated by a sharply peaked likelihood function. The best estimate of the spacer length, y m , then very nearly coincides with the peak in the likelihood, and the prior (pre-data) distribution becomes irrelevant. The curves are not to scale. A common source of systematic error in such a length measurement is a possible scale error in the linear indicator. In order to correct for this error, we can perform a calibration using a gauge block (length standard) whose estimated length y g is known to within a small uncertainty s g . In the case of a calibration, the measurand is the error, and Bayes' theorem is written: f fa f p (e |e m I 0 = K ¢p (e | I 0 p e m |eI 0 , (33) where K ¢ is a constant, e º 'the indicator systematic error lies in the range e, e + de ,' and e m is the result of the measurement, given by the difference between the indicator data and the estimated length of the standard: e m = y m - y g . The prior distribution p (e | I 0 is typically symmetric about zero in the absence of any a priori knowledge about the sign of the systematic error. The likelihood p e m | eI 0 will be sharply peaked because of the small uncertainty in the length of the standard. Again, the posterior distribution for the indicator systematic error is dominated by the peak in the likelihood and whatever is known a priori becomes irrelevant. This situation is illustrated in Fig. 5. f a f Measurement and calibration are thus seen to be complementary operations in Bayesian inference. The mechanics of taking the data are exactly the same in both cases but we are asking different questions. In a measurement we focus on the length of a workpiece, in a calibration on the systematic error of an indicator. The mathematics is the same, the only differences being in the identification of the measurand and the nature of the prior information. The calibration/measurement process relies on 2 << s 2 . the ordering s 2g << s m p 0 em Error e Figure 5. Calibration of a linear indicator using a gauge block. The measurand is now the systematic error of the indicator, and the sharply-peaked likelihood reflects the low uncertainty in the length realized by a gauge block. The GUM makes no reference to a prior probability distribution for a measurand (while encouraging the use of assumed a priori distributions to describe knowledge of the input quantities upon which the measurand depends). From a theoretical point of view this has to be regarded as inconsistent. Operationally, it amounts to an implicit assumption of a uniform (constant) distribution to describe prior knowledge of the measurand, with the best estimate to be supplied by the measurement data via the likelihood function. 6. The assignment of probabilities The sum and product rules, together with Bayes' theorem, are the unique algebraic tools for working with and manipulating probabilities, but the question remains of how to assign prior probabilities in the first place in order for a calculation to get started. Since probabilities represent (or encode) states of knowledge or degrees of reasonable belief, what is needed are principles by which whatever information is available can be uniquely incorporated into a probability distribution. This problem is addressed in the GUM, for variables other than the measurand, where such distributions are called a priori probability distributions, with associated variances whose positive square roots are called Type B standard uncertainties. There is no easy way to assign a real number to the probability of an uncertain proposition such as A º 'there is life on Mars', but for the quantities of interest in engineering metrology the International System of Units (SI) provides a set of location parameters that makes such assignment possible. These parameters are the continuous variables such as position or mass, with respect to which we can order degrees of belief and over which we can sum discrete probabilities or integrate probability densities in order to effect normalization. There are three principal theoretical approaches to the consistent assignment of prior probabilities in problems of engineering metrology. By 'consistent' we mean that two persons with the same state of knowledge should assign the same probabilities. There is really no conceptual difference between assigning a prior probability distribution for a measurand before performing a measurement, and evaluating the likelihood function for the measurement process after the data is in hand. Both operations yield 9 probability distributions that describe degrees of belief and both require the exercise of judgment, insight, knowledge, experience, and skill. In the final analysis it should be recognized that the limiting uncertainty of a measurement cannot be gleaned from anything in the measurement data itself, nor can the error be known in the sense of a logical deduction. 6.1 The representation of ignorance Since a probability distribution for a quantity of interest encodes what is known about the quantity, it is interesting to ask for the distribution that describes a state of complete ignorance. For example, suppose that a long metal bar is engraved with a single ruled line whose position along the bar is unknown. Here our state of knowledge consists simply of the line's existence, with no information that would lead us to favor any location over any other. How can we represent this state of ignorance? We reason as follows: denote position along the bar by x, and let f ( x )dx be the probability that the line lies in the interval x , x + dx . Ignorance of location then suggests that the probability should be invariant with respect to the translation x ® x ¢ = x + a , where a is an arbitrary constant. Thus the density f ( x ) should satisfy f ( x )dx = f ( x ¢ )dx ¢ , (34) and since dx ¢ = dx , we have f ( x ) = f ( x + a ) , which implies that f ( x ) = constant. (35) Thus the probability density that describes ignorance of a location parameter, such as the position of the ruled line or the magnitude of an error, is the uniform density. Now suppose that there are two lines ruled on the metal bar, thus forming a line scale, and that we are interested in the length L between them. The probability that the length lies in the interval L, L + dL is written g (L )dL . Suppose that we are completely ignorant of the line spacing, in the sense that we have no definite scale for the unit of length. We can imagine drawing a graph of g (L ) versus L, using some local, arbitrary unit of length. Another metrologist, perhaps using a photograph of the line scale, might draw a graph in different units, g (L ¢ ) , where L ¢ = bL , with b equal to an unknown scale factor. If the two states of knowledge (or more correctly, ignorance) are to be the same, then we should assign the same probability to equivalent intervals on the two graphs. That is, we should require that g (L ¢ )dL ¢ = g (L )dL , with L ¢ = bL , so that: g ( bL )d (bL ) = g (L )dL . (36) Thus we require that g ( bL ) = (1 b )g (L ) , so that the probability density g (L ) is given by g (L ) = 1 L . (37) A parameter such as the line spacing that is known a priori to be positive is called a scale parameter. Another scale parameter is the standard deviation of a probability distribution for the error of a length measurement. We have shown that the invariant density that represents ignorance of a scale parameter is the reciprocal density g (L ) = 1 L . This is a strange looking probability density that appears more reasonable if we write the equivalent forms g (L )dL = dL L = d (ln L ) , (38) so that requiring g ( bL )d (bL ) = g (L )dL is equivalent to the statement that d (ln L ) = constant. (39) Thus ignorance of a scale parameter is represented by a uniform distribution of the logarithm of the parameter. The results given by (35) and (39) for the prior densities representing ignorance for location and scale parameters were originally proposed by Jeffreys [16], using heuristic plausibility arguments. They were subsequently placed on a firm theoretical foundation by Jaynes [14], who invoked a 'desideratum of consistency' to express the reasonable requirement that in two problems where we have the same information, we should assign the same probabilities. In the case of complete ignorance, where the parameters have infinite range ( - ¥ < x < + ¥ and 0 £ L < + ¥ ), the prior probability densities (35) and (39) cannot be normalized, since the corresponding integrals are undefined. Such prior distributions are called improper priors and have been the subject of much controversy and criticism, since a nonnormalizable function can obviously not represent a probability density. In response, we make several observations. First, in almost any real application using Bayes' theorem, the prior distribution occurs in both the numerator and denominator, and so cancels out of the calculation. In such a case, the fact that we might be using an improper prior becomes moot. Next, in the real world of engineering metrology we are never completely ignorant in the mathematical sense. As previously argued, the length of a real workpiece, such as a metal spacer, will certainly be greater than some definite small value and less than a definite large value, so that the relevant probability density will vanish outside of such finite limits, and the normalization integral will always converge to unity. In an unusual case where the posterior distribution itself should turn out to be improper, then this fact should serve as a warning that there is not enough information in the measurement data to be able to make a confident inference with respect to the measurand. In spite of the mainly theoretical problems with improper priors, they are useful in real problems as labor-saving devices when the exact finite limits of the relevant prior densities make no resolvable difference in the calculations. 6.2 Symmetry and the principle of indifference Consider a discrete collection of n propositions ( A1 L An ) that form an exhaustive and mutually exclusive set on prior information I 0 . Furthermore, suppose that that there is nothing in information I 0 that would lead us to believe that any one of the propositions was more or less probable than any other. In such a case we must then have p A j | I 0 = p Ak | I 0 for any pair of propositions A j , Ak . If this were not the case, then by simply permuting the numbering scheme of the propositions we could demonstrate two problems, each with the same prior information but with different probability assignments. The assignment of equal probabilities in this case is perhaps intuitively obvious given the symmetry of the situation, and employs what is often called the principle of indifference, a term introduced by J. M. Keynes [18]. b f a Now since f å1 p aAk | I 0 f = 1 n a f (exhaustive constraint), and a f since all of the probabilities p Ak | I 0 are equal, we have necessarily: 10 a f p Ak | I 0 = 1 , k = 1,L, n. n (40) The result (40) is perhaps the oldest and most familiar of all probability assignments. It will appear as a special case of the principle of maximum entropy to be described in the next section, but we chose to introduce it separately because of its importance in probability theory. The principle of indifference leads, of course, to the equal a priori probabilities that characterize games of chance such as drawing cards or rolling dice. Note, however, that the 1 n probability assignment is a logical consequence of the sum and product rules of probability theory applied to a set of exhaustive and mutually exclusive propositions, given a particular state of prior knowledge. There is no need to imagine an infinite set of repeated experiments and an imagined distribution of limiting frequencies. Of course given the probabilities, it is a straightforward procedure to calculate the expected frequency of any particular outcome in a set of repeated trials, and thus to compute, for example, the familiar odds of the gambler. Such calculations are developed in great detail in most books on probability and statistics. The uniform 1 n discrete probability distribution can be usefully employed to characterize ignorance of a physical dimension, such as the length Y of a metal spacer. We choose an interval [ymin, ymax] that is certain, based on engineering judgment, to contain the length Y, and we divide this interval into a large number n of discrete lengths ( y 1,L, y n ) . Here n is chosen so that the discrete lengths y k are separated by less than the measurement resolution. A state of knowledge about the length of the spacer can now be represented by the discrete probability distribution ( p 1,L, p n ) where p k º p Y = y k | I 0 . If now our prior information I 0 consists only of knowledge of the interval [ymin, ymax] together with an enumeration of the possible lengths ( y 1,L, y n ) , then the only consistent and unbiased probability assignment is the uniform distribution ( p 1,L, p n ) = (1 n ,L,1 n ) . a f 6.3 The principle of maximum entropy Since probabilities represent states of knowledge, it is useful and productive to think about the information content of a probability distribution for a physical quantity. In this view, an accurate measurement supplies missing information that sharpens a vague, poorly informative prior distribution. Said a different way, the information provided by a measurement serves to reduce uncertainty with respect to the value of an unknown quantity, such as the length of a metal spacer. In the interpretation of the GUM, what we call 'uncertainty' is just the standard deviation of the probability distribution that describes the distribution of values of a quantity that are reasonable or plausible in the sense of being consistent with whatever is known (or assumed) to be true. This kind of uncertainty we might call 'location uncertainty' because the standard deviation is a characteristic measure of the region about the expectation of the distribution in which there is an appreciable probability that the value of the quantity is located. If we think more carefully about this, however, we can see that the GUM-type of location uncertainty is useful and realistic only for particular states of knowledge. To illustrate, suppose that an inspector has two highly repeatable length gauges of identical quality, except for the fact that one of them has a significant zero offset z 0 , while the other has a negligible offset. Probability z0 Length Figure 6. A bi-modal probability distribution for the length of a spacer measured using a gauge with one of two possible zero offsets, zero or z 0 . The actual offset is unknown. If the peaks are very narrow relative to their separation, the combined standard uncertainty of the measurement (standard deviation of the distribution) is approximately equal to z 0 2 . The inspector proceeds to measure the length of a metal spacer, but fails to record which of the two gauges was used for the measurement. In this case the measurement process would yield a doubly-peaked (or bi-modal) probability distribution, with the two peaks separated by the unknown gauge offset z 0 , as shown in Fig. 6. If the other uncertainty components were negligible, the two peaks would be very narrow and the combined standard uncertainty (standard deviation of the distribution) would be well-approximated by z 0 2. Several features of this situation should be noted. First we see that the standard deviation z 0 2 is a measure of the width of the region between the two peaks of the distribution, over most of which there is a negligible probability of containing the true length of the spacer. The expectation of the distribution, in particular, lies in the center of this low probability region. From this we see that the GUM identification of a best estimate with an expectation is useful only for certain types of probability distributions, and that an estimated standard deviation may not be the best uncertainty parameter in all cases. In particular, we see that should the unknown zero offset increase, so would the combined standard uncertainty, together with the inclusion of more and more highly improbable values for the spacer length. Now notice that there is a sense in which increasing the gauge offset error z 0 adds no additional uncertainty at all. If we asked 'Which of the two gauges was used to perform the measurement?', and somehow managed to obtain this information, then the spacer length probability distribution would collapse via Bayes' theorem to a single narrow peak, and the length of the spacer would be known with high accuracy. This operation is clearly independent of z 0 , depending only our knowing that the probability distribution has two narrow peaks, independent of their separation. The information supplied by the answer to our question decreases our uncertainty about the length of the spacer, just as might be accomplished by repeating the measurement with a gauge of known offset. This suggests that there is another way to think about the uncertainty of a probability distribution that depends only on the form of the distribution itself and not on the actual values of the quantity described by the distribution. Such an approach leads to the concept of entropy. 11 Consider again a set ( y 1,L, y n ) of possible lengths of a spacer, with a corresponding discrete probability distribution ( p 1,L, p n ) . We have argued that a state of complete ignorance as to the length of the spacer is represented by the uniform distribution ( p 1,L, p n ) = (1 n ,L,1 n ) , and it seems intuitively reasonable that the uniform distribution describes a state of maximum uncertainty. Now imagine a contrasting situation in which we know for certain that the length of the spacer is Y = y k , so that p k = 1 and p j = 0, j ¹ k . A plot of the distribution ( p 1,L, p n ) versus index number j would display a single spike at j = k with unit probability and zeros everywhere else. Since the length of the spacer in this case is known, we have zero uncertainty in the sense of needing no more information in order to decide the length state of the spacer, and our certainty is reflected in the sharply spiked probability distribution. We see here how the shape of the probability distribution encodes general properties that we identify with information and uncertainty. This raises the interesting question as to whether there exists some unique function of the distribution ( p 1,L, p n ) that might serve as a numerical measure of the amount of information (in a sense to be described) needed to reduce a state of incomplete knowledge to a state of certainty. Such a function, called the entropy of the distribution, was found by C. E. Shannon [29] in the context of communication theory. We proceed to sketch the arguments that lead to the mathematical form of the entropy function. Given a discrete probability distribution ( p 1,L, p n ) , we seek a function H ( p 1,L, p n ) that will serve to measure information uncertainty (in contrast to the location uncertainty as measured by a standard deviation). Following Shannon, we require the function H, if it exists, to satisfy the following reasonable conditions: Condition 1. H ( p 1,L, p n ) should be a continuous function of the probabilities ( p 1,L, p n ) . Condition 2. If all of the probabilities are equal, so that 1 1 p k = 1 n for all k, then H ,L, should be a n n monotonically increasing function of the positive integer n. More choices should mean more uncertainty. FH p1 IK p1 p2 q = p2 + p3 p2 q p3 q (a) (b) Figure 7. Illustrating the grouping of inferences. The information uncertainty should be the same in both cases. In (b), the uncertainty associated with the choice of p2 or p3 occurs with probability q = p2 + p3 . Condition 3. If a problem is reformulated by grouping subsets of the probabilities and calculating the uncertainty in stages, the final result must be the same for all possible groupings. This is a consistency requirement. We illustrate Condition 3 by example (see Fig. 7). Consider a problem in which there are three possible inferences with probabilities ( p 1, p 2 , p 3 ) as shown in Figure 7(a). The information uncertainty is H ( p 1, p 2 , p 3 ) . Now suppose that we proceed in two steps by grouping the inferences as shown in Fig. 7(b). The first step involves the choice of either p1 or q = p 2 + p 3 , with an uncertainty of H ( p 1, q ) . Then, with probability q, there will be an additional uncertainty associated with the choice of either p2 or p3 in the amount of H ( p 2 q , p 3 q ) . Shannon's Condition 3 then requires that the information uncertainty be the same in both cases: H ( p 1, p 2 , p 3 ) = H ( p 1, q ) + qH (p 2 q , p 3 q ) . (41) Shannon generalized the result (41) to derive a functional equation for H ( p 1,L, p n ) and then showed that the unique solution for the measure of information uncertainty, called the entropy of the distribution ( p 1,L, p n ) is given by n H ( p 1,L, p n ) = -K å p i log p i . (42) i =1 In this expression K is a positive constant that depends on the base of the logarithms. Such a choice is arbitrary, so we simplify by setting K = 1 and writing for the entropy n H ( p 1,L, p n ) = - å p i log p i . (43) i =1 The entropy H of (43) behaves quantitatively as we might expect from a measure of uncertainty. If one of the probabilities is equal to one and the rest equal to zero (a state of certainty), then H ( p 1,L, p n ) = H (0, 0,L,1,L, 0) = 0 , (44) while the uniform distribution, p k = 1 n for all k, has entropy H (1 n ,L,1 n ) = log n , (45) which is the maximum value of H. The logarithmic dependence of the entropy on the number of equally-likely choices can be understood most easily in base-2 binary logic. The answers to N 'yes/no' questions (i.e. N 'bits' of information) would be sufficient to uniquely specify one of n = 2 N possibilities, so that the entropy is H = log 2 n = N . As the number of possibilities increases exponentially, the entropy increases only linearly, so that, for example, deciding among twice as many possibilities requires only one more bit of information. In the case of a continuous probability distribution for a parameter such as the length of a spacer, where prior ignorance is described by a uniform distribution, the entropy becomes z f f H = - p ( y | I 0 log p ( y | I 0 dy , (46) where the integral is over all possible values of the length. 12 There is a close connection between entropy in the sense of information and uncertainty and the entropy of statistical mechanics. In fact, all of equilibrium statistical mechanics can be viewed as an exercise in probable inference with respect to the unknown microscopic state of a thermodynamic system, when our information consists only of estimates of a few macroscopic variables such as temperature and pressure. The interested reader should see, for example, the pioneering work of Jaynes [15] and the excellent introductory text by Baierlein [1]. The entropy is a unique measure of uncertainty, in the sense of missing information, with respect to a state of nature. Our natural desire for objectivity and freedom from bias would therefore suggest that among all possible prior distributions that might describe knowledge of a measurement variable, we should choose the one that maximizes the entropy in a way that is consistent with whatever is known (or assumed) to be true. This is the principle of maximum entropy (PME). The resulting probability distribution then reproduces what we assume to be true while distributing the remaining uncertainty in the most honest and unbiased manner. At the same time, PME is a procedure that satisfies our desire for consistency in the sense that two persons with the same information (state of knowledge) should assign the same probabilities. Jaynes [14] has described the maximum entropy distribution as being 'maximally noncommittal with regard to missing information' and has also observed that this distribution '... is the one which is, in a certain sense, spread out as much as possible without contradicting the given information, i.e., it gives free rein to all possible variability of [the unknown quantity] allowed by the constraints. Thus it accomplishes, in at least one sense, the intuitive purpose of assigning a prior distribution; it agrees with what is known, but expresses a 'maximum uncertainty' with respect to all other matters, and thus leaves a maximum possible freedom for our final decisions to be influenced by the subsequent sample data.' The mathematical procedure that underlies the PME is one of constrained maximization, which seeks to maximize the entropy (either the discrete or continuous form, as appropriate) subject to constraints on the probability distribution imposed by prior information, using the method of Lagrange multipliers [1, 13, 30, 35]. The example of the metal spacer will serve to illustrate the procedure for particular states of available information. Suppose that we are certain, based on engineering judgment and the known properties of a production process, that the length of a spacer is contained in the interval y min , y max . Such knowledge constrains the distribution p (y | I 0 via the normalization requirement: f z y max f p (y | I 0 dy = 1. (47) y min Maximizing the entropy (46) subject to the constraint (47) then yields the rectangular, or uniform, density given by f p ( y | I 0 = 1 ( y max - y min ) (48) in the allowed range of y, and zero otherwise. We could have guessed this simple distribution based on the symmetry of the situation, but it is instructive to see how the PME works with such meager information. In many cases we may have a prior estimate y 0 of the length, together with an estimated variance s 2y , related to f p (y | I 0 by (24) and (25). We might know, for example, that the spacer was produced by a reliable machine or process with a well-characterized production history. Maximizing the entropy subject to these constraints, together with the normalization requirement of (27), yields the normal (or Gaussian) density: f p (y | I 0 = 1 s y 2p a exp - ( y - y 0 ) 2 2s 2y = N y ; y 0 ,s 2 y f. (49) This is a very important and useful result. Prior information about the length of the spacer might be based not on the known characteristics of a machine or production process but rather on the result of a previous measurement, perhaps performed by a supplier. If the supplier follows the recommendations of the GUM, the result of the measurement will be reported in the form Y = y 0 ± ku c (y 0 ) , where k is a coverage factor and the combined standard uncertainty u c (y 0 ) is an estimated standard deviation of the probability distribution that characterizes the supplier's measurement process. Given only this information, the best prior probability assignment (being least informative in the sense of the PME) for encoding knowledge of the length of the spacer is just p ( y | I 0 = N ( y ; y 0 ,u c2 ) . Thus the normal distribution, rather than being an unwarranted assumption, is the least biased and 'maximally noncommittal' of all distributions for given mean and variance. Consistency would then require anyone using the supplier's measurement result to assign the same normal distribution. f 7. The ubiquitous normal distribution The normal, or Gaussian, distribution has a very special status in probability theory and measurement science. In this section we describe some of the reasons for the ubiquitous occurrence of this particular distribution. 7.1 The central limit theorem When many small independent effects combine additively to affect either a production process or a set of repeated measurements, the resultant frequency distributions (histograms) of either the workpiece errors or the measurement results will usually be well approximated by normal distributions. The central limit theorem (CLT) provides a theoretical basis for modeling this behavior, under very general and non-restrictive assumptions about the various probability distributions that characterize the individual effects. The CLT is a general result in the theory of random variables. Without attempting a formal proof, the CLT says that if Z is the sum Z = X 1 + L + X n of n independent random variables X i , each of which has finite mean and variance, with none of the variances significantly larger than the others, then the distribution of Z will be approximately normal, converging towards a normal distribution for large n. In practical applications, 'large n' may mean n no greater than three or four. 7.1.1 Gaussian sampling and Type A uncertainties There is perhaps no source of measurement uncertainty more basic and fundamental than that caused by getting a different answer every time a measurement is repeated. The CLT suggests a useful and realistic model of a noisy, non-repeatable measurement procedure. Consider, for example, a well-calibrated electronic indicator used to measure the length of a metal spacer. A set of n repeated measurements yields a data set of indicator readings 13 D º {y 1y 2 L y n } , where each reading is equal to the length plus an error that fluctuates from reading to reading. Guided by the CLT we assume that each error is the sum of a large number of small random (meaning unpredictable) errors, and model the procedure as repeated sampling from a normal frequency distribution with an expectation (or mean) m y º m and a standard deviation s that characterizes the measurement process repeatability. In many situations, the standard deviation s may be known from prior experience with the process. The postdata (posterior) distribution for the spacer length then follows from Bayes' theorem: f f f p ( m |DI 0 = Kp ( m | I 0 × p (D | mI 0 , (50) where the prior information I 0 includes the known value of s. In (50), K is a normalization constant and p( m | I 0 is the prior distribution for m, which we assume to be constant (a uniform density), corresponding to knowing little about the value of m a priori. The last factor on the right side of (50) is the likelihood function: f f a f p (D | mI 0 = p y 1 L y n | mI 0 . (51) We assume that the sequential measurements are independent, which means that the probability of obtaining datum y i does not depend upon the results of previous measurements. For the first two samples, using the product rule, we then have: a f a fa = p a y | mI fp a y p y 1y 2 | mI 0 = p y 1 | mI 0 p y 2 | y 1mI 0 1 0 2 f f (52) | mI 0 . Independence means, according to Cox, that knowledge of y 1 is irrelevant for reasoning about y 2 . Now by definition of the model, the probability of obtaining a particular indicator reading y i is given by the normal distribution: a f p y i | mI 0 = 1 s 2p LM N exp - OP Q 1 (y i - m ) 2 . 2s 2 (53) Repeated use of the product rule then yields for the likelihood: f a f a p (D | mI 0 = p y 1 | mI 0 L p y n | mI 0 LM N µ exp Now since 1 2s 2 f n O å (y i - m ) 2 PQ. (54) i =1 å (y i - m ) 2 = å a y i - y f 2 + n ( y - m ) 2 , with y = å y i n (the sample mean), and since the first term is fixed, given the data, the likelihood becomes f LM 1 FG y - m IJ OP . N 2 Hs n K Q with standard deviation (GUM Type A standard uncertainty) u ( m 0 ) = s n . This familiar result is called a maximum likelihood estimate, which is seen to be no more than Bayesian inference in the case of a uniform prior distribution and Gaussian sampling distribution (likelihood function). This is an example of the way in which probability theory as extended logic reproduces the results of traditional statistical sampling theory when warranted by the available information. The case where s is unknown is straightforward but more complicated, so that we simply state the results. For details, see references [2,16]. Using a constant prior density for m and Jeffrey's log-uniform prior density for s, Bayes' theorem leads to a posterior distribution for m given by Student's tdistribution for n - 1 degrees of freedom. The best estimate m 0 is again given by the sample mean y , with variance (in the notation of the GUM) given by u 2 (m 0 ) = (55) Using this result for the likelihood in Bayes' theorem (50), with a constant (uniform) prior distribution, we have finally: LM 1 FG y - m IJ OP . N 2 Hs n K Q p ( m |DI 0 µ exp - 2 (56) The post-data distribution for the expectation m is seen to be a normal distribution centered at the best estimate n -1 s2 . n-3 n (57) In this expression, s 2 is the sample variance, computed from the data according to s2 º 1 n å yi - y n -1 1 a f 2 . (58) The uncertainty u ( m 0 ) is seen from (57) to be larger than the value s n recommended in the GUM, and in fact is defined only for n > 3 . In Bayesian inference, this is a signal that for small n, one needs more prior information about s than is provided by the log-uniform density. As n increases, the result (57) approaches the GUM recommendation. To sum up the results when sampling from an assumed normal frequency distribution N ( y ; m, s 2 ), when very little is known a priori about m: the best estimate m 0 is always given by the sample mean y = n > 10 or so, is s å yi n ; if s is known a n ; if s is unknown, then u ( m 0 ) for priori, then u ( m 0 ) is s n , with s computed according to (58). 7.2 Inspection measurements I: process control Suppose that a manufacturer produces a large number of metal spacers, using a well-designed production process. Because of unavoidable process variations there will be some distribution of lengths in any particular batch of spacers. We can think of each spacer as sampling the manufacturing process, with the process adding a random error to the intended dimension [24]. Assume that the process is such that the distribution of spacer lengths, as displayed in a histogram, is well approximated by a normal distribution: 2 p (D | mI 0 µ exp - f m est = m 0 = y f (y ) = 1 s p 2p a exp - (y - y 0 ) 2 2s p2 = N y ; y 0 ,s 2 p f. (59) Here y 0 is the average length of a spacer and the variance s p2 characterizes the variability of the production process. The quantity f ( y ) is a frequency distribution function, where f ( y )Dy is approximately equal to the fraction of spacers with lengths in the range y , y + Dy . Figure 8 shows such a histogram, together with its normal approximation, for a 14 typical large run of spacers. During production, the spacers are measured using a calibrated length gauge that has been corrected for all significant systematic errors. The inspection measurement process has a combined standard uncertainty s m that includes the effects of temperature, gauge calibration uncertainty, measurement process reproducibility, and so forth. From experience with this gauge and the measurement process, it is known that length measurement errors are well described by the normal distribution: a f za za za = = m 0 m -¥ (62) 0 2 p 0 2 m )de. Here the second step follows directly from the product rule and the result p y m | eI 0 = N y m ; y 0 + e, s p2 says that for a given error e, the distribution of y m would equal the production distribution, shifted and centered at y 0 + e . The last integral in (62), called a convolution integral, is straightforward [2] and leads to the basic result: a a f p y m | I0 = f(y) f p y |eI f × p (e | I fde ¥ N y ; y + e, s f × N (e ; 0, s p y m | I 0 = p y me | I 0 de f a 1 s T 2p a sp f exp - ( y m - y 0 ) 2 2s T2 = N y m ; y 0 ,s 2 T f, (63) where 2 . s 2T = s p2 + s m y0 y Figure 8. A histogram showing the frequency distribution of the lengths of a large run of spacers produced by a machine. The solid curve is a normal distribution fit to the histogram. f p (e | I 0 = 1 s m 2p 2 ) exp( - e 2 2s m (60) 2 ), = N (e ; 0, s m f 2 )de is the probability that the so that p (e | I 0 de = N (e; 0, s m error of a length measurement lies in the range e, e + de . Because all known significant systematic effects have been accounted for, the measurement error has an expectation of zero. As part of a statistical quality control program, an inspector uses this gauge to measure the lengths of a large sample of spacers, and plots the result in a histogram. What can we say about this frequency distribution? In general, the result y m of a particular length measurement will be the sum of an unknown length y and an unknown measurement error e: ym = y + e (61) Such a result could be realized in an infinite number of ways, corresponding to the infinite number of pairs ( y ,e ) that satisfy (61). The error e here is a nuisance parameter, present in the data but of no interest in itself, that can be eliminated via marginalization. Let p y m | I 0 dy m be the probability of the proposition 'the result of a measurement lies in the interval y m , y m + dy m .' The distribution p y m | I 0 can be found by averaging over all possible measurement errors to yield a marginal distribution, analogous to the result of (22), where we had discrete probability distributions. a a f f The marginal distribution for y m is found according to: (64) We see that the distribution of values that might reasonably be expected to result from the measurement of a spacer chosen at random is normally distributed, centered at the average production value y 0 , with a standard deviation 2 . The inspector's measurements given by s T = s p2 + s m can be thought of as a sequence of samples from this distribution, so that the resulting histogram can be expected to be approximately Gaussian with a sample variance s 2 » s T2 . The two sources of variation, production and measurement, are seen to be mixed or confounded in the measurement process, and the behavior of the variance s 2T accords with our common sense. If the measurement process were 2 ® 0 , then nearly perfect and noise free so that s m 2 2 s T » s p and the dispersion of measurement results would reflect only production variation. On the other hand, if all of the spacers were nearly identical, so that s p2 ® 0 , then 2 and such dispersion would be dominated by s 2T » s m measurement variability. In a similar way, any observed drift in the measurement results away from the nominal length y 0 would be the sum of production and measurement drift, requiring more information (i.e. gauge re-calibration) before being unambiguously attributed to changes in the production process. From this example we see that it is very important in industrial inspection to understand the difference between actual workpiece variation and measurement uncertainty. The reliable characterization of a production process requires a measurement system whose expected error is held close to zero and whose combined standard uncertainty has been independently and carefully evaluated. 7.3 Inspection measurements II: a particular workpiece While the result (63) is interesting and useful for process control, it is not the information that an inspector would need in order to accept or reject a given workpiece. In order to decide if a spacer is acceptable or not, what an inspector needs to know, given a particular measurement result y m , is the best estimate of the length of the spacer actually 15 measured, together with an evaluation of the measurement uncertainty. Before performing the measurement, the inspector's knowledge of the length of the spacer is guided by his experience with the production process and data such as that shown in the histogram of Fig. 8. Using this prior information he assigns a normal prior distribution: f p (y | I 0 = 1 exp - ( y - y 0 ) 2 2s p2 s p 2p a f (65) = N y ; y 0 , s p2 . While this has the same mathematical form as f ( y ) in (59), it should be stressed that f ( y ) is a measured frequency distribution of lengths, while p (y | I 0 is an assigned probability distribution for a single spacer drawn from (59). Probability and frequency are not the same thing. We also observe that in the real world of manufacturing, many (if not most) workpieces are never measured at all, but rather accepted for use based upon pure inference in which knowledge of the workpiece is implicitly encoded by a distribution such as p (y | I 0 . f f After obtaining the measurement data y m , we update the prior distribution to obtain the post-data (posterior) pdf using Bayes' theorem f f ppaayy ||yII ff = Kp ( y | I fp a y | yI f, m p (y | y m I 0 = p (y | I 0 m m 0 a f 0 (66) 0 0 -1 where the constant K = p y m | I 0 is independent of y and will be absorbed into the normalization of p (y | y m I 0 . The likelihood p y m | yI 0 is the probability of obtaining data y m as a function of an assumed value y. Given a measurement process with error probability distributed as 2 ) , i.e., a Gaussian in (60), this is p y m | yI 0 = N ( y m ; y , s m centered at the assumed value of y. Thus: a f a f f a f f 2 ). p ( y | y m I 0 = KN y ; y 0 , s p2 × N ( y m ; y , s m (67) f We see from (67) that the posterior pdf p (y | y m I 0 is proportional to the product of two normal distributions. It is a straightforward exercise to show that the result is another normal distribution: f 1 exp - ( y - y ) 2 2s 2 s 2p = N ( y ; y , s 2 ), p (y | y m I 0 = (68) where: y = s 2 a f f 1 -2 y s p-2 y 0 + s m m , -2 s p-2 + s m a = s -2 p +s (69) -2 -1. m These results can be written somewhat more elegantly, and in a form easier to remember, by defining a weight parameter (or simply a weight) w for a probability distribution as the reciprocal of the variance: w º 1 s 2 , so 2 , and w = 1 s 2 . With these that w p = 1 s p2 , w m = 1 s m definitions (69) becomes: y = w py 0 + w my m wp +wm , (70) w = w p + w m. From the results (68-70) we see that the best estimate y of the length of the spacer, given the measurement data y m , is a weighted average of the prior estimate y 0 and the measured value y m . The weights characterize the sharpness of the respective probability distributions for y 0 and y m , and the posterior estimate y will be biased toward the value in which we have the most confidence, as measured by its weight. If we study this result, we find that it accords very well with what our intuition might suggest. In a typical industrial inspection, the measurement procedure is arranged such that w m >> w p whence y » y m and the best estimate of the spacer length derives almost completely from the measurement data. On the other hand, imagine using a hand micrometer to measure the diameter of a cylindrical workpiece produced by a modern diamond turning machine. In this case we could well have w p >> w m , and in effect we would be using the workpiece to calibrate the micrometer. From (69-70) we see that the posterior weight w is always greater than either of the weights w p or w m , so that a measurement always supplies information that reduces uncertainty. Usually w m >> w p so that w » w m and the information supplied by the measurement overwhelms whatever we may know a priori. In many cases of modern manufacture, however, such as the production of complex aspheric optics by single-point diamond machining, the workpieces are very difficult to measure independently, and we could well have w m » w p and w » 2w m , so that prior knowledge of a machine's positioning error characteristics could lead to a meaningful reduction in measurement uncertainty [25]. 7.3.1 Comparison measurements It is a common practice for a dimensional measurement laboratory to evaluate its procedures in relation to similar laboratories by participating in comparison measurements, such as round-robins. Here each participating laboratory measures, in turn, the same artifact and the results are then used to evaluate the equivalency of the various participants. The question naturally arises as to how best to consolidate and compare the results of the individual measurements. The optimum way to do so follows by extending the results of the previous section, using the rules of probability theory. We give a simple example and then touch upon practical difficulties. Suppose that n laboratories independently measure a quantity m and report the results in the form m = x i ± s i , i = 1,L, n , where s i is the combined standard uncertainty of the i-th measurement, evaluated according to the procedures of the GUM. Here m might be the length of a gauge block or the diameter of a cylindrical standard. Guided by the principle of maximum entropy, we assign a Gaussian probability density to each of the measurement results, so that a f p x i | m s i I0 = 1 s i 2p exp - ( x i - m ) 2 2s i2 (71) 16 is the probability density for the i-th measurement result. Assuming that the measurements are completely independent, the likelihood of the data set x = {x 1 L x n } is f a f L µ exp M- å ( x - m ) N n i 2 OP Q 2s i2 , exp - ( m - m - ) 2 2s 2- = N ( m ; m - , s 2- ) (73) = p - (m ) where the subscript denotes the pre-data estimate. [Prior ignorance can be well-approximated by allowing s 2- ® ¥ .] Bayes' theorem then gives for the post-data (posterior) density p + ( m ) º p ( m | xsI 0 : f f p + ( m ) = Kp - ( m )p (x | msI 0 , (74) where as usual, K is a normalization constant. Substituting expressions (72) and (73) into (74) gives a product of n + 1 normal distributions, yielding, after simplification, the normal posterior density p + (m) = 1 s + 2p exp - ( m - m + ) 2 2s 2+ . (75) Here m + and s + are, respectively, the best estimate of m and its combined standard uncertainty after incorporation of all of the measurement data, and are given by n m+ = w+ = (72) where s º {s 1 Ls n } . We also assume a Gaussian prior density for m, centered on the estimate m - : s - 2p (77) A central question in the analysis of round-robin data is how to choose a reference value m ref in order to effect the comparisons. We see from Eqs. (77) that the logical and consistent way to do so is to use the weighted mean m ref = m + which is the best estimate of the measurand using all available information. In spite of this, it seems to be common practice to use the simple un-weighted mean value, m ref = x , which discounts the variation in the measurement uncertainty. The motivation for this choice is to prevent a participant from claiming a very small uncertainty and forcing a weighted mean value toward his own result. We see here how in the real world it is easy to move beyond purely technical considerations and into areas that have psychological, political, and economic aspects. Choice of a reference value is only one of several problems that arise in the design and data analysis of comparison measurements and that are subjects of active discussion and debate. Among the others are: (76) 1 n w + = w - + åw i , 1 More common is a comparison round-robin (sometimes called a key comparison when a number of NMIs are involved), in which only a nominal value of m is known a priori and the goal is laboratory intercomparison. In such comparisons it will almost always be the case that for any laboratory w - << w i , (the measurements overwhelm prior information) so that w - » 0 and Eqs. (76) simplify to: 1 s + = s 0 n , a familiar result. If laboratory k has a much smaller uncertainty that the others due to a superior measurement process (w k >> w i , i ¹ k ) , then m + » x k with uncertainty s + » s k , which is just as it should be. A single high-accuracy measurement is more valuable than a number of poor ones. 1 n where w + º 1 s 2+ and so on for the rest of the weights. The results (75-76) have all of the intuitive properties that we might expect in pondering the situation. If w - >> w i for all i, then m + » m - and rather than learning about the artifact, the round-robin would reveal estimates of the systematic errors of the various measurements. This might describe a round-robin in which a National Metrology Institute (NMI) circulated a well-calibrated 'golden artifact' among a group of lower echelon laboratories, perhaps as part of a laboratory accreditation program. åw i . If all laboratories report the same uncertainty s 0 , then w i = w 0 = 1 s 02 and m + = x = å x i n with uncertainty · w - m - + åw i x i w - + åw i n åw i n 1 1 1 1 1 f m+ = n p (x | msI 0 = Õ p x i | ms i I 0 p (m | I 0 = n åw i x i Correlations. It is difficult to perform a set of comparison measurements that are all logically completely independent of each other. Independence means that knowledge of Laboratory A's systematic errors would convey no information that would affect Laboratory B's result. Use of common reference standards, using instruments from the same manufacturer, using the same empirical equation (such as Edlén's equation for the refractive index of air) or phenomenological model --- all of these will correlate the experimental results. Depending upon the particular nature of the measurement, failing to account for significant correlations among the input quantities will lead to either an underestimation or an overestimation of the uncertainty in the final result. · Method of measurement. Performing a measurement in two different ways will often give two different answers, even when the individual procedures are highly repeatable. From the point of view of probability theory, we would say that the measurements occur in different reasoning environments I 0 and I 0 ¢ so that p ( m | xI 0 ) ¹ p ( m | xI 0 ¢ f for the same data x, a result often called an 'apples and oranges' comparison. This introduces a component of uncertainty due to the method of measurement that can be studied by an appropriate experimental design. 17 · · Definition of the measurand. In many comparison measurements, the uncertainties may be dominated by an incomplete definition of the measured quantity. Diameter, for example, is not well defined for a cylinder that is not perfectly round. Similarly, the width of a chromium line deposited on glass is not well described by a single number at the level of a few nanometers. This lack of complete definition can also interact strongly with the measurement technique, further complicating both the evaluation of the uncertainty and comparison with other results. Unrecognized errors. It is not uncommon for the results of two independent measurements of the same quantity to be inconsistent, which means that the difference between the measured values exceeds the sum of the individual uncertainties by more than a 'reasonable' amount. An effort to achieve a very small measurement uncertainty requires the correction for smaller and smaller effects, and it is easy for some tiny effect to go unrecognized in the data analysis. In such a case, at least one of the results must be wrong, but it can be difficult, if not impossible, to find the source of the inconsistency. Of course, one of the principal reasons for performing a comparison such as a round-robin is to discover such unrecognized errors, and it is important to have a consistent procedure for handling them. An interesting approach has been demonstrated by F. Fröhner, who calls inconsistent results 'one of the thorniest problems in data evaluation [8].' He models the unrecognized errors themselves as being normally distributed with maximum entropy prior distributions for the unknown means and variances. The resultant Bayesian inference yields a best estimate for the unknown measurand, together with best estimates for the unrecognized errors and their uncertainties, in a straightforward way. Differences between the results of measurements of the 'same' measurand will, in general, be due to some admixture of these last three sources of variation. These effects cannot be separated without a more complete understanding and analysis of the various measurement processes. As suggested above, much creative thinking is needed about how best to treat the measurement data created in the course of measurement intercomparisons. 7.4 Industrial inspections III: accept/reject decisions Let us return to the inspector who has measured the length of a metal spacer and must now decide whether or not it is acceptable for use. The nominal length of the spacer is y 0 , and the design specification calls for y 0 to be centered in a specification zone of width T, where T is called the tolerance. This means that the length of an acceptable spacer must lie in the range LSL £ Y £ USL , where the lower specification limit LSL º y 0 - T 2 and the upper specification limit USL º y 0 + T 2. The tolerance is related to the specification limits by T = USL - LSL , as shown in Fig. 9. Figure 9. The specification zone for a metal spacer of design length y 0 and tolerance T. The goal of the inspector's measurement is to answer the question 'Is the length of the measured spacer contained in the specification zone with an acceptable probability?' Clearly what is meant by 'acceptable probability' is a question of professional or business judgment that involves matters such as costs and risks. For the purposes of our discussion we assume that there is a critical probability P0 such that a spacer will be acceptable if there is a probability P ³ P0 that its length lies within the specification zone. Typically P0 will be a number such as 0.95 or 0.99, corresponding to a level of confidence of 95% or 99%. The inspector's knowledge of the length of the spacer following his measurement is summed up in the posterior density p ( y | y m I 0 = N ( y ; y , s 2 ) of (68-69), which describes the distribution of reasonably probable values. The probability PG that the spacer is good (within tolerance) is just the fraction of this distribution contained between the specification limits (see Figure 10): f z USL f (78) exp - (y - y ) 2 2s 2 dy . (79) PG = p ( y | y m I 0 dy . LSL From (68) we have explicitly: PG = 1 s 2p z USL LSL This integral cannot be evaluated in closed form, but the result can be expressed in terms of the standard normal cumulative distribution function (CDF) defined by 1 z x exp( - t 2 2) dt . 2p -¥ F( x ) º (80) The CDF F ( x ) is tabulated in many statistics texts and is commonly included in commercial mathematics and spreadsheet software. Letting z º ( y - y ) s , PG is given by FH USL - y IK - FFH LSL - y IK . s s PG = F (81) T LSL y0 USL 18 0.30 f a p (y | y m I 0 = N y ; y , s 2 ) s* = s T PG = 0.95 0.25 PG = 0.99 0.20 0.15 s 0.10 99% conformance zone 0.05 y LSL 0 USL f Figure 10. The probability density p ( y | y m I 0 of a measured spacer, superimposed on the specification zone. The best estimate of the length is y . The probability PG that the spacer is good is the fraction of the area under the curve (shown cross-hatched) contained between the specification limits. The spacer is in tolerance if PG ³ P0 , where P0 is a threshold value determined by economic considerations. Now defining the dimensionless variables y - LSL , T s s* º T y* º . (82) (81) becomes: PG F 1- y * IJ - FFG - y * IJ = FG H s* K H s*K. = P by * ,s * g (83) G The tolerance T thus provides a natural length scale for the inspector's decision problem. For a spacer to have a chance to be accepted, the best estimate y of its length must lie within the specification zone, for otherwise the probability of being in tolerance would be less than 50%. The specification zone corresponds to the region 0 £ y * £ 1, with y * = 0 being the lower specification limit, y * = 1 being the upper specification limit, and y * = 0.5 being the center of the specification zone. The dimensionless uncertainty parameter s * = s T is sometimes called a gauging ratio, and typically has a value such as 0.25 (a 4-to-1 ratio) or 0.10 (a 10-to-1 ratio occasionally called the gauge maker's rule.) The reduced quantities y * and s * are closely related to various process capability indices such as C p and C pk that are used in statistical quality control [23]. This result can be appreciated by examining Fig. 11, which shows the locus of constant probability for two levels of confidence (PG = 0.95 and 0.99 ) in the y * - s * plane. 0 0.1 0.2 0.3 0.4 a 0.5 0.6 0.7 0.8 0.9 1.0 y * = y - LSL ) T Figure 11. The y * - s * plane, showing the locus of constant probability PG , from Eq. (83), for PG = 0.95 (upper curve) and PG = 0.99 (lower curve). For a given level of confidence, acceptable spacers lie in the region below the corresponding curve. The horizontal dotted line in Fig. 11 locates a particular 'gauge maker's rule' of a 10-to-1 ratio of tolerance to measurement uncertainty. The intersection of such a line with a particular probability locus defines a conformance zone whose width determines the range of measured values y allowed for acceptable spacers. The 99% conformance zone is shown, so that a spacer whose measured length is such that y * lies in this region has at least a 99% probability of being in tolerance, so long as s £ T 10 . A simplified way of showing the same decision problem follows from the recently adopted International Standard ISO 14253-1 [11] which defines default decision rules for proving conformance or non-conformance to specification. The basic idea is shown in Fig. 12. specification zone conformance zone U = ks Figure 12. Illustrating the specification and conformance zones according to ISO 14253-1. The quantity U is the expanded uncertainty, with k equal to a coverage factor according to the GUM. According to this standard, the specification zone is reduced by twice the expanded uncertainty U = ks of the measurement in order for a supplier to prove conformance with specification. On the other hand, for a customer to prove non-conformance requires that he add the expanded uncertainty to the result of his measurement, thus increasing the size of the conformance zone. The measurement uncertainty always works against whoever is making a conformance or non-conformance decision, and there is always a tradeoff involving costs and risks. In ISO 14253-1, the default coverage factor is k = 2 . It should be emphasized that this is a default procedure that fails to consider important economic issues such as the costs and risks associated with making erroneous 19 decisions [34]. These considerations can greatly affect the boundaries of the conformance and non-conformance zones so that default rules such as ISO 14253-1 will likely be of marginal value for real decisions in the marketplace [22,27]. 8. Summary We have attempted to give a broad overview of the fundamental ideas of inference, where probability is viewed as a degree of rational belief. In this view, engineering metrology is seen to be a particular application of a very general system of extended logic that applies to any situation where incomplete information precludes the use of deductive reasoning. The two major questions of probability theory are (a) how to assign initial probabilities and (b) how to revise probabilities in order to incorporate new information. We have shown how the answers to these questions are provided by (1) the principal of maximum entropy and (2) the sum and product rules that follow from the axioms of Cox. These are the fundamental ideas. All of the standard results of statistical sampling theory follow as special cases when necessary. Lack of repeatability is only one component of uncertainty. Ultimately, any physical measurement will be limited by uncertainty in the realization of the unit and will reduce to a set of Type B assumed distributions best estimated by the method of maximum entropy. 9. Acknowledgments It is a sincere pleasure to thank the many individuals who made valuable comments and suggestions based on earlier drafts of this paper, and for their contributions and constructive criticisms that helped to guide its revision. In particular, I am deeply indebted to W. Wöger (PTB Braunschweig) and S. Sartori (IMGC - CNR, Torino) for their very thorough critiques of the revised manuscript and their detailed suggestions for clarifying and correcting substantial portions of the paper. My gratitude is also extended to [* denotes CIRP member]: · · · · · · · · · · · · · · · · · J. Bryan* - Pleasanton, CA USA K. Bowen* - Bede Scientific, Englewood, CO USA D. Banks - NIST, Gaithersburg, MD USA D. DeBra* - Stanford University, Stanford, CA USA T. Doiron - NIST, Gaithersburg, MD USA T. Charlton* - Brown and Sharpe, North Kingston, RI USA C. Evans* - NIST, Gaithersburg, MD USA R. Hocken* - University of North Carolina - Charlotte, NC USA H. Kunzmann* - PTB, Braunschweig, Germany R. Levi* - Politecnico di Torino, Italy D. Lucca* - Oklahoma State University, Stillwater, OK USA P. McKeown* - Cranfield University, Cranfield, United Kingdom J. Meijer* - University of Twente, Enschede, Netherlands E. Pardo - NPL, Teddington, United Kingdom G. Peggs* - NPL, Teddington, United Kingdom J. Peters* - Instituut voor Werktuigkunde, Heverlee, Belgium S. Phillips - NIST, Gaithersburg, MD USA · · J. Potzick - NIST, Gaithersburg, MD USA J. Raja - University of North Carolina - Charlotte, NC USA I would also like to acknowledge the late Professor L R. Wilcox of the State University of New York at Stony Brook and the late Dr. C E. Kuyatt of NIST/Gaithersburg for their essential contributions to my understanding of the nature of probability and uncertainty. 10. References [1] Baierlein, R., 1971, Atoms and Information Theory, W. H. Freeman, San Francisco. [2] Box, G. E. P. and Tiao, G. C., 1973, Bayesian Inference in Statistical Analysis, Wiley Classics Library Ed. 1992, J. Wiley and Sons, New York. [3] Bryan, J. B., 1993, The Deterministic Approach in Metrology and Manufacturing, Int. Forum on Dimensional Tolerancing and Metrology, ASME, Dearborn, Michigan. [4] Cox, R. T., 1946, Probability, Frequency, and Reasonable Expectation, Am. J. Phys, 14: 1-13. [5] Cox, R. T., 1961, The Algebra of Probable Inference, Johns Hopkins Press, Baltimore. [6] Donaldson, R. R., 1972, The Deterministic Approach to Machining Accuracy, Soc. Mech. Eng. Fabrication Technology Symposium, Golden, Colorado. [7] Estler, W. T., 1997, A Distribution-Independent Bound on the Level of Confidence in the Result of a Measurement, J. Res. Natl. Inst. Stand. Technol. 102, 587-88. [8] Fröhner, F. H., 1989, Bayesian Evaluation of Discrepant Experimental Data, in Maximum Entropy and Bayesian Methods, J. Skilling, ed., Kluwer Academic Publishers, Dordrecht, Netherlands. [9] Garrett, A. J. M. and Fisher, D. J., 1992, Combining Data from Different Experiments: Bayesian Analysis and Meta-analysis, in Maximum Entropy and Bayesian Methods, Seattle 1991, C. R. Smith et al, eds., Kluwer Academic Publishers, Dordrecht, Netherlands 273-86. [10] International Organization for Standardization (ISO), 1995, Guide to the Expression of Uncertainty in Measurement, ISO, Geneva. [11] International Organization for Standardization (ISO), 1998, International Standard 14253-1, Geometrical Product Specifications (GPS)-Part 1: Decision rules for proving conformance or non-conformance with specification,. [12] Jaynes, E. T., 1994, Probability Theory: The Logic of Science, preliminary version at ftp://bayes.wustl.edu/ pub/Jaynes/book.probability.theory. [13] Jaynes, E. T., 1989, Papers on Probability, Statistics, and Statistical Physics, R. D. Rosenkrantz, Ed., D. Kluwer Academic Publishers, Dordrecht, Netherlands [14] Jaynes, E. T., 1968, Prior Probabilities, IEEE Trans. Syst. Sci. and Cybernetics, Vol. SSC-4, 227-41. [reprinted in Ref. 13.] [15] Jaynes, E. T., 1957, Information Theory and Statistical Mechanics, I, II, Phys. Rev. 106, 620-30, 108, 171-90. [reprinted in Ref. 13.] [16] Jeffreys, H., 1967, Theory of Probability, Clarendon Press, Oxford. [17] Jessop, A., 1995, Informed Assessments - An Introduction to Information, Entropy, and Statistics, Ellis Horwood, London. [18] Keynes, J. M., 1921, A Treatise on Probability, Macmillan, London. 20 [19] Kolmogorov, A. N., 1950, Foundations of the Theory of Probability, Chelsea Publishing Co., New York. [20] Kyburg Jr., H. E. and Smokler, H. E., Eds., 1964, Studies in Subjective Probability, John Wiley and Sons, New York. [21] Lindley, D. V. 1990, The 1988 Wald Memorial Lectures: The Present Position in Bayesian Statistics, Stat. Sci. 5, No.1, 44-89. [22] Lindley, D. V., 1985, Making Decisions, 2nd Ed., John Wiley and Sons, London. [23] Messina, W. S., 1987, Statistical Quality Control for Manufacturing Managers, John Wiley and Sons, New York. [24] Patterson, S. R., 1996, Treatment of Errors and Uncertainty, Tutorial Notes, American Society for Precision Engineering, Raleigh, North Carolina. [25] Phillips, S. D., Estler, W. T., Levenson, M. S., and Eberhardt, K. R., 1998, Calculation of Measurement Uncertainty Using Prior Information, J. Res. Natl. Inst. Stand. Technol. 103, 625-32. [26] Polya, G., 1954, Mathematics and Plausible Reasoning, 2 Vols., Princeton University Press. [27] Schlaifer, R., 1959, Probability and Statistics for Business Decisions, McGraw-Hill, New York. [28] Savage, I. R., 1961, Probability Inequalities of the Tchebyscheff Type, J. Res. Natl. Bur. Stand. 65B, 211-22. [29] Shannon, C. E. and Weaver, W., 1963, The Mathematical Theory of Communication, Univ. of Illinois Press, Urbana, Illinois. [30] Sivia, D. S., 1996, Data Analysis - A Bayesian Tutorial, Clarendon Press, Oxford. [31] Smith, C. R. and Erickson, G., 1989, From Rationality and Consistency to Bayesian Probability in Maximum Entropy and Bayesian Methods, Kluwer Academic Publishers, Dordrecht, Netherlands. [32] Tribus, M., 1969, Rational Descriptions, Decisions, and Designs, Pergamon Press, New York. [33] Weise, K. and Wöger, W., 1992, A Bayesian Theory of Measurement Uncertainty, Meas. Sci. Technol. 3, 1-11. [34] Williams, R. H. and Hawkins, C. F., 1993, The Economics of Guardband Placement, Proc. 24th IEEE International Test Conference, Baltimore USA. [35] Wöger, W., 1987, Probability Assignment to Systematic Deviations by the Principle of Maximum Entropy, IEEE Trans. Inst. Meas., Vol. IM-36, 655-58. 21