Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
David LeBlond Understanding Hypothesis Testing Using Probability Distributions David LeBlond “Statistical Viewpoint” addresses principles of statistics useful to practitioners in compliance and validation. We intend to present these concepts in a meaningful way so as to enable their application in daily work situations. Reader comments, questions, and suggestions are needed to help us fulfill our objective for this column. Suggestions for future discussion topics or questions to be addressed are invited. Readers are also invited to participate and contribute manuscripts for this column. Case studies sharing regulatory strategies are most welcome. Please contact coordinating editor Susan Haigney at [email protected] with comments, suggestions, or manuscripts for publication. KEY POINTS The following key points are discussed: t4DJFOUJGJDJOGFSFODFSFRVJSFTCPUIEFEVDUJWFBOE inductive inference t5ISFFNBJOBQQSPBDIFTUPJOEVDUJWFJOGFSFODFBSF used t'JTIFSJBOJOEVDUJPOVTFTUIF1WBMVFBTBNFBTVSF of evidence against the null hypothesis t/FZNBO1FBSTPOJOEVDUJPODPOUSPMTUIFMPOHSVO decision risk over repeated experiments t#BZFTJBOJOEVDUJPOPCUBJOTEJSFDUQSPCBCJMJTUJD measures of evidence from the posterior distribution For more Author information, go to gxpandjvt.com/bios [ t"1WBMVFJTUIFQSPCBCJMJUZPGPCTFSWJOHBSFTVMU as extreme or more extreme than that observed, assuming the null hypothesis is true t5IF5ZQF*FSSPSSBUFJTUIFQSPCBCJMJUZPG incorrectly rejecting the null hypothesis t5IF5ZQF**FSSPSSBUFJTUIFQSPCBCJMJUZPG incorrectly failing to reject the null hypothesis t5IF#BZFTGBDUPSJTBNFBTVSFPGFWJEFODFJOGBWPS of the null hypothesis contained in the data t1WBMVFFTUJNBUJPOJTCBTFEPOVOPCTFSWFESFTVMUT more extreme than those observed, so it may overstate the evidence against the null hypothesis t5IF1WBMVFGSPNBQPJOUOVMMIZQPUIFTJTPSUXP sided test of equality, is very difficult to interpret. A confidence or credible interval should be provided in addition to the P-value. t"OBMZTJTPGWBSJBODFJTB'JTIFSJBOIZQPUIFTJTUFTU for the equality of means of two or more groups t5IF#BZFTJBOBQQSPBDIPGGFSTBOJOTJHIUGVM re-interpretation of the analysis of variance in terms of the joint posterior distribution of the group means. INTRODUCTION The first issue of “Statistical Viewpoint” (1) presented basic probability distribution concepts and Microsoft Excel tools useful in statistical calculations. The second (2) reinforced these concepts and tools through eight scientific decision-making examples. ABOUT THE AUTHOR David LeBlond, Ph.D., has 29 years experience in the pharmaceutical and medical diagnostics fields. He is currently a principal research statistician supporting analytical and pharmaceutical development at Abbott. David can be reached at [email protected]. 86 PROCESS VALIDATION – Process Design David LeBlond The third and fourth (3, 4) illustrated how probability distributions aid in process knowledge building. This issue shows how probability distributions are central to the understanding of hypothesis testing. Some of the concepts of probability distributions introduced in the first four issues of this column will be helpful in understanding what follows here. Product, process, or method development can be viewed as a series of decisions based on knowledge building: What type of packaging should be employed? What is the optimum granulation time? Can a proportional model be used for calibration? The tools we use to make such decisions are many. We rely on prior theory and expert knowledge to conceptualize the underlying mechanisms, select prototypes for testing, design experiments, and prepare our minds to interpret experimental results. We use dependable measuring systems to acquire new data. Often, when the results of our experimental trials are unclear (and even sometimes when they are obvious), we employ statistical and probabilistic methods to guide us in our decision making. We humans are reasonably good at exploring our world, finding explanations for things, and making predictions from our theories. Unfortunately, while we all have a sense of rational intuition that (for the most part) serves us well, the process we use (or should use) to build understanding and make optimal decisions from data has been the subject of heated debate by philosophers, scientists, and mathematicians for centuries (reference 5 gives a nice, readable overview; see references 6 and 7 for more details). While the debate shows no sign of concluding in our own time, three noteworthy approaches have emerged. Here we discuss some history and key concepts of each approach and illustrate the central role probability distributions play with two simple examples. THE PROCESS OF SCIENTIFIC INFERENCE Let us start by defining several important terms as follows (Note that these and additional terms are defined in the Glossary section at the end of this article): tParameter: In statistics, a parameter is a quantity of interest whose “true” value is to be estimated. Generally a parameter is some underlying variable associated with a physical, chemical, or statistical model. When a quantity is described here as “true” or “underlying” the quantity discussed is a parameter. tHypothesis: A provisional statement about the value of a model parameter or parameters whose truth can be tested by experiment. tNull hypothesis (H0): A plausible hypothesis that is presumed sufficient, given prior knowledge, until experimental evidence in the form of a hypothesis test indicates otherwise. tAlternative hypothesis (Ha): A hypothesis considered as an alternative to the null hypothesis, though possibly more complicated or less likely given prior knowledge. tInference: The act of drawing a conclusion regarding some hypothesis based on facts or data. tInductive inference: The act of drawing a conclusion about some hypothesis based primarily on data. tDeductive inference: The act of drawing a conclusion about some hypothesis based entirely on careful definitions, axioms, and logical reasoning. We can identify the following four types of activities in the decision-making process which are illustrated in Figure 1. State Hypotheses About True Mean EXAMPLE 1: TABLET POTENCY Consider the case of a development team concerned with the true mean potency, averaged over batches, produced by a tablet manufacturing process. In this case, the tablet label claim (LC) and target for the manufacturing process is 100%LC. Individual batches may have a mean potency that deviates slightly from 100%, but batch means <90%LC are unacceptable. While the team believes the process is adequate, their objective is to provide evidence that the process produces acceptable batches. How can the team validate their belief that the process mean potency is acceptable? We will use this example to illustrate the three difference systems of scientific inference for doing this. State one or more hypotheses about the underlying true mean potency parameter. In Figure 1, three possible hypotheses (true process mean potency = 92, 96, and 102) are illustrated. Such hypotheses are called “point” hypotheses because they specify a single fixed value of an underlying parameter. Useful hypotheses often specify a range of values and are referred to as composite hypotheses. Note the following definitions: tComposite hypothesis: A statement that gives a range of possible values to a model parameter. For example, ‘Ha: true mean > 0’ is a composite hypothesis. tPoint (simple) hypothesis: A statement that a model parameter is equal to a single specific value. For example, ‘H0: true mean = 0’ is a simple hypothesis. PROCESS VALIDATION – Process Design 87 David LeBlond Figure 1: Scientific inference applied to the mean potency of a tablet manufacturing process. 1. State hypotheses about true mean 2. Make deductive Inferences H: 92 80.5 H: 96 H: 102 89.8 96.2 102.7 108.3 4. Make inductive inferences 3. Obtain a sample estimate of mean For example 1, if low potency is a concern, the team may be interested in testing a composite null hypothesis such as H0: true mean = or > 96%LC against a composite alternative hypothesis such as Ha: true mean < 96%LC. The value of 96%LC might be considered the lowest process mean potency consistent with an acceptable process. That is, if the team guesses that the true process standard deviation is about 2%LC, then 96% would be safely “3-sigma” above the unacceptable lower limit of 90%LC. There is an asymmetry to the hypotheses such that the null hypothesis, H0, is considered a priori most likely, requiring the fewest assumptions (i.e., the process is performing acceptably). Following Ockham’s Razor (8), the simplest hypothesis is often the default H0. The hypothesis chosen as H0 is usually the one that requires the lower burden of proof in the minds of decision makers. If the team believes the process is adequate, the alternative hypothesis, Ha, is considered less likely than H0 and would require postulation of some special cause, defined as follows: tSpecial cause: When the cause for variation in data or statistics derived from data can be identified and controlled, it is referred to as a “special” cause. When the cause cannot be identified, it is regarded as random noise and referred to as a “common” cause. Still the team needs supportive evidence because, if the true process mean is < 96%LC, the process may result in subpotent or failing batches. Make Deductive Inferences The team believes that batch mean potency 88 PROCESS VALIDATION – Process Design measurements will be normally distributed about the true process mean potency. Thus they have a mechanistic/probabilistic model to predict the likely range of measured batch mean potency values to expect. This kind of model is known as a likelihood model, defined as follows: tLikelihood model: A description of a data generating process that includes parameters (and possibly other variables) whose values determine the distribution of data produced. Specifically, the likelihood is likelihood = Probability of the observed data if the hypothesis is true. [Equation 1] Note that the predicted range of the observed data depends on the hypothesized true mean potency. The range will be different for H0 and Ha, with a mean process potency of 96%LC being considered borderline acceptable. The act of predicting (or simulating) future data from such a likelihood model is purely deductive. Such predictions are always true as long as the underlying model and hypothesized value of the underlying parameter are true. Obtain Sample Estimate Of Mean The team obtains potency measurements for 10 batches made using the process by testing composite samples. This is the experimental part of the decisionmaking process. The measured batch potencies constitute raw data. For inferences a summary of the data will be sufficient. In the present case, the observed mean of 93%LC and standard deviation of 5%LC were obtained. The team noted that 93%LC is below 96%LC. But is it far enough below 96%LC to reject H0? Make Inductive Inferences From Figure 1 we see that induction is the opposite of deduction: On the basis of the data, we evaluate which hypothesis (H0 or Ha) is most likely. When we reason inductively, we reflect back from the observations to some underlying truth about nature, in this case the true process mean potency. Unlike deduction, even if our data are valid, there is no guarantee that our conclusions about nature are correct. What we hope to do is make optimal scientific decisions and/or acquire evidence in favor of one or more of the hypotheses we have considered, with some appreciation for the decision risks. Given the larger than expected standard deviation estimate (5 instead of the expected 2%LC), could the value of 93%LC have been due to random variation? What inductive inference can the team make about the true process mean potency? David LeBlond To help the team with this decision we must review some background on methods of inductive inference. THREE SYSTEMS OF INDUCTIVE INFERENCE In the following sections we describe the three systems of inductive inference most commonly employed today. Each description opens with a brief historical perspective followed by an application of the methodology to our Example 1. t3FEVDJOHSBXEBUBUPTPNFEFDJTJPOTUBUJTUJD whose sampling distribution is known from the likelihood, Equation 1 t6TJOHBi1WBMVFwGPSEFDJEJOHXIFUIFSBO observed set of data deviates from a hypothesized probability distribution more than would be expected from random error. P-value, is defined as: The probability of obtaining a result at least as extreme as the one that was actually observed, given that the null hypothesis is true. The fact that p-values are based on this assumption is crucial to their correct interpretation. Fisherian Induction The term “Fisherian” seems appropriate because it was R. A. Fisher who described the approach with the greatest clarity and laid its statistical foundations. Before 1900, inductive inference from data was informal. The discipline of statistics, as we know it today, was in its infancy. While many probability distributions and models were known, workers typically summarized their data using tables and graphs and made visual comparisons with theoretical predictions. In 1900 the British mathematician Karl Pearson described what is now called the “Chi-square test” (9). This innovation was followed in 1908 with a “t-test” for means by an Irish brewer, William Gosset, better known to us as “Student” (10), and in 1925 with “ANOVA” and an associated “F-test” for comparing groups of means by the English geneticist and mathematician Ronald Fisher (11). The F-distribution was so named in his honor (12). The following are definitions of some terms that are central to the Fisherian approach: tStatistic: A summary value (such as the mean or standard deviation) that is calculated from data. A statistic is often used because it provides a good estimate of a parameter of interest tSampling distribution: The distribution of data or some decision statistic calculated from data tt-statistic: The decision statistic used in Student’s t-test consisting of the ratio of a difference between an observed and hypothesized mean divided by its estimated standard error tF-statistic: The decision statistic used in Fisher’s analysis of variance hypothesis test consisting of the ratio of two independent observed variances calculated from normally distributed data tAnalysis of variance (ANOVA): A hypothesis test that uses the F-statistic to detect differences among the true means of data from two or more groups. The Chi-square, t-test, and ANOVA hypothesis tests (and many others developed subsequently) rely on the following ideas: Fisher offered his opinion about the proper P-value criterion for decision making (11, page 80): “We shall not often be astray if we draw a conventional line at 0.05.” Over 80 years later his “conventional line” is widely used as a criterion for approval of new pharmaceutical products. In our tablet-manufacturing example, Fisherian induction would have us summarize our data as a t-statistic, which can be done by using Excel syntax as follows: t –statistic = sqrt(Sample Size)*(Observed Mean – H0)/(Standard Deviation) = sqrt(10)*(93 - 96)/5 = -1.8974 [Equation 2] If the team were to repeat their experiment using 10 different (independent) batches, they would of course not get the same t-statistic because of sampling and measurement variation. Conceptually, if they repeated the experiment many times, and if the true value of the process mean was equal to 96%LC, the sampling distribution of the t-statistics would be the probability distribution given in Figure 2. This distribution is known as the Student’s t-distribution. Notice that the observed t-statistic (-1.8974) is relatively far to the left side of the t-distribution. This is because the observed mean of 93%LC is somewhat below the hypothetical limit of 96%LC. Does this mean that the team should reject H0? Fisherian induction suggests that they should reject H0 (in favor of Ha) if it is unlikely to obtain such a t-statistic or one even more extreme by random chance alone. In this case even more extreme would include all those values equal to or less than -1.8974. The probability of observing such extreme values by chance alone is PROCESS VALIDATION – Process Design 89 David LeBlond Figure 2: Fisherian inductive inference model. Probability density Observed value = 1.8974 Sampling distribution for H0 P-value = 0.045 0 Observed statistic (t) equal to the area under the distribution curve to the left of -1.8974. This probability is known as the P-value and we can obtain it easily using the Excel cumulative distribution function as follows: P-value=TDIST(-observed t-statistic, sample size-1,1) =TDIST(1.8974, 9, 1) = 0.045. Thus, such extreme (or even more extreme) values of the t-statistic would occur by chance alone on average less than 1 time in 20 repeats of this experiment, or less than 5% of the time. If we use Fisher’s “conventional line,” the team should reject H0 and conclude that the true process mean is below 96%LC and thus unacceptable. The P-value is also called the “significance level” of the hypothesis test and when it is low (say < 0.05) it is considered a measure of evidence against the null hypothesis, defined as follows: tMeasure of evidence: In scientific studies, hypothesis testing is used to build evidence for or against various hypotheses. The P-value and Bayes factor (see below) are examples of measures of evidence in Fisherian and Bayesian induction, respectively. While it offers an objective criterion, the P-value may not be an optimal measure of evidence of the validity (or not) of the null hypothesis (13), and must be interpreted with care (14). The following cautions apply to the Fisherian perspective: t"TXJUIBMMTUBUJTUJDBMQSPDFEVSFTUIFNFUIPEPMPHZ is only applicable as long as all assumptions of the likelihood model (the data generation model) apply. For instance, in our example we assume normality and independence of the measurements. t/PUJDFJOFRVBUJPOUIBUUIFNBHOJUVEFPGUIF t-statistic depends directly on the square root 90 PROCESS VALIDATION – Process Design of the sample size. If our team had tested 1,000 instead of only 10 batches, it is likely that the t-statistic would have been “significant” even if the observed mean had been only slightly below 96%LC. Thus it is always wise to supplement a hypothesis test with a confidence interval for the mean potency (see references 3 and 4 for a discussion of confidence intervals). t#ZJUTWFSZOBUVSFUIF1WBMVFJOEFYJODMVEFTOPU only the observed t-statistic value, -1.8974, but also all those t-statistic values of lesser value that were not observed. In a sense, we are rejecting H0, because it has failed to predict low t-statistic values that were not in fact observed. The P-value is similar to the index often used to classify the relative performance of students: the observed t-statistic is “in the lower 5% of its class.” Within that category there are many poorer performers with whom we are not concerned, but the P-value disregards this information. t5IFiDPOWFOUJPOBMMJOFwEFDJTJPOQPJOUGPSB P-value of 0.05 may not be appropriate in all cases. The P-value per se does not take into account the consequences of making an incorrect judgment concerning H0. Further, it is difficult to integrate the P-value index with measures of decision error consequences. t.PTUJNQPSUBOUMZUIF1WBMVFJTOFJUIFSPGUIF following: a. The probability that the initial experimental result will repeat, or b. The probability that H0 is true. To obtain these probabilities we need to use one of the other two systems of induction. Neyman-Pearson Induction Between 1927 and 1933, two of Fisher’s contemporaries extended his groundbreaking ideas about hypothesis testing. Egon Pearson was the son of Karl Pearson (the developer of the Chi-square test) and a colleague of Fisher in London. Jerzy Neyman was a Polish mathematician in Warsaw who, among other things, developed the idea of confidence intervals. Their collaboration (15) led to a more general concept of inductive inference. They developed an extremely useful theory of optimal testing. Some key aspects of their ideas are shown in Figure 3. As with Fisherian induction, Neyman-Pearson induction recognizes a decision statistic having a known sampling distribution. The range of values of the decision statistic regarded as unlikely (if H0 is true) is called the rejection region. This region is in the tail or tails of the sampling distribution and corresponds to some fixed tail probability (e.g., 0.05) called the Type I error. Type I error is defined as follows: David LeBlond Fixed t-statistic =-TINV(2*(Type I error rate),Sample size - 1) = - TINV(2*0.05, 10-1) = - 1.8331. According to the Neyman-Pearson scheme, any observed t-statistic less than -1.8331 would result in a rejection of H0. For our tablet manufacturing example we would reject H0 because -1.8974 < -1.8331. We should note here that the calculation of the t-statistic in the equation would be modified if larger (rather than smaller) potencies, or if both larger and smaller potencies were considered unacceptable. In comparing the Neyman-Pearson paradigm (Figure 3) with that of Fisher (Figure 2) notice that an additional distribution (Ha) is added. Neyman and Pearson recognized the need to consider the sampling distribution of the statistic (the t-statistic in the present example) for both a specific H0 (such as when the true mean potency equals 96%LC) and at a specific Ha (representing some arbitrary true mean potency that might be considered unacceptable). The probability of incorrectly accepting H0 when in fact Ha is true, is called a Type II error, defined as follows: tType II error: A decision error that results in failing to reject the null hypothesis when in fact it is false. Of course this Type II error will depend on the specific Ha being considered. One can make a plot of Type II error as a function of the value of the parameter (e.g., the true mean potency) associated with Ha. Such a plot is called a “Power Curve” or an “Operating Characteristic” curve for the hypothesis test, defined as follows: tPower (or operating characteristic) curve: Figure 3: Neyman-Pearson inductive inference model. Probability density tType I error: A decision error that results in falsely rejecting the null hypothesis when in fact it is true. They envisioned decision makers using this decision statistic for all future hypotheses tests. In this way, all future hypothesis tests would have a fixed probability (e.g., 0.05) of incorrectly rejecting H0. Different Type I errors could of course be chosen for different situations, but the important point was to ensure that the rate of incorrectly rejecting H0 would be understood for each decision. For instance, in the current example, our team may agree that a Type I error rate of 0.05 (i.e., one error in 20 hypothesis tests) is appropriate in their situation. Using the TINV EXCEL function, this error rate corresponds to the following fixed t-statistic: Sampling distribution for Ha Decision point Type I error Sampling distribution for H0 Type II error t statistic Power is equal to 1 minus the Type II error rate. The power curve of a hypothesis test is a plot of the Power versus the true value of the underlying parameter of interest. The calculation of such power curves is an important part of experimental planning; however, it involves the use of probability distribution functions (such as the non-central t distribution) that are not available in Excel. One can picture these decision risks as a 2x2 table such as in Figure 4. In adopting a Neyman-Pearson paradigm, the team has moved from the objective of developing evidence with respect to their specific experiment, to the objective of using a methodology with assured decision risks. The Neyman-Pearson approach does not concern itself with whether or not the true mean process potency is below or above 96%LC. It only assures the team that the many decisions they will make over their careers to reject H0s will be incorrect only 1 time in 20 (i.e., a probability of 0.05). Examples of practical situations where this point of view is appropriate are listed as follows: tControl charting. For monitoring critical quality measures of a process or method, it may be useful to know the probability of incorrectly identifying an out of control situation (Type I error) or of failing to detect a condition that is unacceptable (Type II error). tDiagnostics screening. A diagnostic test is analogous to a hypothesis test. The sensitivity (1 - Type II error rate) and specificity (1 - Type I error rate) of a diagnostic test are key measures that determine the medical value of the reported test result when the prevalence of disease in the tested population is known. tValidation and product acceptance testing. For judging production costs and allocating resources, it may be desirable to fix the manufacturer’s risk (Type I error, risk of incorrectly PROCESS VALIDATION – Process Design 91 David LeBlond Figure 4: Neyman-Pearson hypothesis testing error types. True state of nature* Conclusion from experiment H0 Ha H0 no error Type II error Ha Type I error no error *Choosing the wrong H0 or Ha to study is sometimes called a Type III error. failing an acceptable batch) or consumer’s risk (Type II error, risk of incorrectly passing a batch of some defined level of unacceptability). tNew drug application regulatory acceptance. For maintaining standards of risk, a regulatory agency may find it desirable to require all studies of a given type (e.g., clinical trials, shelflife estimation, bio-equivalence tests) to maintain Type I error benchmarks. Requirements with respect to Type II error can assure that sample sizes are adequate (e.g., for safety studies). The Neyman-Pearson approach permeates much of today’s scientific decision making. It represents a high watermark in terms of objectivity and consistency in inductive inference. When we use hypothesis testing methodology whose Type I and II error risks are known, we say that we are using a “calibrated” method, and this can have many advantages. A calibrated hypothesis test is defined as follows: tCalibrated hypothesis test: A hypothesis test method who’s Type I error, on repeated use, is known from theory or computer simulation. However, the Neyman-Pearson approach may not be the appropriate paradigm for all situations. Some important considerations are listed as follows: t8IJMFUIFBQQSPBDIEPFTGJYEFDJTJPOFSSPSSBUFT over a series of experiments, it does not by itself provide a measure of evidence concerning H0 or Ha in any specific experiment. t*ONPTUBDUVBMTUVEJFTUIFSFXJMMCFNVMUJQMF hypotheses tests, which may or may not be independent. We refer to groups of hypothesis tests that are associated with the same decision as a “family.” Thus we must consider both the familywise error rates as well as the individual test rates obtained from Neyman-Pearson methodology. These family-wise rates suffer from a condition called “multiplicity” in that they can be difficult to predict. t5IF/FZNBO1FBSTPO5ZQF*FSSPSSBUFNVTUOPU be confused with the Fisherian P-value. The Type I error rate is the rate of falsely rejecting H0 over 92 PROCESS VALIDATION – Process Design many experiments. The P-value is a measure of evidence (albeit imperfect) against H0. t3 JHJEBEIFSFODFUPB5ZQF*FSSPSSBUFMFBET to conceptual problems. Type I error rates of 0.0499 and 0.0501 are very close in any practical situation, yet they could lead to very different decisions. t*GUIFEFDJTJPOTUBUJTUJDEPFTOPUGBMMJOUIF rejection zone, the Neyman-Pearson formulation recommends that H0 be “accepted.” However, from a scientific point of view it is more appropriate to “fail to reject” H0. A larger experiment would have a larger rejection zone that might include the observed result. t"TXJUIUIF'JTIFSJBOTZTUFNUIF/FZNBO Pearson system relies solely on the likelihood (probabilistic model of data generation) for both deductive and inductive inferences. However, developing and building evidence for mechanistic, predictive models often requires strong theory and experience. Finding and using such models as a part of risk management control strategies is the key to regulatory initiatives such as quality by design (QbD) (16). To incorporate such prior knowledge in a quantitative way, we must use Bayesian induction that grew out of an earlier age. Bayesian Induction In 1739, the Scottish empiricist philosopher David Hume posed the following problem in inductive inference (17): “ ‘tis only probable that the sun will rise tomorrow … we have no further assurance of this fact than what experience affords us.” Knowing the underlying probabilities of events was critical to the active 18th-century insurance and finance industries whose profits depended on accurate inductive inferences from available experience and theory. It was also a problem of some interest to the liberal theologians of that time. In 1763, the problem was addressed quantitatively for the first time by two non-conformist ministers, Thomas Bayes and Richard Price (18). Their solution, Bayes’ rule, is to probability theory what the Pythagorean theorem is to geometry. The following definitions apply: tBayes’ rule: A process for combining the information in a data set with relevant prior information (e.g., theory, past data, expert opinion, and knowledge) to obtain posterior information. Prior and posterior information are expressed in the form of prior and posterior probability distributions, respectively, of the underlying physical parameters, or of predictive posterior distributions of future data. David LeBlond tPrior distribution: A subjective distributional estimate of a random variable, obtained prior to any data collection, which consists of a probability distribution. tPosterior distribution: A distributional estimate of a random variable that updates information from a prior distribution with new information from data using Bayes’ rule. tBayesian induction: A process for inductive inference in which the P-value is replaced with the posterior probability that the null hypothesis is true. In Bayesian induction, the respective prior distributions and data models (likelihoods) constitute the null and alternative hypotheses. In addition, one must specify the prior probability (or odds) that the null hypothesis is true. Two centuries later, Harold Jeffreys, a British astronomer, greatly extended the utility of Bayes’ rule (19). In terms of hypothesis testing, it may be summarized as follows: Probability that the hypothesis is true, given observed data = K*(Probability of the observed data if the hypothesis is true) x(Prior probability that the hypothesis is true), [Equation 3] and by reference to Equation 1 we see that Probability that the hypothesis is true, given observed data = K*(Likelihood)x(Prior probability that the hypothesis is true). [Equation 4] Thus we see that evidence for (or against) a given hypothesis may be obtained directly from probability theory, as long as we can supply the following: t5IFMJLFMJIPPEXIJDIJTBWBJMBCMFGPSNPTU practical problems. It is the same likelihood required for Fisherian and Neyman-Pearson approaches. However, in the Bayesian approach we must also have the following: t5IFWBMVF,JO&RVBUJPOXIJDITPNFUJNFT requires computing technology and numerical methods that have only recently become available. In many common cases, however, such as those we consider here, K can be easily evaluated in Excel. t5IFQSJPSQSPCBCJMJUZPGUIFIZQPUIFTJT5IJTQSJPS probability should be based on existing theory and expert knowledge. It can be problematic because experts will differ in their prior beliefs. On the other hand, this Bayesian approach provides a quantitative tool to gauge the effects of different prior opinions on a final conclusion. If prior knowledge is lacking, it is logical to assign equal probability to each of the hypotheses under consideration (e.g., 0.5 to H0 and 0.5 to Ha). For our tablet-manufacturing example, it is straightforward to apply a Bayesian approach. We have shown previously how to obtain the prior and posterior distributions of a normal mean (see reference 3, Table IV and reference 20, pp. 78-80). As illustrated in Figure 5, the prior (or posterior) probabilities of H0 and Ha are simply the areas under the prior (or posterior) distribution of the mean over the respective ranges of the mean (in the present example, below and above 96%LC for Ha and H0, respectively). Let’s calculate the probability of truth of H0 and Ha before and after the data are examined by the team. This particular example is illustrated in Figure 6. Before examining data: All knowledge about the true process mean comes from the prior distribution. The team used a noninformative prior for both the mean and standard deviation. In Figure 6, the prior distribution for the mean is essentially flat and indistinguishable from the horizontal axis. This prior distribution was so broad that about half the probability density (i.e., area under the prior distribution of the mean) lies below 96%LC and half above. Thus the prior probabilities of H0 and Ha were each very close to 0.5. We can use short-hand to state this: PriorProbH0 = 0.5 and PriorProbHa = 0.5. While the team actually felt that H0 was more likely, they used this noninformative prior to provide a more objective test. After examining data: All knowledge about the true process mean comes from the posterior distribution. Notice in Figure 6 that the area under the distribution to the right of 96%LC (i.e., H0 range) is 0.045 while that to the left of 96%LC (i.e., Ha range) is 0.955, so that PostProbH0 = 0.045 and PostProbHa = 0.955. Thus the team can be 95.5% confident (in a true probabilistic sense) that the null hypothesis, H0, is false. It is also useful to consider such probabilities in terms of odds as illustrated in Figure 7. Odds is defined as follows: tOdds: The ratio of success to failure in probability calculations. In the case of hypothesis testing where only H0 or Ha are possible (but not both), if the probability of truth of H0 is ProbH0, then the odds of H0 equals ProbH0/(1-ProbH0). PROCESS VALIDATION – Process Design 93 David LeBlond Probability density Figure 5: Bayesian induction inference model. Prior or posterior distribution Ha H0 Probability that Ha is true Probability that H0 is true Parameter Figure 6: Visualizing a one-sided hypothesis test for a normal mean using its posterior distribution. Probability density 0.3 Ha Prior H0 Posterior 0.2 Probability that Ha is true = 0.955 Probability that H0 is true = 0.045 0.1 0 86 88 90 92 94 96 98 100 Mean Before data are examined, the prior odds of H0 are defined as Prior Odds of H0 = PriorProbH0/PriorProbHa = 0.5/0.5 = 1/1. So the prior odds that H0 is true are “1 to 1”. After information from data has been incorporated, the posterior odds of H0 are Posterior Odds of H0 = PostProbH0/PostProbHa = 0.045/0.955 = 45/955. So the posterior odds that H0 is true are “45 to 955.” This reduction in the odds of H0 from 1/1 to 45/955 is due to the evidence about H0 contributed by the data. As shown in Figure 7, we can form a measure of evidence by taking the odds ratio as follows: that H0 is true to its prior odds. A B value of 1/10 means that Ha is supported by the data 10 times as much as H0. Because the Bayes factor is normalized by the prior odds, it is measure of evidence that primarily reflects the observed data. The Bayes factor is a measure of evidence supplied by data in a hypothesis testing situation. Like the P-value, various decision levels have been proposed (see reference 19, page 432). Notice something important here: The Bayesian PostProbH0 and the Fisherian P-value for our manufacturing example are both equal to 0.045. This seems remarkable considering that these indices are not measuring the same thing. The P-value is the probability of observing data at least as extreme as was observed, while the PostProbH0 is the probability that H0 is true. However, it can be shown that in many common situations (e.g., one-sided hypothesis tests involving normally distributed data) the P-value will be equal to the PostProbH0 when an appropriate noninformative prior is used to calculate PostProbH0 (see examples in references 20 and 22). PROBLEMS WITH THE TWO-SIDED HYPOTHESIS TEST FOR EQUALITY In Example 1, the hypothesis tested by our team is one sided because the ranges for the mean for Ha and H0 were each completely on one side or the other of the hypothesized value, 96%LC. A one-sided test is defined as follows: tOne-sided test: A null hypothesis stated in such a way that observed values of the decision statistic on one side (either large or small but not both) constitutes evidence against it. Our example uses composite hypotheses because Ha and H0 both consist of ranges rather than single points. It is also common to consider a two-sided situation in which the null hypothesis, H0, consists of a single point. Say for instance our team obtained the following data from their testing of n=10 batches: Mean of 10 measured batch potencies = 95%LC, and Sample standard deviation = 5%LC, They might have considered testing a point-null hypothesis such as H0: true mean = 100%LC B = (Posterior Odds of H0)/(Prior Odds of H0) = (45/955)/(1/1) = 45/955 = 0.0452 B is referred to as the Bayes factor (21,22). Bayes factor is defined as follows: tBayes factor (B): In Bayesian induction, the Bayes factor is the ratio of the posterior odds 94 PROCESS VALIDATION – Process Design against an alternative hypothesis such as Ha: true mean is not = 100%LC. A Fisherian test of this point-null H0 is easily executed in Excel as follows: David LeBlond that would lead to rejection of H0 if the common line of 0.05 is employed. Testing of point-null hypotheses such as this is very common. Unfortunately, as shown below, this is almost never a realistic or meaningful test. Most would agree that there is little practical difference between a process mean potency of 99.999%LC and 100.001%LC. Yet it can be shown that if the sample size is large enough, H0 will be rejected with high probability, even if the true deviation from 100.000%LC is only 0.001%LC. This sensitivity to sample size is well known to statisticians because decision statistics, such as the mean, become very precise (small standard errors)—but not necessarily more accurate—when sample size is increased. An essentially correct hypothesis can be rejected when the summary statistics are too precise. This type of counter-intuitive behavior in hypothesis testing is often due to an incorrect statement of the problem known as a Type III error (see Figure 4 footnote). Type III error is defined as follows: tType III error: A decision error that results in choosing the incorrect null or alternative hypothesis for use in a hypothesis test. Rather than consider a point H0, it may be more appropriate to specify a small interval for H0. Type III errors are common in hypothesis tests for normality. In very large samples, normality is almost always rejected despite the fact that a histogram agrees visually with the fitted normal curve. Sometimes the rejection is caused by minor imperfections in the data, such as rounding, that are not material to the objectives of the hypothesis test. One advantage of the Bayesian approach is that it forces one to think carefully about the correct formulation of a hypothesis testing problem. From the Bayesian hypothesis testing perspective, the prior distributions for H0 and Ha are the hypotheses being tested. An appropriate Bayesian approach to test this point-null H0 is given by Schervish (23, example 4.22 pp 224-5). Under the Bayesian paradigm, we require prior distributions for the mean and standard deviation for both H0 and Ha. This is because, in general, neither the true mean or standard deviation parameters are known. Instead of a single parameter we have two parameters to consider and these will have joint prior and posterior probability distributions. Joint probability distribution is defined as follows: tJoint probability distribution: A probability distribution in which the probability density or Probability density P-value = TDIST(3.16,10-1,2) = 0.012, Probability density t-value = SQRT(10)*ABS(95-100)/5 = 3.16 Figure 7: The Bayes factor (B) as a measure of evidence for H0. Ha H0 ProbHa Prior ProbHO + Data Posterior ProbHa ProbH0 Mean B= ProbH0/ProbHa ProbH0/ProbHa mass depends on the values of two (bivariate) or more (multivariate) parameters simultaneously. A bivariate probability distribution can be visualized as a surface mesh or contour plot. The H0 prior for mean and standard deviation is shown in Figure 8. The prior for the mean (top panel) is a single spike at 100%LC, consistent with the point H0. The prior distribution for the standard deviation is a mildly informative IRG distribution with C=1 and G=50 (3, Table IV) and is shown on the lower panel of Figure 8. The Ha prior for mean and standard deviation is shown in Figure 9. Because under Ha, the mean is permitted to vary, the joint distribution of mean and variance is displayed as a surface mesh plot. This is a joint LSSt-IRG distribution with R=1, T=100, U=7.07, C=0.5, and G=50 (3, Table IV). This same Ha prior is displayed in Figure 10 as a contour plot. From this it is easier to see the twodimensional shape and range that is identified as the alternate hypothesis. Unlike the one-sided case, the point-null situation requires that we also specify our prior beliefs about the truth of H0 and Ha. If the team had no prior preference for either H0 or Ha, they would assign equal prior probabilities to each. We can express that using shorthand notation as PriorProbH0 = 0.5 and PriorProbHa = 0.5. The calculation of the Bayes factor for this test can easily be done in Excel (23, equation 4.23) and a value of B = 0.3495 PROCESS VALIDATION – Process Design 95 David LeBlond Figure 8: Point-null hypothesis visualized as probability distributions for the mean and standard deviation. Probability mass Prior null hypothesis for mean 1.5 1 0.5 0 80 90 100 110 120 Mean Probability density Prior null hypothesis for standard deviation 1.5 1 0.5 0 2 7 12 17 22 Standard deviation Figure 9: Alternative hypothesis visualized as a surface mesh plot of the joint probability for the mean and standard deviation. 0.006 0.005-0.006 0.004-0.005 0.005 0.003-0.004 0.002-0.003 0.004 Probability density 0.001-0.002 0-0.001 0.003 0.002 0.001 18 13 0 80 7 88 96 104 Mean 112 2 120 Standard deviation is obtained. Given this information, we can invert the equation for B given in Figure 7 to solve for the posterior probability of H0. Noting that PostProbHa = 1 – PostProbH0, we find that PostProbH0 =0.259. So the team would conclude that there is a 25.9% probability that H0 is true and likely would not reject it. This is a troubling result because the Fisherian point-null test above yielded a P-value of 0.012, clearly suggesting rejection of H0. However, as noted above, 96 PROCESS VALIDATION – Process Design the P-value groups the actual observation obtained with much more extreme observations that were not actually obtained. In the case of point-null hypothesis testing, Bayesian and Fisherian conclusions rarely agree (24, see pp 151, table 4.2). When two sound methodologies lead to different conclusions we must wonder whether we have misidentified the problem (i.e., our friend, the Type III error again?). It is best to consider carefully what we mean by “not equal.” We do this in the following section. When applicable, the Bayesian approach to hypothesis testing gives answers in terms of probability statements about the parameters and the hypotheses themselves, which are impossible with the Fisherian and Neyman-Pearson approaches. This is very advantageous for risk analysis. Bayesian inductive inference is also useful in data-mining applications. As an example, the US Food and Drug Administration’s Center for Drug Evaluation and Research now employs a Bayesian screening algorithm as part of their internal drug safety surveillance program (25). However, the Bayesian approach can be more demanding for the following reasons: t8IJMFNBOZ#BZFTJBOQSPCMFNTDBOCFTPMWFEJO Excel, more complicated situations may require advanced computing packages such as WinBugs (26). Calculation of the K in Equation 4 or of the Bayes factor can sometimes be challenging (27). t#BZFTJBOBQQSPBDIFTSFRVJSFTQFDJGJDBUJPOPG prior distributions and prior probabilities of the hypotheses under study. While noninformative priors may be used for objectivity, this may result in a loss of information as well as lost opportunity to debate and thereby take advantage of the prior knowledge of different members of a project team. It is always critical to understand any effect that the prior may have on conclusions. t8IFOVTFEGPSDPOGJSNBUPSZPSEFNPOTUSBUJPO studies such as clinical trials, validations, quality control, or data mining, Bayesian hypothesis testing methods must be calibrated (usually by computer simulation) so that the Type I and II error risks are known. A good, readable, basic introductory textbook for readers interested in the theory and methodology of Bayesian hypothesis testing is Bolstad (28). THE TOST HYPOTHESIS TEST FOR EQUIVALENCY Sometimes, as with method transfers or process changes in existing products or validation of new products, the objective may be to establish evidence for equivalency, rather than equality, with the usual burden of proof reversed. It is best to define a range for the mean difference, say L to H, indicative of David LeBlond equivalence. Then the null hypothesis is H0: true mean difference < L or true mean difference > H Figure 10: Alternative hypothesis visualized as a contour plot of the prior joint probability distribution for the mean and standard deviation. 22 0 . 0 0 5 -0 . 0 0 6 against the alternative hypothesis 0 . 0 0 4 -0 . 0 0 5 18 0 . 0 0 3 -0 . 0 0 4 Ha: L < true mean difference < H. 0 . 0 0 2 -0 . 0 0 3 0 . 0 0 1 -0 . 0 0 2 In this way we treat nonequivalence as the default state of nature and require the data to provide evidence that allows us to reject H0. A “two one-sided testing” (TOST) procedure (29) is based on requiring a confidence or credible interval to be completely contained within the range of Ha in order to reject H0. TOST is defined as followed: tTwo one-sided hypothesis test (TOST): A hypothesis test for equivalency that consists of two one-sided tests conducted at the high and low range of equivalence, each of which must be rejected at the Type I error rate in order to reject the null hypothesis of non-equivalence. EXAMPLE 2: A TEST THAT THE MEANS OF THREE GROUPS ARE EQUAL The ANOVA procedure developed by Fisher is a procedure for testing the null hypothesis of equality for group means. Fisher’s ANOVA calculation procedure is illustrated elsewhere in this issue with an example (30) in which there are three groups (A, B, and C). We will illustrate the Bayesian approach to ANOVA here. Zelen (31, pp 306-9) shows how to perform multiple regressions from a Bayesian perspective. His procedure can easily be adapted to the simple ANOVA case and a Bayes factor for the ANOVA can be calculated in Excel. However, as with the point-null hypothesis, we will find that the conventional P-value tends to overstate the case for rejection of H0. Again, it is wise to ask if the ANOVA hypothesis test correctly frames the real questions we want to ask. We will assume here that the null hypothesis of equality 14 0 -0 . 0 0 1 10 Standard deviation 6 2 80 88 96 104 112 120 Mean Figure 11: Visualizing the ANOVA null hypothesis (H0) relative to the posterior distribution of the group deltas. Ratio 0 .2 30-32 28-30 0 .1 26-28 0 24-26 delta_A 22-24 -0 . 1 20-22 18-20 -0 . 2 16-18 14-16 -0 . 3 12-14 10-12 8-10 6-8 4-6 H0 Posterior mode -0 . 4 -0 . 5 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 2-4 0-2 delta_B H0: meanA = meanB = meanC, is appropriate. The corresponding alternative hypothesis is Ha: one or more of the three means are not equal. An ANOVA F-test is often used when the real objective is to make comparisons among the group means. When an appropriate noninformative prior (3, 4) is used, the Bayesian approach to group means comparison agrees exactly with the standard ANOVA F-test (32, pp 123-43). But the Bayesian approach offers a useful advantage. The usual Fisherian paradigm regards the true group means as fixed entities. So one can only make comparisons with respect to the measured statistics, using such things as confidence intervals. However, the Bayesian paradigm regards the true group means as having a joint posterior distribution. It is possible to calculate and plot the joint posterior distribution using Excel and to visualize the single point within this distribution that corresponds to all the means being equal (i.e., the H0). Such a graph for the example (30) is shown in Figure 11, but it requires some explanation. First, the axes for the plot are not the true group means, but the deviations (deltas) of these true means PROCESS VALIDATION – Process Design 97 David LeBlond from their true grand mean, M, where M = (mean_A + mean_B + mean_C)/3, or mean_C - M = - (mean_A - M) - (mean_B - M), and denoting the differences of the true group means from their true grand mean with a “delta,” we have delta_C = -delta_A - delta_B. Because delta_C can always be obtained knowing delta_A and delta_B, it is not an independent random variable. Consequently, we only need to concern our selves with two of the three deltas, say delta_A and delta_B. With three groups, we can actually visualize the joint distribution as a two-dimensional contour plot. Second, the point in Figure 11 corresponding to H0 (true means all equal) is the point delta_A = delta_B = 0, because only under this condition can all three group means be equal. Third, the contour lines form ellipses about the joint distribution mode (the point delta_A = -0.22 and delta_B = -0.26 indicated as a red dot in Figure 11). (Recall the mode is the point of a probability distribution having the maximum probability density). They do not correspond to a probability density as would be true for an actual distribution such as that in Figure 10. Instead the contour lines are critical F values. In Excel, the probability that the true value of delta_A, delta_B is beyond a given contour ellipse corresponding to F is FDIST(F,3-1,15-3). By analogy with a univariate distribution, the further any point is from the distribution mode, the larger the F value and the less likely that point is as a candidate for the true delta_A, delta_B. Finally the contour line in Figure 11 that equals the critical F value of 3.885, obtained in Excel as FINV(0.05,3-1,15-3) (30), forms a 95% credible ellipse. This is a bivariate analogue of a univariate 95% credible interval (3). Notice in Figure 11 that H0 corresponds to an F value of 6.142 and is, therefore, beyond the 95% credible ellipse. As with the traditional ANOVA calculation, we reject H0. By restating the hypotheses in terms of joint posterior distributions that can be displayed visually, the Bayesian perspective offers deeper insight into the mechanics and interpretation of ANOVA. approaches to inductive inference are: Fisherian, which uses the P-value as a measure of evidence against the null hypothesis; Neyman-Pearson, which controls the long-run decision risk over repeated experiments; and, Bayesian, which obtains direct probabilistic measures of evidence from the posterior distribution. Understanding of the P-value as the probability of observing a result as extreme or more extreme than that observed, assuming the null hypothesis is true, is critical to its proper interpretation. The P-value is based on unobserved results more extreme than those observed, so it may overstate the evidence against the null hypothesis. The P-value from a pointnull hypothesis, or two-sided test of equality, is very difficult to interpret. A confidence or credible interval should be provided in addition to the P-value. The Bayes factor, like the P-value, is a measure of evidence in favor of the null hypothesis contained in the data. It is based on an odds ratio and permits direct probability statements about the hypothesis tests being considered. The Type II error rate is the probability of incorrectly failing to reject the null hypothesis when in fact the alternative hypothesis is true. The Type II error rate is related to Power and allows the construction of operating characteristic curves. The analysis of variance is a Fisherian point-null hypothesis test for the equality of means of two or more groups. The Bayesian approach offers an insightful reinterpretation of the analysis of variance in terms of the joint posterior distribution of the group means. The three approaches to hypothesis testing represent major advances in the way we use data to make inductive inferences about the products, processes, and test methods we develop in regulated industries. These approaches, and others not discussed here, have done much to improve our decision-making. They help us build evidence about underlying mechanisms and provide us a common language for objective communication of the evidence supporting our conclusions. Still, humility and caution are in order. Despite these impressive methodologies, no single, coherent, generally-accepted approach to inductive inference has yet emerged in our time. All experimenters know that Nature does not divulge her secrets easily. For the present, it is perhaps unwise to apply any induction methodology by rote without first carefully considering whether it is consistent with our objective and problem. ACKNOWLEDGMENT SUMMARY We have seen that scientific inference requires both deductive and inductive inferences. The three main 98 PROCESS VALIDATION – Process Design The text before you would be poorer if not for the help of others. I am most sincerely grateful to Paul Pluta for ideas, encouragement, and expert feedback; to Diane David LeBlond Wolden who tirelessly kept the reader in mind; and to Susan Haigney for expertly laying the product to print. GLOSSARY Alternative hypothesis: A hypothesis considered as an alternative to the null hypothesis, though possibly more complicated or less likely. ANOVA (analysis of variance): A hypothesis test that uses the F-statistic described by Fisher used to detect differences among the true means of data from two or more groups. Bayes factor (B): In Bayesian induction, the Bayes factor is the ratio of the posterior odds that H0 is true to its prior odds. A B value of 1/10 means that the Ha is supported 10 times as much as H0. Since the Bayes factor is normalized by the prior odds, it is measure of evidence that primarily reflects the observed data. Bayes’ rule: A process for combining the information in a data set with relevant prior information (theory, past data, expert opinion, and knowledge) to obtain posterior information. Prior and posterior information are expressed in the form of prior and posterior probability distributions, respectively, of the underlying physical parameters, or of predictive posterior distributions of future data. Bayesian induction: A process for inductive inference in which the P-value is replaced with the posterior probability that the null hypothesis is true. In Bayesian induction, the respective prior distributions and data models (likelihoods) constitute the null and alternative hypotheses. In addition, one must specify the prior probability (or odds) that the null hypothesis is true. Calibrated hypothesis test: A hypothesis test method who’s Type I error, on repeated use, are known from theory or computer simulation. Composite hypothesis: A statement that gives a range of possible values to a model parameter. For example, ‘Ha: true mean > 0’ is a composite hypothesis. Confidence interval: A random interval estimate of a (conceptually) fixed quantity, which is obtained by an estimation method calibrated such that the interval contains the fixed quantity with a certain probability (the confidence level). Control chart: A time ordered plot of observed data values or statistics that is used as part of a process control program. Various hypothesis tests are employed with control charts to detect the presence of trends or unusual values. Credible interval: An interval estimate of a random variable, based on its probability distribution, which contains its value with a certain probability (the credible probability level). Data: Measured random variable values, assumed to be generated by some hypothetical likelihood model, which contain information about the parameters of that model. Deduction: The act of drawing a conclusion about some hypothesis based entirely on careful definitions, axioms, and logical reasoning. F-statistic: The decision statistic used in Fisher’s analysis of variance hypothesis test consisting of the ratio of two independent observed variances calculated from normally distributed data. Fisherian induction: A process for inductive inference, described most clearly by Ronald Fisher that uses the P-value as a criterion for rejecting a hypothesis. Hypothesis: A provisional statement about the value of a model parameter or parameters whose truth can be tested by experiment. Induction: The act of drawing a conclusion about some hypothesis based primarily on limited data. Inference: The act of drawing a conclusion regarding some hypothesis based on facts or data. Joint probability distribution: A probability distribution in which the probability density or mass depends on the values of two (bivariate) or more (multivariate) parameters simultaneously. A bivariate probability distribution can be visualized as a surface mesh or contour plot. Likelihood model: A description of a data generating process that includes parameters (and possibly other variables) whose values determine the distribution of data produced. Measures of evidence: In scientific studies, hypothesis testing is used build evidence for or against various hypotheses. The P-value and Bayes factor are examples of measures of evidence in Fisherian and Bayesian induction, respectively. Mode: A point estimate of a random variable that is the value at which its probability density is maximized. Multiplicity: When multiple hypothesis tests (e.g., control chart rules) are applied to different aspects (e.g., trending patterns) of a data set, the overall false alarm rate of any one test failing may be greater than that of any single test when applied alone. This statistical phenomenon is referred to as multiplicity. Neyman-Pearson induction: A methodology for inductive inference, developed by Jerzy Neyman and Egon Pearson, that considers both a null and an alternative hypothesis. PROCESS VALIDATION – Process Design 99 David LeBlond The null hypothesis is rejected in favor of the alternative hypothesis if the observed value of some statistic lies in its ‘rejection region’. The statistic and the associated rejection region are identified from statistical theory and are chosen to provide desired Type I or II decision error rates over repeated applications of the methodology. Null hypothesis (H0): A plausible hypothesis that is presumed sufficient to explain a set of data unless statistical evidence in the form of a hypothesis test indicates otherwise. Ockham’s Razor: The doctrine of parsimony that advocates provisionally adopting the simplest possible explanation for observed data. Odds: The ratio of success to failure in probability calculations. In the case of hypothesis testing where only H0 or Ha are possible (but not both), if the probability of truth of H0 is ProbH0, then the odds of H0 equals ProbH0/(1-ProbH0). One-sided test: A null hypothesis stated in such a way that observed values of the decision statistic on one side (either large or small but not both) constitutes evidence against it. P-value: The probability of obtaining a result at least as extreme as the one that was actually observed, given that the null hypothesis is true. The fact that p-values are based on this assumption is crucial to their correct interpretation. Parameter: In statistics, a parameter is a quantity of interest whose “true” value is to be estimated. Generally a parameter is some underlying variable associated with a physical, chemical, or statistical model. Point (simple) hypothesis: A statement that a model parameter is equal to a single specific value. For example, ’H0: true mean = 0’ is a simple hypothesis. Posterior distribution: A distributional estimate of a random variable that updates information from a prior distribution with new information from data using Bayes’ rule. Power (or operating characteristic) curve: Power is equal to 1 minus the Type II error rate. The power curve of a hypothesis test is a plot of the Power versus the true value of the underlying parameter of interest. Prior distribution: A subjective distributional estimate of a random variable, obtained prior to any data collection, which consists of a probability distribution. Probability density contour plot: A rendering of a bivariate distribution in which the distribution appears in the bivariate parameter space as contours of equal probability density. 100 PROCESS VALIDATION – Process Design Probability density surface mesh plot: A threedimensional analogue of a two-dimensional probability density plot in which there are two, rather than only one, model parameters. The bivariate distribution therefore appears as a surface rather than as a curve. Sampling distribution: The distribution of data or some summary statistic calculated from data. Special cause: When the cause for variation in data, or statistics derived from data, can be identified and controlled, it is referred to as a “special” cause. When the cause cannot be identified, it is regarded as random noise and referred to as a “common” cause. Statistic: A summary value (such as the mean or standard deviation) that is calculated from data. A statistic is often used because it provides a good estimate of a parameter of interest. t-statistic: The decision statistic used in Student’s t-test consisting of the ratio of a difference between an observed and hypothesized mean divided by its estimated standard error. Two-sided test: A null hypothesis stated in such a way that either large or small observed values of the decision statistic constitute evidence against it. Two one-sided hypothesis test (TOST): A hypothesis test for equivalency that consists of two one-sided tests conducted at the high and low range of equivalence, each of which must be rejected at the Type I error rate in order to reject the null hypothesis of non-equivalence. Type I error: A decision error that results in falsely rejecting the null hypothesis when in fact it is true. It is sometimes referred to as the alpha-risk or manufacturer’s risk. Type II error: A decision error that results in failing to reject the null hypothesis when in fact it is false. It is sometimes referred to as the beta-risk or consumer’s risks. Type III error: A decision error that results in choosing the incorrect null or alternative hypothesis for use in a hypothesis test. REFERENCES 1. LeBlond, D, “Data, Variation, Uncertainty, and Probability Distributions,” Journal of GXP Compliance, Vol. 12, No. 3, pp 30–41, 2008. 2. LeBlond, D, “Using Probability Distributions to Make Decisions,” Journal of Validation Technology, Spring 2008, pp 2–14, 2008. 3. LeBlond, D, “Estimation: Knowledge Building with Probability Distributions,” Journal of GXP Compliance, Vol. 12 (4), 42-59, 2008. See Journal of Validation Technology, Vol. 14, No. 5, 2008 for correction to Table IV. David LeBlond 4. LeBlond, D, “Estimation: Knowledge Building with Probability Distributions–Reader Q&A, Journal of Validation Technology, Vol. 14(5), 50-64, 2008. 5. Marden, J, “Hypothesis Testing: From p Values to Bayes Factors, Journal of the American Statistical Association, 95(452) 1316-1320, 2000. 6. Stigler, S, The History of Statistics. The measurement of uncertainty before 1900, Belknap Press, Cambridge, 1986. 7. Daston, L, Classical Probability in the Enlightenment, Princeton University Press, Princeton, 1988. 8. William of Ockham, of the 14th century, advocated adopting the simplest possible explanation for physical phenomena, see Jeffreys, H (1961) Theory of Probability, 3rd edition, Oxford University Press, NY, page 342. 9. Pearson, K, On the criterion that a given system of deviations from the probable in the case of correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine 50, 157–175, 1900. 10. Gosset, W, aka “Student,” The probable error of a mean, Biometrika VI (1), 1-25, 1908. 11. Fisher, R, Statistical Methods for Research Workers, Oliver & Boyd, Edinburgh, 1925. 12. Snedecor, G and Cochran, W, Statistical Methods, 6th edition, Iowa State University Press, Ames, page 98, 1967. 13. Goodman, S, “Toward Evidence-based Medical Statistics. 1. The P value fallacy,” Annals of Internal Medicine, 130, 9951004, 1999. 14. Gibbons, J and Pratt, J, “P-values: Interpretation and Methodology,” The American Statistician, 29(1) 20-25, 1975. 15. Neymen, J and Pearson E, “On the Problem of the Most Efficient Tests of Statistical Hypotheses,” Philosophical Transactions of the Royal Society, series A, volume 231, 289337, 1933. 16. International Conference on Harmonisation, ICH Harmonised Tripartite Guideline on Pharmaceutical Development Q8, Current Step 4 version, dated 10 November 2005. 17. Dale, A , A History of Inverse Probability, Springer, New York, page 545, note 38, 1999. 18. Price, R, (1763) “An essay towards solving a problem in the doctrine of chances,” in Dale, A Most Honourable Remembrance: The Life and Work of Thomas Bayes, Springer, New York, 2003. 19. Jeffreys, H, Theory of Probability, 3rd edition, Oxford University Press, Cambridge, 1961. 20. Gelman, A, Carlin, J, Stern, H, and Rubin, D, Bayesian Data Analysis, 2nd Edition, Chapman and Hall/ CRC, New York, 2004. 21. Goodman, S (1999) “Toward evidence-based medical statistics. 2. The Bayes factor,” Annals of Internal Medicine, 130, 1005-1013. 22. Casella, G and Berger, R, “Reconciling Bayesian and Freqentist Evidence in the One-Sided Testing Problem,” Journal of the American Statistical Association 82(397) 106111, 1987. 23. Schervish, M, Theory of Statistics, Springer-Verlag, New York, 1995. 24. Berger, J, Statistical Decision Theory and Bayesian Analysis, 2nd edition, Springer-Verlag, New York, 1985. 25. Lincoln Technologies, “WebVDME in Production at the FDA,” WebDVMA News, volume 2(2), page 1, 2005. Available at HTTP://WWW.LINCOLNTECHNOLOGIES. COM 26. Cowles, M, “Review of WinBUGS 1.4,” American Statistician 58(4), 330-336, 2004. 27. Kass, R and Raftery, A, “Bayes Factors,” Journal of the American Statistical Association, 90(430) 773-795, 1995. 28. Bolstad, W, Introduction to Bayesian Statistics 2nd Edition, John Wiley & Sons, Hoboken, New Jersey, 2007. 29. Schuirmann, D, “On Hypothesis Testing to Determine if the Mean of a Normal Distribution is Contained in a Known Interval,” Biometrics 37 617, 1981. 30. Vijayvargiya, A., “One Way Analysis of Variance,” Journal of Validation Technology, Vol. 15, No. 1, 2009. 31. Zellner, A, An Introduction to Bayesian Inference in Econometrics, John Wiley, New York, 1971. 32. Box, G and Tiao, G, Bayesian Inference in Statistical Analysis, Addison-Wesley Pub. Co., Reading, MA, 1973. JVT Originally published in the Winter 2009 issue of The Journal of Validation Technology PROCESS VALIDATION – Process Design 101