Download Understanding Hypothesis Testing Using Probability

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ronald Fisher wikipedia , lookup

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Dempster–Shafer theory wikipedia , lookup

Probability interpretations wikipedia , lookup

Inductive probability wikipedia , lookup

Probability box wikipedia , lookup

Transcript
David LeBlond
Understanding Hypothesis
Testing Using Probability
Distributions
David LeBlond
“Statistical Viewpoint” addresses principles of
statistics useful to practitioners in compliance and
validation. We intend to present these concepts in a
meaningful way so as to enable their application in
daily work situations.
Reader comments, questions, and suggestions are
needed to help us fulfill our objective for this column.
Suggestions for future discussion topics or questions
to be addressed are invited. Readers are also invited
to participate and contribute manuscripts for this
column. Case studies sharing regulatory strategies
are most welcome. Please contact coordinating
editor Susan Haigney at [email protected]
with comments, suggestions, or manuscripts for
publication.
KEY POINTS
The following key points are discussed:
t4DJFOUJGJDJOGFSFODFSFRVJSFTCPUIEFEVDUJWFBOE
inductive inference
t5ISFFNBJOBQQSPBDIFTUPJOEVDUJWFJOGFSFODFBSF
used
t'JTIFSJBOJOEVDUJPOVTFTUIF1WBMVFBTBNFBTVSF
of evidence against the null hypothesis
t/FZNBO1FBSTPOJOEVDUJPODPOUSPMTUIFMPOHSVO
decision risk over repeated experiments
t#BZFTJBOJOEVDUJPOPCUBJOTEJSFDUQSPCBCJMJTUJD
measures of evidence from the posterior
distribution
For more Author
information,
go to
gxpandjvt.com/bios
[
t"1WBMVFJTUIFQSPCBCJMJUZPGPCTFSWJOHBSFTVMU
as extreme or more extreme than that observed,
assuming the null hypothesis is true
t5IF5ZQF*FSSPSSBUFJTUIFQSPCBCJMJUZPG
incorrectly rejecting the null hypothesis
t5IF5ZQF**FSSPSSBUFJTUIFQSPCBCJMJUZPG
incorrectly failing to reject the null hypothesis
t5IF#BZFTGBDUPSJTBNFBTVSFPGFWJEFODFJOGBWPS
of the null hypothesis contained in the data
t1WBMVFFTUJNBUJPOJTCBTFEPOVOPCTFSWFESFTVMUT
more extreme than those observed, so it may overstate the evidence against the null hypothesis
t5IF1WBMVFGSPNBQPJOUOVMMIZQPUIFTJTPSUXP
sided test of equality, is very difficult to interpret. A
confidence or credible interval should be provided
in addition to the P-value.
t"OBMZTJTPGWBSJBODFJTB'JTIFSJBOIZQPUIFTJTUFTU
for the equality of means of two or more groups
t5IF#BZFTJBOBQQSPBDIPGGFSTBOJOTJHIUGVM
re-interpretation of the analysis of variance in
terms of the joint posterior distribution of the
group means.
INTRODUCTION
The first issue of “Statistical Viewpoint” (1) presented
basic probability distribution concepts and Microsoft
Excel tools useful in statistical calculations. The
second (2) reinforced these concepts and tools
through eight scientific decision-making examples.
ABOUT THE AUTHOR
David LeBlond, Ph.D., has 29 years experience in the pharmaceutical and medical diagnostics
fields. He is currently a principal research statistician supporting analytical and pharmaceutical
development at Abbott. David can be reached at [email protected].
86 PROCESS VALIDATION – Process Design
David LeBlond
The third and fourth (3, 4) illustrated how probability
distributions aid in process knowledge building. This
issue shows how probability distributions are central
to the understanding of hypothesis testing. Some of
the concepts of probability distributions introduced in
the first four issues of this column will be helpful in
understanding what follows here.
Product, process, or method development can be
viewed as a series of decisions based on knowledge
building: What type of packaging should be employed?
What is the optimum granulation time? Can a
proportional model be used for calibration? The tools
we use to make such decisions are many. We rely on
prior theory and expert knowledge to conceptualize
the underlying mechanisms, select prototypes for
testing, design experiments, and prepare our minds
to interpret experimental results. We use dependable
measuring systems to acquire new data. Often, when
the results of our experimental trials are unclear (and
even sometimes when they are obvious), we employ
statistical and probabilistic methods to guide us in our
decision making.
We humans are reasonably good at exploring our
world, finding explanations for things, and making
predictions from our theories. Unfortunately, while we
all have a sense of rational intuition that (for the most
part) serves us well, the process we use (or should use)
to build understanding and make optimal decisions
from data has been the subject of heated debate by
philosophers, scientists, and mathematicians for
centuries (reference 5 gives a nice, readable overview;
see references 6 and 7 for more details). While the
debate shows no sign of concluding in our own time,
three noteworthy approaches have emerged. Here
we discuss some history and key concepts of each
approach and illustrate the central role probability
distributions play with two simple examples.
THE PROCESS OF SCIENTIFIC INFERENCE
Let us start by defining several important terms as
follows (Note that these and additional terms are
defined in the Glossary section at the end of this
article):
tParameter: In statistics, a parameter is a quantity
of interest whose “true” value is to be estimated.
Generally a parameter is some underlying variable
associated with a physical, chemical, or statistical
model. When a quantity is described here as
“true” or “underlying” the quantity discussed is a
parameter.
tHypothesis: A provisional statement about the
value of a model parameter or parameters whose
truth can be tested by experiment.
tNull hypothesis (H0): A plausible hypothesis
that is presumed sufficient, given prior knowledge,
until experimental evidence in the form of a
hypothesis test indicates otherwise.
tAlternative hypothesis (Ha): A hypothesis
considered as an alternative to the null hypothesis,
though possibly more complicated or less likely
given prior knowledge.
tInference: The act of drawing a conclusion
regarding some hypothesis based on facts or data.
tInductive inference: The act of drawing a
conclusion about some hypothesis based primarily
on data.
tDeductive inference: The act of drawing a
conclusion about some hypothesis based entirely
on careful definitions, axioms, and logical
reasoning.
We can identify the following four types of activities
in the decision-making process which are illustrated in
Figure 1.
State Hypotheses About True Mean
EXAMPLE 1: TABLET POTENCY
Consider the case of a development team concerned
with the true mean potency, averaged over batches,
produced by a tablet manufacturing process. In this
case, the tablet label claim (LC) and target for the
manufacturing process is 100%LC. Individual batches
may have a mean potency that deviates slightly from
100%, but batch means <90%LC are unacceptable.
While the team believes the process is adequate,
their objective is to provide evidence that the process
produces acceptable batches. How can the team
validate their belief that the process mean potency
is acceptable? We will use this example to illustrate
the three difference systems of scientific inference for
doing this.
State one or more hypotheses about the underlying
true mean potency parameter. In Figure 1, three
possible hypotheses (true process mean potency = 92,
96, and 102) are illustrated. Such hypotheses are called
“point” hypotheses because they specify a single fixed
value of an underlying parameter. Useful hypotheses
often specify a range of values and are referred to as
composite hypotheses. Note the following definitions:
tComposite hypothesis: A statement that gives
a range of possible values to a model parameter.
For example, ‘Ha: true mean > 0’ is a composite
hypothesis.
tPoint (simple) hypothesis: A statement that
a model parameter is equal to a single specific
value. For example, ‘H0: true mean = 0’ is a simple
hypothesis.
PROCESS VALIDATION – Process Design 87
David LeBlond
Figure 1: Scientific inference applied to the
mean potency of a tablet manufacturing
process.
1. State hypotheses about true mean
2. Make
deductive
Inferences
H: 92
80.5
H: 96
H: 102
89.8 96.2 102.7 108.3
4. Make
inductive
inferences
3. Obtain a sample estimate of mean
For example 1, if low potency is a concern, the
team may be interested in testing a composite null
hypothesis such as
H0: true mean = or > 96%LC
against a composite alternative hypothesis such as
Ha: true mean < 96%LC.
The value of 96%LC might be considered the lowest
process mean potency consistent with an acceptable
process. That is, if the team guesses that the true
process standard deviation is about 2%LC, then 96%
would be safely “3-sigma” above the unacceptable
lower limit of 90%LC.
There is an asymmetry to the hypotheses such that
the null hypothesis, H0, is considered a priori most
likely, requiring the fewest assumptions (i.e., the
process is performing acceptably). Following Ockham’s
Razor (8), the simplest hypothesis is often the default
H0. The hypothesis chosen as H0 is usually the one
that requires the lower burden of proof in the minds of
decision makers.
If the team believes the process is adequate, the
alternative hypothesis, Ha, is considered less likely
than H0 and would require postulation of some
special cause, defined as follows:
tSpecial cause: When the cause for variation
in data or statistics derived from data can be
identified and controlled, it is referred to as
a “special” cause. When the cause cannot be
identified, it is regarded as random noise and
referred to as a “common” cause.
Still the team needs supportive evidence because,
if the true process mean is < 96%LC, the process may
result in subpotent or failing batches.
Make Deductive Inferences
The team believes that batch mean potency
88 PROCESS VALIDATION – Process Design
measurements will be normally distributed about
the true process mean potency. Thus they have a
mechanistic/probabilistic model to predict the likely
range of measured batch mean potency values to
expect. This kind of model is known as a likelihood
model, defined as follows:
tLikelihood model: A description of a data
generating process that includes parameters (and
possibly other variables) whose values determine
the distribution of data produced. Specifically, the
likelihood is
likelihood = Probability of the observed data
if the hypothesis is true.
[Equation 1]
Note that the predicted range of the observed data
depends on the hypothesized true mean potency. The
range will be different for H0 and Ha, with a mean
process potency of 96%LC being considered borderline
acceptable. The act of predicting (or simulating)
future data from such a likelihood model is purely
deductive. Such predictions are always true as long as
the underlying model and hypothesized value of the
underlying parameter are true.
Obtain Sample Estimate Of Mean
The team obtains potency measurements for 10
batches made using the process by testing composite
samples. This is the experimental part of the decisionmaking process. The measured batch potencies
constitute raw data. For inferences a summary of
the data will be sufficient. In the present case, the
observed mean of 93%LC and standard deviation of
5%LC were obtained. The team noted that 93%LC is
below 96%LC. But is it far enough below 96%LC to
reject H0?
Make Inductive Inferences
From Figure 1 we see that induction is the opposite
of deduction: On the basis of the data, we evaluate
which hypothesis (H0 or Ha) is most likely. When
we reason inductively, we reflect back from the
observations to some underlying truth about
nature, in this case the true process mean potency.
Unlike deduction, even if our data are valid, there
is no guarantee that our conclusions about nature
are correct. What we hope to do is make optimal
scientific decisions and/or acquire evidence in
favor of one or more of the hypotheses we have
considered, with some appreciation for the decision
risks.
Given the larger than expected standard
deviation estimate (5 instead of the expected
2%LC), could the value of 93%LC have been due to
random variation? What inductive inference can the
team make about the true process mean potency?
David LeBlond
To help the team with this decision we must
review some background on methods of inductive
inference.
THREE SYSTEMS OF INDUCTIVE
INFERENCE
In the following sections we describe the three
systems of inductive inference most commonly
employed today. Each description opens with
a brief historical perspective followed by an
application of the methodology to our Example 1.
t3FEVDJOHSBXEBUBUPTPNFEFDJTJPOTUBUJTUJD
whose sampling distribution is known from the
likelihood, Equation 1
t6TJOHBi1WBMVFwGPSEFDJEJOHXIFUIFSBO
observed set of data deviates from a hypothesized
probability distribution more than would be
expected from random error. P-value, is defined
as: The probability of obtaining a result at least
as extreme as the one that was actually observed,
given that the null hypothesis is true. The fact that
p-values are based on this assumption is crucial to
their correct interpretation.
Fisherian Induction
The term “Fisherian” seems appropriate because it
was R. A. Fisher who described the approach with
the greatest clarity and laid its statistical foundations.
Before 1900, inductive inference from data was
informal. The discipline of statistics, as we know it
today, was in its infancy. While many probability
distributions and models were known, workers
typically summarized their data using tables and
graphs and made visual comparisons with theoretical
predictions. In 1900 the British mathematician Karl
Pearson described what is now called the “Chi-square
test” (9). This innovation was followed in 1908 with a
“t-test” for means by an Irish brewer, William Gosset,
better known to us as “Student” (10), and in 1925 with
“ANOVA” and an associated “F-test” for comparing
groups of means by the English geneticist and
mathematician Ronald Fisher (11). The F-distribution
was so named in his honor (12).
The following are definitions of some terms that are
central to the Fisherian approach:
tStatistic: A summary value (such as the mean or
standard deviation) that is calculated from data.
A statistic is often used because it provides a good
estimate of a parameter of interest
tSampling distribution: The distribution of data
or some decision statistic calculated from data
tt-statistic: The decision statistic used in Student’s
t-test consisting of the ratio of a difference between
an observed and hypothesized mean divided by
its estimated standard error
tF-statistic: The decision statistic used in Fisher’s
analysis of variance hypothesis test consisting of
the ratio of two independent observed variances
calculated from normally distributed data
tAnalysis of variance (ANOVA): A hypothesis
test that uses the F-statistic to detect differences
among the true means of data from two or more
groups.
The Chi-square, t-test, and ANOVA hypothesis tests
(and many others developed subsequently) rely on the
following ideas:
Fisher offered his opinion about the proper P-value
criterion for decision making (11, page 80):
“We shall not often be astray if we draw a
conventional line at 0.05.”
Over 80 years later his “conventional line” is widely
used as a criterion for approval of new pharmaceutical
products.
In our tablet-manufacturing example, Fisherian
induction would have us summarize our data as a
t-statistic, which can be done by using Excel syntax as
follows:
t –statistic = sqrt(Sample Size)*(Observed Mean –
H0)/(Standard Deviation)
= sqrt(10)*(93 - 96)/5
= -1.8974
[Equation 2]
If the team were to repeat their experiment using 10
different (independent) batches, they would of course
not get the same t-statistic because of sampling and
measurement variation. Conceptually, if they repeated
the experiment many times, and if the true value of
the process mean was equal to 96%LC, the sampling
distribution of the t-statistics would be the probability
distribution given in Figure 2. This distribution is known
as the Student’s t-distribution.
Notice that the observed t-statistic (-1.8974) is
relatively far to the left side of the t-distribution. This
is because the observed mean of 93%LC is somewhat
below the hypothetical limit of 96%LC. Does this
mean that the team should reject H0? Fisherian
induction suggests that they should reject H0 (in favor
of Ha) if it is unlikely to obtain such a t-statistic or
one even more extreme by random chance alone. In
this case even more extreme would include all those
values equal to or less than -1.8974. The probability
of observing such extreme values by chance alone is
PROCESS VALIDATION – Process Design 89
David LeBlond
Figure 2: Fisherian inductive inference model.
Probability density
Observed value = 1.8974
Sampling
distribution
for H0
P-value = 0.045
0
Observed statistic (t)
equal to the area under the distribution curve to the
left of -1.8974. This probability is known as the P-value
and we can obtain it easily using the Excel cumulative
distribution function as follows:
P-value=TDIST(-observed t-statistic, sample size-1,1)
=TDIST(1.8974, 9, 1)
= 0.045.
Thus, such extreme (or even more extreme) values of
the t-statistic would occur by chance alone on average
less than 1 time in 20 repeats of this experiment,
or less than 5% of the time. If we use Fisher’s
“conventional line,” the team should reject H0 and
conclude that the true process mean is below 96%LC
and thus unacceptable. The P-value is also called
the “significance level” of the hypothesis test and
when it is low (say < 0.05) it is considered a measure
of evidence against the null hypothesis, defined as
follows:
tMeasure of evidence: In scientific studies,
hypothesis testing is used to build evidence for
or against various hypotheses. The P-value and
Bayes factor (see below) are examples of measures
of evidence in Fisherian and Bayesian induction,
respectively.
While it offers an objective criterion, the P-value
may not be an optimal measure of evidence of the
validity (or not) of the null hypothesis (13), and must
be interpreted with care (14). The following cautions
apply to the Fisherian perspective:
t"TXJUIBMMTUBUJTUJDBMQSPDFEVSFTUIFNFUIPEPMPHZ
is only applicable as long as all assumptions
of the likelihood model (the data generation
model) apply. For instance, in our example we
assume normality and independence of the
measurements.
t/PUJDFJOFRVBUJPOUIBUUIFNBHOJUVEFPGUIF
t-statistic depends directly on the square root
90 PROCESS VALIDATION – Process Design
of the sample size. If our team had tested 1,000
instead of only 10 batches, it is likely that the
t-statistic would have been “significant” even if
the observed mean had been only slightly below
96%LC. Thus it is always wise to supplement a
hypothesis test with a confidence interval for
the mean potency (see references 3 and 4 for a
discussion of confidence intervals).
t#ZJUTWFSZOBUVSFUIF1WBMVFJOEFYJODMVEFTOPU
only the observed t-statistic value, -1.8974, but
also all those t-statistic values of lesser value that
were not observed. In a sense, we are rejecting
H0, because it has failed to predict low t-statistic
values that were not in fact observed. The P-value
is similar to the index often used to classify the
relative performance of students: the observed
t-statistic is “in the lower 5% of its class.” Within
that category there are many poorer performers
with whom we are not concerned, but the P-value
disregards this information.
t5IFiDPOWFOUJPOBMMJOFwEFDJTJPOQPJOUGPSB
P-value of 0.05 may not be appropriate in all cases.
The P-value per se does not take into account the
consequences of making an incorrect judgment
concerning H0. Further, it is difficult to integrate
the P-value index with measures of decision error
consequences.
t.PTUJNQPSUBOUMZUIF1WBMVFJTOFJUIFSPGUIF
following:
a. The probability that the initial experimental
result will repeat, or
b. The probability that H0 is true.
To obtain these probabilities we need to use one of
the other two systems of induction.
Neyman-Pearson Induction
Between 1927 and 1933, two of Fisher’s
contemporaries extended his groundbreaking ideas
about hypothesis testing. Egon Pearson was the son of
Karl Pearson (the developer of the Chi-square test) and
a colleague of Fisher in London. Jerzy Neyman was a
Polish mathematician in Warsaw who, among other
things, developed the idea of confidence intervals.
Their collaboration (15) led to a more general concept
of inductive inference. They developed an extremely
useful theory of optimal testing. Some key aspects of
their ideas are shown in Figure 3.
As with Fisherian induction, Neyman-Pearson
induction recognizes a decision statistic having a
known sampling distribution. The range of values of
the decision statistic regarded as unlikely (if H0 is true)
is called the rejection region. This region is in the tail
or tails of the sampling distribution and corresponds to
some fixed tail probability (e.g., 0.05) called the Type I
error. Type I error is defined as follows:
David LeBlond
Fixed t-statistic =-TINV(2*(Type I error rate),Sample
size - 1)
= - TINV(2*0.05, 10-1)
= - 1.8331.
According to the Neyman-Pearson scheme, any
observed t-statistic less than -1.8331 would result in a
rejection of H0. For our tablet manufacturing example
we would reject H0 because -1.8974 < -1.8331. We
should note here that the calculation of the t-statistic
in the equation would be modified if larger (rather
than smaller) potencies, or if both larger and smaller
potencies were considered unacceptable.
In comparing the Neyman-Pearson paradigm
(Figure 3) with that of Fisher (Figure 2) notice that an
additional distribution (Ha) is added. Neyman and
Pearson recognized the need to consider the sampling
distribution of the statistic (the t-statistic in the present
example) for both a specific H0 (such as when the true
mean potency equals 96%LC) and at a specific Ha
(representing some arbitrary true mean potency that
might be considered unacceptable). The probability
of incorrectly accepting H0 when in fact Ha is true, is
called a Type II error, defined as follows:
tType II error: A decision error that results in
failing to reject the null hypothesis when in fact it
is false.
Of course this Type II error will depend on the
specific Ha being considered. One can make a plot of
Type II error as a function of the value of the parameter
(e.g., the true mean potency) associated with Ha.
Such a plot is called a “Power Curve” or an “Operating
Characteristic” curve for the hypothesis test, defined as
follows:
tPower (or operating characteristic) curve:
Figure 3: Neyman-Pearson inductive
inference model.
Probability density
tType I error: A decision error that results in
falsely rejecting the null hypothesis when in fact it
is true.
They envisioned decision makers using this
decision statistic for all future hypotheses tests. In this
way, all future hypothesis tests would have a fixed
probability (e.g., 0.05) of incorrectly rejecting H0.
Different Type I errors could of course be chosen for
different situations, but the important point was to
ensure that the rate of incorrectly rejecting H0 would
be understood for each decision.
For instance, in the current example, our team may
agree that a Type I error rate of 0.05 (i.e., one error in
20 hypothesis tests) is appropriate in their situation.
Using the TINV EXCEL function, this error rate
corresponds to the following fixed t-statistic:
Sampling
distribution
for Ha
Decision point
Type I error
Sampling
distribution
for H0
Type II error
t statistic
Power is equal to 1 minus the Type II error rate.
The power curve of a hypothesis test is a plot of
the Power versus the true value of the underlying
parameter of interest.
The calculation of such power curves is an
important part of experimental planning; however, it
involves the use of probability distribution functions
(such as the non-central t distribution) that are not
available in Excel.
One can picture these decision risks as a 2x2 table
such as in Figure 4. In adopting a Neyman-Pearson
paradigm, the team has moved from the objective
of developing evidence with respect to their specific
experiment, to the objective of using a methodology
with assured decision risks. The Neyman-Pearson
approach does not concern itself with whether
or not the true mean process potency is below or
above 96%LC. It only assures the team that the
many decisions they will make over their careers to
reject H0s will be incorrect only 1 time in 20 (i.e., a
probability of 0.05). Examples of practical situations
where this point of view is appropriate are listed as
follows:
tControl charting. For monitoring critical
quality measures of a process or method, it may
be useful to know the probability of incorrectly
identifying an out of control situation (Type I
error) or of failing to detect a condition that is
unacceptable (Type II error).
tDiagnostics screening. A diagnostic test is
analogous to a hypothesis test. The sensitivity
(1 - Type II error rate) and specificity (1 - Type I
error rate) of a diagnostic test are key measures that
determine the medical value of the reported test
result when the prevalence of disease in the tested
population is known.
tValidation and product acceptance
testing. For judging production costs and
allocating resources, it may be desirable to fix the
manufacturer’s risk (Type I error, risk of incorrectly
PROCESS VALIDATION – Process Design 91
David LeBlond
Figure 4: Neyman-Pearson hypothesis testing
error types.
True state of nature*
Conclusion
from
experiment
H0
Ha
H0
no error
Type II error
Ha
Type I error
no error
*Choosing the wrong H0 or Ha to study is
sometimes called a Type III error.
failing an acceptable batch) or consumer’s risk
(Type II error, risk of incorrectly passing a batch of
some defined level of unacceptability).
tNew drug application regulatory
acceptance. For maintaining standards of risk, a
regulatory agency may find it desirable to require
all studies of a given type (e.g., clinical trials, shelflife estimation, bio-equivalence tests) to maintain
Type I error benchmarks. Requirements with
respect to Type II error can assure that sample sizes
are adequate (e.g., for safety studies).
The Neyman-Pearson approach permeates much of
today’s scientific decision making. It represents a high
watermark in terms of objectivity and consistency in
inductive inference. When we use hypothesis testing
methodology whose Type I and II error risks are
known, we say that we are using a “calibrated” method,
and this can have many advantages. A calibrated
hypothesis test is defined as follows:
tCalibrated hypothesis test: A hypothesis test
method who’s Type I error, on repeated use, is
known from theory or computer simulation.
However, the Neyman-Pearson approach may not
be the appropriate paradigm for all situations. Some
important considerations are listed as follows:
t8IJMFUIFBQQSPBDIEPFTGJYEFDJTJPOFSSPSSBUFT
over a series of experiments, it does not by itself
provide a measure of evidence concerning H0 or
Ha in any specific experiment.
t*ONPTUBDUVBMTUVEJFTUIFSFXJMMCFNVMUJQMF
hypotheses tests, which may or may not be
independent. We refer to groups of hypothesis
tests that are associated with the same decision as
a “family.” Thus we must consider both the familywise error rates as well as the individual test rates
obtained from Neyman-Pearson methodology.
These family-wise rates suffer from a condition
called “multiplicity” in that they can be difficult to
predict.
t5IF/FZNBO1FBSTPO5ZQF*FSSPSSBUFNVTUOPU
be confused with the Fisherian P-value. The Type
I error rate is the rate of falsely rejecting H0 over
92 PROCESS VALIDATION – Process Design
many experiments. The P-value is a measure of
evidence (albeit imperfect) against H0.
t3 JHJEBEIFSFODFUPB5ZQF*FSSPSSBUFMFBET
to conceptual problems. Type I error rates of
0.0499 and 0.0501 are very close in any practical
situation, yet they could lead to very different
decisions.
t*GUIFEFDJTJPOTUBUJTUJDEPFTOPUGBMMJOUIF
rejection zone, the Neyman-Pearson formulation
recommends that H0 be “accepted.” However,
from a scientific point of view it is more
appropriate to “fail to reject” H0. A larger
experiment would have a larger rejection zone that
might include the observed result.
t"TXJUIUIF'JTIFSJBOTZTUFNUIF/FZNBO
Pearson system relies solely on the likelihood
(probabilistic model of data generation) for
both deductive and inductive inferences.
However, developing and building evidence for
mechanistic, predictive models often requires
strong theory and experience. Finding and using
such models as a part of risk management control
strategies is the key to regulatory initiatives such
as quality by design (QbD) (16). To incorporate
such prior knowledge in a quantitative way, we
must use Bayesian induction that grew out of an
earlier age.
Bayesian Induction
In 1739, the Scottish empiricist philosopher David
Hume posed the following problem in inductive
inference (17):
“ ‘tis only probable that the sun will rise tomorrow
… we have no further assurance of this fact than what
experience affords us.”
Knowing the underlying probabilities of events
was critical to the active 18th-century insurance and
finance industries whose profits depended on accurate
inductive inferences from available experience and
theory. It was also a problem of some interest to the
liberal theologians of that time. In 1763, the problem
was addressed quantitatively for the first time by two
non-conformist ministers, Thomas Bayes and Richard
Price (18). Their solution, Bayes’ rule, is to probability
theory what the Pythagorean theorem is to geometry.
The following definitions apply:
tBayes’ rule: A process for combining the
information in a data set with relevant prior
information (e.g., theory, past data, expert
opinion, and knowledge) to obtain posterior
information. Prior and posterior information
are expressed in the form of prior and posterior
probability distributions, respectively, of the
underlying physical parameters, or of predictive
posterior distributions of future data.
David LeBlond
tPrior distribution: A subjective distributional
estimate of a random variable, obtained prior to
any data collection, which consists of a probability
distribution.
tPosterior distribution: A distributional
estimate of a random variable that updates
information from a prior distribution with new
information from data using Bayes’ rule.
tBayesian induction: A process for inductive
inference in which the P-value is replaced with
the posterior probability that the null hypothesis
is true. In Bayesian induction, the respective
prior distributions and data models (likelihoods)
constitute the null and alternative hypotheses. In
addition, one must specify the prior probability (or
odds) that the null hypothesis is true.
Two centuries later, Harold Jeffreys, a British
astronomer, greatly extended the utility of Bayes’
rule (19). In terms of hypothesis testing, it may be
summarized as follows:
Probability that the hypothesis is true, given
observed data
= K*(Probability of the observed data if the
hypothesis is true)
x(Prior probability that the hypothesis is true),
[Equation 3]
and by reference to Equation 1 we see that
Probability that the hypothesis is true, given
observed data
= K*(Likelihood)x(Prior probability
that the hypothesis is true).
[Equation 4]
Thus we see that evidence for (or against) a given
hypothesis may be obtained directly from probability
theory, as long as we can supply the following:
t5IFMJLFMJIPPEXIJDIJTBWBJMBCMFGPSNPTU
practical problems. It is the same likelihood
required for Fisherian and Neyman-Pearson
approaches. However, in the Bayesian approach we
must also have the following:
t5IFWBMVF,JO&RVBUJPOXIJDITPNFUJNFT
requires computing technology and numerical
methods that have only recently become available.
In many common cases, however, such as those
we consider here, K can be easily evaluated in
Excel.
t5IFQSJPSQSPCBCJMJUZPGUIFIZQPUIFTJT5IJTQSJPS
probability should be based on existing theory and
expert knowledge. It can be problematic because
experts will differ in their prior beliefs. On the other
hand, this Bayesian approach provides a quantitative
tool to gauge the effects of different prior opinions
on a final conclusion. If prior knowledge is lacking,
it is logical to assign equal probability to each of the
hypotheses under consideration (e.g., 0.5 to H0 and
0.5 to Ha).
For our tablet-manufacturing example, it is
straightforward to apply a Bayesian approach. We have
shown previously how to obtain the prior and posterior
distributions of a normal mean (see reference 3, Table
IV and reference 20, pp. 78-80). As illustrated in Figure
5, the prior (or posterior) probabilities of H0 and Ha
are simply the areas under the prior (or posterior)
distribution of the mean over the respective ranges of the
mean (in the present example, below and above 96%LC
for Ha and H0, respectively).
Let’s calculate the probability of truth of H0 and Ha
before and after the data are examined by the team. This
particular example is illustrated in Figure 6.
Before examining data: All knowledge about the
true process mean comes from the prior distribution. The
team used a noninformative prior for both the mean and
standard deviation. In Figure 6, the prior distribution for
the mean is essentially flat and indistinguishable from
the horizontal axis. This prior distribution was so broad
that about half the probability density (i.e., area under
the prior distribution of the mean) lies below 96%LC
and half above. Thus the prior probabilities of H0 and
Ha were each very close to 0.5. We can use short-hand to
state this:
PriorProbH0 = 0.5 and PriorProbHa = 0.5.
While the team actually felt that H0 was more likely,
they used this noninformative prior to provide a more
objective test.
After examining data: All knowledge about
the true process mean comes from the posterior
distribution. Notice in Figure 6 that the area under the
distribution to the right of 96%LC (i.e., H0 range) is
0.045 while that to the left of 96%LC (i.e., Ha range) is
0.955, so that
PostProbH0 = 0.045 and PostProbHa = 0.955.
Thus the team can be 95.5% confident (in a true
probabilistic sense) that the null hypothesis, H0, is
false. It is also useful to consider such probabilities in
terms of odds as illustrated in Figure 7.
Odds is defined as follows:
tOdds: The ratio of success to failure in probability
calculations. In the case of hypothesis testing
where only H0 or Ha are possible (but not both), if
the probability of truth of H0 is ProbH0, then the
odds of H0 equals ProbH0/(1-ProbH0).
PROCESS VALIDATION – Process Design 93
David LeBlond
Probability density
Figure 5: Bayesian induction inference model.
Prior or
posterior
distribution
Ha
H0
Probability that
Ha is true
Probability that
H0 is true
Parameter
Figure 6: Visualizing a one-sided hypothesis
test for a normal mean using its posterior distribution.
Probability density
0.3
Ha
Prior
H0
Posterior
0.2
Probability that
Ha is true = 0.955
Probability that
H0 is true = 0.045
0.1
0
86
88
90
92
94
96
98
100
Mean
Before data are examined, the prior odds of H0 are
defined as
Prior Odds of H0 = PriorProbH0/PriorProbHa =
0.5/0.5 = 1/1.
So the prior odds that H0 is true are “1 to 1”. After
information from data has been incorporated, the
posterior odds of H0 are
Posterior Odds of H0 = PostProbH0/PostProbHa =
0.045/0.955 = 45/955.
So the posterior odds that H0 is true are “45 to 955.”
This reduction in the odds of H0 from 1/1 to 45/955
is due to the evidence about H0 contributed by the
data. As shown in Figure 7, we can form a measure of
evidence by taking the odds ratio as follows:
that H0 is true to its prior odds. A B value of
1/10 means that Ha is supported by the data 10
times as much as H0. Because the Bayes factor
is normalized by the prior odds, it is measure of
evidence that primarily reflects the observed data.
The Bayes factor is a measure of evidence supplied
by data in a hypothesis testing situation. Like the
P-value, various decision levels have been proposed
(see reference 19, page 432).
Notice something important here: The Bayesian
PostProbH0 and the Fisherian P-value for our
manufacturing example are both equal to 0.045.
This seems remarkable considering that these indices
are not measuring the same thing. The P-value is the
probability of observing data at least as extreme as
was observed, while the PostProbH0 is the probability
that H0 is true. However, it can be shown that in many
common situations (e.g., one-sided hypothesis tests
involving normally distributed data) the P-value will
be equal to the PostProbH0 when an appropriate noninformative prior is used to calculate PostProbH0 (see
examples in references 20 and 22).
PROBLEMS WITH THE TWO-SIDED
HYPOTHESIS TEST FOR EQUALITY
In Example 1, the hypothesis tested by our team is
one sided because the ranges for the mean for Ha and
H0 were each completely on one side or the other of
the hypothesized value, 96%LC. A one-sided test is
defined as follows:
tOne-sided test: A null hypothesis stated in such
a way that observed values of the decision statistic
on one side (either large or small but not both)
constitutes evidence against it.
Our example uses composite hypotheses because
Ha and H0 both consist of ranges rather than single
points. It is also common to consider a two-sided
situation in which the null hypothesis, H0, consists of
a single point. Say for instance our team obtained the
following data from their testing of n=10 batches:
Mean of 10 measured batch potencies = 95%LC, and
Sample standard deviation = 5%LC,
They might have considered testing a point-null
hypothesis such as
H0: true mean = 100%LC
B = (Posterior Odds of H0)/(Prior Odds of H0) =
(45/955)/(1/1) = 45/955 = 0.0452
B is referred to as the Bayes factor (21,22). Bayes
factor is defined as follows:
tBayes factor (B): In Bayesian induction, the
Bayes factor is the ratio of the posterior odds
94
PROCESS VALIDATION – Process Design
against an alternative hypothesis such as
Ha: true mean is not = 100%LC.
A Fisherian test of this point-null H0 is easily
executed in Excel as follows:
David LeBlond
that would lead to rejection of H0 if the common
line of 0.05 is employed.
Testing of point-null hypotheses such as this is
very common. Unfortunately, as shown below, this is
almost never a realistic or meaningful test.
Most would agree that there is little practical
difference between a process mean potency of
99.999%LC and 100.001%LC. Yet it can be shown that
if the sample size is large enough, H0 will be rejected
with high probability, even if the true deviation from
100.000%LC is only 0.001%LC. This sensitivity to
sample size is well known to statisticians because
decision statistics, such as the mean, become very
precise (small standard errors)—but not necessarily
more accurate—when sample size is increased. An
essentially correct hypothesis can be rejected when
the summary statistics are too precise. This type of
counter-intuitive behavior in hypothesis testing is
often due to an incorrect statement of the problem
known as a Type III error (see Figure 4 footnote). Type
III error is defined as follows:
tType III error: A decision error that results
in choosing the incorrect null or alternative
hypothesis for use in a hypothesis test.
Rather than consider a point H0, it may be more
appropriate to specify a small interval for H0. Type III
errors are common in hypothesis tests for normality. In
very large samples, normality is almost always rejected
despite the fact that a histogram agrees visually with
the fitted normal curve. Sometimes the rejection is
caused by minor imperfections in the data, such as
rounding, that are not material to the objectives of the
hypothesis test.
One advantage of the Bayesian approach is that
it forces one to think carefully about the correct
formulation of a hypothesis testing problem. From
the Bayesian hypothesis testing perspective, the prior
distributions for H0 and Ha are the hypotheses being
tested. An appropriate Bayesian approach to test this
point-null H0 is given by Schervish (23, example
4.22 pp 224-5). Under the Bayesian paradigm, we
require prior distributions for the mean and standard
deviation for both H0 and Ha. This is because, in
general, neither the true mean or standard deviation
parameters are known. Instead of a single parameter
we have two parameters to consider and these will
have joint prior and posterior probability distributions.
Joint probability distribution is defined as follows:
tJoint probability distribution: A probability
distribution in which the probability density or
Probability density
P-value = TDIST(3.16,10-1,2) = 0.012,
Probability density
t-value = SQRT(10)*ABS(95-100)/5 = 3.16
Figure 7: The Bayes factor (B) as a measure of
evidence for H0.
Ha
H0
ProbHa
Prior
ProbHO
+ Data
Posterior
ProbHa
ProbH0
Mean
B=
ProbH0/ProbHa
ProbH0/ProbHa
mass depends on the values of two (bivariate) or
more (multivariate) parameters simultaneously. A
bivariate probability distribution can be visualized
as a surface mesh or contour plot.
The H0 prior for mean and standard deviation is
shown in Figure 8. The prior for the mean (top panel)
is a single spike at 100%LC, consistent with the point
H0. The prior distribution for the standard deviation
is a mildly informative IRG distribution with C=1 and
G=50 (3, Table IV) and is shown on the lower panel of
Figure 8.
The Ha prior for mean and standard deviation is
shown in Figure 9. Because under Ha, the mean is
permitted to vary, the joint distribution of mean and
variance is displayed as a surface mesh plot. This is a
joint LSSt-IRG distribution with R=1, T=100, U=7.07,
C=0.5, and G=50 (3, Table IV).
This same Ha prior is displayed in Figure 10 as
a contour plot. From this it is easier to see the twodimensional shape and range that is identified as the
alternate hypothesis.
Unlike the one-sided case, the point-null situation
requires that we also specify our prior beliefs about
the truth of H0 and Ha. If the team had no prior
preference for either H0 or Ha, they would assign equal
prior probabilities to each. We can express that using
shorthand notation as
PriorProbH0 = 0.5 and PriorProbHa = 0.5.
The calculation of the Bayes factor for this test can
easily be done in Excel (23, equation 4.23) and a value
of
B = 0.3495
PROCESS VALIDATION – Process Design 95
David LeBlond
Figure 8: Point-null hypothesis visualized as
probability distributions for the mean and
standard deviation.
Probability mass
Prior null hypothesis for mean
1.5
1
0.5
0
80
90
100
110
120
Mean
Probability density
Prior null hypothesis for standard deviation
1.5
1
0.5
0
2
7
12
17
22
Standard deviation
Figure 9: Alternative hypothesis visualized as
a surface mesh plot of the joint probability for
the mean and standard deviation.
0.006
0.005-0.006
0.004-0.005
0.005
0.003-0.004
0.002-0.003
0.004
Probability
density
0.001-0.002
0-0.001
0.003
0.002
0.001
18
13
0
80
7
88
96
104
Mean
112
2
120
Standard
deviation
is obtained. Given this information, we can invert
the equation for B given in Figure 7 to solve for the
posterior probability of H0. Noting that PostProbHa =
1 – PostProbH0, we find that
PostProbH0 =0.259.
So the team would conclude that there is a 25.9%
probability that H0 is true and likely would not reject
it. This is a troubling result because the Fisherian
point-null test above yielded a P-value of 0.012, clearly
suggesting rejection of H0. However, as noted above,
96 PROCESS VALIDATION – Process Design
the P-value groups the actual observation obtained
with much more extreme observations that were not
actually obtained. In the case of point-null hypothesis
testing, Bayesian and Fisherian conclusions rarely
agree (24, see pp 151, table 4.2). When two sound
methodologies lead to different conclusions we must
wonder whether we have misidentified the problem
(i.e., our friend, the Type III error again?). It is best to
consider carefully what we mean by “not equal.” We do
this in the following section.
When applicable, the Bayesian approach to
hypothesis testing gives answers in terms of
probability statements about the parameters and the
hypotheses themselves, which are impossible with the
Fisherian and Neyman-Pearson approaches. This is
very advantageous for risk analysis. Bayesian inductive
inference is also useful in data-mining applications. As
an example, the US Food and Drug Administration’s
Center for Drug Evaluation and Research now employs
a Bayesian screening algorithm as part of their internal
drug safety surveillance program (25). However, the
Bayesian approach can be more demanding for the
following reasons:
t8IJMFNBOZ#BZFTJBOQSPCMFNTDBOCFTPMWFEJO
Excel, more complicated situations may require
advanced computing packages such as WinBugs
(26). Calculation of the K in Equation 4 or of the
Bayes factor can sometimes be challenging (27).
t#BZFTJBOBQQSPBDIFTSFRVJSFTQFDJGJDBUJPOPG
prior distributions and prior probabilities of the
hypotheses under study. While noninformative
priors may be used for objectivity, this may result
in a loss of information as well as lost opportunity
to debate and thereby take advantage of the prior
knowledge of different members of a project team.
It is always critical to understand any effect that
the prior may have on conclusions.
t8IFOVTFEGPSDPOGJSNBUPSZPSEFNPOTUSBUJPO
studies such as clinical trials, validations, quality
control, or data mining, Bayesian hypothesis
testing methods must be calibrated (usually by
computer simulation) so that the Type I and II
error risks are known.
A good, readable, basic introductory textbook for
readers interested in the theory and methodology of
Bayesian hypothesis testing is Bolstad (28).
THE TOST HYPOTHESIS TEST FOR
EQUIVALENCY
Sometimes, as with method transfers or process
changes in existing products or validation of new
products, the objective may be to establish evidence
for equivalency, rather than equality, with the usual
burden of proof reversed. It is best to define a range
for the mean difference, say L to H, indicative of
David LeBlond
equivalence. Then the null hypothesis is
H0: true mean difference < L or true mean
difference > H
Figure 10: Alternative hypothesis visualized as a contour
plot of the prior joint probability distribution for the mean
and standard deviation.
22
0 . 0 0 5 -0 . 0 0 6
against the alternative hypothesis
0 . 0 0 4 -0 . 0 0 5
18
0 . 0 0 3 -0 . 0 0 4
Ha: L < true mean difference < H.
0 . 0 0 2 -0 . 0 0 3
0 . 0 0 1 -0 . 0 0 2
In this way we treat nonequivalence as the default
state of nature and require the data to provide
evidence that allows us to reject H0. A “two one-sided
testing” (TOST) procedure (29) is based on requiring
a confidence or credible interval to be completely
contained within the range of Ha in order to reject H0.
TOST is defined as followed:
tTwo one-sided hypothesis test (TOST): A
hypothesis test for equivalency that consists of
two one-sided tests conducted at the high and
low range of equivalence, each of which must be
rejected at the Type I error rate in order to reject the
null hypothesis of non-equivalence.
EXAMPLE 2: A TEST THAT THE MEANS OF
THREE GROUPS ARE EQUAL
The ANOVA procedure developed by Fisher is a
procedure for testing the null hypothesis of equality for
group means. Fisher’s ANOVA calculation procedure is
illustrated elsewhere in this issue with an example (30)
in which there are three groups (A, B, and C). We will
illustrate the Bayesian approach to ANOVA here.
Zelen (31, pp 306-9) shows how to perform multiple
regressions from a Bayesian perspective. His procedure
can easily be adapted to the simple ANOVA case and a
Bayes factor for the ANOVA can be calculated in Excel.
However, as with the point-null hypothesis, we will
find that the conventional P-value tends to overstate
the case for rejection of H0. Again, it is wise to ask if
the ANOVA hypothesis test correctly frames the real
questions we want to ask. We will assume here that the
null hypothesis of equality
14
0 -0 . 0 0 1
10
Standard
deviation
6
2
80
88
96
104
112
120
Mean
Figure 11: Visualizing the ANOVA null hypothesis (H0)
relative to the posterior distribution of the group deltas.
Ratio
0 .2
30-32
28-30
0 .1
26-28
0
24-26
delta_A
22-24
-0 . 1
20-22
18-20
-0 . 2
16-18
14-16
-0 . 3
12-14
10-12
8-10
6-8
4-6
H0
Posterior
mode
-0 . 4
-0 . 5
-0.5 -0.4 -0.3 -0.2 -0.1
0
0.1
0.2 0.3 0.4 0.5
2-4
0-2
delta_B
H0: meanA = meanB = meanC,
is appropriate. The corresponding alternative
hypothesis is
Ha: one or more of the three means are not equal.
An ANOVA F-test is often used when the real
objective is to make comparisons among the group
means. When an appropriate noninformative prior
(3, 4) is used, the Bayesian approach to group means
comparison agrees exactly with the standard ANOVA
F-test (32, pp 123-43). But the Bayesian approach offers
a useful advantage.
The usual Fisherian paradigm regards the true
group means as fixed entities. So one can only make
comparisons with respect to the measured statistics,
using such things as confidence intervals. However,
the Bayesian paradigm regards the true group means
as having a joint posterior distribution. It is possible
to calculate and plot the joint posterior distribution
using Excel and to visualize the single point within this
distribution that corresponds to all the means being
equal (i.e., the H0). Such a graph for the example (30)
is shown in Figure 11, but it requires some explanation.
First, the axes for the plot are not the true group
means, but the deviations (deltas) of these true means
PROCESS VALIDATION – Process Design 97
David LeBlond
from their true grand mean, M, where
M = (mean_A + mean_B + mean_C)/3,
or
mean_C - M = - (mean_A - M) - (mean_B - M),
and denoting the differences of the true group
means from their true grand mean with a “delta,” we
have
delta_C = -delta_A - delta_B.
Because delta_C can always be obtained knowing
delta_A and delta_B, it is not an independent random
variable. Consequently, we only need to concern our
selves with two of the three deltas, say delta_A and
delta_B. With three groups, we can actually visualize
the joint distribution as a two-dimensional contour
plot.
Second, the point in Figure 11 corresponding to H0
(true means all equal) is the point delta_A = delta_B
= 0, because only under this condition can all three
group means be equal.
Third, the contour lines form ellipses about the
joint distribution mode (the point delta_A = -0.22
and delta_B = -0.26 indicated as a red dot in Figure
11). (Recall the mode is the point of a probability
distribution having the maximum probability density).
They do not correspond to a probability density as
would be true for an actual distribution such as that
in Figure 10. Instead the contour lines are critical F
values. In Excel, the probability that the true value
of delta_A, delta_B is beyond a given contour ellipse
corresponding to F is FDIST(F,3-1,15-3). By analogy
with a univariate distribution, the further any point is
from the distribution mode, the larger the F value and
the less likely that point is as a candidate for the true
delta_A, delta_B.
Finally the contour line in Figure 11 that equals
the critical F value of 3.885, obtained in Excel as
FINV(0.05,3-1,15-3) (30), forms a 95% credible ellipse.
This is a bivariate analogue of a univariate 95% credible
interval (3). Notice in Figure 11 that H0 corresponds
to an F value of 6.142 and is, therefore, beyond the
95% credible ellipse. As with the traditional ANOVA
calculation, we reject H0.
By restating the hypotheses in terms of joint
posterior distributions that can be displayed visually,
the Bayesian perspective offers deeper insight into the
mechanics and interpretation of ANOVA.
approaches to inductive inference are: Fisherian, which
uses the P-value as a measure of evidence against the
null hypothesis; Neyman-Pearson, which controls the
long-run decision risk over repeated experiments; and,
Bayesian, which obtains direct probabilistic measures
of evidence from the posterior distribution.
Understanding of the P-value as the probability of
observing a result as extreme or more extreme than
that observed, assuming the null hypothesis is true,
is critical to its proper interpretation. The P-value
is based on unobserved results more extreme than
those observed, so it may overstate the evidence
against the null hypothesis. The P-value from a pointnull hypothesis, or two-sided test of equality, is very
difficult to interpret. A confidence or credible interval
should be provided in addition to the P-value.
The Bayes factor, like the P-value, is a measure of
evidence in favor of the null hypothesis contained
in the data. It is based on an odds ratio and permits
direct probability statements about the hypothesis
tests being considered.
The Type II error rate is the probability of incorrectly
failing to reject the null hypothesis when in fact the
alternative hypothesis is true. The Type II error rate
is related to Power and allows the construction of
operating characteristic curves.
The analysis of variance is a Fisherian point-null
hypothesis test for the equality of means of two
or more groups. The Bayesian approach offers an
insightful reinterpretation of the analysis of variance
in terms of the joint posterior distribution of the group
means.
The three approaches to hypothesis testing represent
major advances in the way we use data to make
inductive inferences about the products, processes,
and test methods we develop in regulated industries.
These approaches, and others not discussed here, have
done much to improve our decision-making. They
help us build evidence about underlying mechanisms
and provide us a common language for objective
communication of the evidence supporting our
conclusions.
Still, humility and caution are in order. Despite
these impressive methodologies, no single, coherent,
generally-accepted approach to inductive inference
has yet emerged in our time. All experimenters
know that Nature does not divulge her secrets easily.
For the present, it is perhaps unwise to apply any
induction methodology by rote without first carefully
considering whether it is consistent with our objective
and problem.
ACKNOWLEDGMENT
SUMMARY
We have seen that scientific inference requires both
deductive and inductive inferences. The three main
98 PROCESS VALIDATION – Process Design
The text before you would be poorer if not for the help
of others. I am most sincerely grateful to Paul Pluta for
ideas, encouragement, and expert feedback; to Diane
David LeBlond
Wolden who tirelessly kept the reader in mind; and to
Susan Haigney for expertly laying the product to print.
GLOSSARY
Alternative hypothesis: A hypothesis considered as
an alternative to the null hypothesis, though
possibly more complicated or less likely.
ANOVA (analysis of variance): A hypothesis test that
uses the F-statistic described by Fisher used to
detect differences among the true means of
data from two or more groups.
Bayes factor (B): In Bayesian induction, the Bayes factor is
the ratio of the posterior odds that H0 is true to
its prior odds. A B value of 1/10 means that the
Ha is supported 10 times as much as H0. Since
the Bayes factor is normalized by the prior odds,
it is measure of evidence that primarily reflects
the observed data.
Bayes’ rule: A process for combining the information
in a data set with relevant prior information
(theory, past data, expert opinion, and
knowledge) to obtain posterior information.
Prior and posterior information are expressed
in the form of prior and posterior probability
distributions, respectively, of the underlying
physical parameters, or of predictive posterior
distributions of future data.
Bayesian induction: A process for inductive inference in
which the P-value is replaced with the posterior
probability that the null hypothesis is true.
In Bayesian induction, the respective prior
distributions and data models (likelihoods)
constitute the null and alternative hypotheses.
In addition, one must specify the prior
probability (or odds) that the null hypothesis is
true.
Calibrated hypothesis test: A hypothesis test method
who’s Type I error, on repeated use, are known
from theory or computer simulation.
Composite hypothesis: A statement that gives a range
of possible values to a model parameter. For
example, ‘Ha: true mean > 0’ is a composite
hypothesis.
Confidence interval: A random interval estimate of a
(conceptually) fixed quantity, which is obtained
by an estimation method calibrated such that
the interval contains the fixed quantity with a
certain probability (the confidence level).
Control chart: A time ordered plot of observed data
values or statistics that is used as part of a
process control program. Various hypothesis
tests are employed with control charts to detect
the presence of trends or unusual values.
Credible interval: An interval estimate of a random
variable, based on its probability distribution,
which contains its value with a certain
probability (the credible probability level).
Data: Measured random variable values, assumed to
be generated by some hypothetical likelihood
model, which contain information about the
parameters of that model.
Deduction: The act of drawing a conclusion about some
hypothesis based entirely on careful definitions,
axioms, and logical reasoning.
F-statistic: The decision statistic used in Fisher’s analysis
of variance hypothesis test consisting of the
ratio of two independent observed variances
calculated from normally distributed data.
Fisherian induction: A process for inductive inference,
described most clearly by Ronald Fisher that
uses the P-value as a criterion for rejecting a
hypothesis.
Hypothesis: A provisional statement about the value of
a model parameter or parameters whose truth
can be tested by experiment.
Induction: The act of drawing a conclusion about some
hypothesis based primarily on limited data.
Inference: The act of drawing a conclusion regarding
some hypothesis based on facts or data.
Joint probability distribution: A probability distribution
in which the probability density or mass
depends on the values of two (bivariate) or
more (multivariate) parameters simultaneously.
A bivariate probability distribution can be
visualized as a surface mesh or contour plot.
Likelihood model: A description of a data generating
process that includes parameters (and possibly
other variables) whose values determine the
distribution of data produced.
Measures of evidence: In scientific studies, hypothesis
testing is used build evidence for or against
various hypotheses. The P-value and Bayes
factor are examples of measures of evidence in
Fisherian and Bayesian induction, respectively.
Mode: A point estimate of a random variable that is
the value at which its probability density is
maximized.
Multiplicity: When multiple hypothesis tests (e.g., control
chart rules) are applied to different aspects (e.g.,
trending patterns) of a data set, the overall false
alarm rate of any one test failing may be greater
than that of any single test when applied alone.
This statistical phenomenon is referred to as
multiplicity.
Neyman-Pearson induction: A methodology for
inductive inference, developed by Jerzy
Neyman and Egon Pearson, that considers
both a null and an alternative hypothesis.
PROCESS VALIDATION – Process Design 99
David LeBlond
The null hypothesis is rejected in favor of the
alternative hypothesis if the observed value of
some statistic lies in its ‘rejection region’. The
statistic and the associated rejection region are
identified from statistical theory and are chosen
to provide desired Type I or II decision error rates
over repeated applications of the methodology.
Null hypothesis (H0): A plausible hypothesis that is
presumed sufficient to explain a set of data
unless statistical evidence in the form of a
hypothesis test indicates otherwise.
Ockham’s Razor: The doctrine of parsimony that
advocates provisionally adopting the simplest
possible explanation for observed data.
Odds: The ratio of success to failure in probability
calculations. In the case of hypothesis testing
where only H0 or Ha are possible (but not both),
if the probability of truth of H0 is ProbH0, then
the odds of H0 equals ProbH0/(1-ProbH0).
One-sided test: A null hypothesis stated in such a way
that observed values of the decision statistic
on one side (either large or small but not both)
constitutes evidence against it.
P-value: The probability of obtaining a result at least as
extreme as the one that was actually observed,
given that the null hypothesis is true. The fact
that p-values are based on this assumption is
crucial to their correct interpretation.
Parameter: In statistics, a parameter is a quantity of
interest whose “true” value is to be estimated.
Generally a parameter is some underlying
variable associated with a physical, chemical, or
statistical model.
Point (simple) hypothesis: A statement that a model
parameter is equal to a single specific value.
For example, ’H0: true mean = 0’ is a simple
hypothesis.
Posterior distribution: A distributional estimate of a
random variable that updates information from
a prior distribution with new information from
data using Bayes’ rule.
Power (or operating characteristic) curve: Power is
equal to 1 minus the Type II error rate. The
power curve of a hypothesis test is a plot of the
Power versus the true value of the underlying
parameter of interest.
Prior distribution: A subjective distributional estimate
of a random variable, obtained prior to any
data collection, which consists of a probability
distribution.
Probability density contour plot: A rendering of a
bivariate distribution in which the distribution
appears in the bivariate parameter space as
contours of equal probability density.
100 PROCESS VALIDATION – Process Design
Probability density surface mesh plot: A threedimensional analogue of a two-dimensional
probability density plot in which there are two,
rather than only one, model parameters. The
bivariate distribution therefore appears as a
surface rather than as a curve.
Sampling distribution: The distribution of data or some
summary statistic calculated from data.
Special cause: When the cause for variation in data, or
statistics derived from data, can be identified
and controlled, it is referred to as a “special”
cause. When the cause cannot be identified, it is
regarded as random noise and referred to as a
“common” cause.
Statistic: A summary value (such as the mean or standard
deviation) that is calculated from data. A
statistic is often used because it provides a good
estimate of a parameter of interest.
t-statistic: The decision statistic used in Student’s t-test
consisting of the ratio of a difference between
an observed and hypothesized mean divided by
its estimated standard error.
Two-sided test: A null hypothesis stated in such a way
that either large or small observed values of the
decision statistic constitute evidence against it.
Two one-sided hypothesis test (TOST): A hypothesis test
for equivalency that consists of two one-sided
tests conducted at the high and low range of
equivalence, each of which must be rejected at
the Type I error rate in order to reject the null
hypothesis of non-equivalence.
Type I error: A decision error that results in falsely
rejecting the null hypothesis when in fact it is
true. It is sometimes referred to as the alpha-risk
or manufacturer’s risk.
Type II error: A decision error that results in failing to
reject the null hypothesis when in fact it is false.
It is sometimes referred to as the beta-risk or
consumer’s risks.
Type III error: A decision error that results in choosing
the incorrect null or alternative hypothesis for
use in a hypothesis test.
REFERENCES
1. LeBlond, D, “Data, Variation, Uncertainty, and Probability
Distributions,” Journal of GXP Compliance, Vol. 12, No. 3, pp
30–41, 2008.
2. LeBlond, D, “Using Probability Distributions to Make
Decisions,” Journal of Validation Technology, Spring 2008,
pp 2–14, 2008.
3. LeBlond, D, “Estimation: Knowledge Building with
Probability Distributions,” Journal of GXP Compliance, Vol.
12 (4), 42-59, 2008. See Journal of Validation Technology, Vol.
14, No. 5, 2008 for correction to Table IV.
David LeBlond
4. LeBlond, D, “Estimation: Knowledge Building with
Probability Distributions–Reader Q&A, Journal of Validation
Technology, Vol. 14(5), 50-64, 2008.
5. Marden, J, “Hypothesis Testing: From p Values to Bayes
Factors, Journal of the American Statistical Association, 95(452)
1316-1320, 2000.
6. Stigler, S, The History of Statistics. The measurement of
uncertainty before 1900, Belknap Press, Cambridge, 1986.
7. Daston, L, Classical Probability in the Enlightenment, Princeton
University Press, Princeton, 1988.
8. William of Ockham, of the 14th century, advocated
adopting the simplest possible explanation for physical
phenomena, see Jeffreys, H (1961) Theory of Probability, 3rd
edition, Oxford University Press, NY, page 342.
9. Pearson, K, On the criterion that a given system of
deviations from the probable in the case of correlated
system of variables is such that it can be reasonably
supposed to have arisen from random sampling.
Philosophical Magazine 50, 157–175, 1900.
10. Gosset, W, aka “Student,” The probable error of a mean,
Biometrika VI (1), 1-25, 1908.
11. Fisher, R, Statistical Methods for Research Workers, Oliver &
Boyd, Edinburgh, 1925.
12. Snedecor, G and Cochran, W, Statistical Methods, 6th
edition, Iowa State University Press, Ames, page 98, 1967.
13. Goodman, S, “Toward Evidence-based Medical Statistics.
1. The P value fallacy,” Annals of Internal Medicine, 130, 9951004, 1999.
14. Gibbons, J and Pratt, J, “P-values: Interpretation and
Methodology,” The American Statistician, 29(1) 20-25, 1975.
15. Neymen, J and Pearson E, “On the Problem of the Most
Efficient Tests of Statistical Hypotheses,” Philosophical
Transactions of the Royal Society, series A, volume 231, 289337, 1933.
16. International Conference on Harmonisation, ICH
Harmonised Tripartite Guideline on Pharmaceutical Development
Q8, Current Step 4 version, dated 10 November 2005.
17. Dale, A , A History of Inverse Probability, Springer, New York,
page 545, note 38, 1999.
18. Price, R, (1763) “An essay towards solving a problem
in the doctrine of chances,” in Dale, A Most Honourable
Remembrance: The Life and Work of Thomas Bayes, Springer,
New York, 2003.
19. Jeffreys, H, Theory of Probability, 3rd edition, Oxford
University Press, Cambridge, 1961.
20. Gelman, A, Carlin, J, Stern, H, and Rubin, D, Bayesian Data
Analysis, 2nd Edition, Chapman and Hall/ CRC, New York,
2004.
21. Goodman, S (1999) “Toward evidence-based medical
statistics. 2. The Bayes factor,” Annals of Internal Medicine,
130, 1005-1013.
22. Casella, G and Berger, R, “Reconciling Bayesian and
Freqentist Evidence in the One-Sided Testing Problem,”
Journal of the American Statistical Association 82(397) 106111, 1987.
23. Schervish, M, Theory of Statistics, Springer-Verlag, New York,
1995.
24. Berger, J, Statistical Decision Theory and Bayesian Analysis, 2nd
edition, Springer-Verlag, New York, 1985.
25. Lincoln Technologies, “WebVDME in Production at
the FDA,” WebDVMA News, volume 2(2), page 1, 2005.
Available at HTTP://WWW.LINCOLNTECHNOLOGIES.
COM
26. Cowles, M, “Review of WinBUGS 1.4,” American Statistician
58(4), 330-336, 2004.
27. Kass, R and Raftery, A, “Bayes Factors,” Journal of the
American Statistical Association, 90(430) 773-795, 1995.
28. Bolstad, W, Introduction to Bayesian Statistics 2nd Edition, John
Wiley & Sons, Hoboken, New Jersey, 2007.
29. Schuirmann, D, “On Hypothesis Testing to Determine
if the Mean of a Normal Distribution is Contained in a
Known Interval,” Biometrics 37 617, 1981.
30. Vijayvargiya, A., “One Way Analysis of Variance,” Journal of
Validation Technology, Vol. 15, No. 1, 2009.
31. Zellner, A, An Introduction to Bayesian Inference in
Econometrics, John Wiley, New York, 1971.
32. Box, G and Tiao, G, Bayesian Inference in Statistical Analysis,
Addison-Wesley Pub. Co., Reading, MA, 1973. JVT
Originally published in the Winter 2009 issue of The Journal of Validation Technology
PROCESS VALIDATION – Process Design 101