Download Assessment of blinding in clinical trials

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Drug discovery wikipedia , lookup

Pharmacognosy wikipedia , lookup

Neuropharmacology wikipedia , lookup

National Institute for Health and Care Excellence wikipedia , lookup

Pharmacokinetics wikipedia , lookup

Pharmaceutical industry wikipedia , lookup

Prescription costs wikipedia , lookup

Polysubstance dependence wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Clinical trial wikipedia , lookup

Theralizumab wikipedia , lookup

Bad Pharma wikipedia , lookup

Bilastine wikipedia , lookup

Transcript
Controlled Clinical Trials 25 (2004) 143 – 156
www.elsevier.com/locate/conclintrial
Assessment of blinding in clinical trials
Heejung Bang a,*, Liyun Ni b, Clarence E. Davis a
a
Department of Biostatistics, University of North Carolina at Chapel Hill, 137 E. Franklin Street, Chapel Hill, NC 27599, USA
b
Amgen, Inc., Thousand Oaks, CA, USA
Received 18 February 2003; accepted 17 October 2003
Abstract
Success of blinding is a fundamental issue in many clinical trials. The validity of a trial may be questioned if
this important assumption is violated. Although thousands of ostensibly double-blind trials are conducted annually
and investigators acknowledge the importance of blinding, attempts to measure the effectiveness of blinding are
rarely discussed. Several published papers proposed ways to evaluate the success of blinding, but none of the
methods are commonly used or regarded as standard. This paper investigates a new approach to assess the success
of blinding in clinical trials. The blinding index proposed is scaled to an interval of 1 to 1, 1 being complete lack
of blinding, 0 being consistent with perfect blinding and 1 indicating opposite guessing which may be related to
unblinding. It has the ability to detect a relatively low degree of blinding, response bias and different behaviors in
two arms. The proposed method is applied to a clinical trial of cholesterol-lowering medication in a group of
elderly people.
D 2004 Elsevier Inc. All rights reserved.
Keywords: Blinding; Blinding index; CRISP; Clinical trial; Masking; Multinomial test
1. Introduction
Blinding embodies a rich history spanning a couple of centuries and represents an important, distinct
aspect of randomized controlled trials. The double-blind procedure has been regarded as an important
design feature in clinical trials. Although most researchers appreciate its meaning, there is some
confusion in the definition of blinding (e.g., single-, double- and triple-blind, masking and allocation
concealment) [1]. The terminology double-blind usually refers to keeping study participants, investigators and data assessors unaware of the allocated treatment or therapy, so that they are not influenced
* Corresponding author. Tel.: +1-919-962-3231; fax: +1-919-962-3265.
E-mail address: [email protected] (H. Bang).
0197-2456/$ - see front matter D 2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.cct.2003.10.016
144
H. Bang et al. / Controlled Clinical Trials 25 (2004) 143–156
psychologically or physically by that knowledge. Single-blind normally means blinding of the
participants only [2,3]. Blinding not only can prevent selection or ascertainment (i.e., information)
biases, but also can improve compliance and retention of trial participants [1]. In well-blinded trials one
can have certainty that any differential effects between groups stem from the treatment rather than the
subjects’ or researchers’ biases [4].
Blinding is difficult to achieve in some situations such as comparison of surgical vs. medical
interventions, or of psychological vs. no treatments. In drug trials, comparing active treatment with
placebo or competing active treatment, however, researchers have established many methods to assure
the achievement of blinding, such as making the appearance, smell and taste of active drug and placebo
be the same to disguise their dissimilarity. When two active drugs are compared, the double-dummy
method using two placebos is often used [5]. The degree of blinding is ascertained by directly asking
participants, health-care providers or outcome assessors which treatment they think was administered at
several stages over or at the end of a trial. Even with these technical efforts, beneficial therapeutic
efficacy, side effects or even internal conversation are frequently cited as clues to treatment identity, and
in consequence allow the patients and caregivers to become unblinded through the trial. For example, in
clinical trials of psychoactive drugs, it was often found that both patients and physicians were able to
figure out treatment allocation beyond chance levels [5].
Even though many trials are designed as double-blind and most researchers acknowledge that
successful blinding is essential, very few studies actually evaluate or report the magnitude to which the
blinding was maintained during the course of the study. A meta-analysis reported that among trials
claimed as double-blind, only 45% described similarity of the treatment and control regimens, and only
26% provided information regarding the protection of the allocation schedule [6]. Most publications
provide no information on attempts to maintain and/or evaluate blinding. Although several methods to
assess blinding in clinical trials have been published, none of these are widely used or considered
standard. Moreover, the methods have not been thoroughly studied from a statistical point of view. Most
previous methods were based on an exploratory analysis often with no or even incorrect statistical
properties and excluded nonrespondents (i.e., responders with indefinite answers) [5,7–11]. Recently,
James et al. [4] proposed a method to assess blinding by constructing an index along with its asymptotic
theory, which incorporates the nonrespondents.
In this paper, we propose a set of hypothesis tests to evaluate the effectiveness of blinding in clinical
trials. The remainder of this paper is organized as follows. In the next section, we introduce typical data
structures commonly available in clinical trials. Then, we compare the proposed approach with the best
(statistically) established method and explore their advantages and limitations. Next we present the
Table 1
Number of subjects by treatment assignment and guess in 23 format
Assignment
Drug
Placebo
Total
Response
Drug
Placebo
DK
Total
n11( P1j1)
n21( P1j2)
n.1
n12( P2j1)
n22( P2j2)
n.2
n13( P3j1)
n23( P3j2)
n.3
n1.
n2.
N
Pjji = P(guess jjassigned treatment i) for i = 1 (drug), 2 (placebo) and j = 1 (drug), 2 (placebo), 3 (DK) where DK denotes ‘‘Don’t
know’’. N is total number of participants.
H. Bang et al. / Controlled Clinical Trials 25 (2004) 143–156
145
Table 2
Number of subjects by treatment assignment and guess in 25 format
Assignment
Drug
Placebo
Total
Response
1
2
3
4
5 (DK)
Total
n11( P1j1)
n21( P1j2)
n.1
n12( P2j1)
n22( P2j2)
n.2
n13( P3j1)
n23( P3j2)
n.3
n14( P4j1)
n24( P4j2)
n.4
n15( P5j1)
n25( P5j2)
n.5
n1.
n2.
N
1: Strongly believe the treatment is drug, 2: somewhat believe the treatment is drug, 3: somewhat believe the treatment is
placebo, 4: strongly believe the treatment is placebo and 5: DK.
results of a simulation study. These methods are illustrated with the data from the Cholesterol Reduction
in Seniors Program (CRISP). We end with some discussion and additional remarks.
2. Data structure
The typical procedure to elicit blinding information from participants is to ask, during or at the end of
study, about the treatment allocation they think they were assigned, and the answer can take various
forms. In this section, we present the two most common structures for the response data. One is
presented in Table 1 with three responses of ‘‘drug’’, ‘‘placebo’’ or ‘‘DK’’, where we will use ‘‘DK’’ as
an abbreviation of ‘‘Don’t know.’’ The other is shown in Table 2 which presents the level of selfidentification in five categories of ‘‘Strongly believe the treatment is drug’’ (coded as 1), ‘‘Somewhat
believe the treatment is drug’’ (2), ‘‘Somewhat believe the treatment is placebo’’ (3), ‘‘Strongly believe
the treatment is placebo’’ (4) and ‘‘DK’’ (5). The probability in each cell denotes the conditional
probability Pjji = P (guess jjassigned treatment i) for i = 1 (drug), 2 (placebo) and j = 1 (drug), 2 (placebo),
3 (DK) for Table 1 and j = 1,. . .,5 for Table 2.
In addition, those individuals who declined to venture an opinion (i.e., answered DK) originally may
be asked to choose a treatment allocation anyway as done in the CRISP study. Such ancillary data can be
displayed in Table 3. If there are no missing data in Table 3, ñ (from Table 3) is expected to be equal to
n.3 from Table 1 or n.5 from Table 2. Each probability in Table 3 can be defined similarly to the
conditional probability above. We will focus on Table 1 alone and then explain how to expand the idea to
Table 2 (and Table 3). The study design for blinding assessment and the data collection process in
clinical trials is not complicated technically; however, no standard method is popularly used for this
purpose. In Section 3, we summarize an existing method and then introduce a new approach based on a
multinomial test.
Table 3
Ancillary data from the subjects who answered DK initially
Assignment
Drug
Placebo
Total
Response
Drug
Placebo
Total
ñ11(P̃1j1)
ñ21(P̃1j2)
ñ.1
ñ12(P̃2j1)
ñ22(P̃2j2)
ñ.2
ñ1.
ñ2.
ñ
146
H. Bang et al. / Controlled Clinical Trials 25 (2004) 143–156
3. Statistical methods
3.1. James’ blinding index
The standard j coefficient ignoring DK responses measures the degree of agreement. However, in
clinical trials, disagreement is a more favorable result since it indicates a high degree of blinding and DK
may be the most indicative response for that. Therefore, James et al. proposed a blinding index (BI), a
variation of the kappa coefficient, that is sensitive not to the degree of agreement but to the degree of
disagreement by placing the highest weight on DK responses [4]. The BI score for the data in Table 1 is
defined as
BI ¼ f1 þ PDK þ ð1 PDK Þ*KD g=2
P2 P2
responses,
K
=
(
P
P
)/P
with
P
=
where PDK is the proportion of DK
D
Do
De
De
Do
i¼1
j ¼ 1 wijPij/
P2 P2
2
(1PDK) for PDK p 1 and PDe = i ¼ 1 j ¼ 1 wijP.j( Pi.Pi3)/(1PDK) . Note that this statistic is
applicable to a symmetric data structure plus DK answers, and Pij and wij are the expected ‘‘relative’’
(not conditional) probability and the weight, respectively, for the (i, j)th cell where i indexes
treatment assignment and j indexes treatment guess (i, j = 1,2). Thus, BI can be estimated by
B̂I = {1+P̂DK+(1P̂DK)*K̂D}/2, where P̂ij = nij/N, P̂i. = ni./N, P̂.j = n.j/N, P̂i3 = ni3/N and P̂DK = n.3/N
with the total sample size of N. Its asymptotic variance and an extension to the study with
multiple arms can be found in the same article.
According to James et al., correct guesses are least supportive of blinding and assigned a weight of 0,
incorrect guesses are moderately supportive with an intermediate weight of 0.5 or 0.75, while DK
responses are implicitly assigned a weight of 1. When all responses are correct, BI = 0. When all
responses are DK (i.e., PDK = 1), BI = 1. Therefore, this index increases as the success of blinding
increases, ranging from 0 to 1, 0 being total lack of blinding, 1 being complete blinding and 0.5 being
completely random. If the upper bound of the confidence interval (CI) of B̂I is below 0.5 (i.e., CI does
not cover the null value), the study is regarded as lacking blinding. Otherwise, we conclude that there is
insufficient evidence to show unblinding or blinding is achieved. The former statement is a correct one
but we will use these statements interchangeably to simplify communication.
A major advantage of James’ BI is the adaptability to various data structures. Yet, it should be
noted that this method is critically dominated by DK responses, which will be demonstrated later by a
numerical experiment. A fundamental assumption for this method to be valid is that participants who
answer DK are truly uncertain about their treatment assignment, not just giving a socially acceptable
answer. This assumption cannot be verified in the absence of any supporting information from the
people who answer DK. Some evidence may be provided from the ancillary data in Table 3 although
it is not necessarily the case. In practice, the drug and placebo arms of a clinical trial can exhibit
distinct blinding behaviors not only in magnitude but also in direction. Since James’ BI is a single
index value obtained from all arms together, it cannot distinguish this difference, if any, and can even
lead us to a totally opposite conclusion. Moreover, this and other previous methods do not provide the
estimate of the proportion of unblinded participants beyond random chance level. Investigators may
want to know how many participants are unmasked regardless of the validity and effectiveness of
masking. Being motivated by several limitations, we propose a new BI based on the multinomial
distribution.
H. Bang et al. / Controlled Clinical Trials 25 (2004) 143–156
147
3.2. New blinding index
Taking into account the effect of uncertain responses and the fact that two arms often exhibit different
blinding characteristics, a new BI based on a trinomial test is developed in this section. We introduce a
new notation riji ( = Piji/( P1ji+P2ji)) for proportion of correct guesses among individuals with certain
identification on the ith arm, where i = 1 (drug) and 2 (placebo) for the data in Table 1. When there is no
DK response, if the treatment is successfully concealed, it is likely that r1j1 = r2j2 = 0.5 (i.e., half of the
treatment group will guess treatment and half of the placebo group will guess placebo). It is easy to show
that 2r̂iji1 is the sample proportion of individuals who guess their treatment correctly on the ith arm
beyond random balance, where r̂iji = nii/(ni1+ni2) estimates the population quantity riji consistently.
In the presence of DK responses, we define a new treatment-specific BI, newBIi = (2riji1)*(P1ji+P2ji)
and estimate it by
neŵBIi ¼ ð2r̂iAi 1Þ*ðni1 þ ni2 Þ=ðni1 þ ni2 þ ni3 Þ
ð1Þ
for i = 1,2. Intuitively, the numerator (2r̂iji1)*(ni1+ni2) estimates the number of people who guess the
treatment correctly beyond chance level. This test is generally carried out separately to answer the
questions: (1) Was the treatment arm blinded? and (2) Was the placebo arm blinded? A routine
calculation leads us to the following relationships: newBI1 = P1j1P2j1 for drug arm and newBI2 = P2j2P1j2 for placebo arm, where Pjji (i, j = 1,2) is the ‘‘conditional’’ probability and can be
estimated by P̂jji = nij/ni. (see Table 1 for details). The exact distribution of neŵBIi can be derived from a
trinomial distribution and gives the variance
VarðneŵBIi Þ ¼ fP1Ai ð1 P1Ai Þ þ P2Ai ð1 P2Ai Þ þ 2P1Ai P2Ai g=ni :
for i = 1,2.
In other words, inference based on the new index score is equivalent to statistical testing of H0:
P1j1 = P2j1 vs. Ha: P1j1 > P2j1 for treatment arm and H0: P2j2 = P1j2 vs. Ha: P2j2 > P1j2 for placebo arm.
NewBI calculates the difference between the proportions of correct and incorrect guesses by excluding
DK responses in the numerator but including them in the denominator, and can be straightforwardly
interpreted as a treatment-specific proportion of the unblinded participants, assuming implicitly that DK
indicates blindness. It takes a value between 1 and 1, with 0 as a null value, which indicates the most
desirable situation under successful blinding. A positive value implies failure in masking above random
accounting (i.e., a majority of participants guess their treatment allocation correctly), and a negative
value suggests success of masking or failure of masking in the other direction (i.e., more individuals
mistakenly name the alternative treatment), which will be discussed below. If we do not have DK
answers, this test is equivalent to a one-sample binomial test (i.e., H0: riji = 0.5 vs. Ha: riji > 0.5). The test
can be conducted easily using the normal approximation to a binomial distribution. The normal
approximation is expected to perform well because a large sample is commonly available in clinical
trials. If a one-sided 95% confidence limit of the BI does not include the null value (equivalently, the
lower limit is above 0), this treatment arm is deemed to fail to achieve blinding. Otherwise, blinding is
maintained or, to speak more rigorously, there is insufficient evidence to show unblinding.
When the data in Tables 2 and 3 are available, we can use a linear combination derived from a
multinomial distribution with weights specified a priori. Now we are testing H0: P1ji + w2jiP2ji +
w̃1jiP̃1ji = P4ji + w3jiP3ji + w̃2jiP̃2ji vs. Ha: P1ji + w2jiP2ji + w̃1jiP̃1ji > P4ji + w3jiP3ji + w̃2jiP̃2ji for i = 1, or
148
H. Bang et al. / Controlled Clinical Trials 25 (2004) 143–156
Ha: P1ji + w2jiP2ji + w̃1jiP̃1ji < P4ji + w3jiP3ji + w̃2jiP̃2ji for i = 2, where wjji and w̃jji are the weights for
Pjji and P̃jji, respectively. The new BI in this situation is defined as
newBIi ¼ liT Pi
where
li ¼ ½1; w2Ai ; w̃1Ai ; w̃2Ai ; w3Ai ; 1T
and
Pi ¼ ½P1Ai ; P2Ai ; P̃1Ai ; P̃2Ai ; P3Ai ; P4Ai T for i ¼ 1;
Pi ¼ ½P4Ai ; P2Ai ; P̃1Ai ; P̃2Ai ; P3Ai ; P1Ai T for i ¼ 2
P
P
subject to conditions: (1) 4j¼1 Pjji + 2j¼1 P̃jji = 1, and (2) 0 V w̃1ji = w̃2ji V w2ji = w3ji V 1 for i = 1,2.
The variance of the new BI estimator is var(neŵBIi) = lTi cov(P̃i)
li, where cov(P̃i) is a 66
covariance matrix. By using the data in both Tables 2 and 3, not only the ordinal scores for
the response (rather than binary response), but also ancillary data can be incorporated along with
proper weight assignment in these hypothesis tests. We suggest w2ji = w3ji = 0.5 and w̃1ji = w̃2ji = 0.25
for i = 1,2 but other weight specifications can be employed, especially for sensitivity analysis. It is
easy to see that the test statistic in Eq. (1) is a special case with w2ji = w3ji = 1 and w̃1ji = w̃2ji = 0
and we just set w̃1ji = w̃2ji = 0 in the absence of the auxiliary data.
If we intend to conduct the two hypothesis tests simultaneously, we recommend a Bonferroni type
multiple testing procedure. Since the two test statistics are independent and a small number of tests (i.e.,
at most two or three) are generally involved, the conservatism of the Bonferroni approach will not be a
problem [12]. A joint test can be an alternative, but some statistical tests proposed by Silvapulle should
be adopted (instead of a standard chi-square test used for two-sided tests) in order to account for onesided nature of the alternative [13]. If statistical significance is near the nominal level, statistical
inference may entail extensive computing efforts.
Although the testing procedure in the new index method looks straightforward, more caution should
be exercised before we reach the final conclusion, especially regarding response bias and the potential
interpretation of a negative value of the BI [7,14]. A typical example of response bias is that all subjects
tend to believe that they are on active treatment. This may occur when both taking an active treatment
and being untreated cause adverse reactions or favorable effects, or accompany wishful thinking.
Alternatively, participants are inclined to believe they receive a placebo in the absence of any therapeutic
effects, because participants who presume they are on active drug anticipate definitive effects. No
previous literature has proposed a statistically justifiable way to evaluate this type of potential bias. For
our test, we suggest conducting a formal test and reporting the finding instead of concluding successful
blinding on one arm and unsuccessful blinding on the other arm. If one arm suggests successful blinding
and the other arm does not based on the method outlined above (i.e., accept newBI1 < 0 and newBI2 > 0,
or vice versa.), then test H0: newBI1 + newBI2 = 0 in order to confirm the existence of response bias. If
H0 cannot be rejected, there is some evidence that this trial may experience response bias.
Finally, let us consider a very rare but still plausible example in which newBI = 1 (i.e., everyone
picked the treatment that they did not receive). In this case, James’ BI assigns the best score of 1 and
concludes that the trial is perfectly blinded. Unlike James’ BI, our method still can detect this departure
H. Bang et al. / Controlled Clinical Trials 25 (2004) 143–156
149
by giving a negative estimate. If the investigator believes that masking is suspect if unblinding is
significant in either direction, adopting a two-sided hypothesis can resolve this issue. We note that this
situation would seldom occur and cause little concern relative to the case of newBI z 0; consequently,
we propose to use a one-sided test in general.
Our suggestion in practice is that the result should be reported or discussed if at least one of the
following conditions is met: (1) neŵBI is greater than 20%, (2) the null hypothesis (i.e., successful or
random blinding) is rejected based on our principle or (3) the James method and our method differ in
conclusion.
4. Simulation study
We conducted a numerical experiment under various scenarios and compared the performance of our
method with the James method. We chose the James method as the only one for which a large sample
theory has been established.
For illustrative purposes, we simulated the data according to the data structure in Table 1. We assumed
equal allocations of 250 patients to intervention and placebo arms (i.e., a balanced clinical trial). We
considered three scenarios: ‘‘random’’, ‘‘opposite’’ and ‘‘unblinded’’, where Random indicates that the
probabilities of correct and wrong guesses are equal, therefore, each participant randomly guesses the
treatment assignment. Opposite represents a higher probability of guessing the opposite intervention, i.e.,
more participants in the intervention arm guess placebo, and more participants in the placebo arm believe
they receive an active drug. Lastly, unblinded indicates that most participants identify the assignment
correctly. There are 3! = 6 distinct combinations and they are listed in Table 4.
We briefly explain the motivation of each case. In case 1, all participants randomly guess their
assignment, therefore theoretically, James’ BI = 0.5 and newBI = 0 for each arm. This is the most ideal
scenario in reality. Case 2 is the scenario that gives the best score for each BI (James’ BI is close to 1 and
our newBI is nearly 1 for each arm), but it seldom occurs and needs some caution in interpretation as
outlined in the previous section. Case 3 is the opposite to case 2, where most participants figure out their
treatment assignment. This case certainly indicates that treatments are revealed, thus the validity of the
accompanying statistical analyses should be closely examined. In case 4, most participants guess the
active drug, suggesting that response bias may exist. Case 5 indicates random guessing in the placebo
arm and mostly incorrect guesses in the drug arm, which is also a moderately rare scenario. In case 6,
participants in the placebo arm randomly guess their treatment, while most participants in the drug arm
correctly identify it.
Within each case, we varied the proportion of DK response: PDK = 0, 0.3 and 0.7, corresponding to
no, moderate and high percentage of DK, respectively. The simulation study was performed with SAS
8.2. For case 1 with moderate DK probability, a random number Q was generated from a uniform
distribution on [0, 1] using ranuni function. If Q V 0.35, the response was assigned to drug, if Q z 0.65,
the response was assigned to placebo, and the response was assigned to DK otherwise. The data were
generated similarly for the other conditions.
Sample means of BI estimates and of standard error estimates, and the proportion of unblinding
judged by the one-sided 95% CI from 500 data sets for each scenario were computed, and the results
were summarized in Table 5. The simulation study showed precise coverage probabilities for most cases
within the 2% range, where the accuracy for each empirical coverage probability is given by
150
H. Bang et al. / Controlled Clinical Trials 25 (2004) 143–156
Table 4
Simulation setting
Case
1
(T: random, P: random)
2
(T: opposite, P: opposite)
3
(T: unblinded, P: unblinded)
4
(T: unblinded, P: opposite)
5
(T: opposite, P: random)
6
(T: unblinded, P: random)
Assignment
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
Response (%)
DK
T
P
0
50
50
35
35
15
15
10
90
10
60
10
20
90
10
60
10
20
10
90
90
60
60
20
20
10
50
10
35
10
15
90
50
60
35
20
15
50
50
35
35
15
15
90
10
60
10
20
10
10
90
10
60
10
20
10
10
10
10
10
10
90
50
60
35
20
15
10
50
10
35
10
15
30
70
0
30
70
0
30
70
0
30
70
0
30
70
0
30
70
T and P denote an active treatment and placebo, respectively. ‘‘Random’’, ‘‘opposite’’ and ‘‘unblinded’’ indicate random,
incorrect and correct guesses for the given treatment assignment, respectively.
2(0.950.05/500)1/2 = 0.02 with 500 runs (results not shown). The empirical type I error in case 1
without uncertain responses, where all participants in both arms randomly guessed their treatments, was
close to 0.05, the prespecified significance level.
Throughout all the scenarios, when the proportion of individuals who decline to express their belief is
increased, the value of James’ BI is also increased. Since the James method emphasizes the contribution
of uncertain response when PDK is high, this method almost always concludes that blinding is
H. Bang et al. / Controlled Clinical Trials 25 (2004) 143–156
151
Table 5
Simulation results with 5% type I error
Case
DK (%)
James’ BI
BI (SEE)
1
2
3
4
5
6
New BI
Rejection (%)
0
0.50 (0.022)
5
30
0.65 (0.021)
0
70
0.85 (0.016)
0
0
0.90 (0.013)
0
30
0.90 (0.013)
0
70
0.90 (0.013)
0
0
0.10 (0.013)
100
30
0.40 (0.022)
99.4
70
0.80 (0.018)
0
0
0.50 (0.013)
6.6
30
0.65 (0.017)
0
70
0.85 (0.015)
0
0
0.70 (0.018)
0
30
0.77 (0.015)
0
70
0.88 (0.019)
0
0
0.30 (0.019)
100
30
0.53 (0.021)
0
70
0.83 (0.016)
0
Assignment
BI (SEE)
Rejection (%)
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
T
P
0.0017 (0.063)
0.0002 (0.063)
0.0023 (0.053)
0.0001 (0.053)
0.0008 (0.034)
0.0007 (0.034)
0.80 (0.038)
0.80 (0.037)
0.50 (0.042)
0.50 (0.042)
0.10 (0.034)
0.10 (0.034)
0.80 (0.038)
0.80 (0.037)
0.50 (0.042)
0.50 (0.042)
0.10 (0.034)
0.10 (0.034)
0.80 (0.038)
0.80 (0.037)
0.50 (0.042)
0.50 (0.042)
0.10 (0.034)
0.10 (0.034)
0.80 (0.038)
0.0002 (0.063)
0.50 (0.042)
0.0001 (0.053)
0.10 (0.034)
0.0007 (0.034)
0.80 (0.038)
0.0002 (0.063)
0.50 (0.042)
0.0001 (0.053)
0.10 (0.034)
0.0007 (0.034)
5.2
7.2
3.2
5.8
4.4
4.4
0
0
0
0
0
0
100
100
100
100
92
92.4
100
0
100
0
92
0
0
7.2
0
5.8
0
4.4
100
7.2
100
5.8
92
4.4
T and P denote an active treatment and placebo, respectively. BI denotes the sample mean of BI estimates and SEE is the
average of standard error estimates. Rejection (%) represents the proportion of {H0: blinding is maintained} being rejected. Two
hundred and fifty subjects are allocated to each treatment arm. The results are based on 500 simulations.
successfully fulfilled. As seen in the simulation study with PDK = 0.7, the James method never rejects the
H0 (i.e., rejection probability is 0). Even with PDK = 0.3, H0 is never rejected except in case 3 where the
entire study is severely unmasked and the conclusion could be controversial. The most important issue
here is how much weight should be assigned to those uncertain responses, which depends on how
uncertain those DK responses really are. In the CRISP study, which will be addressed later, those
152
H. Bang et al. / Controlled Clinical Trials 25 (2004) 143–156
participants who originally answered DK were asked to choose treatment assignment a second time. This
kind of ancillary data with a small number of non-responses may provide some assessment of the
validity of DK answers in the original data. In addition, when a response bias exists such as in case 4, the
James method could not detect this and would conclude that the study was successfully blinded. The
final decision can be highly subjective but it is important to identify the cause of any bias. For the cases
where participants in one arm randomly guess their assignment, and participants in another arm correctly
guess their assignment, such as case 6, the James method always concludes (un)blinding without
differentiating that one arm is actually blinded. Note that, for all opposite cases in either arm, James’ BI
always gives the estimate greater than 0.5, whereas our method diagnoses the situation properly by
providing negative values. In particular, it is interesting to note through case 2 that increasing PDK and
more opposite selection have similar effects on the final estimate of the James method.
In general, the James method works well, given that DK respondents are truly uncertain about the
treatment assignment, not just providing a pleasing response to the study physician or program
coordinator. However, the response bias, potential unblinding in other direction or differential behaviors
for different arms cannot be detected by the James method since it combines all the information from all
intervention arms into a univariate index value and some cancelling effect may occur erroneously.
In contrast, the new BI method assesses blinding in each arm separately, weighs identified responses
and detects unblinding in the opposite direction. Uncertain response comes into play indirectly in the
denominator. When PDK increases, the new BI is generally attenuated toward randomness, i.e., uncertain
response has a similar effect as random guessing. In the example of case 1, when PDK increases from 0 to
0.7, the new BI is attenuated toward 0, but the final result (i.e., rejection rate) is much the same in all
circumstances. Our new BI shows higher power to detect a small absolute difference and relatively low
degree of unblinding. In the example of case 2 with 70% uncertain response, the new BI method gives a
value 0.1, indicating it is close to being random instead of blind (Jamês BI=0.9). Of course, 10%
difference for determining (un)blindness may be subjective or even controversial, but it would be of
scientific interest to detect if any significant discrepancy exists beyond balance.
When blinding behaviors in the two arms are not similar, the new BI method can detect it and utilizes
more information conveyed by the data for the arm-specific assessment of blinding. Taking case 4
without uncertain response as an example, adjusted by Bonferroni correction at an overall 5%
significance level, the BI estimate for drug arm is significantly greater than 0 and that for placebo
arm is significantly less than 0 in each of 500 simulations. Further, the sum of BI estimates from both
arms in 99% of the 500 runs is not significantly away from zero, thus we report that response bias may
exist. Lastly, in those scenarios where the participants in one arm randomly guess treatment and those in
the other arm incorrectly or correctly guess treatment (cases 5 and 6), the new BI distinguishes the armspecific behaviors by yielding distinct index values.
In general, due to the separate evaluation of blindness for each arm, the new BI is more realistic and
accurate, especially in those circumstances where there exists response bias or significant difference in
terms of blinding nature in two arms, but it is still very simple to use.
5. An example: the CRISP study
Total and lipoprotein cholesterol levels continue to be one of the greatest risk factors of coronary heart
disease in people over 65 years old, but clinical trials with regard to this issue are very sparse. The
H. Bang et al. / Controlled Clinical Trials 25 (2004) 143–156
153
Table 6
CRISP study data in 25 format
Assignment
Lovastatin
Placebo
Total
Response
1
2
3
4
5 (DK)
Total
38
11
49
44
16
60
21
21
42
4
8
12
170
83
253
277
139
416
1: Strongly believe the treatment is Lovastatin, 2: somewhat believe the treatment is Lovastatin, 3: somewhat believe the
treatment is placebo, 4: strongly believe the treatment is placebo and 5: DK.
CRISP study was a five-center pilot study to assess feasibility of recruitment and efficacy of cholesterol
lowering in this age group. It was conducted from July 1990 to June 1994 [15]. The five centers included
Wake Forest School of Medicine, NC; George Washington University Medical Center, Washington, DC;
University of Minnesota, MN; University of Tennessee, TN; and University of Washington, WA. Four
hundred thirty one subjects with low-density lipoprotein cholesterol levels greater than 4.1 and less than
5.7 mmol/l were randomized into the study, of whom 71% were women, 24% were minorities and the
mean age was 71 years. Participants were followed for 1 year while on a cholesterol-lowering diet plus
either placebo or one of two doses of the study drug, Lovastatin (20 or 40 mg/day). The primary
endpoint was change in blood lipid level. Although change in blood lipid level was an objective
measurement, whether or not the patients found out their treatment assignment may have affected their
compliance or attitude toward participation in the study, which makes the assessment of blinding an
important issue.
At the end of the trial, all participants were asked to rate the extent to which they knew their
medication on a five-point scale: (1) strongly believe the treatment is Lovastatin, (2) somewhat believe
the treatment is Lovastatin, (3) somewhat believe the treatment is placebo, (4) strongly believe the
treatment is placebo and (5) DK. Of the 420 people who participated in the post-trial data collection
process, 4 participants did not answer this question and were deleted from the blinding assessment. The
distribution of the participants’ response in each treatment group (with two doses combined) is displayed
in Table 6. To simplify further, we also created a simpler table according to the 23 data structure
(presented in Table 7). The additional data from those who answered DK originally and were asked a
second time are also available in Table 8.
James’ BI using the weights of 0, 0.5 and 1 for correct, wrong and uncertain response, respectively,
gave the estimate of 0.75 (95% CI: 0.71, 0.78), implying that the CRISP study was well-blinded. The
point estimates from different data configurations and weights varied between 0.74 and 0.78, always
reaching the conclusion of effective masking.
Table 7
CRISP study data in 23 format
Assignment
Lovastatin
Placebo
Total
Response
Lovastatin
Placebo
DK
Total
82
27
109
25
29
54
170
83
253
277
139
416
154
H. Bang et al. / Controlled Clinical Trials 25 (2004) 143–156
Table 8
Ancillary data from DK in CRISP study
Assignment
Lovastatin
Placebo
Total
Response
Lovastatin
Placebo
No answer
Total
79
36
115
86
45
131
5
2
7
170
83
253
The new BI gave an estimate of 0.21 (95% CI: 0.15, 0.26) for the Lovastatin arm, meaning a
significant excess of correct guesses, while it gave a value of 0.01 (95% CI: 0.07, 0.10) for the placebo
arm, showing a pattern consistent with a random distribution of responses. These results can be directly
interpreted as 21% of participants correctly guessed their treatment beyond chance in the Lovastatin arm,
whereas only 1% of participants did in the placebo arm. When we use Tables 6 and 8 with the symmetric
weights of 1, 0.5 and 0.25 for strongly believe, somewhat believe and believe but originally uncertain,
respectively, the new BI estimate is 0.16 (95% CI: 0.11, 0.21) for the Lovastatin arm and 0.01 (95% CI:
0.06, 0.08) for the placebo arm (excluding seven participants who refused to answer again). Therefore,
we attain the same conclusion for either data structure. Note that these estimates are slightly attenuated
toward the null since the ancillary data strongly indicate randomness. We also find that the result is
highly robust to different weight specifications. In particular, more patients who took the high dose
tended to guess correctly compared to those who took the low dose (data not shown).
6. Discussion
Many double-blind trials are conducted worldwide every year and there is substantial literature on the
technology for and potential effects of successful blinding. Nonetheless, only a very small number of
studies have systematically assessed the views of trial subjects regarding the identity of the assigned
treatment. It is partly due to the absence of a standard method. In this paper, we investigate a new BI and
compare it with the method proposed by James et al. [4]. Most methods previous to James’ simply ignore
unknown responses and/or were not developed for statistical inference, but for descriptive purpose. If
there is a high percentage of nonrespondents, these methods might result in misleading conclusions. The
James’ method gives a single index value to assess the blindness of the entire clinical trial, emphasizing
uncertain responses. If there is a moderate to high percentage of those responses in the data ( PDKz0.3),
this method rarely discovers the moderate to low level of unblinding as demonstrated in the simulation
study. Moreover, this method combines all treatment arms, so some cancelling or mixing effects may
result in wrong conclusions when there is a significant difference in blinding between the two arms or
when there exists response bias. A fundamental assumption for the James’ method is that the participants
with uncertain responses truly do not know their assignment, but this is sometimes hard to ascertain.
In this article, we propose a new BI based on a multinomial test. Because this is evaluated in each arm
separately, it utilizes detailed information from all available data including DK, with more weights to
known answers. The new BI can be used not only for hypothesis testing but also be directly interpreted
as the proportion of the unblind beyond balance, where more emphasis is placed on the latter property
due to the nature of the study. It has the ability to detect a relatively low degree of unblinding, response
H. Bang et al. / Controlled Clinical Trials 25 (2004) 143–156
155
bias and the different behaviors including unblinding in the opposite direction. Since the index is
computed separately for each treatment arm, generalization to more than two arms is straightforward,
although one may need to adjust for multiple tests using the Bonferroni correction.
It is possible that unsuccessful blindness may not influence statistical analyses in the end. For
example, in a nicotine gum study, even if the trial was regarded as partially unblinded, a drug effect was
preserved among subjects who correctly identified their drug, among subjects who incorrectly identified
their drug and among subjects who could not tell which drug they received, implying failure to maintain
blindness did not alter the results of the study for different subgroups [7]. On the other hand, in a wellknown trial of vitamin C, the perceptions affected the endpoint concerning cold symptoms [16]. It has
also been suggested that adequacy of the blind may strongly influence the adherence of subjects to their
regimens. In this context, subjects’ efforts to unblind their assignment should be minimized. It may be an
issue even in trials where more objective outcomes such as mortality are used. Knowledge about the
treatment condition could lead to better or worse ancillary care and more or less attention. These factors
could have an impact even on the survival time.
While studying this topic, we realized that assessing blindness is not a simple process. Statistically it
is straightforward, but the final conclusion is hard to summarize in one decision criterion. It relies
heavily on the investigators’ subjectivity and the nature of the study. In example, Howard et al. claimed
that patient blinding worked effectively even though about 67% of participants correctly identified
treatment assignment [17]. Clearly, 67% might be high enough for a majority of researchers to reject that
blinding was implemented properly. The patients’ conception of their response options can also vary.
Incorrect guess may actually imply success of blinding, failure of blinding or even failure by the patients
to understand the pros and cons of each treatment. This may be particularly pertinent in studies
comparing two active treatments. However, under a random selection process, we can hypothesize
statistically that it would occur in a balanced manner.
We should make every effort to maximize the credibility and quality of the data. Of course, there is no
way of knowing whether the respondents are completely honest in discussing their guesses and how they
were determined. Some people may underplay the extent of certainty or hide relevant clues. The most
important contribution to the blind is the respondents’ commitment to the purpose of the study. Toward
this end, we should encourage all participants to make their best guess of the assigned treatment.
Although responses other than DK are preferred in a statistical point of view, inclusion of an extra
question for nonrespondents should be carefully discussed in the design stage of the trial. If unblinding is
detected, we recommend that additional subgroup analyses be conducted to adjust for unblinding and
extraordinary efforts be made to identify the unblinding factors, which would offer constructive
suggestions for the design and conduct of similar studies in the future.
The new BI appears to have several desirable properties for assessing blinding success. In practice, a
combination of our BI, statistical testing, classification of case diagnosis (according to Table 4), careful
interpretation and potential cause identification can provide a comprehensive evaluation of the blindness
of clinical trials.
Acknowledgements
We thank participants for agreeing to be interviewed for the CRISP project. We also want to thank
both referees and the editor-in-chief for very helpful comments, which remarkably improved this paper.
156
H. Bang et al. / Controlled Clinical Trials 25 (2004) 143–156
References
[1] Schulz KF, Grimes DA. Epidemiology series: blinding in randomized trials: hiding who got what. Lancet 2002;
359(9307):696 – 700.
[2] Day SJ. Dictionary for clinical trials. Wiley. New York; 1999.
[3] Day SJ, Altman GD. Blinding in clinical trials and other studies. BMJ 2000;321:504.
[4] James KE, Lee KK, Kraemer HC, Fuller RK. An index for assessing blindness in a multicenter clinical trial: disulfiram for
alcohol cessation—a VA cooperative study. Stat Med 1990;15:1421 – 1434.
[5] Morin CM, Colecchi C, Brink D, et al. How ‘‘blind’’ are double-blind placebo-controlled trials of benzodiazepine
hypnotics? Sleep 1995;18(4):240 – 245.
[6] Schulz KF, Grimes DA, Altman DG, Hayes RJ. Blinding and exclusions after allocation in randomized controlled trials:
survey of published parallel group trials in obstetrics and gynecology. BMJ 1996;312:742 – 744.
[7] Hughes JR, Krahn D. Blindness and the validity of the double-blind procedure. J Clin Psychopharmacol 1985;5(3):
138 – 142.
[8] Deyo RA, Walsh NE, Schoenfeld LS, Ramamurthy S. Can trials of physical treatments be blinded? Am J Phys Med
Rehabil 1990;69(1):6 – 10.
[9] Jespersen CM, the Danish Study Group on Verapamil in Myocardial Infarction. Assessment of blindness in the Danish
Verapamil Infarction trial II (DAVIT II). Eur J Clin Pharmacol 1990;39:75 – 76.
[10] Marini JL, Sheard MH, Bridges CI, Wagner Jr E. An evaluation of the double-blind design in a study comparing lithium
carbonate with placebo. Acta Psychiatr Scand 1976;53:343 – 354.
[11] Rabkin JG, Markowitz JS, Stewart J, et al. How blind is blind? Assessment of patient and doctor medication guesses in a
placebo-controlled trial of imipramine and phenelzine. Psychiatry Res 1986;19:75 – 86.
[12] Reitmeir P, Wassmer G. Resampling-based methods for the analysis of multiple endpoints in clinical trials. Stat Med
1999;18:3453 – 3462.
[13] Silvapulle MJ. On tests against one-sided hypotheses in some generalized linear models. Biometrics 1994;50:853 – 858.
[14] Anderson TW, Reid DBW, Beaton GH. Vitamin C and the common cold: a double-blinded trial. Can Med Assoc J
1972;107:503 – 508.
[15] LaRosa JC, Applegate W, Crouse JR, et al. Cholesterol lowering in the elderly. Arch Intern Med 1994;154:529 – 539.
[16] Karlowski TR, Chalmers TC, Frenkel LD, et al. Ascorbic acid for the common cold: a prophylactic and therapeutic trial.
JAMA 1975;231:1038 – 1042.
[17] Howard J, Whittemore AS, Hoover JJ, Panos M, the Aspirin Myocardial Infarction Study Research Group. How blind was
the patient blind in AMIS. Clin Pharmacol Ther 1982;32:543 – 553.