Download Confidence interval on

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Transcript
Biometrical Journal 42 (2000) 1, 59±69
Confidence Intervals of the Simple Difference
between the Proportions of a Primary Infection
and a Secondary Infection, Given the Primary Infection
Kung-Jong Lui
Department of Mathematical Sciences
College of Sciences
San Diego State University
USA
Summary
This paper discusses interval estimation of the simple difference (SD) between the proportions of the
primary infection and the secondary infection, given the primary infection, by developing three asymptotic interval estimators using Wald's test statistic, the likelihood-ratio test, and the basic principle of Fieller's theorem. This paper further evaluates and compares the performance of these interval estimators with
respect to the coverage probability and the expected length of the resulting confidence intervals. This
paper finds that the asymptotic confidence interval using the likelihood ratio test consistently performs
well in all situations considered here. When the underlying SD is within 0.10 and the total number of
subjects is not large (say, 50), this paper further finds that the interval estimators using Fieller's theorem
would be preferable to the estimator using the Wald's test statistic if the primary infection probability
were moderate (say, 0.30), but the latter is preferable to the former if this probability were large (say,
0.80). When the total number of subjects is large (say, 200), all the three interval estimators perform
well in almost all situations considered in this paper. In these cases, for simplicity, we may apply either of
the two interval estimators using Wald's test statistic or Fieller's theorem without losing much accuracy
and efficiency as compared with the interval estimator using the asymptotic likelihood ratio test.
Key words: Interval Estimation; Coverage probability; Likelihood ratio test; Fieller's Theorem.
1. Introduction
To establish the characteristics of a given disease, one of the interesting problems is
to assess the effect due to the primary infection on the likelihood of developing the
secondary infection. For example, consider the data (Agresti, 1990, Pages 45±46)
about a sample of calves. Calves are first classified by whether they get a primary
pneumonia infection. After recovering from the primary infection, calves are then
reclassified by whether they develop a secondary infection within a defined time
period. In this situation, observations are taken from the same group of calves and
hence are likely to be dependent. Therefore, when estimating the simple difference
(SD) between the probability of the primary infection and the conditional probability
60
K.-J. Lui: Confidence Intervals of the Difference between Proportions
of the secondary infection, given the primary infection, we cannot apply all the interval estimators of SD developed under two independent samples (Thomas and Gart,
1977; Anbar, 1983, 1984; Beal, 1987; Mee, 1984; Hauck and Anderson, 1986;
Miettinen and Nurminen, 1985; Santner and Snell, 1980; Wallenstein, 1997).
Note that the completely randomized trial, in which calves are randomly allocated into
the control and experimental groups, is certainly not ethical and adequate for use here.
In this paper, we concentrate discussion on interval estimation of the SD between the probability of the primary infection and the conditional probability of the
secondary infection, given the primary infection. We develop three asymptotic interval estimators using Wald's test statistic, the likelihood ratio test, and the basic
principle of Fieller's theorem. To evaluate and compare the performance of these
interval estimators, we calculate the coverage probability and the expected length of
the resulting confidence intervals on the basis of the exact distribution in a variety
of situations. We find that the interval estimator using the asymptotic likelihood
ratio test, which involves a sophisticated numerical procedure, consistently performs well in all the situations considered here. When the underlying SD is within
0.10 and the total number of subjects is not large (say, 50), the interval estimator
using Fieller's theorem would be preferable to the estimator using the Wald's test
statistic if the underlying primary infection probability were moderate (say, 0.30).
On the other hand, however, the latter would be preferable to the former if the
underlying primary infection probability were high (say, 0.80). When the total
number of subjects is large (say, 200), all the three estimators perform reasonably
well in almost all situations considered in this paper. Therefore, for simplicity, we
may apply either of the two asymptotic interval estimators using Wald's test statistic or Fieller's theorem in these situations without losing much accuracy and efficiency as compared with the asymptotic confidence interval using the likelihood
ratio test. Note that Agresti (1990) discusses a hypothesis testing procedure for
testing whether there is an effect due to the primary infection on the probability of
developing the secondary infection and Lui (1998) discusses interval estimation of
risk ratio between the two successive infections. However, none of these two papers considers interval estimation of the SD as focused here.
2. Interval Estimators
Consider a study, in which the data can be summarized by use of the following
2 2 table:
Secondary
Infection
Yes
No
Primary
Yes
p11
p12
p1:
Infection
No
ÿ
p22
p22
;
Biometrical Journal 42 (2000) 1
61
where 0 < pij < 1 (for i ˆ 1; 2 and j ˆ 1; 2) denotes the probability of the corresponding cells, p1: ˆ p11 ‡ p12 , and p1: ‡ p22 ˆ 1. As also noted elsewhere
(Agresti, 1990), by definition, no subject can have the secondary infection without first having the primary infection (i.e., p21 ˆ 0).
In this paper, we focus discussion on interval estimation of the SD between the
probability of the primary infection and the conditional probability of the secondary infection, given the primary infection. In terms of the pij , the SD, denoted by
d, is defined as p1: ÿ …p11 =p1: †. Hence, for given p1: and d, we have
p11 ˆ p1: …p1: ÿ d†, p12 ˆ p1: …1 ÿ p1: ‡ d†, and p22 ˆ …1 ÿ p1: †. Note that the range
for d, by definition, is ÿ1 < d < 1.
Suppose that we take a random sample of n subjects. Let nij denote the corresponding number of subjects who fall in the cell with probability pij. Then the
log-likelihood for a given (n11 ; n12 ; n22 ) is then
Log …L† ˆ C ‡ n11 flog …p1: † ‡ log …p1: ÿ d†g
‡ n12 flog …p1: † ‡ log …1 ÿ p1: ‡ d†g ‡ n22 log …1 ÿ p1: † ;
…1†
where C is a constant, that does not depend on parameters d and p1: . On the basis of
(1), we can easily show that the maximum likelihood estimates (MLEs) of p1: and d
^ ˆ p^1: ÿ …^
are p^1: ˆ …n11 ‡ n12 †=n and d
p11 =^
p1: †, respectively, where p^11 ˆ n11 =n.
Furthermore, with using the inverse of the observed information matrix, we obtain the
^ of the asymptotic variance for d
^ to be f^
d …d†
estimate Var
p11 p^12 =^
p31: ‡ p^1: …1 ÿ p^1: †g=n
(Appendix). Therefore, the asymptotic …1 ÿ a†% confidence interval for d is
‰ml ; mu Š ;
…2†
q o
q o
n
n
^ ÿ Za=2 Var
^ ‡ Za=2 Var
^ and mu ˆ min 1; d
^
d …d†
d …d†
where ml ˆ max ÿ1; d
and Za is the upper …100a†th percentile of the standard normal distribution.
For testing H0 : d ˆ d0 versus Ha : d 6ˆ d0 , it is easy to see that the acceptance
region using the asymptotic likelihood ratio test consists of all sample vectors
(n11 ; n12 ; n22 ) such that
p11 † ‡ n12 log …^
p12 † ‡ n22 log …^
p22 †
2…n11 log …^
p1: …d0 †g ‡ log f^
p1: …d0 † ÿ d0 gŠ
ÿ n11 ‰log f^
ÿ n12 ‰log f^
p1: …d0 †g ‡ log f1 ÿ p^1: …d0 † ‡ d0 gŠ
ÿ n22 log f1 ÿ p^1: …d0 †g† c2a ;
…3†
where p^ij ˆ nij =n is the MLE of pij ; p^1: …d0 † denotes the conditional MLE of p1: ,
for a given fixed d0 (Appendix), and c2a is the upper …100a†th percentile of the
central c2 -distribution with one degree of freedom. Therefore, we can obtain the
asymptotic likelihood ratio test based confidence interval by inverting the acceptance region (Casella and Berger, 1990):
‰rl ; ru Š ;
…4†
62
K.-J. Lui: Confidence Intervals of the Difference between Proportions
where ÿ1 < rl < ru < 1 are the smaller and the larger roots of d0 such that
2…n11 log …^
p11 † ‡ n12 log …^
p12 † ‡ n22 log …^
p22 †
ÿ n11 ‰log f^
p1: …d0 †g ‡ log f^
p1: …d0 † ÿ d0 gŠ
ÿ n12 ‰log f^
p1: …d0 †g ‡ log f1 ÿ p^1: …d0 † ‡ d0 gŠ
ÿ n22 log f1 ÿ p^1: …d0 †g† ˆ c2a :
Recall that, by definition, the d defined here can be rewritten as a ratio
…p21: ÿ p11 †=p1: . Following Fieller's theorem (Casella and Berger, 1990), we define Z ˆ ……n^
p21: ÿ p^1: †=…n ÿ 1† ÿ p^11 † ÿ d^
p1: . Note that the expectation
2
2
^
E……n^
p1: ÿ p1: †=…n ÿ 1†† ˆ p1: and E…^
p11 † ˆ p11 . Thus, E…Z† ˆ 0. By use of the
delta method and the multivariate
Central
Limit Theorem (Anderson, 1958), we
p
can easily show that n Z asymptotically follows the normal distribution with
mean 0 and asymptotic variance Var3 ˆ p11 …1 ÿ p11 † ‡ ‰…2np1: ÿ 1†=…n ÿ 1† ÿ dŠ2
p1: …1 ÿ p1: † ÿ 2‰…2np1: ÿ 1†=…n ÿ 1† ÿ dŠ p11 p22 . Thus, the probability that
:
2
PfZ 2 =…Var3 =n† Za=2
g ˆ 1 ÿ a if n were large. This leads us to consider the
following working quadratic equation in d:
^ 2 ‡ Bd
^ ‡ C^ 0 ;
Ad
A^ ˆ p^2 ÿ Z 2 p^1: …1 ÿ p^1: †=n,
…5†
where
B^ ˆ
ÿ p^1: †=…n ÿ 1† ÿ p^11 Š p^1:
1:
a=2
2
ÿZa=2 ‰…2n^
p1: ÿ 1† p^1: …1 ÿ p^1: †=‰…n ÿ 1† nŠ ÿ p^11 p^22 =nІ, and C^ ˆ ‰…n^
p21: ÿ p^1: †=
2
2
2
…n ÿ 1† ÿ p^11 Š ÿ Za=2
…^
p11 …1 ÿ p^11 †=n ‡ …2n^
p1: ÿ 1† p^1: …1 ÿ p^1: †=…‰…n ÿ 1†2 nŠ
ÿ2…2n^
p1: ÿ 1† p^11 p^22 =‰…n ÿ 1† nІ. If both A^ > 0 and B^2 ÿ 4A^C^ > 0, then the
asymptotic 100…1 ÿ a†% confidence interval of SD as n is large is given by
ÿ2…‰…n^
p21:
‰ql ; qu Š ;
where
and
…6†
n
o
p ^
ql ˆ max ÿ1; ÿB^ ÿ B^2 ÿ 4A^C^ =…2A†
n o
p ^ .
qu ˆ min 1; ÿB^ ‡ B^2 ÿ 4A^C^ =…2A†
3. Coverage Probability and Expected Length
To evaluate the finite-sample performance of interval estimators (2, 4, and 6) for
the SD, we calculate the coverage probability and the expected length of the resulting 95% confidence interval on the basis of the exact trinomialPdistribution. By
definition, the coverage probability is simply equal to
1…d 2 ‰cl ; cu І
f …n11 ; n12 ; n22 †, where ‰cl ; cu Š is the confidence interval obtained by use of (2, 4,
and 6) and is a function of …n11 ; n12 ; n22 †, 1…d 2 ‰cl ; cu І is the indicator function
and ˆ 1 if d 2 ‰cl ; cu Š is true, and ˆ 0, otherwise, and where f …n11 ; n12 ; n22 † is
Biometrical Journal 42 (2000) 1
63
the trinomial distribution with the underlying cell probabilities: p11 ; p12 ; and p22 .
Similarly,
the expected length of the resulting confidence interval is given by
P
…cu ÿ cl † f …n11 ; n12 ; n22 †.
^ is not well-defined and interval estimator (2) is inapNote that when p^1: ˆ 0; d
plicable. Similarly, in this case, the coefficient of the quadratic terms d2 in equation
(5) is 0 and hence we cannot apply (6) to obtain the confidence interval of d either.
Furthermore, if either A^ < 0 or B^2 ÿ 4A^C^ < 0, then (6) cannot be applied as well.
Note also that the logarithmic function log …X† is defined only for 0 < X < 1.
Therefore, if any cell frequency nij in a random vector (n11 ; n12 ; n22 ) were 0, we
would not be able to apply interval estimator (4). When evaluating the performance
of (2, 4, and 6), we calculate the coverage probability and the expected length, conditional upon those samples in which the confidence limits of using the respective
interval estimator exist. For completeness, we also calculate the probability that we
fail to produce confidence limits for each of interval estimators (2, 4, and 6).
For given values of p1: and d, as noted before, all parameter values:
p11 ˆ p1: …p1: ÿ d†, p12 ˆ p1: …1 ÿ p1: ‡ d†, and p22 ˆ …1 ÿ p1: † are uniquely determined. We consider the
situations, in which p1: ˆ 0:30, 0.50, and 0.80;
d ˆ ÿ0:30; ÿ0:20; ÿ0:10; . . . ; 0:30 but which such a restriction that the corresponding cell probabilities: p11 ; p12; and p22 are all > 0; and n ˆ 50, 100, and
200. We write programs in SAS (1990) to enumerate the exact probability
f …n11 ; n12 ; n22 † of the desired trinomial distribution.
4. Results
Table 1 summarizes the results about the coverage probability and the expected
length of the resulting 95% confidence intervals conditional upon those samples in
which the confidence limits of the respective interval estimator exist in a variety of
situations. As seen from Table 1, when n 200, all estimators perform reasonably
well in almost all situations considered here. When both n and p1: are not large (i.e.,
n ˆ 50 and p1: ˆ 0:30) and d is within 0.10, estimators (4 and 6) outperforms estimator (2), of which the coverage probability is likely to be less than the desired
confidence level. On the other hand, in these cases but in which p1: is large (ˆ 0:80),
estimator (2 and 4) is preferable to estimator (6). We also find that the probability of
failing to produce an 95% confidence interval by use of either estimator (2 and 6) is
negligible (< 0:001) in all situations considered in Table 1, but this probability for
use of (4) can be of practical significance when n is not large (ˆ 50).
5. An example
To illustrate the practical usefulness of (2, 4, and 6), we consider the example
(Agresti, 1990, Pages 45±46) about 156 calves born in Florida. Calves are first
64
K.-J. Lui: Confidence Intervals of the Difference between Proportions
Table 1
The coverage probability and the expected length (presented in parenthesis) of the resulting
95% confidence interval for the underlying risk difference between the primary infection and
the secondary infection given the primary infection d ˆ ÿ0:30; ÿ0:20; . . . ; 0:30 but with
such a restriction that p11 ; p12 ; and p22 are all > 0 for use of estimators (2, 4, 6) in the
situations, in which the probability of primary infection p1: ˆ 0:30, 0.50, and 0.80; and the
total number of subjects n ˆ 50, 100, and 200
n
p1:
Estimator
d
0.30
ÿ0.3
ÿ0.2
ÿ0:1
0.0
0.1
0.2
0.50
ÿ0.3
ÿ0.2
ÿ0.1
0.0
0.1
0.2
0.3
0.80
ÿ0.1
0.0
0.1
0.2
0.3
50
100
200
2
4
6
2
4
6
2
4
6
0.926
(0.548)
0.930
(0.557)
0.922
(0.548)
0.924
(0.518)
0.910
(0.465)
0.935
(0.380)
0.941
(0.528)
0.944
(0.537)
0.942
(0.530)
0.949
(0.510)
0.955
(0.475)
0.958
(0.431)
0.919
(0.628)
0.937
(0.640)
0.943
(0.631)
0.955
(0.599)
0.962
(0.542)
0.971
(0.453)
0.942
(0.391)
0.941
(0.398)
0.940
(0.391)
0.939
(0.371)
0.936
(0.334)
0.938
(0.275)
0.946
(0.384)
0.948
(0.390)
0.948
(0.384)
0.949
(0.366)
0.948
(0.335)
0.959
(0.288)
0.934
(0.416)
0.942
(0.423)
0.948
(0.416)
0.950
(0.395)
0.957
(0.357)
0.965
(0.296)
0.943
(0.278)
0.945
(0.282)
0.945
(0.278)
0.947
(0.263)
0.942
(0.238)
0.943
(0.196)
0.950
(0.275)
0.949
(0.279)
0.949
(0.275)
0.947
(0.262)
0.948
(0.238)
0.948
(0.200)
0.943
(0.286)
0.946
(0.291)
0.949
(0.286)
0.952
(0.271)
0.954
(0.245)
0.954
(0.203)
0.934
(0.412)
0.938
(0.448)
0.930
(0.468)
0.937
(0.474)
0.941
(0.468)
0.941
(0.448)
0.935
(0.412)
0.954
(0.412)
0.945
(0.443)
0.952
(0.460)
0.951
(0.466)
0.946
(0.460)
0.945
(0.443)
0.946
(0.412)
0.914
(0.443)
0.918
(0.479)
0.940
(0.499)
0.944
(0.506)
0.942
(0.499)
0.943
(0.479)
0.944
(0.443)
0.944
(0.294)
0.943
(0.319)
0.947
(0.333)
0.947
(0.338)
0.945
(0.333)
0.945
(0.319)
0.946
(0.294)
0.948
(0.293)
0.948
(0.317)
0.951
(0.330)
0.951
(0.334)
0.949
(0.330)
0.949
(0.317)
0.947
(0.293)
0.931
(0.304)
0.932
(0.329)
0.942
(0.343)
0.949
(0.348)
0.944
(0.343)
0.946
(0.329)
0.945
(0.304)
0.946
(0.209)
0.947
(0.226)
0.948
(0.236)
0.948
(0.239)
0.949
(0.236)
0.948
(0.226)
0.948
(0.209)
0.949
(0.208)
0.949
(0.225)
0.949
(0.235)
0.949
(0.238)
0.950
(0.235)
0.950
(0.225)
0.950
(0.208)
0.939
(0.212)
0.942
(0.230)
0.946
(0.240)
0.945
(0.243)
0.948
(0.240)
0.948
(0.230)
0.948
(0.212)
0.941
(0.284)
0.944
(0.328)
0.943
(0.356)
0.943
(0.371)
0.945
(0.377)
0.955
(0.293)
0.952
(0.332)
0.949
(0.356)
0.946
(0.369)
0.948
(0.373)
0.919
(0.293)
0.928
(0.336)
0.932
(0.364)
0.934
(0.379)
0.941
(0.384)
0.948
(0.203)
0.947
(0.233)
0.947
(0.253)
0.948
(0.264)
0.945
(0.268)
0.950
(0.206)
0.948
(0.235)
0.950
(0.253)
0.949
(0.264)
0.948
(0.267)
0.939
(0.206)
0.941
(0.236)
0.942
(0.256)
0.944
(0.267)
0.944
(0.271)
0.950
(0.144)
0.949
(0.166)
0.948
(0.180)
0.949
(0.187)
0.948
(0.190)
0.950
(0.145)
0.949
(0.166)
0.949
(0.180)
0.950
(0.187)
0.950
(0.190)
0.946
(0.145)
0.944
(0.167)
0.946
(0.181)
0.945
(0.188)
0.947
(0.191)
Biometrical Journal 42 (2000) 1
65
classified according to whether they are infected with pneumonia within 60 days
after birth. They are then classified again by whether they develop a secondary
infection within two weeks after clearing up the first infection. As shown in Table 3.2 on Page 46 by Agresti (1990), we have n11 ˆ 30, n12 ˆ 63, and n22 ˆ 63.
^ is 0.274. Applying interval estimators (2, 4,
With given these data, the estimate d
and 6), we obtain the 95% confidence intervals of d to be [0.151, 0.396], [0.148,
0.392], and [0.137, 0.385], respectively. Because the lower limits of these resulting
confidence intervals are all larger then 0, applying any of these interval estimators
may suggest that the primary infection of pneumonia should stimulate a natural
immunity to reduce the likelihood of a secondary infection. Although this inference
is the same as that claimed elsewhere with using a hypothesis test procedure
(Agresti, 1990, Page 47), we do need to implicitly assume that the immunity level
of calves to pneumonia does not vary much within the first 3 months of birth and
the follow-up period of 14 days is sufficiently long enough to calculate the proportion of the secondary infection to draw the above conclusion. When applying the
study design discussed here to study the natural immunity, it is certainly important
to decide how to choose an appropriate length of the follow-up period. However,
this decision is essentially dependent on subjective knowledge of the characteristics
of the underlying disease and beyond the scope of this paper.
6. Discussion
The coverage probability of interval estimator (4) using the asymptotic likelihoodratio test consistently agrees reasonably well with the desired confidence level of
95% in all situations considered in Table 1, while those of estimators (2 and 6)
can be less than the 95% when n is not large. Furthermore, the expected length
for use of (4) may often be the shortest among these three estimators when the
coverage probability is in the near neighborhood of 95% (Table 1). Therefore, in
the situation in which the probability of failing to produce an interval estimate by
use of (4) is negligible, estimator (4) might be generally recommended if n were
not large (ˆ 50). On the other hand, use of (4) requires a sophisticated numerical
procedure to calculate the confidence limits, while application of the other two
estimators (2 and 6) is simple to implement. Thus, when n is large 200 and all
the three estimators are essentially equivalent, we may wish to apply estimators
(2 and 6) for simplicity.
:
:
^ˆ
In the above example, the MLEs of p1: and d are p^1: ˆ 0:60 and d
0:274,
respectively. The total number of subjects n is 156. According to the results presented in Table 1, all three interval estimators (2, 4 and 6) are appropriate for use
in this case. This is consistent with the finding that all the resulting 95% confidence intervals are similar to one another.
Note that the probability of failing to produce confidence limits for use of (2
and 6), as shown in Table 2, is negligible for all situations considered here. There-
66
K.-J. Lui: Confidence Intervals of the Difference between Proportions
Table 2
The probability of failing to produce an 95% confidence interval in application of interval
estimators (2, 4, and 6) for the underlying risk difference d ˆ ÿ0:30; ÿ0:20;
ÿ0:10; . . . ; 0:30 but with such a restriction that p11 ; p12 ; and p22 are all > 0 in the situations, in which the prohability of primary infection p1: ˆ 0:30, 0.50, and 0.80; and the total
number of subjects n ˆ 50, 100, and 200
n
p1:
Estimator
d
50
100
200
2
4
6
2
4
6
2
4
6
0.30
ÿ0.3
ÿ0.2
ÿ0.1
0:0
0:1
0:2
0.000
0.000
0.000
0.000
0.000
0.000
0.002
0.001
0.002
0.009
0.045
0.218
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.002
0.048
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.002
0.000
0.000
0.000
0.000
0.000
0.000
0.50
ÿ0.3
ÿ0.2
ÿ0.1
0:0
0:1
0:2
0:3
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.005
0.000
0.000
0.000
0.000
0.000
0.005
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.80
ÿ0.1
0:0
0:1
0:2
0:3
0.000
0.000
0.000
0.000
0.000
0.015
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
fore, the resulting coverage probability and the expected length for these two estimators calculated conditional upon the samples in which the confidence limits
exist are essentially equivalent to those normally calculated over all samples. However, the probability of failing to apply (4) when any cell frequency, n11 ; n12 ; or
n22 equals 0 can be non-negligible. For example, when n ˆ 50, p1: ˆ 0:30, and
d ˆ 0:20, this probability is approximately 0.218 (Table 2). To avoid this limitation in application of (4), we can apply the commonly-used adjustment for sparse
data by adding 0.50 to each cell frequency whenever this occurs. With use of this
and hoc adjustment in the above case considered in Table 2, we find that the
coverage probability and the expected length change from 0.958 and 0.431 to
0.950 and 0.412, respectively. The magnitudes of these changes are certainly of no
practical importance. In fact, we have recalculated all the coverage probability and
the expected length with use of this as hoc adjustment to eliminate the probability
of failing to produce confidence limits for using (4) in all situations considered in
Table 1. Because the differences between the results of using (4) presented in
Biometrical Journal 42 (2000) 1
67
Table 1 and those with this adjustment are generally quite small, we decide not to
present them for brevity.
Finally, note that though the logarithmic transformation has been successfully
applied to derive the confidence interval for the other epidemiologic indices such
as risk ratio or odds ratio (Katz et al., 1978; Lui, 1995, 1996, and 1998), we do
not recommend use of this transformation to derive the confidence interval of the
^
SD as focused here. This is not only because the sampling distribution of log …d†
^
can be even more skewed than that of d when the underlying d is small, but also
^ is undefined when d
^ is <0.
because log …d†
In summary, this paper proposes three asymptotic confidence interval for the
SD between successive infections. This paper demonstrates that the interval estimator using the asymptotic likelihood ratio test can consistently perform well in a
variety of situations. However, application of this procedure involves iterative numerical calculation. When the probability of the underlying primary infection is
moderate (ˆ 0:30) and the SD is within 0.10, we may use the interval estimator
using the Fieller's theorem. On the other hand, when the probability of the underlying primary infection is high (ˆ 0:80), we may apply the interval estimator
using the Wald's test statistic.
Acknowledgements
The author wishes to thank the referee for many helpful and valuable comments to
improve the clarity of this paper. This work in part was supported by the grant
from the Agency for Health Care Policy and Research #R01-HS07161.
Appendix
For a given sample vector (n11 ; n12 ; n22 ), the log-likelihood is
Log …L† ˆ C ‡ n11 flog …p1: † ‡ log …p1: ÿ d†g
‡ n12 flog …p1: † ‡ log …1 ÿ p1: ‡ d†g ‡ n22 log …1 ÿ p1: † :
Then the MLEs of p1: and d are simply the roots for p1: and d of the following
two equations:
@ Log …L†
ˆ n11 f1=p1: ‡ 1=…p1: ÿ d†g
@p1:
‡ n12 f1=p1: ÿ 1=…1 ÿ p1: ‡ d†g ÿ n22 =…1 ÿ p1: † ˆ 0
…A:1†
and
@ Log …L†
ˆ ÿn11 =…p1: ÿ d† ‡ n12 =…1 ÿ p1: ‡ d† ˆ 0 :
@d
…A:2†
68
K.-J. Lui: Confidence Intervals of the Difference between Proportions
^ ˆ p^1: ÿ p^11 =^
We can easily show that the MLEs are p^1: ˆ …n11 ‡ n12 †=n and d
p1: .
Furthermore,
@ 2 Log …L†
ˆ ÿ n11 f1=p21: ‡ 1=…p1: ÿ d†2 g
@p21:
ÿ n12 f1=p21: ‡ 1=…1 ÿ p1: ‡ d†2 g ÿ n22 =…1 ÿ p1: †2 ;
…A:3†
2
@ Log …L†
ˆ ÿn11 =…p1: ÿ d†2 ÿ n12 =…1 ÿ p1: ‡ d†2 ;
…A:4†
2
@d
@ Log …L†
…A:5†
ˆ n11 =…p1: ÿ d†2 ‡ n12 =…1 ÿ p1: ‡ d†2 :
@p1: @d
^ for the corresponding parameters in
When substituting the MLEs p^1: and d
^
(A.3±A.5) we can obtain the estimate of the asymptotic variance for the MLE d
3
to be f^
p11 p^12 =^
p1: ‡ p^1: †g=n through use of the inverse of the observed information
matrix.
Note that for a given fixed d0 such that ÿ1 < d0 < 1, as p1: increases from
@ Log …L†
in the left-hand of equamax f0; d0 g to min f1; 1 ‡ d0 g, the value of
@p1:
tion (A.1) decreases from 1 to ÿ1. Furthermore, (A.1) is a continuous function
over max f0; d0 g p1: min f1; 1 ‡ d0 g. These suggest that, for a given fixed
d0 , where ÿ1 < d0 < 1, the conditional MLE p^1: …d0 † of p1: is simply the unique
root for p1: (falling in the range of max f0; d0 g p1: min f1; 1 ‡ d0 g† of equation (A.1) with replacing d by d0.
References
Agresti, A., 1990: Categorical Data Analysis. Wiley, New York.
Anbar, D., 1983: On estimating the difference between two probabilities, with special reference to
clinical trials. Biometrics 39, 257±262.
Anbar, D., 1984: Confidence bounds for the difference between two probabilities. Biometrics (reply to
letter) 40, 1176.
Anderson, T. W., 1958: An Introduction to Multivariate Statistical Analysis. Wiley, New York.
Beal, S. L., 1987: Asymptotic confidence intervals for the difference between two binomial parameters
for use with small samples. Biometrics 43, 941±950.
Casella, G. and Berger, R. L., 1990: Statistical Inference. Duxbury, Belmont, California.
Hauck, W. W. and Anderson, S., 1986: A comparison of large sample confidence interval methods
for the difference of two binomial probabilities. The American Statistician 40, 318±322.
Katz, D., Baptista, J., Azen, S. P., and Pike, M. C., 1978: Obtaining confidence intervals for the risk
ratio in cohort studies. Biometrics 34, 469±474.
Lui, K.-J., 1995: Confidence intervals for the risk ratio in cohort studies under inverse sampling. Biometrical Journal 37, 965±971.
Lui, K.-J., 1996: Notes on Confidence limits for the odds ratio in case-control studies under inverse
sampling. Biometrical Journal 38, 221±229.
Lui, K.-J., 1998: Interval estimation of risk ratio between the secondary infection given the primary
infection and the primary infection. Biometrics 54, 706±711.
Biometrical Journal 42 (2000) 1
69
Mee, R. W., 1984: Confidence bounds for the difference between two probabilities. Biometrics 40,
1175±1176.
Miettinen, O. and Nurminen, M., 1985: Comparison analysis of two rates. Statistics in Medicine 4,
213±226.
Santner, T. J. and Snell, M. K., 1980: Small-sample confidence intervals for p1 ÿ p2 and p1 =p2 in
2 2 contingency tables. Journal of the American Statistical Association 73, 386±394.
Thomas, D. G. and Gart, J. J., 1977: A table of exact confidence limits for differences and ratios of
two proportions and their odds ratios. Journal of the American Statistical Association 72, 73±76.
SAS Institute, Inc., 1990: SAS Language, Version 6, 1st edition. Cary, North Carolina.
Wallenstein, S., 1997: A non-iterative accurate asymptotic confidence interval for the difference between two proportions. Statistics in Medicine 16, 1329±1336.
Kung-Jong Lui
Department of Mathematical Sciences
College of Sciences
San Diego State University
5500 Campanile Drive
San Diego, CA 92182-7720
USA
E-mail: [email protected]
Received, November 1997
Revised, August 1999
Accepted, August 1999