Download Imputation Techniques Using SAS Software For Incomplete Data In Diabetes Clinical Trials

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Imputation Techniques Using SAS Software For Incomplete Data
In Diabetes Clinical Trials
Naum M. Khutoryansky and Won-Chin Huang
Novo Nordisk Pharmaceuticals, Inc., Princeton, NJ
ABSTRACT
Missing data are common in clinical trials. In longitudinal
studies missing data are mostly related to drop-outs. Some
drop-outs appear completely at random. The source for other
drop-outs is withdrawal from trials due to lack of efficacy. For
the latter case the standard analysis of the actual observed
data produces bias. An attractive approach to avoid this
problem is to impute (i.e. fill in) the missing data. This paper is
devoted to comparison of different techniques of imputation for
some longitudinal clinical trials with large percentage of
dropouts. The techniques being considered include (among
others) the last observation carried forward technique, the
multiple imputation methods (presented in PROC MI of the
SAS version 8.1) and new incremental methods programmed
in the SAS software framework. The imputation methods were
compared on simulated data (to assess preciseness). The
most effective techniques were applied to diabetes clinical trial
data.
INTRODUCTION
Missing data is a common problem in clinical trials. Comparison
of drug treatments involves longitudinal measurements of efficacy
variables. When individual elements of longitudinal data are
missing, the data are “incomplete. The incompleteness can have
different patterns and different causes. Missing values occur due
to attrition (drop-out case) if the missing-data process induces a
monotone pattern of missing values, i.e. a patient has
observations up to a time point and none after that; otherwise the
missing values are intermittent [1]. In many cases the drop-out
processes possess the property that the missing data are missing
at random (MAR) [2,3,4] (i.e. the drop-out process depends on
the observed measurements ). A special case of MAR drop-outs
is the MCAR class when the missing data are missing completely
at random (i.e. the drop-out and measurement processes are
independent ).
A very important drop-out process falling into the MAR class is
the process of withdrawal from a trial due to lack of efficacy.
A statistical analysis can be biased if incomplete cases are
excluded from the analysis [5]. The intent-to-treat (ITT) principle
requires that all cases, complete or incomplete, should be
included in the analysis.
An attractive approach dealing with incomplete cases is the
imputation-based approach: the missing values are filled in and
the resultant completed data are analyzed by standard methods
[3].
This paper is concerned with comparison of several imputation
techniques applied to incomplete longitudinal data sets with MAR
drop-outs. The data sets are simulated to resemble time behavior
of HbA1c in diabetes clinical trial. The distance between adjacent
time points of measurements varies from 4 to 6 weeks. The
missingness mechanisms employed resemble the process of
withdrawal from trials due to lack of efficacy. The start of missing
values for each patient depends on two preceding
measurements. If a patient follows a specified pattern for change
of HbA1c between two adjacent time points with a tolerance
margin, she/he remains in the study; otherwise the patient is
withdrawn from the study. The percentage of missing values in a
data set depends on the tolerance margin.
The imputation techniques being compared include the last
observation carried forward (LOCF) method, three multiple
imputation methods (the regression method, propensity score
method and Markov chain Monte Carlo method) [6] and two
incremental methods (the incremental mean and incremental
regression methods). PROC MI [7] and a SAS macro are used to
implement these techniques.
METHODS OF IMPUTATIONS
The following methods are used in this paper. With the exception
of the incremental methods that are introduced here, the
description of the methods is very concise and reduced mostly to
references.
LAST OBSERVATION CARRIED FORWARD
Whenever a value is missing, the last observed value is
substituted. The technique is typically applied to drop-out patterns
of incompleteness.
MULTIPLE IMPUTATION TECHNIQUES
We consider only the multiple imputation techniques [6] that are
implemented in PROC MI [7]. The following three method are
available in the MI procedure.
In the regression method, a regression model is fitted for each
variable with missing values, with the previous variables as
covariates. Based on the fitted regression coefficients, a new
regression model is simulated from a Bayesian predictive
distribution of the parameters and is used to impute the missing
values for each variable.
In the propensity score method, for each variable with missing
values, propensity scores are generated for all observations to
estimate the probabilities that each observation is missing. The
observations are then grouped using these scores, and a
Bayesian imputation procedure is applied for each group.
The first two methods are used for monotone missing data.
The last method to be considered is the Markov chain Monte
Carlo method (MCMC). The goal of the MCMC method is to
create pseudorandom draws from a target distribution. Rather
than attempting to draw from this distribution directly, a sequence
of variates is generated sufficient to approximate a draw from the
target distribution. The MCMC method is used in the MI
procedure to create proper imputations of arbitrary missing data.
The results of the MI procedure are multiple imputations for
missing data. For example, if a SAS data set PartMiss has
missing values for variables a1,…,a6, the following statements
create 10 multiple imputations for this data set in a data set
Imputed using the MCMC method:
proc mi data=PartMiss out=Imputed
nimpu=10 seed=37851;
multinormal method=mcmc;
var a1 a2 a3 a4 a5 a6;
run;
The analysis of a multiple imputed data set including estimation
of the mean and variance for each variable is based on the
formulas given in [6].
INCREMENTAL METHODS
We assume that an outcome variable Y is measured for subject i
= 1,…,n in the study at occasions j = 1,…,k. The measurements
are denoted by Yij. Let Dij be the increment of Y for subject i from
time point j to time point j+1. Next, let Y*ij and Y’ij are observed
and missing values of Y, respectively. Similarly, let D*ij and D’ij
be observed and missing increments, respectively. Denote by
D*. j the sample mean of observed increments D*ij at time point j.
Suppose that values Yi1 are all observed.
The incremental mean method creates two matrices: a basic
imputed matrix and an uncertainty matrix. To build the basic
imputed matrix B the unknown increments D’ij are prescribed to
be equal to D*. j and, next, the missing values Y’i, j+1 are replaced
by
Y^ i, j+1 = Y ij + D*. j for j = 1,…,k - 1
(1)
where values Yij are observed or calculated on the previous step.
The elements Bij that coincide either with Y*ij or with Y^ij
(calculated step by step as specified above) create the basic
imputed matrix B. The approximate mean values of Y at each
time point j will be calculated as the sample mean B. j using the
jth column of matrix B. The columns of matrix B can be used to
calculate the variances of Y at each time point. However, these
calculated values V^ j underestimate the real variances due to
the lack of uncertainty in (1) and will be named the partial
variances. To induce an additional uncertainty it is proposed to
build an uncertainty matrix V’. Let U* j be the sample variance of
the set {D*ij} for fixed j. Let R ij be equal to 1 if Yij is observed
and 0 otherwise. Then the elements V’ij of matrix V’ are
calculated as follows:
V’i1 = 0; V’ij = (V’i, j-1 + U* j-1) (1 - R ij) for j = 2,…,k .
(2)
Therefore, if Yij is observed then V’ij = 0 (there is no uncertainty).
Otherwise, an additional variance (as an uncertainty measure) is
accumulated over time until time point j for each subject i. The
pooled additional variance V’j at time point j is defined as the
average of V’ij over all the subjects. The total variance Vj at time
point j is defined as
Vj = V^ j + V’j .
(3)
The incremental regression method is based on the following
regression model:
Dij = D. j + Aj (Yij – Y. j) + εij
where D.
j
and Y.
j
(4)
are the means of samples {Dij} and {Yi j},
respectively, over j, and εI j are errors.
For the observed data a counterpart of model (4) takes the
following form
D*ij = D*. j + A*j (Yij – Y**. j) + ε*I j
(5)
where Y**. j is defined as the mean of sample { Yij | Ri, j+1=1} for
fixed j and ε*I j are errors.
Coefficient A* j is estimated from (5) using the least-squares
method. Next, the missing values Y’i, j+1 are replaced by values
Y^i, j+1 introduced by the following recurrent scheme:
D^ ij = D*. j + A* j (Yij – Y. j),
(6)
Y^i, j+1 =Yij + D^ ij, j = 1,…,k– 1.
After obtaining the basic imputed matrix B comprised of both Y*ij
and Y^ij and evaluating the time point means, the calculation of
the partial variances V^ j is completely similar to that in the
incremental mean method. Consider the calculation of the
corresponding uncertainty matrix V’. The elements of V’ are
evaluated using the same recurrent formula (2) as in the previous
method. However, evaluation of U* j is now based on the errors of
model (5) :
ε*ij = D* ij
- [D*. j + A* j (Yi j – Y**. j)] .
(7)
Now U* j is defined as the sample variance of set {ε*I j } for fixed
j.
The last step of the method – calculation of the total variance is
carried out using the same formula (3) as in the incremental
mean method.
Note that two different means of Yi j over i are used in this
algorithm: Y**. j for calculation of A* j and U* j , and Y. j in the
recurrent scheme (6).
SIMULATION
DATA SET SIMULATION
The first step in the simulation is to create longitudinal data sets
that resemble time behavior of HbA1c in a diabetes clinical trial.
As an example we consider such a trial with six time points of
measurements starting from baseline. The first three time
increments are equal to four weeks, the next two are six weeks.
We assume that the baseline distribution of HbA1c can be
approximated by the normal distribution with mean of 9 and
variance of 1. The mean of HbA1c used in simulations is
specified to increase by 0.5 for the first time increment, remain
unchanged for the second one and decrease by 0.5 for each next
time increment (through the fifth increment). This scenario is very
approximate but, nevertheless, reflects a tendency for some
diabetes drugs. Each increment of HbA1c is prescribed to be the
sum of a normal random variable with the standard of 0.5 and a
term reflecting the process of titration (i.e. dose escalation with a
set of predetermined dose levels and prespecified criteria of
responders and nonresponders). The following SAS program
implements this algorithm:
%let totno=60;
%let titrate=0.05;
%let seedi=54417;
data complete;
drop norm norm0 seed j m1;
retain seed &seedi;
array a{6} a1 a2 a3 a4 a5 a6;
array dm{5} dm1 dm2 dm3 dm4 dm5;
array da{5};
do id=1 to &totno;
m1=9; dm1=0.5; dm2=0;
dm3=-0.5; dm4=-0.5; dm5=-0.5;
call rannor(seed,norm0);
a1=m1+norm0; dms=0;
do j = 1 to 5;
dms=dms+dm{j};
call rannor(seed,norm);
da{j}=dm{j} –
&titrate*(a{j}-m1-dms)
+ norm/2;
a{j+1}=a{j}+da{j};
end;
output;
end;
run;
The macro variables of the program are: &totno (the number of
patients), &titrate (a titration parameter) and &seedi (the seed
used for random simulation). Note that the random increment of
the primary variable (HbA1c) increases the variance of it at the
next time point. However, the titration term (that includes &titrate
as a factor) acts in opposite direction. The titration parameter
usually has a value between 0 and 0.1 in real studies, and,
therefore, is specified at the level of 0.05 in the program above.
However, we consider below also the case when &titrate=0, i.e.
the increment is independent of the previous values of the
primary variable.
MISSINGNESS MECHANISM
It is reasonable to assume that the probability for patient i in a
clinical trial to become a drop-out at time point j depends only on
the values Yi1,…, Yi,j – 1. Therefore, we assume that the MAR
condition holds. Moreover, we accept a more restrictive
assumption that, we believe, can be applicable in the diabetes
clinical trials which design was described above. Namely, we
suppose that the probability depends only on Di,j – 2:
Pr(R ij = 0 ) = Gj (Di,j – 2),
(8)
where Gj is such a function that 0 ≤ Gj (x) ≤ 1 for any real x.
Two possible choices for g are
Gj (x) = ζ (x - δ j + c) - ζ (x - δ j + c - 1)
(9)
where ζ (x) = x if x > 0, else 0,
Gj (x) = 1 if 0 ≤ x - δ j + c ≤ 1, else 0.
(10)
According to the description of the complete data set (see above)
the following values of δ j are specified as
δ 1 = 0.5, δ 2 = 0, δ 3 = δ 4 = δ 5 = - 0.5 .
The program below creates a data set with missing data using
the missingness mechanism (8), (9) with c = - 0.1. The amount of
missing values at the sixth time point is approximately 50% for
this value of c.
data PartMiss;
set complete;
drop dropout dm1-dm5 dr da1-da5 j;
array a{6} a1 a2 a3 a4 a5 a6;
array dm{5} dm1 dm2 dm3 dm4 dm5;
array da{5} da1 da2 da3 da4 da5;
dropout=0;
do j = 1 to 6;
if j>2 then dr = da{j-2}-dm{j-2};
else dr=0;
if dropout=0 and j>1 and
dr-0.1 > ranuni(&seedi)
then dropout=1;
if dropout=1 then a{j}=.;
end;
run;
Approximately the same percentage of missing values at the sixth
time point will be generated using the second missingness
mechanism (8), (10) with c = - 0.5. To implement this mechanism
in the program above, ithe condition
dr-0.1 > ranuni(&seedi)
should be replaced by the condition
dr > 0.5
RESULTS OF SIMULATION
In this subsection we consider the results of simulation for two
types of the initial (complete) longitudinal data sets (with and
without titration) presented in the subsection “Data set
simulation”. The type is determined by the value of macro
variable &titrate either equal to 0 (without titration) or equal to
0.05 (with titration). The number of observations is chosen to be
in two versions: n = 30 and n = 60. Two missingness
mechanisms described above are implemented to create
approximately 50% of missing data at the sixth time point. Six
imputation techniques presented in section “Methods of
imputation” are compared on the base of preciseness of the
mean and standard deviation estimations at the sixth time point.
To make the comparisons, the following errors of the means and
standard deviations of imputed data sets (at the sixth time point)
over 100 random samples of each kind are calculated:
εm
100
= 0.01
Σ | M^i - Mi | / Si , εs
100
= 0.01
Σ | S ^i - S i | / S i ,
i =1
i=1
where M^i and Mi are the sample means of the imputed and
original data sets, respectively; S^i and Si are the corresponding
standard deviations. Actually, εs is the average relative error of a
standard deviation. Since the standard deviations of the original
data sets are close to one, these errors are close to the absolute
errors.
All the simulations were carried out in the framework of the SAS
software (version 8.1) using the MI procedure to implement the
multiple imputation methods and a macro to implement the
incremental methods.
The following four tables present the simulation results. Besides,
the similar errors of the means and standard deviations of
observed data sets are also presented. First two tables are
concerned with original data sets without titration which means
that all the increments are completely random. However, the
missingness mechanisms are not MCAR. The tables 3 and 4 give
the results for original data sets with titration. The missingness
mechanism is indicated in each table.
Table 1
Properties of data sets:
- Origional: without titration
- Missingness mech.: (9)
- Missing % : 50%
Method
LOCF
MI regression
MI propensity scores
MI MCMC
Incremental mean
Incremental regression
Observed
Table 2
Properties of data sets:
- Origional: without titration
- Missingness mech.: (10)
- Missing % : 50%
Method
LOCF
MI regression
MI propensity scores
MI MCMC
Incremental mean
Incremental regression
Observed
Table 3
Properties of data sets:
- Origional: with titration
- Missingness mech.: (9)
- Missing % : 50%
Method
LOCF
MI regression
MI propensity scores
MI MCMC
Incremental mean
Incremental regression
Observed
N=30
N=60
εm
εs
εm
εs
.390
.132
.263
.283
.065
.069
.290
.161
.511
.133
1.21
.055
.068
.119
.406
.083
.191
.083
.046
.048
.316
.162
.557
.104
.116
.041
.050
.086
N=30
N=60
εm
εs
εm
εs
.415
.178
.371
.422
.078
.088
.385
.183
.639
.139
1.56
.052
.074
.125
.411
.111
.318
.100
.050
.056
.388
.186
.568
.099
.118
.039
.052
.104
N=30
N=60
εm
εs
εm
εs
.515
.145
.264
.270
.090
.088
.259
.248
.649
.133
.971
.091
.076
.120
.530
.096
.186
.093
.070
.062
.270
.249
.656
.098
.119
.085
.058
.084
Table 4
Properties of data sets:
- Origional: with titration
- Missingness mech.: (10)
- Missing % : 50%
Method
LOCF
MI regression
MI propensity scores
MI MCMC
Incremental mean
Incremental regression
Observed
CONTACT INFORMATION
N=30
N=60
εm
εs
εm
εs
.554
.173
.356
.354
.093
.102
.322
.258
.869
.151
1.46
.087
.077
.118
.543
.106
.310
.112
.071
.066
.323
.268
.720
.116
.135
.080
.060
.094
The results presented in the tables show that, for some
longitudinal data sets modeling diabetes clinical trial data with
relatively small number of patients (30 to 60) and a large
percentage of missing values due to nonresponders,
the
incremental methods give, in average, more precise estimations
of the means and standard deviations than the other imputation
methods considered in the paper. Among the multiple imputation
methods included in the MI procedure the MCMC method is the
most reliable. In regard to estimation of standard deviations the
most unreliable method is the MI regression method (or its
implementation in the MI procedure). However, the precision of
all multiple imputation methods increases when n increases. The
LOCF method (widely used in clinical trials) is less precise than
the observed data approach.
CONCLUSION
The paper is concerned with comparison of several imputation
techniques applied to incomplete longitudinal data sets with MAR
drop-outs. The data sets are simulated to resemble time behavior
of HbA1c in diabetes clinical trial. The missingness mechanisms
employed resemble the process of withdrawal from trials due to
lack of efficacy.
The imputation techniques being compared include the last
observation carried forward (LOCF) method, three multiple
imputation methods used in the SAS PROC MI (the regression
method, propensity score method and Markov chain Monte Carlo
method) and two incremental methods (the incremental mean
and incremental regression methods). The results of simulation
presented in the paper show that for relatively small number of
patients (30 to 60) and a large percentage of missing values the
incremental methods give, in average, more precise estimations
of the means and standard deviations than the other imputation
methods considered in the paper.
REFERENCES
1. Diggle, P.J., Liang, K.-Y., and Zeger, S.L. Analysis of
Longitudinal Data. Oxford University Press, New York, 1994.
2. Rubin, D.B. Inference and Missing Data. Biometrika
1976; 63:581-592.
3. Little, R.J.A., and Rubin, D.B. Statistical Analysis with
Missing Data. Wiley, New York, 1987.
4. Little, R.J.A. Modeling the Drop-Out Mechanism in
Repeated-Measure Studies. Journal of the American Statistical
Association 1995; 90:1112-1120.
5. Lachin, J.M. Statistical Consideration in the Intent-toTreat Principle. Controlled Clinical Trials 2000; 21:167-189.
6. Rubin, D.B. Multiple Imputation for Nonresponse in
Surveys. Wiley, New York, 1987.
7. Yuan, Y.C. Multiple Imputation for Missing Data:
Concepts and New Development. SUGI Proceedings 2000;
P267-25.
Your comments and questions are valued and encouraged.
Contact the author at:
Naum Khutoryansky
Novo Nordisk Pharmaceuticals, Inc.
100 College Road West
Princeton NJ 08540
Work Phone: 609-987-5812
Email: [email protected]
Web: www.novonordisk-us.com