Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Imputation Techniques Using SAS Software For Incomplete Data In Diabetes Clinical Trials Naum M. Khutoryansky and Won-Chin Huang Novo Nordisk Pharmaceuticals, Inc., Princeton, NJ ABSTRACT Missing data are common in clinical trials. In longitudinal studies missing data are mostly related to drop-outs. Some drop-outs appear completely at random. The source for other drop-outs is withdrawal from trials due to lack of efficacy. For the latter case the standard analysis of the actual observed data produces bias. An attractive approach to avoid this problem is to impute (i.e. fill in) the missing data. This paper is devoted to comparison of different techniques of imputation for some longitudinal clinical trials with large percentage of dropouts. The techniques being considered include (among others) the last observation carried forward technique, the multiple imputation methods (presented in PROC MI of the SAS version 8.1) and new incremental methods programmed in the SAS software framework. The imputation methods were compared on simulated data (to assess preciseness). The most effective techniques were applied to diabetes clinical trial data. INTRODUCTION Missing data is a common problem in clinical trials. Comparison of drug treatments involves longitudinal measurements of efficacy variables. When individual elements of longitudinal data are missing, the data are “incomplete. The incompleteness can have different patterns and different causes. Missing values occur due to attrition (drop-out case) if the missing-data process induces a monotone pattern of missing values, i.e. a patient has observations up to a time point and none after that; otherwise the missing values are intermittent [1]. In many cases the drop-out processes possess the property that the missing data are missing at random (MAR) [2,3,4] (i.e. the drop-out process depends on the observed measurements ). A special case of MAR drop-outs is the MCAR class when the missing data are missing completely at random (i.e. the drop-out and measurement processes are independent ). A very important drop-out process falling into the MAR class is the process of withdrawal from a trial due to lack of efficacy. A statistical analysis can be biased if incomplete cases are excluded from the analysis [5]. The intent-to-treat (ITT) principle requires that all cases, complete or incomplete, should be included in the analysis. An attractive approach dealing with incomplete cases is the imputation-based approach: the missing values are filled in and the resultant completed data are analyzed by standard methods [3]. This paper is concerned with comparison of several imputation techniques applied to incomplete longitudinal data sets with MAR drop-outs. The data sets are simulated to resemble time behavior of HbA1c in diabetes clinical trial. The distance between adjacent time points of measurements varies from 4 to 6 weeks. The missingness mechanisms employed resemble the process of withdrawal from trials due to lack of efficacy. The start of missing values for each patient depends on two preceding measurements. If a patient follows a specified pattern for change of HbA1c between two adjacent time points with a tolerance margin, she/he remains in the study; otherwise the patient is withdrawn from the study. The percentage of missing values in a data set depends on the tolerance margin. The imputation techniques being compared include the last observation carried forward (LOCF) method, three multiple imputation methods (the regression method, propensity score method and Markov chain Monte Carlo method) [6] and two incremental methods (the incremental mean and incremental regression methods). PROC MI [7] and a SAS macro are used to implement these techniques. METHODS OF IMPUTATIONS The following methods are used in this paper. With the exception of the incremental methods that are introduced here, the description of the methods is very concise and reduced mostly to references. LAST OBSERVATION CARRIED FORWARD Whenever a value is missing, the last observed value is substituted. The technique is typically applied to drop-out patterns of incompleteness. MULTIPLE IMPUTATION TECHNIQUES We consider only the multiple imputation techniques [6] that are implemented in PROC MI [7]. The following three method are available in the MI procedure. In the regression method, a regression model is fitted for each variable with missing values, with the previous variables as covariates. Based on the fitted regression coefficients, a new regression model is simulated from a Bayesian predictive distribution of the parameters and is used to impute the missing values for each variable. In the propensity score method, for each variable with missing values, propensity scores are generated for all observations to estimate the probabilities that each observation is missing. The observations are then grouped using these scores, and a Bayesian imputation procedure is applied for each group. The first two methods are used for monotone missing data. The last method to be considered is the Markov chain Monte Carlo method (MCMC). The goal of the MCMC method is to create pseudorandom draws from a target distribution. Rather than attempting to draw from this distribution directly, a sequence of variates is generated sufficient to approximate a draw from the target distribution. The MCMC method is used in the MI procedure to create proper imputations of arbitrary missing data. The results of the MI procedure are multiple imputations for missing data. For example, if a SAS data set PartMiss has missing values for variables a1,…,a6, the following statements create 10 multiple imputations for this data set in a data set Imputed using the MCMC method: proc mi data=PartMiss out=Imputed nimpu=10 seed=37851; multinormal method=mcmc; var a1 a2 a3 a4 a5 a6; run; The analysis of a multiple imputed data set including estimation of the mean and variance for each variable is based on the formulas given in [6]. INCREMENTAL METHODS We assume that an outcome variable Y is measured for subject i = 1,…,n in the study at occasions j = 1,…,k. The measurements are denoted by Yij. Let Dij be the increment of Y for subject i from time point j to time point j+1. Next, let Y*ij and Y’ij are observed and missing values of Y, respectively. Similarly, let D*ij and D’ij be observed and missing increments, respectively. Denote by D*. j the sample mean of observed increments D*ij at time point j. Suppose that values Yi1 are all observed. The incremental mean method creates two matrices: a basic imputed matrix and an uncertainty matrix. To build the basic imputed matrix B the unknown increments D’ij are prescribed to be equal to D*. j and, next, the missing values Y’i, j+1 are replaced by Y^ i, j+1 = Y ij + D*. j for j = 1,…,k - 1 (1) where values Yij are observed or calculated on the previous step. The elements Bij that coincide either with Y*ij or with Y^ij (calculated step by step as specified above) create the basic imputed matrix B. The approximate mean values of Y at each time point j will be calculated as the sample mean B. j using the jth column of matrix B. The columns of matrix B can be used to calculate the variances of Y at each time point. However, these calculated values V^ j underestimate the real variances due to the lack of uncertainty in (1) and will be named the partial variances. To induce an additional uncertainty it is proposed to build an uncertainty matrix V’. Let U* j be the sample variance of the set {D*ij} for fixed j. Let R ij be equal to 1 if Yij is observed and 0 otherwise. Then the elements V’ij of matrix V’ are calculated as follows: V’i1 = 0; V’ij = (V’i, j-1 + U* j-1) (1 - R ij) for j = 2,…,k . (2) Therefore, if Yij is observed then V’ij = 0 (there is no uncertainty). Otherwise, an additional variance (as an uncertainty measure) is accumulated over time until time point j for each subject i. The pooled additional variance V’j at time point j is defined as the average of V’ij over all the subjects. The total variance Vj at time point j is defined as Vj = V^ j + V’j . (3) The incremental regression method is based on the following regression model: Dij = D. j + Aj (Yij – Y. j) + εij where D. j and Y. j (4) are the means of samples {Dij} and {Yi j}, respectively, over j, and εI j are errors. For the observed data a counterpart of model (4) takes the following form D*ij = D*. j + A*j (Yij – Y**. j) + ε*I j (5) where Y**. j is defined as the mean of sample { Yij | Ri, j+1=1} for fixed j and ε*I j are errors. Coefficient A* j is estimated from (5) using the least-squares method. Next, the missing values Y’i, j+1 are replaced by values Y^i, j+1 introduced by the following recurrent scheme: D^ ij = D*. j + A* j (Yij – Y. j), (6) Y^i, j+1 =Yij + D^ ij, j = 1,…,k– 1. After obtaining the basic imputed matrix B comprised of both Y*ij and Y^ij and evaluating the time point means, the calculation of the partial variances V^ j is completely similar to that in the incremental mean method. Consider the calculation of the corresponding uncertainty matrix V’. The elements of V’ are evaluated using the same recurrent formula (2) as in the previous method. However, evaluation of U* j is now based on the errors of model (5) : ε*ij = D* ij - [D*. j + A* j (Yi j – Y**. j)] . (7) Now U* j is defined as the sample variance of set {ε*I j } for fixed j. The last step of the method – calculation of the total variance is carried out using the same formula (3) as in the incremental mean method. Note that two different means of Yi j over i are used in this algorithm: Y**. j for calculation of A* j and U* j , and Y. j in the recurrent scheme (6). SIMULATION DATA SET SIMULATION The first step in the simulation is to create longitudinal data sets that resemble time behavior of HbA1c in a diabetes clinical trial. As an example we consider such a trial with six time points of measurements starting from baseline. The first three time increments are equal to four weeks, the next two are six weeks. We assume that the baseline distribution of HbA1c can be approximated by the normal distribution with mean of 9 and variance of 1. The mean of HbA1c used in simulations is specified to increase by 0.5 for the first time increment, remain unchanged for the second one and decrease by 0.5 for each next time increment (through the fifth increment). This scenario is very approximate but, nevertheless, reflects a tendency for some diabetes drugs. Each increment of HbA1c is prescribed to be the sum of a normal random variable with the standard of 0.5 and a term reflecting the process of titration (i.e. dose escalation with a set of predetermined dose levels and prespecified criteria of responders and nonresponders). The following SAS program implements this algorithm: %let totno=60; %let titrate=0.05; %let seedi=54417; data complete; drop norm norm0 seed j m1; retain seed &seedi; array a{6} a1 a2 a3 a4 a5 a6; array dm{5} dm1 dm2 dm3 dm4 dm5; array da{5}; do id=1 to &totno; m1=9; dm1=0.5; dm2=0; dm3=-0.5; dm4=-0.5; dm5=-0.5; call rannor(seed,norm0); a1=m1+norm0; dms=0; do j = 1 to 5; dms=dms+dm{j}; call rannor(seed,norm); da{j}=dm{j} – &titrate*(a{j}-m1-dms) + norm/2; a{j+1}=a{j}+da{j}; end; output; end; run; The macro variables of the program are: &totno (the number of patients), &titrate (a titration parameter) and &seedi (the seed used for random simulation). Note that the random increment of the primary variable (HbA1c) increases the variance of it at the next time point. However, the titration term (that includes &titrate as a factor) acts in opposite direction. The titration parameter usually has a value between 0 and 0.1 in real studies, and, therefore, is specified at the level of 0.05 in the program above. However, we consider below also the case when &titrate=0, i.e. the increment is independent of the previous values of the primary variable. MISSINGNESS MECHANISM It is reasonable to assume that the probability for patient i in a clinical trial to become a drop-out at time point j depends only on the values Yi1,…, Yi,j – 1. Therefore, we assume that the MAR condition holds. Moreover, we accept a more restrictive assumption that, we believe, can be applicable in the diabetes clinical trials which design was described above. Namely, we suppose that the probability depends only on Di,j – 2: Pr(R ij = 0 ) = Gj (Di,j – 2), (8) where Gj is such a function that 0 ≤ Gj (x) ≤ 1 for any real x. Two possible choices for g are Gj (x) = ζ (x - δ j + c) - ζ (x - δ j + c - 1) (9) where ζ (x) = x if x > 0, else 0, Gj (x) = 1 if 0 ≤ x - δ j + c ≤ 1, else 0. (10) According to the description of the complete data set (see above) the following values of δ j are specified as δ 1 = 0.5, δ 2 = 0, δ 3 = δ 4 = δ 5 = - 0.5 . The program below creates a data set with missing data using the missingness mechanism (8), (9) with c = - 0.1. The amount of missing values at the sixth time point is approximately 50% for this value of c. data PartMiss; set complete; drop dropout dm1-dm5 dr da1-da5 j; array a{6} a1 a2 a3 a4 a5 a6; array dm{5} dm1 dm2 dm3 dm4 dm5; array da{5} da1 da2 da3 da4 da5; dropout=0; do j = 1 to 6; if j>2 then dr = da{j-2}-dm{j-2}; else dr=0; if dropout=0 and j>1 and dr-0.1 > ranuni(&seedi) then dropout=1; if dropout=1 then a{j}=.; end; run; Approximately the same percentage of missing values at the sixth time point will be generated using the second missingness mechanism (8), (10) with c = - 0.5. To implement this mechanism in the program above, ithe condition dr-0.1 > ranuni(&seedi) should be replaced by the condition dr > 0.5 RESULTS OF SIMULATION In this subsection we consider the results of simulation for two types of the initial (complete) longitudinal data sets (with and without titration) presented in the subsection “Data set simulation”. The type is determined by the value of macro variable &titrate either equal to 0 (without titration) or equal to 0.05 (with titration). The number of observations is chosen to be in two versions: n = 30 and n = 60. Two missingness mechanisms described above are implemented to create approximately 50% of missing data at the sixth time point. Six imputation techniques presented in section “Methods of imputation” are compared on the base of preciseness of the mean and standard deviation estimations at the sixth time point. To make the comparisons, the following errors of the means and standard deviations of imputed data sets (at the sixth time point) over 100 random samples of each kind are calculated: εm 100 = 0.01 Σ | M^i - Mi | / Si , εs 100 = 0.01 Σ | S ^i - S i | / S i , i =1 i=1 where M^i and Mi are the sample means of the imputed and original data sets, respectively; S^i and Si are the corresponding standard deviations. Actually, εs is the average relative error of a standard deviation. Since the standard deviations of the original data sets are close to one, these errors are close to the absolute errors. All the simulations were carried out in the framework of the SAS software (version 8.1) using the MI procedure to implement the multiple imputation methods and a macro to implement the incremental methods. The following four tables present the simulation results. Besides, the similar errors of the means and standard deviations of observed data sets are also presented. First two tables are concerned with original data sets without titration which means that all the increments are completely random. However, the missingness mechanisms are not MCAR. The tables 3 and 4 give the results for original data sets with titration. The missingness mechanism is indicated in each table. Table 1 Properties of data sets: - Origional: without titration - Missingness mech.: (9) - Missing % : 50% Method LOCF MI regression MI propensity scores MI MCMC Incremental mean Incremental regression Observed Table 2 Properties of data sets: - Origional: without titration - Missingness mech.: (10) - Missing % : 50% Method LOCF MI regression MI propensity scores MI MCMC Incremental mean Incremental regression Observed Table 3 Properties of data sets: - Origional: with titration - Missingness mech.: (9) - Missing % : 50% Method LOCF MI regression MI propensity scores MI MCMC Incremental mean Incremental regression Observed N=30 N=60 εm εs εm εs .390 .132 .263 .283 .065 .069 .290 .161 .511 .133 1.21 .055 .068 .119 .406 .083 .191 .083 .046 .048 .316 .162 .557 .104 .116 .041 .050 .086 N=30 N=60 εm εs εm εs .415 .178 .371 .422 .078 .088 .385 .183 .639 .139 1.56 .052 .074 .125 .411 .111 .318 .100 .050 .056 .388 .186 .568 .099 .118 .039 .052 .104 N=30 N=60 εm εs εm εs .515 .145 .264 .270 .090 .088 .259 .248 .649 .133 .971 .091 .076 .120 .530 .096 .186 .093 .070 .062 .270 .249 .656 .098 .119 .085 .058 .084 Table 4 Properties of data sets: - Origional: with titration - Missingness mech.: (10) - Missing % : 50% Method LOCF MI regression MI propensity scores MI MCMC Incremental mean Incremental regression Observed CONTACT INFORMATION N=30 N=60 εm εs εm εs .554 .173 .356 .354 .093 .102 .322 .258 .869 .151 1.46 .087 .077 .118 .543 .106 .310 .112 .071 .066 .323 .268 .720 .116 .135 .080 .060 .094 The results presented in the tables show that, for some longitudinal data sets modeling diabetes clinical trial data with relatively small number of patients (30 to 60) and a large percentage of missing values due to nonresponders, the incremental methods give, in average, more precise estimations of the means and standard deviations than the other imputation methods considered in the paper. Among the multiple imputation methods included in the MI procedure the MCMC method is the most reliable. In regard to estimation of standard deviations the most unreliable method is the MI regression method (or its implementation in the MI procedure). However, the precision of all multiple imputation methods increases when n increases. The LOCF method (widely used in clinical trials) is less precise than the observed data approach. CONCLUSION The paper is concerned with comparison of several imputation techniques applied to incomplete longitudinal data sets with MAR drop-outs. The data sets are simulated to resemble time behavior of HbA1c in diabetes clinical trial. The missingness mechanisms employed resemble the process of withdrawal from trials due to lack of efficacy. The imputation techniques being compared include the last observation carried forward (LOCF) method, three multiple imputation methods used in the SAS PROC MI (the regression method, propensity score method and Markov chain Monte Carlo method) and two incremental methods (the incremental mean and incremental regression methods). The results of simulation presented in the paper show that for relatively small number of patients (30 to 60) and a large percentage of missing values the incremental methods give, in average, more precise estimations of the means and standard deviations than the other imputation methods considered in the paper. REFERENCES 1. Diggle, P.J., Liang, K.-Y., and Zeger, S.L. Analysis of Longitudinal Data. Oxford University Press, New York, 1994. 2. Rubin, D.B. Inference and Missing Data. Biometrika 1976; 63:581-592. 3. Little, R.J.A., and Rubin, D.B. Statistical Analysis with Missing Data. Wiley, New York, 1987. 4. Little, R.J.A. Modeling the Drop-Out Mechanism in Repeated-Measure Studies. Journal of the American Statistical Association 1995; 90:1112-1120. 5. Lachin, J.M. Statistical Consideration in the Intent-toTreat Principle. Controlled Clinical Trials 2000; 21:167-189. 6. Rubin, D.B. Multiple Imputation for Nonresponse in Surveys. Wiley, New York, 1987. 7. Yuan, Y.C. Multiple Imputation for Missing Data: Concepts and New Development. SUGI Proceedings 2000; P267-25. Your comments and questions are valued and encouraged. Contact the author at: Naum Khutoryansky Novo Nordisk Pharmaceuticals, Inc. 100 College Road West Princeton NJ 08540 Work Phone: 609-987-5812 Email: [email protected] Web: www.novonordisk-us.com