Download SAS Markov Chain Monte Carlo (MCMC) Simulation in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Paper SP07
SAS® Markov Chain Monte Carlo (MCMC) Simulation in Practice
Scott D Patterson, GlaxoSmithKline, King of Prussia, PA
Shi-Tao Yeh, GlaxoSmithKline, King of Prussia, PA
ABSTRACT
Markov Chain Monte Carlo (MCMC) is a random sampling method with Monte Carlo integration using Markov
chains. MCMC has gained popularity in many applications due to the advancement of computational algorithms
and power. The SAS® MI Procedure provides MCMC method for filling arbitrary missing data and for simulating
random samples based on complete data information. Extensions of this procedure are currently available in
experimental form to perform Bayesian statistical analysis.
The purpose of this paper is to use a simulated hypothetical clinical trial efficacy data set and Challenger’s Oring failure data as input in order to perform the MCMC method for missing data imputation, model parameter
simulation, and model diagnostics, and to use SAS to perform a Bayesian analysis of data commonly
encountered in clinical trials.
The SAS V9 products used in this paper are SAS BASE®, SAS/STAT®, and SAS/GRAPH® on a PC Windows®
platform.
INTRODUCTION
Monte Carlo methods are sampling techniques that draw pseudo-random samples from specified probability
distributions. In other words, Monte Carlo methods are numerical methods that utilize sequence numbers of
random numbers to perform statistical simulations. A Monte Carlo algorithm involves the following components:
1)
2)
3)
4)
5)
6)
probability distribution functions (pdf’s) – the target distribution must be specified by a set of pdf’s,
random number generator – a source of random numbers uniformly distributed on the unit interval,
sampling rule – a prescription for sampling from the specified pdf’s,
scoring – the outcomes must be summarized into overall scores,
error estimation – an estimate of the statistical error (variance) as a function of the number of trials,
variance reduction techniques – methods for reducing the variance in the estimated solution to reduce the
computational time,
7) parallelization and vectorization – an algorithm to allow Monte Carlo methods to be implemented efficiently
on computer computation.
For independent samples, the simulation outcomes can apply ‘Law of Large Numbers’. But independent
sampling from Monte Carlo methods may be difficult. The issue of independent samples can be solved by using
a Markov chain.
A Markov chain is a sequence of random values whose probabilities in a time interval depend upon the value of
the number from a previous time point. It converts the sampling schema into a time-series sequence. The
controlling factor in a Markov chain is the transition probability which is a conditional probability for the system to
go to a particular new state, given the current state of the system. Because the Markov chain is in a time-series
format, we can check the sample independence by examination of sample auto-correlation. As time interval
increases toward infinite, the Markov chain converges to its stationary distribution. Assuming a stationary
distribution exists, it is unique if the chain is irreducible. Irreducible means any set of states can be reached from
any other state in a finite number of moves.
A Markov chain taking only a finite number of values is aperiodic if the greatest common divisor of return
multiplied by any particular state is 1.
We can have an ergodic theorem if we assume the Markov chain has the following properties:
1) It has the stationary distribution, and
2) It is aperiodic and irreducible.
The ergodic theorem proves that:
1) the central limit theorem holds, and
2) convergence occurs geometrically.
The Markov chain Monte Carlo (MCMC) method consists of a class of algorithms for sampling from probability
distributions based on constructing a Markov chain that has the desired distribution as its stationary distribution.
It combines the Monte Carlo method for sampling randomness and the Markov chain method for sampling
independence with its stationary distribution.
MCMC methods have gained popularity in a wide range of fields and are useful in both Bayesian and frequentist
statistical inference. MCMC has been applied as a method for exploring posterior distributions in Bayesian
inference. In other words, you can simulate the entire joint posterior distribution of the unknown quantities and
obtain simulation-based estimates of posterior parameters that are of interest.
The SAS/STAT® system provides the MI procedure for performing multiple imputation of missing data.
“Missing values are an issue in a substantial number of statistical analysis. Most SAS statistical
procedures exclude observations with any missing variable values from the analysis. While
analyzing only complete cases has its simplicity, the information contained in the incomplete
cases is lost. This approach also ignores possible systematic differences between the complete
cases and the incomplete cases, and the resulting inference may be not applicable to the
population of all cases, especially with a smaller number of complete cases.” [14]
MCMC imputation is one of the features provided in the MI procedure. You can use the SAS MCMC method for
arbitrary missing data imputation or random sample data set simulation based on the complete input data set as
prior information.
Graphical display is an important component of the MCMC process. It provides the visual displays of MCMC
output for checking the behavior from the random sampling process, including convergence of Markov chains
and independency of samples.
The data used in this paper are for illustration purposes only. They are:
1) A simulated hypothetical longitudinal clinical trial data set. This data set contains a continuous response
variable and variables of age, sex, treatment, race, baseline value, subject ID, and visit, as covariate
variables.
2) A NASA Challenger O-ring failure data set. It actually contains only 2 variables: failure as binary
response variable and temperature as continuous covariate variable.
3) Event data from a recent clinical trial.
The SAS V9 products used in this paper are SAS BASE®, SAS/STAT®, and SAS/GRAPH® on a PC Windows®
platform.
GETTING STARTED
The MI procedure made MCMC imputation a simple and easy, but powerful, process. The following sample
code is an example for running MCMC sampling.
proc mi data=eff seed=54321 nimpute=1000 out=outmono;
mcmc impute=monotone chain=multiple ;
var subjid y1 y2 y3 y4 y5;
run;
2
The sample code uses only three SAS statements – PROC MI statement, MCMC statement, and VAR
statement. The options available in the PROC MI and MCMC statements are listed in the appendices.
AN EXTENSION
MCMC is also useful for Bayesian inference. SAS recently released an experimental version of Bayesian
software in SAS/STAT at http://support.sas.com/rnd/app/da/bayesproc.html for users to assess and suggest
improvements to the procedures.
To apply Bayesian statistics, one applies inductive reasoning. One begins with a rough idea of Θ (where Θ here
represents the parameters of interest.) This rough idea statistically is termed the prior distribution. One then
collects data (designated as X for the purpose of this paper). Traditionally, one then integrates this information
to derive a refined understanding of Θ|X (the posterior distribution of Θ given X) using Bayes rule (see, for
example, Patterson et al., 1999). Often, however, mathematical derivation of this posterior distribution is not
possible, and in such circumstances, MCMC may be used to derive samples from the posterior density of Θ|X at
any time as data are collected. This type of analysis is very useful for independent data monitoring in clinical
trials and for use in sequential and adaptive designs. We will discuss such an analysis in Example III.
EXAMPLE I – LONGITUDINAL CLINICAL EFFICACY DATA
Input Data
A simulated hypothetical clinical efficacy data set is used for this example. The data set contains 252 subjects
with 5 treatments. The variables are listed as follows:
Variable Name
Description
Valid Value
response
a derived ‘change from baseline’ variable.
numeric value
visit
5 levels of clinical visit.
4,5,7,9,11
trt
5 levels of treatment including a placebo.
1=Dose A, 2=Dose B, 3=Dose C,
4=Dose D, 5=Dose E.
sex
subject’s gender.
1=male, 2=female
race
5 levels of racial group
1=white, 2=black, 3=American
hispanic, 4=Asian, 5=other
baseval
standardized baseline value
numeric value
age
standardized age
numeric value
subjid
subject ID
numeric value
Table 1. Description of Sample Clinical Data
Efficacy endpoints are measured at selected on-therapy visits. The MIXED procedure is selected for repeated
measures analysis using restricted maximum likelihood.
3
proc mixed data=eff Method=REML ;
class trtc visit subjid Sex Race;
model response= trt*visit baseval*visit Sex Race Age /
solution ddfm=kr influence residual ;
repeated visit / type=un subject=subjid;
estimate "Dose B – Dose A at V 11" trt*visit
estimate "Dose C – Dose A at V 11" trt*visit
1;
estimate "Dose D – Dose A at V 11" trt*visit
0 0 0 0 1;
estimate "Dose E – Dose A at V 11" trt*visit
0 0 0 0 0 0 0 0 0 1;
0 0 0 0 -1 0 0 0 0 1;
0 0 0 0 -1 0 0 0 0 0 0 0 0 0
0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0
lsmeans trt*visit;
Table 2. Sample Efficacy Analysis Code
The dataset contains some missing values due to early withdrawal subjects and some unexpected missing
values.
Incomplete Data Output
The SAS MIXED procedure excludes observations with any missing values from the analysis. The output of
estimates is shown in Table 3 below.
Table 3. Output of Estimates from Incomplete Data
LOCF Output
Last Observation Carried Forward (LOCF) is a method specific to longitudinal missing data problems. This
method replaces the missing data in later visits by the last available observed data. This method can be
illustrated in Tables 4 and 5. Table 4 shows the missing data pattern.
SUBJID
BASELINE
VISIT 4 (y1)
VISIT 5 (y2)
VISIT 7 (y3)
001
0.5
0.3
-1.2
-2.8
002
0.4
-0.6
-0.6
-0.9
003
0.7
-1.2
-1.0
-1.1
004
0.6
1.1
005
-0.2
-0.9
VISIT 9 (y4)
VISIT 11 (y5)
-1.4
-1.6
-1.0
Table 4. Missing Data Patterns
4
SUBJID
BASELINE
VISIT 4 (y1)
VISIT 5 (y2)
VISIT 7 (y3)
VISIT 9 (y4)
VISIT 11 (y5)
001
0.5
0.3
-1.2
-2.8
==Î -2.8
==Î
-2.8
002
0.4
-0.6
-0.6
-0.9
==Î -0.9
==Î
-0.9
003
0.7
-1.2
-1.0
-1.1
-1.4
004
0.6
1.1
==Î 1.1
==Î 1.1
==Î 1.1
-0.2
-0.9
==Î - 0.9
-1.0
005
-1.6
==Î
1.1
==Î -1.0
Table 5. LOCF to Fill Missing Data
Using LOCF, once the data set has been completed in this way, it is analyzed as if it were fully observed. In
New Drug Applications (NDA), LOCF is still widely used; many statisticians consider the method as producing
biased point estimate, biased variance, and incorrect inference. See for example Mallinckrodt et al., 2004.
Table 6. Output of Estimates from LOCF Data
MCMC Output – Monotone Imputation Method
There are several possible patterns of missing data in a clinical study. The sources of missing data can be
categorized as:
1) some subjects dropping out from the study, resulting in a monotone pattern of missing data.
2) some data missing intermittently, due, for example, to an illness or death, an invalid measurement, or
forgetfulness. This type of missing data is missing at random (MAR) with a non-monotone pattern.
A longitudinal clinical study generally suffer from both types of missingness, and the collected data are often
incomplete with a mixed monotone and non-monotone structure.
A monotone missing pattern is a particular type of statistical missing data that can be arranged in a monotone
pattern. A data set with variables, Y1, Y2, …Yp, , in this specific order, is defined having a monotone missing
pattern when the event where a variable Yj is observed for a particular subject implies that all previous Yk, k < j,
are also observed for the subject.
Table 7 shows the missing data patterns: an “X” means that the variable is observed in the corresponding
group, a “.” means that the variable is missing and will be imputed to achieve the monotone missingness for the
imputed data set, and an “O” means that the variable is missing and will not be imputed.
5
Table 7. Missing Data Patterns
The missing data patterns in Table 8 are rearranged to show a triangular monotone missing data pattern. They
provide a clear overview of the quantity, positioning, and type of missing values in the dataset. The variable
order specified in the VAR statement determines the monotone missing pattern in the imputed data set. With a
different order in the VAR list, the results will be different because the monotone missing pattern to be
constructed will be different.
Table 8. Rearranged Missing Data Patterns
Model Parameter Simulation
The number of imputations is set to 1000 in the MI procedure to create 1000 imputed monotone data sets. Each
imputed data set is used to run the PROC MIXED model to generate a set of model parameter estimates. Table
9 shows the 1000 sets of simulated model parameters.
Table 9. 1000 Sets of Simulated Model Parameters
Rubin (1987) and others proposed a simple rule that the overall estimates and standard errors can be produced
by averaging the simulated model parameters. Table 10 shows the overall results.
Label
Estimate
Std Err
DF
t Value
Pr > |t|
Dose B – Dose A at V 11
-0.7597
0.4910
229
-1.55
0.1232
Dose C – Dose A at V 11
0.0077
0.4910
229
0.02
0.9876
Dose D – Dose A at V 11
-1.2151
0.4979
232
-2.44
0.0154
Dose E – Dose A at V 11
0.7028
0.5060
228
1.39
0.1665
6
Table 10. Average of 1000 Sets of Simulated Model Parameters
If we combine the 1000 imputed data sets and average each data cell, we can construct a single averaged
imputed data set. Table 11 shows the model estimation results.
Table 11. Model Estimation Results from Averaged Imputed Data Set
In the cases of monotone missing pattern, the fraction of missing information is low, methods that average
imputed values of the missing data can be more efficient than methods that average simulated parameter
values in parameter simulation. (Schafer, 1997). A comparison of Tables 10 and 11, shows that the parameter
estimates are very similar.
The relative efficiency (RE) of MI is a measurement of imputation efficiency, proposed by Rubin (1987) as
follows:
Λ is the fraction of missing information for the quantity being estimated. Table 12 shows relative efficiencies with
different values of m and λ. For cases with little missing information, only a small number of imputations are
necessary.
Table 12. RE with different values of m and λ
The estimation results from the 10 imputations are shown in Table 13.
Table 13. Output of Estimates from 10 Imputations
7
A comparison of Tables 11 and 13, shows that the differences in estimates are very small due to the fraction of
missing information being low.
MCMC Output – Full Imputation Method
The full imputation method replaces all missing values by imputed values. Table 14 shows the missing data
patterns: an “X” means that the variable is observed in the corresponding group, and a “.” means that the
variable is missing and will be imputed to replace the missing values in the imputed data set.
Table 14. Missing Data Patterns – Full Imputation Method
The parameter simulation results from 1000 imputations are shown in Table 15.
Table 15. Model Parameter Simulation of 1000 Imputation
The average overall estimates and standard errors are shown in Table 16.
Label
Estimate
Std Err
DF
t Value
Pr > |t|
Dose B – Dose A at V 11
-0.6733
0.4784
245
-1.41
0.1736
Dose C – Dose A at V 11
0.0247
0.4795
246
0.05
0.8500
Dose D – Dose A at V 11
-1.0531
0.4752
246
-2.22
0.0362
Dose E – Dose A at V 11
0.5927
0.4869
244
1.22
0.2454
Table 16. Average of 1000 Sets of Simulated Model Parameters
8
The parameter simulation process with estimates can be depicted in Figure 1 below. Figure 1 consists of 3
panels: descriptive statistics panel, simulation time series panel, and simulation density plot panel.
Figure 1. MCMC Parameter Simulation Plot with Average Parameter Values of 1000 Imputations
Table 17 is the output from using the methods that average the imputed values of the missing data.
Table 17. Model Estimates from Average of 1000 Imputed Data Sets
The joint density plots can be constructed from the pairs of y1 vs y2, y2 vs y3, y3 vs y4, and y4 vs y5. Figure 2
shows the joint density estimates of MCMC simulated responses.
9
Figure 2. Joint Density Estimates of MCMC Simulated Responses
Imputation Diagnostics
Convergence diagnostics are critical when the MCMC based simulations are used. MCMC diagnostics focus on
iteration series convergence and sampling independence. These diagnostics can be achieved by graphical
presentation of the iteration process. Two plots are suggested.
1) Plot the time-series for each variable of interest.
2) Plot the auto-correlation functions.
The MCMC statement provides two options for plotting the time-series for each variable and the autocorrelation
functions. The sample code below produces an auto-correlation function for each parameter. Figure 3 shows
the combination of autocorrelation functions from variables y1, y2, y3, y4, y5.
goptions lfactor=5 ftext=swissb htext=2;
proc mi data=test seed=54321 nimpute=100 out=outmono;
mcmc impute=monotone chain=multiple
acfplot (mean( y1 y2 y3 y4 y5) /symbol=dot csymbol=red hsymbol=0.01
cneedles=red wneedles=3 cref=blue cconf=blue );
var y1 y2 y3 y4 y5 ; run; quit;run;
10
Figure 3. Autocorrelations with 95% Confidence Limits
You can use the outiter option in the MCMC statement to capture the iteration history data. Figure 4 shows the
iteration time-series plot by response variables at each visit.
Figure 4. MCMC Iteration Time Series Plot by Visit
Model Diagnostics – Residual Analysis
Residual analysis is one of regression diagnostics tools for graphical and numerical examinations of the
adequacy of model specification. A model misspecification can affect the validity and efficacy of regression
analysis. One of the residual analyses is based on the plots of raw residuals. In this illustration, variable age is
the only numerical covariate. A boxplot is selected to display the raw residuals. Figure 5 shows that the means
of residuals are close to zero, confirming variable age is not misspecified.
11
Figure 5. Residual Boxplot for Checking Functional Form for Variable Age
The other residual analysis simulation is based on using aggregates of residuals, such as moving sums, moving
averages or cumulative residuals, to check the distributions of certain zero-mean Gaussian stochastic
processes[5]. Figure 6 shows the moving average residuals for checking functional form specification
graphically.
Figure 6. Checking Functional Form for Variable Age
EXAMPLE II – CHALLENGER O-RING FAILURE DATA
Input Data
On January 28, 1986, the space shuttle Challenger exploded 73 seconds after launch. The Challenger disaster
investigation determined that cold weather with cold air temperature caused the rubber to stiffen and not
adequately seal the joint. The data relating O-ring failure to temperature are used in this section for illustration
purposes. The data are listed in Table 17 below.
Flight No.
14 9
23 10 1
5
13 15 4
3
8
17 2
11 6
7
16 21 19 22 12 20 18
Failure
1
1
0
0
0
0
0
1
0
0
o
Temp( F)
1
0
0
0
0
1
0
1
0
0
0
0
0
53 57 58 63 66 67 67 67 68 69 70 70 70 70 72 73 75 75 76 76 78 78 79
Table 17. O-ring Failure data
12
MCMC is used to fill in some missing data and then is fitted into a logistic regression model. The sample code is
shown below.
data inprior(type=cov);
input _type_ $1-4 _name_ $7-13 @16 failure 6.2 @25 temp
6.2;
datalines;
COV
failure 0.15
-1.5
COV
temp
-1.5
35.1
N
25.
25.
MEAN
0.3
67.7
;
proc mi data=test seed=54321 nimpute=300 out=outmono;
mcmc prior=input=inprior ;
var failure temp;
run;
data outmono;
set outmono;
if failure > 0.5 then failure=1;
else failure=0;
ods output CLParmWald=wparm association=asso;
proc logistic data=outmono;
by _imputation_;
model failure(event='1')=temp/outroc=roc1 clparm=wald;
run;
The results are depicted in Figure 7. Figure 7 consists of 5 panels:
1)
2)
3)
4)
5)
The left panel shows the average logistic function and variation.
The middle upper panel shows simulated predicted failure probability at 65 degree.
The middle lower panel shows simulated predicted failure probability at 49 degree.
The right upper panel shows density of predicted failure probability at 65 degree.
The right lower panel shows density of predicted failure probability at 49 degree.
Figure 7. MCMC Simulated predicted Probability of O-ring Failure
13
The O-ring data is a binary response variable with failure and non-failure outcomes. The risk factor variable is a
continuous variable, Temperature. The odds ratio (OR) Ψ is defined in the dichotomous risk factor as the ratio
of the odds for those with the risk factor to the odds for those without the risk factor. For continuous explanatory
variables, these odds ratios correspond to a unit increase in the risk factors.
Three types of prior information are used to construct the imputed data sets: non-informative prior (default),
Ridge, and prior information data set. These data sets are filled into the LOGISTIC procedure to generate the
odds ratios. In the displayed output of PROC LOGISTIC, the “Odds Ratio Estimates” table contains the odds
ratio estimates. Figure 8 shows the OR density estimates by prior information.
Figure 8. OR Density Estimates from MCMC Simulation by Prior Info.
The Receiver Operating Characteristic (ROC) curve is a curve presented in a probability scale graph and is
used to judge the discrimination ability of various statistical methods for predictive purposes. The area under
the ROC curve can be measured and converted to a single quantitative index for diagnostic accuracy.
Receiver Operating Characteristic (ROC) curves are popular as tools for detection of events or various
conditions such as asymptomatic dysfunction or disease. For binary response data, the response is either an
event or a nonevent. The accuracy of the classification is measured by its sensitivity. Sensitivity is the ability to
predict an event correctly. Specificity is the ability to predict a nonevent correctly.
“PROC LOGISTIC also computes three other conditional probabilities: false positive rate, false negative
rate, and rate of correct classification. The false positive rate is the proportion of predicted event
responses that were observed as nonevents. The false negative rate is the proportion of predicted
nonevent responses that were observed as events. Given prior probabilities specified with
the PEVENT=option, these conditional probabilities can be computed as posterior probabilities using
Bayes’ theorem.” [14]
The ROC curve, shown in Figure 9 by plotting of ‘sensitivity’ versus ‘1 – specificity’, is used to judge the
discrimination ability of various statistical methods for predictive purposes.
14
Figure 9. Simulated ROC Curves
The area under a ROC curve is a summarized quantitative index. This index, varies between 0.5 (no
discrimination power) to 1.0 (perfect accuracy) as the ROC travels towards the left and top boundaries of the
graph. The meaning of the area under a ROC curve, namely the index, is a "probability of correctly ranking a
(normal, abnormal) pair"[14]. In other words, the index is a probability of correct pairwise rankings.
The area under the ROC curve, as determined by the trapezoidal rule, is estimated by the statistic ‘c’ in the
“Association of Predicted Probabilities and Observed Responses” table [14]. Table 18 shows the AUC
estimations from the first 10 imputations.
Table 18. AUC Estimation from first 10 Imputations.
The functional form specification technique [] is used to smooth the simulated ROC curves. Figure 10 shows the
smoothed ROC curves.
15
Figure 10. Smoothed Simulated ROC Curves
EXAMPLE III – CLINICAL EVENT DATA
For the purposes of this example, we will consider a clinical event observed with a frequency following a clinical
procedure. Our interest is in understanding the frequency of this event following a clinical procedure when
patients are receiving one of four treatments (labeled Groups 1-4). For this example, we will assume that little is
known about the frequency of this type of event prior to the conduct of the clinical trial. This is known as the
assumption of a uniform or non-informative prior. Data from the clinical trial will be incorporated in our model
with the intent of updating this knowledge.
The clinical events themselves are assumed to follow a binomial distribution with unknown parameter 0≤p≤1.
Our intent is to use clinical data as it is collected to refine our knowledge of p.
Data were observed as follows where x is the number of observed clinical events out of the n number of patients
randomized to receive treatments 1 through 4:
data test;input group x n;
datalines;
1 74 164
2 75 169
3 66 160
4 49 157
;run;
Such data are easily modeled using logistic regression in PREC GENMOD (see model statement below);
however, the new experimental feature using PROC BGENMOD allows one to conduct a Bayesian analysis of
data at any point in the trial. The additional code involved is found below in the `bayes’ statement, where it is
specified that the prior distribution should be uniform, and that one-million MCMC simulations of the posterior
density of p should be generated with a burn-in of 500 iterations. Parameter estimates are output in the ODS
statement and may be summarized using PROC UNIVARIATE for example to compare the posterior
distributions of p.
16
proc bgenmod data=test;class group;
model x/n = group/dist=bin link=logit p cl;
bayes seed=2010 NBI=500 NMC=1000000 coeffprior=uniform;
ods output ParameterEstimates=params PosteriorSample=post;
run;
Here it is found that the posterior median value for p is lowest (0.31) in Group 4, and that the probability of an
event in groups 1-3 is higher with posterior median values lying from 0.41-0.45. The confidence to be placed in
the findings is reflected in the width of the posterior distributions (see Figure 11). Here we see that the findings
are pretty compelling. Treatment groups 1-3 have higher probability of an event than Group 4.
Figure 11. Summary of Estimated Posterior Densities for the Probability of an Event in Treatment Groups 1-4
It is easy to confirm that a Bayesian credible regions for each group based on the estimates from the posterior
distribution described above are equivalent to a frequentist confidence interval when a uniform prior is utilized,
and we defer further discussion of the utility of Bayesian methods to future works.
CONCLUSION
SAS MI procedure is a powerful tool for missing data imputation and parameter simulation. The key features are
summarized as follows:
•
•
•
It is very easy to use. Only three SAS statements are needed.
The procedure provides options for the users to modify the initial value settings and to select the
method for imputation.
It provides two optional output datasets, out and outiter, for further analysis.
17
•
The procedure can be used in the data preparation steps before calling the analysis model to simplify
the clinical efficacy data analysis process.
MI is attractive because it can be highly efficient even for small values of imputation. In many applications, just
3-5 imputations are sufficient to obtain excellent results.
Extensions of this approach to Bayesian analysis are easily accommodated using the experimental version
provided recently by SAS, and in our example only one additional line of code is needed. Modification in future
versions of SAS to include a general MCMC Gibbs sampler will allow for even greater utility in clinical statistics
and programming.
APPENDICES
Appendix 1. Summary of PROC MI Options
Option
Description
alpha=
specifies the confidence limits be constructed for the mean estimates
with confidence level 100(1 – α)%, where 0 < α < 1.
input data set name
data=
maximum=
minimum=
minmaxiter=
mu0=
theta0=
nimpute=
specifies maximum values for imputed variables.
specifies the minimum values for imputed variables.
specifies the maximum number of iterations for imputed values to be in
the specified range when the option MINIMUM or MAXIMUM is also
specified.
specifies the parameter values µ0 under the null hypothesis µ=µ0 for the
population means corresponding to the analysis variables.
specifies the number of imputations.
noprint
out=
round=
seed=
suppresses the display of all output.
creates an output SAS data set containing imputation results.
specifies the units to round variables in the imputation.
specifies a positive integer to start the pseudo-number generator.
simple
displays simple descriptive univariate statistics and pairwise
correlations from available cases.
specifies the criterion for determining the singularity of a covariance
matrix based on standardized variables, where 0 < p < 1.
singular=
Option
Value
α=0.05
Default
most recent created
data set name
a missing value
a missing value
100
0
5
SAS data set name
a missing value
the time of day from
the
computer’s
clock.
IE-8
Appendix 2. Summary of MCMC Statement Options
Type
input
Option
inest=
Initial=input=
prior=input=
Description
the input INEST=data set is a TYPE=EST data set and
contains a variable _imputation_ to identify the imputation
number.
the input INITIAL=INPUT=data set is a TYPE=COV or CORR
data set and provides initial parameter estimates for the MCMC
process.
the PRIOR=INPUT= data set is a TYPE=COV data set that
provides information for the prior information
18
Default
output
outest=
outiter<(options)>
=
imputation
impute=
chain=
nbiter=
niter=
initial=
prior=
start=
graphics
timeplot=
acfplot=
gout=
printed
output
WLF
displayinit
the OUTEST= data set is a TYPE=EST data set and contains
parameter estimates used in each imputation in the MCMC
process.
the OUTITER=data set in an MCMC statement is a
TYPE=COV data set and contains parameters used in the
imputation step for each iteration.
specifies whether a full-data imputation is used for all missing
values or a monotone data imputation is used for a subset of
missing values to make the imputed data sets have a
monotone missing pattern.
specifies whether a single or a separate chain is used for all
imputations.
specifies the number of burn-in iterations before the first
imputation in each chain.
specifies the number of iterations between imputations in a
single chain.
specifies the initial mean and covariance estimates for the
MCMC process
specifies the prior information for the means and covariances.
Valid values for name are as follows: JEFFREYS, RIDGE, and
INPUT=dataset
specifies that the initial parameter estimates are used as either
the starting value or as the starting distribution in the first
imputation step of each chain.
displays the time-series plots of parameters from iterations.
displays the autocorrelation function plots of parameters from
iterations.
specifies the graphics catalog for saving graphics output from
PROC MI.
displays worst linear function.
full
single
200
100
JEFFREYS
value
Displays initial parameter values in the MCMC process for
each imputation.
Appendix 3. Summary of ODS Table from MCMC Statement
ODS Table Name
EMPostEstimates
EMPostIterHistory
EMWLF
MCMCInitEstimates
Description
EM (posterior mode) estimates
EM (posterior mode) iteration history
coefficients of the worst linear function
initial parameter estimates for MCMC
Option
INITIAL=EM
INITIAL=EM
WLF
DISPLAYINIT
REFERENCES
[1] Harrell, E.F.(2000): Practical Bayesian Data Analysis from a Former Frequentist, Mastering Statistical Issues
in Drug Development, Henry Stewart Conference Studies, 15-16 May, 2000
[2] Harrell, E.F.(2005): A Good P-value is Hard to Find: Why I Became a Bayesian, Department of Biostatistics,
Vanderbilt University School of Medicine, Nashville, TN, March 2, 2005
19
[3] Gilks, W.R., S Richardson, and D.J. Spiegelhalter edt. (1996): MARKOV CHAIN MONTE CARLO IN
PRACTICE, Chapman & Hall, London, UK
[4] Fan X., A. Felsovalyi, S.A. Sivo, and S.C. Keenan(2001): SAS for Monte Carlo Studies – A Guide for
Quantitative Researchers, SAS Institute Inc., Cary, N.C., USA
[5] Lin, D.Y., L.J. Wei, and Z. Ying (2002): Model-Checking Techniques Based on Cumulative Residuals,
Biometrics, 58, 1-12, March 2002
[6] Little, R.J.A. and D. B. Rubin, (1987), Statistical Analysis with Missing Data, New York: John Wiley & Sons,
Inc.
[7] Mallinckrodt, C., Kaiser, C., Watkin, J., Molenberghs, G., Carroll, R. (2004) The effect of correlation structure
on treatment contrasts estimated from incomplete clinical trial data with likelihood-based repeated measures
compared with last observation carried forward ANOVA. Clinical Trials, 1, 477--489.
[8] Patterson, S. Francis, S. Ireson, M. Webber, D. and J. Whitehead (1999): A NOVEL BAYESIAN DECISION
PROCEDURE FOR EARLY-PHASE DOSE FINDING STUDIES, Journal of Biopharmaceutical Statistics, 9(4),
583-597
[9] Rubin, D.B., (1987): Multiple Imputation for Nonresponses in Surveys, New York: John Wiley & Sons, Inc.
[10] SAS Institute Inc(2004).: SAS/STAT 9.1 User’s Guide, SAS Institute Inc., Cary NC, USA.
[11] SAS Institute Inc.,(2003): SAS® OnlineDoc 9. , Cary NC. http://support.sas.com/91doc/docMainpage.jsp .
[12] SAS Institute Inc.,(2004): Base SAS® 9.1 Procedure Guide , SAS Institute Inc. Cary NC. USA
[13] SAS Institute Inc.,(2004): SAS/GRAPH® 9.1 Reference, Volumes 1 and 2 , Cary NC. SAS Institute Inc.
[14] SAS Institute Inc.,(2004): SAS/STAT® 9.1 User’s Guide , Cary NC. SAS Institute Inc.
[15] SAS Institute Inc.,(2007): Preliminary Capabilities for Bayesian Analysis in SAS/STAT® Software , Cary NC.
SAS Institute Inc.
[16] Schafer, J.L., (1997): Analysis of Incomplete Multivariate Data, New York: Chapman and Hall
[17] Yeh, S.T.(2004): Graphical Display of Clinical Data – A Nonparametric Approach, NESUG 2004
Proceedings, Paper po10, Nov. 2004.
TRADEMARKS
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are
registered trademarks or trademarks of their respective companies.
AUTHOR CONTACT INFORMATION
Scott D Patterson, Ph. D.
(610) 787-3296 (W)
E-mail: [email protected]
Shi-Tao Yeh, Ph. D.
(610) 787-3856 (W)
E-mail: [email protected]
20