Download analysis of longitudinal data with missing responses

Document related concepts

Adherence (medicine) wikipedia , lookup

Transcript
ANALYSIS OF LONGITUDINAL DATA WITH MISSING
RESPONSES: A STUDY OF PAIN CONTROL COST
A T HESIS
S UBMITTED TO THE FACULTY OF G RADUATE S TUDIES AND R ESEARCH
I N PARTIAL F ULFILLMENT OF THE R EQUIREMENTS
FOR THE
D EGREE OF
M ASTER OF S CIENCE
IN
S TATISTICS
U NIVERSITY OF R EGINA
By
Zhenyu Yang
Regina, Saskatchewan
June, 2015
c Copyright 2015: Zhenyu Yang
UNIVERSITY OF REGINA
FACULTY OF GRADUATE STUDIES AND RESEARCH
SUPERVISORY AND EXAMINING COMMITTEE
Zhenyu Yang, candidate for the degree of Master of Science in Statistics, has presented a
thesis titled, Analysis of Longitudinal Data With Missing Responses: A Study of Pain
Control Cost, in an oral examination held on April 15. 2015. The following committee
members have found the thesis acceptable in form and content, and that the candidate
demonstrated satisfactory knowledge of the subject material.
External Examiner:
Dr. Yiyu Yao, Department of Computer Science
Supervisor:
Dr. Yang Y. Zhao, Department of Mathematics & Statistics
Committee Member:
Dr. Dianliang Deng, Department of Mathematics & Statistics
Chair of Defense:
Dr. Liming Dai, Faculty of Engineering & Applied Science
Abstract
Recent years have seen a major increase of interest in pain control cost studies. Due to
rising costs of medical treatment, researchers study factors contributing to cost and appropriate ways to control or reduce the cost of pain control.
The first data studied in this research is longitudinal data with cost values completely missing. We fill in the daily cost for minority observations based on the information
provided by the price data, and then impute the daily cost for the remaining observations
by applying multiple imputation. Multiple sets of complete imputed daily cost data are
produced. A generalized estimating equations (GEE) model, is then applied to conduct
analysis on each set of data, producing multiple analysis results. The correlation between
observations within patients is presented in an AR-1 working correlation structure. The
correlation becomes weak with the increase of time lag.
Combining six sets of the GEE model results is sufficient to produce an overall result.
Two factors, treatment year and types of treatment, are significant to the average daily cost.
Overall, it costs the highest for patients in treatment 3, taking mono-drug, dual-drug or
triple-drug.
i
Acknowledgement
I would like to express my sincere gratitude to my supervisor, Dr. Yang Zhao, for her
continuous encouragement and support of my graduate study and research. Her guidance,
motivation, enthusiasm and immense knowledge helped me while conducting research and
writing this thesis.
Sincere thanks also go to the Faculty of Graduate Studies and Research and the Department of Mathematics and Statistics for giving me absolute access to required resources and
technical assistance in the pursuit of my graduate studies.
ii
Post Defense Acknowledgement
I would like to thank my external examiner, Dr. Yiyu Yao, supervisory committee
member, Dr. Dianliang Deng, and the chair, Dr. Liming Dai, for their insight, comments
and suggestions.
iii
Dedication
I would like to express my deepest gratitude to my beloved families, my parents, Yiquan Yang and Fanxian Zeng, and parents-in-law, Huiming Li and Yan Jiang, and friends,
Andrea Osborne and Terry Osborne, for their love and encouragement through my entire
life. I am most grateful to my beloved wife, Yue Li, for her endless love, support, encouragement and patience in all my pursuits.
iv
Contents
Abstract
i
Acknowledgement
ii
Post Defense Acknowledgement
iii
Dedication
iv
Table of Contents
v
List of Tables
viii
List of Figures
x
1
Introduction
1
1.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2
A Review of Longitudinal Study
5
v
2.1
2.2
2.3
3
5
2.1.1
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.1.2
Notation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Exploring of Longitudinal Data . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.2
Graphical Presentation . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2.3
Approaches to Longitudinal Data Analysis . . . . . . . . . . . . . 10
Generalized Estimating Equations (GEE) Model . . . . . . . . . . . . . . . 12
2.3.1
Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2
Specific Working Correlation Structure . . . . . . . . . . . . . . . 15
2.3.3
Model Selection Criterion . . . . . . . . . . . . . . . . . . . . . . 17
A Review of Statistical Analysis with Missing Data
3.1
3.2
4
Introduction of Longitudinal Study . . . . . . . . . . . . . . . . . . . . . .
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2
Missing Data Mechanisms . . . . . . . . . . . . . . . . . . . . . . 20
Multiple Imputation (MI) . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2
Combining Rules and Inference of MI . . . . . . . . . . . . . . . . 23
3.2.3
Method for Creating MI . . . . . . . . . . . . . . . . . . . . . . . 26
Analysis of Pain Control Cost Data
4.1
19
27
Preliminary Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
vi
5
4.2
Creating Imputed Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3
Analyzing Imputed Daily Cost Data Sets by GEE Model . . . . . . . . . . 38
4.4
Combining Analysis Results by MI . . . . . . . . . . . . . . . . . . . . . . 49
Conclusion and Future Study
53
5.1
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2
Future Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Bibliography
56
vii
List of Tables
4.1
Gender of patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2
Diagnosis of patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3
Types of treatment of patients . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4
Drugs prescribed in medical treatment . . . . . . . . . . . . . . . . . . . . 30
4.5
Part of the price data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6
Part of the first data with filled in daily cost data . . . . . . . . . . . . . . . 34
4.7
Results of price model estimated parameters (coefficients) . . . . . . . . . 35
4.8
Part of the first data with complete imputed daily cost data . . . . . . . . . 36
4.9
Part of other eight complete imputed daily cost data sets . . . . . . . . . . 37
4.10 Part of the first set of complete imputed daily cost data . . . . . . . . . . . 38
4.11 Part of the first set of average daily cost data . . . . . . . . . . . . . . . . . 39
4.12 Results of QIC for the GEE model . . . . . . . . . . . . . . . . . . . . . . 47
4.13 Results of the GEE model . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.14 Results of the MI procedure for the variable treatment year . . . . . . . . . 50
4.15 Results of the MI procedure for all variables . . . . . . . . . . . . . . . . . 50
viii
4.16 Results of combination for all variables . . . . . . . . . . . . . . . . . . . 51
4.17 Results of significant test for all variables . . . . . . . . . . . . . . . . . . 51
ix
List of Figures
4.1
Age of patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2
Treatment year of patients . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3
The average daily cost of a year for 76 patients vs treatment years . . . . . 40
4.4
The average daily cost of a year of gender vs treatment years . . . . . . . . 41
4.5
The average daily cost of a year of 4 main diagnoses vs treatment years . . 42
4.6
The average daily cost of a year of 3 treatment types vs treatment years . . 43
4.7
The average daily cost of a year of treatment 1 vs treatment years . . . . . . 44
4.8
The average daily cost of a year of treatment 2 vs treatment years . . . . . . 45
4.9
The average daily cost of a year of treatment 3 vs treatment years . . . . . . 45
4.10 Scatterplot matrix of 76 patients . . . . . . . . . . . . . . . . . . . . . . . 46
x
Chapter 1
Introduction
1.1
Background
Pain control cost has been a concern in medical studies over a number of years in
Canada. Patients might take various pain control medications when necessary during their
medical treatment for different time periods, from several days to 10 years or more. Pain
control cost varies, depending, then, on a number of factors. In Canada, most medical costs,
including pain control cost incurred in medical treatment, is covered through the Canadian
health care system and/or private insurance companies. It is important that researchers
study factors that impact pain control cost, with a view to controlling or reducing this
significant expenditure.
The efficacy of prescribed pain control medication on patients is of prime importance
to the medical profession. Doctors may be obliged to decide, for example, if a single
pain control medication or a combination of several pain control medications is the most
1
effective treatment for their patients. The doctor, in addition to consideration of drug efficacy, will often be concerned with the cost of medication. Therefore, the relationship
between efficacy and cost of pain control medications becomes a crucial area of study for
researchers.
The role of the pharmacist is to provide patients with the best quality pain medication.
A primary issue facing pharmacists is the quantity of certain pain control drugs in stock. If
a drug expires, waste occurs. It is crucial then that researchers find ways to avoid the waste
of drug inventory.
1.2
Data Set
The data, referred to as first data studied in this research, come from the Regina General
Hospital. It provides information on 83 patients on different pain control medications from
1994 to 2009. The first data introduces patients’ personal data: identification number (ID),
date of birth (age), gender, diagnosis, types of treatment (mono-drug, dual-drug or tripledrug) and the length of time on pain control medication (start and end date). Following this
information are the drug name (8 drugs in total), volume of a single drug or combination
of drugs, concentration and daily dose of each drug for each patient. With any change in
the prescription for a patient, such as the change of concentration of a certain drug, a new
observation is recorded. Thus, the first data lists a total of 3964 observations.
A sample of price data, referred to as price data studied in this research, is obtained
from Saskatchewan pharmacies. It lists 64 different prices for the prescription randomly
2
selected from the first data based on a certain usage of volume and concentration of each
individual drug involved in the three different types of treatment. Seven drugs prescribed
in the first data are recorded in the price data with the 8th drug information completely
missing. We are able to fill in the daily cost for 1552 observations listed in the first data
based on the information provided by the price data, and then impute the daily cost for
the remaining 2412 observations by using an appropriate imputation method. This will be
reviewed and discussed in detail in later chapters.
1.3
Objectives
A recent research was conducted in 2013 by Kumar, Rizvi, Bishop and Tang (2013),
with a focus on the relationship between the pain control cost and the variable of time in
medical treatment. It studied 110 observed patients in three types of treatment categories
(mono-, dual-, and triple-drug) and in four different diagnoses categories respectively. In
their article Cost Impact of Intrathecal Polyanalgesia, two models, price model and cost
model, are established. By using a linear regression method, the price model estimates
each drug price from the observed price data for analysis. The cost model is divided into
7 autoregressive models in a 5-year period, 3 for three types of treatment and 4 for four
diagnoses. It predicts daily pain control cost by using time series analysis. The relationship
between the cost occurred within individuals in different time periods, unfortunately, is
unknown.
In order to minimize bias or errors occurring in the longitudinal data analysis, this
3
research will take the missing values in daily cost of pain control medication into consideration. By using the price information in the price data, we will fill in the daily cost for
1552 observations in the first data, and then impute the daily cost for the remaining 2412
observations by using multiple imputation (MI). The present research aims to conduct a
longitudinal data analysis by establishing a generalized estimating equations (GEE) model. Our goal is to discover the possible factors that impact the pain control cost during the
patient’s entire medical treatment. The factors may include age, gender, diagnosis, types
of treatment and treatment year. The result of this research is to provide information to
aid government and/or insurance companies with possible ways to control or reduce pain
control costs.
4
Chapter 2
A Review of Longitudinal Study
2.1
Introduction of Longitudinal Study
First, we will briefly discuss the nature of longitudinal studies, focusing on definition,
notation and review of the literature. More detail can be found in the work Analysis of
Longitudinal Data by Diggle, Heagerty, Liang and Zeger (2002).
2.1.1
Definition
The longitudinal study is a research approach to observe and measure each individual
repeatedly over a period of time. It is usually a study of a group of individuals rather than a
large population due to its requirement of the lengthy of time period involved. Compared to
the cross-sectional study, an approach to observe and measure each individual only once,
the longitudinal study tells the relationship, as well as changes, between individuals and
within individuals by analyzing recorded observations and measurements for each individual over a certain period of time. Therefore, it is considered more useful and powerful than
5
the cross-sectional study.
The longitudinal data can be collected either prospectively or retrospectively. When
collecting prospectively the researcher observes and measures individuals forward in time.
When collecting in the retrospective way, the researcher obtains observations and measurements from historical records of each individual. However, data may be incomplete or
inferior. Longitudinal data are commonly collected prospectively rather than retrospectively. The first data studied in this research are extracted from observations and measurements
of 83 patients’ historical records. It is considered incomplete as it does not record daily
cost information for each observation.
2.1.2
Notation
We let Yij be a response variable, where subject i = 1, ..., m, and observation j =
1, ..., ni . We denote E(Yij ) = µij , V ar(Yij ) = υij , and Cov(Yij , Yij 0 ) = υijj 0 , for j 6= j 0 .
And we let xij be a vector of explanatory variables observed at time tij . So we can rewrite
xijk for variables, where i = 1, ..., m, j = 1, ..., ni , and k = 1, ..., p. Also we let β =
(β1 , ..., βp ) be a p-vector of unknown regression coefficients, and let ij be a zero-mean
random variable. So we are able to establish a linear regression model in the longitudinal
analysis as follows:
Yij = β1 xij1 + β2 xij2 + · · · + βp xijp + ij = x0ij β + ij .
6
We can also demonstrate it in matrix notation. Let



















Y =

















Y11 

.. 
. 



Y1n1 





Y21 
x112 · · ·
 x111



.. 

 x
. 
x122 · · ·
 121
,X = 


..
..
..

.
.
Y2n2 
.





.. 

. 
xmnm 1 xmnm 2 · · ·



Ym1 

.. 

. 


Ymnm



x11p 
 β1





 β
x12p 
 2
,β = 

 .
..

 ..
.






xmnm p
βp























, = 























11 

.. 
. 



1n1 



21 


.. 
. 

.

2n2 


.. 

. 



m1 

.. 

. 


mnm
Thus, we can now rewrite the model by using matrix notation as Y = Xβ + .
2.2
Exploring of Longitudinal Data
Following the introduction of longitudinal study, we now move to explore and discuss
the longitudinal data with a focus on its graphical presentation and three common approaches used for analysis in this section. More detail can be found in Diggle, Heagerty, Liang
and Zeger (2002).
7
2.2.1
Introduction
A complete longitudinal data analysis contains exploratory data analysis (EDA) and
confirmatory analysis. Serving as a foundation stone, EDA is the detective work to make
data visualized in graphical presentation. Confirmatory analysis following as the judicial
work is a procedure to proof evidence in the data for or against hypotheses made in the
research.
There is no universal criteria to make effective graphical presentation of longitudinal
data. Since different graphical presentations expose different relevant evidence of the data
in different scenarios, researchers must be open to all new, useful evidence obtained through
EDA. The following four guidelines are commonly taken into consideration when making
graphical presentation in EDA:
1. identifying both cross-sectional and longitudinal graphical presentations;
2. presenting the relevant evidence of data as much as possible;
3. finding out unusual individuals or unusual observations;
4. highlighting aggregate graphical presentation related to the scientific questions in the
research.
Time, as an explanatory variable, xij , is considered the most significant factor shown
as x axis when we make graphical presentations in EDA.
8
2.2.2
Graphical Presentation
The first choice to present longitudinal data in a graph for analysis is the scatterplot of
the response variable against time. It directly displays all observations through time. It,
however, cannot show the change of each observed individual. This graphical presentation
usually appears too dense when the number of observations is large.
An alternative graphical presentation is to connect the repeated observations in lines
for each individual. This connected-line graph is considered adequate to show the general
change for all observations, as well as for each individual, throughout time. Specifically,
it highlights the difference in all individuals at the beginning and end of time period. An
individual line in this graph, unfortunately, is difficult to isolate for analysis when the graph
contains many observations. In this research, we will use this connected-line graph to
display the average daily cost per year data in a later chapter. It can help us identify the
change of cost for all patients and also for each individual patient through their medical
treatment.
In order to pick an individual line out for analysis, we can connect the repeated observations in lines for randomly selected individuals. Two problems, however, may arise in this
graphical presentation. One is that a group of non-representative individuals will possibly
be selected by chance. The other is that outlying individuals may be selected and included.
Both will cause bias affecting the analysis result.
An improved graphical presentation follows. We place all observed individuals in order
in terms of a certain common characteristic, such as mean or median of each individual,
9
and then only connect the observation of individuals with a selected percentile (e.g., 10th,
20th, ..., 90th). By doing this, we are able to exclude the non-representative and outlying
individuals, and thus minimize bias in the analysis result. This percentile graph requires
more work when the data is large. It cannot adequately show the change in all observations.
2.2.3
Approaches to Longitudinal Data Analysis
The three most common approaches used in the longitudinal data analysis are the
marginal model, the random effects model and the transition model. We will review and
discuss each of them.
1. Marginal model
The marginal model, or the Generalized Estimating Equation (GEE) model, models the
marginal mean and covariance separately of the longitudinal data. The marginal expectation, E(Yij ) = µij , depends on explanatory variables, xij . The marginal variance,
V ar(Yij ) = υij , depends on the marginal mean and a scale parameter φ. The correlation between Yij and Yij 0 is a known function of the marginal mean with an additional
parameter α. In the marginal model, the correlation can be shown as a working correlation
matrix. It is required to be modeled since repeated observations within each individual are
often not independent. The marginal model is considered the most appropriate approach in
the longitudinal data analysis when the objective is to make inferences about the population
average of longitudinal data.
10
2. Random effects model
The random effects model assumes that the correlation arises within repeated observations
as each individual shares mutual and independent random effects, Ui , where i = 1, ..., m.
It incorporates the correlation in the longitudinal data and represents the variation between
individuals. Given Ui , the repeated observations Yi1 , ..., Yini are mutually independent and
the conditional distribution of Yij follows a distribution from the exponential family with
density f (yij |Ui , β). The conditional expectation E(Yij |Ui ) = µij and conditional variance
V ar(Yij |Ui ) = υij satisfy h(µij ) = x0ij β + d0ij Ui and υij = a(µij )φ, where h(.) and a(.)
are known link and variance functions, dij is a subset of explanatory variables xij , and φ
is a scale parameter. The random effects model is more useful when we are considering
individuals rather than the population average in the longitudinal data analysis.
3. Transition model
The transition model assumes that correlation exists among the repeated observations Yi1 ,...,
Yini since the past observations, Yi1 , ..., Yij−1 , are treated as additional predictor variables
to influence the current observation, Yij . We let Hij = {Yi1 , ..., Yij−1 } be the past observations of the ith individual. Given Hij and the explanatory variables xij , the condiC
tional mean E(Yij |Hij ) = µC
ij and the conditional variance V ar(Yij |Hij ) = υij satisfy
0
s
C
C
h(µC
ij ) = xij β + Σr=1 fr (Hij ; α) and υij = a(µij )φ, where h(.) and a(.) are known link
functions, fr (.) is a known and suitable function to present the current observation, α is a
vector of additional parameters and φ is a scale parameter. The transition model is mostly
used in the longitudinal data analysis when Markov chains structure is assumed.
11
We may be aware that not only the response variables Yij and the explanatory variables
xij are considered in each model, but the correlation within individuals is also considered.
This will help us estimate regression coefficients β correctly or efficiently in order to avoid
bias. Compared with the random effects model and the transition model, the GEE model can present the correlation structure using a working correlation matrix. Additionally,
the GEE estimator has asymptotic properties and consistency even when the working correlation matrix is misspecified. Therefore, we choose the GEE model in this research to
analyze our longitudinal data.
2.3
Generalized Estimating Equations (GEE) Model
Before we establish the GEE model and apply it to the longitudinal data analysis in this
research, we will have a further look into, and discussion about it, in terms of its inference,
specific working correlation structure and model selection criterion. More detail can be
found in Liang and Zeger (1986) and in Imori (2013).
2.3.1
Inference
The GEE model was developed in 1986 based on the generalized linear model (GLM) in
order to analyze the correlation among observations within each individual in the longitudinal data analysis. It specifies the mean, variance and covariance without further distributional assumptions. The GEE model gives consistent estimates of the regression parameters
and of their variances under weak assumptions about the time dependence.
12
In the GLM, we assume that the marginal density of Yij is
f (Yij ) = exp[
Yij θij − a(θij ) + b(Yij )
].
φ
The first two moments of Yij are
E(Yij ) = a0 (θij ) = µij ,
V ar(Yij ) = a00 (θij )φ = υij ,
where θij = h(ηij ), ηij = x0ij β, h(.), a(.) and b(.) are known link functions, and φ is a scale
parameter.
In the GEE, we let R(α) be an n × n working correlation matrix, and α be a vector
of unknown correlation parameter for all individuals. We let Vi be the working covariance
matrix for Yi , So we can define R(α) as
1/2
1/2
Vi = φAi R(α)Ai ,
where Ai = diag{a00 (θij )} is an n × n diagonal matrix for each i. If R(α) is the true
correlation matrix for the Yi , then Cov(Yi ) = Vi .
Now we are able to write the GEE model for the longitudinal data as
U (β, α) =
m
X
Di0 Vi−1 Si = 0,
(2.1)
i=1
where Di = ∂µi /∂β = Ai ∆i Xi , ∆i = diag(∂θij /∂ηij ) is an n × n matrix, where Xi =
(xi1 , ..., xini )0 is the ni × p matrix of covariate values of the ith subject, and Si = Yi − µi
with µi = (µi1 , ..., µini )0 . We can see that Di0 Vi−1 does not depend on the response variable
Y , and also if E(Si ) = 0, E(U (β, α)) = 0. In this context, we can say that GEE model
is an unbiased estimating equation. It ensures the consistency of the regression coefficient
estimator given the correct mean function.
13
From the equation (2.1), we can get a consistent estimate of α by given β and φ, and
a consistent estimate of φ by given β. Now we substitute these two estimates into the
equation (2.1) and have
m
X
Ui [β, α̂(β, φ̂(β))] = 0,
(2.2)
i=1
which is an estimating equation of β alone, and β̂ is defined as the solution of the equation
(2.2).
To estimate the variance of β̂ we can use the following two approaches, naive variance
estimations and robust variance estimations.
1. The naive variance estimate of β̂ is
m
X
(
Di0 Vi−1 Di )−1 .
i=1
The naive variance sometimes underestimates or overestimates the variance. It may be
unreliable when the assumed model for the correlation structure is not sufficient.
2. The robust variance estimate of β̂ is
(
m
X
i=1
Di0 Vi−1 Di )−1 {
m
X
m
X
Di0 Vi−1 Si Si0 Vi−1 Di }(
i=1
Di0 Vi−1 Di )−1 ,
i=1
which uses Si Si0 to estimate the variance of Yi . The robust variance is usually called a
sandwich type covariance estimator. It automatically accounts for the correlation in the
response variable and the results vary according to the data structure. So we normally
consider the robust variance estimations rather than the naive variance estimations in the
GEE model.
14
Now we consider the correlation parameter α and the scale parameter φ. The standardized residual is defined as
Yij − a0 (θ̂ij )
.
r̂ij = q
00
a (θ̂ij )
In the GEE model, φ can be estimated as
φ̂ =
ni
m X
X
2
[r̂ij
/(N − p)],
i=1 j=1
where N = Σm
i=1 ni , and p (p − vector) is the length of explanatory variables xij . So we
can estimate α using
R̂(αjk ) =
m
X
r̂ij r̂ik /(N − p).
i=1
2.3.2
Specific Working Correlation Structure
A specific choice of R(α) in different structures will result in different analyses. When
discussing the working correlation in the GEE model, the following four specific structures
are primarily considered.
1. Independence
Independence correlation indicates that any two observations are not correlative for each
individual. This structure can be defined using the identity matrix,






R(α) = 






1
0
···
0
..
..
..
.
...
...
0 ···
0
15
.
.
0 

.. 
. 

.

0 



1
2. Exchangeable
Exchangeable correlation shows that any two observations have the same correlation ρ for
each individual. The correlation matrix is






R(α) = 






1
ρ
···
ρ
..
.
..
.
..
.
..
.
..
.
ρ ···
ρ
ρ 

.. 
. 

.

ρ 



1
3. Unstructured
Unstructured correlation is completely unspecified. We are unable to find any patterns in
the correlation matrix for each individual. Each element of the correlation matrix needs to
be estimated separately, so ni (ni − 1)/2 parameters are therefore taken into consideration.
The correlation matrix is


···
ρ1ni
 1 ρ12


..
...
...
 ρ
.
 21
R(α) = 
 .
...
...
 ..
ρni−1 ni



ρni 1 · · · ρni ni−1
1





,





where ρjj 0 = corr(Yij , Yij 0 ) for j 6= j 0 .
4. First-order autoregressive (AR-1)
The correlation structure of the continuous time analogue of the first-order autoregressive
(AR-1) process was proposed by Feller (1971). It indicates that the correlation of two
16
0
observations becomes weak with the increase of time lag. Let Corr(Yij , Yij 0 ) = ρ|j−j | ,
where j 6= j 0 . The correlation matrix is






R(α) = 





2.3.3

1
ρ1
···
ρ1
...
...
..
.
..
.
..
ρ|ni −1| · · ·
ρ1
.
ρ|ni −1| 


..

.

.

ρ1 



1
Model Selection Criterion
Among the four working correlation structures discussed above, the independence correlation is the simplest. It usually becomes the first choice in the GEE model especially
when the correlation structure is unobvious. Liang and Zeger (1986) have proven that the
independence correlation is sufficient in the case of a relatively small correlation between
any observations. It, however, has low efficiency when the correlation becomes relatively
big. No matter which working correlation structure is chosen, the GEE model will bring a
consistent estimator of β.
Depending on unknown parameters, the Kullback-Leibler (KL) criterion was proposed
by Kullback and Leibler (1951). It estimates the risk function in order to measure the
appropriateness of the fit of a model. Two decades later, Akaike (1973) developed the
Akaike’s information criterion (AIC) based on KL. It then becomes popular in selecting
the best model among candidate models. The AIC is simply defined as
AIC = −2M ax. + 2N o.,
17
where M ax. is the maximum log-likelihood, and N o. is the number of parameters in the
model. The best choice of a fit model should be the one with the lowest AIC value.
The AIC, however, may not be able to be used directly in the GEE procedure since
we do not assume the multivariate distribution in observations. Therefore, Pan (2001)
improved AIC based on quasi-likelihood constructed from the estimating equations, named
quasi-likelihood under the independence model criterion (QIC). Thus the QIC becomes an
alternative method in terms of model selection in the longitudinal study. It is defined as
QIC = −2Q(Y ; β̂) + 2tr(V̂s Ω̂I ),
where Q(.) is the quasi-likelihood term, and tr(.) is the trace term for penalty. V̂s and Ω̂I
are defined respectively as
m
1 X 0 −1
D A Di ,
Ω̂I =
m i=1 i i
m
1 X 0 −1
Ω̂R =
D V Di ,
m i=1 i i
−1 1
Vˆs = Ω̂R
[
m
m
X
Di0 Vi−1 Cov(Yi )Vi−1 Di ]Ω̂−1
R ,
i=1
where Ω̂R = Ω̂I when R(α) is the independence, and Ω̂−1
R = V̂s when R(α) includes
the true correlation structure as Vi = Cov(Yi ) in this situation. Similarly, the working
correlation structure with the smallest QIC value is considered the best GEE model.
18
Chapter 3
A Review of Statistical Analysis with Missing
Data
3.1
Missing Data
Missing value has been recently studied in the literature since it commonly appears in
longitudinal data especially when collected retrospectively. In this section, we will briefly
review the missing data problems discussed in Diggle, Heagerty, Liang, and Zeger (2002)
and in Little and Rubin (2002).
3.1.1
Introduction
When one or more intended measurements for each individual are not taken in data
collection, or the expected values are not obtained or lost or unavailable, missing values
appear. Computer problems, outage of data storage system, privacy and religious issues
and/or mistakes made in collecting and recording information etc. may all account for the
19
causes of missing data.
Generally, missing values are categorized into intermittent missing values and dropouts.
If a missing value occurs as dropouts for an individual subject Yi at time j, all its subsequent
values Yik are missing for all k ≥ j; if otherwise, it is an intermittent missing value.
An interesting phenomenon appears. Researchers, when dealing with missing values in
longitudinal data analysis, prefer intermittent missing values. It is commonly believed
that to estimate and impute intermittent missing values is easier than dropouts with all
subsequent information missing. Fortunately, the missing values in the daily cost data
studied in a later chapter in this research are intermittent missing values.
Missing data should be appropriately taken into consideration in longitudinal data analysis. It may hide true values that are significant in making an effective prediction. Ignoring
missing data may result in bias. As briefly introduced in Chapter 1, pain control cost information is not recorded in the first data set. Based on the information provided by the price
data, we are therefore able to fill in the daily cost for 1552 observations in the first data.
The daily cost for the remaining 2412 observations, however, is considered missing. So
our primary job in this research is to discover an appropriate way to estimate and impute
the daily cost for those 2412 observations when conducting a longitudinal data analysis.
3.1.2
Missing Data Mechanisms
The missing data mechanism concerns the relationship between missing values and
observed values of each variable in data. It usually determines the choice of estimating and
20
imputing missing value methods. In 1976, Rubin proposed three missing data mechanisms,
missing completely at random, missing at random and not missing at random. We let
Y = (Yij ) be the complete data set, and M = (mij ) the missing data indicator matrix,
where mij = 1 if Yij is missing and mij = 0 if Yij is present. Let f (M |Y, ϕ) be the
conditional distribution of M given Y , where ϕ is a vector of unknown parameters.
1. Missing completely at random (MCAR)
MCAR is a missing data mechanism when missing values depend on neither the missing
values themselves nor the observed values of Y . The formula is
f (M |Y, ϕ) = f (M |ϕ)
for all Y , ϕ.
When data are MCAR, we can simply conduct a longitudinal data analysis based on complete observations without considering the missing values. However, the produced analysis
result may lose power as incomplete observations are not used in the analysis. When taking
missing values into consideration, we can apply multiple imputation (MI) to estimate and
impute the missing values in the data set. The missing values in the cost data set in this
research are MCAR because the price data set is based on a random sample of prescription.
We will deal with them further in detail in Chapter 4.
2. Missing at random (MAR)
Let Yobs be the observed components and Ymis be the missing components of Y . If missing
values depend only on the components Yobs of Y , rather than on the components Ymis of Y ,
this missing data mechanism is called MAR. The formula is
21
f (M |Y, ϕ) = f (M |Yobs , ϕ)
for all Ymis , ϕ.
Similarly, missing values in MAR can be also estimated and imputed by MI. The analysis
result is considered unbiased when the missing values are taken into consideration.
3. Not missing at random (NMAR)
NMAR is a missing data mechanism when the distribution of M depends on the missing
values in the data matrix Y . It is a damaging situation since the missing values depend on
unmeasured values. Unfortunately, we cannot apply MI to deal with the missing values in
NMAR.
3.2
Multiple Imputation (MI)
Multiple imputation (MI) is a common method to deal with missing data in the statistical analysis. It estimates and imputes more than one value for each individual missing
value in the data. Single imputation, another popular method, estimates and imputes only
one value for an individual missing value by mean imputation, regression imputation, hot
deck imputation, cold deck imputation etc. Without any special adjustments, the imputed single value is treated as a true value in the analysis. Therefore, it usually underestimates the
variance and is also unable to reflect uncertainty about the correct model for nonresponse.
The multiple imputation overcomes these deficiencies. In this section we will discuss the
multiple imputation in detail in terms of its concepts, method of creating the multiple imputed data sets and combining rules and inferences. More detail can be found in Little and
Rubin (2002).
22
3.2.1
Introduction
MI was first proposed by Rubin and then further developed in 1987. It is a procedure
to replace each missing value with plausible values representing the uncertainty about the
true value to make a complete data set, referred to as imputed data sets, for analysis. This
replacement process is performed multiple times, producing multiple imputed data sets. An
analysis is then conducted on each imputed data set to produce multiple analysis results.
By integrating the multiple analysis results, one overall analysis result is finally obtained.
Compared with the single imputation, MI can restore the variability in missing data.
It creates multiple imputed values based on variables correlated with the missing data to
maintain the variability. It also creates an inference by combining multiple analysis results to reflect sampling variability. When creating different versions of missing data and
observing the variability between imputed data sets, uncertainty exists. MI can produce unbiased parameter estimates to reflect the uncertainty and then incorporate the uncertainty.
Additionally, MI is able to provide adequate results in the presence of low sample size or
high rates of missing data. MI, however, requires extra work and time in the performance
of multiple imputation and analysis.
3.2.2
Combining Rules and Inference of MI
When analyzing multiple imputed data sets, we define D the number of imputed completed data sets, and D ≥ 2. Now we introduce Rubin’s combining rules. Let θ̂d be
a complete data estimate of a scalar quantity of interest obtained from data set d, for
23
d = 1, 2, ..., D, and Wd be their associated variances. The combined estimate is the average of each estimate as
D
1 X
θ̄D =
θ̂d .
D d=1
(3.1)
Now we calculate the average within imputation variance which measures the natural
variability in the data,
D
1 X
Wd ,
W̄D =
D d=1
(3.2)
which is simply the average of the estimated variances. And we calculate the between
imputation variance which measures uncertainty introduced by missing data,
D
BD =
1 X
(θ̂d − θ̄D )2 ,
D − 1 d=1
(3.3)
which is the sample variance of the estimates themselves. With the above two components,
we are able to have the total variance of θ̄D , which is
D+1
BD ,
D
TD = W̄D +
where (D + 1)/D is an adjustment of finite small D. The
(3.4)
√
T is the total standard error
associated with θ̄D .
In general, the confidence interval can be calculated by using the approximation as
follows:
θ̄D ± tν
p
TD .
For large sample size and scalar θ, we replace the normal distribution by t distribution
when D is small. By the Satterthwaite approximation (Rubin and Schenker, 1986; Rubin,
24
1987a), t distribution has the degrees of freedom
ν = (D − 1)(1 +
D W̄D 2
).
D + 1 BD
(3.5)
For small sample size, t distribution has the degrees of freedom
−1 −1
ν∗ = (ν −1 + ν̂obs
) ,
where
ν̂obs = (1 −
D + 1 BD νcom + 1
)(
)νcom ,
D TD νcom + 3
and νcom is the degrees of freedom for t inferences about θ with no missing values (Barnard
and Rubin, 1999).
With respect to the significance test of the null hypothesis θ = 0, we compare the ratio
√
of t = θ̄D / TD with the t distribution.
In 1987, Rubin proved that the relative efficiency of a finite D estimator is
r̂
V (θ̄∞ )
= (1 + )−1 ,
D
V (θ̄D )
(3.6)
where r is an estimate of the fraction of missing information,
r̂ =
γ̂D + 2/(ν + 3)
,
γ̂D + 1
(3.7)
where γ̂D is the relative increase in variance because of nonresponse,
γ̂D =
D + 1 BD
.
D W̄D
(3.8)
According to Rubin, 2-9 times imputation of missing data are essentially required in
the MI process in order to get high efficiency in dealing with missing data.
25
3.2.3
Method for Creating MI
Now we return to discuss the method for creating MI. Rubin proposed the idea in 1978
of drawing missing values from their joint posterior predictive distribution. The underlying
condition of this idea is to integrate over the parameters θ, which causes difficulties in
drawing from the predictive distribution especially in complicated cases. This method
is considered time consuming, as well, when multivariate data are involved in nonlinear
relationships to build a coherent model for the joint distribution of the variables, program
the draws or assess convergence. To overcome these problems, an alternative approach,
approximate draw, appears to yield approximately valid inferences with combining rules.
This alternative method seems to be more effective than a rigorous MI inference under a
full model without a perfect reflection of the data.
By using this approximate draw method in MI, we are able to approximate the mean
and variances, for large D, as follows:
D
1 X
θ̂d
E(θ|Yobs ) ≈
D d=1
= θ̄D .
D
V ar(θ|Yobs ) ≈
D
1 X
1 X
Wd +
(θ̂d − θ̂)2
D d=1
D − 1 d=1
= W̄ + BD .
When D is small, the mean is the same as it is in large D above. The variance, however,
is different. It is as
V ar(θ|Yobs ) ≈ W̄ +
26
D+1
BD .
D
Chapter 4
Analysis of Pain Control Cost Data
4.1
Preliminary Analysis
A preliminary analysis of the first data from the Regina General Hospital is required
before conducting a longitudinal analysis. As briefly introduced in Chapter 1, the first data
records 83 patients in terms of their personal medical information, including age, gender,
diagnosis, types of treatment and treatment year. We will discuss each of them in detail.
The observed patient age varies from 17 to 90’s. From Figure 4.1, we can see that 28
patients are in their 60’s, 25 in their 50’s, 12 in their 40’s and 10 in their 70’s. The youngest
patient is only 17 years old, while the oldest one is 92. Based on patient age, we note that
the 50-60 age group requires the greatest amount of pain control medication.
27
Figure 4.1: Age of patients
With respect to gender presented in Table 4.1, we can see that the number of male and
female patients is very close. Specifically, 44 out of 83 patients are male with an average
age of 60. The remaining 39 patients are female with an average age of 57. Therefore, we
can assume that the need for pain control medication differs modestly with gender.
Table 4.1: Gender of patients
Gender
Male
Female
Total
Number of Patients
44
39
83
Percentage
53.01%
46.99%
100.00%
Mean Age
60.64
57.33
59.08
Table 4.2 shows all 6 diagnoses recorded in the first data. Spasticity Chronic (SC),
Complex Regional Pain Syndrome (CRPS), Failed Back Surgery Syndrome (FBSS) and
Spasticity Multiple Sclerosis (SMS) are categorized as the chronic non-cancer pain diagnosis. The number of patients falling in this category comes to 79, accounting for 95%
of the total. The other two diagnoses, Neuropathy and Pancreatitis, are the miscellaneous
28
diagnosis, with only 4 patients included. The average age of patients in the chronic noncancer pain diagnosis is about 60.
Table 4.2: Diagnosis of patients
Diagnosis
SC
CRPS
FBSS
SMS
Neuropathy
Pancreatitis
Diagnosis Type
1
2
3
4
5
6
Number of Patients
19
26
28
6
3
1
Percentage
22.89%
31.33%
33.74%
7.23%
3.61%
1.20%
Mean Age
58.05
61.15
58.32
61.67
53.33
48.00
Table 4.3: Types of treatment of patients
Treatment
Treatment 1
Treatment 2
Treatment 3
Number of Patients
46
29
8
Percentage
55.42%
34.94%
9.64%
Mean Age
59.39
57.83
61.88
In terms of types of treatment in Table 4.3, patients are involved in three types based
on their needs, treatment 1 with mono-drug prescription, treatment 2 with either mono- or
dual-drug prescription and treatment 3 with mono-, dual- or triple-drug prescription. All
83 patients start their medical treatment in either treatment 1 or treatment 2. Due to the
drug efficacy, 8 of them move into treatment 3 later for a certain period of time, accounting
for less than 10% of the total. A majority patients, 46, remain in treatment 1 through
their entire medical treatment, accounting for more than half. Twenty nine patients stay
in treatment 2 accounting for about 35%. It is of interest to note that the average age of
patients in each type of treatment is again about 60.
29
Figure 4.2 shows the time in years the observed patients spent on pain control medications in their medical treatment. It varies from 1 year to 15 years with 5.6 years as the
average. The first, second and third quartile of year are 2, 4 and 9 respectively.
Figure 4.2: Treatment year of patients
Table 4.4: Drugs prescribed in medical treatment
Drug Names
Baclofen
Bupivacaine
Clonidine
Dilaudid
Fentanyl
Morphine
Ropivicaine
Cadd
Drug Number
D1
D2
D3
D4
D5
D6
D7
D8
Number of Patients
40
15
5
35
7
45
20
1
Percentage
48.19%
18.07%
6.02%
42.17%
8.43%
54.22%
24.10%
1.20%
Besides the above personal medication information, eight drugs prescribed in either
mono-, dual-, or triple-drug are also recorded in the first data. From Table 4.4, we can see
that Baclofen, Dilaudid, Morphine and Ropivicaine are the four most frequently used drugs
to relieve or control pain. Specifically, 45 patients are on Morphine, 40 on Baclofen, 35 on
30
Dilaudid and 20 on Ropivicaine, accounting for over 80% of the total. There is only one
patient taking Cadd. We will discuss this later.
Reviewing the first data in detail from above, we notice that no pain control cost information is recorded at all. So now our primary job is to calculate the cost in an appropriate
way based on the information provided in the price data.
4.2
Creating Imputed Data Sets
As briefly introduced in Chapter 1, the price data obtained from Saskatchewan pharmacies list 64 different prices based on the certain usage of volume and concentration of each
individual drug in three types of treatment. Since it is difficult to track the market price of
drugs from two decades ago, all prices recorded in the data are current. Price changes over
time will not be taken into considerations. For example, we assume that the daily cost of
pain control drugs from January 30, 1998 to February 3, 2008 remains unchanged.
With the information provided by the price data, the daily cost for 1552 observations
recorded in the first data can be filled in since exactly the same usage of volume and concentration of each drug prescribed for these observations can be found in the price data.
The daily cost formula is
Daily Cost =
P rice
× Daily Dose.
V olume × Concentration
Take a patient on mono-drug D1 in volume 20ml and concentration 0.5mg/ml in treatment 1 for example. The price in the price data for D1 with the same volume and concentration is $241.185. The daily dose of the patient taking D1 recorded in the first data is
31
0.0291mg. So the daily cost for this patient is
Daily Cost =
241.185
× 0.0291 ≈ 0.70.
20 × 0.5
The above formula also applies in calculating daily cost for patients taking dual- or
triple-drug. Since one individual drug remains a fixed ratio when combined with other
drugs in one prescription, either dual- or triple-drug, so we can calculate the daily cost by
using any one individual prescribed drug. Take one patient on dual-drug D4 and D7 with
a total volume of 40ml in treatment 2 for example. The concentration of D4 is 1.4mg/ml
and of D7 is 7.1 mg/ml. The price listed in the price data for this dual-drug with the same
volume and concentration is $89.67, while the daily dose of D4 and D7 are 0.47mg and
2.4mg respectively. So we can find the following daily cost 1 and 2, calculated by using
D4 and D7, are $0.753 and $0.757 respectively.
Daily Cost 1 =
89.67
× 0.47 ≈ 0.753.
40 × 1.4
Daily Cost 2 =
89.67
× 2.4 ≈ 0.758.
40 × 7.1
Theoretically, the above two daily costs should be the same. However, due to the rounding error in the data collection method in both the first data and the price data, they are
slightly different. Now we can average the daily cost of drugs in dual- or triple-drug to
obtain the daily cost. In the above example, the daily cost for the patient is
Daily Cost =
1
× (0.753 + 0.758) ≈ 0.76.
2
Part of the price data is shown in Table 4.5. Cn(D1),..., Cn(D7) represent the concentration of drug D1,..., D7 in each prescription, and Vo is the total volume of drug(s).
32
Table 4.5: Part of the price data
Price ($)
241.185
849.62
66
131.4
1388
89.67
570.73
345.29
Cn(D1)
0.5
2
.
.
0.547
.
.
0.175
Cn(D2)
.
.
.
15
.
.
.
.
Cn(D3)
.
.
0.1
.
.
.
.
.
Cn(D4)
.
.
.
.
.
1.4
10
7.5
Cn(D5)
.
.
.
.
12
.
.
.
Cn(D6)
.
.
.
32
.
.
.
.
Cn(D7)
.
.
.
.
.
7.1
.
6.5
Vo
20
20
20
40
40
40
40
40
By using the above daily cost formula, we now fill in the daily cost for 1552 observations in the first data. Part of the first data with filled in daily cost data is shown in Table
4.6. Vo again is the total volume of drug(s). Dg(1), Cn(1) and Ds(1) represent an individual drug, the concentration of the individual drug and the daily dose of the individual drug
respectively in a certain prescription. Similarly, Dg(2), Cn(2) and Ds(2) represent a second
individual drug, the concentration and daily dose of the second individual drug in one prescription. And Dg(3), Cn(3) and Ds(3) represent a third individual drug, the concentration
and daily dose of the third individual drug. Cost in the table is the daily cost.
Although we filled in the daily cost for 1552 observations, we are unable to calculate
the daily cost directly for the remaining 2412 observations in the first data, since exactly
the same usage of volume and concentration of each drug prescribed for these observations
cannot be found in the price data. So we say these daily costs are missing (as the question
marks show in Table 4.6). Due to the arbitrary nature of the missing pattern, these missing
values are intermittent missing values. We assume that they are MCAR because the price
data set is based on a random sample of prescriptions. We will use MI to produce multiple
33
imputed data sets.
Table 4.6: Part of the first data with filled in daily cost data
ID
1
1
3
5
9
16
17
17
34
83
Vo
20
20
20
20
40
40
40
40
20
20
Dg(1)
D1
D1
D3
D7
D3
D4
D4
D4
D6
D5
Cn(1)
0.5
2
0.09
1.1
0.13
1.4
5
10
10.2
0.5
Ds(1)
0.0291
0.1505
0.03
0.29
0.1101
0.47
0.6
5
2.42
0.15
Dg(2)
.
.
.
.
D4
D7
.
.
D1
.
Cn(2)
.
.
.
.
8
7.1
.
.
0.167
.
Ds(2)
.
.
.
.
7.047
2.4
.
.
0.04
.
Dg(3)
.
.
.
.
.
.
.
.
D7
.
Cn(3)
.
.
.
.
.
.
.
.
4.6
.
Ds(3)
.
.
.
.
.
.
.
.
1.1
.
Cost
0.70
3.20
?
?
?
0.76
?
7.13
?
?
Based on the price data, we fit a general linear model for creating multiple imputed data
sets as follows:
P rice = β0 + β1 · D1 + β2 · D2 + ... + β7 · D7 + ε,
where ε ∼ N (0, σ 2 ). The residual standard error is 4.8 on 56 degrees of freedom, and
R2 = 0.978. The results of the fitted model are shown in Table 4.7.
The formula of imputed price is
Imputed P rice = βˆ0 + βˆ1 [Cn(D1) × V o] + ... + βˆ7 [Cn(D7) × V o] + ε̂,
where ε̂ ∼ N (0, 4.82 ). And the formula of imputed daily cost is
Imputed Daily Cost =
Imputed P rice
× Daily Dose.
V olume × Concentration
Take one patient on mono-drug D3 in treatment 1, for example. As we cannot find the
price for D3 in volume 20ml and concentration 0.09mg/ml with daily dose of 0.03mg in the
34
Table 4.7: Results of price model estimated parameters (coefficients)
Coefficients
intercept
D1
D2
D3
D4
D5
D6
D7
Estimate
36.102
23.106
0.059
0.207
0.377
1.774
0.085
0.156
Std.Error
10.315
0.775
0.031
5.216
0.015
0.096
0.015
0.047
P-value
0.001
0.000
0.063
0.969
0.000
0.000
0.000
0.002
price data, we apply the above formulas to estimate the price and then the daily cost. We
randomly draw an error value of 0.201 from ε̂ ∼ N (0, 4.82 ). The error value is different
from each draw. Now we have
Imputed P rice = 36.102 + (0.207 × 20 × 0.09) + 0.201 ≈ 36.68.
Imputed Daily Cost =
36.68
× 0.03 ≈ 0.61.
20 × 0.09
When patients are on dual- or triple-drug without any price information provided in
the price data, we are also able to estimate and impute the price as well as the daily cost
by using the above formulas. For example, a patient takes D3 with a concentration of
0.13mg/ml and D4 with a concentration of 8mg/ml in the total volume of 40ml. The daily
dose of each drug is 0.1101mg and 7.047mg. We have
Imputed P rice = 36.102 + (0.207 × 40 × 0.13) + (0.377 × 40 × 8) − 0.712 ≈ 157.11,
where we randomly draw an error value of −0.712. And the imputed daily cost, 1 and 2,
by using D3 and D4 are $3.327 and $3.460 respectively.
35
Imputed Daily Cost 1 =
157.11
× 0.1101 ≈ 3.327
40 × 0.13
Imputed Daily Cost 2 =
157.11
× 7.047 ≈ 3.460.
40 × 8
So the imputed daily cost is
Imputed Daily Cost =
1
× (3.327 + 3.460) ≈ 3.39.
2
Now we apply the above formulas to estimate and impute the daily cost for the remaining 2412 observations in the first data in order to obtain a complete data set, referred to as
the first data with complete imputed daily cost data. Part of the data is shown in Table 4.8.
Table 4.8: Part of the first data with complete imputed daily cost data
ID
1
1
3
5
9
16
17
17
34
83
Vo
20
20
20
20
40
40
40
40
20
20
Dg(1)
D1
D1
D3
D7
D3
D4
D4
D4
D6
D5
Cn(1)
0.5
2
0.09
1.1
0.13
1.4
5
10
10.2
0.5
Ds(1)
0.0291
0.1505
0.03
0.29
0.1101
0.47
0.6
5
2.42
0.15
Dg(2)
.
.
.
.
D4
D7
.
.
D1
.
Cn(2)
.
.
.
.
8
7.1
.
.
0.167
.
Ds(2)
.
.
.
.
7.047
2.4
.
.
0.04
.
Dg(3)
.
.
.
.
.
.
.
.
D7
.
Cn(3)
.
.
.
.
.
.
.
.
4.6
.
Ds(3)
.
.
.
.
.
.
.
.
1.1
.
Cost
0.70
3.20
0.61
0.50
3.39
0.76
0.34
7.13
1.72
0.85
As previously introduced, there is only one patient taking Cadd 4 times through his/her
medical treatment. Four observations are therefore recorded in the first data. Unfortunately, the price data provides no information about this drug. We have no way to estimate or
impute the daily cost for these 4 observations. In this circumstance, we will ignore these
36
4 observations and remove them from the first data. Compared with the total 3964 observations and 2412 observations with daily cost missing in the first data, 4 observations are
considered very small. No bias would be produced when removing them. Therefore, we
imputed the daily cost for 2408 observations in the first data.
Table 4.9: Part of other eight complete imputed daily cost data sets
ID
1
1
3
5
9
16
17
17
34
83
Cost(2)
0.70
3.20
0.75
0.68
3.65
0.76
0.55
7.13
2.05
0.84
Cost(3)
0.70
3.20
0.86
0.54
3.41
0.76
0.79
7.13
1.81
1.09
Cost(4)
0.70
3.20
0.58
0.47
3.49
0.76
0.55
7.13
1.71
0.94
Cost(5)
0.70
3.20
0.71
0,73
3.63
0.76
0.58
7.13
1.76
1.01
Cost(6)
0.70
3.20
0.49
0.56
3.04
0.76
0.80
7.13
1.60
0.78
Cost(7)
0.70
3.20
0.72
0.58
3.11
0.76
0.42
7.13
1.95
0.97
Cost(8)
0.70
3.20
0.66
0.59
3.72
0.76
0.45
7.13
1.59
0.81
Cost(9)
0.70
3.20
0.67
0.52
3.59
0.76
0.50
7.13
1.61
0.84
In order to obtain high efficiency, we perform the MI process 9 times to replace each
missing daily cost in the first data, as previously discussed in Chapter 3. So we are able
to obtain 9 sets of complete imputed daily cost data. Part of the other 8 complete imputed
daily cost data sets is shown in Table 4.9, where Cost(d) is the dth set of complete imputed
daily cost data, d = 1, ..., 9. In the next section, we will analyze each obtained data set by
using a standard statistical analysis.
37
4.3
Analyzing Imputed Daily Cost Data Sets by GEE Model
In order to identify how the 5 factors of age, gender, diagnosis, types of treatment and
treatment year possibly impact the daily cost for the observed patients, we decide to apply
the GEE model to analyze each set of complete imputed daily cost data. In this section, we
take the first set of complete imputed daily cost data as an example for analysis. Part of the
data is shown in Table 4.10.
Table 4.10: Part of the first set of complete imputed daily cost data
ID
1
..
.
Age
60
..
.
Gender
F
..
.
Diagnosis
1
..
.
Treatment
1
..
.
Start Date
2002/06/21
..
.
End Date
2002/06/21
..
.
Cost(1)
0.70
..
.
1
3
5
9
16
17
17
34
83
60
63
55
48
62
54
54
67
56
F
M
F
F
M
M
M
M
M
1
2
2
3
3
3
3
1
3
1
1
1
2
2
1
1
3
3
2004/07/19
1997/03/21
2004/05/17
2007/09/19
2006/06/11
2004/01/09
2004/06/12
2003/01/28
2014/12/16
2004/07/19
1997/03/23
2004/05/30
2007/10/19
2006/06/16
2004/06/12
2004/06/12
2003/01/28
2014/12/17
3.20
0.61
0.50
3.39
0.76
0.34
7.13
1.72
0.85
We may be aware of one issue when looking at Table 4.10. The time patients on one
single pain control prescription in their treatment is recorded in days with one imputed
daily cost. When changing to a different prescription, the days change with a different
imputed daily cost. For example, the patient (ID 17) is on the same drug prescription every
day from January 9, 2004 to June 12, 2004 with a daily cost of $0.34. However, it climbs
to $7.13 for that patient with the change of prescription on June 12, 2004. We choose to
38
Table 4.11: Part of the first set of average daily cost data
ID
1
1
..
.
Age
60
60
..
.
Gender
F
F
..
.
Diagnosis
1
1
..
.
Treatment
1
1
..
.
Treatment Year
1
2
..
.
Average Cost (1)
0.74
1.58
..
.
1
..
.
60
..
.
F
..
.
1
..
.
1
..
.
8
..
.
2.32
..
.
16
..
.
62
..
.
M
..
.
3
..
.
2
..
.
1
..
.
0.94
..
.
16
..
.
62
..
.
M
..
.
3
..
.
2
..
.
10
..
.
0.84
..
.
83
..
.
56
..
.
M
..
.
3
..
.
3
..
.
1
..
.
1.33
..
.
83
56
M
3
3
7
7.77
consider the average daily cost in years rather than in days for a rough analysis result in
convenience in this research. We define the calendar year of a patient beginning on pain
control drugs in the medical treatment as the first year, and the following calendar year as
the second year and so forth. We then calculate the average daily cost of a year for each
individual patient, which is the sum of daily cost in one year divided by the total number of
days observed in that year. By applying this calculation procedure into 9 sets of complete
imputed daily cost data, we obtain 9 sets of average daily cost data. Again, in this section,
we will take the first set of average daily cost data as an example for analysis. Part of the
data is shown in Table 4.11, where Average Cost (d), d = 1, ..., 9, represents the dth set of
average daily cost data.
Figure 4.3 shows the average daily cost per year for 76 patients on pain control drugs
throughout their medical treatments. The total number of patients is now down from 83 to
39
Figure 4.3: The average daily cost of a year for 76 patients vs treatment years
76 is because some patients are removed from the data. When considering the treatment
time in years, we notice that three patients are observed less than a year, two with one single
day observation and the other one with 19 days observation. In this case, we decide not to
calculate the average daily cost per year for these patients, and remove them from the data.
Additionally, four patients in the miscellaneous diagnosis are removed as well since this
research studies patients in the chronic non-cancer pain diagnosis.
The average daily cost for each patient is quite close at the beginning of their medical
treatment, under $5.00 as shown in Figure 4.3. It varies, either increasing or decreasing,
throughout the time spent in the treatment. Due to the large number of observations, we are
unable to identify each individual line for analysis, but we notice some obvious changes in
40
the average daily cost for a few patients. Therefore, we categorize these 76 patients based
on their gender, diagnosis, types of treatment and treatment year in order to find out the
relationship between the average daily cost and each factor.
Figure 4.4: The average daily cost of a year of gender vs treatment years
The change in average daily cost by gender of treatment time in years is shown in
Figure 4.4. Overall, the cost trend for male patients is similar to that of female patients.
Specifically, average daily costs for male and female patients climb quickly from about
$1.50 to about $4.00 in the first five years. The cost for female patients increases steadily,
while the cost for males increases more abruptly. Around year 9 of the treatment, the costs
for both diminish. Thirteen patients are observed over 9 years. Nine of these 13 patients
are in treatment 1, and three in treatment 2. From this Figure, we can assume that gender
has a weak impact on the average daily cost.
41
Figure 4.5: The average daily cost of a year of 4 main diagnoses vs treatment years
Figure 4.5 shows the average daily cost for each diagnosis identified in the medical
treatment. It costs about $2.00 for patients in SC, CRPS and SMS, and about $1.00 in
FBSS when patients start to take pain control medication. The costs in four diagnoses
increase greatly in subsequent years at a similar pace until they reach the maximum point
of about $6.00. After that, all costs go down. In SC, five patients are in treatment 1 and two
in treatment 2. In CPRS, three patients are in treatment 1 and one in treatment 2. In FBSS,
three patients in treatment 1, five in treatment 2 and two in treatment 3. In SMS, all four
patients are in treatment 1. From this Figure, we can assume that diagnosis has a modest
impact on the average daily cost.
From Figure 4.6 we note significant differences among the average daily costs of three
types of treatment with the increase of treatment years. The cost for patients on mono-drug
42
Figure 4.6: The average daily cost of a year of 3 treatment types vs treatment years
in treatment 1 remains around $3.00, a slight increase from $2.00 to $4.00 in the first 9
years and then down back to around $2.00. A continuous steady increase appears in the
cost for patients in treatment 2, taking either mono- or dual-drug, from about $2.00 to
$7.00. The cost for patients in treatment 3 shows the most variance. It keeps increasing
from about $2.00 to $5.00 in the first five years, and then sharply climbs to about $11.00 in
the 6th year. Falling down slightly after, it stays around $9.00. So we can assume that the
types of treatment possibly affect the average daily cost.
We are aware that the trend curves in the above 3 figures all end at a certain treatment
year. Take Figure 4.4 for example. The average daily cost of a year for female patients
stops at the 11th year and for male patients at the 12th year. This is simply because fewer
than 3 female or male patients are observed over 11 years or 12 years respectively in each
43
gender group. We are unable to obtain sufficient information to fit the model. Similarly,
the trend curves of SC, CRPS, FBSS and SMS in Figure 4.5 stop at 9th year, 11th year, 9th
year and 7th year respectively. In Figure 4.6, the trend curves of treatment 1, treatment 2
and treatment 3 end in 14th year, 10th year and 9th year respectively.
The relationship between the average daily cost of a year and each type of treatment
with 95% confidence interval is shown in Figure 4.7, Figure 4.8 and Figure 4.9 respectively.
Overall, the average daily cost for all three types of treatment goes up with the increase of
time in the treatment. It stays at a modest increase in treatment 1 with a small and close
confidence interval, and remains a continuous steady increase in treatment 2 with a large
confidence interval. The cost in treatment 3 varies most, climbing rapidly to the maximum
and then sliding down, with an inconsistent confidence interval.
Figure 4.7: The average daily cost of a year of treatment 1 vs treatment years
44
Figure 4.8: The average daily cost of a year of treatment 2 vs treatment years
Figure 4.9: The average daily cost of a year of treatment 3 vs treatment years
45
The scatterplot matrix in Figure 4.10 explores the correlation of observations. It shows
each of the 15 choose 2 scatterplots of responses, the average daily cost of a year, from
a patient at different years. There are 15 observations at maximum for each patient. The
correlation becomes weak when observations are moved far from each other in time. Also
the correlation between observations at time tij and tik depends strongly on |tij − tik | for
all patients. Therefore, we can say that the correlations decrease with the increase in time
lag.
Figure 4.10: Scatterplot matrix of 76 patients
Now we establish a GEE model and apply it to analyze the interaction impact among
variables. Let the five factors, age, gender, diagnosis, types of treatment and treatment year
46
as the covariates, X, and the average daily cost of a year as the response variable, Y . Age
and treatment year are continuous variables. Gender, diagnosis and types of treatment are
dummy variables. We let f emale = 0 and male = 1 for gender as binary covariate. Let
diagnosis 1 = 1 if patients are in SC, otherwise diagnosis 1 = 0. Let diagnosis 2 = 1
if patients are in CRPS, otherwise diagnosis 2 = 0. Let diagnosis 3 = 1 if patients are
in FBSS, otherwise diagnosis 3 = 0. Let treatment 1 = 1 if patients are in treatment 1,
otherwise treatment 1 = 0. And also let treament 2 = 1 if patients are in treatment 2,
otherwise treatment 2 = 0. So we can write the GEE model as follows:
Average Cost = β0 + β1 · T reatment Y ear + β2 · Age + β3 · Gender + β4 · diagnosis 1
+β5 · diagnosis 2 + β6 · diagnosis 3 + β7 · treatment 1 + β8 · treatment 2.
Table 4.12: Results of QIC for the GEE model
R(α)
QIC
Independence 1144.7
Exchangeable 1127.0
Unstructured 2216.1
AR-1
1117.1
With the above the GEE model established, we now calculate the QIC value in order
to select the best working correlation structure among four structures, independence, exchangeable, unstructured and AR-1. The calculation results are presented in Table 4.12. As
previously discussed in Chapter 2, the best working correlation structure in the GEE model is the one with the smallest QIC value. Therefore, we decide to apply the established
GEE model under AR-1 working correlation structure to analyze the first set of average
daily cost data. The results of the GEE model are presented in Table 4.13. The estimator
47
correlation ρ̂ = 0.806. The estimated scale parameter φ = 11.177.
Table 4.13: Results of the GEE model
Coefficients
Estimate Naive S.E. Naive z Robust S.E.
Intercept
5.686
2.127
2.673
1.582
Treatment Year 0.446
0.072
6,217
0.100
Age
-0.031
0.025
-1.240 0.022
Gender
0.259
0.650
0.398
0.559
Diagnosis 1
0.103
1.173
0.087
0.987
Diagnosis 2
-0.514
1.143
-0.450 1.057
Diagnosis 3
-0.887
1.086
-0.817 0.890
Treatment 1
-3.027
1.048
-2.888 0.845
Treatment 2
-1.965
1.098
-1.790 0.991
Robust z
3.594
4.462
-1.429
0.464
0.104
-0.486
-0.996
-3.582
-1.982
Assume we choose a 95% confidence interval, z = ±1.96 for two tails, and consider
robust z in the analysis since it is more consistent than naive z as previously discussed.
From Table 4.13, we identify two variables, treatment year and types of treatment, that
are significant to the average daily cost of a year at a 95% confidence interval, with robust
z= 4.462 > 1.96 for treatment year, robust z= −3.582 < −1.96 for treatment 1 and robust
z= −1.982 < −1.96 for treatment 2. The other three variables, age, gender and diagnosis,
are not significant at a 95% confidence interval.
Similarly, we apply this GEE model under AR-1 working correlation structure to analyze the other 8 sets of average daily cost data with the same procedure, and obtain similar
results of the GEE model. Treatment year and types of treatment are significant to the average daily cost of a year at a 95% confidence interval. However, age, gender and diagnosis
are insignificant to the average daily cost at the same confidence interval. The cost for all
types of treatment goes up with the increase in time. Treatment 3 is the most costly.
48
4.4
Combining Analysis Results by MI
With the 9 sets of the GEE analysis results obtained from the previous section, we
now move to the last stage of the MI analysis procedure that will integrate all results and
produce an overall result. Take the most significant variable in the GEE model, treatment
year, for example. We calculate the combined estimate of coefficient of year θ̄D by formula
(3.1), where θ̂d is the estimate of coefficient of year from each average daily cost data set.
We calculate the within imputation variance W̄D by formula (3.2), where the associated
variance Wd is the robust variance of estimate from each average daily cost data set. Also,
we calculate the between imputation variance BD by formula (3.3). The total variance
TD of year can be therefore obtained by applying W̄D and BD to the formula (3.4). The
total standard error (S.E.) of year is
√
TD . The ratio of t (T-Ratio) can be then calculated
√
by θ̄D / TD . Additionally, we calculate the relative increase γ
bD by formula (3.8), and the
degrees of freedom ν for t distribution by formula (3.5) due to the large sample size and
small D ≤ 9. The fraction of missing information r̂ is then calculated by formula (3.7). So
we can obtain the relative efficiency (R.E.) of a finite 2 ≤ D ≤ 9 by formula (3.6). Table
4.14 lists all the calculation results in combining of the MI procedure for the variable year
when D = 2, 3, ..., 9.
When combining 2 sets and up to 5 sets of the GEE analysis results, we are aware of a
rapid increase in the value of relative efficiency from 0.9079 to 0.9866 listed in Table 4.14.
When combining 6 sets of GEE analysis results and above, the values of relative efficiency
are all over 0.9907, with a moderate increase of less than 0.005. We assume that finite D
49
Table 4.14: Results of the MI procedure for the variable treatment year
D
2
3
4
5
6
7
8
9
θ̄D
0.4725
0.4838
0.4841
0.4891
0.4869
0.4879
0.4877
0.4851
W̄D
0.0108
0.0114
0.0111
0.0112
0.0111
0.0112
0.0112
0.0112
BD
0.0014
0.0011
0.0007
0.0007
0.0006
0.0005
0.0004
0.0004
TD
0.0129
0.0128
0.0120
0.0119
0.0118
0.0118
0.0117
0.0116
S.E.
0.1135
0.1130
0.1095
0.1094
0.1085
0.1084
0.1081
0.1079
γ
bD
0.1934
0.1258
0.0804
0.0707
0.0585
0.0481
0.0406
0.0411
ν
38.1
160.1
541.6
915.8
1637.5
2854.8
4610.2
5217.1
r̂
0.2028
0.1226
0.0778
0.0681
0.0564
0.0466
0.0394
0.0399
R.E.
0.9079
0.9607
0.9809
0.9866
0.9907
0.9934
0.9951
0.9956
Table 4.15: Results of the MI procedure for all variables
Variables
Treatment Year
Age
Gender
Diagnosis 1
Diagnosis 2
Diagnosis 3
Treatment 1
Treatment 2
Sufficient D
5
3
2
2
6
6
6
3
R.E.
0.9866
0.9944
0.9964
0.9926
0.9804
0.9900
0.9762
0.9967
sets in combination are sufficient when the difference of R.E. is less than 0.005 between D
and D + 1 sets. So D = 5 is sufficient to the variable year. Following the same integration
procedure, we can obtain the results for all variables. Table 4.15 shows the result in terms
of sufficient D and associated values of R.E. for each variable. Considering all variables,
we believe that it is sufficient to combine 6 sets of average daily cost data. The overall value
obtained from 6 sets combination is significant and close to the true value for all variables.
The overall result for each variable is presented in Table 4.16 from combining 6 sets of
average daily cost data.
50
Table 4.16: Results of combination for all variables
Coefficients
Combined Estimate Total Variance Total S.E.
Intercept
6.0743
2.8491
1.6879
Treatment Year 0.4869
0.0118
0.1085
Age
-0.0326
0.0005
0.0230
Gender
0.2997
0.3415
0.5844
Diagnosis 1
0.1952
0.9847
0.9923
Diagnosis 2
-0.0181
1.2516
1.1188
Diagnosis 3
-0.5931
0.8183
0.9046
Treatment 1
-3.4582
0.8212
0.9062
Treatment 2
-2.0567
1.0617
1.0304
T-Ratio
3.5986
4.4855
-1.4178
0.5129
0.1967
-0.0162
-0.6557
-3.8163
-1.9961
We choose a 95% confidence interval with two-tails in order to test the significance of
each variable if it is equal to 0. We calculate the degree of freedom ν, and the critical value
for each t distribution with degree of freedom and α/2 = ±0.025, with 95% lower and
upper bound. Now we obtain the P-value for each variable shown in Table 4.17.
Table 4.17: Results of significant test for all variables
Coefficients
Intercept
Treatment Year
Age
Gender
Diagnosis 1
Diagnosis 2
Diagnosis 3
Treatment 1
Treatment 2
Degree of Freedom
1033.3
1637.5
21031.4
436773.6
221937.3
377.4
1432.9
255.9
30720.0
tν,α/2
1.9623
1.9614
-1.9600
1.9600
1.9600
-1.9663
-1.9616
-1.9705
-1.9600
95% Low
2.7622
0.2740
-0.0776
-0.8457
-1.7497
-0.03626
-2.3676
-5.2439
-4.0763
95% HIGH
9.3864
0.6997
0.0125
1.4452
2.1401
2.1816
1.1813
-1.6726
-0.0372
P-value
0.0003
0.0001
0.1563
0.6080
0.8441
0.9871
0.5121
0.0002
0.0459
From the overall result presented in Table 4.16 and Table 4.17, we can find T-Ratio for
treatment year 4.4855 > 1.9614 on the upper tail, for treatment 1 −3.8163 < −1.9705 and
51
for treatment 2 −1.9961 < −1.9600 on the lower tail. Also the P-values we have, 0.0001
for treatment year, 0.0002 for treatment 1 and 0.0459 for treatment 2, are all less than 0.05.
Thus, we can say that treatment year and types of treatment are two significant factors that
impact the average daily cost at a 95% confidence interval. Age, gender and diagnosis do
not affect the average daily cost at the same confidence interval. In conclusion, the cost
increases with the increase in treatment time. When patients change from treatment 1 to
2, or treatment 2 to 3, or treatment 1 to 3, the cost will increase. Overall, the cost is the
highest for treatment 3.
52
Chapter 5
Conclusion and Future Study
5.1
Conclusion
This research studied 83 patients in terms of the cost of pain control medication in
their medical treatments. It targets patients diagnosed with chronic non-cancer pain and
aims to identify factors impacting on pain control costs. Due to the missing values existing
in a majority of observations in the data, we apply multiple imputation to estimate and
impute the daily cost, and thus obtain multiple sets of complete average daily cost data.
The GEE model, as a standard statistical analysis method, is then appropriately established
and applied in the analysis on each set of data. Multiple analysis results are produced.
Combining these results we discover two factors, treatment year and types of treatment,
that are significant to the average daily cost. They positively impact the average daily cost
for all patients. Overall, pain control costs increase continuously with time in treatment.
Additionally, costs are the highest for patients in treatment 3, taking mono-, dual- or tripledrug, and the least for those in treatment 1, taking mono-drug only. We might suggest that
53
patients remain in treatment 1 or treatment 2, depending on efficacy, in order to control or
reduce costs.
5.2
Future Study
This research obtained a rough longitudinal analysis result by considering the average
daily cost of a year in the GEE model. We may consider the average daily cost of a month
instead in a future study because the analysis results might be more accurate.
When reviewing the literature regarding approaches applied in the longitudinal data
analysis in Chapter 2, we note that the random effects model and the transition model
are also commonly used. Both approaches consider the correlation within individuals in
their specific ways. The random effects model assumes that the correlation arises within
repeated observations as each individual shares mutual and independent random effects.
The transition model assumes that the correlation exists among the repeated observations
since the past observations are treated as additional predictor variables to influence the
current observation. Neither of them requires the correlation between individuals in the
presence of a certain working correlation structure. Therefore, we can apply either or both
models in our future research, as standard statistical methods, to analyze the longitudinal
data. The analysis results can be compared with the results of the GEE model obtained
from this research to see if we are able to draw the same conclusion in the pain control cost
study.
54
Another area of possible future study concerns approaches dealing with missing values. Multiple imputation, as a useful and popular method, is performed in this research
to estimate and impute the daily cost of pain control medication for all observed patients.
However, it requires a large amount of time and work in the process of imputation and
analysis. Other approaches, such as the unified approach, maximum likelihood and fully
Bayesian etc. might well be applied to future studies.
55
Bibliography
[1] Barnard, J. and Rubin, D.B. (1999). Small-sample degrees of freedom with multiple
imputation. Biometrika, 86, 949-955.
[2] Diggle, P.J., Heagerty, P., Liang, K.-Y., and Zeger, S.L. (2002). Analysis of Longitudinal Data (Second Edition). Oxford Statistical Science Series 25, Oxford University
Press.
[3] He, Y.L. (2010). Missing Data Analysis Using Multiple Imputation: Getting to the
Heart of the Matter. Department of Health Care Policy, Harvard Medical School.
[4] Imori, S. (2013). On Properties of QIC in Generalized Estimating Equations. Hiroshima University, 1-8.
[5] Jiang, M.J. (2011). Working Correlation Selection in Generalized Estimating Equations. University of Iowa, 1-14.
[6] Kumar, K., Rizvi, S., Bishop, S., and Tang, W. (2013). Cost Impact of Intrathecal
Polyanalgesia. Pain Medicine 2013, Wiley Periodicals, Inc.
56
[7] Liang, K.-Y., and Zeger, S.L. (1986). Longitudinal Data Analysis Using Generalized
Linear Models. Source: Biometrika, Vol.73, No.1, 13-22.
[8] Little, R.J.A., and Rubin, D.B. (2002). Statistical Analysis With Missing Data (Second
Edition). Wiley Interscience.
[9] Rubin, D.B. (1987a). Multiple Imputaion for Nonresponse in Surverys. New York:
Wiley.
[10] Rubin, D.B. and Schenker, N. (1986). Multiple imputation for interval estimation
from simple random samples with ignorable nonresponse. J. Am. Statist. Assoc, 81,
366-374.
[11] Reither, J.P. and Raghunathan, T.E. (2007). The Multiple Adaptations of Multiple
Imupation. Duke University and University of Michigan.
[12] Schafer, J.L. and Olsen, M.K. (1998). Multiple imputation for mutivariate missingdata problems: a data analyst’s perspective. The Pennsylvania State University, 1-36.
[13] Sainani, K. (2002). GEE and Mixed Models for longitudinal data. Stanford University.
[14] Takahashi, M. and Ito, T. (2012). Multiple Imputation of Turnover in Edinet Data:
Toward the Improvement of Imputation for the Economic Census. National Statistics
Center, Japan, 1-4.
[15] The Multiple Imputation FAQ page. Webpage:
s/mifaq.html. February 3rd, 2015.
57
http://sites.stat.psu.edu/ jl-
[16] Wayman, J.C. (2003). Multiple Imputation For Missing Data: What Is It And How
Can I Use It. Center for Social Organization of Schools, Johns Hopkins University.
[17] Wu, L. (2014). Longitudinal Data Analysis. University of British Columbia.
[18] Yuan, Y.C. (2011). Multiple Imputation for Missing Data: Concepts and New Development. Version 1.0. SAS Institute Inc.
58