Download 12_ClassNotes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability interpretations wikipedia , lookup

Mixture model wikipedia , lookup

Odds wikipedia , lookup

Transcript
Ronald Heck
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
Week 12: Class Notes
1
Nov. 2, 2012
Class Notes
Examining Repeated Measures Data on Individuals
Generalized linear mixed models (GLMM) also provide a means of incorporating longitudinal
designs with categorical outcomes into situations where there are clustered data structures. One
of the attractive properties of the GLMM is that it allows for linear as well as non-linear models
under a single framework which will address issues of clustering. It is possible to fit models with
outcomes resulting from various probability distributions including normal (or Gaussian),
inverse Gaussian, gamma, Poisson, multinomial, binomial, and negative binomial through an
appropriate link function g () . At level 1, repeated observations (e.g., students’ proficiency
status in math, students’ enrollment over successive semesters in college, changes in clinical or
health status) are nested within individuals, perhaps with additional time-varying covariates. At
level 2, we can define variables describing differences between individuals (e.g., treatment
groups, participation status, subject background variables and attitudes).
Generalized Estimating Equations
Alternatively, by using the Generalized Estimated Equations (GEE) approach, we can examine a
number of categorical measurements nested within individuals (i.e., individuals represent the
clusters), but where individuals themselves are considered to be independent and randomly
sampled from a population of interest. More specifically, in this latter type of model, the pairs of
dependent and independent variables ( Yi ; X i ) for individuals are assumed to be independent and
identically distributed (Ziegler, Kastner, & Blettner, 1998) rather than clustered within
organizations. GEE is used to characterize the marginal expectation of a set of repeated measures
(i.e., average response for observations sharing the same covariates) as a function of a set of
study variables. As a result, the important point is that the growth parameters are not assumed to
vary randomly across individuals (or higher groups) as in a typical random-coefficients (or
mixed) model. This is an important distinction between the two types of models to keep in
mind—that is, while random-coefficient models explicitly address variation across individuals as
well as clustering among subjects in higher-order groups, GEE models assume simple random
sampling of subjects representing a population as opposed to at set of higher-order groups.
Hence, GEE models provide what are called “population average” results; that is, they model the
marginal expectation as a function of the explanatory variables. In contrast, typical multilevel
model provide “unit specific” results.
Regression coefficients based on population averages (GEE) will be generally similar to unitspecific (random-effect models) coefficients but smaller in size (Raudenbush & Bryk, 2002).
This distinction does not arise in models with continuous outcomes and identity link functions.
For example, for a GEE model, the odds ratio is the average estimate in the population—that is,
the expected increase for a unit change X in the population. In contrast, in random-effect (unitspecific) models, the odds ratio will be the subject-specific effect for a particular level of
clustering (i.e., the person or unit of clustering) given a unit change in X.
Ronald Heck
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
Week 12: Class Notes
2
Nov. 2, 2012
We first begin with a within- and between-subjects model estimated using the GEE (or fixedeffect) approach. GEE was developed to extend GLM further by accommodating repeated
categorical measures, logistic regression, and various other models for time series or other
correlated data where relationships between successive measurements on the same individual are
assumed to influence the estimation of model parameters (Horton & Lipsitz, 1999; Liang &
Zeger, 1986; Zeger, Liang, & Albert, 1988). The GEE analytic approach handles a number of
different types of categorical outcomes, their associated sampling distributions, and
corresponding link functions. It is suitable to use where the repeated observations are nested
within individuals over time, but the individuals are considered to be a random sample of a
population.
One scenario is where individuals are randomly assigned to treatment conditions that unfold over
time. If the outcome is a count, we can make use of an additional “exposure” parameter (i.e.,
referred to as an offset term) which as you will recall is a "structural" predictor that can be added
to the model. Its coefficient is not estimated by the model but is assumed to have the value 1.0;
thus, the values of the offset are simply added to the linear predictor of the dependent variable.
This extra parameter can be especially useful in Poisson regression models, where each case may
have different levels of exposure to the event of interest. At present in IBM SPSS, the GEE
approach only accommodates a two-level data hierarchy (measurements nested in individuals).
If we intend to add a group-level variable, we would need to use GENLIN MIXED to specify the
group structure.
Students’ Proficiency in Reading Over Time
Consider a study to examine students’ likelihood to be proficient in reading over time and to
assess whether their background might affect their varying patterns of meeting proficiency or
not. We may first be interested in answering whether a change takes place over time in students’
likelihood to be proficient. This concern addresses whether the probability of a student being
proficient is the same or different over the occasions of measurement. The assumption is that if
we can reject the hypothesis that the likelihood of being proficient is the same over time, it
implies that a change in individuals has taken place. In this situation, occasions of measurement
are assumed to be nested within subjects but independent between subjects.
We may have a number of research questions we are interested in examining such as the
following: “What is the probability of students being proficient in reading over time? Do
probabilities of being proficient change over time? What do students’ trends look like over
time? Are there between-individual variables that explain students’ likelihood to be proficient
over time?”
Vertical Alignment of Data Within Individuals
The data in this study consist of 2,228 individuals who were measured on four occasions
regarding their proficiency in reading. To examine growth within and between individuals using
GEE (or GENLIN MIXED), the data must first be organized differently (see Chapter 2 in the
text). The time-related observations must be organized vertically, which will require four lines
for each subject, since there are four repeated observations regarding proficiency. You will recall
that an intercept is defined as the level of Y when X (Time) is 0. For categorical outcomes, the
time variable functions to separate contrasts between time, for example, between a baseline
Ronald Heck
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
Week 12: Class Notes
3
Nov. 2, 2012
measurement and end of a treatment intervention or to examine change over a particular time
period. This coding pattern for Time (0, 1, 2, 3) identifies the intercept in the model as students’
initial (time1) proficiency status (i.e., since it is coded 0, and the intercept represents the
individual’s status when the other predictors are 0). This is the most common type of coding for
models involving individual change.
There are several important steps that must be specified in conducting the analysis. Users
identify the type of outcome and appropriate link function, define the regression model, select
the correlation structure between repeated measures, and select either model-based or robust
standard errors. There are a number of different ways to notate the models. We will let Yti be the
dichotomous response at time t (t = 1,2,…, T ) for individual i (i = 1,2,…, N), where we assume
the observations of different individuals are independent, but we allow for an association
between the repeated measures for the same subject. This will allow us later in the chapter to add
the subscript j to define random effects of individuals nested within groups such as classrooms or
schools. We assume the following marginal regression model for the expected value of Yti :

g ( E[Yti ])  xti
where xti is a (p +1) x 1 vector (prime designates a vector) of covariates for the ith subject on the
tth measurement occasion (t = 1,2,…, T),  represents the corresponding regression parameters,
and g () refers to one of several corresponding link functions, depending on the measurement
of Yti . This suggests that the data can be summarized to the vector Yi and the matrix. The
slope  can be interpreted as the rate of change in the population-averaged Yi with X i (Zeger et
al., 1988). Typically, the  parameters are constant for all t (Ziegler et al., 1998). Where the data
are dichotomous, the marginal mean—a probability—is most commonly modeled via the logit
link (i.e., whether a child is proficient or not at time t). The coefficients are then interpreted as
log odds.
For the Bernoulli case (i.e., where the number of trials is 1), Yti has a binomial distribution with
probability of success  ti and variance of π(1-π). For binary data with the logit link function, we
have the familiar
 ti = log( ti /(1   ti )  xti  ,
where  ti is the underlying transformed predictor of Yti , in this case, the log of the odds of
 ti /(1   ti ) . It should again be noted that the model represents a ratio of the probability of the
event coded 1 occurring versus the probability of the event coded 0 occurring at a particular time
point. There is no residual variance parameter (  i ), as the variance is related to the expected
value of
 ti and therefore cannot be uniquely defined.
Ronald Heck
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
Week 12: Class Notes
4
Nov. 2, 2012
In the first model, we specify the repeated measures outcomes in two parameters which
describe the intercept and time-related slope as follows:
ti  log( ti /1   ti )   0  1 (time) ,
where time is coded to indicate the interval between successive measurements,  0 is an intercept
and 1 describes the rate of change on a logit scale in the fraction of positive responses in the
population of subjects per unit time, rather than the typical change for an individual subject. As
the above equation suggests,  0 is the log odds of response when time is 0 (i.e, initial status). In
this case, 1 is the log odds associated with a one-year interval. The model assumes there are no
between-subject random effects; therefore, there are two parameters to estimate. Since this is a
single-level model, for convenience we’ll drop the subscripts referring to the predictors.
Correlation Structures Between Repeated Measures
It is possible to specify several different types of correlation structures to describe the withinsubject dependencies over time. However, because one does not often know what the correct
structure is ahead of time, different choices can make some difference in the model’s parameter
estimates; therefore, the structure is chosen to improve efficiency. It often does take a bit of
preliminary work to determine the optimal working correlation matrix for a particular data
structure. Examples of GEE correlation/covariance structure specifications include
independence, exchangeable, autoregressive, stationary m-dependent, and unstructured.
The independent matrix assumes that the repeated measurements are uncorrelated; however, this
will not be the case in most instances. Generally, in longitudinal models the successive
measurements are correlated at least to some extent. An exchangeable (or compound symmetry)
covariance (or correlation) matrix assumes homogenous correlations between elements (which is
sometimes difficult to assume in longitudinal studies); that is, the correlations are assumed to be
the same over time. This can sometimes be difficult to support in a longitudinal study, however.
The autoregressive, or AR(1) matrix, assumes the repeated measures have a first-order
autoregressive structure. This implies that the correlation between any two adjacent elements is
2
equal to  (rho), to  for elements separated by a third, and so on, with  constrained such that 1<  <1.
An m-dependent matrix assumes consecutive measurement have a common correlation
coefficient, pairs of measurements separated by a third have a common correlation coefficient,
and so on, through pairs of measurements separated by m-1other measurements. Where
measurements are note evenly spaced, it may be reasonable to consider a model where the
correlation is a function of the time between observations (i.e., M-dependent or autoregressive).
Measurements with greater separation are assumed to be uncorrelated. When choosing this
structure, specify a value of m less than the order of the working correlation matrix.
Ronald Heck
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
Week 12: Class Notes
5
Nov. 2, 2012
Finally, an unstructured correlation (or covariance) matrix provides a separate coefficient for
each covariance. As with cross-sectional models, we have found that model estimates can vary
slightly according to the matrix structure specified.
Standard Errors and Estimation
Model-based standard errors are based on the correlational structure chosen. Hence, they may be
inconsistent if the correlation structure is incorrectly specified. They are usually a little smaller
than the robust standard errors (SEs). For smaller numbers of clusters, model-based SEs are
generally preferred over robust SEs. In contrast, robust standard errors vary only slightly
depending on the choice of hypothesized correlational structure among the repeated measures;
that is, the estimates are consistent even if the correlational structure is specified incorrectly. The
robust SE approach uses a “sandwich” estimator based on an approximation to maximum
likelihood. Because of this, there can be occasions that occur when one approach will converge
and the other may not. Robust standard errors are often preferred when the number of clustered
observations is large. We will estimate our models in this example using robust standard errors
since we have a considerable amount of data. Once again, we note that users should keep in mind
that GEE uses a type of quasi-likelihood estimation (as opposed to full information ML), which
can make direct model comparison based on fit statistics that depend on the real likelihood (e.g.,
deviance, AIC, BIC) not very accurate (Hox, 2010).
Table 1. Model Information
Dependent Variable
Probability Distribution
Link Function
Subject Effect
1
Within-Subject Effect 1
Working Correlation Matrix Structure
readprofa
Binomial
Logit
Id
Time
Exchangeable
a. The procedure models 1 as the response, treating 0 as the reference category.
Table 1 provides information about how the model is defined (e.g., probability distribution and
link function, number of effects in the model, type of correlation matrix used to describe withinsubject structure). As the output shows, the distribution is binomial and a logit link function is
used to transform Yti . The working correlation structure is exchangeable, which is the same as
compound symmetry. This implies that the correlations are the same over each time interval. We
can subsequently investigate whether this is a viable assumption for these data.
Next, we can observe how many of the total cases for the dependent variable (reading
proficiency) are coded 1 (proficient) versus 0 (not proficient). As the table suggests, across the
four time periods, an average 68% of the individuals were proficient and 32% were not.
Ronald Heck
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
Week 12: Class Notes
6
Nov. 2, 2012
Table 2. Reading Proficiency Information
N
Dependent Variable readprof
Percent
0
2850
32.0%
1
6062
68.0%
Total
8912
100.0%
If we did not include the time variable, the log odds intercept would be 0.755 (not tabled) which
would be the grand mean log odds coefficient across the four time periods. We can translate the
odds ratio back to the predicted population probability of  = 1 [odds/(1+odds)], which would
be 2.128/3.128, or 0.680, which fits with the Table 2 estimate.
Next in Table 3 are the fixed effect results for the intercept and the time-related predictor. The
estimated intercept log odds coefficient is 0.838, which because of the coding of the time
variable (i.e., 0, 1, 2, 3), can be interpreted as the percentage of individuals who are proficient at
the start of the study. The intercept represents the predicted log odds when any variables in the
model are 0. If we exponentiate the log odds, we obtain the corresponding odds ratio of 2.311.
This suggests individuals are almost 2.3 times more likely to be proficient than non-proficient at
the beginning of the study (.70/.30 ~2.3).
Table 3. Parameter Estimates
95% Wald
Confidence Interval
Parameter
B
(Intercept)
Time
(Scale)
.838
-.055
1
Std.
Error
.0478
.0165
Lower
.744
-.087
Upper
.931
-.022
95% Wald
Confidence Interval
for Exp(B)
Hypothesis Test
Wald ChiSquare
306.910
11.021
Df
Sig.
1
1
.000
.000
Exp(B)
2.311
.947
Lower
2.104
.916
Upper
2.538
.978
Dependent Variable: readprof
Model: (Intercept), time
Regarding the time variable, the coefficient suggests that over each interval students’ likelihood
of being proficient decreases significantly (log odds = -0.055, p < .001). We can translate this
into a predicted probability by adding it to the intercept. Initially (i.e., at time = 0), the log odds
of being proficient is 0.838. For the second interval (time = 1) the estimated log odds will then be
the 0.783 [0.838 + (-0.055) = 0.783]. We could then estimate the new probability as 0.69, which
is estimated as follows: 1/[1+(2.71828)-(.783) which reduces to 1/1.457. Note this estimate is
slightly different from the actual observed probability in the table below, since there was no
actual change that took place between time 0 and time 1. The odds ratio suggests the odds of
being proficient are multiplied by .947 (or reduced by 5.3%) over the first interval. We can see in
Ronald Heck
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
Week 12: Class Notes
7
Nov. 2, 2012
this situation an assumed negative linear time trend in reduced probability of being proficient
does not quite fit the data optimally.
Table 4. Proportion of proficient students
Readprof
Time
Mean
N
Std. Deviation
0
.6984
2228
.459
1
.6988
2228
.459
2
.6481
2228
.478
3
.6755
2228
.468
Total
.6802
8912
.466
In this case, we might decide to code the data somewhat differently to obtain results that model
the trend a bit better. We might wish to treat the time-related variable as ordinal (1,2…,C) rather
than scale. If we make this change, we will have C-1 estimates, since one category will serve as
the reference group. In this case, we will specify “descending” for the factor category order so
that the first category (Time = 0) will serve as the reference group. This is the same as creating a
series of C-1 dummy variables for a categorical factor and specifying them in the model.
Table 5. Model 1.2 Parameter Estimates
95% Wald Confidence
Interval for Exp(B)
Hypothesis Test
Std.
Error
Wald ChiSquare
Parameter
B
(Intercept)
[time=3]
.840
-.106
.0462
.0457
330.845
5.438
1
1
.000
.020
2.315
.899
2.115
.822
2.535
.983
[time=2]
[time=1]
[time=0]
-.229
.002
0a
.0445
.0177
.
26.440
.014
.
1
1
.
.000
.904
.
.795
1.002
1
.729
.968
.
.868
1.038
.
(Scale)
df
Sig.
Exp(B)
Lower
Upper
1
Dependent Variable: readprof Model: (Intercept), time (ordinal)
a. Set to zero because this parameter is redundant.
The intercept log odds is now 0.840. This is only slightly different from the last table. If we
calculate the predicted probability of being proficient initially (Time = 0), we see it will be
1/(1  e (.840) ) or 1/1.432 = 0.698. Note we can also use the odds ratio to estimate the
probability (2.315/3.315). This probability is consistent with the observed probability of 0.6984
in the previous table. We can see further that at Time = 1, there was little change in log odds
units regarding students’ probability of being proficient (log odds = 0.002, p = .904). At Time 2
(log odds = -0.229, p < .001) and Time 3 (log odds = -0.106, p < .001), however, students were
significantly lower in probability of being proficient relative to their proficiency status at Time 0.
Regarding the odds ratios (OR), we can interpret the nonsignificant relationship at Time = 1 as
indicating there was no significant change in odds of being proficient at Time 1 (OR = 1.002, p =
Ronald Heck
EDEP 768E: Seminar in Categorical Data Modeling (F2012)
Week 12: Class Notes
8
Nov. 2, 2012
.904). In contrast, the odds of being proficient at Time = 2 versus time 0 are multiplied by 0.795
(or reduced by 20.5%) compared to the initial level. At Time = 3, it suggests that the odds of
being proficient at Time 3 versus Time 0 (i.e., initial status intercept) are multiplied by 0.899 (or
reduced by 11%).
We can estimate the probability of being proficient at Time 3 versus Time 0 in several ways. We
can add the two log odds coefficients (0.840 -0.106 = 0.734). This will provide the log odds of
being proficient at Time 3. The exponentiated slope can be interpreted as the change in the odds
that Y = 1 relative to the reference category (i.e., Time 0). If we exponentiate the log odds
.734
( e ), we obtain the odds ratio of 2.08. We can then calculate the probability of being proficient
at Time 3 as 2.08/3.08 = 0.675, which is consistent with the 0.6755 in the previous table.
Alternatively, we can also represent the new odds ratio as the product of the two odds ratios
(2.315*0.899) = 2.08; that is, we multiply the odds ratio for Time = 0 by the difference in odds
between Time 0 and Time 3 (0.899), which provides the new odds ratio (2.08), and will lead to
the same probability. Applying this approach for Time = 2, we have 2.315*0.795 =1.840, which
is then 0.648 (1.84/2.84 =0.648). This estimate of the probability Y = 1 is consistent with the
observed proportion of 0.6481 in the previous table.
We can see that defining the time trend as categorical in this instance provides some benefits in
representing the change probability of being proficient that takes place between each
measurement more accurately.
Adding a Predictor
We can next add one or more between-subjects predictors, but the outcome parameters are
treated as fixed; that is, the slopes cannot vary across individuals in the sample. We provide an
example where we add gender (female coded 1; male coded 0) to the model. We can define this
model as follows:
ti  log[ ti /1   ti )]   0  1time   2 female .
We will do this one in class and compare time defined as interval and ordinal.
References
Horton, N. J., & Lipsitz, S. R. (1999). Review of software to fit Generalized Estimating Equation
(GEE) regression models. The American Statistician, 53, 160–169.
Hox, Joop J. (2010). Multilevel analysis: Techniques and applications (2nd ed.). New York:
Routledge.
Liang, Kung-Lee, & Zeger, Scott L. (1986). Longitudinal analysis using generalized linear
models. Biometrika, 73(1), 13–22.
Raudenbush, Stephen W., & Bryk, Anthony S. (2002). Hierarchical linear models: Applications
and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage Publications.
Zeger, Scott L., & Liang, Kung-Lee. (1986). Longitudinal data analysis for discrete and
continuous outcomes. Biometrics, 42(1), 121–130.
Ziegler, A., Kastner, C., & Blettner, M. (1998). The Generalised Estimating Equations: An
annotated bibliography. Biometrical Journal(2), 115–139.