Download PPT - Richardcharnigo.net

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Transcript
An Introduction to
Group-Based Trajectory Modeling
and PROC TRAJ
Richard Charnigo
Professor of Statistics and Biostatistics
Director of Statistics and Psychometrics Core, CDART
[email protected]
Objectives
First ~80 minutes:
1. Be able to describe a group-based trajectory model
and, in particular, distinguish it from a conventional
regression model.
2. Be able to interpret results obtained from fitting a
group-based trajectory model via PROC TRAJ.
Last ~40 minutes:
3. Be able to fit a group-based trajectory model via
PROC TRAJ.
Motivating example
The Excel file at {www.richardcharnigo.net/traj}
contains a simulated data set:
Five hundred college freshmen (“ID”) were asked to
estimate how many times per month they
consumed marijuana during their freshman (“Y1”),
sophomore (“Y2”), junior (“Y3”), and senior (“Y4”)
years of high school.
Later they were asked to estimate their marijuana use
during freshman year of college (“Y5”).
They were also assessed on reward seeking; for ease
of interpretation, we standardize this variable (“X”).
Motivating example
Two possible “research questions” are:
i.
What are prototypical trajectories of marijuana use
within the population of college students from
which this sample was drawn ?
ii. Is the trajectory that best describes the experience
of a particular student associated with that
student’s level of reward seeking ?
We can develop more complicated and realistic
scenarios ( e.g., with additional personality
variables and/or interventions ), but this simple
scenario will help us begin to understand groupbased trajectory modeling and PROC TRAJ.
Exploratory data analysis
Before pursuing group-based trajectory ( or any other
statistical ) modeling, we are well-advised to
perform exploratory data analysis.
This can alert us to gross mistakes in the data set,
heretofore undetected, which may otherwise
threaten the validity of our results.
This can also suggest an appropriate probability
distribution to use with the group-based trajectory
model and help us to anticipate what the results
may be.
Exploratory data analysis
Quantiles (Definition 5)
Quantile
100% Max
Basic Statistical Measures
Estimate
4
Mean
99%
3
95%
2
90%
1
75% Q3
1
50% Median
0
25% Q1
0 Mode
10%
0
5%
0
1%
0
0% Min
0
Median
Location
Variability
0.362000 Std Deviation
0.71553
0.000000 Variance
0.51198
0.000000 Range
4.00000
Interquartile Range
1.00000
Exploratory data analysis
Quantiles (Definition 5)
Quantile
100% Max
Basic Statistical Measures
Estimate
14
Mean
99%
12
95%
9
90%
7
75% Q3
1
50% Median
0
25% Q1
0 Mode
10%
0
5%
0
1%
0
0% Min
0
Median
Location
Variability
1.454000 Std Deviation
0.000000 Variance
0.000000 Range
Interquartile Range
2.91563
8.50089
14.00000
1.00000
Exploratory data analysis
The preceding slides show descriptive statistics for Y1
and Y5. ( We can similarly examine descriptive
statistics for Y2, Y3, and Y4. ) Here are a few
observations:
•
As anticipated, the possible values of Y1 and Y5
are nonnegative, and they appear to have been
recorded ( or rounded ) to the nearest integer.
•
The distributions of Y1 and Y5 are right-skewed,
and there are lots of 0’s.
•
Both the mean and the variance for Y5 are greater
than the corresponding quantities for Y1.
Exploratory data analysis
Our observations suggest the following:
•
Because there are lots of 0’s, there is no
transformation that will bring Y1 or Y5 to
approximate normality.
•
However, because Y1 and Y5 are integer-valued,
a Poisson ( or similar ) probability distribution may
be applicable.
•
Since Y5 has greater mean and variance than Y1,
we anticipate some divergence between
trajectories over time and at least one trajectory
showing increasing marijuana use over time.
A first trajectory model
Let t denote time in years. If we set time 0 to be
high school graduation, then we have t = -3, -2,
-1, 0, and 1 corresponding to Y1 through Y5.
Suppose for now --- the viability of this supposition
can be assessed later --- that there are three
subpopulations whose mean levels of marijuana
use over time ( called “trajectories” ) are defined
by exponentials of linear functions
f1(t) = exp(a1 + b1 t),
f2(t) = exp(a2 + b2 t), and
f3(t) = exp(a3 + b3 t).
The exponentials are needed because f1(t), f2(t),
and f3(t) must be nonnegative.
A first trajectory model
Suppose that the distribution of Yk ( 1 < k < 5 ) in the
first subpopulation is Poisson with mean f1( k-4 ),
in the second is Poisson with mean f2( k-4 ), and
in the third is Poisson with mean f3( k-4 ).
Finally, suppose that the probability of belonging to
subpopulation j ( 2 < j < 3 ) divided by the
probability of belonging to subpopulation 1 is of
the form exp(cj + dj X). If dj > 0, then higher
levels of reward seeking increase the above ratio;
if dj < 0, then they decrease the above ratio.
A first trajectory model
A group-based trajectory model is thus distinguished
from a conventional regression model in that a
latent variable --- namely, the subpopulation to
which one belongs --- is intermediate between
what might be thought of as the independent
variable (here, reward seeking) and the dependent
variable (here, marijuana use).
Consequently, and importantly, the difference between
two trajectories is typically much greater than the
difference between mean levels among persons
“high” on the independent variable versus persons
“low” on the independent variable.
A first trajectory model
3
3
3
3
2
2
2
3
2
1
1
1
1 1 1
2 2 2
1
3 3 3
2
1
A first trajectory model
The preceding figure shows results from fitting the
group-based trajectory model via PROC TRAJ.
Approximately 65.3% of persons belong to a
subpopulation that is essentially abstinent from
marijuana, about 19.4% to a subpopulation whose
marijuana use increases and then decreases, and
about 15.3% to a subpopulation whose marijuana
use continually increases.
Dashed lines represent estimates of f1(t), f2(t), and
f3(t) when they are assumed to be exponentials of
linear functions; solid lines represent estimates
without such a constraint.
A first trajectory model
Obs
5
6
7
8
9
10
ID Y1 Y2 Y3 Y4 Y5 T1 T2 T3 T4 T5
X
5 0 0 1 0 0 -3 -2 -1 0 1 0.08
6 2 3 6 4 0 -3 -2 -1 0 1 2.75
7 0 0 0 0 1 -3 -2 -1 0 1 -0.97
8 2 4 3 8 8 -3 -2 -1 0 1 0.7
9 1 0 1 4 5 -3 -2 -1 0 1 2.78
10 1 4 0 0 1 -3 -2 -1 0 1 0.53
Obs _MODEL_
1 ZIP
_MODEL2_
_TYPE_
PARMS
Obs
LINEAR2
INTERC3
1 0.0881527887 1.6413614616
Obs
1
Obs
1
2
3
4
5
_LOGLIK_
-2580.343083
T
-3.00000
-2.00000
-1.00000
0.00000
1.00000
_NAME_
LINEAR3
0.404393847
_BIC1_
-2611.416123
AVG1
0.13401
0.10507
0.13408
0.08391
0.11002
AVG2
0.48801
1.68610
2.58710
1.57138
1.20619
GRP1PRB
0.995814
0.000000
0.998364
0.000000
0.000000
0.000634
GRP2PRB
0.004186
0.243606
0.001636
0.000002
0.071390
0.999287
INTERC1
-2.240945095
CONST2
X2
-1.196677753 1.1816491462
_BIC2_
-2619.463313
AVG3
1.17827
2.66516
3.57223
5.24892
7.52729
GRP3PRB
0.000000
0.756394
0.000000
0.999998
0.928610
0.000078
LINEAR1
-0.061055892
INTERC2
0.4881041958
CONST3
X3
-2.400466075 2.4141657075
_AIC_
-2590.343083
PRED1
0.12774
0.12017
0.11305
0.10636
0.10006
GROUP
1
3
1
3
3
2
_CONVERGE_
4
PRED2
1.25063
1.36588
1.49175
1.62922
1.77937
PRED3
1.53446
2.29923
3.44515
5.16219
7.73500
A first trajectory model
The preceding tables display additional results.
The first table shows variable values for six subjects,
along with the estimated probabilities that the
subjects belong to the three subpopulations.
The second and third tables present estimates of a1,
b1, a2, b2, a3, b3, c2, d2, c3, and d3. Companion
output, which is displayed by PROC TRAJ on
screen only, provides accompanying p-values.
The fourth table provides indices of model fit, and the
fifth table specifies the numbers used to construct
the figure displayed earlier.
A first trajectory model
Visually, the estimate of f2(t) appears somewhat
unsatisfactory. There are corresponding
discrepancies between the “AVG2” and “PRED2”
columns in the fifth table.
Therefore, let us consider a second group-based
trajectory model in which the trajectories are
defined by exponentials of quadratic functions
f1(t) = exp(a1 + b1 t + g1 t2),
f2(t) = exp(a2 + b2 t + g2 t2), and
f3(t) = exp(a3 + b3 t + g3 t2).
A second trajectory model
3
3
3
2
3
2
2
3
2
1
2
1
1
1 1 1
2 2 2
1
3 3 3
1
A second trajectory model
Obs
5
6
7
8
9
10
ID Y1 Y2 Y3 Y4 Y5 T1 T2 T3 T4 T5
X
5 0 0 1 0 0 -3 -2 -1 0 1 0.08
6 2 3 6 4 0 -3 -2 -1 0 1 2.75
7 0 0 0 0 1 -3 -2 -1 0 1 -0.97
8 2 4 3 8 8 -3 -2 -1 0 1 0.7
9 1 0 1 4 5 -3 -2 -1 0 1 2.78
10 1 4 0 0 1 -3 -2 -1 0 1 0.53
Obs _MODEL_
1 ZIP
Obs
1
LINEAR2
-0.469526884
Obs
CONST3
1 -2.356619304
Obs
1
2
3
4
5
_MODEL2_
_TYPE_
PARMS
GRP1PRB
0.992863
0.000000
0.999285
0.000000
0.000000
0.001133
_NAME_
GRP2PRB
0.007137
0.868232
0.000715
0.000000
0.008870
0.998748
INTERC1
-2.311574846
QUADRA2
INTERC3
LINEAR3
-0.296767096 1.6939847836 0.3771947256
QUADRA3
-0.029055771
GRP3PRB
0.000000
0.131768
0.000000
1.000000
0.991130
0.000119
LINEAR1
0.0558397704
CONST2
-1.157274099
X3
_LOGLIK_
_BIC1_
_BIC2_
_AIC_
2.313769971 -2504.788285 -2545.183238 -2555.644584 -2517.788285
T
-3.00000
-2.00000
-1.00000
0.00000
1.00000
AVG1
0.13401
0.10507
0.13408
0.08391
0.11002
AVG2
0.48801
1.68610
2.58710
1.57138
1.20619
AVG3
1.17827
2.66516
3.57223
5.24892
7.52729
PRED1
0.12774
0.12017
0.11305
0.10636
0.10006
PRED2
1.25063
1.36588
1.49175
1.62922
1.77937
GROUP
1
2
1
3
3
2
QUADRA1
0.0514642791
X2
1.197230375
_CONVERGE_
4
PRED3
1.53446
2.29923
3.44515
5.16219
7.73500
A second trajectory model
Some comments are in order:
•
The estimate of f2(t) looks much better now.
•
The guess about which subpopulation subject 6
belongs to has changed ( and appears more
reasonable now ).
•
The BIC1, BIC2, and AIC have increased by
approximately 66, 64, and 73 points respectively.
These are overwhelming changes, suggesting that
the second group-based trajectory model provides
a much better fit to the data than the first groupbased trajectory model.
Is that the best we can do ?
Besides moving from linear functions to quadratic
functions, other modifications are possible.
One, for which I provide SAS code at
{www.richardcharnigo.net/traj}, entails replacing
the ordinary Poisson probability distribution by the
zero-inflated Poisson probability distribution. The
idea is that, especially in the first subpopulation,
there may be too many 0’s to be compatible with
the ordinary Poisson probability distribution.
Accounting for this zero inflation may provide a
better fit to the data.
Is that the best we can do ?
Another possible modification is to change the
quadratic functions to cubic or even quartic
functions. ( With only five time points, we cannot
go beyond polynomials of degree four. )
In fact, the polynomial degree need not be the same
for each subpopulation. For instance, a linear
function may suffice for the first and third
subpopulations, while ( at least ) a quadratic
function appears necessary for the second
subpopulation.
Is that the best we can do ?
We face the practical problem, though, of deciding
which modifications to make.
Rather than consider dozens ( or hundreds ) of
possible competing models, a more feasible
approach may be to start with the most
complicated model that one is willing to entertain
( for example, with quartic polynomials for each
subpopulation ) and then perform “backward
elimination”.
Is that the best we can do ?
To do this, remove whichever model feature has the
largest p-value, while respecting the hierarchical
principle that simpler features cannot be removed
before more complicated features.
Thus, for example, the linear term cannot be removed
from a quadratic polynomial.
Once all remaining model features have p-values less
than 0.05 ( or are ineligible for removal ), stop and
create a table of model fit indices corresponding to
the various steps of the backward elimination.
Is that the best we can do ?
The step in the backward elimination at which the
model fit indices are optimized can be used to
select a final model. ( Matters become a bit more
complicated, though, if the model fit indices are
not in agreement about this. )
Also, if we are unsure whether three is the best
number of groups, then the above process can be
repeated with, say, two groups and four groups.
Model fit indices can then be used to choose
among the final two-group model, the final threegroup model, and the final four-group model.
Other capabilities of PROC TRAJ
Worth mentioning here, though not illustrated in this
presentation or in the SAS code at
{www.richardcharnigo.net/traj}, are three additional
capabilities of PROC TRAJ:
•
The dependent variable need not have the (zeroinflated) Poisson probability distribution; the
normal and Bernoulli probability distributions can
be accommodated as well.
•
Multiple independent variables can be
accommodated.
Other capabilities of PROC TRAJ
•
Multiple, related dependent variables can be
accommodated. If there are two ( for instance,
marijuana use and alcohol use ), then PROC
TRAJ provides one latent variable defining
subpopulations on the first dependent variable and
a separate latent variable defining subpopulations
on the second. Part of the output from PROC
TRAJ then estimates the probabilities of
membership in the subpopulations defined by the
second latent variable given membership in a
subpopulation defined by the first. If there are
more than two, then PROC TRAJ provides a
single latent variable defining subpopulations on
all dependent variables simultaneously.
Trying out PROC TRAJ
With this background, let us open SAS and work our
way through at least some of the SAS code at
{www.richardcharnigo.net/traj}.
This is also an opportunity to experiment and make
some changes to the SAS code. For instance,
you can see what PROC TRAJ does when a
quadratic function is replaced by a cubic function
or when a quadratic function is retained for only
one of the three subpopulations.