Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Introduction to Group-Based Trajectory Modeling and PROC TRAJ Richard Charnigo Professor of Statistics and Biostatistics Director of Statistics and Psychometrics Core, CDART [email protected] Objectives First ~80 minutes: 1. Be able to describe a group-based trajectory model and, in particular, distinguish it from a conventional regression model. 2. Be able to interpret results obtained from fitting a group-based trajectory model via PROC TRAJ. Last ~40 minutes: 3. Be able to fit a group-based trajectory model via PROC TRAJ. Motivating example The Excel file at {www.richardcharnigo.net/traj} contains a simulated data set: Five hundred college freshmen (“ID”) were asked to estimate how many times per month they consumed marijuana during their freshman (“Y1”), sophomore (“Y2”), junior (“Y3”), and senior (“Y4”) years of high school. Later they were asked to estimate their marijuana use during freshman year of college (“Y5”). They were also assessed on reward seeking; for ease of interpretation, we standardize this variable (“X”). Motivating example Two possible “research questions” are: i. What are prototypical trajectories of marijuana use within the population of college students from which this sample was drawn ? ii. Is the trajectory that best describes the experience of a particular student associated with that student’s level of reward seeking ? We can develop more complicated and realistic scenarios ( e.g., with additional personality variables and/or interventions ), but this simple scenario will help us begin to understand groupbased trajectory modeling and PROC TRAJ. Exploratory data analysis Before pursuing group-based trajectory ( or any other statistical ) modeling, we are well-advised to perform exploratory data analysis. This can alert us to gross mistakes in the data set, heretofore undetected, which may otherwise threaten the validity of our results. This can also suggest an appropriate probability distribution to use with the group-based trajectory model and help us to anticipate what the results may be. Exploratory data analysis Quantiles (Definition 5) Quantile 100% Max Basic Statistical Measures Estimate 4 Mean 99% 3 95% 2 90% 1 75% Q3 1 50% Median 0 25% Q1 0 Mode 10% 0 5% 0 1% 0 0% Min 0 Median Location Variability 0.362000 Std Deviation 0.71553 0.000000 Variance 0.51198 0.000000 Range 4.00000 Interquartile Range 1.00000 Exploratory data analysis Quantiles (Definition 5) Quantile 100% Max Basic Statistical Measures Estimate 14 Mean 99% 12 95% 9 90% 7 75% Q3 1 50% Median 0 25% Q1 0 Mode 10% 0 5% 0 1% 0 0% Min 0 Median Location Variability 1.454000 Std Deviation 0.000000 Variance 0.000000 Range Interquartile Range 2.91563 8.50089 14.00000 1.00000 Exploratory data analysis The preceding slides show descriptive statistics for Y1 and Y5. ( We can similarly examine descriptive statistics for Y2, Y3, and Y4. ) Here are a few observations: • As anticipated, the possible values of Y1 and Y5 are nonnegative, and they appear to have been recorded ( or rounded ) to the nearest integer. • The distributions of Y1 and Y5 are right-skewed, and there are lots of 0’s. • Both the mean and the variance for Y5 are greater than the corresponding quantities for Y1. Exploratory data analysis Our observations suggest the following: • Because there are lots of 0’s, there is no transformation that will bring Y1 or Y5 to approximate normality. • However, because Y1 and Y5 are integer-valued, a Poisson ( or similar ) probability distribution may be applicable. • Since Y5 has greater mean and variance than Y1, we anticipate some divergence between trajectories over time and at least one trajectory showing increasing marijuana use over time. A first trajectory model Let t denote time in years. If we set time 0 to be high school graduation, then we have t = -3, -2, -1, 0, and 1 corresponding to Y1 through Y5. Suppose for now --- the viability of this supposition can be assessed later --- that there are three subpopulations whose mean levels of marijuana use over time ( called “trajectories” ) are defined by exponentials of linear functions f1(t) = exp(a1 + b1 t), f2(t) = exp(a2 + b2 t), and f3(t) = exp(a3 + b3 t). The exponentials are needed because f1(t), f2(t), and f3(t) must be nonnegative. A first trajectory model Suppose that the distribution of Yk ( 1 < k < 5 ) in the first subpopulation is Poisson with mean f1( k-4 ), in the second is Poisson with mean f2( k-4 ), and in the third is Poisson with mean f3( k-4 ). Finally, suppose that the probability of belonging to subpopulation j ( 2 < j < 3 ) divided by the probability of belonging to subpopulation 1 is of the form exp(cj + dj X). If dj > 0, then higher levels of reward seeking increase the above ratio; if dj < 0, then they decrease the above ratio. A first trajectory model A group-based trajectory model is thus distinguished from a conventional regression model in that a latent variable --- namely, the subpopulation to which one belongs --- is intermediate between what might be thought of as the independent variable (here, reward seeking) and the dependent variable (here, marijuana use). Consequently, and importantly, the difference between two trajectories is typically much greater than the difference between mean levels among persons “high” on the independent variable versus persons “low” on the independent variable. A first trajectory model 3 3 3 3 2 2 2 3 2 1 1 1 1 1 1 2 2 2 1 3 3 3 2 1 A first trajectory model The preceding figure shows results from fitting the group-based trajectory model via PROC TRAJ. Approximately 65.3% of persons belong to a subpopulation that is essentially abstinent from marijuana, about 19.4% to a subpopulation whose marijuana use increases and then decreases, and about 15.3% to a subpopulation whose marijuana use continually increases. Dashed lines represent estimates of f1(t), f2(t), and f3(t) when they are assumed to be exponentials of linear functions; solid lines represent estimates without such a constraint. A first trajectory model Obs 5 6 7 8 9 10 ID Y1 Y2 Y3 Y4 Y5 T1 T2 T3 T4 T5 X 5 0 0 1 0 0 -3 -2 -1 0 1 0.08 6 2 3 6 4 0 -3 -2 -1 0 1 2.75 7 0 0 0 0 1 -3 -2 -1 0 1 -0.97 8 2 4 3 8 8 -3 -2 -1 0 1 0.7 9 1 0 1 4 5 -3 -2 -1 0 1 2.78 10 1 4 0 0 1 -3 -2 -1 0 1 0.53 Obs _MODEL_ 1 ZIP _MODEL2_ _TYPE_ PARMS Obs LINEAR2 INTERC3 1 0.0881527887 1.6413614616 Obs 1 Obs 1 2 3 4 5 _LOGLIK_ -2580.343083 T -3.00000 -2.00000 -1.00000 0.00000 1.00000 _NAME_ LINEAR3 0.404393847 _BIC1_ -2611.416123 AVG1 0.13401 0.10507 0.13408 0.08391 0.11002 AVG2 0.48801 1.68610 2.58710 1.57138 1.20619 GRP1PRB 0.995814 0.000000 0.998364 0.000000 0.000000 0.000634 GRP2PRB 0.004186 0.243606 0.001636 0.000002 0.071390 0.999287 INTERC1 -2.240945095 CONST2 X2 -1.196677753 1.1816491462 _BIC2_ -2619.463313 AVG3 1.17827 2.66516 3.57223 5.24892 7.52729 GRP3PRB 0.000000 0.756394 0.000000 0.999998 0.928610 0.000078 LINEAR1 -0.061055892 INTERC2 0.4881041958 CONST3 X3 -2.400466075 2.4141657075 _AIC_ -2590.343083 PRED1 0.12774 0.12017 0.11305 0.10636 0.10006 GROUP 1 3 1 3 3 2 _CONVERGE_ 4 PRED2 1.25063 1.36588 1.49175 1.62922 1.77937 PRED3 1.53446 2.29923 3.44515 5.16219 7.73500 A first trajectory model The preceding tables display additional results. The first table shows variable values for six subjects, along with the estimated probabilities that the subjects belong to the three subpopulations. The second and third tables present estimates of a1, b1, a2, b2, a3, b3, c2, d2, c3, and d3. Companion output, which is displayed by PROC TRAJ on screen only, provides accompanying p-values. The fourth table provides indices of model fit, and the fifth table specifies the numbers used to construct the figure displayed earlier. A first trajectory model Visually, the estimate of f2(t) appears somewhat unsatisfactory. There are corresponding discrepancies between the “AVG2” and “PRED2” columns in the fifth table. Therefore, let us consider a second group-based trajectory model in which the trajectories are defined by exponentials of quadratic functions f1(t) = exp(a1 + b1 t + g1 t2), f2(t) = exp(a2 + b2 t + g2 t2), and f3(t) = exp(a3 + b3 t + g3 t2). A second trajectory model 3 3 3 2 3 2 2 3 2 1 2 1 1 1 1 1 2 2 2 1 3 3 3 1 A second trajectory model Obs 5 6 7 8 9 10 ID Y1 Y2 Y3 Y4 Y5 T1 T2 T3 T4 T5 X 5 0 0 1 0 0 -3 -2 -1 0 1 0.08 6 2 3 6 4 0 -3 -2 -1 0 1 2.75 7 0 0 0 0 1 -3 -2 -1 0 1 -0.97 8 2 4 3 8 8 -3 -2 -1 0 1 0.7 9 1 0 1 4 5 -3 -2 -1 0 1 2.78 10 1 4 0 0 1 -3 -2 -1 0 1 0.53 Obs _MODEL_ 1 ZIP Obs 1 LINEAR2 -0.469526884 Obs CONST3 1 -2.356619304 Obs 1 2 3 4 5 _MODEL2_ _TYPE_ PARMS GRP1PRB 0.992863 0.000000 0.999285 0.000000 0.000000 0.001133 _NAME_ GRP2PRB 0.007137 0.868232 0.000715 0.000000 0.008870 0.998748 INTERC1 -2.311574846 QUADRA2 INTERC3 LINEAR3 -0.296767096 1.6939847836 0.3771947256 QUADRA3 -0.029055771 GRP3PRB 0.000000 0.131768 0.000000 1.000000 0.991130 0.000119 LINEAR1 0.0558397704 CONST2 -1.157274099 X3 _LOGLIK_ _BIC1_ _BIC2_ _AIC_ 2.313769971 -2504.788285 -2545.183238 -2555.644584 -2517.788285 T -3.00000 -2.00000 -1.00000 0.00000 1.00000 AVG1 0.13401 0.10507 0.13408 0.08391 0.11002 AVG2 0.48801 1.68610 2.58710 1.57138 1.20619 AVG3 1.17827 2.66516 3.57223 5.24892 7.52729 PRED1 0.12774 0.12017 0.11305 0.10636 0.10006 PRED2 1.25063 1.36588 1.49175 1.62922 1.77937 GROUP 1 2 1 3 3 2 QUADRA1 0.0514642791 X2 1.197230375 _CONVERGE_ 4 PRED3 1.53446 2.29923 3.44515 5.16219 7.73500 A second trajectory model Some comments are in order: • The estimate of f2(t) looks much better now. • The guess about which subpopulation subject 6 belongs to has changed ( and appears more reasonable now ). • The BIC1, BIC2, and AIC have increased by approximately 66, 64, and 73 points respectively. These are overwhelming changes, suggesting that the second group-based trajectory model provides a much better fit to the data than the first groupbased trajectory model. Is that the best we can do ? Besides moving from linear functions to quadratic functions, other modifications are possible. One, for which I provide SAS code at {www.richardcharnigo.net/traj}, entails replacing the ordinary Poisson probability distribution by the zero-inflated Poisson probability distribution. The idea is that, especially in the first subpopulation, there may be too many 0’s to be compatible with the ordinary Poisson probability distribution. Accounting for this zero inflation may provide a better fit to the data. Is that the best we can do ? Another possible modification is to change the quadratic functions to cubic or even quartic functions. ( With only five time points, we cannot go beyond polynomials of degree four. ) In fact, the polynomial degree need not be the same for each subpopulation. For instance, a linear function may suffice for the first and third subpopulations, while ( at least ) a quadratic function appears necessary for the second subpopulation. Is that the best we can do ? We face the practical problem, though, of deciding which modifications to make. Rather than consider dozens ( or hundreds ) of possible competing models, a more feasible approach may be to start with the most complicated model that one is willing to entertain ( for example, with quartic polynomials for each subpopulation ) and then perform “backward elimination”. Is that the best we can do ? To do this, remove whichever model feature has the largest p-value, while respecting the hierarchical principle that simpler features cannot be removed before more complicated features. Thus, for example, the linear term cannot be removed from a quadratic polynomial. Once all remaining model features have p-values less than 0.05 ( or are ineligible for removal ), stop and create a table of model fit indices corresponding to the various steps of the backward elimination. Is that the best we can do ? The step in the backward elimination at which the model fit indices are optimized can be used to select a final model. ( Matters become a bit more complicated, though, if the model fit indices are not in agreement about this. ) Also, if we are unsure whether three is the best number of groups, then the above process can be repeated with, say, two groups and four groups. Model fit indices can then be used to choose among the final two-group model, the final threegroup model, and the final four-group model. Other capabilities of PROC TRAJ Worth mentioning here, though not illustrated in this presentation or in the SAS code at {www.richardcharnigo.net/traj}, are three additional capabilities of PROC TRAJ: • The dependent variable need not have the (zeroinflated) Poisson probability distribution; the normal and Bernoulli probability distributions can be accommodated as well. • Multiple independent variables can be accommodated. Other capabilities of PROC TRAJ • Multiple, related dependent variables can be accommodated. If there are two ( for instance, marijuana use and alcohol use ), then PROC TRAJ provides one latent variable defining subpopulations on the first dependent variable and a separate latent variable defining subpopulations on the second. Part of the output from PROC TRAJ then estimates the probabilities of membership in the subpopulations defined by the second latent variable given membership in a subpopulation defined by the first. If there are more than two, then PROC TRAJ provides a single latent variable defining subpopulations on all dependent variables simultaneously. Trying out PROC TRAJ With this background, let us open SAS and work our way through at least some of the SAS code at {www.richardcharnigo.net/traj}. This is also an opportunity to experiment and make some changes to the SAS code. For instance, you can see what PROC TRAJ does when a quadratic function is replaced by a cubic function or when a quadratic function is retained for only one of the three subpopulations.