* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ACCP Research Institute F.I.T. Program: Statistical Methods
Instrumental variables estimation wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Regression toward the mean wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Forecasting wikipedia , lookup
Data assimilation wikipedia , lookup
Choice modelling wikipedia , lookup
Regression analysis wikipedia , lookup
Introduction to Multiple Imputation Greg Stoddard September 29, 2011 University of Utah School of Medicine Outline The missing value problem Types of missing Simple approaches Multiple Imputation – State of Art When to use each approach Consider the following dataset. ID Y X1 X2 1 11 1 2 2 10 miss 5 3 miss 3 2 4 9 miss miss 5 12 5 7 6 7 6 3 N=6 When a linear regression, Y = a + b1(X1)+b2(X2) is fitted, the sample size drops to N = 3, due to listwise deletion of missing values. To get our sample size back to 6, we plug up the missing holes with “imputed” values. ID Y X1 X2 1 11 1 2 2 10 3 5 3 11 3 2 4 9 6 3 5 12 5 7 6 7 6 3 To offer a precise definition… missing data imputation is substituting values for missing values—but it is not fabricating data because we do it statistically. Every study is going to have some missing values. You rarely see any mention of it in a published article, however. Should you mention it? In the guidance article, Vandenbrouchke JP, von Elm E, Altman DG, et al. Strengthening and reporting of observational studies in epidemiology (STROBE): explanation and elaboration. Ann Intern Med 2007;147(8):W-163 to W-194. on page W-176, it advises, “12(c) Explain how missing data were addressed” This guidance article then gives the following example, used in a paper by Chandola et al. (2006) of how you might say this, “Our missing data analysis procedures used missing at random (MAR) assumptions. We used the MICE (multivariate imputation by chained equations) method of multiple multivariate imputation in STATA. We independently analyzed 10 copies of the data, each with missing values suitably imputed , in the multivariate logistic regression analyses. We average estimates of the variables to give a single mean estimate and adjusted standard errors according to Rubin’s rules”. ------Chandola T, Brunner E, Marmot M. Chronic stress at work and the metabolic syndrome: prospective study. BMJ 2006;332:521-5. PMID: 16428252 Although missing value imputation is too complex of a subject to include in introductory statistics textbooks, entire chapters are devoted in specialized applied texts. Some examples are: Harrell Jr FE. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York, Springer-Verlag, 2001, pp.41-52. Twisk JWR. Applied Longitudinal Data Analysis for Epidemiology: A Practical Guide. Cambridge, Cambridge University Press, 2003, pp.202-224. Fleiss JL, Levin B, Paik MC. Statistical Methods for Rates and Proportions, 3rd ed. Hoboken NJ, John Wiley & Sons, 2003, pp.491-560. A popular classification scheme for missing data is (Harrell, 2001, pp.41-52): 1. Missing completely at random (MCAR) 2. Missing at random (MAR) 3. Informative missing (IM) -----Harrell Jr FE. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York, Springer-Verlag, 2001, pp.41-52. Missing completely at random (MCAR) Data are missing for reasons unrelated to any characteristics or responses of the subject, including the value of the missing value, were it to be known. An example is the accidental dropping of test tube resulting in missing laboratory measurements. (Here, the best guess of the missing variable is simply the sample median). -----Harrell Jr FE, 2001, pp.41-52. Missing at random (MAR) Data elements are not missing at random, but the probability that a value is missing depends on values of variables that were actually measured. For example, suppose males are less likely to respond to their income question in general, but the likelihood of responding is independent of their actual income. In this case, unbiased sex-specific income estimates can be made if we have data on the sex variable (by replacing the missing value with the sex-specific median income, for example). -----Harrell Jr FE, 2001, pp.41-52. Informative missing (IM) Data elements are more likely to be missing if their true values of the variable in question are systematically higher or lower. For example, this occurs if lower income subjects, or high income subjects, or both, are less likely to answer the income question in a survey. This is the most difficult type of missing data to handle, and in many cases there is no good value to substitute for the missing value. Furthermore, if you analyze your data by just dropping these subjects, your results will be biased, so that does not work either. -----Harrell Jr FE, 2001, pp.41-52. Missing Comorbidities In Patient Medical Record A special case of missing is a comorbidity not listed in the patient’s medical record. For example, if no mention of diabetes was ever made and a diagnostic code for diabetes was never entered for any clinic visit, the fact that it is missing suggests that the patient does not have diabetes. Defining a coding rule to replace this missing value with 0, or absent, will most likely produce the least amount of misclassification, or mis-imputation, error. Steyerberg (2009) mentions this coding rule approach, “An alternative in such a situation might be to change the definition of the predictor, i.e., by assuming that if no value is available from a patient chart, the characteristic is absent rather than missing.” ------Steyerberg EW. (2009). Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York, Springer, 2009, pp.130-131. Replacing Missing Values with Mean, Median, or Mode Before the more sophisticated imputation schemes were developed, it was common practice to replace the missing value with a likely value, being the mean, median, or mode. One criticism of this approach is that you artificially shrink the variance, since so many observations will then have the average value. Royston (2004) makes this criticism, “Old-fashioned imputation typically replaced missing values with the mean or mode of the nonmissing values for that variable. That approach is now regarded as inadequate. For subsequent statistical inference to be valid, it is essential to inject the correct degree of randomness into the imputations and to incorporate that uncertainty when computing standard errors and confidence intervals for parameters of interest.” -----Royston P. Multiple imputation of missing values. The Stata Journal 2004;4(3):227-241. One possible approach is imputing the missing value with a likely value, such as the median. Then, add a random residual back to the median imputed value to maintain the correct standard error (Harrell, 2001). --Harrell Jr FE. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York, Springer-Verlag, 2001, pp.45-46. Such a direct approach is not usually done, however, since the more widely accepted approaches, accomplish the same thing. Furthermore, if there are a lot of missing data, imputing with a likely value might adversely affect the regression coefficient. The imputation methods of multiple imputation and maximum likelihood not only provide the best standard errors, but also the best regression coefficients. What About Imputing the Outcome Variable? ID Y X1 X2 1 11 1 2 2 10 miss 5 3 miss 3 2 4 9 miss miss 5 12 5 7 6 7 6 3 At first, it seems like you must at least have the outcome variable measured or the subject should be excluded. It is common to discard subjects with a missing outcome variable, but imputing missing values of the outcome variable frequently leads to more efficient estimates of the regression coefficients when the imputation is based on the nonmissing predictor variables (Harrell, 2001). ----Harrell Jr FE. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York, Springer-Verlag, 2001, pp.43. Missing Value Indicator Approach A historically popular approach in epidemiologic research was to use a missing value indicator, which has a value of 1 if the variable is missing and 0 otherwise. For example, given the following variable for gender, 1. male n = 50 2. female n = 40 Missing n = 10 we would recode this to two indicators, male and malemissing: Original gender variable 1. male (n=50) 2. female (n=40) . = missing (n=10) Male indicator 1 (n=50) 0 (n=40) 0 (n=10) Malemissing indicator 0 (n=50) 0 (n=40) 1 (n=10) and then include both indicator variables into the regression model. With this approach, the missing value indicator is not interpreted, or reported in an article, but simply acts as a place holder so the subjects with missing values are not dropped out of the analysis. Greenland and Finkle (1995) suggest not using the missing value indictor approach, “…The method based on missing-data indicators can exhibit severe bias even when the data are missing completely at random, … In general, the authors recommend that epidemiologists avoid using the missing-indicator method and use more sophisticated methods whenever a large proportion of data are missing.” -----Greenland S, Finkle WD. (1995). A critical look at methods for handling missing covariates in epidemiologic regression analysis. Am J Epidemiol 142(12):1255-64. Steyerberg (2009, pp.130-131) likewise advises not using the missing value indicator approach, “…such a procedure ignores correlation of the values of predictors among each other. Simulations have shown that the procedure may lead to severe bias in estimated regression coefficients.155,295. The missing indicator should hence generally not be used.” ------Steyerberg EW. (2009). Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York, Springer, 2009, pp.130-131. Hotdeck Imputation In this method, the missing values are replaced by randomly selected values from the nonmissing values of the same variable. ID Y X1 X2 1 11 1 2 2 10 3 5 3 11 3 2 4 9 6 3 5 12 5 7 6 7 6 3 Hotdeck imputation has the advantage of being simple to use, it preserves the distributional characteristics of the variable, and performs nearly as well as the more sophisticated imputation approaches (Roth, 1994). ------Roth, P. Missing data: A conceptual review for applied psychologists. Personnel Psychology 1994;47:537-560. In hotdeck imputation, you set a random number seed before the imputation, so you can replicate the imputation and subsequent analysis. You then create imputed variables, using different variable names, such as male → male_imp. Then you use ordinary statistical methods, such as linear regression on the imputed variables. The original variables, with missing values, are preserved for final analyses using multiple imputation, if you choose to do so. What you are going to discover, however, is that you do not get the answer you want in your regression model. So, you become curious and choose a different random number seed, and this time you get the desired answer using your second round of imputed variables. If you pre-specified your random number seed, this will be very unsatisifying because you are actually stuck with your first model…but which model is the “right” one? It seems the only correct thing to do, then, is to do hotdeck several times and then average the results to arrive at a stable “right” answer. This is the basic idea of multiple imputation, which we will get to later. ID Y X1 X2 1 11 1 2 2 10 3 5 3 11 3 2 4 9 6 3 5 12 5 7 6 7 6 3 Recall, with hotdeck imputation, we simply use a random value from the nonmissing values. This has the advantage that an actual possible value is used, but it does not take into account that a “more likely” value could be obtained by using information from the other variables the imputed variable is correlated with. For example, body weight will be different between teenagers and adults. Regression Approach In this method, you impute the missing value with the predicted value from an appropriate regression model. For example, you could use a linear regression, with age and gender as the predictors, to impute the missing body weight variables. This is similar to imputing with means, but more like using subgroup specific means. A disadvantage is that it does not preserve the variability of the data. Multiple Imputation “The idea of multiple imputation is that instead of filling in missing values to create a single imputed dataset, several (or more) imputed data sets are created each of which contains different imputed values. The analysis of a statistical model is then done on each of the imputed data sets. The multiple analyses are then combined to yield a single set of results. The major advantage of multiple imputation over single imputation is that it produces standard errors that reflect the degree of uncertainty due to the imputation missing values. In general, multiple imputation techniques require that missing observations are missing at random (MAR).” -----http://www.ats.ucla.edu/stat/stata/library/ice.htm There are two major approaches to creating multiply imputed datasets: 1) multivariate normal 2) imputation by chained equations -----http://www.ats.ucla.edu/stat/stata/library/ice.htm 1) multivariate normal This approach is “based on the joint distribution of all the variables in the imputation model, including variables to be imputed and variables to be used only for the purpose of imputing other variables. In this approach, the joint distribution of all variables in the imputation model is assumed to be multivariate normal.” -----http://www.ats.ucla.edu/stat/stata/library/ice.htm 2) imputation by chained equations This method is “based on each conditional density of a variable given other variables.” -----http://www.ats.ucla.edu/stat/stata/library/ice.htm imputation by chained equations The user selects how many imputed datasets to combine. Usually 3 is enough, with little if any advantage to using more than 10 sets. To create one imputed dataset, 1) Drop any subject that is missing every variable that will be considered in the regression model, as these subjects are impossible to impute. 2) All missing values for each specific variable is filled in with randomly selected nonmissing values from the same variable (hotdeck approach). 3) For each variable, in turn, ignore the filled-in value and instead impute the missing value by predicting it from the remaining variables, where the remaining variables were filled in if needed in step 2 (regression approach). -----http://www.ats.ucla.edu/stat/stata/library/ice.htm Next, each imputed dataset is analyzed independently with the desired regression model. Then, “Estimates of parameters of interest are averaged across the copies to give a single estimate. Standard errors are computed according to the ‘Rubin rules’, devised to allow for the between- and within-imputation components of variation in the parameter estimates.” -----Royston P. Multiple imputation of missing values. The Stata Journal 2004;4(3):227-241. Do we really need to go to all this trouble? It depends on how much missing data you have. Harrell (2001) provides some crude guidelines for: Proportion of missings ≤ 0.05 Proportion of missings 0.05 to 0.15 Proportion of missings > 0.15 -----Harrell Jr FE. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York, Springer-Verlag, 2001, p.49. “Proportion of missings 0.05: It doesn’t matter very much how you impute missings or whether you adjust variance of regression coefficient estimates for having imputed data in this case. For continuous variables imputing missings with the median nonmissing value is adequate; for categorical predictors the most frequent category can be used. Complete case analysis is an option here.” “Proportion of missings 0.05 to 0.15: If a predictor is unrelated to all of the other predictors, imputations can be done the same as the above (i.e., impute a reasonable constant value). If the predictor is correlated with other predictors, develop a customized model (or have the transcan fuction [available for S-Plus from Harrell’s website] do it for you) to predict the predictor from all of the other predictors. Then impute missings with predicted values. For categorical variables, classification trees are good methods for developing customized imputation models. For continuous variables, ordinary regression can be used if the variable in question does not require a nonmonotonic transformation to be predicted from the other variables. For either the related or unrelated predictor case, variances may need to adjusted for imputation. Single imputation is probably OK here, but multiple imputation doesn’t hurt.” “Proportion of missings > 0.15: This situation requires the same considerations as in the previous case, and adjusting variances for imputation is even more important. To estimate the strength of the effect of a predictor that is frequently missing, it may be necessary to refit the model on the subject of observations for which that predictor is not missing, if Y is not used for imputation. Multiple imputation is preferred for most models.” That ends the lecture. As a discussion topic, however, let’s return to the informative missing situation. Informative missing (IM) Data elements are more likely to be missing if their true values of the variable in question are systematically higher or lower. For example, this occurs if lower income subjects, or high income subjects, or both, are less likely to answer the income question in a survey. This is the most difficult type of missing data to handle, and in many cases there is no good value to substitute for the missing value. Furthermore, if you analyze your data by just dropping these subjects, your results will be biased, so that does not work either. -----Harrell Jr FE, 2001, pp.41-52. Discussion Question: Suppose you want to want to predict opioid abuse in patients receiving prescription opioid pain medications. Your primary predictor variable is an opioid abuse potential scale that is scored as: 1) at risk for abuse, 0) not at risk. Your primary outcome variable is a question on your survey, “Do you currently or have you in the past year used opioid prescription drugs simply to get high? (Yes or No)” You told your subjects that they had the option to skip any question they wanted to. About 20% of your subjects chose to not answer that question (many of them, perhaps, because they would be admitting to doing something illegal). How are you going to analyze these data?