Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA In this course, we will… • Examine missing data in a general sense; what it is, where it comes from, what types exist, etc. • Explain the problems of certain common methods for dealing with missing data, such as complete case analysis and single imputation methods • Study multiple imputation (MI), learning generally how it works • Apply MI to real data sets using SAS and R So what is missing data? • Missing data is information that we want to know, but don’t • It can come in many forms, from people not answering questions on surveys, to inaccurate recordings of the height of plants that need to be discarded, to canceled runs in a driving experiment due to rain • We could also consider something we never even thought of to be missing data The key question is, why is the data missing? • What mechanism is it that contributes to, or is associated with, the probability of a data point being absent? • Can it be explained by our observed data or not? • The answers drastically affect what we can ultimately do to compensate for the missingness Perhaps the most common method of handling missing data is “Complete Case Analysis” • Simply delete all cases that have any missing values at all, so you are left only with observations with all variables observed • Computer software often does this by default when performing analysis (regression, for example) • This is the simplest way to handle missing data. In some cases, will work fine • However, loss of sample will lead to variance larger than reflected by the size of your data. • May bias your sample And now a closer look… • We use as an example a data set of body fat percentage in men, and the circumference of various body parts (Penrose et al., 1985) • Does the circumference of certain body parts predict body fat percentage? • Here are some significant figures from a regression model with body fat percentage as the response Predictor Age Neck Forearm Wrist Estimate 0.0626 -0.4728 0.45315 -1.6181 S.E. 0.0313 0.2294 0.1979 0.5323 P-Value 0.0463 0.0403 0.0229 0.0026 In this case, the data is complete, with sample size 252 • But suppose about 5 percent of the participants had missing values? 10 percent? 20 percent? • What if we performed complete case analysis and removed those who had missing values? • First let’s examine the effect if we do this if when the data is MCAR • I randomly removed cases from the data set, reran the analysis and stored the p-values. I did this 1,000 times, and plotted the 1,000 p-values in boxplots For about 5 percent (n=13) deleted P V a l u e Age Neck Forearm Wrist For about 20 percent (n=50) deleted P V a l u e Age Neck Forearm Wrist We seem to change our conclusions somewhat • With age and neck, it seems we fail to reject more often than not • The other two, we still reject most of the time • This is assuming the missing subjects do mot differ from the non-missing. This would cause bias . Types Of Missingness • Missing Completely at Random (MCAR) • Missing at Random (MAR) • Missing Not at Random (MNAR) or Not Missing at Random (NMAR) What Distinguishes Each Type? • Suppose you’re loitering outside an elementary school one day… • You then find out that students just received their report cards for the first quarter • For some reason, you start asking passing students their English grades. Of course, you don’t force them to tell you or anything. You also write down their gender and hair color A data set from this activity might look like this… Hair Color Red Brown Black Black Brown Brown Brown Black Black Brown Black Brown Red Red Brown Black Gender Grade M F F M M M F M M F F F M F M M A A B A B B A C A A A • 7 students received As, 3 received Bs, and 1 a C • No failing!! • But 5 students did not reveal their grade To determine the type of missingness, look at what influences the probability of a missing point • • • Here is the same data set, but the values are replaced with a “0” if the data point is observed and “1” if it is not We’ll call this the “Missing Matrix.” Obviously there are many more possible missing matricies The relevant question is, for any one of these data points, what is the probability that the point is equal to “1” ? Hair Color Gender Grade 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 Upcoming Quiz! • What type of missingness do the grades exhibit? Missing Completely at Random (MCAR) • If this probability is not dependent on any of the data, observed or unobserved, then the data is Missing Completely at Random (MCAR) • To be more precise, suppose that X is the observed data and Y is the unobserved data. Suppose we label our “Missing Matrix” as R. • Then, if the data are MCAR, P(R|X,Y)=P(R) Example… • Suppose you are running an experiment on plants grown in pots, when suddenly you have a nervous breakdown and smash some of the pots • In your insanity, you will probably not likely choose the plants to smash in a well-defined pattern, such as height age, etc. • Hence, the missing values generated from your act of madness will likely fall into the MCAR category Another way to think of MCAR • Supposed we had to quickly go to the bathroom and do number 2 • In our desperation, we use the data as our toilet paper • Presumably, some of our data would be smeared with…you know what • The data smeared can be said to be a random subset of our data In practice, MCAR is usually not realistic • A completely random mechanism for generating missingness in your data set just isn’t very realistic • Usually, missing data is missing for a reason. Maybe older people are less likely to answer web-delivered questions on surveys, or in longitudinal studies people may die before they have completed the entire study, etc., companies may be reluctant to reveal financial information, etc. Missing at Random (MAR) • If the probability of your missing data is dependent on the observed data but not the unobserved data, your missing observations are said to be Missing at Random (MAR) • Symbolically, P(R|X,Y)=P(R|X), so that the unobserved data does not contribute to the probability of observing our “Missing Matrix.” • Random is somewhat of a misnomer. MAR means that there is a mechanism that is associated with whether the data is missing, and it has to do with our observed data Example… • Usually, missing data is missing for a reason. Maybe older people are less likely to answer web-delivered questions on surveys, or in longitudinal studies people may die before they have completed the entire study, etc., companies may be reluctant to reveal financial information, etc. • The key point to MAR is… • • • • We can still model the missing mechanism and compensate for it The multiple imputation methods we will be talking about today assume MAR For example, if age is known, you can model missingness as a function of age Whether or not missing data is MAR or the next type, Missing Not at Random (MNAR) is not testable. Requires you to understand your data • (after) Missing Not at Random (MNAR) • The missingness has something to do with the missing value itself • It has been said that smokers are not as likely to answer the question, “Do you smoke?” • Said to be nonignorable • Although there are some proposed ways to handle MNAR data, these are more complicated and are beyond the scope of this class So, returning to our school example… Hair Color • Do you think this missing data is likely MCAR, MAR or MNAR? Red Brown Black Black Brown Brown Brown Black Black Brown Black Brown Red Red Brown Black Gender Grade M F F M M M F M M F F F M F M M A A B A B B A C A A A Add overall GPA • Now the data looks like this • Does this change anything? Hair Color Red Brown Black Black Brown Brown Brown Black Black Brown Black Brown Red Red Brown Black GPA 3.4 3.6 3.7 3.9 2.5 3.2 3.0 2.9 3.3 4.0 3.65 3.4 2.2 3.8 3.8 3.67 Gender Grade M A F A F B M A M M F M B M B F A F F C M F A M A M A So what do we do about missing data? Single Imputation Methods Impute Once • Mean Imputation: imputing the average from observed cases for all missing values of a variable • Hot Deck Imputation: imputing a value from another subject, or “donor,” that is most like the subject in terms of observed variables • Some others • All fundamentally impose too much precision. We have uncertainty in what the unobserved values actually are Multiple Imputation • Using a single imputation approach does not account for an obvious source of uncertainty • By imputing only once, we are treating the imputed value as if we observed it when we did not • Therefore, we have uncertainty in what the observed value would have been • Multiple Imputation (MI) takes this into account by generating several random values for each missing data point The General Process 1. A value is randomly drawn for the unobserved data points based on a predetermined model from the observed data 2. Repeat step 1 some number of times, say N, resulting in N imputed data sets 3. Each imputed data set is analyzed separately 4. The separate analyses are pooled together for a unifying analysis that takes into account all the imputed data sets To illustrate… Here’s some data X Y 32 2 43 ? 56 6 25 ? 84 5 Oh no, we have two missing values! Whatever shall we do?! Let’s Impute Some Data! First, we’ll use a predictive distribution of the missing values, given the observed values, to make random draws of the observed values and fill them in. X Y 32 2 43 5.5 56 6 25 8 84 5 Now we have one imputed data set! Let’s Set That Aside… And Do it Again!!!! X Y 32 2 43 5.5 56 6 X Y 25 8 32 2 84 5 43 7.2 56 6 25 1.1 5 84 Set that aside… Now we have 2 imputed data sets!!! X Y X Y 32 2 32 2 43 5.5 43 7.2 56 6 56 6 25 8 25 1.1 84 5 84 5 • Do this m number of times for m imputed data sets Inference with Multiple Imputation • Now that we have our imputed data sets, how do we make use of them? (suppose in this case m = 2) X Y X Y 32 2 32 2 43 5.5 43 7.2 56 6 56 6 25 8 25 1.1 84 5 84 5 We analyze each separately X Y X Y 32 2 32 2 43 5.5 43 7.2 56 6 56 6 25 8 25 1.1 84 5 84 5 Slope S.E. -0.8245 6.1845 Slope S.E. 4.932 4.287 Finally we pool the analyses together • The pooled slope estimate is the average of the m imputed estimates • In our example, β1p = β11+β12 2 = (4.932-.8245)*.5 = 2.0538 • The pooled slope variance is given by 𝑠= 𝑍𝑖 𝑚 + (1 + 1 1 )* 𝑚 𝑚−1 ∗ (𝛽1𝑖 − β1p )2 Where Zi is the standard error of the imputed slopes The pooled standard error in this case is (4.287 + 6.1845)/2 + (3/2)*(16.569) = 30.08925 To find the standard error, take the square root, and we get 5.485 Predicting the missing data given the observed data • Bayes’ Theorem... 𝑃(𝐴|𝐵) ∗ 𝑃(𝐵) 𝑃 𝐵𝐴 = 𝑃(𝐴) Imagine, then, that we establish some distribution of parameters of interest before considering the data, P(θ), where θ is the set of parameters we are trying to estimate. This is called the prior distribution of θ. Then, we establish a distribution P(Xobs|θ) We can finally use Bayes Theorem to establish P(θ|Xobs), make random draws for θ, and use these draws to make predictions of Ymiss How many imputations do we need? • Depends on the size of the data set and the amount of missingness • Some previous research indicated that about 5 is sufficient for efficiency of the estimates, based on λ (1 + 𝑚)-1 Where m is the number of imputations and λ is the fraction of missing information for the term being estimated (Schaffer, 1999) • However, more recent research claims that a good imputation number is actually higher (maybe 40 or more) in order to achieve higher power (Graham et al, 2007) General Methods for Multiple Imputation • Regression based • Chained Equations (MICE) or Fully Conditional Specification (FCS) • Markov Chain Monte Carlo (MCMC) • We will look at part of a data set of CEO bonuses, with other predictor variables (sales, advanced degrees, age, etc.) Regression Approach in SAS • Uses predictive mean matching, which means that the actual imputed value is one chosen randomly from a set of observed values whose predicted value is close to the predicted value of the missing observation • Is meant to try and keep imputed values plausible • Based on the imputation model we build, posterior random draws are made for the regression parameters • These draws are used to construct the predicted values for the missing observation What parameters? • Suppose our imputation model is y=β0 + β1𝑥1 + ⋯ + β𝑘𝑥𝑘 • A random draw is made from the posterior predictive distribution of the parameters, and we get the randomly drawn parameters 𝜷* = (𝜷*1,…, 𝜷*k) • The missing value yi is predicted as 𝜷*1x1 + … + 𝜷*kxk • Predictive mean matching is made based on this prediction SAS example • We will look at part of a data set of CEO bonuses, with other predictor variables (sales, advanced degrees, age, etc.) • Since we plan to do regression on bonuses, and bonuses may have large variability as they get higher, we will take the log of bonuses before we do the imputation Here’s the code for the entire process: proc mi data=bob1 out=mob seed=123 nimpute=10; monotone regpmm(logb=stock sales years mba mastphd age); var stock sales years mba mastphd age logb; run; Imputation Code proc reg data=mob outest=mp covout noprint; model logb=stock sales years mba mastphd age; by _imputation_; run; Regression Code proc mianalyze data=mp; modeleffects stock sales years mba mastphd age; run; Pooled Analysis Code Here is the output Parameter Variance Between Within Parameter Estimate Total stock sales 1.2212E-05 3.5E-05 4.9E-05 1.15E-12 1.02E- 1.15E-11 11 years 6.388E-06 2.5E-05 3.2E-05 mba 0.001423 0.00794 0.0095 mastphd 0.001299 0.00633 0.00776 age 1.0735E-05 2.5E-05 3.7E-05 stock -0.00556 sales 2.42E-05 years 0.017694 mba 0.014343 mastphd -0.00182 age 0.014896 Std 95% Confiden Theta0 t for H0: Pr > |t| Error ce Limits Paramet er=Thet a0 0.0069 -0.0194 0.00824 0 -0.8 0.4267 7 3.4E- 0.0000 3.1E-05 0 7.16 <.0001 06 2 0.0056 0.0066 0.02879 0 3.15 0.0019 2 0.0974 -0.1774 0.20612 0 0.15 0.8831 9 0.0880 -0.1753 0.17162 0 -0.02 0.9835 9 0.0060 0.0028 0.02698 0 2.45 0.0163 8 1 Classification Variables • Suppose that we want to impute a variable that takes one of two values, “male” or “female”, “smoker” or “non smoker”, “dead” or “alive” • Or what if there are even more categories, such as dislike, like, and love? • What if they are nominal, like chocolate, vanilla, and strawberry? • We can hardly use continuous methods in these cases We can use the “Logistic Regression Method” • Remember that if p = probability that y=1, the logistic regression model can be expressed as 𝑙𝑜𝑔 𝑝 1−𝑝 = β0 + β1𝑥1 + ⋯ + β𝑘𝑥𝑘 We can make random draws for 𝜷*, the estimators of 𝜷, from their posterior distribution of Use those to calculate the estimate for p = the missing case 𝑒 𝑥β 1+𝑒 𝑥β , and use this to predict y for • This method also works for ordinal data • Can be performed sequentially in SAS on multiple variables one at a time if data is monotone missing, which means an observation missing implies observations missing in all the rest of the variables for that subject • Discriminant Function Method can be used for nominal variables SAS Example • I took the CEO data set and removed 57 values (no particular reason I chose 57) • The following code runs the imputation proc mi data=bob nimpute=5 seed=231 out=lid; class mastphd; var age stock sales years mba mastphd; monotone logistic (mastphd=years sales stock age mba); run; And we get this… WARNING: The maximum likelihood estimates for the logistic regression with observed observations may not exist for variable MastPHD. The posterior predictive distribution of the parameters used in the imputation process is based on the maximum likelihood estimates in the last maximum likelihood iteration. The answer lies in the follies of logistic regression, as well as the redundancy of our model Table of MastPHD by MBA MastPH MBA(MBA) D(MastP HD) 0 1 Total 0 1 Total 372 0 372 50.07 100 67.64 178 23.96 47.98 32.36 550 74.02 0 0 0 193 25.98 52.02 100 193 25.98 50.07 371 49.93 743 100 • We have “perfect classification” in that no one without a masters/phd has an mba • If we have perfect classification like this, then the algorithm that does logistic regression will not converge • This is something you need to be careful about in general Now that we’ve removed MBA, here’s the code proc mi data=bob nimpute=5 seed=231 out=lid; class mastphd; var age stock sales years mastphd; monotone logistic (mastphd=years sales stock age); run; Imputation Code proc logistic data=lid outest=rain covout noprint descending; class mastphd; model mastphd=age stock sales years; by _imputation_; run; Logistic Regression Code proc mianalyze data=rain; modeleffects age stock sales years; run; Pooled Analysis Code So here are the results Variance Information Parameter Variance Between Within DF Total Relative Fraction Relative Increase Missing Efficiency in Information Variance age 7.2861E-05 0.00015 0.00023 28.185 0.604415 0.416692 0.923073 stock 1.3164E-05 0.00042 0.00044 3084.8 0.037355 0.036634 0.992727 sales 8.39E-13 5.41E11 1.8885E-05 0.00014 5.51E-11 11990 0.018605 0.018429 0.996328 0.00016 204.21 0.162733 0.148259 0.971202 years Parameter Estimates Std 95% Confidence Li DF Minimum Maximum Theta0 t for Pr > |t| Error mits H0: Param eter=T heta0 -0.030592 0.01524 -0.0618 0.00061 28.185 -0.040456 -0.02281 0 -2.01 0.0543 -0.087136 0.02095 -0.1282 -0.0461 3084.8 -0.090364 -0.082114 0 -4.16 <.0001 3.554E-06 7.4E-06 -1E-05 0.00002 11990 0.000002689 0.000004581 0 0.48 0.6321 0.005274 0.01273 -0.0198 0.03036 204.21 0.001705 0.010075 0 0.41 0.679 Parameter Estimate age stock sales years What about when more than one variable has missing values? Multiple Imputation by Chained Equations (MICE) 1. Provides initial imputations of missing values 2. For one particular variable, removes them again 3. Builds model based on other variables, and uses posterior predictive distribution to impute random values 4. Does the same thing for another variable, only imputed values for first variable remain 5. Completes for all variables, repeats the process many times 6. This makes one imputed data set. Does so m times • Works well in simulations, handles many types of variables at once • Can take a lot of time, and theoretical justification is not particularly strong R Example • This data set, “Nhanes”, has age group, body mas index, hypertensive status and serum cholesterol • Body mass index and serum cholesterol are continuous, while hypertensive status (yes or no) is binary and age group is ordinal • We will use the package “mice” and the function “mice” to complete the imputation and analysis Code • You need to install the ‘mice’ package nhanes$hyp<-as.factor(nhanes$hyp) bord<-mice(nhanes,m=40,seed=132, me=c("polr","pmm","logreg","norm")) complete(bord,12) bit<-with(bord,lm(chl~age+bmi+hyp)) summary(pool(bit)) Output est se t df Pr(>|t|) (Intercept) -39.104424 88.462185 -0.4420468 9.341691 0.66851235 age 40.287101 18.378020 2.1921350 6.268912 0.06894168 bmi 6.091045 2.610044 2.3336941 11.449700 0.03876241 hyp 5.410891 29.405394 0.1840102 8.038752 0.85856252 Body mass is a significant predictor of cholesterol, and age nearly is, but hypertensive status is not Markov Chain Monte Carlo Approach • Here, the process gives us data estimations via Markov chains • A Markov chain holds the property that the probability of the next link in the chain depends only on the current link • Basically, we perform a bunch of steps, and the probability of each step depends only on the previous step • Eventually, theory holds that under certain conditions, the steps will converge to the state that we are trying to estimate, called the stationary distribution But, there’s a catch… • This approach assumes multivariate normality Summary • Though handling missing data is ultimately just a nuisance necessity and not the point of the analysis, it pays to give it the consideration it is due • Whether or not you use multiple imputation, single imputation, or complete case analysis depends on how much missing data you have, and how big the sample is • Having the actual data is still always better Thank you!