Download Understand the distribution of missing data

Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015 I. II. III. IV. Intro Missing Values and Bias Simulations and Imputation Deletion Methodology Not Missing at Random Initial Steps Why is our data missing? What is the characteristic of our missing data? How will that affect the bias? Mean? Std? 𝐵𝑖𝑎𝑠𝜃 = 𝐸𝜃 𝜃 − 𝜃 = 𝐸𝜃 𝜃 − 𝜃 𝛽0 , 𝛽1 , 𝜇, 𝜎 𝑒𝑡𝑐. https://www.utexas.edu/cola/prc/_files/cs/Missing-Data.pdf OLS Unbiased Estimator 𝐵𝑖𝑎𝑠𝜃 = 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛, 𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛, 𝑒𝑡𝑐 𝑦 = 𝑋𝛽 + 𝜖 𝑢𝑛𝑏𝑖𝑎𝑠𝑒𝑑 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟 𝐸 𝑋′𝜖 →𝐸 𝛽−𝛽 = 𝑋′𝑋 𝑇𝑟𝑢𝑒 𝐷𝐺𝑃 𝛽 → 𝑆𝑎𝑚𝑝𝑙𝑒 𝑑𝑎𝑡𝑎 → 𝛽 Initial Steps 1. Identify the reason for missing data  Marriage, graduation, death, etc. 2. Understand the distribution of missing data  Certain groups more likely to have missing values 3. Decide on the best method of analysis    Deletion methods – Listwise, pairwise deletion Single Imputation Methods – Mean substitution, dummy variable, single regression Model based methods – Maximum likelihood and multiple imputation 4. Power and Bias   Too many missing variables reduces power Introduction of bias in your estimator https://www.utexas.edu/cola/prc/_files/cs/Missing-Data.pdf Missing Values and Bias 𝐵𝑖𝑎𝑠𝜃 = 𝐸𝜃 [𝜃] − 𝜃 = 𝐸𝜃 [𝜃 − 𝜃] Are missing values moving us away or closer to the true DGP? Conditional Distribution MCAR (missing completely at random) Probability ( Y = Missing | X,Y) = Probability (Y=Missing) Probability that Y is missing does not depend on X or Y MAR (missing at random) Probability ( Y = Missing | X,Y) = Probability (Y=Missing | X) Probability that Y is missing depends on X but not Y NMAR (not missing at random) Probability ( Y = Missing | X,Y) = Probability (Y=Missing | X,Y) Probability that Y is missing depends on Y and possibly on X Statistical Models- A.C. Davison- Cambridge University Press Example: Sea Level = 𝛽0 + 𝛽1𝑌𝑒𝑎𝑟 + 𝜀 Normal Data MCAR NMAR MAR Statistical Models- A.C. Davison- Cambridge University Press Bias Matrix – Does Bias Exist? 𝑩𝒊𝒂𝒔 𝜽 = 𝑬 𝜽 − 𝜽 Deletion Mean Imputation 𝐸 𝜇 −𝜇 𝐸 𝜎 −𝜎 𝐸 𝜇 −𝜇 𝐸 𝜎 −𝜎 𝜃𝑀𝐶𝐴𝑅 None (but reduced power) None (but reduced power) None <0 𝜃𝑀𝐴𝑅 Conditional None Unconditional Yes Conditional None Unconditional Yes Conditional None Unconditional Yes Conditional Yes < 0 Unconditional Yes 𝜃𝑁𝑀𝐴𝑅 Yes Yes Yes Yes Statistical Models- A.C. Davison- Cambridge University Press Working with Missing Data MCAR MAR NMAR • Deletion • Maximum Likelihood • Multiple Imputation • Single Imputation • Maximum Likelihood • Multiple Imputation • Single Imputation • Sensitivity Analysis • Pattern Mixture Models • Selection Model • Maximum Entropy https://www.utexas.edu/cola/prc/_files/cs/Missing-Data.pdf Listwise and Pairwise Deletion NMAR BIASED MCAR MAR Conditonal UNBIASED Missing values are MCAR MAR 𝐵𝑖𝑎𝑠𝜃 = 𝐸𝜃 [𝜃] − 𝜃 = 𝐸𝜃 [𝜃 − 𝜃] 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟 𝑐𝑎𝑛 𝑏𝑒 𝜇, 𝜎 Single Imputation Mean Mode Substition Dummy Variable Control Conditional Mean Substitution • Replace missing data with mean or mode • Introduces bias in estimated variance • Create indicator (1=missing, 0=not missing) • Impute missing values to a constant • Replace missing values with predicted score from a regression • Overestimates model fit https://www.utexas.edu/cola/prc/_files/cs/Missing-Data.pdf PRESENTATION TITLE HERE Simulations and Imputation Imputing Values • Deal with missing data by generating values for those that are missing. • Use a variety of methods to impute these values varying in accuracy and complexity. • We will focus on single imputation methods and a few multiple imputation methods. Mean Imputation • We can use the mean in place of the missing values • This will retain the mean from the dataset • This will also cause a negative bias in the variance Regression Mean Imputation • Instead of using the mean, we can use regression to give us predicted values for those missing. • This may allow us to achieve better estimates http://missingdata.lshtm.ac.uk/ Multiple Imputations • A more complex way to impute missing values. • Imputes and analyzes data to replace missing values within the data set. http://www.stefvanbuuren.nl/mi/MI.html A Few R Methods How can we do this in R?  Amelia  mi  There are many others, and some can be used to treat specific conditions for certain data sets. Amelia Amelia is an algorithm that bootstraps data and uses that data in a multiple imputation process. http://gking.harvard.edu/files/gking/files/amelia_jss.pdf?m=1360040717 mi “mi” imputes missing values using Bayesian regression methods, which are run a number of times and analyzed for convergence. This method is very customizable, but is also very costly https://cran.r-project.org/web/packages/mi/mi.pdf Additional Resources Additional packages that can be used in R can be found here: http://www.stefvanbuuren.nl/mi/Software.html Imputation Summary  In order to use imputation based methods we need to first understand the data and the reason for the “missingness” of the data.  By knowing this we can fit the method that we feel is most appropriate to our data set.  Single imputation methods can give us quick and easy answers to our missing values, but they also bias statistics like the variance.  Multiple imputation methods can handle the bias better but are complex and require more specialized R packages or software PRESENTATION TITLE HERE Deletion Methodology Bias E(qˆ) = E(qˆ) - q = 0 • 0 means no bias • E(qˆ) > 0 there is a systematic tendency for the estimate to be larger than the parameter it is estimating. • E(qˆ) < 0 there is a systematic tendency for the estimate to be smaller than the parameter it is estimating. *q is not from the data. It 's part of the data - generating process. Credit: email from Dr.Westfall Listwise Vs Pairwise Deletion What are they? • They are methods that discard data. How do they work? • Listwise (Complete-case analysis): Excluding all units for which the outcome or any of the inputs are missing. • Pairwise (Available-case analysis): Excluding a pair which contains one ore two missing values from data set. What is the difference? • Pairwise attempts to minimize the loss that occurs in listwise deletion. Credit: http://www.stat.columbia.edu/~gelman/arm/missing.pdf] Listwise Vs Pairwise Deletion (Cont’) Listwise deletion Pairwise deletion Listwise Vs Pairwise Deletion (Cont’) Pros and Cons of Listwise and Pairwise deletions: • Listwise : • • • • The sample after deletion may not be representative of the full sample. Reducing power and type II error rates increase. Tendency to get bias results. Pairwise: • • • Preserved or increase statistical power in the analyses. The result will be the same if the data has two variables (columns) Bias (over or underestimated) Credit: https://www.statisticssolutions.com/missing-data-listwise-vs-pairwise/ Credit: http://files.eric.ed.gov/fulltext/ED281854.pdf PRESENTATION TITLE HERE Not Missing at Random Case of NMAR 𝐵𝑖𝑎𝑠𝜃 ? 𝐸𝜃 𝜃 − 𝜃 Our sample 𝐼𝑛𝑐𝑜𝑚𝑒𝑖 = 𝐵0 + 𝐵𝑚𝑎𝑙𝑒 𝐷𝑚𝑎𝑙𝑒,𝑖 + 𝐵𝑓𝑒𝑚𝑎𝑙𝑒 𝐷𝑓𝑒𝑚𝑎𝑙𝑒,𝑖 𝐸𝜃 𝜃 < 𝜃  Why are our values missing? High income individuals don’t report income  What is the characteristic of the missing data Missing values are NMAR Meboot Package Our sample 𝐸𝜃 𝜃 < 𝜃  Our NMAR missing values introduce the most unsolvable estimator bias  We don’t know the true distribution. But we can infer a similar distribution for imputation.  Maximum Entropy is for time series statistical inference when traditional 𝑁 𝜇, 𝜎 2 assumptions are unreliable  For the worst case scenario: • Missing values are NMAR • Missing values follow a different distribution • Extraction of this distribution is not available from historical data – i.e. company stock enters bankruptcy – Company stock trading is halted – Your client is calling and wants to know whether they should sell or hold – This is a methodology for a “best guess” in the worst possible case https://cran.r-project.org/web/packages/meboot/index.html Evaluation of a Fund Manager 2007 2008 2009 2010 Yearly Returns Bond Fund Equity Fund 8.54% 7.58% 9.58% NA -1.87% 23.14% 5.46% 13.44% 2007 2008 2009 2010 Yearly Returns 10 Year Treasury S&P 500 10.21% 5.48% 20.10% -36.55% -11.12% 25.94% 8.46% 14.82% • While evaluating a fund manager for investment you notice that the fund did not include 2008 returns for its equity fund • You highly suspect it is NMAR – It was left out because returns were bad Evaluation of a Fund Manager IT Financials Health Care Cons. D Industrials Cons. S Energy Utilities Materials Telecom Sector Breakdown Equity Fund '07,'09, '10 US Markets 20.7% 20.4% 17.5% 16.5% 15.8% 14.7% 15.3% 13.1% 10.8% 10.1% 10.1% 9.9% 8.2% 6.9% 3.5% 3.1% 3.0% 2.8% 2.8% 2.4% • You find out that the equity fund normally held stocks representative of the entire stock market • Distribution of the missing data may follow the overall US equity market Meboot Maximum Entropy • Data dependent nonstandard bootstrap • Creates a population of time series that is non-stationary (i.e. mean changes over time) • Creates a large number of replicates based on your provided ensemble Ω 1. Sorts provided data in increasing order 2. Compute intermediate points of sorted data 3. Compute min/max 4. Compute mean preserving constraints 5. Generate random U[0,1] interval iterations 6. Repeat https://cran.r-project.org/web/packages/meboot/index.html Meboot Maximum Entropy 2007 2008 2009 2010 m s m ME s ME • • • • Yearly Returns Bond Fund Equity Fund 8.54% 7.58% 9.58% -35.47% -1.87% 23.14% 5.46% 13.44% 5.43% 14.72% 5.17% 7.86% 2.17% 25.90% NMAR missing values requires the most assumptions Minimizing bias for NMAR depends heavily on your model setup There is no “right” answer, we do not know the true DGP All we can do is minimize bias with well grounded assumptions Questions? THANK YOU! Questions?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Understand the distribution of missing data