Download Understand the distribution of missing data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Missing Values
Raymond Kim
Pink Preechavanichwong
Andrew Wendel
October 27, 2015
I.
II.
III.
IV.
Intro Missing Values and Bias
Simulations and Imputation
Deletion Methodology
Not Missing at Random
Initial Steps
Why is our data
missing?
What is the
characteristic of
our missing data?
How will that
affect the bias?
Mean? Std?
π΅π‘–π‘Žπ‘ πœƒ = πΈπœƒ πœƒ βˆ’ πœƒ = πΈπœƒ πœƒ βˆ’ πœƒ
𝛽0 , 𝛽1 , πœ‡, 𝜎 𝑒𝑑𝑐.
https://www.utexas.edu/cola/prc/_files/cs/Missing-Data.pdf
OLS Unbiased Estimator
π΅π‘–π‘Žπ‘ πœƒ = π‘…π‘’π‘”π‘Ÿπ‘’π‘ π‘ π‘–π‘œπ‘›, π·π‘–π‘ π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘–π‘œπ‘›, 𝑒𝑑𝑐
𝑦 = 𝑋𝛽 + πœ–
π‘’π‘›π‘π‘–π‘Žπ‘ π‘’π‘‘ π‘’π‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘œπ‘Ÿ
𝐸 π‘‹β€²πœ–
→𝐸 π›½βˆ’π›½ =
𝑋′𝑋
π‘‡π‘Ÿπ‘’π‘’ 𝐷𝐺𝑃 𝛽 β†’ π‘†π‘Žπ‘šπ‘π‘™π‘’ π‘‘π‘Žπ‘‘π‘Ž β†’ 𝛽
Initial Steps
1. Identify the reason for missing data

Marriage, graduation, death, etc.
2. Understand the distribution of missing data

Certain groups more likely to have missing values
3. Decide on the best method of analysis



Deletion methods – Listwise, pairwise deletion
Single Imputation Methods – Mean substitution, dummy variable, single
regression
Model based methods – Maximum likelihood and multiple imputation
4. Power and Bias


Too many missing variables reduces power
Introduction of bias in your estimator
https://www.utexas.edu/cola/prc/_files/cs/Missing-Data.pdf
Missing Values and Bias
π΅π‘–π‘Žπ‘ πœƒ = πΈπœƒ [πœƒ] βˆ’ πœƒ = πΈπœƒ [πœƒ βˆ’ πœƒ]
Are missing values moving us away or
closer to the true DGP?
Conditional Distribution
MCAR
(missing completely at random)
Probability ( Y = Missing | X,Y) = Probability (Y=Missing)
Probability that Y is missing does not depend on X or Y
MAR
(missing at random)
Probability ( Y = Missing | X,Y) = Probability (Y=Missing | X)
Probability that Y is missing depends on X but not Y
NMAR
(not missing at random)
Probability ( Y = Missing | X,Y) = Probability (Y=Missing | X,Y)
Probability that Y is missing depends on Y and possibly on X
Statistical Models- A.C. Davison- Cambridge University Press
Example: Sea Level = 𝛽0 + 𝛽1π‘Œπ‘’π‘Žπ‘Ÿ + πœ€
Normal Data
MCAR
NMAR
MAR
Statistical Models- A.C. Davison- Cambridge University Press
Bias Matrix – Does Bias Exist?
π‘©π’Šπ’‚π’” 𝜽 = 𝑬 𝜽 βˆ’ 𝜽
Deletion
Mean Imputation
𝐸 πœ‡ βˆ’πœ‡
𝐸 𝜎 βˆ’πœŽ
𝐸 πœ‡ βˆ’πœ‡
𝐸 𝜎 βˆ’πœŽ
πœƒπ‘€πΆπ΄π‘…
None (but
reduced
power)
None (but
reduced
power)
None
<0
πœƒπ‘€π΄π‘…
Conditional
None
Unconditional
Yes
Conditional
None
Unconditional
Yes
Conditional
None
Unconditional
Yes
Conditional
Yes < 0
Unconditional
Yes
πœƒπ‘π‘€π΄π‘…
Yes
Yes
Yes
Yes
Statistical Models- A.C. Davison- Cambridge University Press
Working with Missing Data
MCAR
MAR
NMAR
β€’ Deletion
β€’ Maximum Likelihood
β€’ Multiple Imputation
β€’ Single Imputation
β€’ Maximum Likelihood
β€’ Multiple Imputation
β€’ Single Imputation
β€’ Sensitivity Analysis
β€’ Pattern Mixture Models
β€’ Selection Model
β€’ Maximum Entropy
https://www.utexas.edu/cola/prc/_files/cs/Missing-Data.pdf
Listwise and Pairwise Deletion
NMAR
BIASED
MCAR
MAR
Conditonal
UNBIASED
Missing values
are MCAR
MAR
π΅π‘–π‘Žπ‘ πœƒ = πΈπœƒ [πœƒ] βˆ’ πœƒ = πΈπœƒ [πœƒ βˆ’ πœƒ]
π‘’π‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘œπ‘Ÿ π‘π‘Žπ‘› 𝑏𝑒 πœ‡, 𝜎
Single Imputation
Mean Mode
Substition
Dummy
Variable
Control
Conditional
Mean
Substitution
β€’ Replace missing data with mean
or mode
β€’ Introduces bias in estimated
variance
β€’ Create indicator (1=missing,
0=not missing)
β€’ Impute missing values to a
constant
β€’ Replace missing values with
predicted score from a regression
β€’ Overestimates model fit
https://www.utexas.edu/cola/prc/_files/cs/Missing-Data.pdf
PRESENTATION TITLE HERE
Simulations and Imputation
Imputing Values
β€’ Deal with missing data by generating values for
those that are missing.
β€’ Use a variety of methods to impute these values
varying in accuracy and complexity.
β€’ We will focus on single imputation methods and
a few multiple imputation methods.
Mean Imputation
β€’ We can use the mean in place of the missing
values
β€’ This will retain the mean from the dataset
β€’ This will also cause a negative bias in the
variance
Regression Mean Imputation
β€’ Instead of using the mean, we can use regression
to give us predicted values for those missing.
β€’ This may allow us to achieve better estimates
http://missingdata.lshtm.ac.uk/
Multiple Imputations
β€’ A more complex way to impute missing values.
β€’ Imputes and analyzes data to replace missing
values within the data set.
http://www.stefvanbuuren.nl/mi/MI.html
A Few R Methods
How can we do this in R?
 Amelia
 mi
 There are many others, and some can be used to treat
specific conditions for certain data sets.
Amelia
Amelia is an algorithm that bootstraps data and uses that
data in a multiple imputation process.
http://gking.harvard.edu/files/gking/files/amelia_jss.pdf?m=1360040717
mi
β€œmi” imputes missing values using Bayesian regression
methods, which are run a number of times and analyzed for
convergence.
This method is very customizable, but is also very costly
https://cran.r-project.org/web/packages/mi/mi.pdf
Additional Resources
Additional packages that can be used in R can be found
here:
http://www.stefvanbuuren.nl/mi/Software.html
Imputation Summary
 In order to use imputation based methods we need to first understand
the data and the reason for the β€œmissingness” of the data.
 By knowing this we can fit the method that we feel is most
appropriate to our data set.
 Single imputation methods can give us quick and easy answers to our
missing values, but they also bias statistics like the variance.
 Multiple imputation methods can handle the bias better but are
complex and require more specialized R packages or software
PRESENTATION TITLE HERE
Deletion Methodology
Bias
E(qˆ) = E(qˆ) - q = 0
β€’ 0 means no bias
β€’
E(qˆ) > 0 there is a systematic tendency for the estimate to be
larger than the parameter it is estimating.
β€’
E(qˆ) < 0 there is a systematic tendency for the estimate to be
smaller than the parameter it is estimating.
*q is not from the data. It 's part of the data - generating process.
Credit: email from Dr.Westfall
Listwise Vs Pairwise Deletion
What are they?
β€’
They are methods that discard data.
How do they work?
β€’ Listwise (Complete-case analysis): Excluding all units for which the
outcome or any of the inputs are missing.
β€’ Pairwise (Available-case analysis): Excluding a pair which contains one
ore two missing values from data set.
What is the difference?
β€’ Pairwise attempts to minimize the loss that occurs in listwise deletion.
Credit: http://www.stat.columbia.edu/~gelman/arm/missing.pdf]
Listwise Vs Pairwise Deletion (Cont’)
Listwise deletion
Pairwise deletion
Listwise Vs Pairwise Deletion (Cont’)
Pros and Cons of Listwise and Pairwise deletions:
β€’
Listwise :
β€’
β€’
β€’
β€’
The sample after deletion may not be representative of the full sample.
Reducing power and type II error rates increase.
Tendency to get bias results.
Pairwise:
β€’
β€’
β€’
Preserved or increase statistical power in the analyses.
The result will be the same if the data has two variables (columns)
Bias (over or underestimated)
Credit: https://www.statisticssolutions.com/missing-data-listwise-vs-pairwise/
Credit: http://files.eric.ed.gov/fulltext/ED281854.pdf
PRESENTATION TITLE HERE
Not Missing at Random
Case of NMAR
π΅π‘–π‘Žπ‘ πœƒ ?
πΈπœƒ πœƒ βˆ’ πœƒ
Our sample
πΌπ‘›π‘π‘œπ‘šπ‘’π‘– = 𝐡0 + π΅π‘šπ‘Žπ‘™π‘’ π·π‘šπ‘Žπ‘™π‘’,𝑖 + π΅π‘“π‘’π‘šπ‘Žπ‘™π‘’ π·π‘“π‘’π‘šπ‘Žπ‘™π‘’,𝑖
πΈπœƒ πœƒ < πœƒ
 Why are our values missing?
High income individuals don’t report income
 What is the characteristic of the missing data
Missing values are NMAR
Meboot Package
Our sample
πΈπœƒ πœƒ < πœƒ
 Our NMAR missing values introduce the most unsolvable estimator bias
 We don’t know the true distribution. But we can infer a similar distribution
for imputation.
 Maximum Entropy is for time series statistical inference when traditional
𝑁 πœ‡, 𝜎 2 assumptions are unreliable
 For the worst case scenario:
β€’ Missing values are NMAR
β€’ Missing values follow a different distribution
β€’ Extraction of this distribution is not available from historical data
–
i.e. company stock enters bankruptcy
–
Company stock trading is halted
–
Your client is calling and wants to know whether they should sell or hold
–
This is a methodology for a β€œbest guess” in the worst possible case
https://cran.r-project.org/web/packages/meboot/index.html
Evaluation of a Fund Manager
2007
2008
2009
2010
Yearly Returns
Bond Fund
Equity Fund
8.54%
7.58%
9.58%
NA
-1.87%
23.14%
5.46%
13.44%
2007
2008
2009
2010
Yearly Returns
10 Year Treasury
S&P 500
10.21%
5.48%
20.10%
-36.55%
-11.12%
25.94%
8.46%
14.82%
β€’ While evaluating a fund manager for investment you notice that the fund
did not include 2008 returns for its equity fund
β€’ You highly suspect it is NMAR – It was left out because returns were bad
Evaluation of a Fund Manager
IT
Financials
Health Care
Cons. D
Industrials
Cons. S
Energy
Utilities
Materials
Telecom
Sector Breakdown
Equity Fund '07,'09, '10
US Markets
20.7%
20.4%
17.5%
16.5%
15.8%
14.7%
15.3%
13.1%
10.8%
10.1%
10.1%
9.9%
8.2%
6.9%
3.5%
3.1%
3.0%
2.8%
2.8%
2.4%
β€’ You find out that the equity fund normally held stocks representative of
the entire stock market
β€’ Distribution of the missing data may follow the overall US equity market
Meboot Maximum Entropy
β€’ Data dependent nonstandard bootstrap
β€’ Creates a population of time series that is non-stationary (i.e. mean
changes over time)
β€’ Creates a large number of replicates based on your provided ensemble Ξ©
1. Sorts provided data in increasing order
2. Compute intermediate points of sorted data
3. Compute min/max
4. Compute mean preserving constraints
5. Generate random U[0,1] interval iterations
6. Repeat
https://cran.r-project.org/web/packages/meboot/index.html
Meboot Maximum Entropy
2007
2008
2009
2010
m
s
m ME
s ME
β€’
β€’
β€’
β€’
Yearly Returns
Bond Fund
Equity Fund
8.54%
7.58%
9.58%
-35.47%
-1.87%
23.14%
5.46%
13.44%
5.43%
14.72%
5.17%
7.86%
2.17%
25.90%
NMAR missing values requires the most assumptions
Minimizing bias for NMAR depends heavily on your model setup
There is no β€œright” answer, we do not know the true DGP
All we can do is minimize bias with well grounded assumptions
Questions?
THANK YOU!
Questions?