Download Missing Data and Imputation Strategies

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
STAT 3130
Statistical Methods II
Missing Data and Imputation
STAT3130 – Missing Data
Rarely does “real” data come to you without any missing values. And
“missing” can take several forms:
1. Truly “missing” – meaning no value is present;
2. Coded – meaning that there is a “value” but it means something
different from the scale of the data;
3. Miscoded – meaning that there is a “value”, but it is wrong...
STAT3130 – Missing Data
Lets take each one individually…
1) Truly Missing Data.
Consider the GSS08 dataset. You will see missing data which is character
as well as numeric. Note that missing values for a character variable are
identified in SAS as a blank, while missing values for numeric variables
are identified in SAS as a “.”.
This is the most obvious form of missing data.
Note that you can check for truly missing data by using the following SAS
Code:
Proc Means data = data nmiss;
Var var;
Run;
Proc Freq data = data;
Tables var;
Run;
STAT3130 – Missing Data
Lets take each one individually…
2) Coded Data
Frequently, when data is input into a database, any values which were
missing, incorrect, illegible, etc. will be coded at the time of entry. These
codes are typically (but not always) provided to you in a data dictionary.
Coded values are sometimes easy to spot (the codes are character when the
rest of the data is numeric) or not easy to spot (the coded values are
numeric, but not part of the “true” range of the data).
Consider the GSS08 dataset again. Take a look at the age variable – there
are coded values there. What are they and how would you know?
STAT3130 – Missing Data
Lets take each one individually…
3) MisCoded Data
Humans make mistakes – sometimes in weird ways computers make
mistakes too. When data is entered incorrectly, this can really mess things
up when you are trying to run a model or a test.
Consider the age variable again in the GSS08 dataset…
STAT3130 – Missing Data
With all of these issues, you also need to determine if the data is missing:
1) Completely randomly – also called MCAR. This means that the missing values
have no pattern. In other words, the missing values cannot be predicted in any
way.
2) Missing at random – also called MAR. This means that the missing values can
be predicted using the other data available for an observation. In these instances,
you may want to assign a categorical value (when the variable is categorical)
with an indicator of “MISSING” to identify these observations differently.
3) Missing that depends upon latent variables. For example, there could be a latent
(unobserved) variable which is highly correlated with the missing values. A
familiar example from medical studies is that if a particular treatment causes
discomfort, a patient is more likely to drop out of the study. This “missingness”
is not at random (unless “discomfort” is measured and observed for all
patients).
STAT3130 – Missing Data
In all instances, the data values need to be replaced or “imputed” with a
logical, meaningful value.
Before we discuss the strategies for imputation…lets make a quick point
regarding why this needs to be done…
All analytical software packages – including SAS – require “complete
case” for an observation to be included in the analysis. This means that if
there are 100 variables and an observation is missing just ONE value, the
entire case is removed from the analysis.
And, you lose the other 99 perfectly good values. 
STAT3130 – Missing Data
Think about this…if you are missing only 1% of your data and you have
1,000,000 observations and 50 variables, you could lose as much as
395,000 observations when you go to model…
[total observations – (((1-percent missing)^variables)*total observations)]
or
[1,000,000 – (((1-.01)^50)*1,000,000)] = 394,994
That is A LOT of valid data that you would lose!
And, it could bias your results.
STAT3130 – Imputation
We need a way to replace those values – logically.
Many options for imputation exist. Here are four of the primary methods:
1.
2.
3.
4.
Mean based imputation
Median based imputation
Stratified imputation
Regressed imputation – difficult with MCAR
Each of these will be discussed briefly in turn.
STAT3130 – Imputation
Imputation Strategies:
1) Mean Based Imputation – this process is the most simple. This
involves replacing the missing values with the mean of the variable.
But…before you do this, think through these questions:
a. How would the distribution of the variable affect/and be affected
by this imputation decision?
b. What happens to the mean of the variable?
c. What happens to the standard deviation of the variable?
d. How might the results be biased?
STAT3130 – Imputation
Imputation Strategies:
2) Median Based Imputation – this process is also very simple. This
involves replacing the missing values with the median of the variable.
But…before you do this, think through these questions:
a. How would the distribution of the variable affect/and be affected
by this imputation decision?
b. What happens to the mean of the variable?
c. What happens to the standard deviation of the variable?
d. How might the results be biased?
STAT3130 – Imputation
Imputation Strategies:
3) Stratified Imputation – this process is slightly more involved. This
involves replacing the missing values with the mean or median of the
variable but with consideration for similar strata of observations.
But…before you do this, think through these questions:
a. How would the distribution of the variable affect/and be affected
by this imputation decision?
b. What happens to the mean of the variable?
c. What happens to the standard deviation of the variable?
d. How might the results be biased?
STAT3130 – Imputation
Imputation Strategies:
4) Regressed Imputation – This process involves actually predicting the
value of the missing values using Regression. It works well if: the
variables are related to each other and if you only have one or two
variables with missing data. But…before you do this, think through
these questions:
a. How would the distribution of the variable affect/and be affected
by this imputation decision?
b. What happens to the mean of the variable?
c. What happens to the standard deviation of the variable?
d. How might the results be biased?