Download Slides - saphir network

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Improving the quality of data
through imputing missing values
(Part One: Introduction to types of missing data)
Saeid Shahraz MD, PhD Student
Heller School of Social Policy and Management
5/8/2017
Saeid Shahraz
1
Basic questions
1.
2.
3.
4.
5.
6.
What does the ‘missing data’ mean?
What does ‘imputation’ mean?
What does ‘data improvement’ mean?
How much missingness is acceptable?
Is missing data a usual problem?
Is ‘imputation’ always a right solution?
5/8/2017
Saeid Shahraz
2
What does the “missing data” mean?
Please look at Table one in the next slide. We
have 5 observations in this ultra-small data set
and as you see observations number 3 and
number 5 have missing values on the variable
“number of follow-up rehabilitation visits”.
5/8/2017
Saeid Shahraz
3
Table 1-Two values are missing
5/8/2017
Id
Gender
Age
Rehab visits
1
1
12
7
2
2
13
6
3
2
16
4
1
67
5
1
72
Saeid Shahraz
13
4
What does “ imputation” mean?
If we figure out what the missing values are
and put them in the missing boxes we have
done imputation. So please look at Table two
in which the missing values have been
imputed. Please do not think of how the
imputation processed. Indeed, I put some
arbitrary numbers in.
5/8/2017
Saeid Shahraz
5
Table 2-Two values imputed
5/8/2017
Id
Gender
Age
Rehab visits
1
1
12
7
2
2
13
6
3
2
16
6
4
1
67
13
5
1
72
15
Saeid Shahraz
6
What does “data improvement”
mean?
Please look at Table three. In this table you see
three columns for number of visits. The left
column is the actual (non-missing) variable. The
middle is a column with missing values and the
most right column is the one with imputed
values. The last row of the table shows you what
the average numbers of visits are given the actual
data, the missing data, and the imputed data. You
clearly see that the average for imputed column
is closer to that of the actual information. So, this
means “imputation” actually improved the
quality of data.
5/8/2017
Saeid Shahraz
7
Table 3- Data improvement
Id
Gender
Age
Rehab
visitsactual
Rehab
visitsmissing
Rehab
visitsimputed
1
1
12
7
7
7
2
2
13
6
6
6
3
2
16
8
4
1
67
13
5
1
72
17
Average of Rehab variable
5/8/2017
10.2
Saeid Shahraz
6
13
13
15
8.7
9.4
8
How much missingness is acceptable?
Like a threshold for the significance level for pvalues, there is no empirical answer to the
question. Leong and Austin (2006) for instance
suggested 5%. I have personally seen in actual
research work some social science and health
service researches accepted 10% of
missingness. So, for now, let us agree with the
tolerance level at 5%.
5/8/2017
Saeid Shahraz
9
Is missing data a usual problem?
Yes. In most administrative data sets that I have
been working with a considerable number of
values on my desired variables were missing. We
need to seriously think of significant amount of
missing even when the data has a reputation for
being clean and complete. Examples of the latter
is Demographic and Health Surveys, better
known as DHS. These data sets carry a lot of
invaluable information but missing data is
sometimes a prohibiting factor for researchers
using them.
5/8/2017
Saeid Shahraz
10
Is imputation always a right solution?
With some exceptions yes. But I would like you
to answer this question when we are done
with the whole presentations.
5/8/2017
Saeid Shahraz
11
TYPES OF MISSING
(RUBIN’S TYPOLOGY)
• MISSING COMPLETELY AT RANDOM (MCAR)
• MISSING AT RANDOM (MAR)
• MISSING NOT AT RANDOM (MNAR)
5/8/2017
Saeid Shahraz
12
Missing Completely At Random
(MCAR)
• The cause of missingness cannot be found
through looking at other observed variables.
• The cause of missingness is independent of
values of missing variable.
NO-NO condition
5/8/2017
Saeid Shahraz
13
MCAR: EXAMPLE ONE: Lab samples thrown out
Imagine that blood samples from a randomly selected
population to test fasting blood sugar have been sent to 3 labs.
One of the labs reports that all the samples have been
accidentally thrown out. So, a portion of data on the variable
blood sugar level will be missed in the final data set. Here, the
event causes missingness is exogenous to the process of data
gathering and characteristics of the population ( independency
of the likelihood of missing from observed information). Also, the
missingness was independent of whether or not blood sugar was
high or low.
5/8/2017
Saeid Shahraz
14
MCAR-1
Missing Completely At Random
1.Variable with
considerable missing
values
2.Other observed
variables
3.Missingness
depends on missing
(unobserved ) values
4.Missing depends
on other variables?
Example 1: Lab samples thrown out
Blood sugar
5/8/2017
Age-sex-weight for
example
Did higher or lower
blood sugar have an
effect on the
probability of
missing blood
sugar? No
Saeid Shahraz
Did age or sex or
weight increase or
decrease the
probability of
missing on blood
sugar? No
15
MCAR: EXAMPLE TWO: Coin tossing
This example is the famous coin tossing in sport to define which
team own the ball first. Two possibilities: head and tail. Imagine
that we know the age of the referee and the type of the sport in
our data set and some of the values on the result of coin tossing
are missing from the data. Obviously, having missing values on
the result is not dependent on either observed variables (age of
the referee and type of sport) or on the missing (unobserved)
values. To elaborate on the latter I would say having 70% of the
results on coin tossing as head up does not imply that 70% or the
majority of the missing values have to be head up.
5/8/2017
Saeid Shahraz
16
MCAR-2
Missing Completely At Random
Variable with
considerable
missing values
Other observed
variables
Missingness
depends on
missing
(unobserved )
values
Missing depends on
other variables?
Did having head up
depend on having
head up in previous
trials? No
Would type of sport
or age of referee
affect the
probability of head
up? No
Example 2: Coin tossing in sport
Missing on the
result of coin
tossing
5/8/2017
Type of sport and
age of the referee
Saeid Shahraz
17
Missing At Random (MAR)
• The cause of missing values is independent of
missing (unobservable) values
• But can be predicted by other observed
values
NO-YES condition
5/8/2017
Saeid Shahraz
18
MAR: EXAMPLE ONE: Females and kidney donation
The example is a study through which the effect of kidney
donation on the donor’s household income is investigated. If
during the study it is found that female donors more than male
donors tend to refuse to answer to the income question the
missing pattern on the income variable is called Missing At
Random or MAR. In this case women with low or high income
respond to the question of income with the same probability. In
other words the missingness is independent of the missing
(unobserved) values
5/8/2017
Saeid Shahraz
19
MAR-1
Missing At Random
Variable with
considerable
missing values
Other observed
variables
Missingness
depends on
missing
(unobserved )
values
Missing depends
on other
variables?
Did women with
high income in
oppose to
women with low
income have a
greater chance to
refuse to answer
the income
question? No
Did sex of the
donor affect the
probability of
responding to
the income
question? Yes
Example 1: females and kidney donation
Missing values on
income of the
family donated
kidney
5/8/2017
Sex of the donor,
age of the donor,
ethnicity of the
donor
Saeid Shahraz
20
MAR: EXAMPLE TWO: attitudes toward having social insurance
This is a study on the attitudes towards implementing a universal
social welfare insurance program. It was found that people with
affiliation to a type of political party tended not to respond to
the insurance question. In this example, the pattern of missing on
the response to having social insurance is MAR because at least
one observed variable (political party) somehow determined the
likelihood of the response to be missing. Positive or negative
response toward having the social insurance was assumed to be
independent of missing pattern. This means that the probability
of missing answer to the insurance questions was the same for
both people who tended to provide negative results and those
who wanted to answer positively.
5/8/2017
Saeid Shahraz
21
MAR-2
Missing At Random
Variable with
considerable
missing values
Other observed
variables
Missingness
depends on
missing
(unobserved )
values
Missing depends on
other variables?
Example 2: attitudes toward having social insurance
Missing values on
yes/no answer to
having universal
social insurance
5/8/2017
Political party
affiliation
Did positive or
negative response
to the necessity of
having the
insurance affect the
likelihood of
missing? No
Saeid Shahraz
Did political
affiliation of the
person predict the
likelihood of
missingness? Yes
22
Missing Not At Random (MNAR)
• The cause of missing values is dependent of
missing (unobservable) values
• And can usually be predicted by other
observed values
YES-YES condition
5/8/2017
Saeid Shahraz
23
MNAR: EXAMPLE ONE: Synthetic insulin and blood sugar
reduction time
The first scenario is a research study through which the effect of
a new type of synthetic insulin on the time of blood sugar
reduction in human is investigated. The protocol mandates the
researcher if the reduction time is greater than one third of the
standard reduction time (defined in the protocol) the researchers
should stop the treatment and refer the patient to the
emergency department. These patients quit the study and the
final result on the reduction time is missing. In this example, the
likelihood of missing depends exactly on the unobserved
(missing) values. This means that reduction time pattern (the
variable that has considerable number of missing cases)
determines whether or not the value is missing or not
5/8/2017
Saeid Shahraz
24
MNAR-1
Missing Not At Random
Variable with
considerable
missing values
Other observed
variables
Missingness
depends on
missing
(unobserved )
values
Missing depends on
other variables?
Example 1: Synthetic insulin and blood sugar reduction time
Missing values on
blood sugar
reduction time
5/8/2017
Sex, age , and
ethnicity of the
patient
Did the reduction
time depend on the
value of reduction
time? Yes
Saeid Shahraz
Did the
demographics of
the patient affect
the likelihood of
missing? likely
25
MNAR: EXAMPLE TWO: A new pain killer and experience with
pain
The second scenario is a study in which a new pain killer
medication is administered to patients with migraine headache
and the amount of pain reduction is asked the day after. It was
found out that missing values on the variable ‘how much pain
was reduced’ were much greater among patients who
experienced severe pain.
5/8/2017
Saeid Shahraz
26
MNAR-2
Missing Not At Random
Variable with
considerable
missing values
Other observed
variables
Missingness
depends on
missing
(unobserved )
values
Missing depends on
other variables?
Example 2: A new pain killer and experience with pain
Missing values on
amount of pain
reduction
5/8/2017
Sex ,age, and
having mood
disorders
Did the likelihood
of missing depend
on the amount of
pain reduction? Yes
Saeid Shahraz
Did the
demographics of
the participant and
his or her history of
mood disorder
affect the likelihood
of missing? likely
27
Thank you and looking forward to
having you for the next session
Please email me your questions at
[email protected]
5/8/2017
Saeid Shahraz
28