Download Using Data Mining Techniques to Discover Bias Patterns

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Using Data Mining Techniques to Discover
Bias Patterns in Missing Data
MONICA CHIARINI TREMBLAY, KAUSHIK DUTTA, and DEBRA VANDERMEER
Florida International University
In today’s data-rich environment, decision makers draw conclusions from data repositories that
may contain data quality problems. In this context, missing data is an important and known problem, since it can seriously affect the accuracy of conclusions drawn. Researchers have described
several approaches for dealing with missing data, primarily attempting to infer values or estimate
the impact of missing data on conclusions. However, few have considered approaches to characterize patterns of bias in missing data, that is, to determine the specific attributes that predict the
missingness of data values. Knowledge of the specific systematic bias patterns in the incidence of
missing data can help analysts more accurately assess the quality of conclusions drawn from data
sets with missing data. This research proposes a methodology to combine a number of Knowledge
Discovery and Data Mining techniques, including association rule mining, to discover patterns in
related attribute values that help characterize these bias patterns. We demonstrate the efficacy of
our proposed approach by applying it on a demo census dataset seeded with biased missing data.
The experimental results show that our approach was able to find seeded biases and filter out most
seeded noise.
Categories and Subject Descriptors: H.4.2 [Information Systems Applications]: Types of Systems—decision support; H.2.8 [Database Management]: Database Applications—data mining
General Terms: Design, Algorithms, Human Factors
Additional Key Words and Phrases: Data quality, missing data, pattern discovery
ACM Reference Format:
Tremblay, M. C., Dutta, K., and D. VanderMeer, D. 2010. Using data mining techniques to discover
bias patterns in missing data. ACM J. Data Inform. Quality 2, 1, Article 2 (July 2010), 19 pages.
DOI = 10.1145/1805286.1805288. http://doi.acm.org/10.1145/1805286.1805288.
1. INTRODUCTION
With access to vast volumes of data, decision makers frequently draw conclusions from data repositories that, for a variety of reasons, contain data quality
problems. In decision making, data quality is a serious concern [Strong et al.
Authors’ address: M. C. Tremblay, K. Dutta, and D. VanderMeer, Decision Sciences and Information Systems, Florida International University, 11200 SW 8th St., Miami, FL 33199; email:
[email protected].
Permission to make digital or hard copies part or all of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with
the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to
redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from the Publications Dept., ACM, Inc., 2 Penn
Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected].
c 2010 ACM 1936-1955/2010/07-ART2 $10.00 DOI: 10.1145/1805286.1805288.
http://doi.acm.org/10.1145/1805286.1805288.
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
2
2: 2
·
M. C. Tremblay et al.
1997; Wang et al. 1995] that can negatively impact decision performance [Even
and Shankaranarayanan 2007; Fisher et al. 2003; Heinrich et al. 2007; Jung
et al. 2005] and lead to misreporting of information [Lorence 2003].
The incidence of data quality issues arises from the nature of the information supply chain [Sun and Yen 2005], where the consumer of a data product may be several supply-chain steps removed from the people or groups who
gathered the original datasets on which the data product is based [Wang et al.
1995]. These consumers use data products to make decisions, often with financial or time budgeting implications. The separation of the data consumer from
the data producer creates a situation where the consumer has little or no idea
of the level of quality of the data [Shankaranarayan et al. 2003], leading to
the potential for poor decision-making and poorly allocated time and financial
resources.
Missing data, that is, fields for which data is unavailable, is a particularly
important problem, since it can lead analysts to draw inaccurate conclusions
[Mahnic 2001]. When data is extracted from a data warehouse or database (a
common occurrence when aggregating data from multiple sources), it typically
passes through a cleansing process to reduce the incidence of missing values as
far as possible. This cleansing process can only be taken so far. Some missing
data cannot be fixed because the database manager may not have any way of
knowing the value that is missing, as would happen, for example, if a person
left a field in a form blank. At this point, the database manager may choose
to remove records with missing data, at the loss of the power of the other data
contained in these records, or to provide the dataset with missing-data records
included. Our interest in this article is in the second case.
Missing data is a widespread occurrence; a recent review of more than 300
articles published in a psychology research journal [McKnight et al. 2007]
found that more than 90% were based on datasets with missing data, and that
the average amount of data missing was more than 30%, yet few of the articles
surveyed in the study mentioned the potential impact of missing data on any
conclusions drawn.
Data values may be missing for a variety of reasons, falling into two general
categories: Missing At Random (MAR) and Missing Not At Random (MNAR)
[Rubin 1976]. In the MAR case, the incidence of a missing data value cannot
be predicted based on other data, while in the MNAR case, there is a pattern
in the missing data. For MAR scenarios, there exist methods for estimating, or
imputing, missing values [Horton and Kleinman 2007].
The incidence of missing data, however, often falls into the MNAR category;
that is, there is bias in the occurrence of null values. This missingness may
occur for a variety of reasons. For example, survey respondents may refuse to
answer questions that they feel reveal information that is too personal, often
based on religious, cultural, or gender norms. Bias may also occur due to a
natural human tendency to suppress unfavorable information. For example, in
health care informatics, patients may choose not to report unhealthy behaviors
that increase risks for certain illnesses.
In MNAR scenarios, if an analyst assumes there is no bias in the missingness of data, the conclusions drawn may be inaccurate. For example, a health
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
Using Data Mining Techniques to Discover Bias Patterns in Missing Data
·
2: 3
care policy analyst charged with recommending funding allocations for preventive care projects may note an apparent trend showing reduced teenage smoking, and decide to reduce funding for projects aimed at convincing teens not to
smoke. If, however, a large number of males in the 15–20 age range simply did
not report whether or not they smoke (leading to nulls in the data set), then
the apparent trend noted by the analyst may in fact lead to a faulty conclusion.
Any funding decisions based on it could be nonoptimal allocations of financial
resources.
Clearly, for MNAR scenarios, it would be useful to uncover the patterns in
the incidence of missing data in a dataset. Once patterns are uncovered, several steps can be taken, depending on the context in which the bias is found.
If the decision maker is the creator or owner of the dataset, preventative steps
can be taken to prevent the occurrence of missing data by improving data collection methods given the information on possible biases. For example, in our
teenage smoking example, care could be taken to stress that the data is deidentified and that privacy will be ensured.
If the decision maker does not have control of the creation of the dataset, a
number of approaches can be considered. One option is to infer data for missing
fields, based on similar tuples that have the same values for other nonmissing
attributes. In our teenage smoking case, if 20% of males in the 15–20 age range
did not report whether or not they smoke, but a majority of the remaining 80%
of males in this age range had reported they smoked, then the assumption may
be made that the majority of these respondents with missing smoking data
are indeed smokers (intuitively, this is the basis of imputation techniques).
Alternatively, decisions or actions based on conclusions can be altered. If, for
example, our health care policy analyst knew that a large number of male teens
did not provide data as to whether or not they smoke, he might realize that the
apparent trend in the remaining data might be inaccurate, and retain funding
for teen smoking prevention projects or look for alternative ways to verify the
apparent trend.
Much of the work in addressing missing data focuses on imputing, or estimating, missing values. Shen and Chen [2003] propose a scheme for using association rule mining (ARM) techniques for filling in missing values. Horton and
Kleinman [2007] provide an overview of statistical methods aimed at the same
purpose. Another area of work [Horton and Kleinman 2007; Parssian 2006;
Parssian et al. 2004] provides insight into methods of measuring the impact
of missing data on the quality of conclusions drawn. Our purpose is different;
rather than trying to fill in missing values or assess their impact on conclusions, we attempt to bring to light patterns in the incidence of missing data.
Wang and Strong [1996] defined a set of 15 dimensions of data quality based
on a survey of information systems professionals and researchers. Issues surrounding missing data fall into two of these categories, completeness, and objectivity, which Wang and Strong [1996, p. 42] define as follows.
Completeness: The extent to which data is not missing and is of sufficient
breadth and depth for the task at hand.
Objectivity: The extent to which data is unbiased, unprejudiced, and
impartial.
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
2: 4
·
M. C. Tremblay et al.
We are specifically interested in cases where the objectivity problem is
nested within the completeness problem in a dataset, that is, where there are
bias patterns in missing data. This nesting makes it difficult for a decision
maker to notice the full scope of the problem unaided. In this article, we attempt to address these issues to help decision makers uncover bias patterns
in missing data. We focus our research on detecting and characterizing these
patterns. Our contributions in this article are as follows.
—We identify data mining techniques as a potential means of identifying patterns of bias in missing data.
—We propose a methodology for applying well-known data mining techniques
to reveal patterns in missing data for an attribute. Our approach considers
data preparation, the application of rule mining techniques, and aggregating
potentially large result sets.
—We describe important differences between the traditional application of data
mining techniques and their application in the domain of finding patterns in
missing data, and suggest modifications to overcome these differences.
At the end of Rubin’s original work on missing data [Rubin 1976], he makes
a strong call for the need for missing-data pattern research. More recently,
Ordonez et al. [2000] cite patterns in missing data as an interesting area for research. While recent research in this area identifies some common ad hoc patterns, for example, monotonicity (caused in longitudinal studies by dropouts)
[Horton and Kleinman 2007], in an extensive search of the literature, we were
not able to find a general approach to identifying patterns in missing data that
would satisfy Rubin’s 1976 request. Our work here represents a first attempt
to provide a general approach to patterns in missing data.
The remainder of this article is organized as follows. In Section 2, we describe our methodological approach and include a small illustrative example.
In Section 3, we present a set of initial experimental results and show the efficacy of our approach. In Section 4, we discuss the results of our experiments
and explore potential avenues for improving our method. In Section 5, we discuss the applicability of our approach, and other possible applicable algorithms.
In Section 6, we discuss future work to further refine our approach.
2. METHODOLOGICAL APPROACH
In order to find systematic bias patterns in the data missing for a specific attribute in a database, we explore the use of KDD (Knowledge Discovery and
Data Mining) techniques to discover patterns in related attributes that may
indicate bias. Several approaches are possible, including both supervised and
unsupervised machine learning techniques. Our goal is to identify which attribute values frequently cooccur with the incidence of missing data for a selected attribute (for example, nulls in yearly income may cooccur in records for
single, home-owner males).
There are many pattern discovery methods that can be considered for missing data pattern analysis, including both supervised techniques such as decision trees, neural networks and regression, and unsupervised techniques such
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
Using Data Mining Techniques to Discover Bias Patterns in Missing Data
·
2: 5
as cluster analysis and association rule mining. For this context, we wish to
consider the combinations of attribute values that may indicate a bias in the
missing data. Association rule mining, as we describe below, seems to be a
natural fit for this problem context.
Association Rule Mining (ARM), also known as market basket analysis in
the world of marketing, is a widely used data mining technique. The goal of
ARM algorithms is to search for frequent item sets in large transaction databases, such as the items one might purchase at a grocery store. The frequently
occurring items found in a market basket can be used to design promotions,
refine store layouts, and manage other aspects of a consumer’s experience.
Similarly, we seek to find “baskets” of values that indicate a pattern for missing data.
The association task has two goals: to find frequent itemsets and to find
association rules. For association rule mining algorithms, each attribute/value
pair is considered an item. Each itemset has a size, which corresponds to the
number of items it contains (for example, for some demographic data, we could
have a three-item itemset with gender= “female,” age= “30-35,” homeowner=
“true”). The coverage of this itemset is the number of times this combination
occurs in the dataset. Support is coverage expressed as a percentage, and is
often specified by the user ahead of time. For example, support = 4% means
that the itemset appears in 4% of the records.
The second goal of ARM algorithms is to find rules. An association rule has
the form A, B => C with a confidence measure, where confidence represents
the percentage occurrance of the right hand (RHS) side of the rule, given the
itemset in the left hand side (LHS). Thus, if we had a confidence value of 60%
for the rule if gender=“female” and age=“30-35” => homeowner= true, then
60% of 30–35-year-old females in this dataset are homeowners. In the association task, the user specifies minimum thresholds for support and confidence
prior to running the ARM algorithm. Then, only those rules that have values
at or above the minimum specified thresholds for support and confidence are
generated by the algorithm.
We generate rules using the association rule algorithm to identify instances
of systematic bias in missing data. In this section, we outline our general approach, which we will illustrate with a small example, and later demonstrate
with an exploratory experimental study.
Table I represents a set of fictional results from a survey where high school
and middle school children were asked if they had tried smoking. It contains
five attributes: school name (as a proxy for socio-economic status), whether a
parent or guardian smoked, gender, age, and whether the person reports having tried smoking. There are two hidden seeded biases: females between the
ages of 15–18 (records number 1, 3) and any males whose parents did not smoke
(records 2, 4, 6, 7) tend to not report if they have tried smoking, regardless of
what school they went to. Similar to what can be found in a real dataset the
bias is a tendency, so in our small example, respondent 2 did provide an answer.
Figure 1 shows a high-level overview of the steps in our approach. There are
four main steps in finding these patterns: data preparation (which includes
sampling, selecting and transforming features), association rule creation, and
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
2: 6
·
M. C. Tremblay et al.
Table I. Small Example Dataset
Record
1
2
3
4
5
6
7
8
9
10
School
Northwest
Northwest
Northwest
Southeast
Southeast
Southeast
Southeast
Wesley
Wesley
Wesley
Parent Smokes
Y
N
N
Y
Y
N
N
Y
N
Y
Gender
F
M
F
M
M
M
M
M
F
M
Age
16
16
15
14
16
17
12
16
14
17
Tried Smoking?
NULL
Y
NULL
NULL
Y
NULL
NULL
Y
N
Y
Fig. 1. Pattern discovery approach.
association rule aggregation. A fourth step, interpreting the patterns, involves
a decision maker using these patterns in the context of drawing conclusions
from the dataset.
2.1 Data Preparation
The first step in the KDD process is data preparation [Fayyad et al. 1996]. In
this application, three data preparation techniques are described: sampling,
feature selection, and feature transformation.
2.1.1 Sampling. In the KDD process it is neither necessary nor efficient
to use all available data. Instead, we sample for a representative sample of
the data to build our KDD models. However, in our case, we cannot simply
randomly sample to obtain a set number of records, since missing data for an
attribute is a rare occurrence. If we sample randomly, we risk having most, if
not all our records be complete, that is, we might lose all records with missing
data. Further, if we underrepresent records with missing data, the association
rule algorithm will not generate any rules for cases where data is missing.
Association rule mining discovers patterns that occur with a frequency above
the minimum support threshold. Therefore, in order to find associations involving rare events, we could try to set the threshold value for support to a very low
number. However, this would result in the generation of large numbers of rules
to analyze, and still would not guarantee that we would find rules for these
rare biases. It could also dramatically increase the time needed for execution.
This is a well-known problem, also referred to as the unbalanced dataset
problem. It occurs when the class of interest is highly underrepresented in the
data. For example, fraudulent medical claims may only be present in 1% or
less of a dataset. Similar situations occur in rare disease diagnosis, network
intrusion detection, and bioterrorism detection. In order to ensure a good representation of missing data records, we use stratified sampling [Han and Kamber
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
Using Data Mining Techniques to Discover Bias Patterns in Missing Data
·
2: 7
2001; Trochim 1999]. Stratified sampling, also called proportional or quota random sampling, involves dividing the available population into homogenous subgroups and then taking a simple random sample from each subgroup [Trochim
1999]. In our case, we have two subgroups, one contains records with missing
data, and one without. We sample so that occurrences of missing data for the
attribute for the resulting dataset constitute at least 30% of our cases. In other
words, if we want to use 1000 records, we ensure that 300 of those records are
random-sampled from the missing-data subgroup, and 700 are from the other
subgroup.
Stratified sampling increases rule support (the numerator of the support
computation remains the same, but the denominator decreases), but does not
change confidence (both the numerator and denominator of the confidence computation change proportionally). It does not hurt to increase the value of support for missing data records, since the purpose of the support metric is to
identify occurrences of sufficient frequency in the dataset. Since we want to extract rules for cases where the target attribute value is null, which we already
know is a rare event, the actual value of support is not important.
Since stratified sampling necessarily requires a significant sample size to
illustrate its effect on a dataset, it is not appropriate to illustrate this on our
small example since it only contains 10 rows, which is substantially less than
would be acceptable for most data mining algorithms.
2.1.2 Feature Selection. Databases can contain thousands of attributes,
making it challenging to determine which features (attributes) are contributing to a systematic bias. The goal is to feed forward only those attributes that
are most relevant in identifying a bias. This step is important for two reasons.
First, it reduces the amount of data that must be analyzed, which improves the
scalability of our approach. Second, it minimizes the number of rules generated, which helps to simplify the interpretation of these rules.
Feature selection is an important and challenging step, particularly for our
approach. For the same reason that logistic regression was ruled out as an approach to find missing data biases, traditional dimension reduction techniques
outlined in the literature, such as stepwise forward selection, stepwise backward elimination, and combinations of forward selection and backward elimination [Barbara et al. 1997; Dash et al. 1997; Han and Kamber 2001; Neter
and Wasserman 1974] will not be accurate. We are searching for combinations
of attribute values, so we cannot singularly consider each attribute and its correlation to the value of our selected target attribute (which is either null or not
null). As an example, consider gender, which has two possible values. Individually, there may not be a high correlation between the value of gender and
missing data. In fact, it might be evenly distributed for both the missing and
not-missing cases. However, the bias for missing data might be for a certain
value of gender combined with certain values for age and ethnicity. Feature
selection techniques (e.g., forward selection) do not consider combinations of
attributes, and could discard important attributes.
In our approach, we utilize decision trees, a well-known machine learning
technique that was originally intended for classification [Han and Kamber
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
2: 8
·
M. C. Tremblay et al.
Table II. Resulting Sample Dataset
Parent Smokes
Y
N
N
Y
Y
N
N
Y
N
Y
Gender
F
M
F
M
M
M
M
M
F
M
Age
15-16
15-16
15-16
14-15
16-17
17-18
12-13
16-17
14-15
16-17
Missing Flag
1
0
1
0
0
1
1
0
0
0
2001]. The decision tree is a classifier in the form of a tree structure, where
each node is either a “decision” or a “leaf ” node. The decision node indicates
a test based on the values of a single attribute, and a leaf node indicates the
value of the target attribute. When using decision tree induction for feature
selection, the variables that do not appear in the tree are considered irrelevant
[Han and Kamber 2001].
In our small example, there are not enough cases to illustrate the use of
decision trees. In a larger (but similar dataset), we would expect the decision
tree would not select high school as a splitting mechanism, since there is no
discernible correlation between missing data in the “tried smoking” field and
the school field.
2.1.3 Data Transformation. In our approach, we recommend two transformations. First, we replace the attribute for which we are investigating missing
data with a flag (missing-flag) that indicates whether the data value is missing
or not. In the case of our small example, we would replace the “tried smoking” attribute with a flag which is set to 1 if the attribute was missing, or set
to 0 if it was not. Secondly, since association rule algorithms generally do not
work with continuous variables, we divide those variables into groups and create bins, either based on their distribution or in a way that makes sense for
that particular variable. In our small example, age would be a continuous variable we could bin. We also would transform our variable “Tried Smoking,” to a
new binary variable named “Missing Data.” Table II illustrates our resulting
dataset after feature selection and transformation.
After we have created a stratified dataset, selected the attributes, and transformed the variables as needed, we are ready to feed the data to an association
rule mining algorithm and aggregate the results.
2.2 Association Rule Creation
Typically, association rule mining searches for all possible patterns from the
given data, that is, the RHS of a rule may contain any attribute of the database.
However, we are interested only in rules that have the attribute missing-flag in
the RHS. For this purpose, we select an Apriori Algorithm [Li et al. 2001], using
the class association rules (CAR) approach [Scheffer 2001], which generates
rules with a set of target attributes. In this way, we can mine the prepared
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
Using Data Mining Techniques to Discover Bias Patterns in Missing Data
·
2: 9
Table III. Sample Dataset Generated Rules
1. Parent Smokes = Y, Gender = F, Age = 15-16 → Missing Flag = 1
2. Parent Smokes = N, Gender = F, Age = 15-16 → Missing Flag = 1
3. Parent Smokes = N, Gender = M, Age=17-18 → Missing Flag = 1
4. Parent Smokes = N, Gender = M → Missing Flag = 1
data for association rules involving the missing-flag as the class attribute, that
is, the attribute on the RHS of the association rule.
Association rule mining discovers rules based on support and confidence.
Traditionally, for a rule X→Y, the support is defined by S(X ) = #X N, where #X
denotes the number of occurrences of X in the total population of N. The con
fidence is defined by S(X ∩ Y ) S(X ), where X ∩ Y denotes the instances where
both X and Y occur.
In the case of missing data analysis, the occurrence of missing data is rare
compared to the total population, so support (using the traditional definition of
the term) for rules for the event will be very low compared to most other rules
the miner will generate. To address this we modify the definition of support as
follows: for a rule X→Y (where Y is missing-flag=1), S (X ) = #X M, where M
is the total count in the population where the target attribute is null, that is,
where missing-flag = 1. Here, we are interested in rules of type X→missingflag=1 with modified support of S (X ). So, we apply the standard CAR to gen
erate rules with minimum support of S(X ) = S (X )M N. This will generate
rules indicating the correlations between the incidence of missing data and the
values of attribute sets.
Finally, we note that the CAR approach generates rules for all cases for the
missing-flag attribute, that is, with both missing-flag=1 and missing-flag=0 in
the RHS. We are interested only in rules for missing-flag=1, so we filter out any
rules with missing-flag=0 in the RHS, and retain those with missing-flag=1 in
the RHS.
For our small example, M = 4andN = 10. Suppose we target a modified
support of S (X ) = 0.1. To achieve this we select the minimum support of
S(X ) = 0.1 × 4/10, that is, 0.04 for the ARM technique. We further filter the
rules to those where missing-flag=1 in the RHS. Table III illustrates rules generated by this approach on our small dataset. Actual support for each of these
rules, would be 1/4 or 25%, given the rule occurs once and there are four occurrences of missing data in the dataset. Due to the small size of the dataset, the
confidence of all these rules turns out to be 1.0, since there are no records without missing data for any of the itemsets in the LHS of the rules in Table III.
This will not be the case in larger dataset that we have used in Section 5.
2.3 Rule Aggregation
The association rule mining method may generate a large number of rules,
leading to the potential for data overload. To avoid this, we introduce the possibility of aggregating rules to reduce the size of the rule set delivered to the
analyst. Our goal here is to do this without significantly damaging the results.
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
2: 10
·
M. C. Tremblay et al.
We first consider the possibility of combining rules where one attribute
clearly does not contribute to the correlation to the missingness of data. Consider the following rules from our small dataset.
(1) Parent Smokes = Y, Gender = F, Age = 15–16 → Missing Flag = 1
(2) Parent Smokes = N, Gender = F, Age = 15–16 → Missing Flag = 1
In this example, “Y” and “N” are the only values that the Parent Smokes may
have in the database. Since these two rules differ only in their value of Parent
Smokes, then Parent Smokes clearly does not contribute to the correlation to
the missingness of data. This may occur in the case of any attribute whose
values are roughly equally distributed in the dataset. Clearly, then, we can
merge rules (1) and (2) without loss of meaning to produce a new rule, Gender =
F, Age = 15–16.
Second, we propose presenting the more general rule in cases where one
generated rule is a more detailed version of another (or a set of other rules).
Consider, for example, the following pair of rules.
(3) Parent Smokes = N, Gender = M, Age = 17–18 → Missing Flag = 1
(4) Parent Smokes = N, Gender = M → Missing Flag = 1
Each of these generated rules is supported by a specific set of records in
the database. Here, rules (3) and (4) share two attribute-value pairs: Parent
Smokes = N and Gender = M. Because rule (4) contains only these two items, it
is supported by all the records that support rule (3), and more. Thus, rule (4) is
the more general rule, and the one we will choose to present in the aggregated
rule set. Clearly, there is some loss of detail in this type of aggregation, but
there is no loss of coverage in terms of the records that support the rules—all
records that were counted as supporting rules in the original rule set are still
counted as supporting rules in the aggregated rule set. Thus, the rules in our
example would be reduced to the following set.
(1) Gender = F, Age = 15–16
(2) Parent Smokes = N, Gender = M
In our small example, and in our experiments, the rule set generated was
small, so we simply aggregated rules manually using these heuristics. It should
be possible, however, to automate this step using techniques from predictive association rule mining, for example, as described in Scheffer [2001] or applying
Operations Research (OR) techniques. Below we present an outline of one possible OR model to identify the minimum set of rules from the complete rule set.
Let our original data set consist of N records with K attributes. Let us also
assume that our approach has generated a total of J rules.
We create a decision variable,
x j = 1, if rule j ∈ J is included in the final rule set; 0 otherwise.
We identify the following exogenous parameter of the model.
pij = 1, if a record i ∈ N supports a rule j ∈ J; 0 otherwise.
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
Using Data Mining Techniques to Discover Bias Patterns in Missing Data
·
2: 11
Table IV. Data Description
Attribute
Number of
Distinct Values
2
2
6
6
2
5
5
5
Marital Status
Gender
Number of Children
Number of Children at Home
Home Owner Flag
Number of Cars Owned
Commute Distance
Occupation
Yearly-Income
Continuous
Values
S, M
F, M
0, 1, 2, 3, 4, 5, 6
0, 1, 2, 3, 4, 5, 6
FALSE, TRUE
0, 1, 2, 3, 4, 5
0-1, 1-2, 2-5, 5-10, 10+
Professional, Manual, Clerical,
Management, Skilled Manual
10000 to 170000
The one helper variable of the model is,
ui = 1, if a record i ∈ N supports any rule j ∈ J; 0 otherwise.
The linear programming model for the selection of the minimum rule set is
as follows, Minimize
pijx j
i∈N j∈J
Subject to:
pijx j ≥ ui, ∀i ∈ N
(1)
j∈J
The objective function minimizes the total number of rules needed to cover
all the rows in the original rule set. The constraint (1) ensures that all records
in the original dataset that support rules in the original rule set are still
counted as supporting rules in the aggregated dataset.
The above integer programming (IP) model maps to the 0-1 min-Knapsack
problem [Csirik et al. 1990]. Though the knapsack problem is an NP-Complete
problem, which may be computationally challenging to solve, there exists a
well-developed greedy heuristic [Martello and Toth 1991] that is known to have
a log-linear time complexity O(|J| log |J|), and provides results within 2% of the
optimal result . We can apply such a heuristic algorithm to solve the given IP
problem and find the aggregated rule set quite efficiently.
Finally, we note that this model is not complete—further fine tuning to address selecting rules with higher confidence measures, ensuring that the total
support of the final rule set is above a certain threshold, etc., is left for future
research.
3. EXPERIMENTAL RESULTS
In order to assess the efficacy of our approach, we extracted a dataset of 18,000
transactions from a sample data warehouse [Microsoft 2005]. A description of
the dataset can be found in Table IV.
We first introduce some systematic biases for missing data, and then apply
our proposed approach to determine whether it is able to uncover them. We
chose Yearly-Income as our target column. We modified the experimental data
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
2: 12
·
M. C. Tremblay et al.
Table V. Seeded “Systematic Bias” Missing Rules
Rule
1
2
3
4
5
Noise
Remaining Data
Total
Marital Gender
Status
M
M
M
S
M
S
F
M
Total
Children
0
5
5
0
2
House
Owner Flag
FALSE
FALSE
Null
Count
920
176
213
1640
535
255
3739
Not Null
Count
21
2
3
54
18
14,163
14,261
Total
941
178
216
1694
553
255
14,163
18,000
Note that the systematic biases in Table IV were selected for experimental purposes only, and
are not intended to have any intrinsic meaning. To do otherwise would surely cause offense.
to set the Yearly-Income attribute to null (to represent missing data) for records
that satisfy one of the attribute value-sets described in Table V.
To create a realistic environment, we also introduced some noise to the data.
—In about 2% of the cases that match one of the attribute value-sets described
in Table V, the Yearly-Income attribute value is left unchanged, that is,
not null.
—Yearly-Income is set to null for 1.4% of the cases in the dataset that do not
match the patterns in Table V.
—A total of 20% of the data has a null value in the Yearly-Income field (inclusive of seeded noise nulls).
Next, we created a new Boolean column Income-flag in the database, which
acts as an indicator (missing-flag) specifying whether the Yearly-Income attribute of a record is null or not. We set this value to 1 (i.e., true) for records
where Yearly-Income is null, and 0 for records where the Yearly-Income value
is not null.
3.1 Experimental Process
In this section, we describe how we applied each of the steps of our proposed
approach in this experiment.
3.1.1 Stratified Sampling. In the Stratified Sampling step, we kept all the
missing data rows of our dataset (Income-flag=1) and seeded noise, and then
randomly sampled the remaining data so that the remaining 70% of the dataset
contained nonmissing records (Income-flag=0). As illustrated in Figure 2, this
new stratified database contained a total of 11,940 rows.
3.1.2 Feature Selection. Using the modified stratified sample dataset, we
used the Weka data mining tool [Witten and Frank 2005] to run a C4.5 decision tree. Figure 3 shows the results of the feature selection step. We select
only those attributes that appear in the resulting tree. Based on this approach,
we conclude that only TotalChildren, Gender, HouseOwnerFlag, and MaritalStatus are important in classifying whether a record has high potential to have
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
Using Data Mining Techniques to Discover Bias Patterns in Missing Data
·
2: 13
Fig. 2. Stratified sampling.
Income-flag=1 (i.e., where Yearly-Income is null). In the next step, we run association rule mining by only selecting these relevant attributes.
3.1.3 Association Rule Mining. Using the Apriori algorithm [Agrawal and
Srikant 1994] in the WEKA tool, we generated rules from our the stratified
sample dataset by selecting only the four relevant attributes from the decision
tree: TotalChildren, Gender, HouseOwnerFlag, and MaritalStatus. We used
the class association rule mining technique in order to list only those rules that
contain Income-flag on the RHS. Prior to running the mining algorithm, we
converted all continuous-data columns of the input dataset to categorical data,
since association rule mining runs on categorical data.
The minimum confidence was set high (0.95) for the Apriori algorithm. Also,
the minimum support S(X )was set low, to 0.01, based on the modified support
heuristic in Section 4.2. The rules were ranked by confidence, and limited to
200 rules. Finally, once we had the list of generated rules, we filtered the results
for those rules that referred to Income-flag=1 indicating Yearly-Income=null.
Figure 4 shows all such rules found from the WEKA environment.
3.1.4 Rule Aggregation. In this step, we aggregated multiple rules together
to reduce the number of total rules. Though elaborate formal algorithm development is possible for such aggregation, for the purpose of this case study we
merged the result rules manually by following the simple heuristics described
in Section 4.3. In future research, we intend to develop an algorithm that will
aggregate rules from a large number of generated rules.
3.2 Experimental Results
The aggregated rule set is presented in Table VI. The rules labeled A, B, C,
D, and E exactly describe the seeded patterns, which demonstrates that our
approach was able to find all seeded biases in the dataset. The approach also
found an additional rule (labeled F in Table VI), Gender=M, TotalChildren=5
and HouseOwnerFlag=False, which was not part of the seeded bias. We could
tune the support threshold to prevent this rule from being generated, but doing
so runs the risk of losing rules for actual seeded bias. For example, support for
Rule F is 142/11,940, while support for Rule B (in Table VI) is 176/11,940. Here,
increasing the support threshold will prevent the additional rule (F) from being
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
2: 14
·
M. C. Tremblay et al.
Fig. 3. Decision tree output.
generated, but will also likely prevent rule B, a seeded bias, from being generated. In future research we intend to focus on this tradeoff and parameters
affecting it.
3.3 Validation of the Approach
In this study, we were able to validate our technique by checking if the seeded
biases were identified by the rules. In a nonexperimental context using real
data in which biases are not known in advance, a possible validation approach
is to divide the dataset in two halves and verify that similar patterns of missingness exist when subjecting each half to our method. To demonstrate the applicability of this method, we divide our sample dataset into two equally-sized
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
Using Data Mining Techniques to Discover Bias Patterns in Missing Data
·
2: 15
Fig. 4. Generated rules.
Table VI. Supported Rules
Supported
Rule Label
A
WEKA
Identified
Rule
Induced
Numbers
Rule
(Figure 4) Number
(Table V)
11, 23, 73
1
Marital Gender
Status
Total
Children
House
Owner
Flag
Rule
Confidence
0.98
(920/941)
0.99
(176/178)
0.99
(213/216)
0.97
(1640/1694)
0.97
(535/553)
0.98
(142/145)
M
M
0
-
B
5
2
M
-
5
FALSE
C
1, 7
3
S
M
5
-
D
34, 70, 77
4
S
F
0
-
E
60, 71, 80
5
-
M
2
FALSE
F
19
NONE
-
M
5
FALSE
datasets and compute the confidence value of the derived rule sets on both
these data sets separately. Table VII contains these confidence values, showing that both datasets generate confidence values similar to those in Table VI.
Thus, we can safely conclude that this rule validation procedure can easily be
applied in nonexperimental datasets where the hidden bias patterns are not
known in advance.
4. DISCUSSION
In this exploratory study, we have demonstrated that association rule mining
can be used successfully to describe systematic biases in missing data. Understanding the set of values that are normally associated with missing data
in a certain attribute can be of significant benefit to researchers. Consider
the example presented in the introduction to this article, where the authors of
McKnight et al. [2007] described a review of 300+ articles from a psychology
journal and found that more than 90% were based on data sets with missing data, and that the average amount of data missing was more than 30%.
A missing data analysis for the underlying datasets might reveal patterns of
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
2: 16
·
M. C. Tremblay et al.
Table VII. Method Validation—Approach 1
Supported
Rule Label
A
B
C
D
E
F
Marital Gender
Status
M
M
S
S
-
M
M
F
M
M
Total
Children
0
5
5
0
2
5
House
Owner
Flag
FALSE
FALSE
FALSE
Rule
Confidence
(Sample Set 1)
0.97
0.99
0.99
0.97
0.97
0.99
Rule
Confidence
(Sample Set 2)
0.98
0.99
0.98
0.97
0.97
0.97
missingness, potentially resulting in the modification of conclusions, or even
the invalidation of the results. Our proposed technique is broadly applicable
across domains where data analysis is a significant research method.
Certainly other statistical and knowledge discovery techniques should be
explored, but they will not be as straightforward in generating a set of rules
that consider combinations of attribute values. Cluster analysis is an interesting possibility. Clustering algorithms find natural groupings within datasets
based on different types of similarity metrics [Jain and Dubes 1988]. In our
case, we could explore the values of the attributes that cluster together with
the occurrence of missing data, which could prove to be difficult to interpret.
We could also explore the use of decision trees. The approach would differ
slightly, in that we would train models that would classify whether data is
missing for an attribute. Decision tree algorithms evaluate splits or decision
nodes based on the available attribute values, which can be converted into rule
sets [Breiman et al. 1984; Quinlan 1993]. Decision trees could be useful, but
would be time-consuming to apply, because we would need to investigate each
branch whose terminal node indicates missing data in order to discover the
rules. We did some preliminary testing of these two alternative approaches:
cluster analysis and decision trees, and compared our results based on number
of rules found and ease of finding and interpreting rules.
Using cluster analysis (entropy minimization) we were able to induce four
of the rules by looking at cluster membership. We discovered that this was an
extremely time-consuming exercise. Additionally, it did not provide measures
of confidence or support, since we are inducing the rules. We also noted that
the execution time was very slow (30 minutes for our small dataset).
Decision trees found all the same rules as ARM, and could be a promising
technique. However, unlike ARM, we had to follow the branches of the tree to
induce the rules. Decision trees are a promising approach; however, they may
have scalability issues. In our small dataset it was not difficult to investigate
the tree structure to extract the rules. In a larger dataset, with more attributes,
this may be more taxing. Additionally, there are extra steps involved because
we will need to split the data in order to avoid overfitting. Yet, decision trees
warrant more investigation, since there may be ways to automate rule induction. We do plan to test both decision trees and ARM on a much larger dataset
in future research.
An additional step we did not explore is working with domain experts to explore and explain our results. In our contrived example, the biases are seeded.
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
Using Data Mining Techniques to Discover Bias Patterns in Missing Data
·
2: 17
In a real dataset, it will be likely that only a domain expert will be helpful in
understanding potential biases identified.
5. CONCLUSION
This research explores a promising method for utilizing machine learning techniques to identify potential patterns of bias in the incidence of missing data. We
describe how to reduce the feature space, how to conduct necessary transformations, and how to uncover patterns in the data that describe the incidence of
missing data.
There are several ways this methodology can be integrated into existing systems. The most obvious is to identify the bias and fix the errors. This might
prove to be difficult if this data is from surveys, because it will be difficult to
trace the source. Additionally, when information is purposefully withheld, it is
likely that respondents still will not want to divulge it. In the case where the
data are accumulated from several sources, the systematic bias might point to
an error in the data collection process. For example, a large federal agency
that collects data from several agencies might be able to identify that it has interoperability issues that cause data to be systematically truncated in certain
circumstances. If fixing at the source is not possible, certainly this knowledge
must be disseminated to decision makers. This process, however, can be complicated, and can cause decision makers to lose confidence in the dataset. The
information could be added to the metadata and reported to decision makers
via reports or at query time. This procedure is certainly an interesting area for
future study.
For each of the steps in our methodology, we can identify other potential
techniques that may produce better (or faster) results, as well as interesting
avenues for future research. For example, feature selection is challenging, especially for larger datasets, and is an area that warrants further research. We
consider association rule mining the best approach for finding correlations indicating missingness, but certainly other data mining techniques, for example
decision trees, need to be considered and tested. Furthermore, in our case, several rules were generated for each bias. Finding a way to collapse rules that
are very similar would facilitate the understanding of these rules and make
the procedure more scalable as we consider larger numbers of attributes. An
interesting extension to this study is to conduct a sensitivity analysis on the
ability of such techniques to find these systematic patterns by experimenting
with settings of confidence and support and the number of available events
(cases of missing data). Finally, applying our method to a real dataset and generating rules that domain experts could interpret as biases in missing data is
an important next step.
REFERENCES
A GRAWAL , R. AND S RIKANT, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases. M. J. J. B.
Bocca, and C. Zaniolo Eds., Morgan Kaufmann Publishers Inc., San Francisco, CA, 487–499.
B ARBARA , D., D UMOUCHEL , W., FALOUTSOS, C., H ASS, P. J., H ELLERSTEIN, J. M., I OANNIDIS,
Y. E., J AGADISH , H. V., J OHNSON, T., N G, R., P OOSALA , V., R OSS, K. A., AND S EVCIK , K. C.
1997. The New Jersey data reduction report. IEEE Data Engin. Bull. 20, 3–45.
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
2: 18
·
M. C. Tremblay et al.
B REIMAN, L., F RIEDMAN, J. H., O LSHEN, R. A., AND S TONE , C. J. 1984. Classification and
Regression Trees. Wadsworth International.
C SIRIK , J., F RENK , J. B. G., L ABBE , M., AND Z HANG, S. 1990. Heuristics for 0-1 min-knapsack
problem erasmus. University of Rotterdam - Econometric Institute, Rotterdam.
D ASH , M., L IU, H., AND YAO, J. 1997. Dimensionality reduction of unsupervised data. In Proceedings of the 9th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE
Computer Society, 532.
E VEN, A. AND S HANKARANARAYANAN, G. 2007. Utility-driven assessment of data quality. Data
Base Advances Inform. Syst. 38, 76–93.
FAYYAD, U., P IATETSKY-S HAPIRO, G., AND S MITH , P. 1996. The KDD process for extracting useful
knowledge from volumes of data. Comm. ACM 39, 27–34.
F ISHER , C. W., C HENGALUR -S MITH , I., AND B ALLOU, D. P. 2003. The impact of experience and
time on the use of data quality information in decision making. Inform. Syst. Resear. 14, 170–188.
H AN, J. AND K AMBER , M. 2001. Data Mining: Concepts and Techniques. Morgan Kaufmann, San
Francisco, CA.
H EINRICH , B., K AISER , M., AND K LIER , M. 2007. How to measure data quality? A metric-based
approach. In Proceedings of the International Conference on Information Systems (ICIS).
H ORTON, N. J. AND K LEINMAN, K. P. 2007. Much ado about nothing: A comparison of missing
data methods and software to fit incomplete data regression models. Amer. Statist. 61, 79–90.
J AIN, A. K. AND D UBES, R. C. 1988. Algorithms for Clustering Data. Prentice Hall, Englewood
Cliffs, NJ.
J UNG, W., O LFMAN, L., R YAN, T., AND PARK , Y. T. 2005. An experimental study of the effects of
contextual data quality and task complexity on decision performance, 149–154.
L I , J., S HEN, H., AND T OPOR , R. W. 2001. Mining Optimal Class Association Rule Set. In Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining. D. W.-L.
Cheung, G. J. Williams, and Q. Li Eds., Springer-Verlag, 364–375.
L ORENCE , D. P. 2003. The perils of data misreporting. Comm. ACM 46, 85–88.
M AHNIC, V. 2001. Data quality: A prerequisite for successful data warehouse implementation.
Informatica Slovene Soc. Informatika 25, 183–188.
M ARTELLO, S. AND T OTH , P. 1991. Heuristic algorithms for the multiple knapsack problem.
Comput. 27, 93–112.
M C K NIGHT, P. E., M C K NIGHT, K. M., S IDANI , S., AND F IGUEREDO, A. J. 2007. Missing Data:
A Gentle Introduction. Guilford Press, New York.
M ICROSOFT. 2005. Microsoft SQL server 2005 sample data warehouse.
N ETER , J. AND WASSERMAN, W. 1974. Applied Linear Statistical Models; Regression, Analysis of
Variance, and Experimental Designs. R. D. Irwin, Homewood, Ill.
O RDONEZ , C., S ANTANA , C. A., AND B RAAL , L. 2000. Discovering interesting association rules in
medical data. In Proceedings of the ACM SIGMOD Workshop on Research Issues on Data Mining
and Knowledge Discovery. 78–85.
PARSSIAN, A. 2006. Managerial decision support with knowledge of accuracy and completeness of
the relational aggregate functions. Decis. Supp. Syst. 42, 1494–1502.
PARSSIAN, A., S ARKAR , S., AND J ACOB, V. S. 2004. Assessing data quality for information products: Impact for selection, projection, and cartesian product. Manage. Sci. 50, 967–982.
Q UINLAN, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann.
R UBIN, D. B. 1976. Inference and missing data. Biometrika 63, 581–592.
S CHEFFER , T. 2001. Finding Association Rules That Trade Support Optimally against Confidence.
In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge
Discovery. 424–435.
S HANKARANARAYAN, G., Z IAD, M., AND WANG, R. Y. 2003. Managing data quality in dynamic
decision environments: An information product approach. J. Datab. Manage. 14, 14–32.
S HEN, J.-J. AND C HEN, M.-T. 2003. A recycle technique of association rule for missing value completion. In Proceedings of the 7th International Conference on Advanced Information Networking
and Applications. IEEE Computer Society, 526.
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.
Using Data Mining Techniques to Discover Bias Patterns in Missing Data
·
2: 19
S TRONG, D. M., L EE , Y. W., AND WANG, R. Y. 1997. Data quality in context. Comm. ACM 40,
104–110.
S UN, S. AND Y EN, J. 2005. Information supply chain: A unified framework for information-sharing.
In Intelligence and Security Informatics. Springer, 422–428.
T ROCHIM , W. M. K. 1999. The Research Methods Knowledge Base. Cornell University, Ithaca.
WANG, R. Y., R EDDY, M. P., AND H ENRY, B. K. 1995a. Toward quality data: An attribute-based
approach. Decis. Supp. Syst. 13, 349–372.
WANG, R. Y., S TOREY, V. C., AND F IRTH , C. P. 1995b. A framework for analysis of data quality
research. IEEE Trans. Knowl. Data Engin. 7, 623–640.
WANG, R. Y. AND S TRONG, D. M. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inform. Syst. 12, 5–34.
W ITTEN, I. H. AND F RANK , E. 2005. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco.
Received October 2008; revised April 2009, August 2009, December 2009; accepted March 2010
ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.