Download Using Data Mining Techniques to Discover Bias Patterns

Using Data Mining Techniques to Discover Bias Patterns in Missing Data MONICA CHIARINI TREMBLAY, KAUSHIK DUTTA, and DEBRA VANDERMEER Florida International University In today’s data-rich environment, decision makers draw conclusions from data repositories that may contain data quality problems. In this context, missing data is an important and known problem, since it can seriously affect the accuracy of conclusions drawn. Researchers have described several approaches for dealing with missing data, primarily attempting to infer values or estimate the impact of missing data on conclusions. However, few have considered approaches to characterize patterns of bias in missing data, that is, to determine the specific attributes that predict the missingness of data values. Knowledge of the specific systematic bias patterns in the incidence of missing data can help analysts more accurately assess the quality of conclusions drawn from data sets with missing data. This research proposes a methodology to combine a number of Knowledge Discovery and Data Mining techniques, including association rule mining, to discover patterns in related attribute values that help characterize these bias patterns. We demonstrate the efficacy of our proposed approach by applying it on a demo census dataset seeded with biased missing data. The experimental results show that our approach was able to find seeded biases and filter out most seeded noise. Categories and Subject Descriptors: H.4.2 [Information Systems Applications]: Types of Systems—decision support; H.2.8 [Database Management]: Database Applications—data mining General Terms: Design, Algorithms, Human Factors Additional Key Words and Phrases: Data quality, missing data, pattern discovery ACM Reference Format: Tremblay, M. C., Dutta, K., and D. VanderMeer, D. 2010. Using data mining techniques to discover bias patterns in missing data. ACM J. Data Inform. Quality 2, 1, Article 2 (July 2010), 19 pages. DOI = 10.1145/1805286.1805288. http://doi.acm.org/10.1145/1805286.1805288. 1. INTRODUCTION With access to vast volumes of data, decision makers frequently draw conclusions from data repositories that, for a variety of reasons, contain data quality problems. In decision making, data quality is a serious concern [Strong et al. Authors’ address: M. C. Tremblay, K. Dutta, and D. VanderMeer, Decision Sciences and Information Systems, Florida International University, 11200 SW 8th St., Miami, FL 33199; email: [email protected]. Permission to make digital or hard copies part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from the Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2010 ACM 1936-1955/2010/07-ART2 $10.00 DOI: 10.1145/1805286.1805288. http://doi.acm.org/10.1145/1805286.1805288. ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. 2 2: 2 · M. C. Tremblay et al. 1997; Wang et al. 1995] that can negatively impact decision performance [Even and Shankaranarayanan 2007; Fisher et al. 2003; Heinrich et al. 2007; Jung et al. 2005] and lead to misreporting of information [Lorence 2003]. The incidence of data quality issues arises from the nature of the information supply chain [Sun and Yen 2005], where the consumer of a data product may be several supply-chain steps removed from the people or groups who gathered the original datasets on which the data product is based [Wang et al. 1995]. These consumers use data products to make decisions, often with financial or time budgeting implications. The separation of the data consumer from the data producer creates a situation where the consumer has little or no idea of the level of quality of the data [Shankaranarayan et al. 2003], leading to the potential for poor decision-making and poorly allocated time and financial resources. Missing data, that is, fields for which data is unavailable, is a particularly important problem, since it can lead analysts to draw inaccurate conclusions [Mahnic 2001]. When data is extracted from a data warehouse or database (a common occurrence when aggregating data from multiple sources), it typically passes through a cleansing process to reduce the incidence of missing values as far as possible. This cleansing process can only be taken so far. Some missing data cannot be fixed because the database manager may not have any way of knowing the value that is missing, as would happen, for example, if a person left a field in a form blank. At this point, the database manager may choose to remove records with missing data, at the loss of the power of the other data contained in these records, or to provide the dataset with missing-data records included. Our interest in this article is in the second case. Missing data is a widespread occurrence; a recent review of more than 300 articles published in a psychology research journal [McKnight et al. 2007] found that more than 90% were based on datasets with missing data, and that the average amount of data missing was more than 30%, yet few of the articles surveyed in the study mentioned the potential impact of missing data on any conclusions drawn. Data values may be missing for a variety of reasons, falling into two general categories: Missing At Random (MAR) and Missing Not At Random (MNAR) [Rubin 1976]. In the MAR case, the incidence of a missing data value cannot be predicted based on other data, while in the MNAR case, there is a pattern in the missing data. For MAR scenarios, there exist methods for estimating, or imputing, missing values [Horton and Kleinman 2007]. The incidence of missing data, however, often falls into the MNAR category; that is, there is bias in the occurrence of null values. This missingness may occur for a variety of reasons. For example, survey respondents may refuse to answer questions that they feel reveal information that is too personal, often based on religious, cultural, or gender norms. Bias may also occur due to a natural human tendency to suppress unfavorable information. For example, in health care informatics, patients may choose not to report unhealthy behaviors that increase risks for certain illnesses. In MNAR scenarios, if an analyst assumes there is no bias in the missingness of data, the conclusions drawn may be inaccurate. For example, a health ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. Using Data Mining Techniques to Discover Bias Patterns in Missing Data · 2: 3 care policy analyst charged with recommending funding allocations for preventive care projects may note an apparent trend showing reduced teenage smoking, and decide to reduce funding for projects aimed at convincing teens not to smoke. If, however, a large number of males in the 15–20 age range simply did not report whether or not they smoke (leading to nulls in the data set), then the apparent trend noted by the analyst may in fact lead to a faulty conclusion. Any funding decisions based on it could be nonoptimal allocations of financial resources. Clearly, for MNAR scenarios, it would be useful to uncover the patterns in the incidence of missing data in a dataset. Once patterns are uncovered, several steps can be taken, depending on the context in which the bias is found. If the decision maker is the creator or owner of the dataset, preventative steps can be taken to prevent the occurrence of missing data by improving data collection methods given the information on possible biases. For example, in our teenage smoking example, care could be taken to stress that the data is deidentified and that privacy will be ensured. If the decision maker does not have control of the creation of the dataset, a number of approaches can be considered. One option is to infer data for missing fields, based on similar tuples that have the same values for other nonmissing attributes. In our teenage smoking case, if 20% of males in the 15–20 age range did not report whether or not they smoke, but a majority of the remaining 80% of males in this age range had reported they smoked, then the assumption may be made that the majority of these respondents with missing smoking data are indeed smokers (intuitively, this is the basis of imputation techniques). Alternatively, decisions or actions based on conclusions can be altered. If, for example, our health care policy analyst knew that a large number of male teens did not provide data as to whether or not they smoke, he might realize that the apparent trend in the remaining data might be inaccurate, and retain funding for teen smoking prevention projects or look for alternative ways to verify the apparent trend. Much of the work in addressing missing data focuses on imputing, or estimating, missing values. Shen and Chen [2003] propose a scheme for using association rule mining (ARM) techniques for filling in missing values. Horton and Kleinman [2007] provide an overview of statistical methods aimed at the same purpose. Another area of work [Horton and Kleinman 2007; Parssian 2006; Parssian et al. 2004] provides insight into methods of measuring the impact of missing data on the quality of conclusions drawn. Our purpose is different; rather than trying to fill in missing values or assess their impact on conclusions, we attempt to bring to light patterns in the incidence of missing data. Wang and Strong [1996] defined a set of 15 dimensions of data quality based on a survey of information systems professionals and researchers. Issues surrounding missing data fall into two of these categories, completeness, and objectivity, which Wang and Strong [1996, p. 42] define as follows. Completeness: The extent to which data is not missing and is of sufficient breadth and depth for the task at hand. Objectivity: The extent to which data is unbiased, unprejudiced, and impartial. ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. 2: 4 · M. C. Tremblay et al. We are specifically interested in cases where the objectivity problem is nested within the completeness problem in a dataset, that is, where there are bias patterns in missing data. This nesting makes it difficult for a decision maker to notice the full scope of the problem unaided. In this article, we attempt to address these issues to help decision makers uncover bias patterns in missing data. We focus our research on detecting and characterizing these patterns. Our contributions in this article are as follows. —We identify data mining techniques as a potential means of identifying patterns of bias in missing data. —We propose a methodology for applying well-known data mining techniques to reveal patterns in missing data for an attribute. Our approach considers data preparation, the application of rule mining techniques, and aggregating potentially large result sets. —We describe important differences between the traditional application of data mining techniques and their application in the domain of finding patterns in missing data, and suggest modifications to overcome these differences. At the end of Rubin’s original work on missing data [Rubin 1976], he makes a strong call for the need for missing-data pattern research. More recently, Ordonez et al. [2000] cite patterns in missing data as an interesting area for research. While recent research in this area identifies some common ad hoc patterns, for example, monotonicity (caused in longitudinal studies by dropouts) [Horton and Kleinman 2007], in an extensive search of the literature, we were not able to find a general approach to identifying patterns in missing data that would satisfy Rubin’s 1976 request. Our work here represents a first attempt to provide a general approach to patterns in missing data. The remainder of this article is organized as follows. In Section 2, we describe our methodological approach and include a small illustrative example. In Section 3, we present a set of initial experimental results and show the efficacy of our approach. In Section 4, we discuss the results of our experiments and explore potential avenues for improving our method. In Section 5, we discuss the applicability of our approach, and other possible applicable algorithms. In Section 6, we discuss future work to further refine our approach. 2. METHODOLOGICAL APPROACH In order to find systematic bias patterns in the data missing for a specific attribute in a database, we explore the use of KDD (Knowledge Discovery and Data Mining) techniques to discover patterns in related attributes that may indicate bias. Several approaches are possible, including both supervised and unsupervised machine learning techniques. Our goal is to identify which attribute values frequently cooccur with the incidence of missing data for a selected attribute (for example, nulls in yearly income may cooccur in records for single, home-owner males). There are many pattern discovery methods that can be considered for missing data pattern analysis, including both supervised techniques such as decision trees, neural networks and regression, and unsupervised techniques such ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. Using Data Mining Techniques to Discover Bias Patterns in Missing Data · 2: 5 as cluster analysis and association rule mining. For this context, we wish to consider the combinations of attribute values that may indicate a bias in the missing data. Association rule mining, as we describe below, seems to be a natural fit for this problem context. Association Rule Mining (ARM), also known as market basket analysis in the world of marketing, is a widely used data mining technique. The goal of ARM algorithms is to search for frequent item sets in large transaction databases, such as the items one might purchase at a grocery store. The frequently occurring items found in a market basket can be used to design promotions, refine store layouts, and manage other aspects of a consumer’s experience. Similarly, we seek to find “baskets” of values that indicate a pattern for missing data. The association task has two goals: to find frequent itemsets and to find association rules. For association rule mining algorithms, each attribute/value pair is considered an item. Each itemset has a size, which corresponds to the number of items it contains (for example, for some demographic data, we could have a three-item itemset with gender= “female,” age= “30-35,” homeowner= “true”). The coverage of this itemset is the number of times this combination occurs in the dataset. Support is coverage expressed as a percentage, and is often specified by the user ahead of time. For example, support = 4% means that the itemset appears in 4% of the records. The second goal of ARM algorithms is to find rules. An association rule has the form A, B => C with a confidence measure, where confidence represents the percentage occurrance of the right hand (RHS) side of the rule, given the itemset in the left hand side (LHS). Thus, if we had a confidence value of 60% for the rule if gender=“female” and age=“30-35” => homeowner= true, then 60% of 30–35-year-old females in this dataset are homeowners. In the association task, the user specifies minimum thresholds for support and confidence prior to running the ARM algorithm. Then, only those rules that have values at or above the minimum specified thresholds for support and confidence are generated by the algorithm. We generate rules using the association rule algorithm to identify instances of systematic bias in missing data. In this section, we outline our general approach, which we will illustrate with a small example, and later demonstrate with an exploratory experimental study. Table I represents a set of fictional results from a survey where high school and middle school children were asked if they had tried smoking. It contains five attributes: school name (as a proxy for socio-economic status), whether a parent or guardian smoked, gender, age, and whether the person reports having tried smoking. There are two hidden seeded biases: females between the ages of 15–18 (records number 1, 3) and any males whose parents did not smoke (records 2, 4, 6, 7) tend to not report if they have tried smoking, regardless of what school they went to. Similar to what can be found in a real dataset the bias is a tendency, so in our small example, respondent 2 did provide an answer. Figure 1 shows a high-level overview of the steps in our approach. There are four main steps in finding these patterns: data preparation (which includes sampling, selecting and transforming features), association rule creation, and ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. 2: 6 · M. C. Tremblay et al. Table I. Small Example Dataset Record 1 2 3 4 5 6 7 8 9 10 School Northwest Northwest Northwest Southeast Southeast Southeast Southeast Wesley Wesley Wesley Parent Smokes Y N N Y Y N N Y N Y Gender F M F M M M M M F M Age 16 16 15 14 16 17 12 16 14 17 Tried Smoking? NULL Y NULL NULL Y NULL NULL Y N Y Fig. 1. Pattern discovery approach. association rule aggregation. A fourth step, interpreting the patterns, involves a decision maker using these patterns in the context of drawing conclusions from the dataset. 2.1 Data Preparation The first step in the KDD process is data preparation [Fayyad et al. 1996]. In this application, three data preparation techniques are described: sampling, feature selection, and feature transformation. 2.1.1 Sampling. In the KDD process it is neither necessary nor efficient to use all available data. Instead, we sample for a representative sample of the data to build our KDD models. However, in our case, we cannot simply randomly sample to obtain a set number of records, since missing data for an attribute is a rare occurrence. If we sample randomly, we risk having most, if not all our records be complete, that is, we might lose all records with missing data. Further, if we underrepresent records with missing data, the association rule algorithm will not generate any rules for cases where data is missing. Association rule mining discovers patterns that occur with a frequency above the minimum support threshold. Therefore, in order to find associations involving rare events, we could try to set the threshold value for support to a very low number. However, this would result in the generation of large numbers of rules to analyze, and still would not guarantee that we would find rules for these rare biases. It could also dramatically increase the time needed for execution. This is a well-known problem, also referred to as the unbalanced dataset problem. It occurs when the class of interest is highly underrepresented in the data. For example, fraudulent medical claims may only be present in 1% or less of a dataset. Similar situations occur in rare disease diagnosis, network intrusion detection, and bioterrorism detection. In order to ensure a good representation of missing data records, we use stratified sampling [Han and Kamber ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. Using Data Mining Techniques to Discover Bias Patterns in Missing Data · 2: 7 2001; Trochim 1999]. Stratified sampling, also called proportional or quota random sampling, involves dividing the available population into homogenous subgroups and then taking a simple random sample from each subgroup [Trochim 1999]. In our case, we have two subgroups, one contains records with missing data, and one without. We sample so that occurrences of missing data for the attribute for the resulting dataset constitute at least 30% of our cases. In other words, if we want to use 1000 records, we ensure that 300 of those records are random-sampled from the missing-data subgroup, and 700 are from the other subgroup. Stratified sampling increases rule support (the numerator of the support computation remains the same, but the denominator decreases), but does not change confidence (both the numerator and denominator of the confidence computation change proportionally). It does not hurt to increase the value of support for missing data records, since the purpose of the support metric is to identify occurrences of sufficient frequency in the dataset. Since we want to extract rules for cases where the target attribute value is null, which we already know is a rare event, the actual value of support is not important. Since stratified sampling necessarily requires a significant sample size to illustrate its effect on a dataset, it is not appropriate to illustrate this on our small example since it only contains 10 rows, which is substantially less than would be acceptable for most data mining algorithms. 2.1.2 Feature Selection. Databases can contain thousands of attributes, making it challenging to determine which features (attributes) are contributing to a systematic bias. The goal is to feed forward only those attributes that are most relevant in identifying a bias. This step is important for two reasons. First, it reduces the amount of data that must be analyzed, which improves the scalability of our approach. Second, it minimizes the number of rules generated, which helps to simplify the interpretation of these rules. Feature selection is an important and challenging step, particularly for our approach. For the same reason that logistic regression was ruled out as an approach to find missing data biases, traditional dimension reduction techniques outlined in the literature, such as stepwise forward selection, stepwise backward elimination, and combinations of forward selection and backward elimination [Barbara et al. 1997; Dash et al. 1997; Han and Kamber 2001; Neter and Wasserman 1974] will not be accurate. We are searching for combinations of attribute values, so we cannot singularly consider each attribute and its correlation to the value of our selected target attribute (which is either null or not null). As an example, consider gender, which has two possible values. Individually, there may not be a high correlation between the value of gender and missing data. In fact, it might be evenly distributed for both the missing and not-missing cases. However, the bias for missing data might be for a certain value of gender combined with certain values for age and ethnicity. Feature selection techniques (e.g., forward selection) do not consider combinations of attributes, and could discard important attributes. In our approach, we utilize decision trees, a well-known machine learning technique that was originally intended for classification [Han and Kamber ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. 2: 8 · M. C. Tremblay et al. Table II. Resulting Sample Dataset Parent Smokes Y N N Y Y N N Y N Y Gender F M F M M M M M F M Age 15-16 15-16 15-16 14-15 16-17 17-18 12-13 16-17 14-15 16-17 Missing Flag 1 0 1 0 0 1 1 0 0 0 2001]. The decision tree is a classifier in the form of a tree structure, where each node is either a “decision” or a “leaf ” node. The decision node indicates a test based on the values of a single attribute, and a leaf node indicates the value of the target attribute. When using decision tree induction for feature selection, the variables that do not appear in the tree are considered irrelevant [Han and Kamber 2001]. In our small example, there are not enough cases to illustrate the use of decision trees. In a larger (but similar dataset), we would expect the decision tree would not select high school as a splitting mechanism, since there is no discernible correlation between missing data in the “tried smoking” field and the school field. 2.1.3 Data Transformation. In our approach, we recommend two transformations. First, we replace the attribute for which we are investigating missing data with a flag (missing-flag) that indicates whether the data value is missing or not. In the case of our small example, we would replace the “tried smoking” attribute with a flag which is set to 1 if the attribute was missing, or set to 0 if it was not. Secondly, since association rule algorithms generally do not work with continuous variables, we divide those variables into groups and create bins, either based on their distribution or in a way that makes sense for that particular variable. In our small example, age would be a continuous variable we could bin. We also would transform our variable “Tried Smoking,” to a new binary variable named “Missing Data.” Table II illustrates our resulting dataset after feature selection and transformation. After we have created a stratified dataset, selected the attributes, and transformed the variables as needed, we are ready to feed the data to an association rule mining algorithm and aggregate the results. 2.2 Association Rule Creation Typically, association rule mining searches for all possible patterns from the given data, that is, the RHS of a rule may contain any attribute of the database. However, we are interested only in rules that have the attribute missing-flag in the RHS. For this purpose, we select an Apriori Algorithm [Li et al. 2001], using the class association rules (CAR) approach [Scheffer 2001], which generates rules with a set of target attributes. In this way, we can mine the prepared ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. Using Data Mining Techniques to Discover Bias Patterns in Missing Data · 2: 9 Table III. Sample Dataset Generated Rules 1. Parent Smokes = Y, Gender = F, Age = 15-16 → Missing Flag = 1 2. Parent Smokes = N, Gender = F, Age = 15-16 → Missing Flag = 1 3. Parent Smokes = N, Gender = M, Age=17-18 → Missing Flag = 1 4. Parent Smokes = N, Gender = M → Missing Flag = 1 data for association rules involving the missing-flag as the class attribute, that is, the attribute on the RHS of the association rule. Association rule mining discovers rules based on support and confidence. Traditionally, for a rule X→Y, the support is defined by S(X ) = #X N, where #X denotes the number of occurrences of X in the total population of N. The con fidence is defined by S(X ∩ Y ) S(X ), where X ∩ Y denotes the instances where both X and Y occur. In the case of missing data analysis, the occurrence of missing data is rare compared to the total population, so support (using the traditional definition of the term) for rules for the event will be very low compared to most other rules the miner will generate. To address this we modify the definition of support as follows: for a rule X→Y (where Y is missing-flag=1), S (X ) = #X M, where M is the total count in the population where the target attribute is null, that is, where missing-flag = 1. Here, we are interested in rules of type X→missingflag=1 with modified support of S (X ). So, we apply the standard CAR to gen erate rules with minimum support of S(X ) = S (X )M N. This will generate rules indicating the correlations between the incidence of missing data and the values of attribute sets. Finally, we note that the CAR approach generates rules for all cases for the missing-flag attribute, that is, with both missing-flag=1 and missing-flag=0 in the RHS. We are interested only in rules for missing-flag=1, so we filter out any rules with missing-flag=0 in the RHS, and retain those with missing-flag=1 in the RHS. For our small example, M = 4andN = 10. Suppose we target a modified support of S (X ) = 0.1. To achieve this we select the minimum support of S(X ) = 0.1 × 4/10, that is, 0.04 for the ARM technique. We further filter the rules to those where missing-flag=1 in the RHS. Table III illustrates rules generated by this approach on our small dataset. Actual support for each of these rules, would be 1/4 or 25%, given the rule occurs once and there are four occurrences of missing data in the dataset. Due to the small size of the dataset, the confidence of all these rules turns out to be 1.0, since there are no records without missing data for any of the itemsets in the LHS of the rules in Table III. This will not be the case in larger dataset that we have used in Section 5. 2.3 Rule Aggregation The association rule mining method may generate a large number of rules, leading to the potential for data overload. To avoid this, we introduce the possibility of aggregating rules to reduce the size of the rule set delivered to the analyst. Our goal here is to do this without significantly damaging the results. ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. 2: 10 · M. C. Tremblay et al. We first consider the possibility of combining rules where one attribute clearly does not contribute to the correlation to the missingness of data. Consider the following rules from our small dataset. (1) Parent Smokes = Y, Gender = F, Age = 15–16 → Missing Flag = 1 (2) Parent Smokes = N, Gender = F, Age = 15–16 → Missing Flag = 1 In this example, “Y” and “N” are the only values that the Parent Smokes may have in the database. Since these two rules differ only in their value of Parent Smokes, then Parent Smokes clearly does not contribute to the correlation to the missingness of data. This may occur in the case of any attribute whose values are roughly equally distributed in the dataset. Clearly, then, we can merge rules (1) and (2) without loss of meaning to produce a new rule, Gender = F, Age = 15–16. Second, we propose presenting the more general rule in cases where one generated rule is a more detailed version of another (or a set of other rules). Consider, for example, the following pair of rules. (3) Parent Smokes = N, Gender = M, Age = 17–18 → Missing Flag = 1 (4) Parent Smokes = N, Gender = M → Missing Flag = 1 Each of these generated rules is supported by a specific set of records in the database. Here, rules (3) and (4) share two attribute-value pairs: Parent Smokes = N and Gender = M. Because rule (4) contains only these two items, it is supported by all the records that support rule (3), and more. Thus, rule (4) is the more general rule, and the one we will choose to present in the aggregated rule set. Clearly, there is some loss of detail in this type of aggregation, but there is no loss of coverage in terms of the records that support the rules—all records that were counted as supporting rules in the original rule set are still counted as supporting rules in the aggregated rule set. Thus, the rules in our example would be reduced to the following set. (1) Gender = F, Age = 15–16 (2) Parent Smokes = N, Gender = M In our small example, and in our experiments, the rule set generated was small, so we simply aggregated rules manually using these heuristics. It should be possible, however, to automate this step using techniques from predictive association rule mining, for example, as described in Scheffer [2001] or applying Operations Research (OR) techniques. Below we present an outline of one possible OR model to identify the minimum set of rules from the complete rule set. Let our original data set consist of N records with K attributes. Let us also assume that our approach has generated a total of J rules. We create a decision variable, x j = 1, if rule j ∈ J is included in the final rule set; 0 otherwise. We identify the following exogenous parameter of the model. pij = 1, if a record i ∈ N supports a rule j ∈ J; 0 otherwise. ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. Using Data Mining Techniques to Discover Bias Patterns in Missing Data · 2: 11 Table IV. Data Description Attribute Number of Distinct Values 2 2 6 6 2 5 5 5 Marital Status Gender Number of Children Number of Children at Home Home Owner Flag Number of Cars Owned Commute Distance Occupation Yearly-Income Continuous Values S, M F, M 0, 1, 2, 3, 4, 5, 6 0, 1, 2, 3, 4, 5, 6 FALSE, TRUE 0, 1, 2, 3, 4, 5 0-1, 1-2, 2-5, 5-10, 10+ Professional, Manual, Clerical, Management, Skilled Manual 10000 to 170000 The one helper variable of the model is, ui = 1, if a record i ∈ N supports any rule j ∈ J; 0 otherwise. The linear programming model for the selection of the minimum rule set is as follows, Minimize pijx j i∈N j∈J Subject to: pijx j ≥ ui, ∀i ∈ N (1) j∈J The objective function minimizes the total number of rules needed to cover all the rows in the original rule set. The constraint (1) ensures that all records in the original dataset that support rules in the original rule set are still counted as supporting rules in the aggregated dataset. The above integer programming (IP) model maps to the 0-1 min-Knapsack problem [Csirik et al. 1990]. Though the knapsack problem is an NP-Complete problem, which may be computationally challenging to solve, there exists a well-developed greedy heuristic [Martello and Toth 1991] that is known to have a log-linear time complexity O(|J| log |J|), and provides results within 2% of the optimal result . We can apply such a heuristic algorithm to solve the given IP problem and find the aggregated rule set quite efficiently. Finally, we note that this model is not complete—further fine tuning to address selecting rules with higher confidence measures, ensuring that the total support of the final rule set is above a certain threshold, etc., is left for future research. 3. EXPERIMENTAL RESULTS In order to assess the efficacy of our approach, we extracted a dataset of 18,000 transactions from a sample data warehouse [Microsoft 2005]. A description of the dataset can be found in Table IV. We first introduce some systematic biases for missing data, and then apply our proposed approach to determine whether it is able to uncover them. We chose Yearly-Income as our target column. We modified the experimental data ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. 2: 12 · M. C. Tremblay et al. Table V. Seeded “Systematic Bias” Missing Rules Rule 1 2 3 4 5 Noise Remaining Data Total Marital Gender Status M M M S M S F M Total Children 0 5 5 0 2 House Owner Flag FALSE FALSE Null Count 920 176 213 1640 535 255 3739 Not Null Count 21 2 3 54 18 14,163 14,261 Total 941 178 216 1694 553 255 14,163 18,000 Note that the systematic biases in Table IV were selected for experimental purposes only, and are not intended to have any intrinsic meaning. To do otherwise would surely cause offense. to set the Yearly-Income attribute to null (to represent missing data) for records that satisfy one of the attribute value-sets described in Table V. To create a realistic environment, we also introduced some noise to the data. —In about 2% of the cases that match one of the attribute value-sets described in Table V, the Yearly-Income attribute value is left unchanged, that is, not null. —Yearly-Income is set to null for 1.4% of the cases in the dataset that do not match the patterns in Table V. —A total of 20% of the data has a null value in the Yearly-Income field (inclusive of seeded noise nulls). Next, we created a new Boolean column Income-flag in the database, which acts as an indicator (missing-flag) specifying whether the Yearly-Income attribute of a record is null or not. We set this value to 1 (i.e., true) for records where Yearly-Income is null, and 0 for records where the Yearly-Income value is not null. 3.1 Experimental Process In this section, we describe how we applied each of the steps of our proposed approach in this experiment. 3.1.1 Stratified Sampling. In the Stratified Sampling step, we kept all the missing data rows of our dataset (Income-flag=1) and seeded noise, and then randomly sampled the remaining data so that the remaining 70% of the dataset contained nonmissing records (Income-flag=0). As illustrated in Figure 2, this new stratified database contained a total of 11,940 rows. 3.1.2 Feature Selection. Using the modified stratified sample dataset, we used the Weka data mining tool [Witten and Frank 2005] to run a C4.5 decision tree. Figure 3 shows the results of the feature selection step. We select only those attributes that appear in the resulting tree. Based on this approach, we conclude that only TotalChildren, Gender, HouseOwnerFlag, and MaritalStatus are important in classifying whether a record has high potential to have ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. Using Data Mining Techniques to Discover Bias Patterns in Missing Data · 2: 13 Fig. 2. Stratified sampling. Income-flag=1 (i.e., where Yearly-Income is null). In the next step, we run association rule mining by only selecting these relevant attributes. 3.1.3 Association Rule Mining. Using the Apriori algorithm [Agrawal and Srikant 1994] in the WEKA tool, we generated rules from our the stratified sample dataset by selecting only the four relevant attributes from the decision tree: TotalChildren, Gender, HouseOwnerFlag, and MaritalStatus. We used the class association rule mining technique in order to list only those rules that contain Income-flag on the RHS. Prior to running the mining algorithm, we converted all continuous-data columns of the input dataset to categorical data, since association rule mining runs on categorical data. The minimum confidence was set high (0.95) for the Apriori algorithm. Also, the minimum support S(X )was set low, to 0.01, based on the modified support heuristic in Section 4.2. The rules were ranked by confidence, and limited to 200 rules. Finally, once we had the list of generated rules, we filtered the results for those rules that referred to Income-flag=1 indicating Yearly-Income=null. Figure 4 shows all such rules found from the WEKA environment. 3.1.4 Rule Aggregation. In this step, we aggregated multiple rules together to reduce the number of total rules. Though elaborate formal algorithm development is possible for such aggregation, for the purpose of this case study we merged the result rules manually by following the simple heuristics described in Section 4.3. In future research, we intend to develop an algorithm that will aggregate rules from a large number of generated rules. 3.2 Experimental Results The aggregated rule set is presented in Table VI. The rules labeled A, B, C, D, and E exactly describe the seeded patterns, which demonstrates that our approach was able to find all seeded biases in the dataset. The approach also found an additional rule (labeled F in Table VI), Gender=M, TotalChildren=5 and HouseOwnerFlag=False, which was not part of the seeded bias. We could tune the support threshold to prevent this rule from being generated, but doing so runs the risk of losing rules for actual seeded bias. For example, support for Rule F is 142/11,940, while support for Rule B (in Table VI) is 176/11,940. Here, increasing the support threshold will prevent the additional rule (F) from being ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. 2: 14 · M. C. Tremblay et al. Fig. 3. Decision tree output. generated, but will also likely prevent rule B, a seeded bias, from being generated. In future research we intend to focus on this tradeoff and parameters affecting it. 3.3 Validation of the Approach In this study, we were able to validate our technique by checking if the seeded biases were identified by the rules. In a nonexperimental context using real data in which biases are not known in advance, a possible validation approach is to divide the dataset in two halves and verify that similar patterns of missingness exist when subjecting each half to our method. To demonstrate the applicability of this method, we divide our sample dataset into two equally-sized ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. Using Data Mining Techniques to Discover Bias Patterns in Missing Data · 2: 15 Fig. 4. Generated rules. Table VI. Supported Rules Supported Rule Label A WEKA Identified Rule Induced Numbers Rule (Figure 4) Number (Table V) 11, 23, 73 1 Marital Gender Status Total Children House Owner Flag Rule Confidence 0.98 (920/941) 0.99 (176/178) 0.99 (213/216) 0.97 (1640/1694) 0.97 (535/553) 0.98 (142/145) M M 0 - B 5 2 M - 5 FALSE C 1, 7 3 S M 5 - D 34, 70, 77 4 S F 0 - E 60, 71, 80 5 - M 2 FALSE F 19 NONE - M 5 FALSE datasets and compute the confidence value of the derived rule sets on both these data sets separately. Table VII contains these confidence values, showing that both datasets generate confidence values similar to those in Table VI. Thus, we can safely conclude that this rule validation procedure can easily be applied in nonexperimental datasets where the hidden bias patterns are not known in advance. 4. DISCUSSION In this exploratory study, we have demonstrated that association rule mining can be used successfully to describe systematic biases in missing data. Understanding the set of values that are normally associated with missing data in a certain attribute can be of significant benefit to researchers. Consider the example presented in the introduction to this article, where the authors of McKnight et al. [2007] described a review of 300+ articles from a psychology journal and found that more than 90% were based on data sets with missing data, and that the average amount of data missing was more than 30%. A missing data analysis for the underlying datasets might reveal patterns of ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. 2: 16 · M. C. Tremblay et al. Table VII. Method Validation—Approach 1 Supported Rule Label A B C D E F Marital Gender Status M M S S - M M F M M Total Children 0 5 5 0 2 5 House Owner Flag FALSE FALSE FALSE Rule Confidence (Sample Set 1) 0.97 0.99 0.99 0.97 0.97 0.99 Rule Confidence (Sample Set 2) 0.98 0.99 0.98 0.97 0.97 0.97 missingness, potentially resulting in the modification of conclusions, or even the invalidation of the results. Our proposed technique is broadly applicable across domains where data analysis is a significant research method. Certainly other statistical and knowledge discovery techniques should be explored, but they will not be as straightforward in generating a set of rules that consider combinations of attribute values. Cluster analysis is an interesting possibility. Clustering algorithms find natural groupings within datasets based on different types of similarity metrics [Jain and Dubes 1988]. In our case, we could explore the values of the attributes that cluster together with the occurrence of missing data, which could prove to be difficult to interpret. We could also explore the use of decision trees. The approach would differ slightly, in that we would train models that would classify whether data is missing for an attribute. Decision tree algorithms evaluate splits or decision nodes based on the available attribute values, which can be converted into rule sets [Breiman et al. 1984; Quinlan 1993]. Decision trees could be useful, but would be time-consuming to apply, because we would need to investigate each branch whose terminal node indicates missing data in order to discover the rules. We did some preliminary testing of these two alternative approaches: cluster analysis and decision trees, and compared our results based on number of rules found and ease of finding and interpreting rules. Using cluster analysis (entropy minimization) we were able to induce four of the rules by looking at cluster membership. We discovered that this was an extremely time-consuming exercise. Additionally, it did not provide measures of confidence or support, since we are inducing the rules. We also noted that the execution time was very slow (30 minutes for our small dataset). Decision trees found all the same rules as ARM, and could be a promising technique. However, unlike ARM, we had to follow the branches of the tree to induce the rules. Decision trees are a promising approach; however, they may have scalability issues. In our small dataset it was not difficult to investigate the tree structure to extract the rules. In a larger dataset, with more attributes, this may be more taxing. Additionally, there are extra steps involved because we will need to split the data in order to avoid overfitting. Yet, decision trees warrant more investigation, since there may be ways to automate rule induction. We do plan to test both decision trees and ARM on a much larger dataset in future research. An additional step we did not explore is working with domain experts to explore and explain our results. In our contrived example, the biases are seeded. ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. Using Data Mining Techniques to Discover Bias Patterns in Missing Data · 2: 17 In a real dataset, it will be likely that only a domain expert will be helpful in understanding potential biases identified. 5. CONCLUSION This research explores a promising method for utilizing machine learning techniques to identify potential patterns of bias in the incidence of missing data. We describe how to reduce the feature space, how to conduct necessary transformations, and how to uncover patterns in the data that describe the incidence of missing data. There are several ways this methodology can be integrated into existing systems. The most obvious is to identify the bias and fix the errors. This might prove to be difficult if this data is from surveys, because it will be difficult to trace the source. Additionally, when information is purposefully withheld, it is likely that respondents still will not want to divulge it. In the case where the data are accumulated from several sources, the systematic bias might point to an error in the data collection process. For example, a large federal agency that collects data from several agencies might be able to identify that it has interoperability issues that cause data to be systematically truncated in certain circumstances. If fixing at the source is not possible, certainly this knowledge must be disseminated to decision makers. This process, however, can be complicated, and can cause decision makers to lose confidence in the dataset. The information could be added to the metadata and reported to decision makers via reports or at query time. This procedure is certainly an interesting area for future study. For each of the steps in our methodology, we can identify other potential techniques that may produce better (or faster) results, as well as interesting avenues for future research. For example, feature selection is challenging, especially for larger datasets, and is an area that warrants further research. We consider association rule mining the best approach for finding correlations indicating missingness, but certainly other data mining techniques, for example decision trees, need to be considered and tested. Furthermore, in our case, several rules were generated for each bias. Finding a way to collapse rules that are very similar would facilitate the understanding of these rules and make the procedure more scalable as we consider larger numbers of attributes. An interesting extension to this study is to conduct a sensitivity analysis on the ability of such techniques to find these systematic patterns by experimenting with settings of confidence and support and the number of available events (cases of missing data). Finally, applying our method to a real dataset and generating rules that domain experts could interpret as biases in missing data is an important next step. REFERENCES A GRAWAL , R. AND S RIKANT, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases. M. J. J. B. Bocca, and C. Zaniolo Eds., Morgan Kaufmann Publishers Inc., San Francisco, CA, 487–499. B ARBARA , D., D UMOUCHEL , W., FALOUTSOS, C., H ASS, P. J., H ELLERSTEIN, J. M., I OANNIDIS, Y. E., J AGADISH , H. V., J OHNSON, T., N G, R., P OOSALA , V., R OSS, K. A., AND S EVCIK , K. C. 1997. The New Jersey data reduction report. IEEE Data Engin. Bull. 20, 3–45. ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. 2: 18 · M. C. Tremblay et al. B REIMAN, L., F RIEDMAN, J. H., O LSHEN, R. A., AND S TONE , C. J. 1984. Classification and Regression Trees. Wadsworth International. C SIRIK , J., F RENK , J. B. G., L ABBE , M., AND Z HANG, S. 1990. Heuristics for 0-1 min-knapsack problem erasmus. University of Rotterdam - Econometric Institute, Rotterdam. D ASH , M., L IU, H., AND YAO, J. 1997. Dimensionality reduction of unsupervised data. In Proceedings of the 9th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE Computer Society, 532. E VEN, A. AND S HANKARANARAYANAN, G. 2007. Utility-driven assessment of data quality. Data Base Advances Inform. Syst. 38, 76–93. FAYYAD, U., P IATETSKY-S HAPIRO, G., AND S MITH , P. 1996. The KDD process for extracting useful knowledge from volumes of data. Comm. ACM 39, 27–34. F ISHER , C. W., C HENGALUR -S MITH , I., AND B ALLOU, D. P. 2003. The impact of experience and time on the use of data quality information in decision making. Inform. Syst. Resear. 14, 170–188. H AN, J. AND K AMBER , M. 2001. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA. H EINRICH , B., K AISER , M., AND K LIER , M. 2007. How to measure data quality? A metric-based approach. In Proceedings of the International Conference on Information Systems (ICIS). H ORTON, N. J. AND K LEINMAN, K. P. 2007. Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. Amer. Statist. 61, 79–90. J AIN, A. K. AND D UBES, R. C. 1988. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, NJ. J UNG, W., O LFMAN, L., R YAN, T., AND PARK , Y. T. 2005. An experimental study of the effects of contextual data quality and task complexity on decision performance, 149–154. L I , J., S HEN, H., AND T OPOR , R. W. 2001. Mining Optimal Class Association Rule Set. In Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining. D. W.-L. Cheung, G. J. Williams, and Q. Li Eds., Springer-Verlag, 364–375. L ORENCE , D. P. 2003. The perils of data misreporting. Comm. ACM 46, 85–88. M AHNIC, V. 2001. Data quality: A prerequisite for successful data warehouse implementation. Informatica Slovene Soc. Informatika 25, 183–188. M ARTELLO, S. AND T OTH , P. 1991. Heuristic algorithms for the multiple knapsack problem. Comput. 27, 93–112. M C K NIGHT, P. E., M C K NIGHT, K. M., S IDANI , S., AND F IGUEREDO, A. J. 2007. Missing Data: A Gentle Introduction. Guilford Press, New York. M ICROSOFT. 2005. Microsoft SQL server 2005 sample data warehouse. N ETER , J. AND WASSERMAN, W. 1974. Applied Linear Statistical Models; Regression, Analysis of Variance, and Experimental Designs. R. D. Irwin, Homewood, Ill. O RDONEZ , C., S ANTANA , C. A., AND B RAAL , L. 2000. Discovering interesting association rules in medical data. In Proceedings of the ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery. 78–85. PARSSIAN, A. 2006. Managerial decision support with knowledge of accuracy and completeness of the relational aggregate functions. Decis. Supp. Syst. 42, 1494–1502. PARSSIAN, A., S ARKAR , S., AND J ACOB, V. S. 2004. Assessing data quality for information products: Impact for selection, projection, and cartesian product. Manage. Sci. 50, 967–982. Q UINLAN, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann. R UBIN, D. B. 1976. Inference and missing data. Biometrika 63, 581–592. S CHEFFER , T. 2001. Finding Association Rules That Trade Support Optimally against Confidence. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery. 424–435. S HANKARANARAYAN, G., Z IAD, M., AND WANG, R. Y. 2003. Managing data quality in dynamic decision environments: An information product approach. J. Datab. Manage. 14, 14–32. S HEN, J.-J. AND C HEN, M.-T. 2003. A recycle technique of association rule for missing value completion. In Proceedings of the 7th International Conference on Advanced Information Networking and Applications. IEEE Computer Society, 526. ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010. Using Data Mining Techniques to Discover Bias Patterns in Missing Data · 2: 19 S TRONG, D. M., L EE , Y. W., AND WANG, R. Y. 1997. Data quality in context. Comm. ACM 40, 104–110. S UN, S. AND Y EN, J. 2005. Information supply chain: A unified framework for information-sharing. In Intelligence and Security Informatics. Springer, 422–428. T ROCHIM , W. M. K. 1999. The Research Methods Knowledge Base. Cornell University, Ithaca. WANG, R. Y., R EDDY, M. P., AND H ENRY, B. K. 1995a. Toward quality data: An attribute-based approach. Decis. Supp. Syst. 13, 349–372. WANG, R. Y., S TOREY, V. C., AND F IRTH , C. P. 1995b. A framework for analysis of data quality research. IEEE Trans. Knowl. Data Engin. 7, 623–640. WANG, R. Y. AND S TRONG, D. M. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inform. Syst. 12, 5–34. W ITTEN, I. H. AND F RANK , E. 2005. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco. Received October 2008; revised April 2009, August 2009, December 2009; accepted March 2010 ACM Journal of Data and Information Quality, Vol. 2, No. 1, Article 2, Pub. date: July 2010.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Using Data Mining Techniques to Discover Bias Patterns