Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Incomplete Data with Attribute-Concept Values and “Do Not Care” Conditions Patrick G. Clark1 and Jerzy W. Grzymala-Busse1,2 1 Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS 66045, USA 2 Institute of Computer Science, Polish Academy of Sciences, 01–237 Warsaw, Poland [email protected], [email protected] Abstract. In this paper we present novel experimental results on comparing two interpretations of missing attribute values: attribute-concept values and “do not care” conditions. Experiments were conducted on 176 data sets, with preprocessing using three kinds of probabilistic approximations (lower, middle and upper) and the MLEM2 rule induction system. The performance was evaluated using the error rate computed by ten-fold cross validation. At 5% statistical significance level, four cases attribute-concept values and two cases “do not care” conditions performed better. At 10% statistical significance level, five cases attributeconcept values and three cases of “do not care” conditions performed better. In the remaining cases the differences were not statistically significant. 1 Introduction Lower and upper approximations are the basic ideas of rough set theory. A probabilistic approximation, defined using a probability α, is an extension of standard lower and upper approximations. If α is equal to 1, the probabilistic approximation is reduced to the lower approximation; if α is slightly larger than 0, the probabilistic approximation becomes the upper approximation. Probabilistic approximations have been investigated in Bayesian rough sets, decision-theoretic rough sets, variable precision rough sets, etc., see, e.g., [12, 15–17, 20, 22–25]. Hybrid intelligent systems are an area of research that seeks to combine many single classifier approaches to pattern recognition such that the resulting collective performance improves on the performance of any single part [21]. The idea that no single computational view solves all problems was presented in [19]. In our work we combine two areas of intelligent systems, rule learner classification systems and uncertainty management in the form of rough sets [21]. Until recently, research on probabilistic approximations focused on theoretical properties of such approximations. Additionally, it was restricted to complete data sets (with no missing attribute values). For incomplete data sets standard approximations were extended to probabilistic approximations in [11]. The first 2 P. G. Clark and J. W. Grzymala-Busse papers reporting experimental results on probabilistic approximations were [1, 4]. In this paper we study two interpretations of missing attribute values: attributeconcept values and “do not care” conditions. Our research is a continuation of [5]. In [5] three interpretations of missing attribute values: lost values, attributeconcept values and “do not care” conditions were discussed; however, data sets used for experiments in [5] were very restricted: only eight data sets were considered, all with 35% of missing attribute values. In this paper we consider a spectrum of data sets with various percentages of missing attribute values, starting from 0 (complete data sets), and ending with saturated incomplete data sets, with 5% as an increment of missing attribute values, for details see Section 4. Our main objective was to check which interpretation of missing attribute values: attribute-concept values and “do not care” conditions is better in terms of the error rate. Our secondary objective was to compare three types of approximations, lower, middle and upper, where middle approximations are probabilistic approximation associated with the parameter α = 0.5 [2, 3]. In this paper we study usefulness of all three types of probabilistic approximations applied for rule induction from incomplete data. There exist many definitions of approximations [9] for data sets with missing attribute values, we use one of the most successful options (from the view point of rule induction) called concept approximations [9]. Concept approximations were generalized to concept probabilistic approximations in [11]. Our experiments on rule induction on 176 data sets (with two types of missing attribute values) and with three probabilistic approximations (lower, middle and upper) show that an error rate, evaluated by ten-fold cross validation, depends on a choice of the data set. Our main conclusion is that for a specific data set both choices, for an interpretation of missing attribute values and for an approximation type, should be taken into account in order to find the best combination used for data mining. 2 Incomplete Data We assume that the input data sets are presented in the form of a decision table. An example of a decision table is shown in Table 1. Rows of the decision table represent cases, while columns are labeled by variables. The set of all cases will be denoted by U . In Table 1, U = {1, 2, 3, 4, 5, 6, 7, 8}. Independent variables are called attributes and a dependent variable is called a decision and is denoted by d. The set of all attributes will be denoted by A. In Table 1, A = {Education, Skills, Experience}. The value for a case x and an attribute a will be denoted by a(x). In this paper we distinguish between two interpretations of missing attribute values: attribute-concept values and “do not care” conditions. Attribute-concept values, denoted by “−”, mean that the original attribute value is unknown; however, because we know the concept to which a case belongs, we know all possible attribute values. For example, if we know that a patient is sick with Mining Incomplete Data 3 flu and if typical temperature values for such patients is high or very high, then we will use these values for rule induction, for details see [10]. “Do not care” conditions , denoted by “*”, mean that the original attribute values are irrelevant, so we may replace them by any attribute value, for details see [6, 13, 18]. Table 1 presents an incomplete data set affected by both lost values and attribute-concept values. Table 1. A decision table Attributes Decision Case Education Skills Experience Productivity 1 2 3 4 5 6 7 8 higher * secondary higher elementary secondary − elementary high high − * high − low * − low high high low high high − high high high high low low low low One of the most important ideas of rough set theory [14] is an indiscernibility relation, defined for complete data sets. Let B be a nonempty subset of A. The indiscernibility relation R(B) is a relation on U defined for x, y ∈ U as follows: (x, y) ∈ R(B) if and only if ∀a ∈ B (a(x) = a(y)). The indiscernibility relation R(B) is an equivalence relation. Equivalence classes of R(B) are called elementary sets of B and are denoted by [x]B . A subset of U is called B-definable if it is a union of elementary sets of B. The set X of all cases defined by the same value of the decision d is called a concept. For example, a concept associated with the value low of the decision Productivity is the set {1, 2, 3, 4}. The largest B-definable set contained in X is called the B-lower approximation of X, denoted by apprB (X), and defined as follows ∪{[x]B | [x]B ⊆ X}, while the smallest B-definable set containing X, denoted by apprB (X) is called the B-upper approximation of X, and is defined as follows ∪{[x]B | [x]B ∩ X 6= ∅}. 4 P. G. Clark and J. W. Grzymala-Busse For a variable a and its value v, (a, v) is called a variable-value pair. A block of (a, v), denoted by [(a, v)], is the set {x ∈ U | a(x) = v} [7]. For incomplete decision tables the definition of a block of an attribute-value pair is modified in the following way. – If for an attribute a there exists a case x such that a(x) = −, then the corresponding case x should be included in blocks [(a, v)] for all specified values v ∈ V (x, a) of attribute a, where V (x, a) = {a(y) | a(y) is specified , y ∈ U, d(y) = d(x)}, – If for an attribute a there exists a case x such that a(x) = ∗, then the case x should be included in blocks [(a, v)] for all specified values v of the attribute a. For the data set from Table 1, V (1, Experience) = {low, high}, V (3, Skills) = {high}, V (6, Skills) = {low, high}, V (7, Education) = {elementary, secondary}, and V (8, Experience) = {low, high}. For the data set from Table 1 the blocks of attribute-value pairs are: [(Education, elementary)] = {2, 5, 7, 8}, [(Education, secondary)] = {2, 3, 6, 7}, [(Education, higher)] = {1, 2, 4}, [(Skills, low)] = {4, 6, 7, 8}, [(Skills, high)] = {1, 2, 3, 4, 5, 6, 8}, [(Experience, low)] = {1, 2, 5, 8}, [(Experience, high)] = {1, 3, 4, 6, 7, 8}. For a case x ∈ U and B ⊆ A, the characteristic set KB (x) is defined as the intersection of the sets K(x, a), for all a ∈ B, where the set K(x, a) is defined in the following way: – If a(x) is specified, then K(x, a) is the block [(a, a(x))] of attribute a and its value a(x), – If a(x) = −, then the corresponding set K(x, a) is equal to the union of all blocks of attribute-value pairs (a, v), where v ∈ V (x, a) if V (x, a) is nonempty. If V (x, a) is empty, K(x, a) = U , – If a(x) = ∗ then the set K(x, a) = U , where U is the set of all cases. For Table 1 and B = A, KA (1) = {1, 2, 4}, KA (2) = {1, 2, 5, 8}, KA (3) = {3, 6}, KA (4) = {1, 4}, KA (5) = {2, 5, 8}, KA (6) = {3, 6, 7}, KA (7) = {6, 7, 8}, KA (8) = {2, 5, 7, 8}. Mining Incomplete Data 5 Note that for incomplete data there are a few possible ways to define approximations [9], we used concept approximations [11] since our previous experiments indicated that such approximations are most efficient [11]. A B-concept lower approximation of the concept X is defined as follows: BX = ∪{KB (x) | x ∈ X, KB (x) ⊆ X}, while a B-concept upper approximation of the concept X is defined by: BX = ∪{KB (x) | x ∈ X, KB (x) ∩ X 6= ∅} = ∪{KB (x) | x ∈ X}. For Table 1, A-concept lower and A-concept upper approximations of the concept {5, 6, 7, 8} are: A{5, 6, 7, 8} = {6, 7, 8}, A{5, 6, 7, 8} = {2, 3, 5, 6, 7, 8}. 3 Probabilistic approximations For completely specified data sets a probabilistic approximation is defined as follows apprα (X) = ∪{[x] | x ∈ U, P (X | [x]) ≥ α}, α is a parameter, 0 < α ≤ 1, see [11, 12, 16, 20, 22, 24]. Additionally, for simplicity, the elementary sets [x]A are denoted by [x]. For discussion on how this definition is related to the value precision asymmetric rough sets see [1, 11]. Note that if α = 1, the probabilistic approximation becomes the standard lower approximation and if α is small, close to 0, in our experiments it was 0.001, the same definition describes the standard upper approximation. For incomplete data sets, a B-concept probabilistic approximation is defined by the following formula [11] ∪{KB (x) | x ∈ X, P r(X|KB (x)) ≥ α}. For simplicity, we will denote KA (x) by K(x) and the A-concept probabilistic approximation will be called a probabilistic approximation. For Table 1 and the concept X = [(Productivity, low )] = {5, 6, 7, 8}, there exist three distinct three distinct probabilistic approximations: appr1.0 ({5, 6, 7, 8}) = {6, 7, 8}, appr0.75 ({5, 6, 7, 8}) = {2, 5, 6, 7, 8}, and appr0.001 ({5, 6, 7, 8}) = {2, 3, 5, 6, 7, 8}. The special probabilistic approximations with the parameter α = 0.5 will be called a middle approximation. P. G. Clark and J. W. Grzymala-Busse Error rate 40 Lower, - 30 Middle, - 20 Upper, - 10 Lower, * Middle, * 0 0 10 20 30 Percentage of missing attribute values Upper, * Fig. 1. Error rate for the bankruptcy data set Error rate 100 Lower, - 80 Middle, - 60 Upper, - 40 Lower, * Middle, * 20 0 Upper, * 10 20 30 40 Percentage of missing attribute values Error rate Fig. 2. Error rate for the breast cancer data set 45 40 35 30 25 20 15 Lower, Middle, Upper, Lower, * Middle, * 0 10 20 30 Percentage of missing attribute values 40 Upper, * Fig. 3. Error rate for the echocardiogram data set Error rate 6 27 25 23 21 19 17 15 Lower, Middle, Upper, Lower, * Middle, * 0 10 20 30 40 50 Percentage of missing attribute values 60 Fig. 4. Error rate for the hepatitis data set Upper, * Mining Incomplete Data Lower, - Error rate 90 70 Middle, - 50 Upper, - 30 Lower, * Middle, * 10 0 10 20 30 40 50 60 Percentage of missing attribute values Upper, * Fig. 5. Error rate for the image segmentation data set Lower, - Error rate 60 Middle, - 40 Upper, - 20 Lower, * Middle, * 0 0 10 20 30 Percentage of missing attribute values Upper, * Fig. 6. Error rate for the iris data set Lower, - Error rate 90 70 Middle, - 50 Upper, - 30 Lower, * Middle, * 10 0 10 20 30 40 50 60 Percentage of missing attribute values Upper, * Fig. 7. Error rate for the lymphography data set Error rate 100 Lower, - 80 Middle, - 60 Upper, - 40 Lower, * 20 Middle, * 0 0 10 20 30 40 50 60 Percentage of missing attribute values Fig. 8. Error rate for the wine recognition data set Upper, * 7 8 4 P. G. Clark and J. W. Grzymala-Busse Experiments For our experiments we used eight types of data sets from the University of California at Irvine Machine Learning Repository. Our main objective was to check which interpretation of missing attribute values: attribute-concept values or “do not care” conditions is better in terms of the error rate. Our secondary objective was to test which of the three probabilistic approximations: lower, middle or upper provides the best results, again in terms of the error rate. In our experiments two parameters were used: the percentage of missing attribute values and the probability α used in the definition of probabilistic approximations. Both parameters have many possible values. For practical reasons, we restricted both parameters, for the percentage of missing attribute values we considered numbers 0, 5%, 10% and so on. We have replaced randomly and incrementally existing attribute values by symbols of missing attribute values, first using − and then we replaced all symbols of attribute-concept values by *, with the increment 5% until replacing another 5% of existing attribute values by symbols of missing attribute values caused in the incomplete data set the entire row full of missing attribute values (i.e., there was a case x such that a(x), for all a ∈ A, was a symbol of the missing attribute value). If so, we retracted the last replacement and tried another random replacement. If this attempt was unsuccessful, we tried yet another random replacement. If this third attempt was unsuccessful, we ended the process of creating incomplete data sets. For example, for the bankruptcy data set, the maximum number of missing attribute values is 35%, since with three random tries of replacing yet another 5% of attribute values, all three data sets with 40% of missing attribute values had a row labeled by some x ∈ U with a(x) being a symbol of missing attribute values for all a ∈ A. For the bankruptcy data set, replacing existing attribute values by symbols of attribute-concept values, resulted in seven new data sets (with 5%, 10%,..., 35%) of attribute-concept values. Then we created another seven data sets by replacing all symbols of attribute-concept values by symbols of “do not care” conditions. Thus for bankruptcy data sets, 15 data sets were used in experiments. Since we used eight types of data sets, the total number of all data sets used in experiments was 176. We restricted our attention to three probabilistic approximations: lower (α = 1), upper (α = 0.001) and the most typical probabilistic approximation, for α = 0.5, called the middle approximation. Therefore the total number of tenfold cross validation experiments was 176 × 3 = 528. Results of our experiments are presented on Figures 1–8. First, for all eight types of data sets we computed error rate associated with two interpretations of missing attribute values: attribute-concept values and “do not care” conditions. Then we evaluated the statistical significance of the results using the Wilcoxon matched-pairs signed rank test, with the 5% level of significance for two-tailed Mining Incomplete Data 9 test, separately for lower, middle and upper approximations. For the echocardiogram, for middle and upper approximations, the “do not care” condition interpretation of missing attribute values was better than the attribute-concept interpretation of missing attribute values. On the other hand, for the image segmentation data set, for all three types of approximations and for the wine recognition data set with lower approximation, the attribute-concept value interpretation of missing attribute values was better than the “do not care” condition interpretation of missing attribute values. Thus, for two combinations of data set and type of approximation, the “do not care” condition interpretation of missing attribute values was better than the attribute-concept value interpretation of missing attribute values. For other four combinations of data set and type of approximation, the attribute-concept value interpretation of missing attribute values was better than the “do not care” condition interpretation of missing attribute values. For remaining 18 combinations of data set and type of approximation, the difference in performance between attribute-concept values and “do not care” condition interpretation of missing attribute was not statistically significant (5% significance level). When we changed the level of significance in the Wilcoxon test to 10%, we observed additionally that for the echocardiogram, for lower approximations, the “do not care” condition interpretation of missing attribute values was better than the attribute-concept value interpretation of missing attribute values. However, for the iris data set and lower approximations, the attribute-concept value interpretation of missing attribute values was better than the “do not care” condition interpretation of missing attribute values. Thus, for three combinations of data set and type of approximation, the “do not care” condition interpretation of missing attribute values was better than the attribute-concept value interpretation of missing attribute values. For other five combinations of data set and type of approximation, the attribute-concept value interpretation of missing attribute values was better than the “do not care” condition interpretation of missing attribute values. For the remaining 16 combinations of data set and type of approximation, the difference in performance between attribute-concept values and “do not care” condition interpretation of missing attribute was not statistically significant (10% significance level). Then we compared the three kinds of approximations, separately for attributeconcept values and for “do not care” conditions, using the Friedman Rank Sums test, again, with 5% of significance level. The total number of tests was 16 (eight types of data sets with two interpretations of missing attribute values). For image segmentation data set with attribute-concept values and with “do not care” conditions, for the iris data set with “do not care” conditions and for the wine recognition data set with “do not care” conditions there was a strong evidence to reject the null hypothesis that all three approximations are equivalent. Using the test for ordered alternatives based on the Friedman rank sums test we conclude that, with 5% of significance level, the middle and upper approximations are better than the lower approximations both image segmentation data sets, while for remaining two data sets: iris and wine recognition, both with “do 10 P. G. Clark and J. W. Grzymala-Busse not care” conditions, upper approximations are better than lower approximations. The difference in performance between the middle and upper approximations is not statistically significant. For the remaining 10 combinations of data set and type of missing attribute value, the difference in performance between approximations was not significant. Again, when we changed the level of significance in the Friedman test to 10%, we may observe additionally that for the bankruptcy and lymphography data sets, both with “do not care” conditions, upper approximations are better than lower, while for the iris and wine recognition data sets, both with “do not care” conditions, middle approximations are better than lower. Thus, upper approximations were better than lower approximations for six combinations of data set and type of missing attribute values, and middle approximations were better than lower approximations for other two such combinations. For the remaining six combinations of data set and type of missing attribute value, the difference in performance between approximations was not significant. A summary of the results of the experiments indicate that for the majority of the experiments performed, the results did not show a statistically significant difference in performance between the interpretation of missing attribute values and approximation types. However, there is strong evidence that there are situations where varying these values would yield better results with certain data sets. For rule induction we used the MLEM2 (Modified Learning from Examples Module version 2) rule induction algorithm, a component of the LERS (Learning from Examples based on Rough Sets) data mining system [7, 8]. 5 Conclusions Our primary objective was to compare the quality of rule sets induced from incomplete data sets with attribute-concept values and “do not care” conditions using three types of probabilistic approximations: lower, middle and upper. There is some evidence that attribute-concept values are better than “do not care” conditions in terms of the error rate measured by ten-fold cross validation. This work is a continuation of the experiments in [5]. The primary focus of [5] was to identify the best interpretation of missing attribute values with a secondary objective of testing the usefulness of concept probabilistic approximations in mining incomplete data. In this work we expanded our investigation to 176 data sets while in [5] only 24 data sets were considered. However, because of the experiment size, we restricted the number of interpretations from three to two. In addition, while a primary objective was an investigation of two interpretations of missing attribute values, this work also compared three approximations in an effort to identify the most effective between lower, middle and upper. Our experiments on rule induction on 176 data sets (with two types of missing attribute values) and with three probabilistic approximations (lower, middle and upper) show that an error rate, evaluated by ten-fold cross validation, depends on Mining Incomplete Data 11 a choice of the data set. In the majority of the data experimented with resulted in insignificant differences between the methods. However, as discussed Our main conclusion is that for a specific data set both choices, for an interpretation of missing attribute values and for an approximation type, should be taken into account in order to find the best combination used for data mining. Data sets with large percentage of “do not care” conditions may cause the error rate for lower approximation to increase up to 100% due to large characteristic sets, and, consequently, empty corresponding lower approximations and empty rule sets. Our secondary objective was to compare three approximations: lower, middle, and upper. In six combinations, out of 16, lower approximations were worse than middle or upper (5% significance level). In addition, in ten combinations, out of 16, lower approximations were worse than middle or upper approximations. Hence lower approximations should be avoided for mining incomplete data with attribute-concept values or “do not care” conditions. References 1. Clark, P.G., Grzymala-Busse, J.W.: Experiments on probabilistic approximations. In: Proceedings of the 2011 IEEE International Conference on Granular Computing. (2011) 144–149 2. Clark, P.G., Grzymala-Busse, J.W.: Experiments on rule induction from incomplete data using three probabilistic approximations. In: Proceedings of the 2012 IEEE International Conference on Granular Computing. (2012) 90–95 3. Clark, P.G., Grzymala-Busse, J.W.: Experiments using three probabilistic approximations for rule induction from incomplete data sets. In: Proceeedings of the MCCSIS 2012, IADIS European Conference on Data Mining ECDM 2012. (2012) 72–78 4. Clark, P.G., Grzymala-Busse, J.W.: Rule induction using probabilistic approximations and data with missing attribute values. In: Proceedings of the 15-th IASTED International Conference on Artificial Intelligence and Soft Computing ASC 2012. (2012) 235–242 5. Clark, P.G., Grzymala-Busse, J.W.: An experimental comparison of three interpretations of missing attribute values using probabilistic approximations. In: Proceedings of the 14-th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing. (2013) 77–86 6. Grzymala-Busse, J.W.: On the unknown attribute values in learning from examples. In: Proceedings of the ISMIS-91, 6th International Symposium on Methodologies for Intelligent Systems. (1991) 368–377 7. Grzymala-Busse, J.W.: LERS—a system for learning from examples based on rough sets. In Slowinski, R., ed.: Intelligent Decision Support. Handbook of Applications and Advances of the Rough Set Theory. Kluwer Academic Publishers, Dordrecht, Boston, London (1992) 3–18 8. Grzymala-Busse, J.W.: MLEM2: A new algorithm for rule induction from imperfect data. In: Proceedings of the 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems. (2002) 243–250 12 P. G. Clark and J. W. Grzymala-Busse 9. Grzymala-Busse, J.W.: Rough set strategies to data with missing attribute values. In: Workshop Notes, Foundations and New Directions of Data Mining, in conjunction with the 3-rd International Conference on Data Mining. (2003) 56–63 10. Grzymala-Busse, J.W.: Three approaches to missing attribute values—a rough set perspective. In: Proceedings of the Workshop on Foundation of Data Mining, in conjunction with the Fourth IEEE International Conference on Data Mining. (2004) 55–62 11. Grzymala-Busse, J.W.: Generalized parameterized approximations. In: Proceedings of the RSKT 2011, the 6-th International Conference on Rough Sets and Knowledge Technology. (2011) 136–145 12. Grzymala-Busse, J.W., Ziarko, W.: Data mining based on rough sets. In Wang, J., ed.: Data Mining: Opportunities and Challenges. Idea Group Publ., Hershey, PA (2003) 142–173 13. Kryszkiewicz, M.: Rough set approach to incomplete information systems. In: Proceedings of the Second Annual Joint Conference on Information Sciences. (1995) 194–197 14. Pawlak, Z.: Rough sets. International Journal of Computer and Information Sciences 11 (1982) 341–356 15. Pawlak, Z., Skowron, A.: Rough sets: Some extensions. Information Sciences 177 (2007) 28–40 16. Pawlak, Z., Wong, S.K.M., Ziarko, W.: Rough sets: probabilistic versus deterministic approach. International Journal of Man-Machine Studies 29 (1988) 81–95 17. Ślȩzak, D., Ziarko, W.: The investigation of the bayesian rough set model. International Journal of Approximate Reasoning 40 (2005) 81–91 18. Stefanowski, J., Tsoukias, A.: On the extension of rough sets under incomplete information. In: Proceedings of the RSFDGrC’1999, 7th International Workshop on New Directions in Rough Sets, Data Mining, and Granular-Soft Computing. (1999) 73–81 19. Wolpert, D.H.: The supervised learning no-free-lunch theorems. In: Soft Computing and Industry. Springer (2002) 25–42 20. Wong, S.K.M., Ziarko, W.: INFER—an adaptive decision support system based on the probabilistic approximate classification. In: Proceedings of the 6-th International Workshop on Expert Systems and their Applications. (1986) 713–726 21. Wozniak, M., Grana, M., E., C.: A survey of multiple classifier systems as hybrid systems. Information Fusion 16 (2014) 3–17 22. Yao, Y.Y.: Probabilistic rough set approximations. International Journal of Approximate Reasoning 49 (2008) 255–271 23. Yao, Y.Y., Wong, S.K.M.: A decision theoretic framework for approximate concepts. International Journal of Man-Machine Studies 37 (1992) 793–809 24. Ziarko, W.: Variable precision rough set model. Journal of Computer and System Sciences 46(1) (1993) 39–59 25. Ziarko, W.: Probabilistic approach to rough sets. International Journal of Approximate Reasoning 49 (2008) 272–284