Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Improving Contrast Set Mining Mondelle Simeon Robert Hilderman Department of Computer Science University of Regina Regina, Saskatchewan, Canada S4S 0A2 Email: {simeon2m, hilder}@cs.uregina.ca Abstract 2 A fundamental task in exploratory data analysis is discerning the differences between contrasting groups. Contrast set mining has been developed as a data mining task, which aims to identify the differences between these groups. This paper examines the algorithms, heuristics, and open issues of contrast set mining, and seeks to improve contrast set mining by addressing several of the open issues. It proposes four interestingness measures for ranking contrast sets: coverage, overall support, growth rate, and unusualness. It introduces a new method to discretize quantitative attributes. A new type of contrast set, called the jumping contrast set, is defined, and the contrast set mining process is modified, to include mining both types of contrast sets, on datasets containing both quantitative and categorical attributes. Finally, a simple visualization method is introduced, to describe contrast sets to the end-user. Contrast set mining is closely related to association rule mining, and utilizes some of the terminology and notation of association rule mining (Agrawal et al., 1993). Contrast sets were first defined by Bay and Pazzani (1999) as conjunctions of attributes and values that differ meaningfully in their distributions across groups. They further explore contrast set mining in Bay and Pazzani (2001). 1 Introduction Discovering the differences between groups is a fundamental problem in many disciplines. For example, a medical researcher might be interested in uncovering “What is the difference between males and females affected by a disease, such as diabetes?”. He may discover that women who have a waist size greater than 40 inches, are twice as likely to have diabetes, than men who have the same waist size. Groups are defined by a selected property that distinguishes one group from the other. The search for group differences can be applied to patients, organizations, molecules and even timelines. Contrast set mining was developed by Bay and Pazzani (1999) to automatically detect all the differences between contrasting groups from observational multivariate data. Contrast set mining seeks conjunctions of attributes and values, called contrast sets, that have different levels of support in different groups. This paper examines the algorithms, heuristics and open issues of contrast set mining, and seeks to improve contrast set mining by addressing four open issues. It proposes four interestingness measures for ranking contrasts sets. It also examines three ways of comparing contrasting groups, to determine which contrast sets are significant. Several methods for discretizing quantitative attributes are discussed, with a new discretization method proposed, which is based on statistical measures. This paper also defines a new type of contrast set, called the jumping contrast set, and expands the contrast set mining process, to include mining both the original contrast sets, and jumping contrast sets on datasets with both categorical and quantitative attributes. Finally, this paper introduces a simple visualization method to describe contrast sets to the end-user. Contrast Set Mining 2.1 Contrast Set Mining Algorithm The STUCCO (Search and Testing for Understandable Consistent Contrasts) algorithm which is based on the Max-Miner rule discovery algorithm (Bayardo, 1998) was introduced by Bay and Pazzani (1999). The algorithm discovers a set of contrast sets along with their supports on groups. Let A1 , A2 , ..., Ak be a set of k variables called attributes. Each Ai can take on values from the set {Vi1 , Vi2 , ..., Vim }. Then a contrast set, CSet, is a conjunction of attributevalue pairs defined on mutually exclusive groups, G1 , G2 , ..., Gn with no Ai occurring more than once and Gi ∩ Gj = ∅, ∀i 6= j. The support of a contrast set with respect to a group Gi , represented as support(CSet, Gi ), is the percentage of examples in Gi for which the contrast set is true. STUCCO specifically finds significant contrast sets by testing the null hypothesis that contrast-set support is equal across all groups, or alternatively, that contrast-set support is independent of group membership. Formally, this is defined as: ∃ij P (CSet = T rue|Gi ) 6= P (CSet = T rue|Gj ) (1) max |support(CSet, Gi ) − support(CSet, Gj )| ≥ δ ij (2) where δ is a user defined threshold called the minimum support difference. Contrast sets where Equation 1 is statistically valid are called significant, and contrast sets where Equation 2 is met are called large. A potential contrast set is discarded if it fails the test for independence, measured with the chi-squared (χ2 ) test, with respect to the group variable, Gi . In order to limit the possibly high probability of false rejection, STUCCO employs a variant of the Bonferroni correction for multiple hypothesis tests which uses more stringent critical values for the statistical tests as the number of conditions in the contrast set is increased. In the contrast set mining process, the search space for potential contrast sets can be represented in a tree structure having every possible combination of the attributes. For a dataset that has three categorical attributes, ‘1’, ‘2’ and ‘3’, in addition to a class attribute, Figure 1 shows the search tree. The ‘{}’ at the root of the tree represents the entire dataset. The search process begins with the most Holm’s sequential rejective method (Holm, 1979) for the independence test. A different approach to contrast set mining called CIGAR (Contrasting Grouped Association Rules) was proposed by Hilderman and Peckham (2005). CIGAR not only considers whether the difference in support between groups is significant, it also specifically identifies which pairs of groups are significantly different and whether the attributes in a contrast set are correlated. CIGAR utilizes the same general approach as STUCCO, however it uses a series of 2 × 2 contingency tables in determining whether a difference exists between contrast sets in two or more groups. In contrast to STUCCO, CIGAR focusses on controlling Type II error through increasing the significance level for the significance tests, and by not correcting for multiple corrections. CIGAR utilizes a correlation pruning technique that compares the correlation difference, r, between parents and children in the search space and rejects candidate contrast sets where r is less than a predetermined threshold. A general formulation of interesting contrast rules was proposed by Minaei-Bidgoli et al. (2004). They developed an algorithm to discover a set of contrast rules by investigating three statistical measures: difference of confidence, difference of proportion, and correlation and chi-square. Their algorithm allows for very low minimum support values, which would allow for the discovery of rules which would normally have been overlooked. The previous contrast set mining approaches focused on discrete data, which is not frequently the case in real world data. However, few attempts have been made at discovering contrast sets on data which contain continuous attributes. A formal notion of a time series contrast set was introduced by Lin and Keogh (2006). They proposed an efficient algorithm to discover timeseries contrast sets on timeseries and multimedia data. The algorithm utilizes a SAX alphabet (Lin et al., 2003) to convert continuous data to discrete data (discretization). A modified equal-width binning interval approach to discretizing continuous-values attributes where the approximate width of the intervals is provided as a parameter to the model was proposed by Simeon and Hilderman (2007). These intervals are then used in generating contrast sets where the consequent in the rules contain up to two continuous-valued attributes. The approach used is similar to Figure 2 with the discretization process taking place before determining all possible itemsets. The Student’s T-test and Z-test are used to determine independence. Both tests require at least two items for each group and this also served as the minimum support requirement. An objective measure for identifying and ranking potentially interesting contrast sets, referred to as the distribution difference, was also proposed. Figure 1: Example search tree for a dataset with three attributes general terms first, shown as ‘Level 1’ in Figure 1. In moving to ‘Level 2’, conjunctions of the attributes would then be examined. At this level, there are only three contrast sets to examine. This is because, for instance, ‘{1,2}’ represents the same instances in the dataset as ‘{2,1}’. More complex combinations would be examined in subsequent levels. The number of levels in the search tree is equal to the number of attributes. The number of contrast sets on each level is equal to the number of combinations of n and i, represented as Cin , where n is the number of attributes, and i is the level in theP tree. The total number of n contrast sets is equal to i=1 Cin . STUCCO prunes portions of the search space that is determined to produce contrast sets which fail to meet Equations 1 and 2. It utilizes three types of pruning methods – effect size pruning, statistical significance pruning, and interest based pruning. Effect size pruning requires that the maximum difference between the support of the two groups in the contrast set, be greater than some predetermined threshold. Statistical significance pruning requires that there are enough data points to have a valid chi-square test, or that the maximum value the chi-squared statistic can take is large enough to be significant. Interest based pruning requires that the contrast sets selected are interesting enough, where interest is measured as representing new information from what is already known, with already selected contrast sets. Figure 2 summarizes the algorithm. A search tree is constructed for all possible item sets as in Figure 1, the minimum support is applied to generate candidate item sets, filtering conditions are applied to produce candidate contrast sets and independence tests are performed to discover the candidate contrast sets. 2.3 A study to compare contrast-set mining with existing rule-discovery techniques was undertaken by Webb et al. (2003). They compared a commercial rulediscovery system, called Magnum Opus (Webb, 2001), with STUCCO and concluded that contrast set mining is equivalent to a special case of a general rulediscovery task. They identified several open issues in contrast set mining which requires further research: the development of filters to guard against spurious contrasts as well as preventing the removal of potentially valuable contrasts; the development of appropriate methods to describe the contrasts that are discovered; and the investigation of the impact of various independence tests on the contrast sets which are discovered. Since their publication, the work by Simeon Figure 2: Procedure for mining contrast sets 2.2 Open issues in Contrast Set Mining Related Research on Contrast Set Mining An algorithm to discover negative contrast sets that can include negation of terms in the contrast set was proposed by Wong and Tseng (2005). Their algorithm utilizes the same steps as in Figure 2, to discover negative contrast sets. However they utilize 2 and Hilderman (2007) is the only attempt to directly address one of these issues, through the development of an objective measure for identifying and ranking contrast sets. 3 needed. One simple method is to use a bar chart to visualize the contrast sets. The first row would show areas or blocks which are in proportion to the distribution of the entire dataset. Each following row represents a single contrast set and its support, represented numerically and as the shaded area, in each group. The contrast sets would be ordered based on a ranking heuristic such as the growth rate. Figure 3 shows a model of this for four contrast sets and three groups with sample support values. Addressing the open issues in Contrast Set Mining Several open issues in contrast set mining were identified in Section 2.3. These issues among others need to be addressed to further develop contrast set mining into a more useful data mining task. The most urgent issues which need to be addressed are, the development of additional heuristics for ranking contrast sets, the comparison of contrasting groups for independence testing, the discretization and classification of quantitative data, and the visualization of contrast sets. In order to distinguish amongst the contrast sets which are returned, four new interestingness measures are proposed to rank the contrast sets: overall support, an extension of the support measure by Bay and Pazzani (1999), growth rate, coverage, and unusualness. The latter three measures were adapted from the areas of emerging patterns (Dong and Li, 1999) and subgroup discovery (Klösgen, 1996) (Kavsek and Lavrac, 2006) (Lavrac et al., 2004), because of their provable compatibility with contrast set mining. The overall support for a contrast set, X, is defined as the sum of the supports of X over all the groups. The growth rate of a contrast set, X, is defined as the maximum ratio of the supports in Gi over Gj for all pairs of the groups. When the support of X in Gi is zero and the support of X in Gj is not equal to zero, this creates a new type of contrast set, called the jumping contrast set. The unusualness of a contrast set, X is defined as the maximum conditional probability of a group Gi given that X is satisfied. The coverage of a contrast set, X, is defined as the percentage of instances in the dataset covered by X. In the contrast set mining process, there are three ways of inducing rules in multi-class learning problems: learners either induce the rules that characterize one class compared to the rest of the data, referred to as one versus all or they search for rules that discriminate between all pairs of classes, referred to as round robin (Kralj et al., 2007) or learners induce rules such that the each class is represented separately and the classes are compared simultaneously (Bay and Pazzani, 1999), referred to as separate grouping. Initial analysis shows that for the same independence test and significance level, the same conclusion would not be drawn in all three approaches. A new method is proposed to partition the values of quantitative attributes into intervals, referred to as discretization, in contrast set mining based on the work of Aumann and Lindell (1999). Cut-points for the intervals are created based on the standard deviation of the values. Starting from the mean, left and right cut-points would be found by iteratively subtracting and adding, respectively, a user-defined factor of the standard deviation until the potential cutpoints are outside of the range of the values. Each interval can be considered as a discrete value whereby a statistical test such as chi-squared can be used to determine independence. However, the data points within each interval are continuous and a statistical test such as the Student’s T-test can also be used to determine independence. Can the test selection affect the contrast sets discovered? Initial analysis shows that the treatment of discretized values has significant effects on the contrast sets discovered. Once the contrast sets have been discovered, a suitable method for describing them to end users is Figure 3: Bar graph visualization of contrast sets 4 Re-defining the Contrast Set Mining Process With the introduction of a new type of contrast set, and the expansion to quantitative attributes, the contrast set mining process in Figure 2 needs to be redefined to reflect these changes. Figure 4 shows the modified procedure for mining contrast sets. Figure 4: Modified Contrast Set Mining Procedure The total itemsets are produced from either or both of quantitative and categorical attributes, depending on the dataset, through the construction of a search tree. However this tree is potentially larger than in Figure 1 because each quantitative attribute is discretized at every level in the tree, in order to create intervals that are most representative of the instances. Thus, each possible combination of the attributes has to be examined. This results in the maximum number of contrast sets to be examined as Pn n n i=1 Pi , where Pi is equal to the number of permutations of n and i, n is the number of attributes, and i is the level in the tree. This maximum is achieved when all the attributes, except the class attribute is quantitative. Figure 5, which uses the same notation as Figure 1, shows the modified search tree for the maximum. 3 SIGKDD international conference on Knowledge discovery and data mining, pages 43–52, New York, NY, USA, 1999. ACM. ISBN 1-58113-143-7. doi: http://doi.acm.org/10.1145/312129.312191. R.J. Hilderman and T. Peckham. A statistically sound alternative approach to mining contrast sets. Proceedings of the 4th Australasian Data Mining Conference (AusDM’05), pages 157–172, Dec. 2005. Figure 5: Example search tree for dataset with three non-class attributes where all three are quantitative S Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6: 65–70, 1979. ISSN 0303-6898. From the total itemsets, those which do not meet the minimum support are removed. The remaining itemsets become the candidate itemsets. Itemsets which have, for example, the same support in each group would be removed as part of filtering. The remaining itemsets make up the candidate contrast sets. The minimum support requirement was changed to allow for the discovery of jumping contrast sets, where the growth rate is infinity. For the other contrast sets, an independence test is used to determine the significant contrast sets, where the support of the itemset in each group is above some threshold. The interestingness measures proposed, that is, overall support, growth rate, unusualness, and coverage, along with the distribution difference, would then be used to rank the contrast sets. It is certainly possible to use any of these measures as filtering methods, as well. However, in making the contrast set mining process as applicable as possible to a wide variety of datasets, it may be prudent to limit the filtering and allow the ranking methods to guide the end-user to the most interesting contrast sets. Branko Kavsek and Nada Lavrac. Apriori-sd: Adapting association rule learning to subgroup discovery. Applied Artificial Intelligence, 20(7):543–583, 2006. Willi Klösgen. Explora: a multipattern and multistrategy discovery assistant. pages 249–271, 1996. Petra Kralj, Nada Lavrac, Dragan Gamberger, and Antonija Krstacic. Contrast set mining for distinguishing between similar diseases. In Riccardo Bellazzi, Ameen Abu-Hanna, and Jim Hunter, editors, AIME, volume 4594 of Lecture Notes in Computer Science, pages 109–118. Springer, 2007. ISBN 9783-540-73598-4. Nada Lavrac, Branko Kavsek, Peter A. Flach, and Ljupco Todorovski. Subgroup discovery with cn2sd. Journal of Machine Learning Research, 5:153– 188, 2004. Jessica Lin and Eamonn J. Keogh. Group sax: Extending the notion of contrast sets to time series and multimedia data. In PKDD, pages 284–296, 2006. References Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. A symbolic representation of time series, with implications for streaming algorithms. In DMKD ’03: Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, pages 2–11, New York, NY, USA, 2003. ACM. doi: http://doi.acm.org/10.1145/882082.882086. Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. Mining association rules between sets of items in large databases. In SIGMOD ’93: Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pages 207–216, New York, NY, USA, 1993. ACM. ISBN 0-89791-592-5. doi: http://doi.acm.org/10.1145/170035.170072. B. Minaei-Bidgoli, Pang-Ning Tan, and W.F. Punch. Mining interesting contrast rules for a web-based educational system. Machine Learning and Applications, 2004. Proceedings. 2004 International Conference on, pages 320–327, December, 2004. Yonatan Aumann and Yehuda Lindell. A statistical theory for quantitative association rules. In In Knowledge Discovery and Data Mining, pages 261– 270, 1999. Mondelle Simeon and Robert J. Hilderman. Exploratory quantitative contrast set mining: A discretization approach. In ICTAI ’07: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Vol.2 (ICTAI 2007), pages 124–131, Washington, DC, USA, 2007. IEEE Computer Society. ISBN 0-7695-3015-X. doi: http://dx.doi.org/10.1109/ICTAI.2007.99. Stephen D. Bay and Michael J. Pazzani. Detecting change in categorical data: mining contrast sets. In KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 302–306, New York, NY, USA, 1999. ACM. ISBN 1-58113-143-7. doi: http://doi.acm.org/10.1145/312129.312263. Stephen D. Bay and Michael J. Pazzani. Detecting group differences: Mining contrast sets. Data Min. Knowl. Discov., 5 (3):213–246, 2001. ISSN 1384-5810. doi: http://dx.doi.org/10.1023/A:1011429418057. Geoffrey I. Webb, Shane Butler, and Douglas Newlands. On detecting differences between groups. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 256–265, New York, NY, USA, 2003. ACM. ISBN 1-58113-737-0. doi: http://doi.acm.org/10.1145/956750.956781. Roberto J. Bayardo. Efficiently mining long patterns from databases. In SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pages 85–93, New York, NY, USA, 1998. ACM. ISBN 0-89791-995-5. doi: http://doi.acm.org/10.1145/276304.276313. G.I. Webb. Magnum opus: Setting new standards in association rule discovery, 2001. URL www.rulequest.com. Tzu-Tsung Wong and Kuo-Lung Tseng. Mining negative contrast sets from data with discrete attributes. Expert Syst. Appl., 29(2):401–407, 2005. Guozhu Dong and Jinyan Li. Efficient mining of emerging patterns: discovering trends and differences. In KDD ’99: Proceedings of the fifth ACM 4