Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Discovering Rules with Concept Hierarchies Marco Eugênio Madeira Di Beneditto Leliane Nunes de Barros Centro de Análises de Sistemas Navais Pr. Barão de Ladário s/n - Ilha das Cobras - Ed 8 do AMRJ, 3 andar Centro – 20091-000, Rio de Janeiro, RJ, Brasil Instituto de Matemática e Estatı́stica da Universidade de São Paulo (IME-USP), Rua do Matão, 1010, Cidade Universitária – 05508-090, São Paulo, SP, Brasil [email protected], [email protected] Abstract In Data Mining, one of the steps of the Knowledge Discovery in Databases (KDD) process, the use of concept hierarchies for the attribute values of the database allows to express the discovered knowledge in a higher abstraction level, more concise and usually in a more interesting format. However, data mining for high level concepts is more complex because the search space is generally too big. Some data mining systems require the database to be pre-generalized to reduce the space, what makes difficult to discover knowledge at arbitrary levels of abstraction. To efficiently induce high-level rules at different levels of generality, without pre-generalizing databases, fast access to concept hierarchies and fast query evaluation methods are needed. This work, presents the NETUNO-HC system that performs induction of classification rules using concept hierarchies for the attributes values of a database without pre-generalizing them. It is showed how the abstraction level of the discovered rules can be affected by the adopted search strategy and by the relevance measures considered during the data mining step. Moreover, it is demonstrated by a series of experiments that the NETUNO-HC system improves the efficiency of the data mining process, due to the implementation of the following techniques: (i) a SQL primitive to execute the databases queries; (ii) the numerical encoding of concept hierarchies; (iii) the use of Beam Search strategy, and (iv) the data representation of the concept hierarchy. Keywords: Knowledge Discovery, Data Mining, Machine Learning. 1. Introduction This paper describes a KDD (Knowledge Discovery in Databases) system named NETUNO-HC [1], that uses concept hierarchies to discover knowledge at a high abstraction level than the existing in a relational database (DB), without pre-generalizing the data. The search for this kind of knowledge requires the construction of SQL queries to a Database Management System (DBMS), considering that the attribute values belong to a concept hierarchy not directly represented in the DB. We argue that this kind of task can be achieved providing fast access to concept hierarchies and fast query evaluation through : (i) an efficient search strategy, and (ii) the use of a SQL primitive to allow fast evaluation of high level hypotheses. Unlike of [5], the system proposed in this paper does not require the DB to be pre-generalized. Finally, the proposed representation of hierarchies followed by the use of SQL primitives turns NETUNO-HC independent from other systems, unlike ParDRI [7] which uses Parka, a knowledge representation language to manage the hierarchies. 2. Concept Hierarchies The concept hierarchy can be defined as a partial order set. Given two concepts a and b belonging to a partial order relation , i.e., , (describe by or precedes ). We say that concept is a more specific than concept, or is more general than . Usually, the partial order relation in a concept hierarchy represents the special-general relationship between concepts, also called subsetsuperset relation. So, a concept hierarchy is defined as: A Concept Hierarchy is a partial order set , where is a finite set of concepts, and is a partial order relation in . A tree is special type of concept hierarchy, where a concept precedes only one concept and greatest concept exists, i.e., a concept that does not precede anyone. The tree root will be the most general concept, called ANY, and the leaves will be the attribute values in the DB, that is, the lowest abstraction level of a hierarchy. In this work, we will use concept hierarchies that can be represented as a tree. 2.1. Representing Hierarchies The use of concept hierarchies during the data mining to generate and evaluate hypotheses is computationally more demanding than the creation of generalized tables. The representation of a hierarchy in memory using a tree data structure gives some speed and efficiency to traverse it. Nevertheless, the number of queries necessary to verify the relationship between concepts in a hierarchy can be too high. One approach to decrease this complexity is to encode each hierarchy concept in such a way that the code itself indicates the partial order relation between the concepts. Thus the relation verification is made by only checking the codes. nr of bits: 1 4 1 0010 5 01001 5 01011 = 18731 1 4 1 0010 = 18 (18731 >> 10) Figure 1: Two concept codes where the code 18731 represents a concept that is a descendant of the concept with code 18 The concept encoding algorithm we propose is based on a post-fixed order traversal of the hierarchy with complexity O( ), where is the number of concepts in the hierarchy. The verification of the relationship between two concepts, is performed shifting one of the codes, in this case, the bigger one. Figure 1 shows two concept codes where the code 18731 represents a concept that is a descendant of the concept with code 18. Since the difference between the codes corresponds to ten bits, the bigger code has to be shifted to the right by this number of bits, and if this new value is equal to the smaller code, than the concepts belongs to the relation, i.e., the concept with smaller code is an ascendant of the concept with the bigger code. In the Table 1, is showed the mean percentage of the total time spent in the algorithm execution during the generation and evaluation of the hypothesis, ie, regardless the time spent issuing SQL queries. The use of numeric verification gives a relevant decrease in the time spent. The hierarchy query uses a pointer path for relationship verification. So, a higher hierarchy will cause a bigger spent time, as we can see in the first line, because the Adult DB has higher concept hierarchies than Mushroom DB. For the numeric verification, a higher hierarchy has a irrelevant influence in the spent time. This is due to the nearly constant time spent in the code shift. In the NETUNO-HC the hierarchies are stored in relational tables in the DB and loaded before the data mining step. More than one hierarchy for each attribute can be stored leaving to the user the Method Hierarchy query Numeric verification Mushroom 10,11% =0,18 1,87% =0,15 Adult 14,93% =0,21 1,91% =0,13 Table 1: Mean and standard deviation ( ) of the time spent in the generation and evalution of hypothesis for the NETUNO-HC possibility to choose one . The use of tables provides the facility of concurrently access the hierarchy data. 2.2. Generation of Numerical Concept Hierarchies For numerical or continuous attributes, the concept hierarchies can be previously generated and stored in relational tables, or generated by an algorithm before the data mining step. In the NETUNOHC we propose an algorithm to generate a numerical hierarchy considering the class distribution. This algorithm is based on the InfoMerge algorithm [3] used for discretization of continuous attributes. The idea underlying the InfoMerge algorithm is to group values in intervals which causes the smaller information loss (a dual operation of information gain in C4.5 [6]). In the NETUNO-HC, the same idea is applied to the generation in a bottom-up approach of a numerical concept hierarchy, where a concept will be numerical intervals, closed in the left. After the leaf level intervals be generated, these are merged in bigger intervals until the root is reached, which will be an interval that includes all the existing values in the DB. 3. The NETUNO-HC Algorithm The search space is organized in a general-to-specific ordering of hypotheses, beginning with the empty hypothesis. A hypothesis will be transformed (node expansion search operation) by specialization operations, i.e., by the addition of an attribute or by doing hierarchy specialization to generate more specific hypotheses. A hypothesis can be a discovered rule if it satisfies the relevance measures. The node expansion operation is made in two steps. First, an attribute is added to a hypothesis. Second, using the SQL query, the algorithm check, in a top-down fashion, which values in the hierarchy of the attribute satisfy the relevance measures. The search strategy employed by the NETUNO-HC is Beam Search. For each level of the search space, which corresponds to hypotheses with the same number of attribute-value pairs, the algorithm selects only a fixed number of them. This number corresponds to the beam width, i.e., the number of hypotheses that will be specialized. 3.1. NETUNO-HC Knowledge Description Language The power of a symbolic algorithm for data mining resides in the expressiveness of the knowledge description language used. The language specifies what the algorithm is capable of learning. NETUNOHC uses a propositional-like language extending the attribute value with concept hierarchies in order to achieve higher expressiveness. Rules induced by NETUNO-HC take the form IF THEN , where is a conjunction of one or more attribute-value pairs. An attribute-value pair is a condition between an attribute and a value from the concept hierarchy. For categorical attributes this condition is an equality, e.g., ( ), and for continuous attributes this condition is an interval inclusion (closed on left), e.g., , or an equality. 3.2. Specializing Hypotheses In the progressive specialization, or top-down approach, the data mining algorithm generates hypotheses that have to be specialized. The specialization operation of hypothesis generates a new that covers a number of tuples less or equal the ones covered by . Specialization can hypothesis be realized by either adding an attribute or replacing the value of the attribute with any of its descendants as defined by concept hierarchies. In NETUNO-HC, both forms of hypotheses specializations are considered. If a hypothesis does not satisfy the relevance measures then it has to be specialized. After the addition of the attribute, the algorithm has to check which of the values forms valid hypotheses, i.e., hypotheses that satisfy the relevance measures. With the use of hierarchies, the values have to be checked in a top-down way, i.e., from the most general concept to the more specific. 3.3. Rules Subsumption The NETUNO-HC avoids the generation of two rules, and , such that is subsumed by , i.e., . This occurs when: 1. the rules have the same size and for each attribute-value pair exists a pair where . 2. the rules have different size and for each attribute-value pair exists a pair where and is the smaller rule. This kind of verification is done in two different phases. The first phase is done when the data mining algorithm checks for an attribute value in the hierarchy. If the value generates a rule, the descendants values that can also generate rules in the same class are not stored as valid rules, even though they satisfy the relevance measures. Second, if a discovered rule subsumes other rules previously discovered, these last ones are deleted from the list of discovered rules. On the opposite side, if a discovered rule is subsumed by one or more previously discovered rules, this rule is not added to the list. This second phase is performed using a rule indexing schema, which is generated based in the rule consequent, ie, the rule class and the rule antecedent. Each attribute-value pair has a code, and the composition of each code will form the rule’s code, wich will be used to create a hash based index of the discovered rule set. 3.4. Relevance Measures and Selection Criteria In NETUNO-HC system, the rule hypotheses are evaluated by two conditions: completeness and consistency. Let denote the total number of positive examples of a given class in the training data. Let be a rule hypothesis to cover tuples of that class; let and be the number of positive and negative tuples covered by , respectively. The completeness will be defined by the ratio , which is called in this work support (also known in the literature as positive coverage). The consistency is defined as the ratio , which is called in this work confidence (also known as training accuracy). These values will be calculated using the SQL primitive, described in Section 4. The criteria for the selection of the best hypotheses that will be expanded is based on the product . The hypotheses in the open-list will be stored in a decreasing order according with that product, and only the best hypotheses (the beam width) will be selected. !#" $&%')(+*, ,* - 3.5. Interpretation of the Induced Rules The induced rules can be interpreted as classification rules. Thus, to use the induced rules to classify new examples, NETUNO-HC employ an interpretation in which all rules are tried and only those that cover the example are collected. If a collision occurs (i.e., the example belows to more than one class) the decision is to classify the example in the class given by the rule with the greatest value for the product . If some example is not covered by any rule, then the number of non-classified example is incremented. In Sec. 5.3, will be showed the result of applying a default rule in this case. !#" $&%')(* ,* 4. SQL Primitive for Evaluation of High Level Hypothesis In [4] was propose a generic KDD primitive in SQL which underlies the candidate rule evaluation procedure. This primitive consists of counting the number of tuples in each partition formed by a SQL group by statement. The primitive has three input parameters: a tuple-set descriptor, a candidate attribute, and the class attribute. The output is a matrix , where is the number of different values of the new attribute, and is the number of different values of the class attribute. In order to use this primitive and the output matrix for the evaluation of high level hypothesis (i.e., building a SQL primitive considering a concept hierarchy), some extensions were made to the original proposal [4]. In the primitive, the tuple-set descriptor has to be expressed by values in the DB, i.e., the leaf concepts in the hierarchy. So, for each high level value the descriptor has to be expressed by the leaf values that precedes it. This is made by the NETUNO-HC, during the data mining, using the hierarchy for building the SQL primitive. For example, let black, brown dark where black, brown are leaf concepts in a color domain hierarchy. If the antecedent of a hypothesis has the attribute-value pair: spore print color = dark, this has to be expressed in the tuple-set descriptor by leaf values, i.e., spore print color = brown OR spore print color = black. For the output matrix, the lines are the leaf concepts of the hierarchy. Adding the lines whose concepts are leaf and precedes a high level concept is equivalent to have a high level line, which can be used to evaluate the high level hypotheses (see Figure 2). A condition between an attribute and his value may be the inequality. In this case, eg. spore print color dark, the tuple-set descriptor will be translated to spore print color brown AND spore print color black. To calculate the relevance measures for this condition, the same matrix can be used. The line for this condition is the difference between the Total line and the line that . corresponds to the attribute value, i.e., " ')&* Concept Hierarchy of the Candidate Attribute ' &* C1 a ANY a’ C2 C3 . . . Cn Total a1 a2 a3 . . . am Total Figure 2: The lines of the matrix represents the leaf concepts of the hierarchy 5. Experiments In order to evaluate the NETUNO-HC algorithm we used two DBs from the UCI repository: the Mushroom and Adult. First, we tested how the size of the search space changes performing data mining with and without the use of concept hierarchies. This was done using a simplified implemented version of the NETUNO-HC algorithm that uses a complete search method. In the rest of the experiments we analyzed the data mining process, with and without the use of concept hierarchies, with respect to the following aspects: efficiency on DB access, concept hierarchy access and rules subsumption verification; results on the accuracy of the discovered rule set; the discovery of high level rules and the semantic evaluation of high level rules. 5.1. The Size of the Search Space We have first analyzed how the use of concept hierarchies in data mining can affect the size of the search space considering a complete search method, such as Breadth-First Search. without CH with CH 16000 14000 open−list size 12000 10000 8000 6000 4000 2000 0 0 10000 20000 30000 40000 number of open−list removes 50000 60000 Figure 3: Breadth-First Search algorithm execution in the Mushroom DB with and without hierarchies and sup = 20%, conf= 90%. In the graphic above is showed the open-list size (list of the candidate rules or rule hypotheses) versus the number of open-list removes (number of hypothesis specializations) Figure 3 shows, as it was expected, that the search space for high level rules increases with the size of the concept hierarchies considered in a data mining process. We can also see in Figure 3 that pruning techniques, based on relevance measures and rules subsumption, can eventually turn the list of open nodes (open-list) empty. This occurs for the Mushroom DB after 15000 hypothesis specializations, in data mining WITHOUT concept hierarchies and after 59000 hypothesis specializations, in data mining WITH concept hierarchies. Another observation we can make from Figure 3 is that the size of the open-list is approximately four times bigger when using concept hierarchies evaluation for the Mushroom DB. Therefore, it is important to improve performance on the hypotheses evaluation which involves efficient DB access, concept hierarchy access and rules subsumption verification. 5.2. Efficiency in High Level SQL Primitive and Hypotheses Generation In order to evaluate the use of high level SQL primitive, it was implemented a version of the ParDRI [7]. In ParDRI, the high level queries are made in a different way: it uses the direct descendants of the hierarchy root. So, if the root concept has tree descendants, tree queries will be issued, while with the SQL primitive, only one query is necessary. For the Mushroom DB, without the SQL primitive, the algorithm generated 117 queries and discovered 26 rules. Using the primitive, only 70 queries were generated for exactly the same 26 rules, showing a reduction of 40% in the number of queries. To evaluate the time spent on hypotheses generation, the following times were measured during the executions: 1. the time spent with DB queries; 2. the time spent by the data mining algorithm. The ratio between the difference of these two times and the time spent by the data mining algorithm is the percentage spent in the generation and evaluation of the hypotheses. This value is 1.87% (with =0.15) showing that the execution time is dominated by queries issued to the DBMS. Therefore, the use of the high level SQL primitive, combining with efficient techniques for encoding and evaluation of hypotheses in the NETUNO-HC, makes it a more efficient algorithm for high level data mining than ParDRI [7]. 5.3. Accuracy In Tab. 2, the accuracy results of the NETUNO-HC with and without hierarchies are compared with two other algorithms, C4.5 [6] and CN2 [2], which did not use concept hierarchies. In order to compare similar classification schemes, the NETUNO-HC results were obtained using a default class, the majority class in this case, to label examples not covered, similar to the two other algorithms. For the other experiments, the default class was not used. Algorithm C4.5 CN2 NETUNO-HC without CH NETUNO-HC with CH Mushroom 100% 100% 99.04% 98.45% Adult 84.46% 84% 82.14% 81.62% Table 2: Accuracies for the algorithms - the default class was used in NETUNO-HC The next experiments show the results obtained through ten-fold stratified cross validation. In Table 3 is showed the accuracy of the discovered rule set. For both DBs we can observe that by decreasing the minimum support value, the accuracy tends to increase (in both situations: with or without hierarchies). This happens because some tuples are covered by rules with small coverage, and this rules can only be discovered defining a small support. Support / Confidence 20% / 90% 20% / 94% 20% / 98% 12% / 90% 12% / 94% 12% / 98% 4% / 90% 4% / 94% 4% / 98% Mushroom Mean accuracy Mean accuracy without CH with CH 0.9061 =0.002 0.8942 =0.002 0.9572 =0.005 0.9311 =0.005 0.9596 =0.004 0.9845 =0.002 0.8991 =0.004 0.8931 =0.002 0.9572 =0.002 0.9299 =0.003 0.9738 =0.002 0.9845 =0.003 0.8954 =0.003 0.8931 =0.004 0.9524 =0.003 0.9275 =0.004 0.9881 =0.003 0.9845 =0.002 Adult Mean accuracy Mean accuracy without CH with CH 0.6717 =0.003 0.6762 =0.004 0.5672 =0.004 0.5851 =0.005 0.3701 =0.002 0.5146 =0.004 0.7048 =0.002 0.7031 =0.006 0.6559 =0.003 0.6598 =0.005 0.4112 =0.002 0.5566 =0.005 0.7229 =0.004 0.7235 =0.003 0.6797 =0.005 0.6646 =0.002 0.5513 =0.005 0.6035 =0.002 Table 3: Mean accuracies and standard deviations ( ) for each support and confidence value with beam width =256 - As expected, the use of hierarchies does not directly affect the accuracy of the discovered rules. That can be explained by the following. On one hand, a more general concept has greater inconsistency which decreases the accuracy. On the other hand, with high support values an increase in the minimum confidence value tends to increase the accuracy. In this case, the high level concept can cover more examples (i.e., decreasing the number of non-covered examples, as can be seen in Table 4), where the number of non-classified examples is very small (considering a small beam width). Intuitively, we can think that a larger beam width would discover a rule set with a better accuracy since the search would become closer to a complete search. However, in the Mushroom DB with hierarchies, an increase in the beam width did not result in a better accuracy as can be seen in Table 4. Beam width Accuracy without CH with CH 0.9501 0.9857 0.9501 0.9857 0.9501 0.9857 0.9548 0.9857 0.9845 0.9845 0.9845 0.9869 0.9881 0.9869 0.9881 0.9845 0.9881 0.9845 1 2 4 8 16 32 64 128 256 Non-Classified Examples without CH with CH 37 2 37 2 37 2 33 2 7 2 7 0 4 0 4 0 4 0 Table 4: Beam Width vs Accuracy and Non-Classified examples in the Mushroom DB 5.4. High Level Rules The relevance measures affect the discovered rule set. With a confidence minimum value of 90%, in the two DBs it can be seen that high support minimum values tends to discover more high level rules in the rule set. In the Table 5.4, it is showed the percentage of rules in the discovered rule set that contains high level values for the attributes, for different support values. Support Minimum Value 4% 20% Mushroom 51.8% 63.8% Adult 81.4% 85.6% 5.5. Semantic Evaluation The use of hierarchies introduces more general concepts which can cause low level rules to be subsumed by high level ones. For example, in the Mushroom DB, given the high level concept BAD ! " ), the rule ( is discovered. # This rule, is more general than the other following two rules, and , discovered without the use of hierarchies. $ : odor = BAD - POISONOUS - Supp: 0.822 Conf: 1.0 $ : odor = CREOSOTE - POISONOUS - Supp: 0.048 Conf: 1.0 $ # : odor = FOUL - POISONOUS - Supp: 0.549 Conf: 1.0 This example shows that, by using concept hierarchies in data mining, one can generate a more concise knowledge to be interpreted by an expert of the DB domain. 6. Conclusions The use of concept hierarchies in data mining results in a trade off between the discovery of more interesting rules, expressed in high abstraction level, versus a higher computational cost. In this work, we present the NETUNO-HC algorithm and its implementation to propose ways to solve the efficiency problems of the data mining with concept hierarchies, that are: the use of Beam Search strategy, the encoding and evaluation techniques of the concept hierarchies and the high level SQL primitive. The main contribution of this work is to specify a high level SQL primitive as an efficient way to analyze rules considering concept hierarchies. We also perform some experiments to show how the mining parameters affects the discovered rule set such as: Variation of the support minimum value. On one hand, a decrease in the support minimum value tends to increase the accuracy, with or without hierarchies, also increasing the rule set size. On the other hand, a high support minimum value tends to discover a more interesting rule set, i.e., a set with more high level rules. Variation of the confidence minimum value. The effect of this kind of variation depends of the DB domain. For the databases analyzed, a higher confidence value could not always result in a higher accuracy. Alterations of the beam width. A higher beam width tends to increase the accuracy. However, depending on the DB domain, a better accuracy can be obtained in lower beam width, with or without hierarchies. The hierarchy also affects the discovered rule set: a higher accuracy can be obtained with a lower beam width. References [1] Marco Eugênio Madeira Di Beneditto. Descoberta de regras de classificação com hierarquias conceituais. Master’s thesis, Instituto de Matemática e Estatı́stica, Universidade de São Paulo, Brasil, feb 2004. [2] Peter Clark and Tim Niblett. The CN2 induction algorithm. Machine Learning, 3:261–283, 1989. [3] A. Freitas and S. Lavington. Speeding up knowledge discovery in large relational databases by means of a new discretization algorithm. In Proc. 14th British Nat. Conf. on Databases (BNCOD-14), pages 124–133, Edinburgh, Scotland, 1996. [4] A. Freitas and S. Lavington. Using SQL primitives and parallel DB servers to speed up knowledge discovery in large relational databases. In R. Trappl., editor, Cybernetics and Systems’96: Proc. 13th European Meeting on Cybernetics and Systems Research, pages 955–960, Viena, Austria, 1996. [5] Jiawei Han, Yongjian Fu, Wei Wang, Jenny Chiang, Wan Gong, Krzystof Koperski, Deyi Li, Yijun Lu, Amynmohamed Rajan, Nebojsa Stefanovic, Betty Xia, and Osmar R. Zaiane. DBMiner: A system for mining knowledge in large relational databases. In Evangelos Simoudis, Jia Wei Han, and Usama Fayyad, editors, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 250–263. AAAI Press, 1996. [6] John Ross Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, 1 edition, 1993. [7] Merwyn G. Taylor. Finding High Level Discriminant Rules in Parallel. PhD thesis, Faculty of the Graduate School of the University of Maryland, College Park, USA, 1999.