Download Discovering Rules with Concept Hierarchies

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
Discovering Rules with Concept Hierarchies
Marco Eugênio Madeira Di Beneditto
Leliane Nunes de Barros
Centro de Análises de Sistemas Navais
Pr. Barão de Ladário s/n - Ilha das Cobras - Ed 8 do AMRJ, 3 andar
Centro – 20091-000, Rio de Janeiro, RJ, Brasil
Instituto de Matemática e Estatı́stica da Universidade de São Paulo (IME-USP),
Rua do Matão, 1010, Cidade Universitária – 05508-090, São Paulo, SP, Brasil
[email protected], [email protected]
Abstract
In Data Mining, one of the steps of the Knowledge Discovery in Databases (KDD) process, the use of concept
hierarchies for the attribute values of the database allows to express the discovered knowledge in a higher abstraction level, more concise and usually in a more interesting format. However, data mining for high level concepts is
more complex because the search space is generally too big. Some data mining systems require the database to
be pre-generalized to reduce the space, what makes difficult to discover knowledge at arbitrary levels of abstraction. To efficiently induce high-level rules at different levels of generality, without pre-generalizing databases, fast
access to concept hierarchies and fast query evaluation methods are needed.
This work, presents the NETUNO-HC system that performs induction of classification rules using concept hierarchies for the attributes values of a database without pre-generalizing them. It is showed how the abstraction
level of the discovered rules can be affected by the adopted search strategy and by the relevance measures considered during the data mining step. Moreover, it is demonstrated by a series of experiments that the NETUNO-HC
system improves the efficiency of the data mining process, due to the implementation of the following techniques:
(i) a SQL primitive to execute the databases queries; (ii) the numerical encoding of concept hierarchies; (iii) the
use of Beam Search strategy, and (iv) the data representation of the concept hierarchy.
Keywords: Knowledge Discovery, Data Mining, Machine Learning.
1. Introduction
This paper describes a KDD (Knowledge Discovery in Databases) system named NETUNO-HC
[1], that uses concept hierarchies to discover knowledge at a high abstraction level than the existing in
a relational database (DB), without pre-generalizing the data. The search for this kind of knowledge
requires the construction of SQL queries to a Database Management System (DBMS), considering that
the attribute values belong to a concept hierarchy not directly represented in the DB.
We argue that this kind of task can be achieved providing fast access to concept hierarchies and fast
query evaluation through : (i) an efficient search strategy, and (ii) the use of a SQL primitive to allow fast
evaluation of high level hypotheses. Unlike of [5], the system proposed in this paper does not require the
DB to be pre-generalized. Finally, the proposed representation of hierarchies followed by the use of SQL
primitives turns NETUNO-HC independent from other systems, unlike ParDRI [7] which uses Parka, a
knowledge representation language to manage the hierarchies.
2. Concept Hierarchies
The concept hierarchy can be defined as a partial order set. Given two concepts a and b belonging
to a partial order relation , i.e., , (describe by or precedes ). We say that concept is a more specific than concept, or is more general than . Usually, the partial order relation
in a concept hierarchy represents the special-general relationship between concepts, also called subsetsuperset relation. So, a concept hierarchy is defined as:
A Concept Hierarchy is a partial order set , where is a finite set of concepts, and
is a partial order relation in .
A tree is special type of concept hierarchy, where a concept precedes only one concept and greatest
concept exists, i.e., a concept that does not precede anyone. The tree root will be the most general
concept, called ANY, and the leaves will be the attribute values in the DB, that is, the lowest abstraction
level of a hierarchy. In this work, we will use concept hierarchies that can be represented as a tree.
2.1. Representing Hierarchies
The use of concept hierarchies during the data mining to generate and evaluate hypotheses is computationally more demanding than the creation of generalized tables. The representation of a hierarchy
in memory using a tree data structure gives some speed and efficiency to traverse it. Nevertheless, the
number of queries necessary to verify the relationship between concepts in a hierarchy can be too high.
One approach to decrease this complexity is to encode each hierarchy concept in such a way that the code
itself indicates the partial order relation between the concepts. Thus the relation verification is made by
only checking the codes.
nr of bits:
1
4
1
0010
5
01001
5
01011
= 18731
1
4
1
0010
= 18 (18731 >> 10)
Figure 1: Two concept codes where the code 18731 represents a concept that is a descendant of the
concept with code 18
The concept encoding algorithm we propose is based on a post-fixed order traversal of the hierarchy with complexity O( ), where is the number of concepts in the hierarchy. The verification of the
relationship between two concepts, is performed shifting one of the codes, in this case, the bigger one.
Figure 1 shows two concept codes where the code 18731 represents a concept that is a descendant of the
concept with code 18. Since the difference between the codes corresponds to ten bits, the bigger code
has to be shifted to the right by this number of bits, and if this new value is equal to the smaller code, than
the concepts belongs to the relation, i.e., the concept with smaller code is an ascendant of the concept
with the bigger code.
In the Table 1, is showed the mean percentage of the total time spent in the algorithm execution
during the generation and evaluation of the hypothesis, ie, regardless the time spent issuing SQL queries.
The use of numeric verification gives a relevant decrease in the time spent. The hierarchy query uses a
pointer path for relationship verification. So, a higher hierarchy will cause a bigger spent time, as we
can see in the first line, because the Adult DB has higher concept hierarchies than Mushroom DB. For
the numeric verification, a higher hierarchy has a irrelevant influence in the spent time. This is due to the
nearly constant time spent in the code shift.
In the NETUNO-HC the hierarchies are stored in relational tables in the DB and loaded before
the data mining step. More than one hierarchy for each attribute can be stored leaving to the user the
Method
Hierarchy query
Numeric verification
Mushroom
10,11% =0,18
1,87% =0,15
Adult
14,93% =0,21
1,91% =0,13
Table 1: Mean and standard deviation ( ) of the time spent in the generation and evalution of hypothesis
for the NETUNO-HC
possibility to choose one . The use of tables provides the facility of concurrently access the hierarchy
data.
2.2. Generation of Numerical Concept Hierarchies
For numerical or continuous attributes, the concept hierarchies can be previously generated and
stored in relational tables, or generated by an algorithm before the data mining step. In the NETUNOHC we propose an algorithm to generate a numerical hierarchy considering the class distribution. This
algorithm is based on the InfoMerge algorithm [3] used for discretization of continuous attributes. The
idea underlying the InfoMerge algorithm is to group values in intervals which causes the smaller information loss (a dual operation of information gain in C4.5 [6]).
In the NETUNO-HC, the same idea is applied to the generation in a bottom-up approach of a
numerical concept hierarchy, where a concept will be numerical intervals, closed in the left. After the
leaf level intervals be generated, these are merged in bigger intervals until the root is reached, which will
be an interval that includes all the existing values in the DB.
3. The NETUNO-HC Algorithm
The search space is organized in a general-to-specific ordering of hypotheses, beginning with the
empty hypothesis. A hypothesis will be transformed (node expansion search operation) by specialization
operations, i.e., by the addition of an attribute or by doing hierarchy specialization to generate more
specific hypotheses. A hypothesis can be a discovered rule if it satisfies the relevance measures. The
node expansion operation is made in two steps. First, an attribute is added to a hypothesis. Second, using
the SQL query, the algorithm check, in a top-down fashion, which values in the hierarchy of the attribute
satisfy the relevance measures.
The search strategy employed by the NETUNO-HC is Beam Search. For each level of the search
space, which corresponds to hypotheses with the same number of attribute-value pairs, the algorithm
selects only a fixed number of them. This number corresponds to the beam width, i.e., the number of
hypotheses that will be specialized.
3.1. NETUNO-HC Knowledge Description Language
The power of a symbolic algorithm for data mining resides in the expressiveness of the knowledge
description language used. The language specifies what the algorithm is capable of learning. NETUNOHC uses a propositional-like language extending the attribute value with concept hierarchies in order to
achieve higher expressiveness.
Rules induced by NETUNO-HC take the form IF
THEN
, where
is
a conjunction of one or more attribute-value pairs. An attribute-value pair is a condition between an
attribute and a value from the concept hierarchy. For categorical attributes this condition is an equality,
e.g., (
), and for continuous attributes this condition is an interval inclusion (closed on left), e.g.,
, or an equality.
3.2. Specializing Hypotheses
In the progressive specialization, or top-down approach, the data mining algorithm generates hypotheses that have to be specialized. The specialization operation of hypothesis
generates a new
that covers a number of tuples less or equal the ones covered by . Specialization can
hypothesis
be realized by either adding an attribute or replacing the value of the attribute with any of its descendants as defined by concept hierarchies. In NETUNO-HC, both forms of hypotheses specializations are
considered.
If a hypothesis does not satisfy the relevance measures then it has to be specialized. After the addition
of the attribute, the algorithm has to check which of the values forms valid hypotheses, i.e., hypotheses
that satisfy the relevance measures. With the use of hierarchies, the values have to be checked in a
top-down way, i.e., from the most general concept to the more specific.
3.3. Rules Subsumption
The NETUNO-HC avoids the generation of two rules,
and , such that
is subsumed by ,
i.e.,
. This occurs when:
1. the rules have the same size and for each attribute-value pair exists a pair
where .
2. the rules have different size and for each attribute-value pair exists a pair
where and
is the smaller rule.
This kind of verification is done in two different phases. The first phase is done when the data mining
algorithm checks for an attribute value in the hierarchy. If the value generates a rule, the descendants
values that can also generate rules in the same class are not stored as valid rules, even though they
satisfy the relevance measures. Second, if a discovered rule subsumes other rules previously discovered,
these last ones are deleted from the list of discovered rules. On the opposite side, if a discovered rule
is subsumed by one or more previously discovered rules, this rule is not added to the list. This second
phase is performed using a rule indexing schema, which is generated based in the rule consequent, ie,
the rule class and the rule antecedent. Each attribute-value pair has a code, and the composition of each
code will form the rule’s code, wich will be used to create a hash based index of the discovered rule set.
3.4. Relevance Measures and Selection Criteria
In NETUNO-HC system, the rule hypotheses are evaluated by two conditions: completeness and
consistency. Let denote the total number of positive examples of a given class in the training data. Let
be a rule hypothesis to cover tuples of that class; let and be the number of positive and negative
tuples covered by , respectively. The completeness will be defined by the ratio , which is called in
this work support (also known in the literature as positive coverage). The consistency is defined as the
ratio
, which is called in this work confidence (also known as training accuracy). These values will
be calculated using the SQL primitive, described in Section 4.
The criteria for the selection of the best hypotheses that will be expanded is based on the product
. The hypotheses in the open-list will be stored in a decreasing order according
with that product, and only the best hypotheses (the beam width) will be selected.
!#" $&%')(+*, ,*
-
3.5. Interpretation of the Induced Rules
The induced rules can be interpreted as classification rules. Thus, to use the induced rules to classify
new examples, NETUNO-HC employ an interpretation in which all rules are tried and only those that
cover the example are collected. If a collision occurs (i.e., the example belows to more than one class)
the decision is to classify the example in the class given by the rule with the greatest value for the product
. If some example is not covered by any rule, then the number of non-classified
example is incremented. In Sec. 5.3, will be showed the result of applying a default rule in this case.
!#" $&%')(* ,*
4. SQL Primitive for Evaluation of High Level Hypothesis
In [4] was propose a generic KDD primitive in SQL which underlies the candidate rule evaluation
procedure. This primitive consists of counting the number of tuples in each partition formed by a SQL
group by statement. The primitive has three input parameters: a tuple-set descriptor, a candidate attribute,
and the class attribute. The output is a matrix , where is the number of different values of the
new attribute, and is the number of different values of the class attribute.
In order to use this primitive and the output matrix for the evaluation of high level hypothesis (i.e.,
building a SQL primitive considering a concept hierarchy), some extensions were made to the original
proposal [4]. In the primitive, the tuple-set descriptor has to be expressed by values in the DB, i.e.,
the leaf concepts in the hierarchy. So, for each high level value the descriptor has to be expressed by
the leaf values that precedes it. This is made by the NETUNO-HC, during the data mining, using the
hierarchy for building the SQL primitive. For example, let black, brown dark where black, brown are leaf concepts in a color domain hierarchy. If the antecedent of a hypothesis has the attribute-value
pair: spore print color = dark, this has to be expressed in the tuple-set descriptor by leaf values, i.e.,
spore print color = brown OR spore print color = black.
For the output matrix, the lines are the leaf concepts of the hierarchy. Adding the lines whose
concepts are leaf and precedes a high level concept is equivalent to have a high level line, which can be
used to evaluate the high level hypotheses (see Figure 2).
A condition between an attribute and his value may be the inequality. In this case, eg.
spore print color
dark, the tuple-set descriptor will be translated to spore print color
brown
AND spore print color
black. To calculate the relevance measures for this condition, the same matrix can be used. The line for this condition is the difference between the Total line and the line that
.
corresponds to the attribute value, i.e.,
" ')&*
Concept Hierarchy
of the Candidate Attribute
' &*
C1
a
ANY
a’
C2
C3
. . . Cn
Total
a1
a2
a3
.
.
.
am
Total
Figure 2: The lines of the matrix represents the leaf concepts of the hierarchy
5. Experiments
In order to evaluate the NETUNO-HC algorithm we used two DBs from the UCI repository: the
Mushroom and Adult. First, we tested how the size of the search space changes performing data mining
with and without the use of concept hierarchies. This was done using a simplified implemented version
of the NETUNO-HC algorithm that uses a complete search method.
In the rest of the experiments we analyzed the data mining process, with and without the use of
concept hierarchies, with respect to the following aspects: efficiency on DB access, concept hierarchy
access and rules subsumption verification; results on the accuracy of the discovered rule set; the discovery
of high level rules and the semantic evaluation of high level rules.
5.1. The Size of the Search Space
We have first analyzed how the use of concept hierarchies in data mining can affect the size of the
search space considering a complete search method, such as Breadth-First Search.
without CH
with CH
16000
14000
open−list size
12000
10000
8000
6000
4000
2000
0
0
10000
20000
30000
40000
number of open−list removes
50000
60000
Figure 3: Breadth-First Search algorithm execution in the Mushroom DB with and without hierarchies
and sup = 20%, conf= 90%. In the graphic above is showed the open-list size (list of the candidate rules
or rule hypotheses) versus the number of open-list removes (number of hypothesis specializations)
Figure 3 shows, as it was expected, that the search space for high level rules increases with the size
of the concept hierarchies considered in a data mining process. We can also see in Figure 3 that pruning
techniques, based on relevance measures and rules subsumption, can eventually turn the list of open
nodes (open-list) empty. This occurs for the Mushroom DB after 15000 hypothesis specializations, in
data mining WITHOUT concept hierarchies and after 59000 hypothesis specializations, in data mining
WITH concept hierarchies.
Another observation we can make from Figure 3 is that the size of the open-list is approximately
four times bigger when using concept hierarchies evaluation for the Mushroom DB. Therefore, it is
important to improve performance on the hypotheses evaluation which involves efficient DB access,
concept hierarchy access and rules subsumption verification.
5.2. Efficiency in High Level SQL Primitive and Hypotheses Generation
In order to evaluate the use of high level SQL primitive, it was implemented a version of the ParDRI
[7]. In ParDRI, the high level queries are made in a different way: it uses the direct descendants of the
hierarchy root. So, if the root concept has tree descendants, tree queries will be issued, while with the
SQL primitive, only one query is necessary.
For the Mushroom DB, without the SQL primitive, the algorithm generated 117 queries and discovered 26 rules. Using the primitive, only 70 queries were generated for exactly the same 26 rules, showing
a reduction of 40% in the number of queries.
To evaluate the time spent on hypotheses generation, the following times were measured during the
executions:
1. the time spent with DB queries;
2. the time spent by the data mining algorithm.
The ratio between the difference of these two times and the time spent by the data mining algorithm
is the percentage spent in the generation and evaluation of the hypotheses. This value is 1.87% (with
=0.15) showing that the execution time is dominated by queries issued to the DBMS. Therefore, the
use of the high level SQL primitive, combining with efficient techniques for encoding and evaluation of
hypotheses in the NETUNO-HC, makes it a more efficient algorithm for high level data mining than
ParDRI [7].
5.3. Accuracy
In Tab. 2, the accuracy results of the NETUNO-HC with and without hierarchies are compared
with two other algorithms, C4.5 [6] and CN2 [2], which did not use concept hierarchies. In order to
compare similar classification schemes, the NETUNO-HC results were obtained using a default class,
the majority class in this case, to label examples not covered, similar to the two other algorithms. For the
other experiments, the default class was not used.
Algorithm
C4.5
CN2
NETUNO-HC without CH
NETUNO-HC with CH
Mushroom
100%
100%
99.04%
98.45%
Adult
84.46%
84%
82.14%
81.62%
Table 2: Accuracies for the algorithms - the default class was used in NETUNO-HC
The next experiments show the results obtained through ten-fold stratified cross validation. In Table
3 is showed the accuracy of the discovered rule set. For both DBs we can observe that by decreasing the
minimum support value, the accuracy tends to increase (in both situations: with or without hierarchies).
This happens because some tuples are covered by rules with small coverage, and this rules can only be
discovered defining a small support.
Support / Confidence
20% / 90%
20% / 94%
20% / 98%
12% / 90%
12% / 94%
12% / 98%
4% / 90%
4% / 94%
4% / 98%
Mushroom
Mean accuracy Mean accuracy
without CH
with CH
0.9061 =0.002
0.8942 =0.002
0.9572 =0.005
0.9311 =0.005
0.9596 =0.004
0.9845 =0.002
0.8991 =0.004
0.8931 =0.002
0.9572 =0.002
0.9299 =0.003
0.9738 =0.002
0.9845 =0.003
0.8954 =0.003
0.8931 =0.004
0.9524 =0.003
0.9275 =0.004
0.9881 =0.003
0.9845 =0.002
Adult
Mean accuracy Mean accuracy
without CH
with CH
0.6717 =0.003
0.6762 =0.004
0.5672 =0.004
0.5851 =0.005
0.3701 =0.002
0.5146 =0.004
0.7048 =0.002
0.7031 =0.006
0.6559 =0.003
0.6598 =0.005
0.4112 =0.002
0.5566 =0.005
0.7229 =0.004
0.7235 =0.003
0.6797 =0.005
0.6646 =0.002
0.5513 =0.005
0.6035 =0.002
Table 3: Mean accuracies and standard deviations ( ) for each support and confidence value with beam
width =256
-
As expected, the use of hierarchies does not directly affect the accuracy of the discovered rules.
That can be explained by the following. On one hand, a more general concept has greater inconsistency
which decreases the accuracy. On the other hand, with high support values an increase in the minimum
confidence value tends to increase the accuracy. In this case, the high level concept can cover more
examples (i.e., decreasing the number of non-covered examples, as can be seen in Table 4), where the
number of non-classified examples is very small (considering a small beam width).
Intuitively, we can think that a larger beam width would discover a rule set with a better accuracy
since the search would become closer to a complete search. However, in the Mushroom DB with hierarchies, an increase in the beam width did not result in a better accuracy as can be seen in Table 4.
Beam width
Accuracy
without CH with CH
0.9501
0.9857
0.9501
0.9857
0.9501
0.9857
0.9548
0.9857
0.9845
0.9845
0.9845
0.9869
0.9881
0.9869
0.9881
0.9845
0.9881
0.9845
1
2
4
8
16
32
64
128
256
Non-Classified Examples
without CH
with CH
37
2
37
2
37
2
33
2
7
2
7
0
4
0
4
0
4
0
Table 4: Beam Width vs Accuracy and Non-Classified examples in the Mushroom DB
5.4. High Level Rules
The relevance measures affect the discovered rule set. With a confidence minimum value of 90%, in
the two DBs it can be seen that high support minimum values tends to discover more high level rules in
the rule set. In the Table 5.4, it is showed the percentage of rules in the discovered rule set that contains
high level values for the attributes, for different support values.
Support Minimum Value
4%
20%
Mushroom
51.8%
63.8%
Adult
81.4%
85.6%
5.5. Semantic Evaluation
The use of hierarchies introduces more general concepts which can cause low level rules to be
subsumed by high level ones. For example, in the Mushroom DB, given the high level concept BAD
! " ), the rule
(
is discovered.
#
This rule, is more general than the other following two rules,
and
, discovered without the use
of hierarchies.
$
: odor = BAD - POISONOUS - Supp: 0.822 Conf: 1.0
$
: odor = CREOSOTE - POISONOUS - Supp: 0.048 Conf: 1.0
$ # : odor = FOUL - POISONOUS - Supp: 0.549 Conf: 1.0
This example shows that, by using concept hierarchies in data mining, one can generate a more
concise knowledge to be interpreted by an expert of the DB domain.
6. Conclusions
The use of concept hierarchies in data mining results in a trade off between the discovery of more
interesting rules, expressed in high abstraction level, versus a higher computational cost. In this work,
we present the NETUNO-HC algorithm and its implementation to propose ways to solve the efficiency
problems of the data mining with concept hierarchies, that are: the use of Beam Search strategy, the
encoding and evaluation techniques of the concept hierarchies and the high level SQL primitive.
The main contribution of this work is to specify a high level SQL primitive as an efficient way to
analyze rules considering concept hierarchies.
We also perform some experiments to show how the mining parameters affects the discovered rule
set such as:
Variation of the support minimum value. On one hand, a decrease in the support minimum value
tends to increase the accuracy, with or without hierarchies, also increasing the rule set size. On the other
hand, a high support minimum value tends to discover a more interesting rule set, i.e., a set with more
high level rules.
Variation of the confidence minimum value. The effect of this kind of variation depends of the
DB domain. For the databases analyzed, a higher confidence value could not always result in a higher
accuracy.
Alterations of the beam width. A higher beam width tends to increase the accuracy. However,
depending on the DB domain, a better accuracy can be obtained in lower beam width, with or without
hierarchies. The hierarchy also affects the discovered rule set: a higher accuracy can be obtained with a
lower beam width.
References
[1] Marco Eugênio Madeira Di Beneditto. Descoberta de regras de classificação com hierarquias conceituais. Master’s thesis, Instituto de Matemática e Estatı́stica, Universidade de São Paulo, Brasil,
feb 2004.
[2] Peter Clark and Tim Niblett. The CN2 induction algorithm. Machine Learning, 3:261–283, 1989.
[3] A. Freitas and S. Lavington. Speeding up knowledge discovery in large relational databases by means
of a new discretization algorithm. In Proc. 14th British Nat. Conf. on Databases (BNCOD-14), pages
124–133, Edinburgh, Scotland, 1996.
[4] A. Freitas and S. Lavington. Using SQL primitives and parallel DB servers to speed up knowledge
discovery in large relational databases. In R. Trappl., editor, Cybernetics and Systems’96: Proc. 13th
European Meeting on Cybernetics and Systems Research, pages 955–960, Viena, Austria, 1996.
[5] Jiawei Han, Yongjian Fu, Wei Wang, Jenny Chiang, Wan Gong, Krzystof Koperski, Deyi Li, Yijun
Lu, Amynmohamed Rajan, Nebojsa Stefanovic, Betty Xia, and Osmar R. Zaiane. DBMiner: A
system for mining knowledge in large relational databases. In Evangelos Simoudis, Jia Wei Han,
and Usama Fayyad, editors, Proceedings of the Second International Conference on Knowledge
Discovery and Data Mining (KDD-96), pages 250–263. AAAI Press, 1996.
[6] John Ross Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, 1 edition, 1993.
[7] Merwyn G. Taylor. Finding High Level Discriminant Rules in Parallel. PhD thesis, Faculty of the
Graduate School of the University of Maryland, College Park, USA, 1999.