Ontology-based Induction of High Level Classication
Merwyn G. Taylor
University of Maryland
Computer Science Department
College Park, MD, 20742
[email protected]
Kilian Stoel
University of Maryland
Computer Science Department
College Park, MD, 20742
[email protected]
A tool that could be benecial to the data mining community
is one that facilitates the seamless integration of knowledge
bases and databases. This kind of tool could form the foundation of a data mining system capable of nding interesting
information using ontologies. In this paper, we describe a
new algorithm based on the query facilities provided by such
a tool, ParkaDB which is a knowledge representation system.
Ontologies and relational databases are merged thus extending the range of queries that can be issued against databases.
This extended query capability makes it possible to generalize from low-level concepts to high-level concepts on demand.
Given this extended querying capability, we are developing
a classication algorithm that will nd classication rules at
multiple levels of abstraction.
1 Introduction
The number of databases maintained in the world is
increasing rapidly. Many of these databases are huge
and are therefore very dicult to manually analyze.
To make it feasible to analyze databases, researchers
have been developing tools for knowledge discovery in
databases (KDD). These tools can be used to nd
various types of interesting information in databases.
KDD tools have been developed to nd association
rules [1], perform time sequence analysis [2], and nd
classication rules [3]. To supplement the discovery
process, some systems use background knowledge [4,
9]. The background knowledge can be in the form
of domain rules and concept-hierarchies (ontologies).
The background knowledge is often used to guide the
discovery process such that uninteresting information is
The classication problem is one of the most researched problems in KDD, primarily because of its
utility to various communities. Classication systems
create mappings from data to predened classes, based
This research was supported in part by grants from ONR
(N00014-J-91-1451), ARPA (N00014-94-1090, DAST-95-C003,
F30602-93-C-0039), and the ARL (DAAH049610297). Dr.
Hendler is also aliated with the UM Institute for Systems
Research (NSF Grant NSF EEC 94-02384).
James A. Hendler
University of Maryland
Computer Science Department
College Park, MD, 20742
[email protected]
on features of the data. Many of them are based on decision tree technology introduced by Quinlan [3], while
others employ various techniques such as implication
rule search. Classication problems exist in a variety
of domains. A classic classication problem is that of a
lending institution in search of rules that can be used to
predict the type of borrowers that are likely to default
on a loan and those that are not.
Ontologies such as UMLS [5], and WordNet [6]
have been created for use in knowledge-based systems.
In KDD, ontologies provide a mechanism by which
domain specic knowledge may be included to aide
the discovery process. However, ontologies are only
useful if they can be repeatedly queried eciently and
quickly. Currently, most ontology management systems
cannot support extremely large ontologies, making it
dicult to create ecient knowledge-based systems.
For this reason, KDD tools that use ontologies usually
pre-generalize databases before applying the core data
mining algorithm [4, 10, 12]. We have developed a tool,
ParkaDB that is capable of managing large ontologies.
ParkaDB [7, 8] has a very ecient structural design and
it is based on high-performance computing technologies.
Because of its ability to query large ontologies, both
serially and in parallel, ParkaDB makes it feasible to
merge large ontologies with large databases. The data
mining algorithm described in this paper is based on
the ParkaDB's ability to eciently evaluate queries over
ontologies integrated with databases.
We are currently developing a high-level classication
system that will be applied to a medical database
maintained at the Johns Hopkins Hospital. It is a
knowledge-based system designed to nd classication
rules at multiple levels of generality. To realize this
goal, the system incorporates ontological information
and dependency data between attribute-value pairs. It
uses the ParkaDB KR system mentioned above. In this
paper we discuss the high-level classication algorithm
and introduce a data mining technique for discovery
with ontologies.
2 Ontologies & Concept Hierachies
There is some dispute in the KR community as to
what exactly an \ontology" is. In particular, there is
a question as to whether \exemplars," the individual
items lling an ontological denition count as part of
the ontology. Thus, for example, does a knowledge
base containing information about thousands of cities
and the countries they are found in contain one
assertion, that countries contain cities, or does it contain
thousands of assertions when one includes all of the
individual city to country mappings. While ParkaDB
can support both kinds quite well, it is sometimes
important to dierentiate between the two. We will
use the term traditional ontology for the former, that
is those ontologies consisting of only the denitions.
We use the term hybrid ontologies for the latter, those
combining both ontological relations and the instances
dened thereon. This second group may consist of
a relatively small ontological part and a much larger
example base, or as a dense combination of relations
and instances. In this paper, we will assume that our
data stores are in the form of hybrid ontologies. See [8]
for a discussion on ontologies.
Concept taxonomies are traditional ontologies in
which every assertion represents a sub-class to supperclass, or vice-versa, relationship between concepts. Traditionally, concept taxonomies have been created using
"isa" assertions. The "isa" assertion represents a mapping from a specic concept to a more general concept.
Figure 2 denotes a concept taxonomy dened over several types of beverages. All links in Figure 2 represent
"isa" assertions. For example, "Diet Coke" "isa" type
of "Coke". Through transitivity, it can be concluded
that "Diet Coke" is also a type of "Beverage". With
concept taxonomies, and a knowledge representation to
reason over concept taxonomies, concepts can be referenced directly, ie. "Diet Coke", or indirectly through
general concepts, ie. "Soda".
3 ParkaDB's Query Language
ParkaDB is knowledge representation system developed
by the PLUS group at the University of Maryland. It
supports the storage, loading, querying and updating
of very large ontologies, both traditional, and hybrid.
In this section we describe ParakaDB's query language.
See [7, 8] for a discussion on ParkaDB in general.
ParkaDB supports a conjunctive query language in
which every conjunct is a triple (P; D; R). P can be
one of several predened structural relationships, "isa",
"instanceOf", "subCat", etc. Alternatively, P can be
an attribute dened on a database. D and R are the
domain and range of P respectively. D and R are
allowed to be variables while P has to be fully specied.
Given the table in Figure 1 and the concept taxonomy in
Figure 2 consider the query "(Item1 ?D ?R)(instanceOf
?R Soda)". This query requests all tuples in Figure 1
in which the value of attribute Item1 is a "Soda". The
result of this query will be a list of pairs (?D ?R) in
which ?D will be bound to a tuple id and ?R to the
value of Item1 for those tuples that satisfy the query.
ParkaDB also supports several inheritance methods that
are beyond the scope of this article.
4 High Level Classication Rules
Many data mining systems search for interesting information in databases at low levels. That is, the language
of the interesting information returned is composed of
only the terms occurring in the data itself. In this paper,
we will consider a high-level classication rule in which
the language, of a rule, is composed of terms that are abstractions of the low level terms as well as the low-level
terms occurring in a database. Consider the database
in Figure 1 and the ontology in Figure 2. The high-level
classication rules in Figure 3 can be induced from Figure 1 and the Ontology in Figure 2.
These rules are composed of concepts at high levels
of abstraction. Soda is a generalization of Pepsi and
Coke, while Dairy Beverage is a generalization of AB
Milk and AB Egg Nog. The rules are the result of
data mining at a higher level of abstraction. The
designers of DBLearn [4] and KBRL [9] have explored
data mining at higher levels of abstraction. In this
paper we present a technique similar to those used in
the aforementioned systems. Our system is based on
high-level abstraction using domain ontologies and an
ecient, scalable ontology management tool, ParkaDB,
which allows it to induce multi-level classication rules
without pre-generalizing tables (see [10, 4, 12]).
Data mining at high levels of abstraction has several
advantages over data mining at low levels. First, highlevel rules can provide a \clearer" synopsis of databases.
In general, data mining systems generate summaries of
databases in the form of low-level information. Highlevel rules can thus be considered summarizations of
rules existing at lower levels. They are especially
benecial if it possible to induce many low-level rules
that are similar in general content and form. A
second advantage of data mining at high levels of
abstraction is the number of generated rules is less
than what would be generated at low-level data mining,
assuming similar search strategies are used. Fewer
rules can typically be generated since many low-level
concepts can be generalized to fewer high-level concepts.
Consequently, low-level rules that are similar in general
content and form will be replaced by a single high-level
rule. Furthermore, the user will be presented with less
information to digest and this information will have the
same coverage as the low-level alternatives.
There are several ways to realize high-level data mining. The approach taken by DBLearn's classication
Item 1
Item 2
Diet Coke
Ranch Dorito
CD Raisin Cereal
Reg Coke
Reg Dorito
Reg Coke
SB Chips
Diet Coke
Nacho Dorito
Diet Pepsi
BBQ Frito
Req Pepsi
Reg Frito
Skim Milk CD Raisin Cereal
Reg Pepsi
BBQ Frito
Reg Pepsi
AB Egg Nog CD Farm Steak
Figure 1: Sample customer purchase database
Fruit Juice
Diary Beverage
AB EggNog
Apple Juice
Figure 2: Concept-hierarchy over beverages
Class Young
Class Old
Customers purchased sodas.
Customers purchased orange juice.
Customers purchased dairy beverages.
Figure 3: High-Level Rules
module is high-level data mining at a single level of generalization. DBLearn's classication module generalizes
values in the domain of an attribute to a single level per
attribute. Alternatively, high-level data mining can be
performed at multiple levels of generalization. No restrictions are placed on the levels at which rules must
contain attribute-value pairs.
To clarify the meaning of multi-level data mining,
consider Figure 2 and the rules in Figure 3. These
ndings are at varying levels of generality and they are
based on abstractions over a single attribute. Soda
occurs on level 1, orange juice occurs on level 2
and dairy beverages occurs on level 1. Data mining
at multiple levels of generalization can lead to far
more interesting information, uncovering relationships
between very general concepts and those at lower levels
of generality.
5 Multi-Level Classication
The major components of the multi-level classication
algorithm will be described in this section. The
complete algorithm is presented in Figure 4.
1. Gather frequency counts.
2. Repeat until all frequency counts are zero.
(a) Find attribute-value pair with max D(Av) .
(b) Create a ParkaDB query, Q
(c) Extend Q until it is satised by tuples in one class.
(d) Reduce frequency counts of all attribute-value
pairs occurring in all tuples satisfying Q.
(e) Transform Q into a classication rule.
Figure 4: Multi-level classication algorithm
The algorithm is a form of the \generate and
test" class of search techniques. Concept hierarchies
augmented with frequency counts, and dependency
data, are used to guide the creation of queries that will
become classication rules. The algorithm tries to nd
a set of queries that individually are satised by tuples
in just one class and collectively describe the dierences
between the classes.
Ontologies play an important role in the multi-level
classication process described in this paper. They are
important because they contain the concepts that the
multi-level classication algorithm will use to induce
its rules. Currently, we are only considering ontologies
similar to the concept hierarchy illustrated in Figure 2.
These ontologies are domain specic tree structures representing hierarchical groupings of values occurring in a
database. An interior node represents a generalization
of concepts at lower levels thus providing a mechanism
for data mining at high levels of abstractions, providing
there exists a query engine powerful enough to generalize data from databases to concepts in the taxonomies
on demand.
The classication algorithm uses frequency counts
from a database when determining which concepts from
the concept-hierarchies to include in a rule. In step
1 (Figure 4: Gathering frequency counts), nodes in
the concept hierarchies are augmented with frequency
counts gathered from the database being analyzed. The
frequency counts for low-level concepts, (i.e. leaves in
the concept hierarchies) are gathered directly from the
databases, whereas the frequency counts for high-level
concepts are accumulated from concepts at lower levels.
Figure 5 is an extension of Figure 2 containing frequency
counts from Figure 1. The frequency counts are grouped
by class to represent the distribution of concepts among
the classes. This is denoted by [X,Y,Z] where X
represents the frequency with which an attribute-value
pair occurs in a tuple that is in the rst class, Y the
second class, and Z the third class. For example, in
Figure 1, there is one tuple, belonging to the \Mid
class", in which \Diet Pepsi" is a value for attribute
\Item 1". This is illustrated in Figure 5.
Diary Beverage
[1,3,0] Pepsi
[3,1,0] Coke
AB EggNog
Fruit Juice
Apple Juice
Figure 5: Concept-hierarchy over beverages with frequency counts: [young,mid,old]
In step 2a, the frequency counts are used to eciently
nd those attribute-value pairs (attribute + concept
from the attribute's concept hierarchy) that are strong
indicators for class membership. Attribute-value pairs
are selected based on an evaluation function D(Av) .
f(C; Av)
f(C; Av) is the frequency of attribute-value pair Av in
class C
fr (C; Av) = f (kC;Av
fr (C; Av) is the relative frequency of attribute-value
pair Av in class C
m(Av) = max(fr (Ci ; Av)) 1 i n
y kC k
is continuously reduced to reect the number of tuples
in class C that have not been covered by a rule
m(Av) is the maximum relative frequency of Av over
all classes
(Ci ;Av )
d(Av) = ni=1 ( n;1 1 ; mf(rAv
d(Av) is the normalized standard deviation of Av's
relative frequency counts
D(Av) = :9abs Id;Avd d(Av)
Ideal Depth The Ideal Depth, Id , is the preferred
level of generality. If we are interested in rules at
the abstraction level of soda, fruit juice, and dairy
beverages in Figure 2 we would set Id = 1.
Attribute-Value Depth The Attribute-Value Depth,
Avd , is the depth of Av in its respective concept
hierarchy (e.g. Sodad = 1).
The evaluation function D(Av) measures the quality
of Av with respect to generating ParkaDB queries
that are satisable by tuples in just one class. The
d(Av) term is used to select those attribute-value pairs
that will require the fewest additional constraints to
create a ParkaDB query that is satised by tuples in
a single class. d(Av) ranges from 0, indicating and
even distribution, to 1, indicating that Av is unique
to tuples in just one class. Attribute-value pairs with
d(Av) values close to 1 are strong indicators of class
membership. Selecting such an attribute-value pair to
seed a query should reduce the complexity of the queries
generated that describe tuples in a single class. In
the case there exists two or more attribute-value pairs
with similar d(Av) values those that are closer to the
ideal depth are preferable. The :9abs(Id;Avd ) term is
included in the evaluation function to lter out those
attribute-value pairs that have high d(Av) values but
are not close to the ideal depth in their respective
concept taxonomies. We experimented with several
constants before selecting .9 for this term. Values less
than .9 tended to lter out too many attribute-value
pairs and values greater than .9 did not lter out enough
attribute-value pairs.
The classication algorithm generates a series of
conjunctive queries that are subsequently evaluated by
ParkaDB in steps 2b & 2c. Queries are initialized
with the attribute-value pair having the highest D(Av)
value selected in step 2a. They are extended by adding
attribute-value pairs as constraints until the queries are
satised by tuples in a single class.
Given the tuples in Figure 1 if the attribute-value pair
(Item 1,Pepsi) is selected the algorithm will generate the
following ParkaDB query.
(Item1 ?tid ?item1) (everyInstanceOf ?item1
BBQ Frito
Reg Frito
Figure 6: Concept hierarchy for Frito
Customer Item 1
Item 2 Class
Diet Pepsi BBQ Frito Mid
Reg Pepsi Reg Frito Mid
Reg Pepsi BBQ Frito Mid
Table 1: Query Results
attribute \A" that are not covered by a rule previously
generated. In Figure 7 the frequency counts for the
\Beverage" taxonomy have been reduced based on the
results of the extended query above. The frequency
counts for Diet Pepsi, Reg Pepsi, were reduced as a
result the nding. Furthermore, the frequency counts
for Pepsi, Soda, and Beverage have also been reduced
since they are generalizations of Diet Pepsi and Reg
Pepsi. If the frequency counts for an attribute-value
pair (A,v) are set to 0 then there does not exist a tuple
in the database that has \v", or a value generalizable
to \v" for attribute \A" that is not covered by a rule
previously generated. Consequently, neither (A,v) nor
any (A,v'), where v' is any value generalizable to v,
should be considered for Q in step 2c. This heuristic
eectively prunes the search space as classication rules
are induced.
Since Pepsi is not exclusive to a single class the query
above would have to be extended. Since there is a strong
relationship 1 between Pepsi and Frito the above query
can be extended to
(Item1 ?tid ?item1) (everyInstanceOf ?item1
(Item2 ?tid ?item2) (everyInstanceOF ?item2
(A concept taxonomy for Frito is represented in Figure 6.) This query requests tuples in which the values of
attribute Item1 are generalizable to \Pepsi" and values
of attribute \Item2" are generalizable to \Frito" from
the table in Figure 1. Table 1 contains the results of
this query.
Queries are issued to ParkaDB to evaluate there
strengths. The strength of a rule is determined the
distribution of classes that tuples satisfying the rule
belong to. In the above example, one rule for Mid Aged
customers would be
(Item1 "2 Pepsi) ^ (Item2 " Frito) ! Mid Age Patron
Step 2d is a critical step in the algorithm. The
purpose of this step is to gradually prune the search
space as the algorithm iterates. The frequency counts
for an attribute-value pair (A,v) represent the number
of tuples in which attribute \A" has value \v" or a value
that is generalizable to \v". The frequency counts for
an attribute-value pair (A,v) are reduced so that they
indicate the number of tuples remaining in the database
that contain the value \v", or any of its descendants, for
We developed an algorithm that will nd these relationships
Is generalizable to
Diary Beverage
[1,0,0] Pepsi
[3,1,0] Coke
AB EggNog
Fruit Juice
Apple Juice
Figure 7: Concept hierarchy over Beverage with reduced
frequency counts.
6 Example: Mushroom Database
We ran our algorithm on the "Mushroom Toxicology"
database located at the UC Irvine Repository. This
database contains 8124 tuples and 22 attributes used
to describe the mushrooms. The mushrooms were
classied as either edible or poisonous. Simple concept
taxonomies were dened for all attributes, excluding the
class attribute. Figure 8 depicts the taxonomy dened
on the Odor. The algorithm was restricted to use only
those attributes that had an average D(Av) of 0.7 or
greater for their respective values.
The following is a list of some of the rules that the
algorithm generated:
1. (Odor " None) ^ (Gill Size " Broad) ! edible 0.98
2. (Odor " Bad) ! poisonous 1.00
3. (Spore Print Color " Dark) ^ (Odor " Pleasant) !
edible 1.00
4. (Odor " Spicy) ! poisonous 1.00
Rule 2 illustrates the potential of high-level data
mining. It subsumes several low levels rules, thus
Figure 8: Concept taxonomy for the Odor attribute.
reducing the amount of data that is produced. It also
provides a more intuitive description of a subset of the
data set. Without concept taxonomies dened on the
Odor attribute rule 2 would be replaced by a single rule
containing a disjunction of all descendants of "Bad"
in the antecedent or by four rules each containing a
single descendant of "Bad" in the antecedents. Whereas
these low-level alternatives to rule 2 do in fact describe
a subset of the data, they fail to describe the more
general phenomenon occurring in the data, that is
mushrooms with a "Bad" odor are poisonous. It is
commonly known that "creosote", "shy", "foul", and
"pungent" are all unpleasant odors. Therefore, it may
be quite simple to conclude rule 2 from it's low-level
alternatives. However, in more complex domains it
may be much more dicult to make such conclusions.
The concept taxonomies dened on the Odor attribute
provide a mechanism for encoding the hierarchical
relationship between "Bad", and "creosote", "shy",
"foul", and "pungent" such that high-level rules can be
automatically induced, if they exists.
7 Related Work
Data mining with background knowledge has been
extensively studied in the past. Background knowledge
has been represented as rules, domain constraints,
taxonomies, and more recently as full ontologies. This
information has been used in many dierent ways.
Here we will only consider systems that use taxonomic
background knowledge.
Walker [10] was the rst to use concept taxonomies.
The taxonomies were used to replace values by more
general values. Han et al. [4] proposed a similar approach to nd characterization and classication rules at
high levels of generality. Later, Dhar and Tuzhilin [12]
proposed a generalization of the techniques of Walker
and Han. Their approach used database views and con-
cept hierarchies to represent background knowledge and
could induce a broader range of interesting patterns.
All of the above mentioned approaches were based on
traditional relational database technology. To use background knowledge, the databases had to be transformed
to a generalized table before the respective discovery
algorithms could be invoked. This approach may lead
to over generalization. The databases had to be pregeneralized because the RDBMS's used did not support
arbitrary generalization at \query time". Our approach
avoids pre-generalization by issuing queries with highlevel concepts to which ParkaDB can eciently generalize low-level concepts. This enables us to dynamically
generalize data as necessary and minimize over generalization.
8 Conclusion and Future Work
In this paper, we have presented the core algorithm of
a high-level data mining system which induces classication rules at multiple levels of generality. The algorithm is based on concept hierarchies augmented with
frequency counts, to guide the classication process and
to prune the rule space, dependency data to create interesting queries, and a high-performance parallelizable
query engine to evaluate queries. We are using ParkaDB
to integrate ontologies and database. Such an integration in crucial to eectively use ontologies (concept hierarchies) within data mining systems. This tool simplies data mining with background knowledge by allowing databases to be queried at high levels of abstraction
The algorithm presented in this paper is the major
component of a data mining system that will be used at
Johns Hopkins Hospital. The system will be used to nd
conditions under which patients had positive responses
to platelet transfusions as well as negative responses.
