* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download View PDF - CiteSeerX
Survey
Document related concepts
Data analysis wikipedia , lookup
Concurrency control wikipedia , lookup
Operational transformation wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Versant Object Database wikipedia , lookup
Data vault modeling wikipedia , lookup
Information privacy law wikipedia , lookup
Business intelligence wikipedia , lookup
Open data in the United Kingdom wikipedia , lookup
Clusterpoint wikipedia , lookup
Relational algebra wikipedia , lookup
Relational model wikipedia , lookup
Transcript
Ontology-based Induction of High Level Classication Rules Merwyn G. Taylor University of Maryland Computer Science Department College Park, MD, 20742 [email protected] Kilian Stoel University of Maryland Computer Science Department College Park, MD, 20742 [email protected] Abstract A tool that could be benecial to the data mining community is one that facilitates the seamless integration of knowledge bases and databases. This kind of tool could form the foundation of a data mining system capable of nding interesting information using ontologies. In this paper, we describe a new algorithm based on the query facilities provided by such a tool, ParkaDB which is a knowledge representation system. Ontologies and relational databases are merged thus extending the range of queries that can be issued against databases. This extended query capability makes it possible to generalize from low-level concepts to high-level concepts on demand. Given this extended querying capability, we are developing a classication algorithm that will nd classication rules at multiple levels of abstraction. 1 Introduction The number of databases maintained in the world is increasing rapidly. Many of these databases are huge and are therefore very dicult to manually analyze. To make it feasible to analyze databases, researchers have been developing tools for knowledge discovery in databases (KDD). These tools can be used to nd various types of interesting information in databases. KDD tools have been developed to nd association rules [1], perform time sequence analysis [2], and nd classication rules [3]. To supplement the discovery process, some systems use background knowledge [4, 9]. The background knowledge can be in the form of domain rules and concept-hierarchies (ontologies). The background knowledge is often used to guide the discovery process such that uninteresting information is avoided. The classication problem is one of the most researched problems in KDD, primarily because of its utility to various communities. Classication systems create mappings from data to predened classes, based This research was supported in part by grants from ONR (N00014-J-91-1451), ARPA (N00014-94-1090, DAST-95-C003, F30602-93-C-0039), and the ARL (DAAH049610297). Dr. Hendler is also aliated with the UM Institute for Systems Research (NSF Grant NSF EEC 94-02384). James A. Hendler University of Maryland Computer Science Department College Park, MD, 20742 [email protected] on features of the data. Many of them are based on decision tree technology introduced by Quinlan [3], while others employ various techniques such as implication rule search. Classication problems exist in a variety of domains. A classic classication problem is that of a lending institution in search of rules that can be used to predict the type of borrowers that are likely to default on a loan and those that are not. Ontologies such as UMLS [5], and WordNet [6] have been created for use in knowledge-based systems. In KDD, ontologies provide a mechanism by which domain specic knowledge may be included to aide the discovery process. However, ontologies are only useful if they can be repeatedly queried eciently and quickly. Currently, most ontology management systems cannot support extremely large ontologies, making it dicult to create ecient knowledge-based systems. For this reason, KDD tools that use ontologies usually pre-generalize databases before applying the core data mining algorithm [4, 10, 12]. We have developed a tool, ParkaDB that is capable of managing large ontologies. ParkaDB [7, 8] has a very ecient structural design and it is based on high-performance computing technologies. Because of its ability to query large ontologies, both serially and in parallel, ParkaDB makes it feasible to merge large ontologies with large databases. The data mining algorithm described in this paper is based on the ParkaDB's ability to eciently evaluate queries over ontologies integrated with databases. We are currently developing a high-level classication system that will be applied to a medical database maintained at the Johns Hopkins Hospital. It is a knowledge-based system designed to nd classication rules at multiple levels of generality. To realize this goal, the system incorporates ontological information and dependency data between attribute-value pairs. It uses the ParkaDB KR system mentioned above. In this paper we discuss the high-level classication algorithm and introduce a data mining technique for discovery with ontologies. 2 Ontologies & Concept Hierachies There is some dispute in the KR community as to what exactly an \ontology" is. In particular, there is a question as to whether \exemplars," the individual items lling an ontological denition count as part of the ontology. Thus, for example, does a knowledge base containing information about thousands of cities and the countries they are found in contain one assertion, that countries contain cities, or does it contain thousands of assertions when one includes all of the individual city to country mappings. While ParkaDB can support both kinds quite well, it is sometimes important to dierentiate between the two. We will use the term traditional ontology for the former, that is those ontologies consisting of only the denitions. We use the term hybrid ontologies for the latter, those combining both ontological relations and the instances dened thereon. This second group may consist of a relatively small ontological part and a much larger example base, or as a dense combination of relations and instances. In this paper, we will assume that our data stores are in the form of hybrid ontologies. See [8] for a discussion on ontologies. Concept taxonomies are traditional ontologies in which every assertion represents a sub-class to supperclass, or vice-versa, relationship between concepts. Traditionally, concept taxonomies have been created using "isa" assertions. The "isa" assertion represents a mapping from a specic concept to a more general concept. Figure 2 denotes a concept taxonomy dened over several types of beverages. All links in Figure 2 represent "isa" assertions. For example, "Diet Coke" "isa" type of "Coke". Through transitivity, it can be concluded that "Diet Coke" is also a type of "Beverage". With concept taxonomies, and a knowledge representation to reason over concept taxonomies, concepts can be referenced directly, ie. "Diet Coke", or indirectly through general concepts, ie. "Soda". 3 ParkaDB's Query Language ParkaDB is knowledge representation system developed by the PLUS group at the University of Maryland. It supports the storage, loading, querying and updating of very large ontologies, both traditional, and hybrid. In this section we describe ParakaDB's query language. See [7, 8] for a discussion on ParkaDB in general. ParkaDB supports a conjunctive query language in which every conjunct is a triple (P; D; R). P can be one of several predened structural relationships, "isa", "instanceOf", "subCat", etc. Alternatively, P can be an attribute dened on a database. D and R are the domain and range of P respectively. D and R are allowed to be variables while P has to be fully specied. Given the table in Figure 1 and the concept taxonomy in Figure 2 consider the query "(Item1 ?D ?R)(instanceOf ?R Soda)". This query requests all tuples in Figure 1 in which the value of attribute Item1 is a "Soda". The result of this query will be a list of pairs (?D ?R) in which ?D will be bound to a tuple id and ?R to the value of Item1 for those tuples that satisfy the query. ParkaDB also supports several inheritance methods that are beyond the scope of this article. 4 High Level Classication Rules Many data mining systems search for interesting information in databases at low levels. That is, the language of the interesting information returned is composed of only the terms occurring in the data itself. In this paper, we will consider a high-level classication rule in which the language, of a rule, is composed of terms that are abstractions of the low level terms as well as the low-level terms occurring in a database. Consider the database in Figure 1 and the ontology in Figure 2. The high-level classication rules in Figure 3 can be induced from Figure 1 and the Ontology in Figure 2. These rules are composed of concepts at high levels of abstraction. Soda is a generalization of Pepsi and Coke, while Dairy Beverage is a generalization of AB Milk and AB Egg Nog. The rules are the result of data mining at a higher level of abstraction. The designers of DBLearn [4] and KBRL [9] have explored data mining at higher levels of abstraction. In this paper we present a technique similar to those used in the aforementioned systems. Our system is based on high-level abstraction using domain ontologies and an ecient, scalable ontology management tool, ParkaDB, which allows it to induce multi-level classication rules without pre-generalizing tables (see [10, 4, 12]). Data mining at high levels of abstraction has several advantages over data mining at low levels. First, highlevel rules can provide a \clearer" synopsis of databases. In general, data mining systems generate summaries of databases in the form of low-level information. Highlevel rules can thus be considered summarizations of rules existing at lower levels. They are especially benecial if it possible to induce many low-level rules that are similar in general content and form. A second advantage of data mining at high levels of abstraction is the number of generated rules is less than what would be generated at low-level data mining, assuming similar search strategies are used. Fewer rules can typically be generated since many low-level concepts can be generalized to fewer high-level concepts. Consequently, low-level rules that are similar in general content and form will be replaced by a single high-level rule. Furthermore, the user will be presented with less information to digest and this information will have the same coverage as the low-level alternatives. There are several ways to realize high-level data mining. The approach taken by DBLearn's classication Customer Item 1 Item 2 1 Diet Coke Ranch Dorito 2 AB OJ CD Raisin Cereal 3 Reg Coke Reg Dorito 4 Reg Coke SB Chips 5 Diet Coke Nacho Dorito 6 Diet Pepsi BBQ Frito 7 Req Pepsi Reg Frito 8 Skim Milk CD Raisin Cereal 9 Reg Pepsi BBQ Frito 10 CD OJ Bread 11 Reg Pepsi Popcorn 12 AB Egg Nog CD Farm Steak Class Young Old Young Mid Young Mid Mid Old Mid Old Young Old Figure 1: Sample customer purchase database Beverage Soda Pepsi Diet Pepsi Reg Pepsi Fruit Juice Diary Beverage Coke Diet Coke AB EggNog Milk Reg Coke Skim Milk Low Milk Apple Juice OJ AB OJ CD OJ Figure 2: Concept-hierarchy over beverages Class Young Class Old Customers purchased sodas. Customers purchased orange juice. Customers purchased dairy beverages. Figure 3: High-Level Rules module is high-level data mining at a single level of generalization. DBLearn's classication module generalizes values in the domain of an attribute to a single level per attribute. Alternatively, high-level data mining can be performed at multiple levels of generalization. No restrictions are placed on the levels at which rules must contain attribute-value pairs. To clarify the meaning of multi-level data mining, consider Figure 2 and the rules in Figure 3. These ndings are at varying levels of generality and they are based on abstractions over a single attribute. Soda occurs on level 1, orange juice occurs on level 2 and dairy beverages occurs on level 1. Data mining at multiple levels of generalization can lead to far more interesting information, uncovering relationships between very general concepts and those at lower levels of generality. 5 Multi-Level Classication The major components of the multi-level classication algorithm will be described in this section. The complete algorithm is presented in Figure 4. 1. Gather frequency counts. 2. Repeat until all frequency counts are zero. (a) Find attribute-value pair with max D(Av) . (b) Create a ParkaDB query, Q (c) Extend Q until it is satised by tuples in one class. (d) Reduce frequency counts of all attribute-value pairs occurring in all tuples satisfying Q. (e) Transform Q into a classication rule. Figure 4: Multi-level classication algorithm The algorithm is a form of the \generate and test" class of search techniques. Concept hierarchies augmented with frequency counts, and dependency data, are used to guide the creation of queries that will become classication rules. The algorithm tries to nd a set of queries that individually are satised by tuples in just one class and collectively describe the dierences between the classes. Ontologies play an important role in the multi-level classication process described in this paper. They are important because they contain the concepts that the multi-level classication algorithm will use to induce its rules. Currently, we are only considering ontologies similar to the concept hierarchy illustrated in Figure 2. These ontologies are domain specic tree structures representing hierarchical groupings of values occurring in a database. An interior node represents a generalization of concepts at lower levels thus providing a mechanism for data mining at high levels of abstractions, providing there exists a query engine powerful enough to generalize data from databases to concepts in the taxonomies on demand. The classication algorithm uses frequency counts from a database when determining which concepts from the concept-hierarchies to include in a rule. In step 1 (Figure 4: Gathering frequency counts), nodes in the concept hierarchies are augmented with frequency counts gathered from the database being analyzed. The frequency counts for low-level concepts, (i.e. leaves in the concept hierarchies) are gathered directly from the databases, whereas the frequency counts for high-level concepts are accumulated from concepts at lower levels. Figure 5 is an extension of Figure 2 containing frequency counts from Figure 1. The frequency counts are grouped by class to represent the distribution of concepts among the classes. This is denoted by [X,Y,Z] where X represents the frequency with which an attribute-value pair occurs in a tuple that is in the rst class, Y the second class, and Z the third class. For example, in Figure 1, there is one tuple, belonging to the \Mid class", in which \Diet Pepsi" is a value for attribute \Item 1". This is illustrated in Figure 5. [4,4,4] [0,0,2] Diary Beverage [4,4,0] Soda [1,3,0] Pepsi Beverage [3,1,0] Coke [0,0,1] AB EggNog Milk [0,0,2] Fruit Juice [0,0,2] OJ Apple Juice [0,0,1] Diet Pepsi [0,1,0] Reg Pepsi [1,2,0] Diet Coke [2,0,0] Reg Coke [1,1,0] Skim Milk Low Milk [0,0,1] [0,0,0] [0,0,0] AB OJ [0,0,1] CD OJ [0,0,1] Figure 5: Concept-hierarchy over beverages with frequency counts: [young,mid,old] In step 2a, the frequency counts are used to eciently nd those attribute-value pairs (attribute + concept from the attribute's concept hierarchy) that are strong indicators for class membership. Attribute-value pairs are selected based on an evaluation function D(Av) . f(C; Av) f(C; Av) is the frequency of attribute-value pair Av in class C )y fr (C; Av) = f (kC;Av Ck fr (C; Av) is the relative frequency of attribute-value pair Av in class C m(Av) = max(fr (Ci ; Av)) 1 i n y kC k is continuously reduced to reect the number of tuples in class C that have not been covered by a rule m(Av) is the maximum relative frequency of Av over all classes P (Ci ;Av ) ) d(Av) = ni=1 ( n;1 1 ; mf(rAv )(n;1) d(Av) is the normalized standard deviation of Av's relative frequency counts D(Av) = :9abs Id;Avd d(Av) Ideal Depth The Ideal Depth, Id , is the preferred ( ) level of generality. If we are interested in rules at the abstraction level of soda, fruit juice, and dairy beverages in Figure 2 we would set Id = 1. Attribute-Value Depth The Attribute-Value Depth, Avd , is the depth of Av in its respective concept hierarchy (e.g. Sodad = 1). The evaluation function D(Av) measures the quality of Av with respect to generating ParkaDB queries that are satisable by tuples in just one class. The d(Av) term is used to select those attribute-value pairs that will require the fewest additional constraints to create a ParkaDB query that is satised by tuples in a single class. d(Av) ranges from 0, indicating and even distribution, to 1, indicating that Av is unique to tuples in just one class. Attribute-value pairs with d(Av) values close to 1 are strong indicators of class membership. Selecting such an attribute-value pair to seed a query should reduce the complexity of the queries generated that describe tuples in a single class. In the case there exists two or more attribute-value pairs with similar d(Av) values those that are closer to the ideal depth are preferable. The :9abs(Id;Avd ) term is included in the evaluation function to lter out those attribute-value pairs that have high d(Av) values but are not close to the ideal depth in their respective concept taxonomies. We experimented with several constants before selecting .9 for this term. Values less than .9 tended to lter out too many attribute-value pairs and values greater than .9 did not lter out enough attribute-value pairs. The classication algorithm generates a series of conjunctive queries that are subsequently evaluated by ParkaDB in steps 2b & 2c. Queries are initialized with the attribute-value pair having the highest D(Av) value selected in step 2a. They are extended by adding attribute-value pairs as constraints until the queries are satised by tuples in a single class. Given the tuples in Figure 1 if the attribute-value pair (Item 1,Pepsi) is selected the algorithm will generate the following ParkaDB query. (Item1 ?tid ?item1) (everyInstanceOf ?item1 Pepsi) Frito BBQ Frito Reg Frito Figure 6: Concept hierarchy for Frito Customer Item 1 Item 2 Class 6 Diet Pepsi BBQ Frito Mid 7 Reg Pepsi Reg Frito Mid 9 Reg Pepsi BBQ Frito Mid Table 1: Query Results attribute \A" that are not covered by a rule previously generated. In Figure 7 the frequency counts for the \Beverage" taxonomy have been reduced based on the results of the extended query above. The frequency counts for Diet Pepsi, Reg Pepsi, were reduced as a result the nding. Furthermore, the frequency counts for Pepsi, Soda, and Beverage have also been reduced since they are generalizations of Diet Pepsi and Reg Pepsi. If the frequency counts for an attribute-value pair (A,v) are set to 0 then there does not exist a tuple in the database that has \v", or a value generalizable to \v" for attribute \A" that is not covered by a rule previously generated. Consequently, neither (A,v) nor any (A,v'), where v' is any value generalizable to v, should be considered for Q in step 2c. This heuristic eectively prunes the search space as classication rules are induced. [4,1,4] Since Pepsi is not exclusive to a single class the query above would have to be extended. Since there is a strong relationship 1 between Pepsi and Frito the above query can be extended to (Item1 ?tid ?item1) (everyInstanceOf ?item1 Pepsi) (Item2 ?tid ?item2) (everyInstanceOF ?item2 Frito). (A concept taxonomy for Frito is represented in Figure 6.) This query requests tuples in which the values of attribute Item1 are generalizable to \Pepsi" and values of attribute \Item2" are generalizable to \Frito" from the table in Figure 1. Table 1 contains the results of this query. Queries are issued to ParkaDB to evaluate there strengths. The strength of a rule is determined the distribution of classes that tuples satisfying the rule belong to. In the above example, one rule for Mid Aged customers would be (Item1 "2 Pepsi) ^ (Item2 " Frito) ! Mid Age Patron Step 2d is a critical step in the algorithm. The purpose of this step is to gradually prune the search space as the algorithm iterates. The frequency counts for an attribute-value pair (A,v) represent the number of tuples in which attribute \A" has value \v" or a value that is generalizable to \v". The frequency counts for an attribute-value pair (A,v) are reduced so that they indicate the number of tuples remaining in the database that contain the value \v", or any of its descendants, for 1 We developed an algorithm that will nd these relationships 2 Is generalizable to [11] [0,0,2] Diary Beverage [4,1,0] Soda [1,0,0] Pepsi Beverage [3,1,0] Coke [0,0,1] AB EggNog Milk [0,0,2] Fruit Juice [0,0,2] OJ Apple Juice [0,0,1] Diet Pepsi [0,0,0] Reg Pepsi [1,0,0] Diet Coke [2,0,0] Reg Coke [1,1,0] Skim Milk Low Milk [0,0,1] [0,0,0] [0,0,0] AB OJ [0,0,1] CD OJ [0,0,1] Figure 7: Concept hierarchy over Beverage with reduced frequency counts. 6 Example: Mushroom Database We ran our algorithm on the "Mushroom Toxicology" database located at the UC Irvine Repository. This database contains 8124 tuples and 22 attributes used to describe the mushrooms. The mushrooms were classied as either edible or poisonous. Simple concept taxonomies were dened for all attributes, excluding the class attribute. Figure 8 depicts the taxonomy dened on the Odor. The algorithm was restricted to use only those attributes that had an average D(Av) of 0.7 or greater for their respective values. The following is a list of some of the rules that the algorithm generated: 1. (Odor " None) ^ (Gill Size " Broad) ! edible 0.98 2. (Odor " Bad) ! poisonous 1.00 3. (Spore Print Color " Dark) ^ (Odor " Pleasant) ! edible 1.00 4. (Odor " Spicy) ! poisonous 1.00 Rule 2 illustrates the potential of high-level data mining. It subsumes several low levels rules, thus BAD CREOSOTE FISHY FOUL PUNGENT PLEASANT ALMOND ANISE SPICY Figure 8: Concept taxonomy for the Odor attribute. reducing the amount of data that is produced. It also provides a more intuitive description of a subset of the data set. Without concept taxonomies dened on the Odor attribute rule 2 would be replaced by a single rule containing a disjunction of all descendants of "Bad" in the antecedent or by four rules each containing a single descendant of "Bad" in the antecedents. Whereas these low-level alternatives to rule 2 do in fact describe a subset of the data, they fail to describe the more general phenomenon occurring in the data, that is mushrooms with a "Bad" odor are poisonous. It is commonly known that "creosote", "shy", "foul", and "pungent" are all unpleasant odors. Therefore, it may be quite simple to conclude rule 2 from it's low-level alternatives. However, in more complex domains it may be much more dicult to make such conclusions. The concept taxonomies dened on the Odor attribute provide a mechanism for encoding the hierarchical relationship between "Bad", and "creosote", "shy", "foul", and "pungent" such that high-level rules can be automatically induced, if they exists. 7 Related Work Data mining with background knowledge has been extensively studied in the past. Background knowledge has been represented as rules, domain constraints, taxonomies, and more recently as full ontologies. This information has been used in many dierent ways. Here we will only consider systems that use taxonomic background knowledge. Walker [10] was the rst to use concept taxonomies. The taxonomies were used to replace values by more general values. Han et al. [4] proposed a similar approach to nd characterization and classication rules at high levels of generality. Later, Dhar and Tuzhilin [12] proposed a generalization of the techniques of Walker and Han. Their approach used database views and con- cept hierarchies to represent background knowledge and could induce a broader range of interesting patterns. All of the above mentioned approaches were based on traditional relational database technology. To use background knowledge, the databases had to be transformed to a generalized table before the respective discovery algorithms could be invoked. This approach may lead to over generalization. The databases had to be pregeneralized because the RDBMS's used did not support arbitrary generalization at \query time". Our approach avoids pre-generalization by issuing queries with highlevel concepts to which ParkaDB can eciently generalize low-level concepts. This enables us to dynamically generalize data as necessary and minimize over generalization. 8 Conclusion and Future Work In this paper, we have presented the core algorithm of a high-level data mining system which induces classication rules at multiple levels of generality. The algorithm is based on concept hierarchies augmented with frequency counts, to guide the classication process and to prune the rule space, dependency data to create interesting queries, and a high-performance parallelizable query engine to evaluate queries. We are using ParkaDB to integrate ontologies and database. Such an integration in crucial to eectively use ontologies (concept hierarchies) within data mining systems. This tool simplies data mining with background knowledge by allowing databases to be queried at high levels of abstraction [13]. The algorithm presented in this paper is the major component of a data mining system that will be used at Johns Hopkins Hospital. The system will be used to nd conditions under which patients had positive responses to platelet transfusions as well as negative responses. References [1] Rakesh Agrawal Tomasz Imielinski and Arun Swami. Mining association rules between sets of items in large databases. In 1993 International Conference on Management of Data (SIGMOD93), pages 207{216, May 1993. [2] Ira J. Haimowitz. Knowledge-based Trend Detection and Diagnosis. PhD thesis, Massachusetts Institute of Technology, June 1994. MIT/LCS/TR620. [3] J. Ross Quinlan. Induction of decision trees. In Readings in Machine Learning, pages 81{106. Morgan Kaufman, 1990. [4] Yandong Cai, Nick Cercone, and Jiawei Han. Attribute-oriented induction in [5] [6] [7] [8] [9] relational databases. In Knowledge Discovery in Databases. AAAI/MIT Press, 1991. UMLS. Unied Medical Language System. National Library of Medicine, 1994. George A. Miller. Human language technology. Technical report, Psychology Department, Green Hall, Princeton University, 1996. James Hendler, Kilian Stoel, and Merwyn Taylor. Advances in high performance knowledge representation. Technical Report CS-TR-3672, University of Maryland @ College Park, August 1996. Kilian Stoel, Merwyn Taylor, and James Hendler. Ecient management of very large ontologies. In AAAI-97 Proceedings, 1997. John M. Aronis and Foster J. Provost. Eciently constructing relational features from background knowledge for inductive machine learning. In Proceedings: AAAI-94 Workshop on Knowledge Discovery in Databases, March 1996. [10] A. Walker. On retrieval from a small version of a large database. VLDB Conference, 1980. [11] Kilian Stoel, James Hendler, and Merwyn Taylor. Induction of hierarchical dependencies from relational databases. Technical Report CS-TR-3757, University of Maryland @ College Park, January 1997. [12] Vasant Dhar and Alexander Tuzhilin. Abstractdriven pattern discovery in databases. IEEE Transactions on Knowledge and Data Engineering, 5(6), December 1993. [13] Merwyn Taylor. Hybrid knowledge- and databases. In Thirteenth National Conference on Articial Intelligence, volume 2, page 1411. AAAI/MIT Press, 1996.