Download ARMiner - Journal of Computer Science and Technology

Vol.17 No.5 J. Comput. Sci. & Technol. Sept. 2002 ARMiner: A Data Mining Tool Based on Association Rules ¨ª¡), ZHU Jianqiu (©¢¤), ZHU Yangyong (©¦§) and SHI Baile (¥ £) ZHOU Haofeng ( Department of Computing and Information Technology, Fudan University, Shanghai 200433, P.R. China E-mail: [email protected] Received May 14, 2001; revised September 26, 2001. Abstract In this paper, ARMiner, a data mining tool based on association rules, is introduced. Beginning with the system architecture, the characteristics and functions are discussed in details, including data transfer, concept hierarchy generalization, mining rules with negative items and the re-development of the system. An example of the tool's application is also shown. Finally, some issues for future research are presented. Keywords 1 association rule, negative item, interestingness, concept hierarchy Introduction The data mining technology has attracted lots of researchers and organizations for its brilliant prospects of application[1] . Due to much research on it, a large number of applications have emerged and many prototypes have been produced, such as KEFIR from GTE and IMACS from AT&T. Some systems, such as Intelligent Mine from IBM[2] , DBMiner from Simon Fraser University[3] and Knight from Nanjing University[4] , have been used successfully in many domains like nance and commerce. Representing the achievements of the current data mining technology, these systems involve the research in databases, expert systems, machine learning and statistics. A few of them have been put into practice in business elds. After developing AMINER[5] , a data mining tool which adopts various kinds of data mining technologies, we have successfully constructed ARMiner, a data mining system based on association rules and a component in AMINER, by integrating commercial requirements and the research on the association rules together. The goal of ARMiner is to develop data mining tools for intelligent POS systems and to support decision-making in data warehousing. ARMiner is not designed for some particular kind of application. By permitting the outside modication to its domain knowledge during the process of data mining, ARMiner acquires exibility to some degree. Another advantage of ARMiner is that an interestingness measure is introduced in the system as a new evaluation to lter useless and uninteresting association rules. As a result, an improved algorithm is obtained to reserve the semantics implied in the association rules entirely. Moreover, ARMiner provides mining algorithms and preprocessing API functions for its re-development. These API functions can be seamlessly integrated with many developing environments, thus facilitating the deployment of ARMiner. 2 Overview of the System Currently, there are two major kinds of architectures for data mining. One is process-oriented. Examples are two multi-phase processing models proposed by Usama M. Fayyad[6] and George H. This paper is supported by the Key Program of the National Natural Science Foundation of China (Grant No.69933010) and the National \863" High-Tech Programme of China (Grant No.863-306-ZT02-05-1). No.5 ARMiner: A Data Mining Tool Based on Association Rules 595 John[7] , respectively. The other focuses on the user and applications, such as the user-oriented processing model invented by Brachman[8] . Other models can be classied into these two categories, such as the three-tiered architecture from IBM[9] and the Knight architecture from Nanjing University[4] . The architecture of ARMiner has the characteristics of both kinds. According to the real applications, functional requirements and implementations of data mining tools, we adjust it properly to make ARMiner suitable for both the Client/Server architecture and the Browser/Web Server architecture. ARMiner consists of ve components: a basic technology module, a presentation module, an algorithm module, a data source module and an instruction center of data mining. The whole structure is shown in Fig.1. Fig.1. The architecture of ARMiner. Basic technology module: It refers to the environment of software and hardware where a system is developed, such as a server, a network, a platform and a tool. As the physical foundation of implementation, it determines the eÆciency of the nal system. Presentation module: It refers to the operating interface for users. It can be direct, like the operating interface of the Client/Server architecture, or indirect (for example, the electronic mail which users use to deliver data mining requests and accept the result of data mining). Algorithm module: It is the core of the system. Guided by the instruction center of data mining, it selects suitable algorithms to mine the clean data processed from data sources, applies techniques like indexing, parallel computing and pruning ramications to improve eÆciency and sends the mining result to the presentation module. Data source module: It prepares data for mining by transforming the raw data extracted in various ways from dierent sources, such as the relational database, multidimensional database, data warehouse and even at les. The raw data can be extracted through gateways such as ODBC, or by connecting databases in special ways, or by analyzing the data from the data warehouse and other data sources. Large data set can be sampled to reduce the size of data which will be processed. Then, the dirty data are removed by cleaning (for example, the raw data are generalized according to the concept hierarchies) and the remainder is integrated for mining. In that way, every mining algorithm uses the same interface to access data without being aware of the existence of data sources. Instruction center of data mining: As the headquarters of the system, the instruction center directs three modules, namely, presentation, algorithm and data source, to run properly. The presentation form base stores the denitions of forms in which the output is presented to end users, such as natural languages, graphics, and grids. The knowledge/algorithm base is used to control the management and execution of algorithms, for example, to adjust the evaluation system and to choose a suitable mining algorithm to accelerate computation. The data preparation method base provides the methods for the data source module during data transfer, such as data transformation and conception hierarchy analysis. There are two modes provided for executing the instruction center: an automatic mode and ZHOU Haofeng, ZHU Jianqiu et al. 596 Vol.17 a manual mode. The latter mode is reserved for manually controlling the mining process. By allowing users to adjust the setting in the instruction center to achieve good mining performance, the system is exible and general-purpose. The horizontal view of Fig.1 reects the processing phases: data preparation, data mining and result presentation. The vertical view indicates the application with the physical foundation and the reserved interface for manual control. Therefore, the characteristics of the precedent kinds of architectures are well combined into ARMiner. 3 The Role of ARMiner as a System 3.1 Functions of ARMiner as a Mining Tool As a mining tool, ARMiner provides functions such as data preparation and association rule mining. 3.1.1 Data Preparation of ARMiner It includes data transfer and concept hierarchy generalization. The task of data preparation is to transform raw transaction data according to the required structure and transfer the transformed data to the mining database of ARMiner. The ultimate goal of data transfer is transferring the data from the sources to the mining database. In general, the data used by rule mining are transaction data, i.e., the data with the structure: (TID, ItemID). Furthermore, to facilitate the work of displaying rules, the description information of items, such as item names, should be transferred too. Therefore, the data to be transformed by ARMiner include the transaction set and the item set. During transforming the transaction set, if there is no unique eld as the primary key of the records in the original set, a new primary key is generated to substitute for the old one. Then, the transaction data are transferred accordingly. This is called transformation-transfer and the counterpart is simple transfer. The proper transfer method is automatically chosen as transfer rules are provided. During the process of transferring data, eld types are matched and, if necessary, converted according to the system requirements. Data transfer provides ARMiner with the proper data by importing the source data into the mining database. The original transaction data often contain a large amount of detailed data, where much useless knowledge may be discovered, and which cannot reect the abstract hierarchies in the real world. Therefore, it is important to generalize the original data after data transfer is nished. So, the concept of domain knowledge is introduced there. Many researchers have studied it and some algorithms are proposed[10 12] . However, these algorithms are usually bound to mining algorithms, that is, the concept hierarchies are taken into account at the stage of generating large item sets. In that way, a problem occurs: when the knowledge of a domain is not used, the algorithm of this kind will cause unnecessary cost. Hence, in ARMiner, we make the concept hierarchy generation independent by separating it from the process of mining, and it can be regarded as a part of data preprocessing. According to users' requirements, we process the raw data, convert them at more abstract levels. Then, from the new data set previously obtained, we remove the redundancy, and put these clean data into the mining database. The whole process is illustrated in Fig.2. The display of concept hierarchies uses a tree to show the denitions of generalization levels. As an interactive process, generalization hierarchy selection allows users to choose the proper generalization levels. After the selection work is over, it delivers the generalization requirements to the data generalization module. The module of concept hierarchy input and data generalization converts the hierarchies in tables into the ones with program data structures, transforms the raw data at the chosen levels, and then stores these processed data into the mining database. No.5 ARMiner: A Data Mining Tool Based on Association Rules 597 Fig.2. Data generalization process. To dene the generalization hierarchy table , we use a recursive table whose structure is in the form of (ItemID, ItemName, SuperiorItem). An array is used to implement the structure of concept hierarchy. Each element of the array is in the form shown below. typedef structf CHAR strItem[56]; String strSuperItem; long iSuperItem; //the ID of the concept hierarchy //strItem's name //strItem's parent, //if it is less than zero,it is the root gSBTreeItem; SBTreeItem *m sbtItems = new SBTreeItem[iCount] //level number The tree nodes will be scanned frequently during the process of concept hierarchy generalization, and this search aects the eÆciency of the whole program. Therefore, we apply a binary search algorithm to improve the performance. The array elements must be sorted by strItem before this algorithm is used. Given a node, it is necessary to search for its level and superior node. After the user denes a desired generalization level in the hierarchy for each item, it is necessary to denote the user requirements by the algorithm in an easy way. For convenience, our algorithm directly marks the data structure of the original concept hierarchy by setting a negative value to the iSuperItem of each appointed generalization level. After the data are transformed, the modied data structure will be recovered for next use. 3.1.2 Mining Association Rules with Negative Items Association rule mining is the core technique of ARMiner. We not only introduce an interestingness measure as a new kind of evaluation, but also provide an algorithm for association rules with negative items. There are various denitions of interestingness measures[13 15] . Based on statistics, we dene our interestingness measure of association rules with the form of X ) Y as follows: )S (Y ) i = S (SX(XY ) (1) where S (X ) is the support measure of X in the transaction set. If i, the interestingness measure of a rule, satises the condition of 0 < i < 1, we will consider this rule valueless. However, if i im (im is the threshold of interestingness measure and im 1), it will be a valuable rule. Based on the above denition, the evaluation of association rules contains three measure arguments: support, condence and interestingness. Their statistical explanation is: given a rule X ) Y , the interestingness measure reects the tightness of the connection between X and Y , and condence represents the connection direction in this condition, i.e., from X to Y or Y to X . Support shows whether this condition is common among the transactions. Furthermore, by admitting negative items, we modied the denition of association rules and proposed the concept of negative itemset[16] . To obtain the support measure of the negative item set, 598 ZHOU Haofeng, ZHU Jianqiu et al. Vol.17 we devise a new algorithm, which computes the support measure using the support measures of the positive literal set without rescanning the database. Then, a mining algorithm is contrived for association rules with negative items. If the generated rules are not interesting to users, this algorithm is able to discover other rules (perhaps more interesting) by introducing negative item sets automatically[16] . By these denitions and algorithms, we can discover the association rules whose semantics are more integrated, such as `coee ) milk '. 3.2 Re-Development Ability During the development of ARMiner, we encapsulate the system function into several API functions for re-development. These functions are classied into four categories: the mining algorithm functions, the rules generating functions, the data transformation functions and the concept hierarchy constructing functions. The process of association rule mining can be divided into two phases: the large-set generation and rule generation. The former is carried out by the mining algorithm functions whose interface is open, so we can use any algorithm which measures up the denition of the interface. In ARMiner, we currently provide the Apriori, AprioriTID[17] and DHP[18] algorithms, and we can also use other ones which have better performance. The second phase is implemented by the rule generating functions which include the original rule generation function mentioned by Agrawal and the new one mentioned above. By an option, the user can choose either of them. During the transfer, with the transformation rules, data transformation functions will automatically choose either transformation transfer or simple transfer, and adjust the attributes of each eld. The concept hierarchy constructing functions are used to analyze the original data and construct hierarchies. To permit manual intervention in the process of mining, we set up an independent operating interface for constructing hierarchies. Through the interface shown in Fig.3, users can freely select the wanted items in the displayed tree. Fig.3. The concept hierarchy. Implemented in the form of dynamic link library (DLL), these functions can be seamlessly integrated with a few development environments such as VC and VB. Therefore, the deployment of the system is greatly eased. We can use the API functions to enhance various kinds of existing information systems by adding the function of decision support to them. Furthermore, these functions No.5 ARMiner: A Data Mining Tool Based on Association Rules 599 can be used to support developing relevant software for e-commercial sites. Meanwhile, the interface denitions of API functions are the foundation of system extension. Following these considerations, more functions, such as mining algorithms with better performance and powerful guidance with more domain knowledge, will be added to ARMiner. 4 Applications of ARMiner The target database is a supermarket database where every item belongs to a denite category, for example, apple belongs to fruit, fruit belongs to food, and so on. Due to hardware limitation (PIII450, 128M RAM, Windows NT), we only select about 7,200 transactions (about 30,000 records in a month) from this database to operate. First, using the domain knowledge implied in the concept hierarchy shown in Fig.3, we process these transaction data. The hierarchy is constructed from the original data and the generalization items are selected, then the data are generalized. We mine the database using the interestingness measure and association rules with negative items. As the threshold of the support measure is 0.005 and those of condence and interestingness are 0.06 and 1.15 respectively, we get the results presented in Fig.4. Fig.4. The ARMiner main interface. As shown in this gure, negative items are marked with `( )'. If we had adopted only three threshold values to lter rules without considering negative items, some rules, such as \fast noodle ) ( ) groceries", would not have been discovered. This kind of rules appear just because the corresponding normal ones, such as fast noodle ) ( ) groceries, bear interestingness less than 1. If only interestingness is used to lter rules, we will lose this kind of normal rules, needless to say, the ones with negative items. From the results, we detect some particular rules, such as `sanitary Fig.5. Rules with negative items generating in ARMiner. napkin ) lacto-drink', which is as weird as the classic sample of `diaper ) beer'. Besides, the rule `fast noodle ) fruit ' is understandable for the former food is for the busy life and the latter is for the leisure one. 600 ZHOU Haofeng, ZHU Jianqiu et al. Vol.17 In the experiment, we consider two instances: using the concept hierarchy or not. Fig.5 shows the rules we get in these two instances with various support thresholds. The thresholds of condence and interestingness are 0 and 1 respectively. As shown in the gure, we can nd that in the instance of considering the concept hierarchy, the rules are more than those in the instance of not considering the hierarchy. Both Figs.4 and 5 show one of the advantages of ARMiner against other systems that it can generate the rules with negative items, especially when the concept hierarchy is considered. As to the eÆciency and the stability of ARMiner, under the hardware platform mentioned above, when the threshold of the support measure is below 0.002, the system seems to be down. This application also shows the API function utilization in ARMiner. The interface is designed by DELPHI, and it calls the API functions, which are provided in the DLL form, to achieve its object. Using the same method, we also add a decision support module to enhance the decision-making function in an existing business information system. Fig.6 shows some part of the coding progress in the development of this application. Fig.6. A module using the API. So both applications show the ability of the re-development of ARMiner. 5 Comparison with Other Systems Because ARMiner is a mining system based on association rules and others are usually integrated systems adopting many techniques such as classication and aggregation, we just select their modules for association rules. Limited by the length of the paper, we just use the following representative systems: IBM Intelligent Miner, DBMiner from Simon Fraser University and Knight from Nanjing University. The rst one is a commercial system and the other two are those in the research area. It should be pointed out here that the main features of ARMiner are based on its functions, not its performance, so we do not compare it with other systems in this aspect. Intelligent Miner is an integrated tool set based on DB2 and provides full-scale decision support. Its mining module for association rules strictly conforms to the previous denition of association rules and evaluation systems. Although the problem of association rule generalization is considered and a batch of API functions is provided, this module has been screwed onto the foundation stone, DB2. No.5 ARMiner: A Data Mining Tool Based on Association Rules 601 By the Big Blue's great inuence, its applications are popularized extensively. However, ARMiner is independent of any database platform and can run on many database platforms through ODBC. Association rule generalization is also taken into account. Moreover, an interestingness measure is introduced as a new evaluation criterion. Meanwhile, ARMiner is able to discover the rules with negative items. Similarly, its API functions do not rely on any database. DBMiner is a system for interactive mining of multiple level knowledge in large relational databases. The system implements a wide spectrum of data mining functions, and mining association rules is one of its main functions. This function is based on the multiple dimensional data cube computed in the preparation phase. It can perform interactive rule mining at multiple concept levels using an SQL-like Data Mining Query Language (DMQL) and a graphical user interface, and generate dierent forms of outputs. However, rules with negative item generating and API functions in ARMiner are distinguishing features against DBMiner. Both systems can be connected to various databases using ODBC. ARMiner also considers the multiple level knowledge through the concept hierarchy. The only disadvantages of ARMiner against DBMiner are the lack of data mining query languages and the humdrum presentation of the results. But as a prototype, ARMiner is a successful one. Knight is a general-purpose mining tool, which uses ODBC and special database interfaces to implement its platform transparence. By guiding the knowledge discovery with a syntax tree, it imports domain knowledge. Capable of doing four types of mining, it has been put into use in the insurance domain. In some similarity, ARMiner accesses databases through ODBC and introduces domain knowledge into the system by concept hierarchies. The mining algorithms and data preprocessing functions are encapsulated into dynamic link libraries. Therefore, the API functions are immediately provided for re-development as the system development is nished. ARMiner and its API functions have been used in several application areas. Besides these mining systems, there are other data mining systems. But, as a whole, compared with ARMiner, they do not use the interesting measure of association rules and cannot generate the rules with negative items. Most of them are monolithic, not providing API functions for re-development. Even though some systems provide these functions, they still hinder the re-development because the oered API functions heavily rely on the given platforms. However, ARMiner does not bear these shortages. Compared with those systems, ARMiner is more exible and adaptable. 6 Conclusion The technology of data mining emerged to meet the requirements of actual practice. Its implementation is helpful to decision-making. Some new eorts have been made in ARMiner. Not only is it not limited by a certain eld, but also can it use the domain knowledge through the concept hierarchy, which displays its exibility. It also introduces the interestingness and algorithm that mines the rules with negative items, making the semantics of the association rules more complete than ever. In addition, it also provides the API functions for the re-development, which leads to more applications. There are still some research problems we need to do in the future. First, we need to expand the algorithm implementation over the rule mining, and provide the complete function to the system. Besides, the incremental computation needs to be considered. Finally, more presentation forms should be added into it. References [1] Chen M-S, Han J, Yu P S. Data mining: An overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering, 1996, 8(6): 866{883. [2] Agrawal R, Mehta M, Shafer J C et al. The quest data mining system. In Proc. Knowledge Discovery and Data Mining, Portland, Oregon, 1996, pp.244{249. [3] Han J, Fu Y, Wang W et al. DBMiner: A system for mining knowledge in large relational databases. In Proc. Knowledge Discovery and Data Mining, Portland, Oregon, 1996, pp.250{255. ZHOU Haofeng, ZHU Jianqiu et al. 602 Vol.17 [4] Chen D, Xu J. Knight: A general purpose data mining system. J. Computer Research & Development, 1998, 35(4): 338{343. [5] Zhu Y, Zhou X, Shi B. Rule-based data mining tool kit: AMiner. Communication of High Technology, 2000, 10(3): 19{22. [6] Fayyad U M, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery: An overview. In Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996, pp.1{34. [7] John G H. Enhancements to the data mining process [Dissertation]. Department of Computer Science, School of Engineering, Stanford University, 1997. [8] Brachman R J. The process of knowledge discovery in database. In Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996, pp.37{57. [9] Data mining: Extending the information warehouse framework. http://www.almaden.ibm.com/cs/quest/paper/whitepaper.html. [10] Cheng J, Shi P. Fast mining multiple-level association rules. Chinese J. Computers, 1998, 21(11): 1037{1041. [11] Han J, Fu Y. Discovery of multiple-level association rules from large databases. In Proc. Very Large Data Bases, Zurich, Switzerland, 1995, pp.420{431. [12] Srikant R, Agrawal R. Mining generalized association rules. In Proc. Very Large Data Bases, Zurich, Switzerland, 1995, pp.407{419. [13] Zhou X, Sha C, Zhu Y, Shi B. Interest measure { Another threshold in association rules. J. Computer Research & Development, 2000, 37(5): 627{633. [14] Brin S, Motwani R, Silverstein C. Beyond market baskets: Generalizing association rules to correlations. In Proc. of ACM SIGMOD, Tucson, Arizona, USA, 1997, pp.265{276. [15] Savasere A, Omiecinski E, Navathe S B. Mining for strong negative associations in a large database of customer transactions. In Proc. the 14th Int. Conf. Data Engineering, Orlando, Florida, USA, 1998, pp.494{502. [16] Zhou H, Gao P, Zhu Y. Mining association rules with negative items using interest measure. In Web-Age Information Management, Lecture Notes in Computer Science 1846, Springer-Verlag Publisher, 2000, pp.121{132. [17] Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. In Proc. Very Large Data Bases, Santiago de Chile, Chile, 1994, pp.487{499. [18] Park J S, Chen M-S, Yu P S. An eective hash based algorithm for mining association rules. In Proc. ACM SIGMOD, San Jose, California, USA, 1995, pp.175{186. ZHOU Haofeng was born in 1975. He received his B.E. degree in computer science from Shanghai University in 1997, and his M.S. degree in computer science from Fudan University in 2000. He is currently a Ph.D. candidate in computer science at Fudan University. His research interests include data mining, database and knowledgebase. ZHU Jianqiu was born in 1974. He received his B.S. degree in computer science from Harbin University of Science and Technology in 1996, and his M.S. degree in computer science from Fudan University in 1999. He is currently a Ph.D. candidate in computer science at Fudan University. His research interests include data mining, CRM and e-commerce. ZHU Yangyong was born in 1963. He received his B.S. in mathematics from Xinjiang University in 1984, and his Ph.D. in computer science from Fudan University in 1994. He is now a professor in the Department of Computing and Information Technology in Fudan University. His research interests include data mining and e-commerce. SHI Baile sity in 1957. was born in 1935. He majored in computing mathematics and graduated from Beijing Univer- He is now a professor in the Department of Computing and Information Technology in Fudan University. His research interests include digital library, database and knowledgebase.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ARMiner - Journal of Computer Science and Technology