Download ARMiner - Journal of Computer Science and Technology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Vol.17 No.5
J. Comput. Sci. & Technol.
Sept. 2002
ARMiner: A Data Mining Tool Based on Association Rules
¨ª¡), ZHU Jianqiu (©¢¤), ZHU Yangyong (©¦§) and SHI Baile (¥ £)
ZHOU Haofeng (
Department of Computing and Information Technology, Fudan University, Shanghai 200433, P.R. China
E-mail: [email protected]
Received May 14, 2001; revised September 26, 2001.
Abstract
In this paper, ARMiner, a data mining tool based on association rules, is
introduced. Beginning with the system architecture, the characteristics and functions are discussed in details, including data transfer, concept hierarchy generalization, mining rules with
negative items and the re-development of the system. An example of the tool's application is
also shown. Finally, some issues for future research are presented.
Keywords
1
association rule, negative item, interestingness, concept hierarchy
Introduction
The data mining technology has attracted lots of researchers and organizations for its brilliant
prospects of application[1] . Due to much research on it, a large number of applications have emerged
and many prototypes have been produced, such as KEFIR from GTE and IMACS from AT&T.
Some systems, such as Intelligent Mine from IBM[2] , DBMiner from Simon Fraser University[3]
and Knight from Nanjing University[4] , have been used successfully in many domains like nance
and commerce. Representing the achievements of the current data mining technology, these systems
involve the research in databases, expert systems, machine learning and statistics. A few of them have
been put into practice in business elds.
After developing AMINER[5] , a data mining tool which adopts various kinds of data mining technologies, we have successfully constructed ARMiner, a data mining system based on association rules
and a component in AMINER, by integrating commercial requirements and the research on the association rules together.
The goal of ARMiner is to develop data mining tools for intelligent POS systems and to support
decision-making in data warehousing.
ARMiner is not designed for some particular kind of application. By permitting the outside
modication to its domain knowledge during the process of data mining, ARMiner acquires exibility
to some degree.
Another advantage of ARMiner is that an interestingness measure is introduced in the system as a
new evaluation to lter useless and uninteresting association rules. As a result, an improved algorithm
is obtained to reserve the semantics implied in the association rules entirely.
Moreover, ARMiner provides mining algorithms and preprocessing API functions for its re-development. These API functions can be seamlessly integrated with many developing environments, thus
facilitating the deployment of ARMiner.
2
Overview of the System
Currently, there are two major kinds of architectures for data mining. One is process-oriented.
Examples are two multi-phase processing models proposed by Usama M. Fayyad[6] and George H.
This paper is supported by the Key Program of the National Natural Science Foundation of China (Grant
No.69933010) and the National \863" High-Tech Programme of China (Grant No.863-306-ZT02-05-1).
No.5
ARMiner: A Data Mining Tool Based on Association Rules
595
John[7] , respectively. The other focuses on the user and applications, such as the user-oriented processing model invented by Brachman[8] . Other models can be classied into these two categories, such
as the three-tiered architecture from IBM[9] and the Knight architecture from Nanjing University[4] .
The architecture of ARMiner has the characteristics of both kinds. According to the real applications, functional requirements and implementations of data mining tools, we adjust it properly to make
ARMiner suitable for both the Client/Server architecture and the Browser/Web Server architecture.
ARMiner consists of ve components: a basic technology module, a presentation module, an
algorithm module, a data source module and an instruction center of data mining. The whole structure
is shown in Fig.1.
Fig.1. The architecture of ARMiner.
Basic technology module: It refers to the environment of software and hardware where a system
is developed, such as a server, a network, a platform and a tool. As the physical foundation of
implementation, it determines the eÆciency of the nal system.
Presentation module: It refers to the operating interface for users. It can be direct, like the
operating interface of the Client/Server architecture, or indirect (for example, the electronic mail
which users use to deliver data mining requests and accept the result of data mining).
Algorithm module: It is the core of the system. Guided by the instruction center of data mining, it
selects suitable algorithms to mine the clean data processed from data sources, applies techniques like
indexing, parallel computing and pruning ramications to improve eÆciency and sends the mining
result to the presentation module.
Data source module: It prepares data for mining by transforming the raw data extracted in various ways from dierent sources, such as the relational database, multidimensional database, data
warehouse and even at les. The raw data can be extracted through gateways such as ODBC, or
by connecting databases in special ways, or by analyzing the data from the data warehouse and other
data sources. Large data set can be sampled to reduce the size of data which will be processed. Then,
the dirty data are removed by cleaning (for example, the raw data are generalized according to the
concept hierarchies) and the remainder is integrated for mining. In that way, every mining algorithm
uses the same interface to access data without being aware of the existence of data sources.
Instruction center of data mining: As the headquarters of the system, the instruction center directs
three modules, namely, presentation, algorithm and data source, to run properly. The presentation
form base stores the denitions of forms in which the output is presented to end users, such as natural
languages, graphics, and grids. The knowledge/algorithm base is used to control the management and
execution of algorithms, for example, to adjust the evaluation system and to choose a suitable mining
algorithm to accelerate computation. The data preparation method base provides the methods for
the data source module during data transfer, such as data transformation and conception hierarchy
analysis. There are two modes provided for executing the instruction center: an automatic mode and
ZHOU Haofeng, ZHU Jianqiu et al.
596
Vol.17
a manual mode. The latter mode is reserved for manually controlling the mining process.
By allowing users to adjust the setting in the instruction center to achieve good mining performance, the system is exible and general-purpose.
The horizontal view of Fig.1 reects the processing phases: data preparation, data mining and
result presentation. The vertical view indicates the application with the physical foundation and
the reserved interface for manual control. Therefore, the characteristics of the precedent kinds of
architectures are well combined into ARMiner.
3
The Role of ARMiner as a System
3.1
Functions of ARMiner as a Mining Tool
As a mining tool, ARMiner provides functions such as data preparation and association rule mining.
3.1.1
Data Preparation of ARMiner
It includes data transfer and concept hierarchy generalization. The task of data preparation is to
transform raw transaction data according to the required structure and transfer the transformed data
to the mining database of ARMiner.
The ultimate goal of data transfer is transferring the data from the sources to the mining database.
In general, the data used by rule mining are transaction data, i.e., the data with the structure: (TID,
ItemID). Furthermore, to facilitate the work of displaying rules, the description information of items,
such as item names, should be transferred too. Therefore, the data to be transformed by ARMiner
include the transaction set and the item set.
During transforming the transaction set, if there is no unique eld as the primary key of the records
in the original set, a new primary key is generated to substitute for the old one. Then, the transaction
data are transferred accordingly. This is called transformation-transfer and the counterpart is simple
transfer. The proper transfer method is automatically chosen as transfer rules are provided. During
the process of transferring data, eld types are matched and, if necessary, converted according to the
system requirements.
Data transfer provides ARMiner with the proper data by importing the source data into the mining
database.
The original transaction data often contain a large amount of detailed data, where much useless
knowledge may be discovered, and which cannot reect the abstract hierarchies in the real world.
Therefore, it is important to generalize the original data after data transfer is nished. So, the
concept of domain knowledge is introduced there.
Many researchers have studied it and some algorithms are proposed[10 12] . However, these algorithms are usually bound to mining algorithms, that is, the concept hierarchies are taken into account
at the stage of generating large item sets. In that way, a problem occurs: when the knowledge of a
domain is not used, the algorithm of this kind will cause unnecessary cost. Hence, in ARMiner, we
make the concept hierarchy generation independent by separating it from the process of mining, and
it can be regarded as a part of data preprocessing.
According to users' requirements, we process the raw data, convert them at more abstract levels.
Then, from the new data set previously obtained, we remove the redundancy, and put these clean
data into the mining database. The whole process is illustrated in Fig.2.
The display of concept hierarchies uses a tree to show the denitions of generalization levels. As
an interactive process, generalization hierarchy selection allows users to choose the proper generalization levels. After the selection work is over, it delivers the generalization requirements to the data
generalization module. The module of concept hierarchy input and data generalization converts the
hierarchies in tables into the ones with program data structures, transforms the raw data at the chosen
levels, and then stores these processed data into the mining database.
No.5
ARMiner: A Data Mining Tool Based on Association Rules
597
Fig.2. Data generalization process.
To dene the generalization hierarchy table , we use a recursive table whose structure is in the
form of (ItemID, ItemName, SuperiorItem). An array is used to implement the structure of concept
hierarchy. Each element of the array is in the form shown below.
typedef structf
CHAR strItem[56];
String strSuperItem;
long iSuperItem;
//the ID of the concept hierarchy
//strItem's name
//strItem's parent,
//if it is less than zero,it is the root
gSBTreeItem;
SBTreeItem *m sbtItems = new SBTreeItem[iCount] //level number
The tree nodes will be scanned frequently during the process of concept hierarchy generalization,
and this search aects the eÆciency of the whole program. Therefore, we apply a binary search
algorithm to improve the performance. The array elements must be sorted by strItem before this
algorithm is used. Given a node, it is necessary to search for its level and superior node.
After the user denes a desired generalization level in the hierarchy for each item, it is necessary
to denote the user requirements by the algorithm in an easy way. For convenience, our algorithm
directly marks the data structure of the original concept hierarchy by setting a negative value to the
iSuperItem of each appointed generalization level. After the data are transformed, the modied data
structure will be recovered for next use.
3.1.2
Mining Association Rules with Negative Items
Association rule mining is the core technique of ARMiner. We not only introduce an interestingness
measure as a new kind of evaluation, but also provide an algorithm for association rules with negative
items.
There are various denitions of interestingness measures[13 15] . Based on statistics, we dene our
interestingness measure of association rules with the form of X ) Y as follows:
)S (Y )
i = S (SX(XY
)
(1)
where S (X ) is the support measure of X in the transaction set. If i, the interestingness measure of a
rule, satises the condition of 0 < i < 1, we will consider this rule valueless. However, if i im (im is
the threshold of interestingness measure and im 1), it will be a valuable rule.
Based on the above denition, the evaluation of association rules contains three measure arguments:
support, condence and interestingness. Their statistical explanation is: given a rule X ) Y , the
interestingness measure reects the tightness of the connection between X and Y , and condence
represents the connection direction in this condition, i.e., from X to Y or Y to X . Support shows
whether this condition is common among the transactions.
Furthermore, by admitting negative items, we modied the denition of association rules and
proposed the concept of negative itemset[16] . To obtain the support measure of the negative item set,
598
ZHOU Haofeng, ZHU Jianqiu et al.
Vol.17
we devise a new algorithm, which computes the support measure using the support measures of the
positive literal set without rescanning the database.
Then, a mining algorithm is contrived for association rules with negative items. If the generated rules are not interesting to users, this algorithm is able to discover other rules (perhaps more
interesting) by introducing negative item sets automatically[16] .
By these denitions and algorithms, we can discover the association rules whose semantics are
more integrated, such as `coee ) milk '.
3.2
Re-Development Ability
During the development of ARMiner, we encapsulate the system function into several API functions for re-development. These functions are classied into four categories: the mining algorithm
functions, the rules generating functions, the data transformation functions and the concept hierarchy
constructing functions.
The process of association rule mining can be divided into two phases: the large-set generation
and rule generation. The former is carried out by the mining algorithm functions whose interface is
open, so we can use any algorithm which measures up the denition of the interface. In ARMiner, we
currently provide the Apriori, AprioriTID[17] and DHP[18] algorithms, and we can also use other ones
which have better performance. The second phase is implemented by the rule generating functions
which include the original rule generation function mentioned by Agrawal and the new one mentioned
above. By an option, the user can choose either of them.
During the transfer, with the transformation rules, data transformation functions will automatically choose either transformation transfer or simple transfer, and adjust the attributes of each eld.
The concept hierarchy constructing functions are used to analyze the original data and construct hierarchies. To permit manual intervention in the process of mining, we set up an independent operating
interface for constructing hierarchies. Through the interface shown in Fig.3, users can freely select
the wanted items in the displayed tree.
Fig.3. The concept hierarchy.
Implemented in the form of dynamic link library (DLL), these functions can be seamlessly integrated with a few development environments such as VC and VB. Therefore, the deployment of the
system is greatly eased. We can use the API functions to enhance various kinds of existing information systems by adding the function of decision support to them. Furthermore, these functions
No.5
ARMiner: A Data Mining Tool Based on Association Rules
599
can be used to support developing relevant software for e-commercial sites. Meanwhile, the interface
denitions of API functions are the foundation of system extension. Following these considerations,
more functions, such as mining algorithms with better performance and powerful guidance with more
domain knowledge, will be added to ARMiner.
4
Applications of ARMiner
The target database is a supermarket database where every item belongs to a denite category,
for example, apple belongs to fruit, fruit belongs to food, and so on. Due to hardware limitation
(PIII450, 128M RAM, Windows NT), we only select about 7,200 transactions (about 30,000 records
in a month) from this database to operate. First, using the domain knowledge implied in the concept
hierarchy shown in Fig.3, we process these transaction data. The hierarchy is constructed from the
original data and the generalization items are selected, then the data are generalized.
We mine the database using the interestingness measure and association rules with negative items.
As the threshold of the support measure is 0.005 and those of condence and interestingness are 0.06
and 1.15 respectively, we get the results presented in Fig.4.
Fig.4. The ARMiner main interface.
As shown in this gure, negative items are
marked with `( )'. If we had adopted only three
threshold values to lter rules without considering
negative items, some rules, such as \fast noodle
) ( ) groceries", would not have been discovered. This kind of rules appear just because the
corresponding normal ones, such as fast noodle
) ( ) groceries, bear interestingness less than 1.
If only interestingness is used to lter rules, we
will lose this kind of normal rules, needless to say,
the ones with negative items. From the results,
we detect some particular rules, such as `sanitary Fig.5. Rules with negative items generating in ARMiner.
napkin ) lacto-drink', which is as weird as the
classic sample of `diaper ) beer'. Besides, the rule `fast noodle ) fruit ' is understandable for the
former food is for the busy life and the latter is for the leisure one.
600
ZHOU Haofeng, ZHU Jianqiu et al.
Vol.17
In the experiment, we consider two instances: using the concept hierarchy or not. Fig.5 shows the
rules we get in these two instances with various support thresholds. The thresholds of condence and
interestingness are 0 and 1 respectively.
As shown in the gure, we can nd that in the instance of considering the concept hierarchy, the
rules are more than those in the instance of not considering the hierarchy. Both Figs.4 and 5 show
one of the advantages of ARMiner against other systems that it can generate the rules with negative
items, especially when the concept hierarchy is considered. As to the eÆciency and the stability of
ARMiner, under the hardware platform mentioned above, when the threshold of the support measure
is below 0.002, the system seems to be down.
This application also shows the API function utilization in ARMiner. The interface is designed by
DELPHI, and it calls the API functions, which are provided in the DLL form, to achieve its object.
Using the same method, we also add a decision support module to enhance the decision-making
function in an existing business information system. Fig.6 shows some part of the coding progress in
the development of this application.
Fig.6. A module using the API.
So both applications show the ability of the re-development of ARMiner.
5
Comparison with Other Systems
Because ARMiner is a mining system based on association rules and others are usually integrated
systems adopting many techniques such as classication and aggregation, we just select their modules
for association rules. Limited by the length of the paper, we just use the following representative
systems: IBM Intelligent Miner, DBMiner from Simon Fraser University and Knight from Nanjing
University. The rst one is a commercial system and the other two are those in the research area.
It should be pointed out here that the main features of ARMiner are based on its functions, not its
performance, so we do not compare it with other systems in this aspect.
Intelligent Miner is an integrated tool set based on DB2 and provides full-scale decision support.
Its mining module for association rules strictly conforms to the previous denition of association rules
and evaluation systems. Although the problem of association rule generalization is considered and a
batch of API functions is provided, this module has been screwed onto the foundation stone, DB2.
No.5
ARMiner: A Data Mining Tool Based on Association Rules
601
By the Big Blue's great inuence, its applications are popularized extensively. However, ARMiner
is independent of any database platform and can run on many database platforms through ODBC.
Association rule generalization is also taken into account. Moreover, an interestingness measure is
introduced as a new evaluation criterion. Meanwhile, ARMiner is able to discover the rules with
negative items. Similarly, its API functions do not rely on any database.
DBMiner is a system for interactive mining of multiple level knowledge in large relational databases.
The system implements a wide spectrum of data mining functions, and mining association rules is
one of its main functions. This function is based on the multiple dimensional data cube computed
in the preparation phase. It can perform interactive rule mining at multiple concept levels using an
SQL-like Data Mining Query Language (DMQL) and a graphical user interface, and generate dierent
forms of outputs. However, rules with negative item generating and API functions in ARMiner are
distinguishing features against DBMiner. Both systems can be connected to various databases using
ODBC. ARMiner also considers the multiple level knowledge through the concept hierarchy. The
only disadvantages of ARMiner against DBMiner are the lack of data mining query languages and the
humdrum presentation of the results. But as a prototype, ARMiner is a successful one.
Knight is a general-purpose mining tool, which uses ODBC and special database interfaces to implement its platform transparence. By guiding the knowledge discovery with a syntax tree, it imports
domain knowledge. Capable of doing four types of mining, it has been put into use in the insurance domain. In some similarity, ARMiner accesses databases through ODBC and introduces domain
knowledge into the system by concept hierarchies. The mining algorithms and data preprocessing
functions are encapsulated into dynamic link libraries. Therefore, the API functions are immediately
provided for re-development as the system development is nished. ARMiner and its API functions
have been used in several application areas.
Besides these mining systems, there are other data mining systems. But, as a whole, compared with
ARMiner, they do not use the interesting measure of association rules and cannot generate the rules
with negative items. Most of them are monolithic, not providing API functions for re-development.
Even though some systems provide these functions, they still hinder the re-development because the
oered API functions heavily rely on the given platforms. However, ARMiner does not bear these
shortages. Compared with those systems, ARMiner is more exible and adaptable.
6
Conclusion
The technology of data mining emerged to meet the requirements of actual practice. Its implementation is helpful to decision-making. Some new eorts have been made in ARMiner. Not only is it
not limited by a certain eld, but also can it use the domain knowledge through the concept hierarchy, which displays its exibility. It also introduces the interestingness and algorithm that mines the
rules with negative items, making the semantics of the association rules more complete than ever. In
addition, it also provides the API functions for the re-development, which leads to more applications.
There are still some research problems we need to do in the future. First, we need to expand
the algorithm implementation over the rule mining, and provide the complete function to the system.
Besides, the incremental computation needs to be considered. Finally, more presentation forms should
be added into it.
References
[1] Chen M-S, Han J, Yu P S. Data mining: An overview from a database perspective. IEEE Transactions on Knowledge
and Data Engineering, 1996, 8(6): 866{883.
[2] Agrawal R, Mehta M, Shafer J C et al. The quest data mining system. In Proc. Knowledge Discovery and Data
Mining, Portland, Oregon, 1996, pp.244{249.
[3] Han J, Fu Y, Wang W et al. DBMiner: A system for mining knowledge in large relational databases. In Proc.
Knowledge Discovery and Data Mining, Portland, Oregon, 1996, pp.250{255.
ZHOU Haofeng, ZHU Jianqiu et al.
602
Vol.17
[4] Chen D, Xu J. Knight: A general purpose data mining system. J. Computer Research & Development, 1998, 35(4):
338{343.
[5] Zhu Y, Zhou X, Shi B. Rule-based data mining tool kit: AMiner. Communication of High Technology, 2000, 10(3):
19{22.
[6] Fayyad U M, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery: An overview. In Advances
in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996, pp.1{34.
[7] John G H. Enhancements to the data mining process [Dissertation]. Department of Computer Science, School of
Engineering, Stanford University, 1997.
[8] Brachman R J. The process of knowledge discovery in database. In Advances in Knowledge Discovery and Data
Mining, AAAI/MIT Press, 1996, pp.37{57.
[9] Data mining: Extending the information warehouse framework.
http://www.almaden.ibm.com/cs/quest/paper/whitepaper.html.
[10] Cheng J, Shi P. Fast mining multiple-level association rules. Chinese J. Computers, 1998, 21(11): 1037{1041.
[11] Han J, Fu Y. Discovery of multiple-level association rules from large databases. In Proc. Very Large Data Bases,
Zurich, Switzerland, 1995, pp.420{431.
[12] Srikant R, Agrawal R. Mining generalized association rules. In Proc. Very Large Data Bases, Zurich, Switzerland,
1995, pp.407{419.
[13] Zhou X, Sha C, Zhu Y, Shi B. Interest measure { Another threshold in association rules. J. Computer Research &
Development, 2000, 37(5): 627{633.
[14] Brin S, Motwani R, Silverstein C. Beyond market baskets: Generalizing association rules to correlations. In Proc.
of ACM SIGMOD, Tucson, Arizona, USA, 1997, pp.265{276.
[15] Savasere A, Omiecinski E, Navathe S B. Mining for strong negative associations in a large database of customer
transactions. In Proc. the 14th Int. Conf. Data Engineering, Orlando, Florida, USA, 1998, pp.494{502.
[16] Zhou H, Gao P, Zhu Y. Mining association rules with negative items using interest measure. In Web-Age Information
Management, Lecture Notes in Computer Science 1846, Springer-Verlag Publisher, 2000, pp.121{132.
[17] Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. In Proc. Very Large Data
Bases, Santiago de Chile, Chile, 1994, pp.487{499.
[18] Park J S, Chen M-S, Yu P S. An eective hash based algorithm for mining association rules. In Proc. ACM
SIGMOD, San Jose, California, USA, 1995, pp.175{186.
ZHOU Haofeng
was born in 1975.
He received his B.E. degree in computer science from Shanghai
University in 1997, and his M.S. degree in computer science from Fudan University in 2000. He is currently a
Ph.D. candidate in computer science at Fudan University. His research interests include data mining, database
and knowledgebase.
ZHU Jianqiu
was born in 1974. He received his B.S. degree in computer science from Harbin University
of Science and Technology in 1996, and his M.S. degree in computer science from Fudan University in 1999.
He is currently a Ph.D. candidate in computer science at Fudan University. His research interests include data
mining, CRM and e-commerce.
ZHU Yangyong
was born in 1963. He received his B.S. in mathematics from Xinjiang University in 1984,
and his Ph.D. in computer science from Fudan University in 1994. He is now a professor in the Department of
Computing and Information Technology in Fudan University. His research interests include data mining and
e-commerce.
SHI Baile
sity in 1957.
was born in 1935. He majored in computing mathematics and graduated from Beijing Univer-
He is now a professor in the Department of Computing and Information Technology in Fudan
University. His research interests include digital library, database and knowledgebase.