Download Research on Improved Algorithm Based on AprioriTid Transaction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Business intelligence wikipedia , lookup

Expense and cost recovery system (ECRS) wikipedia , lookup

Transcript
2016 3rd International Conference on Engineering Technology and Application (ICETA 2016)
ISBN: 978-1-60595-383-0
Research on Improved Algorithm Based on AprioriTid Transaction
and Candidate Item Compression
Yonggui Zou & Bowen Tan
School of Computer Science and Technology, Chongqing University of Posts and Telecommunications,
Chongqing, China
ABSTRACT: Association rule mining is one of the most popular technologies. This paper researches on AprioriTid of association rule mining technique, and concludes the two main problems in AprioriTid. One is giant
candidate Tid list, the other one is the mount of nonsense projects storage. As to the main problems in AprioriTid
Algorithm, this paper proposes an improved AprioriTid Algorithm based on transaction set compression and
candidate item set compression. This algorithm optimizes the self-connection of frequent item set to reduce the
candidate item set through deleting affairs and reducing data set; after testing and comparing the algorithm in
many aspects by UCI standard test set, the improved algorithm grows by 10%-20% in time efficiency.
Keywords:
data mining; association rule; AprioriTid Algorithm; improved AprioriTid Algorithm
1 INTRODUCTION
Data mining[1] is the process of extracting unknown
but potential and useful information from enormous,
incomplete, noisy, fuzzy, and random practical application data. The information from the data mining is
effective and practical. The core modules of data
mining technology include statistics, artificial intelligence, and machine learning. Data mining is the most
popular area in database research, and association rule
is an important research of data mining.
Association rule learning is a method for discovering interesting relations between variables in large
databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.[1] Based on the concept of strong rules,
Rakesh Agrawal et al.[2] introduced association rules
for discovering regularities between products in
large-scale transaction data recorded by point-of-sale
(POS) systems in supermarkets. For example, the rule
{onions, potatoes}  {burger} found in the sales data
of a supermarket would indicate that if a customer
buys onions and potatoes together, they are likely to
also buy hamburger meat. Such information can be
used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product
placements. In addition to the above example from
market basket analysis, association rules are employed
today in many application areas including web usage
mining, intrusion detection, continuous production,
and bioinformatics. In contrast with sequence mining,
association rule learning typically does not consider
the order of items either within a transaction or across
transactions.
Association rule is one of the important topics of
KDD (Knowledge Discovery in Database) research
that American R. Agrawa started firstly. The discovery of association rules in two steps: firstly, iterate and
identify all the frequent item set with the requirement
support is not lower than minimum support set by
users; secondly, build credibility no lower than users-setting minimum confidence from the frequent
item set to identify all the frequent item sets. This is
the core of Association Rule discovery algorithm.
2 ASSOCIATION ALGORITHM ANALSIS:
BASIC CONCEPTS AND ALGORITHMS
Many business enterprises accumulate large quantities
of data from their day-to-day operations. For example,
huge amounts of customer purchase data are collected
daily at the checkout counters of grocery stores. Table
1 illustrates an example of such data, commonly
known as market basket transactions.
107
Each row in this table corresponds to a transaction,
which contains a unique identifier labeled TID and a
set of items bought by a given customer. Retailers are
interested in analyzing the data to learn about the purchasing behavior of their customers. Such valuable
information can be used to support a variety of business-related applications such as marketing promotions, inventory management, and customer relationship management. Association analysis is useful for
discovering interesting relationships hidden in large
datasets. The uncovered relationships can be represented in the form of association rules or sets of frequent items. For example, the following rule can be
extracted from the data set shown in Table 1.
Table 1. An example of market basket transactions.
TID
1
2
3
4
5
Items
{Bread, Milk}
{Butter }
{Beer, Diaper}
{Bread, Milk, Butter}
{Bread }
and contains a subset of the items in I.
A rule is defined as an implication of the form:
X  YWhereX, Y  IandX  Y  
Every rule is composed of two different set of items,
also known as item sets, X and Y, where X is called
antecedent or left-hand-side (LHS) and Y consequent
or right-hand-side (RHS).
To illustrate the concepts, we use a small example
from the supermarket domain. The set of items is
I={Bread, Milk, Diapers, Beer, Cola} and in the Table
2, the A binary 0/1 representation of market basket
data, is shown a small database containing the items,
where, in each entry, the value 1 means the presence
of the item in the corresponding transaction, and the
value 0 represents the absence of an item in a that
transaction.
Table 2. A binary 0/1representation of market basket data.
The rule suggests that a strong relationship exists
between the sale of butter, bread and milk meaning
that if butter and bread are bought, while customers
also buy milk. Retailers can use this type of rules to
help them identify new opportunities for cross selling
their products to the customers.
Besides market basket data, association analysis is
also applicable to other application domains such as
bioinformatics, medical diagnosis, web mining, and
scientific data analysis. In the analysis of Earth science data, for example, the association patterns may
reveal interesting connections among the ocean, land,
and atmospheric processes. Such information may
help Earth scientists develop a better understanding of
how the different elements of the Earth system interact
with each other. Even though the techniques presented
here are generally applicable to a wider variety of data
sets, for illustrative purposes, our discussion will focus
mainly on market basket data. There are two key issues that need to be addressed when applying association analysis to market basket data. First, discovering
patterns from a large transaction data set can be computationally expensive. Second, some of the discovered patterns are potentially spurious because they
may happen simply by chance.
2.1 Basic concepts
Following the original definition proposed by Agrawal
et al.,[2] the problem of association rule mining is defined as:
Let I = {i1, i2,…, in} be a set of n binary attributes
called items.
Let D = {t1, t2,…, tm} be a set of transactions called
the database.
Each transaction in D has a unique transaction ID
TID
1
2
3
4
5
Milk
1
0
0
1
0
Bread
1
0
0
1
1
Butter
0
1
0
1
0
Beer
0
0
1
0
0
Diaper
0
0
1
0
0
An example rule for the supermarket could be
{butter, bread}  {milk} meaning that if butter and
bread are bought, customers also buy milk.
In order to select interesting rules from the set of all
possible rules, constraints on various measures of
significance and interest are used. The best-known
constraints are minimum thresholds on support and
confidence.
Let X be an item-set, X  Y be an association
rule and T be a set of transactions of a given database.
Support: The support value of X with respect to T is
defined as the proportion of transactions in the database
which contains the item-set X. In formula: sup(X).
In the example database, the item-set {milk, bread,
butter} has a support of 1/5=0.2 since it occurs in 20% of
all transactions (1 out of 5 transactions). The argument of
sup(X) is a set of preconditions, and thus becomes more
restrictive as it grows (instead of more inclusive).
Confidence: The confidence value of a rule, X  Y ,
with respect to a set of transactions T, is the proportion of the transactions that contains X which also
contains Y.
Confidence is defined as:
conf (X  Y)  sup(X  Y) / sup(X)
For example, the rule {butter, bread}  {milk} has
a confidence of 0.2/0.2=1.0 in the database, which
means that for 100% of the transactions containing
butter and bread the rule is correct (100% of the times
a customer buys butter and bread, and milk is bought
as well).
108
2.2 The algorithms
Association rule research can be divided in to two
sub-tasks: [2]
(1) Find out all the frequent item set in data set D
according to the minimum support.
(2) Generate association rule according to frequent
item set and minimum confidence.
The first task is to find all the frequent item set in D.
It’s the main problem and standard judgment of association rule research. Most studies are focused on this.
Apriori algorithm[3,4] is one kind of hierarchical
computation which belongs to association rule research, meanwhile, it’s a common algorithm with
good combination property. Apriori algorithm needs
multi-steps processing to search frequent item set: the
first step is to find out 1 frequent item set, and then
circular process until there is no generation of frequent
item set. When it comes the K cycle, use apriori-gen
function to generate candidate set of K-dimension (Ck),
and then search the support rate of Ck in database, next,
compare with the minimum support rate and find out
K-dimension frequent item set Lk. The join step and
prune step are needed in apriori-gen function. The
prune step can diminish the number of entry in candidate item set Ck.
Here’s the main idea:
(1) L 1  {l arg e 1  itemsets};
(2) for (k=2; L k 1   ; k++) do begin
(3) C k  apriori  gen(L k 1 ); / / New candidates
(4)
for all transactions t  D do begin
C t  subste(C k , t);
(5)
/ / candidates contains in transations t
(6)
for all candidates c  C t do
(7)
c.count++;
(8)
(9)
end
L k  {c  C k | c.count  min sup}
(10) end
Every factor in Ck with the form of <TID, {Xk}>,
among them {Xk} is the biggest set of itemset in all
the potential transactions by TID only-identified. The
first step of AprioriTid Algorithm is to scan data D
and calculate the support of C1 in candidate set 1, and
then generate 1 frequent item set L1. C1 is the same as
data set D. The second step is to process circularly
until no generation of frequent item set. The circle
process is record all the transaction like t  D ,
 t.TID,{c  C k | c  t}  C k . In the K-th step, use
the frequent item set k-1(Lk-1) from K-1-th step to
generate candidate item set K(Ck).Then Ck traverse
Ck-1 to calculate the support in Ck, simultaneously,
generate frequent item set Lk. With the increase of K,
the value of Ck is far less than transaction data set D.
Besides, it reduces the I/O operation time and the size
of scanning date set and improves the efficiency of the
algorithm.
3 OPTIMIZED APRIORITID ALGORITHM
3.1 The feature of optimized AprioriTid Algorithm as
follows
Property 1: All the nonvoid subsets in any frequent
item set are frequent item set, and the superset of
non-frequent item set is non- frequent item set[6].
Property 2: If frequent item set k can generate item
set k+1, then the number of frequent item set k must
greater than k.
It can be proved from Property 1 that in Lk+1 the
different k+1 item of k subset must belong to set frequent item set k.
Property 3: The support of any piece of transaction
in frequent item set Lk support at least k pieces of k-1
item set in Lk-1;
Property 4: The non-frequent item in AprioriTid
transaction data Ck can be dismissed when calculating
Lk+1.
According to proof by contradiction, if non-frequent
item cannot be dismissed, then it supports Lk+1, which
stands against Nature 1.
3.2 The optimization idea of AprioriTid
(11) Aswer=  L K
AprioriTid Algorithm [5] is the classic association
rule algorithm improved from Apriori Algorithm.
Apriori Algorithm only uses data D at the first time
when calculating the support of unidimensional item
set with the iteration calculate support from data D to
candidate item set. While AprioriTid takes Ck to replace the work, Ck stores K-dimension candidate item
set in transaction item set instead of store transaction
item set directly to decide whether K-dimension
candidate item set can be K-dimension frequent item
set.
(1) The improved algorithm based on transaction item
set compression [7]
In the k-th step, during the traversing of candidate
item set Ck in Ck 1 , c  C k . If the number of potential large item set is less than or equal to 1 in any
transaction record, then delete the transaction record
directly.
Authentication: in k-th step, c  Ck , during the
process of Ck traversing Ck 1 and generating Ck,
should judgment of Ck 1 ’s each pieces of transaction
record whether included in or not in c-[k-1] and c-[k],
109
therefore, each pieces of transaction record contains at
least two or more items.
(2) The improved algorithm based on candidate
item set compression [8]
Before association rule digging, algorithm needs to
preprocess the original data and build data dictionary
to make attributive classification and simple number
equivalent. In this paper, the algorithm ranks data set
by dictionary ascending order, in order to reduce the
candidate item set, when generate the K candidate
itemsets just keep the support is greater than the minimum support degree of itemsets, while K ≥ 2 generate candidate itemsets k Tid item set in the table with
the set instead of the item set (just like this
 t.TID,{{c  Lk 1}| c  t}  ) and this method could
reduce the data storage. After getting the frequent item
set Lk, we should count for each item. If the item’s
number is less than the minsup, we should remove it
and use a new frequent itemsets L'k to replace the
original Lk frequent itemsets, which reduces the next
time the number of candidate itemsets generation and
also the storage space and time.
3.3 Improved AprioriTid Algorithm pseudocode
4 EXPERIMENT AND THE RESULT
To verify the efficient of the improved algorithm, this
experiment compares the frequent item set generating-time of improved algorithm (named NewAprioriTid) and AprioriTid Algorithm under different support.
To better verify NewAprioTid algorithm efficiency
so we use the UCI standard test data set – Mushroom
data set. UCI standard test data is the popular data set
in data mining, so it can be got freely from the Internet
and has much authority. Experimental environment is
Windows 7 operating system, 2.50 GHz CPU, 4.0 GB
Memory, and the JAVA language as the development
language.
4.1 Comparison between the original algorithm and
the improved algorithm and analysis
(1) Fix data centralized transaction number and compare the original algorithm and the improved algorithm under different supporting threshold value.
Choose 4,000 data randomly in transactional databases, and then get the different running time in Table
3.
The improved algorithm is as follow:
Table 3. The run time of 4000 transactions.
Support
The run time of AprioriTid
(ms)
The run time of NewAprioriTid (ms)
Improvement efficiency
(1) C1 = D
(2) L1 = getFreq1Itemset(C1 )
(3) L'1 = L1
(4) for(k  2;(L' k 1   ) & &(C k 1   ); k  ){
(5)
Ck  Apriori _ gen(Lk 1 )
(6)
Ck  
(7)
(8)
if(transactin t  k)
contiune
(9)
else if(c  Ck ){
(13)
(14)
c.sup++
}
0.7
261
313
290
270
242
214
8%
12% 11% 14% 18%
300
280
260
240
}
AprioriTid
NewAprioriTid
220
200
{ Ck  Ck  ( t.tid, C t )}
(17) L k  {c  Ck |c.sup  min sup}
0.5
0.55
0.6
Support
0.65
0.7
Figure 1. The run time of 4000 transactions.
(18) L' k  getFreqKItemSet(L k )
(19)}
0.65
283
320
(15) if (C t   ) & &(C t item set'number  k)
(16)
0.6
304
340
Run time/ms
(12)
0.55
330
To compare the running efficiency clearly this paper takes Figure1 as description.
if(c  t){
Ct  Ct  c
(10)
(11)
0.5
340
It can be seen from the table above that the features
of two algorithms are relatively stable. It shows that
running time decreases with the increasing of the
support. From the saving-time efficiency line, we can
see the improved algorithm takes less time than before. Besides, when support is relatively low, the time
110
improved algorithm takes reducing 10%-20%. With
the increase of threshold value, the time-saving efficiency of improved one becomes more apparent. In
the beginning of the program operation, the lower
supporting threshold value makes condition-satisfying
frequent item set and association rule become more, so
running time is relatively long. With the increasing of
support, rangeability of running time becomes less.
This shows the superiority of improved algorithm.
Due to the high support at the beginning of program, it
restrains the generation of large number of candidate
item set and frequent item set which satisfy lower
support and saves a lot of time to make the rangeability bigger.
(2) When comparing the efficiency under different
transaction, take minimum support 0.5, and then randomly take 5000,6000,7000,8000 data to compare and
analysis. The results are as follow in Table 4.
Table 4. The efficiency of different transaction.
The number of transaction set
5000 6000 7000 8000
The run time of AprioriTid (ms)
263 291 326 354
The run time of NewAprioriTid (ms) 227 258 283 314
360
340
Run time/ms
320
300
280
260
220
AprioriTid
NewAprioriTid
5000
5 CONCLUSION
Association rule digging problem can be divided into
two sub-tasks: find frequent item set and generate
association rule. Find frequent item set is the core of
association rule digging. This paper proposed an improved AprioriTid algorithm based on the transaction
set compression and candidate item set compression.
On the one hand, it compresses the size of data set
efficiently and reduces the times of scanning data set;
on the other hand, it reduces the candidate item set and
improves the algorithm. The experimental result has
shown that the improved AprioriTid Algorithm is
better than the original one.
REFERENCES
For better comparison, the data are described as
Figure 2.
240
set with the fixed support 0.5 condition. It shows that
the general tendencies of the two algorithms are the
same. Meanwhile, the improved algorithm has advantages in time increasing rate
6000
7000
8000
The number of transaction set
Figure 2. The efficiency of different transaction.
It can be seen that the running time of algorithm
and improved one increase with the bigger transaction
[1] Han, Jiawei, Micheline Kamber, Jian Pei. 2011. Data
Mining: Concepts and Techniques. Elsevier,
[2] Agrawal R., Mannila H., Srikant R., et al. 1996. Fast
discovery of association rules. Advances in Knowledge
Discovery and Data Mining, 12(1): 307-328.
[3] Park J.S., Chen M.S., Yu P.S. 1995. An Effective
Hash-based Algorithm for Mining Association Rules.
ACM, 24(2): 175-186.
[4] Pasquier N., Bastide Y., Taouil R., et al. 1999. Discovering Frequent Closed Itemsets for Association Rules.
Database Theory--ICDT’ 99. pp: 398-416.
[5] Li Z.C., He P.L., Lei M. 2005. A high efficient AprioriTid algorithm for mining association rule. Machine
Learning and Cybernetics, 2005. Proceedings of 2005
International Conference on. IEEE, 3: 1812-1815.
[6] Lan C., Liu Y., Tang Z. 2010. Improvement of aprioritid
algorithm for mining frequent items. Computer Applications and Software, 27: 234-6.
[7] Peng Y., Xiong Y. 2006. Study on Optimization of
AprioriTid algorithm for mining association rules.
Computer Engineering, 5: 019.
[8] Athiyaman B., Sahu R., Srivastava A. 2013. Hybrid data
ming algorithm: An application to weather data. Journal
of Indian Research, 1(4): 71-83.
111