Download Distributed Count Association Rule Mining Algorithm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Computer Trends and Technology- July to Aug Issue 2011
Distributed Count Association Rule Mining Algorithm
Tirumala prasad B #1, Dr. MHM Krishna Prasad *2
#
Dept. of Information Technology , UCEV- JNTUK
Vizianagaram, A.P., India
*
Associate Professor & Head, Dept. of IT
UCEV-JNTUK, Vizianagaram, A.P., India
Abstract- With the explosive growth of information sources
available on the World Wide Web, it has become increasingly
necessary for users to utilize automated tools to find the desired
information resources, and to track and analyze their usage
patterns. Association rule mining is an active data mining
research area. However, most ARM algorithms cater to a
centralized environment. In contrast to previous ARM
algorithms, Optimized Distributed Association Rule Mining
(ODARM) is a distributed algorithm for geographically spread
data sets that aimed to reduces operational/ communication
costs. Recently, as the need to mine patterns across distributed
databases has grown, Distributed Association Rule Mining (DARM) algorithms have been developed. These algorithms assume
that the databases are either horizontally or vertically
distributed. In the special case of databases populated from
information extracted from textual data, existing D-ARM
algorithms cannot discover rules based on higher-order
associations between items in distributed textual documents that
are neither vertically nor horizontally distributed, but rather a
hybrid of the two. Hence, this paper proposes a Distributed
Count Association Rule Mining Algorithm(DCARM), which is
experimented on real time datasets obtained from UCI machine
learning repository.
Keywords
- ARM algorithm, ODARM, DARM, PARM.
I. INTRODUCTION
Data mining [1] is the extraction of hidden predictive
information from large databases, is a powerful new
technology with great potential to help companies focus on
the most important information in their data warehouses. Data
mining tools predict future trends and behaviours, allowing
businesses to make proactive, knowledge-driven decisions.
The automated, prospective analyses offered by data mining
move beyond the analyses of past events provided by
retrospective tools typical of decision support systems. Data
mining tools can answer business questions that traditionally
were too time consuming to resolve. They scour databases for
hidden patterns, finding predictive information that experts
may miss because it lies outside their expectations.
Most companies already collect and refine massive
quantities of data. Data mining techniques can be
implemented rapidly on existing software and hardware
platforms to enhance the value of existing information
resources, and can be integrated with new products and
systems as they are brought on-line. When implemented on
high performance client/server or parallel processing
computers, data mining tools can analyse massive databases to
ISSN: 2231-2803
deliver answers to questions such as, "Which clients are most
likely to respond to my next promotional mailing, and why?"
Data mining (DM), also called Knowledge-Discovery in
Databases (KDD) or Knowledge-Discovery and Data Mining,
is the process of automatically searching large volumes of
data for patterns using tools such as classification, association
rule mining, clustering, etc.. Data mining is a complex topic
and has links with multiple core fields such as computer
science and adds value to rich seminal computational
techniques from statistics, information retrieval, machine
learning and pattern recognition.
Data mining techniques are the result of a long process of
research and product development. This evolution began when
business data was first stored on computers, continued with
improvements in data access, and more recently, generated
technologies that allow users to navigate through their data in
real time. Data mining takes this evolutionary process beyond
retrospective data access and navigation to prospective and
proactive information delivery. Data mining is ready for
application in the business community because it is supported
by three technologies that are now sufficiently mature:
 Massive data collection
 Powerful multiprocessor computers
 Data mining algorithms
Commercial databases are growing at unprecedented rates.
A recent META Group survey of data warehouse projects
found that 19% of respondents are beyond the 50 gigabyte
level, while 59% expect to be there by second quarter of
1996.1 In some industries, such as retail, these numbers can
be much larger. The accompanying need for improved
computational engines can now be met in a cost-effective
manner with parallel multiprocessor computer technology.
Data mining algorithms embody techniques that have existed
for at least 10 years, but have only recently been implemented
as mature, reliable, understandable tools that consistently
outperform older statistical methods.
Web Mining [2] is the extraction of interesting and
potentially useful patterns and implicit information from
artifacts or activity related to the World Wide Web. There are
roughly three knowledge discovery domains that pertain to
web mining: Web Content Mining, Web Structure Mining,
and Web Usage Mining. Web content mining is the process of
extracting knowledge from the content of documents or their
descriptions. Web document text mining, resource discovery
based on concepts indexing or agent based technology may
also fall in this category. Web structure mining is the process
http://www.internationaljournalssrg.org
Page 280
International Journal of Computer Trends and Technology- July to Aug Issue 2011
of inferring knowledge from the World Wide Web
organization and links between references and referents in the
Web. Finally, web usage mining, also known as Web Log
Mining, is the process of extracting interesting patterns in web
access logs.
Web content mining is an automatic process that goes
beyond keyword extraction. Since the content of a text
document presents no machine-readable semantic, some
approaches have suggested restructuring the document content
in a representation that could be exploited by machines. The
usual approach to exploit known structure in documents is to
use wrappers to map documents to some data model.
A) Classification:
The process of dividing a dataset into mutually
exclusive groups such that the members of each group are as
"close" as possible to one another, and different groups are as
"far" as possible from one another, where distance is
measured with respect to specific variable(s) you are trying to
predict. For example, a typical classification problem is to
divide a database of companies into groups that are as
homogeneous as possible with respect to a creditworthiness
variable with values "Good" and "Bad."
B) Clustering:
The process of dividing a dataset into mutually
exclusive groups, such that the members of each group are as
"close" as possible to other, and different groups are as "far"
as possible from the other, where distance is measured with
respect to all available/considered variables.
Given databases of sufficient size and quality, data
mining technology can generate new business opportunities
by providing these capabilities:


Automated prediction of trends and behaviours.
Data mining automates the process of finding
predictive information in large databases. Questions
that traditionally required extensive hands-on
analysis can now be answered directly from the data
— quickly. A typical example of a predictive
problem is targeted marketing. Data mining uses data
on past promotional mailings to identify the targets
most likely to maximize return on investment in
future mailings. Other predictive problems include
forecasting bankruptcy and other forms of default,
and identifying segments of a population likely to
respond similarly to given events.
Automated discovery of previously unknown
patterns. Data mining tools sweep through databases
and identify previously hidden patterns in one step
An example of pattern discovery is the analysis of retail sales
data to identify seemingly unrelated products that are often
purchased together. Other pattern discovery problems include
detecting fraudulent credit card transactions and identifying
anomalous data that could represent data entry keying errors.
ISSN: 2231-2803
DARM discovers [3], [4] rules from various geographically
distributed data sets. However, the network connection
between those data sets isn't as fast as in a parallel
environment, so distributed mining usually aims to minimize
communication costs.
II. ASSOCIATION RULE MINING
Association rule mining [5] finds interesting associations
and/or correlation relationships among large set of data items.
Association rules shows attribute value conditions that occur
frequently together in a given dataset. A typical and widelyused example of association rule mining is Market Basket
Analysis.
For example, data are collected using bar-code scanners in
supermarkets. Such ‘market basket’ databases consist of a
large number of transaction records. Each record lists all items
bought by a customer on a single purchase transaction.
Association rules provide information of this type in the
form of "if-then" statements. These rules are computed from
the data and, unlike the if-then rules of logic, association rules
are probabilistic in nature. In addition to the antecedent (the
"if" part) and the consequent (the "then" part), an association
rule has two numbers that express the degree of uncertainty
about the rule.
 Support
 Confidence
Support: In association analysis the antecedent and
consequent are sets of items (called itemsets) that are disjoint
(do not have any items in common). The first number is called
the support for the rule. The support is simply the number of
transactions that include all items in the antecedent and
consequent parts of the rule. (The support is sometimes
expressed as a percentage of the total number of records in the
database.)
Confidence: The other number is known as the confidence
of the rule. Confidence is the ratio of the number of
transactions that include all items in the consequent as well as
the antecedent (namely, the support) to the number of
transactions that include all items in the antecedent.
Let us see an example based on these two association rule
numbers:
If a supermarket database has 100,000 point-of-sale
transactions, out of which 2,000 include both items A and B
and 800 of these include item C, the association rule "If A and
B are purchased then C is purchased on the same trip" has a
support of 800 transactions (alternatively 0.8% = 800/100,000)
and a confidence of 40% (=800/2,000). One way to think of
support is that it is the probability that a randomly selected
transaction from the database will contain all items in the
antecedent and the consequent, whereas the confidence is the
conditional probability that a randomly selected transaction
will include all the items in the consequent given that the
transaction includes all the items in the antecedent.
http://www.internationaljournalssrg.org
Page 281
International Journal of Computer Trends and Technology- July to Aug Issue 2011
for all 2-subsets s of t
An association rule tells us about the association between
two or more items. For example: In 80% of the cases when
people buy bread, they also buy milk. This tells us of the
association between bread and milk.
We represent it as bread => milk | 80%
This should be read as - "Bread means or implies milk, 80%
of the time." Here 80% is the "confidence factor" of the rule.
Association rules can be between more than 2 items. For
example bread , milk => jam | 60%
bread => milk, jam | 40%
Given any rule, we can easily find its confidence. For example,
for the rule
bread, milk => jam
We count the number say n1, of records that contain
bread and milk. Of these, how many contain jam as well? Let
this be n2. Then required confidence is n2/n1. This means that
the user has to guess which rule is interesting and ask for its
confidence. But our goal was to "automatically" find all
interesting rules. This is going to be difficult because the
database is bound to be very large. We might have to go
through the entire database many times to find all interesting
rules.
if (s є C2) s .sup ++;
t ΄ = delete_nonfrequent_items(t);
Table.add( t ΄);
}
Send_to_receiver(C2);
/*Global Frequent support counts from receiver*/
F2 =receive_from_receiver(Fg);
C3={Candidate itemset};
T=Table.getTransactions( ); k=3;
While(Ck ≠ { })
{
for all transaction t є T
for all k-subsets s of t
if(s є Ck) s .sup++;
k++;
send_to_receiver(Ck);
/*generating candidate Itemset of k+1 pass*/
Ck + 1={Candidate itemset};
}
III. DISTRIBUTED COUNT ASSOCIATION MINING ALGORITHM
We assume each site [6] has the same tasks as sequential
association mining, except it broadcasts support counts of
candidate itemsets after every pass. DCARM first computes
support counts of 1-itemsets from each site in the same
manner as it does for the sequential Apriori [7]. It then
broadcasts those itemsets to other sites and discovers the
global frequent 1-itemsets. Subsequently, each site generates
candidate 2- itemsets and computes their support counts.
At the same time, DCARM also eliminates all globally
infrequent 1-itemsets from every transaction and inserts the
new transaction (that is, a transaction without infrequent 1itemset) into memory. While inserting the new transaction, it
checks whether that transaction is already in the memory. If it
is, DCARM increases that transaction's counter by one.
Otherwise, it inserts the transaction with a count equal to one
into the main memory. After generating support counts of
candidate 2-itemsets at each site, DCARM generates the
globally frequent 2-itemsets. It then iterates through the main
memory (transactions without infrequent 1-itemsets) and
generates the support counts of candidate itemsets of
respective length. Next, it generates the globally frequent
itemsets of that respective length by broadcasting the support
counts of candidate itemsets after every pass.
Algorithm: Distributed Count Association Rule Mining Algorithm
NF = {Non-frequent global 1-itemset}
for all transaction t є D {
ISSN: 2231-2803
Because DCARM eliminates all globally infrequent
itemsets from every transaction and inserts them into the main
memory, it reduces the transaction size (the number of items)
and finds more identical transactions. This is because the data
set initially contains both frequent and infrequent items.
However, total transactions could exceed the main memory
limit. To deal with this problem, we propose a technique that
fragments the data set into different horizontal partitions.
Then, from each partition, DCARM removes infrequent items
and inserts each transaction into the main memory. While
inserting the transactions, it checks whether they are already
in memory. If yes, it increases that transaction's counter by
one. Otherwise, it inserts that transaction into the main
memory with a count equal to one. Finally, it writes all mainmemory entries for this partition into a temp file each local
site generates support counts and broadcasts them to all other
sites to let each site calculate globally frequent itemsets for
that pass [8]. So, the total number of messages broadcast from
each site equals (n - 1 * |C|). We can calculate the total
message size using
=
( − 1) ∗
where, n is the total number of sites and C is number of
candidate itemsets. Contrast with CD, DCARM sends support
counts of candidate itemsets to a single site, which calculates
the globally frequent itemsets for that pass. We refer to the
sites that send local support counts as the sender and the site
that generates the globally frequent itemsets is the receiver.
http://www.internationaljournalssrg.org
Page 282
International Journal of Computer Trends and Technology- July to Aug Issue 2011
For example, with three sites, two broadcast their local
support counts of candidate itemsets to the third site. The third
site is responsible for generating that iteration's globally
frequent itemsets. The total number of messages broadcast
from a sender site to a receiver site equals (1 * |C|).
Once the receiver site generates globally frequent itemsets,
it broadcasts them to all sender sites. The total number of
messages broadcast from the receiver is (n - 1 * |FG|). We can
calculate the total message broadcasting size (the aggregate of
sender and receiver sites messages) using
=
( − 1) ∗
+ ( − 1)
where n is the number of sites, C is the candidate itemsets,
and Fg is the globally frequent itemsets.
Minimum Support:
To make the problem tractable, we introduce the
concept of minimum support. The user has to specify this
parameter - let us call it minsupport. Then any rule
i1, i2, ... , in => j1, j2, ... , jn
needs to be considered, only if the set of all items in this rule
which is { i1 , i2, ... , in, j1, j2, ... , jn } has support greater
than minsupport.
bread, milk => jam
If the number of people buying bread, milk and jam
together is very small, then this rule is hardly worth
consideration (even if it has high confidence). Our problem
now becomes - Find all rules that have a given minimum
confidence and involves itemsets whose support is more than
minsupport. Clearly, once we know the supports of all these
itemsets, we can easily determine the rules and their
confidences. Hence we need to concentrate on the problem of
finding all itemsets which have minimum support. We call
such itemsets as frequent itemsets.
IV. EXPERIMENTAL EVALUATION
In this section, we present our experimental
observations obtained while evaluating the behaviour of
optimized association rule mining on different real time and
synthetic datasets.
Datasets Experiments are performed on Connect-4 [9]
and Cover Type [9] real-time datasets from UCI Machine
learning, also using synthetic datasets.
Implementation details we implemented Optimised
distribution association rule mining algorithm on different
machines, communicated using java socket programming
using jdk1.6. Experiments are performed on different
machines of Intel® CoreTM 2DuoCPU 3.0 GHz with 4GB of
RAM running on windows.
ISSN: 2231-2803
Observations In this paper we have extensively studied
DCARM's performance to confirm its effectiveness, we
implemented this structure in java technology and we
established a socket-based, client-server distributed
environment to evaluate DCARM's message reduction
techniques. Each site has a receiving and a sending unit and
assigns a specific port to send and receive candidate support
counts. Because the candidate itemsets that each site generates
will be based on the global frequent itemset for the previous
pass, the candidate itemsets are identical among various sites.
We organize our performance evaluation experiment by
executing DCARM in multiple sites to compare how much
time the algorithm takes to generate n length of frequent
itemsets by using different no. of instances.
In each site we generate candidate frequent itemsets by
removing non-frequent item sets to improve the memory
utilization of the system. By using this technique in the three
proposed sites, we generate candidate frequent itemsets with
different size of instances and finally among the three sites we
assume a coordinator can collect the generated candidate
frequent itemsets and implement DCARM algorithm to get
global frequent itemsets and it calculate the minimum support
count and generate association rules.
DCARM exchanges less messages among different sites to
generate globally frequent itemsets. DCARM reduces the
communication cost by 60 to 80 percent. The process of
communication among the sites is described below.
DCARM sends each support count to the polling sites.
After receiving a request, each polling site sends a polling
request to all remote sites other than the originator site. Upon
receiving the polling request from all other sites, the polling
site computes whether that candidate itemset is globally
frequent and broadcasts only globally frequent itemsets to all
other sites. Hence, it exchanges more messages because each
polling site sends and receives support counts from remote
sites. It also needs to send global support counts to all
participating sites when a candidate itemset is heavy and
subsequently increases the communication cost. Furthermore,
each polling site receives polling requests only from one site.
The below table shows the evaluated time for the three
different sites with the instances of size 1k to 5k where
k=1000.
We conduct an experiment in a multiple sites with
different sizes of datasets, and compared the time of execution
times, we can say that DCARM is efficient algorithm for
generating the support counts by the comparison of the
execution times of the multiple sites. Finally by the result of
the coordinator support count we prove that the algorithm
DCARM is efficient.
http://www.internationaljournalssrg.org
Page 283
International Journal of Computer Trends and Technology- July to Aug Issue 2011
mining research area. This paper presents a distributed
association rule mining algorithm by utilizing the global count
on frequent itemsets. From the experimental results, we
observed that, Distributed Count Association Rule Mining
provides an efficient method for generating association rules
from different datasets, distributed among various sites.
REFERENCES
[1]
[2]
Figure 1: Comparison of multiple sites
The above Figure 1 shows the comparison evaluations for the
multiple sites and the coordinator.
[3]
[4]
Table 1: Execution times for multiple sites and coordinator
No.of Instances
1000
2000
3000
4000
5000
Site1 local time
125
281
406
547
672
Site2 local time
156
261
390
531
641
Site3 local time
Algorithm computational
time
109
250
407
562
672
31
32
62
78
109
V. CONCLUSION
With the explosive growth of information sources
available on the World Wide Web, it has become increasingly
necessary for users to utilize automated tools to find the
desired information resources, and to track and analyze their
usage patterns. Association rule mining is an active data
ISSN: 2231-2803
[5]
[6]
[7]
[8]
[9]
Mining F. Patterns, “Mining Frequent Patterns”, In Data Mining and
Knowledge Discovery, Vol. 8, pp. 53-87, 2004.
R. Cooley and B. Mobasher and J. Srivastava, “Web Mining:
Information and Pattern Discovery on the World Wide Web”. In IEEE
International Conference on, pg-558, 1997.
R. Agrawal and J.C. Shafer , "Parallel Mining of Association Rules”,
IEEE Tran. Knowledge and 16 Data Eng , vol. 8, pp. 962-969, 1996.
D.W. Cheung , et al., "A Fast Distributed Algorithm for Mining
Association Rules", In Proc. Parallel and Distributed Information
Systems, IEEE CS Press, pp. 31-42, 1996.
R. Agrawal and R. Srikant , "Fast Algorithms for Mining Association
Rules in Large Database,"Proc. 20th Int'l Conf. Very Large Databases
(VLDB 94), Morgan Kaufmann, 1994,pp. 407-419.
M. Zaman, Ashrafi, T. David, K. Smith, “ODAM: An Optimized
Distributed Association Rule Mining Algorithm”, In IEEE Distributed
Systems Online, Los Alamitos 2004.
A. Inokuchi, T. Washio, H. Motoda, “An Apriori-Based Algorithm for
Mining Frequent Substructures from Graph Data” In Lecture Notes in
Computer Science, pg.13-23, 2000.
M.J. Zaki , "Parallel and Distributed Association Mining: A Survey",
IEEE Concurrency , pp. 14-25, 1999.
UCI Machine Learning Repository, accessed on 12th Feb 2011.
http://www.internationaljournalssrg.org
Page 284