Download Association Rule Mining Using the Apriori Algorithm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Project Report
In this report, we briefly introduce the application scenario of association rule mining, give
details of apriori algorithm implementation and comment on the mined rules. Also some
instructions of using this program are given.
1. Application Scenario
Association rule mining finds interesting association relationships among a large set of
data items. With massive amounts of data continuously being collected and stored in
databases, many industries are becoming interested in mining association rules from their
databases. [1]
1.1 Market Basket Analysis
A typical example of association rule mining is market basket analysis. This process
analyzes customer-buying habits by finding associations among the different items that
customers place in their shopping baskets. The discovery of such associations can help
retailers develop marketing strategies by gaining insight into which items are frequently
purchased together by customers. For instance, market basket analysis may help managers
optimize different store layouts. If customers who purchase milk also tend to buy bread at
the same time, then placing the milk close or opposite to bread may help to increase the
sales of both of these items. [1]
1.2 Database
This program apriori is designed to generate strong association rules from Boolean-valued
database that looks like this:
Item1 Item2 Item3 Item4 Item5 …
y
y
y
n
n
…
y
n
y
y
y
…
……
That is, the first line consists of all different item names, and each of the remaining lines is
a Boolean-valued vector where y indicates the corresponding item appears in this line and
n indicates not. However, If the database looks like this:
Item1 Item2 Item3
Item1 Item3 Item4 Item5
……
We should first use the pre-processing program convert to set it into Boolean-valued file.
The program convert works as follows:
1
•
•
First, find all the different items by scanning the whole database and save them as
an item name line into the first line of a new text file newdata.txt.
Next, for each line of the source file, set it into a Boolean-valued vector consisted
of y or n depending on whether each item of the item name line appears or not in
this line. Thus each vector has the same length that is exactly the total number of
different items. Then save this vector into newdata.txt.
In this project, we use a supermarket transaction database transaction.txt to mine
association rules. This database comes from the software package CBA2.0 of National
University of Singapore. [2]
It looks like this:
newspaper, cd, battery, sweets, soya_sauce, rice
rice, sugar, tomato_sauce, apple, pamper, pacifier
……
First we use the pre-processing program convert to set transaction.txt into Boolean-valued
file supmart.txt. Then we run the program apriori upon supmart.txt to get all the
association rules we might be interested.
To test the robustness of our program, we also use a much larger database votes.txt
(Congressional Voting Records of United States in 1984 from UCI Machine Learning
Repository). In this database all attributes are already Boolean-valued. We delete the first
column because this file is originally for classification purpose of Republican and
Democrat. Then run the program apriori upon votes.txt to get the association rules about
the voting records. [3]
2. Implementation of Algorithms
In this project, we use many C functions to implement the apriori algorithm and generate
association rules.
2.1 Data Structure
To implement this project, the key point is setting up good data structures to represent each
itemset and store all the frequent itemsets:
•
•
First, we use struct MATRIX to store the number of different items, the number of
transaction records, all the different item names and all the Boolean values of the
database. The size of data matrix is dynamically determined.
Second, in order to represent a certain itemset, we use struct VECTOR, which
includes the itemset frequency and itemset vector whose length equals to the
number of different items.
2
•
•
Third, in order to link all the frequent k-itemsets into a list, we use struct
ITEMSETS which includes the struct VECTOR and a pointer which points to next
frequent k-itemset in the list. So by referring to the head pointer Lk of the list for
frequent k-itemsets, we can make proper operations.
For all the other supplementary data structures, please see detail in the source code.
2.2 Algorithm
Market basket analysis can be divided into two sub-problems:
1. Find all frequent itemsets that have support above minimum support threshold.
2. Generate strong association rules that satisfy minimum confidence threshold
from the frequent itemsets. [1]
2.2.1 Data Processing
First, we use the function file_size to scan the database to determine the number of
different items and the number of transaction records. Second, we use the function
init_struct to initialize the data matrix, all the head pointers L1 to Lk and some other
supplementary data structures. Third, use the function read_data to store all the different
item names and Boolean values into data matrix.
2.2.2 Apriori Algorithm
Apriori is an influential algorithm to find frequent itemsets. The first pass of the algorithm
simply uses the function getL1 to count item occurrences to determine the frequent 1itemsets. A subsequent pass, for example pass k, consists of two phases:
•
•
First, the frequent itemsets Lk-1 found in the (k-1)th pass are used to generate the
candidate itemsets Ck using the function getCk described below.
Next, the data matrix is scanned and the support of candidates in Ck is counted. For
fast counting, we use the function be_subset to efficiently determine whether the
candidates in Ck are contained in a given transaction or not. [4]
There are two steps in the function getCk.
• First, in the join step, we join Lk-1 and Lk-1 to generate potential candidates.
• Next, in the prune step, we use the function infqn_subset to remove all candidates
that have a subset that is not frequent. The pruning is based on the apriori property
that “all non-empty subsets of a frequent itemset must be requent as well”. [1]
The function getCk returns a superset of the set of all frequent k-itemsets.
We also use the function display_itemsets to save all the frequent itemsets into a new text
file itemsets.txt.
3
2.2.3 Generate Strong Association Rules
Once all the frequent itemsets have been found, it is straightforward to generate strong
association rules from them as follows:
•
•
For each frequent itemset l in Lk (k≥2), generate all non-empty subsets of l.
For every non-empty subset s of l, output the rule “s (l-s)” if support(l)/support(s)
≥min_conf. [1]
In the function get_rules, we modify the algorithm to further prune the search space based
on the apriori knowledge as follows:
Since all the subsets of l must be frequent 1 to k-1 itemsets, we only need to visit each of
the frequent 1 to k-1 itemset lists, and for each itemset of any list just check if it is the
subset of l (it’s easy by vector representation). If so and support(l)/support(s)≥min_conf,
then output the rule “s (l-s)”. Also it’s very easy to generate l-s where both l and s are
represented by the vectors consisted of 1 and 0.
All the generated rules are saved into a new text file rules.txt by the function display_rule.
2.2.4 Free Memory
After all the strong association rules are generated, we use the function free_struct to free
all the memory dynamically allocated for the frequent itemsets lists and data matrix.
3. Comments and Discussion
In our project, if we set minimum support to be 0.3 and minimum confidence to be 0.5, then
there are 18 frequent itemsets and 17 strong association rules generated from supmart.txt,
respectively. One strong association rule looks like this:
cd ==> soya_sauce (Support:39.06%, Confidence:66.67%)
This rule means 39.06% of all the transaction records contain both cd and soya_sauce, and
66.67% of the customers who purchased cd also bought soya_sauce. So it’s great fun to
find many interesting patterns.
If we apply the same minimum support and confidence threshold to the second database
votes.txt (Congressional Voting Records of United States in 1984), we get 91 frequent
itemsets and 354 strong association rules, respectively. One strong association rule looks
like this:
education-spending ==> crime (Support:36.32%, Confidence:92.40%)
This rule means 36.32% of all the voters supported both the education-spending policy and
the crime policy, and 92.40% of the voters who supported the education-spending policy
also supported the crime policy.
4
If we set lower minimum support and confidence, much more frequent itemsets and strong
association rules might be generated. Also the run time will be a little longer. In other
words, raising the minimum support and confidence will have a secondary effect of
reducing computation time, which may be desirable for large data sets. [4]
4. Instructions to Use the Tool
4.1 Installation
Type: gcc convert.c –o convert.exe to get the pre-precessing program convert.exe.
Type: gcc apriori.c –o apriori to get the executive program apriori.
Please see detail in the README file in the proj directory.
4.2 How to use it?
If the database is not Boolean-valued, we should first use the pre-processing program
convert.exe to set it into a Boolean-valued one.
Then just type: apriori. At the prompt, enter the name of data file (supmart.txt or votes.txt).
Then input minimum support (say 0.3, not 30%) and minimum confidence (say 0.5, not
50%). The program will analyze this file and display all the frequent itemsets and
association rules on the screen. Then you can check two result files itemsets.txt and
rules.txt for detail.
References:
[1]: Jiawei Han and Micheline Kamber “Data Mining: Concepts and Techniques”, Morgan
Kaufmann Publishers 2000, ISBN 1-55860-489-8
[2]: http://www.comp.nus.edu.sg/~dm2/ “Data Mining Interestingness and Interaction”.
Available: Dec. 5, 2000
[3]: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/voting-records “UCI Machine
Learning Repository” Available: Dec. 5, 2000
[4]: http://siva.bpa.arizona.edu/data_mining/Data_Mining.htm “Introduction to Data
Mining”. Available: Dec. 5, 2000
5