Download Association Rules

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
IT462 Lab 4: Association Rules
For this lab, you will implement a data mining algorithm and try it on some sample data set.
Part 1 (Find the frequent itemsets):
Implement the Apriori algorithm to find the frequent itemsets. This is the algorithms we
discussed in class. The algorithm is described in the recommended textbook (Chapter 26.2.1),
and in "Fast Algorithms for Mining Association Rules" by Rakesh Agrawal and Ramakrishnan
Srikant.
A pseudo-code for the algorithm is given below:
Find the frequent itemsets of size 1(L1)
k=1
While (Lk not empty){
a. Generate candidate itemsets of size k+1 (Ck+1) based on Lk
b. Prune the list based on the apriori principle (eliminate candidates which have an
infrequent subset)
c. Scan the transactions to count support for the remaining candidates and eliminate
the infrequent ones
d. Lk+1 = Ck+1
e. Increase k by 1
}
You can use any programming language to implement the algorithm. Make sure you write easy
to use and well-documented code.
In your implementation, you can assume the following:




Transactions data is stored in a flat file (not database). The format of the file is as
follows: one transaction per line; each line contains the transaction id, and a list of items
bought in that transaction; fields are separated by comma
The input for your algorithm should be:
o a file containing the transactions data, in the specified format
o the minimum support considered by the user
The output for your algorithm should be a file containing the list of frequent itemsets
(itemsets with support higher or equal to the user specified minimum support) and the
computed support for each itemset.
All itemsets fit in memory, so you can use a hash table or array to count the itemsets.
I've created a sample file for the wardroom data we used in Lab 2, and one for the small grocery
example we discussed in class. Download the files from the course calendar and test your
program on both. Also test your program by finding the frequent itemsets for different min
support values, but turn in the results for min support of 5 percent for the wardroom data.
Part 2 (Generate the association rules):
Implement an algorithm (any algorithm that you choose) to generate the association rules based
on an input list of frequent itemsets. Run your algorithm for the frequent itemsets you obtained
in Part 1 (with min support 5%) and minimum confidence 25%.
The input for the algorithm should be:
o the minimum support considered by the user
o the minimum confidence considered by the user
o a list of frequent itemsets and their support
The output for the algorithm should be:
o a file containing all the association rules, with the computed support and confidence
(higher than min confidence)
Part 3 (Efficiency):
Create a copy of your program from Part 1 and remove step b (pruning the candidate itemsets)
from your copy. Run the two programs (the original and the "reduced" copy) on the datasets
provided and measure the running time. Which program is faster? Explain the results.
Extra credit: Implement other improvements for the Apriori algorithm. Run it on the test
datasets and compare the running time with the ones obtained using the algorithm implemented
in Part 1. Justify the results.
Turn In
Electronic
Upload to Lab 4 assignment on Blackboard:
o your code for parts 1, 2 and 3,
o the results file containing the frequent itemsets for min support 5 percent on the sample
wardroom file,
o the results file with the generated association rules for min support 5% and min
confidence 25% for the sample wardroom file
o the results of experiments in Part 3 and the justification
Hard-copy
1. Completed assignment coversheet. Your comments will help us improve this course.
2. A hard-copy of the same files you uploaded: your code, the frequent itemsets, the
associated rules and the experimental results and their justification.