Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IT462 Lab 4: Association Rules For this lab, you will implement a data mining algorithm and try it on some sample data set. Part 1 (Find the frequent itemsets): Implement the Apriori algorithm to find the frequent itemsets. This is the algorithms we discussed in class. The algorithm is described in the recommended textbook (Chapter 26.2.1), and in "Fast Algorithms for Mining Association Rules" by Rakesh Agrawal and Ramakrishnan Srikant. A pseudo-code for the algorithm is given below: Find the frequent itemsets of size 1(L1) k=1 While (Lk not empty){ a. Generate candidate itemsets of size k+1 (Ck+1) based on Lk b. Prune the list based on the apriori principle (eliminate candidates which have an infrequent subset) c. Scan the transactions to count support for the remaining candidates and eliminate the infrequent ones d. Lk+1 = Ck+1 e. Increase k by 1 } You can use any programming language to implement the algorithm. Make sure you write easy to use and well-documented code. In your implementation, you can assume the following: Transactions data is stored in a flat file (not database). The format of the file is as follows: one transaction per line; each line contains the transaction id, and a list of items bought in that transaction; fields are separated by comma The input for your algorithm should be: o a file containing the transactions data, in the specified format o the minimum support considered by the user The output for your algorithm should be a file containing the list of frequent itemsets (itemsets with support higher or equal to the user specified minimum support) and the computed support for each itemset. All itemsets fit in memory, so you can use a hash table or array to count the itemsets. I've created a sample file for the wardroom data we used in Lab 2, and one for the small grocery example we discussed in class. Download the files from the course calendar and test your program on both. Also test your program by finding the frequent itemsets for different min support values, but turn in the results for min support of 5 percent for the wardroom data. Part 2 (Generate the association rules): Implement an algorithm (any algorithm that you choose) to generate the association rules based on an input list of frequent itemsets. Run your algorithm for the frequent itemsets you obtained in Part 1 (with min support 5%) and minimum confidence 25%. The input for the algorithm should be: o the minimum support considered by the user o the minimum confidence considered by the user o a list of frequent itemsets and their support The output for the algorithm should be: o a file containing all the association rules, with the computed support and confidence (higher than min confidence) Part 3 (Efficiency): Create a copy of your program from Part 1 and remove step b (pruning the candidate itemsets) from your copy. Run the two programs (the original and the "reduced" copy) on the datasets provided and measure the running time. Which program is faster? Explain the results. Extra credit: Implement other improvements for the Apriori algorithm. Run it on the test datasets and compare the running time with the ones obtained using the algorithm implemented in Part 1. Justify the results. Turn In Electronic Upload to Lab 4 assignment on Blackboard: o your code for parts 1, 2 and 3, o the results file containing the frequent itemsets for min support 5 percent on the sample wardroom file, o the results file with the generated association rules for min support 5% and min confidence 25% for the sample wardroom file o the results of experiments in Part 3 and the justification Hard-copy 1. Completed assignment coversheet. Your comments will help us improve this course. 2. A hard-copy of the same files you uploaded: your code, the frequent itemsets, the associated rules and the experimental results and their justification.