Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Assignment III Association Rule Mining November 26, 2008 1 Background The purpose of this programming assignment is to implement a technique of finding frequent itemsets, and using this information, generate the association rules which have support and confidence above certain minimum thresholds. The frequent itemset mining implementation can either be Apriori based [1, 2] or frequent pattern growth based [3, 4], but for it to be correct and complete, in both cases it has to find and correctly determine the support of all the frequent itemsets. 2 Basic assignment Assume that an input file named transactions.txt consists of text that looks as follows: 1 1 2 2 1 3 2 3 5 2 4 3 5 5 3 6 In the file, blanks separate items (identified by integers) and new lines separate transactions. For example, the above file contains information about a total of 5 transactions and its second transaction consists of 4 items. Your task is to write a program, in your favorite programming language,1 that takes as parameters the minimum support, minimum confidence (given as floating point numbers in the range [0..1]), and the name of file of transactions (whose format is as that of the file transactions.txt above) and produces all association rules which can be mined from the transaction file which satisfy the minimum support and confidence requirements. The rules should be output sorted first by the number of items that they contain (in decreasing order), then by the confidence, and finally by their support (also in decreasing order). An example of a possible session using your program on the data of file transactions.txt above is given in Figure 1. Note: If it makes your life any easier, you can assume that item numbers will be integers in the range [0..216 − 1] and items appear once per transaction and sorted (as above). However, you cannot make any assumptions about the number of transactions that the file may contain. 1 Any programming language is accepted, but please avoid using MATLAB: it is very slow for this task. 1 > myApriori -s 0.25 -c 0.58 transactions.txt Mined file transactions.txt and found a total of 16 association rules: ================================== Rule Confidence Support ================================== 1 2 ==> 3 1 0.4 3 5 ==> 2 1 0.4 1 ==> 2 3 0.666 0.4 1 3 ==> 2 0.666 0.4 2 3 ==> 1 0.666 0.4 5 ==> 2 3 0.666 0.4 2 3 ==> 5 0.666 0.4 2 5 ==> 3 0.666 0.4 1 ==> 3 1 0.6 5 ==> 2 1 0.6 3 ==> 1 0.75 0.6 2 ==> 3 0.75 0.6 3 ==> 2 0.75 0.6 2 ==> 5 0.75 0.6 5 ==> 3 0.666 0.4 1 ==> 2 0.666 0.4 Figure 1: Possible session of using the program myApriori. 3 Examination As in all assignments of this course the work is to be done in groups of 2-3 people. Groups with one person is not allowed. There will be an oral exam for this assignment. You will be expected to present your solution to an instructor and answer his questions about the assignment. You are also expected to run your program during the examination so all solutions should be runnable on the UNIX-machines in the computer labs. For further information about the examination proceedings refer to the general instructions for lab examination on this course. Choose a time for the oral exam by signing up your group on the list posted outside of our office P1316 3.1 Preparation for the oral exam Before the oral exam you shall have written a program that performs the apriori algorithm on a file of transactions. It shall be runnable on the Unix workstations and it must produce the correct answer for the example data that is published on the website. 2 In addition to having written the program you should be prepared to describe the design of your program and be ready to discuss your choice of data structures and other implementation issues. Good luck! References [1] R. Agrawal, T. Imielinski, and A. Swami. Mining Associations between Sets of Items in Large Databases. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pp. 207-216, May 1993. [2] R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proceedings of the 20th International Conference on Very Large Databases, pp. 487-499, September 1994. [3] J. Han, H. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, pp. 112, May 2000. [4] X. Shang, K.-U. Sattler, and I. Geist. Efficient Frequent Pattern Mining in Relational Databases. 5. Workshop des GI-Arbeitskreis Knowledge Discovery (AK KD) im Rahmen der LWA, 2004. 3