Download instructions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining Assignment III
Association Rule Mining
November 26, 2008
1
Background
The purpose of this programming assignment is to implement a technique of finding frequent itemsets, and using this information, generate the association rules which have support and confidence
above certain minimum thresholds. The frequent itemset mining implementation can either be
Apriori based [1, 2] or frequent pattern growth based [3, 4], but for it to be correct and complete,
in both cases it has to find and correctly determine the support of all the frequent itemsets.
2
Basic assignment
Assume that an input file named transactions.txt consists of text that looks as follows:
1
1
2
2
1
3
2
3
5
2
4
3 5
5
3 6
In the file, blanks separate items (identified by integers) and new lines separate transactions. For
example, the above file contains information about a total of 5 transactions and its second transaction consists of 4 items.
Your task is to write a program, in your favorite programming language,1 that takes as parameters the minimum support, minimum confidence (given as floating point numbers in the range
[0..1]), and the name of file of transactions (whose format is as that of the file transactions.txt
above) and produces all association rules which can be mined from the transaction file which satisfy
the minimum support and confidence requirements. The rules should be output sorted first by the
number of items that they contain (in decreasing order), then by the confidence, and finally by
their support (also in decreasing order). An example of a possible session using your program on
the data of file transactions.txt above is given in Figure 1.
Note: If it makes your life any easier, you can assume that item numbers will be integers in
the range [0..216 − 1] and items appear once per transaction and sorted (as above). However, you
cannot make any assumptions about the number of transactions that the file may contain.
1
Any programming language is accepted, but please avoid using MATLAB: it is very slow for this task.
1
> myApriori -s 0.25 -c 0.58 transactions.txt
Mined file transactions.txt
and found a total of 16 association rules:
==================================
Rule
Confidence
Support
==================================
1 2 ==> 3
1
0.4
3 5 ==> 2
1
0.4
1 ==> 2 3
0.666
0.4
1 3 ==> 2
0.666
0.4
2 3 ==> 1
0.666
0.4
5 ==> 2 3
0.666
0.4
2 3 ==> 5
0.666
0.4
2 5 ==> 3
0.666
0.4
1 ==> 3
1
0.6
5 ==> 2
1
0.6
3 ==> 1
0.75
0.6
2 ==> 3
0.75
0.6
3 ==> 2
0.75
0.6
2 ==> 5
0.75
0.6
5 ==> 3
0.666
0.4
1 ==> 2
0.666
0.4
Figure 1: Possible session of using the program myApriori.
3
Examination
As in all assignments of this course the work is to be done in groups of 2-3 people. Groups with
one person is not allowed. There will be an oral exam for this assignment. You will be expected
to present your solution to an instructor and answer his questions about the assignment. You are
also expected to run your program during the examination so all solutions should be runnable on
the UNIX-machines in the computer labs.
For further information about the examination proceedings refer to the general instructions for
lab examination on this course.
Choose a time for the oral exam by signing up your group on the list posted outside
of our office P1316
3.1
Preparation for the oral exam
Before the oral exam you shall have written a program that performs the apriori algorithm on a
file of transactions. It shall be runnable on the Unix workstations and it must produce the correct
answer for the example data that is published on the website.
2
In addition to having written the program you should be prepared to describe the design of your
program and be ready to discuss your choice of data structures and other implementation issues.
Good luck!
References
[1] R. Agrawal, T. Imielinski, and A. Swami. Mining Associations between Sets of Items in Large
Databases. In Proceedings of the ACM SIGMOD International Conference on the Management
of Data, pp. 207-216, May 1993.
[2] R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proceedings of
the 20th International Conference on Very Large Databases, pp. 487-499, September 1994.
[3] J. Han, H. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation. In
Proceedings of the ACM-SIGMOD International Conference on Management of Data, pp. 112, May 2000.
[4] X. Shang, K.-U. Sattler, and I. Geist. Efficient Frequent Pattern Mining in Relational
Databases. 5. Workshop des GI-Arbeitskreis Knowledge Discovery (AK KD) im Rahmen der
LWA, 2004.
3