Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Association Rule Mining Muhammad Ali Yousuf http://dsc.itmorelia.edu.mx/~maliyusuf/ [email protected] Contenidos Part (A): Maximal frequent itemset, Generation of Association rules Part (B): AR Miner Software Part (C): Horizontal v/s Vertical Database Layout Part (D): Frequent Closed ItemSets 2 Rule: 1. If a set of items (itemset) is not frequent, no superset of this itemset can be frequent. 2. If a set of items (itemset) is frequent, every subset of this itemset will be frequent. Now we define the concept of a "Maximal frequent itemset": Maximal Frequent itemset is a frequent itemset which is not a subset of any other frequent set,hence the term maximal. 3 Another property of a maximal frequent itemset is that all its subsets will be frequent, hence if we can modify our Apriori Algorithm to identify the maximal frequent itemsets , that will be a great optimisation. Here we list an example which looks at extracting a maximal frequent itemset . 4 We have the transactions as follows: Trans Id 1 A 2 C 3 A 4 A 5 A 6 C Items CT W D W C T W C D W C D T W D T 5 min_support = 50 % (i.e. there should be atleast 3 occurences(out of 6) of any pattern to be considered frequent) Now we generate the frequent itemset lattice in a bottom up approach: (We have omitted the lines showing the formation of k-itemset (itemsets with k members) for simplicity ) 6 7 Explanation Here we start from the empty set, and go up each level retaining the itemsets which are frequent ( atleast 3/6 ooccurences) and throwing away the rest. As per the Apriori algorithm we will be throwing away following itemsets since their support is < 3. Itemsets: AD , DT at level 2, and CDT at level 3 8 Rule At each level, whenever we see a frequent itemset to be a superset of another frequent itemset at lower level, we ignore or rule out that subset (itemset), since we know that the superset by definition will include that subset. Hence while generating the association rules we will consider all the proper subsets of this frequent itemset (which will be the maximal one). 9 Rule E.g at level 3, the itemset "ACT" is formed by the grouping of AC and AT, hence we can rule out itemsets "AC" and "AT" from the lower level; If we continue this process, we will be left with only 2 frequent itemsets (maximal) :ACTW and CDW 10 Rule To summarise what we have done so far: Step 1:Find frequent itemsets. We have found the maximal frequent itemsets , all possible subsets of this will be frequent Step 2: Generate all possible association rules (A) Look at all frequent itemsets (B) For each frequent itemset 'X' look at all its proper subsets 'Y',such that 'Y' is not empty or equal to 'X'. 11 Observation: For each frequent k-itemset , there will be (2^k) - 2 proper subsets. Hence those many rules to be tested. To test a rule : Y ==> X - Y i.e. to check if the confidence of the rule >= min_confidence Lets take an e.g.: LET X = CDW k = 3 Number of proper subsets = 2^3 - 2 = 6 to generate all possible association rules with their confidence: 12 13 Thus we have seen the need to identify maximal frequent itemsets and how to generate all possible association rules from it 14 Introduction to AR Miner Introduction to Data Storage We will take a small example of supermarket data. Suppose the items sold by a (very very small) shop are green apples, red apples, oranges, bananas, and grapes. Also suppose that in this morning you had three customers, one bought green apples and grapes, one bought only oranges, and the last one bought oranges and grapes. 16 This activity can be represented in the .asc format as follows: 1 green apples 2 red apples 3 oranges 4 bananas 5 grapes BEGIN_DATA 15 3 35 END_DATA 17 There are two distinct parts of this file, the first one contains a listing of all the items you can sell, or otherwise said, of all the items that could participate in a transaction. This part looks is: 1 green apples 2 red apples 3 oranges 4 bananas 5 grapes 18 The format is pretty simple. It must consist of a positive number followed by a string (which can contain blank spaces). It is important that the numbers be assigned in increasing order starting from 1. Empty lines are allowed to appear in this section. This section enumerates all entities described by the data and between which ARMiner will later be used to look for association rules. 19 The second part consists of the actual data: BEGIN_DATA 15 3 35 END_DATA 20 In our case we had 3 transactions and these are each represented on a separate line. The first transaction involved green apples and grapes and they are represented by the numbers associated in the first section, that is 1 for green apples and 5 for grapes. Note that this section must be enclosed between a BEGIN_DATA and END_DATA lines. Anything appearing after the END_DATA line will be ignored. Blank lines are allowed to appear in this section. 21 Note that although the numbers appearing in each line are sorted this is not required by the format. You can list the numbers in any order and the file can still be processed correctly, however we suggest to always list the numbers in a transaction in increasing order, this way the processing of the file by asc2db will be done more efficiently. 22 Census data SSN# Age 006 26 345 54 743 37 Sex M F M Married Num_kids Income No 0 25000$ Yes 2 55000$ Yes 1 80000$ 23 What Can You Do With It? Let's Look at Each Column: SSN#: this is unique for each entry, there is no sense to look for association rules involving SSN#, at least not in this data, since each SSN# appears only once in the whole data. So we can simply ignore this field for mining purposes. 24 What Can You Do With It? Let's Look at Each Column: Age: this attribute can take a variety of values. ARMiner cannot handle such attributes easily, in fact it only considers binary attributes. We need to discretize this attribute, replacing for example ages 0-21 with "very young age", 22-35 with "young age", 35-55 with "middle age", etc 25 What Can You Do With It? Let's Look at Each Column: Sex: this has two values: "male" and "female", so we could create two attributes out of it. 26 What Can You Do With It? Let's Look at Each Column: Married: again we can create two attributes: "married" and "not married" 27 What Can You Do With It? Let's Look at Each Column: Num_kids: this also has to be discretized, maybe in "no kids", "one kid", "several kids". 28 What Can You Do With It? Let's Look at Each Column: Income: we could also discretize this into "small", "average", and "high". 29 What Can You Do With It? Let's Look at Each Column: The discretization should be made such that it will identify clearly the ranges that present interest for the person who will do the mining of this data. 30 What Can You Do With It? Let's Look at Each Column: With these changes we could represent the above data in .asc format as: 31 1 very young age 2 young age 3 middle age 4 old age 5 male 6 female 7 married 8 not married 9 no kids 10 one kid 11 several kids 12 small income 13 average income 14 high income BEGIN_DATA 2 5 8 9 12 3 6 7 11 13 3 5 7 10 14 END_DATA 32 How To Start AR Miner For a quick start, first launch the server: java -jar Server.jar and then launch the client application: java -jar Client.jar You can login as admin with password renimra. 33 Support and Confidence Support: The fraction of transactions T supporting an itemset X with respect to database D is called the support of X, supp(X) = |{T D | X T }| / |D | 34 Support and Confidence The support of a rule X => Y is defined as, supp(X=>Y) = supp(X U Y) 35 Support and Confidence Confidence: The confidence of this rule is defined as conf(X=>Y) = supp(X U Y) / supp(X) 36