Download Itemset

Discriminative Pattern Mining By Mohammad Hossain Based on the paper Mining Low-Support Discriminative Patterns from Dense and High-Dimensional Data by 1. Gang Fang 2. Gaurav Pandey 3. Wen Wang 4. Manish Gupta 4.Michael Steinbach 5.Vipin Kumar What is Discriminative Pattern • A pattern is said to be Discriminative when its occurrence in two data sets (or in two different classes of a single data set) is significantly different. • One way to measure such discriminative power of a pattern is to find the difference between the supports of the pattern in two data sets. • When this support-difference (DiffSup) is greater then a threshold the the pattern is called discriminative. An example D D Transaction-id Items Transaction-id Items 10 A, C 10 A, B 20 B, C 20 A, C 30 A, B, C 30 A, B, E 40 A, B, C, D 40 A, C, D Pattern Support in D+ Support in D- DiffSup A 3 4 1 B 3 2 1 C 4 2 2 AB 2 2 0 AC 3 2 1 ABC 2 0 2 If we consider the DiffSup =2 then the pattern C and ABC become interesting patterns. Importance • Discriminative patterns have been shown to be useful for improving the classification performance for data sets where combinations of features have better discriminative power than the individual features • For example, for biomarker discovery from case-control data (e.g. disease vs. normal samples), it is important to identify groups of biological entities, such as genes and single-nucleotide polymorphisms (SNPs), that are collectively associated with a certain disease or other phenotypes P1 = {i1, i2, i3} P2 = {i5, i6, i7} P3 = {i9, i10} P4 = {i12, i13, i14}. P P1 P2 P3 P4 i1 i2 i3 i5 i6 i7 i9 i10 i12 i13 i14 0 1 2 1 0 0 0 1 6 7 6 C1 C2 DifSup P1 6 0 6 P2 6 6 0 P3 3 3 0 P4 9 2 7 DiffSup is NOT Anti-monotonic As a result, it will not work in Apriori like framework. Apriori: A Candidate Generation-and-Test Approach • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! • Method: – Initially, scan DB once to get frequent 1-itemset – Generate length (k+1) candidate itemsets from length k frequent itemsets – Test the candidates against DB – Terminate when no frequent or candidate set can be generated 7 The Apriori Algorithm—An Example Database TDB Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Supmin = 2 Itemset {A, C} {B, C} {B, E} {C, E} sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 C1 1st scan C2 L2 Itemset sup 2 2 3 2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} sup 1 2 1 2 3 2 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 L1 C2 2nd scan Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} C3 Itemset {B, C, E} 3rd scan L3 Itemset sup {B, C, E} 2 8 Pattern Support in D+ Support in D- DiffSup A 3 4 1 B 3 2 1 C 4 2 2 AB 2 2 0 AC 3 2 1 ABC 2 0 2 • But here we see, though the patterns AB and AC both have DiffSup < threshold (2) their super set ABC has DiffSup = 2 which is equal to threshold and thus becomes interesting. So AB, AC cannot be pruned. BASIC TERMINOLOGY AND PROBLEM DEFINITION • Let D be a dataset with a set of m items, I = {i1, i2, ..., im}, two class labels S1 and S2. The instances of class S1 and S2 are denoted by D1 and D2. We have |D| = |D1| + |D2|. • For a pattern (itemset) α = {α1,α2,...,αl} the set of instances in D1 and D2 that contain α are denoted by Dα1 and Dα2. • The relative supports of α in classes S1 and S2 are RelSup1(α) = |Dα1 |/|D1| and RelSup2(α) = |Dα2 |/}D2| • The absolute difference of the relative supports of α in D1 and D2 is denoted as DiffSup(α) = |RelSup1(α) − RelSup2(α)| New function • Some new functions are proposed that has anti-monotonic property and can be used in a apriori like frame work for pruning purpose. • One of them is BiggerSup defined as: BiggerSup(α) = max(RelSup1(α), RelSup2(α)). • BiggerSup is anti-monotonic and the upper bound of DiffSup. So we may use it for pruning in the apriori like frame work. • BiggerSup is a weak upper bound of DiffSup. • For instance, in the previous example if we want to use it to find discriminative patterns with thresold 4, – P3 can be pruned, because it has a BiggerSup of 3. – P2 can not be pruned (BiggerSup(P2) = 6), even though it is not discriminative (DiffSup(P2) = 0). • More generally, BiggerSup-based pruning can only prune infrequent non-discriminative patterns with relatively low support, but not frequent non- discriminative patterns. A new measure: SupMaxK • The SupMaxK of an itemset α in D1 and D2 is defined as SupMaxK(α) = RelSup1(α) − maxβ⊆α(RelSup2(β)), where |β| = K • If K=1 then it is called SupMax1 and defined as SupMax1(α) = RelSup1(α) − maxa∈α(RelSup2({a})). • Similarly with K=2 we can define SupMax2 which is also called SupMaxPair. Properties of the SupMaxK Family Relationship between DiffSup, BiggerSup and the SupMaxK Family SupMaxPair: A Special Member Suitable for High-Dimensional Data • In SupMaxK, as K increases we get more complete set of discriminative patterns. • But as K increased the complexity of calculation of SupMaxK also increases. • In fact the complexity of calculation of SupMaxK is O(mK). • So for high dimensional data (where m is large) high value of K (K>2)makes it infeasible. • In that case SupMaxPair can be used.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Itemset