Download 2.1 UNIT-2 material

Association Rules Association rules are used to show relationship between data items Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 1 Mining Frequent patterns Motivation: Finding inherent regularities in data What products were often purchased together? -Milk and Bread What products are purchased one after other? -PC followed by digital camera -TV set followed by VCD player Is there a structure defining relationships in the items purchased? -tree structure defining dependencies -Lattices defining some order in the items bought 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 2 Applications - Market Basket analysis -Cross-Market Analysis -Catalog design, -Sale campaign analysis, -Web log (click stream) analysis - DNA sequence analysis. Forms the foundation for many essential data mining tasks -Association, correlation, and causality analysis -Classification: associative classification -Pattern analysis in spatiotemporal, multimedia, timeseries, and stream data -Cluster analysis: frequent pattern-based clustering 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 3 Market Basket Analysis -It analyzes customer-buying habits by finding associations between the different item that customer place in their shopping baskets -It helps retailers in -developing market strategies -Advertising strategies -Planning their shelf space -Preparing store layouts-proximity -Plan sales of non-moving items -Plan discounts, offers etc 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 4 -Each basket can be represented by a boolean vector -These vectors can be analyzed to get buying patterns -Buying patterns can be represented by an association rules -Support and confidence are two measures of rule’s interestingness, They reflect the usefulness and certainity of discovered rules The support of an item ( or set of items) is the percentage of transactions in which that item occurs. Support of all set of items is problematic as the number of subsets increase exponentially for a given set of values 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 5 Association Rule – It is an implication of the form x  Y where X, Y are set of items called itemsets and X  Y is empty The Support (s) for an association rule X Y is the percentage of transactions in the database that contains XY Confidence or strength() for an association rule X  Y is the ratio of the number of transactions that contain X  Y to the number of transaction that contain X The support count is absolute support while probability of support count is relative support 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 6 Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Customer buys both Customer buys Y 5/8/2017 Customer buys X  Itemset X = {x1, …, xk}  Association rule X  Y  support, s, probability that a transaction contains X  Y  Support(X=>Y)=P(XUY)  confidence, c, conditional probability that a transaction having X also contains Y Confidence(X=>Y)=P(Y/X) =P(XUY)/P(X) Association rules: A  D (60%, 100%) D  A (60%, 75%) Data Mining -By Dr. S. C. Shirwaikar 7 An itemset that contains k items is a k-itemset The occurrence frequency of itemset or support count is the number of transactions that contain the itemset A minimum support threshold is specified A large(frequent) item set is one whose support count is above a threshold The subset of any large itemset is also large Generating Association Rules is a Two-step process -Find all large ( frequent) itemsets -Generate strong association rules from the frequent itemsets satisfying minimum support and minimum confidence Confidence(A=>B)=P(B/A)=support(AUB)/support(A) The support counts of A, B and AUB are required to determine association rules A=>B and B=>A 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 8 For each frequent itemset l, generate all nonempty subsets of l For every nonempty subset s of l, output the association rule s=>l-s if support_count(l)/support_count(s) ≥ min_conf ( min_conf=minimum confidence threshold) Consider l={A,B.E}. min_sup=2 Tid Items Its nonempty subsets are {A,B},{A,E}, 10 A, B, E {B,E}, {A}, {B},{E} 20 B, E {A,B}=>E confidence=2/4= 50% 30 B, C {A,E}=>B confidence=2/2=100% 40 A, B, D {B,E}=>A confidence=2/3=66% 50 A, C {A}=>{B,E} confidence=2/6=33% 60 B, C {B}=>{A,E} confidence=2/7=28% 70 A, C {E}=>{A,B} confidence=2/3=66% 80 A, B, C, E If minimum confidence threshold is 70% 90 A, B, C then {A,E}=> B is the only association rule 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 9 Basic Algorithms Apriori Algorithm – It is based on large itemset or Aproiori property Apriori property- all nonempty subsets of a frequent itemset must also be frequent- Large itemsets are downward closed If we know that an itemset is small , we need not consider supersets of it as candidates because they also will be small Apriori employs an iterative approach known as level-wise search, where k-itemsets are used to explore k+1-itemsets •Initially, scan DB once to get frequent 1-itemset •Generate length (k+1) candidate itemsets from length k frequent itemsets •Test the candidates against DB •Terminate when no frequent or candidate set can be generated 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 10 Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested Method: Lk denotes the set of frequent k-itemsets- Large itemset Ck is the superset of Lk – Candidate for Large itemset Tid Items 10 A, B, E 20 30 B, E B, C 40 A, B, D 50 A, C 60 B, C 70 A, C 80 A, B, C, E 90 A, B, C 5/8/2017 C1 1st scan Itemset {A} {B} {C} {D} {E} sup 6 7 6 1 3 L1 Itemset {A} {B} {C} {E} sup 6 7 6 3 Supmin = 2 Data Mining -By Dr. S. C. Shirwaikar 11 Two-step process is followed consisting of join and prune actions to generate Lk from Lk-1 Join Step- Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. The Candidate set Ck is generated by taking the join Lk-1xLk-1, where members of Lk-1 are joinable if their first k-2 items are in common. This ensures that no duplicates are generated Prune step- To reduce the size of Ck, Apriori poperty is used as follows Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset. Hence if any (k-1)-subset of a candidate k-itemset is not in Lk-1, the candidate cannot be frequent and can be removed from Ck The count of each candidate in Ck is used to determine Lk (minimum support count) 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 12 Supmin = 2 C2 L1 Itemset {A} {B} {C} {E} 5/8/2017 sup 6 7 6 3 L1xL1 L2 Itemset {A, B} sup 4 {A, C} {A, E} {B, C} {B, E} {C, E} 4 2 4 3 1 2nd scan Data Mining -By Dr. S. C. Shirwaikar Itemset {A, B} {A, C} {A, E} sup 4 4 2 {B, C} {B, E} 4 3 13 Supmin = 2 L2 C3= {{A,B,C},{A,B,E},{A,C,E},{B,C,E}} Itemset {A, B} {A, C} {A, E} {B, C} {B, E} The 2 item subsets of {A,B,C} are {A,B,},{B,C}, {A,C} which are all in L2 L2xL2 The 2 item subsets of {A,B,E} are {A,B,},{B,E}, {A,E} which are all in L2 The 2 item subsets of {A,C,E} are {A,C,},{C,E} and {A,E} . {C,E} is not in L2 Remove {A,C,E} Itemset {A, B, C} {A, B, E} {A, C, E} {B, C, E} C3 5/8/2017 The 2 item subsets of {B,C,E} are {B,C,}, {C,E} and {B,E} . {C,E} is not in L2 Remove {B,C,E} 3rd scan L3 Itemset {A, B, C} sup 2 {A, B, E} 2 Data Mining -By Dr. S. C. Shirwaikar 14 L3 Itemset {A, B, C} {A, B, E} Supmin = 2 L3xL3 C4= {{A,B,C,E}} The 3 item subsets of {A,B,C,E} are {A,B,C}, {B,C,E}, {A,C,E} and {A,B,E} , {B, C, E} and {A, C, E} are not in L3 Remove {A,B, C,E} Thus C4 is empty and algorithm terminates having found all the frequent itemsets 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 15 The Apriori Algorithm Ck: Candidate itemset of size k Lk : frequent itemset of size k Algorithm Apriori L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = Apriori_generate(Lk) // candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 16 Algorithm Apriori_generate(Lk) For each itemset l1 in Lk For each itemset l2 in Lk If k-1 elemts in l1 and l2 are equal //If l1[1]=l2[1] and l1[2]=l2[2]and….l1[k-1]=l2[k-1] and //l1[k]<l2[k] C=l1xl2 add C to Ck+1 for each k subset s of c if s does not belong to Lk then delete c break The Apriori algorithm assumes that the dataset is memory resident. The max number of DB scans is one more than the cardinality of largest itemset. Large number of data scans is a weakness of apriori 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 17 Sampling Algorithm It is an improvement that reduces number of database scans to one in the best case and two in the worst case A database sample is drawn such that it can be memory resident. An algorithm such as Apriori used to find large itemsets for the sample. These are viewed as potentially large (PL) itemsets Additional candidates are determined by applying negative border function BD~ against the large itemsets from the sample Negative border function is defined as the minimal set of items that are not in PL but whose subsets are all in PL 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 18 Tid Items 10 A, B, E 20 30 B, E B, C 40 A, B, D 50 A, C 60 B, C 70 A, C 80 A, B, C, E 90 A, B, C C3 Itemset {A, B , C} Sample Tid Items 10 A, B, E 50 A, C 70 A, C 90 A, B, C L1 C2 Itemset {A} {B} {C} Itemset {A, B,C} Itemset {A, B} {A, C} {B C} C1 Itemset {A} {B} {C} {E} sup 4 2 3 1 L2 Itemset {A, B} {A, C} {B C} sup 2 2 1 sup 1 PL ={A,B,C,{A,B}.{A,C}} BD~(PL)={{B,C}, E,D} , {B,C} is added because both its subsets {B} and {C} are in PL, { E} , {D} are added as all their subsets (empty) are vacuously in PL 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 19 Total candidates considered as C= PL U BD~(PL) = {A,B,C,{A,B}.{A,C},{B,C}, E,D} During the first scan of the database , support count is computed for all. If all candidates that are large are in PL, then all large itemsets are found A second scan is required if any are in the negative border area The negative border is the buffer area between those known to be large and others. It is the smallest possible set of itemsets that could potentially be large During the second scan, additional candidates are generated and counted to ensure that all large itemsets are found ML- the missing large itemsets are those in L but are not in PL 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 20 To find all the remaining large itemsets in the second scan , the sampling algorithm repeatedly applies the negative border function until the set of possible candidates does not grow further ML={{B,C}, E,} PL ={A,B,C,{A,B}.{A,C}} BD~(ML)= { {A, E}, {B,E}, {C,E } ,{A,B,C}, E,{,B,C}} BD~(ML) ={ {A,B,E},{A,C,E},{B,C,E}} BD~(ML) ={{A,B,C,E}} This creates potentially large set of candidates with many not large, it does guarantee that only one more database scan is required Apriori algorithm is performed using a support called small(s), which is a min support value less than s. It is reduced because the sample size is smaller 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 21 Sampling algorithm I=Itemsets s = support count Ds = Sample drawn from D PL=Apriori(I, Ds,Small(s)), C= PLU BD~(PL) Scan the database and compute support counts of each Li in C and test if each of the itemset is large L= Itemsets that are tested to be large ML= { X / X ε BD~(PL) Λ X ε L} If ML ≠ Φ then C=L Repeat C= C U BD~(C) Until no new itemsets are added to C Scan the database second time and compute support counts of each Li in C and test if each of the itemset is large 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 22 Partitioning Dataset D is divided into p partitions D1, D2, …. Dp All frequent itemsets within the partition called local frequent itemsets are computed(min_sup appropriately changed) These form global candidate itemsets which are used to get frequent itemsets for the entire database Partitioning may improve the performance in many ways •By large itemset property, a large itemset must be large in at least one of the partitions. Each partition can be created such that it fits in main memory. The number of itemsets to be counted per partition would be smaller •Parallel or distributed algorithms can be used •Incremental generation of association rules is possible, by treating newly added data as a new partition 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 23 •The number of database scans is reduced to two. In first scan partitions are braught in memory and the large itemsets of the partition are found During the second scan , only those itemsets that are large in atleast one partition are used as candidates and counted to determine if they are large across the database Parallel and distributed algorithms Data parallelism – data can be partitioned but it requires that memory at each processor is large enough to store all candidates at each scan Task parallelism - candidate sets can be partitioned and counted separately at each processor. Candidate set allotted to a processor should fit in its memory 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 24 CDA (Count distribution algorithm) It uses data parallelism. The database is divided into p partitions, one for each processor. Each processor counts the candidates for its data and then broadcasts its count to all other processors. Each processor then determines the global counts. These are used to generate large itemsets and candidate sets for the next scan 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 25 DDA( Data Distribution algorithm) It uses task parallelism. The candidates as well as data are partitioned among processors. Each processor counts the candidates given to it using local database partition. Then each processor broadcasts its database partition to all other processors. This data is then used by each processor to compute the global count for its data and broadcasts this count to all Each processors determines globally large itemsets and the candidate sets These candidate sets are divided among processors for next scan This algorithm suffers from high message traffic 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 26 Comparing the Algorithms Algorithms can be classified along the following dimensions Target- The algorithms generate rules that support a given support and confidence Type- can generate regular or advanced association rules Data Type – data can be categorical or textual Data source – Market basket data- item present in a transaction Technique – large or frequent itemsets Itemset strategy – usually bottom up approach is used reducing using apriori property – A top-down technique can also be used 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 27 Transaction strategy – all database iis scanned or sample or partition is used Itemset data structure – hash tree data structure can be usedeffective technique to store access and count itemsets Transaction data structure- Usually we have table of transactions with the itemset present in the transactions in horizontal data format.Alternatively data can be represented in a table with itemname and set of transactions containing the item called vertical data format Optimization – techniques used to improve the performance of the algorithm for a given data distribution Architecture – sequential, parallel and distributed algorithms are used Parallelism strategy –Data parallelism and or task parallelism is used 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 28 Incremental Rules In case of dynamic databases, database state keeps on changing. Generating association rules for a new database state requires a complete rerun of the algorithm Incremental approaches address the issue of modifying associations rules as new transactions are inserted into the database If D is the database state with large itemsets L and db are the updates, incremental approach finds itemsets for D U db using L FUP ( fast update ) is based on Apriori algorithm. For each iteration when db is canned , the candidate sets generated are pruned using L. This is because the itemset need to be large in at least one partition D or db in order to be large in D Udb . 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 29 Association rules There are various kinds of association rules •Multilevel association •Multidimensional association •Quantitative association •Correlation rules Multi level association rules When a concept hierarchy exists between the items, association rules can be generated at various levels of concept hierarchy Items at the lower level are expected to have lower support Strong associations discovered at higher levels may represent common sense Knowledge Strong association rules at lower levels are difficult to find due to unavailability of data at that level. 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 30 Food vegetables Grain Fruit ……… Meat ……… Wheat Rice yoghurt Whole 5/8/2017 Dairy Data Mining -By Dr. S. C. Shirwaikar Milk Cheese Skim 31 Multi level association rules can be mined using concept hierarchies and support-confidence framework A top-down strategy can be applied There are several variations Using uniform minimum support at all levels -search procedure is simplified as it avoids examining itemsets whose ancestors do not have minimum support - Users are requirted to specify only one value min-sup Using reduced minimum support at lower levels -Each level has its own min-sup -Deeper the level , smaller is the threshold value Using item or group based minimum support - minimum support threshold can be set by grouping items based on price or other attributes -low support threshold can be set for an item of interest Redundant rules are generated due to ancestor relationship 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 32 Multidimensional association rules-Association rules that involve two or more predicates Single-dimensional rules: buys(X, “milk”)  buys(X, “bread”) Multi-dimensional rules: Inter-dimension assoc. rules (no repeated predicates) age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”) hybrid-dimension assoc. rules (repeated predicates) age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”) Quantitative association rules Categorical Attributes: finite number of possible values, no ordering among values Quantitative Attributes: numeric, implicit ordering among values Discretization of quantitative attributes can be predefined – converted to categorical Discretization can be dynamic-retaining quantitative nature-to maximize confidence 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 33 Correlation rules Strong association rules may not necessarily be interesting Correlation analysis can be additionally used to augment support and confidence measure There are several correlation measures -lift. Chi-square etc Measuring Quality of rules Several measures can be used Support- s(A=>B) = P(AUB) Confidence - (A=>B) = P (B/A) Lift or Interest – relationship between items –symmetric – interest (A=>B) = P(AUB) / P(A) P(B) Conviction - measure of independence – negation – inverts ratio Conviction(A=>B) = P(A) P~B) /P(A U ~B) Chi-square χ2 5/8/2017 Data Mining -By Dr. S. C. Shirwaikar 34

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 2.1 UNIT-2 material