Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS 590M Fall 2001: Security Issues in Data Mining Lecture 5: Association Rules, Sequential Associations Why Association Rules? • Understand attributes, not entities • Discover relationships that – Show some dependency between attributes – Are “interesting” • Give an understanding of the data space Formal Definition • Data: – Items={i1,…,in} – Transactions T={t1,…,tm} where ti = {ij1, …, ijk} • Support: Given AI, supp(A) = |{t T | t A}| / |T| • Goal: Find rules AB with support ≥ s and confidence ≥ c where: – A, B I, A B = – s = supp(A B), c = supp(A B) / supp(A) Sample: Market Basket I Hardware Auto Clothing Furnishings Paper goods Grocery t0 - - + - + - t1 + + - - - - t2 + - - - + - t3 - - + + + - t4 + + - - + - t5 - - - - + + t6 - - + + - - t7 + + - - + - t8 - - + - + + t9 - - - - + - T Types of associations • Machine-learning base: classification / decision rules – Entities independent, unordered – Find rules leading to target class – To get rule sets, re-run for all classes as targets • Market-basket – Collection of related entities with same key – Can be modeled as independent entities, sparse data • Sequential – Like market basket, but group by distance rather than same key Historical Association Rule Learning • Decision tree converted to rules – ID3, as discussed in previous lecture • Direct production of decision rules – CN2, others • Problem: Algorithms don’t scale well to many practical problems Database community contribution: Market Basket Association Rules • Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, pages 207--216, Washington, D.C., May 1993. • Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499 Database community contribution: Market Basket Association Rules • Practical problems often have sparse data – Many attributes, few items per transaction • Goal is typically search for high support – High support = broad impact – High confidence not crucial (as opposed to classification) • Very Large data sets (main-memory algorithms impractical) A-Priori Algorithm • Observation: if A has support s, then – i A, supp(i) ≥ s • Gives bottom-up algorithm – Find single items with support ≥ s – Just look at transaction subsets with those items for pairs – Recurse A-Priori Algorithm • First, generate all large itemsets – Sets X I such that supp(X) ≥ s (threshold) – Captures “supp(A B) ≥ s” part of problem • Second, find high-confidence rules that are subsets of X – B = Xi , A = X-B – To find confidence, need supp(A) But A will be in all large itemsets – don’t need to go back to the database! A-Priori Algorithm L1 = {large 1-itemsets}; for ( k = 2; Lk-1 ; k++ ) Ck = select p.i1, p.iY, …, p.ik-1, q.ik-1 from Lk-1 p, Lk-1 q where p.i1 = q.i1, …, p.ik-2 = q.ik-2 transactions t T Ct = subset(Ck, t); // Candidates contained in t candidates c Ct: c.count++; Lk = {c Ck | c.count minsup} Answer = k Lk; Frequent episodes for sequential associations • Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo: Discovering Frequent Episodes in Sequences. In First International Conference on Knowledge Discovery and Data Mining (KDD'95), 210 - 215, Montreal, Canada, August 1995. AAAI Press. • Instead of transaction, items grouped by sliding window in time • Same basic idea as A-Priori Frequent Episodes: Definition • • • • Event types E Event (A,t) where A in E Sequence S=((A1,t1),…,(An,tn)) Frequent episode F = (Ai, …, Aj) where – tl, tm such that t1tl<…<tm tn tm-tl window: – count( ((Ai,tl), …, (Aj, tm)) ) support Applications/Issues in Security • Frequent episodes in intrusion detection data – What does this tell us? • Preventing the discovery of associations – Known items to protect – What if we don’t know what we want to protect?