Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Course on Data Mining (581550-4) Intro/Ass. Rules 7.11. 24./26.10. Clustering 14.11. Episodes KDD Process Home Exam 30.10. Text Mining 21.11. 28.11. Appl./Summary Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Course on Data Mining (581550-4) Today 26.10.2001 • Summary: – Course organization • Summary: – What is data mining? • Today's subject: – Association rules • Next week's program: – Lecture: – Exercise: – Seminar: Episodes Associations Associations Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Course Organization Lectures, Exercises, Exam • 12 lectures: 24.10.-30.11.2001 – Wed 14-16, Fri 12-14 (A217) • Wed: normal lecture • Fri: seminar like lecture (except for 26.10.) • 5 exercise sessions: 1.11.-29.11.2001 – Thu 12-14 (A318) • Home exam: – Given: 28.11., Returned due: 21.12.2001 • Language: – Lecturing language is Finnish – Slides and material are in English Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Course Organization Group Work • Group for seminar (and exercise) work: – 10 groups, à 3 persons, 2 groups/lecture – Dates are agreed at the beginning of course – Articles are given on previous week's Wed • Seminar presentations: – Presentation in an HTML page (around 3-5 printed pages) due to seminar starting: • Can be either a HTML page or a printable document in PostScript/PDF format – 30 minutes of presentation – 5-15 minutes of discussion – Active participation Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Course Organization / Groups • Group presentation time allocation: – Fri 2.11.: Group 1, Group 2 (associations) – Fri 9.11.: Group 3, Group 4 (episodes) – Fri 16.11.: Group 5, Group 6 (text mining) – Fri 23.11.: Group 7, Group 8 (clustering) – Fri 30.11.: Group 9, Group 10 (KDD process) Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Course Organization Course Evaluation • Passing the course: min 30 points – home exam: min 13 points (max 30 points) – exercises/experiments: min 8 points (max 20 points) • at least 3 returned and reported experiments – group presentation: min 4 points (max 10 points) • Remember also the other requirements: – Attending the lectures (5/7) – Attending the seminars (4/5) – Attending the exercises (4/5) Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Course Organization Course Material • • • • Lecture slides Original articles Seminar presentations Book: "Data Mining: Concepts and Techniques" by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, August 2000. 550 pages. ISBN 1-55860489-8 • Remember to check course website and folder for the material! Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Summary:What is Data Mining? • Ultimately: – "Extraction of interesting (non-trivial, implicit, previously unknown, potentially useful) information or patterns from data in large databases" • Often just: – "Tell something interesting about this data", "Describe this data" Exploratory, semi-automatic data analysis on large data sets Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Summary:What is Data Mining? • Data mining: semi-automatic discovery of interesting patterns from large data sets • Knowledge discovery is a process: – Preprocessing – Data mining – Postprocessing • To be mined, used or utilized different … – – – – Databases (relational, object-oriented, spatial, WWW, …) Knowledge (characterization, clustering, association, …) Techniques (machine learning, statistics, visualization, …) Applications (retail, telecom, Web mining, log analysis, …) Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Summary: Typical KDD Process Time based selection Raw data Operational Database Eval. of interestingness Input data 1 Preprocessing Data mining Cleaned Verified Focused 2 Utilization Postprocessing Results 3 Selected usable patterns Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Association Rules Basics Examples Generation Multi-level Rules Constraints Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Market Basket Analysis • Analysis of customer buying habits by finding associations and correlations between the different items that customers place in their "shopping basket" Milk, eggs, sugar, bread Milk, eggs, cereal, bread Eggs, sugar Customer1 Customer2 Customer3 Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Market Basket Analysis • Given: • A database of customer transactions (e.g., shopping baskets), where each transaction is a set of items (e.g., products) • Find: • Groups of items which are frequently purchased together Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Market Basket Analysis • Extract information on purchasing behavior – "IF buys beer and sausage, THEN also by mustard with high probability" • Actionable information: can suggest... – New store layouts and product assortments – Which products to put on promotion • MBA approach is applicable whenever a customer purchases multiple things in proximity – – – – Credit cards Services of telecommunication companies Banking services Medical treatments Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Market Basket Analysis • Useful: "On Thursdays, grocery store consumers often purchase diapers and beer together." • Trivial: "Customers who purchase maintenance agreements are very likely to purchase large appliances." • Unexplicable/unexpected: "When a new hardaware store opens, one of the most sold items is toilet rings." Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Association Rules: Basics • Association rule mining: – Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. • Comprehensibility: Simple to understand • Utilizability: Provide actionable information • Efficiency: Efficient discovery algorithms exist • Applications: – Market basket data analysis, cross-marketing, catalog design, lossleader analysis, clustering, classification, etc. Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Association Rules: Basics • Typical representation formats for association rules: – diapers beer [0.5%, 60%] – buys:diapers buys:beer [0.5%, 60%] – "IF buys diapers, THEN buys beer in 60% of the cases. Diapers and beer are bought together in 0.5% of the rows in the database." • Other representations (used in Han's book): – buys(x, "diapers") buys(x, "beer") [0.5%, 60%] – major(x, "CS") ^ takes(x, "DB") grade(x, "A") [1%, 75%] Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Association Rules: Basics diapers beer [0.5%, 60%] 1 2 3 4 "IF buys diapers, THEN buys beer in 60% of the cases in 0.5% of the rows" 1 Antecedent, left-hand side (LHS), body 2 Consequent, right-hand side (RHS), head 3 Support, frequency ("in how big part of the data the things in left- and right-hand sides occur together") 4 Confidence, strength ("if the left-hand side occurs, how likely the right-hand side occurs") Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Association Rules: Basics • Support: denotes the frequency of the rule within transactions. support(A B [ s, c ]) = p(AB) = support ({A,B}) • Confidence: denotes the percentage of transactions containing A which contain also B. confidence(A B [ s, c ]) = p(B|A) = p(AB) / p(A) = support({A,B}) / support({A}) Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Association Rules: Basics • Minimum support : – High few frequent itemsets few valid rules which occur very often – Low many valid rules which occur rarely • Minimum confidence : – High few rules, but all "almost logically true" – Low many rules, many of them very "uncertain" • Typical values: = 2 -10 %, = 70 - 90 % Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Association Rules: Basics • Transaction: – Relational format <Tid,item> <1, item1> <1, item2> <2, item3> • Item vs. itemsets: • Support of an itemset I: • Minimum support : • Frequent itemset: Compact format <Tid,itemset> <1, {item1,item2}> <2, {item3}> single element vs. set of items # of transaction containing I threshold for support with support . Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Association Rules: Basics • Given: (1) database of transactions, (2) each transaction is a list of items bought (purchased by a customer in a visit) Transaction ID 100 200 400 500 Items Bought A,B,C A,C A,D B,E,F Frequent Itemset Support {A} 3 or 75% {B} and {C} 2 or 50% {D}, {E} and {F} 1 or 25% {A,C} 2 or 50% Other item pairs max 25% • Find: all rules with minimum support and confidence • If min. support 50% and min. confidence 50%, then A C [50%, 66.6%], C A [50%, 100%] Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Association Rule Generation • Association rule mining is a two-step process: STEP 1: Find the frequent itemsets: the sets of items that have minimum support. – So called Apriori trick: A subset of a frequent itemset must also be a frequent itemset: • i.e., if {AB} is a frequent itemset, both {A} and {B} should be frequent itemsets – Iteratively find frequent itemsets with size from 1 to k (k-itemset) STEP 2: Use the frequent itemsets to generate association rules. Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Frequent Sets with Apriori • Join Step: Ck is generated by joining Lk-1with itself • Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset • Pseudo-code: Ck: Candidate itemset of size k; Lk : Frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = {candidates generated from Lk }; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = {candidates in Ck+1 with min_support} end return k Lk; Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Apriori Candidate Generation • The Apriori principle: Any subset of a frequent itemset must be frequent • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 – abcd from abc and abd – acde from acd and ace • Pruning: – acde is removed because ade is not in L3 • C4={abcd} Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Apriori Example (1/6) Database D TID 100 200 300 400 Items 134 235 1235 25 C1 itemset sup. {1} 2 3 Scan D {2} {3} 3 {4} 1 {5} 3 L1 itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Apriori Example (2/6) C2 C2 itemset itemset sup {1 2} {1 2} 1 {1 3} {1 3} 2 Scan D {1 5} 1 {1 5} {2 3} 2 {2 3} {2 5} 3 {2 5} {3 5} 2 {3 5} L2 itemset {1 3} {2 3} {2 5} {3 5} Course on Data Mining sup 2 2 3 2 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Apriori Example (3/6) C3 itemset {2 3 5} L3 Scan D itemset sup {2 3 5} 2 Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Apriori Example (4/6) Search Space of Database D 123 124 12 13 12345 1234 1235 125 134 135 145 234 14 15 23 24 25 1 2 1245 3 1345 4 2345 235 34 245 35 5 Course on Data Mining 45 345 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Apriori Example (5/6) Apriori Trick on Level 1 123 124 12 13 12345 1234 1235 125 134 14 15 1 2 1245 1345 2345 135 145 234 235 23 24 3 25 4 34 245 35 5 Course on Data Mining 345 45 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Apriori Example (6/6) Apriori Trick on Level 2 123 124 12 13 12345 1234 1235 125 134 135 145 234 14 15 23 24 25 1 2 1245 3 1345 4 2345 235 34 245 35 5 Course on Data Mining 345 45 Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Is Apriori Fast Enough? • The core of the Apriori algorithm: – Use frequent (k – 1)-itemsets to generate candidate frequent kitemsets – Use database scan and pattern matching to collect counts for the candidate itemsets • The bottleneck of Apriori: candidate generation – Huge candidate sets: • 104 frequent 1-itemset will generate 107 candidate 2-itemsets • To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates. – Multiple scans of database: • Needs (n +1 ) scans, n is the length of the longest pattern Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Is Apriori Fast Enough? • In practice: – For basic Apriori approach the number of attributes in the row is usually much more critical than the number of transaction rows – For example: • 50 attributes each having 1-3 values, 100.000 rows (not very bad) • 50 attributes each having 10-100 values, 100.000 rows (quite bad) • 10.000 attributes each having 5-10 values, 100 rows (very bad...) – Notice: • One attribute might have several different values • Association rule algorithms typically treat every attribute-value pair as one attribute (2 attribute with 5 values each => "10 attributes") • There are some ways to overcome the problem... Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Improving Apriori Performance • Hash-based itemset counting: – A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent • Transaction reduction: – A transaction that does not contain any frequent k-itemset is useless in subsequent scans • Partitioning: – Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB • Sampling: – Mining on a subset of given data, lower support threshold + a method to determine the completeness Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Association Rules from Itemsets • Pseudo-code: for every frequent itemset l generate all nonempty subsets s of l for every nonempty subset s of l output the rule "s (l-s)" if support(l)/support(s) min_conf", where min_conf is the minimum confidence threshold • E.g.: frequent set l = {abc}, subsets s = {a, b, c, ab, ac, bc) – a b, a c, b c – a bc, b ac, c ab – ab c, ac b, bc a Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Association Rule Generation • Rule 1 to remember: – Generating frequent sets is slow (especially itemsets of size 2) – Generating association rules from frequent itemsets is fast • Rule 2 to remember: – For itemset generation, support threshold is used – For association rules, confidence threshold is used • What happens in reality, how long does it take to create frequent sets and association rules? – Let's take small real-life examples… – Experiments are made with Citum 4/275 Alpha server with 512 MB of main memory & Red Hat Linux release 5.0 (kernel 2.0.30) Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Performance Example (1/4) Network Management System MSC MSC MSC BSC BSC BSC Switched Network Access Network Alarms BTS BTS BTS MSC Mobile station controller BSC Base station controller BTS Base station transceiver Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Performance Example (2/4) • Telecom data containing alarms: 1234 EL1 PCM 940926082623 A1 ALARMTEXT.. Alarm type Date, time Alarming network element Alarm number Alarm severity class • Example data 1: – 43 478 alarms (26.9.94 - 5.10.94; ~ 10 days) – 2 234 different types of alarms, 23 attributes, 5503 different values • Example data 2: – 73 679 alarms (1.2.95 - 22.3.95; ~ 7 weeks) – 287 different types of alarms, 19 attributes, 3411 different values Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Performance Example (3/4) Data set 1 (~10 days) Data set 2 (~7 weeks) Example rule: alarm_number=1234, alarm_type=PCM alarm_severity=A1 [2%,45%] Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Performance Example (4/4) • Example results for data 1: – – – – Frequency threshold: Candidate sets: Frequent sets: Rules: 0.1 (lowest possible with this data) 109 719 Time: 12.02 s 79 311 Time: 64 855.73 s 3 750 000 Time: 860.60 s • Example results for data 2: – – – – Frequency threshold: Candidate sets: Frequent sets: Rules: 0.1 (lowest possible with this data) 43 600 Time: 1.70 s 13 321 Time: 10 478.93 s 509 075 Time: 143.35 s Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Selecting the Interesting Rules? • Usually the result set is very big, one must select interesting ones based on: – Objective measures: Two popular measurements: support; and confidence – Subjective measures (Silberschatz & Tuzhilin, KDD95) A rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it) • These issues will be discussed with KDD processes Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Boolean vs. Quantitative Rules • Boolean vs. quantitative association rules (based on the types of values handled) – Boolean: Rule concerns associations between the presence or absence of items (e.g. "buys A" or "does not buy A") buys=SQLServer, buys=DMBook buys=DBMiner [2%,60%] buys(x, "SQLServer") ^ buys(x, "DMBook") buys(x, "DBMiner") [0.2%, 60%] – Quantitative: Rule concerns associations between quantitative items or attributes age=30..39, income=42..48K buys=PC [1%, 75%] age(x, "30..39") ^ income(x, "42..48K") buys(x, "PC") [1%, 75%] Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Quantitative Rules • Quantitative attributes: e.g., age, income, height, weight • Categorical attributes: e.g., color of car CID 1 2 3 4 height 168 175 174 170 weight 75,4 80,0 70,3 65,2 income 30,5 20,3 25,8 27,0 Problem: too many distinct values for quantitative attributes Solution: transform quantitative attributes in categorical ones via discretization more about this in seminar! Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Single- vs. Multi-dimensional Rules • Single-dimensional vs. multi-dimensional associations – Single-dimensional: Items or attributes in the rule refer to only one dimension (e.g., to "buys") Beer, Chips Bread [0.4%, 52%] buys(x, "Beer") ^ buys(x, "Chips") buys(x, "Bread") [0.4%, 52%] – Multi-dimensional: Items or attributes in the rule refer to two or more dimensions (e.g., "buys", "time_of_transaction", "customer_category") In the following example: nationality, age, income Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Multi-dimensional Rules CID 1 2 3 4 5 6 nationality Italian French French Italian Italian French age 50 40 30 50 45 35 income low high high medium high high RULES: nationality = French income = high [50%, 100%] income = high nationality = French [50%, 75%] age = 50 nationality = Italian [33%, 100%] Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Single- vs. Multi-level Rules • Single-level vs. multi-level associations – Single-level: Associations between items or attributes from the same level of abstraction (i.e., from the same level of hierarchy) Beer, Chips Bread [0.4%, 52%] – Multi-level: Associations between items or attributes from different levels of abstraction (i.e, from different levels of hierarchy) Beer:Karjala, Chips:Estrella:Barbeque Bread [0.1%, 74%] More about multi-level association rules on the next slides… Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Multi-level Association Rules • Is difficult to find interesting patterns at a too primitive level – high support = too few rules – low support = too many rules, most uninteresting • Approach: reason at suitable level of abstraction • A common form of background knowledge is that an attribute may be generalized or specialized according to a hierarchy of concepts • Multi-level association rules: rules which combine associations with hierarchy of concepts Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Multi-level Association Rules • Items often form hierarchies Food • Items at the lower level are expected to have bread milk lower support 2% • Rules regarding itemsets skim wheat white at appropriate levels Fraser Sunset could be quite useful • Transaction database can be encoded based on dimensions and levels Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Multi-level Association Rules Food 1 2 bread milk 1 skim 2 2% 1 Fraser 1 wheat TID T1 T2 T3 T4 T5 2 white Items {111, 121, 211, 221} {111, 211, 222, 323} {112, 122, 221, 411} {111, 121} {111, 122, 211, 221, 413} 2 Sunset 121= milk - 2% - Fraser Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Multi-level Association Rules • A top-down, progressive deepening approach: – First find high-level strong rules: milk bread [20%, 60%] – Then find their lower-level "weaker" rules: 2% milk wheat bread [6%, 50%] • Variations at mining multi-level association rules: – Level-crossed association rules: milk wheat bread – Association rules with multiple, alternative hierarchies: milk Wonder bread Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Multi-level Association Rules • Generalizing/specializing values of attributes… – ...from specialized to general: support of rules increases (new rules may become valid) – ...from general to specialized: support of rules decreases (rules may become not valid, their support falls under the threshold) • Too low level => too many rules and too primitive Pepsi light 0.5l bottle Taffel Barbeque Chips 200gr • Too high level => uninteresting rules Food Clothes Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Redundancy Filtering • Some rules may be redundant due to "ancestor" relationships between items • Example (milk has 4 subclasses): – milk wheat bread [support = 8%, confidence = 70%] – 2% milk wheat bread [support = 2%, confidence = 72%] • We say the first rule is an ancestor of the second rule • A rule is redundant if its support is close to the "expected" value, based on the rule’s ancestor – Above the second rule could be redundant Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Uniform vs. Reduced Support • Uniform Support: the same minimum support for all levels + One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support. – Lower level items do not occur as frequently. If support threshold • too high miss low level associations • too low generate too many high level associations • Reduced Support: reduced minimum support at lower levels Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Uniform Support Multi-level mining with uniform support Level 1 min_sup = 5% Level 2 min_sup = 5% Milk [support = 10%] 2% Milk Skim Milk [support = 6%] [support = 4%] Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Reduced Support Multi-level mining with reduced support Level 1 min_sup = 5% Level 2 min_sup = 3% Milk [support = 10%] 2% Milk Skim Milk [support = 6%] [support = 4%] Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Progressive Deepening • A top-down, progressive deepening approach: – First mine high-level frequent items: milk (15%), bread (10%) – Then mine their lower-level "weaker" frequent itemsets: 2% milk (5%), wheat bread (4%) • Different min_support thresholds across multi-levels lead to different algorithms: – If adopting the same min_support across multi-levels then do not examine t if any of t’s ancestors is infrequent – If adopting reduced min_support at lower levels then examine only those descendents whose ancestor’s support is frequent/non-negligible Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Constraint-Based Mining • Interactive, exploratory mining giga-bytes of data? – Could it be real? — By making good use of constraints! • What kinds of constraints can be used in mining? – Knowledge type constraint: classification, association, etc. – Data constraint: SQL-like queries • Find product pairs sold together in Vancouver in Dec.’98 – Dimension/level constraints: • In relevance to region, price, brand, customer category – Interestingness constraints: • Strong rules (min_support 3%, min_confidence 60%) – Rule constraints (see the next slides) Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Rule Constraints • Two kinds of rule constraints: – Rule form constraints: meta-rule guided mining • Metarule: P(X, Y) ^ Q(X, W) takes(X, "database systems") • Matching rule: age(X, "30..39") ^ income(X, "41K..60K") takes(X, "database systems"). – Rule content constraint: constraint-based query optimization (Ng, et al., SIGMOD’98) • sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000 Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Rule Constraints • 1-variable vs. 2-variable constraints (Lakshmanan, et al. SIGMOD’99): – 1-var: A constraint confining only one side (L/R) of the rule, e.g., • sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000 – 2-var: A constraint confining both sides (L and R). • sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS) Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Summary • Association rule mining: – Probably the most significant contribution from the database community in KDD – Rather simple concept, but the "thinking" gives basis for extensions and other methods – A large number of papers have been published • Many interesting issues have been explored • Interesting research directions: – Association analysis in other types of data: spatial data, multimedia data, time series data, etc. Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 References (1/5) • • • • • • • • • • R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High Performance Data Mining), 2000. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93, 207-216, Washington, D.C. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94 487-499, Santiago, Chile. R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14, Taipei, Taiwan. R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98, 85-93, Seattle, Washington. S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97, 265-276, Tucson, Arizona. S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket analysis. SIGMOD'97, 255-264, Tucson, Arizona, May 1997. K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD'99, 359-370, Philadelphia, PA, June 1999. D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of discovered association rules in large databases: An incremental updating technique. ICDE'96, 106-114, New Orleans, LA. M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. VLDB'98, 299-310, New York, NY, Aug. 1998. Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 References (2/5) • • • • • • • • • • G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00, 512-521, San Diego, CA, Feb. 2000. Y. Fu and J. Han. Meta-rule-guided mining of association rules in relational databases. KDOOD'95, 39-46, Singapore, Dec. 1995. T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. SIGMOD'96, 13-23, Montreal, Canada. E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. SIGMOD'97, 277-288, Tucson, Arizona. J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. ICDE'99, Sydney, Australia. J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB'95, 420431, Zurich, Switzerland. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, 1-12, Dallas, TX, May 2000. T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58-64, 1996. M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-dimensional association rules using data cubes. KDD'97, 207-210, Newport Beach, California. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94, 401-408, Gaithersburg, Maryland. Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 References (3/5) • • • • • • • • • F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable data mining. VLDB'98, 582-593, New York, NY. B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97, 220-231, Birmingham, England. H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional inter-transaction association rules. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'98), 12:1-12:7, Seattle, Washington. H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94, 181-192, Seattle, WA, July 1994. H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1:259-289, 1997. R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, 122-133, Bombay, India. R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97, 452-461, Tucson, Arizona. R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. SIGMOD'98, 13-24, Seattle, Washington. N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99, 398-416, Jerusalem, Israel, Jan. 1999. Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 References (4/5) • • • • • • • • • • J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95, 175-186, San Jose, CA, May 1995. J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. DMKD'00, Dallas, TX, 11-20, May 2000. J. Pei and J. Han. Can We Push More Constraints into Frequent Pattern Mining? KDD'00. Boston, MA. Aug. 2000. G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, 229-238. AAAI/MIT Press, 1991. B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, 412-421, Orlando, FL. J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95, 175-186, San Jose, CA. S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of interesting patterns in association rules. VLDB'98, 368-379, New York, NY.. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98, 343-354, Seattle, WA. A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. VLDB'95, 432-443, Zurich, Switzerland. A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large database of customer transactions. ICDE'98, 494-502, Orlando, FL, Feb. 1998. Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 References (5/5) • • • • • • • • • • C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. VLDB'98, 594-605, New York, NY. R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95, 407-419, Zurich, Switzerland, Sept. 1995. R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. SIGMOD'96, 1-12, Montreal, Canada. R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97, 67-73, Newport Beach, California. H. Toivonen. Sampling large databases for association rules. VLDB'96, 134-145, Bombay, India, Sept. 1996. D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, 1-12, Seattle, Washington. K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized rectilinear regions for association rules. KDD'97, 96-103, Newport Beach, CA, Aug. 1997. M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. Data Mining and Knowledge Discovery, 1:343-374, 1997. M. Zaki. Generating Non-Redundant Association Rules. KDD'00. Boston, MA. Aug. 2000. O. R. Zaiane, J. Han, and H. Zhu. Mining Recurrent Items in Multimedia with Progressive Resolution Refinement. ICDE'00, 461-470, San Diego, CA, Feb. 2000. Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Course Organization Next Week • Lecture 31.10.: Episodes and recurrent patterns – Mika gives the lecture • Excercise 1.11.: Associations – Pirjo takes care of you! :-) • Seminar 2.11.: Associations – Pirjo gives the lecture – 2 group presentations Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Seminar Presentations • Seminar presentations: – Articles are given on previous week's Wed – Presentation in an HTML page (around 3-5 printed pages) due to seminar starting: • Can be either a HTML page or a printable document in PostScript/PDF format – 30 minutes of presentation – 5-15 minutes of discussion – Active participation Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Seminar Presentations • Seminar presentations: – Try to understand the "message" in the article – Try to present the basic ideas as clearly as possible, use examples – Do not present detailed mathematics or algorithms – Test: do you understand your own presentation? – In the presentation, use PowerPoint or conventional slides Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Seminar Presentations/Groups 1-2 Quantitative Rules R. Srikant, R. Agrawal: "Mining Quantitative Association Rules in Large Relational Tables", Proc. of the ACM-SIGMOD 1996 MINERULE Rosa Meo, Giuseppe Psaila, Stefano Ceri: "A New SQL-like Operator for Mining Association Rules". VLDB 1996: 122-133 Course on Data Mining Mika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001 Introduction to Data Mining (DM) Thank you for your attention and have a nice weekend! Thanks to Jiawei Han from Simon Fraser University for his slides which greatly helped in preparing this lecture! Also thanks to Fosca Giannotti and Dino Pedreschi from Pisa for their slides. Course on Data Mining