Download Webb EII PhD School

Intelligent Systems Exploratory pattern discovery Geoff Webb http://www.csse.monash.edu.au/~webb Outline • Tutorial covers • • • • • • • • • • Data Mining Exploratory Pattern Discovery Association rules Interestingness (objective functions) False discoveries Limitations of minimum support K-most interesting pattern discovery Itemset discovery Contrast rule discovery Impact rules http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 2 Part 1: Data Mining http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 3 Data mining • Data mining seeks to discover unanticipated knowledge from data • Exponential growth in the quantity of data stored gives urgency to the pursuit of practical analytic approaches that address • • • • Large volumes of data Low quality data Post-hoc analysis Loosely defined analytical objectives http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 4 So what’s the big deal? • Don’t statistics identify patterns in data? • Conventional statistics do not address • searching quintillions of potential correlations Eg. • market basket data 2100,000 • US phone calls 2100,000,000 • human genome 23,000,000,000 • selecting most interesting from millions of correlations http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 5 Example: Should we stock vitamins? • Major national retailer with detailed records of customer purchasing behaviour • Considering deleting a low volume product line • Does data provide evidence of indirect contribution to bottom line? http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 6 Example: Steel rolling mill • Complex control problem for expensive production process influenced by input materials, desired output and state of equipment • Currently uses imperfect model • Objective, use data to identify circumstances in which model is Photo courtesy G.C. Goodwin, S. Graebe and M. Salgado. Control System Design, Prentice Hall, 2000. deficient http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 7 Example: Synchrotron x-ray data analysis • Synchrotron x-ray scatter patterns reflect micro-structure of material analysed. • Can x-ray scatter plots be used for cancer diagnosis? Normal Malignant http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 8 A growth area • The sum of human data stored doubles every 7 years • Data mining is critical to commerce • Fraud detection • Information retrieval and to science • Bioinformatics • Mass data analysis http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 9 Large unmet demand for good PhDs! http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 10 Beyond statistics • Data mining goes beyond the traditional realm of statistics by encompassing • problem formulation • interactions between the business process and the analytic process • knowledge management Other • data manipulation knowledge sources Business processes Analytics Data http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 11 Generating models • The core of the data mining process is generating models from data Eg neural networks, support vector machines, decision trees • Most research concentrates on this aspect • Surrounding activities are also very important • • • • • • • Defining analytic task Sourcing data Preprocessing data Identifying appropriate forms of model Identifying appropriate techniques for generating models Interpreting models Applying models http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 12 Part 2: Exploratory Pattern Discovery http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 13 The perils of model selection • Many data mining techniques seek to identify a single model that best fits the observed data. • In many applications many models will (almost) equally fit the data bruises=f & gill-attachment=f & gill-spacing=c & ring-number=o → poisonous [Coverage=0.406 (3296); Support=0.388 (3152); Confidence=0.956] bruises=f & gill-spacing=c & veil-color=w & ring-number=o → poisonous [Coverage=0.406 (3296); Support=0.388 (3152); Confidence=0.956] http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 14 Perils of model selection (cont.) • Data mining systems often make arbitrary choices • without warning • A system may have no basis on which to select models, but an expert often will • ease / cost of operatalisation • comprehensibility / compatibility with existing knowledge and beliefs • social / legal / ethical / political acceptability http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 15 Exploratory pattern discovery • Exploratory pattern discovery seeks all patterns that satisfy user-defined constraints • The user can select from these patterns • can use criteria that might be infeasible to quantify http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 16 Patterns • Rules: • <antecedent>  <consequent> • Itemsets • <condition1> & <condition2> & … • Sequences • <event1>, <event2>, …. • Structures http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 17 Rules • <antecedent>  <consequent> • IF <antecedent> THEN <consequent> • IF temp >36.8 AND pulse > 120 THEN call doctor • Antecedent = condition = left hand side, LHS = conditions under which antecedent holds / applies • Consequent = conclusion = right hand side, RHS = action to perform or conclusion to reach http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 18 Theoretical foundations • Substantial bodies of theory in Formal Logic, Computational Logic, and Artificial Intelligence can be brought to bear to utilise rules once they are inferred. • If the antecedent entails the consequent and the antecedent is known (believed) then the consequent can be concluded. • Can be extended to probabilistic basis. • Supports complex reasoning. • Modular knowledge representation. • can capture knowledge nuggets http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 19 Rule discovery as search • Rule discovery can be viewed as search through a space of expressible rules. • The rule space (search space / description space) can be partially ordered on generality. • A  C is a generalisation of B  C iff B entails A (A must be true if B is true) • proper generalisation iff A does not also entail B • If A  C is a generalisation of B  C then B  C is a specialisation of A  C. • Eg. IF age > 30 THEN X is a generalisation of • IF age > 31 THEN X • IF age > 30 AND gender = male THEN X http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 20 Generalization lattice for antecedents {} {A} {A,B} {A,C} {A,B,C} {B} {C} {B,C} {A,B,D} {A,D} {A,C,D} {D} {B,D} {C,D} {B,C,D} {A,B,C,D} http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 21 Search tree for antecedents {} {A} {A,B} {A,C} {A,B,C} {B} {C} {B,C} {A,B,D} {A,D} {A,C,D} {D} {B,D} {C,D} {B,C,D} {A,B,C,D} http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 22 Search tree with consequent propagation {} {A,B,C,D} {A} {B,C,D} {A,B} {C,D} {A,C} {B,D} {A,B,C} {D} {B} {A,C,D} {C} {A,B,D} {B,C} {A,D} {A,B,D} {C} {A,D} {B,C} {A,C,D} {B} {D} {A,B,C} {B,D} {A,C} {C,D} {A,B} {B,C,D} {A} {A,B,C,D} {} http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 23 Propositional rule discovery • Antecedent and consequent are propositions • Often restricted to antecedent and consequent both conjunctions of Boolean terms • IF temp >36.8 AND pulse > 120 THEN blood pressure > 140 AND condition = critical http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 24 Rule discovery is inherently intractable • If • there are n propositions, • antecedents can be any set of propositions and • consequents are a single proposition then • size of search space ≈ n2n • It is essential to use powerful pruning techniques to limit the search space http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 25 Part 3: Association rules http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 26 Association rule discovery • Developed for market basket analysis • a basket is a collection of products purchased in a single transaction • an itemset is a set of products • all baskets are itemsets • market basket analysis seeks to identify products that are associated with each other • diapers and beer • Can generalize to itemset = any conjunction of Boolean terms http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 27 Transaction and tabular data • Transaction data • Each record is a set of items involved in a single transaction • Eg. market basket, web site traversal, amino acids in a protein • Tabular data • Each record consists of a vector of values for the predefined attributes or fields • Eg. A patient’s signs and symptoms, employee details, the amino acids at each site in a protein • While association rules were developed for transaction data they generalise directly to attribute-value data http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 28 Support and confidence • F(X) = proportion of records that satisfy condition X • Coverage(AC) = F(A) A C • Support(AC) = F(A & C) • Confidence(AC) = support(AC) / coverage(AC) • Maximum likelihood estimate of P(C | A) http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 29 Frequent itemsets • An itemset is frequent if its cover equals or exceeds a user defined minimum • Downward closure • frequency is anti-monotone • if an itemset I is not frequent then no specialization of I is frequent http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 30 Association rules • Antecedent and consequent are frequent itemsets • An association rule indicates that the presence of the antecedent increases the probability that the consequent will be present • bread & butter  honey http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 31 Association rule discovery • Requires minimum support constraint • Finds all rules that satisfy minimum support together with other user specified constraints such as minimum confidence • Example: 1000 transactions, 100 bread, 100 honey, 50 bread & honey • support(bread  honey) = 0.05 • confidence(bread  honey) = 0.50 http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 32 The frequent itemset approach • Find all frequent itemsets • Generate all association rules therefrom • Assumes • a minimum support constraint • sparse data http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 33 Finding frequent itemsets • Once frequent itemsets are found rule generation is straightforward • Research has concentrated on efficient frequent itemset generation http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 34 The Apriori algorithm Apriori(T, ε) L1 ← frequent 1-itemsets relative to T k←2 while Lk-1 ≠  Ck ← Generate(Lk-1) for t  T for c  Subsets(Ck, t) count[c]++ Lk ← { c  Ck | count[c] ≥ ε } k++ return L TRANSACTIONS a,b,c a,b,d a,d PROCESS, ε=2 L1 {{a},{b},{d}} C2 {{a,b},{a,d},{b,d}} L2 {{a,b},{a,d}} C3 {{a,b,d}} L3 {} http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 35 Closed itemsets • In practice many itemsets cover exactly the same items • Eg pregnant, pregnant & woman • A closed itemset is the most specific itemset that covers a particular set of items • More efficient to find all closed frequent itemsets than all frequent itemsets • Can generate all association rules from closed itemsets http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 36 Closed Itemsets Example Full set of itemsets for gill-size=n, gill-color=b & spore-print-color=w gill-size=n [Coverage=2512] spore-print-color=w [Coverage=2388] gill-size=n & spore-print-color=w [Coverage=1824] gill-color=b [Coverage=1728] gill-color=b & spore-print-color=w [Coverage=1728] gill-size=n & gill-color=b [Coverage=1728] gill-size=n & gill-color=b & spore-print-color=w [Coverage=1728] Closed itemsets gill-size=n [Coverage=2512] spore-print-color=w [Coverage=2388] gill-size=n & spore-print-color=w [Coverage=1824] gill-size=n & gill-color=b & spore-print-color=w [Coverage=1728] http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 37 Part 4: Interestingness (objective functions) http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 38 Interestingness (Objective Functions) • Need some means of selecting the most (potentially) interesting patterns • Many different measures of interestingness may be relevant • Most measures relate to the degree to which the antecedent and consequent are interdependent o P(A & C) – P(A)  P(C) http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 39 Interestingness measures: lift • lift = confidence / (cover(consequent)/n) • proportional increase in confidence in context of antecedent • Example: 1000 transactions, 100 bread, 100 honey, 50 bread & honey • confidence(bread  honey) = 0.50 • lift(bread  honey) = 5.00 http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 40 M-estimates • Problem: many rules with low support will have unrealistically high confidence and lift • Example: 1000 records, 500 females, 1 age>=90, 1 female & age>=90 • confidence(age>=90  female) = 1.00 • lift(age>=90  female) = 2.00 • M-estimate is Bayesian estimate of true confidence and lift • biases confidence toward prior • confidence estimate = (support + m * prior) / (coverage + m) • lift estimate = confidence estimate / prior • Eg confidence estimate = (1 + 2 * 0. 5) / (1 + 2) = 0.667 lift estimate = 0.667 / 0. 500 = 1.333 http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 41 Interestingness measures: leverage • leverage = support – (cover(antecedent)  cover(consequent) / n) • absolute increase in comparison to expected cases if antecedent and consequent independent • Also known as interest • Example: 1000 transactions, 100 bread, 100 honey, 50 bread & honey • confidence(bread  honey) = 0.50 • lift(bread  honey) = 5.00 • leverage(bread  honey) = 0.04 • Example2: 1000 transactions, 10 batteries, 5 vodka, 1 batteries & vodka • lift(batteries  vodka) = 20.00 • leverage(batteries  vodka) = 0.0009 http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 42 Spurious rules • If condition X is unrelated to conditions A and B, • confidence(A & X  B)  confidence(A  B) • lift(A & X  B)  lift(A  B) • Eg pregnant & AI Researcher  oedema • One core rule can result in many spurious rules • If problem ignored, majority of rules can be spurious! http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 43 Need to test up the generalization lattice {} {A,B,C,D} {A} {B,C,D} {A,B} {C,D} {A,C} {B,D} {A,B,C} {D} {B} {A,C,D} {C} {A,B,D} {B,C} {A,D} {A,B,D} {C} {A,D} {B,C} {A,C,D} {B} {D} {A,B,C} {B,D} {A,C} {C,D} {A,B} {B,C,D} {A} {A,B,C,D} {} http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 44 Minimum Improvement • The improvement of rule X → Y [conf=c] = min(c-k | ZX Z → Y [conf=k]) • A minimum improvement constraint can eliminate many spurious rules http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 45 Non redundant rules • xyzsc x → y [conf = 1.0 ]  x → z [supp=s, conf=c]  x, z → y [supp=s, conf=c] Eg pregnant → oedema [supp=0.1, conf=0.2]  pregnant, female → oedema [supp=0.1, conf=0.2] • A rule X → Y [supp=s, conf=c] is redundant iff xX X\x → Y [supp=s, conf=c] or yY X → Y\y [supp=s, conf=c] Eg, pregnant, female → oedema • Closed itemset approaches lead to efficient generation of non-redundant rules because a rule is non-redundant iff all immediate specialisations are closed itemsets. • Note, redundant rules have improvement of 0.0. http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 46 Effect filter nondataset none improvement redundant % >0 % bms webview 170 170 100 155 91 covtype 998 815 82 143 14 ipums.la.99 973 959 99 481 49 kddcup98 995 992 100 939 94 letter-recognition 541 524 97 421 78 mush 891 469 53 128 14 retail 590 590 100 519 88 shuttle 666 595 89 312 47 splice-junction 748 727 97 699 93 ticdata-2000 996 996 100 988 99 http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 47 Part 5: False discoveries http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 48 False discoveries • Massive search leads to high risk of false discoveries • eg 100 observations, two independent events each occurring with 0.5 probability, • the probability of perfect correlation is 7.8x10-31. • if there are 1000 events then there are 21000 = 1.07x10301 antecedent – consequent pairs. • What constitutes a false discovery depends upon the analytic objective • Usually should include rules where • antecedent and consequent are independent • antecedent and consequent are independent given a generalisation of the antecedent http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 49 Testing independence • Cannot perform simple test of independence because of multiple comparisons problem • used previously (eg Webb, Butler & Newlands, 2003) as a statistically unsound filter http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 50 Standard statistical correction • Bonferroni • To maintain experimentwise risk ≤ α for n tests • use critical value = α / n • Holm procedure • To maintain experimentwise risk ≤ α for n tests with p values ordered from lowest to highest p1 … pn • Accept tests corresponding to p1 … pk , where k is the highest value such that 1≤i≤k pi ≤ α / (n – k + 1) p values critical values 0.0100, 0.0200, 0.0400, 0.0400 0.0125, 0.0167, 0.0250, 0.0500 accept, accept, reject, reject http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 51 Direct adjustment • I used to think “cannot perform simple adjustment such as Bonferroni or Holm because rule spaces are so large, eg 21000 (> 1.0E+301 ) • would result in unacceptable type-2 error • eg  = 5.0E-303” • However, search is often restricted to small antecedents (eg. ≤ 4) resulting in Bonferonni adjusted critical values of magnitude 1.0E-10 … 1.0E-20. • With such adjustments often many rules can be found • Cannot order p values to apply Holm procedure http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 52 Discovery as hypothesis generation • Important to trade-off the risks of both type-1 and type-2 errors • Perhaps best viewed as hypothesis generation, recognising that ‘discovered’ patterns require independent assessment http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 53 Hypothesis testing: proposal • Why not automate such assessment? Data Exploratory Patterns Holdout Holm adjustment Exploratory Pattern Discovery Small set prefer -able Any hypothesi s test Statistical Evaluation Limite d type2 error Sound Patterns http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 54 Direct adjustment vs Holdout Direct adjustment Holdout • All data used for exploration and evaluation • Bonferroni adjustment • Larger adjustment • Adjustment alters with size of search space • Half data used for each of exploration and evaluation • Holm procedure • Smaller adjustment • Adjustment alters with number of rules found http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 55 Case study: Ten widely used data sets Name Description Records BMS webview products viewed at a commercial website covtype forest cover data ipums.la.99 Attributevalues 59,601 497 581,012 125 Los Angeles census data 88,443 1,874 kddcup98 charity donors 52,256 19,662 letter-recog’n digital image recognition 20,000 74 mush identification of poisonous mushrooms 8,124 127 retail retail market basket data 88,162 16,470 shuttle records of space shuttle flight data 58,000 34 splice-junction DNA sequence records 3,177 243 ticdata-2000 5,822 689 insurance risk assessment http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 56 Detecting spurious rules • Assuming interest only in positive associations • P(C | A) > P(C) • For any rule A  C, want to assess whether it has higher confidence than all its generalisations • Eg, is confidence(pregnant & female  B) > • confidence(pregnant  B) • confidence(female  B) • confidence(true  B) http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 57 Detecting spurious rules (cont) • Perform one-tailed Fisher exact tests with respect to each generalisation • Reject if any test does not exceed critical value • no need to adjust for multiple comparisons with respect to the multiple tests for a single rule • Use Holm adjustment for strict control of type-1 error http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 58 Spurious rules case study: high support & confidence non-redundant rules Name bms webview Records Attribute -values # Rules # Accepted % 59,601 497 22,135 1,747 8 581,012 125 10,018 0 0 ipums.la.99 88,443 1,874 9,857 288 3 kddcup98 52,256 19,662 9,863 40 <1 letter-recognition 20,000 74 7,978 952 12 mush 8,124 127 8,957 1,266 14 retail 88,162 16,470 11,656 97 1 shuttle 58,000 34 9,760 876 9 splice-junction 3,177 243 8,937 132 1 ticdata-2000 5,822 689 10,438 30 <1 covtype http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 59 KDDCUP98: 99.5% of rules rejected The following 40 rules passed holdout evaluation … ETH12<=0  HC15<=0 [Coverage=0.987 (25786); Support=0.946 (24722); Confidence=0.959; Lift=1.00] … The following 9843 rules failed holdout evaluation, adjusted critical value = 5.09E-06 … NOEXCH=0 & ETH12<=0  HC15<=0 [Coverage=0.984 (25703); Support=0.943 (24644); Confidence=0.959; Lift=1.00] … NOEXCH=0 & ETH12<=0 & MDMAUD_F=X  HC15<=0 [Coverage=0.981 (25629); Support=0.940 (24573); Confidence=0.959; Lift=1.00] … NOEXCH=0 & ETH12<=0 & ADATE_2>=9706 & MDMAUD_R=X  HC15<=0 [Coverage=0.981 (25623); Support=0.940 (24567); Confidence=0.959; Lift=1.00] … http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 60 Comparison of direct adjustment and holdout tests on artificial data False Discoveries Experimentwise Error Direct Holdout True Discoveries Averages over 100 runs, 84 true rules at antecedent size 4 http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 61 Comparison on real data Retail 3000 1200 2500 1000 2000 800 No of rules No of rules Letter Recognition 1500 600 1000 400 500 200 0 0 2.33E+03 1.32E+05 2.29E+06 2.68E+07 2.27E+08 1.47E+09 1.36E+08 2.23E+12 1.23E+16 5.05E+19 1.66E+23 4.56E+26 Search Space Size Search space size Direct Holdout Direct Holdout http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 62 Part 6: Limitations of minimum support http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 63 Limitations of minimum support • Discontinuity in ‘interestingness’ function • The vodka and caviar problem • some high value associations are infrequent • Feast or famine • minimum support is a crude control mechanism • often results in too few or too many associations • Cannot handle dense data • Cannot prune search space using constraints on relationship between antecedent and consequent • eg confidence • Minimum support may not be relevant • cannot be sufficiently low to capture all valid rules • cannot be sufficiently high to exclude all spurious rules http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 64 Very low support rules can be significant Data file: Brijs retail.itl [50% sample] 44081 cases / 44081 holdout cases / 16470 items The following 5 rules passed holdout evaluation 168 & 4685 → 1 [Coverage=0.000 (3); Support=0.000 (3); Confidence estimate=0.601; Lift estimate=192.06] 168 & 3021 → 1 [Coverage=0.000 (3); Support=0.000 (3); Confidence estimate=0.601; Lift estimate=192.06] 1476 & 4685 → 1 [Coverage=0.000 (2); Support=0.000 (2); Confidence estimate=0.502; Lift estimate=160.21] 168 & 783 → 1 [Coverage=0.000 (4); Support=0.000 (3); Confidence estimate=0.501; Lift estimate=160.05] 3021 & 4685 → 1 [Coverage=0.000 (4); Support=0.000 (3); Confidence estimate=0.501; Lift estimate=160.05] http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 65 Very high support rules can be spurious Data file: covtype.data 581012 cases / 125 values ST15=0 → ST07=0 [Coverage=1.000 (581009); Support=1.000 (580904); Confidence=1.000; Lift=1.00] ST07=0 → ST15=0 [Coverage=1.000 (580907); Support=1.000 (580904); Confidence=1.000; Lift=1.00] ST15=0 → ST36=0 [Coverage=1.000 (581009); Support=1.000 (580890); Confidence=1.000; Lift=1.00] ST36=0 → ST15=0 [Coverage=1.000 (580893); Support=1.000 (580890); Confidence=1.000; Lift=1.00] ST15=0 → ST08=0 [Coverage=1.000 (581009); Support=1.000 (580830); Confidence=1.000; Lift=1.00] ST08=0 → ST15=0 [Coverage=1.000 (580833); Support=1.000 (580830); Confidence=1.000; Lift=1.00] ….. 197,183,686 such rules have highest support http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 66 Roles of constraints 1. Select most relevant patterns • patterns that are likely to be interesting 2. Control the number of patterns that the user must consider 3. Make computation feasible http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 67 Minimum support can get overloaded! http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 68 Part 6: K-most interesting pattern discovery http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 69 K-most interesting pattern discovery • Find k patterns that maximise a measure of interest within other constraints that the user may specify • • • • • removes need for minimum support constraint efficient with dense data empowers user to use relevant measure of interest user specifies number of patterns to be returned does not require either monotone or anti-monotone constraints • Relies on efficient search • must be able to retain all data in memory • constraints must sufficiently constraint the search space http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 70 Part 7: Itemset discovery http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 71 Itemset discovery • In some contexts it is the collection of variables that are correlated that are of interest and the rule structure is superfluous. • If A is associated with B then B must be associated with A (in the sense of the presence of the antecedent increasing the probability of the presence of the consequent). • Discovering interesting itemsets is an area that has been little explored. http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 72 Part 8: Contrast discovery http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 73 Contrast sets (emerging patterns) • Sometimes it is interesting to identify differences between contrasting groups • Eg: how do purchasing patterns differ on weekends to weekdays? • Contrast sets find sets of conditions that differ significantly between groups ij P(cset | Gi )  P(cset | G j ) max ij support( cset , Gi )  support( cset , G j )   http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 74 Contrast sets (cont.) • Different analytic objective to association rules • more directed • focus on differences between groups instead of associations between variables • Different to classification rules • not discriminative • no attempt to distinguish all individuals of each group • find all contrasts rather than sufficient discriminators http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 75 Can be discovered by existing techniques! ij P(cset | G )  P(cset | G )  ij P(G | cset )  P(G | cset ) • Contrast / emerging pattern discovery is strictly equivalent to standard exploratory rule discovery with the consequent restricted to the group variable i j i j http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 76 Part 9: Impact rules http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 77 Impact rules (quantitative association rules) • Most rule discovery techniques require that numeric variables be discretised. • This often loses important information. • Impact rules associate an antecedent with a distribution on a numeric variable. • The user specifies what makes a distribution interesting • eg largest mean, smallest standard deviation, … • System finds rules that maximise the measure of interest within other user-specified constraints http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 78 Impact rule discovery example LengthOfStay: mean = 10.6; min = -6; max = 1687; sum = 367781 COUNTRYOFBIRTH=1100 -> LengthOfStay: Coverage=0.054 (1861); Mean=22.2; Min=-4; Max=1687; Sum=41314; Impact=21612.4 ADMITDay=Wednesday -> LengthOfStay: Coverage=0.159 (5518); Mean=13.3; Min=0; Max=1548; Sum=73389; Impact=15307.6 http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 79 Summary • Exploratory pattern discovery empowers the user to select the patterns that are most useful • Rules provide a modular and powerful knowledge representation formalism • Association rules discover associations between qualitative variables that are frequent • K-optimal rules discover associations between qualitative variables that optimise a measure of interest • Impact rules discover associations between qualitative and quantitative variables • Contrasts discover differences in distributions over variables between different groups • If you mine for patterns without appropriate statistical evaluation, expect to find fool’s gold! http://www.csse.monash.edu.au/~webb Intelligent Systems Copyright © Geoffrey I Webb 2006 80

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Webb EII PhD School