Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Interestingness Measures Quality in KDD Quality in KDD Levels of Quality Quality of discovered knowledge : f(D,M,U) Data Quality (D) Noise, accuracy, missing values, bad values, … [Berti-Equille 2004] Model Quality (M) Accuracy, generalization, relevance, … User-based Quality (U) Relevance for decision making User-based Quality 2 categories: Objective (D, M) Computed from data only Subjective (U) Hypothesis : goal, domain knowledge Hard to formalize (novelty) Objective Measures Examples of Quality Criteria 7 criteria of interest (interestingness) [Hussein 2000]: Objective: Generality : (ex: Support) Validity: (ex: Confidence) Reliability: (ex: High generality and validity) Subjective: Common Sense: reliable + known yet Actionability : utility for decision Novelty: previously unknown Surprise (Unexpectedness): contradiction ? Quality and Association Rules AR Quality Association Rules Association rules [Agrawal et al. 1993]: Market-basket analysis Non supervised learning Algorithms + 2 measures (support and confidence) Problems: Enormous amount of rules (rough rules) Few semantic on support and confidence measures Need to help the user select the best rules AR Quality Association Rules Solutions: Redundancy reduction Structuring (classes, close rules) Improve quality measures Interactive decision aid (rule mining) AR Quality Association Rules Input : data p Boolean attributes (V0, V1, … Vp) (columns) n transactions (rows) Output : Association Rules: Implicative tendencies : X Y X and Y (itemsets) ex: V0^V4^V8 V1 Negative examples 2 measures: Support: supp(XY) = freq(XUY) Confidence: conf(XY) = P(Y|X) = freq(XUY)/freq(X) Algorithm properties (monotony) Ex: Couches beer (supp=20%, conf=90%) (NB: max nb of rules 3p) AR Quality Limits of Support Support: supp(XY) = freq(XUY) Generality of the rule Minimum support threshold (ex: 10%) Reduce the complexity Lose nuggets (support pruning) Nugget: Specific rule (low support) Valid rule (high confidence) High potential of novelty/surprise AR Quality Limits of Confidence [Guillaume et al. 1998], [Lallich et al. 2004] Confidence: conf(XY) = P(Y|X) = freq(XUY)/freq(X) Validity/logical aspect of the rule (inclusion) Minimal confidence threshold (ex: 90%) Reduces the amount of extracted rules Interestingness /= validity No detection of independence Independence: X and Y are independent: P(Y|X) = P(Y) If P(Y) is high => nonsense rule with high support Ex: Couches beer (supp=20%, conf=90%) if supp(beer)=90% AR Quality Limits of the Pair Support-Confidence In practice: High support threshold (10%) High confidence threshold (90%) Valid and general rules Common Sense but not novelty Efficient measures but insufficient to capture quality Subjective Measures AR Quality : Subjective Measures Criteria User-oriented measures (U) Quality : interestingness: Unexpectedness [Silberschatz 1996] Unknown or contradictory rule Actionability (Usefulness) [Piatesky-shapiro 1994] Usefulness for decision making, gain Anticipation [Roddick 2001] Prediction on temporal dimension AR Quality : Subjective Measures Criteria Unexpectedness and actionability: Unexpected + useful = high interestingness Expected + non-useful = ? Expected + useful = reinforcement Unexpected + non-useful = ? AR Quality : Subjective Measures Principle Algorithm principle: 1. 2. 3. 4. 5. Extraction of the decision maker Knowledge Formalization of the knowledge (K) (expected and actionable) KDD (K’) Compare K and K’ Select (subjective measures) the rules Δ(K,K’) of K’ which are: Differ the most from K (unexpectedness) Or are the most similar (actionability) AR Quality : Subjective Measures Rule Templates [Klemettinen et al. 1994] User knowledge (K): syntactic constraints Patterns/forms of rules: A1, A2, …, AkAk+1 Ai: constraints on attribute Vi (interval of values) K: K1 + K2 K1: interesting patterns (select) K2: not interesting patterns (reject) Goal: select the interesting rules inside K’ Boolean Criterion: Rules XY of K’ satisfying K1 patterns but not K2 ones + constraints (threshold) support, confidence, rule size (| XUY |) AR Quality : Subjective Measures Interestingness [Sibershatz & Tuzhilin 1995] User knowledge (K): beliefs A set K of beliefs (bayes rules) A belief (rule) a∈K weighted by p(α) K: K1 + K2 K1: hard beliefs (p(α) constant) K2: soft beliefs (p(α) can vary) Goal: make beliefs K2 varying in function of the part of K’ which satisfy K1 + Interest criterion of R=XY of K’: Change weight α: AR Quality : Subjective Measures Logical Contradiction [Padmanabhan and Tuzhilin 1998] User Knowledge (K): A set of K rules Goal: select unexpected rules in K’ Unexpected criterion: Rule AB of K’ and XY of K AB is unexpected if: B and Y are contradictory (p(B and Y)=0) (A and X) is frequent (p(A and Y) high) (A and X) B is true (hence (A and X)not Y also (exception!)) AR Quality : Subjective Measures Attribute Costs [Freitas 1999] User Knowledge (K): costs Cost of each attribute/item Ai: Cost(Ai) Goal: select the costless rules in K’ Cost of a rule: Rule A1, A2, …, AkB Low mean cost: AR Quality : Subjective Measures Other Subjective Measures Projected Savings (KEFIR system’s interestingness) [Matheus & Piatetsky-Shapiro 1994] Fuzzy Matching Interestingness Measure [Lie et al. 1996] General Impression [Liu et al. 1997] Logical Contradiction [Padmanabhan & Tuzhilin’s 1997] Misclassification Costs [Frietas 1999] Vague Feelings (Fuzzy General Impressions) [Liu et al. 2000] Anticipation [Roddick and rice 2001] Interestingness [Shekar & Natarajan’s 2001] AR Quality : Subjective Measures Classification Interestingness Measure Year Application Foundation Scope Subjective Aspects User’s Knowledge Representation 1 Matheus and PiatetskyShapiro’s Projected Savings 1994 Summaries Utilitarian Single Rule Unexpectedness Pattern Deviation 2 Klemettinen et al. Rule Templates 1994 Association Rules Syntactic Single Rule Unexpectedness & Actionability Rule Templates 3 Silbershatz and Tuzhilin’s Interestingness 1995 Format Independent Probabilistic Rule Set Unexpectedness Hard & Soft Beliefs 4 Liu et al. Fuzzy Matching Interestingness Measure 1996 Classification rules Syntactic Distance Single Rule Unexpectedness Fuzzy Rules 5 Liu et al. General Impressions 1997 Classification Rules Syntactic Single Rule Unexpectedness GI, RPK 6 Padmanabhan and Tuzhilin Logical Contradiction 1997 Association Rules Logical, Statistic Single Rule Unexpectedness Beliefs XY 7 Freitas’ Attributes Costs 1999 Association Rules Utilitarian Single Rule Actionability Costs Values 8 Freitas’ Misclassification Costs 1999 Association rules Utilitarian Single rule Actionability Costs Values 9 Liu et al. Vague Feelings (Fuzzy General Impressions) 2000 Generalized Association Rules Syntactic Single Rule Unexpectedness GI, RPK, PK 1 0 Roddick and Rice’s Anticipation 2001 Format Independent Probabilistic Single Rule Temporal Dimension Probability Graph 1 1 Shekar and Natarajan’s Interestingness 2002 Association Rules Distance Single Rule Unexpectedness Fuzzy-graph based taxonomy AR Quality : Subjective Measures Conclusion Algorithm + Measures to compare K and K’ Focus on interesting rules Knowledge is Domain specific Acquisition of K? Hard task to represent knowledge and goals of the decision maker Many improvements to make Objective Measures Principles and Classification AR Quality : Objective Measures Principle Statistics on data D (transactions) for each rule R=XY Interestingness measure = i(R,D,H) Degree of satisfaction of the hypothesis H in D independently of U AR Quality : Objective Measures Contingency Rule X with X and Y disjoined itemsets Inclusion of E(X) in E(Y) 5 observable parameters in E: n=|E| amount of transactions nx=|E(X)| cardinal of the premise (left hand side) ny=|E(Y)| cardinal of the conclusion (right hand side) nxy=|E(X and Y)| number of positive examples nx¬y=|E(X and ¬Y)| number of negative examples AR Quality : Objective Measures Independence p(X) estimated by (frequency) Hypothesis of Independence of X and Y: Inclusion /= dependence AR Quality : Objective Measures Equiprobability (Equilibrium) Rule XY Same amount of negative examples (e-) and positive examples (e+): hence when: 2 situations: (or P(Y|X)>0.5): e+ higher: rule XY (or P(Y|X)<0.5): e- higher: rule X¬Y Contra-positive ¬X¬Y AR Quality : Objective Measures Interestingness Measure Definition i(XY) = f(n, nx, ny, nxy) General principles: Semantic and readability for the user Increasing value with the quality Sensibility to equiprobability (inclusion) Statistic Likelihood (confidence in the measure itself) Noise resistance, time stability Surprisingness, nuggets ? AR Quality : Objective Measures Properties in the Literature Properties of i(XY) = f(n, nx, ny, nxy) [Piatetsky-Shapiro 1991] (strong rules): [Major & Mangano 1993]: (P1) =0 if X and Y are independent (P2) increases with examples nxy (P3) decreases with premise nx (or conclusion ny)(?) (P4) increases with nxy when confidence is constant (nxy/nx) [Freitas 1999]: (P5) asymmetry (i(XY)/=i(YX)) Small disjunctions (nuggets) [Tan et al. 2002], [Hilderman & Hamilton 2001] and [Gras et al. 2004] AR Quality : Objective Measures Selected Properties Inclusion and equiprobability Independence Noise Resistance, interval of security for independence and equiprobability Sensibility Comparability, global threshold, inclusion Non linearity 0, interval of security Bounded maximum value 0, interval of security N (nuggets), dilation (likelihood) Frequency p(X) cardinal nx Reinforcement by similar rules (contra-positive, negative rule,…) [Smyth & Goodman 1991][Kodratoff 2001][Gras et al 2001][Gras et al. 2004] AR Quality : Objective Measures What Could Be a Good Measure? Negative-examples nx¬y + independence + equiprobability constraints upon other dimensions Imax AR Quality : Objective Measures Consequences On Other Dimensions Conclusion ny Decrease with ny (ny n: Ind ↓) Size of data n Increase with dilation (Ind ↑) Increase with n (Ind ↑) AR Quality : Objective Measures List AR Quality : Objective Measures Classification Classification between three criteria: Object of the index Range of the index Concept measured by the index Entity concerned with measurement Nature of the index Statistical or descriptive character of the index AR Quality : Objective Measures Classification The Object: Certain indices take a fixed value with independence. P(a ∩ b) = P(a) x P(b) Certain indices take a fixed value with equilibrium. P(a ∩ b) = P(a)/2 They evaluate a variation with independence They evaluate a variation with equilibrium Others do not take a fixed value with independence or with equilibrium Statistical indices AR Quality : Objective Measures Classification The Range: Certain indices evaluate to more than a simple rule: They relate simultaneously to a rule and its contra-positive: I(a b) = I(¬b ¬ a) Indices of quasi-Involvement They simultaneously relate a rule and its reciprocal: I(a b) = I(b a) Indices of quasi-conjunction They relate simultaneously to all three: I(a b) = I(b a) = I(¬ b ¬ a) Indices of quasi-equivalence AR Quality : Objective Measures Classification The Nature: If variation : If not : statistical index descriptive index AR Quality : Objective Measures Classification AR Quality : Objective Measures List Of Quality Measures Monodimensional e+, e Bidimensional - Inclusion Descriptive-Confirm [Yves Kodratoff, 1999] Sebag et Schoenauer [Sebag, Schoenauer, 1991] Examples neg examples ratio (*) Bidimensional – Inclusion – Conditional Probability Support [Agrawal et al. 1996] Ralambrodrainy [Ralambrodrainy, 1991] Confidence [Agrawal et al. 1996] Wang index [Wang et al., 1988] Laplace (*) Bidimensional – Analogous Rules Descriptive Confirmed-Confidence [Yves Kodratoff, 1999] (*) AR Quality : Objective Measures List Of Quality Measures Tridimensional – Analogous Rules Causal Support [Kodratoff, 1999] Causal Confidence [Kodratoff, 1999] (*) Causal Confirmed-Confidence [Kodratoff, 1999] Least contradiction [Aze & Kodratoff 2004] (*) Tridimensional – Linear - Independent Pavillon index [Pavillon, 1991] Rule Interest [Piatetsky-Shapiro, 1991] (*) Pearl index [Pearl, 1988], [Acid et al., 1991] [Gammerman, Luo, 1991] Correlation [Pearson 1996] (*) Loevinger index [Loevinger, 1947] (*) Certainty factor [Tan & Kumar 2000] Rate of connection[Bernard et Charron 1996] Interest factor [Brin et al., 1997] Top spin(*) Cosine [Tan & Kumar 2000] (*) Kappa [Tan & Kumar 2000] AR Quality : Objective Measures List Of Quality Measures Tridimensional – Nonlinear – Independent Chi squared distance Logarithmic lift [Church & Hanks, 1990] (*) Predictive association [Tan & Kumar 2000] (Goodman & Kruskal) Conviction [Brin et al., 1997b] Odd’s ratio [Tan & Kumar 2000] Yule’Q [Tan & Kumar 2000] Yule’s Y [Tan & Kumar 2000] Jaccard [Tan & Kumar 2000] Klosgen [Tan & Kumar 2000] Interestingness [Gray & Orlowska, 1998] Mutual information ratio (Uncertainty) [Tan et al., 2002] J-measure [Smyth & Goodman 1991] [Goodman & Kruskal 1959] (*) Gini [Tan et al., 2002] General measure of rule interestingness [Jaroszewicz & Simovici, 2001] (*) AR Quality : Objective Measures List Of Quality Measures Quadridimensional – Linear – independent Quadridimensional – likeliness (conditional probability?) of dependence Probability of error of Chi2 (*) Intensity of Involvement [Gras, 1996] (*) Quadridimensional – Inclusion – dependent – analogous rules Lerman index of similarity[Lerman, 1981] Index of Involvement[Gras, 1996] Entropic intensity of Involvement [Gras, 1996] (*) TIC [Blanchard et al., 2004] (*) Others Surprisingness (*) [Freitas, 1998] + rules of exception [Duval et al. 2004] + rule distance, similarity [Dong & Li 1998] AR Quality : Objective Measures Objective Measures Simulations and Properties Quality of Rules : Objective Measures Monodimensional Measures e+ eSupport [Agrawal et al. 1996] Definition : Semantics : degree of general information Sensitivity: 1 parameter Measuring frequency Linear Insensitive to independence Disequilibrium? Symmetrical Quality of Rules : Objective Measures Monodimensional Measures e+ eRalambrodrainy Measure [Ralambrodrainy 1991] Definition : Semantics: scarcity of the eSensitivity: 1 parameter Measuring frequency Linear Insensitive to independence Disequilibrium? Increasing Quality of Rules : Objective Measures Bidimensional Measures - Inclusion Descriptive-Confirm [Kodratoff 1999] Definition: Semantics: variation e+ e(improved support) Sensitivity: 2 parameters Measuring frequency Linear Insensitive to independence 0 with disequilibrium Quality of Rules : Objective Measures Bidimensional Measures - Inclusion Sebag and Schoenauer [Sebag & Schoenauer, 1991] Definition: Semantics: ratio e+/eSensitivity: 2 parameters Measuring frequency Non-Linear (very selective) Insensitive to independence 1 with disequilibrium Max value not limited Quality of Rules : Objective Measures Bidimensional Measures - Inclusion Example and Counterexample Rate (*) Definition: Semantics: ratio e+/eSensitivity: 2 parameters Measuring frequency Non-linear (tolerance) Insensitive to independence 0 with disequilibrium Max value limited Quality of Rules : Objective Measures Bidimensional Measures - Inclusion Confidence [Agrawal et al. 1996] Definition: Semantics: inclusion, validity Sensitivity: 2 parameters Measuring frequency Linear Insensitive to independence 0.5 with disequilibrium Max value limited Variations: [Ganascia, 1991] : Charade Or Descriptive ConfirmedConfidence [Kodratoff, 1999] Quality of Rules : Objective Measures Bidimensional Measures - Inclusion Wang [Wang et al 1988] Definition: Semantics: improved support (threshold of confidence integrated) Sensitivity: 2 parameters Measuring frequency Linear Insensitive to independence Disequilibrium? Quality of Rules : Objective Measures Bidimensional Measures - Inclusion Laplace [Clark & Robin 1991], [Tan & Kumar 2000] Definition: Semantics: estimates confidence (decreases with lowering support) Sensitivity: 2 parameters Does not measure frequency when numbers are small Linear Insensitive to independence Max value limited Quality of Rules : Objective Measures Bidimensional Measures–Similar Rules Descriptive Confirmed-Confidence [Kondratoff 1999] Definition: Semantics: confidence confirmed by its negative (X¬Y) Sensitivity: 2 parameters Measuring frequency Linear Insensitive to independence 0 with disequilibrium Max value limited Reinforcement by the negative rule Quality of Rules : Objective Measures Bidimensional Measures–Similar Rules Casual Support [Kodratoff 1999] Definition: Semantics: support improved by the use of the contra-positive Sensitivity: 3 parameters Measuring frequency Linear Insensitive to independence Disequilibrium? Reinforcement by the contra-positive rule Quality of Rules : Objective Measures Bidimensional Measures–Similar Rules Casual Confidence [Kodratoff 1999] Definition: Semantics: confidence reinforced by the contrapositive Sensitivity: 3 parameters Measuring frequency Linear Insensitive to independence Disequilibrium? Max value limited Reinforcement by the contra-positive rule Evolution: Causal-Confirmed Confidence: contra-positive + negative Quality of Rules : Objective Measures Bidimensional Measures–Similar Rules Least Contradiction [Aze & Kodratoff 2004] Definition: Semantics: little-contradiction Sensitivity: 3 parameters Measuring frequency Linear 0 with Disequilibrium Supports inclusive measurement Reinforcement by the negative rule Coupled with an algorithm Quality of Rules : Objective Measures Tridimensional Measures-Independence Centered Confidence (Pavillon Index) [Pavillon 1991] Definition: Semantics: variation with independence, correction of the size of the conclusion Sensitivity: 3 parameters Measuring frequency Linear 0 when independent Disequilibrium? Called Added Value in [Tan et al. 2002] Quality of Rules : Objective Measures Tridimensional Measures-Independence Rule Interest [Piatetsky-Shapiro 1991] Definition: Semantics: gap to independence (strong rules) Sensitivity: 3 parameters Measuring frequency Linear 0 when independent Disequilibrium? Alternative symmetric Measure: Pear [Pearl, 1988], [Acid et al., 1991] [GAMMERMAN, Luo, 1991] Quality of Rules : Objective Measures Tridimensional Measures-Independence Coefficient of Correlation [Pearson 1996] Definition: Semantics: Correlation Sensitivity: 3 parameters Measuring frequency Linear 0 when independent Disequilibrium? Quality of Rules : Objective Measures Tridimensional Measures-Independence Loevinger (*) [Loevinger 1947] Certainty Factor [Tan & Kumar 2000] Definition: Semantics: dependence implicative Sensitivity: 3 parameters Measuring frequency Linear 0 when independent Maximum value bounded (inclusion) Disequilibrium? Equivalent measure: Certainty factor [Tan & Kumar 2000]: Quality of Rules : Objective Measures Tridimensional Measures-Independence Varying Rates Liaison [Bernard & Charron 1996] Definition: Semantics: dependence Sensitivity: 3 parameters Measuring frequency Linear 0 when independent Inclusion? Disequilibrium? Variations: Measurement of interest (interest factor) [Brin et al., 1997] Equivalent to Lift Alternative: Logarithmic Measure of lift [Church & Hanks, 1990] Quality of Rules : Objective Measures Tridimensional Measures-Independence Measure of Interest (Interest Factor) [Brin et al. 1997] Lift (*) Logarithmic Measure of Lift (*) [Church & Hanks 1990] Cosine (*) [Tan & Kumar 2000] Definitions: Measure of Interest (Interest Factor): Lift Logarithmic Measure of Lift: Cosine: Semantics: dependence Sensitivity: 3 parameters Measuring Frequency Linear Inclusion? Disequilibrium? Quality of Rules : Objective Measures Tridimensional Measures-Independence Kappa [Tan & Kumar 2000] Definition: Semantics: Sensitivity: 3 parameters Measuring frequency Linear 0 when Independent Disequilibrium? Maximum value Strengthened by contra-positive Quality of Rules : Objective Measures Tridimensional Measures-Independence Predictive Association (*) [Tan & Kumar 2000] (Goodman & Kruskal) Definition: Semantics: X good prediction for Y Sensitivity: 3 parameters Measuring frequency Linear piecewise 0 when independent? Maximum value? Disequilibrium? with Quality of Rules : Objective Measures Tridimensional Measures-Independence Conviction [Brin et al. 1997b] Definition: Semantics: conviction Sensitivity: 3 parameters Measuring frequency Non Linear (very selective) 1 when independent Maximum value not merely Disequilibrium? (shape similar to Sebag and Schoenauer [Sebag & Schoenauer 1991] except for independence) Quality of Rules : Objective Measures Tridimensional Measures-Independence Odds Ratio, Yule’s Q, Yule’s Y [Tan & Kumar 2000] Definitions: (Close Conviction) Odds Ratio: Yule’s Q: Yule’s Y: Semantics: correlation Sensitivity: 3 parameters Measuring frequency Non Linear (resistance to noise?) 1 or 0 when independent Bounded max value (1 or not) Disequilibrium? Strengthened by the similar rules Quality of Rules : Objective Measures Tridimensional Measures-Independence Jaccard, Klosgen [Tan & Kumar 2000] Definitions: Jaccard: Klosgen: Semantics: correlation Sensitivity: 3 parameters Measuring frequency Non Linear 0 when independent Bounded max value (0 or 1) Disequilibrium? Strengthened by similar rules Quality of Rules : Objective Measures Tridimensional Measures-Independence Interestingness Weighting Dependency [Gray & Orlowska 1998] Definition: Semantics: interest? Sensitivity: 3 parameters Measuring frequency Non Linear 0 when independent Inclusion? Disequilibrium? Quality of Rules : Objective Measures Tridimensional Measures-Independence Mutual Information (Uncertainty) [Tan et al 2002] Definition: Semantics: information gain provided by X for Y Sensitivity: 3 parameters Measuring frequency Non linear, entropic 0 when independent Inclusion? Disequilibrium? Strongly Symmetric Low value Quality of Rules : Objective Measures Tridimensional Measures-Independence J-Measure (*) [Smyth & Goodman 1991][Goodman & Kruskal 1959] Definition: Semantics: cross entropy (by mutual information) Sensitivity: 3 parameters Measuring frequency Non linear, entropic O when Independent + concave Inclusion? Disequilibrium? Symmetric Low value Strengthened by the negative (X¬Y) Quality of Rules : Objective Measures Tridimensional Measures-Independence Gini Index Definition: Semantics: quadratic entropy Sensitivity: 3 parameters Measuring frequency Non linear, entropic 0 when Independent + concave Inclusion? Disequilibrium? Very Symmetric Low value Quality of Rules : Objective Measures Tridimensional Measures-Independence General Measure of Rule Interestingness (*) [Jaroszewicz & Simovici 2001] Definition: (continuum of measure between Gini and Chi2) Semantics:? Sensitivity: 3 parameters Measuring frequency Non-Linear (Gini-> distance from the Chi-2) 0 when independent Inclusion? Disequilibrium? Not Symmetric -> Symmetric Δα: Family measures differences conditioned by a factor real α (Gini -> Distance from chi2) ΔX(resp. ΔY): distribution of vector X and (resp. Y) Δxy: vector distribution of X and attached Y ΔX x ΔY: vector distribution of attached X and Y under the hypothesis of independence θ: vector apriori distribution of Y Quality of Rules : Objective Measures Quadridimensional Measures-Independence Lerman Similarity [Lerman 1981] Definition: Semantics: number of examples normalized centered Sensitivity: 4 parameters Measurement statistics (numbers) Linear 0 when independent Inclusion? Disequilibrium? Quality of Rules : Objective Measures Quadridimensional Measures-Independence Variation: Implication Index [Gras 1996] Definition: Semantics: number of normalized counter-examples Sensitivity: 4 parameters Measurement statistics (numbers) Linear 0 when independent Inclusion? Disequilibrium? Quality of Rules : Objective Measures Quadridimensional Measures-Independence Lerman Similarity [Lerman 1981] Definition: (probabilistic modeling, law chi2) Semantics: probability of a dependence between X and Y Sensitivity: 4 parameters Measuring probability, not frequency Non Linear + e- tolerance 0 when independent + real Maximum value bounded inclusion? Disequilibrium? Strongly Symmetric => Coupling measure of interest [Brin et al., 1997] Alternative: Report likelihood [Ritschard & al., 1998] Quality of Rules : Objective Measures Quadridimensional Measures-Independence Intensity of Implication (*)[Gras 1996] (Analysis of Statistical Involvement) Definition: (probabilistic modeling, law of counterexamples) Semantics: likely the scarcity of counterexamples (Statistical astonishment) Sensitivity: 4 parameters Measuring probability, not frequency Non Linear + e-tolerance 0.5 when independent + likelihood Maximum value bounded inclusion? Disequilibrium? Logic rules: Can be 0 Inspired by Link Likelihood [Lerman et al 1981] AR Quality : Objective Measures Intensity Of Involvement and Analysis Of Implicative Statistics Extensions Modeling: Structuring: Binary variables => numerical, ordinal, intervals, fuzzy [Bernadet 2000, Guillaume 2002, ...] Bulky data: intensity of entropic Involvement [Gras et al. 2001] Sequences: rules of prediction [Blanchard et al. 2002] Hierarchy implicative (cohesion) [Gras et al. 2001] Typical, reduction of variables (inertia of Involvement) [Gras et al. 2002] Applications CHIC (http://www.ardm.asso.fr/CHIC.html) SIPINA (University of Lyon 2) FELIX (PerformanSE SA) AR Quality : Objective Measures Quadridimensional Rules Entropy (*) [Gras et al 2001] (Analysis of Statistical Involvement) Definition: Inclusion Rate: Information: Asymmetric entropy: the entropy H’(Y|X) decreases with p(Y|X) (increases with Semantics: Surprising Statistic + inclusion (removal of disequilibrium) Sensitivity: 4 parameters Measuring frequency non-probabilistic Non linear + tolerance e- (adjustment of the selectivity with α (ex: α=2) Max 0.5 when independent + real 0 when in disequilibrium Strengthened by the contra-positive Maximum value bounded (1) ) Quality of Rules : Objective Measures Tridimensional Measures-Independence TIC (*) [Blanchard et al.2004] (Analysis of Statistical Involvement) Definition: Information Rate: Asymmetric Entropy: The entropy Ê(X) with p(X) Semantics: Surprise Statistic + inclusion (removal of disequilibrium) Sensitivity: 4 parameters Measuring frequency Non-linear, entropic 0 to independence 0 to Imbalance Strengthened by the contra-positive Maximum value bounded (1) Quality of Rules : Objective Measures Tridimensional Measures-Independence Surprisingness (*) [Freitas 1998] Definition: Information gain provided by the attribute Xi: Conditional entropy: Rule: X1 X2 X3 … Xp-1 Xp Y Semantics: surprise gain informational resources provided by the premise Measuring frequency Non-Linear: entropic Can be used to assess individual contribution of each attribute ... Comparative Theory Comparison by Simulation Intensity of Involvement Confidence, J-Measure, Coverage Rate Comparison by Simulation Intensity of Involvement Confidence, J-Measure, Coverage Rate Comparison by Simulation Intensity of Involvement Confidence, J-Measure, Coverage Rate Comparison by Simulation Intensity of Involvement Confidence, PS, Intensity of Implication Comparison by Simulation TIC Confidence, TIM, J-Measure, Gini Index Comparison by Simulation TIC Confidence, TIM, J-Measure, Gini Index Comparison by Simulation TIC Confidence, J-Measure, Coverage Rate Comparison by Simulation Comparison by Simulation Comparison by Simulation Comparison by Simulation Comparison by Simulation Intensity of Involvement Confidence, J-Measure, Coverage Rate Quality of Rules : Subjective Measures Synthesis & Comparative Studies [Bayardo and Agrawal, 1999]: influence of support [Hilderman and Hamilton, 2001]: Interest summaries 10 criteria [Lenca et al., 2004]: association rules interest 21 measures symmetrical 8 principles study of correlation, influence the media [Gras et al. 04]: interest association rules 9 symmetric measures, study of the relationship observed between 2 measurements, influence of support [Tan et al., 2002]: association rules interest 16 measures, 5 principles of independence, correlation study [Azé and Kodratoff, 2001]: resistance to noise in the data [Tan & Kumar 2000]: interest association rules 9 measures, monotonous functions / antitones support, optimization 20 measures, 8 criteria for decision support multi-criteria [Lallich & Teytaud 2004]: association rules interest 15 measures, 10 principles, learning and using the VC-dimension Study of Comparative Experiments Project AR-QAT : Quality Measures Analysis Tool Experimental Results Input Data Sets 30 Objective Measures Experimental Results – Positive Correlations Experimental Results – Positive Correlations Experimental Results Stable Strong Positive Correlations Average Correlation ARVAL A workshop for calculating quality measures for the scientific community http://www.univ-nantes.fr/arval ARVAL ARVAL ARVAL ARVAL Conclusion Conclusion and Outlooks Quality = multidimensional concept: Subjective (maker) Interest = changes with the knowledge of the decision-maker PB1: extract knowledge / objective decision-maker Objective (data and rules) Interest = on the Hypothetical Data: Inclusion, Independence, Imbalance, nuggets, robustness ... Antagonism Independence / Disequilibrium Many indices (~ 50!) => PB2: restricted to support / confidence => workshop for calculating indices PB3: comparative study (properties, simulations) and experimental (behavior data): a platform? PB4: combining the clues, choose the right index => Decision Support PB5: new clues? PB6: What is a good index? (ingredients of quality) Ax: Quality Assessment of Knowledge Perspective (PB1) Combining Subjective and Objective Aspects of Quality Search for knowledge Anthropocentric approach Adaptive Extraction FELIX [Lehn et. Al 1999] AR-VIS [Blanchard et al. 2003] Ax: Quality Assessment of Knowledge Perspective (PB 2 3 4 5) Platform for experimentation, support and a decision Calculation: ARVAL? (www.polytech.univ-nantes.fr/arval) Analysis: AR-QAT? [Popovici 2003] Decision Support: HERBS? [Lenca et al. 2003] (wwwiasc.enst-bretagne.fr/ecd-ind/HERBS) Bibliography [Agrawal et al., 1993] R. Agrawal, T. Imielinsky et A. Swami. Mining associations rules between sets of items in large databases. Proc. of ACM SIGMOD'93, 1993, p. 207-216 [Azé & Kodratoff, 2001] J. Azé et Y. Kodratoff. Evaluation de la résistance au bruit de quelques mesures d'extraction de règles d'association. Extraction des connaissances et apprentissage 1(4), 2001, p. 143-154 [Azé & Kodratoff, 2001] J. Azé et Y. Kodratoff. Extraction de « pépites » de connaissances dans les données : une nouvelle approche et une étude de sensibilité au bruit. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Bayardo & Agrawal, 1999] R.J. Bayardo et R. Agrawal. Mining the most interesting rules. Proc. of the 5th Int. Conf. on Knowledge Discovery and Data Mining, 1999, p.145-154. [Bernadet 2000] M. Bernardet. Basis of a fuzzy knowledge discovery system. Proc. of Principles of Data Mining and Knowledge Discovery, LNAI 1510, pages 24-33. Springer, 2000. [Bernard et Charron 1996] J.-M. Bernard et C. Charron. L’analyse implicative bayésienne, une méthode pour l’étude des dépendances orientées. I. Données binaires, Revue Mathématique Informatique et Sciences Humaines (MISH), vol. 134, 1996, p. 5-38. [Berti-Equille 2004] L. Berti-équille. Etat de l'art sur la qualité des données : un premier pas vers la qualité des connaissances. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Blanchard et al. 2001] J. Blanchard, F. Guillet, et H. Briand. L'intensité d'implication entropique pour la recherche de règles de prédiction intéressantes dans les séquences de pannes d'ascenseurs. Extraction des Connaissances et Apprentissage (ECA), Hermès Science Publication, 1(4):77-88, 2002. [Blanchard et al. 2003] J. Blanchard, F. Guillet, F. Rantière, H. Briand. Vers une Représentation Graphique en Réalité Virtuelle pour la Fouille Interactive de Règles d’Association. Extraction des Connaissances et Apprentissage (ECA), vol. 17, n°1-2-3, 105-118, 2003. Hermès Science Publication. ISSN 0992-499X, ISBN 2-7462-0631-5 [Blanchard et al. 2003a] J. Blanchard, F. Guillet, H. Briand. Une visualisation orientée qualité pour la fouille anthropocentrée de règles d’association. In Cognito - Cahiers Romans de Sciences Cognitives. A paraître. ISSN 1267-8015 [Blanchard et al. 2003b] J. Blanchard, F. Guillet, H. Briand. A User-driven and Quality oriented Visualiation for Mining Association Rules. In Proc. Of the Third IEEE International Conference on Data Mining, ICDM’2003, Melbourne, Florida, USA, November 19 - 22, 2003. [Blanchard et al., 2004] J. Blanchard, F. Guillet, R. Gras, H. Briand. Mesurer la qualité des règles et de leurs contraposées avec le taux informationnel TIC. EGC2004, RNTI, Cépaduès. 2004 A paraître. [Blanchard et al., 2004a] J. Blanchard, F. Guillet, R. Gras, H. Briand. Mesure de la qualité des règles d'association par l'intensité d'implication entropique. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Breiman & al. 1984] L.Breiman, J. Friedman, R. Olshen and C.Stone. Classification and Regression Trees. Chapman & Hall,1984. [Briand et al. 2004] H. Briand, M. Sebag, G. Gras et F. Guillet (eds). Mesures de Qualité pour la fouille de données. Revue des Nouvelles Technologies de l’Information, RNTI, Cépaduès, 2004. A paraître. [Brin et al., 1997] S. Brin, R. Motwani and C. Silverstein. Beyond Market Baskets: Generalizing Association Rules to Correlations. In Proceedings of SIGMOD’97, pages 265-276, AZ, USA, 1997. [Brin et al., 1997b] S. Brin, R. Motwani, J. Ullman et S. Tsur. Dynamic itemset counting and implication rules for market basket data. Proc. of the Int. Conf. on Management of Data, ACM Press, 1997, p. 255-264. Bibliography [Church & Hanks, 1990] K. W. Church et P. Hanks. Word association norms, mutual information and lexicography. Computational Linguistics, 16(1), 22-29, 1990. [Clark & Robin 1991] Peter Clark and Robin Boswell: Rule Induction with CN2: Some Recent Improvements. In Proceeding of the European Working Session on Learning EWSL-91, 1991. [Dong & Li, 1998] G. Dong and J. Li. Interestingness of Discovered Association Rules in terms of Neighborhood-Based Unexpectedness. In X. Wu, R. Kotagiri and K. Korb, editors, Proc. of 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD `98), Melbourne, Australia, April 1998. [Duval et al. 2004] B. Duval, A. Salleb, C. Vrain. Méthodes et mesures d’intérêt pour l’extraction de règles d’exception. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Fleury 1996] L. Fleury. Découverte de connaissances pour la gestion des ressources humaines. Thèse de doctorat, Université de Nantes, 1996. [Frawley & Piatetsky-Shapiro 1992] Frawley W. Piatetsky-Shapiro G. and Matheus C., « Knowledge discovery in databases: an overview », AI Magazine, 14(3), 1992, pages 57-70 [Freitas, 1998] A. A. Freitas. On Objective Measures of Rule Suprisingness. In J. Zytkow and M. Quafafou, editors, Proceedings of the Second European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD `98), pages 1-9, Nantes, France, September 1998. [Freitas, 1999] A. Freitas. On rule interestingness measures. Knowledge-Based Systems Journal 12(5-6), 1999, p. 309-315. [Gago & Bento, 1998 ] P. Gago and C. Bento. A Metric for Selection of the Most Promising Rules. PKDD’98, 1998. [Gray & Orlowska, 1998] B. Gray and M. E. Orlowska. Ccaiia: Clustering Categorical Attributes into Interesting Association Rules. In X. Wu, R. Kotagiri and K. Korb, editors, Proc. of 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD `98), pages 132 43, Melbourne, Australia, April 1998. [Goodman & Kruskal 1959] L. A. Goodman andW. H. Kruskal. Measures of Association for Cross Classification, ii: Further discussion and references. Journal of the American Statistical Association, ??? 1959. [Gras et al. 1995] R. Gras, H. Briand and P. Peter. Structuration sets with implication intensity. Proc. of the Int. Conf. On Ordinal and Symbolic Data Analysis - OSDA 95. Springer, 1995. [Gras, 1996] R. Gras et coll.. L'implication statistique - Nouvelle méthode exploratoire de données. La pensée sauvage éditions, 1996. [Gras et al. 2001] R. Gras, P. Kuntz, et H. Briand. Les fondements de l'analyse statistique implicative et quelques prolongements pour la fouille de données. Mathématiques et Sciences Humaines : Numéro spécial Analyse statistique implicative, 1(154-155) :9-29, 2001. [Gras et al. 2001b] R. Gras, P. Kuntz, R. Couturier, et F. Guillet. Une version entropique de l'intensité d'implication pour les corpus volumineux. Extraction des Connaissances et Apprentissage (ECA), Hermès Science Publication, 1(1-2) :69-80, 2001. [Gras et al. 2002] R. Gras, F. Guillet, et J. Philippe. Réduction des colonnes d'un tableau de données par quasi-équivalence entre variables. Extraction des Connaissances et Apprentissage (ECA), Hermès Science Publication, 1(4) :197-202, 2002. [Gras et al. 2004] R. Gras, R. Couturier, J. Blanchard, H. Briand, P. Kuntz, P. Peter. Quelques critères pour une mesure de la qualité des règles d’association. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Guillaume et al. 1998] S. Guillaume, F. Guillet, J. Philippé. Improving the discovery of associations Rules with Intensity of implication. Proc. of 2nd European Symposium Principles of data Mining and Knowledge Discovery, LNAI 1510, p 318-327. Springer 1998. [Guillaume 2002] S. Guillaume. Discovery of Ordinal Association Rules. M.-S. Cheng, P. S. Yu, B. Liu (Eds.), Proc. Of the 6th Pacific- sia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2002, LNCS 2336, pages 322-327 Springer 2002. Bibliography [Guillet et al. 1999] F. Guillet, P. Kuntz, et R. Lehn. A genetic algorithm for visualizing networks of association rules. Proc. the 12th Int. Conf. On Industrial and Engineering Appl. of AI and Expert Systems, LNCS 1611, pages 145-154. Springer 1999 [Guillet 2000] F. Guillet. Mesures de qualité de règles d’association. Cours DEA-ECD. Ecole polytechnique de l’université de Nantes. 2000. [Hilderman & Hamilton, 1998] R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Interestingness Measures: A Survey. (KDD `98), ??? New-York 1998. [Hilderman et Hamilton, 2001] R. Hilderman et H. Hamilton. Knowledge discovery and measures of interest. Kluwer Academic publishers, 2001. [Hussain et al. 2001] F. Hussain, H. Liu, E. Suzuki and H. Lu. Exception Rule Mining with a Relative Interestingness Measure. ??? [Jaroszewicz & Simovici, 2001] S. Jaroszewicz et D.A. Simovici. A general measure of rule interestingness. Proc. of the 7th Int. Conf. on Knowledge Discovery and Data Mining, L.N.C.S. 2168, Springer, 2001, p. 253-265 [Klemettinen et al. 1994] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen and A. I. Verkamo. Finding Interesting Rules from Large Sets of Discovered Association Rules. In N. R. Adam, B. K. Bhargava and Y. Yesha, editors, Proc. of the Third International Conf. on Information and Knowledge Management``, pages 401-407, Gaitersburg, Maryland, 1994. [Kodratoff, 1999] Y. Kodratoff. Comparing Machine Learning and Knowledge Discovery in Databases:An Application to Knowledge Discovery in Texts. Lecture Notes on AI (LNAI)-Tutorial series. 2000. [Kuntz et al. 2000] P.Kuntz, F.Guillet, R.Lehn and H.Briand. A User-Driven Process for Mining Association Rules. In D. Zighed, J. Komorowski and J.M. Zytkow (Eds.), Principles of Data Mining and Knowledge Discovery (PKDD2000), Lecture Notes in Computer Science, vol. 1910, pages 483-489, 2000. Springer. [Kodratoff, 2001] Y. Kodratoff. Comparing machine learning and knowledge discovery in databases: an application to knowledge discovery in texts. Machine Learning and Its Applications, Paliouras G., Karkaletsis V., Spyropoulos C.D. (eds.), L.N.C.S. 2049, Springer, 2001, p. 1-21. [Kuntz et al. 2001] P. Kuntz, F. Guillet, R. Lehn and H. Briand. A user-driven process for mining association rules. Proc. of Principles of Data Mining and Knowledge Discovery, LNAI 1510, pages 483-489. Springer, 2000. [Kuntz et al. 2001b] P. Kuntz, F. Guillet, R. Lehn, et H. Briand. Vers un processus d'extraction de règles d'association centré sur l'utilisateur. In Cognito, Revue francophone internationale en sciences cognitives, 1(20) :13-26, 2001. [Lallich et al. 2004] S. Lallich et O. Teytaud . Évaluation et validation de l’intérêt des règles d’association. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Lehn et al. 1999] R.Lehn, F.Guillet, P.Kuntz, H.Briand and J. Philippé. Felix : An interactive rule mining interface in a kdd process. In P. Lenca (editor), Proc. of the 10th Mini-Euro Conference, Human Centered Processes, HCP’99, pages 169-174, Brest, France, September 22-24, 1999. [Lenca et al. 2004] P. Lenca, P. Meyer, B. Vaillant, P. Picouet, S. Lallich. Evaluation et analyse multi-critères des mesures de qualité des règles d’association. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Lerman et al. 1981] I. C. Lerman, R. Gras et H. Rostam. Elaboration et évaluation d’un indice d’implication pour les données binaires. Revue Mathématiques et Sciences Humaines, 75, p. 5-35, 1981. [Lerman, 1981] I. C. Lerman. Classification et analyse ordinale des données. Paris, Dunod 1981. [Lerman, 1993] I. C. Lerman. Likelihood linkage analysis classification method, Biochimie 75, p. 379-397, 1993. [Lerman & Azé 2004] I. C. Lerman et J. Azé.Indidice probabiliste discriminant de vraisemblance du lien pour des données volumineuses. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. Bibliography [Liu et al., 1999] B. Liu, W. Hsu, L. Mun et H. Lee. Finding interesting patterns using user expectations. IEEE Transactions on Knowledge and Data Engineering 11, 1999, p. 817-832. [Loevinger, 1947] J. Loevinger. A systemic approach to the construction and evaluation of tests of ability. Psychological monographs, 61(4), 1947. [Mannila & Pavlov, 1999] H. Mannila and D. Pavlov. Prediction with Local Patterns using Cross-Entropy. Technical Report, Information and Computer Science, University of California, Irvine, 1999. [Matheus & Piatetsky-Shapiro, 1996] C. J. Matheus and G. Piatetsky-Shapiro. Selecting and Reporting what is Interesting: The KEFIR Application to Healthcare data. In U. M. Fayyad, G. Piatetsky-Shapiro, P.Smyth and R. Uthurusamy (eds), Advances in Knowledge Discovery and Data Mining, p. 401-419, 1996. AAAI Press/MIT Press. [Meo 2000] R. Meo. Theory of dependence values, ACM Transactions on Database Systems 5(3), p. 380-406, 2000. [Padmanabhan et Tuzhilin, 1998] B. Padmanabhan et A. Tuzhilin. A belief-driven method for discovering unexpected patterns. Proc. Of the 4th Int. Conf. on Knowledge Discovery and Data Mining, 1998, p. 94-100. [Pearson, 1896] K. Pearson. Mathematical contributions to the theory of evolution. III. regression, heredity and panmixia. Philosophical Transactions of the Royal Society, vol. A, 1896. [Piatestsky-Shapiro, 1991] G. Piatestsky-Shapiro. Discovery, analysis, and presentation of strong rules. Knowledge Discovery in Databases. Piatetsky-Shapiro G., Frawley W.J. (eds.), AAAI/MIT Press, 1991, p. 229-248 [Popovici, 2003] E. Popovici. Un atelier pour l'évaluation des indices de qualité. Mémoire de D.E.A. E.C.D., IRIN/Université Lyon2/RACAI Bucarest, Juin 2003 [Ritschard & al., 1998] G. Ritschard, D. A. Zighed and N. Nicoloyannis. Maximiser l`association par agrégation dans un tableau croisé. In J. Zytkow and M. Quafafou, editors, Proc. of the Second European Conf. on the Principles of Data Mining and Knowledge Discovery (PKDD `98), Nantes, France, September 1998. [Sebag et Schoenauer, 1988] M. Sebag et M. Schoenauer. Generation of rules with certainty and confidence factors from incomplete and incoherent learning bases. Proc. of the European Knowledge Acquisition Workshop (EKAW'88), Boose J., Gaines B., Linster M. (eds.), Gesellschaft für Mathematik und Datenverarbeitung mbH, 1988, p. 28.1-28.20. [Shannon & Weaver, 1949] C.E. Shannon et W. Weaver. The mathematical theory of communication. University of Illinois Press, 1949. [Silbershatz &Tuzhilin,1995] Avi Silberschatz and Alexander Tuzhilin. On Subjective Measures of Interestingness in Knowledge Discovery, (KD. & DM. `95) ??? , 1995. [Smyth & Goodman, 1991] P. Smyth et R.M. Goodman. Rule induction using information theory. Knowledge Discovery in Databases, Piatetsky- Shapiro G., Frawley W.J. (eds.), AAAI/MIT Press, 1991, p. 159-176 [Tan & Kumar 2000] P. Tan, V. Kumar. Interestingness Measures for Association Patterns : A Perspective. Workshop tutorial (KDD 2000). [Tan et al., 2002] P. Tan, V. Kumar et J. Srivastava. Selecting the right interestingness measure for association patterns. Proc. of the 8th Int. Conf. on Knowledge Discovery and Data Mining, 2002, p. 32-41.