Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Subgroup Discovery Finding Local Patterns in Data Exploratory Data Analysis Scan the data without much prior focus Find unusual parts of the data Analyse attribute dependencies interpret this as ‘rule’: if X=x and Y=y then Z is unusual Complex data: nominal, numeric, relational? the Subgroup Exploratory Data Analysis Classification: model the dependency of the target on the remaining attributes. problem: sometimes classifier is a black-box, or uses only some of the available dependencies. for example: in decision trees, some attributes may not appear because of overshadowing. Exploratory Data Analysis: understanding the effects of all attributes on the target. Interactions between Attributes Single-attribute effects are not enough XOR problem is extreme example: 2 attributes with no info gain form a good subgroup Apart from A=a, B=b, C=c, … consider also A=aB=b, A=aC=c, …, B=bC=c, … A=aB=bC=c, … … Subgroup Discovery Task “Find all subgroups within the inductive constraints that show a significant deviation in the distribution of the target attributes” Inductive constraints: Minimum support (Maximum support) Minimum quality (Information gain, X2, WRAcc) Maximum complexity … Subgroup Discovery: the Binary Target Case Confusion Matrix A confusion matrix (or contingency table) describes the frequency of the four combinations of subgroup and target: within subgroup, positive within subgroup, negative outside subgroup, positive outside subgroup, negative target subgroup T F T .42 .13 F .12 .33 .54 .55 1.0 Confusion Matrix High numbers along the TT-FF diagonal means a positive correlation between subgroup and target High numbers along the TF-FT diagonal means a negative correlation between subgroup and target Target distribution on DB is fixed Only two degrees of freedom target subgroup T F T .42 .13 .55 F .12 .33 .45 .54 .46 1.0 Quality Measures A quality measure for subgroups summarizes the interestingness of its confusion matrix into a single number WRAcc, weighted relative accuracy Balance between coverage and unexpectedness WRAcc(S,T) = p(ST) – p(S)p(T) between −.25 and .25, 0 means uninteresting target subgroup T F T .42 .13 F .12 .33 .54 .55 1.0 WRAcc(S,T) = p(ST)−p(S)p(T) = .42 − .297 = .123 Quality Measures WRAcc: Weighted Relative Accuracy Information gain X2 Correlation Coefficient Laplace Jaccard Specificity … Subgroup Discovery as Search true A=a1 A=a1B=b1 A=a1B=b1C=c1 A=a2 A=a1B=b2 B=b1 … B=b2 C=c1 A=a2B=b1 … … T F T .42 .13 F .12 .33 .54 minimum support level reached … .55 1.0 Refinements are (anti-)monotonic entire database Refinements are (anti-) monotonic in their support… target concept S3 refinement of S2 S2 refinement of S1 subgroup S1 …but not in interestingness. This may go up or down. Subgroup Discovery and ROC space ROC Space ROC = Receiver Operating Characteristics Each subgroup forms a point in ROC space, in terms of its False Positive Rate, and True Positive Rate. TPR = TP/Pos = TP/TP+FN (fraction of positive cases in the subgroup) FPR = FP/Neg = FP/FP+TN (fraction of negative cases in the subgroup) ROC Space Properties entire database ‘ROC heaven’ perfect subgroup ‘ROC hell’ random subgroup perfect negative subgroup empty subgroup minimum support threshold source: Flach & Fürnkranz Measures in ROC Space 0 positive negative WRAcc Information Gain isometric Other Measures Precision Gini index Correlation coefficient Foil gain Refinements in ROC Space Refinements of S will reduce the FPR and TPR, so will appear to the left and below S. . . . .. Blue polygon represents possible refinements of S. With a convex measure, f is bounded by measure of corners. If corners are not above minimum quality or current best (top k?), prune search space below S. Multi-class problems Generalising to problems with more than 2 classes is fairly staightforward: target T F X2 C2 C3 .27 .06 .22 .55 .03 .19 .23 .45 .3 .25 .45 1.0 combine values to quality measure Information gain source: Nijssen & Kok subgroup C1 Subgroup Discovery for Numeric targets Numeric Subgroup Discovery Target is numeric: find subgroups with significantly higher or lower average value Trade-off between size of subgroup and average target value h = 3600 h = 3100 h = 2200 Types of SD for Numeric Targets Regression subgroup discovery numeric target has order and scale Ordinal subgroup discovery numeric target has order Ranked subgroup discovery numeric target has order or scale Vancouver 2010 Winter Olympics ordinal target Partial ranking objects share a rank regression target Offical IOC ranking of countries (med > 0) Rank 1 2 3 4 5 6 7 9 9 9 11 12 13.5 13.5 16 16 16 20 20 20 20 20 23 25 25 25 Country Medals USA 37 Germany 30 Canada 26 Norway 23 Austria 16 Russ. Fed. 15 Korea 14 China 11 Sweden 11 shared France 11 Switzerland 9 Netherlands 8 Czech Rep. 6 Poland 6 Italy 5 Japan 5 Finland 5 Australia 3 Belarus 3 Slovakia 3 Croatia 3 Slovenia 3 Latvia 2 Great Britain 1 Estonia 1 Kazakhstan 1 Athletes 214 152 205 100 79 179 46 90 107 ranks 107 are 144 34 92 50 110 94 95 40 49 73 18 49 58 52 30 38 Continent N. America Europe N. America Europe Europe Asia Asia Asia Europe averaged Europe Europe Europe Europe Europe Europe Asia Europe Australia Europe Europe Europe Europe Europe Europe Europe Asia Fractional ranks Popul. 309 82 34 4.8 8.3 142 73 1338 9.3 65 7.8 16.5 10.5 38 60 127 5.3 22 9.6 5.4 4.5 2 2.2 61 1.3 16 Language Family Germanic Germanic Germanic Germanic Germanic Slavic Altaic Sino-Tibetan Germanic Italic Germanic Germanic Slavic Slavic Italic Japonic Finno-Ugric Germanic Slavic Slavic Slavic Slavic Slavic Germanic Finno-Ugric Turkic Repub. y y n n y y y y n y y n y y y n y y y y y y y n y y Polar y n y y n y n n y n n n n n n n y n n n n n n n n n Interesting Subgroups ‘polar = yes’ 1. United States 3. Canada 4. Norway 6. Russian Federation 9. Sweden 16 Finland ‘language_family = Germanic & athletes 60’ 1. United States 2. Germany 3. Canada 4. Norway 5. Austria 9. Sweden 11. Switzerland Intuitions Size: larger subgroups are more reliable * Rank: majority of objects appear at the top language_family = Slavic Position: ‘middle’ of subgroup should differ from middle of ranking Deviation: objects should have similar rank ** ** ** * Intuitions Size: larger subgroups are more reliable Rank: majority of objects appear at the top * ** * * polar = yes Position: ‘middle’ of subgroup should differ from middle of ranking Deviation: objects should have similar rank * Intuitions Size: larger subgroups are more reliable population 10M Rank: majority of objects appear at the top Position: ‘middle’ of subgroup should differ from middle of ranking Deviation: objects should have similar rank ** * * ** ** * * Intuitions Size: larger subgroups are more reliable Rank: majority of objects appear at the top language_family = Slavic & population 10M Position: ‘middle’ of subgroup should differ from middle of ranking Deviation: objects should have similar rank ** ** * Quality Measures Average Mean test z-Score t-Statistic Median X2 statistic AUC of ROC Wilcoxon-Mann-Whitney Ranks statistic Median MAD Metric Meet Cortana the open source Subgroup Discovery tool Cortana Features Generic Subgroup Discovery algorithm quality measure search strategy inductive constraints Flat file, .txt, .arff, (DB connection to come) Support for complex targets 41 quality measures ROC plots Statistical validation Target Concepts ‘Classical’ Subgroup Discovery nominal targets (classification) numeric targets (regression) Exceptional Model Mining multiple targets regression, correlation multi-label classification (to be discussed in a few slides) Mixed Data Data types binary nominal numeric Numeric data is treated dynamically (no discretisation as preprocessing) all: consider all available thresholds bins: discretise the current candidate subgroup best: find best threshold, and search from there Statistical Validation Determine distribution of random results random subsets random conditions swap-randomization Determine minimum quality Significance of individual results Validate quality measures how exceptional? Open Source You can Use Cortana binary datamining.liacs.nl/cortana.html Use and modify Cortana sources (Java) Exceptional Model Mining Subgroup Discovery with multiple target attributes Mixture of Distributions 100 90 80 70 60 50 40 30 20 10 0 0 20 40 60 80 100 Mixture of Distributions 100 90 100 80 70 90 60 50 80 40 70 30 20 60 10 0 0 50 20 40 60 80 100 20 40 60 80 100 100 40 90 80 30 70 20 60 50 10 40 30 0 0 20 40 60 80 100 20 10 0 0 Mixture of Distributions 100 90 100 80 70 90 60 50 80 40 70 30 20 60 10 0 0 50 20 40 60 80 100 20 40 60 80 100 100 40 90 80 30 70 20 60 50 10 40 30 0 0 20 40 60 80 100 20 10 0 0 For each datapoint it is unclear whether it belongs to G or G Intensional description of exceptional subgroup G? Model class unknown Model parameters unknown Solution: extend Subgroup Discovery Use other information than X and Y: object desciptions D Use Subgroup Discovery to scan sub-populations in terms of D Subgroup Discovery: find subgroups of the database where the target attribute shows an unusual distribution. Solution: extend Subgroup Discovery Use other information than X and Y: object desciptions D Use Subgroup Discovery to scan sub-populations in terms of D Model over subgroup becomes target of SD Exceptional Model Mining Subgroup Discovery: find subgroups of the database where the target attributes show an unusual distribution, by means of modeling over the target attributes. Exceptional Model Mining object description target concept X Define a target concept (X and y) y Exceptional Model Mining object description target concept X y modeling Define a target concept (X and y) Choose a model class C Define a quality measure φ over C Exceptional Model Mining target concept object description X y modeling Subgroup Discovery Define a target concept (X and y) Choose a model class C Define a quality measure φ over C Use Subgroup Discovery to find exceptional subgroups G and associated model M Quality Measure Specify what defines an exceptional subgroup G based on properties of model M Absolute measure (absolute quality of M) Correlation coefficient Predictive accuracy Difference measure (difference between M and M) Difference in slope qualitative properties of classifier Reliable results Minimum support level Statistical significance of G Correlation Model Correlation coefficient φρ = ρ(G) Absolute difference in correlation φabs = |ρ(G) - ρ(G)| Entropy weighted absolute difference φent = H(p)·|ρ(G) - ρ(G)| Statistical significance of correlation difference φscd compute z-score from z* z ' z ' ρ through Fisher transform 1 1 n3 n 3 compute p-value from z-score Regression Model Compare slope b of yi = a + b·xi + e, and yi = a + b·xi + e Compute significance of slope difference φssd drive = 1 basement = 0 #baths ≤ 1 y = 41 568 + 3.31·x y = 30 723 + 8.45·x Gene Expression Data 11_band = ‘no deletion’ survival time ≤ 1919 XP_498569.1 ≤ 57 y = 3313 - 1.77·x y = 360 + 0.40·x Classification Model Decision Table Majority classifier BDeu measure (predictiveness) whole database RIF1 160.45 Hellinger (unusual distribution) prognosis = ‘unknown’ General Framework object description target concept X y modeling Subgroup Discovery General Framework object description target concept X Subgroup Discovery y Regression ● Classification ● Clustering Association Graphical modeling ● … General Framework object description target concept X Subgroup Discovery ● Decision Trees SVM … y Regression Classification Clustering Association Graphical modeling … General Framework propositional ● multi-relational ● … target concept X Subgroup Discovery Decision Trees SVM … y Regression Classification Clustering Association Graphical modeling …