Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin Outline Motivation Objective Research Review Search for Contrast Sets Filtering for Summarizing Contrast Set Evaluation Conclusion Motivation Learning group differences a central problem in many domains Contrasting groups especially important in social science research Objective Automatically detect differences between contrasting groups from observational multivariate data Research Review time series research traditional statistical methods rule learner and decision tree multiple observations miss group differences association rule mining multiple group and different search criteria Problem Definition itemset concept extends to contrast set Definition 1: Let A1,A2,...,Ak be a set of k variables called attributes. Each Ai can take on values from the set {Vi1,Vi2,...Vim}. Contrast set a conjunction of attribute – value pairs defined on groups G1,G2,...,Gn with no Ai occurring more than once. Define support of contrast set Definition 2: The support of a contrast set with respect to a group G is the percentage of examples in G where the contrast set is true. minimum support difference δ user defined threshold Search for Contrast Sets find contrast sets meet our criteria though search explore all possible contrast sets return only sets meet our criteria STUCCO (Search and Testing for Understandable Consistent Contrasts): breadth-first search incorporates several efficiently mining techniques Framework use set-enumeration trees use breadth-first search counting phase organize nodes into candidate groups Finding Significant Contrast Sets testing the null hypothesis across all groups support counts from contingency tables Controlling Search Error data mining test many hypotheses family of tests control Type I error Bonferroni inequality:given any set of events e1,e2,...,en, the probability of their union is less than or equal to the sum of the individual probabilities Pruning prune when contrast sets fail to meet effect size or statistical significance criteria prune when lead to uninteresting contrast sets Effect Size Pruning prune nodes when bound maximum support difference groups below δ Statistical Significance Pruning pruned when too few data or maximum value X2 too small Interest Based Pruning contrast sets are not interesting when have identical support or relation between groups is fixed Specializations with Identical Support marital-status=husband marital-status=husband ^ Sex = male Fixed Relations Fixed Relations prune node as contrast set specializations do not add new information Relation to Itemset Mining minimum support difference criterion implies constraints support levels in individual groups eliminate large portions of the search space based on: subset infrequency pruning effect size pruning superset frequency pruning interest based pruning ab abc Filtering for Summarizing Contrast Set past approaches limit the rules shown by constraint the variables or items compare discovered rules, show only unexpected results new methods expectation based statistical approach identify and select linear trend contrast sets Statistical Surprise show most general contrast sets first, more complicated conjunctions if surprising based on previously shown sets IPF(Iterative Proportional Fitting) find maximum likelihood estimates Detecting Linear Trends identical to finding change over time detect significant contrast set by using the chisquare test use regression techniques to find the portion of the x2 Evaluation three research points: low support difference few high support attribute-value pairs, lower bounds can’t take advantage pruning rules δ -> 0 statistical significance pruning is more important filtering rules Conclusion STUCCO algorithm combined statistical hypothesis testing with search for mining contrast sets STUCOO has pruning rules efficient mining at low support differences guaranteed control over false positives linear trend detection compact summarization of result