Download Detecting Group Differences: Mining Contrast Sets

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Detecting Group Differences:
Mining Contrast Sets
Author: Stephen D. Bay
Advisor: Dr. Hsu
Graduate: Yan-Cheng Lin
Outline







Motivation
Objective
Research Review
Search for Contrast Sets
Filtering for Summarizing Contrast
Set
Evaluation
Conclusion
Motivation


Learning group differences a central
problem in many domains
Contrasting groups especially
important in social science research
Objective

Automatically detect differences
between contrasting groups from
observational multivariate data
Research Review

time series research



traditional statistical methods
rule learner and decision tree


multiple observations
miss group differences
association rule mining

multiple group and different search
criteria
Problem Definition

itemset concept extends to contrast
set
Definition 1:
Let A1,A2,...,Ak be a set of k variables called
attributes.
Each Ai can take on values from the set
{Vi1,Vi2,...Vim}.
Contrast set a conjunction of attribute –
value pairs defined on groups G1,G2,...,Gn
with no Ai occurring more than once.
Define support of contrast set

Definition 2:


The support of a contrast set with respect to a group
G is the percentage of examples in G where the
contrast set is true.
minimum support difference δ user defined threshold
Search for Contrast Sets



find contrast sets meet our criteria
though search
explore all possible contrast sets
return only sets meet our criteria
STUCCO (Search and Testing for
Understandable Consistent
Contrasts): breadth-first search
incorporates several efficiently
mining techniques
Framework



use set-enumeration trees
use breadth-first search
counting phase organize nodes into candidate
groups
Finding Significant Contrast Sets


testing the null hypothesis across all groups
support counts from contingency tables
Controlling Search Error



data mining test many hypotheses
family of tests control Type I error
Bonferroni inequality:given any set of
events e1,e2,...,en, the probability of their
union is less than or equal to the sum of
the individual probabilities
Pruning



prune when contrast sets fail to meet
effect size or statistical significance
criteria
prune when lead to uninteresting contrast
sets
Effect Size Pruning


prune nodes when bound maximum support
difference groups below δ
Statistical Significance Pruning

pruned when too few data or maximum value
X2 too small
Interest Based Pruning


contrast sets are not interesting when have
identical support or relation between groups is
fixed
Specializations with Identical Support


marital-status=husband
marital-status=husband ^ Sex = male
Fixed Relations

Fixed Relations

prune node as contrast set specializations do
not add new information
Relation to Itemset Mining



minimum support difference criterion implies
constraints support levels in individual groups
eliminate large portions of the search space based
on:
subset infrequency pruning


effect size pruning
superset frequency pruning

interest based pruning
ab
abc
Filtering for Summarizing Contrast Set

past approaches



limit the rules shown by constraint the
variables or items
compare discovered rules, show only
unexpected results
new methods


expectation based statistical approach
identify and select linear trend contrast
sets
Statistical Surprise


show most general contrast sets
first, more complicated conjunctions
if surprising based on previously
shown sets
IPF(Iterative Proportional Fitting)
find maximum likelihood estimates
Detecting Linear Trends



identical to finding change over time
detect significant contrast set by using the chisquare test
use regression techniques to find the portion of
the x2
Evaluation

three research points:



low support difference
 few high support attribute-value pairs, lower bounds
can’t take advantage
pruning rules
 δ -> 0 statistical significance pruning is more
important
filtering rules
Conclusion


STUCCO algorithm combined
statistical hypothesis testing with
search for mining contrast sets
STUCOO has




pruning rules efficient mining at low
support differences
guaranteed control over false positives
linear trend detection
compact summarization of result