Download SPARCCC Slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Associative Classification of
Imbalanced Datasets
Sanjay Chawla
School of IT
University of Sydney
1
Overview
•
•
•
•
Data Mining Tasks
Associative Classifiers
Downside of Support and Confidence
Mining Rules from Imbalanced Data Sets
– Fisher’s Exact Test
– Class Correlation Ratio (CCR)
– Searching and Pruning Strategies
– Experiments
2
Data Mining
• Data Mining research has settled into an
equilibrium involving four tasks
Associative
Classifier
Pattern Mining
(Association Rules)
Classification
DB
Clustering
Anomaly or Outlier
Detection
ML
3
Association Rules (Agrawal, Imielinksi and
Swami, 93 SIGMOD)
– An implication expression of the
form X  Y, where X and Y are
itemsets
– Example:
{Milk, Diaper}  {Beer}
• Rule Evaluation Metrics
– Confidence (c)
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example:
– Support (s)
• Fraction of transactions that
contain both X and Y
TID
{Milk , Diaper }  Beer
s
 (Milk, Diaper, Beer )
|T|

2
 0.4
5
• Measures how often items in Y
 (Milk, Diaper, Beer ) 2
appear in transactions that
c
  0.67
contain X
 (Milk, Diaper )
3
5
From “Introduction to Data Mining”, Tan,Steinbach and Kumar
Mining Association Rules
•
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset
•
Frequent itemset generation is computationally
expensive
6
Overview
•
•
•
•
Data Mining Tasks
Associative Classifiers
Downside of Support and Confidence
Mining Rules from Imbalanced Data Sets
– Fisher’s Exact Test
– Class Correlation Ratio (CCR)
– Searching and Pruning Strategies
– Experiments
7
Associative Classifiers
• Most of the Associative Classifiers are
based on rules discovered using the
support-confidence criterion.
• The classifier itself is a collection of rules
ranked using their support or confidence.
8
Associative Classifiers (2)
TID
Items
Gender
1
Bread, Milk
F
2
Bread, Diaper, Beer, Eggs
M
3
Milk Diaper, Beer, Coke
M
4
Bread, Milk, Diaper, Beer
M
5
Bread, Milk, Diaper, Coke
F
In a Classification task we want to predict the
class label (Gender) using the attributes
A good (albeit stereotypical) rule is {Beer,Diaper}  Male whose
support is 60% and confidence is 100%
9
Overview
•
•
•
•
Data Mining Tasks
Associative Classifiers
Downside of Support and Confidence
Mining Rules from Imbalanced Data Sets
– Fisher’s Exact Test
– Class Correlation Ratio (CCR)
– Searching and Pruning Strategies
– Experiments
10
Imbalanced Data Set
• In some application domains, Data Sets
are Imbalanced :
– The proportion of samples from one class is
much smaller than the other class/classes.
– And the smaller class is the class of interest.
• Support and confidence are biased toward
the majority class, and do not perform well
in such cases.
11
Downsides of Support
• Support is biased towards the majority
class
– Eg: classes = {yes, no}, sup({yes})=90%
– minSup > 10% wipes out any rule predicting
“no”
– Suppose X  no has confidence 1 and
support 3%. Rule discarded if minSup > 3%
even though it perfectly predicts 30% of the
instances in the minority class!
12
Downside of Confidence(1)
C
Conf(A C) = 20/25 = 0.8
Support(AC) = 20/100 = 0.2
Correlation between A and C:
A
A
C
20
5
25
70
5
75
90
10
100
 ( A, C )
20

 0.89  1
 ( A) (C ) 0.25  0.90
Thus, when the data set is imbalanced a high support and high
confidence rule may not necessarily imply that the antecedent and
the consequent are positively correlated.
13
Downside of Confidence (2)
• Reasonable to expect that for “good rules”
the antecedent and consequent are not
independent!
• Suppose
– P(Class=Yes) = 0.9
– P(Class=Yes|X) = 0.9
14
Downsides of Confidence (3)
Another useful observation
• Higher confidence (support) for a rule in the
minority class implies higher correlation, and
lower correlation in the minority class implies
lower confidence, but neither of these apply for
the majority class.
• Confidence (support) tends to bias the
majority class.
15
Overview
•
•
•
•
Data Mining Tasks
Associative Classifiers
Downside of Support and Confidence
Mining Rules from Imbalanced Data Sets
– Fisher’s Exact Test
– Class Correlation Ratio (CCR)
– Searching and Pruning Strategies
– Experiments
16
Contingency Table
• A 2 * 2 Contingency Table for X → y.
• We will use the notation [a, b; c, d] to
represent this table.
X
b
 rows
y
X
a
y
c
d
cd
 cols
ac bd
ab
n  abcd
17
Fisher Exact Test
• Given a table, [a, b; c, d], Fisher Exact
Test will find the probability (p-value) of
obtaining the given table under the
hypothesis that {X, ¬X} and {y, ¬y} are
independent.
• The margin sums (∑rows, ∑cols) are fixed.
18
Fisher Exact Test (2)
• The p-value is given by:
p([a, b; c, d ]) 
min( b ,c )

i 0
(a  b)!(c  d )!(a  c)!(b  d )!
n!(a  i)!(b  i)!(c  i)!(d  i)!
• We will only use rules whose p-values are below
the level of significant desired (e.g. 0.01).
• Rules that pass this test are statistically
significant in the positively associated direction
(e.g. X → y).
19
Overview
•
•
•
•
Data Mining Tasks
Associative Classifiers
Downside of Support and Confidence
Mining Rules from Imbalanced Data Sets
– Fisher’s Exact Test
– Class Correlation Ratio (CCR)
– Searching and Pruning Strategies
– Experiments
20
Class Correlation Ratio
• In Class Correlation, we are interested in rules
X → y where X is more positively correlated
with y than it is with ¬y.
• The correlation is defined by:
sup( X  y ) | T |
an
corr ( X  y ) 

sup( X )  sup( y ) (a  c)( a  b)
where |T| is the number of transactions n.
21
Class Correlation Ratio (2)
• We then use corr() to measure how
correlated X is with y compared to ¬y.
• X and y are positively correlated if
corr(X→y)>1, and negatively correlated if
corr(X→y)<1.
22
Class Correlation Ratio (3)
• Based on correlation corr(), we define the
Class Correlation Ratio (CCR):
corr ( X  y ) a(c  d )
CCR( X  y ) 

corr ( X  y ) c(a  b)
• The CCR measures how much more positively
the antecedent is correlated with the class it
predicts (e.g. y), relative to the alternative
class (e.g. ¬y).
23
Class Correlation Ratio (4)
corr ( X  y)
CCR( X  y) 
corr ( X  y)
• We only use rules with CCR higher than a
desired threshold, so that no rules are used
that are more positively associated with the
classes they do not predict.
24
The two measurements
• We perform the following tests to
determine whether a potentially interesting
rule is indeed interesting:
– Check the significant of a rule X → y by
performing the Fisher’s Exact Test.
– Check whether CCR(X→y) > 1.
• Those rules that pass the above two tests
are candidates for the classification task.
25
Overview
•
•
•
•
Data Mining Tasks
Associative Classifiers
Downside of Support and Confidence
Mining Rules from Imbalanced Data Sets
– Fisher’s Exact Test
– Class Correlation Ratio (CCR)
– Searching and Pruning Strategies
– Experiments
26
Search and Pruning Strategies
• To avoid examining the whole set of possible
rules, we use search strategies that ensure
the concept of being potential interesting is
anti-monotonic:
X→y might be considered as potential
interesting if and only if all {X’→y|X’ in X}
have been found to be potentially interesting.
27
Search and Pruning Strategies (2)
• The contingency table [a, b; c, d] used to test
for the significance of the rule X → y in
comparison to one of its generalizations X-{z}
→ y for the Aggressive search strategy.
t:X t
t : X  {z}  t  z  t
t : X  {z}  t
t : y t
a  sup( X  y )
b  sup( X  {z}  y )  sup( X  y )
a  b  sup( X  {z}  y}
t : y  t c  sup( X  y ) d  sup( X  {z}  y )  sup( X  y ) c  d  sup( X  {z}  y )
a  c  sup( X )
b  d  sup( X  {z})  sup( X )
a  b  c  d  sup( X  {z})
28
Example
• Suppose we have already determined that the
rules (A = a1)  1 and (A = a2)  1 are
significant.
• Now we want to test if X=(A =a1) ^ (A=a2)  1
is significant
• Then we carry out a FET and calculate the CCR
on X and X –{A=a1} (i.e. z = {a2})and X and X{A=a2} (i.e. z = {a1}).
• If the minimum of their p-value is less than the
significance level, and their CCR is greater than
1, we keep the X 1 rule, otherwise we discard
it.
29
Ranking Rules
• Strength Score (SS):
– In order to determine how interesting a rule is,
we need a ranking (ordering) of the rules, and
the ordering is defined by the Strength Score.
30
Overview
•
•
•
•
Data Mining Tasks
Associative Classifiers
Downside of Support and Confidence
Mining Rules from Imbalanced Data Sets
– Fisher’s Exact Test
– Class Correlation Ratio (CCR)
– Searching and Pruning Strategies
– Experiments
31
Experiments (Balanced Data)
• The preceding approach is represented by
“SPARCCC”.
• The experiments on Balanced Data Sets show
that the average accuracy of SPARCCC
compares favourably to CBA and C4.5.
– The table below is the prediction accuracy on
balanced data sets.
32
Experiments (Imbalanced Data)
• True Positive Rate (Recall/Sensitivity) is a better
performance measure for imbalanced data sets.
• SPARCCC overcomes other rule based techs
such as CBA and CCCS.
– The table below is True Positive Rate of the Minority
Class on Imbalanced version of the Datasets.
33
References
• Florian Verhein, Sanjay Chawla.
Using Significant, Positively Associated
and Relatively Class Correlated Rules For
Associative Classification of Imbalanced
Datasets.
The 2007 IEEE International Conference
on Data Mining . Omaha NE, USA.
October 28-31, 2007.
34