Download classificationIII - The University of Kansas

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Transcript
EECS 800 Research Seminar
Mining Biological Data
Instructor: Luke Huan
Fall, 2006
The UNIVERSITY of Kansas
Overview
Rule-based classification method overview
CBA: classification based on association
Applying rules based method in Microarray analysis
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide2
Rule Based Methods Vs. SVM
Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering Rule
Groups for Gene Expression Data". SIGMOD'05, 2005.
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide3
Rule-Based Classifier
Classify records by using a collection of “if…then…”
rules
Rule: (Condition)  y
where
Condition is a conjunctions of attributes
y is the class label
Examples of classification rules:
(Blood Type=Warm)  (Lay Eggs=Yes)  Birds
(Taxable Income < 50K)  (Refund=Yes)  Evade=No
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide4
Rule-based Classifier (Example)
Name
human
python
salmon
whale
frog
komodo
bat
pigeon
cat
leopard shark
turtle
penguin
porcupine
eel
salamander
gila monster
platypus
owl
dolphin
eagle
Blood Type
warm
cold
cold
warm
cold
cold
warm
warm
warm
cold
cold
warm
warm
cold
cold
cold
warm
warm
warm
warm
Give Birth
yes
no
no
yes
no
no
yes
no
yes
yes
no
no
yes
no
no
no
no
no
yes
no
Can Fly
no
no
no
no
no
no
yes
yes
no
no
no
no
no
no
no
no
no
yes
no
yes
Live in Water
no
no
yes
yes
sometimes
no
no
no
no
yes
sometimes
sometimes
no
yes
sometimes
no
no
no
yes
no
Class
mammals
reptiles
fishes
mammals
amphibians
reptiles
mammals
birds
mammals
fishes
reptiles
birds
mammals
fishes
amphibians
reptiles
mammals
birds
mammals
birds
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide5
Application of Rule-Based Classifier
A rule r covers an instance x if the attributes of the
instance satisfy the condition of the rule
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
Name
hawk
grizzly bear
Blood Type
warm
warm
Give Birth
Can Fly
Live in Water
Class
no
yes
yes
no
no
no
?
?
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide6
Rule Coverage and Accuracy
Coverage of a rule:
Fraction of records that
satisfy the antecedent of a
rule
Accuracy of a rule:
Fraction of records that
satisfy both the antecedent
and consequent of a rule
Tid home
Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
(Status=Single)  No
Coverage = 40%, Accuracy = 50%
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide7
How does Rule-based Classifier
Work?
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
Name
lemur
turtle
dogfish shark
Blood Type
warm
cold
cold
Give Birth
Can Fly
Live in Water
Class
yes
no
yes
no
no
no
no
sometimes
yes
?
?
?
A lemur triggers rule R3, so it is classified as a mammal
A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide8
Characteristics of Rule-Based
Classifier
Mutually exclusive rules
Every record is covered by at most one rule
Exhaustive rules
Classifier has exhaustive coverage if it accounts for every
possible combination of attribute values
Each record is covered by at least one rule
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide9
From Decision Trees To Rules
Classification Rules
Refund
Yes
No
(Refund=Yes) ==> No
NO
Marita l
Status
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
{Single,
Divorced}
{Married}
NO
Taxable
Income
< 80K
NO
(Refund=No, Marital Status={Single,Divorced},
Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
> 80K
YES
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the
tree
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide10
Rules Can Be Simplified
Tid home
Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Refund
Yes
No
NO
{Single,
Divorced}
Marita l
Status
{Married}
NO
Taxable
Income
< 80K
NO
> 80K
YES
60K
10
Initial Rule:
(Refund=No)  (Status=Married)  No
Simplified Rule: (Status=Married)  No
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide11
Effect of Rule Simplification
Rules are no longer mutually exclusive
A record may trigger more than one rule
Solution?
Ordered rule set
Unordered rule set – use voting schemes
Rules are no longer exhaustive
A record may not trigger any rules
Solution?
Use a default class
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide12
Ordered Rule Set
Rules are rank ordered according to their priority
An ordered rule set is known as a decision list
When a test record is presented to the classifier
It is assigned to the class label of the highest ranked rule it has
triggered
If none of the rules fired, it is assigned to the default class
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
Name
turtle
10/25/2006
Classification III
Blood Type
cold
Give Birth
Can Fly
Live in Water
Class
no
no
sometimes
?
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide13
Rule Ordering Schemes
Rule-based ordering
Individual rules are ranked based on their quality
Class-based ordering
Rules that belong to the same class appear together
Rule-based Ordering
Class-based Ordering
(Refund=Yes) ==> No
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
(Refund=No, Marital Status={Married}) ==> No
10/25/2006
Classification III
(Refund=No, Marital Status={Single,Divorced},
Taxable Income>80K) ==> Yes
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide14
Building Classification Rules
Direct Method:
Extract rules directly from data
e.g.: CBA
Indirect Method:
Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
e.g: C4.5rules
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide15
Direct Method: Sequential Covering
1.
2.
3.
4.
Start from an empty rule
Grow a rule using the Learn-One-Rule function
Remove training records covered by the rule
Repeat Step (2) and (3) until stopping criterion is met
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide16
Example of Sequential Covering
(i) Original Data
10/25/2006
Classification III
(ii) Step 1
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide17
Example of Sequential Covering…
R1
R1
R2
(iii) Step 2
10/25/2006
Classification III
(iv) Step 3
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide18
Aspects of Sequential Covering
Rule Growing
Rule Evaluation
Stopping Criterion
Rule Pruning
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide19
Rule Growing
Two common strategies
{}
Yes: 3
No: 4
Refund=No,
Status=Single,
Income=85K
(Class=Yes)
Refund=
No
Status =
Single
Status =
Divorced
Status =
Married
Yes: 3
No: 4
Yes: 2
No: 1
Yes: 1
No: 0
Yes: 0
No: 3
...
Income
> 80K
Yes: 3
No: 1
(a) General-to-specific
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
Refund=No,
Status=Single,
Income=90K
(Class=Yes)
Refund=No,
Status = Single
(Class = Yes)
(b) Specific-to-general
slide20
Rule Growing (Examples)
RIPPER Algorithm:
Start from an empty rule: {} => class
Add conjuncts that maximizes FOIL’s information gain measure:
R0: {} => class (initial rule)
R1: {A} => class (rule after adding conjunct)
Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ]
where t: number of positive instances covered by both R0
and R1
p0: number of positive instances covered by R0
n0: number of negative instances covered by R0
p1: number of positive instances covered by R1
n1: number of negative instances covered by R1
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide21
Rule Evaluation
Metrics:
Accuracy
Laplace
M-estimate
10/25/2006
Classification III
nc

n
nc  1

nk
nc  kp

nk
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
n : Number of instances
covered by rule
nc : Number of instances
covered by rule
k : Number of classes
p : Prior probability
slide22
Stopping Criterion and Rule Pruning
Stopping criterion
Compute the gain
If gain is not significant, discard the new rule
Rule Pruning
Similar to post-pruning of decision trees
Reduced Error Pruning:
Remove one of the conjuncts in the rule
Compare error rate on validation set before and after pruning
If error improves, prune the conjunct
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide23
Summary of Direct Method
Grow a single rule
Remove Instances from rule
Prune the rule (if necessary)
Add rule to Current Rule Set
Repeat
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide24
Direct Method: RIPPER
For 2-class problem, choose one of the classes as positive
class, and the other as negative class
Learn rules for positive class
Negative class will be default class
For multi-class problem
Order the classes according to increasing class prevalence
(fraction of instances that belong to a particular class)
Learn the rule set for smallest class first, treat the rest as
negative class
Repeat with next smallest class as positive class
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide25
Indirect Methods
P
No
Yes
Q
No
-
Rule Set
R
Yes
No
Yes
+
+
Q
No
-
10/25/2006
Classification III
Yes
r1:
r2:
r3:
r4:
r5:
(P=No,Q=No) ==> (P=No,Q=Yes) ==> +
(P=Yes,R=No) ==> +
(P=Yes,R=Yes,Q=No) ==> (P=Yes,R=Yes,Q=Yes) ==> +
+
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide26
Indirect Method: C4.5rules
Extract rules from an unpruned decision tree
For each rule, r: A  y,
consider an alternative rule r’: A’  y where A’ is obtained by
removing one of the conjuncts in A
Compare the pessimistic error rate for r against all r’s
Prune if one of the r’s has lower pessimistic error rate
Repeat until we can no longer improve generalization error
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide27
Advantages of Rule-Based Classifiers
As highly expressive as decision trees
Easy to interpret
Easy to generate
Can classify new instances rapidly
Performance comparable to decision trees
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide28
Overview of CBA
Classification rule mining versus Association rule mining
Aim
A small set of rules as classifier
All rules according to minsup and minconf
Syntax
Xy
X Y
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide29
Association Rules for Classification
Classification: mine a small set of rules existing in the data
to form a classifier or predictor.
It has a target attribute: Class attribute
Association rules: have no fixed target, but we can fix a
target.
Class association rules (CAR): has a target class attribute.
E.g.,
Own_house = true  Class =Yes [sup=6/15, conf=6/6]
CARs can obviously be used for classification.
B. Liu, W. Hsu, and Y. Ma. Integrating classification and
association rule mining. KDD, 1998
http://www.comp.nus.edu.sg/~dm2/
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide30
Decision tree vs. CARs
The decision tree below generates the following 3 rules.
Own_house = true  Class =Yes
[sup=6/15, conf=6/6]
Own_house = false, Has_job = true  Class=Yes [sup=5/15, conf=5/5]
Own_house = false, Has_job = false  Class=No [sup=4/15, conf=4/4]

But there are many other rules that
are not found by the decision tree
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide31
There are many more rules
CAR mining finds all of them.
In many cases, rules not in the
decision tree (or a rule list)
may perform classification
better.
Such rules may also be
actionable in practice
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide32
Decision tree vs. CARs (cont …)
Association mining require discrete attributes. Decision
tree learning uses both discrete and continuous attributes.
CAR mining requires continuous attributes discretized. There
are several such algorithms.
Decision tree is not constrained by minsup or minconf,
and thus is able to find rules with very low support. Of
course, such rules may be pruned due to the possible
overfitting.
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide33
CBA: Three Steps
Discretize continuous attributes, if any
Generate all class association rules (CARs)
Building a classifier based on the generated CARs.
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide34
RG: The Algorithm
Find the complete set of all possible rules
Usually takes long time to finish
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide35
RG: Basic Concepts
Frequent ruleitems
A ruleitem is frequent if its support is above minsup
Accurate rule
A rule is accurate if its confidence is above minconf
Possible rule
For all ruleitems that have the same condset, the ruleitem with
the highest confidence is the possible rule of this set of
ruleitems.
The set of class association rules (CARs) consists of all
the possible rules (PRs) that are both frequent and
accurate.
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide36
Further Considerations in CAR mining
Multiple minimum class supports
Deal with imbalanced class distribution, e.g., some class is
rare, 98% negative and 2% positive.
We can set the minsup(positive) = 0.2% and minsup(negative)
= 2%.
If we are not interested in classification of negative class, we
may not want to generate rules for negative class. We can set
minsup(negative)=100% or more.
Rule pruning may be performed.
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide37
Building Classifiers
There are many ways to build classifiers using CARs.
Several existing systems available.
Simplest: After CARs are mined, do nothing.
For each test case, we simply choose the most confident rule that
covers the test case to classify it. Microsoft SQL Server has a
similar method.
Or, using a combination of rules.
Another method (used in the CBA system) is similar to
sequential covering.
Choose a set of rules to cover the training data.
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide38
Class Builder: Three Steps
The basic idea is to choose a set of high precedence rules
in R to cover D.
Sort the set of generated rules R
Select rules for the classifier from R following the sorted
sequence and put in C.
Each selected rule has to correctly classify at least one
additional case.
Also select default class and compute errors.
Discard those rules in C that don’t improve the accuracy of the
classifier.
Locate the rule with the lowest error rate and discard the rest
rules in the sequence.
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide39
Rules are sorted first
Definition: Given two rules, ri and rj, ri  rj (also called ri
precedes rj or ri has a higher precedence than rj) if
the confidence of ri is greater than that of rj, or
their confidences are the same, but the support of ri is greater
than that of rj, or
both the confidences and supports of ri and rj are the same, but
ri is generated earlier than rj.
A CBA classifier L is of the form:
L = <r1, r2, …, rk, default-class>
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide40
Classifier building using CARs
Selection:
Each rule does at least one correct prediction
Each case is covered by the rule with highest precedence
This algorithm is correct but is not efficient
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide41
Classifier building using CARs
For each case d in D
coverRulesd = all covering rules of d
Sort D according to the precedence of the first correctly
predicting rule of each case d
RuleSet = empty
Scan D again to find optimal rule set.
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide42
Refined Classification Based on
TopkRGS (RCBT)
General Idea:
Construct RCBT classifier from top-k covering rule groups. So
the number of the rule groups generated are bounded.
Efficiency and accuracy are validated by experimental
results
Based on Gao Cong, Kian-Lee Tan, Anthony K. H. Tung,
Xin Xu. "Mining Top-k Covering Rule Groups for Gene
Expression Data". SIGMOD'05
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide43
Dataset
In the microarray dataset:
Each row in the dataset corresponds to a sample
Each item value in the dataset corresponds to a distretized gene
expression value.
Class labels correspond to category of sample, (cancer / not
cancer)
Useful in diagnostic purpose
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide44
Introduction of gene expression data
(Microarray)
Format of gene expression data:
Column – gene: thousands of
Row — sample (class): tens of or hundreds of patients
gene1 gene2 gene3 gene4 Class
row1 10.6
1.9
33
22
C1
row2 18.6
5.6
56
10
C1
row3
4.3
133
22
C2
5.7
10/25/2006
Classification III
discretize
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
items
Class
row1
a, b, e ,h
C1
row2
c, d, e, f
C1
row3
a, b, g, h
C2
slide45
Rule Example
Rule r: {a,e,h} -> C
Support(r)=
3
Condifence(r)=
66%
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide46
RG: General solution
Step1: Find all frequently occurred itemsets from dataset
D.
Step2: Generate rule in the form of itemset -> C. Prune
rules that do not have enough support an confidence.
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide47
RG: Previous Algorithms
Item enumeration: Search all the frequent itemsets by
checking all possible combinations of items.
{}
We can simulate the
search process in an item
enumeration tree.

{a }
{ab }
{b }
{ac }
{c }
{bc }
{abc }
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide48
Microarray data
Features of Microarray data
A few rows: 100-1000
A large number of items, 10000
The space of all the combinations of items is large 210000.
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide49
Motivations
Very slow for existing rule mining algorithms
Item search space is exponential to the number of item
use the idea of row enumeration to design new algorithm
The number of association rules are too huge even for a
given consequent
mine top-k Interesting rule groups for each row
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide50
Definitions
Row support set: Given a set of items I’, we denote R(I’)
as the largest set of rows that contain I.
Item support set: Given a set of rows R’, we denote I(R’)
as the largest set of items that are common among rows in
R’.
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide51
Example
I’={a,e,h}, then
R(I’)={r2,r3,r4}
R’={r2,r3}, then
I(R’)={a,e,h}
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide52
Rule Groups
What is rule group?
Given a one row dataset: {a, b, c, d, e, Cancer}, 31 rules in the form
of LHS  Cancer.
the same row and the same confidence (100%).
1 upper bound and 5 lower bound
Rule group: a set of association rules whose LHS itemsets
occurs in a same set of rows.
Rule group has a unique upper bound.
abcde  Cancer
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide53
Rule groups: example
Rule groups: a set of rules covered by
the same rows.
upper
upper rule abc->C1(100%)
bound
bound rule
ab->C1(100%)
ac->C1(100%)
a->C1(100%)
10/25/2006
Classification III
bc->C1(100%)
b->C1(100%)
class
Items
C1
a,b,c
C1
a,b,c,d
C1
c,d,e
C2
c,d,e
Mining Biological Data
c-> C1 is not
KU EECS 800, Luke Huan, Fall’06
in the group
slide54
Significant of RGS
Rule group r1 is more significant than r2 if
r1.conf>r2.conf or
r1.sup>r2.sup and r1.conf=r2.conf
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide55
Finding top-k rule groups
Given dataset D, for each row of the dataset, find the k
most significant covering rule groups (represented by the
upper bounds) subject to the minimum support constraint
No min-confidence required
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide56
Top-k covering rule groups
For each row, we find the most
significant k rule groups:
based on confidence first
then support
Given minsup=1, Top-1
row 1: abcC1(sup = 2, conf=
100%)
row 2: abcC1
abcdC1(sup=1,conf = 100%)
row 3: cdC1(sup=2, conf =
66.7%)
row 4: cdeC2 (sup=1, conf =
50%)
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
class
Items
C1
a,b,c
C1
a,b,c,d
C1
c,d,e
C2
c,d,e
slide57
Relationship with CBA
The rules selected by CBA for classification are a subset of the rules
of TopkRGS with k=1
So we can use top-1 covering rule groups to build the CBA
classifier
The authors also proposed a refined classification method based on
TopkRGS
Reduces the chance that test data is classified by a default class;
Uses a subset of rules to make a collective decision
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide58
Main advantages of Top-k coverage
rule group
The number is bounded by the product of k and the
number of samples
Treat each sample equally  provide a complete
description for each row (small)
The minimum confidence parameter-- instead k.
Sufficient to build classifiers while avoiding excessive
computation
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide59
Naïve method of finding TopkRGS
Find the complete set of upper bound rules in the dataset
by the row-wise algorithm
Pick the top-k covering rule groups for each row in the
dataset
Inefficient -- To improve, keep track of the top-k rule
groups at each enumeration node dynamically and use
effective pruning strategies
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide60
Reference
W. Cohen. Fast eective rule induction. In ICML'95,
Xiaoxing Yin & Jiawei Han CPAR: Classification based
on Predictive Association Rules,SDM’03
B. Liu, W. Hsu, and Y. Ma. Integrating classification and
association rule mining. In KDD'98
Jiuyong Li, On Optimal Rule Discovery, IEEE
Transactions on Knowledge and Data Engineering,
Volume 18 , Issue 4, 2006
Some slides are offered by Zhang Xiang at
http://www.cs.unc.edu/~xiang/
10/25/2006
Classification III
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide61