Download Lecture 5 - Hui Xiong

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Classification: Alternative Techniques
Dr. Hui Xiong
Rutgers University
Introduction to Data Mining
08/26/2006
1
Classification: Alternative Techniques
Rule-based Classifier
Introduction to Data Mining
08/26/2006
2
Rule-Based Classifier
z
Classify records by using a collection of
“if…then…” rules
z
Rule:
(Condition) → y
– where
‹
Condition is a conjunctions of attributes
‹
y is the class label
– LHS: rule antecedent or condition
– RHS: rule consequent
– Examples of classification rules:
‹
(Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds
‹
(Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No
Introduction to Data Mining
08/26/2006
3
Rule-based Classifier (Example)
Introduction to Data Mining
08/26/2006
4
Application of Rule-Based Classifier
z
A rule r covers an instance x if the attributes of
the instance satisfy the condition of the rule
The rule r1 covers a hawk => Bird
The rule r3 covers the grizzly bear => Mammal
Introduction to Data Mining
08/26/2006
5
Rule Coverage and Accuracy
Tid Refund Marital
Coverage of a rule:
Status
1
Yes
Single
– Fraction of records
2
No
Married
th t satisfy
that
ti f the
th
3
No
Single
antecedent of a rule
4
Yes
Married
5
No
Divorced
z Accuracy of a rule:
6
No
Married
– Fraction of records
7
Yes
Divorced
that satisfy both the
8
No
Single
9
No
Married
antecedent
t
d t and
d
10 No
Single
consequent of a
(Status=Single) → No
rule
z
Taxable
Income Class
125K
No
100K
No
70K
No
120K
No
95K
Yes
60K
No
220K
No
85K
Yes
75K
No
90K
Yes
10
Coverage = 40%, Accuracy = 50%
Introduction to Data Mining
08/26/2006
6
How does Rule-based Classifier Work?
A lemur triggers rule r3, so it is classified as a mammal
A turtle triggers both r4 and r5
A dogfish shark triggers none of the rules
Introduction to Data Mining
08/26/2006
7
Characteristics of Rule-Based Classifier
z
Mutually exclusive rules
– Classifier contains mutually exclusive rules if
th rules
the
l are iindependent
d
d t off each
h other
th
– Every record is covered by at most one rule
z
Exhaustive rules
– Classifier has exhaustive coverage
g if it
accounts for every possible combination of
attribute values
– Each record is covered by at least one rule
Introduction to Data Mining
08/26/2006
8
From Decision Trees To Rules
Classification Rules
(Refund=Yes) ==> No
Refund
Yes
No
NO
{Single,
Divorced}
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
Marita l
Status
(Refund=No, Marital Status={Single,Divorced},
Taxable Income>80K) ==> Yes
{Married}
(Refund=No, Marital Status={Married}) ==> No
NO
Taxable
Income
< 80K
> 80K
NO
YES
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the
tree
Introduction to Data Mining
08/26/2006
9
Rules Can Be Simplified
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
6
No
Married
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Refund
Yes
No
NO
Marita l
Status
{Single,
Divorced}
{Married}
NO
Taxable
Income
< 80K
NO
> 80K
YES
60K
Yes
No
10
Initial Rule:
(Refund=No) ∧ (Status=Married) → No
Simplified Rule: (Status=Married) → No
Introduction to Data Mining
08/26/2006
10
Effect of Rule Simplification
z
Rules are no longer mutually exclusive
– A record may trigger more than one rule
– Solution?
Ordered rule set
‹ Unordered rule set – use voting schemes
‹
z
Rules are no longer
g exhaustive
– A record may not trigger any rules
– Solution?
‹
Use a default class
Introduction to Data Mining
08/26/2006
11
Ordered Rule Set
z
Rules are rank ordered according to their priority
– An ordered rule set is known as a decision list
z
When a test record is presented to the classifier
– It is assigned to the class label of the highest ranked rule it has
triggered
– If none of the rules fired, it is assigned to the default class
Introduction to Data Mining
08/26/2006
12
Rule Ordering Schemes
z
Rule-based ordering
– Individual rules are ranked based on their quality/priority
z
Class-based ordering
– Rules that belong to the same class appear together
Introduction to Data Mining
08/26/2006
13
Building Classification Rules
z
Direct Method:
Extract rules directly from data
‹ e.g.: RIPPER,
RIPPER CN2
CN2, 1R
1R, and
d AQ
‹
z
Indirect Method:
Extract rules from other classification models (e.g.
decision trees, neural networks, SVM, etc).
‹ e.g: C4.5rules
C4 5 l
‹
Introduction to Data Mining
08/26/2006
14
Direct Method: Sequential Covering
Introduction to Data Mining
08/26/2006
15
Example of Sequential Covering
(ii) Step 1
Introduction to Data Mining
08/26/2006
16
Example of Sequential Covering…
R1
R1
R2
(iii) Step 2
Introduction to Data Mining
(iv) Step 3
08/26/2006
17
Aspects of Sequential Covering
z
Rule Growing
– Rule evaluation
z
Instance Elimination
z
Stopping Criterion
z
Rule Pruning
Introduction to Data Mining
08/26/2006
18
Rule Growing
z
Two common strategies
Yes: 3
No: 4
{}
Refund=
No
Status =
Single
Status =
Divorced
Status =
Married
Yes: 3
N 4
No:
Yes: 2
N 1
No:
Yes: 1
N 0
No:
Yes: 0
N 3
No:
...
Income
> 80K
Yes: 3
N 1
No:
(a) General-to-specific
Introduction to Data Mining
08/26/2006
19
Rule Evaluation
z
Evaluation metric determines which conjunct
should be added during rule growing
– Accuracy
– Laplace
–
=
=
nc
n
nc + 1
n+k
n + kp
M-estimate = c
n+k
Introduction to Data Mining
n : Number of instances
covered by rule
nc : Number of instances
of class c covered by rule
k : Number of classes
p : Prior probability
08/26/2006
20
Rule Growing (Examples)
z
CN2 Algorithm:
– Start from an empty conjunct: {}
– Add conjuncts that minimizes the entropy measure: {A}, {A,B}, …
– Determine
D t
i th
the rule
l consequentt b
by ttaking
ki majority
j it class
l
off iinstances
t
covered by the rule
z
RIPPER Algorithm:
– Start from an empty rule: {} => class
– Add conjuncts that maximizes FOIL’s information gain measure:
R0: {} => class (initial rule)
R1: {A} => class (rule after adding conjunct)
‹ Gain(R0,
Gain(R0 R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ]
‹ where t: number of positive instances covered by both R0 and R1
p0: number of positive instances covered by R0
n0: number of negative instances covered by R0
p1: number of positive instances covered by R1
n1: number of negative instances covered by R1
‹
‹
Introduction to Data Mining
08/26/2006
21
08/26/2006
22
Instance Elimination
z
Why do we need to
eliminate instances?
– Otherwise, the next rule is
identical to previous rule
z
Why do we remove
positive instances?
– Ensure that the next rule is
different
z
Why do we remove
g
instances?
negative
– Prevent underestimating
accuracy of rule
– Compare rules R2 and R3
in the diagram
Introduction to Data Mining
Stopping Criterion and Rule Pruning
z
Examples of stopping criterion:
– If rule does not improve significantly after adding
conjunct
– If rule starts covering examples from another class
z
Rule Pruning
– Similar to post-pruning of decision trees
– Example:
a p e us
using
g validation
a da o se
set ((reduced
educed e
error
o p
pruning)
u g)
‹
Remove one of the conjuncts in the rule
‹
Compare error rate on validation set before and after pruning
‹
If error improves, prune the conjunct
Introduction to Data Mining
08/26/2006
23
Summary of Direct Method
z
Initial rule set is empty
z
Repeat
– Grow a single rule
– Remove Instances covered by the rule
– Prune the rule (if necessary)
– Add rule to the current rule set
Introduction to Data Mining
08/26/2006
24
Direct Method: RIPPER
z
For 2-class problem, choose one of the classes as
positive class, and the other as negative class
– Learn the rules for positive class
– Use negative class as default
z
For multi-class problem
– Order the classes according to increasing class
prevalence (fraction of instances that belong to a
particular class)
– Learn the rule set for smallest class first, treat the rest
as negative class
– Repeat with next smallest class as positive class
Introduction to Data Mining
08/26/2006
25
Direct Method: RIPPER
z
Rule growing:
– Start from an empty rule: {} → +
– Add conjuncts
j
as long
g as they
y improve
p
FOIL’s
information gain
– Stop when rule no longer covers negative examples
– Prune the rule immediately using incremental reduced
error pruning
– Measure for pruning: v = (p-n)/(p+n)
p: number of p
p
positive examples
p
covered by
y the rule in
the validation set
‹ n: number of negative examples covered by the rule in
the validation set
‹
– Pruning method: delete any final sequence of
conditions that maximizes v
Introduction to Data Mining
08/26/2006
26
Direct Method: RIPPER
z
Building a Rule Set:
– Use sequential covering algorithm
Grow a rule to cover the current set of positive
examples
‹ Eliminate both positive and negative examples
covered by the rule
‹
– Each time a rule is added to the rule set,
compute the new description length
stop adding new rules when the new description
length is d bits longer than the smallest description
length obtained so far
‹
Introduction to Data Mining
08/26/2006
27
08/26/2006
28
Indirect Methods
Introduction to Data Mining
Indirect Method: C4.5rules
Extract rules for every path from root to leaf nodes
z For each rule, r: A → y,
– consider alternative rule r’: A’ → y where A’ is
obtained by removing one of the conjuncts in A
– Compare the pessimistic error rate for r against
all r’s
z
Prune if one of the r’s has lower pessimistic error
rate
t
‹
– Repeat until pessimistic error rate can no
longer be improved
Introduction to Data Mining
08/26/2006
29
Indirect Method: C4.5rules
z
Use class-based ordering
– Rules that predict the same class are grouped
t
together
th into
i t the
th same subset
b t
– Compute total description length for each
class
– Classes are ordered in increasing order of
their total description length
Introduction to Data Mining
08/26/2006
30
Example
Give
Birth?
C4.5rules:
(Give Birth=No, Can Fly=Yes) → Birds
((Give Birth=No,, Live in Water=Yes)) → Fishes
No
Yes
(Give Birth=Yes) → Mammals
(Give Birth=No, Can Fly=No, Live in Water=No) → Reptiles
Live In
Water?
Mammals
Yes
( ) → Amphibians
No
Sometimes
Fishes
Can
Fly?
Amphibians
Yes
Birds
Introduction to Data Mining
No
Reptiles
08/26/2006
31
Characteristics of Rule-Based Classifiers
As highly expressive as decision trees
z Easy to interpret
z Easy to generate
z Can classify new instances rapidly
z Performance comparable to decision trees
z
Introduction to Data Mining
08/26/2006
32
Classification: Alternative Techniques
Instance-Based Classifiers
Introduction to Data Mining
08/26/2006
33
Instance-Based Classifiers
Set of Stored Cases
Atr1
……...
AtrN
Class
A
• Store the training records
• Use training records to
predict the class label of
unseen cases
B
B
Unseen Case
C
A
Atr1
……...
AtrN
C
B
Introduction to Data Mining
08/26/2006
34
Instance Based Classifiers
z
Examples:
– Rote-learner
Memorizes entire training data and performs
classification only if attributes of record match one of
the training examples exactly
‹
– Nearest neighbor
Uses k ““closest”
U
l
t” points
i t ((nearestt neighbors)
i hb ) ffor
performing classification
‹
Introduction to Data Mining
08/26/2006
35
Nearest Neighbor Classifiers
z
Basic idea:
– If it walks like a duck, quacks like a duck, then
it’ probably
it’s
b bl a d
duck
k
Compute
Distance
Training
Records
Introduction to Data Mining
Test
Record
Choose k of the
“nearest” records
08/26/2006
36
Nearest-Neighbor Classifiers
Unknown record
z
Requires three things
– The set of stored records
– Distance metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
z
To classify an unknown record:
– Compute distance to other
training records
– Identify
Id tif k nearestt neighbors
i hb
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Introduction to Data Mining
08/26/2006
37
Definition of Nearest Neighbor
X
(a) 1-nearest neighbor
X
X
(b) 2-nearest neighbor
(c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
Introduction to Data Mining
08/26/2006
38
1 nearest-neighbor
Voronoi Diagram
Introduction to Data Mining
08/26/2006
39
Nearest Neighbor Classification
z
Compute distance between two points:
– Example: Euclidean distance
d ( p, q ) =
z
∑ ( pi
i
−q )
2
i
Determine the class from nearest neighbor list
– take the majority vote of class labels among
the k-nearest neighbors
– Weigh the vote according to distance
‹
weight factor, w = 1/d2
Introduction to Data Mining
08/26/2006
40
Nearest Neighbor Classification…
z
Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is
i too
t large,
l
neighborhood
i hb h d may iinclude
l d points
i t ffrom
other classes
Introduction to Data Mining
08/26/2006
41
Nearest Neighbor Classification…
z
Scaling issues
– Attributes may have to be scaled to prevent
di t
distance
measures ffrom being
b i d
dominated
i t db
by
one of the attributes
– Example:
height of a person may vary from 1.5m to 1.8m
‹ weight of a person may vary from 90lb to 300lb
‹ income of a person may vary from $10K
$
to $
$1M
‹
Introduction to Data Mining
08/26/2006
42
Nearest Neighbor Classification…
z
Problem with Euclidean measure:
– High dimensional data
‹
curse of dimensionality
– Can produce counter-intuitive results
111111111110
100000000000
vs
011111111111
000000000001
d = 1.4142
d = 1.4142
‹
Solution: Normalize the vectors to unit length
Introduction to Data Mining
08/26/2006
43
Nearest neighbor Classification…
z
k-NN classifiers are lazy learners
– It does not build models explicitly
– Unlike eager learners such as decision tree
induction and rule-based systems
– Classifying unknown records are relatively
expensive
Introduction to Data Mining
08/26/2006
44
Example: PEBLS
z
PEBLS: Parallel Examplar-Based Learning
System (Cost & Salzberg)
– Works
W k with
ith both
b th continuous
ti
and
d nominal
i l
features
‹For
nominal features, distance between two
nominal values is computed using modified value
difference metric (MVDM)
– Each record is assigned a weight factor
– Number of nearest neighbor, k = 1
Introduction to Data Mining
08/26/2006
45
Example: PEBLS
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
d(Single,Married)
2
No
Married
100K
No
= | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1
3
No
Single
70K
No
d(Single,Divorced)
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
d(Married,Divorced)
7
Yes
Divorced 220K
No
= | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1
8
No
Single
85K
Yes
d(Refund=Yes,Refund=No)
9
No
Married
75K
No
= | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7
10
No
Single
90K
Yes
60K
Distance between nominal attribute values:
= | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0
10
Marital Status
Class
Refund
Single
Married
Divorced
Yes
2
0
1
No
2
4
1
Introduction to Data Mining
Class
Yes
No
Yes
0
3
No
3
4
d (V1 ,V2 ) =
∑
i
08/26/2006
n1i n 2 i
−
n1 n 2
46
Example: PEBLS
Tid Refund Marital
Status
Taxable
Income Cheat
X
Yes
Single
125K
No
Y
No
Married
100K
No
10
Distance between record X and record Y:
d
Δ ( X , Y ) = w X wY ∑ d ( X i , Yi ) 2
i =1
where:
wX =
Number of times X is used for p
prediction
Number of times X predicts correctly
wX ≅ 1 if X makes accurate prediction most of the time
wX > 1 if X is not reliable for making predictions
Introduction to Data Mining
08/26/2006
47
Classification: Alternative Techniques
Bayesian Classifiers
Introduction to Data Mining
08/26/2006
48
Classification: Alternative Techniques
Ensemble Methods
Introduction to Data Mining
8/26/2005
49
Ensemble Methods
z
Construct a set of classifiers from the training
data
z
Predict class label of test records by combining
the predictions made by multiple classifiers
Introduction to Data Mining
08/26/2006
50
Why Ensemble Methods work?
z
Suppose there are 25 base
classifiers
– Each classifier has
error rate, ε = 0.35
– Assume errors made
by classifiers are
uncorrelated
– Probability that the
ensemble classifier makes
a wrong prediction:
⎛ 25 ⎞
P ( X ≥ 13) = ∑ ⎜⎜ ⎟⎟ε i (1 − ε ) 25−i = 0.06
i =13 ⎝ i ⎠
25
Introduction to Data Mining
08/26/2006
51
08/26/2006
52
General Approach
Introduction to Data Mining
Types of Ensemble Methods
Bayesian ensemble
– Example: Mixture of Gaussian
z Manipulate data distribution
– Example: Resampling method
z Manipulate input features
– Example: Feature subset selection
z Manipulate class labels
– Example: error-correcting output coding
z Introduce randomness into learning algorithm
– Example: Random forests
z
Introduction to Data Mining
08/26/2006
53
Bagging
z
Sampling with replacement
Original Data
Bagging (Round 1)
Bagging (Round 2)
Bagging (Round 3)
1
7
1
1
2
8
4
8
3
10
9
5
4
8
1
10
5
2
2
5
6
5
3
5
7
10
2
9
8
10
7
6
9
5
3
3
10
9
2
7
z
Build classifier on each bootstrap sample
z
Each sample has probability (1 – 1/n)n of being
selected
Introduction to Data Mining
08/26/2006
54
Bagging Algorithm
Introduction to Data Mining
08/26/2006
55
Bagging Example
z
Consider 1-dimensional data set:
x
y
z
0.1
1
0.2
1
0.3
1
0.4
-1
0.5
-1
0.6
-1
0.7
-1
0.8
1
0.9
1
1
1
Classifier is a decision stump
– Decision rule:
x ≤ k versus x > k
– Split point k is chosen based on entropy
x≤k
True
False
yleft
yright
Introduction to Data Mining
08/26/2006
56
Bagging Example
Bagging Round 1:
x
0.1
0.2
y
1
1
0.2
1
0.3
1
0.4
-1
0.4
-1
0.5
-1
0.6
-1
0.9
1
0.9
1
Bagging
B
i R
Round
d2
2:
x
0.1
0.2
y
1
1
0.3
1
0.4
-1
0.5
-1
0.5
-1
0.9
1
1
1
1
1
1
1
Bagging Round 3:
x
0.1
0.2
y
1
1
0.3
1
0.4
-1
0.4
-1
0.5
-1
0.7
-1
0.7
-1
0.8
1
0.9
1
Bagging Round 4:
x
01
0.1
01
0.1
y
1
1
0.2
0
2
1
0.4
0
4
-1
0.4
0
4
-1
0.5
0
5
-1
0.5
0
5
-1
0.7
0
7
-1
0.8
0
8
1
0.9
0
9
1
Bagging Round 5:
x
0.1
0.1
y
1
1
0.2
1
0.5
-1
0.6
-1
0.6
-1
0.6
-1
1
1
1
1
1
1
Introduction to Data Mining
08/26/2006
57
Bagging Example
Bagging Round 6:
x
0.2
0.4
y
1
-1
0.5
-1
0.6
-1
0.7
-1
0.7
-1
0.7
-1
0.8
1
0.9
1
1
1
Bagging
B
i R
Round
d7
7:
x
0.1
0.4
y
1
-1
0.4
-1
0.6
-1
0.7
-1
0.8
1
0.9
1
0.9
1
0.9
1
1
1
Bagging Round 8:
x
0.1
0.2
y
1
1
0.5
-1
0.5
-1
0.5
-1
0.7
-1
0.7
-1
0.8
1
0.9
1
1
1
Bagging Round 9:
x
01
0.1
03
0.3
y
1
1
0.4
0
4
-1
0.4
0
4
-1
0.6
0
6
-1
0.7
0
7
-1
0.7
0
7
-1
0.8
0
8
1
1
1
1
1
Bagging Round 10:
x
0.1
0.1
y
1
1
0.1
1
0.1
1
0.3
1
0.3
1
0.8
1
0.8
1
0.9
1
0.9
1
Introduction to Data Mining
08/26/2006
58
Bagging Example
z
Summary of Training sets:
Round
1
2
3
4
5
6
7
8
9
10
Split Point Left Class Right Class
0.35
1
-1
0.7
1
1
0.35
1
-1
0.3
1
-1
0.35
1
-1
0.75
-1
1
0.75
-1
1
0 75
0.75
-1
1
1
0.75
-1
1
0.05
1
1
Introduction to Data Mining
08/26/2006
59
Bagging Example
Assume test set is the same as the original data
z Use majority vote to determine class of ensemble
classifier
l
ifi
z
Predicted
Class
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1
1
1
1
-1
-1
-1
-1
-1
-1
-1
2
1
1
1
1
1
1
1
1
1
1
3
1
1
1
-1
-1
-1
-1
-1
-1
-1
4
1
1
1
-1
-1
-1
-1
-1
-1
-1
5
1
1
1
-1
-1
-1
-1
-1
-1
-1
6
-1
-1
-1
-1
-1
-1
-1
1
1
1
7
-1
-1
-1
-1
-1
-1
-1
1
1
1
8
-1
-1
-1
-1
-1
-1
-1
1
1
1
9
-1
-1
-1
-1
-1
-1
-1
1
1
1
10
1
1
1
1
1
1
1
1
1
1
Sum
2
2
2
-6
-6
-6
-6
2
2
2
Sign
1
1
1
-1
-1
-1
-1
1
1
1
Introduction to Data Mining
08/26/2006
60
Boosting
z
An iterative procedure to adaptively change
distribution of training data by focusing more on
previously misclassified records
– Initially, all N records are assigned equal
weights
– Unlike bagging, weights may change at the
end of each boosting round
Introduction to Data Mining
08/26/2006
61
Boosting
Records that are wrongly classified will have their
weights increased
z Records
R
d th
thatt are classified
l
ifi d correctly
tl will
ill h
have
their weights decreased
z
Original Data
Boosting (Round 1)
Boosting (Round 2)
Boosting (Round 3)
1
7
5
4
2
3
4
4
3
2
9
8
4
8
4
10
5
7
2
4
6
9
5
5
7
4
1
4
8
10
7
6
9
6
4
3
10
3
2
4
• Example 4 is hard to classify
• Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds
Introduction to Data Mining
08/26/2006
62
AdaBoost
z
Base classifiers: C1, C2, …, CT
z
Error rate:
1
εi =
N
z
∑ w δ (C ( x ) ≠ y )
N
j
i
j
j
j =1
Importance of a classifier:
1 ⎛ 1 − εi ⎞
⎟
αi = ln⎜⎜
2 ⎝ ε i ⎟⎠
Introduction to Data Mining
08/26/2006
63
AdaBoost Algorithm
z
Weight update:
⎧⎪exp −α j if C j ( xi ) = yi
wi
⎨
α
⎪⎩ exp j if C j ( xi ) ≠ yi
where Z j is the normalization factor
( j +1)
wi( j )
=
Zj
If any intermediate rounds produce error rate
higher than 50%, the weights are reverted back
to 1/n and the resampling procedure is repeated
z Classification:
T
z
C * ( x ) = arg max ∑ α jδ (C j ( x ) = y )
y
Introduction to Data Mining
j =1
08/26/2006
64
AdaBoost Algorithm
Introduction to Data Mining
08/26/2006
65
AdaBoost Example
z
Consider 1-dimensional data set:
x
y
z
0.1
1
0.2
1
0.3
1
0.4
-1
0.5
-1
0.6
-1
0.7
-1
0.8
1
0.9
1
1
1
Classifier is a decision stump
– Decision rule:
x ≤ k versus x > k
– Split point k is chosen based on entropy
x≤k
True
False
yleft
yright
Introduction to Data Mining
08/26/2006
66
AdaBoost Example
z
z
Training sets for the first 3 boosting rounds:
Boosting Round 1:
x
01
0.1
04
0.4
y
1
-1
0.5
0
5
-1
0.6
0
6
-1
0.6
0
6
-1
0.7
0
7
-1
0.7
0
7
-1
0.7
0
7
-1
0.8
0
8
1
1
1
Boosting Round 2:
x
0.1
0.1
y
1
1
0.2
1
0.2
1
0.2
1
0.2
1
0.3
1
0.3
1
0.3
1
0.3
1
Boosting Round 3:
x
0.2
0.2
y
1
1
0.4
-1
0.4
-1
0.4
-1
0.4
-1
0.5
-1
0.6
-1
0.6
-1
0.7
-1
Summary:
Round
1
2
3
Split Point Left Class Right Class
0.75
-1
1
0.05
1
1
0.3
1
-1
Introduction to Data Mining
08/26/2006
alpha
1.738
2.7784
4.1195
67
AdaBoost Example
z
Weights
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
2
0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.01
3
0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009
z
Classification
Predicted
Class
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1
-1
-1
-1
-1
-1
-1
-1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
3
1
1
1
-1
-1
-1
-1
-1
-1
-1
Sum
5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Sign
1
1
1
-1
-1
-1
-1
1
1
1
Introduction to Data Mining
08/26/2006
68
Classification: Alternative Techniques
Imbalanced Class Problem
Introduction to Data Mining
8/26/2005
69
Class Imbalance Problem
z
Lots of classification problems where the classes
are skewed (more records from one class than
another)
– Credit card fraud
– Intrusion detection
– Defective products in manufacturing assembly
line
Introduction to Data Mining
08/26/2006
70
Challenges
z
Evaluation measures such as accuracy is not
well-suited for imbalanced class
z
Detecting the rare class is like finding needle in a
haystack
Introduction to Data Mining
08/26/2006
71
Confusion Matrix
z
Confusion Matrix:
PREDICTED CLASS
Class=Yes
Class=Yes
ACTUAL
CLASS Class=No
Class=No
a
b
c
d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
Introduction to Data Mining
08/26/2006
72
Accuracy
PREDICTED CLASS
Class=Yes
Class=Yes
ACTUAL
CLASS Class=No
z
Class=No
a
(TP)
b
(FN)
c
(FP)
d
(TN)
Most widely-used metric:
Accuracy =
a+d
TP + TN
=
a + b + c + d TP + TN + FP + FN
Introduction to Data Mining
08/26/2006
73
Problem with Accuracy
z
Consider a 2-class problem
– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10
z
If a model predicts everything to be class 0,
accuracy is 9990/10000 = 99.9 %
– This is misleading because the model does
not detect any class 1 example
– Detecting the rare class is usually more
interesting (e.g., frauds, intrusions, defects,
etc)
Introduction to Data Mining
08/26/2006
74
Alternative Measures
PREDICTED CLASS
Class=Yes
ACTUAL
CLASS
Class=No
Cl
Class=Yes
Y
a
(TP)
b
(FN)
Class=No
c
(FP)
d
(TN)
Precision (p) =
Recall (r) =
a
a+c
a
a+b
F - measure (F) =
Introduction to Data Mining
2rp
2a
=
r + p 2a + b + c
08/26/2006
75
ROC (Receiver Operating Characteristic)
A graphical approach for displaying trade-off
between detection rate and false alarm rate
z Developed
D
l
d iin 1950
1950s ffor signal
i
ld
detection
t ti th
theory tto
analyze noisy signals
z ROC curve plots TPR against FPR
– TPR = TP/(TP+FN), FPR = FP/(TN+FP)
– Performance of a model represented as a
point in an ROC curve
– Changing the threshold parameter of classifier
changes the location of the point
z
Introduction to Data Mining
08/26/2006
76
ROC Curve
(TPR,FPR):
z (0,0): declare everything
to be negative class
z (1,1): declare everything
to be positive class
z (1,0): ideal
z
Diagonal line:
– Random guessing
– Below diagonal line:
‹ prediction is opposite of
the true class
Introduction to Data Mining
08/26/2006
77
ROC (Receiver Operating Characteristic)
z
To draw ROC curve, classifier must produce
continuous-valued output
– Outputs are used to rank test records
records, from the most
likely positive class record to the least likely positive
class record
z
Many classifiers produce only discrete outputs (i.e.,
predicted class)
– How to get continuous-valued outputs?
‹ Decision trees, rule-based classifiers, neural networks,
Bayesian classifiers, k-nearest neighbors, SVM
Introduction to Data Mining
08/26/2006
78
Example: Decision Trees
Decision Tree
C ti
Continuous-valued
l d outputs
t t
Introduction to Data Mining
08/26/2006
79
08/26/2006
80
ROC Curve Example
Introduction to Data Mining
Using ROC for Model Comparison
z
No model consistently
outperform the other
z M1 is better for
small FPR
z M2 is better for
large FPR
z
Area Under the ROC
curve
z
Ideal:
ƒ Area
z
Random guess:
ƒ Area
Introduction to Data Mining
=1
= 0.5
08/26/2006
81
How to Construct an ROC curve
Instance
score(+|A) True Class
1
0.95
+
2
0 93
0.93
+
3
0.87
-
4
0.85
-
5
0.85
-
6
0.85
+
7
0.76
-
8
0 53
0.53
+
9
0.43
-
10
0.25
+
• Use classifier that produces
continuous-valued output for
each test instance score(+|A)
• Sort the instances according
to score(+|A) in decreasing
order
• Apply threshold at each
unique value of score(+|A)
• Count the number of TP, FP,
TN, FN at each threshold
•TPR = TP/(TP+FN)
•FPR = FP/(FP + TN)
Introduction to Data Mining
08/26/2006
82
How to construct an ROC curve
+
-
+
-
-
-
+
-
+
+
0.25
0.43
0.53
0.76
0.85
0.85
0.85
0.87
0.93
0.95
1.00
TP
5
4
4
3
3
3
3
2
2
1
0
FP
5
5
4
4
3
2
1
1
0
0
0
TN
0
0
1
1
2
3
4
4
5
5
5
5
Class
Threshold >=
FN
0
1
1
2
2
2
2
3
3
4
TPR
1
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.2
0
FPR
1
1
0.8
0.8
0.6
0.4
0.2
0.2
0
0
0
ROC Curve:
Introduction to Data Mining
08/26/2006
83
Handling Class Imbalanced Problem
z
Class-based ordering (e.g. RIPPER)
– Rules for rare class have higher priority
z
Cost-sensitive classification
– Misclassifying rare class as majority class is
more expensive than misclassifying majority
as rare class
z
Sampling-based approaches
Introduction to Data Mining
08/26/2006
84
Cost Matrix
PREDICTED CLASS
Class=Yes
ACTUAL
CLASS Class=Yes
Class=No
Cost
Matrix
f(Yes, Yes)
f(Yes,No)
f(No, Yes)
f(No, No)
C(i,j): Cost of
misclassifying class i
example as class j
PREDICTED CLASS
C(i j)
C(i,
ACTUAL
CLASS
Class=No
Cl
Class=Yes
Y
C t = ∑ C (i , j ) × f (i , j )
Cost
Cl
Class=No
N
Class=Yes
C(Yes, Yes)
C(Yes, No)
Class=No
C(No, Yes)
C(No, No)
Introduction to Data Mining
08/26/2006
85
Computing Cost of Classification
Cost
Matrix
PREDICTED CLASS
ACTUAL
CLASS
Model
M1
PREDICTED CLASS
+
ACTUAL
CLASS
150
40
-
60
250
Introduction to Data Mining
+
-
+
-1
100
-
1
0
Model
M2
PREDICTED CLASS
-
+
Accuracy = 80%
Cost = 3910
C(i,j)
ACTUAL
CLASS
+
-
+
250
45
-
5
200
Accuracy = 90%
Cost = 4255
08/26/2006
86
Cost Sensitive Classification
z
Example: Bayesian classifer
– Given a test record x:
Compute p(i|x) for each class i
‹ Decision rule: classify node as class k if
‹
k = arg max p(i | x )
i
– For 2-class, classify x as + if p(+|x) > p(-|x)
‹
This decision rule implicitly assumes that
C(+|+) = C(-|-) = 0 and C(+|-) = C(-|+)
Introduction to Data Mining
08/26/2006
87
Cost Sensitive Classification
z
General decision rule:
– Classify test record x as class k if
k = arg min ∑ p(i | x ) × C (i, j )
j
z
i
2-class:
– Cost(+) = p(+|x) C(+,+) + p(-|x) C(-,+)
– Cost(-)
( ) = p(+|x)
( | ) C(+,-)
( ) + p(-|x)
( | ) C(-,-)
( )
– Decision rule: classify x as + if Cost(+) < Cost(-)
‹
if C(+,+) = C(-,-) = 0:
Introduction to Data Mining
p( + | x ) >
C ( −, + )
C ( −, + ) + C ( + , − )
08/26/2006
88
Sampling-based Approaches
z
Modify the distribution of training data so that rare
class is well-represented in training set
– Undersample
U d
l th
the majority
j it class
l
– Oversample the rare class
z
Advantages and disadvantages
Introduction to Data Mining
08/26/2006
89