Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Politecnico di Torino
Classification fundamentals
Classification
Objectives
Classification fundamentals
prediction of a class label
definition of an interpretable model of a given
phenomenon
training data
model
unclassified data
classified data
Elena Baralis, Tania Cerquitelli
Politecnico di Torino
DB
MG
Classification: definition
Definitions
Given
a collection of class labels
a collection of data objects labelled with a
class label
Training set
Find a descriptive profile of each class,
which will allow the assignment of
unlabeled objects to the appropriate class
DB
MG
3
DataBase and Data Mining Group
Collection of labeled data objects used to
validate the classification model
DB
MG
4
Evaluation of classification techniques
Decision trees
Classification rules
Association rules
Neural Networks
Naïve Bayes and Bayesian Networks
k-Nearest Neighbours (k-NN)
Support Vector Machines (SVM)
….
DB
MG
Collection of labeled data objects used to learn
the classification model
Test set
Classification techniques
2
Accuracy
Efficiency
model building time
classification time
Scalability
training set size
attribute number
Robustness
Interpretability
5
quality of the prediction
DB
MG
noise, missing data
model interpretability
model compactness
6
1
Politecnico di Torino
Classification fundamentals
Example of decision tree
Decision trees
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Married
Single, Divorced
TaxInc
NO
< 80K
> 80K
YES
NO
10
DB
MG
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
MarSt
Married
NO
Refund
NO
Test Data
Start from the root of tree.
Single,
Divorced
Refund
No
Yes
Yes
TaxInc
< 80K
> 80K
NO
Taxable
Income Cheat
No
80K
Married
?
10
MarSt
Single, Divorced
TaxInc
< 80K
There could be more than one tree that
fits the same data!
Refund Marital
Status
No
YES
NO
8
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
Apply Model to Test Data
Another example of decision tree
Tid Refund Marital
Status
Model: Decision Tree
Training Data
Politecnico di Torino
Married
NO
> 80K
YES
NO
10
DB
MG
9
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
DB
MG
Apply Model to Test Data
Apply Model to Test Data
Test Data
Refund
Yes
Taxable
Income Cheat
No
80K
?
Refund
10
No
NO
Yes
MarSt
Single, Divorced
TaxInc
< 80K
NO
DB
MG
Test Data
Refund Marital
Status
Married
TaxInc
NO
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
< 80K
NO
11
DB
MG
No
80K
Married
?
10
Single, Divorced
YES
Taxable
Income Cheat
MarSt
Married
> 80K
Refund Marital
Status
No
NO
DataBase and Data Mining Group
10
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
Married
NO
> 80K
YES
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
12
2
Politecnico di Torino
Classification fundamentals
Apply Model to Test Data
Apply Model to Test Data
Test Data
Refund
Yes
Test Data
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Refund
10
No
NO
Yes
MarSt
Single, Divorced
TaxInc
< 80K
DB
MG
Married
Single, Divorced
TaxInc
NO
< 80K
NO
13
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
DB
MG
Apply Model to Test Data
Test Data
Refund
Yes
No
80K
MarSt
TaxInc
< 80K
NO
Married
Assign Cheat to “No”
DB
MG
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
14
YES
15
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
then t is a leaf node labeled as yt
select the “best” attribute A on which
to split Dt and label node t as A
split Dt into smaller subsets and
recursively apply the procedure to
each subset
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
DB
MG
Structure of test condition
17
not a global optimum
Issues
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
DataBase and Data Mining Group
“Best” attribute for the split is selected locally at
each step
t
then t is a leaf node labeled as the
default (majority) class, yd
16
Adopts a greedy strategy
10
Dt,, set of training records
that reach a node t
Dt
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
Decision tree induction
If Dt is an empty set
YES
NO
If Dt contains records that belong
to more than one class
NO
> 80K
> 80K
Basic steps
If Dt contains records that belong
to the same class yt
Married
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5, C5.0
SLIQ, SPRINT
General structure of Hunt’s algorithm
?
Many algorithms to build a decision tree
?
10
Single, Divorced
DB
MG
Taxable
Income Cheat
No
NO
80K
Married
Decision tree induction
Refund Marital
Status
Married
No
10
MarSt
YES
NO
Taxable
Income Cheat
No
NO
> 80K
Refund Marital
Status
Binary split versus multiway split
Selection of the best attribute for the split
Stopping condition for the algorithm
DB
MG
18
3
Politecnico di Torino
Classification fundamentals
Splitting on nominal attributes
Structure of test condition
Depends on attribute type
use as many partitions as distinct values
{Sports,
Luxury}
DB
MG
19
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
DB
MG
Multi-way split
use as many partitions as distinct values
Binary split
Small
Divides values into two subsets
Need to find optimal partitioning
OR
Size
{Small,
Medium}
{Large}
{Medium,
Large}
Size
What about this split?
DB
MG
OR
{Family}
Large
Size
{Small}
{Medium}
21
Static – discretize once at the beginning
Dynamic – discretize during tree induction
consider all possible splits and find the best cut
more computationally intensive
DB
MG
Splitting on continuous attributes
Selection of the best attribute
Taxable
Income?
Own
Car?
> 80K
(i) Binary split
DB
MG
Yes
10 records of class 0,
10 records of class 1
Car
Type?
No
Family
Student
ID?
Luxury
c1
Sports
No
[25K,50K)
C0: 6
C1: 4
[50K,80K)
(ii) Multi-way split
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
DataBase and Data Mining Group
22
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
Before splitting:
[10K,25K)
20
Ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or
clustering
Binary decision (A < v) or (A v)
Size
< 10K
{Sports}
Discretization to form an ordinal categorical
attribute
Medium
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
Yes
CarType
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
Taxable
Income
> 80K?
{Family,
Luxury}
Different techniques
{Small,
Large}
CarType
Splitting on continuous attributes
Splitting on ordinal attributes
Luxury
Sports
Divides values into two subsets
Need to find optimal partitioning
2-way split
multi-way split
CarType
Family
Binary split
Depends on number of outgoing edges
Multi-way split
nominal
ordinal
continuous
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
C0: 1
C1: 0
...
c10
C0: 1
C1: 0
c11
C0: 0
C1: 1
c20
...
C0: 0
C1: 1
Which attribute (test condition) is the best?
23
DB
MG
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
24
4
Politecnico di Torino
Classification fundamentals
Measures of node impurity
Selection of the best attribute
Attributes with homogeneous class
distribution are preferred
Need a measure of node impurity
Many different measures available
C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
high degree of impurity
Homogeneous, low
degree of impurity
DB
MG
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
25
Gini index
Entropy
Misclassification error
Different algorithms rely on different
measures
DB
MG
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
26
Decision Tree Based Classification
Advantages
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification
techniques for many simple data sets
Associative classification
Disadvantages
accuracy may be affected by missing data
Politecnico di Torino
DB
MG
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
27
Associative classification
The classification model is defined by
means of association rules
(Condition) y
Associative classification
rule body is an itemset
based on support, confidence and correlation
thresholds
Database coverage: the training set is covered by
selecting topmost rules according to previous sort
DataBase and Data Mining Group
Weak points
rule generation may be slow
reduced scalability in the number of attributes
29
correlation among attributes is considered
efficient classification
unaffected by missing data
good scalability in the training set size
Rule pruning
DB
MG
Rule selection & sorting
interpretable model
higher accuracy than decision trees
Model generation
Strong points
DB
MG
it depends on support threshold
rule generation may become unfeasible
30
5
Politecnico di Torino
Classification fundamentals
Neural networks
Inspired to the structure of the human
brain
Neurons as elaboration units
Synapses as connection network
Neural networks
Politecnico di Torino
DB
MG
Structure of a neural network
32
Structure of a neuron
Output vector
- mk
Output nodes
x0
w0
x1
w1
xn
wn
Input
vector x
Weight
vector w
Hidden nodes
f
output y
weight wij
Input nodes
Weighted
sum
Activation
function
Input vector: xi
DB
MG
From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
33
DB
MG
Construction of the neural network
For each node, definition of
set of weights
offset value
Strong points
High accuracy
Robust to noise and outliers
Supports both discrete and continuous output
Efficient during classification
Weak points
Long training time
DataBase and Data Mining Group
35
weakly scalable in training data size
complex configuration
Not interpretable model
DB
MG
34
Neural networks
providing the highest accuracy on the
training data
Iterative approach on training data
instances
From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
DB
MG
application domain knowledge cannot be exploited in the model
36
6
Politecnico di Torino
Classification fundamentals
Bayes theorem
Bayesian Classification
Politecnico di Torino
Let C and X be random variables
P(C,X) = P(C|X) P(X)
P(C,X) = P(X|C) P(C)
Hence
P(C|X) P(X) = P(X|C) P(C)
and also
P(C|X) = P(X|C) P(C) / P(X)
DB
MG
Bayesian classification
C = any class label
X = <x1,…,xk> record to be classified
Bayesian classification
How to estimate P(X|C), i.e. P(x1,…,xk|C)?
Naïve hypothesis
P(x1,…,xk|C) = P(x1|C) P(x2|C) … P(xk|C)
statistical independence of attributes x1,…,xk
not always true
compute P(C|X) for all classes
Bayesian classification
Let the class attribute and all data attributes be random
variables
probability that record X belongs to C
assign X to the class with maximal P(C|X)
39
DB
MG
Bayesian classification: Example
DB
MG
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
DataBase and Data Mining Group
for continuous attributes, use probability distribution
allow specifying a subset of dependencies among attributes
40
Bayesian classification: Example
Temperature Humidity Windy Class
hot
high
false
N
hot
high
true
N
hot
high
false
P
mild
high
false
P
cool
normal false
P
cool
normal true
N
cool
normal true
P
mild
high
false
N
cool
normal false
P
mild
normal false
P
mild
normal true
P
mild
high
true
P
hot
normal false
P
mild
high
true
N
From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
where |xkC| is number of instances having value xk for attribute k
and belonging to class C
Bayesian networks
DB
MG
for discrete attributes
P(xk|C) = |xkC|/ Nc
P(X) constant for all C, disregarded for maximum computation
P(C) a priori probability of C
P(C) = Nc/N
model quality may be affected
Computing P(xk|C)
Applying Bayes theorem
P(C|X) = P(X|C)·P(C) / P(X)
38
outlook
P(sunny|p) = 2/9
P(sunny|n) = 3/5
P(p) = 9/14
P(overcast|p) = 4/9
P(overcast|n) = 0
P(n) = 5/14
P(rain|p) = 3/9
P(rain|n) = 2/5
temperature
P(hot|p) = 2/9
P(hot|n) = 2/5
P(mild|p) = 4/9
P(mild|n) = 2/5
P(cool|p) = 3/9
P(cool|n) = 1/5
humidity
P(high|p) = 3/9
P(high|n) = 4/5
P(normal|p) = 6/9
P(normal|n) = 2/5
windy
41
DB
MG
P(true|p) = 3/9
P(true|n) = 3/5
P(false|p) = 6/9
P(false|n) = 2/5
From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
42
7
Politecnico di Torino
Classification fundamentals
Bayesian classification: Example
Data to be labeled
X = <rain, hot, high, false>
For class p
P(X|p)·P(p) =
= P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p)
= 3/9·2/9·3/9·6/9·9/14 = 0.010582
For class n
P(X|n)·P(n) =
= P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n)
= 2/5·2/5·4/5·2/5·5/14 = 0.018286
Model evaluation
Politecnico di Torino
DB
MG
From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
43
Model evaluation
Methods for performance evaluation
Accuracy, other measures
ROC curve
Stratified sampling to generate partitions
Bootstrap
45
reserve 2/3 for training and 1/3 for testing
may be repeated several times
repeated holdout
47
repeat for all folds
reliable accuracy estimation, not appropriate for
very large datasets
Leave-one-out
DataBase and Data Mining Group
partition data into k disjoint subsets (i.e., folds)
k-fold: train on k-1 partitions, test on the
remaining one
DB
MG
46
Cross validation
Appropriate for large datasets
Sampling with replacement
Cross validation
Fixed partitioning
without replacement
DB
MG
Holdout
holdout
cross validation
DB
MG
training set for model building
test set for model evaluation
Several partitioning techniques
Techniques for model comparison
Partitioning labeled data in
Partitioning techniques for training and test sets
Metrics for performance evaluation
Methods of estimation
DB
MG
cross validation for k=n
only appropriate for very small datasets
48
8
Politecnico di Torino
Classification fundamentals
Metrics for model evaluation
Accuracy
Evaluate the predictive accuracy of a model
Confusion matrix
Most widely-used metric for model
evaluation
binary classifier
Accuracy Number of correctly classified objects
Number of classified objects
PREDICTED CLASS
Class=Yes
Class=Yes
ACTUAL
CLASS Class=No
Class=No
a
b
c
d
Not always a reliable metric
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
DB
MG
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
49
DB
MG
Limitations of accuracy
Accuracy
For a binary classifier
Consider a binary problem
PREDICTED CLASS
Class=Yes
ACTUAL
CLASS
Accuracy
DB
MG
Class=Yes
Class=No
Class=No
a
(TP)
c
(FP)
Model
() class 0
b
(FN)
d
(TN)
Model predicts everything to be class 0
ad
TP TN
a b c d TP TN FP FN
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
Cardinality of Class 0 = 9900
Cardinality of Class 1 = 100
51
DB
MG
Evaluate separately for each class
Misclassification of objects of a given class is
more important
e.g., ill patients erroneously assigned to the
healthy patients class
Recall (r) Number of objects correctly assigned to C
Number of objects belonging to C
Precision (p) Number of objects correctly assigned to C
Number of objects assigned to C
Accuracy is not appropriate for
unbalanced class label distribution
different class relevance
DB
MG
DataBase and Data Mining Group
52
Class specific measures
Classes may have different importance
accuracy is 9900/10000 = 99.0 %
Accuracy is misleading because the model
does not detect any class 1 object
Limitations of accuracy
50
Maximize
F - measure (F)
53
DB
MG
2rp
r p
54
9
Politecnico di Torino
Classification fundamentals
Class specific measures
For a binary classification problem
on the confusion matrix, for the positive class
Precision (p)
Recall (r)
a
ac
a
ab
F - measure (F)
DB
MG
2 rp
r p
2a
2a b c
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
DataBase and Data Mining Group
55
10