Download Classification fundamentals - DataBase and Data Mining Group

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Politecnico di Torino
Classification fundamentals
Classification
Objectives



Classification fundamentals
prediction of a class label
definition of an interpretable model of a given
phenomenon
training data
model
unclassified data
classified data
Elena Baralis, Tania Cerquitelli
Politecnico di Torino
DB
MG
Classification: definition



Definitions
Given


a collection of class labels
a collection of data objects labelled with a
class label
Training set


Find a descriptive profile of each class,
which will allow the assignment of
unlabeled objects to the appropriate class
DB
MG

3







DataBase and Data Mining Group
Collection of labeled data objects used to
validate the classification model
DB
MG
4
Evaluation of classification techniques
Decision trees
Classification rules
Association rules
Neural Networks
Naïve Bayes and Bayesian Networks
k-Nearest Neighbours (k-NN)
Support Vector Machines (SVM)
….
DB
MG
Collection of labeled data objects used to learn
the classification model
Test set
Classification techniques

2

Accuracy

Efficiency




model building time
classification time
Scalability


training set size
attribute number

Robustness

Interpretability



5
quality of the prediction
DB
MG
noise, missing data
model interpretability
model compactness
6
1
Politecnico di Torino
Classification fundamentals
Example of decision tree
Decision trees
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Married
Single, Divorced
TaxInc
NO
< 80K
> 80K
YES
NO
10
DB
MG
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
MarSt
Married
NO
Refund
NO
Test Data
Start from the root of tree.
Single,
Divorced
Refund
No
Yes
Yes
TaxInc
< 80K
> 80K
NO
Taxable
Income Cheat
No
80K
Married
?
10
MarSt
Single, Divorced
TaxInc
< 80K
There could be more than one tree that
fits the same data!
Refund Marital
Status
No
YES
NO
8
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
Apply Model to Test Data
Another example of decision tree
Tid Refund Marital
Status
Model: Decision Tree
Training Data
Politecnico di Torino
Married
NO
> 80K
YES
NO
10
DB
MG
9
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
DB
MG
Apply Model to Test Data
Apply Model to Test Data
Test Data
Refund
Yes
Taxable
Income Cheat
No
80K
?
Refund
10
No
NO
Yes
MarSt
Single, Divorced
TaxInc
< 80K
NO
DB
MG
Test Data
Refund Marital
Status
Married
TaxInc
NO
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
< 80K
NO
11
DB
MG
No
80K
Married
?
10
Single, Divorced
YES
Taxable
Income Cheat
MarSt
Married
> 80K
Refund Marital
Status
No
NO
DataBase and Data Mining Group
10
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
Married
NO
> 80K
YES
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
12
2
Politecnico di Torino
Classification fundamentals
Apply Model to Test Data
Apply Model to Test Data
Test Data
Refund
Yes
Test Data
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Refund
10
No
NO
Yes
MarSt
Single, Divorced
TaxInc
< 80K
DB
MG
Married
Single, Divorced
TaxInc
NO
< 80K
NO
13
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
DB
MG
Apply Model to Test Data
Test Data
Refund
Yes
No
80K


MarSt
TaxInc
< 80K
NO
Married

Assign Cheat to “No”


DB
MG
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
14
YES
15
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
then t is a leaf node labeled as yt
select the “best” attribute A on which
to split Dt and label node t as A
split Dt into smaller subsets and
recursively apply the procedure to
each subset
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
DB
MG



Structure of test condition



17
not a global optimum
Issues

From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
DataBase and Data Mining Group
“Best” attribute for the split is selected locally at
each step
t
then t is a leaf node labeled as the
default (majority) class, yd
16
Adopts a greedy strategy
10
Dt,, set of training records
that reach a node t
Dt
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
Decision tree induction

If Dt is an empty set

YES
NO
If Dt contains records that belong
to more than one class

NO
> 80K
> 80K
Basic steps
 If Dt contains records that belong
to the same class yt

Married
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5, C5.0
SLIQ, SPRINT
General structure of Hunt’s algorithm

?
Many algorithms to build a decision tree

?
10
Single, Divorced
DB
MG

Taxable
Income Cheat
No
NO
80K
Married
Decision tree induction
Refund Marital
Status
Married
No
10
MarSt
YES
NO
Taxable
Income Cheat
No
NO
> 80K
Refund Marital
Status
Binary split versus multiway split
Selection of the best attribute for the split
Stopping condition for the algorithm
DB
MG
18
3
Politecnico di Torino
Classification fundamentals
Splitting on nominal attributes
Structure of test condition

Depends on attribute type





use as many partitions as distinct values


{Sports,
Luxury}
DB
MG
19
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
DB
MG
Multi-way split



use as many partitions as distinct values
Binary split


Small
Divides values into two subsets
Need to find optimal partitioning
OR
Size
{Small,
Medium}
{Large}
{Medium,
Large}
Size
What about this split?
DB
MG
OR
{Family}

Large


Size
{Small}

{Medium}
21
Static – discretize once at the beginning
Dynamic – discretize during tree induction
consider all possible splits and find the best cut
more computationally intensive
DB
MG
Splitting on continuous attributes
Selection of the best attribute
Taxable
Income?
Own
Car?
> 80K
(i) Binary split
DB
MG
Yes
10 records of class 0,
10 records of class 1
Car
Type?
No
Family
Student
ID?
Luxury
c1
Sports
No
[25K,50K)
C0: 6
C1: 4
[50K,80K)
(ii) Multi-way split
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
DataBase and Data Mining Group
22
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
Before splitting:
[10K,25K)
20
Ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or
clustering
Binary decision (A < v) or (A  v)
Size
< 10K
{Sports}
Discretization to form an ordinal categorical
attribute
Medium
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
Yes
CarType
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006

Taxable
Income
> 80K?
{Family,
Luxury}
Different techniques

{Small,
Large}
CarType
Splitting on continuous attributes
Splitting on ordinal attributes

Luxury
Sports
Divides values into two subsets
Need to find optimal partitioning

2-way split
multi-way split
CarType
Family
Binary split

Depends on number of outgoing edges

Multi-way split

nominal
ordinal
continuous
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
C0: 1
C1: 0
...
c10
C0: 1
C1: 0
c11
C0: 0
C1: 1
c20
...
C0: 0
C1: 1
Which attribute (test condition) is the best?
23
DB
MG
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
24
4
Politecnico di Torino
Classification fundamentals
Measures of node impurity
Selection of the best attribute
Attributes with homogeneous class
distribution are preferred
Need a measure of node impurity



Many different measures available



C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
high degree of impurity
Homogeneous, low
degree of impurity
DB
MG
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006

25
Gini index
Entropy
Misclassification error
Different algorithms rely on different
measures
DB
MG
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
26
Decision Tree Based Classification

Advantages





Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification
techniques for many simple data sets
Associative classification
Disadvantages

accuracy may be affected by missing data
Politecnico di Torino
DB
MG
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
27
Associative classification

The classification model is defined by
means of association rules
(Condition)  y


Associative classification



rule body is an itemset



based on support, confidence and correlation
thresholds
Database coverage: the training set is covered by
selecting topmost rules according to previous sort
DataBase and Data Mining Group
Weak points

rule generation may be slow

reduced scalability in the number of attributes

29
correlation among attributes is considered
efficient classification
unaffected by missing data
good scalability in the training set size

Rule pruning

DB
MG

Rule selection & sorting

interpretable model
higher accuracy than decision trees


Model generation

Strong points
DB
MG
it depends on support threshold
rule generation may become unfeasible
30
5
Politecnico di Torino
Classification fundamentals
Neural networks

Inspired to the structure of the human
brain
Neurons as elaboration units
Synapses as connection network

Neural networks

Politecnico di Torino
DB
MG
Structure of a neural network
32
Structure of a neuron
Output vector
- mk
Output nodes
x0
w0
x1
w1
xn
wn
Input
vector x
Weight
vector w

Hidden nodes
f
output y
weight wij
Input nodes
Weighted
sum
Activation
function
Input vector: xi
DB
MG
From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
33
DB
MG
Construction of the neural network

For each node, definition of



set of weights
offset value
Strong points





High accuracy
Robust to noise and outliers
Supports both discrete and continuous output
Efficient during classification
Weak points

Long training time



DataBase and Data Mining Group
35
weakly scalable in training data size
complex configuration
Not interpretable model

DB
MG
34
Neural networks

providing the highest accuracy on the
training data
Iterative approach on training data
instances
From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
DB
MG
application domain knowledge cannot be exploited in the model
36
6
Politecnico di Torino
Classification fundamentals
Bayes theorem

Bayesian Classification


Politecnico di Torino
Let C and X be random variables
P(C,X) = P(C|X) P(X)
P(C,X) = P(X|C) P(C)
Hence
P(C|X) P(X) = P(X|C) P(C)
and also
P(C|X) = P(X|C) P(C) / P(X)
DB
MG
Bayesian classification





C = any class label
X = <x1,…,xk> record to be classified
Bayesian classification


How to estimate P(X|C), i.e. P(x1,…,xk|C)?
Naïve hypothesis
P(x1,…,xk|C) = P(x1|C) P(x2|C) … P(xk|C)

statistical independence of attributes x1,…,xk

not always true
compute P(C|X) for all classes


Bayesian classification
Let the class attribute and all data attributes be random
variables


probability that record X belongs to C

assign X to the class with maximal P(C|X)




39
DB
MG
Bayesian classification: Example
DB
MG
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
DataBase and Data Mining Group
for continuous attributes, use probability distribution
allow specifying a subset of dependencies among attributes
40
Bayesian classification: Example
Temperature Humidity Windy Class
hot
high
false
N
hot
high
true
N
hot
high
false
P
mild
high
false
P
cool
normal false
P
cool
normal true
N
cool
normal true
P
mild
high
false
N
cool
normal false
P
mild
normal false
P
mild
normal true
P
mild
high
true
P
hot
normal false
P
mild
high
true
N
From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
where |xkC| is number of instances having value xk for attribute k
and belonging to class C
Bayesian networks

DB
MG
for discrete attributes
P(xk|C) = |xkC|/ Nc

P(X) constant for all C, disregarded for maximum computation
P(C) a priori probability of C
P(C) = Nc/N
model quality may be affected
Computing P(xk|C)
Applying Bayes theorem
P(C|X) = P(X|C)·P(C) / P(X)

38
outlook
P(sunny|p) = 2/9
P(sunny|n) = 3/5
P(p) = 9/14
P(overcast|p) = 4/9
P(overcast|n) = 0
P(n) = 5/14
P(rain|p) = 3/9
P(rain|n) = 2/5
temperature
P(hot|p) = 2/9
P(hot|n) = 2/5
P(mild|p) = 4/9
P(mild|n) = 2/5
P(cool|p) = 3/9
P(cool|n) = 1/5
humidity
P(high|p) = 3/9
P(high|n) = 4/5
P(normal|p) = 6/9
P(normal|n) = 2/5
windy
41
DB
MG
P(true|p) = 3/9
P(true|n) = 3/5
P(false|p) = 6/9
P(false|n) = 2/5
From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
42
7
Politecnico di Torino
Classification fundamentals
Bayesian classification: Example



Data to be labeled
X = <rain, hot, high, false>
For class p
P(X|p)·P(p) =
= P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p)
= 3/9·2/9·3/9·6/9·9/14 = 0.010582
For class n
P(X|n)·P(n) =
= P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n)
= 2/5·2/5·4/5·2/5·5/14 = 0.018286
Model evaluation
Politecnico di Torino
DB
MG
From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006
43
Model evaluation

Methods for performance evaluation





Accuracy, other measures

ROC curve
Stratified sampling to generate partitions

Bootstrap

45


reserve 2/3 for training and 1/3 for testing

may be repeated several times

repeated holdout


47
repeat for all folds
reliable accuracy estimation, not appropriate for
very large datasets
Leave-one-out

DataBase and Data Mining Group
partition data into k disjoint subsets (i.e., folds)
k-fold: train on k-1 partitions, test on the
remaining one


DB
MG
46
Cross validation

Appropriate for large datasets

Sampling with replacement
Cross validation
Fixed partitioning

without replacement
DB
MG
Holdout

holdout
cross validation


DB
MG
training set for model building
test set for model evaluation
Several partitioning techniques

Techniques for model comparison

Partitioning labeled data in

Partitioning techniques for training and test sets
Metrics for performance evaluation


Methods of estimation
DB
MG
cross validation for k=n
only appropriate for very small datasets
48
8
Politecnico di Torino
Classification fundamentals
Metrics for model evaluation


Accuracy
Evaluate the predictive accuracy of a model
Confusion matrix

Most widely-used metric for model
evaluation
binary classifier

Accuracy  Number of correctly classified objects
Number of classified objects
PREDICTED CLASS
Class=Yes
Class=Yes
ACTUAL
CLASS Class=No

Class=No
a
b
c
d
Not always a reliable metric
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
DB
MG
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
49
DB
MG
Limitations of accuracy
Accuracy

For a binary classifier

Consider a binary problem

PREDICTED CLASS

Class=Yes
ACTUAL
CLASS
Accuracy 
DB
MG
Class=Yes
Class=No
Class=No
a
(TP)
c
(FP)

Model
()  class 0
b
(FN)

d
(TN)
Model predicts everything to be class 0


ad
TP  TN

a  b  c  d TP  TN  FP  FN
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
Cardinality of Class 0 = 9900
Cardinality of Class 1 = 100
51
DB
MG




Evaluate separately for each class
Misclassification of objects of a given class is
more important
e.g., ill patients erroneously assigned to the
healthy patients class
Recall (r)  Number of objects correctly assigned to C
Number of objects belonging to C
Precision (p)  Number of objects correctly assigned to C
Number of objects assigned to C
Accuracy is not appropriate for

unbalanced class label distribution
different class relevance
DB
MG
DataBase and Data Mining Group
52
Class specific measures
Classes may have different importance

accuracy is 9900/10000 = 99.0 %
Accuracy is misleading because the model
does not detect any class 1 object
Limitations of accuracy

50

Maximize
F - measure (F) 
53
DB
MG
2rp
r p
54
9
Politecnico di Torino
Classification fundamentals
Class specific measures

For a binary classification problem

on the confusion matrix, for the positive class
Precision (p) 
Recall (r) 
a
ac
a
ab
F - measure (F) 
DB
MG
2 rp
r p

2a
2a  b  c
From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006
DataBase and Data Mining Group
55
10