Download CENG 464 Introduction to Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Naive Bayes classifier wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
11/18/2015
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
CENG 464
Introduction to Data Mining
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is
the class.
• Find a model for class attribute as a function of the
values of other attributes.
Y=f(X)
• Goal: previously unseen records should be assigned a
class as accurately as possible.
– A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
Prediction Problems: Classification vs.
Numeric Prediction
• Classification :
– predicts categorical class labels (discrete or nominal)
– classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data
• Numeric Prediction
– models continuous-valued functions, i.e., predicts unknown or missing
values
• Typical applications
– Credit/loan approval
– Medical diagnosis: if a tumor is cancerous or benign
– Fraud detection: if a transaction is fraudulent
– Web page categorization: which category it is
Process (1): Model Construction
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
– If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set
Classification
Algorithms
Training
Data
NAME RANK
M ike
M ary
B ill
Jim
D ave
A nne
A ssistant P rof
A ssistant P rof
P rofessor
A ssociate P rof
A ssistant P rof
A ssociate P rof
YEARS TENURED
3
7
2
7
6
3
no
yes
yes
yes
no
no
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
1
11/18/2015
Illustrating Classification Task
Process (2): Using the Model in Prediction
Training and Test set are randomly sampled
Classifier
Testing
Data
Unseen Data
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Learning
algorithm
Class
supervised
Induction
Learn
Model
Model
10
Training Set
accuracy
(Jeff, Professor, 4)
NAME RANK
T om
M erlisa
G eorge
Joseph
YEARS TENURED
A ssistant P rof
A ssociate P rof
P rofessor
A ssistant P rof
2
7
5
7
Tenured?
no
no
yes
yes
Tid
Attrib1
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Attrib3
Class
Deduction
10
Test Set
Find a mapping OR function that can predict class label of given tuple X
Classification Techniques
Example of a Decision Tree
• Decision Tree based Methods
• Bayes Classification Methods
•
•
•
•
•
Attrib2
Apply
Model
Rule-based Methods
Nearest-Neighbor Classifier
Artificial Neural Networks
Support Vector Machines
Memory based reasoning
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Root node: attributes
Arrows: attribute test conditions Splitting Attributes
Leaf nodes: class label
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Another Example of Decision Tree
MarSt
Taxable
Income Cheat
1
125K
Yes
Single
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
Yes
60K
85K
9
No
Married
75K
No
10
No
Single
90K
Yes
NO
> 80K
YES
10
Model: Decision Tree
Training Data
Tid Refund Marital
Status
Married
Married
NO
Decision Tree Classification Task
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
> 80K
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Class
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
Training Set
NO
YES
• There could be more than one tree that
fits the same data!
10
• Performance/Speed of classification
depends on tree structure-levels
Tid
Attrib1
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
Attrib2
110K
Attrib3
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
• Accuracy of the tree
2
11/18/2015
Apply Model to Test Data
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
Test Data
Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
No
80K
No
80K
Married
?
Refund
10
No
NO
Yes
MarSt
Single, Divorced
TaxInc
< 80K
MarSt
Single, Divorced
TaxInc
NO
< 80K
YES
NO
Married
NO
> 80K
YES
NO
Apply Model to Test Data
Apply Model to Test Data
Test Data
Refund
Yes
Test Data
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Refund
10
No
NO
Yes
MarSt
Single, Divorced
TaxInc
< 80K
TaxInc
< 80K
Apply Model to Test Data
NO
NO
YES
Apply Model to Test Data
TaxInc
NO
Taxable
Income Cheat
No
80K
?
Refund
Yes
MarSt
< 80K
Test Data
Refund Marital
Status
10
Single, Divorced
YES
Taxable
Income Cheat
No
80K
Married
?
10
MarSt
Single, Divorced
TaxInc
NO
> 80K
Refund Marital
Status
No
NO
Married
?
Married
Test Data
Married
80K
Married
> 80K
NO
No
No
MarSt
NO
Refund
Taxable
Income Cheat
10
Single, Divorced
YES
Yes
Refund Marital
Status
No
NO
Married
> 80K
NO
?
10
No
NO
Married
> 80K
Married
< 80K
NO
Married
Assign Cheat to “No”
NO
> 80K
YES
3
11/18/2015
Decision Tree Classification Task
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Class
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Attrib2
DT algoritm
Tree
Induction
algorithm
Attrib3
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
Decision Tree Induction
• Many Algorithms:
– Hunt’s Algorithm
– ID3, C4.5
– CART
– SLIQ,SPRINT
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are
discretized in advance)
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)-attribute selection
measure
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
– There are no samples left
22
How to Specify Test Condition?
• Depends on attribute types
– Nominal
– Ordinal
– Continuous
• Depends on number of ways to split
– 2-way split
– Multi-way split
– Determine when to stop splitting
4
11/18/2015
Splitting Based on Nominal Attributes
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as
distinct values.
• Multi-way split: Use as many partitions as
distinct values.
Size
CarType
Family
Small
Luxury
Sports
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
{Sports,
Luxury}
CarType
OR
{Family}
Large
Medium
{Family,
Luxury}
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
CarType
Splitting Based on Continuous Attributes
• Different ways of handling
OR
Size
{Small,
Medium}
{Sports}
{Medium,
Large}
{Large}
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
Taxable
Income?
< 10K
Yes
> 80K
No
[10K,25K)
(i) Binary split
– Binary Decision: (A < v) or (A  v)
{Small}
Splitting Based on Continuous Attributes
Taxable
Income
> 80K?
– Discretization to form an ordinal categorical
attribute
Size
[25K,50K)
[50K,80K)
(ii) Multi-way split
• consider all possible splits and finds the best cut
• can be more compute intensive
How to determine the Best Split
Which split?
Before Splitting: 10 records of class 0,
10 records of class 1
Own
Car?
Yes
Car
Type?
No
Family
Student
ID?
Luxury
c1
Sports
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
C0: 1
C1: 0
...
c10
C0: 1
C1: 0
c11
C0: 0
C1: 1
c20
...
C0: 0
C1: 1
Which test condition is the best?
5
11/18/2015
How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are
preferred
• Need a measure of node impurity:
C0: 9
C1: 1
C0: 5
C1: 5
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
Why need attribute selection
measure?
Why need attribute selection
measure?
• Random
attribute
selection
• Why not
choose
randomly?
Attribute Selection-Splitting Rules Measures
(Measures of Node Impurity)
Provides a ranking for each attribute describing the given
training tuples. The attribute having the best score for the
measure is chosen as the splitting attribute for the given tuples.
• Information Gain-Entropy
• Gini Index
• Misclassification error
Which split?
Information Gain
• We want to determine which attribute in a
given set of training feature vectors is most
useful for discriminating between the classes
to be learned.
• Information gain tells us how important a
given attribute of the feature vectors is.
• We will use it to decide the ordering of
attributes in the nodes of a decision tree.
6
11/18/2015
Brief Review of Entropy
m=2
Attribute Selection Measure:
Information Gain (ID3/C4.5)

Select the attribute with the highest information gain

This attribute minimizes the information needed to classify the tuples in the
resulting partitions and reflects the least randomness or impurity in these
partitions

Let pi be the probability that an arbitrary tuple in D belongs to class Ci,
estimated by |Ci, D|/|D|

Expected information (entropy) needed to classify a tuple in D:
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Ideally, we want the partitioning on attribute A to produce an exact
classification of the tuples

Information needed (after using A to split D into v partitions) to classify D:
v
| Dj |
j 1
|D|
InfoA ( D)  


 Info( D j )
|Dj|/|D| acts as the weight of the jth partition
InfoA(D) the smaller the expected information the greater the purity of
the partitions
m
Info( D)   pi log 2 ( pi )

Information gained by branching on attribute A

Tells us how much would be gained by branching on A

We want to partition on the attribute A that would do the best classification
so that the amount of information still required to finish classifying the
tuples is minimal
Gain(A)  Info(D)  Info A(D)
i 1
Info(D) is the average amount of info needed to indentify the class label of a
tuple in D
DT Algorithm-example
Which attribute should we have as
the top most node of our decision
tree
•Determine the information gain for
each candidate attribute…
–Gain(S, Outlook)
–Gain(S, Humidity)
–Gain(S, Wind)
–Gain(S, Temperature)
DT Algorithm-example
Which attribute should we have as
the top most node of our decision
tree
•Determine the information gain for
each candidate attribute…
–Gain(S, Outlook) = 0.246
–Gain(S, Humidity) = 0.151
–Gain(S, Wind) = 0.048
–Gain(S, Temperature) = 0.029
•So would have Outlook as top Node
7
11/18/2015
DT Algorithm-example
Example: Which Attribute to Select?
Information Gain
age
pi
ni I(pi, ni)
youth
2
3 0.971
middle-aged
4
0 0
senior
3
2 0.971
age
income
youth
high
youth
high
middle-aged
high
senior
medium
senior
low
senior
low
middle-aged
low
youth
medium
youth
low
senior
medium
youth
medium
middle-aged
medium
missle-aged
high
senior
medium
Attribute Selection: Information Gain


Class P: buys_computer = “yes”
Class N: buys_computer = “no”
Info( D)  I (9,5)  
age
pi
youth
2
middle-aged
4
senior
3
Infoage ( D) 
5
4
I (2,3) 
I (4,0)
14
14
5

I (3,2)  0.694
14
9
9
5
5
log 2 ( )  log 2 ( ) 0.940
14
14 14
14
5
I (2,3) means “youth” has 5 out of 14
14 samples, with 2 yes’es and 3 no’s.
ni I(pi, ni)
3 0.971
0 0
2 0.971
Hence
age
income student credit_rating
youth
high
no
fair
youth
high
no
excellent
middle-aged
high
no
fair
senior
medium
no
fair
senior
low
yes
fair
senior
low
yes
excellent
middle-aged
low
yes
excellent
youth
medium
no
fair
youth
low
yes
fair
senior
medium
yes
fair
youth
medium
yes
excellent
middle-aged
medium
no
excellent
missle-aged
high
yes
fair
45
senior
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Gain (age)  Info( D)  Infoage ( D)  0.246
Similarly,
Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
Gini Index (CART, IBM IntelligentMiner)
• If a data set D contains examples from n classes, gini index,
gini(D) is defined as
n
gini( D)  1  p 2j
j 1
where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
gini A ( D) 
• Reduction in Impurity:
|D1|
|D |
gini( D1)  2 gini( D 2)
|D|
|D|
gini( A)  gini(D)  giniA (D)
• The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a
large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v
| Dj |
j 1
|D|
SplitInfoA ( D)  
 log 2 (
| Dj |
|D|
)
– GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.
– gain_ratio(income) = 0.029/1.557 = 0.019
• The attribute with the maximum gain ratio is selected as the
splitting attribute
Computation of Gini Index
• Ex. D has 9 tuples in buys_computer = “yes”
and
5 in “no”
2
2
9 5
gini( D)  1        0.459
 14   14 
• Suppose the attribute income partitions D into 10 in D1: {low,
 10 
4
medium} and 4 in D2 gini
( D)   Gini( D )   Gini( D )
income{low, medium}
 14 
1
 14 
2
Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the
{low,medium} (and {high}) since it has the lowest Gini index
• All attributes are assumed continuous-valued
• May need other tools, e.g., clustering, to get the possible split
values
• Can be modified for categorical attributes
48
8
11/18/2015
Comparing Attribute Selection Measures
• The three measures, in general, return good results but
– Information gain:
• biased towards multivalued attributes
– Gain ratio:
• tends to prefer unbalanced splits in which one partition is
much smaller than the others
– Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized partitions
and purity in both partitions
Overfitting and Tree Pruning
•
Overfitting: An induced tree may overfit the training data –classification model built from a limited set
of training examples is too representative of the training examples themselves rather than the general
characteristics of the data
– Too many branches, some may reflect anomalies due to noise or outliers
– Poor accuracy for unseen samples
– Remove least reliable branches from the tree
•
Two approaches to avoid overfitting
– Prepruning: Halt tree construction early ̵ do not split a node if this would result in the goodness
measure falling below a threshold
Other Attribute Selection Measures
• CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
• C-SEP: performs better than info. gain and gini index in certain cases
• G-statistic: has a close approximation to χ2 distribution
• MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
– The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
• Multivariate splits (partition based on multiple variable combinations)
– CART: finds multivariate splits based on a linear comb. of attrs.
• Which attribute selection measure is the best?
– Most give good results, none is significantly superior than others
Performance of DT
• Accuracyerror rate, the ratio of the number of
misclassifications against total number of classifications
• Natural performance measure for classification problems: error rate on a
test set
– Success: instance’s class is predicted correctly
– Error: instance’s class is predicted incorrectly
– Error rate: proportion of errors made over the whole set of instances
– Accuracy: proportion of correctly classified instances over the whole
set of instances
accuracy = 1 – error rate
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned
trees
• Use a set of data different from the training data to decide which is the “best pruned tree”
Confusion matrix
Classification Cost
• Associate a cost with misclassification
• Class +: with brain tumor
• Class -: without brain tumor
– A FN error can be highly costly: a patient with brain tumor is
missed and may die
- A FP error is not so serious: a healthy patient is sent to extra
screening unnecessarily
– A classifier biased toward to the + class is preferred
9
11/18/2015
Accuracy in DT
• Explore methods for the evaluation of
performance of a classification model:
– Bootstrap
– Cross-validation
Decision Tree Based Classification
• Advantages:
–
–
–
–
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification techniques for many
simple data sets
• Disadvantages
– Irrelevant attributes may affect badly the construction of a decision
tree E.g. ID numbers
– Small variations in the data can imply that very different looking trees
are generated
– A sub-tree can be replicated several times
– Error-prone with too many classes
– Not good for predicting the value of a continuous class attribute
Chapter 8. Classification: Basic Concepts
Bayesian Classification: Why?
• Classification: Basic Concepts
• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Decision Tree Induction
• Foundation: Based on Bayes’ Theorem.
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Summary
Bayes’ Theorem: Basics
• Total probability Theorem: P(B) 
M

i 1
P(B | A )P( A )
i
i
• Bayes’ Theorem: P( H | X)  P(X | H ) P( H )  P(X | H ) P( H ) / P(X)
P(X)
– Let X be a data sample (“evidence”): class label is unknown
– Let H be a hypothesis that X belongs to class C
– Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
– P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
– P(X): probability that sample data is observed
– P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds
• E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
P( H | X)  P(X | H ) P(H )  P(X | H ) P(H ) / P(X)
P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
10
11/18/2015
Classification Is to Derive the Maximum Posteriori
Naïve Bayes Classifier
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
P(C | X) 
i
P(X | C )P(C )
i
i
P(X)
• Since P(X) is constant for all classes, only
P(C | X)  P(X | C )P(C )
i
i
i
needs to be maximized
n
P(X | C i)   P( x | C i)  P( x | C i)  P( x | C i)  ...  P( x | C i)
k
1
2
n
k 1
• This greatly reduces the computation cost: Only counts the
class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
and P(xk|Ci) is
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age =youth,
Income = medium,
Student = yes
Credit_rating = Fair)

1
e
2 
( x )2
2 2
P ( X | C i )  g ( xk ,  C i ,  Ci )
Naïve Bayes Classifier:
An Example
age
income student credit_rating
youth
high
no fair
youth
high
no excellent
middle-aged
high
no fair
senior medium
no fair
senior low
yes fair
senior low
yes excellent
middle-aged
low
yes excellent
youth
medium
no fair
youth
low
yes fair
senior medium
yes fair
youth
medium
yes excellent
middle-aged
medium
no excellent
missle-aged
high
yes fair
senior medium
no excellent
Naïve Bayes Classifier: Training Dataset
age income student credit_rating
youth
high
no fair
youth
high
no excellent
middle-aged
high
no fair
senior medium
no fair
senior low
yes fair
senior low
yes excellent
middle-aged
low
yes excellent
youth
medium
no fair
youth
low
yes fair
senior medium
yes fair
youth
medium
yes excellent
middle-aged
medium
no excellent
missle-aged
high
yes fair
senior medium
no excellent
g ( x,  ,  ) 
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “youth” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “youth ” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
64
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
P( X | C i )

Naïve Bayes Classifier: Comments
• Advantages
– Easy to implement
– Robust to noise
– Can handle null values
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore loss of
accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes,
etc.
• Dependencies among these cannot be modeled by Naïve Bayes
Classifier
• How to deal with these dependencies? Bayesian Belief Networks
n
 P( x k | C i )
k 1
• Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
– Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
– The “corrected” prob. estimates are close to their
“uncorrected” counterparts
65
11
11/18/2015
Chapter 8. Classification: Basic Concepts
Rule-Based Classification
• Classification: Basic Concepts
• Construct sequence of rules from training set
• Decision Tree Induction
• Search rules to classify an unseen record
• Bayes Classification Methods
• Rule-Based Classification
• Nearest Neighbour Approach
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• The rule that covers the record is fired to
determine class of it
• Rules can be discovered directly or indirectly
• Summary
Rule-Based Classification
A rule-based classifier uses a set of IF-THEN rules for classification
• An IF-THEN rule is an expression of the form:
IF condition THEN conclusion
where
• Condition (or LHS ) is rule antecedent/precondition
• Conclusion (or RHS ) is rule consequent
• Ex: : IF age = youth AND student = yes THEN buys_computer = yes
• The condition consists of one or more attribute tests
that are logically ANDed
• The rule’s consequent contains a class prediction: we are
predicting whether a customer will buy a computer
Using IF-THEN Rules for Classification
• Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
– If rule is satisfied by X, it covers the tuple, the rule is said to
be triggered/fired by returning the class predicting
– Ideally, rules should be mutually exclusive and exhaustive
 Mutually exclusive: no two rules will be triggered for the
same tuple
 Exhaustive: there is one rule for each possible attribute value
combinationno need for a default rule
Rule-Based Classification
Assessment of a Rule:
• Coverage of a rule: The percentage of instances that satisfy the antecedent of
a rule (i.e., whose attribute values hold true for the rule’s antecedent).
• Accuracy of a rule: The percentage of instances that satisfy both the
antecedent and consequent of a rule
Assessment of a rule: coverage and accuracy
– ncovers = # of tuples covered by R
– ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D|
where D: training data set
accuracy(R) = ncorrect / ncovers
Using IF-THEN Rules for Classification
Two ways of executing a rule set:
• Ordered set of rules (“decision list”):Order is important
for interpretation of the rules
• Unordered set of rules: Rules may overlap and lead to
different conclusions for the same instance –all rules
may be fired at the same time
12
11/18/2015
Using IF-THEN Rules for Classification
If more than one rule is triggered, need conflict resolution
strategy
• Size ordering: assign the highest priority to the triggering rules
that has the “toughest” requirement (i.e., higher qualitywith
the most attribute tests)
• Rule ordering: prioritize rules beforehand
• Class-based ordering: classes are sorted in order of decreasing
importance like order of prevalence or misclassification cost per class.
Within each class rules are not ordered
• Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality like accuracy,
coverage or size. The first rule satisfying X fires class prediction, any
other rule satisfying X is ignored. Each rule in the list implies the
negation of the rules that come before itdifficult to interpret
• What if no rule is fired for X? default rule!
Rule Extraction from a Decision Tree
Rules are easier to understand than large trees
One rule is created for each path from the root to a leaf and
logically ANDed to form the rule antecedent
 Each attribute-value pair along a path forms a conjunction:
the leaf holds the class prediction
Rules are:
• Mutually exclusive: no two rules will be satisfied for the
same instance
• Exhaustive: there is one rule for each possible attributevalue combination
Using IF-THEN Rules for Classification
If no rule is satisfied by X :
– A default rule can be set up to specify a default class, based
on a training set.
– This may be the class in majority or the majority class of the
instances that were not covered by any rule.
– The default rule is evaluated at the end, if and only if no
other rule covers X.
– The condition in the default rule is empty.
– In this way, the rule fires when no other rule is satisfied.
Rule Extraction from a Decision Tree


Rule Induction: Sequential Covering Method
• Sequential covering algorithm: Extracts rules directly from training
data
• Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
• Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
• Steps:
• Rules are learned one at a time
• Each time a rule is learned, the tuples covered by the rules are
removed
• Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the quality
of a rule returned is below a user-specified threshold
age?
young
middle_aged
student?
no
yes
yes
no
senior
credit rating?
excellent
fair
yes
yes
• Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no
THEN buys_computer = no
IF age = young AND student = yes
THEN buys_computer = yes
IF age = middle_aged
THEN buys_computer = yes
IF age = senior AND credit_rating = excellent THEN buys_computer = no
IF age = senior AND credit_rating = fair
THEN buys_computer = yes
Sequential Covering Algorithm
When learning a rule for a class, C, we would like the rule to cover all or most of
the training tuples of class C and none or few of the tuples from other classes
while (enough target tuples left)
generate a rule
remove positive target tuples satisfying this rule
Examples covered
by Rule 2
Examples covered
by Rule 1
Examples covered
by Rule 3
Positive
examples
13
11/18/2015
How to Learn-One-Rule?
Initial prototype rule is generated and refined until a certain
evaluation criteria is met.
Two approaches:
Specialization:
• Start with the most general rule possible: empty ruleclass y
• Best attribute-value pair is added from list A into the antecedent
• Continue until rule performance measure cannot improve further
– If income=high THEN loan_decision=accept
– If income=high AND credit_rating=excellent THEN
loan_decision=accept
– Greedy algorithm: always add attribute –value pair which is
best at the moment
How to Learn-One-Rule?
Rule-Quality measures: used to decide if appending a test to the
rule’s condition will result in an improved rule: accuracy, coverage
•Consider R1 correctly classifies 38 0f 40 tuples whereas R2 covers 2
tuples and correctly classifies all: which rule is better? Accuracy?
•Different Measures: Foil-gain, likelihood ratio statistics, chisquare
statistics
How to Learn-One-Rule?
Two approaches:
generalization:
• Start with the randomly selected positive tuple and convert to a
rule that covers
– Tuple: (overcast, high,false,P) can be converted to a rule as
Outlook=overcast AND humidity=high AND windy=false class=P
• Choose one attribute-value pair and remove it so that rule covers
more positive examples
• Repeat the process until the rule starts to cover negative
examples
How to Learn-One-Rule?
Rule-Quality measures: Foil-gain: checks if ANDing a new condition results in a
better rule
• considers both coverage and accuracy
– Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition
FOIL _ Gain  pos'(log 2
pos'
pos
 log 2
)
pos' neg'
pos  neg
pos and neg are the # of positively and negatively covered tuples by R and
Pos’ and neg’ are the # of positively and negatively covered tuples by R’
• favors rules that have high accuracy and cover many positive tuples
• No test set for evaluating rules but Rule pruning is performed by removing a
condition from the rule
pos  neg
FOIL _ Prune( R) 
pos  neg
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
Advantages of Rule-Based Classifiers
•
•
•
•
•
As highly expressive as decision trees
Easy to interpret
Easy to generate
Can classify new instances rapidly
Performance comparable to decision trees
Nearest Neighbour Approach
• General Idea
– The Model: a set of training examples stored in memory
– Lazy Learning: delaying the decision to the time of
classification. In other words, there is no training!
– To classify an unseen record: compute its proximity to all
training examples and locate 1 or k nearest neighbours
examples. The nearest neighbours determine the class of the
record (e.g. majority vote)
– Rationale: “If it walks like a duck, quacks like a duck, and looks
like a duck, it probably is a duck”.
14
11/18/2015
Nearest Neighbour Approach
Nearest Neighbour Approach
• kNN Classification Algorithm
• PEBLS Algorithm- Parallel Examplar-Based Learning
–
–
–
–
–
–
algorithm kNN (Tr: training set; k : integer; r :
data record) : Class
begin
for each training example t in Tr do
calculate proximity d(t, r) upon
descriptive attributes
end for;
select the top k nearest neighbours into set
D accordingly;
Class := majority class in D
return Class
Class():■
end;
Class based similarity measure is used
A nearest neighbour algorithm (k = 1)
Examples in memory have weights (exemplars)
Simple training: assigning and refining weights
A different proximity measure
Algorithm outline:
1.
2.
3.
Build value difference tables for descriptive attributes (in preparation
of measuring distances between examples)
For each training, refine the weight of its nearest neighbour
Refine the weights of some training examples when classifying
validation examples
Nearest Neighbour Approach
Nearest Neighbour Approach
• PEBLS: Value Difference Table
attribute A
A1
A2
…
Ai
….
Am
A1
A2
d(A1, A1) d(A1, A2)
d(A2, A1) d(A2, A2)
d(Ai, A1)
…
Aj
d(A1, Aj)
d(A2, Aj)
…
d(Ai, A2)
d(Ai, Aj)
d(Ai, Am)
d(Am, A1) d(Am, A2)
d(Am, Aj)
d(Am, Am)
k
d (V1 ,V2 ) 
Civ1
C
t 1
v1

Civ2
• PEBLS: Distance Function
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Am
d(A1, Am)
d(A2, Am)
r
C v2
Temperature
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
Humidity
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
Windy
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
Class
N
N
P
P
P
N
P
N
P
P
P
P
P
N
m
( X , Y )  wX wY
sunny
0
1.2
0.4
sunny
overcast
rain
d (sunny, overcast) 
overcast
1.2
0
0.8
i
2
i
i 1
where wX, wY: weights for X and Y, m: the number of attributes, xi, yi:
values of the ith attribute for X and Y.
wX 
Value Difference Table for Outlook
r: is set to 1.
Cv1: total number of examples with V1
Cv2: total number of examples with V2
Civ1: total number of examples with V1 and of class i
Civ2: total number of examples with V2 and of class i
d(x , y )
rain
0.4
0.8
0
2 4 3 0 3 3
      1.2
5 4 5 4 5 5
T
C
where T: the total number of times that X is selected as the nearest
neighbour, C: the total number of times that X correctly classifies
examples.
Class P Class N
Nearest Neighbour Approach
Nearest Neighbour Approach
• PEBLS: Distance Function (Example)
ROW#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
Humidity
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
Windy
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
Class
N
N
P
P
P
N
P
N
P
P
P
P
P
N
• PEBLS-: Example
After Training:
Value Difference Tables
Outlook
sunny
overcast
rain
sunny
0
1.2
0.4
Temperature
hot
mild
cool
hot
0
0.33
0.5
overcast
1.2
0
0.8
mild
0.33
0
0.33
rain
0.4
0.8
0
cool
0.5
0.33
0
Windy
TRUE
FALSE
TRUE
0
0.418
FALSE
0.418
0
Humidity
high
normal
high
0
0.857
normal
0.857
0
Assuming row1.weight = row2.weight = 1,
(row1, row2) = d(row1outlook, row2outlook)2 + d(row1temperature, row2temperature)2 +
d(row1humidity, row2humidity)2 + d(row1windy, row2windy)2
= d(sunny, sunny)2 + d(hot, hot)2 + d(high, high)2 + d(false, true)2
= 0 + 0 + 0 + (1/2)2 =1/4
1 sunny hot high
false N
2
2 sunny hot high
true
N
1
3 overcast hot high
false P
1
4 rain
mild
false P
1.5
5 rain
cool normal false P
1.5
cool normal true
high
N
2
7 overcast cool normal true P
1
8 sunny mild high false N
2
9 sunny cool normal false P
1
10 rain
1
6 rain
mild normal false P
11 sunny mild normal true
P
1
12 overcast mild high true
P
2
13 overcast hot normal false P
1
14 rain mild high
1
true
N
•
each row is added to the memory space with
weight 1
•
every time a new example is added, the distance
of it with each existing examples is calculated
•
The nearest neighbour is located and the class
value is compared
•
İf the sameweight is tuned down by increasing
the number of total use and total number of
correct use by 1
•
if differenttotal number of use incresed by 1
only.Then validation examples are used to
modify weights
15
11/18/2015
Artificial Neural Network Approach
Nearest Neighbour Approach
• PEBLS: Example
8 sunny mild high false
9 sunny cool normal false
10 rain mild normal false
11 sunny mild normal true
12 overcast mild high true
13 overcast hot normal false
14 rain mild high true
N
P
P
P
P
P
N
2
1
1
1
2
1
1
1 sunny hot high false N
2 sunny hot high true N
3 overcast hot high false P
4 rain mild high false P
5 rain cool normal false P
6 rain cool normal true N
7 overcast cool normal true P
8 sunny mild high false N
9 sunny cool normal false P
10 rain mild normal false P
11 sunny mild normal true P
12 overcast mild high true P
13 overcast hot normal false P
14 rain mild high true N
Unsee
n
record
2
1
1
1.5
1.5
2
1
2
1
1
1
2
1
1
overcast hot high false
2
1
1
1.5
1.5
2
1
?
N
N
P
P
P
N
P
overcast hot high false
1 sunny hot high false
2 sunny hot high true
3 overcast hot high false
4 rain mild high false
5 rain cool normal false
6 rain cool normal true
7 overcast cool normal true
P
Classifying an unseen record
After Training:
-the distance between the unseen record and every examplar in the set is
calculated and nearest neighbour is located
-class of the nearest neighbour is considered to be the class of the record
– Our brains are made up of about 100 billion
tiny units called neurons.
– Each neuron is connected to thousands of
other neurons and communicates with them
via electrochemical signals.
– Signals coming into the neuron are received
via junctions called synapses, these in turn
are located at the end of branches of the
neuron cell called dendrites.
– The neuron continuously receives signals
from these inputs
– What the neuron does is sum up the inputs to
itself in some way and then, if the end result
is greater than some threshold value, the
neuron fires.
– It generates a voltage and outputs a signal
along something called an axon.
Artificial Neural Network Approach
Artificial Neural Network Approach
• General Idea
– The Model: A network of connected artificial neurons
– Training: select a specific network topology and use the training
example to tune the weights attached on the links connecting
the neurons
– To classify an unseen record X, feed the descriptive attribute
values of the record into the network as inputs. The network
computes an output value that can be converted to a class label






Artificial Neural Network Approach
• Artificial Neuron (Perceptron)
i1
i2
w1
w2
w3
A neural network can have many hidden layers, but one layer is
normally considered sufficient
The more units a hidden layer has, the more capacity of pattern
recognition
The output layer can have more than one unit, predicting the likelihood
on a number of classes
The constant inputs can be fed into the units in the hidden and output
layers as inputs.
Network with links from lower layers to upper layersfeed-forward nw
Network with links between nodes of the same layerrecurrent nw
Artificial Neural Network Approach
• General Principle for Training an ANN
S
y
i3
Sum function:
Transformation function:
x = w1*i1 + w2*i2 + w3*i3
Sigmoid(x) = 1/(1+e-x)
Y
Sigm oid Function
X
algorithm trainNetwork (Tr: training set) : Network
Begin
R = initial network with a particular topology;
initialise the weight vector with random values w(0);
repeat
for each training example t=<xi, yi> in Tr do
compute the predicted class output ŷ(k)
for each weight wj in the weight vector do
update the weight wj: wj(k+1) := wj(k) + (yi - ŷ(k))xij
end for;
end for;
until stopping criterion is met
return R
end;
: the learning factor. The
more the value is, the bigger
amount weight changes.
16
11/18/2015
Artificial Neural Network Approach
Artificial Neural Network Approach
• Network Topology
• Using ANN for Classification
– Multiple hidden layers:
– # of nodes in input layer: determined by # and data types of
attributes:
• Do not know the actual class value and hence difficult to adjust the
weight
• Solution: Back-propagation (layer by layer from the output layer)
• Continuous and binary attributes: 1 node for each attribute
• categorical attribute: convert to numeric or binary
– Model Overfitting: use validation examples to further tune the
weights in the network
– Convert input values into numbers before classification.
Convert numeric output into class label after classification
– Descriptive attributes should be normalized or converted to
binary
– Training examples are used repeatedly. The training cost is
therefore very high.
– Difficulty in explaining classification decisions
– Attribute with k labels needs at least log k nodes
– # of nodes in output layer: determined by # of classess
• For 2 class solution 1 node
• K class solution  at least log k nodes
– # of hidden layers and nodes in the hidden layers: difficult to
decide
– in NWs with hidden layers: updating weights using
backpropagation
Classifier Evaluation Metrics: Confusion
Matrix
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other metrics
to consider?
• Use validation test set of class-labeled tuples instead of training set
when assessing accuracy
• Methods for estimating a classifier’s accuracy:
– Holdout method, random subsampling
– Cross-validation
– Bootstrap
• Comparing classifiers:
– Confidence intervals
– Cost-benefit analysis and ROC Curves
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class
Confusion
Matrix: how well the classifier can recognize tuples of different classes
•
Actual class\Predicted class
yes
no
yes
True Positives (TP)
False Negatives (FN)
no
False Positives (FP)
True Negatives (TN)
Example of Confusion Matrix:
•
•
•
•
TP: positive tuples that were correctly labeled by the classifier
TN: negative tuples that were correctly labeled by the classifier
FP: negative tuples that were incorrectly labeled as positive
FN: positive tuples that were mislabeled as negative
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P
yes
no
yes
True Positives (TP)
False Negatives (FN)
no
False Positives (FP)
True Negatives (TN)
Example of Confusion Matrix:
Actual class\Predicted buy_computer buy_computer
class
= yes
= no
Total
buy_computer = yes
6954
46
buy_computer = no
412
2588
3000
Total
7366
2634
10000
• TP and TN are the correctly predicted tuples
• May have extra rows/columns to provide totals
7000
Y
N
Class Imbalance Problem:
 One class may be rare, e.g.
N FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the
negative class and minority of
• Classifier Accuracy, or
the positive class
recognition rate: percentage of
test set tuples that are correctly  Sensitivity: True Positive
classified
recognition rate
Accuracy = (TP + TN)/All
 Sensitivity = TP/P
• Error rate:misclassification rate  Specificity: True Negative
recognition rate
1 – accuracy, or
 Specificity = TN/N
Error rate = (FP + FN)/All
Y
TP FN

P
17
11/18/2015
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Classifier Evaluation Metrics: Example
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
• Recall: completeness – what % of positive tuples are the
classifier labeled as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall
• F measure (F1 or F-score): harmonic mean of precision and
recall,
–
Actual Class\Predicted class
cancer = yes
cancer = no
Total
Recognition(%)
cancer = yes
90
210
300
30.00 (sensitivity
cancer = no
140
9560
9700
98.56 (specificity)
Total
230
9770
10000
96.40 (accuracy)
Precision = ??
Recall = ??
• Fß: weighted measure of precision and recall
– assigns ß times as much weight to recall as to precision
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies
obtained
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
– At i-th iteration, use Di as test set and others as training set
– Leave-one-out: k folds where k = # of tuples, 1 sample is left
out for testing
HW
•
•
•
•
•
Research and learn about:
Ensemble methods principles
Bagging
Boosting
ROC curves
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
– Examples used for training set can be used for test set too
Ensemble Methods: Increasing the Accuracy
• Ensemble methods
– Use a combination of models to increase accuracy
– Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
– Bagging, boosting, Ensemble
18
11/18/2015
Classification of Class-Imbalanced Data Sets
• Class-imbalance problem: Rare positive example but numerous
negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
• Traditional methods assume a balanced distribution of classes
and equal error costs: not suitable for class-imbalanced data
• Typical methods for imbalance data in 2-class classification:
– Oversampling: re-sampling of data from positive class
– Under-sampling: randomly eliminate tuples from negative
class
Model Selection: ROC Curves
•
•
•
•
•
•
ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
Originated from signal detection theory
Shows the trade-off between the true
positive rate and the false positive rate
The area under the ROC curve is a
measure of the accuracy of the model
Diagonal line: for every TP, equally
likely to encounter FP
The closer to the diagonal line (i.e., the
closer the area is to 0.5), the less
accurate is the model
Issues Affecting Model Selection




Vertical axis
represents the true
positive rate
Horizontal axis rep.
the false positive rate
The plot also shows a
diagonal line
A model with perfect
accuracy will have an
area of 1.0
Comparison of Techniques
• Comparison of Approaches
• Accuracy
Classification
Approaches
Decision Tree
Nearest
Neighbours
Rul-base
– classifier accuracy: predicting class label
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
Model
Interpretability
Model
maintenability
Training
Cost
Classifcation
Cost
Artificial
Neural Network
Bayesian
Classifier
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
Model Interpretability: ease of understanding classification decisions
• Interpretability
Model maintenability: ease of modifying the model in the presence of new
training examples
Training cost: computational cost for building a model
Classification cost: computational cost for classifying an unseen record
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
Comparison of Techniques
• Comparison of Approaches
Classification
Model
Approaches
Interpretability
Decision Tree
High
Nearest
Medium
Neighbours
Rul-base
High
Artificial Neural
Low
Network
Bayesian
Medium
Classifier
Model
maintenability
Low
Decision Tree Induction in Weka
• Overview
Training
Cost
High
Classifcation
Cost
Low
High
Low
High
Low
High
Medium
Low
High
Low
Medium
Medium
Low
Model Interpretability: ease of understanding classification decisions
Model maintenability: ease of modifying the model in the presence of new
training examples
Training cost: computational cost for building a model
–
–
–
–
–
–
–
–
ID3 (only work for categorical attributes)
J48 (Java implementation of C4.5)
RandomTree (with K attributes)
RandomForest (a forest of random trees)
REPTree (regression tree with reduced error pruning)
BFTree (best-first tree, using Gain or Gini)
FT (functional tree, logistic regression as split nodes)
SimpleCart (CART with cost-complexity pruning)
Classification cost: computational cost for classifying an unseen record
19
11/18/2015
Decision Tree Induction in Weka
• Preparation
Decision Tree Induction in Weka
• Constructing Classification Models (ID3)
Pre-processing attributes
if necessary
1. Choosing a method and
setting parameters
Specifying the class
attribute
2. Setting a
test option
4. View the model and
evaluation results
3. Starting
the process
Selecting
attributes
5. Selecting the
option to view
the tree
Decision Tree Induction in Weka
• J48 (unpruned tree)
Decision Tree Induction in Weka
• RandomTree
Decision Tree Induction in Weka
• Classifying Unseen Records
Decision Tree Induction in Weka
• Classifying Unseen Records
1. Preparing unseen records in an ARFF file
2. Classifying unseen records in the file
1.Selecting this
option and click
Set… button
2.Press the button
and load the file
Class values are left
as “unknown” (“?”)
3.Press to start
the classification
20
11/18/2015
Decision Tree Induction in Weka
Decision Tree Induction in Weka
• Classifying Unseen Records
• Classifying Unseen Records
3. Saving Classification Results into a file
4. Classification Results in an ARFF file
2.Setting both X and Y
to instance_number
3.Saving the results
into a file
Class labels
assinged
1.Selecting the
option to pop up
visualisation
Comparison of Techniques
Comparison of Techniques
• Comparison of Performance in Weka
• Setting up Experiment in Weka
– A system module known as Experimenter
– Designated for comparing performances on techniques for
classification over a single or a collection of data sets
– Data miners setting up an experiment with:
•
•
•
•
Choosing a
Test options
Naming the file to
store experiment
results
Adding data
sets
Selected data set(s)
Selected algorithms(s) and times of repeated operations
Selected test option (e.g. cross validation)
Selected p value (indicating confidence)
The list of data
sets selected
Comparison of Techniques
The list of selected
algorithms
Classification in Practice
• Experiment Results in Weka
• Process of a Classification Project
Analysis
method
Performing
the Analysis
No. of times each
algorithm repeated
Add an algorithm
– Output accuracy rates of the algorithms
– Pairwise comparison of algorithms with significant better and
worse accuracy marked out.
Value of
significance
New or existing
experiment
Loading
Experiment Data
Results of Pairwise
Comparisons
1.
2.
3.
4.
5.
Locate data
Prepare data
Choose a classification method
Construct the model and tune the model
Measure its accuracy and go back to step 3 or 4 until the
accuracy is satisfactory
6. Further evaluate the model from other aspects such as
complexity, comprehensibility, etc.
7. Deliver the model and test it in real environment. Further
modify the model if necessary
21
11/18/2015
References (1)
Classification in Practice
• Data Preparation
–
–
–
–
Identify descriptive features (input attributes)
Identify or define the class
Determine the sizes of the training, validation and test sets
Select examples
•
•
•
•
Spread and coverage of classes
Spread and coverage of attribute values
Null values
Noisy data
– Prepare the input values (categorical to continuous,
continuous to categorical)
•
•
•
•
•
•
•
•
•
References (3)
•
T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity,
and training time of thirty-three old and new classification algorithms. Machine
Learning, 2000.
•
J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing
Research, Blackwell Business, 1994.
•
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
EDBT'96.
•
T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
•
S. K. Murthy, Automatic Construction of Decision Trees from Data: A MultiDisciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
•
J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
•
J. R. Quinlan and R. M. Cameron-Jones. FOIL: A midterm report. ECML’93.
•
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
•
J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
C. M. Bishop, Neural Networks for Pattern Recognition. Oxford University Press,
1995
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2(2): 121-168, 1998
P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. KDD'95
H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern Analysis for
Effective Classification, ICDE'07
H. Cheng, X. Yan, J. Han, and P. S. Yu, Direct Discriminative Pattern Mining for
Effective Classification, ICDE'08
W. Cohen. Fast effective rule induction. ICML'95
G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for
gene expression data. SIGMOD'05
References (4)
•
•
•
•
•
•
•
•
•
R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98.
J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96.
J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann,
1990.
P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley,
2005.
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert
Systems. Morgan Kaufman, 1991.
S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997.
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques, 2ed. Morgan Kaufmann, 2005.
X. Yin and J. Han. CPAR: Classification based on predictive association rules. SDM'03
H. Yu, J. Yang, and J. Han. Classifying large data sets using SVM with hierarchical
clusters. KDD'03.
22