Download Data Mining – Classification - Faculty Website Listing

Document related concepts
no text concepts found
Transcript
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Data Mining – Classification
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Keith E Emmert
Tarleton State University
[email protected]
k-Nearest
Neighbors
October 23, 2013
Keith E Emmert (TSU)
Data Mining
October 23, 2013
1 / 222
Sub-Setting Data
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
One must create training and test sets from a data set or data frame.
The following code will perform that task, leaving two-thirds of the
data for the training set and one third for the test set. Let’s look at
how to split the iris data set.
> set.seed(100)
> index = 1:nrow(iris)
> trainindex <- sample(index, trunc(2*length(index)/3))
> trainSet <- iris[trainindex, ]
> testSet <- iris[-trainindex, ]
For a much fancier way of performing splits (and much more),
consider the caret package in R.
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
2 / 222
Classification
Classification
Data Mining
Definition
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Classification is the task of learning (training) a target function
f that maps each attribute set x to one of the predefined class
labels y . The target function is also known as a classification
model.
Descriptive models is a classification model which distinguishes
between objects of different classes. Note that the class labels
are known.
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
3 / 222
Classification
Examples
Data Mining
Keith E Emmert
Sub-setting Data
Decision Trees
Rule-Based Classifiers
Neural Networks
Classification
Decision Trees
Support Vector Machines
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
4 / 222
Classification
General Process
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Learn (train) the classifier with a training set of data (with
known labels).
Evaluate the classifier with a test set (with known labels).
Predictive Models A classification model can be used to predict
the class label of unknown records.
Learn/Training of the model is called induction.
Deduction of class labels occurs when the model is applied to a
data set.
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
5 / 222
Classification
Evaluation – The Confusion Matrix or 2 × 2 Contingency Table
Data Mining
Keith E Emmert
Sub-setting Data
Actual Class
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Class = 1
Class = 0
Predicted Class
Class = 1 Class = 0
f11
f10
f01
f00
fij is the number of records from class i predicted to be of class
j. Thus when i = j we have a correct prediction.
Accuracy is a performance metric defined as
Correct Predictions
f00 + f11
Accuracy =
=
.
f00 + f01 + f10 + f11
Total Predictions
Error rate is a performance metric defined as
f01 + f10
Wrong Predictions
Error rate =
=
.
f00 + f01 + f10 + f11
Total Predictions
Often training a model consists of minimizing the error rate or
maximizing the accuracy using a test set.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
6 / 222
Decision Trees
The Basics
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Definition
A decision tree
is a (directed) graph consisting of
a root node with no incoming edges and zero or more outgoing
edges
internal nodes with exactly one incoming edge and two or more
outgoing edges
leaf nodes with exactly one incoming edge and no outgoing
edges.
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Each leaf node is assigned exactly one class label.
The root and internal nodes are assigned attribute test
conditions to separate records.
A decision tree consisting of only the root node (which is then a
leaf) is an indication that the training process has failed
spectacularly.
To classify an unknown record, follow nodes to a leaf.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
7 / 222
Decision Trees
Example
Data Mining
Example
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Fisher’s Iris data is used. A training set and R’s party package is
used.
This data set consists of the measurements petal width, petal
length, sepal width, and sepal length.
There are three species, setosa, versicolor, and verginica are
recorded.
A decision tree to predict species based upon one of the four
measurements is constructed.
A split variable (petal width, petal length, sepal width, sepal
length) and p-value used to reject the null hypothesis of
independence of the variable with Species.
Note that sepal values are not needed!
Keith E Emmert (TSU)
Data Mining
October 23, 2013
8 / 222
Decision Trees
Example – Conditional Inference Trees
Root and internal nodes
list the split variable. H0 :
independence between
input variables and the
response variable.
Smallest p-value
determines the split.
Data Mining
Keith E Emmert
1
Petal.Length
p < 0.001
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
≤ 1.9
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
2
n = 35
y = (1, 0, 0)
k-Nearest
Neighbors
> 1.9
3
Petal.Width
p < 0.001
≤ 1.7
4
n = 35
y = (0, 0.971, 0.029)
> 1.7
5
n = 39
y = (0, 0.026, 0.974)
n is the number classified
(correctly or incorrectly)
into this node.
Probabilities are listed as
y = (a, b, c), where a for
setosa, b for versicolor, c
for virginica.
Node 2 is perfect at
classifying setosa. Nodes
4 & 5 have issues.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
9 / 222
Decision Trees
Building Decision Trees – Hunt’s Algorithm
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
The Set-Up
Grown in a recursive fashion by partitioning the training records
into successively “purer” subsets.
DEt is the set of training records that are associated with node
T.
y = {y1 , y2 , . . . , yc } are the class labels.
Hunt’s Algorithm
1 If all records in Dt belong to the same class yt , then t is a leaf
node with label yt .
2 If Dt contains records that belong to more than one class, then
1
k-Nearest
Neighbors
2
3
4
Keith E Emmert (TSU)
an attribute test condition is selected to partition Dt into smaller
subsets.
a child node is created for each outcome of the test condition.
the records in Dt are distributed to the children based upon the
outcomes.
the algorithm (steps 1 & 2) is (recursively) applied to each child
node.
Data Mining
October 23, 2013
10 / 222
Decision Trees
Building Decision Trees – Problems with Hunt’s Algorithm
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
It is possible to generate an empty leaf with a training sample if
none of the training records have the combination of attribute
values associated with this node. For this case, a leaf is created
with the same class label as the majority class of training records
in its parent node.
Home Owner
Yes
Yes
No
k-Nearest
Neighbors
Keith E Emmert (TSU)
Married
Yes
Yes
Yes
Income
$50,000
$25,000
$35,000
Predict: Loan Default
No
Yes
Yes
Split: Married?
Problem: An empty node for Married = No.
Fix: Current node is a leaf; Default is “Yes”
Data Mining
October 23, 2013
11 / 222
Decision Trees
Building Decision Trees – Problems with Hunt’s Algorithm
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
If all records associated with Dt have the same attributes
(except the class label), then no split is possible. Then, declare
the node a leaf with the same class label as the majority class of
training records associated with this node.
Home Owner
Yes
Yes
Yes
k-Nearest
Neighbors
Keith E Emmert (TSU)
Married
Yes
Yes
Yes
Income
$50,000
$50,000
$50,000
Predict: Loan Default
No
Yes
Yes
Problem: No attribute allows a split!
Fix: Current node is a leaf; Default is “Yes”
Data Mining
October 23, 2013
12 / 222
Decision Trees
Splitting Notes
Data Mining
Keith E Emmert
Binary data is obvious.
Ordinal data - the first problem child.
Small, Medium, and Large could generate three categories
(obviously) or two, say Small & Medium vs Large.
Don’t group beginning and ending (i.e. Small & Large vs.
Medium).
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Numeric - the second problem child.
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
It can be split by a single number (< 10 vs ≥ 10)
By bins/ranges [10, 20), [20, 30), etc, and then treated as ordinal
data.
Picking the numeric cut-off/best bin or deciding to how to group
ordinal data is “interesting.”
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
13 / 222
Decision Trees
Decision Tree Example
Data Mining
Example
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Apply Hunt’s Algorithm to the following data. Split order: Home
Owner, Marital Status (married vs not married), Annual Income
≥ 100K . (Table is changed slightly from book.)
Binary
Categorical Continuous
Class
Tid Home Owner
Married
Income
Defaulted
1
Yes
Single
125K
Yes
2
No
Married
100K
No
3
No
Single
70K
No
Yes
Married
120K
No
4
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Keith E Emmert (TSU)
Data Mining
October 23, 2013
14 / 222
Decision Trees
Impurity Measures
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Definition
A partition is impure if it contains a collection of tuples from different
classes rather than from a single class. Otherwise it is pure.
p(i | t) is the fraction of records belonging to class i at a given
node t. When t is fixed or clear, define pi = p(i | t).
c is the number of classes in node t labeled 0, 1, . . . , c − 1.
Using the above, the following impurity measures are defined:
c−1
X
Entropy(t) = −
p(i | t) log2 [p(i | t)]
i=0
k-Nearest
Neighbors
Gini(t) = 1 −
c−1
X
[p(i | t)]2
i=0
Classification Error(t) = 1 − max[p(i | t)].
i
For empty nodes, assume that entropy, Gini, and classification error
are all zero.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
15 / 222
Decision Trees
Gain
Data Mining
Keith E Emmert
Definition
The gain, ∆, is defined to be
Sub-setting Data
Classification
Bayes
Classification
j=1
Parent Impurity
|
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
k
X
N(vj )
∆ = I (parent) −
| {z }
Decision Trees
N
{z
I (vj )
,
}
Weighted Average of an Impurity Index
where
I (•) is any (fixed) impurity measure (i.e. Entropy, Gini, etc).
N is the total number of records at the parent node N(vj ) is the
number of records at a proposed child node vj
k is the number of attribute values or proposed child nodes.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
16 / 222
Decision Trees
Gain
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Remark
Gain compares the degree of impurity of the parent node before
splitting to the degree of impurity of each of the child nodes
after splitting.
Note that maximizing Gain, ∆, is equivalent to minimizing the
weighted average of an impurity index of proposed child nodes
k
X
N(vj )
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
j=1
N
I (vj )
since I (parent) is constant.
When I (•) = Entropy(•), then ∆info ≡ ∆ is called the
information gain.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
17 / 222
Decision Trees
Splitting Nodes
Data Mining
Example
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Use the following table on the next few slides to determine the best
split. (Table is changed slightly from book.)
Binary
Categorical Continuous
Class
Tid Home Owner
Married
Income
Defaulted
1
Yes
Single
125K
Yes
No
Married
100K
No
2
3
No
Single
70K
No
4
Yes
Married
120K
No
No
Divorced
95K
Yes
5
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Keith E Emmert (TSU)
Data Mining
October 23, 2013
18 / 222
Decision Trees
Splitting Categorical Data
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Here we investigate the root split of married by determining if
Single/Divorced vs Married or if Single vs Divorced vs Married is
better?
Married
Defaulted
Single/Divorced Married
Yes
4
0
No
2
4
2 2
X
4
2
4
2
Gini(S&D) = 1 −
[p(i | t)] = 1 −
−
=
6
6
9
i=Yes,No
2 2
0
4
Gini(M) = 1 −
−
=0
4
4
X N(vj )
4
6 4
4
Gini(vj ) =
+
0=
Wt Gini =
N
10 9
10
15
j=S&D,M
Keith E Emmert (TSU)
Data Mining
October 23, 2013
19 / 222
Decision Trees
Splitting Categorical Data - Continued
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Defaulted
Single
3
1
Married
Divorced
1
1
Married
0
4
2 2
X
1
3
3
2
−
=
Gini(S) = 1 −
[p(i | t)] = 1 −
4
4
8
i=Yes,No
2 2
X
1
1
1
Gini(D) = 1 −
[p(i | t)]2 = 1 −
−
=
2
2
2
i=Yes,No
2 2
0
4
Gini(M) = 1 −
−
=0
4
4
X N(vj )
4 3
2 1
4
Wt Gini =
Gini(vj ) =
+
+
0
N
10 8
10 2
10
Yes
No
j=S,D,M
=
Keith E Emmert (TSU)
1
4
Data Mining
October 23, 2013
20 / 222
Decision Trees
Splitting Categorical Data - Continued
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
4
≈ 0.2667
15
1
Wt Gini(S, D, M) = = 0.25
4
So, a better split when considering only Married would be 3
categories in a new decision tree.
Wt Gini(S&D, M) =
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
21 / 222
Decision Trees
Splitting Continuous Data
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
The following table is based upon making a new decision tree using
Income from the above table. Splits are the midpoints of
intervals..say [60K , 70K ) has midpoint of 65K .
Class
No
No
No
Yes
Yes
Income
60K
70K
75K
85K
90K
55K
65K
72K
80K
87K
Splits
≤ > ≤ > ≤ > ≤ > ≤ >
Yes
0
4
0
4
0
4
0
4
1
3
No
0
6
1
5
2
4
3
3
3
3
Wt. Gini 0.480
0.444
0.400
0.343
0.450
Class
Yes
No
No
No
No
Income
95K
100K
120K
125K
220K
92K
97K
110K
122K
172K
230K
Splits
≤ > ≤ > ≤ > ≤ > ≤ > ≤ >
Yes
2
2
3
1
3
1
3
1
4
0
4
0
No
3
3
3
3
4
2
5
1
5
1
6
0
Wt. Gini
0.480
0.450
0.476
0.475
0.444
0.480
Keith E Emmert (TSU)
Data Mining
October 23, 2013
22 / 222
Decision Trees
Splitting Continuous Data – Continued
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Continuous data can also be split into bins and treated as nominal
(categorical) data. For example,
Income
Defaulted
Wt Gini
I: < 85K II: [85K , 100K )] III: ≥ 100K
Yes
0
2
1
No
3
1
3
Gini
4
9
0
Gini
I&II:
4
9
3
8
0.2833
3
8
0.4167
24
0.3429
49
So, the three categories appear to be better. Of course, there are
many ways to build such categories....
Gini
Keith E Emmert (TSU)
0
II&III:
Data Mining
October 23, 2013
23 / 222
Decision Trees
Final Thoughts About the First Node
Data Mining
Keith E Emmert
Recall:
Using Married
Wt Gini for S&D vs M: 0.2667
Wt Gini for S vs D vs M: 0.25
Sub-setting Data
Classification
Decision Trees
Using Income
Wt
Wt
Wt
Wt
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Gini
Gini
Gini
Gini
for
for
for
for
Single Split: ≤ 80K vs > 80K : 0.343
Categories I vs II vs III: 0.2833
Categories I & II vs III: 0.4167
Categories I vs II & III: 0.3429
For completeness, Wt Gini for Home Owner: 0.4190
Conclusion: The first node should be split using the Married variable
and the three categories: S vs D vs M.
Next Steps: For each of the three nodes, S, D, and M, determine
the optimal split using the remaining attributes: Income and Home
Owner.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
24 / 222
Decision Trees
Characteristics
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
This is a non-parametric approach – no underlying assumptions
for the distribution are made!
Building a decision tree is easy – finding the optimal is a
NP-complete problem.
Classification time is linear – based on the height of the tree.
Sub-trees can be replicated throughout the decision tree making
it more complicated than necessary.
The decision boundaries are rectilinear – non rectilinear
boundaries in the data need other methods.
The choice of impurity method (i.e. Gini) has little impact on
the tree.
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
25 / 222
Decision Trees
Types of Errors
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Training Error is the number of miss-classification errors
committed on training records.
Generalization Error is the expected error of the model on
previously unseen records.
Training error – from training data – is a (possibly poor)
estimate of the generalization error.
Test error – from test data – is a (possibly better) estimate of
the generalization error since we’re using the trained model on
previously unseen data.
There are bounds on the generalization error which are
implementation specific.
Cross-validation, random sub-sampling, and bootstrap will
improve your estimates for the generalization error.
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Goal: Low Training and Low Generalization Errors.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
26 / 222
Decision Trees
Types of Errors – Estimating Generalization Error: Random Sub-Sampling
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Split data set randomly into 70% Training and 30% Testing.
Train the model.
Test the Model and compute the Test Error.
Repeat k times.
Average all of the Test Errors.
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
27 / 222
Decision Trees
Types of Errors – Estimating Generalization Error: Cross-Validation
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Split data set into k disjoint groups, C1 , . . . , Ck , of equal size.
Train the model using C1 , . . . , Ci−1 , Ci+1 , . . . , Ck .
Test the Model using Ci and compute the Test Error.
Repeat i = 1, 2, . . . , k times.
Average all of the Test Errors.
If k is the size of the data set, this is called Leave-one-Out
Cross-Validation.
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
28 / 222
Decision Trees
Types of Errors – Estimating Generalization Error: Bootstrap
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Generate a training set of size N by sampling the data ::::
with
replacement. The rest is the test set.
Train the model.
Use the test set to calculate the test error.
Repeat k times.
Average all of the Test Errors.
Approximately 63.2% of the records are sampled.
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
The probability a record is chosen is
1 − (1 − 1/N)N −−−−→ 1 − e −1 ≈ 0.632
N→∞
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
29 / 222
Decision Trees
Types of Errors – Example, 400 Ordered Pairs
Data Mining
Keith E Emmert
8
Sub-setting Data
Positive
Negative
Classification
6
Decision Trees
Bayes
Classification
2
0
−2
k-Nearest
Neighbors
y
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
4
Support Vector
Machines
−2
Keith E Emmert (TSU)
0
2
4
x
Data Mining
6
8
October 23, 2013
30 / 222
Decision Trees
Training Errors – Example – A (Default) Decision Tree using a Training Set of 300
Pairs, Leaving 100 Pairs for Testing
Data Mining
x>=0.7277
|
Keith E Emmert
Sub-setting Data
Classification
y>=1.496
x< 4.496
Decision Trees
Bayes
Classification
y>=1.049
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
y< 2.63
−1
y< 5.506
1
x>=1.814
−1
1
k-Nearest
Neighbors
−1
Keith E Emmert (TSU)
1
−1
1
Data Mining
October 23, 2013
31 / 222
Decision Trees
Example – Training and Test Error for the Default Tree
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
The Training (or Test) Error is given by
Incorrect
.
Total
Confusion Matrix
Confusion Matrix
myPosNegTrainError
myPosNegTestError
labelTrain -1
1
labelTest -1 1
-1 123 28
-1 35 14
1
16 133
1
6 45
Training Error is 0.146667.
Test Error is 0.2
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
32 / 222
Decision Trees
Example – Training Error for the Default Tree - Visual
Data Mining
Positive class is denoted with black and circles.
Negative class is denoted with red and triangles.
Miss-classified items are black/triangles and red/circles.
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
8
Bayes
Classification
Positive
Negative
2
0
−2
k-Nearest
Neighbors
y
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
4
6
Support Vector
Machines
−2
Keith E Emmert (TSU)
0
2
4
Data Mining
6
8
October 23, 2013
33 / 222
Decision Trees
Example – Test Error for the Default Tree - Visual
Data Mining
Positive class is denoted with black and circles.
Negative class is denoted with red and triangles.
Miss-classified items are black/triangles and red/circles.
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
8
Positive
Negative
Bayes
Classification
4
2
0
−2
k-Nearest
Neighbors
y
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
6
Support Vector
Machines
−2
Keith E Emmert (TSU)
0
2
4
Data Mining
6
8
October 23, 2013
34 / 222
Decision Trees
Example – Complexity Parameter, CP for RPART
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
1
2
3
4
5
6
7
The Root Node Training Error is 149
300 ≈ 0.496667.
A split that does not decrease the overall lack of fit by a factor
of CP is not attempted.
nsplit is the number of splits 0 for none, etc.
rel error is the test error, scaled to 1 based upon the error of the
root node.
xerror is the test error using cross-validation (10-fold by default).
xstd is the standard error using (10-fold) cross-validation.
CP nsplit rel error
xerror
xstd
0.28859060
0 1.0000000 1.1744966 0.05730957
0.16778523
1 0.7114094 0.9463087 0.05801780
0.08053691
2 0.5436242 0.6711409 0.05479843
0.06711409
3 0.4630872 0.5771812 0.05256652
0.04026846
4 0.3959732 0.5234899 0.05098905
0.02013423
6 0.3154362 0.4429530 0.04815407
0.01000000
7 0.2953020 0.4228188 0.04734755
Keith E Emmert (TSU)
Data Mining
October 23, 2013
35 / 222
Decision Trees
Example – Picking the “Best”
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
1
2
3
4
5
6
7
CP nsplit rel error
xerror
xstd
0.28859060
0 1.0000000 1.1744966 0.05730957
0.16778523
1 0.7114094 0.9463087 0.05801780
0.08053691
2 0.5436242 0.6711409 0.05479843
0.06711409
3 0.4630872 0.5771812 0.05256652
0.04026846
4 0.3959732 0.5234899 0.05098905
0.02013423
6 0.3154362 0.4429530 0.04815407
0.01000000
7 0.2953020 0.4228188 0.04734755
Use the “best tree” – lowest cross-validation error: That would
be the tree in row 7.
Use the “smallest tree” – within one standard error of the best
tree.
0.4228188 + 0.04734755 = 0.47016635, so choose the one in
row 6 because it is simpler and within 1 SE of the “best” tree.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
36 / 222
Decision Trees
Example – Picking the “Best” Visually
Data Mining
The horizontal line is 1 SE above the minimum of the curve.
The “ideal” tree lies below this line...so choose cp so that it
generates an appropriate tree.
Keith E Emmert
Sub-setting Data
Classification
1
2
3
4
5
7
8
Inf
0.22
0.12
0.074
0.052
0.028
0.014
Decision Trees
1.2
Bayes
Classification
1.0
0.8
0.6
k-Nearest
Neighbors
0.4
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
X−val Relative Error
Support Vector
Machines
Keith E Emmert (TSU)
Data Mining
October 23, 2013
37 / 222
Decision Trees
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Remark
plotcp and printcp generate cp values using different algorithms!
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
38 / 222
Decision Trees
Types of Errors – Under-fitting and Over-fitting
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Under-fitting occurs at the beginning when both training and
generalization errors are high. (Model has not learned much of
the structure of the data.)
Over-fitting occurs when a model fits the training data too well
and has a poorer generalization error than a model with a higher
training error rate. (The model understands the training data
too well – which could have “noise” in it leading to higher
generalization errors.)
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
39 / 222
Decision Trees
Example – Under-fitting: cp = 0.25 and minimum split is 2
Data Mining
x>=0.7277
|
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
−1
Keith E Emmert (TSU)
1
Data Mining
October 23, 2013
40 / 222
Decision Trees
Example – Training and Test Error for the Under-fit Tree
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
The Training (or Test) Error is given by
Incorrect
.
Total
Confusion Matrix
Confusion Matrix
myPosNegTrainError3
myPosNegTestError3
labelTrain -1
1
labelTest -1 1
-1 137 14
-1 43 6
1
92 57
1 34 17
Training Error is 0.353333.
Test Error is 0.4
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
41 / 222
Decision Trees
Over-fitting Causes
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Noise – Training data is mislabeled.
Lack of representative samples – training data has too few
records of one or more categories.
Multiple Comparison Procedure
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
42 / 222
Decision Trees
Example – Over-fitting: cp = 0 and minimum split is 2
Data Mining
x>=0.7277
|
Keith E Emmert
y>=1.049
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
y>=1.496
x< 4.496
Sub-setting Data
y< 5.506
y< 2.63
y>=2.383
y>=0.5338
x>=1.814y< 2.336
x>=6.391
0.558
x<y<
0.124
−1
1
y< 4.722y>=7.176
y< 0.5552
y< 3.847x>=5.506 y< 2.356
x< 0.8038
x>=6.077
−1
1−1
1
y>=−0.3969
x< 4.313
x< 3.386x< 3.461
x>=2.763
x< 6.67
x< 5.708x< x>=−0.4303
4.883
−1 −1 −1
−1
1−1
1
y>=2.006
y< y<
6.487
5.684
y< −0.2607
x>=4.347
x>=3.711
x< 5.98
x>=4.845
x< −0.958
−1
−1
1
1 −1
1−1
1 −1
y< 4.117
x>=1.012
x< y>=4.943
4.026y>=5.635 x< 1.011
x>=5.97y< 4.89
−1
1
−1
1 1 −1
1−1
−1
1
y<y>=4.117
3.5y< 1.882x< 3.412
y>=4.02
x>=0.8701
x>=1.505
x< 5.832
1 −1
1 1 −1
1
−1
1
y>=3.562
y>=4.284 x>=3.427 y>=−0.1042
x< 1.609 x>=5.806
x>=4.57
−1
1−1
1
−1
1 1
1
1
y< 4.267
x< 5.761
−1
1−1
−1
1
−1
1 −1
1
−1
−1
1
x>=5.732
−1
1
1
−1
1
Keith E Emmert (TSU)
Data Mining
October 23, 2013
43 / 222
Decision Trees
Example – Training and Test Error for the Over-fit Tree
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
The Training (or Test) Error is given by
Incorrect
.
Total
Confusion Matrix
Confusion Matrix
myPosNegTrainError2
myPosNegTestError2
labelTrain -1
1
labelTest -1 1
-1 151
0
-1 33 16
1
0 149
1 11 40
Training Error is 0.
Test Error is 0.27
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
44 / 222
Decision Trees
Correcting Over-fitting
Data Mining
Prepruning
Stop growing the tree when the gain in impurity (or lessoning of
training error) is minimal. CP is an example of this.
Hard to find the best threshold.
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Post-Pruning
Grow the tree to maximum size.
Trimming can occur by replacing a subtree with
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
a new leaf node with class label based upon the majority of the
records
the most frequently used branch of the subtree.
Stops when no further improvement is observed.
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
45 / 222
Decision Trees
Correcting Over-fitting: Example - Over-fit, cp = 0, minimum split is 2
Data Mining
Example
Keith E Emmert
Sub-setting Data
Recall, the poorly generated tree:
Classification
x>=0.7277
|
Decision Trees
y>=1.496
x< 4.496
Bayes
Classification
y>=1.049
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
y< 2.63
y>=2.383
y>=0.5338
y< 5.506 x>=1.814y< 2.336
0.558
x>=6.391
x<y<
0.124
−1
1
y< 4.722y>=7.176
y< 0.5552
y< 3.847x>=5.506 y< 2.356
x< 0.8038
x>=6.077
−1
1−1
1
y>=−0.3969
x< 4.313
x< 3.386x< 3.461
x>=2.763
x< 6.67
x< 5.708x< x>=−0.4303
4.883
−1 −1 −1
−1
1−1
1
y>=2.006
y< y<
6.487
5.684
y< −0.2607
x>=4.347
x>=3.711
x< 5.98
x>=4.845
x< −0.958
−1
−1
1
1 −1
1−1
1 −1
y< 4.117
x>=1.012
x< y>=4.943
4.026y>=5.635 x< 1.011
x>=5.97y< 4.89
−1
1
−1
1 1 −1
1−1
−1
1
y<y>=4.117
3.5y< 1.882x< 3.412
y>=4.02
x>=0.8701
x>=1.505
x< 5.832
1 −1
1 1 −1
1
−1
1
y>=3.562
y>=4.284 x>=3.427 y>=−0.1042
x< 1.609 x>=5.806
x>=4.57
−1
1−1
1
−1
1 1
1
1
y< 4.267
x< 5.761
−1
1−1
−1
1
−1
1 −1
1
−1
−1
1
x>=5.732
−1
1
1
−1
1
Keith E Emmert (TSU)
Data Mining
October 23, 2013
46 / 222
Decision Trees
Correcting Over-fitting: Example - Over-fit, cp = 0, minimum split is 2
Data Mining
Example
Keith E Emmert
Sub-setting Data
Now, we determine cp:
1
2
3
4
5
7
8
Inf
0.22
0.12
0.074
0.052
0.028
0.014
Classification
Decision Trees
0.8
0.6
k-Nearest
Neighbors
0.4
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
X−val Relative Error
Support Vector
Machines
1.0
1.2
Bayes
Classification
Keith E Emmert (TSU)
cp
Data Mining
October 23, 2013
47 / 222
Decision Trees
Correcting Over-fitting: Example - Over-fit, cp = 0, minimum split is 2
Data Mining
Example
Keith E Emmert
Sub-setting Data
Now, we can prune the tree: much more readable!
Classification
x>=0.7277
|
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
y>=1.496
x< 4.496
−1
k-Nearest
Neighbors
y>=1.049
1
y< 2.63
x>=1.814
−1
−1
−1
Keith E Emmert (TSU)
1
1
Data Mining
October 23, 2013
48 / 222
Decision Trees
Example – Training and Test Error for the Pruned Tree
Data Mining
Keith E Emmert
The Pruned Tree
Confusion Matrix
prunedTrainError2
prunedTestError2
labelTrain -1
1
labelTest -1 1
-1 131 20
-1 37 12
1
27 122
1
9 42
Training Error is 0.156667.
Test Error is 0.21
Confusion Matrix
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
The Over-fit Tree
Confusion Matrix
myPosNegTrainError2
myPosNegTestError2
labelTrain -1
1
labelTest -1 1
-1 151
0
-1 33 16
1
0 149
1 11 40
Training Error is 0.
Test Error is 0.27
Confusion Matrix
Keith E Emmert (TSU)
Data Mining
October 23, 2013
49 / 222
Bayes Classification
Bayesian vs Non-Bayesian
Data Mining
Keith E Emmert
Remark
Sub-setting Data
Non-Bayesian Modeling: Pr (X | θ).
Classification
Bayesian Modeling: Pr (X , θ) = Pr (θ)Pr (X | θ).
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
50 / 222
Bayes Classification
Condition Probability
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Remark
For random variables X and Y , with joint probability mass function
Pr (X , Y ) = Pr (X = x, Y = y ), we have
Decision Trees
Pr (X , Y ) = Pr (X | Y )Pr (Y ).
Bayes
Classification
Support Vector
Machines
In particular, for Pr (Y ) 6= 0,
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Pr (X | Y ) =
Pr (X , Y )
.
Pr (Y )
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
51 / 222
Bayes Classification
Law of Total Probability
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Theorem (Law of Total Probability)
If {Y1 , . . . , Yk } is the set of mutually exclusive and exhaustive
outcomes of Y , then
Decision Trees
Bayes
Classification
Pr (X ) =
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
k
X
Pr (X | Yi )Pr (Yi ).
i=1
Remark
Suppose X is an attribute set and Y is the class variable.
Pr (X | Y ) is the class conditional probability.
Pr (Y ) is the prior or initial degree of belief in Y .
Pr (Y | X ) is the posterior or the degree of belief having
accounted for X .
Keith E Emmert (TSU)
Data Mining
October 23, 2013
52 / 222
Bayes Classification
Bayes Theorem
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Theorem (Bayes Theorem)
If {Y1 , . . . , Yk } is the set of mutually exclusive and exhaustive
outcomes of Y , then
Decision Trees
Pr (Y | X ) =
Bayes
Classification
Support Vector
Machines
Pr (X | Y )Pr (Y )
Pr (X | Y )Pr (Y )
.
= Pk
Pr (X )
i=1 Pr (X | Yi )Pr (Yi )
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
53 / 222
Bayes Classification
Goal
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Given an unknown record X = (x1 , . . . , xn ), classify it as one of
y1 , . . . , yk by computing
Pr (Y | X ) = Pr (Y = yi | X = (x1 , . . . , xn )),
= 1, . . . , k.
The class label for X is the yi that maximizes the conditional
probability.
Hence, we need
Pr (Y )
Estimated from the training set
Fraction of training records in that class
Pr (X | Y ) Harder to find. Use
Naive Bayes Classifier
Bayesian Belief Network
Keith E Emmert (TSU)
Data Mining
October 23, 2013
54 / 222
Bayes Classification
Naive Bayes Classification
Data Mining
Remark
Keith E Emmert
Sub-setting Data
Classification
We will assume that the attributes, X = (X1 , . . . , Xn ), are
conditionally independent given class label Y = y , that is
Decision Trees
Pr (X | Y = y ) =
Bayes
Classification
k-Nearest
Neighbors
Pr (Xj | Y = y ).
j=1
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
n
Y
Hence, by Bayes Theorem, for 1 ≤ i ≤ k and fixed
X = (X1 = x1 , . . . , Xn = xn ),
Pr (Y = yi | X ) =
=
Keith E Emmert (TSU)
Pr (X | Y )Pr (Y = yi )
Pr (X )
Qn
Pr (Y = yi ) j=1 Pr (Xj = xj | Y = yi )
Pr (X )
Data Mining
October 23, 2013
55 / 222
Bayes Classification
Naive Bayes Classification – Goal
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Find yi which maximizes Pr (Y = yi | X ) given a fixed X . Since
Qn
Pr (Y = yi ) j=1 Pr (Xj = xj | Y = yi )
Pr (Y = yi | X ) =
,
Pr (X )
we only need to maximize the numerator,
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Pr (Y = yi )
n
Y
Pr (Xj = xj | Y = yi ).
j=1
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
56 / 222
Naive Bayes Classification – Handling Categorical
& Binary Attributes
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Note that Pr (Xj = xj | Y = yi ) represents the fraction of training
instances of class yi that take on the particular class attribute xj .
Example
Suppose we wish to classify X = (Red, Domestic, SUV ) given the
following data set as stolen or not.
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
No.
1
2
3
4
5
6
7
8
9
10
Color
Red
Red
Red
Yellow
Yellow
Yellow
Yellow
Yellow
Red
Red
Type
Sports
Sports
Sports
Sports
Sports
SUV
SUV
SUV
SUV
Sports
Data Mining
Origin
Domestic
Domestic
Domestic
Domestic
Imported
Imported
Imported
Domestic
Imported
Imported
Stolen?
Yes
No
Yes
No
Yes
No
Yes
No
No
Yes
October 23, 2013
57 / 222
Bayes Classification
Naive Bayes Classification - Categorical Example Continued
Data Mining
Example
Keith E Emmert
Sub-setting Data
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
3
Pr (Red | No) =
10
1
Pr (SUV | Yes) =
Pr (SUV | No) =
10
2
Pr (Domestic | Yes) =
Pr (Domestic | No) =
10
5
Pr (Yes) =
Pr (No) =
10
Qn
So, we maximize Pr (Y = yi ) j=1 Pr (Xj = xj | Y = yi ).
Pr (Red | Yes) =
Classification
Pr (Yes | X ) =
5 3 1 2
= 0.003,
10 10 10 10
2
10
3
10
3
10
5
10
Pr (No | X ) = 0.009
X is assigned a label “No” since Pr (No | X ) > Pr (Yes | X ).
Keith E Emmert (TSU)
Data Mining
October 23, 2013
58 / 222
Naive Bayes Classification - Continuous Attributes
Data Mining
Keith E Emmert
For continuous attributes, you have a choice:
Discretize Xj by using “appropriate” intervals or bins.
Pr (Xj = xj | Y = yi ) is the fraction of training records belonging
to class yi that falls withing the interval containing xi .
Too many or too few bins can cause issues.
Sub-setting Data
Classification
Decision Trees
Assume a distribution and estimate the parameters of the
distribution using training data.
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
59 / 222
Naive Bayes Classification - Continuous Attributes
Continued
Data Mining
Remark
Keith E Emmert
Sub-setting Data
For PMF f (X ; θ) with parameter vector θ, if > 0, then
Classification
Z
Decision Trees
xj +
Pr (xj ≤ Xj ≤ xj + ) =
f (Xj ; θ) dXj
xj
Bayes
Classification
≈ f (xj ; θ)
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Pr (Y = yi | X ) =
≈
k-Nearest
Neighbors
Pr (Y = yi | X )
≈
n
Pr (Y = yi )
Qn
Pr (Y = yi )
Qn
j=1
Pr (Xj = xj | Y = yi )
Pr (X )
j=1 f (xj ; θ)
Pr (X )
Qn
Pr (Y = yi ) j=1 f (xj ; θ)
Pr (X )
Qn
Hence, we seek to maximize Pr (Y = yi ) j=1 f (xj ; θ).
Keith E Emmert (TSU)
Data Mining
October 23, 2013
60 / 222
Bayes Classification
Naive Bayes Classification - Example
Data Mining
Example
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Use the following table on the next few slides to classify X =
(Home Owner = No, Marital Status = Married, Income = $120K ).
(Table is changed slightly from book.)
Binary
Categorical Continuous
Class
Tid Home Owner
Married
Income
Defaulted
1
Yes
Single
125K
Yes
2
No
Married
100K
No
3
No
Single
70K
No
Yes
Married
120K
No
4
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Keith E Emmert (TSU)
Data Mining
October 23, 2013
61 / 222
Bayes Classification
Naive Bayes Classification - Example Continued
Data Mining
Example
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
2
2
√ 1 e −(x−x̄) /(2s )
2πs
= 107.5
x̄No = 100+70+120+60+220+75
6
qP
(xi −x̄No )2
sNo =
= 59.31
n−1
Assume Gaussian f (x; x̄, s) =
for Annual Income
Support Vector
Machines
Pr (Income = 120K | No) ≈ f (120; 107.5, 59.31) = 0.0507
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
x̄Yes = 98.75
k-Nearest
Neighbors
Prior: Pr (No) =
sYes = 17.97
Pr (Income = 120K | Yes) ≈ f (120; 98.75, 17.97) = 0.0468
Keith E Emmert (TSU)
6
10
=
3
5
Data Mining
Pr (Yes) =
4
10
= 25 .
October 23, 2013
62 / 222
Bayes Classification
Naive Bayes Classification - Example Continued
Data Mining
The class conditional probabilities can now be easily computed:
Keith E Emmert
Sub-setting Data
Pr (X | No) =
Classification
Decision Trees
Bayes
Classification
3
Y
Pr (Xj = xj | No)
j=1
= Pr (HomeOwner . = No | No)
× Pr (MaritalStatus = Married | No)
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
× Pr (Income = 120K | No)
4
4
=
0.0507 = 0.0225
6
6
k-Nearest
Neighbors
Pr (X | Yes) =
3
Y
Pr (Xj = xj | Yes)
j=1
3
0
=
0.0468 = 0
4
4
Keith E Emmert (TSU)
Data Mining
October 23, 2013
63 / 222
Bayes Classification
Naive Bayes Classification - Example Continued
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Recall, the goal is to find yi which maximizes Pr (Y = yi | X ) given a
fixed X . Since
Qn
Pr (Y = yi ) j=1 Pr (Xj = xj | Y = yi )
Pr (Y = yi | X ) =
,
Pr (X )
we only need to maximize the numerator,
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Pr (Y = yi )
n
Y
Pr (Xj = xj | Y = yi ).
j=1
Hence, we have
Pr (Y = No) ·
3
Y
Pr (Xj = xj | No) =
6
· 0.0225 = 0.0135
10
Pr (Xj = xj | Yes) =
4
·0=0
10
j=1
Pr (Y = Yes) ·
3
Y
j=1
Therefore, X is classified as “No.”
Keith E Emmert (TSU)
Data Mining
October 23, 2013
64 / 222
Bayes Classification
Naive Bayes Classification - m-Estimate of Conditional Probability
Data Mining
Keith E Emmert
Remark
Recall that from the previous example,
Sub-setting Data
Classification
Pr (Y = Yes) ·
Decision Trees
Bayes
Classification
Pr (Xj = xj | Yes) =
j=1
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
3
Y
4
· 0 = 0.
10
This is bad since classification may not work very well. However, if
Pr (X | Y = yi ) = 0,
∀i,
then this is BAD and classification can’t occur.
The fix is m-estimation.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
65 / 222
Bayes Classification
Naive Bayes Classification - m-Estimate of Conditional Probability
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Remark
The idea is that we assume we have an extra m training records with
class yi where pm of them having the xj attribute.
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Definition
The m-estimate approach approximates the conditional probabilities:
nc + mp
,
Pr (Xj = xj | Y = yi ) =
n+m
p is the prior estimate of the probability. You might assume
1
(unless you have other information) a uniform prior, p = ,
k
where k is the number of attributes of Xj .
m is the equivalent sample size (constant).
nc is the number of training samples from class yi that take
value xj .
n is the number of instances from class yi .
Keith E Emmert (TSU)
Data Mining
October 23, 2013
66 / 222
Bayes Classification
Naive Bayes Classification - m-Estimate Example
Data Mining
Example
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Use the following table on the next few slides to classify X =
(Home Owner = No, Marital Status = Married, Income = $120K ).
(Table is changed slightly from book.) Use m-estimates.
Binary
Categorical Continuous
Class
Tid Home Owner
Married
Income
Defaulted
1
Yes
Single
125K
Yes
2
No
Married
100K
No
3
No
Single
70K
No
Yes
Married
120K
No
4
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Keith E Emmert (TSU)
Data Mining
October 23, 2013
67 / 222
Bayes Classification
Naive Bayes Classification - m-Estimate Example Continued
Data Mining
Example
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Recall that
6
10
4
Pr (Y = Yes) =
10
Pr (Income = 120K | No) ≈ f (120; 107.5, 59.31) = 0.0507
Pr (Y = No) =
Pr (Income = 120K | Yes) ≈ f (120; 98.75, 17.97) = 0.0468
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
68 / 222
Bayes Classification
Naive Bayes Classification - m-Estimate Example Continued
Data Mining
Example
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Assume m = 6, for convenience.
Home Owner = No
Y = No
nc = 4 Pr (HomeOwner = No | No) =
n = 6 p = 12
4+6( 12 )
7
6+6 = 12
nc +mp
n+m
=
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
nc = 4
p = 13
Marital Status = Married
Pr (MaritalStatus = Married | No) =
4+6( 13 )
nc +mp
6
n+m = 6+6 = 12
k-Nearest
Neighbors
Hence, we have
Pr (Y = No) ·
3
Y
Pr (Xj = xj | No) =
j=1
Keith E Emmert (TSU)
Data Mining
6 7 6
·
·
· 0.0507 = 0.0089
10 12 12
October 23, 2013
69 / 222
Bayes Classification
Naive Bayes Classification - m-Estimate Example Continued
Data Mining
Example
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Assume m = 6, for convenience.
Home Owner = No
Y = Yes
nc = 3 Pr (HomeOwner = No | Yes) =
n = 4 p = 12
3+6( 12 )
6
4+6 = 10
nc +mp
n+m
=
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
nc = 0
p = 13
Marital Status = Married
Pr (MaritalStatus = Married | Yes) =
0+6( 13 )
nc +mp
2
n+m = 4+6 = 10
k-Nearest
Neighbors
Hence, we have
Pr (Y = Yes) ·
3
Y
Pr (Xj = xj | Yes) =
j=1
Keith E Emmert (TSU)
Data Mining
4 6 2
·
·
· 0.0468 = 0.0034
10 10 10
October 23, 2013
70 / 222
Bayes Classification
Naive Bayes Classification - m-Estimate Example Continued
Data Mining
Example
Keith E Emmert
Sub-setting Data
Again, since
Classification
Decision Trees
Pr (No | X ) · Pr (X ) = Pr (Y = No) ·
Bayes
Classification
Pr (Xj = xj | No) = 0.0089
j=1
> Pr (Yes | X ) · Pr (X )
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
3
Y
= Pr (Y = Yes) ·
3
Y
Pr (Xj = xj | Yes) = 0.0034,
j=1
we can safely classify X as No.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
71 / 222
Bayes Classification
Naive Bayes Classification - Bayes Error Rate
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Suppose that the true probability distribution for Pr (X | Y ) is
known, where Y is the class label, and X is a vector of attributes.
We seek to classify alligators and crocodiles based upon their length
by finding the ideal decision boundary.
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
72 / 222
Bayes Classification
Naive Bayes Classification - Bayes Error Rate
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Assume average length of an adult crocodile is N(15 ft, 22 ft).
Hence we can approximate the class-conditional probabilities as
"
2 #
1
1 X − 15
Pr (X | Crocodile) = √
exp −
.
2
2
2π · 2
Assume average length of an adult alligator is N(12 ft, 22 ft).
Hence we can approximate the class-conditional probabilities as
"
2 #
1 X − 12
1
exp −
.
Pr (X | Alligator) = √
2
2
2π · 2
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
73 / 222
Bayes Classification
Naive Bayes Classification - Bayes Error Rate
Data Mining
Assuming that the prior probabilities are the same, that is
Keith E Emmert
Pr (Alligator) = Pr (Crocodile)
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Hence, left of x̂ = 13.5 should be an alligator and to the right of
x̂ = 13.5 should be a crocodile.
0.20
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
then the figure illustrates the ideal decision boundary where
Pr (X = x̂ | Crocodile) = Pr (X = x̂ | Alligator).
Hence, we have
2 2
x̂ − 12
27
x̂ − 15
= 13.5
=
=⇒ x̂ =
2
2
2
Alligators
Crocodiles
0.10
0.00
0.05
Density
0.15
k-Nearest
Neighbors
5
10
15
20
Length
Keith E Emmert (TSU)
Data Mining
October 23, 2013
74 / 222
Bayes Classification
Naive Bayes Classification - Bayes Error Rate
Data Mining
Assuming that the prior probabilities are the different, that is
Keith E Emmert
Pr (Alligator) 6= Pr (Crocodile)
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
then decision boundary shifts towards the class with the lower prior
probability.
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
75 / 222
Bayes Classification
Naive Bayes Classification - Bayes Error Rate
Data Mining
For example, if
Keith E Emmert
Pr (Alligator) = 2Pr (Crocodile)
Sub-setting Data
then
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Pr (Alligator | X ) = Pr (Crocodile | X )
Pr (Crocodile)Pr (X | Crocodile)
Pr (Alligator)Pr (X | Alligator)
=
Pr (X )
Pr (X )
⇔ 2Pr (Crocodile)Pr (X | Alligator) = Pr (Crocodile)Pr (X | Crocodile)
⇔
⇔ 2Pr (X | Alligator) = Pr (X | Crocodile)
"
"
2 #
2 #
1
1
1 x̂ − 12
1 x̂ − 15
⇔2· √
exp −
=√
exp −
2
2
2
2
2π · 2
2π · 2
2
2
⇔ 8 ln(2) − (x̂ − 12) = − (x̂ − 15)
1
⇒ x̂ = (81 + 8 ln(2)) =⇒ x̂ ≈ 14.4242
6
Hence, the decision boundary shifted towards the crocodile.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
76 / 222
Bayes Classification
Naive Bayes Classification - Bayes Error Rate
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
The Bayes error rate is the probability of incorrectly classifying X and
is nonzero if the distributions of the classes overlap. For this example,
Z x̂
Z ∞
Bayes Error =
Pr (Crocodile | X ) dX +
Pr (Alligator | X ) dX
0
x̂
(
0.453255, x̂ = 13.5
=
0.499449, x̂ = 14.4242
where the lower bound is 0 because it would be weird to consider a
crocodile of negative length.
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
77 / 222
Bayes Classification
Naive Bayes Classification - Iris Example
Data Mining
Here, we should use the package “klaR”.
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
A confusion matrix (should not test with training set...but this is just
a tiny example) is generated using the function NaiveBayes().
setosa versicolor virginica
setosa
50
0
0
versicolor
0
47
3
virginica
0
3
47
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
78 / 222
Bayes Classification
Naive Bayes Classification - Iris Continued - Visualizes the marginal probabilities of
predictor variables given the class.
Data Mining
Naive Bayes Plot
Keith E Emmert
setosa
versicolor
virginica
Sub-setting Data
0.6
Classification
Decision Trees
Bayes
Classification
0.4
0.0
k-Nearest
Neighbors
0.2
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Density
Support Vector
Machines
1
Keith E Emmert (TSU)
2
3
4
Petal.Length
Data Mining
5
6
7
October 23, 2013
79 / 222
Bayes Classification
Naive Bayes Classification - The 400 Ordered Pairs
Data Mining
Keith E Emmert
8
Sub-setting Data
Positive
Negative
Classification
6
Decision Trees
Bayes
Classification
2
0
−2
k-Nearest
Neighbors
y
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
4
Support Vector
Machines
−2
Keith E Emmert (TSU)
0
2
4
x
Data Mining
6
8
October 23, 2013
80 / 222
Bayes Classification
Naive Bayes Classification - The 400 Ordered Pairs Continued
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Training set is 300 pairs, test set is 100 pairs.
How well did we do with the training data?
labelTrain
-1
1
-1 111 67
1
40 82
Predictions based upon the test data
labelTest
-1 1
-1 34 26
1 15 25
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
81 / 222
Bayes Classification
Naive Bayes Classification - The 400 Ordered Pairs - Visualizes the marginal
probabilities of predictor variables given the class.
Data Mining
Naive Bayes Plot
Keith E Emmert
−1
1
0.10
Sub-setting Data
Classification
0.00
k-Nearest
Neighbors
0.02
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Density
Support Vector
Machines
0.04
Bayes
Classification
0.06
0.08
Decision Trees
−2
Keith E Emmert (TSU)
0
2
4
x
Data Mining
6
8
October 23, 2013
82 / 222
Bayes Classification
Naive Bayes Classification - The 400 Ordered Pairs Scatter Plot for Training
Data Mining
Keith E Emmert
8
Sub-setting Data
Positive
Negative
Classification
6
Decision Trees
Bayes
Classification
2
0
−2
k-Nearest
Neighbors
y
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
4
Support Vector
Machines
−2
Keith E Emmert (TSU)
0
2
4
x
Data Mining
6
8
October 23, 2013
83 / 222
Naive Bayes Classification - The 400 Ordered Pairs
Scatter Plot for Testing
Data Mining
Positive
Negative
8
Keith E Emmert
Sub-setting Data
6
Classification
Decision Trees
Bayes
Classification
2
0
−2
k-Nearest
Neighbors
y
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
4
Support Vector
Machines
−2
Keith E Emmert (TSU)
0
2
4
x
Data Mining
6
8
October 23, 2013
84 / 222
Bayes Classification
Bayesian Belief Networks
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
See http://www.bnlearn.com/ for very detailed information.
This requires the package “bnlearn.” Other packages that might prove
useful include “deal,”“catnet/mugnet” and “pcalg.”
Note that when loading the package “bnlearn” with
library("bnlearn"), I received an error that package “graph” was
not installed. An attempt to install it yielded the message that it is
no longer supported with this version of R. So, we need to run the
following three commands to install “graph.” For the first, you need
to select a mirror, I chose “1. Seattle (USA).” For the second
command, you must choose a repository, I chose “2: Bio software.”
Finally, install the package as usual.
> chooseBioCmirror(graphics = getOption("menu.graphics"))
> setRepositories(addURLs =
c(CRANxtras = "http://www.bioconductor.org"))
> install.packages("graph")
Everything should now be OK.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
85 / 222
Bayes Classification
Bayesian Belief Networks
Data Mining
Remark
Keith E Emmert
Sub-setting Data
Classification
For naive Bayes, recall that the attributes, X = (X1 , . . . , Xn ), are
conditionally independent given class label Y = y , that is
Decision Trees
Pr (X | Y = y ) =
Bayes
Classification
k-Nearest
Neighbors
Pr (Xj | Y = y ).
j=1
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
n
Y
Hence, by Bayes Theorem, for 1 ≤ i ≤ k and fixed
X = (X1 = x1 , . . . , Xn = xn ),
Pr (Y = yi | X ) =
=
Pr (X | Y )Pr (Y = yi )
Pr (X )
Qn
Pr (Y = yi ) j=1 Pr (Xj = xj | Y = yi )
Pr (X )
What if this assumption is relaxed?
Keith E Emmert (TSU)
Data Mining
October 23, 2013
86 / 222
Bayes Classification
Bayesian Belief Networks
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Definition
A Bayesian belief network is a graphical representation of the
probabilistic relationships among a set of random variables and has
the following elements:
A directed acyclic graph encoding the dependence relationships
among a set of variables
A probability table associating each node to its immediate
parent nodes.
Remark
For nodes A, B,
if there is a direct arc from A to B, then A is the parent of B
and B is the child of A.
if there is a path from A to B, then A is an ancestor of B and B
is a descendant of A.
None of these relationships (parent, etc) are unique.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
87 / 222
Bayes Classification
Bayesian Belief Networks - Properties
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Remark
Causal Sufficiency Assumption There exist no common unobserved
(or hidden or latent) variables in the domain that are parent of one or
more observed variables in the domain.
Definition
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
A and B are independent if Pr (A, B) = Pr (A)Pr (B).
A and B are conditionally independent given C if
Pr (A, B | C ) = Pr (A | C )Pr (B | C )
Pr (A | B, C ) = Pr (A | C ).
k-Nearest
Neighbors
Remark
If A and B are conditionally independent given C , then B and A are
conditionally independent given C .
Keith E Emmert (TSU)
Data Mining
October 23, 2013
88 / 222
Bayes Classification
Bayesian Belief Networks - Properties
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Remark
Markov Assumption A node in a Bayesian network is conditionally
independent of its non-descendants, if its parents are known.
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Remark
If a node X does not have any parents, then the table contains
only the prior probability, Pr (X ).
If a node X has only one parent, Y , then the table contains the
conditional probability Pr (X | Y ).
If a node X has multiple parents, {Y1 , . . . , Yk }, then the table
contains the conditional probability Pr (X | Y1 , Y2 , . . . , Yk ).
Keith E Emmert (TSU)
Data Mining
October 23, 2013
89 / 222
Bayes Classification
Bayesian Belief Networks - The Full Joint Distribution
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Definition
The full joint distribution is defined in terms of local conditional
distributions:
Decision Trees
Pr (X1 , X2 , . . . , Xd ) =
Bayes
Classification
Pr (Xi | π(Xi ))
i=1
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
d
Y
where π(Xi ) represents the parents of Xi . If Xi has no parents, then
this is simply the prior distribution of Xi .
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
90 / 222
Bayes Classification
Bayesian Belief Networks - Building a Model
Data Mining
Keith E Emmert
It’s a two step process:
1
Create the structure of the network
2
Estimate the probability values in the tables associated with
each node.
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
91 / 222
Bayes Classification
Bayesian Belief Networks - Learning of Bayesian Belief Networks
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Data: D = {D1 , . . . , Dn }, Di = xi ,a vector of variable values.
Discrete Variables: X = {X1 , . . . , Xd }
True probability distribution: p(X ).
Goal: estimate the true distribution p(X ) over variables X using
examples in D. That is, find the “best” parameters T such that
Bayes
Classification
p̃(X | T ) ≈ p(X ).
Support Vector
Machines
Parameter Estimation Criteria:
Maximum likelihood estimation (MLE): maximize
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Pr (D | Θ) =
n
Y
f (xi | Θ),
i=1
k-Nearest
Neighbors
where f is a probability function.
Maximum a posteriori probability (MAP): maximize
Pr (Θ | D) =
Pr (D | Θ)Pr (Θ)
Pr (D | Θ)Pr (Θ)
= R
.
Pr (D)
Pr (D | Θ)Pr (Θ)dΘ
That is, find the mode.
]
Keith E Emmert (TSU)
Data Mining
October 23, 2013
92 / 222
Bayes Classification
Bayesian Belief Networks - Example - Maximum Likelihood Estimation
Data Mining
Example
Keith E Emmert
Sub-setting Data
Suppose we have a biased coin marked with Heads or Tails.
Pr (Heads) = θ.
Classification
(
Decision Trees
Let D be the sequence of outcomes xi =
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
1, Heads
0, Tails.
Clearly, Pr (x | θ) = f (x | θ) = θx (1 − θ)1−x . (i.e. our friend
Bernoulli!)
n
Y
Likelihood Function: L(θ | D) =
θxi (1 − θ)1−xi .
i=1
Using the log-likelihood log(L(θ | D)) and a gentle partial
derivative,
n
n
∂ log(L(θ))
1X
1X
= 0 =⇒ θ =
xi = x̄ =⇒ θ̂ =
Xi = X̄ .
∂θ
n
n
i=1
Keith E Emmert (TSU)
Data Mining
i=1
October 23, 2013
93 / 222
Bayes Classification
Bayesian Belief Networks - Conjugate Distributions
Data Mining
Definition
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
If the posterior distributions p(θ | x) are in the same family as the
prior probability distribution p(θ), then the prior and the posterior are
called conjugate distributions and the prior is called a conjugate prior
for the likelihood. The parameters for the conjugate distribution are
called hyper-parameters.
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Remark
All members of the exponential family have conjugate priors. See:
Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin.
Bayesian Data Analysis, 2nd edition. CRC Press, 2003. ISBN
1-58488-388-X.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
94 / 222
Bayes Classification
Bayesian Belief Networks - Example - Maximum a Posteriori Probability (MAP)
Data Mining
Example
Keith E Emmert
Sub-setting Data
Suppose we have a biased coin marked with Heads or Tails.
Pr (Heads) = θ.
Classification
(
Decision Trees
Let D be the sequence of outcomes xi =
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
1, Heads
0, Tails.
Clearly, Pr (x | θ) = f (x | θ) = θx (1 − θ)1−x . (i.e. our friend
Bernoulli!)
MAP seeks to maximize the prior probability on θ:
k-Nearest
Neighbors
p̃(θ | D) =
Keith E Emmert (TSU)
Pr (D | θ)p̃(θ)
Pr (D | θ)p̃(θ)
=R
.
Pr (D)
Pr (D | θ)p̃(θ)dθ
Data Mining
October 23, 2013
95 / 222
Bayes Classification
Bayesian Belief Networks - Example - Maximum a Posteriori Probability (MAP)
Data Mining
Example
Keith E Emmert
Sub-setting Data
We wish to maximize the prior probability on θ:
Classification
p̃(θ | D) =
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Pr (D | θ)p̃(θ)
,
Pr (D)
where
Pr (D | θ) is the likelihood of data. That is,
Pr (D | θ) =
N
Y
θxi (1 − θ)1−xi = θN1 (1 − θ)N2
i=1
k-Nearest
Neighbors
where N1 is the number of success, N2 the number of failures.
p̃(θ) is the prior probability on θ.
Problem: How to choose the prior probability, p̃(θ)?
Keith E Emmert (TSU)
Data Mining
October 23, 2013
96 / 222
Bayes Classification
Bayesian Belief Networks - Example - Maximum a Posteriori Probability (MAP)
Data Mining
Example
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Suppose we have a biased coin marked with Heads or Tails.
Problem: How to choose the prior probability, p̃(θ)? Use the
conjugate distribution for the binomial:
Bayes
Classification
Pr (D | θ) = θN1 (1 − θ)N2
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
=⇒ Pr (θ) =
Γ(α1 + α2 ) α1 −1
θ
(1 − θ)α2 −1 .
Γ(α1 )Γ(α2 )
Hence, we know
k-Nearest
Neighbors
p̃(θ | D) = R
Pr (D | θ)Beta(θ | α1 , α2 )
.
Pr (D | θ)Beta(θ | α1 , α2 )dθ
= Beta(θ | α1 + N1 , α2 + N2 )
The mode is: θMAP =
Keith E Emmert (TSU)
α1 +N1 −1
α1 +α2 +N1 +N2 −2 .
Data Mining
October 23, 2013
97 / 222
Bayes Classification
Bayesian Belief Networks - Bayesian Learning
Data Mining
Both MLE and MAP pick one parameter value.
Is it always the best solution?
If we have two different parameter setting that are close in terms
of probability (i.e. MLE or MAP), then using only one may
introduce a strong bias.
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayesian approach
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Remedies the limitation of one choice
Considers all parameter settings and averages the result:
Z
(∆ | D) = Pr (∆ | θ)p̃(θ | D)dθ.
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
98 / 222
Bayes Classification
Bayesian Belief Networks - Example Continued
Data Mining
Example
Keith E Emmert
Sub-setting Data
Predict the next coin flip. Recall we have
Classification
p̃(θ | D) = Beta(θ | α1 + N1 , α2 + N2 )
Decision Trees
Bayes
Classification
So, maximize Pr (X = 1 | D) and Pr (X = 0 | D).
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Z
1
Pr (X = 1 | θ)p̃(θ | D)dθ
Pr (X = 1 | D) =
0
Z
1
θ1 (1 − θ)1−1 Beta(θ | α1 + N1 , α2 + N2 )dθ
=
k-Nearest
Neighbors
0
Z
1
θBeta(θ | α1 + N1 , α2 + N2 )dθ
=
0
= E (θ) =
Keith E Emmert (TSU)
α1 + N1
α1 + α2 + N1 + N2
Data Mining
October 23, 2013
99 / 222
Bayes Classification
Bayesian Belief Networks - Example Continued
Data Mining
Example
Keith E Emmert
Sub-setting Data
Z
1
Pr (X = 0 | D) =
Classification
Decision Trees
Bayes
Classification
Pr (X = 0 | θ)p̃(θ | D)dθ
0
Z
1
θ0 (1 − θ)1−0 Beta(θ | α1 + N1 , α2 + N2 )dθ
=
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
0
Z
0
= E (1 − θ) = 1 −
k-Nearest
Neighbors
=
Keith E Emmert (TSU)
1
(1 − θ)Beta(θ | α1 + N1 , α2 + N2 )dθ
=
α1 + N1
α1 + α2 + N1 + N2
α2 + N2
α1 + α2 + N1 + N2
Data Mining
October 23, 2013
100 / 222
Bayes Classification
Bayesian Belief Networks - Example Continued
Data Mining
Example
Keith E Emmert
Sub-setting Data
So, we predict heads if
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Pr (X = 1 | D) > Pr (X = 0 | D)
α2 + N2
α1 + N1
>
α1 + α2 + N1 + N2
α1 + α2 + N1 + N2
α1 + N1 > α2 + N2 .
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
101 / 222
Bayes Classification
Bayesian Belief Networks - Multinomial Example
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Roll k dice.
Data: D = {D1 , . . . , DN }, Di = xi , a vector of k variable values,
Ni represents the number of times i occurs.
P
Model Parameters: θ~ = (θ1 , . . . , θk ), i θi = 1, θi the
probability of an outcome i.
Probability of Data (Likelihood function)
N!
θN1 · · · θkNk .
N1 · · · Nk ! 1
This is the multinomial distribution.
The MLE estimate is
Ni
,
i = 1, 2, . . . , k.
θ̂i =
N
Keith E Emmert (TSU)
~ =
Pr (N1 , . . . , Nk | θ)
Data Mining
October 23, 2013
102 / 222
Bayes Classification
Bayesian Belief Networks - Multinomial Example
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Roll k dice.
The Conjugate The Dirichlet function:
P
k
Γ
i=1 αi
θiα1 −1 · · · θkαk −1 ,
Dir (θ~ | α1 , . . . , αk ) = Qk
Γ(α
)
i
i=1
where
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Z
Γ(α) =
∞
x α−1 e −x dx,
α > 0.
0
The Posterior
~
Pr (D | θ)Dir
(θ~ | α1 , . . . , αk )
Pr (D)
= Dir (θ~ | α1 + N1 , . . . , αk + Nk ).
p̃(θ~ | D) =
k-Nearest
Neighbors
The MAP estimate (the mode) is
θi,MAP = Pk
αi + N1 − 1
i=1 (αi
Keith E Emmert (TSU)
Data Mining
+ Ni ) − k
.
October 23, 2013
103 / 222
Bayes Classification
Bayesian Belief Networks - Example Using Package “bnlearn”
Data Mining
Using the Hill Climbing Algorithm – A Score-Based Algorithm
Keith E Emmert
Sub-setting Data
F
Classification
Decision Trees
Bayes
Classification
A
E
B
D
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
C
Keith E Emmert (TSU)
Data Mining
October 23, 2013
104 / 222
Bayes Classification
Bayesian Belief Networks - Example Using Package “bnlearn”
Data Mining
Using the Grow Shrink Algorithm – A Constraint-Based Algorithm
Keith E Emmert
Sub-setting Data
Classification
Note that the arc A − B has no arrow. This indicates that A → B
and A → B generate networks with the same score.
Decision Trees
Bayes
Classification
F
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
A
E
B
D
k-Nearest
Neighbors
C
Keith E Emmert (TSU)
Data Mining
October 23, 2013
105 / 222
Bayes Classification
Bayesian Belief Networks – Comparing the Models
Data Mining
We have different results...
Keith E Emmert
Hill−Climbing
Grow−Shrink
F
F
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
A
E
A
E
B
D
B
D
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
C
Keith E Emmert (TSU)
C
Data Mining
October 23, 2013
106 / 222
Bayes Classification
Bayesian Belief Networks - Rgraphviz package
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Note that the package Rgraphviz helps with plots. It is available at
source("http://bioconductor.org/biocLite.R")
biocLite("Rgraphviz")
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
107 / 222
Bayes Classification
Bayesian Belief Networks - A Prettier Graph via Rgraphviz
Data Mining
So, here is a better comparison plot...
Keith E Emmert
Hill−Climbing
Grow−Shrink
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
A C
A C
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
B D F B D F
k-Nearest
Neighbors
E
Keith E Emmert (TSU)
E
Data Mining
October 23, 2013
108 / 222
Bayes Classification
Bayesian Belief Networks
Data Mining
Keith E Emmert
Sub-setting Data
Small synthetic data set from Lauritzen and Spiegelhalter (1988)
about lung diseases (tuberculosis, lung cancer or bronchitis) and
visits to Asia.
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
The data set has the following two-level variables, with levels yes and
no.
D (dyspnoea) – shortness of breath
T (tuberculosis)
L (lung cancer)
B (bronchitis)
A (visit to Asia)
S (smoking)
X (chest X-ray)
E (tubercolosis versus cancer/bronchitis)
Keith E Emmert (TSU)
Data Mining
October 23, 2013
109 / 222
Bayes Classification
Bayesian Belief Networks - Asia Example
Data Mining
Keith E Emmert
Sub-setting Data
D
Classification
A
Decision Trees
X
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
S
E
T
B
L
Keith E Emmert (TSU)
Data Mining
October 23, 2013
110 / 222
Bayes Classification
Bayesian Belief Networks - Asia Example - Fitting with MLE
Data Mining
Keith E Emmert
Sub-setting Data
We can find the conditional probability tables of the variables.
For example, node A:
Parameters of node A (multinomial distribution)
Classification
Decision Trees
Conditional probability table:
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
no
yes
0.9916 0.0084
and node T :
Parameters of node T (multinomial distribution)
Conditional probability table:
A
T
no
yes
no 0.991528842 0.952380952
yes 0.008471158 0.047619048
Keith E Emmert (TSU)
Data Mining
October 23, 2013
111 / 222
Bayes Classification
Bayesian Belief Networks - Asia Example - Fitting with MLE
Data Mining
Keith E Emmert
Node D
Parameters of node D (multinomial distribution)
Sub-setting Data
Conditional probability table:
Classification
Decision Trees
Bayes
Classification
, , E = no
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
B
D
no
yes
no 0.90017286 0.21373057
yes 0.09982714 0.78626943
, , E = yes
B
D
no
yes
no 0.27737226 0.14592275
yes 0.72262774 0.85407725
Keith E Emmert (TSU)
Data Mining
October 23, 2013
112 / 222
Bayes Classification
Bayesian Belief Networks - Asia Example - Fitting with MLE
Asia Node D Conditional Probabilities using MLE
Data Mining
0.0
Keith E Emmert
Sub-setting Data
0.2
0.4
no
no
no
yes
yes
no
yes
yes
0.6
0.8
Classification
Decision Trees
yes
Bayes
Classification
Support Vector
Machines
k-Nearest
Neighbors
no
Levels
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
yes
no
0.0
0.2
0.4
0.6
0.8
Probabilities
Keith E Emmert (TSU)
Data Mining
October 23, 2013
113 / 222
Bayes Classification
Bayesian Belief Networks - Asia Example - Fitting with Bayes
Data Mining
Keith E Emmert
Sub-setting Data
Classification
We can find the conditional probability tables of the variables using
the expected value of the posterior distribution.
For example, node A:
Parameters of node A (multinomial distribution)
Decision Trees
Bayes
Classification
Conditional probability table:
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
no
yes
0.990618762 0.009381238
and node T :
Parameters of node T (multinomial distribution)
k-Nearest
Neighbors
Conditional probability table:
A
T
no
yes
no 0.991033649 0.904255319
yes 0.008966351 0.095744681
Keith E Emmert (TSU)
Data Mining
October 23, 2013
114 / 222
Bayes Classification
Bayesian Belief Networks - Asia Example - Fitting with Bayes
Data Mining
Keith E Emmert
Node D
Parameters of node D (multinomial distribution)
Sub-setting Data
Conditional probability table:
Classification
Decision Trees
Bayes
Classification
, , E = no
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
B
D
no
yes
no 0.8997410 0.2140392
yes 0.1002590 0.7859608
, , E = yes
B
D
no
yes
no 0.2813620 0.1496815
yes 0.7186380 0.8503185
Keith E Emmert (TSU)
Data Mining
October 23, 2013
115 / 222
Bayes Classification
Bayesian Belief Networks - Asia Example - Fitting with Bayes
Asia Node D Conditional Probabilities using Bayes
Data Mining
0.0
Keith E Emmert
Sub-setting Data
0.2
0.4
no
no
no
yes
yes
no
yes
yes
0.6
0.8
Classification
Decision Trees
yes
Bayes
Classification
Support Vector
Machines
k-Nearest
Neighbors
no
Levels
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
yes
no
0.0
0.2
0.4
0.6
0.8
Probabilities
Keith E Emmert (TSU)
Data Mining
October 23, 2013
116 / 222
Bayes Classification
Bayesian Belief Networks - Asia Example - Computations: Do you have dyspnoea? No Prior Information
Data Mining
Keith E Emmert
D
A
Sub-setting Data
Remark
Recall: A node in a Bayesian network is
conditionally independent of its non
descendants, if its parents are known.
X
Classification
Decision Trees
S
E
Bayes
Classification
Support Vector
Machines
T
k-Nearest
Neighbors
X is conditionally independent to D and
also L given E .
The answer is “Yes” if Pr (D = Y ) >
Pr (D = N).
B
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
L
Compute Pr (D = Yes). (We get Pr (D = No) for free.)
XX
Pr (D = Yes) =
Pr (D = Yes | E = α, B = β)Pr (E = α, B = β)
α
=
XX
α
Keith E Emmert (TSU)
β
Pr (D = Yes | E = α, B = β)Pr (E = α)Pr (B = β)
β
Data Mining
October 23, 2013
117 / 222
Bayes Classification
Bayesian Belief Networks - Asia Example - Computations: Do you have dyspnoea? Using Prior Information
Data Mining
Keith E Emmert
D
A
Sub-setting Data
X
Classification
Decision Trees
S
E
Bayes
Classification
Support Vector
Machines
T
B
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Suppose we have some background on the
patient: they have bronchitis. This, of
course, changes the chance of dyspnoea.
The answer is “Yes” if Pr (D = Y | B =
Y ) > Pr (D = N | B = Y ).
L
Compute Pr (D = Yes | B = Yes). (Pr (D = No | B = Yes) is free.)
Pr (B = Y | D = Y )Pr (D = Y )
Pr (B = Y )
Pr (B = Y | D = Y )Pr (D = Y )
=P
α Pr (B = Y | D = α)Pr (D = α)
Pr (D = Y | B = Y ) =
Keith E Emmert (TSU)
Data Mining
October 23, 2013
118 / 222
Bayes Classification
Bayesian Belief Networks - Asia Example - Computations: Do you have dyspnoea? Using Prior Information
Data Mining
Keith E Emmert
D
A
Sub-setting Data
X
Classification
Decision Trees
S
E
Bayes
Classification
Support Vector
Machines
T
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
B
L
Suppose we have some background on the
patient: they have bronchitis, a yes for E
(tb vs lung cancer/bronchitis), and no for
x-ray. This, of course, changes the chance
of dyspnoea. Compare Pr (D = Yes | B =
Yes, E = Yes, X = No) and Pr (D = No |
B = Yes, E = Yes, X = No).
D is conditionally independent from X given E , B! So,
Pr (D = Y | B = Y , E = Y , X = N) = Pr (D = Y | B = Y , E = Y ),
which is the conditional probability table for D. (Thus, the x-ray is
not necessary.) Compare to Pr (D = N | B = Y , E = Y ) to make the
call for dyspnoea.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
119 / 222
Bayes Classification
Bayesian Belief Networks - ALARM
Data Mining
Keith E Emmert
Sub-setting Data
The ALARM (”A Logical Alarm Reduction Mechanism”) is a Bayesian
network designed to provide an alarm message system for patient
monitoring.
Classification
MVS
Decision Trees
DISC
Bayes
Classification
PMB
Support Vector
Machines
PAP
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
SHNT
FIO2
APL
MINV
ANES
VMCH
KINK
VTUB
VLNG
PRSS
VALV
PVS
TPR
k-Nearest
Neighbors
INT
ACO2
SAO2
ECO2
HYP
LVF
CCHL
LVV
STKV
HIST
ERLO
HR
ERCA
PCWP
CVP
CO
HRBP
HRSA
HREK
BP
Keith E Emmert (TSU)
Data Mining
October 23, 2013
120 / 222
Bayes Classification
Bayesian Belief Networks - Continuous Data
Data Mining
Keith E Emmert
Sub-setting Data
Univariate case: use a Gaussian p(x) = √
1
e− 2 (
1
x−µ
σ
2
)
2πσ 2
Multivariate case: multivariate Gaussian over X1 , . . . , Xn with
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Mean vector µ, µi = E (Xi ) for 1 ≤ i ≤ n.
n × n co-variance matrix Σ, Σii = Var (Xi ),
Σij = Cov (Xi , Xj ) = E (Xi Xj ) − E (Xi ) − E (Xj ), for i 6= j.
Joint density function
T −1
1
p(x) =
e −(x−µ) Σ (x−µ)/2 .
(2π)n/2 |Σ|1/2
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
121 / 222
Bayes Classification
Bayesian Belief Networks - Continuous Data
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
µX
Σ
ΣXY
, XX
, then the marginal for
µY
ΣYX ΣYY
X is Pr (X ) ∼ N(µX , ΣXX ).
If X = (X1 , . . . , Xn ), then Xi and Xj are independent if and only
if Σij = 0.
If Pr (X , Y ) ∼ N
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
122 / 222
Bayes Classification
Bayesian Belief Networks - Continuous Data
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Definition
Suppose Y is a continuous variable with parents X1 , . . . , Xn . Then Y
has a linear Gaussian model if it can be described using parameters
β0 , β1 , . . . , βn and σ 2 such that
Pr (Y | x1 , . . . , xn ) ∼ N(β0 + β1 x1 + · · · βn xn , σ 2 )
Pr (Y | ~x ) ∼ N(β0 + β~ T ~x , σ 2 )
Vector Notation
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
123 / 222
Bayes Classification
Bayesian Belief Networks - Iris Data Set - A Continuous Example
Data Mining
Hill−Climbing
Grow−Shrink
Keith E Emmert
Sub-setting Data
Sepal.Length
Sepal.Width
Sepal.Length
Sepal.Width
Species
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Petal.Length
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Petal.Length
Petal.Width
k-Nearest
Neighbors
Petal.Width
Species
Keith E Emmert (TSU)
Data Mining
October 23, 2013
124 / 222
Bayes Classification
Bayesian Belief Networks – QQ Plots on Iris
Hill−Climb Normal QQ−Plot
Data Mining
−2 −1 0
Sepal.Width
Keith E Emmert
1
Grow−Shrink Normal QQ−Plot
2
−2 −1 0
Species
Sepal.Width
1
2
Species
1
1
Classification
0
0
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
−1
Petal.Length
Petal.Width
Sepal.Length
2
1
Sample Quantiles
2
Sub-setting Data
Sample Quantiles
2
−1
Petal.Length
Petal.Width
Sepal.Length
2
1
0
0
−1
−1
−2 −1 0
1
2
−2 −1 0
1
2
Theoretical Quantiles
−2 −1 0
1
2
−2 −1 0
1
2
Theoretical Quantiles
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
125 / 222
Bayes Classification
Bayesian Belief Networks – Residuals Vs Fitted on Iris
Hill−Climb Normal Residuals Vs Fitted
Data Mining
0
Sepal.Width
Keith E Emmert
2
4
6
Grow−Shrink Normal Residuals Vs Fitted
8
0
Species
Sepal.Width
2
4
6
Species
2
Sub-setting Data
1
1
Classification
0
0
Decision Trees
−1
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Petal.Length
Petal.Width
−1
Residuals
Residuals
2
Sepal.Length
2
1
Petal.Length
Petal.Width
Sepal.Length
2
1
0
0
−1
−1
0
2
4
6
8
0
2
4
6
8
Fitted values
0
2
4
6
0
2
4
6
Fitted values
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
126 / 222
Bayes Classification
Bayesian Belief Networks – Histogram of the Residuals on Iris
Hill−Climb Normal Histogram of the Residuals
Data Mining
−1
Sepal.Width
Keith E Emmert
0
1
Grow−Shrink Normal Histogram of the Residuals
2
−1
Species
Sepal.Width
0
1
2
Species
1.5
1.5
1.0
1.0
0.5
0.5
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Petal.Length
Petal.Width
0.0
Density
Density
0.0
Sepal.Length
Petal.Length
1.5
1.5
1.0
1.0
0.5
0.5
0.0
Petal.Width
Sepal.Length
0.0
−1
0
1
2
−1
0
1
2
Residuals
−1
0
1
2
−1
0
1
2
Residuals
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
127 / 222
Bayes Classification
Bayesian Belief Networks – Predictions for Iris Data - GS
Data Mining
Keith E Emmert
This gives surprisingly horrible predictions – everything classifies as
Species # 2.
Sub-setting Data
Classification
Decision Trees
irisPredictGS 1 2 3
2 50 50 50
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
128 / 222
Bayes Classification
Bayesian Belief Networks – Predictions for Iris Data - HC
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
This one gives what we might expect – after all, I’ve used the entire
data set to train the network and then predicted on the set I used to
train! (Bad idea, by the way...)
irisPredictHC 1 2 3
1 50 0 0
2 0 48 2
3 0 2 48
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
129 / 222
Support Vector Machines
Linearly Separable
Data Mining
8
Keith E Emmert
Sub-setting Data
6
Classification
Decision Trees
4
Bayes
Classification
2
Class A
Class B
0
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
y
Support Vector
Machines
−4
−2
k-Nearest
Neighbors
−2
Keith E Emmert (TSU)
0
2
4
x
Data Mining
6
8
October 23, 2013
130 / 222
Support Vector Machines
Linearly Separable – The Margins
8
Data Mining
Keith E Emmert
Sub-setting Data
Decision Trees
6
Classification
−2
k-Nearest
Neighbors
0
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Class A
Class B
2
Support Vector
Machines
4
Bayes
Classification
−2
Keith E Emmert (TSU)
0
2
Data Mining
4
6
8
October 23, 2013
131 / 222
Support Vector Machines
Linearly Separable
The decision boundary, Bi , is
a hyper-plane which separates
two classes.
Keith E Emmert
8
Data Mining
Sub-setting Data
Suppose bij are hyper-planes
parallel to Bi which touch one
data point, but still separate
the two classes.
6
Classification
4
Decision Trees
0
Bi
The margin is the distance
between bi1 and bi2 .
b i2
k-Nearest
Neighbors
b i1
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
−2
Support Vector
Machines
Class A
Class B
2
Bayes
Classification
−2
Keith E Emmert (TSU)
0
2
4
6
8
Data Mining
A linear support vector
machine is a maximal margin
classifier. It searches for the
hyper-plane with the largest
margin.
October 23, 2013
132 / 222
Support Vector Machines
Linearly Separable – Hyper-planes
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Let N be the number of training records of the form (~xi , yi ), for
i = 1, 2, . . . N.
Let ~xi = (xi1 , . . . , xid ) be the attributes
Let yi ∈ {−1, 1} be the class labels.
The decision boundary is a hyper-plane and can be written as
Bayes
Classification
~ · ~x + b = 0,
w
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
~ and b are to be determined.
where w
Fact: Hyper-planes of this type separate a space into two
connected components with
~ · ~x + b > 0
w
and
~ · ~x + b < 0.
w
For Rd , a hyper-plane contains d − 1 independent variables (and
is a d − 1 dimensional subspace if the origin is included,
otherwise it is just a set.)
Keith E Emmert (TSU)
Data Mining
October 23, 2013
133 / 222
Support Vector Machines
Linearly Separable
Data Mining
Example
Keith E Emmert
3
Sub-setting Data
Classification
Note that the line y = 2 − x
is a hyper-plane in R2 .
Decision Trees
w.x + d > 0
2
Bayes
Classification
Clearly,
w.x + d < 0
0
y
k-Nearest
Neighbors
y =2−x
x +y −2=0
[1 1] · [x y ] − 2 = 0
~ · ~x + d = 0
w
−1
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
1
Support Vector
Machines
−1
0
1
2
3
x
Keith E Emmert (TSU)
Data Mining
October 23, 2013
134 / 222
Support Vector Machines
~ and b
Linearly Separable – Finding w
8
Data Mining
Keith E Emmert
For a Class A point, ~x◦ , we
~ · ~x◦ + b = k~x◦ < 0.
have w
For a Class B point, ~x∆ , we
~ · ~x∆ + b = k~x∆ > 0.
have w
6
Sub-setting Data
Decision Trees
4
Classification
Class A
Class B
For a ◦ point which is closest
to Bi , we have a parallel
hyper-plane bi1 .
2
Bayes
Classification
Support Vector
Machines
0
−2
Bi
b i2
k-Nearest
Neighbors
b i1
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
−2
0
2
4
6
8
For a ∆ point which is closest
to Bi , we have a parallel
hyper-plane bi2 .
~ and b so that
Rescale the parameters w
~ · ~x + b = −1
bi1 : w
Keith E Emmert (TSU)
Data Mining
and
~ · ~x + b = 1.
bi2 : w
October 23, 2013
135 / 222
Support Vector Machines
Linearly Separable – Classifying New Vectors
~ and b so
Suppose we have w
that the hyper-planes
bi1 , Bi , bi2 are determined,
that is
Data Mining
8
Keith E Emmert
Sub-setting Data
6
Classification
Decision Trees
0
Bi
b i2
b i1
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
−2
Support Vector
Machines
bi1 :~
w · ~x + b = −1
Bi :~
w · ~x + b = 0
bi2 :~
w · ~x + b = 1
Class A
Class B
2
4
Bayes
Classification
k-Nearest
Neighbors
−2
Keith E Emmert (TSU)
0
2
4
6
8
Data Mining
Then, to classify the attribute
vector ~z ,
(
~ · ~z + b > 0
1,
w
y=
~ · ~z + b < 0.
−1, w
October 23, 2013
136 / 222
Support Vector Machines
~ and b
Linearly Separable – Finding w
Let d be the margin, the
distance between bi2 and bi1 .
Choose a point ~u on bi1 and
a point ~v on bi2 . Then
8
Data Mining
Sub-setting Data
6
Keith E Emmert
Decision Trees
4
Classification
Class A
Class B
~ · ~u = −1
w
~ · ~v = 1.
and w
2
Bayes
Classification
Support Vector
Machines
Bi
0
Subtracting, we have
b i2
k-Nearest
Neighbors
b i1
~ (~v − ~u ) = 2,
w
−2
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
−2
0
||~
w ||
2
4
6
which, using a famous
property of norms, dot
products, and cosine, yields
2
~ · (~v − ~u ) = 2 =⇒ d =
=w
||~
w ||
8
||~v − ~u || cos(θ)
|
{z
}
~ ⇔θ=0
=d⇔~
v −~
u is || to w
in order to maximize the margin between the two classes.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
137 / 222
Support Vector Machines
Linearly Separable – Learning a Linear SVM Model
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
~ and b are to be estimated so
w
that
~ · ~xi + b ≥ 0, yi = 1
w
1
=⇒ yi (~
w · ~xi + b) ≥ 0 for all
~ · ~xi + b ≤ 0, yi = −1
w
i = 1, . . . , N.
2 The margin of the decision boundary is maximized if and only if
d = ||~w2 || is maximized iff the objective function
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
f (~
w) =
||~
w ||2
2
is minimized.
So, we have a constrained optimization problem,
k-Nearest
Neighbors
min
~ ∈Rd
w
||~
w ||2
subject to yi (~
w · ~xi + b) ≥ 1
2
for all i = 1, . . . , N.
Thus we have a quadratic objective function
~ and b.
The constraints are linear in w
Keith E Emmert (TSU)
Data Mining
October 23, 2013
138 / 222
Support Vector Machines
Linearly Separable – Lagrange Multipliers – Equality Constraints
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
For ~x ∈ Rd , minimize f (~x ) subject to gi (~x ) = 0 for i = 1, 2, . . . , p.
Pp
1 The Lagrangian: L(~
x , ~λ) = f (~x ) + i=1 λi gi (~x ), where the λi
are dummy variables called Lagrange multipliers.
2 Compute
∂L
∂L
, i = 1, . . . , d,
and
, i = 1, . . . , p.
∂xi
∂λi
3 Solve the d + p equations for the stationary point ~
x ∗ and ~λ.
Example
Minimize f (x, y ) = x + 2y subject to x 2 + y 2 − 4 = 0.
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
139 / 222
Support Vector Machines
Linearly Separable – Inequality Constraints
Data Mining
Keith E Emmert
For ~x ∈ Rd , minimize f (~x ) subject to hi (~x ) ≤ 0 for i = 1, . . . , q.
Then the Lagrangian
Sub-setting Data
L(~x , ~λ) = f (~x ) +
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
p
X
λi hi (~x )
i=1
is generalized and solved using the Karush-Kuhn-Tucker conditions
∂L
= 0, i = 1, . . . , d
∂xi
hi (~x ) ≤ 0, i = 1, . . . , q
λi ≥ 0 i = 1, . . . , q
k-Nearest
Neighbors
λi hi (~x ) = 0, i = 1, . . . , q.
Example
x +y
y
That is, subject to h1 (~x ) = x + y − 2 and h2 (~x ) = x − y .
2
2
Minimize f (x, y ) = (x − 1) + (y − 3) subject to
Keith E Emmert (TSU)
Data Mining
≤2
≥x
October 23, 2013
140 / 222
Support Vector Machines
A Little Math
Data Mining
Example
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Consider f (~x ) = ||~x ||2 and g (~x ) = ~x · ~a, for ~x ,~a ∈ Rn . Recall
p
Pn
||~x || = h~x , ~x i =⇒ ||~x ||2 = h~x , ~x i = i=1 xi2 .
 ∂f 
∂x
1
∂f
 
=  . . . .
∂~x
∂f
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
∂xn
Hence,
n
∂f
∂ X 2
∂f
∂
=
xi = 2xi =⇒
=
||~x ||2 = 2~x
∂xi
∂xi
∂~x
∂~x
k-Nearest
Neighbors
i=1
and
n
∂ X
∂g
∂
∂g
=
xi ai = ai =⇒
=
(~x · ~a) = ~a.
∂xi
∂xi
∂~x
∂~x
i=1
Keith E Emmert (TSU)
Data Mining
October 23, 2013
141 / 222
Support Vector Machines
Linearly Separable - The Primal Problem
Data Mining
Recall, we have a constrained optimization problem,
Keith E Emmert
min
Sub-setting Data
~ ∈Rd
w
||~
w ||2
subject to yi (~
w · ~xi + b) ≥ 1
2
Classification
Decision Trees
Bayes
Classification
for all i = 1, . . . , N.
Lagrange multipliers yields the primal
N
X
1
2
~
~ , b) = ||~
w || −
λi yi (~
w · ~xi + b) − 1 .
Lp (λ, w
2
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
i=1
To minimize the Lagrangian,
N
X
∂Lp
~ =
= 0 =⇒ w
λi yi ~xi ,
~
∂w
i=1
N
X
∂Lp
= 0 =⇒
λi yi = 0.
∂b
i=1
The Karush-Kahn-Tucker Conditions:
λi ≥ 0,
λi [1 − yi (~
w · ~xi + b)] = 0,
for i = 1, 2, . . . , N.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
142 / 222
Support Vector Machines
Linearly Separable - Support Vectors
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Inferences from the Karush-Kahn-Tucker Conditions:
(
λi = 0, 1 − yi (~
w · ~xi + b) 6= 0
λi [1 − yi (~
w · ~xi + b)] = 0 =⇒
λi > 0, 1 − yi (~
w · ~xi + b) = 0.
Note that
yi (~
w · ~xi + b) = 1
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
if and only if the vector ~xi lies along one of the hyper-planes bi1 or
bi2 . Such a vector is called a support vector (hence the name,
Support Vector Machines).
Definition
In the above context, ~xi is a support vector if and only if the
corresponding λi > 0.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
143 / 222
Support Vector Machines
Linearly Separable - The Dual Lagrangian Problem
Data Mining
Recall: the primal is
Keith E Emmert
N
X
1
2
~
~ , b) = ||~
Lp (λ, w
w || −
λi yi (~
w · ~xi + b) − 1 .
2
Sub-setting Data
Classification
i=1
Decision Trees
Bayes
Classification
where we know
~ =
w
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
N
X
N
X
λi yi ~xi ,
i=1
λi yi = 0.
i=1
~ , b), we obtain the Dual problem:
Substituting the above into Lp (~λ, w
LD =
k-Nearest
Neighbors
N
X
N
λi −
i=1
subject to
N
X
N
1 XX
λi λj yi yj h~xj , ~xi i,
2
i=1 j=1
λi yi = 0 and λi ≥ 0. Note that this removes
i=1
~ and b!
dependence upon w
Keith E Emmert (TSU)
Data Mining
October 23, 2013
144 / 222
Support Vector Machines
Linearly Separable - The Dual Lagrangian Problem
Data Mining
Keith E Emmert
The Dual problem: LD =
N
X
N
λi −
i=1
Sub-setting Data
to
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
N
1 XX
λi λj yi yj h~xj , ~xi i, subject
2
i=1 j=1
PN
i=1 λi yi = 0 and λi ≥ 0.
Since the “quadratic” term is negative, this is now a
maximization problem rather than a minimization problem.
N
X
~ =
Of course w
λi yi ~xi .
i=1
b is found via λi [1 − yi (~
w · ~xi + b)] = 0, i = 1, . . . , N using the
support vectors (when λi > 0) ~xi . (Just average all the b’s
together...)
!
N
X
The decision boundary is
λi yi ~xi · ~x + b = 0.
i=1
Classification: f (~z ) = sign(~
w ·~z + b) = sign
N
X
!
λi yi ~xi · z + b .
i=1
Keith E Emmert (TSU)
Data Mining
October 23, 2013
145 / 222
Support Vector Machines
~ and b: Back to the Old Example...
Linearly Separable – Finding w
~ is
The vector w
Keith E Emmert
8
Data Mining
~ =
w
6
Sub-setting Data
Classification
N
X
λi yi ~xi
Decision Trees
4
i=1
= (−0.7071, −0.6797).
Class A
Class B
2
Bayes
Classification
0
−2
Bi
b i1
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
b = 3.9408 (R reports the
negative of our b, which I’ve
corrected for.)
b i2
Support Vector
Machines
−2
0
2
4
6
8
k-Nearest
Neighbors
Decision function is given by
The equation of the decision
~ · ~z + b = 0 or
boundary is w
−0.7071z1 +−0.6797z2 +3.9408 = 0.
f (~z ) = sign(~
w · ~z + b)
= sign(−0.7071z1 + −0.6797z2 + 3.9408).
Keith E Emmert (TSU)
Data Mining
October 23, 2013
146 / 222
Support Vector Machines
Using R
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
There are several packages for using Support Vector Machines. A few
of them include
kernlab: The one we’ll use
e1071: Perhaps the first implementation of SVM
klaR
svmpath
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
147 / 222
Support Vector Machines
Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel
Data Mining
Keith E Emmert
Generate 200 ordered pairs, there are 2 positive classes with different
means, and one negative class
Sub-setting Data
Classification
8
Decision Trees
Positive
Negative
6
Bayes
Classification
y
2
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
4
Support Vector
Machines
−2
0
k-Nearest
Neighbors
−2
Keith E Emmert (TSU)
0
2
4
Data
x Mining
6
8
October 23, 2013
148 / 222
Support Vector Machines
Linear Non-Separable: The Idea
Data Mining
Slack variables: ξi > 0, ∀i:
Keith E Emmert
~ · ~xi + b ≥ 1 − ξi ,
w
yi = 1
=⇒ yi [~
w · ~xi + b] ≥ 1 − ξi .
~ · ~xi + b ≤ −1 + ξi , yi = −1
w
Sub-setting Data
Classification
3.5
Decision Trees
Support Vector
Machines
3.0
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
2.5
Bayes
Classification
ξi
1.5
ξi > 0 is a penalty for
misclassification of pi .
1.0
2.0
||w||
ξi
||~
w ||
y
pi
2
0.5
0.0
<w, x> + b = 1
1
<w, x> + b = 0
<w, x> + b = −1
0
<w, x> + b = −1 + ξi
k-Nearest
Neighbors
Clearly not linearly separable.
pi will not be correctly
classified.
3
4
5
6
is the distance from
~ · ~x + b = −1 to the “noisy”
w
point pi . This is an estimate
of the error of the decision
boundary on pi .
x
Keith E Emmert (TSU)
Data Mining
October 23, 2013
149 / 222
Support Vector Machines
Linear Non-Separable: The Objective Function
Data Mining
The New Objective Function
Keith E Emmert
||~
w ||2
f (~
w) =
+C
2
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
N
X
!k
ξi
.
i=1
Seek to minimize the objective function.
!k
N
X
The term C
ξi
represents a penalty for a decision
i=1
boundary with large values of slack variables which misclassify
many training examples.
Simplification: k = 1.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
150 / 222
Support Vector Machines
Linear Non-Separable: The Primal
Data Mining
N
Keith E Emmert
Sub-setting Data
Classification
Lp :
X
||~
w ||2
ξi
+C
2
i=1
|
{z
}
objective function
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
−
N
X
λi [yi (~
w · ~xi + b) − 1 + ξi ]
−
i=1
|
N
X
µi ξ i
i=1
{z
inequality constraints + slack variables
}
|
{z
}
ξi s non-negativity
requirements
The Karush-Kahn-Tucker Constraints: (to transform into equality
constraints for optimization)
ξi ≥ 0, λi ≥ 0, µi ≥ 0, µi ξi = 0, λi [yi (~
w · ~xi + b) − 1 + ξi ] = 0.
k-Nearest
Neighbors
N
X
∂Lp
~
= 0 =⇒ w
λi yi ~xi ,
~
∂w
i=1
∂Lp
= 0 =⇒
∂b
N
X
λi yi = 0,
i=1
∂Lp
= 0 =⇒ λi + µi = C .
∂ξi
Keith E Emmert (TSU)
Data Mining
October 23, 2013
151 / 222
Support Vector Machines
Linear Non-Separable: The Dual
Data Mining
Gentle mathematics yields
Keith E Emmert
Sub-setting Data
LD :
Classification
N
X
N
λi −
i=1
Decision Trees
N
1 XX
λi λj yi yj ~xi · ~xj .
2
i=1 j=1
Note that the following restrictions hold:
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
0 ≤ λi ≤ C , µi ≥ 0,
N
X
λi yi = 0.
i=1
The solution is
~
w
k-Nearest
Neighbors
N
X
λi yi ~xi
i=1
with b an average of the b’s obtained by solving
yi (~
w · ~xi + b) − 1 + ξi = 0
when λi > 0 (i.e. using support vector ~xi ).
Keith E Emmert (TSU)
Data Mining
October 23, 2013
152 / 222
Support Vector Machines
Linear Non-Separable: Classification
Data Mining
Keith E Emmert
The decision boundary is
Decision Trees
!
λi yi ~xi · ~x
+ b = 0.
i=1
Sub-setting Data
Classification
N
X
Classification: f (~z ) = sign(~
w ·~z + b) = sign
N
X
!
λi yi ~xi · z + b .
i=1
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
153 / 222
Support Vector Machines
Linear Non-Separable: Hard Margin vs Soft Margin
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
A hard-margin SVM does not allow slack variables, ξi s. Hence,
one constraint is simply λi ≥ 0.
A soft-margin SVM does allow slack variables, ξi s. Hence, one
constraint is simply 0 ≤ λi ≤ C .
The Role of C : If C → ∞, then the soft-margin SVM becomes a
hard-margin SVM because
N
X
||~
w ||2
1 In Lp , try to minimize:
+C
ξi . Hence, to minimize Lp
2
i=1
as C → ∞, simply take ξi = 0 for all i.
2 In Ld , the dual, the constraint for the soft-margin 0 ≤ λi ≤ C
becomes λi ≥ 0 (the constraint for the hard-margin) when
C → ∞.
Finding C : Use a grid search.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
154 / 222
Support Vector Machines
Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Training
Data is 75%
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
>
>
>
>
+
+
>
library(kernlab)
myC = 1 # The parameter C for most of this section
# Do some training
mySVM = ksvm(trainData, labelTrain, type="C-svc",
kernel='vanilladot', C=myC, scaled=c(),
kpar=list())
mySVM # Look at the summary
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 1
Linear (vanilla) kernel function.
Number of Support Vectors : 271
Objective Function Value : -269.9585
Training error : 0.35
Keith E Emmert (TSU)
Data Mining
October 23, 2013
155 / 222
Support Vector Machines
Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Visual
Results
SVM classification plot
Data Mining
Keith E Emmert
2
8
Sub-setting Data
6
Classification
1
Decision Trees
4
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
0
2
−1
0
−2
−2
−2
0
Keith E Emmert (TSU)
2
4
6
8
Data Mining
In the figure, we have the
training results when C is
1. for a (linear since we’ve
used ”vanilladot’ for a kernel) SVM. The filled triangles and circles are the support vectors. Values near
zero are close to the decision
boundary.
What a mess. Looks like
most vectors are training
vectors.
October 23, 2013
156 / 222
Support Vector Machines
Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Finding
λi s, b, and Support Vectors
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Note that alpha(mySVM) returns the positive λi s and the
corresponding indices with alphaindex(mySVM) which can be used
to find the corresponding support vectors. coef(mySVM) returns
yi λi s. The negative intercept is given by b(mySVM). Let’s just look at
a few of them.
> coef(mySVM)[[1]][1:20]
[1] -1 1 -1 -1 1 -1 -1 1 1 -1 -1 1 1 -1 1 1 1
[18] 1 1 -1
> alpha(mySVM)[[1]][1:20]
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> alphaindex(mySVM)[[1]][1:20]
[1] 1 2 3 4 5 6 7 8 9 10 12 14 15 17 18 19 20
[18] 22 23 24
> b(mySVM)
[1] -0.7650215
Keith E Emmert (TSU)
Data Mining
October 23, 2013
157 / 222
Support Vector Machines
Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel –
Do-It-Yourself Decision Boundary
Data Mining
Keith E Emmert
Sub-setting Data
To generate the same picture as the previous contour plot, one must
swap the columns around as follows. We’ll also fill in the training
vectors as solid circles and triangles.
Decision Trees
8
Classification
4
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
2
Support Vector
Machines
6
Bayes
Classification
−2
0
k-Nearest
Neighbors
−2
Keith E Emmert (TSU)
0
2
4
6
Data Mining
8
October 23, 2013
158 / 222
Support Vector Machines
Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel –
Do-It-Yourself Decision Boundary and Contour Plot
Now we can compare the contour-plot with the plot showing the
boundaries side-by-side. It does not look like we did a very good job
with this classifier – we’ve got a lot of circles and triangles on both
sides of the decision boundary – which shouldn’t be a big surprise!
Data Mining
Keith E Emmert
Sub-setting Data
Classification
SVM classification plot
Decision Trees
2
4
0
4
1
2
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
6
2
−1
0
Support Vector
Machines
6
8
8
Bayes
Classification
0
k-Nearest
Neighbors
−2
−2
−2
−2
0
Keith E Emmert (TSU)
2
4
6
8
−2
Data Mining
0
2
4
6
8
October 23, 2013
159 / 222
Support Vector Machines
Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Have We
Done a Good Job?
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Recall that we trained our toy data with C = 1 and used the
“vanniladot” kernel, indicating a linear SVM. The training error was
reported and is 0.35. Have we done a good job?
Let’s look at the predictions on the labels first using the test set.
myPred
labelTest -1 1
-1 35 14
1 29 22
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
160 / 222
Support Vector Machines
Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Have We
Done a Good Job?
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Next, we can compute the accuracy of the model. This is computed
in a fairly straightforward manner.
P
δ(myPred, labelTest)
Accuracy =
,
|labelTest|
(
1, x = y
where δ(x, y ) =
is the Kronecker delta function.
0, x 6= y
[1] 0.57
The test-error (or training error when using training data) is found in
a similar manner where we count the differences.
[1] 0.43
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
161 / 222
Support Vector Machines
Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Finding a
Better Model
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
> myNewSVM # Look at the summary
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
However, we might be able to do better by modifying the parameter C .
Recall that we set C = 1. Most people recommend using a grid search to
find the “best” value for C . So, let’s assume that 10−k ≤ C ≤ 10k for some
k ∈ N. Remember that smaller C s indicate that we should not penalize
misclassifications too much while larger C s indicate greater penalties. Let’s
use C = 105 .
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 1e+05
Linear (vanilla) kernel function.
Number of Support Vectors : 191
Objective Function Value : -22054852
Training error : 0.303333
Keith E Emmert (TSU)
Data Mining
October 23, 2013
162 / 222
Support Vector Machines
Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Finding a
Better Model
SVM classification plot
Data Mining
Keith E Emmert
8
4
Sub-setting Data
6
Classification
2
Decision Trees
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
0
4
Bayes
Classification
−2
2
−4
0
−6
−2
−8
−2
0
Keith E Emmert (TSU)
2
4
6
In the figure, we have the
training results when C is
1e + 05. for a (linear since
we’ve used ”vanilladot’ for a
kernel) SVM. The filled triangles and circles are the
support vectors. Values near
zero are close to the decision
boundary.
8
Data Mining
October 23, 2013
163 / 222
Support Vector Machines
Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Compare
the Two Models
It is interesting to compare the two models together to see how
radically things have changed. (The old model is on the left, the new
on the right.) Solid circles/triangles represent support vectors.
Data Mining
Keith E Emmert
Sub-setting Data
SVM classification plot
Classification
SVM classification plot
2
8
8
Decision Trees
Bayes
Classification
6
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
4
1
4
0
6
2
0
4
−2
2
2
−1
−4
0
0
−6
−2
−2
−2
−2
0
Keith E Emmert (TSU)
2
4
6
8
−8
−2
Data Mining
0
2
4
6
8
October 23, 2013
164 / 222
Support Vector Machines
Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Compare
the Two Models
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Let’s look at the predictions on
the labels of the old model.
Here are the predictions for the
new model.
myPred
labelTest -1 1
-1 35 14
1 29 22
myNewPred
labelTest -1 1
-1 41 8
1 29 22
Next, we can compute the accuracy of the old model.
Next, we can compute the accuracy of the new model.
[1] 0.57
[1] 0.63
The test-error of the old model.
The test-error of the new model.
[1] 0.43
[1] 0.37
Keith E Emmert (TSU)
Data Mining
October 23, 2013
165 / 222
Support Vector Machines
Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Finding C
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
For the linear SVM example, we can tune C as follows.
Let’s try k = 10. Then we have 2k + 1 = 21 SVMs to consider. Note
that the option cross=10 indicates that 10-fold cross validation
should be used on the full data set.
> k=10
> mySVMVec = NULL
> for (i in -k:k)
+ {
+
mySVMVec = c(mySVMVec,
+
ksvm(rbind(posClass, negClass), labels,
+
type="C-svc", kernel='vanilladot',
+
C=10^i, scaled=c(), kpar=list(),
+
cross=10)
+
)
+
+ }
Keith E Emmert (TSU)
Data Mining
October 23, 2013
166 / 222
Support Vector Machines
Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Finding C
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Now, let’s take a look at the cross-validation errors generated by
various C s.
> xCross=-k:k
> yCross=NULL
> for (i in 1:(2*k+1))
+ {
+
yCross = c(yCross, cross(mySVMVec[[i]]))
+ }
> yCross
[1] 0.5600 0.5775 0.5750 0.5750 0.5825 0.5725 0.5850
[8] 0.5625 0.4225 0.4275 0.4325 0.4200 0.4350 0.4375
[15] 0.4525 0.3775 0.3675 0.4150 0.4725 0.4275 0.5100
Keith E Emmert (TSU)
Data Mining
October 23, 2013
167 / 222
Support Vector Machines
Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Finding C
Data Mining
Keith E Emmert
We might also generate a simple plot as C varies. Choose the
smallest C that minimizes the CV-Error.
log10(C) vs CV Error
Sub-setting Data
Classification
0.55
Decision Trees
Bayes
Classification
0.50
0.45
k-Nearest
Neighbors
0.40
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
CV−Error
Support Vector
Machines
−10
Keith E Emmert (TSU)
−5
0
log10(C)
Data Mining
5
10
October 23, 2013
168 / 222
Support Vector Machines
Non-Linearly Separable
4
Data Mining
Keith E Emmert
Sub-setting Data
2
Classification
Decision Trees
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
0
−2
Support Vector
Machines
y
Bayes
Classification
−4
k-Nearest
Neighbors
−4
−2
0
2
4
x
Keith E Emmert (TSU)
Data Mining
October 23, 2013
169 / 222
Support Vector Machines
Non-Linearly Separable Using (x1 , x2 ) 7→ (x12 , x22 )
Data Mining
15
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
5
Support Vector
Machines
y
10
Bayes
Classification
0
k-Nearest
Neighbors
0
5
10
15
x
Keith E Emmert (TSU)
Data Mining
October 23, 2013
170 / 222
Support Vector Machines
Non-Linearly Separable – The Idea
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Transform the input space – non-linearly separable – using a function,
φ, into what we will call the feature space – linearly separable. In the
feature space, we have
~ · φ(~xi ) + b = 0.
The linear decision boundary given by w
||~
w ||
subject to yi [~
w · φ(~xi ) + b] ≥ 1, for
Learning: min
~
w
2
i = 1, 2, . . . , N where yi = ±1.
The Dual Lagrangian is
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
LD :
N
X
N
λi −
i=1
k-Nearest
Neighbors
~ =
Then w
N
X
N
1 XX
λi λj yi yj φ(~xi ) · φ(~xj ).
2
i=1 j=1
λi yi φ(~xi )
λi [yi (~
w · φ(~xi ) + b) − 1] = 0.
i=1
Classification:


N
X
f (~z ) = sign[~
w · φ(~z ) + b] = sign 
λj yj φ(~xj ) · ~xi + b .
j=1
Keith E Emmert (TSU)
Data Mining
October 23, 2013
171 / 222
Support Vector Machines
Non-Linearly Separable – Problems
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
What mapping function should be used to generate a linear
decision boundary?
The transformed space is likely to be a very high (infinite?)
dimensional space.
Finding λi s in the transformed space is likely to be
computationally expensive.
Lots of dot products involved in classification.
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
172 / 222
Support Vector Machines
Non-Linearly Separable – The Kernel Trick
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Theorem (Mercer’s Theorem)
A kernel function, K , can be expressed as
K (~u , ~v ) = φ(~u )φ(~v )
R
if and only if for all functions g such that [g (x)]2 dx < ∞ we have
Z
k(~x , ~y )g (~x )g (~y ) d~x d~y ≥ 0.
Such kernel functions are called positive definite kernel functions.
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
173 / 222
Support Vector Machines
Non-Linearly Separable – The Application of Mercer’s Theorem
Data Mining
Keith E Emmert
Sub-setting Data
Classification
With such a positive definite kernel function, K , we compute b via
 


N
X
λi yi 
λ + jyj K (~xi , ~xj ) + b − 1 = 0.
j=1
Decision Trees
Bayes
Classification
Support Vector
Machines
(for λi > 0) and classify via
"
f (~z ) = sign
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
N
X
#
λi yi φ(~xi ) · φ(~z ) + b
i=1
"
= sign
N
X
#
λi yi K (~xi , ~z ) + b .
i=1
Notes:
We no longer need the exact form of the mapping function, φ.
Don’t need to compute dot products in a transformed space!
Avoid high-dimensional spaces since we remain in the original
space!
Keith E Emmert (TSU)
Data Mining
October 23, 2013
174 / 222
Support Vector Machines
Non-Linearly Separable – Kernel Facts
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
If K1 and K2 are kernels, then the following functions are kernels:
K (~x , ~y ) = K1 (~x , ~y ) + K2 (~x , ~y ).
K (~x , ~y ) = αK1 (~x , ~y ), for α ∈ R+ = (0, ∞).
K (~x , ~y ) = K1 (~x , ~y )K2 (~x , ~y ).
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
175 / 222
Support Vector Machines
Kernels in R
Data Mining
Keith E Emmert
Sub-setting Data
Let u, v ∈ Rn and hu, v i be any inner product on Rn .
Polynomial: k(u, v ) = (αhu, v i + β)n where α is a scale
parameter, β is an offset and n is the degree.
Linear: α = n = 1, β = 0. Useful when dealing with large sparse
data vectors as can be found in text mining.
Polynomial: Used in image processing.
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Gaussian Radial Basis Function (RBF):
2
2
k(u, v ) = e −||u−v || /(2σ ) . General purpose kernel used when
there is no pressing reason to pick another kernel.
Hyperbolic Tangent: k(u, v ) = tanh(αhu, v i + β). α is a scale
parameter, and β is an offset parameter. Note that not every
parameter choice generates a valid kernel!
Keith E Emmert (TSU)
Data Mining
October 23, 2013
176 / 222
Support Vector Machines
Kernels in R
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Let u, v ∈ Rn and hu, v i be any inner product on Rn .
Bessel function ofn the first kind:
Besselν+1 (σ||u − v ||)
k(u, v ) =
Another general purpose kernel.
(||u − v ||)−n(ν+1)
Note that n ∈ N is the degree of the Bessel function, σ is the
inverse kernel width, and ν is from the second order differential
equation which gives rise to the Bessel function.
Laplace Radial Basis Function (RBF): k(u, v ) = e −σ||u−v || .
Another general purpose kernel.
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
177 / 222
Support Vector Machines
Non-Linearly Separable – Soft Margins
Data Mining
Keith E Emmert
Sub-setting Data
In a similar manner, include slack variables ξi to deal with the case
when the transformed space is not linearly separable.
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
178 / 222
Support Vector Machines
Non-Linearly Separable
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Everything that we’ve talked about (with obvious changes since we
have non-linear rather than linear) applies. However, there might be
more than the one parameter, C , to consider. For example, the
Gaussian allows C and σ to vary.
> mySVMGauss = ksvm(trainData, labelTrain, type="C-svc",
+
kernel='rbf', kpar=list(sigma=1),
+
C=myC, scaled=c())
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
179 / 222
Support Vector Machines
Non-Linearly Separable
Everything that we’ve talked about (with obvious changes since we have
non-linear rather than linear) applies. However, there might be more than
the one parameter to consider. The Gaussian allows C and σ to vary.
For example, we can compare the linear (left) and the Gaussian (right)
below.
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
SVM classification plot
1.5
8
4
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
SVM classification plot
8
1.0
6
2
6
0.5
0
4
4
0.0
−2
2
2
−0.5
−4
0
0
−1.0
−6
−2
−8
−2
0
2
4
6
8
−2
−1.5
−2
0
2
4
6
8
It is pretty clear that the decision boundary curves quite a bit in the
Gaussian (and can’t curve in the linear).
Keith E Emmert (TSU)
Data Mining
October 23, 2013
180 / 222
Support Vector Machines
Non-Linearly Separable
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Have we done a good job? Let’s look at the predictions on the labels
from both methods.
myNewPred
myPredGauss
labelTest -1 1
labelTest -1 1
-1 41 8
-1 39 10
1 29 22
1
5 46
Linear
Gaussian
Training Error 0.303333 0.123333
Accuracy
0.63
0.85
Test Error
0.37
0.15
It looks like the nonlinear model is a better classifier. However, we
need to remember that there are two parameters to tune, C and σ!
Keith E Emmert (TSU)
Data Mining
October 23, 2013
181 / 222
Support Vector Machines
Non-Linearly Separable
Data Mining
This time, let’s let R choose σ for us. For now, let’s keep C = 1.
Keith E Emmert
Using automatic sigma estimation (sigest) for RBF or laplace kern
Sub-setting Data
Support Vector Machine object of class "ksvm"
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
SV type: C-svc (classification)
parameter : cost C = 1
Gaussian Radial Basis kernel function.
Hyperparameter : sigma = 0.35671985783464
Number of Support Vectors : 139
Objective Function Value : -109.8726
Training error : 0.133333
Keith E Emmert (TSU)
Data Mining
October 23, 2013
182 / 222
Support Vector Machines
Non-Linearly Separable – R Chooses σ
SVM classification plot
Data Mining
Keith E Emmert
8
1.0
Sub-setting Data
6
Classification
Decision Trees
Bayes
Classification
0.5
4
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
0.0
2
−0.5
0
−1.0
k-Nearest
Neighbors
−2
−1.5
−2
0
2
4
6
8
Notice that some of the support vectors have changed.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
183 / 222
Support Vector Machines
Non-Linearly Separable
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
myNewPred
labelTest -1 1
-1 41 8
1 29 22
myPredGauss2
labelTest -1 1
-1 38 11
1
6 45
Linear
Training Error 0.303333
Accuracy
0.63
Test Error
0.37
myPredGauss
labelTest -1 1
-1 39 10
1
5 46
Gaussian
0.123333
0.85
0.15
Gaussian w/Auto σ
0.133333
0.83
0.17
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
184 / 222
Support Vector Machines
Non-Linearly Separable
Data Mining
Keith E Emmert
Sub-setting Data
However, we might be able to do better by modifying the parameter C .
Recall that we set C = 1. Most people recommend using a grid search to
find the “best” value for C . So, let’s assume that
10−k ≤ C ≤ 10k .
Classification
Decision Trees
Bayes
Classification
Using automatic sigma estimation (sigest) for RBF or laplace kern
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
We now try C = 5.
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 5
Gaussian Radial Basis kernel function.
Hyperparameter : sigma = 0.320905142072364
Number of Support Vectors : 123
Objective Function Value : -470.6818
Training error : 0.133333
Keith E Emmert (TSU)
Data Mining
October 23, 2013
185 / 222
Support Vector Machines
Non-Linearly Separable
SVM classification plot
Data Mining
Keith E Emmert
8
Sub-setting Data
1
6
Classification
Decision Trees
Bayes
Classification
4
0
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
2
−1
0
−2
−2
−2
0
2
4
6
8
Notice that some of the support vectors have changed.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
186 / 222
Support Vector Machines
Non-Linearly Separable
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
myNewPred
labelTest -1 1
-1 41 8
1 29 22
myPredGauss2
labelTest -1 1
-1 38 11
1
6 45
Linear
Training Error 0.303333
Accuracy
0.63
Test Error
0.37
k-Nearest
Neighbors
Training Error
Accuracy
Test Error
Keith E Emmert (TSU)
myPredGauss
labelTest -1 1
-1 39 10
1
5 46
myPredGauss3
labelTest -1 1
-1 38 11
1
5 46
Gaussian Gaussian w/Auto σ
0.123333
0.133333
0.85
0.83
0.15
0.17
Gaussian w/Auto σ, C = 5
0.133333
0.84
0.16
Data Mining
October 23, 2013
187 / 222
Support Vector Machines
Non-Linearly Separable
For ease of comparison, here are the four contour-plots:
Data Mining
SVM classification plot
Keith E Emmert
SVM classification plot
8
1.5
8
4
Sub-setting Data
1.0
6
2
6
Classification
Linear
0.5
0
4
4
Decision Trees
0.0
−2
2
Bayes
Classification
0
−0.5
0
Support Vector
Machines
−1.0
−6
−2
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Basic Gaussian
2
−4
−8
−2
0
2
4
6
−2
−1.5
8
−2
SVM classification plot
0
2
4
6
with
8
Gaussian with C = 5
and Auto σ.
SVM classification plot
8
Basic Gaussian
Auto σ
8
1.0
k-Nearest
Neighbors
6
1
6
0.5
4
4
0
0.0
2
2
−0.5
0
−1
0
−1.0
−2
−2
−2
−1.5
−2
0
2
4
Keith E Emmert (TSU)
6
8
−2
0
2
4
Data Mining
6
8
October 23, 2013
188 / 222
Support Vector Machines
Non Linear: Simple Example Using a Gaussian Kernel – Finding C
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
For the Gaussian SVM example, we can tune C as follows. We’ll let
R guess the best σ.
Let’s try k = 10 and search for 10−k ≤ C ≤ 10k . Then we have
2k + 1 = 21 SVMs to consider.
> k=10
> myNonLinearSVMVec = NULL
> for (i in -k:k)
+ {
+
myNonLinearSVMVec = c(myNonLinearSVMVec,
+
ksvm(trainData, labelTrain, type="C-svc",
+
kernel='rbf', C=10^i, scaled=c())
+
)
+
+ }
Using automatic sigma estimation (sigest) for RBF or laplac
Using automatic sigma estimation (sigest) for RBF or laplac
Using automatic sigma estimation (sigest) for RBF or laplac
Using automatic sigma estimation (sigest) for RBF or laplac
Keith E Emmert (TSU)
Data Mining
October 23, 2013
189 / 222
Support Vector Machines
Non Linear: Simple Example Using a Gaussian Kernel – Finding C
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Now, let’s take a look at the test errors generated by various C s.
> xNonLinear=-k:k
> yNonLinear=NULL
> for (i in 1:(2*k+1))
+ {
+
myPredNonLinGauss = predict(myNonLinearSVMVec[[i]],
+
trainTest)
+
contTable = table(labelTest , myPredNonLinGauss)
+
testError = (sum(contTable) +
sum(diag(contTable)))/sum(contTable)
+
yNonLinear = c(yNonLinear, testError)
+ }
> yNonLinear
[1] 0.51 0.51 0.51 0.51 0.51 0.51 0.51 0.51 0.51 0.18
[11] 0.17 0.16 0.16 0.16 0.19 0.18 0.25 0.39 0.46 0.42
[21] 0.36
Keith E Emmert (TSU)
Data Mining
October 23, 2013
190 / 222
Support Vector Machines
Non Linear: Simple Example Using a Gaussian Kernel – Finding C
Data Mining
Keith E Emmert
We might also generate a simple plot as C varies. Choose the
smallest C that minimizes the test Error.
log10(C) vs Test Error
0.50
Sub-setting Data
Classification
0.45
Decision Trees
Bayes
Classification
0.35
Test Error
0.30
0.20
0.15
k-Nearest
Neighbors
0.25
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
0.40
Support Vector
Machines
−10
Keith E Emmert (TSU)
−5
0
log10(C)
Data Mining
5
10
October 23, 2013
191 / 222
Support Vector Machines
Non Linear: Simple Example Using a Gaussian Kernel – The Best C
Data Mining
Keith E Emmert
Sub-setting Data
The “best” C = 101 . The confusion matrix for this C is
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
myPredNonLinGauss
labelTest -1 1
-1 38 11
1
5 46
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
192 / 222
Support Vector Machines
Non-Linearly Separable – The Best C
SVM classification plot
Data Mining
Keith E Emmert
2
8
Sub-setting Data
Classification
6
1
4
0
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
2
−1
0
−2
−2
−3
−2
0
2
4
6
8
Notice that some of the support vectors have changed.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
193 / 222
Support Vector Machines
Non-Linearly Separable – Summaries
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
myNewPred
myPredGauss
labelTest -1 1
labelTest -1 1
-1 41 8
-1 39 10
1 29 22
1
5 46
myPredGauss2
myPredGauss3
labelTest -1 1
labelTest -1 1
-1 38 11
-1 38 11
1
6 45
1
5 46
myPredNonLinGauss
labelTest -1 1
-1 38 11
1
5 46
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
194 / 222
Support Vector Machines
Non-Linearly Separable – Summaries
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Training Error
Accuracy
Test Error
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Training Error
Accuracy
Test Error
Linear
0.303333
0.63
0.37
Gaussian
0.123333
0.85
0.15
Gaussian w/Auto σ, C = 5
0.133333
0.84
0.16
Gaussian w/Auto σ
0.133333
0.83
0.17
Gauss. w/Auto σ, C = 101
0.123333
0.84
0.16
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
195 / 222
ROC and Precision-Recall Curves
Class Imbalance Problem
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
For inspections (fraud detection), you hope the # of problems is
small relative to the number of successes.
This gives rise to unbalanced data sets.
The rare instances are susceptible to noisy data.
Accuracy is a bad measure for rare cases.
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
196 / 222
ROC and Precision-Recall Curves
Class Imbalance Problem
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Predicted Class
+
−
+ TP
FN
P = TP + FN
Actual Class
− FP
TN
N = FP + TN
True Positive TP: # of positive examples correctly predicted by
a classification model.
False Negative FN: # of positive examples wrongly predicted as
negative.
False Positive FP: # of negative examples wrongly predicted as
positive.
True Negative TN: # of negative examples correctly predicted.
Correct
TP + TN
Accuracy =
=
.
TP + FP + FN + TN
Total
Keith E Emmert (TSU)
Data Mining
October 23, 2013
197 / 222
ROC and Precision-Recall Curves
Class Imbalance Problem
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
P = TP + FN
N = FP + TN
TP
is the
Sensitivity or True Positive Rate = TPR =
TP + FN
fraction of positive examples predicted correctly.
TN
Specificity or True Negative Rate = TNR =
is the
TN + FP
fraction of negative examples predicted correctly.
FP
is the fraction of
False Positive Rate = FPR =
TN + FP
negative examples predicted as a positive class.
FN
False Negative Rate = FNR =
is the fraction of
TP + FN
positive examples predicted as a negative class.
Actual Class
Keith E Emmert (TSU)
+
−
Predicted Class
+
−
TP
FN
FP
TN
Data Mining
October 23, 2013
198 / 222
ROC and Precision-Recall Curves
Class Imbalance Problem
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
P = TP + FN
N = FP + TN
TP
is the fraction of
Recall or Sensitivity = r = TPR =
TP + FN
positive examples predicted correctly.
TP
Precision = p =
is the fraction of records that
TP + FP
actually turns out to be positive in the group the classifier has
declared as a positive class.
Actual Class
+
−
Predicted Class
+
−
TP
FN
FP
TN
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
199 / 222
ROC and Precision-Recall Curves
Class Imbalance Problem
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
1
TP
=
⇐⇒↓ FN or ↑ TP.
FN
TP + FN
1 + TP
1
TP
=
⇐⇒↓ FP or ↑ TP.
↑ Precision = p =
FP
TP + FP
1 + TP
A model that declares every record to be the positive class has
perfect recall r = 1, but has poor precision p < 1 due to high
FP.
A model that assigns a positive class to every test record that
matches one of the positive training set records has high
precision p ≈ 1 but low recall r < 1 since many positive
examples in the test set which fail to appear in the training set
can be classified as negative, i.e. FN > 0.
Goal: Maximize Precision and Recall.
↑ Recall = r =
Keith E Emmert (TSU)
Data Mining
October 23, 2013
200 / 222
ROC and Precision-Recall Curves
Class Imbalance Problem
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Predicted Class
+
−
TP
FN
FP
TN
+
P = TP + FN
−
N = FP + TN
TP
TP
Recall = r =
and Precision = p =
.
TP + FN
TP + FP
2
2rp
2TP
F1 Measure: F1 =
=
=
.
1
1
r
+
p
2TP
+
FP + FN
+
r
p
| {z }
Actual Class
Harmonic mean
Note that the harmonic mean tends to be closer to the smaller
number.
↑ F1 ⇐⇒↑ r and ↑ p.
k-Nearest
Neighbors
Fβ Measure: Fβ =
(β 2 + 1)rp
(β 2 + 1)TP
=
.
r + β2p
(β 2 + 1)TP + β 2 FP + FN
F0 = p and F∞ = r .
Measures a trade-off between precision and recall.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
201 / 222
ROC and Precision-Recall Curves
Class Imbalance Problem
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Predicted Class
+
−
+ TP
FN
P = TP + FN
Actual Class
− FP
TN
N = FP + TN
w1 TP + w4 TN
.
Weighted accuracy =
w1 TP + w2 FP + w3 FN + w4 TN
Measure
Recall
Precision
Fβ
Accuracy
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
w1
1
1
β2 + 1
1
w2
1
0
β2
1
w3
0
1
1
1
w4
0
0
0
1
Thus, weighted accuracy measures the trade-offs between recall,
precision, accuracy, and Fβ .
Keith E Emmert (TSU)
Data Mining
October 23, 2013
202 / 222
ROC and Precision-Recall Curves
ROC
Classification
Decision Trees
Bayes
Classification
0.8
True Positive Rate
0.0
k-Nearest
Neighbors
0.2
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
1.0
Support Vector
Machines
0.6
Sub-setting Data
Definition
A receiver operating characteristic curve or ROC curve is a plot of
the sensitivity or true positive rate on the vertical axis vs the false
positive rate on the horizontal axis. Each point on the curve
represents a model induced by the classifier.
0.4
Data Mining
Keith E Emmert
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
Keith E Emmert (TSU)
Data Mining
October 23, 2013
203 / 222
ROC and Precision-Recall Curves
ROC Curve
1.0
Data Mining
Keith E Emmert
(0, 0) represents a model
which predicts every instance
to be a negative class.
0.8
Sub-setting Data
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
0.6
0.4
(1, 1) represents a model
which predicts every instance
to be a positive class.
0.2
Bayes
Classification
0.0
Decision Trees
True Positive Rate
Classification
0.0
k-Nearest
Neighbors
0.2
0.4
0.6
0.8
1.0
(0, 1) represents a perfect
model. TPR = 1 and
FPR = 0.
False Positive Rate
y = x (dashed line) represents a model which classifies a record
as positive with some fixed probability, i.e. random guessing!
Keith E Emmert (TSU)
Data Mining
October 23, 2013
204 / 222
ROC and Precision-Recall Curves
ROC Curve
Data Mining
Towards bottom left corner “Conservative classifier” positive classifications only
with strong evidence. Hence,
few FP, but low TP as well.
1.0
Keith E Emmert
0.8
Sub-setting Data
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
0.6
0.4
0.2
Bayes
Classification
0.0
Decision Trees
True Positive Rate
Classification
0.0
k-Nearest
Neighbors
0.2
0.4
0.6
0.8
False Positive Rate
1.0
Towards top right corner “Liberal classifiers” - positive
classifications with weak
evidence. Therefore, classify
most positive correctly, but
have a high FP.
If your classifier falls below the curve, this is BAD. However, you
can swap your prediction labels, i.e. swap TP and FN, and swap
FP and TN to reflect the classifier above the curve.
Area under the ROC curve, AUC, can be used to tell which
classifier, on average, is better. If AUC ≤ 0.5 is bad!
Keith E Emmert (TSU)
Data Mining
October 23, 2013
205 / 222
ROC and Precision-Recall Curves
ROC Curve
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
The prediction scores are computed for the two “best” models, one linear
and one based upon a Gaussian. Note the use of the type="decision
option in the predict function.
> myNewPredLinear = predict(myNewSVM, trainTest,
+
type="decision")
> myNewPredNonLinGauss = predict(
+
myNonLinearSVMVec[[which.min(yNonLinear)]],
+
trainTest, type="decision")
We can also compute some ROC curves.
> library(ROCR)
> predLinearROC = prediction(myNewPredLinear, labelTest)
> predGaussROC = prediction(myNewPredNonLinGauss,
+
labelTest)
> perLinearROC = performance(predLinearROC,
+
measure = "tpr", x.measure = "fpr")
> perGaussROC = performance(predGaussROC,
+
measure = "tpr", x.measure = "fpr")
Keith E Emmert (TSU)
Data Mining
October 23, 2013
206 / 222
ROC and Precision-Recall Curves
ROC Curve
Now, plot the ROC curves.
Data Mining
Linear Model
1.0
Gaussian Model
1.0
Keith E Emmert
0.8
0.6
0.2
0.4
0.6
0.0
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
0.4
Support Vector
Machines
True positive rate
Bayes
Classification
0.2
Decision Trees
0.0
Classification
True positive rate
0.8
Sub-setting Data
0.0
0.2
0.4
0.6
0.8
1.0
0.0
False positive rate
0.2
0.4
0.6
0.8
1.0
False positive rate
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
207 / 222
ROC and Precision-Recall Curves
ROC Curve
Data Mining
Of course, we can plot multiple ROC curves on the same axes.
Keith E Emmert
1.0
Linear is Blue, Gaussian is Red
Sub-setting Data
Classification
0.8
Decision Trees
0.6
0.4
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
0.2
Support Vector
Machines
True positive rate
Bayes
Classification
0.0
k-Nearest
Neighbors
0.0
0.2
0.4
0.6
0.8
1.0
False positive rate
Keith E Emmert (TSU)
Data Mining
October 23, 2013
208 / 222
ROC and Precision-Recall Curves
ROC Curve – Problems
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Predicted Class
+
−
+ TP
FN
P = TP + FN
Actual Class
− FP
TN
N = FP + TN
ROC curves are not very sensitive to changes in the class
distributions. (Few − and many + vs equal numbers of − and
+.)
So, if the proportion of positive to negative instances changes,
the ROC curve will not detect this.
This is true because ROC curves utilize ratios of rows and does
not depend upon the distribution.
TP
FP
TPR =
FPR =
.
TP + FN
TN + FP
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
TP
utilize values
TP + FP
from both rows and is sensitive to class skews. (Accuracy and
Fβ are also sensitive to class skews.)
Other measures such as precision p =
Keith E Emmert (TSU)
Data Mining
October 23, 2013
209 / 222
ROC and Precision-Recall Curves
ROC Curve – Problems
Data Mining
Keith E Emmert
Sub-setting Data
Actual Class
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
+
−
Predicted Class
+
−
TP = 50
FN = 25
FP = 20
TN = 55
= 1.
Ratio of + to − is 75
75
50
TP
= 50+25
= 23 and FPR =
TPR = TP+FN
Precision = p =
Actual Class
+
−
k-Nearest
Neighbors
TP
TP+FP
1
75
= 10
.
Ratio of + to − is 750
TP
50
TPR = TP+FN
= 50+25
= 23 and FPR =
Precision = p =
Keith E Emmert (TSU)
FP
FP+TN
=
4
.
15
= 57 .
Predicted Class
+
−
TP = 50
FN = 25
FP
= TN
=
200
550
TP
TP+FP
P = TP + FN
N = FP + TN
=
P = TP + FN
N = FP + TN
FP
FP+TN
=
4
.
15
1
.
5
Data Mining
October 23, 2013
210 / 222
ROC and Precision-Recall Curves
Precision-Recall Graphs
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Definition
A precision-recall graph uses precision for the vertical axis and recall
for the horizontal axis.
(0, 0) is bad.
(1, 1) is the goal.
When recall is zero, it is possible for precision to be undefined.
So, when creating your own precision-recall graphs, never let
recall become zero.
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
211 / 222
ROC and Precision-Recall Curves
Connections between ROC Curves and Precision-Recall Graphs
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Theorem
For a fixed data set, Model 1 dominates Model 2 in ROC space if and
only if Model 1 dominates Model 2 in precision-recall space.
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
212 / 222
ROC and Precision-Recall Curves
Precision-Recall Graphs
Data Mining
Next, plot the precision/recall curves.
1.0
1.0
Keith E Emmert
0.8
0.7
0.7
0.6
0.5
Precision
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
0.6
Support Vector
Machines
0.5
Bayes
Classification
0.4
Decision Trees
Precision
0.8
Classification
0.9
0.9
Sub-setting Data
0.0
0.2
0.4
0.6
0.8
1.0
Recall
0.0
0.2
0.4
0.6
0.8
1.0
Recall
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
213 / 222
ROC and Precision-Recall Curves
Precision-Recall Graphs
Data Mining
Again, lets stack the plots.
Keith E Emmert
1.0
Linear is Blue, Gaussian is Red
Sub-setting Data
Classification
0.8
Decision Trees
0.6
0.4
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
0.2
Support Vector
Machines
Precision
Bayes
Classification
0.0
k-Nearest
Neighbors
0.0
0.2
0.4
0.6
0.8
1.0
Recall
Keith E Emmert (TSU)
Data Mining
October 23, 2013
214 / 222
k-Nearest Neighbors
Eager vs Lazy Learners
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
An eager learner map the input attributes to the class label as
soon as training data are available. Decision trees are examples
of this type of learning style.
A lazy learner delays the process of modeling the training data
until it is needed to classify the test data. k-nearest neighbors is
a lazy learner.
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
215 / 222
k-Nearest Neighbors
Definition
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Definition
The k-nearest neighbors strategy determines the k-points closest to
an unknown point z. The test example is classified based upon the
majority of the class labels of the k-nearest neighbors.
4
Bayes
Classification
Support Vector
Machines
2
1-NN: The point is a triangle.
2-NN: Can’t classify the point.
3-NN: The point is a square.
0
k-Nearest
Neighbors
y
3
Goal: classify the green circle as
a blue triangle or red square.
1
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
0
1
2
3
4
x
Note that “close” means we’re using some measure of similarity or
dissimilarity based upon the attribute(s) being measured.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
216 / 222
k-Nearest Neighbors
Problems
3.0
Data Mining
Goal: classify the green circle as a
blue triangle or red square.
2.5
Keith E Emmert
2.0
Sub-setting Data
If k is too small, then the nearest
neighbor algorithm is susceptible to
over-fitting due to noise.
y
1.5
Classification
1.0
Decision Trees
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
0.0
0.5
Bayes
Classification
0.0
0.5
1.0
1.5
2.0
2.5
3.0
If k is too large, then it may
misclassify because it includes data
too far away from its neighborhood.
x
One way to minimize contributions of distant points is to use
distance-weighted voting, where given the k-nearest neighbors and
their associated labels as a set, Dk , δ returns 0 or 1 if a label matches
at training point v , and ARGMAX returns the label, y , associated
with the largest number of label matches.
X
1
y = ARGMAX
δ(yi , v ).
v ∈Labels
d(~xi , ~z )2
(~
xi ,yi )∈Dk
Keith E Emmert (TSU)
Data Mining
October 23, 2013
217 / 222
k-Nearest Neighbors
Example
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
k-Nearest
Neighbors
Here, we will use the package “class” and the R function “knn” for
classification.
Function “knn” takes several arguments.
matrix or data frame of training set cases.
train
test
matrix or data frame of test set cases. A vector will be interpreted
as a row vector for a single case.
cl
factor of true classifications of training set
k
number of neighbors considered.
l
minimum vote for definite decision, otherwise doubt. (More precisely, less than k − l dissenting votes are allowed, even if k is
increased by ties.)
prob
If this is true, the proportion of the votes for the winning class are
returned as attribute prob
use.all controls handling of ties. If true, all distances equal to the kth
largest are included. If false, a random selection of distances equal
to the kth is chosen to use exactly k neighbors.
Keith E Emmert (TSU)
Data Mining
October 23, 2013
218 / 222
k-Nearest Neighbors
Example: 3-Nearest Neighbors
Data Mining
Keith E Emmert
Sub-setting Data
Classification
The iris data set is labeled Setosa, Versicolor, and Virginica. In this
example, we will classify a rectangular grid rather than holding back some
data. Only two columns are chosen so that a contour plot can be
constructed. Here is the decision boundary.
Iris Data
3.0
Decision Trees
2.5
Bayes
Classification
1.5
Petal.Width
0.5
0.0
k-Nearest
Neighbors
1.0
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
2.0
Support Vector
Machines
0
2
4
6
8
10
Petal.Length
Keith E Emmert (TSU)
Data Mining
October 23, 2013
219 / 222
k-Nearest Neighbors
Example: 3-Nearest Neighbors
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
Now, we utilize the full iris data set.
How did we do?
testLabels
irisClass2
setosa versicolor virginica
setosa
20
0
0
versicolor
0
17
0
virginica
0
3
20
Next, we can compute the accuracy of the model.
[1] 0.95
Of course, the test-error is
[1] 0.05
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
220 / 222
k-Nearest Neighbors
Example - Finding the best k
Data Mining
Keith E Emmert
We compute the test error for a variety of k. Now, let’s plot our
values.
Sub-setting Data
0.7
Classification
Decision Trees
0.6
Bayes
Classification
Test Error
0.4
0.3
0.1
k-Nearest
Neighbors
0.2
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
0.5
Support Vector
Machines
0
20
40
60
80
100
k
Keith E Emmert (TSU)
Data Mining
October 23, 2013
221 / 222
k-Nearest Neighbors
Example - Find the Best K
Data Mining
Keith E Emmert
Sub-setting Data
Classification
Decision Trees
Bayes
Classification
Support Vector
Machines
Receiver
Operating
Characteristics
(ROC) and
Precision - Recall
Curves
We can find the minimum k by using the “which.min(vector)”
command.
So, in this region, our best model corresponds to k = 10 which
corresponds to a test error of 0.033333. If we train this model,
testLabels
bestKNN
setosa versicolor virginica
setosa
20
0
0
versicolor
0
18
0
virginica
0
2
20
Thus, our accuracy is 0.966667. Of course, we already knew this
from the test error 0.033333.
k-Nearest
Neighbors
Keith E Emmert (TSU)
Data Mining
October 23, 2013
222 / 222
Related documents