Download Classification and Supervised Learning

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Classification and Supervised Learning
Credits
Hand, Mannila and Smyth
Cook and Swayne
Padhraic Smyth’s notes
Shawndra Hill notes
Data Mining - Massey University
1
Outline
•
•
•
•
Supervised Learning Overview
Linear Discriminant analysis
Tree models
Probability based and Bayes models
Data Mining - Massey University
2
Classification
• Classification or supervised learning
– prediction for categorical response
• for binary, T/F, can be used as an alternative to logistic
regression
• often is a quantized real value or non-scaled numeric
– can be used with categorical predictors
– great for missing data - can be a response in itself!
– methods for fitting can be
• parametric
• algorithmic
Data Mining - Massey University
3
• Because labels are known, you can
build parametric models for the
classes
• can also define decision regions
and decision boundaries
Data Mining - Massey University
4
Examples of classifiers
• Generative/class-conditional/probabilistic, based on p( x | ck ),
– Naïve Bayes (simple, but often effective in high dimensions)
– Parametric generative models, e.g., Gaussian - Linear discriminant analysis
• Regression-based, based on p( ck | x )
– Logistic regression: simple, linear in “odds” space
– Neural network: non-linear extension of logistic
• Discriminative models, focus on locating optimal decision boundaries
– Decision trees: “swiss army knife”, often effective in high dimensions
– Linear discriminants,
– Support vector machines (SVM): generalization of linear discriminants, can
be quite effective, computational complexity is an issue
– Nearest neighbor: simple, can scale poorly in high dimensions
Data Mining - Massey University
5
Evaluation of Classifiers
• Already seen some of this…
• Assume output is probability vector for each class
• Classification error
– P(true Y | predicted Y)
• ROC Area
– area under ROC plot
• top-k analysis
– sometimes all you care about is how well you can do at the top of the list
• plan A: top 50 candidates have 44 sales, top 500 have 300 sales
• plan B: top 50 have 48 sales, top 500 have 270 sales
• which do you choose?
– often used with imbalanced class distributions - good classification error is
easy!
• fraud, etc
• calibration is sometimes important
– if you say something has 90% chance, does it?
Data Mining - Massey University
6
Linear Discriminant Analysis
• LDA - parametric classification
– Fisher 1936 Rao 1948
– linear combination of variables separating two classes by comparing
the difference between class means with the variance in each class
– assumes multivariate normal distribution of each class (cluster)
– pros:
•
•
•
•
easy to define likelihood
easy to define boundary
easy to measure goodness of fit
interpretation easy
– cons:
• very rare for data come close to a multi-normal!
• works only on numeric predictors
Data Mining - Massey University
7
• painters data: 54 painters rated on a score of 0-21 for
composition, drawing color and expression. Classified them
into 8 classes:
Da Udine
Da Vinci
Del Piombo
Del Sarto
Fr. Penni
Guilio Romano
Michelangelo
Perino del Vaga
Perugino
Raphael
Composition Drawing Colour Expression School
10
8
16
3
A
15
16
4
14
A
8
13
16
7
A
12
16
9
8
A
0
15
8
0
A
15
16
4
14
A
8
17
4
8
A
15
16
7
6
A
4
12
10
4
A
17
18
12
18
A
library(MASS)
lda1=lda(School~.,data=painters)
Data Mining - Massey University
8
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Data Mining - Massey University
9
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Data Mining - Massey University
10
LDA - predictions
• to check how good the model is, you can see how well it
predicts what actually happened:
> predict(lda1)
$class
[1] D H D A A H A C A A A A A C A B B E C C B E D D D D G D D D D D E D G H E E E F G A F D G A G G E
[50] G C H H H
Levels: A B C D E F G H
$posterior
A
B
C
D
E
F
Da Udine
0.0153311094 0.0059952857 0.0105980288 6.717937e-01 0.124938731 2.913817e-03
Da Vinci
0.1023448947 0.1963312180 0.1155149000 4.444461e-05 0.016182391 1.942920e-02
Del Piombo
0.1763906259 0.0142589568 0.0064792116 6.351212e-01 0.102924883 9.080713e-03
Del Sarto
0.4549047647 0.2079127774 0.1459033415 2.166203e-02 0.146171796 3.716302e-03
> table(predict(lda1)$class,painters$Sch)
ABCDEFG H
A54000110
B01200000
C11200001
D20091010
E00204010
F00000200
G00011140
H20001003
Data Mining - Massey University
11
Classification (Decision) Trees
• Trees are one of the most popular and useful of all data
mining models
• Algorithmic version of classification
– no distributional assumptions
• Competing algorithms: CART, C4.5, DBMiner
• Pros:
–
–
–
–
–
–
no distributional assumptions
can handle real and nominal inputs
speed and scalability
robustness to outliers and missing values
interpretability
compactness of classification rules
• Cons
– interpretability ?
– several tuning parameters to set with little guidance
– decision boundary is non-continuous
Data Mining - Massey University
12
Decision Tree Example
Debt
Income
Data Mining - Massey University
13
Decision Tree Example
Debt
Income > t1
??
t1
Income
Data Mining - Massey University
14
Decision Tree Example
Debt
Income > t1
t2
Debt > t2
t1
Income
??
Data Mining - Massey University
15
Decision Tree Example
Debt
Income > t1
t2
Debt > t2
t3
t1
Income
Income > t3
Data Mining - Massey University
16
Decision Tree Example
Debt
Income > t1
t2
Debt > t2
t3
Income
t1
Income > t3
Note: tree boundaries are piecewise
linear and axis-parallel
Data Mining - Massey University
17
Example: Titanic Data
• On the Titanic
–
–
–
–
1313 passengers
34% survived
was it a random sample?
or did survival depend on features of the individual?
• sex
• age
• class
pclass survived
name age embarked sex
1 1st
1
Allen, Miss Elisabeth Walton 29.0000 Southampton female
2 1st
0
Allison, Miss Helen Loraine 2.0000 Southampton female
3 1st
0
Allison, Mr Hudson Joshua Creighton 30.0000 Southampton male
4 1st
0 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.0000 Southampton female
5 1st
1
Allison, Master Hudson Trevor 0.9167 Southampton male
6 2nd
1
Anderson, Mr Harry 47.0000 Southampton male
Data Mining - Massey University
18
Decision trees
• At first ‘split’ decide which is the best variable to create separation
between the survivors and non-survivors cases:
N:1313
p: 0.34
Male
N:850
Y:
150
N:0.16
p:
1500
Greater than 12
Age
N: 150
Y:
821
N:0.15
p:
300
2nd or 3rd
N: 646
p:0.10
Class
N:463
Y:
50
N:0.66
p:
3500
Less Than 12 3rd Class
N:29
Y:
50
N: 1200
p:0.73
1st Class
N: 175
p: 0.31
Female
Sex?
N=213
Y:
45
N:0.37
p:
500
Class
1st or 2nd Class
N: 5250
Y:
N:0.912
p:
3000
Goodness of split is
determined by the ‘purity’ of
the leaves
Data Mining - Massey University
19
Decision Tree Induction
•
Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Examples are partitioned recursively to create pure subgroups
• Purity measured by: information gain, Gini index, entropy, etc
•
Conditions for stopping partitioning
– All samples for a given node belong to the same class
– All leaf nodes are smaller than a specified threshold
– BUT: building a tree too big will overfit the data, and will predict poorly.
• Predictions:
– each leaf will have class probability estimates (CPE), based on the training data that
ended up in that leaf.
– majority voting is employed for classifying all members of the leaf
Data Mining - Massey University
20
Purity in tree building
• Why do we care about pure subgroups?
– purity of the subgroup gives us confidence that new cases
that fall into this “leaf ” have a given label
Data Mining - Massey University
21
Purity measures
• If a data set T contains examples from n classes, gini index, gini(T) is defined as
n
gini(T ) 1  p 2j
j 1
where pj is the relative frequency of class j in T.
• If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively,
the gini index of the split data contains examples from n classes, the gini index
gini(T) is defined as
gini split (T ) 
N 1 gini( )  N 2 gini( )
T1
T2
N
N
For Titanic split on sex: 850/1313 x(1-0.16*0.84) + 463/1313*(1-0.66*0.34) =
0.83
• The attribute provides the smallest ginisplit(T) is chosen to split the node (need to
enumerate all possible splitting points for each attribute).
• Another often used measure: Entropy

p i log
2
pi
i
Data Mining - Massey University
22
Calculating Information Gain
Information Gain =
Impurity (parent) – [Impurity (children)]
13   4
4
 13
impurity    log 2     log 2   0.787
17   17
17 
 17
Entire population (30 instances)
17 instances
Balance>=50K
1   12
12 
1
impurity    log2     log2   0.391
13   13
13 
 13
Balance<50K
14   16
16 
 14
impurity    log 2     log 2   0.996
30   30
30 
 30
13 instances
 17
  13


0
.
787


0
.
391



  0.615
(Weighted) Average Impurity of Children =
 30
  30

Information Gain= EntropyData(Mining
parent)
– Entropy (Children)
- Massey University
= 0.996 - 0.615 = 0.38
23 23
Information Gain
Information Gain =
Impurity (parent) – [Impurity (children)]
Gain=0.38
Impurity(A) =0.996
Impurity(,B,C) = 0.61 Impurity(D,E) =0.405
Gain=0.205
D
B
Age>=45
Impurity(D)=0 Log20
Entire population
Balance>=50K
Age<45
Impurity(B)
=
0.787
A
C
Balance<50K
+1 log21=0
E
Impurity(E) =
-3/7 Log23/7 -4/7Log24/7=0.985
Impurity (C)= 0.39
Bad risk (Default)
Data Mining - Massey University
Good risk (Not default)
24 24
Information Gain
At each node chose first the attribute that obtains maximum
information gain: providing maximum information
Gain=0.38
Impurity(A) =0.996
Impurity(B,C)= 0.61 Impurity(D,E)= 0.405
Gain=0.205
D
B
Age>=45
Entire population
Balance>=50K
A
Age<45
E
C
Balance<50K
Bad risk (Default) Data Mining - Massey University
Good risk (Not default)
25 25
Avoid Overfitting in Classification
• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
• Use a set of data different from the training data to decide which is
the “best pruned tree”
Data Mining - Massey University
26
Which attribute to split over?
Brute-force search:
Balance
– At each node examine splits over each of
the attributes
<50K
>=50K
– Select the attribute for which the maximum
information gain is obtained
Data Mining - Massey University
27 27
Finding the right size
• Use a hold out sample (n fold cross-validation)
• Overfit a tree - with many leaves
• snip the tree back and use the hold out sample for
prediction, calculate predictive error
• record error rate for each tree size
• repeat for n folds
• plot average error rate as a function of tree size
• fit optimal tree size to the entire data set
R note:
can use cvtree()
Data Mining - Massey University
28
Olive oil data
X region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic
1 1.North-Apulia
1 1 1075
75 226 7823
672
36
2 2.North-Apulia
1 1 1088
73 224 7709
781
31
3 3.North-Apulia
1 1
911
54 246 8113
549
31
4 4.North-Apulia
1 1
966
57 240 7952
619
50
5 5.North-Apulia
1 1 1051
67 259 7771
672
50
6 6.North-Apulia
1 1
911
49 268 7924
678
51
60
61
63
78
80
70
• classification of Italian olive oils by their
components
• 9 areas, from 3 regions
Data Mining - Massey University
29
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Data Mining - Massey University
30
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Data Mining - Massey University
31
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Data Mining - Massey University
32
Regression Trees
• Trees can also be used for regression: when the
response is real valued
– leaf prediction is mean value instead of class probability
estimates (CPE)
– helpful with categorical predictors
Data Mining - Massey University
33
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Tips data
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Data Mining - Massey University
34
Treating Missing Data in Trees
• Missing values are common in practice
• Approaches to handing missing values
– During training
• Ignore rows with missing values (inefficient)
– During testing
• Send the example being classified down both branches and average
predictions
– Replace missing values with an “imputed value”
• Other approaches
– Treat “missing” as a unique value (useful if missing values are
correlated with the class)
– Surrogate splits method
• Search for and store “surrogate” variables/splits during training
Data Mining - Massey University
35
Other Issues with Classification Trees
• Can use non-binary splits
–
–
–
–
–
Multi-way
Linear combinations
Tend to increase complexity substantially, and don’t improve performance
Binary splits are interpretable, even by non-experts
Easy to compute, visualize
• Model instability
– A small change in the data can lead to a completely different tree
– Model averaging techniques (like bagging) can be useful
• Restricted to splits along coordinate axes
• Discontinuities in prediction space
Data Mining - Massey University
36
Why Trees are widely used in Practice
• Can handle high dimensional data
– builds a model using 1 dimension at time
• Can handle any type of input variables
– categorical, real-valued, etc
• Invariant to monotonic transformations of input variables
– E.g., using x, 10x + 2, log(x), 2^x, etc, will not change the tree
– So, scaling is not a factor - user can be sloppy!
• Trees are (somewhat) interpretable
– domain expert can “read off ” the tree’s logic
• Tree algorithms are relatively easy to code and test
Data Mining - Massey University
37
Limitations of Trees
• Representational Bias
– classification: piecewise linear boundaries, parallel to axes
– regression: piecewise constant surfaces
• High Variance
– trees can be “unstable” as a function of the sample
• e.g., small change in the data -> completely different tree
– causes two problems
• 1. High variance contributes to prediction error
• 2. High variance reduces interpretability
– Trees are good candidates for model combining
• Often used with boosting and bagging
Data Mining - Massey University
38
Decision Trees are not stable
Moving just one
example slightly
may lead to quite
different trees and
space partition!
Lack of stability
against small
perturbation of data.
Figure from
Duda, Hart & Stork,
Chap. 8
Data Mining - Massey University
39
Random Forests
• Another con for trees:
– trees are sensitive to the primary split, which can lead the tree in
inappropriate directions
– one way to see this: fit a tree on a random sample, or a bootstrapped
sample of the data -
• Solution:
–
–
–
–
–
random forests: an ensemble of unpruned decision trees
each tree is built on a random subset of the training data
at each split point, only a random subset of predictors are selected
many parameters to fiddle!
prediction is simply majority vote of the trees ( or mean prediction
of the trees).
• Has the advantage of trees, with more robustness, and a
smoother decision rule.
• Also, they are trendy!
Data Mining - Massey University
40
Other Models: k-NN
• k-Nearest Neighbors (kNN)
• to classify a new point
– look at the kth nearest neighbor from the training set
– look at the circle of radius r that includes this point
– what is the class distribution of this circle?
• Advantages
– simple to understand
– simple to implement
• Disadvantages
– what is k?
• k=1 : high variance, sensitive to data
• k large : robust, reduces variance but blends everything together - includes ‘far
away points’
– what is near?
• Euclidean distance assumes all inputs are equally important
• how do you deal with categorical data?
– no interpretable model
• Best to use cross-validation and visualization techniques to pick k.
Data Mining - Massey University
41
Probabilistic (Bayesian) Models for Classification
If you belong to class k, you have a distribution over input vectors:
Then, given priors on ck, we can get posterior distribution on classes:
At each point in the x space, we have a predicted class vector, allowing for
decision boundaries
Data Mining - Massey University
42
Example of Probabilistic Classification
p( x | c2 )
p( x | c1 )
1
p( c1 | x )
0.5
0
Data Mining - Massey University
43
Example of Probabilistic Classification
p( x | c2 )
p( x | c1 )
1
p( c1 | x )
0.5
0
Data Mining - Massey University
44
Decision Regions and Bayes Error Rate
p( x | c2 )
Class c2
Class c1
Class c2
p( x | c1 )
Class c1
Class c2
Optimal decision regions = regions where 1 class is more likely
Optimal decision regions  optimal decision boundaries
Data Mining - Massey University
45
Decision Regions and Bayes Error Rate
p( x | c2 )
Class c2
Class c1
Class c2
p( x | c1 )
Class c1
Class c2
Optimal decision regions = regions where 1 class is more likely
Optimal decision regions  optimal decision boundaries
Bayes error rate = fraction of examples misclassified by optimal classifier
(shaded area above).
If max=1, then there is no error. Hence:
Data Mining - Massey University
46
Procedure for optimal Bayes classifier
• For each class learn a model p( x | ck )
– E.g., each class is multivariate Gaussian with its own mean and covariance
• Use Bayes rule to obtain p( ck | x )
=> this yields the optimal decision regions/boundaries
=> use these decision regions/boundaries for classification
• Correct in theory…. but practical problems include:
– How do we model p( x | ck ) ?
– Even if we know the model for p( x | ck ), modeling a distribution or
density will be very difficult in high dimensions (e.g., p = 100)
• Alternative approach: model the decision boundaries directly
Data Mining - Massey University
47
Bayesian Classification: Why?
•
•
•
•
Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problems
Incremental: Each training example can incrementally increase/decrease the probability that
a hypothesis is correct. Prior knowledge can be combined with observed data.
Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
Standard: Even when Bayesian methods are computationally intractable, they can provide a
standard of optimal decision making against which other methods can be measured
Data Mining - Massey University
48
Naïve Bayes Classifiers
• Generative probabilistic model with conditional independence
assumption
on p( x | ck ), i.e.
p( x | ck ) = P p( xj | ck )
• Typically used with nominal variables
– Real-valued variables discretized to create nominal versions
• Comments:
– Simple to train (just estimate conditional probabilities for each feature-class
pair)
– Often works surprisingly well in practice
• e.g., state of the art for text-classification, basis of many widely used spam filters
Data Mining - Massey University
49
Naïve Bayes
• When all variables are categorical, classification
should be easy (since all xs can be enumerated):
But, remember the curse of dimensionality!
Data Mining - Massey University
50
Naïve Bayes Classification
Recall:
p(ck |x)  p(x| ck)p(ck)
Now assume variables are conditionally independent given
the classes:
C
x1
x2 … xp
is this a valid assumption? Probably not, but perhaps still useful
• example - symptoms and diseases
Data Mining - Massey University
51
Naïve Bayes
estimate of the prob that a point x will belong to ck:
p
p(ck | x)  p(ck ) p( x j | ck )
j 1
if two classes:
Data Mining - Massey University
“weights of
evidence”
52
Play-tennis example: estimating P(xi|C)
outlook
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
Humidity
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
P(y) = 9/14
P(n) = 5/14
Windy
false
true
false
false
false
true
true
false
false
false
true
true
false
true
Win?
N
N
Y
Y
Y
N
Y
N
Y
Y
Y
Y
Y
N
P(sunny|y) = 2/9
P(sunny|n) = 3/5
P(overcast|y) = 4/9
P(overcast|n) = 0
P(rain|y) = 3/9
P(rain|n) = 2/5
temperature
P(hot|y) = 2/9
P(hot|n) = 2/5
P(mild|y) = 4/9
P(mild|n) = 2/5
P(cool|y) = 3/9
P(cool|n) = 1/5
humidity
P(high|y) = 3/9
P(high|n) = 4/5
P(normal|y) = 6/9
P(normal|n) = 2/5
windy
P(true|y) = 3/9
P(true|n) = 3/5
P(false|y) = 6/9
P(false|n) = 2/5
Data Mining - Massey University
53
Play-tennis example: classifying X
• An unseen sample X = <rain, hot, high, false>
• P(X|y)·P(y) =
P(rain|y)·P(hot|y)·P(high|y)·P(false|y)·P(y) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
• P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14
= 0.018286
• Sample X is classified in class n (you’ll lose!)
Data Mining - Massey University
54
The independence hypothesis…
• … makes computation possible
• … yields optimal classifiers when satisfied
• … but is seldom satisfied in practice, as attributes (variables) are
often correlated.
• Yet, empirically, naïve bayes performs really well in practice.
Data Mining - Massey University
55
Lab #5
• Olive Oil Data
– from Cook and Swayne book
– consists of % composition of fatty acids found in the
lipid fraction of Italian Olive Oils. Study done to
determine authenticity of olive oils.
– region (North, South, and Sardinia)
– area (nine regions)
– 9 fatty acids and %s
Data Mining - Massey University
56
Lab #5
• Spam Data
• Collected at Iowa State University in 2003. (Cook
and Swayne)
– 2171 cases
– 21 variables
• be careful - 3 vars: spampct, category, and spam were
determined by spam models - do not use these for fitting!
• Goal: determine spam from valid mail
Data Mining - Massey University
57