Download Special Topics: Advanced Classification Neural Nets, Support

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Transcript
Topic 9: Advanced Classification
Neural Networks
Support Vector Machines
Credits:
Shawndra Hill
Andrew Moore lecture notes
Data Mining - 2011 - Volinsky - Columbia University
1
Outline
• Special Topics
– Neural Networks
– Support Vector Machines
Data Mining - 2011 - Volinsky - Columbia University
2
Neural Networks Agenda
•
•
•
•
•
•
The biological inspiration
Structure of neural net models
Using neural net models
Training neural net models
Strengths and weaknesses
An example
Data Mining - 2011 - Volinsky - Columbia University
3
What the heck are neural nets?
• A data mining algorithm, inspired by biological
processes
• A type of non-linear regression/classification
• An ensemble method
– Although not usually thought of as such
• A black box!
Data Mining - 2011 - Volinsky - Columbia University
4
Inspiration from Biology
• Information processing inspired by biological
nervous systems

Structure of the
nervous system:



A large number of neurons (information processing units)
connected together
A neuron’s response depends on the states of other
neurons it is connected to and to the ‘strength’ of those
connections.
The ‘strengths’ are learned based on experience.
Data Mining - 2011 - Volinsky - Columbia University
5
From Real to Artificial
Data Mining - 2011 - Volinsky - Columbia University
6
Nodes: A Closer Look
Bias
b
x1
w1
Activation
function
Input
values
x2

w2
 ()
Output
y
Summing
function
xm
wm
weights
Data Mining - 2011 - Volinsky - Columbia University
7
Nodes: A Closer Look

A node (neuron) is the basic information
processing unit of a neural net. It has:


A set of inputs with weights w1, w2, …, wm along
with a default input called the bias
An adder function (linear combiner) that computes the
weighted sum of the inputs
m
v   w jx j
j1

An Activation function (squashing function) 
that transforms v, usually non-linearly.

Data Mining - 2011 - Volinsky - Columbia University
y  (v  b)
8
A Simple Node: A Perceptron

A simple activation function: A signing
threshold
 1 if v  0
 (v )  
 1 if v  0
b (bias)
x1
x2
w1
w2
v
(v)
y
wn
xn
Data Mining - 2011 - Volinsky - Columbia University
9
Common Activation Functions
• Step function
• Sigmoid (logistic) function
ev
1
 (v) 
v 
1 e
1 e v
• Hyperbolic Tangent (Tanh)
function
e v  e v
tanh( v)  v
e  e v
• The s-shape adds non-linearity
[Hornick (1989)]: combining many of these
simple functions is sufficient to approximate
any continuous non-linear function arbitrarily
well over a compact interval.
Data Mining - 2011 - Volinsky - Columbia University
10
Neural Network: Architecture
Output
layer
Input
layer
Hidden Layer(s)
• Big idea: a combination of simple non-linear models working
together to model a complex function
• How many layers? Nodes? What is the function?
– Magic
– Luckily, defaults do well
Data Mining - 2011 - Volinsky - Columbia University
11
Neural Networks: The Model
• Model has two components
– A particular architecture
• Number of hidden layers
• Number of nodes in the input, output and hidden layers
• Specification of the activation function(s)
– The associated set of weights
• Weights and complexity are “learned” from the data
– Supervised learning, applied iteratively
– Out-of-sample methods; Cross-validation
Data Mining - 2011 - Volinsky - Columbia University
12
Fitting a Neural Net: Feed Forward
• Supply attribute values at input nodes
• Obtain predictions from the output node(s)
– Predicting classes
• Two classes – single output node with threshold
• Multiple classes – use multiple outputs, one for each class
Predicted class = output node with highest value
Multiple class problems are one of the main uses of NN!
Data Mining - 2011 - Volinsky - Columbia University
13
A Simple NN: Regression
• A one-node neural network:
– Called a ‘perceptron’
– Use identity function as the activation function
– What’s the output?
• Weighted sum of inputs
b (bias)
x1
x2
w1
w2
v
(v)
y
wn
xn
Logistic regression just changes the activation function to
the logistic function
Data Mining
Data- Mining
2011 - Volinsky
- Columbia
- Columbia
University
University
14
Training a NN: What does it learn?
• It fits/learns the weights that best translates inputs
into outputs given its architecture
• Hidden units can be thought to learn some higher
order regularities or features of the inputs that can
be used to predict outputs.
“Multi
layer
perceptron”
Data Mining - 2011 - Volinsky - Columbia University
15
Perceptron Training Rule

1.
2.
3.
Perceptron = Adder +
Threshold
Start with a random set of small
weights.
Calculate an example
Change the weight by an
amount proportional to the
difference between the desired
output and the actual output.
Input
Δ Wi = η * (D-Y).Ii
Learning rate/
Actual output
Desired output
Step size
Data Mining - 2011 - Volinsky - Columbia University
16
Training NNs: Back Propagation
• How to train a neural net (find the optimal weights):
– Present a training sample to the neural network.
– Calculate the error in each output neuron.
– For each neuron, calculate what the output should have been, and a scaling factor, how much
lower or higher the output must be adjusted to match the desired output. This is the local
error.
– Adjust the weights of each neuron to lower the local error.
– Assign "blame" for the local error to neurons at the previous level, giving greater
responsibility to neurons connected by stronger weights.
– Repeat on the neurons at the previous level, using each one's "blame" as its error.
• This ‘propogates’ the error backward. The sequence of
forward and backward fits is called ‘back propogation’.
Data Mining - 2011 - Volinsky - Columbia University
17
Training NNs: How to do it
• A “Gradient Descent” algorithm is typically used to fit back propogation
• You can imagine a surface in an n-dimensional space such that
– Each dimension is a weight
– Each point in this space is a particular combination of weights
– Each point on the “surface” is the output error that corresponds to that
combination of weights
– You want to minimize error i.e. find the “valleys” on this surface
– Note the potential for ‘local minima’
Data Mining - 2011 - Volinsky - Columbia University
18
Training NNs: Gradient Descent
• Find the gradient in each
direction:
Error
w i
• Move according to these
 gradients will result in the
move of ‘steepest descent’
• Note potential problem
with ‘local minima’.
Data Mining - 2011 - Volinsky - Columbia University
19
Gradient Descent
• Direction of
steepest
descent can
be found
mathematical
ly or via
computation
al estimation
Via A. Moore
Data Mining - 2011 - Volinsky - Columbia University
20
Neural Nets: Strengths
• Can model very complex functions, very accurately
– non linearity is built into the model
• Handles noisy data quite well
• Provides fast predictions
• Good for multiple category problems
–
–
–
–
Many-class classification
Image detection
Speech recognition
Financial models
• Good for multiple stage problems
Data Mining - 2011 - Volinsky - Columbia University
21
Neural Nets: Weaknesses
• A black-box. Hard to explain or gain intuition.
• For complex problems, training time could be quite
high
• Many, many training parameters
– Layers, neurons per layer, output layers, bias, training
algs, learning rate
• Highly prone to overfitting
– Balance between complexity with parsimony can be
learned through cross-validation
Data Mining - 2011 - Volinsky - Columbia University
22
Example: Face Detection
Architecture of the complete system: they use another neural
net to estimate orientation of the face, then rectify it. They
search over scales to find bigger/smaller faces.
Figure from “Rotation invariant neural-network based face detection,” H.A. Rowley, S.
Baluja and T. Kanade, Proc. Computer Vision and Pattern Recognition, 1998, copyright
1998, IEEE
Data Mining - 2011 - Volinsky - Columbia University
23
Rowley, Baluja and Kanade’s (1998)
Image Size: 20 x 20
Input Layer: 400 units
Hidden Layer: 15 units
Data Mining - 2011 - Volinsky - Columbia University
24
Neural Nets: Face Detection
Goal: detect “face or no face”
Data Mining - 2011 - Volinsky - Columbia University
25
Face Detection: Results
Data Mining - 2011 - Volinsky - Columbia University
26
Face Detection Results: A Few Misses
Data Mining - 2011 - Volinsky - Columbia University
27
Neural Nets
• Face detection in action
• For more:
– See Hastie, et al Chapter 11
• R packages
– Basic : nnet
– Better: amore
Data Mining - 2011 - Volinsky - Columbia University
28
Support Vector Machines
Data Mining - 2011 - Volinsky - Columbia University
29
SVM
• Classification technique
• Start with a BIG assumption
– The classes can be separated linearly
Data Mining - 2011 - Volinsky - Columbia University
30
a
Linear Classifiers
x
denotes +1
f
yest
f(x,w,b) = sign(w. x - b)
denotes -1
How would you
classify this data?
Data Mining - 2011 - Volinsky - Columbia University
31
a
Linear Classifiers
x
denotes +1
f
yest
f(x,w,b) = sign(w. x - b)
denotes -1
How would you
classify this data?
Data Mining - 2011 - Volinsky - Columbia University
32
a
Linear Classifiers
x
denotes +1
f
yest
f(x,w,b) = sign(w. x - b)
denotes -1
How would you
classify this data?
Data Mining - 2011 - Volinsky - Columbia University
33
a
Linear Classifiers
x
denotes +1
f
yest
f(x,w,b) = sign(w. x - b)
denotes -1
How would you
classify this data?
Data Mining - 2011 - Volinsky - Columbia University
34
a
Linear Classifiers
x
denotes +1
f
yest
f(x,w,b) = sign(w. x - b)
denotes -1
Any of these
would be fine..
..but which is
best?
Data Mining - 2011 - Volinsky - Columbia University
35
a
Classifier Margin
x
denotes +1
f
yest
f(x,w,b) = sign(w. x - b)
Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
denotes -1
Data Mining - 2011 - Volinsky - Columbia University
36
a
Maximum Margin
x
denotes +1
f
yest
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
denotes -1
Linear SVM
Data Mining - 2011 - Volinsky - Columbia University
This is the
simplest kind of
SVM (Called an
LSVM)
37
a
Maximum Margin
x
denotes +1
f
yest
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
denotes -1
Support Vectors
are those
datapoints that
the margin
pushes up
against
Linear SVM
Data Mining - 2011 - Volinsky - Columbia University
This is the
simplest kind of
SVM (Called an
LSVM)
38
Why Maximum Margin?
1. Intuitively this feels safest.
denotes +1
denotes -1
Support Vectors
are those
datapoints that
the margin
pushes up
against
f(x,w,b)
= sign(w.
- b)
2. If we’ve made
a small
error inxthe
location of the boundary (it’s been
The maximum
jolted in its perpendicular
direction)
this gives us leastmargin
chance linear
of causing a
misclassification. classifier is the
3. LOOCV is easy since
the classifier
model is
linear
immune to removal
of any
with
the,nonum,
support-vector datapoints.
maximum margin.
4. There’s some theory (using VC
is the
dimension) that isThis
related
to (but not
the same as) thesimplest
propositionkind
that of
this
is a good thing. SVM (Called an
LSVM)
5. Empirically it works
very very well.
Data Mining - 2011 - Volinsky - Columbia University
39
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane
• How do we represent this mathematically?
• …in m input dimensions?
Data Mining - 2011 - Volinsky - Columbia University
40
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Classify
as..
+1
if
w . x + b >= 1
-1
if
w . x + b <= -1
Universe
explodes
if
-1 < w . x + b < 1
Data Mining - 2011 - Volinsky - Columbia University
41
Computing the margin width
M = Margin Width
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane.
Data Mining - 2011 - Volinsky - Columbia University
42
Computing the margin width
x+
x-
•
•
•
•
•
M = Margin Width
How do we compute
M in terms of w
and b?
Plus-plane = { x : w . x + b = +1 }
Minus-plane = { x : w . x + b = -1 }
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x+ be the closest plus-plane-point to x-.
Data Mining - 2011 - Volinsky - Columbia University
Any location in
mm:: not
not
R
necessarily a
datapoint
43
Computing the margin width
x+
x-
•
•
•
•
•
•
M = Margin Width
How do we compute
M in terms of w
and b?
Plus-plane = { x : w . x + b = +1 }
Minus-plane = { x : w . x + b = -1 }
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x+ be the closest plus-plane-point to x-.
Claim: x+ = x- + l w for some value of l.
Data Mining - 2011 - Volinsky - Columbia University
44
Computing the margin width
x+
x-
•
•
•
•
•
•
M = Margin Width
The line from x- to x+ is
perpendicular
to the
How
do we compute
planes.
M in terms of w
and
?
So to
getbfrom
x- to x+
travel some distance in
w.
{ x : w . x + b = +1direction
}
Plus-plane =
Minus-plane = { x : w . x + b = -1 }
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x+ be the closest plus-plane-point to x-.
Claim: x+ = x- + l w for some value of l. Why?
Data Mining - 2011 - Volinsky - Columbia University
45
Computing the margin width
x+
M = Margin Width
x-
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + l w
• |x+ - x- | = M
It’s now easy to get M in
terms of w and b
Data Mining - 2011 - Volinsky - Columbia University
46
Computing the margin width
M = Margin Width
x+
x-
w . (x - + l w) + b = 1
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + l w
• |x+ - x- | = M
It’s now easy to get M in
terms of w and b
=>
w . x - + b + l w .w = 1
=>
-1 + l w .w = 1
=>
2
l
w.w
Data Mining - 2011 - Volinsky - Columbia University
47
Computing the margin width
M = Margin Width =
x+
2
w.w
xM = |x+ - x- | =| l w |=
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + l w
• |x+ - x- | = M
2
•
l
w.w
 l | w |  l w.w

2 w.w
2


w.w
w.w
Data Mining - 2011 - Volinsky - Columbia University
48
Learning the Maximum Margin Classifier
x+
M = Margin Width =
2
w.w
x-
Given a guess of w and b we can
• Compute whether all data points in the correct half-planes
• Compute the width of the margin
Search the space of w’s and b’s to find the widest margin that matches
all the datapoints.
Data Mining - 2011 - Volinsky - Columbia University
49
Uh-oh!
This is going to be a problem!
What should we do?
denotes +1
denotes -1
Data Mining - 2011 - Volinsky - Columbia University
50
Uh-oh!
This is going to be a problem!
What should we do?
denotes +1
denotes -1
Idea 1:
Find minimum w.w, while
minimizing number of
training set errors.
Problemette: Two things
to minimize makes for
an ill-defined
optimization
Data Mining - 2011 - Volinsky - Columbia University
51
Uh-oh!
This is going to be a problem!
What should we do?
denotes +1
denotes -1
Idea 1.1:
Minimize
w.w + C (#train errors)
Tradeoff parameter
And:
Use a trick
Data Mining - 2011 - Volinsky - Columbia University
52
Suppose we’re in 1-dimension
What would
SVMs do with
this data?
x=0
Data Mining - 2011 - Volinsky - Columbia University
53
Suppose we’re in 1-dimension
Not a big surprise
x=0
Positive “plane”
Negative “plane”
Data Mining - 2011 - Volinsky - Columbia University
54
Harder 1-dimensional dataset
What can be
done about
this?
x=0
Data Mining - 2011 - Volinsky - Columbia University
55
Harder 1-dimensional dataset
Embed the data in
a higher
dimensional
space
z k  ( xk , x )
2
k
x=0
Data Mining - 2011 - Volinsky - Columbia University
56
Harder 1-dimensional dataset
z k  ( xk , x )
2
k
x=0
Data Mining - 2011 - Volinsky - Columbia University
57
SVM Kernel Functions
• Embedding the data in a higher dimensional space where it is
separable is called the “kernel trick”
• Beyond polynomials there are other very high
dimensional basis functions that can be made
practical by finding the right Kernel Function
– Radial-Basis-style Kernel Function:
 (a  b) 2 

K (a, b)  exp  
2
2 

– Neural-net-style Kernel Function:
K (a, b)  tanh(  a.b   )
Data Mining - 2011 - Volinsky - Columbia University
58
SVM Performance
• Trick: find linear boundaries in an enlarged space
– Translate to nonlinear boundaries in the original space
• Magic: for more details, see Hastie et al 12.3
• Anecdotally they work very very well indeed.
• Example: They are currently the best-known
classifier on a well-studied hand-written-character
recognition benchmark
• There is a lot of excitement and religious fervor
about SVMs.
• Despite this, some practitioners are a little skeptical.
Data Mining - 2011 - Volinsky - Columbia University
59
Data Mining - 2011 - Volinsky - Columbia University
60
Doing multi-class classification
• SVMs can only handle two-class outputs (i.e. a
categorical output variable with arity 2).
• What can be done?
• Answer: with output arity N, learn N SVM’s
–
–
–
–
SVM 1 learns “Output==1” vs “Output != 1”
SVM 2 learns “Output==2” vs “Output != 2”
:
SVM N learns “Output==N” vs “Output != N”
• Then to predict the output for a new input, just
predict with each SVM and find out which one puts
the prediction the furthest into the positive region.
Data Mining - 2011 - Volinsky - Columbia University
61
References
• Hastie, et al Chapter 11 (NN); Chapter 12 (SVM)
• Andrew Moore notes on Neural nets
• Andrew Moore notes on SVM
• Wikipedia has very good pages on both topics
• An excellent tutorial on VC-dimension and Support Vector
Machines by C. Burges.
– A tutorial on support vector machines for pattern recognition. Data Mining
and Knowledge Discovery, 2(2):955-974, 1998.
• The SVM Bible:
Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998
Data Mining - 2011 - Volinsky - Columbia University
62