Download Machine Learning - Fordham University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Airborne Networking wikipedia , lookup

Transcript
CISC 4631
Data Mining
Lecture 11:
Neural Networks
Biological Motivation
• Can we simulate the human learning process?  Two
schools
• modeling biological learning process
• obtain highly effective algorithms, independent of
whether these algorithms mirror biological
processes (this course)
• Biological learning system (brain)
– complex network of neurons
– ANN are loosely motivated by biological neural
systems. However, many features of ANNs are
inconsistent with biological systems
2
Neural Speed Constraints
• Neurons have a “switching time” on the order of a few milliseconds,
compared to nanoseconds for current computing hardware.
• However, neural systems can perform complex cognitive tasks
(vision, speech understanding) in tenths of a second.
• Only time for performing 100 serial steps in this time frame,
compared to orders of magnitude more for current computers.
• Must be exploiting “massive parallelism.”
• Human brain has about 1011 neurons with an average of 104
connections each.
3
Artificial Neural Networks (ANN)
• ANN
– network of simple units
– real-valued inputs & outputs
• Many neuron-like threshold switching units
• Many weighted interconnections among units
• Highly parallel, distributed process
• Emphasis on tuning weights automatically
4
Neural Network Learning
• Learning approach based on modeling adaptation in biological
neural systems.
• Perceptron: Initial algorithm for learning simple neural networks
(single layer) developed in the 1950’s.
• Backpropagation: More complex algorithm for learning multi-layer
neural networks developed in the 1980’s.
5
Real Neurons
6
How Does our Brain Work?
• A neuron is connected to other neurons via its input and
output links
• Each incoming neuron has an activation value and each
connection has a weight associated with it
• The neuron sums the incoming weighted values and this
value is input to an activation function
• The output of the activation function is the output from
the neuron
7
Neural Communication
• Electrical potential across
cell membrane exhibits
spikes called action
potentials.
• Spike originates in cell
body, travels down axon,
and causes synaptic
terminals to release
neurotransmitters.
• Chemical diffuses across
synapse to dendrites of
other neurons.
• Neurotransmitters can be
excititory or inhibitory.
• If net input of
neurotransmitters to a
neuron from other neurons
is excititory and exceeds
some threshold, it fires an
action potential.
8
Real Neural Learning
• To model the brain we need to model a neuron
• Each neuron performs a simple computation
– It receives signals from its input links and it uses these values to
compute the activation level (or output) for the neuron.
– This value is passed to other neurons via its output links.
9
Prototypical ANN
• Units interconnected in layers
– directed, acyclic graph (DAG)
• Network structure is fixed
– learning = weight adjustment
– backpropagation algorithm
10
Appropriate Problems
• Instances: vectors of attributes
– discrete or real values
• Target function
– discrete, real, vector
– ANNs can handle classification & regression
•
•
•
•
Noisy data
Long training times acceptable
Fast evaluation
No need to be readable
– It is almost impossible to interpret neural networks except for the
simplest target functions
11
Perceptrons
•
•
•
The perceptron is a type of
artificial neural network which
can be seen as the simplest kind
of feedforward neural network: a
linear classifier
Introduced in the late 50s
Perceptron convergence
theorem (Rosenblatt 1962):
– Perceptron will learn to
classify any linearly
separable set of inputs.
Perceptron is a network:
– single-layer
– feed-forward: data only
travels in one direction
12
XOR function (no linear separation)
ALVINN drives 70 mph on highways
See Alvinn video
Alvinn Video
13
Artificial Neuron Model
• Model network as a graph with cells as nodes and synaptic
connections as weighted edges from node i to node j, wji
1
• Model net input to cell as
w12
net j   w jioi
w15
w16
w13 w14
i
2
3
4
5
6
• Cell output is:
0 if net j  T j
oj 
1 if neti  T j
(Tj is threshold for unit j)
oj
1
0
Tj
netj
14
Perceptron: Artificial Neuron Model
Model network as a graph with cells as nodes and synaptic connections as
weighted edges from node i to node j, wji
The input value received of na neuron is calculated by summing the weighted input
values from its input links  wi xi
i 0
threshold
threshold function
Vector notation:
15
Different Threshold Functions
 
  1, w  x  0
o( x )  
 
 1, w  x  0
 
  1, w  x  t
o( x )  
0, otherwise
 
  1, w  x  0
o( x )  
 
 1, w  x  0
 
  1, w  x  0
o( x )  
 

1
,
w
x 0

We should learn the weight w1,…, wn
16
Examples
(step activation function)
In1
In2
Out
In1
In2
Out
In
Out
0
0
0
0
0
0
0
1
0
1
0
0
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
1
n
w0 – t
w x
i 0
i i
17
Neural Computation
• McCollough and Pitts (1943) showed how such model
neurons could compute logical functions and be used to
construct finite-state machines
• Can be used to simulate logic gates:
– AND: Let all wji be Tj/n, where n is the number of inputs.
– OR: Let all wji be Tj
– NOT: Let threshold be 0, single input with a negative weight.
• Can build arbitrary logic circuits, sequential machines, and
computers with such gates
18
Perceptron Training
• Assume supervised training examples giving the desired
output for a unit given a set of known input activations.
• Goal: learn the weight vector (synaptic weights) that
causes the perceptron to produce the correct +/- 1 values
• Perceptron uses iterative update algorithm to learn a
correct set of weights
– Perceptron training rule
– Delta rule
• Both algorithms are guaranteed to converge to somewhat
different acceptable hypotheses, under somewhat different
conditions
19
Perceptron Training Rule
• Update weights by:
wi  wi  wi
wi   (t  o) wi
where
η is the learning rate
• a small value (e.g., 0.1)
• sometimes is made to decay as the number of
weight-tuning operations increases
t – target output for the current training example
o – linear unit output for the current training example
20
Perceptron Training Rule
• Equivalent to rules:
– If output is correct do nothing.
– If output is high, lower weights on active inputs
– If output is low, increase weights on active inputs
• Can prove it will converge
– if training data is linearly separable
– and η is sufficiently small
21
Perceptron Learning Algorithm
• Iteratively update weights until convergence.
Initialize weights to random values
Until outputs of all training examples are correct
For each training pair, E, do:
Compute current output oj for E given its inputs
Compare current output to target value, tj , for E
Update synaptic weights and threshold using learning rule
• Each execution of the outer loop is typically called an
epoch.
22
Delta Rule
• Works reasonably with data that is not linearly separable
• Minimizes error
• Gradient descent method
– basis of Backpropagation method
– basis for methods working in multidimensional continuous
spaces
– Discussion of this rule is beyond the scope of this course
23
Perceptron as a Linear Separator
• Since perceptron uses linear threshold function it searches
for a linear separator that discriminates the classes
o3
w12o2  w13o3  T1
??
w12
T1
o3  
o2 
w13
w13
o2
Or hyperplane in
n-dimensional space
24
Concept Perceptron Cannot Learn
• Cannot learn exclusive-or (XOR), or parity function in
general
o3
1
+
??
–
–
+
0
1
o2
25
General Structure of an ANN
x1
x2
x3
Input
Layer
x4
x5
Input
I1
I2
Hidden
Layer
I3
Neuron i
Output
wi1
wi2
wi3
Si
Activation
function
g(Si )
Oi
Oi
threshold, t
Output
Layer
Perceptrons have no hidden layers
Multilayer perceptrons may have many
y
26
Learning Power of an ANN
• Perceptron is guaranteed to converge if data is linearly
separable
– It will learn a hyperplane that separates the classes
– The XOR function on page 250 TSK is not linearly
separable
• A mulitlayer ANN has no such guarantee of convergence
but can learn functions that are not linearly separable
• An ANN with a hidder layer can learn the XOR function
by constructing two hyperplanes (see page 253)
27
Multilayer Network Example
The decision surface is highly nonlinear
28
Sigmoid Threshold Unit
• Sigmoid is a unit whose output is a nonlinear function of its inputs,
but whose output is also a differentiable function of its inputs
• We can derive gradient descent rules to train
– Sigmoid unit
– Multilayer networks of sigmoid units  backpropagation
29
Multi-Layer Networks
• Multi-layer networks can represent arbitrary functions,
but an effective learning algorithm for such networks was
thought to be difficult
• A typical multi-layer network consists of an input, hidden
and output layer, each fully connected to the next, with
activation feeding forward.
output
hidden
activation
input
• The weights determine the function computed. Given an
arbitrary number of hidden units, any boolean function
can be computed with a single hidden layer
30
Logistic function
Inputs
Age
Gender
Stage
34
1
4
Independent
variables
Output
.5
.4
0.6
S
“Probability
of beingAlive”
.8
Coefficients
Prediction

1
n
1 e
  wi xi
i 0
31
Neural Network Model
Inputs
Age
.6
34
.2
.4
S
.5
.1
Gender
2
.2
.3
S
.7
Stage
4
Independent
variables
Output
.8
S
“Probability
of beingAlive”
.2
Weights
0.6
Hidden
Layer
Weights
Dependent
variable
Prediction
32
Getting an answer from a NN
Inputs
Age
Output
.6
34
.5
.1
Gender
S
2
Stage
“Probability
.8
.7
of beingAlive”
4
Independent
variables
0.6
Weights
Hidden
Layer
Weights
Dependent
variable
Prediction
33
Getting an answer from a NN
Inputs
Age
Output
34
.5
.2
Gender
2
S
.3
“Probability
.8
Stage
4
Independent
variables
of beingAlive”
.2
Weights
Hidden
Layer
0.6
Weights
Dependent
variable
Prediction
34
Getting an answer from a NN
Inputs
Age
Output
.6
34
.5
.2
.1
Gender
1
S
.3
.7
Stage
4
Independent
variables
“Probability
.8
of beingAlive”
.2
Weights
Hidden
Layer
0.6
Weights
Dependent
variable
Prediction
35
Comments on Training Algorithm
• Not guaranteed to converge to zero training error, may
converge to local optima or oscillate indefinitely.
• However, in practice, does converge to low error for many
large networks on real data.
• Many epochs (thousands) may be required, hours or days
of training for large networks.
• To avoid local-minima problems, run several trials starting
with different random weights (random restarts).
– Take results of trial with lowest training set error.
– Build a committee of results from multiple trials (possibly
weighting votes by training set accuracy).
36
Hidden Unit Representations
• Trained hidden units can be seen as newly constructed
features that make the target concept linearly separable in the
transformed space: key features of ANNs
• On many real domains, hidden units can be interpreted as
representing meaningful features such as vowel detectors or
edge detectors, etc..
• However, the hidden layer can also become a distributed
representation of the input in which each individual unit is not
easily interpretable as a meaningful feature.
37
Expressive Power of ANNs
• Boolean functions: Any boolean function can be
represented by a two-layer network with sufficient hidden
units.
• Continuous functions: Any bounded continuous
function can be approximated with arbitrarily small error
by a two-layer network.
• Arbitrary function: Any function can be approximated to
arbitrary accuracy by a three-layer network.
38
Sample Learned XOR Network
IN 1
IN 2
OUT
7.38
0
0
0
B 2.03
0
1
1
5.57
1
0
1
1
1
0
O 3.11
6.96
5.24 A
3.6
X
3.58
5.74
Y
Hidden Unit A represents: (X  Y)
Hidden Unit B represents: (X  Y)
Output O represents: A  B = (X  Y)  (X  Y)
=XY
39
Expressive Power of ANNs
• Universal Function Approximator:
– Given enough hidden units, can approximate any continuous
function f
• Why not use millions of hidden units?
– Efficiency (neural network training is slow)
– Overfitting
40
Combating Overfitting in Neural Nets
• Many techniques
• Two popular ones:
– Early Stopping
• Use “a lot” of hidden units
• Just don’t over-train
– Cross-validation
• Choose the “right” number of hidden units
41
Determining the Best
Number of Hidden Units
error
• Too few hidden units prevents the network from
adequately fitting the data
• Too many hidden units can result in over-fitting
on test data
on training data
0
# hidden units
• Use internal cross-validation to empirically determine an
optimal number of hidden units
42
Over-Training Prevention
error
• Running too many epochs can result in over-fitting.
on test data
on training data
0
# training epochs
• Keep a hold-out validation set and test accuracy on it after every
epoch. Stop training when additional epochs actually increase
validation error.
• To avoid losing training data for validation:
– Use internal 10-fold CV on the training set to compute the average
number of epochs that maximizes generalization accuracy.
– Train final network on complete training set for this many epochs.
43
Summary of Neural Networks
When are Neural Networks useful?
– Instances represented by attribute-value pairs
• Particularly when attributes are real valued
– The target function is
• Discrete-valued
• Real-valued
• Vector-valued
– Training examples may contain errors
– Fast evaluation times are necessary
When not?
– Fast training times are necessary
– Understandability of the function is required
44
Successful Applications
• Text to Speech (NetTalk)
• Fraud detection
• Financial Applications
– HNC (eventually bought by Fair Isaac)
• Chemical Plant Control
– Pavillion Technologies
• Automated Vehicles
• Game Playing
– Neurogammon
• Handwriting recognition
45
Issues in Neural Nets
• Learning the proper network architecture:
– Grow network until able to fit data
• Cascade Correlation
• Upstart
– Shrink large network until unable to fit data
• Optimal Brain Damage
• Recurrent networks that use feedback and can learn
finite state machines with “backpropagation through
time.”
46