Download neural network - WCU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

IEEE 802.1aq wikipedia , lookup

Airborne Networking wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

CAN bus wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

Nonblocking minimal spanning switch wikipedia , lookup

Transcript
ARTIFICIAL NEURAL
NETWORKS
Class 24
CSC 600: Data Mining
Today…





Artificial Neural Networks (ANN)
Inspiration
Perceptron
Hidden Nodes
Learning
Inspiration


Attempts to simulate biological neural systems
Animal brains have complex learning systems
consisting of closely interconnected sets of neurons
Human Brain



Neurons: nerve cells
Neurons linked (connected) to other neurons via
axons
A neuron is connected to the axons via dendrites
 Dentrites
gather inputs from other neurons
Learning


Neurons uses dendrites to gather inputs from other
neurons
Combines input information, and outputs a response
 “fires”

when some threshold is reached
Human brain learns by changing the strength of the
connection between neurons, upon repeated
stimulation by the same impulse
Inspiration



Human brain contains approximately 1011 neurons
Each connected on average to 10,000 other
neurons
Total of 1,000,000,000,000,000 = 1015
connections
X1
X2
X3
Y
Input
1
1
1
1
0
0
0
0
0
0
1
1
0
1
1
0
0
1
0
1
1
0
1
0
0
1
1
1
0
0
1
0
X1
Black box
Output
X2
X3
Output Y is 1 if at least two of the three inputs are equal to 1.
Y

Going to begin with simplest model…
Perceptron

X2
X3
Y
1
1
1
1
0
0
0
0
0
0
1
1
0
1
1
0
0
1
0
1
1
0
1
0
0
1
1
1
0
0
1
0
Black box
X1
X2
X3
Output
node
0.3
0.3
0.3

Y
t=0.4
Perceptron has two types of nodes:
1.
2.

X1
Input
nodes
Input nodes (for the input attributes)
Output node (for the model’s output)
Nodes in a neural network are commonly known as neurons.
Perceptron

X1
X2
X3
Y
1
1
1
1
0
0
0
0
0
0
1
1
0
1
1
0
0
1
0
1
1
0
1
0
0
1
1
1
0
0
1
0
Input
nodes
Black box
X1
X2
X3
Output
node
0.3
0.3
0.3

t=0.4
Each input node is connected to the output node via a
weighted link
 Weighted
link represents the strength of the connection
between neurons

Y
Idea: learning the optimal weights
Perceptron –
Output Value

X1
X2
X3
Y
1
1
1
1
0
0
0
0
0
0
1
1
0
1
1
0
0
1
0
1
1
0
1
0
0
1
1
1
0
0
1
0
Input
nodes
Black box
X1
X2
X3
Output
node
0.3
0.3
0.3

Y
t=0.4
Weighted sum of inputs, subtracting a bias factor t, and
examining sign of the result.
Y  I (0.3 X 1  0.3 X 2  0.3 X 3  0.4  0)
1
where I ( z )  
0
if z is true
otherwise
Perceptron – General Model



Model is an assembly of interconnected nodes and weighted links
Output node sums up each of its
input value according to the weights
of its links
Compare output node against some
threshold t
Input
nodes
Black box
X1
Output
node
w1
w2
X2

w3
X3
t
Perceptron Model
Y  I (  wi X i  t )
i
Y  sign (  wi X i  t )
i
Y
Learning Perceptron Model
Input
nodes
Weight Update Formula
Black box
X1
X2
Output
node
w1
w2

Y
w3
X3
t
(k)
(k)
w (k+1)
=
w
+
l
(y
ŷ
j
j
i
i )xij
Weight for attribute j, after k iterations
New weight for attribute j
Input:
Observation i
Attribute j
Prediction error
Learning rate parameter, between 0 and 1
Closer to 0: SLOW - new weight mostly influenced by value of old weight
Closer to 1: FAST – more sensitive to error in current iteration
Input
nodes
Weight Update Formula
Black box
X1
X2
X3
Output
node
w1
w2

Y
w3
t
(k)
(k)
w (k+1)
=
w
+
l
(y
ŷ
j
j
i
i )xij

If y = +1 and yhat = 0:



prediction error = (y – yhat) = 1
To compensate for the error: increase the value of the predicted output by
increasing weights of all links with positive inputs and decreasing weights of all
links with negative inputs.
If y = 0 and yhat = 1:


prediction error = (y – yhat) = -1
To compensate for the error: decrease the value of the predicted output by
decreasing weights of all links with positive inputs and increasing weights of all
links with negative inputs.
Perceptron Weight Convergence

Perceptron learning algorithm is guaranteed to
converge to an optimal solution



(weights stop changing)
… for linearly separable classification problems
If problem is not linearly separable, the
algorithm fails to converge
XOR Problem
Decision boundary of perceptron is a linear hyperplane.
Multilayer Artificial Neural Network

More complex than perceptron
model:
Also contains one or more
intermediary layers between
input and output layers
1.
Called: hidden nodes
 Allows for modeling more complex
relationships
x1
x2
x3
Input
Layer
Hidden
Layer

Output
Layer
y
x4
x5
Think of each hidden node as a perceptron.


Perceptron “learns” / “creates” one hyperplane.
XOR problem can be classified with two hyperplanes.
Multilayer Artificial Neural Network

More complex than perceptron
model:
1.



Hidden
Layer
Called: hidden nodes
Allows for modeling more complex
relationships
Output
Layer
Alternative: sigmoid (logistic) function
x2
x3
Input
Layer
Also contains one or more
intermediary layers between input
and output layers
Use of other activation functions
other than the sign function
2.
x1
y
y=
1
1+ e- x
x4
x5
Why Sigmoid Function?

Combines nearly linear behavior, curvi-linear
behavior, and nearly constant behavior, depending
on the value of the input.
Input: any real-valued input
Output: between 0 and 1
y=
1
1+ e- x
Feed-Forward Neural Network

Nodes in one layer are connected
only to the nodes in the next
layer.



Completely connected (every node
in layer i is connected to every
node in layer i+1)
(Perceptron is a single-layer,
feed-forward neural network.)
Other types: Recurrent neural
network may connect nodes within
the same layer, or to nodes in a
previous layer.
x1
x2
x3
Input
Layer
Hidden
Layer
Output
Layer
y
x4
x5
Input Encoding

Possible drawback: all attribute values must be
numeric and normalized between 0 and 1
 even

categorical variables
Numeric variables: apply min-max normalization
X* =
 works
X - min(X)
X - min(X)
=
range(X) max(X) - min(X)
as longs as min and max are known
 What if new value (in testing set) is outside of range?
 Potential
solution: assign value to either the min or max
Input Encoding


Possible drawback: all attribute values must be numeric
and normalized between 0 and 1
Categorical variables:

Use flag (binary 0/1) variables to represent each category,
if more than 2 categories

(if # of possible categories is not too large)
2 categories can be represented by 0/1 numeric variable
 In general: k-1 indicator variables needed for categorical
variable with k classes

Output Encoding

Neural Networks will output a continuous value between 0
and 1


Binary problems: use some threshold, such as 0.5
Ordinal example:





If 0 <= output < 0.25, classify first-grade reading level
If 0.25 <= output < 0.50, classify second-grade reading level
If 0.50 <= output < 0.75, classify third-grade reading level
If output >= 0.75, classify fourth-grade reading level
Classification:


Ideas?
Use 1-of-n output encoding with multiple output nodes
1-of-n Output Encoding

Example:

Assume marital status target variable with outputs:

{divorced, married, separated, single, widowed, unknown}
Each output node gets value between 0 and 1
 Choose node with highest value


Additional Benefit: Measure of confidence

Difference between highest value output node and the
second highest value output node
Output

For numerical output problems:
 Neural
net output is between 0 and 1
 May need to transform output to a different scale
prediction = output(data range) + minimum
Inverse of min-max normalization
Neural Network Structure



# of input nodes: depends on number and type of attributes in
dataset
# of output nodes: depends on classification task
# of hidden nodes:





Configurable by data analyst
More nodes increase power and flexibility of network
Too many nodes will lead to overfitting
Too few nodes will lead to poor learning
# of hidden layers:


Configurable by data analyst
Usually 1 for computational reasons
Neural Network Example: Predicted Value
Data Inputs and Weights:
f (net A ) =
Input Attributes: x1, x2, x3
Predicted Value: 0.8750
1
= 0.7892
1+ e-1.32
Learning the ANN Model

Goal is to determine a set of weights w that
minimize the total sum of squared errors
SSE =
å
å
records output nodes
(yi - ŷi )2
May converge without finding optimal weights.
Gradient Descent Method


No closed-form solution exists for minimizing the SSE
Gradient Descent:


Direction that weights should be adjusted
Back Propagation:
Takes prediction error and propagates error back through
the network
 Weights of hidden nodes can also be adjusted

Learning the ANN Model

Keep adjusting weights until some stopping criterion
is met:
1.
2.
3.
4.
SSE reduced below some threshold
Weights are not changing anymore
Elapsed training time exceeds limit
Number of iterations exceeds limits
Non-Optimal Local Minimum

Potential Solutions:
1. Adjust learning parameter
2. Momentum Term
Algorithm discovers weights that result in local
minimum rather than global minimum
Characteristics of Artificial Neural
Networks


Important to choose appropriate network topology
Very expensive hypothesis space



Fast classification time
Can handle redundant features


Weights for redundant features tend to be very small
Gradient descent for learning weights may converge to local
minimum



Relatively lengthy training time
Use momentum term
Learn multiple models (remember initial weights are random)
Interpretability: what do weights of hidden nodes mean?
Sensitivity Analysis

1.
2.
3.

Measures relative influence each attribute has on the output result:
Generate a new observation xmean, with each attribute in xmean,
equal to the mean of each attribute
Find the network output for input xmean
Attribute by attribute, vary xmean to the min and max of that
attribute. Find the network output for each variation and compare
it to (2).
Will discover which attributes the network is more sensitive to
References



Data Science from Scratch, 1st Edition, Grus
Introduction to Data Mining, 1st edition, Tan et al.
Discovering Knowledge in Data, 2nd edition, Larose
et al.