Download ABSTRACT:

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Internet protocol suite wikipedia , lookup

Distributed operating system wikipedia , lookup

Airborne Networking wikipedia , lookup

CAN bus wikipedia , lookup

Nonblocking minimal spanning switch wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Transcript
ABSTRACT:
Multilayer perceptrons (MLPs) are one of the most popular neural network
models for solving pattern classification and image classification problems. Once the
connection weights in an MLP have been learnt, the network can be used repeatedly for
classification of input test patters. Several special-purpose architectures have been
described in the literature for neural networks, since they are slow on a conventional
uniprocessor. This presentation report is about, a high performance architecture for MLPs
that is presented using a novel class of hardware called “Custom Computing Machines”
(CCMs).
1
1.
Introduction:
Artificial Neural Networks (ANNs) attempt to mimic Biological Neural Networks
and one of the main features of biological neural networks is the massively parallel
interconnections among the neurons.
An artificial neuron is a computational model inspired in the natural neurons.
Natural neurons receive signals through synapses located on the dendrites or membrane
of the neuron. When the signals received are strong enough (surpass a certain threshold),
the neuron is activated and emits a signal though the axon. This signal might be sent to
another synapse, and might activate other neurons (refer to Figure 1).
Figure 1: Natural neurons. [1].
While we model artificial neurons, we try to highly abstract the complexity of real
neurons. Artificial neurons (also called Perceptrons) basically consist of a d-dimensional
vector X that is connected to the input node through suitable weights. The weight values
corresponding to each connection to this input node are the components of a ddimensional weight vector W. Weights represent the strength of respective signals (i.e.
components of the input vector).
2
As it is shown in Figure 2, inside the input node the inner product of the input
vector and weight vector is evaluated. A function (which is nonlinear and also sometimes
in the dependence of a threshold) computes the output taking the result of the inner
product as input. In the paper by Jain and Ratha [2] which forms the core subject of this
presentation report, the non-linear function is the sigmoid function tanh(βx) where β =
0.25.
Figure 2: Schematic of a perceptron [2]
ANNs combine artificial neurons in order to process information. Design
parameters of ANNs can be listed as follows:

network topology
o number of layers
o number of nodes in a layer

connection weights

property at a node (i.e. type of non-linearity to be used)
1.1. Multilayer Perceptrons (MLPs)
Multilayer Perceptrons (MLPs) are one of the most popular neural network
models for solving pattern classification and image classification problems.
3
A Multilayer Perceptron consists of several layers of perceptrons. Nodes in the ith
layer are connected to nodes in the (i+1)th layer through suitable weights and there are no
interconnections among the nodes in a layer. An MLP with 2 hidden layers is shown in
Figure 3.
(a)
(b)
Figure 3: (a) a Multilayer Perceptron with two hidden layers (b) Biological neuron [3]
1. 2. Training of a MLP:
Training of a MLP consists of two stages:
1. Feedforward Stage
2. Weight Update Stage
In Feedforward Stage training patterns with known class labels are presented at the
input layer and the output is computed at the output layer. At the start of the training
phase the weight matrix is randomly initialized.
In Weight Update Stage weights are updated in a backward fashion starting with the
output layer. Weights are changed proportional to the error between the desired output
and the actual output.
4
Feedforward Stage and Weight Update Stage are repeated until the network
converges (i.e. difference between the desired output and the actual at the output layer is
within the predetermined limit).
In Figure 4, a detailed schema of a MLP with a single hidden layer is given.
Figure 4: A Multilayer Perceptron. [2]
For a MLP with n nodes interconnections needed is O(n2). Thus mapping a MLP
onto a parallel processor is a real challenge. On the other hand, on a uniprocessor the
whole operation proceeds sequentially one node at a time with no complex
communications involved.
However, for a high performance implementation, efficient communication
capability must be supported. For instance, in typical pattern recognition and computer
vision applications, applications have more than 100 input nodes. In addition,
classification process involving complex decision boundaries demands a large number of
hidden nodes.
In Real Time Computer Vision applications, the network training can be carried
offline. On the other hand, during the Recall Phase (which is supposed to be real-time)
high input/output bandwidth is required along with fast classification (recall) speeds.
5
For a three layer network (excluding the input layer) if we let:
m: input nodes
n1: nodes in the first hidden layer
n2: nodes in the second hidden layer
k: output nodes (classes)
Nm: the number of multiplications
Na: the number of additions
Then, total number of multiplications will be;
N m  (m  n1 )  (n1  n2 )  (n2  k )
and total number of additions will be;
N a  N m  (n1  n2  k )
In the calculations of Na and Nm nonlinearity is not included.
As a numeric example a Practical Vision System where a 1024x1024 image is
processed in real time. If the total number of frames to be processed per second is 30,
then input patterns to be processed per second will be;
30 * 1024 * 1024 = 30 * 106 input patterns/sec
Thus, a real time neural network classifier is exoected to perform billions of operarions
per second. Since connection weights are floating point numbers, multiplications and
additions will be floating point multiplications and floating point additions respectively.
As a result, throughputs of this kind are difficult to achieve with today’s most powerful
uniprocessors.
1.3. Parallel Architectures for ANNs:
Types of parallelism available in a MLP can be listed as follows:

Training Session Parallelism

Training Example Parallelism

Layer and Backward/Forward Parallelism

Node Parallelism

Weight Parallelism
6

Bit Parallelism
Layer and Backward/Forward Parallelism, Node Parallelism and Weight
Parallelism cab easily be mapped to a Parallel Architecture.
Complexities involved in MLPs are:
1. Computational Complexity
2. Communication Complexity (Since inner product computation involves a large
number of communication steps, Communication Complexity arises.)
As a result of Computational and Communicational complexities, special purpose
neurocomputers have been built using either commercially available special purpose
VLSIs, or special purpose VLSIs (VLSIs for a distinct type of Artificial Neural Network).
VLSIs for a distinct type of ANN provide the best performance.
For each MLP application; number of input nodes, number of layers and number of
nodes in each layer changes. However, it is expensive to design a VLSI architecture for
individual applications. Hence, typically architectures with a fixed number of nodes and
layers are fabricated.
1.4.
Special Purpose ANN Implementations in the Literature:
Ghosh & Hwang:
Ghosh and Hwang investigated the architectural requirements for simulating ANNs using
massively parallel multiprocessors. In their work, they propose a model for mapping neural
networks onto message passing multi-computers.
Liu & Wilcox:
Liu & Wilcox present an efficient implementation of backpropagation algorithm on the
CM-5 that avoids explicit message passing and compare the results of CM-5
backpropagation with those of Cray-2, CrayX-MP and CrayY-MP.
Chinn:
Chinn describes a systolic algorithm for ANN on MasPar-1 using a 2-D Systolic Array
Based Design.
7
Onuki:
Onuki presents a parallel implementation using a set of sixteen standard 24-bit DSPs
connected in a hypercube.
Kirsanov:
Kirsanov discusses a new architecture for ANNs using transputers.
Muller:
Muller presents a special purpose parallel computer using a large number of Motorola
floating point processors for ANN implementation.
1.5. Special Purpose VLSI Chips Designed and Fabricated for ANN
Implementations:
Hamerstrom:
Hamerstrom implemented a high performance and low cost ANN with 64 processing nodes
per chip and hardware based multiply & accumulator operators.
Barber:
Barber used a binary tree adder following parallel multipliers in SPIN-L architecture.
Shinokawa:
Shinokawa describes a fast ANN with billion connections per second using ASIC chips.
Viredez:
Viredez describes MANTRA-1 neurocomputer using 2 x 2 systolic PE (Processsing
Element) blocks.
Kotolainen:
Kotolainen proposed a tree of connection units with processing units at the leaf nodes for
mapping many common ANNs.
Asanovic:
Asanovic proposed a VLIW of 128 bit instruction width and a 7 stage pipelined processor
with 8 processors per chip.
8
Ramacher:
Ramacher describes the architecture of SYNAPSE
which is a systolic neural signal
processor using a 2-D array of systolic elements.
Mueller & Hammerstrom:
Mueller and Hammerstrom describe design and implementation of CNAPS. CNAPS is a
gate array implementation of ANNs. A single CNAPS chip consists of 64 processing nodes
with each node connected in a SIMD fashion using broadcast interconnect.
Cox:
Cox describes the implementation of GANGLION which is a single neuron having a fixed
neural architecture of 12 input nodes, 14 hidden nodes and 4 output nodes. 8 x 8 multipliers
were built using CLBs (Configurable Logic Blocks). In addition, a Lookup Table is used
for activation function.
1.6. Stochastic Neural Architectures:
In Stochastic Neural Architectures, there is no need for a time-consuming and area
costly floating point multiplier. Stochastic Neural Architectures are suitable for VLSI
implementations.
Examples for Stochatic Neural Architectures can be given as follows:
Armstrong & Thomas:
Armstrong and Thomas proposed a variation of ANN called Adaptive Logic Network
(ALN). ALNs are similar to ANNs and costly multiplications are replaced by logical and
operations while additions are replaced by logical or operations.
Masa et. al:
Describe an ANN with a single output, six hidden layers and seventy inputs, which can
operate at 50 MHz input rate.
9
2. Custom Computing Machines:
In a uniprocessor instruction set available to a programmer is fixed and an
algorithm is coded using a sequence of instructions. A processor can serve many
applications by simply reordering the sequence of instructions.
On the other hand, Application Specific Integrated Circuits (ASICs) are used for a
specific application. Thus, for the specific application they are supposed to serve, they
provide higher performance compared to the general purpose uniprocessor.
For a Custom Computing Machine (CCM), a user can customize the architecture
and instructions for a given application by programming at a gate level, so that high
performance can be achieved. Using a CCM, a customer can tune and match the
architectural requirements of the problem.
A CCM can overcome the limitations of an ASIC which are:

ASICs are fast but costly

ASICs are non-reconfigurable

ASICs are time-consuming.
If we mention the advantages of a CCM:

CCMs are cheap: CCMs use Field Programmable Gate Arrays (FGPAs) as
compute elements. Since FPGAs are off-the-shelf components, they are
relatively cheap.

CCMs are reconfigurable: Since FPGAs are reconfigurable, CMMs are easily
reprogrammed.

CMMs are time saving: CCMs do not need to be fabricated with every new
application since they are often employed for fast prototyping. Hence, they
save a considerable amount of time in design and implementation of
algorithms.
Splash-2 is one of the leading FPGA based custom computing machine (CCM)
designed and Developed by “Supercomputing Research Center”.
10
2.1.
System Level View of Splash-2 Architecture:
Interface board connects Splash 2 to the host and extends the address and data buses.
The Sun host can read/write to memories and memory mapped control registers of Splash
2 via SIMD Bus and Rbus.
Splash-2 consists of Splash Processing Boards(see Figure 5) .Each Processing Element
(PE) has 512 KB of memory where the host can read/write. Individual memory available
with each PE makes it convenient to store temporary results and tables (refer to Figure 6) .
PEs from x1 to x16 are used for inner product and sigmoid function evaluations; whereas
PE x0 controls the data flow into the processor board.
Figure 5: System Level View of Splash-2 Architecture [2].
11
Figure 6: Processing Element in Splash-2. [2]
3.
SPLASH-2: Architecture and Programming Flow:
Design automation process consists of two steps which are simulation and
synthesis. Th programming flow is shown if Figure 7. In simulation, the logic designed
using VHDL is verified. In systhesis main concern is to achieve the best placement of logic
in an FPGA in order to minimize timing delay. If the logic circuit can not be mapped to
CLBs and flip flops which are available internal to an FPGA , then designer needs to revise
the logic in the VHDL code and the process is repeated. Once the logic is mapped to CLBs,
the timing for the entire digital logic is obtained. If timing obtained is not acceptable then
design process is repeated. To program Splash 2, we need to program each of the PEs,
crossbar and host interface.
4.
Mapping MLP on SPLASH-2:
In implementing a neural network classifier on Splash-2building block is
”Perceptron Implementation”. For maping MLP to Splash-2, 2 physical PEs serve as a
neuron; ith PE handles the inner product phase  wij xi ; whereas (i+1)th PE computes
nonlinear function tanh(βx) with β=0.25 , where i is odd and (i+1) is even.
12
Figure 7: Programming Flow for SPLASH-2. [2]
Figure 8: Steps in Software Development on Splash-2. [2]
13
Assuming perceptrons have been trained so that connection weights are fixed; an
efficient way of handling multiplication is to employ a Look-up Table. Since the external
memory of each PE (512 KB) is large, the Look-up Table can be stored.
A pattern vector component xi is presented at every clock cycle and the following
stages are performed:
1. Inner Product Calculation: The ith PE looks up the multiplication table to obtain
the weighted product. The sum
w x
ij
i
is computed using an accumulator. After all the
components of a pattern vector have been examined, we have computed the inner product.
2. Application Nonlinear Function to the Inner Product: On receiving the inner
product result from the first PE, the second PE uses the result as the address to the Non
Linearity Look up Table and produces the output.
3. Thus the output of a neuron is obtained: The output is written back to the
external memory of the second PE starting from a pre-specified location.
4. After sending all the pattern vectors, the host can read back the memory
contents.
A layer in the neural network is simply a collection of neurons working
synchronously on the input. On Splash-2, this can be achived by broadcasting the input to
as many physical PEs as desired.The output of a neuron is written into a specified segment
of external memory and read back by the host.
For every layer in MLP stages 1 to 4 are repated until the output layer is reached. It
should be noted that, for every layer there is a different look-up table.
4.1. Look-up Table Organisation:
There are m multiplications to be performed per node corresponding to the mdimensional weight vector.
Look-up Table is divided into m segments(see Figure 9). A counter is incremented
at every clock which forms the higher order (block) address for the Look-up Table. Pattern
14
vector component forms the lower order address bits. Splash-2 has 18 bit adress bus for the
external memory:

Higher order 6 bits for the block address

Lower order 12 bits for the offset address within the block.
It should be noted that, the offset can also be negative correponding to a negative input to
the look up table.
Figure 9: Look-up Table address computation. [2]
The numbers have been represented by 12 bits 2’s complement representation.
Hence,the resolution of this representation is eleven bits. Accumulator is within PE and is
16 bit wide. After accumulation, the accumulator result is scaled down to 12 bits.
5. Performance Evaluation:
For mapping a MLP which is required to complete a classification process; the
number of PEs needed is equal to twicw the number of nodes in each layer. Hence, number
of clock cycles necessary to complete whole classification process is m  K * l , where:
m: number of input layer nodes.
K: number of patterns
l :number of clock cycles
In the implementation performed by the authors of the main paper of this
presenration report (“High Performance Muliılayer Perceptron on a Custom Computıng
Machine” by Nalini K. Ratha, Anil K. Jain );
15
m = 20
K=1024 X 1024 = 1 MB (Total number of pixels in the input image)
l=2
Thus, number of clock cycles is 20*2*106 with a clock rate of 22 MHz. ; so that time taken
for 40 million clock ticks is 1.81 secs.
Figure 10:Look-up Table Organisation. [2]
When the number of PEs required is larger than the available PEs; either more
processor boards need to be added or PEs need to be time shared. It should be emphasized
that neuron outputs are produced independent of other neurons and algorithm waits till the
computations in each layer are completed.
A MLP has communication complexity of O(n2) where n is the number of nodes.
As n grows, it will be difficult to get good timing performance from a single processor
system. With a large number of processor boards, the single input data bus of 36 bits can
cater to multiple input patterns. Note that, in a multiboard system, all boards receive the
16
same input and this parallelism can give rise to more data streaming into the system, thus
the number of clock cycles is reduced.
5.1. Scalability:
The performance of the mapping on Splash-2 can be compared with a host
implementation of different sizes of the neural network. For this comparison, authors of the
main paper of this presentation report (“High Performance Muliılayer Perceptron on a
Custom Computıng Machine” by Nalini K. Ratha, Anil K. Jain ) look at only a single layer
and represent the network size by the number of nodes in that layer. Multilayered networks
are considered to be linearly scalable in Splash 2 architecture. The performance measure is
processing time as measured by # of clock cycles for Splash 2 with 22 MHz. Clock. The
sequential time and the Splash time are plotted in Figure 11 and time has been plotted on
log scaleto accomodate the large scale difference of time.
Figure 11: Speed comparison of neural network on .sparc20 and Splash-2. [2]
17
5. 2. Speed Evaluation:
Authors of [2], implemented on a 2-board system for their network of 20 input nodes.
They achieve 176 million connections per second (MCPS) per layer by running the Splash
clock at 22 MHz.
A 6- board system can deliver more than a billion connections per second. This is
comparable to the performance of many high level VLSI-based systems such as Synapse,
CNAPS which perform in the range of 5 GCPS.
5. 3. Network-based Image Segmentation:
Image Segmentation is the process of partitioning an image into mutually exclusive
connected image regions.In an automated image document understanding system, page
layout segmentation plays an important role for segmenting text, graphics and background
areas. Jain and Karu [22] proposed an algorithm to learn texture discrimination masks
needed for image segmentation. The page segmentation algorithm proposed by Jain and
Karu has three stages of computation:
1. feature extraction: Feature extraction is is based on 20 masks.
2. classification: The neural network used is a multistage feedforward
neural network with 20 input nodes, 20 hidden nodes and 3 output nodes
3. postprocessing: Posprocessing involves removing small noisy regions
and placing rectangular blocks around homogenous identical regions.
18
Figure 12: Schematic of the Page Segmentation Algorithm. [2]
(a)
(b)
(c)
Figure 13: Page Segmentation (a) input gray level image (b) result of segmentation
algorithm (c) result after postprocessing. [2]
19
6.
Conclusion
A novel sheme of mapping MLPs on Custom Computing Machine has been
presented. The scheme is scalable in terms of number of nodes and the number of layers in
the MLP and provides near-ASIC level speed. The reconfigurality of CCMs has been
exploited to map several layers of a MLP onto the same hardware. In addition, the
performance gains achieved using this mapping have been demonstrated on a networkbased image segmentation.
20
REFERENCES:
[1] Artificial Neural Networks for Beginners, Carlos Gershenson, The University of
Sussex, http://www.cogs.susx.ac.uk/users/carlos/doc/FCS-ANN-tutorial.pdf
[2] “High Performance Multilayer Perceptron on a Custom Computing Machine”, Nalini
K. Ratha, Anil K. Jain, Department of Computer Science, Michigan State University
[3] “An Introduction to Artificial Neural Networks”,Universitry of Data Mining and
Metalurgy in Cracow, Institude of Automatics,
http://student.uci.agh.edu.pl/~best/summer/m/introduction_to_nets.pdf
[4]“Digital Systems for Neural Networks”, Paolo Ienne, EPFL Microcomputing
Labarotory, Lausenne, Gary Kuhn, SCR Learnin Systems, Princeton.
[5] “Overview of Neural Hardware”, Jan. N. H. Neenskerk, Unit of Experimental and
Theoretical Psychology, Leiden University, Netherlands.
21