Download ABSTRACT:

ABSTRACT: Multilayer perceptrons (MLPs) are one of the most popular neural network models for solving pattern classification and image classification problems. Once the connection weights in an MLP have been learnt, the network can be used repeatedly for classification of input test patters. Several special-purpose architectures have been described in the literature for neural networks, since they are slow on a conventional uniprocessor. This presentation report is about, a high performance architecture for MLPs that is presented using a novel class of hardware called “Custom Computing Machines” (CCMs). 1 1. Introduction: Artificial Neural Networks (ANNs) attempt to mimic Biological Neural Networks and one of the main features of biological neural networks is the massively parallel interconnections among the neurons. An artificial neuron is a computational model inspired in the natural neurons. Natural neurons receive signals through synapses located on the dendrites or membrane of the neuron. When the signals received are strong enough (surpass a certain threshold), the neuron is activated and emits a signal though the axon. This signal might be sent to another synapse, and might activate other neurons (refer to Figure 1). Figure 1: Natural neurons. [1]. While we model artificial neurons, we try to highly abstract the complexity of real neurons. Artificial neurons (also called Perceptrons) basically consist of a d-dimensional vector X that is connected to the input node through suitable weights. The weight values corresponding to each connection to this input node are the components of a ddimensional weight vector W. Weights represent the strength of respective signals (i.e. components of the input vector). 2 As it is shown in Figure 2, inside the input node the inner product of the input vector and weight vector is evaluated. A function (which is nonlinear and also sometimes in the dependence of a threshold) computes the output taking the result of the inner product as input. In the paper by Jain and Ratha [2] which forms the core subject of this presentation report, the non-linear function is the sigmoid function tanh(βx) where β = 0.25. Figure 2: Schematic of a perceptron [2] ANNs combine artificial neurons in order to process information. Design parameters of ANNs can be listed as follows:  network topology o number of layers o number of nodes in a layer  connection weights  property at a node (i.e. type of non-linearity to be used) 1.1. Multilayer Perceptrons (MLPs) Multilayer Perceptrons (MLPs) are one of the most popular neural network models for solving pattern classification and image classification problems. 3 A Multilayer Perceptron consists of several layers of perceptrons. Nodes in the ith layer are connected to nodes in the (i+1)th layer through suitable weights and there are no interconnections among the nodes in a layer. An MLP with 2 hidden layers is shown in Figure 3. (a) (b) Figure 3: (a) a Multilayer Perceptron with two hidden layers (b) Biological neuron [3] 1. 2. Training of a MLP: Training of a MLP consists of two stages: 1. Feedforward Stage 2. Weight Update Stage In Feedforward Stage training patterns with known class labels are presented at the input layer and the output is computed at the output layer. At the start of the training phase the weight matrix is randomly initialized. In Weight Update Stage weights are updated in a backward fashion starting with the output layer. Weights are changed proportional to the error between the desired output and the actual output. 4 Feedforward Stage and Weight Update Stage are repeated until the network converges (i.e. difference between the desired output and the actual at the output layer is within the predetermined limit). In Figure 4, a detailed schema of a MLP with a single hidden layer is given. Figure 4: A Multilayer Perceptron. [2] For a MLP with n nodes interconnections needed is O(n2). Thus mapping a MLP onto a parallel processor is a real challenge. On the other hand, on a uniprocessor the whole operation proceeds sequentially one node at a time with no complex communications involved. However, for a high performance implementation, efficient communication capability must be supported. For instance, in typical pattern recognition and computer vision applications, applications have more than 100 input nodes. In addition, classification process involving complex decision boundaries demands a large number of hidden nodes. In Real Time Computer Vision applications, the network training can be carried offline. On the other hand, during the Recall Phase (which is supposed to be real-time) high input/output bandwidth is required along with fast classification (recall) speeds. 5 For a three layer network (excluding the input layer) if we let: m: input nodes n1: nodes in the first hidden layer n2: nodes in the second hidden layer k: output nodes (classes) Nm: the number of multiplications Na: the number of additions Then, total number of multiplications will be; N m  (m  n1 )  (n1  n2 )  (n2  k ) and total number of additions will be; N a  N m  (n1  n2  k ) In the calculations of Na and Nm nonlinearity is not included. As a numeric example a Practical Vision System where a 1024x1024 image is processed in real time. If the total number of frames to be processed per second is 30, then input patterns to be processed per second will be; 30 * 1024 * 1024 = 30 * 106 input patterns/sec Thus, a real time neural network classifier is exoected to perform billions of operarions per second. Since connection weights are floating point numbers, multiplications and additions will be floating point multiplications and floating point additions respectively. As a result, throughputs of this kind are difficult to achieve with today’s most powerful uniprocessors. 1.3. Parallel Architectures for ANNs: Types of parallelism available in a MLP can be listed as follows:  Training Session Parallelism  Training Example Parallelism  Layer and Backward/Forward Parallelism  Node Parallelism  Weight Parallelism 6  Bit Parallelism Layer and Backward/Forward Parallelism, Node Parallelism and Weight Parallelism cab easily be mapped to a Parallel Architecture. Complexities involved in MLPs are: 1. Computational Complexity 2. Communication Complexity (Since inner product computation involves a large number of communication steps, Communication Complexity arises.) As a result of Computational and Communicational complexities, special purpose neurocomputers have been built using either commercially available special purpose VLSIs, or special purpose VLSIs (VLSIs for a distinct type of Artificial Neural Network). VLSIs for a distinct type of ANN provide the best performance. For each MLP application; number of input nodes, number of layers and number of nodes in each layer changes. However, it is expensive to design a VLSI architecture for individual applications. Hence, typically architectures with a fixed number of nodes and layers are fabricated. 1.4. Special Purpose ANN Implementations in the Literature: Ghosh & Hwang: Ghosh and Hwang investigated the architectural requirements for simulating ANNs using massively parallel multiprocessors. In their work, they propose a model for mapping neural networks onto message passing multi-computers. Liu & Wilcox: Liu & Wilcox present an efficient implementation of backpropagation algorithm on the CM-5 that avoids explicit message passing and compare the results of CM-5 backpropagation with those of Cray-2, CrayX-MP and CrayY-MP. Chinn: Chinn describes a systolic algorithm for ANN on MasPar-1 using a 2-D Systolic Array Based Design. 7 Onuki: Onuki presents a parallel implementation using a set of sixteen standard 24-bit DSPs connected in a hypercube. Kirsanov: Kirsanov discusses a new architecture for ANNs using transputers. Muller: Muller presents a special purpose parallel computer using a large number of Motorola floating point processors for ANN implementation. 1.5. Special Purpose VLSI Chips Designed and Fabricated for ANN Implementations: Hamerstrom: Hamerstrom implemented a high performance and low cost ANN with 64 processing nodes per chip and hardware based multiply & accumulator operators. Barber: Barber used a binary tree adder following parallel multipliers in SPIN-L architecture. Shinokawa: Shinokawa describes a fast ANN with billion connections per second using ASIC chips. Viredez: Viredez describes MANTRA-1 neurocomputer using 2 x 2 systolic PE (Processsing Element) blocks. Kotolainen: Kotolainen proposed a tree of connection units with processing units at the leaf nodes for mapping many common ANNs. Asanovic: Asanovic proposed a VLIW of 128 bit instruction width and a 7 stage pipelined processor with 8 processors per chip. 8 Ramacher: Ramacher describes the architecture of SYNAPSE which is a systolic neural signal processor using a 2-D array of systolic elements. Mueller & Hammerstrom: Mueller and Hammerstrom describe design and implementation of CNAPS. CNAPS is a gate array implementation of ANNs. A single CNAPS chip consists of 64 processing nodes with each node connected in a SIMD fashion using broadcast interconnect. Cox: Cox describes the implementation of GANGLION which is a single neuron having a fixed neural architecture of 12 input nodes, 14 hidden nodes and 4 output nodes. 8 x 8 multipliers were built using CLBs (Configurable Logic Blocks). In addition, a Lookup Table is used for activation function. 1.6. Stochastic Neural Architectures: In Stochastic Neural Architectures, there is no need for a time-consuming and area costly floating point multiplier. Stochastic Neural Architectures are suitable for VLSI implementations. Examples for Stochatic Neural Architectures can be given as follows: Armstrong & Thomas: Armstrong and Thomas proposed a variation of ANN called Adaptive Logic Network (ALN). ALNs are similar to ANNs and costly multiplications are replaced by logical and operations while additions are replaced by logical or operations. Masa et. al: Describe an ANN with a single output, six hidden layers and seventy inputs, which can operate at 50 MHz input rate. 9 2. Custom Computing Machines: In a uniprocessor instruction set available to a programmer is fixed and an algorithm is coded using a sequence of instructions. A processor can serve many applications by simply reordering the sequence of instructions. On the other hand, Application Specific Integrated Circuits (ASICs) are used for a specific application. Thus, for the specific application they are supposed to serve, they provide higher performance compared to the general purpose uniprocessor. For a Custom Computing Machine (CCM), a user can customize the architecture and instructions for a given application by programming at a gate level, so that high performance can be achieved. Using a CCM, a customer can tune and match the architectural requirements of the problem. A CCM can overcome the limitations of an ASIC which are:  ASICs are fast but costly  ASICs are non-reconfigurable  ASICs are time-consuming. If we mention the advantages of a CCM:  CCMs are cheap: CCMs use Field Programmable Gate Arrays (FGPAs) as compute elements. Since FPGAs are off-the-shelf components, they are relatively cheap.  CCMs are reconfigurable: Since FPGAs are reconfigurable, CMMs are easily reprogrammed.  CMMs are time saving: CCMs do not need to be fabricated with every new application since they are often employed for fast prototyping. Hence, they save a considerable amount of time in design and implementation of algorithms. Splash-2 is one of the leading FPGA based custom computing machine (CCM) designed and Developed by “Supercomputing Research Center”. 10 2.1. System Level View of Splash-2 Architecture: Interface board connects Splash 2 to the host and extends the address and data buses. The Sun host can read/write to memories and memory mapped control registers of Splash 2 via SIMD Bus and Rbus. Splash-2 consists of Splash Processing Boards(see Figure 5) .Each Processing Element (PE) has 512 KB of memory where the host can read/write. Individual memory available with each PE makes it convenient to store temporary results and tables (refer to Figure 6) . PEs from x1 to x16 are used for inner product and sigmoid function evaluations; whereas PE x0 controls the data flow into the processor board. Figure 5: System Level View of Splash-2 Architecture [2]. 11 Figure 6: Processing Element in Splash-2. [2] 3. SPLASH-2: Architecture and Programming Flow: Design automation process consists of two steps which are simulation and synthesis. Th programming flow is shown if Figure 7. In simulation, the logic designed using VHDL is verified. In systhesis main concern is to achieve the best placement of logic in an FPGA in order to minimize timing delay. If the logic circuit can not be mapped to CLBs and flip flops which are available internal to an FPGA , then designer needs to revise the logic in the VHDL code and the process is repeated. Once the logic is mapped to CLBs, the timing for the entire digital logic is obtained. If timing obtained is not acceptable then design process is repeated. To program Splash 2, we need to program each of the PEs, crossbar and host interface. 4. Mapping MLP on SPLASH-2: In implementing a neural network classifier on Splash-2building block is ”Perceptron Implementation”. For maping MLP to Splash-2, 2 physical PEs serve as a neuron; ith PE handles the inner product phase  wij xi ; whereas (i+1)th PE computes nonlinear function tanh(βx) with β=0.25 , where i is odd and (i+1) is even. 12 Figure 7: Programming Flow for SPLASH-2. [2] Figure 8: Steps in Software Development on Splash-2. [2] 13 Assuming perceptrons have been trained so that connection weights are fixed; an efficient way of handling multiplication is to employ a Look-up Table. Since the external memory of each PE (512 KB) is large, the Look-up Table can be stored. A pattern vector component xi is presented at every clock cycle and the following stages are performed: 1. Inner Product Calculation: The ith PE looks up the multiplication table to obtain the weighted product. The sum w x ij i is computed using an accumulator. After all the components of a pattern vector have been examined, we have computed the inner product. 2. Application Nonlinear Function to the Inner Product: On receiving the inner product result from the first PE, the second PE uses the result as the address to the Non Linearity Look up Table and produces the output. 3. Thus the output of a neuron is obtained: The output is written back to the external memory of the second PE starting from a pre-specified location. 4. After sending all the pattern vectors, the host can read back the memory contents. A layer in the neural network is simply a collection of neurons working synchronously on the input. On Splash-2, this can be achived by broadcasting the input to as many physical PEs as desired.The output of a neuron is written into a specified segment of external memory and read back by the host. For every layer in MLP stages 1 to 4 are repated until the output layer is reached. It should be noted that, for every layer there is a different look-up table. 4.1. Look-up Table Organisation: There are m multiplications to be performed per node corresponding to the mdimensional weight vector. Look-up Table is divided into m segments(see Figure 9). A counter is incremented at every clock which forms the higher order (block) address for the Look-up Table. Pattern 14 vector component forms the lower order address bits. Splash-2 has 18 bit adress bus for the external memory:  Higher order 6 bits for the block address  Lower order 12 bits for the offset address within the block. It should be noted that, the offset can also be negative correponding to a negative input to the look up table. Figure 9: Look-up Table address computation. [2] The numbers have been represented by 12 bits 2’s complement representation. Hence,the resolution of this representation is eleven bits. Accumulator is within PE and is 16 bit wide. After accumulation, the accumulator result is scaled down to 12 bits. 5. Performance Evaluation: For mapping a MLP which is required to complete a classification process; the number of PEs needed is equal to twicw the number of nodes in each layer. Hence, number of clock cycles necessary to complete whole classification process is m  K * l , where: m: number of input layer nodes. K: number of patterns l :number of clock cycles In the implementation performed by the authors of the main paper of this presenration report (“High Performance Muliılayer Perceptron on a Custom Computıng Machine” by Nalini K. Ratha, Anil K. Jain ); 15 m = 20 K=1024 X 1024 = 1 MB (Total number of pixels in the input image) l=2 Thus, number of clock cycles is 20*2*106 with a clock rate of 22 MHz. ; so that time taken for 40 million clock ticks is 1.81 secs. Figure 10:Look-up Table Organisation. [2] When the number of PEs required is larger than the available PEs; either more processor boards need to be added or PEs need to be time shared. It should be emphasized that neuron outputs are produced independent of other neurons and algorithm waits till the computations in each layer are completed. A MLP has communication complexity of O(n2) where n is the number of nodes. As n grows, it will be difficult to get good timing performance from a single processor system. With a large number of processor boards, the single input data bus of 36 bits can cater to multiple input patterns. Note that, in a multiboard system, all boards receive the 16 same input and this parallelism can give rise to more data streaming into the system, thus the number of clock cycles is reduced. 5.1. Scalability: The performance of the mapping on Splash-2 can be compared with a host implementation of different sizes of the neural network. For this comparison, authors of the main paper of this presentation report (“High Performance Muliılayer Perceptron on a Custom Computıng Machine” by Nalini K. Ratha, Anil K. Jain ) look at only a single layer and represent the network size by the number of nodes in that layer. Multilayered networks are considered to be linearly scalable in Splash 2 architecture. The performance measure is processing time as measured by # of clock cycles for Splash 2 with 22 MHz. Clock. The sequential time and the Splash time are plotted in Figure 11 and time has been plotted on log scaleto accomodate the large scale difference of time. Figure 11: Speed comparison of neural network on .sparc20 and Splash-2. [2] 17 5. 2. Speed Evaluation: Authors of [2], implemented on a 2-board system for their network of 20 input nodes. They achieve 176 million connections per second (MCPS) per layer by running the Splash clock at 22 MHz. A 6- board system can deliver more than a billion connections per second. This is comparable to the performance of many high level VLSI-based systems such as Synapse, CNAPS which perform in the range of 5 GCPS. 5. 3. Network-based Image Segmentation: Image Segmentation is the process of partitioning an image into mutually exclusive connected image regions.In an automated image document understanding system, page layout segmentation plays an important role for segmenting text, graphics and background areas. Jain and Karu [22] proposed an algorithm to learn texture discrimination masks needed for image segmentation. The page segmentation algorithm proposed by Jain and Karu has three stages of computation: 1. feature extraction: Feature extraction is is based on 20 masks. 2. classification: The neural network used is a multistage feedforward neural network with 20 input nodes, 20 hidden nodes and 3 output nodes 3. postprocessing: Posprocessing involves removing small noisy regions and placing rectangular blocks around homogenous identical regions. 18 Figure 12: Schematic of the Page Segmentation Algorithm. [2] (a) (b) (c) Figure 13: Page Segmentation (a) input gray level image (b) result of segmentation algorithm (c) result after postprocessing. [2] 19 6. Conclusion A novel sheme of mapping MLPs on Custom Computing Machine has been presented. The scheme is scalable in terms of number of nodes and the number of layers in the MLP and provides near-ASIC level speed. The reconfigurality of CCMs has been exploited to map several layers of a MLP onto the same hardware. In addition, the performance gains achieved using this mapping have been demonstrated on a networkbased image segmentation. 20 REFERENCES: [1] Artificial Neural Networks for Beginners, Carlos Gershenson, The University of Sussex, http://www.cogs.susx.ac.uk/users/carlos/doc/FCS-ANN-tutorial.pdf [2] “High Performance Multilayer Perceptron on a Custom Computing Machine”, Nalini K. Ratha, Anil K. Jain, Department of Computer Science, Michigan State University [3] “An Introduction to Artificial Neural Networks”,Universitry of Data Mining and Metalurgy in Cracow, Institude of Automatics, http://student.uci.agh.edu.pl/~best/summer/m/introduction_to_nets.pdf [4]“Digital Systems for Neural Networks”, Paolo Ienne, EPFL Microcomputing Labarotory, Lausenne, Gary Kuhn, SCR Learnin Systems, Princeton. [5] “Overview of Neural Hardware”, Jan. N. H. Neenskerk, Unit of Experimental and Theoretical Psychology, Leiden University, Netherlands. 21

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ABSTRACT: