Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Fast IP Routing Using FPGA Submitted By Raushan Raj (200730020) Rupen Humad(200730023) Rahul Sachdev(200730019) Addhyan Pandey(200730002) Introduction Routing is a network layer process. There are two commonly used protocols for the functioning of the network layer – Ipv4 and IPv6. Among these IPv4 is the more common one. Our design was to target IPv4 routing process and to develop a design that could be targeted on an FPGA. In this way routing as a whole is implemented completely in hardware. A further goal of the design was to increase the speed of operation of routers. An FPGA normally works at low frequencies of the order of MHz (125 MHz in our case) while an ASIC Design can have clocks of the order of GHz, thus making an ASIC design faster than the FPGAs. The downside is that the ASIC design takes in more input power and takes longer time for the design to be validated. If we can somehow increase the speed of operation in the design of FPGA we have the advantage of high speed characteristic of ASIC and fast marketability, the characteristic of FPGAs. This is what we have attempted to do in this project. Work done previously The relevant block diagrams needed for the design were discussed upon. The basic functionalities of the block and their architecture were also discussed. The five basic blocks discussed upon were the following: 1) 8-32 converter 2) Distribution block 3) Storing buffer 4) Forwarding block 5) 32-8 bit converter We also started the design of the blocks and by the first evaluation we had completed the first block i.e. 8-32 bit converter. 8-32 bit converter This block, as the name suggests, takes the input as 8 bit words and outputs 32 bit word. This module is needed so as to convert the word mismatches of 32 bit network processor and 8 bit MAC layer processor. This module is also the interface between the MAC layer and network layer. The MAC layer inputs packets to this block with a signal called Start of Frame (SOF). The converter block starts storing the inputs which arrive in 8 – bit burst in a buffer. When a complete packet is stored, (This is indicated by another signal from the MAC layer called the End of Frame), this module outputs only the IPv4 Header in 32 bits burst to the next module. The module, thus, enables us to operate 4 MAC layer ports simultaneously because the speed of filling the buffer in the converter block is ¼ times the speed of emptying it. This module also provides an added advantage. For routing, the complete packet payload is not required. Only the header part is needed for processing. Thus, the module only forwards the header part to next module keeping all the data part in its buffer. The module then simply appends the data part to the new outbound header. The header payload can be at max 60 bytes while the data payload is around 1500 bytes on an average. This saves a significant amount of transmit time. A simplified block diagram of the block is given below: Work done after the first evaluation After the first evaluation we went on with the design of the rest of the modules. Their descriptions are as follows: Distribution Block: This block takes in the input from the 8-32 converter block and distributes it to the relevant fields and stores the rest of the header in another buffer. The first 32-bit word transmitted through the previous block contains information regarding the header length and version (4 or 6). Thus the first 32 bit word is routed to the version check block which checks the version and drops it if it does not turn out to be 4 (IPv4 routing) . The block then stores the data in another buffer ahead of it. The next 32 bit word contains information about Identification and fragment offset. The third 32 bit word contains information about Time to Live (TTL) and header checksum. The 32 bit word is thus forwarded to the TTL block and the Header Checksum block. The TTL block decrements the TTL field by 1 and checks if the TTL field is greater than or equal to zero. The header checksum checks the old checksum and calculated the new checksum. If the TTL field is less than zero or the header checksum does not match, the packet is dropped. The next 32 bit contains the source IP address. The 32 bit contains the destination IP address. This is sent to the Look Up table to look for the next hop. The block diagram of the distribution block is shown below: Storing Block This block stores the 32-bit words coming from the Distribution Block. This block stores only those portions of header that do not change. The required part of the header fields that do change are picked up directly from the corresponding blocks. This provides for the optimization of memory as less memory is then required. This also reduces the memory overhead as the number of memory references is reduced. The block diagram of the storing block is given below: Forwarding Block This block is the block that begins preparing the header as required by the MAC layer. Thus, this block correspondingly appends destination MAC Address and Source MAC Address to the header fields. The block waits for the completion of Look Up Process and then stores the destination MAC Address in another buffer (next block). The block then puts its own MAC address in the next burst. This process takes three clock cycles. After that the Header from the buffer of previous block is stored. Those fields of the header that have been changed in the process are directly accessed from the corresponding blocks, thus reducing memory access. The block outputs its data to be stored in the 32-8 converter block. The block diagram of the forwarding block is given below: 32-8 converter block The 32 – 8 converter block stores the data received from the previous block and converts them into 8 bits to be sent through the Local Link Fifo to the MAC Sublayer. Since this block is also the interfacing between the MAC Sublayer and the Network layer, this block also formats the data for use by the MAC layer. After this block completes its transmission, it signals the 8-32 bit converter block and then that block transfers the remaining data out to the MAC Sublayer. At the end an EOF (End of Frame) is asserted to signal the MAC Sublayer that data transmission is over. Calculations achieved 1) Total Overall Potential Speedup: The clock speed is 125 MHz. The MAC layer transfers 8 bit per clock cycle from the physical layer to the Network layer. The Network layer at one point can handle 4 incoming packets from MAC layer. This is because the bus width in Network layer is 32 bits while the bus width in MAC layer is 8 bits. Thus total overall potential speedup = 125*8*4 = 4Gbps. 2) Latency of the Network Layer (Assuming an average payload of 1500 bytes): The five blocks each have different latencies. The 8-32 bit converter block waits for the whole packet to be stored before it begins its transmission. Thus assuming an average packet size of 1500 bytes it takes 1500 clock cycles before it can begin transmitting. The next block i.e. the distribution block distributes the output to different blocks. Depending upon the circumstances either the header checksum or the Look Up can take more time. A header checksum takes at most 15 cycles. Look Up is implemented using linear search with 16 entries. Thus, the Look Up can take at most 16 clock cycles. The next block (Storing Buffer) works in parallel with the Distribution Block. Thus, assuming the buffer access time of 1 clock cycle, the Buffer does not take additional time in storing. However when it is transmitting the Header, it takes at most 15 clock cycles for the complete header to be transmitted. The forwarding block takes 3 additional clock cycles because it also begins transmitting the MAC layer addresses. Finally the 32-8 bit converter takes 1500 clock cycles to transfer the whole packet from the network layer to the MAC Sublayer. Thus, the latencies of each block are as follows: 8-32 converter : 1500 clock cycles (Average) Distribution Block: 16 clock cycles (Maximum) Storing Buffer : 15 clock cycles (Maximum) Forwarding Block : 3 clock cycles (Deterministic) 32-8 converter : 1500 clock cycles (Average) Thus, the total latency of the network layer will be: (1500 + 16 + 15 + 3 + 1500)/125 = 24.3 μs. Thus, the average latency has the window of 24 micro seconds. 3) Worst Case Performance: In the worst case all the input packets arriving will be bound to the same output port. Thus, the second packet arriving will have to wait until the first packet is sent out. In this case the overall performance is equivalent to the case when there is only one interface between the input and the output port. Thus, the overall performance is 125*8= 1Gbps. Limitations The first limitation of the project is that while the router is performing a lookup, it is using a linear search. This can be severely costly in case of high connectivity. Thus, some other important algorithm can be used. The first block i.e. the 8-32 converter block stores the whole packet before any processing begins even though the complete packet is not needed for processing. This costs us some penalty in case of latency. This can be improved if the 8-32 bit converter only stores the header and not the whole packet. This however can also be a serious handicap as the design might get even more complex. The design is intended for fast routing. If there is a slow link between any two nodes this can lead to the router doing nothing for most of the time and so wastage of clock cycles and thus power. If under these circumstances, the speed of operation could be reduced, the power wastage could be reduced. Our router has no such provision. Future prospects Our project has the objective to implement a fast IP routing using FPGA. Though, fast designs are possible using the ASIC design but it has serious handicaps. An ASIC design requires a lot of effort in terms of the prototype design. This makes that process slow and incurs heavy cost which becomes reasonable only for a very high volume design. The FPGAs on the other hand have an advantage that their process cycle is not so complex and have faster time to markets compared to the ASIC devices. They can also be easily programmed any time to incorporate some new changes. Thus, if we are able to achieve a faster design using FPGAs, then along with the other advantages of FPGA, we can have a winning force in market. One another added advantage of the design is its scalability. Just by changing the network layer bus width to 64 from 32 can provide us with 8 MAC interfaces each running at 1Gbps (assuming bus width of MAC Layer = 8 and clock speed = 125 MHz) with a total speed of 8 Gbps. Thus, doubling the bus width can double the speed. Moreover replicating the processing units can provide a potential speed up, however, there is a limit to how many processing units be included. For example doubling the processing units can help us address 4 additional ports. This can lead to a doubling of speed in an ideal situation. Finally, combining the effect of doubling the bus width and doubling the number of processing units can lead to a four -fold increase in speed. If clock speed, simultaneously could be increased from 125MHz to 250 MHz (most of the FPGAs today use clocks of range 200-250 MHz), the overall gain would become 8 times. Thus, the scalability in speed is provided by the scalability in bus width, clock speed and processing unit replication. Conclusion The IP router using the above design can achieve a maximum speedup of 4Gbps. The design, though, suffers from a few drawbacks such as linear search algorithm and overhead at the MACNetwork layer interface and power wastage in some cases. These few problems if addressed can make the design even faster. References http://en.wikipedia.org/wiki/IPv4 http://www.xilinx.com/support/documentation/user_guides/ug194.pdf http://www.xilinx.com/support/documentation/ip_documentation/tri_mode_eth_mac_ug138. pdf http://www.xilinx.com/support/documentation/ip_documentation/tri_mode_eth_mac_ug138. pdf