Download 8-32 bit converter

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Low Pin Count wikipedia , lookup

Deep packet inspection wikipedia , lookup

Wake-on-LAN wikipedia , lookup

IEEE 1355 wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

CAN bus wikipedia , lookup

UniPro protocol stack wikipedia , lookup

Transcript
Fast IP Routing Using
FPGA
Submitted By
Raushan Raj (200730020)
Rupen Humad(200730023)
Rahul Sachdev(200730019)
Addhyan Pandey(200730002)
 Introduction
Routing is a network layer process. There are two commonly used protocols for the functioning
of the network layer – Ipv4 and IPv6. Among these IPv4 is the more common one. Our design
was to target IPv4 routing process and to develop a design that could be targeted on an FPGA. In
this way routing as a whole is implemented completely in hardware.
A further goal of the design was to increase the speed of operation of routers. An FPGA normally
works at low frequencies of the order of MHz (125 MHz in our case) while an ASIC Design can
have clocks of the order of GHz, thus making an ASIC design faster than the FPGAs. The
downside is that the ASIC design takes in more input power and takes longer time for the design
to be validated. If we can somehow increase the speed of operation in the design of FPGA we
have the advantage of high speed characteristic of ASIC and fast marketability, the characteristic
of FPGAs. This is what we have attempted to do in this project.
 Work done previously
The relevant block diagrams needed for the design were discussed upon. The basic
functionalities of the block and their architecture were also discussed.
The five basic blocks discussed upon were the following:
1) 8-32 converter
2) Distribution block
3) Storing buffer
4) Forwarding block
5) 32-8 bit converter
We also started the design of the blocks and by the first evaluation we had completed the first
block i.e. 8-32 bit converter.
8-32 bit converter
This block, as the name suggests, takes the input as 8 bit words and outputs 32 bit word. This
module is needed so as to convert the word mismatches of 32 bit network processor and 8 bit
MAC layer processor. This module is also the interface between the MAC layer and network
layer.
The MAC layer inputs packets to this block with a signal called Start of Frame (SOF). The
converter block starts storing the inputs which arrive in 8 – bit burst in a buffer. When a
complete packet is stored, (This is indicated by another signal from the MAC layer called the End
of Frame), this module outputs only the IPv4 Header in 32 bits burst to the next module. The
module, thus, enables us to operate 4 MAC layer ports simultaneously because the speed of
filling the buffer in the converter block is ¼ times the speed of emptying it.
This module also provides an added advantage. For routing, the complete packet payload is not
required. Only the header part is needed for processing. Thus, the module only forwards the
header part to next module keeping all the data part in its buffer. The module then simply
appends the data part to the new outbound header. The header payload can be at max 60 bytes
while the data payload is around 1500 bytes on an average. This saves a significant amount of
transmit time.
A simplified block diagram of the block is given below:
 Work done after the first evaluation
After the first evaluation we went on with the design of the rest of the modules.
Their descriptions are as follows:
Distribution Block:
This block takes in the input from the 8-32 converter block and distributes it to the relevant
fields and stores the rest of the header in another buffer. The first 32-bit word transmitted
through the previous block contains information regarding the header length and version (4 or
6).
Thus the first 32 bit word is routed to the version check block which checks the version and
drops it if it does not turn out to be 4 (IPv4 routing) . The block then stores the data in another
buffer ahead of it.
The next 32 bit word contains information about Identification and fragment offset.
The third 32 bit word contains information about Time to Live (TTL) and header checksum. The
32 bit word is thus forwarded to the TTL block and the Header Checksum block. The TTL block
decrements the TTL field by 1 and checks if the TTL field is greater than or equal to zero. The
header checksum checks the old checksum and calculated the new checksum. If the TTL field is
less than zero or the header checksum does not match, the packet is dropped.
The next 32 bit contains the source IP address.
The 32 bit contains the destination IP address. This is sent to the Look Up table to look for the
next hop.
The block diagram of the distribution block is shown below:
Storing Block
This block stores the 32-bit words coming from the Distribution Block. This block stores only
those portions of header that do not change. The required part of the header fields that do
change are picked up directly from the corresponding blocks. This provides for the optimization
of memory as less memory is then required. This also reduces the memory overhead as the
number of memory references is reduced.
The block diagram of the storing block is given below:
Forwarding Block
This block is the block that begins preparing the header as required by the MAC layer. Thus, this
block correspondingly appends destination MAC Address and Source MAC Address to the
header fields. The block waits for the completion of Look Up Process and then stores the
destination MAC Address in another buffer (next block). The block then puts its own MAC
address in the next burst. This process takes three clock cycles. After that the Header from the
buffer of previous block is stored. Those fields of the header that have been changed in the
process are directly accessed from the corresponding blocks, thus reducing memory access.
The block outputs its data to be stored in the 32-8 converter block.
The block diagram of the forwarding block is given below:
32-8 converter block
The 32 – 8 converter block stores the data received from the previous block and converts them
into 8 bits to be sent through the Local Link Fifo to the MAC Sublayer. Since this block is also the
interfacing between the MAC Sublayer and the Network layer, this block also formats the data
for use by the MAC layer. After this block completes its transmission, it signals the 8-32 bit
converter block and then that block transfers the remaining data out to the MAC Sublayer. At
the end an EOF (End of Frame) is asserted to signal the MAC Sublayer that data transmission is
over.
 Calculations achieved
1) Total Overall Potential Speedup:
The clock speed is 125 MHz. The MAC layer transfers 8 bit per clock cycle from the physical
layer to the Network layer. The Network layer at one point can handle 4 incoming packets
from MAC layer. This is because the bus width in Network layer is 32 bits while the bus
width in MAC layer is 8 bits.
Thus total overall potential speedup = 125*8*4 = 4Gbps.
2) Latency of the Network Layer (Assuming an average payload of 1500 bytes):
The five blocks each have different latencies. The 8-32 bit converter block waits for the
whole packet to be stored before it begins its transmission. Thus assuming an average
packet size of 1500 bytes it takes 1500 clock cycles before it can begin transmitting.
The next block i.e. the distribution block distributes the output to different blocks.
Depending upon the circumstances either the header checksum or the Look Up can take
more time. A header checksum takes at most 15 cycles. Look Up is implemented using
linear search with 16 entries. Thus, the Look Up can take at most 16 clock cycles. The next
block (Storing Buffer) works in parallel with the Distribution Block. Thus, assuming the
buffer access time of 1 clock cycle, the Buffer does not take additional time in storing.
However when it is transmitting the Header, it takes at most 15 clock cycles for the
complete header to be transmitted. The forwarding block takes 3 additional clock cycles
because it also begins transmitting the MAC layer addresses. Finally the 32-8 bit converter
takes 1500 clock cycles to transfer the whole packet from the network layer to the MAC
Sublayer.
Thus, the latencies of each block are as follows:
8-32 converter : 1500 clock cycles (Average)
Distribution Block: 16 clock cycles (Maximum)
Storing Buffer : 15 clock cycles (Maximum)
Forwarding Block : 3 clock cycles (Deterministic)
32-8 converter : 1500 clock cycles (Average)
Thus, the total latency of the network layer will be:
(1500 + 16 + 15 + 3 + 1500)/125 = 24.3 μs.
Thus, the average latency has the window of 24 micro seconds.
3) Worst Case Performance:
In the worst case all the input packets arriving will be bound to the same output port. Thus,
the second packet arriving will have to wait until the first packet is sent out. In this case the
overall performance is equivalent to the case when there is only one interface between the
input and the output port.
Thus, the overall performance is 125*8= 1Gbps.
 Limitations
The first limitation of the project is that while the router is performing a lookup, it is using a
linear search. This can be severely costly in case of high connectivity. Thus, some other
important algorithm can be used.
The first block i.e. the 8-32 converter block stores the whole packet before any processing
begins even though the complete packet is not needed for processing. This costs us some
penalty in case of latency. This can be improved if the 8-32 bit converter only stores the header
and not the whole packet. This however can also be a serious handicap as the design might get
even more complex.
The design is intended for fast routing. If there is a slow link between any two nodes this can
lead to the router doing nothing for most of the time and so wastage of clock cycles and thus
power. If under these circumstances, the speed of operation could be reduced, the power
wastage could be reduced. Our router has no such provision.
 Future prospects
Our project has the objective to implement a fast IP routing using FPGA. Though, fast designs
are possible using the ASIC design but it has serious handicaps.
An ASIC design requires a lot of effort in terms of the prototype design. This makes that process
slow and incurs heavy cost which becomes reasonable only for a very high volume design. The
FPGAs on the other hand have an advantage that their process cycle is not so complex and have
faster time to markets compared to the ASIC devices. They can also be easily programmed any
time to incorporate some new changes.
Thus, if we are able to achieve a faster design using FPGAs, then along with the other
advantages of FPGA, we can have a winning force in market.
One another added advantage of the design is its scalability. Just by changing the network layer
bus width to 64 from 32 can provide us with 8 MAC interfaces each running at 1Gbps (assuming
bus width of MAC Layer = 8 and clock speed = 125 MHz) with a total speed of 8 Gbps. Thus,
doubling the bus width can double the speed.
Moreover replicating the processing units can provide a potential speed up, however, there is a
limit to how many processing units be included. For example doubling the processing units can
help us address 4 additional ports. This can lead to a doubling of speed in an ideal situation.
Finally, combining the effect of doubling the bus width and doubling the number of processing
units can lead to a four -fold increase in speed. If clock speed, simultaneously could be increased
from 125MHz to 250 MHz (most of the FPGAs today use clocks of range 200-250 MHz), the
overall gain would become 8 times.
Thus, the scalability in speed is provided by the scalability in bus width, clock speed and
processing unit replication.
 Conclusion
The IP router using the above design can achieve a maximum speedup of 4Gbps. The design,
though, suffers from a few drawbacks such as linear search algorithm and overhead at the MACNetwork layer interface and power wastage in some cases. These few problems if addressed can
make the design even faster.
 References
http://en.wikipedia.org/wiki/IPv4
http://www.xilinx.com/support/documentation/user_guides/ug194.pdf
http://www.xilinx.com/support/documentation/ip_documentation/tri_mode_eth_mac_ug138.
pdf
http://www.xilinx.com/support/documentation/ip_documentation/tri_mode_eth_mac_ug138.
pdf