Download The PCI bus debuted over a decade ago at 33MHz, with a 32

System Area Network Speeds Data Transfer between Servers with PCI Express A new network switch technology is targeted to answer the phenomenal demands on intercommunication transfer speeds between servers, which are becoming all too evident in today’s client-server architecture that is found in all data processing environments. by Joey Maitra, Magma The proliferation of the raw processing power of computers has resulted in system architectures where processing tasks are distributed and assigned to various processing elements in the system in order to spread the load and derive better system throughput. The execution of these tasks is closely coordinated and then integrated by some central processing (CPU) entity to produce the desired output. The intent is to have the entire set of these elements share the processing load thereby contributing to boost the overall throughput of the system. Processing elements must then communicate with the central entity and/or among themselves to synchronize the execution of their respective tasks. In most scenarios, there is also volatile and non-volatile storage elements dedicated to these distributed processing elements comprising the system. For instance in blade centers, blade servers have their own private storage facilities and also communicate with each other over high-speed connections on the mid-plane as well as to devices on a storage area network (SAN) through a switch module. This is typically the case in mid- to high-end server environments. However, to extend this paradigm to an environment made up of servers physically located in separate enclosures would require a fast interconnect mechanism. In other words, these servers must communicate among themselves via some sort of a network. In such environments, there is also the need to access vast amounts of data via network attached storage (NAS) devices. This scenario is all too prevalent in datacenters and server farms to mention a few. Today, these access mechanisms are implemented via local area network (LAN) with technologies such as InfiniBand, 10 Gigabit Ethernet, Fibre Channel and the like. Another point to note, the phenomenal rate of deployment of the Internet has resulted in most LANs using TCP/IP in the upper layers of the communication stack. IP packets from the TCP/IP layers are essentially encapsulated within the frames of the communication protocol used to form the LAN. The physical connections to the network fabric for servers and computers take place through either a network I/O controller card or a network controller device resident on the motherboards. These motherboards host a root complex processor as shown in Figure 1. A root complex denotes the root of an I/O hierarchy that connects the CPU/memory subsystem to I/O devices. This hierarchy consists of a root complex (RC), multiple endpoints (I/O devices), a switch and a PCI Express to PCI/PCI-X bridge, all interconnected via PCI Express links. PCI Express is a point-to-point, low-overhead, low-latency communication link maximizing application payload bandwidth and link efficiency. Inherent in the PCI Express technology is a August 30, 2010 Rev: F very robust communication protocol with its own set of Transaction, Data Link and Physical Layers. The network I/O controllers implement some specific communication protocol and provide the interface to the physical media constituting the LAN. The controllers interface to a PCI Express endpoint of the root complex processor (RCP) of the server node participating in the network. Incidentally, this architecture is not restricted to servers since it is common in workstations, desktops and laptops. The vast majority of the industry’s prevalent communication protocols was invented before the advent of PCI Express technology. These protocols have their own set of almost identical infrastructure made up of Transaction, Data link and Physical layers. As depicted in Figure 2, data originating at the Application layer are transformed into TCP/IP packets and then embedded in PCI Express packets. These packets are then sent to the Ethernet controller that de-packetizes the TCP/IP packet from the PCI Express packets and re-packetizes it to be sent in Ethernet frames over the 10Gigabit Ethernet physical media. The reverse process takes place at the destination server end. It is obvious from the discussion so far that there is an awful lot of protocol duplication. The cost of such duplication measured in terms of overall throughput of the network becomes more poignant when the nuances of the various communication protocols are considered as they relate to efficiency, i.e. data rate, maximum payload, packet header overhead, etc. It turns out that the duplication of the communication protocol, even though it may be executed in hardware, causes unnecessary software and hardware overhead burdens that seriously impact the overall throughput of the network infrastructure. Another important factor impacting the overall performance of a network is the bandwidth limitations of the physical media associated with the communication protocol used. This encompasses transfer rates, maximum distances supported and the connection topography to name a few. For instance, with 10 Gbit/s Ethernet the restriction of the data transfer rate to10 Gbit/s is potentially a very serious limitation for many applications. Given this scenario, the ideal approach to boosting the overall performance of the network would be to use the PCI Express technology as the network fabric. Embedded in the PCI Express packet is the IP datagram with the destination IP address of the server node. PCI Express is a point-topoint communication protocol and consequently does not have a media access control (MAC) address. Therefore, the most natural and logical approach to routing data from one node in the network to another would be to have some entity route the data based on the destination IP address. Implementation of this type of routing methodology essentially makes that entity an IP router. This is where the PCI Express switch comes into play as shown in Figure 3. All of the downstream ports of the PCI Express switch connect to servers comprising the nodes of a system area network. Intelligence tied to the upstream port of the switch has already established the knowledge of the correlation between the downstream ports and the corresponding IP address of the server attached to it. Data flows from one server to another through the PCI Express switch. 2 Consequently, it requires that the root complex processor (RCP) tied to the upstream port of the switch communicate with the RCP of the server. This poses the question of how best to communicate between two RCPs. Bus enumeration techniques in PCI Express architecture, which is the same as in PCI bus architecture, cannot allow one RCP to go through the discovery of devices on a bus that belongs to another RCP. However, there is a technique pioneered by PLX Corporation during the heyday of the PCI bus that addresses this issue and it is called Non Transparent Bridging (NTB). This method allows two RCPs to communicate through the use of base address registers (BARs). This interchange of information is applicable for memory, I/O and configuration spaces in the context of PCI Bus architecture and is applicable for both systems. This can only be supported if the underlying hardware of the PCI Express switch provides NTB functions on the respective downstream ports. The RCP of the IP router sets up the BAR registers on the individual PCI Express Switch ports attached to the respective servers and maps their system memories to respective windows in the logical address space of its own system memory. This then allows for the visibility of individual system memories of all respective servers in the network by one entity. This access mechanism is used to transfer data, in this case TCP/IP packets, between servers comprising the LAN. This method allows for the transfer of memory or I/O data between attached servers through the switch ports at the maximum data rate supported by the respective physical links. For example, with 8 lanes of PCI Express links using Gen 2 technology the data transfer rate is 40 Gbit/s and with 16 lanes it is 80 Gbit/s. PCI Express incorporates full duplex communication technology meaning transmit and receive can happen at the same time. This then makes the full duplex bandwidth for 8 lanes of Gen 2 to be 80 Gbit/s and for 16 lanes it is 160 Gbit/s. Gen 3 technology, which is currently being developed, will more than double these numbers. Magma’s patent pending technology, which covers all aspects of a network based on running TCP/IP protocol over PCI Express fabric inclusive of the IP Router, is the basis of the network switch design. It relies on the pull-model for data transfer through the network switch. This allows for the processors on the sending servers to be totally free and oblivious of how IP data is transferred to the destination server. This significantly reduces the processor overhead on transferring data to and from the network. This is illustrated in Figure 4. With the PCI Express-based network switch technology, the maximum number of nodes that can be on one network is 256 because of the restrictions imposed by PCI configuration space that supports a maximum of 256 buses. This may be construed as a limitation, but it allows for a very symmetrical topography with one RCP, that of the network switch, servicing all of the nodes as devices underneath it. There is no additional RCP involved on expanding the number of nodes and, therefore, no additional memory resources are required. Consequently, adding nodes to the network simply implies daisy-chaining PCI Express switches resulting in significant cost per port decrease as the number of nodes in the network is increased. Moreover, as compared to 10Gigabit Ethernet and other legacy networks, adding nodes to the network switch is seamless because of the plug-n-play attributes of the PCI bus architecture. 3 Since the servers have no direct visibility into a remote server’s memory, any data transfer operations necessarily require the root switch to be involved. For instance, when a source server needs to read/write data from/to a target server, the server notifies the root switch rather than attempting to communicate with the target server. It is the root switch that accesses the memory of the source as well as the target server. To further reduce data transfer latencies, the new switch technology uses DMA controllers built into the NTB ports of the PCI Express switch. This relieves the network switch processor from moving data between servers and allows for concurrent transfers between nodes in the network. This amounts to peer-to-peer transfers within the PCI Express switch array contributing to drastic reduction in data transfer latencies in the network. Based on the destination IP addresses of all of the individual packets in a particular server’s kernel space, the RCP on the network switch sets up the DMA descriptor file and then fires the DMA engine. PCI Express technology is fast becoming ubiquitous and the result has been that all server and workstation manufacturers now provide a certain number of PCI Express slots for I/O expansion. These form the PCI Express endpoints of the RCP on the host computer backplane. A host PCI Express card will take the PCI Express signals from the backplane and bring them out on fiber, or alternately on copper, to attach to the ports of the network switch. The number of PCI Express lanes operational between the server and the network switch will depend on the number of lanes supported by the server hardware. PCI Express allows for link negotiation whereby both ends of a PCI Express link negotiate to support the minimum number of lanes supported by either of the two connection points. Consequently, each port of the network switch will negotiate down to the number of lanes supported by the host connection to that individual port. These ports support Gen 2 PCI Express signaling standards and will negotiate down to Gen 1 signaling to support the corresponding connection to the server. This makes the network switch highly scalable. The network switch technology is based completely on standards with no aspects of the technology being proprietary. With the industry’s commitment to PCI Express technology, it provides a migration path to a newer generation of technology thus potentially extending its life cycle. This technology allows for the coexistence of legacy networks as it goes through its adoption cycle phase and, moreover, can serve as a fallback mechanism for mission-critical applications. This allows for a fail safe deployment. Another significant advantage is the cost per port as nodes get added to the network since there is only one root complex processor (RCP) on the network switch in this network topology. Figure 5 shows an example of how servers with disparate functions participate seamlessly in a symmetrical TCP/IP-based system area network. This also shows how storage and processing servers coexist in one homogeneous network. This is facilitated by the increasingly popular implementation of iSCSI on communication with network attached storage devices. iSCSI is essentially the SCSI protocol embedded in TCP/IP packets. SCSI protocol is widely used in the industry to communicate with storage devices. 4 Also, the connection to the Internet implies simply transferring all IP packets intact that are not destined for any server on the network via a wide area network (WAN) interface. The deployment of the network switch as shown in Figure 5 is representative of a topography that with different software modules can be used for clustering, I/O virtualization and cloud computing applications. It is a highly flexible architecture. Magma, San Diego, CA. (858) 530-2511. [www.magma.com]. 5

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download The PCI bus debuted over a decade ago at 33MHz, with a 32