Download The PCI bus debuted over a decade ago at 33MHz, with a 32

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Deep packet inspection wikipedia , lookup

AppleTalk wikipedia , lookup

Internet protocol suite wikipedia , lookup

Lag wikipedia , lookup

Piggybacking (Internet access) wikipedia , lookup

IEEE 1355 wikipedia , lookup

Distributed firewall wikipedia , lookup

Computer network wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Wake-on-LAN wikipedia , lookup

Conventional PCI wikipedia , lookup

Zero-configuration networking wikipedia , lookup

Network tap wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

Airborne Networking wikipedia , lookup

UniPro protocol stack wikipedia , lookup

PCI Express wikipedia , lookup

PC/104 wikipedia , lookup

Transcript
System Area Network Speeds Data Transfer between Servers with PCI Express
A new network switch technology is targeted to answer the phenomenal demands on
intercommunication transfer speeds between servers, which are becoming all too evident in
today’s client-server architecture that is found in all data processing environments.
by Joey Maitra, Magma
The proliferation of the raw processing power of computers has resulted in system architectures
where processing tasks are distributed and assigned to various processing elements in the system
in order to spread the load and derive better system throughput. The execution of these tasks is
closely coordinated and then integrated by some central processing (CPU) entity to produce the
desired output. The intent is to have the entire set of these elements share the processing load
thereby contributing to boost the overall throughput of the system.
Processing elements must then communicate with the central entity and/or among themselves to
synchronize the execution of their respective tasks. In most scenarios, there is also volatile and
non-volatile storage elements dedicated to these distributed processing elements comprising the
system. For instance in blade centers, blade servers have their own private storage facilities and
also communicate with each other over high-speed connections on the mid-plane as well as to
devices on a storage area network (SAN) through a switch module. This is typically the case in
mid- to high-end server environments.
However, to extend this paradigm to an environment made up of servers physically located in
separate enclosures would require a fast interconnect mechanism. In other words, these servers
must communicate among themselves via some sort of a network. In such environments, there is
also the need to access vast amounts of data via network attached storage (NAS) devices. This
scenario is all too prevalent in datacenters and server farms to mention a few. Today, these
access mechanisms are implemented via local area network (LAN) with technologies such as
InfiniBand, 10 Gigabit Ethernet, Fibre Channel and the like.
Another point to note, the phenomenal rate of deployment of the Internet has resulted in most
LANs using TCP/IP in the upper layers of the communication stack. IP packets from the TCP/IP
layers are essentially encapsulated within the frames of the communication protocol used to form
the LAN.
The physical connections to the network fabric for servers and computers take place through
either a network I/O controller card or a network controller device resident on the motherboards.
These motherboards host a root complex processor as shown in Figure 1.
A root complex denotes the root of an I/O hierarchy that connects the CPU/memory subsystem
to I/O devices. This hierarchy consists of a root complex (RC), multiple endpoints (I/O devices),
a switch and a PCI Express to PCI/PCI-X bridge, all interconnected via PCI Express links. PCI
Express is a point-to-point, low-overhead, low-latency communication link maximizing
application payload bandwidth and link efficiency. Inherent in the PCI Express technology is a
August 30, 2010
Rev: F
very robust communication protocol with its own set of Transaction, Data Link and Physical
Layers.
The network I/O controllers implement some specific communication protocol and provide the
interface to the physical media constituting the LAN. The controllers interface to a PCI Express
endpoint of the root complex processor (RCP) of the server node participating in the network.
Incidentally, this architecture is not restricted to servers since it is common in workstations,
desktops and laptops.
The vast majority of the industry’s prevalent communication protocols was invented before the
advent of PCI Express technology. These protocols have their own set of almost identical infrastructure made up of Transaction, Data link and Physical layers. As depicted in Figure 2, data
originating at the Application layer are transformed into TCP/IP packets and then embedded in
PCI Express packets. These packets are then sent to the Ethernet controller that de-packetizes the
TCP/IP packet from the PCI Express packets and re-packetizes it to be sent in Ethernet frames
over the 10Gigabit Ethernet physical media. The reverse process takes place at the destination
server end.
It is obvious from the discussion so far that there is an awful lot of protocol duplication. The cost
of such duplication measured in terms of overall throughput of the network becomes more
poignant when the nuances of the various communication protocols are considered as they relate
to efficiency, i.e. data rate, maximum payload, packet header overhead, etc. It turns out that the
duplication of the communication protocol, even though it may be executed in hardware, causes
unnecessary software and hardware overhead burdens that seriously impact the overall
throughput of the network infrastructure.
Another important factor impacting the overall performance of a network is the bandwidth
limitations of the physical media associated with the communication protocol used. This
encompasses transfer rates, maximum distances supported and the connection topography to
name a few. For instance, with 10 Gbit/s Ethernet the restriction of the data transfer rate to10
Gbit/s is potentially a very serious limitation for many applications.
Given this scenario, the ideal approach to boosting the overall performance of the network would
be to use the PCI Express technology as the network fabric. Embedded in the PCI Express packet
is the IP datagram with the destination IP address of the server node. PCI Express is a point-topoint communication protocol and consequently does not have a media access control (MAC)
address. Therefore, the most natural and logical approach to routing data from one node in the
network to another would be to have some entity route the data based on the destination IP
address. Implementation of this type of routing methodology essentially makes that entity an IP
router.
This is where the PCI Express switch comes into play as shown in Figure 3. All of the
downstream ports of the PCI Express switch connect to servers comprising the nodes of a system
area network. Intelligence tied to the upstream port of the switch has already established the
knowledge of the correlation between the downstream ports and the corresponding IP address of
the server attached to it. Data flows from one server to another through the PCI Express switch.
2
Consequently, it requires that the root complex processor (RCP) tied to the upstream port of the
switch communicate with the RCP of the server. This poses the question of how best to
communicate between two RCPs.
Bus enumeration techniques in PCI Express architecture, which is the same as in PCI bus
architecture, cannot allow one RCP to go through the discovery of devices on a bus that belongs
to another RCP. However, there is a technique pioneered by PLX Corporation during the heyday
of the PCI bus that addresses this issue and it is called Non Transparent Bridging (NTB). This
method allows two RCPs to communicate through the use of base address registers (BARs). This
interchange of information is applicable for memory, I/O and configuration spaces in the context
of PCI Bus architecture and is applicable for both systems. This can only be supported if the
underlying hardware of the PCI Express switch provides NTB functions on the respective
downstream ports.
The RCP of the IP router sets up the BAR registers on the individual PCI Express Switch ports
attached to the respective servers and maps their system memories to respective windows in the
logical address space of its own system memory. This then allows for the visibility of individual
system memories of all respective servers in the network by one entity. This access mechanism is
used to transfer data, in this case TCP/IP packets, between servers comprising the LAN.
This method allows for the transfer of memory or I/O data between attached servers through the
switch ports at the maximum data rate supported by the respective physical links. For example,
with 8 lanes of PCI Express links using Gen 2 technology the data transfer rate is 40 Gbit/s and
with 16 lanes it is 80 Gbit/s. PCI Express incorporates full duplex communication technology
meaning transmit and receive can happen at the same time. This then makes the full duplex
bandwidth for 8 lanes of Gen 2 to be 80 Gbit/s and for 16 lanes it is 160 Gbit/s. Gen 3
technology, which is currently being developed, will more than double these numbers.
Magma’s patent pending technology, which covers all aspects of a network based on running
TCP/IP protocol over PCI Express fabric inclusive of the IP Router, is the basis of the network
switch design. It relies on the pull-model for data transfer through the network switch. This
allows for the processors on the sending servers to be totally free and oblivious of how IP data is
transferred to the destination server. This significantly reduces the processor overhead on
transferring data to and from the network. This is illustrated in Figure 4.
With the PCI Express-based network switch technology, the maximum number of nodes that can
be on one network is 256 because of the restrictions imposed by PCI configuration space that
supports a maximum of 256 buses. This may be construed as a limitation, but it allows for a very
symmetrical topography with one RCP, that of the network switch, servicing all of the nodes as
devices underneath it. There is no additional RCP involved on expanding the number of nodes
and, therefore, no additional memory resources are required. Consequently, adding nodes to the
network simply implies daisy-chaining PCI Express switches resulting in significant cost per port
decrease as the number of nodes in the network is increased. Moreover, as compared to
10Gigabit Ethernet and other legacy networks, adding nodes to the network switch is seamless
because of the plug-n-play attributes of the PCI bus architecture.
3
Since the servers have no direct visibility into a remote server’s memory, any data transfer
operations necessarily require the root switch to be involved. For instance, when a source server
needs to read/write data from/to a target server, the server notifies the root switch rather than
attempting to communicate with the target server. It is the root switch that accesses the memory
of the source as well as the target server.
To further reduce data transfer latencies, the new switch technology uses DMA controllers built
into the NTB ports of the PCI Express switch. This relieves the network switch processor from
moving data between servers and allows for concurrent transfers between nodes in the network.
This amounts to peer-to-peer transfers within the PCI Express switch array contributing to
drastic reduction in data transfer latencies in the network. Based on the destination IP addresses
of all of the individual packets in a particular server’s kernel space, the RCP on the network
switch sets up the DMA descriptor file and then fires the DMA engine.
PCI Express technology is fast becoming ubiquitous and the result has been that all server and
workstation manufacturers now provide a certain number of PCI Express slots for I/O expansion.
These form the PCI Express endpoints of the RCP on the host computer backplane. A host PCI
Express card will take the PCI Express signals from the backplane and bring them out on fiber,
or alternately on copper, to attach to the ports of the network switch.
The number of PCI Express lanes operational between the server and the network switch will
depend on the number of lanes supported by the server hardware. PCI Express allows for link
negotiation whereby both ends of a PCI Express link negotiate to support the minimum number
of lanes supported by either of the two connection points. Consequently, each port of the
network switch will negotiate down to the number of lanes supported by the host connection to
that individual port. These ports support Gen 2 PCI Express signaling standards and will
negotiate down to Gen 1 signaling to support the corresponding connection to the server. This
makes the network switch highly scalable.
The network switch technology is based completely on standards with no aspects of the
technology being proprietary. With the industry’s commitment to PCI Express technology, it
provides a migration path to a newer generation of technology thus potentially extending its life
cycle. This technology allows for the coexistence of legacy networks as it goes through its
adoption cycle phase and, moreover, can serve as a fallback mechanism for mission-critical
applications. This allows for a fail safe deployment. Another significant advantage is the cost per
port as nodes get added to the network since there is only one root complex processor (RCP) on
the network switch in this network topology.
Figure 5 shows an example of how servers with disparate functions participate seamlessly in a
symmetrical TCP/IP-based system area network. This also shows how storage and processing
servers coexist in one homogeneous network. This is facilitated by the increasingly popular
implementation of iSCSI on communication with network attached storage devices. iSCSI is
essentially the SCSI protocol embedded in TCP/IP packets. SCSI protocol is widely used in the
industry to communicate with storage devices.
4
Also, the connection to the Internet implies simply transferring all IP packets intact that are not
destined for any server on the network via a wide area network (WAN) interface. The
deployment of the network switch as shown in Figure 5 is representative of a topography that
with different software modules can be used for clustering, I/O virtualization and cloud
computing applications. It is a highly flexible architecture.
Magma, San Diego, CA. (858) 530-2511. [www.magma.com].
5