Download 7 10 Gigabit Ethernet Performance with Various

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Network tap wikipedia , lookup

Deep packet inspection wikipedia , lookup

RapidIO wikipedia , lookup

Asynchronous Transfer Mode wikipedia , lookup

Industry Standard Architecture wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

CAN bus wikipedia , lookup

VMEbus wikipedia , lookup

IEEE 1355 wikipedia , lookup

Low Pin Count wikipedia , lookup

Point-to-Point Protocol over Ethernet wikipedia , lookup

Bus (computing) wikipedia , lookup

PCI Express wikipedia , lookup

IEEE 802.11 wikipedia , lookup

Wake-on-LAN wikipedia , lookup

Conventional PCI wikipedia , lookup

Throughput wikipedia , lookup

UniPro protocol stack wikipedia , lookup

PC/104 wikipedia , lookup

Real-Time Messaging Protocol wikipedia , lookup

Transcript
Performance of 1 and 10 Gigabit Ethernet Cards with
Server Quality Motherboards
Richard Hughes-Jones1, Peter Clarke2, Steven Dallison1
1
Dept. of Physics and Astronomy, The University of Manchester, Oxford Rd., Manchester M13 9PL, UK
2National e-Science Centre, The University of Edinburgh, 15 S. College St., Edinburgh EH8 9AA, UK
Abstract
System administrators often assume that just by plugging in a Gigabit Ethernet Interface, the system
will deliver line rate performance; sadly this is not often the case. The behaviour of various 1 and 10
Gigabit Ethernet Network Interface cards has been investigated using server quality motherboards. The
latency, throughput, and the activity on the PCI buses and Gigabit Ethernet links were chosen as
performance indicators. The tests were performed using two PCs connected back-to-back and sending
UDP/IP frames from one to the other. This paper shows the importance of having a good combination
of memory and peripheral bus chipset, Ethernet Interface, CPU power, good driver and operating
system designs and proper configuration to achieve and sustain gigabit transfer rates. With these
considerations taken into account, and suitably designed hardware, transfers can operate at gigabit and
multi-gigabit speeds. Some recommendations are given for high performance data servers.
Key words: Gigabit Networking, Gigabit Ethernet, Network Interface Card, UDP, PCI bus, PCI-X bus.
1
Introduction
With the increased availability and decrease in cost of Gigabit Ethernet using twisted pair cabling of
suitable quality, system suppliers and integrators are offering Gigabit Ethernet and associated switches
as the preferred interconnect between disk servers and PC compute farms as well as the most common
campus or departmental backbones. With the excellent publicity that ‘Gigabit Ethernet is just Ethernet’,
users and administrators are now expecting gigabit performance just by purchasing gigabit components.
In addition, 10 Gigabit Ethernet PCI-X (Peripheral Component Interconnect – Extended) cards such as
the Intel PRO/10GbE LR Server Adapter[1] and the S2io Xframe[2] are emerging into the market.
Although initially targeted at the server market, there is considerable interest in providing access to
data at multi-gigabit rates for the emerging high performance Grid applications.
In order to be able to perform sustained data transfers over the network at gigabit speeds, it is essential
to study the behaviour and performance of the end-system compute platform and the Network Interface
Cards (NICs). In general, data must be transferred between system memory and the NIC and then
placed on the network. For this reason, the operation and interactions of the memory, CPU, memorybus, the bridge to the input-output bus (often referred to as the “chipset”), the network interface card
and the input-output bus itself (in this case PCI / PCI-X) are of great importance. The design and
implementation of the software drivers, protocol stack, and operating system are also vital to good
performance. For most data transfers, the information must be read from and stored on permanent
storage, thus the performance of the storage sub-systems and this interaction with the computer buses is
equally important.
The work reported here forms part of an ongoing evaluation programme for systems and Gigabit
Ethernet components for high performance networking for Grid computing and the Trigger/Data
Acquisition in the ATLAS [8] experiment at the Large Hadron Collider being built at CERN,
Switzerland.
The paper first describes the methodology of the tests performed (Section 2) and then details the
equipment used (Section 3). For the 1 Gigabit Ethernet NICs, the results, analysis and discussion of the
tests performed are presented in Sections 4, 5 and 6 (one for each motherboard used). Within each of
these sections, subsections describe the results of each NIC which was tested. Section 7 describes tests
made with the Intel 10 Gigabit Ethernet adaptor on both Intel 32 and 64 bit architectures. Related work
is studied in Section 8. Finally, the paper ends with concluding remarks.
2
Methodology and Tests Performed
For each combination of motherboard, NIC and Linux kernel, three sets of measurements were made
using two PCs with the NICs directly connected together with suitable fibre or copper crossover cables.
UDP/IP frames were chosen for the tests as they are processed in a similar manner to TCP/IP frames,
but are not subject to the flow control and congestion avoidance algorithms defined in the TCP protocol
and thus do not distort the base-level performance. The packet lengths given are those of the user
payload1.
The following subsections describe the measurements made of the latency, throughput, and the activity
on the PCI buses.
2.1
Latency
Round trip times were measured using request-response UDP frames. One system sent a UDP packet
requesting that a response of the required length be sent back by the remote end. The unused portions
of the UDP packets were filled with random values to prevent any data compression. Each test
involved measuring many (~1000) request-response singletons. The individual request-response times
were measured by using the CPU cycle counter on the Pentium [6],[20] and the minimum, average and
maximum times were computed. This approach is in agreement with the recommendations of the IP
Performance Metric Working Group and the Global Grid Forum (GGF) [7].
The round-trip latency was measured and plotted as a function of the frame size of the response. The
slope of this graph is given by the sum of the inverse data transfer rates for each step of the end-to-end
path [10]. Figure 2.1 shows a table giving the slopes expected for various PCI-Ethernet-PCI transfers
given the different PCI bus widths and speeds. The intercept gives the sum of the propagation delays in
the hardware components and the end system processing times.
Histograms were also made of the singleton request-response measurements. These histograms show
any variations in the round-trip latencies, some of which may be caused by other activity in the PCs,
which was kept to a minimum during the tests. In general for the Gigabit Ethernet systems under test,
the Full Width Half Maximum (FWHM) of the latency distribution was ~2 µs with very few events in
the tail (see Figure 6.3), which indicates that the method used to measure the times was precise.
These latency tests also indicate any interrupt coalescence in the NICs and provide information on
protocol stack performance and buffer management.
2.2
UDP Throughput
The UDPmon [11] tool was used to transmit streams of UDP packets at regular, carefully controlled
intervals and the throughput was measured at the receiver. Figure 2.2 shows how the messages and the
stream of UDP data packets are exchanged by UDPmon. On an unloaded network, UDPmon will
estimate the capacity of the link with the smallest bandwidth on the path between the two end systems.
On a loaded network, the tool gives an estimate of the available bandwidth [7]. These bandwidths are
indicated by the flat portions of the curves.
In these tests, a series of user payloads from 64 to 1472 bytes were selected and for each packet size,
the frame transmit spacing was varied. For each point, the following information was recorded:
 the time to send and the time to receive the frames;
 the number of packets received, the number lost, and the number out of order;
 the distribution of the lost packets;
 the received inter-packet spacing;
 the CPU load and the number of interrupts for both transmitting and receiving systems.
1
Allowing for 20 bytes of IP and 8 bytes of UDP headers, the maximum user payload for an Ethernet interface
with a 1500 byte Maximum Transfer Unit (MTU) would be 1472 bytes.
The “wire”2 throughput rates include an extra 60 bytes of overhead and were plotted as a function of
the frame transmit spacing. On the right hand side of the plots, the curves show a 1/t behaviour, where
the delay between sending successive packets is the most important factor. When the frame transmit
spacing is such that the data rate would be greater than the available bandwidth, one would expect the
curves to be flat (often observed to be the case). As the packet size is reduced, processing and transfer
overheads become more important and this decreases the achievable data transfer rate.
Transfer Element
32 bit 33 MHz PCI
64 bit 33 MHz PCI
64 bit 66 MHz PCI
64 bit 133 MHz PCI-X
Gigabit Ethernet
10 Gigabit Ethernet
32 bit 33 MHz PCI with Gigabit Ethernet
64 bit 33 MHz PCI with Gigabit Ethernet
64 bit 66 MHz PCI with Gigabit Ethernet
64 bit 133 MHz PCI-X with 10 Gigabit Ethernet
Inverse data
transfer rate µs/byte
0.0075
0.00375
0.00188
0.00094
0.008
0.0008
Expected slope
µs/byte
0.023
0.0155
0.0118
0.0027
Figure 2.1 Table of the slopes expected for various PCI-Ethernet-PCI transfers.
Requester
Responder
Start & Zero stats
OK done
Send data frames
at regular intervals
Bytes per packet
●●●
●●●
Inter-packet time
Time to send
Time to receive
Get remote statistics
Send statistics
Signal end of test
OK done
Time
Figure 2.2 Timing diagram of the messages exchanged by UDPmon
2.3
Activity on the PCI Buses and the Gigabit Ethernet Link
The activity on the PCI bus of the sender and receiver nodes and the passage of the Gigabit Ethernet
frames on the fibre was measured using a Logic Analyser to record traces of the individual signals on
the PCI/PCI-X buses, and a specialised Gigabit Ethernet Probe [12 to determine what was on the
network, as shown in Figure 2.3.
The 66 “wire” overhead bytes include: 12 bytes for inter-packet gap, 8 bytes for the preamble and Start Frame
Delimiter, 18 bytes for Ethernet frame header and CRC and 28 bytes of IP and UDP headers
2
Gigabit
CPU
NIC
Ethernet
CPU
NIC
Probe
PCI bus
PCI bus
chipset
chipset
memory
memory
Logic Analyser
Display
Figure 2.3 Test arrangement for examining the PCI buses and the Gigabit Ethernet link.
Possible bottlenecks are: CPU power, Memory bandwidth, chipset/PCI bus, NIC.
3
Hardware and Operating Systems
Both copper and fibre NICs were tested and are listed in Figure 3.1 together with the chipset and the
Linux driver version. Details of the systems and motherboards used in the tests are given with the
summary of the measurements in Figure 9.1. As an example of a typical server quality board, the
SuperMicro [3] P4DP6 motherboard has dual 2.2 GHz Xeon processors 3, a 400 MHz front-side bus and
four independent 64 bit PCI / PCI-X buses selectable speeds of 66, 100 or 133 MHz as shown in the
block diagram of Figure 3.2; the “slow” devices and the EIDE controllers are connected via another
bus. The P4DP8-G2 used in the DataTAG [4] PCs is similar, with the on-board 10/100 LAN controller
replaced by the Intel 82546EB dual port Gigabit Ethernet controller connected to PCI Slot 5.
As most of the large Grid projects in High Energy Physics or Astronomy are currently based on Linux
PCs, this platform was selected for making the evaluations. RedHat 7.1, 7.2 and 8.0 distributions of
Linux were used with various kernels including 2.4.14, 2.4.19, 2.4.20 and 2.4.22. However, the
techniques and methodology are applicable to other platforms. Care was taken to obtain the latest
drivers for the Linux kernel in use. Normally no modifications were made to the IP stacks nor the
drivers, but in some tests the DataTAG altAIMD kernel[28] was used. Drivers were loaded with the
default parameters except where stated.
The DataTAG altAIMD kernel incorporates several experimental TCP stacks that modify the sender
side congestion window update algorithms and can work with any TCP receiver. The version used in
these tests was a patch for the Linux 2.4.20 kernel which allowed dynamic switching between the
following stacks using a parameter in the kernel “/proc” database:
 Standard Linux TCP
 HighSpeed TCP
 ScalableTCP 0.5.1
In addition, the following modifications were incorporated and applied to all the TCP stacks:
 Web100 instrumentation[29];
 an on/off toggle for Limited Slow Start;
 Appropriate Byte Counting (ABC) as described in RFC3465;
 the ability to enable or disable the functionality of the moderate_cwnd() procedure that adjusts
the congestion window in the presence of "dubious" ACKs;
 the Txqueuelen feedback/moderation toggle, which enables TCP to ignore or moderate
congestion feedback when the transmit queue becomes full.
The Xeon processor has two “logical” processors each with its own independent machine states, data registers,
control registers and interrupt controller (APIC). However they share the core resources including execution
engine, system bus interface, and level-2 cache. This hyper-threading technology allows the core to execute two or
more separate code streams concurrently, using out-of-order instruction scheduling to maximise the use of the
execution units[5][19].
3
Manufacturer
Alteon
Model
ACENIC
Chipset
Tigon II rev 6
Linux
2.4.14
SysKonnect
SK9843
PRO/1000 XT
82544
Intel
On board
82546 EB
2.4.14
2.4.19-SMP
2.4.20-SMP
2.4.14
2.4.19-SMP
2.4.20-SMP
2.4.20-SMP
Intel
Intel
PRO/10GbE LR
Server Adapter
2.4.20
2.4.22
Driver version
acenic v0.83
Firmware 12.4.11
sk98lin v4.06
sk98lin v6.0 β 04
sk98lin v6.0 β 04
e1000
e1000 4.4.12
e1000 4.4.12
e1000 4.4.12
Ixgb 1.0.45
Figure 3.1 NICs used in the tests.
Figure 3.2 Block diagram of the SuperMicro P4DP6 / P4DP8 motherboards.
4
Measurements made on the SuperMicro 370DLE
The SuperMicro 370DLE motherboard is somewhat old now, but has the advantage of having both 32
bit and 64 bit PCI slots with 33 or 66 MHz bus speed jumper selectable. The board uses the
ServerWorks III LE Chipset and one PIII 800 MHz CPU. RedHat 7.1 Linux was used with the 2.4.14
kernel for the tests.
4.1
SysKonnect NIC
4.1.1 Request-Response Latency
The round trip latency, shown in Figure 4.1 as a function of the packet size, is a smooth function for
both 33 MHz and 66 MHz PCI buses, indicating that the driver-NIC buffer management works well.
The clear step increase in latency at 1500 bytes is due to the need to send a second partially filled
packet, and the smaller slope for the second packet is due to the overlapping of the incoming second
frame and the data from the first frame being transferred over the PCI bus. These two effects are
common to all the gigabit NICs tested.
The slope observed for the 32 bit PCI bus (0.0286 µs/byte) is in reasonable agreement with that
expected, the increase being consistent with some memory-memory copies at both ends. However the
slope of 0.023 µs/byte measured for the 64 bit PCI bus is much larger than the 0.0118 µs/byte expected,
even allowing for memory copies. Measurements with the logic analyser, discussed in Section 4.1.3,
confirmed that the PCI data transfers did operate at 64 bit 66 MHz with no wait states. It is possible
that the extra time per byte could be due to some movement of data within the hardware interface itself.
The small intercept of 62 or 56 µs, depending on bus speed, suggests that the NIC interrupts the CPU
for each packet sent and received. This was confirmed by the UDPmon throughput data and the PCI
measurements.
4.1.2 UDP Throughput
The SysKonnect NIC managed 584 Mbit/s throughput for full 1500 byte frames on the 32 bit PCI bus
(at a spacing of 20-21 µs) and 720 Mbit/s (at a spacing of 17 µs) on the 64 bit 66 MHz PCI bus as
shown in Figure 4.2. With inter-packet spacing less than these values, there is packet loss, traced to loss
in the receiving IP layer when no buffers for the higher layers are available. However, the receiving
CPU is in kernel mode for 95 - 100% of the time when the frames were spaced at 18 µs or less, as
shown in the lower graph of Figure 4.2, so a lack of CPU power could explain the packet loss.
160
SysKonnect 32bit 33 MHz
160
y = 0.0286x + 61.79
y = 0.0188x + 89.756
100
80
60
Latency us
120
120
Latency us
SysKonnect 64 bit 66 MHz
140
140
y = 0.0231x + 56.088
100
y = 0.0142x + 81.975
80
60
40
40
20
20
0
0
0
500
1000
1500
2000
2500
0
3000
500
1000
1500
2000
2500
3000
Message length bytes
Message length bytes
Figure 4.1 UDP request-response latency as a function of the packet size for the SysKonnect SK9843
NIC on 32 bit 33 MHz and 64 bit 66 MHz PCI buses of the SuperMicro 370DLE.
Recv Wire rate Mbit/s
1000
50 bytes
SysKonnect 32bit 33 MHz
100 bytes
800
200 bytes
400 bytes
600
600 bytes
400
800 bytes
1000 bytes
200
1200 bytes
1400 bytes
0
0
5
10
15
20
25
Transmit Time per frame us
Recv Wire rate Mbit/s
1000
30
35
40
1472 bytes
50 bytes
SysKonnnect 64bit 66MHz
100 bytes
800
200 bytes
400 bytes
600
600 bytes
800 bytes
400
1000 bytes
200
1200 bytes
1400 bytes
0
0
5
10
15
20
25
30
35
40
1472 bytes
Transmit Time per frame us
SysKonnect 64 bit 66 MHz
50 bytes
% CPU Kernel
Receiver
100
100 bytes
200 bytes
80
400 bytes
60
600 bytes
40
800 bytes
1000 bytes
20
1200 bytes
0
1400 bytes
0
5
10
15
20
Spacing between frames us
25
30
35
40
1472 bytes
Figure 4.2 UDP Throughput as a function of the packet size for the SysKonnect SK9843 NIC on 32 bit
33 MHz and 64 bit 66 MHz PCI buses of the SuperMicro 370DLE, together with the percentage of
time the receiving CPU was in kernel mode.
4.1.3 PCI Activity
Figure 4.3 shows the signals on the sending and receiving PCI buses when frames with 1400 bytes of
user data were sent. At time t, the dark portions on the send PCI shows the setup of the Control and
Status Registers (CSRs) on the NIC, which then moves the data over the PCI bus – indicated by the
assertion of the PCI signals for a long period (~3s) .
The Ethernet frame is transferred over the Gigabit Ethernet as shown by the lower two signals at time X
to O, and then on to the receive PCI bus at a time just after O. The frame exists on the Ethernet medium
for 11.6 s. The activity seen on the sending and receiving PCI buses after the data transfers is due to
the driver updating the CSRs on the NIC after the frame has been sent or received. These actions are
performed after the NIC has generated an interrupt; for the SysKonnect card, this happens for each
frame.
Figure 4.3 also shows that the propagation delay of the frame from the first word on the sending PCI
bus to the first word on the receiving PCI was ~26 s. Including the times to setup the sending CSRs
and the interrupt processing on the receiver, the total delay from the software sending the packet to
receiving it was 36 s. Using the intercept of 56 s from the latency measurements, one can estimate
the IP stack and application processing times to be ~ 10 s on each 800 MHz CPU.
The traces show that data movement over PCI and Ethernet do not overlap in time, indicating that the
NICs operate in a store-and-forward mode using buffering on the NIC for both transmit and receive.
There is no indication of multiple Ethernet frames being stored or queued on the NIC.
Send transfer
Send setup
Send PCI
Receive PCI
Packet on Ethernet Fibre
Receive transfer
Figure 4.3 The traces of the signals on the send and receive PCI buses for the SysKonnect NIC on
the 64 bit 66 MHz bus of the SuperMicro 370DLE motherboard. The bottom 3 signals are from
the Gigabit Ethernet Probe card and show the presence of the frame on the Gigabit Ethernet
fibre.
The upper plot in Figure 4.4 shows the same signals corresponding to packets being generated at a
transmit spacing of 20 µs and the lower plot with the transmit spacing set to 10 µs. In this case, the
packets are transmitted back-to-back on the wire with a spacing of 11.7 µs, i.e. full wire speed. This
shows that a 800MHz CPU is capable of transmitting large frames at gigabit wire speed.
The time required for the Direct Memory Access (DMA) transfers over the PCI scales with PCI bus
width and speed as expected, but the time required to setup the CSRs of the NIC is almost independent
of these parameters. It is dependent on how quickly the NIC can deal with the CSR accesses internally.
Together with CPU load, these considerations explain why using 66 MHz PCI buses gives increased
performance but not by a factor of two (see Figure 4.2). Another issue to consider is the number of
CSR accesses that a NIC requires to transmit data, receive data, determine the error status, and update
the CSRs when a packet has been sent or received. Clearly, for high throughput, the number of CSR
accesses should be minimised.
Send PCI
Receive PCI
Frames on Ethernet Fibre
Spacing 20 µs
Send PCI
Receive PCI
Frame back-to-back
on Ethernet Fibre
Figure 4.4 Signals on the send and receive PCI buses and the Gigabit Ethernet Probe card for different
packet spacings. Upper: signals correspond to packets being generated at a transmit spacing of 20 µs.
Lower: plots with the transmit spacing set to 10 µs.
4.2
Intel PRO/1000 XT NIC
4.2.1 UDP Request-Response Latency
Figure 4.5 shows the round trip latency as a function of the packet size for the Intel PRO/1000 XT
Server NIC connected to the 64 bit 64 MHz PCI bus of the SuperMicro 370DLE motherboard. The
graph shows smooth behaviour as a function of packet size, and the observed slope of 0.018 µs/byte is
in reasonable agreement with the 0.0118 µs/byte expected. The large intercept of ~168 µs suggests that
the driver enables interrupt coalescence, which was confirmed by the throughput tests that showed
one interrupt for approximately every 33 packets sent (~ 400 µs) and one interrupt for every 10 packets
received (~ 120 µs).
4.2.2 UDP Throughput
Figure 4.6 shows that the maximum throughput for full 1472 byte frames (1500 byte MTU) on the 64
bit 66 MHz PCI bus was 910 Mbit/s. The lower graph shows the corresponding packet loss. This
behaviour of losing frames at wire rate is typical of most NIC-motherboard combinations. It was traced
to “indiscards”, i.e. loss in the receiving IP layer when no buffers for the higher layers were available.
The receiving CPU was loaded at between 85 and 100% when the frames were spaced at 13 µs or less,
which suggests that a lack of CPU power may be the cause of the lost packets.
4.2.3 PCI Activity
Figure 4.7 shows the PCI signals for a 64 byte request from the sending PC followed by a 1400 byte
response. Prior to sending the request frame, there is a CSR setup time of 1.75 µs followed by 0.25 µs
of DMA data transfer. The default interrupt coalescence of 64 units was enabled. This gave ~ 70 µs
delay between the PCI transfer and the update of the CSRs. This delay is seen on both send and receive
actions.
250
Intel Pro/1000 64bit 66MHz
Latency us
200
150
y = 0.0187x + 167.86
100
50
0
0
500
Message length bytes
1000
1500
Figure 4.5 Latency as a function of the packet size for the Intel PRO/1000 XT on the SuperMicro
370DLE 64 bit 66 MHz PCI bus.
Figure 4.8 shows the PCI signals for 1400 byte frames sent every 11 µs. The PCI bus is occupied for
~4.7 µs on sending, which corresponds to ~ 43% usage; for receiving, it takes only ~ 3.25 µs to transfer
the frame ~ 30% usage.
Figure 4.9 shows the transfers on a longer time scale. There are regular gaps in the frame transmission,
approximately every 900 µs. Transfers on the sending PCI stop first, followed by those on the receiving
PCI. Symmetric flow control was enabled, which suggests that Ethernet pause frames were used to
prevent buffer overflow and subsequent frame loss. This behaviour is not unreasonable as one of these
frames would occupy the Gigabit Ethernet for ~11.7 µs.
Recv Wire rate Mbit/s
1000
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
Intel Pro/1000 370DLE 64bit 66MHz
800
600
400
200
0
0
5
% Packet loss
1
10
15
20
25
Spacing between frames us
30
35
40
Intel Pro/1000 370DLE 64bit 66MHz
0.8
0.6
0.4
0.2
0
0
5
10
15
20
25
Spacing between frames us
30
35
40
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
Figure 4.6 UDP Throughput and corresponding packet loss as a function of the packet size for the Intel
PRO/1000 XT Server NIC on the SuperMicro 370DLE 64 bit 66 MHz PCI bus.
Send Interrupt processing
Interrupt delay
Send 64
byte request
Send PCI
Receive PCI
Request received
1400 byte response
Receive Interrupt processing
Figure 4.7 The traces of the signals on the send and receive PCI buses for the Intel PRO/1000 XT NIC
and the SuperMicro 370DLE motherboard for a 64 byte request with a 1400 byte response.
1400 bytes to NIC
Send PCI
Receive PCI
1400 bytes to memory
Figure 4.8 The traces of the signals on the send and receive PCI buses for the Intel PRO/1000 XT NIC
and the SuperMicro 370DLE motherboard for a stream of 1400 byte packets transmitted with a
separation of 11 µs.
Send PCI
Receive PCI
Figure 4.9 The traces of the signals on the send and receive PCI buses for the Intel PRO/1000 XT NIC
and the SuperMicro 370DLE motherboard for a stream of 1400 byte packets transmitted with a
separation of 11 µs. showing pauses in the packet transmission.
5
Measurements made on the IBM DAS board
The motherboards in the IBM DAS compute server were tested. They had Dual 1GHz Pentium III with
the ServerWorks CNB20LE Chipset, and the PCI bus was 64 bit 33 MHz. RedHat 7.1 Linux with the
2.4.14 kernel was used for the tests.
Figure 5.1shows the UDP throughput for the SysKonnect NIC. It performed very well, giving a
throughput of 790 Mbit/s, which may be compared to only 620 Mbit/s on the SuperMicro 370DLE.
However the IBM had much more processing power with dual 1 GHz CPUs.
The Intel NIC also performed very well on the IBM DAS system, giving a throughput of 930 Mbit/s,
which may be compared to the 910 Mbit/s on the SuperMicro 370DLE.
The latency and PCI activity were similar to those for the SuperMicro 370DLE.
Recv Wire rate Mbit/s
1000
50 bytes
SysKonnnect IBM das 64bit 33MHz
100 bytes
800
200 bytes
400 bytes
600
600 bytes
800 bytes
400
1000 bytes
1200 bytes
200
1400 bytes
0
1472 bytes
0
5
10
15
20
Spacing between frames us
25
30
35
40
Figure 5.1 UDP Throughput as a function of the packet size for the SysKonnect SK9843 NIC on the 64
bit 33 MHz PCI bus of the IBM DAS.
6
Measurements made on the SuperMicro P4DP6 board
The SuperMicro P4DP6 motherboard used the Intel E7500 (Plumas) Chipset and had Dual Xeon
Prestonia 2.2 GHz CPUs. The arrangement of the 64 bit PCI and PCI-X buses was discussed in Section
3.
6.1
SysKonnect NIC
6.1.1 Request-Response Latency
Figure 6.1 depicts the round trip latency as a function of the packet size. This smooth function indicates
that the driver-NIC buffer management works well. The slope for the 64 bit PCI bus of 0.0199 µs/byte
is in reasonable agreement with the 0.0117 µs/byte expected. The intercept of 62 µs suggests that there
is no interrupt coalescence, which is confirmed by the data from the UDPmon throughput tests that
record one interrupt per packet sent and received.
140
SysKonnect 64 bit 66 MHz
120
y = 0.0199x + 62.406
Latency us
100
y = 0.012x + 78.554
80
60
40
20
0
0
500
1000
1500
2000
2500
3000
Message length bytes
Figure 6.1 UDP request-response latency as a function of the packet size for the SysKonnect NIC using
the 64 bit 66 MHz PCI bus on the SuperMicro P4DP6 board.
6.1.2 UDP Throughput
As shown in Figure 6.2, the SysKonnect NIC manages 885 Mbit/s throughput for full 1472 byte
packets on the 64 bit 66 MHz PCI bus (at a spacing of 13 µs). At inter-packet spacing less than this
value, there is ~20% packet loss for packets over 1200 bytes and between 60 and 70% loss for small
packets. High packet loss corresponded to low throughput. These losses were again traced to
“indiscards”, i.e. loss in the receiving IP layer.
Although the corresponding CPU utilisation on the receiving PC averaged over the 4 CPUs was ~ 20 30 %, with the 2.4.19 kernel, only one of the CPUs was responsible for the packet processing and this
CPU spent nearly all of its time in kernel mode as shown by the lower plot in Figure 6.2. The
SysKonnect NIC generated an interrupt for each packet, and with this utilisation, the system was very
close to live lock [18], providing a likely explanation of the observed packet loss. Network studies at
Stanford Linear Accelerator Center (SLAC) [14] also indicate that a processor of at least 1 GHz/ Gbit
is required. With the 2.4.20 kernel and the same hardware, packets over 1400 bytes achieved line speed,
and in general the CPU utilisation was less. However there was still packet loss for small packets,
which is not understood.
Recv Wire rate Mbit/s
50 bytes
SysKonnect P4DP6 64 bit 66 MHz
1000
100 bytes
800
200 bytes
400 bytes
600
600 bytes
800 bytes
400
1000 bytes
200
1200 bytes
1400 bytes
0
0
5
10
15
20
25
30
35
40
1472 bytes
Spacing between frames us
% Packet loss
50 bytes
SysKonnect P4DP6 64 bit 66 MHz
100
100 bytes
80
200 bytes
60
400 bytes
600 bytes
40
800 bytes
20
1000 bytes
0
1200 bytes
0
5
10
15
20
25
Spacing between frames us
30
35
40
1472 bytes
50 bytes
SysKonnect P4DP6 64 bit 66 MHz
% CPU1 kernel
receiver
1400 bytes
100
100 bytes
80
200 bytes
400 bytes
60
600 bytes
40
800 bytes
20
1000 bytes
1200 bytes
0
0
5
10
15
20
25
Spacing between frames us
30
35
40
1400 bytes
1472 bytes
Figure 6.2 UDP Throughput, packet loss, and CPU utilization as a function of the packet size for the
SysKonnect NIC using the 64 bit 66 MHz PCI bus on the SuperMicro P4DP6 board.
6.2
Intel PRO/1000 NIC
Figure 6.3 shows the round trip latency as a function of the packet size. The behaviour shows some
distinct steps. The observed slope of 0.009 µs/byte is smaller than the 0.0118 µs/byte expected, but
making rough allowance for the steps, agreement is good.
Figure 6.3 also shows histograms of the round trip times for various packet sizes. There is no variation
with packet size, all having a FWHM of 1.5 µs and no significant tail. This confirms that the timing
method of using the CPU cycle counter gives good precision and does not introduce bias due to other
activity in the end systems.
6.2.1 UDP Throughput
Figure 6.4 shows that the Intel NIC performed very well, giving a throughput of 950 Mbit/s with no
packet loss. The corresponding average CPU utilisation on the receiving PC was ~ 20 % for packets
greater than 1000 bytes and 30- 40 % for smaller packets.
Intel 64 bit 66 MHz
300
y = 0.0093x + 194.67
Latency us
250
200
y = 0.0149x + 201.75
150
100
50
0
0
900
800
64 bytes Intel 64 bit 66 MHz
500
1000
1500
2000
Message length bytes
512 bytes Intel 64 bit 66 MHz
800
2500
3000
1024 bytes Intel 64 bit 66 MHz
800
700
700
700
700
600
600
600
500
500
500
N(t)
N(t)
500
N(t)
600
400
400
400
400
300
300
300
300
200
200
200
200
100
100
100
100
0
0
0
170
190
210
170
Latency us
190
Latency us
210
1400 bytes Intel 64 bit 66 MHz
N(t)
800
0
190
210
190
230
210
230
Latency us
Latency us
Figure 6.3 Latency as a function of the packet size, and histograms of the request-response latency for
64, 512, 1024 and 1400 byte packet sizes for the Intel PRO/1000 XT NIC on the 64 bit 66 MHz PCI
bus of the SuperMicro P4DP6 motherboard.
Recv Wire rate Mbit/s
1000
Intel P4DP6 64 bit 66 MHz
50 bytes
100 bytes
800
200 bytes
400 bytes
600
600 bytes
800 bytes
400
1000 bytes
200
1200 bytes
1400 bytes
0
1472 bytes
0
5
10
15
20
25
Spacing between frames us
30
35
40
Figure 6.4 UDP Throughput as a function of the packet size for the Intel PRO/1000 XT NIC on the
SuperMicro P4DP6 64 bit 64 MHz PCI-X bus. During these tests no packet loss was observed.
1400 bytes to NIC
Send PCI
PCI Stop Asserted
Receive PCI
1400 bytes to memory
Figure 6.5 The signals on the send and receive PCI buses for the Intel Pro 1000 NIC and the
SuperMicro P4DP6 motherboard. The lower picture shows the PCI transactions to send the frame send
in more detail.
6.2.2 PCI Activity
Figure 6.5 shows the PCI activity when sending a 1400 byte packet with the Intel PRO/1000 XT NICs.
The DMA transfers are clean but there are several PCI STOP signals that occur when accessing the
NIC CSRs. The assertion of these STOP signals delays the completion of the PCI transaction, as shown
in more detail in the lower picture. The PCI bus is occupied for ~ 2 µs just setting up the CSRs for
sending. For 1400 byte packets at wire speed, the PCI bus is occupied for ~ 5µs on sending, ~ 50%
usage, while for receiving it takes only ~ 2.9 µs to transfer the frame over the PCI bus, ~ 25% usage.
7
10 Gigabit Ethernet Performance with Various Motherboards
For the tests described in this section, two similar systems were connected back to back.
7.1
Latency
Figure 7.1 shows the request-response latency plot as a function of message size and histograms of the
latency for the Intel 10 Gbit NIC in the DataTAG SuperMicro P4DP8-G2 PCs. For two 133 MHz PCIX buses and a 10 Gigabit Ethernet interconnect, the expected slope is 0.00268 µs/byte, compared with
the measured 0.0034 µs/byte. The extra time could be due to memory-memory copies in each host, as it
corresponds to a transfer rate of 2.7 GBytes/s in each host. The histograms show narrow peaks (FWHM
~2 µs) indicating good stability; however there is a second smaller peak 4-6 µs away from the main
peak.
The request-response latency was also measured for the CERN OpenLab HP RX2600 Dual Itanium 64
bit systems. The measured slope was 0.0033 µs/byte, close to that expected, and the histograms were
similar to those for the Xeon processor.
7.2
Throughput
Figure 7.2 shows the UDP throughput as a function of the spacing between the packets for different
packet sizes. In all cases the MTU was set to 16,114 bytes, giving a maximum length of 16,080 bytes
for the user data in the UDP packet. The top plot is for the DataTAG SuperMicro PCs showing a wire
rate throughput of 2.9 Gbit/s; the middle plot is for the HP Itanium PCs with a wire rate of 5.7 Gbit/s;
and the bottom plot is for the SLAC Dell PCs giving a wire rate of 5.4 Gbit/s. The PCI-X Maximum
Memory Read Byte Count (mmrbc) parameter [13] [27] was set to 4096 bytes for the SLAC Dell and
HP Itanium measurements; the DataTAG SuperMicro PCs measurements used the default mmrbc of
512 bytes. The difference between the top and bottom plots includes the effects of CPU power and the
PCI-X mmrbc parameter setting.
In general the plots indicate well behaved hardware and driver network sub-system with few
unexpected changes or lost packets. The plots also show that high throughput can only be achieved by
using jumbo frames. When 1472 byte frames (1500 byte MTU) are used, the throughput is 2 Gbit/s or
less, and some packets are lost. In this case it appears that CPU usage is the limiting factor.
In terms of throughput performance, the 1 GHz 64 bit Itaniums have a clear advantage when compared
to the 3 GHz Xeon, but further investigation is required to determine whether this difference in
performance is due to the processor characteristics, the memory interconnect or the chipset design. It is
also possible that the improvements in the kernel from 2.4.20 to 2.5.72 play an important role.
7.3
Effect of tuning the PCI-X Bus
The activity on the PCI-X bus and the UDP memory to memory throughput were investigated as a
function of the value loaded into the mmrbc register of the Intel PRO/10GbE LR Server Adapter. The
mmrbc forms part of the PCI-X Command Register and sets the maximum byte count the PCI-X device
may use when initiating a data transfer sequence with one of the burst read commands [13].
Figure 7.3 shows traces of the signals on the PCI-X Bus on the DataTAG SuperMicro PC, when a
packet with 16,080 bytes of user data is moved from memory to the NIC for different values of the
mmrbc parameter. The PCI-X bus operated at 133 MHz. The sequence starts with two accesses to the
CSRs on the NIC and then the DMA data transfer begins. For illustration, the CSR access, the length of
a PCI-X data transfer segment and the length of the whole packet (16,114 bytes including headers), are
indicated on the traces when the mmrbc was set to 2048 bytes. Note that after transferring a segment
2048 bytes long, there is a ~870 ns pause in the data flow over the PCI-X bus, which allows rearbitration for use of the bus.
60
DataTAGt3-6 10GE MTU16114
Latency us
50
40
y = 0.0032x + 44.098
30
ave time us
min time
20
10
0
0
500
500
Message length bytes
64 bytes
600
500
400
300
200
100
0
400
1500
512 bytes
100
0
0
50
Latency us
100
1400 bytes
300
N(t)
200
500
400
N(t)
N(t)
300
1000
200
100
0
0
50
Latency us
0
100
50
Latency us
Figure 7.1 The request-response latency plots and histograms for DataTAG SuperMicro P4DP8-G2
Dual Xeon 2.2GHz
Recv Wire rate Mbit/s
3500
16000 bytes
DataTAG3-6 10GE MTU16114
14000 bytes
3000
12000 bytes
2500
10000 bytes
2000
9000 bytes
8000 bytes
1500
7000 bytes
1000
6000 bytes
500
5000 bytes
4000 bytes
0
3000 bytes
0
5
15
20
25
Spacing between frames us
30
35
40
5000
4000
3000
9000 bytes
8000 bytes
2000
7000 bytes
6000 bytes
1000
5000 bytes
4000 bytes
0
0
5
10
15
20
25
Spacing between frames us
30
35
40
3000 bytes
2000 bytes
1472 bytes
16080 bytes
SLAC Dell 10GE MTU16114
6000
Recv Wire rate Mbit/s
2000 bytes
1472 bytes
16080 bytes
16000 bytes
14000 bytes
12000 bytes
10000 bytes
Openlab29-30 10GE MTU16114
6000
Recv Wire rate Mbit/s
10
14000 bytes
5000
12000 bytes
10000 bytes
4000
9000 bytes
3000
8000 bytes
7000 bytes
2000
6000 bytes
1000
5000 bytes
4000 bytes
0
3000 bytes
0
10
20
Spacing between frames us
30
40
2000 bytes
1472 bytes
Figure 7.2 UDP throughput as a function of the spacing between the packets for different packet sizes.
Top: SuperMicro dual 2.2 GHz Xeon, kernel 2.4.21 DataTAG altAIMD, mmrbc= 512 bytes;
Middle: HP dual 1 GHz Itanium, kernel 2.5.72, mmrbc= 4096 bytes;
Bottom: Dell dual 3.06GHz Xeon, kernel 2.4.20 DataTAG altAIMD, mmrbc= 4096 bytes
100
The plots clearly show the expected decrease in the time taken for the data transfers as the amount of
data that may be transmitted in one PCI-X sequence is increased.
The variation of the throughput for UDP packets as a function of mmrbc is shown in Figure 7.4. The
dashed lines represent the theoretical maximum throughput for PCI-X. For the DataTAG SuperMicro
systems, which have dual 2.2 GHz Xeon CPUs and a 400 MHz memory bus, the throughput plateaus at
4.0 Gbit/s. The purple diamonds show the PCI-X throughout calculated using data from the logic
analyser traces. For the HP RX2600 system with Dual Itanium 2 CPUs, the throughput climbs
smoothly from 3.2 Gbit/s to 5.7 Gbit/s, tracking the predicted rate curve (purple line). Further work is
needed to determine whether CPU power, memory speed or the chipset is the limiting factor.
mmrbc
512 bytes
mmrbc
1024 bytes
CSR Access
mmrbc
2048 bytes
PCI-X segment 2048 bytes
Transfer of 16114 bytes
Interrupt & CSR update
mmrbc
4096 bytes
Figure 7.3 Traces of the PCI-X signals showing movement of a 16080 byte packet from memory to the
NIC as the Maximum Memory Read Byte Count is increased.
10
8
6
4
measured Rate Gbit/s
rate from expected time Gbit/s
Max throughput PCI-X
2
0
0
1000
2000
3000
4000
Max Memory Read Byte Count
5000
PCI-X Transfer time us
PCI-X Transfer time us
10
8
6
4
measured Rate Gbit/s
rate from expected time Gbit/s
Max throughput PCI-X
2
0
0
1000
2000
3000
4000
Max Memory Read Byte Count
5000
Figure 7.4 The plots of the variation of UDP throughput as a function of mmrbc together with the
theoretical maximum PCI-X throughput and that calculated from the PCI-X activity traces.
Left plot: Throughput for the DataTAG SuperMicro Dual 2.2 GHz Xeon
Right plot: Throughput for the HP RX2600 dual 64 bit 1 GHz Itanium 2
8
Related Work
One Gigabit Ethernet NICs have been available for some time and many groups have evaluated their
performance in various ways. Competitive performance evaluations [26], zero-copy operating systembypass concepts [24], and detailed NIC tests using TCP [9] have been reported. However, few included
an investigation of the role and limitations of the PCI / PCI-X bus [25].
The emergence of the 10 Gigabit Ethernet standard provided a cost effective method to upgrade
performance of network backbones, and router and switch manufacturers are keen to demonstrate
performance and inter-operability. Photonics researchers study the performance of 10 gigabit
transceivers [22] as well as the stability of Dense Wave Division Multiplexing optical filter
components for use in real-time applications [21].
Ten Gigabit Ethernet PCI-X cards[1], [2] are becoming available and network researchers have studied
their performance in moving bulk data with TCP and the effect of MTU size and PCI-X burst size on
LANs [16] and transatlantic networks [17]. Other researchers [23] have tested transatlantic UDP and
TCP flows.
Recently, other tests on Gigabit Ethernet NICs have been reported [9]. These have used TCP/IP and
long message sizes up to 1 Gbytes for the memory to memory transfers. It is clear that these results will
include the effects of the TCP protocol, its implementation, as well as the performance of the NICs and
Motherboards.
By focusing on simple UDP transfers to avoid the effects of reliable network protocols, measuring the
individual CPU loads, and visualising the actual operation of the PCI bus with a logic analyser when
sending commands and data to and from the NIC, the measurements presented in this paper
complement and extend the understanding of the work of other researchers.
9
Conclusion
This paper has reported on some of the tests made on a variety of high performance server-type
motherboards with PCI or PCI-X input/output buses with NICs from different manufacturers. Figure
9.1 gives a table summarising the maximum throughput obtained for the NICs tested. Details of tests
on the older NICs may be found in [15].
The Intel PRO/1000 XT card gave a consistently high throughput between 910 – 950 Mbit/s,
improving when used on the more powerful modern motherboards, and the on-board Intel controller
operated at gigabit line speed when one port was used. The old SysKonnect card gave throughputs
between 720 – 885 Mbit/s, again improving with more modern motherboards. The SysKonnect card
used on the SuperMicro P4DP8-G2 operated at line speed. In these tests, all cards were stable in
operation with several 1000s Gbytes of data being transferred. In general, the SysKonnect had a larger
CPU load and gave more packet loss, probably due to the impact of not using interrupt coalescence.
Inspection of the signals on the PCI buses and the Gigabit Ethernet media has shown that a PC with a
800 MHz CPU can generate Gigabit Ethernet frames back-to-back at line speed. However, much more
processing power is required for the receiving system to prevent packet loss.
The design and implementation of the kernel and drivers are very important to ensure high throughput
while minimising CPU load. In some cases, an increase of throughput of ~12% was observed. Better
use of the multiple CPUs (e.g., by better scheduling) could give an advantage.
Timing information derived from the PCI signals and the round trip latency, allowed the processing
time required for the IP stack and test application to be estimated as 10 µs per node for send and
receive of a packet.
A 32 bit 33 MHz PCI bus has almost 100% occupancy and the tests indicate a maximum throughput of
~ 670 Mbit/s. A 64 bit 33 MHz PCI bus shows 82% usage on sending when operating with interrupt
coalescence and delivering 930 Mbit/s. In both of these cases, involving a disk sub-system operating on
the same PCI bus would seriously decrease performance: the data would have to traverse the bus twice
and there would be extra control information for the disk controller.
It is worth noting that even on a well designed system, the network data must cross the CPU-memory
bus three times: once during the DMA to or from the NIC, and twice for the kernel to user space
memory-memory copy by the CPU. Thus the achievable bandwidth of the memory sub-system should
be at least three times the required network bandwidth.
Motherboard
PCI Bus
OS
SuperMicro 370DLE
PCI 32bit 33 MHz
RedHat 7.1
kernel 2.4.14
SuperMicro 370DLE
PCI 64bit 66 MHz
RedHat 7.1
kernel 2.4.14
IBM DAS
PCI 64bit 33 MHz
RedHat 7.1
kernel 2.4.14
SuperMicro P4DP6
PCI 64bit 66 MHz
RedHat 7.2
kernel 2.4.19-SMP
SuperMicro P4DP6
PCI 64bit 66 MHz
RedHat 7.2
2.4.20 DataTAG altAIMD
SuperMicro P4DP8-G2
PCI 64bit 66 MHz
RedHat 7.2
kernel 2.4.19-SMP
HP RX2600 Itanium
PCI-X 64 bit 133 MHz
kernel 2.5.72
Dell PowerEdge 2650
PCI-X 64 bit 133 MHz
2.4.20 DataTAG altAIMD
CPU Type Speed
System Bus Speed
Chipset
Alteon
AceNIC
SysKonnect
SK-9843
PIII 800 MHz
Intel
Pro/1000
XT
ServerWorks
III LE
674 Mbit/s
584 Mbit/s
0-0 µs
PIII 800 MHz
ServerWorks
III LE
930 Mbit/s
720 Mbit/s
0-0 µs
910 Mbit/s
400-120 µs
Dual PIII 1 GHz
ServerWorks
CNB20LE
790 Mbit/s
0-0 µs
930 Mbit/s
400-120 µs
Dual Xeon 2.2GHz
400 MHz
Intel E7500
(Plumas)
885 Mbit/s
0-0 µs
950 Mbit/s
70-70 µs
Dual Xeon 2.2GHz
400 MHz
Intel E7500
(Plumas)
995 Mbit/s
0-0 µs
995 Mbit/s
70-70 µs
Dual Xeon 2.2GHz
400 MHz
Intel E7500
(Plumas)
990 Mbit/s
0-0 µs
995 Mbit/s
70-70 µs
Dual Itanium
2 1 GHz 64 bit
200 MHz dual
pumped
BW = 6.4GB/s
Dual Xenon 3.06
GHz
533 MHz
HP zx1
Intel
Pro/10GbE
LR
4.0 Gbit/s
5.7 Gbit/s
5.4 Gbit/s
Figure 9.1 Table of the maximum throughput measured for 1472 byte UDP packets and the transmitreceive interrupt coalescence times in micro-seconds. Details of the motherboards and Operating
Systems studied are also given.
The latency, throughout and PCI-X activity measurements indicate that the Intel 10 Gigabit Ethernet
card and driver are well behaved and integrate well with the network sub-system for both 32 and 64 bit
architectures. The plots also show that high throughput, 5.4Gbit/s for Xeon with mmrbc tuning and 5.7
Gbit/s for Itanium with the default mmrbc setting, can only be achieved by using jumbo frames.
With 10 Gigabit Ethernet, when standard 1500 byte MTU frames are used, the throughput is 2 Gbit/s or
less, and some packets are lost. It appears that CPU usage is very important and may be the limiting
factor, but further work is needed to investigate this and the effect of interrupt coalescence. It is also
important to note the large increase in throughput that may be achieved by correctly setting the PCI-X
mmrbc parameter.
In the future, it would be interesting to study in detail the emerging 10 Gigabit Ethernet NICs, the
impact of 64-bit architectures, the impact of PCI-Express, and the performance of the latest Linux
kernels (e.g. version 2.6, which includes driver, kernel and scheduling enhancements). The techniques
described in this paper have been applied to the operation of RAID controllers and disk sub-systems;
the interaction between the network and storage subsystems should also be investigated in the future.
10 Acknowledgements
We would like to thank members of the European DataGrid and DataTAG projects, the UK e-science
MB-NG projects, and the SLAC network group for their support and help in making test systems
available, especially:
 Boston Ltd & SuperMicro for the supply of the SuperMicro motherboards and NICs - Dev
Tyagi (SuperMicro), Alex Gili-Ross(Boston)
 Brunell University - Peter van Santen
 Sverre Jarp and Glen Hisdal of the CERN Open Lab
 CERN DataTAG - Sylvain Ravot, Olivier Martin and Elise Guyot
 SLAC – Les Cottrell, Connie Logg, Gary Buhrmaster
 SURFnet & Universiteit van Amsterdam - Pieter de Boer, Cees de Laat, Antony Antony
We would also like to acknowledge funding from both PPARC and EPSRC to work on these projects.
11 References
[1] Intel, PRO/10GbE LR Server Adapter User Guide,
http://support.intel.com/support/network/adapter/pro10gbe/pro10gbelr/sb/CS-008392.htm
[2] S2io, S2io Xframe 10 Gigabit Ethernet PCI-X Server/Storage Adapter DataSheet,
http://www.s2io.com/products/XframeDatasheet.pdf
[3] SuperMicro motherboard reference material,
http://www.supermicro.com/PRODUCT/MotherBoards/E7500/P4DP6.htm
[4] DataTAG,
http://datatag.web.cern.ch/datatag
[5] Intel, Hyper-threading Technology on the Intel Xeon Processor Family for Servers, White paper,
http://www.intel.com/eBusiness/pdf/prod/server/xeon/wp020901.pdf
[6] M. Boosten, R.W. Dobinson and P. van der Stok, “MESH Messaging and ScHeduling for Fine
Grain Parallel Processing on Commodity Platforms”, in Proceedings of the International Conference
on Parallel and Distributed Techniques and Applications (PDPTA'99), Las-Vegas, Nevada, June 1999.
[7] B. Lowekamp, B. Tierney, L. Cottrell, R. Hughes-Jones, T. Kielmann and M. Swany, A Hierarchy
of Network Performance Characteristics for Grid Applications and Services, GGF Network
Measurements Working Group draft, work in progress, May 2004
http://www-didc.lbl.gov/NMWG/docs/draft-ggf-nmwg-hierarchy-04.pdf
[8] ATLAS HLT/DAQ/DCS Group, ATLAS High-Level Trigger Data Acquisition and Controls,
Technical Design Report CERN/LHCC/2003-022, Reference ATLAS TDR 016, June 2003.
[9] P. Gray and A.Bets “Performance Evaluation of Copper-based Gigabit Ethernet Interfaces”, in
Proceedings of the 27th Annual IEEE Conference on Local Computer Networks (LCN'02), Tampa,
Florida, November 2002.
[10] R. Hughes-Jones and F. Saka, Investigation of the Performance of 100Mbit and Gigabit Ethernet
Components Using Raw Ethernet Frames, Technical Report ATL-COM-DAQ-2000-014, Mar 2000.
http://www.hep.man.ac.uk/~rich/atlas/atlas_net_note_draft5.pdf
[11] R. Hughes-Jones, UDPmon: a Tool for Investigating Network Performance,
http://www.hep.man.ac.uk/~rich/net
[12 Details of the Gigabit Ethernet Probe Card
http://www.hep.man.ac.uk/~scott/projects/atlas/probe/probehome.html
[13] PCI Special Interest Group, PCI-X Addendum to the PCI Local Bus Specification, Revision 1.0a,
July 2000.
[14] SLAC Network tests on bulk throughput site:
http://www-iepm.slac.stanford.edu/monitoring/bulk/
SC2001 high throughput measurements:
http://www-iepm.slac.stanford.edu/monitoring/bulk/sc2001
[15] R. Hughes-Jones, S. Dallison, G. Fairey, P. Clarke and I. Bridge, “Performance Measurements on
Gigabit Ethernet NICs and Server Quality Motherboards”, In Proceedings of the 1st International
Workshop on Protocols for Fast Long-Distance Networks (PFLDnet 2003), Geneva, Switzerland,
February 2003.
[16] W.C. Feng, J. Hurwitz, H. Newman et al., “Optimizing 10-Gigabit Ethernet for Networks of
Workstations, Clusters, and Grids: a Case Study”, in Proceedings of the 15th International Conference
for High Performance Networking and Communications (SC 2003), Phoenix, AZ, November 2003.
[17] J. Hurwitz, W. Feng , “Initial End-to-End Performance Evaluation of 10-Gigabit Ethernet”, in
Proceedings of the 11th Symposium on High Performance Interconnects (HOT-I 2003), Stanford,
California, August 2003.
[18] J. C. Mogul and K.K. Ramakrishnan, “Eliminating receive livelock in an interrupt-driven kernel”.
In Proceedings of the 1996 Usenix Technical Conference, pages 99–111, 1996.
[19] Intel, IA-32 Intel Architecture Software Developer’s Manual Vol 1. Basic Architecture,
http://www.intel.com/design/pentium4/manuals/245471.htm
[20] F. Saka, “Ethernet for the ATLAS Second Level Trigger”, Chapter 6 Thesis, Physics Department,
Royal Holloway College, University of London, 2001, http://www.hep.ucl.ac.uk/~fs/thesis.pdf
[21] R. McCool, “Fibre Developments at Jodrell Bank Observatory”, in Proceedings of the 2nd Annual
e-VLBI (Very Large Base Interferometry) Workshop, Dwingeloo, The Netherlands, May 2003
http://www.jive.nl/evlbi_ws/presentations/mccool.pdf
[22] Variations in 10 Gigabit Ethernet Laser Transmitter Testing Using Reference Receivers,
Tektronix Application Note, 2003,
http://www.tek.com/Measurement/App_Notes/2G_16691/eng/2GW_16691_0.pdf
[23] C. Meirosu, “The International Grid Testbed: a 10 Gigabit Ethernet success story”, Talk given at
the 1st International Grid Networking Workshop (GNEW 2004), Geneva, Switzerland, March 2004,
http://datatag.web.cern.ch/datatag/gnew2004/slides/meirosu.ppt
[24] P. Shivam, P. Wyckoff and D. Panda, “EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet
Message Passing”, in Proceedings of the 13th International Conference for High Performance
Networking and Communications (SC2001), Denver, CO, November 2001.
[25] M.L. Loeb, A.J. Rindos, W.G. Holland and S.P. Woolet, “Gigabit Ethernet PCI adapter
performance”, IEEE Network, Vol. 15, No. 2, pp.42-47, March-April 2001.
[26] R. Kovac, "Getting ’Giggy’ with it", Network World Fusion, June 2001
http://www.nwfusion.com/reviews/2001/0618rev.html
[27] T. Shanley, PCI-X System Architecture, Addison-Wesley, 2001.
[28] altAIMD kernel
http://www.hep.ucl.ac.uk/~ytl/tcpip/linux/altaimd/
[29] Web100 Project home page, http://www.web100.org/
Biographies:
Richard Hughes-Jones leads the e-science and Trigger and Data Acquisition development in the
Particle Physics group at Manchester University. He has a PhD in Particle Physics and has worked on
Data Acquisition and Network projects. He is a member of the Trigger/DAQ group of the ATLAS
experiment in the LHC programme, focusing on Gigabit Ethernet and protocol performance. He is also
responsible for the High performance, High Throughput network investigations in the European Union
DataGrid and DataTAG projects, the UK e-Science MB-NG project, and the UK GridPP project. He is
secretary of the Particle Physics Network Coordinating Group which supports networking for UK
PPARC researchers. He is a co-chair of the Network Measurements Working Group of the GGF, a cochair of PFLDnet 2005, and a member of the UKLight Technical Committee.
Peter Clarke is deputy director of the National e-Science Centre at Edinburgh. He gained his PhD in
Particle Physics from Oxford University. His experience encompasses both Experimental Particle
Physics and High Performance Global Computing and Networks for Science. He worked in the
ATLAS experiment at the CERN Large Hadron Collider, and recently the Computing sector of the
LHC project. In the field of networking he has been a principle member of the DataGRID, DataTAG
and MB-NG projects, which address Grid and the Network issues (including QoS, high performance
data transport and optical networking). He is a member of the GGF Steering Committee and coDirector of the Data Area within the GGF.
Stephen Dallison has a PhD in Particle Physics and took up a postdoctoral post in the Network Team of
the HEP group at the University of Manchester in May 2002. He principally works on the MB-NG
project but also contributes to the EU DataTAG project. As part of the High Throughput working group
Stephen has worked on network performance tests between sites in the UK, Europe and the United
States, involving advanced networking hardware and novel transport layer alternatives to TCP/IP.