Download Efficient Hardware/Software Architectures for Highly Concurrent

Document related concepts

Distributed operating system wikipedia , lookup

Spring (operating system) wikipedia , lookup

Security-focused operating system wikipedia , lookup

Unix security wikipedia , lookup

Transcript
RICE UNIVERSITY
Efficient Hardware/Software Architectures for
Highly Concurrent Network Servers
by
Paul Willmann
A THESIS SUBMITTED
IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE
Doctor of Philosophy
Approved, Thesis Committee:
Behnaam Aazhang, Chair
J. S. Abercrombie Professor of
Electrical and Computer Engineering
Alan L. Cox
Associate Professor of Computer Science
and of Electrical and Computer Engineering
David B. Johnson
Associate Professor of Computer Science
and of Electrical and Computer Engineering
Scott Rixner
Associate Professor of Computer Science
and of Electrical and Computer Engineering
Houston, Texas
November 2007
Abstract
Internet services continue to incorporate increasingly bandwidth-intensive applications, including audio and high-quality, feature-length video. As the pace of uniprocessor performance improvements slows, however, network servers can no longer rely
on uniprocessor technology to fuel the overall performance improvements necessary
for next-generation, high-bandwidth applications. Furthermore, rising per-machine
power costs in the datacenter are driving demand for solutions that enable consolidation of multiple servers onto one machine, thus improving overall efficiency. This
dissertation presents strategies that improve the efficiency and performance of server
I/O using both virtual-machine concurrency and thread concurrency. Contemporary
virtual machine monitors (VMMs) aim to improve server efficiency by enabling consolidation of separate isolated servers onto one physical machine. However, modern
VMMs incur heavy device virtualization penalties, ultimately reducing application
performance by up to a factor of 3. Contemporary parallelized operating systems
aim to improve server performance by exploiting thread parallelism using multiple
processors. However, the concurrency and communication models used to implement that parallelism impose significant performance penalties, severely damaging
the server’s ability to leverage more processors to attain higher performance. This
dissertation examines the architectural sources of these inefficiencies and introduces
new OS- and VMM-level architectures that greatly reduce them.
Acknowledgments
I would like to thank Dr. Scott Rixner and Dr. Alan Cox for their steady technical guidance throughout the course of this research. Also, I would like to thank
Dr. Behnaam Aazhang and Dr. David Johnson for their perspectives regarding and
support of this work. Thanks also to Dr. Vijay Pai, who helped and encouraged
me from the beginning of my graduate school career through its conclusion, even
after he moved on to a new opportunity at Purdue University. Additionally, I want
to acknowledge Jeff Shafer’s significant contributions regarding development and debugging of the CDNA prototype hardware. David Carr helped tremendously with
bringup of and scripting support for the Xen VMM environment. Marcos Huerta
provided the formatting template for this document and graciously helped me modify it to suit my needs. I also want to thank my family and friends, whose constant
support and encouragement made this work possible. Finally, I want to particularly
thank my wife Leighann. Her perspectives on the scientific process and academic
research made this work better, and her enduring love and patience made my life
better. Thank you.
Contents
1 Introduction
1.1 Server Concurrency Trends . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . .
1
1
5
9
2 Background
2.1 Contemporary Server and CPU Technology . . . . . . . . . . . . . . .
2.2 Existing OS Support for Concurrent Network Servers . . . . . . . . .
2.3 Existing VMM Support for Concurrent Network Servers . . . . . . .
2.3.1 Private I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Software-Shared I/O . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 Hardware-shared I/O . . . . . . . . . . . . . . . . . . . . . . .
2.3.4 Protection Strategies for Direct-Access Private and Shared Virtualized I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Hardware Support for Concurrent Server I/O . . . . . . . . . . . . .
2.4.1 Hardware Support for Parallel Receive-side OS Processing . .
2.4.2 User-level Network Interfaces . . . . . . . . . . . . . . . . . .
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
10
14
20
21
22
25
3 Parallelization Strategies for OS Network Stacks
3.1 Background . . . . . . . . . . . . . . . . . . . . .
3.2 Parallel Network Stack Architectures . . . . . . .
3.2.1 Message-based Parallelism (MsgP) . . . .
3.2.2 Connection-based Parallelism (ConnP) . .
3.3 Methodology . . . . . . . . . . . . . . . . . . . .
3.3.1 Evaluation Hardware . . . . . . . . . . . .
3.3.2 Parallel TCP Benchmark . . . . . . . . . .
3.4 Evaluation using One 10 Gigabit NIC . . . . . . .
3.5 Evaluation using Multiple Gigabit NICs . . . . .
3.6 Discussion and Analysis . . . . . . . . . . . . . .
3.6.1 Locking Overhead . . . . . . . . . . . . . .
3.6.2 Scheduler Overhead . . . . . . . . . . . . .
3.6.3 Cache Behavior . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
35
38
40
41
43
46
46
47
48
49
52
52
56
59
4 Concurrent Direct Network Access
4.1 Networking in Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Hypervisor and Driver Domain Operation . . . . . . . . . . .
63
66
66
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
30
30
32
34
.
.
.
.
.
.
.
.
.
.
.
.
.
67
69
70
71
72
74
77
79
81
81
83
85
86
Monitors
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
89
91
93
93
95
96
97
99
99
100
101
103
104
106
106
109
6 Conclusion
6.1 Orchestrating OS parallelization to characterize and improve I/O processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Reducing virtualization overhead using a hybrid hardware/software approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Improving performance and efficiency of protection strategies for directaccess I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
112
4.2
4.3
4.4
4.1.2 Device Driver Operation . . .
4.1.3 Performance . . . . . . . . . .
CDNA Architecture . . . . . . . . . .
4.2.1 Multiplexing Network Traffic .
4.2.2 Interrupt Delivery . . . . . . .
4.2.3 DMA Memory Protection . .
4.2.4 Discussion . . . . . . . . . . .
CDNA NIC Implementation . . . . .
Evaluation . . . . . . . . . . . . . . .
4.4.1 Experimental Setup . . . . . .
4.4.2 Single Guest Performance . .
4.4.3 Memory Protection . . . . . .
4.4.4 Scalability . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Protection Strategies for Direct I/O in
5.1 Background . . . . . . . . . . . . . . .
5.2 IOMMU-based Protection . . . . . . .
5.2.1 Single-use Mappings . . . . . .
5.2.2 Shared Mappings . . . . . . . .
5.2.3 Persistent Mappings . . . . . .
5.3 Software-based Protection . . . . . . .
5.4 Protection Properties . . . . . . . . . .
5.4.1 Inter-Guest Protection . . . . .
5.4.2 Intra-Guest Protection . . . . .
5.5 Experimental Setup . . . . . . . . . . .
5.6 Evaluation . . . . . . . . . . . . . . . .
5.6.1 TCP Stream . . . . . . . . . . .
5.6.2 VoIP Server . . . . . . . . . . .
5.6.3 Web Server . . . . . . . . . . .
5.6.4 Discussion . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Virtual
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Machine
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
113
115
117
120
List of Figures
1.1
1.2
1.3
2.1
2.2
2.3
3.1
Uniprocessor frequency and network bandwidth history. . . . . . . . .
Network I/O throughput disparity between the modern FreeBSD operating system and link capacity, using either six 1-Gigabit interfaces
or one 10-Gigabit interface. . . . . . . . . . . . . . . . . . . . . . . .
Network I/O throughput disparity between native Linux and virtualized Linux, using six 1-Gigabit Ethernet interfaces. . . . . . . . . . .
2
3
4
Uniprocessor performance history (data source: Standard Performance
Evaluation Corporation). . . . . . . . . . . . . . . . . . . . . . . . . . 12
The efficiency/parallelism continuum of OS network-stack parallelization strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
A contemporary software-based, shared-I/O virtualization architecture. 24
3.5
3.6
Aggregate transmit throughput for uniprocessor, message-parallel and
connection-parallel network stacks using 6 NICs. . . . . . . . . . . . .
Aggregate receive throughput for uniprocessor, message-parallel and
connection-parallel network stacks using 6 NICs. . . . . . . . . . . . .
The outbound control path in the application thread context. . . . .
Aggregate transmit throughput for the ConnP-L network stack as the
number of locks is varied. . . . . . . . . . . . . . . . . . . . . . . . . .
Profile of L2 cache misses per 1 Kilobyte of payload data (transmit test).
Profile of L2 cache misses per 1 Kilobyte of payload data (receive test).
56
57
58
4.1
4.2
4.3
4.4
Shared networking in the Xen virtual machine environment. . . .
The CDNA shared networking architecture in Xen. . . . . . . . .
Transmit throughput for Xen and CDNA (with CDNA idle time).
Receive throughput for Xen and CDNA (with CDNA idle time). .
65
71
87
87
3.2
3.3
3.4
i
.
.
.
.
.
.
.
.
50
51
53
List of Tables
2.1
I/O virtualization methods. . . . . . . . . . . . . . . . . . . . . . . .
3.1
FreeBSD network bandwidth (Mbps) using a single processor and a
10 Gbps network interface. . . . . . . . . . . . . . . . . . . . . . . . .
Aggregate throughput for uniprocessor, message-parallel and connectionparallel network stacks. . . . . . . . . . . . . . . . . . . . . . . . . . .
Percentage of lock acquisitions for global TCP/IP locks that do not
succeed immediately when transmitting data. . . . . . . . . . . . . .
Cycles spent managing the scheduler and scheduler synchronization
per Kilobyte of payload. . . . . . . . . . . . . . . . . . . . . . . . . .
Percentage of L2 cache misses within the network stack to global data
structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
3.3
3.4
3.5
4.1
4.2
4.3
4.4
4.5
5.1
5.2
5.3
5.4
Transmit and receive performance for native Linux 2.6.16.29 and paravirtualized Linux 2.6.16.29 as a guest OS within Xen 3. . . . . . . .
Transmit performance for a single guest with 2 NICs using Xen and
CDNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Receive performance for a single guest with 2 NICs using Xen and
CDNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CDNA 2-NIC transmit performance with and without DMA memory
protection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CDNA 2-NIC receive performance with and without DMA memory
protection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TCP Stream profile. . . . . . . . . . . . . . . .
OpenSER profile. . . . . . . . . . . . . . . . . .
Web Server profile using write(). . . . . . . . .
Web Server profile using zero-copy sendfile().
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
39
48
54
56
61
69
81
81
84
84
104
106
107
108
Chapter 1
Introduction
Internet services continue to incorporate ever more bandwidth-intensive, highperformance applications, including audio and high-quality, feature-length video. Furthermore, network services are proliferating into every aspect of businesses, with even
small-scale organizations leveraging robust storage, database, and voice-over-IP technology to manage resources, facilitate communications, and reduce costs. However,
ongoing processor trends toward chip multiprocessing present new challenges and opportunities for server architectures as those architectures strive to keep pace with
performance and efficiency demands. This dissertation addresses these challenges
with new operating system and virtual machine monitor architectures designed to
provide efficient, high-performance network input/output (I/O) support for coming
generations of servers.
1.1
Server Concurrency Trends
Throughout the vast expansion of Internet technology in the 1990s, processor performance and network server bandwidth both grew at exponential rates. Figure 1.1
1
100000
Ethernet Bandwidth (Mbps)
Processor Frequency (MHz)
Ethernet Bandwidth
Processor Frequency
10000
1000
100
10
1
1980
1985
1990
1995
Year
2000
2005
Figure 1.1: Uniprocessor frequency and network bandwidth history.
shows the progression of uniprocessor frequency (for the Intel IA32 family of processors) and network interface bandwidth since 1982. Though frequency alone is not a
comprehensive measure of performance, the figure does show a qualitative comparison of Ethernet and commodity uniprocessor trends over the past twenty five years.
The figure shows the exponential growth of both uniprocessor frequency and Ethernet
bandwidth throughout the 1990s and early 2000s. However, the rate of uniprocessor frequency increases shows a marked decline in 2003. In 2003, physical circuit
limitations, such as increasing per-cycle interconnect delays, started to overwhelm
contemporary CPU architectures. Instead of relying on CPUs that rely on larger and
larger global interconnects, CPU architects turned to multicore designs.
The move to multicore designs has both implications for server performance and
efficiency. In terms of performance, it is important that the server be able to leverage
its multiple processors to keep pace with network bandwidth improvements, just as
past servers have leveraged faster uniprocessors to deliver more server performance.
In terms of efficiency, multicore architectures provide new opportunities to consol-
2
10000
10000
Link Capacity
Link Capacity
8000
Uniprocessor OS
Multiprocessor OS (p=4)
TCP Throughput (Mb/s)
TCP Throughput (Mb/s)
Uniprocessor OS
6000
4000
2000
0
8000
Multiprocessor OS (p=4)
6000
4000
2000
0
Six 1-Gigabit NICs
One 10-Gigabit NIC
Six 1-Gigabit NICs
(a) Transmit
One 10-Gigabit NIC
(b) Receive
Figure 1.2: Network I/O throughput disparity between the modern FreeBSD operating
system and link capacity, using either six 1-Gigabit interfaces or one 10-Gigabit interface.
idate isolated servers from several different machines onto just one machine. Such
consolidation maximizes utilization of the server’s physical CPU and I/O resources
and can substantially reduce power and cooling costs, according to a study by Intel [22]. Virtual machine monitor (VMM) software provides multiplexing facilities
that enables this kind of consolidation and sharing, but it is important that virtualization overheads are kept low so as to maximize the capacity of a consolidated server
and thus maximize the associated power and cooling savings.
However, modern server software architectures fall well short of meeting both
performance and efficiency demands given modern network link speeds. Figure 1.2
illustrates the gap between the theoretical peak I/O throughput of a modern server
and its achieved throughput. The figure shows TCP network throughput, using either six 1 Gigabit Ethernet network interface cards (NICs) or a single 10 Gigabit
Ethernet NIC. The throughput achieved by uniprocessor and multiprocessor-capable
3
Native
5000
TCP Throughput (Mb/s)
Virtualized
4000
3000
2000
1000
0
Transmit
Receive
Figure 1.3: Network I/O throughput disparity between native Linux and virtualized Linux,
using six 1-Gigabit Ethernet interfaces.
OS configurations are compared to the theoretical aggregate TCP throughput offered
by the physical links. The operating system used is FreeBSD, which uses a similar parallelization strategy to that of Linux and achieves similar performance (not
shown). The uniprocessor configurations use just one 2.2 GHz Opteron processor
core, whereas the multiprocessor configurations use all four cores of a chip multiprocessor that features two chips with two cores each. In all cases, the application’s
thread count is matched to the number of processors. The application is a lightweight
microbenchmark that simply sends or receives data, thus isolating and stressing the
operating system’s network stack. As the figure shows, existing approaches for network OS multiprocessing can improve network performance in some cases. However,
the performance improvement can be meager (or nonexistent) and falls well short of
4
being able to saturate link resources. Furthermore, current multiprocessor OS organizations are poorly suited for managing a single, high-bandwidth NIC, sustaining
less than half of the available link bandwidth in the best case.
Whereas native OS performance is significantly less than ideal, Figure 1.3 shows
that virtualizing an operating system reduces its I/O performance even more. Figure 1.3 compares the TCP network throughput of native, unvirtualized Linux using
six 1 Gigabit NICs to that achieved by a virtualized Linux server using the Xen virtual machine monitor. Linux is used as the benchmark native system here because
the Xen open-source VMM features mature support for Linux but only preliminary
support for FreeBSD; regardless, the performance limitations illustrated by the figure
are inherent to the VMM architecture, not the OS. The system under test uses a
single 2.4 GHz Opteron processor core. For both transmit and receive workloads, virtualization imposes an I/O slowdown of more than 300% versus native-OS execution.
1.2
Contributions
This dissertation contributes to the field at the hardware, virtual machine monitor,
and operating system levels. This work tackles the efficiency and performance issues
within and across each of these levels that cause the performance disparities illustrated
in Figures 1.2 and 1.3. The fundamental approach of this research is to use a combination of hardware and software to architect strategies that minimize inefficiencies
in concurrent network servers. Combined, these strategies form a software/hardware
5
architecture that will efficiently leverage coming generations of multicore network
servers. This architecture is comprised of three parts, each which has its own set of
contributions: strategies for efficient network-stack processing by a parallelized operating system, strategies for efficient network I/O sharing in VMM environments, and
strategies for efficient isolation of direct-access I/O devices in VMM environments.
Defining and exploring the continuum of network stack parallelization.
At the operating system level, an efficient parallelization of the network protocol
processing stack is required so as to maximize system performance and efficiency
on coming generations of chip multiprocessor hardware and to relieve the protocol
processing bottleneck demonstrated in Figure 1.2. Past research efforts have studied this problem in the context of improving protocol-processing performance for
100 Mb/s Ethernet using the SGI Challenge shared-memory multiprocessor. Though
these studies examined some of the tradeoffs for two different strategies for networkstack parallelization, ten years later there exists no consensus among OS architects
regarding the unit of concurrency for network stack processor or the method of synchronization. This dissertation defines and explores a continuum of practical parallelization strategies for modern operating systems. The strategies on this continuum
vary according to their unit of concurrency and their method of synchronization and
have different overhead characteristics, and thus different performance and scalability.
Whereas past studies have used emulated network devices and idealized, experimental
research operating systems, this dissertation examines network-stack parallelization
6
on a real hardware platform using a modern network operating system, including
all of the device and software overhead. Through understanding this continuum of
parallelization and efficiency, this work finds that designing the operating system to
maximize parallel execution can actually decrease performance on parallel hardware
by ignoring sources of overhead. Further, this study identifies the hardware/software
interface between high-bandwidth devices and operating systems as a performance
bottleneck, but when overcome, that efficient network stack parallelization strategies
significantly improve performance versus contemporary inefficient strategies.
Designing a hardware/software architecture for shared I/O virtualization. Contemporary architectures for I/O virtualization enable economical sharing
of physical devices among virtual machines. These architectures multiplex I/O traffic, manage device notification messages, and perform direct memory access (DMA)
memory protection entirely in software. However, this software-based design incurs
heavy penalties versus native OS execution, as depicted in Figure 1.3. In contrast,
contemporary research by others has examined purely hardware-based architectures
that aim to reduce software overhead. This dissertation contributes a new architecture for shared I/O virtualization that permits concurrent, direct access by untrusted
virtual machines using a combination of both hardware and software. This research
includes an examination of a prototype network interface developed using this hybrid hardware/software approach, which proves effective at eliminating most of the
overhead associated with traditional, software-based shared I/O virtualization. Be-
7
yond the prototype itself, the primarily software-based mechanism for enforcing DMA
memory protection is an entirely new contribution, differing greatly from contemporary hardware-based mechanisms. The prototype device is a standard expansion card
that requires modest additional hardware beyond that found in a commodity network
interface. This low cost and the architecture’s suitability with existing, unmodified
commodity hardware makes it ideal for commodity network servers.
Developing and exploring alternative strategies for virtualized direct
I/O access. Unlike the shared, direct-access architecture explored in this dissertation, the prevailing industrial solution for high-performance virtualized I/O is to
provide private, direct access to a single device by a single virtual machine. This
obviates the device’s need for multiplexing of traffic among multiple operating systems, but such systems still need reliable DMA memory protection mechanisms to
prevent an untrusted virtual machine from potentially misusing an I/O device to access another virtual machine’s memory. This dissertation contributes to the field by
developing and examining new hardware- and software-based strategies for managing
DMA memory protection and compares them to the state-of-the-art strategy. Contemporary high-availability server architectures use a hardware I/O memory management unit (IOMMU) to enforce the memory access rules established by the memory
management unit, and commodity CPU manufacturers are aggressively pursuing inclusion of IOMMU hardware in next-generation processors. Though the aim of these
architectures is to provide near-native virtualized I/O performance, the strategy for
8
managing the system’s IOMMU hardware can greatly impact performance and efficiency. This research contributes two novel strategies for achieving direct I/O access
using an IOMMU that, unlike the state-of-the-art strategy, reuse IOMMU entries to
reduce the total overhead of ensuring I/O safety. Further, this research finds that
the software-based DMA memory protection strategy introduced in this dissertation
performs comparably to the most aggressive hardware-based strategy. Contrary to
much of the industrial enthusiasm for IOMMUs in coming commodity servers, this
dissertation concludes that an IOMMU is not necessarily required to achieve safe,
high-performance virtualized I/O.
1.3
Dissertation Organization
This dissertation presents these contributions in three studies and is organized as
follows. Chapter 2 first provides some background regarding the motivation for this
work and the state-of-the-art hardware and software architectures of contemporary
network servers. Chapter 3 then presents a comparison and analysis of parallelization
strategies for modern thread-parallel operating systems. Chapter 4 introduces the
concurrent direct network access (CDNA) architecture for delivering efficient shared
access to virtualized operating systems in VMM environments. Chapter 5 follows
up with a comparison and analysis of hardware- and software-based strategies for
providing isolation among untrusted virtualized operating systems that have direct
access to I/O hardware, and Chapter 6 concludes.
9
Chapter 2
Background
Over the past forty years, there have been extensive industrial and academic
efforts toward improving the performance and efficiency of servers. These efforts
have touched on both multiprocessing concurrency and virtual machine concurrency.
Furthermore, there have been many efforts to coordinate the architecture of I/O
hardware (such as network interfaces) with software (both applications and operating
systems) to improve the performance and efficiency of the overall server. This chapter
discusses the background of contemporary server technology and its limitations and
then explores the prior research efforts that are related to the themes and strategies
of this dissertation.
2.1
Contemporary Server and CPU Technology
The Internet expansion of the 1990s was sustained with exponential growth in
processor performance and network server bandwidth. Figure 1.1 shows this exponential progression of uniprocessor frequency (for the Intel IA32 family of processors)
and network interface bandwidth since 1982. These steady, exponential performance
10
improvements came from technology improvements, such as feature-size reduction
that enabled higher-frequency throughput and from architectural innovations, such
as superscalar instruction execution in CPUs and packet checksum offloading for network interfaces. However, the dominant server architecture relied on a fairly constant
architecture: a single processor using a single network interface, both which were
exponentially improving in performance between generations. The commoditization
and rapid improvement of this architecture has yielded superior cost efficiencies compared to more specialized designs, ultimately motivating companies such as Google
to standardize their server platforms on this commodity architecture [8].
The consistency of the commodity hardware architecture ensured that legacy software architectures could readily leverage successive generations of higher-performance
hardware, ultimately producing higher-performance, higher-capacity servers. Efficiency improvements in system software, such as zero-copy I/O and efficient eventdriven execution models, provided additional server performance on the same architectures. However, these software innovations did not rely on architectural changes
and instead improved performance using the existing contemporary architecture.
Figure 2.1 confirms this processor trend in terms of performance and shows that
it is not specific to the Intel IA32 family of processors. This figure plots the highest
reported SPEC CPU integer benchmark scores, selected across all families of processors, for each year since 1995. The left-hand portion of the figure shows SPEC95
scores, and the right-hand figure shows SPEC2000 scores. In 2000, the same com-
11
)*+,-./0!"""
d
)*+,-./0),12+
)*+,-./0"#
term
#"""
gLon
!(
!"""
!""#
!""$
!""%
!""&
!"""
!""#
!""!
!""$
!""%
!""&
!""'
)*+,-./0),12+
n
ma
for
Per
ren
ce T
#""
!""(
'(((
Figure 2.1: Uniprocessor performance history (data source: Standard Performance Evaluation Corporation).
puter system was evaluated using both SPEC95 and SPEC2000, so the “2000” point
in both graphs corresponds to the same computer. The y axes of each graph are
scaled to each other such that a line with a certain slope in the SPEC95 graph will
have the same slope in the SPEC2000 graph. Though the two benchmark suites are
not identical, they are designed to measure performance on contemporary hardware.
From 1995 to 2003, the average rate of benchmark improvement was 43% per year
(shown as the “Long-term Performance Trend” line in the figure). Though processor
frequencies remained mostly unchanged in the period of 2003-2007 (as shown for the
Intel family of processors in Figure 1.1), processor architects were still able to produce performance improvements for that time period. However, the rate of SPEC
benchmark improvement dropped significantly, to 19% per year.
12
The decline in frequency and performance increases stem primarily from transistorand circuit-level physical limitations. One of the most significant of these is parasitic
wire capacitance among transistors inside CPUs. The progression toward smaller
transistors has enabled larger-scale integration, but it has also led to increasing relative delay (in cycles) for global interconnect from generation to generation [15]. The
poor scalability of global interconnection networks (and thus the control circuitry of
modern superscalar processors) is contributing to the slowdown in uniprocessor performance improvements from year to year. This scalability problem is also driving
CPU manufacturers toward multicore CPU designs. It is this migration and the subsequent poor performance of contemporary server operating systems and VMMs that
serves in part as motivation for this work.
Though commodity chip multiprocessor technology is new, software and hardware architects have been developing OS and VMM support for larger-scale, specialpurpose concurrent servers over the past four decades. Contemporary OS and VMM
solutions are derived from these prior endeavors, and current I/O architectures bear
many similarities to their ancestors. However, recent advances in server I/O (such
as the adoption of 10 Gigabit Ethernet) have placed new stresses on these architectures, exposing inefficiencies and bottlenecks that prevent modern systems from fully
utilizing their I/O capabilities, as depicted in Figures 1.2 and 1.3 in the Introduction.
The poor I/O scalability and performance of modern concurrent servers are attributable to both software and hardware inefficiencies. Many existing operating
13
systems are designed to maximize opportunities for parallelism. However, designs
that maximize parallelism incur higher synchronization and thread-scheduling overhead, ultimately reducing performance. Furthermore, both operating systems and
VMMs are designed to interact with the traditional serialized hardware interface
exported by I/O devices. Operating system performance is bottlenecked by this interface when the multithreaded higher levels of the OS must wait for single-threaded
device-management operations to complete. This serialized interface has more farreaching effects on VMM design, and consequently VMMs experience even heavier
efficiency penalties relative to native-OS performance. These penalties stem primarily from the separation of device management (in one privileged OS instance) from
server computation (in traditional, untrusted OS instances) and the software virtualization layers needed between them. Combined, all of these inefficiencies significantly
degrade OS and VMM performance and will prevent future servers from scaling with
contemporary I/O capabilities.
2.2
Existing OS Support for Concurrent Network Servers
Given the continuing trend toward commodity chip multiprocessor hardware, the
trend away from vast improvements in uniprocessor performance, and the ongoing
improvements in Ethernet link throughput, operating system architects must consider efficient methods to close the I/O performance gap. The organization of the
operating system’s network stack is particularly important. An operating system’s
14
network stack implements protocol processing (typically TCP/IP or UDP). TCP processing is the only operation being performed in the microbenchmark examined in
Figure 1.2, but there remains a clear performance gap imposed by the overhead of protocol processing. To close that gap, multiprocessor operating systems must efficiently
orchestrate concurrent protocol processing.
There exist two principal strategies for parallelizing the operating system’s network stack, both of which derive from research in the mid-1990s that was conducted
using large-scale SGI Challenge shared-memory multiprocessors. These strategies
differ according to their unit of concurrency. Though current OS implementations
are derived from one or the other of these two strategies, no consensus exists among
developers regarding the most appropriate organization for emerging processing and
I/O hardware.
Nahum et al. first examined a parallelization scheme that attempted to treat
messages (usually packets) as the fundamental unit of concurrency [41]. In its most
extreme implementation, this message-parallel (or MsgP) strategy attempts to process each packet in the system in parallel using separate processors. Because a server
has a constant stream of packets, the message-parallel approach maximizes the theoretically achievable concurrency. Though this message-oriented organization ideally
scales limitlessly given the abundance of in-flight packets available in a typical server,
Nahum et al. found that repeated synchronization for connection state shared among
packets belonging to the same connection (such as reorder queues) severely limited
15
scalability [41]. However, that study found that the scalability and performance of
the message-parallel organization was highly dependent on the synchronization characteristics of the system.
The connection-parallel (or ConnP) strategy treats connections as the fundamental unit of concurrency. When the operating system receives a message for transmission or has received a message from the network, a ConnP organization first associates
the message with its connection. The OS then uses a synchronization mechanism to
move the packet to a thread responsible for processing its connection. Hence, ConnP
organizations avoid the MsgP inefficiencies associated with repeated synchronization
to shared connection state. However, ConnP organizations also limit concurrency
by enlarging the granularity of parallelism from packets to connections. In its most
extreme form, a ConnP organization has as many threads as there are connections.
After Nahum et al. examined the MsgP organization, Yates et al. conducted a
similar study using the SGI Challenge architecture. In this study, Yates et al. examined connection-oriented parallelization strategies that treat connections as the
fundamental unit of concurrency [62]. This study persistently mapped network stack
operations for a specific connection to a specific worker thread in the OS. This strategy eliminates any shared connection state among threads and thus eliminates the
synchronization overhead of message-oriented parallelizations. Consequently, this organization yielded excellent scalability as threads were added.
16
sg
P
M
-O
M rde
sg r
P
In
nP
C
on
G
ro
C up
on e
nP d
U
ni
pr
oc
Efficiency
Concurrency
Figure 2.2: The efficiency/parallelism continuum of OS network-stack parallelization
strategies.
Both of these prior works ran on the mid-90s era SGI challenge multiprocessor,
utilized the user-space x -kernel, and used a simulated network device. However,
modern servers feature processors with very different synchronization costs relative
to processing, utilize operating systems that bear little resemblance to the x -kernel,
and incur the real-world overhead associated with interrupts and device management.
These works ultimately concluded that the synchronization and packet-ordering
overhead associated with fine-grained packet-level processing could severely damage
performance, and that a connection-oriented network stack yielded better efficiency
and performance. However, ten years later there is little if any consensus regarding
the “correct” organization for modern network servers. FreeBSD and Linux both
utilize a variant of a message-parallel network stack organization, whereas Solaris 10
and DragonflyBSD both feature connection-parallel organizations.
Hence, though prior research suggested three points of consideration for network stack parallelization architectures (serial, ConnP, and MsgP), modern practical
variants represent several additional points of interest that make different efficiency
and performance tradeoffs. Consequently, there exist several additional points along
17
a continuum of both concurrency and efficiency. Figure 2.2 depicts this concurrency/efficiency continuum. Whereas uniprocessor organizations incur no synchronization or thread-scheduling overhead and hence are the most efficient, they are also
the least concurrent. Conversely, purely MsgP organizations exploit the highest level
of concurrency but experience overhead that reduces their efficiency. ConnP organizations attempt to trade some of the MsgP concurrency for increased efficiency, and
better overall performance.
Real-world MsgP and ConnP implementations also compromise concurrency for
efficiency, though some of these compromises are motivated by pragmatism. Whereas
theoretical MsgP organizations attempt limitless message-level concurrency, realworld implementations such as Linux and FreeBSD process messages coming into
the network stack from any given source in-order. In effect, packets from a given
hardware interface or from a given application thread will be processed in-order.
However, these in-order MsgP stacks utilize fine-grained synchronization to enable
message parallelism, particularly between received and transmitted packets. Similarly, realizable ConnP organizations do not have limitless connection parallelism;
instead, ConnP organizations such as Solaris 10 and DragonflyBSD associate each
connection with a group, and then process that group in parallel.
As Figure 1.2 shows, a massive performance gap exists between achievable throughput of modern in-order MsgP organizations and the link capacity of modern 10 Gigabit interfaces. Furthermore, the figure shows that current multiprocessor operating
18
systems are not effective at reducing this gap, but that parallel network interfaces
achieve higher throughput than a single interface. This performance gap and the lack
of existing solutions that close it motivate the research in this dissertation.
Thus, prior research has established that there are at least two methods of parallelizing an operating system network stack, each which differ according to their unit
of concurrency. However, prior evaluations have been conducted in the context of an
experimental user-space operating system that uses a simulated network device, and
which was done on hardware that had very different synchronization overhead characteristics from modern hardware. So while previous research established that there
were different approaches to parallelizing an OS network stack, that research did not
fully evaluate those approaches on practical hardware in a practical software environment. In Chapter 3, this dissertation reaches beyond that prior work by examining
the way the network stack’s unit of concurrency affects efficiency and performance
in a real operating system with a real network device, including the associated device and scheduling overhead. This research also breaks ground by examining and
comparing how the means for synchronization (locks versus threads) within a given
organization can affect performance and efficiency, thus rounding out an examination of an entire continuum of network stack parallelization strategies, whereas prior
research examined only some of the points on that continuum.
19
I/O Access
Software
Manager
Private
System/360 and /370
Disks and Terminals [37, 43]
Hypervisor
Operating System
POWER4 LPAR [24]
Shared
System/360 and /370 Networking [19, 36],
Xen 1.0 [7], VMware ESX [56]
POWER5 LPAR [4],
Xen 2.0+ [16], VMware Workstation [53]
Table 2.1: I/O virtualization methods.
2.3
Existing VMM Support for Concurrent Network Servers
Other than OS multiprocessing, virtualization is another method of achieving
server parallelism. Just as parallelized operating systems can exploit connection-level
parallelism, parallel virtual machines exploit a coarser-grained connection parallelism
by managing separate connections inside separate virtual machines. There is a significant amount of existing research in the field with respect to virtualization techniques, much of which predates modern operating system research. With respect
to server I/O, past virtualized architectures have exploited private device architectures (in which a device is assigned to just one virtual machine) and shared device
architectures (in which I/O resources for one device are shared among many virtual
machines). These private and shared I/O architectures have been realized using either hardware-based or software-based techniques. All of these architectures require
that an I/O device cannot be used by a VM to gain access to another VM’s resources,
and prior research has explored these isolation issues as well.
The first widely available virtualized system was IBM’s System 370, which was
first deployed almost 40 years ago [43]. Though demand for server consolidation has
inspired new research in virtualization technology for commodity systems, contemporary I/O virtualization architectures bear a strong resemblance to IBM’s original
concepts, particularly with respect to network I/O virtualization. Over time, this
I/O virtualization architecture has led to significant software inefficiencies, ultimately
manifesting themselves in large performance degradations such as those depicted in
Figure 1.3.
20
There are two approaches to I/O virtualization. Private I/O virtualization architectures statically partition a machine’s physical devices, such as disk and network
controllers, among the system’s virtual machines. In a Private I/O environment, only
one virtual machine has access to a particular device. Shared I/O virtualization architectures enable multiple virtualized operating systems to access a particular device.
Existing Shared I/O systems use software interfaces to multiplex I/O requests from
different virtual machines onto a single device. Private I/O architectures have the
benefit of near-native performance, but they require each virtual machine in a system
to have its own private set of network, disk, and terminal devices. Because this costly
requirement is impractical on both commodity servers and large-scale servers capable
of running hundreds of virtual machines, current-generation VMMs employ Shared
I/O architectures.
2.3.1
Private I/O
IBM’s System/360 was the first widely available virtualization solution [43]. The
System/370 was an extended version of this same architecture and featured hardwareassisted enhancements for processor and memory virtualization, but which supported
I/O using the same mechanisms [17, 37, 49]. The first I/O architectures developed for
System/360 and System/370 did not permit shared access to physical I/O resources.
Instead, a particular VM instance had private access to a specific resource, such as a
terminal. To permit many users to access a single costly disk, the System/360 and
System/370 architecture extended the idea of private device access by sub-dividing
contiguous regions on a disk into logically separate, virtual “mini-disks” [43]. Though
multiple virtual machines could access the same physical disk via the mini-disk abstraction, these VMs did not concurrently share access to the same mini-disk region,
and hence mini-disks still represented logical private I/O access. System/360 and
System/370 required that I/O operations (triggered by the start-io instruction) be
21
trapped and interpreted by the system hypervisor. The hypervisor ensured that a
given virtual machine had permission to access a specific device, and that the given
VM owned the physical memory locations being read from or written to by the pending I/O command. The hypervisor would then actually restart the I/O operation,
returning control to the virtual machine only after the operation completed. Hence,
the System/360 and System/370 hypervisor managed I/O resources.
More recent virtualization systems have also relied on private device access, such as
IBM’s first release of the logical partitioning (LPAR) architecture featuring POWER4
processors [24]. The POWER4 architecture isolated devices at the PCI-slot level and
assigned them to a particular VM instance for management. Each VM required a
physically distinct disk controller for disk access and a physically distinct network
interface for network access. Unlike the System/360 and System/370 architecture,
the POWER4’s I/O devices accessed host memory asynchronously via DMA using
OS-provided DMA descriptors. Since a buggy or malicious guest OS could provide
DMA descriptors pointing to memory locations for which the given VM has no access
permissions, the POWER4 employs an IOMMU [9]. The IOMMU validates all PCI
operations per-slot using a set of hypervisor-maintained permissions. Hence, the
POWER4’s hypervisor can set up the IOMMU at device-initialization time, but I/O
resources can be directly managed at runtime by the guest operating systems.
2.3.2
Software-Shared I/O
Requiring private access to I/O devices imposes significant hardware costs and
scalability limitations, since each VM must have its own private hardware and device
slots. The first shared-I/O virtualization solutions were part of the development of
networking support for System/360 and System/370 between physically separated
virtual machines [19, 36]. This networking architecture supported shared access to
network I/O resources by use of a virtualized spool-file interface that was serviced
22
by a special-purpose virtual machine, or I/O domain, dedicated to networking. The
various general-purpose VMs in a machine could read from or write to virtualized
spool files. The system hypervisor would interpret these reads and writes based on
whether or not the spool locations were on a physically remote machine; if the data
was on a remote machine, the hypervisor would invoke the special-purpose networking
VM. This networking VM would then use its physical network interfaces to connect to
a remote machine. The remote machine used this same virtualized spool architecture
and dedicated networking VM to service requests. The networking I/O domain was
trusted to not violate memory protection rules, so on System/370 architectures that
supported “Preferred-Machine” execution, the I/O domain could be granted direct
access to the network interfaces and would not require the hypervisor to manage
network I/O [17].
This software architecture for sharing devices through virtualized interfaces is
logically identical to most virtualization solutions today. Xen, VMware, and the
POWER5 virtualization architectures all share access to devices through virtualized
software interfaces and rely on a dedicated software entity to actually perform physical
device management [4, 7, 53]. Subsequent releases of Xen and VMware have moved
device management either into the hypervisor and out of an I/O domain (as is the
case with VMware ESX [56]) or into an I/O domain and out of the hypervisor (as is
the case with Xen versions 2.0 and higher [16]). Furthermore, different architectures
use different interfaces for implementing shared I/O access. For example, the Denali
isolation kernel provides a high-level interface that operates on packets [58]. The Xen
VMM provides an interface that mimics that of a real network interface card but
abstracts away many of the register-level management details [16]. VMware can support either an emulated register-level interface that implements the precise semantics
of a hardware NIC, or it can support a higher-level interface similar to Xen’s [53, 56].
23
Guest Domain
Driver Domain
Multiplexing
Backend
Driver
Data/
Control
Frontend
Driver
Virtual Interrupts
Virtual Machine
Monitor
Device Driver
Data/Control
I/O Device(s)
Interrupt
CPU / Memory
Figure 2.3: A contemporary software-based, shared-I/O virtualization architecture.
Regardless of the interface, however, the overall organization is fundamentally quite
similar.
Figure 2.3 depicts the organization of a typical modern Shared I/O architecture,
which is heavily influenced by IBM’s original software-based Shared I/O architecture
for sharing network resources. These modern architectures grant direct hardware
access to only a single virtual machine instance, referred to as the “driver domain” in
the figure. Each consolidated server instance exists in an operating system running
inside an unprivileged “guest domain”. The driver domain is a privileged operating
system instance (for example, running Linux) whose sole responsibility is to manage
physical hardware in the machine and present a virtualized software interface to the
guest domains. This interface is exported by the driver domain’s backend driver,
as depicted in the figure. Guest domains access this virtualized interface and issue
I/O requests using their own frontend drivers. Upon reception of I/O requests in
the backend driver, the driver domain uses a separate multiplexing module inside the
operating system (such as a software Ethernet bridge in the case of network traffic)
to map requests to physical device drivers. The driver domain then uses native device
24
drivers to access hardware, thus carrying out the various I/O operations requested
by guest domains.
This I/O virtualization architecture addresses the many requirements for implementing Shared I/O in a virtualized environment featuring untrusted virtualized
servers. First, the architecture provides a method to multiplex I/O requests from
various guest operating systems onto a single commodity device, which exports just
one management interface to software. This software-only architecture is practical
insofar as it supports a large class of existing commodity devices. Second, the architecture provides a centralized, trusted interface with which to safely translate virtualized
I/O requests originating from an untrusted guest into trusted requests operating on
physical hardware and memory. Third, this architecture provides inter-VM message
notification using a virtualized interrupt system. This messaging system is implemented by the VMM and is used by guest and driver domains to notify each other of
pending requests and event completion.
However, forcing all I/O operations to be forwarded through the driver domain
incurs significant overhead that ultimately reduces performance. The magnitude of
the performance loss for network I/O under the Xen VMM is depicted in Figure 1.3,
and Sugerman et al. has reported similar results using other the VMware VMM [53].
These performance losses are attributable to the inefficiency of moving I/O requests
into and out of the driver domain and multiplexing those requests once inside the
driver domain.
2.3.3
Hardware-shared I/O
Direct I/O access (rather than indirect access through management software)
eliminates the overhead of moving all I/O traffic through a software management
entity for multiplexing and memory protection purposes. In addition to supporting
private management of I/O devices by just one software entity at a time (either the
25
OS or hypervisor), the System/360 and System/370 fully implemented the direct access storage device (DASD) architecture. The DASD architecture enabled concurrent
programs, operating systems, and the hypervisor to access disks directly and simultaneously [12]. DASD-capable devices had several distinct, separately addressable
channels. Software subroutines called channel programs performed programmed I/O
on a channel to carry out a disk request. Disk-access commands in channel programs
executed synchronously. A significant benefit of multi-channel DASD hardware was
that it permitted one channel program to access the disk while another performed data
comparisons (in local memory) to determine if further record accesses were required.
This hardware support for concurrency significantly improved I/O throughput.
On a virtualized system, the hypervisor would trap upon execution of the privileged start-io instruction that was meant to begin a channel program. The hypervisor would then inspect all of the addresses to be used by the device in the pending channel program, possibly substituting machine-physical addresses for machinevirtual addresses. The hypervisor would also verify the ownership of each address on
the disk to be accessed. After ensuring the validity of each address in the programmedI/O channel subroutine, the hypervisor would execute the modified subroutine [17].
This trap-modify-execute interpretive execution model enabled the hypervisor to
check and ensure that no virtual machine could read or write from another VM’s
physical memory and that no virtual machine could access another VM’s disk area.
However, the synchronous nature of the interface afforded the hypervisor simplicity
with respect to memory protection enforcement; it was sufficient to check the current
permission state in the hypervisor and base I/O operations on that state. However,
modern operating systems operate devices asynchronously to achieve concurrency.
Devices use DMA to access host memory at an indeterminate time after software
makes an I/O request. Rather than simply querying memory ownership state at
the instant software issues an I/O request, a modern virtualization solution that sup26
ports concurrent access by separate virtual machines to a single physical device would
require tracking of memory ownership state over time.
2.3.4
Protection Strategies for Direct-Access Private and Shared Virtualized I/O
Providing untrusted virtual machines with direct access to I/O resources (as in
Private I/O architectures or Hardware-Shared I/O architectures) can substantially
improve performance by avoiding software overheads associated with indirect access
(as in Software-Shared I/O architectures). However, VMs with direct I/O access
could maliciously or accidentally use a commodity I/O device to access another VM’s
memory via the device’s direct memory access (DMA) facilities. Furthermore, a fault
by the device could generate an invalid request to an unrequested region of memory,
possibly corrupting memory.
One approach to providing isolation among operating systems that have direct
I/O access is to leverage a hardware I/O memory management unit (IOMMU). The
IOMMU translates all DMA requests by a device according to the IOMMU’s page
table, which is managed by the VMM. Before making a DMA request, an untrusted
VM must first request that the VMM install a valid mapping in the IOMMU, so that
later the device’s transaction will proceed correctly with a current, valid translation.
Hence, the VMM can effectively use the IOMMU to enforce system-wide rules for
controlling what memory an I/O device (under the direction of an untrusted VM) may
access. By requesting immediate destruction of IOMMU translations, an untrusted
VM can furthermore protect itself against later, errant requests by a faulty I/O device.
Contemporary commodity virtualization solutions run on standard x86 hardware,
which typically lacks an IOMMU. Hence, these solutions forbid direct I/O access
and instead use software to implement both protection and sharing of I/O resources
among untrusted guest operating systems. Confining direct I/O accesses only within
27
the trusted VMM ensures that all DMA descriptors used by hardware have been
constructed by trusted software. Though commodity VMMs confine direct I/O within
privileged software, they provide indirect, shared access to their unprivileged VMs
using a variety of different software interfaces.
IBM’s high-availability virtualization platforms feature IOMMUs and can support
private direct I/O by untrusted guest operating systems, but they do not support
shared direct I/O. The POWER4 platform supports logical partitioning of hardware
resources among guest operating systems but does not permit concurrent sharing of
resources [24]. The POWER5 platform adds support for concurrent sharing using
software, effectively sacrificing direct I/O access to gain sharing [4]. This sharing
mechanism works similarly to commodity solutions, effectively confining direct I/O
access within what IBM refers to as a “Virtual I/O Server”. Unlike commodity
VMMs, however, this software-based interface is used solely to gain flexibility, not
safety. When a device is privately assigned to a single untrusted guest OS, the
POWER5 platforms can still use its IOMMU to support safe, direct I/O access.
The high overhead of software-based shared I/O virtualization motivated recent
research toward hardware-based techniques that support simultaneous, direct-access
network I/O by untrusted guest operating systems. These each have different approaches to implementing isolation and protection. Liu et al. developed an Infinibandbased prototype that supports direct access by applications running within untrusted
virtualized guest operating systems [35]. This work adopted the Infiniband model
of registration-based direct I/O memory protection, in which trusted software (the
VMM) must validate and register the application’s memory buffers before those
buffers can be used for network I/O. Registration is similar to programming an
IOMMU but has different overhead characteristics, because registrations require interaction with the device rather than modification of IOMMU page table entries. Unlike
an IOMMU, registration alone cannot provide any protection against malfunctioning
28
by the device, since the protection mechanism is partially enforced within the I/O
device.
Raj and Schwan also developed an Ethernet-based prototype device that supports shared, direct I/O access by untrusted guests [45].
Because of hardware-
implementation constraints, their prototype has limited addressability of main memory and thus requires all network data to be copied through VMM-managed bouncebuffers. This strategy permits the VMM to validate each buffer but does not provide
any protection against faulty accesses by the device within its addressable memory
range.
AMD and Intel have recently proposed the addition of IOMMUs to their upcoming
architectures [3, 23]. Though they will be new to commodity architectures, IOMMUs
are established components in high-availability server architectures [9]. Ben-Yehuda
et al. recently explored the TCP-stream network performance of IBM’s state-of-theart IOMMU-based architectures using both non-virtualized, “bare-metal” Linux and
paravirtualized Linux running under Xen [10]. They reported that the state-of-theart IOMMU-management strategy can incur significant overhead. They hypothesized
that modifications to the single-use IOMMU-management strategy could avoid such
penalties.
The concurrent direct network access (CDNA) architecture described in Chapter 4
of this dissertation is an Ethernet-based prototype that supports concurrent, direct
network access by untrusted guest operating systems. Unlike the Ethernet-based
prototype developed by Raj and Schwan, the CDNA prototype does not require extra
copying nor bounce buffers; instead, the CDNA architecture uses a novel softwarebased memory protection mechanism. Like registration, this software-based strategy
offers no protection against faulty device behavior. The CDNA architecture does not
fundamentally require software-based DMA memory protection. Rather, CDNA can
be used with an IOMMU to implement DMA memory protection. This dissertation
29
explores such an approach to DMA memory protection further in Chapter 5. Like all
direct, shared-I/O architectures, CDNA fundamentally requires that the device be
able to access several guests’ memory simultaneously. Consequently, the device could
still use the wrong VM’s data for a particular I/O transaction, and hence it is not
possible to guard against faulty behavior by the device even when using an IOMMU.
Overall, there has been quite a large volume of research regarding support for I/O
virtualization, some of which dates back forty years. Given the continuing demand for
server consolidation, researchers continue to develop architectures for private I/O, for
software-shared I/O, for hardware-shared I/O, and to develop protection strategies
for allowing direct access to a particular I/O device by an untrusted virtual machine.
This dissertation explores a novel approach that combines several of these techniques
to create a new, hybrid architecture. This research uses a combination of software and
hardware to facilitate shared I/O with much greater efficiency and performance than
past approaches. Further, this research explores new software techniques for managing
hardware designed to enforce I/O memory protection policies, also achieving higher
efficiency and performance.
2.4
Hardware Support for Concurrent Server I/O
In addition to software support for thread and VM concurrency, there has been significant research in the field regarding concurrent-access I/O devices. These concurrentaccess I/O devices alone do not provide an architectural solution to the performance
and efficiency challenges with respect to server concurrency. However, they can be
used in concurrency-aware architectures to improve the efficiency of the software.
30
2.4.1
Hardware Support for Parallel Receive-side OS Processing
Proposals for parallel receive queues on NICs (such as receive-side scaling (RSS) [40])
are a beginning toward providing explicitly concurrent access to I/O devices by multiple threads. Such architectures maintain separate queues for received packets that
can be processed simultaneously by the operating system. The NIC classifies packets
into a specific queue according to a hashing function that is usually based on the IP
address and port information in each IP packet header. Because this IP address and
port information is unique per connection in traditional protocols (such as TCP and
UDP), the NIC can distribute incoming packets into specific queues according to connection. This distribution ensures that packets for the same connection are not placed
in different queues and thus later processed out-of-order by the operating system.
While this approach should efficiently improve concurrency for single, non-virtualized
operating systems in receive-dominated workloads, such proposals do not improve
transmit-side concurrency. Though parallel receive queues are a necessary component
to improving the efficiency of receive-side network stack concurrency, this driver/NIC
interface leaves the larger network stack design issues unresolved. These issues must
be confronted to prevent inefficiencies in the network stack from rendering any architectural improvements useless. Consequently, this dissertation examines the larger
network stack issues in detail.
Additionally, a restricted interface such as RSS that considers only receive concurrency is not amenable to supporting concurrent direct hardware access by parallel
virtualized operating systems. A more flexible interface would be beneficial for extracting the most utility from a modified hardware architecture. At minimum, an
RSS-style NIC architecture would need to be modified to enable more flexible classification of incoming packets based on the virtual machine they belong to rather than
the connection they are associated with. Even so, such a modified device architec-
31
ture would be insufficient because it would still require all transmit operations to be
performed via traditional software sharing rather than direct hardware access.
2.4.2
User-level Network Interfaces
User-level network interfaces provide a more flexible hardware/software interface
that allows concurrent user-space applications to directly access a special-purpose
NIC [44, 52]. In effect, the NIC provides a separate hardware context to each requesting application instance. Hence, user-level NICs provide the functional equivalent of
implementing parallel transmit and receive queues on a single traditional NIC, which
could be used as a component toward building an interface that breaks the scalability
limitations of traditional NICs. However, user-level NIC architectures lack two key
features required for use in efficient, concurrent network servers.
First, user-level NICs do not provide context-private event notification. Instead,
applications written for user-level NICs typically poll the status of a private context
to determine if that context is ready to be serviced. While this is perfectly suitable
for high-performance message-passing applications in which the application may not
have any work to do until a new message arrives, a polling model is inappropriate for
general-purpose operating systems or virtual machine monitors in which many other
applications or devices may require service.
Second, user-level network interfaces require a single trusted software entity to implement direct memory access (DMA) memory protection, which limits the scalability
of this approach. For unvirtualized environments, this entity is the operating system;
for virtualized environments, the entity is a single trusted “driver domain” OS instance. Like all applications, user-level NIC applications manipulate virtual memory
addresses rather than physical addresses. Hence, the addresses provided by an application to a particular hardware context on a user-level NIC are virtual addresses.
However, commodity architectures (such as x86) require I/O devices to use physical
32
addresses. To inform the NIC of the appropriate virtual-to-physical address translations, applications invoke the trusted managing software entity to perform an I/O
interaction with the NIC (typically referred to as memory registration) that updates
the NIC’s current translations. Liu et al. present an implementation of the Infiniband
user-level NIC architecture with support for the Xen VMM and show that memory
registration costs can significantly degrade performance [35]. Unlike user-level NIC
applications that typically only invoke memory registration twice (once during initialization and again during application termination), operating systems frequently
create and destroy virtual-to-physical mappings at runtime, especially when utilizing
zero-copy I/O. Hence, the costly memory registration model is inappropriate for operating systems running running on a VMM. The concurrent direct network access
architecture presented in this dissertation avoids these registration costs by using a
lightweight, primarily software-based protection strategy instead. This dissertation
also explores other IOMMU-based strategies for efficient memory protection that are
attractive alternatives to costly on-device memory registration.
Thus, prior research has examined hardware support for both OS and VMM concurrency, but this hardware alone is not sufficient to address the problems of each.
Furthermore, these prior endeavors each either only solved one aspect of concurrency
(as in the RSS model which addresses receive parallelism, but not transmit parallelism) or they lacked important components necessary for high-performance servers
(such as high-performance, low-overhead DMA memory protection so as to facilitate
zero-copy I/O). This dissertation uses both hardware and software to create a comprehensive solution, or when applicable, determine and characterize the components that
are still necessary for a comprehensive solution. Moreover, this research represents a
fundamentally different approach than these past efforts by using a hardware/software
synthesis to achieve a comprehensive architectural analysis and solution rather than
33
using a primarily hardware-based approach that examines only part of the problems
and their overhead.
2.5
Summary
Server technology has become increasingly important for academic and commercial applications, and the Internet era has brought explosive growth in demand from
home users. The demand for efficient, high-performance server technology has motivated extensive research over the past several decades that touch on issues related
to the contributions of this dissertation. Though there is extensive research in this
area, there are new challenges with regard to supporting new levels of thread- and
virtual-machine-level concurrency. This chapter has described the efficiency and performance challenges observed in modern systems and has outlined the research that
is most closely related to solving these problems. Previous research has explored
some variations of OS architectures that support thread-parallel network I/O processing, but the research in this dissertation reaches beyond that by exploring a fuller
spectrum of OS architectures and by examining them on real, rather than simulated,
hardware and software. Furthermore, previous research has explored the performance,
efficiency, and protection issues related to different I/O virtualization architectures,
but the research in this dissertation presents a completely new, novel architecture
that brings with it different performance, efficiency, and protection characteristics.
Finally, a key aspect of the research in this dissertation is that it uses hardware
to improve the efficiency of software. There have been several prior efforts to use
hardware to support OS and VMM concurrency, but these efforts have been almost
exclusively hardware-centric and did not address issues relevant to real-world application performance, such as support for zero-copy I/O in modern server applications.
The architecture presented in this dissertation uses hardware in synthesis with soft-
34
ware to comprehensively address efficiency and performance of real-world applications
running on modern thread- and virtual-machine-concurrent network servers.
35
Chapter 3
Parallelization Strategies for OS Network Stacks
As established in the previous chapter, network server architectures will feature
chip multiprocessors in the future. Furthermore, the slowdown in uniprocessor performance improvements means that network servers will have to leverage parallel
processors to meet the ever increasing demand for network services. A wide range of
parallel network stack organizations have been proposed and implemented. Among
the parallel network stack organizations, there exist two major categories: messagebased parallelism (MsgP) and connection-based parallelism (ConnP). These organizations expose different levels of concurrency, in terms of the maximum available
parallelism within the network stack. They also achieve different levels of efficiency,
in terms of achieved network bandwidth per processing core, as they incur differing
cache, synchronization, and scheduling overheads.
The costs of synchronization and scheduling have changed dramatically in the
years since the parallel network stack organizations introduced in Chapter 2 were
originally proposed and studied. Though processors have become much faster, the
gap between processor and memory performance has become much greater, increasing
the cost, in terms of lost execution cycles, of synchronization and scheduling. Furthermore, technology trends and architectural complexity are keeping uniprocessor
performance growth from keeping pace with Ethernet bandwidth increases. Both of
these factors motivate a fresh examination of parallel network stack architectures on
modern parallel hardware.
36
Today, network servers are frequently faced with tens of thousands of simultaneous
connections. The locking, cache, and scheduling overheads of parallel network stack
organizations vary depending on the number of active connections in the system.
However, network performance evaluations generally focus on the bandwidth over a
small numbers of connections, often just one. In contrast, this study evaluates the
different network stack organizations under widely varying connection loads.
This study has four main contributions. First, this study presents a fair comparison of uniprocessor, message-based parallel, and connection-based parallel network stack organizations on modern multiprocessor hardware. Three competing network stack organizations are implemented within the FreeBSD 7 operating system:
message-based parallelism (MsgP), connection-based parallelism using threads for
synchronization (ConnP-T), and connection-based parallelism using locks for synchronization (ConnP-L). The uniprocessor version of FreeBSD is efficient, but its
performance falls short of saturating the fastest available network interfaces. Utilizing 4 cores, the parallel stack organizations can outperform the uniprocessor stack,
but at reduced efficiency.
Second, this study compares the performance of the different network stack organizations when using a single 10 Gbps network interface versus multiple 1 Gbps
network interfaces. Unsurprisingly, a uniprocessor network stack can more efficiently
utilize a single 10 Gbps network interface, as multiple network interfaces generate
additional interrupt overheads. However, the interactions between the network stack
and the device serialize the parallel stack organizations when only a single network
interface is present in the system. The parallel network stack organizations benefit
from the device-level parallelism that is exposed by having multiple network interfaces, allowing a system with multiple 1 Gbps network interfaces to outperform a
system with a single 10 Gbps network interface. With multiple interfaces, the par-
37
allel organizations are able to process interrupts concurrently on multiple processors
and experience reduced lock contention at the device level.
Third, this study presents an analysis of the locking and scheduling overhead
incurred by the different parallel stack organizations. MsgP experiences significant
locking overhead, but is still able to outperform the uniprocessor for almost all connection loads. In contrast, ConnP-T has very low locking overhead but incurs significant
scheduling overhead, leading to reduced performance compared to even the uniprocessor kernel for all but the heaviest loads. ConnP-L mitigates the locking overhead of
MsgP, by grouping connections so that there is little global locking, and the scheduling
overhead of ConnP-T, by using the requesting thread for network processing rather
than forwarding the request to another thread.
Finally, this study analyzes the cache behavior of the different parallel stack organizations. Specifically, this study categorizes data sharing within the network stack
as either concurrent or serial. If a datum may be accessed simultaneously by two or
more threads, that datum is shared concurrently. If, however, a datum may only be
accessed by one thread at a time, but it may be accessed by different threads over
time, that datum is shared serially. CMP organizations with shared caches will likely
reduce the cache misses to concurrently shared data, but are unlikely to provide any
benefit for serially shared data. Unfortunately, this study shows that there is a significant amount of serial sharing in the parallel network stack organizations, but very
little concurrent sharing.
The remainder of this chapter proceeds as follows. The next section further motivates the need for parallelized network stacks in current and future systems. Section 3.2 describes the parallel network stack architectures that are evaluated in this
paper. Section 3.3 then describes the hardware and software used to evaluate each
organization. Sections 3.4 and 3.5 present evaluations of the organizations using one
10 Gbps interface and six 1 Gbps interfaces, respectively. Section 3.6 provides a dis38
cussion of these results. This chapter is based in part on my previously published
work [59].
3.1
Background
The most efficient network stacks in modern operating systems are designed for
uniprocessor systems. There are still concurrent threads in such operating systems,
but locking and scheduling overhead are minimized as only one thread can execute at
a time. For example, a lock operation can often be made atomic simply by masking
interrupts during the operation. Despite their efficiency, such network stacks are not
capable of saturating a modern 10 Gbps Ethernet link. In 2004, Hurwitz and Feng
found that, using Linux 2.4 and 2.5 uniprocessor kernels (with TCP segmentation
offloading), they were only able to achieve about 2.5 Gbps on a 2.4 GHz Intel Pentium
4 Xeon system [20].
Increasing processor performance has allowed uniprocessor network stacks to achieve
higher bandwidth, but they still are not close to saturating a 10 Gbps Ethernet link.
Table 3.1 shows the performance of FreeBSD 7 on a modern 2.2 GHz Opteron uniprocessor system. The first row shows the performance of the uniprocessor kernel, which
remains nearly constant around 4 Gbps as the number of connections in the system
is varied. While this is an improvement over the performance reported in 2004, it
is still less than one half of the link’s capacity. Though the use of jumbo frames
can improve these numbers, network servers connected to the Internet will continue
to use standard 1500 byte Ethernet frames into the foreseeable future in order to
interoperate with legacy hardware.
In the face of technology constraints and uniprocessor complexity, architects have
turned to chip multiprocessors to continue to provide additional processing performance [14, 18, 25, 30, 31, 32, 33, 34, 42, 54]. The network stack within the operating
39
OS Type
Uniprocessor only
SMP capable
SMP capable
Processors
1
1
4
24 conns
4177
3688
3328
192 conns
4156
3796
3251
16384 conns
4037
3774
1821
Table 3.1: FreeBSD network bandwidth (Mbps) using a single processor and a 10 Gbps
network interface.
system will have to be able to take advantage of such architectures in order to keep
up with increases in network bandwidth demand. However, parallelizing the network
stack inherently reduces its efficiency. A symmetric multiprocessing (SMP) kernel
must use a more expensive implementation of lock operations as there is now physical concurrency in the system. For a lock operation to be atomic, it must be ensured
that threads running on the other processors will not interfere with the read-modifywrite sequence required to acquire and release a lock. On x86 hardware, this is
accomplished by adding the lock prefix to lock acquisition instructions. The lock
prefix causes the instruction to be extremely expensive, as it serializes all instruction
execution on the processor and it locks the system bus to ensure that the processor can do an atomic read-modify-write with respect to the other processors in the
system. Scheduling is also potentially more expensive, as the operating system now
must schedule multiple threads across multiple physical processors. As the second
row in Table 3.1 shows, in FreeBSD 7, the overhead of making the kernel SMP capable results in a 7–12% reduction in efficiency. Note that this is still using just a single
physical processor.
As the number of processors increases, lock contention becomes a major issue. The
third row of Table 3.1 shows the results of this effect. With the same SMP capable
kernel with 4 physical processors, not only does the efficiency further decrease, but the
absolute performance also decreases. Note that the problem gets dramatically worse
as the number of connections are increased. This is because with a larger number of
40
connections, each connection has much lower bandwidth, so less work is accomplished
for each lock acquisition.
These results strongly motivate a reexamination of network stack parallelization
strategies in the face of modern technology trends. It seems unlikely that uniprocessor performance will scale fast enough to keep up with increasing network bandwidth
demands, so the efficiency of uniprocessor network stacks can no longer be relied
upon to provide the necessary networking performance. Furthermore, the inefficiencies of modern SMP capable network stacks leads to a situation where small-scale
chip multiprocessors are only going to make the situation worse, as networking performance actually gets worse, not better, using 4 processing cores. There have been
several proposals to use a single core of a multiprocessor to achieve the efficiencies
of a uniprocessor network stack [6, 11, 46, 47, 48]. However, this is not a solution,
either, as each core of a CMP is likely to provide less performance than a monolithic
uniprocessor. So, if a uniprocessor is insufficient, there is no reason to believe a single
core of a CMP will be able to do any better. Furthermore, dedicating multiple cores
for network reintroduces the need for synchronization. The remainder of this paper
will examine the continuum of parallelization strategies depicted in Figure 2.2 and
analyze their behavior on small scale multiprocessor systems to better understand
this situation.
3.2
Parallel Network Stack Architectures
As was introduced in Chapter 2 and depicted in Figure 2.2, there are two primary
network stack parallelization strategies: message-based parallelism and connectionbased parallelism. Using message-based parallelism, any message (or packet) may be
processed simultaneously with respect to other messages. Hence, messages for a single
connection could be processed concurrently on different threads, potentially resulting
41
in improved performance. Connection-based parallelism is more coarse-grained; at the
beginning of network processing (either at the top or bottom of the network stack),
messages and packets are classified according to the connection with which they are
associated. All packets for a certain connection are then processed by a single thread
at any given time. However, each thread may be responsible for processing one or
more connections.
These parallelization strategies were studied in the mid-1990s, between the introduction of 100 Mbps and 1 Gbps Ethernet. Despite those efforts, there is not a
solid consensus among modern operating system developers on how to design efficient and scalable parallel network stacks. Major subsystems of FreeBSD and Linux,
including the network stack, have been redesigned in recent years to improve performance on parallel hardware. Both operating systems now incorporate variations of
message-based parallelism within their network stacks. Conversely, Sun has recently
redesigned the Solaris operating system for their high-throughput computing microprocessors and it now incorporates a variation of connection-based parallelism [55].
DragonflyBSD also uses connection-based parallelism within its network stack.
Each strategy was implemented within the FreeBSD 7 operating system to enable a
fair comparison of the trade-offs among the different strategies. This section provides
a more detailed explanation of how each parallelization strategy works.
3.2.1
Message-based Parallelism (MsgP)
Message-based parallel (MsgP) network stacks, such as FreeBSD, exploit parallelism by allowing multiple threads to operate within the network stack simultaneously. Two types of threads may perform network processing: one or more application
threads and one or more inbound protocol threads. When an application thread makes
a system call, that calling thread context is “borrowed” to then enter the kernel and
carry out the requested service. So, for example, a read or write call on a socket
42
would loan the application thread to the operating system to perform networking
tasks. Multiple such application threads can be executing within the kernel at any
given time. The network interface’s driver executes on an inbound protocol thread
whenever the network interface card (NIC) interrupts the host, and it may transfer
packets between the NIC and host memory. After servicing the NIC, the inbound
protocol thread processes received packets “up” through the network stack.
Given that multiple threads can be active within the network stack, FreeBSD utilizes fine-grained locking around shared kernel structures to ensure proper message
ordering and connection state consistency. As a thread attempts to send or receive
a message on a connection, it must acquire various locks when accessing shared connection state, such as the global connection hash table lock (for looking up TCP
connections) and per-connection locks (for both socket state and TCP state). If a
thread is unable to obtain a lock, it is placed in the lock’s queue of waiting threads
and yields the processor, allowing another thread to execute. To prevent priority
inversion, priority propagation from the waiting threads to the thread holding the
lock is performed.
As is characteristic of message-based parallel network stacks, FreeBSD’s locking
organization thus allows concurrent processing of different messages on the same
connection, so long as the various threads are not accessing the same portion of the
connection state at the same time. For example, one thread may process TCP timeout
state based on the reception of a new ACK, while at the same time another thread
is copying data into that connection’s socket buffer for later transmission. However,
note that the inbound thread configuration described is not the FreeBSD 7 default.
Rather, the operating system’s network stack has been configured to use the optional
direct-dispatch mechanism. Normally dedicated parallel driver threads service each
NIC and then hand off inbound packets to a single protocol thread via a shared
queue. That protocol thread then processes the received packets “up” through the
43
network stack. The default configuration thus limits the performance of MsgP and is
hence not considered in this paper. The thread-per-NIC model also differs from the
message-parallel organization described by Nahum et al. [41], which used many more
worker threads than interfaces. Such an organization requires a sophisticated scheme
to ensure these worker threads do not reorder inbound packets that were received in
order, and hence that organization is also not considered.
3.2.2
Connection-based Parallelism (ConnP)
To compare connection parallelism in the same framework as message parallelism,
FreeBSD 7 was modified to support two variants of connection-based parallelism
(ConnP) that differ in how they serialize TCP/IP processing within a connection. The
first variant assigns each connection to one of a small number of protocol processing
threads (ConnP-T). The second variant assigns each connection to one of a small
number of locks (ConnP-L).
Connection Parallelism Serialized by Threads (ConnP-T)
Connection-based parallelism using threads utilizes several kernel threads dedicated to per-connection protocol processing. Each protocol thread is responsible for
processing a subset of the system’s connections. At each entry point into the TCP/IP
protocol stack, the requested operation is enqueued for service by a particular protocol
thread based on the connection that is being processed. Each connection is uniquely
mapped to a single protocol thread for the lifetime of that connection. Later, the protocol threads dequeue requests and process them appropriately. No per-connection
state locking is required within the TCP/IP protocol stack, because the state of each
connection is only manipulated by a single protocol thread.
The kernel protocol threads are simply worker threads that are bound to a specific
CPU. They dequeue requests and perform the appropriate processing; the messaging
44
system between the threads requesting service and kernel protocol threads maintains
strict FIFO ordering. Within each protocol thread, several data structures that are
normally system-wide (such as the TCP connection hash table) are replicated so
that they are thread-private. Kernel protocol threads provide both synchronous and
asynchronous interfaces to threads requesting service.
If a requesting thread requires a return value or if the requester must maintain
synchronous semantics (that is, the requester must wait until the kernel thread completes the desired request), that requester yields the processor and waits for the kernel
thread to complete the requested work. Once the kernel protocol thread completes
the desired function, the kernel thread sends the return value back to the requester
and signals the waiting thread. This is the common case for application threads,
which require a return value to determine if the network request succeeded. However,
interrupt threads (such as those that service the network interface card and pass “up”
packets received on the network) do not require synchronous semantics. In this case,
the interrupt context classifies each packet according to its connection and enqueues
the packet for the appropriate kernel protocol thread. The connection-based parallel stack uniquely maps a packet or socket request to a specific protocol thread by
hashing the 4-tuple of remote IP address, remote port number, local IP address, and
local port number. This implementation of connection-based parallelism is like that
of DragonflyBSD.
Connection Parallelism Serialized by Locks (ConnP-L)
Just as in thread-serialized connection parallelism, connection-based parallelism
using locks is based upon the principle of isolating connections into groups that are
each bound to a single entity during execution. As the name implies, however, the
binding entity is not a thread; instead, each group is isolated by a mutual exclusion
lock.
45
When an application thread enters the kernel to obtain service from the network
stack, the network system call maps the connection being serviced to a particular
group using a mechanism identical to that employed by thread-serialized connection
parallelism. However, rather than building a message and passing it to that group’s
specific kernel protocol thread for service, the calling thread directly obtains the lock
for the group associated with the given connection. After that point, the calling
thread may access any of the group-private data structures, such as the group-private
connection hash table or group-private per-connection structures. Hence, these locks
serve to ensure that at most one thread may be accessing each group’s private connection structures at a time. Upon completion of the system call in the network
stack, the calling thread releases the group lock, allowing another thread to obtain
that group’s lock if necessary. Threads accessing connections in different groups may
proceed concurrently through the network stack without obtaining any stack-specific
locks other than the group lock.
Inbound packet processing is also analogous to connection-based parallelism using
threads. After receiving a packet, the inbound protocol thread classifies the packet
into a group. Unlike the thread-oriented connection-parallel case, the inbound thread
need not hand off the packet from the driver to the worker thread corresponding
to the packet’s connection group. Instead, the inbound thread directly obtains the
appropriate group lock for the packet and then processes the packet “up” the protocol
stack without any thread handoff. This control flow is similar to the message-parallel
stack, but the lock-serialized connection-parallel stack does not require any further
protocol locks after obtaining the connection group lock. As in the MsgP case, there
is one inbound protocol thread for each NIC, but the number of groups may far
exceed the number of threads. This implementation of connection-based parallelism
is similar to the implementation used in Solaris 10.
46
3.3
Methodology
To gain insights into the behavior and characteristics of the parallel network stack
architectures described in Section 3.2, these architectures were evaluated on a modern
chip multiprocessor. All stack architectures were implemented within the 2006-03-27
repository version of the FreeBSD 7 operating system to facilitate a fair comparison.
This section describes the benchmarking methodology and hardware platforms.
3.3.1
Evaluation Hardware
The parallel network stack organizations were evaluated using a 4-way SMP
Opteron system, using either a single 10 Gbps Ethernet interface or six 1 Gbps Ethernet interfaces. The system consists of two dual-core 2.2 GHz Opteron 275 processors
and four 512 MB PC2700 DIMMs per processor (two per memory channel). Each of
the four processor cores has a private level-2 cache. The 10 Gbps NIC evaluation is
based on a Myricom 10 Gbps PCI-Express Ethernet interface. The six 1 Gbps NIC
evaluation is based on three dual-port Intel PRO/1000-MT Ethernet interfaces that
are spread across the motherboard’s PCI-X bus segments.
In both configurations, data is transferred between the 4-way Opteron’s Ethernet
interface(s) and one or more client systems. The 10 Gbps configuration uses one
client with an identical 10 Gbps interface as the system under test, whereas the sixNIC configuration uses three client systems that each have two Gigabit Ethernet
interfaces. Each client is directly connected to the 4-way Opteron without the use of
a switch. For the 10 Gbps evaluation, the client system uses faster 2.6 GHz Opteron
285 processors and PC3200 memory, so that the client will never be a bottleneck in
any of the tests. For the six-NIC evaluation, each client was independently tested to
confirm that it can simultaneously sustain the theoretical peak bandwidth of its two
47
interfaces. Therefore, all results are determined solely by the behavior of the 4-way
Opteron 275 system.
3.3.2
Parallel TCP Benchmark
Most existing network benchmarks evaluate single-connection performance. However, modern multithreaded server applications simultaneously manage tens to thousands of connections. This parallel network traffic behaves quite differently than a
single network connection. To address this issue, a multithreaded, event-driven, network benchmark was developed that distributes traffic across a configurable number
of connections. The benchmark distributes connections evenly across threads and
utilizes libevent to manage connections within a thread. For all of the experiments
in this paper, the number of threads used by the benchmark is equal to the number
of processor cores being used. Each thread manages an equal number of connections.
For evaluations using 6 NICs, the application’s connections are distributed across the
server’s NICs equally such that each of the four threads uses each NIC, and every
thread has the same number of connections that map to each NIC.
Each thread sends data over all of its connections using zero-copy sendfile().
Threads receive data using read(). The sending and receiving socket buffer sizes
are set to be sufficiently large (typically 256 KB) to accommodate the large TCP
windows for high-bandwidth connections. Using larger socket buffers did not improve
performance for any test. All experiments use the standard 1500-byte maximum
transmission unit and do not utilize TCP segmentation offload, which currently is
not implemented in FreeBSD. The benchmark is always run for 3 minutes.
48
Stack Type
UP
MsgP
ConnP-T(4)
ConnP-L(128)
24 conns
4177
3328
2543
3321
192 conns
4156
3251
2475
3240
16384 conns
4037
1821
2483
1861
Table 3.2: Aggregate throughput for uniprocessor, message-parallel and connectionparallel network stacks.
3.4
Evaluation using One 10 Gigabit NIC
Table 3.2 shows the aggregate throughput across all the connections of the parallel
TCP benchmark described in Section 3.3.2 when using a single 10 Gbps interface. The
table presents the throughput for each network stack organization when the evaluated
system is transmitting data on 24, 192, or 16384 simultaneous connections.
“UP” is the uniprocessor version of the FreeBSD kernel running on a single core of
the Opteron server. The rest of the configurations are run on all 4 cores. “MsgP” is
the multiprocessor FreeBSD-based MsgP kernel described in Section 3.2.1. “ConnPT(4)” is the multiprocessor FreeBSD-based ConnP-T kernel described in Section 3.2.2,
using 4 kernel protocol threads for TCP/IP stack processing that are each pinned to a
different core. “ConnP-L(128)” is the multiprocessor FreeBSD-based ConnP-L kernel
described in Section 3.2.2. ConnP-L(128) divides the connections among 128 locks
within the TCP/IP stack.
As Table 3.2 shows, none of the parallel organizations outperform the “UP” kernel.
This corroborates prior evaluations of 10 Gbps Ethernet that used hosts with two
processors and an SMP variant of Linux and exhibited worse performance than when
the hosts used a uniprocessor kernel [20]. Of the parallel organizations, MsgP and
ConnP-L perform approximately the same and outperform ConnP-T when using 24
or 192 connections. However, ConnP-T performs best when using 16384 connections.
Both the software interface to the single 10 Gbps NIC and the various overheads
inherent to each parallel approach limit performance and prevent the parallel orga-
49
nizations from outperforming the uniprocessor. When using one NIC, performance
is limited by the serialization constraints imposed by the device’s interface. Because
the device has a single physical interrupt line, only one thread is triggered when
the device raises an interrupt, and hence one thread carries received packets “up”
through the network stack as described in Section 3.2.1. Transmit-side traffic also
faces a device-imposed serialization constraint. Because multiple threads can potentially request to transmit a packet at the same time and invoke the NIC’s driver, the
driver requires acquisition of a mutual exclusion lock to ensure consistency of shared
state related to transmitting packets. Process profiling shows that for all connection
loads, the driver’s lock is held by a core in the system nearly 100% of the time, and
that even with 16384 connections, MsgP and ConnP-L organizations show more than
50% idle time. The ConnP-T organization is also constrained by the driver’s lock,
but it is able to outperform the other organizations with 16384 connections because
it does not constrain received acknowledgement packets to be processed by the single
interrupt thread, as the other organizations do. Instead, it is able to distribute receive processing to protocol threads running on all of the processor cores. However,
ConnP-T performs worse than the uniprocessor because of the significant scheduler
overheads associated with ConnP-T’s thread handoff mechanism.
3.5
Evaluation using Multiple Gigabit NICs
As is shown in the previous section, using a single 10 Gbps interface limits the
parallelism available to the network stack at the device interface. This external bottleneck prevents the parallelism within the network stack from being exercised. To
provide additional inbound parallelism and to reduce the degree to which a single
driver’s lock can serialize network stack processing, the uniprocessor, message-parallel,
and connection-parallel organizations are evaluated using six Gigabit Ethernet NICs
50
6000
Throughput (Mb/s)
5000
4000
3000
2000
1000
UP
MsgP
ConnP−T(4)
ConnP−L(128)
122
8
163 8
84
4
614
2
307
6
153
768
384
192
96
48
24
0
Connections
Figure 3.1: Aggregate transmit throughput for uniprocessor, message-parallel and
connection-parallel network stacks using 6 NICs.
rather than one single 10 Gigabit NIC. Hence, on the inbound processing path there
are six different interrupts with six different interrupt threads to feed the network
stack in parallel. Each NIC has a separate driver instance with a separate driver
lock, reducing the probability that the network stack will contend for a driver lock.
This model more closely resembles the abundant thread parallelism that is presented
to the operating system at the application layer by the parallel benchmark and hence
fully stresses the network stack’s parallel processing capabilities. Because the single
10 Gbps-NIC configuration is insufficient to utilize processing resources for each organization and cannot effectively isolate the network stack, it is not examined further.
Figures 3.1 and 3.2 depict the aggregate TCP throughput across all connections
for the various network stack organizations when using six separate Gigabit interfaces.
Figure 3.1 shows that the “UP” kernel performs well when transmitting on a small
number of connections, achieving a bandwidth of 3804 Mb/s with 24 connections.
51
6000
Throughput (Mb/s)
5000
4000
3000
2000
1000
UP
MsgP
ConnP−T(4)
ConnP−L(128)
122
8
163 8
84
4
614
2
307
6
153
768
384
192
96
48
24
0
Connections
Figure 3.2: Aggregate receive throughput for uniprocessor, message-parallel and
connection-parallel network stacks using 6 NICs.
However, total bandwidth decreases as the number of connections increases. MsgP
performs better, providing an 11% improvement over the uniprocessor bandwidth
at 24 connections but quickly ramps up to 4630 Mb/s, holding steady through 768
connections and then decreasing to 3403 Mb/s with 16384 connections. ConnPT(4) achieves its peak bandwidth of 3169 Mb/s with 24 connections and provides
approximately steady bandwidth as the number of connections increase. Finally, the
ConnP-L(128) curve is shaped similar to that of MsgP, but its performance is larger in
magnitude and always outperforms the uniprocessor kernel. ConnP-L(128) delivers
steady performance around 5440 Mb/s for 96–768 connections and then gradually
decreases to 4747 Mb/s with 16384 connections. This peak performance is roughly
the peak TCP throughput deliverable by the three dual-port Gigabit NICs.
Figure 3.2 shows the aggregate TCP throughput across all connections when receiving data on six Gigabit interfaces. Again, ConnP-L(128) performs best, followed
52
by MsgP, ConnP-T(4), and the uniprocessor kernel. Unlike the transmit case, the parallel organizations always outperform the uniprocessor, and in many cases they receive
at a higher rate than they transmit. The ConnP-L(128) organization is again able to
receive at near-peak performance at 384 connections and holds approximately steady,
receiving over 5 Gb/s of throughput using 16384 connections. Both the ConnP-T(4)
and uniprocessor kernels also receives steady (but lower) bandwidth across all connection loads tested, only slightly decreasing as connections are added. Conversely,
MsgP does not provide as consistent bandwidth across the various connection loads,
but it does uniformly outperform both ConnP-T(4) and “UP”.
3.6
Discussion and Analysis
The locking, scheduling, and cache overheads of the network stack vary depending
on both the parallel network stack organization and the number of active connections
in the system. The following subsections will examine these issues for the best performing hardware configuration, a system with six 1 Gbps network interfaces. All of
the statistics within this section were collected using either the Opterons’ performance
counters or FreeBSD’s lock-profiling facilities.
3.6.1
Locking Overhead
There are two significant costs of locking within the parallelized network stacks.
The first is that SMP locks are fundamentally more expensive than uniprocessor locks.
In a uniprocessor kernel, a simple atomic test-and-set instruction can be used to
protect against interference across context switches, whereas, SMP systems must use
system wide locking to ensure proper synchronization among simultaneously running
threads. This is likely to incur significant overhead in the SMP case. For example,
on x86 architectures, the lock prefix, which is used to ensure that an instruction
53
Socket Buffer
A
=
Acquisition of lock
Lock-Name
Lock-Name R
=
Release of lock
Lock-Name
Bold = Global Lock
Calculate Ready Data to Send
Socket Buffer
Lock-Name A
R
Regular =
Per-Connection
Lock
Socket Send
Prepare metadata structures pointing to
message data
Connection Hashtable A
return value
Look Up Connection
Connection
Connection
A
R
TCP Send
Connection Hashtable R
for(packet-sized
segments in message)
Socket Buffer
TCP Output
A
Prepare TCP header for one packet
Socket Buffer
R
return value
Route Hashtable A
Allocate new route structure
Route
Route
A
Fill in route
Route
R
Destroy route structure
Route
R
IP Output
A
Route Hashtable R
Prepare IP header
Ensure ARP entry
isn't expired
Route
A
Validate route
Route
Ethernet
Output
R
TX Interface Queue
A
Interface
Queue
Insert packet
TX Interface Queue
R
Driver
Figure 3.3: The outbound control path in the application thread context.
is executed atomically across the system, effectively locks all other cores out of the
memory system during the execution of the locked instruction.
The second is that contention for global locks within the network stack is significantly increased when multiple threads are actively performing network tasks si-
54
OS Type
MsgP
ConnP-L(4)
ConnP-L(8)
ConnP-L(16)
ConnP-L(32)
ConnP-L(64)
ConnP-L(128)
6 conns
89
60
51
49
41
37
33
192 conns
100
56
30
18
10
6
5
16384 conns
100
52
26
14
7
4
2
Table 3.3: Percentage of lock acquisitions for global TCP/IP locks that do not succeed
immediately when transmitting data.
multaneously. As an illustration of how locks can contend within the network stack,
Figure 3.3 shows the locking required in the control path for send processing within
the sending application’s thread context in the MsgP network stack of FreeBSD 7.
Most of the locks pictured are associated with a single socket buffer or connection.
Therefore, it is unlikely that multiple application threads would contend for those
locks since connection-oriented applications do not use multiple application threads
to send data over the same connection. However, those locks could be shared with
the kernel’s inbound protocol threads that are processing receive traffic on the same
connection. Global locks that must be acquired by all threads that are sending (or
possibly receiving) data over any connection are far more problematic.
There are two global locks on the send path: the Connection Hash-table lock
and the Route Hash-table lock. These locks protect the hash tables that map a
particular connection to its individual connection lock that map a particular connection to its individual route lock, respectively. These locks are also used in lieu of
explicit reference counting for individual connections and locks. Watson presents a
more detailed description of locking within the FreeBSD network stack [57].
There is very little contention for the Route Hash-table lock because the corresponding Route lock is quickly acquired and released so a thread is unlikely to be
blocked while holding the Route Hash-table lock and waiting for a Route lock. In
contrast, the Connection Hash-table lock is highly contended. This lock must be
55
acquired by any thread performing any network operation on any connection. Furthermore, it is possible for a thread to block while holding the lock and waiting for
its corresponding Connection lock, which can be held for quite some time.
Table 3.3 depicts global TCP/IP lock contention when sending data, measured as
the percentage of lock acquisitions that do not immediately succeed because another
thread holds the lock. ConnP-T is omitted from the table because it eliminates
global TCP/IP locking completely. As the table shows, the MsgP network stack
experiences significant contention for the Connection Hash-table lock, which leads
to considerable overhead as the number of connections increases.
One would expect that as connections are added, contention for per-connection
locks would decrease, and in fact lock profiling supports this conclusion. However,
because other locks (such as that guarding the scheduler) are acquired while holding
the per-connection lock, and because those other locks are system-wide and become
highly contended during heavy loads, detailed locking profiles show that the average
time per-connection locks are held increases dramatically. Hence, though contention
for per-connection locks decreases, the increasing cost for a contended lock is so
much greater that the system exhibits increasing average acquisition times for perconnection locks as connections are added. This increased per-connection acquisition
time in turn leads to longer waits for the Connection Hash-table lock, eventually
bogging down the system with contention.
Whereas the MsgP stack relies on repeated acquisition to the Connection Hash-table
and Connection locks to continue stack processing, ConnP-L stacks can also become
periodically bottlenecked if a single group becomes highly contended. Table 3.3 shows
the contention for the Network Group locks for ConnP-L stacks as the number of network groups is varied to from 4 to 128 groups. The table demonstrates that contention
for the Network Group locks consistently decreases as the number of network groups
increases. Though ConnP-L(4)’s Network Group lock contention is high at over 50%
56
6000
Throughput (Mb/s)
5000
4000
3000
ConnP−L(128)
ConnP−L(64)
ConnP−L(32)
ConnP−L(16)
ConnP−L(8)
ConnP−L(4)
2000
1000
122
8
163 8
84
4
614
2
307
6
153
768
384
192
96
48
24
0
Connections
Figure 3.4: Aggregate transmit throughput for the ConnP-L network stack as the number
of locks is varied.
Stack Type
UP
MsgP
ConnP-T(4)
ConnP-L(128)
24 conns
452
1305
3617
1056
Transmit
192 conns
16384 conns
440
423
1818
2448
3602
4535
924
1064
24 conns
350
1125
858
598
Receive
192 conns
378
1126
957
519
16384 conns
421
1158
1547
524
Table 3.4: Cycles spent managing the scheduler and scheduler synchronization per Kilobyte of payload.
for all connection loads, increasing the number of network groups to 128 reduces
contention from 52% to just 2% for the heaviest connection load.
Figure 3.4 shows the effect increasing the number of network groups has on aggregate throughput for 6, 192, and 16384 connections. As is suggested by the contention
reduction associated with larger numbers of network groups, network throughput increases with more network groups. However, there are diminishing returns as more
groups are added.
3.6.2
Scheduler Overhead
The ConnP-T kernel trades the locking overhead of the ConnP-L and MsgP kernels
for scheduling overhead. As operations are requested for a particular connection,
57
L2 Misses/KB Throughput
Scheduler
Network Stack
60
40
20
0
UP
(4) 128)
gP
−T
Ms
L(
nP
P−
n
nn
Co
Co
UP
24 Connections
(4) 128)
gP
−T
Ms
L(
nP
P−
n
nn
Co
Co
192 Connections
UP
(4) 128)
gP
−T
Ms
L(
nP
P−
n
nn
Co
Co
16384 Connections
Figure 3.5: Profile of L2 cache misses per 1 Kilobyte of payload data (transmit test).
they must be scheduled onto the appropriate protocol thread. As Figures 3.1 and 3.2
showed, this results in stable, but low total bandwidth as connections scale for ConnPT(4). ConnP-L approximates the reduced intra-stack locking properties of ConnP-T
and adopts the simpler scheduling properties of MsgP; locking overhead is minimized
by the additional groups and scheduling overhead is minimized since messages are
not transferred to protocol threads. This results in consistently better performance
than the other parallel organizations.
To further explain this behavior, Table 3.4 shows the number of cycles spent
managing the scheduler and scheduler synchronization per KB of payload data transmitted and received. This shows the overhead of the scheduler normalized to network
bandwidth. Though MsgP experiences significantly less scheduling overhead than
ConnP-T in most cases, locking overhead within the threads negate the scheduler
advantage as connections are added. In contrast, the scheduler overhead of ConnP-T
remains high, particularly when transmitting, corresponding to relatively low bandwidth. Conversely, ConnP-L exhibits stable scheduler overhead that is much lower
than either MsgP or ConnP-L, contributing to its higher throughput. ConnP-L does
not require a thread handoff mechanism and its low lock contention compared to
MsgP results in fewer context switches from threads waiting for locks.
58
L2 Misses/KB Throughput
60
Data Copying
Scheduler
Network Stack
40
20
0
UP
(4) 128)
gP
−T
Ms
L(
nP
P−
n
nn
Co
Co
24 Connections
UP
(4) 128)
gP
−T
Ms
L(
nP
P−
n
nn
Co
Co
192 Connections
UP
(4) 128)
gP
−T
Ms
L(
nP
P−
n
nn
Co
Co
16384 Connections
Figure 3.6: Profile of L2 cache misses per 1 Kilobyte of payload data (receive test).
All of the network stack organizations examined experience higher scheduler overhead when transmitting than when receiving. The reference FreeBSD 7 operating
system utilizes an interrupt-serialized task queue architecture for processing received
packets. This architecture obviates the need for explicit mutual exclusion locking
within NIC drivers when processing received packets, though locking is still required
on the transmit path. Each of the organizations examined benefit from this optimization. Because FreeBSD’s kernel-adaptive mutual exclusion locks invoke the thread
scheduler when acquisitions repeatedly fail, eliminating lock acquisition attempts necessarily reduces scheduler overhead.
The ConnP-T organization experiences an additional reduction in scheduler overhead when processing received packets. In this organization, inbound packets are
queued asynchronously for later processing by the appropriate network protocol thread,
which eliminates the need to block the thread that enqueues a packet or to later notify a blocked thread of completion. When sending, most processing occurs when
the application attempts to send data, which requires a more scheduler-intensive synchronous call, and hence ConnP-T exhibits significantly higher scheduler overhead
when transmitting than when receiving.
59
Table 3.4 shows that the reference ConnP-T implementation in this paper incurs
heavy overhead in the thread scheduler, and hence an effective ConnP-T organization would require a more efficient interprocessor communication mechanism. A
lightweight mechanism for interprocessor communication, as implemented in DragonflyBSD, would enable efficient intra-kernel messaging between processor cores. Such
an efficient messaging mechanism is likely to greatly benefit the ConnP-T organization
by allowing message transfer without invoking the general-purpose scheduler.
3.6.3
Cache Behavior
Figures 3.5 and3.6 show the number of L2 cache misses per KB of payload data
transmitted and received, respectively. The stacked bars separate the L2 cache misses
based upon where in the operating system the misses occurred. On the transmit side,
all of the L2 cache misses occur either in the scheduler or in the network stack. On the
receive side, there are also misses copying the data from the kernel to the application.
Recall that zero-copy transmit is used, so the corresponding copy from the application
to the kernel does not occur on the transmit side.
The figures show the efficiency of the cache hierarchy normalized to network bandwidth. The uniprocessor kernel incurs very few cache misses relative to the multiprocessor configurations. The lack of data migration between processor caches accounts
for the uniprocessor kernel’s cache efficiency. As the number of connections are increased, the additional connection state within the kernel stresses the cache and
directly results in increased cache misses and decreased throughput [27, 28].
The parallel network stacks incur significantly more cache misses per KB of transmitted data because of data migration and lock accesses. Surprisingly, ConnP-T(4)
incurs the most cache misses despite each thread being pinned to a specific processor core. One might expect that such pinning would improve locality by eliminating
migration of many connection data structures. However, Figure 3.5 shows that for
60
the cases with 6 and 192 connections, ConnP-T(4) exhibits more misses in the network stack than any of the other organizations. While thread pinning can improve
locality by eliminating migration of connection metadata, frequently updated socket
metadata is still shared between the application and protocol threads, which leads
to data migration and a higher cache miss rate. Pinning the protocol threads does
result in better utilization of the caches for the 16384-connection load when transmitting, however. In this case, ConnP-T(4) exhibits the fewest network-stack L2
cache misses. However, the relatively higher number of L2 cache misses caused by
the scheduler prevents this advantage from translating into a performance benefit.
Other than the cache misses due to data copying, the cache miss profiles for
transmit and receive are quite similar. However, ConnP-T(4) incurs far fewer cache
misses in the scheduler when receiving data than it does when transmitting data.
This is directly related to the reduced scheduling overhead on the receive side, as
discussed in the previous section.
The cache misses within the network stack can be divided between misses to
concurrently shared data and serially shared data. Global network stack data is
concurrently shared, as it may be simultaneously accessed by multiple threads in
order to transmit or receive data. In contrast, per-connection data is serially shared,
as it is only accessed by a single thread at a time, although it may be accessed by
multiple threads over time. In a true MsgP organization, the per-connection data will
also be concurrently shared, as multiple threads can process packets from the same
connection simultaneously. However, in a practical, in-order MsgP implementation,
as described in Section 3.2.1, per-connection data may be accessed by at most two
threads at a time, one sending data and one receiving data on the same connection.
Table 3.5 indicates the percentage of the cache misses within the network stack
that are due to global data structures, and are therefore concurrently shared. The
remaining L2 cache misses are due to per-connection data structures. As previously
61
Stack Type
UP
MsgP
ConnP-T(4)
ConnP-L(128)
24 conns
4%
12%
7%
15%
Transmit
192 conns
16384 conns
3%
27%
14%
32%
9%
15%
19%
21%
24 conns
1%
16%
14%
18%
Receive
192 conns
1%
15%
13%
18%
16384 conns
12%
22%
11%
20%
Table 3.5: Percentage of L2 cache misses within the network stack to global data structures.
stated, these per-connection data structures are rarely, if ever, accessed by different
threads concurrently.
Data that is concurrently shared is most likely to benefit from a CMP with a
shared cache. Therefore, the percentages of Table 3.5 indicate the possible reduction
in L2 cache misses within the network stack if a CMP with a shared cache were used.
For most connection loads and network stack organizations, fewer than 20% of the L2
cache misses are due to global data. As there is no guarantee that a shared L2 will
eliminate these misses entirely, the benefits of a shared cache for the network stack
are likely to be minimal. Furthermore, it is also possible for a shared cache to have a
detrimental effect on the serially shared data. Previous work has shown that shared
caches can hurt performance when the cores are not actively sharing data [29]. If
the processor cores must compete to store per-connection state with each other, this
could potentially lead to an overall increase in L2 cache misses within the network
stack, despite the benefits for concurrently shared data.
The lock contention, scheduling, and cache efficiency data show that the different
concurrency models and the different synchronization mechanisms employed by their
implementations directly impact network stack efficiency and throughput. Though
all of these parallelized organizations can outperform the uniprocessor when using 4
cores, each parallel organization experiences higher locking overhead, decreased cache
efficiency, and higher scheduling overhead than a uniprocessor network stack. The
ConnP-L organization maximizes performance and efficiency compared to the MsgP
and ConnP-T organizations. ConnP-L mitigates the locking overhead of the highly
62
contentious MsgP organization by grouping connections to reduce global locking.
ConnP-L also benefits from reduced scheduling overhead as compared to ConnP-T,
since ConnP-L does not require inter-thread communication or message passing to
carry out network stack processing. Hence, though the ConnP-L parallelism model
is more restricted than that of MsgP, ConnP-L still provides the same level of parallelism expected by most applications (e.g., connection- or socket-level parallelism)
and achieves higher efficiency and higher throughput.
63
Chapter 4
Concurrent Direct Network Access
In many organizations, the economics of supporting a growing number of Internetbased services has created a demand for server consolidation. In such organizations,
maximizing machine utilization and increasing the efficiency of the overall server is
just as important as increasing the efficiency of each individual operating system, as
in Chapter 3. Consequently, there has been a resurgence of interest in machine virtualization [1, 2, 7, 13, 16, 21, 35, 53, 58]. A virtual machine monitor (VMM) enables
multiple virtual machines, each encapsulating one or more services, to share the same
physical machine safely and fairly. In principle, general-purpose operating systems,
such as Unix and Windows, offer the same capability for multiple services to share
the same physical machine. However, VMMs provide additional advantages. For example, VMMs allow services implemented in different or customized environments,
including different operating systems, to share the same physical machine.
Modern VMMs for commodity hardware, such as VMware [1, 13] and Xen [7],
virtualize processor, memory, and I/O devices in software. This enables these VMMs
to support a variety of hardware. In an attempt to decrease the software overhead
of virtualization, both AMD and Intel are introducing hardware support for virtualization [2, 21]. Specifically, their hardware support for processor virtualization is
currently available, and their hardware support for memory virtualization is imminent. As these hardware mechanisms mature, they should reduce the overhead of
virtualization, improving the efficiency of VMMs.
64
Despite the renewed interest in system virtualization, there is still no clear solution to improve the efficiency of I/O virtualization. To support networking, a VMM
must present each virtual machine with a virtual network interface that is multiplexed in software onto a physical network interface card (NIC). The overhead of this
software-based network virtualization severely limits network performance [38, 39, 53].
For example, a Linux kernel running within a virtual machine on Xen is only able
to achieve about 30% of the network throughput that the same kernel can achieve
running directly on the physical machine.
This study proposes and evaluates concurrent direct network access (CDNA), a
new I/O virtualization architecture that combines software and hardware components
to significantly reduce the overhead of network virtualization in VMMs. The CDNA
network virtualization architecture provides virtual machines running on a VMM safe
direct access to the network interface. With CDNA, each virtual machine is allocated
a unique context on the network interface and communicates directly with the network
interface through that context. In this manner, the virtual machines that run on the
VMM operate as if each has access to its own dedicated network interface.
Using CDNA, a single virtual machine running Linux can transmit at a rate of
1867 Mb/s with 51% idle time and receive at a rate of 1874 Mb/s with 41% idle
time. In contrast, at 97% CPU utilization, Xen is only able to achieve 1602 Mb/s for
transmit and 1112 Mb/s for receive. Furthermore, with 24 virtual machines, CDNA
can still transmit and receive at a rate of over 1860 Mb/s, but with no idle time. In
contrast, Xen is only able to transmit at a rate of 891 Mb/s and receive at a rate of
558 Mb/s with 24 virtual machines.
The CDNA network virtualization architecture achieves this dramatic increase in
network efficiency by dividing the tasks of traffic multiplexing, interrupt delivery, and
memory protection among hardware and software in a novel way. Traffic multiplexing
is performed directly on the network interface, whereas interrupt delivery and memory
65
Guest
Domain 1
Driver Domain
Back-End
Drivers
Page
Flipping
Packet
Data
Ethernet
Bridge
Front-End
Driver
Guest
Domain 2
Front-End
Driver
NIC Driver
Virtual Interrupts
Driver
Control
Control & Data
Interrupts
NIC
Hypervisor
Interrupt Dispatch
Packet
Data
Hypervisor
Control
CPU / Memory / Disk / Other Devices
Figure 4.1: Shared networking in the Xen virtual machine environment.
protection are performed by the VMM with support from the network interface.
This division of tasks into hardware and software components simplifies the overall
software architecture, minimizes the hardware additions to the network interface, and
addresses the network performance bottlenecks of Xen.
The remainder of this study proceeds as follows. The next section discusses networking in the Xen VMM in more detail. Section 4.2 describes how CDNA manages
traffic multiplexing, interrupt delivery, and memory protection in software and hardware to provide concurrent access to the NIC. Section 4.3 then describes the custom
hardware NIC that facilitates concurrent direct network access on a single device.
Finally, Section 4.4 presents the experimental methodology and results. This study
is based on one of my previously published works [60].
66
4.1
4.1.1
Networking in Xen
Hypervisor and Driver Domain Operation
A VMM allows multiple guest operating systems, each running in a virtual machine, to share a single physical machine safely and fairly. It provides isolation between these guest operating systems and manages their access to hardware resources.
Xen is an open source VMM that supports paravirtualization, which requires modifications to the guest operating system [7]. By modifying the guest operating systems
to interact with the VMM, the complexity of the VMM can be reduced and overall
system performance improved.
Xen performs three key functions in order to provide virtual machine environments. First, Xen allocates the physical resources of the machine to the guest operating systems and isolates them from each other. Second, Xen receives all interrupts
in the system and passes them on to the guest operating systems, as appropriate. Finally, all I/O operations go through Xen in order to ensure fair and non-overlapping
access to I/O devices by the guests.
Figure 4.1 shows the organization of the Xen VMM. Xen consists of two elements:
the hypervisor and the driver domain. The hypervisor provides an abstraction layer
between the virtual machines, called guest domains, and the actual hardware, enabling each guest operating system to execute as if it were the only operating system
on the machine. However, the guest operating systems cannot directly communicate
with the physical I/O devices. Exclusive access to the physical devices is given by the
hypervisor to the driver domain, a privileged virtual machine. Each guest operating
system is then given a virtual I/O device that is controlled by a paravirtualized driver,
called a front-end driver. In order to access a physical device, such as the network interface card (NIC), the guest’s front-end driver communicates with the corresponding
back-end driver in the driver domain. The driver domain then multiplexes the data
67
streams for each guest onto the physical device. The driver domain runs a modified
version of Linux that uses native Linux device drivers to manage I/O devices.
As the figure shows, in order to provide network access to the guest domains, the
driver domain includes a software Ethernet bridge that interconnects the physical
NIC and all of the virtual network interfaces. When a packet is transmitted by a
guest, it is first transferred to the back-end driver in the driver domain using a page
remapping operation. Within the driver domain, the packet is then routed through
the Ethernet bridge to the physical device driver. The device driver enqueues the
packet for transmission on the network interface as if it were generated normally
by the operating system within the driver domain. When a packet is received, the
network interface generates an interrupt that is captured by the hypervisor and routed
to the network interface’s device driver in the driver domain as a virtual interrupt.
The network interface’s device driver transfers the packet to the Ethernet bridge,
which routes the packet to the appropriate back-end driver. The back-end driver
then transfers the packet to the front-end driver in the guest domain using a page
remapping operation. Once the packet is transferred, the back-end driver requests
that the hypervisor send a virtual interrupt to the guest notifying it of the new packet.
Upon receiving the virtual interrupt, the front-end driver delivers the packet to the
guest operating system’s network stack, as if it had come directly from the physical
device.
4.1.2
Device Driver Operation
The driver domain in Xen is able to use unmodified Linux device drivers to access the network interface. Thus, all interactions between the device driver and the
NIC are as they would be in an unvirtualized system. These interactions include
programmed I/O (PIO) operations from the driver to the NIC, direct memory access
68
(DMA) transfers by the NIC to read or write host memory, and physical interrupts
from the NIC to invoke the device driver.
The device driver directs the NIC to send packets from buffers in host memory
and to place received packets into preallocated buffers in host memory. The NIC
accesses these buffers using DMA read and write operations. In order for the NIC to
know where to store or retrieve data from the host, the device driver within the host
operating system generates DMA descriptors for use by the NIC. These descriptors
indicate the buffer’s length and physical address on the host. The device driver notifies
the NIC via PIO that new descriptors are available, which causes the NIC to retrieve
them via DMA transfers. Once the NIC reads a DMA descriptor, it can either read
from or write to the associated buffer, depending on whether the descriptor is being
used by the driver to transmit or receive packets.
Device drivers organize DMA descriptors in a series of rings that are managed
using a producer/consumer protocol. As they are updated, the producer and consumer pointers wrap around the rings to create a continuous circular buffer. There
are separate rings of DMA descriptors for transmit and receive operations. Transmit
DMA descriptors point to host buffers that will be transmitted by the NIC, whereas
receive DMA descriptors point to host buffers that the OS wants the NIC to use as it
receives packets. When the host driver wants to notify the NIC of the availability of a
new DMA descriptor (and hence a new packet to be transmitted or a new buffer to be
posted for packet reception), the driver first creates the new DMA descriptor in the
next-available slot in the driver’s descriptor ring and then increments the producer
index on the NIC to reflect that a new descriptor is available. The driver updates
the NIC’s producer index by writing the value via PIO into a specific location, called
a mailbox, within the device’s PCI memory-mapped region. The network interface
monitors these mailboxes for such writes from the host. When a mailbox update is
detected, the NIC reads the new producer value from the mailbox, performs a DMA
69
System
Native Linux
Xen Guest
Transmit (Mb/s) Receive (Mb/s)
5126
3629
1602
1112
Table 4.1: Transmit and receive performance for native Linux 2.6.16.29 and paravirtualized
Linux 2.6.16.29 as a guest OS within Xen 3.
read of the descriptor indicated by the index, and then is ready to use the DMA
descriptor. After the NIC consumes a descriptor from a ring, the NIC updates its
consumer index, transfers this consumer index to a location in host memory via DMA,
and raises a physical interrupt to notify the host that state has changed.
In an unvirtualized operating system, the network interface trusts that the device
driver gives it valid DMA descriptors. Similarly, the device driver trusts that the NIC
will use the DMA descriptors correctly. If either entity violates this trust, physical
memory can be corrupted. Xen also requires this trust relationship between the device
driver in the driver domain and the NIC.
4.1.3
Performance
Despite the optimizations within the paravirtualized drivers to support communication between the guest and driver domains (such as using page remapping rather
than copying to transfer packets), Xen introduces significant processing and communication overheads into the network transmit and receive paths. Table 4.1 shows
the networking performance of both native Linux 2.6.16.29 and paravirtualized Linux
2.6.16.29 as a guest operating system within Xen 3 Unstable1 on a modern Opteronbased system with six Intel Gigabit Ethernet NICs. In both configurations, checksum offloading, scatter/gather I/O, and TCP Segmentation Offloading (TSO) were
enabled. Support for TSO was recently added to the unstable development branch of
Xen and is not currently available in the Xen 3 release. As the table shows, a guest
1
Changeset 12053:874cc0ff214d from 11/1/2006.
70
domain within Xen is only able to achieve about 30% of the performance of native
Linux. This performance gap strongly motivates the need for networking performance
improvements within Xen.
4.2
CDNA Architecture
With CDNA, the network interface and the hypervisor collaborate to provide the
abstraction that each guest operating system is connected directly to its own network interface. This eliminates many of the overheads of network virtualization in
Xen. Figure 4.2 shows the CDNA architecture. The network interface must support
multiple contexts in hardware. Each context acts as if it is an independent physical
network interface and can be controlled by a separate device driver instance. Instead
of assigning ownership of the entire network interface to the driver domain, the hypervisor treats each context as if it were a physical NIC and assigns ownership of
contexts to guest operating systems. Notice the absence of the driver domain from
the figure: each guest can transmit and receive network traffic using its own private
context without any interaction with other guest operating systems or the driver domain. The driver domain, however, is still present to perform control functions and
allow access to other I/O devices. Furthermore, the hypervisor is still involved in
networking, as it must guarantee memory protection and deliver virtual interrupts to
the guest operating systems.
With CDNA, the communication overheads between the guest and driver domains
and the software multiplexing overheads within the driver domain are eliminated
entirely. However, the network interface now must multiplex the traffic across all of
its active contexts, and the hypervisor must provide protection across the contexts.
The following sections describe how CDNA performs traffic multiplexing, interrupt
delivery, and DMA memory protection.
71
Guest
Domain 1
NIC Driver
Guest
Domain 2
NIC Driver
Virtual Interrupts
Guest
Domain 3
NIC Driver
Hypervisor
Interrupt
Dispatch
Interrupts
Control
CPU / Memory / Disk /
Other Devices
Packet
Data
Driver
Control
CDNA NIC
Figure 4.2: The CDNA shared networking architecture in Xen.
4.2.1
Multiplexing Network Traffic
CDNA eliminates the software multiplexing overheads within the driver domain
by multiplexing network traffic on the NIC. The network interface must be able to
identify the source or target guest operating system for all network traffic. The network interface accomplishes this by providing independent hardware contexts and
associating a unique Ethernet MAC address with each context. The hypervisor assigns a unique hardware context on the NIC to each guest operating system. The
device driver within the guest operating system then interacts with its context exactly as if the context were an independent physical network interface. As described
in Section 4.1.2, these interactions consist of creating DMA descriptors and updating
a mailbox on the NIC via PIO.
Each context on the network interface therefore must include a unique set of
mailboxes. This isolates the activity of each guest operating system, so that the NIC
can distinguish between the different guests. The hypervisor assigns a context to
a guest simply by mapping the I/O locations for that context’s mailboxes into the
guest’s address space. The hypervisor also notifies the NIC that the context has been
allocated and is active. As the hypervisor only maps each context into a single guest’s
72
address space, a guest cannot accidentally or intentionally access any context on the
NIC other than its own. When necessary, the hypervisor can also revoke a context at
any time by notifying the NIC, which will shut down all pending operations associated
with the indicated context.
To multiplex transmit network traffic, the NIC simply services all of the hardware
contexts fairly and interleaves the network traffic for each guest. When network
packets are received by the NIC, it uses the Ethernet MAC address to demultiplex
the traffic, and transfers each packet to the appropriate guest using available DMA
descriptors from that guest’s context.
4.2.2
Interrupt Delivery
In addition to isolating the guest operating systems and multiplexing network traffic, the hardware contexts on the NIC must also be able to interrupt their respective
guests. As the NIC carries out network requests on behalf of any particular context,
the CDNA NIC updates that context’s consumer pointers for the DMA descriptor
rings, as described in Section 4.1.2. Normally, the NIC would then interrupt the guest
to notify it that the context state has changed. However, in Xen all physical interrupts are handled by the hypervisor. Therefore, the NIC cannot physically interrupt
the guest operating systems directly. Even if it were possible to interrupt the guests
directly, that could create a much higher interrupt load on the system, which would
decrease the performance benefits of CDNA.
Under CDNA, the NIC keeps track of which contexts have been updated since the
last physical interrupt, encoding this set of contexts in an interrupt bit vector, which
is stored in the hypervisor’s private memory-mapped control context on the NIC. To
signal a set of interrupts to the hypervisor, the NIC raises a physical interrupt, which
invokes the hypervisor’s interrupt service routine (ISR). The hypervisor then reads
the interrupt bit vector from the NIC via programmed I/O. Next, the hypervisor
73
decodes the vector and schedules virtual interrupts to each of the guest operating
systems that have pending updates from the NIC. Because the Xen scheduler guarantees that these virtual interrupts will be delivered, the hypervisor can immediately
acknowledge the set of interrupts that have been processed. The hypervisor performs
this acknowledgment by writing the processed vector back to the NIC to a separate
acknowledgment location in the hypervisor’s private memory-mapped control context.
After acknowledgment and after the hypervisor’s ISR has run, the hypervisor’s
scheduler will execute and select the next guest operating system to run. When
subsequent guest operating systems are next scheduled by the hypervisor, the CDNA
network interface driver within the guest receives these virtual interrupts that the
hypervisor has sent. The virtual interrupts are received by the paravirtualized guest
as if they were actual physical interrupts from the hardware. At that time, the guest’s
driver examines the updates from the NIC and determines what further action, such
as processing received packets, is required.
During the development of this research, an alternative NIC-to-hypervisor event
notification mechanism was explored. Instead of providing the interrupt vector to
the hypervisor via memory-mapped I/O, the NIC would transfer an interrupt bit
vector into the hypervisor’s memory space using DMA. The interrupt bit vectors
were stored in a circular buffer using a producer/consumer protocol to ensure that
they are processed by the host before being overwritten by the NIC. The vectors would
then be processed identically to the memory-mapped I/O implementation. However,
further examination found that under heavy load, it was possible that the ring buffer
could fill up and interrupts could be lost, even with a very large event ring buffer
of 256 entries. The positive-acknowledgment strategy ensures more reliable delivery
under heavy load, though it does incur some additional overhead. At minimum, the
memory-mapped I/O implementation requires an additional programmed-I/O write
(for the acknowledgment) compared to the ring-buffer implementation. When several
74
interrupt vectors are processed in one invocation of the hypervisor’s ISR, the ringbuffer implementation can save one memory-mapped I/O read per vector processed,
since those vectors are read from host memory rather than from the memory-mapped
location on the NIC. The ring-buffer-based strategy was evaluated in [61].
4.2.3
DMA Memory Protection
In the x86 architecture, network interfaces and other I/O devices use physical
addresses when reading or writing host system memory. The device driver in the
host operating system is responsible for doing virtual-to-physical address translation
for the device. The physical addresses are provided to the network interface through
read and write DMA descriptors as discussed in Section 4.1.2. By exposing physical
addresses to the network interface, the DMA engine on the NIC can be co-opted into
compromising system security by a buggy or malicious driver. There are two key
I/O protection violations that are possible in the x86 architecture. First, the device
driver could instruct the NIC to transmit packets containing a payload from physical
memory that does not contain packets generated by the operating system, thereby
creating a security hole. Second, the device driver could instruct the NIC to receive
packets into physical memory that was not designated as an available receive buffer,
possibly corrupting memory that is in use.
In the conventional Xen network architecture discussed in Section 4.1.2, Xen trusts
the device driver in the driver domain to only use the physical addresses of network
buffers in the driver domain’s address space when passing DMA descriptors to the
network interface. This ensures that all network traffic will be transferred to/from
network buffers within the driver domain. Since guest domains do not interact with
the NIC, they cannot initiate DMA operations, so they are prevented from causing
either of the I/O protection violations in the x86 architecture.
75
Though the Xen I/O architecture guarantees that untrusted guest domains cannot
induce memory protection violations, any domain that is granted access to an I/O
device by the hypervisor can potentially direct the device to perform DMA operations
that access memory belonging to other guests, or even the hypervisor. The Xen
architecture does not fundamentally solve this security defect but instead limits the
scope of the problem to a single, trusted driver domain [16]. Therefore, as the driver
domain is trusted, it is unlikely to intentionally violate I/O memory protection, but
a buggy driver within the driver domain could do so unintentionally.
This solution is insufficient for the CDNA architecture. In a CDNA system, device
drivers in the guest domains have direct access to the network interface and are able
to pass DMA descriptors with physical addresses to the device. Thus, the untrusted
guests could read or write memory in any other domain through the NIC, unless
additional security features are added. To maintain isolation between guests, the
CDNA architecture validates and protects all DMA descriptors and ensures that a
guest maintains ownership of physical pages that are sources or targets of outstanding
DMA accesses. Although the hypervisor and the network interface share the responsibility for implementing these protection mechanisms, the more complex aspects are
implemented in the hypervisor.
The most important protection provided by CDNA is that it does not allow guest
domains to directly enqueue DMA descriptors into the network interface descriptor
rings. Instead, the device driver in each guest must call into the hypervisor to perform the enqueue operation. This allows the hypervisor to validate that the physical
addresses provided by the guest are, in fact, owned by that guest domain. This prevents a guest domain from arbitrarily transmitting from or receiving into another
guest domain. The hypervisor prevents guest operating systems from independently
enqueueing unauthorized DMA descriptors by establishing the hypervisor’s exclusive
76
write access to the host memory region containing the CDNA descriptor rings during
driver initialization.
As discussed in Section 4.1.2, conventional I/O devices autonomously fetch and
process DMA descriptors from host memory at runtime. Though hypervisor-managed
validation and enqueuing of DMA descriptors ensures that DMA operations are valid
when they are enqueued, the physical memory could still be reallocated before it is
accessed by the network interface. There are two ways in which such a protection
violation could be exploited by a buggy or malicious device driver. First, the guest
could return the memory to the hypervisor to be reallocated shortly after enqueueing
the DMA descriptor. Second, the guest could attempt to reuse an old DMA descriptor
in the descriptor ring that is no longer valid.
When memory is freed by a guest operating system, it becomes available for reallocation to another guest by the hypervisor. Hence, ownership of the underlying
physical memory can change dynamically at runtime. However, it is critical to prevent any possible reallocation of physical memory during a DMA operation. CDNA
achieves this by delaying the reallocation of physical memory that is being used in
a DMA transaction until after that pending DMA has completed. When the hypervisor enqueues a DMA descriptor, it first establishes that the requesting guest owns
the physical memory associated with the requested DMA. The hypervisor then increments the reference count for each physical page associated with the requested DMA.
This per-page reference counting system already exists within the Xen hypervisor; so
long as the reference count is non-zero, a physical page cannot be reallocated. Later,
the hypervisor then observes which DMA operations have completed and decrements
the associated reference counts. For efficiency, the reference counts are only decremented when additional DMA descriptors are enqueued, but there is no reason why
they could not be decremented more aggressively, if necessary.
77
After enqueuing DMA descriptors, the device driver notifies the NIC by writing
a producer index into a mailbox location within that guest’s context on the NIC.
This producer index indicates the location of the last of the newly created DMA
descriptors. The NIC then assumes that all DMA descriptors up to the location
indicated by the producer index are valid. If the device driver in the guest increments
the producer index past the last valid descriptor, the NIC will attempt to use a stale
DMA descriptor that is in the descriptor ring. Since that descriptor was previously
used in a DMA operation, the hypervisor may have decremented the reference count
on the associated physical memory and reallocated the physical memory.
To prevent such stale DMA descriptors from being used, the hypervisor writes a
strictly increasing sequence number into each DMA descriptor. The NIC then checks
the sequence number before using any DMA descriptor. If the descriptor is valid,
the sequence numbers will be continuous modulo the size of the maximum sequence
number. If they are not, the NIC will refuse to use the descriptors and will report a
guest-specific protection fault error to the hypervisor. Because each DMA descriptor
in the ring buffer gets a new, increasing sequence number, a stale descriptor will have
a sequence number exactly equal to the correct value minus the number of descriptor
slots in the buffer. Making the maximum sequence number at least twice as large as
the number of DMA descriptors in a ring buffer prevents aliasing and ensures that
any stale sequence number will be detected.
4.2.4
Discussion
The CDNA interrupt delivery mechanism is neither device nor Xen specific. This
mechanism only requires the device to transfer an interrupt bit vector to the hypervisor via DMA prior to raising a physical interrupt. This is a relatively simple
mechanism from the perspective of the device and is therefore generalizable to a va-
78
riety of virtualized I/O devices. Furthermore, it does not rely on any Xen-specific
features.
The handling of the DMA descriptors within the hypervisor is linked to a particular network interface only because the format of the DMA descriptors and their
rings is likely to be different for each device. As the hypervisor must validate that
the host addresses referred to in each descriptor belong to the guest operating system
that provided them, the hypervisor must be aware of the descriptor format. Fortunately, there are only three fields of interest in any DMA descriptor: an address, a
length, and additional flags. This commonality should make it possible to generalize
the mechanisms within the hypervisor by having the NIC notify the hypervisor of
its preferred format. The NIC would only need to specify the size of the descriptor
and the location of the address, length, and flags. The hypervisor would not need
to interpret the flags, so they could just be copied into the appropriate location. A
generic NIC would also need to support the use of sequence numbers within each
DMA descriptor. Again, the NIC could notify the hypervisor of the size and location
of the sequence number field within the descriptors.
CDNA’s DMA memory protection is specific to Xen only insofar as Xen permits
guest operating systems to use physical memory addresses. Consequently, the current
implementation must validate the ownership of those physical addresses for every
requested DMA operation. For VMMs that only permit the guest to use virtual
addresses, the hypervisor could just as easily translate those virtual addresses and
ensure physical contiguity. The current CDNA implementation does not rely on
physical addresses in the guest at all; rather, a small library translates the driver’s
virtual addresses to physical addresses within the guest’s driver before making a
hypercall request to enqueue a DMA descriptor. For VMMs that use virtual addresses,
this library would do nothing.
79
4.3
CDNA NIC Implementation
To evaluate the CDNA concept in a real system, RiceNIC, a programmable and
reconfigurable FPGA-based Gigabit Ethernet network interface [50], was modified
to provide virtualization support. RiceNIC contains a Virtex-II Pro FPGA with
two embedded 300MHz PowerPC processors, hundreds of megabytes of on-board
SRAM and DRAM memories, a Gigabit Ethernet PHY, and a 64-bit/66 MHz PCI
interface [5]. Custom hardware assist units for accelerated DMA transfers and MAC
packet handling are provided on the FPGA. The RiceNIC architecture is similar to
the architecture of a conventional network interface. With basic firmware and the
appropriate Linux or FreeBSD device driver, it acts as a standard Gigabit Ethernet
network interface that is capable of fully saturating the Ethernet link while only using
one of the two embedded processors.
To support CDNA, both the hardware and firmware of the RiceNIC were modified
to provide multiple protected contexts and to multiplex network traffic. The network
interface was also modified to interact with the hypervisor through a dedicated context
to allow privileged management operations. The modified hardware and firmware
components work together to implement the CDNA interfaces.
To support CDNA, the most significant addition to the network interface is the
specialized use of the 2 MB SRAM on the NIC. This SRAM is accessible via PIO from
the host. For CDNA, 128 KB of the SRAM is divided into 32 partitions of 4 KB each.
Each of these partitions is an interface to a separate hardware context on the NIC.
Only the SRAM can be memory mapped into the host’s address space, so no other
memory locations on the NIC are accessible via PIO. As a context’s memory partition
is the same size as a page on the host system and because the region is page-aligned,
the hypervisor can trivially map each context into a different guest domain’s address
space. The device drivers in the guest domains may use these 4 KB partitions as
80
general purpose shared memory between the corresponding guest operating system
and the network interface.
Within each context’s partition, the lowest 24 memory locations are mailboxes
that can be used to communicate from the driver to the NIC. When any mailbox
is written by PIO, a global mailbox event is automatically generated by the FPGA
hardware. The NIC firmware can then process the event and efficiently determine
which mailbox and corresponding context has been written by decoding a two-level
hierarchy of bit vectors. All of the bit vectors are generated automatically by the
hardware and stored in a data scratchpad for high speed access by the processor. The
first bit vector in the hierarchy determines which of the 32 potential contexts have
updated mailbox events to process, and the second vector in the hierarchy determines
which mailbox(es) in a particular context have been updated. Once the specific
mailbox has been identified, that off-chip SRAM location can be read by the firmware
and the mailbox information processed.
The mailbox event and associated hierarchy of bit vectors are managed by a
small hardware core that snoops data on the SRAM bus and dispatches notification
messages when a mailbox is updated. A small state machine decodes these messages
and incrementally updates the data scratchpad with the modified bit vectors. This
state machine also handles event-clear messages from the processor that can clear
multiple events from a single context at once.
Each context requires 128 KB of storage on the NIC for metadata, such as the rings
of transmit- and receive-DMA descriptors provided by the host operating systems.
Furthermore, each context uses 128 KB of memory on the NIC for buffering transmit
packet data and 128 KB for receive packet data. However, the NIC’s transmit and
receive packet buffers are each managed globally, and hence packet buffering is shared
across all contexts.
81
System
NIC
Mb/s
Xen
Xen
CDNA
Intel
RiceNIC
RiceNIC
1602
1674
1865
Hyp
19.8%
13.7%
10.8%
Domain Execution Profile
Driver Domain
Guest OS
OS
User
OS
User
35.7%
0.8% 39.7%
1.0%
41.5%
0.5% 39.5%
1.0%
0.1%
0.2% 42.7%
1.7%
Idle
3.0%
3.8%
44.5%
Interrupts/s
Driver
Guest
Domain
OS
7,438
7,853
8,839
5,661
0
13,903
Table 4.2: Transmit performance for a single guest with 2 NICs using Xen and CDNA.
System
NIC
Mb/s
Xen
Xen
CDNA
Intel
RiceNIC
RiceNIC
1112
1075
1850
Hyp
25.7%
30.6%
9.9%
Domain Execution Profile
Driver Domain
Guest OS
OS
User
OS
User
36.8%
0.5% 31.0%
1.0%
39.4%
0.6% 28.8%
0.6%
0.2%
0.2% 52.6%
0.6%
Idle
5.0%
0%
36.5%
Interrupts/s
Driver
Guest
Domain
OS
11,138
5,193
10,946
5,163
0
7,484
Table 4.3: Receive performance for a single guest with 2 NICs using Xen and CDNA.
The modifications to the RiceNIC to support CDNA were minimal. The major
hardware change was the additional mailbox storage and handling logic. This could
easily be added to an existing NIC without interfering with the normal operation
of the network interface—unvirtualized device drivers would use a single context’s
mailboxes to interact with the base firmware. Furthermore, the computation and
storage requirements of CDNA are minimal. Only one of the RiceNIC’s two embedded
processors is needed to saturate the network, and only 12 MB of memory on the NIC
is needed to support 32 contexts. Therefore, with minor modifications, commodity
network interfaces could easily provide sufficient computation and storage resources
to support CDNA.
4.4
4.4.1
Evaluation
Experimental Setup
The performance of Xen and CDNA network virtualization was evaluated on
an AMD Opteron-based system running Xen 3 Unstable2 .
This system used a
Tyan S2882 motherboard with a single Opteron 250 processor and 4GB of DDR400
2
Changeset 12053:874cc0ff214d from 11/1/2006.
82
SDRAM. Xen 3 Unstable was used because it provides the latest support for highperformance networking, including TCP segmentation offloading, and the most recent
version of Xenoprof [39] for profiling the entire system.
In all experiments, the driver domain was configured with 256 MB of memory
and each of 24 guest domains were configured with 128 MB of memory. Each guest
domain ran a stripped-down Linux 2.6.16.29 kernel with minimal services for memory
efficiency and performance. For the base Xen experiments, a single dual-port Intel
Pro/1000 MT NIC was used in the system. In the CDNA experiments, two RiceNICs
configured to support CDNA were used in the system. Linux TCP parameters and
NIC coalescing options were tuned in the driver domain and guest domains for optimal
performance. For all experiments, checksum offloading and scatter/gather I/O were
enabled. TCP segmentation offloading was enabled for experiments using the Intel
NICs, but disabled for those using the RiceNICs due to lack of support. The Xen
system was setup to communicate with a similar Opteron system that was running a
native Linux kernel. This system was tuned so that it could easily saturate two NICs
both transmitting and receiving so that it would never be the bottleneck in any of
the tests.
To validate the performance of the CDNA approach, multiple simultaneous connections across multiple NICs to multiple guests domains were needed. A multithreaded, event-driven, lightweight network benchmark program was developed to
distribute traffic across a configurable number of connections. The benchmark program balances the bandwidth across all connections to ensure fairness and uses a
single buffer per thread to send and receive data to minimize the memory footprint
and improve cache performance.
83
4.4.2
Single Guest Performance
Tables 4.2 and 4.3 show the transmit and receive performance of a single guest
operating system over two physical network interfaces using Xen and CDNA. The
first two rows of each table show the performance of the Xen I/O virtualization
architecture using both the Intel and RiceNIC network interfaces. The third row of
each table shows the performance of the CDNA I/O virtualization architecture.
The Intel network interface can only be used with Xen through the use of software
virtualization. However, the RiceNIC can be used with both CDNA and software virtualization. To use the RiceNIC interface with software virtualization, a context was
assigned to the driver domain and no contexts were assigned to the guest operating
system. Therefore, all network traffic from the guest operating system is routed via
the driver domain as it normally would be, through the use of software virtualization.
Within the driver domain, all of the mechanisms within the CDNA NIC are used
identically to the way they would be used by a guest operating system when configured to use concurrent direct network access. As the tables show, the Intel network
interface performs similarly to the RiceNIC network interface. Therefore, the benefits
achieved with CDNA are the result of the CDNA I/O virtualization architecture, not
the result of differences in network interface performance.
Note that in Xen the interrupt rate for the guest is not necessarily the same as it is
for the driver. This is because the back-end driver within the driver domain attempts
to interrupt the guest operating system whenever it generates new work for the frontend driver. This can happen at a higher or lower rate than the actual interrupt rate
generated by the network interface depending on a variety of factors, including the
number of packets that traverse the Ethernet bridge each time the driver domain is
scheduled by the hypervisor.
84
DMA Protection
Mb/s
Hyp
Enabled
Disabled
1865
1865
10.8%
1.9%
Domain Execution Profile
Driver Domain
Guest OS
OS
User
OS
User
0.1%
0.2%
42.7%
1.7%
0.2%
0.2%
37.0%
1.8%
Idle
44.5%
58.9%
Interrupts/s
Driver
Guest
Domain
OS
0
13,903
0
14,202
Table 4.4: CDNA 2-NIC transmit performance with and without DMA memory protection.
DMA Protection
Mb/s
Hyp
Enabled
Disabled
1850
1850
9.9%
2.2%
Domain Execution Profile
Driver Domain
Guest OS
OS
User
OS
User
0.2%
0.2% 52.6%
0.6%
0.2%
0.3% 49.5%
0.8%
Idle
36.5%
47.0%
Interrupts/s
Driver
Guest
Domain
OS
0
7,484
0
7,616
Table 4.5: CDNA 2-NIC receive performance with and without DMA memory protection.
Table 4.2 shows that using all of the available processing resources, Xen’s software
virtualization is not able to transmit at line rate over two network interfaces with either the Intel hardware or the RiceNIC hardware. However, only 41% of the processor
is used by the guest operating system. The remaining resources are consumed by Xen
overheads—using the Intel hardware, approximately 20% in the hypervisor and 37%
in the driver domain performing software multiplexing and other tasks.
As the table shows, CDNA is able to saturate two network interfaces, whereas
traditional Xen networking cannot. Additionally, CDNA performs far more efficiently,
with 45% processor idle time. The increase in idle time is primarily the result of two
factors. First, nearly all of the time spent in the driver domain is eliminated. The
remaining time spent in the driver domain is unrelated to networking tasks. Second,
the time spent in the hypervisor is decreased. With Xen, the hypervisor spends the
bulk of its time managing the interactions between the front-end and back-end virtual
network interface drivers. CDNA eliminates these communication overheads with the
driver domain, so the hypervisor instead spends the bulk of its time managing DMA
memory protection.
Table 4.3 shows the receive performance of the same configurations. Receiving
network traffic requires more processor resources, so Xen only achieves 1112 Mb/s
with the Intel network interface, and slightly lower with the RiceNIC interface. Again,
85
Xen overheads consume the bulk of the time, as the guest operating system only
consumes about 32% of the processor resources when using the Intel hardware.
As the table shows, not only is CDNA able to saturate the two network interfaces,
it does so with 37% idle time. Again, nearly all of the time spent in the driver domain
is eliminated. As with the transmit case, the CDNA architecture permits the hypervisor to spend its time performing DMA memory protection rather than managing
higher-cost interdomain communications as is required using software virtualization.
In summary, the CDNA I/O virtualization architecture provides significant performance improvements over Xen for both transmit and receive. On the transmit
side, CDNA requires half the processor resources to deliver about 200 Mb/s higher
throughput. On the receive side, CDNA requires 63% of the processor resources to
deliver about 750 Mb/s higher throughput.
4.4.3
Memory Protection
The software-based protection mechanisms in CDNA can potentially be replaced
by a hardware IOMMU. For example, AMD has proposed an IOMMU architecture
for virtualization that restricts the physical memory that can be accessed by each
device [2]. AMD’s proposed architecture provides memory protection as long as each
device is only accessed by a single domain. For CDNA, such an IOMMU would have
to be extended to work on a per-context basis, rather than a per-device basis. This
would also require a mechanism to indicate a context for each DMA transfer. Since
CDNA only distinguishes between guest operating systems and not traffic flows, there
are a limited number of contexts, which may make a generic system-level contextaware IOMMU practical.
Tables 4.4 and4.5 show the performance of the CDNA I/O virtualization architecture both with and without DMA memory protection under transmit and receive
tests, respectively. (The performance of CDNA with DMA memory protection en86
abled was replicated from Tables 4.2 and 4.3 for comparison purposes.) By disabling
DMA memory protection, the performance of the modified CDNA system establishes
an upper bound on achievable performance in a system with an appropriate IOMMU.
However, there would be additional hypervisor overhead to manage the IOMMU that
is not accounted for by this experiment. Since CDNA can already saturate two network interfaces for both transmit and receive traffic, the effect of removing DMA
protection is to increase the idle time by about 10–15%, depending on the workload.
As the table shows, this increase in idle time is the direct result of reducing the number of hypercalls from the guests and the time spent in the hypervisor performing
protection operations.
Even as systems begin to provide IOMMU support for techniques such as CDNA,
older systems will continue to lack such features. In order to generalize the design
of CDNA for systems with and without an appropriate IOMMU, wrapper functions
could be used around the hypercalls within the guest device drivers. The hypervisor
must notify the guest whether or not there is an IOMMU. When no IOMMU is
present, the wrappers would simply call the hypervisor, as described here. When
an IOMMU is present, the wrapper would instead create DMA descriptors without
hypervisor intervention and only invoke the hypervisor to set up the IOMMU. Such
wrappers already exist in modern operating systems to deal with such IOMMU issues.
4.4.4
Scalability
Figures 4.3 and 4.4 show the aggregate transmit and receive throughput, respectively, of Xen and CDNA with two network interfaces as the number of guest operating systems varies. The percentage of CPU idle time is also plotted above each data
point. CDNA outperforms Xen for both transmit and receive both for a single guest,
as previously shown in Tables 4.2 and 4.3, and as the number of guest operating
systems is increased.
87
2000
.5% 5.4% .9%
5
2
44
0%
0%
0%
CDNA% / RiceNIC%
0
0
Xen / Intel
1800
Xen Transmit Throughput (Mbps)
%
1600
3.0
0%
0%
1400
1200
0%
0%
1000
0%
0%
0%
800
600
400
1
2
4
8
12
Xen Guests
16
20
24
Figure 4.3: Transmit throughput for Xen and CDNA (with CDNA idle time).
2000
1%
.5%29.
36
.6%
12
0%
0%
0%
CDNA% / RiceNIC%
0
0
Xen / Intel
Xen Receive Throughput (Mbps)
1800
1600
1400
1200
%
5.0
0%
1000
0%
800
0%
0%
0%
600
400
1
2
4
8
12
Xen Guests
16
0%
20
0%
24
Figure 4.4: Receive throughput for Xen and CDNA (with CDNA idle time).
As the figures show, the performance of both CDNA and software virtualization
degrades as the number of guests increases. For Xen, this results in declining bandwidth, but the marginal reduction in bandwidth decreases with each increase in the
number of guests. For CDNA, while the bandwidth remains constant, the idle time
88
decreases to zero. Despite the fact that there is no idle time for 8 or more guests,
CDNA is still able to maintain constant bandwidth. This is consistent with the leveling of the bandwidth achieved by software virtualization. Therefore, it is likely that
with more CDNA NICs, the throughput curve would have a similar shape to that
of software virtualization, but with a much higher peak throughput when using 1–4
guests.
These results clearly show that not only does CDNA deliver better network performance for a single guest operating system within Xen, but it also maintains significantly higher bandwidth as the number of guest operating systems is increased. With
24 guest operating systems, CDNA’s transmit bandwidth is a factor of 2.1 higher than
Xen’s and CDNA’s receive bandwidth is a factor of 3.3 higher than Xen’s.
89
Chapter 5
Protection Strategies for Direct I/O in Virtual
Machine Monitors
As the CDNA architecture shows, direct I/O access by guest operating systems
can significantly improve performance. Preferably, guest operating systems within
a virtual machine monitor would be able to directly access all I/O devices without
the need for the data to traverse an intermediate software layer within the virtual
machine monitor [45, 60]. However, if a guest can directly access an I/O device, then
it can potentially direct the device to access memory that it does not own via direct
memory access (DMA). Therefore, the virtual machine monitor must still be able to
ensure that guest operating systems do not access each other’s memory indirectly
through the shared I/O devices in the system. Both IOMMUs [10] and softwarebased methods (as established in the previous chapter) can provide DMA memory
protection for the virtual machine monitor. They do so by preventing guest operating
systems from directing I/O devices to access memory that is not owned by that guest,
while still allowing the guest to directly access the device.
This study is the first experimental study that performs a head-to-head comparison of DMA memory protection strategies supporting direct access to I/O devices
from untrusted guest operating systems within a virtual machine monitor. Specifically, three hardware IOMMU-based strategies and one software-based strategy are
explored. The first IOMMU-based strategy uses single-use I/O memory mappings
that are created before each I/O operation and immediately destroyed after each I/O
90
operation. The second IOMMU-based strategy uses shared I/O memory mappings
that can be reused by multiple, concurrent I/O operations. The third IOMMU-based
strategy uses persistent I/O memory mappings that can be reused until they need
to be reclaimed to create new mappings. Finally, the software-based strategy uses
validated DMA descriptors that can only be used for one I/O operation.
The comparison of these four strategies yields several insights. First, all four
strategies provide equivalent protection between guest operating systems for direct
access to shared I/O devices in a virtual machine monitor. All of the techniques
prevent a guest operating system from directing the device to access memory that
does not belong to that guest. The traditional single-use strategy, however, provides
this protection at the greatest cost. Second, there is significant opportunity to reuse
IOMMU mappings which can reduce the cost of providing protection. Multiple concurrent I/O operations are able to share the same mappings often enough that there
is a noticeable decrease in the overhead of providing protection. That overhead can
further be decreased by allowing mappings to persist so that they can also be reused
by future I/O operations. Finally, the software-based protection strategy performs
comparably to the best of the IOMMU-based strategies.
The next section provides background on how I/O devices access main memory and the possible memory protection violations that can occur when doing so.
Sections 5.2 and 5.3 discuss the three IOMMU-based protection strategies and the
one software-based protection strategy. Section 5.4 then describes the protection
properties afforded by the four strategies. Section 5.5 describes the experimental
methodology and Section 5.6 evaluates the protection strategies.
91
5.1
Background
Modern server I/O devices, including disk and network controllers, utilize direct
memory access (DMA) to move data between the host’s main memory and the device’s
on-board buffers. The device uses DMA to access memory independently of the host
CPU, so such accesses must be controlled and protected. To initiate a DMA operation,
the device driver within the operating system creates DMA descriptors that refer to
regions of memory. Each DMA descriptor typically includes an address, a length,
and a few device-specific flags. In commodity x86 systems, devices lack support for
virtual-to-physical address translation, so DMA descriptors always contain physical
addresses for main memory. Once created, the device driver passes the descriptors
to the device, which will later use the descriptors to transfer data to or from the
indicated memory regions autonomously. When the requested I/O operations have
been completed, the device raises an interrupt to notify the device driver.
For example, to transmit a network packet, the network interface’s device driver
might create two DMA descriptors. The first descriptor might point to the packet
headers and the second descriptor might point to the packet payload. Once created,
the device driver would then notify the network interface that there are new DMA
descriptors available. The precise mechanism of that notification depends on the
particular network interface, but typically involves a programmed I/O operation to
the device telling it the location of the new descriptors. The network interface would
then retrieve the descriptors from main memory using DMA—if they were not written
to the device directly by programmed I/O. The network interface would then retrieve
the two memory regions that compose the network packet and transmit them over
the network. Finally, the network interface would interrupt the host to indicate that
the packet has been transmitted. In practice, notifications from the device driver and
92
interrupts from the network interface would likely be aggregated to cover multiple
packets for efficiency.
Three potential memory access violations can occur on every I/O transfer initiated
using this DMA architecture:
1. The device driver could create a DMA descriptor with an incorrect address.
2. The memory referenced by the DMA descriptor could be repurposed after the
descriptor was created by the device driver, but before it is used by the device.
3. The device itself could initiate a DMA transfer to a memory address not referenced by the DMA descriptor.
These violations could occur either because of failures or because of malicious intent.
However, as devices are typically not user-programmable, the last type of violation is
only likely to occur as a result of a hardware or software failure on the device.
In a non-virtualized environment, the operating system is responsible for preventing the first two types of memory access violations. This requires the operating
system to trust the device driver to create the correct DMA descriptors and to pin
physical memory used by I/O devices. A failure of the operating system to prevent
these memory access violations could potentially result in system failure. In a virtualized environment, however, the virtual machine monitor cannot trust the guest
operating systems to prevent these memory access violations, as a memory access
violation incurred by one guest operating system can potentially harm other guest
operating systems or even bring down the whole system. Therefore, a virtual machine
monitor requires mechanisms to prevent one guest operating system from intentionally or accidentally directing and I/O device to access the memory of another guest
operating system. The only way that would be possible is via one of the first two
types of memory access violations. Depending on the reliability of the I/O devices, it
93
may also be desirable to try to prevent the third type of memory access violation, as
well (although it is frequently not possible to protect against a misbehaving device,
as will be discussed in Section 5.4). The following sections describe mechanisms and
strategies for preventing these memory access violations.
5.2
IOMMU-based Protection
A virtual machine monitor can utilize an I/O memory management unit (IOMMU)
to help provide DMA memory protection when allowing direct access to I/O devices.
Whereas a virtual memory management unit enforces access control and provides
address translation services for software as it accesses memory, an IOMMU enforces
access control and provides address translation services for I/O devices as they access
memory. The IOMMU uses page table entries (PTEs) that each specify translation
from an I/O address to a physical memory address and specify access control (such
as which devices are permitted to use the given PTE).
An IOMMU only permits I/O devices to access memory for which a valid mapping
exists in the IOMMU page table. Thus, in an IOMMU-based system, there must be
a valid IOMMU translation for each host memory buffer to be used in an upcoming
DMA descriptor. Otherwise, the DMA descriptor will refer to a region unmapped by
the IOMMU, and the I/O transaction will fail.
The following subsections present three strategies for using an IOMMU to provide
DMA memory protection in a virtual machine monitor. The strategies primarily differ
in the extent to which IOMMU mappings are allowed to be reused.
5.2.1
Single-use Mappings
A common strategy for managing an IOMMU is to create a single-use mapping
for each I/O transaction. The Linux DMA-Mapping interface, for example, implements
94
a single-use mapping strategy. Ben-Yehuda, et al. also explored a single-use mapping strategy in the context of virtual machine monitors [10]. In such a single-use
strategy, the driver must ensure that a new IOMMU mapping is created for each
DMA descriptor. The IOMMU mapping is then destroyed once the corresponding
I/O transaction has completed. In a virtualized system, the trusted virtual machine
monitor is responsible for creating and destroying IOMMU mappings at the driver’s
request. If the VMM does not create the mapping, either because the driver did not
request it or because the request referred to memory not owned by the guest, then
the device will be unable to perform the corresponding DMA operation.
To carry out an I/O transaction using a single-use mapping strategy, the virtual
machine monitor (VMM), untrusted guest operating system (GOS), and the device
(DEV) carry out the following steps:
1. GOS: The guest OS requests an IOMMU mapping for the memory buffer involved in the I/O transaction.
2. VMM: The VMM validates that the requesting guest OS has appropriate read
or write permission for each memory page in the buffer to be mapped.
3. VMM: The VMM marks the memory buffer as “in I/O use”, which prevents the
buffer from being reallocated to another guest OS during an I/O transaction.
4. VMM: The VMM creates one or more IOMMU mappings for the buffer. As
with virtual memory management units, one mapping is usually required for
each memory page in the buffer.
5. GOS: The guest OS creates a DMA descriptor with the IOMMU-mapped address that was returned by the VMM.
6. DEV: The device carries out its I/O transaction as directed by the DMA descriptor and it notifies the driver upon completion.
95
7. GOS: The driver requests destruction of the corresponding IOMMU mapping(s).
8. VMM: The VMM validates that the mappings belong to the guest OS making
the request.
9. VMM: The VMM destroys the IOMMU mappings.
10. VMM: The VMM clears the “in I/O use” marker associated with each memory
page referred to by the recently-destroyed mapping(s).
5.2.2
Shared Mappings
Rather than creating a new IOMMU mapping for each new DMA descriptor,
it is possible to share a mapping among DMA descriptors so long as the mapping
points to the same underlying memory page and remains valid. Sharing IOMMU
mappings is advantageous because it avoids the overhead of creating and destroying
a new mapping for each I/O request by instead reusing an existing mapping. To
implement sharing, the guest operating system must keep track of which IOMMU
mappings are currently valid, and it must keep track of how many pending I/O
requests are currently using the mapping. To protect a guest’s memory from errant
device accesses, an IOMMU mapping should be destroyed once all outstanding I/O
requests that use the mapping have been completed. Though the untrusted guest
operating system has responsibilities for carrying out a shared-mapping strategy,
it need not function correctly to ensure isolation among operating systems, as is
discussed further in Section 5.4.
To carry out a shared-mapping strategy, the guest OS and the VMM perform many
of the same steps as required by the single-use strategy. The shared-mapping strategy
differs at the initiation and termination of an I/O transaction. Before step 1 would
occur in a one-time-use strategy, the guest operating system first queries a table of
known, valid IOMMU mappings to see if a mapping for the I/O memory buffer already
96
exists. If so, the driver uses the previously established IOMMU-mapped address for
a DMA descriptor, and then passes the descriptor to the device, in effect skipping
steps 1–4. If not, the guest and VMM follow steps 1–4 to create a new mapping.
Whether a new mapping is created or not, before step 5, the guest operating system
increments its own reference count for the mapping (or setting it to one for a new
mapping). This reference count is separate from the reference count maintained by
the VMM.
Steps 5 and 6 then proceed as in the single-use strategy. After these steps have
completed, the driver calls the guest operating system to decrement its reference
count. If the reference count is zero, no other I/O transactions are in progress that
are using this mapping, and it is appropriate to call the VMM to destroy the mapping
as in steps 7–10 of the single-use strategy. Otherwise, the IOMMU mapping is still
being used by another I/O transaction within the guest OS, so steps 7–10 are skipped.
5.2.3
Persistent Mappings
IOMMU mappings can further be reused by allowing them to persist evan after all
I/O transactions using the mapping have completed. Compared to a shared mapping
strategy, such a persistent mapping strategy attempts to further reduce the overhead
associated with creating and destroying IOMMU mappings inside the VMM. Whereas
sharing exploits reuse among mappings only when a mapping is being actively used
by at least one I/O transaction, persistence exploits temporal reuse across periods of
inactivity.
The infrastructure and mechanisms for implementing a persistent mapping strategy are similar to those required by a shared mapping strategy. The primary difference
is that the guest operating system does not request that mappings be destroyed after
the I/O transactions using them complete. In effect, this means that mappings persist
until they must be recycled. Therefore, in contrast to the shared mapping strategy,
97
when the guest’s reference count is decremented after step 6, the I/O transaction is
complete and steps 7–10 are always skipped. This should dramatically reduce the
number of hypercalls into the VMM.
As mappings are now persistent, they must be recycled whenever a new mapping
is needed. This changes the behavior of step 1 when compared to the shared mapping
case. Before performing step 1, as in the shared mapping case, the guest operating
system first queries a table of known, valid IOMMU mappings to see if a mapping
for the I/O memory buffer already exists. If one does not, a new mapping is needed
and the guest operating system must select an idle mapping to be recycled. In step 1,
the guest then passes this idle mapping to the virtual machine monitor along with
the request to create a new mapping. Steps 8, 10, and 2–4 are then performed by
the VMM to modify the mapping(s) for use by the new I/O transaction. Note that
step 9 can be skipped, as one valid mapping is going to be immediately replaced by
another valid mapping.
5.3
Software-based Protection
IOMMU-based protection strategies enforce safety even when untrusted software
provides unverified DMA descriptors directly to hardware, because the DMA operations generated by any device are always subject to later validation. However, an
IOMMU is not necessary to ensure full isolation among untrusted guest operating
systems, even when they use DMA-capable hardware that directly reads and writes
host memory. Rather than relying on hardware to perform late validation during
I/O transactions, a lightweight software-based system performs early validation of
DMA descriptors before they are used by hardware. The software-based strategy
also must protect validated descriptors from subsequent unauthorized modification
by untrusted software, thus ensuring that all I/O transactions operate only on buffers
98
that have been approved by the VMM. The CDNA architecture relies on a softwarebased protection mechanism, as introduced in Chapter 4. This study compares that
approach to IOMMU-based approaches.
The runtime operation of a software-based protection strategy works much like
a single-use IOMMU-based strategy, since both validate permissions for each I/O
transaction. Whereas the single-use IOMMU-based strategy uses the VMM to create
IOMMU mappings for each transaction, software-based I/O protection creates the
actual DMA descriptor. The descriptor is valid only for the single I/O transaction.
Unlike an IOMMU-based system, an untrusted guest OS’s driver must first register
itself with the VMM during initialization. At that time, the VMM takes ownership
of the driver’s DMA descriptor region and the driver’s status region, revoking write
permissions from the guest. This prevents the guest from independently creating
or modifying DMA descriptors, or modifying the status region. Finally, the VMM
must prevent the guest from changing the descriptor and status regions. This can be
trivially accomplished by only mapping the device’s configuration registers into the
VMM’s address space, and not into the guests’ address spaces.
After initialization, the runtime operation of the software-based strategy is similar
to the single-use IOMMU-based strategy outlined in Section 5.2.1. Steps 1–3 of a
software-based strategy are identical. In step 4, the VMM creates a DMA descriptor
in the write-protected DMA descriptor region, obviating the OS’s role in step 5. The
device carries out the requested operation using the validated descriptor, as in step 6,
and because the descriptor is write-protected, the untrusted guest cannot induce an
unauthorized transaction. When the device signals completion of the transaction,
the VMM inspects the device’s state (which is usually written via DMA back to the
host) to see which DMA descriptors have been used. The VMM then processes those
completed descriptors, as in step 10, permitting the associated guest memory buffers
to be reallocated.
99
5.4
Protection Properties
The protection strategies presented in Sections 5.2 and 5.3 can be used to prevent
the memory access violations presented in Section 5.1. Those memory access violations, however, can occur both across multiple guests (inter-guest) and within a single
guest (intra-guest). A virtual machine monitor must provide inter-guest protection
in order to operate reliably. A guest operating system may additionally benefit if
the virtual machine monitor can also help provide intra-guest protection. This section describes the protection properties of the four previously presented protection
strategies.
5.4.1
Inter-Guest Protection
Perhaps surprisingly, all four strategies provide equivalent protection against the
first two types of memory access violations presented in Section 5.1: creating of
an incorrect DMA descriptor and repurposing the memory referenced by a DMA
descriptor. In all of the IOMMU-based strategies, if the device driver creates a
DMA descriptor that refers to memory that is not owned by that guest operating
system, the device will be unable to perform that DMA, as no IOMMU mapping will
exist. The only requirement to maintain this protection is that the VMM must never
create an IOMMU mapping for a guest that does not refer to that guest’s memory.
Similarly, only the VMM can repurpose memory to another guest, so as long as it does
not do so while there is an existing IOMMU mapping to that memory, the second
memory protection violation can never occur. The software-based approach provides
exactly the same guarantees by only allowing the VMM to create DMA descriptors.
Therefore, these strategies allow the VMM to provide protection.
The third type of memory access violation, the device initiating a rogue DMA
operation, is more difficult to prevent. If the device is shared among multiple guest
100
operating systems, then no strategy can prevent this type of protection violation. For
example, if a network interface is allowed to receive packets for two guest operating
systems, there is no way for the VMM to prevent the device from sending the traffic
destined for one guest to the other. This is one simple example of many protection
violations that a shared device can commit.
However, if a device is privately assigned to a single guest operating system, the
IOMMU-based strategies can be used to provide protection against faulty device
behavior. In this case, the VMM simply has to ensure that there are only IOMMU
mappings to the guest that privately owns the device. In that manner, there is no
way the device can even access memory that does not belong to that guest. However,
the software-based strategy cannot even provide this level of protection. As DMA
descriptors are pre-validated, there is no way to stop the device from simply ignoring
the DMA descriptor and accessing any physical memory.
5.4.2
Intra-Guest Protection
None of the four protection strategies can protect the guest OS from the first
two types of access violations caused by its device drivers. In essence, the protection
afforded to the guest OS by any of the strategies is only as good as the implementation
of the strategy in a device driver. Consider the IOMMU-based strategies. For an
actual access violation to be prevented, the device driver would have to map the
correct buffer through the IOMMU but construct an incorrect DMA descriptor for
it. Such an error, however, seems unlikely. In the case of the software-based strategy,
such a scenario is impossible because the memory protection on the buffer and the
creation of the DMA descriptor are combined into operation by the VMM.
In contrast, the IOMMU-based strategies offer some protection against the third
type of memory access violation, the device initiating a rogue DMA operation. Of
these strategies, the single-use and shared strategies will offer the greatest protection
101
against this type of memory access violation because the only pages that could be
corrupted are those that are the target of a pending I/O operation. However, the
persistent strategy offers very little protection, as there will be a significant number
of active mappings at any given time that the device could erroneously use.
5.5
Experimental Setup
The protection strategies described here were evaluated on a system with an AMD
Opteron 250 processor that includes an integrated graphics address relocation table
(GART) alongside the memory controller. The GART can be used to translate memory addresses using physical-to-physical address mappings. Therefore, with the appropriate software infrastructure, a GART can model the functionality of an IOMMU.
GART mappings are established at the memory-page granularity (in this case, 4
KB). Each page requires a separate GART mapping. Software programs the GART
hardware to create a mapping at a specific location within the GART’s contiguous physical address range that points to a memory-backed memory location. The
GART’s physical address range is often referred to as the GART “aperture”.
GART mappings are organized in an array in memory. An index into the mapping
array corresponds to a page index into the aperture. When an I/O device accesses
a location in the GART’s aperture, the GART transparently redirects the memory
access to a target memory location as specified by the corresponding GART mapping’s
address. For unused or unmapped locations within the aperture, software creates a
dummy mapping pointing to a single, shared garbage memory page.
So long as an I/O device can only access memory within the GART aperture, all
of that device’s accesses will be subject to remapping and access controls as specified
by the virtual machine monitor. Thus, the GART’s mapping table limits I/O device
102
accesses to those regions approved by the VMM, just as an IOMMU limits I/O device
accesses.
Unlike an IOMMU-based system, however, a device could still generate an access
outside the GART region, thus bypassing access controls. As a practical measure, I
modify the prototype network interface to only accept DMA requests that lie within
the VMM-specified GART aperture. Even though this system architecture could
allow a faulty device to access memory outside the GART aperture, the architecture
faithfully models the overheads of a system for which all of the network interface’s
DMA requests are subject to the IOMMU strategy implemented by the VMM. Hence,
this architecture is an effective means for examining the efficiency and performance
of the various IOMMU management strategies under consideration.
Ben-Yehuda et al. identified that platform-specific IOMMU implementation details can significantly affect performance and influence the efficiency of a system’s
protection strategy [10]. Specifically, that work noted that the inability to individually replace IOMMU mappings without globally flushing the CPU cache can severely
degrade performance. The GART-based IOMMU implementation used in this work
does not incur the cache-flush penalties associated with the IBM platform, and thus
the GART-based implementation should represent a low-overhead upper-bound with
respect to architectural efficiency and performance.
I implement the IOMMU- and software-based protection strategies in the open
source Xen 3 virtual machine monitor [7]. I evaluate these strategies on a variety of
network-intensive workloads, including a TCP stream microbenchmark, a voice-overIP (VoIP) server benchmark, and a static-content web server benchmark. The stream
microbenchmark either transmits or receives bulk data over a TCP connection to a
remote host. The VoIP benchmark uses the OpenSER server. In this benchmark,
OpenSER acts as a SIP registrar and 50 clients simultaneously initiate calls as quickly
as possible. The web server benchmark uses the lighttpd web server to host static
103
HTTP content. In this benchmark, 32 clients simultaneously replay requests from
various web traces as quickly as possible. Three web traces are used in this study:
“CS”, “IBM”, and “WC”. The CS trace is from a computer science departmental web
server and has a working set of 1.2 GB of data. The IBM trace is from an IBM web
server and has a working set of 1.1 GB of data. The WC trace is from the 1998 World
Cup soccer web server and has a working set of 100 MB of data. For all benchmarks,
the client machine is never saturated, so the server machine is always the bottleneck.
The server under test uses a 2.4 GHz Opteron processor, has two Gigabit Ethernet network interface cards, and features DDR 400 DRAM. The network interfaces
are publicly available prototypes that support shared, direct access [51]. A single
unprivileged guest operating system has 1.4 GB of memory. The IOMMU-based
strategies employ 512 MB of physical GART address space for remapping. In each
benchmark, direct access for the guest is granted only for the network interface cards.
Because the guest’s memory allocation is large enough to hold each benchmark and its
corresponding data set, other I/O is insignificant. For the web-based workloads, the
guest’s buffer cache is warmed prior to performance testing. For all of the benchmarks,
each configuration was performance tested at least five times with each benchmark.
Because there was effectively no variance across runs for a given configuration and
benchmark, the statistics reported are averages of those runs.
5.6
Evaluation
Network server applications can stress network I/O in different ways, depending
on the characteristics of the application and its workload. Applications may generate large or small network packets, and may or may not utilize zero-copy I/O.
For an application running on a virtualized guest operating system, these network
characteristics interact with the I/O protection strategy implemented by the VMM.
104
Protection
CPU %
Reuse (%)
HC/
Strategy Total Prot. TX RX DMA
Stream Transmit
None
41
0 N/A N/A
0
Single-use
64
23 N/A N/A
.88
Shared
59
18
39
0
.55
Persistent
51
10 100
100
0
Software
56
15 N/A N/A
.90
Stream Receive
None
53
0 N/A N/A
0
Single-use
79
26 N/A N/A
.37
Shared
73
20
39
0
.10
Persistent
66
13 100
100
0
Software
64
11 N/A N/A
.39
Table 5.1: TCP Stream profile.
Consequently, the efficiency of the I/O protection strategy can affect application performance in different ways.
For all applications, I evaluate the four protection strategies presented earlier,
and I compare each to the performance of a system lacking any I/O protection at
all (“None”). “Single-use”, “Shared”, and “Persistent” all use an IOMMU to enforce
protection, using either single-use, shared-mapping, or persistent-mapping strategies,
respectively, as described in Section 5.2. “Software” uses software-based I/O protection, as described in Section 5.3.
5.6.1
TCP Stream
A TCP stream microbenchmark either transmits or receives bulk TCP data and
thus isolates network I/O performance. This benchmark does not use zero-copy I/O.
Table 5.1 shows the CPU efficiency and overhead associated with each protection
mechanism when streaming data over two network interfaces. The table shows the
total percentage of CPU consumed while executing the benchmark and the percentage
of CPU spent implementing the given protection strategy. The table also shows
105
the percentage of times a buffer to be used in an I/O transaction (either transmit
or receive) already has a valid IOMMU mapping that can be reused. Finally, the
table shows the number of VMM invocations, or hypercalls (HC), required per DMA
descriptor used by the network interface driver.
When either transmitting or receiving, all of the strategies achieve the same TCP
throughput (1865 Mb/s transmitting, 1850 Mb/s receiving), but they differ according to how costly they are in terms of CPU consumption. The single-use protection strategy is the most costly, with its repeated construction and destruction of
IOMMU mappings consuming 23% of total CPU resources for transmit and 26% for
receive. The shared strategy reclaims some of this overhead through its sharing of
in-use mappings, though this reuse only exists for transmitted packets (data in the
transmit-stream case, TCP ACK packets in the receive case). The lack of reuse for
received packets is caused by the XenoLinux buffer allocator, which dedicates an entire 4 KB page for each receive buffer, regardless of the buffer’s actual size. This
over-allocation is an artifact of the XenoLinux I/O architecture, which was designed
to remap received packets to transfer them between guest operating systems. Regardless, the persistent strategy achieves 100% reuse of mappings, as the small number of
persistent mappings that cover network buffers essentially become permanent. This
further reduces overhead relative to single-use and shared. Notably, the number of
hypercalls per DMA operation rounds to zero. However, management of the persistent mappings—mapping lookup and recycling, as described in Section 5.2.3—still
consume over 10% of the processor’s resources.
Surprisingly, the overhead incurred by the software-based technique is comparable
to the IOMMU-based persistent mapping strategy. The software-based technique
certainly requires far more hypercalls per DMA than the IOMMU-based strategies.
However, the cost of those VMM invocations and the associated page-verification
operations is similar to the cost of managing persistent mappings for an IOMMU.
106
Protection Calls/ CPU % Reuse (%)
HC/
Strategy
Sec.
Prot.
TX RX DMA
None
3005
0 N/A N/A
0
Single-use
2790
6.1 N/A N/A
.68
Shared
2835
6.0
4
0
.65
Persistent
2901
2.1 100
100
0
Software
2895
3.5 N/A N/A
.67
Table 5.2: OpenSER profile.
5.6.2
VoIP Server
Table 5.2 shows the performance and overhead profile for the OpenSER VoIP application benchmark for the various protection strategies. The OpenSER benchmark
is largely CPU-intensive and therefore only uses one of the two network interface cards.
Though the strategies rank similarly in efficiency for the OpenSER benchmark as in
the TCP Stream benchmark, Table 5.2 shows one significant difference with respect to
reuse of IOMMU mappings. Whereas the shared strategy was able to reuse mappings
39% of the time for transmit packets under the TCP Stream benchmark, OpenSER
sees only 4% reuse. Unlike typical high-bandwidth streaming applications, OpenSER
only sends and receives very small TCP messages in order to initiate and terminate
VoIP phone calls. Consequently, the shared strategy provides only a minimal efficiency and performance improvement over the high-overhead single-use strategy for
the OpenSER benchmark, indicating that sharing alone does not provide an efficiency
gain for applications that are heavily reliant on small messages.
5.6.3
Web Server
Table 5.3 shows the performance, overhead, and sharing profiles of the various protection strategies when running a web server under each of three different trace workloads, “CS”, “IBM”, and “WC”. As in the TCP Stream and OpenSER benchmarks,
the different strategies rank identically among each other in terms of performance
107
Protection HTTP CPU %
Strategy
Mbps
Prot.
CS Trace
None
1336
0
Single-use
1142
18.2
Shared
1162
16.3
Persistent
1252
5.3
Software
1212
9.1
IBM Trace
None
359
0
Single-use
322
8.5
Shared
322
8.3
Persistent
338
2.4
Software
326
4.5
WC Trace
None
714
0
Single-use
617
11.8
Shared
619
11.1
Persistent
655
3.0
Software
632
5.9
Reuse (%)
HC/
TX RX DMA
N/A
N/A
40
100
N/A
N/A
N/A
0
100
N/A
0
.66
.42
0
.67
N/A
N/A
22
100
N/A
N/A
N/A
0
100
N/A
0
.70
.58
0
.71
N/A
N/A
30
100
N/A
N/A
N/A
0
100
N/A
0
.68
.50
0
.69
Table 5.3: Web Server profile using write().
and overhead. Each of the different traces generates messages of different sizes and
requires different amounts of web-server compute overhead. For the write()-based
implementation of the web server, however, the server is always completely saturated
for each workload. “CS” is primarily network-limited, generating relatively large response messages with an average HTTP message size of 34 KB. “IBM” is largely
compute-limited, generating relatively small HTTP responses with an average size
of 2.8 KB. “WC” lies in between, with an average response size of 6.7 KB. As the
table shows, the amount of reuse exploited by the shared strategy is dependent on
the average HTTP response being generated. Larger average messages lead to larger
amounts of reuse for transmitted buffers under the shared strategy. Though larger
amounts of reuse slightly reduce the CPU overhead for the shared strategy relative
108
Protection
Strategy
None
Single-use
Shared
Persistent
Software
HTTP
Mbps
1378 (35%
1291 (7%
1330 (17%
1342 (23%
1351 (21%
idle)
idle)
idle)
idle)
idle)
None
Single-use
Shared
Persistent
Software
475
403
413
438
422
None
Single-use
Shared
Persistent
Software
961
760
796
872
833
CPU %
Prot.
TX
CS Trace
0
27.6
17.7
11.5
13.7
IBM Trace
0
14.0
12.3
4.3
6.2
WC Trace
0
19.9
16.0
5.1
8.7
Reuse (%)
Hdr. TX File
HC/
RX DMA
N/A
N/A
82
100
N/A
N/A N/A
N/A N/A
72
0
96 100
N/A N/A
0
.37
.17
.02
.37
N/A
N/A
34
100
N/A
N/A N/A
N/A N/A
50
0
99 100
N/A N/A
0
.43
.35
0
.43
N/A
N/A
53
100
N/A
N/A N/A
N/A N/A
62
0
100 100
N/A N/A
0
.39
.27
0
.40
Table 5.4: Web Server profile using zero-copy sendfile().
to the single-use strategy, the reuse is not significant enough under these workloads
to yield significant performance benefits.
As in the other benchmarks, receive buffers are not subject to reuse with the
shared-mapping strategy. Regardless of the workload, the persistent strategy is 100%
effective at reusing existing mappings as the mappings again become effectively permanent. As in the other benchmarks, the software-based strategy achieves application
performance between the shared and persistent IOMMU-based strategies.
For all of the previous workloads, the network application utilized the write()
system call to send any data. Consequently, all buffers that are transmitted to the
network interface have been allocated by the guest operating system’s network-buffer
allocator. Using the zero-copy sendfile() interface, however, the guest OS generates
network buffers for the packet headers, but then appends the application’s file buffers
109
rather than copying the payload. This interface has the potential to change the
amount of reuse exploitable by a protection strategy. Using sendfile(), the packetpayload footprint for IOMMU mappings is no longer limited to the number of internal
network buffers allocated by the OS, but instead is limited only by the size of physical
memory allocated to the guest.
Table 5.4 shows the performance, efficiency, and sharing profiles for the different
protection strategies for web-based workloads when the server uses sendfile() to
transmit HTTP responses. Note that for the “CS” trace, the host CPU is not completely saturated, and so the CPU’s idle time percentage is annotated next to HTTP
performance in the table. For the other traces, the CPU is completely saturated.
The table separates reuse statistics for transmitted buffers according to whether or
not the buffer was a packet header or packet payload. As compared to Table 5.3,
Table 5.4 shows that the shared strategy is more effective overall at exploiting reuse
using sendfile() than with write(). Consequently, the shared strategy gives a
larger performance and efficiency benefit relative to the single-use strategy when using sendfile(). Table 5.4 also shows that the persistent strategy is highly effective
at capturing file reuse, even though the total working-set size of the “CS” and “IBM”
traces are each more than twice as large as the 512 MB mapping space afforded by
the GART. Finally, the table shows that the software-based strategy performs better
than either the shared or single-use IOMMU strategies for all workloads, and can
perform even better than the persistent strategy on the CS trace, though it consumes
more CPU resources.
5.6.4
Discussion
The architecture of the GART imposes some limitations on this study. In particular, it is infeasible to evaluate a direct map strategy using the IOMMU. Under this
strategy, the VMM creates a persistent identity mapping for each VM that permits
110
access to its entire memory. This mapping is created by the VMM when the VM
is started and updated only if the memory allocated to the VM changes. Moreover,
because the direct map strategy uses an identity mapping, there is no need for the
device driver to translate the address that is provided by the guest OS into an address that is suitable for DMA. Unfortunately, the GART cannot implement such an
identity mapping because the address of the aperture cannot overlap with that of
physical memory.
Like the other protection strategies, the direct map strategy has pros and cons. It
provides the same protection between guest operating systems as the other IOMMUbased strategies, but it provides the least safety within a guest operating system.
For example, under the persistent mapping strategy, a page will only be mapped by
the IOMMU if it is the target of an I/O operation. Moreover, an unused mapping
may ultimately be destroyed. In contrast, under the direct map strategy, all pages
are mapped at all times. The direct map strategy’s unique advantage is that it can
be implemented entirely within the VMM without support from the guest OS. Its
implementation is, in effect, transparent to the guest OS.
Although it is not possible to determine the performance of the direct map strategy experimentally using the GART-based setup, it is reasonable to argue that its
performance must be bounded by the performance of the “Persistent” and “None”
strategies. Although, in many cases, the “Persistent” strategy achieves near 100%
reuse, the direct map strategy could have lower overhead because the device driver
does not have to translate the address that is provided by the guest OS into an address
that is suitable for DMA.
The GART’s translation table is a single, one-dimensional array. Moreover, if
an IOTLB miss occurs, address translation requires at most one memory access. In
contrast, the coming IOMMUs from AMD and Intel will use multilevel translation
tables, similar to the page tables used by the processor’s MMU. Thus, both updates
111
by the hypervisor and IOTLB misses may have a higher cost because of the additional
memory references incurred by walking multilevel translation tables.
Regardless of the benchmark, the data in Section 5.6 shows many opportunities
for reuse of mappings in network I/O applications. However, some of this reuse is
a consequence of the difference between the mapping’s granularity (ie, a 4 kilobyte
memory page) and the granularity of a network packet (ie, 1500 bytes). Hence,
adjacent buffers in the same memory page can be reused for multiple packets because the packet size is smaller than that of a memory page. A hardware technique
that increases the maximum transaction size of a DMA operation from could invert
this relationship and decrease the amount of reuse exploitable by the existing implementations examined here. For example, network interfaces that support TCP
segmentation offload provide the abstraction to the operating system of a NIC that
has a much larger maximum transmission unit (ie, 16 kilobytes instead of 1500 bytes).
In this case, the Shared protection strategy could approach the reuse properties of
the Single-use strategy, since a memory page would likely be used only once for one
large buffer rather than being used multiple times. However, previous studies by Kim
et al. show that the payload data for the web-based traces examined in this study
have significant reuse, and hence one would still expect to see reuse benefits in the
Persistent protection strategy [26].
Xen differs from many virtualization systems in that it exposes host physical addresses to the guest OS. In particular, the guest OS, and not the VMM, is responsible
for translating between pseudo-physical addresses that are used at most levels of the
guest OS and host physical addresses that are used at the device level. This does
not, however, fundamentally change the implementation of the various protection
strategies.
112
Chapter 6
Conclusion
As demand for high-bandwidth network services continues to grow, network servers
must continue to deliver more and more performance. Simultaneously, power and
cooling continue to be first-class concerns for datacenter servers, and thus network
servers must support the highest levels of efficiency possible. Architectural trends toward chip multiprocessors are straining contemporary OS network stacks and network
hardware, exposing efficiency bottlenecks that can prevent software architectures from
gaining any substantial performance through multiprocessing. And whereas multicore
architectures offer an opportunity to better utilize physical server resources in a more
efficient manner through consolidation, inefficiencies inherent to modern I/O sharing
architectures and protection strategies severely damage performance and undermine
overall server efficiency.
This dissertation addresses key OS and VMM architectural components that can
limit I/O performance and efficiency in modern thread-concurrent and VM-concurrent
servers. Each of these components has separate performance and efficiency issues that
have tangible effects on the ability of a server to support its network applications.
The OS parallelization strategy affects the performance of network I/O processing by
the operating system, affects the maximum throughput attainable on a given connection, and thus affects application throughput and scalability. The virtual machine
monitor’s I/O virtualization architecture affects the overhead required to share access to a given I/O device, which thus affects the maximum aggregate application
113
performance attainable on the system and affects the ability of the system to support
larger numbers of concurrent virtual machines. Furthermore, the VMM’s memoryprotection strategy also affects the overhead of device virtualization, which affects
application performance, and affects the level of isolation supported by the system,
which affects the operating systems’ and hence applications’ stability. The design
decisions of each strategy explored and the characteristics of the resulting architectures have implications for server architects who will seek to build upon this work
and tackle remaining and future challenges facing server architecture. Those design
decisions, their characteristics, and the corresponding implications are discussed in
further detail.
6.1
Orchestrating OS parallelization to characterize and improve I/O processing
The trend toward chip multiprocessing hardware necessitates parallelization of
the operating system’s network stack. This dissertation establishes that a continuum of parallelism and efficiency exists among contemporary protocol network stack
organizations, and this research explores points along that continuum. Along with
the synchronization mechanism employed by each organization, the parallelization
strategy has a direct impact on overall efficiency and ultimately throughput. This research found that a traditional network interface featuring a single high- bandwidth
link imposes an inherent bottleneck with regard to its single interface (ie, packet
queue) to the operating system, which limited throughput regardless of the network
stack organization used. However, introducing parallelism at the network interface
(by using separate interfaces) exposed the scheduling and synchronization efficiency
characteristics of each organization on the continuum. Through examining these
characteristics, it is clear that attempting to maximize theoretical parallelism in the
114
network stack can actually hurt performance, even on a highly parallel machine. This
research finds that a less-parallel, more-efficient connection-parallel network stack is
both more efficient and higher-performing than other organizations that attempt to
maximize packet parallelism.
Though this dissertation explored primarily performance and efficiency, the selection of a connection-parallel network stack within the operating system has implications well beyond just the operating system. This study showed that hardware
support is needed to overcome the bottleneck imposed by the serialized interface exported by a single high-bandwidth NIC. To efficiently support a connection-parallel
network stack, the network interface card would first require parallel packet queues
so that multiple threads could access the NIC at the same time, without synchronization. Second, the NIC would require some form of packet classification that can map
incoming packets to specific connections (or connection groups) and then place them
in a specific packet queue on-board the NIC associated with that connection. With
the additional capability to fire a separate interrupt for each separate queue, it would
be possible to closely mimic the behavior of the parallel-NIC prototype evaluated in
this work, in which packets for a specific connection “queue” (a NIC in this case) are
persistently mapped to that same queue.
Even with such hardware support, additional challenges in both the hardware
and software remain, including support for load balancing. In the experiments evaluated in this dissertation, the load was purposefully spread evenly across the separate
connections and their groups, and a connection always hashed to the same group.
However, a static hash mechanism may lead to undesirable overload conditions for
only a subset of connection groups, leading to under-utilization in lightly used groups.
In this case, it would be desirable to migrate busy connections to lightly-loaded connection groups. If the hardware is responsible for mapping packets to connections,
though, then clearly the hardware must participate in this scheme. One can imag115
ine several possibilities for providing this support, including full control by hardware
(where the hardware attempts to detect the overload condition and notifies the OS
of a migration), full control by software (where the software detects overload and
migrates specific connections by notifying the NIC), or something in between, where
the software provides hints to the hardware about future possibilities for migration.
Regardless, the issue of load-balancing across multiple queues will be a critical area
for maintaining high performance with connection-parallel network stacks that have
hardware support.
6.2
Reducing virtualization overhead using a hybrid hardware/software approach
Whereas OS support for thread-parallelism incurs performance-damaging inefficiencies, contemporary software-based techniques for providing shared access to
an I/O device also incur severe performance overheads. Though the contemporary
software-based virtualization architecture supports a variety of hardware, the hypervisor and driver domain consume as much as 70% of the execution time during network
transfers. This dissertation introduces the novel CDNA I/O virtualization architecture, which is a hybrid hardware/software approach to providing safe, shared access
to a single network interface. CDNA uses hardware to perform traffic multiplexing,
a combination of hardware and software to facilitate event notification from the I/O
device to a particular virtual machine, and a combination of hardware and software
to enforce isolation of DMA requests initiated by each untrusted virtual machine.
This study demonstrates that a hybrid hardware/software approach is both economical and effective. The CDNA prototype device required about 12 MB of onboard storage and used a 300 MHz embedded processor, which is about the same as
modern network interface cards. Using these resources, the CDNA architecture im116
proved transmit and receive performance for concurrent virtual machines by factors
of 2.1 and 3.3, respectively, over a standard software-based shared I/O virtualization
architecture. And whereas a purely hardware-based approach could require costly
memory-registration operations or system-level hardware modifications to enforce
DMA memory protection, the lightweight, software-based DMA memory protection
strategy introduced in this research incurs relatively little overhead and requires no
system-level modifications.
Moving traffic multiplexing to the hardware proved to be the biggest source of
performance and efficiency improvement in the CDNA architecture. By not forcing
the VMM to inspect, demultiplex, and page-flip data between virtual machines, the
relatively simple hardware of the CDNA prototype dramatically reduced total I/O
virtualization overhead. This reduction occurs despite introducing the new overhead
of software-based DMA memory protection for supporting direct I/O access.
Beyond the performance and efficiency issues explored in this study, the CDNA
architecture presents new opportunities for I/O virtualization research, including generalization to other devices and challenges not related to performance. As presented,
the CDNA device is a prototype. Though there is nothing specific about the prototype
or its architecture that prevents it from being adapted to other types of devices (such
as graphics cards or disk controllers), actually doing this generalization remains an
area for future exploration. Clearly the performance and efficiency benefits demonstrated for network I/O could prove advantageous for other types of I/O, though
the percentage of impact would depend on the importance of I/O for any particular
workload. Generalizing the CDNA interface would require development of a general
software interface for communicating DMA updates to the virtual machine monitor,
since the prototype’s method is actually based on updates to the NIC-specific DMA
descriptor structure. Further, generalization would require a method to generically,
concisely describe the control region (and mechanisms) for any particular device so
117
that the VMM maintains control of actually enqueueing DMA descriptors, as is required by the software-based DMA memory protection method.
Another area open for exploration with the CDNA architecture is that of providing quality of service guarantees, including support for customizable service allocation
and prioritization. For example, it would be advantageous to be able to guarantee
that a high-priority virtual machine received performance according to its application
needs (which could be either high bandwidth, or low latency, or both). Traditional
software-based I/O virtualization allows fine-grained centralized load balancing, because the VMM actively controls the flow of I/O into and out of the device. With
direct-access, hardware-shared devices such as the CDNA architecture, however, the
hardware ultimately determines the order and priority that concurrent requests are
processed. Thus, the hardware must have some mechanism for implementing the desired quality-of-service policy. Furthermore, the hardware must support some mechanism for the VMM to communicate the desired policy to the hardware. Finally, in
cases when the device supports unsolicited I/O (such as a network interface, which
receives packets from a network), it would be advantageous for the hardware to track
usage statistics and report them to the VMM so that it could make better decisions
that might avoid I/O failure (e.g., packet loss).
6.3
Improving performance and efficiency of protection strategies for direct-access I/O
CDNA’s performance and efficiency gains versus software-based virtualization illustrate the effectiveness of direct I/O access by untrusted virtual machines. Though
direct I/O access overcomes performance penalties, it requires new protection strategies to prevent the guest operating systems from directing the device to violate memory protection.
118
This dissertation has evaluated a variety of DMA memory protection strategies
for direct access to I/O devices within virtual machine monitors. As others have
noted, overhead for managing DMA memory protection using an IOMMU in a virtualized environment can noticeably degrade network I/O performance, ultimately affecting application throughput. Even with the novel IOMMU-based strategies aimed
at reducing this overhead by reusing installed mappings in the IOMMU hardware,
there remains a nonzero overhead that reduces throughput. However, this research
has shown that reuse-based strategies are effective at reducing overhead relative to
the state-of-the-art, single-use strategy. Furthermore, this research shows that the
software-based implementation for providing DMA memory protection introduced
with the CDNA architecture can deliver performance and efficiency comparable to
the most aggressive reuse-based strategies. These results held true across a wide array of TCP-based applications with different I/O demand characteristics. resulting
in several key insights.
This research also explored the differences in the level of protection offered by
different strategies and the level of efficiency gained through reuse. All of the strategies (single-use, shared, persistent, and software-based) explored in this study provide
equivalent protection between guest operating systems when those guest operating
systems are sharing a single device and have direct access. Further, all of these techniques prevent a guest operating system from directing the device to access memory
that does not belong to that guest. The traditional single-use strategy, however, provides this protection at the greatest cost, consuming from 6–26% of the CPU. This
cost can be reduced by reusing IOMMU mappings. Multiple concurrent network
transmit operations are typically able to share the same mappings 20–40% of the
time, yielding small performance improvements. However, due to Xen’s I/O architecture, network receive operations are usually unable to share mappings. In contrast,
even with a small pool of persistent IOMMU mappings, reuse approaches 100% in
119
almost all cases, reducing the overhead of protection to only 2–13% of the CPU. Finally, the software-based protection strategy performs comparably to the best of the
IOMMU-based strategies, consuming only 3–15% of the CPU for protection.
After comparing the performance and protection offered by hardware- and softwarebased DMA protection strategies, an IOMMU proves to provide surprisingly limited
benefits beyond what is possible with software. This finding comes despite industrial
enthusiasm for deploying IOMMU hardware in the next generation of commodity systems. As these new systems arrive, a new comparison that uses an actual IOMMU
(rather than the GART-modeled IOMMU in this study) may be illustrative so as
to quantify the performance of “direct-map” IOMMU-based protection strategies.
In such a strategy, the entire physical memory space of a given virtual machine is
mapped (usually once) by the IOMMU and remains mapped for the lifetime of the
virtual machine. Such a strategy should not impose the remapping overhead or reuselookup overheads of the strategies explored in this dissertation, but they should not
perform any better than the “no-protection” case, either. Hence, it is unlikely that
even future, improved IOMMU-based designs will offer significantly better performance than the best-performing strategies explored in this dissertation, which came
within just a few percent of native performance.
Though it is possible to achieve near-native performance with an optimized, reusebased protection strategy, the results in this study also show that inefficient use of
hardware structures designed to reduce the burden of software (such as an IOMMU)
can in fact significantly degrade performance. Hence, this dissertation illustrates a
warning to system architects who would use hardware to solve an architectural problem without consideration of the software overhead. The availability of an IOMMU
does not significantly improve performance unless one compares against a naive,
worst-case implementation of DMA memory protection. This underscores the need
120
for software architects to work closely with hardware architects to solve problems
such as DMA memory protection and, in general, I/O virtualization.
6.4
Summary
This dissertation explored performance and efficiency of server concurrency at
both the OS and VMM levels and introduced several hybrid hardware/software techniques that strategically use hardware to improve software efficiency and performance.
By changing the way the operating system uses its parallel processors to facilitate I/O
processor, by changing the responsibilities of I/O devices to more efficiently integrate
with a virtualized environment, and by strategically using memory protection hardware to reduce the total cost of using that hardware, these techniques each modify
the overall hardware/software system architecture. The OS and VMM architectures
introduced and explored in this dissertation provide a scalable means to deliver highperformance, efficient I/O for contemporary and future commodity servers. As servers
continue to support more and more cores on a chip, thread concurrency and VMM
concurrency will be increasingly critical for system integrators facing performance
challenges for high-bandwidth applications and efficiency challenges for system consolidation in the datacenter. The techniques explored in this dissertation rely on a
synthesis of software orchestration, parallel computation resources (in the OS and
among virtual machines) and lightweight, efficient interfaces with hardware that support the desired level of concurrency. Given the cost and performance advantages
that were repeatedly found using the hybrid hardware/software approach explored
in this dissertation, this approach should be a guiding principle for hardware and
software architects facing the future challenges of concurrent server architecture.
121
Bibliography
[1] Keith Adams and Ole Agesen. A comparison of software and hardware techniques for x86 virtualization. In Proceedings of the Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS), pages
2–13, October 2006.
[2] Advanced Micro Devices. Secure Virtual Machine Architecture Reference Manual, May 2005. Publication 33047, Revision 3.01.
[3] Advanced Micro Devices. AMD I/O Virtualization Technology (IOMMU) Specification, February 2007. Publication 34434, Revision 1.20.
[4] W. J. Armstrong, R. L. Arndt, D. C. Boutcher, R. G. Kovacs, D. Larson, K. A.
Lucke, N. Nayar, and R. C. Swanberg. Advanced virtualization capabilities of
POWER5 systems. IBM Journal of Research and Development, 49(4/5):523–532,
2005.
[5] Avnet Design Services. Xilinx Virtex-II Pro Development Kit: User’s Guide,
November 2003. ADS-003704.
[6] Kalpana Banerjee, Aniruddha Bohra, Suresh Gopalakrishnan, Murali Rangarajan, and Liviu Iftode. Split-OS: An operating system architecture for clusters of
intelligent devices. Work-in-Progress Session at the 18th Symposium on Operating Systems Principles, October 2001.
[7] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho,
Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualiza-
122
tion. In Proceedings of the Symposium on Operating Systems Principles, pages
164–177, October 2003.
[8] Luiz A. Barroso, Jeffrey Dean, and Urz Hölzle. Web search for a planet: The
Google cluster architecture. IEEE Micro, 23(2):22–28, March–April 2003.
[9] Muli Ben-Yehuda, Jon Mason, Orran Krieger, Jimi Xenidis, Leendert Van Doorn,
Asit Mallick, Jun Nakajima, and Elsie Wahlig. Utilizing IOMMUs for virtualization in Linux and Xen. In Proceedings of the Linux Symposium, pages 71–85,
July 2006.
[10] Muli Ben-Yehuda, Jimi Xenidis, Michal Ostrowski, Karl Rister, Alexis Bruemmer, and Leendert Van Doorn. The price of safety: Evaluating IOMMU performance. In Proceedings of the 2007 Linux Symposium, pages 9–19, July 2007.
[11] Tim Brecht, G. (John) Janakiraman, Brian Lynn, Vikram Saletore, and Yoshio
Turner. Evaluating network processing efficiency with processor partitioning and
asynchronous I/O. In Proceedings of EuroSys 2006, pages 265–278, April 2006.
[12] D. T. Brown, R. L. Eibsen, and C. A. Thorn. Channel and direct access device
architecture. IBM Systems Journal, 11(3):186–199, 1972.
[13] S. Devine, E. Bugnion, and M. Rosenblum. Virtualization system including
a virtual machine monitor for a computer with a segmented architecture. US
Patent #6,397,242, October 1998.
[14] Keith Diefendorff. Power4 focuses on memory bandwidth. Microprocessor Report,
13(13):1–7, October 1999.
[15] M. Engelhardt, G. Schindler, W. Steinhögl, and G. Steinlesberger. Challenges of
interconnection technology till the end of the roadmap and beyond. Microelectronic Engineering, 64(1–4):3–10, October 2002.
123
[16] Keir Fraser, Steven Hand, Rolf Neugebauer, Ian Pratt, Andrew Warfield, and
Mark Williamson. Safe hardware access with the Xen virtual machine monitor.
In Proceedings of the Workshop on Operating System and Architectural Support
for the on-demand IT Infrastructure, October 2004.
[17] P. H. Gum. System/370 extended architecture: Facilities for virtual machines.
IBM Journal of Research and Development, 27(6):530–544, 1983.
[18] Lance Hammond, Basem A. Nayfeh, and Kunle Olukotun. A single-chip multiprocessor. Computer, 30(9):79–85, September 1997.
[19] E. C. Hendricks and T. C. Hartmann. Evolution of a virtual machine subsystem.
IBM Systems Journal, 18(1):111–142, 1979.
[20] Justin Hurwitz and Wu-chun Feng. End-to-end performance of 10-gigabit Ethernet on commodity systems. IEEE Micro, 24(1):10–22, Jan./Feb. 2004.
[21] Intel. Intel Virtualization Technology Specification for the Intel Itanium Architecture (VT-i), April 2005. Order Number 305942-002, Revision 2.0.
[22] Intel Corporation. Energy-efficient performance for the data center, September
2006. Order Number 315018-001US.
[23] Intel Corporation. Intel Virtualization Technology for Directed I/O, May 2007.
Order Number D51397-002, Revision 1.0.
[24] J. Jann, L. M. Browning, and R. S. Burugula. Dynamic reconfiguration: Basic
building blocks for autonomic computing on IBM pSeries servers. IBM Systems
Journal, 42(1):29–37, 2003.
[25] Sanjiv Kapil, Harlan McGhan, and Jesse Lawrendra. A chip multithreaded
processor for network-facing workloads. IEEE Micro, 24(2):20–30, Mar./Apr.
2004.
124
[26] Hyong-youb Kim, Vijay S. Pai, and Scott Rixner. Improving web server throughput with network interface data caching. In Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and
Operating Systems, pages 239–250, October 2002.
[27] Hyong-youb Kim and Scott Rixner. Performance characterization of the FreeBSD
network stack. Technical Report TR05-450, Rice University Computer Science
Department, June 2005.
[28] Hyong-youb Kim and Scott Rixner. TCP offload through connection handoff. In
Proceedings of EuroSys, pages 279–290, April 2006.
[29] Seongbeom Kim, Dhruba Chandra, and Yan Solihin. Fair cache sharing and
partitioning in a chip multiprocessor architecture. In Proceedings of the 13th
International Conference on Parallel Architectures and Compilation Techniques,
pages 111–122, 2004.
[30] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. Niagara: A
32-way multithreaded SPARC processor. IEEE Micro, 25(2):21–29, Mar./Apr.
2005.
[31] David Koufaty and Deborah T. Marr. Hyperthreading technology in the netburst
microarchitecture. IEEE Micro, 23(2):56–65, 2003.
[32] Kevin Krewell. UltraSPARC IV mirrors predecessor. Microprocessor Report,
17(11):1–6, November 2003.
[33] Kevin Krewell. Double your Opterons; double your fun. Microprocessor Report,
18(10):26–28, October 2004.
[34] Kevin Krewell.
Sun’s Niagara pours on the cores.
18(9):11–13, September 2004.
125
Microprocessor Report,
[35] Jiuxing Liu, Wei Huang, Bulent Abali, and Dhabaleswar K. Panda. High performance VMM-bypass I/O in virtual machines. In Proceedings of the USENIX
Annual Technical Conference, pages 29–42, June 2006.
[36] R. A. MacKinnon. The changing virtual machine environment: Interfaces to real
hardware, virtual hardware, and other virtual machines. IBM Systems Journal,
18(1):18–46, 1979.
[37] M. McGrath. Virtual machine computing in an engineering environment. IBM
Systems Journal, 11(2):131–149, June 1972.
[38] Aravind Menon, Alan L. Cox, and Willy Zwaenepoel. Optimizing network virtualization in Xen. In Proceedings of the USENIX Annual Technical Conference,
June 2006.
[39] Aravind Menon, Jose Renato Santos, Yoshio Turner, G. (John) Janakiraman,
and Willy Zwaenepoel. Diagnosing performance overheads in the Xen virtual
machine environment. In Proceedings of the ACM/USENIX Conference on Virtual Execution Environments, pages 13–23, June 2005.
[40] Microsoft Corporation. Scalable networking: Eliminating the receive processing
bottleneck – Introducing RSS. In Proceedings of the Windows Hardware Engineering Conference, April 2004.
[41] E. M. Nahum, D. J. Yates, J. F. Kurose, and D. Towsley. Performance issues in
parallelized network protocols. In Proceedings of the Symposium on Operating
Systems Design and Implementation, pages 125–137, November 1994.
[42] Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung
Chang. The case for a single-chip multiprocessor. In Proceedings of the Seventh
126
International Conference on Architectural Support for Programming Languages
and Operating Systems, pages 2–11, October 1996.
[43] R. P. Parmelee, T. I. Peterson, C. C. Tillman, and D. J. Hatfield. Virtual storage
and virtual machine concepts. IBM Systems Journal, 11(2):99–130, 1972.
[44] Ian Pratt and Keir Fraser. Arsenic: A user-accessible Gigabit Ethernet interface.
In Proceedings of IEEE INFOCOM, pages 67–76, April 2001.
[45] Himanshu Raj and Karsten Schwan. High performance and scalable I/O virtualization via self-virtualized devices. In Proceedings of the 16th International
Symposium on High Performance Distributed Computing, pages 179–188, June
2007.
[46] Murali Rangarajan, Aniruddha Bohra, Kalpana Banerjee, Enrique V. Carrera,
Ricardo Bianchini, Liviu Iftode, and Willy Zwaenepoel. TCP Servers: Offloading
TCP/IP Processing in Internet Servers. Design, Implementation, and Performance. Computer Science Department, Rutgers University, March 2002. Technical Report DCR-TR-481.
[47] Greg Regnier, Srihari Makineni, Ramesh Illikkal, Ravi Iyer, Dave Minturn, Ram
Huggahalli, Don Newell, Linda Cline, and Annie Foong. TCP Onloading for
Data Center Servers. Computer, 37(11):48–58, November 2004.
[48] Greg Regnier, Dave Minturn, Gary McAlpine, Vikram A. Saletore, and Annie
Foong. ETA: Experience with an Intel Xeon Processor as a Packet Processing
Engine. IEEE Micro, 24(1):24–31, January 2004.
[49] L. H. Seawright and R. A. MacKinnon. VM/370–a study of multiplicity and
usefulness. IBM Systems Journal, 18(1):4–17, 1979.
127
[50] Jeff Shafer and Scott Rixner. A Reconfigurable and Programmable Gigabit Ethernet Network Interface Card. Rice University, Department of Electrical and
Computer Engineering, December 2006. Technical Report TREE0611.
[51] Jeffrey Shafer and Scott Rixner. RiceNIC: A reconfigurable network interface
for experimental research and education. In Proceedings of the Workshop on
Experimental Computer Science, June 2007.
[52] Piyush Shivam, Pete Wyckoff, and Dhabaleswar K. Panda. EMP: Zero-copy
OS-bypass NIC-driven Gigabit Ethernet message passing. In Proceedings of the
ACM/IEEE Conference on Supercomputing (CDROM), pages 57–57, November
2001.
[53] J. Sugerman, G. Venkitachalam, and B. Lim.
Virtualizing I/O devices on
VMware Workstation’s hosted virtual machine monitor. In Proceedings of the
USENIX Annual Technical Conference, pages 1–14, June 2001.
[54] J. M. Tendler, J. S. Dodson, Jr J. S. Fields, H. Le, and B. Sinharoy. POWER4
system microarchitecture. IBM Journal of Research and Development, 46(1):5–
26, January 2002.
[55] Sunay Tripathi. FireEngine—a new networking architecture for the Solaris operating system. White paper, Sun Microsystems, June 2004.
[56] VMware Inc. VMware ESX server: Platform for virtualizing servers, storage and
networking. http://www.vmware.com/pdf/esx datasheet.pdf, 2006.
[57] Robert N. M. Watson. Introduction to multithreading and multiprocessing in
the FreeBSD SMPng network stack. In Proceedings of EuroBSDCon, November
2005.
128
[58] A. Whitaker, M. Shaw, and S. Gribble. Scale and performance in the Denali
isolation kernel. In Proceedings of the Symposium on Operating Systems Design
and Implementation (OSDI), pages 195–210, December 2002.
[59] Paul Willmann, Scott Rixner, and Alan L. Cox. An evaluation of network stack
parallelization strategies in modern operating systems. In Proceedings of the
USENIX Annual Technical Conference, pages 91–96, 2006.
[60] Paul Willmann, Jeffrey Shafer, David Carr, Aravind Menon, Scott Rixner,
Alan L. Cox, and Willy Zwaenepoel. Concurrent direct network access for virtual
machine monitors. In Proceedings of the 13th International Symposium on High
Performance Computer Architecture, pages 306–317, February 2007.
[61] Paul Willmann, Jeffrey Shafer, David Carr, Aravind Menon, Scott Rixner,
Alan L. Cox, and Willy Zwaenepoel.
Concurrent direct network access for
virtual machine monitors. In Proceedings of the International Symposium on
High-Performance Computer Architecture, February 2007.
[62] David J. Yates, Erich M. Nahum, James F. Kurose, and Don Towsley. Networking support for large scale multiprocessor servers. In Proceedings of the
ACM SIGMETRICS International Conference on Measurement and Modeling of
Computer Systems, pages 116–125, May 1996.
129