Download Efficient Hardware/Software Architectures for Highly Concurrent

RICE UNIVERSITY Efficient Hardware/Software Architectures for Highly Concurrent Network Servers by Paul Willmann A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE Doctor of Philosophy Approved, Thesis Committee: Behnaam Aazhang, Chair J. S. Abercrombie Professor of Electrical and Computer Engineering Alan L. Cox Associate Professor of Computer Science and of Electrical and Computer Engineering David B. Johnson Associate Professor of Computer Science and of Electrical and Computer Engineering Scott Rixner Associate Professor of Computer Science and of Electrical and Computer Engineering Houston, Texas November 2007 Abstract Internet services continue to incorporate increasingly bandwidth-intensive applications, including audio and high-quality, feature-length video. As the pace of uniprocessor performance improvements slows, however, network servers can no longer rely on uniprocessor technology to fuel the overall performance improvements necessary for next-generation, high-bandwidth applications. Furthermore, rising per-machine power costs in the datacenter are driving demand for solutions that enable consolidation of multiple servers onto one machine, thus improving overall efficiency. This dissertation presents strategies that improve the efficiency and performance of server I/O using both virtual-machine concurrency and thread concurrency. Contemporary virtual machine monitors (VMMs) aim to improve server efficiency by enabling consolidation of separate isolated servers onto one physical machine. However, modern VMMs incur heavy device virtualization penalties, ultimately reducing application performance by up to a factor of 3. Contemporary parallelized operating systems aim to improve server performance by exploiting thread parallelism using multiple processors. However, the concurrency and communication models used to implement that parallelism impose significant performance penalties, severely damaging the server’s ability to leverage more processors to attain higher performance. This dissertation examines the architectural sources of these inefficiencies and introduces new OS- and VMM-level architectures that greatly reduce them. Acknowledgments I would like to thank Dr. Scott Rixner and Dr. Alan Cox for their steady technical guidance throughout the course of this research. Also, I would like to thank Dr. Behnaam Aazhang and Dr. David Johnson for their perspectives regarding and support of this work. Thanks also to Dr. Vijay Pai, who helped and encouraged me from the beginning of my graduate school career through its conclusion, even after he moved on to a new opportunity at Purdue University. Additionally, I want to acknowledge Jeff Shafer’s significant contributions regarding development and debugging of the CDNA prototype hardware. David Carr helped tremendously with bringup of and scripting support for the Xen VMM environment. Marcos Huerta provided the formatting template for this document and graciously helped me modify it to suit my needs. I also want to thank my family and friends, whose constant support and encouragement made this work possible. Finally, I want to particularly thank my wife Leighann. Her perspectives on the scientific process and academic research made this work better, and her enduring love and patience made my life better. Thank you. Contents 1 Introduction 1.1 Server Concurrency Trends . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . 1 1 5 9 2 Background 2.1 Contemporary Server and CPU Technology . . . . . . . . . . . . . . . 2.2 Existing OS Support for Concurrent Network Servers . . . . . . . . . 2.3 Existing VMM Support for Concurrent Network Servers . . . . . . . 2.3.1 Private I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Software-Shared I/O . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Hardware-shared I/O . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Protection Strategies for Direct-Access Private and Shared Virtualized I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Hardware Support for Concurrent Server I/O . . . . . . . . . . . . . 2.4.1 Hardware Support for Parallel Receive-side OS Processing . . 2.4.2 User-level Network Interfaces . . . . . . . . . . . . . . . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 14 20 21 22 25 3 Parallelization Strategies for OS Network Stacks 3.1 Background . . . . . . . . . . . . . . . . . . . . . 3.2 Parallel Network Stack Architectures . . . . . . . 3.2.1 Message-based Parallelism (MsgP) . . . . 3.2.2 Connection-based Parallelism (ConnP) . . 3.3 Methodology . . . . . . . . . . . . . . . . . . . . 3.3.1 Evaluation Hardware . . . . . . . . . . . . 3.3.2 Parallel TCP Benchmark . . . . . . . . . . 3.4 Evaluation using One 10 Gigabit NIC . . . . . . . 3.5 Evaluation using Multiple Gigabit NICs . . . . . 3.6 Discussion and Analysis . . . . . . . . . . . . . . 3.6.1 Locking Overhead . . . . . . . . . . . . . . 3.6.2 Scheduler Overhead . . . . . . . . . . . . . 3.6.3 Cache Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 38 40 41 43 46 46 47 48 49 52 52 56 59 4 Concurrent Direct Network Access 4.1 Networking in Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Hypervisor and Driver Domain Operation . . . . . . . . . . . 63 66 66 i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 30 30 32 34 . . . . . . . . . . . . . 67 69 70 71 72 74 77 79 81 81 83 85 86 Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 91 93 93 95 96 97 99 99 100 101 103 104 106 106 109 6 Conclusion 6.1 Orchestrating OS parallelization to characterize and improve I/O processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Reducing virtualization overhead using a hybrid hardware/software approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Improving performance and efficiency of protection strategies for directaccess I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.2 4.3 4.4 4.1.2 Device Driver Operation . . . 4.1.3 Performance . . . . . . . . . . CDNA Architecture . . . . . . . . . . 4.2.1 Multiplexing Network Traffic . 4.2.2 Interrupt Delivery . . . . . . . 4.2.3 DMA Memory Protection . . 4.2.4 Discussion . . . . . . . . . . . CDNA NIC Implementation . . . . . Evaluation . . . . . . . . . . . . . . . 4.4.1 Experimental Setup . . . . . . 4.4.2 Single Guest Performance . . 4.4.3 Memory Protection . . . . . . 4.4.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . 5 Protection Strategies for Direct I/O in 5.1 Background . . . . . . . . . . . . . . . 5.2 IOMMU-based Protection . . . . . . . 5.2.1 Single-use Mappings . . . . . . 5.2.2 Shared Mappings . . . . . . . . 5.2.3 Persistent Mappings . . . . . . 5.3 Software-based Protection . . . . . . . 5.4 Protection Properties . . . . . . . . . . 5.4.1 Inter-Guest Protection . . . . . 5.4.2 Intra-Guest Protection . . . . . 5.5 Experimental Setup . . . . . . . . . . . 5.6 Evaluation . . . . . . . . . . . . . . . . 5.6.1 TCP Stream . . . . . . . . . . . 5.6.2 VoIP Server . . . . . . . . . . . 5.6.3 Web Server . . . . . . . . . . . 5.6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Virtual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 115 117 120 List of Figures 1.1 1.2 1.3 2.1 2.2 2.3 3.1 Uniprocessor frequency and network bandwidth history. . . . . . . . . Network I/O throughput disparity between the modern FreeBSD operating system and link capacity, using either six 1-Gigabit interfaces or one 10-Gigabit interface. . . . . . . . . . . . . . . . . . . . . . . . Network I/O throughput disparity between native Linux and virtualized Linux, using six 1-Gigabit Ethernet interfaces. . . . . . . . . . . 2 3 4 Uniprocessor performance history (data source: Standard Performance Evaluation Corporation). . . . . . . . . . . . . . . . . . . . . . . . . . 12 The efficiency/parallelism continuum of OS network-stack parallelization strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A contemporary software-based, shared-I/O virtualization architecture. 24 3.5 3.6 Aggregate transmit throughput for uniprocessor, message-parallel and connection-parallel network stacks using 6 NICs. . . . . . . . . . . . . Aggregate receive throughput for uniprocessor, message-parallel and connection-parallel network stacks using 6 NICs. . . . . . . . . . . . . The outbound control path in the application thread context. . . . . Aggregate transmit throughput for the ConnP-L network stack as the number of locks is varied. . . . . . . . . . . . . . . . . . . . . . . . . . Profile of L2 cache misses per 1 Kilobyte of payload data (transmit test). Profile of L2 cache misses per 1 Kilobyte of payload data (receive test). 56 57 58 4.1 4.2 4.3 4.4 Shared networking in the Xen virtual machine environment. . . . The CDNA shared networking architecture in Xen. . . . . . . . . Transmit throughput for Xen and CDNA (with CDNA idle time). Receive throughput for Xen and CDNA (with CDNA idle time). . 65 71 87 87 3.2 3.3 3.4 i . . . . . . . . 50 51 53 List of Tables 2.1 I/O virtualization methods. . . . . . . . . . . . . . . . . . . . . . . . 3.1 FreeBSD network bandwidth (Mbps) using a single processor and a 10 Gbps network interface. . . . . . . . . . . . . . . . . . . . . . . . . Aggregate throughput for uniprocessor, message-parallel and connectionparallel network stacks. . . . . . . . . . . . . . . . . . . . . . . . . . . Percentage of lock acquisitions for global TCP/IP locks that do not succeed immediately when transmitting data. . . . . . . . . . . . . . Cycles spent managing the scheduler and scheduler synchronization per Kilobyte of payload. . . . . . . . . . . . . . . . . . . . . . . . . . Percentage of L2 cache misses within the network stack to global data structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 3.3 3.4 3.5 4.1 4.2 4.3 4.4 4.5 5.1 5.2 5.3 5.4 Transmit and receive performance for native Linux 2.6.16.29 and paravirtualized Linux 2.6.16.29 as a guest OS within Xen 3. . . . . . . . Transmit performance for a single guest with 2 NICs using Xen and CDNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Receive performance for a single guest with 2 NICs using Xen and CDNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CDNA 2-NIC transmit performance with and without DMA memory protection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CDNA 2-NIC receive performance with and without DMA memory protection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TCP Stream profile. . . . . . . . . . . . . . . . OpenSER profile. . . . . . . . . . . . . . . . . . Web Server profile using write(). . . . . . . . . Web Server profile using zero-copy sendfile(). i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 39 48 54 56 61 69 81 81 84 84 104 106 107 108 Chapter 1 Introduction Internet services continue to incorporate ever more bandwidth-intensive, highperformance applications, including audio and high-quality, feature-length video. Furthermore, network services are proliferating into every aspect of businesses, with even small-scale organizations leveraging robust storage, database, and voice-over-IP technology to manage resources, facilitate communications, and reduce costs. However, ongoing processor trends toward chip multiprocessing present new challenges and opportunities for server architectures as those architectures strive to keep pace with performance and efficiency demands. This dissertation addresses these challenges with new operating system and virtual machine monitor architectures designed to provide efficient, high-performance network input/output (I/O) support for coming generations of servers. 1.1 Server Concurrency Trends Throughout the vast expansion of Internet technology in the 1990s, processor performance and network server bandwidth both grew at exponential rates. Figure 1.1 1 100000 Ethernet Bandwidth (Mbps) Processor Frequency (MHz) Ethernet Bandwidth Processor Frequency 10000 1000 100 10 1 1980 1985 1990 1995 Year 2000 2005 Figure 1.1: Uniprocessor frequency and network bandwidth history. shows the progression of uniprocessor frequency (for the Intel IA32 family of processors) and network interface bandwidth since 1982. Though frequency alone is not a comprehensive measure of performance, the figure does show a qualitative comparison of Ethernet and commodity uniprocessor trends over the past twenty five years. The figure shows the exponential growth of both uniprocessor frequency and Ethernet bandwidth throughout the 1990s and early 2000s. However, the rate of uniprocessor frequency increases shows a marked decline in 2003. In 2003, physical circuit limitations, such as increasing per-cycle interconnect delays, started to overwhelm contemporary CPU architectures. Instead of relying on CPUs that rely on larger and larger global interconnects, CPU architects turned to multicore designs. The move to multicore designs has both implications for server performance and efficiency. In terms of performance, it is important that the server be able to leverage its multiple processors to keep pace with network bandwidth improvements, just as past servers have leveraged faster uniprocessors to deliver more server performance. In terms of efficiency, multicore architectures provide new opportunities to consol- 2 10000 10000 Link Capacity Link Capacity 8000 Uniprocessor OS Multiprocessor OS (p=4) TCP Throughput (Mb/s) TCP Throughput (Mb/s) Uniprocessor OS 6000 4000 2000 0 8000 Multiprocessor OS (p=4) 6000 4000 2000 0 Six 1-Gigabit NICs One 10-Gigabit NIC Six 1-Gigabit NICs (a) Transmit One 10-Gigabit NIC (b) Receive Figure 1.2: Network I/O throughput disparity between the modern FreeBSD operating system and link capacity, using either six 1-Gigabit interfaces or one 10-Gigabit interface. idate isolated servers from several different machines onto just one machine. Such consolidation maximizes utilization of the server’s physical CPU and I/O resources and can substantially reduce power and cooling costs, according to a study by Intel [22]. Virtual machine monitor (VMM) software provides multiplexing facilities that enables this kind of consolidation and sharing, but it is important that virtualization overheads are kept low so as to maximize the capacity of a consolidated server and thus maximize the associated power and cooling savings. However, modern server software architectures fall well short of meeting both performance and efficiency demands given modern network link speeds. Figure 1.2 illustrates the gap between the theoretical peak I/O throughput of a modern server and its achieved throughput. The figure shows TCP network throughput, using either six 1 Gigabit Ethernet network interface cards (NICs) or a single 10 Gigabit Ethernet NIC. The throughput achieved by uniprocessor and multiprocessor-capable 3 Native 5000 TCP Throughput (Mb/s) Virtualized 4000 3000 2000 1000 0 Transmit Receive Figure 1.3: Network I/O throughput disparity between native Linux and virtualized Linux, using six 1-Gigabit Ethernet interfaces. OS configurations are compared to the theoretical aggregate TCP throughput offered by the physical links. The operating system used is FreeBSD, which uses a similar parallelization strategy to that of Linux and achieves similar performance (not shown). The uniprocessor configurations use just one 2.2 GHz Opteron processor core, whereas the multiprocessor configurations use all four cores of a chip multiprocessor that features two chips with two cores each. In all cases, the application’s thread count is matched to the number of processors. The application is a lightweight microbenchmark that simply sends or receives data, thus isolating and stressing the operating system’s network stack. As the figure shows, existing approaches for network OS multiprocessing can improve network performance in some cases. However, the performance improvement can be meager (or nonexistent) and falls well short of 4 being able to saturate link resources. Furthermore, current multiprocessor OS organizations are poorly suited for managing a single, high-bandwidth NIC, sustaining less than half of the available link bandwidth in the best case. Whereas native OS performance is significantly less than ideal, Figure 1.3 shows that virtualizing an operating system reduces its I/O performance even more. Figure 1.3 compares the TCP network throughput of native, unvirtualized Linux using six 1 Gigabit NICs to that achieved by a virtualized Linux server using the Xen virtual machine monitor. Linux is used as the benchmark native system here because the Xen open-source VMM features mature support for Linux but only preliminary support for FreeBSD; regardless, the performance limitations illustrated by the figure are inherent to the VMM architecture, not the OS. The system under test uses a single 2.4 GHz Opteron processor core. For both transmit and receive workloads, virtualization imposes an I/O slowdown of more than 300% versus native-OS execution. 1.2 Contributions This dissertation contributes to the field at the hardware, virtual machine monitor, and operating system levels. This work tackles the efficiency and performance issues within and across each of these levels that cause the performance disparities illustrated in Figures 1.2 and 1.3. The fundamental approach of this research is to use a combination of hardware and software to architect strategies that minimize inefficiencies in concurrent network servers. Combined, these strategies form a software/hardware 5 architecture that will efficiently leverage coming generations of multicore network servers. This architecture is comprised of three parts, each which has its own set of contributions: strategies for efficient network-stack processing by a parallelized operating system, strategies for efficient network I/O sharing in VMM environments, and strategies for efficient isolation of direct-access I/O devices in VMM environments. Defining and exploring the continuum of network stack parallelization. At the operating system level, an efficient parallelization of the network protocol processing stack is required so as to maximize system performance and efficiency on coming generations of chip multiprocessor hardware and to relieve the protocol processing bottleneck demonstrated in Figure 1.2. Past research efforts have studied this problem in the context of improving protocol-processing performance for 100 Mb/s Ethernet using the SGI Challenge shared-memory multiprocessor. Though these studies examined some of the tradeoffs for two different strategies for networkstack parallelization, ten years later there exists no consensus among OS architects regarding the unit of concurrency for network stack processor or the method of synchronization. This dissertation defines and explores a continuum of practical parallelization strategies for modern operating systems. The strategies on this continuum vary according to their unit of concurrency and their method of synchronization and have different overhead characteristics, and thus different performance and scalability. Whereas past studies have used emulated network devices and idealized, experimental research operating systems, this dissertation examines network-stack parallelization 6 on a real hardware platform using a modern network operating system, including all of the device and software overhead. Through understanding this continuum of parallelization and efficiency, this work finds that designing the operating system to maximize parallel execution can actually decrease performance on parallel hardware by ignoring sources of overhead. Further, this study identifies the hardware/software interface between high-bandwidth devices and operating systems as a performance bottleneck, but when overcome, that efficient network stack parallelization strategies significantly improve performance versus contemporary inefficient strategies. Designing a hardware/software architecture for shared I/O virtualization. Contemporary architectures for I/O virtualization enable economical sharing of physical devices among virtual machines. These architectures multiplex I/O traffic, manage device notification messages, and perform direct memory access (DMA) memory protection entirely in software. However, this software-based design incurs heavy penalties versus native OS execution, as depicted in Figure 1.3. In contrast, contemporary research by others has examined purely hardware-based architectures that aim to reduce software overhead. This dissertation contributes a new architecture for shared I/O virtualization that permits concurrent, direct access by untrusted virtual machines using a combination of both hardware and software. This research includes an examination of a prototype network interface developed using this hybrid hardware/software approach, which proves effective at eliminating most of the overhead associated with traditional, software-based shared I/O virtualization. Be- 7 yond the prototype itself, the primarily software-based mechanism for enforcing DMA memory protection is an entirely new contribution, differing greatly from contemporary hardware-based mechanisms. The prototype device is a standard expansion card that requires modest additional hardware beyond that found in a commodity network interface. This low cost and the architecture’s suitability with existing, unmodified commodity hardware makes it ideal for commodity network servers. Developing and exploring alternative strategies for virtualized direct I/O access. Unlike the shared, direct-access architecture explored in this dissertation, the prevailing industrial solution for high-performance virtualized I/O is to provide private, direct access to a single device by a single virtual machine. This obviates the device’s need for multiplexing of traffic among multiple operating systems, but such systems still need reliable DMA memory protection mechanisms to prevent an untrusted virtual machine from potentially misusing an I/O device to access another virtual machine’s memory. This dissertation contributes to the field by developing and examining new hardware- and software-based strategies for managing DMA memory protection and compares them to the state-of-the-art strategy. Contemporary high-availability server architectures use a hardware I/O memory management unit (IOMMU) to enforce the memory access rules established by the memory management unit, and commodity CPU manufacturers are aggressively pursuing inclusion of IOMMU hardware in next-generation processors. Though the aim of these architectures is to provide near-native virtualized I/O performance, the strategy for 8 managing the system’s IOMMU hardware can greatly impact performance and efficiency. This research contributes two novel strategies for achieving direct I/O access using an IOMMU that, unlike the state-of-the-art strategy, reuse IOMMU entries to reduce the total overhead of ensuring I/O safety. Further, this research finds that the software-based DMA memory protection strategy introduced in this dissertation performs comparably to the most aggressive hardware-based strategy. Contrary to much of the industrial enthusiasm for IOMMUs in coming commodity servers, this dissertation concludes that an IOMMU is not necessarily required to achieve safe, high-performance virtualized I/O. 1.3 Dissertation Organization This dissertation presents these contributions in three studies and is organized as follows. Chapter 2 first provides some background regarding the motivation for this work and the state-of-the-art hardware and software architectures of contemporary network servers. Chapter 3 then presents a comparison and analysis of parallelization strategies for modern thread-parallel operating systems. Chapter 4 introduces the concurrent direct network access (CDNA) architecture for delivering efficient shared access to virtualized operating systems in VMM environments. Chapter 5 follows up with a comparison and analysis of hardware- and software-based strategies for providing isolation among untrusted virtualized operating systems that have direct access to I/O hardware, and Chapter 6 concludes. 9 Chapter 2 Background Over the past forty years, there have been extensive industrial and academic efforts toward improving the performance and efficiency of servers. These efforts have touched on both multiprocessing concurrency and virtual machine concurrency. Furthermore, there have been many efforts to coordinate the architecture of I/O hardware (such as network interfaces) with software (both applications and operating systems) to improve the performance and efficiency of the overall server. This chapter discusses the background of contemporary server technology and its limitations and then explores the prior research efforts that are related to the themes and strategies of this dissertation. 2.1 Contemporary Server and CPU Technology The Internet expansion of the 1990s was sustained with exponential growth in processor performance and network server bandwidth. Figure 1.1 shows this exponential progression of uniprocessor frequency (for the Intel IA32 family of processors) and network interface bandwidth since 1982. These steady, exponential performance 10 improvements came from technology improvements, such as feature-size reduction that enabled higher-frequency throughput and from architectural innovations, such as superscalar instruction execution in CPUs and packet checksum offloading for network interfaces. However, the dominant server architecture relied on a fairly constant architecture: a single processor using a single network interface, both which were exponentially improving in performance between generations. The commoditization and rapid improvement of this architecture has yielded superior cost efficiencies compared to more specialized designs, ultimately motivating companies such as Google to standardize their server platforms on this commodity architecture [8]. The consistency of the commodity hardware architecture ensured that legacy software architectures could readily leverage successive generations of higher-performance hardware, ultimately producing higher-performance, higher-capacity servers. Efficiency improvements in system software, such as zero-copy I/O and efficient eventdriven execution models, provided additional server performance on the same architectures. However, these software innovations did not rely on architectural changes and instead improved performance using the existing contemporary architecture. Figure 2.1 confirms this processor trend in terms of performance and shows that it is not specific to the Intel IA32 family of processors. This figure plots the highest reported SPEC CPU integer benchmark scores, selected across all families of processors, for each year since 1995. The left-hand portion of the figure shows SPEC95 scores, and the right-hand figure shows SPEC2000 scores. In 2000, the same com- 11 )*+,-./0!""" d )*+,-./0),12+ )*+,-./0"# term #""" gLon !( !""" !""# !""$ !""% !""& !""" !""# !""! !""$ !""% !""& !""' )*+,-./0),12+ n ma for Per ren ce T #"" !""( '((( Figure 2.1: Uniprocessor performance history (data source: Standard Performance Evaluation Corporation). puter system was evaluated using both SPEC95 and SPEC2000, so the “2000” point in both graphs corresponds to the same computer. The y axes of each graph are scaled to each other such that a line with a certain slope in the SPEC95 graph will have the same slope in the SPEC2000 graph. Though the two benchmark suites are not identical, they are designed to measure performance on contemporary hardware. From 1995 to 2003, the average rate of benchmark improvement was 43% per year (shown as the “Long-term Performance Trend” line in the figure). Though processor frequencies remained mostly unchanged in the period of 2003-2007 (as shown for the Intel family of processors in Figure 1.1), processor architects were still able to produce performance improvements for that time period. However, the rate of SPEC benchmark improvement dropped significantly, to 19% per year. 12 The decline in frequency and performance increases stem primarily from transistorand circuit-level physical limitations. One of the most significant of these is parasitic wire capacitance among transistors inside CPUs. The progression toward smaller transistors has enabled larger-scale integration, but it has also led to increasing relative delay (in cycles) for global interconnect from generation to generation [15]. The poor scalability of global interconnection networks (and thus the control circuitry of modern superscalar processors) is contributing to the slowdown in uniprocessor performance improvements from year to year. This scalability problem is also driving CPU manufacturers toward multicore CPU designs. It is this migration and the subsequent poor performance of contemporary server operating systems and VMMs that serves in part as motivation for this work. Though commodity chip multiprocessor technology is new, software and hardware architects have been developing OS and VMM support for larger-scale, specialpurpose concurrent servers over the past four decades. Contemporary OS and VMM solutions are derived from these prior endeavors, and current I/O architectures bear many similarities to their ancestors. However, recent advances in server I/O (such as the adoption of 10 Gigabit Ethernet) have placed new stresses on these architectures, exposing inefficiencies and bottlenecks that prevent modern systems from fully utilizing their I/O capabilities, as depicted in Figures 1.2 and 1.3 in the Introduction. The poor I/O scalability and performance of modern concurrent servers are attributable to both software and hardware inefficiencies. Many existing operating 13 systems are designed to maximize opportunities for parallelism. However, designs that maximize parallelism incur higher synchronization and thread-scheduling overhead, ultimately reducing performance. Furthermore, both operating systems and VMMs are designed to interact with the traditional serialized hardware interface exported by I/O devices. Operating system performance is bottlenecked by this interface when the multithreaded higher levels of the OS must wait for single-threaded device-management operations to complete. This serialized interface has more farreaching effects on VMM design, and consequently VMMs experience even heavier efficiency penalties relative to native-OS performance. These penalties stem primarily from the separation of device management (in one privileged OS instance) from server computation (in traditional, untrusted OS instances) and the software virtualization layers needed between them. Combined, all of these inefficiencies significantly degrade OS and VMM performance and will prevent future servers from scaling with contemporary I/O capabilities. 2.2 Existing OS Support for Concurrent Network Servers Given the continuing trend toward commodity chip multiprocessor hardware, the trend away from vast improvements in uniprocessor performance, and the ongoing improvements in Ethernet link throughput, operating system architects must consider efficient methods to close the I/O performance gap. The organization of the operating system’s network stack is particularly important. An operating system’s 14 network stack implements protocol processing (typically TCP/IP or UDP). TCP processing is the only operation being performed in the microbenchmark examined in Figure 1.2, but there remains a clear performance gap imposed by the overhead of protocol processing. To close that gap, multiprocessor operating systems must efficiently orchestrate concurrent protocol processing. There exist two principal strategies for parallelizing the operating system’s network stack, both of which derive from research in the mid-1990s that was conducted using large-scale SGI Challenge shared-memory multiprocessors. These strategies differ according to their unit of concurrency. Though current OS implementations are derived from one or the other of these two strategies, no consensus exists among developers regarding the most appropriate organization for emerging processing and I/O hardware. Nahum et al. first examined a parallelization scheme that attempted to treat messages (usually packets) as the fundamental unit of concurrency [41]. In its most extreme implementation, this message-parallel (or MsgP) strategy attempts to process each packet in the system in parallel using separate processors. Because a server has a constant stream of packets, the message-parallel approach maximizes the theoretically achievable concurrency. Though this message-oriented organization ideally scales limitlessly given the abundance of in-flight packets available in a typical server, Nahum et al. found that repeated synchronization for connection state shared among packets belonging to the same connection (such as reorder queues) severely limited 15 scalability [41]. However, that study found that the scalability and performance of the message-parallel organization was highly dependent on the synchronization characteristics of the system. The connection-parallel (or ConnP) strategy treats connections as the fundamental unit of concurrency. When the operating system receives a message for transmission or has received a message from the network, a ConnP organization first associates the message with its connection. The OS then uses a synchronization mechanism to move the packet to a thread responsible for processing its connection. Hence, ConnP organizations avoid the MsgP inefficiencies associated with repeated synchronization to shared connection state. However, ConnP organizations also limit concurrency by enlarging the granularity of parallelism from packets to connections. In its most extreme form, a ConnP organization has as many threads as there are connections. After Nahum et al. examined the MsgP organization, Yates et al. conducted a similar study using the SGI Challenge architecture. In this study, Yates et al. examined connection-oriented parallelization strategies that treat connections as the fundamental unit of concurrency [62]. This study persistently mapped network stack operations for a specific connection to a specific worker thread in the OS. This strategy eliminates any shared connection state among threads and thus eliminates the synchronization overhead of message-oriented parallelizations. Consequently, this organization yielded excellent scalability as threads were added. 16 sg P M -O M rde sg r P In nP C on G ro C up on e nP d U ni pr oc Efficiency Concurrency Figure 2.2: The efficiency/parallelism continuum of OS network-stack parallelization strategies. Both of these prior works ran on the mid-90s era SGI challenge multiprocessor, utilized the user-space x -kernel, and used a simulated network device. However, modern servers feature processors with very different synchronization costs relative to processing, utilize operating systems that bear little resemblance to the x -kernel, and incur the real-world overhead associated with interrupts and device management. These works ultimately concluded that the synchronization and packet-ordering overhead associated with fine-grained packet-level processing could severely damage performance, and that a connection-oriented network stack yielded better efficiency and performance. However, ten years later there is little if any consensus regarding the “correct” organization for modern network servers. FreeBSD and Linux both utilize a variant of a message-parallel network stack organization, whereas Solaris 10 and DragonflyBSD both feature connection-parallel organizations. Hence, though prior research suggested three points of consideration for network stack parallelization architectures (serial, ConnP, and MsgP), modern practical variants represent several additional points of interest that make different efficiency and performance tradeoffs. Consequently, there exist several additional points along 17 a continuum of both concurrency and efficiency. Figure 2.2 depicts this concurrency/efficiency continuum. Whereas uniprocessor organizations incur no synchronization or thread-scheduling overhead and hence are the most efficient, they are also the least concurrent. Conversely, purely MsgP organizations exploit the highest level of concurrency but experience overhead that reduces their efficiency. ConnP organizations attempt to trade some of the MsgP concurrency for increased efficiency, and better overall performance. Real-world MsgP and ConnP implementations also compromise concurrency for efficiency, though some of these compromises are motivated by pragmatism. Whereas theoretical MsgP organizations attempt limitless message-level concurrency, realworld implementations such as Linux and FreeBSD process messages coming into the network stack from any given source in-order. In effect, packets from a given hardware interface or from a given application thread will be processed in-order. However, these in-order MsgP stacks utilize fine-grained synchronization to enable message parallelism, particularly between received and transmitted packets. Similarly, realizable ConnP organizations do not have limitless connection parallelism; instead, ConnP organizations such as Solaris 10 and DragonflyBSD associate each connection with a group, and then process that group in parallel. As Figure 1.2 shows, a massive performance gap exists between achievable throughput of modern in-order MsgP organizations and the link capacity of modern 10 Gigabit interfaces. Furthermore, the figure shows that current multiprocessor operating 18 systems are not effective at reducing this gap, but that parallel network interfaces achieve higher throughput than a single interface. This performance gap and the lack of existing solutions that close it motivate the research in this dissertation. Thus, prior research has established that there are at least two methods of parallelizing an operating system network stack, each which differ according to their unit of concurrency. However, prior evaluations have been conducted in the context of an experimental user-space operating system that uses a simulated network device, and which was done on hardware that had very different synchronization overhead characteristics from modern hardware. So while previous research established that there were different approaches to parallelizing an OS network stack, that research did not fully evaluate those approaches on practical hardware in a practical software environment. In Chapter 3, this dissertation reaches beyond that prior work by examining the way the network stack’s unit of concurrency affects efficiency and performance in a real operating system with a real network device, including the associated device and scheduling overhead. This research also breaks ground by examining and comparing how the means for synchronization (locks versus threads) within a given organization can affect performance and efficiency, thus rounding out an examination of an entire continuum of network stack parallelization strategies, whereas prior research examined only some of the points on that continuum. 19 I/O Access Software Manager Private System/360 and /370 Disks and Terminals [37, 43] Hypervisor Operating System POWER4 LPAR [24] Shared System/360 and /370 Networking [19, 36], Xen 1.0 [7], VMware ESX [56] POWER5 LPAR [4], Xen 2.0+ [16], VMware Workstation [53] Table 2.1: I/O virtualization methods. 2.3 Existing VMM Support for Concurrent Network Servers Other than OS multiprocessing, virtualization is another method of achieving server parallelism. Just as parallelized operating systems can exploit connection-level parallelism, parallel virtual machines exploit a coarser-grained connection parallelism by managing separate connections inside separate virtual machines. There is a significant amount of existing research in the field with respect to virtualization techniques, much of which predates modern operating system research. With respect to server I/O, past virtualized architectures have exploited private device architectures (in which a device is assigned to just one virtual machine) and shared device architectures (in which I/O resources for one device are shared among many virtual machines). These private and shared I/O architectures have been realized using either hardware-based or software-based techniques. All of these architectures require that an I/O device cannot be used by a VM to gain access to another VM’s resources, and prior research has explored these isolation issues as well. The first widely available virtualized system was IBM’s System 370, which was first deployed almost 40 years ago [43]. Though demand for server consolidation has inspired new research in virtualization technology for commodity systems, contemporary I/O virtualization architectures bear a strong resemblance to IBM’s original concepts, particularly with respect to network I/O virtualization. Over time, this I/O virtualization architecture has led to significant software inefficiencies, ultimately manifesting themselves in large performance degradations such as those depicted in Figure 1.3. 20 There are two approaches to I/O virtualization. Private I/O virtualization architectures statically partition a machine’s physical devices, such as disk and network controllers, among the system’s virtual machines. In a Private I/O environment, only one virtual machine has access to a particular device. Shared I/O virtualization architectures enable multiple virtualized operating systems to access a particular device. Existing Shared I/O systems use software interfaces to multiplex I/O requests from different virtual machines onto a single device. Private I/O architectures have the benefit of near-native performance, but they require each virtual machine in a system to have its own private set of network, disk, and terminal devices. Because this costly requirement is impractical on both commodity servers and large-scale servers capable of running hundreds of virtual machines, current-generation VMMs employ Shared I/O architectures. 2.3.1 Private I/O IBM’s System/360 was the first widely available virtualization solution [43]. The System/370 was an extended version of this same architecture and featured hardwareassisted enhancements for processor and memory virtualization, but which supported I/O using the same mechanisms [17, 37, 49]. The first I/O architectures developed for System/360 and System/370 did not permit shared access to physical I/O resources. Instead, a particular VM instance had private access to a specific resource, such as a terminal. To permit many users to access a single costly disk, the System/360 and System/370 architecture extended the idea of private device access by sub-dividing contiguous regions on a disk into logically separate, virtual “mini-disks” [43]. Though multiple virtual machines could access the same physical disk via the mini-disk abstraction, these VMs did not concurrently share access to the same mini-disk region, and hence mini-disks still represented logical private I/O access. System/360 and System/370 required that I/O operations (triggered by the start-io instruction) be 21 trapped and interpreted by the system hypervisor. The hypervisor ensured that a given virtual machine had permission to access a specific device, and that the given VM owned the physical memory locations being read from or written to by the pending I/O command. The hypervisor would then actually restart the I/O operation, returning control to the virtual machine only after the operation completed. Hence, the System/360 and System/370 hypervisor managed I/O resources. More recent virtualization systems have also relied on private device access, such as IBM’s first release of the logical partitioning (LPAR) architecture featuring POWER4 processors [24]. The POWER4 architecture isolated devices at the PCI-slot level and assigned them to a particular VM instance for management. Each VM required a physically distinct disk controller for disk access and a physically distinct network interface for network access. Unlike the System/360 and System/370 architecture, the POWER4’s I/O devices accessed host memory asynchronously via DMA using OS-provided DMA descriptors. Since a buggy or malicious guest OS could provide DMA descriptors pointing to memory locations for which the given VM has no access permissions, the POWER4 employs an IOMMU [9]. The IOMMU validates all PCI operations per-slot using a set of hypervisor-maintained permissions. Hence, the POWER4’s hypervisor can set up the IOMMU at device-initialization time, but I/O resources can be directly managed at runtime by the guest operating systems. 2.3.2 Software-Shared I/O Requiring private access to I/O devices imposes significant hardware costs and scalability limitations, since each VM must have its own private hardware and device slots. The first shared-I/O virtualization solutions were part of the development of networking support for System/360 and System/370 between physically separated virtual machines [19, 36]. This networking architecture supported shared access to network I/O resources by use of a virtualized spool-file interface that was serviced 22 by a special-purpose virtual machine, or I/O domain, dedicated to networking. The various general-purpose VMs in a machine could read from or write to virtualized spool files. The system hypervisor would interpret these reads and writes based on whether or not the spool locations were on a physically remote machine; if the data was on a remote machine, the hypervisor would invoke the special-purpose networking VM. This networking VM would then use its physical network interfaces to connect to a remote machine. The remote machine used this same virtualized spool architecture and dedicated networking VM to service requests. The networking I/O domain was trusted to not violate memory protection rules, so on System/370 architectures that supported “Preferred-Machine” execution, the I/O domain could be granted direct access to the network interfaces and would not require the hypervisor to manage network I/O [17]. This software architecture for sharing devices through virtualized interfaces is logically identical to most virtualization solutions today. Xen, VMware, and the POWER5 virtualization architectures all share access to devices through virtualized software interfaces and rely on a dedicated software entity to actually perform physical device management [4, 7, 53]. Subsequent releases of Xen and VMware have moved device management either into the hypervisor and out of an I/O domain (as is the case with VMware ESX [56]) or into an I/O domain and out of the hypervisor (as is the case with Xen versions 2.0 and higher [16]). Furthermore, different architectures use different interfaces for implementing shared I/O access. For example, the Denali isolation kernel provides a high-level interface that operates on packets [58]. The Xen VMM provides an interface that mimics that of a real network interface card but abstracts away many of the register-level management details [16]. VMware can support either an emulated register-level interface that implements the precise semantics of a hardware NIC, or it can support a higher-level interface similar to Xen’s [53, 56]. 23 Guest Domain Driver Domain Multiplexing Backend Driver Data/ Control Frontend Driver Virtual Interrupts Virtual Machine Monitor Device Driver Data/Control I/O Device(s) Interrupt CPU / Memory Figure 2.3: A contemporary software-based, shared-I/O virtualization architecture. Regardless of the interface, however, the overall organization is fundamentally quite similar. Figure 2.3 depicts the organization of a typical modern Shared I/O architecture, which is heavily influenced by IBM’s original software-based Shared I/O architecture for sharing network resources. These modern architectures grant direct hardware access to only a single virtual machine instance, referred to as the “driver domain” in the figure. Each consolidated server instance exists in an operating system running inside an unprivileged “guest domain”. The driver domain is a privileged operating system instance (for example, running Linux) whose sole responsibility is to manage physical hardware in the machine and present a virtualized software interface to the guest domains. This interface is exported by the driver domain’s backend driver, as depicted in the figure. Guest domains access this virtualized interface and issue I/O requests using their own frontend drivers. Upon reception of I/O requests in the backend driver, the driver domain uses a separate multiplexing module inside the operating system (such as a software Ethernet bridge in the case of network traffic) to map requests to physical device drivers. The driver domain then uses native device 24 drivers to access hardware, thus carrying out the various I/O operations requested by guest domains. This I/O virtualization architecture addresses the many requirements for implementing Shared I/O in a virtualized environment featuring untrusted virtualized servers. First, the architecture provides a method to multiplex I/O requests from various guest operating systems onto a single commodity device, which exports just one management interface to software. This software-only architecture is practical insofar as it supports a large class of existing commodity devices. Second, the architecture provides a centralized, trusted interface with which to safely translate virtualized I/O requests originating from an untrusted guest into trusted requests operating on physical hardware and memory. Third, this architecture provides inter-VM message notification using a virtualized interrupt system. This messaging system is implemented by the VMM and is used by guest and driver domains to notify each other of pending requests and event completion. However, forcing all I/O operations to be forwarded through the driver domain incurs significant overhead that ultimately reduces performance. The magnitude of the performance loss for network I/O under the Xen VMM is depicted in Figure 1.3, and Sugerman et al. has reported similar results using other the VMware VMM [53]. These performance losses are attributable to the inefficiency of moving I/O requests into and out of the driver domain and multiplexing those requests once inside the driver domain. 2.3.3 Hardware-shared I/O Direct I/O access (rather than indirect access through management software) eliminates the overhead of moving all I/O traffic through a software management entity for multiplexing and memory protection purposes. In addition to supporting private management of I/O devices by just one software entity at a time (either the 25 OS or hypervisor), the System/360 and System/370 fully implemented the direct access storage device (DASD) architecture. The DASD architecture enabled concurrent programs, operating systems, and the hypervisor to access disks directly and simultaneously [12]. DASD-capable devices had several distinct, separately addressable channels. Software subroutines called channel programs performed programmed I/O on a channel to carry out a disk request. Disk-access commands in channel programs executed synchronously. A significant benefit of multi-channel DASD hardware was that it permitted one channel program to access the disk while another performed data comparisons (in local memory) to determine if further record accesses were required. This hardware support for concurrency significantly improved I/O throughput. On a virtualized system, the hypervisor would trap upon execution of the privileged start-io instruction that was meant to begin a channel program. The hypervisor would then inspect all of the addresses to be used by the device in the pending channel program, possibly substituting machine-physical addresses for machinevirtual addresses. The hypervisor would also verify the ownership of each address on the disk to be accessed. After ensuring the validity of each address in the programmedI/O channel subroutine, the hypervisor would execute the modified subroutine [17]. This trap-modify-execute interpretive execution model enabled the hypervisor to check and ensure that no virtual machine could read or write from another VM’s physical memory and that no virtual machine could access another VM’s disk area. However, the synchronous nature of the interface afforded the hypervisor simplicity with respect to memory protection enforcement; it was sufficient to check the current permission state in the hypervisor and base I/O operations on that state. However, modern operating systems operate devices asynchronously to achieve concurrency. Devices use DMA to access host memory at an indeterminate time after software makes an I/O request. Rather than simply querying memory ownership state at the instant software issues an I/O request, a modern virtualization solution that sup26 ports concurrent access by separate virtual machines to a single physical device would require tracking of memory ownership state over time. 2.3.4 Protection Strategies for Direct-Access Private and Shared Virtualized I/O Providing untrusted virtual machines with direct access to I/O resources (as in Private I/O architectures or Hardware-Shared I/O architectures) can substantially improve performance by avoiding software overheads associated with indirect access (as in Software-Shared I/O architectures). However, VMs with direct I/O access could maliciously or accidentally use a commodity I/O device to access another VM’s memory via the device’s direct memory access (DMA) facilities. Furthermore, a fault by the device could generate an invalid request to an unrequested region of memory, possibly corrupting memory. One approach to providing isolation among operating systems that have direct I/O access is to leverage a hardware I/O memory management unit (IOMMU). The IOMMU translates all DMA requests by a device according to the IOMMU’s page table, which is managed by the VMM. Before making a DMA request, an untrusted VM must first request that the VMM install a valid mapping in the IOMMU, so that later the device’s transaction will proceed correctly with a current, valid translation. Hence, the VMM can effectively use the IOMMU to enforce system-wide rules for controlling what memory an I/O device (under the direction of an untrusted VM) may access. By requesting immediate destruction of IOMMU translations, an untrusted VM can furthermore protect itself against later, errant requests by a faulty I/O device. Contemporary commodity virtualization solutions run on standard x86 hardware, which typically lacks an IOMMU. Hence, these solutions forbid direct I/O access and instead use software to implement both protection and sharing of I/O resources among untrusted guest operating systems. Confining direct I/O accesses only within 27 the trusted VMM ensures that all DMA descriptors used by hardware have been constructed by trusted software. Though commodity VMMs confine direct I/O within privileged software, they provide indirect, shared access to their unprivileged VMs using a variety of different software interfaces. IBM’s high-availability virtualization platforms feature IOMMUs and can support private direct I/O by untrusted guest operating systems, but they do not support shared direct I/O. The POWER4 platform supports logical partitioning of hardware resources among guest operating systems but does not permit concurrent sharing of resources [24]. The POWER5 platform adds support for concurrent sharing using software, effectively sacrificing direct I/O access to gain sharing [4]. This sharing mechanism works similarly to commodity solutions, effectively confining direct I/O access within what IBM refers to as a “Virtual I/O Server”. Unlike commodity VMMs, however, this software-based interface is used solely to gain flexibility, not safety. When a device is privately assigned to a single untrusted guest OS, the POWER5 platforms can still use its IOMMU to support safe, direct I/O access. The high overhead of software-based shared I/O virtualization motivated recent research toward hardware-based techniques that support simultaneous, direct-access network I/O by untrusted guest operating systems. These each have different approaches to implementing isolation and protection. Liu et al. developed an Infinibandbased prototype that supports direct access by applications running within untrusted virtualized guest operating systems [35]. This work adopted the Infiniband model of registration-based direct I/O memory protection, in which trusted software (the VMM) must validate and register the application’s memory buffers before those buffers can be used for network I/O. Registration is similar to programming an IOMMU but has different overhead characteristics, because registrations require interaction with the device rather than modification of IOMMU page table entries. Unlike an IOMMU, registration alone cannot provide any protection against malfunctioning 28 by the device, since the protection mechanism is partially enforced within the I/O device. Raj and Schwan also developed an Ethernet-based prototype device that supports shared, direct I/O access by untrusted guests [45]. Because of hardware- implementation constraints, their prototype has limited addressability of main memory and thus requires all network data to be copied through VMM-managed bouncebuffers. This strategy permits the VMM to validate each buffer but does not provide any protection against faulty accesses by the device within its addressable memory range. AMD and Intel have recently proposed the addition of IOMMUs to their upcoming architectures [3, 23]. Though they will be new to commodity architectures, IOMMUs are established components in high-availability server architectures [9]. Ben-Yehuda et al. recently explored the TCP-stream network performance of IBM’s state-of-theart IOMMU-based architectures using both non-virtualized, “bare-metal” Linux and paravirtualized Linux running under Xen [10]. They reported that the state-of-theart IOMMU-management strategy can incur significant overhead. They hypothesized that modifications to the single-use IOMMU-management strategy could avoid such penalties. The concurrent direct network access (CDNA) architecture described in Chapter 4 of this dissertation is an Ethernet-based prototype that supports concurrent, direct network access by untrusted guest operating systems. Unlike the Ethernet-based prototype developed by Raj and Schwan, the CDNA prototype does not require extra copying nor bounce buffers; instead, the CDNA architecture uses a novel softwarebased memory protection mechanism. Like registration, this software-based strategy offers no protection against faulty device behavior. The CDNA architecture does not fundamentally require software-based DMA memory protection. Rather, CDNA can be used with an IOMMU to implement DMA memory protection. This dissertation 29 explores such an approach to DMA memory protection further in Chapter 5. Like all direct, shared-I/O architectures, CDNA fundamentally requires that the device be able to access several guests’ memory simultaneously. Consequently, the device could still use the wrong VM’s data for a particular I/O transaction, and hence it is not possible to guard against faulty behavior by the device even when using an IOMMU. Overall, there has been quite a large volume of research regarding support for I/O virtualization, some of which dates back forty years. Given the continuing demand for server consolidation, researchers continue to develop architectures for private I/O, for software-shared I/O, for hardware-shared I/O, and to develop protection strategies for allowing direct access to a particular I/O device by an untrusted virtual machine. This dissertation explores a novel approach that combines several of these techniques to create a new, hybrid architecture. This research uses a combination of software and hardware to facilitate shared I/O with much greater efficiency and performance than past approaches. Further, this research explores new software techniques for managing hardware designed to enforce I/O memory protection policies, also achieving higher efficiency and performance. 2.4 Hardware Support for Concurrent Server I/O In addition to software support for thread and VM concurrency, there has been significant research in the field regarding concurrent-access I/O devices. These concurrentaccess I/O devices alone do not provide an architectural solution to the performance and efficiency challenges with respect to server concurrency. However, they can be used in concurrency-aware architectures to improve the efficiency of the software. 30 2.4.1 Hardware Support for Parallel Receive-side OS Processing Proposals for parallel receive queues on NICs (such as receive-side scaling (RSS) [40]) are a beginning toward providing explicitly concurrent access to I/O devices by multiple threads. Such architectures maintain separate queues for received packets that can be processed simultaneously by the operating system. The NIC classifies packets into a specific queue according to a hashing function that is usually based on the IP address and port information in each IP packet header. Because this IP address and port information is unique per connection in traditional protocols (such as TCP and UDP), the NIC can distribute incoming packets into specific queues according to connection. This distribution ensures that packets for the same connection are not placed in different queues and thus later processed out-of-order by the operating system. While this approach should efficiently improve concurrency for single, non-virtualized operating systems in receive-dominated workloads, such proposals do not improve transmit-side concurrency. Though parallel receive queues are a necessary component to improving the efficiency of receive-side network stack concurrency, this driver/NIC interface leaves the larger network stack design issues unresolved. These issues must be confronted to prevent inefficiencies in the network stack from rendering any architectural improvements useless. Consequently, this dissertation examines the larger network stack issues in detail. Additionally, a restricted interface such as RSS that considers only receive concurrency is not amenable to supporting concurrent direct hardware access by parallel virtualized operating systems. A more flexible interface would be beneficial for extracting the most utility from a modified hardware architecture. At minimum, an RSS-style NIC architecture would need to be modified to enable more flexible classification of incoming packets based on the virtual machine they belong to rather than the connection they are associated with. Even so, such a modified device architec- 31 ture would be insufficient because it would still require all transmit operations to be performed via traditional software sharing rather than direct hardware access. 2.4.2 User-level Network Interfaces User-level network interfaces provide a more flexible hardware/software interface that allows concurrent user-space applications to directly access a special-purpose NIC [44, 52]. In effect, the NIC provides a separate hardware context to each requesting application instance. Hence, user-level NICs provide the functional equivalent of implementing parallel transmit and receive queues on a single traditional NIC, which could be used as a component toward building an interface that breaks the scalability limitations of traditional NICs. However, user-level NIC architectures lack two key features required for use in efficient, concurrent network servers. First, user-level NICs do not provide context-private event notification. Instead, applications written for user-level NICs typically poll the status of a private context to determine if that context is ready to be serviced. While this is perfectly suitable for high-performance message-passing applications in which the application may not have any work to do until a new message arrives, a polling model is inappropriate for general-purpose operating systems or virtual machine monitors in which many other applications or devices may require service. Second, user-level network interfaces require a single trusted software entity to implement direct memory access (DMA) memory protection, which limits the scalability of this approach. For unvirtualized environments, this entity is the operating system; for virtualized environments, the entity is a single trusted “driver domain” OS instance. Like all applications, user-level NIC applications manipulate virtual memory addresses rather than physical addresses. Hence, the addresses provided by an application to a particular hardware context on a user-level NIC are virtual addresses. However, commodity architectures (such as x86) require I/O devices to use physical 32 addresses. To inform the NIC of the appropriate virtual-to-physical address translations, applications invoke the trusted managing software entity to perform an I/O interaction with the NIC (typically referred to as memory registration) that updates the NIC’s current translations. Liu et al. present an implementation of the Infiniband user-level NIC architecture with support for the Xen VMM and show that memory registration costs can significantly degrade performance [35]. Unlike user-level NIC applications that typically only invoke memory registration twice (once during initialization and again during application termination), operating systems frequently create and destroy virtual-to-physical mappings at runtime, especially when utilizing zero-copy I/O. Hence, the costly memory registration model is inappropriate for operating systems running running on a VMM. The concurrent direct network access architecture presented in this dissertation avoids these registration costs by using a lightweight, primarily software-based protection strategy instead. This dissertation also explores other IOMMU-based strategies for efficient memory protection that are attractive alternatives to costly on-device memory registration. Thus, prior research has examined hardware support for both OS and VMM concurrency, but this hardware alone is not sufficient to address the problems of each. Furthermore, these prior endeavors each either only solved one aspect of concurrency (as in the RSS model which addresses receive parallelism, but not transmit parallelism) or they lacked important components necessary for high-performance servers (such as high-performance, low-overhead DMA memory protection so as to facilitate zero-copy I/O). This dissertation uses both hardware and software to create a comprehensive solution, or when applicable, determine and characterize the components that are still necessary for a comprehensive solution. Moreover, this research represents a fundamentally different approach than these past efforts by using a hardware/software synthesis to achieve a comprehensive architectural analysis and solution rather than 33 using a primarily hardware-based approach that examines only part of the problems and their overhead. 2.5 Summary Server technology has become increasingly important for academic and commercial applications, and the Internet era has brought explosive growth in demand from home users. The demand for efficient, high-performance server technology has motivated extensive research over the past several decades that touch on issues related to the contributions of this dissertation. Though there is extensive research in this area, there are new challenges with regard to supporting new levels of thread- and virtual-machine-level concurrency. This chapter has described the efficiency and performance challenges observed in modern systems and has outlined the research that is most closely related to solving these problems. Previous research has explored some variations of OS architectures that support thread-parallel network I/O processing, but the research in this dissertation reaches beyond that by exploring a fuller spectrum of OS architectures and by examining them on real, rather than simulated, hardware and software. Furthermore, previous research has explored the performance, efficiency, and protection issues related to different I/O virtualization architectures, but the research in this dissertation presents a completely new, novel architecture that brings with it different performance, efficiency, and protection characteristics. Finally, a key aspect of the research in this dissertation is that it uses hardware to improve the efficiency of software. There have been several prior efforts to use hardware to support OS and VMM concurrency, but these efforts have been almost exclusively hardware-centric and did not address issues relevant to real-world application performance, such as support for zero-copy I/O in modern server applications. The architecture presented in this dissertation uses hardware in synthesis with soft- 34 ware to comprehensively address efficiency and performance of real-world applications running on modern thread- and virtual-machine-concurrent network servers. 35 Chapter 3 Parallelization Strategies for OS Network Stacks As established in the previous chapter, network server architectures will feature chip multiprocessors in the future. Furthermore, the slowdown in uniprocessor performance improvements means that network servers will have to leverage parallel processors to meet the ever increasing demand for network services. A wide range of parallel network stack organizations have been proposed and implemented. Among the parallel network stack organizations, there exist two major categories: messagebased parallelism (MsgP) and connection-based parallelism (ConnP). These organizations expose different levels of concurrency, in terms of the maximum available parallelism within the network stack. They also achieve different levels of efficiency, in terms of achieved network bandwidth per processing core, as they incur differing cache, synchronization, and scheduling overheads. The costs of synchronization and scheduling have changed dramatically in the years since the parallel network stack organizations introduced in Chapter 2 were originally proposed and studied. Though processors have become much faster, the gap between processor and memory performance has become much greater, increasing the cost, in terms of lost execution cycles, of synchronization and scheduling. Furthermore, technology trends and architectural complexity are keeping uniprocessor performance growth from keeping pace with Ethernet bandwidth increases. Both of these factors motivate a fresh examination of parallel network stack architectures on modern parallel hardware. 36 Today, network servers are frequently faced with tens of thousands of simultaneous connections. The locking, cache, and scheduling overheads of parallel network stack organizations vary depending on the number of active connections in the system. However, network performance evaluations generally focus on the bandwidth over a small numbers of connections, often just one. In contrast, this study evaluates the different network stack organizations under widely varying connection loads. This study has four main contributions. First, this study presents a fair comparison of uniprocessor, message-based parallel, and connection-based parallel network stack organizations on modern multiprocessor hardware. Three competing network stack organizations are implemented within the FreeBSD 7 operating system: message-based parallelism (MsgP), connection-based parallelism using threads for synchronization (ConnP-T), and connection-based parallelism using locks for synchronization (ConnP-L). The uniprocessor version of FreeBSD is efficient, but its performance falls short of saturating the fastest available network interfaces. Utilizing 4 cores, the parallel stack organizations can outperform the uniprocessor stack, but at reduced efficiency. Second, this study compares the performance of the different network stack organizations when using a single 10 Gbps network interface versus multiple 1 Gbps network interfaces. Unsurprisingly, a uniprocessor network stack can more efficiently utilize a single 10 Gbps network interface, as multiple network interfaces generate additional interrupt overheads. However, the interactions between the network stack and the device serialize the parallel stack organizations when only a single network interface is present in the system. The parallel network stack organizations benefit from the device-level parallelism that is exposed by having multiple network interfaces, allowing a system with multiple 1 Gbps network interfaces to outperform a system with a single 10 Gbps network interface. With multiple interfaces, the par- 37 allel organizations are able to process interrupts concurrently on multiple processors and experience reduced lock contention at the device level. Third, this study presents an analysis of the locking and scheduling overhead incurred by the different parallel stack organizations. MsgP experiences significant locking overhead, but is still able to outperform the uniprocessor for almost all connection loads. In contrast, ConnP-T has very low locking overhead but incurs significant scheduling overhead, leading to reduced performance compared to even the uniprocessor kernel for all but the heaviest loads. ConnP-L mitigates the locking overhead of MsgP, by grouping connections so that there is little global locking, and the scheduling overhead of ConnP-T, by using the requesting thread for network processing rather than forwarding the request to another thread. Finally, this study analyzes the cache behavior of the different parallel stack organizations. Specifically, this study categorizes data sharing within the network stack as either concurrent or serial. If a datum may be accessed simultaneously by two or more threads, that datum is shared concurrently. If, however, a datum may only be accessed by one thread at a time, but it may be accessed by different threads over time, that datum is shared serially. CMP organizations with shared caches will likely reduce the cache misses to concurrently shared data, but are unlikely to provide any benefit for serially shared data. Unfortunately, this study shows that there is a significant amount of serial sharing in the parallel network stack organizations, but very little concurrent sharing. The remainder of this chapter proceeds as follows. The next section further motivates the need for parallelized network stacks in current and future systems. Section 3.2 describes the parallel network stack architectures that are evaluated in this paper. Section 3.3 then describes the hardware and software used to evaluate each organization. Sections 3.4 and 3.5 present evaluations of the organizations using one 10 Gbps interface and six 1 Gbps interfaces, respectively. Section 3.6 provides a dis38 cussion of these results. This chapter is based in part on my previously published work [59]. 3.1 Background The most efficient network stacks in modern operating systems are designed for uniprocessor systems. There are still concurrent threads in such operating systems, but locking and scheduling overhead are minimized as only one thread can execute at a time. For example, a lock operation can often be made atomic simply by masking interrupts during the operation. Despite their efficiency, such network stacks are not capable of saturating a modern 10 Gbps Ethernet link. In 2004, Hurwitz and Feng found that, using Linux 2.4 and 2.5 uniprocessor kernels (with TCP segmentation offloading), they were only able to achieve about 2.5 Gbps on a 2.4 GHz Intel Pentium 4 Xeon system [20]. Increasing processor performance has allowed uniprocessor network stacks to achieve higher bandwidth, but they still are not close to saturating a 10 Gbps Ethernet link. Table 3.1 shows the performance of FreeBSD 7 on a modern 2.2 GHz Opteron uniprocessor system. The first row shows the performance of the uniprocessor kernel, which remains nearly constant around 4 Gbps as the number of connections in the system is varied. While this is an improvement over the performance reported in 2004, it is still less than one half of the link’s capacity. Though the use of jumbo frames can improve these numbers, network servers connected to the Internet will continue to use standard 1500 byte Ethernet frames into the foreseeable future in order to interoperate with legacy hardware. In the face of technology constraints and uniprocessor complexity, architects have turned to chip multiprocessors to continue to provide additional processing performance [14, 18, 25, 30, 31, 32, 33, 34, 42, 54]. The network stack within the operating 39 OS Type Uniprocessor only SMP capable SMP capable Processors 1 1 4 24 conns 4177 3688 3328 192 conns 4156 3796 3251 16384 conns 4037 3774 1821 Table 3.1: FreeBSD network bandwidth (Mbps) using a single processor and a 10 Gbps network interface. system will have to be able to take advantage of such architectures in order to keep up with increases in network bandwidth demand. However, parallelizing the network stack inherently reduces its efficiency. A symmetric multiprocessing (SMP) kernel must use a more expensive implementation of lock operations as there is now physical concurrency in the system. For a lock operation to be atomic, it must be ensured that threads running on the other processors will not interfere with the read-modifywrite sequence required to acquire and release a lock. On x86 hardware, this is accomplished by adding the lock prefix to lock acquisition instructions. The lock prefix causes the instruction to be extremely expensive, as it serializes all instruction execution on the processor and it locks the system bus to ensure that the processor can do an atomic read-modify-write with respect to the other processors in the system. Scheduling is also potentially more expensive, as the operating system now must schedule multiple threads across multiple physical processors. As the second row in Table 3.1 shows, in FreeBSD 7, the overhead of making the kernel SMP capable results in a 7–12% reduction in efficiency. Note that this is still using just a single physical processor. As the number of processors increases, lock contention becomes a major issue. The third row of Table 3.1 shows the results of this effect. With the same SMP capable kernel with 4 physical processors, not only does the efficiency further decrease, but the absolute performance also decreases. Note that the problem gets dramatically worse as the number of connections are increased. This is because with a larger number of 40 connections, each connection has much lower bandwidth, so less work is accomplished for each lock acquisition. These results strongly motivate a reexamination of network stack parallelization strategies in the face of modern technology trends. It seems unlikely that uniprocessor performance will scale fast enough to keep up with increasing network bandwidth demands, so the efficiency of uniprocessor network stacks can no longer be relied upon to provide the necessary networking performance. Furthermore, the inefficiencies of modern SMP capable network stacks leads to a situation where small-scale chip multiprocessors are only going to make the situation worse, as networking performance actually gets worse, not better, using 4 processing cores. There have been several proposals to use a single core of a multiprocessor to achieve the efficiencies of a uniprocessor network stack [6, 11, 46, 47, 48]. However, this is not a solution, either, as each core of a CMP is likely to provide less performance than a monolithic uniprocessor. So, if a uniprocessor is insufficient, there is no reason to believe a single core of a CMP will be able to do any better. Furthermore, dedicating multiple cores for network reintroduces the need for synchronization. The remainder of this paper will examine the continuum of parallelization strategies depicted in Figure 2.2 and analyze their behavior on small scale multiprocessor systems to better understand this situation. 3.2 Parallel Network Stack Architectures As was introduced in Chapter 2 and depicted in Figure 2.2, there are two primary network stack parallelization strategies: message-based parallelism and connectionbased parallelism. Using message-based parallelism, any message (or packet) may be processed simultaneously with respect to other messages. Hence, messages for a single connection could be processed concurrently on different threads, potentially resulting 41 in improved performance. Connection-based parallelism is more coarse-grained; at the beginning of network processing (either at the top or bottom of the network stack), messages and packets are classified according to the connection with which they are associated. All packets for a certain connection are then processed by a single thread at any given time. However, each thread may be responsible for processing one or more connections. These parallelization strategies were studied in the mid-1990s, between the introduction of 100 Mbps and 1 Gbps Ethernet. Despite those efforts, there is not a solid consensus among modern operating system developers on how to design efficient and scalable parallel network stacks. Major subsystems of FreeBSD and Linux, including the network stack, have been redesigned in recent years to improve performance on parallel hardware. Both operating systems now incorporate variations of message-based parallelism within their network stacks. Conversely, Sun has recently redesigned the Solaris operating system for their high-throughput computing microprocessors and it now incorporates a variation of connection-based parallelism [55]. DragonflyBSD also uses connection-based parallelism within its network stack. Each strategy was implemented within the FreeBSD 7 operating system to enable a fair comparison of the trade-offs among the different strategies. This section provides a more detailed explanation of how each parallelization strategy works. 3.2.1 Message-based Parallelism (MsgP) Message-based parallel (MsgP) network stacks, such as FreeBSD, exploit parallelism by allowing multiple threads to operate within the network stack simultaneously. Two types of threads may perform network processing: one or more application threads and one or more inbound protocol threads. When an application thread makes a system call, that calling thread context is “borrowed” to then enter the kernel and carry out the requested service. So, for example, a read or write call on a socket 42 would loan the application thread to the operating system to perform networking tasks. Multiple such application threads can be executing within the kernel at any given time. The network interface’s driver executes on an inbound protocol thread whenever the network interface card (NIC) interrupts the host, and it may transfer packets between the NIC and host memory. After servicing the NIC, the inbound protocol thread processes received packets “up” through the network stack. Given that multiple threads can be active within the network stack, FreeBSD utilizes fine-grained locking around shared kernel structures to ensure proper message ordering and connection state consistency. As a thread attempts to send or receive a message on a connection, it must acquire various locks when accessing shared connection state, such as the global connection hash table lock (for looking up TCP connections) and per-connection locks (for both socket state and TCP state). If a thread is unable to obtain a lock, it is placed in the lock’s queue of waiting threads and yields the processor, allowing another thread to execute. To prevent priority inversion, priority propagation from the waiting threads to the thread holding the lock is performed. As is characteristic of message-based parallel network stacks, FreeBSD’s locking organization thus allows concurrent processing of different messages on the same connection, so long as the various threads are not accessing the same portion of the connection state at the same time. For example, one thread may process TCP timeout state based on the reception of a new ACK, while at the same time another thread is copying data into that connection’s socket buffer for later transmission. However, note that the inbound thread configuration described is not the FreeBSD 7 default. Rather, the operating system’s network stack has been configured to use the optional direct-dispatch mechanism. Normally dedicated parallel driver threads service each NIC and then hand off inbound packets to a single protocol thread via a shared queue. That protocol thread then processes the received packets “up” through the 43 network stack. The default configuration thus limits the performance of MsgP and is hence not considered in this paper. The thread-per-NIC model also differs from the message-parallel organization described by Nahum et al. [41], which used many more worker threads than interfaces. Such an organization requires a sophisticated scheme to ensure these worker threads do not reorder inbound packets that were received in order, and hence that organization is also not considered. 3.2.2 Connection-based Parallelism (ConnP) To compare connection parallelism in the same framework as message parallelism, FreeBSD 7 was modified to support two variants of connection-based parallelism (ConnP) that differ in how they serialize TCP/IP processing within a connection. The first variant assigns each connection to one of a small number of protocol processing threads (ConnP-T). The second variant assigns each connection to one of a small number of locks (ConnP-L). Connection Parallelism Serialized by Threads (ConnP-T) Connection-based parallelism using threads utilizes several kernel threads dedicated to per-connection protocol processing. Each protocol thread is responsible for processing a subset of the system’s connections. At each entry point into the TCP/IP protocol stack, the requested operation is enqueued for service by a particular protocol thread based on the connection that is being processed. Each connection is uniquely mapped to a single protocol thread for the lifetime of that connection. Later, the protocol threads dequeue requests and process them appropriately. No per-connection state locking is required within the TCP/IP protocol stack, because the state of each connection is only manipulated by a single protocol thread. The kernel protocol threads are simply worker threads that are bound to a specific CPU. They dequeue requests and perform the appropriate processing; the messaging 44 system between the threads requesting service and kernel protocol threads maintains strict FIFO ordering. Within each protocol thread, several data structures that are normally system-wide (such as the TCP connection hash table) are replicated so that they are thread-private. Kernel protocol threads provide both synchronous and asynchronous interfaces to threads requesting service. If a requesting thread requires a return value or if the requester must maintain synchronous semantics (that is, the requester must wait until the kernel thread completes the desired request), that requester yields the processor and waits for the kernel thread to complete the requested work. Once the kernel protocol thread completes the desired function, the kernel thread sends the return value back to the requester and signals the waiting thread. This is the common case for application threads, which require a return value to determine if the network request succeeded. However, interrupt threads (such as those that service the network interface card and pass “up” packets received on the network) do not require synchronous semantics. In this case, the interrupt context classifies each packet according to its connection and enqueues the packet for the appropriate kernel protocol thread. The connection-based parallel stack uniquely maps a packet or socket request to a specific protocol thread by hashing the 4-tuple of remote IP address, remote port number, local IP address, and local port number. This implementation of connection-based parallelism is like that of DragonflyBSD. Connection Parallelism Serialized by Locks (ConnP-L) Just as in thread-serialized connection parallelism, connection-based parallelism using locks is based upon the principle of isolating connections into groups that are each bound to a single entity during execution. As the name implies, however, the binding entity is not a thread; instead, each group is isolated by a mutual exclusion lock. 45 When an application thread enters the kernel to obtain service from the network stack, the network system call maps the connection being serviced to a particular group using a mechanism identical to that employed by thread-serialized connection parallelism. However, rather than building a message and passing it to that group’s specific kernel protocol thread for service, the calling thread directly obtains the lock for the group associated with the given connection. After that point, the calling thread may access any of the group-private data structures, such as the group-private connection hash table or group-private per-connection structures. Hence, these locks serve to ensure that at most one thread may be accessing each group’s private connection structures at a time. Upon completion of the system call in the network stack, the calling thread releases the group lock, allowing another thread to obtain that group’s lock if necessary. Threads accessing connections in different groups may proceed concurrently through the network stack without obtaining any stack-specific locks other than the group lock. Inbound packet processing is also analogous to connection-based parallelism using threads. After receiving a packet, the inbound protocol thread classifies the packet into a group. Unlike the thread-oriented connection-parallel case, the inbound thread need not hand off the packet from the driver to the worker thread corresponding to the packet’s connection group. Instead, the inbound thread directly obtains the appropriate group lock for the packet and then processes the packet “up” the protocol stack without any thread handoff. This control flow is similar to the message-parallel stack, but the lock-serialized connection-parallel stack does not require any further protocol locks after obtaining the connection group lock. As in the MsgP case, there is one inbound protocol thread for each NIC, but the number of groups may far exceed the number of threads. This implementation of connection-based parallelism is similar to the implementation used in Solaris 10. 46 3.3 Methodology To gain insights into the behavior and characteristics of the parallel network stack architectures described in Section 3.2, these architectures were evaluated on a modern chip multiprocessor. All stack architectures were implemented within the 2006-03-27 repository version of the FreeBSD 7 operating system to facilitate a fair comparison. This section describes the benchmarking methodology and hardware platforms. 3.3.1 Evaluation Hardware The parallel network stack organizations were evaluated using a 4-way SMP Opteron system, using either a single 10 Gbps Ethernet interface or six 1 Gbps Ethernet interfaces. The system consists of two dual-core 2.2 GHz Opteron 275 processors and four 512 MB PC2700 DIMMs per processor (two per memory channel). Each of the four processor cores has a private level-2 cache. The 10 Gbps NIC evaluation is based on a Myricom 10 Gbps PCI-Express Ethernet interface. The six 1 Gbps NIC evaluation is based on three dual-port Intel PRO/1000-MT Ethernet interfaces that are spread across the motherboard’s PCI-X bus segments. In both configurations, data is transferred between the 4-way Opteron’s Ethernet interface(s) and one or more client systems. The 10 Gbps configuration uses one client with an identical 10 Gbps interface as the system under test, whereas the sixNIC configuration uses three client systems that each have two Gigabit Ethernet interfaces. Each client is directly connected to the 4-way Opteron without the use of a switch. For the 10 Gbps evaluation, the client system uses faster 2.6 GHz Opteron 285 processors and PC3200 memory, so that the client will never be a bottleneck in any of the tests. For the six-NIC evaluation, each client was independently tested to confirm that it can simultaneously sustain the theoretical peak bandwidth of its two 47 interfaces. Therefore, all results are determined solely by the behavior of the 4-way Opteron 275 system. 3.3.2 Parallel TCP Benchmark Most existing network benchmarks evaluate single-connection performance. However, modern multithreaded server applications simultaneously manage tens to thousands of connections. This parallel network traffic behaves quite differently than a single network connection. To address this issue, a multithreaded, event-driven, network benchmark was developed that distributes traffic across a configurable number of connections. The benchmark distributes connections evenly across threads and utilizes libevent to manage connections within a thread. For all of the experiments in this paper, the number of threads used by the benchmark is equal to the number of processor cores being used. Each thread manages an equal number of connections. For evaluations using 6 NICs, the application’s connections are distributed across the server’s NICs equally such that each of the four threads uses each NIC, and every thread has the same number of connections that map to each NIC. Each thread sends data over all of its connections using zero-copy sendfile(). Threads receive data using read(). The sending and receiving socket buffer sizes are set to be sufficiently large (typically 256 KB) to accommodate the large TCP windows for high-bandwidth connections. Using larger socket buffers did not improve performance for any test. All experiments use the standard 1500-byte maximum transmission unit and do not utilize TCP segmentation offload, which currently is not implemented in FreeBSD. The benchmark is always run for 3 minutes. 48 Stack Type UP MsgP ConnP-T(4) ConnP-L(128) 24 conns 4177 3328 2543 3321 192 conns 4156 3251 2475 3240 16384 conns 4037 1821 2483 1861 Table 3.2: Aggregate throughput for uniprocessor, message-parallel and connectionparallel network stacks. 3.4 Evaluation using One 10 Gigabit NIC Table 3.2 shows the aggregate throughput across all the connections of the parallel TCP benchmark described in Section 3.3.2 when using a single 10 Gbps interface. The table presents the throughput for each network stack organization when the evaluated system is transmitting data on 24, 192, or 16384 simultaneous connections. “UP” is the uniprocessor version of the FreeBSD kernel running on a single core of the Opteron server. The rest of the configurations are run on all 4 cores. “MsgP” is the multiprocessor FreeBSD-based MsgP kernel described in Section 3.2.1. “ConnPT(4)” is the multiprocessor FreeBSD-based ConnP-T kernel described in Section 3.2.2, using 4 kernel protocol threads for TCP/IP stack processing that are each pinned to a different core. “ConnP-L(128)” is the multiprocessor FreeBSD-based ConnP-L kernel described in Section 3.2.2. ConnP-L(128) divides the connections among 128 locks within the TCP/IP stack. As Table 3.2 shows, none of the parallel organizations outperform the “UP” kernel. This corroborates prior evaluations of 10 Gbps Ethernet that used hosts with two processors and an SMP variant of Linux and exhibited worse performance than when the hosts used a uniprocessor kernel [20]. Of the parallel organizations, MsgP and ConnP-L perform approximately the same and outperform ConnP-T when using 24 or 192 connections. However, ConnP-T performs best when using 16384 connections. Both the software interface to the single 10 Gbps NIC and the various overheads inherent to each parallel approach limit performance and prevent the parallel orga- 49 nizations from outperforming the uniprocessor. When using one NIC, performance is limited by the serialization constraints imposed by the device’s interface. Because the device has a single physical interrupt line, only one thread is triggered when the device raises an interrupt, and hence one thread carries received packets “up” through the network stack as described in Section 3.2.1. Transmit-side traffic also faces a device-imposed serialization constraint. Because multiple threads can potentially request to transmit a packet at the same time and invoke the NIC’s driver, the driver requires acquisition of a mutual exclusion lock to ensure consistency of shared state related to transmitting packets. Process profiling shows that for all connection loads, the driver’s lock is held by a core in the system nearly 100% of the time, and that even with 16384 connections, MsgP and ConnP-L organizations show more than 50% idle time. The ConnP-T organization is also constrained by the driver’s lock, but it is able to outperform the other organizations with 16384 connections because it does not constrain received acknowledgement packets to be processed by the single interrupt thread, as the other organizations do. Instead, it is able to distribute receive processing to protocol threads running on all of the processor cores. However, ConnP-T performs worse than the uniprocessor because of the significant scheduler overheads associated with ConnP-T’s thread handoff mechanism. 3.5 Evaluation using Multiple Gigabit NICs As is shown in the previous section, using a single 10 Gbps interface limits the parallelism available to the network stack at the device interface. This external bottleneck prevents the parallelism within the network stack from being exercised. To provide additional inbound parallelism and to reduce the degree to which a single driver’s lock can serialize network stack processing, the uniprocessor, message-parallel, and connection-parallel organizations are evaluated using six Gigabit Ethernet NICs 50 6000 Throughput (Mb/s) 5000 4000 3000 2000 1000 UP MsgP ConnP−T(4) ConnP−L(128) 122 8 163 8 84 4 614 2 307 6 153 768 384 192 96 48 24 0 Connections Figure 3.1: Aggregate transmit throughput for uniprocessor, message-parallel and connection-parallel network stacks using 6 NICs. rather than one single 10 Gigabit NIC. Hence, on the inbound processing path there are six different interrupts with six different interrupt threads to feed the network stack in parallel. Each NIC has a separate driver instance with a separate driver lock, reducing the probability that the network stack will contend for a driver lock. This model more closely resembles the abundant thread parallelism that is presented to the operating system at the application layer by the parallel benchmark and hence fully stresses the network stack’s parallel processing capabilities. Because the single 10 Gbps-NIC configuration is insufficient to utilize processing resources for each organization and cannot effectively isolate the network stack, it is not examined further. Figures 3.1 and 3.2 depict the aggregate TCP throughput across all connections for the various network stack organizations when using six separate Gigabit interfaces. Figure 3.1 shows that the “UP” kernel performs well when transmitting on a small number of connections, achieving a bandwidth of 3804 Mb/s with 24 connections. 51 6000 Throughput (Mb/s) 5000 4000 3000 2000 1000 UP MsgP ConnP−T(4) ConnP−L(128) 122 8 163 8 84 4 614 2 307 6 153 768 384 192 96 48 24 0 Connections Figure 3.2: Aggregate receive throughput for uniprocessor, message-parallel and connection-parallel network stacks using 6 NICs. However, total bandwidth decreases as the number of connections increases. MsgP performs better, providing an 11% improvement over the uniprocessor bandwidth at 24 connections but quickly ramps up to 4630 Mb/s, holding steady through 768 connections and then decreasing to 3403 Mb/s with 16384 connections. ConnPT(4) achieves its peak bandwidth of 3169 Mb/s with 24 connections and provides approximately steady bandwidth as the number of connections increase. Finally, the ConnP-L(128) curve is shaped similar to that of MsgP, but its performance is larger in magnitude and always outperforms the uniprocessor kernel. ConnP-L(128) delivers steady performance around 5440 Mb/s for 96–768 connections and then gradually decreases to 4747 Mb/s with 16384 connections. This peak performance is roughly the peak TCP throughput deliverable by the three dual-port Gigabit NICs. Figure 3.2 shows the aggregate TCP throughput across all connections when receiving data on six Gigabit interfaces. Again, ConnP-L(128) performs best, followed 52 by MsgP, ConnP-T(4), and the uniprocessor kernel. Unlike the transmit case, the parallel organizations always outperform the uniprocessor, and in many cases they receive at a higher rate than they transmit. The ConnP-L(128) organization is again able to receive at near-peak performance at 384 connections and holds approximately steady, receiving over 5 Gb/s of throughput using 16384 connections. Both the ConnP-T(4) and uniprocessor kernels also receives steady (but lower) bandwidth across all connection loads tested, only slightly decreasing as connections are added. Conversely, MsgP does not provide as consistent bandwidth across the various connection loads, but it does uniformly outperform both ConnP-T(4) and “UP”. 3.6 Discussion and Analysis The locking, scheduling, and cache overheads of the network stack vary depending on both the parallel network stack organization and the number of active connections in the system. The following subsections will examine these issues for the best performing hardware configuration, a system with six 1 Gbps network interfaces. All of the statistics within this section were collected using either the Opterons’ performance counters or FreeBSD’s lock-profiling facilities. 3.6.1 Locking Overhead There are two significant costs of locking within the parallelized network stacks. The first is that SMP locks are fundamentally more expensive than uniprocessor locks. In a uniprocessor kernel, a simple atomic test-and-set instruction can be used to protect against interference across context switches, whereas, SMP systems must use system wide locking to ensure proper synchronization among simultaneously running threads. This is likely to incur significant overhead in the SMP case. For example, on x86 architectures, the lock prefix, which is used to ensure that an instruction 53 Socket Buffer A = Acquisition of lock Lock-Name Lock-Name R = Release of lock Lock-Name Bold = Global Lock Calculate Ready Data to Send Socket Buffer Lock-Name A R Regular = Per-Connection Lock Socket Send Prepare metadata structures pointing to message data Connection Hashtable A return value Look Up Connection Connection Connection A R TCP Send Connection Hashtable R for(packet-sized segments in message) Socket Buffer TCP Output A Prepare TCP header for one packet Socket Buffer R return value Route Hashtable A Allocate new route structure Route Route A Fill in route Route R Destroy route structure Route R IP Output A Route Hashtable R Prepare IP header Ensure ARP entry isn't expired Route A Validate route Route Ethernet Output R TX Interface Queue A Interface Queue Insert packet TX Interface Queue R Driver Figure 3.3: The outbound control path in the application thread context. is executed atomically across the system, effectively locks all other cores out of the memory system during the execution of the locked instruction. The second is that contention for global locks within the network stack is significantly increased when multiple threads are actively performing network tasks si- 54 OS Type MsgP ConnP-L(4) ConnP-L(8) ConnP-L(16) ConnP-L(32) ConnP-L(64) ConnP-L(128) 6 conns 89 60 51 49 41 37 33 192 conns 100 56 30 18 10 6 5 16384 conns 100 52 26 14 7 4 2 Table 3.3: Percentage of lock acquisitions for global TCP/IP locks that do not succeed immediately when transmitting data. multaneously. As an illustration of how locks can contend within the network stack, Figure 3.3 shows the locking required in the control path for send processing within the sending application’s thread context in the MsgP network stack of FreeBSD 7. Most of the locks pictured are associated with a single socket buffer or connection. Therefore, it is unlikely that multiple application threads would contend for those locks since connection-oriented applications do not use multiple application threads to send data over the same connection. However, those locks could be shared with the kernel’s inbound protocol threads that are processing receive traffic on the same connection. Global locks that must be acquired by all threads that are sending (or possibly receiving) data over any connection are far more problematic. There are two global locks on the send path: the Connection Hash-table lock and the Route Hash-table lock. These locks protect the hash tables that map a particular connection to its individual connection lock that map a particular connection to its individual route lock, respectively. These locks are also used in lieu of explicit reference counting for individual connections and locks. Watson presents a more detailed description of locking within the FreeBSD network stack [57]. There is very little contention for the Route Hash-table lock because the corresponding Route lock is quickly acquired and released so a thread is unlikely to be blocked while holding the Route Hash-table lock and waiting for a Route lock. In contrast, the Connection Hash-table lock is highly contended. This lock must be 55 acquired by any thread performing any network operation on any connection. Furthermore, it is possible for a thread to block while holding the lock and waiting for its corresponding Connection lock, which can be held for quite some time. Table 3.3 depicts global TCP/IP lock contention when sending data, measured as the percentage of lock acquisitions that do not immediately succeed because another thread holds the lock. ConnP-T is omitted from the table because it eliminates global TCP/IP locking completely. As the table shows, the MsgP network stack experiences significant contention for the Connection Hash-table lock, which leads to considerable overhead as the number of connections increases. One would expect that as connections are added, contention for per-connection locks would decrease, and in fact lock profiling supports this conclusion. However, because other locks (such as that guarding the scheduler) are acquired while holding the per-connection lock, and because those other locks are system-wide and become highly contended during heavy loads, detailed locking profiles show that the average time per-connection locks are held increases dramatically. Hence, though contention for per-connection locks decreases, the increasing cost for a contended lock is so much greater that the system exhibits increasing average acquisition times for perconnection locks as connections are added. This increased per-connection acquisition time in turn leads to longer waits for the Connection Hash-table lock, eventually bogging down the system with contention. Whereas the MsgP stack relies on repeated acquisition to the Connection Hash-table and Connection locks to continue stack processing, ConnP-L stacks can also become periodically bottlenecked if a single group becomes highly contended. Table 3.3 shows the contention for the Network Group locks for ConnP-L stacks as the number of network groups is varied to from 4 to 128 groups. The table demonstrates that contention for the Network Group locks consistently decreases as the number of network groups increases. Though ConnP-L(4)’s Network Group lock contention is high at over 50% 56 6000 Throughput (Mb/s) 5000 4000 3000 ConnP−L(128) ConnP−L(64) ConnP−L(32) ConnP−L(16) ConnP−L(8) ConnP−L(4) 2000 1000 122 8 163 8 84 4 614 2 307 6 153 768 384 192 96 48 24 0 Connections Figure 3.4: Aggregate transmit throughput for the ConnP-L network stack as the number of locks is varied. Stack Type UP MsgP ConnP-T(4) ConnP-L(128) 24 conns 452 1305 3617 1056 Transmit 192 conns 16384 conns 440 423 1818 2448 3602 4535 924 1064 24 conns 350 1125 858 598 Receive 192 conns 378 1126 957 519 16384 conns 421 1158 1547 524 Table 3.4: Cycles spent managing the scheduler and scheduler synchronization per Kilobyte of payload. for all connection loads, increasing the number of network groups to 128 reduces contention from 52% to just 2% for the heaviest connection load. Figure 3.4 shows the effect increasing the number of network groups has on aggregate throughput for 6, 192, and 16384 connections. As is suggested by the contention reduction associated with larger numbers of network groups, network throughput increases with more network groups. However, there are diminishing returns as more groups are added. 3.6.2 Scheduler Overhead The ConnP-T kernel trades the locking overhead of the ConnP-L and MsgP kernels for scheduling overhead. As operations are requested for a particular connection, 57 L2 Misses/KB Throughput Scheduler Network Stack 60 40 20 0 UP (4) 128) gP −T Ms L( nP P− n nn Co Co UP 24 Connections (4) 128) gP −T Ms L( nP P− n nn Co Co 192 Connections UP (4) 128) gP −T Ms L( nP P− n nn Co Co 16384 Connections Figure 3.5: Profile of L2 cache misses per 1 Kilobyte of payload data (transmit test). they must be scheduled onto the appropriate protocol thread. As Figures 3.1 and 3.2 showed, this results in stable, but low total bandwidth as connections scale for ConnPT(4). ConnP-L approximates the reduced intra-stack locking properties of ConnP-T and adopts the simpler scheduling properties of MsgP; locking overhead is minimized by the additional groups and scheduling overhead is minimized since messages are not transferred to protocol threads. This results in consistently better performance than the other parallel organizations. To further explain this behavior, Table 3.4 shows the number of cycles spent managing the scheduler and scheduler synchronization per KB of payload data transmitted and received. This shows the overhead of the scheduler normalized to network bandwidth. Though MsgP experiences significantly less scheduling overhead than ConnP-T in most cases, locking overhead within the threads negate the scheduler advantage as connections are added. In contrast, the scheduler overhead of ConnP-T remains high, particularly when transmitting, corresponding to relatively low bandwidth. Conversely, ConnP-L exhibits stable scheduler overhead that is much lower than either MsgP or ConnP-L, contributing to its higher throughput. ConnP-L does not require a thread handoff mechanism and its low lock contention compared to MsgP results in fewer context switches from threads waiting for locks. 58 L2 Misses/KB Throughput 60 Data Copying Scheduler Network Stack 40 20 0 UP (4) 128) gP −T Ms L( nP P− n nn Co Co 24 Connections UP (4) 128) gP −T Ms L( nP P− n nn Co Co 192 Connections UP (4) 128) gP −T Ms L( nP P− n nn Co Co 16384 Connections Figure 3.6: Profile of L2 cache misses per 1 Kilobyte of payload data (receive test). All of the network stack organizations examined experience higher scheduler overhead when transmitting than when receiving. The reference FreeBSD 7 operating system utilizes an interrupt-serialized task queue architecture for processing received packets. This architecture obviates the need for explicit mutual exclusion locking within NIC drivers when processing received packets, though locking is still required on the transmit path. Each of the organizations examined benefit from this optimization. Because FreeBSD’s kernel-adaptive mutual exclusion locks invoke the thread scheduler when acquisitions repeatedly fail, eliminating lock acquisition attempts necessarily reduces scheduler overhead. The ConnP-T organization experiences an additional reduction in scheduler overhead when processing received packets. In this organization, inbound packets are queued asynchronously for later processing by the appropriate network protocol thread, which eliminates the need to block the thread that enqueues a packet or to later notify a blocked thread of completion. When sending, most processing occurs when the application attempts to send data, which requires a more scheduler-intensive synchronous call, and hence ConnP-T exhibits significantly higher scheduler overhead when transmitting than when receiving. 59 Table 3.4 shows that the reference ConnP-T implementation in this paper incurs heavy overhead in the thread scheduler, and hence an effective ConnP-T organization would require a more efficient interprocessor communication mechanism. A lightweight mechanism for interprocessor communication, as implemented in DragonflyBSD, would enable efficient intra-kernel messaging between processor cores. Such an efficient messaging mechanism is likely to greatly benefit the ConnP-T organization by allowing message transfer without invoking the general-purpose scheduler. 3.6.3 Cache Behavior Figures 3.5 and3.6 show the number of L2 cache misses per KB of payload data transmitted and received, respectively. The stacked bars separate the L2 cache misses based upon where in the operating system the misses occurred. On the transmit side, all of the L2 cache misses occur either in the scheduler or in the network stack. On the receive side, there are also misses copying the data from the kernel to the application. Recall that zero-copy transmit is used, so the corresponding copy from the application to the kernel does not occur on the transmit side. The figures show the efficiency of the cache hierarchy normalized to network bandwidth. The uniprocessor kernel incurs very few cache misses relative to the multiprocessor configurations. The lack of data migration between processor caches accounts for the uniprocessor kernel’s cache efficiency. As the number of connections are increased, the additional connection state within the kernel stresses the cache and directly results in increased cache misses and decreased throughput [27, 28]. The parallel network stacks incur significantly more cache misses per KB of transmitted data because of data migration and lock accesses. Surprisingly, ConnP-T(4) incurs the most cache misses despite each thread being pinned to a specific processor core. One might expect that such pinning would improve locality by eliminating migration of many connection data structures. However, Figure 3.5 shows that for 60 the cases with 6 and 192 connections, ConnP-T(4) exhibits more misses in the network stack than any of the other organizations. While thread pinning can improve locality by eliminating migration of connection metadata, frequently updated socket metadata is still shared between the application and protocol threads, which leads to data migration and a higher cache miss rate. Pinning the protocol threads does result in better utilization of the caches for the 16384-connection load when transmitting, however. In this case, ConnP-T(4) exhibits the fewest network-stack L2 cache misses. However, the relatively higher number of L2 cache misses caused by the scheduler prevents this advantage from translating into a performance benefit. Other than the cache misses due to data copying, the cache miss profiles for transmit and receive are quite similar. However, ConnP-T(4) incurs far fewer cache misses in the scheduler when receiving data than it does when transmitting data. This is directly related to the reduced scheduling overhead on the receive side, as discussed in the previous section. The cache misses within the network stack can be divided between misses to concurrently shared data and serially shared data. Global network stack data is concurrently shared, as it may be simultaneously accessed by multiple threads in order to transmit or receive data. In contrast, per-connection data is serially shared, as it is only accessed by a single thread at a time, although it may be accessed by multiple threads over time. In a true MsgP organization, the per-connection data will also be concurrently shared, as multiple threads can process packets from the same connection simultaneously. However, in a practical, in-order MsgP implementation, as described in Section 3.2.1, per-connection data may be accessed by at most two threads at a time, one sending data and one receiving data on the same connection. Table 3.5 indicates the percentage of the cache misses within the network stack that are due to global data structures, and are therefore concurrently shared. The remaining L2 cache misses are due to per-connection data structures. As previously 61 Stack Type UP MsgP ConnP-T(4) ConnP-L(128) 24 conns 4% 12% 7% 15% Transmit 192 conns 16384 conns 3% 27% 14% 32% 9% 15% 19% 21% 24 conns 1% 16% 14% 18% Receive 192 conns 1% 15% 13% 18% 16384 conns 12% 22% 11% 20% Table 3.5: Percentage of L2 cache misses within the network stack to global data structures. stated, these per-connection data structures are rarely, if ever, accessed by different threads concurrently. Data that is concurrently shared is most likely to benefit from a CMP with a shared cache. Therefore, the percentages of Table 3.5 indicate the possible reduction in L2 cache misses within the network stack if a CMP with a shared cache were used. For most connection loads and network stack organizations, fewer than 20% of the L2 cache misses are due to global data. As there is no guarantee that a shared L2 will eliminate these misses entirely, the benefits of a shared cache for the network stack are likely to be minimal. Furthermore, it is also possible for a shared cache to have a detrimental effect on the serially shared data. Previous work has shown that shared caches can hurt performance when the cores are not actively sharing data [29]. If the processor cores must compete to store per-connection state with each other, this could potentially lead to an overall increase in L2 cache misses within the network stack, despite the benefits for concurrently shared data. The lock contention, scheduling, and cache efficiency data show that the different concurrency models and the different synchronization mechanisms employed by their implementations directly impact network stack efficiency and throughput. Though all of these parallelized organizations can outperform the uniprocessor when using 4 cores, each parallel organization experiences higher locking overhead, decreased cache efficiency, and higher scheduling overhead than a uniprocessor network stack. The ConnP-L organization maximizes performance and efficiency compared to the MsgP and ConnP-T organizations. ConnP-L mitigates the locking overhead of the highly 62 contentious MsgP organization by grouping connections to reduce global locking. ConnP-L also benefits from reduced scheduling overhead as compared to ConnP-T, since ConnP-L does not require inter-thread communication or message passing to carry out network stack processing. Hence, though the ConnP-L parallelism model is more restricted than that of MsgP, ConnP-L still provides the same level of parallelism expected by most applications (e.g., connection- or socket-level parallelism) and achieves higher efficiency and higher throughput. 63 Chapter 4 Concurrent Direct Network Access In many organizations, the economics of supporting a growing number of Internetbased services has created a demand for server consolidation. In such organizations, maximizing machine utilization and increasing the efficiency of the overall server is just as important as increasing the efficiency of each individual operating system, as in Chapter 3. Consequently, there has been a resurgence of interest in machine virtualization [1, 2, 7, 13, 16, 21, 35, 53, 58]. A virtual machine monitor (VMM) enables multiple virtual machines, each encapsulating one or more services, to share the same physical machine safely and fairly. In principle, general-purpose operating systems, such as Unix and Windows, offer the same capability for multiple services to share the same physical machine. However, VMMs provide additional advantages. For example, VMMs allow services implemented in different or customized environments, including different operating systems, to share the same physical machine. Modern VMMs for commodity hardware, such as VMware [1, 13] and Xen [7], virtualize processor, memory, and I/O devices in software. This enables these VMMs to support a variety of hardware. In an attempt to decrease the software overhead of virtualization, both AMD and Intel are introducing hardware support for virtualization [2, 21]. Specifically, their hardware support for processor virtualization is currently available, and their hardware support for memory virtualization is imminent. As these hardware mechanisms mature, they should reduce the overhead of virtualization, improving the efficiency of VMMs. 64 Despite the renewed interest in system virtualization, there is still no clear solution to improve the efficiency of I/O virtualization. To support networking, a VMM must present each virtual machine with a virtual network interface that is multiplexed in software onto a physical network interface card (NIC). The overhead of this software-based network virtualization severely limits network performance [38, 39, 53]. For example, a Linux kernel running within a virtual machine on Xen is only able to achieve about 30% of the network throughput that the same kernel can achieve running directly on the physical machine. This study proposes and evaluates concurrent direct network access (CDNA), a new I/O virtualization architecture that combines software and hardware components to significantly reduce the overhead of network virtualization in VMMs. The CDNA network virtualization architecture provides virtual machines running on a VMM safe direct access to the network interface. With CDNA, each virtual machine is allocated a unique context on the network interface and communicates directly with the network interface through that context. In this manner, the virtual machines that run on the VMM operate as if each has access to its own dedicated network interface. Using CDNA, a single virtual machine running Linux can transmit at a rate of 1867 Mb/s with 51% idle time and receive at a rate of 1874 Mb/s with 41% idle time. In contrast, at 97% CPU utilization, Xen is only able to achieve 1602 Mb/s for transmit and 1112 Mb/s for receive. Furthermore, with 24 virtual machines, CDNA can still transmit and receive at a rate of over 1860 Mb/s, but with no idle time. In contrast, Xen is only able to transmit at a rate of 891 Mb/s and receive at a rate of 558 Mb/s with 24 virtual machines. The CDNA network virtualization architecture achieves this dramatic increase in network efficiency by dividing the tasks of traffic multiplexing, interrupt delivery, and memory protection among hardware and software in a novel way. Traffic multiplexing is performed directly on the network interface, whereas interrupt delivery and memory 65 Guest Domain 1 Driver Domain Back-End Drivers Page Flipping Packet Data Ethernet Bridge Front-End Driver Guest Domain 2 Front-End Driver NIC Driver Virtual Interrupts Driver Control Control & Data Interrupts NIC Hypervisor Interrupt Dispatch Packet Data Hypervisor Control CPU / Memory / Disk / Other Devices Figure 4.1: Shared networking in the Xen virtual machine environment. protection are performed by the VMM with support from the network interface. This division of tasks into hardware and software components simplifies the overall software architecture, minimizes the hardware additions to the network interface, and addresses the network performance bottlenecks of Xen. The remainder of this study proceeds as follows. The next section discusses networking in the Xen VMM in more detail. Section 4.2 describes how CDNA manages traffic multiplexing, interrupt delivery, and memory protection in software and hardware to provide concurrent access to the NIC. Section 4.3 then describes the custom hardware NIC that facilitates concurrent direct network access on a single device. Finally, Section 4.4 presents the experimental methodology and results. This study is based on one of my previously published works [60]. 66 4.1 4.1.1 Networking in Xen Hypervisor and Driver Domain Operation A VMM allows multiple guest operating systems, each running in a virtual machine, to share a single physical machine safely and fairly. It provides isolation between these guest operating systems and manages their access to hardware resources. Xen is an open source VMM that supports paravirtualization, which requires modifications to the guest operating system [7]. By modifying the guest operating systems to interact with the VMM, the complexity of the VMM can be reduced and overall system performance improved. Xen performs three key functions in order to provide virtual machine environments. First, Xen allocates the physical resources of the machine to the guest operating systems and isolates them from each other. Second, Xen receives all interrupts in the system and passes them on to the guest operating systems, as appropriate. Finally, all I/O operations go through Xen in order to ensure fair and non-overlapping access to I/O devices by the guests. Figure 4.1 shows the organization of the Xen VMM. Xen consists of two elements: the hypervisor and the driver domain. The hypervisor provides an abstraction layer between the virtual machines, called guest domains, and the actual hardware, enabling each guest operating system to execute as if it were the only operating system on the machine. However, the guest operating systems cannot directly communicate with the physical I/O devices. Exclusive access to the physical devices is given by the hypervisor to the driver domain, a privileged virtual machine. Each guest operating system is then given a virtual I/O device that is controlled by a paravirtualized driver, called a front-end driver. In order to access a physical device, such as the network interface card (NIC), the guest’s front-end driver communicates with the corresponding back-end driver in the driver domain. The driver domain then multiplexes the data 67 streams for each guest onto the physical device. The driver domain runs a modified version of Linux that uses native Linux device drivers to manage I/O devices. As the figure shows, in order to provide network access to the guest domains, the driver domain includes a software Ethernet bridge that interconnects the physical NIC and all of the virtual network interfaces. When a packet is transmitted by a guest, it is first transferred to the back-end driver in the driver domain using a page remapping operation. Within the driver domain, the packet is then routed through the Ethernet bridge to the physical device driver. The device driver enqueues the packet for transmission on the network interface as if it were generated normally by the operating system within the driver domain. When a packet is received, the network interface generates an interrupt that is captured by the hypervisor and routed to the network interface’s device driver in the driver domain as a virtual interrupt. The network interface’s device driver transfers the packet to the Ethernet bridge, which routes the packet to the appropriate back-end driver. The back-end driver then transfers the packet to the front-end driver in the guest domain using a page remapping operation. Once the packet is transferred, the back-end driver requests that the hypervisor send a virtual interrupt to the guest notifying it of the new packet. Upon receiving the virtual interrupt, the front-end driver delivers the packet to the guest operating system’s network stack, as if it had come directly from the physical device. 4.1.2 Device Driver Operation The driver domain in Xen is able to use unmodified Linux device drivers to access the network interface. Thus, all interactions between the device driver and the NIC are as they would be in an unvirtualized system. These interactions include programmed I/O (PIO) operations from the driver to the NIC, direct memory access 68 (DMA) transfers by the NIC to read or write host memory, and physical interrupts from the NIC to invoke the device driver. The device driver directs the NIC to send packets from buffers in host memory and to place received packets into preallocated buffers in host memory. The NIC accesses these buffers using DMA read and write operations. In order for the NIC to know where to store or retrieve data from the host, the device driver within the host operating system generates DMA descriptors for use by the NIC. These descriptors indicate the buffer’s length and physical address on the host. The device driver notifies the NIC via PIO that new descriptors are available, which causes the NIC to retrieve them via DMA transfers. Once the NIC reads a DMA descriptor, it can either read from or write to the associated buffer, depending on whether the descriptor is being used by the driver to transmit or receive packets. Device drivers organize DMA descriptors in a series of rings that are managed using a producer/consumer protocol. As they are updated, the producer and consumer pointers wrap around the rings to create a continuous circular buffer. There are separate rings of DMA descriptors for transmit and receive operations. Transmit DMA descriptors point to host buffers that will be transmitted by the NIC, whereas receive DMA descriptors point to host buffers that the OS wants the NIC to use as it receives packets. When the host driver wants to notify the NIC of the availability of a new DMA descriptor (and hence a new packet to be transmitted or a new buffer to be posted for packet reception), the driver first creates the new DMA descriptor in the next-available slot in the driver’s descriptor ring and then increments the producer index on the NIC to reflect that a new descriptor is available. The driver updates the NIC’s producer index by writing the value via PIO into a specific location, called a mailbox, within the device’s PCI memory-mapped region. The network interface monitors these mailboxes for such writes from the host. When a mailbox update is detected, the NIC reads the new producer value from the mailbox, performs a DMA 69 System Native Linux Xen Guest Transmit (Mb/s) Receive (Mb/s) 5126 3629 1602 1112 Table 4.1: Transmit and receive performance for native Linux 2.6.16.29 and paravirtualized Linux 2.6.16.29 as a guest OS within Xen 3. read of the descriptor indicated by the index, and then is ready to use the DMA descriptor. After the NIC consumes a descriptor from a ring, the NIC updates its consumer index, transfers this consumer index to a location in host memory via DMA, and raises a physical interrupt to notify the host that state has changed. In an unvirtualized operating system, the network interface trusts that the device driver gives it valid DMA descriptors. Similarly, the device driver trusts that the NIC will use the DMA descriptors correctly. If either entity violates this trust, physical memory can be corrupted. Xen also requires this trust relationship between the device driver in the driver domain and the NIC. 4.1.3 Performance Despite the optimizations within the paravirtualized drivers to support communication between the guest and driver domains (such as using page remapping rather than copying to transfer packets), Xen introduces significant processing and communication overheads into the network transmit and receive paths. Table 4.1 shows the networking performance of both native Linux 2.6.16.29 and paravirtualized Linux 2.6.16.29 as a guest operating system within Xen 3 Unstable1 on a modern Opteronbased system with six Intel Gigabit Ethernet NICs. In both configurations, checksum offloading, scatter/gather I/O, and TCP Segmentation Offloading (TSO) were enabled. Support for TSO was recently added to the unstable development branch of Xen and is not currently available in the Xen 3 release. As the table shows, a guest 1 Changeset 12053:874cc0ff214d from 11/1/2006. 70 domain within Xen is only able to achieve about 30% of the performance of native Linux. This performance gap strongly motivates the need for networking performance improvements within Xen. 4.2 CDNA Architecture With CDNA, the network interface and the hypervisor collaborate to provide the abstraction that each guest operating system is connected directly to its own network interface. This eliminates many of the overheads of network virtualization in Xen. Figure 4.2 shows the CDNA architecture. The network interface must support multiple contexts in hardware. Each context acts as if it is an independent physical network interface and can be controlled by a separate device driver instance. Instead of assigning ownership of the entire network interface to the driver domain, the hypervisor treats each context as if it were a physical NIC and assigns ownership of contexts to guest operating systems. Notice the absence of the driver domain from the figure: each guest can transmit and receive network traffic using its own private context without any interaction with other guest operating systems or the driver domain. The driver domain, however, is still present to perform control functions and allow access to other I/O devices. Furthermore, the hypervisor is still involved in networking, as it must guarantee memory protection and deliver virtual interrupts to the guest operating systems. With CDNA, the communication overheads between the guest and driver domains and the software multiplexing overheads within the driver domain are eliminated entirely. However, the network interface now must multiplex the traffic across all of its active contexts, and the hypervisor must provide protection across the contexts. The following sections describe how CDNA performs traffic multiplexing, interrupt delivery, and DMA memory protection. 71 Guest Domain 1 NIC Driver Guest Domain 2 NIC Driver Virtual Interrupts Guest Domain 3 NIC Driver Hypervisor Interrupt Dispatch Interrupts Control CPU / Memory / Disk / Other Devices Packet Data Driver Control CDNA NIC Figure 4.2: The CDNA shared networking architecture in Xen. 4.2.1 Multiplexing Network Traffic CDNA eliminates the software multiplexing overheads within the driver domain by multiplexing network traffic on the NIC. The network interface must be able to identify the source or target guest operating system for all network traffic. The network interface accomplishes this by providing independent hardware contexts and associating a unique Ethernet MAC address with each context. The hypervisor assigns a unique hardware context on the NIC to each guest operating system. The device driver within the guest operating system then interacts with its context exactly as if the context were an independent physical network interface. As described in Section 4.1.2, these interactions consist of creating DMA descriptors and updating a mailbox on the NIC via PIO. Each context on the network interface therefore must include a unique set of mailboxes. This isolates the activity of each guest operating system, so that the NIC can distinguish between the different guests. The hypervisor assigns a context to a guest simply by mapping the I/O locations for that context’s mailboxes into the guest’s address space. The hypervisor also notifies the NIC that the context has been allocated and is active. As the hypervisor only maps each context into a single guest’s 72 address space, a guest cannot accidentally or intentionally access any context on the NIC other than its own. When necessary, the hypervisor can also revoke a context at any time by notifying the NIC, which will shut down all pending operations associated with the indicated context. To multiplex transmit network traffic, the NIC simply services all of the hardware contexts fairly and interleaves the network traffic for each guest. When network packets are received by the NIC, it uses the Ethernet MAC address to demultiplex the traffic, and transfers each packet to the appropriate guest using available DMA descriptors from that guest’s context. 4.2.2 Interrupt Delivery In addition to isolating the guest operating systems and multiplexing network traffic, the hardware contexts on the NIC must also be able to interrupt their respective guests. As the NIC carries out network requests on behalf of any particular context, the CDNA NIC updates that context’s consumer pointers for the DMA descriptor rings, as described in Section 4.1.2. Normally, the NIC would then interrupt the guest to notify it that the context state has changed. However, in Xen all physical interrupts are handled by the hypervisor. Therefore, the NIC cannot physically interrupt the guest operating systems directly. Even if it were possible to interrupt the guests directly, that could create a much higher interrupt load on the system, which would decrease the performance benefits of CDNA. Under CDNA, the NIC keeps track of which contexts have been updated since the last physical interrupt, encoding this set of contexts in an interrupt bit vector, which is stored in the hypervisor’s private memory-mapped control context on the NIC. To signal a set of interrupts to the hypervisor, the NIC raises a physical interrupt, which invokes the hypervisor’s interrupt service routine (ISR). The hypervisor then reads the interrupt bit vector from the NIC via programmed I/O. Next, the hypervisor 73 decodes the vector and schedules virtual interrupts to each of the guest operating systems that have pending updates from the NIC. Because the Xen scheduler guarantees that these virtual interrupts will be delivered, the hypervisor can immediately acknowledge the set of interrupts that have been processed. The hypervisor performs this acknowledgment by writing the processed vector back to the NIC to a separate acknowledgment location in the hypervisor’s private memory-mapped control context. After acknowledgment and after the hypervisor’s ISR has run, the hypervisor’s scheduler will execute and select the next guest operating system to run. When subsequent guest operating systems are next scheduled by the hypervisor, the CDNA network interface driver within the guest receives these virtual interrupts that the hypervisor has sent. The virtual interrupts are received by the paravirtualized guest as if they were actual physical interrupts from the hardware. At that time, the guest’s driver examines the updates from the NIC and determines what further action, such as processing received packets, is required. During the development of this research, an alternative NIC-to-hypervisor event notification mechanism was explored. Instead of providing the interrupt vector to the hypervisor via memory-mapped I/O, the NIC would transfer an interrupt bit vector into the hypervisor’s memory space using DMA. The interrupt bit vectors were stored in a circular buffer using a producer/consumer protocol to ensure that they are processed by the host before being overwritten by the NIC. The vectors would then be processed identically to the memory-mapped I/O implementation. However, further examination found that under heavy load, it was possible that the ring buffer could fill up and interrupts could be lost, even with a very large event ring buffer of 256 entries. The positive-acknowledgment strategy ensures more reliable delivery under heavy load, though it does incur some additional overhead. At minimum, the memory-mapped I/O implementation requires an additional programmed-I/O write (for the acknowledgment) compared to the ring-buffer implementation. When several 74 interrupt vectors are processed in one invocation of the hypervisor’s ISR, the ringbuffer implementation can save one memory-mapped I/O read per vector processed, since those vectors are read from host memory rather than from the memory-mapped location on the NIC. The ring-buffer-based strategy was evaluated in [61]. 4.2.3 DMA Memory Protection In the x86 architecture, network interfaces and other I/O devices use physical addresses when reading or writing host system memory. The device driver in the host operating system is responsible for doing virtual-to-physical address translation for the device. The physical addresses are provided to the network interface through read and write DMA descriptors as discussed in Section 4.1.2. By exposing physical addresses to the network interface, the DMA engine on the NIC can be co-opted into compromising system security by a buggy or malicious driver. There are two key I/O protection violations that are possible in the x86 architecture. First, the device driver could instruct the NIC to transmit packets containing a payload from physical memory that does not contain packets generated by the operating system, thereby creating a security hole. Second, the device driver could instruct the NIC to receive packets into physical memory that was not designated as an available receive buffer, possibly corrupting memory that is in use. In the conventional Xen network architecture discussed in Section 4.1.2, Xen trusts the device driver in the driver domain to only use the physical addresses of network buffers in the driver domain’s address space when passing DMA descriptors to the network interface. This ensures that all network traffic will be transferred to/from network buffers within the driver domain. Since guest domains do not interact with the NIC, they cannot initiate DMA operations, so they are prevented from causing either of the I/O protection violations in the x86 architecture. 75 Though the Xen I/O architecture guarantees that untrusted guest domains cannot induce memory protection violations, any domain that is granted access to an I/O device by the hypervisor can potentially direct the device to perform DMA operations that access memory belonging to other guests, or even the hypervisor. The Xen architecture does not fundamentally solve this security defect but instead limits the scope of the problem to a single, trusted driver domain [16]. Therefore, as the driver domain is trusted, it is unlikely to intentionally violate I/O memory protection, but a buggy driver within the driver domain could do so unintentionally. This solution is insufficient for the CDNA architecture. In a CDNA system, device drivers in the guest domains have direct access to the network interface and are able to pass DMA descriptors with physical addresses to the device. Thus, the untrusted guests could read or write memory in any other domain through the NIC, unless additional security features are added. To maintain isolation between guests, the CDNA architecture validates and protects all DMA descriptors and ensures that a guest maintains ownership of physical pages that are sources or targets of outstanding DMA accesses. Although the hypervisor and the network interface share the responsibility for implementing these protection mechanisms, the more complex aspects are implemented in the hypervisor. The most important protection provided by CDNA is that it does not allow guest domains to directly enqueue DMA descriptors into the network interface descriptor rings. Instead, the device driver in each guest must call into the hypervisor to perform the enqueue operation. This allows the hypervisor to validate that the physical addresses provided by the guest are, in fact, owned by that guest domain. This prevents a guest domain from arbitrarily transmitting from or receiving into another guest domain. The hypervisor prevents guest operating systems from independently enqueueing unauthorized DMA descriptors by establishing the hypervisor’s exclusive 76 write access to the host memory region containing the CDNA descriptor rings during driver initialization. As discussed in Section 4.1.2, conventional I/O devices autonomously fetch and process DMA descriptors from host memory at runtime. Though hypervisor-managed validation and enqueuing of DMA descriptors ensures that DMA operations are valid when they are enqueued, the physical memory could still be reallocated before it is accessed by the network interface. There are two ways in which such a protection violation could be exploited by a buggy or malicious device driver. First, the guest could return the memory to the hypervisor to be reallocated shortly after enqueueing the DMA descriptor. Second, the guest could attempt to reuse an old DMA descriptor in the descriptor ring that is no longer valid. When memory is freed by a guest operating system, it becomes available for reallocation to another guest by the hypervisor. Hence, ownership of the underlying physical memory can change dynamically at runtime. However, it is critical to prevent any possible reallocation of physical memory during a DMA operation. CDNA achieves this by delaying the reallocation of physical memory that is being used in a DMA transaction until after that pending DMA has completed. When the hypervisor enqueues a DMA descriptor, it first establishes that the requesting guest owns the physical memory associated with the requested DMA. The hypervisor then increments the reference count for each physical page associated with the requested DMA. This per-page reference counting system already exists within the Xen hypervisor; so long as the reference count is non-zero, a physical page cannot be reallocated. Later, the hypervisor then observes which DMA operations have completed and decrements the associated reference counts. For efficiency, the reference counts are only decremented when additional DMA descriptors are enqueued, but there is no reason why they could not be decremented more aggressively, if necessary. 77 After enqueuing DMA descriptors, the device driver notifies the NIC by writing a producer index into a mailbox location within that guest’s context on the NIC. This producer index indicates the location of the last of the newly created DMA descriptors. The NIC then assumes that all DMA descriptors up to the location indicated by the producer index are valid. If the device driver in the guest increments the producer index past the last valid descriptor, the NIC will attempt to use a stale DMA descriptor that is in the descriptor ring. Since that descriptor was previously used in a DMA operation, the hypervisor may have decremented the reference count on the associated physical memory and reallocated the physical memory. To prevent such stale DMA descriptors from being used, the hypervisor writes a strictly increasing sequence number into each DMA descriptor. The NIC then checks the sequence number before using any DMA descriptor. If the descriptor is valid, the sequence numbers will be continuous modulo the size of the maximum sequence number. If they are not, the NIC will refuse to use the descriptors and will report a guest-specific protection fault error to the hypervisor. Because each DMA descriptor in the ring buffer gets a new, increasing sequence number, a stale descriptor will have a sequence number exactly equal to the correct value minus the number of descriptor slots in the buffer. Making the maximum sequence number at least twice as large as the number of DMA descriptors in a ring buffer prevents aliasing and ensures that any stale sequence number will be detected. 4.2.4 Discussion The CDNA interrupt delivery mechanism is neither device nor Xen specific. This mechanism only requires the device to transfer an interrupt bit vector to the hypervisor via DMA prior to raising a physical interrupt. This is a relatively simple mechanism from the perspective of the device and is therefore generalizable to a va- 78 riety of virtualized I/O devices. Furthermore, it does not rely on any Xen-specific features. The handling of the DMA descriptors within the hypervisor is linked to a particular network interface only because the format of the DMA descriptors and their rings is likely to be different for each device. As the hypervisor must validate that the host addresses referred to in each descriptor belong to the guest operating system that provided them, the hypervisor must be aware of the descriptor format. Fortunately, there are only three fields of interest in any DMA descriptor: an address, a length, and additional flags. This commonality should make it possible to generalize the mechanisms within the hypervisor by having the NIC notify the hypervisor of its preferred format. The NIC would only need to specify the size of the descriptor and the location of the address, length, and flags. The hypervisor would not need to interpret the flags, so they could just be copied into the appropriate location. A generic NIC would also need to support the use of sequence numbers within each DMA descriptor. Again, the NIC could notify the hypervisor of the size and location of the sequence number field within the descriptors. CDNA’s DMA memory protection is specific to Xen only insofar as Xen permits guest operating systems to use physical memory addresses. Consequently, the current implementation must validate the ownership of those physical addresses for every requested DMA operation. For VMMs that only permit the guest to use virtual addresses, the hypervisor could just as easily translate those virtual addresses and ensure physical contiguity. The current CDNA implementation does not rely on physical addresses in the guest at all; rather, a small library translates the driver’s virtual addresses to physical addresses within the guest’s driver before making a hypercall request to enqueue a DMA descriptor. For VMMs that use virtual addresses, this library would do nothing. 79 4.3 CDNA NIC Implementation To evaluate the CDNA concept in a real system, RiceNIC, a programmable and reconfigurable FPGA-based Gigabit Ethernet network interface [50], was modified to provide virtualization support. RiceNIC contains a Virtex-II Pro FPGA with two embedded 300MHz PowerPC processors, hundreds of megabytes of on-board SRAM and DRAM memories, a Gigabit Ethernet PHY, and a 64-bit/66 MHz PCI interface [5]. Custom hardware assist units for accelerated DMA transfers and MAC packet handling are provided on the FPGA. The RiceNIC architecture is similar to the architecture of a conventional network interface. With basic firmware and the appropriate Linux or FreeBSD device driver, it acts as a standard Gigabit Ethernet network interface that is capable of fully saturating the Ethernet link while only using one of the two embedded processors. To support CDNA, both the hardware and firmware of the RiceNIC were modified to provide multiple protected contexts and to multiplex network traffic. The network interface was also modified to interact with the hypervisor through a dedicated context to allow privileged management operations. The modified hardware and firmware components work together to implement the CDNA interfaces. To support CDNA, the most significant addition to the network interface is the specialized use of the 2 MB SRAM on the NIC. This SRAM is accessible via PIO from the host. For CDNA, 128 KB of the SRAM is divided into 32 partitions of 4 KB each. Each of these partitions is an interface to a separate hardware context on the NIC. Only the SRAM can be memory mapped into the host’s address space, so no other memory locations on the NIC are accessible via PIO. As a context’s memory partition is the same size as a page on the host system and because the region is page-aligned, the hypervisor can trivially map each context into a different guest domain’s address space. The device drivers in the guest domains may use these 4 KB partitions as 80 general purpose shared memory between the corresponding guest operating system and the network interface. Within each context’s partition, the lowest 24 memory locations are mailboxes that can be used to communicate from the driver to the NIC. When any mailbox is written by PIO, a global mailbox event is automatically generated by the FPGA hardware. The NIC firmware can then process the event and efficiently determine which mailbox and corresponding context has been written by decoding a two-level hierarchy of bit vectors. All of the bit vectors are generated automatically by the hardware and stored in a data scratchpad for high speed access by the processor. The first bit vector in the hierarchy determines which of the 32 potential contexts have updated mailbox events to process, and the second vector in the hierarchy determines which mailbox(es) in a particular context have been updated. Once the specific mailbox has been identified, that off-chip SRAM location can be read by the firmware and the mailbox information processed. The mailbox event and associated hierarchy of bit vectors are managed by a small hardware core that snoops data on the SRAM bus and dispatches notification messages when a mailbox is updated. A small state machine decodes these messages and incrementally updates the data scratchpad with the modified bit vectors. This state machine also handles event-clear messages from the processor that can clear multiple events from a single context at once. Each context requires 128 KB of storage on the NIC for metadata, such as the rings of transmit- and receive-DMA descriptors provided by the host operating systems. Furthermore, each context uses 128 KB of memory on the NIC for buffering transmit packet data and 128 KB for receive packet data. However, the NIC’s transmit and receive packet buffers are each managed globally, and hence packet buffering is shared across all contexts. 81 System NIC Mb/s Xen Xen CDNA Intel RiceNIC RiceNIC 1602 1674 1865 Hyp 19.8% 13.7% 10.8% Domain Execution Profile Driver Domain Guest OS OS User OS User 35.7% 0.8% 39.7% 1.0% 41.5% 0.5% 39.5% 1.0% 0.1% 0.2% 42.7% 1.7% Idle 3.0% 3.8% 44.5% Interrupts/s Driver Guest Domain OS 7,438 7,853 8,839 5,661 0 13,903 Table 4.2: Transmit performance for a single guest with 2 NICs using Xen and CDNA. System NIC Mb/s Xen Xen CDNA Intel RiceNIC RiceNIC 1112 1075 1850 Hyp 25.7% 30.6% 9.9% Domain Execution Profile Driver Domain Guest OS OS User OS User 36.8% 0.5% 31.0% 1.0% 39.4% 0.6% 28.8% 0.6% 0.2% 0.2% 52.6% 0.6% Idle 5.0% 0% 36.5% Interrupts/s Driver Guest Domain OS 11,138 5,193 10,946 5,163 0 7,484 Table 4.3: Receive performance for a single guest with 2 NICs using Xen and CDNA. The modifications to the RiceNIC to support CDNA were minimal. The major hardware change was the additional mailbox storage and handling logic. This could easily be added to an existing NIC without interfering with the normal operation of the network interface—unvirtualized device drivers would use a single context’s mailboxes to interact with the base firmware. Furthermore, the computation and storage requirements of CDNA are minimal. Only one of the RiceNIC’s two embedded processors is needed to saturate the network, and only 12 MB of memory on the NIC is needed to support 32 contexts. Therefore, with minor modifications, commodity network interfaces could easily provide sufficient computation and storage resources to support CDNA. 4.4 4.4.1 Evaluation Experimental Setup The performance of Xen and CDNA network virtualization was evaluated on an AMD Opteron-based system running Xen 3 Unstable2 . This system used a Tyan S2882 motherboard with a single Opteron 250 processor and 4GB of DDR400 2 Changeset 12053:874cc0ff214d from 11/1/2006. 82 SDRAM. Xen 3 Unstable was used because it provides the latest support for highperformance networking, including TCP segmentation offloading, and the most recent version of Xenoprof [39] for profiling the entire system. In all experiments, the driver domain was configured with 256 MB of memory and each of 24 guest domains were configured with 128 MB of memory. Each guest domain ran a stripped-down Linux 2.6.16.29 kernel with minimal services for memory efficiency and performance. For the base Xen experiments, a single dual-port Intel Pro/1000 MT NIC was used in the system. In the CDNA experiments, two RiceNICs configured to support CDNA were used in the system. Linux TCP parameters and NIC coalescing options were tuned in the driver domain and guest domains for optimal performance. For all experiments, checksum offloading and scatter/gather I/O were enabled. TCP segmentation offloading was enabled for experiments using the Intel NICs, but disabled for those using the RiceNICs due to lack of support. The Xen system was setup to communicate with a similar Opteron system that was running a native Linux kernel. This system was tuned so that it could easily saturate two NICs both transmitting and receiving so that it would never be the bottleneck in any of the tests. To validate the performance of the CDNA approach, multiple simultaneous connections across multiple NICs to multiple guests domains were needed. A multithreaded, event-driven, lightweight network benchmark program was developed to distribute traffic across a configurable number of connections. The benchmark program balances the bandwidth across all connections to ensure fairness and uses a single buffer per thread to send and receive data to minimize the memory footprint and improve cache performance. 83 4.4.2 Single Guest Performance Tables 4.2 and 4.3 show the transmit and receive performance of a single guest operating system over two physical network interfaces using Xen and CDNA. The first two rows of each table show the performance of the Xen I/O virtualization architecture using both the Intel and RiceNIC network interfaces. The third row of each table shows the performance of the CDNA I/O virtualization architecture. The Intel network interface can only be used with Xen through the use of software virtualization. However, the RiceNIC can be used with both CDNA and software virtualization. To use the RiceNIC interface with software virtualization, a context was assigned to the driver domain and no contexts were assigned to the guest operating system. Therefore, all network traffic from the guest operating system is routed via the driver domain as it normally would be, through the use of software virtualization. Within the driver domain, all of the mechanisms within the CDNA NIC are used identically to the way they would be used by a guest operating system when configured to use concurrent direct network access. As the tables show, the Intel network interface performs similarly to the RiceNIC network interface. Therefore, the benefits achieved with CDNA are the result of the CDNA I/O virtualization architecture, not the result of differences in network interface performance. Note that in Xen the interrupt rate for the guest is not necessarily the same as it is for the driver. This is because the back-end driver within the driver domain attempts to interrupt the guest operating system whenever it generates new work for the frontend driver. This can happen at a higher or lower rate than the actual interrupt rate generated by the network interface depending on a variety of factors, including the number of packets that traverse the Ethernet bridge each time the driver domain is scheduled by the hypervisor. 84 DMA Protection Mb/s Hyp Enabled Disabled 1865 1865 10.8% 1.9% Domain Execution Profile Driver Domain Guest OS OS User OS User 0.1% 0.2% 42.7% 1.7% 0.2% 0.2% 37.0% 1.8% Idle 44.5% 58.9% Interrupts/s Driver Guest Domain OS 0 13,903 0 14,202 Table 4.4: CDNA 2-NIC transmit performance with and without DMA memory protection. DMA Protection Mb/s Hyp Enabled Disabled 1850 1850 9.9% 2.2% Domain Execution Profile Driver Domain Guest OS OS User OS User 0.2% 0.2% 52.6% 0.6% 0.2% 0.3% 49.5% 0.8% Idle 36.5% 47.0% Interrupts/s Driver Guest Domain OS 0 7,484 0 7,616 Table 4.5: CDNA 2-NIC receive performance with and without DMA memory protection. Table 4.2 shows that using all of the available processing resources, Xen’s software virtualization is not able to transmit at line rate over two network interfaces with either the Intel hardware or the RiceNIC hardware. However, only 41% of the processor is used by the guest operating system. The remaining resources are consumed by Xen overheads—using the Intel hardware, approximately 20% in the hypervisor and 37% in the driver domain performing software multiplexing and other tasks. As the table shows, CDNA is able to saturate two network interfaces, whereas traditional Xen networking cannot. Additionally, CDNA performs far more efficiently, with 45% processor idle time. The increase in idle time is primarily the result of two factors. First, nearly all of the time spent in the driver domain is eliminated. The remaining time spent in the driver domain is unrelated to networking tasks. Second, the time spent in the hypervisor is decreased. With Xen, the hypervisor spends the bulk of its time managing the interactions between the front-end and back-end virtual network interface drivers. CDNA eliminates these communication overheads with the driver domain, so the hypervisor instead spends the bulk of its time managing DMA memory protection. Table 4.3 shows the receive performance of the same configurations. Receiving network traffic requires more processor resources, so Xen only achieves 1112 Mb/s with the Intel network interface, and slightly lower with the RiceNIC interface. Again, 85 Xen overheads consume the bulk of the time, as the guest operating system only consumes about 32% of the processor resources when using the Intel hardware. As the table shows, not only is CDNA able to saturate the two network interfaces, it does so with 37% idle time. Again, nearly all of the time spent in the driver domain is eliminated. As with the transmit case, the CDNA architecture permits the hypervisor to spend its time performing DMA memory protection rather than managing higher-cost interdomain communications as is required using software virtualization. In summary, the CDNA I/O virtualization architecture provides significant performance improvements over Xen for both transmit and receive. On the transmit side, CDNA requires half the processor resources to deliver about 200 Mb/s higher throughput. On the receive side, CDNA requires 63% of the processor resources to deliver about 750 Mb/s higher throughput. 4.4.3 Memory Protection The software-based protection mechanisms in CDNA can potentially be replaced by a hardware IOMMU. For example, AMD has proposed an IOMMU architecture for virtualization that restricts the physical memory that can be accessed by each device [2]. AMD’s proposed architecture provides memory protection as long as each device is only accessed by a single domain. For CDNA, such an IOMMU would have to be extended to work on a per-context basis, rather than a per-device basis. This would also require a mechanism to indicate a context for each DMA transfer. Since CDNA only distinguishes between guest operating systems and not traffic flows, there are a limited number of contexts, which may make a generic system-level contextaware IOMMU practical. Tables 4.4 and4.5 show the performance of the CDNA I/O virtualization architecture both with and without DMA memory protection under transmit and receive tests, respectively. (The performance of CDNA with DMA memory protection en86 abled was replicated from Tables 4.2 and 4.3 for comparison purposes.) By disabling DMA memory protection, the performance of the modified CDNA system establishes an upper bound on achievable performance in a system with an appropriate IOMMU. However, there would be additional hypervisor overhead to manage the IOMMU that is not accounted for by this experiment. Since CDNA can already saturate two network interfaces for both transmit and receive traffic, the effect of removing DMA protection is to increase the idle time by about 10–15%, depending on the workload. As the table shows, this increase in idle time is the direct result of reducing the number of hypercalls from the guests and the time spent in the hypervisor performing protection operations. Even as systems begin to provide IOMMU support for techniques such as CDNA, older systems will continue to lack such features. In order to generalize the design of CDNA for systems with and without an appropriate IOMMU, wrapper functions could be used around the hypercalls within the guest device drivers. The hypervisor must notify the guest whether or not there is an IOMMU. When no IOMMU is present, the wrappers would simply call the hypervisor, as described here. When an IOMMU is present, the wrapper would instead create DMA descriptors without hypervisor intervention and only invoke the hypervisor to set up the IOMMU. Such wrappers already exist in modern operating systems to deal with such IOMMU issues. 4.4.4 Scalability Figures 4.3 and 4.4 show the aggregate transmit and receive throughput, respectively, of Xen and CDNA with two network interfaces as the number of guest operating systems varies. The percentage of CPU idle time is also plotted above each data point. CDNA outperforms Xen for both transmit and receive both for a single guest, as previously shown in Tables 4.2 and 4.3, and as the number of guest operating systems is increased. 87 2000 .5% 5.4% .9% 5 2 44 0% 0% 0% CDNA% / RiceNIC% 0 0 Xen / Intel 1800 Xen Transmit Throughput (Mbps) % 1600 3.0 0% 0% 1400 1200 0% 0% 1000 0% 0% 0% 800 600 400 1 2 4 8 12 Xen Guests 16 20 24 Figure 4.3: Transmit throughput for Xen and CDNA (with CDNA idle time). 2000 1% .5%29. 36 .6% 12 0% 0% 0% CDNA% / RiceNIC% 0 0 Xen / Intel Xen Receive Throughput (Mbps) 1800 1600 1400 1200 % 5.0 0% 1000 0% 800 0% 0% 0% 600 400 1 2 4 8 12 Xen Guests 16 0% 20 0% 24 Figure 4.4: Receive throughput for Xen and CDNA (with CDNA idle time). As the figures show, the performance of both CDNA and software virtualization degrades as the number of guests increases. For Xen, this results in declining bandwidth, but the marginal reduction in bandwidth decreases with each increase in the number of guests. For CDNA, while the bandwidth remains constant, the idle time 88 decreases to zero. Despite the fact that there is no idle time for 8 or more guests, CDNA is still able to maintain constant bandwidth. This is consistent with the leveling of the bandwidth achieved by software virtualization. Therefore, it is likely that with more CDNA NICs, the throughput curve would have a similar shape to that of software virtualization, but with a much higher peak throughput when using 1–4 guests. These results clearly show that not only does CDNA deliver better network performance for a single guest operating system within Xen, but it also maintains significantly higher bandwidth as the number of guest operating systems is increased. With 24 guest operating systems, CDNA’s transmit bandwidth is a factor of 2.1 higher than Xen’s and CDNA’s receive bandwidth is a factor of 3.3 higher than Xen’s. 89 Chapter 5 Protection Strategies for Direct I/O in Virtual Machine Monitors As the CDNA architecture shows, direct I/O access by guest operating systems can significantly improve performance. Preferably, guest operating systems within a virtual machine monitor would be able to directly access all I/O devices without the need for the data to traverse an intermediate software layer within the virtual machine monitor [45, 60]. However, if a guest can directly access an I/O device, then it can potentially direct the device to access memory that it does not own via direct memory access (DMA). Therefore, the virtual machine monitor must still be able to ensure that guest operating systems do not access each other’s memory indirectly through the shared I/O devices in the system. Both IOMMUs [10] and softwarebased methods (as established in the previous chapter) can provide DMA memory protection for the virtual machine monitor. They do so by preventing guest operating systems from directing I/O devices to access memory that is not owned by that guest, while still allowing the guest to directly access the device. This study is the first experimental study that performs a head-to-head comparison of DMA memory protection strategies supporting direct access to I/O devices from untrusted guest operating systems within a virtual machine monitor. Specifically, three hardware IOMMU-based strategies and one software-based strategy are explored. The first IOMMU-based strategy uses single-use I/O memory mappings that are created before each I/O operation and immediately destroyed after each I/O 90 operation. The second IOMMU-based strategy uses shared I/O memory mappings that can be reused by multiple, concurrent I/O operations. The third IOMMU-based strategy uses persistent I/O memory mappings that can be reused until they need to be reclaimed to create new mappings. Finally, the software-based strategy uses validated DMA descriptors that can only be used for one I/O operation. The comparison of these four strategies yields several insights. First, all four strategies provide equivalent protection between guest operating systems for direct access to shared I/O devices in a virtual machine monitor. All of the techniques prevent a guest operating system from directing the device to access memory that does not belong to that guest. The traditional single-use strategy, however, provides this protection at the greatest cost. Second, there is significant opportunity to reuse IOMMU mappings which can reduce the cost of providing protection. Multiple concurrent I/O operations are able to share the same mappings often enough that there is a noticeable decrease in the overhead of providing protection. That overhead can further be decreased by allowing mappings to persist so that they can also be reused by future I/O operations. Finally, the software-based protection strategy performs comparably to the best of the IOMMU-based strategies. The next section provides background on how I/O devices access main memory and the possible memory protection violations that can occur when doing so. Sections 5.2 and 5.3 discuss the three IOMMU-based protection strategies and the one software-based protection strategy. Section 5.4 then describes the protection properties afforded by the four strategies. Section 5.5 describes the experimental methodology and Section 5.6 evaluates the protection strategies. 91 5.1 Background Modern server I/O devices, including disk and network controllers, utilize direct memory access (DMA) to move data between the host’s main memory and the device’s on-board buffers. The device uses DMA to access memory independently of the host CPU, so such accesses must be controlled and protected. To initiate a DMA operation, the device driver within the operating system creates DMA descriptors that refer to regions of memory. Each DMA descriptor typically includes an address, a length, and a few device-specific flags. In commodity x86 systems, devices lack support for virtual-to-physical address translation, so DMA descriptors always contain physical addresses for main memory. Once created, the device driver passes the descriptors to the device, which will later use the descriptors to transfer data to or from the indicated memory regions autonomously. When the requested I/O operations have been completed, the device raises an interrupt to notify the device driver. For example, to transmit a network packet, the network interface’s device driver might create two DMA descriptors. The first descriptor might point to the packet headers and the second descriptor might point to the packet payload. Once created, the device driver would then notify the network interface that there are new DMA descriptors available. The precise mechanism of that notification depends on the particular network interface, but typically involves a programmed I/O operation to the device telling it the location of the new descriptors. The network interface would then retrieve the descriptors from main memory using DMA—if they were not written to the device directly by programmed I/O. The network interface would then retrieve the two memory regions that compose the network packet and transmit them over the network. Finally, the network interface would interrupt the host to indicate that the packet has been transmitted. In practice, notifications from the device driver and 92 interrupts from the network interface would likely be aggregated to cover multiple packets for efficiency. Three potential memory access violations can occur on every I/O transfer initiated using this DMA architecture: 1. The device driver could create a DMA descriptor with an incorrect address. 2. The memory referenced by the DMA descriptor could be repurposed after the descriptor was created by the device driver, but before it is used by the device. 3. The device itself could initiate a DMA transfer to a memory address not referenced by the DMA descriptor. These violations could occur either because of failures or because of malicious intent. However, as devices are typically not user-programmable, the last type of violation is only likely to occur as a result of a hardware or software failure on the device. In a non-virtualized environment, the operating system is responsible for preventing the first two types of memory access violations. This requires the operating system to trust the device driver to create the correct DMA descriptors and to pin physical memory used by I/O devices. A failure of the operating system to prevent these memory access violations could potentially result in system failure. In a virtualized environment, however, the virtual machine monitor cannot trust the guest operating systems to prevent these memory access violations, as a memory access violation incurred by one guest operating system can potentially harm other guest operating systems or even bring down the whole system. Therefore, a virtual machine monitor requires mechanisms to prevent one guest operating system from intentionally or accidentally directing and I/O device to access the memory of another guest operating system. The only way that would be possible is via one of the first two types of memory access violations. Depending on the reliability of the I/O devices, it 93 may also be desirable to try to prevent the third type of memory access violation, as well (although it is frequently not possible to protect against a misbehaving device, as will be discussed in Section 5.4). The following sections describe mechanisms and strategies for preventing these memory access violations. 5.2 IOMMU-based Protection A virtual machine monitor can utilize an I/O memory management unit (IOMMU) to help provide DMA memory protection when allowing direct access to I/O devices. Whereas a virtual memory management unit enforces access control and provides address translation services for software as it accesses memory, an IOMMU enforces access control and provides address translation services for I/O devices as they access memory. The IOMMU uses page table entries (PTEs) that each specify translation from an I/O address to a physical memory address and specify access control (such as which devices are permitted to use the given PTE). An IOMMU only permits I/O devices to access memory for which a valid mapping exists in the IOMMU page table. Thus, in an IOMMU-based system, there must be a valid IOMMU translation for each host memory buffer to be used in an upcoming DMA descriptor. Otherwise, the DMA descriptor will refer to a region unmapped by the IOMMU, and the I/O transaction will fail. The following subsections present three strategies for using an IOMMU to provide DMA memory protection in a virtual machine monitor. The strategies primarily differ in the extent to which IOMMU mappings are allowed to be reused. 5.2.1 Single-use Mappings A common strategy for managing an IOMMU is to create a single-use mapping for each I/O transaction. The Linux DMA-Mapping interface, for example, implements 94 a single-use mapping strategy. Ben-Yehuda, et al. also explored a single-use mapping strategy in the context of virtual machine monitors [10]. In such a single-use strategy, the driver must ensure that a new IOMMU mapping is created for each DMA descriptor. The IOMMU mapping is then destroyed once the corresponding I/O transaction has completed. In a virtualized system, the trusted virtual machine monitor is responsible for creating and destroying IOMMU mappings at the driver’s request. If the VMM does not create the mapping, either because the driver did not request it or because the request referred to memory not owned by the guest, then the device will be unable to perform the corresponding DMA operation. To carry out an I/O transaction using a single-use mapping strategy, the virtual machine monitor (VMM), untrusted guest operating system (GOS), and the device (DEV) carry out the following steps: 1. GOS: The guest OS requests an IOMMU mapping for the memory buffer involved in the I/O transaction. 2. VMM: The VMM validates that the requesting guest OS has appropriate read or write permission for each memory page in the buffer to be mapped. 3. VMM: The VMM marks the memory buffer as “in I/O use”, which prevents the buffer from being reallocated to another guest OS during an I/O transaction. 4. VMM: The VMM creates one or more IOMMU mappings for the buffer. As with virtual memory management units, one mapping is usually required for each memory page in the buffer. 5. GOS: The guest OS creates a DMA descriptor with the IOMMU-mapped address that was returned by the VMM. 6. DEV: The device carries out its I/O transaction as directed by the DMA descriptor and it notifies the driver upon completion. 95 7. GOS: The driver requests destruction of the corresponding IOMMU mapping(s). 8. VMM: The VMM validates that the mappings belong to the guest OS making the request. 9. VMM: The VMM destroys the IOMMU mappings. 10. VMM: The VMM clears the “in I/O use” marker associated with each memory page referred to by the recently-destroyed mapping(s). 5.2.2 Shared Mappings Rather than creating a new IOMMU mapping for each new DMA descriptor, it is possible to share a mapping among DMA descriptors so long as the mapping points to the same underlying memory page and remains valid. Sharing IOMMU mappings is advantageous because it avoids the overhead of creating and destroying a new mapping for each I/O request by instead reusing an existing mapping. To implement sharing, the guest operating system must keep track of which IOMMU mappings are currently valid, and it must keep track of how many pending I/O requests are currently using the mapping. To protect a guest’s memory from errant device accesses, an IOMMU mapping should be destroyed once all outstanding I/O requests that use the mapping have been completed. Though the untrusted guest operating system has responsibilities for carrying out a shared-mapping strategy, it need not function correctly to ensure isolation among operating systems, as is discussed further in Section 5.4. To carry out a shared-mapping strategy, the guest OS and the VMM perform many of the same steps as required by the single-use strategy. The shared-mapping strategy differs at the initiation and termination of an I/O transaction. Before step 1 would occur in a one-time-use strategy, the guest operating system first queries a table of known, valid IOMMU mappings to see if a mapping for the I/O memory buffer already 96 exists. If so, the driver uses the previously established IOMMU-mapped address for a DMA descriptor, and then passes the descriptor to the device, in effect skipping steps 1–4. If not, the guest and VMM follow steps 1–4 to create a new mapping. Whether a new mapping is created or not, before step 5, the guest operating system increments its own reference count for the mapping (or setting it to one for a new mapping). This reference count is separate from the reference count maintained by the VMM. Steps 5 and 6 then proceed as in the single-use strategy. After these steps have completed, the driver calls the guest operating system to decrement its reference count. If the reference count is zero, no other I/O transactions are in progress that are using this mapping, and it is appropriate to call the VMM to destroy the mapping as in steps 7–10 of the single-use strategy. Otherwise, the IOMMU mapping is still being used by another I/O transaction within the guest OS, so steps 7–10 are skipped. 5.2.3 Persistent Mappings IOMMU mappings can further be reused by allowing them to persist evan after all I/O transactions using the mapping have completed. Compared to a shared mapping strategy, such a persistent mapping strategy attempts to further reduce the overhead associated with creating and destroying IOMMU mappings inside the VMM. Whereas sharing exploits reuse among mappings only when a mapping is being actively used by at least one I/O transaction, persistence exploits temporal reuse across periods of inactivity. The infrastructure and mechanisms for implementing a persistent mapping strategy are similar to those required by a shared mapping strategy. The primary difference is that the guest operating system does not request that mappings be destroyed after the I/O transactions using them complete. In effect, this means that mappings persist until they must be recycled. Therefore, in contrast to the shared mapping strategy, 97 when the guest’s reference count is decremented after step 6, the I/O transaction is complete and steps 7–10 are always skipped. This should dramatically reduce the number of hypercalls into the VMM. As mappings are now persistent, they must be recycled whenever a new mapping is needed. This changes the behavior of step 1 when compared to the shared mapping case. Before performing step 1, as in the shared mapping case, the guest operating system first queries a table of known, valid IOMMU mappings to see if a mapping for the I/O memory buffer already exists. If one does not, a new mapping is needed and the guest operating system must select an idle mapping to be recycled. In step 1, the guest then passes this idle mapping to the virtual machine monitor along with the request to create a new mapping. Steps 8, 10, and 2–4 are then performed by the VMM to modify the mapping(s) for use by the new I/O transaction. Note that step 9 can be skipped, as one valid mapping is going to be immediately replaced by another valid mapping. 5.3 Software-based Protection IOMMU-based protection strategies enforce safety even when untrusted software provides unverified DMA descriptors directly to hardware, because the DMA operations generated by any device are always subject to later validation. However, an IOMMU is not necessary to ensure full isolation among untrusted guest operating systems, even when they use DMA-capable hardware that directly reads and writes host memory. Rather than relying on hardware to perform late validation during I/O transactions, a lightweight software-based system performs early validation of DMA descriptors before they are used by hardware. The software-based strategy also must protect validated descriptors from subsequent unauthorized modification by untrusted software, thus ensuring that all I/O transactions operate only on buffers 98 that have been approved by the VMM. The CDNA architecture relies on a softwarebased protection mechanism, as introduced in Chapter 4. This study compares that approach to IOMMU-based approaches. The runtime operation of a software-based protection strategy works much like a single-use IOMMU-based strategy, since both validate permissions for each I/O transaction. Whereas the single-use IOMMU-based strategy uses the VMM to create IOMMU mappings for each transaction, software-based I/O protection creates the actual DMA descriptor. The descriptor is valid only for the single I/O transaction. Unlike an IOMMU-based system, an untrusted guest OS’s driver must first register itself with the VMM during initialization. At that time, the VMM takes ownership of the driver’s DMA descriptor region and the driver’s status region, revoking write permissions from the guest. This prevents the guest from independently creating or modifying DMA descriptors, or modifying the status region. Finally, the VMM must prevent the guest from changing the descriptor and status regions. This can be trivially accomplished by only mapping the device’s configuration registers into the VMM’s address space, and not into the guests’ address spaces. After initialization, the runtime operation of the software-based strategy is similar to the single-use IOMMU-based strategy outlined in Section 5.2.1. Steps 1–3 of a software-based strategy are identical. In step 4, the VMM creates a DMA descriptor in the write-protected DMA descriptor region, obviating the OS’s role in step 5. The device carries out the requested operation using the validated descriptor, as in step 6, and because the descriptor is write-protected, the untrusted guest cannot induce an unauthorized transaction. When the device signals completion of the transaction, the VMM inspects the device’s state (which is usually written via DMA back to the host) to see which DMA descriptors have been used. The VMM then processes those completed descriptors, as in step 10, permitting the associated guest memory buffers to be reallocated. 99 5.4 Protection Properties The protection strategies presented in Sections 5.2 and 5.3 can be used to prevent the memory access violations presented in Section 5.1. Those memory access violations, however, can occur both across multiple guests (inter-guest) and within a single guest (intra-guest). A virtual machine monitor must provide inter-guest protection in order to operate reliably. A guest operating system may additionally benefit if the virtual machine monitor can also help provide intra-guest protection. This section describes the protection properties of the four previously presented protection strategies. 5.4.1 Inter-Guest Protection Perhaps surprisingly, all four strategies provide equivalent protection against the first two types of memory access violations presented in Section 5.1: creating of an incorrect DMA descriptor and repurposing the memory referenced by a DMA descriptor. In all of the IOMMU-based strategies, if the device driver creates a DMA descriptor that refers to memory that is not owned by that guest operating system, the device will be unable to perform that DMA, as no IOMMU mapping will exist. The only requirement to maintain this protection is that the VMM must never create an IOMMU mapping for a guest that does not refer to that guest’s memory. Similarly, only the VMM can repurpose memory to another guest, so as long as it does not do so while there is an existing IOMMU mapping to that memory, the second memory protection violation can never occur. The software-based approach provides exactly the same guarantees by only allowing the VMM to create DMA descriptors. Therefore, these strategies allow the VMM to provide protection. The third type of memory access violation, the device initiating a rogue DMA operation, is more difficult to prevent. If the device is shared among multiple guest 100 operating systems, then no strategy can prevent this type of protection violation. For example, if a network interface is allowed to receive packets for two guest operating systems, there is no way for the VMM to prevent the device from sending the traffic destined for one guest to the other. This is one simple example of many protection violations that a shared device can commit. However, if a device is privately assigned to a single guest operating system, the IOMMU-based strategies can be used to provide protection against faulty device behavior. In this case, the VMM simply has to ensure that there are only IOMMU mappings to the guest that privately owns the device. In that manner, there is no way the device can even access memory that does not belong to that guest. However, the software-based strategy cannot even provide this level of protection. As DMA descriptors are pre-validated, there is no way to stop the device from simply ignoring the DMA descriptor and accessing any physical memory. 5.4.2 Intra-Guest Protection None of the four protection strategies can protect the guest OS from the first two types of access violations caused by its device drivers. In essence, the protection afforded to the guest OS by any of the strategies is only as good as the implementation of the strategy in a device driver. Consider the IOMMU-based strategies. For an actual access violation to be prevented, the device driver would have to map the correct buffer through the IOMMU but construct an incorrect DMA descriptor for it. Such an error, however, seems unlikely. In the case of the software-based strategy, such a scenario is impossible because the memory protection on the buffer and the creation of the DMA descriptor are combined into operation by the VMM. In contrast, the IOMMU-based strategies offer some protection against the third type of memory access violation, the device initiating a rogue DMA operation. Of these strategies, the single-use and shared strategies will offer the greatest protection 101 against this type of memory access violation because the only pages that could be corrupted are those that are the target of a pending I/O operation. However, the persistent strategy offers very little protection, as there will be a significant number of active mappings at any given time that the device could erroneously use. 5.5 Experimental Setup The protection strategies described here were evaluated on a system with an AMD Opteron 250 processor that includes an integrated graphics address relocation table (GART) alongside the memory controller. The GART can be used to translate memory addresses using physical-to-physical address mappings. Therefore, with the appropriate software infrastructure, a GART can model the functionality of an IOMMU. GART mappings are established at the memory-page granularity (in this case, 4 KB). Each page requires a separate GART mapping. Software programs the GART hardware to create a mapping at a specific location within the GART’s contiguous physical address range that points to a memory-backed memory location. The GART’s physical address range is often referred to as the GART “aperture”. GART mappings are organized in an array in memory. An index into the mapping array corresponds to a page index into the aperture. When an I/O device accesses a location in the GART’s aperture, the GART transparently redirects the memory access to a target memory location as specified by the corresponding GART mapping’s address. For unused or unmapped locations within the aperture, software creates a dummy mapping pointing to a single, shared garbage memory page. So long as an I/O device can only access memory within the GART aperture, all of that device’s accesses will be subject to remapping and access controls as specified by the virtual machine monitor. Thus, the GART’s mapping table limits I/O device 102 accesses to those regions approved by the VMM, just as an IOMMU limits I/O device accesses. Unlike an IOMMU-based system, however, a device could still generate an access outside the GART region, thus bypassing access controls. As a practical measure, I modify the prototype network interface to only accept DMA requests that lie within the VMM-specified GART aperture. Even though this system architecture could allow a faulty device to access memory outside the GART aperture, the architecture faithfully models the overheads of a system for which all of the network interface’s DMA requests are subject to the IOMMU strategy implemented by the VMM. Hence, this architecture is an effective means for examining the efficiency and performance of the various IOMMU management strategies under consideration. Ben-Yehuda et al. identified that platform-specific IOMMU implementation details can significantly affect performance and influence the efficiency of a system’s protection strategy [10]. Specifically, that work noted that the inability to individually replace IOMMU mappings without globally flushing the CPU cache can severely degrade performance. The GART-based IOMMU implementation used in this work does not incur the cache-flush penalties associated with the IBM platform, and thus the GART-based implementation should represent a low-overhead upper-bound with respect to architectural efficiency and performance. I implement the IOMMU- and software-based protection strategies in the open source Xen 3 virtual machine monitor [7]. I evaluate these strategies on a variety of network-intensive workloads, including a TCP stream microbenchmark, a voice-overIP (VoIP) server benchmark, and a static-content web server benchmark. The stream microbenchmark either transmits or receives bulk data over a TCP connection to a remote host. The VoIP benchmark uses the OpenSER server. In this benchmark, OpenSER acts as a SIP registrar and 50 clients simultaneously initiate calls as quickly as possible. The web server benchmark uses the lighttpd web server to host static 103 HTTP content. In this benchmark, 32 clients simultaneously replay requests from various web traces as quickly as possible. Three web traces are used in this study: “CS”, “IBM”, and “WC”. The CS trace is from a computer science departmental web server and has a working set of 1.2 GB of data. The IBM trace is from an IBM web server and has a working set of 1.1 GB of data. The WC trace is from the 1998 World Cup soccer web server and has a working set of 100 MB of data. For all benchmarks, the client machine is never saturated, so the server machine is always the bottleneck. The server under test uses a 2.4 GHz Opteron processor, has two Gigabit Ethernet network interface cards, and features DDR 400 DRAM. The network interfaces are publicly available prototypes that support shared, direct access [51]. A single unprivileged guest operating system has 1.4 GB of memory. The IOMMU-based strategies employ 512 MB of physical GART address space for remapping. In each benchmark, direct access for the guest is granted only for the network interface cards. Because the guest’s memory allocation is large enough to hold each benchmark and its corresponding data set, other I/O is insignificant. For the web-based workloads, the guest’s buffer cache is warmed prior to performance testing. For all of the benchmarks, each configuration was performance tested at least five times with each benchmark. Because there was effectively no variance across runs for a given configuration and benchmark, the statistics reported are averages of those runs. 5.6 Evaluation Network server applications can stress network I/O in different ways, depending on the characteristics of the application and its workload. Applications may generate large or small network packets, and may or may not utilize zero-copy I/O. For an application running on a virtualized guest operating system, these network characteristics interact with the I/O protection strategy implemented by the VMM. 104 Protection CPU % Reuse (%) HC/ Strategy Total Prot. TX RX DMA Stream Transmit None 41 0 N/A N/A 0 Single-use 64 23 N/A N/A .88 Shared 59 18 39 0 .55 Persistent 51 10 100 100 0 Software 56 15 N/A N/A .90 Stream Receive None 53 0 N/A N/A 0 Single-use 79 26 N/A N/A .37 Shared 73 20 39 0 .10 Persistent 66 13 100 100 0 Software 64 11 N/A N/A .39 Table 5.1: TCP Stream profile. Consequently, the efficiency of the I/O protection strategy can affect application performance in different ways. For all applications, I evaluate the four protection strategies presented earlier, and I compare each to the performance of a system lacking any I/O protection at all (“None”). “Single-use”, “Shared”, and “Persistent” all use an IOMMU to enforce protection, using either single-use, shared-mapping, or persistent-mapping strategies, respectively, as described in Section 5.2. “Software” uses software-based I/O protection, as described in Section 5.3. 5.6.1 TCP Stream A TCP stream microbenchmark either transmits or receives bulk TCP data and thus isolates network I/O performance. This benchmark does not use zero-copy I/O. Table 5.1 shows the CPU efficiency and overhead associated with each protection mechanism when streaming data over two network interfaces. The table shows the total percentage of CPU consumed while executing the benchmark and the percentage of CPU spent implementing the given protection strategy. The table also shows 105 the percentage of times a buffer to be used in an I/O transaction (either transmit or receive) already has a valid IOMMU mapping that can be reused. Finally, the table shows the number of VMM invocations, or hypercalls (HC), required per DMA descriptor used by the network interface driver. When either transmitting or receiving, all of the strategies achieve the same TCP throughput (1865 Mb/s transmitting, 1850 Mb/s receiving), but they differ according to how costly they are in terms of CPU consumption. The single-use protection strategy is the most costly, with its repeated construction and destruction of IOMMU mappings consuming 23% of total CPU resources for transmit and 26% for receive. The shared strategy reclaims some of this overhead through its sharing of in-use mappings, though this reuse only exists for transmitted packets (data in the transmit-stream case, TCP ACK packets in the receive case). The lack of reuse for received packets is caused by the XenoLinux buffer allocator, which dedicates an entire 4 KB page for each receive buffer, regardless of the buffer’s actual size. This over-allocation is an artifact of the XenoLinux I/O architecture, which was designed to remap received packets to transfer them between guest operating systems. Regardless, the persistent strategy achieves 100% reuse of mappings, as the small number of persistent mappings that cover network buffers essentially become permanent. This further reduces overhead relative to single-use and shared. Notably, the number of hypercalls per DMA operation rounds to zero. However, management of the persistent mappings—mapping lookup and recycling, as described in Section 5.2.3—still consume over 10% of the processor’s resources. Surprisingly, the overhead incurred by the software-based technique is comparable to the IOMMU-based persistent mapping strategy. The software-based technique certainly requires far more hypercalls per DMA than the IOMMU-based strategies. However, the cost of those VMM invocations and the associated page-verification operations is similar to the cost of managing persistent mappings for an IOMMU. 106 Protection Calls/ CPU % Reuse (%) HC/ Strategy Sec. Prot. TX RX DMA None 3005 0 N/A N/A 0 Single-use 2790 6.1 N/A N/A .68 Shared 2835 6.0 4 0 .65 Persistent 2901 2.1 100 100 0 Software 2895 3.5 N/A N/A .67 Table 5.2: OpenSER profile. 5.6.2 VoIP Server Table 5.2 shows the performance and overhead profile for the OpenSER VoIP application benchmark for the various protection strategies. The OpenSER benchmark is largely CPU-intensive and therefore only uses one of the two network interface cards. Though the strategies rank similarly in efficiency for the OpenSER benchmark as in the TCP Stream benchmark, Table 5.2 shows one significant difference with respect to reuse of IOMMU mappings. Whereas the shared strategy was able to reuse mappings 39% of the time for transmit packets under the TCP Stream benchmark, OpenSER sees only 4% reuse. Unlike typical high-bandwidth streaming applications, OpenSER only sends and receives very small TCP messages in order to initiate and terminate VoIP phone calls. Consequently, the shared strategy provides only a minimal efficiency and performance improvement over the high-overhead single-use strategy for the OpenSER benchmark, indicating that sharing alone does not provide an efficiency gain for applications that are heavily reliant on small messages. 5.6.3 Web Server Table 5.3 shows the performance, overhead, and sharing profiles of the various protection strategies when running a web server under each of three different trace workloads, “CS”, “IBM”, and “WC”. As in the TCP Stream and OpenSER benchmarks, the different strategies rank identically among each other in terms of performance 107 Protection HTTP CPU % Strategy Mbps Prot. CS Trace None 1336 0 Single-use 1142 18.2 Shared 1162 16.3 Persistent 1252 5.3 Software 1212 9.1 IBM Trace None 359 0 Single-use 322 8.5 Shared 322 8.3 Persistent 338 2.4 Software 326 4.5 WC Trace None 714 0 Single-use 617 11.8 Shared 619 11.1 Persistent 655 3.0 Software 632 5.9 Reuse (%) HC/ TX RX DMA N/A N/A 40 100 N/A N/A N/A 0 100 N/A 0 .66 .42 0 .67 N/A N/A 22 100 N/A N/A N/A 0 100 N/A 0 .70 .58 0 .71 N/A N/A 30 100 N/A N/A N/A 0 100 N/A 0 .68 .50 0 .69 Table 5.3: Web Server profile using write(). and overhead. Each of the different traces generates messages of different sizes and requires different amounts of web-server compute overhead. For the write()-based implementation of the web server, however, the server is always completely saturated for each workload. “CS” is primarily network-limited, generating relatively large response messages with an average HTTP message size of 34 KB. “IBM” is largely compute-limited, generating relatively small HTTP responses with an average size of 2.8 KB. “WC” lies in between, with an average response size of 6.7 KB. As the table shows, the amount of reuse exploited by the shared strategy is dependent on the average HTTP response being generated. Larger average messages lead to larger amounts of reuse for transmitted buffers under the shared strategy. Though larger amounts of reuse slightly reduce the CPU overhead for the shared strategy relative 108 Protection Strategy None Single-use Shared Persistent Software HTTP Mbps 1378 (35% 1291 (7% 1330 (17% 1342 (23% 1351 (21% idle) idle) idle) idle) idle) None Single-use Shared Persistent Software 475 403 413 438 422 None Single-use Shared Persistent Software 961 760 796 872 833 CPU % Prot. TX CS Trace 0 27.6 17.7 11.5 13.7 IBM Trace 0 14.0 12.3 4.3 6.2 WC Trace 0 19.9 16.0 5.1 8.7 Reuse (%) Hdr. TX File HC/ RX DMA N/A N/A 82 100 N/A N/A N/A N/A N/A 72 0 96 100 N/A N/A 0 .37 .17 .02 .37 N/A N/A 34 100 N/A N/A N/A N/A N/A 50 0 99 100 N/A N/A 0 .43 .35 0 .43 N/A N/A 53 100 N/A N/A N/A N/A N/A 62 0 100 100 N/A N/A 0 .39 .27 0 .40 Table 5.4: Web Server profile using zero-copy sendfile(). to the single-use strategy, the reuse is not significant enough under these workloads to yield significant performance benefits. As in the other benchmarks, receive buffers are not subject to reuse with the shared-mapping strategy. Regardless of the workload, the persistent strategy is 100% effective at reusing existing mappings as the mappings again become effectively permanent. As in the other benchmarks, the software-based strategy achieves application performance between the shared and persistent IOMMU-based strategies. For all of the previous workloads, the network application utilized the write() system call to send any data. Consequently, all buffers that are transmitted to the network interface have been allocated by the guest operating system’s network-buffer allocator. Using the zero-copy sendfile() interface, however, the guest OS generates network buffers for the packet headers, but then appends the application’s file buffers 109 rather than copying the payload. This interface has the potential to change the amount of reuse exploitable by a protection strategy. Using sendfile(), the packetpayload footprint for IOMMU mappings is no longer limited to the number of internal network buffers allocated by the OS, but instead is limited only by the size of physical memory allocated to the guest. Table 5.4 shows the performance, efficiency, and sharing profiles for the different protection strategies for web-based workloads when the server uses sendfile() to transmit HTTP responses. Note that for the “CS” trace, the host CPU is not completely saturated, and so the CPU’s idle time percentage is annotated next to HTTP performance in the table. For the other traces, the CPU is completely saturated. The table separates reuse statistics for transmitted buffers according to whether or not the buffer was a packet header or packet payload. As compared to Table 5.3, Table 5.4 shows that the shared strategy is more effective overall at exploiting reuse using sendfile() than with write(). Consequently, the shared strategy gives a larger performance and efficiency benefit relative to the single-use strategy when using sendfile(). Table 5.4 also shows that the persistent strategy is highly effective at capturing file reuse, even though the total working-set size of the “CS” and “IBM” traces are each more than twice as large as the 512 MB mapping space afforded by the GART. Finally, the table shows that the software-based strategy performs better than either the shared or single-use IOMMU strategies for all workloads, and can perform even better than the persistent strategy on the CS trace, though it consumes more CPU resources. 5.6.4 Discussion The architecture of the GART imposes some limitations on this study. In particular, it is infeasible to evaluate a direct map strategy using the IOMMU. Under this strategy, the VMM creates a persistent identity mapping for each VM that permits 110 access to its entire memory. This mapping is created by the VMM when the VM is started and updated only if the memory allocated to the VM changes. Moreover, because the direct map strategy uses an identity mapping, there is no need for the device driver to translate the address that is provided by the guest OS into an address that is suitable for DMA. Unfortunately, the GART cannot implement such an identity mapping because the address of the aperture cannot overlap with that of physical memory. Like the other protection strategies, the direct map strategy has pros and cons. It provides the same protection between guest operating systems as the other IOMMUbased strategies, but it provides the least safety within a guest operating system. For example, under the persistent mapping strategy, a page will only be mapped by the IOMMU if it is the target of an I/O operation. Moreover, an unused mapping may ultimately be destroyed. In contrast, under the direct map strategy, all pages are mapped at all times. The direct map strategy’s unique advantage is that it can be implemented entirely within the VMM without support from the guest OS. Its implementation is, in effect, transparent to the guest OS. Although it is not possible to determine the performance of the direct map strategy experimentally using the GART-based setup, it is reasonable to argue that its performance must be bounded by the performance of the “Persistent” and “None” strategies. Although, in many cases, the “Persistent” strategy achieves near 100% reuse, the direct map strategy could have lower overhead because the device driver does not have to translate the address that is provided by the guest OS into an address that is suitable for DMA. The GART’s translation table is a single, one-dimensional array. Moreover, if an IOTLB miss occurs, address translation requires at most one memory access. In contrast, the coming IOMMUs from AMD and Intel will use multilevel translation tables, similar to the page tables used by the processor’s MMU. Thus, both updates 111 by the hypervisor and IOTLB misses may have a higher cost because of the additional memory references incurred by walking multilevel translation tables. Regardless of the benchmark, the data in Section 5.6 shows many opportunities for reuse of mappings in network I/O applications. However, some of this reuse is a consequence of the difference between the mapping’s granularity (ie, a 4 kilobyte memory page) and the granularity of a network packet (ie, 1500 bytes). Hence, adjacent buffers in the same memory page can be reused for multiple packets because the packet size is smaller than that of a memory page. A hardware technique that increases the maximum transaction size of a DMA operation from could invert this relationship and decrease the amount of reuse exploitable by the existing implementations examined here. For example, network interfaces that support TCP segmentation offload provide the abstraction to the operating system of a NIC that has a much larger maximum transmission unit (ie, 16 kilobytes instead of 1500 bytes). In this case, the Shared protection strategy could approach the reuse properties of the Single-use strategy, since a memory page would likely be used only once for one large buffer rather than being used multiple times. However, previous studies by Kim et al. show that the payload data for the web-based traces examined in this study have significant reuse, and hence one would still expect to see reuse benefits in the Persistent protection strategy [26]. Xen differs from many virtualization systems in that it exposes host physical addresses to the guest OS. In particular, the guest OS, and not the VMM, is responsible for translating between pseudo-physical addresses that are used at most levels of the guest OS and host physical addresses that are used at the device level. This does not, however, fundamentally change the implementation of the various protection strategies. 112 Chapter 6 Conclusion As demand for high-bandwidth network services continues to grow, network servers must continue to deliver more and more performance. Simultaneously, power and cooling continue to be first-class concerns for datacenter servers, and thus network servers must support the highest levels of efficiency possible. Architectural trends toward chip multiprocessors are straining contemporary OS network stacks and network hardware, exposing efficiency bottlenecks that can prevent software architectures from gaining any substantial performance through multiprocessing. And whereas multicore architectures offer an opportunity to better utilize physical server resources in a more efficient manner through consolidation, inefficiencies inherent to modern I/O sharing architectures and protection strategies severely damage performance and undermine overall server efficiency. This dissertation addresses key OS and VMM architectural components that can limit I/O performance and efficiency in modern thread-concurrent and VM-concurrent servers. Each of these components has separate performance and efficiency issues that have tangible effects on the ability of a server to support its network applications. The OS parallelization strategy affects the performance of network I/O processing by the operating system, affects the maximum throughput attainable on a given connection, and thus affects application throughput and scalability. The virtual machine monitor’s I/O virtualization architecture affects the overhead required to share access to a given I/O device, which thus affects the maximum aggregate application 113 performance attainable on the system and affects the ability of the system to support larger numbers of concurrent virtual machines. Furthermore, the VMM’s memoryprotection strategy also affects the overhead of device virtualization, which affects application performance, and affects the level of isolation supported by the system, which affects the operating systems’ and hence applications’ stability. The design decisions of each strategy explored and the characteristics of the resulting architectures have implications for server architects who will seek to build upon this work and tackle remaining and future challenges facing server architecture. Those design decisions, their characteristics, and the corresponding implications are discussed in further detail. 6.1 Orchestrating OS parallelization to characterize and improve I/O processing The trend toward chip multiprocessing hardware necessitates parallelization of the operating system’s network stack. This dissertation establishes that a continuum of parallelism and efficiency exists among contemporary protocol network stack organizations, and this research explores points along that continuum. Along with the synchronization mechanism employed by each organization, the parallelization strategy has a direct impact on overall efficiency and ultimately throughput. This research found that a traditional network interface featuring a single high- bandwidth link imposes an inherent bottleneck with regard to its single interface (ie, packet queue) to the operating system, which limited throughput regardless of the network stack organization used. However, introducing parallelism at the network interface (by using separate interfaces) exposed the scheduling and synchronization efficiency characteristics of each organization on the continuum. Through examining these characteristics, it is clear that attempting to maximize theoretical parallelism in the 114 network stack can actually hurt performance, even on a highly parallel machine. This research finds that a less-parallel, more-efficient connection-parallel network stack is both more efficient and higher-performing than other organizations that attempt to maximize packet parallelism. Though this dissertation explored primarily performance and efficiency, the selection of a connection-parallel network stack within the operating system has implications well beyond just the operating system. This study showed that hardware support is needed to overcome the bottleneck imposed by the serialized interface exported by a single high-bandwidth NIC. To efficiently support a connection-parallel network stack, the network interface card would first require parallel packet queues so that multiple threads could access the NIC at the same time, without synchronization. Second, the NIC would require some form of packet classification that can map incoming packets to specific connections (or connection groups) and then place them in a specific packet queue on-board the NIC associated with that connection. With the additional capability to fire a separate interrupt for each separate queue, it would be possible to closely mimic the behavior of the parallel-NIC prototype evaluated in this work, in which packets for a specific connection “queue” (a NIC in this case) are persistently mapped to that same queue. Even with such hardware support, additional challenges in both the hardware and software remain, including support for load balancing. In the experiments evaluated in this dissertation, the load was purposefully spread evenly across the separate connections and their groups, and a connection always hashed to the same group. However, a static hash mechanism may lead to undesirable overload conditions for only a subset of connection groups, leading to under-utilization in lightly used groups. In this case, it would be desirable to migrate busy connections to lightly-loaded connection groups. If the hardware is responsible for mapping packets to connections, though, then clearly the hardware must participate in this scheme. One can imag115 ine several possibilities for providing this support, including full control by hardware (where the hardware attempts to detect the overload condition and notifies the OS of a migration), full control by software (where the software detects overload and migrates specific connections by notifying the NIC), or something in between, where the software provides hints to the hardware about future possibilities for migration. Regardless, the issue of load-balancing across multiple queues will be a critical area for maintaining high performance with connection-parallel network stacks that have hardware support. 6.2 Reducing virtualization overhead using a hybrid hardware/software approach Whereas OS support for thread-parallelism incurs performance-damaging inefficiencies, contemporary software-based techniques for providing shared access to an I/O device also incur severe performance overheads. Though the contemporary software-based virtualization architecture supports a variety of hardware, the hypervisor and driver domain consume as much as 70% of the execution time during network transfers. This dissertation introduces the novel CDNA I/O virtualization architecture, which is a hybrid hardware/software approach to providing safe, shared access to a single network interface. CDNA uses hardware to perform traffic multiplexing, a combination of hardware and software to facilitate event notification from the I/O device to a particular virtual machine, and a combination of hardware and software to enforce isolation of DMA requests initiated by each untrusted virtual machine. This study demonstrates that a hybrid hardware/software approach is both economical and effective. The CDNA prototype device required about 12 MB of onboard storage and used a 300 MHz embedded processor, which is about the same as modern network interface cards. Using these resources, the CDNA architecture im116 proved transmit and receive performance for concurrent virtual machines by factors of 2.1 and 3.3, respectively, over a standard software-based shared I/O virtualization architecture. And whereas a purely hardware-based approach could require costly memory-registration operations or system-level hardware modifications to enforce DMA memory protection, the lightweight, software-based DMA memory protection strategy introduced in this research incurs relatively little overhead and requires no system-level modifications. Moving traffic multiplexing to the hardware proved to be the biggest source of performance and efficiency improvement in the CDNA architecture. By not forcing the VMM to inspect, demultiplex, and page-flip data between virtual machines, the relatively simple hardware of the CDNA prototype dramatically reduced total I/O virtualization overhead. This reduction occurs despite introducing the new overhead of software-based DMA memory protection for supporting direct I/O access. Beyond the performance and efficiency issues explored in this study, the CDNA architecture presents new opportunities for I/O virtualization research, including generalization to other devices and challenges not related to performance. As presented, the CDNA device is a prototype. Though there is nothing specific about the prototype or its architecture that prevents it from being adapted to other types of devices (such as graphics cards or disk controllers), actually doing this generalization remains an area for future exploration. Clearly the performance and efficiency benefits demonstrated for network I/O could prove advantageous for other types of I/O, though the percentage of impact would depend on the importance of I/O for any particular workload. Generalizing the CDNA interface would require development of a general software interface for communicating DMA updates to the virtual machine monitor, since the prototype’s method is actually based on updates to the NIC-specific DMA descriptor structure. Further, generalization would require a method to generically, concisely describe the control region (and mechanisms) for any particular device so 117 that the VMM maintains control of actually enqueueing DMA descriptors, as is required by the software-based DMA memory protection method. Another area open for exploration with the CDNA architecture is that of providing quality of service guarantees, including support for customizable service allocation and prioritization. For example, it would be advantageous to be able to guarantee that a high-priority virtual machine received performance according to its application needs (which could be either high bandwidth, or low latency, or both). Traditional software-based I/O virtualization allows fine-grained centralized load balancing, because the VMM actively controls the flow of I/O into and out of the device. With direct-access, hardware-shared devices such as the CDNA architecture, however, the hardware ultimately determines the order and priority that concurrent requests are processed. Thus, the hardware must have some mechanism for implementing the desired quality-of-service policy. Furthermore, the hardware must support some mechanism for the VMM to communicate the desired policy to the hardware. Finally, in cases when the device supports unsolicited I/O (such as a network interface, which receives packets from a network), it would be advantageous for the hardware to track usage statistics and report them to the VMM so that it could make better decisions that might avoid I/O failure (e.g., packet loss). 6.3 Improving performance and efficiency of protection strategies for direct-access I/O CDNA’s performance and efficiency gains versus software-based virtualization illustrate the effectiveness of direct I/O access by untrusted virtual machines. Though direct I/O access overcomes performance penalties, it requires new protection strategies to prevent the guest operating systems from directing the device to violate memory protection. 118 This dissertation has evaluated a variety of DMA memory protection strategies for direct access to I/O devices within virtual machine monitors. As others have noted, overhead for managing DMA memory protection using an IOMMU in a virtualized environment can noticeably degrade network I/O performance, ultimately affecting application throughput. Even with the novel IOMMU-based strategies aimed at reducing this overhead by reusing installed mappings in the IOMMU hardware, there remains a nonzero overhead that reduces throughput. However, this research has shown that reuse-based strategies are effective at reducing overhead relative to the state-of-the-art, single-use strategy. Furthermore, this research shows that the software-based implementation for providing DMA memory protection introduced with the CDNA architecture can deliver performance and efficiency comparable to the most aggressive reuse-based strategies. These results held true across a wide array of TCP-based applications with different I/O demand characteristics. resulting in several key insights. This research also explored the differences in the level of protection offered by different strategies and the level of efficiency gained through reuse. All of the strategies (single-use, shared, persistent, and software-based) explored in this study provide equivalent protection between guest operating systems when those guest operating systems are sharing a single device and have direct access. Further, all of these techniques prevent a guest operating system from directing the device to access memory that does not belong to that guest. The traditional single-use strategy, however, provides this protection at the greatest cost, consuming from 6–26% of the CPU. This cost can be reduced by reusing IOMMU mappings. Multiple concurrent network transmit operations are typically able to share the same mappings 20–40% of the time, yielding small performance improvements. However, due to Xen’s I/O architecture, network receive operations are usually unable to share mappings. In contrast, even with a small pool of persistent IOMMU mappings, reuse approaches 100% in 119 almost all cases, reducing the overhead of protection to only 2–13% of the CPU. Finally, the software-based protection strategy performs comparably to the best of the IOMMU-based strategies, consuming only 3–15% of the CPU for protection. After comparing the performance and protection offered by hardware- and softwarebased DMA protection strategies, an IOMMU proves to provide surprisingly limited benefits beyond what is possible with software. This finding comes despite industrial enthusiasm for deploying IOMMU hardware in the next generation of commodity systems. As these new systems arrive, a new comparison that uses an actual IOMMU (rather than the GART-modeled IOMMU in this study) may be illustrative so as to quantify the performance of “direct-map” IOMMU-based protection strategies. In such a strategy, the entire physical memory space of a given virtual machine is mapped (usually once) by the IOMMU and remains mapped for the lifetime of the virtual machine. Such a strategy should not impose the remapping overhead or reuselookup overheads of the strategies explored in this dissertation, but they should not perform any better than the “no-protection” case, either. Hence, it is unlikely that even future, improved IOMMU-based designs will offer significantly better performance than the best-performing strategies explored in this dissertation, which came within just a few percent of native performance. Though it is possible to achieve near-native performance with an optimized, reusebased protection strategy, the results in this study also show that inefficient use of hardware structures designed to reduce the burden of software (such as an IOMMU) can in fact significantly degrade performance. Hence, this dissertation illustrates a warning to system architects who would use hardware to solve an architectural problem without consideration of the software overhead. The availability of an IOMMU does not significantly improve performance unless one compares against a naive, worst-case implementation of DMA memory protection. This underscores the need 120 for software architects to work closely with hardware architects to solve problems such as DMA memory protection and, in general, I/O virtualization. 6.4 Summary This dissertation explored performance and efficiency of server concurrency at both the OS and VMM levels and introduced several hybrid hardware/software techniques that strategically use hardware to improve software efficiency and performance. By changing the way the operating system uses its parallel processors to facilitate I/O processor, by changing the responsibilities of I/O devices to more efficiently integrate with a virtualized environment, and by strategically using memory protection hardware to reduce the total cost of using that hardware, these techniques each modify the overall hardware/software system architecture. The OS and VMM architectures introduced and explored in this dissertation provide a scalable means to deliver highperformance, efficient I/O for contemporary and future commodity servers. As servers continue to support more and more cores on a chip, thread concurrency and VMM concurrency will be increasingly critical for system integrators facing performance challenges for high-bandwidth applications and efficiency challenges for system consolidation in the datacenter. The techniques explored in this dissertation rely on a synthesis of software orchestration, parallel computation resources (in the OS and among virtual machines) and lightweight, efficient interfaces with hardware that support the desired level of concurrency. Given the cost and performance advantages that were repeatedly found using the hybrid hardware/software approach explored in this dissertation, this approach should be a guiding principle for hardware and software architects facing the future challenges of concurrent server architecture. 121 Bibliography [1] Keith Adams and Ole Agesen. A comparison of software and hardware techniques for x86 virtualization. In Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 2–13, October 2006. [2] Advanced Micro Devices. Secure Virtual Machine Architecture Reference Manual, May 2005. Publication 33047, Revision 3.01. [3] Advanced Micro Devices. AMD I/O Virtualization Technology (IOMMU) Specification, February 2007. Publication 34434, Revision 1.20. [4] W. J. Armstrong, R. L. Arndt, D. C. Boutcher, R. G. Kovacs, D. Larson, K. A. Lucke, N. Nayar, and R. C. Swanberg. Advanced virtualization capabilities of POWER5 systems. IBM Journal of Research and Development, 49(4/5):523–532, 2005. [5] Avnet Design Services. Xilinx Virtex-II Pro Development Kit: User’s Guide, November 2003. ADS-003704. [6] Kalpana Banerjee, Aniruddha Bohra, Suresh Gopalakrishnan, Murali Rangarajan, and Liviu Iftode. Split-OS: An operating system architecture for clusters of intelligent devices. Work-in-Progress Session at the 18th Symposium on Operating Systems Principles, October 2001. [7] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualiza- 122 tion. In Proceedings of the Symposium on Operating Systems Principles, pages 164–177, October 2003. [8] Luiz A. Barroso, Jeffrey Dean, and Urz Hölzle. Web search for a planet: The Google cluster architecture. IEEE Micro, 23(2):22–28, March–April 2003. [9] Muli Ben-Yehuda, Jon Mason, Orran Krieger, Jimi Xenidis, Leendert Van Doorn, Asit Mallick, Jun Nakajima, and Elsie Wahlig. Utilizing IOMMUs for virtualization in Linux and Xen. In Proceedings of the Linux Symposium, pages 71–85, July 2006. [10] Muli Ben-Yehuda, Jimi Xenidis, Michal Ostrowski, Karl Rister, Alexis Bruemmer, and Leendert Van Doorn. The price of safety: Evaluating IOMMU performance. In Proceedings of the 2007 Linux Symposium, pages 9–19, July 2007. [11] Tim Brecht, G. (John) Janakiraman, Brian Lynn, Vikram Saletore, and Yoshio Turner. Evaluating network processing efficiency with processor partitioning and asynchronous I/O. In Proceedings of EuroSys 2006, pages 265–278, April 2006. [12] D. T. Brown, R. L. Eibsen, and C. A. Thorn. Channel and direct access device architecture. IBM Systems Journal, 11(3):186–199, 1972. [13] S. Devine, E. Bugnion, and M. Rosenblum. Virtualization system including a virtual machine monitor for a computer with a segmented architecture. US Patent #6,397,242, October 1998. [14] Keith Diefendorff. Power4 focuses on memory bandwidth. Microprocessor Report, 13(13):1–7, October 1999. [15] M. Engelhardt, G. Schindler, W. Steinhögl, and G. Steinlesberger. Challenges of interconnection technology till the end of the roadmap and beyond. Microelectronic Engineering, 64(1–4):3–10, October 2002. 123 [16] Keir Fraser, Steven Hand, Rolf Neugebauer, Ian Pratt, Andrew Warfield, and Mark Williamson. Safe hardware access with the Xen virtual machine monitor. In Proceedings of the Workshop on Operating System and Architectural Support for the on-demand IT Infrastructure, October 2004. [17] P. H. Gum. System/370 extended architecture: Facilities for virtual machines. IBM Journal of Research and Development, 27(6):530–544, 1983. [18] Lance Hammond, Basem A. Nayfeh, and Kunle Olukotun. A single-chip multiprocessor. Computer, 30(9):79–85, September 1997. [19] E. C. Hendricks and T. C. Hartmann. Evolution of a virtual machine subsystem. IBM Systems Journal, 18(1):111–142, 1979. [20] Justin Hurwitz and Wu-chun Feng. End-to-end performance of 10-gigabit Ethernet on commodity systems. IEEE Micro, 24(1):10–22, Jan./Feb. 2004. [21] Intel. Intel Virtualization Technology Specification for the Intel Itanium Architecture (VT-i), April 2005. Order Number 305942-002, Revision 2.0. [22] Intel Corporation. Energy-efficient performance for the data center, September 2006. Order Number 315018-001US. [23] Intel Corporation. Intel Virtualization Technology for Directed I/O, May 2007. Order Number D51397-002, Revision 1.0. [24] J. Jann, L. M. Browning, and R. S. Burugula. Dynamic reconfiguration: Basic building blocks for autonomic computing on IBM pSeries servers. IBM Systems Journal, 42(1):29–37, 2003. [25] Sanjiv Kapil, Harlan McGhan, and Jesse Lawrendra. A chip multithreaded processor for network-facing workloads. IEEE Micro, 24(2):20–30, Mar./Apr. 2004. 124 [26] Hyong-youb Kim, Vijay S. Pai, and Scott Rixner. Improving web server throughput with network interface data caching. In Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 239–250, October 2002. [27] Hyong-youb Kim and Scott Rixner. Performance characterization of the FreeBSD network stack. Technical Report TR05-450, Rice University Computer Science Department, June 2005. [28] Hyong-youb Kim and Scott Rixner. TCP offload through connection handoff. In Proceedings of EuroSys, pages 279–290, April 2006. [29] Seongbeom Kim, Dhruba Chandra, and Yan Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pages 111–122, 2004. [30] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. Niagara: A 32-way multithreaded SPARC processor. IEEE Micro, 25(2):21–29, Mar./Apr. 2005. [31] David Koufaty and Deborah T. Marr. Hyperthreading technology in the netburst microarchitecture. IEEE Micro, 23(2):56–65, 2003. [32] Kevin Krewell. UltraSPARC IV mirrors predecessor. Microprocessor Report, 17(11):1–6, November 2003. [33] Kevin Krewell. Double your Opterons; double your fun. Microprocessor Report, 18(10):26–28, October 2004. [34] Kevin Krewell. Sun’s Niagara pours on the cores. 18(9):11–13, September 2004. 125 Microprocessor Report, [35] Jiuxing Liu, Wei Huang, Bulent Abali, and Dhabaleswar K. Panda. High performance VMM-bypass I/O in virtual machines. In Proceedings of the USENIX Annual Technical Conference, pages 29–42, June 2006. [36] R. A. MacKinnon. The changing virtual machine environment: Interfaces to real hardware, virtual hardware, and other virtual machines. IBM Systems Journal, 18(1):18–46, 1979. [37] M. McGrath. Virtual machine computing in an engineering environment. IBM Systems Journal, 11(2):131–149, June 1972. [38] Aravind Menon, Alan L. Cox, and Willy Zwaenepoel. Optimizing network virtualization in Xen. In Proceedings of the USENIX Annual Technical Conference, June 2006. [39] Aravind Menon, Jose Renato Santos, Yoshio Turner, G. (John) Janakiraman, and Willy Zwaenepoel. Diagnosing performance overheads in the Xen virtual machine environment. In Proceedings of the ACM/USENIX Conference on Virtual Execution Environments, pages 13–23, June 2005. [40] Microsoft Corporation. Scalable networking: Eliminating the receive processing bottleneck – Introducing RSS. In Proceedings of the Windows Hardware Engineering Conference, April 2004. [41] E. M. Nahum, D. J. Yates, J. F. Kurose, and D. Towsley. Performance issues in parallelized network protocols. In Proceedings of the Symposium on Operating Systems Design and Implementation, pages 125–137, November 1994. [42] Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. The case for a single-chip multiprocessor. In Proceedings of the Seventh 126 International Conference on Architectural Support for Programming Languages and Operating Systems, pages 2–11, October 1996. [43] R. P. Parmelee, T. I. Peterson, C. C. Tillman, and D. J. Hatfield. Virtual storage and virtual machine concepts. IBM Systems Journal, 11(2):99–130, 1972. [44] Ian Pratt and Keir Fraser. Arsenic: A user-accessible Gigabit Ethernet interface. In Proceedings of IEEE INFOCOM, pages 67–76, April 2001. [45] Himanshu Raj and Karsten Schwan. High performance and scalable I/O virtualization via self-virtualized devices. In Proceedings of the 16th International Symposium on High Performance Distributed Computing, pages 179–188, June 2007. [46] Murali Rangarajan, Aniruddha Bohra, Kalpana Banerjee, Enrique V. Carrera, Ricardo Bianchini, Liviu Iftode, and Willy Zwaenepoel. TCP Servers: Offloading TCP/IP Processing in Internet Servers. Design, Implementation, and Performance. Computer Science Department, Rutgers University, March 2002. Technical Report DCR-TR-481. [47] Greg Regnier, Srihari Makineni, Ramesh Illikkal, Ravi Iyer, Dave Minturn, Ram Huggahalli, Don Newell, Linda Cline, and Annie Foong. TCP Onloading for Data Center Servers. Computer, 37(11):48–58, November 2004. [48] Greg Regnier, Dave Minturn, Gary McAlpine, Vikram A. Saletore, and Annie Foong. ETA: Experience with an Intel Xeon Processor as a Packet Processing Engine. IEEE Micro, 24(1):24–31, January 2004. [49] L. H. Seawright and R. A. MacKinnon. VM/370–a study of multiplicity and usefulness. IBM Systems Journal, 18(1):4–17, 1979. 127 [50] Jeff Shafer and Scott Rixner. A Reconfigurable and Programmable Gigabit Ethernet Network Interface Card. Rice University, Department of Electrical and Computer Engineering, December 2006. Technical Report TREE0611. [51] Jeffrey Shafer and Scott Rixner. RiceNIC: A reconfigurable network interface for experimental research and education. In Proceedings of the Workshop on Experimental Computer Science, June 2007. [52] Piyush Shivam, Pete Wyckoff, and Dhabaleswar K. Panda. EMP: Zero-copy OS-bypass NIC-driven Gigabit Ethernet message passing. In Proceedings of the ACM/IEEE Conference on Supercomputing (CDROM), pages 57–57, November 2001. [53] J. Sugerman, G. Venkitachalam, and B. Lim. Virtualizing I/O devices on VMware Workstation’s hosted virtual machine monitor. In Proceedings of the USENIX Annual Technical Conference, pages 1–14, June 2001. [54] J. M. Tendler, J. S. Dodson, Jr J. S. Fields, H. Le, and B. Sinharoy. POWER4 system microarchitecture. IBM Journal of Research and Development, 46(1):5– 26, January 2002. [55] Sunay Tripathi. FireEngine—a new networking architecture for the Solaris operating system. White paper, Sun Microsystems, June 2004. [56] VMware Inc. VMware ESX server: Platform for virtualizing servers, storage and networking. http://www.vmware.com/pdf/esx datasheet.pdf, 2006. [57] Robert N. M. Watson. Introduction to multithreading and multiprocessing in the FreeBSD SMPng network stack. In Proceedings of EuroBSDCon, November 2005. 128 [58] A. Whitaker, M. Shaw, and S. Gribble. Scale and performance in the Denali isolation kernel. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI), pages 195–210, December 2002. [59] Paul Willmann, Scott Rixner, and Alan L. Cox. An evaluation of network stack parallelization strategies in modern operating systems. In Proceedings of the USENIX Annual Technical Conference, pages 91–96, 2006. [60] Paul Willmann, Jeffrey Shafer, David Carr, Aravind Menon, Scott Rixner, Alan L. Cox, and Willy Zwaenepoel. Concurrent direct network access for virtual machine monitors. In Proceedings of the 13th International Symposium on High Performance Computer Architecture, pages 306–317, February 2007. [61] Paul Willmann, Jeffrey Shafer, David Carr, Aravind Menon, Scott Rixner, Alan L. Cox, and Willy Zwaenepoel. Concurrent direct network access for virtual machine monitors. In Proceedings of the International Symposium on High-Performance Computer Architecture, February 2007. [62] David J. Yates, Erich M. Nahum, James F. Kurose, and Don Towsley. Networking support for large scale multiprocessor servers. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 116–125, May 1996. 129

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Efficient Hardware/Software Architectures for Highly Concurrent