Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
VIRTUALIZING IO THROUGH THE IO MEMORY MANAGEMENT UNIT (IOMMU) ANDY KEGEL, PAUL BLINZER, ARKA BASU, MAGGIE CHAN ASPLOS 2016 WHAT THIS TUTORIAL WILL AND WILL NOT COVER Definition of “IO” or “Device” or “IO Device” : ‒ Traditional IO includes GPU for graphics, NIC, storage controller, USB controller, etc. ‒ New IO (accelerators) includes general-purpose computation on a GPU (GPGPU), encryption accelerators, digital signal processors, etc. Two Parts in Virtualizing an IO Device ‒ Device specific: Virtual instances of device ‒ Virtual functions and Physical function in devices (PCIE® SR-IOV, MR-IOV) ‒ System defined: IO Memory Management Unit or IOMMU ‒ Virtualizing DMA accesses (Address Translation and Protection) ‒ Virtualizing Interrupts (Interrupt Remapping and Virtualizing) 2 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 WHAT THIS TUTORIAL WILL AND WILL NOT COVER Definition of “IO” or “Device” or “IO Device” : ‒ Traditional IO includes GPU for graphics, NIC, storage controller, USB controller, etc. ‒ New IO (accelerators) includes general-purpose computation on a GPU (GPGPU), encryption accelerators, digital signal processors, etc. Two Parts in Virtualizing an IO Device ‒ Device specific: Virtual instances of device ‒ Virtual functions and Physical function in devices (PCIE® SR-IOV, MR-IOV) ‒ System defined: IO Memory Management Unit or IOMMU ‒ Virtualizing DMA accesses (Address Translation and Protection) ‒ Virtualizing Interrupts (Interrupt Remapping and Virtualizing) 3 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Focus AGENDA MOTIVATION & INTRODUCTION USE CASES & DEMOSTRATION What is IOMMU? -- Andy Kegel Where can IOMMU help? -- Paul Blinzer INTERNALS How does IOMMU work? -- Arka Basu, Maggie Chan RESEARCH Research Opportunities and Discussion – Arka Basu 4 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 MOTIVATION: TRADITIONAL DMA BY IO NO SYSTEM VIRTUALIZATION Core Core MMU MMU Memory 5 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device MOTIVATION: TRADITIONAL DMA BY IO NO SYSTEM VIRTUALIZATION Core Core MMU MMU Virtual Addresses Memory 6 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device MOTIVATION: TRADITIONAL DMA BY IO NO SYSTEM VIRTUALIZATION Core Virtual Addresses Core Protection Check MMU MMU Physical Addresses Memory 7 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device MOTIVATION: TRADITIONAL DMA BY IO NO SYSTEM VIRTUALIZATION Device Driver Core Virtual Addresses Core Protection Check MMU MMU Physical Addresses Memory 8 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device MOTIVATION: TRADITIONAL DMA BY IO NO SYSTEM VIRTUALIZATION Setup Device Driver Core Virtual Addresses Core Protection Check MMU MMU Physical Addresses Memory 9 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device MOTIVATION: TRADITIONAL DMA BY IO NO SYSTEM VIRTUALIZATION Setup Device Driver Core Virtual Addresses Core IO Device IO Device Protection Check MMU MMU Physical Addresses Memory 10 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Physical Addresses DMA Request MOTIVATION: TRADITIONAL DMA BY IO NO SYSTEM VIRTUALIZATION Setup Device Driver Core Virtual Addresses Core IO Device IO Device Protection Check MMU MMU Physical Addresses Wrong location Memory 11 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Physical Addresses DMA Request MOTIVATION: TRADITIONAL DMA BY IO NO SYSTEM VIRTUALIZATION Setup Device Driver Core Virtual Addresses Core IO Device IO Device Protection Check MMU MMU Physical Addresses DMA Request Physical Addresses Wrong location Memory 12 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 No protection from malicious devices --> “DMA Attack” (e.g., FinSpy) No protection from buggy device driver Side channel attack – leak information MOTIVATION: TRADITIONAL DMA BY IO NO SYSTEM VIRTUALIZATION Setup Device Driver Core Virtual Addresses Core IO Device IO Device Protection Check MMU MMU Physical Addresses DMA Request Physical Addresses Wrong location Memory 13 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 No protection from malicious devices --> “DMA Attack” (e.g., FinSpy) No protection from buggy device driver Side channel attack – leak information Needs hardware enforced memory protection MOTIVATION: VIRTUAL MACHINES ARE TRENDING Tremendous growth in virtualization in server Efficient access to IO under virtualization is important Source: IDC Server Virtualization, MCS 2012 14 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 BACKGROUND: TRANSLATIONS IN VIRTUALIZED SYSTEM Guest OS 0 Guest OS 1 Hypervisor (a.k.a. VMM) Hardware – CPU, Memory, IO 15 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 BACKGROUND: TRANSLATIONS IN VIRTUALIZED SYSTEM Guest Applications Guest Applications Guest Virtual Address (GVA) Guest OS 0 Guest OS 1 Hypervisor (a.k.a. VMM) Hardware – CPU, Memory, IO 16 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 BACKGROUND: TRANSLATIONS IN VIRTUALIZED SYSTEM Guest Applications Guest Applications Guest Virtual Address (GVA) Guest OS 0 Guest OS 1 Managed by Guest OS Guest Physical Address (GPA) Hypervisor (a.k.a. VMM) Hardware – CPU, Memory, IO 17 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 BACKGROUND: TRANSLATIONS IN VIRTUALIZED SYSTEM Guest Applications Guest Applications Guest Virtual Address (GVA) Guest OS 0 Guest OS 1 Managed by Guest OS Guest Physical Address (GPA) Hypervisor (a.k.a. VMM) System Physical Address(SPA) Hardware – CPU, Memory, IO 18 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Managed by VMM BACKGROUND: TRANSLATIONS IN VIRTUALIZED SYSTEM Guest Applications Guest Applications Guest Virtual Address (GVA) Guest OS 0 Guest OS 1 Managed by Guest OS Guest Physical Address (GPA) Hypervisor (a.k.a. VMM) System Physical Address(SPA) Managed by VMM Hardware – CPU, Memory, IO Isolation across Guest OS => No access to (system) physical address from Guest OS 19 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 MOTIVATION: TRADITIONAL DMA IN VIRTUAL MACHINES VIRTUALIZED SYSTEM Core Core MMU MMU IO Device IO Device Memory 20 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 *SPA == “Physical Address” MOTIVATION: TRADITIONAL DMA IN VIRTUAL MACHINES VIRTUALIZED SYSTEM Guest OS 0 Guest OS 1 GVA Core Core IO Device IO Device GPA VMM MMU MMU SPA Memory 21 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 *SPA == “Physical Address” MOTIVATION: TRADITIONAL DMA IN VIRTUAL MACHINES VIRTUALIZED SYSTEM Device Driver Guest OS 0 Guest OS 1 GVA Core Core Setup IO Device No access to Physical Address IO Device GPA VMM MMU MMU SPA Memory 22 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 *SPA == “Physical Address” MOTIVATION: TRADITIONAL DMA IN VIRTUAL MACHINES VIRTUALIZED SYSTEM Device Driver Guest OS 0 Guest OS 1 GVA Core Core Setup IO Device IO Device GPA VMM MMU Setup MMU SPA Every DMA operation mediated by VMM Memory 23 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 *SPA == “Physical Address” MOTIVATION: TRADITIONAL DMA IN VIRTUAL MACHINES VIRTUALIZED SYSTEM Device Driver Guest OS 0 Guest OS 1 GVA Core Core Setup IO Device IO Device GPA VMM MMU Setup MMU SPA Physical Addresses DMA Operation Every DMA operation mediated by VMM Memory 24 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 *SPA == “Physical Address” MOTIVATION: TRADITIONAL DMA IN VIRTUAL MACHINES VIRTUALIZED SYSTEM Device Driver Guest OS 0 Guest OS 1 GVA Core Core Setup IO Device IO Device GPA VMM MMU Setup MMU SPA Physical Addresses DMA Operation Every DMA operation mediated by VMM Often ~30% performance overhead Memory 25 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 *SPA == “Physical Address” MOTIVATION: TRADITIONAL DMA IN VIRTUAL MACHINES VIRTUALIZED SYSTEM Device Driver Guest OS 0 Guest OS 1 GVA Core Core Setup IO Device IO Device GPA VMM MMU Setup MMU SPA Physical Addresses DMA Operation Every DMA operation mediated by VMM Often ~30% performance overhead Memory 26 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Virtual address translation for DMA *SPA == “Physical Address” INTRODUCTION OF IOMMU: THE LOGICAL VIEW Core Core MMU MMU Memory 27 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device INTRODUCTION OF IOMMU: THE LOGICAL VIEW IOMMU Driver Core MMU Sets up IOMMU hardware Core MMU Memory 28 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device IOMMU Hardware that intercepts DMA transactions Key capabilities: 1. Memory protection for DMA 2. Virtual address translation for DMA MOTIVATION: TRADITIONAL IO INTERRUPT NON-VIRTUALIZED SYSTEM Core Core MMU MMU Memory 29 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device MOTIVATION: TRADITIONAL IO INTERRUPT NON-VIRTUALIZED SYSTEM Setup Device Driver Core Core MMU MMU Memory 30 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IRQ # + Core id IO Device MOTIVATION: TRADITIONAL IO INTERRUPT NON-VIRTUALIZED SYSTEM Setup Device Driver Core Core APIC APIC MMU MMU IO Device IRQ # Memory 31 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IRQ # + Core id IO Device MOTIVATION: TRADITIONAL IO INTERRUPT VIRTUALIZED SYSTEM Guest OS 0 Core Core VMM MMU MMU Memory 32 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device MOTIVATION: TRADITIONAL IO INTERRUPT VIRTUALIZED SYSTEM Guest OS 0 Core Core VMM MMU IO Device Setup MMU Memory 33 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IRQ # + Core i MOTIVATION: TRADITIONAL IO INTERRUPT VIRTUALIZED SYSTEM Guest OS migration Guest OS 0 Core Core VMM MMU IO Device Setup MMU Memory 34 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IRQ # + Core i MOTIVATION: TRADITIONAL IO INTERRUPT VIRTUALIZED SYSTEM Guest OS migration Guest OS 0 Core Core VMM MMU IO Device Setup MMU Memory 35 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IRQ # + Core i MOTIVATION: TRADITIONAL IO INTERRUPT VIRTUALIZED SYSTEM Guest OS migration Guest OS 0 Core Core VMM MMU IO Device Setup MMU Inter-Process Interrupt Memory 36 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IRQ # + Core i MOTIVATION: TRADITIONAL IO INTERRUPT VIRTUALIZED SYSTEM Guest OS migration Guest OS 0 Core Core VMM MMU IO Device IO Device IRQ # + Core i Setup MMU Inter-Process Interrupt Extraneous IPI adds overheads => Each extra interrupt can add 5-10K cycles Memory 37 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Needs dynamic remapping of interrupts MOTIVATION: TRADITIONAL IO INTERRUPT VIRTUALIZED SYSTEM Guest OS 0 Core Core VMM MMU MMU Memory 38 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device MOTIVATION: TRADITIONAL IO INTERRUPT VIRTUALIZED SYSTEM Guest OS 0 Core Core VMM MMU IO Device IO Device IRQ # + Core i Setup MMU Performance overheads VMM exits on each interrupt Memory 39 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 MOTIVATION: TRADITIONAL IO INTERRUPT VIRTUALIZED SYSTEM Guest OS de-scheduled Core Core VMM MMU IO Device IO Device IRQ # + Core i Setup MMU Performance overheads VMM exits on each interrupt Memory 40 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 MOTIVATION: TRADITIONAL IO INTERRUPT VIRTUALIZED SYSTEM Guest OS de-scheduled Core Core VMM MMU IO Device IO Device IRQ # + Core i Setup MMU Performance overheads VMM exits on each interrupt Unnecessary VMM wakeup Memory 41 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 MOTIVATION: TRADITIONAL IO INTERRUPT VIRTUALIZED SYSTEM Guest OS de-scheduled Core Core VMM MMU IO Device IO Device IRQ # + Core i Setup MMU Performance overheads VMM exits on each interrupt Unnecessary VMM wakeup Memory 42 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Need to virtualize interrupt: Direct interrupt delivery to guest OS and temporary queueing INTRODUCTION OF IOMMU: THE LOGICAL VIEW ADDING INTERRUPT HANDLING CAPABILITY IOMMU Driver Core MMU Sets up IOMMU hardware Core MMU IO Device IO Device IOMMU Hardware that intercepts DMA transactions Key capabilities: 1. Memory protection for DMA 2. Virtual address translation for DMA Memory 43 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 INTRODUCTION OF IOMMU: THE LOGICAL VIEW ADDING INTERRUPT HANDLING CAPABILITY IOMMU Driver Core MMU Sets up IOMMU hardware Core MMU Memory 44 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device IOMMU Hardware that intercepts DMA transactions and interrupts Key capabilities: 1. Memory protection for DMA 2. Virtual address translation for DMA 3. Interrupt remapping and virtualization MOTIVATION: EMERGENCE OF HETEROGENEOUS SYSTEMS HETEROGENEOUS SYSTEM ARCHITECTURE (HSA) Core Core MMU MMU Memory 45 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device MOTIVATION: EMERGENCE OF HETEROGENEOUS SYSTEMS HETEROGENEOUS SYSTEM ARCHITECTURE (HSA) Core Core MMU MMU Memory 46 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 GPU IO Device MOTIVATION: EMERGENCE OF HETEROGENEOUS SYSTEMS HETEROGENEOUS SYSTEM ARCHITECTURE (HSA) Shared virtual addressing is key to ease of programming Core Core MMU MMU Memory 47 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 GPU IO Device MOTIVATION: EMERGENCE OF HETEROGENEOUS SYSTEMS HETEROGENEOUS SYSTEM ARCHITECTURE (HSA) Shared virtual addressing is key to ease of programming Core MMU IO Device MMU Memory 48 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 MOTIVATION: EMERGENCE OF HETEROGENEOUS SYSTEMS HETEROGENEOUS SYSTEM ARCHITECTURE (HSA) Shared virtual addressing is key to ease of programming Core IO Device VA0 MMU VA0 MMU “Pointer-is-a -Pointer” across CPU and devices Memory 49 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 MOTIVATION: EMERGENCE OF HETEROGENEOUS SYSTEMS HETEROGENEOUS SYSTEM ARCHITECTURE (HSA) Shared virtual addressing is key to ease of programming Core IO Device VA0 MMU VA0 MMU “Pointer-is-a -Pointer” across CPU and devices Memory IO needs to share CPU page table* *Data Structure that keeps VA to PA mapping 50 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 INTRODUCTION OF IOMMU: THE LOGICAL VIEW ADDING ABILITY TO SHARE ADDRESS SPACE IN HETEROGENEOUS SYSTEM IOMMU Driver Core MMU Sets up IOMMU hardware Core MMU Memory 51 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device IOMMU Hardware that intercepts DMA transactions and interrupts Key capabilities: 1. Memory protection for DMA 2. Virtual address translation for DMA 3. Interrupt remapping and virtualization INTRODUCTION OF IOMMU: THE LOGICAL VIEW ADDING ABILITY TO SHARE ADDRESS SPACE IN HETEROGENEOUS SYSTEM IOMMU Driver Core Sets up IOMMU hardware Core MMU MMU Memory IO Device IO Device IOMMU Hardware that intercepts DMA transactions and interrupts Key capabilities: 1. Memory protection for DMA 2. Virtual address translation for DMA 3. Interrupt remapping and virtualization 4. IO can share CPU page tables 52 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 INTRODUCTION OF IOMMU: (TYPICAL) PHYSICAL VIEW IOMMU IS PART OF PROCESSOR COMPLEX MMU MMU Memory Controller Memory 53 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Root Complex/ “IOHUB” IO Device Interconnect Core IOMMU Core IO Device IO Device Processor /Chip IOMMU FROM THE PERSPECTIVE OF DEVICE (PCIE® SPEC) Memory Translation Agent Addr. Translation and Protection Table Root Complex (RC) Root Integrated ATC Endpoint Device ATC – Address Translation Cache 54 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Root Port Root Port Switch ATC Device ATC Device IOMMU FROM THE PERSPECTIVE OF DEVICE (PCIE® SPEC) IOMMU Translation Agent and uses the Address Translation and Protection Table Memory Translation Agent Addr. Translation and Protection Table Root Complex (RC) Root Integrated ATC Endpoint Device ATC – Address Translation Cache 55 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Root Port Root Port Switch ATC Device ATC Device COMPARING CPU MMU AND IOMMU CPU MMU IOMMU Address Translation VA PA and GVA GPA SPA Memory Protection Read/Write etc. Read/Write etc. Interrupt Handling No Remapping and Virtualization Support Parallelism Mostly Single Threaded Highly Multithreaded Page Faults, Events, etc. Synchronous Handling Asynchronous Handling 56 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 VA PA and GVA GPA SPA HISTORY A SIMPLIFIED VIEW V1, c. 2004 Technology created to translate and vet memory accesses by peripherals, replacing software V1.2, c. 2006 Interrupt remapping added for IO virtualization V2, c. 2008 Nested paging, interrupt virtualization, and improved management features added V3, c. 2010 Features added for full heterogeneous computing and further efficiencies Whither next? 57 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU TECHNOLOGY FAMILIES REFERENCES AMD IOMMU® IO Memory Management Unit Intel VT-d® Virtualization Technology for Directed IO ARM SMM® System Memory Management Unit IBM CAPI® 58 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Coherent Accelerator Processor Interface AGENDA MOTIVATION & INTRODUCTION USE CASES & DEMOSTRATION What is IOMMU? Where can IOMMU help? INTERNALS How does IOMMU work? RESEARCH Research Opportunities and Tools 60 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 FIVE USE CASES OF IOMMU LEGACY I/O Supporting legacy devices – Extending DMA “beyond reach” SECURITY AND PROTECTION Preventing uncontrolled memory access SECURE BOOT Enforcing secure boot DIRECT I/O DEVICES HETEROGENEOUS COMPUTING 61 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Secure and efficient IO from Guest OS Enabling shared virtual memory SUPPORTING LEGACY DEVICES HOW CAN AN IOMMU HELP? Many 32-bit DMA devices operate in a 64-bit system Physical Memory ‒ Older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), … 232-1 Device 0 62 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SUPPORTING LEGACY DEVICES HOW CAN AN IOMMU HELP? Many 32-bit DMA devices operate in a 64-bit system ‒ Older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), … Physical Memory 264-1 232-1 Device 0 63 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SUPPORTING LEGACY DEVICES HOW CAN AN IOMMU HELP? Many 32-bit DMA devices operate in a 64-bit system ‒ Older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), … Physical Memory 264-1 SW Solution: Bounce buffers ‒ Device does DMA to a region in 32bit physical address, CPU copies data from buffer to the final destination 232-1 Device 0 64 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SUPPORTING LEGACY DEVICES HOW CAN AN IOMMU HELP? Many 32-bit DMA devices operate in a 64-bit system ‒ Older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), … Physical Memory 264-1 SW Solution: Bounce buffers ‒ Device does DMA to a region in 32bit physical address, CPU copies data from buffer to the final destination 232-1 Device 0 65 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SUPPORTING LEGACY DEVICES HOW CAN AN IOMMU HELP? Many 32-bit DMA devices operate in a 64-bit system ‒ Older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), … Physical Memory 264-1 SW Solution: Bounce buffers ‒ Device does DMA to a region in 32bit physical address, CPU copies data from buffer to the final destination 232-1 Device 0 66 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SUPPORTING LEGACY DEVICES HOW CAN AN IOMMU HELP? Many 32-bit DMA devices operate in a 64-bit system ‒ Older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), … Physical Memory 264-1 SW Solution: Bounce buffers ‒ Device does DMA to a region in 32bit physical address, CPU copies data from buffer to the final destination 232-1 Device 0 67 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CPU SUPPORTING LEGACY DEVICES HOW CAN AN IOMMU HELP? Many 32-bit DMA devices operate in a 64-bit system ‒ Older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), … Physical Memory 264-1 SW Solution: Bounce buffers ‒ Device does DMA to a region in 32bit physical address, CPU copies data from buffer to the final destination 232-1 Device 0 68 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CPU SUPPORTING LEGACY DEVICES HOW CAN AN IOMMU HELP? Many 32-bit DMA devices operate in a 64-bit system ‒ Older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), … Physical Memory 264-1 SW Solution: Bounce buffers ‒ Device does DMA to a region in 32bit physical address, CPU copies data from buffer to the final destination ‒ Slow, needs SW synchronization, ties up CPU core 232-1 Device 0 69 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CPU SUPPORTING LEGACY DEVICES HOW CAN AN IOMMU HELP? Physical Memory Many 32bit DMA devices operate in a 64bit system ‒ older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), … IOMMU Translation Device 264-1 232-1 0x01020304 -> 0x208090A0B0C 0 70 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SUPPORTING LEGACY DEVICES HOW CAN AN IOMMU HELP? Physical Memory Many 32bit DMA devices operate in a 64bit system ‒ older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), … 264-1 Better solution: IOMMU remaps 32bit device physical address to system physical address beyond 32bit IOMMU Translation Device 232-1 0x01020304 -> 0x208090A0B0C 0 71 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SUPPORTING LEGACY DEVICES HOW CAN AN IOMMU HELP? Physical Memory Many 32bit DMA devices operate in a 64bit system ‒ older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), … 264-1 Better solution: IOMMU remaps 32bit device physical address to system physical address beyond 32bit IOMMU Translation Device 232-1 0x01020304 -> 0x208090A0B0C 0 72 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SUPPORTING LEGACY DEVICES HOW CAN AN IOMMU HELP? Physical Memory Many 32bit DMA devices operate in a 64bit system ‒ older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), … 264-1 Better solution: IOMMU remaps 32bit device physical address to system physical address beyond 32bit IOMMU Translation Device 232-1 0x01020304 -> 0x208090A0B0C 0 73 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SUPPORTING LEGACY DEVICES HOW CAN AN IOMMU HELP? Physical Memory Many 32bit DMA devices operate in a 64bit system ‒ older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), … 264-1 Better solution: IOMMU remaps 32bit device physical address to system physical address beyond 32bit IOMMU Translation Device 232-1 0x01020304 -> 0x208090A0B0C 0 74 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SUPPORTING LEGACY DEVICES HOW CAN AN IOMMU HELP? Physical Memory Many 32bit DMA devices operate in a 64bit system ‒ older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), … 264-1 Better solution: IOMMU remaps 32bit device physical address to system physical address beyond 32bit ‒ DMA goes directly into 64bit memory ‒ No CPU transfer ‒ More efficient IOMMU Translation Device 232-1 0x01020304 -> 0x208090A0B0C 0 75 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SUPPORTING LEGACY DEVICES HOW CAN AN IOMMU HELP? Physical Memory Many 32bit DMA devices operate in a 64bit system ‒ older PCI cards (through PCI-PCIe bridges), special-purpose controllers, parallel ports (IEEE-1284), … 264-1 Better solution: IOMMU remaps 32bit device physical address to system physical address beyond 32bit ‒ DMA goes directly into 64bit memory ‒ No CPU transfer ‒ More efficient Linux: DMA redirect feature IOMMU Translation Device 232-1 0x01020304 -> 0x208090A0B0C 0 76 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU USECASE: SECURITY AND PROTECTION SECURE BOOT SECURITY AND PROTECTION THE TRADITIONAL IOMMU USE Physical Memory DMA devices use physical addresses on the system bus to read and write memory based on SW driver or OS instructions Passwords, Critical data I/O buffer Device 78 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SECURITY AND PROTECTION THE TRADITIONAL IOMMU USE Physical Memory DMA devices use physical addresses on the system bus to read and write memory based on SW driver or OS instructions Passwords, Critical data I/O buffer Device 79 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SECURITY AND PROTECTION THE TRADITIONAL IOMMU USE Physical Memory DMA devices use physical addresses on the system bus to read and write memory based on SW driver or OS instructions SW bugs or attacks by malicious applications could access and modify important OS data (OS security policy, passwords,…) ‒ Without OS able to detect or prevent the access as it can for CPU ‒ Latent problem until it shows unexpectedly possibly much later Passwords, Critical data I/O buffer Device 80 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SECURITY AND PROTECTION THE TRADITIONAL IOMMU USE Physical Memory DMA devices use physical addresses on the system bus to read and write memory based on SW driver or OS instructions SW bugs or attacks by malicious applications could access and modify important OS data (OS security policy, passwords,…) ‒ Without OS able to detect or prevent the access as it can for CPU ‒ Latent problem until it shows unexpectedly possibly much later Passwords, Critical data I/O buffer Device 81 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SECURITY AND PROTECTION THE TRADITIONAL IOMMU USE Physical Memory DMA devices use physical addresses on the system bus to read and write memory based on SW driver or OS instructions SW bugs or attacks by malicious applications could access and modify important OS data (OS security policy, passwords,…) ‒ Without OS able to detect or prevent the access as it can for CPU ‒ Latent problem until it shows unexpectedly possibly much later This affects system stability, if just the right data is hit ‒ “Heisenbugs” are sometimes caused by bugs in system drivers Or it allows malicious driver attacks to take over the system Passwords, Critical data I/O buffer Device 82 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SECURITY AND PROTECTION THE TRADITIONAL IOMMU USE DMA devices assert physical addresses on the system bus to read and write memory based on SW driver or OS settings Physical Memory SW bugs or attacks by malicious applications could access and modify important data (OS security policy, passwords,…) X OK Device 83 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU Range check Passwords, critical data I/O buffer SECURITY AND PROTECTION THE TRADITIONAL IOMMU USE DMA devices assert physical addresses on the system bus to read and write memory based on SW driver or OS settings Physical Memory SW bugs or attacks by malicious applications could access and modify important data (OS security policy, passwords,…) The IOMMU allows OS to enforce DMA access policy for any DMA capable device accessing physical memory ‒ Memory state important to stability/security ‒ If access occurs, OS gets notified and can shut the device & driver down and notifies the user or administrator X OK Device 84 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU Range check Passwords, critical data I/O buffer SECURITY AND PROTECTION THE TRADITIONAL IOMMU USE DMA devices assert physical addresses on the system bus to read and write memory based on SW driver or OS settings Physical Memory SW bugs or attacks by malicious applications could access and modify important data (OS security policy, passwords,…) The IOMMU allows OS to enforce DMA access policy for any DMA capable device accessing physical memory ‒ Memory state important to stability/security ‒ If access occurs, OS gets notified and can shut the device & driver down and notifies the user or administrator X OK Device 85 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU Range check Passwords, critical data I/O buffer SECURITY AND PROTECTION THE TRADITIONAL IOMMU USE DMA devices assert physical addresses on the system bus to read and write memory based on SW driver or OS settings Physical Memory SW bugs or attacks by malicious applications could access and modify important data (OS security policy, passwords,…) The IOMMU allows OS to enforce DMA access policy for any DMA capable device accessing physical memory ‒ Memory state important to stability/security ‒ If access occurs, OS gets notified and can shut the device & driver down and notifies the user or administrator X OK Device 86 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU Range check Passwords, critical data I/O buffer SECURITY AND PROTECTION THE TRADITIONAL IOMMU USE DMA devices assert physical addresses on the system bus to read and write memory based on SW driver or OS settings Physical Memory SW bugs or attacks by malicious applications could access and modify important data (OS security policy, passwords,…) The IOMMU allows OS to enforce DMA access policy for any DMA capable device accessing physical memory ‒ Memory state important to stability/security ‒ If access occurs, OS gets notified and can shut the device & driver down and notifies the user or administrator X OK Device 87 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU Range check Passwords, critical data I/O buffer SECURITY AND PROTECTION THE TRADITIONAL IOMMU USE DMA devices assert physical addresses on the system bus to read and write memory based on SW driver or OS settings Physical Memory SW bugs or attacks by malicious applications could access and modify important data (OS security policy, passwords,…) The IOMMU allows OS to enforce DMA access policy for any DMA capable device accessing physical memory ‒ Memory state important to stability/security ‒ If access occurs, OS gets notified and can shut the device & driver down and notifies the user or administrator X OK Device 88 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU Range check Passwords, critical data I/O buffer SECURE BOOT YET ANOTHER USE FOR AN IOMMU Ensuring that a system is not doing more than it’s supposed to ‒ e.g., being part of a botnet, provide banking data or other personal info to impersonators or other attackers ‒ The earliest time for attack and defense is at firmware startup ‒ From there critical memory regions are protected from invalid access UEFI Firmware OS Bootloader OS kernel drivers Application 89 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SECURE BOOT YET ANOTHER USE FOR AN IOMMU Ensuring that a system is not doing more than it’s supposed to ‒ e.g., being part of a botnet, provide banking data or other personal info to impersonators or other attackers ‒ The earliest time for attack and defense is at firmware startup ‒ From there critical memory regions are protected from invalid access The Secure Boot architecture ensures that no non-vetted OS kernel code runs on the system, changing critical settings UEFI Firmware OS Bootloader OS kernel drivers Application 90 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SECURE BOOT YET ANOTHER USE FOR AN IOMMU Ensuring that a system is not doing more than it’s supposed to ‒ e.g., being part of a botnet, provide banking data or other personal info to impersonators or other attackers ‒ The earliest time for attack and defense is at firmware startup ‒ From there critical memory regions are protected from invalid access The Secure Boot architecture ensures that no non-vetted OS kernel code runs on the system, changing critical settings Some I/O devices can issue DMA requests to system memory directly, without OS or Firmware intervention ‒ e.g.,1394/Firewire, network cards, as part of network boot ‒ That allows attacks to modify memory before even the OS has a chance to protect against the attacks UEFI Firmware OS Bootloader OS kernel drivers Application 91 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SECURE BOOT YET ANOTHER USE FOR AN IOMMU Ensuring that a system is not doing more than it’s supposed to ‒ e.g., being part of a botnet, provide banking data or other personal info to impersonators or other attackers ‒ The earliest time for attack and defense is at firmware startup ‒ From there critical memory regions are protected from invalid access The Secure Boot architecture ensures that no non-vetted OS kernel code runs on the system, changing critical settings Some I/O devices can issue DMA requests to system memory directly, without OS or Firmware intervention ‒ e.g.,1394/Firewire, network cards, as part of network boot ‒ That allows attacks to modify memory before even the OS has a chance to protect against the attacks As outlined earlier, using the IOMMU prevents DMA access to important memory regions 92 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 UEFI Firmware OS Bootloader OS kernel drivers Application IOMMU USECASE: EFFICIENT IO IN VIRTUALIZED ENVIRONMENT BACKGROUND: TRADITIONAL DMA BY IO (NO SYSTEM VIRTUALIZATION) Core Core MMU MMU Memory 94 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device BACKGROUND: TRADITIONAL DMA BY IO (NO SYSTEM VIRTUALIZATION) Core Core MMU MMU Virtual Addresses Memory 95 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device BACKGROUND: TRADITIONAL DMA BY IO (NO SYSTEM VIRTUALIZATION) Core Core Virtual Addresses IO Device Protection Check MMU MMU Physical Addresses Memory 96 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device BACKGROUND: TRADITIONAL DMA BY IO (NO SYSTEM VIRTUALIZATION) Device Driver Core Core Virtual Addresses IO Device Protection Check MMU MMU Physical Addresses Memory 97 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device BACKGROUND: TRADITIONAL DMA BY IO (NO SYSTEM VIRTUALIZATION) Setup Device Driver Core Core Virtual Addresses IO Device Protection Check MMU MMU Physical Addresses Memory 98 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device BACKGROUND: TRADITIONAL DMA BY IO (NO SYSTEM VIRTUALIZATION) Setup Device Driver Core Core Virtual Addresses IO Device IO Device Protection Check MMU MMU Physical Addresses Memory 99 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Physical Addresses DMA Request BACKGROUND: TRADITIONAL DMA BY IO (NO SYSTEM VIRTUALIZATION) Setup Device Driver Core Core Virtual Addresses IO Device IO Device Protection Check MMU MMU Physical Addresses Memory 100 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Physical Addresses DMA Request BACKGROUND: TRADITIONAL DMA BY IO (NO SYSTEM VIRTUALIZATION) Setup Device Driver Core Core Virtual Addresses IO Device IO Device Protection Check MMU MMU Physical Addresses DMA Request Physical Addresses Memory 101 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Device drivers must program the true system physical memory address No protection from SW or hardware bugs in I/O devices and drivers system crash by writing wrong memory No protection from potentially malicious driver or system SW attacks VIRTUALIZATION OF A SYSTEM IN SOFTWARE IT HAS TO LOOK REAL TO AN OPERATING SYSTEM Each OS assumes full access to the platform hardware ‒ Memory, Interrupts, Devices, CPU cores, etc. 102 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 VIRTUALIZATION OF A SYSTEM IN SOFTWARE IT HAS TO LOOK REAL TO AN OPERATING SYSTEM Each OS assumes full access to the platform hardware ‒ Memory, Interrupts, Devices, CPU cores, etc. A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage the physical hardware and define a “virtual machine” (VM) that represents the resources an OS expects to find in the system 103 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 VIRTUALIZATION OF A SYSTEM IN SOFTWARE IT HAS TO LOOK REAL TO AN OPERATING SYSTEM Each OS assumes full access to the platform hardware ‒ Memory, Interrupts, Devices, CPU cores, etc. A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage the physical hardware and define a “virtual machine” (VM) that represents the resources an OS expects to find in the system Hypervisor VMM 104 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Virtual Machine1 Virtual Machine2 Virtual Machine3 Operating System1 Operating System2 Operating System3 Application Application Application Application Application Application Application Application Application VIRTUALIZATION OF A SYSTEM IN SOFTWARE IT HAS TO LOOK REAL TO AN OPERATING SYSTEM Each OS assumes full access to the platform hardware ‒ Memory, Interrupts, Devices, CPU cores, etc. A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage the physical hardware and define a “virtual machine” (VM) that represents the resources an OS expects to find in the system Hypervisor VMM Use cases: 105 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Virtual Machine1 Virtual Machine2 Virtual Machine3 Operating System1 Operating System2 Operating System3 Application Application Application Application Application Application Application Application Application VIRTUALIZATION OF A SYSTEM IN SOFTWARE IT HAS TO LOOK REAL TO AN OPERATING SYSTEM Each OS assumes full access to the platform hardware ‒ Memory, Interrupts, Devices, CPU cores, etc. A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage the physical hardware and define a “virtual machine” (VM) that represents the resources an OS expects to find in the system Hypervisor VMM Use cases: ‒ System consolidation 106 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Virtual Machine1 Virtual Machine2 Virtual Machine3 Operating System1 Operating System2 Operating System3 Application Application Application Application Application Application Application Application Application VIRTUALIZATION OF A SYSTEM IN SOFTWARE IT HAS TO LOOK REAL TO AN OPERATING SYSTEM Each OS assumes full access to the platform hardware ‒ Memory, Interrupts, Devices, CPU cores, etc. A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage the physical hardware and define a “virtual machine” (VM) that represents the resources an OS expects to find in the system Hypervisor VMM Use cases: ‒ System consolidation ‒ OS/application compatibility 107 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Virtual Machine1 Virtual Machine2 Virtual Machine3 Operating System1 Operating System2 Operating System3 Application Application Application Application Application Application Application Application Application VIRTUALIZATION OF A SYSTEM IN SOFTWARE IT HAS TO LOOK REAL TO AN OPERATING SYSTEM Each OS assumes full access to the platform hardware ‒ Memory, Interrupts, Devices, CPU cores, etc. A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage the physical hardware and define a “virtual machine” (VM) that represents the resources an OS expects to find in the system Hypervisor VMM Use cases: ‒ System consolidation ‒ OS/application compatibility ‒ Security / Stability 108 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Virtual Machine1 Virtual Machine2 Virtual Machine3 Operating System1 Operating System2 Operating System3 Application Application Application Application Application Application Application Application Application VIRTUALIZATION OF A SYSTEM IN SOFTWARE IT HAS TO LOOK REAL TO AN OPERATING SYSTEM Each OS assumes full access to the platform hardware ‒ Memory, Interrupts, Devices, CPU cores, etc. A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage the physical hardware and define a “virtual machine” (VM) that represents the resources an OS expects to find in the system Hypervisor VMM Use cases: ‒ System consolidation ‒ OS/application compatibility ‒ Security / Stability ‒ Cloud Infrastructure 109 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Virtual Machine1 Virtual Machine2 Virtual Machine3 Operating System1 Operating System2 Operating System3 Application Application Application Application Application Application Application Application Application VIRTUALIZATION OF A SYSTEM Most CPUs today have support for system virtualization ‒ Nested page tables (HV & OS levels), allow VMM/HV to assign and manage system memory and interrupts to Virtual Machines 110 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 VIRTUALIZATION OF A SYSTEM Most CPUs today have support for system virtualization ‒ Nested page tables (HV & OS levels), allow VMM/HV to assign and manage system memory and interrupts to Virtual Machines I/O devices are typically managed by HV/VMM software, either by… 111 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 VIRTUALIZATION OF A SYSTEM Most CPUs today have support for system virtualization ‒ Nested page tables (HV & OS levels), allow VMM/HV to assign and manage system memory and interrupts to Virtual Machines I/O devices are typically managed by HV/VMM software, either by… Para-Virtualization Guest device driver uses HV “hypercalls” Hypervisor manages HW operation (DMA) Hypervisor SW validates and redirects I/O requests from Guest OS (overhead, slow) Hypervisor arbitrates and schedules requests from multiple guest OS, allows VM migration Most common operation for today’s virtualization Software Works well for CPU-heavy workloads I/O, graphics or compute-heavy workloads 112 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 VIRTUALIZATION OF A SYSTEM Most CPUs today have support for system virtualization ‒ Nested page tables (HV & OS levels), allow VMM/HV to assign and manage system memory and interrupts to Virtual Machines I/O devices are typically managed by HV/VMM software, either by… Para-Virtualization Direct-Mapped Device & SR-IOV Guest device driver uses HV “hypercalls” Hypervisor manages HW operation (DMA) Device function is mapped to guest OS Guest OS uses native HW drivers Hypervisor SW validates and redirects I/O requests from Guest OS (overhead, slow) Physical Device DMA must be limited and redirected by Hypervisor (via IOMMU), Hypervisor arbitrates and schedules requests One device function per guest OS, physical from multiple guest OS, allows VM migration memory must be committed Most common operation for today’s virtualization Software Works well for CPU-heavy workloads I/O, graphics or compute-heavy workloads 113 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 I/O device must be resettable by HV when guest error puts it in undefined state SR-IOV is a variant of direct mapped I/O device provides 1 - n “virtual” devices in HW (PCI-SIG standard) EFFICIENT I/O VIRTUALIZATION HARDWARE IMPLEMENTED TECHNIQUE THROUGH IOMMU IOMMU validates DMA accesses and validates device interrupts Core Core MMU MMU Memory 114 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device IOMMU EFFICIENT IO VIRTUALIZATION WITH IOMMU WHAT ARE THE BENEFITS? Using the IOMMU allows a Hypervisor to assign a physical device exclusively to a Guest VM without danger of memory corruption to other VMs 115 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EFFICIENT IO VIRTUALIZATION WITH IOMMU WHAT ARE THE BENEFITS? Using the IOMMU allows a Hypervisor to assign a physical device exclusively to a Guest VM without danger of memory corruption to other VMs ‒ Beneficial if one VM requires near native performance 116 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EFFICIENT IO VIRTUALIZATION WITH IOMMU WHAT ARE THE BENEFITS? Using the IOMMU allows a Hypervisor to assign a physical device exclusively to a Guest VM without danger of memory corruption to other VMs ‒ Beneficial if one VM requires near native performance ‒ Or if OS needs to be “sandboxed” (because of suspected malware) 117 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EFFICIENT IO VIRTUALIZATION WITH IOMMU WHAT ARE THE BENEFITS? Using the IOMMU allows a Hypervisor to assign a physical device exclusively to a Guest VM without danger of memory corruption to other VMs ‒ Beneficial if one VM requires near native performance ‒ Or if OS needs to be “sandboxed” (because of suspected malware) Native driver can operate in the Guest OS 118 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EFFICIENT IO VIRTUALIZATION WITH IOMMU WHAT ARE THE BENEFITS? Using the IOMMU allows a Hypervisor to assign a physical device exclusively to a Guest VM without danger of memory corruption to other VMs ‒ Beneficial if one VM requires near native performance ‒ Or if OS needs to be “sandboxed” (because of suspected malware) Native driver can operate in the Guest OS IOMMU enforces Hypervisor policy on memory and system resource isolation for each of the Guest Virtual Machines 119 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EFFICIENT IO VIRTUALIZATION WITH IOMMU WHAT ARE THE BENEFITS? Using the IOMMU allows a Hypervisor to assign a physical device exclusively to a Guest VM without danger of memory corruption to other VMs ‒ Beneficial if one VM requires near native performance ‒ Or if OS needs to be “sandboxed” (because of suspected malware) Native driver can operate in the Guest OS IOMMU enforces Hypervisor policy on memory and system resource isolation for each of the Guest Virtual Machines IOMMU redirects device physical address set up by Guest OS driver (= Guest Physical Addresses) to the actual Host System Physical Address (SPA) 120 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EFFICIENT IO VIRTUALIZATION WITH IOMMU WHAT ARE THE BENEFITS? Using the IOMMU allows a Hypervisor to assign a physical device exclusively to a Guest VM without danger of memory corruption to other VMs ‒ Beneficial if one VM requires near native performance ‒ Or if OS needs to be “sandboxed” (because of suspected malware) Native driver can operate in the Guest OS IOMMU enforces Hypervisor policy on memory and system resource isolation for each of the Guest Virtual Machines IOMMU redirects device physical address set up by Guest OS driver (= Guest Physical Addresses) to the actual Host System Physical Address (SPA) ‒ Useful for platform resources that have “well-known” addresses like legacy devices or system resources like APIC (Advanced Programmable Interrupt Controller) 121 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EFFICIENT IO VIRTUALIZATION WITH IOMMU WHAT ARE THE BENEFITS? Using the IOMMU allows a Hypervisor to assign a physical device exclusively to a Guest VM without danger of memory corruption to other VMs ‒ Beneficial if one VM requires near native performance ‒ Or if OS needs to be “sandboxed” (because of suspected malware) Native driver can operate in the Guest OS IOMMU enforces Hypervisor policy on memory and system resource isolation for each of the Guest Virtual Machines IOMMU redirects device physical address set up by Guest OS driver (= Guest Physical Addresses) to the actual Host System Physical Address (SPA) ‒ Useful for platform resources that have “well-known” addresses like legacy devices or system resources like APIC (Advanced Programmable Interrupt Controller) Allows near-native device performance for high-performance devices with low system impact 122 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU USECASE: ENABLING HETEROGENEOUS COMPUTING LEGACY GPU COMPUTE The limiters that need to be fixed to unleash programmers: CPU CPU ... CU CPU CU CU CU GPU PCIe™ System Memory (Coherent) 124 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU CU CU CU GPU Memory (Non-Coherent) LEGACY GPU COMPUTE The limiters that need to be fixed to unleash programmers: Multiple memory pools, multiple address spaces CPU CPU ... CU CPU CU CU CU GPU PCIe™ System Memory (Coherent) 125 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU CU CU CU GPU Memory (Non-Coherent) LEGACY GPU COMPUTE The limiters that need to be fixed to unleash programmers: Multiple memory pools, multiple address spaces High overhead dispatch, data copies across PCIe CPU CPU ... CU CPU CU CU CU GPU PCIe™ System Memory (Coherent) 126 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU CU CU CU GPU Memory (Non-Coherent) LEGACY GPU COMPUTE The limiters that need to be fixed to unleash programmers: Multiple memory pools, multiple address spaces High overhead dispatch, data copies across PCIe New languages and APIs for GPU programming necessary (OpenCL, etc.) CPU CPU ... CU CPU CU CU CU GPU PCIe™ System Memory (Coherent) 127 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU CU CU CU GPU Memory (Non-Coherent) LEGACY GPU COMPUTE The limiters that need to be fixed to unleash programmers: Multiple memory pools, multiple address spaces High overhead dispatch, data copies across PCIe New languages and APIs for GPU programming necessary (OpenCL, etc.) ‒ And sometimes proprietary environments CPU CPU ... CU CPU CU CU CU GPU PCIe™ System Memory (Coherent) 128 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU CU CU CU GPU Memory (Non-Coherent) LEGACY GPU COMPUTE The limiters that need to be fixed to unleash programmers: Multiple memory pools, multiple address spaces High overhead dispatch, data copies across PCIe New languages and APIs for GPU programming necessary (OpenCL, etc.) ‒ And sometimes proprietary environments Dual source development CPU CPU ... CU CPU CU CU CU GPU PCIe™ System Memory (Coherent) 129 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU CU CU CU GPU Memory (Non-Coherent) LEGACY GPU COMPUTE The limiters that need to be fixed to unleash programmers: Multiple memory pools, multiple address spaces High overhead dispatch, data copies across PCIe New languages and APIs for GPU programming necessary (OpenCL, etc.) ‒ And sometimes proprietary environments Dual source development Expert programmers only CPU CPU ... CU CPU CU CU CU GPU PCIe™ System Memory (Coherent) 130 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU CU CU CU GPU Memory (Non-Coherent) THE PREVIOUS APUS AND SOCS, PHYSICAL INTEGRATION Some memory copies are gone, because the same memory is accessed Physical Integration GPU CPU 1 CPU 2 … CPU N CU 1 System Memory (Coherent) 131 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU 2 CU 3 … CU M-2 CU M-1 GPU Memory (Non-Coherent) CU M THE PREVIOUS APUS AND SOCS, PHYSICAL INTEGRATION Some memory copies are gone, because the same memory is accessed ‒ But the memory is not accessible concurrently, because of cache policies Physical Integration GPU CPU 1 CPU 2 … CPU N CU 1 System Memory (Coherent) 132 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU 2 CU 3 … CU M-2 CU M-1 GPU Memory (Non-Coherent) CU M THE PREVIOUS APUS AND SOCS, PHYSICAL INTEGRATION Some memory copies are gone, because the same memory is accessed ‒ But the memory is not accessible concurrently, because of cache policies Two memory pools remain (cache coherent + non-coherent memory regions) Physical Integration GPU CPU 1 CPU 2 … CPU N CU 1 System Memory (Coherent) 133 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU 2 CU 3 … CU M-2 CU M-1 GPU Memory (Non-Coherent) CU M THE PREVIOUS APUS AND SOCS, PHYSICAL INTEGRATION Some memory copies are gone, because the same memory is accessed ‒ But the memory is not accessible concurrently, because of cache policies Two memory pools remain (cache coherent + non-coherent memory regions) Physical Integration GPU CPU 1 CPU 2 … CPU N CU 1 System Memory (Coherent) 134 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU 2 CU 3 … CU M-2 CU M-1 GPU Memory (Non-Coherent) CU M THE PREVIOUS APUS AND SOCS, PHYSICAL INTEGRATION Some memory copies are gone, because the same memory is accessed ‒ But the memory is not accessible concurrently, because of cache policies Two memory pools remain (cache coherent + non-coherent memory regions) Jobs are still queued through the OS driver chain and suffer from overhead Physical Integration GPU CPU 1 CPU 2 … CPU N CU 1 System Memory (Coherent) 135 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU 2 CU 3 … CU M-2 CU M-1 GPU Memory (Non-Coherent) CU M THE PREVIOUS APUS AND SOCS, PHYSICAL INTEGRATION Some memory copies are gone, because the same memory is accessed ‒ But the memory is not accessible concurrently, because of cache policies Two memory pools remain (cache coherent + non-coherent memory regions) Jobs are still queued through the OS driver chain and suffer from overhead Still requires expert programmers to get performance Physical Integration GPU CPU 1 CPU 2 … CPU N CU 1 System Memory (Coherent) 136 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU 2 CU 3 … CU M-2 CU M-1 GPU Memory (Non-Coherent) CU M THE PREVIOUS APUS AND SOCS, PHYSICAL INTEGRATION Some memory copies are gone, because the same memory is accessed ‒ But the memory is not accessible concurrently, because of cache policies Two memory pools remain (cache coherent + non-coherent memory regions) Jobs are still queued through the OS driver chain and suffer from overhead Still requires expert programmers to get performance This is only an intermediate step in the journey Physical Integration GPU CPU 1 CPU 2 … CPU N CU 1 System Memory (Coherent) 137 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU 2 CU 3 … CU M-2 CU M-1 GPU Memory (Non-Coherent) CU M AN HSA ENABLED SOC Unified Coherent Memory enables data sharing across all processors CPU 1 CPU 2 … CPU N CU 1 CU 2 CU 3 … Unified Coherent Memory 138 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU M-2 CU M-1 CU M AN HSA ENABLED SOC Unified Coherent Memory enables data sharing across all processors CPU 1 CPU 2 … CPU N CU 1 CU 2 CU 3 … Unified Coherent Memory 139 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU M-2 CU M-1 CU M AN HSA ENABLED SOC Unified Coherent Memory enables data sharing across all processors Processors architected to operate cooperatively CPU 1 CPU 2 … CPU N CU 1 CU 2 CU 3 … Unified Coherent Memory 140 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU M-2 CU M-1 CU M AN HSA ENABLED SOC Unified Coherent Memory enables data sharing across all processors Processors architected to operate cooperatively ‒ Can exchange data “on the fly”, similar to what CPU threads do CPU 1 CPU 2 … CPU N CU 1 CU 2 CU 3 … Unified Coherent Memory 141 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU M-2 CU M-1 CU M AN HSA ENABLED SOC Unified Coherent Memory enables data sharing across all processors Processors architected to operate cooperatively ‒ Can exchange data “on the fly”, similar to what CPU threads do ‒ The lower job dispatch overhead allows tasks to be handled by the GPU that previously were “too costly” to transfer over Designed to enable the application running on different processors without substantially changing the programming logic CPU 1 CPU 2 … CPU N CU 1 CU 2 CU 3 … Unified Coherent Memory 142 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CU M-2 CU M-1 CU M IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps: 143 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps: Use of accelerators as a first-class, peer processor within the system 144 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps: Use of accelerators as a first-class, peer processor within the system ‒ Unified process address space access across all processors ‒ Shared Virtual Memory (SVM), “GPU ptr == CPU ptr” 145 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps: Use of accelerators as a first-class, peer processor within the system ‒ Unified process address space access across all processors ‒ Shared Virtual Memory (SVM), “GPU ptr == CPU ptr” 146 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps: Use of accelerators as a first-class, peer processor within the system ‒ Unified process address space access across all processors ‒ Shared Virtual Memory (SVM), “GPU ptr == CPU ptr” ‒ Accelerator operates in pageable system memory* 147 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps: Use of accelerators as a first-class, peer processor within the system ‒ Unified process address space access across all processors ‒ Shared Virtual Memory (SVM), “GPU ptr == CPU ptr” ‒ Accelerator operates in pageable system memory* 148 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 *with OS support & ATS/PRI IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps: Use of accelerators as a first-class, peer processor within the system ‒ Unified process address space access across all processors ‒ Shared Virtual Memory (SVM), “GPU ptr == CPU ptr” ‒ Accelerator operates in pageable system memory* ‒ Cache coherency between the CPU and accelerator caches ‒ User mode dispatch/scheduling reduces job-dispatch overhead ‒ QoS with preemption/context switch of GPU Compute Units 149 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 *with OS support & ATS/PRI IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps: Use of accelerators as a first-class, peer processor within the system ‒ Unified process address space access across all processors ‒ Shared Virtual Memory (SVM), “GPU ptr == CPU ptr” ‒ Accelerator operates in pageable system memory* ‒ Cache coherency between the CPU and accelerator caches ‒ User mode dispatch/scheduling reduces job-dispatch overhead ‒ QoS with preemption/context switch of GPU Compute Units The IOMMU enforces control of GPU access to memory 150 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 *with OS support & ATS/PRI IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps: Use of accelerators as a first-class, peer processor within the system ‒ Unified process address space access across all processors ‒ Shared Virtual Memory (SVM), “GPU ptr == CPU ptr” ‒ Accelerator operates in pageable system memory* ‒ Cache coherency between the CPU and accelerator caches ‒ User mode dispatch/scheduling reduces job-dispatch overhead ‒ QoS with preemption/context switch of GPU Compute Units The IOMMU enforces control of GPU access to memory ‒ OS can efficiently and safely share process page tables with accelerators (requires ATS/PRI protocol support) 151 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 *with OS support & ATS/PRI IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The goals of the Heterogeneous System Architecture (HSA) and where the IOMMU helps: Use of accelerators as a first-class, peer processor within the system ‒ Unified process address space access across all processors ‒ Shared Virtual Memory (SVM), “GPU ptr == CPU ptr” ‒ Accelerator operates in pageable system memory* ‒ Cache coherency between the CPU and accelerator caches ‒ User mode dispatch/scheduling reduces job-dispatch overhead ‒ QoS with preemption/context switch of GPU Compute Units The IOMMU enforces control of GPU access to memory ‒ OS can efficiently and safely share process page tables with accelerators (requires ATS/PRI protocol support) ‒ Accelerators can’t step outside of the OS-set boundaries 152 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 *with OS support & ATS/PRI IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The benefits of the Heterogeneous System Architecture: 153 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The benefits of the Heterogeneous System Architecture: Pageable memory access is validated and handled directly by the OS memory manager via AMD IOMMU 154 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The benefits of the Heterogeneous System Architecture: Pageable memory access is validated and handled directly by the OS memory manager via AMD IOMMU Application data structures can be directly parsed by the accelerator and pointer links followed without CPU help 155 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The benefits of the Heterogeneous System Architecture: Pageable memory access is validated and handled directly by the OS memory manager via AMD IOMMU Application data structures can be directly parsed by the accelerator and pointer links followed without CPU help Common high level languages and tools (compilers, runtimes, …) port easily to accelerators 156 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The benefits of the Heterogeneous System Architecture: Pageable memory access is validated and handled directly by the OS memory manager via AMD IOMMU Application data structures can be directly parsed by the accelerator and pointer links followed without CPU help Common high level languages and tools (compilers, runtimes, …) port easily to accelerators ‒ C/C++, Python, Java, … already have open source implementations 157 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The benefits of the Heterogeneous System Architecture: Pageable memory access is validated and handled directly by the OS memory manager via AMD IOMMU Application data structures can be directly parsed by the accelerator and pointer links followed without CPU help Common high level languages and tools (compilers, runtimes, …) port easily to accelerators ‒ C/C++, Python, Java, … already have open source implementations ‒ Many more languages to follow 158 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU: A BUILDING BLOCK FOR HSA REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS The benefits of the Heterogeneous System Architecture: Pageable memory access is validated and handled directly by the OS memory manager via AMD IOMMU Application data structures can be directly parsed by the accelerator and pointer links followed without CPU help Common high level languages and tools (compilers, runtimes, …) port easily to accelerators ‒ C/C++, Python, Java, … already have open source implementations ‒ Many more languages to follow IOMMU making it easier for programmers to use GPUs and other accelerators safely and efficiently 159 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EVOLUTION OF THE SOFTWARE STACK – A COMPARISON Goal of the software stack is to focus on high-level language support HSA Software Stack Hardware - APUs, CPUs, GPUs User mode component Kernel mode component © Copyright 2014 HSA Foundation. All Rights Reserved. 160 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Components contributed by third parties EVOLUTION OF THE SOFTWARE STACK – A COMPARISON Goal of the software stack is to focus on high-level language support HSA Software Stack Driver Stack Apps Apps Apps Apps Apps Apps Domain Libraries OpenCL™, DX Runtimes, User Mode Drivers Graphics Kernel Mode Driver Hardware - APUs, CPUs, GPUs User mode component Kernel mode component © Copyright 2014 HSA Foundation. All Rights Reserved. 161 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Components contributed by third parties EVOLUTION OF THE SOFTWARE STACK – A COMPARISON Goal of the software stack is to focus on high-level language support ‒ Allow to target the GPU directly by SW HSA Software Stack Driver Stack Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps HSA Domain Libraries, OpenCL ™ 2.x Runtime Domain Libraries OpenCL™, DX Runtimes, User Mode Drivers HSA Runtime Task Queuing Libraries HSA JIT Graphics Kernel Mode Driver HSA Kernel Mode Driver Hardware - APUs, CPUs, GPUs User mode component Kernel mode component © Copyright 2014 HSA Foundation. All Rights Reserved. 162 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Components contributed by third parties EVOLUTION OF THE SOFTWARE STACK – A COMPARISON Goal of the software stack is to focus on high-level language support ‒ Allow to target the GPU directly by SW ‒ Drivers are setting up the HW and policies, then go out of the way HSA Software Stack Driver Stack Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps HSA Domain Libraries, OpenCL ™ 2.x Runtime Domain Libraries OpenCL™, DX Runtimes, User Mode Drivers HSA Runtime Task Queuing Libraries HSA JIT Graphics Kernel Mode Driver HSA Kernel Mode Driver Hardware - APUs, CPUs, GPUs User mode component Kernel mode component © Copyright 2014 HSA Foundation. All Rights Reserved. 163 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Components contributed by third parties EVOLUTION OF THE SOFTWARE STACK – A COMPARISON Goal of the software stack is to focus on high-level language support ‒ Allow to target the GPU directly by SW ‒ Drivers are setting up the HW and policies, then go out of the way ‒ IOMMU support provide hardware enforced protections for Operating System HSA Software Stack Driver Stack Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps HSA Domain Libraries, OpenCL ™ 2.x Runtime Domain Libraries OpenCL™, DX Runtimes, User Mode Drivers HSA Runtime Task Queuing Libraries HSA JIT Graphics Kernel Mode Driver HSA Kernel Mode Driver Hardware - APUs, CPUs, GPUs User mode component Kernel mode component © Copyright 2014 HSA Foundation. All Rights Reserved. 164 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Components contributed by third parties Operating System IOMMU Hardware IOMMU LINES-OF-CODE AND PERFORMANCE COMPARISONS 350 (Exemplary ISV “Hessian” Kernel) 300 Init. LOC 250 Launch 200 Compile Compile Copy Copy 150 Launch Launch Launch Algorithm Algorithm Algorithm Launch 100 Launch Algorithm Launch 50 Algorithm Algorithm Algorithm Serial CPU Copy-back Algorithm TBB Intrinsics+TBB Launch Copy OpenCL™-C Compile OpenCL™ -C++ C++ AMP HSA Bolt Init AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM. Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta 165 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Copy-back Copy-back Copyback 0 © Copyright 2014 HSA Foundation. All Rights Reserved. LINES-OF-CODE AND PERFORMANCE COMPARISONS 35.00 (Exemplary ISV “Hessian” Kernel) 300 30.00 250 25.00 200 20.00 150 15.00 100 10.00 50 Performance LOC 350 5.00 0 0 Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt Performance AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM. Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta 166 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 © Copyright 2014 HSA Foundation. All Rights Reserved. ACCELERATORS: THE PORTABILITY CHALLENGE CPU ISAs ‒ ISA innovations added incrementally (i.e., NEON, AVX, etc) ‒ ISA retains backwards-compatibility with previous generation ‒ Two dominant instruction-set architectures: ARM and x86 GPU ISAs ‒ Massive diversity of architectures in the market ‒ Each vendor has its own ISA - and often several in the market at same time ‒ No commitment (or attempt!) to provide any backwards compatibility ‒ Traditionally graphics APIs (OpenGL, DirectX) provide necessary abstraction 167 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 WHAT IS HSA INTERMEDIATE LANGUAGE (HSAIL)? Intermediate language for parallel compute in HSA ‒ Generated by a “High Level Compiler” (GCC, LLVM, Java VM, etc.) ‒ Expresses parallel regions of code ‒ Binary format of HSAIL is called “BRIG” ‒ Goal: Bring parallel acceleration to mainstream programming languages IOMMU based pointer translation is key to enabling an efficient IL Implementation main() { … #pragma omp parallel for for (int i=0;i<N; i++) { } … } High-Level Compiler Host ISA BRIG Finalizer © Copyright 2014 HSA Foundation. All Rights Reserved. 168 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Component ISA MEMBERS DRIVING HAS FOUNDATION http://www.hsafoundation.com/ Founders Promoters Supporters Contributors Academic 169 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 GEN1: FIR & AES FIR is a memory-intensive streaming workload AES is a compute-intensive streaming workload CL12 – cl_mem buffer ‒ Copy to/from the device CL20 – SVM buffer – Coarse Grain Sync ‒ Copy to/from SVM ‒ Data copy cannot be avoided, since the space for SVM is limited HSA – Unified Memory Space – Fine Grained Sync ‒ Regular pointer ‒ No explicit copy Results ‒ HSA compute abstraction ‒ NO performance penalty Not all algorithms run faster ‒ Measured on Kaveri (A pre-HSA 1.0 device) ‒ Limited Coherent throughput 170 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Saoni Mukherjee, Yifan Sun, Paul Blinzer, Amir Kavyan Ziabari, David Kaeli,A Comprehensive Performance Analysis of HSA and OpenCL 2.0, Proceedings of the 2016 International Symposium on Program Analysis and System Software, April 2016, to appear. BLACKSCHOLES C++ on HSA ‒ Matches or outperforms OpenCL Course Grained SVM ‒ Matches OpenCL buffers for bandwidth ‒ More predictable performance Fine Grained SVM ‒ Faster kernel dispatch ‒ Larger allocations ‒ Shared data structure Results ‒ HSA compute abstraction ‒ NO performance penalty SOURCE: RALPH POTTER – CODEPLAY. PRESENTATION MADE TO SG14 C++ WORKGROUP 171 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 ENABLING HETEROGENEOUS COMPUTING SUMMARY AND DEMONSTRATION Key Takeaways: ‒ To further scale up compute performance, software must take better advantage of system accelerators like GPUs and DSPs in high level languages ‒ Accelerators following the HSA Foundation specification requirements allow programmers to write or port programs easily using common high level languages ‒ AMD IOMMU is key to efficiently and safely access process virtual memory! ‒ Does translation of both process address space via PASID and device physical accesses ‒ Enforces OS allocation policy, deals with virtual memory page faults, and much more 172 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 AGENDA MOTIVATION & INTRODUCTION USE CASES & DEMOSTRATION What is IOMMU? Where can IOMMU help? INTERNALS How does IOMMU work? RESEARCH Research Opportunities and Tools 173 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 RECAP: IOMMU AND ITS CAPABILITIES IOMMU Driver Core Sets up IOMMU hardware Core MMU MMU Memory IO Device IO Device IOMMU Hardware that intercepts DMA transactions and interrupts Key capabilities: 1. Memory protection for DMA 2. Virtual address translation for DMA 3. Interrupt remapping and virtualization 4. IO can share CPU page tables 174 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 AGENDA: WHAT IS COMING UP? DMA Address Translation ‒ Address translation and memory protection in un-virtualized System ‒ Making address translation faster through caching ‒ Enabling shared address space in heterogeneous system ‒ Enabling pre-translation through IOMMU ‒ Enabling demand paging from devices (dynamic page fault) ‒ Nested address translation in virtualized system ‒ Invalidating IOMMU mappings 175 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Address translation, memory protection, HSA AGENDA: WHAT IS COMING UP? DMA Address Translation ‒ Address translation and memory protection in un-virtualized System ‒ Making address translation faster through caching ‒ Enabling shared address space in heterogeneous system ‒ Enabling pre-translation through IOMMU ‒ Enabling demand paging from devices (dynamic page fault) ‒ Nested address translation in virtualized system ‒ Invalidating IOMMU mappings Address translation, memory protection, HSA Interrupt Handling ‒ Interrupt filtering and remapping ‒ Interrupt virtualization 176 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Interrupts AGENDA: WHAT IS COMING UP? DMA Address Translation ‒ Address translation and memory protection in un-virtualized System ‒ Making address translation faster through caching ‒ Enabling shared address space in heterogeneous system ‒ Enabling pre-translation through IOMMU ‒ Enabling demand paging from devices (dynamic page fault) ‒ Nested address translation in virtualized system ‒ Invalidating IOMMU mappings Address translation, memory protection, HSA Interrupt Handling ‒ Interrupt filtering and remapping ‒ Interrupt virtualization Summary ‒ A peek inside a typical IOMMU implementation ‒ Data structures and their Interactions 177 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Interrupts IOMMU Internals: Address Translation and Memory Protection 178 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 ADDRESS TRANSLATION AND MEMORY PROTECTION NON-VIRTUALIZED SYSTEM Core Core MMU MMU IO Device IO Device Virtual Addresses Physical Addresses Memory 179 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU ADDRESS TRANSLATION AND MEMORY PROTECTION NON-VIRTUALIZED SYSTEM Core Core MMU MMU IO Device IO Device Virtual Addresses Physical Addresses Memory 180 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU Domain (Defined by OS) ADDRESS TRANSLATION AND MEMORY PROTECTION NON-VIRTUALIZED SYSTEM Core Core Virtual Addresses MMU MMU IO Device Virtual Address IO Device DMA Request IOMMU Physical Addresses Memory DevID DomID Device Table 181 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Domain (Defined by OS) DeviceID ADDRESS TRANSLATION AND MEMORY PROTECTION NON-VIRTUALIZED SYSTEM Core Core Virtual Addresses MMU MMU IO Device Virtual Address IO Device DMA Request Domain (Defined by OS) DeviceID IOMMU Physical Addresses Memory DevID DomID Device Table 182 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Page Table ADDRESS TRANSLATION AND MEMORY PROTECTION NON-VIRTUALIZED SYSTEM Core Core Virtual Addresses MMU MMU Physical Addresses IO Device Virtual Address IO Device DMA Request Domain (Defined by OS) DeviceID IOMMU Physical Addresses Memory DevID DomID Device Table 183 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Page Table ADDRESS TRANSLATION AND MEMORY PROTECTION NON-VIRTUALIZED SYSTEM Core Core Virtual Addresses MMU Physical Addresses MMU Domain IO Device Virtual Address IO Device DMA Request DeviceID IOMMU Abort request if not sufficient permission Memory DevID Device Table 184 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Page Table MAKING TRANSLATION FAST CACHING TRANSLATION IN IOMMU Core Core IO Device Virtual Addresses MMU IO Device Device Table Entry Cache Translation Lookaside Buffer MMU Page Table Walker Physical Addresses Memory DevID Device Table 185 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Page Table IOMMU Internals: Enabling “Pointer-is-a-Pointer” in Heterogeneous Systems 186 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SHARING ADDRESS SPACE WITH CPU ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS Core Core Virtual Addresses MMU MMU Physical Addresses Domain IO Device Virtual Address IO Device DMA Request IOMMU Physical Addresses Memory DevID Device Table 187 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Page Table SHARING ADDRESS SPACE WITH CPU ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS Core Core Virtual Addresses MMU MMU Physical Addresses Domain IO Device Virtual Address GPU DMA Request IOMMU Physical Addresses Memory DevID Device Table 188 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Page Table SHARING ADDRESS SPACE WITH CPU ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS Process Domain Core IO Device Virtual Addresses MMU MMU Physical Addresses Virtual Address GPU DMA Request IOMMU Physical Addresses Memory DevID Device Table 189 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Page Table SHARING ADDRESS SPACE WITH CPU ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS Process Domain Core IO Device Virtual Addresses MMU MMU Physical Addresses Virtual Address GPU DMA Request IOMMU Physical Addresses Memory DevID Device Table 190 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 x86-64 Page Table SHARING ADDRESS SPACE WITH CPU ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS Process 0 Process 1 Domain IO Device Virtual Addresses MMU MMU Physical Addresses Virtual Address GPU DMA Request IOMMU Physical Addresses Memory DevID Device Table 191 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 x86-64 Page Table SHARING ADDRESS SPACE WITH CPU ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS Process 0 Process 1 Domain IO Device Virtual Addresses MMU MMU Physical Addresses Virtual Address GPU DMA Request IOMMU Physical Addresses Needs ability to identify more than one address space Memory DevID Device Table 192 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 x86-64 Page Table SHARING ADDRESS SPACE WITH CPU ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS Process 0 Process 1 Domain IO Device Virtual Addresses MMU MMU Physical Addresses Virtual Address GPU DMA Request IOMMU Physical Addresses Memory DevID Device Table 193 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 DeviceID SHARING ADDRESS SPACE WITH CPU ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS Process 0 PASID 0 Process 1 PASID 1 Domain IO Device Virtual Addresses MMU MMU Physical Addresses Virtual Address GPU DMA Request IOMMU Physical Addresses Memory DevID Device Table 194 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 DeviceID + PASID SHARING ADDRESS SPACE WITH CPU ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS Process 0 PASID 0 Process 1 PASID 1 Domain IO Device Virtual Addresses MMU MMU Physical Addresses Virtual Address GPU DMA Request DeviceID + PASID IOMMU Physical Addresses Memory DevID PASID gCR3 table Device Table 195 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SHARING ADDRESS SPACE WITH CPU ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS Process 0 PASID 0 Process 1 PASID 1 Domain IO Device Virtual Addresses MMU MMU Physical Addresses Virtual Address GPU DMA Request DeviceID + PASID IOMMU Physical Addresses Memory DevID PASID gCR3 table Device Table 196 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SHARING ADDRESS SPACE WITH CPU ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS Process 0 PASID 0 Process 1 PASID 1 Domain IO Device Virtual Addresses MMU MMU Physical Addresses Virtual Address GPU DMA Request DeviceID + PASID IOMMU Physical Addresses Memory DevID PASID gCR3 table Device Table 197 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU Internals: Enabling Translation Caching in Devices 198 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 CACHING ADDRESS TRANSLATION IN DEVICES ENABLING MORE CAPABLE DEVICE/ACCELERATORS Core MMU Core GPU MMU Memory IO Device Device Table Entry Cache IOMMU Page Table walker DevID PASID Device Table 199 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Translation Lookaside Buffer gCR3 table CACHING ADDRESS TRANSLATION IN DEVICES ENABLING MORE CAPABLE DEVICE/ACCELERATORS ATC/ IOTLB Core MMU Core TLB GPU MMU IO Device Device Table Entry Cache IOMMU Locally caching address translation in trips to IOMMU Memory Page Table device reduces walker DevID PASID Device Table 200 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Translation Lookaside Buffer gCR3 table CACHING ADDRESS TRANSLATION IN DEVICES ENABLING MORE CAPABLE DEVICE/ACCELERATORS ATC/ IOTLB Core MMU Core TLB GPU MMU IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU Page Table walker IOMMU driver assigns per-translation capability to devices Pre-translation capable? Memory DevID 1 PASID Device Table 201 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table CACHING ADDRESS TRANSLATION IN DEVICES ENABLING MORE CAPABLE DEVICE/ACCELERATORS ATC/ IOTLB Core MMU Core TLB GPU MMU IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU Introduce new message ype: Address Translation Service (ATS) Page Table walker Pre-translation capable? Memory DevID 1 PASID Device Table 202 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table CACHING ADDRESS TRANSLATION IN DEVICES ENABLING MORE CAPABLE DEVICE/ACCELERATORS ATC/ IOTLB Core Core TLB GPU IO Device Device Table Entry Cache ATS Req MMU MMU (DevID, PASID, VA, R/W) Translation Lookaside Buffer IOMMU Page Table walker Pre-translation capable? Memory DevID 1 PASID Device Table 203 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table CACHING ADDRESS TRANSLATION IN DEVICES ENABLING MORE CAPABLE DEVICE/ACCELERATORS ATC/ IOTLB Core Core TLB GPU IO Device Device Table Entry Cache ATS Resp MMU MMU Translation Lookaside Buffer (PASID, VA, PA, Attr.) IOMMU Page Table walker Pre-translation capable? Memory DevID 1 PASID Device Table 204 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table CACHING ADDRESS TRANSLATION IN DEVICES ENABLING MORE CAPABLE DEVICE/ACCELERATORS ATC/ IOTLB Core MMU Core MMU TLB GPU DMA Req (Physical Address) Pre-translated Req IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU Page Table walker Pre-translation capable? Memory DevID 1 PASID Device Table 205 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table CACHING ADDRESS TRANSLATION IN DEVICES ENABLING MORE CAPABLE DEVICE/ACCELERATORS ATC/ IOTLB Core MMU Core MMU TLB GPU DMA Req (Physical Address) Pre-translated Req IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU Page Table walker Pre-translation capable? Memory DevID 1 PASID Device Table 206 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table CACHING ADDRESS TRANSLATION IN DEVICES ENABLING MORE CAPABLE DEVICE/ACCELERATORS ATC/ IOTLB Core MMU Core MMU TLB GPU DMA Req (Physical Address) Pre-translated Req IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU Page Table walker Pre-translation capable? Memory DevID 1 Abort if not pre-translation capable Device Table 207 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 PASID gCR3 table IOMMU Internals: Enabling Demand Paging from IO No Need to Pin Memory 208 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT Device(s) access local TLB (ATC/IOTLB) first Core MMU Core TLB GPU MMU IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU DevID Page Table walker 1 PASID Device Table 209 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT On a (IO)TLB hit no access to IOMMU Core MMU Core TLB GPU MMU IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU DevID Page Table walker 1 PASID Device Table 210 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT (IO)TLB miss Core MMU Core TLB GPU MMU IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU DevID Page Table walker 1 PASID Device Table 211 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT (IO)TLB miss Core Core TLB GPU IO Device Device Table Entry Cache ATS Req MMU MMU (DevID, PASID, VA, R/W) Translation Lookaside Buffer IOMMU DevID Page Table walker 1 PASID Device Table 212 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT (IO)TLB miss Core Core TLB GPU IO Device Device Table Entry Cache ATS Req MMU MMU (DevID, PASID, VA, R/W) Translation Lookaside Buffer IOMMU Page Table walker Page faultNo valid PTE DevID 1 PASID Device Table 213 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT Core Core TLB GPU IO Device Device Table Entry Cache ATS Resp (NACK) MMU MMU Translation Lookaside Buffer IOMMU Page Table walker Page faultNo valid PTE DevID 1 PASID Device Table 214 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT Core Core TLB GPU IO Device Device Table Entry Cache PPR* request MMU MMU (DevID, PASID, VA,R/W) Translation Lookaside Buffer IOMMU DevID Page Table walker 1 PASID 215 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table Device Table *PPR= Page Peripheral Request ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT Core Core TLB GPU IO Device Device Table Entry Cache PPR* request MMU MMU PPR Log (circular buffer) 216 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 (DevID, PASID, VA,R/W) Translation Lookaside Buffer IOMMU DevID Page Table walker 1 PASID gCR3 table Device Table *PPR= Page Peripheral Request ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT Core MMU Core TLB GPU MMU PPR Log (circular buffer) PASID dID Addr Flag 217 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU DevID Page Table walker 1 PASID gCR3 table Device Table *PPR= Page Peripheral Request ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT Core MMU Core TLB GPU MMU IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU Page Table walker Fault batching possible PPR Log (circular buffer) PASID dID Addr Flag 218 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 DevID 1 PASID gCR3 table Device Table *PPR= Page Peripheral Request ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT Interrupt handler Core TLB GPU IO Device Device Table Entry Cache Interrupt MMU MMU PPR Log (circular buffer) PASID dID Addr Flag 219 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Translation Lookaside Buffer IOMMU DevID Page Table walker 1 PASID gCR3 table Device Table *PPR= Page Peripheral Request ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT Interrupt handler Core MMU TLB GPU MMU PPR Log (circular buffer) PASID dID Addr Flag 220 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU DevID Page Table walker 1 PASID gCR3 table Device Table *PPR= Page Peripheral Request ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT Interrupt handler Core MMU TLB GPU MMU Work Queue PPR Log (circular buffer) PASID dID Addr Flag 221 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU DevID Page Table walker 1 PASID gCR3 table Device Table *PPR= Page Peripheral Request ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT OS worker thread Core MMU TLB GPU MMU Work Queue PPR Log (circular buffer) PASID dID Addr Flag 222 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU DevID Page Table walker 1 PASID gCR3 table Device Table *PPR= Page Peripheral Request ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT OS worker thread Service page fault Core MMU TLB GPU MMU IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU Page Table walker Fix the page table Work Queue PPR Log (circular buffer) PASID dID Addr Flag 223 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 DevID 1 PASID gCR3 table Device Table *PPR= Page Peripheral Request ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT OS worker thread Service page fault Core MMU TLB GPU MMU Fix the page table IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU Page Table walker Write PPR completion command Command Buffer Work Queue PPR Log (circular buffer) PASID dID Addr Flag 224 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 DevID 1 PASID gCR3 table Device Table *PPR= Page Peripheral Request ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT OS worker thread Service page fault Core MMU TLB GPU MMU Fix the page table IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU Page Table walker Write PPR completion command Command Buffer Work Queue PPR Log (circular buffer) PASID dID Addr Flag 225 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 DevID 1 PASID gCR3 table Device Table *PPR= Page Peripheral Request ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT OS worker thread Service page fault Core TLB GPU IO Device Device Table Entry Cache PPR response MMU MMU Fix the page table (DevID, PASID, VA,..) Translation Lookaside Buffer IOMMU Page Table walker Write PPR completion command Command Buffer Work Queue PPR Log (circular buffer) PASID dID Addr Flag 226 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 DevID 1 PASID gCR3 table Device Table *PPR= Page Peripheral Request ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT Core MMU Core TLB GPU MMU IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU Page Table walker Command Buffer Work Queue PPR Log DevID 1 (circular buffer) PASID Device Table 227 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT Retry original request Core Core TLB GPU IO Device Device Table Entry Cache ATS Req MMU MMU (DevID, PASID, VA, R/W) Translation Lookaside Buffer IOMMU Page Table walker Command Buffer Work Queue PPR Log DevID 1 (circular buffer) PASID Device Table 228 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT Retry original request Core Core TLB GPU IO Device Device Table Entry Cache ATS Req MMU MMU (DevID, PASID, VA, R/W) Translation Lookaside Buffer IOMMU Page Table walker Command Buffer Work Queue PPR Log DevID 1 (circular buffer) PASID Device Table 229 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT Retry original request Core Core TLB GPU IO Device Device Table Entry Cache ATS Resp MMU MMU (PASID, VA, PA, Attr.) Translation Lookaside Buffer IOMMU Page Table walker Command Buffer Work Queue PPR Log DevID 1 (circular buffer) PASID Device Table 230 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table ENABLING DEMAND PAGING FROM DEVICE SERVICING DEVICE PAGE FAULT Retry original request Core Core TLB GPU IO Device Device Table Entry Cache DMA Req (Physical Address) MMU MMU Translation Lookaside Buffer IOMMU Page Table walker Command Buffer Work Queue PPR Log DevID 1 (circular buffer) PASID Device Table 231 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 gCR3 table IOMMU Internals: Nested (Two-Level) Address Translation 232 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 RECAP: ADDRESS TRANSLATION IN VIRTUALIZED SYSTEMS Guest Applications Guest Applications Guest Virtual Address (GVA) Guest Page Table (GPT) Guest OS 0 Guest OS 1 Guest Physical Address (GPA) Host Page Table (HPT) Hypervisor (a.k.a. VMM) System Physical Address (SPA) Hardware – CPU, Memory, IO Guest OS does not have access to (system) physical address 233 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 NESTED ADDRESS TRANSLATION BY IOMMU Guest Process Guest Process GVA Guest OS 0 Guest OS 1 GPT GPA Core 0 HPT SPA Core 0 Domain IO Device GPU VMM MMU MMU Memory 234 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU DevID Device Table NESTED ADDRESS TRANSLATION BY IOMMU Guest Process Guest Process GVA Guest OS 0 Guest OS 1 GPT GPA Core 0 HPT SPA Core 0 VMM MMU MMU Memory 235 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Domain IO Device Guest Virtual Address GPU DMA Request IOMMU DevID Device Table Device ID + PASID NESTED ADDRESS TRANSLATION BY IOMMU Guest Process Guest Process GVA Guest OS 0 Guest OS 1 GPT GPA Core 0 HPT SPA Core 0 VMM MMU MMU Identified by PASID Identified by DevID/DomID Domain IO Device Guest Virtual Address GPU DMA Request Device ID + PASID IOMMU Physical Addresses Host Page Table Memory Guest Page Table(s) DevID PASID gCR3 table 236 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Device Table NESTED ADDRESS TRANSLATION BY IOMMU GVA GPT GCR3 table entry PASID Device Table Entry nL4 1 nL3 2 nL2 3 nL1 4 GL4 5 GVA [38:30] nL4 6 nL3 7 nL2 8 nL1 9 GL3 10 GVA [29:21] nL4 11 nL3 12 nL2 13 nL1 14 GL2 15 nL4 16 nL3 17 nL2 18 nL1 19 GL1 20 nL4 21 nL4 22 nL4 23 nL4 24 GVA [20:12] GVA [11:0] HPT Device Table Entry 237 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Nested/Host page table SPA Guest page table GVA [47:39] IOMMU Internals: Sending Commands to IOMMU 238 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 COMMANDS TO IOMMU IOMMU Driver (running on CPU) issues commands to IOMMU ‒ e.g., Invalidate IOMMU TLB Entry, Invalidate IOTLB Entry ‒ e.g., Invalidate Device Table Entry ‒ e.g., Complete PPR, Completion Wait , etc. Issued via Command Buffer ‒ Memory resident circular buffer ‒ MMIO registers: Base, Head, and Tail register 239 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 COMMANDS TO IOMMU IOMMU Driver (running on CPU) issues commands to IOMMU ‒ e.g., Invalidate IOMMU TLB Entry, Invalidate IOTLB Entry ‒ e.g., Invalidate Device Table Entry ‒ e.g., Complete PPR, Completion Wait , etc. Issued via Command Buffer ‒ Memory resident circular buffer ‒ MMIO registers: Base, Head, and Tail register IOMMU Driver Write Tail Head Base Variable holding content of registers 240 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU Hardware Fetch Tail Head Base Registers EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU TLB Shootdown ‒ Update page table information ‒ Flush TLB Entry(s) containing stale information Three steps in IOMMU TLB shootdown ‒ Invalidating IOMMU TLB entry ‒ Invalidating IO TLB (Device TLB) entry ‒ Wait for completion 241 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU Command Buffer 242 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 TLB GPU IO Device Device Table Entry Cache IOMMU Translation Lookaside Buffer Page Table walker EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU TLB GPU IO Device Device Table Entry Cache IOMMU Translation Lookaside Buffer Page Table walker Command Buffer OpCode PASID DomID Addr Misc. invalidate iommu tlb entry 128 bits 243 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU Command Buffer 244 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 TLB GPU IO Device Device Table Entry Cache IOMMU Translation Lookaside Buffer Page Table walker EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU TLB GPU IO Device Device Table Entry Cache IOMMU Translation Lookaside Buffer Page Table walker Command Buffer OpCode PASID DevID Addr Misc. invalidate IO tlb entry 128 bits 245 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU Command Buffer 246 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 TLB GPU IO Device Device Table Entry Cache IOMMU Translation Lookaside Buffer Page Table walker EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU TLB GPU IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU Page Table walker Command Buffer OpCode Store Address Store Data completion wait 128 bits 247 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU Command Buffer 248 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 TLB GPU IO Device Device Table Entry Cache IOMMU Translation Lookaside Buffer Page Table walker EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU IO Device Device Table Entry Cache TLB GPU IOMMU Update Tail pointer Command Buffer 249 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Translation Lookaside Buffer Page Table walker EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU Command Buffer 250 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 TLB GPU IO Device Device Table Entry Cache IOMMU Translation Lookaside Buffer Page Table Walker EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU TLB GPU IO Device Device Table Entry Cache IOMMU Translation Lookaside Buffer Page Table Walker Command Buffer OpCode PASID DomID Addr Misc. invalidate IOMMU tlb entry 128 bits 251 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU TLB GPU IO Device Device Table Entry Cache IOMMU Translation Lookaside Buffer Page Table Walker Command Buffer OpCode PASID DomID Addr Misc. invalidate IOMMU tlb entry 128 bits 252 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU IO Device Device Table Entry Cache TLB GPU IOMMU Update Head pointer Command Buffer 253 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Translation Lookaside Buffer Page Table Walker EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU TLB GPU IO Device Device Table Entry Cache IOMMU Translation Lookaside Buffer Page Table Walker Command Buffer OpCode PASID DevID Addr Misc. invalidate IO tlb entry 128 bits 254 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU TLB GPU IO Device Device Table Entry Cache IOMMU Translation Lookaside Buffer Page Table Walker Command Buffer OpCode PASID DevID Addr Misc. invalidate IO tlb entry 128 bits 255 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU IO Device Device Table Entry Cache TLB GPU IOMMU Update Head pointer Command Buffer 256 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Translation Lookaside Buffer Page Table Walker EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU TLB GPU IO Device Device Table Entry Cache Translation Lookaside Buffer IOMMU Page Table Walker Command Buffer OpCode Store Address Store Data completion wait 128 bits 257 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU Command Buffer 258 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 TLB GPU IO Device Device Table Entry Cache IOMMU Translation Lookaside Buffer Page Table Walker Wait for previous commands to finish EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core Core TLB GPU IO Device Device Table Entry Cache ACK MMU MMU Command Buffer 259 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU Translation Lookaside Buffer Page Table Walker Wait for previous commands to finish EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU Command Buffer IOMMU Stores Data to “Store Address” Or Raise Interrupt 260 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 TLB GPU IO Device Device Table Entry Cache IOMMU Translation Lookaside Buffer Page Table Walker Wait for previous commands to finish EXAMPLE: IOMMU TLB SHOOTDOWN IOMMU Driver Core MMU Core MMU IO Device Device Table Entry Cache TLB GPU IOMMU Update Head pointer Command Buffer 261 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Translation Lookaside Buffer Page Table Walker Wait for previous commands to finish IOMMU INTERNALS: INTERRUPT REMAPPING AND VIRTUALIZATION INTERRUPT REMAPPING Core Core APIC MMU APIC MMU Memory 263 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device Device Table IO Device Entry Cache Interrupt Remapping Lookaside Buffer IOMMU Table walker INTERRUPT REMAPPING Core Core APIC MMU APIC MMU Memory 264 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device Device Table IO Device Entry Cache Fixed/ Arbitrated Interrupt Interrupt Remapping Lookaside Buffer IOMMU Table walker INTERRUPT REMAPPING Core Core APIC MMU APIC MMU Memory 265 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device Device Table IO Device Entry Cache Fixed/ Arbitrated Interrupt Interrupt Remapping Lookaside Buffer IOMMU DevID Did Device Table Table walker INTERRUPT REMAPPING Core Core APIC MMU APIC MMU Memory 266 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device Device Table IO Device Entry Cache Fixed/ Arbitrated Interrupt Interrupt Remapping Lookaside Buffer IOMMU DevID Table walker Did Device Table Interrupt Remapping Table INTERRUPT REMAPPING Core Core APIC MMU APIC MMU IO Device Device Table IO Device Entry Cache Fixed/ Arbitrated Interrupt Interrupt Remapping Lookaside Buffer IOMMU Abort request if not sufficient permission Memory 267 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Table walker INTERRUPT REMAPPING Core Core APIC MMU APIC MMU Memory 268 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device Device Table IO Device Entry Cache Fixed/ Arbitrated Interrupt Interrupt Remapping Lookaside Buffer IOMMU Table walker INTERRUPT VIRTUALIZATION Guest OS 0 vAPIC Core Core APIC APIC IO Device IO Device VMM MMU MMU Memory 269 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU INTERRUPT VIRTUALIZATION Guest OS 0 vAPIC Core Core APIC APIC VMM MMU MMU Memory 270 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device Guest Virtualized Interrupt IOMMU INTERRUPT VIRTUALIZATION Guest OS 0 vAPIC Core Core APIC APIC VMM MMU MMU Memory 271 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device Guest Virtualized Interrupt IOMMU DevID Did Device Table INTERRUPT VIRTUALIZATION Guest OS 0 vAPIC Core Core APIC APIC VMM MMU MMU Memory 272 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device Guest Virtualized Interrupt IOMMU DevID Did Device Table Guest Mode Interrupt Remapping Table INTERRUPT VIRTUALIZATION Guest OS 0 vAPIC Core Core APIC APIC VMM MMU MMU IO Device IO Device Guest Virtualized Interrupt IOMMU Abort request if not sufficient permission Memory 273 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 INTERRUPT VIRTUALIZATION Guest OS 0 vAPIC Core Core APIC VMM MMU IO Device APIC MMU Memory Guest Virtualized Interrupt Guest vAPIC backing page 274 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IOMMU INTERRUPT VIRTUALIZATION Guest OS 0 vAPIC Core Core APIC APIC VMM MMU MMU Memory 275 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device Guest Virtualized Interrupt IOMMU DevID Did Device Table INTERRUPT VIRTUALIZATION Guest OS 0 vAPIC Core Core APIC APIC VMM MMU MMU Memory 276 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device Guest Virtualized Interrupt IOMMU DevID Did Device Table Guest Running Interrupt Remapping Table INTERRUPT VIRTUALIZATION Guest OS 0 vAPIC Core Core APIC APIC VMM MMU MMU Memory 277 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device Guest Virtualized Interrupt IOMMU INTERRUPT VIRTUALIZATION Guest OS 0 vAPIC Core Core APIC APIC VMM MMU MMU Memory 278 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Inactive Guest IO Device IO Device Guest Virtualized Interrupt IOMMU INTERRUPT VIRTUALIZATION Guest OS 0 vAPIC Core Core APIC APIC VMM MMU MMU Memory 279 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Inactive Guest IO Device IO Device Guest Virtualized Interrupt IOMMU DevID Did Device Table Guest NOT Running Interrupt Remapping Table INTERRUPT VIRTUALIZATION Guest OS 0 vAPIC Core Core APIC MMU IO Device APIC VMM MMU IO Device Guest Virtualized Interrupt Memory Guest vAPIC Log 280 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Inactive Guest IOMMU INTERRUPT VIRTUALIZATION Guest OS 0 vAPIC Core Core APIC MMU IO Device APIC VMM MMU IO Device Guest Virtualized Interrupt Memory Guest vAPIC Log 281 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Inactive Guest IOMMU INTERRUPT VIRTUALIZATION Guest OS 0 vAPIC Activate Target Guest Core Core APIC APIC VMM MMU MMU Memory 282 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Inactive Guest IO Device IO Device Guest Virtualized Interrupt IOMMU INTERRUPT VIRTUALIZATION Guest OS 0 vAPIC Activate Target Guest Core Core APIC APIC VMM MMU MMU Memory 283 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device Guest Virtualized Interrupt IOMMU INTERRUPT VIRTUALIZATION Guest OS 0 vAPIC Interrupt Core Core Guest APIC APIC vAPIC VMM MMU MMU Memory 284 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IO Device IO Device Guest Virtualized Interrupt IOMMU IOMMU INTERNALS: A TYPICAL IOMMU HARDWARE DESIGN EXAMPLE OF IOMMU HARDWARE DESIGN DRAM CPU Memory Controller IOHUB IOMMU Table Walker L1 TLB Device L1 TLB L1 TLB Device Device 286 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 L2 gPDC L2 DTC L2 ITC L2 gPTC L2 nPDC L2 nPTC CACHE SIZING VS PRODUCT TYPE Typical Client Product ‒ Non-Virtualized ‒ I/O Isolation ‒ Small Working Set L1 TLB L2 DTC 287 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 L2 ITC L2 gPDC L2 gPTC L2 nPDC L2 nPTC CACHE SIZING VS PRODUCT TYPE Typical Server Product ‒ Virtualized ‒ Large Working Set L2 gPDC L1 TLB L2 DTC 288 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 L2 ITC L2 gPTC L2 nPDC L2 nPTC IOMMU INTERNALS: SUMMARY OF KEY DATA STRUCTURES IOMMU’S KEY DATA STRUCTURES IOMMU Device Table Device Table Base Register DRAM GCR3 Table Guest Page Tables Host Page Tables Interrupt Remap Table Guest Virtual APIC Backing Page Guest vAPIC Log Base Register Guest Virtual APIC Log Command Buffer Base Register Command Buffer Event Log Base Register Page Request Log Base Register 290 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Event Log Peripheral Page Request Log DEVICE TABLE ENTRY Each entry is 32B IOTLB Enable Interrupt info - Interrupt Table Root Pointer - Legacy Interrupt Permission domainID guest translation Info - GCR3 Table Root Pointer - Guest Levels translated host translation Info - Page Mode - Host Page Table Root Pointer 291 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 valid entry INTERRUPT REMAPPING TABLE ENTRY Each entry is 128b. Two modes: Interrupt Remapping (guest mode=0) Interrupt Virtualization (guest mode=1) guest mode=0: 0 guest mode=1: vector destination guest mode remap enabled 1 Guest vAPIC info - Guest vAPIC Root Pointer - Guest vAPIC Tag - Guest Running 292 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 vector destination guest mode remap enabled AGENDA MOTIVATION & INTRODUCTION USE CASES & DEMOSTRATION What is IOMMU? Where can IOMMU help? INTERNALS How does IOMMU work? RESEARCH Research Opportunities and Tools 293 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 RESEARCH DIRECTIONS Isolation from malicious or buggy third party accelerators ‒ Can IOMMU ensure protection in-presence of untrusted accelerators? 294 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 RESEARCH DIRECTIONS Isolation from malicious or buggy third party accelerators ‒ Can IOMMU ensure protection in-presence of untrusted accelerators? Specializing IOMMU for performance and power ‒ Can IOMMU hardware exploit predictable access pattern of some accelerators? 295 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 RESEARCH DIRECTIONS Isolation from malicious or buggy third party accelerators ‒ Can IOMMU ensure protection in-presence of untrusted accelerators? Specializing IOMMU for performance and power ‒ Can IOMMU hardware exploit predictable access pattern of some accelerators? Trading memory protection for performance 296 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 RESEARCH DIRECTIONS Isolation from malicious or buggy third party accelerators ‒ Can IOMMU ensure protection in-presence of untrusted accelerators? Specializing IOMMU for performance and power ‒ Can IOMMU hardware exploit predictable access pattern of some accelerators? Trading memory protection for performance ‒ Can selectively lowering protection enable better performance? 297 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 RESEARCH DIRECTIONS Isolation from malicious or buggy third party accelerators ‒ Can IOMMU ensure protection in-presence of untrusted accelerators? Specializing IOMMU for performance and power ‒ Can IOMMU hardware exploit predictable access pattern of some accelerators? Trading memory protection for performance ‒ Can selectively lowering protection enable better performance? Extending (limited) virtual memory to embedded accelerators ‒ Can we design for IOMMULITE embedded low-power accelerators? 298 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 RESEARCH DIRECTIONS Isolation from malicious or buggy third party accelerators ‒ Can IOMMU ensure protection in-presence of untrusted accelerators? Specializing IOMMU for performance and power ‒ Can IOMMU hardware exploit predictable access pattern of some accelerators? Trading memory protection for performance ‒ Can selectively lowering protection enable better performance? Extending (limited) virtual memory to embedded accelerators ‒ Can we design for IOMMULITE embedded low-power accelerators? Avoiding interference in the IOMMU ‒ How to reduce interference among multiple devices accessing IOMMU? 299 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 ISOLATION FROM THIRD PARTY ACCELERATORS EMERGENCE OF 3RD PARTY ACCELERATORS Core Core MMU MMU Memory 300 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 1st Party (Trusted) Accelerator Accelerator IOMMU ISOLATION FROM THIRD PARTY ACCELERATORS EMERGENCE OF 3RD PARTY ACCELERATORS Core Core MMU MMU Memory 301 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 3rd Party (Un-trusted) Accelerator Accelerator IOMMU ISOLATION FROM THIRD PARTY ACCELERATORS EMERGENCE OF 3RD PARTY ACCELERATORS Core Core MMU MMU 3rd Party (Un-trusted) Accelerator Accelerator IOMMU Q: How to integrate third party accelerators efficiently and securely? How to determine if a device is trustworthy and remains trustworthy? Memory May not be possible verify if 3rd party accelerator is not buggy. 302 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 ISOLATION FROM THIRD PARTY ACCELERATORS (CNTD.) EMERGENCE OF 3RD PARTY ACCELERATORS Core Core MMU MMU Memory 303 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 3rd Party (Un-trusted) Accelerator Accelerator IOMMU ISOLATION FROM THIRD PARTY ACCELERATORS (CNTD.) EMERGENCE OF 3RD PARTY ACCELERATORS Core Core MMU MMU Memory 304 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Physical address 3rd Party (Un-trusted) TLB Accelerator IOMMU Performance consideration: 1. TLBs in accelerator Possible to bypass IOMMU ISOLATION FROM THIRD PARTY ACCELERATORS (CNTD.) EMERGENCE OF 3RD PARTY ACCELERATORS MMU Core Physical TLB address Caches Caches Coherence traffic MMU Memory 305 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Physical address Core 3rd Party (Un-trusted) Accelerator IOMMU Performance consideration: 1. TLBs in accelerator Possible to bypass IOMMU 2. Coherent caches in accelerator Coherence traffic bypass IOMMU ISOLATION FROM THIRD PARTY ACCELERATORS (CNTD.) EMERGENCE OF 3RD PARTY ACCELERATORS MMU Core Physical TLB address Caches Caches Coherence traffic MMU Memory 306 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Physical address Core 3rd Party (Un-trusted) Accelerator IOMMU Related work: Olson et al. “Border Control” in MICRO’15 [OLSON’15] ISOLATION FROM THIRD PARTY ACCELERATORS (CNTD.) EMERGENCE OF 3RD PARTY ACCELERATORS MMU Core Physical TLB address Accelerator Caches Caches Coherence traffic BC MMU IOMMU Physical address Core 3rd Party (Un-trusted) Memory 307 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 Related work: Olson et al. “Border Control” in MICRO’15 [OLSON’15] Idea: Check every access with physical address if valid. SPECIALIZING IOMMU FOR DEVICE/ ACCELERATOR IOMMU design(s) resembles CPU MMU design ‒ But device/accelerator access patterns differs from CPU’s IOMMU caters to disparate devices ‒ Single design point may not be optimal for all ‒ e.g., access pattern from GPU likely different from NIC’s 308 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SPECIALIZING IOMMU FOR DEVICE/ ACCELERATOR IOMMU design(s) resembles CPU MMU design ‒ But device/accelerator access patterns differs from CPU’s IOMMU caters to disparate devices ‒ Single design point may not be optimal for all ‒ e.g., access pattern from GPU likely different from NIC’s Study traffic pattern to IOMMU and specialize for common patterns Related work: Malka et al. ’s “rIOMMU” in ASPLOS’15. ‒ Idea: Exploit predictable IOMMU accesses from devices using circular ring buffers 309 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SPECIALIZING IOMMU FOR DEVICE/ ACCELERATOR IOMMU design(s) resembles CPU MMU design ‒ But device/accelerator access patterns differs from CPU’s IOMMU caters to disparate devices ‒ Single design point may not be optimal for all ‒ e.g., access pattern from GPU likely different from NIC’s Study traffic pattern to IOMMU and specialize for common patterns Related work: Malka et al. ’s “rIOMMU” in ASPLOS’15. ‒ Idea: Exploit predictable IOMMU accesses from devices using circular ring buffers ‒ Replace page table with circular, flat table Easy page walk ‒ Predictable access single entry IOTLB with no TLB miss and less invalidation 310 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SPECIALIZING IOMMU FOR DEVICE/ ACCELERATOR IOMMU design(s) resembles CPU MMU design ‒ But device/accelerator access patterns differs from CPU’s IOMMU caters to disparate devices ‒ Single design point may not be optimal for all ‒ e.g., access pattern from GPU likely different from NIC’s Study traffic pattern to IOMMU and specialize for common patterns Related work: Malka et al. ’s “rIOMMU” in ASPLOS’15. ‒ Idea: Exploit predictable IOMMU accesses from devices using circular ring buffers ‒ Replace page table with circular, flat table Easy page walk ‒ Predictable access single entry IOTLB with no TLB miss and less invalidation Possible to use device-specific knowledge to optimize performance ‒ IOMMU prefetching and TLB caching hints can be useful ‒ Replacement policy coordination between IOTLB (Device TLB) and IOMMU TLB ‒ Energy/power optimization in IOMMU 311 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 TRADING PROTECTION FOR PERFORMANCE IOMMU hardware allows lowering protection for performance ‒ For example: pre-translated DMA transactions pass-through IOMMU ‒ A trusted IO device can manipulate any address, including interrupt storms 312 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 TRADING PROTECTION FOR PERFORMANCE IOMMU hardware allows lowering protection for performance ‒ For example: pre-translated DMA transactions pass-through IOMMU ‒ A trusted IO device can manipulate any address, including interrupt storms OS policies for trading off protection for security ‒ Should the sysadmin decide how much to trust a device/driver? ‒ Exposing software knobs for dialing performance vs. protection ‒ Related work: OS policies for Strict vs Deferred protection strategy [WILMANN’08, BEN-YEHUDA’07, AMIT’11] ‒ ASPLOS’16: Strict, sub-page grain protection through Shadow DMA-buffer [MARKUZE’16] 313 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMULITE FOR EMBEDDED LOW-POWER ACCELERATORS Virtual memory eases programming (e.g., “pointer-is-pointer”) ‒ But comes at performance and energy cost Stripped-down IOMMU for ultra low-power accelerators ‒ Lower hardware, performance, power cost by stripping non-essential features ‒ Example “non-essential” features: IO virtualization support, Interrupt remapping, Page fault handling, Nested page table walker, etc. 314 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMULITE FOR EMBEDDED LOW-POWER ACCELERATORS Virtual memory eases programming (e.g., “pointer-is-pointer”) ‒ But comes at performance and energy cost Stripped-down IOMMU for ultra low-power accelerators ‒ Lower hardware, performance, power cost by stripping non-essential features ‒ Example “non-essential” features: IO virtualization support, Interrupt remapping, Page fault handling, Nested page table walker, etc. Related work: ‒ Vogel et al.’s “Lightweight Virtual Memory” in CODES’15 [VOGEL’15] ‒ Idea: Software managed IOMMU for FPGA No translation miss handling in hardware ‒ Simple design, high performance with effective software management 315 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 AVOIDING (DESTRUCTIVE-) INTERFERENCE IN IOMMU Core Core MMU MMU IO Device IO Device Virtual Addresses Physical Addresses Memory 316 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU AVOIDING (DESTRUCTIVE-) INTERFERENCE IN IOMMU Core Core MMU MMU GPU NIC Virtual Addresses Physical Addresses Memory 317 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 IOMMU AVOIDING (DESTRUCTIVE-) INTERFERENCE IN IOMMU Core Core Virtual Addresses GPU Virtual Address MMU MMU Memory 318 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 DMA Requests IOMMU Physical Addresses Physical Addresses NIC AVOIDING (DESTRUCTIVE-) INTERFERENCE IN IOMMU Core Core Virtual Addresses GPU Virtual Address MMU MMU NIC DMA Requests IOMMU Physical Addresses Physical Addresses Memory IOMMU is a shared resource How to model contention in IOMMU? How to guarantee Quality-of-Service in IOMMU? 319 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 RESEARCH: TOOLS AND MODELING Software research: IOMMU driver/OS policies ‒ Easy! Open source IOMMU Driver in Linux Hardware research: Modifying IOMMU hardware behavior ‒ Option 1: Hardware performance counter + Analytical models ‒ Option 2: Simulator with IOMMU model ‒ Work in progress to add IOMMU model in gem5 ‒ Write down in attendance sheet your email if interested 320 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 SUMMARY IOMMU (kernel-mode) Driver: Configuration/Setup IOMMU hardware Core Core MMU MMU Memory IO Device IO Device IOMMU Hardware that intercepts DMA transactions and interrupts Important roles: 1. Memory protection from rogue devices 2. Shared virtual memory to devices 3. I/O virtualization – direct I/O 4. Supporting legacy I/O, Secure boot 321 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 REFERENCES IOMMU specification: http://support.amd.com/TechDocs/48882_IOMMU.pdf OLSON’15: Lean Olson et. al. “Border Control: Sandboxing Accelerators” , MICRO 2015 AMIT’11: Nadav Amit et al. “vIOMMU: Efficient IOMMU Emulation”, USENIX, ATC , 2011 BEN-YEHUDA’07: Muli Ben-Yehuda et al. “The Price of Safety: Evaluating IOMMU Performance”, OLS 2007 MALKA’15: Moshe Malka et al. “rIOMMU: Efficient IOMMU for I/O Devices That Employ Ring Buffers”, ASPLOS 2015. WILLMANN’08: Paul Willmann et al. “Protection Strategies for Direct Access to Virtualized I/O Devices”, USENIX, ATC 2008. VOGEl’15: Pirmin Vogel et. al. “Lightweight virtual memory support for many-core accelerators in heterogeneous embedded SoCs”, CODES’15 MARKUZE’16: Markuze et al. “True IOMMU Protection from DMA Attacks”, ASPLOS’16. 322 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 QUESTIONS AND FEEDBACK Reachable @ ‒ Arka Basu: Arkaprava “dot” Basu “at” amd.com ‒ Andy Kegel: Andrew “dot” Kegel “at” amd.com ‒ Paul Blinzer: Paul “dot” Blinzer “at” amd.com ‒ Maggie Chan: Maggie “dot” Chan “at” amd.com 323 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016 DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2016 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). OpenCL is a trademark of Apple Inc. used by permission by Khronos. ARM ® is/are the registered trademark(s) of ARM Limited in the EU and other countries. PCIe® is registered trademark of PCI-SIG corporation. Other name are for informational purposes only and may be trademarks of their respective owners. 324 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016