Download Virtualizing IO through THE IO Memory Management Unit

Document related concepts

Paging wikipedia , lookup

Memory management unit wikipedia , lookup

Transcript
VIRTUALIZING IO THROUGH
THE IO MEMORY MANAGEMENT UNIT (IOMMU)
ANDY KEGEL, PAUL BLINZER, ARKA BASU, MAGGIE CHAN
ASPLOS 2016
WHAT THIS TUTORIAL WILL AND WILL NOT COVER
 Definition of “IO” or “Device” or “IO Device” :
‒ Traditional IO includes GPU for graphics, NIC, storage controller, USB controller, etc.
‒ New IO (accelerators) includes general-purpose computation on a GPU (GPGPU),
encryption accelerators, digital signal processors, etc.
 Two Parts in Virtualizing an IO Device
‒ Device specific: Virtual instances of device
‒ Virtual functions and Physical function in devices (PCIE® SR-IOV, MR-IOV)
‒ System defined: IO Memory Management Unit or IOMMU
‒ Virtualizing DMA accesses (Address Translation and Protection)
‒ Virtualizing Interrupts (Interrupt Remapping and Virtualizing)
2 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
WHAT THIS TUTORIAL WILL AND WILL NOT COVER
 Definition of “IO” or “Device” or “IO Device” :
‒ Traditional IO includes GPU for graphics, NIC, storage controller, USB controller, etc.
‒ New IO (accelerators) includes general-purpose computation on a GPU (GPGPU),
encryption accelerators, digital signal processors, etc.
 Two Parts in Virtualizing an IO Device
‒ Device specific: Virtual instances of device
‒ Virtual functions and Physical function in devices (PCIE® SR-IOV, MR-IOV)
‒ System defined: IO Memory Management Unit or IOMMU
‒ Virtualizing DMA accesses (Address Translation and Protection)
‒ Virtualizing Interrupts (Interrupt Remapping and Virtualizing)
3 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Focus
AGENDA
MOTIVATION &
INTRODUCTION
USE CASES &
DEMOSTRATION
What is IOMMU? -- Andy Kegel
Where can IOMMU help? -- Paul Blinzer
INTERNALS
How does IOMMU work? -- Arka Basu, Maggie Chan
RESEARCH
Research Opportunities and Discussion – Arka Basu
4 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
MOTIVATION: TRADITIONAL DMA BY IO
NO SYSTEM VIRTUALIZATION
Core
Core
MMU
MMU
Memory
5 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
MOTIVATION: TRADITIONAL DMA BY IO
NO SYSTEM VIRTUALIZATION
Core
Core
MMU
MMU
Virtual
Addresses
Memory
6 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
MOTIVATION: TRADITIONAL DMA BY IO
NO SYSTEM VIRTUALIZATION
Core
Virtual
Addresses
Core
Protection
Check
MMU
MMU
Physical
Addresses
Memory
7 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
MOTIVATION: TRADITIONAL DMA BY IO
NO SYSTEM VIRTUALIZATION
Device Driver
Core
Virtual
Addresses
Core
Protection
Check
MMU
MMU
Physical
Addresses
Memory
8 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
MOTIVATION: TRADITIONAL DMA BY IO
NO SYSTEM VIRTUALIZATION
Setup
Device Driver
Core
Virtual
Addresses
Core
Protection
Check
MMU
MMU
Physical
Addresses
Memory
9 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
MOTIVATION: TRADITIONAL DMA BY IO
NO SYSTEM VIRTUALIZATION
Setup
Device Driver
Core
Virtual
Addresses
Core
IO Device
IO Device
Protection
Check
MMU
MMU
Physical
Addresses
Memory
10 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Physical
Addresses
DMA
Request
MOTIVATION: TRADITIONAL DMA BY IO
NO SYSTEM VIRTUALIZATION
Setup
Device Driver
Core
Virtual
Addresses
Core
IO Device
IO Device
Protection
Check
MMU
MMU
Physical
Addresses
Wrong
location
Memory
11 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Physical
Addresses
DMA
Request
MOTIVATION: TRADITIONAL DMA BY IO
NO SYSTEM VIRTUALIZATION
Setup
Device Driver
Core
Virtual
Addresses
Core
IO Device
IO Device
Protection
Check
MMU
MMU
Physical
Addresses
DMA
Request
Physical
Addresses
Wrong
location
Memory
12 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
No protection from malicious devices
--> “DMA Attack” (e.g., FinSpy)
No protection from buggy device driver
Side channel attack – leak information
MOTIVATION: TRADITIONAL DMA BY IO
NO SYSTEM VIRTUALIZATION
Setup
Device Driver
Core
Virtual
Addresses
Core
IO Device
IO Device
Protection
Check
MMU
MMU
Physical
Addresses
DMA
Request
Physical
Addresses
Wrong
location
Memory
13 IOMMU TUTORIAL @ ASPLOS |
3RD
APRIL 2016
No protection from malicious devices
--> “DMA Attack” (e.g., FinSpy)
No protection from buggy device driver
Side channel attack – leak information
Needs hardware enforced memory
protection
MOTIVATION: VIRTUAL MACHINES ARE TRENDING
Tremendous growth in virtualization in server
Efficient access to IO under virtualization is important
Source: IDC Server Virtualization, MCS 2012
14 IOMMU TUTORIAL @ ASPLOS |
3RD
APRIL 2016
BACKGROUND: TRANSLATIONS IN VIRTUALIZED SYSTEM
Guest OS 0
Guest OS 1
Hypervisor (a.k.a. VMM)
Hardware – CPU, Memory, IO
15 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
BACKGROUND: TRANSLATIONS IN VIRTUALIZED SYSTEM
Guest Applications
Guest Applications
Guest Virtual
Address (GVA)
Guest OS 0
Guest OS 1
Hypervisor (a.k.a. VMM)
Hardware – CPU, Memory, IO
16 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
BACKGROUND: TRANSLATIONS IN VIRTUALIZED SYSTEM
Guest Applications
Guest Applications
Guest Virtual
Address (GVA)
Guest OS 0
Guest OS 1
Managed by
Guest OS
Guest Physical
Address (GPA)
Hypervisor (a.k.a. VMM)
Hardware – CPU, Memory, IO
17 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
BACKGROUND: TRANSLATIONS IN VIRTUALIZED SYSTEM
Guest Applications
Guest Applications
Guest Virtual
Address (GVA)
Guest OS 0
Guest OS 1
Managed by
Guest OS
Guest Physical
Address (GPA)
Hypervisor (a.k.a. VMM)
System Physical
Address(SPA)
Hardware – CPU, Memory, IO
18 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Managed by
VMM
BACKGROUND: TRANSLATIONS IN VIRTUALIZED SYSTEM
Guest Applications
Guest Applications
Guest Virtual
Address (GVA)
Guest OS 0
Guest OS 1
Managed by
Guest OS
Guest Physical
Address (GPA)
Hypervisor (a.k.a. VMM)
System Physical
Address(SPA)
Managed by
VMM
Hardware – CPU, Memory, IO
Isolation across Guest OS => No access to (system) physical address from Guest OS
19 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
MOTIVATION: TRADITIONAL DMA IN VIRTUAL MACHINES
VIRTUALIZED SYSTEM
Core
Core
MMU
MMU
IO Device
IO Device
Memory
20 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
*SPA == “Physical Address”
MOTIVATION: TRADITIONAL DMA IN VIRTUAL MACHINES
VIRTUALIZED SYSTEM
Guest OS 0 Guest OS 1
GVA
Core
Core
IO Device
IO Device
GPA
VMM
MMU
MMU
SPA
Memory
21 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
*SPA == “Physical Address”
MOTIVATION: TRADITIONAL DMA IN VIRTUAL MACHINES
VIRTUALIZED SYSTEM
Device Driver
Guest OS 0 Guest OS 1
GVA
Core
Core
Setup
IO Device
No access to Physical
Address
IO Device
GPA
VMM
MMU
MMU
SPA
Memory
22 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
*SPA == “Physical Address”
MOTIVATION: TRADITIONAL DMA IN VIRTUAL MACHINES
VIRTUALIZED SYSTEM
Device Driver
Guest OS 0 Guest OS 1
GVA
Core
Core
Setup
IO Device
IO Device
GPA
VMM
MMU
Setup
MMU
SPA
Every DMA operation mediated by VMM
Memory
23 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
*SPA == “Physical Address”
MOTIVATION: TRADITIONAL DMA IN VIRTUAL MACHINES
VIRTUALIZED SYSTEM
Device Driver
Guest OS 0 Guest OS 1
GVA
Core
Core
Setup
IO Device
IO Device
GPA
VMM
MMU
Setup
MMU
SPA
Physical
Addresses
DMA
Operation
Every DMA operation mediated by VMM
Memory
24 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
*SPA == “Physical Address”
MOTIVATION: TRADITIONAL DMA IN VIRTUAL MACHINES
VIRTUALIZED SYSTEM
Device Driver
Guest OS 0 Guest OS 1
GVA
Core
Core
Setup
IO Device
IO Device
GPA
VMM
MMU
Setup
MMU
SPA
Physical
Addresses
DMA
Operation
Every DMA operation mediated by VMM
 Often ~30% performance overhead
Memory
25 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
*SPA == “Physical Address”
MOTIVATION: TRADITIONAL DMA IN VIRTUAL MACHINES
VIRTUALIZED SYSTEM
Device Driver
Guest OS 0 Guest OS 1
GVA
Core
Core
Setup
IO Device
IO Device
GPA
VMM
MMU
Setup
MMU
SPA
Physical
Addresses
DMA
Operation
Every DMA operation mediated by VMM
 Often ~30% performance overhead
Memory
26 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Virtual address translation for DMA
*SPA == “Physical Address”
INTRODUCTION OF IOMMU: THE LOGICAL VIEW
Core
Core
MMU
MMU
Memory
27 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
INTRODUCTION OF IOMMU: THE LOGICAL VIEW
IOMMU Driver
Core
MMU
Sets up IOMMU hardware
Core
MMU
Memory
28 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
IOMMU
Hardware that
intercepts DMA
transactions
Key capabilities:
1. Memory protection for DMA
2. Virtual address translation for DMA
MOTIVATION: TRADITIONAL IO INTERRUPT
NON-VIRTUALIZED SYSTEM
Core
Core
MMU
MMU
Memory
29 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
MOTIVATION: TRADITIONAL IO INTERRUPT
NON-VIRTUALIZED SYSTEM
Setup
Device Driver
Core
Core
MMU
MMU
Memory
30 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IRQ # + Core id
IO Device
MOTIVATION: TRADITIONAL IO INTERRUPT
NON-VIRTUALIZED SYSTEM
Setup
Device Driver
Core
Core
APIC
APIC
MMU
MMU
IO Device
IRQ #
Memory
31 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IRQ # + Core id
IO Device
MOTIVATION: TRADITIONAL IO INTERRUPT
VIRTUALIZED SYSTEM
Guest OS 0
Core
Core
VMM
MMU
MMU
Memory
32 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
MOTIVATION: TRADITIONAL IO INTERRUPT
VIRTUALIZED SYSTEM
Guest OS 0
Core
Core
VMM
MMU
IO Device
Setup
MMU
Memory
33 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IRQ # + Core i
MOTIVATION: TRADITIONAL IO INTERRUPT
VIRTUALIZED SYSTEM
Guest OS migration
Guest OS 0
Core
Core
VMM
MMU
IO Device
Setup
MMU
Memory
34 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IRQ # + Core i
MOTIVATION: TRADITIONAL IO INTERRUPT
VIRTUALIZED SYSTEM
Guest OS migration
Guest OS 0
Core
Core
VMM
MMU
IO Device
Setup
MMU
Memory
35 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IRQ # + Core i
MOTIVATION: TRADITIONAL IO INTERRUPT
VIRTUALIZED SYSTEM
Guest OS migration
Guest OS 0
Core
Core
VMM
MMU
IO Device
Setup
MMU
Inter-Process
Interrupt
Memory
36 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IRQ # + Core i
MOTIVATION: TRADITIONAL IO INTERRUPT
VIRTUALIZED SYSTEM
Guest OS migration
Guest OS 0
Core
Core
VMM
MMU
IO Device
IO Device
IRQ # + Core i
Setup
MMU
Inter-Process
Interrupt
Extraneous IPI adds overheads
=> Each extra interrupt can
add 5-10K cycles
Memory
37 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Needs dynamic remapping of
interrupts
MOTIVATION: TRADITIONAL IO INTERRUPT
VIRTUALIZED SYSTEM
Guest OS 0
Core
Core
VMM
MMU
MMU
Memory
38 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
MOTIVATION: TRADITIONAL IO INTERRUPT
VIRTUALIZED SYSTEM
Guest OS 0
Core
Core
VMM
MMU
IO Device
IO Device
IRQ # + Core i
Setup
MMU
Performance overheads VMM exits on
each interrupt
Memory
39 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
MOTIVATION: TRADITIONAL IO INTERRUPT
VIRTUALIZED SYSTEM
Guest OS de-scheduled
Core
Core
VMM
MMU
IO Device
IO Device
IRQ # + Core i
Setup
MMU
Performance overheads VMM exits on
each interrupt
Memory
40 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
MOTIVATION: TRADITIONAL IO INTERRUPT
VIRTUALIZED SYSTEM
Guest OS de-scheduled
Core
Core
VMM
MMU
IO Device
IO Device
IRQ # + Core i
Setup
MMU
Performance overheads VMM exits on
each interrupt
Unnecessary VMM wakeup
Memory
41 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
MOTIVATION: TRADITIONAL IO INTERRUPT
VIRTUALIZED SYSTEM
Guest OS de-scheduled
Core
Core
VMM
MMU
IO Device
IO Device
IRQ # + Core i
Setup
MMU
Performance overheads VMM exits on
each interrupt
Unnecessary VMM wakeup
Memory
42 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Need to virtualize interrupt:
 Direct interrupt delivery to guest OS
and temporary queueing
INTRODUCTION OF IOMMU: THE LOGICAL VIEW
ADDING INTERRUPT HANDLING CAPABILITY
IOMMU Driver
Core
MMU
Sets up IOMMU hardware
Core
MMU
IO Device
IO Device
IOMMU
Hardware that
intercepts DMA
transactions
Key capabilities:
1. Memory protection for DMA
2. Virtual address translation for DMA
Memory
43 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
INTRODUCTION OF IOMMU: THE LOGICAL VIEW
ADDING INTERRUPT HANDLING CAPABILITY
IOMMU Driver
Core
MMU
Sets up IOMMU hardware
Core
MMU
Memory
44 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
IOMMU
Hardware that
intercepts DMA
transactions
and interrupts
Key capabilities:
1. Memory protection for DMA
2. Virtual address translation for DMA
3. Interrupt remapping and virtualization
MOTIVATION: EMERGENCE OF HETEROGENEOUS SYSTEMS
HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)
Core
Core
MMU
MMU
Memory
45 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
MOTIVATION: EMERGENCE OF HETEROGENEOUS SYSTEMS
HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)
Core
Core
MMU
MMU
Memory
46 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
GPU
IO Device
MOTIVATION: EMERGENCE OF HETEROGENEOUS SYSTEMS
HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)
Shared virtual addressing is key to ease of programming
Core
Core
MMU
MMU
Memory
47 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
GPU
IO Device
MOTIVATION: EMERGENCE OF HETEROGENEOUS SYSTEMS
HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)
Shared virtual addressing is key to ease of programming
Core
MMU
IO Device
MMU
Memory
48 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
MOTIVATION: EMERGENCE OF HETEROGENEOUS SYSTEMS
HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)
Shared virtual addressing is key to ease of programming
Core
IO Device
VA0
MMU
VA0
MMU
“Pointer-is-a -Pointer” across CPU and devices
Memory
49 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
MOTIVATION: EMERGENCE OF HETEROGENEOUS SYSTEMS
HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)
Shared virtual addressing is key to ease of programming
Core
IO Device
VA0
MMU
VA0
MMU
“Pointer-is-a -Pointer” across CPU and devices
Memory
IO needs to share CPU page table*
*Data Structure that keeps VA to PA mapping
50 IOMMU TUTORIAL @ ASPLOS |
3RD
APRIL 2016
INTRODUCTION OF IOMMU: THE LOGICAL VIEW
ADDING ABILITY TO SHARE ADDRESS SPACE IN HETEROGENEOUS SYSTEM
IOMMU Driver
Core
MMU
Sets up IOMMU hardware
Core
MMU
Memory
51 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
IOMMU
Hardware that
intercepts DMA
transactions
and interrupts
Key capabilities:
1. Memory protection for DMA
2. Virtual address translation for DMA
3. Interrupt remapping and virtualization
INTRODUCTION OF IOMMU: THE LOGICAL VIEW
ADDING ABILITY TO SHARE ADDRESS SPACE IN HETEROGENEOUS SYSTEM
IOMMU Driver
Core
Sets up IOMMU hardware
Core
MMU
MMU
Memory
IO Device
IO Device
IOMMU
Hardware that
intercepts DMA
transactions
and interrupts
Key capabilities:
1. Memory protection for DMA
2. Virtual address translation for DMA
3. Interrupt remapping and virtualization
4. IO can share CPU page tables
52 IOMMU TUTORIAL @ ASPLOS |
3RD
APRIL 2016
INTRODUCTION OF IOMMU: (TYPICAL) PHYSICAL VIEW
IOMMU IS PART OF PROCESSOR COMPLEX
MMU
MMU
Memory Controller
Memory
53 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Root Complex/
“IOHUB”
IO Device
Interconnect
Core
IOMMU
Core
IO Device
IO Device
Processor
/Chip
IOMMU FROM THE PERSPECTIVE OF DEVICE (PCIE® SPEC)
Memory
Translation
Agent
Addr. Translation and
Protection Table
Root Complex (RC)
Root Integrated ATC
Endpoint
Device
ATC – Address Translation
Cache
54 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Root
Port
Root
Port
Switch
ATC
Device
ATC
Device
IOMMU FROM THE PERSPECTIVE OF DEVICE (PCIE® SPEC)
IOMMU  Translation Agent and uses the Address Translation and Protection Table
Memory
Translation
Agent
Addr. Translation and
Protection Table
Root Complex (RC)
Root Integrated ATC
Endpoint
Device
ATC – Address Translation
Cache
55 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Root
Port
Root
Port
Switch
ATC
Device
ATC
Device
COMPARING CPU MMU AND IOMMU
CPU MMU
IOMMU
Address Translation
VA  PA and GVA 
GPA  SPA
Memory Protection
Read/Write etc.
Read/Write etc.
Interrupt Handling
No
Remapping and
Virtualization Support
Parallelism
Mostly Single Threaded
Highly Multithreaded
Page Faults, Events,
etc.
Synchronous Handling
Asynchronous Handling
56 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
VA  PA and GVA
 GPA  SPA
HISTORY
A SIMPLIFIED VIEW
V1, c. 2004
Technology created to translate and vet memory
accesses by peripherals, replacing software
V1.2, c. 2006
Interrupt remapping added for IO virtualization
V2, c. 2008
Nested paging, interrupt virtualization, and improved
management features added
V3, c. 2010
Features added for full heterogeneous computing and
further efficiencies
Whither next?
57 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU TECHNOLOGY FAMILIES
REFERENCES
AMD IOMMU®
IO Memory Management Unit
Intel VT-d®
Virtualization Technology for Directed IO
ARM SMM®
System Memory Management Unit
IBM CAPI®
58 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Coherent Accelerator Processor Interface
AGENDA
MOTIVATION &
INTRODUCTION
USE CASES &
DEMOSTRATION
What is IOMMU?
Where can IOMMU help?
INTERNALS
How does IOMMU work?
RESEARCH
Research Opportunities and Tools
60 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
FIVE USE CASES OF IOMMU
LEGACY I/O
Supporting legacy devices –
Extending DMA “beyond reach”
SECURITY AND
PROTECTION
Preventing uncontrolled memory access
SECURE BOOT
Enforcing secure boot
DIRECT I/O DEVICES
HETEROGENEOUS
COMPUTING
61 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Secure and efficient IO from Guest OS
Enabling shared virtual memory
SUPPORTING LEGACY DEVICES
HOW CAN AN IOMMU HELP?
 Many 32-bit DMA devices operate in a 64-bit system
Physical Memory
‒ Older PCI cards (through PCI-PCIe bridges), special-purpose
controllers, parallel ports (IEEE-1284), …
232-1
Device
0
62 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SUPPORTING LEGACY DEVICES
HOW CAN AN IOMMU HELP?
 Many 32-bit DMA devices operate in a 64-bit system
‒ Older PCI cards (through PCI-PCIe bridges), special-purpose
controllers, parallel ports (IEEE-1284), …
Physical Memory
264-1
232-1
Device
0
63 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SUPPORTING LEGACY DEVICES
HOW CAN AN IOMMU HELP?
 Many 32-bit DMA devices operate in a 64-bit system
‒ Older PCI cards (through PCI-PCIe bridges), special-purpose
controllers, parallel ports (IEEE-1284), …
Physical Memory
264-1
 SW Solution: Bounce buffers
‒ Device does DMA to a region in 32bit physical address, CPU
copies data from buffer to the final destination
232-1
Device
0
64 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SUPPORTING LEGACY DEVICES
HOW CAN AN IOMMU HELP?
 Many 32-bit DMA devices operate in a 64-bit system
‒ Older PCI cards (through PCI-PCIe bridges), special-purpose
controllers, parallel ports (IEEE-1284), …
Physical Memory
264-1
 SW Solution: Bounce buffers
‒ Device does DMA to a region in 32bit physical address, CPU
copies data from buffer to the final destination
232-1
Device
0
65 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SUPPORTING LEGACY DEVICES
HOW CAN AN IOMMU HELP?
 Many 32-bit DMA devices operate in a 64-bit system
‒ Older PCI cards (through PCI-PCIe bridges), special-purpose
controllers, parallel ports (IEEE-1284), …
Physical Memory
264-1
 SW Solution: Bounce buffers
‒ Device does DMA to a region in 32bit physical address, CPU
copies data from buffer to the final destination
232-1
Device
0
66 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SUPPORTING LEGACY DEVICES
HOW CAN AN IOMMU HELP?
 Many 32-bit DMA devices operate in a 64-bit system
‒ Older PCI cards (through PCI-PCIe bridges), special-purpose
controllers, parallel ports (IEEE-1284), …
Physical Memory
264-1
 SW Solution: Bounce buffers
‒ Device does DMA to a region in 32bit physical address, CPU
copies data from buffer to the final destination
232-1
Device
0
67 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CPU
SUPPORTING LEGACY DEVICES
HOW CAN AN IOMMU HELP?
 Many 32-bit DMA devices operate in a 64-bit system
‒ Older PCI cards (through PCI-PCIe bridges), special-purpose
controllers, parallel ports (IEEE-1284), …
Physical Memory
264-1
 SW Solution: Bounce buffers
‒ Device does DMA to a region in 32bit physical address, CPU
copies data from buffer to the final destination
232-1
Device
0
68 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CPU
SUPPORTING LEGACY DEVICES
HOW CAN AN IOMMU HELP?
 Many 32-bit DMA devices operate in a 64-bit system
‒ Older PCI cards (through PCI-PCIe bridges), special-purpose
controllers, parallel ports (IEEE-1284), …
Physical Memory
264-1
 SW Solution: Bounce buffers
‒ Device does DMA to a region in 32bit physical address, CPU
copies data from buffer to the final destination
‒ Slow, needs SW synchronization, ties up CPU core
232-1
Device
0
69 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CPU
SUPPORTING LEGACY DEVICES
HOW CAN AN IOMMU HELP?
Physical Memory
 Many 32bit DMA devices operate in a 64bit system
‒ older PCI cards (through PCI-PCIe bridges), special-purpose
controllers, parallel ports (IEEE-1284), …
IOMMU
Translation
Device
264-1
232-1
0x01020304 ->
0x208090A0B0C
0
70 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SUPPORTING LEGACY DEVICES
HOW CAN AN IOMMU HELP?
Physical Memory
 Many 32bit DMA devices operate in a 64bit system
‒ older PCI cards (through PCI-PCIe bridges), special-purpose
controllers, parallel ports (IEEE-1284), …
264-1
 Better solution: IOMMU remaps 32bit device physical
address to system physical address beyond 32bit
IOMMU
Translation
Device
232-1
0x01020304 ->
0x208090A0B0C
0
71 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SUPPORTING LEGACY DEVICES
HOW CAN AN IOMMU HELP?
Physical Memory
 Many 32bit DMA devices operate in a 64bit system
‒ older PCI cards (through PCI-PCIe bridges), special-purpose
controllers, parallel ports (IEEE-1284), …
264-1
 Better solution: IOMMU remaps 32bit device physical
address to system physical address beyond 32bit
IOMMU
Translation
Device
232-1
0x01020304 ->
0x208090A0B0C
0
72 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SUPPORTING LEGACY DEVICES
HOW CAN AN IOMMU HELP?
Physical Memory
 Many 32bit DMA devices operate in a 64bit system
‒ older PCI cards (through PCI-PCIe bridges), special-purpose
controllers, parallel ports (IEEE-1284), …
264-1
 Better solution: IOMMU remaps 32bit device physical
address to system physical address beyond 32bit
IOMMU
Translation
Device
232-1
0x01020304 ->
0x208090A0B0C
0
73 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SUPPORTING LEGACY DEVICES
HOW CAN AN IOMMU HELP?
Physical Memory
 Many 32bit DMA devices operate in a 64bit system
‒ older PCI cards (through PCI-PCIe bridges), special-purpose
controllers, parallel ports (IEEE-1284), …
264-1
 Better solution: IOMMU remaps 32bit device physical
address to system physical address beyond 32bit
IOMMU
Translation
Device
232-1
0x01020304 ->
0x208090A0B0C
0
74 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SUPPORTING LEGACY DEVICES
HOW CAN AN IOMMU HELP?
Physical Memory
 Many 32bit DMA devices operate in a 64bit system
‒ older PCI cards (through PCI-PCIe bridges), special-purpose
controllers, parallel ports (IEEE-1284), …
264-1
 Better solution: IOMMU remaps 32bit device physical
address to system physical address beyond 32bit
‒ DMA goes directly into 64bit memory
‒ No CPU transfer
‒ More efficient
IOMMU
Translation
Device
232-1
0x01020304 ->
0x208090A0B0C
0
75 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SUPPORTING LEGACY DEVICES
HOW CAN AN IOMMU HELP?
Physical Memory
 Many 32bit DMA devices operate in a 64bit system
‒ older PCI cards (through PCI-PCIe bridges), special-purpose
controllers, parallel ports (IEEE-1284), …
264-1
 Better solution: IOMMU remaps 32bit device physical
address to system physical address beyond 32bit
‒ DMA goes directly into 64bit memory
‒ No CPU transfer
‒ More efficient
 Linux: DMA redirect feature
IOMMU
Translation
Device
232-1
0x01020304 ->
0x208090A0B0C
0
76 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU USECASE: SECURITY AND PROTECTION
SECURE BOOT
SECURITY AND PROTECTION
THE TRADITIONAL IOMMU USE
Physical Memory
 DMA devices use physical addresses on the system bus to read
and write memory based on SW driver or OS instructions
Passwords,
Critical data
I/O buffer
Device
78 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SECURITY AND PROTECTION
THE TRADITIONAL IOMMU USE
Physical Memory
 DMA devices use physical addresses on the system bus to read
and write memory based on SW driver or OS instructions
Passwords,
Critical data
I/O buffer
Device
79 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SECURITY AND PROTECTION
THE TRADITIONAL IOMMU USE
Physical Memory
 DMA devices use physical addresses on the system bus to read
and write memory based on SW driver or OS instructions
 SW bugs or attacks by malicious applications could access and
modify important OS data (OS security policy, passwords,…)
‒ Without OS able to detect or prevent the access as it can for CPU
‒ Latent problem until it shows unexpectedly possibly much later
Passwords,
Critical data
I/O buffer
Device
80 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SECURITY AND PROTECTION
THE TRADITIONAL IOMMU USE
Physical Memory
 DMA devices use physical addresses on the system bus to read
and write memory based on SW driver or OS instructions
 SW bugs or attacks by malicious applications could access and
modify important OS data (OS security policy, passwords,…)
‒ Without OS able to detect or prevent the access as it can for CPU
‒ Latent problem until it shows unexpectedly possibly much later
Passwords,
Critical data
I/O buffer
Device
81 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SECURITY AND PROTECTION
THE TRADITIONAL IOMMU USE
Physical Memory
 DMA devices use physical addresses on the system bus to read
and write memory based on SW driver or OS instructions
 SW bugs or attacks by malicious applications could access and
modify important OS data (OS security policy, passwords,…)
‒ Without OS able to detect or prevent the access as it can for CPU
‒ Latent problem until it shows unexpectedly possibly much later
 This affects system stability, if just the right data is hit
‒ “Heisenbugs” are sometimes caused by bugs in system drivers
 Or it allows malicious driver attacks to take over the system
Passwords,
Critical data
I/O buffer
Device
82 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SECURITY AND PROTECTION
THE TRADITIONAL IOMMU USE
 DMA devices assert physical addresses on the system bus to
read and write memory based on SW driver or OS settings
Physical Memory
 SW bugs or attacks by malicious applications could access
and modify important data (OS security policy, passwords,…)
X
OK
Device
83 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU
Range check
Passwords,
critical data
I/O buffer
SECURITY AND PROTECTION
THE TRADITIONAL IOMMU USE
 DMA devices assert physical addresses on the system bus to
read and write memory based on SW driver or OS settings
Physical Memory
 SW bugs or attacks by malicious applications could access
and modify important data (OS security policy, passwords,…)
 The IOMMU allows OS to enforce DMA access policy for any
DMA capable device accessing physical memory
‒ Memory state important to stability/security
‒ If access occurs, OS gets notified and can shut the device & driver
down and notifies the user or administrator
X
OK
Device
84 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU
Range check
Passwords,
critical data
I/O buffer
SECURITY AND PROTECTION
THE TRADITIONAL IOMMU USE
 DMA devices assert physical addresses on the system bus to
read and write memory based on SW driver or OS settings
Physical Memory
 SW bugs or attacks by malicious applications could access
and modify important data (OS security policy, passwords,…)
 The IOMMU allows OS to enforce DMA access policy for any
DMA capable device accessing physical memory
‒ Memory state important to stability/security
‒ If access occurs, OS gets notified and can shut the device & driver
down and notifies the user or administrator
X
OK
Device
85 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU
Range check
Passwords,
critical data
I/O buffer
SECURITY AND PROTECTION
THE TRADITIONAL IOMMU USE
 DMA devices assert physical addresses on the system bus to
read and write memory based on SW driver or OS settings
Physical Memory
 SW bugs or attacks by malicious applications could access
and modify important data (OS security policy, passwords,…)
 The IOMMU allows OS to enforce DMA access policy for any
DMA capable device accessing physical memory
‒ Memory state important to stability/security
‒ If access occurs, OS gets notified and can shut the device & driver
down and notifies the user or administrator
X
OK
Device
86 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU
Range check
Passwords,
critical data
I/O buffer
SECURITY AND PROTECTION
THE TRADITIONAL IOMMU USE
 DMA devices assert physical addresses on the system bus to
read and write memory based on SW driver or OS settings
Physical Memory
 SW bugs or attacks by malicious applications could access
and modify important data (OS security policy, passwords,…)
 The IOMMU allows OS to enforce DMA access policy for any
DMA capable device accessing physical memory
‒ Memory state important to stability/security
‒ If access occurs, OS gets notified and can shut the device & driver
down and notifies the user or administrator
X
OK
Device
87 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU
Range check
Passwords,
critical data
I/O buffer
SECURITY AND PROTECTION
THE TRADITIONAL IOMMU USE
 DMA devices assert physical addresses on the system bus to
read and write memory based on SW driver or OS settings
Physical Memory
 SW bugs or attacks by malicious applications could access
and modify important data (OS security policy, passwords,…)
 The IOMMU allows OS to enforce DMA access policy for any
DMA capable device accessing physical memory
‒ Memory state important to stability/security
‒ If access occurs, OS gets notified and can shut the device & driver
down and notifies the user or administrator
X
OK
Device
88 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU
Range check
Passwords,
critical data
I/O buffer
SECURE BOOT
YET ANOTHER USE FOR AN IOMMU
 Ensuring that a system is not doing more than it’s supposed to
‒ e.g., being part of a botnet, provide banking data or other personal
info to impersonators or other attackers
‒ The earliest time for attack and defense is at firmware startup
‒ From there critical memory regions are protected from invalid access
UEFI
Firmware
OS
Bootloader
OS kernel
drivers
Application
89 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SECURE BOOT
YET ANOTHER USE FOR AN IOMMU
 Ensuring that a system is not doing more than it’s supposed to
‒ e.g., being part of a botnet, provide banking data or other personal
info to impersonators or other attackers
‒ The earliest time for attack and defense is at firmware startup
‒ From there critical memory regions are protected from invalid access
 The Secure Boot architecture ensures that no non-vetted OS
kernel code runs on the system, changing critical settings
UEFI
Firmware
OS
Bootloader
OS kernel
drivers
Application
90 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SECURE BOOT
YET ANOTHER USE FOR AN IOMMU
 Ensuring that a system is not doing more than it’s supposed to
‒ e.g., being part of a botnet, provide banking data or other personal
info to impersonators or other attackers
‒ The earliest time for attack and defense is at firmware startup
‒ From there critical memory regions are protected from invalid access
 The Secure Boot architecture ensures that no non-vetted OS
kernel code runs on the system, changing critical settings
 Some I/O devices can issue DMA requests to system memory
directly, without OS or Firmware intervention
‒ e.g.,1394/Firewire, network cards, as part of network boot
‒ That allows attacks to modify memory before even the OS has a
chance to protect against the attacks
UEFI
Firmware
OS
Bootloader
OS kernel
drivers
Application
91 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SECURE BOOT
YET ANOTHER USE FOR AN IOMMU
 Ensuring that a system is not doing more than it’s supposed to
‒ e.g., being part of a botnet, provide banking data or other personal
info to impersonators or other attackers
‒ The earliest time for attack and defense is at firmware startup
‒ From there critical memory regions are protected from invalid access
 The Secure Boot architecture ensures that no non-vetted OS
kernel code runs on the system, changing critical settings
 Some I/O devices can issue DMA requests to system memory
directly, without OS or Firmware intervention
‒ e.g.,1394/Firewire, network cards, as part of network boot
‒ That allows attacks to modify memory before even the OS has a
chance to protect against the attacks
 As outlined earlier, using the IOMMU prevents DMA access to
important memory regions
92 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
UEFI
Firmware
OS
Bootloader
OS kernel
drivers
Application
IOMMU USECASE: EFFICIENT IO IN VIRTUALIZED
ENVIRONMENT
BACKGROUND: TRADITIONAL DMA BY IO
(NO SYSTEM VIRTUALIZATION)
Core
Core
MMU
MMU
Memory
94 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
BACKGROUND: TRADITIONAL DMA BY IO
(NO SYSTEM VIRTUALIZATION)
Core
Core
MMU
MMU
Virtual
Addresses
Memory
95 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
BACKGROUND: TRADITIONAL DMA BY IO
(NO SYSTEM VIRTUALIZATION)
Core
Core
Virtual
Addresses
IO Device
Protection
Check
MMU
MMU
Physical
Addresses
Memory
96 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
BACKGROUND: TRADITIONAL DMA BY IO
(NO SYSTEM VIRTUALIZATION)
Device Driver
Core
Core
Virtual
Addresses
IO Device
Protection
Check
MMU
MMU
Physical
Addresses
Memory
97 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
BACKGROUND: TRADITIONAL DMA BY IO
(NO SYSTEM VIRTUALIZATION)
Setup
Device Driver
Core
Core
Virtual
Addresses
IO Device
Protection
Check
MMU
MMU
Physical
Addresses
Memory
98 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
BACKGROUND: TRADITIONAL DMA BY IO
(NO SYSTEM VIRTUALIZATION)
Setup
Device Driver
Core
Core
Virtual
Addresses
IO Device
IO Device
Protection
Check
MMU
MMU
Physical
Addresses
Memory
99 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Physical
Addresses
DMA
Request
BACKGROUND: TRADITIONAL DMA BY IO
(NO SYSTEM VIRTUALIZATION)
Setup
Device Driver
Core
Core
Virtual
Addresses
IO Device
IO Device
Protection
Check
MMU
MMU
Physical
Addresses
Memory
100 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Physical
Addresses
DMA
Request
BACKGROUND: TRADITIONAL DMA BY IO
(NO SYSTEM VIRTUALIZATION)
Setup
Device Driver
Core
Core
Virtual
Addresses
IO Device
IO Device
Protection
Check
MMU
MMU
Physical
Addresses
DMA
Request
Physical
Addresses
Memory
101 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Device drivers must program the true
system physical memory address
No protection from SW or hardware
bugs in I/O devices and drivers
system crash by writing wrong memory
No protection from potentially malicious
driver or system SW attacks
VIRTUALIZATION OF A SYSTEM IN SOFTWARE
IT HAS TO LOOK REAL TO AN OPERATING SYSTEM
 Each OS assumes full access to the platform hardware
‒ Memory, Interrupts, Devices, CPU cores, etc.
102 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
VIRTUALIZATION OF A SYSTEM IN SOFTWARE
IT HAS TO LOOK REAL TO AN OPERATING SYSTEM
 Each OS assumes full access to the platform hardware
‒ Memory, Interrupts, Devices, CPU cores, etc.
 A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage
the physical hardware and define a “virtual machine” (VM) that represents
the resources an OS expects to find in the system
103 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
VIRTUALIZATION OF A SYSTEM IN SOFTWARE
IT HAS TO LOOK REAL TO AN OPERATING SYSTEM
 Each OS assumes full access to the platform hardware
‒ Memory, Interrupts, Devices, CPU cores, etc.
 A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage
the physical hardware and define a “virtual machine” (VM) that represents
the resources an OS expects to find in the system
Hypervisor
VMM
104 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Virtual
Machine1
Virtual
Machine2
Virtual
Machine3
Operating
System1
Operating
System2
Operating
System3
Application
Application
Application
Application
Application
Application
Application
Application
Application
VIRTUALIZATION OF A SYSTEM IN SOFTWARE
IT HAS TO LOOK REAL TO AN OPERATING SYSTEM
 Each OS assumes full access to the platform hardware
‒ Memory, Interrupts, Devices, CPU cores, etc.
 A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage
the physical hardware and define a “virtual machine” (VM) that represents
the resources an OS expects to find in the system
Hypervisor
VMM
 Use cases:
105 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Virtual
Machine1
Virtual
Machine2
Virtual
Machine3
Operating
System1
Operating
System2
Operating
System3
Application
Application
Application
Application
Application
Application
Application
Application
Application
VIRTUALIZATION OF A SYSTEM IN SOFTWARE
IT HAS TO LOOK REAL TO AN OPERATING SYSTEM
 Each OS assumes full access to the platform hardware
‒ Memory, Interrupts, Devices, CPU cores, etc.
 A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage
the physical hardware and define a “virtual machine” (VM) that represents
the resources an OS expects to find in the system
Hypervisor
VMM
 Use cases:
‒ System consolidation
106 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Virtual
Machine1
Virtual
Machine2
Virtual
Machine3
Operating
System1
Operating
System2
Operating
System3
Application
Application
Application
Application
Application
Application
Application
Application
Application
VIRTUALIZATION OF A SYSTEM IN SOFTWARE
IT HAS TO LOOK REAL TO AN OPERATING SYSTEM
 Each OS assumes full access to the platform hardware
‒ Memory, Interrupts, Devices, CPU cores, etc.
 A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage
the physical hardware and define a “virtual machine” (VM) that represents
the resources an OS expects to find in the system
Hypervisor
VMM
 Use cases:
‒ System consolidation
‒ OS/application compatibility
107 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Virtual
Machine1
Virtual
Machine2
Virtual
Machine3
Operating
System1
Operating
System2
Operating
System3
Application
Application
Application
Application
Application
Application
Application
Application
Application
VIRTUALIZATION OF A SYSTEM IN SOFTWARE
IT HAS TO LOOK REAL TO AN OPERATING SYSTEM
 Each OS assumes full access to the platform hardware
‒ Memory, Interrupts, Devices, CPU cores, etc.
 A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage
the physical hardware and define a “virtual machine” (VM) that represents
the resources an OS expects to find in the system
Hypervisor
VMM
 Use cases:
‒ System consolidation
‒ OS/application compatibility
‒ Security / Stability
108 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Virtual
Machine1
Virtual
Machine2
Virtual
Machine3
Operating
System1
Operating
System2
Operating
System3
Application
Application
Application
Application
Application
Application
Application
Application
Application
VIRTUALIZATION OF A SYSTEM IN SOFTWARE
IT HAS TO LOOK REAL TO AN OPERATING SYSTEM
 Each OS assumes full access to the platform hardware
‒ Memory, Interrupts, Devices, CPU cores, etc.
 A Virtual Machine Manager (VMM) or Hypervisor (HV) is tasked to manage
the physical hardware and define a “virtual machine” (VM) that represents
the resources an OS expects to find in the system
Hypervisor
VMM
 Use cases:
‒ System consolidation
‒ OS/application compatibility
‒ Security / Stability
‒ Cloud Infrastructure
109 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Virtual
Machine1
Virtual
Machine2
Virtual
Machine3
Operating
System1
Operating
System2
Operating
System3
Application
Application
Application
Application
Application
Application
Application
Application
Application
VIRTUALIZATION OF A SYSTEM
 Most CPUs today have support for system virtualization
‒ Nested page tables (HV & OS levels), allow VMM/HV to assign and manage system
memory and interrupts to Virtual Machines
110 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
VIRTUALIZATION OF A SYSTEM
 Most CPUs today have support for system virtualization
‒ Nested page tables (HV & OS levels), allow VMM/HV to assign and manage system
memory and interrupts to Virtual Machines
 I/O devices are typically managed by HV/VMM software, either by…
111 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
VIRTUALIZATION OF A SYSTEM
 Most CPUs today have support for system virtualization
‒ Nested page tables (HV & OS levels), allow VMM/HV to assign and manage system
memory and interrupts to Virtual Machines
 I/O devices are typically managed by HV/VMM software, either by…
Para-Virtualization
Guest device driver uses HV “hypercalls”
Hypervisor manages HW operation (DMA)
Hypervisor SW validates and redirects I/O
requests from Guest OS (overhead, slow)
Hypervisor arbitrates and schedules requests
from multiple guest OS, allows VM migration
Most common operation for today’s
virtualization Software
Works well for CPU-heavy workloads
I/O, graphics or compute-heavy workloads
112 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
VIRTUALIZATION OF A SYSTEM
 Most CPUs today have support for system virtualization
‒ Nested page tables (HV & OS levels), allow VMM/HV to assign and manage system
memory and interrupts to Virtual Machines
 I/O devices are typically managed by HV/VMM software, either by…
Para-Virtualization
Direct-Mapped Device & SR-IOV
Guest device driver uses HV “hypercalls”
Hypervisor manages HW operation (DMA)
Device function is mapped to guest OS
Guest OS uses native HW drivers
Hypervisor SW validates and redirects I/O
requests from Guest OS (overhead, slow)
Physical Device DMA must be limited and
redirected by Hypervisor (via IOMMU),
Hypervisor arbitrates and schedules requests One device function per guest OS, physical
from multiple guest OS, allows VM migration memory must be committed
Most common operation for today’s
virtualization Software
Works well for CPU-heavy workloads
I/O, graphics or compute-heavy workloads
113 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
I/O device must be resettable by HV when
guest error puts it in undefined state
SR-IOV is a variant of direct mapped
I/O device provides 1 - n “virtual” devices in
HW (PCI-SIG standard)
EFFICIENT I/O VIRTUALIZATION
HARDWARE IMPLEMENTED TECHNIQUE THROUGH IOMMU
 IOMMU validates DMA accesses and validates device interrupts
Core
Core
MMU
MMU
Memory
114 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
IOMMU
EFFICIENT IO VIRTUALIZATION WITH IOMMU
WHAT ARE THE BENEFITS?
 Using the IOMMU allows a Hypervisor to assign a physical device exclusively
to a Guest VM without danger of memory corruption to other VMs
115 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EFFICIENT IO VIRTUALIZATION WITH IOMMU
WHAT ARE THE BENEFITS?
 Using the IOMMU allows a Hypervisor to assign a physical device exclusively
to a Guest VM without danger of memory corruption to other VMs
‒ Beneficial if one VM requires near native performance
116 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EFFICIENT IO VIRTUALIZATION WITH IOMMU
WHAT ARE THE BENEFITS?
 Using the IOMMU allows a Hypervisor to assign a physical device exclusively
to a Guest VM without danger of memory corruption to other VMs
‒ Beneficial if one VM requires near native performance
‒ Or if OS needs to be “sandboxed” (because of suspected malware)
117 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EFFICIENT IO VIRTUALIZATION WITH IOMMU
WHAT ARE THE BENEFITS?
 Using the IOMMU allows a Hypervisor to assign a physical device exclusively
to a Guest VM without danger of memory corruption to other VMs
‒ Beneficial if one VM requires near native performance
‒ Or if OS needs to be “sandboxed” (because of suspected malware)
 Native driver can operate in the Guest OS
118 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EFFICIENT IO VIRTUALIZATION WITH IOMMU
WHAT ARE THE BENEFITS?
 Using the IOMMU allows a Hypervisor to assign a physical device exclusively
to a Guest VM without danger of memory corruption to other VMs
‒ Beneficial if one VM requires near native performance
‒ Or if OS needs to be “sandboxed” (because of suspected malware)
 Native driver can operate in the Guest OS
 IOMMU enforces Hypervisor policy on memory and system resource
isolation for each of the Guest Virtual Machines
119 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EFFICIENT IO VIRTUALIZATION WITH IOMMU
WHAT ARE THE BENEFITS?
 Using the IOMMU allows a Hypervisor to assign a physical device exclusively
to a Guest VM without danger of memory corruption to other VMs
‒ Beneficial if one VM requires near native performance
‒ Or if OS needs to be “sandboxed” (because of suspected malware)
 Native driver can operate in the Guest OS
 IOMMU enforces Hypervisor policy on memory and system resource
isolation for each of the Guest Virtual Machines
 IOMMU redirects device physical address set up by Guest OS driver (= Guest
Physical Addresses) to the actual Host System Physical Address (SPA)
120 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EFFICIENT IO VIRTUALIZATION WITH IOMMU
WHAT ARE THE BENEFITS?
 Using the IOMMU allows a Hypervisor to assign a physical device exclusively
to a Guest VM without danger of memory corruption to other VMs
‒ Beneficial if one VM requires near native performance
‒ Or if OS needs to be “sandboxed” (because of suspected malware)
 Native driver can operate in the Guest OS
 IOMMU enforces Hypervisor policy on memory and system resource
isolation for each of the Guest Virtual Machines
 IOMMU redirects device physical address set up by Guest OS driver (= Guest
Physical Addresses) to the actual Host System Physical Address (SPA)
‒ Useful for platform resources that have “well-known” addresses like legacy devices
or system resources like APIC (Advanced Programmable Interrupt Controller)
121 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EFFICIENT IO VIRTUALIZATION WITH IOMMU
WHAT ARE THE BENEFITS?
 Using the IOMMU allows a Hypervisor to assign a physical device exclusively
to a Guest VM without danger of memory corruption to other VMs
‒ Beneficial if one VM requires near native performance
‒ Or if OS needs to be “sandboxed” (because of suspected malware)
 Native driver can operate in the Guest OS
 IOMMU enforces Hypervisor policy on memory and system resource
isolation for each of the Guest Virtual Machines
 IOMMU redirects device physical address set up by Guest OS driver (= Guest
Physical Addresses) to the actual Host System Physical Address (SPA)
‒ Useful for platform resources that have “well-known” addresses like legacy devices
or system resources like APIC (Advanced Programmable Interrupt Controller)
 Allows near-native device performance for high-performance devices with
low system impact
122 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU USECASE: ENABLING HETEROGENEOUS
COMPUTING
LEGACY GPU COMPUTE
The limiters that need to be fixed to unleash programmers:
CPU
CPU
...
CU
CPU
CU
CU
CU
GPU
PCIe™
System Memory
(Coherent)
124 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
CU
CU
CU
GPU Memory
(Non-Coherent)
LEGACY GPU COMPUTE
The limiters that need to be fixed to unleash programmers:
 Multiple memory pools, multiple address spaces
CPU
CPU
...
CU
CPU
CU
CU
CU
GPU
PCIe™
System Memory
(Coherent)
125 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
CU
CU
CU
GPU Memory
(Non-Coherent)
LEGACY GPU COMPUTE
The limiters that need to be fixed to unleash programmers:
 Multiple memory pools, multiple address spaces
 High overhead dispatch, data copies across PCIe
CPU
CPU
...
CU
CPU
CU
CU
CU
GPU
PCIe™
System Memory
(Coherent)
126 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
CU
CU
CU
GPU Memory
(Non-Coherent)
LEGACY GPU COMPUTE
The limiters that need to be fixed to unleash programmers:
 Multiple memory pools, multiple address spaces
 High overhead dispatch, data copies across PCIe
 New languages and APIs for GPU programming necessary (OpenCL, etc.)
CPU
CPU
...
CU
CPU
CU
CU
CU
GPU
PCIe™
System Memory
(Coherent)
127 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
CU
CU
CU
GPU Memory
(Non-Coherent)
LEGACY GPU COMPUTE
The limiters that need to be fixed to unleash programmers:
 Multiple memory pools, multiple address spaces
 High overhead dispatch, data copies across PCIe
 New languages and APIs for GPU programming necessary (OpenCL, etc.)
‒ And sometimes proprietary environments
CPU
CPU
...
CU
CPU
CU
CU
CU
GPU
PCIe™
System Memory
(Coherent)
128 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
CU
CU
CU
GPU Memory
(Non-Coherent)
LEGACY GPU COMPUTE
The limiters that need to be fixed to unleash programmers:
 Multiple memory pools, multiple address spaces
 High overhead dispatch, data copies across PCIe
 New languages and APIs for GPU programming necessary (OpenCL, etc.)
‒ And sometimes proprietary environments
 Dual source development
CPU
CPU
...
CU
CPU
CU
CU
CU
GPU
PCIe™
System Memory
(Coherent)
129 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
CU
CU
CU
GPU Memory
(Non-Coherent)
LEGACY GPU COMPUTE
The limiters that need to be fixed to unleash programmers:
 Multiple memory pools, multiple address spaces
 High overhead dispatch, data copies across PCIe
 New languages and APIs for GPU programming necessary (OpenCL, etc.)
‒ And sometimes proprietary environments
 Dual source development
 Expert programmers only
CPU
CPU
...
CU
CPU
CU
CU
CU
GPU
PCIe™
System Memory
(Coherent)
130 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
CU
CU
CU
GPU Memory
(Non-Coherent)
THE PREVIOUS APUS AND SOCS, PHYSICAL INTEGRATION
 Some memory copies are gone, because the same memory is accessed
Physical Integration
GPU
CPU
1
CPU
2
…
CPU
N
CU
1
System Memory
(Coherent)
131 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
2
CU
3
…
CU
M-2
CU
M-1
GPU Memory
(Non-Coherent)
CU
M
THE PREVIOUS APUS AND SOCS, PHYSICAL INTEGRATION
 Some memory copies are gone, because the same memory is accessed
‒ But the memory is not accessible concurrently, because of cache policies
Physical Integration
GPU
CPU
1
CPU
2
…
CPU
N
CU
1
System Memory
(Coherent)
132 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
2
CU
3
…
CU
M-2
CU
M-1
GPU Memory
(Non-Coherent)
CU
M
THE PREVIOUS APUS AND SOCS, PHYSICAL INTEGRATION
 Some memory copies are gone, because the same memory is accessed
‒ But the memory is not accessible concurrently, because of cache policies
 Two memory pools remain (cache coherent + non-coherent memory regions)
Physical Integration
GPU
CPU
1
CPU
2
…
CPU
N
CU
1
System Memory
(Coherent)
133 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
2
CU
3
…
CU
M-2
CU
M-1
GPU Memory
(Non-Coherent)
CU
M
THE PREVIOUS APUS AND SOCS, PHYSICAL INTEGRATION
 Some memory copies are gone, because the same memory is accessed
‒ But the memory is not accessible concurrently, because of cache policies
 Two memory pools remain (cache coherent + non-coherent memory regions)
Physical Integration
GPU
CPU
1
CPU
2
…
CPU
N
CU
1
System Memory
(Coherent)
134 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
2
CU
3
…
CU
M-2
CU
M-1
GPU Memory
(Non-Coherent)
CU
M
THE PREVIOUS APUS AND SOCS, PHYSICAL INTEGRATION
 Some memory copies are gone, because the same memory is accessed
‒ But the memory is not accessible concurrently, because of cache policies
 Two memory pools remain (cache coherent + non-coherent memory regions)
 Jobs are still queued through the OS driver chain and suffer from overhead
Physical Integration
GPU
CPU
1
CPU
2
…
CPU
N
CU
1
System Memory
(Coherent)
135 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
2
CU
3
…
CU
M-2
CU
M-1
GPU Memory
(Non-Coherent)
CU
M
THE PREVIOUS APUS AND SOCS, PHYSICAL INTEGRATION
 Some memory copies are gone, because the same memory is accessed
‒ But the memory is not accessible concurrently, because of cache policies
 Two memory pools remain (cache coherent + non-coherent memory regions)
 Jobs are still queued through the OS driver chain and suffer from overhead
 Still requires expert programmers to get performance
Physical Integration
GPU
CPU
1
CPU
2
…
CPU
N
CU
1
System Memory
(Coherent)
136 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
2
CU
3
…
CU
M-2
CU
M-1
GPU Memory
(Non-Coherent)
CU
M
THE PREVIOUS APUS AND SOCS, PHYSICAL INTEGRATION
 Some memory copies are gone, because the same memory is accessed
‒ But the memory is not accessible concurrently, because of cache policies
 Two memory pools remain (cache coherent + non-coherent memory regions)
 Jobs are still queued through the OS driver chain and suffer from overhead
 Still requires expert programmers to get performance
 This is only an intermediate step in the journey
Physical Integration
GPU
CPU
1
CPU
2
…
CPU
N
CU
1
System Memory
(Coherent)
137 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
2
CU
3
…
CU
M-2
CU
M-1
GPU Memory
(Non-Coherent)
CU
M
AN HSA ENABLED SOC
 Unified Coherent Memory enables data sharing across all processors
CPU
1
CPU
2
…
CPU
N
CU
1
CU
2
CU
3
…
Unified Coherent Memory
138 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
M-2
CU
M-1
CU
M
AN HSA ENABLED SOC
 Unified Coherent Memory enables data sharing across all processors
CPU
1
CPU
2
…
CPU
N
CU
1
CU
2
CU
3
…
Unified Coherent Memory
139 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
M-2
CU
M-1
CU
M
AN HSA ENABLED SOC
 Unified Coherent Memory enables data sharing across all processors
 Processors architected to operate cooperatively
CPU
1
CPU
2
…
CPU
N
CU
1
CU
2
CU
3
…
Unified Coherent Memory
140 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
M-2
CU
M-1
CU
M
AN HSA ENABLED SOC
 Unified Coherent Memory enables data sharing across all processors
 Processors architected to operate cooperatively
‒ Can exchange data “on the fly”, similar to what CPU threads do
CPU
1
CPU
2
…
CPU
N
CU
1
CU
2
CU
3
…
Unified Coherent Memory
141 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
M-2
CU
M-1
CU
M
AN HSA ENABLED SOC
 Unified Coherent Memory enables data sharing across all processors
 Processors architected to operate cooperatively
‒ Can exchange data “on the fly”, similar to what CPU threads do
‒ The lower job dispatch overhead allows tasks to be handled by the GPU that
previously were “too costly” to transfer over
 Designed to enable the application running on different processors without
substantially changing the programming logic
CPU
1
CPU
2
…
CPU
N
CU
1
CU
2
CU
3
…
Unified Coherent Memory
142 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CU
M-2
CU
M-1
CU
M
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The goals of the Heterogeneous System Architecture (HSA)
and where the IOMMU helps:
143 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The goals of the Heterogeneous System Architecture (HSA)
and where the IOMMU helps:
 Use of accelerators as a first-class, peer processor within
the system
144 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The goals of the Heterogeneous System Architecture (HSA)
and where the IOMMU helps:
 Use of accelerators as a first-class, peer processor within
the system
‒ Unified process address space access across all processors
‒ Shared Virtual Memory (SVM), “GPU ptr == CPU ptr”
145 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The goals of the Heterogeneous System Architecture (HSA)
and where the IOMMU helps:
 Use of accelerators as a first-class, peer processor within
the system
‒ Unified process address space access across all processors
‒ Shared Virtual Memory (SVM), “GPU ptr == CPU ptr”
146 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The goals of the Heterogeneous System Architecture (HSA)
and where the IOMMU helps:
 Use of accelerators as a first-class, peer processor within
the system
‒ Unified process address space access across all processors
‒ Shared Virtual Memory (SVM), “GPU ptr == CPU ptr”
‒ Accelerator operates in pageable system memory*
147 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The goals of the Heterogeneous System Architecture (HSA)
and where the IOMMU helps:
 Use of accelerators as a first-class, peer processor within
the system
‒ Unified process address space access across all processors
‒ Shared Virtual Memory (SVM), “GPU ptr == CPU ptr”
‒ Accelerator operates in pageable system memory*
148 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
*with OS support & ATS/PRI
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The goals of the Heterogeneous System Architecture (HSA)
and where the IOMMU helps:
 Use of accelerators as a first-class, peer processor within
the system
‒ Unified process address space access across all processors
‒ Shared Virtual Memory (SVM), “GPU ptr == CPU ptr”
‒ Accelerator operates in pageable system memory*
‒ Cache coherency between the CPU and accelerator caches
‒ User mode dispatch/scheduling reduces job-dispatch
overhead
‒ QoS with preemption/context switch of GPU Compute Units
149 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
*with OS support & ATS/PRI
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The goals of the Heterogeneous System Architecture (HSA)
and where the IOMMU helps:
 Use of accelerators as a first-class, peer processor within
the system
‒ Unified process address space access across all processors
‒ Shared Virtual Memory (SVM), “GPU ptr == CPU ptr”
‒ Accelerator operates in pageable system memory*
‒ Cache coherency between the CPU and accelerator caches
‒ User mode dispatch/scheduling reduces job-dispatch
overhead
‒ QoS with preemption/context switch of GPU Compute Units
 The IOMMU enforces control of GPU access to memory
150 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
*with OS support & ATS/PRI
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The goals of the Heterogeneous System Architecture (HSA)
and where the IOMMU helps:
 Use of accelerators as a first-class, peer processor within
the system
‒ Unified process address space access across all processors
‒ Shared Virtual Memory (SVM), “GPU ptr == CPU ptr”
‒ Accelerator operates in pageable system memory*
‒ Cache coherency between the CPU and accelerator caches
‒ User mode dispatch/scheduling reduces job-dispatch
overhead
‒ QoS with preemption/context switch of GPU Compute Units
 The IOMMU enforces control of GPU access to memory
‒ OS can efficiently and safely share process page tables with
accelerators (requires ATS/PRI protocol support)
151 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
*with OS support & ATS/PRI
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The goals of the Heterogeneous System Architecture (HSA)
and where the IOMMU helps:
 Use of accelerators as a first-class, peer processor within
the system
‒ Unified process address space access across all processors
‒ Shared Virtual Memory (SVM), “GPU ptr == CPU ptr”
‒ Accelerator operates in pageable system memory*
‒ Cache coherency between the CPU and accelerator caches
‒ User mode dispatch/scheduling reduces job-dispatch
overhead
‒ QoS with preemption/context switch of GPU Compute Units
 The IOMMU enforces control of GPU access to memory
‒ OS can efficiently and safely share process page tables with
accelerators (requires ATS/PRI protocol support)
‒ Accelerators can’t step outside of the OS-set boundaries
152 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
*with OS support & ATS/PRI
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The benefits of the Heterogeneous System Architecture:
153 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The benefits of the Heterogeneous System Architecture:
 Pageable memory access is validated and handled
directly by the OS memory manager via AMD IOMMU
154 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The benefits of the Heterogeneous System Architecture:
 Pageable memory access is validated and handled
directly by the OS memory manager via AMD IOMMU
 Application data structures can be directly parsed by the
accelerator and pointer links followed without CPU help
155 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The benefits of the Heterogeneous System Architecture:
 Pageable memory access is validated and handled
directly by the OS memory manager via AMD IOMMU
 Application data structures can be directly parsed by the
accelerator and pointer links followed without CPU help
 Common high level languages and tools (compilers,
runtimes, …) port easily to accelerators
156 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The benefits of the Heterogeneous System Architecture:
 Pageable memory access is validated and handled
directly by the OS memory manager via AMD IOMMU
 Application data structures can be directly parsed by the
accelerator and pointer links followed without CPU help
 Common high level languages and tools (compilers,
runtimes, …) port easily to accelerators
‒ C/C++, Python, Java, … already have open source
implementations
157 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The benefits of the Heterogeneous System Architecture:
 Pageable memory access is validated and handled
directly by the OS memory manager via AMD IOMMU
 Application data structures can be directly parsed by the
accelerator and pointer links followed without CPU help
 Common high level languages and tools (compilers,
runtimes, …) port easily to accelerators
‒ C/C++, Python, Java, … already have open source
implementations
‒ Many more languages to follow
158 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU: A BUILDING BLOCK FOR HSA
REDUCING THE OVERHEAD TO CALL THE GPU OR OTHER ACCELERATORS
The benefits of the Heterogeneous System Architecture:
 Pageable memory access is validated and handled
directly by the OS memory manager via AMD IOMMU
 Application data structures can be directly parsed by the
accelerator and pointer links followed without CPU help
 Common high level languages and tools (compilers,
runtimes, …) port easily to accelerators
‒ C/C++, Python, Java, … already have open source
implementations
‒ Many more languages to follow
 IOMMU making it easier for programmers to use GPUs
and other accelerators safely and efficiently
159 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EVOLUTION OF THE SOFTWARE STACK – A COMPARISON
 Goal of the software stack is to focus on high-level language support
HSA Software Stack
Hardware - APUs, CPUs, GPUs
User mode component
Kernel mode component
© Copyright 2014 HSA Foundation. All Rights Reserved.
160 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Components contributed by third parties
EVOLUTION OF THE SOFTWARE STACK – A COMPARISON
 Goal of the software stack is to focus on high-level language support
HSA Software Stack
Driver Stack
Apps
Apps
Apps
Apps
Apps
Apps
Domain Libraries
OpenCL™, DX Runtimes,
User Mode Drivers
Graphics Kernel Mode Driver
Hardware - APUs, CPUs, GPUs
User mode component
Kernel mode component
© Copyright 2014 HSA Foundation. All Rights Reserved.
161 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Components contributed by third parties
EVOLUTION OF THE SOFTWARE STACK – A COMPARISON
 Goal of the software stack is to focus on high-level language support
‒ Allow to target the GPU directly by SW
HSA Software Stack
Driver Stack
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
HSA Domain Libraries,
OpenCL ™ 2.x Runtime
Domain Libraries
OpenCL™, DX Runtimes,
User Mode Drivers
HSA Runtime
Task Queuing
Libraries
HSA JIT
Graphics Kernel Mode Driver
HSA Kernel
Mode Driver
Hardware - APUs, CPUs, GPUs
User mode component
Kernel mode component
© Copyright 2014 HSA Foundation. All Rights Reserved.
162 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Components contributed by third parties
EVOLUTION OF THE SOFTWARE STACK – A COMPARISON
 Goal of the software stack is to focus on high-level language support
‒ Allow to target the GPU directly by SW
‒ Drivers are setting up the HW and policies, then go out of the way
HSA Software Stack
Driver Stack
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
HSA Domain Libraries,
OpenCL ™ 2.x Runtime
Domain Libraries
OpenCL™, DX Runtimes,
User Mode Drivers
HSA Runtime
Task Queuing
Libraries
HSA JIT
Graphics Kernel Mode Driver
HSA Kernel
Mode Driver
Hardware - APUs, CPUs, GPUs
User mode component
Kernel mode component
© Copyright 2014 HSA Foundation. All Rights Reserved.
163 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Components contributed by third parties
EVOLUTION OF THE SOFTWARE STACK – A COMPARISON
 Goal of the software stack is to focus on high-level language support
‒ Allow to target the GPU directly by SW
‒ Drivers are setting up the HW and policies, then go out of the way
‒ IOMMU support provide hardware enforced protections for Operating System
HSA Software Stack
Driver Stack
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
Apps
HSA Domain Libraries,
OpenCL ™ 2.x Runtime
Domain Libraries
OpenCL™, DX Runtimes,
User Mode Drivers
HSA Runtime
Task Queuing
Libraries
HSA JIT
Graphics Kernel Mode Driver
HSA Kernel
Mode Driver
Hardware - APUs, CPUs, GPUs
User mode component
Kernel mode component
© Copyright 2014 HSA Foundation. All Rights Reserved.
164 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Components contributed by third parties
Operating System
IOMMU
Hardware
IOMMU
LINES-OF-CODE AND PERFORMANCE COMPARISONS
350
(Exemplary ISV “Hessian” Kernel)
300
Init.
LOC
250
Launch
200
Compile
Compile
Copy
Copy
150
Launch
Launch
Launch
Algorithm
Algorithm
Algorithm
Launch
100
Launch
Algorithm
Launch
50
Algorithm
Algorithm
Algorithm
Serial CPU
Copy-back
Algorithm
TBB
Intrinsics+TBB
Launch
Copy
OpenCL™-C
Compile
OpenCL™ -C++
C++ AMP
HSA Bolt
Init
AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.
Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
165 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Copy-back
Copy-back
Copyback
0
© Copyright 2014 HSA Foundation. All Rights Reserved.
LINES-OF-CODE AND PERFORMANCE COMPARISONS
35.00
(Exemplary ISV “Hessian” Kernel)
300
30.00
250
25.00
200
20.00
150
15.00
100
10.00
50
Performance
LOC
350
5.00
0
0
Serial CPU
TBB
Intrinsics+TBB
OpenCL™-C
OpenCL™ -C++
C++ AMP
HSA Bolt
Performance
AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.
Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
166 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
© Copyright 2014 HSA Foundation. All Rights Reserved.
ACCELERATORS: THE PORTABILITY CHALLENGE
 CPU ISAs
‒ ISA innovations added incrementally (i.e., NEON, AVX, etc)
‒ ISA retains backwards-compatibility with previous generation
‒ Two dominant instruction-set architectures: ARM and x86
 GPU ISAs
‒ Massive diversity of architectures in the market
‒ Each vendor has its own ISA - and often several in the market at same time
‒ No commitment (or attempt!) to provide any backwards compatibility
‒ Traditionally graphics APIs (OpenGL, DirectX) provide necessary abstraction
167 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
WHAT IS HSA INTERMEDIATE LANGUAGE (HSAIL)?
 Intermediate language for parallel compute in HSA
‒ Generated by a “High Level Compiler” (GCC, LLVM, Java VM, etc.)
‒ Expresses parallel regions of code
‒ Binary format of HSAIL is called “BRIG”
‒ Goal: Bring parallel acceleration to mainstream programming languages
 IOMMU based pointer translation is key to enabling an efficient IL
Implementation
main() {
…
#pragma omp parallel for
for (int i=0;i<N; i++) {
}
…
}
High-Level
Compiler
Host ISA
BRIG
Finalizer
© Copyright 2014 HSA Foundation. All Rights Reserved.
168 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Component
ISA
MEMBERS DRIVING HAS FOUNDATION
http://www.hsafoundation.com/
Founders
Promoters
Supporters
Contributors
Academic
169 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
GEN1: FIR & AES
 FIR is a memory-intensive streaming workload
 AES is a compute-intensive streaming workload
 CL12 – cl_mem buffer
‒ Copy to/from the device
 CL20 – SVM buffer – Coarse Grain Sync
‒ Copy to/from SVM
‒ Data copy cannot be avoided, since the space for SVM is
limited
 HSA – Unified Memory Space – Fine Grained Sync
‒ Regular pointer
‒ No explicit copy
 Results
‒ HSA compute abstraction
‒ NO performance penalty
 Not all algorithms run faster
‒ Measured on Kaveri (A pre-HSA 1.0 device)
‒ Limited Coherent throughput
170 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Saoni Mukherjee, Yifan Sun, Paul Blinzer, Amir Kavyan Ziabari, David
Kaeli,A Comprehensive Performance Analysis of HSA and OpenCL 2.0,
Proceedings of the 2016 International Symposium on Program
Analysis and System Software, April 2016, to appear.
BLACKSCHOLES
 C++ on HSA
‒ Matches or outperforms OpenCL
 Course Grained SVM
‒ Matches OpenCL buffers for
bandwidth
‒ More predictable performance
 Fine Grained SVM
‒ Faster kernel dispatch
‒ Larger allocations
‒ Shared data structure
 Results
‒ HSA compute abstraction
‒ NO performance penalty
SOURCE: RALPH POTTER – CODEPLAY. PRESENTATION MADE TO SG14 C++ WORKGROUP
171 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
ENABLING HETEROGENEOUS COMPUTING
SUMMARY AND DEMONSTRATION
 Key Takeaways:
‒ To further scale up compute performance, software must take better advantage of
system accelerators like GPUs and DSPs in high level languages
‒ Accelerators following the HSA Foundation specification requirements allow
programmers to write or port programs easily using common high level languages
‒ AMD IOMMU is key to efficiently and safely access process virtual memory!
‒ Does translation of both process address space via PASID and device physical accesses
‒ Enforces OS allocation policy, deals with virtual memory page faults, and much more
172 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
AGENDA
MOTIVATION &
INTRODUCTION
USE CASES &
DEMOSTRATION
What is IOMMU?
Where can IOMMU help?
INTERNALS
How does IOMMU work?
RESEARCH
Research Opportunities and Tools
173 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
RECAP: IOMMU AND ITS CAPABILITIES
IOMMU Driver
Core
Sets up IOMMU hardware
Core
MMU
MMU
Memory
IO Device
IO Device
IOMMU
Hardware that
intercepts DMA
transactions
and interrupts
Key capabilities:
1. Memory protection for DMA
2. Virtual address translation for DMA
3. Interrupt remapping and virtualization
4. IO can share CPU page tables
174 IOMMU TUTORIAL @ ASPLOS |
3RD
APRIL 2016
AGENDA: WHAT IS COMING UP?
 DMA Address Translation
‒ Address translation and memory protection in un-virtualized System
‒ Making address translation faster through caching
‒ Enabling shared address space in heterogeneous system
‒ Enabling pre-translation through IOMMU
‒ Enabling demand paging from devices (dynamic page fault)
‒ Nested address translation in virtualized system
‒ Invalidating IOMMU mappings
175 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Address
translation,
memory
protection,
HSA
AGENDA: WHAT IS COMING UP?
 DMA Address Translation
‒ Address translation and memory protection in un-virtualized System
‒ Making address translation faster through caching
‒ Enabling shared address space in heterogeneous system
‒ Enabling pre-translation through IOMMU
‒ Enabling demand paging from devices (dynamic page fault)
‒ Nested address translation in virtualized system
‒ Invalidating IOMMU mappings
Address
translation,
memory
protection,
HSA
 Interrupt Handling
‒ Interrupt filtering and remapping
‒ Interrupt virtualization
176 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Interrupts
AGENDA: WHAT IS COMING UP?
 DMA Address Translation
‒ Address translation and memory protection in un-virtualized System
‒ Making address translation faster through caching
‒ Enabling shared address space in heterogeneous system
‒ Enabling pre-translation through IOMMU
‒ Enabling demand paging from devices (dynamic page fault)
‒ Nested address translation in virtualized system
‒ Invalidating IOMMU mappings
Address
translation,
memory
protection,
HSA
 Interrupt Handling
‒ Interrupt filtering and remapping
‒ Interrupt virtualization
 Summary
‒ A peek inside a typical IOMMU implementation
‒ Data structures and their Interactions
177 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Interrupts
IOMMU Internals:
Address Translation and Memory Protection
178 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
ADDRESS TRANSLATION AND MEMORY PROTECTION
NON-VIRTUALIZED SYSTEM
Core
Core
MMU
MMU
IO Device
IO Device
Virtual
Addresses
Physical
Addresses
Memory
179 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU
ADDRESS TRANSLATION AND MEMORY PROTECTION
NON-VIRTUALIZED SYSTEM
Core
Core
MMU
MMU
IO Device
IO Device
Virtual
Addresses
Physical
Addresses
Memory
180 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU
Domain
(Defined by OS)
ADDRESS TRANSLATION AND MEMORY PROTECTION
NON-VIRTUALIZED SYSTEM
Core
Core
Virtual
Addresses
MMU
MMU
IO Device
Virtual
Address
IO Device
DMA
Request
IOMMU
Physical
Addresses
Memory
DevID DomID
Device Table
181 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Domain
(Defined by OS)
DeviceID
ADDRESS TRANSLATION AND MEMORY PROTECTION
NON-VIRTUALIZED SYSTEM
Core
Core
Virtual
Addresses
MMU
MMU
IO Device
Virtual
Address
IO Device
DMA
Request
Domain
(Defined by OS)
DeviceID
IOMMU
Physical
Addresses
Memory
DevID DomID
Device Table
182 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Page Table
ADDRESS TRANSLATION AND MEMORY PROTECTION
NON-VIRTUALIZED SYSTEM
Core
Core
Virtual
Addresses
MMU
MMU
Physical
Addresses
IO Device
Virtual
Address
IO Device
DMA
Request
Domain
(Defined by OS)
DeviceID
IOMMU
Physical
Addresses
Memory
DevID DomID
Device Table
183 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Page Table
ADDRESS TRANSLATION AND MEMORY PROTECTION
NON-VIRTUALIZED SYSTEM
Core
Core
Virtual
Addresses
MMU
Physical
Addresses
MMU
Domain
IO Device
Virtual
Address
IO Device
DMA
Request
DeviceID
IOMMU
Abort request if not sufficient permission
Memory
DevID
Device Table
184 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Page Table
MAKING TRANSLATION FAST
CACHING TRANSLATION IN IOMMU
Core
Core
IO Device
Virtual
Addresses
MMU
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
MMU
Page Table
Walker
Physical
Addresses
Memory
DevID
Device Table
185 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Page Table
IOMMU Internals:
Enabling “Pointer-is-a-Pointer” in Heterogeneous
Systems
186 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SHARING ADDRESS SPACE WITH CPU
ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS
Core
Core
Virtual
Addresses
MMU
MMU
Physical
Addresses
Domain
IO Device
Virtual
Address
IO Device
DMA
Request
IOMMU
Physical
Addresses
Memory
DevID
Device Table
187 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Page Table
SHARING ADDRESS SPACE WITH CPU
ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS
Core
Core
Virtual
Addresses
MMU
MMU
Physical
Addresses
Domain
IO Device
Virtual
Address
GPU
DMA
Request
IOMMU
Physical
Addresses
Memory
DevID
Device Table
188 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Page Table
SHARING ADDRESS SPACE WITH CPU
ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS
Process
Domain
Core
IO Device
Virtual
Addresses
MMU
MMU
Physical
Addresses
Virtual
Address
GPU
DMA
Request
IOMMU
Physical
Addresses
Memory
DevID
Device Table
189 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Page Table
SHARING ADDRESS SPACE WITH CPU
ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS
Process
Domain
Core
IO Device
Virtual
Addresses
MMU
MMU
Physical
Addresses
Virtual
Address
GPU
DMA
Request
IOMMU
Physical
Addresses
Memory
DevID
Device Table
190 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
x86-64 Page Table
SHARING ADDRESS SPACE WITH CPU
ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS
Process 0
Process 1
Domain
IO Device
Virtual
Addresses
MMU
MMU
Physical
Addresses
Virtual
Address
GPU
DMA
Request
IOMMU
Physical
Addresses
Memory
DevID
Device Table
191 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
x86-64 Page Table
SHARING ADDRESS SPACE WITH CPU
ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS
Process 0
Process 1
Domain
IO Device
Virtual
Addresses
MMU
MMU
Physical
Addresses
Virtual
Address
GPU
DMA
Request
IOMMU
Physical
Addresses
Needs ability to identify more than one address space
Memory
DevID
Device Table
192 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
x86-64 Page Table
SHARING ADDRESS SPACE WITH CPU
ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS
Process 0
Process 1
Domain
IO Device
Virtual
Addresses
MMU
MMU
Physical
Addresses
Virtual
Address
GPU
DMA
Request
IOMMU
Physical
Addresses
Memory
DevID
Device Table
193 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
DeviceID
SHARING ADDRESS SPACE WITH CPU
ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS
Process 0
PASID 0
Process 1
PASID 1
Domain
IO Device
Virtual
Addresses
MMU
MMU
Physical
Addresses
Virtual
Address
GPU
DMA
Request
IOMMU
Physical
Addresses
Memory
DevID
Device Table
194 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
DeviceID
+ PASID
SHARING ADDRESS SPACE WITH CPU
ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS
Process 0
PASID 0
Process 1
PASID 1
Domain
IO Device
Virtual
Addresses
MMU
MMU
Physical
Addresses
Virtual
Address
GPU
DMA
Request
DeviceID
+ PASID
IOMMU
Physical
Addresses
Memory
DevID
PASID
gCR3 table
Device Table
195 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SHARING ADDRESS SPACE WITH CPU
ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS
Process 0
PASID 0
Process 1
PASID 1
Domain
IO Device
Virtual
Addresses
MMU
MMU
Physical
Addresses
Virtual
Address
GPU
DMA
Request
DeviceID
+ PASID
IOMMU
Physical
Addresses
Memory
DevID
PASID
gCR3 table
Device Table
196 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SHARING ADDRESS SPACE WITH CPU
ENABLING POINTER AS POINTER IN HETEROGENEOUS SYSTEMS
Process 0
PASID 0
Process 1
PASID 1
Domain
IO Device
Virtual
Addresses
MMU
MMU
Physical
Addresses
Virtual
Address
GPU
DMA
Request
DeviceID
+ PASID
IOMMU
Physical
Addresses
Memory
DevID
PASID
gCR3 table
Device Table
197 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU Internals:
Enabling Translation Caching in Devices
198 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
CACHING ADDRESS TRANSLATION IN DEVICES
ENABLING MORE CAPABLE DEVICE/ACCELERATORS
Core
MMU
Core
GPU
MMU
Memory
IO Device Device Table
Entry Cache
IOMMU
Page Table
walker
DevID
PASID
Device Table
199 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Translation
Lookaside Buffer
gCR3 table
CACHING ADDRESS TRANSLATION IN DEVICES
ENABLING MORE CAPABLE DEVICE/ACCELERATORS
ATC/ IOTLB
Core
MMU
Core
TLB
GPU
MMU
IO Device Device Table
Entry Cache
IOMMU
Locally caching address translation in
trips to IOMMU
Memory
Page Table
device
reduces
walker
DevID
PASID
Device Table
200 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Translation
Lookaside Buffer
gCR3 table
CACHING ADDRESS TRANSLATION IN DEVICES
ENABLING MORE CAPABLE DEVICE/ACCELERATORS
ATC/ IOTLB
Core
MMU
Core
TLB
GPU
MMU
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
Page Table
walker
IOMMU driver assigns per-translation capability to devices
Pre-translation capable?
Memory
DevID
1
PASID
Device Table
201 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
CACHING ADDRESS TRANSLATION IN DEVICES
ENABLING MORE CAPABLE DEVICE/ACCELERATORS
ATC/ IOTLB
Core
MMU
Core
TLB
GPU
MMU
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
Introduce new message ype:
Address Translation Service (ATS)
Page Table
walker
Pre-translation capable?
Memory
DevID
1
PASID
Device Table
202 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
CACHING ADDRESS TRANSLATION IN DEVICES
ENABLING MORE CAPABLE DEVICE/ACCELERATORS
ATC/ IOTLB
Core
Core
TLB
GPU
IO Device Device Table
Entry Cache
ATS Req
MMU
MMU
(DevID,
PASID, VA,
R/W)
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Pre-translation capable?
Memory
DevID
1
PASID
Device Table
203 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
CACHING ADDRESS TRANSLATION IN DEVICES
ENABLING MORE CAPABLE DEVICE/ACCELERATORS
ATC/ IOTLB
Core
Core
TLB
GPU
IO Device Device Table
Entry Cache
ATS Resp
MMU
MMU
Translation
Lookaside Buffer
(PASID, VA,
PA, Attr.)
IOMMU
Page Table
walker
Pre-translation capable?
Memory
DevID
1
PASID
Device Table
204 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
CACHING ADDRESS TRANSLATION IN DEVICES
ENABLING MORE CAPABLE DEVICE/ACCELERATORS
ATC/ IOTLB
Core
MMU
Core
MMU
TLB
GPU
DMA Req
(Physical
Address)
Pre-translated Req
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Pre-translation capable?
Memory
DevID
1
PASID
Device Table
205 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
CACHING ADDRESS TRANSLATION IN DEVICES
ENABLING MORE CAPABLE DEVICE/ACCELERATORS
ATC/ IOTLB
Core
MMU
Core
MMU
TLB
GPU
DMA Req
(Physical
Address)
Pre-translated Req
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Pre-translation capable?
Memory
DevID
1
PASID
Device Table
206 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
CACHING ADDRESS TRANSLATION IN DEVICES
ENABLING MORE CAPABLE DEVICE/ACCELERATORS
ATC/ IOTLB
Core
MMU
Core
MMU
TLB
GPU
DMA Req
(Physical
Address)
Pre-translated Req
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Pre-translation capable?
Memory
DevID
1
Abort if not pre-translation capable
Device Table
207 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
PASID
gCR3 table
IOMMU Internals:
Enabling Demand Paging from IO
 No Need to Pin Memory
208 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
Device(s) access local TLB (ATC/IOTLB) first
Core
MMU
Core
TLB
GPU
MMU
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
DevID
Page Table
walker
1
PASID
Device Table
209 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
On a (IO)TLB hit no access to IOMMU
Core
MMU
Core
TLB
GPU
MMU
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
DevID
Page Table
walker
1
PASID
Device Table
210 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
(IO)TLB miss
Core
MMU
Core
TLB
GPU
MMU
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
DevID
Page Table
walker
1
PASID
Device Table
211 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
(IO)TLB miss
Core
Core
TLB
GPU
IO Device Device Table
Entry Cache
ATS Req
MMU
MMU
(DevID,
PASID, VA,
R/W)
Translation
Lookaside Buffer
IOMMU
DevID
Page Table
walker
1
PASID
Device Table
212 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
(IO)TLB miss
Core
Core
TLB
GPU
IO Device Device Table
Entry Cache
ATS Req
MMU
MMU
(DevID,
PASID, VA,
R/W)
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Page faultNo valid PTE
DevID
1
PASID
Device Table
213 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
Core
Core
TLB
GPU
IO Device Device Table
Entry Cache
ATS Resp
(NACK)
MMU
MMU
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Page faultNo valid PTE
DevID
1
PASID
Device Table
214 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
Core
Core
TLB
GPU
IO Device Device Table
Entry Cache
PPR* request
MMU
MMU
(DevID, PASID,
VA,R/W)
Translation
Lookaside Buffer
IOMMU
DevID
Page Table
walker
1
PASID
215 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
Device Table
*PPR= Page Peripheral Request
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
Core
Core
TLB
GPU
IO Device Device Table
Entry Cache
PPR* request
MMU
MMU
PPR Log
(circular buffer)
216 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
(DevID, PASID,
VA,R/W)
Translation
Lookaside Buffer
IOMMU
DevID
Page Table
walker
1
PASID
gCR3 table
Device Table
*PPR= Page Peripheral Request
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
Core
MMU
Core
TLB
GPU
MMU
PPR Log
(circular buffer)
PASID dID Addr Flag
217 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
DevID
Page Table
walker
1
PASID
gCR3 table
Device Table
*PPR= Page Peripheral Request
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
Core
MMU
Core
TLB
GPU
MMU
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Fault batching
possible
PPR Log
(circular buffer)
PASID dID Addr Flag
218 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
DevID
1
PASID
gCR3 table
Device Table
*PPR= Page Peripheral Request
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
Interrupt handler
Core
TLB
GPU
IO Device Device Table
Entry Cache
Interrupt
MMU
MMU
PPR Log
(circular buffer)
PASID dID Addr Flag
219 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Translation
Lookaside Buffer
IOMMU
DevID
Page Table
walker
1
PASID
gCR3 table
Device Table
*PPR= Page Peripheral Request
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
Interrupt handler
Core
MMU
TLB
GPU
MMU
PPR Log
(circular buffer)
PASID dID Addr Flag
220 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
DevID
Page Table
walker
1
PASID
gCR3 table
Device Table
*PPR= Page Peripheral Request
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
Interrupt handler
Core
MMU
TLB
GPU
MMU
Work Queue
PPR Log
(circular buffer)
PASID dID Addr Flag
221 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
DevID
Page Table
walker
1
PASID
gCR3 table
Device Table
*PPR= Page Peripheral Request
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
OS worker thread
Core
MMU
TLB
GPU
MMU
Work Queue
PPR Log
(circular buffer)
PASID dID Addr Flag
222 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
DevID
Page Table
walker
1
PASID
gCR3 table
Device Table
*PPR= Page Peripheral Request
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
OS worker thread
Service
page fault
Core
MMU
TLB
GPU
MMU
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Fix the
page table
Work Queue
PPR Log
(circular buffer)
PASID dID Addr Flag
223 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
DevID
1
PASID
gCR3 table
Device Table
*PPR= Page Peripheral Request
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
OS worker thread
Service
page fault
Core
MMU
TLB
GPU
MMU
Fix the
page table
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Write PPR completion
command
Command Buffer
Work Queue
PPR Log
(circular buffer)
PASID dID Addr Flag
224 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
DevID
1
PASID
gCR3 table
Device Table
*PPR= Page Peripheral Request
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
OS worker thread
Service
page fault
Core
MMU
TLB
GPU
MMU
Fix the
page table
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Write PPR completion
command
Command Buffer
Work Queue
PPR Log
(circular buffer)
PASID dID Addr Flag
225 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
DevID
1
PASID
gCR3 table
Device Table
*PPR= Page Peripheral Request
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
OS worker thread
Service
page fault
Core
TLB
GPU
IO Device Device Table
Entry Cache
PPR response
MMU
MMU
Fix the
page table
(DevID, PASID,
VA,..)
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Write PPR completion
command
Command Buffer
Work Queue
PPR Log
(circular buffer)
PASID dID Addr Flag
226 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
DevID
1
PASID
gCR3 table
Device Table
*PPR= Page Peripheral Request
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
Core
MMU
Core
TLB
GPU
MMU
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Command Buffer
Work Queue
PPR Log
DevID
1
(circular buffer)
PASID
Device Table
227 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
Retry original request
Core
Core
TLB
GPU
IO Device Device Table
Entry Cache
ATS Req
MMU
MMU
(DevID,
PASID, VA,
R/W)
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Command Buffer
Work Queue
PPR Log
DevID
1
(circular buffer)
PASID
Device Table
228 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
Retry original request
Core
Core
TLB
GPU
IO Device Device Table
Entry Cache
ATS Req
MMU
MMU
(DevID,
PASID, VA,
R/W)
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Command Buffer
Work Queue
PPR Log
DevID
1
(circular buffer)
PASID
Device Table
229 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
Retry original request
Core
Core
TLB
GPU
IO Device Device Table
Entry Cache
ATS Resp
MMU
MMU
(PASID, VA,
PA, Attr.)
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Command Buffer
Work Queue
PPR Log
DevID
1
(circular buffer)
PASID
Device Table
230 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
ENABLING DEMAND PAGING FROM DEVICE
SERVICING DEVICE PAGE FAULT
Retry original request
Core
Core
TLB
GPU
IO Device Device Table
Entry Cache
DMA Req
(Physical
Address)
MMU
MMU
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Command Buffer
Work Queue
PPR Log
DevID
1
(circular buffer)
PASID
Device Table
231 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
gCR3 table
IOMMU Internals:
Nested (Two-Level) Address Translation
232 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
RECAP: ADDRESS TRANSLATION IN VIRTUALIZED SYSTEMS
Guest Applications
Guest Applications
Guest Virtual Address
(GVA)
Guest Page
Table (GPT)
Guest OS 0
Guest OS 1
Guest Physical Address
(GPA)
Host Page
Table (HPT)
Hypervisor (a.k.a. VMM)
System Physical Address
(SPA)
Hardware – CPU, Memory, IO
Guest OS does not have access to (system) physical address
233 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
NESTED ADDRESS TRANSLATION BY IOMMU
Guest Process
Guest Process
GVA
Guest OS 0 Guest OS 1
GPT
GPA
Core 0
HPT
SPA
Core 0
Domain
IO Device
GPU
VMM
MMU
MMU
Memory
234 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU
DevID
Device Table
NESTED ADDRESS TRANSLATION BY IOMMU
Guest Process
Guest Process
GVA
Guest OS 0 Guest OS 1
GPT
GPA
Core 0
HPT
SPA
Core 0
VMM
MMU
MMU
Memory
235 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Domain
IO Device
Guest
Virtual
Address
GPU
DMA
Request
IOMMU
DevID
Device Table
Device ID + PASID
NESTED ADDRESS TRANSLATION BY IOMMU
Guest Process
Guest Process
GVA
Guest OS 0 Guest OS 1
GPT
GPA
Core 0
HPT
SPA
Core 0
VMM
MMU
MMU
Identified by PASID
Identified by DevID/DomID
Domain
IO Device
Guest
Virtual
Address
GPU
DMA
Request
Device ID + PASID
IOMMU
Physical
Addresses
Host Page Table
Memory
Guest Page
Table(s)
DevID
PASID
gCR3 table
236 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Device Table
NESTED ADDRESS TRANSLATION BY IOMMU
GVA
GPT GCR3 table entry PASID Device Table Entry
nL4
1
nL3
2
nL2
3
nL1
4
GL4
5
GVA
[38:30]
nL4
6
nL3
7
nL2
8
nL1
9
GL3
10
GVA
[29:21]
nL4
11
nL3
12
nL2
13
nL1
14
GL2
15
nL4
16
nL3
17
nL2
18
nL1
19
GL1
20
nL4
21
nL4
22
nL4
23
nL4
24
GVA
[20:12]
GVA
[11:0]
HPT
Device Table Entry
237 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Nested/Host page table
SPA
Guest page table
GVA
[47:39]
IOMMU Internals:
Sending Commands to IOMMU
238 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
COMMANDS TO IOMMU
 IOMMU Driver (running on CPU) issues commands to IOMMU
‒ e.g., Invalidate IOMMU TLB Entry, Invalidate IOTLB Entry
‒ e.g., Invalidate Device Table Entry
‒ e.g., Complete PPR, Completion Wait , etc.
 Issued via Command Buffer
‒ Memory resident circular buffer
‒ MMIO registers: Base, Head, and Tail register
239 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
COMMANDS TO IOMMU
 IOMMU Driver (running on CPU) issues commands to IOMMU
‒ e.g., Invalidate IOMMU TLB Entry, Invalidate IOTLB Entry
‒ e.g., Invalidate Device Table Entry
‒ e.g., Complete PPR, Completion Wait , etc.
 Issued via Command Buffer
‒ Memory resident circular buffer
‒ MMIO registers: Base, Head, and Tail register
IOMMU Driver
Write
Tail
Head
Base
Variable holding
content of registers
240 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU Hardware
Fetch
Tail
Head
Base
Registers
EXAMPLE: IOMMU TLB SHOOTDOWN
 IOMMU TLB Shootdown
‒ Update page table information
‒ Flush TLB Entry(s) containing stale information
 Three steps in IOMMU TLB shootdown
‒ Invalidating IOMMU TLB entry
‒ Invalidating IO TLB (Device TLB) entry
‒ Wait for completion
241 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
Command Buffer
242 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
TLB
GPU
IO Device Device Table
Entry Cache
IOMMU
Translation
Lookaside Buffer
Page Table
walker
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
TLB
GPU
IO Device Device Table
Entry Cache
IOMMU
Translation
Lookaside Buffer
Page Table
walker
Command Buffer
OpCode PASID DomID Addr Misc.
invalidate iommu tlb entry
128 bits
243 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
Command Buffer
244 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
TLB
GPU
IO Device Device Table
Entry Cache
IOMMU
Translation
Lookaside Buffer
Page Table
walker
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
TLB
GPU
IO Device Device Table
Entry Cache
IOMMU
Translation
Lookaside Buffer
Page Table
walker
Command Buffer
OpCode PASID DevID Addr Misc.
invalidate IO tlb entry
128 bits
245 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
Command Buffer
246 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
TLB
GPU
IO Device Device Table
Entry Cache
IOMMU
Translation
Lookaside Buffer
Page Table
walker
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
TLB
GPU
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
Page Table
walker
Command Buffer
OpCode Store
Address
Store
Data
completion wait
128 bits
247 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
Command Buffer
248 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
TLB
GPU
IO Device Device Table
Entry Cache
IOMMU
Translation
Lookaside Buffer
Page Table
walker
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
IO Device Device Table
Entry Cache
TLB
GPU
IOMMU
Update Tail pointer
Command Buffer
249 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Translation
Lookaside Buffer
Page Table
walker
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
Command Buffer
250 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
TLB
GPU
IO Device Device Table
Entry Cache
IOMMU
Translation
Lookaside Buffer
Page Table
Walker
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
TLB
GPU
IO Device Device Table
Entry Cache
IOMMU
Translation
Lookaside Buffer
Page Table
Walker
Command Buffer
OpCode PASID DomID Addr Misc.
invalidate IOMMU tlb entry
128 bits
251 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
TLB
GPU
IO Device Device Table
Entry Cache
IOMMU
Translation
Lookaside Buffer
Page Table
Walker
Command Buffer
OpCode PASID DomID Addr Misc.
invalidate IOMMU tlb entry
128 bits
252 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
IO Device Device Table
Entry Cache
TLB
GPU
IOMMU
Update Head pointer
Command Buffer
253 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Translation
Lookaside Buffer
Page Table
Walker
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
TLB
GPU
IO Device Device Table
Entry Cache
IOMMU
Translation
Lookaside Buffer
Page Table
Walker
Command Buffer
OpCode PASID DevID Addr Misc.
invalidate IO tlb entry
128 bits
254 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
TLB
GPU
IO Device Device Table
Entry Cache
IOMMU
Translation
Lookaside Buffer
Page Table
Walker
Command Buffer
OpCode PASID DevID Addr Misc.
invalidate IO tlb entry
128 bits
255 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
IO Device Device Table
Entry Cache
TLB
GPU
IOMMU
Update Head pointer
Command Buffer
256 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Translation
Lookaside Buffer
Page Table
Walker
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
TLB
GPU
IO Device Device Table
Entry Cache
Translation
Lookaside Buffer
IOMMU
Page Table
Walker
Command Buffer
OpCode Store
Address
Store
Data
completion wait
128 bits
257 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
Command Buffer
258 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
TLB
GPU
IO Device Device Table
Entry Cache
IOMMU
Translation
Lookaside Buffer
Page Table
Walker
Wait for previous commands to finish
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
Core
TLB
GPU
IO Device Device Table
Entry Cache
ACK
MMU
MMU
Command Buffer
259 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU
Translation
Lookaside Buffer
Page Table
Walker
Wait for previous commands to finish
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
Command Buffer
IOMMU Stores
Data to
“Store Address”
Or Raise Interrupt
260 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
TLB
GPU
IO Device Device Table
Entry Cache
IOMMU
Translation
Lookaside Buffer
Page Table
Walker
Wait for previous commands to finish
EXAMPLE: IOMMU TLB SHOOTDOWN
IOMMU Driver
Core
MMU
Core
MMU
IO Device Device Table
Entry Cache
TLB
GPU
IOMMU
Update Head pointer
Command Buffer
261 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Translation
Lookaside Buffer
Page Table
Walker
Wait for previous commands to finish
IOMMU INTERNALS: INTERRUPT REMAPPING AND
VIRTUALIZATION
INTERRUPT REMAPPING
Core
Core
APIC
MMU
APIC
MMU
Memory
263 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
Device Table
IO Device Entry Cache
Interrupt
Remapping
Lookaside Buffer
IOMMU
Table
walker
INTERRUPT REMAPPING
Core
Core
APIC
MMU
APIC
MMU
Memory
264 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
Device Table
IO Device Entry Cache
Fixed/
Arbitrated
Interrupt
Interrupt
Remapping
Lookaside Buffer
IOMMU
Table
walker
INTERRUPT REMAPPING
Core
Core
APIC
MMU
APIC
MMU
Memory
265 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
Device Table
IO Device Entry Cache
Fixed/
Arbitrated
Interrupt
Interrupt
Remapping
Lookaside Buffer
IOMMU
DevID
Did
Device Table
Table
walker
INTERRUPT REMAPPING
Core
Core
APIC
MMU
APIC
MMU
Memory
266 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
Device Table
IO Device Entry Cache
Fixed/
Arbitrated
Interrupt
Interrupt
Remapping
Lookaside Buffer
IOMMU
DevID
Table
walker
Did
Device Table
Interrupt
Remapping Table
INTERRUPT REMAPPING
Core
Core
APIC
MMU
APIC
MMU
IO Device
Device Table
IO Device Entry Cache
Fixed/
Arbitrated
Interrupt
Interrupt
Remapping
Lookaside Buffer
IOMMU
Abort request if not sufficient permission
Memory
267 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Table
walker
INTERRUPT REMAPPING
Core
Core
APIC
MMU
APIC
MMU
Memory
268 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
Device Table
IO Device Entry Cache
Fixed/
Arbitrated
Interrupt
Interrupt
Remapping
Lookaside Buffer
IOMMU
Table
walker
INTERRUPT VIRTUALIZATION
Guest OS 0
vAPIC
Core
Core
APIC
APIC
IO Device
IO Device
VMM
MMU
MMU
Memory
269 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU
INTERRUPT VIRTUALIZATION
Guest OS 0
vAPIC
Core
Core
APIC
APIC
VMM
MMU
MMU
Memory
270 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
Guest
Virtualized
Interrupt
IOMMU
INTERRUPT VIRTUALIZATION
Guest OS 0
vAPIC
Core
Core
APIC
APIC
VMM
MMU
MMU
Memory
271 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
Guest
Virtualized
Interrupt
IOMMU
DevID
Did
Device Table
INTERRUPT VIRTUALIZATION
Guest OS 0
vAPIC
Core
Core
APIC
APIC
VMM
MMU
MMU
Memory
272 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
Guest
Virtualized
Interrupt
IOMMU
DevID
Did
Device Table
Guest
Mode
Interrupt
Remapping Table
INTERRUPT VIRTUALIZATION
Guest OS 0
vAPIC
Core
Core
APIC
APIC
VMM
MMU
MMU
IO Device
IO Device
Guest
Virtualized
Interrupt
IOMMU
Abort request if not sufficient permission
Memory
273 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
INTERRUPT VIRTUALIZATION
Guest OS 0
vAPIC
Core
Core
APIC
VMM
MMU
IO Device
APIC
MMU
Memory
Guest
Virtualized
Interrupt
Guest vAPIC backing page
274 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IOMMU
INTERRUPT VIRTUALIZATION
Guest OS 0
vAPIC
Core
Core
APIC
APIC
VMM
MMU
MMU
Memory
275 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
Guest
Virtualized
Interrupt
IOMMU
DevID
Did
Device Table
INTERRUPT VIRTUALIZATION
Guest OS 0
vAPIC
Core
Core
APIC
APIC
VMM
MMU
MMU
Memory
276 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
Guest
Virtualized
Interrupt
IOMMU
DevID
Did
Device Table
Guest
Running
Interrupt
Remapping Table
INTERRUPT VIRTUALIZATION
Guest OS 0
vAPIC
Core
Core
APIC
APIC
VMM
MMU
MMU
Memory
277 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
Guest
Virtualized
Interrupt
IOMMU
INTERRUPT VIRTUALIZATION
Guest OS 0
vAPIC
Core
Core
APIC
APIC
VMM
MMU
MMU
Memory
278 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Inactive
Guest
IO Device
IO Device
Guest
Virtualized
Interrupt
IOMMU
INTERRUPT VIRTUALIZATION
Guest OS 0
vAPIC
Core
Core
APIC
APIC
VMM
MMU
MMU
Memory
279 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Inactive
Guest
IO Device
IO Device
Guest
Virtualized
Interrupt
IOMMU
DevID
Did
Device Table
Guest NOT
Running
Interrupt
Remapping Table
INTERRUPT VIRTUALIZATION
Guest OS 0
vAPIC
Core
Core
APIC
MMU
IO Device
APIC
VMM
MMU
IO Device
Guest
Virtualized
Interrupt
Memory
Guest vAPIC Log
280 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Inactive
Guest
IOMMU
INTERRUPT VIRTUALIZATION
Guest OS 0
vAPIC
Core
Core
APIC
MMU
IO Device
APIC
VMM
MMU
IO Device
Guest
Virtualized
Interrupt
Memory
Guest vAPIC Log
281 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Inactive
Guest
IOMMU
INTERRUPT VIRTUALIZATION
Guest OS 0
vAPIC
Activate
Target
Guest
Core
Core
APIC
APIC
VMM
MMU
MMU
Memory
282 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Inactive
Guest
IO Device
IO Device
Guest
Virtualized
Interrupt
IOMMU
INTERRUPT VIRTUALIZATION
Guest OS 0
vAPIC
Activate
Target
Guest
Core
Core
APIC
APIC
VMM
MMU
MMU
Memory
283 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
Guest
Virtualized
Interrupt
IOMMU
INTERRUPT VIRTUALIZATION
Guest OS 0
vAPIC
Interrupt Core
Core
Guest
APIC
APIC
vAPIC
VMM
MMU
MMU
Memory
284 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IO Device
IO Device
Guest
Virtualized
Interrupt
IOMMU
IOMMU INTERNALS: A TYPICAL IOMMU HARDWARE
DESIGN
EXAMPLE OF IOMMU HARDWARE DESIGN
DRAM
CPU
Memory Controller
IOHUB
IOMMU
Table
Walker
L1
TLB
Device
L1
TLB
L1
TLB
Device
Device
286 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
L2
gPDC
L2
DTC
L2
ITC
L2
gPTC
L2
nPDC
L2
nPTC
CACHE SIZING VS PRODUCT TYPE
 Typical Client Product
‒ Non-Virtualized
‒ I/O Isolation
‒ Small Working Set
L1
TLB
L2
DTC
287 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
L2
ITC
L2
gPDC
L2
gPTC
L2
nPDC
L2
nPTC
CACHE SIZING VS PRODUCT TYPE
 Typical Server Product
‒ Virtualized
‒ Large Working Set
L2
gPDC
L1
TLB
L2
DTC
288 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
L2
ITC
L2
gPTC
L2
nPDC
L2
nPTC
IOMMU INTERNALS: SUMMARY OF KEY DATA STRUCTURES
IOMMU’S KEY DATA STRUCTURES
IOMMU
Device Table
Device Table
Base Register
DRAM
GCR3 Table
Guest Page Tables
Host Page Tables
Interrupt
Remap
Table
Guest Virtual APIC Backing Page
Guest vAPIC Log
Base Register
Guest Virtual APIC Log
Command Buffer
Base Register
Command Buffer
Event Log
Base Register
Page Request
Log Base Register
290 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Event Log
Peripheral Page Request Log
DEVICE TABLE ENTRY
Each entry is 32B
IOTLB Enable
Interrupt info
- Interrupt Table Root Pointer
- Legacy Interrupt Permission
domainID
guest translation Info
- GCR3 Table Root Pointer
- Guest Levels translated
host translation Info
- Page Mode
- Host Page Table Root Pointer
291 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
valid entry
INTERRUPT REMAPPING TABLE ENTRY
Each entry is 128b. Two modes:
Interrupt Remapping (guest mode=0)
Interrupt Virtualization (guest mode=1)
guest mode=0:
0
guest mode=1:
vector
destination
guest mode
remap enabled
1
Guest vAPIC info
- Guest vAPIC Root Pointer
- Guest vAPIC Tag
- Guest Running
292 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
vector
destination
guest mode
remap enabled
AGENDA
MOTIVATION &
INTRODUCTION
USE CASES &
DEMOSTRATION
What is IOMMU?
Where can IOMMU help?
INTERNALS
How does IOMMU work?
RESEARCH
Research Opportunities and Tools
293 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
RESEARCH DIRECTIONS
 Isolation from malicious or buggy third party accelerators
‒ Can IOMMU ensure protection in-presence of untrusted accelerators?
294 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
RESEARCH DIRECTIONS
 Isolation from malicious or buggy third party accelerators
‒ Can IOMMU ensure protection in-presence of untrusted accelerators?
 Specializing IOMMU for performance and power
‒ Can IOMMU hardware exploit predictable access pattern of some accelerators?
295 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
RESEARCH DIRECTIONS
 Isolation from malicious or buggy third party accelerators
‒ Can IOMMU ensure protection in-presence of untrusted accelerators?
 Specializing IOMMU for performance and power
‒ Can IOMMU hardware exploit predictable access pattern of some accelerators?
 Trading memory protection for performance
296 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
RESEARCH DIRECTIONS
 Isolation from malicious or buggy third party accelerators
‒ Can IOMMU ensure protection in-presence of untrusted accelerators?
 Specializing IOMMU for performance and power
‒ Can IOMMU hardware exploit predictable access pattern of some accelerators?
 Trading memory protection for performance
‒ Can selectively lowering protection enable better performance?
297 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
RESEARCH DIRECTIONS
 Isolation from malicious or buggy third party accelerators
‒ Can IOMMU ensure protection in-presence of untrusted accelerators?
 Specializing IOMMU for performance and power
‒ Can IOMMU hardware exploit predictable access pattern of some accelerators?
 Trading memory protection for performance
‒ Can selectively lowering protection enable better performance?
 Extending (limited) virtual memory to embedded accelerators
‒ Can we design for IOMMULITE embedded low-power accelerators?
298 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
RESEARCH DIRECTIONS
 Isolation from malicious or buggy third party accelerators
‒ Can IOMMU ensure protection in-presence of untrusted accelerators?
 Specializing IOMMU for performance and power
‒ Can IOMMU hardware exploit predictable access pattern of some accelerators?
 Trading memory protection for performance
‒ Can selectively lowering protection enable better performance?
 Extending (limited) virtual memory to embedded accelerators
‒ Can we design for IOMMULITE embedded low-power accelerators?
 Avoiding interference in the IOMMU
‒ How to reduce interference among multiple devices accessing IOMMU?
299 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
ISOLATION FROM THIRD PARTY ACCELERATORS
EMERGENCE OF 3RD PARTY ACCELERATORS
Core
Core
MMU
MMU
Memory
300 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
1st Party
(Trusted)
Accelerator Accelerator
IOMMU
ISOLATION FROM THIRD PARTY ACCELERATORS
EMERGENCE OF 3RD PARTY ACCELERATORS
Core
Core
MMU
MMU
Memory
301 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
3rd Party
(Un-trusted)
Accelerator Accelerator
IOMMU
ISOLATION FROM THIRD PARTY ACCELERATORS
EMERGENCE OF 3RD PARTY ACCELERATORS
Core
Core
MMU
MMU
3rd Party
(Un-trusted)
Accelerator Accelerator
IOMMU
Q: How to integrate third party accelerators efficiently and
securely?
 How to determine if a device is trustworthy and remains
trustworthy? Memory
 May not be possible verify if 3rd party accelerator is not buggy.
302 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
ISOLATION FROM THIRD PARTY ACCELERATORS (CNTD.)
EMERGENCE OF 3RD PARTY ACCELERATORS
Core
Core
MMU
MMU
Memory
303 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
3rd Party
(Un-trusted)
Accelerator Accelerator
IOMMU
ISOLATION FROM THIRD PARTY ACCELERATORS (CNTD.)
EMERGENCE OF 3RD PARTY ACCELERATORS
Core
Core
MMU
MMU
Memory
304 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Physical
address
3rd Party
(Un-trusted)
TLB
Accelerator
IOMMU
Performance consideration:
1. TLBs in accelerator 
Possible to bypass IOMMU
ISOLATION FROM THIRD PARTY ACCELERATORS (CNTD.)
EMERGENCE OF 3RD PARTY ACCELERATORS
MMU
Core Physical
TLB
address
Caches
Caches
Coherence
traffic
MMU
Memory
305 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Physical
address
Core
3rd Party
(Un-trusted)
Accelerator
IOMMU
Performance consideration:
1. TLBs in accelerator 
Possible to bypass IOMMU
2. Coherent caches in accelerator 
Coherence traffic bypass IOMMU
ISOLATION FROM THIRD PARTY ACCELERATORS (CNTD.)
EMERGENCE OF 3RD PARTY ACCELERATORS
MMU
Core Physical
TLB
address
Caches
Caches
Coherence
traffic
MMU
Memory
306 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Physical
address
Core
3rd Party
(Un-trusted)
Accelerator
IOMMU
Related work:
Olson et al. “Border Control” in
MICRO’15 [OLSON’15]
ISOLATION FROM THIRD PARTY ACCELERATORS (CNTD.)
EMERGENCE OF 3RD PARTY ACCELERATORS
MMU
Core Physical
TLB
address
Accelerator
Caches
Caches
Coherence
traffic
BC
MMU
IOMMU
Physical
address
Core
3rd Party
(Un-trusted)
Memory
307 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
Related work:
Olson et al. “Border Control” in
MICRO’15 [OLSON’15]
Idea: Check every access with physical
address if valid.
SPECIALIZING IOMMU FOR DEVICE/ ACCELERATOR
 IOMMU design(s) resembles CPU MMU design
‒ But device/accelerator access patterns differs from CPU’s
 IOMMU caters to disparate devices
‒ Single design point may not be optimal for all
‒ e.g., access pattern from GPU likely different from NIC’s
308 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SPECIALIZING IOMMU FOR DEVICE/ ACCELERATOR
 IOMMU design(s) resembles CPU MMU design
‒ But device/accelerator access patterns differs from CPU’s
 IOMMU caters to disparate devices
‒ Single design point may not be optimal for all
‒ e.g., access pattern from GPU likely different from NIC’s
Study traffic pattern to IOMMU and specialize for common patterns
 Related work: Malka et al. ’s “rIOMMU” in ASPLOS’15.
‒ Idea: Exploit predictable IOMMU accesses from devices using circular ring buffers
309 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SPECIALIZING IOMMU FOR DEVICE/ ACCELERATOR
 IOMMU design(s) resembles CPU MMU design
‒ But device/accelerator access patterns differs from CPU’s
 IOMMU caters to disparate devices
‒ Single design point may not be optimal for all
‒ e.g., access pattern from GPU likely different from NIC’s
Study traffic pattern to IOMMU and specialize for common patterns
 Related work: Malka et al. ’s “rIOMMU” in ASPLOS’15.
‒ Idea: Exploit predictable IOMMU accesses from devices using circular ring buffers
‒ Replace page table with circular, flat table  Easy page walk
‒ Predictable access  single entry IOTLB with no TLB miss and less invalidation
310 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SPECIALIZING IOMMU FOR DEVICE/ ACCELERATOR
 IOMMU design(s) resembles CPU MMU design
‒ But device/accelerator access patterns differs from CPU’s
 IOMMU caters to disparate devices
‒ Single design point may not be optimal for all
‒ e.g., access pattern from GPU likely different from NIC’s
Study traffic pattern to IOMMU and specialize for common patterns
 Related work: Malka et al. ’s “rIOMMU” in ASPLOS’15.
‒ Idea: Exploit predictable IOMMU accesses from devices using circular ring buffers
‒ Replace page table with circular, flat table  Easy page walk
‒ Predictable access  single entry IOTLB with no TLB miss and less invalidation
 Possible to use device-specific knowledge to optimize performance
‒ IOMMU prefetching and TLB caching hints can be useful
‒ Replacement policy coordination between IOTLB (Device TLB) and IOMMU TLB
‒ Energy/power optimization in IOMMU
311 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
TRADING PROTECTION FOR PERFORMANCE
 IOMMU hardware allows lowering protection for performance
‒ For example: pre-translated DMA transactions pass-through IOMMU
‒ A trusted IO device can manipulate any address, including interrupt storms
312 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
TRADING PROTECTION FOR PERFORMANCE
 IOMMU hardware allows lowering protection for performance
‒ For example: pre-translated DMA transactions pass-through IOMMU
‒ A trusted IO device can manipulate any address, including interrupt storms
 OS policies for trading off protection for security
‒ Should the sysadmin decide how much to trust a device/driver?
‒ Exposing software knobs for dialing performance vs. protection
‒ Related work: OS policies for Strict vs Deferred protection strategy
[WILMANN’08, BEN-YEHUDA’07, AMIT’11]
‒ ASPLOS’16: Strict, sub-page grain protection through Shadow DMA-buffer
[MARKUZE’16]
313 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMULITE FOR EMBEDDED LOW-POWER ACCELERATORS
 Virtual memory eases programming (e.g., “pointer-is-pointer”)
‒ But comes at performance and energy cost
 Stripped-down IOMMU for ultra low-power accelerators
‒ Lower hardware, performance, power cost by stripping non-essential features
‒ Example “non-essential” features: IO virtualization support, Interrupt remapping,
Page fault handling, Nested page table walker, etc.
314 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMULITE FOR EMBEDDED LOW-POWER ACCELERATORS
 Virtual memory eases programming (e.g., “pointer-is-pointer”)
‒ But comes at performance and energy cost
 Stripped-down IOMMU for ultra low-power accelerators
‒ Lower hardware, performance, power cost by stripping non-essential features
‒ Example “non-essential” features: IO virtualization support, Interrupt remapping,
Page fault handling, Nested page table walker, etc.
 Related work:
‒ Vogel et al.’s “Lightweight Virtual Memory” in CODES’15 [VOGEL’15]
‒ Idea: Software managed IOMMU for FPGA  No translation miss handling in hardware
‒ Simple design, high performance with effective software management
315 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
AVOIDING (DESTRUCTIVE-) INTERFERENCE IN IOMMU
Core
Core
MMU
MMU
IO Device
IO Device
Virtual
Addresses
Physical
Addresses
Memory
316 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU
AVOIDING (DESTRUCTIVE-) INTERFERENCE IN IOMMU
Core
Core
MMU
MMU
GPU
NIC
Virtual
Addresses
Physical
Addresses
Memory
317 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
IOMMU
AVOIDING (DESTRUCTIVE-) INTERFERENCE IN IOMMU
Core
Core
Virtual
Addresses
GPU
Virtual
Address
MMU
MMU
Memory
318 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
DMA
Requests
IOMMU
Physical
Addresses
Physical
Addresses
NIC
AVOIDING (DESTRUCTIVE-) INTERFERENCE IN IOMMU
Core
Core
Virtual
Addresses
GPU
Virtual
Address
MMU
MMU
NIC
DMA
Requests
IOMMU
Physical
Addresses
Physical
Addresses
Memory
IOMMU is a shared resource
How to model contention in IOMMU?
How to guarantee Quality-of-Service
in IOMMU?
319 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
RESEARCH: TOOLS AND MODELING
 Software research: IOMMU driver/OS policies
‒ Easy! Open source IOMMU Driver in Linux
 Hardware research: Modifying IOMMU hardware behavior
‒ Option 1: Hardware performance counter + Analytical models
‒ Option 2: Simulator with IOMMU model
‒ Work in progress to add IOMMU model in gem5
‒ Write down in attendance sheet your email if interested
320 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
SUMMARY
IOMMU (kernel-mode) Driver:
Configuration/Setup IOMMU hardware
Core
Core
MMU
MMU
Memory
IO Device
IO Device
IOMMU
Hardware that
intercepts DMA
transactions
and interrupts
Important roles:
1. Memory protection from rogue devices
2. Shared virtual memory to devices
3. I/O virtualization – direct I/O
4. Supporting legacy I/O, Secure boot
321 IOMMU TUTORIAL @ ASPLOS |
3RD
APRIL 2016
REFERENCES
 IOMMU specification: http://support.amd.com/TechDocs/48882_IOMMU.pdf
 OLSON’15: Lean Olson et. al. “Border Control: Sandboxing Accelerators” , MICRO
2015
 AMIT’11: Nadav Amit et al. “vIOMMU: Efficient IOMMU Emulation”, USENIX, ATC ,
2011
 BEN-YEHUDA’07: Muli Ben-Yehuda et al. “The Price of Safety: Evaluating IOMMU
Performance”, OLS 2007
 MALKA’15: Moshe Malka et al. “rIOMMU: Efficient IOMMU for I/O Devices That
Employ Ring Buffers”, ASPLOS 2015.
 WILLMANN’08: Paul Willmann et al. “Protection Strategies for Direct Access to
Virtualized I/O Devices”, USENIX, ATC 2008.
 VOGEl’15: Pirmin Vogel et. al. “Lightweight virtual memory support for many-core
accelerators in heterogeneous embedded SoCs”, CODES’15
 MARKUZE’16: Markuze et al. “True IOMMU Protection from DMA Attacks”,
ASPLOS’16.
322 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
QUESTIONS AND FEEDBACK
 Reachable @
‒ Arka Basu: Arkaprava “dot” Basu “at” amd.com
‒ Andy Kegel: Andrew “dot” Kegel “at” amd.com
‒ Paul Blinzer: Paul “dot” Blinzer “at” amd.com
‒ Maggie Chan: Maggie “dot” Chan “at” amd.com
323 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES,
ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2016 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices,
Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). OpenCL is a
trademark of Apple Inc. used by permission by Khronos. ARM ® is/are the registered trademark(s) of ARM Limited in the EU and other countries. PCIe® is
registered trademark of PCI-SIG corporation. Other name are for informational purposes only and may be trademarks of their respective owners.
324 IOMMU TUTORIAL @ ASPLOS | 3RD APRIL 2016