Download ppt

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Security-focused operating system wikipedia , lookup

Unix security wikipedia , lookup

Spring (operating system) wikipedia , lookup

DNIX wikipedia , lookup

Process management (computing) wikipedia , lookup

Distributed operating system wikipedia , lookup

Transcript
Kernel-Kernel communication
in a shared-memory
multiprocessor
By
ELISEU M. CHAVES,* JR., PRAKASH CH. DAS,* THOMAS J.
LEBLANC, BRIAN D. MARSH* AND
MICHAEL L. SCOTT
Introduction
• Computer architecture can influence
multiprocessor operating system design
–
–
–
–
How to distribute kernel functionality across processors
Inter-Kernel Communication
Data Structure and Location
How to maintain Data Integrity and provide fast access
Introduction
• Computer architecture can influence
multiprocessor operating system design
–
–
–
–
How to distribute kernel functionality across processors
Inter-Kernel Communication
Data Structure and Location
How to maintain Data Integrity and provide fast access
• Some Examples...
Introduction
• Bus-based shared memory multiprocessors
– Kernel code shared for all processors
– Typically these have uniform memory access
– Use explicit synchronization to control access to data
structures (critical sections, locks, spin-locks, semaphores)
• Advantages
– Easy to port an existing uniprocessor kernel and add explicit
synchronization
– Decent performance
– Simplified load balancing and global resource management
• Disadvantages
– Good for smaller machines but not very scalable
– Increased contention leads to performance degradation
Introduction
• Distributed systems & Distributed memory computers
– Kernel data distributed among processors
– Each processor executes a copy of the kernel
– Local data operated on directly, but remote data is accessed
via remote invocation
– Uses implicit synchronization to control access to data
– Hypercubes & Mesh-connected machines
• Advantages
– Works well for machines with no shared memory
– Modular design gives increased fault protection
– Can use non-preemption for bulk of synchronization
• Disadvantages
– Reliance on a communication network to send invocations
– Security of data while in the network
Large Scale Shared Memory Multiprocessors
• Memory can be accessed directly, but at varying cost
(NUMA)
• Some with coherent cache between processors
– ( ccNUMA )
• Large scale shared memory multiprocessors have properties
common to both
– Bus-based systems
• Which use Remote access
– Distributed memory multi-computers
• Which use Remote invocation
• Kernel Designers must choose between
– Remote ACCESS
&
– Remote INVOCATION
• Tradeoff's between Remote Access/Invocation remain
largely unknown (circa 1993)
Target Parameters of this Research
• General purpose parallel operating system
–
–
–
–
–
–
–
Shared memory access to all processors
Full range of kernel services
Reasonable use of all processors
Arrange system using collection of nodes
Possibly with caches?
NUMA architecture
No cache coherence
Kernel-Kernel Communication Options
• Remote Access
– execution on node i, reading and writing memory on node
j’s memory when necessary
• Remote Invocation
– Processor at node i sends a message to node j’s
processor requesting an operation on i’s behalf
• Bulk Data Transfer
– Move data required from node j to node i
• Specific request to kernel
• Re-Map memory from node j to node i
• Data migration
– Cache coherence blurs distinction with remote access
Types of Remote Invocation [1/2]
• Interrupt Level
– Request executes in the interrupt handler “bottom half”
– Doesn’t execute in the context of a ‘process’
• Can’t sleep
• Can’t call schedule
• Can’t transfer data to/from user space
– Interrupts enabled during execution to allow top half to
continue handling other interrupts
– Mutual exclusion achieved through ‘masking interrupts’
• Course granularity can unnecessarily prohibit too many
other invocations
– Prohibits outgoing invocations when incoming invocations
are disabled. (to avoid deadlock)
• Limiting for accessing data on two different processors
– Yes, lots of limitations, but very fast!
Types of Remote Invocation [2/2]
• Kernel Process Level
– Normal “Top Half” activity
– From user space - interrupt handler performs an
asynchronous trap & remote invocation executes
– From kernel space – interrupt handler queues the invocation
for execution when kernel becomes idle, or when control
returns to user space
– Has process context so requested operation free to block
during its execution
– Process holding semaphores or scheduler-based locks can
perform process-level RI’s
– Deadlock still possible, but can be avoided
• Requires requesting process to block when making RI
• Busy waiting would lock out other incoming process-level RI’s
Coexistence is possible
•
•
Remote access and interrupt-level RI on the same data structure
Deadlock-avoidance rule: must always be possible to perform incoming
invocations wile waiting for outgoing invocation
Trade-Offs
• Direct costs of remote operations
– Latency of the operation
– (R-1)*n < C
• R: remote to local access time ratio
• n: number of memory accesses
• C: overhead of remote invocation
– If C is greater, implement Operation using Remote Access
– Notes:
• In general, C is fixed so use RI for longer operations
• Parameter passing not included in calculation
Trade-Offs
• Indirect costs for local operations
– RI may need context of invoker as parameter
– Operations arranged to increase node locality??
• Access between invoker/target processor not interleaved??
– Process level RI for all remote accesses may allow data
structure implementation without explicit synchronization
• Uses lack of pre-emption within kernel to provide implicit
synchronization
– Avoiding Explicit Synchronization
• Can improve speed of remote and local operations because
of the high cost of explicit synchronization
– Remote Memory Access may require large portions of
kernel data space on other processors must be mapped
into each instance of the kernel
• May not scale well to large machines
• Mapping on demand not a good option
Trade-Offs
• Competition for processor and memory cycles
– When do operations serialize
• Remote Invocation – on processor that executes them
– Serialize whether data is common or not
• Remote Memory Access – at the memory
– more chance of parallel computation because operations do
more than just access data
– If competing for locked data operations may serialize
– Tolerance for Contention
• Remote Memory Access has slightly higher throughput
because of parallel operation possibility
• However, this can introduce greater variance of completion
time
Trade-Offs
• Compatibility with conceptual kernel model
– Procedure based kernel
•
•
•
•
Gives uniform model for user and kernel processes
Closely mimics hardware of UMA multiprocessor
Minimizes context switching
Fits nicely with Remote Memory Access
– Message based kernel
• Comparmentalization of kernel
• Synchronization is subsumed by message passing
• Closely mimics hardware of distributed-memory
multicomputer
• Easier to debug
• Remote Invocation more appropriate
Case Study
• Psyche on the BBN butterfly
–
–
–
–
–
12:1
6.88µs
0.518µs
4.27µs
0.398µs
– 56µs
– 421µs
Remote-to-Local memory access time ratio
Read 32-bit word remote memory location
Read 32-bit word local memory location
Write 32-bit word remote memory location
Write 32-bit word local memory location
Average Latency Interrupt level Remote Invocation
Average latency Process level Remote Invocation
Latency of Kernel Operations
Local Access
Operation
locking

Remote Access
Remote Invocation
On
Off
On
Off
On
Off
42.4
21.6
247
154
197
174
25.0
16.1
131
87.6
115
96.7
Find last in list of 10
40.6
30.5
211
169
125
105
Create segment
6.20
5.69
14.8
13.1
7.42
6.88
0.96
0.86
3.05
2.62
1.94
1.77
1.43
1.35
3.30
3.04
1.89
1.75
Enqueue + dequeue
Find last in list of 5
Map segment
Create process
(µs)
(ms)
Times in columns represent cost of operations:
• col 1 & 3 measurements for original Psyche kernel
• col 2 synchronization without context switching
• col 4 subsumed in other operation with course grain locking
• col 6 cost if data accessed were always via remote
invocation – no explicit synchronization required
• col 5 cost of remote invocations in a hybrid kernel
Interrupt level RI
Process level RI
% Latency from locking
• Cost of synchronization
• Measure Locking OFF vs ON
Remote Access Penalty
• NL RA(off) - local(off)/ RA(off)
• L
RA(off) – local(off)/RA(on)
• Overhead of explicit synchronization and remote access is a
function of complexity
• Example
– Interrupt level RI can be justified to avoid 11 remote refs
– 6µs memory access cost & 60µs overhead of RI interrupt
level
RI time as percentage of RA time
•
RA: locking in the case of remote access
– column 6 as % of column 3
•
B: locking in the case of both
– column 5 as % of column 3
– Simulates Hybrid kernel
•
N:
locking in the case of neither
– column 6 as % of column 4
– Simulates desired operation being subsumed by other operation
with coarse-grained locking
Not so fast
• Not such a clear win for Remote Invocation
– Experiments represent extreme cases
• Simple enough to perform via interrupt level RI
• Complex enough to absorb overhead of process level RI
– Medium size operations might not accept limitations of
interrupt level RI or the overhead of process level RI
• Throughput for tiny operations
– Remote memory accesses steal busy cycles from
processor where memory is being used
– Remote invocations steal the entire processor for the
entire operation
– Experiments show Interrupt level RI negatively affects
throughput of the remote processor & remote access
does not
Conclusions
• Good Kernel design must consider
–
–
–
–
Cost of remote invocation mechanism
Cost of atomic operations and synchronization
Ratio of remote to local access time
Breakeven points depend on the number of remote
references in any given Operation
• Interrupt level RI has small number of uses
– TLB shootdown
– Console I/O
• Results could apply for ccNUMA machines as well
• Risks of putting data structures in interrupt level
of the kernel might be a reason to use remote
access instead for small operations
REFERENCES
•
•
•
Samuel J. Leffler, Marshall Kirk McKusick, Michael J. Karels, and
John S. Quarterman “The Design and Implementation of the 4.3BSD
UNIX Operating System” Addison Wesley Publishing Company
Andrew S. Tannenbaum “Modern Operating Systems 2nd Edition”
Prentice-Hall of India
Alessandro Rubini, Jonathan Corbet, “Linux Device Drivers 2nd
Edition” O’Reilly