Download ppt

Kernel-Kernel communication in a shared-memory multiprocessor By ELISEU M. CHAVES,* JR., PRAKASH CH. DAS,* THOMAS J. LEBLANC, BRIAN D. MARSH* AND MICHAEL L. SCOTT Introduction • Computer architecture can influence multiprocessor operating system design – – – – How to distribute kernel functionality across processors Inter-Kernel Communication Data Structure and Location How to maintain Data Integrity and provide fast access Introduction • Computer architecture can influence multiprocessor operating system design – – – – How to distribute kernel functionality across processors Inter-Kernel Communication Data Structure and Location How to maintain Data Integrity and provide fast access • Some Examples... Introduction • Bus-based shared memory multiprocessors – Kernel code shared for all processors – Typically these have uniform memory access – Use explicit synchronization to control access to data structures (critical sections, locks, spin-locks, semaphores) • Advantages – Easy to port an existing uniprocessor kernel and add explicit synchronization – Decent performance – Simplified load balancing and global resource management • Disadvantages – Good for smaller machines but not very scalable – Increased contention leads to performance degradation Introduction • Distributed systems & Distributed memory computers – Kernel data distributed among processors – Each processor executes a copy of the kernel – Local data operated on directly, but remote data is accessed via remote invocation – Uses implicit synchronization to control access to data – Hypercubes & Mesh-connected machines • Advantages – Works well for machines with no shared memory – Modular design gives increased fault protection – Can use non-preemption for bulk of synchronization • Disadvantages – Reliance on a communication network to send invocations – Security of data while in the network Large Scale Shared Memory Multiprocessors • Memory can be accessed directly, but at varying cost (NUMA) • Some with coherent cache between processors – ( ccNUMA ) • Large scale shared memory multiprocessors have properties common to both – Bus-based systems • Which use Remote access – Distributed memory multi-computers • Which use Remote invocation • Kernel Designers must choose between – Remote ACCESS & – Remote INVOCATION • Tradeoff's between Remote Access/Invocation remain largely unknown (circa 1993) Target Parameters of this Research • General purpose parallel operating system – – – – – – – Shared memory access to all processors Full range of kernel services Reasonable use of all processors Arrange system using collection of nodes Possibly with caches? NUMA architecture No cache coherence Kernel-Kernel Communication Options • Remote Access – execution on node i, reading and writing memory on node j’s memory when necessary • Remote Invocation – Processor at node i sends a message to node j’s processor requesting an operation on i’s behalf • Bulk Data Transfer – Move data required from node j to node i • Specific request to kernel • Re-Map memory from node j to node i • Data migration – Cache coherence blurs distinction with remote access Types of Remote Invocation [1/2] • Interrupt Level – Request executes in the interrupt handler “bottom half” – Doesn’t execute in the context of a ‘process’ • Can’t sleep • Can’t call schedule • Can’t transfer data to/from user space – Interrupts enabled during execution to allow top half to continue handling other interrupts – Mutual exclusion achieved through ‘masking interrupts’ • Course granularity can unnecessarily prohibit too many other invocations – Prohibits outgoing invocations when incoming invocations are disabled. (to avoid deadlock) • Limiting for accessing data on two different processors – Yes, lots of limitations, but very fast! Types of Remote Invocation [2/2] • Kernel Process Level – Normal “Top Half” activity – From user space - interrupt handler performs an asynchronous trap & remote invocation executes – From kernel space – interrupt handler queues the invocation for execution when kernel becomes idle, or when control returns to user space – Has process context so requested operation free to block during its execution – Process holding semaphores or scheduler-based locks can perform process-level RI’s – Deadlock still possible, but can be avoided • Requires requesting process to block when making RI • Busy waiting would lock out other incoming process-level RI’s Coexistence is possible • • Remote access and interrupt-level RI on the same data structure Deadlock-avoidance rule: must always be possible to perform incoming invocations wile waiting for outgoing invocation Trade-Offs • Direct costs of remote operations – Latency of the operation – (R-1)*n < C • R: remote to local access time ratio • n: number of memory accesses • C: overhead of remote invocation – If C is greater, implement Operation using Remote Access – Notes: • In general, C is fixed so use RI for longer operations • Parameter passing not included in calculation Trade-Offs • Indirect costs for local operations – RI may need context of invoker as parameter – Operations arranged to increase node locality?? • Access between invoker/target processor not interleaved?? – Process level RI for all remote accesses may allow data structure implementation without explicit synchronization • Uses lack of pre-emption within kernel to provide implicit synchronization – Avoiding Explicit Synchronization • Can improve speed of remote and local operations because of the high cost of explicit synchronization – Remote Memory Access may require large portions of kernel data space on other processors must be mapped into each instance of the kernel • May not scale well to large machines • Mapping on demand not a good option Trade-Offs • Competition for processor and memory cycles – When do operations serialize • Remote Invocation – on processor that executes them – Serialize whether data is common or not • Remote Memory Access – at the memory – more chance of parallel computation because operations do more than just access data – If competing for locked data operations may serialize – Tolerance for Contention • Remote Memory Access has slightly higher throughput because of parallel operation possibility • However, this can introduce greater variance of completion time Trade-Offs • Compatibility with conceptual kernel model – Procedure based kernel • • • • Gives uniform model for user and kernel processes Closely mimics hardware of UMA multiprocessor Minimizes context switching Fits nicely with Remote Memory Access – Message based kernel • Comparmentalization of kernel • Synchronization is subsumed by message passing • Closely mimics hardware of distributed-memory multicomputer • Easier to debug • Remote Invocation more appropriate Case Study • Psyche on the BBN butterfly – – – – – 12:1 6.88µs 0.518µs 4.27µs 0.398µs – 56µs – 421µs Remote-to-Local memory access time ratio Read 32-bit word remote memory location Read 32-bit word local memory location Write 32-bit word remote memory location Write 32-bit word local memory location Average Latency Interrupt level Remote Invocation Average latency Process level Remote Invocation Latency of Kernel Operations Local Access Operation locking  Remote Access Remote Invocation On Off On Off On Off 42.4 21.6 247 154 197 174 25.0 16.1 131 87.6 115 96.7 Find last in list of 10 40.6 30.5 211 169 125 105 Create segment 6.20 5.69 14.8 13.1 7.42 6.88 0.96 0.86 3.05 2.62 1.94 1.77 1.43 1.35 3.30 3.04 1.89 1.75 Enqueue + dequeue Find last in list of 5 Map segment Create process (µs) (ms) Times in columns represent cost of operations: • col 1 & 3 measurements for original Psyche kernel • col 2 synchronization without context switching • col 4 subsumed in other operation with course grain locking • col 6 cost if data accessed were always via remote invocation – no explicit synchronization required • col 5 cost of remote invocations in a hybrid kernel Interrupt level RI Process level RI % Latency from locking • Cost of synchronization • Measure Locking OFF vs ON Remote Access Penalty • NL RA(off) - local(off)/ RA(off) • L RA(off) – local(off)/RA(on) • Overhead of explicit synchronization and remote access is a function of complexity • Example – Interrupt level RI can be justified to avoid 11 remote refs – 6µs memory access cost & 60µs overhead of RI interrupt level RI time as percentage of RA time • RA: locking in the case of remote access – column 6 as % of column 3 • B: locking in the case of both – column 5 as % of column 3 – Simulates Hybrid kernel • N: locking in the case of neither – column 6 as % of column 4 – Simulates desired operation being subsumed by other operation with coarse-grained locking Not so fast • Not such a clear win for Remote Invocation – Experiments represent extreme cases • Simple enough to perform via interrupt level RI • Complex enough to absorb overhead of process level RI – Medium size operations might not accept limitations of interrupt level RI or the overhead of process level RI • Throughput for tiny operations – Remote memory accesses steal busy cycles from processor where memory is being used – Remote invocations steal the entire processor for the entire operation – Experiments show Interrupt level RI negatively affects throughput of the remote processor & remote access does not Conclusions • Good Kernel design must consider – – – – Cost of remote invocation mechanism Cost of atomic operations and synchronization Ratio of remote to local access time Breakeven points depend on the number of remote references in any given Operation • Interrupt level RI has small number of uses – TLB shootdown – Console I/O • Results could apply for ccNUMA machines as well • Risks of putting data structures in interrupt level of the kernel might be a reason to use remote access instead for small operations REFERENCES • • • Samuel J. Leffler, Marshall Kirk McKusick, Michael J. Karels, and John S. Quarterman “The Design and Implementation of the 4.3BSD UNIX Operating System” Addison Wesley Publishing Company Andrew S. Tannenbaum “Modern Operating Systems 2nd Edition” Prentice-Hall of India Alessandro Rubini, Jonathan Corbet, “Linux Device Drivers 2nd Edition” O’Reilly

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt