Download No Slide Title

Many-Core Operating Systems Burton Smith Technical Fellow Advanced Strategies and Policy 1 The von Neumann Premise   Simply put: “there is exactly one program counter” It has led to some artifacts:    Synchronous coprocessor coroutining (e.g. 8087) Interrupts for asynchronous concurrency Demand paging    And some serious problems:     to make memory allocation incremental to let |virtual| > |physical| The memory wall (insufficient memory concurrency) The ILP wall (diminished improvement in ILP) The power wall (the cost of run-time ILP exploitation) Given multiple program counters, what should we change?   Scheduling? Synchronization? 2 Computing is at a Crossroads  Continual performance improvement is our lifeblood    Single-thread performance is nearing the end of the line    But Moore’s Law will continue for some time to come What can we do with all those transistors? Computation needs to become as parallel as possible     It encourages people to buy new hardware It opens up new software possibilities Henceforth, serial means slow Systems must support general purpose parallel computing The alternative to all this is commoditization New many-core chips will need new system software   And vice versa! This talk is about interplay between OS and hardware 3 Many-Core OS Challenges   Architecture of the parallel virtual machine Processor management      Memory management     Multiple processors A mix of in-order and out-of-order CPUs GPUs and other performance accelerators I/O processors and devices Performance problems due to paging TLB pressure from larger working sets Bandwidth resources Quality of service (time management)   For media applications, games, real-time apps, etc. For deadlines 4 The Parallel Virtual Machine  What should the interface that the OS presents to parallel application software look like?     Stable, negotiated resource allocation Isolation among protection domains Freedom from bottlenecks in OS services The key objective is fine-grain application parallelism  We need the whole tree, not just the low-hanging fruit 5 Fine-grain Parallelism  Exploitable parallelism grows as task granularity shrinks   Inter-task dependence enforcement demands scheduling    No privilege change is needed to stop or restart a task Locality (e.g. cache content) can be better preserved Todays OS and hardware don’t encourage waiting     A task needing a value from elsewhere must wait for it User-level work scheduling is called for   But dependences among tasks become more numerous OS thread scheduling makes blocking dangerous Instruction sets encourage non-blocking approaches Busy-waiting wastes instruction issue opportunities Impact:   Better instruction set support for blocking synchronization Changes to OS processor and memory resource management 6 Multithreading and Synchronization  Fine-grain multithreading can use TLP to tolerate latency     In the latter case, some architectural support is helpful     To stop issuing from a context while it is waiting To resume issuing when the wait is over To free up the context if and when a wait becomes long The benefits:    Memory latency Other operation latency, e.g. branch latency Synchronization latency Waiting does not consume issue slots Overhead is automatically amortized I talked about this stuff in my 1996 FCRC keynote 7 Resource Scheduling Consequences  Since the user runtime is scheduling work on processors, the OS should not attempt to do the same    An asynchronous OS API is a necessary corollary Scheduling memory via demand paging is also problematic Instead, the two schedulers should negotiate   The application tells the OS its resource needs/desires The OS makes decisions based on the big picture:     The OS can preempt resources to reclaim them   Availability of resources Appropriateness of power consumption level Requirements for quality of service But with notification, so the application can rearrange things Resources should be time- and space-shared in chunks  Scheduling turns into a bin-packing problem 8 Bin Packing  The more resources allocated, the more swapping overhead   It would be nice to amortize it… The more resources you get, the longer you may keep them  Quantity of resource  Roughly, this means scheduling = packing squarish blocks QOS applications might need long rectangles instead Time  When the blocks don’t fit, the OS can morph them a little  Or cut corners when absolutely necessary 9 What About Priority Scheduling?  Priorities are appropriate for some kinds of scheduling   Especially when some things to be scheduled are optional If it all has to be done, how do the priorities get set?     “How much work must be done before the next deadline?” Even highly interactive tasks can benefit Deadlines are harder to implement than priorities   Fairness is seldom maintained in the process Quality of service needs a different approach   The answer is usually “ad-hoc, and often!” Then again, so is bin packing compared to fixed quanta Fairness can also be based on quality-of-service concepts   Relative work rates rather than absolute “In the next 16 milliseconds, give level i activities r times as many processor-seconds as level i-1 activities” 10 Heterogeneous Processors  There are two kinds of heterogeny:     Both are likely to be important A single application might ask for a heterogeneous mix    In architecture, i.e. different instruction sets In implementation, i.e. different performance characteristics Failure in the HA case might need multiple versions or JIT In the HI case, scheduling might be based on instrumentation A key question is whether a processor is time-sharable   If not, the OS has to dedicate it to one application at a time With user-level scheduling and some support for preemption, application state save and restore can be done at user level 11 Virtual Memory Design Alternatives   Swapping instead of demand paging Address-space names/identifiers    TLB shootdown becomes a rarer event Hardware TLB coherence Two-dimensional addressing (segmentation w/o registers)    To assist with variable granularity memory allocation To help mitigate upward pressure on TLB size To leverage persistent memory via segment sharing   A variation of mmap()might suffice for this purpose To accommodate variations in memory bank architecture  Local versus global, for example 12 Physical Memory Bank Architecture  Consider this example      Interleaving addresses across the banks is a solution   Page granularity is the standard choice If memory access is non-uniform, this is not the best idea    An application is using 31 cores, about half of them 50% of its cache misses are stack references The stacks are all allocated in a compact virtual region How many of the 128 memory banks are available? Stacks should be allocated near their processors So should compiler-allocated temporary arrays on the heap Is it one bank architecture scheme fits all, or not?  If not, how do we manage the virtual address space? 13 “Hot Spots”  When processors share memory, they can interfere   Within an application, this creates performance problems   Hardware help is needed to discover “where” these are Beween applications, interference is even more serious     Not only data races, but also bandwidth oversubscription Performance unpredictability Denial of service Covert-channel signaling Bandwidth is a resource like any other  We need to be able to partition and isolate it 14 I/O Architecture  Direct memory access is usually a good way to do I/O    But I/O devices are getting smarter all the time      Transistors are cheaper than almost anything else Why not treat I/O devices like heterogeneous processors?   Today’s DMA mostly demands “wired down” pages This leads to lots of data copying and other OS warts Teach them to do virtual address translation Allocate them to real-time or sensor-intensive applications Allocate them to a not-very-trusted “driver application” Address space sharing can be partial, as it is now There is a problem, though: inter-domain signaling (IPC)   This is what interrupts do I have some issues with interrupts 15 Interrupts  Interrupts are OK when there is only one processor   If there are many processors, which one do you interrupt?   The usual solution: “just pick one and leave it to software” A better idea is to signal via an address space you already share (perhaps only in part) with the intended recipient     Some people avoid them to make systems more predictable “The DMA at address <a> is (ready)(done)” This is kinda like doing programmed I/O via device CSRs It’s also the way the CDC 6600 and 7600 did things You may not want to have the signal recipient busy-wait 16 Conclusions   It is time to rethink some of the basics of computing There is lots of work for everyone to do   e.g. I’ve left out compilers, debuggers, and applications We need basic research as well as industrial development   Research in computer systems is deprecated these days In the USA, NSF and DOD need to take the initiative 17

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download No Slide Title