Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Many-Core Operating Systems Burton Smith Technical Fellow Advanced Strategies and Policy 1 The von Neumann Premise   Simply put: “there is exactly one program counter” It has led to some artifacts:    Synchronous coprocessor coroutining (e.g. 8087) Interrupts for asynchronous concurrency Demand paging    And some serious problems:     to make memory allocation incremental to let |virtual| > |physical| The memory wall (insufficient memory concurrency) The ILP wall (diminished improvement in ILP) The power wall (the cost of run-time ILP exploitation) Given multiple program counters, what should we change?   Scheduling? Synchronization? 2 Computing is at a Crossroads  Continual performance improvement is our lifeblood    Single-thread performance is nearing the end of the line    But Moore’s Law will continue for some time to come What can we do with all those transistors? Computation needs to become as parallel as possible     It encourages people to buy new hardware It opens up new software possibilities Henceforth, serial means slow Systems must support general purpose parallel computing The alternative to all this is commoditization New many-core chips will need new system software   And vice versa! This talk is about interplay between OS and hardware 3 Many-Core OS Challenges   Architecture of the parallel virtual machine Processor management      Memory management     Multiple processors A mix of in-order and out-of-order CPUs GPUs and other performance accelerators I/O processors and devices Performance problems due to paging TLB pressure from larger working sets Bandwidth resources Quality of service (time management)   For media applications, games, real-time apps, etc. For deadlines 4 The Parallel Virtual Machine  What should the interface that the OS presents to parallel application software look like?     Stable, negotiated resource allocation Isolation among protection domains Freedom from bottlenecks in OS services The key objective is fine-grain application parallelism  We need the whole tree, not just the low-hanging fruit 5 Fine-grain Parallelism  Exploitable parallelism grows as task granularity shrinks   Inter-task dependence enforcement demands scheduling    No privilege change is needed to stop or restart a task Locality (e.g. cache content) can be better preserved Todays OS and hardware don’t encourage waiting     A task needing a value from elsewhere must wait for it User-level work scheduling is called for   But dependences among tasks become more numerous OS thread scheduling makes blocking dangerous Instruction sets encourage non-blocking approaches Busy-waiting wastes instruction issue opportunities Impact:   Better instruction set support for blocking synchronization Changes to OS processor and memory resource management 6 Multithreading and Synchronization  Fine-grain multithreading can use TLP to tolerate latency     In the latter case, some architectural support is helpful     To stop issuing from a context while it is waiting To resume issuing when the wait is over To free up the context if and when a wait becomes long The benefits:    Memory latency Other operation latency, e.g. branch latency Synchronization latency Waiting does not consume issue slots Overhead is automatically amortized I talked about this stuff in my 1996 FCRC keynote 7 Resource Scheduling Consequences  Since the user runtime is scheduling work on processors, the OS should not attempt to do the same    An asynchronous OS API is a necessary corollary Scheduling memory via demand paging is also problematic Instead, the two schedulers should negotiate   The application tells the OS its resource needs/desires The OS makes decisions based on the big picture:     The OS can preempt resources to reclaim them   Availability of resources Appropriateness of power consumption level Requirements for quality of service But with notification, so the application can rearrange things Resources should be time- and space-shared in chunks  Scheduling turns into a bin-packing problem 8 Bin Packing  The more resources allocated, the more swapping overhead   It would be nice to amortize it… The more resources you get, the longer you may keep them  Quantity of resource  Roughly, this means scheduling = packing squarish blocks QOS applications might need long rectangles instead Time  When the blocks don’t fit, the OS can morph them a little  Or cut corners when absolutely necessary 9 What About Priority Scheduling?  Priorities are appropriate for some kinds of scheduling   Especially when some things to be scheduled are optional If it all has to be done, how do the priorities get set?     “How much work must be done before the next deadline?” Even highly interactive tasks can benefit Deadlines are harder to implement than priorities   Fairness is seldom maintained in the process Quality of service needs a different approach   The answer is usually “ad-hoc, and often!” Then again, so is bin packing compared to fixed quanta Fairness can also be based on quality-of-service concepts   Relative work rates rather than absolute “In the next 16 milliseconds, give level i activities r times as many processor-seconds as level i-1 activities” 10 Heterogeneous Processors  There are two kinds of heterogeny:     Both are likely to be important A single application might ask for a heterogeneous mix    In architecture, i.e. different instruction sets In implementation, i.e. different performance characteristics Failure in the HA case might need multiple versions or JIT In the HI case, scheduling might be based on instrumentation A key question is whether a processor is time-sharable   If not, the OS has to dedicate it to one application at a time With user-level scheduling and some support for preemption, application state save and restore can be done at user level 11 Virtual Memory Design Alternatives   Swapping instead of demand paging Address-space names/identifiers    TLB shootdown becomes a rarer event Hardware TLB coherence Two-dimensional addressing (segmentation w/o registers)    To assist with variable granularity memory allocation To help mitigate upward pressure on TLB size To leverage persistent memory via segment sharing   A variation of mmap()might suffice for this purpose To accommodate variations in memory bank architecture  Local versus global, for example 12 Physical Memory Bank Architecture  Consider this example      Interleaving addresses across the banks is a solution   Page granularity is the standard choice If memory access is non-uniform, this is not the best idea    An application is using 31 cores, about half of them 50% of its cache misses are stack references The stacks are all allocated in a compact virtual region How many of the 128 memory banks are available? Stacks should be allocated near their processors So should compiler-allocated temporary arrays on the heap Is it one bank architecture scheme fits all, or not?  If not, how do we manage the virtual address space? 13 “Hot Spots”  When processors share memory, they can interfere   Within an application, this creates performance problems   Hardware help is needed to discover “where” these are Beween applications, interference is even more serious     Not only data races, but also bandwidth oversubscription Performance unpredictability Denial of service Covert-channel signaling Bandwidth is a resource like any other  We need to be able to partition and isolate it 14 I/O Architecture  Direct memory access is usually a good way to do I/O    But I/O devices are getting smarter all the time      Transistors are cheaper than almost anything else Why not treat I/O devices like heterogeneous processors?   Today’s DMA mostly demands “wired down” pages This leads to lots of data copying and other OS warts Teach them to do virtual address translation Allocate them to real-time or sensor-intensive applications Allocate them to a not-very-trusted “driver application” Address space sharing can be partial, as it is now There is a problem, though: inter-domain signaling (IPC)   This is what interrupts do I have some issues with interrupts 15 Interrupts  Interrupts are OK when there is only one processor   If there are many processors, which one do you interrupt?   The usual solution: “just pick one and leave it to software” A better idea is to signal via an address space you already share (perhaps only in part) with the intended recipient     Some people avoid them to make systems more predictable “The DMA at address <a> is (ready)(done)” This is kinda like doing programmed I/O via device CSRs It’s also the way the CDC 6600 and 7600 did things You may not want to have the signal recipient busy-wait 16 Conclusions   It is time to rethink some of the basics of computing There is lots of work for everyone to do   e.g. I’ve left out compilers, debuggers, and applications We need basic research as well as industrial development   Research in computer systems is deprecated these days In the USA, NSF and DOD need to take the initiative 17