Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Technology Drivers • Traditional HPC application drivers – – • New and evolving programming models – – – • On-node RAM, DRAM, Flash Stacked memory (performance implications for different access patterns) Explicit cache/hierarchy management On-node interconnect Heterogenous cores On-node power management Global structures – – • • Shifting emphasis from managing cycles to managing data Programming models require more access to resource management decisions Hybrid/Mixed programming models (composing applications) Node and Memory structures – – – – – – • OS noise, resource monitoring and management, memory footprint Complexity of resources to be managed Global address space Integration of collectives, esp synchronization Resilience (soft errors and damaged cores) HPC OS Sustainability Increasing importance and complexity of resource management Alternate R&D Strategies • Evolve an existing OS – Linux, Plan 9, IBM CNK, Kitten • Start with an empty emacs buffer • Steal components from existing operating systems • Partitioning resources – independent management within a partition – Composibility • Collective/Global OS – Global address space? It’s time to define the winner Research Agenda • HPC Community OS – Define basic structure – Individual groups work on components • Expose management of critical resources • Simulation to evaluate scalability of resource management strategies • Enable co-design of hardware to support resource management • Define and implement OS mechanisms that will enable global, autonomic runtime systems Priority Research Direction: Community OS Framework for HPC Systems Key challenges 1. HPC applications have unique resource management needs (e.g., memory layout) 2. Anticipated rapid evolution/revolution in architectures and programming models 3. Limited ability to innovate in existing commodity operating systems Summary of research direction 1. Develop an OS framework specific to the needs of HPC 2. Open system architecture that exposes the management of critical resources 3. Empower developers of libraries and runtime systems 4. Sustainability of HPC OS is difficult 1. Context for individual innovation and contribution Potential impact on usability, capability, and breadth of community 1. This will enable full access to hardware resources 2. Common foundation for libraries and runtime environments 2. Timeframe: 2-3 years Potential impact on software component Priority Research Direction: Scalable System Simulation Key challenges 1. Inability to conduct “apples to apples” comparisons in scalable resource management 2. Evolution / revolution in new systems 3. Wide variety of existing simulators Potential impact on software component 1. Ability to evaluate resource management mechanisms and policies at scale 2. Enable architecture/OS co-design Summary of research direction 1. Develop a scalable, full system simulation capability 2. Address multi-scale challenges 3. Adapt techniques that have been used in other branches of computational science 4. Develop common interfaces between simulators Potential impact on usability, capability, and breadth of community 1. Critical for the OS research/development community 2. Important for runtime community 3. Timeframe: 2-4 years Priority Research Direction: Open System APIs Key challenges Summary of research direction 1. Communication management 2. Thread management 3. Memory management 4. Power management 1. Develop community based APIs to expose critical resources 2. Develop prototype runtime environments for common programming models 5. Resilience (fault/failure isolation/management) Potential impact on software component 1. Provides a fixed point for innovation in API implementation and innovation in the implementation of runtimes (hourglass principle) 2. Differentiation based on performance, not functionality Potential impact on usability, capability, and breadth of community 1. Critical for supporting the development of new programming models 2. Critical for enabling the development of new architectures 3. Timeframe: 3 to 8 years 4.1 Operating Systems A Community HPC OS Robust, Scalable System Simulation APIs for energy management Runtime Environments enabled Autonomic runtime systems API for node resilience Community OS Framework Prototype implementation of OS Framework 2010 2011 2012 Next Generation Interconnect API 2013 2014 2015 2016 2017 2018 2019