Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Toward a Sustainable Architecture at Extreme Scale Zhimin Tang, CTO [email protected] Outline Sustainable (Cost Effective) HPC Counter-examples in the history Current and Future Challenges New computing forms from sensor to cloud Silicon based IC process approaching its physical limit Strategy Abandon HPC only acceleration features Design sustainable architecture for HPC and other applications Considerations of Cost Effectiveness or Sustainability Application (Algorithm) Requirements High performance Technology Constraints CMOS vs. bipolar, Moore’s Law Commercial MPU vs. customed ASIP Economical Feasibility Good eco-system Mass production Low energy consumption HPCs in the History Vector Supercomputers CMOS Dominated, SIMD Weakness Connection Machine SIMD PE Array Optimal only for some Algorithms Custom chips, tiny processor MIMD with Custom CPUs Chip Level Integration (SoC) nCube/2, KSR-1 (COMA), … High NRE cost due to custom design without mass production Low node processor performance Why No Cost Effectiveness HPC Is a Small Market Architectures Designed Only for HPC Lower volume, higher cost (NRE) No enough resource to implement a top level (wrt performance) solution Longer time-to-market, behind Moore’s Law Result: COTS Solutions in Last 20 Years Commercial off-the-shelf Co-design with the IT Ecosystem From Cloud computers to sensors Ecosystem Requirements High Performance and Low Cost Low cost is continuing a must New factors of cost: energy/power, big NRE Performance no longer the bottleneck for most applications like car, train, airplane in transportation New appearances of performance Computing: MIPS/MFLOPS Transaction processing: TPM Cloud applications: requests serviced in unit time Energy Efficiency Two Ends of Computing System Cloud: large scale power dissipation Terminal: limited battery life Energy: compute < memory < communication For each FLOP in Linpack FPU spends 10pJ, Memory access 475pJ Wireless Sensor Network RF radio consumes most of the power What We Need Besides Locality? Needs New Architecture Architecture Consuming Less Energy Many core, custom designed for applications Flattened software stack Architecture for New Performance Metrics High volume throughput computers New Algorithms and Methodology Complexity of computation Complexity of memory access and communication Constraints to Innovation Existing Software Ecosystem standard or de facto interfaces e.g., ISA: Instruction Set Architecture Pro: Compatibility of Software Con: Obstacles of Innovation, legacy Huge Expenses of Development new architecture needs new processors NRE of chip development increasing rapidly, as CMOS process approaching its limit NRE: Non-Recurring Engineering CMOS Technology Approaching Limit, And No Replacement! Moore’s law:7nm@2024, ~30 atoms Different with the Transfer in 1990’s Bipolar (ECL/TTL) is faster, but consumes much power CMOS developed for 20 years, no too slow, low cost, and low power But Now, Liquid Cooling for CMOS In the foreseeable future, still CMOS More and More than Moore 2011 ITRS Exec. Summary Fig. 4 Dark Silicon At 8nm, above half of transistors must be turned off Speedup of 4-8 for 5 process generations ISCA’11, IEEE Micro’12, CACM’13 Economical Feasibility Moore’s Law Provides More Transistors But switching speed no longer faster Process development in nanometer scale increases NRE tremendously Mass Production Is Essential Otherwise, chip business is not sustainable Advantages of general-purposed processors How about Many-core Processors? GPU, Tilera, MIC, … Pros and Cons of MPU Most Advanced Process, Mass Product Stable, reliable, low cost Mature ecosystem and solutions Not Optimal for Many Applications Aim: not too bad for most applications Over allocation of resources Waste of resources, Consumption of more energy MPU not good for Cloud High L1-I Cache Miss Rate Processor idle (instruction starvation) Small ILP and MLP Wide issue not effective Low Efficiency of Memory Access Large L3 takes ½ chip area, no help to improve performance Useless High Bandwidth On-chip Few Data sharing among cores Low Utilization of Resources Only 1/3 are frequently used L2 Cache OOO FPU L2 Cache OOO FPU L2 Cache OOO FPU GPU L3 Cache L2 Cache OOO FPU Pros and Cons of ASIP Optimal Designed for Some Applications high efficiency, low resource, low power But No Lunches Are Free Much design/verification work Stability/Reliability? May affect the time to market How to amortize the huge NRE Small market means high cost MPU + Accelerator GPU Pro: mass production Con: PCIE overhead, small memory size MIC PHI Mass production possible? FPGA Resource utilization Ease of programming MPU interface, e.g., QPI or PCIE Design of New Processors Crossing the Gap between General and Special Many Simple Cores Reduce power consumption Multiple Hardware Thread in Each Core Massive threads on chip Exploit concurrency, tolerate latency Dynamic Scheduling of On-chip Threads Improve performance for general apps Combining Multithreading and Vector Pipelining 流水向量处理引擎 指令 I$ 缓存 指 令 PC PC PC 寄 IR 存 器 Switch to single thread 指 令 I 译 D 码 Vector Registers PC PC PC PC 寄 PC PC R存 PC F器 堆 ALU FPU LSU 数据缓存/SPM D$/SPM Deep scalar pipeline Switch to vector pipeline Thread Parallelism and Data Parallelism in Two dimensions Deep thread parallelism and data parallelism PC PC PC PC 指令 I$ 缓存 指 令 PC IPC PC 寄 R 存 器 指 令 I 译 D 码 指 令 I 译 D 码 寄 PC R 存 PC PC 器 F 堆 寄 PC R 存 PC PC 器 F 堆 FPU LSU 数据缓存/SPM D$/SPM ALU FPU LSU 数据缓存/SPM D$/SPM Wide data parallelism Wide thread parallelism PC PC PC PC 指 令 PC IPC PC 寄 R 存 器 Vector Register File 指令 I$ 缓存 ALU In Conclusion A Universal Architecture Scalable and reconfigurable processor array Supports thread and data level parallelism Fulfill All Requirements from Terminal to Cloud Data Center High performance computers Cloud computing servers Equipment in Core network Terminals for Cloud and mobile Internet Thanks!