Download Conclusions - Zhejiang University

Transactional Server Processor – A Bright Future Ahead Nhon Quach (Guo Ruiren), Ph.D. Processor Architect Zhejiang University Hangzhou September 13th, 2005 Outline        What are transactional server processors (TSP)? Background Past and present TSP design trends Server performance growth A bright future for TSPs Challenges and opportunities Summary You are expected to have fun and ask as many questions as you can What Are TSPs?  Processors designed for transaction processing – Formally, transactions have ACID (atomicity, consistency, isolation, and durability) properties in database terminology. • E.g.: placing a order, doing a query, updating the database, etc. – Often used in corporate server farms and datacenters in tiers. • Tier-2: web servers running Apache. • Tier-3: application servers running business logic. • Tier-4: backend servers running commercial databases such as Oracle, SQL, or DB2.  Do NOT include processors used in compute servers. –  High demand on floating-point computing capability Examples: IBM POWER4, AMD Opteron and Jackhammer, Intel Xeon and Itanium. Today, same TSPs are used in all servers (including compute servers) Tiers of Servers Clients Web servers Application servers Backend servers Switch Intranet/ Internet Large SMPs with 8-128 processors Small SMPs with 2-8P Stateless connections Small-medium size SMPs with 4-16P Stateful transactions Web & app servers usually scale out and backend servers usually scale up Next Item of Discussion  What are transactional server processors (TSP)?  Background  Past and present TSP design trends  Server performance growth  A bright future for TSPs  Challenges and opportunities  Summary Server Hardware Characteristics  Symmetric multiprocessing (SMP) systems – 2 to 128 processors with large physical memory. • Large (>16P) SMPs are often logically partitioned. • Stringent requirements on RAS (reliability, availability, serviceability), performance stability, and scalability.    Negligible silent data corruption Vendors often need to write service level agreements SMP vs. Clusters – – – – Hardware vs. software enforced coherence. Scale up vs. scale out Manageability vs. scalability and availability 90% of the servers sold today are SMP’s • But clusters are gaining in popularity recently (e.g., grid or utility computing, etc.). Some large SMP’s have large than 1 terabyte of main memory! SMP vs. Cluster SMP P P P P P P M P P P  One OS image HW enforced coherency – All memory accesses must check other processor caches or memory for consistency.  M HW Link P P Coherency logic P P P M P P – M Cluster P P M P P P P M P P Need multiple layers of coherency logic for large systems.  P P M SW links via TCP P P P P M P P  Multiple OS images SW enforced coherency – Only remote data accesses will experience messaging delay. Manageability is a major reason why SMPs are preferred over clusters Server Workload Characteristics  Large code and data footprints – Large cache and TLB miss rates • Performance highly sensitive to memory latency • Low hardware instruction level parallelism (ILP)   Large number of I/O operations and context switch –    Often misconstrued to have low application ILP Result in a large number of interrupts. Large number of synchronization operations High (>100) level of thread level parallelism (TLP) Low demand on floating-point computing capability – May change with the proposed IEEE754R decimal floatingpoint standard. TSPs face one of the most demanding workloads TSP Performance – A Quick Introduction  Measured by TPC-C transactions per minute (TPMC) – TPMC = N * (1/PL) * (1/CPI) * Freq * 60 • N: number of processors or threads in the systems  Function of processor implementation style • PL: pathlength or number of instructions per transaction  Function of the ISA and the compiler • CPI: cycles per instruction   Function of microarchitecture and implementation style Dominated by L3 miss rates (30-40%!) • Freq: frequency of the processor   Function of process technology and length of the pipeline Easier to increase N and/or reduce CPI – Use more processors (or multithreading) and large L3 cache. TPMC ratings often decide the market success (or failure) of a TSP Components of Memory Latency CPU CPU P2 P1 1 L1 L1 L2 CPU L1 L2 P3 L2 CPU L1 P4 L2 2 4 3 Bus 5 Bus Local memory 4 4 Remote memory Memory Memory latency = L1 + L2 + L3 + Snoop + local/remote memory access time Memory latencies are an important determining factor of server performance Next Items of Discussion  What are transactional server processors (TSP)?  Background  Past and present TSP design trends  Server performance growth  A bright future for TSPs  Challenges and opportunities  Summary The 80’s  No concept of tiered servers yet. – –  RISC processors dominated the server and workstation markets. CISC processors dominated the desktop market. All processors were designed to exploit ILP. – – – – One-size-fit-all design style Processed multiple (2-3) instructions per cycle. Used dynamic branch prediction and on-chip I and D caches. Performance limited by the number of available transistors • Simplified instruction sets to ease decoding => The RISC “revolution”  Power consumption was not an issue. Design focus was how to find more transistors? The 90’s  The search for ILP continued. – – – –  Frequency sold! –  But RISC processors still dominated the server market. Efforts continued to invent new ISA (e.g., Intel EPIC). –  Ultra long and high frequency pipeline became popular. CISC processors took over the workstation market. –  Pipelined floating-point execution units Processed 4-8 instructions per cycle Out-of-order (OOO) execution Larger and more (2-3) levels of on-chip caches Increased awareness of the abundance of TLP in server apps. Power consumption was somewhat an issue. – But most vendors adopted a wait-and-see attitude. Design focus was how to find even more transistors? TSP Integration History 1980 CPU 1985 1990 CPU + FPU CPU + FPU Cache Br tables Cache L2 FPU L1 Memory Controller Memory 1995 CPU + FPU Memory Controller Memory Memory Controller Memory 2000 2005 CPU + FPU Br tables L1 L2 Memory Controller Memory L3 CPU0 Branch tables L1 L2 CPU1 Br tables Br tables L1 L2 L1 L2 L3 L3 Memory Controller Memory Controller Memory Memory Integration reduces both local and remote memory latencies 10000000 1000000 100000 Moore’s Law 10000 1000 100 TPMC $/TPMC Moore’s Law 04 20 02 20 00 20 98 19 96 19 94 19 19 19 92 10 1 90 tpmC or $/tpmC Performance Improvement Year Reported Source: http://www.tpc.org, 2005 Achieved by having faster and more processors in the system The Present  Today, all TSPs used – – – –  Impressive TSP performance gain – –  Large dynamic branch prediction tables Three levels of on-chip caches with huge L3 caches Two levels of on-chip TLBs Pure hardware or software assisted OOO (e.g., Intel IPF) 50% due to process technology 50% due to implementation (uarch+circuit) techniques ISA largely irrelevant – CISC processors dominate the low-end server market • Dominance of the high-end market likely Memory and system architecture is more important than ISA style The Present (Cont.)  The search for ILP slowing – – –  Memory latencies are stabilizing! – – –  Cache hierarchy stops growing – stays at three levels Processor clock frequency increase is slowing Use of integrated memory controllers Transistor integration density still follows Moore’s law. –  Though certain researchers are still trying Re-discovery of the KISS (Keep It Simple, Stupid) principle Search for way to exploit TLP begins But transistors are slower, leakier, and less reliable. Power consumption a major problem – Market segmentation – the birth of TSPs Design focus is how to use the transistors more efficiently Next Item of Discussion  What are transactional server processors (TSP)?  Background  Past and present TSP design trends  Server performance growth  A bright future for TSPs  Challenges and opportunities  Summary New Developments Favoring TSP  Multiple processor cores on a single chip. – The era of chip multiprocessing (CMP) has finally arrived!  Implicit and explicit Multithreading – Efficient way to increase the number of logical processors.  Active power management – Consume power only when absolutely needed.  Virtualization technology – – Allow server consolidation Reduce cost of ownership These technologies will enable a new breed of low-cost and powerful servers The Promise of CMP M M P P L3 L3 P P L3 L3 M M 4P SMP  M M CMP CMP P P P P P P P P P P P P P P CMP with 428 cores L3 L3 CMP CMP M M 16P 4P CMP 8P 32P Scale out by increasing the number of cores in a CMP – – Protect IT infrastructure investment Memory latencies remain unchanged CMP is an especially cost effective way to build a single-chip server Two-Core CMP – An Example 16-64K. Read-only. Designed for latency. Need to resolve branch quickly, hence small size to achieve one cycle access time. TLB address translation often in critical path. L2 queues to buffer read/write and instruction traffic. Reads & writes may be reordered depending on the memory ordering model. 0.5-1MB. Unified and WB. Designed for latency and bandwidth. High-performance ILP cores. Use dynamic branch prediction, multiple issues, and OOO execution. CPU L1I CPU L1D L1I L2 Queue L1D L2 Queue L2 L2 L3 Queue L3 queues. Reads & writes may be reordered depending on the memory ordering model. May or may not be shared 16-64K. WT or WB. Designed for latency. Determine memory-op penalty. Large penalty for every extra cycle of latency. TLB address translation often in critical path. L3 4-8MB, WB. Designed for bandwidth with large line size. May or may not be shared. P2P fabric + MC Logic P2P network running at a high frequency. Enforce cache coherency and ordering For simplicity, some CMP designs don’t share the L3 cache The Promise of Multithreading M M P P CMP P CMP L3 1428 thread threads CMP CMP M M 32P 16P 64P 8P CMP  CMP+multithreading is an effective way to exploit TLP – Increase number of (logical) processors without increasing memory latency. I/O and memory BWs place an upper limit on the number of threads Multithreading  Amortize memory and stall penalties. –  Benefit depends on the stall and memory access patterns. Threads share the same pipelines and uarch resources. – – Need to duplicate arch and enlarge uarch resources. Thread switch and run time slice depend on threading style. • May require complex pipeline interlock logic for certain threading style  Explicit threads – Threads appear as logical processors to the OS. • Exploit TLP to improve transaction throughput. • Single thread performance may suffer.  Implicit threads – Used strictly by the hardware to enhance single-thread performance. • Often with compiler assist. There is still no consensus on how to best use the threads Multithreading – An Example Unthreaded Processor L1I$ Br Tables RF Execution units D$ (L1/L2/L3) 2-Way Threaded Processor Typically enlarged to handle 2 threads Br Tables Contains instructions Thread from both switch logic threads L1I$ TS Logic Contain architectural states and must be duplicated RF Many ways to do this as shown in the next slide Execution units Shared between the threads D$ (L1/L2/L3) Complexity of implementing multithreading is often underestimated Many Multithreading Styles  Each thread takes turn – Simple issue logic, complicated interlock logic • Fit well with non-stallable pipeline and in-order design styles – Exploit both short and long stall latencies Interleaved  L3 miss Thread switch on long latency events – Simple issue and interlock logic • Pipeline flush on each thread switch • No need to redesign the pipeline interlock logic – Blocked  Reasonable performance Both threads may issue in the same clock – Complicated issue and interlock logic • Fit well with non-stallable pipeline design style – SMT – Exploit both short and long stall latencies Take advantage of often-empty slots in superscalar processors Existing implementation constraints often determine the threading style Server Performance Trend Prediction 100000 ILP + Multithreading + CMP 10000 1000 TPMC ILP + CMP 100 We are here 10 ILP only 11 9 7 5 3 1 1 TPMC Rating on Log Scale TPMC Performance Improvement Trend Chip Integration Generation Achieved by increasing no. of processors and reducing memory latencies Future TSP – An Example  Processor core design – – – –      0.15-0.45nm technology 4-8Ghz (~10-20 stage pipeline) Dynamic branch prediction with three levels of caches Adequate level of reliability 4-16 processor cores 2-8 explicit threads 16-128MB L3 cache size Point to point bus topology with integrated memory controller Aggressive active power management 128P server systems on a single chip! Outline  What are transactional server processors (TSP)?  Background  Past and present TSP design trends  Server performance growth  A bright future for TSPs  Challenges and opportunities  Summary Possible Roadblocks  Power consumption – – Limit the number of cores or threads on a chip. Passive power management (PPM) still has a lot of room for improvement – and research! • • • • • Process technology level: reduce leakage currents, etc. Circuit level: size transistor properly, use sleep transistors, etc. Logic level: use low voltage swing logic, use clock gating, etc. Micro-architecture level: use stallable pipelines, etc. Systems level: use active power management (APM)  Reliability (against soft errors) – May require expensive protection of random logic elements • Extent of problem not yet known. Power consumption will play the same prominent role as ILP once did Different Types of Power Consumption Determined by: Power Thermal Power Type of packaging Worst-case Power Representative set of target applications or power virus Max. App. Power Currently running application Ave. Power App B Ave. App. Power Representative set of target applications Currently running application App A Time Power consumption is one of the most difficult quantities to estimate Power Consumption Trend with PPM Relatvie Performance 18 16 14 12 Spec Int Die Size Power 10 8 6 4 2 0 486 Pentium PIII P4 Source: Deborah Marr et. al., “Hyper-Threading Technology Architecture And Microarchitecture,” Intel Technology Journal, Q1, 2002 With PPM, it is hard to integrate multiple cores on a single chip Active Power Management Processor core0 power, temp, perf sensors Processor core2 Processor core1 APM Module Frequency controller Processor core3 Voltage controller Off-chip voltage regulator  Max. power dynamically allocated among cores depending on power requirement of the running apps Adjusting voltage is the most effective way to control power consumption PPM vs. APM  Design for worst-case power. – –  –   – Local clock gating when blocks idle. Global clock gating when processor idle • P states rarely used – No saving on static power consumption Processor runs at variable voltages and frequencies. – Dynamic power balancing among cores. – Power budget dynamically adjusted. No power balancing between cores. Fixed power budget Save energy by – Design for ave. app. power. – Max. power may be adjusted – Shutdown on overheat Max. power = thermal power Shutdown on overheat Processor runs at fixed voltage and frequency –   Save energy by – Local clock gating. – Dynamically adjust voltage and clock frequency. • Hardware P state transition – Save static power consumption by shutting off power to a block. Voltage regulator changing speed determines how aggressively APM can be APM Challenges  Difficult to monitor performance level – – Processor may be stalled for memory accesses Applications often show phase behaviors • Quick response time needed   Customers want a minimum guarantee on performance May affect system software behaviors –  Timestamp, timing loop, etc. Intel IPF2 is a good start – – Showed 3x improvement in power efficiency Still much room for improvement Software and systems vendors are now open to the concept of APM Summary  Server workloads are highly demanding because of their large code and data footprints. –  Past TSPs used a one-size-fit-all design style. – –  Good performance gain in the 80’s and 90’s Poor scalability in terms of power consumption and memory latency. CMP+multithreading is a more scalable way to build SMPs – –  Large miss rates and synchronization and I/O rates But need active power management techniques. Future TSP design more tailored and balanced: focusing on memory latency, TLP, soft errors, and power consumption. The best is yet to come for TSPs – An extended period of growth and simultaneous drop in cost TSPs – A Bright Future Awaits! Q&A  Thank you!  Hope you had fun and have learned something new.  Questions or comments?  Contact info: [email protected]

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Conclusions - Zhejiang University