Download presentation source

Chapter 5 Future processor to use FineGrain Parallelism 1 Microprocessors: today (Y2K++)   Chip-Technology 2000/01: – 0.18-m CMOS-technology, 10 to 100 M transistors per chip, 600 MHz to 1.4 GHz cycle rate Example processors: – Intel Pentium III: 7.5 M transistors, 0.18 m (Bi-)CMOS, up to 1 GHz – Intel Pentium 4: ?? Transistors, 0.18 m (Bi-)CMOS, up to 1.4 GHz, uses a Trace Cache! – Intel IA-64 Itanium (already announced for 2000?): 0.18 m CMOS, 800 MHz; successor McKinley (announced for 2001): > 1 GHz – Alpha 21364: 100 M transistors, 0.18 m CMOS (1.5 volt, 100 watt), ? MHz – HAL SPARC uses Trace Cache and Value Prediction – Alpha 21464 will be 4-way simultaneous multithreaded – Sun MAJC will be a two processor-chip, each processor a 4-way blockmultithreaded VLIW 2 Trends and principles in the giga chip era  Forecasting the effects of technology is hard: “Everything that can be invented has been invented.” US Commissioner of Patents, 1899. “I think there is a world market for about five computers.” Thomas J. Watson Sr., IBM founder, 1943. 3 Microprocessors: tomorrow (Y2K-2012)  Moore's Law: number of transistors per chip double every two years  SIA (semiconductor industries association) prognosis 1998: Year of 1st shipment Local Clock (GHz) Across Chip (GHz) Chip Size (mm²) Dense Lines (nm) Number of chip I/O Transistors per chip 1997 1999 2002 2005 2008 2011 2014 0,75 1,25 2,1 3,5 6 10 16,9 0,75 1,2 1,6 2 2,5 3 3,674 300 340 430 520 620 750 901 250 180 130 100 70 50 35 1515 1867 2553 3492 4776 6532 8935 11M 21M 76M 200M 520M 1,4B 3,62B 4 Design challenges         increasing clock speed, the amount of work that can be performed per cycle, and the number of instructions needed to perform a task. Today's general trend toward more complex designs is opposed by the wiring delay within the processor chip as main technological problem. higher clock rates with subquarter-micron designs  on-chip interconnecting wires cause a significant portion of the delay time in circuits. Especially global interconnects within a processor chip cause problems with higher clock rates. Maintaining the integrity of a signal as it moves from one end of the chip to the other becomes more difficult. Copper metallization is worth a 20 % to 30 % reduction in wiring delay. 5 Application and economy-related trends    Applications: – generally: user interactions like video, audio, voice recognition, speech processing, and 3D graphics. – large data sets and huge databases, – large data mining applications, – transaction processing, – huge EDA applications like CAD/CAM software, – virtual reality computer games, – signal processing and real-time control. Colwell from Intel: the real threat for processor designers is shipping 30 million CPUs only to discover they are imperfect and cause a recall. Economies of scale: – Fabrication plants now cost about $2 billion, a factor of ten more than a decade ago. Manufacturers can only sustain such development costs if larger markets with greater economies of scale emerge.  Workloads will concentrate on human computer interface.  Multimedia workloads will grow and influence architectures. 6 Architectural challenges and implications       Preserve object code compatibility (may be avoided by a virtual machine that targets run-time ISAs) It is necessary to find ways of expressing and exposing more parallelism to the processor. It is doubtful if enough ILP is available. Buses may probably scale. Expect much wider buses in future. Memory bottleneck: Memory latency may be solved by a combination of technological improvements in memory chip technology and by applying advanced memory hierarchy techniques (other authors disagree). Power consumption for mobile computers and appliances. Soft errors by cosmic rays of gamma radiation may be faced with fault-tolerant design through the chip. 7 Possible solutions  a focus of processor chips on particular market segments; multimedia pushes desktop personal computers while high-end microprocessors will serve specialized applications; integrate functionalities to systems on a chip; partition a microprocessor in a client chip part that focuses on general user interaction enhanced by server chip parts that are tailored for special applications; a CPU core that works like a large ASIC block and that allows system developers to instantiate various devices on a chip with a simple CPU core; and reconfigurable on-chip parts that adapt to application requirements.  Functional partitioning becomes more important!      8 Future processor architecture principles     Speed-up of a single-threaded application: – Trace cache – Superspeculative – Advanced superscalar Speed-up of multi-threaded applications: – Chip multiprocessors (CMPs) – Simultaneous multithreading Speed-up of a single-threaded application by multithreading: – Multiscalar processors – Trace processors – DataScalar Exotics: – Processor-in-memory (PIM) or intelligent RAM (IRAM) – Reconfigurable – Asynchronous 9 Future processors to use coarse-grain parallelism   Today‘s microprocessors utilize instruction level parallelism by a deep instruction pipeline and by the superscalar or VLIW multiple issue techniques Today‘s (2001) technology: approx. 40 M transistors per chip, In future (2012): 1.4 G transistors per chip, What next?   Two directions: – Increase of single-thread performance --> use of more speculative instruction-level parallelism – Increase of multi-thread (multi-task) performance --> Utilize thread-level parallism additionally to instruction-level parallelism A „thread“ in this lecture means a „HW thread“ which can be a SW (Posix) thread, a process, ... Far future (??): Increase of single-thread performance by use of speculative instruction-level and thread-level parallelism 10 Processor techniques that speedup single threaded applications  Trace Cache tries to fetch from dynamic instruction sequences instead of the static code cache  Advanced superscalar processors scale current designs up to issue 16 or 32 instructions per cycle.  Superspeculative processors enhance wide-issue superscalar performance by speculating aggressively at every point. 11 The trace cache      Trace cache is a new paradigm for caching instructions. A Trace cache is a special I-cache that captures dynamic instruction sequences in contrast to the I-cache that contains static instruction sequences. Like the I-cache, the trace cache is accessed using the starting address of the next block of instructions. Unlike the I-cache, it stores logically contiguous instructions in physically contiguous storage. A trace cache line stores a segment of the dynamic instruction trace across multiple, potentially taken branches. I-cache Trace cache 12 The trace cache (2)      Each line stores a snapshot, or trace, of the dynamic instruction stream. The trace construction is of the critical path. As a group of instructions is processed, it is latched into the fill unit. The fill unit maximizes the size of the segment and finalizes a segment when the segment can be expanded no further. – The number of instructions within a trace is limited by the trace cache line size. – Finalized segments are written into the trace cache. Instructions can be sent from the trace cache into the reservation stations (??) without having to undergo a large amount of processing and rerouting. – It is under research if the instructions in the Trace cache are • fetched but not yet decoded • decoded but not yet renamed • or decoded and partly renamed – Trace cache placement in the microarchitecture is dependent on this decision. 13 The trace cache - performance Three applications from the SPECint95 benchmarks are simulated on a 16wide issue machine with perfect branch prediction. 14 Superspeculative processors   Idea: Instructions generate many highly predictable result values in real programs  Speculate on source operand values and begin execution without waiting for result from the previous instruction. Speculate about true data dependences!! reasons for the existence of value locality – Due to register spill code the reuse distance of many shared values is very short in processor cycles. Many stores do not even make it out of the store queue before their values are needed again. – Input sets often contain data with little variation (e.g., sparse matrices or text files with white spaces). – A compiler often generates run-time constants due to error-checking, switch statement evaluation, and virtual function calls. – The compiler also often loads program constants from memory rather than using immediate operands. 15 Strong- vs. weak-dependence model   Strong-dependence model for program execution: a total instruction ordering of a sequential program. – Two instructions are identified as either dependent or independent, and when in doubt, dependences are pessimistically assumed to exist. – Dependences are never allowed to be violated and are enforced during instruction processing. – To date, most machines enforce such dependences in a rigorous fashion. – This traditional model is overly rigorous and unnecessarily restricts available parallelism. Weak-dependence model: – specifying that dependences can be temporarily violated during instruction execution as long as recovery can be performed prior to affecting the permanent machine state. – Advantage: the machine can speculate aggressively and temporarily violate the dependences. The machine can exceed the performance limit imposed by the strongdependence model. 16 Implementation of a weak-dependence model  The front-end engine assumes the weak-dependence model and is highly speculative, predicting instructions to aggressively speculate past them.  The back-end engine still uses the strong-dependence model to validate the speculations, recover from misspeculation, and provide history and guidance information to the speculative engine. 17 Superflow processor  The Superflow processor speculates on – instruction flow: two-phase branch predictor combined with trace cache – register data flow: dependence prediction: predict the register value dependence between instructions • source operand value prediction • constant value prediction • value stride prediction: speculate on constant, incremental increases in operand values • dependence prediction predicts inter-instruction dependences – memory data flow: prediction of load values, of load addresses and alias prediction – Superflow simulations: 7.3 IPC for SPEC95 integer suite, up to 9 instructions per cycle when 32 instructions are potentially issued per cycle – With dependence and value prediction a three cycle issue nearly matches the performance of a single issue dispatch. 18 RSs Branch predictor RSs Trace cache Decode Floating-point Register dataflow D-cache RSs Memory Store queue Retire Reorder buffer I-cache RSs Instruction buffer Fetch Instruction flow Superflow processor proposal Memory dataflow Multimedia Integer 19 Advanced superscalar processors - Characteristics  Aggressive speculation, such as a very aggressive dynamic branch predictor,  a large trace cache,  very-wide-issue superscalar processing (an issue width of 16 or 32 instructions per cycle),  a large number of reservation stations to accommodate 2,000 instructions,  24 to 48 highly optimized, pipelined functional units,  sufficient on-chip data cache, and  sufficient resolution and forwarding logic. 20 Advanced superscalar processor 21 Requirements and solution    Delivering optimal instruction bandwidth requires: – a minimal number of empty fetch cycles, – a very wide (conservatively 16 instructions, aggressively 32), full issue each cycle, – and a minimal number of cycles in which the instructions fetched are subsequently discarded. Consuming this instruction bandwidth requires: – sufficient data supply, – and sufficient processing resources to handle the instruction bandwidth. Suggestions: – an instruction cache system (the I-cache) that provides for out-of-order fetch (fetch, decode, and issue in the presence of I-cache misses). – a large Trace cache for providing a logically contiguous instruction stream, – an aggressive Multi-Hybrid branch predictor (multiple, separate branch predictors, each tuned to a different class of branches) with support for context switching, indirect jumps, and interference handling. 22 Multi-hybrid branch predictor        Hybrid predictors comprise several predictors, each targeting different classes of branches. Idea: each predictor scheme works best for another branch type. McFarling already combined two predictors (see also PowerPC 620, Alpha 21364,...). As predictors increase in size, they often take more time to react to changes in a program (warm-up time). A hybrid predictor with several components can solve this problem by using component predictors with shorter warm-up times while the larger predictors are warming up. Examples of predictors with shorter warm-up times are two-level predictors with shorter histories as well as smaller dynamic predictors. The Multi-Hybrid uses a set of selection counters for each entry in the branch target buffer, in the trace cache, or in a similar structure, keeping track of the predictor currently most accurate for each branch and then using the prediction from that predictor for that branch. 23 Instruction supply and out-of-order fetch       An in-order fetch processor, upon encountering a trace cache miss, waits until the miss is serviced before fetching any new segments. But an out-of-order fetch processor temporarily ignores the segment associated with the miss, attempting to fetch, decode, and issue the segments that follow it. After the miss has been serviced, the processor decodes and issues the ignored segment. Higher performance can be achieved by fetching instructions that—in terms of a dynamic instruction trace—appear after a mispredicted branch, but are not control-dependent upon that branch. In the event of a mispredict, only instructions control-dependent on the mispredicted branch are discarded. Out-of-order fetch provides a way to fetch control-independent instructions. 24 Data supply      A 16-wide-issue processor will need to execute about eight loads/stores per cycle. The primary design goal of the data-cache hierarchy is to provide the necessary bandwidth to support eight loads/stores per cycle. The size of a single, monolithic, multi-ported, first-level data cache would likely be so large that it would jeopardize the cycle time. Because of this, we expect the first-level data cache to be replicated to provide the required ports. Further features of the data supply system: – A bigger, second-level data cache with less port requirements. – Data prefetching. – Processors will predict the addresses of loads, allowing loads to be executed before the computation of operands needed for their address calculation. – Processors will predict dependencies between loads and stores, allowing them to predict that a load is always dependent on some older store. 25 Execution core       If the fetch engine is providing 16 to 32 instructions per cycle, then the execution core must consume instructions just as rapidly. To avoid unnecessary delays due to false dependencies, logical registers must be renamed. To compensate for the delays imposed by the true data dependencies, instructions must be executed out of order. Patt et al. envision an execution core comprising 24 to 48 functional units supplied with instructions from large reservation stations and having a total storage capacity of 2,000 or more instructions. Functional units will be partitioned into clusters of three to five units. Each cluster will maintain an individual register file. Each functional unit has its own reservation station. Instruction scheduling will be done in stages. 26 Conclusions  Patt et al. argue that the highest performance computing system (when one billion transistor chips are available) will contain on each processor chip a single processor.  All this makes sense, however, only – if CAD tools can be improved to design such chips – if algorithms and compilers can be redesigned to take advantage of such powerful dynamically scheduled engines. 27 Processors that use thread-level speculation to boost single-threaded programs  Multiscalar processors divide a program in a collection of tasks that are distributed to a number of parallel processing units under control of a single hardware sequencer.  Trace processors facilitate high ILP and a fast clock by breaking up the processor into multiple distinct cores (similar to multiscalar!), and breaking up the program into traces (dynamic sequences of instructions).  DataScalar processors run the same sequential program redundantly across multiple processors using distributed data sets. 28 Multiscalar processors        A program is represented as a control flow graph (CFG), where basic blocks are nodes, and arcs represent flow of control. Program execution can be viewed as walking through the CFG with a high level of parallelism. A multiscalar processor walks through the CFG speculatively, taking task-sized steps, without pausing to inspect any of the instructions within a task. The primary constraint: it must preserve the sequential program semantics. A program is statically partitioned into tasks which are marked by annotations of the CFG. The tasks are distributed to a number of parallel PEs within a processor. Each PE fetches and executes instructions belonging to its assigned task. 29 Multiscalar mode of execution PE 0 A B C D Data values Task A PE 1 Task B PE 2 Task D E PE 3 Task E 30 Multiscalar processor Tail Sequencer I-cache unidirectional unidirectional ring ring Processing ... ... unit Processing element n Processing element 1 Head Register file ... D-cache Data bank ARB Interconnect ... Data bank 31 Proper resolution of inter-task data dependences      Concerns in particular data that is passed between instructions via registers and via memory. To maintain a sequential appearance a twofold strategy is employed. – First, each processing element adheres to sequential execution semantics for the task assigned to it. – Second, a loose sequential order is enforced over the collection of processing elements, which in turn imposes a sequential order of the tasks. The sequential order on the processing elements is maintained by organizing the elements into a circular queue. The appearance of a single logical register file is maintained although copies are distributed to each parallel PE. – Register results are dynamically routed among the many parallel processing elements with the help of compiler-generated masks. For memory operations: An address resolution buffer (ARB) is provided to hold speculative memory operations and to detect violations of memory dependences. 32 The multiscalar paradigm has at least two forms of speculation:  Control speculation, which is used by the task sequencer, and  Data dependence speculation, which is performed by each PE.  It could also use other forms of speculation, such as data value speculation, to alleviate inter-task dependences. 33 Alternatives with thread-level speculation  Static Thread-level speculation: multiscalar  Dynamic thread-level speculation: „threads“ are generated during run-time from: – Loops – Subprogram invocations – Specific start or stop instruction – Traces: Trace processors  Datascalar 34 Trace processors  The focus of a Trace processor is the combination of the Trace cache with multiscalar.  Idea: Create subsystems similar in complexity to today's superscalar processors and combine replicated subsystems into a full processor.  A trace is a sequence of instructions that potentially covers several basic blocks starting at any point in the dynamic instruction stream.  A trace cache is a special I-cache that captures dynamic instruction sequences in contrast to the I-cache that contains static instruction sequences. 35 Trace processor Processing element 1 Global registers Local registers Instruction buffer Branch prediction Functional units Next-trace prediction Processing element 2 ... Instruction preprocessing Trace cache Trace construction Data-value prediction Processing element n 36 Trace processor       Instruction fetch hardware fetches instructions from the I-cache and simultaneously generates traces of 8 to 32 instructions including predicted conditional branches. Traces are built as the program executes and they are stored in a trace cache. A trace fetch unit reads traces from the trace cache and parcels them out to the parallel PEs. Next trace prediction speculates on the next traces to be executed, – Next trace prediction predicts multiple branches per cycle. – Data prediction speculates on the trace's input data values. Constant value prediction predicts 80% correct for gcc. A trace cache miss causes a trace to be built through conventional instruction fetching with branch prediction. Trace processor is similar to multiscalar except for its use of hardwaregenerated dynamic traces rather than compiler-generated static tasks. 37 DataScalar processors  The DataScalar model of execution runs the same sequential program redundantly across multiple processors.  The data set is distributed across physical memories that are tightly coupled to their distinct processors.   Each processor broadcasts operands that it loads from its local memory to all other processors.  Instead of explicitly accessing a remote memory, processors wait until the requested value is broadcasted.  Stores are completed only by the processor that owns the operand, and are dropped by the others. 38 Address space  The address space is divided into a replicated and a communicated section: – The communicated section holds values that only exist in single copies and are owned by the respective processor. – The communicated section of the address space is distributed among the processors. – Replicated pages are mapped in each processor's local memory. – Access to a replicated page requires no data communication. 39 Accesses of DataScalar processors to memory Memory Processor 1 load-1 store-1 replicated load-2 store-2 communicated broadcast load-2 Processor 2 load-1 store-1 store-2 Memory replicated communicated 40 Why DataScalar?  The main goal of the DataScalar model of execution is the improvement of memory system performance by introducing redundancy in execution by replicating processors and part of the data storage.  Since all physical memory is local to at least one processor, a request for a remote operand is never sent.  Memory access latency and bus traffic is reduced.  All communication is one-way.  Writes never appear on the global bus. 41 Execution mode  The processors execute the same program in slightly different time steps.  The lead processor runs slightly ahead of the others, especially when it is broadcasting while the others wait for the broadcasted value.  When the program execution accesses an operand that is not owned by the lead processor, a lead change occurs.  All processors stall until the new lead processor catches up and broadcasts its operands.  The capability that each processor may run ahead on computation that involves operands owned by the processor is called datathreading. 42 Pros and cons  The DataScalar model is primarily a memory system optimization intended for codes that are performance limited by the memory system and difficult to parallelize.  Simulation results : the DataScalar model of execution works best with codes for which traditional parallelization techniques fail.  Six unmodied SPEC95 binaries ran from 7 % slower to 50 % faster on two nodes, and from 9 % to 100 % faster on four nodes, than on a system with comparable, more traditional memory system.  Current technological parameters do not make DataScalar systems a costeffective alternative to today's microprocessors. 43 Pros and cons  For a DataScalar system to be more cost-effective than the alternatives, the following three conditions must hold: – Processing power must be cheap, the dominant cost of each node should be memory. – Remote memory accesses should be slower than local memory accesses. – Broadcasts should not be prohibitively expensive.  Three possible candidates: – IRAM: because remote memory accesses to other IRAM chips will be more expensive than on-chip memory accesses. – extending concept of CMP: CMPs access operands from a processor-local memory faster than requesting an operand from a remote processor memory across the chip due to wiring delays. – NOWs (networks of workstations): alternative to paging, provided that broadcasts are efficient. 44

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download presentation source