Download presentation source

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Chapter 5
Future processor to use FineGrain Parallelism
Microprocessors: today (Y2K++)
Chip-Technology 2000/01:
– 0.18-m CMOS-technology, 10 to 100 M transistors per chip,
600 MHz to 1.4 GHz cycle rate
Example processors:
– Intel Pentium III: 7.5 M transistors, 0.18 m (Bi-)CMOS, up to 1 GHz
– Intel Pentium 4: ?? Transistors, 0.18 m (Bi-)CMOS, up to 1.4 GHz, uses a
Trace Cache!
– Intel IA-64 Itanium (already announced for 2000?): 0.18 m CMOS, 800 MHz;
successor McKinley (announced for 2001): > 1 GHz
– Alpha 21364: 100 M transistors, 0.18 m CMOS (1.5 volt, 100 watt), ? MHz
– HAL SPARC uses Trace Cache and Value Prediction
– Alpha 21464 will be 4-way simultaneous multithreaded
– Sun MAJC will be a two processor-chip, each processor a 4-way blockmultithreaded VLIW
Trends and principles in the giga chip era
Forecasting the effects of technology is hard:
“Everything that can be invented has been invented.”
US Commissioner of Patents, 1899.
“I think there is a world market for about five computers.”
Thomas J. Watson Sr., IBM founder, 1943.
Microprocessors: tomorrow (Y2K-2012)
Moore's Law: number of transistors per chip double every two years
SIA (semiconductor industries association) prognosis 1998:
Year of 1st shipment
Local Clock (GHz)
Across Chip (GHz)
Chip Size (mm²)
Dense Lines (nm)
Number of chip I/O
Transistors per chip
1997 1999 2002 2005 2008 2011 2014
0,75 1,25
10 16,9
3 3,674
1515 1867 2553 3492 4776 6532 8935
11M 21M 76M 200M 520M 1,4B 3,62B
Design challenges
increasing clock speed,
the amount of work that can be performed per cycle,
and the number of instructions needed to perform a task.
Today's general trend toward more complex designs is opposed by the wiring
delay within the processor chip as main technological problem.
higher clock rates with subquarter-micron designs
 on-chip interconnecting wires cause a significant portion of the delay time
in circuits.
Especially global interconnects within a processor chip cause problems with
higher clock rates.
Maintaining the integrity of a signal as it moves from one end of the chip to the
other becomes more difficult.
Copper metallization is worth a 20 % to 30 % reduction in wiring delay.
Application and economy-related trends
– generally: user interactions like video, audio, voice recognition, speech
processing, and 3D graphics.
– large data sets and huge databases,
– large data mining applications,
– transaction processing,
– huge EDA applications like CAD/CAM software,
– virtual reality computer games,
– signal processing and real-time control.
Colwell from Intel: the real threat for processor designers is shipping 30 million
CPUs only to discover they are imperfect and cause a recall.
Economies of scale:
– Fabrication plants now cost about $2 billion, a factor of ten more than a
decade ago. Manufacturers can only sustain such development costs if
larger markets with greater economies of scale emerge.
 Workloads will concentrate on human computer interface.
 Multimedia workloads will grow and influence architectures.
Architectural challenges and implications
Preserve object code compatibility (may be avoided by a virtual machine that
targets run-time ISAs)
It is necessary to find ways of expressing and exposing more parallelism to the
processor. It is doubtful if enough ILP is available.
Buses may probably scale. Expect much wider buses in future.
Memory bottleneck: Memory latency may be solved by a combination of
technological improvements in memory chip technology and by applying
advanced memory hierarchy techniques (other authors disagree).
Power consumption for mobile computers and appliances.
Soft errors by cosmic rays of gamma radiation may be faced with fault-tolerant
design through the chip.
Possible solutions
a focus of processor chips on particular market segments;
multimedia pushes desktop personal computers while high-end
microprocessors will serve specialized applications;
integrate functionalities to systems on a chip;
partition a microprocessor in a client chip part that focuses on general user
interaction enhanced by server chip parts that are tailored for special
a CPU core that works like a large ASIC block and that allows system
developers to instantiate various devices on a chip with a simple CPU core;
and reconfigurable on-chip parts that adapt to application requirements.
Functional partitioning becomes more important!
Future processor architecture principles
Speed-up of a single-threaded application:
– Trace cache
– Superspeculative
– Advanced superscalar
Speed-up of multi-threaded applications:
– Chip multiprocessors (CMPs)
– Simultaneous multithreading
Speed-up of a single-threaded application by multithreading:
– Multiscalar processors
– Trace processors
– DataScalar
– Processor-in-memory (PIM) or intelligent RAM (IRAM)
– Reconfigurable
– Asynchronous
Future processors to use coarse-grain parallelism
Today‘s microprocessors utilize instruction level parallelism by a deep
instruction pipeline and by the superscalar or VLIW multiple issue techniques
Today‘s (2001) technology: approx. 40 M transistors per chip,
In future (2012): 1.4 G transistors per chip,
What next?
Two directions:
– Increase of single-thread performance
--> use of more speculative instruction-level parallelism
– Increase of multi-thread (multi-task) performance
--> Utilize thread-level parallism additionally to instruction-level parallelism
A „thread“ in this lecture means a „HW thread“ which can be a SW (Posix)
thread, a process, ...
Far future (??): Increase of single-thread performance by use of speculative
instruction-level and thread-level parallelism
Processor techniques that speedup single threaded
Trace Cache tries to fetch from dynamic instruction sequences instead of the
static code cache
Advanced superscalar processors scale current designs up to issue 16 or 32
instructions per cycle.
Superspeculative processors enhance wide-issue superscalar performance by
speculating aggressively at every point.
The trace cache
Trace cache is a new paradigm for caching instructions.
A Trace cache is a special I-cache that captures dynamic instruction sequences
in contrast to the I-cache that contains static instruction sequences.
Like the I-cache, the trace cache is accessed using the starting address of the
next block of instructions.
Unlike the I-cache, it stores logically contiguous instructions in physically
contiguous storage.
A trace cache line stores a segment of the dynamic instruction trace across
multiple, potentially taken branches.
Trace cache
The trace cache (2)
Each line stores a snapshot, or trace, of the dynamic instruction stream.
The trace construction is of the critical path.
As a group of instructions is processed, it is latched into the fill unit.
The fill unit maximizes the size of the segment and finalizes a segment when the
segment can be expanded no further.
– The number of instructions within a trace is limited by the trace cache line
– Finalized segments are written into the trace cache.
Instructions can be sent from the trace cache into the reservation stations (??)
without having to undergo a large amount of processing and rerouting.
– It is under research if the instructions in the Trace cache are
• fetched but not yet decoded
• decoded but not yet renamed
• or decoded and partly renamed
– Trace cache placement in the microarchitecture is dependent on this
The trace cache - performance
Three applications
from the
benchmarks are
simulated on a 16wide issue
machine with
perfect branch
Superspeculative processors
Idea: Instructions generate many highly predictable result values in real
 Speculate on source operand values and begin execution without waiting
for result from the previous instruction.
Speculate about true data dependences!!
reasons for the existence of value locality
– Due to register spill code the reuse distance of many shared values is very
short in processor cycles.
Many stores do not even make it out of the store queue before their values
are needed again.
– Input sets often contain data with little variation (e.g., sparse matrices or
text files with white spaces).
– A compiler often generates run-time constants due to error-checking,
switch statement evaluation, and virtual function calls.
– The compiler also often loads program constants from memory rather than
using immediate operands.
Strong- vs. weak-dependence model
Strong-dependence model for program execution:
a total instruction ordering of a sequential program.
– Two instructions are identified as either dependent or independent, and
when in doubt, dependences are pessimistically assumed to exist.
– Dependences are never allowed to be violated and are enforced during
instruction processing.
– To date, most machines enforce such dependences in a rigorous fashion.
– This traditional model is overly rigorous and unnecessarily restricts
available parallelism.
Weak-dependence model:
– specifying that dependences can be temporarily violated during instruction
execution as long as recovery can be performed prior to affecting the
permanent machine state.
– Advantage: the machine can speculate aggressively and temporarily violate
the dependences.
The machine can exceed the performance limit imposed by the strongdependence model.
Implementation of a weak-dependence model
The front-end engine assumes the weak-dependence model and is highly
speculative, predicting instructions to aggressively speculate past them.
The back-end engine still uses the strong-dependence model to validate the
speculations, recover from misspeculation, and provide history and guidance
information to the speculative engine.
Superflow processor
The Superflow processor speculates on
– instruction flow: two-phase branch predictor combined with trace cache
– register data flow: dependence prediction: predict the register value
dependence between instructions
• source operand value prediction
• constant value prediction
• value stride prediction: speculate on constant, incremental increases in
operand values
• dependence prediction predicts inter-instruction dependences
– memory data flow: prediction of load values, of load addresses and alias
– Superflow simulations: 7.3 IPC for SPEC95 integer suite, up to 9
instructions per cycle when 32 instructions are potentially issued per cycle
– With dependence and value prediction a three cycle issue nearly matches
the performance of a single issue dispatch.
Register dataflow
Store queue
Reorder buffer
Instruction buffer
Instruction flow
Superflow processor
Memory dataflow
Advanced superscalar processors - Characteristics
Aggressive speculation, such as a very aggressive dynamic branch predictor,
a large trace cache,
very-wide-issue superscalar processing (an issue width of 16 or 32
instructions per cycle),
a large number of reservation stations to accommodate 2,000 instructions,
24 to 48 highly optimized, pipelined functional units,
sufficient on-chip data cache, and
sufficient resolution and forwarding logic.
Advanced superscalar processor
Requirements and solution
Delivering optimal instruction bandwidth requires:
– a minimal number of empty fetch cycles,
– a very wide (conservatively 16 instructions, aggressively 32), full issue each
– and a minimal number of cycles in which the instructions fetched are
subsequently discarded.
Consuming this instruction bandwidth requires:
– sufficient data supply,
– and sufficient processing resources to handle the instruction bandwidth.
– an instruction cache system (the I-cache) that provides for out-of-order fetch
(fetch, decode, and issue in the presence of I-cache misses).
– a large Trace cache for providing a logically contiguous instruction stream,
– an aggressive Multi-Hybrid branch predictor (multiple, separate branch
predictors, each tuned to a different class of branches) with support for
context switching, indirect jumps, and interference handling.
Multi-hybrid branch predictor
Hybrid predictors comprise several predictors, each targeting different classes
of branches.
Idea: each predictor scheme works best for another branch type.
McFarling already combined two predictors
(see also PowerPC 620, Alpha 21364,...).
As predictors increase in size, they often take more time to react to changes in
a program (warm-up time).
A hybrid predictor with several components can solve this problem by using
component predictors with shorter warm-up times while the larger predictors
are warming up.
Examples of predictors with shorter warm-up times are two-level predictors
with shorter histories as well as smaller dynamic predictors.
The Multi-Hybrid uses a set of selection counters for each entry in the branch
target buffer, in the trace cache, or in a similar structure, keeping track of the
predictor currently most accurate for each branch and then using the
prediction from that predictor for that branch.
Instruction supply and out-of-order fetch
An in-order fetch processor, upon encountering a trace cache miss, waits until
the miss is serviced before fetching any new segments.
But an out-of-order fetch processor temporarily ignores the segment
associated with the miss, attempting to fetch, decode, and issue the segments
that follow it.
After the miss has been serviced, the processor decodes and issues the
ignored segment.
Higher performance can be achieved by fetching instructions that—in terms of
a dynamic instruction trace—appear after a mispredicted branch, but are not
control-dependent upon that branch.
In the event of a mispredict, only instructions control-dependent on the
mispredicted branch are discarded.
Out-of-order fetch provides a way to fetch control-independent instructions.
Data supply
A 16-wide-issue processor will need to execute about eight loads/stores per
The primary design goal of the data-cache hierarchy is to provide the
necessary bandwidth to support eight loads/stores per cycle.
The size of a single, monolithic, multi-ported, first-level data cache would likely
be so large that it would jeopardize the cycle time.
Because of this, we expect the first-level data cache to be replicated to provide
the required ports.
Further features of the data supply system:
– A bigger, second-level data cache with less port requirements.
– Data prefetching.
– Processors will predict the addresses of loads, allowing loads to be
executed before the computation of operands needed for their address
– Processors will predict dependencies between loads and stores, allowing
them to predict that a load is always dependent on some older store.
Execution core
If the fetch engine is providing 16 to 32 instructions per cycle, then the
execution core must consume instructions just as rapidly.
To avoid unnecessary delays due to false dependencies, logical registers must
be renamed.
To compensate for the delays imposed by the true data dependencies,
instructions must be executed out of order.
Patt et al. envision an execution core comprising 24 to 48 functional units
supplied with instructions from large reservation stations and having a total
storage capacity of 2,000 or more instructions.
Functional units will be partitioned into clusters of three to five units.
Each cluster will maintain an individual register file.
Each functional unit has its own reservation station.
Instruction scheduling will be done in stages.
Patt et al. argue that the highest performance computing system (when one
billion transistor chips are available) will contain on each processor chip a
single processor.
All this makes sense, however, only
– if CAD tools can be improved to design such chips
– if algorithms and compilers can be redesigned to take advantage of such
powerful dynamically scheduled engines.
Processors that use thread-level speculation to
boost single-threaded programs
Multiscalar processors divide a program in a collection of tasks that are
distributed to a number of parallel processing units under control of a single
hardware sequencer.
Trace processors facilitate high ILP and a fast clock by breaking up the
processor into multiple distinct cores (similar to multiscalar!), and breaking up
the program into traces (dynamic sequences of instructions).
DataScalar processors run the same sequential program redundantly across
multiple processors using distributed data sets.
Multiscalar processors
A program is represented as a control flow graph (CFG), where basic blocks are
nodes, and arcs represent flow of control.
Program execution can be viewed as walking through the CFG with a high level
of parallelism.
A multiscalar processor walks through the CFG speculatively, taking task-sized
steps, without pausing to inspect any of the instructions within a task.
The primary constraint: it must preserve the sequential program semantics.
A program is statically partitioned into tasks which are marked by annotations of
the CFG.
The tasks are distributed to a number of parallel PEs within a processor.
Each PE fetches and executes instructions belonging to its assigned task.
Multiscalar mode of execution
PE 0
Data values
Task A
PE 1
Task B
PE 2
Task D
PE 3
Task E
Multiscalar processor
Processing element n
Processing element 1
Proper resolution of inter-task data dependences
Concerns in particular data that is passed between instructions via registers
and via memory.
To maintain a sequential appearance a twofold strategy is employed.
– First, each processing element adheres to sequential execution semantics
for the task assigned to it.
– Second, a loose sequential order is enforced over the collection of
processing elements, which in turn imposes a sequential order of the
The sequential order on the processing elements is maintained by organizing
the elements into a circular queue.
The appearance of a single logical register file is maintained although copies
are distributed to each parallel PE.
– Register results are dynamically routed among the many parallel
processing elements with the help of compiler-generated masks.
For memory operations: An address resolution buffer (ARB) is provided to hold
speculative memory operations and to detect violations of memory
The multiscalar paradigm has at least two forms of
Control speculation, which is used by the task sequencer, and
Data dependence speculation, which is performed by each PE.
It could also use other forms of speculation, such as data value speculation,
to alleviate inter-task dependences.
Alternatives with thread-level speculation
Static Thread-level speculation: multiscalar
Dynamic thread-level speculation: „threads“ are generated during run-time
– Loops
– Subprogram invocations
– Specific start or stop instruction
– Traces: Trace processors
Trace processors
The focus of a Trace processor is the combination of the Trace cache with
Idea: Create subsystems similar in complexity to today's superscalar
processors and combine replicated subsystems into a full processor.
A trace is a sequence of instructions that potentially covers several basic blocks
starting at any point in the dynamic instruction stream.
A trace cache is a special I-cache that captures dynamic instruction sequences
in contrast to the I-cache that contains static instruction sequences.
Trace processor
Processing element 1
Processing element 2
Trace cache
Processing element n
Trace processor
Instruction fetch hardware fetches instructions from the I-cache and
simultaneously generates traces of 8 to 32 instructions including predicted
conditional branches.
Traces are built as the program executes and they are stored in a trace cache.
A trace fetch unit reads traces from the trace cache and parcels them out to the
parallel PEs.
Next trace prediction speculates on the next traces to be executed,
– Next trace prediction predicts multiple branches per cycle.
– Data prediction speculates on the trace's input data values.
Constant value prediction predicts 80% correct for gcc.
A trace cache miss causes a trace to be built through conventional instruction
fetching with branch prediction.
Trace processor is similar to multiscalar except for its use of hardwaregenerated dynamic traces rather than compiler-generated static tasks.
DataScalar processors
The DataScalar model of execution runs the same sequential program
redundantly across multiple processors.
The data set is distributed across physical memories that are tightly coupled to
their distinct processors.
Each processor broadcasts operands that it loads from its local memory to all
other processors.
Instead of explicitly accessing a remote memory, processors wait until the
requested value is broadcasted.
Stores are completed only by the processor that owns the operand, and are
dropped by the others.
Address space
The address space is divided into a replicated and a communicated section:
– The communicated section holds values that only exist in single copies
and are owned by the respective processor.
– The communicated section of the address space is distributed among the
– Replicated pages are mapped in each processor's local memory.
– Access to a replicated page requires no data communication.
Accesses of DataScalar processors to memory
Processor 1
Processor 2
Why DataScalar?
The main goal of the DataScalar model of execution is the improvement of
memory system performance by introducing redundancy in execution by
replicating processors and part of the data storage.
Since all physical memory is local to at least one processor, a request for a
remote operand is never sent.
Memory access latency and bus traffic is reduced.
All communication is one-way.
Writes never appear on the global bus.
Execution mode
The processors execute the same program in slightly different time steps.
The lead processor runs slightly ahead of the others, especially when it is
broadcasting while the others wait for the broadcasted value.
When the program execution accesses an operand that is not owned by the
lead processor, a lead change occurs.
All processors stall until the new lead processor catches up and broadcasts its
The capability that each processor may run ahead on computation that involves
operands owned by the processor is called datathreading.
Pros and cons
The DataScalar model is primarily a memory system optimization intended for
codes that are performance limited by the memory system and difficult to
Simulation results : the DataScalar model of execution works best with codes
for which traditional parallelization techniques fail.
Six unmodied SPEC95 binaries ran from 7 % slower to 50 % faster on two
nodes, and from 9 % to 100 % faster on four nodes, than on a system with
comparable, more traditional memory system.
Current technological parameters do not make DataScalar systems a costeffective alternative to today's microprocessors.
Pros and cons
For a DataScalar system to be more cost-effective than the alternatives, the
following three conditions must hold:
– Processing power must be cheap, the dominant cost of each node should
be memory.
– Remote memory accesses should be slower than local memory accesses.
– Broadcasts should not be prohibitively expensive.
Three possible candidates:
– IRAM: because remote memory accesses to other IRAM chips will be more
expensive than on-chip memory accesses.
– extending concept of CMP: CMPs access operands from a processor-local
memory faster than requesting an operand from a remote processor
memory across the chip due to wiring delays.
– NOWs (networks of workstations): alternative to paging, provided that
broadcasts are efficient.