Download 3 The TRIPS Architecture - Department of Computer Engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Distributed operating system wikipedia , lookup

Transcript
CMPE 511
Computer Architecture
GPAs: Grid Processor Architectures
Report
Instructor:
Prof. Oğuz Tosun
Prepared by:
A.Emre Arpacı
Abstract ...................................................................................................................................... 2
1 Introduction ............................................................................................................................. 2
2 The Tera-op Reliable Intelligently adaptive Processing System (TRIPS) Project .................. 4
2.1 Goals................................................................................................................................. 5
3 The TRIPS Architecture .......................................................................................................... 5
3.1 Core Execution Model ..................................................................................................... 6
3.2 Architectural Overview .................................................................................................... 6
3.3 Polymorphous Resources ................................................................................................. 7
4 D-morph: Instruction-Level Parallelism ................................................................................. 8
4.1 Frame Space Management ............................................................................................... 8
4.2 Multiblock Speculation .................................................................................................... 9
4.3 HighBandwidth .............................................................................................................. 10
4.4 Memory Interface ........................................................................................................... 10
4.5 D-morph Results ............................................................................................................ 11
5 Related Work......................................................................................................................... 11
6 Conclusions ........................................................................................................................... 12
References ................................................................................................................................ 13
1
Abstract
The doubling of microprocessor performance every three years has been the result of
two factors: more transistors per chip and superlinear scaling of the processor clock with
technology generation. Due to both diminishing improvements in clock rates and poor wire
scaling as semiconductor devices shrink, the achievable performance growth of conventional
microarchitectures will slow substantially.
In this report, we survey the design space of a new class of architectures called Grid
Processor Architectures (GPAs for short). Grid Processor Architectures are not “grid
computing” in the popular sense. These architectures are designed to scale with technology,
allowing faster clock rates than conventional architectures while providing superior
instruction-level parallelism on traditional workloads and high performance across a range of
application classes. We can exploit the concept of Grid Processor Architectures under two
main machine examples; one is MIT RAW machine and the other which we consider in this
report as the basis is Texas TRIPS machine. MIT RAW uses the compiler to map computation
onto the grid and to schedule the communication across it. Where as TRIPS exploits physical
locality among PEs to speed up communication of data [1].
1 Introduction
For the past decade, microprocessors have been improving in overall performance at a
rate of approximately 50–60% per year. These substantial performance improvements have
been mined from two sources. First, designers have been increasing clock rates at a rapid rate,
both by scaling technology and by reducing the number of levels of logic per cycle. Second,
designers have been exploiting the increasing number of transistors on a chip, plus
improvements in compiler technology, to improve instruction throughput (IPC). Although
designers have generally opted to emphasize one over the other, both clock rates and IPC
have been improving consistently. In Figure 1, it has shown that while some designers have
chosen to optimize the design for fast clocks (Compaq Alpha), and others have optimized
their design for high instruction throughput (HP PA-RISC), the past decade’s performance
increases have been a function of both. Achieving high performance in future microprocessors
will be a tremendous challenge, as both components of performance improvement are facing
emerging technology-driven limitations. Designers will soon be unable to sustain clock speed
improvements at the past decade’s annualized rate of 50% per year.
Compensating for the slower clock growth by increasing sustained IPC proportionally
will be difficult. Wire delays will limit the ability of conventional microarchitectures to
improve instruction throughput. Microprocessor cores will soon face a new constraint, one in
which they are communication bound on the die instead of capacity bound. As feature sizes
shrink, and wires become slower relative to logic, the amount of state that can be accessed in
a single clock cycle will cease to grow, and will eventually begin to decline. Increases in
instruction-level parallelism will be limited by the amount of state reachable in a cycle, not by
the number of transistors that can be manufactured on a chip [2].
2
Figure 1: Processor clock rates and normalized processor performance (SpecInt/Clock rate), 19952000.
Future microprocessors must thus achieve ILP considerably higher than today’s
designs, even while being partitioned, and do so with a high clock rate. These future
processors must exploit increased device counts to meet the above goals, but must do so while
considering the increased communication delays and partitioning requirements [3]. In this
report, we survey a class of architectures intended to address these problems faced by future
systems. Grid Processor Architectures are designed to enable both faster clock rates and
higher ILP than conventional architectures, even as devices shrink and wire delays increase.
The GPAs consists of an array of ALUs connected using a lightweight routed network.
Each ALU in the array contains local instruction storage and data storage buffers. Banked
instruction and data storage caches are placed around the array of ALUs backed by partitioned
secondary level cache banks. The processor follows a blockatomic model of execution where
an entire block of instructions is fetched and mapped onto the execution array. A dataflow
style ISA that encodes each instruction’s placement and its consumers, allows a statically
placed but dynamically issued (SPDI) execution model. The dataflow style ISA and the
distributed control and local storage inherently provided by the architecture makes the
implementation of the mechanisms straight-forward.
This organization eliminates the centralized instruction issue window and converts the
conventional broadcast bypass network into a routed point-to-point network. Similar to VLIW
architectures, a compiler is used to detect parallelism and statically schedule instructions onto
the computation substrate, such that the topography of the dataflow graph matches the
mapping. However, instructions are issued dynamically with the execution order determined
by the availability of input operands.
In a GPA, few large structures reside on the critical execution path, enhancing
scalability as wire resistance increases. Out-of-order execution is achieved with greatly
reduced register file bandwidth and with no associative issue window or register rename table.
Compiler-controlled physical layout ensures that the critical path is scheduled along the
shortest physical path, and that banked instruction caches reside near the units to which they
will issue instructions. Finally, large instruction blocks are mapped onto the nodes as single
units of computation, amortizing scheduling and decode overhead over a large number of
instructions. In a GPA, the register file bandwidth is also reduced. Experiments show that
register file writes are reduced by 30% to 90% using direct communication between
producing and consuming instructions. On a set of conventional uniprocessor (SPEC
3
CPU2000 and Mediabench) benchmarks, simulation results show IPCs of between one and
nine, running on a substrate that can likely be clocked faster than conventional designs and
that will scale with technology. Assuming small routing delays, perfect memory and perfect
branch prediction, the GPA averages eleven instructions per cycle across these benchmarks
[4].
GPA-based systems provide unique opportunities for power efficiency. The
elimination of structures dedicated to instruction-level register renaming, associative operand
comparisons, and state tracking reduce the overhead circuitry and power on a per-ALU basis.
ALU chaining dramatically reduces the number of global register file accesses in exchange
for short point-to-point connections. The dynamic power of the ALU array and banked
memory structures can be actively managed to reduce consumption during periods of lighter
utilization. The dataflow execution model of the GPA is also amenable to power-efficient
asynchronous design techniques.
In addition to high ILP, a secondary design goal of the GPA is polymorphism, or the
ability to adapt the hardware to the execution characteristics of the application. Grid
Processors can be easily sub-divided into sub-processors, allowing discrete threads to be
assigned to different sub-processors for high thread-level parallelism (TLP). Grid Processors
can also be configured to target data-level parallelism (DLP), often exhibited in media,
streaming, and scientific codes. For DLP applications, the same GPA hardware employs a
different execution model in which instructions for kernels or inner loops are mapped to the
ALUs and stay resident for multiple iterations. In addition, each access to a data cache bank
provides multiple values that are distributed to the ALUs in each row. Initial results on a set
of 7 signal processing kernels show that an 8x8 GPA can average 48 compute instructions per
cycle. Assuming an 8-GHz clock in 50nm CMOS, this configuration would achieve a
performance level of 384 Gflops [5].
The remainder of this report is organized as follows. Section 2 describes the TRIPS
Project. Section 3 describes the TRIPS Architecture and GPAs. Section 4 present design space
of GPA classes of machine TRIPS and its D-morph: ILP (Instruction Level Parallelism).
Section 5 describes related work pertaining to wide-issue and dataflow-oriented machines.
Finally, Section 6 concludes with a discussion of the strengths and weaknesses of GPAs and
makes some critiques about the GPAs in the TRIPS.
2 The Tera-op Reliable Intelligently adaptive Processing System (TRIPS)
Project
TRIPS is a multidisciplinary project sponsored under the DARPA Polymorphous
Computing Architecture (PCA) initiative. The goal of the TRIPS project is to develop a
computing system that outperforms evolutionary architectures on a wide range of applications,
achieving single-chip Tera-op performance that scales with advances in semiconductor
technology [6].
TRIPS is a collaborative effort among multidisciplinary groups from UT-Austin and
the IBM Austin Research Laboratory. The design, evaluation, and implementation span the
research disciplines of VLSI design, architecture, compilers, operating systems, and
applications.
4
2.1 Goals
The TRIPS project has four major research goals:
Technology-Scalable Architecture: To address the semiconductor scaling challenges
of high-performance processors, particularly in instruction selection, execution, and bypass,
the TRIPS team has proposed a new class of processor organizations called Grid Processor
Architectures (GPAs). A GPA is composed of a tightly coupled array of ALUs connected via
a thin network, onto which large blocks of instructions are scheduled and mapped. To
mitigate on-chip communication delays, applications are scheduled so that their critical
dataflow paths are placed along nearby ALUs.
Malleable Architecture: The TRIPS architecture is designed to be configurable to
meet the needs of a variety of workloads and environmental conditions. Both the grid
processors and the on-chip memory system are configurable, able to run workloads as diverse
as control-bound integer codes, highly parallel threaded codes, and regular, computationally
intensive streaming codes efficiently. The allocation of ALUs within the grid, the instruction
mapping onto the grid, the number of executing threads, and the flow of instructions across
the grid are all exposed to the system, compiler, and application software for maximum
flexibility. A TRIPS chip consists of one or more inconnected grid processors working in
parallel.
Dynamic Adaptivity: To respond to changing workloads and conditions, a TRIPS
chip provides on-chip sensors and a lightweight software layer called morphware, which
monitors power, temperature, memory performance, and ALU utilitization. The morphware
layer controls the runtime operation of the execution resources, mediating between the
requirements of running applications, the capabilities of a TRIPS implementation, and the
operating environment of the system.
Application Diversity: TRIPS is intended to support a variety of runtime workloads,
including desktop, scientific, streaming, and server workloads. Desktop applications are
characterized by irregular integer operations, scientific applications by their large data sets,
streaming applications with their regularity and predictability, and server applications by their
non-uniform workloads, independent thread execution, and real-time response requirements.
The TRIPS system dynamically responds to each in kind and supports concurrent execution of
all.
3 The TRIPS Architecture
The TRIPS architecture uses large, coarse-grained processing cores to achieve high
performance on single threaded applications with high ILP, and augments them with
polymorphous features that enable the core to be subdivided for explicitly concurrent
applications at different granularities. Contrary to conventional large-core designs with
centralized components that are difficult to scale, the TRIPS architecture is heavily partitioned
to avoid large centralized structures and long wire runs. These partitioned computation and
memory elements are connected by point-to-point communication channels that are exposed
to software schedulers for optimization.
5
The key challenge in defining the polymorphous features is balancing their appropriate
granularity so that workloads involving different levels of ILP, TLP and DLP can maximize
their use of the available resources, and at the same time avoid escalating complexity and
nonscalable structures. The TRIPS system employs coarse grained polymorphous features, at
the level of memory banks and instruction storage, to minimize both software and hardware
complexity and configuration overheads.
3.1 Core Execution Model
The TRIPS architecture is fundamentally block oriented. In all modes of operation,
programs compiled for TRIPS are partitioned into large blocks of instructions with a single
entry point, no internal loops, and possibly multiple possible exit points as found in
hyperblocks [7]. For instruction and thread level parallel programs, blocks commit atomically
and interrupts are block precise, meaning that they are handled only at block boundaries. For
all modes of execution, the compiler is responsible for statically scheduling each block of
instructions onto the computational engine such that inter-instruction dependences are explicit.
Each block has a static set of state inputs, and a potentially variable set of state outputs that
depends upon the exit point from the block. At runtime, the basic operational flow of the
processor includes fetching a block from memory, loading it into the computational engine,
executing it to completion, committing its results to the persistent architectural state if
necessary, and then proceeding to the next block.
Figure 2: TRIPS Architecture having GPA as a core
3.2 Architectural Overview
Figure 2a shows a diagram of the TRIPS architecture will be implemented in a
prototype chip. While the architecture is scalable to both larger dimensions and high clock
rates due to both the partitioned structures and short point-to-point wiring connections, the
TRIPS prototype chip will consist of four polymorphous 16-wide cores, an array of 32KB
memory tiles connected by a routed network, and a set of distributed memory controllers with
channels to external memory. The prototype chip will be built using a 100nm process and is
targeted for completion in 2005.
Figure 2b shows an expanded view of a TRIPS core (GPA) and the primary memory
system. The TRIPS core is an example of the Grid Processor family of designs [4], which are
6
typically composed of an array of homogeneous execution nodes, each containing an integer
ALU, a floating point unit, a set of reservation stations, and router connections at the input
and output. Each reservation station has storage for an instruction and two source operands.
When a reservation station contains a valid instruction and a pair of valid operands, the node
can select the instruction for execution. After execution, the node can forward the result to
any of the operand slots in local or remote reservation stations within the ALU array. The
nodes are directly connected to their nearest neighbors, but the routing network can deliver
results to any node in the array. The banked instruction cache on the left couples one bank per
row, with an additional instruction cache bank to issue fetches to values from registers for
injection into the ALU array. The banked register file above the ALU array holds a portion of
the architectural state. To the right of the execution nodes are a set of banked level-1 data
caches, which can be accessed by any ALU through the local grid routing network. Below the
ALU array is the block control logic that is responsible for sequencing block execution and
selecting the next block. The backside of the L1 caches are connected to secondary memory
tiles through the chip-wide two-dimensional interconnection network. The switched network
provides a robust and scalable connection to a large number of tiles, using less wiring than
conventional dedicated channels between these components.
The TRIPS architecture contains three main types of resources. First, the hardcoded,
non-polymorphous resources operate in the same manner, and present the same view of
internal state in all modes of operation. Some examples include the execution units within the
nodes, the interconnect fabric between the nodes, and the L1 instruction cache banks. In the
second type, polymorphous resources are used in all modes of operation, but can be
configured to operate differently depending on the mode. The third type is the resources that
are not required for all modes and can be disabled when not in use for a given mode.
3.3 Polymorphous Resources
Frame Space: As shown in Figure 2c, each execution node contains a set of
reservation stations. Reservation stations with the same index across all of the nodes combine
to form a physical frame. For example, combining the first slot for all nodes in the grid forms
frame 0. The frame space, or collection of frames, is a polymorphous resource in TRIPS, as it
is managed differently by different modes to support efficient execution of alternate forms of
parallelism.
Register File Banks: Although the programming model of each execution mode sees
essentially the same number of architecturally visible registers, the hardware substrate
provides many more. The extra copies can be used in different ways, such as for speculation
or multithreading, depending on the mode of operation.
Block Sequencing Controls: The block sequencing controls determine when a block
has completed execution, when a block should be deallocated from the frame space, and
which block should be loaded next into the free frame space. To implement different modes
of operation, a range of policies can govern these actions. The deallocation logic may be
configured to allow a block to execute more than once, as is useful in streaming applications
in which the same inner loop is applied to multiple data elements. The next block selector can
be configured to limit the speculation, and to prioritize between multiple concurrently
executing threads useful for multithreaded parallel programs.
7
Memory Tiles: The TRIPS Memory tiles can be configured to behave as NUCA style
L2 cache banks, scratchpad memory, and synchronization buffers for producer/consumer
communication. In addition, the memory tiles closest to each processor present a special high
bandwidth interface that further optimizes their use as stream register files.
4 D-morph: Instruction-Level Parallelism
The desktop morph, or D-morph, of the TRIPS processor uses the polymorphous
capabilities of the processor to run single-threaded codes efficiently by exploiting instructionlevel parallelism. The TRIPS processor core is an instantiation of the Grid Processor family of
architectures, but with some important differences as described in this section. To achieve
high ILP, the D-morph configuration treats the instruction buffers in the processor core as a
large, distributed, instruction issue window, which uses the TRIPS ISA to enable out-of-order
execution while avoiding the associative issue window lookups of conventional machines.
To use the instruction buffers effectively as a large window, the D-morph must
provide high-bandwidth instruction fetching, aggressive control and data speculation, and a
high-bandwidth, low-latency memory system that preserves sequential memory semantics
across a window of thousands of instructions.
Figure 3: D-morph frame management
4.1 Frame Space Management
By treating the instruction buffers at each ALU as a distributed issue window, ordersof-magnitude increases in window sizes are possible. This window is fundamentally a threedimensional scheduling region, where the x- and y-dimensions correspond to the physical
dimensions of the ALU array and the z-dimension corresponds to multiple instruction slots at
each ALU node, as shown in Figure 2c. This three-dimensional region can be viewed as a
series of frames, as shown in Figure 3b, in which each frame consists of one instruction buffer
entry per ALU node, resulting in a 2-D slice of the 3-D scheduling region. To fill one of these
scheduling regions, the compiler schedules hyperblocks into a 3-D region, assigning each
instruction to one node in the 3-D space. Hyperblocks are predicated, single entry, multiple
exit regions formed by the compiler. A 3-D region (the array and the set of frames) into which
8
one hyperblock is mapped is called an architectural frame, or A-frame. Figure 3a shows a
four-instruction hyperblock (H0) mapped into A-frame 0 as shown in Figure 3b, where N0
and N2 are mapped to different buffer slots (frames) on the same physical ALU node. All
communication within the block is determined by the compiler which schedules operand
routing directly from ALU to ALU. Consumers are encoded in the producer instructions as X,
Y, and Z relative offsets. Instructions can direct a produced value to any element within the
same A-frame, using the lightweight routed network in the ALU array. The maximum number
of frames that can be occupied by one program block (the maximum A-frame size) is
architecturally limited by the number of instruction bits to specify destinations, and physically
limited by the total number of frames available in a given implementation.
The current TRIPS ISA limits the number of instructions in a hyperblock to 128, and
the current implementation limits the maximum number of frames per A-frame to 16, the
maximum number of A-frames to 32, and provides 128 frames total.
4.2 Multiblock Speculation
The TRIPS instruction window size is much larger than the average hyperblock size
that can be constructed. The hardware fills empty A-frames with speculatively mapped
hyperblocks, predicting which hyperblock will be executed next, mapping it to an empty Aframe, and so on.
The A-frames are treated as a circular buffer in which the oldest A-frame is nonspeculative and all other A-frames are speculative (analogous to tasks in a Multiscalar
processor). When the A-frame holding the oldest hyperblock completes, the block is
committed and removed. The next oldest hyperblock becomes non-speculative, and the
released frames can be filled with a new speculative hyperblock. On a misprediction, all
blocks past the offending prediction are squashed and restarted. Since A-frame IDs are
assigned dynamically and all intra-hyperblock communication occurs within a single A-frame,
each producer instruction prepends its A-frame ID to the Z-coordinate of its consumer to form
the correct instruction buffer address of the consumer. Values passed between hyperblocks are
transmitted through the register file, as shown by the communication of R1 from H0 to H1 in
Figure 3b. Such values are aggressively forwarded when they are produced, using the register
stitch table that dynamically matches the register outputs of earlier hyperblocks to the register
inputs of later hyperblocks.
Table 1. Execution Characteristics of D-morph codes.
9
4.3 HighBandwidth
To fill the large distributed window the D-morph requires high-bandwidth instruction
fetch. The control model uses a program counter that points to hyperblock headers. When
there is sufficient frame space to map a hyperblock, the control logic accesses a partitioned
instruction cache by broadcasting the index of the hyperblock to all banks. Each bank then
fetches a row’s worth of instructions with a single access and streams it to the bank’s
respective row. Hyperblocks are encoded as VLIW-like blocks, along with a prepended
header that contains the number of frames consumed by the block. The next-hyperblock
prediction is made using a highly tuned tournament exit predictor, which predicts a binary
value that indicates the branch predicted to be the first to exit the hyperblock. The per-block
accuracy of the exit predictor is shown in row 3 of Table 1. The value generated by the exit
predictor is used both to index into a BTB to obtain the next predicted hyperblock address,
and also to avoid forwarding register outputs produced past the predicted branch to
subsequent blocks.
4.4 Memory Interface
To support high ILP, the D-morph memory system must provide a high-bandwidth,
low-latency data cache, and must maintain sequential memory semantics. As shown in Figure
2b, the right side of each TRIPS core contains distributed primary memory system banks,
which are tightly coupled to the processing logic for low latency. The banks are interleaved
using the low-order bits of the cache index, and can process multiple non-conflicting accesses
simultaneously. Each bank is coupled with MSHRs for the cache bank and a partition of the
address-interleaved load/store queues that enforce ordering of loads and stores. The MSHRs,
the load/store queues, and the cache banks all use the same interleaving scheme. Stores are
written back to the cache from the LSQs upon block commit.
The secondary memory system in the D-morph configures the networked banks as a
non-uniform cache access (NUCA) array, in which elements of a set are spread across
multiple secondary banks, and are capable of migrating data on the two-dimensional switched
network that connects the secondary banks. This network also provides a high-bandwidth link
to each L1 bank for parallel L1 miss processing and fills. To summarize, with accurate exit
prediction, high-bandwidth I-fetching, partitioned data caches, and concurrent execution of
hyperblocks with inter-block value forwarding, the D-morph is able to use the instruction
buffers as a polymorphous out-of-order issue window effectively, as shown in the next
subsection.
Figure 4: D-morph performance as a function of frame A
10
4.5 D-morph Results
In this subsection, the ILP is measured using the mechanisms described above. The
results shown in this section assume a 4x4 (16-wide issue) core, with 128 physical frames, a
64KB L1 data cache that requires three cycles to access, a 64KB L1 instruction cache (both
partitioned into 4 banks), 0.5 cycles per hop in the ALU array, a 10-cycle branch
misprediction penalty, a 250Kb exit predictor, a 12-cycle access penalty to a 2MB L2 cache,
and a 132-cycle main memory access penalty. Optimistic assumptions in the simulator
currently include no modeling of TLBs or page faults, oracular load/store ordering, simulation
of a centralized register file, and no issue of wrongpath instructions to the memory system.
All of the binaries were compiled with the Trimaran tool set [8], and scheduled for the TRIPS
processor with custom scheduler/rewriter.
The first row of Table 1 shows the average number of useful dynamically executed
instructions per block, discounting overhead instructions, instructions with false predicates or
instructions past a block exit. The second row shows the average dynamic number of frames
allocated per block by scheduler for a 4x4 grid. Using the steady-state block (exit) prediction
accuracies shown in the third row, each benchmark holds 965 useful instructions in the
distributed window, on average, as shown in row of Table 1.
Figure 4 shows how IPC scales as the number of Aframes is increased from 1 to 32,
permitting deeper speculative execution. The integer benchmarks are shown on the left; the
floating point and Mediabench benchmarks are shown on the right. Each 32 A-frame bar also
has two additional IPC values, showing the performance with perfect memory in the hashed
fraction of each bar, and then adding perfect branch prediction, shown in white. Increasing the
number of A-frames provides a consistent performance boost across many of the benchmarks,
since it permits greater exploitation of ILP by providing a larger window of instructions.
Some benchmarks show no performance improvements beyond 16 A-frames (bzip2, m88ksim,
and tomcatv), and a few reach their peak at 8 Aframes (adpcm, gzip, twolf, and hydro2d). In
such cases, the large frame space is underutilized when running a single thread, due to either
low hyperblock predictability in some cases or a lack of program ILP in others. The graphs
demonstrate that while control mispredictions cause large performance losses for the integer
codes (close to 50% on average), the large window is able to tolerate memory latencies
extremely well, resulting in negligible slowdowns due to an imperfect memory system for all
benchmarks but mgrid.
5 Related Work
The goals of high clock rate and high IPC are not unique to GPAs. Many prior
approaches have attempted to use both static and dynamic techniques to discover and execute
along the critical path of a program, but they are too numerous to discuss here. In this section
we describe what we believe to be the most relevant related work. Dennis and Misunas
proposed static dataflow architecture [9], and Arvind proposed Tagged-Token Dataflow
architecture with purely data-driven instruction scheduling for programs expressed in a
dataflow language [10]. Culler later proposed a hybrid dataflow execution model where
programs are partitioned into code blocks made up of instruction sequences, called threads,
with dataflow execution between threads [11].
11
The TRIPS Grid Processor Architecture approach differs from these in that the
architects use a conventional programming interface with dataflow execution for a limited
window of instructions, and rely on compiler instruction mapping to reduce the complexity of
the token matching. In a sense, GPAs are a hybrid approach between VLIW and conventional
superscalar architectures. A GPA statically schedules the instructions using a compiler, but
then dynamically issues them based on data dependences. Other efforts have attempted to
enhance VLIW architectures with dynamic execution. Rau proposed a split-issue mechanism
to separate register read and execute from writeback and a delay buffer to support dynamic
scheduling for VLIW processors [12]. Grid Processors share many characteristics with the
Transport Triggered Architectures proposed by Corporaal and Mulder, including data driven
execution, reducing register file traffic, and non-broadcasting bypass of execution unit results
[13].
Others have looked at various naming mechanisms for values to reduce the register
pressure and register file size. Smelyanskiy et al. proposed Register Queues for allocating live
values in software pipelined loops [14]. Llosa proposed register sacks, which are low
bandwidth port-limited register files for allocating live values in pipelined loops [15]. Patt
proposed a Block-Structured Instruction Set Architecture for increasing the fetch rate for wide
issue machines where the atomic unit of execution is a block and not an instruction [16].
Many researchers are exploring distributed or partitioned uniprocessor designs.
Waingold et al. proposed a distributed execution model with extensive compiler support in the
RAW architecture [17]. The RAW architecture assumes a coarsergrain execution than does
the Grid Processor, exploiting parallelism across multiple compiler-generated instruction
streams.
6 Conclusions
In this report we made a survey on Grid Processor Architectures, a new class of micro
architecture, that are intended to enable continued scaling of both clock rate and instruction
throughput. GPAs, by mapping dependence chains onto an array of ALUs, conventional large
structures such as register files and instruction windows can be distributed throughout the
ALU array, permitting better scalability of the processing core. By delivering ALU results
point-to-point instead of broadcasting them, GPAs mitigate the growing global wire and delay
overheads of conventional bypass architectures. Studies on sequential applications are
promising, with the grid processor achieving IPCs ranging from 1 to 9, competitive with those
of idealized superscalar microarchitectures, and exceeding those of VLIW microarchitectures.
It it not clear that GPAs will be superior to the conventional alternatives, which may
find more incremental, but equally good solutions to the wire delay and clock scaling
problems. GPAs have several disadvantages; they force the data caches to be far away from
many of the ALUs, and incur delays between dependent operations due to the network router
and wires, which can be significant. The complexity of frame management and block stitching
(allowing successor hyperblocks to execute speculatively) is significant and may interfere
with the goal of fast clock rates. However, future architectures must be partitioned somehow,
and the partitioning and the flow of operations are likely being exposed to the compiler, while
still preserving dynamic execution. Many of the techniques discussed here are thus likely to
appear in future designs. There are still research groups to refine the microarchitecture of the
GPAs and the hyperblock scheduler with the anticipation that the hardware complexity can be
further reduced without undue burden on the software.
12
References
[1] - Daniel J. Sorin “10 Novel Architectures”, Advanced Computer Architecture II (Parallel
Computer Architecture), Novel Architectures Presentation
[2] - Vikas Agarwal, M.S. Hrishikesh, StephenW., Keckler Doug Burger, Appears in the
Proceedings of the 27, Clock Rate versus IPC: The End of the Road for Conventional
Microarchitectures
[3] D. I. August, D. A. Connors, S. A. Mahlke, J. W. Sias, K. M. Crozier, B.-C. Cheng, P. R.
Eaton, Q. B. Olaniran, and W. Hwu. Integrated predicated and speculative execution in the
IMPACT EPIC architecture. In Proceedings of the 25th International Symposium on
Computer Architecture, pages 45–54, July 1998.
[4] R. Nagarajan, K. Sankaralingam, D. Burger, and S. W. Keckler. A design space evaluation
of grid processor architectures. In Proceedings of the 34th Annual International Symposium
on Microarchitecture, pages 40–51, December 2001.
[5] Stephen W. Keckler, Doug Burger, Charles R. Moore, Ramadass Nagarajan, Karthikeyan
Sankaralingam, Vikas Agarwal, M.S. Hrishikesh, Nitya Ranganathan, and Premkishore
Shivakumar, Wire-Delay Scalable Microprocessor Architecture for High Performance
Systems, Department of Computer Sciences, The University of Texas at Austin, Austin,TX
[6] TRIPS Project Home Page
[7] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann. Effective
compiler support for predicated execution using the hyperblock. In Proceedings of the 25st
International Symposium on Microarchitecture, pages 45–54, 1992.
[8] Trimaran: An infrastructure for research in instruction-level parallelism.
http://www.trimaran.org.
[9] J. Dennis and D. Misunas. A preliminary architecture for a basic data-flow processor. In
Proceedings of the 2nd Annual Symposium on Computer Architecture, pages 126–132,
January 1975.
[10] Arvind and R. S. Nikhil. Executing a program on the MIT Tagged-Token Dataflow
Architecture. IEEE Transactions on Computers, 39(3):300–318, 1990.
[11] D. E. Culler, A. Sah, K. E. Schauser, T. von Eicken, and J. Wawrzynek. Fine-grain
parallelism with minimal hardware support: A compiler-controlled threaded abstract machine.
In Proceedings of the 4th International Conference on Architectural Support for Programming
Languages and Operating Systems, pages 164–175, April 1991.
[12] B. Rau. Dynamically scheduled VLIW processors. In Proceedings of the 26th Annual
International Symposium on Microarchitecture, pages 80–90, December 1993.
[13] H. Corporaal and H. Mulder. Move: A framework for highperformance processor design.
In Supercomputing-91, pages 692–701, November 1991.
13
[14] M. Smelyanskiy, G. Tyson, and E. Davidson. Register queues: A new hardware/software
approach to efficient software pipelining. In International Conference on Parallel
Architectures and Compilation Techniques (PACT 2000), pages 3–12, October
2000.
[15] J. Llosa, M. Valero, J. Fortes, and E. Ayguade. Using sacks to organize register files in
VLIW machines. In CONPAR 94 - VAPP VI, pages 628–639, September 1994.
[16] E. Hao, P. Chang, M. Evers, and Y. Patt. Increasing the instruction fetch rate via blockstructured instruction set architectures. In Proceedings of the 29th International Symposium
on Microarchitecture, pages 191–200, December 1996.
[17] E.Waingold, M. Taylor, D. Srikrishna, V. Sarkar,W. Lee, V. Lee, J. Kim, M. Frank, P.
Finch, R. Barua, J. Babb, S. Amarsinghe, and A. Agarwal. Baring it all to software:
RAWmachines. IEEE Computer, 30(9):86–93, September 1997.
14