Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CMPE 511 Computer Architecture GPAs: Grid Processor Architectures Report Instructor: Prof. Oğuz Tosun Prepared by: A.Emre Arpacı Abstract ...................................................................................................................................... 2 1 Introduction ............................................................................................................................. 2 2 The Tera-op Reliable Intelligently adaptive Processing System (TRIPS) Project .................. 4 2.1 Goals................................................................................................................................. 5 3 The TRIPS Architecture .......................................................................................................... 5 3.1 Core Execution Model ..................................................................................................... 6 3.2 Architectural Overview .................................................................................................... 6 3.3 Polymorphous Resources ................................................................................................. 7 4 D-morph: Instruction-Level Parallelism ................................................................................. 8 4.1 Frame Space Management ............................................................................................... 8 4.2 Multiblock Speculation .................................................................................................... 9 4.3 HighBandwidth .............................................................................................................. 10 4.4 Memory Interface ........................................................................................................... 10 4.5 D-morph Results ............................................................................................................ 11 5 Related Work......................................................................................................................... 11 6 Conclusions ........................................................................................................................... 12 References ................................................................................................................................ 13 1 Abstract The doubling of microprocessor performance every three years has been the result of two factors: more transistors per chip and superlinear scaling of the processor clock with technology generation. Due to both diminishing improvements in clock rates and poor wire scaling as semiconductor devices shrink, the achievable performance growth of conventional microarchitectures will slow substantially. In this report, we survey the design space of a new class of architectures called Grid Processor Architectures (GPAs for short). Grid Processor Architectures are not “grid computing” in the popular sense. These architectures are designed to scale with technology, allowing faster clock rates than conventional architectures while providing superior instruction-level parallelism on traditional workloads and high performance across a range of application classes. We can exploit the concept of Grid Processor Architectures under two main machine examples; one is MIT RAW machine and the other which we consider in this report as the basis is Texas TRIPS machine. MIT RAW uses the compiler to map computation onto the grid and to schedule the communication across it. Where as TRIPS exploits physical locality among PEs to speed up communication of data [1]. 1 Introduction For the past decade, microprocessors have been improving in overall performance at a rate of approximately 50–60% per year. These substantial performance improvements have been mined from two sources. First, designers have been increasing clock rates at a rapid rate, both by scaling technology and by reducing the number of levels of logic per cycle. Second, designers have been exploiting the increasing number of transistors on a chip, plus improvements in compiler technology, to improve instruction throughput (IPC). Although designers have generally opted to emphasize one over the other, both clock rates and IPC have been improving consistently. In Figure 1, it has shown that while some designers have chosen to optimize the design for fast clocks (Compaq Alpha), and others have optimized their design for high instruction throughput (HP PA-RISC), the past decade’s performance increases have been a function of both. Achieving high performance in future microprocessors will be a tremendous challenge, as both components of performance improvement are facing emerging technology-driven limitations. Designers will soon be unable to sustain clock speed improvements at the past decade’s annualized rate of 50% per year. Compensating for the slower clock growth by increasing sustained IPC proportionally will be difficult. Wire delays will limit the ability of conventional microarchitectures to improve instruction throughput. Microprocessor cores will soon face a new constraint, one in which they are communication bound on the die instead of capacity bound. As feature sizes shrink, and wires become slower relative to logic, the amount of state that can be accessed in a single clock cycle will cease to grow, and will eventually begin to decline. Increases in instruction-level parallelism will be limited by the amount of state reachable in a cycle, not by the number of transistors that can be manufactured on a chip [2]. 2 Figure 1: Processor clock rates and normalized processor performance (SpecInt/Clock rate), 19952000. Future microprocessors must thus achieve ILP considerably higher than today’s designs, even while being partitioned, and do so with a high clock rate. These future processors must exploit increased device counts to meet the above goals, but must do so while considering the increased communication delays and partitioning requirements [3]. In this report, we survey a class of architectures intended to address these problems faced by future systems. Grid Processor Architectures are designed to enable both faster clock rates and higher ILP than conventional architectures, even as devices shrink and wire delays increase. The GPAs consists of an array of ALUs connected using a lightweight routed network. Each ALU in the array contains local instruction storage and data storage buffers. Banked instruction and data storage caches are placed around the array of ALUs backed by partitioned secondary level cache banks. The processor follows a blockatomic model of execution where an entire block of instructions is fetched and mapped onto the execution array. A dataflow style ISA that encodes each instruction’s placement and its consumers, allows a statically placed but dynamically issued (SPDI) execution model. The dataflow style ISA and the distributed control and local storage inherently provided by the architecture makes the implementation of the mechanisms straight-forward. This organization eliminates the centralized instruction issue window and converts the conventional broadcast bypass network into a routed point-to-point network. Similar to VLIW architectures, a compiler is used to detect parallelism and statically schedule instructions onto the computation substrate, such that the topography of the dataflow graph matches the mapping. However, instructions are issued dynamically with the execution order determined by the availability of input operands. In a GPA, few large structures reside on the critical execution path, enhancing scalability as wire resistance increases. Out-of-order execution is achieved with greatly reduced register file bandwidth and with no associative issue window or register rename table. Compiler-controlled physical layout ensures that the critical path is scheduled along the shortest physical path, and that banked instruction caches reside near the units to which they will issue instructions. Finally, large instruction blocks are mapped onto the nodes as single units of computation, amortizing scheduling and decode overhead over a large number of instructions. In a GPA, the register file bandwidth is also reduced. Experiments show that register file writes are reduced by 30% to 90% using direct communication between producing and consuming instructions. On a set of conventional uniprocessor (SPEC 3 CPU2000 and Mediabench) benchmarks, simulation results show IPCs of between one and nine, running on a substrate that can likely be clocked faster than conventional designs and that will scale with technology. Assuming small routing delays, perfect memory and perfect branch prediction, the GPA averages eleven instructions per cycle across these benchmarks [4]. GPA-based systems provide unique opportunities for power efficiency. The elimination of structures dedicated to instruction-level register renaming, associative operand comparisons, and state tracking reduce the overhead circuitry and power on a per-ALU basis. ALU chaining dramatically reduces the number of global register file accesses in exchange for short point-to-point connections. The dynamic power of the ALU array and banked memory structures can be actively managed to reduce consumption during periods of lighter utilization. The dataflow execution model of the GPA is also amenable to power-efficient asynchronous design techniques. In addition to high ILP, a secondary design goal of the GPA is polymorphism, or the ability to adapt the hardware to the execution characteristics of the application. Grid Processors can be easily sub-divided into sub-processors, allowing discrete threads to be assigned to different sub-processors for high thread-level parallelism (TLP). Grid Processors can also be configured to target data-level parallelism (DLP), often exhibited in media, streaming, and scientific codes. For DLP applications, the same GPA hardware employs a different execution model in which instructions for kernels or inner loops are mapped to the ALUs and stay resident for multiple iterations. In addition, each access to a data cache bank provides multiple values that are distributed to the ALUs in each row. Initial results on a set of 7 signal processing kernels show that an 8x8 GPA can average 48 compute instructions per cycle. Assuming an 8-GHz clock in 50nm CMOS, this configuration would achieve a performance level of 384 Gflops [5]. The remainder of this report is organized as follows. Section 2 describes the TRIPS Project. Section 3 describes the TRIPS Architecture and GPAs. Section 4 present design space of GPA classes of machine TRIPS and its D-morph: ILP (Instruction Level Parallelism). Section 5 describes related work pertaining to wide-issue and dataflow-oriented machines. Finally, Section 6 concludes with a discussion of the strengths and weaknesses of GPAs and makes some critiques about the GPAs in the TRIPS. 2 The Tera-op Reliable Intelligently adaptive Processing System (TRIPS) Project TRIPS is a multidisciplinary project sponsored under the DARPA Polymorphous Computing Architecture (PCA) initiative. The goal of the TRIPS project is to develop a computing system that outperforms evolutionary architectures on a wide range of applications, achieving single-chip Tera-op performance that scales with advances in semiconductor technology [6]. TRIPS is a collaborative effort among multidisciplinary groups from UT-Austin and the IBM Austin Research Laboratory. The design, evaluation, and implementation span the research disciplines of VLSI design, architecture, compilers, operating systems, and applications. 4 2.1 Goals The TRIPS project has four major research goals: Technology-Scalable Architecture: To address the semiconductor scaling challenges of high-performance processors, particularly in instruction selection, execution, and bypass, the TRIPS team has proposed a new class of processor organizations called Grid Processor Architectures (GPAs). A GPA is composed of a tightly coupled array of ALUs connected via a thin network, onto which large blocks of instructions are scheduled and mapped. To mitigate on-chip communication delays, applications are scheduled so that their critical dataflow paths are placed along nearby ALUs. Malleable Architecture: The TRIPS architecture is designed to be configurable to meet the needs of a variety of workloads and environmental conditions. Both the grid processors and the on-chip memory system are configurable, able to run workloads as diverse as control-bound integer codes, highly parallel threaded codes, and regular, computationally intensive streaming codes efficiently. The allocation of ALUs within the grid, the instruction mapping onto the grid, the number of executing threads, and the flow of instructions across the grid are all exposed to the system, compiler, and application software for maximum flexibility. A TRIPS chip consists of one or more inconnected grid processors working in parallel. Dynamic Adaptivity: To respond to changing workloads and conditions, a TRIPS chip provides on-chip sensors and a lightweight software layer called morphware, which monitors power, temperature, memory performance, and ALU utilitization. The morphware layer controls the runtime operation of the execution resources, mediating between the requirements of running applications, the capabilities of a TRIPS implementation, and the operating environment of the system. Application Diversity: TRIPS is intended to support a variety of runtime workloads, including desktop, scientific, streaming, and server workloads. Desktop applications are characterized by irregular integer operations, scientific applications by their large data sets, streaming applications with their regularity and predictability, and server applications by their non-uniform workloads, independent thread execution, and real-time response requirements. The TRIPS system dynamically responds to each in kind and supports concurrent execution of all. 3 The TRIPS Architecture The TRIPS architecture uses large, coarse-grained processing cores to achieve high performance on single threaded applications with high ILP, and augments them with polymorphous features that enable the core to be subdivided for explicitly concurrent applications at different granularities. Contrary to conventional large-core designs with centralized components that are difficult to scale, the TRIPS architecture is heavily partitioned to avoid large centralized structures and long wire runs. These partitioned computation and memory elements are connected by point-to-point communication channels that are exposed to software schedulers for optimization. 5 The key challenge in defining the polymorphous features is balancing their appropriate granularity so that workloads involving different levels of ILP, TLP and DLP can maximize their use of the available resources, and at the same time avoid escalating complexity and nonscalable structures. The TRIPS system employs coarse grained polymorphous features, at the level of memory banks and instruction storage, to minimize both software and hardware complexity and configuration overheads. 3.1 Core Execution Model The TRIPS architecture is fundamentally block oriented. In all modes of operation, programs compiled for TRIPS are partitioned into large blocks of instructions with a single entry point, no internal loops, and possibly multiple possible exit points as found in hyperblocks [7]. For instruction and thread level parallel programs, blocks commit atomically and interrupts are block precise, meaning that they are handled only at block boundaries. For all modes of execution, the compiler is responsible for statically scheduling each block of instructions onto the computational engine such that inter-instruction dependences are explicit. Each block has a static set of state inputs, and a potentially variable set of state outputs that depends upon the exit point from the block. At runtime, the basic operational flow of the processor includes fetching a block from memory, loading it into the computational engine, executing it to completion, committing its results to the persistent architectural state if necessary, and then proceeding to the next block. Figure 2: TRIPS Architecture having GPA as a core 3.2 Architectural Overview Figure 2a shows a diagram of the TRIPS architecture will be implemented in a prototype chip. While the architecture is scalable to both larger dimensions and high clock rates due to both the partitioned structures and short point-to-point wiring connections, the TRIPS prototype chip will consist of four polymorphous 16-wide cores, an array of 32KB memory tiles connected by a routed network, and a set of distributed memory controllers with channels to external memory. The prototype chip will be built using a 100nm process and is targeted for completion in 2005. Figure 2b shows an expanded view of a TRIPS core (GPA) and the primary memory system. The TRIPS core is an example of the Grid Processor family of designs [4], which are 6 typically composed of an array of homogeneous execution nodes, each containing an integer ALU, a floating point unit, a set of reservation stations, and router connections at the input and output. Each reservation station has storage for an instruction and two source operands. When a reservation station contains a valid instruction and a pair of valid operands, the node can select the instruction for execution. After execution, the node can forward the result to any of the operand slots in local or remote reservation stations within the ALU array. The nodes are directly connected to their nearest neighbors, but the routing network can deliver results to any node in the array. The banked instruction cache on the left couples one bank per row, with an additional instruction cache bank to issue fetches to values from registers for injection into the ALU array. The banked register file above the ALU array holds a portion of the architectural state. To the right of the execution nodes are a set of banked level-1 data caches, which can be accessed by any ALU through the local grid routing network. Below the ALU array is the block control logic that is responsible for sequencing block execution and selecting the next block. The backside of the L1 caches are connected to secondary memory tiles through the chip-wide two-dimensional interconnection network. The switched network provides a robust and scalable connection to a large number of tiles, using less wiring than conventional dedicated channels between these components. The TRIPS architecture contains three main types of resources. First, the hardcoded, non-polymorphous resources operate in the same manner, and present the same view of internal state in all modes of operation. Some examples include the execution units within the nodes, the interconnect fabric between the nodes, and the L1 instruction cache banks. In the second type, polymorphous resources are used in all modes of operation, but can be configured to operate differently depending on the mode. The third type is the resources that are not required for all modes and can be disabled when not in use for a given mode. 3.3 Polymorphous Resources Frame Space: As shown in Figure 2c, each execution node contains a set of reservation stations. Reservation stations with the same index across all of the nodes combine to form a physical frame. For example, combining the first slot for all nodes in the grid forms frame 0. The frame space, or collection of frames, is a polymorphous resource in TRIPS, as it is managed differently by different modes to support efficient execution of alternate forms of parallelism. Register File Banks: Although the programming model of each execution mode sees essentially the same number of architecturally visible registers, the hardware substrate provides many more. The extra copies can be used in different ways, such as for speculation or multithreading, depending on the mode of operation. Block Sequencing Controls: The block sequencing controls determine when a block has completed execution, when a block should be deallocated from the frame space, and which block should be loaded next into the free frame space. To implement different modes of operation, a range of policies can govern these actions. The deallocation logic may be configured to allow a block to execute more than once, as is useful in streaming applications in which the same inner loop is applied to multiple data elements. The next block selector can be configured to limit the speculation, and to prioritize between multiple concurrently executing threads useful for multithreaded parallel programs. 7 Memory Tiles: The TRIPS Memory tiles can be configured to behave as NUCA style L2 cache banks, scratchpad memory, and synchronization buffers for producer/consumer communication. In addition, the memory tiles closest to each processor present a special high bandwidth interface that further optimizes their use as stream register files. 4 D-morph: Instruction-Level Parallelism The desktop morph, or D-morph, of the TRIPS processor uses the polymorphous capabilities of the processor to run single-threaded codes efficiently by exploiting instructionlevel parallelism. The TRIPS processor core is an instantiation of the Grid Processor family of architectures, but with some important differences as described in this section. To achieve high ILP, the D-morph configuration treats the instruction buffers in the processor core as a large, distributed, instruction issue window, which uses the TRIPS ISA to enable out-of-order execution while avoiding the associative issue window lookups of conventional machines. To use the instruction buffers effectively as a large window, the D-morph must provide high-bandwidth instruction fetching, aggressive control and data speculation, and a high-bandwidth, low-latency memory system that preserves sequential memory semantics across a window of thousands of instructions. Figure 3: D-morph frame management 4.1 Frame Space Management By treating the instruction buffers at each ALU as a distributed issue window, ordersof-magnitude increases in window sizes are possible. This window is fundamentally a threedimensional scheduling region, where the x- and y-dimensions correspond to the physical dimensions of the ALU array and the z-dimension corresponds to multiple instruction slots at each ALU node, as shown in Figure 2c. This three-dimensional region can be viewed as a series of frames, as shown in Figure 3b, in which each frame consists of one instruction buffer entry per ALU node, resulting in a 2-D slice of the 3-D scheduling region. To fill one of these scheduling regions, the compiler schedules hyperblocks into a 3-D region, assigning each instruction to one node in the 3-D space. Hyperblocks are predicated, single entry, multiple exit regions formed by the compiler. A 3-D region (the array and the set of frames) into which 8 one hyperblock is mapped is called an architectural frame, or A-frame. Figure 3a shows a four-instruction hyperblock (H0) mapped into A-frame 0 as shown in Figure 3b, where N0 and N2 are mapped to different buffer slots (frames) on the same physical ALU node. All communication within the block is determined by the compiler which schedules operand routing directly from ALU to ALU. Consumers are encoded in the producer instructions as X, Y, and Z relative offsets. Instructions can direct a produced value to any element within the same A-frame, using the lightweight routed network in the ALU array. The maximum number of frames that can be occupied by one program block (the maximum A-frame size) is architecturally limited by the number of instruction bits to specify destinations, and physically limited by the total number of frames available in a given implementation. The current TRIPS ISA limits the number of instructions in a hyperblock to 128, and the current implementation limits the maximum number of frames per A-frame to 16, the maximum number of A-frames to 32, and provides 128 frames total. 4.2 Multiblock Speculation The TRIPS instruction window size is much larger than the average hyperblock size that can be constructed. The hardware fills empty A-frames with speculatively mapped hyperblocks, predicting which hyperblock will be executed next, mapping it to an empty Aframe, and so on. The A-frames are treated as a circular buffer in which the oldest A-frame is nonspeculative and all other A-frames are speculative (analogous to tasks in a Multiscalar processor). When the A-frame holding the oldest hyperblock completes, the block is committed and removed. The next oldest hyperblock becomes non-speculative, and the released frames can be filled with a new speculative hyperblock. On a misprediction, all blocks past the offending prediction are squashed and restarted. Since A-frame IDs are assigned dynamically and all intra-hyperblock communication occurs within a single A-frame, each producer instruction prepends its A-frame ID to the Z-coordinate of its consumer to form the correct instruction buffer address of the consumer. Values passed between hyperblocks are transmitted through the register file, as shown by the communication of R1 from H0 to H1 in Figure 3b. Such values are aggressively forwarded when they are produced, using the register stitch table that dynamically matches the register outputs of earlier hyperblocks to the register inputs of later hyperblocks. Table 1. Execution Characteristics of D-morph codes. 9 4.3 HighBandwidth To fill the large distributed window the D-morph requires high-bandwidth instruction fetch. The control model uses a program counter that points to hyperblock headers. When there is sufficient frame space to map a hyperblock, the control logic accesses a partitioned instruction cache by broadcasting the index of the hyperblock to all banks. Each bank then fetches a row’s worth of instructions with a single access and streams it to the bank’s respective row. Hyperblocks are encoded as VLIW-like blocks, along with a prepended header that contains the number of frames consumed by the block. The next-hyperblock prediction is made using a highly tuned tournament exit predictor, which predicts a binary value that indicates the branch predicted to be the first to exit the hyperblock. The per-block accuracy of the exit predictor is shown in row 3 of Table 1. The value generated by the exit predictor is used both to index into a BTB to obtain the next predicted hyperblock address, and also to avoid forwarding register outputs produced past the predicted branch to subsequent blocks. 4.4 Memory Interface To support high ILP, the D-morph memory system must provide a high-bandwidth, low-latency data cache, and must maintain sequential memory semantics. As shown in Figure 2b, the right side of each TRIPS core contains distributed primary memory system banks, which are tightly coupled to the processing logic for low latency. The banks are interleaved using the low-order bits of the cache index, and can process multiple non-conflicting accesses simultaneously. Each bank is coupled with MSHRs for the cache bank and a partition of the address-interleaved load/store queues that enforce ordering of loads and stores. The MSHRs, the load/store queues, and the cache banks all use the same interleaving scheme. Stores are written back to the cache from the LSQs upon block commit. The secondary memory system in the D-morph configures the networked banks as a non-uniform cache access (NUCA) array, in which elements of a set are spread across multiple secondary banks, and are capable of migrating data on the two-dimensional switched network that connects the secondary banks. This network also provides a high-bandwidth link to each L1 bank for parallel L1 miss processing and fills. To summarize, with accurate exit prediction, high-bandwidth I-fetching, partitioned data caches, and concurrent execution of hyperblocks with inter-block value forwarding, the D-morph is able to use the instruction buffers as a polymorphous out-of-order issue window effectively, as shown in the next subsection. Figure 4: D-morph performance as a function of frame A 10 4.5 D-morph Results In this subsection, the ILP is measured using the mechanisms described above. The results shown in this section assume a 4x4 (16-wide issue) core, with 128 physical frames, a 64KB L1 data cache that requires three cycles to access, a 64KB L1 instruction cache (both partitioned into 4 banks), 0.5 cycles per hop in the ALU array, a 10-cycle branch misprediction penalty, a 250Kb exit predictor, a 12-cycle access penalty to a 2MB L2 cache, and a 132-cycle main memory access penalty. Optimistic assumptions in the simulator currently include no modeling of TLBs or page faults, oracular load/store ordering, simulation of a centralized register file, and no issue of wrongpath instructions to the memory system. All of the binaries were compiled with the Trimaran tool set [8], and scheduled for the TRIPS processor with custom scheduler/rewriter. The first row of Table 1 shows the average number of useful dynamically executed instructions per block, discounting overhead instructions, instructions with false predicates or instructions past a block exit. The second row shows the average dynamic number of frames allocated per block by scheduler for a 4x4 grid. Using the steady-state block (exit) prediction accuracies shown in the third row, each benchmark holds 965 useful instructions in the distributed window, on average, as shown in row of Table 1. Figure 4 shows how IPC scales as the number of Aframes is increased from 1 to 32, permitting deeper speculative execution. The integer benchmarks are shown on the left; the floating point and Mediabench benchmarks are shown on the right. Each 32 A-frame bar also has two additional IPC values, showing the performance with perfect memory in the hashed fraction of each bar, and then adding perfect branch prediction, shown in white. Increasing the number of A-frames provides a consistent performance boost across many of the benchmarks, since it permits greater exploitation of ILP by providing a larger window of instructions. Some benchmarks show no performance improvements beyond 16 A-frames (bzip2, m88ksim, and tomcatv), and a few reach their peak at 8 Aframes (adpcm, gzip, twolf, and hydro2d). In such cases, the large frame space is underutilized when running a single thread, due to either low hyperblock predictability in some cases or a lack of program ILP in others. The graphs demonstrate that while control mispredictions cause large performance losses for the integer codes (close to 50% on average), the large window is able to tolerate memory latencies extremely well, resulting in negligible slowdowns due to an imperfect memory system for all benchmarks but mgrid. 5 Related Work The goals of high clock rate and high IPC are not unique to GPAs. Many prior approaches have attempted to use both static and dynamic techniques to discover and execute along the critical path of a program, but they are too numerous to discuss here. In this section we describe what we believe to be the most relevant related work. Dennis and Misunas proposed static dataflow architecture [9], and Arvind proposed Tagged-Token Dataflow architecture with purely data-driven instruction scheduling for programs expressed in a dataflow language [10]. Culler later proposed a hybrid dataflow execution model where programs are partitioned into code blocks made up of instruction sequences, called threads, with dataflow execution between threads [11]. 11 The TRIPS Grid Processor Architecture approach differs from these in that the architects use a conventional programming interface with dataflow execution for a limited window of instructions, and rely on compiler instruction mapping to reduce the complexity of the token matching. In a sense, GPAs are a hybrid approach between VLIW and conventional superscalar architectures. A GPA statically schedules the instructions using a compiler, but then dynamically issues them based on data dependences. Other efforts have attempted to enhance VLIW architectures with dynamic execution. Rau proposed a split-issue mechanism to separate register read and execute from writeback and a delay buffer to support dynamic scheduling for VLIW processors [12]. Grid Processors share many characteristics with the Transport Triggered Architectures proposed by Corporaal and Mulder, including data driven execution, reducing register file traffic, and non-broadcasting bypass of execution unit results [13]. Others have looked at various naming mechanisms for values to reduce the register pressure and register file size. Smelyanskiy et al. proposed Register Queues for allocating live values in software pipelined loops [14]. Llosa proposed register sacks, which are low bandwidth port-limited register files for allocating live values in pipelined loops [15]. Patt proposed a Block-Structured Instruction Set Architecture for increasing the fetch rate for wide issue machines where the atomic unit of execution is a block and not an instruction [16]. Many researchers are exploring distributed or partitioned uniprocessor designs. Waingold et al. proposed a distributed execution model with extensive compiler support in the RAW architecture [17]. The RAW architecture assumes a coarsergrain execution than does the Grid Processor, exploiting parallelism across multiple compiler-generated instruction streams. 6 Conclusions In this report we made a survey on Grid Processor Architectures, a new class of micro architecture, that are intended to enable continued scaling of both clock rate and instruction throughput. GPAs, by mapping dependence chains onto an array of ALUs, conventional large structures such as register files and instruction windows can be distributed throughout the ALU array, permitting better scalability of the processing core. By delivering ALU results point-to-point instead of broadcasting them, GPAs mitigate the growing global wire and delay overheads of conventional bypass architectures. Studies on sequential applications are promising, with the grid processor achieving IPCs ranging from 1 to 9, competitive with those of idealized superscalar microarchitectures, and exceeding those of VLIW microarchitectures. It it not clear that GPAs will be superior to the conventional alternatives, which may find more incremental, but equally good solutions to the wire delay and clock scaling problems. GPAs have several disadvantages; they force the data caches to be far away from many of the ALUs, and incur delays between dependent operations due to the network router and wires, which can be significant. The complexity of frame management and block stitching (allowing successor hyperblocks to execute speculatively) is significant and may interfere with the goal of fast clock rates. However, future architectures must be partitioned somehow, and the partitioning and the flow of operations are likely being exposed to the compiler, while still preserving dynamic execution. Many of the techniques discussed here are thus likely to appear in future designs. There are still research groups to refine the microarchitecture of the GPAs and the hyperblock scheduler with the anticipation that the hardware complexity can be further reduced without undue burden on the software. 12 References [1] - Daniel J. Sorin “10 Novel Architectures”, Advanced Computer Architecture II (Parallel Computer Architecture), Novel Architectures Presentation [2] - Vikas Agarwal, M.S. Hrishikesh, StephenW., Keckler Doug Burger, Appears in the Proceedings of the 27, Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures [3] D. I. August, D. A. Connors, S. A. Mahlke, J. W. Sias, K. M. Crozier, B.-C. Cheng, P. R. Eaton, Q. B. Olaniran, and W. Hwu. Integrated predicated and speculative execution in the IMPACT EPIC architecture. In Proceedings of the 25th International Symposium on Computer Architecture, pages 45–54, July 1998. [4] R. Nagarajan, K. Sankaralingam, D. Burger, and S. W. Keckler. A design space evaluation of grid processor architectures. In Proceedings of the 34th Annual International Symposium on Microarchitecture, pages 40–51, December 2001. [5] Stephen W. Keckler, Doug Burger, Charles R. Moore, Ramadass Nagarajan, Karthikeyan Sankaralingam, Vikas Agarwal, M.S. Hrishikesh, Nitya Ranganathan, and Premkishore Shivakumar, Wire-Delay Scalable Microprocessor Architecture for High Performance Systems, Department of Computer Sciences, The University of Texas at Austin, Austin,TX [6] TRIPS Project Home Page [7] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the 25st International Symposium on Microarchitecture, pages 45–54, 1992. [8] Trimaran: An infrastructure for research in instruction-level parallelism. http://www.trimaran.org. [9] J. Dennis and D. Misunas. A preliminary architecture for a basic data-flow processor. In Proceedings of the 2nd Annual Symposium on Computer Architecture, pages 126–132, January 1975. [10] Arvind and R. S. Nikhil. Executing a program on the MIT Tagged-Token Dataflow Architecture. IEEE Transactions on Computers, 39(3):300–318, 1990. [11] D. E. Culler, A. Sah, K. E. Schauser, T. von Eicken, and J. Wawrzynek. Fine-grain parallelism with minimal hardware support: A compiler-controlled threaded abstract machine. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 164–175, April 1991. [12] B. Rau. Dynamically scheduled VLIW processors. In Proceedings of the 26th Annual International Symposium on Microarchitecture, pages 80–90, December 1993. [13] H. Corporaal and H. Mulder. Move: A framework for highperformance processor design. In Supercomputing-91, pages 692–701, November 1991. 13 [14] M. Smelyanskiy, G. Tyson, and E. Davidson. Register queues: A new hardware/software approach to efficient software pipelining. In International Conference on Parallel Architectures and Compilation Techniques (PACT 2000), pages 3–12, October 2000. [15] J. Llosa, M. Valero, J. Fortes, and E. Ayguade. Using sacks to organize register files in VLIW machines. In CONPAR 94 - VAPP VI, pages 628–639, September 1994. [16] E. Hao, P. Chang, M. Evers, and Y. Patt. Increasing the instruction fetch rate via blockstructured instruction set architectures. In Proceedings of the 29th International Symposium on Microarchitecture, pages 191–200, December 1996. [17] E.Waingold, M. Taylor, D. Srikrishna, V. Sarkar,W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarsinghe, and A. Agarwal. Baring it all to software: RAWmachines. IEEE Computer, 30(9):86–93, September 1997. 14