Download 3. Microprogrammed Control

3 CPU DESIGN INTRODUCTION As increased CPU speed and complexity yield higher peak performance ratings, it becomes ever more difficult for the rest of the computer system to keep up. To deliver high sustained performance on real applications, the entire system must be balanced in both bandwidth and latency. In other words, students will learn how to convert the potential of a modern CPU into a computer system that can actually deliver high performance CPU designers construct these processors using logic gates to execute these instructions. To keep the number of logic gates reasonably small, CPU designers must necessarily restrict the number and complexity of the commands the CPU recognizes. This small set of commands is the CPU's instruction set. Programs in early (pre-Von Neumann) computer systems were often "hard-wired" into the circuitry. That is, the computer's wiring determined what problem the computer would solve. One had to rewire the circuitry in order to change the program. A very difficult task. The next advance in computer design was the programmable computer system, one that allowed a computer programmer to easily "rewire" the computer system using a sequence of sockets and plug wires. A computer program consisted of a set of rows of holes (sockets), each row representing one operation during the execution of the program. The programmer could select one of several instructions by plugging a wire into the particular socket for the desired instruction Figure:- Patch Panel Programming course, a major difficulty with this scheme is that the number of possible instructions is severely limited by the number of sockets one could physically place on each row. However, CPU designers quickly discovered that with a small amount of additional logic circuitry, they could reduce the number of sockets required from n holes for n instructions to log2(n) holes for n instructions. They did this by assigning a numeric code to each instruction and then encode that instruction as a binary number using log2(n) holes Figure: Encoding Instructions This addition requires eight logic functions to decode the A, B, and C bits from the patch panel, but the extra circuitry is well worth the cost because it reduces the number of sockets that must be repeated for each instruction (this circuitry, by the way, is nothing more than a single three-line to eight-line decoder). CPU design focuses on these areas: 1. 2. 3. 4. 5. 6. Datapaths (such as ALUs and pipelines) control unit: logic which controls the datapaths Memory components such as register files, caches Clock circuitry such as clock drivers, PLLs, clock distribution networks Pad transceiver circuitry Logic gate cell library which is used to implement the logic CPUs designed for high-performance markets might require custom designs for each of these items to achieve frequency, power-dissipation, and chip-area goals. CPUs designed for lower performance markets might lessen the implementation burden by:   Acquiring some of these items by purchasing them as intellectual property Use control logic implementation techniques (logic synthesis using CAD tools) to implement the other components - datapaths, register files, clocks Common logic styles used in CPU design include:     Unstructured random logic Finite-state machines Microprogramming (common from 1965 to 1985, no longer common except for CISC CPUs) Programmable logic array (common in the 1980s, no longer common) Device types used to implement the logic include:   Transistor-transistor logic Small Scale Integration jelly-bean logic chips no longer used for CPUs Programmable Array Logic and Programmable logic devices - no longer used for CPUs     Emitter-coupled logic (ECL) gate arrays - no longer common CMOS gate arrays - no longer used for CPUs CMOS ASICs - what's commonly used today, they're so common that the term ASIC is not used for CPUs Field-programmable gate arrays (FPGA) - common for soft microprocessors, and more or less required for reconfigurable computing A CPU design project generally has these major tasks:           Programmer-visible instruction set architecture, which can be implemented by a variety of microarchitectures Architectural study and performance modeling in ANSI C/C++ or SystemC High-level synthesis (HLS) or RTL (eg. logic) implementation RTL Verification Circuit design of speed critical components (caches, registers, ALUs) Logic synthesis or logic-gate-level design Timing analysis to confirm that all logic and circuits will run at the specified operating frequency Physical design including floorplanning, place and route of logic gates Checking that RTL, gate-level, transistor-level and physical-level representatations are equivalent Checks for signal integrity, chip manufacturability As with most complex electronic designs, the logic verification effort (proving that the design does not have bugs) now dominates the project schedule of a CPU. Key CPU architectural innovations include cache, virtual memory, instruction pipelining, superscalar, CISC, RISC, virtual machine, emulators, microprogram, and stack. Goals The first CPUs were designed to do mathematical calculations faster and more reliably than human computers. Each successive generation of CPU might be designed to achieve some of these goals:       higher performance levels of a single program or thread higher throughput levels of multiple programs/threads less power consumption for the same performance level lower cost for the same performance level greater connectivity to build larger, more parallel systems more specialization to aid in specific targeted markets Re-designing a CPU core to a smaller die-area helps achieve several of these goals.   2. Shrinking everything (a "photomask shrink"), resulting in the same number of transistors on a smaller die, improves performance (smaller transistors switch faster), reduces power (smaller wires have less parasitic capacitance) and reduces cost (more CPUs fit on the same wafer of silicon). Releasing a CPU on the same size die, but with a smaller CPU core, keeps the cost about the same but allows higher levels of integration within one VLSI chip (additional cache, multiple CPUs, or other components), improving performance and reducing overall system cost. Performance analysis and benchmarking Because there are too many programs to test a CPU's speed on all of them, benchmarks were developed. The most famous benchmarks are the SPECint and SPECfp benchmarks developed by Standard Performance Evaluation Corporation and the ConsumerMark benchmark developed by the Embedded Microprocessor Benchmark Consortium EEMBC. Some important measurements include:          Instructions per second - Most consumers pick a computer architecture (normally Intel IA32 architecture) to be able to run a large base of preexisting pre-compiled software. Being relatively uninformed on computer benchmarks, some of them pick a particular CPU based on operating frequency (see Megahertz Myth). FLOPS - The number of floating point operations per second is often important in selecting computers for scientific computations. Performance per watt - System designers building parallel computers, such as Google, pick CPUs based on their speed per watt of power, because the cost of powering the CPU outweighs the cost of the CPU itself. Some system designers building parallel computers pick CPUs based on the speed per dollar. System designers building real-time computing systems want to guarantee worst-case response. That is easier to do when the CPU has low interrupt latency and when it has deterministic response. (DSP) Computer programmers who program directly in assembly language want a CPU to support a full featured instruction set. Low power - For systems with limited power sources (e.g. solar, batteries, human power). Small size or low weight - for portable embedded systems, systems for spacecraft. Environmental impact - Minimizing environmental impact of computers during manufacturing and recycling as well during use. Reducing waste, reducing hazardous materials. (see Green computing). This chapter describes the features found in practical machines that enhance the performance of the control unit. The following parameters are usually considered in the design of a control unit: Speed The control unit should generate control signals fast enough to utilize the processor bus structure most efficiently and minimize the instruction execution time. Cost and complexity The control unit is the most complex subsystem of any processing system. The complexity should be reduced as much as possible to make the maintenance easier and the cost low. In general, random logic implementations of the control unit 3. Microprogrammed Control The control signals needed in each step of instruction execution can be generated by the finite state machine method, also called hardwired control, or, alternatively, by the microprogrammed control method discussed below.  Basic Concepts of Microprogramming: o Control word (CW): A word with each bit for one of the control signals. Each step of the instruction execution is represented by a control word with all of the bits corresponding to the control signals needed for the step set to one. o Microinstruction: Each step in a sequence of steps in the execution of a certain machine instruction is considered as a microinstruction, and it is represented by a control word. All of the bits corresponding to the control signals that need to be asserted in this step are set to 1, and all others are set to 0 (horizontal organization). o Microprogram: Composed of a sequence of microinstructions corresponding to the sequence of steps in the execution of a given machine instruction. o Microprogramming: The method of generating the control signals by properly setting the individual bits in a control word of a step The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture: 3.1.1 Single Instruction, Single Data stream (SISD) A sequential computer which exploits no parallelism in either the instruction or data streams. Examples of SISD architecture are the traditional uniprocessor machines like a PC or old mainframes. SISD (Single Instruction, Single Data) is a term referring to a computer architecture in which a single processor, a uniprocessor, executes a single instruction stream, to operate on data stored in a single memory. This corresponds to the von Neumann architecture. SISD is one the four main classifications as defined in Flynn's taxonomy. In this system classifications are based upon the number of concurrent instructions and data streams present in the computer architecture. According to Michael J. Flynn, SISD can have concurrent processing characteristics. Instruction fetching and pipelined execution of instructions are common examples found in most modern SISD computers. 3.1.2 Single Instruction, Multiple Data streams (SIMD) A computer which exploits multiple data streams against a single instruction stream to perform operations which may be naturally parallelized. For example, an array processor or GPU. SIMD (Single Instruction, Multiple Data; colloquially, "vector instructions") is a technique employed to achieve data level parallelism. An application that may take advantage of SIMD is one where the same value is being added (or subtracted) to a large number of data points, a common operation in many multimedia applications. One example would be changing the brightness of an image. Each pixel of an image consists of three values for the brightness of the red, green and blue portions of the color. To change the brightness, the R G and B values are read from memory, a value is added (or subtracted) from them, and the resulting values are written back out to memory. With a SIMD processor there are two improvements to this process. For one the data is understood to be in blocks, and a number of values can be loaded all at once. Instead of a series of instructions saying "get this pixel, now get the next pixel", a SIMD processor will have a single instruction that effectively says "get lots of pixels" ("lots" is a number that varies from design to design). For a variety of reasons, this can take much less time than "getting" each pixel individually, like with traditional CPU design. Another advantage is that SIMD systems typically include only those instructions that can be applied to all of the data in one operation. In other words, if the SIMD system works by loading up eight data points at once, the add operation being applied to the data will happen to all eight values at the same time. Although the same is true for any superscalar processor design, the level of parallelism in a SIMD system is typically much higher. 3.1.3 Multiple Instruction, Single Data stream (MISD) Multiple instructions operate on a single data stream. Uncommon architecture which is generally used for fault tolerance. Heterogeneous systems operate on the same data stream and must agree on the result. Examples include the Space Shuttle flight control computer. MISD (Multiple Instruction, Single Data) is a type of parallel computing architecture where many functional units perform different operations on the same data. Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type. Not many instances of this architecture exist, as MIMD and SIMD are often more appropriate for common data parallel techniques. Specifically, they allow better scaling and use of computational resources than MISD does. 3.1.4 Multiple Instruction, Multiple Data streams (MIMD) Multiple autonomous processors simultaneously executing different instructions on different data. Distributed systems are generally recognized to be MIMD architectures; either exploiting a single shared memory space or a distributed memory space. MIMD (Multiple Instruction stream, Multiple Data stream) is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data. MIMD architectures may be used in a number of application areas such as computer-aided design/computer-aided manufacturing, simulation, modeling, and as communication switches. MIMD machines can be of either shared memory or distributed memory categories. These classifications are based on how MIMD processors access memory. Shared memory machines may be of the bus-based, extended, or hierarchical type. Distributed memory machines may have hypercube or mesh interconnection schemes.  Horizontal vs. Vertical As the CPU may need hundreds of control signals, the control word will be inevitably long. To reduce the length of the control word, groups of control signals that are mutually exclusive (only one of them need be asserted at a time) can be encoded to form shorter fields. This shorter form of control word is called vertical organization. For example, if only 1 of a group of 8 signals is needed at any time, they can be encoded into a field of logz8 = 3 bits, instead of 8 bits. The price to pay is the time delay needed for decoding the encoded field. Flexibility HCUs are inflexible in terms of adding new architectural features to the processor since they require a redesign of the hardware. MCUs offer a very high flexibility since microprograms can be easily updated without a substantial redesign of the hardware involved. With the advances in hardware technology, faster and more versatile processors are introduced to the market very rapidly. This requires that the design cycle time for newer processors must be as small as possible. Since the design costs must be recovered over a short life span of the new processor, they must be minimized. MCUs offer such flexibility and low-cost redesign capabilities, although they are inherently slow compared to HCUs. This speed differential between the two designs is getting smaller, since in the current technology, the MCU is fabricated on the same IC (i.e. with the same technology) as the rest of the processor. We will concentrate on the popular speed-enhancement techniques used in contemporary machines in this chapter. 3.2 SPEED ENHANCEMENT In ASC, the control unit fetches an instruction, decodes it and executes it, before fetching the next instruction as dictated by the program logic. That is, the control unit brings about the instruction cycles one after the other. With this serial instruction execution mode, the only way to increase the speed of execution of the overall program is to minimize the instruction cycle time of the individual instructions. This concept is called the instruction cycle speedup. The program execution time can be reduced further, if the instruction cycles can be overlapped. That is, if the next instruction can be fetched and/or decoded during the current instruction cycle. This overlapped operation mode is termed instruction execution overlap or more commonly pipelining. Another obvious technique would be to bring about the instruction cycles of more than one instruction simultaneously. This is the parallel mode of instruction execution. We will now describe these speed enhancement mechanisms. 3.2.1 Instruction Cycle Speedup Recall that the processor cycle time (i.e., minor cycle time) depends on the register transfer time on the processor bus structure. If the processor structure consists of multiple buses, it is possible to perform several register transfers simultaneously. This requires that the control unit produce the appropriate control signals simultaneously. In a synchronous HCU, the processor cycle time is fixed by the slowest register transfer. Thus even the fastest register transfer operation consumes a complete processor cycle. In an asynchronous HCU, the completion of one register transfer triggers the next; therefore, if properly designed, the asynchronous HCU would be faster than the synchronous HCU. Since the design and maintenance of an asynchronous HCU is difficult, the majority of the practical processors have synchronous HCUs. An MCU is slower than an HCU since the microinstruction execution time is the sum of processor cycle time and the CROM access time. The HCU of ASC has the simplest configuration possible. Each instruction cycle is divided into one or more phases (states or major cycles), each phase consisting of four processor cycles (i.e., minor cycles). A majority of actual control units are synchronous control units that are, in essence, enhanced versions of the ASC control unit. For example, it is not necessary to use up a complete major cycle if the micro operations corresponding to an instruction execution (or fetch or defer) can be completed in a part of the major cycle. The only optimization performed in the ASC control unit was to reduce the number of major cycles needed to execute certain instructions (SHR, SHL) by not entering an execute cycle, since all the required micro operations to implement those instructions could be completed in one major cycle. Further optimization is possible. For example, the micro operations corresponding to the execution of each branch instruction (BRU, BIP, BIN) could all be completed in one minor cycle rather than in a complete major cycle as they are in the ASC control unit. Thus, three minor cycles could be saved in the execution of branch instructions by returning to fetch cycle after the .rst minor cycle in the execute cycle. When such enhancements are implemented, the state-change circuitry of the control unit becomes more complex but the execution speed increases. Note that in the case of an MCU, the concept of the major cycle is not present, and the basic unit we work with is the minor cycle (i.e., processor cycle time + CROM access time). Thus the lengths of microprograms corresponding to each instruction are different. Each microinstruction is executed in one minor cycle, and the microprograms do not include any idle cycles. In developing the microprogram for ASC (Table 5.6) the microoperation sequences from the HCU design were reorganized to make them as short as possible. It was selected for its simplicity. The more recent Intel processors adopt these techniques very extensively. instruction execution (or fetch or defer) can be completed in a part of the major cycle. The only optimization performed in the ASC control unit was to reduce the number of major cycles needed to execute certain instructions (SHR, SHL) by not entering an execute cycle, since all the required microoperations to implement those instructions could be completed in one major cycle. Further optimization is possible. For example, the microoperations corresponding to the execution of each branch instruction (BRU, BIP, BIN) could all be completed in one minor cycle rather than in a complete major cycle as they are in the ASC control unit. Thus, three minor cycles could be saved in the execution of branch instructions by returning to fetch cycle after the first minor cycle in the execute cycle. When such enhancements are implemented, the statechange circuitry of the control unit becomes more complex but the execution speed increases. Note that in the case of an MCU, the concept of the major cycle is not present, and the basic unit we work with is the minor cycle (i.e., processor cycle time + CROM access time). Thus the lengths of microprograms corresponding to each instruction are different. Each microinstruction is executed in one minor cycle, and the microprograms do not include any idle cycles. In developing the microprogram for ASC (Table 5.6) the microoperation sequences from the HCU design were reorganized to make them as short as possible. Section 5 provides the instruction cycle details of Intel 8080, to illustrate the instruction cycle speedup concept. Although this is an obsolete processor, it was selected for its simplicity. The more recent Intel processors adopt these techniques very extensively. 3.2.2 Instruction Execution Overlap Note that in INTEL 8080, for instructions such as ADD r, once the memory operand is fetched into the CPU, addition is performed while the CPU is fetching the next instruction in sequence from the memory. This overlap of instruction fetch and execute phases increases the program execution speed. In general, the control unit can be envisioned as a device with three subfunctions: fetch, decode (or address computation), and execute. If the control unit is designed in a modular form with one module for each of these functions, it is possible to overlap the instruction-processing functions. The overlapped processing is brought about by a pipeline. A pipeline is a structure that, like an automobile assembly line, consists of several stations, each of which is capable of performing a certain subtask. The work flows from one station to the next. As the work leaves a station, the subsequent unit of the work is picked up by that station. When the work leaves the last station in the pipeline the task is complete. If the pipeline has N stations and the work stays at each station for T seconds, the complete processing time for a task is (N _ TÞ seconds. But since all the N stations are working in an overlapped manner (on various tasks), the pipeline outputs one completed task every T seconds (after an initial period in which the pipeline is being filled). Figure 9.1 introduces the concept of an instruction processing pipeline. The control unit has three modules. The processing sequence is shown in (b). Any time after t2, the .rst module will be fetching instruction (I + 1), the second module will be decoding instruction I, while the last module will be executing instruction (I – 1). From t3 onwards, the pipeline .ows full and the throughput is one instruction per time slot. For simplicity, we have assumed that each module in the above pipeline consumes the same amount of processing time. If such equal time partitioning of the processing task cannot be made, intermediate registers to hold the results and .ags that indicate the completion of one task and beginning of the next task are needed. We have assumed that the instructions are always executed in the sequence they appear in the program. This assumption is valid as long as the program does not contain a branch instruction. When a branch instruction Figure: Pipelined instruction processing is encountered, the next instruction is to be fetched from the target address of the branch instruction. If it is a conditional branch, the target address would not be known until the instruction reaches the execute stage. If a branch is indeed taken, then the instructions following the branch instruction that are already in the pipeline need to be discarded, and the pipeline needs to be filled from the target address. Another approach would be to stop fetching subsequent instructions into the pipeline, once the branch instruction is fetched, until the target address is known. The former approach is preferred for handling conditional branches since there is a good chance that the branch might not occur; in that case, the pipeline would flow full. There is an ineficiency in the pipeline flow only when a branch occurs. For unconditional branches, the latter approach can be used. The following mechanisms have been used to handle the conditional branch ineficiency of the pipeline structures. 3.3 Branch Prediction It is sometimes possible to predict the target address of a conditional branch based on the execution characteristics. For example, the target address of a branch instruction controlling the iterations through a loop is most likely the address of the .rst instruction in the loop, except for the last time through the iteration. If this compiler-dependent characteristic is known, the most likely target address can be used to fetch the subsequent instructions into the pipeline. Branch History If the statistics on the previous branches for each instruction are maintained, the most likely target address can be inferred. Delayed Branching In the three-stage pipeline of Fig. 9.1, two instructions following the branch instruction will have entered the pipeline by the time the target address is determined. If, in the instruction stream, the branch instruction is moved ahead two positions from where it logically appears in the program, the execution of instructions in the pipeline can proceed while the target address is determined. All three stages of the instruction processing pipeline can potentially access the main memory simultaneously, thereby increasing the traffic on the memory bus. A memory system should provide for such simultaneous access and high throughput by multiport, interleaving, and banking schemes. In order to implement the control unit as a pipeline, each stage should be designed to operate independently, performing its own function while sharing resources such as the processor bus structure and the main memory system. Such designs become very complex. Refer to the books listed in the Reference section for further details, The pipeline concept is now used very extensively in all modern processors. It is typical for the processors today to have four or five stages in their pipelines. As the hardware technology progresses, processors with deeper pipelines (i.e., pipelines with larger numbers of stages) have been introduced. These processors belong to the so-called superpipelined processor class, subsequent chapters provide some examples. 3.3.1 Parallel Instruction Execution The control unit of these processors fetches instructions from the same instruction stream (i.e., program), decodes them and delivers them to the appropriate execution unit. Thus, the execution units would be working in parallel, each executing its own instruction. The control unit must now be capable of creating these parallel streams of execution and synchronizing the two streams appropriately, based on the precedence constraints imposed by the program. That is, the result of the computation must be the same, whether the program is executed by serial or parallel execution of instructions. This class of architectures, where more than one instruction stream is processed simultaneously, is called a superscalar architecture. 3.3.2 Instruction Buffer and Cache The instruction buffer schemes used by processors such as INTEL 8086 and CDC 6600, and the instruction cache schemes used by processors such as MC68020 also bring about instruction processing overlap, although at the complete instruction level. That is, the operation of fetching instructions into the buffer is overlapped with the operation of retrieving and executing instructions that are in the buffer. 4. HARDWIRED CONTROL UNITS All the speedup techniques described in the previous section have been adopted by HCUs of practical machines. As mentioned earlier, the main advantage of the HCU is its speed, while the disadvantage is its inflexibility. Although asynchronous HCUs offer a higher speed capability than synchronous HCUs, the majority of practical machines have synchronous HCUs because they have the simpler design of the two. In the current VLSI era, complete processing systems are fabricated on a single IC. Because the control unit is the most complex unit, it occupies a large percentage of the chip ‘‘real estate.’’ Its complexity increases as the number of instructions in the instruction set of the machine increases. Studies have shown that, in practice, only a few instructions are used by programmers, although the machine offers a large instruction set. One way, then, of reducing the complexity of the control unit is to have a small instruction set. Such machines with small sets of powerful instructions are called reduced instruction set computers (RISC). Section 9.4 describes RISCs further. 4.1 MICROPROGRAMMED CONTROL UNITS The execution time for an instruction is proportional to the number of microoperations required and hence the length of microprogram sequence for the instruction. Since a microprogrammed control unit (MCU) starts fetching the next instruction once the last microoperation of the current instruction microprogram is executed, MCU can be treated as an asynchronous control unit. An MCU is slower than the hardwired control unit because of the addition of CROM access time to the register transfer time. But it is more flexible than hardwired CU and requires minimum changes in the hardware if the instruction set is to be modified or enhanced. The CROM word size is one of the design parameters of an MCU. Although the price of ROMs is decreasing, the cost of data path circuits required within the control unit increases as the CROM word size increases. Therefore, the word size should be reduced to reduce the cost of the MCU. We will now examine the microinstruction formats used by practical machines with respect to their cost effectiveness. Figure: Pipelining in an MCU The most common format for a microinstruction is shown in Fig. The ‘‘instruction’’ portion of the microinstruction is used in generating the control signals, and the ‘‘address’’ portion indicates the address of the next microinstruction. Execution of such a microinstruction corresponds to the generation of control signals and transferring the address portion to the _MAR to retrieve the next microinstruction. The advantage of this format is that very little external circuitry is needed to generate the next microinstruction address, while the disadvantage is that conditional branches in the microprogram cannot be easily coded. The format shown in (b) allows for conditional branching. It is now assumed that when the condition is not satisfied, the _MAR is simply incremented to point to the next microinstruction in sequence. However, this requires additional _MAR circuitry. The microinstruction format shown in (c) explicitly codes the jump addresses corresponding to both the outcomes of the test condition, thus reducing the external _MAR circuitry. The CROM word size will be large in all the above formats since they contain one or more address .elds. It is possible to reduce the CROM word size if the address representation can be made implicit. The format shown in (d) is similar to the one used by the MCU of ASC. (b) Vertical the MCU fast. The disadvantage is that this requires larger CROM words. Also, instruction encoding is cumbersome because a thorough familiarity with the processor hardware structure is needed to prevent the generation of control signals that cause conflicting operations in the processor hardware. In the vertical (or packed) microinstruction, the instruction is divided into several fields, each field corresponding to either a resource or a function in the processing hardware. (In the design of ASC MCU, each field corresponded to a resource such as ALU, BUS1, etc.). Vertical microinstruction reduces the CROM word size; but the decoders needed to generate control signals from each field of the instruction contribute to the delays in control signals. Encoding for vertical microinstruction is easier than that for horizontal microinstruction because of the former’s function/resource partitioning. In the foregoing discussion, we have assumed that all the control signals implied by the microinstruction are generated simultaneously and the next clock pulse fetches a new microinstruction. This type of microinstruction encoding is called monophase encoding. It is also possible to associate each, field or each bit in the microinstruction with a time value. That is the execution of each microinstruction now requires more than one clock pulse. This type of microinstruction encoding is called polyphase encoding. Figure (a) shows the monophase encoding, where all the control signals are generated simultaneously at the clock pulse. Figure (b) shows an nphase encoding where microoperations M1 through Mn are associated with time values t1 through tn respectively. 5 REDUCED COMPUTERS INSTRUCTION SET The acronym RISC (pronounced as risk), for reduced instruction set computer, represents a CPU design strategy emphasizing the insight that simplified instructions that "do less" may still provide for higher performance if this simplicity can be utilized to make instructions execute very quickly. Many proposals for a "precise" definition have been attempted, and the term is being slowly replaced by the more descriptive load-store architecture. Well known RISC families include Alpha, ARC, ARM, AVR, MIPS, PA-RISC, Power Architecture (includingThe concept was developed by John Cocke of IBM Research during 1974. His argument was based upon the notion that a computer uses only 20% of the instructions, making the other 80% superfluous to requirement. A processor based upon this concept would use few instructions, which would require fewer transistors, and make them cheaper to manufacture. By reducing the number of transistors and instructions to only those most frequently used, the computer would get more done in a shorter amount of time. The term 'RISC' (short for Reduced Instruction Set Computer) was later coined by David Patterson, a teacher at the University of California in Berkeley. The RISC concept was used to simplify the design of the IBM PC/XT, and was later used in the IBM RISC System/6000 and Sun Microsystems' SPARC microprocessors. The latter CPU led to the founding of MIPS Technologies, who developed the M.I.P.S. RISC microprocessor (Microprocessor without Interlocked Pipe Stages). Many of the MIPS architects also played an instrumental role in the creation of the Motorola 68000, as used in the first Amigas (MIPS Technologies were later bought by Silicon Graphics).. The MIPS processor has continued development, remaining a popular choice in embedded and low-end market. At one time, it was suspected the Amiga MCC would use this CPU to reduce the cost of manufacture. However, the consumer desktop market is limited, only the PowerPC processor remains popular in the choice of RISC alternatives. This is mainly due to Apple's continuous use of the series for its PowerMac range. Being an old idea, some aspects attributed to the first RISC-labeled designs (around 1975) include the observations that the memory restricted compilers of the time were often unable to take advantage of features intended to facilitate coding, and that complex addressing inherently takes many cycles to perform. It was argued that such functions would better be performed by sequences of simpler instructions, if this could yield implementations simple enough to cope with really high frequencies, and small enough to leave room for many registers[2], factoring out slow memory accesses. Uniform, fixed length instructions with arithmetic’s restricted to registers were chosen to ease instruction pipelining in these simple designs, with special load-store instructions accessing memory. There is still considerable controversy among experts about the ultimate value of RISC architectures. Its proponents argue that RISC machines are both cheaper and faster, and are therefore the machines of the future. Skeptics note that by making the hardware simpler, RISC architectures put a greater burden on the software. They argue that this is not worth the trouble because conventional microprocessors are becoming increasingly fast and cheap anyway. To some extent, the argument is becoming moot because CISC and RISC implementations are becoming more and more alike. Many of today's RISC chips support as many instructions as yesterday's CISC chips. And today's CISC chips use many techniques formerly associated with RISC chips. RISC (reduced instruction set computer) is a microprocessor that is designed to perform a smaller number of types of computer instructions so that it can operate at a higher speed (perform more millions of instructions per second, or MIPS). Since each instruction type that a computer must perform requires additional transistors and circuitry, a larger list or set of computer instructions tends to make the microprocessor more complicated and slower in operation. John Cocke of IBM Research in Yorktown, New York, originated the RISC concept in 1974 by proving that about 20% of the instructions in a computer did 80% of the work. The first computer to benefit from this discovery was IBM's PC/XT in 1980. Later, IBM's RISC System/6000, made use of the idea. The term itself (RISC) is credited to David Patterson, a teacher at the University of California in Berkeley. The concept was used in Sun Microsystems' SPARC microprocessors and led to the founding of what is now MIPS Technologies, part of Silicon Graphics. A number of current microchips now use the RISC concept. The RISC concept has led to a more thoughtful design of the microprocessor. Among design considerations are how well an instruction can be mapped to the clock speed of the microprocessor (ideally, an instruction can be performed in one clock cycle); how "simple" an architecture is required; and how much work can be done by the microchip itself without resorting to software help. Besides performance improvement, some advantages of RISC and related design improvements are:     A new microprocessor can be developed and tested more quickly if one of its aims is to be less complicated. Operating system and application programmers who use the microprocessor's instructions will find it easier to develop code with a smaller instruction set. The simplicity of RISC allows more freedom to choose how to use the space on a microprocessor. Higher-level language compilers produce more efficient code than formerly because they have always tended to use the smaller set of instructions to be found in a RISC computer. 6. RISC Vs. CISC The argument over which concept is better has been repeated over the past few years. Macintosh owners have elevated the argument to a pseudo religious level in support of their RISCbased God (the PowerPC sits next to the Steve Jobs statue on every Mac altar). Both positions have been blurred by the argument that we have entered a Post-RISC stage. RISC: For and Against RISC supporters argue that it the way of the future, producing faster and cheaper processors - an Apple Mac G3 offers a significant performance advantage over its Intel equivalent. Instructions are executed over 4x faster providing a significant performance boost! However, RISC chips require more lines of code to produce the same results and are increasingly complex. This will increase the size of the application and the amount of overhead required. RISC developers have also failed to remain in competition with CISC alternatives. The Macintosh market has been damaged by several problems that have affected the availability of 500MHz+ PowerPC chips. In contrast, the PC compatible market has stormed ahead and has broken the 1GHz barrier. Despite the speed advantages of the RISC processor, it cannot compete with a CISC CPU that boasts twice the number of clock cycles. The simplest way to examine the advantages and disadvantages of RISC architecture is by contrasting it with it's predecessor: CISC (Complex Instruction Set Computers) architecture. Multiplying Two Numbers in Memory On the right is a diagram representing the storage scheme for a generic computer. The main memory is divided into locations numbered from (row) 1: (column) 1 to (row) 6: (column) 4. The execution unit is responsible for carrying out all computations. However, the execution unit can only operate on data that has been loaded into one of the six registers (A, B, C, D, E, or F). Let's say we want to find the product of two numbers - one stored in location 2:3 and another stored in location 5:2 - and then store the product back in the location 2:3. The advantages of RISC against CISC are those today:  RISC processors are much simpler to build, by this again results in the following advantages: o easier to build, i.e. you can use already existing production facilities o much less expensive, just compare the price of a XScale with that of a Pentium III at 1 GHz... o less power consumption, which again gives two advantages:  much longer use of battery driven devices  no need for cooling of the device, which again gives to advantages:  smaller design of the whole device  no noise RISC processors are much simpler to program which doesn't only help the assembler programmer, but the compiler designer, too. You'll hardly find any compiler which uses all the functions of a Pentium III optimally The CISC Approach The primary goal of CISC architecture is to complete a task in as few lines of assembly as possible. This is achieved by building processor hardware that is capable of understanding and executing a series of operations. For this particular task, a CISC processor would come prepared with a specific instruction (we'll call it "MULT"). When executed, this instruction loads the two values into separate registers, multiplies the operands in the execution unit, and then stores the product in the appropriate register. Thus, the entire task of multiplying two numbers can be completed with one instruction: MULT 2:3, 5:2 MULT is what is known as a "complex instruction." It operates directly on the computer's memory banks and does not require the programmer to explicitly call any loading or storing functions. It closely resembles a command in a higher level language. For instance, if we let "a" represent the value of 2:3 and "b" represent the value of 5:2, then this command is identical to the C statement "a = a * b." One of the primary advantages of this system is that the compiler has to do very little work to translate a high-level language statement into assembly. Because the length of the code is relatively short, very little RAM is required to store instructions. The emphasis is put on building complex instructions directly into the hardware. The RISC Approach RISC processors only use simple instructions that can be executed within one clock cycle. Thus, the "MULT" command described above could be divided into three separate commands: "LOAD," which moves data from the memory bank to a register, "PROD," which finds the product of two operands located within the registers, and "STORE," which moves data from a register to the memory banks. In order to perform the exact series of steps described in the CISC approach, a programmer would need to code four lines of assembly: LOAD LOAD PROD STORE 2:3, A A, B, A, 2:3 5:2 B At first, this may seem like a much less efficient way of completing the operation. Because there are more lines of code, more RAM is needed to store the assembly level instructions. The compiler must also perform more work to convert a high-level language statement into code of this form. CISC RISC Emphasis on hardware Emphasis on software Includes multi-clock Single-clock, complex instructions reduced instruction only Memory-to-memory: Register to register: "LOAD" and "STORE" "LOAD" and "STORE" incorporated in instructions are independent instructions Small code sizes, Low cycles per second, high cycles per second large code sizes Transistors used for storing Spends more transistors complex instructions on memory registers However, the RISC strategy also brings some very important advantages. Because each instruction requires only one clock cycle to execute, the entire program will execute in approximately the same amount of time as the multi-cycle "MULT" command. These RISC "reduced instructions" require less transistors of hardware space than the complex instructions, leaving more room for general purpose registers. Because all of the instructions execute in a uniform amount of time (i.e. one clock), pipelining is possible. Separating the "LOAD" and "STORE" instructions actually reduces the amount of work that the computer must perform. After a CISC-style "MULT" command is executed, the processor automatically erases the registers. If one of the operands needs to be used for another computation, the processor must re-load the data from the memory bank into a register. In RISC, the operand will remain in the register until another value is loaded in its place. The Performance Equation The following equation is commonly used for expressing a computer's performance ability: The CISC approach attempts to minimize the number of instructions per program, sacrificing the number of cycles per instruction. RISC does the opposite, reducing the cycles per instruction at the cost of the number of instructions per program. RISC Roadblocks Despite the advantages of RISC based processing, RISC chips took over a decade to gain a foothold in the commercial world. This was largely due to a lack of software support. Although Apple's Power Macintosh line featured RISC-based chips and Windows NT was RISC compatible, Windows 3.1 and Windows 95 were designed with CISC processors in mind. Many companies were unwilling to take a chance with the emerging RISC technology. Without commercial interest, processor developers were unable to manufacture RISC chips in large enough volumes to make their price competitive. Another major setback was the presence of Intel. Although their CISC chips were becoming increasingly unwieldy and difficult to develop, Intel had the resources to plow through development and produce powerful processors. Although RISC chips might surpass Intel's efforts in specific areas, the differences were not great enough to persuade buyers to change technologies. The Overall RISC Advantage Today, the Intel x86 is arguable the only chip which retains CISC architecture. This is primarily due to advancements in other areas of computer technology. The price of RAM has decreased dramatically. In 1977, 1MB of DRAM cost about $5,000. By 1994, the same amount of memory cost only $6 (when adjusted for inflation). Compiler technology has also become more sophisticated, so that the RISC use of RAM and emphasis on software has become ideal. 7.MIPS History The MIPS processor was developed as part of a VLSI research program at Stanford University in the early 80s. Professor John Hennessy, now the University's President, started the development of MIPS with a brainstorming class for graduate students. The readings and idea sessions helped launch the development of the processor which became one of the first RISC processors, with IBM and Berkeley developing processors at around the same time. MIPSArchitecture The Stanford research group had a strong background in compilers, which led them to develop a processor whose architecture would represent the lowering of the compiler to the hardware level, as opposed to the raising of hardware to the software level, which had been a long running design philosophy in the hardware industry. Thus, the MIPS processor implemented a smaller, simpler instruction set. Each of the instructions included in the chip design ran in a single clock cycle. The processor used a technique called pipelining to more efficiently process instructions. MIPS used 32 registers, each 32 bits wide (a bit pattern of this size is referred to as a word). InstructionSet The MIPS instruction set consists of about 111 total instructions, each represented in 32 bits. An example of a MIPS instruction is below: add $r12, $r7, $r8 Above is the assembly (left) and binary (right) representation of a MIPS addition instruction. The instruction tells the processor to compute the sum of the values in registers 7 and 8 and store the result in register 12. The dollar signs are used to indicate an operation on a register. The colored binary representation on the right illustrates the 6 fields of a MIPS instruction. The processor identifies the type of instruction by the binary digits in the first and last fields. In this case, the processor recogizes that this instruction is an addition from the zero in its first field and the 20 in its last field. The operands are represented in the blue and yellow fields, and the desired result location is presented in the fourth (purple) field. The orange field represents the shift amount, something that is not used in an addition operation. The instruction set consists of a variety of basic instructions, including:          21 arithmetic instructions (+, -, *, /, %) 8 logic instructions (&, |, ~) 8 bit manipulation instructions 12 comparison instructions (>, <, =, >=, <=, ¬) 25 branch/jump instructions 15 load instructions 10 store instructions 8 move instructions 4 miscellaneous instructions A list of MIPS core instructions can be found here. MIPSToday MIPS Computer Systems, Inc. was founded in 1984 upon the Stanford research from which the first MIPS chip resulted. The company was purchased buy Silicon Graphics, Inc. in 1992, and was spun off as MIPS Technologies, Inc. in 1998. Today, MIPS powers many consumer electronics and other devices. 8.Pipelining HowPipeliningWorks pipelining, a standard feature in RISC processors, is much like an assembly line. Because the processor works on different steps of the instruction at the same time, more instructions can be executed in a shorter period of time. A useful method of demonstrating this is the laundry analogy. Let's say that there are four loads of dirty laundry that need to be washed, dried, and folded. We could put the first load in the washer for 30 minutes, dry it for 40 minutes, and then take 20 minutes to fold the clothes. Then pick up the second load and wash, dry, and fold, and repeat for the third and fourth loads. Supposing we started at 6 PM and worked as efficiently as possible, we would still be doing laundry until midnight. Figure However, a smarter approach to the problem would be to put the second load of dirty laundry into the washer after the first was already clean and whirling happily in the dryer. Then, while the first load was being folded, the second load would dry, and a third load could be added to the pipeline of laundry. Using this method, the laundry would be finished by 9:30. Figure RISCPipelines A RISC processor pipeline operates in much the same way, although the stages in the pipeline are different. While different processors have different numbers of steps, they are basically variations of these five, used in the MIPS R3000 processor: 1. 2. 3. 4. 5. fetch instructions from memory read registers and decode the instruction execute the instruction or calculate an address access an operand in data memory write the result into a register If you glance back at the diagram of the laundry pipeline, you'll notice that although the washer finishes in half an hour, the dryer takes an extra ten minutes, and thus the wet clothes must wait ten minutes for the dryer to free up. Thus, the length of the pipeline is dependent on the length of the longest step. Because RISC instructions are simpler than those used in pre-RISC processors (now called CISC, or Complex Instruction Set Computer), they are more conducive to pipelining. While CISC instructions varied in length, RISC instructions are all the same length and can be fetched in a single operation. Ideally, each of the stages in a RISC processor pipeline should take 1 clock cycle so that the processor finishes an instruction each clock cycle and averages one cycle per instruction (CPI). PipelineProblems In practice, however, RISC processors operate at more than one cycle per instruction. The processor might occasionally stall a a result of data dependencies and branch instructions. A data dependency occurs when an instruction depends on the results of a previous instruction. A particular instruction might need data in a register which has not yet been stored since that is the job of a preceding instruction which has not yet reached that step in the pipeline. For example: add $r3, $r2, add $r5, $r4, more instructions that are independent of the first two $r1 $r3 In this example, the first instruction tells the processor to add the contents of registers r1 and r2 and store the result in register r3. The second instructs it to add r3 and r4 and store the sum in r5. We place this set of instructions in a pipeline. When the second instruction is in the second stage, the processor will be attempting to read r3 and r4 from the registers. Remember, though, that the first instruction is just one step ahead of the second, so the contents of r1 and r2 are being added, but the result has not yet been written into register r3. The second instruction therefore cannot read from the register r3 because it hasn't been written yet and must wait until the data it needs is stored. Consequently, the pipeline is stalled and a number of empty instructions (known as bubbles go into the pipeline. Data dependency affects long pipelines more than shorter ones since it takes a longer period of time for an instruction to reach the final register-writing stage of a long pipeline. MIPS' solution to this problem is code reordering. If, as in the example above, the following instructions have nothing to do with the first two, the code could be rearranged so that those instructions are executed in between the two dependent instructions and the pipeline could flow efficiently. The task of code reordering is generally left to the compiler, which recognizes data dependencies and attempts to minimize performance stalls. Branch instructions are those that tell the processor to make a decision about what the next instruction to be executed should be based on the results of another instruction. Branch instructions can be troublesome in a pipeline if a branch is conditional on the results of an instruction which has not yet finished its path through the pipeline. For example: Loop : add $r3, $r2, $r1 sub $r6, $r5, $r4 beq $r3, $r6, Loop The example above instructs the processor to add r1 and r2 and put the result in r3, then subtract r4 from r5, storing the difference in r6. In the third instruction, beq stands for branch if equal. If the contents of r3 and r6 are equal, the processor should execute the instruction labeled "Loop." Otherwise, it should continue to the next instruction. In this example, the processor cannot make a decision about which branch to take because neither the value of r3 or r6 have been written into the registers yet. The processor could stall, but a more sophisticated method of dealing with branch instructions is branch prediction. The processor makes a guess about which path to take - if the guess is wrong, anything written into the registers must be cleared, and the pipeline must be started again with the correct instruction. Some methods of branch prediction depend on stereotypical behavior. Branches pointing backward are taken about 90% of the time since backward-pointing branches are often found at the bottom of loops. On the other hand, branches pointing forward, are only taken approximately 50% of the time. Thus, it would be logical for processors to always follow the branch when it points backward, but not when it points forward. Other methods of branch prediction are less static: processors that use dynamic prediction keep a history for each branch and uses it to predict future branches. These processors are correct in their predictions 90% of the time. Still other processors forgo the entire branch prediction ordeal. The RISC System/6000 fetches and starts decoding instructions from both sides of the branch. When it determines which branch should be followed, it then sends the correct instructions down the pipeline to be executed. PipeliningDevelopments In order to make processors even faster, various methods of optimizing pipelines have been devised. Superpipelining refers to dividing the pipeline into more steps. The more pipe stages there are, the faster the pipeline is because each stage is then shorter. Ideally, a pipeline with five stages should be five times faster than a non-pipelined processor (or rather, a pipeline with one stage). The instructions are executed at the speed at which each stage is completed, and each stage takes one fifth of the amount of time that the non-pipelined instruction takes. Thus, a processor with an 8-step pipeline (the MIPS R4000) will be even faster than its 5-step counterpart. The MIPS R4000 chops its pipeline into more pieces by dividing some steps into two. Instruction fetching, for example, is now done in two stages rather than one. The stages are as shown: 1. 2. 3. 4. 5. 6. 7. 8. Instruction Fetch (First Half) Instruction Fetch (Second Half) Register Fetch Instruction Execute Data Cache Access (First Half) Data Cache Access (Second Half) Tag Check Write Back Superscalar pipelining involves multiple pipelines in parallel. Internal components of the processor are replicated so it can launch multiple instructions in some or all of its pipeline stages. The RISC System/6000 has a forked pipeline with different paths for floating-point and integer instructions. If there is a mixture of both types in a program, the processor can keep both forks running simultaneously. Both types of instructions share two initial stages (Instruction Fetch and Instruction Dispatch) before they fork. Often, however, superscalar pipelining refers to multiple copies of all pipeline stages (In terms of laundry, this would mean four washers, four dryers, and four people who fold clothes). Many of today's machines attempt to find two to six instructions that it can execute in every pipeline stage. If some of the instructions are dependent, however, only the first instruction or instructions are issued. Dynamic pipelines have the capability to schedule around stalls. A dynamic pipeline is divided into three units: the instruction fetch and decode unit, five to ten execute or functional units, and a commit unit. Each execute unit has reservation stations, which act as buffers and hold the operands and operations. While the functional units have the freedom to execute out of order, the instruction fetch/decode and commit units must operate in-order to maintain simple pipeline behavior. When the instruction is executed and the result is calculated, the commit unit decides when it is safe to store the result. If a stall occurs, the processor can schedule other instructions to be executed until the stall is resolved. This, coupled with the efficiency of multiple units executing instructions simultaneously, makes a dynamic pipeline an attractive alternative. 9.Pipelining in CPU design superscalar machines A superscalar CPU architecture implements a form of parallelism called instruction-level parallelism within a single processor. It thereby allows faster CPU throughput than would otherwise be possible at the same clock rate. A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier. While a superscalar CPU is typically also pipelined, they are two different performance enhancement techniques. It is theoretically possible to have a non-pipelined superscalar CPU or a pipelined non-superscalar CPU. The superscalar technique is traditionally associated with several identifying characteristics. Note these are applied within a given CPU core.    Instructions are issued from a sequential instruction stream CPU hardware dynamically checks for data dependencies between instructions at run time (versus software checking at compile time) Accepts multiple instructions per clock cycle History Seymour Cray's CDC 6600 from 1965 is often mentioned as the first superscalar design. The Intel i960CA (1988) and the AMD 29000-series 29050 (1990) microprocessors were the first commercial single chip superscalar microprocessors. RISC CPUs like these brought the superscalar concept to micro computers because the RISC design results in a simple core, allowing straightforward instruction dispatch and the inclusion of multiple functional units (such as ALUs) on a single CPU in the constrained design rules of the time. This was the reason that RISC designs were faster than CISC designs through the 1980s and into the 1990s. Except for CPUs used in low-power applications, embedded systems, and battery-powered devices, essentially all general-purpose CPUs developed since about 1998 are superscalar. The Pentium was the first superscalar x86 processor; the Nx586, Pentium Pro and AMD K5 were among the first designs which decodes x86instructions asynchronously into dynamic microcode-like micro-op sequences prior to actual execution on a superscalar microarchitecture; this opened up for dynamic scheduling of buffered partial instructions and enabled more parallelism to be extracted compared to the more rigid methods used in the simpler Pentium; it also simplified speculative execution and allowed higher clock frequencies compared to designs such as the advanced Cyrix 6x86. From scalar to superscalar The simplest processors are scalar processors. Each instruction executed by a scalar processor typically manipulates one or two data items at a time. By contrast, each instruction executed by a vector processor operates simultaneously on many data items. An analogy is the difference between scalar and vector arithmetic. A superscalar processor is sort of a mixture of the two. Each instruction processes one data item, but there are multiple redundant functional units within each CPU thus multiple instructions can be processing separate data items concurrently. Superscalar CPU design emphasizes improving the instruction dispatcher accuracy, and allowing it to keep the multiple functional units in use at all times. This has become increasingly important when the number of units increased. While early superscalar CPUs would have two ALUs and a single FPU, a modern design such as the PowerPC 970 includes four ALUs, two FPUs, and two SIMD units. If the dispatcher is ineffective at keeping all of these units fed with instructions, the performance of the system will suffer. A superscalar processor usually sustains an execution rate in excess of one instruction per machine cycle. But merely processing multiple instructions concurrently does not make an architecture superscalar, since pipelined, multiprocessor or multi-core architectures also achieve that, but with different methods. In a superscalar CPU the dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching them to redundant functional units contained inside a single CPU. Therefore a superscalar processor can be envisioned having multiple parallel pipelines, each of which is processing instructions simultaneously from a single instruction thread. Limitations Available performance improvement from superscalar techniques is limited by two key areas: 1. The degree of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism, and 2. The complexity and time cost of the dispatcher and associated dependency checking logic. Existing binary executable programs have varying degrees of intrinsic parallelism. In some cases instructions are not dependent on each other and can be executed simultaneously. In other cases they are interdependent: one instruction impacts either resources or results of the other. The instructions a = b + c; d = e + f can be run in parallel because none of the results depend on other calculations. However, the instructions a = b + c; d = a + f might not be runnable in parallel, depending on the order in which the instructions complete while they move through the units. When the number of simultaneously issued instructions increases, the cost of dependency checking increases extremely rapidly. This is exacerbated by the need to check dependencies at run time and at the CPU's clock rate. This cost includes additional logic gates required to implement the checks, and time delays through those gates. Research shows the gate cost in some cases may be nk gates, and the delay cost k2logn, where n is the number of instructions in the processor's instruction set, and k is the number of simultaneously dispatched instructions. In mathematics, this is called a combinatoric problem involving permutations. Even though the instruction stream may contain no inter-instruction dependencies, a superscalar CPU must nonetheless check for that possibility, since there is no assurance otherwise and failure to detect a dependency would produce incorrect results. No matter how advanced the semiconductor process or how fast the switching speed, this places a practical limit on how many instructions can be simultaneously dispatched. While process advances will allow ever greater numbers of functional units (e.g, ALUs), the burden of checking instruction dependencies grows so rapidly that the achievable superscalar dispatch limit is fairly small. -- likely on the order of five to six simultaneously dispatched instructions. However even given infinitely fast dependency checking logic on an otherwise conventional superscalar CPU, if the instruction stream itself has many dependencies, this would also limit the possible speedup. Thus the degree of intrinsic parallelism in the code stream forms a second limitation. Alternatives Collectively, these two limits drive investigation into alternative architectural performance increases such as Very Long Instruction Word (VLIW), Explicitly Parallel Instruction Computing (EPIC), simultaneous multithreading (SMT), and multi-core processors. With VLIW, the burdensome task of dependency checking by hardware logic at run time is removed and delegated to the compiler. Explicitly Parallel Instruction Computing (EPIC) is like VLIW, with extra cache prefetching instructions. Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures. Superscalar processors differ from multi-core processors in that the redundant functional units are not entire processors. A single processor is composed of finer-grained functional units such as the ALU, integer multiplier, integer shifter, floating point unit, etc. There may be multiple versions of each functional unit to enable execution of many instructions in parallel. This differs from a multicore CPU that concurrently processes instructions from multiple threads, one thread per core. It also differs from a pipelined CPU, where the multiple instructions can concurrently be in various stages of execution, assembly-line fashion. The various alternative techniques are not mutually exclusive—they can be (and frequently are) combined in a single processor. Thus a multicore CPU is possible where each core is an independent processor containing multiple parallel pipelines, each pipeline being superscalar. Some processors also include vector capability. 10Superscalar Operation- Executing Instructions in Parallel With the pipelined architecture we could achieve, at best, execution times of one CPI (clock per instruction). Is it possible to execute instructions faster than this? At first glance you might think, "Of course not, we can do at most one operation per clock cycle. So there is no way we can execute more than one instruction per clock cycle." Keep in mind however, that a single instruction is not a single operation. In the examples presented earlier each instruction has taken between six and eight operations to complete. By adding seven or eight separate units to the CPU, we could effectively execute these eight operations in one clock cycle, yielding one CPI. If we add more hardware and execute, say, 16 operations at once, can we achieve 0.5 CPI? The answer is a qualified "yes." A CPU including this additional hardware is a superscalar CPU and can execute more than one instruction during a single clock cycle. The 80x86 family began supporting superscalar execution with the introduction of the Pentium processor. A superscalar CPU has, essentially, several execution units. If it encounters two or more instructions in the instruction stream (i.e., the prefetch queue) which can execute independently, it will do so. Figure A CPU that Supports Superscalar Operation There are a couple of advantages to going superscalar. Suppose you have the following instructions in the instruction stream: mov( 1000, eax ); mov( 2000, ebx ); If there are no other problems or hazards in the surrounding code, and all six bytes for these two instructions are currently in the prefetch queue, there is no reason why the CPU cannot fetch and execute both instructions in parallel. All it takes is extra silicon on the CPU chip to implement two execution units. Besides speeding up independent instructions, a superscalar CPU can also speed up program sequences that have hazards. One limitation of superscalar CPU is that once a hazard occurs, the offending instruction will completely stall the pipeline. Every instruction which follows will also have to wait for the CPU to synchronize the execution of the instructions. With a superscalar CPU, however, instructions following the hazard may continue execution through the pipeline as long as they don't have hazards of their own. This alleviates (though does not eliminate) some of the need for careful instruction scheduling. As an assembly language programmer, the way you write software for a superscalar CPU can dramatically affect its performance. First and foremost is that rule you're probably sick of by now: use short instructions. The shorter your instructions are, the more instructions the CPU can fetch in a single operation and, therefore, the more likely the CPU will execute faster than one CPI. Most superscalar CPUs do not completely duplicate the execution unit. There might be multiple ALUs, floating point units, etc. This means that certain instruction sequences can execute very quickly while others won't. You have to study the exact composition of your CPU to decide which instruction sequences produce the best performance. Very Long Instruction Word Architecture (VLIW) Superscalar operation attempts to schedule, in hardware, the execution of multiple instructions simultaneously. Another technique that Intel is using in their IA-64 architecture is the use of very long instruction words, or VLIW. In a VLIW computer system, the CPU fetches a large block of bytes (41 in the case of the IA-64 Itanium CPU) and decodes and executes this block all at once. This block of bytes usually contains two or more instructions (three in the case of the IA-64). VLIW computing requires the programmer or compiler to properly schedule the instructions in each block (so there are no hazards or other conflicts), but if properly scheduled, the CPU can execute three or more instructions per clock cycle. The Intel IA-64 Architecture is not the only computer system to employ a VLIW architecture. Transmeta's Crusoe processor family also uses a VLIW architecture. The Crusoe processor is different than the IA-64 architecture insofar as it does not support native execution of IA-32 instructions. Instead, the Crusoe processor dynamically translates 80x86 instructions to Crusoe's VLIW instructions. This "code morphing" technology results in code running about 50% slower than native code, though the Crusoe processor has other advantages. We will not consider VLIW computing any further since the IA-32 architecture does not support it. But keep this architectural advance in mind if you move towards the IA-64 family or the Crusoe family. Parallel Processing Most of the techniques for improving CPU performance via architectural advances involve the parallel (overlapped) execution of instructions. Most of the techniques of this chapter are transparent to the programmer. That is, the programmer does not have to do anything special to take minimal advantage of the parallel operation of pipelines and superscalar operations. True, if programmers are aware of the underlying architecture they can write code that runs even faster, but these architectural advances often improve performance even if programmers do not write special code to take advantage of them. The only problem with this approach (attempting to dynamically parallelize an inherently sequential program) is that there is only so much you can do to parallelize a program that requires sequential execution for proper operation (which covers most programs). To truly produce a parallel program, the programmer must specifically write parallel code; of course, this does require architectural support from the CPU. This section and the next touches on the types of support a CPU can provide. Typical CPUs use what is known as the SISD model: Single Instruction, Single Data. This means that the CPU executes one instruction at a time that operates on a single piece of data. Two common parallel models are the so-called SIMD (Single Instruction, Multiple Data) and MIMD (Multiple Instruction, Multiple Data) models. As it turns out, x86 systems can support both of these parallel execution models. In the SIMD model, the CPU executes a single instruction stream, just like the standard SISD model. However, the CPU executes the specified operation on multiple pieces of data concurrently rather than a single data object. For example, consider the 80x86 ADD instruction. This is a SISD instruction that operates on (that is, produces) a single piece of data; true, the instruction fetches values from two source operands and stores a sum into a destination operand but the end result is that the ADD instruction will only produce a single sum. An SIMD version of ADD, on the other hand, would compute the sum of several values simultaneously. The Pentium III's MMX and SIMD instruction extensions operate in exactly this fashion. With an MMX instruction, for example, you can add up to eight separate pairs of values with the execution of a single instruction. The aptly named SIMD instruction extensions operate in a similar fashion. Note that SIMD instructions are only useful in specialized situations. Unless you have an algorithm that can take advantage of SIMD instructions, they're not that useful. Fortunately, high-speed 3-D graphics and multimedia applications benefit greatly from these SIMD (and MMX) instructions, so their inclusion in the 80x86 CPU offers a huge performance boost for these important applications. The MIMD model uses multiple instructions, operating on multiple pieces of data (usually one instruction per data object, though one of these instructions could also operate on multiple data items). These multiple instructions execute independently of one another. Therefore, it's very rare that a single program (or, more specifically, a single thread of execution) would use the MIMD model. However, if you have a multiprogramming environment with multiple programs attempting to execute concurrently in memory, the MIMD model does allow each of those programs to execute their own code stream concurrently. This type of parallel system is usually called a multiprocessor system. Multiprocessor systems are the subject of the next section. The common computation models are SISD, SIMD, and MIMD. If you're wondering if there is a MISD model (Multiple Instruction, Single Data) the answer is no. Such an architecture doesn't really make sense. Multiprocessing Pipelining, superscalar operation, out of order execution, and VLIW design are techniques CPU designers use in order to execute several operations in parallel. These techniques support fine-grained parallelism13 and are useful for speeding up adjacent instructions in a computer system. If adding more functional units increases parallelism (and, therefore, speeds up the system), you might wonder what would happen if you added two CPUs to the system. This technique, known as multiprocessing, can improve system performance, though not as uniformly as other techniques. As noted in the previous section, a multiprocessor system uses the MIMD parallel execution model. The techniques we've considered to this point don't require special programming to realize a performance increase. True, if you do pay attention you will get better performance; but no special programming is necessary to activate these features. Multiprocessing, on the other hand, doesn't help a program one bit unless that program was specifically written to use multiprocessor (or runs under an O/S specfically written to support multiprocessing). If you build a system with two CPUs, those CPUs cannot trade off executing alternate instructions within a program. In fact, it is very expensive (timewise) to switch the execution of a program from one processor to another. Therefore, multiprocessor systems are really only effective in a system that execute multiple programs concurrently (i.e., a multitasking system)14. To differentiate this type of parallelism from that afforded by pipelining and superscalar operation, we'll call this kind of parallelism coarse-grained parallelism. Adding multiple processors to a system is not as simple as wiring the processor to the motherboard. A big problem with multiple processors is the cache coherency problem. To understand this problem, consider two separate programs running on separate processors in a multiprocessor system. Suppose also that these two processor communicate with one another by writing to a block of shared physical memory. Unfortunately, when CPU #1 writes to this block of addresses the CPU caches the data up and might not actually write the data to physical memory for some time. Simultaneously, CPU #2 might be attempting to read this block of shared memory but winds up reading the data out of its local cache rather than the data that CPU #1 wrote to the block of shared memory (assuming the data made it out of CPU #1's local cache). In order for these two functions to operate properly, the two CPU's must communicate writes to common memory addresses in cache between themselves. This is a very complex and involved process. Currently, the Pentium III and IV processors directly support cache updates between two CPUs in a system. Intel also builds a more expensive processor, the XEON, that supports more than two CPUs in a system. However, one area where the RISC CPUs have a big advantage over Intel is in the support for multiple processors in a system. While Intel systems reach a point of diminishing returns at about 16 processors, Sun SPARC and other RISC processors easily support 64-CPU systems (with more arriving, it seems, every day).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 3. Microprogrammed Control