Download A True Single CYcle RISC Processor without Pipelining

ESS Design White Paper – RISC Embedded Controller 1 A True Single Cycle RISC Processor without Pipelining Robert S. Plachno, VP of Audio Abstract—This paper details the design of a embedded RISC controller used for mixed signal audio integrated circuits. This processor replaced an existing 8 bit CISC embedded processor and obtained a performance improvement of about 6x. This performance improvement was entirely due to architectural improvements using the same input clock rate and external ROM IP block. Index Terms—Computer architecture, Memory management, Pipeline processing, Reduced instruction set computing. T I.INTRODUCTION HE architecture of a RISC processor should support single cycle operation. The definition of a signal cycle operation is continuous instruction fetches from the instruction memory (ROM in this case) at the maximum access rate of the memory. Most RISC processor designs obtain this performance by pipelining. With a pipelined architecture each instruction is fetched assuming the next instruction is at the next physical instruction address (PC+1). If a jump instruction occurs then the pipeline is flushed or a delay must occur while the correct instruction address is calculated. This paper describes a RISC architecture in which single cycle operation is obtained without using a pipelined design. This RISC processor was designed as an embedded controller. Other architectural advantages of this design will be discussed as well as the implementation and design techniques II.THE LIMITATIONS OF THE PREVIOUS DESIGN A.CISC versus RISC The previous designs used an 8-bit CISC as an embedded controller. This CISC was inefficient and had difficulty to support continuous customer requests for additional features. The register instructions required 5 internal clock cycles and instructions using external memory usually required 6 internal cycles. The internal register to register instructions had a low utilization in the program. Instructions using external memory were more common. The instruction fetch occurred over an 8 bit bus using multiple cycles which depended on the instruction type. The CISC design used a ROM organized as 24K by 8 for the instruction memory. Most instruction fetches required 2 to 3 ROM accesses.  This technical paper describes an embedded RISC controller used in mixed signal audio products from 1993 through 1999. Adapting a RISC architecture was forced by the new customer feature requirements. This integrated circuit did not use a PLL so the input clock rate could not be simply sped up. Any performance improvement had to come from purely an architectural change by obtaining single cycle operation. B.Operating Voltage Range This integrated circuit was for PC audio products that could be designed into desk top PCs or notebooks. In this time frame desk tops used 5V operation while notebooks required 3.3V operation. The analog circuits were designed to run over this wide power supply range and at multiple fabrication houses. It was surprising to find that the circuit with the least operating voltage margin was the CISC processor. The CISC was a fully custom design that had issues with the low voltage operation. The new RISC design lowered the operating voltage dramatically to the point where the processor was not the limiting factor and the chip gained over a half a volt of margin. C.Royalty Cost The CISC was a purchased design which required royalty payments. Since the PC audio products had volume shipments at over 2M per month it was desirable to eliminate this cost burden III.MEMORY FOR PROCESSOR DESIGNS My original experience was in memory design including pseudo-static designs where the memory pre-charge is hidden from the user. Later I worked on several processor designs including a 64 bit processor that had a large design group. It became apparent that most processor engineers did not understand how memories worked and I had to teach them how to efficiently interface to their memory blocks. Memories require a pre-charge. In this time frame it was a standard practice to divide the memory cycle in two. The first half cycle is for pre-charge and the second half-cycle is the actual memory access. Addresses must only change during the pre-charge time and the address set-up time is actually measured to the center clock transition (at the end of the first half-cycle). The existing ROM used for the CISC was designed exactly in this manner. As time progressed logic engineers became even more ignorant of their memory blocks and the circuit designers of the memories made their specified interfaces safer. With the change in philosophy to synchronous designs then using latches fell out of favor to using flip-flops. Most engineers now put flip-flops to fix the addresses at the input of the ESS Design White Paper – RISC Embedded Controller memory. This wastes almost a full half cycle. A correct design would use a latch that is open during the pre-charge period. It is also extremely wasteful to inset flip-flops on the output data of the memory. A correct design would have a latch open during the second half cycle of the memory and memory designs should already include this latch internally. Mentally this problem is a conceptual difference caused by going to a synchronous design philosophy. The pipeline design engineer has flip-flops in too many places. How can you do an instruction decode and the next instruction address calculation (for jump instructions) all on the one clock edge between two instruction fetches? In reality you have about half a clock cycle to perform these calculations. You have the margin for the ROM access time for the last instruction fetch plus the precharge time minus the address set-up time for the next instruction fetch. IV.BASIC OVERVIEW OF THE RISC DESIGN The bit widths of each unit are as follows: • Instruction Unit: 24 bits • Execution Unit: 8 bits • Memory Unit: 16 bits The instruction ROM used for the CISC was an 8Kx8 block repeated 3 times and organized as 24K by 8. The RISC used the same original ROM 8Kx8 block but has it organized as 8Kx24 since the instruction is 24 bits wide. All instruction fetches are a single cycle 24 bit access. Both the instruction ROM and any external RAM has 16 bit addressing as indicated by the 16 bit memory unit width. This allows the RISC to have an 8 bit opcode and 16 bit address for jump instructions, etc all in one instruction fetch. This means there is no relative addressing jump calculations. The address for the jump instructions is always immediate. It does not have to be calculated but only multiplexed with PC+1. ROM and RAM access have separate opcodes so in effect there is 17 bit addressing for external memories. The register file feeding the execution unit has 8 bit addressing but the MSB for the register file is always 0. Since the RISC is an embedded controller the other 128 register addresses (MSB=1) are reserved for external user defined registers. Within the 24bit instruction width you have 8 bits for the opcode and two 8 bit addresses for the two operands for the execution unit instructions. Immediate commands have an 8 bit opcode, an 8 bit address for the operand destination and the 8 bit immediate value. There is an obvious trade off to have only an 8 bit execution unit. For 16 bit audio calculations two instructions are performed. For example, 16 bit values are added by two instructions: ADD followed by an ADC (add with carry). The CISC performed in the same manner. The assembler cross assembled all of the previous CISC instructions directly to the RISC instruction set using multiple RISC instructions if required. For example a DJNZ (decrement and jump if not zero) command is assembled as two instructions: ADD Rnum, #%FF (add negative 1) and JP NZ, 2 “label” (conditional jump if not zero). Fig 1 documents the RISC instruction set and decode table. V.THE REGISTER FILE The operands are read and written to a 128 byte register file. All registers are general purpose. This register file is both double pumped meaning there are two accesses in the same time period as one ROM access (interaction fetch) and the register file is a true dual-port meaning both source operands can be read at the same time. Since it is a dual port memory there are two 8 bit data busses: A & B that connect the register file to the execution unit. The A bus is used for reading operand A and for writing the result. The B bus is only for reading operand B. Fig 2 shows the memory cell for the register file. VI.EXECUTION UNIT The execution unit consists of two operand latches, a barrel shifter, and an ALU. A.Operand Latches Two operand busses (A and B) are required for single cycle operation. Both input operands are read from the register file simultaneously. There are two non-overlapping clocks in the RISC called CK1 and CK2. The operand latches are loaded during CK1 from the register file and the output from the ALU is stored back into the register file during CK2. Note that the operand data is allowed to ripple through the latches during the time the data becomes valid. The latches are closed later to avoid corruption before the register file goes into pre-charge. Both operand latches are identical and can be independently reset or inverted (reset and invert together is a set). Clearing and inverting the operands are required for some of the operations in both the ALU and the shifter. B.Barrel Shifter This shifter can rotate right or left inserting zeroes, ones or wrapping around LSB-to-MSB. The trick is all in the layout and in how the operand registers drive it. Logically the shifter is nothing more than 8 to 1 multiplexers for every bit of the operand. Physically the A bus drives into the top and then the wires shifts down one bit to the left for each multiplexer input. Physically the B bus drives into the bottom and then the wires shifts up one bit to the right for each multiplexer input. If you drive both operands with the same value then you barrel shift. All eight shift possibilities are available at the inputs of the multiplexers. The shift amount is determined by selecting one of the eight possibilities. By clearing or setting one of the operand registers you insert either leading or trailing 0’s or 1’s as you shift. Fig 4 shows the shifter wiring. ESS Design White Paper – RISC Embedded Controller C.ALU The Arithmetic Logic Unit can be described as a LFU (Logical Function Unit) with a generate-propagate static CMOS Manchester carry chain design. An LFU means it has a programmable truth table operation for any Boolean function of the two input operands. The four control lines: LFU0 to LFU3 specify each bit of the truth table function. Both the carry chain and the zero detect logic is buffered every four bits. To generate the result for the 8 bit ALU, the carry ripples through only four inverting stages (total) and each of these stages is an inverter. This carry chain implementation is fully static. It does not require any pre-charge clocks. Fig 3 shows the schematic for the ALU design. VII.INSTRUCTION UNIT The instruction unit consists of the Program Counter (PC), the ROM interface registers, and a hardware stack for subroutines and interrupts. A.Program Counter Register This is a 16 bit register. A separate adder which is a simplified but similar design to the ALU does the count function. This register always holds the current ROM instruction address plus one. This is the value which is pushed onto the stack for subroutine CALL’s and for interrupts so that you RETURN to the next valid instruction. If the present instruction is not a jump, call, branch, etc. then the default is to use the PC register value for the next ROM address. B.The ROM Interface Registers This logic has several functions. The next ROM address is multiplexed from the Program Counter, the stack (for Returns), or the immediate value on the instruction word (for Jumps). Reset forces the address 0 and interrupt addresses can be forced to 1, 2, or 3 (which always contain unconditional jumps). This circuit also does the addressing manipulation for reading data from the ROM. The data stored in the ROM uses byte addressing which has to be unpacked from a 8Kx24 or an 16Kx24 configuration. C.The Hardware Stack The stack design is a pure synchronous implementation. It is four 16 bit registers which are always loaded every cycle. If the current opcode is a CALL then the register above it is loaded (pushed). If the current opcode is a RETURN then the register below it is loaded (popped). If the opcode is neither then its own output is multiplexed back to retain its present value. VIII.MEMORY UNIT The RISC does not have complicated memory management. However, the memory unit does the functions of the register 3 file address generation, the A-B-C data bus multiplexing, and the flip-flops for the instruction word for data unpacking. IX.OTHER FEATURES This RISC was designed as an embedded controller for audio applications and has some unique features. A.Indirect Addressing Since the execution unit does not have a multiplier, audio compression and decompression is done by table look-up. For this function indirect addressing is very important. Both the register file addresses and the external memory addressing can be done through another set of registers. The register file addressing costs another instruction to set a unique page register. The external memory can be addressed using a register pair from the register file. The logic for the register pair addressing can also be used for indirect addressing on subroutine calls or jump instructions. B.External Register Addressing As mentioned before there are 8 address bits for the register and the register file only uses the lower 128 bytes. The higher 128 bytes are user defined to be specific registers through the integrated circuit design. These external registers interface to the RISC module through the C bus. This means that external registers can be specified as a source or destination in an ADD or other execution unit instruction. Other architectures require loading the values first to an internal register. C.User Defined Flags The RISC has the four standard condition flags for Carry, Zero, Sign, and Overflow. However there are eight flags total that can be utilized. The other four flag bits are user defined. For example this can be a signal such as a FIFO “full” flag. D.ROM Data Packing The instruction width is 24 bytes. However, to perform table look-up compression, data must be read from the ROM using byte addressing. The three byte wide words are packed with the LSB two data bytes sequentially up to the top of the memory and then back down using the MSB bytes. E.Hardware Stack This is a feature which can be a disadvantage. Using a hardware stack simplifies the design and speeds up the execution. The subroutine RETURN in the CISC took 19 cycles while the RETURN in the RISC takes one cycle. However, the design is limited to only four nested calls and interrupts which is sufficient for the audio design. This can be un-nerving to some programmers. ESS Design White Paper – RISC Embedded Controller F.Large Register File The RISC has 128 bytes of general purpose registers. This was large enough that no external RAM was used for the audio application. The original CISC design did use an external RAM in addition to its internal registers. X.DESIGN IMPLEMENTATION A.Engineers The RISC was designed by Roi Peers and Robert Plachno. Roi was the architect on the PC audio chips. He did the system level design and the software coding. For the RISC he defined the requirements and helped with the architecture. Robert Plachno had designed several processors prior to this RISC including smaller embedded controllers and a larger 64 bit processor. Robert did the design and simulation of the RISC. Vincent Chueng replaced the CISC with the RISC in the audio chip and fixed the software timing issues. B.Semi-Custom Design The RISC was designed in CMOS technology and run at numerous fabrication houses from 0.6µ to 0.35µ channel lengths. The data paths were designed at a transistor level. Four sections including the data paths for the execution unit, instruction unit, memory unit, and the register file were custom laid out and placed together in a rectangular area. The remaining control logic was routed as standard cells. Figure 5 shows the PC audio chip with the RISC in the top left. C.CAE Tools The original design was entered using the ORCAD schematic tool. The simulations were done using Robert Plachno’s EESIM. This simulator allows a mixed mode spice and logic netlist. Modules can be simulated at the transistor, gate or behavior level. The simulator also indicates the 10 worst (or specified) set-up times, hold times, etc and calculates the power dissipation and test vector coverage. The initial design was simulated on a PC and then progressed to a UNIX system. Eventually the schematics were recaptured into Cadence and simulated using Verilog. D.Initial Debug The design was initially done as a test chip on a multi-up mask set. This was debugged using test vectors transferred to an IMS tester. Then the CISC was replaced by the RISC in a full PC audio design. The assembler (written by Plachno) automatically cross-assembles the CISC instruction set to the RISC instruction set. However, certain parts of the program were found to be self-timed (software emulated serial port) and other problems occurred since the processor performed about 6x faster. I would describe replacing the CISC with the RISC in the design as having moderately few issues. 4 XI.CONCLUSION A design for a single cycle RISC processor has been discussed that does not use pipelining. This operation is obtained by folding the processor execution into the memory cycle. The RISC uses a ripple through latch style of design as opposed to a synchronous flip-flop style design. The design improvements include: • 6x performance improvement. • Wide power supply range for both desktop and notebook applications. • No royalty fees. • Minimum software impact. • This is an architectural change only. No process improvements or clock rate increase were required. ESS Design White Paper – RISC Embedded Controller Figure 1. RISC Instruction Set Definition 5 ESS Design White Paper – RISC Embedded Controller 6 Figure 2. Register File Dual Port Memory Cell BL1 BLB1 WL1 VCC M4 VCC M1 M8 M6 M5 M2 WLB2 VCC M9 M10 Figure 3. ALU Design BL2 ESS Design White Paper – RISC Embedded Controller Figure 4. Barrel Shifter. Figure 5. PC Audio Chip with RISC 7

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A True Single CYcle RISC Processor without Pipelining