Download A True Single CYcle RISC Processor without Pipelining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Transmission line loudspeaker wikipedia , lookup

Microprocessor wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Register renaming wikipedia , lookup

Transcript
ESS Design White Paper – RISC Embedded Controller
1
A True Single Cycle RISC Processor without
Pipelining
Robert S. Plachno, VP of Audio
Abstract—This paper details the design of a embedded RISC
controller used for mixed signal audio integrated circuits. This
processor replaced an existing 8 bit CISC embedded processor
and obtained a performance improvement of about 6x. This
performance improvement was entirely due to architectural
improvements using the same input clock rate and external ROM
IP block.
Index Terms—Computer architecture, Memory management,
Pipeline processing, Reduced instruction set computing.
T
I.INTRODUCTION
HE architecture of a RISC processor should support
single cycle operation. The definition of a signal cycle
operation is continuous instruction fetches from the instruction
memory (ROM in this case) at the maximum access rate of the
memory. Most RISC processor designs obtain this
performance by pipelining. With a pipelined architecture each
instruction is fetched assuming the next instruction is at the
next physical instruction address (PC+1). If a jump instruction
occurs then the pipeline is flushed or a delay must occur while
the correct instruction address is calculated. This paper
describes a RISC architecture in which single cycle operation
is obtained without using a pipelined design.
This RISC processor was designed as an embedded
controller. Other architectural advantages of this design will be
discussed as well as the implementation and design techniques
II.THE LIMITATIONS OF THE PREVIOUS DESIGN
A.CISC versus RISC
The previous designs used an 8-bit CISC as an embedded
controller. This CISC was inefficient and had difficulty to
support continuous customer requests for additional features.
The register instructions required 5 internal clock cycles and
instructions using external memory usually required 6 internal
cycles. The internal register to register instructions had a low
utilization in the program. Instructions using external memory
were more common. The instruction fetch occurred over an 8
bit bus using multiple cycles which depended on the
instruction type. The CISC design used a ROM organized as
24K by 8 for the instruction memory. Most instruction fetches
required 2 to 3 ROM accesses.

This technical paper describes an embedded RISC controller used in
mixed signal audio products from 1993 through 1999.
Adapting a RISC architecture was forced by the new
customer feature requirements. This integrated circuit did not
use a PLL so the input clock rate could not be simply sped up.
Any performance improvement had to come from purely an
architectural change by obtaining single cycle operation.
B.Operating Voltage Range
This integrated circuit was for PC audio products that could
be designed into desk top PCs or notebooks. In this time frame
desk tops used 5V operation while notebooks required 3.3V
operation. The analog circuits were designed to run over this
wide power supply range and at multiple fabrication houses. It
was surprising to find that the circuit with the least operating
voltage margin was the CISC processor. The CISC was a fully
custom design that had issues with the low voltage operation.
The new RISC design lowered the operating voltage
dramatically to the point where the processor was not the
limiting factor and the chip gained over a half a volt of margin.
C.Royalty Cost
The CISC was a purchased design which required royalty
payments. Since the PC audio products had volume shipments
at over 2M per month it was desirable to eliminate this cost
burden
III.MEMORY FOR PROCESSOR DESIGNS
My original experience was in memory design including
pseudo-static designs where the memory pre-charge is hidden
from the user. Later I worked on several processor designs
including a 64 bit processor that had a large design group. It
became apparent that most processor engineers did not
understand how memories worked and I had to teach them
how to efficiently interface to their memory blocks.
Memories require a pre-charge. In this time frame it was a
standard practice to divide the memory cycle in two. The first
half cycle is for pre-charge and the second half-cycle is the
actual memory access. Addresses must only change during the
pre-charge time and the address set-up time is actually
measured to the center clock transition (at the end of the first
half-cycle). The existing ROM used for the CISC was
designed exactly in this manner.
As time progressed logic engineers became even more
ignorant of their memory blocks and the circuit designers of
the memories made their specified interfaces safer. With the
change in philosophy to synchronous designs then using
latches fell out of favor to using flip-flops. Most engineers
now put flip-flops to fix the addresses at the input of the
ESS Design White Paper – RISC Embedded Controller
memory. This wastes almost a full half cycle. A correct design
would use a latch that is open during the pre-charge period. It
is also extremely wasteful to inset flip-flops on the output data
of the memory. A correct design would have a latch open
during the second half cycle of the memory and memory
designs should already include this latch internally.
Mentally this problem is a conceptual difference caused by
going to a synchronous design philosophy. The pipeline design
engineer has flip-flops in too many places. How can you do an
instruction decode and the next instruction address calculation
(for jump instructions) all on the one clock edge between two
instruction fetches? In reality you have about half a clock cycle
to perform these calculations. You have the margin for the
ROM access time for the last instruction fetch plus the precharge time minus the address set-up time for the next
instruction fetch.
IV.BASIC OVERVIEW OF THE RISC DESIGN
The bit widths of each unit are as follows:
• Instruction Unit: 24 bits
• Execution Unit: 8 bits
• Memory Unit:
16 bits
The instruction ROM used for the CISC was an 8Kx8 block
repeated 3 times and organized as 24K by 8. The RISC used
the same original ROM 8Kx8 block but has it organized as
8Kx24 since the instruction is 24 bits wide. All instruction
fetches are a single cycle 24 bit access.
Both the instruction ROM and any external RAM has 16 bit
addressing as indicated by the 16 bit memory unit width. This
allows the RISC to have an 8 bit opcode and 16 bit address for
jump instructions, etc all in one instruction fetch. This means
there is no relative addressing jump calculations. The address
for the jump instructions is always immediate. It does not have
to be calculated but only multiplexed with PC+1. ROM and
RAM access have separate opcodes so in effect there is 17 bit
addressing for external memories.
The register file feeding the execution unit has 8 bit
addressing but the MSB for the register file is always 0. Since
the RISC is an embedded controller the other 128 register
addresses (MSB=1) are reserved for external user defined
registers. Within the 24bit instruction width you have 8 bits for
the opcode and two 8 bit addresses for the two operands for
the execution unit instructions. Immediate commands have an
8 bit opcode, an 8 bit address for the operand destination and
the 8 bit immediate value.
There is an obvious trade off to have only an 8 bit execution
unit. For 16 bit audio calculations two instructions are
performed. For example, 16 bit values are added by two
instructions: ADD followed by an ADC (add with carry). The
CISC performed in the same manner.
The assembler cross assembled all of the previous CISC
instructions directly to the RISC instruction set using multiple
RISC instructions if required. For example a DJNZ (decrement
and jump if not zero) command is assembled as two
instructions: ADD Rnum, #%FF (add negative 1) and JP NZ,
2
“label” (conditional jump if not zero).
Fig 1 documents the RISC instruction set and decode table.
V.THE REGISTER FILE
The operands are read and written to a 128 byte register file.
All registers are general purpose. This register file is both
double pumped meaning there are two accesses in the same
time period as one ROM access (interaction fetch) and the
register file is a true dual-port meaning both source operands
can be read at the same time. Since it is a dual port memory
there are two 8 bit data busses: A & B that connect the register
file to the execution unit. The A bus is used for reading
operand A and for writing the result. The B bus is only for
reading operand B. Fig 2 shows the memory cell for the
register file.
VI.EXECUTION UNIT
The execution unit consists of two operand latches, a barrel
shifter, and an ALU.
A.Operand Latches
Two operand busses (A and B) are required for single cycle
operation. Both input operands are read from the register file
simultaneously. There are two non-overlapping clocks in the
RISC called CK1 and CK2. The operand latches are loaded
during CK1 from the register file and the output from the ALU
is stored back into the register file during CK2. Note that the
operand data is allowed to ripple through the latches during
the time the data becomes valid. The latches are closed later to
avoid corruption before the register file goes into pre-charge.
Both operand latches are identical and can be independently
reset or inverted (reset and invert together is a set). Clearing
and inverting the operands are required for some of the
operations in both the ALU and the shifter.
B.Barrel Shifter
This shifter can rotate right or left inserting zeroes, ones or
wrapping around LSB-to-MSB. The trick is all in the layout
and in how the operand registers drive it. Logically the shifter
is nothing more than 8 to 1 multiplexers for every bit of the
operand. Physically the A bus drives into the top and then the
wires shifts down one bit to the left for each multiplexer input.
Physically the B bus drives into the bottom and then the wires
shifts up one bit to the right for each multiplexer input. If you
drive both operands with the same value then you barrel shift.
All eight shift possibilities are available at the inputs of the
multiplexers. The shift amount is determined by selecting one
of the eight possibilities. By clearing or setting one of the
operand registers you insert either leading or trailing 0’s or 1’s
as you shift. Fig 4 shows the shifter wiring.
ESS Design White Paper – RISC Embedded Controller
C.ALU
The Arithmetic Logic Unit can be described as a LFU
(Logical Function Unit) with a generate-propagate static
CMOS Manchester carry chain design. An LFU means it has a
programmable truth table operation for any Boolean function
of the two input operands. The four control lines: LFU0 to
LFU3 specify each bit of the truth table function. Both the
carry chain and the zero detect logic is buffered every four
bits. To generate the result for the 8 bit ALU, the carry ripples
through only four inverting stages (total) and each of these
stages is an inverter. This carry chain implementation is fully
static. It does not require any pre-charge clocks. Fig 3 shows
the schematic for the ALU design.
VII.INSTRUCTION UNIT
The instruction unit consists of the Program Counter (PC),
the ROM interface registers, and a hardware stack for
subroutines and interrupts.
A.Program Counter Register
This is a 16 bit register. A separate adder which is a
simplified but similar design to the ALU does the count
function. This register always holds the current ROM
instruction address plus one. This is the value which is pushed
onto the stack for subroutine CALL’s and for interrupts so that
you RETURN to the next valid instruction. If the present
instruction is not a jump, call, branch, etc. then the default is to
use the PC register value for the next ROM address.
B.The ROM Interface Registers
This logic has several functions. The next ROM address is
multiplexed from the Program Counter, the stack (for
Returns), or the immediate value on the instruction word (for
Jumps). Reset forces the address 0 and interrupt addresses can
be forced to 1, 2, or 3 (which always contain unconditional
jumps). This circuit also does the addressing manipulation for
reading data from the ROM. The data stored in the ROM uses
byte addressing which has to be unpacked from a 8Kx24 or an
16Kx24 configuration.
C.The Hardware Stack
The stack design is a pure synchronous implementation. It is
four 16 bit registers which are always loaded every cycle. If
the current opcode is a CALL then the register above it is
loaded (pushed). If the current opcode is a RETURN then the
register below it is loaded (popped). If the opcode is neither
then its own output is multiplexed back to retain its present
value.
VIII.MEMORY UNIT
The RISC does not have complicated memory management.
However, the memory unit does the functions of the register
3
file address generation, the A-B-C data bus multiplexing, and
the flip-flops for the instruction word for data unpacking.
IX.OTHER FEATURES
This RISC was designed as an embedded controller for
audio applications and has some unique features.
A.Indirect Addressing
Since the execution unit does not have a multiplier, audio
compression and decompression is done by table look-up. For
this function indirect addressing is very important. Both the
register file addresses and the external memory addressing can
be done through another set of registers. The register file
addressing costs another instruction to set a unique page
register. The external memory can be addressed using a
register pair from the register file. The logic for the register
pair addressing can also be used for indirect addressing on
subroutine calls or jump instructions.
B.External Register Addressing
As mentioned before there are 8 address bits for the register
and the register file only uses the lower 128 bytes. The higher
128 bytes are user defined to be specific registers through the
integrated circuit design. These external registers interface to
the RISC module through the C bus. This means that external
registers can be specified as a source or destination in an ADD
or other execution unit instruction. Other architectures require
loading the values first to an internal register.
C.User Defined Flags
The RISC has the four standard condition flags for Carry,
Zero, Sign, and Overflow. However there are eight flags total
that can be utilized. The other four flag bits are user defined.
For example this can be a signal such as a FIFO “full” flag.
D.ROM Data Packing
The instruction width is 24 bytes. However, to perform table
look-up compression, data must be read from the ROM using
byte addressing. The three byte wide words are packed with
the LSB two data bytes sequentially up to the top of the
memory and then back down using the MSB bytes.
E.Hardware Stack
This is a feature which can be a disadvantage. Using a
hardware stack simplifies the design and speeds up the
execution. The subroutine RETURN in the CISC took 19
cycles while the RETURN in the RISC takes one cycle.
However, the design is limited to only four nested calls and
interrupts which is sufficient for the audio design. This can be
un-nerving to some programmers.
ESS Design White Paper – RISC Embedded Controller
F.Large Register File
The RISC has 128 bytes of general purpose registers. This
was large enough that no external RAM was used for the audio
application. The original CISC design did use an external
RAM in addition to its internal registers.
X.DESIGN IMPLEMENTATION
A.Engineers
The RISC was designed by Roi Peers and Robert Plachno.
Roi was the architect on the PC audio chips. He did the system
level design and the software coding. For the RISC he defined
the requirements and helped with the architecture. Robert
Plachno had designed several processors prior to this RISC
including smaller embedded controllers and a larger 64 bit
processor. Robert did the design and simulation of the RISC.
Vincent Chueng replaced the CISC with the RISC in the audio
chip and fixed the software timing issues.
B.Semi-Custom Design
The RISC was designed in CMOS technology and run at
numerous fabrication houses from 0.6µ to 0.35µ channel
lengths. The data paths were designed at a transistor level.
Four sections including the data paths for the execution unit,
instruction unit, memory unit, and the register file were custom
laid out and placed together in a rectangular area. The
remaining control logic was routed as standard cells. Figure 5
shows the PC audio chip with the RISC in the top left.
C.CAE Tools
The original design was entered using the ORCAD
schematic tool. The simulations were done using Robert
Plachno’s EESIM. This simulator allows a mixed mode spice
and logic netlist. Modules can be simulated at the transistor,
gate or behavior level. The simulator also indicates the 10
worst (or specified) set-up times, hold times, etc and calculates
the power dissipation and test vector coverage. The initial
design was simulated on a PC and then progressed to a UNIX
system. Eventually the schematics were recaptured into
Cadence and simulated using Verilog.
D.Initial Debug
The design was initially done as a test chip on a multi-up
mask set. This was debugged using test vectors transferred to
an IMS tester. Then the CISC was replaced by the RISC in a
full PC audio design. The assembler (written by Plachno)
automatically cross-assembles the CISC instruction set to the
RISC instruction set. However, certain parts of the program
were found to be self-timed (software emulated serial port) and
other problems occurred since the processor performed about
6x faster. I would describe replacing the CISC with the RISC
in the design as having moderately few issues.
4
XI.CONCLUSION
A design for a single cycle RISC processor has been
discussed that does not use pipelining. This operation is
obtained by folding the processor execution into the memory
cycle. The RISC uses a ripple through latch style of design as
opposed to a synchronous flip-flop style design. The design
improvements include:
• 6x performance improvement.
• Wide power supply range for both desktop and
notebook applications.
• No royalty fees.
• Minimum software impact.
• This is an architectural change only. No process
improvements or clock rate increase were required.
ESS Design White Paper – RISC Embedded Controller
Figure 1.
RISC Instruction Set Definition
5
ESS Design White Paper – RISC Embedded Controller
6
Figure 2. Register File Dual Port Memory Cell
BL1
BLB1
WL1
VCC
M4
VCC
M1
M8
M6
M5
M2
WLB2
VCC
M9
M10
Figure 3. ALU Design
BL2
ESS Design White Paper – RISC Embedded Controller
Figure 4. Barrel Shifter.
Figure 5. PC Audio Chip with RISC
7