Download 3. Microprogrammed Control

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
3
CPU DESIGN
INTRODUCTION
As increased CPU speed and complexity yield higher peak performance ratings, it
becomes ever more difficult for the rest of the computer system to keep up. To
deliver high sustained performance on real applications, the entire system must be
balanced in both bandwidth and latency. In other words, students will learn how to
convert the potential of a modern CPU into a computer system that can actually
deliver high performance
CPU designers construct these processors using logic gates to execute these
instructions. To keep the number of logic gates reasonably small, CPU designers
must necessarily restrict the number and complexity of the commands the CPU
recognizes. This small set of commands is the CPU's instruction set.
Programs in early (pre-Von Neumann) computer systems were often "hard-wired"
into the circuitry. That is, the computer's wiring determined what problem the
computer would solve. One had to rewire the circuitry in order to change the
program. A very difficult task. The next advance in computer design was the
programmable computer system, one that allowed a computer programmer to
easily "rewire" the computer system using a sequence of sockets and plug wires. A
computer program consisted of a set of rows of holes (sockets), each row
representing one operation during the execution of the program. The programmer
could select one of several instructions by plugging a wire into the particular socket
for the desired instruction
Figure:- Patch Panel Programming
course, a major difficulty with this scheme is that the number of possible
instructions is severely limited by the number of sockets one could physically place
on each row. However, CPU designers quickly discovered that with a small
amount of additional logic circuitry, they could reduce the number of sockets
required from n holes for n instructions to log2(n) holes for n instructions. They did
this by assigning a numeric code to each instruction and then encode that
instruction as a binary number using log2(n) holes
Figure: Encoding Instructions
This addition requires eight logic functions to decode the A, B, and C bits from the
patch panel, but the extra circuitry is well worth the cost because it reduces the
number of sockets that must be repeated for each instruction (this circuitry, by the
way, is nothing more than a single three-line to eight-line decoder).
CPU design focuses on these areas:
1.
2.
3.
4.
5.
6.
Datapaths (such as ALUs and pipelines)
control unit: logic which controls the datapaths
Memory components such as register files, caches
Clock circuitry such as clock drivers, PLLs, clock distribution networks
Pad transceiver circuitry
Logic gate cell library which is used to implement the logic
CPUs designed for high-performance markets might require custom designs for
each of these items to achieve frequency, power-dissipation, and chip-area goals.
CPUs designed for lower performance markets might lessen the implementation
burden by:


Acquiring some of these items by purchasing them as intellectual property
Use control logic implementation techniques (logic synthesis using CAD
tools) to implement the other components - datapaths, register files, clocks
Common logic styles used in CPU design include:




Unstructured random logic
Finite-state machines
Microprogramming (common from 1965 to 1985, no longer common
except for CISC CPUs)
Programmable logic array (common in the 1980s, no longer common)
Device types used to implement the logic include:


Transistor-transistor logic Small Scale Integration jelly-bean logic chips no longer used for CPUs
Programmable Array Logic and Programmable logic devices - no longer
used for CPUs




Emitter-coupled logic (ECL) gate arrays - no longer common
CMOS gate arrays - no longer used for CPUs
CMOS ASICs - what's commonly used today, they're so common that the
term ASIC is not used for CPUs
Field-programmable gate arrays (FPGA) - common for soft
microprocessors, and more or less required for reconfigurable computing
A CPU design project generally has these major tasks:










Programmer-visible instruction set architecture, which can be implemented
by a variety of microarchitectures
Architectural study and performance modeling in ANSI C/C++ or SystemC
High-level synthesis (HLS) or RTL (eg. logic) implementation
RTL Verification
Circuit design of speed critical components (caches, registers, ALUs)
Logic synthesis or logic-gate-level design
Timing analysis to confirm that all logic and circuits will run at the
specified operating frequency
Physical design including floorplanning, place and route of logic gates
Checking that RTL, gate-level, transistor-level and physical-level
representatations are equivalent
Checks for signal integrity, chip manufacturability
As with most complex electronic designs, the logic verification effort (proving that
the design does not have bugs) now dominates the project schedule of a CPU.
Key CPU architectural innovations include cache, virtual memory, instruction
pipelining, superscalar, CISC, RISC, virtual machine, emulators, microprogram,
and stack.
Goals
The first CPUs were designed to do mathematical calculations faster and more
reliably than human computers.
Each successive generation of CPU might be designed to achieve some of these
goals:






higher performance levels of a single program or thread
higher throughput levels of multiple programs/threads
less power consumption for the same performance level
lower cost for the same performance level
greater connectivity to build larger, more parallel systems
more specialization to aid in specific targeted markets
Re-designing a CPU core to a smaller die-area helps achieve several of these goals.


2.
Shrinking everything (a "photomask shrink"), resulting in the same number
of transistors on a smaller die, improves performance (smaller transistors
switch faster), reduces power (smaller wires have less parasitic capacitance)
and reduces cost (more CPUs fit on the same wafer of silicon).
Releasing a CPU on the same size die, but with a smaller CPU core, keeps
the cost about the same but allows higher levels of integration within one
VLSI chip (additional cache, multiple CPUs, or other components),
improving performance and reducing overall system cost.
Performance analysis and benchmarking
Because there are too many programs to test a CPU's speed on all of them,
benchmarks were developed. The most famous benchmarks are the SPECint and
SPECfp benchmarks developed by Standard Performance Evaluation Corporation
and the ConsumerMark benchmark developed by the Embedded Microprocessor
Benchmark Consortium EEMBC.
Some important measurements include:









Instructions per second - Most consumers pick a computer architecture
(normally Intel IA32 architecture) to be able to run a large base of preexisting pre-compiled software. Being relatively uninformed on computer
benchmarks, some of them pick a particular CPU based on operating
frequency (see Megahertz Myth).
FLOPS - The number of floating point operations per second is often
important in selecting computers for scientific computations.
Performance per watt - System designers building parallel computers, such
as Google, pick CPUs based on their speed per watt of power, because the
cost of powering the CPU outweighs the cost of the CPU itself.
Some system designers building parallel computers pick CPUs based on the
speed per dollar.
System designers building real-time computing systems want to guarantee
worst-case response. That is easier to do when the CPU has low interrupt
latency and when it has deterministic response. (DSP)
Computer programmers who program directly in assembly language want a
CPU to support a full featured instruction set.
Low power - For systems with limited power sources (e.g. solar, batteries,
human power).
Small size or low weight - for portable embedded systems, systems for
spacecraft.
Environmental impact - Minimizing environmental impact of computers
during manufacturing and recycling as well during use. Reducing waste,
reducing hazardous materials. (see Green computing).
This chapter describes the features found in practical machines that
enhance the performance of the control unit. The following parameters are
usually considered in the design of a control unit:
Speed
The control unit should generate control signals fast enough to
utilize the processor bus structure most efficiently and minimize the
instruction execution time.
Cost and complexity
The control unit is the most complex subsystem of any processing system.
The complexity should be reduced as much as possible to make the
maintenance easier and the cost
low. In general, random logic implementations of the control unit
3.
Microprogrammed Control
The control signals needed in each step of instruction execution can be generated
by the finite state machine method, also called hardwired control, or, alternatively,
by the microprogrammed control method discussed below.

Basic Concepts of Microprogramming:
o Control word (CW):
A word with each bit for one of the control signals. Each step of the
instruction execution is represented by a control word with all of the
bits corresponding to the control signals needed for the step set to
one.
o
Microinstruction:
Each step in a sequence of steps in the execution of a certain
machine instruction is considered as a microinstruction, and it is
represented by a control word. All of the bits corresponding to the
control signals that need to be asserted in this step are set to 1, and
all others are set to 0 (horizontal organization).
o
Microprogram:
Composed of a sequence of microinstructions corresponding to the
sequence of steps in the execution of a given machine instruction.
o
Microprogramming:
The method of generating the control signals by properly setting the
individual bits in a control word of a step
The four classifications defined by Flynn are based upon the number of concurrent
instruction (or control) and data streams available in the architecture:
3.1.1 Single Instruction, Single Data stream (SISD)
A sequential computer which exploits no parallelism in either the
instruction or data streams. Examples of SISD architecture are the
traditional uniprocessor machines like a PC or old mainframes.
SISD (Single Instruction, Single Data) is a term
referring to a computer architecture in which a single processor, a
uniprocessor, executes a single instruction stream, to operate on data stored
in a single memory. This corresponds to the von Neumann architecture.
SISD is one the four main classifications as defined in Flynn's taxonomy.
In this system classifications are based upon the number of concurrent
instructions and data streams present in the computer architecture.
According to Michael J. Flynn, SISD can have concurrent processing
characteristics. Instruction fetching and pipelined execution of instructions
are common examples found in most modern SISD computers.
3.1.2 Single Instruction, Multiple Data streams (SIMD)
A computer which exploits multiple data streams against a single instruction
stream to perform operations which may be naturally parallelized. For example, an
array processor or GPU. SIMD (Single Instruction, Multiple Data; colloquially,
"vector instructions") is a technique employed to achieve data level parallelism. An
application that may take advantage of SIMD is one where the same value is being
added (or subtracted) to a large number of data points, a common operation in
many multimedia applications. One example would be changing the brightness of
an image. Each pixel of an image consists of three values for the brightness of the
red, green and blue portions of the color. To change the brightness, the R G and B
values are read from memory, a value is added (or subtracted) from them, and the
resulting values are written back out to memory.
With a SIMD processor there are two improvements to this process. For one the
data is understood to be in blocks, and a number of values can be loaded all at
once. Instead of a series of instructions saying "get this pixel, now get the next
pixel", a SIMD processor will have a single instruction that effectively says "get
lots of pixels" ("lots" is a number that varies from design to design). For a variety
of reasons, this can take much less time than "getting" each pixel individually, like
with traditional CPU design.
Another advantage is that SIMD systems typically include only those instructions
that can be applied to all of the data in one operation. In other words, if the SIMD
system works by loading up eight data points at once, the add operation being
applied to the data will happen to all eight values at the same time. Although the
same is true for any superscalar processor design, the level of parallelism in a
SIMD system is typically much higher.
3.1.3 Multiple Instruction, Single Data stream (MISD)
Multiple instructions operate on a single data stream. Uncommon
architecture which is generally used for fault tolerance. Heterogeneous
systems operate on the same data stream and must agree on the result.
Examples include the Space Shuttle flight control computer.
MISD (Multiple Instruction, Single Data) is a type of parallel computing
architecture where many functional units perform different operations on
the same data. Pipeline architectures belong to this type, though a purist
might say that the data is different after processing by each stage in the
pipeline. Fault-tolerant computers executing the same instructions
redundantly in order to detect and mask errors, in a manner known as task
replication, may be considered to belong to this type. Not many instances of
this architecture exist, as MIMD and SIMD are often more appropriate for
common data parallel techniques. Specifically, they allow better scaling and
use of computational resources than MISD does.
3.1.4 Multiple Instruction, Multiple Data streams (MIMD)
Multiple autonomous processors simultaneously executing different
instructions on different data. Distributed systems are generally recognized
to be MIMD architectures; either exploiting a single shared memory space
or a distributed memory space.
MIMD (Multiple Instruction stream, Multiple Data stream) is a technique
employed to achieve parallelism. Machines using MIMD have a number of
processors that function asynchronously and independently. At any time,
different processors may be executing different instructions on different
pieces of data. MIMD architectures may be used in a number of application
areas such as computer-aided design/computer-aided manufacturing,
simulation, modeling, and as communication switches. MIMD machines
can be of either shared memory or distributed memory categories. These
classifications are based on how MIMD processors access memory. Shared
memory machines may be of the bus-based, extended, or hierarchical type.
Distributed memory machines may have hypercube or mesh
interconnection schemes.

Horizontal vs. Vertical
As the CPU may need hundreds of control signals, the control word will be
inevitably long. To reduce the length of the control word, groups of control
signals that are mutually exclusive (only one of them need be asserted at a
time) can be encoded to form shorter fields. This shorter form of control
word is called vertical organization.
For example, if only 1 of a group of 8 signals is needed at any time, they
can be encoded into a field of logz8 = 3 bits, instead of 8 bits. The price
to pay is the time delay needed for decoding the encoded field.
Flexibility
HCUs are inflexible in terms of adding new architectural features to the
processor since they require a redesign of the hardware. MCUs offer a very
high flexibility since microprograms can be easily updated without a
substantial redesign of the hardware involved. With the advances in
hardware technology, faster and more versatile processors are introduced
to the market very rapidly. This requires that the design cycle time for
newer processors must be as small as possible. Since the design costs
must be recovered over a short life span of the new processor, they must
be minimized. MCUs offer such
flexibility and low-cost redesign
capabilities, although they are inherently slow compared to HCUs. This
speed differential between the two designs is getting smaller, since in the
current technology, the MCU is fabricated on the same IC (i.e. with the
same technology) as the rest of the processor. We will concentrate on the
popular speed-enhancement techniques used in contemporary machines in
this chapter.
3.2 SPEED ENHANCEMENT
In ASC, the control unit fetches an instruction, decodes it and executes it,
before fetching the next instruction as dictated by the program logic. That is, the
control unit brings about the instruction cycles one after the other. With this serial
instruction execution mode, the only way to increase the speed of execution of the
overall program is to minimize the instruction cycle time of the individual
instructions. This concept is called the instruction cycle speedup. The program
execution time can be reduced further, if the instruction cycles can be overlapped.
That is, if the next instruction can be fetched and/or decoded during the current
instruction cycle. This overlapped operation mode is termed instruction execution
overlap or more commonly pipelining. Another obvious technique would be to
bring about the instruction cycles of more than one instruction simultaneously.
This is the parallel mode of instruction execution. We will now describe these
speed enhancement mechanisms.
3.2.1 Instruction Cycle Speedup
Recall that the processor cycle time (i.e., minor cycle time) depends on the
register transfer time on the processor bus structure. If the processor structure
consists of multiple buses, it is possible to perform several register transfers
simultaneously. This requires that the control unit produce the appropriate control
signals simultaneously. In a synchronous HCU, the processor cycle time is fixed
by the slowest register transfer. Thus even the fastest register transfer operation
consumes a complete processor cycle. In an asynchronous HCU, the completion
of one register transfer triggers the next; therefore, if properly designed, the
asynchronous HCU would be faster than the synchronous HCU. Since the design
and maintenance of an asynchronous HCU is difficult, the majority of the practical
processors have synchronous HCUs. An MCU is slower than an HCU since the
microinstruction execution time is the sum of processor
cycle time and the CROM access time. The HCU of ASC has the simplest
configuration possible. Each instruction cycle is divided into one or more phases
(states or major cycles), each phase consisting of four processor cycles (i.e.,
minor cycles). A majority of actual control units are synchronous control units that
are, in essence, enhanced versions of the ASC control unit. For example, it is not
necessary to use up a complete major cycle if the micro operations corresponding
to an instruction execution (or fetch or defer) can be completed in a part of the
major cycle. The only optimization performed in the ASC control unit was
to reduce the number of major cycles needed to execute certain instructions
(SHR, SHL) by not entering an execute cycle, since all the required micro
operations to implement those instructions could be completed in one major cycle.
Further optimization is possible. For example, the micro operations corresponding
to the execution of each branch instruction (BRU, BIP, BIN) could all be completed
in one minor cycle rather than in a complete major cycle as they are in the ASC
control unit. Thus, three minor cycles could be saved in the execution of branch
instructions by returning to fetch cycle after the .rst minor cycle in the execute
cycle. When such enhancements are implemented, the state-change circuitry of
the control unit becomes more complex but the execution speed increases. Note
that in the case of an MCU, the concept of the major cycle is not present, and the
basic unit we work with is the minor cycle (i.e., processor cycle time + CROM
access time). Thus the lengths of microprograms corresponding to each
instruction are different. Each microinstruction is executed in one minor cycle, and
the microprograms do not include any idle cycles. In developing the microprogram
for ASC (Table 5.6) the microoperation sequences from the HCU design were
reorganized to make them as short as possible. It was selected for its simplicity.
The more recent Intel processors adopt these techniques very extensively.
instruction execution (or fetch or defer) can be completed in a part of the
major cycle.
The only optimization performed in the ASC control unit was
to reduce the number of major cycles needed to execute certain instructions
(SHR, SHL) by not entering an execute cycle, since all the required
microoperations to implement those instructions could be completed in one major
cycle. Further optimization is possible. For example, the microoperations
corresponding to the execution of each branch instruction (BRU, BIP, BIN) could
all be completed in one minor cycle rather than in a complete major cycle as they
are in the ASC control unit. Thus, three minor cycles could be saved in the
execution of branch instructions by returning to fetch cycle after the first minor
cycle in the execute cycle. When such enhancements are implemented, the statechange circuitry of the control unit becomes more complex but the execution
speed increases.
Note that in the case of an MCU, the concept of the major cycle is not present,
and the basic unit we work with is the minor cycle (i.e., processor cycle time +
CROM access time). Thus the lengths of microprograms corresponding to each
instruction are different. Each microinstruction is executed in one minor cycle, and
the microprograms do not include any idle cycles. In developing the microprogram
for ASC (Table 5.6) the microoperation sequences from the HCU design were
reorganized to make them as short as possible.
Section 5 provides the instruction cycle details of Intel 8080, to
illustrate the instruction cycle speedup concept. Although this is an obsolete
processor, it was selected for its simplicity. The more recent Intel processors
adopt these techniques very extensively.
3.2.2
Instruction Execution Overlap
Note that in INTEL 8080, for instructions such as ADD r, once the memory
operand is fetched into the CPU, addition is performed while the CPU is
fetching the next instruction in sequence from the memory. This overlap of
instruction fetch and execute phases increases the program execution speed. In
general, the control unit can be envisioned as a device with three
subfunctions: fetch, decode (or address computation), and execute. If the
control unit is designed in a modular form with one module for each of these
functions, it is possible to overlap the instruction-processing functions. The
overlapped processing is brought about by a pipeline. A pipeline is a structure
that, like an automobile assembly line, consists of several stations, each of which
is capable of performing a certain subtask. The work flows from one station to the
next. As the work leaves a station, the subsequent unit of the work is picked up by
that station. When the work leaves the last station in the pipeline the task is
complete. If the pipeline has N stations and the work stays at each station for T
seconds, the complete processing time for a task is (N _ TÞ seconds. But since all
the N stations are working in an overlapped manner (on various tasks), the
pipeline outputs one completed task every T seconds (after an initial period in
which the pipeline is being filled).
Figure 9.1 introduces the concept of an instruction processing pipeline. The
control unit has three modules. The processing sequence is shown in (b).
Any time after t2, the .rst module will be fetching instruction (I + 1), the
second module will be decoding instruction I, while the last module will be
executing instruction (I – 1). From t3 onwards, the pipeline .ows full and the
throughput is one instruction per time slot.
For simplicity, we have assumed that each module in the above pipeline
consumes the same amount of processing time. If such equal time
partitioning of the processing task cannot be made, intermediate registers
to hold the results and .ags that indicate the completion of one task and
beginning of the next task are needed. We have assumed that the instructions are
always executed in the sequence they appear in the program. This assumption is
valid as long as the program does not contain a branch instruction. When a branch
instruction
Figure: Pipelined instruction processing
is encountered, the next instruction is to be fetched from the target
address of the branch instruction. If it is a conditional branch, the target
address would not be known until the instruction reaches the execute stage. If a
branch is indeed taken, then the instructions following the branch
instruction that are already in the pipeline need to be discarded, and the
pipeline needs to be filled from the target address. Another approach would
be to stop fetching subsequent instructions into the pipeline, once the
branch instruction is fetched, until the target address is known. The former
approach is preferred for handling conditional branches since there is a
good chance that the branch might not occur; in that case, the pipeline
would flow full. There is an ineficiency in the pipeline flow only when a
branch occurs. For unconditional branches, the latter approach can be used. The
following mechanisms have been used to handle the conditional branch
ineficiency of the pipeline structures.
3.3
Branch Prediction
It is sometimes possible to predict the target address of a conditional branch
based on the execution characteristics. For example, the target address of a
branch instruction controlling the iterations through a loop is most likely the
address of the .rst instruction in the loop, except for the last time through the
iteration. If this compiler-dependent characteristic is known, the most likely target
address can be used to fetch the subsequent instructions into the pipeline.
Branch History If the statistics on the previous branches for each instruction are
maintained, the most likely target address can be inferred.
Delayed Branching In the three-stage pipeline of Fig. 9.1, two instructions
following the branch instruction will have entered the pipeline by the time the
target address is determined. If, in the instruction stream, the branch instruction is
moved ahead two positions from where it logically appears in the program, the
execution of instructions in the pipeline can proceed while the target address is
determined.
All three stages of the instruction processing pipeline can potentially
access the main memory simultaneously, thereby increasing the traffic on
the memory bus. A memory system should provide for such simultaneous
access and high throughput by multiport, interleaving, and banking
schemes.
In order to implement the control unit as a pipeline, each stage should be
designed to operate independently, performing its own function while sharing
resources such as the processor bus structure and the main memory system.
Such designs become very complex. Refer to the books listed in the
Reference section for further details, The pipeline concept is now used very
extensively in all modern processors. It is typical for the processors today to have
four or five stages in their pipelines. As the hardware technology progresses,
processors with deeper pipelines (i.e., pipelines with larger numbers of stages)
have been introduced. These processors belong to the so-called superpipelined
processor class, subsequent chapters provide some examples.
3.3.1 Parallel Instruction Execution
The control unit of these processors fetches instructions from the same instruction
stream (i.e., program), decodes them and delivers them to the
appropriate execution unit. Thus, the execution units would be working in
parallel, each executing its own instruction. The control unit must now be
capable of creating these parallel streams of execution and synchronizing the two
streams appropriately, based on the precedence constraints imposed by the
program. That is, the result of the computation must be the same, whether the
program is executed by serial or parallel execution of instructions. This class of
architectures, where more than one instruction stream is processed
simultaneously, is called a superscalar architecture.
3.3.2
Instruction Buffer and Cache
The instruction buffer schemes used by processors such as INTEL 8086 and CDC
6600, and the instruction cache schemes used by processors such as MC68020
also bring about instruction processing overlap, although at the complete
instruction level. That is, the operation of fetching instructions into the buffer is
overlapped with the operation of retrieving and executing instructions that are in
the buffer.
4. HARDWIRED CONTROL UNITS
All the speedup techniques described in the previous section have been
adopted by HCUs of practical machines. As mentioned earlier, the main
advantage of the HCU is its speed, while the disadvantage is its inflexibility.
Although asynchronous HCUs offer a higher speed capability than
synchronous HCUs, the majority of practical machines have synchronous
HCUs because they have the simpler design of the two. In the current VLSI
era, complete processing systems are fabricated on a single IC. Because
the control unit is the most complex unit, it occupies a large percentage of
the chip ‘‘real estate.’’ Its complexity increases as the number of
instructions in the instruction set of the machine increases. Studies have
shown that, in practice, only a few instructions are used by programmers,
although the machine offers a large instruction set. One way, then, of
reducing the complexity of the control unit is to have a small instruction set.
Such machines with small sets of powerful instructions are
called reduced instruction set computers (RISC). Section 9.4 describes
RISCs further.
4.1 MICROPROGRAMMED CONTROL UNITS
The execution time for an instruction is proportional to the number of
microoperations required and hence the length of microprogram sequence
for the instruction. Since a microprogrammed control unit (MCU) starts
fetching the next instruction once the last microoperation of the current
instruction microprogram is executed, MCU can be treated as an
asynchronous control unit. An MCU is slower than the hardwired control
unit because of the addition of CROM access time to the register transfer
time. But it is more flexible than hardwired CU and requires minimum
changes in the hardware if the instruction set is to be modified or
enhanced.
The CROM word size is one of the design parameters of an MCU. Although
the price of ROMs is decreasing, the cost of data path circuits required
within the control unit increases as the CROM word size increases.
Therefore, the word size should be reduced to reduce the cost of the MCU.
We will now examine the microinstruction formats used by practical
machines with respect to their cost effectiveness.
Figure: Pipelining in an MCU
The most common format for a microinstruction is shown in Fig.
The ‘‘instruction’’ portion of the microinstruction is used in generating the
control signals, and the ‘‘address’’ portion indicates the address of the next
microinstruction. Execution of such a microinstruction corresponds to the
generation of control signals and transferring the address portion to the
_MAR to retrieve the next microinstruction. The advantage of this format is
that very little external circuitry is needed to generate the next
microinstruction address, while the disadvantage is that conditional
branches in the microprogram cannot be easily coded. The format shown in
(b) allows for conditional branching. It is now assumed that when the
condition is not satisfied, the _MAR is simply incremented to point to the
next microinstruction in sequence. However, this requires additional
_MAR circuitry. The microinstruction format shown in (c) explicitly codes
the jump addresses corresponding to both the outcomes of the test
condition, thus reducing the external _MAR circuitry. The CROM word size
will be large in all the above formats since they contain one or more
address .elds. It is possible to reduce the CROM word size if the address
representation can be made implicit. The format shown in (d) is similar to
the one used by the MCU of ASC.
(b) Vertical
the MCU fast. The disadvantage is that this requires larger CROM words.
Also, instruction encoding is cumbersome because a thorough familiarity
with the processor hardware structure is needed to prevent the generation
of control signals that cause conflicting operations in the processor
hardware. In the vertical (or packed) microinstruction, the instruction is
divided into several fields, each field corresponding to either a resource or
a function in the processing hardware. (In the design of ASC MCU, each
field corresponded to a resource such as ALU, BUS1, etc.). Vertical
microinstruction reduces the CROM word size; but the decoders needed to
generate control signals from each field of the instruction contribute to the
delays in control signals. Encoding for vertical microinstruction is easier
than that for horizontal microinstruction because of the former’s
function/resource partitioning. In the foregoing discussion, we have
assumed that all the control signals implied by the microinstruction are
generated simultaneously and the next clock pulse fetches a new
microinstruction. This type of microinstruction encoding is called
monophase encoding. It is also possible to associate each, field or each bit
in the microinstruction with a time value. That is the execution of each
microinstruction now requires more than one clock pulse. This type of
microinstruction encoding is called polyphase encoding. Figure (a) shows
the monophase encoding, where all the control signals are generated
simultaneously at the clock pulse. Figure (b) shows an nphase
encoding where microoperations M1 through Mn are associated with
time values t1 through tn respectively.
5 REDUCED
COMPUTERS
INSTRUCTION
SET
The acronym RISC (pronounced as risk), for reduced instruction set
computer, represents a CPU design strategy emphasizing the insight that
simplified instructions that "do less" may still provide for higher performance
if this simplicity can be utilized to make instructions execute very quickly.
Many proposals for a "precise" definition have been attempted, and the
term is being slowly replaced by the more descriptive load-store
architecture. Well known RISC families include Alpha, ARC, ARM, AVR,
MIPS, PA-RISC, Power Architecture (includingThe concept was developed
by John Cocke of IBM Research during 1974. His argument was based
upon the notion that a computer uses only 20% of the instructions, making
the other 80% superfluous to requirement. A processor based upon this
concept would use few instructions, which would require fewer transistors,
and make them cheaper to manufacture. By reducing the number of
transistors and instructions to only those most frequently used, the
computer would get more done in a shorter amount of time. The term
'RISC' (short for Reduced Instruction Set Computer) was later coined by
David Patterson, a teacher at the University of California in Berkeley.
The RISC concept was used to simplify the design of the IBM PC/XT, and
was later used in the IBM RISC System/6000 and Sun Microsystems'
SPARC microprocessors. The latter CPU led to the founding of MIPS
Technologies, who developed the M.I.P.S. RISC microprocessor
(Microprocessor without Interlocked Pipe Stages). Many of the MIPS
architects also played an instrumental role in the creation of the Motorola
68000, as used in the first Amigas (MIPS Technologies were later bought
by Silicon Graphics).. The MIPS processor has continued development,
remaining a popular choice in embedded and low-end market. At one time,
it was suspected the Amiga MCC would use this CPU to reduce the cost of
manufacture. However, the consumer desktop market is limited, only the
PowerPC processor remains popular in the choice of RISC alternatives.
This is mainly due to Apple's continuous use of the series for its PowerMac
range.
Being an old idea, some aspects attributed to the first RISC-labeled
designs (around 1975) include the observations that the memory restricted
compilers of the time were often unable to take advantage of features
intended to facilitate coding, and that complex addressing inherently takes
many cycles to perform. It was argued that such functions would better be
performed by sequences of simpler instructions, if this could yield
implementations simple enough to cope with really high frequencies, and
small enough to leave room for many registers[2], factoring out slow
memory accesses. Uniform, fixed length instructions with arithmetic’s
restricted to registers were chosen to ease instruction pipelining in these
simple designs, with special load-store instructions accessing memory.
There is still considerable controversy among experts about the ultimate
value of RISC architectures. Its proponents argue that RISC machines are
both cheaper and faster, and are therefore the machines of the future.
Skeptics note that by making the hardware simpler, RISC architectures put
a greater burden on the software. They argue that this is not worth the
trouble because conventional microprocessors are becoming increasingly
fast and cheap anyway.
To some extent, the argument is becoming moot because CISC and RISC
implementations are becoming more and more alike. Many of today's RISC
chips support as many instructions as yesterday's CISC chips. And today's
CISC chips use many techniques formerly associated with RISC chips.
RISC (reduced instruction set computer) is a microprocessor that is
designed to perform a smaller number of types of computer instructions so
that it can operate at a higher speed (perform more millions of instructions
per second, or MIPS). Since each instruction type that a computer must
perform requires additional transistors and circuitry, a larger list or set of
computer instructions tends to make the microprocessor more complicated
and slower in operation.
John Cocke of IBM Research in Yorktown, New York, originated the RISC
concept in 1974 by proving that about 20% of the instructions in a computer
did 80% of the work. The first computer to benefit from this discovery was
IBM's PC/XT in 1980. Later, IBM's RISC System/6000, made use of the
idea. The term itself (RISC) is credited to David Patterson, a teacher at the
University of California in Berkeley. The concept was used in Sun
Microsystems' SPARC microprocessors and led to the founding of what is
now MIPS Technologies, part of Silicon Graphics. A number of current
microchips now use the RISC concept.
The RISC concept has led to a more thoughtful design of the
microprocessor. Among design considerations are how well an instruction
can be mapped to the clock speed of the microprocessor (ideally, an
instruction can be performed in one clock cycle); how "simple" an
architecture is required; and how much work can be done by the microchip
itself without resorting to software help.
Besides performance improvement, some advantages of RISC and related
design improvements are:




A new microprocessor can be developed and tested more quickly if
one of its aims is to be less complicated.
Operating system and application programmers who use the
microprocessor's instructions will find it easier to develop code with a
smaller instruction set.
The simplicity of RISC allows more freedom to choose how to use
the space on a microprocessor.
Higher-level language compilers produce more efficient code than
formerly because they have always tended to use the smaller set of
instructions to be found in a RISC computer.
6.
RISC Vs. CISC
The argument over which concept is better has been repeated
over the past few years. Macintosh owners have elevated the
argument to a pseudo religious level in support of their RISCbased God (the PowerPC sits next to the Steve Jobs statue on
every Mac altar). Both positions have been blurred by the
argument that we have entered a Post-RISC stage.
RISC:
For
and
Against
RISC supporters argue that it the way of the future, producing
faster and cheaper processors - an Apple Mac G3 offers a
significant performance advantage over its Intel equivalent.
Instructions are executed over 4x faster providing a significant
performance boost! However, RISC chips require more lines of
code to produce the same results and are increasingly complex.
This will increase the size of the application and the amount of
overhead required. RISC developers have also failed to remain in
competition with CISC alternatives. The Macintosh market has
been damaged by several problems that have affected the
availability of 500MHz+ PowerPC chips. In contrast, the PC
compatible market has stormed ahead and has broken the 1GHz
barrier. Despite the speed advantages of the RISC processor, it
cannot compete with a CISC CPU that boasts twice the number of
clock cycles.
The simplest way to examine the advantages and disadvantages
of RISC architecture is by contrasting it with it's predecessor:
CISC (Complex Instruction Set Computers) architecture.
Multiplying
Two
Numbers
in
Memory
On the right is a diagram representing the storage scheme for a
generic computer. The main memory is divided into locations
numbered from (row) 1: (column) 1 to (row) 6: (column) 4. The
execution unit is responsible for carrying out all computations.
However, the execution unit can only operate on data that has
been loaded into one of the six registers (A, B, C, D, E, or F).
Let's say we want to find the product of two numbers - one stored
in location 2:3 and another stored in location 5:2 - and then store
the product back in the location 2:3.
The advantages of RISC against CISC are those today:

RISC processors are much simpler to build, by this again results in
the following advantages:
o easier to build, i.e. you can use already existing production
facilities
o much less expensive, just compare the price of a XScale with
that of a Pentium III at 1 GHz...
o less power consumption, which again gives two advantages:
 much longer use of battery driven devices
 no need for cooling of the device, which again gives to
advantages:
 smaller design of the whole device
 no noise
RISC processors are much simpler to program which doesn't only help the
assembler programmer, but the compiler designer, too. You'll hardly find
any compiler which uses all the functions of a Pentium III optimally
The
CISC
Approach
The primary goal of CISC architecture is to complete a task in as
few lines of assembly as possible. This is achieved by building
processor hardware that is capable of understanding and
executing a series of operations. For this particular task, a CISC
processor would come prepared with a specific instruction (we'll
call it "MULT"). When executed, this instruction loads the two
values into separate registers, multiplies the operands in the
execution unit, and then stores the product in the appropriate
register. Thus, the entire task of multiplying two numbers can be
completed with one instruction:
MULT 2:3, 5:2
MULT is what is known as a "complex instruction." It operates
directly on the computer's memory banks and does not require
the programmer to explicitly call any loading or storing functions.
It closely resembles a command in a higher level language. For
instance, if we let "a" represent the value of 2:3 and "b"
represent the value of 5:2, then this command is identical to the
C statement "a = a * b."
One of the primary advantages of this system is that the compiler
has to do very little work to translate a high-level language
statement into assembly. Because the length of the code is
relatively short, very little RAM is required to store instructions.
The emphasis is put on building complex instructions directly into
the hardware.
The
RISC
Approach
RISC processors only use simple instructions that can be
executed within one clock cycle. Thus, the "MULT" command
described above could be divided into three separate commands:
"LOAD," which moves data from the memory bank to a register,
"PROD," which finds the product of two operands located within
the registers, and "STORE," which moves data from a register to
the memory banks. In order to perform the exact series of steps
described in the CISC approach, a programmer would need to
code four lines of assembly:
LOAD
LOAD
PROD
STORE 2:3, A
A,
B,
A,
2:3
5:2
B
At first, this may seem like a much less efficient way of
completing the operation. Because there are more lines of code,
more RAM is needed to store the assembly level instructions. The
compiler must also perform more work to convert a high-level
language statement into code of this form.
CISC
RISC
Emphasis on hardware
Emphasis on software
Includes
multi-clock Single-clock,
complex instructions
reduced instruction only
Memory-to-memory:
Register
to
register:
"LOAD"
and
"STORE" "LOAD"
and
"STORE"
incorporated in instructions are independent instructions
Small
code
sizes, Low cycles per second,
high cycles per second
large code sizes
Transistors used for storing Spends more transistors
complex instructions
on memory registers
However, the RISC strategy also brings some very important
advantages. Because each instruction requires only one clock
cycle to execute, the entire program will execute in approximately
the same amount of time as the multi-cycle "MULT" command.
These RISC "reduced instructions" require less transistors of
hardware space than the complex instructions, leaving more room
for general purpose registers. Because all of the instructions
execute in a uniform amount of time (i.e. one clock), pipelining is
possible.
Separating the "LOAD" and "STORE" instructions actually reduces
the amount of work that the computer must perform. After a
CISC-style "MULT" command is executed, the processor
automatically erases the registers. If one of the operands needs
to be used for another computation, the processor must re-load
the data from the memory bank into a register. In RISC, the
operand will remain in the register until another value is loaded in
its place.
The
Performance
Equation
The following equation is commonly used for expressing a
computer's performance ability:
The CISC approach attempts to minimize the number of
instructions per program, sacrificing the number of cycles per
instruction. RISC does the opposite, reducing the cycles per
instruction at the cost of the number of instructions per program.
RISC
Roadblocks
Despite the advantages of RISC based processing, RISC chips
took over a decade to gain a foothold in the commercial world.
This was largely due to a lack of software support.
Although Apple's Power Macintosh line featured RISC-based chips
and Windows NT was RISC compatible, Windows 3.1 and
Windows 95 were designed with CISC processors in mind. Many
companies were unwilling to take a chance with the emerging
RISC technology. Without commercial interest, processor
developers were unable to manufacture RISC chips in large
enough volumes to make their price competitive.
Another major setback was the presence of Intel. Although their
CISC chips were becoming increasingly unwieldy and difficult to
develop, Intel had the resources to plow through development
and produce powerful processors. Although RISC chips might
surpass Intel's efforts in specific areas, the differences were not
great enough to persuade buyers to change technologies.
The
Overall
RISC
Advantage
Today, the Intel x86 is arguable the only chip which retains CISC
architecture. This is primarily due to advancements in other areas
of computer technology. The price of RAM has decreased
dramatically. In 1977, 1MB of DRAM cost about $5,000. By 1994,
the same amount of memory cost only $6 (when adjusted for
inflation). Compiler technology has also become more
sophisticated, so that the RISC use of RAM and emphasis on
software has become ideal.
7.MIPS
History
The MIPS processor was developed as part of a VLSI research program at
Stanford University in the early 80s. Professor John Hennessy, now the
University's President, started the development of MIPS with a
brainstorming class for graduate students. The readings and idea sessions
helped launch the development of the processor which became one of the
first RISC processors, with IBM and Berkeley developing processors at
around the same time.
MIPSArchitecture
The Stanford research group had a strong background in compilers, which
led them to develop a processor whose architecture would represent the
lowering of the compiler to the hardware level, as opposed to the raising of
hardware to the software level, which had been a long running design
philosophy in the hardware industry.
Thus, the MIPS processor implemented a smaller, simpler instruction set.
Each of the instructions included in the chip design ran in a single clock
cycle. The processor used a technique called pipelining to more efficiently
process instructions.
MIPS used 32 registers, each 32 bits wide (a bit pattern of this size is
referred to as a word).
InstructionSet
The MIPS instruction set consists of about 111 total instructions, each
represented in 32 bits. An example of a MIPS instruction is below:
add $r12, $r7, $r8
Above is the assembly (left) and binary (right) representation of a MIPS
addition instruction. The instruction tells the processor to compute the sum
of the values in registers 7 and 8 and store the result in register 12. The
dollar signs are used to indicate an operation on a register. The colored
binary representation on the right illustrates the 6 fields of a MIPS
instruction. The processor identifies the type of instruction by the binary
digits in the first and last fields. In this case, the processor recogizes that
this instruction is an addition from the zero in its first field and the 20 in its
last field.
The operands are represented in the blue and yellow fields, and the desired
result location is presented in the fourth (purple) field. The orange field
represents the shift amount, something that is not used in an addition
operation.
The instruction set consists of a variety of basic instructions, including:









21 arithmetic instructions (+, -, *, /, %)
8 logic instructions (&, |, ~)
8 bit manipulation instructions
12 comparison instructions (>, <, =, >=, <=, ¬)
25 branch/jump instructions
15 load instructions
10 store instructions
8 move instructions
4 miscellaneous instructions
A list of MIPS core instructions can be found here.
MIPSToday
MIPS Computer Systems, Inc. was founded in 1984 upon the Stanford
research from which the first MIPS chip resulted. The company was
purchased buy Silicon Graphics, Inc. in 1992, and was spun off as MIPS
Technologies, Inc. in 1998. Today, MIPS powers many consumer
electronics and other devices.
8.Pipelining
HowPipeliningWorks
pipelining, a standard feature in RISC processors, is much like an assembly
line. Because the processor works on different steps of the instruction at
the same time, more instructions can be executed in a shorter period of
time.
A useful method of demonstrating this is the laundry analogy. Let's say that
there are four loads of dirty laundry that need to be washed, dried, and
folded. We could put the first load in the washer for 30 minutes, dry it for 40
minutes, and then take 20 minutes to fold the clothes. Then pick up the
second load and wash, dry, and fold, and repeat for the third and fourth
loads. Supposing we started at 6 PM and worked as efficiently as possible,
we would still be doing laundry until midnight.
Figure
However, a smarter approach to the problem would be to put the second
load of dirty laundry into the washer after the first was already clean and
whirling happily in the dryer. Then, while the first load was being folded, the
second load would dry, and a third load could be added to the pipeline of
laundry. Using this method, the laundry would be finished by 9:30.
Figure
RISCPipelines
A RISC processor pipeline operates in much the same way, although the
stages in the pipeline are different. While different processors have different
numbers of steps, they are basically variations of these five, used in the
MIPS R3000 processor:
1.
2.
3.
4.
5.
fetch instructions from memory
read registers and decode the instruction
execute the instruction or calculate an address
access an operand in data memory
write the result into a register
If you glance back at the diagram of the laundry pipeline, you'll notice that
although the washer finishes in half an hour, the dryer takes an extra ten
minutes, and thus the wet clothes must wait ten minutes for the dryer to
free up. Thus, the length of the pipeline is dependent on the length of the
longest step. Because RISC instructions are simpler than those used in
pre-RISC processors (now called CISC, or Complex Instruction Set
Computer), they are more conducive to pipelining. While CISC instructions
varied in length, RISC instructions are all the same length and can be
fetched in a single operation. Ideally, each of the stages in a RISC
processor pipeline should take 1 clock cycle so that the processor finishes
an instruction each clock cycle and averages one cycle per instruction
(CPI).
PipelineProblems
In practice, however, RISC processors operate at more than one cycle per
instruction. The processor might occasionally stall a a result of data
dependencies and branch instructions.
A data dependency occurs when an instruction depends on the results of a
previous instruction. A particular instruction might need data in a register
which has not yet been stored since that is the job of a preceding
instruction which has not yet reached that step in the pipeline.
For example:
add $r3,
$r2,
add $r5,
$r4,
more instructions that are independent of the first two
$r1
$r3
In this example, the first instruction tells the processor to add the contents
of registers r1 and r2 and store the result in register r3. The second
instructs it to add r3 and r4 and store the sum in r5. We place this set of
instructions in a pipeline. When the second instruction is in the second
stage, the processor will be attempting to read r3 and r4 from the registers.
Remember, though, that the first instruction is just one step ahead of the
second, so the contents of r1 and r2 are being added, but the result has not
yet been written into register r3. The second instruction therefore cannot
read from the register r3 because it hasn't been written yet and must wait
until the data it needs is stored. Consequently, the pipeline is stalled and a
number of empty instructions (known as bubbles go into the pipeline. Data
dependency affects long pipelines more than shorter ones since it takes a
longer period of time for an instruction to reach the final register-writing
stage of a long pipeline.
MIPS' solution to this problem is code reordering. If, as in the example
above, the following instructions have nothing to do with the first two, the
code could be rearranged so that those instructions are executed in
between the two dependent instructions and the pipeline could flow
efficiently. The task of code reordering is generally left to the compiler,
which recognizes data dependencies and attempts to minimize
performance stalls.
Branch instructions are those that tell the processor to make a decision
about what the next instruction to be executed should be based on the
results of another instruction. Branch instructions can be troublesome in a
pipeline if a branch is conditional on the results of an instruction which has
not yet finished its path through the pipeline.
For example:
Loop : add $r3, $r2, $r1
sub $r6, $r5, $r4
beq $r3, $r6, Loop
The example above instructs the processor to add r1 and r2 and put the
result in r3, then subtract r4 from r5, storing the difference in r6. In the third
instruction, beq stands for branch if equal. If the contents of r3 and r6 are
equal, the processor should execute the instruction labeled "Loop."
Otherwise, it should continue to the next instruction. In this example, the
processor cannot make a decision about which branch to take because
neither the value of r3 or r6 have been written into the registers yet.
The processor could stall, but a more sophisticated method of dealing with
branch instructions is branch prediction. The processor makes a guess
about which path to take - if the guess is wrong, anything written into the
registers must be cleared, and the pipeline must be started again with the
correct instruction. Some methods of branch prediction depend on
stereotypical behavior. Branches pointing backward are taken about 90% of
the time since backward-pointing branches are often found at the bottom of
loops. On the other hand, branches pointing forward, are only taken
approximately 50% of the time. Thus, it would be logical for processors to
always follow the branch when it points backward, but not when it points
forward. Other methods of branch prediction are less static: processors that
use dynamic prediction keep a history for each branch and uses it to predict
future branches. These processors are correct in their predictions 90% of
the time.
Still other processors forgo the entire branch prediction ordeal. The RISC
System/6000 fetches and starts decoding instructions from both sides of
the branch. When it determines which branch should be followed, it then
sends the correct instructions down the pipeline to be executed.
PipeliningDevelopments
In order to make processors even faster, various methods of optimizing
pipelines have been devised.
Superpipelining refers to dividing the pipeline into more steps. The more
pipe stages there are, the faster the pipeline is because each stage is then
shorter. Ideally, a pipeline with five stages should be five times faster than a
non-pipelined processor (or rather, a pipeline with one stage). The
instructions are executed at the speed at which each stage is completed,
and each stage takes one fifth of the amount of time that the non-pipelined
instruction takes. Thus, a processor with an 8-step pipeline (the MIPS
R4000) will be even faster than its 5-step counterpart. The MIPS R4000
chops its pipeline into more pieces by dividing some steps into two.
Instruction fetching, for example, is now done in two stages rather than
one. The stages are as shown:
1.
2.
3.
4.
5.
6.
7.
8.
Instruction Fetch (First Half)
Instruction Fetch (Second Half)
Register Fetch
Instruction Execute
Data Cache Access (First Half)
Data Cache Access (Second Half)
Tag Check
Write Back
Superscalar pipelining involves multiple pipelines in parallel. Internal
components of the processor are replicated so it can launch multiple
instructions in some or all of its pipeline stages. The RISC System/6000
has a forked pipeline with different paths for floating-point and integer
instructions. If there is a mixture of both types in a program, the processor
can keep both forks running simultaneously. Both types of instructions
share two initial stages (Instruction Fetch and Instruction Dispatch) before
they fork. Often, however, superscalar pipelining refers to multiple copies of
all pipeline stages (In terms of laundry, this would mean four washers, four
dryers, and four people who fold clothes). Many of today's machines
attempt to find two to six instructions that it can execute in every pipeline
stage. If some of the instructions are dependent, however, only the first
instruction or instructions are issued.
Dynamic pipelines have the capability to schedule around stalls. A dynamic
pipeline is divided into three units: the instruction fetch and decode unit, five
to ten execute or functional units, and a commit unit. Each execute unit has
reservation stations, which act as buffers and hold the operands and
operations.
While the functional units have the freedom to execute out of order, the
instruction fetch/decode and commit units must operate in-order to maintain
simple pipeline behavior. When the instruction is executed and the result is
calculated, the commit unit decides when it is safe to store the result. If a
stall occurs, the processor can schedule other instructions to be executed
until the stall is resolved. This, coupled with the efficiency of multiple units
executing instructions simultaneously, makes a dynamic pipeline an
attractive alternative.
9.Pipelining in CPU design superscalar machines
A superscalar CPU architecture implements a form of parallelism called
instruction-level parallelism within a single processor. It thereby allows
faster CPU throughput than would otherwise be possible at the same clock
rate. A superscalar processor executes more than one instruction during a
clock cycle by simultaneously dispatching multiple instructions to redundant
functional units on the processor. Each functional unit is not a separate
CPU core but an execution resource within a single CPU such as an
arithmetic logic unit, a bit shifter, or a multiplier.
While a superscalar CPU is typically also pipelined, they are two different
performance enhancement techniques. It is theoretically possible to have a
non-pipelined superscalar CPU or a pipelined non-superscalar CPU.
The superscalar technique is traditionally associated with several
identifying characteristics. Note these are applied within a given CPU core.



Instructions are issued from a sequential instruction stream
CPU hardware dynamically checks for data dependencies between
instructions at run time (versus software checking at compile time)
Accepts multiple instructions per clock cycle
History
Seymour Cray's CDC 6600 from 1965 is often mentioned as the first
superscalar design. The Intel i960CA (1988) and the AMD 29000-series
29050 (1990) microprocessors were the first commercial single chip
superscalar microprocessors. RISC CPUs like these brought the
superscalar concept to micro computers because the RISC design results
in a simple core, allowing straightforward instruction dispatch and the
inclusion of multiple functional units (such as ALUs) on a single CPU in the
constrained design rules of the time. This was the reason that RISC
designs were faster than CISC designs through the 1980s and into the
1990s.
Except for CPUs used in low-power applications, embedded systems, and
battery-powered devices, essentially all general-purpose CPUs developed
since about 1998 are superscalar.
The Pentium was the first superscalar x86 processor; the Nx586, Pentium
Pro and AMD K5 were among the first designs which decodes x86instructions asynchronously into dynamic microcode-like micro-op
sequences prior to actual execution on a superscalar microarchitecture; this
opened up for dynamic scheduling of buffered partial instructions and
enabled more parallelism to be extracted compared to the more rigid
methods used in the simpler Pentium; it also simplified speculative
execution and allowed higher clock frequencies compared to designs such
as the advanced Cyrix 6x86.
From scalar to superscalar
The simplest processors are scalar processors. Each instruction executed
by a scalar processor typically manipulates one or two data items at a time.
By contrast, each instruction executed by a vector processor operates
simultaneously on many data items. An analogy is the difference between
scalar and vector arithmetic. A superscalar processor is sort of a mixture of
the two. Each instruction processes one data item, but there are multiple
redundant functional units within each CPU thus multiple instructions can
be processing separate data items concurrently.
Superscalar CPU design emphasizes improving the instruction dispatcher
accuracy, and allowing it to keep the multiple functional units in use at all
times. This has become increasingly important when the number of units
increased. While early superscalar CPUs would have two ALUs and a
single FPU, a modern design such as the PowerPC 970 includes four
ALUs, two FPUs, and two SIMD units. If the dispatcher is ineffective at
keeping all of these units fed with instructions, the performance of the
system will suffer.
A superscalar processor usually sustains an execution rate in excess of
one instruction per machine cycle. But merely processing multiple
instructions concurrently does not make an architecture superscalar, since
pipelined, multiprocessor or multi-core architectures also achieve that, but
with different methods.
In a superscalar CPU the dispatcher reads instructions from memory and
decides which ones can be run in parallel, dispatching them to redundant
functional units contained inside a single CPU. Therefore a superscalar
processor can be envisioned having multiple parallel pipelines, each of
which is processing instructions simultaneously from a single instruction
thread.
Limitations
Available performance improvement from superscalar techniques is limited
by two key areas:
1. The degree of intrinsic parallelism in the instruction stream, i.e.
limited amount of instruction-level parallelism, and
2. The complexity and time cost of the dispatcher and associated
dependency checking logic.
Existing binary executable programs have varying degrees of intrinsic
parallelism. In some cases instructions are not dependent on each other
and can be executed simultaneously. In other cases they are interdependent: one instruction impacts either resources or results of the other.
The instructions a = b + c; d = e + f can be run in parallel because none of the
results depend on other calculations. However, the instructions a = b + c; d =
a + f might not be runnable in parallel, depending on the order in which the
instructions complete while they move through the units.
When the number of simultaneously issued instructions increases, the cost
of dependency checking increases extremely rapidly. This is exacerbated
by the need to check dependencies at run time and at the CPU's clock rate.
This cost includes additional logic gates required to implement the checks,
and time delays through those gates. Research shows the gate cost in
some cases may be nk gates, and the delay cost k2logn, where n is the
number of instructions in the processor's instruction set, and k is the
number of simultaneously dispatched instructions. In mathematics, this is
called a combinatoric problem involving permutations.
Even though the instruction stream may contain no inter-instruction
dependencies, a superscalar CPU must nonetheless check for that
possibility, since there is no assurance otherwise and failure to detect a
dependency would produce incorrect results.
No matter how advanced the semiconductor process or how fast the
switching speed, this places a practical limit on how many instructions can
be simultaneously dispatched. While process advances will allow ever
greater numbers of functional units (e.g, ALUs), the burden of checking
instruction dependencies grows so rapidly that the achievable superscalar
dispatch limit is fairly small. -- likely on the order of five to six
simultaneously dispatched instructions.
However even given infinitely fast dependency checking logic on an
otherwise conventional superscalar CPU, if the instruction stream itself has
many dependencies, this would also limit the possible speedup. Thus the
degree of intrinsic parallelism in the code stream forms a second limitation.
Alternatives
Collectively, these two limits drive investigation into alternative architectural
performance increases such as Very Long Instruction Word (VLIW),
Explicitly
Parallel
Instruction
Computing
(EPIC),
simultaneous
multithreading (SMT), and multi-core processors.
With VLIW, the burdensome task of dependency checking by hardware
logic at run time is removed and delegated to the compiler. Explicitly
Parallel Instruction Computing (EPIC) is like VLIW, with extra cache
prefetching instructions.
Simultaneous multithreading, often abbreviated as SMT, is a technique for
improving the overall efficiency of superscalar CPUs. SMT permits multiple
independent threads of execution to better utilize the resources provided by
modern processor architectures.
Superscalar processors differ from multi-core processors in that the
redundant functional units are not entire processors. A single processor is
composed of finer-grained functional units such as the ALU, integer
multiplier, integer shifter, floating point unit, etc. There may be multiple
versions of each functional unit to enable execution of many instructions in
parallel. This differs from a multicore CPU that concurrently processes
instructions from multiple threads, one thread per core. It also differs from a
pipelined CPU, where the multiple instructions can concurrently be in
various stages of execution, assembly-line fashion.
The various alternative techniques are not mutually exclusive—they can be
(and frequently are) combined in a single processor. Thus a multicore CPU
is possible where each core is an independent processor containing
multiple parallel pipelines, each pipeline being superscalar. Some
processors also include vector capability.
10Superscalar
Operation- Executing Instructions in
Parallel
With the pipelined architecture we could achieve, at best, execution times
of one CPI (clock per instruction). Is it possible to execute instructions
faster than this? At first glance you might think, "Of course not, we can do
at most one operation per clock cycle. So there is no way we can execute
more than one instruction per clock cycle." Keep in mind however, that a
single instruction is not a single operation. In the examples presented
earlier each instruction has taken between six and eight operations to
complete. By adding seven or eight separate units to the CPU, we could
effectively execute these eight operations in one clock cycle, yielding one
CPI. If we add more hardware and execute, say, 16 operations at once, can
we achieve 0.5 CPI? The answer is a qualified "yes." A CPU including this
additional hardware is a superscalar CPU and can execute more than one
instruction during a single clock cycle. The 80x86 family began supporting
superscalar execution with the introduction of the Pentium processor.
A superscalar CPU has, essentially, several execution units. If it encounters
two or more instructions in the instruction stream (i.e., the prefetch queue)
which can execute independently, it will do so.
Figure
A CPU that Supports Superscalar Operation
There are a couple of advantages to going superscalar. Suppose you have
the following instructions in the instruction stream:
mov( 1000, eax );
mov( 2000, ebx );
If there are no other problems or hazards in the surrounding code, and all
six bytes for these two instructions are currently in the prefetch queue,
there is no reason why the CPU cannot fetch and execute both instructions
in parallel. All it takes is extra silicon on the CPU chip to implement two
execution units.
Besides speeding up independent instructions, a superscalar CPU can also
speed up program sequences that have hazards. One limitation of
superscalar CPU is that once a hazard occurs, the offending instruction will
completely stall the pipeline. Every instruction which follows will also have
to wait for the CPU to synchronize the execution of the instructions. With a
superscalar CPU, however, instructions following the hazard may continue
execution through the pipeline as long as they don't have hazards of their
own. This alleviates (though does not eliminate) some of the need for
careful instruction scheduling.
As an assembly language programmer, the way you write software for a
superscalar CPU can dramatically affect its performance. First and
foremost is that rule you're probably sick of by now: use short instructions.
The shorter your instructions are, the more instructions the CPU can fetch
in a single operation and, therefore, the more likely the CPU will execute
faster than one CPI. Most superscalar CPUs do not completely duplicate
the execution unit. There might be multiple ALUs, floating point units, etc.
This means that certain instruction sequences can execute very quickly
while others won't. You have to study the exact composition of your CPU to
decide which instruction sequences produce the best performance.
Very Long Instruction Word Architecture (VLIW)
Superscalar operation attempts to schedule, in hardware, the execution of
multiple instructions simultaneously. Another technique that Intel is using in
their IA-64 architecture is the use of very long instruction words, or VLIW. In
a VLIW computer system, the CPU fetches a large block of bytes (41 in the
case of the IA-64 Itanium CPU) and decodes and executes this block all at
once. This block of bytes usually contains two or more instructions (three in
the case of the IA-64). VLIW computing requires the programmer or
compiler to properly schedule the instructions in each block (so there are
no hazards or other conflicts), but if properly scheduled, the CPU can
execute three or more instructions per clock cycle.
The Intel IA-64 Architecture is not the only computer system to employ a
VLIW architecture. Transmeta's Crusoe processor family also uses a VLIW
architecture. The Crusoe processor is different than the IA-64 architecture
insofar as it does not support native execution of IA-32 instructions.
Instead, the Crusoe processor dynamically translates 80x86 instructions to
Crusoe's VLIW instructions. This "code morphing" technology results in
code running about 50% slower than native code, though the Crusoe
processor has other advantages.
We will not consider VLIW computing any further since the IA-32
architecture does not support it. But keep this architectural advance in mind
if you move towards the IA-64 family or the Crusoe family.
Parallel Processing
Most of the techniques for improving CPU performance via architectural
advances involve the parallel (overlapped) execution of instructions. Most
of the techniques of this chapter are transparent to the programmer. That
is, the programmer does not have to do anything special to take minimal
advantage of the parallel operation of pipelines and superscalar operations.
True, if programmers are aware of the underlying architecture they can
write code that runs even faster, but these architectural advances often
improve performance even if programmers do not write special code to take
advantage of them.
The only problem with this approach (attempting to dynamically parallelize
an inherently sequential program) is that there is only so much you can do
to parallelize a program that requires sequential execution for proper
operation (which covers most programs). To truly produce a parallel
program, the programmer must specifically write parallel code; of course,
this does require architectural support from the CPU. This section and the
next touches on the types of support a CPU can provide.
Typical CPUs use what is known as the SISD model: Single Instruction,
Single Data. This means that the CPU executes one instruction at a time
that operates on a single piece of data. Two common parallel models are
the so-called SIMD (Single Instruction, Multiple Data) and MIMD (Multiple
Instruction, Multiple Data) models. As it turns out, x86 systems can support
both of these parallel execution models.
In the SIMD model, the CPU executes a single instruction stream, just like
the standard SISD model. However, the CPU executes the specified
operation on multiple pieces of data concurrently rather than a single data
object. For example, consider the 80x86 ADD instruction. This is a SISD
instruction that operates on (that is, produces) a single piece of data; true,
the instruction fetches values from two source operands and stores a sum
into a destination operand but the end result is that the ADD instruction will
only produce a single sum. An SIMD version of ADD, on the other hand,
would compute the sum of several values simultaneously. The Pentium III's
MMX and SIMD instruction extensions operate in exactly this fashion. With
an MMX instruction, for example, you can add up to eight separate pairs of
values with the execution of a single instruction. The aptly named SIMD
instruction extensions operate in a similar fashion.
Note that SIMD instructions are only useful in specialized situations. Unless
you have an algorithm that can take advantage of SIMD instructions,
they're not that useful. Fortunately, high-speed 3-D graphics and
multimedia applications benefit greatly from these SIMD (and MMX)
instructions, so their inclusion in the 80x86 CPU offers a huge performance
boost for these important applications.
The MIMD model uses multiple instructions, operating on multiple pieces of
data (usually one instruction per data object, though one of these
instructions could also operate on multiple data items). These multiple
instructions execute independently of one another. Therefore, it's very rare
that a single program (or, more specifically, a single thread of execution)
would use the MIMD model. However, if you have a multiprogramming
environment with multiple programs attempting to execute concurrently in
memory, the MIMD model does allow each of those programs to execute
their own code stream concurrently. This type of parallel system is usually
called a multiprocessor system. Multiprocessor systems are the subject of
the next section.
The common computation models are SISD, SIMD, and MIMD. If you're
wondering if there is a MISD model (Multiple Instruction, Single Data) the
answer is no. Such an architecture doesn't really make sense.
Multiprocessing
Pipelining, superscalar operation, out of order execution, and VLIW design
are techniques CPU designers use in order to execute several operations
in parallel. These techniques support fine-grained parallelism13 and are
useful for speeding up adjacent instructions in a computer system. If adding
more functional units increases parallelism (and, therefore, speeds up the
system), you might wonder what would happen if you added two CPUs to
the system. This technique, known as multiprocessing, can improve system
performance, though not as uniformly as other techniques. As noted in the
previous section, a multiprocessor system uses the MIMD parallel
execution model.
The techniques we've considered to this point don't require special
programming to realize a performance increase. True, if you do pay
attention you will get better performance; but no special programming is
necessary to activate these features. Multiprocessing, on the other hand,
doesn't help a program one bit unless that program was specifically written
to use multiprocessor (or runs under an O/S specfically written to support
multiprocessing). If you build a system with two CPUs, those CPUs cannot
trade off executing alternate instructions within a program. In fact, it is very
expensive (timewise) to switch the execution of a program from one
processor to another. Therefore, multiprocessor systems are really only
effective in a system that execute multiple programs concurrently (i.e., a
multitasking system)14. To differentiate this type of parallelism from that
afforded by pipelining and superscalar operation, we'll call this kind of
parallelism coarse-grained parallelism.
Adding multiple processors to a system is not as simple as wiring the
processor to the motherboard. A big problem with multiple processors is the
cache coherency problem. To understand this problem, consider two
separate programs running on separate processors in a multiprocessor
system. Suppose also that these two processor communicate with one
another by writing to a block of shared physical memory. Unfortunately,
when CPU #1 writes to this block of addresses the CPU caches the data up
and might not actually write the data to physical memory for some time.
Simultaneously, CPU #2 might be attempting to read this block of shared
memory but winds up reading the data out of its local cache rather than the
data that CPU #1 wrote to the block of shared memory (assuming the data
made it out of CPU #1's local cache). In order for these two functions to
operate properly, the two CPU's must communicate writes to common
memory addresses in cache between themselves. This is a very complex
and involved process.
Currently, the Pentium III and IV processors directly support cache updates
between two CPUs in a system. Intel also builds a more expensive
processor, the XEON, that supports more than two CPUs in a system.
However, one area where the RISC CPUs have a big advantage over Intel
is in the support for multiple processors in a system. While Intel systems
reach a point of diminishing returns at about 16 processors, Sun SPARC
and other RISC processors easily support 64-CPU systems (with more
arriving, it seems, every day).