Download Hardware

Document related concepts
no text concepts found
Transcript
Computer Hardware and System
Software Concepts
●
Processor Structure
–
Von Neumann Machines
–
Pipelined
–
Clocked logic systems
Von Neumann Machine
●
●
●
●
●
John von Neumann proposed the concept of
a stored program computer early in 1950.
In a von Neumann machine, the program and
the data occupy the same memory.
The machine has a program counter (PC)
which points to the current instruction in
memory.
The PC is updated on every instruction.
When there are no branches, program
instructions are fetched from sequential
memory locations.
Synchronous Machines
●
●
Most machines nowadays are synchronous,
that is they are controlled by a clock.
Datapaths
Synchronous Machines
●
●
●
Registers and combinatorial logic blocks
alternate along the data-paths through the
machine.
Data advances from one register to the next
on each cycle of the global clock: as the
clock edge clocks new data into a register, its
current output (processed by passing through
the combinatorial block) is latched into the
next register in the pipeline.
The registers are master-slave flip-flops
which allow the input to be isolated from the
output, ensuring a "clean" transfer of the new
Synchronous Machines
●
●
In a synchronous machine, the slowest
possible propagation delay, tpdmax, through
any combinatorial block must be less than
the smallest clock cycle time, tcyc - otherwise
a pipeline hazard will occur and data from a
previous stage will be clocked into a register
again.
If tcyc < tpd for any operation in any stage of
the pipeline, the clock edge will arrive at the
register before data has propagated through
Synchronous Machines...
●
●
There may also be feedback loops - in which
the output of the current stage is fed back
and latched in the same register: a
conventional state machine
This sort of logic is used to determine the
next operation (ie next microcode word or
next address for branching purposes).
Basic Processor Structure
●
We will consider the basic structure of a
simple processor.
Basic Processor Structure...
●
ALU
–
–
Arithmetic Logic Unit - this circuit takes two
operands on the inputs (labelled A and B) and
produces a result on the output (labelled Y). The
operations will usually include, as a minimum:
●
add, subtract
●
and, or, not
●
shift right, shift left
ALUs in more complex processors will execute
many more instructions.
Basic Processor Structure...
●
●
Register File
–
A set of storage locations (registers) for storing
temporary results.
–
Early machines had just one register - usually
termed an accumulator.
–
Modern RISC processors will have at least 32
registers.
Instruction Register
–
The instruction currently being executed by the
processor is stored here.
Basic Processor Structure...
●
Control Unit
–
The control unit decodes the instruction in the
instruction register and sets signals which control
the operation of most other units of the
processor.
–
For example, the operation code (opcode) in the
instruction will be used to determine the settings
of control signals for the ALU which determine
which operation (+,-,^,v,~,shift,etc) it performs.
Basic Processor Structure...
●
Clock
–
The vast majority of processors are
synchronous, that is, they use a clock signal to
determine when to capture the next data word
and perform an operation on it.
–
In a globally synchronous processor, a common
clock needs to be routed (connected) to every
unit in the processor.
Basic Processor Structure...
●
●
Program counter
–
The program counter holds the memory address
of the next instruction to be executed.
–
It is updated every instruction cycle to point to
the next instruction in the program.
Memory Address Register
–
This register is loaded with the address of the
next data word to be fetched from or stored into
main memory.
Basic Processor Structure...
●
●
Address Bus
–
This bus is used to transfer addresses to
memory and memory-mapped peripherals.
–
It is driven by the processor acting as a bus
master.
Data Bus
–
This bus carries data to and from the processor,
memory and peripherals. It will be driven by the
source of data, ie processor, memory or
peripheral device.
Basic Processor Structure...
●
Multiplexed Bus
–
Of necessity, high performance processors
provide separate address and data buses.
–
To limit device pin counts and bus complexity,
some simple processors multiplex address and
data onto the same bus: naturally this has an
adverse affect on performance.
–
When a bus is used for multiple purposes, eg
address and data, it's called a multiplexed bus.
Executing Instructions
●
●
●
Let's examine the steps in the execution of a
simple memory fetch instruction, eg 101c16:
lw $1,0($2)
This instruction tells the processor to take the
address stored in register 2, add 0 to it and
load the word found at that address in main
memory into register 1.
As the next instruction to be executed (our lw
instruction) is at memory address 101c16, the
program counter contains 101c.
Execution Steps
●
●
●
The control unit sets the multiplexer to drive
the PC onto the address bus.
The memory unit responds by placing
8c41000016 - the lw $1,0($2) instruction as
encoded for a MIPS processor - on the data
bus from where it is latched into the
instruction register.
The control unit decodes the instruction,
recognises it as a memory load instruction
and directs the register file to drive the
contents of register 2 onto the A input of the
ALU and the value 0 onto the B input. At the
Execution Steps....
●
●
●
The output from the ALU is latched into the
MAR. The controller ensures that this value
is directed onto the address bus by setting
the multiplexor.
When the memory responds with the value
sought, it is captured on the internal data bus
and latched into register 1 of the register file.
The program counter is now updated to point
to the next instruction and the cycle can start
again.
Another Example
●
●
●
●
Lets assume the next instruction is an add
instruction: 102016: add $1,$3,$4
This instruction tells the processor to add the
contents of registers 3 and 4 and place the
result in register 1.
The control unit sets the multiplexer to drive
the PC onto the address bus
The memory unit responds by placing
0023202016 - the encoded add $1,$3,$4
instruction - on the data bus from where it is
latched into the instruction register.
Another Example...
●
●
●
The control unit decodes the instruction,
recognizes it as an arithmetic instruction and
directs the register file to drive the contents
of register 1 onto the A input of the ALU and
the contents of register 3 onto the B input. At
the same time, it instructs the ALU to add its
inputs.
The output from the ALU is latched into the
register file at register address 4.
The program counter is now updated to point
to the next instruction.
Key Terms
●
von Neumann machine
–
●
Pipeline
–
●
A sequence of alternating storage elements (registers or
latches) and combinatorial blocks, making up a datapath
through the computer.
program counter
–
●
A computer which stores its program in memory and steps
through instructions in that memory.
A register or memory location which holds the address of
the next instruction to be executed.
synchronous (system/machine)
–
A computer in which instructions and data move from one
Performance
●
●
Assume that the whole system is driven by a
clock at f MHz. This means that each clock
cycle takes t = 1/f microseconds
Generally, a processor will execute one step
every cycle, thus, for a memory load
instruction, our simple processor needs:
Performance...
●
●
●
●
●
PC to bus
1
Memory response
tac
Decode and register access
1
ALU operation and latch result to MAR
Memory response
tac
1
Performance...
●
●
If the memory response time is, say, 100ns,
then our simple processor needs 3x10+2*100
= 230ns to execute a load instruction.
For the add instruction, we make a similar
table:
–
●
an add instruction requires 3x10+100 = 130ns to
execute.
A store operation will also need more than
200ns, so instructions will require, on
average, about 150ns.
Performance Measures
●
●
●
●
One commonly used performance measure
is MIPS or millions of instructions per
second.
Our simple processor will achieve:1/(150x109)
= ~6.6 x 106 instructions per second
= ~6.6 MIPS
100MHz is a very common figure for
processors in 1998
A MIPS rating of 6.6 is very ordinary.
Bottlenecks
●
●
●
It will be obvious that access to main memory
is a major limiting factor in the performance
of a processor.
Management of the memory hierarchy to
achieve maximum performance is one of the
major challenges for a computer architect.
Unfortunately, the hardware maxim smaller is
faster conflicts with programmers' and users'
desires for more and more capabilities and
more elaborate user interfaces in their
programs - resulting in programs that require
megabytes of main memory to run!
Bottlenecks...
●
●
This has led the memory manufacturers to
concentrate on density (improving the
number of bits stored in a single package)
rather than speed.
They have been remarkably successful in
this: the growth in capacity of the standard
DRAM chips which form the bulk of any
computer's semiconductor memory has
matched the increase in speed of
processors.
Bottlenecks...
●
●
However the increase in DRAM access
speeds has been much more modest - even
if we consider recent developments in
synchronous RAM and FRAM.
Another reason for the manufacturer's
concentration on density is that a small
increase in DRAM access time has a
negligible effect on the effective access time
which needs to include overheads for bus
protocols.
Bottlenecks...
●
●
Cache memories are the most significant
device used to reduce memory overheads.
However, a host of other techniques such as
pipelining, pre-fetching, branch prediction,
etc are all used to alleviate the impact of
memory fetch times on performance.
ALU
●
●
●
The Arithmetic and Logic Unit is the 'core' of
any processor: it's the unit that performs the
calculations.
A typical ALU will have two input ports (A and
B) and a result port (Y).
It will also have a control input telling it which
operation (add, subtract, and, or, etc) to
perform and additional outputs for condition
codes (carry, overflow, negative, zero result).
ALU...
●
●
ALUs may be simple and perform only a few
operations: integer arithmetic (add, subtract),
boolean logic (and, or, complement) and
shifts (left, right, rotate).
Such simple ALUs may be found in small 4and 8-bit processors used in embedded
systems.
ALU...
●
●
More complex ALUs will support a wider range of
integer operations (multiply and divide), floating point
operations (add, subtract, multiply, divide) and even
mathematical functions (square root, sine, cosine,
log, etc).
The largest market for general purpose
programmable processors is the commercial one,
where the commonest arithmetic operations are
addition and subtraction. Integer multiply and all other
more complex operations were performed in software
- although this takes considerable time (a 32-bit
integer multiply needs 32 adds and shifts), the low
ALU...
●
●
●
Thus designers would allocate their valuable
silicon area to cache and other devices which
had a more direct impact on processor
performance in the target marketplace.
More recently, transistor geometries have
shrunk to the point where it's possible to get
107 transistors on a single die.
Thus it becomes feasible to include floating
point ALUs on every chip - probably more
economic than designing separate
processors without the floating point
ALU...
●
●
●
In fact, some manufacturers will supply otherwise
identical processors with and without floating point
capability.
This can be achieved economically by marking
chips which had defects only in the region of the
floating point unit as "integer-only" processors and
selling them at a lower price for the commercial
information processing market!
This has the desirable effect of increasing your
semiconductor yield quite significantly - a floating
point unit is quite complex and occupies a
considerable area of silicon
ALU...
ALU...
●
●
●
In simple processors, the ALU is a large
block of combinatorial logic with the A and B
operands and the opcode (operation code)
as inputs and a result, Y, plus the condition
codes as outputs.
Operands and opcode are applied on one
clock edge and the circuit is expected to
produce a result before the next clock edge.
Thus the propagation delay through the ALU
determines a minimum clock period and sets
ALU...
●
●
In advanced processors, the ALU is heavily
pipelined to extract higher instruction
throughput.
Faster clock speeds are now possible
because complex operations (eg floating
point operations) are done in multiple stages:
each individual stage is smaller and faster.
Software or Hardware?
●
●
●
●
The question of which instructions should be
implemented in hardware and which can be
left to software continues to occupy
designers.
A high performance processor with 107
transistors is very expensive to design - $108
is probably a minimum!
Thus the trend seems to be to place
everything on the die.
However, there is an enormous market for
lower capability processors - for embedded
Note for hackers
●
●
●
A small "industry" has grown up around the
phenomenon of "clock-chipping" - the
discovery that a processor will generally run
at a frequency somewhat higher than its
specification.
Of necessity, manufacturers are somewhat
conservative about the performance of their
products and have to specify performance
over a certain temperature range.
For commercial products this is commonly
Note for hackers...
●
●
A reputable computer manufacturer will also
be somewhat conservative, ensuring that the
temperature inside the case of his computer
normally never rises above, say 45oC.
This allows sufficient margin for error in both
directions - chips sometimes degrade with
age and computers may encounter unusual
environmental conditions - so that systems
will continue to function to their
specifications.
Note for hackers...
●
●
Clock-chippers rely on the fact that
propagation delays usually increase with
temperature so that a chip specified at x MHz
at 70oC may well run at 1.5x at 45oC.
Needless to say this is a somewhat reckless
strategy: your processor may functional
perfectly well for a few months in winter - and
then start failing, initially occasionally, and
then more regularly as summer approaches!
Note for hackers...
●
●
●
The manufacturer may also have allowed for
some degradation with age so that a chip
specified for 70oC now will still function at
xMHz in two years time.
Thus a clock-chipped processor may start to
fail after a few months at the higher speed again the failures may be irregular and
occasional initially, and start to occur with
greater frequency as the effects of age show
themselves.
Restoring the original clock chip may be all
Key terms
●
condition codes
–
a set of bits which store general information
about the result of an operation, eg result was
zero, result was negative, overflow occurred, etc.
Register File
●
●
●
●
The Register File is the highest level of the
memory hierarchy.
In a very simple processor, it consists of a
single memory location - usually called an
accumulator.
The result of ALU operations was stored here
and could be re-used in a subsequent
operation or saved into memory.
In a modern processor, it's considered
necessary to have at least 32 registers for
Register File...
●
●
Thus the register file is a small, addressable
memory at the top of the memory hierarchy.
It's visible to programs (which address
registers directly), so that the number and
type (integer or floating point) of registers is
part of the instruction set architecture (ISA).
Register File...
●
●
●
Registers are built from fast multi-ported
memory cells.
They must be fast: a register must be able to
drive its data onto an internal bus in a single
clock cycle.
They are multi-ported because a register
must be able to supply its data to either the A
or the B input of the ALU and accept a value
to be stored from the internal data bus.
Register File...
Register File Capacity
●
●
●
A modern processor will have at least 32
integer registers each capable of storing a
word of 32 (or, more recently, 64) bits.
A processor with floating point capabilities
will generally also provide 32 or more floating
point registers, each capable of holding a
double precision floating point word.
These registers are used by programs as
temporary storage for values which will be
needed for calculations.
Register File Capacity...
●
●
●
Because the registers are "closest" to the processor
in terms of access time - able to supply a value within
a single clock cycle - an optimising compiler for a
high level language will attempt to retain as many
frequently used values in the registers as possible.
Thus the size of the register file is an important factor
in the overall speed of programs.
Earlier processors with fewer than 32 registers (eg
early members of the x86 family) severely hampered
the ability of the compiler to keep frequently
referenced values close to the processor.
Register File Capacity...
●
However, it isn't possible to arbitrarily
increase the size of the register file. With too
many registers:
–
the capacitative load of too many cells on the
data line will reduce its response time,
–
the resistance of long data lines needed to
connect many cells will combine with the
capacitative load to reduce the response time,
Register File Capacity...
–
the number of bits needed to address the
registers will result in longer instructions. A
typical RISC instruction has three operands:
sub $5, $3, $6
requiring 15 bits with 32 (= 25) registers,
–
the complexity of the address decoder (and thus
its propagation delay time) will increase as the
size of the register file increases.
Ports
●
●
●
Register files need at least 2 read ports: the ALU
has two input ports and it may be necessary to
supply both of its inputs from the same register: eg
add $3, $2, $2
The value in register 2 is added to itself and the
result stored in register 3. So that both operands
can be fetched in the same cycle, each register
must have two read ports.
In superscalar processors, it's necessary to have
two read ports and a write port for each functional
unit, because such processors can issue an
Key terms
●
memory hierarchy
–
Storage in a processor may be arranged in a
hierarchy, with small, fast memories at the "top"
of the hierarchy and slower, larger ones at the
bottom. Managing this hierarchy effectively is
one of the major challenges of computer
architecture.
Cache
●
●
●
Cache is a key to the performance of a
modern processor.
Typically ~25% of instructions reference
memory, so that memory access time is a
critical factor in performance.
By effectively reducing the cost of a memory
access, caches enable the greater than one
instruction/cycle goal for instruction
throughput for modern processors.
Cache...
●
A further indication of the importance of
cache may be gained from noting that out of
6.6x106 transistors in the MIPS R10000,
4.4x106 were used for primary caches.
Locality of Reference
●
●
All programs show some locality of
reference. This appears to be a universal
property of programs - whether commercial,
scientific, games, etc.
Cache exploits this property to improve the
access time to data and reducing the cost of
accessing main memory. There are two types
of locality:
–
Temporal Locality
–
Spatial Locality
Temporal Locality
●
●
Once a location is referenced, there is a high
probability that it will be referenced again in the near
future.
Instructions
–
The simplest example of temporal locality is instructions
in loops: once the loop is entered, all the instructions in
the loop will be referenced again - perhaps many times before the loop exits.
–
However, commonly called subroutines and functions
and interrupt handlers (eg the timer interrupt handler)
also have the same property: if they are accessed once,
Temporal Locality...
●
●
Many types of data exhibit temporal locality:
at any point in a program there will tend to be
some "hot" data that the program uses or
updates many times before going on to
another block of data.
Some examples are:
–
Counters
–
Look-up Tables
–
Accumulation variables
–
Stack variables
Spatial Locality
●
●
When an instruction or datum is accessed it is very
likely that nearby instructions or data will be
accessed soon.
Instructions
–
●
It's obvious that an instruction stream will exhibit
considerable spatial locality. In the absence of jumps, the
next instruction to be executed is the one immediately
following the current one.
Data
–
Data also shows considerable spatial locality - particularly
when arrays or strings are accessed. Programs commonly
Cache operation
●
●
The most basic cache is a direct-mapped
cache. It is a small table of fast memory
(modern processors will store 16-256kbytes
of data in a first-level cache and access a
cache word in 2 cycles).
There are two parts to each entry in the
cache, the data and a tag.
Cache operation...
●
●
●
If memory addresses have p bits (allowing 2p
bytes of memory to be addressed) and the
cache can store 2k words of memory.
Then the least significant m bits of the
address address a byte within a word. (Each
word contains 2m bytes.)
The next k bits of the address select one of
the 2k entries in the cache.
Cache operation...
●
●
●
The p-k-m bits of the tag in this entry are
compared with the most significant p-k-m bits
of the memory address: if they match, then
the data "belongs" to the required memory
address and is used instead of data from the
main memory.
When the cache tag matches the high bits of
the address, we say that we've got a cache
hit.
Thus a request for data from the CPU may
be supplied in 2 cycles rather than the 20-
Cache operation...
Basic operations
●
Write-Through
●
Write-Back
Cache organizations
●
Direct Mapped
●
Fully Associative
Cache organizations...
Memory
●
Memory Technologies
–
Two different technologies can be used to store
bits in semiconductor random access memory
(RAM): static static RAM and dynamic RAM.
Static RAM
●
●
●
●
Static RAM cells use 4-6 transistors to store a single
bit of data.
This provides faster access times at the expense of
lower bit densities.
A processor's internal memory (registers and cache)
will be fabricated in static RAM.
Because the industry has focussed on massproducing dynamic RAM in ever-increasing
densities, static RAM is usually considerably more
expensive than dynamic RAM: due both to its lower
density and the smaller demand (lower production
Static RAM...
●
●
Static RAM is used extensively for second
level cache memory, where its speed is
needed and a relatively small memory will
lead to a significant increase in performance.
A high-performance 1998 processor will
generally have 512kB to 4Mbyte of L2 cache.
Static RAM...
●
●
Since it doesn't need refresh, static RAM's
power consumption is much less than
dynamic RAM, SRAM's will be found in
battery-powered systems.
The absence of refresh circuitry leads to
slightly simpler systems, so SRAM will also
be found in very small systems, where the
simplicity of the circuitry compensates for the
cost of the memory devices themselves.
Dynamic RAM
●
●
●
The bulk of a modern processor's memory is
composed of dynamic RAM (DRAM) chips.
One of the reasons that memory access times
have not reduced as dramatically as processor
speeds have increased is probably that the
memory manufacturers appear to be involved in a
race to produce higher and higher capacity chips.
It seems there is considerable kudos in being first
to market with the next generation of chips. Thus
density increases have been similar to processor
speed increases.
Dynamic RAM...
●
●
●
●
●
A DRAM memory cell uses a single transistor and a
capacitor to store a bit of data.
Devices are reported to be in limited production
which provide 256 Mbits of storage in a single
device.
At the same period, CPUs with 10 million transistors
in them are considered state-of-the-art.
Regularity is certainly a major contributor to this
apparent discrepancy .. a DRAM is about as regular
as it is possible to imagine any device could be: a
massive 2-D array of bit storage cells.
In contrast, a CPU has a large amount of irregular
Dynamic RAM...
●
A typical DRAM cell with a single MOSFET
and a storage capacitor
The Memory Hierarchy
●
●
●
Processors use memory of various types to store
the data on which a program operates. Data may
start in a file on a magnetic disc ("hard disc"), be
read into semiconductor memory ("D-RAM") for
processing, transformed and written back to disc.
As part of the transformation process, individual
words of data will be transferred to the
processor's registers and thence to the ALU.
Transfers from semiconductor memory to
registers will usually pass (transparently to a
programmer) through one or more levels of
The Memory Hierarchy...
●
●
Memory designers are able to trade speed
for capacity - the fastest memory (registers)
having access times below 10ns but the
lowest capacity (10s of words) and the
slowest (magnetic tape) having access times
of several seconds but the highest capacity
(10s of Gbytes).
Thus the memory in a system can usually be
arranged in a hierarchy from the slowest (and
highest capacity) to the fastest (and lowest
The Memory Hierarchy...
●
Discrepancy between Processor and Bus
Frequencies
The Memory Hierarchy...
●
●
●
This figure shows the ratio between
processor clock frequence ("Core
Frequency") and bus frequency for the last
dozen years.
It can be seen that the ratio has started to
increase dramatically recently - as processor
frequencies have continued to increase and
bus frequencies have remained fixed or
increased at a slower rate.
The bus frequency determines the rate at
The Memory Hierarchy...
●
The widening gap between processor and
bus frequencies shows that effective
management of this memory hierarchy is
now critical to performing, posing a challenge
for architects and programmers alike.
The Memory Hierarchy...
Further Speed!
●
Instruction Issue Unit
–
With more than one word in a cache line, every request
from the instruction fetch unit to the cache is able to fetch
several instructions.
–
The instruction issue unit will normally try to take
advantage of this by considering all of these instructions
as possible candidates for the next instruction to be
issued to the ALU.
–
In the absence of branches, if the next instruction is
unable to be issued because its operands are not
available, there may be a subsequent instruction which
Further Speed!..
●
●
This is the simplest form of pre-fetching: the
instruction issue unit fetches and considers
all the instructions in the cache line as
potential candidates for issue to the ALU(s)
next.
In a super-scalar, the instruction issue unit
will try to issue an instruction to each
functional unit in every cycle, so it wants as
many "candidates" as possible to check for
dependencies!
Further Speed!..
●
Branching Costs
–
Branching is always expensive in a heavily
pipelined processor.
–
Even unconditional branches require interruption
of current instruction fetching operations and
restarting the instruction stream from a new
memory location - any instructions which may
have been fetched from memory and are waiting
in instruction buffers or cache lines must be
discarded.
Further Speed!..
●
●
Conditional branches can be much more
disruptive as they must often wait for
operands to be generated or status bits to be
set before the direction of the branch (taken
or not taken) can be determined: an efficient
processor may have fetched and partially
executed many instructions beyond the
branch before it is known whether to take it
or not.
Thus considerable effort has been devoted to
Further Speed!..
●
Branch Prediction
–
Predicting that a branch will be taken turns out to
be a good choice most of the time.
–
The majority of branches are those at the end of
loops which branch back to the start of the loop.
If the loop iterates n times, then the branch will
be taken (n-1)/n of the time.
Further Speed!..
●
Branch Target Buffers