Download Computer Architecture 219 (623.219)

Document related concepts
no text concepts found
Transcript
Computer Architecture 219 (623.219)
Lecturer:
A/Prof Gary Bundell
Room:
4.12 (EECE)
Email:
[email protected]
Phone:
6488 3815
Associate Lecturer (tutorial & lab coordination):
Filip Welna
Room:
2.82 (EECE)
Email:
[email protected]
Phone:
6488 1245
1
Website:
Text:
Tutorials:
swww.ee.uwa.edu.au/~ca219
W. Stallings, Computer Organization
and Architecture, 6th Edition, 2003
(5th Edition is also ok)
Starting 2nd week of semester
Labs:
Starting 3rd week of semester
Assessment:
Mid-semester Test (10%), Essay (10%),
Laboratories (20%), Exam (60%)
Penalties:
Essay & Lab reports 10% per day late
Lecture Notes:
Available from School
Plagiarism:
www.ecm.uwa.edu.au/for/staff/pol/plagiarism
and swww.ee.uwa.edu.au/policies/plagiarism)
Scaling Policy:
www.ecm.uwa.edu.au/for/staff/pol/assess)
Appeals:
www.ecm.uwa.edu.au/for/staff/pol/exam_appeals)
2
Course Objectives:
To review developments and the historical background of
computers.
To introduce and motivate the study of computer architectures,
organisation, and design.
To present foundational concepts for learning computer
organisation.
To describe the technological context of current computer
organisation
Generic skills:
Investigation and report writing (assignment and labs)
Programming (labs)
Prerequisites:
Computer Engineering 102 or Computer Hardware 103
3
1
Lecture Content Schedule:
1. Introduction and Overview
1a. C basics
2. Evolution of Computer Architecture - Historical Perspectives
3. Computer Systems and Interconnection Structures (Buses)
4. The Memory System
4a. Cache Memory
4b. Virtual Memory
5. The Input/Output (I/O) System
6. Instruction Set Architecture
7. CPU Structure and the Control Unit
8. Instruction Pipelining
9. RISC Architectures
4
Section 1
Introduction & Overview
Motivation
This course has several goals:
• To review developments and the historical background of digital
computers.
• Introduce and motivate the study of computer architectures,
organisation, and design.
• To present foundational concepts for learning computer
architectures.
• To describe the technological context of current computer
organisation.
• To examine the operation of the major building blocks of a
computer system, and investigate performance enhancements
for each component
6
2
What is a Computer?
Functional requirements of a computer:
• Process data
• Store data
• Move data between the computer and the outside world
Need to control the operation of the above
7
Functional View of a Computer
Data
Storage
Facility
Data
Movement
Apparatus
Control
Mechanism
Data
Processing
Facility
8
Basic Functional units
ARITHMETIC
AND LOGIC
INPUT
I/O
MEMORY
OUTPUT
Basic functional units of a computer
PROCESSOR
CONTROL
Central processing unit
(CPU)
9
3
Computer Architecture
Hayes: “The study of the structure, behaviour and
design of computers”
Hennessy and Patterson: “The interface between the
hardware and the lowest level of software”
Architecture is those attributes visible to the
programmer – those that have a direct impact on the
execution of a program
• Instruction set, number of bits used for data representation,
I/O mechanisms, addressing techniques.
• e.g. Is there a multiply instruction?
10
Computer Organisation
Organization is how features are implemented
• Control signals, interfaces, technologies.
• e.g. Is there a hardware multiply unit or is it done by repeated
addition?
Synonymous with “architecture” in many texts
An organisation is the underlying implementation of an
architecture
Transparent to the programmer – does not affect
him/her
11
Architecture and Organisation
All Intel x86 family share the same basic architecture
The IBM System/370 family share the same basic
architecture
This gives code compatibility
• At least backwards
Organization differs between different versions
12
4
Types of Computers
Microcomputers: better known as personal computers. Powered by
microprocessors (where the entire processor is contained on a
single chip)
Minicomputers: more powerful than personal computers and
usually operate in a time-shared fashion (supported by
magnetic/optical disk capacity). Can be used for applications such
as, payroll, accounting, and scientific computing. Most popular in
the 1970’s, and have more recently been replaced by servers.
Workstations: these machines have a computational power in the
minicomputer class supported by graphical input/output capacity.
Mainly used for engineering applications.
13
Types of Computers
Mainframes: this type of computers are used for business and
data processing in medium to large corporations whose computing
and storage capacity requirements are exceed the capacity of
minicomputers. Was the dominant form of computing in the
1960’s.
Supercomputers: these are very powerful machines. The power of
a single supercomputer can be compared to that of two or more
mainframes put together! Mainly used for intensive and largescale numerical calculations such as weather forecasting and
aircraft design and simulation. Can be custom-built, and one-of-akind.
14
Types of Computers
Parallel and Distributed Computing Systems: these are computer
systems that evolved from machines based on a single processing
unit into configurations that contain a number of processors.
Many servers, mainframes, and (pretty much all) supercomputers
nowadays contain multiple CPU’s. The differentiation between
parallel and distributed is in the level of coupling between the
processing elements.
15
5
Measuring the Quality of a Computer
Architecture
One can evaluate the quality of an architecture using a
variety of measures that tend to be important in
different contexts (Is it fast??).
One of the most important comparisons is made on the
basis of speed (performance) or the ratio of price to
performance. However, speed is not necessarily an
absolute measure.
Also, architectures can be compared on critical
measures when choices must be made.
16
Measuring the Quality of a Computer
Architecture
Generality
•
•
•
•
scientific applications vs. business applications.
floating-point arithmetic vs. decimal arithmetic.
the diversity of the instruction set.
generality means more complexity.
Applicability
Expandability
• utility of an architecture for its intended use.
• special-purpose architectures.
• expanding the capabilities of a given architecture.
17
Measuring the Quality of a Computer
Architecture
Efficiency
Ease of Use
Malleability
•
•
•
•
utilisation of hardware.
efficiency and generality?
decreasing cost of hardware
simplicity
• ease of building an operating system
• instruction set architecture
• user-friendliness
• the ease of implementing a wide variety of computers that share the
same architecture.
18
6
The Success of a Computer
Architecture
Architectural Merit
• Applicability, Malleability, Expandability.
• Compatibility (with members of the same family).
• Commercial success:
•
•
•
openness of an architecture.
availability of a compatible and comprehensible programming model.
quality of the early implementations.
System Performance
• speed of a computer.
• how good is the implementation.
19
Obtaining Performance
So, what is the Problem?
Processor speeds – we need more speed!
Very large memories
Slow memories – speeds
Slow buses – speeds, bandwidth
20
CPU and Memory Speeds
21
7
Why do we need more powerful
computers?
To solve the Grand Challenges!
Storage Requirements
1 TB
100 GB
1 GB
100 MB
10 MB
structural biology
pharmaceutical design
vehicle dynamics
10 GB
48-hour
weather
2D airfoil
100 MFLOPS
72-hour
weather
chemical dynamics
3D plasma modelling
oil reservoir
modelling
1 GFLOPS 10 GFLOPS 100 GFLOPS
1 TFLOPS
Computational Performance Requirements
22
Measuring Performance
Hardware performance is often the key to the effectiveness of
an entire system of hardware and software.
To compare two architectures, or two implementations of an
architecture, or two compilers for an architecture.
Different types of applications require different performance
metrics.
Certain parts of a computer system (e.g., CPU, memory) may be
most significant in determining the overall performance.
Several criteria can be used (e.g., response time, execution time,
throughput).
23
Performance and Execution Time
Performance ∝
Performance1 > Performance2
1
ExecutionTime
1
1
>
ExecutionTime1 ExecutionTime 2
ExecutionTime2 > ExecutionTime1
Performance1 ExecutionTime2
=
Performance2 ExecutionTime1
24
8
Execution Time
It should be noted that the only complete and reliable
measure of performance is time.
Usually, program execution time is measured in seconds
(per program).
Wall-clock time, response time, or elapsed time.
Most computers are time-shared systems (so CPU time
may be much less than elapsed time).
25
Execution Time
CPU user time
elapsed time
System CPU time
}
hard to distinguish
90.7u 12.9s 2:39 65% (from “time” command in UNIX).
•
•
•
•
User CPU time is 90.7 seconds
System CPU time is 12.9 seconds
Elapsed time is 2 minutes and 39 seconds (159 seconds)
The % of elapsed time that is CPU time is
90.7 + 12 .9
= 0.65
159
26
Execution Time
At times, system CPU time is ignored when examining
CPU execution time because of the inaccuracy of
operating system’s self measurement.
The inclusion of system CPU time is possibly not valid
when comparing computer systems that run different
operating systems.
However, a case can be made for using the sum of user
CPU time and system CPU time to measure program
execution time.
27
9
Execution Time
Distinction made between different performance measures, on
the basis of the time that was used in their calculation
Elapsed Time
System Performance
(unloaded system)
CPU Execution Time
CPU Performance
(user CPU time)
28
CPU Clock
Almost all computers are constructed using a clock that runs at a
constant rate and determines when events take place in the
hardware (clock cycles, clock periods, clock ticks, etc).
Clock period given by clock cycle (e.g., 10 nanoseconds, or 10ns).
Clock rate (e.g., 100 megahertz, or 100 MHz).
29
Relating Performance Metrics
Users and designers use different metrics to examine
performance. If these metrics can be related, one
could determine the effect of a design change on the
performance as seen by the user.
Designers can improve the performance by reducing
either the length of the clock cycle or the number of
clock cycles required for a program.
CPU execution time for a program = CPU clock cycles for a program× Clock cycle time
CPU execution time for a program =
CPU clock cycles for a program
Clock rate
30
10
Relating Performance Metrics
It is important to make a reference to the number of instructions
needed for the program.
CPI: provides one way of comparing two different
implementations of the same instruction set architecture, since
the instruction count for a program will be the same (since the
instruction set is the same).
CPU clock cycles = Instructions for a program ×
Average clock cycles per instruction (CPI)
31
Basic Performance Formula
The Golden Formula!
CPUTime = InstructionCount × CPI × ClockCycleTime
CPUTime =
InstructionCount × CPI
ClockRate
Three key factors that affect performance.
32
Performance Formula
CPUTime = InstructionCount × CPI × ClockCycleTime
CPU execution time: can be measured by running a program.
Clock cycle time: can be obtained from manuals.
Instruction count (IC): can be measured by using software tools that
profile the execution time or by using simulators of the architecture
under-study. Note that, instruction count depends on the architecture
(the instruction set) and not the implementation.
CPI: depends on a wide variety of design details (memory system,
processor structure, and mix of instruction types executed in an
application). So, CPI varies by application, as well as among
implementations of the same instruction set.
33
11
Performance Formula
n
CPU clock cycles = ∑ (CPI j × IC j )
j =1
ICj: count of number of instructions in class j executed.
CPIj: the average number of cycles per instruction for that instruction
class.
n: number of instruction classes.
n
CPI average =
n
(CPI j × IC j ) ∑ (CPI j × IC j )
∑
j =1
j =1
IC total
=
n
IC j
∑
j =1
34
Other Performance Metrics
From a practical standpoint, it is always clearer when quantities
are described as rates or ratios, for example:
•
•
•
•
•
millions of instructions per sec (MIPS)
millions of floating point operations per sec (MFLOPS)
millions of bytes per sec (Mbytes/sec) -> bandwidth
similarly, millions of bits per sec (Mbits/sec)
transactions per sec (TPS) -> throughput
35
MIPS
Million of Instructions Per Second
MIPS is a measure of the instruction execution rate for a particular
machine (native MIPS). MIPS is easy to understand (?). Faster machines
have larger MIPS.
MIPS =
MIPS =
InstructionCount
InstructionCount
=
ExecutionTime × 10 6
CPUClocks × CycleTime × 10 6
InstructionCount × ClockRate
= ClockRate6
InstructionCount × CPI × 10 6
CPI × 10
ExecutionTime = InstructionCount × CPI
ClockRate
=
InstructionCount
 ClockRate 
6
 CPI × 10 6  × 10


ExecutionTime = InstructionCount
MIPS × 10 6
36
12
MIPS
MIPS specifies the instruction execution rate which depends on the instruction
set (the instruction count and the CPI). Computers with different instruction sets
cannot be compared by using MIPS.
MIPS varies between programs on the same computer (a given machine cannot have
a single MIPS rating), due to different instruction mixes and hence different
average CPI.
MIPS can vary inversely with perceived performance
Peak MIPS: choosing an instruction mix that minimises the CPI, even if that
instruction mix is totally impractical.
To standardise MIPS ratings across machines, there are native and normalised
(with respect to a reference machine, the VAX 11-780) ratings
37
MFLOPS
Millions of Floating point Operations Per Second
Used in comparing performance of machines running
“scientific applications”
Intended to provide a fair comparison between
different architectures, since a flop is the same on all
machines.
Problem: MFLOPS rating varies with floating point
instruction mix
38
Speedup
A ratio of two performances
Used to quantify the extent of an architectural
improvement
Performance after enhancement
Performance before enhancement
Execution time before enhancement
=
Execution time after enhancement
Speedup =
39
13
Amdahl’s Law
Limits the performance gain that can be obtained by
improving some portion/component of a computer.
Performance improvement when some mode of
execution is increased in speed is determined by the
fraction of the time that the faster mode is used
ExecutionTimenew = ExecutionTimeold × Fractionold +
ExecutionTimeold
× Fractionnew
Speedupnew

Fractionnew 
= ExecutionTimeold ×(1− Fractionnew ) +

Speedupnew 

let α = Fractionold = 1 - Fractionnew
Speeduptotal =
ExecutionTimeold
Speedupnew
=
ExecutionTimenew 1 + α ( Speedupnew − 1)
40
Aggregate Performance Measures
N
arithmetic mean of {X i } =
harmonic mean of {X i } =
∑X
i
i =1
N
N
N
1
∑X
i =1
i
41
Aggregate Performance Measures
N
weighted arithmetic mean of {X i } =
∑ (W × X )
i
i =1
∑W
i =1

N
i
N

geometric mean of {X i } =  ∏ X i 
 i =1 
1
i
N
ie, GM is the product of the N values, with the N-th root then taken
42
14
Timing
Generally, the timing mechanism has a much coarser resolution
than the CPU clock
T(start)
time
T1
T(finish)
program execution time
dt
T2
Timer Period = dt sec/tick
Timer Resolution = 1/dt tick/sec
Tk
Tn
tick
measured time = Tn - T1
actual time = (Tn - T1) + (Tfinish - Tn) - (Tstart - T1)
fstart = (Tstart - T1)/dt
ffinish = (Tfinish - Tn)/dt
(fraction overreported)
(fraction underreported)
Absolute Error = dt fstart - dt ffinish = dt (fstart - ffinish)
Max Absolute Error = ± dt
43
Timing
actual running time
time
Actual Time ~ zero
Measured Time = dt
Absolute Measurement Error = +dt
actual running time
time
Actual Time ~ 2dt
Measured Time = dt
Absolute Measurement Error = -dt
44
Benchmarking
A computer benchmark is typically a computer program
that performs a strictly defined set of operations (a
workload) and returns some form of result (a metric)
describing how the tested computer performed.
Computer benchmark metrics usually measure speed
(how fast was the workload completed) or throughput
(how many workloads per unit time were measured).
Running the same computer benchmark on several
computers allows a comparison to be made.
45
15
Benchmarking
Finding “real-world” programs is not easy (portability)!
Provide a target for computer system developers.
Benchmarks could be abused as well. However, they
tend to shape a field.
46
Classes of Benchmarks
Toy benchmarks: Few lines of code (multiply two
matrices, etc).
Kernels: Time-critical chunks of real programs (e.g.
Livermore loops).
Synthetic: Attempts to match average frequencies of
real workloads (e.g. Whetstone, Dhrystone, etc).
Real applications: Compilers, etc.
47
SPEC
www.spec.org.
Five companies formed the System Performance Evaluation Corporation
(SPEC) in 1988 (SUN, MIPS, HP, Apollo, DEC).
Development of a “standard”.
Floating-point and Integer benchmarks.
48
16
SPEC Floating Point (CFP2000)
Name
Application
Name
Application
168.wupwise
Quantum
chromodynamics
187.facerec
Computer vision:
recognises faces
171.swim
Shallow water
modelling
188.ammp
Computational
chemistry
173.applu
Parabolic/elliptic
partial differential
equations
189.lucas
Number theory:
primality testing
177.mesa
3D Graphics library
191.fma3d
Finite element crash
simulation
49
SPEC CFP2000 (contd)
Name
Application
Name
Application
178.galgel
Fluid dynamics:
analysis of
oscillatory
instability
200.sixtrack
Particle accelerator
model
179.art
Neural network
simulation; adaptive
resonance theory
301.apsi
Solves problems
regarding
temperature, wind,
velocity and
distribution of
pollutants
183.equake
Finite element
simulation;
earthquake
modelling
172.mgrid
Multigrid solver over
3D field
50
SPEC Integer Benchmarks (CINT2000)
Name
Application
164.gzip
Data compression utility
175.vpr
FPGA circuit placement and routing
176.gcc
C compiler
181.mcf
Minimum cost network flow solver
186.crafty
Chess program
197.parser
Natural language processing
252.eon
Ray tracing (C++)
253.perlbmk
Perl
254.gap
Computational group theory
255.vortex
Object Oriented Database
256.bzip2
Data compression utility
300.twolf
Place and route simulator
51
17
Summary
Differentiate between computer architecture and organisation
Many factors determine the “quality” of an architecture. There is
no “best” system!
Defined measures for computer performance evaluation and
comparison.
52
Section 2
Historical Perspectives
Summary of Generations
Computer generations are usually determined by the change in dominant
implementation technology.
Dates
Hardware
Software
Product
1
1950-1958
Vacuum tubes, Magnetic
drums
Stored programs
Commercial Electronic
Computer
2
1958-1964
Transistors, Core memory, FP
arithmetic
High level programming
languages
Mainframes
3
1964-1971
Integrated Circuits,
Semiconductor memory
Multiprogramming /
Time-sharing, Graphics
Minicomputers
4
1972-1980
LSI/VLSI, Single chip CPU,
Single board computer
Expert systems
PC’s and Workstations
5
1980s-today
VLSI/ULSI, Massively
parallel machines, Networks
Parallel languages, AI,
The internet
Mobile Computing
Devices and Parallel
Computers
54
18
Mechanical Era
Generation 0
Wilhelm Schickhard (1623)
Blaise Pascal (1642)
Gottfried Leibniz (1673)
Charles Babbage (1822) built the Difference Engine and the
Analytic Engine – “Father of the modern computer”
George Boole (1847)
Herman Hollerith (1889) formed the Tabulating Machine Company
which became IBM
Konrad Zuse (1938)
Howard Aiken (1943) designed the Harvard Mark 1 electromechanical calculator
55
The ENIAC
Electronic Numerical Integrator And Computer (1943-1946)
J. Presper Eckert and John Mauchly @ University of Pennsylvania
Supposedly the first general-purpose electronic digital computer,
but this is in contention
Built for WWII ballistics calculations
20 accumulators of 10 digits (decimal)
Programmed manually by switches
18,000 vacuum tubes
30 tons + 15,000 square feet
140 kW power consumption
5,000 additions per second
Disassembled in 1955
© http://inventorsmuseum.com/eniac.htm
http://www.library.upenn.edu/special/gallery/mauchly/jwm8b.html
56
von Neumann @ IAS
John von Neumann (1903-1957) worked at the Princeton Institute for Advanced
Studies
Worked on ENIAC and then proposed EDVAC (Electronic Discrete Variable
Computer) in 1945
Stored Program concept
IAS computer completed in 1952
Arithmetic and Logic Unit
Input/
Output
Main
Memory
Program Control Unit
57
19
More on the IAS Computer
Memory of 1024 x 40 bit words (binary)
Set of registers (storage in CPU)
• Memory Buffer Register
• Memory Address Register
• Instruction Register
• Instruction Buffer Register
• Program Counter
• Accumulator
A number of clones: the MANIAC at Los Alamos Scientific Laboratory, the
ILLIAC at the University of Illinois, the Johnniac at Rand Corp., the SILLIAC
in Australia, and others.
http://www.computerhistory.org/timeline/1952/index.page
58
The First Commercial Computers
1947 - Eckert-Mauchly Computer Corporation formed
They built UNIVAC I (Universal Automatic Computer), which sold 48
units for $250K each – the first successful commercial computer
Late 1950s - UNIVAC II
IBM were originally in punched-card processing equipment
1953 - the 701
IBM’s first stored program computer
Intended for scientific applications
1955 - the 702
Business applications
Lead to 700/7000 series
59
The Second Generation
Transistor invented in 1947 at Bell Labs, but not until
the late 1950’s that fully transistorised computers
were available
DEC founded in 1957 and delivered the PDP-1 in that
year
IBM dominant with the 7000 series (curtailed in 1964)
In 1958, the invention of the integrated circuit
revolutionised the manufacture of computers (greatly
reducing size and cost)
60
20
IBM 360 series (1964)
One of the most significant architectures of the third generation
Fred Brooks, R.O. Evans, Erich Bloch.
The first “real” computer.
Introduced the “family” concept and the term “computer architecture”.
The 360 was first to employ instruction microprogramming to facilitate
derivative designs - and create the concept of a family architecture.
“Computer Architecture” (the term was first used to describe the 360).
Technology: compact solid-state devices mounted on ceramic substrate.
61
DEC PDP-8 (1965)
Another significant third generation architecture
Established the “minicomputer” industry – no longer room-sized, but could
now fit on a lab bench.
Positioned Digital as the leader in this area.
Cost just $16K compared to the System 360 which cost hundreds of
thousands
Technology: transistors, random access memory, 12-bit architecture.
Used a system bus – universally accepted nowadays
62
Later Generations
Technologies: LSI (1000 components on IC), VLSI
(10,000 per chip) and ULSI (>100,000)
Semiconductor memory was a big breakthrough –
faster and smaller, but initially more expensive than
core memory (now very cheap)
First CPU built on a single chip (Intel 4004) enabled
the first computer to be built on a single board.
63
21
Intel Microprocessors
4004 (1971)- First microprocessor. 2300 transistors. 4-bit word. Used in
a hand-held calculator built by Busicom (Japan).
8008 (1972) – 8-bit word length.
8080 (1974) - Designed as the CPU of a general-purpose microcomputer.
20 times as fast as the 4004. Still 8-bit
8086 (1978) - 16-bit processor. Used for first IBM PC.
80286 (1982) - 16 MB of addressable memory and 1 GB of virtual
memory. First serious microprocessor.
80386 (1985), 80486 (1989) - 32-bit processors.
Pentium (1993) - 3 million transistors.
Pentium Pro (1995) - performance-enhancing features and more than 5.5
million transistors. 64-bit bus width.
Itanium (2001) – 25 million transistors in CPU, 300 million in cache
64
Parallel Computing
The next “paradigm shift”
Early parallel machines were one-of-a-kind prototypes: Illiac IV,
NYU Ultracomputer, Manchester Dataflow machine, Illinois Cedar
Early commercial players were Cray (founded 1972), BBN and CDC
Nowadays everybody is building parallel supercomputers: NEC,
Fujitsu, IBM, Intel, SGI, TMC, Sun, etc
65
Other Milestones
FORTRAN, the first high level programming language, was invented by John Backus
for IBM, in 1954, and released commercially, in 1957.
1962 Steve Russell from MIT created the first computer video game, written on a
PDP-1
1965: First computer science Ph.D. was granted to Richard L. Wexelblat at the
University of Pennsylvania
In 1969 the first version of UNIX was created at Bell Labs for the PDP-7. The
first manual appeared in 1971.
1969: first computer connected to the internet (ARPANET) at UCLA
1972: Ray Tomlinson sent first email
In 1972, Dennis Ritchie produced C. The definitive reference manual for it did not
appear until 1974
In 1975 Bill Gates and Paul Allen founded Microsoft
In March 1976, Steve Wozniak and Steve Jobs finished work on a home-grown
computer which they called the Apple 1 (a few weeks later they formed the Apple
Computer Company on April Fools day).
66
22
Growth in CPU Transistor Count
© Stallings, 2000, Computer Organization and
Architecture, Prentice-Hall.
67
Moore’s Law
Each new chip contained
roughly twice as much
capacity as its predecessor.
Each chip was released within
18-24 months of the previous
chip.
© Intel Corp
68
Technological Limitations
Importance of better device technology. Generally attained by
building smaller devices, and packing more onto a chip.
Currently working on a transistor comprising a single atom.
Possible future technology based on the photon, not the electron.
Limitations imposed by device technology on the speed of any
single processor. Even Technologists can’t beat Laws of Physics!
Further improvements in the performance of sequential
computers might not be attainable at acceptable cost.
A very small addendum to Moore's Law is:
•
“…that the cost of capital equipment to build semiconductors will double every four
years."
69
23
The Big Picture: Possible Solutions
Processor speeds – Pipelining, multithreading, speculative
execution, multiprocessing, etc.
Slow memories – memory hierarchy, caches, reduce memory
Slow buses – multiple buses, wider buses, etc.
accesses, etc.
Smarter software – programming languages, compilers, operating
systems, etc
70
Section 3
Computer Systems and
Interconnection Structures
Basic Functional Units
Peripherals
Computer
Central
Processing
Unit
Computer
Communication
lines
Main
Memory
System’s
Interconnection
Input
Output
72
24
von Neumann’s Architecture
The majority of computers today are similar. John von Neumann
and co-workers (during the 1940s) proposed a paradigm to build
computers.
Control the operation of hardware through the manipulation of
control signals.
First machines (eg ENIAC) required physical rewiring to change
the computation being performed.
von Neumann used the memory of the computer to store the
sequence of control signal manipulations required to perform a
task
This is the stored program concept, and was the birth of
software programming
73
von Neumann Machines
ARITHMETIC
& LOGIC
INPUT
&
Data and
Instructions
MAIN
MEMORY
OPERATIONAL
REGISTERS
OUTPUT
Addresses
CONTROL
74
von Neumann Machines
Both data and instructions (control sequences) stored
in a single read-write memory (which is addressed by
location), with no distinction between them
Execution occurs in a sequential fashion by reading
instructions from memory
von Neumann machine cycle instructions fetched from
memory, interpreted by the control unit and then
executed (the so-called fetch-execute cycle).
Repeated until program completion
75
25
Non von Neumann Machines
Not all computers follow the von Neumann model
Parallel processors (multiprocessors and multicomputers) are an
example
Many classification schemes exist today (on the basis of
structure and/or behaviour):
•
•
•
•
Number of instructions that can be processed simultaneously
Internal organization of processors.
Inter-processor connection structure
Methods used to control the flow of instructions and data through
the system
Flynn’s taxonomy is probably the most popular – provides four
classifications
76
Flynn’s Taxonomy (1966)
Flynn, 1966, Very High-Speed Computing Systems, Proc. IEEE, vol.
54, pp. 1901-1909.
The most universally accepted method of classifying computer
systems on the basis of their global structure
Uses block diagrams indicating flow of instructions and data
•
•
•
•
single instruction stream single data stream (SISD).
single instruction stream multiple data stream (SIMD).
multiple instruction stream single data stream (MISD).
multiple instruction stream multiple data stream (MIMD).
77
Single Instruction Stream Single Data
Stream (SISD)
CU
IS
•Control Unit (CU)
•Processing Unit (PU)
•Memory (M)
•Instruction Stream (IS)
•Data Stream (DS)
PU
DS
M
All von Neumann machines
belong to this class.
78
26
Single Instruction Stream Multiple Data
Stream (SIMD)
PU1
DS1
PU2
M
IS
DS2
CU
PUn
DSn
IS
•Control Unit (CU)
•Processing Unit (PU)
•Memory (M)
•Instruction Stream (IS)
•Data Stream (DS)
79
Multiple Instruction Stream Single Data
Stream (MISD)
DS
IS1
PU1
IS1
CU1
DS
M
PU2
DS
IS2
ISn
PUn
IS2
CU2
ISn
CUn
•Control Unit (CU)
•Processing Unit (PU)
•Memory (M)
•Instruction Stream (IS)
•Data Stream (DS)
80
Multiple Instruction Stream Multiple Data
Stream (MIMD)
DS1
IS1
PU1
M
DS2
PU2
DSn
PUn
IS2
ISn
IS1
CU1
IS2
CU2
ISn
CUn
•Control Unit (CU)
•Processing Unit (PU)
•Memory (M)
•Instruction Stream (IS)
•Data Stream (DS)
81
27
Using Flynn’s Taxonomy
Advantages
• Universally accepted
• Compact notation
• Easy to classify a system
Disadvantages
• Very coarse-grain differentiation
• Comparison of different systems is limited
• Many features of the systems ( interconnections, I/O,
memory, etc) not considered in the scheme
82
von Neumann Machines: Instruction Cycle
Two basic steps:
• Fetch
• Execute
Repeated until program completes
83
Fetch Cycle
Program Counter (PC) holds address of next instruction
to fetch
Processor fetches instruction from memory location
pointed to by PC
Increment PC
• Unless told otherwise
Instruction loaded into Instruction Register (IR)
Processor interprets instruction and performs required
actions
Branch instructions modify PC
84
28
Execute Cycle
Processor-memory
• data transfer between CPU and main memory
Processor I/O
• Data transfer between CPU and I/O module
Data processing
• Some arithmetic or logical operation on data
Control
• Alteration of sequence of operations, modify PC
• e.g. jump
Combination of above
85
Example of Program Execution
Partial list of opcodes:
• 0001 (decimal 1):
Load AC from
memory
• 0010 (decimal 2):
Store AC to memory
• 0101 (decimal 5):
Add to AC from
memory
PC – program counter
IR – instruction register
AC - accumulator
86
Instruction Cycle State Diagram
87
29
Interrupts
Mechanism by which other system modules (e.g. I/O) may interrupt
normal sequence of processing
Interrupt types/sources:
• Program
•
e.g. overflow, division by zero
•
•
Generated by internal processor timer
Used in pre-emptive multi-tasking
•
from I/O controller
•
e.g. memory parity error
• Timer
• I/O
• Hardware failure
Sometimes distinguish between interrupts (an asynchronous signal
from hardware), and exceptions (synchronously generated by
software) – both handled similarly
88
Transfer of Control
The processor and the O/S are responsible for recognising an
interrupt, suspending the user program, servicing the interrupt
and then resuming the user program as though nothing had
happened.
89
Interrupt Cycle
Added to instruction cycle
Processor checks for interrupt
• Indicated by an interrupt signal
If no interrupt, fetch next instruction
If interrupt pending:
• Suspend execution of current program
• Save context
• Set PC to start address of interrupt handler routine
• Process interrupt
• Restore context (includes PC) and continue interrupted program
90
30
Instruction Cycle (with Interrupts)
91
Multiple Interrupts
What to do if more than one interrupt occurs at the same time?
Disable interrupts
• Processor will ignore further interrupts whilst processing one
interrupt
• Interrupts remain pending and are checked after first interrupt has
been processed
• Interrupts handled in sequence as they occur
Define priorities
• At the start of the interrupt cycle, the highest priority pending
interrupt is serviced
• Low priority interrupts can be interrupted by higher priority
interrupts
• When higher priority interrupt has been processed, processor
returns to previous interrupt
• Interrupts can be “nested”
92
Multiple Interrupts - Sequential
93
31
Multiple Interrupts - Nested
94
Interconnection Structures
The collection of paths that connect the system
modules together
Necessary to allow the movement of data:
• between processor and memory
• between processor and I/O
• between memory and I/O
95
Interconnection Structures
Major forms of input and output
for each module:
• memory: address, data, read/write
control signal
• I/O: functionally similar to memory,
but can also send interrupts to the
CPU
• processor: reads instructions and
data, writes data after processing,
and uses control signals to control
the system
96
32
Bus Interconnections
Most common interconnection structure
Connects two or more devices, is shared between the devices
attached to it, so that information is broadcast to all devices (not
just intended recipient)
Consists of multiple communication lines transmitting information
(a 0 or 1) in parallel. Width is important in determining
performance
Systems can contain multiple buses
97
Bus Interconnections
Address bus: source or destination address of the data on the
data bus
Data bus: for moving data between modules
Control bus: set of control lines used to control use of the data
and address lines by the attached devices
98
Bus Design Issues
Type
• dedicated or multiplexed
Arbitration
• centralised or distributed
Timing
• synchronous or asynchronous
Width
Transfer type
99
33
Bus arbitration
Ensuring only one device uses the bus at a time
Otherwise we have a collision
Master-slave mechanism
Two methods:
• centralised – single bus controller or bus arbiter
• distributed – any module (except passive devices like memory)
can become the bus master
100
Multiple Buses
Bus performance depends on: bus length (propagation
delay), number of attached devices (contention delay)
Becomes a bottleneck
Solution: use multiple buses
Spreads (and hence reduces) traffic
Hierarchical organisation
High speed, restricted access bus local (close) to CPU
System and expansion buses (further from the CPU)
connect slower devices
101
PC Buses
ISA (Industrial Standard Architecture) - 8 and 16 bit
MCA (Micro Channel Architecture) - 16 and 32 bit in IBM PS/2.
Never caught on.
EISA (Extended ISA) - 16/32 bit data, 24/32 bit address. High
end machines only.
VESA (Video Electronics Standards Assoc) Video Local Bus – used
in conjunction with ISA or EISA to give video devices quick
access to memory.
PCI (Peripheral Component Interface) – 64 bit data and address
lines (multiplexed). Systems include ISA slots for backward
compatibility
Futurebus+
102
34
Summary
von Neumann architecture – the stored program
concept and the fetch-execute cycle
A computer generally comprises a CPU (which itself
contains a control unit, arithmetic processing units,
registers, etc), a memory system, and I/O system and
a system of interconnects between them
Buses commonly used to connect system components
103
Section 4
Memory
Memory Systems
From the earliest days of computing, programmers have
wanted unlimited amounts of fast memory.
The speed of execution of instructions is highly dependent
upon the speed with which data can be transferred
to/from the main memory.
Find a way to help programmers by creating the illusion of
unlimited fast memory. There are many techniques for
making this illusion robust and enhancing its performance.
105
35
Memory Systems
In most modern computers, the physical MM is not as large as the
address space spanned by an address issued by the CPU (main
memory and secondary storage devices).
Maurice Wilkes, “Memoirs of a Computer Pioneer”, 1985.
“. . . the one single development that put computers on their feet
was the invention of a reliable form of memory, namely, the core
memory . . . Its cost was reasonable, it was reliable, and it could in
due course be made large.”
106
The von Neumann bottleneck
Recall Moore’s Law
DRAM
Speed gap
(von Neumann
bottleneck)
Year
Size
Cycle Time
1980
1983
1986
64 Kb
256 Kb
1 Mb
250 ns
220 ns
190 ns
1989
1992
1995
4 Mb
16 Mb
64 Mb
165 ns
145 ns
120 ns
© Stallings, 2000, Computer Organization and Architecture, Prentice-Hall.
107
Terminology
Capacity: the amount of information that can be
contained in a memory unit
Word: the natural unit of organisation of the memory
Addressable unit: typically either a word or individual
bytes
Unit of transfer: the number of bits transferred at a
time
108
36
Characterising Memory Systems
A great deal of variety and decisions are involved in the design of memory
systems.
Location
CPU (registers)
Internal (main)
External (secondary)
Capacity
Word size
Number of words/block
Unit of Transfer
Bit
Word
Block
Performance
Access time
Cycle time
Transfer rate
Physical Type
Seminconductor
Magnetic surface
Physical Characteristics
Volatile/nonvolatile
Erasable/nonerasable
Access Method
Random access
Direct access
Sequential access
Associative access
109
Location and Hierarchy
The maximum size of the MM that can be used in any
computer is determined by the addressing scheme.
Registers.
• CPU
Internal or Main memory.
• may include one or more levels of cache.
• “RAM”.
External memory.
• secondary storage devices.
110
Location and Hierarchy
How much?
(more applications more capacity)
(keep up with the speed of CPU)
How fast?
How expensive? (reasonable as compared to other components)
A trade-off exists between the three key characteristics of
memory, namely cost, capacity, and access time. The following
relationships hold:
• smaller access time, greater cost per bit.
• greater capacity, smaller cost per bit.
• greater capacity, greater access time.
The way out of this dilemma is not to rely on a single memory
component or technology, but to employ a memory hierarchy.
111
37
Location and Hierarchy
As one goes down the
hierarchy:
•
•
•
•
Decreasing cost/bit
Increasing capacity
Increasing access time
Decreasing frequency of
access of the memory by the
CPU
increased capacity
reduced cost
reduced speed
Registers
Cache
Main Memory
Magnetic Disk
Magnetic Tape
Optical Disk
Memory Hierarchy
112
Hierarchy Levels
Registers → L1 Cache → L2 Cache → Main memory → Disk cache
→ Disk → Optical → Tape.
Processor
On-Chip
Cache
Registers
Second
Level
Cache
(SRAM)
Main
Memory
(DRAM)
Secondary
Storage
(Disk)
Tertiary
Storage
(Disk)
113
Hierarchy Management
registers ↔ memory
compiler
cache ↔ memory
hardware
memory ↔ disks
hardware and operating system (virtual memory).
programmer (file i/o).
114
38
Typical Memory Parameters
Type
Size
Access Time
Cache
128-512 KB
10ns
Main memory
4-256 MB
50ns
Magnetic disk
(hard disk)
GB-TB
10ms,
10 MB/s
Optical disk
(CD-ROM)
<1GB
300 ms,
600 KB/s
Magnetic tape
100’s GB
sec-min,
10 MB/sec
115
Capacity
Addressable Units:
word and/or byte.
address of length (A) and the number (N) of addressable units is
related by 2A=N. (A 16-bit computer is capable of addressing up to
216=64K memory locations).
Example: In a byte-addressable 32-bit computer, each memory word
contains 4 bytes. Instructions generate 32 bit addresses. High (or low)
order 30 bits determine which word will be accessed. Low (High) order 2
bits of the address specify which byte location is involved.
Units of transfer: Internal (usually governed by data bus width (e.g.
bits)), External (usually a block which is much larger than a word).
116
Access Methods
Sequential
memory is organised into records. Access is made in a specific
linear sequence.
time to access an arbitrary record is highly variable (e.g. tape).
Direct
individual blocks or records have a unique address based on
physical location.
direct access to reach a general vicinity and a sequential search.
Access time depends on location and previous location (e.g. disk).
117
39
Access Methods
Random
the time to access a given location is independent of the sequence of
prior accesses and is constant (e.g. RAM).
Associative
a word is retrieved based on a portion of its contents rather than its
address.
retrieval time is constant (e.g. cache).
118
Performance
Access Time
the time it takes to perform a read/write operation (randomaccess memory)
or the time it takes to the position the read-write mechanism at
the desired location (nonrandom-access memory).
Memory Cycle Time
primarily associated with random-access memory
consists of the access time plus any additional time required
before a second access can commence (e.g. transients).
119
Performance
Transfer Rate
the rate at which data can be transferred into or out of a memory
unit.
for a single block of random-access memory, it is equal to (1/Cycle
time).
for non-random-access memory, the following relationship holds:
TN = TA +
N
R
TN : Average time to read or write N bits
TA : Average access time
N : Number of bits
R: Transfer rate , in bits / sec (bps)
120
40
Physical Characteristics
Volatility:
• volatile memory: information decays naturally or is lost when
electrical power is switched off. e.g. semiconductor memory.
• non-volatile memory: information once recorded remains
without deterioration (no electrical power is needed to retain
information). e.g. magnetic-surface memories, semiconductor
memory.
Decay
Erasability
Power consumption
121
Memory Technology
Core memory
• magnetic cores (toroids) used to store logical 1 or 0 by
•
•
inducing an E-field in it (in either direction) – 1 core stores 1
bit
destructive reads
obsolete – used in generations two and three (somewhat).
Replaced in the early 70’s by semiconductor memory
122
Memory Technology
Semiconductor memory, using LSI or VLSI technology
• ROM
• RAM
Magnetic surface memory, used for disks and tapes.
Optical
• CD & DVD
Others
• Bubble
• Hologram
123
41
Read Only Memory (ROM)
“Permanent” data storage
Still random access, but read-only
ROM – data is wired in during fabrication
PROM (Programmable ROM) – can be written once
EPROM (Erasable PROM) – by exposure to UV light
EEPROM (Electrically Erasable PROM) and EAPROM
(Electrically Alterable PROM)
Flash memory – similar to EEPROM
124
Semiconductor Memory
• DRAM: Dynamic RAM
RAM
all semiconductor memory is
random access.
generally read/write
volatile
temporary storage
static or dynamic
High density, low power, cheap,
slow.
Dynamic: need to be “refreshed”
regularly.
main memory.
SRAM: Static RAM
Low density, high power,
expensive, fast.
Static: content will last
“forever”(until lose power).
cache and registers
•
125
RAM
6-Transistor SRAM Cell
0
0
bit
word
(row select)
1
1-Transistor DRAM cell
row select
1
bit
SRAM: six transistors use
up a lot of area
bit
126
42
Newer RAM Technology
Basic DRAM same since first RAM chips.
Enhanced DRAM.
• contains small SRAM as well
• SRAM holds last line read (like a mini cache!)
Cache DRAM.
• larger SRAM component.
• use as cache or serial buffer.
127
Newer RAM Technology
Synchronous DRAM (SDRAM).
• access is synchronised with an external clock.
• address is presented to RAM, RAM finds data (CPU waits in
conventional DRAM).
• since SDRAM moves data in time with system, clock, CPU
knows when data will be ready.
• CPU does not have to wait, it can do something else.
• burst mode allows SDRAM to set up stream of data and fire it
out in block.
128
SDRAM
129
43
Organisation
A 16Mbit chip can be organised
as 1M of 16 bit words.
A bit per chip system has 16 lots
of 1Mbit chip with bit 1 of each
word in chip 1 and so on.
A 16Mbit chip can be organised
as a 2048 x 2048 x 4bit array.
• reduces number of address pins
• multiplex row address and column
address.
• 11 pins to address (211=2048).
• adding one more pin doubles
range of values so x4 capacity.
256K×8 memory from 256K×1 chips
130
1 MB Memory Organisation
131
Refreshing (DRAM)
Refresh circuit included on chip.
Disable chip.
Count through rows.
Read & Write back.
Takes time.
Slows down apparent performance.
132
44
Typical 16 Mb DRAM (4M x 4)
133
Packaging
134
Error Correction
Hard Failure.
• permanent defect.
Soft Error.
• random, non-destructive.
• no permanent damage to memory.
• Power supply or alpha particles.
Detected using Hamming error correcting code.
135
45
Error Correcting Code Function
136
Memory Interleaving
Independent accesses to the main memory can be made
simultaneously.
The main memory must be partitioned into memory modules
(banks).
The expenses involved with this approach are usually justified in
large computers.
Addressing circuitry (for banks), bus control, buffer storage for
processors.
Memory references should be distributed evenly among all
modules (m modules, n words per memory cycles).
137
Memory Interleaving
Decoded
data bus
MBR
MBR
MBR
Module
0
Module
j
Module
l
MAR
MAR
MAR
module
m bits
address of word in module
g=n-m
g bits
138
46
Memory Interleaving
Cray-1 (CPU cycle time 12.5nsec, Memory modules of
cycle time 50nsec, Word size of 64 bits).
Memory bandwidth of 4 words per CPU cycle or 16
words per memory cycle. (Cray-1 has 16 memory
modules).
Two different types of interleaving (low-order and
high-order).
139
Section 4a
Cache Memory
Locality of Reference
The Principle of Locality of Reference:
• program access a relatively small portion of the address space at any
instant of time.
• loops, subroutines, arrays, tables, etc.
The “active” part of the memory should be kept in fast memory
and close to CPU.
The “active” parts of the memory will change with time and should
be changed (memory management).
This is the basis of the memory hierarchy
141
47
Locality of Reference
Temporal Locality (locality in time):
Spatial Locality (locality in space):
• loops, subroutines, etc.
• most recently accessed data items closer to the processor.
• arrays, tables, etc.
• instructions are normally accessed sequentially in a program
(branches ≈20%).
By taking advantage of the principle of locality of reference:
• large memory → cheapest technology.
• speed → fastest technology.
142
Cache Memories
The philosophy behind Cache memories is to provide users with a
very fast and large enough memory.
Placed between normal main memory and CPU.
May be located on CPU chip.
Cache operates at or near the speed of processor.
Cache contains “copies” of sections of main memory (recall:
locality of reference).
143
Cache Memories
A block contains a fixed number
of words.
Main memory consists of up to 2n
addressable words (unique
addresses).
For mapping purposes, the
memory is considered to consist
of a number of fixed-length
blocks of m words each (i.e., k=
2n/m blocks).
Block stored in cache as a single
unit (slot or line).
CPU
words
Cache
blocks
Main Memory
144
48
Cache operation - overview
CPU requests contents of memory location. Check cache for this
data.
If present, get from cache (fast).
If not present, read required block from main memory to cache.
Then deliver from cache to CPU.
Cache includes tags to identify which block of main memory is in
each cache slot.
145
Typical Cache Organization
146
Cache Terminology
Hit: data appears in some block in the upper level (e.g.
cache).
Hit Rate: the fraction of memory access found in the
upper level.
Hit Time: the time to access the upper level and it
consists of:
• SRAM access time + Time to determine hit or miss
147
49
Cache Terminology
Miss: data needs to be retrieved from a block in the lower level
(e.g. main memory or secondary storage device).
Miss Rate = 1 - (Hit Rate).
Miss Penalty: Time to fetch a block from the lower level into the
upper level (generally does not include time to determine the miss
or to deliver the block from the cache to the processor)
Hit Time << Miss Penalty.
Average Memory Access Time (AMAT) is an average of times to
perform accesses, weighted by probabilities of data being in the
various levels.
148
Cache Design Issues
Size.
Mapping Function (move blocks from MM to/from
cache).
Replacement Algorithm (which blocks to replace ).
Write Policy.
Block Size.
Number of Caches.
149
Assumptions
The CPU does not need to know explicitly about the
existence of the cache.
The CPU simply makes Read/Write requests.
When the referenced data is in the cache:
• If the operation is a Read, then the MM is not involved.
• If the operation is a Write:
•
•
Update both the MM and the cache simultaneously (write-through method)
You can also update the cache location only and mark it as such through the
use of an associated flag bit (write-back method)
150
50
Assumptions
If the data is not in the cache:
• If the operation is a Read then time savings can be achieved if
the required word is sent to the CPU as soon as it becomes
available, rather than waiting for the whole block to be loaded
into the cache (load-through)
• If a Write operation is executed, then the operation can be
sent directly to MM. Alternatively, the block could be loaded
into the cache and then updated (write-allocate)
151
Elements of Cache Design: Size
Cost vs. performance.
Cost
Speed
• more cache is expensive.
• more cache is faster (up to a point). The larger the cache, the larger
the number of gates involved in addressing the cache.
• checking cache for data takes time.
A number of studies have suggested that cache sizes of between
1K and 512K words would be optimum.
However, because the performance of the cache is very sensitive
to the nature of the workload, it is impossible to arrive at an
“optimum” cache size.
152
Elements of Cache Design:
Mapping Function
There are fewer cache blocks (or lines) than main memory blocks.
An algorithm is needed for determining which main memory block
currently occupies a cache line.
The choice of the mapping function dictates how the cache is
organised.
Three techniques can be used: direct, associative, and set
associative.
153
51
Mapping Function
A cache of 2048 (2K) words with a block size of 16
words.
The cache is organised as 128 blocks. Let the MM have
64K words, addressable by a 16-bit address.
For mapping purposes, the memory will be considered
as composed of 4K blocks of 16 words each.
154
Direct Mapping
Block 0
Cache
TAG
TAG
Block 0
Block 1
MM
Block 1
Block 127
Block 128
TAG
5 bits
TAG
Block 129
Block 127
7 bits
BLOCK
4 bits
Block 255
WORD
155
Direct Mapping
Each block of main memory maps to only one cache line.
• if a block is in cache, it must be in one specific place.
Address is in two parts.
Least significant bits identify unique word.
Most significant bits specify one memory block.
156
52
Direct Mapping Pros & Cons
Simple and easy to implement.
Inexpensive.
Fixed location for given block.
• if a program accesses 2 blocks that map to the same line
repeatedly, cache misses are very high.
Hit ratio is low.
157
Associative Mapping
A main memory block can load into any line of cache.
Memory address is interpreted as tag and word.
Tag uniquely identifies block of memory.
Every line’s tag is examined for a match.
Cache searching gets expensive.
158
Associative Mapping
Block 0
Cache
Block 0
TAG
TAG
Block 1
Block 1
TAG
Block 2
TAG
Block 3
MM
Block 63
Block 64
TAG
TAG
Block 65
Block 126
Block 127
12 bits
4 bits
TAG
WORD
Block 127
159
53
Set Associative Mapping
This final method is the most practical and it exhibits the
strengths of both of the previous two techniques (without their
disadvantages).
Cache is divided into a number of sets. Each set contains a number
of lines.
A given block maps to any line in a given set (e.g. block B can be in
any line of set i).
For example, 2 lines per set.
• 2 way associative mapping.
• a given block can be in one of 2 lines in only one set.
160
Set Associative Mapping
Block 0
TAG
Block 2
TAG
Block 3
Block 1
SET 1
Block 0
SET 0
Cache
TAG
TAG
Block 1
MM
Block 63
TAG
TAG
6 bits
TAG
Block 126
Block 127
6 bits
SET
SET 63
Block 64
4 bits
Block 65
Block 127
WORD
161
Remarks
The tag bits could be placed in a separate, even faster cache,
especially when associative searches are required (tag directory).
Normally, there is a valid bit, with each block indicates whether
or not the block contains valid data.
Also, another bit (dirty bit) is needed to distinguish whether or
not the cache contains an updated version of the MM block –
necessary with write-back caches.
162
54
A Sidebar on Associative Searches
Different way of identifying data.
Contents of data used to identify data (content addressable
memories).
In a RAM the search time will be of order t × d(m) (t: time to
fetch and compare one word from memory, d(m) an increasing
function on m).
In a m-word CAM, the search time is independent of m because
search is performed in parallel (two memory cycles).
163
A Sidebar on Associative Searches
Select Circuit
Output
0
1
2
associative memory
k-1
n-1
Argument
Register
Match Register
Argument Register: 10 110011
Key Register: 11 000000
Word 1: 00 111001
Word 2: 11 001101
Word 3: 10 110111
Word 4: 10 110001
Key
Register
164
A Sidebar on Associative Searches
Parallel search requires specialised hardware (highest level of
memory, real-time systems, etc).
Exact match CAM (equality with key data) and Comparison CAM
(various relational operators >, <, etc). See, for example, A.G.
Hanlon, “Content-Addressable and Associative Memory Systems A Survey,” IEEE Transactions on Electronic Computers, Vol. 15,
No. 4, pp. 509-521, 1966. See also T. Kohonen, 1987, ContentAddressable Memories, 2nd edition, Springer-Verlag.
In general, CAMs are very expensive but they have a fast
response time.
165
55
Elements of Cache Design:
Replacement Algorithm
When a new block must be brought into the cache and all positions
that it may occupy are full, a decision must be made as to which of
the old block is to be overwritten (system performance). Not very
easy to resolve.
Locality of reference, again!
For direct mapping, there is only one possible line for any
particular block and no choice is possible.
For the associative and set associative techniques, a replacement
algorithm is needed. To achieve high speed, the algorithm is
normally hardwired.
166
LRU Replacement Algorithm
Least-Recently-Used (LRU).
The basic idea is to replace that block in the set which
has been in the cache longest with no reference to it.
This technique should give the best hit ratio since the
locality of reference concept assumes that more
recently used memory locations are more likely to be
referenced.
167
Other Replacement Algorithms
First-In-First-Out (FIFO).
• Replace that block in the set which has been in the cache the longest.
FIFO is easily implemented as a circular buffer technique.
Least-Frequently Used (LFU).
• Replace that block in the set which has experienced the fewest
references. LFU can be implemented by associating a counter with
each line.
Random.
• This is a technique which is not based on usage. A line is picked up
among the candidate slots at random. Simulation studies (slight
inferior performance).
168
56
Elements of Cache Design: Write Policy
A variety of write policies, with different performance
and economic trade-offs, are possible. Two problems to
deal with:
• More than one device may have access to main memory. For
example, an I/O module may be able to read/write directly to
memory.
• A more complex problem occurs when multiple CPUs (e.g.
multiprocessor systems) are attached to the same bus and
each CPU has its own local cache.
169
Write Through
This is the simplest technique.
All write operations are made to MM and cache (memory traffic
and congestion).
Multiple CPUs can monitor main memory traffic to keep local (to
CPU) cache up to date.
A lot of traffic.
Slows down writes.
170
Write Back
Updates are only made in the cache.
When an update occurs, an UPDATE bit associated with the line is
set (portions of main memory could be invalid). (I/O through
cache - complicated and expensive)
if block is to be replaced, write to main memory only if update bit is
set.
other caches get out of sync.
I/O must access main memory through cache.
15% of memory references are writes.
171
57
Cache Coherency Problem
Where there are multiple CPUs and caches, accessing a single
shared memory.
Active area of research.
Bus Watching (snoopy caches) with Write Through: Each cache
controller monitors the address lines to detect write operations
to memory by other bus masters (write-through policy is used by
all cache controllers).
Hardware Transparency: Additional hardware is used to ensure
that all updates to main memory via cache are reflected in all
caches.
Non-cacheable Memory: Only a portion of main memory is shared
by more than one processor, and this is designated as noncacheable.
172
Elements of Cache Design:
Block Size
Larger blocks - reduce the number of blocks that fit
into a cache (higher block replacement rate).
Larger blocks - each additional word is farther from
the requested word (reduced locality of reference).
No optimum solutions.
173
Elements of Cache Design:
Number of Caches
Single- Versus Two-Level Caches.
As logic density has increased, it has become possible to have a
cache on the same chip as the processor.
Most contemporary processor designs include both on-chip and
external caches (two-level cache).
Pentium (L1, 16KB), PowerPC (L1, up to 64KB).
L2 is generally 512KB or less.
174
58
Number of Caches
Unified Versus Split Caches.
For a given cache size, the unified cache has a higher hit rate
than split caches (instruction and data fetches are balanced).
Parallel execution and pipelining can help in making split caches
more powerful than unified caches.
Pentium and PowerPC use split caches (one for instructions and
one for data).
175
Examples (Pentium and PowerPC)
On board cache (L1) is supplemented with external fast SRAM
cache (L2).
Intel family
PowerPC
•
•
•
•
X386 – no internal cache
X486 – 8KB unified cache
Pentium – 16KB split cache (8KB data and 8KB instructions)
Pentium supports 256KB or 512KB external L2 cache which is 2-way
set associative
• 601 - one 32KB cache
• 603/604/620 – split cache of size 16/32/64KB
176
PowerPC 604 Caches
Split caches. Cache (16K bytes) - four-way set-associative
organisation.
128 sets (four blocks/set) - block has 8 words (32 bits).
Least-significant 5 bits of an address (byte within a block). The
next 7 bits (which set), and the high-order 20 bits (tag).
MESI protocol for cache coherency (Modified, Exclusive, Shared,
Invalid).
177
59
PowerPC 604 Caches
The data cache (write-back protocol) can also support the writethrough protocol.
The LRU replacement algorithm is used to choose which block will
be sent back to MM.
Instruction unit can read 4 words of some selected block in the
instruction cache in parallel.
For performance reasons, the load-through approach is used for
read misses.
178
Other Enhancements (Write Buffer )
Write-through protocol is used (CPU could be slowed down).
The CPU does not depend on the writes.
To improve performance, a write buffer can be included for
temporary storage of write requests. (DECStation 3100 - 4 words
deep write buffer).
A read request could refer to data held in the write buffer
(better performance).
179
Other Enhancements (Prefetching)
Insertion of “prefetch” instructions in the program by
programmer or compiler. (see for example, T.C. Mowry,
“Tolerating Latency through Software-Controlled Data
Prefetching,” Tech. Report CSL-TR-94-628, Stanford
University, California, 1994).
Prefetching can be implemented in hardware or
software. (see for example, J.L. Baer and T.F. Chen,
“An Effective On-Chip Preloading Scheme to Reduce
Data Access Penalty,” Proceedings of Supercomputing,
1991, pp. 176-186).
180
60
Other Enhancements
(Lockup-Free Cache )
Lockup-free caches were first used in the early 1980s in the
Cyber series of computers manufactured by the Control Data
company (see D. Kroft, “Lockup-Free Instruction Fetch/Prefetch
Cache Organization,” Proceedings of the 8th Annual International
Symposium on Computer Architecture, 1981, pp. 81-85).
The cache structure can be modified to allow the CPU to access
the cache while a miss is being serviced.
A cache that can support multiple outstanding misses is called
lockup-free.
Since a normal cache can handle one miss at a time, a lockup-free
cache should have circuitry that keeps track of all outstanding
misses.
181
Remarks
Compulsory misses (cold-start misses). A block that has never
resided in the cache.
Capacity misses. A cache cannot contain all the blocks needed for
a program (replacing and retrieving the same block).
Conflict (collision) misses (direct-mapped and set-associative
caches). Two blocks compete for the same set.
Never use the miss rate (hit rate) as a single measure of
performance for caches. Why?
182
Remarks
Compilers or program writers should take into account the behaviour of
the memory system
Software- or compiler-based techniques to make programs better (in
terms of spatial and temporal locality). (Prefetching is also useful but
needs capable compilers).
for (i=0; i!=500; i=i+1)
for (j=0; j!=500; j=j+1)
for (k=0; k!=500; k=k+1)
x[i][j]=x[i][j]+ y[i][k]*z[k][j];
With double precision matrices, on SG Challenge L (MIPS R4000, 1-MB
secondary cache) this takes 77.2 seconds
Changing the loop order gives execution time of 44.2 seconds.
183
61
Section 4b
Virtual Memory
Virtual Memories
The physical main memory space is not enough to contain everything
(secondary storage devices - e.g. disks, tapes, drums).
Programmers used to explicitly move programs or parts of programs from
secondary storage to MM when they are to be executed.
This problem is machine-dependent and should not be solved by the
programmer.
Virtual memory describes a hierarchical storage system of at least two
levels, which is managed by the operating system to appear to a
programmer like a single large directly addressable MM.
185
Why Do We Need VM?
Free programmers from the need to carry out storage allocation
and to permit efficient sharing of memory space among different
users.
Make programs independent of the configuration and capacity of
the memory systems used during their execution.
Achieve the high access rates and low cost per bit that is possible
with a memory hierarchy.
186
62
Terminology
Logical (virtual) address: An address expressed as a location
relative to the beginning of the program.
Physical address: This is an actual location in the memory.
Pages: These are basic units of word blocks that must always
occupy contiguous locations, whether they are resident in the MM
or in secondary storage.
187
The Basic Idea
The cache concept is intended to bridge the speed gap between
the CPU and the MM.
The virtual-memory concept bridges the size gap between the MM
and the secondary storage (disk/tape).
Conceptually, cache and virtual-memory techniques involve very
similar ideas. They differ mainly in the details of their
implementation.
188
Memory Management
To avoid CPU idle time, can use (or increase)
multiprogramming (having multiple jobs executing at
once)
• Requires increased memory size to accommodate all jobs
Rather than increasing MM size (costly), we can use a
mechanism to remove a waiting/idle job from memory
to allow an active job to use that space
• this is swapping
• swapped job has its memory written to secondary storage
• active jobs are allocated a portion of memory called a partition
189
63
Partitioning
Partitions can be fixed or variable size
Fixed:
• rigid size divisions of memory (though can be of various sizes)
• job assigned to smallest available partition it will fit into
• can be wasteful
Variable:
• allocate a process only as much memory as it needs
• efficient
• over time can lead to fragmentation (large numbers of pieces
of free memory which are too small to use), must consider
compaction to recover them
190
Virtual Addresses
Since a process can be swapped in and out of memory, it can end
up in different partitions
Addressing within the process’ data must not be tied to a specific
physical location in memory
Addresses are considered to be relative to the starting address
of the process’ memory (partition)
Hardware must convert logical addresses into physical addresses
before passing to the memory system – this is effectively a form
of index addressing
191
Paging
Sub-divide memory into small fixed-size “chunks”
called frames or page-frames
Divide program into same sized chunks called pages
Loading a program into memory requires loading the
pages into page frames (which need not be contiguous)
Limits wastage to a fraction of a single page
Each program has a page table (maps each page to its
page-frame in memory)
Logical addresses are interpreted as a page (converted
into a physical page frame) and an offset within the
page
192
64
The Basic Idea
address from CPU
Page Table Base Register
Virtual Page Number
Virtual
Page # Control bits
Offset
Page Table Address
Σ
Page Frame
Control Bits
Address of
Page Frame
Page Frame
Offset
Page Table
MM
193
The Basic Idea
A virtual address generated by the CPU (instructions/data) is
interpreted as a page number (high-order bits) followed by a word
number (low-order bits).
The page table in the MM specifies the location of the pages that are
currently in the MM.
By adding the page number to the contents of the page table base
register, the address of the corresponding entry in the page table is
obtained.
What if the page is not in memory?
194
Demand Paging
Only the program pages that are actually (currently)
required for execution are loaded
Only a few (of the potentially many) pages of any one
program might be loaded at any time
It is possible for a program to consist of more pages
than could fit into memory
• memory not a limit to program size
• virtual address space much larger than physical
• simplifies program development
195
65
Page Tables
Stored in memory
Can be large and are themselves subject to being stored in
secondary storage
Can have two level tables (require extra look-ups)
Each virtual address reference causes two memory accesses – to
get to the page table and then to get to the data
Extra delay, extra memory traffic
Solution: use a special cache to hold page table information – the
translation lookaside buffer (TLB) – a buffer of page table
entries for recently accessed pages
196
Segmentation
Another way in which addressable memory can be subdivided (in addition
to partitioning and paging).
Paging is invisible to the programmer and serves the purpose of providing
the programmer with a larger address space.
Segmentation is usually visible to the programmer and is provided as a
convenience for organising programs and data, and as a means for
associating privilege and protection attributes with instructions and data.
Segmentation allows the programmer to view memory as consisting of
multiple address spaces or segments. Segments are of variable (and
dynamic) size. Each segment may be assigned access and usage rights.
197
Advantages
Simplifies the handling of
growing data structures. The
data structure can be assigned
its own segment, and the
operating system will expand or
shrink the segment as needed.
Lends itself to sharing among
processes. A programmer can
place a utility program or a useful
table of data in a segment that
can be addressed by other
processes.
Allows programs to be altered
and recompiled independently,
without requiring that an entire
set of programs be re-linked and
reloaded.
Again,
this
is
accomplished
using
multiple
segments.
Lends itself to protection. Since
a segment can be constructed to
contain a well-defined set of
programs or data, the
programmer or a system
administrator can assign access
privileges in a convenient fashion.
198
66
Pentium Memory Management
Based on the early 386 and 486 family
Supports segmentation and paging (though both can be disabled)
32 bit physical address – max MM size of 232=4GB
Unsegmented, unpaged memory has virtual address space of 4GB
Segmentation
• 16-bit segment reference (2 for protection/privilege bits, 14 for id),
and 32-bit offset within segment
• now, virtual address space is 2(14+32)=64TB
Paging
• 2-level table (1024 by 1024)
• each page is 4KB in size
199
Secondary Storage : Hard Disks
Circular platter of metal or plastic with a magnetisable
coating. Rotates at constant speed.
Data is written and read by the head – a conducting
coil through which current flows to induce a magnetic
field
Tracks: concentric rings separated by gaps
Same number of bits stored in each track – density
greatest towards the centre
Block: unit of data transfer
Sector: block-sized regions making up a track
200
201
67
Disk Characteristics
Single or multiple platters per drive – each platter has
its own head
Fixed or movable head
Removable or non-removable
Single sided or double sided platters
Head contact mechanism
202
Disk Performance
Seek time: position the movable head over the correct
track
Rotational delay: for desired sector to come under the
head
Access time: sum of the above two
Block transfer time: for actual data transfer
Contiguous storage of data desired
203
Section 5
The I/O System
68
External Devices
Human readable
• Screen, printer, keyboard
Machine readable
• Disk, tape
Communication
• Modem
• Network Interface Card (NIC)
205
Input/Output Problems
Wide variety of peripherals
• Delivering different amounts of data
• At different speeds
• In different formats
• Different interfaces
All slower than CPU and RAM
Generally not connected directly into system bus
Need I/O modules – standard interface
206
Input/Output Module
Relieves CPU of management of I/O
Interface to CPU and Memory
Interface to one or more peripherals
Interface consists of:
• control
• status
• data
207
69
I/O Modules
Block diagram
208
I/O Module Function
Control & Timing
CPU Communication
Device Communication
Data Buffering
Error Detection
209
I/O Steps
CPU checks I/O module device status
I/O module returns status
If ready, CPU requests data transfer
I/O module gets data from device
I/O module transfers data to CPU
Similar for output
210
70
Programmed I/O
CPU has direct control over I/O
• Sensing status
• Read/write commands
• Transferring data
CPU waits for I/O module to complete operation
Wastes CPU time
211
Programmed I/O - detail
CPU requests I/O operation
I/O module performs operation
I/O module sets status bits
CPU checks status bits periodically
I/O module does not inform CPU directly
I/O module does not interrupt CPU
CPU may wait or come back later
212
I/O Commands
CPU issues address
• Identifies module (& device if >1 per module)
CPU issues command
• Control - telling module what to do
•
e.g. spin up disk
•
e.g. power? Error?
•
Module transfers data via buffer from/to device
• Test - check status
• Read/Write
213
71
Addressing I/O Devices
Under programmed I/O, data transfer is very like memory access
(CPU viewpoint)
Each device given unique identifier
CPU commands contain identifier (address)
Memory mapped I/O
Isolated I/O
• Devices and memory share an address space
• I/O looks just like memory read/write
• No special commands for I/O
•
Large selection of memory access methods and instructions available
• Separate address spaces
• Need I/O or memory select lines
• Special commands for I/O
•
Limited set
214
Interrupt Driven I/O
Overcomes CPU waiting
No repeated CPU checking of device
I/O module interrupts when ready
215
Interrupt Driven I/O - Basic Operation
CPU issues read command
I/O module gets data from peripheral whilst CPU does
other work
I/O module interrupts CPU
CPU requests data
I/O module transfers data
216
72
Identifying Interrupting Module
How do you identify the module issuing the interrupt?
Different line for each module
• PC
• Limits number of devices
Software poll
• CPU asks each module in turn
• Slow
217
Identifying Interrupting Module
Daisy Chain or Hardware poll
• Interrupt Acknowledge sent down a chain
• Module responsible places vector on bus
• CPU uses vector to identify handler routine
Bus Master
• Module must claim the bus before it can raise interrupt
• e.g. PCI & SCSI
218
Multiple Interrupts
How do you deal with multiple interrupts?
• an interrupt handler being interrupted, or multiple I/O devices
Higher priority devices can interrupt lower priority
devices
For multiple interrupt request lines: Each interrupt line
assigned a priority
If bus mastering only current master can interrupt
Polling: order of device polling establishes priority
being ready at once
219
73
Direct Memory Access
Interrupt driven and programmed I/O require active
CPU intervention
• Transfer rate is limited
• CPU is tied up
DMA is the answer
Additional module (hardware) on bus
DMA controller takes over from CPU for I/O
220
DMA Operation
CPU tells DMA controller:• Read/Write
• Device address
• Starting address of memory block for data
• Amount of data to be transferred
CPU carries on with other work
DMA controller deals with transfer (between I/O
device and MM)
DMA controller sends interrupt when finished
221
DMA Transfer - Cycle Stealing
DMA controller takes over bus for a cycle
Transfer of one word of data to/from memory
Not an interrupt
• CPU does not switch context
CPU suspended just before it accesses bus
• i.e. before an operand or data fetch or a data write
Slows down CPU but not as much as CPU doing transfer
222
74
DMA Configurations (1)
DMA
Controller
CPU
Single Bus, Detached DMA controller
Each transfer uses bus twice
CPU is suspended twice (cycle stealing)
I/O
Module
I/O
Module
Main
Memory
• I/O to DMA then DMA to memory
223
DMA Configurations (2)
CPU
DMA
Controller
I/O
Device
I/O
Device
DMA
Controller
Main
Memory
I/O
Device
Single Bus, Integrated DMA controller
Controller may support >1 device
Each transfer uses bus once
• DMA to memory
CPU is suspended once
224
DMA Configurations (3)
I/O
Module
Main
Memory
DMA
Controller
CPU
I/O
Module
Separate I/O Bus
Bus supports all DMA enabled devices
Each transfer uses bus once
• DMA to memory
CPU is suspended once
I/O
Module
I/O
Module
225
75
Comparison
226
I/O Channels
Extends DMA concept
I/O devices getting more sophisticated, e.g. 3D
graphics cards
CPU instructs I/O controller to do transfer (execute
I/O program)
I/O controller does entire transfer
Improves speed
• Takes load off CPU
• Dedicated processor is faster
227
External Interface
Between I/O module and I/O device (peripheral)
Tailored to the nature and operation of the device
• parallel vs serial transfer
• data format conversions
• transfer rates
• number of devices supported
Point-to-point: dedication connection
Multipoint: external buses (external mass storage and
multimedia devices)
228
76
External Interfaces
RS-232 serial port
Games port
Small Computer System Interface (SCSI)
FireWire (IEEE standard 1394) high performance
serial bus
Universal Serial Bus (USB)
229
Section 6
Instruction Set Architecture
What ISA is all about?
C Program
Compiler
Memory
Assembly Language
Program
Loader
Assembler
Machine Language
Program
Much of a computer system’s architecture is hidden from a High Level
Language programmer.
In the abstract sense, the programmer should not really care about what
is the underlying architecture.
The instruction set is the boundary where the computer designer and the
computer programmer can view the same machine.
231
77
What ISA is all about?
The complete collection of instructions that are understood by a
CPU.
Thus, an examination of the instruction set goes a long way to
explaining the design and behaviour of the CPU, for example.
An attempt is made to look closely at the instruction set and see
how it influences the design parameters of a given machine.
232
What to consider in ISA design?
Operation repertoire
• How many operations?
• What can they do?
• How complex are they?
Data types
Instruction formats
• Length of op code field.
• Number of addresses.
233
What to consider in ISA design?
Registers
•
•
•
•
•
These days all machines use general purpose registers.
Registers are faster than memory and memory traffic is reduced.
Code density improves.
Number of CPU registers available
Which operations can be performed on which registers?
Addressing modes (discussed in more detail later)
RISC v CISC
234
78
Instruction Repertoire
What types of instructions should be included in a general-purpose
processor’s instruction set?
A typical machine instruction executes one or two very simple (micro)
operations, eg transferring the contents of one register to another. A
sequence of such instructions is typically needed to implement a
statement in a high-level programming language such as C++, C, or
FORTRAN.
Because of the complexity of the operations, data types, and syntax of
high-level languages, few successful attempts have been made to
construct computers whose machine language directly corresponds to a
high-level language (e.g., DEC VAX-11).
This creates a semantic gap between the high-level problem specification
and the machine instruction set that implements it, a gap that a compiler
must bridge.
235
ISA Requirements
A number of requirements need to be satisfied by an instruction set.
It should be complete in the sense that one should be able to construct a
machine-language program to evaluate any function that is computable
using a reasonable amount of memory space.
The instruction set should be efficient in that frequently required
functions can be performed rapidly using relatively few instructions.
It should be regular in that the instruction set should contain expected
opcodes and addressing modes, e.g., if there is a left-shift, there should
be a right-shift.
236
ISA Requirements
The instruction set should also be reasonably orthogonal with respect to
the addressing modes. An instruction set is said to be orthogonal if there
is only one easy way to do any operation.
To reduce both hardware and software design costs, the instructions may
be required to be compatible with those of existing machines, e.g.,
previous members of the same computer family.
Since simple instructions sets require simple, and therefore inexpensive,
logic circuits to implement them, they can lead to excessively complex
programs. So, there is a fundamental trade-off between processor
simplicity and programming complexity.
237
79
Elements of a Machine Instruction
Instruction Fetch
Instruction Decode
Instruction Format or Encoding
• decoding
Location of operands and result
• is it in memory?
• how many explicit operands?
• how are memory operands
located?
Operand Fetch
• which can or cannot be in
memory?
Execute
Store Results
Next Instruction
Data type and Size
Operations
• what operations?
Next instruction
• Branch, conditional branch, etc
238
Instruction Types
To execute programs we need the following basic types
of instructions:
• Data storage: transfers between main memory and CPU
registers.
• Data processing: arithmetic and logic operations on data.
• Control: program sequencing and control (branches).
• I/O transfers for data movement.
239
Classes of Instructions
Data-transfer instructions, which cause information to be copied
from one location to another either in the processor’s internal
memory or in the external main memory.
Arithmetic instructions, operations on numerical data.
Logical instructions, which include Boolean and other nonnumerical operations.
Program control instructions, branch instructions, which change
the sequence in which programs are executed.
Input-output (I/O) instructions, which cause information to be
transferred between the processor or its main memory and
external IO devices.
System control functions
240
80
Data Transfer
Specify.
Maybe different instructions for different movements.
Or one instruction and different addresses.
Note: different representation conventions.
• source
• destination
• amount of data
• e.g. IBM S/390 (table 10.5 in Stallings)
• e.g. DEC VAX
• INST SRC1, SRC2, DEST vs. INST DEST, SRC1, SRC2
241
Arithmetic and Logical
Arithmetic
Logic
•
•
•
•
Add, Subtract, Multiply, Divide.
Signed Integer.
Floating point ?
May include:
• absolute
• increment
• decrement
• negate
• Bit-wise operations.
• AND, OR, NOT.
242
Conversion, Input/Output,
System Control
Conversion
I/O
System control
• Binary to Decimal.
• Specific instructions.
• Data movement instructions (memory mapped).
• A separate controller (DMA).
• Privileged instructions.
• CPU needs to be in specific state.
• For operating systems use.
243
81
Transfer of Control
Branch instructions.
• Conditional or unconditional
• The testing capability of different conditions and subsequently
choosing one of a set of alternative ways to continue computation has
many more applications than just loop control.
• This capability is embedded in the instruction sets of all computers
and is fundamental to the programming of most nontrivial tasks.
• eg BRP X (branch to location X if result is +ve)
244
Transfer of Control
Skip instructions.
•
•
•
•
Loop: ……
ISZ R1 (increment and skip if zero)
BR Loop
ADD A
Subroutine call instructions.
• Jump to routine with the expectation of returning and resuming
operation at the next instruction.
• Must preserve the address of the next instruction (the return
address)
• Store in a register or memory location
• Store as part of the subroutine itself
• Store on the stack
245
Instruction Representation
The main memory is organised so
that a group of n bits is referred
to as a word of information, and n
is called the word length (8, 32,
64, 128, 256, etc bits).
Each word location has a distinct
address.
Instructions or operands
(numerals or characters).
Addresses
0
1
i
M-1
n bits
word 0
word 1
word i
word M-1
Memory Addresses
2m = M (m bits are needed to represent all addresses)
246
82
Instruction Format
The purpose of an instruction is to specify an operation to be carried out
on the set of operands or data.
The operation is specified by a field called the “opcode” (operation code).
n
1 2
MOVE A,R0
ADD
B,R0
Easy for people
to understand
Opcode
Operands
The symbolic names are called mnemonics; the set of rules for using the
mnemonics in the specification of complete instructions and programs is
called the “syntax” of the language.
247
Instruction Length
Affected by and affects:
Trade off between powerful instruction repertoire and saving
space.
If code size is most important, use variable length instructions.
If performance is most important, use fixed length instructions.
•
•
•
•
•
Memory size
Memory organization
Bus structure
CPU complexity
CPU speed
248
Allocation of Bits
Tradeoff between number of opcodes supported (rich
instruction set) and the power of the addressing
capability.
•
•
•
•
•
•
Number of addressing modes.
Number of operands.
Register versus memory.
Number of register sets.
Address range.
Address granularity.
249
83
Example Instruction Formats
PDP-8: 12-bit fixed format, 35 instructions
PDP-10: 36-bit fixed format, 365 instructions
PDP-11: variable length instructions (16, 32 or 48 bits)
used in 13 different formats
VAX: highly variable format (1 or 2 byte opcode
followed by 0-6 operands), with instruction lengths of
1-37 bytes
Pentium II: highly variable with a large number of
instruction formats
PowerPC: 32-bit fixed format
250
Types of Operands
Addresses
Numbers
• integer
• floating point
Characters
• ASCII, EBCDIC, etc.
Logical Data
• Bits or flags
251
Number of Addresses (pros and cons)
The fewer the addresses, the shorter the instruction. Long instructions
with multiple addresses usually require more complex decoding and
processing circuits.
Limiting the number of addresses also limits the range of functions each
instruction can perform.
Fewer addresses mean more primitive instructions, and longer programs
are needed.
Storage requirements of shorter instructions and longer programs tend
to balance, larger programs require longer execution time.
252
84
Zero-Address or Stack Machines
The order in which an arithmetic expression is evaluated in a
stack machine corresponds to the order in the Polish notation for
the expression, so-called after the Polish logician Jan Lukasiewicz
(1878-1956), who first introduced it.
The basic idea is to write a binary operation X*Y either in the
form *XY (prefix notation) or XY* (suffix or reverse Polish
notation).
Compilers for stack machines convert ordinary infix arithmetic
expressions into Polish form for execution in a stack.
253
Stack Machines
A × B + C × C ⇒ AB × CC × +
Instruction
Comment
PUSH A
Transfer A to top of the stack
PUSH B
Transfer B to top of the stack
MULT
Remove A & B from stack and replace by A * B
PUSH C
Transfer C to top of stack
PUSH C
Transfer second copy of C to top of stack
MULT
Remove C & C from stack and replace by C * C
ADD
Remove C * C & A * B from stack and replace by
their sum
POP X
Transfer result from top of stack to X
254
One-Address Machines
Instruction
Comment
LOAD A
Transfer A to accumulator AC
MULT B
AC = AC * B
STORE T
Transfer AC to memory location T
LOAD C
Transfer C to accumulator AC
MULT C
AC = AC * C
ADD T
AC = AC + T
STORE X
Transfer result to location ‘X’
X=A*B+C*C
255
85
Two-Address Machines
Instruction
Comment
MOVE T,A
T=A
MULT T,B
T=T*B
MOVE X,C
X=C
MULT X,C
X=X*C
ADD X,T
X= X + T
X=A*B+C*C
256
Three-Address Machines
Instruction
Comment
MULT T,A,B
T=A*B
MULT X,C,C
X=C*C
ADD X,X,T
X=X+T
X=A*B+C*C
257
Data Types
Bit: 0, 1
Bit String:
•
•
•
•
8 bits is a byte
16 bits is a word
32 bits is a double-word
64 bits is a quad-word
Character:
• ASCII 7 bit code
Decimal:
• digits 0-9 encoded as 0000b1001b
• two decimal digits packed per 8
bit byte
Integers:
• Sign-magnitude
• Ones and Twos complement
Floating Point:
• Single Precision
• Double Precision
• Extended Precision
258
86
Specific Data Types
General - arbitrary binary contents
Integer - single binary value
Ordinal - unsigned integer
Unpacked BCD - One digit per byte
Packed BCD - 2 BCD digits per byte
Near Pointer - 32 bit offset within segment
Bit field
Byte String
Floating Point
259
Data Types for Pentium & PowerPC
Pentium
PowerPC
•
•
•
•
•
•
•
•
•
•
8 bit Byte
16 bit word
32 bit double word
64 bit quad word
Addressing is by 8 bit unit
A 32 bit double word is read at addresses divisible by 4.
8 bit Byte
16 bit half-word
32 bit word
64 bit double-word
260
Data Types
32 bits
b1
b31 b30
Magnitude
b0
(a) Signed Integer
Sign bit: 0 for +ve numbers
1 for -ve numbers
= b30 × 230 + + b1 × 21 + b0 × 20
8 bits
8 bits
8 bits
8 bits
(b) Four Characters
ASCII
8 bits
operation
field
2 8 ( = 256 ) distinct instructions
(c) A Machine
24 bits
Instruction
addressing
information
addresse 0 − 224 − 1
261
87
Byte Order
word 0
4
3
7
2
6
1
5
0
4
k
2k −1 2k −2 2k − 32 −4
2k −4
Little-endian assignment
Intel 80x86, DEC Vax, DEC Alpha (Windows NT), Pentium
262
Byte Order
word 0
4
0
4
1
5
k
k
2k −4 2 −4 2 − 3
2
6
3
7
2k − 2 2k − 1
Big-endian assignment
IBM 360/370, MIPS, Sparc, Motorola 680x0 (Mac)
Most RISC designs
Internet is big-endian. WinSock provides Host to Internet (htoi)
and Internet to Host (itoh) for conversion.
263
Byte Alignment
Alignment requires that objects fall on address that is multiple of
their size.
Aligned
Not
Aligned
264
88
Addressing Modes
The number of addresses contained in an instruction is decided.
The way in which each address field specifies memory location
must be determined.
The ability to reference a large range of address locations.
Tradeoffs:
• addressing range and flexibility.
• complexity of the address calculation.
265
Addressing Modes
Immediate
Direct
Indirect
Register
Register Indirect
Displacement
Stack
266
Immediate Addressing
Operand is part of instruction.
Data is a constant at run time.
No additional memory references are required after the fetch of
the instruction itself.
Size of the operand (thus its range of values) is limited.
LOADI
999
MOV #200,R0
AC
267
89
Direct Addressing
The simplest mode of “direct” address formation. It requires the
complete operand address to appear in the instruction operand field.
One additional memory access is required to fetch the operand.
Address range limited by the width of the field that contains the
address reference.
Address is a constant at run time but data itself can be changed during
program execution.
This address is used without further modification to access the desired
data item
memory
LOAD
MOV A,B
X
999
X
AC
268
Indirect Addressing
The effective address of the operand is in the register or main memory
location whose address appears in the instruction.
Large address space.
Multilevel (e.g. EA=(…(A)…)).
Multiple memory accesses to find operand (slow).
ADD (A),R0
memory
X
LOADN
A
B
999
B
Operand
W
W
X
AC
269
Register Addressing
Just like direct addressing.
Operand is held in register named in address filed.
For example, Add R0,R1.
Limited number of registers (very limited address space).
Very small address field needed.
• Shorter instructions.
• no memory access, faster instruction fetch and faster execution.
Multiple registers helps performance.
Requires good assembly programming or compiler writing.
270
90
Register Indirect Addressing
Just like indirect addressing.
Operand is in memory cell pointed to by contents of
register R.
Large address space (2N).
One fewer memory access than indirect addressing.
271
Displacement Addressing
Powerful.
The capabilities of direct addressing and register
indirect addressing.
EA = A + (R)
Relative addressing.
Base-register addressing.
Indexing.
272
Displacement Addressing Diagram
ADD 20(R1),R2
R1
1050
offset is given as a constant
1050
offset
= 20
offest=20
1070
Operand
273
91
Relative Addressing
EA = A + (PC)
For example, get operand from A cells from current
location pointed to by PC.
Locality of reference and cache utilisation.
274
Base-Register Addressing
A holds displacement.
R holds pointer to base address.
A is a displacement added to the contents of the
referenced “base register” to form the EA.
Used by programmers and O/S to identify the start of
user areas, segments, etc. and provide accesses within
them.
275
Indexed Addressing
A = base
R = displacement
EA = A + R
Good for accessing arrays
•
•
•
•
EA = A + R
R++
ADD(R2)+, R0
ADD –(R2),R0
Postindex EA = (A) + (R)
Preindex EA = (A+(R))
276
92
Stack Addressing
Operand is (implicitly) on top of stack
Stack
0
stack pointer
register (SP)
PUSH Operation:
Decrement SP
MOVE
NEWITEM,(SP)
POP Operation:
MOVE
(SP),ITEM
INCREMENT SP
-28
17
739
current top
element
43
last element
M-1
277
Pentium and PowerPC addressing
Stallings text shows 9 addressing modes for the
Pentium II.
• Range from simple modes (e.g., immediate) to very complex
modes (e.g., bases with scaled index and displacement).
The PowerPC, in contrast has fewer, simpler addressing
modes.
CISC vs. RISC design issue (later in course)
278
Quantifying ISA Design
Design-time metrics:
Static Metrics:
Dynamic Metrics:
Always remember that “Time is the best metric”
• Can it be implemented, in how long, at what cost?
• Can it be programmed? Ease of compilation?
• How many bytes does the program occupy in memory?
• How many instructions are executed?
• How many bytes does the processor fetch to execute the program?
• How many clocks (clock ticks) are required per instruction?
279
93
Section 7
CPU Structure
and the Control Unit
CPU Organisation
Components of the CPU
• ALU
• Control logic
• Temporary storage
• Means to move data in, out and around the CPU
281
CPU Organisation
External View
Internal Structure
282
94
Register Organisation
Registers are the highest level of the memory
hierarchy – small number of fast temporary storage
locations
User-visible registers
Control and status registers – most are not visible to
the user
283
User-visible Registers
Categories based on function:
Design tradeoff between general purpose and specialised
registers
How many registers are “enough”? Most CISC machines have 8-32
registers. RISC can have many more.
How big (wide) ?
•
•
•
•
General purpose
Data
Addresses, eg segment pointers, stack pointers, index registers
Condition codes – visible to the user, but values set by the CPU as a
result of performing operations
284
Control and Status Registers
Used during execution of instructions – mostly not visible, or
cannot have contents modified
Memory Address Register (MAR)
Memory Buffer Register (MBR)
Program Counter (PC)
Instruction Register (IR)
Program Status Word (PSW)
• Connected to address bus.
• Specifies address for read or write operation.
• Connected to data bus.
• Holds data to write or last data read.
• Holds address of next instruction to be fetched.
• Holds last instruction fetched.
285
95
Micro-Operations
The CPU executes sequences of instructions (the program) stored in a
main memory.
Fetch/execute cycle (instruction cycle).
Each cycle has a number of steps (comprising a number of microoperations), and each step does very little (atomic operation of CPU).
286
Micro-Operations
The time required for the shortest well-defined CPU microoperation is
defined to be the CPU cycle time and is the basic unit of time for
measuring all CPU actions. What is CPU “clock rate”?
The clock rate depends directly on the circuit technology used to
fabricate the CPU.
Main memory speed may be measured by the memory cycle time , which is
the minimum time that must elapse between two successive read or write
operations. The ratio tm/tCPU typically ranges from 1 – 10 (or 15).
The CPU contains a finite number of registers, used for temporary
storage of instructions and operands. The transfer of information among
these registers can proceed at a rate approximately tm/tCPU times that of
a transfer between the CPU and main memory.
287
Types of Micro-operation
Transfer data from one register to another.
Transfer data from a register to an external
component (e.g. bus).
Transfer data from an external component to a
register.
Perform arithmetic or logical operations using
registers for input and output.
288
96
Fetch Sequence
Address of next instruction is in PC.
Copied to MAR
MAR is placed on address bus.
Control unit issues READ command.
Result (data from memory) appears on data bus.
Data from data bus copied into MBR.
PC incremented by 1 (in parallel with data fetch from memory). Note,
PC being incremented by 1 is only true because we are assuming that
each instruction occupies one memory word, and that memory is wordaddressable).
Data (instruction) moved from MBR to IR.
MBR is now free for further data fetches.
289
Fetch Sequence
t1 :
t2:
MAR ← (PC)
MBR ← (memory)
PC ← (PC) + 1
t3:
IR ← (MBR)
t1 :
t2:
t3:
MAR ← (PC)
MBR ← (memory)
PC ← (PC) + 1
IR ←(MBR)
290
Groupings of Micro-Operations
Proper sequence must be followed:
• MAR ← (PC) must precede MBR ← (memory)
Conflicts must be avoided:
• Must not read & write same register at same time
• MBR ← (memory) and IR ← (MBR) must not be in same cycle
Also: PC ← (PC) + 1 involves addition:
• Use ALU
• May need additional micro-operations
291
97
Indirect Cycle
t1:
t2:
t3:
MAR ← (IRaddress)
MBR ← (memory)
IRaddress ← (MBR)
IRaddress is address field of instruction
MBR contains an address
IR is now in same state as if direct addressing had
been used
292
Interrupt Cycle
Differs from one machine to another.
t1:
MBR ← (PC)
t2:
MAR ← save-address
PC ← routine-address
t3:
memory ← (MBR)
The above steps are the minimal number needed.
May need additional micro-ops to get addresses.
293
Execute Cycle (eg ADD)
ADD R1,X - add the contents of location X to Register 1, store
result back in R1.
• Fetch the instruction.
• Fetch the first operand (the contents of the memory location pointed
to by the address field of the instruction).
• Perform the addition.
• Load the result into R1
Examples to follow assume the IR contains the add instruction, ie
already fetched.
t1: MAR ← (IR address)
t2: MBR ← (memory)
t3: R1 ← R1 + (MBR)
294
98
Execute Cycle (eg ISZ)
ISZ X - increment and skip if zero.
t1:
t2:
t3:
t4:
MAR ← (IR address)
MBR ← (memory)
MBR ← (MBR) + 1
memory ← (MBR)
if (MBR) == 0 then PC ← (PC) + 1
Note, the last two microoperations are performed at the same
time.
The conditional in t4 would be enforced by checking the ZERO
condition code
295
Execute Cycle (eg BSA)
BSA X - Branch and save address (subroutine calls).
Address of instruction following BSA is saved in X. Execution
continues from X+1.
t1 :
t2:
t3:
MAR ← (IRaddress)
MBR ← (PC)
PC ← (IRaddress)
memory ← (MBR)
PC ← (PC) + 1
296
Control of the Processor
(Functional Requirements)
Define the basic elements of processor.
Describe the micro-operations that the processor
performs.
Determine functions that the control unit must
perform to cause the micro-operations to be
performed.
• ALU
• Registers
• Internal data paths
• External data paths
• Control Unit
297
99
Functions of Control Unit
Sequencing
• causing the processor to step through a series of microoperations.
Execution
• causing the performance of each micro-operation.
The above is performed using “control signals”.
298
Control Signals
Clock
• processor cycle time.
• one micro-instruction (or set of parallel micro-instructions) per clock
cycle.
Instruction register
• op-code for current instruction determines which micro-instructions
are performed.
Flags
• state of CPU.
• results of previous operations.
299
Control Signals
From control bus
• interrupts.
• acknowledgements.
Within CPU
• cause data movement (between registers).
• activate specific ALU functions.
Via control bus
• to memory.
• to I/O modules.
300
100
Example Control Signal Sequence –
Instruction Fetch
MAR ← (PC)
• control unit activates signal to open gates between PC and
MAR.
MBR ← (memory)
• open gates between MAR and address bus.
• memory read control signal.
• open gates between data bus and MBR.
Have omitted incrementing PC
301
Internal Processor Organisation
Usually a single internal bus.
Gates control movement of data onto and off the bus.
Control signals control data transfer to and from
external systems bus.
Temporary registers needed for proper operation of
ALU.
302
Internal Processor Organisation
Memory Bus
Z
ALU
Y
Registers
Memory Buffer
Register
Memory Address
Register
Program
Counter
Instruction
Register
Instruction
decoder
Internal CPU Bus
Control Lines
Note: The internal bus (CPU) should not be confused with the
external bus or buses connecting the CPU to the memory and I/O
devices.
Registers Y and Z are used by the CPU for temporary storage.
303
101
Example
Example: Add R1, A.
Fetch the instruction.
Fetch the first operand (the contents of the memory
location pointed to by the address field of the
instruction, A).
Perform the addition.
Load the result into R1.
304
Example
Step
Action
1
PCout, MARin, Read, Clear Y, Set carry-in to ALU, Add, Zin
2
Zout, PCin, Wait for MFC
3
MBRout, IRin } until MFC signal is received
4
Address-field-of-IRout, MARin, Read } interpreting the contents of the IR
enables the control circuitry to chooses appropriate signal
5
R1out, Y in, Wait for MFC
6
MBRout, Add, Zin 7
7
Zout, R1in, End
305
More on the Single Bus CPU
Yin
Y
R(i-1)
A
X
B
ALU
R(i-1) in
R(i-1) out
Zin
Z
Yout
X
X
X
X
X
Zout
306
102
More Performance!
Performance of a computer depends on many factors,
some of which are related to the design of the CPU
(power of instructions, clock cycle time, and the
number of clock cycles per instruction) – remember the
golden formula!
Power of instructions: simple vs. complex, single clock
cycle vs. multiple clock cycles, pipeline design.
Clock speed has a major influence on performance.
307
Two-Bus and Three-Bus CPUs
M e m o ry B u s
C o n tro l L in es
ALU
Y
R e g iste rs
M e m o ry D a ta
R e g iste r
M e m o ry A d d re ss
R e g iste r
P ro g ra m
C o u n ter
In stru c tio n
R e g iste r
I n stru c tio n
d ecod er
Internal CPU Buses
G
Instruction
decoder
Internal CPU Buses
B u s2
B u s1
Memory Bus
Control Lines
Bus2
ALU
Y
Registers
Memory Data
Register
Memory Address
Register
Program
Counter
Instruction
Register
Bus1
308
Can Performance Be Further
Enhanced!?
Instruction unit (Pre-fetching).
Caches (memory that responds in a single clock cycle).
Superscalar processors.
Instruction
Unit
Integer
Unit
Instruction Cache
RISC Processor
Floating-Point
Unit
Data Cache
Bus Interface
CPU
System Bus
Main Memory
Input/Output
309
103
Hardwired Implementation of CU
To execute instructions, the CPU must have some means of
generating the control signals discussed previously.
• Hardwired control.
• Microprogrammed control.
The execution of the control sequences (micro-operations)
requires non-overlapping time slots. Each time slot must be at
least long enough for the functions specified in the corresponding
step to be completed.
If one assumes that all the time slots are equal in duration, then,
the required control unit may be based on the use of a counter
driven by a clock signal.
310
Hardwired Implementation
Therefore, the required control signals are uniquely determined by the following
information:
• Contents of the control step counter.
• Contents of the instruction register.
• Contents of the condition codes and other status flags.
R ese t
C o n tr o l S te p
C o u n te r
C lo c k
S te p d e c o d e r
T1
T2
Tn
S t a tu s
F lag s
IN S2
Decoder
Instruction
Register
Instruction
IN S1
E ncoder
C o n d i ti o n
Codes
IN Sn
Run
End
C o n tr o l S ig n a l s
C o n tro l U n it
311
Hardwired Implementation
Control unit inputs.
Flags and control bus.
Instruction register.
• each bit means something.
• op-code causes different control signals for each different
instruction.
• unique logic for each op-code.
• decoder takes encoded input and produces single output.
• n binary inputs and 2n outputs.
312
104
Hardwired Implementation
Clock
• repetitive sequence of pulses.
• useful for measuring duration of micro-operations.
• must be long enough to allow signal propagation.
• different control signals at different times within instruction
cycle.
• need a counter with different control signals for t1, t2 etc.
313
Hardwired Design Approaches
Traditional “state table” method
Clocked delay element
• can produce the minimum component design
• complex design that may be hard to modify
• straight-forward layout based on flow chart of the instruction
implementation
• requires more delay elements (flip-flops) than are really needed
Sequence counter approach
• polyphase clock signals are derived from the master clock using a
standard counter-decoder approach
• these signals are applied to the combinatorial portion of the control
circuit
314
Example
One possible structure is a programmable logic array (PLA). A PLA
consists of an array of AND gates followed be an array of OR gates; it
can be used to implement combinational logic functions of several
variables.
The entire decoder/encoder block shown previously can be implemented
in the form of a single PLA. Thus, the control of a CPU can be organised
as shown below.
Control Signal
OR Array
Flags and
Condition
Codes
Control
Step Counter
Instruction
Register
AND Array
PLA
315
105
Problems With Hardwired Control
Complex sequencing and micro-operation logic.
Difficult to design and test.
Inflexible design.
Difficult to add new instructions.
316
Microprogramming
Hardwired control is not very flexible.
Wilkes introduced a techniques called microprogramming (M.V.
Wilkes, 1951, “The Best Way to Design an Automatic Calculating
Machine,” Report of the Manchester University Computer
Inaugural Conference. Manchester, U.K., University of
Manchester).
Computer within a computer.
Each machine instruction is translated into a sequence of
microinstructions (in firmware) that activate the control signals
of the different resources of the computer (ALU, registers, etc).
317
Microprogram Controllers
All the control unit does is generate a set of control
signals.
Each control signal is on or off.
Represent each control signal by a bit.
Have a control word for each micro-operation.
Have a sequence of control words for each machine code
instruction.
Add an address/branching field to each microinstruction
to specify the next micro-instruction, depending on
conditions.
Put these together to form a microprogram with routines
for fetch, indirect and interrupt cycles, plus one for the
execute cycle of each machine instruction
318
106
Microprogrammed Control Unit
Sequence logic unit issues read
command.
Word specified in control
address register is read into
control buffer register.
Control buffer register contents
generates control signals and
next address information.
Sequence logic loads new address
into control buffer register
based on next address
information from control buffer
register and ALU flags.
319
Microinstruction Execution
Each cycle is made up of two
events:
• fetch - generation of
•
microinstruction address by
sequencing logic.
execute
Effect is to generate control
signals.
Some control signals internal
to processor.
Rest go to external control
bus or other interface.
320
Control Memory
Jump to Indirect or Execute
Jump to Execute
Jump to Fetch
Jump to Op code routine
Jump to Fetch or Interrupt
Jump to Fetch or Interrupt
Fetch cycle routine
Indirect Cycle routine
Interrupt cycle routine
Execute cycle begin
AND routine
ADD routine
321
107
Microinstruction Sequencing
Design Considerations
Size of microinstructions.
Address generation time.
Next microinstruction can be:
determined by IR
• once per cycle, after
instruction is fetched.
next sequential address
a branch
• most common.
• conditional or unconditional.
Sequencing Techniques
Based on current µinstruction,
condition flags, contents of
IR, the control memory
address of the next
µinstruction must be
generated after each step.
Based on format of address
information in µinstruction:
• Two address fields.
• Single address field.
• Variable format.
322
Remarks
Due to the sophistication of modern processors we
have control memories that:
• contain a large number of words which correspond to the large
number of instructions to be executed.
• have a wide word width due to the large number of control
points to be manipulated.
323
Remarks
The microprogram defines the instruction set of the computer.
Change the instruction set by changing/modifying the contents of
the microprogram memory (flexibility).
Microprogram memory (fast). The speed of this memory plays a
major role in determining the overall speed of the computer.
Read-only type memory (ROM) is typically used for that purpose
(microprogram unit is not often modified).
In the case where an entire CPU is fabricated as a single chip, the
microprogram ROM is a part of that chip.
324
108
Design Issues
Minimise the word size (or length) and microprogram
storage.
Minimise the size of the microprograms.
High levels of concurrency in the execution of
microinstructions.
Word size (or length) is based on:
• Maximum number of simultaneous micro-operations supported.
• The way control information is represented or encoded.
• The way in which the next micro-instruction address is
specified.
325
Micro-instruction Types
Parallelism in microinstructions. Microprogrammable
processors are frequently characterised by the
maximum number of microoperations that can be
specified by a single microinstruction (e.g. 1-100).
Microinstructions that specify a single microoperation
are quite similar to conventional machine instructions
(short and more microinstructions needed to perform a
given operation).
326
Horizontal Microprogramming
Wide memory word.
High degree of parallel operations possible.
Little encoding of control information.
IBM System/370 Model 50. The microinstruction consists of 90
bits.
• 21 bits constitute the control fields
• remaining fields are used for generating the next microinstruction
address and for error detection.
Internal CPU Control Signals
Micro-instruction Address
System Bus
Jump Condition
Control Signals
327
109
Vertical Microprogramming
Width is narrow, where n control signals encoded into log2n bits.
Limited ability to express parallelism.
IBM System/370 Model 145. The microinstruction consists of 4
bytes (32 bits).
• one byte specifies the microoperation.
• two bytes specify operands.
• one byte used to construct the address of the next microinstruction.
Considerable encoding of control information requires external
memory word decoder to identify the exact control line being
manipulated.
328
Vertical Microprogramming
Micro-instruction Address
Function Codes
Jump
Condition
329
Pros and Cons
Horizontal microinstructions
Vertical microinstructions
Long format
Short format
Ability to express a high degree of
parallelism
Limited ability to express parallel
microoperations
Little encoding of the control
information
Considerable encoding of the control
information
330
110
Applications of Microprogramming
Realisation of Computers.
Emulation.
Operating System Support.
Realisation of Special Purpose Devices.
High Level Language Support.
Micro-diagnostics.
User Tailoring.
331
Emulation
One of the main functions of microprogramming control is to
provide a means for simple, flexible, and relatively inexpensive
control of a computer.
In a computer with a fixed set of instructions, the control
memory can be a ROM. However, if we use a read/write memory
(or a programmable ROM) for the control memory, it is possible to
alter the instruction set by writing new microprograms. This is
called emulation.
Emulation is easily accomplished when the machines involved have
similar architectures (e.g. members of the same family).
332
Emulation
In emulation, a computer is microprogrammed to have
exactly the same instruction set as another computer
and to behave in exactly the same manner.
Emulation can be used to replace obsolete equipment
with more up-to-date machines without forcing the
users to rewrite the bulk of their software. If the
replacement computer fully emulates the original one,
no software changes have to be made to run the
existing software (cost and time savings).
333
111
Is Microprogramming Good?
Compromise.
Simplifies design of control unit.
• cheaper.
• less error-prone.
• better for diagnosis.
Slower.
A much greater variability exists among computers at
the microinstruction level than at the instruction level.
334
Micprogrammed Designs:
Advantages and Disadvantages
A more structured approach to the design of the control circuitry
(e.g. enhanced diagnostic capabilities and better reliability when
compared to hardwired designs).
Slower than the hardwired ones. Economies of scale seem to
favour hardwired control only when the system is not too complex
and requires only a few control operations.
Main memory utilisation in microprogrammed computers is usually
better (software is stored in the microprogram control memory
instead of main memory).
Better ROMs - in terms of cost and in access time - will further
enhance the use of microprogramming.
335
Section 8
Instruction Pipelining
112
Pipelining
Pipelining is used in many high-performance computers, but it is a
key element in the implementation of the RISC architecture.
Pipelining influences the design of the instruction set of a
computer.
A “good” design goal of any system is to have all of its components
performing useful work all of the time – high efficiency.
Following the instruction cycle in a sequential fashion does not
permit this level of efficiency complete.
337
Pipelining
If we assume that the fetch and
execute stages require the same
amount of time, and If the
computer has two hardware
units, one for fetching
instructions and the other for
executing them (what is the
implication?).
The fetch and execute
operations can each be completed
in one clock cycle.
If pipelined execution can be
sustained for a long period, the
speed of instruction execution is
twice that of sequential
operation (easier said than
done!).
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6
F1 E1 F2 E2 F3 E3 F4 E4 F5 E5 F6 E6
I1
I2
I3
I4
I5
I6
F1 E1
F2 E2
F3 E3
F4 E4
F5 E5
338
Observations
First step: instruction pre-fetch.
Divide the instruction cycle into two (equal??) “parts”.
• Instruction fetch
• Everything else (execution phase)
While one instruction is in “execution,” overlap the prefetching of
the next instruction.
• Assumes the memory bus will be idle at some point during the
execution phase.
• Reduces the time to fetch an instruction to zero (ideal situation).
Problems
• The two parts are not equal in size.
• Branching can negate the prefetching.
•
As a result of the branch instruction, you could have prefetched the “wrong”
instruction.
339
113
Observations
An ideal pipeline divides a task into k independent sequential
subtasks:
• each subtask requires 1 time unit to complete.
• the task itself then requires k time units to complete.
For n iterations of the task, the execution times will be:
Speedup of a k-stage pipeline is thus:
• with no pipelining: nk time units
• with pipelining: k + (n-1) time units
S = nk / [k+(n-1)] → k (for large n)
340
More Stages in the Pipeline
F: Fetch the instruction and
fetch the source operand(s)
D: Decode the instruction
A: ALU Operation
S: Store back the results
Four times faster than that of
the sequential operation (when?).
No two operations performed in
parallel can have resource
conflicts (give examples).
1
2
3
4
5
F1 D1 A1 S1 F2 D2 A2 S2 F3 D3 A3 S3 F4 D4 A4 S4 F5 D5 A5 S5
F1 D1 A1 S1
F2 D2 A2 S2
F3 D3 A3S3
F4 D4A4 S4
F5 D5 A5 S5
341
Observations
Alternative approaches:
• Finer division of the instruction cycle: use a 6-stage pipeline.
•
Instruction fetch, Decode opcode, Calculate operand address(es), Fetch
operands, Perform execution, Write (store) result.
Use multiple execution “functional units” to parallelise
the actual execution phase of several instructions.
Use branching strategies to minimise branch impact.
342
114
Deeper Pipelines
Fetch instruction (FI).
Decode instruction (DI).
Calculate operands (CO)
(i.e. effective addresses).
Fetch operands (FO).
Execute instructions (EI).
Write operand (WO).
343
Remarks
Pipelined operation cannot be
maintained in the presence of
branch or jump instructions.
The data paths for fetching
instructions must be separate
from those involved in the
execution of an instruction.
If one instruction requires data
generated by a previous
instruction, it is essential to
ensure that the correct values
are used (for longer pipelines).
Minimise interruptions in the flow
of work through the stages of
the pipeline.
Conflict over hardware (execute
and fetch operations).
RISC machines have two memory
buses, one for instructions and
one for data (cost, performance).
Memory access time (need to be
decreased because the CPU is
faster than memory). Different
stages in a pipeline should take
the same time.
Two caches need to be used, one
for instructions and one for data.
344
Other Problems (Pipeline Depth)
If the speedup is based on the number of stages, why not build
lots of stages?
Each stage uses latches at its input (output) to buffer the next
set of inputs.
• If the stage granularity is reduced too much, the latches and their
control become a significant hardware overhead.
• Also suffer a time overhead in the propagation time through the
latches.
• Limits the rate at which data can be clocked through the pipeline.
Logic to handle memory and register use and to control the overall
pipeline increases significantly with increasing pipeline depth.
Data dependencies also factor into the effective length of
pipelines
345
115
Other Problems (Data Dependencies)
Pipelining, as a form of parallelism, must ensure that computed
results are the same as if computation was performed in strict
sequential order.
With multiple stages, two instructions “in execution” in the
pipeline may have data dependencies - must design the pipeline to
prevent this.
• Data dependencies limit when an instruction can be input to the
pipeline.
Data dependency examples:
A=B+C
D=E+A
C=GxH
346
Branching
A pipelined machine must provide
some mechanism for dealing with
branch instructions.
However, 15-20% of instructions in
an assembly-level stream are
(conditional) branches.
Of these, 60-70% take the branch
to a target address.
1 Goto
2
3
4
5
NOP (NoOPeration)
Bubble
F1 E1
F2
F3 E3
F4 E4
F5 E5
Impact of the branch is that pipeline
never really operates at its full
capacity – limiting the performance
improvement that is derived from
the pipeline.
347
Branch in a Pipeline
348
116
Dealing with Branches
(Multiple Streams)
Have two pipelines.
Prefetch each branch into a separate pipeline.
Use appropriate pipeline.
Leads to bus and register contention.
Multiple branches lead to further pipelines being
needed.
349
Dealing with Branches
(Prefetch Branch Target)
Target of branch is prefetched in addition to
instructions following branch.
Keep target until branch is executed.
Used by IBM 360/91.
350
Dealing with Branches (Loop Buffer)
Very fast memory (cache).
Maintained by fetch stage of pipeline.
Check buffer before fetching from memory.
Very good for small loops or jumps.
Used by CRAY-1.
351
117
Branch Prediction
Predict never taken
• assume that jump will not happen.
• always fetch next instruction.
• 68020 & VAX 11/780.
• VAX will not prefetch after branch if a page fault would result
(O/S v CPU design).
Predict always taken
• assume that jump will happen.
• always fetch target instruction.
352
Branch Prediction
Predict by Opcode.
• some instructions are more likely to result in a jump than
others.
• static predictor
• can get up to 75% success.
Taken/Not taken switch.
• based on previous history.
• good for loops.
353
Branch Prediction State Diagram
Using 2 history bits
This is not how Simplescalar’s bimod predictor works
354
118
Branch Address Prediction
If ever choose to take a branch, must predict the
address of the branch target, since not known until
after instruction decoding (or even later)
Branch Target Buffer (BTB)
aka Branch History Table (BHT) in Stallings
Cache of recent branch instructions and their target
addresses.
Could store actual target instruction instead of just
address
Updated after branch direction and address known
(same as history bits)
355
Delayed Branch
Do not take jump until you have to
Minimise branch penalty by finding valid instructions to execute
while the branch address is being resolved
Rearrange instructions in compiler.
LOOP
Shift-Left
R1
Decrement
R2
Branch if ≠ 0
LOOP
Next....................
LOOP
Decrement
R2
Branch if ≠ 0
LOOP
Shift-Left
R1
Next...................
Cannot find instruction for branch delay slot? Insert a NOP
Can have more than one branch delay slots – depends on length of
pipeline.
356
Delayed Branch: An example
For an architecture with 1 delay slot:
• Maybe able to fill 70% of all delay slots with useful work
• For 100 instructions, say 15% of which are conditional
branches, we have 15 branch delay slots. 30% of these (which
is about 5) will have NOP instructions put in them.
• Hence, execution time is as though 105 instructions were
executed
For an architecture with 2 delay slots:
• Might fill 70% of first delay slot and 50% of second slot
• For 100 instructions, cannot fill 4.5 out of 15 first slots and
7.5 out of 15 second slots. Hence must insert 12 NOP
instructions.
357
119
Higher Performance
Although the hardware may or may not rely on the compiler to resolve
hazard dependencies to ensure correct execution, the compiler must
“understand” the pipeline to achieve the best performance. Otherwise,
unexpected stalls will reduce the performance of the compiled code.
Techniques that were limited to mainframes and supercomputers have
made their way down to single-chip computers (pipelined functions in
single-chip computers). These are called superpipelined processors
(deeper pipelines than the five-stage RISC model).
Superscalar machines: these are machines that issue multiple
independent instructions per clock cycle (2-4 independent
instructions/clock cycle).
358
Superscalar vs. Superpipelined
The term superscalar, first coined in 1987, refers to a
machine that is designed to improve the performance
of the execution of scalar instructions (as opposed to
vector processors).
In most applications, the bulk of the operations are on
scalar quantities. Accordingly, the superscalar
approach represents the next step in the evolution of
high-performance general-purpose processors.
359
Superscalar vs. Superpipelined
An alternative approach to achieving greater
performance is referred to as superpipelining, a term
first coined in 1988.
Superpipelining exploits the fact that many pipeline
stage perform tasks that require less than half a clock
cycle. Thus, a doubled internal clock speed allows the
performance of two tasks in one external clock cycle.
Both approaches offer similar challenges to compiler
designers.
360
120
Superscalar vs. Superpipelined
361
Pentium vs. PowerPC
Characteristic
Pentium Pro
PowerPC 604
Max. number of instructions issued per clock cycle
3
4
Max. number of instructions completing execution per
clock cycle
Maximum number of instructions committed per clock
cycle
No. of bytes fetched from instruction cache
5
6
No. of bytes in instruction queue
32
No. of instructions in reorder buffer
40
16
No. of entries in branch table buffer
512
512
No. of history bits per entry in branch history buffer
4
2
Number of functional units
6
6
Number of integer functional units
2
2
Number of complex integer operation functional units
0
1
No. of floating-point functional units
1
No. of branch functional units
1
No. of memory functional units
3
6
16
16
1 for load
1 for store
32
1
1
1 for load/store
Patterson, D.A. and Hennessy, J.L., 1994, Computer Organization and Design: The Hardware and
Software Interface , Morgan-Kaufmann.
362
Section 9
Reduced Instruction Set
Computers (RISC)
121
RISC
Reduced Instruction Set Computer (RISC).
Another milestone in the development of computer
architecture
Main characteristics:
•
•
•
•
limited and simple instruction set.
large number of general purpose registers.
use of compiler technology to optimise register use.
emphasis on optimising the instruction pipeline.
364
Comparison of Processors
IBM
370/168
DEC VAX
11/780
Intel
486
Motorola
88000
MIPS
R4000
IBM
RS/6000
Intel
80960
Year
1973
1978
1989
1988
1991
1990
1989
# instr’s
208
303
235
51
94
184
62
Instr. size
2-6
2-57
1-11
4
32
4
4 or 8
Addressing
modes
4
22
11
3
1
2
11
GP registers
16
16
8
32
32
32
23-256
µ Control
memory (kB)
420
480
246
-
-
-
-
CISC
CISC
CISC
RISC
RISC
S/Scalar
S/Scalar
365
Driving force for CISC
Software costs far exceed hardware costs.
Increasingly complex high level languages.
Semantic gap.
The above leads to:
• large instruction sets.
• more addressing modes.
• hardware implementations of HLL statements.
Examples of such machines: Motorola 68000, DEC
VAX, IBM 370.
366
122
Intention of CISC
Simplify the task of designing compilers and leads to
an overall improvement in performance.
Minimise the number of instructions that are needed
to perform a given task (complex operations in
microcode.).
Improve execution efficiency.
Support more complex HLLs.
367
Program Execution Characteristics
Operations performed.
Operands used.
Execution sequencing.
Studies have been done based on programs written in
HLLs.
Dynamic studies are measured during the execution of
the program.
368
Operations
Assignments.
• movement of data.
Conditional statements (IF, LOOP).
• sequence control.
Procedure call-return is very time consuming.
Some HLL instructions lead to many machine code
operations
369
123
Relative Dynamic Frequency
Assign
Loop
Call
If
GoTo
Other
Dynamic
Occurrence
Pascal C
45
38
5
3
15
12
29
43
3
6
1
Machine Instr.
(Weighted)
Pascal C
13
13
42
32
31
33
11
21
3
1
Memory Ref.
(Weighted)
Pascal C
14
15
33
26
44
45
7
13
2
1
370
Operands
Mainly local scalar variables.
Optimisation should concentrate on accessing
local variables.
Integer constant
Scalar variable
Array/structure
Pascal
C
16
58
26
23
53
24
Average
20
55
25
371
Procedure Calls
Very time consuming.
Depends on number of parameters passed.
Depends on level of nesting.
Most programs do not do a lot of calls followed by lots
of returns.
Most variables are local.
Recall: locality of reference!
372
124
Implications
Best support is given by optimising most used and most
time consuming features.
Large number of registers.
• operand referencing.
Careful design of pipelines.
• branch prediction etc.
Simplified (reduced or streamlined) instruction set.
373
RISC Philosophy
Complex instructions that perform specialised tasks tend to appear
infrequently, if at all, in the code generated by a compiler.
The design of instruction set in a RISC processor is heavily guided by
compiler considerations.
Only those instructions that are easily used by compilers and that can be
efficiently implemented in hardware are included in the instruction set.
The more complex tasks are left to the compiler to construct from
simpler operations.
The design of the instruction set of a RISC is streamlined for efficient
use by the compiler to maximise performance for programs written in
HLLs.
374
RISC Philosophy
RISC Philosophy: reduce hardware complexity at the
expense of increased compiler complexity, compilation
time, and the size of the object code.
A processor executes instructions by performing a sequence of steps.
Execution Time = N × S × T
N: number of instructions
S: average number of steps/instruction (like CPI)
T: time needed to perform one step
Higher speed of execution can be achieved by reducing the value of
any or all of the above three parameters.
375
125
RISC Philosophy
CISCs attempt to decrease N, while RISCs decrease the values of S and
T.
Pipelining can be used to make S ≅ 1 (i.e. the computer can complete the
execution of one instruction in every CPU clock cycle).
To reduce the value of T (the clock period), the number of logic levels in
the hardware that decodes the instructions and generates various
control signals must be kept to a minimum (simpler instructions/smaller
numbers of instructions).
The instruction’s effect on the execution time of an average task should
be kept in mind when considering an instruction for inclusion in a RISC
processor.
376
Example
MOVE 50(A2), (A5)+
Now, if no memory-to-memory operations are allowed, and there is no
auto-increment addressing mode.
MOVE 50(A2), D3
MOVE D3, (A5)
ADD #2, A5
377
More on RISC
Instruction fetch operations in a pipeline CPU are carried out in parallel
with internal operations and do not contribute to the execution time of a
program.
READ/WRITE operations on data operands do contribute to the
execution time.
Most RISC machines have a “load-store” architecture
All data manipulation instructions are limited to operands that are either
in CPU registers or are contained within the instruction word.
All addressing modes used in load/store instructions are limited to those
that require only a single access to the main memory.
378
126
More on RISC
Addressing modes requiring several internal operations in the CPU are
also avoided.
For example, most RISC machines do not provide an auto-increment
addressing mode because it usually requires an additional clock cycle to
increment the contents of the register (auto-increment modes are rarely
used by computers).
Small register sets increase load/store operations (smaller or better
utilisation of chip size) (slows down the operation of a pipelined CPU).
More registers might waste space if the compiler can’t use them
effectively (32-128 or more registers).
379
Floating-Point Operations
Computer applications including scientific computations, digital
signal processing, and graphics require the use of floating-point
instructions.
Most RISC machines have special instructions for floating-point
operations and hardware that performs these operations at high
speed.
Floating-point instructions involve a complex, multi-step sequence
of operations to handle the mantissa and the exponent and to
carry out such tasks as operand alignment and normalization.
380
Floating-Point Operations
Is this inconsistent with RISC design philosophy? Yes
But, floating-point instructions are widely used.
An important consideration in the instruction set of a
pipelined CPU is that most instructions should take
about the same time to execute (separate units for
integer and floating-point operations).
381
127
Caches
Play an important role in the design of RISCs (split
caches, etc).
Memory access time required directly affects the
value of S and T.
Again, is it consistent with RISC philosophy?
382
A Typical RISC
Floating
Unit
Integer
Unit
destination
32-bit
Operand 3
Operand 2
Operand 1
Instruction
Unit
Data
Unit
Instruction
Cache
Data
Cache
Regiser
File
System Bus
Main Memory
A Typical
RISC System
383
Choice of Addressing Modes
RISC philosophy influences instruction set designs by considering
the effect of the addressing modes on the pipeline.
Addressing modes are supposed to facilitate accessing a variety
of data structures simply and efficiently. (e.g., index, indirect,
auto-increment, auto-decrement).
When it comes to RISC machines, the effect of the addressing
modes on aspects such as the clock period, chip area, and the
instruction execution pipeline are considered.
Most importantly, the extent to which these modes are likely to
be used by compilers are studied.
384
128
Addressing Modes
MOVE
(X(R1)),R2
The above instruction can be implemented using simpler instructions:
ADD
MOVE
MOVE
#X,R1,R2
(R2),R2
(R2),R2
In a pipelined machine which is capable of starting a new instruction in
every clock cycle, complex addressing modes that involve several
accesses to the main memory do not necessarily lead to faster execution.
385
Addressing Modes
More complex hardware (for decode and execute) is required to
deal with complex instructions, although complex instructions
reduce the space needed in the main memory.
Complex hardware ⇒ more chip area.
Complex hardware ⇒ longer clock cycle.
Complex addressing modes ⇒ increase the value of “T”.
Complex addressing modes ⇒ offsetting reduction in the value of
N (not necessarily true).
386
Addressing Modes
In general, the addressing modes found in RISC
machines often have the following features:
• Access to an operand does not require more than one access to
the main memory.
• Access to the main memory is restricted to load and store
instructions.
• The addressing modes used do not have side effects.
The basic addressing modes that adhere to the above
rules are: register, register indirect, and indexed.
387
129
Effect of Condition Codes
In a machine such as the 68000, the condition code bits, which are a part
of the processor status register, are set or cleared based on the result
of instructions.
Condition codes cause data dependencies and complicate the task of the
compiler, which must ensure that reordering will not cause a change in the
outcome of a computation. (instruction reordering to eliminate NOP
cycles).
Increment
Add
Add-with-carry
R5
R1,R2
R3,R4
388
Condition Codes
The results from the second instruction are used in the third instruction
(data dependency that must be recognised by both hardware and
software).
The order of instructions cannot be reversed.
The hardware must delay the second add instruction.
The condition codes (bits) are not be updated automatically after every
instruction. Instead they are changed only when explicitly requested in
the instruction OP-code. This makes the detection of data dependency
much easier.
389
Register Files
CPU registers.
Data stored in these registers can be accessed faster than data
stored in the main memory or cache. (Note: fewer bits are needed
to specify the location of an operand when that operand is in a
CPU register).
CPU registers are not used by high-level language programmers
(used by the CPU to store intermediate results).
A strategy is needed that will allow the most frequently accessed
operands to be kept in registers and to minimise register-memory
operations.
390
130
Register Files
Software solution.
• require compiler to allocate registers.
• allocate based on most used variables in a given time.
• requires sophisticated program analysis.
Hardware solution.
• have more registers.
• thus more variables will be in registers.
• pioneered by the Berkeley RISC group and is used in the first
commercial RISC product, the Pyramid.
391
Register Windows
Window 2
In most computers, all CPU registers
are available for use at any time
(which register to use is left entirely
to the software).
Registers are divided into groups
called “windows”.
A procedure can access only the
registers in its window.
R0 of the called procedure will
physically differ from R0 of the
calling procedure (register windows
are assigned to procedures as if they
were entries on a stack).
R1
R0
R3
R2
Window 1
R1
R0
R3
R2
R1
Window 0
R0
Register
File
Window 3
D
Window 2
C
C
B
Window 1
B
F
A
Window 0
A
E
Stack
(in memory)
392
Overlapping Windows
(and Global Variables)
The window scheme can be modified
to accommodate global variables and
to provide an easy mechanism for
passing parameters.
To support global variables, a few
registers can be made accessible to
all procedures (used for parameter
passing).
The solution is to overlap windows
between procedures.
This approach reduces the need for
saving and restoring registers, but
the number of registers allocated to
a procedure is fixed.
Window 1
R6
R11/R5
R10/R4
R9
Window 0
R6
R5
R4
Local Registers
R3
R2
R1
R0
Global Registers
393
131
Circular Buffer Mechanism
When a call is made, a current
window pointer is moved to show
the currently active register
window.
If all windows are in use, an
interrupt is generated and the
oldest window (the one furthest
back in the call nesting) is saved
to memory.
A saved window pointer indicates
where the next saved windows
should restore to.
394
Large Register File vs. Cache
Cache
Large Register File
All local scalars
Recently-used local
scalars
Individual variables
Blocks of memory
Compiler-assigned global
variables
Recently-used global
variables
Save/Restore based on
procedure nesting depth
Save/Restore based on
cache replacement
algorithm
Memory addressing
Register addressing
395
Large Register File vs. Cache
Window Based Register File
Cache
396
132
Compiler Based Register Optimisation
Assume small number of registers (16-32).
Optimising use is up to compiler
HLL programs have no explicit references to registers.
Assign symbolic or virtual register to each candidate variable.
Map (unlimited) symbolic registers to real registers.
Symbolic registers that do not overlap can share real registers.
If you run out of real registers some variables use memory.
397
Graph Colouring
The technique most commonly used in RISC compilers
is known as graph colouring (a technique borrowed from
the discipline of topology).
The graph colouring problem can be stated as follows:
“Given a graph consisting of nodes and edges, assign
colours to nodes such that adjacent nodes have
different colours, and do this in such a way as to
minimise the number of different colours.”
398
Graph Colouring
Nodes are symbolic registers.
Two registers that are live in the
same program fragment are
joined by an edge.
Try to colour the graph with n
colours, where n is the number of
real registers.
Nodes that can not be coloured
are placed in memory.
399
133
Summarising RISC
One instruction per cycle.
Register to register operations.
Few, simple addressing modes.
Few, simple instruction formats.
Hardwired design (no microcode).
Fixed instruction format.
More compile time/effort.
400
RISC v CISC
Not clear cut.
Many designs borrow from both philosophies.
e.g. PowerPC and Pentium II.
401
Why CISC?
Compiler simplification?
• disputed.
• complex machine instructions harder to exploit.
• optimisation more difficult.
Smaller programs?
• program takes up less memory but memory is now cheap.
• may not occupy less bits, just look shorter in symbolic form
•
•
more instructions require longer op-codes.
register references require fewer bits.
402
134
Why CISC?
Faster programs?
• bias towards use of simpler instructions.
• more complex control unit.
• microprogram control store larger, thus simple instructions
take longer to execute.
It is far from clear that CISC is the appropriate
solution.
403
Unanswered Questions
The work that has been done on assessing the merits
of the RISC approach can be grouped into two
categories:
• Quantitative: Attempts to compare program size and
execution speed of programs on RISC and CISC machines that
use comparable technology.
• Qualitative: Examination of issues such as high-level language
support and optimum use of VLSI real estate.
404
Problems
There is no pair of RISC and
CISC machines that are
comparable in life-cycle cost,
level of technology, gate
complexity, sophistication of
compiler, operating system
support, and so on.
No definitive test set of
programs exists. Performance
varies with the program.
It is difficult to sort out
hardware effects from effects
due to skill in compiler writing.
Most of the comparative analysis
on RISC has been done on “toy”
machines rather than commercial
products.
Furthermore, most commercially
available machines advertised as
RISC possess a mixture of RISC
and CISC characteristics.
Thus, a fair comparison with a
commercial, “pure-play” CISC
machine (e.g., VAX, Intel 80386)
is difficult.
405
135
Example: Pentium 4
80486 - CISC
Pentium – some superscalar components
• Two separate integer execution units
Pentium Pro – Full blown superscalar
Subsequent models refine & enhance superscalar
design
406
Pentium 4 Block Diagram
407
Pentium 4 Operation
Fetch instructions form memory in order of static
program
Translate instruction into one or more fixed length
RISC instructions (micro-operations)
Execute micro-ops on superscalar pipeline
• micro-ops may be executed out of order
Commit results of micro-ops to register set in original
program flow order
Outer CISC shell with inner RISC core
Inner RISC core pipeline at least 20 stages
• Some micro-ops require multiple execution stages
• Longer pipeline
• c.f. five stage pipeline on x86 up to Pentium
408
136
Pentium 4 Pipeline
409
Pentium 4 Pipeline Operation (1)
410
Pentium 4 Pipeline Operation (2)
411
137
Pentium 4 Pipeline Operation (3)
412
Pentium 4 Pipeline Operation (4)
413
Pentium 4 Pipeline Operation (5)
414
138
Pentium 4 Pipeline Operation (6)
415
Pentium 4 Hyperthreading
416
Example: PowerPC
Direct descendent of IBM 801, RT PC and RS/6000
All are RISC
RS/6000 first superscalar
PowerPC 601 superscalar design similar to RS/6000
Later versions extend superscalar concept
417
139
PowerPC 601 General View
418
PowerPC 601
Pipeline
Structure
419
PowerPC 601 Pipeline
420
140
Section 1a
[modified from Braunl, 2002]
C Basics
C Basics
Program structure
Variables
Assignments and expressions
Control structures
Functions
Arrays
There are many on-line tutorials: e.g. see
www.freeprogrammingresources.com/ctutor.html
For a basic tutorial start: www.eecs.wsu.edu/ctutorial.html
For a ‘best practices’ start: www106.ibm.com/developerworks/eserver/articles/hook_duttaC.html
422
Program Structure
Comments are enclosed in /* and */
“Include” required for libraries used
Each program starts execution with “main”
Statements follow enclosed by “{“ and “}”, separated by semicolons “;”
Return value or parameters in “main” can be used to return values to the
command line
/* Demo program */
#include “demo.h”
int main()
{ …
return 0;
}
423
141
Variables
Variables contain data
Simple data types are:
• int
• float
• char
(integer)
(floating point)
(character)
• local
• global
(declaration inside function or main)
(declaration outside) ← Try to avoid as much as possible!
Variables can be
424
Variables
#include “demo.h”
int distance;
← global variable
int main()
{ char direction;
distance
= 100;
← local variable
← assignment
direction = ‘S’;
← assignment
printf(“Go %c for %d steps\n”, direction, distance);
return 0;
↑ system function call for printing
}
425
Notes
C also allows variable declaration and initialization in one step.
Char constants are enclosed in apostrophes:
String constants are enclosed in quotes:
int distance = 100;
‘S’
“Go %c for %d steps\n”
“printf” is a system function
This function takes a number of arguments enclosed in parenthesis “(“
and “)”.
The first parameter is a formatting string, the following parameters are
data values (in our example variables), which will replace the placeholders
“%c” (char) and “%d” (decimal) of the string in the order they are listed.
Special symbols start with a backslash like “\n” (newline).
The standard C library requires “\n” (newline) to actually print any text.
The program with print statement:
printf(“Go %c for %d steps\n”, direction, distance);
will print on screen:
Go S for 100 steps
426
142
Assignments and Expressions
Variables are assigned values with operator “=“
distance = 2 * (distance - 5) + 75;
Do not confuse with comparison operator “==“
if (distance == 100) { … }
else { … }
Expressions are evaluated left-to-right
→ 6
distance = 4 / 2 * 3
However:
• Multiplication is executed before addition
distance =
4 + 2 * 3
→ 10
• Parentheses may be used to group sub-expressions
distance = (4 + 2)* 3
→ 18
427
Assignments and Expressions
Abbreviations
distance = distance + 1;
is identical to
distance++;
Also:
distance--;
Note: There are plenty more abbreviations in C/C++.
Many of these are confusing, so their use is
discouraged.
428
Assignments and Expressions
Data types can be converted by using “type casts”,
i.e. placing the desired type name in parenthesis before the
expression
distance = (int) direction;
For int ↔ char conversions, the ASCII char code is used:
(int) ‘A’
→ 65
(int) ‘B’
→ 66
…
(char) 70
→ ‘F’
429
143
Control Structures
Selection (if-then-else)
← parentheses required after “if”
← two statements for “then”
require brackets { }
if (distance == 1)
{ direction = ‘S’;
distance = distance - 10;
}
← single statement for “else”
requires no brackets
↑ “else-part” is optional
else direction = ‘N’;
430
Control Structures
Selection (if-then-else)
Comparisons can be:
<, >, <=, >=, ==, !=
Logic expressions a and b can be combined as:
a && b
→ a and b
a || b
→ a or b
!a
→ not a
431
Control Structures
Multiple Selection (switch-case)
switch (distance)
{ case 1: direction = ‘S’;
distance = distance - 10;
break;
case 7: direction = ‘N’;
break;
…
default: direction = ‘W’;
← parentheses required
← no parentheses for
multiple statements
← “break” signals end of case
← optional default case
}
432
144
Control Structures
Iteration (for)
int i;
…
for (i=0; i<4; i++)
{ printf(“%d\n”, i);
}
← for (init, terminate, increment)
← loop contains single statement
brackets { } optional
Output:
0
1
2
3
433
Control Structures
Iteration (while)
int i;
…
i=0;
while (i<4)
{ printf(“%d\n”, i);
i++;
}
← explicit initialization
← while (termination-condition)
is true, repeat loop execution
← explicit increment
Output:
0
1
2
3
434
Control Structures
Iteration (do-while)
int i;
…
i=0;
do
{ printf(“%d\n”, i);
i++;
} while (i<4)
Output:
← test of termination condition at the end
otherwise identical to while-loop
0
1
2
3
435
145
Functions
Functions are sub-programs
• take a number of parameters (optional)
• return a result (optional)
“main” has the same structure as a function
int main (void)
{ …
}
← returns int, no parameters
436
Functions
Parameter values are copied to function
Input parameters are simple
Function declaration:
int sum (int a, int b)
{ return a+b;
}
Function call:
int x,y,z;
x = sum(y, z);
x = sum(y, 10);
← can use constants or variables
y and z remain unchanged
437
Functions
Input/output parameters require pointer
Function declaration:
void increment (int *a)
{*a = *a + 1;
}
← use “*” as reference to access parameter
← no return value (void)
Function call:
int x;
increment(&x);
increment(&10);
← use “&” as address for variable
“x” will get changed
← error !
438
146
Arrays
Data structure with multiple elements of the same type
Declaration:
int field[100];
char text[20];
← elements field[0] .. field[99]
← elements text[0] .. text[19]
Use:
field[0] = 7;
text[3] = ‘T’;
for (i=0; i<100; i++)
field[i] = 2*i-1;
Note: If arrays are used as parameters,
their address is used implicitly - not their contents
439
Arrays
Strings are actually character arrays
Each string must be terminated by a “Null character”: (char) 0
C provides a library with a number of string manipulation functions
Declaration:
char string[100];
← 99 chars + Null character
Use:
string[0] = ‘H’;
string[1] = ‘i’;
string[2] = (char) 0;
printf(“%s\n”, string);
Output: Hi↵
440
Suggested basic programming practice:
1. Fibonacci numbers
2. String conversion
441
147
1. Fibonacci Numbers
Definition:
Fibonacci numbers
f0 = 0
f1 = 1
fi = fi-1 + fi-2
→ 0, 1, 1, 2, 3, 5, 8, ...
Algorithm:
Loop
1. Add: fi = fi-1 + fi-2
2. Move: fi-2 = fi-1; fi-1 = fi
Project:
Compute all Fibonacci numbers ≤ 100
1. ADD
2. MOVE
current +
fibi
last
fibi-1
last2
fibi-2
442
Fibonacci Numbers
int main()
{ int current, last, last2;
Output:
0
last=1; last2=0;
printf(“0\n1\n”); /* print first 2 num.*/
do { current = last+last2;
printf(“%d\n”, current);
last2 = last; last = current;
} while (current <= 100);
return 0;
1
1
2
3
5
}
8
…
443
2. String Conversion
Write a sub-routine that converts a given strings
to uppercase
Sample main:
#include “eyebot.h”
int main(void)
{ printf(“%s\n”, uppercase(“Test String No. 1”));
}
Desired output:
TEST STRING NO. 1↵
444
148
Project: String Conversion
void uppercase (char s[])
{ int i;
← array reference,
identical to “*s”
i=0;
while (s[i])
{ if (‘a’<= s[i] && s[i] <= ‘z’)
s[i] = s[i] - ‘a’ + ‘A’;
i++;
← while (s[i] != (char) 0)
← check if s[i] is lower case
← convert a → A, b → B, …
← go to next character
}
}
445
149