Download ppt

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Computational Methods
in Astrophysics
ASTR 5210
Dr Rob Thacker (AT319E)
[email protected]
Today’s Lecture

More Computer Architecture
Flynn’s Taxonomy
 Improving CPU performance (instructions per
clock)
 Instruction Set Architecture classifications
 Future of CPU design

Machine architecture classifications



Flynn’s taxonomy (see IEEE Trans. Comp. Vol
C-21, pp 94, 1972)
A way of describing the information flow in
computers: architectural definition
Information is divided into instructions (I) and
data (D)


There can be single (S) or multiple instances of
both (M)
Four combinations: SISD,SIMD,MISD,MIMD
SISD



Single Instruction, Single Data
An absolutely serial execution model
Typically viewed as describing a serial computer,
but todays CPUs exploit parallelism
Single data element
Single processor
P
M
SIMD


Single Instruction, Multiple Data
In this case one instruction is applied to multiple
data streams at the same time
K
P
Ma
P
Ma
P
Ma
Single instruction processor K,
broadcasts instruction to processing
elements (PEs)
Each processor typically
has its own data memory
Array of processors
MISD



Multiple Instruction, Single Data
Largely useless definition (not important)
Closest relevant example would be a cpu than
can `pipeline’ instructions
Ma
Each processor has its
own instruction stream
but operates on the same
data stream
Mi
P
Mi
P
Mi
P
Example: systolic array, network
of small elements connected in
a regular grid operating under
a global clock, reading and writing
elements from/to neighbours.
MIMD


Multiple Instruction, Multiple Data
Covers a host of modern architectures
M
M
M
P
P
P
P
Processors have independent
data and instruction streams.
Processors may communicate
directly or via shared memory.
M
Instruction Set Architecture

ISA – interface between hardware and software

ISAs are typically common to a cpu family e.g.
x86, MIPS (more alike than different)

Assembly language is a realization of the ISA in
a form easy to remember (and program)
Key Concept in ISA evolution and
CPU design





Efficiency gains to be had by executing as many
operations per clock cycle as possible
Instruction level parallelism (ILP)
Exploit parallelism within the instruction stream
Programmer does not see this parallelism
explicitly
Goal of modern CPU design – maximize the
number of instructions per clock cycle (IPC),
equivalently reduce cycles per instruction (CPI)
ILP versus thread level parallelism

Many modern programs have more than one
One “thread”
(parallel) “thread” of execution
Instructions

Instruction level parallelism breaks down a
single thread of execution to try and find
parallelism at the instruction level
These instructions
3
3 2 1
2
1
are executed in
parallel even though
there is one thread
ILP techniques

The two main ILP techniques are

Pipelining – including additional techniques such
as out-of-order execution

Superscalar execution
Pipelining


Multiple instructions overlapped in execution
Throughput optimization: doesn’t reduce time for
individual instructions
Instr 12
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7
3 Instr 2 Instr 1
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7
Design sweetspot





Pipeline stepping time is determined by slowest
operation in pipeline
Best speed-up: if all operations take same
amount of time
Net time per instruction=stepping time/pipeline
stages
Perfect speed up factor = # pipeline stages
Never achieved: start up overheads to consider
Pipeline compromises
Time to issue instruction
10ns 10ns
5ns
10ns
5ns
10ns
5ns
=55ns
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7
Instruction
10ns 10ns
10ns
10ns
10ns 10ns
10ns
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7
These stages take longer than
necessary
=70ns
Superscalar execution

Careful about definitions: superscalar execution
is not simply about having multiple instructions
in flight

Superscalar processors have more than one of a
given functional unit (such as the arithmetic
logic unit (ALU) or load/store)
Benefits of superscalar design




Having more than one functional unit of a given
type can help schedule more instructions within the
pipeline
The Pentium IV pipeline was 20 stages deep!
Enormous throughput potential but big pipeline
stall penalty
Incorporation of multiple units into the pipeline
is sometimes called superpipelining
Other ways of increasing ILP

Branch prediction


Out of order execution


Predict which path will be taken by assigning certain
probabilities
Independent operations can be rescheduled in the
instruction stream
Pipelined functional units

Floating point units can be pipelined to increase
throughput
Limits of ILP



See D. Wall “Limits of ILP” 1991
Probability of hitting hazards (instructions that cannot
be pipelined) increases with added length
Instruction fetch and decode rate


Branch prediction –


Remember the “von Neumann” bottleneck? Would be nice
to have single instruction for multiple operations…
Multiple condition statements increase branches severely
Cache locality and memory limitations

Finite limits to effectiveness of prefetch
Scalar Processor Architectures
‘Scalar’
Pipelined
Functional unit parallelism,
e.g. load/store and arithmetic
units can be used in parallel
(instructions in parallel)
Superscalar
Multiple functional units,
e.g. 4 floating point units
can operate at same time
Modern processors exploit parallelism,
and can’t really be called SISD
Complex Instruction Set Computing





CISC – older design idea (x86 instruction set is
CISC)
Many (powerful) instructions supported within
the ISA
Upside: Makes assembly programming much
easier (lots of assembly programming in 60-70’s)
Upside: Reduced instruction memory usage
Downside: designing CPU is much harder
Reduced Instruction Set Computing







RISC – newer concept than CISC (but still old)
MIPS, PowerPC, SPARC, all RISC designs
Small instruction set, CISC type operation becomes a
chain of RISC operations
Upside: Easier to design CPU
Upside: Smaller instruction set => higher clock speed
Downside: assembly language typically longer (compiler
design issue though)
Most modern x86 processors are implemented using
RISC techniques
Birth of RISC

Roots can be traced to three research projects





IBM 801 (late 1970s, J. Cocke)
Berkeley RISC processor (~1980, D. Patterson)
Stanford MIPS processor (~1981, J. Hennessy)
Stanford & Berkeley projects driven by interest
in building a simple chip that could be made in a
university environment
Commercialization benefitted from 3
independent projects


Berkeley Project -> begat Sun Microsystems
Stanford Project -> begat MIPS (used by SGI)
Modern RISC processors



Complexity has nonetheless increased
significantly
Superscalar execution (where CPU has multiple
functional units of the same type e.g. two add
units) require complex circuitry to control
scheduling of operations
What if we could remove the scheduling
complexity by using a smart compiler…?
VLIW & EPIC




VLIW – very long instruction word
Idea: pack a number of noninterdependent
operations into one long instruction
Strong emphasis on compilers to schedule
instructions
When executed, words are easily broken
up and allow operations to be dispatched
to independent execution units
Instr 1
Instr 2
Instr 3
3 instructions scheduled
into one long instruction word
VLIW & EPIC II




Natural successor to RISC – designed to avoid
the need for complex scheduling in RISC
designs
VLIW processors should be faster and less
expensive than RISC
EPIC – explicitly parallel instruction computing,
Intel’s implementation (roughly) of VLIW
ISA is called IA-64
VLIW & EPIC III



Hey – it’s 2015, why aren’t we all using Intel
Itanium processors?
AMD figured out an easy extension to make x86
support 64 bits & introduced multicore
Backwards compatibility + “good enough
performance” + poor Itanium compiler
performance killed IA-64
RISC vs CISC recap
RISC (popular by mid 80s)
Operations on registers
CISC (pre 1970s)
Operations directly on memory
Pro: Small instruction set
makes design easy
Pro: decreased CPI, but also get
faster CPU through easier design
(tc reduced)
Pro: Many powerful instructions,
easy to write assembly language*
Con: complicated instructions
must be built from simpler ones
Pro: Reduced memory
requirement for instructions,
reduced number of total
instructions (Ni)*
Con: ISA often large and
wasteful (20-25% usage)
Con: Efficient compiler
technology absolutely essential
Con: ISA hard to debug during
development
*Driven by 1970s issues of memory size (SMALL)
and speed (FASTER THAN CPU)
Who “won”? – Not VLIW!

Modern x86 are RISC-CISC hybrids




ISA is translated at hardware level to shorter instructions
Very complicated designs though, lots of scheduling
hardware
MIPS, Sun SPARC, DEC Alpha were much truer
implementations of the RISC ideal
Modern metric for determining RISCkyness of design:
does the ISA have LOAD STORE instructions to
memory?
From Patterson’s lectures (UC Berkeley CS252)
Evolution of Instruction Sets
Single Accumulator (EDSAC 1950)
Accumulator + Index Registers
(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model
from Implementation
High-level Language Based
(B5000 1963)
Concept of a Family
(IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets
(Vax, Intel 432 1977-80)
Load/Store Architecture
(CDC 6600, Cray 1 1963-76)
RISC
(Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987)
LIW/”EPIC”? (IA-64. . .1999)
Simultaneous multithreading






Completely different technology to ILP
NOT multi-core
Designed to overcome lack of fine grained parallelism in
code
Idea is to fill any potential gaps in the processor pipeline
by switching between threads of execution on very short
time scales
Requires programmer to have created a parallel program
for this to work though
One physical processor looks like two logical processors
Motivation for SMT




Strong motivation for SMT: memory latency
making load operations take longer and longer
Need some way to hide this bottleneck (memory
wall again!)
SMT: switch over execution to threads that have
their data and execute those
TERA MTA (bought by Cray)
attempt to design computer
entirely around this concept
SMT Example: IBM Power 5 - 8


Dual core, each core
can support 2 SMT
threads
“MCM” package



4 dual core processors
144 MB of cache
SMT gives ~40-60%
improvement in
performance

Not bad

Intel Hyperthreading
~ 10% improvement
Multiple cores

Simply add more CPUs


Easiest way to increase throughput now
Why do this?
Response to problem of increasing power output on
modern CPUs
 We’ve essentially reached the limit on improving
individual core speeds


Design involves compromise: n CPUs must now
share memory bus – less bandwidth to each
Intel & AMD multi-core processors


Intel 18-core processors
Codename “Haswell”



Design envelope 150W,
but divide by number of
processors => each core
is v. power efficient


AMD has 16 core
processors
Codename “Warsaw”
115 W design envelope
Individual cores not as
good as Intel though
Summary

Flynn’s taxonomy categorizes instruction and data flow
in computers





Modern processors are MIMD
Pipelining and superscalar design improve CPU
performance by increasing the instructions per clock
CISC/RISC design approaches appear to be reaching
the limits of their applicability
VLIW didn’t make an impact – will it return?
In the absence of improved single core performance,
designers are simply integrating more cores