Download Next Generation Instruction Set Architecture

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Next Generation Instruction Set
Architecture
John Crawford, Intel Fellow
Jerry Huck
Director, Microprocessor Architecture
Intel Corporation
Manager and Lead Architect
Hewlett-Packard Company
®
1
Objectives
s Unveil the technology behind the next
generation ISA
l Today’s focus on architecture, not implementation
s Context
l History
l Motivation
s ISA
Preview
l A few key features
l Benefits
®
2
Intel and HP Technology Alliance
s
Intel
l Microprocessor / platform technology
l 64-bit architecture definition
s
HP
l E nterprise systems technology expertise
l Architecture research advancements
s
Jointly defined next generation 64-bit
instruction set
l Instruction set specification
l Compiler optimization
l Performance simulation and projection
®
3
Instruction Set Architecture (ISA)
Objectives
s
E nable industry leading system performance
l Breakthrough performance
l Headroom
s
E nable compatibility with today’s IA-32
software & PA-RISC software
s
Allow scalability over a wide range of
implementations
s
Full 64-bit computing
®
4
Current State of The Art
Arch.
Research
???
Performance
OOO SuperScalar
CISC
RISC
ŸH/W detects implicit parallelism
ŸH/W O-O-O scheduling & speculation
ŸH/W renames 8-32 registers to 64+
ŸSimple, fixed length instructions
ŸSequencing done by compiler
ŸComplex, variable length instructions
ŸSequencing done in hardware
Time
What’
ssnext,
What’
next,beyond
beyondtraditional
traditionalarchitectures?
architectures?
®
5
Current Performance Limiters:
Branches
Mispredicts limit performance
Small blocks restrict code scheduling freedom
s
s
Fragmentation
l Poor utilization of wide machines
l
ld R1
IF
use R1
unused
execution slots
st R2
THEN
E LSE
ld R4
use R4
®
6
Current Performance Limiters:
Latency to Memory
Memory latency increasing relative to processor
speed
Load delay compounded by machine width
s
s
l
Latency hiding requires more parallelism in a wide machine
branch
branch is a
barrier
branch
load
load
use
use
Scalar
®
4-way Superscalar
7
Current Performance Limiters:
Extracting Parallelism
Sequential execution model
s
original source
code
compiler
sequential machine
hardware
code
parallelized
code
parallelized
parallelized
code
code
multiple
functional units
. . .
Compiler has limited, indirect view of hardware
s
Implicit
Implicitparallelism
parallelismlimits
limitsperformance
performance
®
8
Better Strategy: Explicit Parallelism
Compiler exposes, enhances, and exploits parallelism in the
source program and makes it explicit in the machine code.
original source
code
compiler
parallel machine
code
“Expose”
“E xploit”
“E nhance”
®
9
Next Generation Architecture Technology
Arch.
Research
EPIC
Performance
OOO SuperScalar
CISC
RISC
Time
®
ŸH/W detects implicit parallelism
ŸH/W O-O-O scheduling & speculation
ŸH/W renames 8-32 registers to 64+
ŸSimple, fixed length instructions
ŸSequencing done by compiler
ŸComplex, variable length instructions.
ŸSequencing done in hardware
10
E xplicitly
P arallel
Instruction
Computing
Next Generation Terminology
s
EPIC is the next generation technology
l e.g.,
s
IA-64 is the architecture that incorporates
EPIC Technology
l e.g.,
s
RISC, CISC
IA-32, PA-RISC
Merced™ processor is the first IA-64-based
implementation
l e.g., Pentium®
®
II processor, PA-8500
11
Key 64-bit ISA Features within IA-64
s
Architecture Resources
s
Instruction Format
s
Predication
s
Speculation
s
(Branch Architecture)
s
(Floating-Point Architecture)
s
(Multimedia Architecture)
s
(Memory Management & Protection)
s
(Compatibility)
®
12
Architecture Resources Provide for
Parallel E xecution & Scalability
M
E
M
O
R
Y
128
128
GRs
GRs
Execution
Units
128
128
FRs
FRs
s Massively resourced - large register files
l Traditional architectures are forced to rename registers
s Inherently scalable - replicated function units
s Explictly parallel - transistors used more effectively
®
13
Instruction Format: Explicit Parallelism
s
Breaking the sequential execution paradigm
l Explicit instruction dependency: template
l Flexibly groups any number of independent instructions
s
Explicitly scheduled parallelism
l E nables compiler to create greater parallelism
l Simplifies hardware by removing dynamic mechanisms
l Fully interlocked- hardware provides compatibility
128-bit bundle
127
0
Instruction 2
s
Instruction 1
Instruction 0
Template
Modest code size expansion
The
Thenew
newinstruction
instructionformat
formatenables
enablesscalability
scalabilityw/
w/compatibility
compatibility
®
14
Branches Limit Performance
Traditional Architectures: 4 basic blocks
instr 1
instr 2
if
then
:
p1, p2 <-cmp(a==b)
jump p2
instr 3
instr 4
:
jump
instr 5
else
instr 6
:
instr 7
instr 8
:
Control
Controlflow
flowintroduces
introducesbranches
branches
®
15
Predication
if
then
instr 1
instr 2
:
p1, p2 <-cmp(a==b)
jump p2
(p1) instr 3
(p1) instr 4
:
jump
(p2) instr 5
else
(p2) instr 6
:
instr 7
instr 8
:
The
Thepredicate
predicatecan
canremove
removebranches
branches
®
16
Predication E nhances Parallelism
Traditional Architectures : 4 basic blocks
EPIC Architectures:
Architectures: 1 basic block
instr 2
if
if
:
p1, p2 <-cmp
<-cmp(a==b)
(a==b)
jump p2
instr 3
then
instr 4
then
instr 1
instr 2
:
p1, p2 <- cmp(a==b)
cmp(a==b)
(p1) instr 3
(p1) instr 4
:
:
jump
else
(p2) instr 5
(p2) instr 6
:
instr 5
else
instr 6
instr 7
instr 8
:
:
instr 7
instr 8
:
new Basic Block
Predication
Predicationenables
enablesmore
moreeffective
effectiveuse
useof
ofparallel
parallelhardware
hardware
®
17
Predication: Features and Benefits
s
Compiler given larger scheduling scope
l Nearly
all instructions can be predicated
l State updated if an instruction’s predicate is true, otherwise
acts as a NOP
l Compiler assigns predicates, compare instructions set them
l Architecture provides 64 1-bit predicate registers (PR)
s
Predicated execution removes branches
l Convert
a control dependence to a data dependence
l Reduce mispredict penalties
s
Parallel execution through larger basic
blocks
l Effective
®
use of parallel hardware
18
Predication Increases Performance
100%
Branches
Removed
90%
80%
Mispredicts
Removed
70%
60%
50%
40%
30%
20%
AVERAGE
yacc
wc
qsort
lex
grep
eqn
cmp
cccp
sc
ear
alvinn
compress
eqntott
li
0%
espresso
10%
On
Onaverage,
average,over
overhalf
halfof
ofall
allbranches
branchesare
areremoved
removed
Source: ISCA ‘95 S.Mahlke
S.Mahlke,, et.al.
®
19
Memory Latency Causes Delays
s
Loads significantly affect performance
l Often first instruction in dependency chain of instructions
l Can incur high latencies
Traditional Architectures
instr 1
instr 2
...
jump_equ
jump_equ
Barrier
Load
use
s
®
Loads can cause exceptions
20
Speculation
EPIC Architectures
;Exception Detection
ld.s
ld.s
instr 1
instr 2
jump_equ
jump_equ
Propagate
Exception
;Exception Delivery
chk.s
chk.s
use
Home Block
s
Separate load behavior from exception behavior
l Speculative load instruction (ld
.s)) initiates a load operation
(ld.s
and detects exceptions
l Propagate an exception “token” (stored with destination
register) from ld.s
ld.s to chk.s
chk.s
l Speculative check instruction (chk
.s)) delivers any
(chk.s
exceptions detected by ld.s
ld.s
®
21
Speculation Minimizes the E ffect of
Memory Latency
Traditional Architectures
instr 1
instr 2
...
jump_equ
jump_equ
EPIC Architectures
ld.s
ld.s
instr 1
instr 2
jump_equ
jump_equ
Barrier
chk.s
chk.s
use
Load
use
s
;Exception Detection
Propagate
Exception
;Exception Delivery
Home Block
Give scheduling freedom to the compiler
l Allows ld.s
ld.s to be scheduled above branches
l chk.s
chk.s remains in home block, branches to
exception is propagated
®
22
fixup code if an
E xample: 8 Queens Loop
if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))
Original Code
2
4
5
R1=&b[j]
R3=&a[i+j]
R5=&c[i-j+7]
ld R2=[R1]
P1,P2 <-cmp
(R2==true)
<-cmp(R2==true)
<P2> br exit
True
38%
Mispred
43%
6
8
9
ld R4=[R3]
P3,P4 <-cmp
(R4==true)
<-cmp(R4==true)
<P4> br exit
72%
33%
10
12
ld R6=[R5]
P5,P6 <-cmp
(R5==true)
<-cmp(R5==true)
<P5> br then
else
47%
39%
1
13
13 cycles
3 potential mispredicts
®
23
E xample: 8 Queens Loop
if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))
Original Code
2
4
5
R1=&b[j]
R3=&a[i+j]
R5=&c[i-j+7]
ld R2=[R1]
P1,P2 <-cmp
(R2==true)
<-cmp(R2==true)
<P2> br exit
6
8
9
ld R4=[R3]
P3,P4 <-cmp
(R4==true)
<-cmp(R4==true)
<P4> br exit
1
Speculation
1
2
4
5
6
7
10
12
13
ld R6=[R5]
P5,P6 <-cmp
(R5==true)
<-cmp(R5==true)
<P5> br then
else
13 cycles
3 potential mispredicts
®
8
9
R1=&b[j]
R3=&a[i+j]
R5=&c[i-j+7]
ld R2=[R1]
ld.s
ld.s R4=[R3]
ld.s
ld.s R6=[R5]
P1,P2 <-cmp
(R2==true)
<-cmp(R2==true)
<P2> br exit
chk.s
chk.s R4
P3,P4 <-cmp
(R4==true)
<-cmp(R4==true)
<P4> br exit
chk.s
chk.s R6
P5,P6 <-cmp
(R5==true)
<-cmp(R5==true)
<P5> br then
else
9 cycles
3 potential mispredicts
24
E xample: 8 Queens Loop
if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))
Speculation
1
2
4
5
6
7
R1=&b[j]
R3=&a[i+j]
R5=&c[i-j+7]
ld R2=[R1]
ld.s
ld.s R4=[R3]
ld.s
ld.s R6=[R5]
P1,P2 <-cmp
(R2==true)
<-cmp(R2==true)
<P2> br exit
chk.s
chk.s R4
P3,P4 <-cmp
(R4==true)
<-cmp(R4==true)
<P4> br exit
Predication
1
2
4
5
6
8
9
chk.s
chk.s R6
P5,P6 <-cmp
(R5==true)
<-cmp(R5==true)
<P5> br then
else
7
9 cycles
3 potential mispredicts
®
R1=&b[j]
R3=&a[i+j]
R5=&c[i-j+7]
ld R2=[R1]
ld.s
ld.s R4=[R3]
ld.s
ld.s R6=[R5]
P1,P2 <-cmp
(R2==true)
<-cmp(R2==true)
<P2> br exit
<p1> chk.s
chk.s R4
<p1> P3,P4 <-cmp
(R4==true)
<-cmp(R4==true)
<P4> br exit
<p3> chk.s
chk.s R6
<p3> P5,P6 <-cmp
(R5==true) True Mispred
<-cmp(R5==true)
<P5> br then
12% 16%
else
7 cycles
1 potential mispredict
25
E xample: 8 Queens Loop
if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))
Original Code
Predication
R1=&b[j]
R1=&b[j]
1
R3=&a[i+j]
R3=&a[i+j]
1
R5=&c[i-j+7]
R5=&c[i-j+7]
ld R2=[R1]
ld R2=[R1]
2
ld.s
ld.s R4=[R3]
2
4
P1,P2 <-cmp
(R2==true)
<-cmp(R2==true)
ld.s
ld.s R6=[R5]
RESULT:
Almost
half
cycles
are
RESULT:
Almost
halfthe
therequired
required
cycles
arereduced
reduced
<P2> br
exit
5
P1,P2
<-cmp
(R2==true)
4
<-cmp(R2==true)
<P2> br exit are eliminated.
and
and2/3
2/3of
ofthe
thepotential
potential mispredicts
mispredicts
are eliminated.
<p1> chk.s
chk.s R4
6
ld R4=[R3]
5
<p1> P3,P4 <-cmp
(R4==true)
<-cmp(R4==true)
8
P3,P4 <-cmp
(R4==true)
<-cmp(R4==true)
<P4> br exit
9
<P4> br exit
<p3> chk.s
chk.s R6
6
<p3> P5,P6 <-cmp
(R5==true)
<-cmp(R5==true)
<P5>
br
then
ld
R6=[R5]
10
7
else
P5,P6 <-cmp
(R5==true)
<-cmp(R5==true)
12
<P5> br then
13
else
13 cycles
3 potential mispredicts
®
7 cycles
1 potential mispredict
26
EPIC is the Next Generation Technology
Explicitly Parallel Instruction Computing
s Features that enhance ILP
s Explicit parallelism
l Predication
l ILP is explicit in machine code
l Compiler schedules across a wide scope l Speculation
l Others...
l Binary compatibility across all family
members
EPIC
s Resources
for parallel execution
l Many registers
l Many functional units
Performance
ŸNo Binary Compatibility
ŸNo Performance Scaling
ŸCode Size Explosion
CISC
l Inherently scalable
OOO SuperScalar
ŸH/W detects Independent Instructions
ŸH/W O-O-O Scheduling & Speculation
ŸH/W Renames 8-32 Registers to 64+
RISC
ŸSimple, fixed length instructions
ŸSequencing done by Compiler
ŸComplex, variable length instructions.
ŸSequencing done in hardware
Time
®
27
IA-64: EPIC Technology Applied
s E nables industry leading performance and capability
l Explicitly
parallel: Beyond the limitations of current architectures
l Inherently scalable, massively resourced:
resourced: Provides headroom for future
market requirements
l Fully compatible: For existing applications and the future
s Addresses server and workstation market
requirements
l Enterprise
transaction processing
l Decision support
l Graphical imaging
l Volume rendering
l Many others
The
TheNext
NextGeneration
Generationin
inComputer
ComputerArchitecture
Architecture
®
28
Related documents