Download Pitfall

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
3.13. Fallacies and Pitfalls
• Fallacy: Processors with lower CPIs will
always be faster
• Fallacy: Processors with faster clock rates
will always be faster
– Balance must be found:
• E.g. sophisticated pipeline: CPI ↓ clock cycle ↑
Fallacies and Pitfalls
• Pitfall: Emphasizing improving CPI by
increasing issue rate, while sacrificing
clock rate can decrease performance
– Again, question of balance
• SuperSPARC –vs– HP PA 7100
– Complex interactions between cycle time and
organisation
Fallacies and Pitfalls
• Pitfall: Improving only one aspect of a
multiple-issue processor and expecting
overall performance improvement
– Amdahl’s Law!
– Boosting performance of one area may uncover
problems in another
Fallacies and Pitfalls
• Pitfall: Sometimes bigger and dumber is
better!
– Alpha 21264: sophisticated multilevel
tournament branch predictor
– Alpha 21164: simple two-bit predictor
– 21164 performs better for transaction
processing application!
• Can handle twice as many local branch predictions
Concluding Remarks
• Lots of open questions!
– Clock speed –vs– CPI
– Power issues
– Exploiting parallelism
• ILP –vs– explicit
Characteristics of Modern (2001)
Processors
• Figure 3.61
–
–
–
–
–
–
3–4 way superscalar
4–22 stage pipelines
Branch prediction
Register renaming (except UltraSPARC)
400MHz – 1.7GHz
7–130 million transistors
Chapter 4
Exploiting ILP with Software
4.1. Compiler Techniques for
Exposing ILP
• Compilers can improve the performance of
simple pipelines
– Reduce data hazards
– Reduce control hazards
Loop Unrolling
• Compiler technique to increase ILP
– Duplicate loop body
– Decrease iterations
• Example:
– Basic code: 10 cycles per iteration
– Scheduled: 6 cycles
for (int k = 0; k < 1000; k++)
{ x[k]
= x[k] + s;
}
Loop Unrolling
for (int k = 0; k < 1000; k+=4)
{ x[k]
= x[k] + s;
for x[k+1]
(int k == x[k+1]
0; k < +1000;
s; k++)
{ x[k+2]
x[k] == x[k+2]
x[k] + +s;s;
} x[k+3] = x[k+3] + s;
}
• Basic code: 7 cycles per “iteration”
• Scheduled: 3.5 cycles (no stalls!)
Loop Unrolling
• Requires clever compilers
– Analysing data dependences, name
dependences and control dependences
• Limitations
–
–
–
–
Code size
Decrease in amortisation of overheads
“Register pressure”
Compiler limitations
• Useful for any architecture
Superscalar Performance
• Two-issue MIPS (int + FP)
• 2.4 cycles per “iteration”
– Unrolled five times
4.2. Static Branch Prediction
• Useful:
– where behaviour can be predicted at compiletime
– to assist dynamic prediction
• Architectural support
– Delayed branches
Static Branch Prediction
• Simple:
– Predict taken
– Has average misprediction rate of 34% (SPEC)
– Range: 59% – 9%
• Better:
– Predict backward taken, forward not-taken
– Worse for SPEC!
Static Branch Prediction
• Advanced compiler analysis can do better
• Profiling is very useful
– FP: 9% ± 4%
– Int: 15% ± 5%
4.3. Static Multiple Issue: VLIW
• Compiler groups instructions into
“packets”, checking for dependences
– Remove dependences
– Flag dependences
• Simplifies hardware
VLIW
• First machines used a wide instruction with
multiple operations per instruction
– Hence Very Long Instruction Word (VLIW)
– 64–128 bits
• Alternative: group several instructions into
an issue packet
VLIW Architectures
• Multiple functional units
• Compiler selects instructions for each unit
to create one long instruction/an issue
packet
• Example: five operations
– Integer/branch, 2 × FP, 2 × memory access
• Need lots of parallelism
– Use loop unrolling, or global scheduling
Example
for (int k = 0; k < 1000; k++)
{ x[k]
= x[k] + s;
}
• Loop unrolled seven times!
• 1.29 cycles per result
• 60% of available instruction “slots” filled
Summary of Improvements
Technique
Basic code
Loop unrolled (4)
Superscalar (5)
VLIW (7)
Unscheduled
Scheduled
10
6
7
3.5
2.4
1.29
Drawbacks of Original VLIWs
• Large code size
– Need to use loop unrolling
– Wasted space for unused slots
• Clever encoding techniques, compression
• Lock-step execution
– Stalling one unit stalls them all
• Binary code compatibility
– Variations on structure required recompilation
4.4. Compiler Support for
Exploiting ILP
• We will not cover this section in detail
• Loop unrolling
– Loop-carried dependences
• Software pipelining
– Interleave instructions from different iterations
4.5. Hardware Support for
Extracting More Parallelism
• Techniques like loop-unrolling work well
when branch behaviour can be predicted at
compile time
• If not, we need more advanced techniques:
– Conditional instructions
– Hardware support for compiler speculation
Conditional or Predicated Instructions
• Instructions have associated conditions
– If condition is true execution proceeds normally
– If not, instruction becomes a no-op
cmovz
bnez
%r8,
%r8,
%r1,
L1 %r2
nop
mov %r1, %r2
L1: ...
• Removes control hazards
if (a == 0)
b = c;
Conditional Instructions
• Control hazards effectively replaced by data
hazards
• Can be used for speculation
– Compiler reorders instructions depending on
likely outcome of branches
Limitations on Conditional
Instructions
• Annulled instructions still execute
– But may occupy otherwise stalled time
• Most useful when conditions evaluated
early
• Limited usefulness for complex conditions
• May be slower than unconditional
operations
Conditional Instructions in
Practice
Machine
Conditional Instructions
MIPS,
Alpha,
SPARC
Move
HP PA
Any register-register instruction can
annul the following instruction
IA-64
Full predication
Related documents