Download ppt - University of Michigan

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Liquid SIMD: Abstracting SIMD Hardware
Using Lightweight Dynamic Mapping
Nathan Clark, Amir Hormati, Scott Mahlke,
Sami Yehia*, Krisztián Flautner*
University of Michigan
1
*ARM Ltd.
University of Michigan
Electrical Engineering and Computer Science
Computational Efficiency
• Low power envelope
• More useful work/transistors
• Hardware accelerators
• Niagara II encryption engine
Source: AMD Analyst Day 12/14/06
2
University of Michigan
Electrical Engineering and Computer Science
How Are Accelerators Used?
Program
Accel.
CPU
Control statically placed in binary
3
University of Michigan
Electrical Engineering and Computer Science
Problem With Static Control
Program
CPU
Accel.
CPU
Accel.
CPU
Not forward/backward compatible
4
University of Michigan
Electrical Engineering and Computer Science
Solution: Virtualization
• Statically identify accelerated computation
• Abstract accelerator features
• Dynamically retarget binary
Program
Proc.
Trans.
Engineer/
Compiler
Accel.
Proc.
Trans.
Accel.
Proc.
Trans.
5
University of Michigan
Electrical Engineering and Computer Science
Liquid SIMD
• Virtualize SIMD accelerators
• Why virtualize SIMD?
– Intel MMX to SSE2
– ARM v6 to Neon
– Wide vectors useful [Lin 06]
6
University of Michigan
Electrical Engineering and Computer Science
SIMD Accelerator Assumptions
SIMD
Exec
Fetch
Decode
Retire
Scalar
Exec
• Same instruction stream
• Separate pipeline – memory interface
7
University of Michigan
Electrical Engineering and Computer Science
How to Virtualize
• Use scalar ISA to represent SIMD operations
– Compatibility, low overhead
Program
Branch
• Key: easy to translate
8
University of Michigan
Electrical Engineering and Computer Science
Virtualization Architecture
uCode
Cache
Accel.
Fetch
Retire
Decode
Trans.
Execute
9
University of Michigan
Electrical Engineering and Computer Science
1. Data Parallel Operations
A
B
+
&
for(i =
r1 =
r2 =
r3 =
r4 =
C[i]
}
0; i < 8; i++) {
A[i];
B[i];
r1 + r2;
r3 & constant;
= r4;
C
10
University of Michigan
Electrical Engineering and Computer Science
1a. What If There’s No Scalar Equivalent?
A
B
for(i = 0; i < 8; i++) {
r1 = A[i];
r2 = B[i];
r3 = r1 + r2;
cmp r3, #FF;
r3 = movgt #FF;
...
}
SADD
Idioms can always be constructed
11
University of Michigan
Electrical Engineering and Computer Science
2. Scalarizing Permutations
+
&
for(i = 0; i < 8; i++) {
…
r1 = r2 + r3;
tmp[i] = r1
}
for(i
r1
r2
r3
…
}
=
=
=
=
0; i < 8; i++) {
offset[i];
tmp[r1 + i]
r2 & const
offset = {4, 4, 4, 4, -4, -4, -4, -4}
12
University of Michigan
Electrical Engineering and Computer Science
3. Scalarizing Reductions
+
for(i = 0; i < 8; i++) {
…
r1 = A[i];
r2 = r2 + r1;
…
}
13
University of Michigan
Electrical Engineering and Computer Science
Applied to ARM Neon
• All instructions supported except…
• VTBL – indirect indexing
v3
v2
v1 = vtbl v2, v3
1
0
Mem
v1
• Interleaved memory accesses
v1
• Not needed in evaluated benchmarks
14
University of Michigan
Electrical Engineering and Computer Science
1 3
Translation to SIMD
for(i = 0; i < 8; i++)
{
r1 = A[i];
r2 = B[i];
r3 = r1 + r2;
r4 = offset[i];
C[i + r4] = r3;
}
for(i =
{
v1 =
v2 =
v3 =
v4
v3 =
C[i]
}
0; i < 8; i += 4
4)
A[i];
B[i];
v1 + v2;
offset[i];
v3 & constant
shuffle
v3;
= v3;
• Update induction variable
• Use inverse of defined translation rules
15
University of Michigan
Electrical Engineering and Computer Science
Translator Design
Program
Proc.
Trans.
Engineer/
Compiler
Accel.
Proc.
Trans.
Accel.
Proc.
Trans.
Translator: efficiency, speed, flexibility
16
University of Michigan
Electrical Engineering and Computer Science
Evaluation
• Trimaran ARM
• Hand SIMDized loops
• SimpleScalar model ARM926 w/ Neon SIMD
• VHDL translator, 130nm std. cell
17
University of Michigan
Electrical Engineering and Computer Science
Liquid SIMD Issues
• Code bloat
– <1% overhead beyond baseline
• Register pressure
– Not a problem
• Translator cost
– 0.2 mm2 + 2KB cache
• Translation overhead
18
University of Michigan
Electrical Engineering and Computer Science
Translation Overhead
MediaBench
SPECfp
19
Kernels
University of Michigan
Electrical Engineering and Computer Science
Summary
• Accelerators are more common and evolving
– Costly binary migration
• SIMD virtualization using scalar ISA
– One binary: forward/backward compatibility
– Negligible overhead
20
University of Michigan
Electrical Engineering and Computer Science
Questions
?
?
?
?
?
?
?
?
21
?
?
?
?
University of Michigan
Electrical Engineering and Computer Science
Related documents