Download Architecture Support for Disciplined Approximate Programming

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Architecture Support for
Disciplined Approximate
Programming
Esmaeilzadeh, Sampson, Ceze, and Burger
Presented by
John Kloosterman, Mick Wollman
Approximate computing
Replace guarantees with expectations
● 2 + 2 = {3,4,5}
● 2 + 2 = 4, most of the time
● 2 + 2 = 1,000,000,000
Why?
● 90/10 rule: can get 90% of the answer for
10% of the effort
● Effort can be either ALU operations or power
Applications of Approximation
e.g. raytracing
Result won’t noticeably
change if ray angle
10% off or wrong 10%
of the time
sphere from http://www.piprime.fr/1143/official_asymptote_example-sphere/
Approximate results vs. semantics
Approximate operations have undefined
results, defined semantics
Undefined result
● 2+2 = 5 is OK
Defined semantics
● 2+2 throwing a divide-by-zero exception is
not OK
Disciplined Approximate Programming
Split program into approximate/exact portions
● exact ⇒ approximate OK
● approximate ⇒ exact only with annotation
Some computations must always be exact:
● address calculations
● control flow
Need ability to switch between exact and
approximate at instruction granularity
Example Approximate Kernel
approximate int sum;
for (int i = 0; i < 100; i++)
sum += *(array + i);
(*output) = sum;
Exact:
● loop counter
● address calculation
Approximate:
● accumulator
● store to approximate
memory
Approximate ISA Design
Have both approximate and exact:
● integer arithmetic
● FP arithmetic
● bit operations
● load/store instructions
Disciplined model: partition approximate/exact
● This paper uses the same HW for both,
compiler enforces data flow rules
Microarchitectural planes
● Instruction control plane
○
○
○
○
✕
Fetch
Decode
Instruction bookkeeping
Approximation would break semantics
● Data movement / processing plane
✓
○ Datapath (RF, $, LSQ, FUs, Bypass network)
○ Approximation will only affect results
Power Reduction Methods
● Global Voltage Reduction
○ Error checking + rollback to provide precision
○ Lower voltage ⇒ More rollbacks
● Dual Voltage, VH and VL
○ High voltage for IC plane
○ Either voltage for data plane
● Dual Voltage, VLH and VL
○ Lower IC plane voltage further
○ Rollback adds complexity, not examined
Errors vs. Voltage (Razor)
http://web.eecs.umich.edu/~taustin/papers/IEEEMICRO05-Razor.pdf
Important structures
● DV-SRAM
○ Each row prepended a VH-driven precision bit
○ In-row VH bit used to connect to power lines
○ Precharge based on inst/op precision
Important structures cont’d.
● DV-Mux
○ Select between two different voltage-level signals
○ Controlled by precision bit
input[0]
0-VH/L
input[1]
0-VH/L
DV
Mux
select
output
0-VH/L
Important structures cont’d.
● L2H and H2L shifters
L2H
input
0-VH/L
DeMux
Mux
output
0-VH
Mux
output
0-VH
select
H2L
input
0-VH/L
DeMux
select
Microarch. Changes
● Opcode and source register operands carry
added precision bit
● RF precision set at register granularity
● Duplicate pipeline data registers
● Approx. shadow FU’s
○ DV FU’s possible but complicated
Microarch. Changes cont’d.
● Broadcast network carries precision bit
● Memory precision set at cache line
granularity
○ Precision set by fills and writes, left unchanged by
read hits
○ Can do a precise read from an approx. line, vice
versa
○ NB: Tags, MSHR, etc. always precise
Overheads
● Precision bits
● Pipeline reg / FU duplication
● Shifters / multiplexers
Results: Energy Savings
Problems:
● Many computations for control flow/address generation,
not data
● Much of processor is control plane, not data plane
Best-case energy savings (best benchmark) for
50% voltage:
○ In-order: ~40%
○ OoO: ~15%
○ Difference due to size of control plane
Energy Savings
● 25% energy savings with 50% voltage
● Modeled using McPAT and Cacti
Results: Program % Approximate
low integer
opportunity
high FP
opportunity
Program Error Sensitivity
● How sensitive are applications to approximation?
○ Model approximate results by flipping bits
○ exact: 0000010 + 0000010 = 0000100
○ approximate with one bit flip could be
■ 0000101 = 5
■ 1000100 = 68
input errors
Questions?
Discussion questions
● Pro
○ Many kinds of useful compute-intensive operations
can be approximated (images, video, simulations)
○ SW approximation has given good results
○ Razor: undervolting can produce acceptable types of
errors
● Con
○ Can this hardware be built?
○ Scheduling is more expensive than the operations
themselves
○ Rounding error tolerant ≟ Approximation tolerant