Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Architecture Support for Disciplined Approximate Programming Esmaeilzadeh, Sampson, Ceze, and Burger Presented by John Kloosterman, Mick Wollman Approximate computing Replace guarantees with expectations ● 2 + 2 = {3,4,5} ● 2 + 2 = 4, most of the time ● 2 + 2 = 1,000,000,000 Why? ● 90/10 rule: can get 90% of the answer for 10% of the effort ● Effort can be either ALU operations or power Applications of Approximation e.g. raytracing Result won’t noticeably change if ray angle 10% off or wrong 10% of the time sphere from http://www.piprime.fr/1143/official_asymptote_example-sphere/ Approximate results vs. semantics Approximate operations have undefined results, defined semantics Undefined result ● 2+2 = 5 is OK Defined semantics ● 2+2 throwing a divide-by-zero exception is not OK Disciplined Approximate Programming Split program into approximate/exact portions ● exact ⇒ approximate OK ● approximate ⇒ exact only with annotation Some computations must always be exact: ● address calculations ● control flow Need ability to switch between exact and approximate at instruction granularity Example Approximate Kernel approximate int sum; for (int i = 0; i < 100; i++) sum += *(array + i); (*output) = sum; Exact: ● loop counter ● address calculation Approximate: ● accumulator ● store to approximate memory Approximate ISA Design Have both approximate and exact: ● integer arithmetic ● FP arithmetic ● bit operations ● load/store instructions Disciplined model: partition approximate/exact ● This paper uses the same HW for both, compiler enforces data flow rules Microarchitectural planes ● Instruction control plane ○ ○ ○ ○ ✕ Fetch Decode Instruction bookkeeping Approximation would break semantics ● Data movement / processing plane ✓ ○ Datapath (RF, $, LSQ, FUs, Bypass network) ○ Approximation will only affect results Power Reduction Methods ● Global Voltage Reduction ○ Error checking + rollback to provide precision ○ Lower voltage ⇒ More rollbacks ● Dual Voltage, VH and VL ○ High voltage for IC plane ○ Either voltage for data plane ● Dual Voltage, VLH and VL ○ Lower IC plane voltage further ○ Rollback adds complexity, not examined Errors vs. Voltage (Razor) http://web.eecs.umich.edu/~taustin/papers/IEEEMICRO05-Razor.pdf Important structures ● DV-SRAM ○ Each row prepended a VH-driven precision bit ○ In-row VH bit used to connect to power lines ○ Precharge based on inst/op precision Important structures cont’d. ● DV-Mux ○ Select between two different voltage-level signals ○ Controlled by precision bit input[0] 0-VH/L input[1] 0-VH/L DV Mux select output 0-VH/L Important structures cont’d. ● L2H and H2L shifters L2H input 0-VH/L DeMux Mux output 0-VH Mux output 0-VH select H2L input 0-VH/L DeMux select Microarch. Changes ● Opcode and source register operands carry added precision bit ● RF precision set at register granularity ● Duplicate pipeline data registers ● Approx. shadow FU’s ○ DV FU’s possible but complicated Microarch. Changes cont’d. ● Broadcast network carries precision bit ● Memory precision set at cache line granularity ○ Precision set by fills and writes, left unchanged by read hits ○ Can do a precise read from an approx. line, vice versa ○ NB: Tags, MSHR, etc. always precise Overheads ● Precision bits ● Pipeline reg / FU duplication ● Shifters / multiplexers Results: Energy Savings Problems: ● Many computations for control flow/address generation, not data ● Much of processor is control plane, not data plane Best-case energy savings (best benchmark) for 50% voltage: ○ In-order: ~40% ○ OoO: ~15% ○ Difference due to size of control plane Energy Savings ● 25% energy savings with 50% voltage ● Modeled using McPAT and Cacti Results: Program % Approximate low integer opportunity high FP opportunity Program Error Sensitivity ● How sensitive are applications to approximation? ○ Model approximate results by flipping bits ○ exact: 0000010 + 0000010 = 0000100 ○ approximate with one bit flip could be ■ 0000101 = 5 ■ 1000100 = 68 input errors Questions? Discussion questions ● Pro ○ Many kinds of useful compute-intensive operations can be approximated (images, video, simulations) ○ SW approximation has given good results ○ Razor: undervolting can produce acceptable types of errors ● Con ○ Can this hardware be built? ○ Scheduling is more expensive than the operations themselves ○ Rounding error tolerant ≟ Approximation tolerant