Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Tarantula A Vector Extension to the Alpha Architecture Roger Espasa, Federico Ardanaz, Joel Emerz, Stephen Felixz, Julio Gago, Roger Gramunt,Isaac Hernandez, Toni Juan, Geoff Lowneyz, Matthew Mattinaz, André Seznec Universitat Politècnica Catalunya, Barcelona, Spain Compaq Computer Corporation, Shrewsbury, MA State of the World • CMOS Technology progresses – More transistors, more functional units, more control overhead • VLIW and Wide Superscalar – More individually controlled units – Amount of real estate for control logic grows nonlinearly • Vector ISA – Localization of parallelism, aggregation of control – Regular structures, simple control Tarantula • EV8 core + tightly integrated Vector Unit – Out of Order execution, Register Renaming – Integrated in VM and cache coherence system – SMT support • Targeted at scientific computing applications • Requires compiler support and recompilation Vector ISA • New Architectural State – 32 vector registers (v0-v31) • v31 wired to 0. Used for prefetch – Vector length (vl), Vector stride (vs), Vector Mask (vm) • 45 New Instructions – 5 Groups • Vector-Vector, Vector-Scalar, Strided Memory Access, Random Memory Access, Vector Control Vector Mask • Allows conditional execution without EV8 scalar registers • VM can be renamed A(i).ne.0.and.B(i).gt.2 vloadq A(i) --> v0 vloadq B(i) --> v1 vcmpne v0, #0 --> v6 vcmpgt v1, #2 --> v7 vand v6, v7 --> v8 setvm v8 --> vm Tarantula Block Diagram Vector Execution Unit • 16 independent lanes – No communication, except for gather/scatter • Each lane has – 2 functional units – Slice of Register File and Mask • Allows high bandwidth – Address generator and private TLB • 32 functional unit appear as only 2 issue ports – Simple scheduling Vector Unit – Core Interface • Vector Unit physically separate from core – Little modification to core • Large bus prevented by routing space – Core to VBox • 3 Instruction Bus • 2 Data Buses for Scalars from EV8 register file • 3 Instruction Kill Signal Bus for misspeculation – VBox to Core • 3 Instruction Completion Bus Power Consumption Vector Memory System • Bound to EV8 VM and Cache Coherence architecture • High Load/Store Bandwidth required – Goal one 64bit datum per flop – Memory Bus to slow – L1 Cache to small for vector data – Direct Connection to L2 Cache • Non-Unit Stride central problem – 20% of all accesses – Don’t match cache lines Non-Unit Strides • EV8 4MByte L2 Cache in 128 banks – 8 ways, 16 banks per way – Read 8 ways, select correct one • Non-unit stride accesses – Read 16 independent cache lines – Select one qword per line • Requires – Conflict free addresses – Conflict free writes to 16 lanes • One qword per lane per cycle Conflict Free Addresses • Possible for any 128 consecutive elements – For stride S= × 2s with s ≤ 4 – Order stored in ROM table • Elements accessed out of order – Even for length < 128 full eight cycles for address generation • Slice – Group of 16 conflict free addresses PUMP • Stride 1 accesses – 80% of all accesses – 128 Qwords in 16 (aligned) or 17 (misaligned) cache lines • Full cache lines read into PUMP latches – Two qwords per cycle sent to VBox • Similar for writes • Allows double bandwidth Gathers and Scatters • Arbitrary Address for every vector element – Reordering algorithm doesn’t work • Conflict Resolution Box (CR) – Find biggest subset of non-conflicting addresses, pack into slice – Add new addresses to remaining ones and repeat • Worst case 128 slices generated • Same algorithm used for self-conflicting strides – stride S= × 2s with s > 4 Vector Misses • To handle L2 misses consider slices as atomic • On miss, slice moved to Miss Address File (MAF) – Wait for missing data – Go to retry queue • Too many retries cause Panic Mode – MAF nacks all other L2 requests, that might prevent progress Scalar-Vector Coherency • VBox by-passes L1 cache – Presence bit P indicates L2 cache line loaded by VCore – If P Set, VBox invalidates L1 • Scalar Write followed by Vector Read is not covered – Barrier command required – DrainM Purges write buffer and cause replay trap Evaluation • No Compiler support available – Hand coded assembler cores • Scientific Benchmarks • ASIM Simulator – Cycle Accurate EV8 simulator • Tarantula compared to – EV8 – EV8 + Trantula’s memory system – Tarantula4 1:4 ratio to RAMBUS frequency Operations per Cycle Speed Up over EV8 Conclusions • Vector Processor most efficient solution for many applications • Vector Unit can be added to standard microprocessor core • Big Bandwidth requirement can only be satisfied by L2 cache • Potentially big performance gains – 2 to 20 over EV8 • Performance depends on good code – Tiling + aggressive prefetching • Very good power/performance ratio Questions • Can only scientific applications exploit vector processors? – Radix sort worked – Powerful memory access instructions – Masks allow logic execution • Does anyone no more about PRAM algorithms? • EV8/VBox coherency seems quirky. Does anyone see a better solution?