Download slides

Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman Motivation Multi-core architectures are mainstream – – – Software concurrency needed for scalability Concurrent programming is hard Difficult to reason about shared data Traditional mechanism: Lock-based Synchronization – – – – Hard to use Must be fine-grain for scalability Deadlocks Not easily composable New Solution: Transactional Memory (TM) – – – – – 2 Simpler programming model: Atomicity, Consistency, Isolation No deadlocks Composability Optimistic concurrency Analogy • GC : Memory allocation ≈ TM : Mutual exclusion Composability class Bank { ConcurrentHashMap accounts; … void deposit(String name, int amount) { synchronized (accounts) { int balance = accounts.get(name); balance = balance + amount; accounts.put(name, balance); } } … } // Get the current balance // Increment it // Set the new balance Thread-safe – but no scaling • ConcurrentHashMap (Java 5/JSR 166) does not help • Performance requires redesign from scratch & fine-grain locking 3 Transactional solution class Bank { HashMap accounts; … void deposit(String name, int amount) { atomic { int balance = accounts.get(name); balance = balance + amount; accounts.put(name, balance); } } … } Underlying system provide: • isolation (thread safety) • optimistic concurrency 4 // Get the current balance // Increment it // Set the new balance Transactions are Composable Scalability - 10,000,000 operations Scalability 4 3 2 1 0 0 4 8 12 16 # of Processors Synchronized Transactional Scalability on 16-way 2.2 GHz Xeon System 5 Our System A Java Software Transactional Memory (STM) System – Pure software implementation – Language extensions in Java – Integrated with JVM & JIT Novel Features – – – – – – 6 Rich transactional language constructs in Java Efficient, first class nested transactions Risc-like STM API Compiler optimizations Per-type word and object level conflict detection Complete GC support System Overview Transactional Java Polyglot Java + STM API Transactional STIR StarJIT ORP VM Optimized T-STIR Native Code 7 McRT STM Transactional Java Java + new language constructs: • Atomic: execute block atomically • atomic {S} • Retry: block until alternate path possible • atomic {… retry;…} • Orelse: compose alternate atomic blocks • atomic {S1} orelse{S2} … orelse{Sn} • Tryatomic: atomic with escape hatch • tryatomic {S} catch(TxnFailed e) {…} • When: conditionally atomic region • when (condition) {S} Builds on prior research Concurrent Haskell, CAML, CILK, Java HPCS languages: Fortress, Chapel, X10 8 Transactional Java → Java Transactional Java atomic { Standard Java + STM API while(true) { TxnHandle th = txnStart(); S; try { } S’; break; STM API • • • • • 9 txnStart[Nested] txnCommit[Nested] txnAbortNested txnUserRetry ... } finally { if(!txnCommit(th)) continue; } } JVM STM support On-demand cloning of methods called inside transactions Garbage collection support • Enumeration of refs in read set, write set & undo log Extra transaction record field in each object • Supports both word & object granularity Native method invocation throws exception inside transaction • Some intrinsic functions allowed Runtime STM API • • 10 Wrapper around McRT-STM API Polyglot / StarJIT automatically generates calls to API Background: McRT-STM STM for • C / C++ (PPoPP 2006) • Java (PLDI 2006) • Writes: – strict two-phase locking – update in place – undo on abort • Reads: – versioning – validation before commit • Granularity per type – Object-level : small objects – Word-level : large arrays • Benefits – Fast memory accesses (no buffering / object wrapping) – Minimal copying (no cloning for large objects) – Compatible with existing types & libraries 11 Ensuring Atomicity: Novel Combination Memory Ops  Mode ↓ Reads + In place updates + Fast commits + Fast reads Pessimistic Concurrency Optimistic Concurrency Writes + Caching effects + Avoids lock operations Quantitative results in PPoPP’06 12 McRT-STM: Example … … atomic { B = A + 5; } … … … stmStart(); temp = stmRd(A); stmWr(B, temp + 5); stmCommit(); … STM read & write barriers before accessing memory inside transactions STM tracks accesses & detects data conflicts 13 Transaction Record Pointer-sized record per object / word Two states: • Shared (low bit is 1) – Read-only / multiple readers – Value is version number (odd) • Exclusive – Write-only / single owner – Value is thread transaction descriptor (4-byte aligned) Mapping • Object : slot in object • Field : hashed index into global record table 14 Transaction Record: Example Every data item has an associated transaction record class Foo { Object int x; granularity int y; } class Foo { Word int x; granularity int y; } 15 vtbl x y vtbl TxR x y vtbl hash x y TxR1 TxR2 TxR3 … TxRn Extra transaction record field Object words hash into table of TxRs Hash is f(obj.hash, offset) Transaction Descriptor Descriptor per thread – Info for version validation, lock release, undo on abort, … Read and Write set : {<Ti, Ni>} – Ti: transaction record – Ni: version number Undo log : {<Ai, Oi, Vi, Ki>} – – – – Ai: field / element address Oi: containing object (or null for static) Vi: original value Ki: type tag (for garbage collection) In atomic region – Read operation appends read set – Write operation appends write set and undo log – GC enumerates read/write/undo logs 16 McRT-STM: Example Class Foo { int x; int y; }; Foo bar, foo; T1 atomic { t = foo.x; bar.x = t; t = foo.y; bar.y = t; } T2 atomic { t1 = bar.x; t2 = bar.y; } • T1 copies foo into bar • T2 reads bar, but should not see intermediate values 17 McRT-STM: Example T1 stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit(); T2 stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit(); • T1 copies foo into bar • T2 reads bar, but should not see intermediate values 18 McRT-STM: Example foo Commit T1 stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit; 3 hdr x=9 y=7 T1 57 hdr 0 x=9 y = 07 T2 waits bar Abort T2 stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit(); Reads <bar, 5> <bar, 7> Reads <foo, 3> <foo, 3> Writes <bar, 5> •T2 should read [0, 0] or should read [9,7] Undo <bar.x, 0> <bar.y, 0> 19 Early Results: Overhead breakdown STM time breakdown 100% Application TLS access 80% STM write STM commit 60% STM validate 40% STM read 20% 0% Binary tree Hashtable Linked list Time breakdown on single processor STM read & validation overheads dominate  Good optimization targets 20 Btree System Overview Transactional Java Polyglot Java + STM API Transactional STIR StarJIT ORP VM Optimized T-STIR Native Code 21 McRT STM Leveraging the JIT StarJIT: High-performance dynamic compiler • Identifies transactional regions in Java+STM code • Differentiates top-level and nested transactions • Inserts read/write barriers in transactional code • Maps STM API to first class opcodes in STIR Good compiler representation → greater optimization opportunities 22 Representing Read/Write Barriers Traditional barriers hide redundant locking/logging atomic { … a.x = t1 a.y = t2 if(a.z == 0) { stmWr(&a.x, t1) stmWr(&a.y, t2) if(stmRd(&a.z) != 0) { a.x = 0 stmWr(&a.x, 0); a.z = t3 } } 23 stmWr(&a.z, t3) } An STM IR for Optimization Redundancies exposed: txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnOpenForWrite(a) atomic { a.x = t1 txnLogObjectInt(&a.y, a) a.y = t2 a.y = t2 txnOpenForRead(a) if(a.z == 0) { if(a.z != 0) { txnOpenForWrite(a) a.x = 0 txnLogObjectInt(&a.x, a) a.z = t3 a.x = 0 } txnOpenForWrite(a) txnLogObjectInt(&a.z, a) } a.z = t3 } 24 Optimized Code Fewer & cheaper STM operations txnOpenForWrite(a) txnLogObjectInt(&a.x, a) atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.x = t1 txnLogObjectInt(&a.y, a) a.y = t2 if(a.z != 0) { a.x = 0 a.z = t3 } a) } a.y = t3 } 25 txnLogObjectInt(&a.z, Compiler Optimizations for Transactions Standard optimizations • • CSE, Dead-code-elimination, … • Subtle in presence of nesting Careful IR representation exposes opportunities and enables optimizations with almost no modifications STM-specific optimizations • • • 26 Immutable field / class detection & barrier removal (vtable/String) Transaction-local object detection & barrier removal Partial inlining of STM fast paths to eliminate call overhead Experiments 16-way 2.2 GHz Xeon with 16 GB shared memory • L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four) Workloads • Hashtable, Binary tree, OO7 (OODBMS) – Mix of gets, in-place updates, insertions, and removals • Object-level conflict detection by default – Word / mixed where beneficial 27 Effective of Compiler Optimizations 1P overheads over thread-unsafe baseline 90% % Overhead on 1P 80% 70% Synchronized 60% No STM Opt 50% +Base STM Opt 40% +Immutability 30% +Txn Local 20% +Fast Path Inlining 10% 0% HashMap TreeMap Prior STMs typically incur ~2x on 1P With compiler optimizations: - < 40% over no concurrency control - < 30% over synchronization 28 Scalability: Java HashMap Shootout Unsafe (java.util.HashMap) • Thread-unsafe w/o Concurrency Control Synchronized • Coarse-grain synchronization via SynchronizedMap wrapper Concurrent (java.util.concurrent.ConcurrentHashMap) • • • Multi-year effort: JSR 166 -> Java 5 Optimized for concurrent gets (no locking) For updates, divides bucket array into 16 segments (size / locking) Atomic • Transactional version via “AtomicMap” wrapper Atomic Prime • Transactional version with minor hand optimization • Tracks size per segment ala ConcurrentHashMap Execution • • 29 10,000,000 operations / 200,000 elements Defaults: load factor, threshold, concurrency level Speedup over 1P Unsafe Scalability: 100% Gets 16 14 12 10 8 6 4 2 0 0 4 8 12 16 # of Processors Unsafe Synchronized Atomic (No Opt) Atomic Concurrent Atomic wrapper is competitive with ConcurrentHashMap Effect of compiler optimizations scale 30 Speedup over 1P Unsafe Scalability: 20% Gets / 80% Updates 16 14 12 10 8 6 4 2 0 0 4 8 12 16 # of Processors Synchronized Concurrent Atomic (No Opt) ConcurrentHashMap thrashes on 16 segments Atomic still scales 31 Atomic Speedup over 1P Unsafe 20% Inserts and Removes 3 2.5 2 1.5 1 0.5 0 0 4 8 12 16 # of Processors Synchronized Concurrent Atomic conflicts on entire bucket array - The array is an object 32 Atomic Speedup over 1P Unsafe 20% Inserts and Removes: Word-Level 3 2.5 2 1.5 1 0.5 0 0 4 8 12 16 # of Processors Synchronized Concurrent Object Atomic Word Atomic We still conflict on the single size field in java.util.HashMap 33 Speedup over 1P Unsafe 20% Inserts and Removes: Atomic Prime 3 2.5 2 1.5 1 0.5 0 0 4 8 12 16 # of Processors Synchronized Object Atomic Word Atomic Prime Concurrent Word Atomic Atomic Prime tracks size / segment – lowering bottleneck No degradation, modest performance gain 34 Speedup over 1P Unsafe 20% Inserts and Removes: Mixed-Level 3 2.5 2 1.5 1 0.5 0 0 4 8 12 16 # of Processors Synchronized Object Atomic Word Atomic Prime Concurrent Word Atomic Mixed Atomic Prime Mixed-level preserves wins & reduces overheads -word-level for arrays -object-level for non-arrays 35 Scalability: java.util.TreeMap 80% Gets 16 14 12 10 8 6 4 2 0 1.2 1 Scalability Scalability 100% Gets 0.8 0.6 0.4 0.2 0 0 4 8 12 16 0 4 Synchronized Atomic Synchronized Results similar to HashMap 36 12 # of Processors # of Processors Unsafe 8 Atomic Atomic Prime 16 Scalability: OO7 – 80% Reads Operations & traversal over synthetic database 6 Scalability 5 4 3 2 1 0 0 4 8 12 16 # of Processors Atomic Synch (Coarse) Synch (Med.) Synch (Fine) “Coarse” atomic is competitive with medium-grain synchronization 37 Key Takeaways Optimistic reads + pessimistic writes is nice sweet spot Compiler optimizations significantly reduce STM overhead - 20-40% over thread-unsafe - 10-30% over synchronized Simple atomic wrappers sometimes good enough Minor modifications give competitive performance to complex fine-grain synchronization Word-level contention is crucial for large arrays Mixed contention provides best of both 38 Research challenges Performance – Compiler optimizations – Hardware support – Dealing with contention Semantics – – – – I/O & communication Strong atomicity Nested parallelism Open transactions Debugging & performance analysis tools System integration 39 Conclusions Rich transactional language constructs in Java Efficient, first class nested transactions Risc-like STM API Compiler optimizations Per-type word and object level conflict detection Complete GC support 40 41

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides