Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Programming Systems Lab Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman Motivation Locks are hard to get right • Programmability vs scalability Transactional memory is appealing alternative • Simpler programming model • Stronger guarantees •Atomicity, Consistency, Isolation •Deadlock avoidance • Closer to programmer intent • Scalable implementations Questions • How to lower TM overheads – particularly in software? • How to balance granularity / scalability? 2 Our System Java Software Transactional Memory (STM) System – Pure software implementation (McRT-STM – PPoPP ’06) – Language extensions in Java (Polyglot) – Integrated with JVM & JIT (ORP & StarJIT) Novel Features – – – – – – 3 Rich transactional language constructs in Java Efficient, first class nested transactions Complete GC support Risc-like STM API / IR Compiler optimizations Per-type word and object level conflict detection Transactional Java → Java Transactional Java atomic { Standard Java + STM API while(true) { TxnHandle th = txnStart(); S; try { } S’; break; } finally { Other Language Constructs • 4 if(!txnCommit(th)) Built on prior research – – – – retry (STM Haskell, …) orelse (STM Haskell) tryatomic (Fortress) when (X10, …) continue; } } Tight integration with JVM & JIT StarJIT & ORP • On-demand cloning of methods (Harris ’03) • Identifies transactional regions in Java+STM code • Inserts read/write barriers in transactional code • Maps STM API to first class opcodes in StarJIT IR (STIR) Good compiler representation → greater optimization opportunities 5 Representing Read/Write Barriers Traditional barriers hide redundant locking/logging atomic { … a.x = t1 a.y = t2 if(a.z == 0) { stmWr(&a.x, t1) stmWr(&a.y, t2) if(stmRd(&a.z) != 0) { a.x = 0 stmWr(&a.x, 0); a.z = t3 } } 6 stmWr(&a.z, t3) } An STM IR for Optimization Redundancies exposed: txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnOpenForWrite(a) atomic { a.x = t1 txnLogObjectInt(&a.y, a) a.y = t2 a.y = t2 txnOpenForRead(a) if(a.z == 0) { if(a.z != 0) { txnOpenForWrite(a) a.x = 0 txnLogObjectInt(&a.x, a) a.z = t3 a.x = 0 } txnOpenForWrite(a) txnLogObjectInt(&a.z, a) } a.z = t3 } 7 Optimized Code Fewer & cheaper STM operations txnOpenForWrite(a) txnLogObjectInt(&a.x, a) atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.x = t1 txnLogObjectInt(&a.y, a) a.y = t2 if(a.z != 0) { a.x = 0 a.z = t3 } a) } a.y = t3 } 8 txnLogObjectInt(&a.z, Compiler Optimizations for Transactions Standard optimizations • • CSE, Dead-code-elimination, … • Subtle in presence of nesting Careful IR representation exposes opportunities and enables optimizations with almost no modifications STM-specific optimizations • • • 9 Immutable field / class detection & barrier removal (vtable/String) Transaction-local object detection & barrier removal Partial inlining of STM fast paths to eliminate call overhead McRT-STM PPoPP 2006 (Saha, et. al.) • C / C++ STM • Pessimistic Writes: – strict two-phase locking – update in place – undo on abort • Optimistic Reads: – versioning – validation before commit • Benefits – Fast memory accesses (no buffering / object wrapping) – Minimal copying (no cloning for large objects) – Compatible with existing types & libraries Similar STMs: Ennals (FastSTM), Harris, et.al (PLDI ’06) 10 STM Data Structures Per-thread: • Transaction Descriptor – Per-thread info for version validation, acquired locks, rollback – Maintained in Read / Write / Undo logs • Transaction Memento – Checkpoint of logs for nesting / partial rollback Per-data: • Transaction Record – Pointer-sized field guarding a set of shared data – Transactional state of data • Shared: Version number (odd) • Exclusive: Owner’s transaction descriptor (even / aligned) 11 Mapping Data to Transaction Record Every data item has an associated transaction record class Foo { Object int x; granularity int y; } Word granularity 12 class Foo { int x; int y; } Transaction record embedded In object vtbl TxR x y vtbl hash x y TxR1 TxR2 TxR3 … TxRn Object words hash into table of TxRs Hash is f(obj.hash, offset) Granularity of Conflict Detection Object-level • Cheaper operation • Exposes CSE opportunities • Lower overhead on 1P Word-level // Thread 1 a.x = … a.y = … • Reduces false sharing • Better scalability Mix & Match • Per type basis • E.g., word-level for arrays, object-level for non-arrays 13 // Thread 2 … = … a.z … Experiments 16-way 2.2 GHz Xeon with 16 GB shared memory • L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four) Workloads • Hashtable, Binary tree, OO7 (OODBMS) – Mix of gets, in-place updates, insertions, and removals • Object-level conflict detection by default – Word / mixed where beneficial 14 Effective of Compiler Optimizations 1P overheads over thread-unsafe baseline 90% % Overhead on 1P 80% 70% Synchronized 60% No STM Opt 50% +Base STM Opt 40% +Immutability 30% +Txn Local 20% +Fast Path Inlining 10% 0% HashMap TreeMap Prior STMs typically incur ~2x on 1P With compiler optimizations: - < 40% over no concurrency control - < 30% over synchronization 15 Scalability: Java HashMap Shootout Unsafe (java.util.HashMap) • Thread-unsafe w/o Concurrency Control Synchronized • Coarse-grain synchronization via SynchronizedMap wrapper Concurrent (java.util.concurrent.ConcurrentHashMap) • • • Multi-year effort: JSR 166 -> Java 5 Optimized for concurrent gets (no locking) For updates, divides bucket array into 16 segments (size / locking) Atomic • Transactional version via “AtomicMap” wrapper Atomic Prime • Transactional version with minor hand optimization • Tracks size per segment ala ConcurrentHashMap Execution • • 16 10,000,000 operations / 200,000 elements Defaults: load factor, threshold, concurrency level Speedup over 1P Unsafe Scalability: 100% Gets 16 14 12 10 8 6 4 2 0 0 4 8 12 16 # of Processors Unsafe Synchronized Atomic (No Opt) Atomic Concurrent Atomic wrapper is competitive with ConcurrentHashMap Effect of compiler optimizations scale 17 Speedup over 1P Unsafe Scalability: 20% Gets / 80% Updates 16 14 12 10 8 6 4 2 0 0 4 8 12 16 # of Processors Synchronized Concurrent Atomic (No Opt) ConcurrentHashMap thrashes on 16 segments Atomic still scales 18 Atomic Speedup over 1P Unsafe 20% Inserts and Removes 3 2.5 2 1.5 1 0.5 0 0 4 8 12 16 # of Processors Synchronized Concurrent Atomic conflicts on entire bucket array - The array is an object 19 Atomic Speedup over 1P Unsafe 20% Inserts and Removes: Word-Level 3 2.5 2 1.5 1 0.5 0 0 4 8 12 16 # of Processors Synchronized Concurrent Object Atomic Word Atomic We still conflict on the single size field in java.util.HashMap 20 Speedup over 1P Unsafe 20% Inserts and Removes: Atomic Prime 3 2.5 2 1.5 1 0.5 0 0 4 8 12 16 # of Processors Synchronized Object Atomic Word Atomic Prime Concurrent Word Atomic Atomic Prime tracks size / segment – lowering bottleneck No degradation, modest performance gain 21 Speedup over 1P Unsafe 20% Inserts and Removes: Mixed-Level 3 2.5 2 1.5 1 0.5 0 0 4 8 12 16 # of Processors Synchronized Object Atomic Word Atomic Prime Concurrent Word Atomic Mixed Atomic Prime Mixed-level preserves wins & reduces overheads -word-level for arrays -object-level for non-arrays 22 Key Takeaways Optimistic reads + pessimistic writes is nice sweet spot Compiler optimizations significantly reduce STM overhead - 20-40% over thread-unsafe - 10-30% over synchronized Simple atomic wrappers sometimes good enough Minor modifications give competitive performance to complex fine-grain synchronization Word-level contention is crucial for large arrays Mixed contention provides best of both 23 Novel Contributions Rich transactional language constructs in Java Efficient, first class nested transactions Complete GC support Risc-like STM API Compiler optimizations Per-type word and object level conflict detection 24 25