Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Compiler and Runtime Support
for Efficient
Software Transactional Memory
Vijay Menon
Ali-Reza Adl-Tabatabai, Brian T. Lewis,
Brian R. Murphy, Bratin Saha, Tatiana Shpeisman
Motivation
Multi-core architectures are mainstream
–
–
–
Software concurrency needed for scalability
Concurrent programming is hard
Difficult to reason about shared data
Traditional mechanism: Lock-based Synchronization
–
–
–
–
Hard to use
Must be fine-grain for scalability
Deadlocks
Not easily composable
New Solution: Transactional Memory (TM)
–
–
–
–
–
2
Simpler programming model: Atomicity, Consistency, Isolation
No deadlocks
Composability
Optimistic concurrency
Analogy
• GC : Memory allocation ≈ TM : Mutual exclusion
Composability
class Bank {
ConcurrentHashMap accounts;
…
void deposit(String name, int amount) {
synchronized (accounts) {
int balance = accounts.get(name);
balance = balance + amount;
accounts.put(name, balance);
}
}
…
}
// Get the current balance
// Increment it
// Set the new balance
Thread-safe – but no scaling
• ConcurrentHashMap (Java 5/JSR 166) does not help
• Performance requires redesign from scratch & fine-grain locking
3
Transactional solution
class Bank {
HashMap accounts;
…
void deposit(String name, int amount) {
atomic {
int balance = accounts.get(name);
balance = balance + amount;
accounts.put(name, balance);
}
}
…
}
Underlying system provide:
• isolation (thread safety)
• optimistic concurrency
4
// Get the current balance
// Increment it
// Set the new balance
Transactions are Composable
Scalability - 10,000,000 operations
Scalability
4
3
2
1
0
0
4
8
12
16
# of Processors
Synchronized
Transactional
Scalability on 16-way 2.2 GHz Xeon System
5
Our System
A Java Software Transactional Memory (STM) System
– Pure software implementation
– Language extensions in Java
– Integrated with JVM & JIT
Novel Features
–
–
–
–
–
–
6
Rich transactional language constructs in Java
Efficient, first class nested transactions
Risc-like STM API
Compiler optimizations
Per-type word and object level conflict detection
Complete GC support
System Overview
Transactional Java
Polyglot
Java + STM API
Transactional STIR
StarJIT
ORP VM
Optimized T-STIR
Native Code
7
McRT STM
Transactional Java
Java + new language constructs:
• Atomic: execute block atomically
• atomic {S}
• Retry: block until alternate path possible
• atomic {… retry;…}
• Orelse: compose alternate atomic blocks
• atomic {S1} orelse{S2} … orelse{Sn}
• Tryatomic: atomic with escape hatch
• tryatomic {S} catch(TxnFailed e) {…}
• When: conditionally atomic region
•
when (condition) {S}
Builds on prior research
Concurrent Haskell, CAML, CILK, Java
HPCS languages: Fortress, Chapel, X10
8
Transactional Java → Java
Transactional Java
atomic {
Standard Java + STM API
while(true) {
TxnHandle th = txnStart();
S;
try {
}
S’;
break;
STM API
•
•
•
•
•
9
txnStart[Nested]
txnCommit[Nested]
txnAbortNested
txnUserRetry
...
} finally {
if(!txnCommit(th))
continue;
}
}
JVM STM support
On-demand cloning of methods called inside transactions
Garbage collection support
•
Enumeration of refs in read set, write set & undo log
Extra transaction record field in each object
•
Supports both word & object granularity
Native method invocation throws exception inside transaction
•
Some intrinsic functions allowed
Runtime STM API
•
•
10
Wrapper around McRT-STM API
Polyglot / StarJIT automatically generates calls to API
Background: McRT-STM
STM for
• C / C++ (PPoPP 2006)
• Java (PLDI 2006)
• Writes:
– strict two-phase locking
– update in place
– undo on abort
• Reads:
– versioning
– validation before commit
• Granularity per type
– Object-level : small objects
– Word-level : large arrays
• Benefits
– Fast memory accesses (no buffering / object wrapping)
– Minimal copying (no cloning for large objects)
– Compatible with existing types & libraries
11
Ensuring Atomicity: Novel Combination
Memory Ops
Mode
↓
Reads
+ In place updates
+ Fast commits
+ Fast reads
Pessimistic
Concurrency
Optimistic
Concurrency
Writes
+ Caching effects
+ Avoids lock
operations
Quantitative results in PPoPP’06
12
McRT-STM: Example
…
…
atomic {
B = A + 5;
}
…
…
…
stmStart();
temp = stmRd(A);
stmWr(B, temp + 5);
stmCommit();
…
STM read & write barriers before accessing memory inside
transactions
STM tracks accesses & detects data conflicts
13
Transaction Record
Pointer-sized record per object / word
Two states:
• Shared (low bit is 1)
– Read-only / multiple readers
– Value is version number (odd)
• Exclusive
– Write-only / single owner
– Value is thread transaction descriptor (4-byte aligned)
Mapping
• Object : slot in object
• Field : hashed index into global record table
14
Transaction Record: Example
Every data item has an associated transaction record
class Foo {
Object
int x;
granularity
int y;
}
class Foo {
Word
int x;
granularity
int y;
}
15
vtbl
x
y
vtbl
TxR
x
y
vtbl
hash
x
y
TxR1
TxR2
TxR3
…
TxRn
Extra transaction
record field
Object words hash
into table of TxRs
Hash is
f(obj.hash, offset)
Transaction Descriptor
Descriptor per thread
– Info for version validation, lock release, undo on abort, …
Read and Write set : {<Ti, Ni>}
– Ti: transaction record
– Ni: version number
Undo log : {<Ai, Oi, Vi, Ki>}
–
–
–
–
Ai: field / element address
Oi: containing object (or null for static)
Vi: original value
Ki: type tag (for garbage collection)
In atomic region
– Read operation appends read set
– Write operation appends write set and undo log
– GC enumerates read/write/undo logs
16
McRT-STM: Example
Class Foo {
int x;
int y;
};
Foo bar, foo;
T1
atomic {
t = foo.x;
bar.x = t;
t = foo.y;
bar.y = t;
}
T2
atomic {
t1 = bar.x;
t2 = bar.y;
}
• T1 copies foo into bar
• T2 reads bar, but should not see intermediate values
17
McRT-STM: Example
T1
stmStart();
t = stmRd(foo.x);
stmWr(bar.x,t);
t = stmRd(foo.y);
stmWr(bar.y,t);
stmCommit();
T2
stmStart();
t1 = stmRd(bar.x);
t2 = stmRd(bar.y);
stmCommit();
• T1 copies foo into bar
• T2 reads bar, but should not see intermediate values
18
McRT-STM: Example
foo
Commit
T1
stmStart();
t = stmRd(foo.x);
stmWr(bar.x,t);
t = stmRd(foo.y);
stmWr(bar.y,t);
stmCommit;
3
hdr
x=9
y=7
T1
57
hdr
0
x=9
y = 07
T2 waits
bar
Abort
T2
stmStart();
t1 = stmRd(bar.x);
t2 = stmRd(bar.y);
stmCommit();
Reads <bar, 5> <bar, 7>
Reads <foo, 3> <foo, 3>
Writes
<bar,
5>
•T2
should
read [0, 0] or should read [9,7]
Undo <bar.x, 0> <bar.y, 0>
19
Early Results: Overhead breakdown
STM time breakdown
100%
Application
TLS access
80%
STM write
STM commit
60%
STM validate
40%
STM read
20%
0%
Binary tree
Hashtable
Linked list
Time breakdown on single processor
STM read & validation overheads dominate
Good optimization targets
20
Btree
System Overview
Transactional Java
Polyglot
Java + STM API
Transactional STIR
StarJIT
ORP VM
Optimized T-STIR
Native Code
21
McRT STM
Leveraging the JIT
StarJIT: High-performance dynamic compiler
• Identifies transactional regions in Java+STM code
• Differentiates top-level and nested transactions
• Inserts read/write barriers in transactional code
• Maps STM API to first class opcodes in STIR
Good compiler representation →
greater optimization opportunities
22
Representing Read/Write Barriers
Traditional barriers hide redundant locking/logging
atomic {
…
a.x = t1
a.y = t2
if(a.z == 0) {
stmWr(&a.x, t1)
stmWr(&a.y, t2)
if(stmRd(&a.z) != 0) {
a.x = 0
stmWr(&a.x, 0);
a.z = t3
}
}
23
stmWr(&a.z, t3)
}
An STM IR for Optimization
Redundancies exposed:
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = t1
txnOpenForWrite(a)
atomic {
a.x = t1
txnLogObjectInt(&a.y, a)
a.y = t2
a.y = t2
txnOpenForRead(a)
if(a.z == 0) {
if(a.z != 0) {
txnOpenForWrite(a)
a.x = 0
txnLogObjectInt(&a.x, a)
a.z = t3
a.x = 0
}
txnOpenForWrite(a)
txnLogObjectInt(&a.z, a)
}
a.z = t3
}
24
Optimized Code
Fewer & cheaper STM operations
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
atomic {
a.x = t1
a.y = t2
if(a.z == 0) {
a.x = 0
a.x = t1
txnLogObjectInt(&a.y, a)
a.y = t2
if(a.z != 0) {
a.x = 0
a.z = t3
}
a)
}
a.y = t3
}
25
txnLogObjectInt(&a.z,
Compiler Optimizations for Transactions
Standard optimizations
•
•
CSE, Dead-code-elimination, …
•
Subtle in presence of nesting
Careful IR representation exposes opportunities and enables
optimizations with almost no modifications
STM-specific optimizations
•
•
•
26
Immutable field / class detection & barrier removal (vtable/String)
Transaction-local object detection & barrier removal
Partial inlining of STM fast paths to eliminate call overhead
Experiments
16-way 2.2 GHz Xeon with 16 GB shared memory
• L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four)
Workloads
• Hashtable, Binary tree, OO7 (OODBMS)
– Mix of gets, in-place updates, insertions, and removals
• Object-level conflict detection by default
– Word / mixed where beneficial
27
Effective of Compiler Optimizations
1P overheads over thread-unsafe baseline
90%
% Overhead on 1P
80%
70%
Synchronized
60%
No STM Opt
50%
+Base STM Opt
40%
+Immutability
30%
+Txn Local
20%
+Fast Path Inlining
10%
0%
HashMap
TreeMap
Prior STMs typically incur ~2x on 1P
With compiler optimizations:
- < 40% over no concurrency control
- < 30% over synchronization
28
Scalability: Java HashMap Shootout
Unsafe (java.util.HashMap)
•
Thread-unsafe w/o Concurrency Control
Synchronized
•
Coarse-grain synchronization via SynchronizedMap wrapper
Concurrent (java.util.concurrent.ConcurrentHashMap)
•
•
•
Multi-year effort: JSR 166 -> Java 5
Optimized for concurrent gets (no locking)
For updates, divides bucket array into 16 segments (size / locking)
Atomic
•
Transactional version via “AtomicMap” wrapper
Atomic Prime
•
Transactional version with minor hand optimization
• Tracks size per segment ala ConcurrentHashMap
Execution
•
•
29
10,000,000 operations / 200,000 elements
Defaults: load factor, threshold, concurrency level
Speedup over 1P Unsafe
Scalability: 100% Gets
16
14
12
10
8
6
4
2
0
0
4
8
12
16
# of Processors
Unsafe
Synchronized
Atomic (No Opt)
Atomic
Concurrent
Atomic wrapper is competitive with ConcurrentHashMap
Effect of compiler optimizations scale
30
Speedup over 1P Unsafe
Scalability: 20% Gets / 80% Updates
16
14
12
10
8
6
4
2
0
0
4
8
12
16
# of Processors
Synchronized
Concurrent
Atomic (No Opt)
ConcurrentHashMap thrashes on 16 segments
Atomic still scales
31
Atomic
Speedup over 1P Unsafe
20% Inserts and Removes
3
2.5
2
1.5
1
0.5
0
0
4
8
12
16
# of Processors
Synchronized
Concurrent
Atomic conflicts on entire bucket array
- The array is an object
32
Atomic
Speedup over 1P Unsafe
20% Inserts and Removes: Word-Level
3
2.5
2
1.5
1
0.5
0
0
4
8
12
16
# of Processors
Synchronized
Concurrent
Object Atomic
Word Atomic
We still conflict on the single size field in java.util.HashMap
33
Speedup over 1P Unsafe
20% Inserts and Removes: Atomic Prime
3
2.5
2
1.5
1
0.5
0
0
4
8
12
16
# of Processors
Synchronized
Object Atomic
Word Atomic Prime
Concurrent
Word Atomic
Atomic Prime tracks size / segment – lowering bottleneck
No degradation, modest performance gain
34
Speedup over 1P Unsafe
20% Inserts and Removes: Mixed-Level
3
2.5
2
1.5
1
0.5
0
0
4
8
12
16
# of Processors
Synchronized
Object Atomic
Word Atomic Prime
Concurrent
Word Atomic
Mixed Atomic Prime
Mixed-level preserves wins & reduces overheads
-word-level for arrays
-object-level for non-arrays
35
Scalability: java.util.TreeMap
80% Gets
16
14
12
10
8
6
4
2
0
1.2
1
Scalability
Scalability
100% Gets
0.8
0.6
0.4
0.2
0
0
4
8
12
16
0
4
Synchronized
Atomic
Synchronized
Results similar to HashMap
36
12
# of Processors
# of Processors
Unsafe
8
Atomic
Atomic Prime
16
Scalability: OO7 – 80% Reads
Operations & traversal over synthetic database
6
Scalability
5
4
3
2
1
0
0
4
8
12
16
# of Processors
Atomic
Synch (Coarse)
Synch (Med.)
Synch (Fine)
“Coarse” atomic is competitive with medium-grain synchronization
37
Key Takeaways
Optimistic reads + pessimistic writes is nice sweet
spot
Compiler optimizations significantly reduce STM
overhead
- 20-40% over thread-unsafe
- 10-30% over synchronized
Simple atomic wrappers sometimes good enough
Minor modifications give competitive performance to
complex fine-grain synchronization
Word-level contention is crucial for large arrays
Mixed contention provides best of both
38
Research challenges
Performance
– Compiler optimizations
– Hardware support
– Dealing with contention
Semantics
–
–
–
–
I/O & communication
Strong atomicity
Nested parallelism
Open transactions
Debugging & performance analysis tools
System integration
39
Conclusions
Rich transactional language constructs in Java
Efficient, first class nested transactions
Risc-like STM API
Compiler optimizations
Per-type word and object level conflict detection
Complete GC support
40
41