Download Concurrent Data Structures in Architectures with Limited Shared

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Distributed Computing and Systems
Chalmers University of Technology
Gothenburg, Sweden
Concurrent Data Structures in
Architectures with
Limited Shared Memory Support
Ivan Walulya
Yiannis Nikolakopoulos
Marina Papatriantafilou
Philippas Tsigas
Concurrent Data Structures
• Parallel/Concurrent programming:
– Share data among threads/processes,
sharing a uniform address space
(shared memory)
• Inter-process/thread communication
and synchronization
– Both a tool and a goal
Yiannis Nikolakopoulos
[email protected]
2
Concurrent Data Structures:
Implementations
• Coarse grained locking
– Easy but slow...
• Fine grained locking
– Fast/scalable but: error-prone, deadlocks
• Non-blocking
– Atomic hardware primitives (e.g. TAS, CAS)
– Good progress guarantees (lock/wait-freedom)
– Scalable
Yiannis Nikolakopoulos
[email protected]
3
What’s happening in hardware?
• Multi-cores  many-cores
Shared Local
Cache
Cache
– “Cache coherency wall”
IA Core
[Kumar et al 2011]
– Shared address space
will not scale
– Universal atomic primitives (CAS, LL/SC) harder to
implement
• Shared memory  message passing
Yiannis Nikolakopoulos
[email protected]
4
Shared Local
Cache
Cache
IA Core
• Networks on chip (NoC)
• Short distance
between cores
• Message passing
model support
• Shared memory support
• Eliminated
cache coherency
• Limited support for
synchronization
primitives
Can we have Data Structures:
Fast
Scalable
Good progress guarantees
Yiannis Nikolakopoulos
[email protected]
5
Outline
•
•
•
•
•
•
Concurrent Data Structures
Many-core architectures
Intel’s SCC
Concurrent FIFO Queues
Evaluation
Conclusion
Yiannis Nikolakopoulos
[email protected]
6
Single-chip Cloud Computer (SCC)
•
•
•
•
Experimental processor by Intel
48 independent x86 cores arranged on 24 tiles
NoC connects all tiles
TestAndSet register
per core
Yiannis Nikolakopoulos
[email protected]
7
SCC: Architecture Overview
Message
Passing
Buffer (MPB)
16Kb
Memory Controllers:
to private & shared
main memory
Yiannis Nikolakopoulos
[email protected]
8
Programming Challenges in SCC
• Message Passing but…
– MPB small for
large data transfers
– Data Replication is difficult
• No universal atomic primitives (CAS);
no wait-free implementations [Herlihy91]
Yiannis Nikolakopoulos
[email protected]
9
Outline
•
•
•
•
•
•
Concurrent Data Structures
Many-core architectures
Intel’s SCC
Concurrent FIFO Queues
Evaluation
Conclusion
Yiannis Nikolakopoulos
[email protected]
10
Concurrent FIFO Queues
• Main idea:
– Data are stored in shared off-chip memory
– Message passing for communication/coordination
• 2 design methodologies:
– Lock-based synchronization (2-lock Queue)
– Message passing-based synchronization
(MP-Queue, MP-Acks)
Yiannis Nikolakopoulos
[email protected]
11
2-lock Queue
•
•
•
•
Array based, in shared off-chip memory (SHM)
Head/Tail pointers in MPBs
1 lock for each pointer [Michael&Scott96]
TAS based locks on 2 cores
Yiannis Nikolakopoulos
[email protected]
12
2-lock Queue:
“Traditional” Enqueue Algorithm
• Acquire lock
• Read & Update
Tail pointer (MPB)
• Add data (SHM)
• Release lock
Yiannis Nikolakopoulos
[email protected]
13
2-lock Queue:
Optimized Enqueue Algorithm
• Acquire lock
• Read & Update
Tail pointer (MPB)
• Release lock
• Add data to node SHM
• Set memory flag to dirty
Yiannis Nikolakopoulos
[email protected]
Why?
No Cache
Coherency!
14
2-lock Queue:
Dequeue Algorithm
What about
progress?
Yiannis Nikolakopoulos
[email protected]
• Acquire lock
• Read & Update
Head pointer
• Release lock
• Check flag
• Read node data
15
2-lock Queue:
Implementation
Locks?
On which tile(s)?
Data nodes
Yiannis Nikolakopoulos
[email protected]
Head/Tail
Pointers
(MPB)
16
Message Passing-based Queue
• Data nodes in SHM
• Access coordinated by a Server node who
keeps Head/Tail pointers
• Enqueuers/Dequeuers request access through
dedicated slots in MPB
• Successfully enqueued data are flagged with
dirty bit
Yiannis Nikolakopoulos
[email protected]
17
MP-Queue
ENQ
TAIL
DEQ
HEAD
ADD
DATA
SPIN
What if this
fails and is
never flagged?
“Pairwise blocking”
only 1 dequeue
blocks
Yiannis Nikolakopoulos
[email protected]
18
Adding Acknowledgements
• No more flags!
Enqueue sends ACK when done
• Server maintains in SHM a private queue of
pointers
• On ACK:
Server adds data location to its private queue
• On Dequeue:
Server returns only ACKed locations
Yiannis Nikolakopoulos
[email protected]
19
MP-Acks
ENQ
TAIL
DEQ
HEAD
ACK
No blocking
between
enqueues/dequeues
Yiannis Nikolakopoulos
[email protected]
20
Outline
•
•
•
•
•
•
Concurrent Data Structures
Many-core architectures
Intel’s SCC
Concurrent FIFO Queues
Evaluation
Conclusion
Yiannis Nikolakopoulos
[email protected]
21
Evaluation
• Perfomance? Scalability?
• Is it the same for all cores?
Benchmark:
• Each core performs Enq/Deq at random
• High/Low contention
Yiannis Nikolakopoulos
[email protected]
22
Measures
• Throughput:
Data structure operations completed per time
Average
Operations by
unit.
operations per
core i
• 𝑓𝑎𝑖𝑟𝑛𝑒𝑠𝑠Δ𝑡 = 𝑚𝑖𝑛
min(𝑛𝑖 )
𝑖 𝑛𝑖
𝑁
,
𝑖 𝑛𝑖
𝑁
core
𝑚𝑎𝑥 (𝑛𝑖 )
[Cederman et al 2013]
Yiannis Nikolakopoulos
[email protected]
23
Throughput – High Contention
Yiannis Nikolakopoulos
[email protected]
24
Fairness – High Contention
Yiannis Nikolakopoulos
[email protected]
25
Throughput VS Lock Location
Yiannis Nikolakopoulos
[email protected]
26
Throughput VS Lock Location
Yiannis Nikolakopoulos
[email protected]
27
Conclusion
• Lock based queue
– High throughput
– Less fair
– Sensitive to lock locations, NoC performance
• MP based queues
–
–
–
–
Lower throughput
Fairer
Better liveness properties
Promising scalability
Yiannis Nikolakopoulos
[email protected]
28
Thank you!
[email protected]
[email protected]
Yiannis Nikolakopoulos
[email protected]
29
BACKUP SLIDES
Yiannis Nikolakopoulos
[email protected]
30
Experimental Setup
•
•
•
•
•
•
533MHz cores, 800MHz mesh, 800MHz DDR3
Randomized Enq/Deq operations
High/Low contention
One thread per core
600ms per execution
Averaged over 12 runs
Yiannis Nikolakopoulos
[email protected]
31
Concurrent FIFO Queues
• Typical 2-lock queue [Michael&Scott96]
Yiannis Nikolakopoulos
[email protected]
32
Related documents