Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas Concurrent Data Structures • Parallel/Concurrent programming: – Share data among threads/processes, sharing a uniform address space (shared memory) • Inter-process/thread communication and synchronization – Both a tool and a goal Yiannis Nikolakopoulos [email protected] 2 Concurrent Data Structures: Implementations • Coarse grained locking – Easy but slow... • Fine grained locking – Fast/scalable but: error-prone, deadlocks • Non-blocking – Atomic hardware primitives (e.g. TAS, CAS) – Good progress guarantees (lock/wait-freedom) – Scalable Yiannis Nikolakopoulos [email protected] 3 What’s happening in hardware? • Multi-cores many-cores Shared Local Cache Cache – “Cache coherency wall” IA Core [Kumar et al 2011] – Shared address space will not scale – Universal atomic primitives (CAS, LL/SC) harder to implement • Shared memory message passing Yiannis Nikolakopoulos [email protected] 4 Shared Local Cache Cache IA Core • Networks on chip (NoC) • Short distance between cores • Message passing model support • Shared memory support • Eliminated cache coherency • Limited support for synchronization primitives Can we have Data Structures: Fast Scalable Good progress guarantees Yiannis Nikolakopoulos [email protected] 5 Outline • • • • • • Concurrent Data Structures Many-core architectures Intel’s SCC Concurrent FIFO Queues Evaluation Conclusion Yiannis Nikolakopoulos [email protected] 6 Single-chip Cloud Computer (SCC) • • • • Experimental processor by Intel 48 independent x86 cores arranged on 24 tiles NoC connects all tiles TestAndSet register per core Yiannis Nikolakopoulos [email protected] 7 SCC: Architecture Overview Message Passing Buffer (MPB) 16Kb Memory Controllers: to private & shared main memory Yiannis Nikolakopoulos [email protected] 8 Programming Challenges in SCC • Message Passing but… – MPB small for large data transfers – Data Replication is difficult • No universal atomic primitives (CAS); no wait-free implementations [Herlihy91] Yiannis Nikolakopoulos [email protected] 9 Outline • • • • • • Concurrent Data Structures Many-core architectures Intel’s SCC Concurrent FIFO Queues Evaluation Conclusion Yiannis Nikolakopoulos [email protected] 10 Concurrent FIFO Queues • Main idea: – Data are stored in shared off-chip memory – Message passing for communication/coordination • 2 design methodologies: – Lock-based synchronization (2-lock Queue) – Message passing-based synchronization (MP-Queue, MP-Acks) Yiannis Nikolakopoulos [email protected] 11 2-lock Queue • • • • Array based, in shared off-chip memory (SHM) Head/Tail pointers in MPBs 1 lock for each pointer [Michael&Scott96] TAS based locks on 2 cores Yiannis Nikolakopoulos [email protected] 12 2-lock Queue: “Traditional” Enqueue Algorithm • Acquire lock • Read & Update Tail pointer (MPB) • Add data (SHM) • Release lock Yiannis Nikolakopoulos [email protected] 13 2-lock Queue: Optimized Enqueue Algorithm • Acquire lock • Read & Update Tail pointer (MPB) • Release lock • Add data to node SHM • Set memory flag to dirty Yiannis Nikolakopoulos [email protected] Why? No Cache Coherency! 14 2-lock Queue: Dequeue Algorithm What about progress? Yiannis Nikolakopoulos [email protected] • Acquire lock • Read & Update Head pointer • Release lock • Check flag • Read node data 15 2-lock Queue: Implementation Locks? On which tile(s)? Data nodes Yiannis Nikolakopoulos [email protected] Head/Tail Pointers (MPB) 16 Message Passing-based Queue • Data nodes in SHM • Access coordinated by a Server node who keeps Head/Tail pointers • Enqueuers/Dequeuers request access through dedicated slots in MPB • Successfully enqueued data are flagged with dirty bit Yiannis Nikolakopoulos [email protected] 17 MP-Queue ENQ TAIL DEQ HEAD ADD DATA SPIN What if this fails and is never flagged? “Pairwise blocking” only 1 dequeue blocks Yiannis Nikolakopoulos [email protected] 18 Adding Acknowledgements • No more flags! Enqueue sends ACK when done • Server maintains in SHM a private queue of pointers • On ACK: Server adds data location to its private queue • On Dequeue: Server returns only ACKed locations Yiannis Nikolakopoulos [email protected] 19 MP-Acks ENQ TAIL DEQ HEAD ACK No blocking between enqueues/dequeues Yiannis Nikolakopoulos [email protected] 20 Outline • • • • • • Concurrent Data Structures Many-core architectures Intel’s SCC Concurrent FIFO Queues Evaluation Conclusion Yiannis Nikolakopoulos [email protected] 21 Evaluation • Perfomance? Scalability? • Is it the same for all cores? Benchmark: • Each core performs Enq/Deq at random • High/Low contention Yiannis Nikolakopoulos [email protected] 22 Measures • Throughput: Data structure operations completed per time Average Operations by unit. operations per core i • 𝑓𝑎𝑖𝑟𝑛𝑒𝑠𝑠Δ𝑡 = 𝑚𝑖𝑛 min(𝑛𝑖 ) 𝑖 𝑛𝑖 𝑁 , 𝑖 𝑛𝑖 𝑁 core 𝑚𝑎𝑥 (𝑛𝑖 ) [Cederman et al 2013] Yiannis Nikolakopoulos [email protected] 23 Throughput – High Contention Yiannis Nikolakopoulos [email protected] 24 Fairness – High Contention Yiannis Nikolakopoulos [email protected] 25 Throughput VS Lock Location Yiannis Nikolakopoulos [email protected] 26 Throughput VS Lock Location Yiannis Nikolakopoulos [email protected] 27 Conclusion • Lock based queue – High throughput – Less fair – Sensitive to lock locations, NoC performance • MP based queues – – – – Lower throughput Fairer Better liveness properties Promising scalability Yiannis Nikolakopoulos [email protected] 28 Thank you! [email protected] [email protected] Yiannis Nikolakopoulos [email protected] 29 BACKUP SLIDES Yiannis Nikolakopoulos [email protected] 30 Experimental Setup • • • • • • 533MHz cores, 800MHz mesh, 800MHz DDR3 Randomized Enq/Deq operations High/Low contention One thread per core 600ms per execution Averaged over 12 runs Yiannis Nikolakopoulos [email protected] 31 Concurrent FIFO Queues • Typical 2-lock queue [Michael&Scott96] Yiannis Nikolakopoulos [email protected] 32