Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University Outline Motivation Random Write Reductions and Parallelization Techniques Problem Definition Analytical Model General Approach Modeling Cache and TLB Modeling waiting for locks and memory contention Experimental Validation Conclusions Motivation Frequently need to mine very large datasets Large and powerful SMP machines are becoming available Vendors often target data mining and data warehousing as the main market Data mining emerging as an important class of applications for SMP machines Common Processing Structure Structure of Common Data Mining Algorithms {* Outer Sequential Loop *} While () { { * Reduction Loop* } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } } Applies to major association mining, clustering and decision tree construction algorithms How to parallelize it on a shared memory machine? Challenges in Parallelization Statically partitioning the reduction object to avoid race conditions is generally impossible. Runtime preprocessing or scheduling also cannot be applied Can’t tell what you need to update w/o processing the element The size of reduction object means significant memory overheads for replication Locking and synchronization costs could be significant because of the fine-grained updates to the reduction object. Parallelization Techniques Full Replication: create a copy of the reduction object for each thread Full Locking: associate a lock with each element Optimized Full Locking: put the element and corresponding lock on the same cache block Cache Sensitive Locking: one lock for all elements in a cache block Memory Layout for Locking Schemes Optimized Full Locking Lock Cache-Sensitive Locking Reduction Element Relative Experimental Performance Different Techniques can outperform each other depending upon problem and Time(s) Relative Performance of Full Replication, Optimized Full Locking and Cache-Sensitive Locking 100000 80000 60000 40000 20000 0 machine parameters fr ofl csl 0.10% 0.05% 0.03% Support Levels 0.02% Problem Definition Can we predict the relative performance of different techniques for given machine, algorithm and dataset parameters ? Develop an analytical model capturing the impact of memory hierarchy and modeling different parallelization overheads Other applications of the model: Predicting speedups possible on parallel configurations Predicting performance as the output size is increased Scheduling and QoS in multiprogrammed environments Choosing accuracy of analysis and sampling rate in an interactive environment or when mining over data streams Context Part of the FREERIDE (Framework for Rapid Implementation of Datamining Engines) system Support parallelization on shared-nothing configurations Support parallelization on shared memory configurations Support processing of large datasets Previously reported our work on parallelization techniques and processing of disk-resident datasets (SDM 01, SDM 02) Analytical Model: Overview Input data is read from disks – constant processing time Reduction elements are accessed randomly – their size can vary considerably Factors to model: Cache Misses on reduction elements -> Capacity and Coherence TLB Misses on reduction elements Waiting time for locks Memory contention Basic Approach Focus on modeling reduction loops Tloop = Taverage * N Taverage = Tcompute + Treduc Treduc = Tupdate + Twait + Tcache_miss + Ttlb_miss + Tmemory_contention Tupdate can be computed by executing the loop with a reduction object that fits into L1 cache Modeling Waiting time for Locks The spent by a thread in one iteration of the loop can be divided into three components Computing independently (a) Waiting for a lock (Twait) Holding a lock (b) where b = Treduc - Twait Each lock is a M/D/1 queue The rate at which each requests to acquire a lock are issued are: l = t / ((a + b + Twait)*m) Modeling Waiting Time for Locks Standard result on M/D/1 queues Twait = bU/ 2(1-U) where, U is the server utilization and is given by U = l b Result on Twait is Twait = b/(2(a/b + 1)(m/t) – 1) Modeling Memory Hierarchy Need to model L1 and L2 Cache L2 Cache TLB Misses Ignore cold misses Only consider directly-mapped cache – analyze capacity and conflict misses together Simple analysis for capacity and conflict misses because of random accesses to the reduction object Modeling Coherence Cache Misses A coherence miss occurs when a cache block is invalidated by other CPU Analyze the probability that: Between two accesses to a cache block on a processor, the same memory block is accessed, and this memory block is not updated by one of the other processors in the mean-time Details are available in the paper Modeling Memory Contention Input elements displace reduction objects from cache Results in a write-back followed by read operation Memory system on many machines requires extra cycles to switch between write-back and read operations Source of contention Model using M/D/1 queues, similar to waiting time for locks Experimental Platform Small SMP machine Sun Ultra Enterprise 450 4 X 250 MHz Ultra-II processors 1 GB of 4-way interleaved main memory Large SMP machine Sun Fire 6800 24 X 900 MHz Sun UltraSparc III A 96KB L1 cache and a 64 MB L2 cache per processor 24 GB main memory Impact of Memory Hierarchy, Large SMP Measured and predicted performance as the size of reduction object is scaled Full replication Optimized full locking Cache-sensitive locking Modeling Parallel Performance with Locking, Large SMP Parallel performance with cache-sensitive locking, small reduction object sizes 1 thread 2 threads 4 threads 8 threads 12 threads Modeling Parallel Performance, Large SMP Performance of optimized full locking with large reduction object sizes 1 thread 2 threads 4 threads 8 threads How good is the Model in Predicting Relative Performance ? (Large SMP) Performance of Optimized full locking and Cache Sensitive Locking (12 threads) Impact of Memory Hierarchy, Small SMP Measured and predicted performance as the size of reduction object is scaled Full replication Optimized full locking Cache-sensitive locking Parallel Performance, Small SMP Performance of optimized full locking 1 thread 2 threads 3 threads Summary A new application of performance modeling Choosing among different parallelization techniques Detailed analytical model capturing memory hierarchy and parallelization overheads Evaluated on two different SMP machines Predicted performance within 20% in almost all cases Effectively capture impact of both memory hierarchy and parallelization overheads Quite accurate in predicting the relative performance of different techniques