Download Sigmet02 - Ohio State Computer Science and Engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Performance Prediction for Random Write
Reductions: A Case Study in Modelling
Shared Memory Programs
Ruoming Jin
Gagan Agrawal
Department of Computer and Information
Sciences
Ohio State University
Outline




Motivation
Random Write Reductions and Parallelization
Techniques
Problem Definition
Analytical Model





General Approach
Modeling Cache and TLB
Modeling waiting for locks and memory contention
Experimental Validation
Conclusions
Motivation


Frequently need to mine very large datasets
Large and powerful SMP machines are becoming
available


Vendors often target data mining and data warehousing as
the main market
Data mining emerging as an important class of
applications for SMP machines
Common Processing Structure



Structure of Common Data Mining Algorithms
{* Outer Sequential Loop *}
While () {
{ * Reduction Loop* }
Foreach (element e) {
(i,val) = process(e);
Reduc(i) = Reduc(i) op val;
}
}
Applies to major association mining, clustering and
decision tree construction algorithms
How to parallelize it on a shared memory machine?
Challenges in Parallelization


Statically partitioning the reduction object to avoid
race conditions is generally impossible.
Runtime preprocessing or scheduling also cannot be
applied



Can’t tell what you need to update w/o processing the
element
The size of reduction object means significant
memory overheads for replication
Locking and synchronization costs could be
significant because of the fine-grained updates to the
reduction object.
Parallelization Techniques




Full Replication: create a copy of the reduction object
for each thread
Full Locking: associate a lock with each element
Optimized Full Locking: put the element and
corresponding lock on the same cache block
Cache Sensitive Locking: one lock for all elements in
a cache block
Memory Layout for Locking Schemes
Optimized Full Locking
Lock
Cache-Sensitive Locking
Reduction Element
Relative Experimental Performance
Different Techniques
can outperform each other
depending upon problem and
Time(s)
Relative Performance of Full Replication, Optimized Full
Locking and Cache-Sensitive Locking
100000
80000
60000
40000
20000
0
machine parameters
fr
ofl
csl
0.10%
0.05%
0.03%
Support Levels
0.02%
Problem Definition



Can we predict the relative performance of different techniques
for given machine, algorithm and dataset parameters ?
Develop an analytical model capturing the impact of memory
hierarchy and modeling different parallelization overheads
Other applications of the model:




Predicting speedups possible on parallel configurations
Predicting performance as the output size is increased
Scheduling and QoS in multiprogrammed environments
Choosing accuracy of analysis and sampling rate in an interactive
environment or when mining over data streams
Context

Part of the FREERIDE (Framework for Rapid
Implementation of Datamining Engines) system




Support parallelization on shared-nothing configurations
Support parallelization on shared memory configurations
Support processing of large datasets
Previously reported our work on parallelization
techniques and processing of disk-resident datasets
(SDM 01, SDM 02)
Analytical Model: Overview



Input data is read from disks – constant processing
time
Reduction elements are accessed randomly – their
size can vary considerably
Factors to model:




Cache Misses on reduction elements -> Capacity and
Coherence
TLB Misses on reduction elements
Waiting time for locks
Memory contention
Basic Approach
Focus on modeling reduction loops
Tloop = Taverage * N
Taverage =
Tcompute + Treduc
Treduc = Tupdate + Twait + Tcache_miss +
Ttlb_miss + Tmemory_contention
Tupdate can be computed by executing the loop with a
reduction object that fits into L1 cache
Modeling Waiting time for Locks

The spent by a thread in one iteration of the loop
can be divided into three components





Computing independently (a)
Waiting for a lock (Twait)
Holding a lock (b)
where b = Treduc - Twait
Each lock is a M/D/1 queue
The rate at which each requests to acquire a lock are
issued are:
l = t / ((a + b + Twait)*m)
Modeling Waiting Time for Locks


Standard result on M/D/1 queues
Twait = bU/ 2(1-U)
where, U is the server utilization and is given by
U = l b
Result on Twait is
Twait = b/(2(a/b + 1)(m/t) – 1)
Modeling Memory Hierarchy

Need to model






L1 and L2 Cache
L2 Cache
TLB Misses
Ignore cold misses
Only consider directly-mapped cache – analyze
capacity and conflict misses together
Simple analysis for capacity and conflict misses
because of random accesses to the reduction object
Modeling Coherence Cache Misses


A coherence miss occurs when a cache block is
invalidated by other CPU
Analyze the probability that:
Between two accesses to a cache block on a processor,
the same memory block is accessed, and this memory block is
not updated by one of the other processors in the mean-time

Details are available in the paper
Modeling Memory Contention





Input elements displace reduction objects from
cache
Results in a write-back followed by read operation
Memory system on many machines requires extra
cycles to switch between write-back and read
operations
Source of contention
Model using M/D/1 queues, similar to waiting time for
locks
Experimental Platform

Small SMP machine




Sun Ultra Enterprise 450
4 X 250 MHz Ultra-II processors
1 GB of 4-way interleaved main memory
Large SMP machine




Sun Fire 6800
24 X 900 MHz Sun UltraSparc III
A 96KB L1 cache and a 64 MB L2 cache per processor
24 GB main memory
Impact of Memory Hierarchy, Large SMP
Measured and predicted
performance as the size of
reduction object is scaled
Full replication
Optimized full locking
Cache-sensitive locking
Modeling Parallel Performance with
Locking, Large SMP
Parallel performance with
cache-sensitive locking,
small reduction object sizes
1 thread
2 threads
4 threads
8 threads
12 threads
Modeling Parallel Performance, Large SMP
Performance of optimized
full locking with large reduction
object sizes
1 thread
2 threads
4 threads
8 threads
How good is the Model in Predicting
Relative Performance ? (Large SMP)
Performance of
Optimized full locking
and
Cache Sensitive Locking
(12 threads)
Impact of Memory Hierarchy, Small SMP
Measured and predicted
performance as the size of
reduction object is scaled
Full replication
Optimized full locking
Cache-sensitive locking
Parallel Performance, Small SMP
Performance of optimized
full locking
1 thread
2 threads
3 threads
Summary

A new application of performance modeling



Choosing among different parallelization techniques
Detailed analytical model capturing memory hierarchy
and parallelization overheads
Evaluated on two different SMP machines



Predicted performance within 20% in almost all cases
Effectively capture impact of both memory hierarchy and
parallelization overheads
Quite accurate in predicting the relative performance of
different techniques
Related documents