Download A New Model for Integrated task and data parallelism

Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium 2002 LACSI 2002, slide 1 Distributed Applications on Networks: Resource selection, Mapping, Adapting Data Model Sim 2 Pre Stream Vis Sim 1 Application ? Network LACSI 2002, slide 2 Resource Selection Framework Network Model Application Model Measured & Forecast Network Conditions (Current Resource Availability) Predict Application Performance under Current Network Conditions subject of this paper Focus on building logical network maps Resource Selection And Scheduling LACSI 2002, slide 3 Building “Sharing Performance Model” Sharing Performace Model: predicts application performance under given availability of CPU and network resources. 1. Execute application on a controlled testbed – monitor CPU and network during execution router cp u cp u cp u cp u 2. Analyze to generate the sharing performance model • • application resource needs determine performance application treated as a black box LACSI 2002, slide 4 Resource Shared Execution Principles • Network Sharing – sharing changes observed bandwidth and latency – effective application level latency and bandwidth determine time to transfer a message • CPU Sharing – scheduler attempts to give equal CPU time to all processes – a competing process is first awarded “idle time”, then competes to get overall equal share of CPU LACSI 2002, slide 5 CPU sharing (1 competing process) Application using CPU CPU idle Competing process using CPU dedicated execution CPU time slice corresponding progress CPU shared execution time If an application keeps the CPU 100% busy for dedicated execution, the execution time will double on sharing the CPU with a compute intensive process. LACSI 2002, slide 6 CPU sharing (1 competing process) Application using CPU CPU idle Competing process using CPU dedicated execution CPU shared execution If CPU is mostly idle (less than 50% busy) for dedicated execution, execution time is unchanged with CPU sharing dedicated execution CPU shared execution If CPU is busy 50 – 100 % time for dedicated execution, execution time increases between 0 and 100% • slowdown is predictable if usage pattern is known LACSI 2002, slide 7 Methodology for Building Application’s Sharing Performance Model 1. Execute application on a controlled testbed and measure system level activity – such as CPU and network usage 2. Analyze and construct program level activity – such as message exchanges, synchronization waits 3. Develop sharing performance model by modeling execution in different sharing scenarios • This paper limited to predicting execution time with one shared node and/or link in a cluster LACSI 2002, slide 8 Measurement and Modeling of Communication 1. tcpdump utility to record all TCP segments exchanged by executing nodes. 2. Sequence of application messages inferred by analyzing the TCP stream (Singh & Subhlok, CCN 2002) Goal is to capture the size and sequence of application messages, such as MPI messages – can also be done by instrumenting/profiling – more precise but application not a black box (access to source or ability to link to a profiler needed) LACSI 2002, slide 9 Measurement and modeling of CPU activity 1. CPU status is measured at a fine grain with • • top based program to probe the status of CPU (busy or idle) at a fine grain (every 20 milliseconds) CPU utilization data from the Unix kernel over a specified interval of time 2. This provides the CPU busy and idle sequence for application execution at each node 3. The CPU busy time is divided into compute and communication time based on the time it takes to send application messages LACSI 2002, slide 10 Prediction of Performance with Shared CPU and Communication Link • It is assumed that we know: – load on the shared node – expected latency and bandwidth on shared link • Execution time for every computation phase and time to transfer every message can be computed  estimate of overall execution time LACSI 2002, slide 11 Validation • Resource utilization of Class A, MPI, NAS benchmarks measured on a dedicated testbed • Sharing performance model developed for each benchmark program • Measured performance with competing loads and limited bandwidth compared with estimates from sharing performance model (All measurements presented are on 500MHz Pentium Duos, 100 Mbps network, TCP/IP, FreeBSD. dummynet employed to control network bandwidth) LACSI 2002, slide 12 Discovered Communication Structure of NAS Benchmarks 0 1 0 1 0 1 2 3 2 3 2 3 BT CG IS 0 1 0 1 0 1 2 3 2 3 2 3 LU MG SP 0 1 2 3 EP LACSI 2002, slide 13 CPU Behavior of NAS Benchmarks Computation Communication Idle 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% CG IS MG SP LU BT EP LACSI 2002, slide 14 Percentage slowdown with sharing Predicted and Measured Performance with Resource Sharing Shared node: Predicted Shared link: Predicted Shared node & link: Predicted Measured Measured Measured 450 400 350 300 250 200 150 100 50 0 CG MG SP LU BT EP LACSI 2002, slide 15 Conclusions (2 more slides though) • Sharing performance model can be built by nonintrusive execution monitoring of an application treated as a black box; it can estimate performance with simple sharing fairly well • Major challenges – Prediction related to data set – hope is that resource selection may still be good even if the estimates are off – Prediction with traffic on all links and computation loads on all nodes ? – Is the overall approach practical for large scale grid computing ? LACSI 2002, slide 16 Sharing of resources on multiple nodes and links • Impact of sharing can be estimated on individual nodes… • but impact on overall execution is difficult to model because of the combination of – synchronization waits with unbalanced execution – independent scheduling (lack of gang scheduling) (e.g., one node is ready to communicate but the other is swapped out due to independent scheduling…) Preliminary result: lack of gang scheduling has a modest overhead (~5-40%) for small clusters (upto ~20 processors), not an order of magnitude overhead LACSI 2002, slide 17 Scalability of Shared Performance Models i.e., is the whole idea of using network measurement tools and application info to make resource selection decisions practical ? • Jury is still out • Alternate approach being studied is… – automatically build an execution skeleton  short running program that reflects the execution behavior of an application – performance of skeleton is a measure of full application performance – run it for estimation of performance on a given network LACSI 2002, slide 18

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A New Model for Integrated task and data parallelism