Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Performance Estimation for Scheduling on Shared Networks Shreeni Venkataramaiah Jaspal Subhlok University of Houston JSSPP 2003 JSSPP 2003, slide 1 Distributed Applications on Networks: Resource selection, Mapping, Adapting Data Model Sim 2 Pre Stream Vis Sim 1 Application where is the best performance ? Network JSSPP 2003, slide 2 Resource Selection Framework Network Model Application Model Measured & Forecast Network Conditions (Current Resource Availability) Predict Application Performance under Current Network Conditions subject of this paper Resource Selection And Scheduling JSSPP 2003, slide 3 Building “Sharing Performance Model” Sharing Performace Model: predicts application performance under given availability of CPU and network resources. 1. Execute application on a controlled testbed – monitor CPU and network during execution router cp u cp u cp u cp u 2. Analyze to generate the sharing performance model • • application resource needs determine performance application treated as a black box JSSPP 2003, slide 4 Resource Shared Execution Principles • Network Sharing – sharing changes observed bandwidth and latency – effective application level latency and bandwidth determine time to transfer a message • CPU Sharing – scheduler attempts to give equal CPU time to all processes – a competing process is first awarded “idle time”, then competes to get overall equal share of CPU JSSPP 2003, slide 5 CPU sharing (1 competing process) Application using CPU CPU idle Competing process using CPU dedicated execution CPU time slice corresponding progress CPU shared execution time • This application keeps the CPU 100% busy … …execution time doubles with CPU sharing JSSPP 2003, slide 6 CPU sharing (1 competing process) Application using CPU CPU idle Competing process using CPU dedicated execution CPU shared execution If CPU is mostly idle (less than 50% busy) for dedicated execution, execution time is unchanged with CPU sharing dedicated execution CPU shared execution If CPU is busy 50 – 100 % time for dedicated execution, execution time increases between 0 and 100% • slowdown is predictable if usage pattern is known JSSPP 2003, slide 7 Shared CPU on All Nodes Note that each node is scheduled independently When one process attempts to send a message, the other might be swapped out leading to a synchronization wait… – difficult to model because of timing – we develop upper and lower bounds on execution time JSSPP 2003, slide 8 All Shared CPUs: Lower bound on execution time Ignore additional synchronization waits due to independent scheduling……execution time is the maximum of execution time of nodes, computed individually This is not necessarily outrageously optimistic! Why ? Because application processes often get into a lock-step execution mode on common OSs. Why ? 1. Process A tries to send a message to B… 2. B is not executing, so A gets swapped out but has priority 3. When B is swapped in, A gets back into ready queue immediately and starts executing 4. Eventually, they start getting scheduled together JSSPP 2003, slide 9 All Shared CPUs: Upper bound on execution time The CPU is in one of these modes during application execution. Consider the impact of 1 competing compute intensive load… Computation: at most doubles Communication: – can double because of own CPU scheduling – can double because of other CPU scheduling – quadruple in worst case Idle: – idle time is waiting for another computation and/or communication to finish – quadruple in worst case. JSSPP 2003, slide 10 Shared Communication Links • It is assumed that we know at runtime: – expected latency and bandwidth on shared links • We will see how to compute – Sequence of messages exchanged by processes ONE LINK SHARED – time to transfer a message of given size can be computed new total communication time – computation and idle time unchanged ALL LINKS SHARED – idle time may also increase by the same factor as communication time -- because of slowdown of communication on other nodes JSSPP 2003, slide 11 Methodology for Building Application’s Sharing Performance Model 1. Execute application on a controlled testbed and measure system level activity – CPU and network usage 2. Analyze and construct program level activity – such as message exchanges, synchronization waits 3. Develop sharing performance model JSSPP 2003, slide 12 Measurement and Modeling of Communication Goal is to capture the size and sequence of application messages, such as MPI messages Two approaches 1. tcpdump utility to record all TCP/network segments. Sequence of application messages inferred by analyzing the TCP stream (Singh & Subhlok, CCN 2002) 2. can also be done by instrumenting/profiling calls to a message passing library – more precise but application not a black box (access to source or ability to link to a profiler needed) In practice both approaches give the “correct answer” JSSPP 2003, slide 13 Measurement and modeling of CPU activity 1. CPU status is measured at a fine grain with • • top based program to probe the status of CPU (busy or idle) at a fine grain (every 20 milliseconds) CPU utilization data from the Unix kernel over a specified interval of time 2. This provides the CPU busy and idle sequence for application execution at each node 3. The CPU busy time is divided into compute and communication time based on the time it takes to send application messages JSSPP 2003, slide 14 Validation • Resource utilization of Class A/B, MPI, NAS benchmarks measured on a dedicated testbed • Sharing performance model developed for each benchmark program • Measured performance with competing loads and limited bandwidth compared with estimates from sharing performance model – experiments on 500MHz Pentium Duos, 100 Mbps switched network, TCP/IP, FreeBSD. dummynet employed to control network bandwidth – some new measurements for class B benchmarks on 1.7 GHz Pentium Duos with Linux (in the talk only) JSSPP 2003, slide 15 Discovered Communication Structure of NAS Benchmarks 0 1 0 1 0 1 2 3 2 3 2 3 BT CG IS 0 1 0 1 0 1 2 3 2 3 2 3 LU MG SP 0 1 2 3 EP JSSPP 2003, slide 16 CPU Behavior of NAS Benchmarks Computation Communication Idle 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% CG IS MG SP LU BT EP JSSPP 2003, slide 17 Percentage slowdown with sharing Predicted and Measured Performance with One shared CPU and/or Link (4 nodes) Shared node: Predicted Shared link: Predicted Shared node & link: Predicted Measured Measured Measured 450 400 350 300 250 200 150 100 50 0 CG MG SP LU BT EP JSSPP 2003, slide 18 Shared node: Predicted Shared link: Predicted Shared node & link: Predicted Measured Measured Measured Predicted Measured 160 450 400 Normalized Execution Time Percentage slowdown with sharing Predicted and Measured Performance with One shared CPU 350 300 250 200 150 100 140 120 100 80 60 40 20 0 50 0 CG CG MG SP LU BT EP EP IS LU MG SP BT Results with one CPU load for the faster cluster/class B benchmarks JSSPP 2003, slide 19 Predicted and Measured Performance with All shared CPUs or Links Shared CPUs Shared Links • measured performance is generally within the bounds • rather close to upper bound in many cases… JSSPP 2003, slide 20 Normalized Execution Time Predicted and Measured Performance with All shared nodes on cluster new cluster: faster nodes same network 350 300 250 200 150 100 50 0 CG EP IS LU MG SP BT Benchmarks • new cluster results closer to lower bound • speculation: CPUs have more idles, hence more flexibility to synchronize JSSPP 2003, slide 21 Conclusions • Applications respond differently to sharing, this is important for grid scheduling • Sharing performance model can be built by nonintrusive execution monitoring of an application treated as a black box • Major challenges – prediction related to data set – scalability to large systems ? – other limitations… • Is the overall approach practical for large scale grid computing ? JSSPP 2003, slide 22 Alternate approach: Node Selection with Performance Skeletons (AMS 2003 tomorrow) Data Model GUI Pre Stream Sim 1 Data Model GUI Pre Construct a skeleton for application (small program but same execution behavior as the application Stre am Sim 1 Select candidate node sets based on network status Execute the skeleton on them Select the node set with best skeleton performance JSSPP 2003, slide 23